Hidden Multivariate Markov Models for Pattern Recognition in Genomic DNA Sequences

Leo Wang-Kit Cheung1
1lcheung@crch.hawaii.edu, Cancer Research Center of Hawaii, University of Hawaii

In this work, a multivariate class of probabilistic models is introduced for modeling multi-dimensional genomic DNA data. This class of models, which we called Hidden Multivariate Markov Models (HM3s), is generally defined as a double stochastic process with an underlying hidden (unobserved) multivariate state process that follows a multivariate Markov chain. In essence, a genomic DNA sequence is viewed as having multivariate properties governed by the discrete multivariate states of an underlying hidden multivariate Markov chain in an HM3. In the light of the fact that the C+G compositional property and the structural property of bending propensity or bendability are now individually found to have a connection with promoter regions of eukaryotic genes, a bivariate version of HM3s is developed. Specifically, a discrete bivariate state of a hidden bivariate Markov chain is defined with one variate representing the C+G compositional property and the other variate representing the structural property of bendability. In a binary bivariate state case, the C+G compositional variate can be either "CG-rich" or "CG-poor", whereas the bendability structural variate can be either "bendable" or "rigid". With this hidden bivariate Markov chain model architecture, both the observed discrete base-compositional data and the observed continuous bendabilty data can be modeled in an integrated fashion. It offers a new statistical framework within which the joint behavior of the C+G richness pattern and the bendability pattern of DNA can be explored. The forward-backward algorithm, the Viterbi algorithm, and the EM algorithm developed for a standard Hidden Markov Model (HMM) are modified for an HM3. Applications of the bivariate HM3s for recognition and prediction of eukaryotic promoter regions are illustrated via case studies using real human DNA data provided by Dr. Anders Pedersen at the Center for Biological Sequence Analysis in Denmark.