A Bayesian HMM algorithm for the identification of gene families

Richard Boys1, Daniel Henderson2
1richard.boys@ncl.ac.uk, University of Newcastle upon Tyne; 2d.a.henderson@open.ac.uk, The Open University

We describe an algorithm that identifies families of genes with similar nucleotide patterns and hopefully similar function. The procedure assumes a hidden Markov model (HMM) for the evolution of the DNA sequence. Typically, HMM segmentation methods assume that the observed process - here the DNA sequence - evolves independently given the unobserved Markov chain which locates the position of the family members. However, this assumption does not properly account for the additional short-range structure that is often evident in DNA sequences. Additionally, HMM-based classification methods attempt to allocate regions of DNA to a known number of possible families. Our algorithm is more general in that it determines the number of statistically significant families and their distinctive (complex) nucleotide patterns.

In this work, we adopt a Bayesian approach to inference which allows us to take full account of the uncertainty in the locations and the composition of the various gene families. It also permits the incorporation of prior knowledge about these unknowns and provides a coherent framework for model comparison/selection. The complex structure of this model precludes a fully analytic treatment and we therefore use modern computationally intensive statistical techniques. Markov chain Monte Carlo (MCMC) algorithms are ideally suited to such problems and, in particular, we use trans-dimensional MCMC to explore both the parameter and model space.

We illustrate the general method by investigating the structure of the bacteriophage lambda genome, a common benchmark sequence used for the comparison of statistical segmentation algorithms.

References

Boys, R.J. and Henderson, D.A. (2003).
A Bayesian approach to DNA sequence segmentation. To appear in Biometrics.
Boys, R.J., Henderson, D.A. and Wilkinson, D.J. (2000).
Detecting homogeneous segments in DNA sequences by using hidden Markov models. Applied Statistics 49 269-285.
Green, P.J. (2003).
Trans-dimensional Markov chain Monte Carlo. In Highly Structured Stochastic Systems. Oxford University Press, Oxford.
Nicolas, P., Bize, L., Muri, F., Hoebeke, M., Rodolphe, F., Ehrlich, S.D., Prum, B. and Bressieres, P. (2002).
Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models. Nucleic Acids Research 30 1418-1426.