### A Bayesian HMM algorithm for the identification of gene families

Richard Boys^{1}, Daniel Henderson^{2}

^{1}richard.boys@ncl.ac.uk, University of Newcastle upon Tyne; ^{2}d.a.henderson@open.ac.uk, The Open University

We describe an algorithm that identifies families of genes with
similar nucleotide patterns and hopefully similar function. The
procedure assumes a hidden Markov model (HMM) for the evolution of the
DNA sequence. Typically, HMM segmentation methods assume that the
observed process - here the DNA sequence - evolves independently given
the unobserved Markov chain which locates the position of the family
members. However, this assumption does not properly account for the
additional short-range structure that is often evident in DNA
sequences. Additionally, HMM-based classification methods attempt to
allocate regions of DNA to a known number of possible families. Our
algorithm is more general in that it determines the number of
statistically significant families and their distinctive (complex)
nucleotide patterns.

In this work, we adopt a Bayesian approach to inference which allows
us to take full account of the uncertainty in the locations and the
composition of the various gene families. It also permits the
incorporation of prior knowledge about these unknowns and provides a
coherent framework for model comparison/selection. The complex
structure of this model precludes a fully analytic treatment and we
therefore use modern computationally intensive statistical techniques.
Markov chain Monte Carlo (MCMC) algorithms are ideally suited to such
problems and, in particular, we use trans-dimensional MCMC to explore
both the parameter and model space.

We illustrate the general method by investigating the structure of the
bacteriophage *lambda* genome, a common benchmark sequence used
for the comparison of statistical segmentation algorithms.

### References

- Boys, R.J. and Henderson, D.A. (2003).
- A Bayesian approach to DNA sequence segmentation. To appear in
*Biometrics*.
- Boys, R.J., Henderson, D.A. and Wilkinson, D.J. (2000).
- Detecting homogeneous segments in DNA sequences by using hidden
Markov models.
*Applied Statistics* **49** 269-285.
- Green, P.J. (2003).
- Trans-dimensional Markov chain Monte Carlo.
In
*Highly Structured Stochastic Systems*. Oxford
University Press, Oxford.
- Nicolas, P., Bize, L., Muri, F., Hoebeke, M., Rodolphe, F.,
Ehrlich, S.D., Prum, B. and Bressieres, P. (2002).
- Mining
*Bacillus subtilis* chromosome heterogeneities using
hidden Markov models. *Nucleic Acids Research* **30** 1418-1426.