Predicting accuracy of comparative gene finders using evolutionary models

Vladimir Pavlovic 1, Rutgers University

In the process of evolution, selective evolutionary forces create variable rates of conservation on different functional sites in DNA thereby producing distinctive signatures of different genomic regions. Since the pattern of conservation in gene coding regions is different from non-coding regions, a comparative computational analysis can lead, in principle, to improved identification of genes in one species by comparing its genome to that of the evolutionarily related other species. Many comparative models, starting from visual studies of sequence alignments to fully automated HMM-based TWINSCAN [Korf 2001] or SLAM [Pachter 2002] do so by relying on a given pair of organisms, such as human and mouse. More precisely, they rely on the ad-hoc rule that if the two organisms are too close together or too far apart, the approach fails as the degree of similarity/dissimilarity becomes the same throughout the pair of genomes.

We propose a formal way to select an optimal pair of genomes/genomic regions. We start by assuming a general Markov model of evolution that gives a probabilistic interpretation of the evolutionary forces in conserved and non-conserved genomic regions. We combine this model with an HMM-based model of a comparative gene finder. In a key observation, we relate the task of selecting the ``best'' pair of genomes to that of minimizing the gene detection error in the combined HMM-Markov evolutionary model as a function of evolutionary distance between genomic regions. We study the aspects of error-analysis in HMMs, an infrequently visited topic, and from it elucidate analytical solutions to the problem of accuracy maximization on a simplified comparative gene finding model. When using a more realistic gene finder model [Zhang 2003], our simulation studies indicate a wide range of genomes at different evolutionary distances that appear to deliver reasonable prediction of human genes. The evolutionary time between human and mouse generally falls in this region; however, better accuracy might be achieved with a reference species other than mouse.

Korf, I., Flicek, P., Duan, D. & Brent, M. (2001), ‘Integrating genomic homology into gene structure prediction’, Bioinformatics 17, S140–S148.
Pachter, L., Alexandersson, M. & Cawley, S. (2002), ‘Applications of generalized pair hidden markov models to alignment and gene finding problems’, Journal of Computational Biology 9, 389–400.
Zhang, L., Pavlovic, V., Cantor, C.R., & Kasif, S. (2003), 'Human-mouse gene identification by comparative evidence integration and evolutionary analysis', Genome Research, to appear.