In the process of evolution, selective evolutionary forces create
variable rates of conservation on different functional sites in DNA
thereby producing distinctive signatures of different genomic regions.
Since the pattern of conservation in gene coding regions is different
from non-coding regions, a comparative computational analysis can lead,
in principle, to improved identification of genes in one species by
comparing its genome to that of the evolutionarily related other
species. Many comparative models, starting from visual studies of
sequence alignments to fully automated HMM-based TWINSCAN [Korf
SLAM [Pachter 2002] do so by relying on a given pair of organisms, such
as human and mouse. More precisely, they rely on the ad-hoc rule that
if the two organisms are too close together or too far apart, the
approach fails as the degree of similarity/dissimilarity becomes the
same throughout the pair of genomes.
We propose a formal way to select an optimal pair of genomes/genomic regions. We start by assuming a general Markov model of evolution that gives a probabilistic interpretation of the evolutionary forces in conserved and non-conserved genomic regions. We combine this model with an HMM-based model of a comparative gene finder. In a key observation, we relate the task of selecting the ``best'' pair of genomes to that of minimizing the gene detection error in the combined HMM-Markov evolutionary model as a function of evolutionary distance between genomic regions. We study the aspects of error-analysis in HMMs, an infrequently visited topic, and from it elucidate analytical solutions to the problem of accuracy maximization on a simplified comparative gene finding model. When using a more realistic gene finder model [Zhang 2003], our simulation studies indicate a wide range of genomes at different evolutionary distances that appear to deliver reasonable prediction of human genes. The evolutionary time between human and mouse generally falls in this region; however, better accuracy might be achieved with a reference species other than mouse.
Korf, I., Flicek,
P., Duan, D. & Brent, M. (2001), ‘Integrating genomic homology into gene
structure prediction’, Bioinformatics
Pachter, L., Alexandersson, M. & Cawley, S. (2002), ‘Applications of generalized pair hidden markov models to alignment and gene finding problems’, Journal of Computational Biology 9, 389–400.
Zhang, L., Pavlovic, V., Cantor, C.R., & Kasif, S. (2003), 'Human-mouse gene identification by comparative evidence integration and evolutionary analysis', Genome Research, to appear.