CompMapper: An Automatic Pipeline to Define Conserved Segments between Genomes Systematically

Fu Lu1, Zhenyuan Wang2, Xiangqun Holly Zheng, Wenyan Zhong, Fei Zhong, Richard Mural
1fu.lu@celera.com, Celera Genomics; 2jack.wang@celera.com, Celera Genomics

Homologous genes have been widely used to identify conserved segments and construct comparative maps. Although it is an effective approach when complete genomic sequence is not available, it is error prone because of the sparseness of the data and the heuristic nature in selecting homologous gene pairs (Nadeau JH and Sankoff d, TIG 14, 495-501). With the robust draft sequence of mouse and human genome in hand (J.C. Venter et al., Science 291, 1304-51; R. Mural et al., Science, 296, 1661-1671); we have developed a new paradigm and systematic approach to define conserved segments between human and mouse directly from genomic sequence. The automatic pipeline should be applicable to map any species with complete or draft genome sequence and within an appropriate phylogenetic distance.

The bioinformatics pipeline CompMapper matches up conserved segments between two genomes given their genomic sequences. There are two main functional modules in the pipeline. 1) LandMarker: This module performs sequence comparisons of entire mouse and human genome sequences using blast to identify regions having similarity score exceeding 80% and longer than 50 bp. The blast matches are further filtered to remove the repetitive matches. The final set of alignments is mutual unique matches. Such regions are probably derived from the same ancestral sequence and are served as orthologous landmarks for identifying conserved segments. 2). CSGenerator (Conserved Segment Generator): Conserved segments are defined by comparison of the location of landmarks in the two genomes. A conserved segment is a maximal region in which a series of landmarks occur in the same order on a single chromosome in both species. To avoid artificial breakpoint owing to imperfections in the draft genome sequence, we have excluded single inconsistent from the computation.

A total of 444410 orthologous landmarks were identified between human and mouse genomes by LandMarker with a density of approximately 170 landmarks per million bases. The orthologous landmarks are several magnitudes denser than any gene based landmarks. The remarkable 97.5% of landmarks are conserved in order across the two genomes. A total of 601 regions of conserved synteny are defined by the conserved landmarks which cover 91.5% and 92.4% mouse and human genomes, respectively. The N50 of the blocks are 9.8M and 10.5M for mouse and human, respectively. In summary, the resulting comparative map from this pipeline is much more sensitive and accurate than the gene based comparative maps. The comparative map between human and mouse has been used in several tasks to predict functional conservations, such as orthologs, gene regulatory regions and small non-coding RNAs. A dynamic, interactive comparative viewer has been developed to display the orthologous landmarks and conserved segments between human and mouse genome.