back to Tutorial Program

Genome Rearrangements and Comparative Mapping

David Sankoff, Centre de Recherches Mathématiques Université de Montréal


Chromosome breakage and mistakes in repair, along with a number of other processes, give rise to changes in gene order.  These have important consequences for the cell, the organism, the population, and for the evolution of species.  This tutorial focuses on analytical methods for the characterization of genome rearrangements, for inferences about them, and their use in phylogenetic analysis.


Introduction.  During biological evolution, inter- and intrachromosomal  exchanges of chromosomal fragments disrupt the order of genes on a  chromosome and, for multichromosomal genomes, the partition of genes  among these chromosomes.  Any (maximal) contiguous region of the  genome in which gene content and order have been conserved in two  species is a conserved segment.  Adjacent conserved segments are separated  by breakpoints.  Conserved segments become shorter and more numerous  over time as they are disrupted by new events creating new breakpoints.  The number of conserved segments (or breakpoints) measures the genomic  distance between two species.  The duality between breakpoints and  conserved segments is mirrored in the two research traditions for  reconstructing genomic history based on gene orders in two or more  genomes.  The genome rearrangements approach, making use of  combinatorial optimisation techniques, attempts to infer a most economical  sequence of rearrangement events to account for the differences among the  genomes.  The comparative mapping approach focuses on the statistics of  genes per segment to infer the amount of evolutionary divergence.

Genomic Distances. The algorithmic study of comparative genomics  tries to explain differences in gene orders in two or more genomes in terms  of a limited number of rearrangement processes.  For single-chromosome  genomes, this requires the calculations of an edit distance between two  linear orders on the same set of objects, representing the ordering of  homologous genes in two genomes.   In the ''signed'' version of the  problem, a plus or minus is associated with each gene, representing the  direction of transcription.  One edit operation consists of the inversion, or  reversal, of any number of consecutive terms in the ordered set, which, in  the case of signed orders, also reverses the polarity of each term within the  scope of the inversion.  The calculation of the distance for unsigned  genomes with inversions only is NP-hard; for signed problem it is of  polynomial complexity. For multi-chromosome genomes, another important  edit operation is reciprocal translocation, representing the exchange of  terminal fragments between two chromosomes. Some formulations of the  distance problem for translocation are of polynomial complexity.  We sketch  the Hannenhalli-Pevzner algorithms and available software for inversions  and for translocations.

Gene-order Phylogenies.  Though distance matrix methods are  applicable to genome edit distances, the optimisation of a given tree,  including the reconstruction of ancestral gene orders, is NP-hard, even with  only three input genomes.  The number of breakpoints is an alternate, easily  computed, genomic distance which, however, is also theoretically hard to  generalise to the phylogenetic context. Nevertheless, it can be transformed  into the Traveling Salesman Problem, for which efficient software exists for  moderate-sized input.  We illustrate the application of this methodology to  the phylogeny of mitochondrial genomes in animals and in protists.

Genome Rearrangement with Gene Families. The above methods  based on permutations of gene order are seriously compromised for larger  genomes where several copies of the same gene, or several highly  homologous (paralogous) genes may be scattered across the genome.  One  approach to overcoming this difficulty is based on exemplars: for each  genome, an exemplar sequence is constructed by deleting all but one  occurrence of each gene family.   The exemplar distance between two  genomes is the minimum, over all their exemplar strings, of the distance  between their exemplars. We describe algorithms and software for exemplar  breakpoint distance and exemplar inversions distance.   We show how to extend exemplar analysis to phylogenetic analysis, using output from gene-tree/species-tree algorithms.

Hybridization. Several types of biological processes, widespread in the  plant kingdom, give rise to hybrids.  A variety of mathematical problems  arising in trying to model these processes.  These problems differ according  to the process responsible for hybridisation, the kinds of data available, and  the aim of the evolutionary reconstruction.  For example, one form of  hybridization of two karyotypically distinct species sees the fusion of two  genomes followed by a series of chromosomal rearrangement events until  the hybrid genome is finally stabilised as a diploid.  We seek to infer an  ancestral hybrid genome, using data on which hybrid genes originated from  each of the parent species, and/or additional data (synteny and gene order)  from the modern purebred descendants of these parents.  A number of  algorithms are available for these problems.  Hybrids may also be formed  through the exceptional fertilisation of two distinct though related species,  the parent species possibly differing from each other and from the modern  hybrid by numerous genome rearrangements. This case may be analyzed using the three-genome generalisation discussed above.

Genome duplication. The effects of genome doubling show up across  the eukaryote spectrum, e.g., in yeast, plants and vertebrates. We propose a  suite of genome halving problems, and associated algorithms, for  reconstructing the ancestral, pre-duplication genome, each problem  depending on the level of detail of the data and the desired reconstruction.   In each case the idea is to find a genome consisting of pairs of identical  chromosomes, representing the original tetraploid, such that the number of  translocations (including chromosome fusions and fissions) required to  transform it to the modern genome is minimised.

Comparative Mapping.  The simplest model of genomic divergence,  deriving from a 1984 study by Nadeau and Taylor, assumes the spatial  homogeneity of both breakpoint and gene distributions along the  chromosomes. The main focus has been the severe underestimation of the  number of segments in comparisons where there are relatively few genes  common to the data sets for a pair of species.  To this end, we study the  marginal probability that a segment contain r genes, as well as the  probability of observing a non-empty segments if there are m genes and n  breakpoints. We show that a is a sufficient statistic for the number of  breakpoints, and provide estimators for n given m and a.  We discuss how  to relax the homogeneity assumptions of the model.

Identification of Conserved Segments.  As map data accumulate, it  becomes increasingly difficult to find segments in which gene content and  order are strictly parallel in two genomes, due in part to experimental error,  but also to high rates of inversion of small regions of chromosomes.  We  introduce a method for estimating the configuration of conserved segments  resulting from the  evolutionary history of reciprocal translocations, by  minimizing a weighted sum of mapping error and rearrangement costs. This  involves a variant of single link stepwise cluster analysis performed  simultaneously on all conserved syntenies, with the interim results from  each cluster analysis affecting the current state of all the others .

Synthesis of the Algorithmic and Statistical Approaches.  To  better characterise segment statistics, a realistic probability model must  discard homogeneity in favour of one parameterised by regional  chromosomal characteristics: isochore structure, telomeric proximity,  recombination rates and gene richness.   At the same time, the simplified  problems solvable by methods of combinatorial optimisation should be  reformulated in a more realistic way, making use of the quantitative aspects  of the biological context and parameterised probability models, perhaps  sacrificing exact global solutions  in favour of heuristic algorithms, local  optimisation, and simulation.