David Sankoff, Centre de Recherches Mathématiques Université de Montréal
Chromosome breakage and mistakes in repair, along with a number of other processes, give rise to changes in gene order. These have important consequences for the cell, the organism, the population, and for the evolution of species. This tutorial focuses on analytical methods for the characterization of genome rearrangements, for inferences about them, and their use in phylogenetic analysis.
Introduction. During biological evolution, inter- and intrachromosomal exchanges of chromosomal fragments disrupt the order of genes on a chromosome and, for multichromosomal genomes, the partition of genes among these chromosomes. Any (maximal) contiguous region of the genome in which gene content and order have been conserved in two species is a conserved segment. Adjacent conserved segments are separated by breakpoints. Conserved segments become shorter and more numerous over time as they are disrupted by new events creating new breakpoints. The number of conserved segments (or breakpoints) measures the genomic distance between two species. The duality between breakpoints and conserved segments is mirrored in the two research traditions for reconstructing genomic history based on gene orders in two or more genomes. The genome rearrangements approach, making use of combinatorial optimisation techniques, attempts to infer a most economical sequence of rearrangement events to account for the differences among the genomes. The comparative mapping approach focuses on the statistics of genes per segment to infer the amount of evolutionary divergence.
Genomic Distances. The algorithmic study of comparative genomics tries to explain differences in gene orders in two or more genomes in terms of a limited number of rearrangement processes. For single-chromosome genomes, this requires the calculations of an edit distance between two linear orders on the same set of objects, representing the ordering of homologous genes in two genomes. In the ''signed'' version of the problem, a plus or minus is associated with each gene, representing the direction of transcription. One edit operation consists of the inversion, or reversal, of any number of consecutive terms in the ordered set, which, in the case of signed orders, also reverses the polarity of each term within the scope of the inversion. The calculation of the distance for unsigned genomes with inversions only is NP-hard; for signed problem it is of polynomial complexity. For multi-chromosome genomes, another important edit operation is reciprocal translocation, representing the exchange of terminal fragments between two chromosomes. Some formulations of the distance problem for translocation are of polynomial complexity. We sketch the Hannenhalli-Pevzner algorithms and available software for inversions and for translocations.
Gene-order Phylogenies. Though distance matrix methods are applicable to genome edit distances, the optimisation of a given tree, including the reconstruction of ancestral gene orders, is NP-hard, even with only three input genomes. The number of breakpoints is an alternate, easily computed, genomic distance which, however, is also theoretically hard to generalise to the phylogenetic context. Nevertheless, it can be transformed into the Traveling Salesman Problem, for which efficient software exists for moderate-sized input. We illustrate the application of this methodology to the phylogeny of mitochondrial genomes in animals and in protists.
Genome Rearrangement with Gene Families. The above methods based on permutations of gene order are seriously compromised for larger genomes where several copies of the same gene, or several highly homologous (paralogous) genes may be scattered across the genome. One approach to overcoming this difficulty is based on exemplars: for each genome, an exemplar sequence is constructed by deleting all but one occurrence of each gene family. The exemplar distance between two genomes is the minimum, over all their exemplar strings, of the distance between their exemplars. We describe algorithms and software for exemplar breakpoint distance and exemplar inversions distance. We show how to extend exemplar analysis to phylogenetic analysis, using output from gene-tree/species-tree algorithms.
Hybridization. Several types of biological processes, widespread in the plant kingdom, give rise to hybrids. A variety of mathematical problems arising in trying to model these processes. These problems differ according to the process responsible for hybridisation, the kinds of data available, and the aim of the evolutionary reconstruction. For example, one form of hybridization of two karyotypically distinct species sees the fusion of two genomes followed by a series of chromosomal rearrangement events until the hybrid genome is finally stabilised as a diploid. We seek to infer an ancestral hybrid genome, using data on which hybrid genes originated from each of the parent species, and/or additional data (synteny and gene order) from the modern purebred descendants of these parents. A number of algorithms are available for these problems. Hybrids may also be formed through the exceptional fertilisation of two distinct though related species, the parent species possibly differing from each other and from the modern hybrid by numerous genome rearrangements. This case may be analyzed using the three-genome generalisation discussed above.
Genome duplication. The effects of genome doubling show up across the eukaryote spectrum, e.g., in yeast, plants and vertebrates. We propose a suite of genome halving problems, and associated algorithms, for reconstructing the ancestral, pre-duplication genome, each problem depending on the level of detail of the data and the desired reconstruction. In each case the idea is to find a genome consisting of pairs of identical chromosomes, representing the original tetraploid, such that the number of translocations (including chromosome fusions and fissions) required to transform it to the modern genome is minimised.
Comparative Mapping. The simplest model of genomic divergence, deriving from a 1984 study by Nadeau and Taylor, assumes the spatial homogeneity of both breakpoint and gene distributions along the chromosomes. The main focus has been the severe underestimation of the number of segments in comparisons where there are relatively few genes common to the data sets for a pair of species. To this end, we study the marginal probability that a segment contain r genes, as well as the probability of observing a non-empty segments if there are m genes and n breakpoints. We show that a is a sufficient statistic for the number of breakpoints, and provide estimators for n given m and a. We discuss how to relax the homogeneity assumptions of the model.
Identification of Conserved Segments. As map data accumulate, it becomes increasingly difficult to find segments in which gene content and order are strictly parallel in two genomes, due in part to experimental error, but also to high rates of inversion of small regions of chromosomes. We introduce a method for estimating the configuration of conserved segments resulting from the evolutionary history of reciprocal translocations, by minimizing a weighted sum of mapping error and rearrangement costs. This involves a variant of single link stepwise cluster analysis performed simultaneously on all conserved syntenies, with the interim results from each cluster analysis affecting the current state of all the others .
Synthesis of the Algorithmic and Statistical Approaches. To better characterise segment statistics, a realistic probability model must discard homogeneity in favour of one parameterised by regional chromosomal characteristics: isochore structure, telomeric proximity, recombination rates and gene richness. At the same time, the simplified problems solvable by methods of combinatorial optimisation should be reformulated in a more realistic way, making use of the quantitative aspects of the biological context and parameterised probability models, perhaps sacrificing exact global solutions in favour of heuristic algorithms, local optimisation, and simulation.