LAGAN2: Probabilistic Global Alignment of DNA Under Multiple Conservation Models

Chuong B. Do1, Michael Brudno2, Serafim Batzoglou
1chuong.do@stanford.edu, Stanford University; 2brudno@cs.stanford.edu, Stanford University

Comparative analysis of genomic sequences from different organisms has long been considered a powerful method for inferring biological function based on nucleotide sequence similarity. Most standard global nucleotide aligners use some variation of or approximation to the basic Needleman-Wunsch algorithm for modeling evolutionary changes in a pair of sequences. When organisms are sufficiently diverged, however, simple DNA-level conservation sometimes provides too weak a signal for reliable comparisons, leading to inaccurate alignments. Better results for distantly related species can be achieved by combining several different alignment models to utilize the local characteristics of regions of sequences during alignment. We present LAGAN2, a probabilistic extension of the LAGAN global alignment method for comparison of sequences, which takes into account variance in local evolutionary constraints. Building on the limited area methodology employed in the original LAGAN algorithm, we generalize the notion of Needleman-Wunsch alignment to allow states for modeling nucleotide conservation in protein-coding, non-coding, and unconstrained alignment regions. This model, which simultaneously searches for amino acid and nucleotide conservation has advantages when aligning sequence pairs with both protein-coding and non-coding regions over models which rely on single base pair edits. Furthermore, our algorithm uses a method for approximating logarithmic gap penalties via piecewise affine gaps. This technique, which can differentiate between various mechanisms for gap creation and allows for more realistic gap functions than the common affine penalty, will have an advantage in handling regions with both long insertions and short deletions in a single alignment. By using this more sophisticated combination of models, LAGAN2 takes better advantage of the richness of features present in biological sequences. Unlike most other global nucleotide aligners, LAGAN2 incorporates a multi-state pair Hidden Markov Model for computing Viterbi alignments, which provides not only an intuitive meaning for the various transition and emission probabilities used but also permits statistical estimation of proper alignment parameters via expectation-maximization. Additional information is available at http://lagan.stanford.edu.