Efficient algorithms for sequence comparison and overlap detection - Correcting errors in shotgun sequence reads

Martti T. Tammi1, Erik Arner2, Ellen Kindlund, Björn Andersson
1martti.tammi@cgb.ki.se, Center for Genomics and Bioinformatics, Karolinska Institutet; 2erik.arner@cgb.ki.se, Center for Genomics and Bioinformatics, Karolinska Institute

Approximate sequence matching has a wide range of applications in biology. Large scale analysis of massive amounts of data require efficient and rapid algorithms to complete all necessary tasks in a reasonable time. The use of approximate matches instead of exact matches to detect potentially overlapping sequences results in increased sensitivity and specificity. This is desirable in order to avoid running unnecessarily computationally expensive dynamic programming algorithms. However, exact matches are widely used as seeds for alignments since it is more time consuming to compute inexact matches than exact matches between sequences. We present a novel approximate pattern matching algorithm that solves this problem. There is a symmetry between indices that can be computed for similar words of the same length. A computation of approximate matches between sequences can be made very rapid by taking advantage of this symmetry pattern. This allowed the development of an on-line, linear time algorithm for construction of multiple alignments, with no previous pairwise matching of sequence reads required. Preliminary results indicate an average specificity of 100.0%, while the sensitivity was 93.1%. These algorithms were were employed in a program for shotgun sequence error correction. Sequencing errors in combination with repeated regions is the key problem in shotgun sequencing. This is mainly due to the failure of assembly programs to distinguish single base differences between repeat copies from erroneous base calls. Errors are corrected utilizing defined nucleotide positions, DNPs. The DNP method distinguishes single base differences from sequencing errors by analyzing multiple alignments consisting of a read and all its overlaps with other reads. The accuracy of the error correction is highly dependent on the sensitivity of the multiple alignment algorithm. Results from a C++ implementation of this method show that up to 99% of sequencing errors can be corrected, while up to 87% of the single base differences remain and up to 80% of the corrected reads contain at most one error. The results also show that the method outperforms the error correction method used in the EULER assembler. The prototype software, MisEd, is freely available from the authors for academic use.