### Efficient algorithms for sequence comparison and overlap
detection - Correcting errors in shotgun sequence reads

Martti T. Tammi^{1}, Erik Arner^{2}, Ellen Kindlund, Björn Andersson

^{1}martti.tammi@cgb.ki.se, Center for Genomics and Bioinformatics, Karolinska Institutet; ^{2}erik.arner@cgb.ki.se, Center for Genomics and Bioinformatics, Karolinska Institute

Approximate sequence matching has a wide range of
applications in biology. Large scale analysis of massive amounts of
data require efficient and rapid algorithms to complete all necessary
tasks in a reasonable time. The use of approximate matches instead of
exact matches to detect potentially overlapping sequences results in
increased sensitivity and specificity. This is desirable in order to
avoid running unnecessarily computationally expensive dynamic
programming algorithms. However, exact matches are widely used as
seeds for alignments since it is more time consuming to compute
inexact matches than exact matches between sequences. We present a
novel approximate pattern matching algorithm that solves this
problem. There is a symmetry between indices that can be computed for
similar words of the same length. A computation of approximate matches
between sequences can be made very rapid by taking advantage of this
symmetry pattern. This allowed the development of an on-line, linear
time algorithm for construction of multiple alignments, with no
previous pairwise matching of sequence reads required. Preliminary
results indicate an average specificity of 100.0%, while the
sensitivity was 93.1%. These algorithms were were employed in a
program for shotgun sequence error correction. Sequencing errors in
combination with repeated regions is the key problem in shotgun
sequencing. This is mainly due to the failure of assembly programs to
distinguish single base differences between repeat copies from
erroneous base calls. Errors are corrected utilizing defined
nucleotide positions, DNPs. The DNP method distinguishes single base
differences from sequencing errors by analyzing multiple alignments
consisting of a read and all its overlaps with other reads. The
accuracy of the error correction is highly dependent on the
sensitivity of the multiple alignment algorithm. Results from a C++
implementation of this method show that up to 99% of sequencing errors
can be corrected, while up to 87% of the single base differences
remain and up to 80% of the corrected reads contain at most one
error. The results also show that the method outperforms the error
correction method used in the EULER assembler. The prototype software,
MisEd, is freely available from the authors for academic use.