Alexa: an improved EST and genomic sequence alignment tool

Miao Zhang1, Warren Gish2
1mzhang@sapiens.wustl.edu, Department of Genetics and Department of Biomedical Engineering, Washington University; 2gish@watson.wustl.edu, Department of Genetics , Washington University

Although several tools have been developed for the specific purpose of aligning EST and genomic sequences, potential pitfalls remain associated with their use. For example, some tools have difficulty identifying short exons. Furthermore, most of the programs demand that splice sites conform to the canonical GT…AG rule, even though roughly 1.7% of human splice sites are non-canonical and a similar phenomenon is seen for other species. To address these concerns, we developed a tool named Alexa, which incorporates a potentially arbitrary splice site model into the recursive dynamic programming equation. As currently implemented, the splice site model consists of two position-specific weight matrices which capture the position-specific base composition at donor and acceptor sites. The possibility of non-canonical sites is currently reflected in the splice site model by the addition of pseudo-counts. By scaling scores from the splice site model to the match:mismatch scores of the sequence alignment algorithm, Alexa productively combines information from the aligned sequences with that from potential splice sites. In the presence of the sequencing errors and polymorphisms expected for EST data, Alexa is therefore more likely to utilize the correct splice sites in its alignments. For reduced memory consumption and increased speed (albeit with a marginal decrease in overall accuracy), Alexa can be guided by an input file produced by WU-BLASTN (http://blast.wustl.edu). With the optional “topcombo” processing of WU-BLASTN, the influence of paralogs, pseudogenes and repetitive elements can also be reduced. Alexa is available at http://sapiens.wustl.edu/alexa.