Species-Specific Substitution Matrices

Michel Dumontier¹, Christopher W.V. Hogue²
¹micheld@mshri.on.ca, Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada M5S 1A8, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, 600 University Ave.,Toronto, Ontario, Canada M5G 1X5; ²hogue@mshri.on.ca, Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada M5S 1A8, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, 600 University Ave.,Toronto, Ontario, Canada M5G 1X5

With over 100 completely sequenced genomes ranging across the archae, bacteria and eukarya, new opportunities exist to take advantage of the wealth of generated sequence data. We previously investigated the wide spectrum of amino acid compositions from 100 complete genome sequence datasets and derived effective species-specific sequence and fold composition scoring functions [Dumontier, 2002 #150]. Both compositional simplicity and bias are known to adversely affect pairwise and profile-based sequence alignments. Corrective measures should be considered to enhance alignments and database searching statistics. We set out to build species-specific substitution matrices (SSSMs) that take compositional bias explicitly into account and evaluate whether they contribute information that would enhance sequence alignment quality. Species-specific fold databases were created from the alignment of complete genome open-reading frames with their nearest structure-bearing sequence neighbours. From these, the corresponding log-odds matrices were constructed. Both query and subject sequences were assigned an SSSM by ‘taxonomy approximation’ to that of a complete genome using taxonomic intersection and species-specific sequence scoring. A variety of scoring schemes were tested to maximize the effect of using multiple matrices. Matrix performance was tested using the CASA server sequence alignment sets and compared to popular sequence alignment programs like PSI-BLAST and pairwise ClustalW using standard substitution matrices. SSSMs capture the species-specific amino acid composition bias found in genomic sequences. Our results indicate that SSSMs may provide improved quality in sequence alignment in the 20-30% identity range. Strategies for increasing sequence alignment length and optimizing gap penalties are discussed.