Detection of false positive results from PSI-BLAST

N. Faux1, M. Cameron2, M. Garcia de la Banda, J.C. Whisstock
1noel.faux@med.monash.edu.au, Department of Biochemistry and Molecular Biology. Monash University; 2mcam@csse.monash.edu.au, School of Computer Science and Software Engineering. Monash University

Position Specific Iterative (PSI)-BLAST is a sensitive and fast database search algorithm, commonly used to detect remote putative homologues. PSI-BLAST utilizes that observation that highly conserved functional motifs are often conserved even between very distantly related proteins. The algorithm achieves high sensitivity by building a matrix that reflects patterns of sequence conservation within the results. Subsequent searches using the matrix are performed until convergence is reached (i.e. no new sequences are identified). One problem with PSI-BLAST is high ratio of false positive to true positives (compared for example to search engines such as SAM). This is because the presence of false positives in the results set used to build the search matrix can “skew” the search profile. The early detection of false positives is important and, in the event that a large number of searches are to be performed, automated methods (rather than time-consuming manual analysis of results) is required. In particular, if a search can be “stopped” just prior to becoming significantly contaminated valuable data can be retained. We have investigated instances where a large number of false positives are detected using PSI-BLAST. Our data reveal that in most instances false positive results have caused the matrix to skew to such an extent that true positives no longer predominate within the results. We are using these data to create a set of heuristics that facilitates the rapid and early detection of a matrix that no longer accurately describes the original query sequence or family.