On the Correspondence between Scoring Matrices and Binding Site Sequence Distributions

Jan E. Gewehr¹, Jan T. Kim², Thomas Martinetz
¹gewehr@bio.informatik.uni-muenchen.de, Institute for Computer Science, Ludwig-Maximilians-University Munich, Theresienstr. 39, D-80333 Munich, Germany; ²kim@inb.uni-luebeck.de, Institute for Neuro- and Bioinformatics, University of Luebeck, Seelandstr. 1a, D-23569, Germany

Processes which implement biological functions based on information stored in the DNA require that DNA-binding proteins execute certain functions on the genome at specific locations, called binding sites. Since the knowledge about the location of binding sites provides information on e.g. the regulation of gene expression, binding site prediction is a highly relevant task in bioinformatics. Scoring matrices are a standard approach to binding site finding on the basis of linear sequence classification. Using maximum likelihood estimation, we analyze the correspondence between popular matrix types and specific probability distributions of binding site sequences.

A good binding site finding method detects binding sites with both a high sensitivity and a high specificity. Given the large size of sequence sets to be searched due to the availability of large amounts of genomic sequence data, high specificity is an essential goal. Using sets of binding sites from TRANSFAC, we evaluate the specificity of matrix classifiers under the constraint that the known binding site sequences are classified correctly. The binding matrix achieves maximal specificity among all matrices included in our test. This indicates that the binding matrix is a good choice when the distribution of binding site sequences is unknown.