Elucidating Patterns within Patterns: A Post-Processing Step in Promoter Sequence Analysis
Jessica Mar1, Alvis Brazma2
1jess@ebi.ac.uk, European Bioinformatics Institute; 2brazma@ebi.ac.uk, European Bioinformatics Institute
Pattern discovery in promoter sequence analysis involves the identification of regular expressions that are over-represented for a set of co-expressed genes. Statistically significant patterns are potential candidates for transcription binding sites or other important regulatory genetic elements. SPEXS (Sequence Pattern Exhaustive Search) is an algorithm that extracts a list of statistically over-represented patterns for a given set of promoter sequences based on the use of suffix trees and is packaged as a web-based software module within Expression Profiler (Vilo et al. 2003). 
In general, the output list from SPEXS contains too many significant patterns for a user to survey in detail, as noted in Vilo et al. (2000) and hence a post-processing step to highlight the key patterns discovered is helpful. The reduction of this output list of significant patterns into key clusters that share strong similarity will facilitate the interpretation of the promoter sequence analysis. A consensus representation for each of these clusters may be visualized in the form of a sequence logo or weight matrix for further analysis. 
We present four clustering approaches, developed to isolate these key clusters. These methods have been applied to experimental microarray yeast data for which known biological signals have been reported in the literature. In all cases, these approaches were able to cluster the list of significant patterns into groups such that the known consensus binding sites were recovered with a high degree of accuracy. 
References
Vilo, J., Kapushesky, M., Kemmeren, P., Sarkans, U., Brazma, A. (2003). 
Expression Profiler. In Parmigiani, G., Garrett,  E.S., Irizarry, R. and Zeger, S.L. (eds), The Analysis of Gene Expression Data: Methods and Software, Springer Verlag, New York, NY.
Vilo, J., Brazma, A., Jonassen, I., Robinson, A. and Ukkonen, E. (2000).
Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data. ISMB-2000 August. AAAI press. 384-394.