A Whole-genome Analysis of Transcription Factor Binding Site Data.

Caroline Finnerty1, Dr. James McInerney2
1caroline.s.finnerty@may.ie, Bioinformatics and Pharmacogenomics Laboratory; 2james.o.mcinerney@may.ie, Bioinformatics and Pharmacogenomics Laboratory

Modern biology has provided us with many surprises most notably the discovery of the number of genes in our genome. Before completion of the human genome the estimate of gene content was in the region of 50-100,000 genes. This large number of genes was believed to explain our complexities over other species of mammals. Upon completion of the human genome the discovery that we have a mere 30,000 genes (the same number as in mouse) required a new theory. One, which is gaining popular acceptance, is a theory that explains our uniqueness not by having a greater number of genes but by how these genes are regulated. This theory suggests that we have the same families of genes as other mammals but they appear to be expressed at various levels, at different stages, in the differing genomes.

Our approach is to analyse, on a genome-wide scale the upstream regions of human genes with an emphasis on transcription factor binding sites. Using existing databases such as TRANSFAC to retrieve transcription factor binding site data we will recode each transcription factor binding site with a different number so that each upstream region will be identified by a different subset of numbers. Using these newly recoded vectors, the following analyses will be carried out; multiple alignment, multi-variate analysis, neural-network analysis and expression analysis using microarray data.

These approaches should enable one to identify homologous upstream regions as well as those that are divergent. The biological significance of this work will be to determine if two genes which have similar upstream regions have a similar function and if they will be expressed to the same level and conversely, if two genes have very divergent upstream sequences will their functions and expression levels also be dissimilar? The ultimate goal is to infer expression pattern from sequence.

Future work will include a comparative approach, analysing the mouse genome in a similar manner and comparing transcription factor binding site information with expression profiles of both human and mouse genes.