Functional Annotation in the Twilight Zone using Machine Learning

Ali Al-Shahib1, David Gilbert2
1alshahib@dcs.gla.ac.uk, University of Glasgow; 2drg@dcs.gla.ac.uk, University of Glasgow

As scientists today estimate that we have around 30,000-40,000 genes, many are worried at the number of functionally unknown genes we have. Some say this has risen to around 30% of our genes (i.e. 10,000 genes), which greatly emphasises the need for further functional genomics research. One of the areas that require further research is the problem of uncertainty in low sequence alignments (twilight zone). This area of research is important because we think that evolutionary studies of genes and proteins has led us to believe the likely chances of obtaining functional annotations of sequences in this region. As bioinformaticians, we can utilise these findings and together with computational technologies such as machine learning, we can provide accurate functional annotation in the twilight zone. In this poster, we will highlight some of the principles behind functional genomics and propose a method of providing accurate functional annotations of genes that fall in the twilight zone. Our method involves the use of machine learning techniques (in the form of logical rules) that provides functional annotations of sequences directly from the amino acid sequence. The functional annotations will include a measure of confidence or belief using well-established statistical methods. In general, the problem of uncertainty in sequence similarity searches and functional annotations will be outlined in this poster.