Automated Gene Ontology annotation for anonymous sequence data

Steffen Hennig1, Detlef Groth2, Hans Lehrach
1hennig@molgen.mpg.de, MPI for Molecular Genetics, Berlin; 2dgroth@molgen.mpg.de, MPI for Molecular Genetics, Berlin

Gene Ontology (GO) is the most widely accepted attempt to construct a unified and structured vocabulary for the description of genes and their products in any organism. Annotation by GO terms is performed in most of the current genome projects, which besides generality has the advantage of being very convenient for computer based classification methods. However, direct use of GO in small sequencing projects is not easy, especially for species not commonly represented in public databases. We present a software package (GOblet), which performs annotation based on GO terms for anonymous cDNA or protein sequences. It uses the species independent GO structure and vocabulary together with a series of protein databases collected from various sites, to perform a detailed GO annotation by sequence similarity searches. The sensitivity and the reference protein sets can be selected by the user. GOblet runs automatically and is available as a public service on our web-server at http://goblet.molgen.mpg.de. Although orthology between genes from different species is frequently detected a central question is, how far GO terms derived for a specific organism (e.g. Drosophila, C.elegans) can be used for annotating distant species like Homo sapiens. We used a reference set of more than 6000 human proteins in Swissprot/TrEMBL, where the GO annotation was checked and verified by a curator, to quantify the reliability of GO annotation based on homologies to other species. Consistent results were found in all major branches of the GO hierarchy.