InterPro, a protein functional classification resource

Nicola Mulder1, InterPro Consortium2
1mulder@ebi.ac.uk, EBI; 2interpro@ebi.ac.uk, EBI

The exponential increase in the submission of sequences to the nucleotide sequence databases by genome sequencing centres has resulted in a need for rapid, automatic methods for classification of the resulting protein sequences. There are several signature and sequence cluster-based methods for protein classification, each resource having distinct areas of optimum application owing to the differences in the underlying analysis methods. In recognition of this, InterPro was developed as an integrated documentation resource for protein families, domains and functional sites, to rationalise the complementary efforts of the individual protein signature database projects. The member databases: PRINTS, PROSITE, Pfam, ProDom, SMART and TIGRFAMs form the InterPro core. New databases that have recently joined are PIR Superfamilies and the SUPERFAMILIES database, the latter providing the first structure- rather than sequence-based families in InterPro. Related signatures from each member database are unified into single InterPro entries. Each InterPro entry includes a unique accession number, functional descriptions, literature references, and information on the taxonomic range of the proteins matching the entry. In each entry links are provided to the protein sequences, member database signatures, specialised protein family or classification databases and curated structural information. InterPro entries may have a hierarchy where one signature describes a subset of the other, thus facilitating classification of proteins on the superfamily/subfamily as well as domain composition level. Protein matches to all signatures are calculated using the InterProScan software, which integrates the individual database searching algorithms and provides the output in a single coherent format. Protein matches for each entry may be viewed in a table or graphically in an overview, detailed and domain architecture-type format. A new feature of InterPro is links to structural information, which broadens the user group of the database and enhances the functional classification of proteins. Release 6.1 of InterPro (April 2003) contains over 8000 entries, representing families, domains, repeats and sites of post-translational modification (PTMs) encoded by different regular expressions, profiles, fingerprints and Hidden Markov Models (HMMs). There are over 3 million InterPro hits from SWISS-PROT and TrEMBL protein sequences in this release. InterPro has been used for large-scale classification and annotation of numerous complete genomes, notably that of human. The database is freely accessible for text- and sequence-based searches at http://www.ebi.ac.uk/interpro/, and queries may be emailed to interhelp@ebi.ac.uk.