Mapping and Visual Exploration of GPCR Classification Hierarchy in Interpro and GPCRDB System

Yanwei Niu1, Xiangyun Wang2, Yockey, Anastasia Christianson, Guang R. Gao
1niu@capsl.udel.edu, . Department of ECE, University of Delaware, USA; 2Xiangyun.Wang@astrazeneca.com, EST Informatics Wilmington, Astra Zeneca PLC

G-Protein Coupled Receptor (GPCR) is the largest receptor superfamily. GPCRs play a key role in cellular signaling network and regulatory functions ranging from neural signaling to olfactory and visual processing. About 50% of the currently available drugs are directed to GPCRs. However, some drugs have efficacy problems and side effects because the compounds do not differentiate between receptor subtypes very well, which is the reason GPCR protein classification is so important to researchers in pharmaceutical area.

Currently, there are two GPCR classification systems. Interpro is a general database of protein families, domains and functional sites, it uses a combination of HMM profiles and motif signatures for family assignment. GPCRDB is a full-fledged database specifically for GPCRs. It organizes GPCRs based on the pharmacological classification of receptors. Since they adopted different classification approach, it is important and challenging to find out the common part and different part of the hierarchical family structure of the two systems. Since they agree quite well at the superfamily level but become less consistent at the sub family levels. So we focus on finding the mapping relationship between families of two systems at each family level.

We compiled 4467 GPCR protein IDs from Interpro sequences (SWISS-PROT Release 40.28 of 19-Sep-2002) and Interpro GPCR family tree. Similarly we compiled 3031 GPCR protein IDs from GPCRDB (September 2002 release (6.1)).All accession numbers are converted to ID to ensure we are comparing same data set. The common part of the two systems contains around 2127 non-redundant sequence ID. Then we focus on the 2127 sequences IDs and find out how the two systems classified this common part differently. For each family (subfamily) at each level, we found out the sequence ids it contains. Then we compare the two systems family by family at all levels and calculate the overlap percentage. Two families that have 80% identity are considered to have one to one mapping relationship, a clear two-way mapping table (from Interpro to GPCRDB and vice versa) are established.

We introduced the information visualization tool "MulHier" (Novel visualization techniques for working with multiple, overlapping classification hierarchies) into Protein classification area. By visualizing the two GPCR classification systems together, we could quickly and easily find the mapping relation from any family in Interpro to any family in GPCRDB and also the proteins IDs belonging to each family.

The mapping table and classification visualization proves very helpful to researchers in Pharmaceutical industry. It provides better approach then the current cross database reference since the current cross reference only gives the mapping at the superfamily level. All source code and data file of this research is available via niu@ee.udel.edu. The Visualization tool MulHier is publicly available at http://www.dcs.napier.ac.uk/~marting.