InterPro is a freely accessible resource for classifying protein sequences into families, domains, and functional sites, integrating predictive signatures from member databases such as Pfam, CDD, and PROSITE. However, generating descriptive abstracts for unannotated signatures is a time-consuming manual task.
To address this, we employed large language models (LLMs) to generate high-quality family descriptions. Using GPT-4 with Swiss-Prot-derived context, we automatically produced abstracts for over 5,000 PANTHER families. Nearly 3,900 of these were used to create new InterPro entries, completing in days what previously took months of curation.
Since 2021, in collaboration with Dr Lucy Colwell's team at Google DeepMind, we have also explored deep learning for protein domain classification. This led to the development of InterPro-N, a novel model inspired by computer vision techniques and trained on all 13 InterPro member databases. InterPro-N significantly expands annotation coverage, assigning at least one annotation to ~90% of UniProtKB 2025_02 sequences, up from 84% using traditional methods. Predictions are accessible via the InterPro website, REST API, and FTP.
Additionally, we have integrated over 300,000 structure predictions from the Big Fantastic Virus Database (BFVD) and domain boundaries from The Encyclopedia of Domains (TED), derived from AlphaFold models. These structure-based insights are now shown alongside conventional InterPro and InterPro-N results, enabling users to compare annotations across methodologies.
Together, these AI-driven advances accelerate curation, expand functional coverage, and enrich protein classification, supporting faster and more comprehensive annotation of the rapidly growing protein sequence universe.