Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in BST
Tuesday, July 22nd
11:20-11:30
Introduction
Format: In person


Authors List: Show

Presentation Overview: Show

Introduction to the joint session Function and EvolCompGen

11:30-12:10
Invited Presentation: Evolution of function in light of gene expression
Format: In person


Authors List: Show

  • Marc Robinson Rechavi

Presentation Overview: Show

One of the fundamental questions of genome evolution is how gene function changes or is constrained, whether between species (orthologs) or inside gene families (paralogs).
While computational prediction is making major progress on function in a broad sense, most evolutionary changes concern details that are small in the big picture, yet very
significant for organismal function. For example, new organs or new physiological adaptations often come from repurposing genes whose basic molecular function is conserved while taking a novel role. Gene expression provides a unique window into
such fine details of gene function. I will present how gene expression of diverse species, bulk and single-cell, is integrated into Bgee; how gene expression can be used to test hypotheses of functional change after duplication (the "ortholog conjecture"); and how the evolution of gene expression provides insight into evolvability and the molecular underpinning of new functions.

12:10-12:20
Convergent evolution to similar proteins confounds structure search
Format: In person


Authors List: Show

  • Erik Wright, University of Pittsburgh, United States

Presentation Overview: Show

Advances in protein structure prediction and structural search tools (e.g., FoldSeek and PLMSearch) have enabled large-scale comparison of protein structures. It is now possible to quickly identify structurally similar proteins ("structurlogs"), but it remains unclear whether these similarities reflect homology (common ancestry) or analogy (convergent evolution). In this study, we found that ~2.6% of FoldSeek clusters lack sequence-level support for homology, including about 1% of matches with high TM-score (>= 0.5). The lack of sequence homology could be due to extreme protein divergence or independent evolution to a similar structure. Here, we show that tandem repeats provide strong evidence for the presence of analogous protein structures. Our results suggest analogs infiltrate structure search results and care should be taken when relying on structural similarity alone if homology is desired. This problem may extend beyond repeat proteins to other low complexity folds, and structure search tools could be improved by masking these regions in the same manner as done by sequence search programs.

12:20-12:30
Evolution of the Metazoan Protein Domain Toolkit Revealed by a Birth-Death-Gain Model
Format: In person


Authors List: Show

  • Maureen Stolzer, Carnegie Mellon University, United States
  • Yuting Xiao, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States

Presentation Overview: Show

Domains, sequence fragments that encode protein modules with a distinct structure and function, are the basic building blocks of proteins. The set of domains encoded in the genome serves as the functional toolkit of the species. Here, we use a phylogenetic birth-death-gain model to investigate the evolution of this protein toolkit in metazoa. Given a species tree and the set of protein domain families in each present-day species, this approach estimates the most likely rates of domain origination, duplication, and loss.

Statistical hierarchical clustering of domain family rates reveals sets of domains with similar rate profiles, consistent with groups of domains evolving in concert. Moreover, we find that domains with similar functions tend to have similar rate profiles. Interestingly, domains with functions associated with metazoan innovations, including immune response, cell adhesion, tissue repair, and signal transduction, tend to have the fastest rates. We further infer the expected ancestral domain content and the history of domain family gains, losses, expansions, and contractions on each branch of the species tree. Comparative analysis of these events reveals that a small number of evolutionary strategies, corresponding to toolkit expansion, turnover, specialization, and streamlining, are sufficient to describe the evolution of the metazoan protein domain complement. Thus, the use of a powerful, probabilistic birth-death-gain model reveals a striking harmony between the evolution of domain usage in metazoan proteins and organismal innovation.

12:30-12:40
Deep Phylogenetic Reconstruction Reveals Key Functional Drivers in the Evolution of B1/B2 Metallo-β-Lactamases
Confirmed Presenter: Samuel Davis, School of Chemistry and Molecular Biosciences, The University of Queensland, Australia

Format: In person


Authors List: Show

  • Samuel Davis, School of Chemistry and Molecular Biosciences, The University of Queensland, Australia
  • Pallav Joshi, School of Chemistry and Molecular Biosciences, The University of Queensland, Australia
  • Ulban Adhikary, School of Chemistry and Molecular Biosciences, The University of Queensland, Australia
  • Julian Zaugg, School of Chemistry and Molecular Biosciences, The University of Queensland, Australia
  • Phil Hugenholtz, School of Chemistry and Molecular Biosciences, The University of Queensland, Australia
  • Marc Morris, School of Chemistry and Molecular Biosciences, The University of Queensland, Australia
  • Gerhard Schenk, School of Chemistry and Molecular Biosciences, The University of Queensland, Australia
  • Mikael Boden, School of Chemistry and Molecular Biosciences, The University of Queensland, Australia

Presentation Overview: Show

Metallo-β-lactamases (MBLs) comprise a diverse family of antibiotic-degrading enzymes. Despite their growing implication in drug-resistant pathogens, no broadly effective clinical inhibitors against MBLs currently exist. Notably, β-lactam-degrading MBLs appear to have emerged twice from within the broader, catalytically diverse MBL-fold protein superfamily, giving rise to two distinct monophyletic groups: B1/B2 and B3 MBLs.

Comparative analyses have highlighted distinct structural hallmarks of these subgroups, particularly in metal-coordinating residues. However, the precise evolutionary events underlying their emergence remain unclear due to challenges presented by extensive sequence divergence. Understanding the molecular determinants driving the evolution of β-lactamase activity may inform design of broadly effective inhibitors.

We sought to infer the evolutionary features driving the emergence of B1/B2 MBLs via phylogenetics and ancestral reconstruction. To overcome challenges associated with evolutionary analysis at this scale, we developed a phylogenetically aware sequence curation framework centred on iterative profile HMM refinement. This framework was applied over several iterations to construct a comprehensive phylogeny encompassing the B1/B2 MBLs and several other recently diverged clades. The resulting tree represents the most robust hypothesis to date regarding the emergence of B1/B2 MBLs and implies a parsimonious evolutionary history of key features, including variation in active site architecture and insertions and deletions of distinct structural elements.

Ancestral proteins inferred at key internal nodes were experimentally characterised, revealing distinct activity profiles that reflect underlying evolutionary transitions. These findings give rise to testable hypotheses regarding the molecular basis and evolutionary drivers of functional diversification, as well as potential targets for MBL inhibitor design.

12:40-12:50
A compendium of human gene functions derived from evolutionary modeling
Format: In person


Authors List: Show

  • Marc Feuermann, SIB Swiss Institute for Bioinformatics, Switzerland
  • Huaiyu Mi, University of Southern California, United States
  • Pascale Gaudet, Swiss Institute of Bioinformatics, Switzerland
  • Anushya Muruganujan, University of Southern California, United States
  • Suzanna E. Lewis, Lawrence Berkeley National Lab, United States
  • Dustin Ebert, University of Southern California, United States
  • Tremayne Mushayahama, University of Southern California, United States
  • Gene Ontology Consortium, Various, United States
  • Paul D. Thomas, University of Southern California, United States

Presentation Overview: Show

A comprehensive, computable representation of the functional repertoire of all macromolecules encoded within the human genome is a foundational resource for biology and biomedical research. We have recently published a paper (Feuermann et al., Nature 640:146, 2025) describing our initial release of a human gene “functionome,” a comprehensive set of human gene function descriptions using Gene Ontology (GO) terms, supported by experimental evidence. This work involved integration of all applicable experimental Gene Ontology (GO) annotations for human genes and their homologs, using a formal, explicit evolutionary modeling framework. We will review this work and its major findings, and describe subsequent progress on an updated version.

In more detail, we will describe the results of a large, international effort to integrate experimental findings from more than 100,000 publications to create a representation of human gene functions that is as complete and accurate as possible. Specifically, we applied an expert-curated, explicit evolutionary modeling approach to all human protein-coding genes, which integrates available experimental information across families of related genes into models reconstructing the gain and loss of functional characteristics over evolutionary time. The resulting set of integrated functions covers ~82% of human protein-coding genes, and the evolutionary models provide insights into the evolutionary origins of human gene functions. We show that our set of function descriptions can improve the widely used genomic technique of GO enrichment analysis. The experimental evidence for each functional characteristic is recorded, enabling the scientific community to help review and improve the resource, available at https://functionome.geneontology.org.

12:50-1:00
pLM in functional annotation: relationship between sequence conservation and embedding similarity
Format: In person


Authors List: Show

  • Ana Rojas, CABD, Spain
  • Ildefonso Cases, CABD-CSIC, Spain
  • Rosa Fernandez, 3Metazoa Phylogenomics Lab, Institute of Evolutionary Biology (CSIC-UPF), 08003 Barcelona, Spain., Spain
  • Gemma Martínez-Redondo, 3Metazoa Phylogenomics Lab, Institute of Evolutionary Biology (CSIC-UPF), 08003 Barcelona, Spain., Spain
  • Francisco M. Perez-Canales, CABD-CSIC, Spain

Presentation Overview: Show

Functional annotation of protein sequences remains a bottleneck for understanding the biology of both model and non model organisms, as conventional homology based tools often fail to assign functions to the majority of newly sequenced genes. We first benchmarked each pLM on well‐characterized model organisms, demonstrating superior recovery of functional signals from transcriptomic datasets compared to traditional methods. We then applied our pipeline to annotate ~1,000 animal proteomes, encompassing 23 million genes, and discovered candidate genes involved in gill regeneration in a non model insect. To elucidate how pLM embeddings relate to primary‐sequence conservation, we computed cosine distances between embeddings and aligned sequences to derive percent identity. Statistical analyses—including Pearson correlation, polynomial regression, and quantile regression—revealed complex, non linear relationships between embedding similarity and sequence identity that vary markedly across models. These findings indicate that pLM embeddings capture orthogonal functional features beyond simple residue conservation. Altogether, our work highlights the power of pLM based annotation for expanding functional insights in biodiversity projects and underscores the need to interpret embedding distances in light of each model’s unique representational biases.

14:00-14:20
Proceedings Presentation: GOAnnotator: Accurate protein function annotation using automatically retrieved literature
Confirmed Presenter: Huiying Yan, Fudan University, China

Format: In person


Authors List: Show

  • Huiying Yan, Fudan University, China
  • Hancheng Liu, Fudan University, China
  • Shaojun Wang, Fudan University, China
  • Shanfeng Zhu, Fudan University, China

Presentation Overview: Show

Automated protein function prediction/annotation (AFP) is vital for understanding biological processes and advancing biomedical research. Existing text-based AFP methods including the state-of-the-art method, GORetriever, rely on expert-curated relevant literature, which is costly and time-consuming, and covers only a small portion of the proteins in UniProt. To overcome this limitation, we propose GOAnnotator, a novel framework for automated protein function annotation. It consists of two key modules: PubRetriever, a hybrid system for retrieving and re-ranking relevant literature, and GORetriever+, an enhanced module for identifying Gene Ontology (GO) terms from the retrieved texts. Extensive experiments over three benchmark datasets demonstrate that GOAnnotator delivers high-quality functional annotations, surpassing GORetriever by uncovering unique literature and predicting additional functions. These results highlight its great potential to streamline and enhance the annotation of protein functions without relying on manual curation.

14:20-14:40
Proceedings Presentation: Data-Integrated Semi-Supervised Attention Enhances Performance and Interpretability of Biological Classification Tasks
Format: In person


Authors List: Show

  • Jun Kim, Department of Biomedical Data Science, Stanford University, United States
  • Russ Altman, Department of Biomedical Data Science, Stanford University, United States

Presentation Overview: Show

The extraction of meaningful information through selective attention improves both performance and interpretability of neural networks. However, high model performance on training data does not ensure alignment between the model’s attention patterns and human knowledge, which can limit the model’s relevance and applicability. We propose Data-Integrated Semi-Supervised Attention (DSSA), a method that numerically integrates a priori knowledge, represented as a knowledge map, into the model’s attention. By incorporating the similarity between the knowledge map and the attention map into a loss function, DSSA causes the model’s attention to correlate with the knowledge. We show that DSSA can improve the performance of neural networks using two biological tasks. In the first task, cancer type prediction from gene expression profiles was guided by identities of cancer type-specific biomarkers. In the second task, enzyme/non-enzyme classification from protein sequences was guided by the locations of the catalytic residues. In both tasks, DSSA leads to improved performance and attention that is explainable by the phenomena in the provided data. DSSA is a novel method for injecting knowledge to achieve model alignment and interpretability.

14:40-15:00
On the completeness, coherence, and consistency of protein function prediction: lifting function prediction from isolated proteins to biological systems
Format: In person


Authors List: Show

  • Rund Tawfiq, KAUST, Saudi Arabia
  • Maxat Kulmanov, KAUST, Saudi Arabia
  • Robert Hoehndorf, KAUST, Saudi Arabia

Presentation Overview: Show

The Critical Assessment of Functional Annotation (CAFA) defines protein function prediction as the task of assigning Gene Ontology (GO) terms to individual proteins, and evaluates performance using ontology-based metrics. However, proteins rarely function in isolation; instead, they act within biological systems that impose genome-wide constraints. With the increasing availability of complete genomes, we define a new computational problem that extends the CAFA approach to genome-scale protein function prediction.

Defining this task allows us to evaluate the biological plausibility of a set of predicted functions. We propose three evaluation criteria: completeness, coherence, and consistency. Completeness requires that all biologically essential functions are predicted for at least one protein in a genome. Coherence ensures that all necessary dependencies between functions are satisfied. Consistency is the absence of mutually exclusive functions within a genome or protein.

We formalize these criteria as logical constraints using GO axioms, inter-ontology mappings, and curated biological knowledge. We implemented an evaluation framework based on the constraints we define, and applied it to six function prediction methods (DeepGOMeta, InterProScan, DeepFRI, TALE, DeepGraphGO, SPROF-GO) across 1,000 complete bacterial genomes. We also applied it to annotations from six well-annotated bacterial model organisms. The methods were not specifically designed to perform our genome-scale function prediction task, and our results revealed limitations in all methods when assessed against the metrics.

Our results demonstrate that current methods, while effective at the protein level, do not produce biologically plausible proteome annotations, motivating new frameworks for function prediction grounded in system-level biological constraints.

15:00-15:20
Contextual Gene Set Analysis with Large Language Models
Confirmed Presenter: Chih-Hsuan Wei, National Institutes of Health, United States

Format: In person


Authors List: Show

  • Zhizheng Wang, National Institutes of Health, United States
  • Chi-Ping Day, National Institutes of Health, United States
  • Chih-Hsuan Wei, National Institutes of Health, United States
  • Qiao Jin, National Institutes of Health, United States
  • Robert Leaman, National Institutes of Health, United States
  • Yifan Yang, National Institutes of Health, United States
  • Shubo Tian, National Institutes of Health, United States
  • Aodong Qiu, University of Pittsburgh, United States
  • Yin Fang, National Institutes of Health, United States
  • Qingqing Zhu, National Institutes of Health, United States
  • Xinghua Lu, University of Pittsburgh, United States
  • Zhiyong Lu, National Institutes of Health, United States

Presentation Overview: Show

Gene set analysis (GSA) is a foundational technique in genomics research, enabling the identification of biological processes and disease mechanisms associated with genes. Traditional GSA methods typically rely on predefined, manually curated biological databases to identify statistically enriched functions from gene sets created by high-throughput studies. However, these approaches as well as the recent large language model (LLM)-based methods generally overlook the biological and experimental contexts in which the gene sets were derived. Consequently, they often produce extensive lists of enriched pathways that are generic, redundant, or misaligned with the study objectives. In addition, conventional GSA methods do not account for gene interactions within the input set, frequently resulting in the overrepresentation of central hub genes. This lack of context-awareness limits the biological relevance of the findings and obstacles the accurate interpretation of results, thereby reducing the potential to derive meaningful insights or generate hypothesis-driven conclusions.

15:40-15:50
A Novel Computational Pipeline for the Functional Characterization and Deorphanization of G-Protein Coupled Receptors
Format: In person


Authors List: Show

  • Catherine Zhou, Stanford University School of Medicine, United States

Presentation Overview: Show

G protein-coupled receptors (GPCRs) are integral membrane proteins central to cellular signaling and intercellular communication, with Class A GPCRs playing key roles in many physiological processes and diseases. Despite their therapeutic potential, many remain orphan receptors, lacking identified endogenous ligands. Traditional de-orphaning methods are labor- and resource-intensive, highlighting the need for more efficient strategies. Here, we developed a multi-omics pipeline combining sequencing data, GPCR and ligand features, AlphaFold2 predictions, and binding pocket analyses to streamline ligand identification for orphan GPCRs. The pipeline starts by analyzing tissue-specific gene expression data to identify co-expressed GPCR-ligand pairs, which are likely to interact. Next, sequence and motif analysis of GPCRs informs potential ligand binding regions, while coevolution, conservation, and binding site similarity analysis refine interaction predictions. To model GPCR-ligand complexes, AlphaFold2 predictions are performed using a high-throughput pipeline on a high performance computing server. This system is fully automated and optimized for parallelized batch predictions using GPUs, improving efficiency compared to manual methods. Models are evaluated using novel metrics to assess ligand binding feasibility, such as distance measurements between ligand and receptor domains and aggregated interaction scores across different types of contacts. The computational predictions are further validated using experimental techniques, and this integrated approach successfully identified novel ligand-receptor interactions, with ongoing efforts to develop a convolutional neural network for improved interaction classification. The pipeline’s success in de-orphanizing GPCRs has led to further collaboration and grant-funded initiatives to expand its use for drug discovery, accelerating the identification of therapeutic targets for complex diseases.

15:50-16:00
VaLPAS: Leveraging variation in experimental multi-omics data to elucidate protein function
Format: In person


Authors List: Show

  • Yannick Mahlich, Pacific Northwest National Laboratory, United States
  • Lummy Monteiro, Pacific Northwest National Laboratory, United States
  • Jason McDermott, Pacific Northwest National Laboratory, United States

Presentation Overview: Show

Despite continuing advances in sequencing and computational function determination, large parts of the studied gene, protein and metabolite space remain functionally undetermined. Most function assignment is driven by homology searches and annotation transfer from known and extensively studied proteins but often fails to leverage available experimental omics data generated via technologies like mass-spectrometry.
The VaLPAS (Variation-Leveraged Phenomic Association Study) framework is an approach combining experimental multi-omics readouts with computational methods to establish functional relationships between different omics modalities. The goal of this approach is to shed light on the functional dark matter of protein space by elucidating previously unknown functions of proteins and metabolites via association metrics (e.g. protein-metabolite correlation) and graph algorithms.
We demonstrate that the framework can reliably recapitulate known functional relationships, by applying VaLPAS to multi-omic data from Rhodosporidium toruloids and Yarrowia lipolytica cultured under different growth and stress conditions. We used KEGG Ortholog for detected proteins and KEGG Compound annoations for metabolites, evaluating the resulting association scores in the context of chemical reactions (KEGG Reactions) and metabolict pathways (KEGG modules & pathways) utilizing network analysis approaches.
The resulting performance metrics detail the applicability of using experimental abundance data from detectable metabolites and proteins (extendable to other modes of experimental data) to infer protein functionality and metabolite annotation for as of yet unannotated data. Finally, the results imply that the approach can also aid in guiding experimental design to validate functional annotations.

16:40-17:00
Accelerating protein family classification in InterPro with AI innovations
Confirmed Presenter: Matthias Blum, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom

Format: In person


Authors List: Show

  • Matthias Blum, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Alessandro Polignano, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Irina Ponamareva, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Alex Bateman, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom

Presentation Overview: Show

InterPro is a freely accessible resource for classifying protein sequences into families, domains, and functional sites, integrating predictive signatures from member databases such as Pfam, CDD, and PROSITE. However, generating descriptive abstracts for unannotated signatures is a time-consuming manual task.

To address this, we employed large language models (LLMs) to generate high-quality family descriptions. Using GPT-4 with Swiss-Prot-derived context, we automatically produced abstracts for over 5,000 PANTHER families. Nearly 3,900 of these were used to create new InterPro entries, completing in days what previously took months of curation.

Since 2021, in collaboration with Dr Lucy Colwell's team at Google DeepMind, we have also explored deep learning for protein domain classification. This led to the development of InterPro-N, a novel model inspired by computer vision techniques and trained on all 13 InterPro member databases. InterPro-N significantly expands annotation coverage, assigning at least one annotation to ~90% of UniProtKB 2025_02 sequences, up from 84% using traditional methods. Predictions are accessible via the InterPro website, REST API, and FTP.

Additionally, we have integrated over 300,000 structure predictions from the Big Fantastic Virus Database (BFVD) and domain boundaries from The Encyclopedia of Domains (TED), derived from AlphaFold models. These structure-based insights are now shown alongside conventional InterPro and InterPro-N results, enabling users to compare annotations across methodologies.

Together, these AI-driven advances accelerate curation, expand functional coverage, and enrich protein classification, supporting faster and more comprehensive annotation of the rapidly growing protein sequence universe.

17:00-17:20
Thousands of confident genetic interactions in an Escherichia coli mutant collection elucidate numerous gene functions
Format: In person


Authors List: Show

  • Simon Jeanneau, Département de Biologie, Faculté des Sciences, Université de Sherbrooke, Canada
  • Mathias Martin Silva, Département de Biologie, Faculté des Sciences, Université de Sherbrooke, Canada
  • Antoine Champie, Département de Biologie, Faculté des Sciences, Université de Sherbrooke, Canada
  • Amélie De Grandmaison, Département de Biologie, Faculté des Sciences, Université de Sherbrooke, Canada
  • Antoine Castonguay, Département de Biologie, Faculté des Sciences, Université de Sherbrooke, Canada
  • Jean-Philippe Côté, Département de Biologie, Faculté des Sciences, Université de Sherbrooke, Canada
  • Sébastien Rodrigue, Département de Biologie, Faculté des Sciences, Université de Sherbrooke, Canada
  • Pierre-Étienne Jacques, Département de Biologie, Faculté des Sciences, Université de Sherbrooke, Canada

Presentation Overview: Show

Despite extensive research, nearly one-third of Escherichia coli genes remain uncharacterized. Understanding how these genes interact to support cellular viability is essential not only for fundamental biology but also for identifying vulnerabilities that may guide novel antimicrobial strategies. While resources such as the Keio collection, which includes all single-gene deletion mutants, have significantly advanced our knowledge of essential genes, the combinatorial nature of gene interactions remains largely unexplored at the genome scale, particularly in the context of synthetic lethality.

We recently developed High-Throughput Transposon Mutagenesis (HTTM), an optimized approach that enables systematic and high-resolution exploration of genetic interactions. By applying HTTM across thousands of mutants, we probed nearly 16 million gene pairs for synthetic lethality, resulting in the most comprehensive interaction screen conducted in E. coli to date.

Our analysis successfully recovered known synthetic lethal pairs and identified thousands of previously unreported interactions, including many involving poorly annotated or uncharacterized genes. Within this dataset, we identified densely connected regions of the interaction network, revealing genes that participate in a large number of critical interactions. These interaction hubs represent vulnerable nodes in bacterial survival networks. Furthermore, the recurring association of uncharacterized genes with well-annotated functional clusters supports functional propagation, a process by which gene function can be inferred from shared interaction patterns.

This extensive interaction map enhances functional annotation of the E. coli genome and highlights combinatorial genetic vulnerabilities. These findings provide a valuable foundation for investigating bacterial physiology and for identifying new targets in the pursuit of antimicrobial development.

17:20-17:40
Present and future of the critical assessment of protein function annotation algorithms (CAFA)
Format: In person


Authors List: Show

  • M. Clara De Paolis Kaluza, Northeastern University, United States
  • Rashika Ramola, Northeastern University, United States
  • Parnal Joshi, Iowa State University, United States
  • An Phan, Iowa State University, United States
  • Priyanka Banarjee, Iowa State University, United States
  • Damiano Piovesan, BioComputing UP - University of Padova, Italy
  • Walter Reade, Kaggle, United States
  • Maggie Demkin, Kaggle, United States
  • Addison Howard, Kaggle, United States
  • Nate Keating, Kaggle, United States
  • Paul Thomas, University of Southern California, United States
  • Maria Martin, EMBL-EBI, United Kingdom
  • Sandra Orchard, EMBL-EBI, United Kingdom
  • Iddo Friedberg, Iowa State University, United States
  • Predrag Radivojac, Northeastern University, United States

Presentation Overview: Show

Since its launch in 2010, the Critical Assessment of Functional Annotation (CAFA) has brought together computational biologists, biocurators, and experimental biologists to benchmark the state of computational prediction of protein function. It has served as a forum for discussion and collaboration to drive innovation in the field. Recent advances in protein representation, coupled with a growing interest from the machine learning community in biological applications, motivated CAFA organizers to expand their reach and invite a broader range of model developers to participate. To this end, the fifth CAFA experiment (CAFA 5) was conducted in partnership with Kaggle, a platform for data science competitions and collaborative model development. The reach and technology of this format resulted in a 22-fold increase over previous CAFAs in the number of participating teams, composed of entrants from 77 counties and various scientific and technical backgrounds.

In this talk, we present an expanded analysis of the prediction models in CAFA 5. Our analysis shows marked improvements in the performance of predictions on Gene Ontology (GO) term annotations compared to models from past CAFA evaluations. We present a new setting for evaluating predictions of function annotations added to proteins with previously incomplete annotations and we suggest new directions for future computational prediction improvements based on these evaluations. Finally, we turn our attention to the future and discuss the planned challenges and assessments for CAFA 6, which will be launched in 2025.

17:40-18:00
ProtHGT: Heterogeneous Graph Transformers for Automated Protein Function Prediction Using Biological Knowledge Graphs and Language Models
Format: In person


Authors List: Show

  • Erva Ulusoy, Hacettepe University, Turkey
  • Tunca Dogan, Hacettepe University, Turkey

Presentation Overview: Show

Accurate functional annotation of proteins is crucial for understanding complex biological systems. As protein sequence data grows rapidly, experimental methods cannot keep pace, underscoring the need for scalable computational approaches. In this study, we present ProtHGT, a heterogeneous graph transformer-based model designed to predict protein functions by integrating diverse biological datasets, including protein-protein interactions, pathways, domains, and phenotypic data. ProtHGT constructs a comprehensive heterogeneous graph with over 542,000 nodes and 3.7 million edges to capture complex biological relationships and employs relationship-specific attention mechanisms to refine node embeddings into biologically meaningful representations. It achieves state-of-the-art performance on benchmark datasets, consistently outperforming graph-based and sequence-based approaches. Advanced pretrained embeddings further enhance predictive accuracy by providing rich feature representations. Ablation analyses highlight the critical role of heterogeneous data integration, demonstrating the value of incorporating multiple node types, such as pathways and domains, to improve predictions. To ensure accessibility, ProtHGT is available as a programmatic tool on https://github.com/HUBioDataLab/ProtHGT and as a user-friendly web service on https://huggingface.co/spaces/HUBioDataLab/ProtHGT, enabling researchers with varying expertise to easily utilize the model. By integrating diverse data sources and leveraging cutting-edge graph transformer architecture, ProtHGT establishes itself as a powerful and accessible tool for advancing bioinformatics research.