Return to ISMB/ECCB 2025 Homepage Click here for the abridged agenda

Schedule for Function

NOTE: Browser resolution may limit the width of the agenda and you may need to scroll the iframe to see additional columns.
Click the buttons below to download your current table in that format

Date	Start Time	End Time	Room	Track	Title	Confrimed Presenter	Format	Authors	Abstract
2025-07-22	11:20:00	11:30:00	12	Function: Gene and Protein Function Annotation	Introduction			, Iddo Friedberg	Introduction to the joint session Function and EvolCompGen
2025-07-22	11:30:00	12:10:00	12	Function: Gene and Protein Function Annotation	Evolution of function in light of gene expression	Marc Robinson-Rechavi	In person	Marc Robinson-Rechavi	One of the fundamental questions of genome evolution is how gene function changes or is constrained, whether between species (orthologs) or inside gene families (paralogs). While computational prediction is making major progress on function in a broad sense, most evolutionary changes concern details that are small in the big picture, yet very significant for organismal function. For example, new organs or new physiological adaptations often come from repurposing genes whose basic molecular function is conserved while taking a novel role. Gene expression provides a unique window into such fine details of gene function. I will present how gene expression of diverse species, bulk and single-cell, is integrated into Bgee; how gene expression can be used to test hypotheses of functional change after duplication (the
2025-07-22	12:10:00	12:20:00	12	Function: Gene and Protein Function Annotation	Convergent evolution to similar proteins confounds structure search	Erik Wright	Live stream	Erik Wright	Advances in protein structure prediction and structural search tools (e.g., FoldSeek and PLMSearch) have enabled large-scale comparison of protein structures. It is now possible to quickly identify structurally similar proteins ("structurlogs"), but it remains unclear whether these similarities reflect homology (common ancestry) or analogy (convergent evolution). In this study, we found that ~2.6% of FoldSeek clusters lack sequence-level support for homology, including about 1% of matches with high TM-score (>= 0.5). The lack of sequence homology could be due to extreme protein divergence or independent evolution to a similar structure. Here, we show that tandem repeats provide strong evidence for the presence of analogous protein structures. Our results suggest analogs infiltrate structure search results and care should be taken when relying on structural similarity alone if homology is desired. This problem may extend beyond repeat proteins to other low complexity folds, and structure search tools could be improved by masking these regions in the same manner as done by sequence search programs.
2025-07-22	12:20:00	12:30:00	12	Function: Gene and Protein Function Annotation	Evolution of the Metazoan Protein Domain Toolkit Revealed by a Birth-Death-Gain Model	Maureen Stolzer	In person	Maureen Stolzer, Yuting Xiao, Dannie Durand	Domains, sequence fragments that encode protein modules with a distinct structure and function, are the basic building blocks of proteins. The set of domains encoded in the genome serves as the functional toolkit of the species. Here, we use a phylogenetic birth-death-gain model to investigate the evolution of this protein toolkit in metazoa. Given a species tree and the set of protein domain families in each present-day species, this approach estimates the most likely rates of domain origination, duplication, and loss. Statistical hierarchical clustering of domain family rates reveals sets of domains with similar rate profiles, consistent with groups of domains evolving in concert. Moreover, we find that domains with similar functions tend to have similar rate profiles. Interestingly, domains with functions associated with metazoan innovations, including immune response, cell adhesion, tissue repair, and signal transduction, tend to have the fastest rates. We further infer the expected ancestral domain content and the history of domain family gains, losses, expansions, and contractions on each branch of the species tree. Comparative analysis of these events reveals that a small number of evolutionary strategies, corresponding to toolkit expansion, turnover, specialization, and streamlining, are sufficient to describe the evolution of the metazoan protein domain complement. Thus, the use of a powerful, probabilistic birth-death-gain model reveals a striking harmony between the evolution of domain usage in metazoan proteins and organismal innovation.
2025-07-22	12:30:00	12:40:00	12	Function: Gene and Protein Function Annotation	Deep Phylogenetic Reconstruction Reveals Key Functional Drivers in the Evolution of B1/B2 Metallo-β-Lactamases	Samuel Davis	In person	Samuel Davis, Pallav Joshi, Ulban Adhikary, Julian Zaugg, Phil Hugenholtz, Marc Morris, Gerhard Schenk, Mikael Boden	Metallo-β-lactamases (MBLs) comprise a diverse family of antibiotic-degrading enzymes. Despite their growing implication in drug-resistant pathogens, no broadly effective clinical inhibitors against MBLs currently exist. Notably, β-lactam-degrading MBLs appear to have emerged twice from within the broader, catalytically diverse MBL-fold protein superfamily, giving rise to two distinct monophyletic groups: B1/B2 and B3 MBLs. Comparative analyses have highlighted distinct structural hallmarks of these subgroups, particularly in metal-coordinating residues. However, the precise evolutionary events underlying their emergence remain unclear due to challenges presented by extensive sequence divergence. Understanding the molecular determinants driving the evolution of β-lactamase activity may inform design of broadly effective inhibitors. We sought to infer the evolutionary features driving the emergence of B1/B2 MBLs via phylogenetics and ancestral reconstruction. To overcome challenges associated with evolutionary analysis at this scale, we developed a phylogenetically aware sequence curation framework centred on iterative profile HMM refinement. This framework was applied over several iterations to construct a comprehensive phylogeny encompassing the B1/B2 MBLs and several other recently diverged clades. The resulting tree represents the most robust hypothesis to date regarding the emergence of B1/B2 MBLs and implies a parsimonious evolutionary history of key features, including variation in active site architecture and insertions and deletions of distinct structural elements. Ancestral proteins inferred at key internal nodes were experimentally characterised, revealing distinct activity profiles that reflect underlying evolutionary transitions. These findings give rise to testable hypotheses regarding the molecular basis and evolutionary drivers of functional diversification, as well as potential targets for MBL inhibitor design.
2025-07-22	12:40:00	12:50:00	12	Function: Gene and Protein Function Annotation	A compendium of human gene functions derived from evolutionary modeling	Paul D. Thomas	In person	Marc Feuermann, Huaiyu Mi, Pascale Gaudet, Anushya Muruganujan, Suzanna Lewis, Dustin Ebert, Tremayne Mushayahama, Gene Ontology Consortium, Paul D. Thomas	A comprehensive, computable representation of the functional repertoire of all macromolecules encoded within the human genome is a foundational resource for biology and biomedical research. We have recently published a paper (Feuermann et al., Nature 640:146, 2025) describing our initial release of a human gene “functionome,” a comprehensive set of human gene function descriptions using Gene Ontology (GO) terms, supported by experimental evidence. This work involved integration of all applicable experimental Gene Ontology (GO) annotations for human genes and their homologs, using a formal, explicit evolutionary modeling framework. We will review this work and its major findings, and describe subsequent progress on an updated version. In more detail, we will describe the results of a large, international effort to integrate experimental findings from more than 100,000 publications to create a representation of human gene functions that is as complete and accurate as possible. Specifically, we applied an expert-curated, explicit evolutionary modeling approach to all human protein-coding genes, which integrates available experimental information across families of related genes into models reconstructing the gain and loss of functional characteristics over evolutionary time. The resulting set of integrated functions covers ~82% of human protein-coding genes, and the evolutionary models provide insights into the evolutionary origins of human gene functions. We show that our set of function descriptions can improve the widely used genomic technique of GO enrichment analysis. The experimental evidence for each functional characteristic is recorded, enabling the scientific community to help review and improve the resource, available at https://functionome.geneontology.org.
2025-07-22	12:50:00	01:00:00	12	Function: Gene and Protein Function Annotation	pLM in functional annotation: relationship between sequence conservation and embedding similarity	Ana Rojas	In person	Ana Rojas, Ildefonso Cases, Rosa Fernandez, Gemma Martínez-Redondo, Francisco M. Perez-Canales	Functional annotation of protein sequences remains a bottleneck for understanding the biology of both model and non model organisms, as conventional homology based tools often fail to assign functions to the majority of newly sequenced genes. We first benchmarked each pLM on well‐characterized model organisms, demonstrating superior recovery of functional signals from transcriptomic datasets compared to traditional methods. We then applied our pipeline to annotate ~1,000 animal proteomes, encompassing 23 million genes, and discovered candidate genes involved in gill regeneration in a non model insect. To elucidate how pLM embeddings relate to primary‐sequence conservation, we computed cosine distances between embeddings and aligned sequences to derive percent identity. Statistical analyses—including Pearson correlation, polynomial regression, and quantile regression—revealed complex, non linear relationships between embedding similarity and sequence identity that vary markedly across models. These findings indicate that pLM embeddings capture orthogonal functional features beyond simple residue conservation. Altogether, our work highlights the power of pLM based annotation for expanding functional insights in biodiversity projects and underscores the need to interpret embedding distances in light of each model’s unique representational biases.
2025-07-22	14:00:00	14:20:00	12	Function: Gene and Protein Function Annotation	GOAnnotator: Accurate protein function annotation using automatically retrieved literature	Huiying Yan	In person	Huiying Yan, Hancheng Liu, Shaojun Wang, Shanfeng Zhu	Automated protein function prediction/annotation (AFP) is vital for understanding biological processes and advancing biomedical research. Existing text-based AFP methods including the state-of-the-art method, GORetriever, rely on expert-curated relevant literature, which is costly and time-consuming, and covers only a small portion of the proteins in UniProt. To overcome this limitation, we propose GOAnnotator, a novel framework for automated protein function annotation. It consists of two key modules: PubRetriever, a hybrid system for retrieving and re-ranking relevant literature, and GORetriever+, an enhanced module for identifying Gene Ontology (GO) terms from the retrieved texts. Extensive experiments over three benchmark datasets demonstrate that GOAnnotator delivers high-quality functional annotations, surpassing GORetriever by uncovering unique literature and predicting additional functions. These results highlight its great potential to streamline and enhance the annotation of protein functions without relying on manual curation.
2025-07-22	14:20:00	14:40:00	12	Function: Gene and Protein Function Annotation	Semi-Supervised Data-Integrated Feature Importance Enhances Performance and Interpretability of Biological Classification Tasks	Jun Kim	In person	Jun Kim, Russ Altman	Accurate model performance on training data does not ensure alignment between the model’s feature weighting patterns and human knowledge, which can limit the model’s relevance and applicability. We propose Semi-Supervised Data-Integrated Feature Importance (DIFI), a method that numerically integrates a priori knowledge, represented as a sparse knowledge map, into the model’s feature weighting. By incorporating the similarity between the knowledge map and the feature map into a loss function, DIFI causes the model’s feature weighting to correlate with the knowledge. We show that DIFI can improve the performance of neural networks using two biological tasks. In the first task, cancer type prediction from gene expression profiles was guided by identities of cancer type-specific biomarkers. In the second task, enzyme/non-enzyme classification from protein sequences was guided by the locations of the catalytic residues. In both tasks, DIFI leads to improved performance and feature weighting that is interpretable. DIFI is a novel method for injecting knowledge to achieve model alignment and interpretability.
2025-07-22	14:40:00	15:00:00	12	Function: Gene and Protein Function Annotation	On the completeness, coherence, and consistency of protein function prediction: lifting function prediction from isolated proteins to biological systems	Rund Tawfiq	In person	Rund Tawfiq, Maxat Kulmanov, Robert Hoehndorf	The Critical Assessment of Functional Annotation (CAFA) defines protein function prediction as the task of assigning Gene Ontology (GO) terms to individual proteins, and evaluates performance using ontology-based metrics. However, proteins rarely function in isolation; instead, they act within biological systems that impose genome-wide constraints. With the increasing availability of complete genomes, we define a new computational problem that extends the CAFA approach to genome-scale protein function prediction. Defining this task allows us to evaluate the biological plausibility of a set of predicted functions. We propose three evaluation criteria: completeness, coherence, and consistency. Completeness requires that all biologically essential functions are predicted for at least one protein in a genome. Coherence ensures that all necessary dependencies between functions are satisfied. Consistency is the absence of mutually exclusive functions within a genome or protein. We formalize these criteria as logical constraints using GO axioms, inter-ontology mappings, and curated biological knowledge. We implemented an evaluation framework based on the constraints we define, and applied it to six function prediction methods (DeepGOMeta, InterProScan, DeepFRI, TALE, DeepGraphGO, SPROF-GO) across 1,000 complete bacterial genomes. We also applied it to annotations from six well-annotated bacterial model organisms. The methods were not specifically designed to perform our genome-scale function prediction task, and our results revealed limitations in all methods when assessed against the metrics. Our results demonstrate that current methods, while effective at the protein level, do not produce biologically plausible proteome annotations, motivating new frameworks for function prediction grounded in system-level biological constraints.
2025-07-22	15:00:00	15:20:00	12	Function: Gene and Protein Function Annotation	Contextual Gene Set Analysis with Large Language Models	Chih-Hsuan Wei	In person	Zhizheng Wang, Chi-Ping Day, Chih-Hsuan Wei, Qiao Jin, Robert Leaman, Yifan Yang, Shubo Tian, Aodong Qiu, Yin Fang, Qingqing Zhu, Xinghua Lu, Zhiyong Lu	Gene set analysis (GSA) is a foundational technique in genomics research, enabling the identification of biological processes and disease mechanisms associated with genes. Traditional GSA methods typically rely on predefined, manually curated biological databases to identify statistically enriched functions from gene sets created by high-throughput studies. However, these approaches as well as the recent large language model (LLM)-based methods generally overlook the biological and experimental contexts in which the gene sets were derived. Consequently, they often produce extensive lists of enriched pathways that are generic, redundant, or misaligned with the study objectives. In addition, conventional GSA methods do not account for gene interactions within the input set, frequently resulting in the overrepresentation of central hub genes. This lack of context-awareness limits the biological relevance of the findings and obstacles the accurate interpretation of results, thereby reducing the potential to derive meaningful insights or generate hypothesis-driven conclusions.
2025-07-22	15:20:00	15:40:00	12	Function: Gene and Protein Function Annotation	Fine-tuning protein language models with a disorder-aware vocabulary improves intrinsic disorder classification and function prediction	Harsh Srivastava	In person	Harsh Srivastava, Daniel Berenberg, Omar Qassab, Jane M. Carlton, Richard Bonneau	Intrinsically disordered proteins (IDPs) and intrinsically disordered regions (IDRs) are essential to cellular processes but lack stable 3D conformations amenable to experimental structure determination. However, identifying key disorder-driving residues and their disorder-related functions remains challenging. Although protein language models (pLMs) generate rich sequence embeddings for many classification tasks, their explicit application to IDPs/IDRs is underexplored. Drawing from prior protein structure tokenization approaches, we hypothesize that fine-tuning pLM embeddings with disorder-aware tokens can substantially enhance downstream performance while preserving pretrained model representations. Here, we introduce a unified framework for predicting disordered residues, disordered binding regions, and disordered linker regions. (1) We developed DisToken, a disorder-aware per-residue vocabulary generated using a VQ-VAE trained on relevant intrinsic disorder annotations from MobiDB. DisTokens encode a meaningful composite of annotations, capture nuanced residue context, and distinguish intrinsic disorder from broader features, providing an alternative to conventional one-hot encodings used previously for fine-tuning pLMs. (2) We fine-tuned a low-parameter ESM-2 model with DisTokens, resulting in ESM-DisTok, which learned disorder-aware representations. (3) Minimal 1-D CNN classifiers trained on ESM-DisTok embeddings significantly outperformed those using baseline ESM embeddings and structure-aware ESM-3Di embeddings in disorder-residue classification, disorder-binding, and disorder-linker tasks. On CAID-2 benchmarks, our minimal ESM-DisTok-based classifiers ranked 1st by AUC and AUPR in predicting disorder-PDB, disorder-binding, and disorder-NOX, and 2nd for disorder-linker tasks relative to previously published methods. Overall, we demonstrate that integrating a disorder-aware vocabulary into pLM embeddings drastically enhances downstream intrinsic disorder-related predictive tasks.
2025-07-22	15:40:00	15:50:00	12	Function: Gene and Protein Function Annotation	A Novel Computational Pipeline for the Functional Characterization and Deorphanization of G-Protein Coupled Receptors	Catherine Zhou	Live stream	Catherine Zhou	G protein-coupled receptors (GPCRs) are integral membrane proteins central to cellular signaling and intercellular communication, with Class A GPCRs playing key roles in many physiological processes and diseases. Despite their therapeutic potential, many remain orphan receptors, lacking identified endogenous ligands. Traditional de-orphaning methods are labor- and resource-intensive, highlighting the need for more efficient strategies. Here, we describe ongoing development of a multi-omics pipeline combining GPCR and ligand features, AI structural predictions, binding pocket analyses, and genomic and transcriptomic sequencing data to streamline the discovery of ligand pairings with orphan GPCRs. The pipeline analyzes tissue-specific gene expression data to identify co-expressed GPCR-ligand pairs, which are positioned to interact. Receptor and candidate ligand sequences and motif analyses inform potential ligand binding regions, while coevolution, conservation, and binding site similarity analysis refine interaction predictions. To model GPCR-ligand complexes, structural predictions (AlphaFold2/3, Boltz-1/2) are generated using a high-throughput pipeline optimized for parallelized batch execution on high performance servers. Models are evaluated using novel metrics to assess ligand binding feasibility, such as distance measurements between ligand and receptor domains and aggregated interaction scores across different types of contacts. The computational predictions are validated using experimental techniques. Initial application of this integrated approach has successfully identified novel ligand-receptor interactions, with ongoing efforts to develop a recurrent neural network for improved interaction classification. The pipeline’s success in deorphanizing GPCRs will lead to initiatives to expand its use for drug discovery, accelerating the identification of therapeutic targets for complex diseases.
2025-07-22	15:50:00	16:00:00	12	Function: Gene and Protein Function Annotation	VaLPAS: Leveraging variation in experimental multi-omics data to elucidate protein function	Jason McDermott	In person	Yannick Mahlich, Lummy Monteiro, Jason McDermott	Despite continuing advances in sequencing and computational function determination, large parts of the studied gene, protein and metabolite space remain functionally undetermined. Most function assignment is driven by homology searches and annotation transfer from known and extensively studied proteins but often fails to leverage available experimental omics data generated via technologies like mass-spectrometry. The VaLPAS (Variation-Leveraged Phenomic Association Study) framework is an approach combining experimental multi-omics readouts with computational methods to establish functional relationships between different omics modalities. The goal of this approach is to shed light on the functional dark matter of protein space by elucidating previously unknown functions of proteins and metabolites via association metrics (e.g. protein-metabolite correlation) and graph algorithms. We demonstrate that the framework can reliably recapitulate known functional relationships, by applying VaLPAS to multi-omic data from Rhodosporidium toruloids and Yarrowia lipolytica cultured under different growth and stress conditions. We used KEGG Ortholog for detected proteins and KEGG Compound annoations for metabolites, evaluating the resulting association scores in the context of chemical reactions (KEGG Reactions) and metabolict pathways (KEGG modules & pathways) utilizing network analysis approaches. The resulting performance metrics detail the applicability of using experimental abundance data from detectable metabolites and proteins (extendable to other modes of experimental data) to infer protein functionality and metabolite annotation for as of yet unannotated data. Finally, the results imply that the approach can also aid in guiding experimental design to validate functional annotations.
2025-07-22	16:40:00	17:00:00	12	Function: Gene and Protein Function Annotation	Accelerating protein family classification in InterPro with AI innovations	Matthias Blum	In person	Matthias Blum, Alessandro Polignano, Irina Ponamareva, Alex Bateman	InterPro is a freely accessible resource for classifying protein sequences into families, domains, and functional sites, integrating predictive signatures from member databases such as Pfam, CDD, and PROSITE. However, generating descriptive abstracts for unannotated signatures is a time-consuming manual task. To address this, we employed large language models (LLMs) to generate high-quality family descriptions. Using GPT-4 with Swiss-Prot-derived context, we automatically produced abstracts for over 5,000 PANTHER families. Nearly 3,900 of these were used to create new InterPro entries, completing in days what previously took months of curation. Since 2021, in collaboration with Dr Lucy Colwell's team at Google DeepMind, we have also explored deep learning for protein domain classification. This led to the development of InterPro-N, a novel model inspired by computer vision techniques and trained on all 13 InterPro member databases. InterPro-N significantly expands annotation coverage, assigning at least one annotation to ~90% of UniProtKB 2025_02 sequences, up from 84% using traditional methods. Predictions are accessible via the InterPro website, REST API, and FTP. Additionally, we have integrated over 300,000 structure predictions from the Big Fantastic Virus Database (BFVD) and domain boundaries from The Encyclopedia of Domains (TED), derived from AlphaFold models. These structure-based insights are now shown alongside conventional InterPro and InterPro-N results, enabling users to compare annotations across methodologies. Together, these AI-driven advances accelerate curation, expand functional coverage, and enrich protein classification, supporting faster and more comprehensive annotation of the rapidly growing protein sequence universe.
2025-07-22	17:00:00	17:20:00	12	Function: Gene and Protein Function Annotation	Thousands of confident genetic interactions in an Escherichia coli mutant collection elucidate numerous gene functions	Simon Jeanneau	In person	Simon Jeanneau, Mathias Martin Silva, Antoine Champie, Amélie De Grandmaison, Antoine Castonguay, Jean-Philippe Côté, Sébastien Rodrigue, Pierre-Étienne Jacques	Despite extensive research, nearly one-third of Escherichia coli genes remain uncharacterized. Understanding how these genes interact to support cellular viability is essential not only for fundamental biology but also for identifying vulnerabilities that may guide novel antimicrobial strategies. While resources such as the Keio collection, which includes a comprehensive set of single-gene deletion mutants, have significantly advanced our knowledge of essential genes, the combinatorial nature of gene interactions remains largely unexplored at the genome scale, particularly in the context of synthetic lethality. We recently developed High-Throughput Transposon Mutagenesis (HTTM), an optimized, high-resolution method for the systematic exploration of genetic interactions. By applying HTTM across thousands of mutants, we probed nearly 16 million gene pairs for synthetic lethality, resulting in the most comprehensive interaction screen conducted in E. coli to date. Our analysis successfully recovered known synthetic lethal pairs and identified thousands of previously unreported interactions, including many involving poorly annotated or uncharacterized genes. Within this dataset, we identified densely connected regions of the interaction network, revealing genes that participate in numerous critical interactions. These interaction hubs represent vulnerable nodes in bacterial survival networks. Furthermore, the recurring association of uncharacterized genes with well-annotated functional clusters supports the concept of functional propagation—a process by which gene function can be inferred from shared interaction patterns. This extensive interaction map enhances the functional annotation of the E. coli genome and highlights combinatorial genetic vulnerabilities. These findings provide a valuable foundation for investigating bacterial physiology and for identifying new targets in the pursuit of antimicrobial development.
2025-07-22	17:20:00	17:40:00	12	Function: Gene and Protein Function Annotation	Present and future of the critical assessment of protein function annotation algorithms (CAFA)	M. Clara De Paolis Kaluza	In person	M. Clara De Paolis Kaluza, Rashika Ramola, Parnal Joshi, An Phan, Priyanka Banarjee, Damiano Piovesan, Walter Reade, Maggie Demkin, Addison Howard, Nate Keating, Paul Thomas, Maria Martin, Sandra Orchard, Iddo Friedberg, Predrag Radivojac	Since its launch in 2010, the Critical Assessment of Functional Annotation (CAFA) has brought together computational biologists, biocurators, and experimental biologists to benchmark the state of computational prediction of protein function. It has served as a forum for discussion and collaboration to drive innovation in the field. Recent advances in protein representation, coupled with a growing interest from the machine learning community in biological applications, motivated CAFA organizers to expand their reach and invite a broader range of model developers to participate. To this end, the fifth CAFA experiment (CAFA 5) was conducted in partnership with Kaggle, a platform for data science competitions and collaborative model development. The reach and technology of this format resulted in a 22-fold increase over previous CAFAs in the number of participating teams, composed of entrants from 77 counties and various scientific and technical backgrounds. In this talk, we present an expanded analysis of the prediction models in CAFA 5 and discuss plans for CAFA 6. Our analysis of CAFA 5 shows marked improvements in the performance of predictions on Gene Ontology (GO) term annotations compared to models from past CAFA evaluations. We present a new setting for evaluating predictions of function annotations added to proteins with previously incomplete annotations and we suggest new directions for future computational prediction improvements based on these evaluations. Finally, we turn our attention to the future and discuss the planned challenges and assessments for CAFA 6, which will be launched in 2025.
2025-07-22	17:40:00	18:00:00	12	Function: Gene and Protein Function Annotation	ProtHGT: Heterogeneous Graph Transformers for Automated Protein Function Prediction Using Biological Knowledge Graphs and Language Models	Erva Ulusoy	In person	Erva Ulusoy, Tunca Dogan	Accurate functional annotation of proteins is crucial for understanding complex biological systems. As protein sequence data grows rapidly, experimental methods cannot keep pace, underscoring the need for scalable computational approaches. In this study, we present ProtHGT, a heterogeneous graph transformer-based model designed to predict protein functions by integrating diverse biological datasets, including protein-protein interactions, pathways, domains, and phenotypic data. ProtHGT constructs a comprehensive heterogeneous graph with over 542,000 nodes and 3.7 million edges to capture complex biological relationships and employs relationship-specific attention mechanisms to refine node embeddings into biologically meaningful representations. It achieves state-of-the-art performance on benchmark datasets, consistently outperforming graph-based and sequence-based approaches. Advanced pretrained embeddings further enhance predictive accuracy by providing rich feature representations. Ablation analyses highlight the critical role of heterogeneous data integration, demonstrating the value of incorporating multiple node types, such as pathways and domains, to improve predictions. To ensure accessibility, ProtHGT is available as a programmatic tool on https://github.com/HUBioDataLab/ProtHGT and as a user-friendly web service on https://huggingface.co/spaces/HUBioDataLab/ProtHGT, enabling researchers with varying expertise to easily utilize the model. By integrating diverse data sources and leveraging cutting-edge graph transformer architecture, ProtHGT establishes itself as a powerful and accessible tool for advancing bioinformatics research.

- top -