Attention Presenters - please review the Presenter Information Page available here
Schedule subject to change
All times listed are in EDT
Saturday, July 13th
10:40-11:00
Roll Call and Introduction to the Function COSI
Room: 520b
Format: In person

Moderator(s): Ana Rojas


Authors List: Show

  • Iddo Friedberg

Presentation Overview: Show

Free cash will be given to the 7th and 131st persons to show up. Maybe.

11:00-11:40
Invited Presentation: Linking Gene and function in the post-genomic era: issues and opportunities
Confirmed Presenter: Valerie de Crécy-Lagard

Room: 520b
Format: In Person

Moderator(s): Ana Rojas


Authors List: Show

  • Valerie de Crécy-Lagard

Presentation Overview: Show

Identifying the function of every gene in all sequenced organisms is the major challenge of the post-genomic era and an obligate step for any systems biology approach. This objective is far from reached. By various estimates, at least 30-70% of the genes of any given organism are of unknown function, incorrectly annotated, or have only a generic annotation. Using comparative genomic approaches, we have linked genes and functions for over 65 gene families related mainly to coenzyme metabolism, nucleic acid modification, protein modification, and metabolite repair. Building on this body of work I will discuss the lessons learned, the next steps needed to improve protein function prediction and discovery, and the role of machine learning in this process.

11:40-12:00
Unveiling the Functional Fate of Duplicated Genes Through Expression Profiling
Confirmed Presenter: Alex Warwick Vesztrocy, University of Lausanne, SIB Swiss Institute of Bioinformatics, Switzerland

Room: 520b
Format: In Person

Moderator(s): Ana Rojas


Authors List: Show

  • Alex Warwick Vesztrocy, University of Lausanne, SIB Swiss Institute of Bioinformatics, Switzerland
  • Natasha Glover, University of Lausanne, SIB Swiss Institute of Bioinformatics, Switzerland
  • Christophe Dessimoz, University of Lausanne, SIB Swiss Institute of Bioinformatics, Switzerland
  • Paul D Thomas, University of Southern California, United States
  • Irene Julca, University of Lausanne, SIB Swiss Institute of Bioinformatics, Switzerland

Presentation Overview: Show

Gene duplication is a major evolutionary source of functional innovation – if one copy maintains the ancestral function then the other is no longer under selective pressure and is free to diverge. It is presumed that the more slowly evolving copy, in terms of sequence, maintains the ancestral function – “the least diverged orthologue (LDO) conjecture”. This is a specific case of the much-debated orthologue conjecture – that orthologous genes are functionally more similar than paralogous genes. However, annotation bias has led to issues when using gene ontology annotations in previous studies. In an attempt to remove this bias, this study uses gene expression profiles as a proxy for function.

This study analysed 15,693 gene families from PANTHER, paired with an extensive and large-scale expression atlas for 16 animal and 20 plant species. Using an evolutionary model, accounting for varying evolutionary rates between gene families, branches were categorised according to whether they display symmetric or asymmetric evolution following duplication.

Pairs of genes resulting from asymmetric duplications displayed a significantly lower expression profile similarity compared to those from symmetric duplications (Mann-Whitney, p<0.001). Additionally, the more diverged copy exhibited increased tissue specificity, suggesting specialisation.

The results of this study support the hypothesis that the gene copy with the least evolutionary change following duplication tends to retain the ancestral function. Additionally, the more diverged copy may acquire a novel, potentially specialised, function. This likely has implications when predicting functional annotations – particularly when only using pairwise distances between sequences.

12:00-12:20
Leveraging deep learning for characterization of malaria parasite PUFs — proteins of unknown function
Confirmed Presenter: Harsh R. Srivastava, Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA, United States

Room: 520b
Format: In Person

Moderator(s): Ana Rojas


Authors List: Show

  • Harsh R. Srivastava, Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA, United States
  • Daniel Berenberg, Courant Institute, Department of Computer Science, New York University, New York, NY, USA, United States
  • Omar Qassab, Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA, United States
  • Tymor Hamamsy, Courant Institute, Department of Computer Science, New York University, New York, NY, USA, United States
  • Jane M. Carlton, Johns Hopkins Malaria Research Institute, Bloomberg School of Public Health, Baltimore, MD, USA, United States
  • Richard Bonneau, Prescient Design, gRED Computational Sciences, Genentech, New York, NY, USA, United States

Presentation Overview: Show

Exploiting the sequence-structure-function paradigm is crucial for annotating proteins of unknown function (PUFs) in Plasmodium falciparum, a member of the diverged eukaryotic SAR (Stramenopiles, Alveolates, and Rhizarians) supergroup. P. falciparum, a malaria-causing parasite, accounted for ~250 million cases and over 600,000 deaths in 2022. Discovery of diagnostic and therapeutic targets in P. falciparum is hindered, as ~23% of proteins are classified as PUFs while ~40% of proteins are partially annotated. Predicting GO annotations for these PUFs is difficult given low sequence similarity to annotated proteins and limited generalization of deep learning models trained on well-studied SwissProt species to diverged organisms. Here, we focused on the structure-function relationship of SAR sequences and developed a new method to predict GO terms in P. falciparum. PFP (Plasmodium Function Predictor) is a collection of structural-homology based deep learning models trained using evolutionarily relevant structure-aware TM-Vec embeddings. We used a deep feedforward architecture with a dropout layer to predict GO annotations and quantify uncertainty using Monte Carlo dropout. When benchmarked against DeepGOPlus, PFP, demonstrated a significant improvement in Fmax, Smin, and AUPR-micro/AUPR-macro for our test split as well as our Plasmodium holdout split. PFP predicted GO terms respected the hierarchical structure of GO and aligned with expected information content distributions. For poorly annotated proteins, PFP imputed GO terms which are biologically plausible given existing annotations. Additionally, predictions made by PFP were categorized into confidence levels and aligned with published data targeting specific P. falciparum PUFs. PFP is the first curated function prediction model developed specifically for a subset of eukaryotic species. We will discuss findings in model architecture and highlight specific GO predictions contributing to an increase of more than 25% in P. falciparum proteome annotation.

14:20-15:00
Crowdsoursing the Fifth Critical Assessment of Protein Function Annotation Algorithms (CAFA 5)
Confirmed Presenter: M. Clara De Paolis Kaluza, Northeastern University, United States

Room: 520b
Format: In Person

Moderator(s): Mark Wass


Authors List: Show

  • M. Clara De Paolis Kaluza, Northeastern University, United States
  • Rashika Ramola, Northeastern University, United States
  • Damiano Piovesan, BioComputing UP - University of Padova, Italy
  • Parnal Joshi, Iowa State University, United States
  • Iddo Friedberg, Iowa State University, United States
  • Predrag Radivojac, Northeastern University, United States

Presentation Overview: Show

The Critical Assessment of Functional Annotation (CAFA) is a long-standing, ongoing community effort to independently assess computational methods for protein function prediction, to highlight well-performing methodologies, to identify bottlenecks in the field, and to provide a forum for dissemination of results and exchange of ideas. Every three years since its inception in 2010, CAFA has solicited participation from computational groups and engaged biocurators and experimental biologists to collect high-quality data on which to evaluate algorithmic performance in a series of prospective computational challenges to predict function for a large set of target proteins. For the 5th installment (CAFA 5) launched in 2023, a partnership with Kaggle Inc. facilitated the participation of a much broader community of data scientists. Predictions were collected as entries to a competitive challenge on the crowdsourced science platform. The reach and technology of this approach resulted in a 22x increase in the number of participating teams, comprised of entrants from 77 counties and various scientific and technical backgrounds. At the conclusion of the challenge, predictions were evaluated using a summary metric on a limited set of proteins which had accumulated annotations during a four month period. In this talk, we will present the outcomes of the increased and diversified participation on the quality of predictions on Gene Ontology (GO) term annotations on an expanded set of annotations and in greater detail across ontology aspects and in relation to past CAFA evaluation.

15:00-15:20
StarFunc: interplaying template-based and deep learning approach for accurate protein function prediction
Confirmed Presenter: Chengxin Zhang, University of Michigan, United States

Room: 520b
Format: In Person

Moderator(s): Mark Wass


Authors List: Show

  • Chengxin Zhang, University of Michigan, United States
  • Quancheng Liu, University of Michigan, United States
  • P Lydia Freddolino, University of Michigan, United States

Presentation Overview: Show

Despite significant advancements in the development of novel methods for protein function prediction via deep learning, template information often remains a critical component. Common predictions typically utilize templates identified through sequence homology or protein-protein interactions. Yet, relatively few methods leverage structure similarity for template detection, even though protein structures underpin function. In response to this, we developed StarFunc, a comprehensive approach that seamlessly marries cutting-edge deep learning models with template information. This information is garnered from a range of sources, including sequence homology, protein-protein interaction partners, similar structures, and protein domain families. StarFunc was put to the test in large-scale benchmarking and blind tests during the 5th Critical Assessment of Function Annotation (CAFA5). The results consistently highlight its advantage over not only conventional template-based predictors but also other state-of-the-art deep learning methods.

15:20-15:40
ProtBoost: Prediction of functional properties of the proteins by Py-Boost and protein language models (CAFA5 top2)
Confirmed Presenter: Alexander Chervov, 1. Institut Curie, 2. Inserm U900, 3. Mines ParisTech, 4. PSL University, France

Room: 520b
Format: Live Stream

Moderator(s): Mark Wass


Authors List: Show

  • Alexander Chervov, 1. Institut Curie, 2. Inserm U900, 3. Mines ParisTech, 4. PSL University, France
  • Loredana Martignetti, 1. Institut Curie, 2. Inserm U900, 3. Mines ParisTech, 4. PSL University, France
  • Anton Vakhrushev, Higher School of Economics, Sberbank, Russia
  • Sergei Fironov, Yandex, Russia

Presentation Overview: Show

We will describe machine learning approach to predict protein functions based on their sequences - which allowed our team to win CAFA5 top 2 position with significant gap to top 3 team. The main novelty of our approach: Py-boost - new gradient boosting algorithm designed to predict multiple targets simultaneously (designed by one of the team members (NeurIPS 2022) ). Gradient boostings (XGBoost, LightGBM, CatBoost) are the most powerful methods update to work with tabular data, however they are too slow for the tasks with hundreds or thousands targets. Py-Boost overcome that problem by special approximation of the loss function, thus combining effectiveness of gradient boostings with ability to predict thousands targets at once. Other key ingredients - extensive use of protein language models (T5, ESM2), GCN (graph neural network) to aggregate predictions across the Gene Ontology, ensemble of neural networks. Language models are becoming quite quite popular tools for various tasks in bioinformatics: for proteins, DNA/RNA, as well as small-molecule SMILES data. We will report our analysis for the different language models obtained during CAFA5 challenge as well as subsequent analysis.

15:40-16:00
GORetriever: Reranking protein-description-based GO candidates by literature-driven deep information retrieval for precise protein function annotation
Confirmed Presenter: Huiying Yan, Fudan University, China

Room: 520b
Format: Live Stream

Moderator(s): Mark Wass


Authors List: Show

  • Huiying Yan, Fudan University, China
  • Shaojun Wang, Fudan University, China
  • Hancheng Liu, Fudan University, China
  • Hiroshi Mamitsuka, Kyoto University / Aalto University, Japan
  • Shanfeng Zhu, Fudan University, China

Presentation Overview: Show

The vast majority of proteins still lack experimentally validated functional annotations, which highlights the importance of developing high-performance automated protein function prediction/annotation (AFP) methods. While existing approaches focus on protein sequences, networks, and structural data, textual information related to proteins has been overlooked. However, roughly 82% of SwissProt proteins already possess literature information that experts have annotated. To efficiently and effectively use literature information, we present GORetriever, a two-stage deep information retrieval-based method for AFP. Given a target protein, in the first stage, candidate Gene Ontology (GO) terms are retrieved by using annotated proteins with similar descriptions. In the second stage, the GO terms are reranked based on semantic matching between the GO definitions and textual information (literature and protein description) of the target protein. Extensive experiments over benchmark datasets demonstrate the remarkable effectiveness of GORe- triever in enhancing the AFP performance. Note that GORetriever is the key component of GOCurator, which has achieved the first place in the latest critical assessment of protein function annotation (CAFA5: over 1,600 teams participated), held in 2023–24.

16:40-17:00
InterLabelGO+: Unraveling label correlations in protein function prediction
Confirmed Presenter: Quancheng Liu, University of Michigan, United States

Room: 520b
Format: In Person

Moderator(s): Dukka KC


Authors List: Show

  • Quancheng Liu, University of Michigan, United States
  • Chengxin Zhang, University of Michigan, United States
  • P Lydia Freddolino, University of Michigan, United States

Presentation Overview: Show

Accurate prediction of protein functions is crucial for understanding biological processes and advancing biomedical research. However, the rapid growth of known protein sequences far outpaces experimental characterization of protein function, necessitating the development of automated computational methods. We present InterLabelGO, a cutting-edge deep learning approach that leverages protein language models and addresses the challenges of label imbalance and label dependencies in protein function prediction. By incorporating a novel loss function that captures complex functional relationships and integrates alignment-based methods through dynamic weighting, InterLabelGO achieves remarkable performance in predicting Gene Ontology (GO) terms. In the recent CAFA5 challenge, a preliminary version of InterLabelGO ranked 6th out of 1,625 teams worldwide, showcasing its effectiveness compared to state-of-the-art methods. Comprehensive evaluations on large-scale datasets demonstrate InterLabelGO's ability to accurately predict GO terms across various functional categories and evaluation metrics. With its innovative approach to harnessing deep learning and label correlations, InterLabelGO represents a significant advancement in automated protein function prediction, offering the potential to greatly accelerate and enrich the functional annotation of the ever-expanding universe of protein sequences.

17:00-17:20
Discovery of PETases using a computational classification system
Confirmed Presenter: Joel Roca Martinez, University College London, United Kingdom

Room: 520b
Format: In Person

Moderator(s): Dukka KC


Authors List: Show

  • Joel Roca Martinez, University College London, United Kingdom
  • Clemens Rauer, Universidad Autonoma de Madrid, Spain
  • Nicola Bordin, University College London, United Kingdom
  • Mahnaz Abbasian, University College London, United Kingdom
  • Josephin Holstein, University of Cambridge, United Kingdom
  • Mariana Rangel, University of Cambridge, United Kingdom
  • Florian Hollfelder, University of Cambridge, United Kingdom
  • Christine Orengo, University College London, United Kingdom

Presentation Overview: Show

Plastic accumulation is a pressing environmental issue that has escalated in recent decades. With around 25 million tons produced yearly, polyethylene terephthalate (PET) is the most common single use plastic polymer. Current PET recycling protocols rely on mechanical processes that lead to a loss of properties and poor recycling rates. Enzymatic PET degradation emerged as an alternative solution a decade ago, with characterized natural enzymes and variants showing PET degrading activity (PETases). Considering PET has only been present in the environment for less than 50 years, PETases have not evolved naturally to degrade it, hindering the identification of very active enzymes with the right properties. In this study, starting from a dataset with over 1 billion sequences from the MGnify database, we filtered it to define a set of potential PETases that we refer to as the PETase zone, containing over 20.000 sequences. After splitting it in functional families, we focused on 2 families closely related to the family containing most of the known active PETases, while showing a high sequence diversity. To select a final set of promising candidates to text experimentally, we focused on a set of positions that we identified as differentially conserved among the functional families, thus likely important for the protein’s function. We used those positions alongside other information such as docking score, protein solubility or pocket properties to shortlist 11 potential PETases, from which 3 were experimentally active, proving the protocol’s efficacy and providing new starting points in the PETase sequence space.

17:20-17:40
Plastic-Ml-Tool, a machine learning tool for discovering and optimising plastic degrading enzymes
Confirmed Presenter: David Medina-Ortiz, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile

Room: 520b
Format: In Person

Moderator(s): Dukka KC


Authors List: Show

  • David Medina-Ortiz, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile
  • Anamaría Daza, Centre for Biotechnology and Bioengineering, CeBiB, Universidad de Chile, Chile
  • Nicole Soto-García, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile
  • Jacqueline Aldridge, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile
  • Diego Sandoval, Centre for Biotechnology and Bioengineering, CeBiB, Universidad de Chile, Chile
  • Bárbara Andrews, Centre for Biotechnology and Bioengineering, CeBiB, Universidad de Chile, Chile
  • Sebastián Rodríguez, Centre for Biotechnology and Bioengineering, CeBiB, Universidad de Chile, Chile
  • Diego Álvarez, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile
  • Juan A. Asenjo, Centre for Biotechnology and Bioengineering, CeBiB, Universidad de Chile, Chile

Presentation Overview: Show

Plastic contamination is a significant environmental threat that negatively impacts habitats, species, and ecosystems. Recycling strategies can involve chemical and thermo-mechanical processes to convert waste into energy. Nevertheless, these technologies have a slow processing time and a limited degrading rate. Biodegradation provides an environmentally friendly alternative through microorganisms with enzymes capable of degrading plastics. Machine learning approaches have been successfully applied to guide enzyme engineering for improving optimal temperature and to build classification systems for detecting plastic degrading enzymes. However, challenges persist in improving the generalisation of the predicted methods and discovering new plastic-degrading enzymes. This work presents Plastic-Ml-Tool, a machine-learning library that discovers and assists in designing new plastic-degrading enzymes. The proposed tool has a classification system to detect plastic-degrading enzymes and recognise different plastic targets. Besides, predictive models for optimal catalytic temperature were built using deep learning. Plastic-Ml-Tool has different generative approaches to discover new plastic degrading enzymes and an MLDE approach to guide the design of enzymes with desirable optimal catalytic temperatures. Different explorations were made to demonstrate the usability of the proposed methods, including the mapping of KEGG databases to recognise new plastic degrading enzymes, the generation of new PET enzymes through the generative approaches implemented in Plastic-ML-Tool, and the protein engineering of target enzymes to improve PET-plastic degradation. The high usability and the ease of in-silico navigation of new enzymes demonstrate the advantages of the proposed work, which is becoming a valuable and high-impact tool to support experimental methods.

17:40-18:00
A BLAST from the past: revisiting BLAST's E-value
Confirmed Presenter: Yang Lu, University of Waterloo, Canada

Room: 520b
Format: In Person

Moderator(s): Dukka KC


Authors List: Show

  • Yang Lu, University of Waterloo, Canada
  • William Stafford Noble, University of Washington, United States
  • Uri Keich, The University of Sydney, Australia

Presentation Overview: Show

The Basic Local Alignment Search Tool, BLAST, is an indispensable tool for genomic research.
BLAST established itself as the canonical tool for sequence similarity search in large part thanks to its meaningful
statistical analysis. Specifically, BLAST reports the E-value of each reported alignment, which is defined
as the expected number of optimal local alignments that will score at least as high as the observed alignment score,
assuming that the query and the database sequences are randomly generated.
Here we critically reevaluate BLAST's E-values, showing that they can be at times significantly conservative
while at others too liberal. We offer an alternative approach based on generating a small sample
from the null distribution of random optimal alignments, and testing whether the observed alignment score
is consistent with it.
In contrast with BLAST, our significance analysis seems valid, in the sense that it did not deliver inflated significance estimates
in any of our extensive experiments. Moreover, although our method is slightly conservative, it is often significantly less so than
the BLAST E-value. Indeed, in cases where BLAST's analysis is valid (i.e., not too liberal), our approach seems to deliver a
greater number of correct alignments.
One advantage of our approach is that it works with any reasonable choice of substitution matrix and gap penalties,
avoiding BLAST's limited options of matrices and penalties.
In addition, we can formulate the problem using a canonical family-wise error rate control setup, thereby dispensing
with E-values, which can at times be difficult to interpret.

Sunday, July 14th
10:40-11:20
Invited Presentation: Fast, high-performance biophysics-based computational methods in function prediction
Confirmed Presenter: Rafael Najmanovich

Room: 520b
Format: In Person

Moderator(s): Iddo Friedberg


Authors List: Show

  • Rafael Najmanovich

Presentation Overview: Show

In this presentation I will discuss a number of computational methods that rely o a simple approximation to molecular interactions to being proportional to the surface area in contact between the constituting atoms modulated by a pairwise atom-type pseudo-energetic term. I will discuss three methods in this presentation: Surfaces, for the quantification of molecular interactions, NRGTEN, for normal-mode analysis and NRGDock for ultra-massive virtual screening. Their simplified nature makes them fast while still being accurate, allowing their utilization in a high-throughput manner for molecular function prediction.

11:20-11:40
Energetic Local Frustration Through Time and Species
Confirmed Presenter: R. Gonzalo Parra, Barcelona Supercomputing Center, Spain

Room: 520b
Format: In Person

Moderator(s): Iddo Friedberg


Authors List: Show

  • R. Gonzalo Parra, Barcelona Supercomputing Center, Spain
  • Maria Freiberger, Buenos Aires University, Argentina
  • Miriam Poley-Gil, Barcelona Supercomputing Center, Spain
  • Miguel Fernandez-Martin, Barcelona Supercomputing Center, Spain
  • Marko Ludaic, Barcelona Supercomputing Center, Spain
  • Victoria Ruiz-Serra, Barcelona Supercomputing Center, Spain
  • Leandro G. Radusky, Manas.tech, Spain
  • Peter G. Wolynes, Rice University, United States
  • Diego U. Ferreiro, Buenos Aires University, Argentina
  • Alfonso Valencia, Barcelona Supercomputing Center, Spain

Presentation Overview: Show

According to the Principle of Minimal Frustration, folded proteins minimize the amount of strong energetic conflicts in their native states. However, not all interactions are energetically optimized for folding but some remain in energetic conflict, i.e. they are highly frustrated. This remaining local energetic frustration has been shown to be statistically correlated with distinct functional aspects such as protein-protein interaction sites, allosterism and catalysis. Fuelled by the recent breakthroughs in efficient protein structure prediction that have made available good quality models for most proteins, we have developed a strategy to calculate local energetic frustration within large protein families and quantify its conservation over evolutionary time. Based on this evolutionary information we can identify how stability and functional constraints have appeared at the common ancestor of the family and have been maintained over the course of evolution.

I will summarize the results of two of our recent publications, plus some unpublished results, where we show how local frustration in proteins and conservation of it in extant protein families and protein ensembles can be used to shed light into the biophysical understanding of the relationships between sequences, structures, dynamics and functions. Moreover, our local frustration based strategies can be exploited to better understand the protein features that are captured and output by novel Machine Learning methods and help to guide protein design strategies.

11:40-12:00
Function Prediction of Intrinsically Disordered Proteins and Regions: A Graph Auto-Encoder Approach
Confirmed Presenter: Mahta Mehdiabadi, University of Padova, Italy

Room: 520b
Format: In Person

Moderator(s): Iddo Friedberg


Authors List: Show

  • Mahta Mehdiabadi, University of Padova, Italy
  • Damiano Piovesan, University of Padova, Italy
  • Silvio Tosatto, University of Padova, Italy

Presentation Overview: Show

Intrinsically disordered proteins/regions (IDPs/IDRs) lack a well-defined three-dimensional structure yet carry out essential biological functions. Due to their highly dynamic nature and poor sequence conservation, conventional homology and structure-based methods cannot determine their functions. Here, we develop a Graph Auto-Encoder (GAE) model that exploits the information available to all forms of proteins to predict Gene Ontology (GO) functions for the entire protein and individual IDRs.
Our model is capable of encoding the proteins' structural units, such as domains and disordered regions, which are then used to assign functions not only to the entire protein but also to its specific regions. This allows us to map the proteins into a latent space, find similar protein embeddings, and transfer the functions among them. Due to training the model in an unsupervised end-to-end manner independent of GO labels, the model is not affected by the incomplete annotation of data and can be trained on large datasets, which boosts the encoding power of the GAE.
The model's predictive performance was assessed using an independent test set from the DisProt database, which shows significant improvements compared to the standard approaches. The model achieves high throughput, processing hundreds of protein sequences per second. Its highly scalable nature enables integration into production environments like the MobiDB database and functionally categorize and cluster more than 18 million disordered regions available in the database.

12:00-12:20
Mapping the affinity of protein-protein interactions with multiple amino acid mutations using deep neural networks
Confirmed Presenter: Yaron Orenstein, Bar-Ilan University, Israel

Room: 520b
Format: In Person

Moderator(s): Iddo Friedberg


Authors List: Show

  • Reut Moshe, Bar-Ilan University, Israel
  • Shay-Lee Aharoni, Weizmann Institute of Science, Israel
  • Niv Papo, Ben-Gurion University, Israel
  • Yaron Orenstein, Bar-Ilan University, Israel

Presentation Overview: Show

Protein-protein interactions (PPIs) play vital roles in diverse biological processes. Hence, measuring PPIs is critical for decoding the evolution of proteins, and for developing powerful interactions for drugs. To date, studies focused mainly on a narrow range of affinities and on single mutations in the amino acid sequence of a given protein to develop high-affinity PPIs due to limitations in the experimental techniques. Our study introduces a novel approach to comprehensively map PPIs and identify multiple (affinity-enhancing or affinity-reducing) mutations by applying machine-learning methods to next-generation sequencing selection data. We present a novel method to accurately predict the impact of multiple interacting mutations that were not observed in the experimental data. We applied our method to the N-TIMP2\MMP9 protein complex as a case study due to its unique interface, which consists of seven positions in N-TIMP2 crucial for binding. We developed a neural network to accurately and quantitatively predict the impact of multiple potentially interacting mutations on binding affinity. Our neural network achieved in cross-validation (training on 90% and testing on a held-out 10%) a Pearson correlation of 0.963 between predicted and observed enrichment ratios. In addition, on an independent dataset of 26 experimentally validated variants, the Pearson correlation between their affinity constants and predicted enrichment ratios was 0.545. Currently, we are testing the affinity of five novel multiple-mutations variants that we predicted as high-affinity variants. Generally, our innovative approach can be applied to many more protein-function datasets to provide a rich characterization of a PPI affinity landscape.

14:20-14:40
Utilising Large Language Models for GO Term Extraction in UniProt Annotation
Confirmed Presenter: Vishal Joshi, EMBL-EBI, United Kingdom

Room: 520b
Format: In Person

Moderator(s): Dukka KC


Authors List: Show

  • Vishal Joshi, EMBL-EBI, United Kingdom
  • Maria J Martin, EMBL-EBI, United Kingdom

Presentation Overview: Show

Automatic Annotation(AA) objective
Manually reviewed records (UniProtKB/SwissProt) constitute only about 0.23% of UniProtKB; expert curation is time-intensive and most published experimental data focuses on a rather limited range of model organisms. Simultaneously, the number of unreviewed records is growing continuously, yet for a large proportion of these records there is no experimental data available. UniProtKB uses three prediction systems UniRule, Association-Rule-Based Annotator (ARBA) & Google’s ProtNLM to functionally annotate around 88% of unreviewed (UniProtKB/TrEMBL) records automatically which we define as Automatic Annotation.

Large Language models for GO term extraction from literature & literature summarisation

We have prototyped a new pipeline that employs the GPT-4 model (version gpt-4-1106-preview) to extract GO(Gene Ontology) terms from scientific literature accessible through PubMed. The extracted GO terms are validated against the GO Annotation database (GOA) at EMBL-EBI I, in order to exclude those that conflict with taxonomic constraints, are obsolete, or have been blacklisted. We have evaluated our prompts to extract GO terms against GPT-4 (version gpt-4-1106-preview) & other open-source quantized models, like Mixtral-8x7B-Instruct-v0.1 and Mistral-7B-Instruct-v0.2, results of which we will be sharing. Once evaluation is complete against a structured annotation like GO, we plan to potentially expand it to other annotation like keywords, EC numbers etc

These models will not only assist in manual curation but also act as collaborative tools in creating UniProt entry-style summaries for proteins/genes from relevant literature. The performance of these models will be assessed for their potential contributions to augmenting existing manually curated entries.

14:40-15:00
ProstGOPred: Advancing Protein Function Prediction through Graph Contrastive Learning and Structure-Aware Protein Language Model Embeddings
Confirmed Presenter: Weining Lin, University College London, United Kingdom

Room: 520b
Format: Live Stream

Moderator(s): Dukka KC


Authors List: Show

  • Weining Lin, University College London, United Kingdom
  • David Miller, University College London, United Kingdom

Presentation Overview: Show

We introduce ProstGOPred, a state-of-the-art protein function prediction model that integrates both protein sequence and structural information with embeddings from the ProstT5 protein language model. By combining embeddings from ProstT5 and network information from the STRING database, our approach employs a graph contrastive learning strategy to optimise the model's ability to recognise functional similarities among proteins. This contrastive learning strategy optimises model performance by minimising the distance between an anchor and a positive sample while maximising the distance between the anchor and a negative sample, serving as a regularisation term in conjunction with supervised learning.

ProstGOPred was benchmarked on the CAFA3 dataset and achieved state-of-art performance with f1max scores of BP: 0.534, MF: 0.561, CC: 0.64 on three sub-ontologies, surpassing the performance of traditional BLAST method (0.26, 0.42, 0.45 respectively), DomainPFP (0.38, 0.56, 0.63 respectively), and DeepGOPlus (0.469, 0.544, 0.623 respectively). By leveraging embeddings from ProstT5, a structure-aware protein language model, and protein network information, our model requires no additional structural information or multiple sequence alignment (MSA) data to efficiently predict protein functions.

Our research demonstrates the immense potential of utilising protein language models and graph neural networks for predicting protein functions. In the next stage, we will include evolutionary data based on CATH-FunFams which are being regenerated using domain assignments based on AlphaFold structures and which will exploit structure embeddings to detect functional similarity.

15:00-15:20
Evaluation of large language models for discovery of gene set function
Confirmed Presenter: Mengzhou Hu, University of California, San Diego, United States

Room: 520b
Format: In Person

Moderator(s): Dukka KC


Authors List: Show

  • Mengzhou Hu, University of California, San Diego, United States
  • Sahar Alkhairy, University of California, San Diego, United States
  • Ingoo Lee, University of California, San Diego, United States
  • Rudolf Pillich, University of California, San Diego, United States
  • Dylan Fong, University of California, San Diego, United States
  • Kevin Smith, University of California, San Diego, United States
  • Robin Bachelder, University of California, San Diego, United States
  • Trey Ideker, University of California, San Diego, United States
  • Dexter Pratt, University of California, San Diego, United States

Presentation Overview: Show

Gene set analysis is a mainstay of functional genomics, but it relies on curated databases of gene functions that are incomplete. Here we evaluate five Large Language Models (LLMs) for their ability to discover the common biological functions represented by a gene set, substantiated by supporting rationale, citations and a confidence assessment. Benchmarking against canonical gene sets from the Gene Ontology, GPT-4 confidently recovered the curated name or a more general concept (73% of cases), while benchmarking against random gene sets correctly yielded zero confidence. Gemini-Pro and Mixtral-Instruct showed ability in naming but were falsely confident for random sets, whereas Llama2-70b had poor performance overall. In gene sets derived from ‘omics data, GPT-4 identified novel functions not reported by classical functional enrichment (32% of cases), which independent review indicated were largely verifiable and not hallucinations. The ability to rapidly synthesize common gene functions positions LLMs as valuable ‘omics assistants.

15:20-15:40
Transformer based data mining for predicting moonlighting in proteins and comparison with first principle annotation
Confirmed Presenter: Dana Varghese, Jawaharlal Nehru University, India

Room: 520b
Format: Live Stream

Moderator(s): Dukka KC


Authors List: Show

  • Dana Varghese, Jawaharlal Nehru University, India
  • Shandar Ahmad, Jawaharlal Nehru University, India

Presentation Overview: Show

Moonlighting proteins are a specific group of multifunctional proteins that independently carry out distinct functions at various time points, under different conditions, in conjunction with different partners, or in different locations without distinct domains associated with each function. The discovery of moonlighting behavior in proteins across various functional classes and organisms, from unicellular to multicellular eukaryotes, suggests that this evolutionary adaptation to multitask is widespread, despite not following conservation patterns within or between closely related species. Protein functional annotation has become more challenging due to this intriguing phenomenon. Several attempts have been made to predict moonlighting proteins from pre-existing annotations, sequence or structural features and computed features with varying degrees of accuracy (Khan et al. 2016; 2017). We have recently developed a method to specifically predict moonlighting in DNA-binding proteins and showed that within this class, a first principle approach worked well (Varghese et al. 2022). In this work we present a scaled up version of our prediction method to consider all human moonlighting proteins. Separately we developed independent models to identify moonlighting proteins using Natural Language Processing (NLP) methods from published literature and achieved superior results compared to the existing model by leveraging transformer-based models that were pre-trained on PubMed data. Together, the NLP and first principle methods provide the best performance to mine existing and predicted candidate moonlighting proteins with high confidence.

15:40-16:00
Gene families of unknown function conserved across Fungi
Confirmed Presenter: Asaf Salamov, DOE Joint Genome Institute, United States

Room: 520b
Format: In Person

Moderator(s): Dukka KC


Authors List: Show

  • Asaf Salamov, DOE Joint Genome Institute, United States
  • Igor Shabalov, DOE Joint Genome Institute, United States
  • Igor Grigoriev, DOE Joint Genome Institute, United States

Presentation Overview: Show

We constructed conserved gene families from over 2000 fungal genomes using MMseqs2 clustering algorithm, which included 339 gene clusters, encompassing ~162K proteins. Our criteria included: a) conservation in at least 50% of all fungal species or over 90% of specific fungal clades, with representation in at least 100 species; b) lack of known function, indicated by absence of Pfam domains or meaningful functional annotations via EggNOG mapper; c) support from transcriptomics data for at least 20% of genes within each cluster
Around half of these gene families (48%) are unique to the Fungal kingdom. Surprisingly, the second-largest group comprises 104 families (31% of total), shared solely between Fungi and the taxonomically distant Viridiplantae clade, underscoring the substantial number of conserved uncharacterized gene families between plants and fungi.
To infer potential functions, we employed Foldseek, aligning AlphaFold2 predicted structures with the PDB database, resulting in PDB hits (excluding hypothetical/uncharacterized proteins) for 14 gene clusters with TMscore > 70. Additionally, we utilized the deep-learning algorithms DeepFRI and Proteinfer for function prediction, revealing that only around 3% of 'Molecular Function' GO terms with high information content were shared between these methods. Furthermore, over 90% of different GO terms assigned by both methods to the same proteins showed low semantic similarity (<25%) according to GOGO software.
This resource is intended for the fungal genomics community to characterize individual family members functionally and extend their annotations across the Fungal Tree of Life. The portal for conserved fungal families of unknown function is accessible at:
https://mycocosm.jgi.doe.gov/conserved-clusters/run/run-2024;cyS6e

16:40-17:00
Improved prediction of DNA and RNA binding proteins with deep learning models
Confirmed Presenter: Jun-Tao Guo, University of North Carolina at Charlotte, United States

Room: 520b
Format: In Person

Moderator(s): Ana Rojas


Authors List: Show

  • Siwen Wu, University of North Carolina at Charlotte, United States
  • Jun-Tao Guo, University of North Carolina at Charlotte, United States

Presentation Overview: Show

Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play important roles in essential biological processes. To facilitate functional annotation and accurate prediction of different types NABPs, many machine learning-based computational approaches have been developed. However, the datasets used for training and testing as well as the prediction scopes in these studies have limited their applications in real world problems. In this paper, we developed new strategies to overcome these limitations by generating more accurate and robust datasets and developing deep learning methods including both hierarchical and multi-class approaches to predict the types of NABPs for any given proteins. The deep learning models employ two layers of convolutional neural network (CNN) and one layer of long short-term memory (LSTM). Our approaches outperform the existing DBP and RBP predictors and have a balanced prediction between DBPs and RBPs. The multi-class approach greatly improves the prediction accuracy of DBPs and RBPs, especially for the DBPs with about 12% performance improvement. Moreover, we explored the prediction accuracy of single-stranded DNA binding proteins (SSBs) and their effect on the overall prediction accuracy of NABP predictions.

17:00-17:20
Analysing multifunctional proteins with MultifacetedProtDB
Confirmed Presenter: Giulia Babbi, Biocomputing Group Bologna, Italy

Room: 520b
Format: In Person

Moderator(s): Ana Rojas


Authors List: Show

  • Giulia Babbi, Biocomputing Group Bologna, Italy
  • Elisa Bertolini, Biocomputing Group Bologna, University of Bologna, Italy
  • Castrense Savojardo, University of Bologna, Italy
  • Pier Luigi Martelli, University of Bologna, Italy
  • Rita Casadio, University of Bologna, Italy

Presentation Overview: Show

We recently proposed MultifacetedProtDB (https://multifacetedprotdb.biocomp.unibo.it), a curated database providing a collection of 1103 multifunctional human proteins, of which 812 are enzymes. The characterization of multifunctional proteins is an expanding research area aiming to elucidate the complexities of biological processes. In our resource, we merge information from UniProt, Humsavar, Monarch, and ClinVar, reporting disease nomenclatures as MONDO, ICD10, OMIM and Orphanet catalogues. Some 30% of multifunctional proteins in our database (321 enzymes and 110 non-enzymes) are associated with 895 MONDO diseases classified into 213 ICD10 categories and in 17 out of the 19 ICD10 main chapters. Out of the 895 diseases, 323 are included in the Orphanet catalogue of rare diseases. Over the 431 multifaceted proteins with MONDO disease annotation, 212 are associated with multiple diseases, and 56% are associated also with multiple Reactome pathways. Performing different functions in different pathways could explicate why a protein is associated with different diseases. Thanks to the “ADVANCED SEARCH” interface in MultifacetedProtDB, it is possible to search for multifunctional proteins associated with MONDO diseases and endowed with Pfam annotations, obtaining as a result a list of 428 entries. Specifically, the Protein kinase domain (PF00069), the Cytochrome P450 domain (PF00067), the Connexin domain (PF00029) and the β/γ crystallin domain (PF00030) are most frequently associated with diseases in multifunctional proteins. Presently, data in our DB indicate that multifunctionality is not exclusively related to multiple subcellular locations, and/or association with diseases, and that some multifunctional proteins with specific domains seem more involved than others in diseases.

17:20-17:40
Dynamic network analysis of multi-scale -omics data for protein function prediction
Confirmed Presenter: Siyu Yang, University of Notre Dame, United States

Room: 520b
Format: In Person

Moderator(s): Ana Rojas


Authors List: Show

  • Siyu Yang, University of Notre Dame, United States
  • Tijana Milenkovic, University of Notre Dame, United States

Presentation Overview: Show

Protein function prediction is a prominent computational task. One common approach is analyzing proteins' 3D structures by modeling them as protein structure networks (PSNs). Another common approach is analyzing a protein-protein interaction (PPI) network. To leverage both data types, our lab recently integrated PSN and PPI network data into a multi-scale "network-of-networks" (NoN), where each node (protein) in the PPI network at the higher scale is a network (PSN) itself at the lower scale. Protein functional prediction via NoN-based data integration often outperformed or complemented single-scale PSN-only or PPI network-only prediction. However, until recently, existing PSN approaches represented the final (native) 3D structure of a protein as a static PSN. Recognizing that protein folding is a dynamic process, our lab recently introduced dynamic PSNs, which when used in single-scale fashion, achieved a promising improvement over static PSNs in the task of protein structure classification. In our recent work to be presented at ISMB 2024, we evaluate the performance of dynamic vs. static PSNs in the protein function prediction task. First, we do so only on the PSN scale, without considering the PPI scale (and thus outside of the NoN framework). Here is where using dynamic PSNs shows the most promise: when focusing on protein functions (Gene Ontology biological process terms) where one of the two PSN approaches (dynamic vs. static PSNs) is significantly superior to the other approach, almost exclusively it is dynamic PSNs that are superior to static PSNs. Second, we compare dynamic vs. static PSNs within the NoN framework. Here, the two approaches are mostly quantitatively comparable yet qualitatively complementary, i.e., each of dynamic and static PSNs is superior to the other approach for some of the GO terms. We are currently analyzing potential biological implications of these results.

17:40-18:00
Enhanced Functional Annotation for Genome-Scale Metabolic Models Using an Omics-Informed Integrated Pipeline
Confirmed Presenter: Jason Mcdermott, Pacific Northwest National Laboratory (US Dept of Energy), United States

Room: 520b
Format: In Person

Moderator(s): Ana Rojas


Authors List: Show

  • Jason Mcdermott, Pacific Northwest National Laboratory (US Dept of Energy), United States
  • David Geller-McGrath, MIT/Woods Hole Oceanographic Institute, United States
  • William Nelson, Pacific Northwest National Laboratory, United States
  • Jeremy Jacobson, Pacific Northwest National Laboratory, United States
  • Christine Chang, Pacific Northwest National Laboratory, United States
  • Tara Nitka, Pacific Northwest National Laboratory, United States
  • Aimee Kessell, University of Nebraska - Lincoln, United States
  • Ryan McClure, Pacific Northwest National Laboratory, United States
  • Robert Egbert, Pacific Northwest National Laboratory, United States
  • Chris Henry, Argonne National Laboratory, United States
  • Janaka Edirisinghe, Argonne National Laboratory, United States
  • Hyun-Seob Song, University of Nebraska Lincoln, United States
  • Travis Wheeler, University of Arizona, United States
  • Kirsten Hofmockel, Pacific Northwest National Laboratory, United States

Presentation Overview: Show

There are inherent challenges in characterizing the metabolic potential of complex microbiomes, particularly those derived from incomplete data like metagenome-assembled genomes (MAGs). Current annotation methods often miss metabolic enzymes due to incomplete sequencing data or significant evolutionary divergence. To address these issues, the authors present a pipeline integrating three tools: MetaPathPredict, a deep learning framework for predicting metabolic modules; OMics-Enabled Global GApfilling (OMEGGA), which performs global gap-filling from experimental growth data and integrates multi-omics data; and Snekmer, a computational framework for building sequence-based models for protein families. The pipeline begins with a traditionally annotated genome, uses MetaPathPredict to predict metabolic modules, and then uses OMEGGA to perform gap filling. Snekmer is then used to identify candidates for missing enzymes. The authors demonstrate the application of this pipeline to a model soil consortium, showing how it improves annotation for genome-scale metabolic model construction and refines models for accurate prediction of growth under novel conditions. The results suggest that this integrated, data-driven approach can improve functional annotation for bacterial metabolic models and propose novel functional annotations for important metabolic enzymes in environmental bacteria.