The SciFinder tool lets you search Titles, Authors, and Abstracts of talks and panels. Enter your search term below and your results will be shown at the bottom of the page. You can also click on a track to see all the talks given in that track on that day.

View Talks By Category

Scroll down to view Results

July 12, 2024
July 13, 2024
July 14, 2024
July 15, 2024
July 16, 2024

Results

July 14, 2024
10:40-11:20
Invited Presentation: Fast, high-performance biophysics-based computational methods in function prediction
Confirmed Presenter: Rafael Najmanovich
Track: Function

Room: 520b
Format: In Person
Moderator(s): Iddo Friedberg


Authors List: Show

  • Rafael Najmanovich

Presentation Overview:Show

In this presentation I will discuss a number of computational methods that rely o a simple approximation to molecular interactions to being proportional to the surface area in contact between the constituting atoms modulated by a pairwise atom-type pseudo-energetic term. I will discuss three methods in this presentation: Surfaces, for the quantification of molecular interactions, NRGTEN, for normal-mode analysis and NRGDock for ultra-massive virtual screening. Their simplified nature makes them fast while still being accurate, allowing their utilization in a high-throughput manner for molecular function prediction.

July 14, 2024
11:20-11:40
Energetic Local Frustration Through Time and Species
Confirmed Presenter: R. Gonzalo Parra, Barcelona Supercomputing Center, Spain
Track: Function

Room: 520b
Format: In Person
Moderator(s): Iddo Friedberg


Authors List: Show

  • R. Gonzalo Parra, R. Gonzalo Parra, Barcelona Supercomputing Center
  • Maria Freiberger, Maria Freiberger, Buenos Aires University
  • Miriam Poley-Gil, Miriam Poley-Gil, Barcelona Supercomputing Center
  • Miguel Fernandez-Martin, Miguel Fernandez-Martin, Barcelona Supercomputing Center
  • Marko Ludaic, Marko Ludaic, Barcelona Supercomputing Center
  • Victoria Ruiz-Serra, Victoria Ruiz-Serra, Barcelona Supercomputing Center
  • Leandro G. Radusky, Leandro G. Radusky, Manas.tech
  • Peter G. Wolynes, Peter G. Wolynes, Rice University
  • Diego U. Ferreiro, Diego U. Ferreiro, Buenos Aires University
  • Alfonso Valencia, Alfonso Valencia, Barcelona Supercomputing Center

Presentation Overview:Show

According to the Principle of Minimal Frustration, folded proteins minimize the amount of strong energetic conflicts in their native states. However, not all interactions are energetically optimized for folding but some remain in energetic conflict, i.e. they are highly frustrated. This remaining local energetic frustration has been shown to be statistically correlated with distinct functional aspects such as protein-protein interaction sites, allosterism and catalysis. Fuelled by the recent breakthroughs in efficient protein structure prediction that have made available good quality models for most proteins, we have developed a strategy to calculate local energetic frustration within large protein families and quantify its conservation over evolutionary time. Based on this evolutionary information we can identify how stability and functional constraints have appeared at the common ancestor of the family and have been maintained over the course of evolution.

I will summarize the results of two of our recent publications, plus some unpublished results, where we show how local frustration in proteins and conservation of it in extant protein families and protein ensembles can be used to shed light into the biophysical understanding of the relationships between sequences, structures, dynamics and functions. Moreover, our local frustration based strategies can be exploited to better understand the protein features that are captured and output by novel Machine Learning methods and help to guide protein design strategies.

July 14, 2024
11:40-12:00
Function Prediction of Intrinsically Disordered Proteins and Regions: A Graph Auto-Encoder Approach
Confirmed Presenter: Mahta Mehdiabadi, University of Padova, Italy
Track: Function

Room: 520b
Format: In Person
Moderator(s): Iddo Friedberg


Authors List: Show

  • Mahta Mehdiabadi, Mahta Mehdiabadi, University of Padova
  • Damiano Piovesan, Damiano Piovesan, University of Padova
  • Silvio Tosatto, Silvio Tosatto, University of Padova

Presentation Overview:Show

Intrinsically disordered proteins/regions (IDPs/IDRs) lack a well-defined three-dimensional structure yet carry out essential biological functions. Due to their highly dynamic nature and poor sequence conservation, conventional homology and structure-based methods cannot determine their functions. Here, we develop a Graph Auto-Encoder (GAE) model that exploits the information available to all forms of proteins to predict Gene Ontology (GO) functions for the entire protein and individual IDRs.
Our model is capable of encoding the proteins' structural units, such as domains and disordered regions, which are then used to assign functions not only to the entire protein but also to its specific regions. This allows us to map the proteins into a latent space, find similar protein embeddings, and transfer the functions among them. Due to training the model in an unsupervised end-to-end manner independent of GO labels, the model is not affected by the incomplete annotation of data and can be trained on large datasets, which boosts the encoding power of the GAE.
The model's predictive performance was assessed using an independent test set from the DisProt database, which shows significant improvements compared to the standard approaches. The model achieves high throughput, processing hundreds of protein sequences per second. Its highly scalable nature enables integration into production environments like the MobiDB database and functionally categorize and cluster more than 18 million disordered regions available in the database.

July 14, 2024
12:00-12:20
Mapping the affinity of protein-protein interactions with multiple amino acid mutations using deep neural networks
Confirmed Presenter: Yaron Orenstein, Bar-Ilan University, Israel
Track: Function

Room: 520b
Format: In Person
Moderator(s): Iddo Friedberg


Authors List: Show

  • Reut Moshe, Reut Moshe, Bar-Ilan University
  • Shay-Lee Aharoni, Shay-Lee Aharoni, Weizmann Institute of Science
  • Niv Papo, Niv Papo, Ben-Gurion University
  • Yaron Orenstein, Yaron Orenstein, Bar-Ilan University

Presentation Overview:Show

Protein-protein interactions (PPIs) play vital roles in diverse biological processes. Hence, measuring PPIs is critical for decoding the evolution of proteins, and for developing powerful interactions for drugs. To date, studies focused mainly on a narrow range of affinities and on single mutations in the amino acid sequence of a given protein to develop high-affinity PPIs due to limitations in the experimental techniques. Our study introduces a novel approach to comprehensively map PPIs and identify multiple (affinity-enhancing or affinity-reducing) mutations by applying machine-learning methods to next-generation sequencing selection data. We present a novel method to accurately predict the impact of multiple interacting mutations that were not observed in the experimental data. We applied our method to the N-TIMP2\MMP9 protein complex as a case study due to its unique interface, which consists of seven positions in N-TIMP2 crucial for binding. We developed a neural network to accurately and quantitatively predict the impact of multiple potentially interacting mutations on binding affinity. Our neural network achieved in cross-validation (training on 90% and testing on a held-out 10%) a Pearson correlation of 0.963 between predicted and observed enrichment ratios. In addition, on an independent dataset of 26 experimentally validated variants, the Pearson correlation between their affinity constants and predicted enrichment ratios was 0.545. Currently, we are testing the affinity of five novel multiple-mutations variants that we predicted as high-affinity variants. Generally, our innovative approach can be applied to many more protein-function datasets to provide a rich characterization of a PPI affinity landscape.

July 14, 2024
14:20-14:40
Utilising Large Language Models for GO Term Extraction in UniProt Annotation
Confirmed Presenter: Vishal Joshi, EMBL-EBI, United Kingdom
Track: Function

Room: 520b
Format: In Person
Moderator(s): Dukka KC


Authors List: Show

  • Vishal Joshi, Vishal Joshi, EMBL-EBI
  • Maria J Martin, Maria J Martin, EMBL-EBI

Presentation Overview:Show

Automatic Annotation(AA) objective
Manually reviewed records (UniProtKB/SwissProt) constitute only about 0.23% of UniProtKB; expert curation is time-intensive and most published experimental data focuses on a rather limited range of model organisms. Simultaneously, the number of unreviewed records is growing continuously, yet for a large proportion of these records there is no experimental data available. UniProtKB uses three prediction systems UniRule, Association-Rule-Based Annotator (ARBA) & Google’s ProtNLM to functionally annotate around 88% of unreviewed (UniProtKB/TrEMBL) records automatically which we define as Automatic Annotation.

Large Language models for GO term extraction from literature & literature summarisation

We have prototyped a new pipeline that employs the GPT-4 model (version gpt-4-1106-preview) to extract GO(Gene Ontology) terms from scientific literature accessible through PubMed. The extracted GO terms are validated against the GO Annotation database (GOA) at EMBL-EBI I, in order to exclude those that conflict with taxonomic constraints, are obsolete, or have been blacklisted. We have evaluated our prompts to extract GO terms against GPT-4 (version gpt-4-1106-preview) & other open-source quantized models, like Mixtral-8x7B-Instruct-v0.1 and Mistral-7B-Instruct-v0.2, results of which we will be sharing. Once evaluation is complete against a structured annotation like GO, we plan to potentially expand it to other annotation like keywords, EC numbers etc

These models will not only assist in manual curation but also act as collaborative tools in creating UniProt entry-style summaries for proteins/genes from relevant literature. The performance of these models will be assessed for their potential contributions to augmenting existing manually curated entries.

July 14, 2024
14:40-15:00
ProstGOPred: Advancing Protein Function Prediction through Graph Contrastive Learning and Structure-Aware Protein Language Model Embeddings
Confirmed Presenter: Weining Lin, University College London, United Kingdom
Track: Function

Room: 520b
Format: Live Stream
Moderator(s): Dukka KC


Authors List: Show

  • Weining Lin, Weining Lin, University College London
  • David Miller, David Miller, University College London

Presentation Overview:Show

We introduce ProstGOPred, a state-of-the-art protein function prediction model that integrates both protein sequence and structural information with embeddings from the ProstT5 protein language model. By combining embeddings from ProstT5 and network information from the STRING database, our approach employs a graph contrastive learning strategy to optimise the model's ability to recognise functional similarities among proteins. This contrastive learning strategy optimises model performance by minimising the distance between an anchor and a positive sample while maximising the distance between the anchor and a negative sample, serving as a regularisation term in conjunction with supervised learning.

ProstGOPred was benchmarked on the CAFA3 dataset and achieved state-of-art performance with f1max scores of BP: 0.534, MF: 0.561, CC: 0.64 on three sub-ontologies, surpassing the performance of traditional BLAST method (0.26, 0.42, 0.45 respectively), DomainPFP (0.38, 0.56, 0.63 respectively), and DeepGOPlus (0.469, 0.544, 0.623 respectively). By leveraging embeddings from ProstT5, a structure-aware protein language model, and protein network information, our model requires no additional structural information or multiple sequence alignment (MSA) data to efficiently predict protein functions.

Our research demonstrates the immense potential of utilising protein language models and graph neural networks for predicting protein functions. In the next stage, we will include evolutionary data based on CATH-FunFams which are being regenerated using domain assignments based on AlphaFold structures and which will exploit structure embeddings to detect functional similarity.

July 14, 2024
15:00-15:20
Evaluation of large language models for discovery of gene set function
Confirmed Presenter: Mengzhou Hu, University of California, San Diego
Track: Function

Room: 520b
Format: In Person
Moderator(s): Dukka KC


Authors List: Show

  • Mengzhou Hu, Mengzhou Hu, University of California
  • Sahar Alkhairy, Sahar Alkhairy, University of California
  • Ingoo Lee, Ingoo Lee, University of California
  • Rudolf Pillich, Rudolf Pillich, University of California
  • Dylan Fong, Dylan Fong, University of California
  • Kevin Smith, Kevin Smith, University of California
  • Robin Bachelder, Robin Bachelder, University of California
  • Trey Ideker, Trey Ideker, University of California
  • Dexter Pratt, Dexter Pratt, University of California

Presentation Overview:Show

Gene set analysis is a mainstay of functional genomics, but it relies on curated databases of gene functions that are incomplete. Here we evaluate five Large Language Models (LLMs) for their ability to discover the common biological functions represented by a gene set, substantiated by supporting rationale, citations and a confidence assessment. Benchmarking against canonical gene sets from the Gene Ontology, GPT-4 confidently recovered the curated name or a more general concept (73% of cases), while benchmarking against random gene sets correctly yielded zero confidence. Gemini-Pro and Mixtral-Instruct showed ability in naming but were falsely confident for random sets, whereas Llama2-70b had poor performance overall. In gene sets derived from ‘omics data, GPT-4 identified novel functions not reported by classical functional enrichment (32% of cases), which independent review indicated were largely verifiable and not hallucinations. The ability to rapidly synthesize common gene functions positions LLMs as valuable ‘omics assistants.

July 14, 2024
15:20-15:40
Transformer based data mining for predicting moonlighting in proteins and comparison with first principle annotation
Confirmed Presenter: Dana Varghese, Jawaharlal Nehru University, India
Track: Function

Room: 520b
Format: Live Stream
Moderator(s): Dukka KC


Authors List: Show

  • Dana Varghese, Dana Varghese, Jawaharlal Nehru University
  • Shandar Ahmad, Shandar Ahmad, Jawaharlal Nehru University

Presentation Overview:Show

Moonlighting proteins are a specific group of multifunctional proteins that independently carry out distinct functions at various time points, under different conditions, in conjunction with different partners, or in different locations without distinct domains associated with each function. The discovery of moonlighting behavior in proteins across various functional classes and organisms, from unicellular to multicellular eukaryotes, suggests that this evolutionary adaptation to multitask is widespread, despite not following conservation patterns within or between closely related species. Protein functional annotation has become more challenging due to this intriguing phenomenon. Several attempts have been made to predict moonlighting proteins from pre-existing annotations, sequence or structural features and computed features with varying degrees of accuracy (Khan et al. 2016; 2017). We have recently developed a method to specifically predict moonlighting in DNA-binding proteins and showed that within this class, a first principle approach worked well (Varghese et al. 2022). In this work we present a scaled up version of our prediction method to consider all human moonlighting proteins. Separately we developed independent models to identify moonlighting proteins using Natural Language Processing (NLP) methods from published literature and achieved superior results compared to the existing model by leveraging transformer-based models that were pre-trained on PubMed data. Together, the NLP and first principle methods provide the best performance to mine existing and predicted candidate moonlighting proteins with high confidence.

July 14, 2024
15:40-16:00
Gene families of unknown function conserved across Fungi
Confirmed Presenter: Asaf Salamov, DOE Joint Genome Institute, United States
Track: Function

Room: 520b
Format: In Person
Moderator(s): Dukka KC


Authors List: Show

  • Asaf Salamov, Asaf Salamov, DOE Joint Genome Institute
  • Igor Shabalov, Igor Shabalov, DOE Joint Genome Institute
  • Igor Grigoriev, Igor Grigoriev, DOE Joint Genome Institute

Presentation Overview:Show

We constructed conserved gene families from over 2000 fungal genomes using MMseqs2 clustering algorithm, which included 339 gene clusters, encompassing ~162K proteins. Our criteria included: a) conservation in at least 50% of all fungal species or over 90% of specific fungal clades, with representation in at least 100 species; b) lack of known function, indicated by absence of Pfam domains or meaningful functional annotations via EggNOG mapper; c) support from transcriptomics data for at least 20% of genes within each cluster
Around half of these gene families (48%) are unique to the Fungal kingdom. Surprisingly, the second-largest group comprises 104 families (31% of total), shared solely between Fungi and the taxonomically distant Viridiplantae clade, underscoring the substantial number of conserved uncharacterized gene families between plants and fungi.
To infer potential functions, we employed Foldseek, aligning AlphaFold2 predicted structures with the PDB database, resulting in PDB hits (excluding hypothetical/uncharacterized proteins) for 14 gene clusters with TMscore > 70. Additionally, we utilized the deep-learning algorithms DeepFRI and Proteinfer for function prediction, revealing that only around 3% of 'Molecular Function' GO terms with high information content were shared between these methods. Furthermore, over 90% of different GO terms assigned by both methods to the same proteins showed low semantic similarity (<25%) according to GOGO software.
This resource is intended for the fungal genomics community to characterize individual family members functionally and extend their annotations across the Fungal Tree of Life. The portal for conserved fungal families of unknown function is accessible at:
https://mycocosm.jgi.doe.gov/conserved-clusters/run/run-2024;cyS6e

July 14, 2024
16:40-17:00
Improved prediction of DNA and RNA binding proteins with deep learning models
Confirmed Presenter: Jun-Tao Guo, University of North Carolina at Charlotte, United States
Track: Function

Room: 520b
Format: In Person
Moderator(s): Ana Rojas


Authors List: Show

  • Siwen Wu, Siwen Wu, University of North Carolina at Charlotte
  • Jun-Tao Guo, Jun-Tao Guo, University of North Carolina at Charlotte

Presentation Overview:Show

Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play important roles in essential biological processes. To facilitate functional annotation and accurate prediction of different types NABPs, many machine learning-based computational approaches have been developed. However, the datasets used for training and testing as well as the prediction scopes in these studies have limited their applications in real world problems. In this paper, we developed new strategies to overcome these limitations by generating more accurate and robust datasets and developing deep learning methods including both hierarchical and multi-class approaches to predict the types of NABPs for any given proteins. The deep learning models employ two layers of convolutional neural network (CNN) and one layer of long short-term memory (LSTM). Our approaches outperform the existing DBP and RBP predictors and have a balanced prediction between DBPs and RBPs. The multi-class approach greatly improves the prediction accuracy of DBPs and RBPs, especially for the DBPs with about 12% performance improvement. Moreover, we explored the prediction accuracy of single-stranded DNA binding proteins (SSBs) and their effect on the overall prediction accuracy of NABP predictions.

July 14, 2024
17:00-17:20
Analysing multifunctional proteins with MultifacetedProtDB
Confirmed Presenter: Giulia Babbi, Biocomputing Group Bologna, Italy
Track: Function

Room: 520b
Format: In Person
Moderator(s): Ana Rojas


Authors List: Show

  • Giulia Babbi, Giulia Babbi, Biocomputing Group Bologna
  • Elisa Bertolini, Elisa Bertolini, Biocomputing Group Bologna
  • Castrense Savojardo, Castrense Savojardo, University of Bologna
  • Pier Luigi Martelli, Pier Luigi Martelli, University of Bologna
  • Rita Casadio, Rita Casadio, University of Bologna

Presentation Overview:Show

We recently proposed MultifacetedProtDB (https://multifacetedprotdb.biocomp.unibo.it), a curated database providing a collection of 1103 multifunctional human proteins, of which 812 are enzymes. The characterization of multifunctional proteins is an expanding research area aiming to elucidate the complexities of biological processes. In our resource, we merge information from UniProt, Humsavar, Monarch, and ClinVar, reporting disease nomenclatures as MONDO, ICD10, OMIM and Orphanet catalogues. Some 30% of multifunctional proteins in our database (321 enzymes and 110 non-enzymes) are associated with 895 MONDO diseases classified into 213 ICD10 categories and in 17 out of the 19 ICD10 main chapters. Out of the 895 diseases, 323 are included in the Orphanet catalogue of rare diseases. Over the 431 multifaceted proteins with MONDO disease annotation, 212 are associated with multiple diseases, and 56% are associated also with multiple Reactome pathways. Performing different functions in different pathways could explicate why a protein is associated with different diseases. Thanks to the “ADVANCED SEARCH” interface in MultifacetedProtDB, it is possible to search for multifunctional proteins associated with MONDO diseases and endowed with Pfam annotations, obtaining as a result a list of 428 entries. Specifically, the Protein kinase domain (PF00069), the Cytochrome P450 domain (PF00067), the Connexin domain (PF00029) and the β/γ crystallin domain (PF00030) are most frequently associated with diseases in multifunctional proteins. Presently, data in our DB indicate that multifunctionality is not exclusively related to multiple subcellular locations, and/or association with diseases, and that some multifunctional proteins with specific domains seem more involved than others in diseases.

July 14, 2024
17:20-17:40
Dynamic network analysis of multi-scale -omics data for protein function prediction
Confirmed Presenter: Siyu Yang, University of Notre Dame, United States
Track: Function

Room: 520b
Format: In Person
Moderator(s): Ana Rojas


Authors List: Show

  • Siyu Yang, Siyu Yang, University of Notre Dame
  • Tijana Milenkovic, Tijana Milenkovic, University of Notre Dame

Presentation Overview:Show

Protein function prediction is a prominent computational task. One common approach is analyzing proteins' 3D structures by modeling them as protein structure networks (PSNs). Another common approach is analyzing a protein-protein interaction (PPI) network. To leverage both data types, our lab recently integrated PSN and PPI network data into a multi-scale "network-of-networks" (NoN), where each node (protein) in the PPI network at the higher scale is a network (PSN) itself at the lower scale. Protein functional prediction via NoN-based data integration often outperformed or complemented single-scale PSN-only or PPI network-only prediction. However, until recently, existing PSN approaches represented the final (native) 3D structure of a protein as a static PSN. Recognizing that protein folding is a dynamic process, our lab recently introduced dynamic PSNs, which when used in single-scale fashion, achieved a promising improvement over static PSNs in the task of protein structure classification. In our recent work to be presented at ISMB 2024, we evaluate the performance of dynamic vs. static PSNs in the protein function prediction task. First, we do so only on the PSN scale, without considering the PPI scale (and thus outside of the NoN framework). Here is where using dynamic PSNs shows the most promise: when focusing on protein functions (Gene Ontology biological process terms) where one of the two PSN approaches (dynamic vs. static PSNs) is significantly superior to the other approach, almost exclusively it is dynamic PSNs that are superior to static PSNs. Second, we compare dynamic vs. static PSNs within the NoN framework. Here, the two approaches are mostly quantitatively comparable yet qualitatively complementary, i.e., each of dynamic and static PSNs is superior to the other approach for some of the GO terms. We are currently analyzing potential biological implications of these results.

July 14, 2024
17:40-18:00
Enhanced Functional Annotation for Genome-Scale Metabolic Models Using an Omics-Informed Integrated Pipeline
Confirmed Presenter: Jason Mcdermott, Pacific Northwest National Laboratory (US Dept of Energy), United States
Track: Function

Room: 520b
Format: In Person
Moderator(s): Ana Rojas


Authors List: Show

  • Jason Mcdermott, Jason Mcdermott, Pacific Northwest National Laboratory (US Dept of Energy)
  • David Geller-McGrath, David Geller-McGrath, MIT/Woods Hole Oceanographic Institute
  • William Nelson, William Nelson, Pacific Northwest National Laboratory
  • Jeremy Jacobson, Jeremy Jacobson, Pacific Northwest National Laboratory
  • Christine Chang, Christine Chang, Pacific Northwest National Laboratory
  • Tara Nitka, Tara Nitka, Pacific Northwest National Laboratory
  • Aimee Kessell, Aimee Kessell, University of Nebraska - Lincoln
  • Ryan McClure, Ryan McClure, Pacific Northwest National Laboratory
  • Robert Egbert, Robert Egbert, Pacific Northwest National Laboratory
  • Chris Henry, Chris Henry, Argonne National Laboratory
  • Janaka Edirisinghe, Janaka Edirisinghe, Argonne National Laboratory
  • Hyun-Seob Song, Hyun-Seob Song, University of Nebraska Lincoln
  • Travis Wheeler, Travis Wheeler, University of Arizona
  • Kirsten Hofmo

Presentation Overview:Show

There are inherent challenges in characterizing the metabolic potential of complex microbiomes, particularly those derived from incomplete data like metagenome-assembled genomes (MAGs). Current annotation methods often miss metabolic enzymes due to incomplete sequencing data or significant evolutionary divergence. To address these issues, the authors present a pipeline integrating three tools: MetaPathPredict, a deep learning framework for predicting metabolic modules; OMics-Enabled Global GApfilling (OMEGGA), which performs global gap-filling from experimental growth data and integrates multi-omics data; and Snekmer, a computational framework for building sequence-based models for protein families. The pipeline begins with a traditionally annotated genome, uses MetaPathPredict to predict metabolic modules, and then uses OMEGGA to perform gap filling. Snekmer is then used to identify candidates for missing enzymes. The authors demonstrate the application of this pipeline to a model soil consortium, showing how it improves annotation for genome-scale metabolic model construction and refines models for accurate prediction of growth under novel conditions. The results suggest that this integrated, data-driven approach can improve functional annotation for bacterial metabolic models and propose novel functional annotations for important metabolic enzymes in environmental bacteria.