The SciFinder tool lets you search Titles, Authors, and Abstracts of talks and panels. Enter your search term below and your results will be shown at the bottom of the page. You can also click on a track to see all the talks given in that track on that day.

View Talks By Category

Scroll down to view Results

July 12, 2024
July 13, 2024
July 14, 2024
July 15, 2024
July 16, 2024

Results

July 13, 2024
10:40-11:00
Roll Call and Introduction to the Function COSI
Track: Function

Room: 520b
Moderator(s): Ana Rojas


Authors List: Show

  • Iddo Friedberg

Presentation Overview:Show

Free cash will be given to the 7th and 131st persons to show up. Maybe.

July 13, 2024
11:00-11:40
Invited Presentation: Linking Gene and function in the post-genomic era: issues and opportunities
Confirmed Presenter: Valerie de Crécy-Lagard
Track: Function

Room: 520b
Format: In Person
Moderator(s): Ana Rojas


Authors List: Show

  • Valerie de Crécy-Lagard

Presentation Overview:Show

Identifying the function of every gene in all sequenced organisms is the major challenge of the post-genomic era and an obligate step for any systems biology approach. This objective is far from reached. By various estimates, at least 30-70% of the genes of any given organism are of unknown function, incorrectly annotated, or have only a generic annotation. Using comparative genomic approaches, we have linked genes and functions for over 65 gene families related mainly to coenzyme metabolism, nucleic acid modification, protein modification, and metabolite repair. Building on this body of work I will discuss the lessons learned, the next steps needed to improve protein function prediction and discovery, and the role of machine learning in this process.

July 13, 2024
11:40-12:00
Unveiling the Functional Fate of Duplicated Genes Through Expression Profiling
Confirmed Presenter: Alex Warwick Vesztrocy, University of Lausanne, SIB Swiss Institute of Bioinformatics
Track: Function

Room: 520b
Format: In Person
Moderator(s): Ana Rojas


Authors List: Show

  • Alex Warwick Vesztrocy, Alex Warwick Vesztrocy, University of Lausanne
  • Natasha Glover, Natasha Glover, University of Lausanne
  • Christophe Dessimoz, Christophe Dessimoz, University of Lausanne
  • Paul D Thomas, Paul D Thomas, University of Southern California
  • Irene Julca, Irene Julca, University of Lausanne

Presentation Overview:Show

Gene duplication is a major evolutionary source of functional innovation – if one copy maintains the ancestral function then the other is no longer under selective pressure and is free to diverge. It is presumed that the more slowly evolving copy, in terms of sequence, maintains the ancestral function – “the least diverged orthologue (LDO) conjecture”. This is a specific case of the much-debated orthologue conjecture – that orthologous genes are functionally more similar than paralogous genes. However, annotation bias has led to issues when using gene ontology annotations in previous studies. In an attempt to remove this bias, this study uses gene expression profiles as a proxy for function.

This study analysed 15,693 gene families from PANTHER, paired with an extensive and large-scale expression atlas for 16 animal and 20 plant species. Using an evolutionary model, accounting for varying evolutionary rates between gene families, branches were categorised according to whether they display symmetric or asymmetric evolution following duplication.

Pairs of genes resulting from asymmetric duplications displayed a significantly lower expression profile similarity compared to those from symmetric duplications (Mann-Whitney, p<0.001). Additionally, the more diverged copy exhibited increased tissue specificity, suggesting specialisation.

The results of this study support the hypothesis that the gene copy with the least evolutionary change following duplication tends to retain the ancestral function. Additionally, the more diverged copy may acquire a novel, potentially specialised, function. This likely has implications when predicting functional annotations – particularly when only using pairwise distances between sequences.

July 13, 2024
12:00-12:20
Leveraging deep learning for characterization of malaria parasite PUFs — proteins of unknown function
Confirmed Presenter: Harsh R. Srivastava, Center for Genomics and Systems Biology, Department of Biology
Track: Function

Room: 520b
Format: In Person
Moderator(s): Ana Rojas


Authors List: Show

  • Harsh R. Srivastava, Harsh R. Srivastava, Center for Genomics and Systems Biology
  • Daniel Berenberg, Daniel Berenberg, Courant Institute
  • Omar Qassab, Omar Qassab, Center for Genomics and Systems Biology
  • Tymor Hamamsy, Tymor Hamamsy, Courant Institute
  • Jane M. Carlton, Jane M. Carlton, Johns Hopkins Malaria Research Institute
  • Richard Bonneau, Richard Bonneau, Prescient Design

Presentation Overview:Show

Exploiting the sequence-structure-function paradigm is crucial for annotating proteins of unknown function (PUFs) in Plasmodium falciparum, a member of the diverged eukaryotic SAR (Stramenopiles, Alveolates, and Rhizarians) supergroup. P. falciparum, a malaria-causing parasite, accounted for ~250 million cases and over 600,000 deaths in 2022. Discovery of diagnostic and therapeutic targets in P. falciparum is hindered, as ~23% of proteins are classified as PUFs while ~40% of proteins are partially annotated. Predicting GO annotations for these PUFs is difficult given low sequence similarity to annotated proteins and limited generalization of deep learning models trained on well-studied SwissProt species to diverged organisms. Here, we focused on the structure-function relationship of SAR sequences and developed a new method to predict GO terms in P. falciparum. PFP (Plasmodium Function Predictor) is a collection of structural-homology based deep learning models trained using evolutionarily relevant structure-aware TM-Vec embeddings. We used a deep feedforward architecture with a dropout layer to predict GO annotations and quantify uncertainty using Monte Carlo dropout. When benchmarked against DeepGOPlus, PFP, demonstrated a significant improvement in Fmax, Smin, and AUPR-micro/AUPR-macro for our test split as well as our Plasmodium holdout split. PFP predicted GO terms respected the hierarchical structure of GO and aligned with expected information content distributions. For poorly annotated proteins, PFP imputed GO terms which are biologically plausible given existing annotations. Additionally, predictions made by PFP were categorized into confidence levels and aligned with published data targeting specific P. falciparum PUFs. PFP is the first curated function prediction model developed specifically for a subset of eukaryotic species. We will discuss findings in model architecture and highlight specific GO predictions contributing to an increase of more than 25% in P. falciparum proteome annotation.

July 13, 2024
14:20-15:00
Crowdsoursing the Fifth Critical Assessment of Protein Function Annotation Algorithms (CAFA 5)
Confirmed Presenter: M. Clara De Paolis Kaluza, Northeastern University, United States
Track: Function

Room: 520b
Format: In Person
Moderator(s): Mark Wass


Authors List: Show

  • M. Clara De Paolis Kaluza, M. Clara De Paolis Kaluza, Northeastern University
  • Rashika Ramola, Rashika Ramola, Northeastern University
  • Damiano Piovesan, Damiano Piovesan, BioComputing UP - University of Padova
  • Parnal Joshi, Parnal Joshi, Iowa State University
  • Iddo Friedberg, Iddo Friedberg, Iowa State University
  • Predrag Radivojac, Predrag Radivojac, Northeastern University

Presentation Overview:Show

The Critical Assessment of Functional Annotation (CAFA) is a long-standing, ongoing community effort to independently assess computational methods for protein function prediction, to highlight well-performing methodologies, to identify bottlenecks in the field, and to provide a forum for dissemination of results and exchange of ideas. Every three years since its inception in 2010, CAFA has solicited participation from computational groups and engaged biocurators and experimental biologists to collect high-quality data on which to evaluate algorithmic performance in a series of prospective computational challenges to predict function for a large set of target proteins. For the 5th installment (CAFA 5) launched in 2023, a partnership with Kaggle Inc. facilitated the participation of a much broader community of data scientists. Predictions were collected as entries to a competitive challenge on the crowdsourced science platform. The reach and technology of this approach resulted in a 22x increase in the number of participating teams, comprised of entrants from 77 counties and various scientific and technical backgrounds. At the conclusion of the challenge, predictions were evaluated using a summary metric on a limited set of proteins which had accumulated annotations during a four month period. In this talk, we will present the outcomes of the increased and diversified participation on the quality of predictions on Gene Ontology (GO) term annotations on an expanded set of annotations and in greater detail across ontology aspects and in relation to past CAFA evaluation.

July 13, 2024
15:00-15:20
StarFunc: interplaying template-based and deep learning approach for accurate protein function prediction
Confirmed Presenter: Chengxin Zhang, University of Michigan, United States
Track: Function

Room: 520b
Format: In Person
Moderator(s): Mark Wass


Authors List: Show

  • Chengxin Zhang, Chengxin Zhang, University of Michigan
  • Quancheng Liu, Quancheng Liu, University of Michigan
  • P Lydia Freddolino, P Lydia Freddolino, University of Michigan

Presentation Overview:Show

Despite significant advancements in the development of novel methods for protein function prediction via deep learning, template information often remains a critical component. Common predictions typically utilize templates identified through sequence homology or protein-protein interactions. Yet, relatively few methods leverage structure similarity for template detection, even though protein structures underpin function. In response to this, we developed StarFunc, a comprehensive approach that seamlessly marries cutting-edge deep learning models with template information. This information is garnered from a range of sources, including sequence homology, protein-protein interaction partners, similar structures, and protein domain families. StarFunc was put to the test in large-scale benchmarking and blind tests during the 5th Critical Assessment of Function Annotation (CAFA5). The results consistently highlight its advantage over not only conventional template-based predictors but also other state-of-the-art deep learning methods.

July 13, 2024
15:20-15:40
ProtBoost: Prediction of functional properties of the proteins by Py-Boost and protein language models (CAFA5 top2)
Confirmed Presenter: Alexander Chervov, 1. Institut Curie, 2. Inserm U900
Track: Function

Room: 520b
Format: Live Stream
Moderator(s): Mark Wass


Authors List: Show

  • Alexander Chervov, Alexander Chervov, 1. Institut Curie
  • Loredana Martignetti, Loredana Martignetti, 1. Institut Curie
  • Anton Vakhrushev, Anton Vakhrushev, Higher School of Economics
  • Sergei Fironov, Sergei Fironov, Yandex

Presentation Overview:Show

We will describe machine learning approach to predict protein functions based on their sequences - which allowed our team to win CAFA5 top 2 position with significant gap to top 3 team. The main novelty of our approach: Py-boost - new gradient boosting algorithm designed to predict multiple targets simultaneously (designed by one of the team members (NeurIPS 2022) ). Gradient boostings (XGBoost, LightGBM, CatBoost) are the most powerful methods update to work with tabular data, however they are too slow for the tasks with hundreds or thousands targets. Py-Boost overcome that problem by special approximation of the loss function, thus combining effectiveness of gradient boostings with ability to predict thousands targets at once. Other key ingredients - extensive use of protein language models (T5, ESM2), GCN (graph neural network) to aggregate predictions across the Gene Ontology, ensemble of neural networks. Language models are becoming quite quite popular tools for various tasks in bioinformatics: for proteins, DNA/RNA, as well as small-molecule SMILES data. We will report our analysis for the different language models obtained during CAFA5 challenge as well as subsequent analysis.

July 13, 2024
15:40-16:00
GORetriever: Reranking protein-description-based GO candidates by literature-driven deep information retrieval for precise protein function annotation
Confirmed Presenter: Huiying Yan, Fudan University, China
Track: Function

Room: 520b
Format: Live Stream
Moderator(s): Mark Wass


Authors List: Show

  • Huiying Yan, Huiying Yan, Fudan University
  • Shaojun Wang, Shaojun Wang, Fudan University
  • Hancheng Liu, Hancheng Liu, Fudan University
  • Hiroshi Mamitsuka, Hiroshi Mamitsuka, Kyoto University / Aalto University
  • Shanfeng Zhu, Shanfeng Zhu, Fudan University

Presentation Overview:Show

The vast majority of proteins still lack experimentally validated functional annotations, which highlights the importance of developing high-performance automated protein function prediction/annotation (AFP) methods. While existing approaches focus on protein sequences, networks, and structural data, textual information related to proteins has been overlooked. However, roughly 82% of SwissProt proteins already possess literature information that experts have annotated. To efficiently and effectively use literature information, we present GORetriever, a two-stage deep information retrieval-based method for AFP. Given a target protein, in the first stage, candidate Gene Ontology (GO) terms are retrieved by using annotated proteins with similar descriptions. In the second stage, the GO terms are reranked based on semantic matching between the GO definitions and textual information (literature and protein description) of the target protein. Extensive experiments over benchmark datasets demonstrate the remarkable effectiveness of GORe- triever in enhancing the AFP performance. Note that GORetriever is the key component of GOCurator, which has achieved the first place in the latest critical assessment of protein function annotation (CAFA5: over 1,600 teams participated), held in 2023–24.

July 13, 2024
16:40-17:00
InterLabelGO+: Unraveling label correlations in protein function prediction
Confirmed Presenter: Quancheng Liu, University of Michigan, United States
Track: Function

Room: 520b
Format: In Person
Moderator(s): Dukka KC


Authors List: Show

  • Quancheng Liu, Quancheng Liu, University of Michigan
  • Chengxin Zhang, Chengxin Zhang, University of Michigan
  • P Lydia Freddolino, P Lydia Freddolino, University of Michigan

Presentation Overview:Show

Accurate prediction of protein functions is crucial for understanding biological processes and advancing biomedical research. However, the rapid growth of known protein sequences far outpaces experimental characterization of protein function, necessitating the development of automated computational methods. We present InterLabelGO, a cutting-edge deep learning approach that leverages protein language models and addresses the challenges of label imbalance and label dependencies in protein function prediction. By incorporating a novel loss function that captures complex functional relationships and integrates alignment-based methods through dynamic weighting, InterLabelGO achieves remarkable performance in predicting Gene Ontology (GO) terms. In the recent CAFA5 challenge, a preliminary version of InterLabelGO ranked 6th out of 1,625 teams worldwide, showcasing its effectiveness compared to state-of-the-art methods. Comprehensive evaluations on large-scale datasets demonstrate InterLabelGO's ability to accurately predict GO terms across various functional categories and evaluation metrics. With its innovative approach to harnessing deep learning and label correlations, InterLabelGO represents a significant advancement in automated protein function prediction, offering the potential to greatly accelerate and enrich the functional annotation of the ever-expanding universe of protein sequences.

July 13, 2024
17:00-17:20
Discovery of PETases using a computational classification system
Confirmed Presenter: Joel Roca Martinez, University College London, United Kingdom
Track: Function

Room: 520b
Format: In Person
Moderator(s): Dukka KC


Authors List: Show

  • Joel Roca Martinez, Joel Roca Martinez, University College London
  • Clemens Rauer, Clemens Rauer, Universidad Autonoma de Madrid
  • Nicola Bordin, Nicola Bordin, University College London
  • Mahnaz Abbasian, Mahnaz Abbasian, University College London
  • Josephin Holstein, Josephin Holstein, University of Cambridge
  • Mariana Rangel, Mariana Rangel, University of Cambridge
  • Florian Hollfelder, Florian Hollfelder, University of Cambridge
  • Christine Orengo, Christine Orengo, University College London

Presentation Overview:Show

Plastic accumulation is a pressing environmental issue that has escalated in recent decades. With around 25 million tons produced yearly, polyethylene terephthalate (PET) is the most common single use plastic polymer. Current PET recycling protocols rely on mechanical processes that lead to a loss of properties and poor recycling rates. Enzymatic PET degradation emerged as an alternative solution a decade ago, with characterized natural enzymes and variants showing PET degrading activity (PETases). Considering PET has only been present in the environment for less than 50 years, PETases have not evolved naturally to degrade it, hindering the identification of very active enzymes with the right properties. In this study, starting from a dataset with over 1 billion sequences from the MGnify database, we filtered it to define a set of potential PETases that we refer to as the PETase zone, containing over 20.000 sequences. After splitting it in functional families, we focused on 2 families closely related to the family containing most of the known active PETases, while showing a high sequence diversity. To select a final set of promising candidates to text experimentally, we focused on a set of positions that we identified as differentially conserved among the functional families, thus likely important for the protein’s function. We used those positions alongside other information such as docking score, protein solubility or pocket properties to shortlist 11 potential PETases, from which 3 were experimentally active, proving the protocol’s efficacy and providing new starting points in the PETase sequence space.

July 13, 2024
17:20-17:40
Plastic-Ml-Tool, a machine learning tool for discovering and optimising plastic degrading enzymes
Confirmed Presenter: David Medina-Ortiz, Departamento de Ingeniería en Computación, Universidad de Magallanes
Track: Function

Room: 520b
Format: In Person
Moderator(s): Dukka KC


Authors List: Show

  • David Medina-Ortiz, David Medina-Ortiz, Departamento de Ingeniería en Computación
  • Anamaría Daza, Anamaría Daza, Centre for Biotechnology and Bioengineering
  • Nicole Soto-García, Nicole Soto-García, Departamento de Ingeniería en Computación
  • Jacqueline Aldridge, Jacqueline Aldridge, Departamento de Ingeniería en Computación
  • Diego Sandoval, Diego Sandoval, Centre for Biotechnology and Bioengineering
  • Bárbara Andrews, Bárbara Andrews, Centre for Biotechnology and Bioengineering
  • Sebastián Rodríguez, Sebastián Rodríguez, Centre for Biotechnology and Bioengineering
  • Diego Álvarez, Diego Álvarez, Departamento de Ingeniería en Computación
  • Juan A. Asenjo, Juan A. Asenjo, Centre for Biotechnology and Bioengineering

Presentation Overview:Show

Plastic contamination is a significant environmental threat that negatively impacts habitats, species, and ecosystems. Recycling strategies can involve chemical and thermo-mechanical processes to convert waste into energy. Nevertheless, these technologies have a slow processing time and a limited degrading rate. Biodegradation provides an environmentally friendly alternative through microorganisms with enzymes capable of degrading plastics. Machine learning approaches have been successfully applied to guide enzyme engineering for improving optimal temperature and to build classification systems for detecting plastic degrading enzymes. However, challenges persist in improving the generalisation of the predicted methods and discovering new plastic-degrading enzymes. This work presents Plastic-Ml-Tool, a machine-learning library that discovers and assists in designing new plastic-degrading enzymes. The proposed tool has a classification system to detect plastic-degrading enzymes and recognise different plastic targets. Besides, predictive models for optimal catalytic temperature were built using deep learning. Plastic-Ml-Tool has different generative approaches to discover new plastic degrading enzymes and an MLDE approach to guide the design of enzymes with desirable optimal catalytic temperatures. Different explorations were made to demonstrate the usability of the proposed methods, including the mapping of KEGG databases to recognise new plastic degrading enzymes, the generation of new PET enzymes through the generative approaches implemented in Plastic-ML-Tool, and the protein engineering of target enzymes to improve PET-plastic degradation. The high usability and the ease of in-silico navigation of new enzymes demonstrate the advantages of the proposed work, which is becoming a valuable and high-impact tool to support experimental methods.

July 13, 2024
17:40-18:00
A BLAST from the past: revisiting BLAST's E-value
Confirmed Presenter: Yang Lu, University of Waterloo, Canada
Track: Function

Room: 520b
Format: In Person
Moderator(s): Dukka KC


Authors List: Show

  • Yang Lu, Yang Lu, University of Waterloo
  • William Stafford Noble, William Stafford Noble, University of Washington
  • Uri Keich, Uri Keich, The University of Sydney

Presentation Overview:Show

The Basic Local Alignment Search Tool, BLAST, is an indispensable tool for genomic research.
BLAST established itself as the canonical tool for sequence similarity search in large part thanks to its meaningful
statistical analysis. Specifically, BLAST reports the E-value of each reported alignment, which is defined
as the expected number of optimal local alignments that will score at least as high as the observed alignment score,
assuming that the query and the database sequences are randomly generated.
Here we critically reevaluate BLAST's E-values, showing that they can be at times significantly conservative
while at others too liberal. We offer an alternative approach based on generating a small sample
from the null distribution of random optimal alignments, and testing whether the observed alignment score
is consistent with it.
In contrast with BLAST, our significance analysis seems valid, in the sense that it did not deliver inflated significance estimates
in any of our extensive experiments. Moreover, although our method is slightly conservative, it is often significantly less so than
the BLAST E-value. Indeed, in cases where BLAST's analysis is valid (i.e., not too liberal), our approach seems to deliver a
greater number of correct alignments.
One advantage of our approach is that it works with any reasonable choice of substitution matrix and gap penalties,
avoiding BLAST's limited options of matrices and penalties.
In addition, we can formulate the problem using a canonical family-wise error rate control setup, thereby dispensing
with E-values, which can at times be difficult to interpret.