Function

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CEST
Wednesday, July 26th
10:30-11:10
Invited Presentation: Learning from unpopular activities: can unknown functions guide exploration of microbiome environmental preferences?
Room: Salle Saint Claire 3
Format: Live from venue

  • Yana Bromberg


Presentation Overview: Show

The space of molecular functionality remains largely unexplored. For every experimentally annotated protein, hundreds, if not thousands, exist that likely carry out the same function in a different organism or environmental context. Function transfer by homology, the current de facto gold standard of annotation, had been very fruitfully employed over the past decades. As a result, many of the proteins in our databases have a functional label, with estimates of misannotation ranging wildly from 5% to 95%. Surprisingly, most of the proteins extracted from newly sequenced bacterial genomes have no labeled homologs and thus remain unannotated. In this talk I will explore the possibility of annotating protein functionality without using known functional labels. I will also describe our efforts to use this kind of protein labels in learning about yet-unknown bacterial pathways and about functionality/pathways emergent in microbiome communities.

11:10-11:30
Understanding Earth’s Ecosystems with Machine Learning
Room: Salle Saint Claire 3
Format: Live from venue

  • Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States
  • Ziming Yang, Brookhaven National Laboratory, United States
  • William Riehl, Lawrence Berkeley National Laboratory, United States
  • Chris Neely, Lawrence Berkeley National Laboratory, United States
  • Prachi Gupta, Lawrence Berkeley National Laboratory, United States
  • Sean Jungbluth, Lawrence Berkeley National Laboratory, United States
  • Adam Arkin, Lawrence Berkeley National Laboratory, United States
  • Paramvir Dehal, lawrence berkeley national laboratory, United States


Presentation Overview: Show

Our planet consists of a wide array of interconnected ecosystems that are dynamically changing on multiple time scales and microbes are now recognized as contributing to major environmental effects. In recent years, advances in microbial metagenomics have provided a substantial collection of data about ecosystem features in the form of biological sequences, taxonomy assignments, and functional annotations. Prior work has used data collections with fewer ecosystems, metagenomes, and features, limiting their performance and generalizability and here we report results from the largest standardized collection of metagenome samples to date. Using rigorous machine learning model training and evaluation approaches, including semantic similarity to assess hierarchical multi-label overlap, we identified the best performing data type combinations and model parameters. While performance was high on training data cross-validation, our results also show that models trained at different ecosystem classification levels exhibit useful generalizability for classifying metagenome samples from environments unseen by the model. By applying model interpretation methods, we derived a set of metagenome features important for distinguishing 41 widely ranging ecosystems. These key features lead to biological insights for ecosystem properties, better agreement with curated ecosystem classifications, information relevant to unknown functions, and ecosystem networks with relationships not represented in current classifications.

11:30-11:50
A universal operon predictor for prokaryotic (meta-)genomics data using self-training
Room: Salle Saint Claire 3
Format: Live from venue

  • Hong Su, Max Planck Institute for Multidisciplinary Sciences, Germany
  • Ruoshi Zhang, Max Planck Institute for Multidisciplinary Sciences, Germany
  • Johannes Soeding, Max Planck Institute for Multidisciplinary Sciences, Germany


Presentation Overview: Show

Improved computational methods are urgently required for enhancing gene functional annotation. Our novel operon predictor overcomes the limitations of existing methods by eliminating the need for prior knowledge. It employs a statistical framework to estimate the probability of genes being in the same operon based on intergenic distance. Furthermore, a self-training method utilizes conserved gene clusters across multiple genomes to predict operons. Comparative evaluations on seven genomes demonstrate superior performance compared to existing approaches (ofs and operon-mapper). This innovative approach holds great promise in advancing our understanding of microbial file processes and unveiling new functional connections.

11:50-12:10
hkgfinder: find and classify prokaryotic housekeeping genes for multilocus sequence analysis
Room: Salle Saint Claire 3
Format: Live from venue

  • Anicet Ebou, Laboratoire de Bioinformatique et Biostatistiques, Institut National Polytechnique Félix Houphouët-Boigny, Cote d'Ivoire
  • Dominique Koua, Laboratoire de Bioinformatique et Biostatistiques, Institut National Polytechnique Félix Houphouët-Boigny, Cote d'Ivoire
  • Adolphe Zeze, Laboratoire de Biotechnologies végétales et microbiennes, Institut National Polytechnique Félix Houphouët-Boigny, Cote d'Ivoire


Presentation Overview: Show

Housekeeping gene prediction in genomic data remains a difficult task. Despite their importance in cellular activities, inclusion as important markers for multilocus sequence analysis, and taxonomic description of bacteria, there is, to the best of our knowledge, no practical tool to fastly and accurately retrieve them. Although genome and metagenome annotation tools exist and can be run for such a task, their usefulness is hindered by their efficiency when used for such a task. We present hkgfinder, a fast and accurate housekeeping gene finder, and classifier for the identification of common genes used in multilocus sequence analysis. Hkgfinder can run on raw sequences, genomes, and metagenomes. The novel value of this method lies in its ability to directly predict and classify gene sequences into housekeeping gene families at a high specificity and sensitivity while being also faster than genome and metagenome annotator on genome and metagenome data. We compare the results of hkgfinder with other methods and show its accuracy and fast implementation. hkgfinder is available as a Python 3 standalone program available at https://github.com/Ebedthan/hkgfinder and on https://pypi.org.

12:10-12:30
PANORAMA: comparative pangenomics tools to explore interspecies diversity of microbial genomes
Room: Salle Saint Claire 3
Format: Live from venue

  • Jérôme Arnoux, LABGeM, UMR 8030 Metabolic Genomics, CEA, Paris Saclay University, CNRS, France
  • Laura Bry, LABGeM, UMR 8030 Metabolic Genomics, CEA, Paris Saclay University, CNRS, France
  • Quentin Fernandez de Grado, LABGeM, UMR 8030 Metabolic Genomics, CEA, Paris Saclay University, CNRS, France
  • David Vallenet, LABGeM, UMR 8030 Metabolic Genomics, CEA, Paris Saclay University, CNRS, France
  • Alexandra Calteau, LABGeM, UMR 8030 Metabolic Genomics, CEA, Paris Saclay University, CNRS, France


Presentation Overview: Show

In recent years, to cope with the increase of genomes in databanks, comparative genomics studies have focused on the overall gene content of a species, the pangenome, imposing a paradigm shift in the representation of knowledge and in the algorithms used.
We developed PANORAMA, a flexible and open-source bioinformatics toolbox, which exploits multiprocessing, to perform rapid and easy-to-use comparative analysis of pangenomes using thousands of microbial genomes. PANORAMA integrates multiple features. It leverages homologous family conservation combined with graph connectivity to allow users to search for a specific genomic context in a set of pangenome graphs. PANORAMA also predicts biological systems, such as conjugation, secretion or defense systems, at the pangenome level using a system-modeling framework associated with HMM profile databases. All generated results are associated to pangenome partitions, as well as to regions of genomic plasticity, their spot of integration and their segmentation in conserved modules.
PANORAMA aims to help microbiologists to understand the adaptive potential of bacteria and the evolutionary dynamics behind the metabolic diversity of microorganisms. Future developments will integrate additional models to identify biological systems and integration of pangenomes in graph databases, to address the challenge of large-scale comparative pangenomics.

13:50-14:10
Functional Variants Identify Sex-specific Genes and Pathways in Alzheimer’s Disease
Room: Salle Saint Claire 3
Format: Live from venue

  • Thomas Bourquard, Baylor College of Medicine, France
  • Kwanghyuk Lee, Baylor College of Medicine, United States
  • Ismael Al-Ramahi, Baylor College of Medicine, United States
  • Juan Botas, Baylor College of Medicine, United States
  • Olivier Lichtarge, Baylor College of Medicine, United States


Presentation Overview: Show

The incidence of Alzheimer’s Disease (AD) in women is almost double that of men. Women also typically exhibit faster cognitive decline and increased cerebral atrophy, while men have higher mortality rates. Identifying the genes underlying these sex-specific differences is crucial but especially challenging since as it requires analyzing smaller, sex-specific cohorts.

To identify sex-specific gene associations, we developed a machine learning approach that focused on functionally impactful coding variants and incorporated a vast amount of evolutionary information to the study of disease-linked coding genome variants, thereby raising statistical power. In the AD Sequencing Project (ADSP) with mixed sexes, this approach identified genes enriched for immune response pathways. Upon sex-separation, we found genes that were specifically enriched for stress-response pathways in men and cell-cycle pathways in women. These genes improved disease risk prediction in silico and experimentally modulated neurodegeneration in live Drosophila AD models.

Therefore, a general, evolution-based approach for machine learning on functionally impactful variants was powerful enough to uncover sex-specific candidates towards the discovery of diagnostic biomarkers. These findings have implications in AD, and other complex diseases, for developing therapeutic strategies and stratifying clinical trial based on sex.

14:10-14:30
CHARTING γ-SECRETASE SUBSTRATES BY EXPLAINABLE AI
Room: Salle Saint Claire 3
Format: Live from venue

  • Stephan Breimann, Department of Bioinformatics, Technical University of Munich, Freising, Germany, Germany
  • Frits Kamp, Ludwig-Maximilians-University Munich, Biomedical Center, Division of Metabolic Biochemistry, München, Germany, Germany
  • Gökhan Güner, German Center for Neurodegenerative Diseases, DZNE Munich, München, Germany, Germany
  • Stefan F. Lichenthaler, German Center for Neurodegenerative Diseases, DZNE Munich, München, Germany, Germany
  • Dieter Langosch, Technical University of Munich, Chair of Biopolymer Chemistry, Freising, Germany, Germany
  • Dmitrij Frishman, Technical University of Munich, Department of Bioinformatics, Freising, Germany, Germany
  • Harald Steiner, German Center for Neurodegenerative Diseases, DZNE Munich, München, Germany, Germany


Presentation Overview: Show

Objectives: This study aimed to identify physicochemical properties defining γ-secretase substrates, associated with Alzheimer's disease and cancer, using a novel bioinformatics approach.

Methods: We developed an innovative sequence-based feature engineering algorithm, Comparative Physicochemical Profiling (CPP), to identify the most discriminative physicochemical features of γ-secretase substrates. Additionally, we designed a novel deterministic positive-unlabeled learning algorithm (dPULearn) to address the problem of an unbalanced dataset containing more substrates than non-substrates. Machine learning models were trained to predict new γ-secretase substrates.

Results: Over 100 substrate-defining features were identified. By combining CPP with the explainable AI tool SHAP, we found that these features were not exclusive but exhibited varied importance. The human γ-secretase substrate proteome was uncovered, with 16.3% (n=250) classified as high confidence substrates. Our approach achieved a 90% balanced accuracy, outperforming the ProtT5 deep protein language model (57%). We experimentally validated 12 predicted substrates and 4 non-substrates with an 89% success rate, including novel substrates related to immune diseases and cancer.

Conclusions: We charted the complete human membrane proteome of γ secretase substrates. By combining CPP with explainable AI, we could reveal the physicochemical signature of γ-secretase substrates hidden in their primary sequence, offering potential applicability in studying other molecular recognition processes.

14:30-14:50
Leveraging massive protein structure datasets for function prediction on a metagenomic scale
Room: Salle Saint Claire 3
Format: Live from venue

  • Pawel Szczerbiak, Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
  • Witold Wydmański, Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
  • Mary Maranga, Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
  • Łukasz Szydłowski, Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
  • Valentyn Bezshapkin, Institute of Microbiology, ETH Zürich, Switzerland
  • Piotr Kucharski, Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
  • Tomasz Kosciolek, Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland


Presentation Overview: Show

Recent breakthroughs in protein structure prediction (AlphaFold, ESMFold and related methods) resulted in unprecedented growth in the availability of high quality structural models which currently approach 1 billion. Since we know that function is reflected in protein structure we can leverage such structural data for more precise function prediction. Indeed, models such as deepFRI showed that using structure instead of sequence can lead to much better function prediction. Moreover, deepFRI has been successfully applied on metagenomic datasets and has been shown to increase the annotation coverage as compared to other methods (eggNOG, HUMAnN2). Metagenomic-DeepFRI framework can successfully extend deepFRI metagenomic datasets even further by efficiently mapping and aligning sequences to putative structures. However, since deepFRI has been trained on PDB and related structures, it produces high coverage annotations, albeit more general than comparable homology-based methods. Here, we show how deepFRI retrained on AlphaFold-UniProt dataset enriched with Gene Ontology annotations can alleviate this limitation and present its applicability on large metagenomic datasets. We will also comment on future directions in which deepFRI (and function predictions in general) can be pushed forward to reflect current challenges occurring in that field.

14:50-15:10
AlphaFold meets large-networks: deep-learning assisted protein family discovery at an unprecedented scale
Room: Salle Saint Claire 3
Format: Live from venue

  • Joana Pereira, Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, Switzerland
  • Janani Durairaj, Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, Switzerland
  • Andrew M. Waterhouse, Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, Switzerland
  • Toomas Mets, Institute of Technology, University of Tartu, Estonia
  • Tetiana Brodiazhenko, Institute of Technology, University of Tartu, Estonia
  • Minhal Abdullah, Institute of Technology, University of Tartu, Estonia
  • Gabriel Studer, Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, Switzerland
  • Gerardo Tauriello, Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, Switzerland
  • Mehmet Akdel, VantAI, United States
  • Antonina Andreeva, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Alex Bateman, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Tanel Tenson, Institute of Technology, University of Tartu, Estonia
  • Hauryliuk Vasili, Science for Life Laboratory and Department of Experimental Medical Science, Lund University, Sweden
  • Torsten Schwede, Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, Switzerland


Presentation Overview: Show

Despite the great success of automated annotation efforts, a large number of all catalogued proteins remain functionally unannotated and unclassified. Fortunately, we are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover most catalogued natural proteins, including those challenging to annotate using standard homology-based approaches.
We measured the extent to which AlphaFold has illuminated the unannotated space of the natural protein universe and created the Protein Universe Atlas. It accounts for more than 6 million unique protein sequences and uses recent advances in GPU-accelerated force-directed graph layouting and complex network summarizing approaches. With this representation, we discovered at least 281 putative new protein families, identified a novel protein fold and defined a new superfamily of translation-targeting toxin-antitoxin systems.
Our work highlights that automated annotation of proteins requires a combination of data sources and approaches, which become increasingly available due to the rapid and ongoing advances in the interface between life sciences and deep learning. But also, that as a community we are closer than ever to unlocking the full potential of the protein universe, from unknown biology to new applications.

15:10-15:30
Systematic Analysis of Symmetry in Membrane Protein Function and Evolution
Room: Salle Saint Claire 3
Format: Live from venue

  • Antoniya Aleksandrova, National Institutes of Health, United States
  • Emily Yaklich, National Institutes of Health, United States
  • Edoardo Sarti, Inria, United States
  • Lucy Forrest, National Institutes of Health, United States


Presentation Overview: Show

Membrane proteins are encoded by around one third of a given genome and play key roles in transmission of information and chemicals such as neurotransmitters into the cell. Available membrane protein structures have revealed an abundance of symmetry and pseudo-symmetry, which arose not only by the formation of multi-subunit assemblies, but also by repetition of internal structural elements. While symmetry often plays a crucial role in defining the functional properties of the proteins, much remains to be learned about the interplay between symmetry, folding, assembly, and evolution. To provide a framework for investigating these relationships, we previously built a robust symmetry detection methodology that takes into consideration the restrictions that the lipid bilayer places on protein structures. In recent years, we have almost tripled the number of analyzed structures, whose symmetries we present in an online database called EncoMPASS (encompass.ninds.nih.gov). We used the expanded dataset to identify key characteristics of membrane proteins such as the enrichment of specific symmetry types with distinct functions and the evolutionary trends that drive fusion and complex formation.

16:00-16:20
Holistic Protein Representation (HOPER): Few-Shot Protein Function Prediction with Multimodal Representation Learning
Room: Salle Saint Claire 3
Format: Live from venue

  • Serbulent Unsal, Karadeniz Technical University, Turkey
  • Sinem Özdemir, Karadeniz Technical University, Turkey
  • Işık Özdinç, Karadeniz Technical University, Turkey
  • Amine Bayraklı, Karadeniz Technical University, Turkey
  • Muammer Albayrak, Karadeniz Technical University, Turkey
  • Kemal Turhan, Karadeniz Technical University, Turkey
  • Tunca Dogan, Hacettepe University, Turkey
  • Aybar Acar, Middle East Technical University, Turkey


Presentation Overview: Show

Proteins are crucial building blocks of the machinery of life. However, manual annotation of proteins is costly, and automated protein function prediction (PFP) faces prediction performance-related issues mainly due to sub-optimal (manual) feature extraction. To address these challenges, we propose HOPER (Holistic Protein Representation), a multimodal learning-based protein representation method for PFP with low-data. HOPER combines different input modalities including protein sequence, descriptive natural language text (from literature), and protein-protein interaction (PPI) information, to fully describe the functional properties of a protein. We evaluated HOPER on our PFP benchmarking platform (PROBE) and found that multimodal models perform notably better in challenging conditions like few-shot training, compared to sequence-based single-modality representations. We also applied HOPER to identify new immune-escape proteins in lung adenocarcinoma as a clinical case study. Our findings highlight the advantages of leveraging multiple modalities for learning complex biological processes. The integration of protein sequence, descriptive text, PPI, and possibly other types of biological data opens up new avenues for the advancement of PFP, and holds promise for a variety of future applications in biomedicine and biotechnology.

16:20-16:40
Functional annotation of the regeneration process of a non-model organism using Language Models.
Room: Salle Saint Claire 3
Format: Live from venue

  • Patricia Medina, CABD-CSIC, Spain
  • Israel Barrios, CABD-CSIC, Spain
  • Ildefonso Cases, CABD-CSIC, Spain
  • Carlos Martín, CABD-CSIC, Spain
  • Fernando Casares, CABD-CSIC, Spain
  • Ana Rojas, CSIC-CABD, Spain


Presentation Overview: Show

Functional annotation of relevant biological processes remains challenging for non-model organisms, as most of the annotation protocols rely on homology, leaving substantial regions of proteomes un-annotated.
In one hand, standard methods to transfer function, may not be ideally suited since most rely on sequence conservation, knowledge on protein-protein interactions, etc., information which is not available or easily deducible for many organisms.
On the other hand, there is a conceptual issue: the paradigm posed by the orthologue’s conjecture is challenged by the abilities for proteins to multi-function.
Recent developments have made use of Language Models to transfer annotations in an evolutionary-based independent manner via “transfer learning” approach.
Since these methods are evolutionary agnostic, they are suited to our purposes considering that the sequence-based mappable fraction of Cloeon dipterum genome is poor.
Here we present a computational pipeline, using language models and standard analyses, devised to annotate the regeneration process of Cloeon dipterum, a non-model organism with extraordinary regeneration capabilities. We will discuss its caveats, advantages, and how this has enabled to identify relevant functions in Cloeon genes.

16:40-17:00
Gene function prediction in five model eukaryotes exclusively based on gene relative location through machine learning
Room: Salle Saint Claire 3
Format: Live from venue

  • Flavio Pazos Obregón, Biological Research Institute "Clemente Estable" & Institut Pasteur Montevideo, Uruguay
  • Diego Silvera, Departamento de Biología del Neurodesarrollo - Instituto de Investigaciones Biológicas Clemente Estable, Uruguay
  • Pablo Soto, Departamento de Biología del Neurodesarrollo - Instituto de Investigaciones Biológicas Clemente Estable, Uruguay
  • Patricio Yankilevich, Bioinfomratics Platform - Instituto de Investigaciones Biomédicas de Buenos Aires, Argentina
  • Gustavo Guerberoff, Facultad de Ingeniería, Universidad de la República, Uruguay
  • Rafael Cantera, Departamento de Biología del Neurodesarrollo - Instituto de Investigaciones Biológicas Clemente Estable, Uruguay


Presentation Overview: Show

The function of most genes is unknown. The best results in automated function prediction are obtained with machine learning-based methods that combine multiple data sources, typically sequence derived features, protein structure and interaction data. Even though there is ample evidence showing that a gene’s function is not independent of its location, the few available examples of gene function prediction based on gene location rely on sequence identity between genes of different organisms and are thus subjected to the limitations of the relationship between sequence and function.
Here we predict thousands of gene functions in five model eukaryotes (Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Homo sapiens) using machine learning models exclusively trained with features derived from the location of genes in the genomes to which they belong. Our aim was not to obtain the best performing method to automated function prediction but to explore the extent to which a gene's location can predict its function in eukaryotes. We found that our models outperform BLAST when predicting terms from Biological Process and Cellular Component Ontologies, showing that, at least in some cases, gene location alone can be more useful than sequence to infer gene function.

17:00-17:10
Co-transcriptional cis-R-loop forming lncRNAs: a new lncRNA subclass?
Room: Salle Saint Claire 3
Format: Live from venue

  • Kevin Muret, Université Paris-Saclay, CEA, Centre National de Recherche en Génomique Humaine (CNRGH), Evry, France., France
  • Jean-François Deleuze, Université Paris-Saclay, CEA, Centre National de Recherche en Génomique Humaine (CNRGH), Evry, France., France
  • Eric Bonnet, Université Paris-Saclay, CEA, Centre National de Recherche en Génomique Humaine (CNRGH), Evry, France., France


Presentation Overview: Show

For more than a decade, lncRNAs have been the subject of many research fields. However, these non-protein-coding entities of more than 200 nucleotides represent a wide diversity of RNAs with very different roles and, despite efforts to subclassify these genes according to their genic environment, do not allow us to obtain subclasses of lncRNAs based on their function. LncRNAs are able to interact with other RNAs, DNA, peptides or proteins. Here, we focused on RNA:DNA interactions (R-loops) which can be studied by DRIP-seq based on the S9.6 antibody's high affinity for R-loops. Based on more than 120 DRIP-seq experiments and lncRNA annotation, we were able to show that 49% of lncRNAs are likely to form a cis-R-loop. We have also identified 1367 lncRNA/coding gene pairs for which we suspect a role for the lncRNA in regulating the expression of the nearby coding gene. The VIM/VIM-AS1 pair, a well-known case described by Boque-Sastre et al., is also retrieved. These initial results are very promising; they will require experimental validations in the coming years. We hope, through this original approach, to annotate more precisely and subclassify lncRNAs in order to help researchers to envisage more adapted experimental methods for their functional studies.

17:10-17:20
Applications of bioinformatics methodologies in the study of lipoxygenases from diatoms
Room: Salle Saint Claire 3
Format: Live from venue

  • Simone Bonora, Istituto di Scienze dell’Alimentazione, CNR, via Roma 64, Avellino, Italy
  • Ilenia D'Orsi, Istituto di Scienze dell’Alimentazione, CNR, via Roma 64, Avellino, Italy
  • Deborah Giordano, Istituto di Scienze dell’Alimentazione, CNR, via Roma 64, Avellino, Italy
  • Domenico D'Alelio, Stazione Zoologica Anton Dohrn, Villa Comunale, 80121 Naples, Italy, Italy
  • Angelo Facchiano, Istituto di Scienze dell’Alimentazione, CNR, via Roma 64, Avellino, Italy


Presentation Overview: Show

Diatoms produce oxylipins as consequence of abiotic or biotic stress. These compounds derive from the oxidation of poly-unsaturated fatty acids by lipoxygenase enzymes (LOX), and influence not only the growth of phytoplankton, but also the growth of numerous organisms constituting zooplankton, having a strong impact on marine flora and fauna.
This project focused on the research, the classification, the sequence and the structural analysis of Lipoxygenase (LOX) in diatoms, using bioinformatics tools and database analysis. Firstly, we analyzed the presence of hypothetic LOX domain in uncharacterized diatoms’ sequences, predicted from transcriptomic experiments, to collect the widest number of LOX belonging to this species. After this screening, we make a classification, exploiting the construction of phylogenetic trees by which was possible to detect at least six different groups, principally divisible in two main classes. The latters are splitted according to the presence or absence of a probable insertion between two coordination residues at the cofactor coordination site typical of LOX (three His, one Asn and one Ile, coordination residues of Fe2+). Finally, the 3D structure of LOX enzymes representative of each group were modelled, evaluated, and compared, revealing possible functional differences related to a different composition of the substrate-binding site.

17:20-17:30
SAP: Synteny-aware gene function prediction for bacteria using protein embeddings
Room: Salle Saint Claire 3
Format: Live from venue

  • Aysun Urhan, Delft University of Technology, Netherlands
  • Bianca-Maria Cosma, Delft University of Technology, Netherlands
  • Abigail L. Manson, Broad Institute of MIT and Harvard, United States
  • Ashlee Earl, Broad Institute of MIT and Harvard, United States
  • Thomas Abeel, Delft University of Technology, Netherlands


Presentation Overview: Show

Today, we know the function of only a small fraction of all known protein sequences. This problem is especially salient in bacteria as human-centric studies are prioritized in the field and there is much to uncover in the bacterial genetic repertoire. Conventional approaches to bacterial gene annotation are inadequate for annotating unseen proteins in novel species since there are no homologs in the existing databases. Thus, we need alternative representations of proteins. Recently, there has been an uptick in interest in adopting natural language processing methods to solve challenging bioinformatics tasks, and great success in tackling various challenges, although with limited applications in bacteria. We developed SAP, a novel synteny-aware gene function prediction tool based on protein embeddings, to annotate bacterial species. SAP distinguishes itself from existing methods in two ways: (i) it uses embedding vectors extracted from state-of-the-art protein language models and (ii) it incorporates conserved synteny across the entire bacterial kingdom using a novel operon-based approach we developed. SAP outperformed conventional annotation methods as well as the state-of-the-art on a range of representative bacteria, for various gene prediction tasks including distant homolog detection where the sequence similarity between training and test proteins was 40% at its lowest.

17:30-17:40
Prediction of bacterial interactomes based on genome-wide coevolutionary networks: an updated implementation of the ContextMirror approach
Room: Salle Saint Claire 3
Format: Live from venue

  • Miguel Fernández Martín, Barcelona Supercomputing Center - Life Sciences, Spain
  • Camila Pontes, Barcelona Supercomputing Center - Life Sciences, Spain
  • Victoria Ruiz-Serra, Barcelona Supercomputing Center - Life Sciences, Spain
  • Alfonso Valencia, Barcelona Supercomputing Center - Life Sciences, Spain


Presentation Overview: Show

The biological function of proteins is preserved through coevolution and can be quantified by computing the similarity between the phylogenetic trees of pairs of protein families. When the phylogenetic similarity is high, it indicates that proteins are likely to interact. However, this similarity is influenced by many factors, including background evolution. Current coevolution-based methods treat protein pairs independently, despite proteins interacting with multiple others. The ContextMirror methodology evaluates coevolution by integrating the influence of every interactor on a given protein pair (coevolutionary network), providing more accurate protein-protein interaction predictions. In our study, we evaluate the ContextMirror pipeline, already shown to improve the prediction of protein-protein interactions, by predicting protein-protein interactions for the full proteome of Escherichia coli (4298 proteins). Preliminary predictions reveal the potential of this approach to improve our understanding of protein coevolution. The true positive rate of the top-500 predictions (≈ 50% accuracy) is approximate to other methods and compared to the STRING database, they map only to high-confident pairs (confident score > 0.8). In the next steps of our analysis, ContextMirror will be used to identify differences in bacterial interactomes, with potential implications in drug design and protein engineering.

17:40-17:50
Large language models improve annotation of viral proteins
Room: Salle Saint Claire 3
Format: Live from venue

  • Zachary Flamholz, Albert Einstein College of Medicine, United States
  • Steve Biller, Wellesley College, United States
  • Libusha Kelly, Albert Einstein College of Medicine, United States


Presentation Overview: Show

Viral sequences are poorly annotated in environmental samples, a major roadblock to understanding how viruses influence microbial community structure. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by available libraries of annotated viral sequences used to construct probabilistic sequence models and sequence divergence in viral proteins, rendering them invisible to recognition by alignment-based approaches. Here, we show that protein language model (PLM)-based representations can capture viral protein function beyond the limits of remote sequence homology. Using the PHROGs database of categorically annotated viral protein families, we trained a functional classifier that achieved an average area under the precision recall curve of 0.62 across nine functions over five train-test splits. The classifier was further validated by achieving 67% accuracy on a reannotation of 57 PHROG families. Additionally, PLM representations capture protein functional properties specific to viruses. Families with functions related to phage virion structure and lysis separate in the embedded space from families with functions related to viral genome replication, host genome integration, and host associated genes. To highlight the potential of PLMs to identify function annotations inaccessible to current approaches, we used a PLM-based functional classifier to identify a novel tyrosine recombinase in the ocean microbiome. Protein language models capture features of viral proteins that aid in detecting remote homology, a necessary step in meaningfully describing viral populations across the planet.

17:50-18:00
Predicting S-nitrosylation Sites in Proteins using a Transformer-based Protein language model
Room: Salle Saint Claire 3
Format: Live from venue

  • Pawel Pratyush, Michigan Technological University, United States
  • Suresh Pokharel, Michigan Technological University, United States
  • Dukka Kc, Michigan Technological University, United States


Presentation Overview: Show

Protein S-nitrosylation (SNO) plays a crucial role in transferring nitric oxide-mediated signals in both animals and plants and has emerged as a vital mechanism for regulating protein functions and cell signaling across all main classes of proteins. Developing robust computational tools to predict protein SNO sites can aid in better understanding the pathological and physiological mechanisms of SNO. Therefore, we propose pLMSNOSite, a stacked generalization approach based on an intermediate fusion of models that combines two different learned marginal amino acid sequence representations: per-residue contextual embedding learned on full sequences from a pre-trained transformer-based protein language model (global context) and per-residue supervised word embedding learned on window sequences using an embedding layer (local context). Our pLMSNOSite approach achieved significant improvement over the current state-of-the-art methods on an independent test set of experimentally identified SNO sites, with ∼21.7% increase in sensitivity, ∼35.0% improvement in MCC, and ∼10.6% improvement in g-mean. These results demonstrate that pLMSNOSite outperforms other approaches for predicting S-nitrosylation sites in protein sequences.

Thursday, July 27th
8:30-8:50
M-Ionic: Prediction of metal ion binding sites from sequence using residue embeddings
Room: Salle Saint Claire 3
Format: Live from venue

  • Aditi Shenoy, Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University, Sweden
  • Yogesh Kalakoti, Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, India
  • Durai Sundar, Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, India
  • Arne Elofsson, Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University, Sweden


Presentation Overview: Show

Understanding metal-protein interaction can provide structural and functional insights into cellular processes. As the number of protein sequences increases, developing fast yet precise computational approaches to predict and annotate metal binding sites becomes imperative. We will present a method we developed M-Ionic, a sequence-based method to predict which metals a protein binds and the binding residues. Since the predictions use only residue embeddings from a pre-trained protein language model, quick predictions can be made for the ten most frequent metal ions (Ca2+, Co2+, Cu2+, Mg2+, Mn2+, Po43-, So42-, Zn2+, Fe2+, Fe3+). Further refinement of this method using structural features will be presented.

8:50-9:10
Machine-learning analysis of neofunctionalization following gene tandem duplication in vertebrate evolution
Room: Salle Saint Claire 3
Format: Live from venue

  • Carlo De Rito, Department Of Chemistry, Life Sciences And Environmental Sustainability, University of Parma, Italy
  • Marco Malatesta, Department Of Chemistry, Life Sciences And Environmental Sustainability, University of Parma, Italy
  • Riccardo Percudani, Department Of Chemistry, Life Sciences And Environmental Sustainability, University of Parma, Italy


Presentation Overview: Show

After tandem duplication, a gene copy can undergo mutation and acquire a new function. Examples of neofunctionalization following tandem gene duplication are evolution of color vision in primates and the origin of a pathway for taurine biosynthesis in sauropsids. For a systematic analysis of neofunctionalization in vertebrates, we developed a large-scale two-step procedure: 1) identification of tandem duplications in the human genome and 2) identification of neofunctionalization signals through machine learning. Best reciprocal hit sequences spaced by less than 10 non-homologous sequences were considered tandem duplications. Individual genes collected by this procedure were aligned with the respective orthologous sequences of other vertebrates. For each position we calculated the overall conservation score and the difference score between orthogroup pairs. The scores were used to build two-dimensional maps based on the contact maps obtained from Alphafold structures. Embeddings were calculated from the protein sequences using an Evolutionary Scale Model. Using a convolutional neural network for map classification and a recurrent one for embedding classification, a neofunctionalization probability value was associated with each pair. Known case study and a 5-Fold Cross-Validation analysis support the possibility of training a neural network to recognize neofunctionalization patterns from protein alignments and structures.

9:10-9:30
Subagging of Principal Components for Sample Balancing: Building a Condition-Independent Gene Coexpression Resource from Public Transcriptome Data
Room: Salle Saint Claire 3
Format: Live from venue

  • Takeshi Obayashi, Tohoku University, Japan


Presentation Overview: Show

Public repositories such as NCBI GEO provide extensive gene expression data, which has led to the construction of condition-independent gene coexpression databases. However, biases in sample collections limit the ideal condition-independent coexpression information. We propose a new coexpression calculation method that uses Principal Component Analysis (PCA) for sample balancing. This approach reduces random noise by omitting low contribution variances and considers a broader range of environments by managing differences in PC contribution variances. We implement two procedures to balance the contribution of PCs: using the Spearman correlation coefficient (SCC) and a subagging ensemble method. Comparisons using the Arabidopsis RNAseq platform showed that the methods using PCA and ensemble computation outperformed those without PCA. We confirmed this result on the 17 coexpression platforms in ATTED-II. Our proposed method improves gene coexpression performance by combining PCA and ensemble computation, considering both major and minor environmental components. Sample balancing is fundamental to harnessing the power of the vast publicly available transcriptome data. The resulting gene coexpression data are available in the ATTED-II and COXPRESdb databases for plant and animal research.

10:00-10:40
Invited Presentation: Crowdsourcing (Data) Science on Kaggle
Room: Salle Saint Claire 3
Format: Live from venue

  • Walter Reade, Kaggle / Google, USA
10:40-11:00
Kaggle-hosted Critical Assessment of protein Function Annotation algorithms (CAFA)
Room: Salle Saint Claire 3
Format: Live from venue

  • M. Clara De Paolis Kaluza, Khoury College of Computer Sciences, Northeastern University, United States
  • Damiano Piovesan, Dept. of Biomedical Sciences, University of Padova, Italy
  • Parnal Joshi, Dept. of Veterinary Microbiology and Preventive Medicine, Iowa State University, United States
  • Maggie Demkin, Kaggle, United States
  • Addison Howard, Kaggle, United States
  • Walter Reade, Kaggle, United States
  • Alexandr Ignatchenko, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Sandra Orchard, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Iddo Friedberg, Dept. of Veterinary Microbiology and Preventive Medicine, Iowa State University, United States
  • Predrag Radivojac, Khoury College of Computer Sciences, Northeastern University, United States


Presentation Overview: Show

The Critical Assessment of protein Function Annotation (CAFA) is an ongoing community effort to independently assess computational methods for protein function prediction, to highlight well-performing methodologies and identify bottlenecks in the field, and to provide a forum for dissemination of results and exchange of ideas. Since its inception in 2010, CAFA has engaged a community of a few hundred prediction groups as well as many biocurators and experimental biologists in a series of prospective computational challenges, typically formulated as the prediction of Gene Ontology (GO) terms, Human Phenotype Ontology (HPO) terms, Disorder Ontology (DO) terms, or functional residues in proteins. In its 5th round (CAFA5) launched in 2023, CAFA has for the first time partnered with Kaggle Inc. to expand the function prediction challenge to a broader community of data scientists. In this talk, we will discuss the challenges and opportunities of forming academic-corporate partnerships to address important scientific problems, and more specifically, protein function prediction. We will also address the mutual adjustments made by CAFA and Kaggle organizers, CAFA5 preliminary findings, and lessons learned throughout the process.

11:00-11:20
CAFA-evaluator: A Python Tool for Benchmarking Ontological Classification Methods
Room: Salle Saint Claire 3
Format: Live from venue

  • Davide Zago, University of Padova, Italy
  • Damiano Piovesan, University of Padova, Italy


Presentation Overview: Show

The automatic prediction of ontological annotations is essential for constructing knowledge bases, but reliable benchmarking is necessary to minimize error propagation. The Critical Assessment of protein Function Annotation (CAFA) initiative provides a well-defined framework for evaluating Gene Ontology (GO) prediction methods.

However, the development of novel function prediction methods has been hindered by the lack of an easy-to-use benchmarking tool. Existing solutions are problematic due to missing documentation, making them difficult to maintain, port, and develop. Additionally, they are tailored specifically for GO terms and the CAFA challenge.

To address these issues, we present the CAFA-evaluator, a fully documented, fast, and generic Python package that can be used with any ontology and annotation. Dataset processing is fully separated from the evaluation stage, and the input format is simple. The software has been successfully tested and replicates the results provided in CAFA2 and CAFA3. Moreover, it has been adopted as the official evaluation software in CAFA5.

The CAFA-evaluator offers an easy-to-use and versatile solution for the assessment of function prediction methods. The software is freely available for download at https://github.com/BioComputingUP/CAFA-evaluator.

11:20-11:40
NetGO 3.0: Protein Language Model Improves Large-scale Functional Annotations
Room: Salle Saint Claire 3
Format: Live-stream

  • Shaojun Wang, Fudan Unicersity, China
  • Ronghui You, Fudan University, China
  • Yunjia Liu, Fudan University, China
  • Yi Xiong, Shanghai Jiao Tong University, China
  • Shanfeng Zhu, Fudan University, China


Presentation Overview: Show

As one of the state-of-the-art automated function prediction (AFP) methods, NetGO 2.0 integrates multi-source information to improve the performance. However, it mainly utilizes the proteins with experimentally supported functional annotations without leveraging valuable information from a vast number of unannotated proteins. Recently, protein language models have been proposed to learn informative representations [e.g., Evolutionary Scale Modelling (ESM)-1b embedding] from protein sequences based on self-supervision. We represented each protein by ESM-1b and used logistic regression (LR) to train a new model, LR-ESM, for AFP. The experimental results showed that LR-ESM achieved comparable performance with the best-performing component of NetGO 2.0. Therefore, by incorporating LR-ESM into NetGO 2.0, we developed NetGO 3.0 to improve the performance of AFP extensively. NetGO 3.0 is freely accessible at https://dmiip.sjtu.edu.cn/ng3.0.

11:40-12:00
Mutual Annotation-Based Prediction of Protein Domain Functions with Domain2GO
Room: Salle Saint Claire 3
Format: Live from venue

  • Erva Ulusoy, Hacettepe University, Turkey
  • Tunca Dogan, Hacettepe University, Turkey


Presentation Overview: Show

Identifying functional properties of proteins is crucial for understanding their roles in health and disease states. Domains, the structural and functional units of proteins, can provide valuable information in this context. To overcome the challenges associated with the time and cost involved in experimental approaches, researchers have developed computational strategies for predicting protein functions. In this study, we introduce Domain2GO, a novel approach to predict associations between domains and GO terms by leveraging documented protein-level GO annotations and domain content, thus redefining the problem as domain function prediction. We employed statistical resampling and analyzed co-occurrence patterns of domains and GO terms on the same proteins to obtain highly reliable associations. We applied Domain2GO to predict unknown protein functions and evaluated its performance against other methods using the Critical Assessment of Function Annotation 3 (CAFA3) challenge datasets. The results demonstrated the high potential of Domain2GO, especially when predicting molecular function and biological process terms, even though Domain2GO is not a protein-level function predictor. The approach proposed here can be extended to other ontologies and biological entities to explore unknown relationships in complex and large-scale biological data. We shared Domain2GO as a programmatic tool at https://github.com/HUBioDataLab/Domain2GO.

13:20-13:40
Predicting function in UniProt : rule-based and natural language models
Room: Salle Saint Claire 3
Format: Live from venue

  • Vishal Joshi, EMBL-EBI, United Kingdom
  • Elena Speretta, EMBL-EBI, United Kingdom
  • Maria Martin, EMBL-EBI, United Kingdom


Presentation Overview: Show

Automatic Annotation(AA) objective
Manually reviewed records (UniProtKB/SwissProt) constitute only about 0.23% of UniProtKB; expert curation is time-intensive and most published experimental data focuses on a rather limited range of model organisms. Simultaneously, the number of unreviewed records is growing continuously, yet for a large proportion of these records there is no experimental data available. UniProtKB uses three prediction systems UniRule, Association-Rule-Based Annotator (ARBA) & Google’s ProtNLM to functionally annotate around 85% of unreviewed (UniProtKB/TrEMBL) records automatically which we define as Automatic Annotation.

Google’s ProtNLM method
ProtNLM (Protein Natural Language Model) is a deep-learning method trained on reviewed (SwissProt) & unreviewed (TrEMBL) records to provide protein names to millions of uncharacterized TrEMBL sequences. It was released to production in UniProt v2022_04 in October 2022. The first version of this method was a sequence-to-sequence model based on the T5X framework which takes an amino acid sequence as input & produces a protein name as output.
To improve accuracy of predictions, in v2022_05 we deployed an ensemble approach which has equal distribution of sequence only & sequence-taxonomy trained models. UniProt then post-processes these predictions after careful curator-led analysis & community feedback to propagate names which might convey functional information about protein more accurately.

13:40-14:00
Proceedings Presentation: Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function
Room: Salle Saint Claire 3
Format: Live from venue

  • Frimpong Boadu, University of Missouri - Columbia, United States
  • Hongyuan Cao, University of Missouri - Columbia, United States
  • Jianlin Cheng, University of Missouri - Columbia, United States


Presentation Overview: Show

Motivation: Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently.
Results: We developed TransFun - a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy.
Availability: The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun .

14:00-14:20
LEGO-CSM: a tool for functional characterisation of proteins
Room: Salle Saint Claire 3
Format: Live from venue

  • Thanh Binh Nguyen, The University of Queensland, Australia
  • Alex de Sá, The University of Queensland, Australia
  • Carlos Rodrigues, The University of Queensland, Australia
  • Douglas Pires, The University of Melbourne, Australia
  • David Ascher, The University of Queensland, Australia


Presentation Overview: Show

With the advancement of sequencing techniques, the discovery of new proteins has significantly exceeded human capacity and resources for experimentally characterising protein functions. In this study, we developed LEGO-CSM, a comprehensive web-based resource that addresses this gap by leveraging well-established and robust graph-based signatures to supervised machine learning models using both protein sequence and structure information to characterise proteins. LEGO-CSM’s models can accurately predict protein functions in terms of subcellular localisation, Enzyme Commission (EC) numbers, and Gene Ontology (GO) terms. We demonstrate that our models perform as well as or better than alternative approaches, achieving an Area Under the Receiver Operating Characteristic Curve (ROC AUC) of up to 0.93 for subcellular localisation, up to 0.93 for EC, and up to 0.81 for GO terms on independent blind tests. LEGO-CSM’s web server is freely available at https://biosig.lab.uq.edu.au/lego_csm.

14:20-14:40
Exploring machine learning algorithms and protein language model strategies to develop functional enzyme classification systems
Room: Salle Saint Claire 3
Format: Live from venue

  • Diego Fernández, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile
  • Alvaro Olivera-Nappa, Centre for Biotechnology and Bioengineering, Department of Chemical Engineering and Biotechnology, University of Chile, Chile
  • Roberto Uribe-Paredes, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile
  • David Medina, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile


Presentation Overview: Show

Discovering functionalities for unknown enzymes has been one of the most common bioinformatics tasks. Functional annotation methods based on phylogenetic properties have been the gold standard in every genome annotation process. However, these methods only succeed if the minimum requirements for expressing similarity or homology are met. Alternatively, machine learning and deep learning methods have proven helpful in this problem, developing functional classification systems in various bioinformatics tasks. Nevertheless, there needs to be a clear strategy for elaborating predictive models and how amino acid sequences should be represented. In this work, we address the problem of functional classification of enzyme sequences (EC number) via machine learning methods, exploring various alternatives for training predictive models and numerical representation methods. The results show that the best performances are achieved by applying representations based on pre-trained models. Methods based on CNN architectures proposed in this work present a more outstanding facility for learning and pattern extraction in complex systems, achieving performances above 97% and with error rates lower than 0.05 of binary cross entropy. Finally, we discuss the strategies explored and analyze future work to develop integrated methods for functional classification and the discovery of new enzymes to support current bioinformatics tools.

14:40-15:00
Function COSI Discussion- what do you think and what would you like to see in the future
Room: Salle Saint Claire 3
Format: Live from venue

  • Function COSI Track Chairs
15:30-15:50
Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters using EMERALD
Room: Salle Saint Claire 3
Format: Live from venue

  • Andreas Grigorjew, University of Helsinki, Finland
  • Artur Gynter, University of Helsinki, Finland
  • Fernando H. C. Dias, University of Helsinki, Finland
  • Benjamin Buchfink, Max Planck Institute for Biology, Tübingen, Germany
  • Hajk-Georg Drost, Max Planck Institute for Biology, Tübingen, Germany
  • Alexandru I. Tomescu, University of Helsinki, Finland


Presentation Overview: Show

Sequence alignments are the foundation of life science research, but most innovation focused on optimal alignments, while ignoring information derived from suboptimal solutions. We argue that one optimal alignment per pairwise sequence comparison was a reasonable approximation when dealing with very similar sequences, but is insufficient when exploring the biodiversity of the protein universe at tree-of-life scale. To overcome this limitation, we introduce pairwise alignment-safety to uncover the amino acid positions robustly shared across all suboptimal solutions. We implemented this approach into EMERALD, a dedicated software solution for alignment-safety inference and apply it to 400k sequences from the SwissProt database.

15:50-16:10
Identifying how evolution has tuned myosin function as species have got larger
Room: Salle Saint Claire 3
Format: Live from venue

  • Chloe Johnson, University of Kent, United Kingdom
  • Jake McGreig, University of Kent, United Kingdom
  • Daniel Mulvihill, University of Kent, United Kingdom
  • Leslie Leinwand, University of Colorado, Bolder, United Kingdom
  • Marta Farre, University of Kent, United Kingdom
  • Michael Geeves, University of Kent, United Kingdom
  • Mark Wass, University of Kent, United Kingdom


Presentation Overview: Show

The speed of muscle contraction is related to body size; muscles in larger species contract at slower rates. Species heart rate is an example of this; a mouse has a heart rate close to 300 beats per minute, but is 30 for an elephant. Since contraction speed is a property of the β-myosin isoform expressed in heart, we investigated how sequence changes in this protein. Analysis of the motor domain sequence of β-myosin from 67 mammals from two distinct evolutionary clades identified 16 sites, out of 800, associated with body mass but not with clade. Both clades change the same small set of amino acids, in the same order from small to large mammals, suggesting a limited number of ways in which contraction velocity can be successfully manipulated. To test this, the nine sites that differ between human and rat were mutated in the human β-myosin to match the rat sequence. Biochemical analysis revealed that the rat-human β-myosin chimera functioned like the native rat myosin with a two-fold increase in both motility and in the rate of ADP release. Thus, these sequence changes indicate adaptation of β-myosin as species mass increased to enable a reduced contraction velocity and heart rate.

16:10-16:30
Proceedings Presentation: TSignal: A transformer model for signal peptide prediction
Room: Salle Saint Claire 3
Format: Live from venue

  • Alexandru Dumitrescu, Aalto University, Department of Computer Science, Finland
  • Emmi Jokinen, Aalto University, Department of Computer Science, Finland
  • Anja Paatero, University of Helsinki, Institute of Biotechnology, Finland
  • Juho Kellosalo, University of Helsinki, Institute of Biotechnology, Finland
  • Ville Paavilainen, University of Helsinki, Institute of Biotechnology, Finland
  • Harri Lähdesmäki, Aalto University, Department of Computer Science, Finland


Presentation Overview: Show

Motivation: Signal peptides are short amino acid segments present at the N-terminus of newly synthesized proteins that facilitate protein translocation into the lumen of the endoplasmic reticulum, after which they are cleaved off. Specific regions of signal peptides influence the efficiency of protein translocation, and small changes in their primary structure can abolish protein secretion altogether. The lack of conserved motifs across signal peptides, sensitivity to mutations, and variability in the length of the peptides make signal peptide prediction a challenging task that has been extensively pursued over the years.
Results: We introduce TSignal, a deep transformer-based neural network architecture that utilizes BERT language models (LMs) and dot-product attention techniques. TSignal predicts the presence of signal peptides (SPs) and the cleavage site between the SP and the translocated mature protein. We use common benchmark datasets and show competitive accuracy in terms of SP presence prediction and state-of-the-art accuracy in terms of cleavage site prediction for most of the SP types and organism groups. We further illustrate that our fully data-driven trained model identifies useful biological information on heterogeneous test sequences.
Availability: TSignal is available at: https://github.com/Dumitrescu-Alexandru/TSignal.
Contact: alexandru.dumitrescu@aalto.fi, harri.lahdesmaki@aalto.fi