Posters - Schedules
Posters Home

View Posters By Category

Monday, July 24, between 18:00 CEST and 19:00 CEST
Tuesday, July 25, between 18:00 CEST and 19:00 CEST
Session A Poster Set-up and Dismantle
Session A Posters set up:
Monday, July 24, between 08:00 CEST and 08:45 CEST
Session A Posters dismantle:
Monday, July 24, at 19:00 CEST
Session B Poster Set-up and Dismantle
Session B Posters set up:
Tuesday, July 25, between 08:00 CEST and 08:45 CEST
Session B Posters dismantle:
Tuesday, July 25, at 19:00 CEST
Wednesday, July 26, between 18:00 CEST and 19:00 CEST
Session C Poster Set-up and Dismantle
Session C Posters set up:
Wednesday, July 26,between 08:00 CEST and 08:45 CEST
Session C Posters dismantle:
Wednesday, July 26, at 19:00 CEST
Virtual
C-271: Exploring Graph-based Ligand Binding Prediction: Introducing bindNode23 for Residue Classification
Track: Function
  • Franz Birkeneder, Technical University of Munich, Germany
  • Kyra Erckert, Technical University of Munich, Germany
  • Burkhard Rost, Technical University of Munich, Germany


Presentation Overview: Show

For many proteins, fulfilling their function requires binding to ligands. Yet, reliable binding data remains scarce due to the time-consuming process of experimental verification.
Recent advances in this field predict binding residues based on sequences using protein Language Model (pLM) embeddings. However recent efforts in assembling large-scale predicted 3D structure databases enable leveraging protein topology for predictions. Representing these topologies as graphs allows us to train Graph Neural Networks (GNNs) on binding prediction.
Here, we propose bindNode23, a predictor based on GNNs, to predict three classes of ligand binding residues: small molecules, metal ions, and DNA/RNA macromolecules. We show that this approach can reduce the state-of-the-art method's parameters by almost 80% while maintaining predictive capabilities without using additional data. Furthermore, our analyses suggest that secondary and tertiary structure features extracted from AlphaFold2 predictions are redundant to the information captured in pLM embeddings.

C-272: LEGO-CSM: a tool for functional characterisation of proteins
Track: Function
  • Thanh Binh Nguyen, The University of Queensland, Australia
  • Alex de Sá, The University of Queensland, Australia
  • Carlos Rodrigues, The University of Queensland, Australia
  • Douglas Pires, The University of Melbourne, Australia
  • David Ascher, The University of Queensland, Australia


Presentation Overview: Show

With the advancement of sequencing techniques, the discovery of new proteins has significantly exceeded human capacity and resources for experimentally characterising protein functions. In this study, we developed LEGO-CSM, a comprehensive web-based resource that addresses this gap by leveraging well-established and robust graph-based signatures to supervised machine learning models using both protein sequence and structure information to characterise proteins. LEGO-CSM’s models can accurately predict protein functions in terms of subcellular localisation, Enzyme Commission (EC) numbers, and Gene Ontology (GO) terms. We demonstrate that our models perform as well as or better than alternative approaches, achieving an Area Under the Receiver Operating Characteristic Curve (ROC AUC) of up to 0.93 for subcellular localisation, up to 0.93 for EC, and up to 0.81 for GO terms on independent blind tests. LEGO-CSM’s web server is freely available at https://biosig.lab.uq.edu.au/lego_csm.

C-273: Functional annotation of the regeneration process of a non-model organism using Language Models.
Track: Function
  • Patricia Medina, CABD-CSIC, Spain
  • Israel Barrios, CABD-CSIC, Spain
  • Ildefonso Cases, CABD-CSIC, Spain
  • Carlos Martín, CABD-CSIC, Spain
  • Fernando Casares, CABD-CSIC, Spain
  • Ana Rojas, CSIC-CABD, Spain


Presentation Overview: Show

Functional annotation of relevant biological processes remains challenging for non-model organisms, as most of the annotation protocols rely on homology, leaving substantial regions of proteomes un-annotated.
In one hand, standard methods to transfer function, may not be ideally suited since most rely on sequence conservation, knowledge on protein-protein interactions, etc., information which is not available or easily deducible for many organisms.
On the other hand, there is a conceptual issue: the paradigm posed by the orthologue’s conjecture is challenged by the abilities for proteins to multi-function.
Recent developments have made use of Language Models to transfer annotations in an evolutionary-based independent manner via “transfer learning” approach.
Since these methods are evolutionary agnostic, they are suited to our purposes considering that the sequence-based mappable fraction of Cloeon dipterum genome is poor.
Here we present a computational pipeline, using language models and standard analyses, devised to annotate the regeneration process of Cloeon dipterum, a non-model organism with extraordinary regeneration capabilities. We will discuss its caveats, advantages, and how this has enabled to identify relevant functions in Cloeon genes.

C-274: Exploring machine learning algorithms and protein language model strategies to develop functional enzyme classification systems
Track: Function
  • Diego Fernández, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile
  • Alvaro Olivera-Nappa, Centre for Biotechnology and Bioengineering, Department of Chemical Engineering and Biotechnology, University of Chile, Chile
  • Roberto Uribe-Paredes, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile
  • David Medina, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile


Presentation Overview: Show

Discovering functionalities for unknown enzymes has been one of the most common bioinformatics tasks. Functional annotation methods based on phylogenetic properties have been the gold standard in every genome annotation process. However, these methods only succeed if the minimum requirements for expressing similarity or homology are met. Alternatively, machine learning and deep learning methods have proven helpful in this problem, developing functional classification systems in various bioinformatics tasks. Nevertheless, there needs to be a clear strategy for elaborating predictive models and how amino acid sequences should be represented. In this work, we address the problem of functional classification of enzyme sequences (EC number) via machine learning methods, exploring various alternatives for training predictive models and numerical representation methods. The results show that the best performances are achieved by applying representations based on pre-trained models. Methods based on CNN architectures proposed in this work present a more outstanding facility for learning and pattern extraction in complex systems, achieving performances above 97% and with error rates lower than 0.05 of binary cross entropy. Finally, we discuss the strategies explored and analyze future work to develop integrated methods for functional classification and the discovery of new enzymes to support current bioinformatics tools.

C-275: Co-transcriptional cis-R-loop forming lncRNAs: a new lncRNA subclass?
Track: Function
  • Kevin Muret, Université Paris-Saclay, CEA, Centre National de Recherche en Génomique Humaine (CNRGH), Evry, France., France
  • Jean-François Deleuze, Université Paris-Saclay, CEA, Centre National de Recherche en Génomique Humaine (CNRGH), Evry, France., France
  • Eric Bonnet, Université Paris-Saclay, CEA, Centre National de Recherche en Génomique Humaine (CNRGH), Evry, France., France


Presentation Overview: Show

For more than a decade, lncRNAs have been the subject of many research fields. However, these non-protein-coding entities of more than 200 nucleotides represent a wide diversity of RNAs with very different roles and, despite efforts to subclassify these genes according to their genic environment, do not allow us to obtain subclasses of lncRNAs based on their function. LncRNAs are able to interact with other RNAs, DNA, peptides or proteins. Here, we focused on RNA:DNA interactions (R-loops) which can be studied by DRIP-seq based on the S9.6 antibody's high affinity for R-loops. Based on more than 120 DRIP-seq experiments and lncRNA annotation, we were able to show that 49% of lncRNAs are likely to form a cis-R-loop. We have also identified 1367 lncRNA/coding gene pairs for which we suspect a role for the lncRNA in regulating the expression of the nearby coding gene. The VIM/VIM-AS1 pair, a well-known case described by Boque-Sastre et al., is also retrieved. These initial results are very promising; they will require experimental validations in the coming years. We hope, through this original approach, to annotate more precisely and subclassify lncRNAs in order to help researchers to envisage more adapted experimental methods for their functional studies.

C-277: Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters using EMERALD
Track: Function
  • Andreas Grigorjew, University of Helsinki, Finland
  • Artur Gynter, University of Helsinki, Finland
  • Fernando H. C. Dias, University of Helsinki, Finland
  • Benjamin Buchfink, Max Planck Institute for Biology, Tübingen, Germany
  • Hajk-Georg Drost, Max Planck Institute for Biology, Tübingen, Germany
  • Alexandru I. Tomescu, University of Helsinki, Finland


Presentation Overview: Show

Sequence alignments are the foundation of life science research, but most innovation focused on optimal alignments, while ignoring information derived from suboptimal solutions. We argue that one optimal alignment per pairwise sequence comparison was a reasonable approximation when dealing with very similar sequences, but is insufficient when exploring the biodiversity of the protein universe at tree-of-life scale. To overcome this limitation, we introduce pairwise alignment-safety to uncover the amino acid positions robustly shared across all suboptimal solutions. We implemented this approach into EMERALD, a dedicated software solution for alignment-safety inference and apply it to 400k sequences from the SwissProt database.

C-278: A universal operon predictor for prokaryotic (meta-)genomics data using self-training
Track: Function
  • Hong Su, Max Planck Institute for Multidisciplinary Sciences, Germany
  • Ruoshi Zhang, Max Planck Institute for Multidisciplinary Sciences, Germany
  • Johannes Soeding, Max Planck Institute for Multidisciplinary Sciences, Germany


Presentation Overview: Show

Improved computational methods are urgently required for enhancing gene functional annotation. Our novel operon predictor overcomes the limitations of existing methods by eliminating the need for prior knowledge. It employs a statistical framework to estimate the probability of genes being in the same operon based on intergenic distance. Furthermore, a self-training method utilizes conserved gene clusters across multiple genomes to predict operons. Comparative evaluations on seven genomes demonstrate superior performance compared to existing approaches (ofs and operon-mapper). This innovative approach holds great promise in advancing our understanding of microbial file processes and unveiling new functional connections.

C-279: Leveraging massive protein structure datasets for function prediction on a metagenomic scale
Track: Function
  • Pawel Szczerbiak, Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
  • Witold Wydmański, Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
  • Mary Maranga, Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
  • Łukasz Szydłowski, Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
  • Valentyn Bezshapkin, Institute of Microbiology, ETH Zürich, Switzerland
  • Piotr Kucharski, Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
  • Tomasz Kosciolek, Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland


Presentation Overview: Show

Recent breakthroughs in protein structure prediction (AlphaFold, ESMFold and related methods) resulted in unprecedented growth in the availability of high quality structural models which currently approach 1 billion. Since we know that function is reflected in protein structure we can leverage such structural data for more precise function prediction. Indeed, models such as deepFRI showed that using structure instead of sequence can lead to much better function prediction. Moreover, deepFRI has been successfully applied on metagenomic datasets and has been shown to increase the annotation coverage as compared to other methods (eggNOG, HUMAnN2). Metagenomic-DeepFRI framework can successfully extend deepFRI metagenomic datasets even further by efficiently mapping and aligning sequences to putative structures. However, since deepFRI has been trained on PDB and related structures, it produces high coverage annotations, albeit more general than comparable homology-based methods. Here, we show how deepFRI retrained on AlphaFold-UniProt dataset enriched with Gene Ontology annotations can alleviate this limitation and present its applicability on large metagenomic datasets. We will also comment on future directions in which deepFRI (and function predictions in general) can be pushed forward to reflect current challenges occurring in that field.

C-280: CHARTING γ-SECRETASE SUBSTRATES BY EXPLAINABLE AI
Track: Function
  • Stephan Breimann, Department of Bioinformatics, Technical University of Munich, Freising, Germany, Germany
  • Frits Kamp, Ludwig-Maximilians-University Munich, Biomedical Center, Division of Metabolic Biochemistry, München, Germany, Germany
  • Gökhan Güner, German Center for Neurodegenerative Diseases, DZNE Munich, München, Germany, Germany
  • Stefan F. Lichenthaler, German Center for Neurodegenerative Diseases, DZNE Munich, München, Germany, Germany
  • Dieter Langosch, Technical University of Munich, Chair of Biopolymer Chemistry, Freising, Germany, Germany
  • Dmitrij Frishman, Technical University of Munich, Department of Bioinformatics, Freising, Germany, Germany
  • Harald Steiner, German Center for Neurodegenerative Diseases, DZNE Munich, München, Germany, Germany


Presentation Overview: Show

Objectives: This study aimed to identify physicochemical properties defining γ-secretase substrates, associated with Alzheimer's disease and cancer, using a novel bioinformatics approach.

Methods: We developed an innovative sequence-based feature engineering algorithm, Comparative Physicochemical Profiling (CPP), to identify the most discriminative physicochemical features of γ-secretase substrates. Additionally, we designed a novel deterministic positive-unlabeled learning algorithm (dPULearn) to address the problem of an unbalanced dataset containing more substrates than non-substrates. Machine learning models were trained to predict new γ-secretase substrates.

Results: Over 100 substrate-defining features were identified. By combining CPP with the explainable AI tool SHAP, we found that these features were not exclusive but exhibited varied importance. The human γ-secretase substrate proteome was uncovered, with 16.3% (n=250) classified as high confidence substrates. Our approach achieved a 90% balanced accuracy, outperforming the ProtT5 deep protein language model (57%). We experimentally validated 12 predicted substrates and 4 non-substrates with an 89% success rate, including novel substrates related to immune diseases and cancer.

Conclusions: We charted the complete human membrane proteome of γ secretase substrates. By combining CPP with explainable AI, we could reveal the physicochemical signature of γ-secretase substrates hidden in their primary sequence, offering potential applicability in studying other molecular recognition processes.

C-281: Functional Variants Identify Sex-specific Genes and Pathways in Alzheimer’s Disease
Track: Function
  • Thomas Bourquard, Baylor College of Medicine, France
  • Kwanghyuk Lee, Baylor College of Medicine, United States
  • Ismael Al-Ramahi, Baylor College of Medicine, United States
  • Juan Botas, Baylor College of Medicine, United States
  • Olivier Lichtarge, Baylor College of Medicine, United States


Presentation Overview: Show

The incidence of Alzheimer’s Disease (AD) in women is almost double that of men. Women also typically exhibit faster cognitive decline and increased cerebral atrophy, while men have higher mortality rates. Identifying the genes underlying these sex-specific differences is crucial but especially challenging since as it requires analyzing smaller, sex-specific cohorts.

To identify sex-specific gene associations, we developed a machine learning approach that focused on functionally impactful coding variants and incorporated a vast amount of evolutionary information to the study of disease-linked coding genome variants, thereby raising statistical power. In the AD Sequencing Project (ADSP) with mixed sexes, this approach identified genes enriched for immune response pathways. Upon sex-separation, we found genes that were specifically enriched for stress-response pathways in men and cell-cycle pathways in women. These genes improved disease risk prediction in silico and experimentally modulated neurodegeneration in live Drosophila AD models.

Therefore, a general, evolution-based approach for machine learning on functionally impactful variants was powerful enough to uncover sex-specific candidates towards the discovery of diagnostic biomarkers. These findings have implications in AD, and other complex diseases, for developing therapeutic strategies and stratifying clinical trial based on sex.

C-282: Predicting function in UniProt : rule-based and natural language models
Track: Function
  • Vishal Joshi, EMBL-EBI, United Kingdom
  • Elena Speretta, EMBL-EBI, United Kingdom
  • Maria Martin, EMBL-EBI, United Kingdom


Presentation Overview: Show

Automatic Annotation(AA) objective
Manually reviewed records (UniProtKB/SwissProt) constitute only about 0.23% of UniProtKB; expert curation is time-intensive and most published experimental data focuses on a rather limited range of model organisms. Simultaneously, the number of unreviewed records is growing continuously, yet for a large proportion of these records there is no experimental data available. UniProtKB uses three prediction systems UniRule, Association-Rule-Based Annotator (ARBA) & Google’s ProtNLM to functionally annotate around 85% of unreviewed (UniProtKB/TrEMBL) records automatically which we define as Automatic Annotation.

Google’s ProtNLM method
ProtNLM (Protein Natural Language Model) is a deep-learning method trained on reviewed (SwissProt) & unreviewed (TrEMBL) records to provide protein names to millions of uncharacterized TrEMBL sequences. It was released to production in UniProt v2022_04 in October 2022. The first version of this method was a sequence-to-sequence model based on the T5X framework which takes an amino acid sequence as input & produces a protein name as output.
To improve accuracy of predictions, in v2022_05 we deployed an ensemble approach which has equal distribution of sequence only & sequence-taxonomy trained models. UniProt then post-processes these predictions after careful curator-led analysis & community feedback to propagate names which might convey functional information about protein more accurately.

C-283: Applications of bioinformatics methodologies in the study of lipoxygenases from diatoms
Track: Function
  • Simone Bonora, Istituto di Scienze dell’Alimentazione, CNR, via Roma 64, Avellino, Italy
  • Ilenia D'Orsi, Istituto di Scienze dell’Alimentazione, CNR, via Roma 64, Avellino, Italy
  • Deborah Giordano, Istituto di Scienze dell’Alimentazione, CNR, via Roma 64, Avellino, Italy
  • Domenico D'Alelio, Stazione Zoologica Anton Dohrn, Villa Comunale, 80121 Naples, Italy, Italy
  • Angelo Facchiano, Istituto di Scienze dell’Alimentazione, CNR, via Roma 64, Avellino, Italy


Presentation Overview: Show

Diatoms produce oxylipins as consequence of abiotic or biotic stress. These compounds derive from the oxidation of poly-unsaturated fatty acids by lipoxygenase enzymes (LOX), and influence not only the growth of phytoplankton, but also the growth of numerous organisms constituting zooplankton, having a strong impact on marine flora and fauna.
This project focused on the research, the classification, the sequence and the structural analysis of Lipoxygenase (LOX) in diatoms, using bioinformatics tools and database analysis. Firstly, we analyzed the presence of hypothetic LOX domain in uncharacterized diatoms’ sequences, predicted from transcriptomic experiments, to collect the widest number of LOX belonging to this species. After this screening, we make a classification, exploiting the construction of phylogenetic trees by which was possible to detect at least six different groups, principally divisible in two main classes. The latters are splitted according to the presence or absence of a probable insertion between two coordination residues at the cofactor coordination site typical of LOX (three His, one Asn and one Ile, coordination residues of Fe2+). Finally, the 3D structure of LOX enzymes representative of each group were modelled, evaluated, and compared, revealing possible functional differences related to a different composition of the substrate-binding site.

C-284: SAP: Synteny-aware gene function prediction for bacteria using protein embeddings
Track: Function
  • Aysun Urhan, Delft University of Technology, Netherlands
  • Bianca-Maria Cosma, Delft University of Technology, Netherlands
  • Abigail L. Manson, Broad Institute of MIT and Harvard, United States
  • Ashlee Earl, Broad Institute of MIT and Harvard, United States
  • Thomas Abeel, Delft University of Technology, Netherlands


Presentation Overview: Show

Today, we know the function of only a small fraction of all known protein sequences. This problem is especially salient in bacteria as human-centric studies are prioritized in the field and there is much to uncover in the bacterial genetic repertoire. Conventional approaches to bacterial gene annotation are inadequate for annotating unseen proteins in novel species since there are no homologs in the existing databases. Thus, we need alternative representations of proteins. Recently, there has been an uptick in interest in adopting natural language processing methods to solve challenging bioinformatics tasks, and great success in tackling various challenges, although with limited applications in bacteria. We developed SAP, a novel synteny-aware gene function prediction tool based on protein embeddings, to annotate bacterial species. SAP distinguishes itself from existing methods in two ways: (i) it uses embedding vectors extracted from state-of-the-art protein language models and (ii) it incorporates conserved synteny across the entire bacterial kingdom using a novel operon-based approach we developed. SAP outperformed conventional annotation methods as well as the state-of-the-art on a range of representative bacteria, for various gene prediction tasks including distant homolog detection where the sequence similarity between training and test proteins was 40% at its lowest.

C-285: Mutual Annotation-Based Prediction of Protein Domain Functions with Domain2GO
Track: Function
  • Erva Ulusoy, Hacettepe University, Turkey
  • Tunca Dogan, Hacettepe University, Turkey


Presentation Overview: Show

Identifying functional properties of proteins is crucial for understanding their roles in health and disease states. Domains, the structural and functional units of proteins, can provide valuable information in this context. To overcome the challenges associated with the time and cost involved in experimental approaches, researchers have developed computational strategies for predicting protein functions. In this study, we introduce Domain2GO, a novel approach to predict associations between domains and GO terms by leveraging documented protein-level GO annotations and domain content, thus redefining the problem as domain function prediction. We employed statistical resampling and analyzed co-occurrence patterns of domains and GO terms on the same proteins to obtain highly reliable associations. We applied Domain2GO to predict unknown protein functions and evaluated its performance against other methods using the Critical Assessment of Function Annotation 3 (CAFA3) challenge datasets. The results demonstrated the high potential of Domain2GO, especially when predicting molecular function and biological process terms, even though Domain2GO is not a protein-level function predictor. The approach proposed here can be extended to other ontologies and biological entities to explore unknown relationships in complex and large-scale biological data. We shared Domain2GO as a programmatic tool at https://github.com/HUBioDataLab/Domain2GO.

C-286: Subagging of Principal Components for Sample Balancing: Building a Condition-Independent Gene Coexpression Resource from Public Transcriptome Data
Track: Function
  • Takeshi Obayashi, Tohoku University, Japan


Presentation Overview: Show

Public repositories such as NCBI GEO provide extensive gene expression data, which has led to the construction of condition-independent gene coexpression databases. However, biases in sample collections limit the ideal condition-independent coexpression information. We propose a new coexpression calculation method that uses Principal Component Analysis (PCA) for sample balancing. This approach reduces random noise by omitting low contribution variances and considers a broader range of environments by managing differences in PC contribution variances. We implement two procedures to balance the contribution of PCs: using the Spearman correlation coefficient (SCC) and a subagging ensemble method. Comparisons using the Arabidopsis RNAseq platform showed that the methods using PCA and ensemble computation outperformed those without PCA. We confirmed this result on the 17 coexpression platforms in ATTED-II. Our proposed method improves gene coexpression performance by combining PCA and ensemble computation, considering both major and minor environmental components. Sample balancing is fundamental to harnessing the power of the vast publicly available transcriptome data. The resulting gene coexpression data are available in the ATTED-II and COXPRESdb databases for plant and animal research.

C-287: AlphaFold meets large-networks: deep-learning assisted protein family discovery at an unprecedented scale
Track: Function
  • Joana Pereira, Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, Switzerland
  • Janani Durairaj, Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, Switzerland
  • Andrew M. Waterhouse, Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, Switzerland
  • Toomas Mets, Institute of Technology, University of Tartu, Estonia
  • Tetiana Brodiazhenko, Institute of Technology, University of Tartu, Estonia
  • Minhal Abdullah, Institute of Technology, University of Tartu, Estonia
  • Gabriel Studer, Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, Switzerland
  • Gerardo Tauriello, Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, Switzerland
  • Mehmet Akdel, VantAI, United States
  • Antonina Andreeva, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Alex Bateman, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Tanel Tenson, Institute of Technology, University of Tartu, Estonia
  • Hauryliuk Vasili, Science for Life Laboratory and Department of Experimental Medical Science, Lund University, Sweden
  • Torsten Schwede, Biozentrum and SIB Swiss Institute of Bioinformatics, University of Basel, Switzerland


Presentation Overview: Show

Despite the great success of automated annotation efforts, a large number of all catalogued proteins remain functionally unannotated and unclassified. Fortunately, we are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover most catalogued natural proteins, including those challenging to annotate using standard homology-based approaches.
We measured the extent to which AlphaFold has illuminated the unannotated space of the natural protein universe and created the Protein Universe Atlas. It accounts for more than 6 million unique protein sequences and uses recent advances in GPU-accelerated force-directed graph layouting and complex network summarizing approaches. With this representation, we discovered at least 281 putative new protein families, identified a novel protein fold and defined a new superfamily of translation-targeting toxin-antitoxin systems.
Our work highlights that automated annotation of proteins requires a combination of data sources and approaches, which become increasingly available due to the rapid and ongoing advances in the interface between life sciences and deep learning. But also, that as a community we are closer than ever to unlocking the full potential of the protein universe, from unknown biology to new applications.

C-288: M-Ionic: Prediction of metal ion binding sites from sequence using residue embeddings
Track: Function
  • Aditi Shenoy, Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University, Sweden
  • Yogesh Kalakoti, Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, India
  • Durai Sundar, Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, India
  • Arne Elofsson, Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University, Sweden


Presentation Overview: Show

Understanding metal-protein interaction can provide structural and functional insights into cellular processes. As the number of protein sequences increases, developing fast yet precise computational approaches to predict and annotate metal binding sites becomes imperative. We will present a method we developed M-Ionic, a sequence-based method to predict which metals a protein binds and the binding residues. Since the predictions use only residue embeddings from a pre-trained protein language model, quick predictions can be made for the ten most frequent metal ions (Ca2+, Co2+, Cu2+, Mg2+, Mn2+, Po43-, So42-, Zn2+, Fe2+, Fe3+). Further refinement of this method using structural features will be presented.

C-289: Predicting S-nitrosylation Sites in Proteins using a Transformer-based Protein language model
Track: Function
  • Pawel Pratyush, Michigan Technological University, United States
  • Suresh Pokharel, Michigan Technological University, United States
  • Dukka Kc, Michigan Technological University, United States


Presentation Overview: Show

Protein S-nitrosylation (SNO) plays a crucial role in transferring nitric oxide-mediated signals in both animals and plants and has emerged as a vital mechanism for regulating protein functions and cell signaling across all main classes of proteins. Developing robust computational tools to predict protein SNO sites can aid in better understanding the pathological and physiological mechanisms of SNO. Therefore, we propose pLMSNOSite, a stacked generalization approach based on an intermediate fusion of models that combines two different learned marginal amino acid sequence representations: per-residue contextual embedding learned on full sequences from a pre-trained transformer-based protein language model (global context) and per-residue supervised word embedding learned on window sequences using an embedding layer (local context). Our pLMSNOSite approach achieved significant improvement over the current state-of-the-art methods on an independent test set of experimentally identified SNO sites, with ∼21.7% increase in sensitivity, ∼35.0% improvement in MCC, and ∼10.6% improvement in g-mean. These results demonstrate that pLMSNOSite outperforms other approaches for predicting S-nitrosylation sites in protein sequences.

C-290: Prediction of bacterial interactomes based on genome-wide coevolutionary networks: an updated implementation of the ContextMirror approach
Track: Function
  • Miguel Fernández Martín, Barcelona Supercomputing Center - Life Sciences, Spain
  • Camila Pontes, Barcelona Supercomputing Center - Life Sciences, Spain
  • Victoria Ruiz-Serra, Barcelona Supercomputing Center - Life Sciences, Spain
  • Alfonso Valencia, Barcelona Supercomputing Center - Life Sciences, Spain


Presentation Overview: Show

The biological function of proteins is preserved through coevolution and can be quantified by computing the similarity between the phylogenetic trees of pairs of protein families. When the phylogenetic similarity is high, it indicates that proteins are likely to interact. However, this similarity is influenced by many factors, including background evolution. Current coevolution-based methods treat protein pairs independently, despite proteins interacting with multiple others. The ContextMirror methodology evaluates coevolution by integrating the influence of every interactor on a given protein pair (coevolutionary network), providing more accurate protein-protein interaction predictions. In our study, we evaluate the ContextMirror pipeline, already shown to improve the prediction of protein-protein interactions, by predicting protein-protein interactions for the full proteome of Escherichia coli (4298 proteins). Preliminary predictions reveal the potential of this approach to improve our understanding of protein coevolution. The true positive rate of the top-500 predictions (≈ 50% accuracy) is approximate to other methods and compared to the STRING database, they map only to high-confident pairs (confident score > 0.8). In the next steps of our analysis, ContextMirror will be used to identify differences in bacterial interactomes, with potential implications in drug design and protein engineering.

C-291: Large language models improve annotation of viral proteins
Track: Function
  • Zachary Flamholz, Albert Einstein College of Medicine, United States
  • Steve Biller, Wellesley College, United States
  • Libusha Kelly, Albert Einstein College of Medicine, United States


Presentation Overview: Show

Viral sequences are poorly annotated in environmental samples, a major roadblock to understanding how viruses influence microbial community structure. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by available libraries of annotated viral sequences used to construct probabilistic sequence models and sequence divergence in viral proteins, rendering them invisible to recognition by alignment-based approaches. Here, we show that protein language model (PLM)-based representations can capture viral protein function beyond the limits of remote sequence homology. Using the PHROGs database of categorically annotated viral protein families, we trained a functional classifier that achieved an average area under the precision recall curve of 0.62 across nine functions over five train-test splits. The classifier was further validated by achieving 67% accuracy on a reannotation of 57 PHROG families. Additionally, PLM representations capture protein functional properties specific to viruses. Families with functions related to phage virion structure and lysis separate in the embedded space from families with functions related to viral genome replication, host genome integration, and host associated genes. To highlight the potential of PLMs to identify function annotations inaccessible to current approaches, we used a PLM-based functional classifier to identify a novel tyrosine recombinase in the ocean microbiome. Protein language models capture features of viral proteins that aid in detecting remote homology, a necessary step in meaningfully describing viral populations across the planet.

C-292: PANORAMA: comparative pangenomics tools to explore interspecies diversity of microbial genomes
Track: Function
  • Jérôme Arnoux, LABGeM, UMR 8030 Metabolic Genomics, CEA, Paris Saclay University, CNRS, France
  • Laura Bry, LABGeM, UMR 8030 Metabolic Genomics, CEA, Paris Saclay University, CNRS, France
  • Quentin Fernandez de Grado, LABGeM, UMR 8030 Metabolic Genomics, CEA, Paris Saclay University, CNRS, France
  • David Vallenet, LABGeM, UMR 8030 Metabolic Genomics, CEA, Paris Saclay University, CNRS, France
  • Alexandra Calteau, LABGeM, UMR 8030 Metabolic Genomics, CEA, Paris Saclay University, CNRS, France


Presentation Overview: Show

In recent years, to cope with the increase of genomes in databanks, comparative genomics studies have focused on the overall gene content of a species, the pangenome, imposing a paradigm shift in the representation of knowledge and in the algorithms used.
We developed PANORAMA, a flexible and open-source bioinformatics toolbox, which exploits multiprocessing, to perform rapid and easy-to-use comparative analysis of pangenomes using thousands of microbial genomes. PANORAMA integrates multiple features. It leverages homologous family conservation combined with graph connectivity to allow users to search for a specific genomic context in a set of pangenome graphs. PANORAMA also predicts biological systems, such as conjugation, secretion or defense systems, at the pangenome level using a system-modeling framework associated with HMM profile databases. All generated results are associated to pangenome partitions, as well as to regions of genomic plasticity, their spot of integration and their segmentation in conserved modules.
PANORAMA aims to help microbiologists to understand the adaptive potential of bacteria and the evolutionary dynamics behind the metabolic diversity of microorganisms. Future developments will integrate additional models to identify biological systems and integration of pangenomes in graph databases, to address the challenge of large-scale comparative pangenomics.

C-293: Machine-learning analysis of neofunctionalization following gene tandem duplication in vertebrate evolution
Track: Function
  • Carlo De Rito, Department Of Chemistry, Life Sciences And Environmental Sustainability, University of Parma, Italy
  • Marco Malatesta, Department Of Chemistry, Life Sciences And Environmental Sustainability, University of Parma, Italy
  • Riccardo Percudani, Department Of Chemistry, Life Sciences And Environmental Sustainability, University of Parma, Italy


Presentation Overview: Show

After tandem duplication, a gene copy can undergo mutation and acquire a new function. Examples of neofunctionalization following tandem gene duplication are evolution of color vision in primates and the origin of a pathway for taurine biosynthesis in sauropsids. For a systematic analysis of neofunctionalization in vertebrates, we developed a large-scale two-step procedure: 1) identification of tandem duplications in the human genome and 2) identification of neofunctionalization signals through machine learning. Best reciprocal hit sequences spaced by less than 10 non-homologous sequences were considered tandem duplications. Individual genes collected by this procedure were aligned with the respective orthologous sequences of other vertebrates. For each position we calculated the overall conservation score and the difference score between orthogroup pairs. The scores were used to build two-dimensional maps based on the contact maps obtained from Alphafold structures. Embeddings were calculated from the protein sequences using an Evolutionary Scale Model. Using a convolutional neural network for map classification and a recurrent one for embedding classification, a neofunctionalization probability value was associated with each pair. Known case study and a 5-Fold Cross-Validation analysis support the possibility of training a neural network to recognize neofunctionalization patterns from protein alignments and structures.

C-294: Kaggle-hosted Critical Assessment of protein Function Annotation algorithms (CAFA)
Track: Function
  • M. Clara De Paolis Kaluza, Khoury College of Computer Sciences, Northeastern University, United States
  • Damiano Piovesan, Dept. of Biomedical Sciences, University of Padova, Italy
  • Parnal Joshi, Dept. of Veterinary Microbiology and Preventive Medicine, Iowa State University, United States
  • Maggie Demkin, Kaggle, United States
  • Addison Howard, Kaggle, United States
  • Walter Reade, Kaggle, United States
  • Alexandr Ignatchenko, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Sandra Orchard, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Iddo Friedberg, Dept. of Veterinary Microbiology and Preventive Medicine, Iowa State University, United States
  • Predrag Radivojac, Khoury College of Computer Sciences, Northeastern University, United States


Presentation Overview: Show

The Critical Assessment of protein Function Annotation (CAFA) is an ongoing community effort to independently assess computational methods for protein function prediction, to highlight well-performing methodologies and identify bottlenecks in the field, and to provide a forum for dissemination of results and exchange of ideas. Since its inception in 2010, CAFA has engaged a community of a few hundred prediction groups as well as many biocurators and experimental biologists in a series of prospective computational challenges, typically formulated as the prediction of Gene Ontology (GO) terms, Human Phenotype Ontology (HPO) terms, Disorder Ontology (DO) terms, or functional residues in proteins. In its 5th round (CAFA5) launched in 2023, CAFA has for the first time partnered with Kaggle Inc. to expand the function prediction challenge to a broader community of data scientists. In this talk, we will discuss the challenges and opportunities of forming academic-corporate partnerships to address important scientific problems, and more specifically, protein function prediction. We will also address the mutual adjustments made by CAFA and Kaggle organizers, CAFA5 preliminary findings, and lessons learned throughout the process.

C-295: Gene function prediction in five model eukaryotes exclusively based on gene relative location through machine learning
Track: Function
  • Flavio Pazos Obregón, Biological Research Institute "Clemente Estable" & Institut Pasteur Montevideo, Uruguay
  • Diego Silvera, Departamento de Biología del Neurodesarrollo - Instituto de Investigaciones Biológicas Clemente Estable, Uruguay
  • Pablo Soto, Departamento de Biología del Neurodesarrollo - Instituto de Investigaciones Biológicas Clemente Estable, Uruguay
  • Patricio Yankilevich, Bioinfomratics Platform - Instituto de Investigaciones Biomédicas de Buenos Aires, Argentina
  • Gustavo Guerberoff, Facultad de Ingeniería, Universidad de la República, Uruguay
  • Rafael Cantera, Departamento de Biología del Neurodesarrollo - Instituto de Investigaciones Biológicas Clemente Estable, Uruguay


Presentation Overview: Show

The function of most genes is unknown. The best results in automated function prediction are obtained with machine learning-based methods that combine multiple data sources, typically sequence derived features, protein structure and interaction data. Even though there is ample evidence showing that a gene’s function is not independent of its location, the few available examples of gene function prediction based on gene location rely on sequence identity between genes of different organisms and are thus subjected to the limitations of the relationship between sequence and function.
Here we predict thousands of gene functions in five model eukaryotes (Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Homo sapiens) using machine learning models exclusively trained with features derived from the location of genes in the genomes to which they belong. Our aim was not to obtain the best performing method to automated function prediction but to explore the extent to which a gene's location can predict its function in eukaryotes. We found that our models outperform BLAST when predicting terms from Biological Process and Cellular Component Ontologies, showing that, at least in some cases, gene location alone can be more useful than sequence to infer gene function.

C-296: A Comparative Study on Protein Sequence Embedding for Function Prediction
Track: Function
  • M. Clara De Paolis Kaluza, Khoury College of Computer Sciences, Northeastern University, United States
  • Predrag Radivojac, Khoury College of Computer Sciences, Northeastern University, United States


Presentation Overview: Show

Numerical representation of protein sequences is crucial to development of computational methods. Large language models trained on millions of sequences are becoming ubiquitous for feature extraction in computational pipelines. To investigate the utility of these models, we compare unsupervised embedding methods for representing protein sequences and evaluate their utility in protein function prediction tasks. We compare two state-of-the-art protein embedding LLM, ProtTrans (Elnaggar, et al 2021) and ESM2 (Rives et al 2019, Lin et al 2022), with traditional k-mer counts (k up to 5). We generate a challenging dataset of all proteins in SwissProt for 25 species with a minimum number of proteins in each function of interest. On the task of catalytic activity prediction, we find that LLM embeddings do outperform kmer embeddings (best model AUC=0.9081 ProtTrans, 0.9023 ESM2, 0.8419 kmers). This is ongoing work and we plan to compare more protein functions to investigate the potential discrepancy in performance at different functional complexity levels. Additionally, we are interested in evaluating combinations of these representations to evaluate to what extent they capture the same or different aspects of the language of proteins as it relates to function prediction from sequence.

C-297: Understanding Earth’s Ecosystems with Machine Learning
Track: Function
  • Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States
  • Ziming Yang, Brookhaven National Laboratory, United States
  • William Riehl, Lawrence Berkeley National Laboratory, United States
  • Chris Neely, Lawrence Berkeley National Laboratory, United States
  • Prachi Gupta, Lawrence Berkeley National Laboratory, United States
  • Sean Jungbluth, Lawrence Berkeley National Laboratory, United States
  • Adam Arkin, Lawrence Berkeley National Laboratory, United States
  • Paramvir Dehal, lawrence berkeley national laboratory, United States


Presentation Overview: Show

Our planet consists of a wide array of interconnected ecosystems that are dynamically changing on multiple time scales and microbes are now recognized as contributing to major environmental effects. In recent years, advances in microbial metagenomics have provided a substantial collection of data about ecosystem features in the form of biological sequences, taxonomy assignments, and functional annotations. Prior work has used data collections with fewer ecosystems, metagenomes, and features, limiting their performance and generalizability and here we report results from the largest standardized collection of metagenome samples to date. Using rigorous machine learning model training and evaluation approaches, including semantic similarity to assess hierarchical multi-label overlap, we identified the best performing data type combinations and model parameters. While performance was high on training data cross-validation, our results also show that models trained at different ecosystem classification levels exhibit useful generalizability for classifying metagenome samples from environments unseen by the model. By applying model interpretation methods, we derived a set of metagenome features important for distinguishing 41 widely ranging ecosystems. These key features lead to biological insights for ecosystem properties, better agreement with curated ecosystem classifications, information relevant to unknown functions, and ecosystem networks with relationships not represented in current classifications.

C-298: Annotation of Myriapoda genomes with a new tool: EXOGAP
Track: Function
  • Dorine Merlat, Department of Computer Science, ICube, UMR 7357, University of Strasbourg, Strasbourg, France, France
  • Gemma Collins, LOEWE Centre for translational biodiversity genomics, Frankfurt am Main, Germany, Germany
  • Clément Schneider, LOEWE Centre for translational biodiversity genomics, Frankfurt am Main, Germany, Germany
  • Arnaud Kress, Department of Computer Science, ICube, UMR 7357, University of Strasbourg, Strasbourg, France, France
  • Peter Decker, LOEWE Centre for translational biodiversity genomics, Frankfurt am Main, Germany, Germany
  • Ricarda Lehmitz, LOEWE Centre for translational biodiversity genomics, Frankfurt am Main, Germany, Germany
  • Miklos Bálint, LOEWE Centre for translational biodiversity genomics, Frankfurt am Main, Germany, Germany
  • Odile Lecompte, Department of Computer Science, ICube, UMR 7357, University of Strasbourg, Strasbourg, France, France


Presentation Overview: Show

The MetaInvert project aims to understand and characterize the biodiversity of soil invertebrates, which play crucial roles in soil ecosystems but are largely unknown and threatened by human activities. Soil is a major reservoir of biodiversity, hosting 40% of terrestrial species. The project focused on myriapods, a subphylum of small arthropods that includes centipedes, millipedes, symphilians, and pauropods.
We sequenced nearly 300 genomes, including 50 myriapod genomes. To annotate these genomes, we developed a tool called EXOGAP (EXotic Organism Genome Annotation Pipeline). EXOGAP is an automated annotation Snakemake pipeline specifically designed for non-model species with limited available data. It predicts various genetic elements, including protein-coding genes, non-coding genes, pseudogenes, and repetitive elements.
Preliminary results from the project indicated that repetitive elements significantly influence genome size and exhibit different evolutionary dynamics between chilopods and diplopods. Additionally, the researchers defined the pangenome of myriapods and observed distinct gene repertoires between chilopods and diplopods. These findings open avenues for future studies exploring the functional roles of conserved and unique genes in species adaptation to different environments. We will also continue the development of EXOGAP.

C-299: Anti-CRISPR Prediction using Transformer-based Protein Language Model
Track: Function
  • Chan-Seok Jeong, Korea Institute of Science and Technology Information, South Korea


Presentation Overview: Show

The discovery of Anti-CRISPR proteins as natural inhibitors of the CRISPR-Cas system has opened new possibilities for post-translational regulation of the CRISPR-Cas system in various applications. Bioinformatic prediction holds promise for cost-effective screening, but algorithm development is limited by scarce verified anti-CRISPR data and sequence similarity. In this study, we propose a prediction approach based on amino acid sequences, utilizing fine-tuning of a pre-trained Transformer-based protein language model. By incorporating an additional classification layer to the Transformer-based model and subsequently training the resulting model on validated anti-CRISPR and putative non-anti-CRISPR prophage datasets, we implement an anti-CRISPR prediction model. Performance evaluation on independent datasets demonstrates our method's superiority over conventional predictors, achieving 2.1 times higher sensitivity at 95% specificity. Notably, our approach significantly improves sensitivity for AcrIIA7 and AcrIIA9 families by 2.8-fold and 3-fold, respectively. Attention structure analysis reveals the model's ability to identify critical residues associated with anti-CRISPR function, even without explicit training on these regions. The method solely relies on amino acid sequences, eliminating the need for additional feature calculations or pre-filtering steps, making it suitable for large-scale genome investigations.

C-300: hkgfinder: find and classify prokaryotic housekeeping genes for multilocus sequence analysis
Track: Function
  • Anicet Ebou, Laboratoire de Bioinformatique et Biostatistiques, Institut National Polytechnique Félix Houphouët-Boigny, Cote d'Ivoire
  • Dominique Koua, Laboratoire de Bioinformatique et Biostatistiques, Institut National Polytechnique Félix Houphouët-Boigny, Cote d'Ivoire
  • Adolphe Zeze, Laboratoire de Biotechnologies végétales et microbiennes, Institut National Polytechnique Félix Houphouët-Boigny, Cote d'Ivoire


Presentation Overview: Show

Housekeeping gene prediction in genomic data remains a difficult task. Despite their importance in cellular activities, inclusion as important markers for multilocus sequence analysis, and taxonomic description of bacteria, there is, to the best of our knowledge, no practical tool to fastly and accurately retrieve them. Although genome and metagenome annotation tools exist and can be run for such a task, their usefulness is hindered by their efficiency when used for such a task. We present hkgfinder, a fast and accurate housekeeping gene finder, and classifier for the identification of common genes used in multilocus sequence analysis. Hkgfinder can run on raw sequences, genomes, and metagenomes. The novel value of this method lies in its ability to directly predict and classify gene sequences into housekeeping gene families at a high specificity and sensitivity while being also faster than genome and metagenome annotator on genome and metagenome data. We compare the results of hkgfinder with other methods and show its accuracy and fast implementation. hkgfinder is available as a Python 3 standalone program available at https://github.com/Ebedthan/hkgfinder and on https://pypi.org.