Posters - Schedules
Posters Home

View Posters By Category

Monday, July 24, between 18:00 CEST and 19:00 CEST
Tuesday, July 25, between 18:00 CEST and 19:00 CEST
Session A Poster Set-up and Dismantle
Session A Posters set up:
Monday, July 24, between 08:00 CEST and 08:45 CEST
Session A Posters dismantle:
Monday, July 24, at 19:00 CEST
Session B Poster Set-up and Dismantle
Session B Posters set up:
Tuesday, July 25, between 08:00 CEST and 08:45 CEST
Session B Posters dismantle:
Tuesday, July 25, at 19:00 CEST
Wednesday, July 26, between 18:00 CEST and 19:00 CEST
Session C Poster Set-up and Dismantle
Session C Posters set up:
Wednesday, July 26,between 08:00 CEST and 08:45 CEST
Session C Posters dismantle:
Wednesday, July 26, at 19:00 CEST
Virtual
A-263: Reliable interpretability of deep learning on biological networks
Track: MLCSB
  • Nikolaus Fortelny, University of Salzburg, Austria
  • Wolfgang Esser-Skala, University of Salzburg, Austria


Presentation Overview: Show

BACKGROUND: Deep learning is powerful, but interpretability remains a challenge. A unique approach for interpretability builds on biological knowledge to construct the computational graph of a neural network such that hidden nodes represent biological entities (pathways). After training, such “biology-inspired” neural networks reveal biological pathways involved in a given process (cancer).

MOTIVATION: Biology-inspired models provide an unprecedented ability to interpret hidden nodes, in contrast to the common approaches to interpret input features. However, critical elements of interpretability remain unsolved. First, the random initiation of weights limits the robustness of interpretations. Second, biases in biological knowledge favor highly connected hidden nodes. Yet, despite their critical relevance, robustness and network biases are largely unstudied.

METHODS: We developed methods to assess and control robustness and network biases, and validated them in state-of-the-art biology-inspired models to evaluate their impact on interpretations.

RESULTS: We demonstrate that controlling both robustness and biases is required for reliable interpretability. We find that the impact of robustness and biases on interpretations depend on the difficulty of the prediction task and we identify which network biases mostly affect interpretations. Together, these results reveal critical elements of interpretability that may be relevant beyond the special case of biology-inspired deep learning.

A-264: Performance comparison between federated and centralized learning with a deep learning model on Hoechst stained images.
Track: MLCSB
  • Damien Alouges, CEA, France
  • Georg Wölflein, University of St Andrews, United Kingdom
  • In Hwa Um, University of St Andrews, United Kingdom
  • David Harrison, University of St Andrews, United Kingdom
  • Ognjen Arandjelović, University of St Andrews, United Kingdom
  • Christophe Battail, CEA, France
  • Stéphane Gazut, CEA, France


Presentation Overview: Show

Medical data is not fully exploited by Machine Learning (ML) techniques because the privacy concerns restrict the sharing of sensitive information and consequently the use of centralized ML schemes. Usually, ML models trained on local data are failing to reach their full potential owing to low statistical power. Federated Learning (FL) solves critical issues in the healthcare domain such as data privacy and enables multiple contributors to build a common and robust ML model by sharing local learning parameters without sharing data. FL approaches are mainly evaluated in the literature using benchmarks and the trade-off between accuracy and privacy still has to be more studied in a realistic clinical context. In this work, part of the European project KATY (GA:101017453), we evaluate this trade-off for a CD3/CD8 cells staining procedure from Hoechst images. Wölflein et al. developed a deep learning GAN model that synthesizes CD3 and CD8 stains from kidney cancer tissue slides, trained on 473,000 patches (256x256 pixels) from 8 whole slide images. We modified the training to simulate a FL approach by distributing the learning across 8 clients and aggregating the parameters to create the overall model. We present then the performance comparison between FL and centralized learning.

A-265: Multi-Task Graph Neural Network Approach for Spatially Resolved Clonal Profiling
Track: MLCSB
  • Olga Lazareva, Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Germany
  • Ilia Kats, Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Germany
  • Oliver Stegle, Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Germany


Presentation Overview: Show

The ability to decode genetic and non-genetic heterogeneity of tumors is crucial to understand the formation of tumors and to overcome therapy-resistant phenotypes. New technologies have emerged to enable targeted analysis of clonal profiles to understand how tumor clones are organised in tissues, yet they are limited in throughput or aim for a priori known genetic aberrations. In contrast, computational integration strategies offer a greater extent of flexibility and can aid bridging spatial transcriptomics data with widely available single-cell sequencing data to map clonal profiles in tissues.
Here, we propose a SpaceTree - a deep-learning approach based on a multi-task graph neural network architecture. Briefly, our model takes copy number reference profiles derived from single-cell data as input, which are then aligned to spatial transcriptomics profiles. SpaceTree jointly models spatially smooth cell type and clonal composition, employing a hierarchical tree loss thereby avoiding the need to discretize clonal profiles or cell types. Our .method can be applied to all major spatial omics technologies to augment them with inferred clonal information to study genetic heterogeneity of complex tissues. We illustrate SpaceTree on existing sequencing and Visium data from human breast cancers, successfully identifying clonal profiles corresponding to annotated morphologically distinct DCIS.

A-266: Gene prioritisation algorithms for target discovery in complex diseases assisted by ML-driven data weighting
Track: MLCSB
  • Luis G. Leal, AI & Digital Research, Novo Nordisk Research Centre Oxford, United Kingdom, United Kingdom
  • M. Lisandra Zepeda Mendoza, AI & Digital Research, Novo Nordisk Research Centre Oxford, United Kingdom, United Kingdom
  • Thomas Monfeuga, AI & Digital Research, Novo Nordisk Research Centre Oxford, United Kingdom, United Kingdom
  • Kasper S. Kjær, Novo Nordisk A/S, Denmark, United Kingdom
  • Djordje Djordjevic, AI & Digital Research, Novo Nordisk A/S, Denmark, United Kingdom
  • Vivek Das, AI & Digital Research, Novo Nordisk A/S, Denmark, United Kingdom
  • Zahra McVey, AI & Digital Research, Novo Nordisk Research Centre Oxford, United Kingdom, United Kingdom
  • Enrique M. Toledo, AI & Digital Research, Novo Nordisk Research Centre Oxford, United Kingdom, United Kingdom
  • Lena K. Hansson, Definitive Healthcare, Sweden
  • William G Haynes, Novo Nordisk Research Centre Oxford, United Kingdom, United Kingdom
  • Robert Kitchen, AI & Digital Research, Novo Nordisk Research Centre Oxford, United Kingdom, United Kingdom
  • Ramneek Gupta, Global Drug Discovery, Novo Nordisk Research Centre Oxford, United Kingdom, United Kingdom
  • Joanna L. Sharman, AI & Digital Research, Novo Nordisk Research Centre Oxford, United Kingdom, United Kingdom


Presentation Overview: Show

Gene prioritisation is the process of algorithmically ranking genes based on their predicted involvement in a trait of interest. These methods typically integrate diverse layers of biological evidence to generate a ranking from an input gene list, which helps to triage long lists of genes and highlight promising candidates for validation. However, there remain challenges to this process: (1) Understanding an appropriate importance-weighting of biological layers, (2) ensuring that prioritised targets have a balance between disease evidence and novelty, and (3) tailoring ranking algorithms towards specific traits of interest.

Here we present two algorithms for integrating gene-disease evidence and describe their application to drug target discovery. We compile data for cardiometabolic diseases across 7 distinct evidence layers (transcriptomics, proteomics, literature, pathways, in-vitro experiments, knowledge graphs, and human genetics). Here we discuss mathematical functions for the aggregation of these molecular data and present a machine learning method to weight biological layers using benchmarks from drug target development. We demonstrate this in atherosclerosis and provide a web-tool to perform prioritisation of user-defined gene lists. Finally, we demonstrate via in-vitro assays that our ranked gene lists show significantly higher ranking in atherosclerosis and reveal the most informative evidence layers for a particular assay.

A-267: Pirat: Peptide-level imputation for randomly truncated proteomic data
Track: MLCSB
  • Lucas Etourneau, Univ. Grenoble Alpes, CNRS, CEA, Inserm, ProFI FR2048 & TIMC, Grenoble INP, Grenoble, France, France
  • Laura Fancello, Univ. Grenoble Alpes, CNRS, CEA, Inserm, ProFI FR2048, Grenoble, France, France
  • Nelle Varoquaux, TIMC, Univ. Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France, France
  • Thomas Burger, Univ. Grenoble Alpes, CNRS, CEA, Inserm, ProFI FR2048, Grenoble, France, France


Presentation Overview: Show

Mass spectrometry-based proteomics is the standard method for identifying and quantifying proteins in biological samples at large scale. However, the analysis of the resulting high-dimensional data frames is often hampered by a significant proportion of missing values (MVs), many of which are not at random (MNAR). To impute them, a promising approach is to rely on the transcriptomic analysis of the same or related samples. However, although methods exist to infer protein abundances from transcriptomic levels alone, transcriptome-informed imputation of proteomic data has never been proposed yet. Here, we propose a novel imputation method, Pirat, with two major novelties. First, it copes with MNAR values by estimating a global parametric censoring mechanism that improves imputation. Second, it allows for integration of transcriptomic data at gene level. On two different datasets, we show that our method outperforms state-of-the-art imputation techniques. On another dataset, we show that integrating transcriptomic data mechanism can help reducing imputation errors for proteins with insufficient analytical coverage.

A-268: Using low-rank tensor formats to enable computations of cancer progression models in large state spaces
Track: MLCSB
  • Simon Pfahler, University of Regensburg, Germany, Germany
  • Peter Georg, University of Regensburg, Germany, Germany
  • Y. Linda Hu, University of Regensburg, Germany, Germany
  • Stefan Vocht, University of Regensburg, Germany, Germany
  • Rudolf Schill, University of Regensburg, Germany; ETH Zürich, Switzerland, Switzerland
  • Andreas Lösch, University of Regensburg, Germany, Germany
  • Kevin Rupp, ETH Zürich, Switzerland, Switzerland
  • Stefan Hansch, University Hospital Regensburg, Germany, Germany
  • Maren Klever, RWTH Aachen University, Germany, Germany
  • Lars Grasedyck, RWTH Aachen University, Germany, Germany
  • Rainer Spang, University of Regensburg, Germany, Germany
  • Tilo Wettig, University of Regensburg, Germany, Germany


Presentation Overview: Show

Cancer progresses by accumulating genomic events whose chronological order is key to understanding the disease. However, for most examined tumors, only their state at one point in time is available. To model progression processes from such cross-sectional data, Mutual Hazard Networks (MHNs) are currently the state of the art. Using Machine Learning techniques, they optimize a model to best fit the data. A major limitation of MHNs is that their computational complexity scales exponentially with the number of events, rendering calculations using more than 25 active events per sample infeasible. This restriction is severe since there are hundreds of genes known to be involved in cancer progression. Modern tensor formats like Tensor Trains break this curse of dimensionality by allowing for efficient handling of high-dimensional tensors. We explain how the Tensor Train formalism is used to accelerate MHN computations, leading to polynomial scaling in the number of events. By comparing MHNs obtained using classical and Tensor Train calculations, we also show this speed-up experimentally. Our work thus enables MHNs to infer dependencies between genomic events even when many events are considered. This allows for more comprehensive cancer progression models to be constructed, leading to a better understanding of the disease.

A-269: Deep learning predictor of the protein-mediated chromatin looping from DNA sequence
Track: MLCSB
  • Mateusz Chiliński, Warsaw University of Technology, Poland
  • Anup Halder, Warsaw University of Technology, Poland
  • Yijun Ruan, Zhejiang University, China
  • Dariusz Plewczynski, Warsaw University of Technology, Poland


Presentation Overview: Show

Chromatin looping is a key concept in computational genomics, that regulates the genomic expression, as well as multiple nucleus functions. In our work, we decided to ask if we can recreate the ChIA-PET experiments that aim at the detection of the protein-mediated chromatin loops. We have used DNABERT (BERT model trained on DNA sequences), as well as a few classical machine learning models - that is, SVMs, RFs, and KNN. The output of those classifiers was then connected using deep hybrid learning approach. We have tested our approach on GM12878 cell line, and protein factors CTCF and RNAPOL2. As the positive set, for our training, we have selected the sequence of anchors connected to each other, and as the negative one - we have selected random places in the genome within a range.

A-270: Elucidating Gene-Environment Relationships using Deep Learning
Track: MLCSB
  • Rajeeva Lokshanan Reguna Madhan, IIT Madras, India
  • Himanshu Sinha, IIT Madras, India
  • Nirav Bhatt, IIT Madras, India


Presentation Overview: Show

One of the major goals of genetics over the past century has been to understand the genotype-phenotype relationship, i.e. how the genotype of an organism affects its fitness in a given chemical environment. Traditionally, this goal has been pursued by experimentally determining the fitness of different genotypes of an organism (such as Saccharomyces cerevisiae) in various growth conditions. Modern machine learning and deep learning tools can help us understand the inner workings of this relationship while avoiding prohibitively expensive experiments. But current methods require us to create a separate model for each chemical condition, thereby rendering the process cumbersome. Given this, we develop a novel approach to build a chemical-agnostic deep-learning model to help us elucidate the organism's fitness in any chemical condition. To achieve this, we use an autoencoder to obtain a numerical representation of a chemical molecule. We use this along with the gene level variant data in a deep-learning model to predict the organism's fitness. We show that our model can more accurately predict the growth rate of yeast strains compared to existing literature.

A-271: Overcoming Batch Effects in Microbiome Data Analysis: A Review of Classical and Deep Learning Methods
Track: MLCSB
  • Mark Olenik, Leibniz Institute on Aging - Fritz Lipmann Institute (FLI), Germany
  • Melike Dönertaş, Leibniz Institute on Aging - Fritz Lipmann Institute (FLI), Germany


Presentation Overview: Show

Recent research has highlighted the crucial role of the human microbiome in aging, shedding light on the intricate interplay between gut bacteria and overall health. However, microbiome data analysis poses a significant challenge due to the presence of batch effects that can distort true signals of interest, particularly in longitudinal studies investigating microbiome composition changes over time. This poster discusses common data issues and study design challenges related to batch effects in microbiome data, as well as methods to address them, including ComBat, SVA, and Bayesian Dirichlet-multinomial regression. In addition, the poster explores the potential of deep learning methods such as autoencoders for removing batch effects, which can learn invariant representations of microbiome data while preserving biological signals. The aim of this poster is to provide guidance for researchers interested in using batch effect removal methods and to encourage further development in the field.

A-272: ChemLM: a language-based approach for molecular property prediction
Track: MLCSB
  • Georgios Kallergis, Computational Biology of Infection Research, HZI, Germany, BRICS, Technische Universitat Braunschweig, Germany, Germany
  • Alice C. McHardy, Computational Biology of Infection Research, HZI, Germany, BRICS, Technische Universitat Braunschweig, Germany, Germany
  • Ehsaneddin Asgari, Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany, Germany
  • Anna Hirsch, Department of Drug Design and Optimization (DDOP), Helmholtz-Institute for Pharmaceutical Research Saarland (HIPS), Germany
  • Martin Empting, Antiviral & Antivirulence Drugs (AVID), Helmholtz-Institute for Pharmaceutical Research Saarland (HIPS), HZI, Germany, Germany
  • Behrooz Azarkhalili, AI VIVO, United Kingdom


Presentation Overview: Show

Deep learning models have been widely used in predicting molecular properties. Even though molecules are usually considered as graphs, their SMILES representation allows us to treat structures as sentences consisting of chemical tokens. Taking advantage of that, we developed ChemLM, a deep neural language-based approach for molecular property prediction. Our work relies on Transformer’s model architecture, which can get trained by efficiently utilising the abundance of unlabelled chemical compounds in public databases. The acquired knowledge can be transferred and, as a result, lead to improved performance when fine-tuned on a specific chemical task. Moreover, we implemented several optimisations in the architecture aiming to enhance its predictive ability. The results show that ChemLM outperforms the state-of-the-art graph neural networks in benchmark datasets. In addition, intrinsic evaluation reveals that our model produces representation embeddings that map structures meaningfully in the chemical space. Our approach is also applied to experimental compounds designed for the Gram-negative pathogen Pseudomonas aeruginosa. It proves to be substantially better than state-of-the-art models in identifying highly potent pathoblockers. Consequently, ChemLM can be successfully used in molecular property prediction.

A-273: Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks
Track: MLCSB
  • Atabey Ünlü, Hacettepe University, Turkey
  • Elif Çevrim, Hacettepe University, Turkey
  • Ahmet Sarıgün, Middle East Technical University, Turkey
  • Hayriye Çelikbilek, Hacettepe University, Turkey
  • Heval Ataş Güvenilir, Middle East Technical University, Turkey
  • Altay Koyaş, Middle East Technical University, Turkey
  • Deniz Cansen Kahraman, Middle East Technical University, Turkey
  • Abdurrahman Olgac, Evias Pharmaceutical R&D, Turkey
  • Ahmet Süreyya Rifaioğlu, Heidelberg University, Germany
  • Tunca Dogan, Hacettepe University, Turkey


Presentation Overview: Show

Discovering novel drug candidate molecules is one of the most fundamental and critical steps in drug development. Generative models offer high potential for designing de novo molecules; however, to be useful in real-life drug development pipelines, these models should be able to design target-specific molecules. In this study, we propose a novel generative system, DrugGEN, for de novo design of drug candidate molecules that interact with selected target proteins. The proposed system represents compounds and protein structures as graphs and processes them via serially connected two generative adversarial networks comprising graph transformers. The system is trained using two million compounds from ChEMBL and target-specific bioactive molecules, to design effective and specific inhibitory molecules against the AKT1 protein. DrugGEN has a competitive or better performance against other methods on fundamental benchmarks. To assess the target-specific generation performance, we conducted further in silico analysis with molecular docking. Their results indicate that de novo molecules have high potential for interacting with the AKT1 protein structure in the level of its native ligand. DrugGEN can be used to design novel and effective target-specific drug candidate molecules for any druggable protein, given the target features and a dataset of known bioactive molecules.

A-274: Deep Learning for Antibody Property Prediction: A Cost and Time-Efficient Alternative to Traditional Lab Experiments
Track: MLCSB
  • Daniel Fabian, Lonza, United Kingdom


Presentation Overview: Show

Defining antibody properties is fundamental to optimize their manufacturability in large bioreactors. However, property assessment often requires time-consuming and expensive lab experiments. A potential solution is offered by machine learning, which could replace tedious lab-assays with accurate computational predictions, thereby significantly reducing time and cost. Here, we develop a deep learning model trained on ~40 million amino acid sequences that transforms sequences into numerical embeddings capturing fundamental features of proteins through unsupervised learning. We next fine-tune this model by further training it with thousands of antibody sequences to improve antibody embeddings. Finally, we employ machine learning to link antibody embeddings with lab measurements, such as viscosity and aggregation, to obtain a final prediction model. Our work highlights that certain antibody properties can successfully be predicted based on amino acid sequence, but unlike suggestions in the literature, a large sample size is necessary to accurately predict these properties.

A-275: Exploring Explainability and its Limits in Machine Learning in Gene Expression Models
Track: MLCSB
  • Myriam Bontonou, LBMC, ENS de Lyon, CNRS, UMR5239, Inserm U1293, Univ Lyon, Lyon, France, France
  • Jean-Michel Arbona, LBMC, ENS de Lyon, CNRS, UMR5239, Inserm U1293, Univ Lyon, Lyon, France, France
  • Benjamin Audit, Univ Lyon, ENS de Lyon, CNRS, Laboratoire de Physique, F-69342 Lyon, France, France
  • Pierre Borgnat, Univ Lyon, ENS de Lyon, CNRS, Laboratoire de Physique, F-69342 Lyon, France, France


Presentation Overview: Show

Systematic profiling of cellular activity by genomic techniques such as RNA sequencing allows neural networks to be trained to classify cell samples into healthy and diseased states with high reliability. The rules learned by the networks promise to be informative about the cellular mechanisms that are deregulated during disease development.

To explore these rules, we study the relevance of the Integrated Gradients (IG) explainability method that highlights the input features contributing the most to the individual predictions. We propose several metrics as well as a simulation model of gene expression data based on Latent Dirichlet Allocation to evaluate the explanation power of the highlighted genes. We analyse the explanation patterns obtained on neural networks trained to classify normal and tumour tissues from gene expression data of The Cancer Genome Atlas and of associated simulations. The discriminating information appears to be dispersed over a large number of genes which varies from one sample to another. Ranking features by IG importance is not enough to robustly identify biomarkers.

These results call into question the use of explainability on a gene-by-gene basis and suggest that this notion should be defined at the level of a group of genes.

A-276: Iterative Data Augmentation of near boundary negative samples improves the model generalizability in Compound-Protein Interaction Prediction.
Track: MLCSB
  • Takuto Koyama, Graduate School of Medicine, Kyoto University, Japan
  • Shigeyuki Matsumoto, Graduate School of Medicine, Kyoto University, Japan
  • Hiroaki Iwata, Graduate School of Medicine, Kyoto University, Japan
  • Ryosuke Kojima, Graduate School of Medicine, Kyoto University, Japan
  • Yasushi Okuno, Graduate School of Medicine, Kyoto University, Japan


Presentation Overview: Show

Identifying compound-protein interactions (CPIs) is a critical component of drug discovery. Since experimental validation of CPIs is time-consuming and costly, a machine-learning-based CPI prediction is expected to be one of the powerful approaches to facilitate the process. However, the prediction generalizability is often hindered by data imbalances attributed to a lack of experimentally validated inactive (negative) samples. Thus, data augmentation of negative samples is essential for constructing efficient CPI prediction models.
In this study, we explored effective negative sampling methods, demonstrating that augmenting the near boundary samples defined by a CPI prediction model, i.e., samples with ambiguous prediction scores, significantly improved the model's generalizability. Investigation of the parameters defining “near boundary” indicated that the negative samples with ambiguous prediction scores are more informative than those distant from the positive CPIs. Furthermore, iterative data augmentation can improve the model performance because it seems to be beneficial for defining the accurate decision boundary. Our study provides guidelines for improving CPI prediction on real-world data, thereby facilitating the drug discovery process.

A-277: A Hypernetwork-Based Approach for Analysis of Metagenomical Datasets
Track: MLCSB
  • Witold Wydmański, Jagiellonian University, Poland
  • Oleksii Bulenok, Jagiellonian University, Poland
  • Dagmara Błaszczyk, Universytet Jagielloński, Poland
  • Valentyn Bezshapkin, Małopolska Centre of Biotechnology, Poland
  • Paweł P. Łabaj, Małopolska Centre of Biotechnology of Jagiellonian University, Poland
  • Marek Śmieja, Jagiellonian University, Poland


Presentation Overview: Show

Metagenomic data analysis is essential for understanding microbial communities in various environments, but small sample sizes in these datasets limit the application of machine learning techniques. Current deep learning methods have limitations in exploiting small tabular datasets. In this study, we investigate HyperTab, a hypernetwork-based approach designed for small sample problems in tabular datasets, like those derived from metagenomic data.

HyperTab combines the advantages of Random Forests and neural networks by generating an ensemble of specialized neural networks, each processing a specific lower-dimensional view of the data. This approach effectively augments the data, virtually increasing the number of training samples without altering the number of trainable parameters, thus mitigating the risk of model overfitting. Our method captures complex patterns and interactions among microbial features, improving performance on small-sized datasets compared to traditional techniques.

We demonstrate HyperTab's effectiveness through evaluations on over 40 diverse datasets, including metagenomic, showcasing its ability to consistently outperform state-of-the-art shallow and deep learning models on small datasets while maintaining comparable performance on larger datasets.

In summary, HyperTab offers a powerful tool for researchers in microbiology, addressing the challenges of small sample sizes in metagenomic datasets.

A-278: Non-linear Dimensionality Reduction of Genotype Data using Autoencoders
Track: MLCSB
  • Gizem Taş, Tilburg University, Netherlands
  • Timo Westerdijk, Utrecht University, Netherlands
  • Jan Herman Veldink, University Medical Center Utrecht, Netherlands
  • Alexander Schoenhuth, Bielefeld University, Germany
  • Marleen Balvert, Tilburg University, Netherlands


Presentation Overview: Show

Genome-wide association studies (GWAS) have been successful in identifying genetic variants associated with heritable phenotypes, including diseases. However, GWAS may not capture all genetic risk, as the loci they identify may only account for a small portion of the genetic variance. Moreover, non-additive interactions between loci can contribute to the missing heritability. Deep neural networks (DNNs) are promising for analyzing high-dimensional genotyping data, nevertheless suffer from the curse of dimensionality. To address this issue, we propose a method called haplotype block-based autoencoder, which clusters single nucleotide polymorphisms (SNPs) into haplotype blocks based on linkage disequilibrium and trains per-block autoencoders. This method compresses high-dimensional genotyping data while preserving complex and non-linear patterns within the data. We apply this method to genotyping data from Project MinE, including 23,209 amyotrophic lateral sclerosis cases and 90,249 healthy controls. They evaluate the ability of the haplotype block-based autoencoders to extract a latent representation of the information contained within the haplotype blocks and quantify the degree to which the autoencoders are able to compress the haplotype blocks. They also investigate the standardization of model configuration for each haplotype block. Overall, the haplotype block-based autoencoder method shows promise for compressing high-dimensional genotyping data while preserving genetic information.

A-279: On the use of cell type hierarchies for annotation of single-cell transcriptomics data
Track: MLCSB
  • Lauren Theunissen, VIB Center for Inflammation Research, Ghent University, Belgium
  • Thomas Mortier, Ghent University Faculty of Bioscience Engineering, Belgium
  • Willem Waegeman, Ghent University Faculty of Bioscience Engineering, Belgium
  • Yvan Saeys, VIB Center for Inflammation Research, Ghent University Faculty of Computer Science and Statistics, Belgium


Presentation Overview: Show

Cell type annotation is a crucial step in single-cell analysis. The relationships between cell types are often hierarchical in nature and several supervised annotation tools based on hierarchical classification exist. The idea here is that hierarchical classification could make more fine-grained distinctions and can provide more information in case of uncertainty due to its multi-level labelling. However, an actual evaluation of the added benefits of hierarchical classification for cell type annotation has yet to be performed. In this research, we make a comprehensive comparison between normal and hierarchical classification both in terms of annotation performance and annotation certainty. We investigate the rejection behaviour of the hierarchical model and see if we can find back relevant biological signals that drive the model’s decisions. Our results show that hierarchical classification matches the annotation performance of flat classification and sometimes even improves it. But hierarchical classification does slightly worsen probability calibration, so application of probability calibration methods prior to classification is advised. Furthermore, we observe that the rejection behaviour of the implemented classifiers severely differs and should be taken into account for the final model set-up. Lastly, we can find back relevant biological signals in the hierarchical classifiers that increase annotation interpretability.

A-280: Re-discovering Triplet Loss for metagenomics in forensics – microbiome-based inference of soil sample origin
Track: MLCSB
  • Michał Kowalski, Jagiellonian University, Poland
  • Alina Frolova, The Institute of Molecular Biology and Genetics of NASU, Ukraine
  • Kamila Marszałek, Jagiellonian University, Poland
  • Kinga Herda, Jagiellonian University, Poland
  • Agata Jagiełło, Central Forensic Laboratory of the Police, Poland
  • Anna Woźniak, Central Forensic Laboratory of the Police, Poland
  • Kaja Milanowska-Zabel, Ardigen S.A., Poland
  • Andrzej Ossowski, Pomeranian Medical University, Poland
  • Rafał Płoski, Medical University of Warsaw, Poland
  • Wojciech Branicki, Institute of Zoology and Biomedical Research of the Jagiellonian University, Poland
  • Renata Zbieć-Piekarska, Central Forensic Laboratory of the Police, Poland
  • Paweł P. Łabaj, Małopolska Centre of Biotechnology of Jagiellonian University, Poland


Presentation Overview: Show

Microbiome and microbiome-associated data have been successfully applied in forensics. However, as MetaSUB and CAMDA have shown the full potential of metagenomic data is yet to be unveiled. Thus the aim of SMAFT project was to develop a complete (wet-lab+dry-lab) solution to be applied by forensic laboratories of the police in Poland. Here we focus on the ‘dry-lab’ challenges in this successful story.
From selected 80 locations (12 meta-locations) samples were collected throughout four seasons in triplicates resulting in ~1000 total. They were profiled with WMS with over 100M read-pairs per sample. The metagenomic features (unitigs) have been engineered with use of MetaGraph. And the dimensionality was reduce with use of the Shannon entropy and Boruta. We ended up with 1015 unitigs which were used for designing TMS panel and were source of data for classification/prediction model.
The Triplet Loss based solution takes count table for 1015 unitigs, reduces the dimensionality to couple hundred Euclidean embeddings, which are being fed to Deep Neural Network to obtain probabilities of the origin of the sample. The constructed system has been validated on 3 datasets and we have obtained performance metrics: Label Ranking Average Precision: 0.87; Weighted F1: 0.86; Balanced Accuracy: 0.94.

A-281: SpatialDDLS: a deep learning-based algorithm to deconvolute spatial transcriptomics data
Track: MLCSB
  • Diego Mañanes, Centro Nacional de Investigaciones Cardiovasculares, Spain
  • Inés Rivero-García, Centro Nacional de Investigaciones Cardiovasculares, Spain
  • Daniel Jiménez-Carretero, Centro Nacional de Investigaciones Cardiovasculares, Spain
  • Miguel Torres, Centro Nacional de Investigaciones Cardiovasculares, Spain
  • David Sancho, Centro Nacional de Investigaciones Cardiovasculares, Spain
  • Carlos Torroja, Centro Nacional de Investigaciones Cardiovasculares, Spain
  • Fátima Sánchez-Cabo, CNIC, Spain


Presentation Overview: Show

Spatial transcriptomics technologies have revolutionized our ability to investigate biological processes by opening a new unbiased way to understand tissue structure and cellular organization. Rather than studying cells as isolated entities, they incorporate the spatial dimension while preserving the powerful information provided by whole transcriptome sequencing. However, due to limitations in their resolution, it remains challenging to disentangle the cellular contribution of different cell types to individual measurements. To address this issue, we introduce SpatialDDLS, a fast neural network-based algorithm for cell type deconvolution of spatial transcriptomics data. SpatialDDLS utilizes single-cell RNA sequencing (scRNA-seq) data to simulate mixed transcriptional profiles with known cell composition, which are then used to train a fully-connected neural network to uncover cell type diversity within each spot. To demonstrate its performance, we benchmarked SpatialDDLS against two state-of-the-art spatial deconvolution methods, cell2location and RCTD, in murine lymph node samples upon stimulation. Our algorithm accurately reproduced known cell type location patterns and produced comparable results to existing methods, making it a competitive alternative for deconvolution of spatial transcriptomics data. SpatialDDLS is available as an R package via CRAN-The Comprehensive R Archive Network.

A-282: Leveraging latent representation in metabolomics in the context of cancer
Track: MLCSB
  • Justine Labory, Université Côte d'Azur, France
  • Evariste Njomgue-Fotso, Université Côte d'Azur, France
  • Youssef Boulaimen, Université Côte d'Azur, France
  • Silvia Bottini, MDLab - MSI - Université Cote d'Azur, France


Presentation Overview: Show

Modern metabolomics experiments yield high-dimensional datasets to profile the molecular phenotype of disease such as cancer and identify the underlying pathological mechanisms of action.
Feature selection or feature extraction techniques can be applied to reduce the high-dimensionality and imbalance of metabolomics data. Feature selection finds a subset of the original features that maximise the accuracy of a predictive model and feature extraction refers to techniques computing a subset of representative features which summarise the original dataset and its dimensions.
Here, we focused on latent representation learning, that is a machine learning technique that attempts to infer latent variables from empirical measurements. Latent components are information that are not measurable therefore have to be inferred from the empirical measurements. Several techniques have been developed to infer the latent space with successful applications on omics data, however how to choose the model that fits the best with the available data is very challenging.
Here we compared several latent space representations on metabolomics data to classify individuals based on their phenotype (i.e. healty/disease) and a feature selection approach based on biological considerations. Overall, we observed that the combination of feature selection and feature extraction is the best performing strategy.

A-283: Getting Personal with Epigenetics: Towards Individual-specific Epigenomic Imputation with Machine Learning
Track: MLCSB
  • Alex Hawkins-Hooker, University College London, United Kingdom
  • Giovanni Visonà, Max Planck Institute for Intelligent Systems, Tübingen, Germany
  • Tanmayee Narendra, University of Dundee, United Kingdom
  • Mateo Rojas-Carulla, Lakera AI, Switzerland
  • Bernhard Schölkopf, Max Planck Institute for Intelligent Systems, Tübingen, Germany
  • Gabriele Schweikert, University of Dundee, University of Tübingen, United Kingdom


Presentation Overview: Show

Epigenetic modifications are dynamic mechanisms involved in the regulation of gene expression. Unlike the DNA sequence, epigenetic patterns vary not only between individuals, but also between different cell types within an individual. Epigenetic changes are reversible and thus promising therapeutic targets for precision medicine. However, mapping efforts to determine an individual’s cell-type-specific epigenome are constrained by experimental costs and tissue accessibility. We developed eDICE, a deep-learning model that employs attention mechanisms to impute epigenomic tracks. eDICE is trained to reconstruct masked epigenomic tracks within sets of epigenomic measurements derived from large-scale mapping efforts. By learning to encode the epigenomic signal at a particular genomic position into factorised representations of the epigenomic state of each profiled cell type as well the local activity profile of each epigenomic assay, eDICE is able to generate genome-wide imputations for the signal tracks of assays in cell types in which measurements are currently unavailable. We demonstrate improved performance relative to previous imputation methods on the reference Roadmap epigenomes, and additionally show that eDICE is able to predict individual-specific epigenetic patterns in unobserved tissues when trained on individual-specific epigenomes from ENTEx.

A-284: Machine learning pipeline for protein function annotation – identification of candidate mitochondrial glutathione transporters
Track: MLCSB
  • Luke Kennedy, Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada., Canada
  • Jagdeep Sandhu, University of Ottawa, Ottawa, ON, Canada. National Research Council Canada, Ottawa, ON, Canada., Canada
  • Mary-Ellen Harper, Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada., Canada
  • Miroslava Cuperlovic-Culf, University of Ottawa, Ottawa, ON, Canada. National Research Council Canada, Ottawa, ON, Canada., Canada


Presentation Overview: Show

Mitochondria are a metabolic hub within cells and rely on the activity of many inner mitochondrial membrane (IMM) transport proteins. Glutathione (GSH) is essential for redox control and is found at millimolar quantities within mitochondria. Because its synthesis is in the cytoplasm, GSH must be transported into mitochondria. The solute carrier family 25 (SLC25) proteins comprise the main mitochondrial transport family, yet functions of approximately 30% of the 53 human SLC25 proteins are unknown, and the functions of many others are only partially characterized.
In this work, we sought to develop a pipeline of machine-learning based hybrid approaches to better characterize mitochondrial GSH (mGSH) and the role of SLC25 proteins in its transport through analysis of transcriptomics, metabolomics, and knowledge-based data.
Three classification models categorizing genes for their mitochondrial-localization, role in GSH metabolism, and membrane transport were developed and combined for an overall transporter score. Combining multi-omics data with knowledge provided the best performing models, with average AUC values of ~0.9. Several orphan SLC25 proteins were ranked highly for their likelihood to function as mGSH transporters, and further validated in-silico.
The approach presented here provides a novel way for identifying previously unknown protein functions, which could guide subsequent experimental validation.

A-285: Overview of DeCovarT, a holistic method for the deconvolution of heterogeneous transcriptomic samples
Track: MLCSB
  • Bastien Chassagnol, LPSM (Laboratoire de Probabilités, Statistiques et Modélisation), Sorbonne Université, France
  • Yufei Luo, IDRS, Les Laboratoires Servier, France
  • Gregory Nuel, LPSM (Laboratoire de Probabilités, Statistiques et Modélisation), CNRS 8001, Sorbonne Université, France
  • Etienne Becht, IRIS, Les Laboratoires Servier, France


Presentation Overview: Show

Although bulk transcriptomic analyses have greatly contributed to a better understanding of complex diseases, their sensitivity is hampered by the highly heterogeneous cellular composition of biological samples. To address this limitation, computational deconvolution methods have been designed to automatically recover individual features of the cell populations that make up tissues.
However, they perform badly at differentiating closely related cell populations. We hypothesised that the integration of pairwise genetic interactions could improve the performance of deconvolution algorithms and therefore developed a new tool, DeCovarT, which considers the structure of the transcriptomic network.
To do so, we represented the set of transcriptomic interactions using a sparse network structure inferred with the gLasso estimator. Then, we virtually reconstruct the bulk profile using a probabilistic approach. Finally, our method returns the estimated cell ratios that maximise the probability of obtaining the global transcriptomic profile, given the individual cell profiles. Specifically, in our framework, we calculate the associated maximum log-likelihood by first rewriting the log-likelihood to remove the simplex constraints and then optimising the reparametrized function with the Levenberg-Marquardt algorithm. We thus obtain an estimator that incorporates the simplex constraint and allows us to derive a multidimensional CLT on the distribution of the estimated ratios.

A-286: Exploring the Insulin-Resistance Molecular Landscape with Knowledge Graph Embeddings
Track: MLCSB
  • Tankred Ott, AI & Analytics Centre of Excellence. Novo Nordisk, Copenhagen, Denmark, Denmark
  • Marc Boubnovski Martell, Machine Intelligence, AI & Digital Research, Novo Nordisk Research Centre Oxford, Oxford, UK., United Kingdom
  • Viktor Sandberg, AI & Analytics Centre of Excellence. Novo Nordisk, Copenhagen, Denmark, Denmark
  • Tonia Sideri, AI & Analytics Centre of Excellence. Novo Nordisk, Copenhagen, Denmark, Denmark
  • Robert Kitchen, Systems Biology & Target Discovery, AI & Digital Research, Novo Nordisk Research Centre Oxford, Oxford, UK., United Kingdom
  • Jesper Ferkinghoff-Borg, Machine Intelligence, AI & Digital Research, Novo Nordisk, Copenhagen, Denmark., Denmark
  • Ramneek Gupta, Global Drug Discovery, Novo Nordisk Research Centre Oxford, Oxford, UK., United Kingdom
  • Marie Lisandra Zepeda Mendoza, Novo Nordisk Research Centre Oxford, United Kingdom


Presentation Overview: Show

Knowledge graphs (KGs) organize complex biomedical relational information from diverse sources. KG embeddings (KGEs) are vector representations of the entities in a KG and can be used to gain a better understanding of disease biology and identify novel drug-targets. Several KG and KGEs-based approaches exist for discovering gene-disease associations, however disease-agnostic approaches for complex phenotypes, such as insulin resistance (IR) are lacking.
We hypothesized and proved that gene KGEs effectively represent their biological information, making the genes related to IR distinguishable from those unrelated, so that gene KGEs related to IR will cluster in the embedding space and those furthest from this cluster are unrelated to IR. We then developed metrics for disease-agnostic drug-target candidate prioritization for further in vitro evaluation. We validate our metrics by comparing the functional annotations of the genes closest and furthest from the IR-relate positive gene set. To define the IR negative gene set, we use positive-unlabelled machine learning approaches that consider the data imbalance.
Our workflow will help understanding the complex IR molecular landscape and identify potential drug-targets. Our flexible methodology can be extended to other complex phenotypes (e.g., fibrosis and inflammation), making it a valuable tool for disease-agnostic drug-target discovery.

A-287: Classifying mode of action by applying machine learning on flow cytometry data for antibiotic research in Pseudomonas aeruginosa
Track: MLCSB
  • David Dylus, F. Hoffmann-La Roche Ltd, Switzerland
  • Luise Wolf, F. Hoffmann-La Roche Ltd, Switzerland
  • Sebastien Rigo, F. Hoffmann-La Roche Ltd, Switzerland
  • Doris Berchtold, F. Hoffmann-La Roche Ltd, Switzerland


Presentation Overview: Show

Pseudomonas aeruginosa (PA) is ranked by the World Health Organisation as critical gram-negative bacteria resistant to multiple antibiotics. Novel compounds that have an antibiotic effect on PA and are well characterized with regard to their mode of action (MoA) are highly needed. The characterization of compound MoA is often laborious and difficult to scale-up for screening. For this reason, we developed a flow cytometry assay that uses a panel of selected fluorescent probes at various concentrations of antibiotic in conjunction with machine learning. In total, we collected features for 24 known antibiotics across 12 MoAs. Using these features, we first applied unsupervised machine learning and found that most antibiotics that are part of the same MoA class, group together in a concentration dependent manner. Then, we built a classifier and showed in a leave-one-out experimental setup good classification accuracy for certain MoAs but not for others. Finally, using this data, we re-iterated our approach and devised a more scalable setup that incorporates a reduced set of fluorescent probes and a reduced number of concentrations, allowing potentially to test more antibiotics in a single run.

A-288: Machine Learning for antibody target classification from sequence information
Track: MLCSB
  • Sara Joubbi, University of Pisa, Italy
  • Giuseppe Maccari, Toscana Life Sciences Foundation, Italy
  • Duccio Medini, Toscana Life Sciences Foundation, Italy
  • Alessio Micheli, University of Pisa, Italy
  • Paolo Milazzo, University of Pisa, Italy


Presentation Overview: Show

Antibodies play a critical role in our immune system and are utilized extensively in research and clinical settings due to their high specificity and affinity towards target molecules. Specificity is particularly crucial in ensuring that the antibody binds to the intended target and identifying the target's nature is one of the first pieces of experimental evidence needed for the antibody characterization.
As wet lab techniques are time-consuming and expensive, every piece of information that can improve the screening strategy is of utmost importance. In particular, the identification of the target molecular nature can be crucial for prioritizing potential candidates for further in-vitro characterization. Machine learning methods have proven to be effective for various protein sequence tasks. In this paper, we propose several methods for classifying antibodies based on their target molecular class, distinguishing between protein and non-protein. To our knowledge, this is the first study utilizing machine learning methods for this classification task. Despite the lack of a large antibody dataset, our model was able to achieve promising results in terms of accuracy and serves as a baseline model for future developments.

A-289: PRIOR KNOWLEDGE ENHANCED GNNs FOR ALZHEIMER'S DISEASE PREDICTIVE MODELING AND THERAPEUTIC TARGET DISCOVERY
Track: MLCSB
  • Stephen Keegan, The Jackson Laboratory, United States
  • Rohit K. Tripathy, The Jackson Laboratory, United States
  • Hong Wang, The Jackson Laboratory, United States
  • Greg Cary, The Jackson Laboratory, United States
  • Zachary Frohock, The Jackson Laboratory, United States
  • Jesse C. Wiley, Sage Bionetworks, United States
  • Laura Heath, Sage Bionetworks, United States
  • Robert R. Butler III, Stanford University, United States
  • Frank M. Longo, Stanford University, United States
  • Allen I. Levey, Emory University, United States
  • Anna K. Greenwood, Sage Bionetworks, United States
  • Yi Li, The Jackson Laboratory, United States
  • Gregory W. Carter, The Jackson Laboratory, United States


Presentation Overview: Show

Objectives: The TaRget Enablement to Accelerate Therapy Development for AD (TREAT-AD) consortium is missioned to provide tools to the Alzheimer’s research community for the discovery of therapeutic targets.

Methods: Prior knowledge graphs of Biological Domains, clusters of Gene Ontology terms that describe 19 different Alzheimer’s Disease (AD) endophenotypes, are constructed from protein-protein interactions in the Pathway Commons database. The graph neural network (GNN) is trained on transcriptomic expression data and patient clinical outcomes. A weighted Key Driver Analysis (wKDA) is conducted on a per gene basis from each network. Using the TREAT-AD Target Risk Score as weights, the output is a ranked list of AD relevance per gene per biological domain.

Results: Networks provide a topographical structure describing AD endophenotypes. Our wKDA and GNN methods interrogate the networks to ascribe all AD relevant genes with a likelihood of their impact on disease pathogenesis. This focuses the therapeutic target search space by describing impacts of a gene on the heterogeneity of AD.

Conclusions: Therapeutic targets nominated by the blended approach of machine learning and prior knowledge derived graph constructions yield powerful disease relevant drug hypotheses. This approach enables target validation with cell models and assays tailored to a precision medicine strategy.

A-290: Heavy and Light chain pairing preferences in antibodies revealed by Deep Convolutional Neural Networks
Track: MLCSB
  • Dongjun Guo, King's College London, United Kingdom
  • Joseph Ng, University College London, United Kingdom
  • Deborah Dunn-Walters, University of Surrey, United Kingdom
  • Franca Fraternali, University College London, United Kingdom


Presentation Overview: Show

Antibodies are immunoglobulin proteins produced by B cells for the defence against the invasion of antigens. They are composed of two heavy (H) chains and two light (L) chains, which when assembled as protein complexes give rise to antibodies. How H chain chooses its L chain partner is still under debate; previous investigations relied on comparing gene usage in different H-L pairs, with little focus on amino acid sequence segments which are important in forming H-L protein complexes. Here, we capitalise on recent high-throughput H-L paired antibody repertoire data to train convolutional neural networks (CNN) for predicting the likelihood of H-L pairing. Our model suggests that H-L pairing is not random, as we are able to distinguish cognate from random H-L pairs in sequences from the same donors but unseen in training, as well as cross-donor comparisons and validation with orthogonal datasets and Protein Data Bank antibody structures. Learning the rules of antibody H-L chain pairing would help us better understand the formation of antibody repertoire and the determinants of antibody stability, as well as the efficient generation of new therapeutic antibodies by pre-excluding unlikely (presumably less stable) H-L combinations to speed up antibody design.

A-291: ProInterVal: Validation of Protein-Protein Interactions through Learned Interface Representations
Track: MLCSB
  • Damla Övek, Koç University, Turkey
  • Attila Gürsoy, Koç University, Turkey
  • Özlem Keskin, Koç University, Turkey


Presentation Overview: Show

Protein-protein interactions (PPIs) play a critical role in many biological processes. Understanding the mechanisms underlying these interactions is essential for developing targeted therapies for diseases. In this study, we present an approach for PPI validation through learned interface representation. Our method utilizes a graph-based contrastive autoencoder combined with a transformer to learn representations of PPI interfaces from a large set of unlabeled data. Then, PPIs are validated through learned representations. We demonstrate the effectiveness of our approach on a benchmark dataset and show that it outperforms existing methods. Our approach provides a promising solution for validating PPIs and has the potential to improve template-based PPI predictions.

A-292: Detecting Blasts with Single Cell Resolution in Acute Myeloid Leukaemia using an Auto-Encoder
Track: MLCSB
  • Alice Driessen, IBM Research & ETH Zürich, Switzerland
  • Susanne Unger, University of Zurich, Switzerland
  • An-Phi Nguyen, IBM Research Zurich, Switzerland
  • Burkhard Becher, University of Zurich, Switzerland
  • Maria Rodriguez Martinez, IBM, Zurich Research Laboratory, Switzerland


Presentation Overview: Show

Acute myeloid leukaemia (AML) is characterised by expansion of immature cells of the myeloid lineage in the bone marrow. Almost half of paediatric AML patients relapse after standard treatment. Personalised medicine—including immunotherapies—could target chemotherapy-resistant cells and achieve long-term remission. However, identifying targets for AML immunotherapy is complicated by patient heterogeneity, complex disease evolution, and similarity of aberrant and developing cells.

Therefore, we planned to identify malignant cells and place them along the myeloid developmental trajectory using machine learning. We generated single-cell flow cytometry profiles of 20 paediatric AML patients with matched samples at diagnosis, remission, and relapse. With this dataset, we trained an auto-encoder on cells of remission samples (Fig. 1B&2A). We show that the auto-encoder’s latent space captures the healthy developmental trajectory (Fig. 3A). Then, we projected all samples onto this trajectory and identified their developmental stages (Fig. 3B&C). Furthermore, classifying malignant cells using the auto-encoder's reconstruction error achieved 96% accuracy (Fig. 4B&C). Finally, KMT2A-mutated AMLs changed their phenotype drastically from diagnosis to relapse (Fig. 5A&B).

Summarising, we identify malignant cells and developmental stages in AML samples. We uncover phenotypic changes related to mutation status. Our work could aid researchers investigating immunotherapy targets to improve AML treatment.

A-293: Using mutational signatures to predict tissue type in highly mutated cancers
Track: MLCSB
  • Julia Cordes, St Olaf College, United States
  • Jaime Davila, St Olaf College, United States


Presentation Overview: Show

A mutational signature is a distinct pattern of mutations originated by a specific cancer mutational process. For example, tobacco exposure causes a high number of C to A mutations, while UV light induces a high quantity of CC to TT mutations. Around three to five percent of all cancers have unknown primary origin and inferring their tissue type is crucial for therapeutic purposes. This is of particular importance for highly mutated cancers which can benefit from immunotherapy. Hence, we leveraged mutational signatures contributions to predict the cancer and tissue type of highly mutated tumor samples.
We used the Mutational Signatures v.3.3 from the Catalogue of Somatic Mutations in Cancer (COSMIC). We considered only nine highly mutated cancer types resulting in a cohort of 1,477 samples. Furthermore, we considered only frequently occurring signatures and removed artifactual signatures, resulting in a core set of twenty signatures. These twenty signatures were used as features to predict cancer and tissue type using a variety of models.
Random forest models performed the best for predicting tissue type with an accuracy of 89.5%. The highest proportion of misclassified cases corresponded to endometrial, stomach, and colorectal cancers, with an accuracy of 67% in endometrial cases.

A-294: Comparison of multiple modalities for anti-cancer drug response prediction
Track: MLCSB
  • Nikhil Branson, Queen Mary University of London, United Kingdom
  • Pedro Cutillas, Queen Mary University of London, United Kingdom
  • Conrad Bessant, Queen Mary University of London, United Kingdom


Presentation Overview: Show

An important problem within stratified medicine is predicting the effectiveness of anti-cancer drugs for different subgroups. This problem is called anti-cancer drug response prediction (DRP). Transcriptomic profiles of cancer cell lines are typically used for anti-cancer drug response prediction, but we hypothesise that other types of omics data such as proteomics or phosphoproteomics might be better for DRP because they give a more direct insight into cellular processes. Furthermore, drugs typically target proteins. Recently proteomics and phosphoproteomics datasets suitable for drug response prediction have been made publicly available. However, the full potential of these datasets to improve performance for DRP has not yet been evaluated.

Many studies have shown that Neural networks (NNs) outperform traditional machine learning algorithms for DRP. However, XGBoost, a gradient-boosting algorithm that gives state-of-the-art performance in many fields, is not commonly used in DRP. Therefore, we evaluated the capability of NNs and XGBoost to predict drug response from three omics data types. We show that phosphoproteomics slightly outperforms RNA-seq and proteomics using the 38 cell lines for which profiles of all three omics data types are available. Furthermore, for the 877 cell lines that have proteomics and RNA-seq profiles, we show that XGBoost outperforms NNs.

A-295: A machine learning-based meta-caller approach for structural variant detection from short read data
Track: MLCSB
  • Rudel Christian Nkouamedjo Fankep, Center for Familial Breast and Ovarian Cancer, Faculty of Medicine and University Hospital Cologne, Germany
  • Jochen Blom, Bioinformatics & Systems Biology, Justus Liebig University Gießen, Germany
  • Corinna Ernst, Center for Familial Breast and Ovarian Cancer, Faculty of Medicine and University Hospital Cologne, Germany
  • Susanne Motameny, Cologne Center for Genomics, University of Cologne, Faculty of Medicine and University Hospital Cologne, Germany


Presentation Overview: Show

Calling structural variants (SVs), i.e., genomic changes ≥ 50bp, from short-read data is challenging due to accuracy and robustness limitations of existing standalone callers. Therefore, meta-calling methods have been proposed, but they lack criteria to evaluate the obtained results. Here, a meta-caller approach is developed by incorporating the results of six state-of-the art standalone tools (BreakDancer, Delly, Lumpy, Manta, Pindel and TARDIS) into an XGBoost classifier model.
The classifier was trained using SV calls on odd-numbered autosomes and evaluated based on even-numbered autosomes of the Genome in a Bottle HG002 reference. Two models were applied, a basic model (BM) considering standalone tools supporting a call as well as SV type and length, and an extended model (EM) including variant call-specific quality measures from individual VCF outputs. For deletions, the EM achieved a slightly increased F1-score compared to the BM (0.84 vs 0.83), but slightly decreased for insertions (0.39 vs 0.40).
Due to achieved prediction probabilities, the approach provides a suitable criterion to rank SV calls according to evidence, e.g., in the top decile of ranked predictions, 99.2% (EM), respectively 99.4% (BM), represented true positives in the SV reference, whereas in the bottom decile only 4.2% (EM), respectively 9% (BM), did.

A-296: Mapping lineage-traced cells across time points with moslin
Track: MLCSB
  • Marius Lange, Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland, Switzerland
  • Zoe Piran, School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel, Israel
  • Michal Klein, Apple, Germany
  • Bastiaan Spanjaard, Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, Berlin, Germany, Germany
  • Dominik Klein, Institute of Computational Biology, Helmholtz Center Munich, Germany, Germany
  • Jan Philipp Junker, Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, Berlin, Germany, Germany
  • Fabian J. Theis, Institute of Computational Biology, Helmholtz Center Munich, Germany, Germany
  • Mor Nitzan, School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel, Israel


Presentation Overview: Show

Simultaneous profiling of single-cell gene expression and lineage history holds enormous potential for studying cellular decision-making beyond simpler pseudotime-based approaches. However, it is currently unclear how lineage and gene expression information across experimental time points can be combined in destructive experiments, which is particularly challenging for in-vivo systems. Here we present moslin, a Fused Gromov-Wasserstein-based model to couple matching cellular profiles across time points. In contrast to existing methods, moslin leverages both intra-individual lineage relations and inter-individual gene expression similarity. We demonstrate on simulated and real data that moslin outperforms state-of-the-art approaches that use either one or both data modalities, even when the lineage information is noisy. On C. elegans embryonic development, we show how moslin, combined with trajectory inference methods, predicts fate probabilities and putative decision driver genes. Finally, we use moslin to delineate lineage relationships among transiently activated fibroblast states during zebrafish heart regeneration. We anticipate moslin to play a crucial role in deciphering complex state change trajectories from lineage-traced single-cell data.

A-297: From Complexity to Clarity: Evaluating Explainability of Biomedical Machine Learning Models
Track: MLCSB
  • Marta González Mallo, Barcelona Supercomputing Center, Spain
  • Alfonso Valencia, Barcelona Supercomputing Center, Spain
  • Davide Cirillo, Barcelona Supercomputing Center, Spain


Presentation Overview: Show

Biomedicine has seen an exponential increase in available data offering immense potential for transforming healthcare through advanced machine learning models. In spite of the immense possibilities, these models are often "black boxes" that lack transparency and interpretability, challenging the ability of the end-users to understand the rationale behind the automated decision-making processes. To address this issue, Explainable Artificial Intelligence (XAI) has emerged as a field aiming to provide explanations for complex machine learning model decisions. Despite the growing interest in XAI, there are currently no standardized guidelines or best practices for its application in biomedicine, particularly in terms of interpretability and coherence of explanations provided by different XAI methods. In this study, we evaluated the explanations provided by popular XAI methods, focusing on machine learning models trained on breast cancer RNASeq data from The Cancer Genome Atlas (TCGA), and using graph analysis of Reactome pathways to assess their interpretability and consistency. This analytical pipeline is implemented in an end-to-end workflow, which can be reused by the community in different scenarios, thus contributing to the responsible and transparent use of machine learning models in health applications.

A-298: Fast Identification of Optimal Monotonic Classifiers
Track: MLCSB
  • Océane Fourquet, Computational Systems Biomedicine Lab, Institut Pasteur, Université Paris Cité, 75015, Paris, France, France
  • Martin Krejca, LIX, CNRS, Ecole Polytechnique, Institut Polytechnique de Paris, 91120, Palaiseau, France, France
  • Carola Doerr, LIP6, CNRS, Sorbonne Université, 4 Place Jussieu, 75005, Paris, France, France
  • Benno Schwikowski, Computational Systems Biomedicine Lab, Institut Pasteur, Université Paris Cité, 75015, Paris, France, France


Presentation Overview: Show

Monotonic bivariate classifiers can describe simple patterns in high-dimensional data that may not be discernible using only elementary linear decision boundaries. Such classifiers are relatively simple, easy to interpret, and do not require large amounts of data. Their scope can be increased by using an ensemble of multiple classifiers while remaining easily interpretable. A challenge is that finding optimal pairs of features from a vast number of possible pairs tends to be computationally intensive. We prove a simple mathematical inequality and show how it can be exploited to drastically speed up the identification of optimal feature pairs. Our empirical results on different biomedical datasets suggest that our approach eliminates most computational effort for 90% to 98% of all feature pairs.

A-299: Machine learning applications for enabling genomic medicine in oncology and rare diseases
Track: MLCSB
  • Stanley Ng, Genomics England, United Kingdom
  • Andreia Rogerio, Genomics England, United Kingdom
  • John Ambrose, Genomics England, United Kingdom
  • Olena Yavorska, Genomics England, United Kingdom
  • Joe Kaplinsky, Genomics England, United Kingdom
  • Alex Younger, Genomics England, United Kingdom
  • Roisin Sullivan, Genomics England, United Kingdom
  • Augusto Rendon, Genomics England, United Kingdom
  • Alona Sosinsky, Genomics England, United Kingdom
  • Francisco Azuaje, Genomics England, United Kingdom


Presentation Overview: Show

At Genomics England, we are exploring how machine learning (ML) and advanced analytics can increase the impact of Whole Genome Sequencing data within the NHS Genomic Medicine Service (GMS).
Here, we evaluate several ML models and algorithms for a) Sequence Defect Detection for whole genome cancer analyses and b) Variant Prioritisation in rare diseases, using data from the 100,000 Genomes Project and NHS GMS.
QC metrics (e.g., AT/GC dropout) target specific flaws in sequencing data, but such metrics may overlook samples with complex or unexpected defects. Here, we assess the performance of tabular ADBench models and internal models on three outlier detection tasks where individual QC metrics failed to catch such defective samples. We demonstrate that these models can detect unexpected sequencing defects in our bioinformatics pipeline (ROC-AUC = 0.907, 0.688, 0.624 for best model on three tasks).
Identifying which variants are pathogenic is crucial for rare disease diagnostics and cancer therapeutics. We evaluate several Deep Learning algorithms (eg Enformer, Nucleotide Transformer) for predicting functional impact of genomic variants. We show how these models contribute to the diagnostic pipeline and propose strategies to integrate them.
Our work illustrates how ML contributes to various stages of bioinformatics workflows for genomic medicine.

A-300: Improving Polygenic Risk Score Prediction by Phenotype- Agnostic Dimensionality Reduction
Track: MLCSB
  • Hadasa Kaufman, The Hebrew University of Jerusalem, Israel
  • Nadav Rappoport, Ben-Gurion University of the Negev, Israel
  • Yarden Hochenberg, Ben-Gurion University of the Negev, Israel
  • Michal Linial, The Hebrew University of Jerusalem, Israel


Presentation Overview: Show

Advances in sequencing technologies have enabled extensive genome-wide studies and advanced the detection of genetic loci and associations to diseases. However, “missing heritability” raises the demand for developing improved models for polygenic risk score (PRS) models is highly desirable.
We hypothesize that non-additive PRS models can overcome some of the missing heritability and incorporate high-dimensional genomic interaction.
Our method first train an unsupervised model for dimensionality reduction and in the second stage trains a supervised prediction model. Our approach enables a computationally feasible PRS model to be computed without needing variable selection techniques. Moreover, the first stage, which is computationally resource-intensive, is independent of phenotype. Therefore, it can be used as a prediction model for any chosen trait and needs only be trained once. We evaluated the approach using two dimensionality reduction models, deep autoencoder, and principal component analysis. Two phenotype prediction models: deep neural networks, and extreme gradient boosting.
The models were trained using the UK Biobank dataset. We evaluated the approach using binary and continuous phenotypes. We compared the results to 2 linear PRS models and to a supervised machine learning algorithm with feature selection. Our model outperforms the other PRS models for both phenotypes.

A-301: Learning Single-Cell Perturbation Responses using Neural Optimal Transport
Track: MLCSB
  • Charlotte Bunne, ETH Zürich, Department for Computer Science, Switzerland
  • Stefan Stark, ETH Zürich, Department for Computer Science, Switzerland
  • Gabriele Gut, University of Zurich, Department of Molecular Life Sciences, Switzerland
  • Jacobo Sarabia del Castillo, University of Zurich, Department of Molecular Life Sciences, Switzerland
  • Mitchell Levesque, University of Zurich Hospital, Switzerland
  • Kjong-Van Lehmann, ETH Zürich, Department for Computer Science, Switzerland
  • Lucas Pelkmans, University of Zurich, Department of Molecular Life Sciences, Switzerland
  • Andreas Krause, ETH Zürich, Department for Computer Science, Switzerland
  • Gunnar Ratsch, ETH Zürich, Department for Computer Science, Switzerland


Presentation Overview: Show

Understanding and predicting molecular responses in single cells enables the study of biological processes to a high degree of granularity. However, obtaining single-cell measurements of perturbation responses typically requires cells to be destroyed. This complicates the learning of heterogeneous perturbation responses as we only have access to unpaired distributions of treated and untreated cells. Here, we leverage optimal transport theory and recent convex neural architectures to present CellOT, a framework for modeling responses of individual cells to perturbations by coupling perturbed and unperturbed cell populations with a learned optimal transport map. This approach explicitly models heterogeneous responses, is able to capture detailed features of the distribution of treated states, and can be applied to unseen untreated cells.

We show that CellOT outperforms current state-of-the-art methods at modeling single-cell drug responses, as profiled by scRNA-seq and a multiplexed protein imaging technology. Further, we demonstrate that CellOT generalizes well in unseen settings by (a) predicting the scRNA-seq responses of unseen lupus patients exposed to IFN-β (b) inferring LPS responses across species and (c) modeling the developmental trajectories of different hematopoietic cell types. Finally we outline how CellOT can be utilized in the clinic to aid treatment decisions of unseen melanoma patients.

A-302: In silico analysis and identification of protein amyloid aggregation inhibitors
Track: MLCSB
  • Ichrak Chouarfia, Université des Sciences et de la Technologie d'Oran Mohamed Boudiaf USTO-MB, Algeria
  • Hafida Bouziane, Université des Sciences et de la Technologie d'Oran USTO-MB, Algeria


Presentation Overview: Show

Many human degenerative and ultimately fatal diseases result from amyloid aggregation of misfolded proteins. The mechanism underlying the formation of such insoluble and toxic aggregates is strongly correlated with protein sequence properties. In our previous published work, we have investigated the power of string kernel-based support vector machines as an alternative to windowing approaches in predicting protein aggregation propensity, integrating predicted secondary structure and solvent accessibility. Our study revealed that certain amino acid residues are predominantly present in the regions promoting amyloid aggregation, whereas, others due to their presence and concentration at particular positions act as aggregation gatekeepers and stabilize the amyloid core formed. To shed light on such a phenomenon which is still an open issue, we have integrated into our recent investigations information related to disordered regions and Gene Ontology annotation. The results showed that amino acid residues in the protein N- and C-terminal regions strongly decrease its amyloidogenic potential, which agrees well with recent in vitro studies. This insight encourages us to further assess the role of certain amino acid residues in protein aggregation and how they favor, disrupt or inhibit its formation. The experiments carried out might help to identify effective approaches to counter amyloid formation.

A-303: HetCPI: Knowledge Graph-Centric Drug Discovery via Heterogeneous Graph Transformers
Track: MLCSB
  • Heval Ataş Güvenilir, Middle East Technical University, Turkey
  • Tunca Doğan, Hacettepe University, Turkey


Presentation Overview: Show

Recent developments in data-driven approaches have facilitated the processing and interpretation of vast quantities of biomedical data for drug discovery and development. As a new and practical data structure, heterogeneous knowledge graphs have the capacity to represent complex relationships between different layers of biomedical data. In relation to that, graph neural networks (GNNs) have emerged as a novel modelling technique for the inference of graph-based data; however, the majority of GNN algorithms are restricted to homogenous graphs and cannot handle heterogeneous data with multiple types of nodes and edges. Here, we propose a new type of systems-level compound-protein interaction (CPI) representation and subsequent prediction framework called HetCPI, which uses large-scale biomedical knowledge graphs (KGs) obtained from the CROssBAR system as input. To process these biomedical KGs for bioactivity prediction, we employed the heterogeneous graph transformer (HGT) architecture, which handles graph heterogeneity and maintains node- and edge-type dependent representations through its attention mechanism. HetCPI has yielded promising results on challenging protein family-specific benchmark CPI datasets, in comparison to baseline and state-of-the-art methods. HetCPI is anticipated to aid computational drug discovery by leveraging direct and indirect relationships in molecular and cellular processes for bioactivity prediction, thereby accelerating the development of new treatments.

A-304: Signature Informed Sampling for Transcriptomic Data
Track: MLCSB
  • Nikita Janakarajan, ETH Zürich, IBM Research Europe, Switzerland
  • Mara Graziani, IBM Research Europe, Switzerland
  • María Rodríguez Martínez, IBM Research Europe, Switzerland


Presentation Overview: Show

The use of deep learning models on transcriptomic data, which often have a large number of features but a small number of patients, is challenging due to the tendency of these models to overfit the data and not generalize well. Data augmentation strategies have been proposed to help address this issue, but existing approaches can be computationally intensive or require parametric estimates. In this study, we introduce a novel, non-parametric data augmentation approach inspired by the phenomenon of chromosomal crossover during meiosis. Given non-overlapping gene signatures describing a phenotype, our method generates new artificial data points by sampling these signatures from different patients under certain phenotypic constraints. We apply this method to transcriptomic data of colorectal cancer from TCGA and CPTAC and demonstrate that it improves patient stratification, generalizes well to out-of-distribution data, and improves the models' robustness to overfitting and distribution shift. We evaluate our method on both discriminative and generative tasks, namely predicting colorectal cancer subtypes and learning a useful latent space with a variational autoencoder, and show that it performs on par with, if not better than, currently popular augmentation methods for RNA-Seq data such as SMOTE and Negative-Binomial sampling. Code for reproducibility is available at https://github.com/PaccMann/transcriptomic_signature_sampling.

A-305: Genome-wide inference of eukaryotic coding regions with deep learning
Track: MLCSB
  • Xavier Lapointe, Université de Sherbrooke, Canada
  • Marie A. Brunet, Université de Sherbrooke, Canada


Presentation Overview: Show

Recent work has presented new evidence of translation for thousands of previously unknown coding sequences (CDS), drastically expanding the human proteome. However, due to experimental limitations and inherent biases, many more CDS are likely to be overlooked or underestimated, thereby limiting our understanding of the full range of proteome diversity. Accurate annotation of functional elements holds crucial implications for clinical and fundamental research, hence we need revised tools to exhaustively evaluate the functional ORFeome. The recent success of deep learning on sequence modeling tasks combined with quality -omics data provides hope in the search for an approximation of the universal features that underlie translation. Here we present FOMOnet, a 1D convolutional neural network derived from the UNet architecture. FOMOnet is trained with human protein coding transcripts and performs segmentation of coding regions within a transcript. With a ROCauc of 99.3% and a PRauc of 96.1%, it outperforms current tools aiming to predict canonical human CDS. Moreover, it confidently predicts hundreds of novel CDS supported by mass spectrometry and Ribo-seq. Overall, our approach provides an unbiased assessment of CDS within eukaryotic genomes, outperforming state-of-the-art tools and predicting both canonical and novel CDS.

A-306: A deep learning method for identification of viral sequences from diverse environments
Track: MLCSB
  • Rajitha Yasas Wijesekara, Institute for Bioinformatics, University of Medicine Greifswald, Felix-Hausdorff-Str. 8, 17475 Greifswald, Germany, Germany
  • Rick Beelo, Theoretical Biology and Bioinformatics, Science4Life, Utrecht University, Padualaan 8, 3584 CH Utrecht, the Netherlands, Netherlands
  • Ling-Yi Wu, Theoretical Biology and Bioinformatics, Science4Life, Utrecht University, Padualaan 8, 3584 CH Utrecht, the Netherlands, Netherlands
  • Piotr Rozwalak, Department of Computational Biology, Adam Mickiewicz University, Poznan 61-614, Poland, Poland
  • Bas E. Dutilh, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, 07743 Jena, Germany, Germany
  • Lars Kaderali, Institute for Bioinformatics, University of Medicine Greifswald, Felix-Hausdorff-Str. 8, 17475 Greifswald, Germany, Germany


Presentation Overview: Show

Metagenomics has emerged as a powerful tool for the study of viruses in their natural environments, but identifying and classifying viral sequences in complex metagenomic datasets remains a challenge. Homology-based and homology-free methods are the two main approaches for virus discovery, with homology-free methods being particularly effective at detecting highly divergent viruses with no close homologs. Here, we introduce Jaeger, a homology-free deep learning method that detects both viruses and proviruses in metagenomic and genomic datasets. Jaeger utilizes a convolutional neural network to effectively capture the complex and diverse compositional features of genomic sequences. Our results demonstrate that Jaeger outperforms existing virus detection tools in predicting phage sequences, with accuracies of 70% for reads, 90% for contigs, and 93% for bins. Jaeger also accurately detects and classifies other sequence categories such as eukaryotic, bacterial, and archaeal genomes, highlighting its versatility and potential applications in a range of research areas. Our study validates the performance of Jaeger on simulated and real metagenomic datasets from three different biomes, but further validation using additional datasets and in different contexts may be necessary to fully evaluate the robustness and generalizability of our approach.

A-307: A Unified Framework for Target-Based Drug Discovery with Multimodal Contrastive Learning
Track: MLCSB
  • Gökçe Uludoğan, Bogazici University, Turkey
  • Elif Ozkirimli, Roche AG, Switzerland
  • Kutlu Ö. Ülgen, Bogazici University, Turkey
  • Nilgün Karalı, Istanbul University, Turkey
  • Arzucan Ozgur, Bogazici University, Turkey


Presentation Overview: Show

Identification of new compounds for biological targets can be accomplished with two paradigms: de novo drug design and drug repurposing. Existing models adopt either one of these paradigms, and thus, separate models are built for each approach. In this study, we propose a novel, unified framework that integrates both paradigms, allowing for target-specific de novo design and drug repurposing and enabling an efficient approach. Our unified framework consists of a combination of unimodal protein and molecule encoders along with a molecule decoder, which exploit powerful pre-trained models. The design of the framework leverages two objectives that are learnt jointly; contrastive learning and conditional language modelling. The contrastive learning objective aims at aligning the protein and molecule representations, thereby enabling drug repurposing as drug-target retrieval while the conditional language modelling objective learns to generate molecules given a target of interest. The preliminary results show that the proposed framework can tackle both tasks with comparable performance to existing approaches.

A-307: A Unified Framework for Target-Based Drug Discovery with Multimodal Contrastive Learning
Track: MLCSB
  • Gökçe Uludoğan, Bogazici University, Turkey
  • Elif Ozkirimli, Roche AG, Switzerland
  • Kutlu Ö. Ülgen, Bogazici University, Turkey
  • Nilgün Karalı, Istanbul University, Turkey
  • Arzucan Ozgur, Bogazici University, Turkey


Presentation Overview: Show

Identification of new compounds for biological targets can be accomplished with two paradigms: de novo drug design and drug repurposing. Existing models adopt either one of these paradigms, and thus, separate models are built for each approach. In this study, we propose a novel, unified framework that integrates both paradigms, allowing for target-specific de novo design and drug repurposing and enabling an efficient approach. Our unified framework consists of a combination of unimodal protein and molecule encoders along with a molecule decoder, which exploit powerful pre-trained models. The design of the framework leverages two objectives that are learnt jointly; contrastive learning and conditional language modelling. The contrastive learning objective aims at aligning the protein and molecule representations, thereby enabling drug repurposing as drug-target retrieval while the conditional language modelling objective learns to generate molecules given a target of interest. The preliminary results show that the proposed framework can tackle both tasks with comparable performance to existing approaches.

A-308: Resolving cell-state parallax reveals causal regulatory mechanisms
Track: MLCSB
  • Alexander Wu, Massachusetts Institute of Technology, United States
  • Rohit Singh, Massachusetts Institute of Technology, United States
  • Christopher Walsh, Harvard Medical School, United States
  • Bonnie Berger, Massachusetts Institute of Technology, United States


Presentation Overview: Show

Genome-wide association studies identify many disease-linked genetic variants at noncoding genomic loci, but it is difficult to determine which genes these variants affect. Single-cell multimodal assays that profile chromatin accessibility and gene expression in the same cell hold promise for addressing this challenge, as they can reveal causal locus-gene relationships. However, existing approaches are unable to account for what we refer to as “cell-state parallax,” the time lag between the epigenetic and transcriptional modalities due to their cause-and-effect relationship. Our algorithm, GrID-Net, is a neural network-based generalization of Granger causal inference that newly enables the detection of causal locus–gene associations in graph-based dynamical systems, such as single-cell trajectories. GrID-Net substantially outperforms existing approaches for inferring locus-gene links, achieving up to 27% higher agreement with independent population genetics-based estimates. Applying GrID-Net to interpret genetic variants in schizophrenia (SCZ), we identified 132 genes linked to SCZ, including the potassium transporters KCNG2 and SLC12A6. We also uncovered evidence for the role of neural transcription factor binding disruptions in SCZ etiology. Our work points to the transformative potential of single-cell multimodal assays for discovering causal regulatory mechanisms and provides a general strategy for unveiling the impact of noncoding variants on gene dysregulation in disease.

A-309: Chrome-Zoo: cross-species chromatin profile prediction using DNA Zoo data
Track: MLCSB
  • Anupama Jha, Department of Genome Sciences, University of Washington, Seattle, WA, USA, United States
  • Jacob Schreiber, Stanford University School of Medicine, Stanford, CA, USA, United States
  • Olga Dudchenko, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA, United States
  • Georgi K. Marinov, Department of Genetics, Stanford University, Stanford, CA, USA, United States
  • Anshul Kundaje, Department of Genetics, Stanford University, Stanford, CA, USA, United States
  • William J. Greenleaf, Department of Genetics, Stanford University, Stanford, CA, USA, United States
  • Erez S. Lieberman Aiden, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA, United States
  • William Stafford Noble, Department of Genome Sciences, University of Washington, Seattle, WA, USA, United States


Presentation Overview: Show

DNA Zoo is a large-scale project that has used Hi-C measurements to create draft genomic assemblies for over 500 species and has collected ATAC-seq datasets for a subset of 100 of those species. Hi-C and ATAC-seq data provide complementary views of chromatin organization, with Hi-C identifying pairwise genomic interactions and ATAC-seq measuring local chromatin accessibility. However, collecting both Hi-C and ATAC-seq data for all species is impractical due to cost and sample availability constraints. To address this challenge, we propose a deep tensor factorization model called Chrome-Zoo that can translate between ATAC-seq and Hi-C in species where only one of these assays is available. We address the challenges associated with handling multiple genomes by training an additional model that converts nucleotide sequences to learned genomic embeddings that are consistent across species. Using these learned genome embeddings, we trained Chrome-Zoo on 18 species with available genome assemblies, Hi-C and ATAC-seq datasets. We show that our model can successfully translate between Hi-C and ATAC-seq in new species at coarse (100 kb) and fine (1–10 kb) resolutions.

A-310: Meta-Training Convolutional Neural Networks for Microbiome Data Classification: Addressing Small Sample Size Challenges
Track: MLCSB
  • Andre Goncalves, Lawrence Livermore National Laboratory, United States
  • Camilo Valdes, Lawrence Livermore National Laboratory, United States
  • Jose Manuel Marti Martinez, Lawrence Livermore National Laboratory, United States
  • James Thissen, Lawrence Livermore National Laboratory, United States
  • Nisha Mulakken, Lawrence Livermore National Laboratory, United States
  • Car Reen Kok, Lawrence Livermore National Laboratory, United States
  • Crystal Jaing, Lawrence Livermore National Laboratory, United States
  • Nicholas Be, Lawrence Livermore National Laboratory, United States


Presentation Overview: Show

The human microbiome, a complex ecosystem of microorganisms inhabiting the human body, has been recognized as a key factor in human health and disease. The advent of high-throughput DNA sequencing technologies has prompted the interest of studying the microbiome using machine learning, including deep learning techniques such as convolutional neural networks (CNNs). However, microbiome datasets are often small in size, leading to challenges in training such models. We propose to leverage the Model-Agnostic Meta-Learning (MAML) algorithm to train CNNs on multiple small microbiome datasets to discriminate between diseased and healthy individuals based on their microbiome profiles. MAML is a meta-learning algorithm that enables fast adaptation of neural networks to new tasks with limited data, making it well-suited for training on small datasets. Our results demonstrate superior performance of the MAML-based approach over traditional CNN training methods in terms of prediction capacity. The meta-trained CNNs performed better on unseen datasets, likely capturing key features of the microbiome. Our research highlights the potential of MAML meta-learning model for training CNNs on small microbiome datasets. Ultimately, this could improve our understanding of the role of the human microbiome in health and disease, aiding in the clinical prediction and management of disease states.

A-311: BuDDI: A generative deconvolution framework to predict cell-type-specific perturbation response
Track: MLCSB
  • Natalie Davidson, University of Colorado, Anschutz Medical Campus, United States
  • Casey Greene, University of Colorado, Anschutz Medical Campus, United States


Presentation Overview: Show

While single-cell (sc) experiments provide deep, cellular resolution within a single sample, they are currently limited in their breadth across biological conditions. Furthermore, some single-cell experiments are inherently more challenging than bulk experiments due to dissociation or limited tissue difficulties. Using domain adaptation techniques to bridge this gap, we integrate available large corpora of case-control bulk and reference scRNA-seq observations to infer missing cell-type-specific perturbation effects.

We propose BuDDI (BUlk Deconvolution with Domain Invariance) to estimate cell-type-specific perturbation changes from bulk RNA-seq samples. BuDDI achieves this by learning three independent latent spaces within a single variational autoencoder (VAE) encompassing three sources of variability: 1) cell-type proportion, 2) perturbation effect, 3) structured experimental noise. Taking advantage of our generative model, we simulate perturbations through the sampling of one or more of our latent spaces. We train our model using a semi-supervised approach to jointly model bulk and scRNA-Seq data.

We validated BuDDI’s performance using control and IFN-B-stimulated scRNA-Seq data. We generated case-control pseudobulks to simulate cell-type specific perturbations and used held-out unperturbed samples for the single-cell reference. We compared BuDDI against PCA, VAE, and conditional VAE. BuDDI matched or outperformed all other models in predicting stimulated gene expression and effect size.

A-312: Learning Gut Microbiome Representations in Human Diseases with Vision and Hierarchical Models
Track: MLCSB
  • Camilo Valdes, Lawrence Livermore National Laboratory, Physical and Life Sciences Directorate, United States
  • Andre Goncalves, Lawrence Livermore National Laboratory, Engineering Directorate, United States
  • James Thissen, Lawrence Livermore National Laboratory, Physical and Life Sciences Directorate, United States
  • Jose Manuel Marti Martinez, Lawrence Livermore National Laboratory, Physical and Life Sciences Directorate, United States
  • Car Reen Kok, Lawrence Livermore National Laboratory, Physical and Life Sciences Directorate, United States
  • Nisha Joy Mulakken, Lawrence Livermore National Laboratory, Physical and Life Sciences Directorate, United States
  • Crystal Jaing, Lawrence Livermore National Laboratory, Physical and Life Sciences Directorate, United States
  • Nicholas Be, Lawrence Livermore National Laboratory, Physical and Life Sciences Directorate, United States


Presentation Overview: Show

Gut microbiome dysbiosis has been linked to several diseases, including colorectal cancer, diabetes, Crohn’s disease, and inflammatory bowel disease. Microbiome community abundance profiles are data that report what microbial organisms are present in a sample, and their quantities. These profiles are generally created by means of high-throughput DNA sequencing.

In this work we present a method for fusing together two complimentary representations of 3,810 gut microbiome community abundance profiles taken from 19 different datasets representing 14 human diseases.

Our model uses vision transformer (ViT) and convolutional neural network (CNN) architectures that leverage the fusion of two representations of the samples: a set of 2D images created via Hilbert curve visualizations; and a set of important taxonomic clades identified by a hierarchical feature engineering (HFE) approach.

The results show that the fusion model’s representation space encodes features related to taxonomical clades and the diseases they are most prevalent in. These representations create distinctive patterns in low dimensional space that can be used in downstream analyses to further identify the important species in disease conditions. Models such as this one can enable platforms for predicting human health states and guiding clinical management of disease.

A-314: An ensemble machine learning approach to predicting in vivo RNA-RNA interaction
Track: MLCSB
  • Weidong Wang, University of Massachusetts Amherst, United States
  • Eric Pederson, University of Massachusetts Amherst, United States
  • Zhengqing Ouyang, University of Massachusetts Amherst, United States


Presentation Overview: Show

RNA-RNA interactions play a critical role in regulating RNA biogenesis and function, yet the governing features remain poorly understood. In this study, we developed a novel ensemble machine learning method for predicting RNA-RNA interaction probabilities by combining multiple cutting-edge models through the super learner. Experimentally verified in vivo RNA-RNA interactions from PARIS and RIC-seq experiments in HeLa cells were utilized. Our model incorporates nucleotide k-mer sequence motifs and RNA structure motifs from RNAcofold, RNA Binding Protein (RBP) sites, and RNA interaction energies. The model achieved high accuracy in both PARIS (0.893) and RIC-seq (0.957) data sets, with area under ROC curve (AUC) of 0.952 and 0.993, and modified Mathews Correlation Coefficient (normMCC) of 0.894 and 0.957, respectively. The model effectively identified crucial sequence and structure motifs across different data sets. Our findings show that structure motifs are vital in the PARIS data set, while sequence motifs are critical in the RIC-seq data set, aligning with underlying mechanisms. In conclusion, our model enhances our understanding of RNA-RNA interactions and provides a valuable tool for future research, with significant implications for advancing knowledge in biological processes involving RNA-RNA interactions.

A-315: Joint Diffusion for Protein Sequence-Structure Co-Design
Track: MLCSB
  • Ria Vinod, Brown University, United States
  • Ava Amini, Microsoft Research New England, United States
  • Kevin Yang, Microsoft Research New England, United States
  • Lorin Crawford, Microsoft Research New England, Brown University, United States


Presentation Overview: Show

There has been significant progress in designing new proteins with the advent of generative models. However, several works only consider the continuous domains of geometric or topological protein data. This approach produces geometrically viable structures but poor downstream task performance. The complexity of designing function-informed proteins, higher order structures, and molecular interaction partners requires the consideration of several other descriptors beyond backbone geometry – including sequence-structure relationships, physiochemical properties and stable energy confirmations. Here, we present a novel diffusion-based generative model, Co-Designing Proteins (CodeProt), and sampling method that jointly operates on and samples from sequence-structure space. Our model consists of a structure module that processes protein structures as clouds of oriented reference frames in 3D space; and a sequence module that processes latent representations of residues. By introducing joint diffusion in continuous and discrete proteomic domains, we identify optimal sampling manifolds that inform sequence-structure pairs with specified downstream molecular properties. Preliminary results show that we significantly improve sequence recovery, perplexity, and structure designability scores as compared to existing unimodal approaches. We perform further in silico evaluations on the novelty, diversity, and confidence of the designed proteins, indicating that we capture the key structural and sequence elements exhibited by natural proteins.

A-316: GrapHiC: An integrative graph-based approach for imputing missing Hi-C reads
Track: MLCSB
  • Ghulam Murtaza, Department of Computer Science, Brown University, United States
  • Justin Wagner, National Institute of Standards and Technology, United States
  • Justin Zook, National Institute of Standards and Technology, United States
  • Ritambhara Singh, Center for Computational Molecular Biology and Department of Computer Science, Brown University, United States


Presentation Overview: Show

Hi-C experiments allow researchers to study and understand the 3D genome organization and its regulatory function. Unfortunately, sequencing costs and technical constraints severely restrict access to high-quality Hi-C data for many cell types. Existing methods employ either a "Seq-to-HiC" or "HiC-to-HiC" based strategies to impute missing Hi-C reads. Unfortunately, both approaches impose a strict Euclidean structure on the input data and consequently lose their predictive capacity. We propose GrapHiC, which utilizes a Graph Auto-Encoder network to generate high-quality Hi-C contact maps. We propose reformulating Hi-C data as a graph with ChIP-seq signals as its node features, and Hi-C reads as its edges. This auxiliary ChIP-seq data provides cell-type-specific information to impute missing reads when the input Hi-C data is sparse. Our experiments on datasets with varying sparsity levels and cross-cell type inputs suggest that GrapHiC generalizes better than its baselines, improving average performance across various evaluation metrics, including Chromatin Loop recovery accuracy. Thus, our generalizable framework can make the analysis of high-quality Hi-C data more accessible for many cell types with a varying range of read sparsities.

A-317: DNA-Diffusion: Generative diffusion models for enhancing gene expression control through synthetic regulatory elements
Track: MLCSB
  • Lucas Ferreira Silva, Harvard/ MGH, United States
  • Simon Senan, openBIOML, United States
  • Matei Bejan, University of Bucharest, Romania
  • César Miguel Valdez Córdova, JKU Linz, Austria
  • Cameron Smith, MGH/Harvard/Broad, United States
  • Sameer Gabbita, MGH/TJHSST, United States
  • Aaron Wenteler, QMUL, United States
  • Zach Nussbaum, Nomic.ai, United States
  • Aniketh Janardhan Reddy, UC Berkeley, United States
  • Zelun Li, Victor Chang Cardiac Institute/UNSW, Australia
  • Zain Munir Patel, MGH/Harvard/Broad, United States
  • Noah Weber, Celeris Therapeutics, Germany
  • Tin M. Tunjic, Celeris Therapeutics, Germany
  • Emily S. Wong, Victor Chang Cardiac Institute/UNSW, Australia
  • Wouter Meuleman, University of Washington, United States
  • Luca Pinello, MGH/Harvard/Broad Institute, United States


Presentation Overview: Show

The challenge of systematically modifying and optimizing regulatory elements for precise gene expression control is central to modern genomics and synthetic biology. Advancements in generative AI have paved the way for designing synthetic sequences and identifying genomic locations for integration, with the aim of safely and accurately modulating gene expression. We leverage diffusion models to design context-specific DNA regulatory sequences, which hold significant potential toward enabling novel therapeutic applications requiring precise modulation of gene expression. Our framework uses a cell type-specific diffusion model to generate novel 200 bp regulatory elements based on chromatin accessibility across different cell types. We evaluate the generated sequences based on key metrics to ensure they retain properties of endogenous sequences including binding specificity, composition, accessibility, and regulatory potential. We assess transcription factor binding site composition, potential for cell type-specific chromatin accessibility, and capacity for sequences generated by DNA diffusion to activate gene expression in different cell contexts using state-of-the-art prediction models. Our results demonstrate the ability to robustly generate DNA sequences with cell type-specific regulatory potential. DNA-Diffusion paves the way for revolutionizing a regulatory modulation approach to mammalian synthetic biology and precision gene therapy.

A-319: NetOIF: A Network-based approach for Time series Omics data Imputation and Forecasting
Track: MLCSB
  • Shamim Mollah, Assisstant Professor of Genetics, Washington University in St. Louis School of Medicine, United States
  • Min Shi, Washington University in St. Louis School of Medicine, United States


Presentation Overview: Show

Many omics studies are time-series data capturing dynamic observations. While time-series omics data are essential to understand disease mechanisms, they often are incomplete (missing values), resulting in data shortage. Missing data and data shortage are especially problematic for downstream data integration and analyses and require complete and sufficient data representation. Data imputation and forecasting methods are widely used to mitigate these issues. However, existing techniques typically address static omics data representing a single time point and perform forecasting on complete data. As a result, these techniques cannot capture the time-ordered nature of data and are unable to handle omics data consisting of missing values at multiple time points. We propose a network-based method for time-series omics data imputation and forecasting (NeTOIF) that address this problem by taking advantage of topological relationships (e.g., protein-protein and gene-gene interactions) among omics data samples and incorporates a graph convolutional network to first infer the missing values at different time points. Then, NeTOIF combines these inferred values with the original data to perform time-series imputation and forecasting using a long short-term memory network. NeTOIF achieved an 11.3% improvement in average mean square error for imputation and a 6.4% improvement for forecasting over baseline methods.

A-320: Probabilistic Machine Learning enables dense Spot Detection in Single-Cell Spatial Transcriptomics with high sensitivity
Track: MLCSB
  • Jenkin Tsui, BC Cancer Research Centre, Canada
  • Yukta Thapliyal, BC Cancer Research Centre, Canada
  • Andrew Roth, BC Cancer Research Centre, Canada


Presentation Overview: Show

Single-cell spatial transcriptomics is a powerful method for studying tissue composition and organization, but decoding multiplex image stacks remains challenging due to limitations in current methods. To address these issues, we present Savannah, a new probabilistic method for spot detection in single-cell transcriptomics. Savannah utilizes a hierarchical Bayesian model, along with mean field approximation and stochastic variational inference algorithm, to efficiently estimate model parameters. With pre-processing and post-processing modules, Savannah enhances signal-to-noise ratio and can handle large datasets. Savannah provides a measure of confidence in the decoded spots and quantifies the uncertainty associated with each spot. Our results demonstrate that Savannah provides similar or better accuracy than existing approaches, while greatly improving sensitivity. Compared to existing approaches, Savannah detected seven times as many genes per field of view and increased the detection of genes per cell by four times, with a decrease in the rate of detection of genes outside cells by 38%. Furthermore, Savannah's running time is comparable to existing approaches, making it a promising tool for decoding spatial transcriptomics data. Savannah's probabilistic approach and ability to handle large datasets make it essential for studying tissue composition and organization.

A-321: SimbaML: Supporting informed machine learning by ordinary differential equation model simulations
Track: MLCSB
  • Maximilian Kleissl, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Lukas Drews, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Benedict B. Heyder, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Julian Zabbarov, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Pascal Iversen, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Simon Witzke, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Bernhard Y. Renard, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Katharina Baum, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam; Department of Mathematics and Computer Sciences, Free University Berlin, Germany, Germany


Presentation Overview: Show

Training sophisticated machine learning (ML) models requires large datasets. These are particularly expensive to collect for molecular processes, and coherent datasets for specific contexts that include sufficient samples are still sparsely available. If we have prior knowledge about system dynamics, mechanistic representations can be used to supplement real-world data. In life sciences, ordinary differential equation (ODE) models have been curated, e.g., for many signaling pathways, cellular metabolism, as well as macroscopic processes such as infection spreading dynamics. We present SimbaML (simulation-based ML), an open-source tool that leverages the information from such ODE models and unifies realistic synthetic dataset generation and direct analysis and inclusion in ML pipelines. Thereby, data generated with SimbaML accounts for measurement errors as well as biological variability. We showcase how SimbaML can be used to determine the best ML method for different amounts of available data for a MAPK signaling pathway model. In addition, we demonstrate how data augmented with simulations of a model of infection spread can improve ML-based forecasts, especially under sparse data settings. SimbaML conveniently enables investigating transfer learning from synthetic to real-world data, data augmentation, identifying needs for data collection, and benchmarking physics-informed ML approaches.

A-322: ProCapNet: Dissecting the cis-regulatory syntax of transcription initiation with deep learning
Track: MLCSB
  • Kelly Cochran, Stanford University, Department of Computer Science, United States
  • Melody Yin, The Harker School, United States
  • Jacob Schreiber, Stanford University, Department of Genetics, United States
  • Anshul Kundaje, Stanford University, Departments of Genetics & Computer Science, United States


Presentation Overview: Show

The DNA sequence determinants of mammalian Pol II transcription initiation remain incompletely understood. Although we've identified overrepresented motifs in promoters, a third of human promoters contain no known motifs; in promoters with known motifs, how sequence translates into TSS positioning and promoter activity is poorly characterized. We know even less about initiation at enhancers. To address these knowledge gaps, we trained a deep learning model, ProCapNet, to predict transcription initiation, measured genome-wide at base-resolution by PRO-cap, from DNA sequence. ProCapNet accurately predicts exact TSS locations and initiation rate consistently across promoter classes and at enhancers. We next applied a model interpretation framework to identify a high-sensitivity collection of motifs predictive of transcription initiation. Then, to dissect how these motifs modulate initiation, we performed systematic in silico mutational experiments. Results suggest nuanced epistasis: motifs play specialized roles, dependent on other nearby motifs. For multiple motifs, we identified a novel secondary function as direct initiation sites. We quantified the contribution of motifs to TSS positioning and initiation rate, finding motif-specific positioning signatures that suggest a general rule of redistribution. Finally, we compared the sequence determinants of initiation in promoters vs. enhancers; results support a unified model of cis-regulatory syntax for transcription initiation.

A-323: ntEmbd: Deep learning embedding for nucleotide sequences
Track: MLCSB
  • Saber Hafezqorani, BC Cancer Genome Sciences Centre, Canada
  • Ka Ming Nip, BC Cancer Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer Genome Sciences Centre, Canada


Presentation Overview: Show

Deep learning has become increasingly popular in genomics research, partly due to the explosion of 'omics data and improvements in computing power. We introduce ntEmbd, a deep learning embedding to extract features from nucleotide sequences, which can then be used for downstream tasks like classification and clustering. Utilizing an autoencoder architecture, it generates a fixed-dimensional latent representation that captures both local and long-range feature dependencies. We demonstrate ntEmbd's effectiveness in a functional annotation task, classifying RNA sequences as coding or noncoding transcripts. Using the ntEmbd-generated representations of the RNA sequences and their labels based on the GENCODE biotype classes, we train a supervised classifier to distinguish coding vs. noncoding transcripts. Our classifier achieved an accuracy of 0.88 on the mRNN-challenge dataset, outperforming five other predictors: RNASamba (0.83), mRNN (0.87), CPAT (0.73), CPC2 (0.69), and FEELnc (0.78). We further examined the model's performance in discerning full vs. partial-length RNA transcripts as well as in detecting adapter sequences on Oxford Nanopore cDNA reads, achieving an accuracy of 0.91, with 0.97 recall and 0.86 precision. We believe that ntEmbd has the potential to be a valuable tool for other sequence classification and clustering tasks when fine-tuned to the specific problem space.

A-324: tAMPer: structure-aware deep learning model for toxicity prediction of antimicrobial peptides
Track: MLCSB
  • Hossein Ebrahimikondori, BC Cancer's Genome Sciences Centre, Canada
  • Darcy Sutherland, BC Cancer's Genome Sciences Centre, Canada
  • Anat Yanai, BC Cancer's Genome Sciences Centre, Canada
  • Lauren Coombe, BC Cancer's Genome Sciences Centre, Canada
  • Rene Warren, BC Cancer's Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer's Genome Sciences Centre, Canada


Presentation Overview: Show

Antimicrobial resistance poses a significant threat to public health, and alternative treatments to conventional antibiotics are urgently needed. Antimicrobial peptides (AMPs) offer a promising avenue, but screening for potential toxicity can be time-consuming and costly. Accurately predicting the toxicity of peptides using computational tools can facilitate the rapid screening of large numbers of candidate AMPs. We introduce tAMPer, a deep learning model that predicts the toxicity of AMPs using sequence and structural information. tAMPer represents peptides as graphs based on their AlphaFold-predicted structures, where nodes and edges correspond to amino acids and their interactions, respectively. Graph neural networks are used to extract the structural features. Additionally, tAMPer utilizes recurrent neural networks to capture temporal dependencies in each peptide's amino acid sequence. Evaluations on a publicly available protein benchmark and in-house hemolysis experiments show that tAMPer outperforms state-of-the-art methods in accuracy and F1 score, achieving 75.00% accuracy in our independent set compared to the second-best method's 61.84%. tAMPer provides interpretable feature importance scores, aiding in identifying the structural features of AMPs contributing to toxicity. Our work demonstrates the potential of three-dimensional peptide structure predictions and graph neural networks for safer and more effective AMP therapeutics to combat antimicrobial resistance.

A-325: EPIC-TRACE: predicting TCR binding to unseen epitopes using attention and contextualized embeddings
Track: MLCSB
  • Dani Korpela, Department of Computer Science, Aalto University, Finland
  • Emmi Jokinen, Hematology Research Unit Helsinki, Helsinki University Hospital Comprehensive Cancer Center, Finland
  • Alexandru Dumitrescu, Department of Computer Science, Aalto University, Finland
  • Jani Huuhtanen, Hematology Research Unit Helsinki, Helsinki University Hospital Comprehensive Cancer Center, Finland
  • Satu Mustjoki, Hematology Research Unit Helsinki, Helsinki University Hospital Comprehensive Cancer Center, Finland
  • Harri Lähdesmäki, Department of Computer Science, Aalto University, Finland


Presentation Overview: Show

T cells play an essential role in adaptive immune system to evade pathogens and cancer but may also give rise to autoimmune diseases. The recognition of a peptide-MHC (pMHC) complex by a T cell receptor (TCR) is required to elicit an immune response. Many machine learning models have been developed to predict the binding, but generalizing predictions to pMHCs outside the training data remains challenging. We have developed a new machine learning model that utilizes information about the TCR from both alpha and beta chains, epitope sequence, and MHC. Our method uses ProtBERT embeddings for the amino acid sequences of both chains and the epitope, as well as convolution and multi-head attention architectures.
We show the importance of each input feature as well as the benefit of including epitopes with only a few TCRs to the training data. We evaluate our model on existing databases and show that it compares favorably against other state-of-the-art models.

A-326: Uncertainty Quantification of Deep Learning Models for Predicting the Regulatory Activity of DNA Sequences
Track: MLCSB
  • Hüseyin Anil Gündüz, LMU Munich, MCML, Germany
  • Sheetal Giri, TU Munich, Germany
  • Martin Binder, LMU Munich, MCML, Germany
  • Bernd Bischl, LMU Munich, MCML, Germany
  • Mina Rezaei, LMU Munich, MCML, Germany


Presentation Overview: Show

The field of computational biology has been enhanced by deep learning (DL) models, which hold great promise for revolutionizing fields such as protein folding and drug discovery. Estimating the epistemic -- or model -- uncertainty of DL model predictions is critical for building trustworthy machine learning systems in biological applications, as data often contain experimental errors and biases, and models have their own biases. Uncertainty quantification (UQ) can help improve models by identifying overconfidence or underconfidence in predictions.
In this paper, we i) study several uncertainty quantification methods with respect to a multi-target regression task, specifically predicting regulatory activity profiles using DNA sequence data. ii) explore the effectiveness of proposed methods for in-domain generalization and out-of-domain (OOD) detection. This includes benchmarking the prediction performance increase for in-domain data and the Pearson correlation increase for OOD when deep ensembles are utilized. iii) We observe a good coverage of prediction intervals on the test set using conformalized quantile regression as the UQ method. iv) We observe a moderate correlation between the absolute change of uncertainty and experimentally validated gene expression log fold changes for a single nucleotide mutation in a subset of the data.

A-327: Benchmarking Drug Perturbation Models: Exploring Model Architectures, Training Regimes, and Performance on Novel Data
Track: MLCSB
  • Adriano Martinelli, IBM Research Zurich; ETH Zurich, Switzerland
  • Jannis Born, IBM Research Zurich, Switzerland
  • Maria Anna Rapsomaniki, IBM Research Zurich, Switzerland


Presentation Overview: Show

Motivation: Accurate prediction of drug perturbations is critical for advancing drug discovery. Machine learning (ML) models have shown great potential, but they can be influenced by factors such as model architecture, data representation, and training regime [1, 2, 3].



Methods: We evaluate state-of-the-art ML models on multiple drug discovery tasks from the Therapeutic Data Commons Initiative (TDC) [3]. We explore the influence of different model architectures, data representation choices, and training regimes on model performance. We compare more simple and interpretable models such as nearest neighbor analysis [2] against more complex ML models. Finally, we evaluate the models on a novel drug perturbation dataset of Bladder Cancer Patient-derived organoids (PDOs) [4].



Results & Conclusion:

Our benchmarking study provides insights into the performance of drug perturbation models, and highlights the importance of carefully considering model architecture, data representation, and training regime when developing these models. We also observe that state-of-the-art models overemphasize drug features over omic features when making predictions, confirming previous findings [1, 5] and suggests that simpler models are on par with deep learning models while being more interpretable. Finally, our evaluation on a novel drug perturbation dataset of PDOs provides a valuable resource for future drug discovery efforts.

A-328: Evaluating computational methods and multi-omics technologies used for biomarker discovery in late Phase2/Phase3 randomized clinical trials
Track: MLCSB
  • Nikolaos Trasanidis, Biomarker Analytics, Computational Biology, GSK, Stevenage, UK, United Kingdom
  • Ashwini Venkatasubramaniam, Statistics and Predictive Modelling, Development, GSK, Stevenage, UK, United Kingdom
  • Aris Perperoglou, Statistics and Predictive Modelling, Development, GSK, Stevenage, UK, United Kingdom
  • Johannes Freudenberg, Biomarker Analytics, Computational Biology, GSK, Stevenage, UK, United Kingdom
  • Paul Newcombe, Statistics and Predictive Modelling, Development, GSK, Stevenage, UK, United Kingdom


Presentation Overview: Show

The decreasing costs of high-throughput technologies, including genomics, transcriptomics, and proteomics has driven wider application in clinical trials, i.e. to inform precision medicine strategies. Interrogating such data to identify predictive biomarkers of response represents a compelling but challenging computational problem. There is growing recognition that traditional “one-at-a-time” univariate analysis is sub-optimal, and research is required into principled, data-driven, and efficient methodologies. Here, we present an overview of the current -omics technologies and recently proposed machine learning methods for predictive biomarkers discovery, including the “Modified covariate Lasso”, “Causal Forests”, and the “X-Learner”. We evaluate their utility for late-phase clinical trials through a series of realistic simulation studies, motivated by real applications underway at GSK. In a realistic phase 3 setting with 1000 biomarkers, where univariate analysis falls short, the Modified Covariate Lasso proved to be highly effective in distinguishing predictive effects, while also showed some capability in a phase 2 setting. Additionally, stratification based on causal forests and X-Learner biomarker signatures consistently replicated in independent data. Overall, we showcase the most prominent omics technologies and a selection of compelling recent machine-learning approaches to biomarker discovery in clinical trials, which may help enable novel precision medicine and companion diagnostics strategies.

A-328: Evaluating computational methods and multi-omics technologies used for biomarker discovery in late Phase2/Phase3 randomized clinical trials
Track: MLCSB
  • Nikolaos Trasanidis, Biomarker Analytics, Computational Biology, GSK, Stevenage, UK, United Kingdom
  • Ashwini Venkatasubramaniam, Statistics and Predictive Modelling, Development, GSK, Stevenage, UK, United Kingdom
  • Aris Perperoglou, Statistics and Predictive Modelling, Development, GSK, Stevenage, UK, United Kingdom
  • Johannes Freudenberg, Biomarker Analytics, Computational Biology, GSK, Stevenage, UK, United Kingdom
  • Paul Newcombe, Statistics and Predictive Modelling, Development, GSK, Stevenage, UK, United Kingdom


Presentation Overview: Show

The decreasing costs of high-throughput technologies, including genomics, transcriptomics, and proteomics has driven wider application in clinical trials, i.e. to inform precision medicine strategies. Interrogating such data to identify predictive biomarkers of response represents a compelling but challenging computational problem. There is growing recognition that traditional “one-at-a-time” univariate analysis is sub-optimal, and research is required into principled, data-driven, and efficient methodologies. Here, we present an overview of the current -omics technologies and recently proposed machine learning methods for predictive biomarkers discovery, including the “Modified covariate Lasso”, “Causal Forests”, and the “X-Learner”. We evaluate their utility for late-phase clinical trials through a series of realistic simulation studies, motivated by real applications underway at GSK. In a realistic phase 3 setting with 1000 biomarkers, where univariate analysis falls short, the Modified Covariate Lasso proved to be highly effective in distinguishing predictive effects, while also showed some capability in a phase 2 setting. Additionally, stratification based on causal forests and X-Learner biomarker signatures consistently replicated in independent data. Overall, we showcase the most prominent omics technologies and a selection of compelling recent machine-learning approaches to biomarker discovery in clinical trials, which may help enable novel precision medicine and companion diagnostics strategies.

A-328: Evaluating computational methods and multi-omics technologies used for biomarker discovery in late Phase2/Phase3 randomized clinical trials
Track: MLCSB
  • Nikolaos Trasanidis, Biomarker Analytics, Computational Biology, GSK, Stevenage, UK, United Kingdom
  • Ashwini Venkatasubramaniam, Statistics and Predictive Modelling, Development, GSK, Stevenage, UK, United Kingdom
  • Aris Perperoglou, Statistics and Predictive Modelling, Development, GSK, Stevenage, UK, United Kingdom
  • Johannes Freudenberg, Biomarker Analytics, Computational Biology, GSK, Stevenage, UK, United Kingdom
  • Paul Newcombe, Statistics and Predictive Modelling, Development, GSK, Stevenage, UK, United Kingdom


Presentation Overview: Show

The decreasing costs of high-throughput technologies, including genomics, transcriptomics, and proteomics has driven wider application in clinical trials, i.e. to inform precision medicine strategies. Interrogating such data to identify predictive biomarkers of response represents a compelling but challenging computational problem. There is growing recognition that traditional “one-at-a-time” univariate analysis is sub-optimal, and research is required into principled, data-driven, and efficient methodologies. Here, we present an overview of the current -omics technologies and recently proposed machine learning methods for predictive biomarkers discovery, including the “Modified covariate Lasso”, “Causal Forests”, and the “X-Learner”. We evaluate their utility for late-phase clinical trials through a series of realistic simulation studies, motivated by real applications underway at GSK. In a realistic phase 3 setting with 1000 biomarkers, where univariate analysis falls short, the Modified Covariate Lasso proved to be highly effective in distinguishing predictive effects, while also showed some capability in a phase 2 setting. Additionally, stratification based on causal forests and X-Learner biomarker signatures consistently replicated in independent data. Overall, we showcase the most prominent omics technologies and a selection of compelling recent machine-learning approaches to biomarker discovery in clinical trials, which may help enable novel precision medicine and companion diagnostics strategies.

A-328: Evaluating computational methods and multi-omics technologies used for biomarker discovery in late Phase2/Phase3 randomized clinical trials
Track: MLCSB
  • Nikolaos Trasanidis, Biomarker Analytics, Computational Biology, GSK, Stevenage, UK, United Kingdom
  • Ashwini Venkatasubramaniam, Statistics and Predictive Modelling, Development, GSK, Stevenage, UK, United Kingdom
  • Aris Perperoglou, Statistics and Predictive Modelling, Development, GSK, Stevenage, UK, United Kingdom
  • Johannes Freudenberg, Biomarker Analytics, Computational Biology, GSK, Stevenage, UK, United Kingdom
  • Paul Newcombe, Statistics and Predictive Modelling, Development, GSK, Stevenage, UK, United Kingdom


Presentation Overview: Show

The decreasing costs of high-throughput technologies, including genomics, transcriptomics, and proteomics has driven wider application in clinical trials, i.e. to inform precision medicine strategies. Interrogating such data to identify predictive biomarkers of response represents a compelling but challenging computational problem. There is growing recognition that traditional “one-at-a-time” univariate analysis is sub-optimal, and research is required into principled, data-driven, and efficient methodologies. Here, we present an overview of the current -omics technologies and recently proposed machine learning methods for predictive biomarkers discovery, including the “Modified covariate Lasso”, “Causal Forests”, and the “X-Learner”. We evaluate their utility for late-phase clinical trials through a series of realistic simulation studies, motivated by real applications underway at GSK. In a realistic phase 3 setting with 1000 biomarkers, where univariate analysis falls short, the Modified Covariate Lasso proved to be highly effective in distinguishing predictive effects, while also showed some capability in a phase 2 setting. Additionally, stratification based on causal forests and X-Learner biomarker signatures consistently replicated in independent data. Overall, we showcase the most prominent omics technologies and a selection of compelling recent machine-learning approaches to biomarker discovery in clinical trials, which may help enable novel precision medicine and companion diagnostics strategies.

A-329: Two-step transfer learning improves deep learning-based drug response prediction in small datasets
Track: MLCSB
  • Jie Ju, Erasmus Medical Center, Netherlands
  • Ioannis Ntafoulis, Department of Neurosurgery, Brain Tumor Center, Erasmus MC Cancer Institute, Rotterdam, The Netherlands, Netherlands
  • Marcel Reinders, Delft University of Technology, Netherlands
  • Martine Lamfers, Department of Neurosurgery, Brain Tumor Center, Erasmus Medical Center, Netherlands
  • Andrew Stubbs, Erasmus MC, Netherlands
  • Yunlei Li, Erasmus Medical Center, Netherlands


Presentation Overview: Show

In drug response prediction studies, the insufficiency of patient samples limits the feasibility of deep learning (DL) utilization. Glioblastoma (GBM) is aggressive with a low incidence; less than 50% of patients benefit from the main chemotherapy temozolomide (TMZ). We developed a DL framework on RNA profiling data to predict the response to TMZ in a small set of GBM patient-derived cell cultures and investigated how transfer learning alleviates the small sample size problem.
We deployed two-step transfer learning on three datasets: GDSC, HGCC, IMPRESSING. The GDSC dataset contains miscellaneous cell cultures from all tissue sites treated by Cyclophosphamide (CTX), Bortezomib (PS-341), TMZ and Oxaliplatin (OXA). HGCC (n=83) and IMPRESSING (n=22) include TMZ-treated GBM cell cultures. To identify the best source dataset, the DL models were pre-trained on the cell cultures treated by each drug from GDSC, respectively. Then the best DL model was refined on HGCC and validated on IMPRESSING.
The utilization of two-step transfer learning with pre-training on OXA is superior to without and with one-step transferring. Furthermore, OXA-based transfer learning succeeded might be because the mechanism of action of OXA is similar to TMZ and the OXA-treated cell cultures contain a balanced ratio of responders and non-responders.

A-330: Uncovering key regulatory mechanisms in the molecular response of mesenchymal stem cells to Mg2+ ions through protein network analysis
Track: MLCSB
  • Jalil Nourisa, helmholtz zentrum hereon, Germany
  • Antoine Passemiers, Dynamical Systems, Signal Processing and Data Analytics (STADIUS), KU Leuven, Belgium
  • Farhad Shakeri, medical faculty, University of Bonn, Germany
  • Berit Zeller-Plumhoff, Helmholtz-Zentrum Hereon, Germany
  • Heike Helmholz, Helmholtz-Zentrum Hereon, Germany
  • Christian Cyron, Helmholtz-Zentrum Hereon, Germany
  • Regine Willumeit-Römer, Helmholtz-Zentrum Hereon, Germany


Presentation Overview: Show

Magnesium (Mg) biomaterials have emerged as a promising alternative for orthopedic applications, owing to their biodegradable and bioactive properties. Mesenchymal stem cells (MSCs) are vital for bone tissue regeneration, making it essential to understand the molecular response of MSCs to Mg2+ ions in order to maximize the potential of Mg-based biomaterials. In this study, we conducted a multifaceted proteomics analysis to examine the molecular responses of MSCs to Mg2+ ions 1-21 days after treatment. We conducted gene regulatory network analysis using multiple techniques to uncover the regulatory relationships among the proteins. We studied the impact of Mg2+ ions on the resulting networks and identified the significant regulatory changes caused by the treatment. In addition, we analyzed protein regulatory role in the network and identified those most sensitive to Mg2+ ions treatment. The results of our analysis candidates MYL1, MDH2, GLS, and TRIM28 as the primary targets of Mg2+ ions in MSCs response during 1-21 days phase. Our results also identifies MDH2-MYL1, MDH2-RPS26, TRIM28-AK1, TRIM28-SOD2, and GLS-AK1 as the critical protein relationships affected by Mg2+ ions.

A-331: Transfer Learning for T-Cell Response Prediction
Track: MLCSB
  • Josua Stadelmaier, University of Tübingen, Germany
  • Brandon Malone, NEC OncoImmunity, Norway
  • Ralf Eggeling, University of Tübingen, Germany


Presentation Overview: Show

We study the prediction of T-cell response for specific given peptides, which could, among other applications, be a crucial step towards the development of personalized cancer vaccines. It is, however, a computational problem that cannot be solved by mere peptide:MHC binding prediction. Challenges include limited, heterogeneous training data featuring a multi-domain structure; such data entail the danger of shortcut learning, where models learn general characteristics of peptide sources, such as the source organism, rather than specific peptide characteristics associated with T-cell response. Using a transformer model for T-cell response prediction, we show that the danger of inflated predictive performance is not merely theoretical but occurs in practice. Consequently, we propose a domain-aware evaluation scheme. We then study an adversarial domain adaptation approach to reduce shortcut learning; despite effectively reducing shortcuts, this approach fails to improve overall predictive performance. As a likely reason, we identify negative transfer between different peptide sources and consequently investigate a transfer learning approach that fine-tunes a pre-trained transformer model for each peptide source. We demonstrate this approach to be effective across a wide range of peptide sources and further show that our final model outperforms existing state-of-the-art approaches for predicting T-cell responses for human peptides.

A-332: SIMULTANEOUS SCORING OF CLUSTERS IN RECURSIVE CLUSTER ELIMINATION, APPLIED ON TRANSCRIPTOMIC DATA ANALYSIS
Track: MLCSB
  • Nurten Bulut, Abdullah Gul University, Turkey
  • Burcu Bakir-Gungor, Abdullah Gul University, Turkey
  • Bahjat F. Qaqish, University of North Carolina at Chapel Hill, United States
  • Malik Yousef, Zefat Academic College, Israel


Presentation Overview: Show

Small sample sizes and high data dimensionality make gene expression data analysis challenging. Feature selection (FS) is helpful for dimensional reduction. Support Vector Machines Recursive Cluster Elimination (SVM-RCE) is a sort of cluster selection method. In SVM-RCE, K-means was used to detect clusters of genes. Each cluster's sub-data contains gene expression values for those member genes, retaining the samples' class labels. Then, we ran SVM and used internal cross-validation to assign a score for each cluster. After that, low-scoring clusters are eliminated. The process is iterated until a certain number of clusters are retained. The current study suggests using linear SVM or Random Forest (RF) on cluster centers to score them simultaneously. The score for each cluster is computed as the absolute value of the coefficients given by the linear SVM. Similarly, the feature weights given by RF serve as the scores of each corresponding cluster.
We tested 17 GEO transcriptomic datasets. Results show that the new technique performs similarly but works 80% faster than the original version of SVM-RCE.

A-333: The influence of negative data on model training and interpretability for genomic data containing motif interactions
Track: MLCSB
  • Marta S. Lemanczyk, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Paulo Yanez Sarmiento, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Jakub M. Bartoszewicz, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Bernhard Renard, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany


Presentation Overview: Show

Post-hoc attribution methods can give insights into machine learning models and the underlying prediction tasks. In genomics, methods can uncover biologically relevant motifs from models like convolutional neural networks which were trained on genomic sequences. However, motif patterns and interactions can be learned in various ways depending on the data availability and selection as well as the model architecture. Thereby, the composition of negative data plays a decisive role in how models learn to predict the positive class in binary classification. Additionally, interactions often make the model more complex and less interpretable. In our work, we investigate the influence of different compositions of interactive motifs in the negative data on model training and post-hoc interpretability. In our experimental setup, we synthetically generated data sets with different negative data containing motifs of transcription factor binding sites. We observe that models learn interactions differently depending on which negative data is used for training. This results in performance disparities in prediction and motif detection including an accuracy-interpretability trade-off.

A-334: Machine learning for the prediction of relapse in BRAF-mutated metastatic melanoma patients
Track: MLCSB
  • Sarah Dandou, LPHI, Université de Montpellier, CNRS, Montpellier, France, France
  • Kriti Amin, Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Germany, Germany
  • Holger Frölich, Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI) Germany, Germany
  • Romain Larive, IRCM, Université de Montpellier, Montpellier, France, France
  • Ovidiu Radulescu, LPHI, Université de Montpellier, Montpellier, France, France


Presentation Overview: Show

Metastatic melanoma has poor sensitivity to conventional therapies. The discovery of activating mutations of the BRAF oncogene in melanoma has opened new therapeutic avenues using kinase inhibitors. Several treatments targeting protein kinases involved in the canonical MAPK signaling pathway were developed, but patients develop resistance to these treatments as well. The mechanisms of resistance can be very diverse and may change from one patient to another. Thus, a general descriptive model does not allow to fully understand the variation of the response to treatment. Our work aims to understand the impact of pre-treated tumors state in the intrinsic or acquired resistance to MAPK inhibitors on melanoma patients. We use several statistical and machine learning models on retrieved data to predict the progression free disease time of patients from individual characteristics. Models are trained on data from a curated SQL database integrating several preexisting studies from the literature on clinical and molecular features of metastatic melanoma patients. Several strategies for integrating multimodal data have been addressed in this project, with a focus on genomic data processing. Their use on cohorts of data from various sources can crucial to the understanding of biological phenomena as of molecular resistance in oncology.

A-335: A hierarchical multi-label classifier for animal COI metabarcode classification through sequence embedding with a self-supervised language model
Track: MLCSB
  • Ho-Jin Gwak, Hanyang University, South Korea
  • Ji Yong You, Marine Bio-Strategy Center, Marine Biodiversity Institute of Korea, South Korea
  • Jeong-Hyeon Choi, Marine Bio-Strategy Center, Marine Biodiversity Institute of Korea, South Korea
  • Mina Rho, Hanyang University, South Korea


Presentation Overview: Show

Metabarcoding has long been the subject of studies in the field of genomics. Especially, cytochrome c oxidase I (COI) genes have been used as deoxyribonucleic acid (DNA) barcodes of animals. Nearly eight million COI gene sequences have been reported on the database of Barcode of life data system (BOLD), but, only half of them are fully annotated with ranks from phylum to species. Nowadays, homology search or probabilistic methods have been used in eDNA studies. Homology-based methods lack generalization performance and probabilistic methods require fully-labeled sequences to train. Moreover, both approaches require a huge amount of time for sequence searching. Recently, a language model showed a good performance in the self-supervised methods using massive unlabeled data, which improves the performance of classification tasks. In this study, we pre-trained a language model using seven million COI gene sequences and trained a phylum-level and lower-level classifiers for eight phyla (Annelida, Arthropoda, Chordata, Cnidaria, Echinodermata, Mollusca, Nematoda, and Platyhelminthes). Our model surpassed the most widely used method, the naïve Bayesian classifier, in AU-ROC and AU-PR for every rank. Especially, our model can classify 1.8 and 48 times faster than the naïve Bayesian classifier and BLAST, respectively.

A-336: Deep Learning-Based Identification of Tau Deposition and Genetic Factors in Alzheimer's Disease
Track: MLCSB
  • Taeho Jo, Indiana University School of Medicine, United States
  • Kwangsik Nho, Indiana University School of Medicine, United States
  • Andrew J. Saykin, Indiana University School of Medicine, United States


Presentation Overview: Show

We developed an innovative deep learning framework integrating image analysis and genetic data to understand the pathogenesis of Alzheimer's disease (AD). Utilizing a convolutional neural network (CNN) and Layer-wise Relevance Propagation (LRP), we analyzed tau PET images from 300 ADNI participants to classify AD and identify tau deposition morphological phenotypes. Additionally, the Sliding Window Association Test (SWAT) was employed on GWAS data from 981 ADNI participants to discover AD-associated single nucleotide polymorphisms (SNPs). Our CNN-LRP framework achieved a classification accuracy of 90.8%, with significant AD-associated tau depositions identified in bilateral temporal lobes. The SWAT method revealed known and novel SNPs linked to AD, with an AUC of 0.82. Integrating these results through a stacking ensemble improved classification accuracy. This promising approach enhances understanding of AD's genetic contributors and tau accumulation's role in disease progression.

A-337: Graph-based multi-modality integration for prediction of cancer subtype and severity
Track: MLCSB
  • Diane Duroux, University of Liège, Belgium
  • Christian Wohlfart, Roche Diagnostic International Ltd, Penzberg, Germany
  • Kristel Van Steen, university of Liège, Belgium
  • Antoaneta Vladimirova, Roche Molecular Systems, Inc., Santa Clara, United States
  • Michael King, Roche Diagnostic International Ltd, Penzberg, Germany


Presentation Overview: Show

Personalised cancer screening before therapy paves the way toward improving diagnostic accuracy and treatment outcomes. Most approaches are limited to a single data type and do not consider interactions between features, leaving aside the complementary insights that multimodality and systems biology can provide. In this project, we demonstrate the use of graph theory for data integration via individual networks where nodes and edges are individual-specific. In particular, we showcase the consequences of early, intermediate, and late graph-based fusion of RNAseq data and histopathology whole-slide images for predicting cancer subtypes and severity. The methodology developed is as follows: 1) we create individual networks; 2) we compute the similarity between individuals from these graphs; 3) we train our model on the similarity matrices; 4) we evaluate the performance using the macro F1 score. Pros and cons of elements of the pipeline are evaluated on publicly available real-life datasets. We find that graph-based methods can increase performance over methods that do not study interactions. Additionally, merging multiple data sources often improves classification compared to models based on single data, especially through intermediate fusion. Notably, when the graphs' nodes are interpretable units of analysis, it helps highlight promising pathways and molecular processes.

See poster A-337 for more information!

A-338: Improving Clinical Outcomes for Patients with Cancer of Unknown Primary: a Gene-Expression Based Classifier to Predict Tissue of Origin
Track: MLCSB
  • Lorenzo Perino, Rigshospitalet - Center for Genomic Medicine, Denmark
  • Magnús Halldór Gíslason, Rigshospitalet - Center for Genomic Medicine, Denmark
  • Tobias Overlund Stannius, Rigshospitalet - Center for Genomic Medicine, Denmark
  • Frederik Otzen Bagger, Rigshospitalet - Center for Genomic Medicine, Denmark


Presentation Overview: Show

Carcinoma of unknown primary (CUP) is a recurrent clinical case of interest, where metastasizing malignant cells are found in the body, but the primary site of the cancer is not known. Due to the lack of knowledge about the origin of the cancer, treatment options are limited, and patients with CUP have a poor prognosis.
A CUP classification model has been developed to classify CUP tumors into the tissue type of origin. The classifier was trained on publicly available gene expression data from GTEx and TCGA, thus retaining the ability to also classify and account for components of normal tissues in the tumour biopsy. Importantly, we could test on an in-house clinical cohort of over 1500 tumor samples with a known primary from routine diagnostics.
The CUP classifier achieved promising results in accurately identifying the primary cancer type of origin for CUP tumors. The classifier's performance was evaluated based on the Matthew’s Correlation Coefficient (MCC), against the main benchmarks in the field.
The CUP classifier has the potential to be a useful tool for oncologists in identifying the primary site of the cancer, which can lead to more targeted and effective treatment options.

A-339: Scoring and ranking strategies to benchmark cell type deconvolution pipelines
Track: MLCSB
  • Vadim Bertrand, CNRS, UMR 5525, VetAgro Sup, Grenoble INP, TIMC, University Grenoble Alpes, 38000 Grenoble, France, France
  • Elise Amblard, CNRS, UMR 5525, VetAgro Sup, Grenoble INP, TIMC, University Grenoble Alpes, 38000 Grenoble, France, France
  • Magali Richard, CNRS, UMR 5525, VetAgro Sup, Grenoble INP, TIMC, University Grenoble Alpes, 38000 Grenoble, France, France


Presentation Overview: Show

With the emergence of new standards of care for cancer patients, omics data are now routinely collected. These data are used for diagnostic purposes in order to classify patients. However, current classifications could be improved, especially by taking into account the cell type composition of the tumor (tumor heterogeneity). Because of the complexity of tumor heterogeneity, there is still no consensual method for estimating tumor composition from bulk data, although many deconvolution algorithms have been developed over the past decade.
In order to evaluate and compare these deconvolution methods as well as our own deconvolution strategies, we need a robust and comprehensive ranking process. We will focus on how to weigh in various aspects of a method's outcome such as raw performance or speed.
We simulated benchmark data in order to compare deconvolution algorithms based on several evaluation criteria. We tested several pipelines to combine the various performance scores and finally designed a ranking strategy to establish an overall leaderboard. We checked the stability of our ranking strategy. The pipeline also computes p-values to assess the significance of differences between ranking scores.

A-340: Assessing software for antimicrobial resistance prediction from genome data
Track: MLCSB
  • Kaixin Hu, Helmholtz Centre for Infection Research, Germany
  • Fernando Meyer, Helmholtz Centre for Infection Research, Germany
  • Zhi-Luo Deng, Helmholtz Centre for Infection Research, Germany
  • Ehsaneddin Asgari, Helmholtz Centre for Infection Research, Germany
  • Tzu-Hao Kuo, Helmholtz Centre for Infection Research, Germany
  • Philipp Münch, Helmholtz Centre for Infection Research, Germany
  • Alice McHardy, Helmholtz Centre for Infection Research, Germany


Presentation Overview: Show

The dissemination of rapid and affordable whole-genome sequencing promises the utilization of genome data for predicting antimicrobial resistance (AMR) phenotypes with in silico techniques. Machine learning (ML) methods can operate without expert knowledge and have the potential to outperform rule-based AMR catalog mapping in accuracy and predictive scope. Currently the field lacks an objective assessment of this potential. To address this, we systematically evaluated both approaches for predicting AMR phenotypes across 44 antibiotics and 11 species. The best-performing ML methods varied in performance: they consistently had high F1-score across classes (> 0.95) for C. jejuni and E. faecium; while for others, such as E. coli, M. tuberculosis, K. pneumonia and P. aeruginosa, the F1-score was 0.49-0.98 in random cross-validation and declined further with increasing evolutionary divergence. While ML-based methods performed best when “test” strains were closely related to the “training” strain genomes, rule-based AMR catalog mapping generalized better to evolutionarily more divergent ones, if rule-based models were available. Our study guides the selection of AMR phenotype prediction for 78 species-antibiotic combinations, which includes both ML- and rule-based AMR phenotype predictions.

A-341: Evaluating Protein Language Model Finetuning
Track: MLCSB
  • Robert Schmirler, Technical University of Munich (TUM) - Rostlab, Germany
  • Burkhard Rost, Technical University of Munich (TUM) - Rostlab, Germany


Presentation Overview: Show

Predictors based on state-of-the-art protein language models (PLMs) have shown to be competitive on a large variety of protein prediction tasks. While these models and their embeddings already perform well when used in their pretrained state, we investigate the potential of finetuning them to the individual tasks.

For this, we provide a benchmark using eight previously published datasets. Our benchmark includes
• structural and functional predictions,
• diverse datasets which range over a large sequences space and narrow mutational fitness landscapes of a single protein
• per-protein regression, per-protein classification, and per-residue classification tasks,
• three state-of-the-art model architectures (ESM2, ProtT5, Ankh)
• model sizes ranging from 8M to 2.8B parameters

We add problem type specific classification heads on top of the PLM encoder and train the entire model including the head in a supervised fashion. For larger models we apply parameter efficient fine-tuning which allows computationally efficient training while preventing catastrophic forgetting.

Our results show that finetuning leads to increased prediction performance when compared to pretrained embeddings of the same model. We find this for all datasets, problem types, model-architectures, and sizes.
Therefore our general recommendation is to finetune PLMs instead of training predictors based on pretrained model embeddings.

A-342: Identifying risk factors for disease progression of COVID-19 using statistical machine learning
Track: MLCSB
  • Florian König, University of Tübingen, Tübingen, Germany, Germany
  • Pontus Hedberg, Karolinska Institutet, Stockholm, Sweden, Sweden
  • Pontus Naucler, Karolinska Institutet, Stockholm, Sweden, Sweden
  • Anders Sönnerborg, Karolinska Institutet, Stockholm, Sweden, Sweden
  • Francis Drobniewski, Imperial College London, London, UK, United Kingdom
  • Elizabeth Sheridan, University Hospitals Dorset NHS Foundation Trust Poole Hospital, Poole, UK, United Kingdom
  • Anna Mantzouratou, Bournemouth University, Bournemouth, UK, United Kingdom
  • Nico Pfeifer, University of Tübingen, Tübingen, Germany, Germany


Presentation Overview: Show

The COVID-19 pandemic has had a profound impact on global health, with mortality being a critical outcome of interest. One challenge in studying COVID-19, such as mortality due to the acute disease, is the potential influence of phylogenetic selection bias, as the virus has evolved over time. To address this, we employed a Bayesian generalized linear mixed model (GLMM) to correct for phylogenetic structure in the variants of concern of SARS-CoV-2 for the prediction of acute mortality.
We utilized viral sequence data from a large cohort of patients in Sweden and the UK, enabled through cooperation in the EUCARE project. By incorporating the viral population structure as a random effect in the GLMM we are able to correct for overrepresented mutations due to their co-occurrence in variants of concern, as has been shown for HIV.
In conclusion, our study shows that incorporating phylogenetic structure in the analysis of viral sequence data is not only interesting for the analysis of HIV but also for SARS-CoV-2. Our findings could provide insights into the potential impact of viral evolution on disease outcomes, which may inform clinical decision-making in managing the ongoing COVID-19 pandemic. These approaches are also applicable to other viral infections.

A-343: Dissecting cellular heterogeneity in myelodysplastic syndrome progression using deconvolution methods.
Track: MLCSB
  • Francesco Gandolfi, Department of Experimental Oncology, European Institute of Oncology, Milan, Italy., Italy
  • Veronica Vallelonga, Department of Experimental Oncology, European Institute of Oncology, Milan, Italy., Italy
  • Alberto Termanini, Human Technopole, Milan, Italy, Italy
  • Elena Saba, Center for Accelerating Leukemia/Lymphoma Research – Humanitas Clinical and Research Center, Rozzano, Milan, Italy., Italy
  • Matteo Zampini, Center for Accelerating Leukemia/Lymphoma Research – Humanitas Clinical and Research Center, Rozzano, Milan, Italy., Italy
  • Matteo Della Porta, Center for Accelerating Leukemia/Lymphoma Research – Humanitas Clinical and Research Center, Rozzano, Milan, Italy., Italy
  • Serena Maria Luisa Ghisletti, Department of Experimental Oncology, European Institute of Oncology, Milan, Italy., Italy


Presentation Overview: Show

Myelodysplastic syndromes (MDS) are neoplastic disorders originating from an expanding subset of hematopoietic stem cells (HSCs). These proliferative MDS cells can progress to secondary acute myeloid leukemia (sAML), a disease characterized by a complex repertoire of expanding cells at various stages of differentiation. Understanding the dynamics of different cell subpopulations is crucial for comprehending disease evolution, mechanisms, and their impact on hematopoiesis.
In our study, we employed a Supported Vector Regression deconvolution method called CIBERSORTx to dissect cellular complexity in a large cohort of MDS and AML patient RNA-seq samples at various stages of progression. Analyzing cell type abundances among groups revealed distinct patterns of variability, which are consistent with ATAC-seq profiles from pre-sorted CD34+ hematopoietic stem/progenitor cell subpopulations. This highlights the efficacy of our approach in exploring cell heterogeneity among different patient groups.
We also performed high-resolution profiling of mixed samples to reconstruct cell-type-specific matrices across all patients, providing new insights into the genetic programs driving different HSPC subpopulations during the MDS/sAML transition.
Our work represents one of the first large-scale transcriptomic studies aimed at unraveling the regulatory dynamics of distinct HSPC subpopulations in MDS. It also demonstrates the utility of deconvolution techniques for analyzing highly heterogeneous transcriptomic datasets.

A-343: Dissecting cellular heterogeneity in myelodysplastic syndrome progression using deconvolution methods.
Track: MLCSB
  • Francesco Gandolfi, Department of Experimental Oncology, European Institute of Oncology, Milan, Italy., Italy
  • Veronica Vallelonga, Department of Experimental Oncology, European Institute of Oncology, Milan, Italy., Italy
  • Alberto Termanini, Human Technopole, Milan, Italy, Italy
  • Elena Saba, Center for Accelerating Leukemia/Lymphoma Research – Humanitas Clinical and Research Center, Rozzano, Milan, Italy., Italy
  • Matteo Zampini, Center for Accelerating Leukemia/Lymphoma Research – Humanitas Clinical and Research Center, Rozzano, Milan, Italy., Italy
  • Matteo Della Porta, Center for Accelerating Leukemia/Lymphoma Research – Humanitas Clinical and Research Center, Rozzano, Milan, Italy., Italy
  • Serena Maria Luisa Ghisletti, Department of Experimental Oncology, European Institute of Oncology, Milan, Italy., Italy


Presentation Overview: Show

Myelodysplastic syndromes (MDS) are neoplastic disorders originating from an expanding subset of hematopoietic stem cells (HSCs). These proliferative MDS cells can progress to secondary acute myeloid leukemia (sAML), a disease characterized by a complex repertoire of expanding cells at various stages of differentiation. Understanding the dynamics of different cell subpopulations is crucial for comprehending disease evolution, mechanisms, and their impact on hematopoiesis.
In our study, we employed a Supported Vector Regression deconvolution method called CIBERSORTx to dissect cellular complexity in a large cohort of MDS and AML patient RNA-seq samples at various stages of progression. Analyzing cell type abundances among groups revealed distinct patterns of variability, which are consistent with ATAC-seq profiles from pre-sorted CD34+ hematopoietic stem/progenitor cell subpopulations. This highlights the efficacy of our approach in exploring cell heterogeneity among different patient groups.
We also performed high-resolution profiling of mixed samples to reconstruct cell-type-specific matrices across all patients, providing new insights into the genetic programs driving different HSPC subpopulations during the MDS/sAML transition.
Our work represents one of the first large-scale transcriptomic studies aimed at unraveling the regulatory dynamics of distinct HSPC subpopulations in MDS. It also demonstrates the utility of deconvolution techniques for analyzing highly heterogeneous transcriptomic datasets.

A-344: Graph Representation learning to Identify Regulatory Transcription and Splicing Patterns In Psychiatric Disorders
Track: MLCSB
  • Ghalia Rehawi, Helmholtz Zentrum Muenchen, Germany


Presentation Overview: Show

Changes to the expression levels of splicing factors have been found to disrupt the splicing process and have been identified as a risk in psychiatric diseases. In this work, we focus on splicing regulators and their relationship to the relative abundance of isoforms as one way to study the regulation of splicing in psychiatric patients vs. controls. Full-length transcriptome sequencing quantifies isoform-level expression, enabling us to study post-transcriptional modification processes. Our samples are extracted from peripheral blood mononuclear cells from 300 individuals with a broad spectrum of affective, anxiety, and stress-related mental disorder (n=223), as well as healthy controls (n=77). We aim to investigate the patterns of splicing regulators and transcription regulators in cases and controls using a network analysis framework, this would be critical in providing additional insights into disease risk mechanisms. We use the information theoretic-based network inference approach ARACNE to create a network consisting of both the total expression of genes as well as the isoform ratios. We show the importance of investigating graph neural network techniques in the analysis and embedding of our created network.

A-345: Sweetwater: an efficient data-driven deconvolution framework for bulk RNA-seq data
Track: MLCSB
  • Jesús de la Fuente, Department of Electrical Engineering, University of Navarra, Spain and Center for Data Science, NYU, USA, Spain
  • Naroa Legarra Marcos, Center for Applied Medical Research (CIMA), University of Navarra, Spain, Spain
  • Ana Garcia-Osta, Center for Applied Medical Research (CIMA), University of Navarra, Spain, Spain
  • Carlos Fernandez-Granda, Center for Data Science, NYU, USA, United States
  • Idoia Ochoa, Department of Electrical Engineering, University of Navarra, Spain, Spain
  • Mikel Hernaez, Center for Applied Medical Research (CIMA), University of Navarra, Spain, Spain


Presentation Overview: Show

Single-cell RNA sequencing (scRNA-seq) has revolutionized biomedical research with the ability to capture the cellular heterogeneity of biological tissues, but its elevated cost and the complexity of several tissues, such as the brain, hinders the applicability of this technology. Deconvolution (or digital cytometry) methods can infer cell type proportions and even cell type specific gene expression patterns (GEPs) using bulk RNA sequencing (bulk RNA-seq) data leveraging a scRNA-seq reference matrix.
Current model-based digital cytometry approaches (e.g., CIBERSORTx and BayesPrism) have an inherent drawback; prior information of different cell types is encoded very rigidly, which precludes these methods from exploiting non-linear gene-gene or cell-cell interactions. Recently developed data-driven methods (e.g., Scaden and TAPE) have the ability to learn nonlinear patterns, but they lack interpretability, and their performance degrades significantly when tested in real bulk RNA-seq datasets.
In this work, we propose Sweetwater, an interpretable auto-encoder based deconvolution model that obtains an interpretable low embedding space while learning the bulk RNA-seq underlying patterns, addressing both the interpretability and train/test transference problems of the state-of-the-art models. We have compared the performance of Sweetwater with state-of-the-art methods in datasets of different tissues with promising results.

A-346: The Impact of Synthetic Dataset Size and Heterogeneity on Transfer Learning for Time Series Data
Track: MLCSB
  • Julian Zabbarov, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Simon Witzke, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Maximilian Kleissl, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Pascal Iversen, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Bernhard Y. Renard, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Katharina Baum, Department of Mathematics and Computer Science, Free University Berlin, Germany


Presentation Overview: Show

Leveraging the capabilities of machine learning models heavily relies on the quality and quantity of data. Biological data is often difficult and costly to acquire. Transfer learning has been suggested to derive insights from accumulated, related data. Further, ordinary differential equation (ODE)-based synthetic data can be provided as additional knowledge. In this work, we investigate how synthetic data can supplement real-world datasets in a transfer learning approach for time series forecasting. Identifying the appropriate combination of synthetic dataset size and heterogeneity is crucial, as pre-training on homogenous datasets can lead to overfitting. We conduct a multivariate study to explore how the interaction of features such as dataset size, different distributions of initial conditions, and kinetic parameters in ODEs impact transfer learning. For this purpose, we generate different synthetic datasets, pre-train various standard time series forecasting models, and compare their predictions to models solely trained on observed data representing the real-world problem. Case studies stem from increasingly complex epidemiology and population dynamics prediction tasks to assess the impact of the ODE system’s complexity on the success of synthetic-data-derived transfer learning. We aim to provide researchers with practical guidelines to leverage the domain knowledge of ODEs in informed transfer learning approaches.

A-347: Extending mvTCR scRNAseq and TCR multimodal embeddings with pre-trained TCR-large language models
Track: MLCSB
  • Markus Haak, Technische Universität München, Ludwig-Maximilians-Universität München, Germany
  • Jan-Philipp Leusch, Technische Universität München, Ludwig-Maximilians-Universität München, Germany
  • Merle Stahl, Technische Universität München, Ludwig-Maximilians-Universität München, Germany
  • Yufan Xia, Technische Universität München, Ludwig-Maximilians-Universität München, Germany
  • Irene Bonafonte Pardàs, Helmholtz Munich, Germany
  • Benjamin Schubert, Helmholtz Munich, Germany


Presentation Overview: Show

Advances in single-cell sequencing enable the simultaneous measurement of the transcriptome and the T-cell receptor (TCR) sequences at cellular resolution, promising to offer new insights on T cell-mediated adaptative immune response. However, how to effectively integrate information from these two modalities is still an open question.
As shown by the multimodal generative model mvTCR, the integration of transcriptomic profiles with TCR features into a joint representation improves T cell responses understanding, by identifying groups of cells with shared phenotypic and functional profiles reacting to the same antigen. In addition, several large language models (LLMs) have been proposed to improve TCR sequence representation, profiting from self-supervised representation learning on large unlabeled datasets.
Here, we combine these advances by replacing the sequential input and transformer-based architecture of mvTCR with a vectorized representation of TCRs from different TCR-specific LLMs. Using two multimodal datasets, we demonstrate that this approach improved mvTCR representations, by generating embeddings that better capture transcriptomic heterogeneity while grouping cells with shared antigen reactivity. Among the tested LLMs, DeepTCR representations pre-trained on 13 million sequences offered the largest increase in performance, by representing TCR both by its CDR3 sequences and V/D/J gene usage while leveraging large collections of TCR data.

A-348: Interpretable prediction of phage life cycle from unannotated DNA sequences
Track: MLCSB
  • Melania Nowicka, Hasso Plattner Institute, Germany
  • Jakub M. Bartoszewicz, Hasso Plattner Institute, Germany
  • Bernhard Y. Renard, Hasso Plattner Institute, Germany


Presentation Overview: Show

Background: Bacteriophages are an important part of the ecosystem and natural antibacterial agents that can be applied to treat bacterial infections. The phage life cycle is one of the optimality criteria for the design of phage-based therapeutics. Although many genes determining the life cycle are known, identifying or engineering phages with therapeutic potential remains challenging. Functional annotations of many known phage genomes are largely incomplete. Accurate life cycle predictions directly from reads or assembled contigs can help characterize complex metaviromic samples.

Results: We train deep residual networks (ResNets) to predict if a phage follows the virulent or temperate lifestyle directly from both full genomes and unassembled next-generation sequencing reads. The approach outperforms the previous state-of-the-art across a range of different datasets. This enables annotation of novel phages, discovered over an extended time after model training and belonging to novel clusters, highly divergent from training genomes. Further, we use custom feature attribution methods to explain the behaviour of our models, visualize the predicted genotype-phenotype relationships, and identify significant virulence-associated genes or regions of interest. Identification of those new, previously unannotated regions may open new paths towards rational phage engineering.

A-349: HOPS: Higher-order partial Shapley values trace deep contribution flows in neural networks
Track: MLCSB
  • Jakub M. Bartoszewicz, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Paulo Yanez Sarmiento, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Simon Witzke, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Bernhard Y. Renard, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany


Presentation Overview: Show

Background: Post-hoc attribution methods are an established group of approaches aiming to explain the behavior of machine learning models by assigning ‘relevance’ or ‘contribution’ scores to each input feature. A wide range of these methods have previously been shown to approximate Shapley values, a game-theoretic concept measuring the players’ influence on the result of a game. A specific subset of those approaches, including partial Shapley values, Concept Relevance Propagation (CRP), and GNN-LRP, track how hierarchical or conditional relevance of specific inputs flows through a multi-layer neural network. The applications range from explaining convolutional neural networks for genomics, to image processing, to graph neural networks for epidemiologic data.

Results: We propose a new theoretical framework, higher-order partial Shapley values (HOPS), unifying these previously developed approaches using a common notation. We show that while the other methods are special cases of HOPS, a larger class of more complex dependencies can easily be tracked using this more general approach. We also note a range of useful theoretical properties, which facilitate explaining arbitrarily complex hierarchical dependencies of the features learned by a model. Finally, we discuss example applications in viral, microbial, and eukaryotic genomics.

A-350: A probabilistic approach for monitoring the proportions of SARS-CoV-2 variants in wastewater
Track: MLCSB
  • El Hacene Djaout, LPSM - Laboratoire de probabilités et modèles aléatoires - Sorbonne University, France
  • Siyun Wang, SUMMIT - Maison des Modélisations Ingénieries et Technologies - Sorbonne Université, France
  • Nicolas Cluzel, SUMMIT - Maison des Modélisations Ingénieries et Technologies - Sorbonne Université, France
  • Zuzana Gerber, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Christian Daviaud, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Marc Delepine, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Robert Olaso, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Florian Sandron, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Jean-François Deleuze, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Arnaud Gloaguen, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Yvon Maday, Sorbonne Université, France
  • Vincent Marechal, Sorbonne Université, France
  • Gregory Nuel, LPSM - Laboratoire de probabilités et modèles aléatoires - Sorbonne University, France
  • Marie Courbariaux, SUMMIT - Maison des Modélisations Ingénieries et Technologies - Sorbonne Université, France


Presentation Overview: Show

Current epidemiological monitoring of COVID-19 heavily really on nasopharyngeal swab PCR testing. Previous studies have shown that high-throughput sequencing of wastewater samples combined with deconvolution methods is a relevant alternative for this surveillance and allows the estimation of COVID-19 variants proportions out of these complex mixtures. In this study, we take a step further the development and the implementation of a deconvolution method based on an Expectation-Maximization (EM) algorithm which is both able to take advantage of the co-occurrence of Single Nucleotide Polymorphisms (SNPs) on a single read and take into account the error rate of the sequencing technology. We validated this algorithm on simulations, synthetic and real wastewater data together with optimised reimplementations of other deconvolution methods. Our results were consistent with those obtained by health authorities through PCR testing. Ongoing research focuses on detecting cryptic emergent variants, incorporating temporal and spatial dependencies for more robust estimations and forecasting, and exploring the applicability of our method to other RNA viruses. This innovative approach contributes to the ongoing efforts in viral variant detection and monitoring, providing an efficient and cost-effective solution for public health surveillance.

A-351: Learning spatial and temporal representations of developing embryos and organs
Track: MLCSB
  • Jiawei Wang, European Bioinformatics Institute (EMBL-EBI) and the University of Cambridge, United Kingdom
  • Jinzheng Ren, Australian National University, Australia
  • James Sharpe, EMBL Barcelona, Spain
  • Bianca Dumitrascu, Columbia University, United States
  • John Marioni, European Bioinformatics Institute (EMBL-EBI), United Kingdom


Presentation Overview: Show

Recent advancements in in toto imaging techniques have enabled real-time tracking of mouse embryo and organ development. However, the vast volume of image data makes manual operations on individual images impractical, necessitating automatic and accurate characterization of images with their spatial and temporal information. Here, we propose an architecture designed to learn spatial and temporal representations of developing embryos and organs. We present a range of solutions, from classical image feature extraction methods to the latest deep learning techniques. Our final model, which combines a pretrained Swin Transformer model with a Siamese network, demonstrates the ability to learn accurate and robust features. These features have been successfully applied to match images across mammalian developing embryos and hearts. Our work establishes a strong foundation for comparative analysis of real-time embryo and organ development, and is key to advancing the integration of multi-omics data within the wider context of combining real-time tracking of cellular dynamics and spatiotemporal gene expression profiles to study early mammalian development.

A-352: Predicting activities of chemical compounds using graph neural network with attention mechanism
Track: MLCSB
  • Heesang Moon, Hanyang University, South Korea
  • Mina Rho, Hanyang University, South Korea


Presentation Overview: Show

In drug discovery, analyzing the molecular features of chemical compounds is essential to predict their activities and properties. However, experimentally finding the desirable compounds related to a certain property is a nontrivial challenge due to the high cost and formidable chemical space. To extract structure-activity relationships without unaffordable experiments, computational methods considered efficient and cost-effective approaches are needed. Recently, by adopting a huge amount of data, various deep learning methods have been introduced and achieved remarkable performances, which provides an advantage in understanding molecular structures. Among the advanced deep learning models, the graph neural network (GNN) and its variants draw attention because of their performance. In this work, we developed a model focused on atoms and bonds which are equally essential to distinguish non-isomorphism while the other GNNs mostly focus on the atoms. We exploited two encoders adapted to the atoms and bonds respectively. The encoders enforce the model to focus on the more important sub-structure. We adopted the graph attention and the self-attention mechanisms to the extracted hidden states of the sub-structures. We tested our model on the five benchmark datasets in the MoleculeNet physiology dataset. Our model achieved more precise or comparable results compared to the existing methods.

A-353: Imputing protein abundance by modeling molecular relationships using GNNs
Track: MLCSB
  • Sukrit Gupta, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany, Germany
  • Christoph N. Schlaffner, HPI, University of Potsdam, Germany; and Dept. of Pathology, Boston Children’s Hospital and Harvard Medical School, USA, Germany
  • Saima Ahmed, Department of Pathology, Boston Children’s Hospital and Harvard Medical School, Boston, USA, United States
  • Hanno Steen, Department of Pathology, Boston Children’s Hospital and Harvard Medical School, Boston, USA, United States
  • Bernhard Y. Renard, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany, Germany
  • Katharina Baum, HPI, Potsdam; Inst. for Math & CS, FU Berlin; and Dept for AI and HPI at Icahn School of Medicine at Mt. Sinai, NY, USA, Germany


Presentation Overview: Show

Knowing which proteins are present at which concentration is critical for understanding a biological system and its dynamics. However, missing values in proteomics often hinders accurate and reliable downstream analyses. Most previous methods for imputing missing protein abundances have not taken advantage of additional layers of quantitative information and molecular interactions present in the same and other data sources. We propose an inductive graph neural network (GNN) framework to predict missing protein abundances by leveraging relationships between proteins, peptides, and mRNA. While the incorporation of mRNA data does not provide any performance boost, the inclusion of unique and especially shared peptide information proves beneficial. We evaluate our novel network-based method on five diverse human protein datasets containing between 538 and 12,826 measured proteins. Peptide-based linear regression models, with and without regularization, can only impute between 11% and 54% of the missing protein values compared to our GNN based method. Our method performs significantly better than the best linear regression in terms of squared Pearson’s correlation in 4/5 datasets (and comparably in 1/5, p<0.05, one-sided Mann-Whitney-U, n1=n2=5). Our GNN based imputation paves the way for generalized imputation approaches that consider more complete molecular information.

A-354: How noise assumptions in ODE-synthesized data influence data augmentation in machine learning
Track: MLCSB
  • Maximilian Kleissl, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Julian Zabbarov, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Pascal Iversen, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Simon Witzke, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Bernhard Y. Renard, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Katharina Baum, Department of Mathematics and Computer Science, Free University Berlin, Germany


Presentation Overview: Show

The performance of machine learning (ML) models heavily depends on the quality and quantity of available data. These can be expensive to acquire, particularly in biology.
Augmenting real-world datasets with synthetic data generated from ordinary differential equations (ODEs) can address this issue. However, ODEs do not account for random fluctuations inherent to biological systems.

Therefore, we construct more realistic synthetic data by adding noise to mimic measurement errors and environmental variations. We distinguish noise characteristics including whether it is applied to the ODE itself or its solution, whether it is additive or multiplicative, and which distribution it follows. Based on these noise characteristics, we study the impact on data augmentation quality. For this, we consider two specific problems: (i) the prediction performance on real-world data of ML models trained on synthetically augmented datasets, and (ii) the ability to identify the best ML model for a specific prediction task and its data needs from synthetic data.

Our study focuses on ML for time series prediction. Example systems cover epidemiological, population dynamics, and biochemical signaling pathway models. This research aims to provide valuable insights into curating suitable and realistic synthetic data to augment real-world datasets in ML applications.

A-355: ProstT5: a Protein Language Model for the post AlphaFold2 era
Track: MLCSB
  • Michael Heinzinger, Technical University of Munich (TUM), Germany
  • Martin Steinegger, Seoul National University (SNU), Korea, The Democratic People's Republic of
  • Burkhard Rost, Technical University of Munich (TUM), Germany


Presentation Overview: Show

Large language models (LLMs) have revolutionised Natural Language Processing, and their adaptation to protein sequences has led to the development of powerful protein language models (pLMs). Concurrently, the breakthrough of AlphaFold2 has accelerated protein structure prediction, enabling the exploration of proteins' dual nature - 1D sequences and 3D structures.
Here, we propose leveraging pLMs to model both modalities simultaneously by combining the protein sequences with their 3D counterparts. To achieve this, we first encode protein structures as token sequences using the 3Di-alphabet introduced by Foldseek. The resulting ""structure-sequence"" representation can be fed into a language model to extract features and patterns. Towards this end, we constructed a non-redundant dataset from the AlphaFold DB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences. Preliminary results demonstrate the feasibility of our approach, called Protein structure-sequence T5 (ProstT5), achieving high accuracy in both folding and inverse folding tasks.
This work serves as a proof-of-concept, showcasing the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2. It paves the way for the development of tools optimising the integration of this vast 3D structure data resource, opening new research avenues in the post AlphaFold2 era.

A-356: What does the biosynthetic gene cluster say? Understanding biosynthetic gene clusters with protein language models
Track: MLCSB
  • Tatiana Malygina, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Germany
  • Olga V. Kalinina, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS); Saarland University., Germany


Presentation Overview: Show

Many organisms, such as bacteria, fungi, and plants, produce intricate chemicals that are not needed for their growth and reproduction, and thus are called secondary metabolites or natural products (NPs). NPs are a rich source of drugs, with most antibiotics being derivatives of NPs. In a producer organism, NPs are synthesized by a set of enzymes encoded by genes that often lie near each other on the chromosome and are called a biosynthetic gene cluster (BGC). Despite the clinical importance that some NPs have, only a small number of naturally-occuring BGCs are explicitly described.

From the natural language processing (NLP) point of view, in terms of the number of samples, the existing collections of BGC sequences can be considered a low-resource language corpus. A common approach to tackle such datasets is to transfer an existing model trained on a high-resource language dataset from one language domain to another using transfer learning. A natural high-resource dataset for BGCs would be all protein sequences. Several different protein language models (pLM) trained with large collections of sequences are available nowadays. In this work, we use them employing transfer learning to explore the meaningfulness of representations of BGCs regarding the chemistry of expressed NPs.

A-357: A benchmark study of deconvolution methods and target enrichment kits to estimate the proportions of COVID-19 lineages in wastewater samples sequenced with Oxford Nanopore Technology
Track: MLCSB
  • Benjamin Vacus, Direction générale de l'Armement (DGA) / Centre National de Recherche en Génomique Humaine, France
  • Marc Délépine, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Christian Daviaud, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Robert Olaso, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Florian Sandron, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Damien Delafoy, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Vincent Meyer, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Jean-François Deleuze, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Edith Le Floch, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Zuzana Gerber, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France
  • Arnaud Gloaguen, Centre National de Recherche en Génomique Humaine, CEA, Université Paris-Saclay, Évry, France, France


Presentation Overview: Show

Wastewater surveillance is a relevant tool for monitoring the evolution of SARS-CoV-2. After target enrichment for SARS-CoV-2 by Polymerase Chain Reaction (PCR) amplification followed by next-generation sequencing, previous studies were able to identify and quantify COVID-19 lineages present in wastewater samples.

Our goal is to investigate if the generation of longer reads allows to better discriminate among lineages present in such complex mixtures.

Our contribution is twofold. First, we generated three distinct data-sets on the same set of synthetic samples including positive/negative controls (single lineage/water only) and several known mixtures of lineages (amongst Alpha, Delta, Omicron, …). Data-sets differ in the primer design used for target enrichment, generating amplicons of either 400, 1200 or 2400 base pairs long.

Second, three deconvolution methods, Freyja (Karthikeyan & al. 2021), LCS (Valieris & al. 2021) and VirPool (Gafurov & al. 2022) were compared for the estimation of the proportion of COVID-19 lineages. Only VirPool can take advantage of the co-occurrence of Single Nucleotide Polymorphisms (SNPs) on a single read, an event better captured by longer reads. This benchmark was built with Snakemake.

Relying on this benchmark, new methods able to exploit the co-occurrence of SNPs will be developed and extended to other viruses.

A-358: REPIC — an ensemble learning methodology for cryo-EM particle picking
Track: MLCSB
  • Christopher Jf Cameron, Yale University, United States
  • Sebastian Jh Seager, Yale University, United States
  • Fred J Sigworth, Yale University, United States
  • Hemant D Tagare, Yale University, United States
  • Mark B Gerstein, Yale University, United States


Presentation Overview: Show

Cryo-EM (cryogenic electron microscopy) is a modern biophysical technique for protein structure determination. Protein complexes are frozen and then imaged with electrons to produce various 2D projections (i.e., particles) in a digitized image called a micrograph. Particle-image identification in micrographs (i.e., picking) is challenging due to the low signal-to-noise ratios and lack of ground truth for particle locations. Moreover, current computational methods (“pickers”) pick different particle sets, complicating the selection of the best-suited picker for a protein of interest. Here, we present REPIC (REliable PIcking by Consensus), an ensemble learning methodology that uses multiple pickers to find consensus particles. REPIC identifies consensus particles by framing its task as a graph problem and using integer linear programming to select particles. REPIC picks high-quality particles even when the best picker is not known a priori and for known difficult-to-pick particles (e.g., TRPV1 ion channel). 3D reconstructions using consensus particles achieve resolutions comparable to those from particles picked by experts, without the need for downstream particle filtering (e.g., 2D or 3D classification). Overall, our results show that REPIC requires minimal (often no) manual intervention and significantly reduces the burden of picker selection and particle picking for cryo-EM users.

A-359: LucidProts: Controlled Protein Design using Diffusion Language Models
Track: MLCSB
  • Adrian Henkel, TU Munich, Germany
  • Dr. Michael Heinzinger, TU Munich, Germany
  • Kyra Erckert, TU Munich, Germany
  • Prof. Dr. Burkhard Rost, TU Munich, Germany


Presentation Overview: Show

Generative diffusion models like DALL-E or Imagen have shown remarkable effectiveness in controlled de novo data generation. This can also be crucial for protein design in fields such as medicine, drug discovery, diagnostics (biomarkers), bioremediation (environmental cleanup), and agriculture. Our approach, LUCIDPROTS, utilizes diffusion language models specifically for protein sequence generation.

To achieve controlled protein sequence design based on user prompts, we employ the techniques of partial noising and conditional denoising (DiffuSeq), as proposed by Gong et al. (2022) on an Enhanced Transformer with Rotary Position Embedding (RoFormer). These techniques allow us to condition the generation process on three-dimensional structures encoded in a one-dimensional sequence, using the novel 3Di structural alphabet proposed by van Kempen, Kim, Tumescheit et al. (2023).

By evaluating the generated sequences in terms of novelty, foldability, and 3D structure similarity using AlphaFold and ESMFold, LucidProts aims to generate diverse amino acid sequences with similar structures, avoiding sequence duplications.
Overall, if further experiments prove to be successful, this work presents a powerful proof of concept work for controlled protein sequence design. Its integration of the DiffuSeq model, trained on the 3Di dataset, enables controlled generation on protein properties, leading to potential advancements in various biotechnological applications.

A-360: OmicsFootPrint: A Comprehensive Deep Learning Framework for Multi-Omics Integration and Interpretability using Circular Images
Track: MLCSB
  • Xiaojia Tang, Mayo Clinic, United States
  • Naresh Prodduturi, Mayo Clinic, United States
  • Kevin Thompson, Mayo Clinic, United States
  • Vera Suman, Mayo Clinic, United States
  • Krishna Kalari, Mayo Clinic, United States


Presentation Overview: Show

OmicsFootPrint is a novel deep-learning framework for investigating complex multi-omics data. It converts data into 2-D circular images by organizing features based on chromosomal locations and effectively analyzes multi-omics data using image-based deep-learning models with enhanced interpretability. Using the TCGA BRCA dataset (n=866), we conducted a proof-of-concept analysis to differentiate PAM50 subtypes. Predictive performance was evaluated using the area-under-the- curve (AUC) with 10 repeats. The three-omics model, incorporating expression, copy number (CNV), and protein-phosphoprotein data with EffectiveNetV2, constantly demonstrated stronger performance than other omics- data combination. Mean AUC values were 0.97±0.02 for Basal, 0.89±0.09 for HER2, 0.86±0.03 for Luminal-A, and 0.77±0.05 for Luminal-B. CNV-only models had the lowest AUC. Expression-only and expression+cnv models showed similar performance, with lower AUC than the three-omics model. Furthermore, OmicsFootPrint interprets the model through mapping high SHAP values to genomic features. In the HER2 subtype, OmicsFootPrint identified differentiating features such as ERBB2 amplification, over-phosphorylated IGF1R, high TP53BP1 protein levels, and high interferon-gamma expression. These findings highlight the robustness and interpretability of OmicsFootPrint in accurately classifying breast cancer subtypes. We are currently summarizing the interpretability multi-omics features for basal and luminal subtypes, and we look forward to presenting those additional insights at the upcoming ISMB meeting.

A-361: Interpretable multi-view dimensionality reduction for biologically meaningful sample representation.
Track: MLCSB
  • Nicolas Kersten, Institute for Bioinformatics and Medical Informatics (IBMI), University of Tübingen, Germany
  • Nora K. Speicher, Chalmers University of Technology, Sweden
  • Nico Pfeifer, Institute for Bioinformatics and Medical Informatics (IBMI), University of Tübingen, Germany


Presentation Overview: Show

Advancements in omics technologies, such as genomics, transcriptomics and proteomics, have led to a rapidly increasing availability of high-dimensional data sets. While omics data sets provide valuable information for supervised and unsupervised machine learning analyses, they usually contain a large number of features, which requires advanced techniques for extracting relevant information. This is particularly important for the joint analysis of multi-omics data. State-of-the-art methods, like multiple kernel clustering, employ dimensionality reduction to facilitate subsequent analysis steps. Although this procedure can improve the overall model performance, the low-dimensional representation of the data remains hard to interpret.
We propose a novel approach that improves the interpretability of multiple kernel learning for dimensionality reduction and clustering. This method employs self-supervised estimation of the contribution of each feature to the sample clustering and subsequent identification of feature subsets within each omics view.
We applied this method to different multi-omics cancer data sets. Our results demonstrate state-of-the-art performance in discovering biologically meaningful patient clusters, while being able to identify disease-relevant biomarkers that contribute to the observed patient clustering. This method could thus provide valuable information for the characterization of patient groups.