Posters - Schedules
Posters Home

View Posters By Category

Monday, July 24, between 18:00 CEST and 19:00 CEST
Tuesday, July 25, between 18:00 CEST and 19:00 CEST
Session A Poster Set-up and Dismantle
Session A Posters set up:
Monday, July 24, between 08:00 CEST and 08:45 CEST
Session A Posters dismantle:
Monday, July 24, at 19:00 CEST
Session B Poster Set-up and Dismantle
Session B Posters set up:
Tuesday, July 25, between 08:00 CEST and 08:45 CEST
Session B Posters dismantle:
Tuesday, July 25, at 19:00 CEST
Wednesday, July 26, between 18:00 CEST and 19:00 CEST
Session C Poster Set-up and Dismantle
Session C Posters set up:
Wednesday, July 26,between 08:00 CEST and 08:45 CEST
Session C Posters dismantle:
Wednesday, July 26, at 19:00 CEST
Virtual
B-211: ArkDTA: Attention Regularization guided by non-Covalent Interactions for Explainable Drug-Target Binding Affinity Prediction
Track: MLCSB
  • Mogan Gim, Korea University, South Korea
  • Junseok Choe, Korea University, South Korea
  • Seungheun Baek, Korea University, South Korea
  • Jueon Park, Korea University, South Korea
  • Chaeeun Lee, Korea University, South Korea
  • Minjae Ju, LG CNS, AI Research Center, South Korea
  • Sumin Lee, LG AI Research, South Korea
  • Jaewoo Kang, Korea University, South Korea


Presentation Overview: Show

Protein-ligand binding affinity prediction is an important task in drug design and development. Cross-modal attention mechanism has become a core component of deep learning models due to the significance of model explainability. Non-covalent interactions, one of the key chemical aspects of this task, should be incorporated in protein-ligand attention mechanism. We propose ArkDTA, a novel deep neural architecture for explainable binding affinity prediction guided by non-covalent interactions. Experimental results show that ArkDTA achieves predictive performance comparable to current state-of-the-art models while significantly improving model explainability. Qualitative investigation into our novel attention mechanism reveals that ArkDTA can identify potential regions for non-covalent interactions between candidate drug compounds and target proteins, as well as guiding internal operations of the model in a more interpretable and domain-aware manner. ArkDTA is available at https://github.com/dmis-lab/ArkDTA

B-212: Hygieia: AI/ML ready pipeline to investigate genes associated with targeted disorders and predict disease with high accuracy.
Track: MLCSB
  • William DeGroat, Institute for Health, Health Care Policy and Aging Research. Rutgers, The State University of New Jersey., United States
  • Vignesh Venkat, Institute for Health, Health Care Policy and Aging Research. Rutgers, The State University of New Jersey., United States
  • Widnie Pierre-Louis, Institute for Health, Health Care Policy and Aging Research. Rutgers, The State University of New Jersey., United States
  • Habiba Abdelhalim, Institute for Health, Health Care Policy and Aging Research. Rutgers, The State University of New Jersey., United States
  • Zeeshan Ahmed, Institute for Health, Health Care Policy and Aging Research. Rutgers, The State University of New Jersey., United States


Presentation Overview: Show

Due to the advancements in sequencing technologies, genomics data is developing at an unmatched pace and levels to foster translational research. Genome-wide association studies (GWAS) have remarkably assisted in understanding the genetic basis of human disease by uncovering millions of loci associated with various complex phenotypes. However, GWAS are unable to predict disease and detect all the heritability explained by single nucleotide polymorphisms (SNPs) and can only target specific variants. The rightful use of the artificial intelligence (AI) and machine learning (ML) techniques can accelerate our ability to leverage and extend the information contained within the original data, and model patient-specific genomics data against publicly available annotation repositories for understanding how coding and non-coding genomic variations are connected to disease mechanisms. The grand challenge here is assimilation of genetics into precision medicine that translates across different ancestries, diverse diseases, and other distinct populations with the implementation of effective AI/ML methods. We present first AI/ML ready pipeline i.e., Hygieia., integrating genomics and clinical data to investigate genes associated with the targeted disorders and predict disease with high accuracy. Hygieia is an open-source and simple to use pipeline, which does not strong require computational background to execute.

B-213: The use of machine learning to detect the cell-type specific effects of genetic variants in disease
Track: MLCSB
  • Alan Murphy, Imperial College London, United Kingdom
  • Mike Phuycharoen, The University of Manchester, United Kingdom
  • Nathan Skene, Imperial College London, United Kingdom


Presentation Overview: Show

Large-scale genetic efforts have been undertaken to identify disease-relevant variants in complex diseases like Alzheimer’s Disease (AD). However, most variants identified are non-coding, playing an elusive regulatory role. Moreover, gene regulatory mechanisms are highly cell type-specific, meaning variants' roles likely differ based on the cell type in question. The effect of genetic variants on the activity of regulatory elements can be measured experimentally through quantitative trait loci (QTL) studies. However, in order to robustly identify QTLs, very large numbers of samples are required. Undertaking such studies on the vast number of disease-related cell types is infeasible.
Recent approaches have shown the effectiveness of machine learning models at in silico prediction of regulatory variants’ effects. Despite their potential, all models either fail to take into account distal effects or can not be applied in previously unseen cell types. Here, we propose Enformer Celltyping which incorporates distal effects of DNA interactions and predicts across diverse cell types. The model is robustly validated using large-scale, immune cell QTL studies and attains best-in-class performance in benchmarking. Using Enformer Celltyping, we will predict the effect of AD non-coding variants on particular regulatory factors to potentially uncover the causality of the disease, elucidating possible therapeutic targets.

B-214: SC2Spa: a deep learning based approach to map transcriptome to spatial origins at cellular resolution
Track: MLCSB
  • Linbu Liao, University of Copenhagen, Denmark
  • Esha Madan, Champalimaud Centre for the Unknown, Lisbon, Portugal
  • Hyobin Kim, Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, United States
  • Rajan Gogna, Virginia Commonwealth University, United States
  • Kyoung Jae Won, Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, United States


Presentation Overview: Show

Integrating single cell RNAseq (scRNAseq) and spatial transcriptomics (ST) data is still challenging especially when the spatial resolution is poor. For cellular resolution spatial mapping, we have developed deep learning-based SC2Spa to learn the intricate spatial mapping rules from the transcriptome to its location from ST data. Benchmarking tests show that SC2Spa uniquely recapitulates tissue architecture from scRNAseq. SC2Spa successfully mapped scRNAseq even to low resolution Visium data. SC2Spa identified spatially variable genes and suggested negative regulatory relationships between genes. SC2Spa armored with deep learning provides a new way to map the transcriptome to the location and perform subsequent analyses.

B-215: GediNET for discovering gene associations across diseases using knowledge based machine learning approach
Track: MLCSB
  • Emma Qumsiyeh, Al-Quds University, Palestine
  • Louise Showe, Wistar Institute, United States
  • Malik Yousef, Zefat College, Israel


Presentation Overview: Show

The most common approaches to discovering genes associated with specific diseases are based on machine learning and use a variety of feature selection techniques to identify significant genes that can serve as biomarkers for a given disease. More recently, the integration in this process of prior knowledge-based approaches has shown significant promise in the discovery of new biomarkers with potential translational applications. In this study, we developed a novel approach, GediNET, that integrates prior biological knowledge to gene Groups that are shown to be associated with a specific disease such as cancer. The novelty of GediNET is that it then also allows the discovery of significant associations between that specific disease and other diseases. The initial step is the identification of gene Groups. The Groups are then subjected to a Scoring component to identify the top-significant Groups. Those Groups are used to train a Machine Learning. The process of Grouping, Scoring, and Modelling (G-S-M) is used to identify other diseases that are similarly associated with this signature. GediNET identifies these relationships through Disease–Disease Association (DDA) based machine learning. DDA explores novel associations between diseases and identifies relationships that could be used to further improve approaches to diagnosis, prognosis, and treatment.

B-216: The language of proteins is in the codons, not the amino acids
Track: MLCSB
  • Carlos Outeiral, University of Oxford, United Kingdom
  • Charlotte Deane, University of Oxford, United Kingdom


Presentation Overview: Show

Protein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with recent models' capacities surpassing the size of the very datasets they were trained on. Here, we propose an alternative direction. We show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of tasks. In some tasks, like species recognition, prediction of protein and transcript abundance, or melting point estimation, we show that a language model trained on codons outperforms every other published protein language model, including some that contain over 50 times more parameters. These results suggest that, in addition to commonly studied scale and model complexity, the information content of biological data provides an orthogonal direction to improve the power of machine learning in biology.

B-217: SpheroScan: A User-Friendly Deep Learning Tool for Spheroid Image Analysis
Track: MLCSB
  • Akshay Akshay, Functional Urology Research Group, Department for BioMedical Research DBMR, University of Bern, Switzerland, Switzerland
  • Mitali Katoch, Institute of Neuropathology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
  • Masoud Abedi, Department of Medical Data Science, Leipzig University Medical Centre, Germany
  • Mustafa Besic, Department of Urology, Inselspital University Hospital, Switzerland
  • Navid Shekarchizadeh, Department of Medical Data Science, Leipzig University Medical Centre, Germany
  • Fiona C. Burkhard, Department of Urology, Inselspital University Hospital, Switzerland
  • Alex Bigger-Allen, Biological & Biomedical Sciences Program, Division of Medical Sciences, Harvard Medical School, United States
  • Rosalyn M. Adam, Urological Diseases Research Center, Boston Children’s Hospital, United States
  • Katia Monastyrskaya, Functional Urology Research Group, Department for BioMedical Research DBMR, University of Bern, Switzerland
  • Ali Hashemi Gheinani, Broad Institute of MIT and Harvard, United States


Presentation Overview: Show

Three-dimensional (3D) spheroid models are increasingly being used in scientific research due to their ability to mimic the in vivo microenvironment. However, the analysis of spheroid images has been limited by the lack of automated and user-friendly tools for spheroid detection and segmentation in an image. To address this issue, we have developed a fully automated, web-based tool called SpheroScan, which uses the deep learning framework called Mask Regions with Convolutional Neural Networks (R-CNN) for image detection and segmentation. To develop a deep learning model that could be applied to spheroid images from a range of experimental conditions, we trained the model using spheroid images captured using IncuCyte Live-Cell Analysis System or a conventional microscope. Performance evaluation of the trained model using validation and test datasets shows promising results. SpheroScan allows for easy analysis of large numbers of images and provides interactive visualization features for a more in-depth understanding of the data. Our tool represents a significant advancement in the analysis of spheroid images and will facilitate the widespread adoption of 3D spheroid models in scientific research. SpheroScan is available at https://github.com/FunctionalUrology/SpheroScan.

B-218: A novel Generative Adversarial Networks modelling for the class imbalance problem in high dimensional omics data
Track: MLCSB
  • Animesh Acharjee, University of Birmingham, United Kingdom
  • Samuel Cusworth, University of Birmingham, United Kingdom
  • Georgios Gkoutos, University of Birmingham, United Kingdom


Presentation Overview: Show

Class imbalance remains a large problem in high-throughput omics analyses, causing bias towards the over-represented class when training machine learning-based classifiers. Oversampling is a common method used to balance classes, allowing for better generalization over the training data. More naive approaches can introduce other biases into the data, being especially sensitive to inaccuracies in the training data, a problem considering the characteristically noisy data obtained in healthcare. This is especially a problem with high-dimensional omics data. In this study, we proposed a GAN-based methodology for use on high-dimensional data sets with small sample sizes, that allows for the synthesis of new samples that represent original data types. We further compared the performance our approach against synthetic minority over-sampling technique’ (SMOTE) and ‘random oversampling (RO), using area under the receiver operating characteristic curve (auroc) when using the data to train a classifier. We performed extensive simulation and applied the proposed methodologies on real world microarray and lipidomics data sets to demonstrate performance. We found evidence for an improved ability of the proposed GAN-based methodology to balance the classes of complex datasets with small sample sizes.

B-219: Machine learning of cancer type and tissue of origin from proteomes of 1,277 human tissue samples and 975 cancer cell lines
Track: MLCSB
  • Zhaoxiang Cai, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Zainab Noor, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Adel Aref, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Emma Boys, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Dylan Xavier, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Natasha Lucas, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Steven Williams, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Jennifer Koh, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Erin Sykes, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Rebecca Poulos, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Peter Hains, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Phillip Robinson, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Rosemary Balleine, Westmead Institute for Medical Research, The University of Sydney, Westmead, NSW, Australia., Australia
  • Roger Reddel, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia
  • Qing Zhong, ProCan®, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW, Australia., Australia


Presentation Overview: Show

Machine learning (ML) has powered the image-based prediction of cancer phenotypes and tissue of origin. Here, we used online ML, a method that updates a model at each step as new data become available in a sequential order, to predict cancer and tissue types using mass spectrometry-based proteomic data. A cohort of 975 cancer cell lines with 9,688 proteins served as the baseline training set (D1), while half of the cohort of 1,277 human tissue samples with 9,501 proteins was used as a variable training set (D2), and the other half as a test set (T1). We trained a model on D1 and sequentially updated it by adding 20% of data from D2 to expand the training set. We evaluated each model's performance on T1 and observed a monotonic performance increase from 0.89 (D1, top-1 accuracy) to 0.97 (D1 + 100% D2) when predicting six cancer types. We observed an analogous trend when predicting seven tissue types. Our results demonstrate that cancer cell lines can be used to predict cancer and tissue types, and reflect a real-world knowledgebase that will continue to increase in predictive power with additional data.

B-220: Attention-based graph neural network for prediction of plasma stability
Track: MLCSB
  • Woo Dae Jang, Korea Research Institute of Chemical Technology, South Korea
  • Seong Jun Park, Korea Research Institute of Chemical Technology, South Korea
  • Hwan Jung Lim, Korea Research Institute of Chemical Technology, South Korea
  • Kwang-Seok Oh, Korea Research Institute of Chemical Technology, South Korea


Presentation Overview: Show

Stability of compounds in the human plasma is crucial for maintaining sufficient systemic drug exposure and considered an essential factor in the early stages of drug discovery and development. The rapid degradation of compounds in the plasma can result in poor in vivo efficacy. Currently, there are no open-source software programs for predicting human plasma stability. In this study, we developed an attention-based graph neural network, PredPS to predict the plasma stability of compounds in human plasma using in-house and open-source datasets. The PredPS outperformed the two machine learning and two deep learning algorithms that were used for comparison indicating its stability-predicting efficiency. PredPS achieved an area under the receiver operating characteristic curve of 0.901±0.006, accuracy of 0.835±0.007, sensitivity of 0.823±0.054, and specificity of 0.846±0.049 when evaluated using 5-fold cross-validation.

B-221: Unlocking ultrahigh-throughput drug combination screens using machine learning
Track: MLCSB
  • William Wright, St. Jude Children s Research Hospital, United States
  • Paul Geeleher, St Jude Children s Research Hospital, United States


Presentation Overview: Show

Here, we present Combocat: an end-to-end platform that scales drug combination screening to ultrahigh-throughput levels with the assistance of machine learning and overcomes the limitations of current drug combination screening methods. .Combocat has two modes of screening called dense mode and sparse mode. In dense mode, 10×10 drug combination matrices are experimentally tested on a 384-well microplate and subsequently analyzed for synergy. In sparse mode, the same 10×10 matrix is produced, but only having tested 10 concentrations (corresponding to the matrix diagonal) and using machine learning to impute the remaining 90 values.
The Combocat machine learning model is a regression model that was trained on hundreds of samples of dense mode data and allows for accurate capturing of synergy. By miniaturizing sparse mode (which uses 1536-well microplates, sparse matrix formats, and information-borrowing steps), we were able to increase the throughput nearly 300-fold. Combocat represents a union between state-of-the-art experimental and computational methods, and dramatically increases the capability of screening for new synergistic drug combinations.

B-222: Cracking the black box of deep sequence-based protein-protein interaction prediction
Track: MLCSB
  • Judith Bernett, Technical University of Munich, Germany
  • David B. Blumenthal, Friedrich-Alexander University Erlangen-Nürnberg, Germany
  • Markus List, Technical University of Munich, Germany


Presentation Overview: Show

Identifying protein-protein interactions (PPIs) is crucial for deciphering biological pathways and their dysregulation. Numerous prediction methods have been developed as a cheap alternative to biological experiments, reporting phenomenal accuracy estimates. While most methods rely exclusively on sequence information, PPIs occur in 3D space. As predicting protein structure from sequence is an infamously complex problem, the almost perfect reported performances for PPI prediction seem dubious. We systematically investigated how much reproducible deep learning models depend on data leakage, sequence similarities, and node degree information and compared them to basic machine learning models. We found that overlaps between training and test sets resulting from random splitting lead to strongly overestimated performances, giving a false impression of the field. In this setting, models learn solely from sequence similarities and node degrees. When data leakage is avoided by minimizing sequence similarities between training and test, performances become random, leaving this research field wide open.

B-223: MultiGML: Multimodal Graph Machine Learning for Prediction of Adverse Drug Events
Track: MLCSB
  • Sophia Krix, Fraunhofer Institute for Scientific Computing and Algorithms, Germany
  • Lauren Nicole DeLong, University of Edinburgh, United Kingdom
  • Sumit Madan, Fraunhofer Institute for Scientific Computing and Algorithms, Germany
  • Daniel Domingo-Fernández, Fraunhofer Institute for Scientific Computing and Algorithms, Germany
  • Ashar Ahmad, Grünenthal GmbH, Germany
  • Sheraz Gul, Fraunhofer Institute for Translational Medicine and Pharmacology, Germany
  • Andrea Zaliani, Fraunhofer Institute for Translational Medicine and Pharmacology, Germany
  • Holger Fröhlich, Fraunhofer Institute for Scientific Computing and Algorithms, Germany


Presentation Overview: Show

Adverse drug events constitute a major challenge for the success of clinical trials. Several computational strategies have been suggested to estimate the risk of adverse drug events in preclinical drug development. While these approaches have demonstrated high utility in practice, they are at the same time limited to specific information sources and thus neglect a wealth of information that is uncovered by fusion of different data sources, including biological protein function, gene expression, chemical compound structure, cell-based imaging, etc. In this work we propose an integrative and explainable Graph Machine Learning approach (MultiGML), which fuses knowledge graphs with multiple further data modalities to predict drug related adverse events. MultiGML demonstrates excellent prediction performance compared to alternative algorithms, including various knowledge graph embedding techniques. MultiGML distinguishes itself from alternative techniques by providing in-depth explanations of model predictions, which point towards biological mechanisms associated with predictions of an adverse drug event.

B-224: Exploring Cell-Cell Communication in Pancreas Tissue via Attributed-graph Convolutional Neural Networks on Single-cell RNA-sequencing Data
Track: MLCSB
  • Akram Vasighizaker, University of Windsor, Canada
  • Sheena Hora, University of Windsor, Canada
  • Luis Rueda, University of Windsor, Canada


Presentation Overview: Show

Recently, computational approaches have been utilized to study cell-cell communication based on graph-structured data. Link prediction methods were employed, but they only performed well on specific networks by considering the likelihood of node interactions. Analyzing local subgraphs instead of the entire graph to include latent as well as explicit features was suggested as a solution.
This study presents a new approach that uses an attributed graph convolutional neural network to forecast cell-cell communication from single-cell RNA-seq data. The proposed method extracts the explicit and latent attributes of an attributed graph that was built from the gene expression profiles of individual cells. To convert high-dimensional and sparse single-cell RNA-seq data to a graph format we used a similarity-based optimization method.
Experiments on six datasets from pancreas tissue reveal outperforming the proposed method compared to the latent feature-based approaches and the state-of-the-art methods for link prediction with 0.99 ROC and 99% prediction accuracy. Additionally, to identify underlying interactions, we run the GENEMANIA on the input list of the top 20 genes. The results on one dataset showed key regulators and effectors of the functional relationships between cells according to the GO annotation: growth factor binding, insulin-like growth factor binding, and type1 interferon.

B-224: Exploring Cell-Cell Communication in Pancreas Tissue via Attributed-graph Convolutional Neural Networks on Single-cell RNA-sequencing Data
Track: MLCSB
  • Akram Vasighizaker, University of Windsor, Canada
  • Sheena Hora, University of Windsor, Canada
  • Luis Rueda, University of Windsor, Canada


Presentation Overview: Show

Recently, computational approaches have been utilized to study cell-cell communication based on graph-structured data. Link prediction methods were employed, but they only performed well on specific networks by considering the likelihood of node interactions. Analyzing local subgraphs instead of the entire graph to include latent as well as explicit features was suggested as a solution.
This study presents a new approach that uses an attributed graph convolutional neural network to forecast cell-cell communication from single-cell RNA-seq data. The proposed method extracts the explicit and latent attributes of an attributed graph that was built from the gene expression profiles of individual cells. To convert high-dimensional and sparse single-cell RNA-seq data to a graph format we used a similarity-based optimization method.
Experiments on six datasets from pancreas tissue reveal outperforming the proposed method compared to the latent feature-based approaches and the state-of-the-art methods for link prediction with 0.99 ROC and 99% prediction accuracy. Additionally, to identify underlying interactions, we run the GENEMANIA on the input list of the top 20 genes. The results on one dataset showed key regulators and effectors of the functional relationships between cells according to the GO annotation: growth factor binding, insulin-like growth factor binding, and type1 interferon.

B-225: Scalable test of statistical significance for protein-DNA binding changes with insertion and deletion of bases in the genome
Track: MLCSB
  • Sunyoung Shin, Pohang University of Science and Technology, South Korea
  • Qinyi Zhou, University of Texas at Dallas, United States
  • Chandler Zuo, University of Texas at Dallas, United States
  • Yuannyu Zhang, St. Jude Children's Research Hospital, United States
  • Min Chen, University of Texas at Dallas, United States
  • Jian Xu, St. Jude Children's Research Hospital, United States


Presentation Overview: Show

Mutations in the noncoding DNA, which represents approximately 99% of the human genome, have been crucial to understanding disease mechanisms through dysregulation of disease-associated genes. One key element in gene regulation that noncoding mutations mediate is the binding of proteins to DNA sequences. Insertion and deletion of bases (InDels) are the second most common type of mutations, following single nucleotide polymorphisms, that may impact protein-DNA binding. However, no existing methods can estimate and test the effects of InDels on the process of protein-DNA binding. We develop a novel test of statistical significance, namely the binding change test (BC test), using a Markov model to evaluate the impact and identify InDels altering protein-DNA binding. The test predicts binding changer InDels of regulatory significance with an efficient importance sampling algorithm generating background sequences in favor of large binding affinity changes. Simulation studies demonstrate its excellent performance. The application to human leukemia data uncovers candidate pathological InDels on modulating MYC binding in leukemic patients. We develop an R package atIndel, which is available on GitHub.

B-226: Machine Learning Derived Transcriptional Signatures in Cancer
Track: MLCSB
  • Faeze Keshavarz, University of British Columbia, Canada
  • Erin Pleasance, Canada's Michael Smith Genome Sciences Centre, Canada
  • Steven Jones, University of British Columbia, Canada


Presentation Overview: Show

A key to understanding cancer, and identifying tumour therapeutic vulnerabilities, is to determine the impact of mutations on cellular pathways. Advances in next-generation sequencing technologies have led to the generation of a large volume of genomic and transcriptomic data. This wealth of omics data provides opportunities for machine learning approaches to go beyond prediction of mutation oncogenicity to investigate the broader impact on the transcriptome of cancerous cells. We successfully demonstrate that a random forest model can identify a transcriptional signature that is associated with the loss of wild-type activity of the TP53 gene across various tumour types, with AUROC and AUPRC of 0.94 and 0.96 respectively. This model additionally identifies silent mutations with functional consequences, and reveals genes impacted by TP53 mutation which can point to therapeutic targets. We have further expanded supervised machine learning approaches on cancer transcriptome data from both primary and metastatic cancers to 50 other frequently mutated genes in cancer. This revealed tissue-specific patterns including for BRAF, PBRM1 and ATRX. These transcriptional signatures associated with cancer gene mutations, especially those associated with therapeutically targeted proteins, have the potential to aid in identification of new targeted therapies for patients with cancer.

B-227: The virtual biomembrane
Track: MLCSB
  • Andreas Denger, Saarland University, Germany
  • Volkhard Helms, Saarland University, Germany


Presentation Overview: Show

The passage of small hydrophilic molecules across biological membranes is mediated by transmembrane transport proteins. These transporters are often highly substrate-specific, and that specificity is encoded in the sequence and, subsequently, in the three-dimensional structure of the protein. Predicting the substrates of all transporter genes in a particular genome would allow us to get an overview of the influx and efflux that a particular organism can carry out. This could help, e.g., to tune the ingredient concentrations in a medium for growing single-cell organisms in the lab, or with finding antibiotic-resistant strains in metagenomic samples. Using features derived from protein sequence and structure, we train machine learning models that predict all transmembrane transporters in a genome and assign them to their putative transported substrates. We will present methodology, software implementation, and initial results for the transporter substrate classification task for E. coli and S. cerevisiae.

B-229: Deciphering protein secretion from the brain to cerebrospinal fluid for biomarker discovery
Track: MLCSB
  • Katharina Waury, Vrije Universiteit Amsterdam, Netherlands
  • Renske de Wit, Vrije Universiteit Amsterdam, Netherlands
  • Inge Verberk, Amsterdam UMC, Netherlands
  • Charlotte Teunissen, Amsterdam UMC, Netherlands
  • Sanne Abeln, Vrije Universiteit Amsterdam, Netherlands


Presentation Overview: Show

Cerebrospinal fluid (CSF) protein biomarkers are urgently needed. However, the high dynamic range of protein concentrations in CSF hinders the detection of the lowest abundant proteins by mass spectrometry and complicates the discovery of novel brain-derived protein markers. Here, we explore if secretion of brain proteins to CSF can be predicted by a machine learning approach, and which factors determine a protein's CSF presence. We curated a large CSF proteome by combining six exploratory mass spectrometry datasets. This CSF proteome was used to annotate the Human Protein Atlas (HPA) brain elevated proteome regarding CSF presence. A logistic classifier was trained on a wide range of sequence-based features to differentiate between CSF and non-CSF proteins. The model achieved an area under the curve of 0.81 which improved up to 0.93 if the stringency of the CSF proteome was increased in regard to the minimum number of studies which detected a protein. The most important prediction features included subcellular localization, the presence of a signal peptide, and specific motifs. The trained classifier is shown to generalize well to the larger HPA brain detected proteome and is able to identify novel low abundance CSF proteins not identified by mass spectrometry.

B-230: Global pattern and determinant for interaction of seasonal influenza viruses
Track: MLCSB
  • Yilin Chen, Sun Yat-sen University, School of Public Health (Shenzhen), China
  • Xiangjun Du, Sun Yat-sen University, School of Public Health (Shenzhen), China


Presentation Overview: Show

The prevalence of different types/subtypes varies across seasons and regions for seasonal influenza viruses, indicating underlying interactions between subtypes. The global interaction patterns and determinants for seasonal influenza viruses need to be explored.
Laboratory influenza surveillance data and multidimensional data, including population, environment, and virus-related indicators from 55 countries worldwide were used to explored subtype interactions based on spearman correlation coefficient. Machine learning method XGBoost with interpretable framework SHAP were used to quantify contributing factors and their effects on global subtype interactions of seasonal influenza viruses. Additionally, causal relationships between subtypes were also explored based on Convergent Cross-mapping.
A consistent globally negative correlation between influenza A/H3N2 and A/H1N1 indicates strong cross-protection. At the same time, interactions between influenza A (H3N2/H1N1) and B show large differences across regions, which were mainly influenced by population-related factors. Influenza A has stronger driving force than influenza B, and A/H3N2 has stronger driving force than A/H1N1.
The detailed interaction study for seasonal influenza viruses sheds light on better model construction and epidemic prediction of influenza. The revelation of the heterogenous interaction patterns and dominant determinants suggests that influenza prevention, control, and prediction should be precisely formulated based on regional specificities.

B-231: Association of stroke with anthropometry and body composition: Korean National Health and Nutrition Examination Survey (KNHANES Ⅳ-Ⅴ) 2008-2011
Track: MLCSB
  • Sang Yeob Kim, KOBIC, South Korea
  • Byeong Hoon Park, KOBIC, South Korea


Presentation Overview: Show

Anthropometry and body composition have been association with stroke. But, very little has been reported on association of stroke with anthropometry and body composition simultaneously, and the association has not been reported in South Korea. The objective of this study was to examine the association of stroke with anthropometry and body composition in Korean populations using the data from the Korea National Health and Nutrition Examination Survey. A total of 8,407 subjects participated over 50 years old. A binary logistic regression analysis was conducted to assess the association of stroke with anthropometric and body composition indices, and an area under the curves were calculated to compare the predictive power of all variables to identify stroke. The age was the strongest association with stroke in both men and women. The results of the adjusted analysis showed that waist-to-height ratio and weight were associated with stroke in men, but not in women. There was no association between fat mass and stroke in men, but there was a strong association of stroke with trunk fat, total body fat without head, and total body fat in women. An association between body composition and stroke showed that there is a gender difference.

B-232: Genotype Imputation with Multi-label Random Forests
Track: MLCSB
  • Ekaterina Antonenko, École Polytechnique, France
  • Jesse Read, École Polytechnique, France


Presentation Overview: Show

Single Nucleotide Polymorphisms (SNP) data is essential in genetic studies. Typically, such data is prone to missing values, and removing instances with missing values can adversely affect the quality of further data analysis, thus imputation methods are required. While in human studies a reference genome of high quality may be an efficient solution, in non-human settings such panels are often not available.
While deep learning is a state-of-the-art approach for imputation in high-dimensional data, existing methods still require enough complete cases to be trained on which is often unavailable in real-world problems.
In this paper, we propose Chains of Autoreplicative multi-label Random Forests which impute missing values based only on the information extracted from the presented data, are computationally effective, and work well for high-dimensional and low-sampled data. Experiments on several SNP datasets show that our algorithm effectively imputes missing values and exhibits better performance than standard algorithms that do not require any additional information. In this paper, the algorithm is implemented specifically for SNP data. Still, it can easily be adapted for other cases of missing value imputation in biological data, e.g. gene expression arrays.

B-233: COMPARATIVE ANALYSIS OF SUPERVISED INTEGRATIVE METHODS FOR MULTI-OMICS DATA
Track: MLCSB
  • Alexei Novoloaca, BIOASTER, France
  • Camilo Broc, BIOASTER, France
  • Laurent Beloeil, BIOASTER, France
  • Wen-Han Yu, Gates MRI, United States
  • Jérémie Becker, BIOASTER, France


Presentation Overview: Show

Recent advances in sequencing, mass spectrometry and cytometry technologies have enabled researchers to collect multiple omics data from a single sample. These large datasets have led to a growing consensus that a holistic approach was needed to identify new candidate biomarkers and unveil mechanisms underlying disease aetiology, key to precision medicine. While many reviews and benchmarks have been conducted on unsupervised approaches, their supervised counterparts have received less attention in the literature, no gold standard has emerged yet.
In this work, we present a thorough comparison of a selection of five methods, representative of the main families of integrative approaches. Methods were evaluated on both simulated and real-world datasets, the latter being carefully selected to cover different medical applications and data modalities. A set of nineteen simulations were designed from the real-world datasets to explore a large and realistic parameter space.
Overall, Integrative approaches showed comparable or higher performances on simulations and outperformed non-integrative methods on real-world data. More specifically, multiple kernel and matrix factorization demonstrated a strong ability to uncover modest effects in high dimensional settings. The strengths and limitations of those methods will be discussed into details as well as guidelines for future applications.

B-234: Genetic data underdetermination, or why everything seems additive in genetic risk modeling
Track: MLCSB
  • Nora Verplaetse, KULeuven, Belgium
  • Antoine Passemiers, KULeuven, Belgium
  • Adam Arany, KULeuven, Belgium
  • Yves Moreau, KULeuven, Belgium
  • Raimondi Daniele, KULeuven, Belgium


Presentation Overview: Show

Notwithstanding ample biological arguments for the existence of nonlinear interactions between genetic factors underlying polygenic diseases, research results so far have not been persuasive towards the advantage of nonlinearity in genotype-to-phenotype modeling. Genetic data intrinsically suffer from an underdetermination issue (p >> n) with millions of variants being observed for each individual while the collection of large, homogenous cohorts are hampered by phenotype incidence, sequencing cost and batch effects.
Here we provide empirical evidence that this underdetermination is a major driver for the apparent optimality of additive genetic risk modeling. By optimizing the n/p ratio, through the provision of enough training data and the stringent control of the complexity of the nonlinear model, we can identify a sample size threshold after which the nonlinear modeling clearly outperforms the linear one in inflammatory bowel disease prediction. To control for model complexity we try different approaches to sparsify the connections in our neural network among which the use of biological knowledge, making the neural network biologically interpretable as well.

B-235: Exploring tissue structure across spatial omic modalities with Vesalius
Track: MLCSB
  • Patrick Cn Martin, Cedars-Sinai Medical Center, United States
  • Wenqi Wang, BRIC, Denmark
  • Hyobin Kim, Cedars-Sinai Medical Center, United States
  • Kyoung Jae Won, Cedars-Sinai Medical Center, United States


Presentation Overview: Show

The field of spatially resolved transcriptomics has demonstrated how science can illuminate the hidden complexities of our world by mapping the intricate landscape of gene expression in each tissue or organ. Non-content with only measuring the transcriptome, spatial biology has extended to cover a multitude of modalities: from chromatin accessibility to the proteome, we now have an unprecedented ability to explore the mysteries of development and disease. Yet for all its power, spatial biology remains a relatively new field and computational tools to analyze this data are required to capitalize on the wealth of information that is generated. We developed Vesalius, a computational tool that can effectively recovers tissue structures in multiple spatial biology modalities. Vesalius converts any spatial biology modality into an image set upon which it applies image processing methods to highlight and extract tissue territories. We applied Vesalius to a variety of modalities including spatial measurement of clonal diversity, chromatin accessibility, histone modification and gene expression. Vesalius can effectively recover tissue anatomy for in depth analysis and exploration of tissue structure. The visual nature of Vesalius lends itself to intuitive exploration of spatial patterns and interactions across biological modalities.

B-236: DeepOM: single-molecule optical genome mapping via deep learning
Track: MLCSB
  • Yevgeni Nogin, Technion, Israel
  • Tahir Detinis Zur, Tel Aviv University, Israel
  • Sapir Margalit, Tel Aviv University, Israel
  • Ilana Barzilai, Technion, Israel
  • Onit Alalouf, Technion, Israel
  • Yuval Ebenstein, Tel Aviv University, Israel
  • Yoav Shechtman, Technion, Israel


Presentation Overview: Show

Tapping efficiently into genomic information from a microscopic image of a DNA molecule will open new frontiers in molecular diagnostics. Here, we present a new computational method for optical genome mapping via Deep Learning, termed DeepOM. A Convolutional Neural Network (CNN) model, trained on simulated images of labeled DNA molecules, improves the success rate in alignment of DNA images to genomic references. When compared to state-of-the-art commercial software on images of human DNA molecules stretched in nanochannels, the model achieves a significant advantage in alignment success rate and improves yield, sensitivity, and throughput of optical genome mapping experiments in human genomics and microbiology.

B-237: Sparse Explanations of Neural Networks on Genome Sequences Using Pruned Layer-Wise Relevance Propagation
Track: MLCSB
  • Paulo Yanez Sarmiento, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Simon Witzke, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Marta Lemanczyk, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Jakub Bartoszewicz, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
  • Bernhard Renard, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany


Presentation Overview: Show

Deep neural networks have become very popular in genomics for a variety of tasks including motif detection or pathogenicity prediction. Despite their good performance, inference on how a model came to a particular decision remains challenging for genomic data because of its high dimensionality and higher-order feature interactions. Layer-wise relevance propagation (LRP) is one of the widely used post-hoc explanation methods decomposing a network's prediction into relevance scores for the input features. However, identifying and interpreting the most important features might be difficult for noisy relevance attributions and large-scale datasets of genome sequences. Therefore, we present a modification of LRP that enforces sparsity directly by pruning the relevance propagation for the different layers. Thereby, we achieve more sparse relevance attributions for the input sequences as well as for the intermediate layers. As the relevance propagation is input-specific, we aim to prune the explanation rather than the underlying model. We show that our approach leads to noise reduction and concentrates relevance at the most important features compared to the unpruned approach. To demonstrate the efficacy of our method, we present an application to different model architectures trained on synthetic genomics data with known ground truth.

B-238: Predicting the pro-longevity or anti-longevity effect of model organism genes with novel hierarchy-aware ensemble Bayesian networks classification algorithms
Track: MLCSB
  • Cen Wan, Birkbeck, University of London, United Kingdom


Presentation Overview: Show

This abstract introduces a recently proposed hierarchy-aware ensemble Bayesian networks classification algorithm – Positive Feature Values Prioritized Hierarchical Dependency Constrained Averaged One-dependence Estimators (Hie-AODE+) which is applied to the task of predicting genes’ pro-longevity or anti-longevity effects with using Gene Ontology terms as features. The experimental results confirmed Hie-AODE+ significantly improved the predictive performance of the conventional AODE classification algorithm.

B-239: Whole Genome Deconvolution Unveils Alzheimer’s Resilient Epigenetic Signature
Track: MLCSB
  • Eloise Berson, Stanford University, United States
  • Anjali Sreenivas, Stanford University, United States
  • Thanaphong Phongpreecha, Stanford University, United States
  • M. Ryan Corces, Gladstone Institute, United States
  • Nima Aghaeepour, Stanford University, United States
  • Thomas Montine, Stanford University, United States


Presentation Overview: Show

Assay for Transposase Accessible Chromatin by sequencing (ATAC-seq) provides an accurate way to depict the chromatin regulatory state and altered mechanisms guiding gene expression in disease. However bulk sequencing entangles information from different cell types and obscures cellular heterogeneity. Here, we develop and validate Cellformer, a novel deep learning method, that deconvolutes bulk ATAC-seq into cell type-specific expression across the whole genome. Cellformer enhances the bulk ATAC-seq resolution and allows an efficient cell type specific open chromatin profiling on large size cohorts at a low cost. Applied to 191 bulk samples from 3 brain regions, Cellformer identifies cell type-specific gene regulatory mechanisms and putative mediators involved in resilient to Alzheimer’s disease (RAD), an uncommon group of cognitively healthy individuals that harbor a high pathological load of Alzheimer’s disease (AD). Cell type-resolved chromatin profiling unveils cell type specific pathways and nominates potential epigenetic mediators underlying RAD that may illuminate therapeutic opportunities to limit the cognitive impact of this highly prevalent yet incurable disease. Cellformer has been made freely and publicly available to advance analysis of high-throughput bulk ATAC-seq in future investigations.

B-240: Deep neural networks for modelling indel mutation rates at base-pair resolution in the human genome
Track: MLCSB
  • Iván Galván-Femenía, Genome Data Science, Institute for Research in Biomedicine (IRB Barcelona), Barcelona, Spain, Spain
  • Daniel Naro, Genome Data Science, Institute for Research in Biomedicine (IRB Barcelona), Barcelona, Spain, Spain
  • Marcell Veiner, Genome Data Science, Institute for Research in Biomedicine (IRB Barcelona), Barcelona, Spain, Spain
  • Fran Supek, Genome Data Science, Institute for Research in Biomedicine (IRB Barcelona), Barcelona, Spain, Spain


Presentation Overview: Show

Somatic mutation rates in cancer genome vary across the human genome at different scales. At the megabase-scale, higher mutation rates are related with later replication timing. At the gene-scale, highly expressed genes have generally a reduced mutation rate. However, at the sub-gene scale, mutation rate remains less explored, particularly for indel mutations. Here, we evaluate indel somatic mutation rates in the human genome at base-pair resolution, from 1bp-300bp, using tumors from 7600 whole-genome sequences. First, we identify DNA motifs enriched nearby somatic indels for each mutational signature. Next, we use these motifs and the DNA sequence around the indels to estimate mutation rate using deep neural network (DNN) models. The DNN is able to learn DNA sequence features to differentiate mutated from non-mutated sites in the human genome. We identified 494 significant de novo motifs; as a positive control, poly(A) motifs were the highest enriched motifs for both ID1 and ID2 replication slippage indel mutational signatures. Accordingly, these processes shows the highest mutation rate prediction accuracy in the DNN models (AUC: 0.96 and 0.97). Overall, our DNN framework infers the risk of indel mutations occurring given the nearby DNA sequence, with implications to disease gene discovery and evolutionary biology.

B-241: SELFormer: Molecular Representation Learning via SELFIES Language Models
Track: MLCSB
  • Erva Ulusoy, Hacettepe University, Turkey
  • Atakan Yüksel, Hacettepe University, Turkey
  • Atabey Ünlü, Hacettepe University, Turkey
  • Tunca Doğan, Hacettepe University, Turkey


Presentation Overview: Show

The expensive and time-consuming nature of drug discovery necessitates the incorporation of innovative computational techniques into research and development pipelines. Representation learning has emerged as a promising solution for creating compact and informative numerical representations of molecules that can be utilized effectively in subsequent prediction tasks. However, current methods suffer from robustness and validity issues, primarily as a result of the input encoding or algorithms employed. In this study, we introduce SELFormer, a transformer-based chemical language model that utilizes SELFIES as input, a 100% valid, compact, and expressive notation. SELFormer is pre-trained on two million drug-like compounds in the ChEMBL database and evaluated on various molecular property prediction tasks. SELFormer demonstrated superior performance in predicting the aqueous solubility of molecules and adverse drug reactions compared to existing graph learning-based methods and SMILES-based chemical language models, while producing comparable results for other tasks. We shared SELFormer as a programmatic tool, along with its datasets and pre-trained models. Overall, our research demonstrates the benefits of combining a valid and expressive molecular notation with the appropriate deep learning architecture in chemical language modeling, thereby opening up new possibilities for discovering and designing novel drug candidates.

B-242: Structure-Independent Peptide Binder Design via Generative Language Models
Track: MLCSB
  • Garyk Brixi, Duke University, United States
  • Kalyan Palepu, Duke University, United States
  • Suhaas Bhat, Duke University, United States
  • Sophia Vincoff, Duke University, United States
  • Tianlai Chen, Duke University, United States
  • Lauren Hong, Duke University, United States
  • Vivian Yudistyra, Duke University, United States
  • Pranam Chatterjee, Duke University, United States


Presentation Overview: Show

The ability to modulate pathogenic proteins represents a powerful treatment strategy for diseases. Unfortunately, many proteins are considered “undruggable” by small molecules, and are often intrinsically disordered, precluding the usage of structure-based tools for binder design. To address these challenges, we have developed a suite of algorithms that enable the design of target-specific peptides via protein language model embeddings, without the requirement of 3D structures. First, we train a model that leverages ESM-2 embeddings to efficiently select high-affinity peptides from natural protein interaction interfaces. We experimentally fuse model-derived peptides to E3 ubiquitin ligases and identify candidates exhibiting robust degradation of undruggable targets in human cells. Next, we develop a high-accuracy discriminator, based on the CLIP architecture, to prioritize and screen peptides with selectivity to a specified target protein. As input to the discriminator, we create a Gaussian diffusion generator to sample an ESM-2-based latent space, fine-tuned on experimentally-valid peptide sequences. Finally, to enable de novo generation of binding peptides, we train an instance of GPT-2 with protein interacting sequences to enable peptide generation conditioned on target sequence. Our model demonstrates low perplexities across both existing and generated peptide sequences. Together, our work lays the foundation for programmable protein targeting applications.

B-243: A Deep learning model for simulating PacBio HiFi sequencing reads of recombinant Adeno-Associated Virus (rAAV) vectors
Track: MLCSB
  • Kokou Kevin Atsou, Pfizer, France
  • Joe Saelens, Pfizer, United States
  • Mykola Bordyuh, Pfizer, United States
  • Robert Stanton, Pfizer, United States
  • Yu-Ting Chen, Pfizer, United States
  • Herbert A. Runnels, Pfizer, United States
  • Reiko Nakashima, Pfizer, United States


Presentation Overview: Show

Background: To ensure the product quality of recombinant adeno-associated virus (rAAV)-based gene therapies, it is essential to examine the encapsidated genome identity, integrity, and the associated process- and product-related DNA impurities. While long-read sequencing has been an effective analytical tool to characterize vector genomes without fragmentation, the abundance of new long-read sequences of rAAV genomes has outpaced the development of accompanying computational tools tailored to these data. Notably, there is a shortage of simulators that can generate ground-truth data needed to compare method performance.

Methods: We built a simulator model for PacBio High-fidelity (HiFi) reads which was trained on AAV vector sequencing reads. The simulator is an auto-regressive model based on the transformer (encoder-decoder) neural network (NN) architecture. The model takes a reference AAV vector sequence as input and generates the reads and quality scores by mimicking PacBio sequencing outputs.

Results: Our model can simulate the properties of experimental sequencing reads including sequencing errors, truncations and chimeric DNA. Additionally, the model reproduces the flip-flop configurations generally observed in the Inverted Terminal Repeats (ITRs) of the vectors. The model achieved more than 90% concordance between empirical and simulated reads. The NN simulated reads can be used to benchmark existing variant calling tools.

B-244: MuLan-Methyl - Multiple Transformer-based Language Models for Accurate DNA Methylation Prediction
Track: MLCSB
  • Wenhuan Zeng, Tübingen University, Germany
  • Anupam Gautam, IBMI, Tübingen University, Germany
  • Daniel Huson, University of Tuebingen, Germany


Presentation Overview: Show

Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely 6mA, 4mC and 5hmC. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we show that the model captures characteristic differences between different species that are relevant for methylation. This demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance.

B-245: PerturbDecode, a probabilistic analysis framework to recover regulatory circuits and predict genetic interactions from large-scale Perturb-Seq screens
Track: MLCSB
  • Basak Eraslan, Genentech Research and Early Development, United States
  • Kathryn Geiger-Schuller, Genentech Research and Early Development, United States
  • Olena Kuksenko, Broad Institute of MIT and Harvard, United States
  • Caroline Uhler, Broad Institute of MIT and Harvard, United States
  • Orit Rozenblatt-Rosen, Genentech Research and Early Development, United States
  • Aviv Regev, Genentech Research and Early Development, United States


Presentation Overview: Show

Pooled genetic perturbation screens with single cell RNA-seq readouts (Perturb-Seq) open up new avenues in deciphering the gene regulatory networks. However, it is challenging to analyze large-scale screens with thousands of perturbed genes and millions of profiled cells due to noise from varying degrees of efficiency of the CRISPR-based perturbations, the sheer scale of the data, and the need to predict effects of combinations of perturbations. Here, we present PerturbDecode, a framework for the automated analysis of such screens, including ComBVAE, a probabilistic deep generative model to identify effective CRISPR guides and significantly perturbed cells, as well as to predict the outcome of the unseen perturbations. To test PerturbDecode we analyzed a new large scale Perturb-Seq screen, spanning 3,390 perturbations of 1,130 E3 ligase family members across 838,201 primary immune dendritic cells. Applying PerturbDecode to these data, we demonstrate that it is powerful in filtering out the ineffective guides as well as the unperturbed cells, thus increasing the signal-to-noise ratio. ComBVAE, leveraged the modular organization of the regulatory network to predict the outcome of unseen combinations of perturbations. ComBVAE-generated profiles were close to experimentally observed profiles, outperformed other models in predicting the combinatorial perturbation responses, and revealed principles of regulation.

B-246: Predicting Disinfectant Resistance in Listeria monocytogenes using Whole Genome Sequencing and Machine Learning
Track: MLCSB
  • Alexander Gmeiner, Food Institute, Technical University of Denmark, Denmark
  • Mirena Ivanova, Food Institute, Technical University of Denmark, Denmark
  • Pimlapas Leekitcharoenphon, Food Institute, Technical University of Denmark, Denmark
  • Leonid Chindelevitch, Faculty of Medicine, School of Public Health, Imperial College London, United Kingdom


Presentation Overview: Show

Even though many studies focus on predicting antimicrobial resistance from bacterial genomic data, a similar prediction of resistance to disinfectants remains poorly studied. The objectives of this study are to i) obtain a high-performing predictor of resistance to disinfectants (Benzalkonium chloride (BC) and Peracetic acid) and ii) explore the prediction of minimum inhibitory concentration (MIC) values. Within this study, 1650 Listeria monocytogenes whole genome sequencing samples and their corresponding MIC values were used to train different standard Machine Learning (ML) models, including a symbolic regression model. Different genomic feature levels (i.e., SNPs and genes) were explored as input for classification (i.e., resistant, susceptible) and regression (MIC values). The BC classification results show that a linear logistic regression model with l1 regularization performs best, achieving a balanced accuracy of 0.91 using SNPs features. BC regression with SNPs features found that random forest regressors (RFR) outperform the others with r2-scores up to 0.68. Additionally, the test set results for BC RFR with SNPs input show 287/330 predictions within the one-dilution difference of the measured MIC. The high-performing predictive models from this study could aid case-specific adaptation of cleaning procedures of food production sites in case of L. monocytogenes contamination.

B-247: DeepResponse: Large Scale Prediction of Cancer Cell Line Drug Response with Deep Learning Based Pharmacogenomic Modelling
Track: MLCSB
  • Burakcan Izmirli, Hacettepe University, Turkey
  • Umut Onur Özcan, Hacettepe University, Turkey
  • Navid Mohammadvand, Hacettepe University, Turkey
  • Etkin Akar, Middle East Technical University, Turkey
  • Deniz Cansen Kahraman, Middle East Technical University, Turkey
  • Tunca Doğan, Hacettepe University, Turkey


Presentation Overview: Show

Assessing the best treatment option for each patient is the main goal of precision medicine. Patients with the same diagnosis may display varying sensitivity to the applied treatment due to genetic heterogeneity, especially in cancers. With the aim of predicting drug response in advance, to save valuable time and prevent the administration of ineffective drugs, computational approaches that utilise genetic features of patients have been developed. Here, we propose DeepResponse, a machine learning-based system that predicts drug responses (sensitivity) of cancer cells. DeepResponse employs gene expression, mutation, copy number variation and methylation profiles of different cancer cell-lines (each representing an individual tumour) obtained from large-scale profiling/screening projects, together with drugs’ molecular features at input level and process them via hybrid convolutional and graph-transformer deep neural networks to learn the relationship between multi-omics features of the tumour and its sensitivity to the administered drug. Both the performance results and in vitro validation experiments indicated DeepResponse successfully predicts drug sensitivity of cancer cells, and especially the multi-omics aspect benefited the learning process and yielded better performance compared to the state-of-the-art. DeepResponse can be used for early stage discovery of new drug candidates and for repurposing the existing ones against resistant tumours.

B-248: Developing sequence-based predictive models for protein-protein interactions using machine learning strategies
Track: MLCSB
  • David Medina, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile
  • Pedro Salinas, Escuela de Ingeniería civil en Bioinformática, Universidad de Talca, Chile
  • Fabio Durán-Verdugo, Escuela de Medicina, Universidad de Magallanes, Chile
  • Alvaro Olivera-Nappa, Centre for Biotechnology and Bioengineering, Department of Chemical Engineering and Biotechnology, University of Chile, Chile
  • Roberto Uribe-Paredes, Departamento de Ingeniería en Computación, Universidad de Magallanes, Chile


Presentation Overview: Show

Predicting the affinity between two proteins is one of the most
relevant challenges in bioinformatics and one of the most useful for
biotechnological and pharmaceutical applications. Current prediction
methods use the structural information of the interaction complexes.
However, predicting the structure of proteins requires enormous
computational costs. Machine learning methods emerge as an alternative
to this bioinformatics challenge. Nevertheless, exploring several
alternatives for processing and developing predictive model training is
necessary. This work builds sequence-based predictive models of protein
interaction via deep learning architectures and classical machine
learning algorithms, evaluating numerical representation methods and
transformation techniques to represent structural complexes using
linear information. Six types of predictive tasks related to the
affinity and mutational variant evaluations and their effect on the
interaction complex were explored. We show that classical machine
learning and CNN-based methods perform better than GCN methods for
studying mutational variants. In contrast, graph-based methods perform
better on affinity problems or association constants, using only the
linear information of the protein sequences. Finally, we show an
illustrative use case, expose how to use the developed models, discuss
the limitations of the explored methods and comment on future
development strategies for improving the studied processes.

B-249: Distinguishing First-Degree Relationships from Ancient Samples with Machine Learning
Track: MLCSB
  • Merve Nur Güler, Department of Biological Sciences, Middle East Technical University, Turkey
  • Ardan Yılmaz, Department of Computer Engineering, Middle East Technical University, Turkey
  • Tara Ekin Ünver, Department of Computer Engineering, Middle East Technical University, Turkey
  • Igor Mapelli, Middle East Technical University, Turkey
  • Kıvılcım Başak Vural, Department of Biological Sciences, Middle East Technical University, Turkey
  • Kanat Gürün, Department of Biological Sciences, Middle East Technical University, Turkey
  • Emre Akbas, Department of Computer Engineering, Middle East Technical University, Turkey
  • Mehmet Somel, Department of Biological Sciences, Middle East Technical University, Turkey


Presentation Overview: Show

This study aims to differentiate between parent-offspring and sibling pairs using deep learning in low-coverage ancient genomes. The ability to distinguish between these two first-degree relationship categories is vital for investigating long-gone cultural practices. The study began by simulating founders using the population genetic simulator msprime and the pedigree simulator ped-sim to create sibling and parent-offspring pairs under realistic demographic scenarios. Then we applied ancient DNA simulation. Next, we estimated the kinship coefficient across genomic windows, i.e., the probability that two alleles at a given locus are identical by descent, using an allele-sharing coefficient statistic. We applied two-dimensional binning on them and trained a Central Neural Network algorithm with those values. We tested the algorithm under scenarios of the different numbers of shared SNPs and achieved 98% and 92% precision for pairs sharing 50K and 20K SNPs, respectively. Additionally, to improve the algorithm's accuracy, we studied the performance of curriculum learning and increased the sample size. This study demonstrates the potential for deep understanding to differentiate between first-degree relationships in low-coverage ancient genomes precisely and provides a foundation for future research in this field.

B-250: Joint Representation Learning of Heterogeneous Biomedical Entity Relationships via Hypergraph Neural Networks
Track: MLCSB
  • Elif Çevrim, Hacettepe University, Turkey
  • Tunca Doğan, Hacettepe University, Turkey


Presentation Overview: Show

Studying complex relationships between biological and biomedical entities is critical to understand how life processes work and how they are perturbed in pathological states. Graphs are widely used for capturing complex interplay between biological entities. By nature, biomedical relationships are heterogeneous, and thus, graphs that represent them should contain multiple types of nodes and edges. Graphs and their analysis are mainly based on the principle of pairwise relationships. However, in many cases, more than two nodes should be linked together to capture the semantics of real biological phenomena, which can be handled using hypergraphs. In this project, we are developing a new computational framework to jointly learn disease–gene/protein–pathway–drug/metabolite relationships using hypergraph neural networks and previously constructed highly heterogeneous biomedical knowledge graphs of the CROssBAR system as the input data. Learned joint biomedical representations are then used for predicting new hyperedges on the knowledge graph, which are semantically analogous to defining a complex biological process with all its constituents. We hope that the approaches developed in this study for the prediction of high-order biological interactions will help researchers to better understand disease mechanisms and ultimately propose novel and effective treatment strategies.

B-251: Deciphering Sequence Determinants of Replication Origins using Deep Learning
Track: MLCSB
  • Marcell Veiner, Genome Data Science, Institute for Research in Biomedicine (IRB Barcelona), Barcelona, Spain, Spain
  • Fran Supek, Genome Data Science, Institute for Research in Biomedicine (IRB Barcelona), Barcelona, Spain, Spain


Presentation Overview: Show

Elucidating replication origins in the genome is essential for understanding the molecular mechanisms underlying DNA replication. Despite recent advances in experimental methodologies to pinpoint replication origins more accurately, their sequence determinants in metazoans remain elusive. Here, we aim to understand the DNA determinants of human replication origins using deep learning. We employed 5 novel architectures (including transformers) to predict origins of replication in the human and yeast (Saccharomyces cerevisiae) genome from DNA sequence, and investigated the learned representations. We validated our approach by accurately predicting the locations of 520 autonomously replicating sequences (ARSs) in S. cerevisiae (97% AUC), and reconstructing the ARS consensus sequence and secondary motifs from the trained model. Next, we applied our pipeline to human replication origins, predicting OK-Seq identified loci (57.25% AUC) and SNS-Seq locates sites (89.68% AUC), outperforming previous sequence-based methods. We have further investigated the learned features to reveal DNA sequence determinants, by artificially perturbing these sequences and observing a drop in accuracy. Overall, our deep learning models can successfully predict the loci with replication origins in the human and yeast genomes, and can uncover the DNA sequence determinants. This has implications to shedding light on the molecular mechanisms of initiating DNA replication.

B-252: Multi-modal mutational signatures in ovarian cancer clarify mechanisms and predict patient survival
Track: MLCSB
  • Patricia Ferrer Torres, IRB Barcelona, Spain
  • Fran Supek, IRB Barcelona, Spain


Presentation Overview: Show

Mutational signatures with "featureless" profiles are difficult to accurately differentiate using existing methodologies. To overcome this, we tested a method based on non-negative matrix factorization (NMF) to extract multimodal signatures, obtained from more than one mutation type via joint analysis. We focused on SBS+ID multimodal signatures, composed of 179 channels (96 from SBS and 83 from ID subtypes), and tested them in four ovarian cancer cohorts. This method correctly separates two commonly confused signatures: SBS3, caused by homologous recombination (HR) deficiency and SBS8, likely due to nucleotide excision repair (NER) failure. Additionally, we identified two distinct SBS3 signatures in ovarian cohorts. Firstly, SBS3a, accompanied by ID signature ID6, likely caused by HR deficiency. Secondly, SBS3b, found with other ID signatures but not ID6, is likely not related to HR deficiency, suggesting that another process independent from HR deficiency may cause SBS3-like signatures. Moreover, we studied the association of signatures to survival in ovarian cancer using Cox regression. SBS3a (caused by HR deficiency) correlated with better survival, but not SBS3b, and neither did SBS8, in line with previous reports. The correct separation of the signatures using the multimodal method allows establishing them as reliable predictors of ovarian cancer survival.

B-253: Accurate drug response prediction in cancer by deep generative neural network and embedding-based methods
Track: MLCSB
  • Peilin Jia, University of Texas Health Science Center at Houston, United States
  • Ruifeng Hu, Harvard University, United States
  • Zhongming Zhao, University of Texas Health Science Center at Houston, United States


Presentation Overview: Show

Drug response varies greatly in cancer patients due to inter- and intra-tumor heterogeneity. Transcriptome context is critical in treatment outcome. We first developed deep variational autoencoder (VAE) model to compress thousands of genes into latent vectors in a low-dimensional space. These encoded vectors could accurately impute cancer drug response, outperform standard signature-gene based approaches, and appropriately control the overfitting problem. We applied rigorous quality assessment and validation, including assessing the impact of cell line lineage, cross-validation, and cross-panel evaluation, to ensure the accuracy of the imputed drug response in both cell lines and cancer samples. Our novel measure, expression-regulated component (EReX) of the observed drug response, achieved high correlation across panels. Using the well-trained models, we imputed drug response of The Cancer Genome Atlas (TCGA) data and investigate various features and signatures associated with the imputed drug response. Furthermore, we benchmarked our VAEN model and other four embedding-based methods, as well as the ensemble of these algorithms, for accurate and transferable prediction of drug response, and the extensive evaluation results using cross-panels, cross-datasets, and target genes were implemented in a user-friendly online server DrVAEN, which has broad use in cancer research, model evaluation, and drug development.

B-254: Machine Learning GWAS Platform, VariantSpark, Demonstrates Increased Power to Identify Genetic Variants Associated with Coronary Artery Disease
Track: MLCSB
  • Letitia Sng, Commonwealth Scientific and Industrial Research Organisation, Australia
  • Mitchell O'Brien, Commonwealth Scientific and Industrial Research Organisation, Australia
  • Piotr Szul, Commonwealth Scientific and Industrial Research Organisation, Australia
  • Roc Reguant, Commonwealth Scientific and Industrial Research Organisation, Australia
  • Angus Panagopoulos, Commonwealth Scientific and Industrial Research Organisation, Australia
  • Johan Verjans, South Australia Health and Medical Research Centre, Australia
  • Denis Bauer, Commonwealth Scientific and Industrial Research Organisation, Australia
  • Natalie Twine, Commonwealth Scientific and Industrial Research Organisation, Australia


Presentation Overview: Show

Coronary artery disease (CAD) is the leading cause of death and morbidity globally. Studies have shown a strong genetic component in CAD aetiology with the latest meta-analysis from over a million samples identifying 279 genome-wide significant associations. However, such studies have focussed on additive genetic associations, foregoing potential non-linear associations. Using a machine learning GWAS platform, VariantSpark, we uncovered 25 loci associated to CAD of which 16 are known loci including LPA, CDKN2B-AS1, and PMAIP1-MC4R. The more recently identified PMAIP1-MC4R was discovered in a meta-analysis of about 185,000 samples while VariantSpark identified this locus with 72% less samples from the UK biobank alone (~52,000). This increased detection power can be attributed to VariantSpark’s capability to account for both additive and non-linear epistatic effects. Indeed, we found significant epistatic interactions between LPA and CDKN2B-AS1 in the UK biobank cohort, which we successfully replicated in the independent TOPMed cohort of ~11,000 samples, while a regression approach only validated the CDKN2B-AS1 locus. In conclusion, our findings show that VariantSpark has increased power to detect genetic variants associated with complex diseases.

B-255: Deep Transformer-based Prediction of Protein-Protein Interactions
Track: MLCSB
  • Jovana Aleksic, Weill Cornell Medicine Qatar, Qatar
  • Miguel Garcia-Remesal, Universidad Politécnica de Madrid, Spain
  • Joel Malek, Weill Cornell Medicine Qatar, Qatar
  • Stephanie Ramadan, Weill Cornell Medicine Qatar, Qatar
  • Yue Guan, Weill Cornell Medicine Qatar, Qatar


Presentation Overview: Show

Protein-protein interactions (PPIs) provide valuable insights into many biological processes. Modern methods for the experimental identification of PPIs are expensive and labor-intensive. Several deep learning approaches have been developed to predict PPIs in recent years. However, these do not provide predictions beyond binary interaction labels. We resort to an ensemble of transformers applied to primary amino acid sequences, and AphaFold's prediction of complexes' structures to predict PPIs. We view this as a regression problem aimed at predicting the strength of the interaction and the interacting region. This is still an ongoing research effort. Currently, our model utilizes attention mechanisms to predict the interaction strength and combines them with AlphaFold’s prediction of protein complexes to determine the interacting region. The dataset used in this study was generated at Weill Cornell Medicine-Qatar using a recently developed all-vs-all (AVA-seq) sequencing approach to determine PPIs. Predicting the strength of the interaction and the interacting region provides information that can help researchers better understand protein function. In addition, this method can predict weaker, transient interactions that are difficult to detect experimentally.

B-256: Leveraging machine learning and combinatorial signaling motif libraries for engineering CAR T cells.
Track: MLCSB
  • Sara Capponi, Almaden Research Center IBM, United States


Presentation Overview: Show

During the past decade the use of machine learning (ML) approaches has been growing rapidly. ML models are deeply rooted in statistical methods and their applications represents a data-driven strategy to solve complex and often mathematically intractable scientific problems. The success of ML approaches is due to the ability of learning patterns and complex interactions inherently hidden in the data. Here I will discuss how we applied machine learning in combination with combinatorial signaling motif libraries to guide efficiently engineering of receptors with desired phenotype. Chimeric antigen receptor (CAR) costimulatory domains are constructed from native immune receptors and govern the phenotypic output of therapeutic T cells. In our work, we built a combinatorial library of 13 signaling motifs that can be allocated in 3 different positions in a CAR T cell. Therefore, the combinatorial library of CAR T cells contains ~2300 elements and each CAR promotes different T cell phenotypes. Our ML model trained on few experimental data enabled us to unveil the different effects of signal motifs, motif combinations, and motif positions on CAR T cell phenotype. Our work demonstrates that ML approaches can be used jointly with more traditional design methods to engineer cells with desired functions.

B-257: Charting spatial ligand-target activity using Renoir
Track: MLCSB
  • Narein Rao, Indian Institute of Technology Kanpur, India
  • Rhea Pai, Harry Perkins Institute of Medical Research, Nedlands, Perth, Western Australia, Australia
  • Archita Mishra, Telethon Kids Institute, University of Western Australia, Perth, Australia, Australia
  • Florent Ginhoux, Gustave Roussy Cancer Campus, Villejuif, France, France
  • Jerry Chan, KK Research Center, KK Women's and Children's Hospital, Singapore, Singapore
  • Ankur Sharma, Harry Perkins Institute of Medical Research, Nedlands, Perth, Western Australia, Australia
  • Hamim Zafar, Indian Institute of Technology Kanpur, India


Presentation Overview: Show

The advancement of single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics has made it possible to infer interactions amongst heterogeneous cells and their surrounding cellular environments. Existing methods assist in the analysis of ligand-receptor interactions by either adding spatial information to the currently available scRNA-seq data or utilizing spot-level or high-resolution spatial transcriptomics data. However, till date, there is a lack of methods capable of mapping ligand-target interactions across a spatial topology with specific cell type composition, with the potential to shed further light on the niche-specific relationship between ligands and their downstream targets. Here we present Renoir for charting the ligand-target activities across a spatial topology and delineating spatial communication niches harboring specific ligand-target activities and cell type composition. Renoir can also spatially map pathway-level aggregate activity of ligand-target gene sets and identify domain-specific activities between ligands and targets. We applied Renoir to three spatial datasets ranging from development to disease to demonstrate its effectiveness in inferring cellular niches with distinct ligand-target interactions, spatially mapping hallmark pathway activities, ranking ligand activity across spatial niches, and visualizing overall cell type-specific, ligand-target interactions in spatial niches.

B-258: SPLAM: an accurate deep-learning-based splice site predictor to clean up spurious spliced alignments
Track: MLCSB
  • Kuan-Hao Chao, Johns Hopkins University, United States
  • Steven L. Salzberg, Johns Hopkins University, United States
  • Mihaela Pertea, Johns Hopkins University, United States


Presentation Overview: Show

We introduce SPLAM, a novel deep residual convolutional neural network framework designed for recognizing splice junctions. Our model training incorporates three key innovations. First, we generated a high-quality junction dataset that includes positives from MANE database and from protein coding genes in RefSeq. Negatives include random GT-AG dimer pairs and junctions supported by 1 alignment on the opposite strand of protein coding gene loci identifed in 17,382 GTEX samples. Second, we use a window size of 400bps around each donor and acceptor site, totaling 800bps, in contrast to the 10,000bp flanking sequences used by SpliceAI, the best previously-published splice site predictor. Third, we train donor and acceptor pairs together, thereby modeling the splice junction rather than the separate splice sites on either end. We compare SPLAM and SpliceAI and show that SPLAM is superior at recognizing alternative splice junctions. Additionally, we describe using SPLAM to post-process the results of spliced alignment programs, and demonstrate significant improvements in intron matching and transcriptome assembly for both Poly-A selected and Ribo-depleted RNA-Seq libraries. In summary, SPLAM offers a faster and more accurate way of detecting splice junctions, while also providing an efficient solution for correcting errors in the output of spliced alignment programs.

B-259: The novel 2nd generation epigenetic skin aging clock
Track: MLCSB
  • Agata Bienkowska, Beiersdorf AG, Research and Development, Germany/ Institute for Bioinformatics, University Medicine Greifswald, Germany, Germany
  • Günter Raddatz, Division of Epigenetics, DKFZ-ZMBH Alliance, German Cancer Research Center, Heidelberg, Germany, Germany
  • Jörn Söhle, Beiersdorf AG, Research and Development, Hamburg, Germany, Germany
  • Boris Kristof, Beiersdorf AG, Research and Development, Hamburg, Germany, Germany
  • Henry Völzke, Institute for Community Medicine, University Medicine Greifswald, Greifswald, Germany, Germany
  • Stefan Gallinat, Beiersdorf AG, Research and Development, Hamburg, Germany, Germany
  • Lars Kaderali, Institute for Bioinformatics, University Medicine Greifswald, Greifswald, Germany, Germany
  • Marc Winnefeld, Beiersdorf AG, Research and Development, Hamburg, Germany, Germany
  • Elke Grönniger, Beiersdorf AG, Research and Development, Hamburg, Germany, Germany
  • Frank Lyko, Division of Epigenetics, DKFZ-ZMBH Alliance, German Cancer Research Center, Heidelberg, Germany, Germany
  • Cassandra Falckenhayn, Beiersdorf AG, Research and Development, Hamburg, Germany, Germany


Presentation Overview: Show

Epigenetic changes, such as DNA methylation (DNAm), are increasingly recognized as important biomarkers of aging. Despite the significant progress made in developing DNAm epigenetic age clocks in recent years, a major limitation of many existing clocks is that they are not directly linked to mechanistic aspects of aging. We studied skin due to its direct exposure to the environment, which makes it a useful model for analyzing aging mechanisms. In this study, we present the 2nd generation epigenetic skin aging clock, which is a novel approach to predicting an individual's skin age based on changes in DNAm patterns specific to skin aging. Using the Infinium methylation EPIC array technology and regularized generalized linear regression algorithm, we developed the 2nd generation clock using over 370 skin samples spanning a wide range of ages. Our results demonstrate that this clock accurately predicts skin aging in independent datasets. In addition, we present a practical application of epigenetic clocks with the aim of developing anti-aging interventions. We highlight the potential of epigenetic markers as valuable biomarkers of skin aging that could transform the field of anti-aging interventions. Our novel skin-specific clock could improve our understanding of the molecular mechanisms underlying skin aging.

B-260: Identifying tumor-specific molecular dependencies using Bayesian spike-and-slab regression
Track: MLCSB
  • Hanwen Xing, University of Oxford, United Kingdom
  • Christopher Yau, University of Oxford, United Kingdom


Presentation Overview: Show

The identification of tumor-specific molecular dependencies is essential for the development of effective cancer therapies. Genetic and chemical perturbations are powerful tools for discovering these dependencies. Even though chemical perturbations can be applied to primary cancer samples at large scale, the interpretation of experiment outcomes is often complicated by the fact that one chemical compound can affect multiple proteins. To overcome this challenge, Batzilla et al proposed DepInfeR, a regularized multi-response regression model designed to identify and estimate specific molecular dependencies of individual cancers from their ex vivo drug sensitivity profiles. Inspired by their work, we propose two Bayesian extensions to DepInfeR. Our proposed approaches offer several advantages over DepInfeR, including the ability to handle missing values in both protein-drug affinity and drug sensitivity profiles without the need for data pre-processing steps such as imputation. Moreover, our approach provides probabilistic statements about whether a protein in the protein-drug-affinity profiles is informative to the drug sensitivity profiles. We further extend the DepInfeR model using Gaussian Processes allowing our approach to adapt to more complex molecular dependency structures. Simulation studies demonstrate that our proposed approaches achieve better prediction accuracy, and are able to identify new dependency structures.

B-261: GeNNius: An ultrafast drug-target interaction inference method based on graph neural networks
Track: MLCSB
  • Uxía Veleiro, Center for Applied Medical Research (CIMA), University of Navarra, Spain
  • Jesus De la Fuente Cedeño, TECNUN, University of Navarra, Spain
  • Guillermo Serrano, TECNUN, University of Navarra, Spain
  • Marija Pizurica, Stanford Center for Biomedical Informatics Research, Stanford University, United States
  • Antonio Pineda-Lucena, Center for Applied Medical Research (CIMA) University of Navarra, Spain
  • Silve Vicent, Center for Applied Medical Research (CIMA) University of Navarra, Spain
  • Idoia Ochoa, TECNUN, University of Navarra, Spain
  • Olivier Gevaert, Stanford University, United States
  • Mikel Hernaez, Center for Applied Medical Research (CIMA) University of Navarra, Spain


Presentation Overview: Show

Drug-target interaction (DTI) prediction is a relevant but challenging task in the drug repurposing field. In-silico approaches have drawn particular attention as they can reduce associated costs and time commitment of traditional methodologies. Yet, current state-of-the-art methods present several limitations: existing DTI prediction approaches are computationally expensive, thereby hindering the ability to use large networks and exploit available datasets; and the generalization to unseen datasets of DTI prediction methods remains unexplored, which could potentially improve the development processes of DTI inferring approaches in terms of accuracy and robustness. In this work, we introduce GeNNius (Graph Embedding Neural Network Interaction Uncovering System), a Graph Neural Network-based method that outperforms state-of-the-art models in terms of both accuracy and time efficiency. Next, we assessed the generalization capability of our model to train and test on different datasets, respectively, showing that the presented methodology potentially improves the DTI prediction task. Furthermore, we demonstrated the prediction power of uncovering new interactions by evaluating not previously known DTIs for each dataset. Finally, we investigated qualitatively the embeddings generated by GeNNius, revealing that the GNN encoder maintains biological information after the graph convolutions while diffusing this information through nodes, eventually distinguishing protein families in the node embeddings.

B-262: Monte-carlo Thompson sampling for antibody design
Track: MLCSB
  • Kakuzaki Taro, Chugai pharmaceuticals, Co., Ltd., Japan
  • Hikaru Koga, Research Division, Chugai Pharmaceutical Co., Ltd., Japan
  • Shuuki Takizawa, Research Division, Chugai Pharmaceutical Co., Ltd., Japan
  • Shoichi Metsugi, Research Division, Chugai Pharmaceutical Co., Ltd., Japan
  • Reiji Teramoto, Research Division, Chugai Pharmaceutical Co., Ltd., Japan
  • Zenjiro Sampei, Research Division, Chugai Pharmaceutical Co., Ltd., Japan
  • Hirotake Shiraiwa, Research Division, Chugai Pharmaceutical Co., Ltd., Japan
  • Kenji Yoshida, Research Division, Chugai Pharmaceutical Co., Ltd., Japan
  • Hiroyuki Tsunoda, Research Division, Chugai Pharmaceutical Co., Ltd., Japan


Presentation Overview: Show

Antibodies represent a key therapeutic modality for diverse diseases. In order to enhance the antigen-binding affinity and stability of the primary antibody, we thoroughly investigate its sequence space by making extensive amino acid substitutions. Yet, due to the combinatorial explosion, it is impractical to experimentally examine all feasible mutation combinations. The literature reveals that machine-learning guided protein engineering techniques, such as Thompson sampling (TS), can efficiently explore sequence space. Nevertheless, TS often leads to excessive exploration when the initial data is biased towards the vicinity of the lead antibody, particularly when handling a vast virtual library that comprises numerous mutations. To solve this issue, we propose Monte-Carlo Thompson sampling (MTS), which balances the exploration-exploitation tradeoff by constructing the posterior distribution using the Monte-Carlo method. We evaluated MTS method for engineering pH-dependent antigen binding of a lead antibody for the neutralization of antigen X as a model case. Our findings indicate that MTS significantly outperformed TS in identifying desirable candidates. Therefore, MTS can be a potent approach to efficiently uncover antibodies with desired traits, particularly when the number of rounds is restricted.

B-263: Paired single-cell multi-omics data integration with Mowgli
Track: MLCSB
  • Geert-Jan Huizing, Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics Group, France
  • Ina Maria Deutschmann, Institut de Biologie de l’Ecole Normale Supérieure, CNRS, INSERM, Ecole Normale Supérieure, Université PSL, France
  • Gabriel Peyre, CNRS and Département de mathématiques et applications de l’Ecole Normale Supérieure, Université PSL, France
  • Laura Cantini, Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics Group, France


Presentation Overview: Show

The profiling of multiple molecular layers from the same set of cells has recently become possible. There is thus a growing need for multi-view learning methods able to jointly analyze such data. We here present Multi-Omics Wasserstein inteGrative anaLysIs (Mowgli), a novel method for the integration of paired multi-omics data with any type and number of omics. Of note, Mowgli combines integrative Nonnegative Matrix Factorization (NMF) and Optimal Transport (OT), enhancing at the same time the clustering performance and interpretability of integrative NMF. We apply Mowgli to multiple paired single-cell multi-omics data profiled with 10X Multiome, CITE-seq, and TEA-seq. Our in-depth benchmark demonstrates that Mowgli’s performance is competitive with the state-of-the-art in cell clustering and superior to the state-of-the-art once considering biological interpretability. Mowgli is implemented as a Python package seamlessly integrated within the scverse ecosystem and it is available at http://github.com/cantinilab/mowgli.

B-264: Prediction of virus-host association using protein language models and multiple instance learning
Track: MLCSB
  • Dan Liu, University of Glasgow, United Kingdom
  • Francesca Young, University of Glasgow, United Kingdom
  • David Robertson, University of Glasgow, United Kingdom
  • Ke Yuan, University of Glasgow, United Kingdom


Presentation Overview: Show

Predicting virus-host associations is essential to determine the specific host species that viruses interact with, and discover if new viruses infect humans and animals. Currently, the host of the majority of viruses is unknown, particularly in microbiomes. To address this challenge, we introduce EvoMIL, a deep learning method that predicts the host species for viruses from viral sequences only. It also identifies important viral proteins that significantly contribute to host prediction. The method combines a pre-trained large protein language model (ESM) and attention-based multiple instance learning to allow protein-orientated predictions. Our results show that protein embeddings capture stronger predictive signals than sequence composition features, including amino acids, physiochemical properties, and DNA k-mers. In multi-host prediction tasks, EvoMIL achieves median F1 score improvements of 8.6%, 12.3%, and 4.1% in prokaryotic hosts, and 0.5%, 1.8% and 3% in eukaryotic hosts. EvoMIL binary classifiers achieve impressive AUC values of over 0.95 for all prokaryotic and ranging from roughly 0.8 to 0.9 for eukaryotic hosts. Furthermore, EvoMIL estimates the importance of single proteins in the prediction task and maps them to an embedding landscape of all viral proteins, where proteins with similar functions are distinctly clustered together, highlighting EvoMIL's ability to capture key proteins in virus-host specificity.

B-265: Interspecies analysis to dissect cellular transcriptomic signatures of humans and hamsters in COVID-19.
Track: MLCSB
  • Thomas Hoefler, Freie Universität Berlin, Institute of Virology, Berlin, Germany., Germany
  • Geraldine Nouailles, Charité – Universitätsmedizin Berlin, Department of Infectious Diseases and Respiratory Medicine, Berlin, Germany., Germany
  • Holger Kirsten, University of Leipzig, Institute for Medical Informatics, Statistics, and Epidemiology, Leipzig, Germany., Germany
  • Martin Witzenrath, Charité – Department of Infectious Diseases and Respiratory Medicine, Berlin, Germany.; DZL, Berlin, Germany., Germany
  • Markus Scholz, University of Leipzig, Institute for Medical Informatics, Statistics, and Epidemiology, Leipzig, Germany., Germany
  • Jakob Trimpert, Freie Universität Berlin, Institute of Virology, Berlin, Germany., Germany
  • Christine Goffinet, Charité – Universitätsmedizin Berlin, Institute of Virology, Berlin, Germany., Germany
  • Markus Landthaler, MDC, BIMSB, Berlin, Germany.; Humboldt-Universität zu Berlin, Institute for Biology, Berlin, Germany., Germany
  • Cengiz Goekeri, Charité – Department of Infectious Diseases, Berlin, Germany.; Cyprus International University, Nicosia, Cyprus., Germany
  • Vincent David Friedrich, University of Leipzig, IMISE, Leipzig, Germany.; ScaDS.AI, Leipzig, Germany., Germany
  • Fabian Pott, Charité – Universitätsmedizin Berlin, Institute of Virology, Berlin, Germany., Germany
  • Julia Kazmierski, Charité – Universitätsmedizin Berlin, Institute of Virology, Berlin, Germany., Germany
  • Luiz Gustavo Teixeira Alves, MDC, Berlin Institute for Medical Systems Biology (BIMSB), Berlin, Germany., Germany
  • Sandro Andreotti, Freie Universität Berlin, Bioinformatics Solution Center, Berlin, Germany., Germany
  • Dylan Postmus, Charité – Universitätsmedizin Berlin, Institute of Virology, Berlin, Germany., Germany
  • Julia Adler, Freie Universität Berlin, Institute of Virology, Berlin, Germany., Germany
  • Emanuel Wyler, MDC, Berlin Institute for Medical Systems Biology (BIMSB), Berlin, Germany., Germany
  • Peter Pennitz, Charité – Universitätsmedizin Berlin, Department of Infectious Diseases and Respiratory Medicine, Berlin, Germany., Germany


Presentation Overview: Show

The COVID-19 pandemic has led to an urgent demand of appropriate models depicting host-pathogen interactions. Characterizing disease severity-dependent immune responses is crucial. Hamster species are permissive to develop either a moderate (Syrian hamster) or severe (Roborovski dwarf hamster) disease course following SARS-CoV-2-infection and are consequently considered particularly valuable. In addition to standard single-cell analysis, we apply deep learning to match disease states between hamsters and humans based on scRNA-seq of fresh whole blood in SARS-CoV-2-infection. While publicly available COVID-19 data is utilized for Syrian hamster and humans, generated data is used for Roborovski dwarf hamster.
We employ an Autoencoder framework to learn a joint low-dimensional embedding of hamster and human data. Per cell type of interest, we identify and apply a species shift vector in latent space to control for non-infection-related interspecies differences. Hamster disease states are then paired with their closest human COVID-19 severity counterparts using a similarity metric based on diffusion pseudotime.
Results show successful joint latent embedding of hamster and human data with preserved disease state separability. Disease state matching aligns with biological observations. The established workflow enables disease state matching in diverse tissues, species and diseases.

B-266: Untangling the Knot: Machine Learning Uncovers Knotted Patterns in Protein Structures
Track: MLCSB
  • Denisa Šrámková, National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Czechia
  • Agata Perlinska, Centre of New Technologies, University of Warsaw, Poland
  • Eva Klimentová, Central European Institute of Technology, Masaryk University, Czechia
  • Dawid Uchal, Centre of New Technologies, University of Warsaw, Poland
  • Marta Korpacz, Centre of New Technologies, University of Warsaw, Poland
  • Roksana Malinowska, Centre of New Technologies, University of Warsaw, Poland
  • Maciej Sikora, Centre of New Technologies, University of Warsaw, Poland
  • Mai Lan Nguyen, Centre of New Technologies, University of Warsaw, Poland
  • Paweł Rubach, Centre of New Technologies, University of Warsaw, Poland
  • Petr Šimeček, Central European Institute of Technology, Masaryk University, Czechia
  • Joanna Sulkowska, Centre of New Technologies, University of Warsaw, Poland


Presentation Overview: Show

Proteins with knotted backbones are an exceedingly rare phenomenon, and the mechanisms governing the knot formation and functional implications remain poorly understood. We fine-tuned the ProtBert-BFD Transformer to classify proteins as either knotted or unknotted solely from their primary structure. As a training set, we used a collection of proteins from selected protein families whose 3D structures were predicted by AlphaFold2. The knotted status of proteins was assigned using Topoly (polymer topology analysis tool). While the model exhibits high accuracy (98%) in predicting a protein's knot status, it does not directly provide a biological explanation or pinpoint which regions of the protein contribute to knot formation. To uncover this phenomenon, we propose a patching technique: a sliding window (patch) replacing part of the sequence and therefore testing the importance of this part for the knot formation. We tested this method on proteins from the SPOUT family and found that the most influential patches reside within the C-terminal portion of the knot core, which is also responsible for substrate binding.

B-267: Capturing Protein Dynamics and Its Determinants Using Explainable Artificial Intelligence
Track: MLCSB
  • Faraneh Haddadi, Loschmidt Laboratories, Masaryk University, Brno, CZ; ICRC, St. Anne’s University Hospital, Brno, CZ, Czechia
  • Jiri Damborsky, Loschmidt Laboratories, Masaryk University, Brno, CZ; ICRC, St. Anne’s University Hospital, Brno, CZ, Czechia
  • Stanislav Mazurenko, Loschmidt Laboratories, Masaryk University, Brno, CZ; ICRC, St. Anne’s University Hospital, Brno, CZ, Czechia


Presentation Overview: Show

Molecular dynamics is a simulation method for studying the motion and interactions of molecules. While it has proven instrumental in biocatalysis, it generates large amounts of data requiring manual analysis, prone to human bias and error. On the other hand, explainable artificial intelligence (XAI) provides powerful tools for extracting meaningful information automatically. This study aims to understand if XAI can capture the dynamical determinants of a selected group of luciferases. To this end, we employed advanced XAI methods to predict snapshots of trajectories 10 ns apart for enzymes RLuc8, AncFT, and AncHLDRluc [1], achieving root-mean-square deviations from the ground truth of 0.61-1.2 Å. We then used layer-wise relevance propagation to pinpoint protein parts most relevant for motion prediction. Experimental findings indicate that helices α4, α5', and L9 loop regions are critical for enzyme activity enhancement [1]. Their relevancies align with their computational B-factors. The relevance of the L9 loop also shows high correlations of 0.67-0.87 with experimental B-factors. Since our pipeline only requires molecular dynamics trajectories and secondary structure annotations of a protein of interest, we plan to test its generalizability on other biomolecular targets.
[1] Schenkmayerova et al. ""Engineering the protein dynamics of an ancestral luciferase."" Nature Communications 12.1(2021):3616.

B-268: Encoder-less generative modelling of transcriptomics data for improved representations
Track: MLCSB
  • Viktoria Schuster, University of Copenhagen, Denmark
  • Anders Krogh, University of Copenhagen, Denmark


Presentation Overview: Show

The availability of single cell sequencing data has provided highly informative, but also complex transcriptomics data. The data is high-dimensional and noisy, which poses a unique challenge to inferring cellular behaviours and relationships. Generative modelling can help denoise the data and learn a lower-dimensional, more interpretable, representation. Even though resulting approaches have advanced our understanding of the cell as a system, they are still underpowered and thus inadequate for large projects.
From our perspective, limitations in previous generative approaches are rooted in the encoder. It requires much more data to infer the same number of features than the decoder. We have developed a highly data-efficient generative model, the Deep Generative Decoder, and applied it to bulk and single cell transcriptomics data. This approach can learn more informative representations and gives fewer false positives in differential expression analysis. It is a truly generative model with an implicit capability to predict missing data. It also enables us to model more complex latent distributions in a simple way, and model technical effects separately from desired representations. These advances have the potential to further our understanding of single cell data and to support the efficiency of downstream tasks.

B-269: Using Deep Learning to Uncover Species-agnostic Defining Features of Sarcoma
Track: MLCSB
  • Jonathan Rub, Weill Cornell, United States


Presentation Overview: Show

Sarcomas are rare, deadly cancers that can arise in soft-tissue. Therefore accurate in vivo models would facilitate an in-depth analysis of sarcoma progression and responses to therapy. Single-cell RNA sequencing (scRNA-seq) provides a way to study sarcomas to capture intratumoral heterogeneity in an unsupervised manner. I performed scRNA-seq on tumors from autochthonous genetically engineered mouse models (GEMM) that are histologically similar to undifferentiated pleomorphic sarcoma (UPS) and patient-derived xenografts (PDX) of human UPS. Individual analysis of these two datasets yielded identification of sarcoma specific cell types. However, current methods of cross species analysis of scRNA-seq data are limited and do not perform well on cancer data. Through my project I am developing an adversarial variational autoencoder (AVAE) to embed the GEMM and human PDXs of UPS in the same latent space. By using this method I will be able to perform robust downstream analysis that will identify conserved features of GEMM and human sarcomas. I will then apply this algorithm on chemotherapy treated GEMM and PDX UPS. Understanding how human UPS responds to therapy and identifying treatment-resistant subpopulations may lead to improved therapeutic strategies. Successful completion of this algorithm will provide a better way to compare scRNA-seq data across species.

B-270: Prediction of cancer drug response using heterogenous graph convolutional networks
Track: MLCSB
  • David Earl Hostallero, McGill University, Canada
  • Yihui Li, McGill University, Canada
  • Amin Emad, McGill University, Canada


Presentation Overview: Show

The prediction of drug response is a crucial task in precision medicine and drug development. In recent years, machine learning methods have shown great promise tackling this problem. However, many of these methods do not fully utilize the patterns of similarities among drugs based on the response they induce in cancer cell lines (CCLs). In this recently published study, we developed BiG-DRP and BiG-DRP+, and showed that by incorporating drug similarities based on both CCL responses and chemical structures, they significantly improve the drug response prediction performance. Our deep learning models use a heterogenous graph convolutional network that in forming a drug’s embedding, it leverages the information from highly sensitive and highly resistant CCLs as well as the drug’s structure. Our evaluation in multiple scenarios showed that incorporating this information in the form of a bipartite graph significantly improves performance over other machine learning models for drug response prediction. Furthermore, genes identified to have significant contribution to our predictions implicated important biological processes and signaling pathways. By utilizing our model to predict drug responses of more than 9000 TCGA tumors, we characterized important associations between mutations and drug sensitivity, highlighting the potential of our model in pharmacogenomics research.

B-271: Interpretable deep learning architectures for drug response prediction: insights and best practice
Track: MLCSB
  • Yihui Li, McGill University, Canada
  • David Earl Hostallero, McGill University, Canada
  • Amin Emad, McGill University, Canada


Presentation Overview: Show

The importance of models that can provide biological insights, in addition to accurate predictions, has been highlighted by the biomedical community over the years. Recently, interpretable deep learning (DL) models that integrate signaling pathways have been suggested for drug response prediction. Although these models improve interpretability, it is still unclear whether this could lead to a compromise in prediction accuracy. Our study involved a comprehensive and systematic assessment of four state-of-the-art interpretable DL models using three distinct pathway collections to gauge the models' capacity to produce precise predictions on previously unseen samples from the same dataset and their generalizability to an independent dataset. We observed that explicitly incorporating pathway information in the form of a latent layer tends to perform worse than incorporating this information implicitly. However, in most cases, the best performance was achieved by a black-box multi-layer perceptron. We observed a decline in the performance of all the models when applied to an independent dataset. Most importantly, we observed that these models perform comparably when using randomly generated pathways instead of biological pathways. These findings emphasize the significance of conducting systematic evaluation of newly proposed models using carefully selected baselines, which we utilized in this study.

B-272: TINDL: A pipeline for Preclinical-to-clinical Anti-cancer Drug Response Prediction and Biomarker Identification
Track: MLCSB
  • David Earl Hostallero, McGill University, Canada
  • Lixuan Wei, Mayo Clinic, United States
  • Liewei Wang, Mayo Clinic, United States
  • Junmei Cairns, Mayo Clinic, United States
  • Amin Emad, McGill University, Canada


Presentation Overview: Show

Clinical drug response (CDR) prediction and drug sensitivity biomarker identification are two major tasks in precision medicine. However, the lack of available CDR data poses a significant obstacle in the development of machine learning pipelines for these tasks. In our recently published study, we introduced TINDL, a deep learning pipeline that predicts the drug response of patient tumors. Although our model is solely trained on preclinical cancer cell lines, our proposed tissue-informed normalization minimizes the statistical discrepancies between cell lines and patient tumor by effectively leveraging prior knowledge of the distribution of the tissue types. Our results showed that TINDL differentiates between resistant and sensitive tumors for 10 (of 14 drugs), outperforming other models. Furthermore, TINDL features an explanation submodule that identifies genes whose expressions considerably contribute to the model’s prediction, serving as novel biomarkers for drug response. We experimentally validated 10 genes identified by this submodule for tamoxifen through siRNA knockdown experiments and observed that most of these genes significantly influence tamoxifen sensitivity for MCF7 (10/10) and T47D (7/10). Moreover, genes identified as potential biomarkers for multiple drugs revealed shared mechanisms of action among drugs and implicated several important signaling pathways.

B-273: Individual modelling of chemotherapy-induced haematotoxicity with NARX neural networks: a knowledge transfer approach
Track: MLCSB
  • Marie Barbara Steinacker, University of Leipzig, IMISE, Leipzig, Germany; ScaDS.AI Dresden/Leipzig, Germany, Germany
  • Markus Scholz, University of Leipzig, IMISE, Leipzig, Germany, Germany
  • Yuri Kheifetz, University of Leipzig, IMISE, Leipzig, Germany, Germany


Presentation Overview: Show

Cytotoxic cancer therapy often results in dose-limiting haematotoxic side effects. Predicting an individual’s risk is a major objective in precision medicine of cancer treatment. In this regard, patient heterogeneity presents a significant challenge. While several (semi-)mechanistic models of bone marrow hematopoiesis have been developed to solve this task, the established models could not sufficiently describe certain patients exhibiting chaotic dynamics. To address this challenge, we propose a data-driven machine learning approach, using recurrent neural networks based on non-linear autoregressive exogenous (NARX) models. Also, we propose a knowledge transfer approach to ameliorate the issue of sparse individual data, which typically hampers learning of individual networks. We demonstrate the feasibility of our approach based on a virtual patient population generated using a semi-mechanistic model of haematopoeisis and imposing different cytotoxic therapy scenarios on it. Employing different techniques of model optimisation, we derive robust and parsimonious individual networks with good generalisation performances. Results suggest that our transfer learning approach using NARX networks can provide robust predictions of individual patients’ response to treatment.

B-274: MethylBERT: A transformer-based tumour-specific methylation pattern analysis model using read-level DNA methylomes
Track: MLCSB
  • Yunhee Jeong, Division of Cancer Epigenomics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, Heidelberg, Germany, Germany
  • Karl Rohr, Biomedical Computer Vision Group, BioQuant, IPMB, Heidelberg University, Im Neuenheimer Feld 267, Heidelberg, Germany, Germany
  • Pavlo Lutsik, Department of Oncology, Catholic University (KU) Leuven, Leuven, Belgium, Belgium


Presentation Overview: Show

DNA methylation (DNAm) at CpG sites is a key epigenetic mark that shows profound alterations in cancer. Due to its stability and cell-type specificity, it is often used for the purpose of disentangling bulk tumour samples. Advancements in sequencing technology including high-throughput, accuracy, and cost-efficiency, have made sequencing-based methylation analysis methods, such as whole-genome bisulfite sequencing, increasingly popular for studying DNAm patterns. Although sequencing-based data provides a highly broad genomic coverage and preserves rare cell-type signals, our previous benchmarking study argued that an accurate and robust computational tumour deconvolution model for read-level methylomes is still needed.

Here, we propose MethylBERT, a novel transformer-based read-level methylation pattern classification model. Although transformers have shown promising performance in methylation imputation, their application to tumour-specific methylation analysis has not been reported. We encoded read-level methylomes using BERT and successfully estimated tumour cell fractions within bulk samples, with superior accuracy compared to existing deconvolution methods. MethylBERT represents a significant progress in read-level methylome analysis and will increase the precision of tumour deconvolution and circulating tumour DNA analysis.

B-275: Trees and Democracy in Precision Medicine
Track: MLCSB
  • Katyna Sada Del Real, Tecnun Universidad de Navarra, Spain
  • Enrique Montal De Grandes, Tecnun Universidad de Navarra, Spain
  • Paula Azqueta Tamés, Tecnun Universidad de Navarra, Spain
  • María Peña Merck, Tecnun Universidad de Navarra, Spain
  • Itziar Abadía Moreno, Tecnun Universidad de Navarra, Spain
  • Djuliana Imperio Domingo, Tecnun Universidad de Navarra, Spain
  • Angel Rubio, Tecnun Universidad de Navarra, Spain


Presentation Overview: Show

Precision medicine is revolutionizing healthcare by providing individualized treatment plans for each patient based on their unique genetic makeup, medical history and lifestyle. This cutting-edge field uses data from multiple sources, including gene expression and genomic characteristics, to identify the optimal drug for each patient.
Applying machine learning algorithms to precision medicine is not straightforward because it requires handling the large and complex data sets involved in precision medicine and, more importantly, solving the assignment problem. In this study, we used a vote-sharing technique (each treatment is given multiple shares of a vote) to test three different tree-based algorithms: Optimal Decision Trees, Random Forest, and XGBoost.
While XGBoost showed superior performance in personalized treatments, it also has a higher computational cost, making it more difficult to train compared to other algorithms. The benefits of XGBoost's accuracy may outweigh the additional computational burden for critical healthcare decisions. On the other hand, optimal decision trees are the most interpretable. Random forests fall between the other algorithms in terms of interpretability and computational requirements. The choice of algorithm ultimately depends on specific needs and available resources.

B-276: Chainsaw: A supervised machine learning approach for protein domain prediction on AlphaFold structures
Track: MLCSB
  • Jude Wells, University College London, United Kingdom
  • Alex Hawkins-Hooker, University College London, United Kingdom
  • Christine Orengo, University College London, United Kingdom
  • Brooks Paige, University College London, United Kingdom
  • Nicola Bordin, University College London, United Kingdom


Presentation Overview: Show

The release of 200 million predicted AlphaFold structures presents new opportunities and challenges for protein domain structure databases such as CATH. A key first step in organising this vast collection is identifying the individual domains within each predicted structure. To this end, we present Chainsaw, a novel supervised machine learning approach to domain boundary prediction that achieves close-to-human-level accuracy with rapid inference times and minimal dependencies. The approach uses a residual convolutional neural network to predict the probability that pairs of residues are in the same domain. This soft adjacency matrix is then used to derive domain assignments using a non-learned algorithm that is equivalent to maximising the likelihood of the assignments given the set of pairwise probabilities expressed in the soft adjacency matrix. We benchmark the method against recent state-of-the-art structure-based methods and show that Chainsaw outperforms them even when predicting novel domains which have not been observed during training. Our method achieves an NDO score of 0.87 vs 0.80 for the next closest method when predicting a non-redundant test set of multi-domain proteins.

B-277: DNA-Inception: predicting tumor-specific splicing sequences using convolutional neural networks
Track: MLCSB
  • Israa Alqassem, NEC Laboratories Europe, Germany
  • Filippo Grazioli, NEC Laboratories Europe, Germany
  • Anja Mösch, NEC Laboratories Europe, Germany
  • Trevor Clancy, NEC OncoImmunity AS, Norway


Presentation Overview: Show

Aberrant alternative splicing (AS) plays a significant role in tumorigenesis and understanding the impact and mechanisms behind it can help in developing new diagnostic and therapeutic strategies. Here, we present DNA-Inception, the first deep learning-based method for classifying tumor- and tissue-specific AS events from their corresponding sequences. DNA-Inception is designed for interpretability and features a motif discovery algorithm that highlights over-represented motifs resulting from dysregulation of AS in tumor transcriptomes. Tumor-specific AS events and motifs may serve as potential biomarkers for cancer and can be leveraged in immunotherapy.

B-278: NeuronMotif: Deciphering cis-regulatory codes by layer-wise demixing of deep neural networks
Track: MLCSB
  • Zheng Wei, Tsinghua University, China
  • Kui Hua, Tsinghua University, China
  • Lei Wei, Tsinghua University, China
  • Shining Ma, Stanford University, United States
  • Rui Jiang, Tsinghua University, China
  • Xuegong Zhang, Tsinghua University, China
  • Yanda Li, Tsinghua University, China
  • Wing Hung Wong, Stanford University, United States
  • Xiaowo Wang, Tsinghua University, China


Presentation Overview: Show

Discovering DNA regulatory sequence motifs and their relative positions is vital to understanding the mechanisms of gene expression regulation. Although deep convolutional neural networks (CNNs) have achieved great success in predicting cis-regulatory elements, the discovery of motifs and their combinatorial patterns from these CNN models has remained difficult. We show the main difficulty is due to the problem of multifaceted neurons which respond to multiple types of sequence patterns. Since existing interpretation methods were mainly designed to visualize the class of sequences that can activate the neuron, the resulting visualization will correspond to a mixture of patterns. Such a mixture is usually difficult to interpret without resolving the mixed patterns. We propose the NeuronMotif algorithm to interpret such neurons. Given any convolutional neuron (CN) in CNN, NeuronMotif first generates a large sample of sequences capable of activating the CN, which typically consists of a mixture of patterns. Then, the sequences are “demixed” in a layer-wise manner by backward clustering of the feature maps of the involved convolutional layers. NeuronMotif can output the sequence motifs, and the syntax rules governing their combinations are depicted by position weight matrices organized in tree structures, which are supported by motif databases and omics data.

B-278: NeuronMotif: Deciphering cis-regulatory codes by layer-wise demixing of deep neural networks
Track: MLCSB
  • Zheng Wei, Tsinghua University, China
  • Kui Hua, Tsinghua University, China
  • Lei Wei, Tsinghua University, China
  • Shining Ma, Stanford University, United States
  • Rui Jiang, Tsinghua University, China
  • Xuegong Zhang, Tsinghua University, China
  • Yanda Li, Tsinghua University, China
  • Wing Hung Wong, Stanford University, United States
  • Xiaowo Wang, Tsinghua University, China


Presentation Overview: Show

Discovering DNA regulatory sequence motifs and their relative positions is vital to understanding the mechanisms of gene expression regulation. Although deep convolutional neural networks (CNNs) have achieved great success in predicting cis-regulatory elements, the discovery of motifs and their combinatorial patterns from these CNN models has remained difficult. We show the main difficulty is due to the problem of multifaceted neurons which respond to multiple types of sequence patterns. Since existing interpretation methods were mainly designed to visualize the class of sequences that can activate the neuron, the resulting visualization will correspond to a mixture of patterns. Such a mixture is usually difficult to interpret without resolving the mixed patterns. We propose the NeuronMotif algorithm to interpret such neurons. Given any convolutional neuron (CN) in CNN, NeuronMotif first generates a large sample of sequences capable of activating the CN, which typically consists of a mixture of patterns. Then, the sequences are “demixed” in a layer-wise manner by backward clustering of the feature maps of the involved convolutional layers. NeuronMotif can output the sequence motifs, and the syntax rules governing their combinations are depicted by position weight matrices organized in tree structures, which are supported by motif databases and omics data.

B-279: Using Statistical and Machine Learning Models to Understand the Determinants of Antibody Class Switch Recombination
Track: MLCSB
  • Lutecia Servius, King's College London, United Kingdom
  • Davide Pigoli, King's College London, United Kingdom
  • Joseph Ng, University College London, United Kingdom
  • Franca Fraternali, University College London, United Kingdom


Presentation Overview: Show

Antibodies can change isotope to adapt their function via Class Switch Recombination (CSR). The mechanism is not well understood but high throughput sampling of antibody sequences offers an opportunity to build data driven models to understand CSR determinants.
Here we apply statistical and machine learning methods such as logistic regression, random forest, and support vector machine to predict the occurrence of CSR in a published antibody repertoire dataset of donors challenged by COVID-19 and Respiratory Syncytial Virus. We establish a pipeline for repertoire representation to allow it to be fed as input to predictive models.
We observe a consistent, non-random unweighted average recall of CSR (mean 65%), suggesting that features in the antibody repertoire contain information contributing to CSR. Interestingly, we observe that clonal group diversity is a consistently strong indicator of CSR occurrence both before and during an immune challenge. We can apply these models to single-cell data where sampling of clonal types is much sparser. But they primarily allow us to understand the biological basis and relation CSR has to the specificity and diversity of B cells. We also highlight the importance of tuning hyperparameters, e.g., tree depth, in random forest to optimise robust predictor performance.

B-280: Optimizing single-cell spatiotemporal delay variations to identify key features driving progression
Track: MLCSB
  • Komlan Atitey, National Institutes Health (NIH), United States
  • Benedict Anchang, National Institute of Environmental Health Sciences, United States


Presentation Overview: Show

The high-dimensional nature of single-cell data poses a challenge for data visualization, analysis, and interpretation. Therefore, developing interpretable accurate and robust models for single-cell spatiotemporal trajectory analysis remains a significant challenge in the field. To these ends, we propose a multi-state Markov-based computational framework denoted Time Order Structure Learning (TOSL) to quantify the spatiotemporal dependency of cells in the low-dimensional space resulting from any temporal data reduction or imaging analysis. The TOSL is used to assess the performance of 6 data reduction methods (DRMs), applied to visualize three different dynamic biological processes; EMT, spermatogenesis and stem cell reprogramming. As results, the TOSL shows that none of the 6 DRMs seem to demonstrate a significant preservation of evolution dynamics of the various cell types. TOSL models the global delay time to achieve stationarity of cell states transition which can be used for the identification of therapeutic targets driving early deviations from normal development resulting in diseases like cancer. In the future, we plan to use the TOSL framework to (1) identify genes and cells which trigger the stationarity of state transitions of cells during the developmental biology; (2) develop a better data reduction method that preserve global cellular spatiotemporal dynamics.

B-281: The covariance environment defines cellular niches for spatial inference
Track: MLCSB
  • Doron Haviv, Memorial Sloan Kettering Cancer Center, United States
  • Mohamed Gatie, Memorial Sloan Kettering Cancer Center, United States
  • Anna-Katerina Hadjantonakis, Memorial Sloan Kettering Cancer Center, United States
  • Tal Nawy, Memorial Sloan Kettering Cancer Center, United States
  • Dana Pe'Er, Memorial Sloan Kettering Cancer Center, United States


Presentation Overview: Show

The tsunami of new multiplexed spatial profiling technologies has opened a range of computational challenges focused on leveraging these powerful data for biological discovery. A key challenge underlying computation is a suitable representation for features of cellular niches. Here, we develop the covariance environment (COVET), a representation that can capture the rich, continuous multivariate nature of cellular niches by capturing the gene-gene covariate structure across cells in the niche, which can reflect the cell-cell communication between them. We define a principled optimal transport-based distance metric between COVET niches and develop a computationally efficient approximation to this metric that can scale to millions of cells. Using COVET to encode spatial context, we develop environmental variational inference (ENVI), a conditional variational autoencoder that jointly embeds spatial and single-cell RNA-seq data into a latent space. Two distinct decoders either impute gene expression across spatial modality, or project spatial information onto dissociated single-cell data. We show that ENVI is not only superior in the imputation of gene expression but is also able to infer spatial context to disassociated single-cell genomics data.

B-282: A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome
Track: MLCSB
  • Zhenhao Zhang, University of Michigan, United States
  • Fan Feng, University of Michigan, United States
  • Yiyang Qiu, University of Michigan, United States
  • Jie Liu, University of Michigan, United States
  • Yuanhao Huang, University of Michigan, United States


Presentation Overview: Show

Many deep learning approaches have been proposed to predict epigenetic profiles, chromatin organization, and transcription activity. While these approaches achieve satisfactory performance in predicting one modality from another, the learned representations are not generalizable across predictive tasks or across cell types. In this paper, we propose a deep learning approach named EPCOT which employs a pre-training and fine-tuning framework, and is able to accurately and comprehensively predict multiple modalities including epigenome, chromatin organization, transcriptome, and enhancer activity for new cell types, by only requiring cell-type specific chromatin accessibility profiles. Many of these predicted modalities, such as Micro-C and ChIA-PET, are quite expensive to get in practice, and the in silico prediction from EPCOT should be quite helpful. Furthermore, this pre-training and fine-tuning framework allows EPCOT to identify generic representations generalizable across different predictive tasks. Interpreting EPCOT models also provides biological insights including mapping between different genomic modalities, identifying TF sequence binding patterns, and analyzing cell-type specific TF impacts on enhancer activity.

B-283: Mechanistic Model Informed Machine Learning (M3L) for Rapid Patients Severity Classification
Track: MLCSB
  • Hsu Kiang Ooi, National Research Council Canada, Canada
  • Hang Hu, National Research Council Canada, Canada
  • Anguang Hu, Defence Research and Development Canada, Canada
  • Mohammad Sajjad Ghaemi, National Research Council Canada, Canada


Presentation Overview: Show

As the demand for healthcare rises, its critical to triage patients’ conditions rapidly, especially during emergency healthcare scenarios such as the COVID-19 pandemic. High-throughput single-cell transcriptomics accurately analyze immune responses in blood samples but do not directly translate into disease severity prediction. We aim to rapidly predict patients' disease severity while mitigating the lack of training data by generating synthetic data of virtual patients (VP) for machine learning (ML) training.

We leverage the biological insight of a mechanistic model fitted to the transcriptomics data to generate synthetic patients' disease outcomes and ""testing"" it by applying a data-driven machine learning method to classify disease severity. The synthetic data train a supervised random forest (RF) classifier model, then tested on transcriptomics data. The training data consists of 400 VP with varying disease states.

The framework trained with synthetic data improves the performance for disease severity classification. The average performance of the model is above the baseline model and is in the acceptable range (AUC range of 0.7-0.80, and P-value < 0.05. This framework suggests that the mechanistic model can generate high-quality synthetic data for reasonably accurate ML prediction for disease severity.

B-284: A deep generative model for covariate dependent cell state dynamics
Track: MLCSB
  • Yasuhiro Kojima, Laboratory of Computational Life Science, National Cancer Center Research Institute, Japan
  • Haruka Hirose, Division of Systems Biology, Nagoya University Graduate School of Medicine, Japan
  • Teppei Shimamura, Department of Computational and Systems Biology, Medical Research Institute, Tokyo Medical and Dental University, Japan


Presentation Overview: Show

Recent high accessibility to single cell transcriptome analysis yields the comparative observation across multiple experimental conditions and multi-modal observation integrating transcriptional molecular profiles with other molecular layers such as chromatin accessibility. Although these advanced observations enhanced the identification of condition specific populations and the correlation structures across molecular layers, existing computational methodologies were not suitable for dissecting the generation process of such populations or correlation structures, which is possibly regulated or intervened by the covariates. Here, we present a deep generative model of covariate-dependent cell state dynamics, CDYN, which realized counterfactual estimation of cell state dynamics for varying covariates in a single cell state. Demonstrating the ability of CDYN to estimate differential cell state dynamics in simulated dataset, we utilized it for revealing the complex relationships between gene expression dynamics and its covariates in several real datasets. In application to squamous carcinoma dataset, we showed that subpopulation of fibroblasts induced cell state dynamics toward invasive cancer cells. We demonstrated the ability of CDYN to dissect the complex relationships between transcriptome dynamics and the various covariates.

B-285: Deep learning methods for designing new-to-nature enzymes
Track: MLCSB
  • Sandra Castillo, VTT, Finland
  • Tuula Tenkanen, VTT, Finland
  • Harry Boer, VTT, Finland
  • Martina Blomster Andberg, VTT, Finland
  • Anna Borisova, VTT, Finland
  • Gopal Peddinti, VTT, Finland
  • Alejandro Revuelta, VTT, Finland
  • Paula Jouhten, Aalto University, Finland


Presentation Overview: Show

Generative deep learning models have been successfully used to generate new-to-nature functional enzymes. In this work, we aimed to generate novel enzymes with new-to-nature functions using a conditional variational autoencoder (cVAE). We designed the enzyme representation based on contact maps calculated using trRosetta, the secondary structure prediction using PSIPRED and a set of 7 selected physico-chemical amino acid features. The deep learning model contained a set of bidirectional recurrent networks (RNNs), and the decoder returned the sequence of the generated enzymes as one-hot representation. The condition of the cVAE was a representation of the enzyme reaction based on fingerprints.

The model was trained with all the TIM-barrel enzymes available in Uniprot (~1 million enzymes sequences with a total of 70 different functions) leaving out from the training the class of enzymes whose function we aimed to create using our generative model. As the test case we used the fructose 1,6-bisphosphate aldolase catalytic function. Most of the enzymes created by the generative model contained sequence signatures of the fructose 1,6-bisphosphate aldolase domain and their predicted 3D structures superposed well with the structures of the natural enzymes.

B-286: Deep ensemble pre-training for target-gene expression prediction from landmark genes
Track: MLCSB
  • Da-Bin Lee, School of Computer Science & Engineering Soongsil University, South Korea
  • Kyu-Baek Hwang, Soongsil University, South Korea


Presentation Overview: Show

Large-scale gene expression profiles are extensively utilized in biological and biomedical research. The L1000 assay is an efficient gene-expression profiling technique that infers the expression pattern of "target" genes by measuring the expression pattern of only 978 "landmark" genes. Deep neural networks (DNNs) have been shown to achieve state-of-the-art results in target gene expression prediction. We propose a novel approach that employs autoencoder pre-training and ensembling for improving the performance of DNNs. In our work, we applied autoencoder pre-training for both input reconstruction (landmark gene autoencoder) and output reconstruction (target gene autoencoder). Our experiments, which utilized the Gene Expression Omnibus (GEO) and Genotype-Tissue Expression (GTEx) datasets, demonstrated that autoencoder pre-training improved the mean absolute error (MAE) by approximately 1.1% and 0.4%, respectively, compared to without pre-training. Additionally, our deep ensemble approach, which used 16 pre-trained DNNs, achieved an additional improvement of approximately 1.4% on the GEO and 0.9% in the GTEx datasets. Our proposed method (MAE: 0.2834 in GEO, 0.4171 in GTEx) outperformed state-of-the-art DNNs (MAE: 0.2897 in GEO, 0.4214 in GTEx). These results support the effectiveness of employing autoencoder pre-training for both input and output, as well as DNN ensembles to enhance the accuracy of target gene expression prediction.

B-287: A VAE-based model for predicting time-frequency of a single cell electrophysiological feature from its gene expression
Track: MLCSB
  • Kazuki Furumichi, Division of Systems Biology, Nagoya University Graduate School of Medicine, Japan
  • Yashuhiro Kojima, Laboratory of Computational Life Science, National Cancer Center Research Institute, Japan
  • Teppei Shimamura, Division of Systems Biology, Nagoya University Graduate School of Medicine, Japan


Presentation Overview: Show

Recent rapid advances in single-cell analysis have facilitated the comprehensive analysis of gene expression and the identification of cell types. The development of Patch-seq, which combines single-cell RNA sequencing (scRNA-seq) with patch-clamp recording, has gained significant attention for its ability to obtain multimodal information. However, interpreting this complex and enormous dataset is quite difficult and the causal relationships among the data modalities remain unclear.
Here, we developed a computational method which links molecular states of cells with electrophysiological signatures based on the framework of variational autoencoder (VAE).
We applied the model to Patch-seq dataset from adult mouse primary motor cortex (Mop). By using variational inference, we optimized an encoder of latent cell state from transcriptomes (t-features) and decoders of t-features and electrophysiological responses (e-features), processed with CWT, from latent cell state.
The learned latent space compressed into two-dimensional space with UMAP formed distinct clusters for each cell type. Moreover, the vector-Jacobian products (vjp) of this model showed the time-frequencies particularly changed along with the fluctuation of t-features.
The application of this method suggests the possibility of realizing analysis across data modalities like extracting genes that have strong effects on a specific frequency of neural activities.

B-288: Cancer prognosis prediction integrating gene expression data and a background biomolecule network.
Track: MLCSB
  • Kazuma Inoue, Kyoto University, Japan
  • Ryosuke Kojima, Kyoto University, Japan
  • Mayumi Kamada, Kyoto University, Japan
  • Yasushi Okuno, Kyoto University, Japan


Presentation Overview: Show

Cancer is a disease with a wide variety of clinical conditions, and these prognoses commonly differ from patient to patient, even with the same cancer types. Although many deep-learning models to predict cancer prognosis have been developed, most of them only used individual-level information such as gene expression data and cancer type information.
Here we proposed a novel deep-learning model to predict a cancer patient's survival by combining the gene expression data of individual patients and molecular interaction information as general knowledge.
Our model consists of a graph neural network (GNN) part and an individual neural network (NN) part. In the GNN part, the molecular interaction information was represented as a graph, and latent vectors were calculated in each molecular node. In the NN part, each molecular latent vectors and gene expression data were combined, and a patient’s prognosis, whether to live or die within 1-, 2-, 3-, 4- and 5 years, was predicted.
Our model outperformed the conventional approach, which is only the NN part in all year’s predictions. In addition, we made contribution subgraphs consisting of high importance nodes and confirmed known biomarkers for cancer prognosis were included. Our model showed the potential to find novel biomarker candidates.

B-289: Machine learning and phylogenetic analysis improves predicting antibiotic resistance in M. tuberculosis
Track: MLCSB
  • Alper Yurtseven, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarbruecken, Germany, Germany
  • Sofia Buyanova, Institute of Science and Technology Austria (ISTA), Vienna, Austria, Austria
  • Amay A. Agrawal, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarbruecken, Germany, Germany
  • Olga O. Bochkareva, University of Vienna, CMESS, CUBE, ISTA, Vienna, Austria, Austria
  • Olga V. Kalinina, HIPS, HZI, CBI, Faculty of Medicine, Saarland University, Saarbrücken, Germany, Germany


Presentation Overview: Show

Antimicrobial resistance (AMR) poses a significant global health threat, and an accurate prediction of bacterial resistance patterns is critical for effective treatment and control strategies. In recent years, machine learning (ML) approaches have emerged as powerful tools for analyzing large-scale bacterial AMR data. However, ML methods often ignore evolutionary relationships among bacterial strains, which can greatly impact performance of the ML methods, especially if resistance-associated features are attempted to be detected. Genome-Wide Association Studies (GWAS) methods that employ linearing mixed models account for evolutionary relationships, but they uncover only highly significant variants which usually have already been reported in literature.

In this work, we introduce a novel phylogeny-related parallelism score (PRPS), which measures whether a certain feature is correlated with the population structure of a set of samples. We demonstrate that PRPS can be used, in combination with SVM- and random forest-based models, to reduce the number of features in the analysis, while simultaneously increasing models’ performance. Applying the pipeline to a publicly available set of Mycobacterium tuberculosis genomes with phenotypic screens against six common antibiotics from the PATRIC database, followed by a feature importance analysis, we re-discovered known resistance-associated mutations as well as new previously not reported candidates.

B-290: Applying NLP approaches for the detection of mobile genetic elements in metagenomic data
Track: MLCSB
  • Ella Rannon, Tel Aviv University, Israel
  • David Burstein, Tel Aviv University, Israel


Presentation Overview: Show

Plasmids, viruses, and other mobile genetic elements (MGEs) play a critical role in facilitating horizontal gene transfer, a process of inter-bacteria exchange of genetic material. This process allows the rapid spread of genes that promotes adaptation to biological or environmental pressures, such as the presence of antibiotics. Thus, monitoring the presence of MGEs, and especially detecting mobile elements carrying antibiotic resistance genes or virulence factors, is essential for assessing the risk different bacteria might pose on human health. However, most computational methods for MGE detection are based on sequence similarity, are not time-efficient, and most detect only a single type of MGE. In this study, we utilized Natural Language Processing (NLP) techniques to classify viruses and plasmids, both as individual elements and as integrated elements (e.g., prophage and integrative and conjugative elements) within metagenomic data. We combined protein and DNA-level embeddings to develop a robust model for the classification of previously unknown MGEs. The model is designed to predict, for a given contig, whether it or specific segments of it belong to an MGE. Our tool will allow rapid detection of various MGEs within large metagenomic datasets and could potentially be used to identify novel genomic elements.

B-291: MSclassifR: an R package for supervised classification of mass spectra with machine learning methods
Track: MLCSB
  • Alexandre Godmer, AP-HP, Sorbonne université, Hôpital Saint-Antoine, Département de Bactériologie & INSERM, U1135, CIMI, Paris, France, France
  • Yahia Benzarara, AP-HP, Sorbonne université, Hôpital Saint-Antoine, Département de Bactériologie, Paris, France, France
  • Karen Druart, Institut Pasteur, Université de Paris, Proteomics Platform, MSBio Unit, UAR CNRS 2024, France
  • Nicolas Veziris, AP-HP, Sorbonne université, Hôpital Saint-Antoine, Département de Bactériologie & INSERM, U1135, CIMI, Paris, France, France
  • Mariette Matondo, Institut Pasteur, Université de Paris, Proteomics Platform, MSBio Unit, UAR CNRS 2024, France
  • Alexandra Aubry, Sorbonne Université, INSERM, U1135, CIMI & AP-HP, Hôpital Pitié-Salpêtrière, CNR-MyRMA, Paris, France, France
  • Quentin Giai Gianetto, Institut Pasteur, Université de Paris, MSBio Unit, UAR CNRS 2024 & Bioinformatics and Biostatistics HUB, France


Presentation Overview: Show

Classification of mass spectra is essential to identify microorganisms from matrix-assisted laser desorption/ionization mass spectrometry (MALDI-TOF). However, spectrally close organisms remain difficult to identify. In this context, we have developed the MSclassifR R package to improve the classification of mass spectra. Its open code reinforces the reproducibility of analyzes in the community. We applied the functions of our package to raw mass spectra from three different laboratories. The best workflow available achieves nearly 100% accuracy in all three datasets. By comparison, commercial software has only achieved 61% accuracy on an internal dataset. Thus, MSclassifR is an interesting alternative for reliable identification based on MALDI-TOF analysis. In addition to relying on existing machine learning methods, MSclassifR offers an original variable selection method based on random forest feature importances and a new decision rule based on AIC criteria. Moreover, it can deal with imbalanced datasets. Quite interestingly, it can be also used in the context of differential analyzes of omics data to improve the selection of genes or proteins of interest. MSclassifR is freely available online from the CRAN repository (https://cran.r-project.org/web/packages/MSclassifR/index.html). Three vignettes illustrating how to use the functions of this package from real datasets are also available online to help users.

B-292: Application of Machine Learning for biomarker identification in cancer research
Track: MLCSB
  • Milan Hucko, Institute of Molecular Biology, Slovak Academy of Sciences, Bratislava, Slovakia, Slovakia
  • Katarina Kalavska, Translational Research Unit, Faculty of Medicine, Comenius University and National Cancer Institute, Bratislava, Slovakia
  • Zuzana Cierna, Department of Pathology, Faculty of Medicine, Comenius University, Bratislava, Slovakia, Slovakia
  • Dominik Hadzega, Medirex Group Academy n. p.o., Nitra, Slovakia, Slovakia
  • Gabriel Minarik, Medirex Group Academy n.p.o., Nitra, Slovakia, Slovakia
  • Lubos Klucar, Institute of Molecular Biology, Slovak Academy of Sciences, Bratislava, Slovakia, Slovakia
  • Michal Mego, 2nd Department of Oncology, Faculty of Medicine, Comenius University and National Cancer Institute, Bratislava, Slovakia


Presentation Overview: Show

Cancer is a major public health concern that has been studied for decades. Despite the effort, cancer often remains elusive because many of the underlying factors are not well understood. In our research we focused on two different features of cancer cells that makes treatment difficult. The first is invasiveness, which is one of the hallmarks of cancer and contributes to treatment failure in later stages. Secondly, drug resistance or chemoresistance of cancer cells is also one of the limiting factors in treatment effectivity. Research of both these features often produce a wealth of data on the genomic level, therefore, we utilised machine learning (ML) algorithms to better understand the data and possibly find new contributors to these processes. We created deep-learning feedforward neural network with PyTorch, subsequently the evaluation of feature importance was accomplished with SHAP library. Interestingly, we found differences when we extracted features that contributed the most in our model compared to significant hits produced by classical statistical approaches. Overall, our study highlights use of advanced data analysis techniques as ML as universal and highly adaptive tool for research into the complex nature of cancer. This work was supported by grant APVV-20-0158.

B-293: Timegated Raman spectroscopy and machine learning for the detection of common contaminants in algal cultivations
Track: MLCSB
  • Gopal Peddinti, VTT Technical Research Center of Finland, Finland
  • Jari Havisto, VTT Technical Research Center of Finland, Finland
  • Anu Tamminen, VTT Technical Research Center of Finland, Finland
  • Mervi Toivari, VTT Technical Research Center of Finland, Finland
  • Dorothee Barth, VTT Technical Research Center of Finland, Finland


Presentation Overview: Show

Large scale algal cultivations lack real-time monitoring methods for collecting data such as biomass composition, cell vitality and contamination from the ongoing bioprocess. Current measurement methods rely on manual sampling and time-delayed analyses preventing timely optimization of culture conditions and early intervention in case of disturbances in the process. Our project aims to develop optical measurement technologies and machine learning based data analysis methods to facilitate real-time monitoring of algal cultures in terms of early detection of contamination in the process.

In this study, we employed Timegated (TG) Raman spectroscopy (picoRaman Spectrometer by Timegate, 532 nm excitation laser) to study the cultures of the green alga Chlorella vulgaris with and without a variety of common contaminants in microalgal cultures—namely, Flagellate Poterioochromonas malhamensis, Dinoflagellate Oxyrrhis marina, Rotifer Brachionus plicatilis, and Cyanobacterium Synechocystis sp. PCC6803. The spectral data were used for classification of algae versus contaminants by training a variety of machine learning models including partial least squares approaches, tree-based methods as well as neural networks. Once adequately trained, such machine learning models can be used for online monitoring of the bioprocess.

B-294: Transfer learning improves matrix factorization of multi-omics datasets
Track: MLCSB
  • David Hirst, Aix Marseille Univ, INSERM, MMG, Marseille, France
  • Matthieu Vignes, School of Mathematical and Computational Sciences, College of Science, Massey University, Palmerston North, New Zealand
  • Anaı̈s Baudot, Aix Marseille Univ, INSERM, MMG, Marseille, France


Presentation Overview: Show

Matrix factorization is a popular method for extracting biological signal from multi-omics data. The resulting lower dimensional representations can be used to infer how latent processes differ across biological conditions. However, when a dataset is generated from a small number of samples, the effectiveness of matrix factorization is reduced. Therefore, transfer learning approaches to matrix factorization have previously been proposed. A transfer learning approach to matrix factorization uses information previously inferred from a large, heterogeneous learning dataset to supplement the factorization of a small, target dataset.

To the best of our knowledge, such transfer learning approaches with matrix factorization have only been applied to single-omics datasets so far. However, multi-omics datasets are increasingly available.

In this study we propose an approach for multi-omics matrix factorization with transfer learning. We focused on the Bayesian method MOFA, and implemented transfer learning by adapting MOFA’s variational inference algorithm. Our goal was to assess whether our transfer learning approach improves the matrix factorization of small, target multi-omics target datasets. We used simulated and TCGA datasets to do this. We observed that transfer learning improved the quality of the factorization, when compared to factorization of the small target dataset without transfer learning.

B-295: sciCSR infers B cell state transition and predicts class-switch recombination dynamics using single-cell transcriptomic data
Track: MLCSB
  • Joseph Ng, University College London, United Kingdom
  • Guillem Montamat Garcia, University College London, United Kingdom
  • Alexander Stewart, University of Surrey, United Kingdom
  • Paul Blair, University College London, United Kingdom
  • Deborah Dunn-Walters, University of Surrey, United Kingdom
  • Claudia Mauri, University College London, United Kingdom
  • Franca Fraternali, University College London, United Kingdom


Presentation Overview: Show

Class-switch recombination (CSR) is an integral part of B cell maturation. Steady-state analyses (e.g. B cell receptor [BCR] repertoire analysis of snapshots during an immune response) do not directly measure CSR dynamics, which is crucial in understanding how B cell maturation is regulated across time. We present sciCSR (pronounced ‘scissor’, single-cell inference of class switch recombination), a computational pipeline which analyses CSR events and dynamics of B cells from single-cell RNA-sequencing (scRNA-seq) experiments. sciCSR re-analyses transcriptomic sequence alignments to differentiate productive heavy-chain immunoglobulin transcripts from germline “sterile” transcripts. From a snapshot of B cell scRNA-seq data, a Markov state model is built to infer the dynamics and direction of CSR. Applying sciCSR on SARS-CoV-2 vaccination time-course scRNA-seq data, we observe that sciCSR predicts, using data from an earlier timepoint in the collected time-course, the isotype distribution of BCR repertoires of subsequent timepoints with high accuracy (cosine similarity ∼ 0.9). As an alternative to conventional RNA velocity, sciCSR makes use of biological signals specific to B cells to infer cell state transitions, and reveals insights into the regulation of CSR and the dynamics of B cell maturation during immune response.

B-296: Paired Antibody Sequence Generation: A Deep Generative Language Model for the Creation of Highly Developable Libraries
Track: MLCSB
  • Oliver Turnbull, University of Oxford, United Kingdom
  • Rebecca Croasdale-Wood, AstraZeneca, United Kingdom
  • Charlotte Deane, University of Oxford, United Kingdom


Presentation Overview: Show

Modern monoclonal antibody drug discovery typically relies on large libraries of paired variable heavy (VH) and variable light (VL) sequences for use in phage display. For these sequences to make effective therapeutics, they must both bind to their target and be free from developability issues, such as aggregation, poly-specificity, and poor expression levels. Current natural and synthetic libraries often contain sequences that must later be discarded or re-engineered due to developability-related issues.

Here, we present a deep generative language model trained on 1.4M paired antibody sequences, which can generate diverse antibody libraries that can be biased towards desired properties. We use an auto-regressive model trained by a next residue prediction task. Our results demonstrate that the model can produce libraries with similar complementarity-determining region (CDR) length distributions and germline diversity to natural repertoires, but with increased diversity.

Furthermore, the model can be biased to generate sequences with desired properties through finetuning. Here, we bias the model to generate antibodies with 3D biophysical properties that fall within distributions seen in clinical stage therapeutic antibodies. This approach enables the creation of synthetic libraries enriched with 'safe' developable candidates, reducing the need for downstream engineering and minimizing developability issues.

B-297: Interpretable multi-label prediction of Mycobacterium tuberculosis drug-resistant phenotypes from whole genome sequence (WGS) variants
Track: MLCSB
  • Nina Mercedes Billows, Royal Veterinary College, United Kingdom
  • Jody Phelan, London School of Hygiene and Tropical Medicine, United Kingdom
  • Dong Xia, Royal Veterinary College, United Kingdom
  • Yonghong Peng, Manchester Metropolitan University, United Kingdom
  • Taane Clark, London School of Hygiene and Tropical Medicine, United Kingdom
  • Yu-Mei Chang, Royal Veterinary College, United Kingdom


Presentation Overview: Show

Machine learning (ML) has wide applications across the biomedical sciences and has demonstrated high level performance for genotype-phenotype predictions. An example of such success is the prediction of drug-resistant phenotypes from Mycobacterium tuberculosis genomic variants to profile resistant Tuberculosis (TB) strains. However, most ML models lack interpretable decisions, making it difficult to understand the mechanisms underlying drug-resistance prediction, preventing wider adoption.
We designed a multi-label weighted random forest (MLWRF) approach that can capture important features and simultaneously control population structure to predict resistance across a panel of 10 drugs. We utilised a global TB dataset comprised of 18,396 M. tuberculosis samples which are split into training (80%) and testing (20%) data. Performance was measured using AUC-ROC, sensitivity and specificity. We also performed deep exploration of feature interactions in the MLWRF model and used interactive knowledge graphs to integrate prior knowledge to boost interpretability.
Our MLWRF model achieved moderate-high performance that is comparable or better than existing models. Using knowledge graphs, we showed that predictions are primarily underpinned by interactions between known drug-resistance mutations and co-occurring mutations. Overall, we highlight that using MLWRF in combination with knowledge graphs improves the interpretability of drug-resistance prediction and is useful for exploring feature interactions.

B-298: Crushing Antimicrobial Resistance using Explainable Artificial Intelligence (XAI)
Track: MLCSB
  • Amay Ajaykumar Agrawal, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarbrücken, Saarland, Germany, Germany
  • Nils Walter, CISPA – Helmholtz Center for Information Security, Saarbrücken, Saarland, Germany, Germany
  • Alper Yurtseven, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarbrücken, Saarland, Germany, Germany
  • Jilles Vreeken, CISPA – Helmholtz Center for Information Security, Saarbrücken, Saarland, Germany, Germany
  • Olga V. Kalinina, HIPS, HZI, CBI, Faculty of Medicine, Saarland University, Saarbrücken, Germany, Germany


Presentation Overview: Show

Antimicrobial Resistance (AMR) is an urgent threat to human health worldwide as microbes have developed resistance to even the most advanced drugs. Whole-genome sequencing followed by rule-based analysis has accelerated the AMR diagnosis compared to traditional phenotypic susceptibility testing, but it cannot detect resistance caused by unknown AMR markers. Prediction approaches based on machine learning (ML) provide superior performance over rule-based approaches, but they offer limited interpretability, which limits their use in clinical settings.

In this work, we propose a novel computational approach to develop inherently interpretable models based on pattern set mining. These models discover patterns of different lengths comprising features (mutations and genes) that are associated with resistance against a target antibiotic. We applied our models on publicly available AMR data from the PATRIC database for Mycobacterium tuberculosis against commonly prescribed antibiotics, such as quinolones and beta-lactams. Using a basic pattern enumerator model, we rediscovered most known common resistance mechanisms. Additionally, we observe novel sets of genes and mutations that can be related to resistance in rare cases and discuss their potential biological meaning. In future, these novel markers can be further used in AMR diagnostics.

B-299: Combining gradient tree boosting and network propagation for the identification of a pan-cancer survival network
Track: MLCSB
  • Kristina Thedinga, Max Planck Institute for Molecular Genetics, Germany
  • Ralf Herwig, Max Planck Institute for Molecular Genetics, Germany


Presentation Overview: Show

Cancer is a leading cause of death worldwide and improving patient survival is the ultimate goal of cancer therapy. Predicting cancer survival from molecular features is an important computational task that allows quantifying patient risks and individualizing treatment. We propose a survival prediction approach that applies XGBoost tree ensemble learning to predict cancer survival from transcriptome data of 8,024 patients from 25 cancer types contained in The Cancer Genome Atlas (TCGA) and shows highly competitive performance with existing survival prediction methods. Additionally, we show that pan-cancer training, where we combine transcriptome data from the 25 different cancer types to train a shared survival prediction model, substantially improves prediction performance compared to cancer subtype-specific training, where a separate model is trained for each cancer type. To further improve plausibility, we apply network propagation on the feature importance weights learned by the pan-cancer survival prediction model and infer a pan-cancer survival network consisting of 103 genes. The survival network comprises cross-cohort features and over-representation analysis shows that it is significantly enriched for the tumor microenvironment, which has been associated with tumor initiation, growth, invasion, metastasis, and response to therapies, underpinning the biological plausibility of the pan-cancer survival network identified by our approach.

B-300: Chemically Interpretable Molecular Representation for Property Prediction
Track: MLCSB
  • Roshan Balaji, IIT Madras, India
  • Nirav Bhatt, IIT Madras, India


Presentation Overview: Show

Molecular property prediction using a molecule's structure is a crucial step in drug and novel material discovery, as computational screening approaches rely on predicted properties to refine the existing design of molecules. Although the problem has existed for decades, it has recently gained attention due to the advent of big data and deep learning. On average, one FDA drug is approved for 250 compounds entering the preclinical research stage, requiring screening of chemical libraries containing more than 20000 compounds. In-silico property prediction approaches using learnable representations increase the pace of development and reduce the cost of discovery. We propose developing molecule representations using functional groups in chemistry to address the problem of deciphering the relationship between a molecule's structure and property. Functional groups are substructures in a molecule with distinctive chemical properties that influence its chemical characteristics. These substructures are found by (i) curating functional groups annotated by chemists and (ii) mining a large corpus of molecules to extract frequent substructures using a pattern-mining algorithm. We show that the Functional Group Representation (FGR) framework beats state-of-the-art models on several benchmark datasets while ensuring explainability between the predicted property and molecular structure to experimentalists.

B-301: Machine learning combining multi-omics data and network algorithms identifies Adrenocortical carcinoma prognostic biomarkers
Track: MLCSB
  • Roberto Martin-Hernandez, Clarivate Analytics, Spain
  • Sergio Espeso-Gil, Clarivate Analytics, Spain
  • Pablo Latorre, Clarivate Analytics, Spain
  • Clara Domingo, Clarivate Analytics, Spain
  • José Ramon Hernandez, Clarivate Analytics, Spain
  • Ekaterina Kotelnikova, Clarivate Analytics, Spain


Presentation Overview: Show

Background: Adrenocortical carcinoma (ACC) is a rare endocrine cancer with a poor prognosis. Here we demonstrate how machine learning methods integrating multi-omics data, in combination with system biology tools, can contribute to the identification of new prognostic biomarkers for ACC.

Methods: ACC multi-omics datasets were downloaded from the Xena Browser. Data integration analysis identified a multi-omics signature. Regulators were discovered using systems biology tools. A random forest classifier was trained, and prognostic value was evaluated with Kaplan-Meier method.

Results: A multi-omics signature including 85 highly correlated features was generated. Machine learning revealed AUC values higher than 0.81 classifying patients from different disease stages. Association of the genes included in the signature with overall survival (OS) data highlighted genes and microRNA’s as statistically significant prognostic biomarkers. Interestingly, we identified features (HAUS8 and miR-125b-2) found to be biomarkers of the disease with a high level of validity in dedicated databases.

Conclusions: Machine learning and integrative analysis of multi-omics data, in combination with systems biology tools, identified a set of biomarkers with high prognostic value for ACC disease. Multi-omics data is a promising resource for the identification of drivers and new prognostic biomarkers.

B-302: Profiling chemical perturbations based on predicting the ranking of differentially expressed genes
Track: MLCSB
  • Hajung Kim, Korea University, South Korea
  • Mogan Gim, Korea University, South Korea
  • Sunkyu Kim, AIGEN Sciences, South Korea
  • Gwanghoon Jang, Korea University, South Korea
  • Seungheun Baek, Korea University, South Korea
  • Jaewoo Kang, Korea University, AIGEN Sciences, South Korea


Presentation Overview: Show

Gene expression analysis is important part of in vitro assays for evaluating the impact of the drug candidates. The magnitude of change observed in differentially expressed genes (DEGs) between pre- and post-perturbed gene expression reflects the effect of the drug treatment. Since drugs are designed to interact with their specific target proteins, the subsequent changes in response to drugs can vary depending on which drug is treated. We introduce a method that predicts the relative importance of DEGs based on the combination of drug candidates and cell lines.
We present ChemProfiler, a method that profiles chemical perturbation based on predicting the ranking of DEGs using large-scale transcriptomic profiles. We aim to build a model that simulates the interplay of biological systems by computing attention function between compounds and target-associated pathways and between target-associated pathways and genes. In addition, we use ground-truth pairs of small molecules and target-associated pathways as auxiliary information to regularize attention. The accuracy score of both the top up-regulated and down-regulated genes provides insights into significant signals induced by the treated drugs. Furthermore, computing the similarity between the DEGs of existing drugs and those of novel compounds enables in silico drug screening.

B-303: Predicting Binding Affinity Changes in Protein Complexes with Protein Language Models
Track: MLCSB
  • Gianluca Lombardi, Sorbonne Université Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, France
  • Alessandra Carbone, Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, France


Presentation Overview: Show

The study of the effects of mutations in protein-protein interactions is crucial for understanding the impact of genetic variations on protein function and disease. While many computational tools are available that are based on protein structure, the recent development of single-sequence protein language models has enabled to exploit the exponential increase in protein sequencing data for large-scale prediction tasks, with promising results.

In this work, we first explore the connection between the variations in the embeddings of the proteins obtained with ESM-2 model and the change in binding affinity ($\Delta \Delta G$) induced by single-point mutations, using data extracted from SKEMPI v2.0 database. We then train a deep learning framework based on a siamese architecture and convolutional light attention to predict $\Delta \Delta G$ from protein embeddings, obtaining a PCC of 0.83 in cross-validation experiments. We find that the best predictions are achieved when the attention network focuses more on regions belonging to the interface, which not only points towards potential future improvements, but also highlights the interpretability of the model. Further analyses will be carried out to verify the consistency of the model and its predictive power on other datasets.

B-304: XGBoost Forest Algorithm Combined with Flux Balance Analysis Predicts Survival Outcome and Elucidates Unforeseen Metabolic Vulnerabilities in Acute Myeloid Leukemia Patients
Track: MLCSB
  • Nathaniel Polley, INSERM, France
  • Petr Nazarov, Luxembourg Institute of Health, Luxembourg
  • Tony Kaoma, Luxembourg Institute of Health, Luxembourg
  • Jean-Emmanuel Sarry, INSERM, France


Presentation Overview: Show

The preliminary usage of experimental screens in search of new biological targets in cancer research remains expensive, time-consuming, and rarely produces a sample size necessary for statistical significance. As a consequence, demand persists for an alternative method to accurately discover therapeutic mechanisms of action directly from patient samples. PolyMORPHOS (Metabolic-Oriented Regression Predicting Human Overall Survival) derives metabolism flux balances using diagnostic bulk patient RNAseq data in order to calibrate a progressive XGBoost model, providing diverse metabolic landscapes to predict the fate of patients at early, middle, and late stages of survival. Per each survival iteration, polyMORPHOS calculates metabolic reactions most pertinent to death prediction, thus enabling precise therapies to best reverse unfavorable prognoses. Additionally, we announce SOURIS (Simulation Of Universal Reactions In Silico). Using polyMORPHOS, SOURIS simulates chronological metabolic survival conditions hundreds of times. Within each simulation, the cohort is stratified based upon discrete expression levels of any phenotypic variable of choice. The resulting metrics provide insight as to which metabolic reactions result in favorable outcome per each stratified characteristic. Hence, elucidation of relevant mechanisms of action at early stages of the experimental process empowers researchers to dedicate more resources to effective therapeutic design, further abridging theoretical and tangible results.

B-305: Interpreting tree ensemble machine learning models with endoR
Track: MLCSB
  • Albane Ruaud, University of Tübingen, Germany
  • Niklas Pfister, University of Copenhagen, Denmark
  • Ruth E Ley, Max Planck Institute for Biology, Germany
  • Nicholas D Youngblut, Max Planck Institute for Biology, Germany


Presentation Overview: Show

Tree ensemble models are increasingly used in microbiome science to predict phenotypes using sequence-based microbial features. While such models are often accurate, it is generally unclear how these models combine microbial taxa to make predictions. We developed endoR, a method to interpret tree ensemble models. The endoR method simplifies the fitted model into a decision ensemble and extracts information on the importance of individual features and their pairwise interactions. This information is displayed as an interpretable network, which provides insights into how features and their interactions contribute to the model's predictive performance. Adjustable regularization and bootstrapping are used to reduce complexity and ensure that only essential parts of the model are retained.
We assessed endoR on simulated and real metagenomic data and found it is as accurate as state-of-the-art approaches while enhancing model interpretation. Using endoR, we also confirmed published results on microbiome differences between cirrhotic and healthy individuals. Finally, we utilized endoR to explore associations between human gut methanogens and microbiome components.
Overall, endoR is a useful, open-source tool for accurately interpreting tree ensemble models. Its visualizations and summary outputs facilitate model interpretation and the generation of novel hypotheses about complex systems.

B-306: Advancing Rhea database annotation with EC numbers using reaction fingerprints and machine learning
Track: MLCSB
  • Anastasia Sveshnikova, SIB Swiss Institute of Bioinformatics, Switzerland
  • Anne Morgat, SIB Swiss Institute of Bioinformatics, Switzerland
  • Parit Bansal, SIB Swiss Institute of Bioinformatics, Switzerland
  • Blanca Cabrera Gil, SIB Swiss Institute of Bioinformatics, Switzerland
  • Nicole Redaschi, SIB Swiss Institute of Bioinformatics, Switzerland
  • Alan Bridge, SIB Swiss Institute of Bioinformatics, Switzerland
  • Kristian Axelsen, SIB Swiss Institute of Bioinformatics, Denmark


Presentation Overview: Show

Rhea (www.rhea-db.org) is a FAIR resource of expert curated biochemical and transport reactions described using the ChEBI ontology of small molecules (www.ebi.ac.uk/chebi/). Rhea is the reference vocabulary for enzyme annotation in UniProtKB (www.uniprot.org) and provides reference reaction data for the Gene Ontology, Reactome, MetaboLights, and a host of other knowledge and data resources.
In this poster, we will describe the application of machine learning and cheminformatics to establish links between Enzyme Commission (EC) numbers and Rhea reactions. We demonstrate that integrating approaches based on ML classifiers and neural networks with the reaction fingerprints based on either cheminformatics or natural language processing (NLP) methods applied to the simplified molecular-input line-entry system (SMILES) can yield reliable EC number predictions for Rhea reactions that are currently lacking enzymatic activity annotation. We believe such predictions can broaden the usability of Rhea database for biologists and chemists interested in enzymatic catalysis.

B-307: MolE: deep pre-trained molecular representations can aid antimicrobial discovery
Track: MLCSB
  • Roberto Olayo Alarcon, Department of Statistics, Ludwig-Maximilians-Universität München., Germany
  • Martin Amstalden, Department of Microbiology, Biocenter, University of Würzburg, Germany
  • Annamaria Zannoni, Department of Molecular Infection Biology II, Institute of Molecular Infection Biology, University of Würzburg, Germany
  • Cynthia Sharma, Department of Molecular Infection Biology II, Institute of Molecular Infection Biology, University of Würzburg, Germany
  • Ana Rita Brochado, Department of Microbiology, Biocenter, University of Würzburg, Germany
  • Mina Rezaei, Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany, Germany
  • Christian L. Müller, Department of Statistics, Ludwig-Maximilians-Universität München., Germany


Presentation Overview: Show

Determining the context-dependent bioactivity of chemical compounds by experimental means is a labor-intensive and costly process. The advent of large-scale machine and deep learning models holds the promise to aid in this process by predicting the biological activity of chemical compounds from known examples. However, the lack of large labeled datasets poses a challenge for the generalization performance of supervised learning schemes, especially for structurally novel molecules. To ameliorate this shortcoming, we present MolE (Molecular representation learned through Embedding decorrelation), a non-contrastive self-supervised deep learning framework that aims at learning meaningful task-independent molecular representations by pre-training on unlabelled chemical structures. Our experiments show that MolE's molecular representation enhances the performance of subsequent supervised machine learning algorithms like Random Forest and XGBoost in molecular property prediction tasks. Moreover, fine-tuning the molecular representation for specific applications further improves the performance of Graph Neural Network-based prediction schemes. As a practical application, we trained a classifier using MolE's molecular features on a publicly available dataset to predict antimicrobial activity of human-targeted drugs. Our results indicate that the classifier was able to predict growth-inhibitory effects of compounds that are structurally diverse from current antibiotics. In-vitro experiments subsequently confirmed the predicted activity against human gut pathogens.

B-308: Graph neural networks for investigating complex diseases: A case study on Parkinson's Disease.
Track: MLCSB
  • Elisa Gómez de Lope, University of Luxembourg, Luxembourg
  • Ramón Viñas Torné, University of Cambridge, United Kingdom
  • Pietro Liò, University of Cambridge, United Kingdom
  • Enrico Glaab, Univeresity of Luxembourg, Luxembourg


Presentation Overview: Show

Omics data analysis is a critical component in the study of complex diseases, but the high dimension and heterogeneity of the data often pose challenges that are difficult to address by classical statistical and machine learning methods. Recently, structured data analyses using graph neural networks (GNNs) have emerged as a promising complementary approach, particularly for investigating the relational information between samples. However, it is still unclear which strategies for designing and optimizing GNNs are most effective when working with real-world data from complex disorders, such as Parkinson's disease (PD).

Our study addresses this gap by examining the application of various GNN models, including Graph Convolutional Network, ChebyNet, and Graph Attention Network, to identify and interpret discriminative patterns between PD patients and controls using omics data. The developed pipeline integrates Lasso penalty-based feature selection, similarity graph construction, and final modeling for sample classification. Through an end-to-end model building and evaluation process, we assess the practical utility of the pipeline on independent PD omics datasets.

Overall, our analyses highlight some of the benefits and challenges of using graph structure data for machine learning analysis of disease-related omics data and provide directions for further research.

B-309: Heterogeneous Domain Adaptation for Species-Agnostic Transfer Learning
Track: MLCSB
  • Youngjun Park, University Medical Center Göttingen, Germany
  • Nils Paul Muttray, Georg-August-Universität Göttingen, Germany
  • Anne-Christin Hauschild, University Medical Center Göttingen, Germany


Presentation Overview: Show

Model organisms such as mice and zebrafish play a crucial role in developing and validating new hypotheses in biomedical research, particularly in studying disease mechanisms and treatment responses. However, due to biological differences between species translating these findings into human applications remains challenging. Commonly used homologous gene information is often unavailable, particularly for non-model organisms, and entails a significant information loss during gene-id conversion. To address this issue, we present a novel methodology for species-agnostic transfer learning with heterogeneous domain adaptation.
Our approach allows for knowledge integration and translation across various species' datasets without relying on external human-curated knowledge. It is an extension of the cross-domain structure-preserving projection algorithm and has been evaluated against gene homology methods and related machine learning models.
The evaluation is done using four different single-cell sequencing datasets, focusing on the out-of-sample prediction task. As a result, it can predict unseen cell types based on other species' data. More importantly, we observe similar gene ontologies amongst the most influential genes composing the primary latent space axis in both species. This demonstrates that our novel approach allows knowledge transfer beyond species barriers without gene homologies, but utilizing all possible gene sets.

B-310: Alleviating cell-free DNA sequencing biases with optimal transport
Track: MLCSB
  • Antoine Passemiers, ESAT-STADIUS, KU Leuven, Belgium
  • Tatjana Jatsenko, Laboratory for Cytogenetics and Genome Research, Department of Human Genetics, KU Leuven, Belgium
  • Daniele Raimondi, ESAT-STADIUS, KU Leuven, Belgium
  • Joris Vermeesch, Laboratory for Cytogenetics and Genome Research, Department of Human Genetics, KU Leuven, Belgium
  • Yves Moreau, ESAT-STADIUS, KU Leuven, Belgium


Presentation Overview: Show

Cell-free DNA (cfDNA) is a promising source of biomarkers for the detection of cancer, autoimmune disorders and transplant rejection. Low-coverage whole genome sequencing of cfDNA is attractive due to its cost efficiency, and downstream analysis based on read counts is appealing due to the presence of copy number aberrations (CNAs) in many cancers. However, clinicable applicability remains limited due to the presence of major confounders, such as the library preparation protocol or sequencing platform, that lead to distributional shifts as those confounders vary across time or space. We present a novel method that builds on optimal transport theory, and explicitly corrects for the effect of preanalytical variables. Our approach can be used to merge data sets representative of the same population but separated by technical biases. Moreover, we also demonstrate that it improves cancer detection via machine learning by alleviating the sources of variation that are not of biological origin, and improves over the widely used GC-bias correction. These results open perspectives for the analysis of larger data sets through the integration of cohorts produced by different sequencing pipelines or different centers. Notably, the approach is rather general with the potential for application to many other genomic data analysis problems.

B-311: varCADD: large training sets for pathogenicity prediction based on standing genetic variation
Track: MLCSB
  • Lusiné Nazaretyan, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, 10117 Berlin, Germany, Germany
  • Philipp Rentzsch, Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden, Sweden
  • Martin Kircher, Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, 23562 Lübeck, Germany, Germany


Presentation Overview: Show

Despite much progress in human genome research, it remains a major challenge to pinpoint the disease or phenotypically causal alterations among millions of genomic variants. Reasons for that are numerous, including the lack of large unbiased datasets for training supervised machine learning models. However, due to advances in sequencing technologies, extensive catalogs of standing variation in the human population recently became available for computational prediction of genetic variant effects.

Here, we present several alternative training sets for variant prioritization based on gnomAD v3.0. We use those to train different CADD models, a widely applied machine learning method for variant prioritization that originally uses variant simulations and human-species derived sequence alterations to score deleteriousness (Kircher et al. Nat.Genet 2014). We show that models trained on alternative training sets perform equally well on a large validation set based on the NCBI ClinVar database. Further, those training sets have major advantages over the original CADD training set, including a larger (and growing) number of variants that provide better coverage of rare genomic annotations, a better reflection of natural mutational processes and representation along the genome, and the potential to capture weaker selection effects. Hence, these models present an opportunity for improved variant prioritization.

B-312: Contextualizing protein representations using deep learning on interactomes and single-cell experiments
Track: MLCSB
  • Michelle Li, Harvard Medical School, United States
  • Yepeng Huang, Harvard T.H. Chan School of Public Health, United States
  • Marissa Sumathipala, Harvard University, United States
  • Man Qing Liang, Harvard Medical School, United States
  • Alberto Valdeolivas, Roche, Switzerland
  • Katherine Liao, Brigham and Women's Hospital, United States
  • Daniel Marbach, Roche, Switzerland
  • Marinka Zitnik, Harvard Medical School, United States


Presentation Overview: Show

Protein interaction networks are a critical component in studying the function and therapeutic potential of proteins. However, accurately modeling protein interactions across diverse biological contexts, such as tissues and cell types, remains a significant challenge for existing algorithms. Here, we introduce PINNACLE, a flexible geometric deep learning approach that trains on contextualized protein interaction networks to generate context-aware protein representations. Leveraging a multi-organ single-cell transcriptomic atlas of humans, PINNACLE provides 394,760 protein representations split across 156 cell-type contexts from 24 tissues and organs. We demonstrate that PINNACLE's contextualized representations of proteins reflect cellular and tissue organization and PINNACLE's tissue representations enable zero-shot retrieval of tissue hierarchy. Our contextualized protein representations, infused with cellular and tissue organization, can easily be adapted for diverse downstream tasks. We fine-tune PINNACLE to study the genomic effects of drugs in multiple cellular contexts and show that our context-aware model significantly outperforms state-of-the-art, yet context-agnostic, models. Enabled by our context-aware modeling of proteins, PINNACLE is able to nominate promising protein targets and cell-type contexts for further investigation. PINNACLE exemplifies and empowers the long-standing paradigm of incorporating context-specific effects for studying biological systems, especially the impact of disease and therapeutics.

B-344: NCLUSION: Joint Bayesian nonparametric clustering and variable selection of heterogenous single cell populations
Track: MLCSB
  • Chibuikem Nwizu, Center for Computational Molecular Biology, Brown University, Providence, RI 02912, USA, United States
  • Madeline Hughes, Microsoft Research New England, Cambridge, MA 02142, USA, United States
  • Michelle Ramseier, Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA, United States
  • Nicolo Fusi, Microsoft Research New England, Cambridge, MA 02142, USA, United States
  • William C. Hahn, Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA, United States
  • Alex K. Shalek, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA, United States
  • Andrew Navia, Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA, United States
  • Srivatsan Raghavan, Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA, United States
  • Peter S. Winter, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA, United States
  • Ava P. Amini, Microsoft Research New England, Cambridge, MA USA, United States
  • Lorin Crawford, Microsoft Research New England, Cambridge, MA USA, United States


Presentation Overview: Show

Using single-cell RNA-sequencing (scRNA-seq) to profile cellular populations has revealed cellular phenotypic heterogeneity, even in populations once considered phenotypically homogeneous. Clustering and cluster-specific feature selection are necessary steps in many scRNA-seq pipelines in order to partition, characterize, and interpret this phenotypic heterogeneity. Unfortunately, many methods rely on heuristic choices which have been shown to bias results and downstream inference. Here, we introduce a method for the "Nonparametric CLUstering of SIngle cell populatiONs" (NCLUSION). NCLUSION is a Bayesian dynamic Hierarchical Dirichlet Process Mixture Model that clusters expression data from one or more experimental conditions by learning phenotypic cluster membership of cells in the data while simultaneously performing variable selection to identify gene signatures driving these memberships. NCLUSION outputs the membership probability that a cell belongs to each phenotypic cluster and the informative genes for each phenotypic cluster. Importantly, our method scales to large data sets and is efficient enough to accommodate more genes as inputs to the clustering process, addressing the need for unbiased clustering and allowing for unbiased feature discovery in large heterogeneous cellular populations. We illustrate the potential of this new sparse nonparametric Bayesian analysis in analyzing data from single and multi-condition single-cell experiments.