MLCSB

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CEST
Monday, July 24th
10:30-11:30
Invited Presentation: Deciphering multiple facets of the cis-regulatory code with deep learning models of regulatory DNA
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Oznur Tastan

  • Anshul Kundaje, Stanford University, USA
11:30-11:40
DNA-Diffusion: Generative diffusion models for enhancing gene expression control through synthetic regulatory elements
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Oznur Tastan

  • Lucas Ferreira Silva, Harvard/ MGH, United States
  • Simon Senan, openBIOML, United States
  • Matei Bejan, University of Bucharest, Romania
  • César Miguel Valdez Córdova, JKU Linz, Austria
  • Cameron Smith, MGH/Harvard/Broad, United States
  • Sameer Gabbita, MGH/TJHSST, United States
  • Aaron Wenteler, QMUL, United States
  • Zach Nussbaum, Nomic.ai, United States
  • Aniketh Janardhan Reddy, UC Berkeley, United States
  • Zelun Li, Victor Chang Cardiac Institute/UNSW, Australia
  • Zain Munir Patel, MGH/Harvard/Broad, United States
  • Noah Weber, Celeris Therapeutics, Germany
  • Tin M. Tunjic, Celeris Therapeutics, Germany
  • Emily S. Wong, Victor Chang Cardiac Institute/UNSW, Australia
  • Wouter Meuleman, University of Washington, United States
  • Luca Pinello, MGH/Harvard/Broad Institute, United States


Presentation Overview: Show

The challenge of systematically modifying and optimizing regulatory elements for precise gene expression control is central to modern genomics and synthetic biology. Advancements in generative AI have paved the way for designing synthetic sequences and identifying genomic locations for integration, with the aim of safely and accurately modulating gene expression. We leverage diffusion models to design context-specific DNA regulatory sequences, which hold significant potential toward enabling novel therapeutic applications requiring precise modulation of gene expression. Our framework uses a cell type-specific diffusion model to generate novel 200 bp regulatory elements based on chromatin accessibility across different cell types. We evaluate the generated sequences based on key metrics to ensure they retain properties of endogenous sequences including binding specificity, composition, accessibility, and regulatory potential. We assess transcription factor binding site composition, potential for cell type-specific chromatin accessibility, and capacity for sequences generated by DNA diffusion to activate gene expression in different cell contexts using state-of-the-art prediction models. Our results demonstrate the ability to robustly generate DNA sequences with cell type-specific regulatory potential. DNA-Diffusion paves the way for revolutionizing a regulatory modulation approach to mammalian synthetic biology and precision gene therapy.

11:40-11:50
Contemporary multi-task deep learning models of regulatory DNA exhibit widespread sensitivity to spurious sequence features
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Oznur Tastan

  • Surag Nair, Stanford University, United States
  • Alan Min, University of Washington, United States
  • Areeb Gani, Montgomery Blair High School, United States
  • Jacob Schreiber, Stanford University, United States
  • William Noble, University of Washington, United States
  • Anshul Kundaje, Stanford University, United States


Presentation Overview: Show

Deep learning models can accurately map regulatory DNA to genome-wide profiles of cellular processes. However, the influence of dataset and model design choices on the robustness and reliability of sequence features learned by these models remains unexplored. We demonstrate that contemporary design choices result in models that learn spurious sequence features that violate biologically valid causal interpretation. This phenomenon, which we term “feature leakage”, afflicts several state-of-the-art deep learning models of regulatory DNA such as DeepSEA, Enformer and scBasset. We identify two key design choices that result in feature leakage- biased selection of training genomic loci, and joint training of multiple tasks using canonical multi-task architectures and loss functions. Biased locus selection, such as training on the union of regulatory events (e.g. DNase-seq peaks) from multiple cellular contexts, creates an artificial enrichment and depletion of motifs of key lineage-defining TFs that are mutually exclusive across cell contexts, resulting in spurious leakage of non-causal features across tasks. Orthogonally, conventional multi-tasking architectures result in entangled shared representations that are unable to separate individual predictive sequence features. To mitigate feature leakage, we propose training single-task models using background locus selection strategies that are not biased towards regulatory events in specific cell contexts.

11:50-12:00
Massively parallel characterization of transcriptional regulatory elements in three diverse human cell types.
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Oznur Tastan

  • Vikram Agarwal, Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA, United States
  • Fumitaka Inoue, Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, CA 94158, USA, Japan
  • Max Schubach, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, 10178, Berlin, Germany, Germany
  • Beth K. Martin, Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA, United States
  • Pyaree Mohan Dash, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, 10178, Berlin, Germany, Germany
  • Zicong Zhang, Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan, Japan
  • Ajuni Sohota, Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, CA 94158, USA, United States
  • William Stafford Noble, Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA, United States
  • Galip Gürkan Yardimci, Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA, United States
  • Martin Kircher, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, 10178, Berlin, Germany, Germany
  • Jay Shendure, Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA, United States
  • Nadav Ahituv, Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, CA 94158, USA, United States


Presentation Overview: Show

The human genome contains millions of candidate cis-regulatory elements (CREs) with cell-type-specific activities that shape both health and myriad disease states. However, we lack a functional understanding of the sequence features that control the activity and cell-type specific features of these CREs. Here, we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of over 680,000 sequences, representing a comprehensive set of all annotated CREs among three cell types (HepG2, K562, and WTC11), finding 41.7% to be functional. By testing sequences in both orientations, we find promoters to have significant strand orientation effects. We also observe that their 200 nucleotide cores function as non-cell-type specific ‘on switches’ providing similar expression levels to their associated gene. In contrast, enhancers have weaker orientation effects, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we developed sequence-based models to predict CRE function with high accuracy and delineate regulatory motifs. Testing an additional lentiMPRA library encompassing 60,000 CREs in all three cell types, we further identified factors that determine cell-type specificity. Collectively, our work provides an exhaustive catalog of functional CREs in three widely used cell lines and highlights how large-scale functional measurements can be leveraged to dissect the regulatory grammar.

12:00-12:10
Chrome-Zoo: cross-species chromatin profile prediction using DNA Zoo data
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Oznur Tastan

  • Anupama Jha, Department of Genome Sciences, University of Washington, Seattle, WA, USA, United States
  • Jacob Schreiber, Stanford University School of Medicine, Stanford, CA, USA, United States
  • Olga Dudchenko, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA, United States
  • Georgi K. Marinov, Department of Genetics, Stanford University, Stanford, CA, USA, United States
  • Anshul Kundaje, Department of Genetics, Stanford University, Stanford, CA, USA, United States
  • William J. Greenleaf, Department of Genetics, Stanford University, Stanford, CA, USA, United States
  • Erez S. Lieberman Aiden, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA, United States
  • William Stafford Noble, Department of Genome Sciences, University of Washington, Seattle, WA, USA, United States


Presentation Overview: Show

DNA Zoo is a large-scale project that has used Hi-C measurements to create draft genomic assemblies for over 500 species and has collected ATAC-seq datasets for a subset of 100 of those species. Hi-C and ATAC-seq data provide complementary views of chromatin organization, with Hi-C identifying pairwise genomic interactions and ATAC-seq measuring local chromatin accessibility. However, collecting both Hi-C and ATAC-seq data for all species is impractical due to cost and sample availability constraints. To address this challenge, we propose a deep tensor factorization model called Chrome-Zoo that can translate between ATAC-seq and Hi-C in species where only one of these assays is available. We address the challenges associated with handling multiple genomes by training an additional model that converts nucleotide sequences to learned genomic embeddings that are consistent across species. Using these learned genome embeddings, we trained Chrome-Zoo on 18 species with available genome assemblies, Hi-C and ATAC-seq datasets. We show that our model can successfully translate between Hi-C and ATAC-seq in new species at coarse (100 kb) and fine (1–10 kb) resolutions.

12:10-12:20
ProCapNet: Dissecting the cis-regulatory syntax of transcription initiation with deep learning
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Oznur Tastan

  • Kelly Cochran, Stanford University, Department of Computer Science, United States
  • Melody Yin, The Harker School, United States
  • Jacob Schreiber, Stanford University, Department of Genetics, United States
  • Anshul Kundaje, Stanford University, Departments of Genetics & Computer Science, United States


Presentation Overview: Show

The DNA sequence determinants of mammalian Pol II transcription initiation remain incompletely understood. Although we've identified overrepresented motifs in promoters, a third of human promoters contain no known motifs; in promoters with known motifs, how sequence translates into TSS positioning and promoter activity is poorly characterized. We know even less about initiation at enhancers. To address these knowledge gaps, we trained a deep learning model, ProCapNet, to predict transcription initiation, measured genome-wide at base-resolution by PRO-cap, from DNA sequence. ProCapNet accurately predicts exact TSS locations and initiation rate consistently across promoter classes and at enhancers. We next applied a model interpretation framework to identify a high-sensitivity collection of motifs predictive of transcription initiation. Then, to dissect how these motifs modulate initiation, we performed systematic in silico mutational experiments. Results suggest nuanced epistasis: motifs play specialized roles, dependent on other nearby motifs. For multiple motifs, we identified a novel secondary function as direct initiation sites. We quantified the contribution of motifs to TSS positioning and initiation rate, finding motif-specific positioning signatures that suggest a general rule of redistribution. Finally, we compared the sequence determinants of initiation in promoters vs. enhancers; results support a unified model of cis-regulatory syntax for transcription initiation.

12:20-12:30
Getting Personal with Epigenetics: Towards Individual-specific Epigenomic Imputation with Machine Learning
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Oznur Tastan

  • Alex Hawkins-Hooker, University College London, United Kingdom
  • Giovanni Visonà, Max Planck Institute for Intelligent Systems, Tübingen, Germany
  • Tanmayee Narendra, University of Dundee, United Kingdom
  • Mateo Rojas-Carulla, Lakera AI, Switzerland
  • Bernhard Schölkopf, Max Planck Institute for Intelligent Systems, Tübingen, Germany
  • Gabriele Schweikert, University of Dundee, University of Tübingen, United Kingdom


Presentation Overview: Show

Epigenetic modifications are dynamic mechanisms involved in the regulation of gene expression. Unlike the DNA sequence, epigenetic patterns vary not only between individuals, but also between different cell types within an individual. Epigenetic changes are reversible and thus promising therapeutic targets for precision medicine. However, mapping efforts to determine an individual’s cell-type-specific epigenome are constrained by experimental costs and tissue accessibility. We developed eDICE, a deep-learning model that employs attention mechanisms to impute epigenomic tracks. eDICE is trained to reconstruct masked epigenomic tracks within sets of epigenomic measurements derived from large-scale mapping efforts. By learning to encode the epigenomic signal at a particular genomic position into factorised representations of the epigenomic state of each profiled cell type as well the local activity profile of each epigenomic assay, eDICE is able to generate genome-wide imputations for the signal tracks of assays in cell types in which measurements are currently unavailable. We demonstrate improved performance relative to previous imputation methods on the reference Roadmap epigenomes, and additionally show that eDICE is able to predict individual-specific epigenetic patterns in unobserved tissues when trained on individual-specific epigenomes from ENTEx.

14:30-15:30
Panel: MLCSB Panel: Biological Foundation Models
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Anshul Kundaje

  • Bo Wang
  • Dana Pe'er
  • Mo Lotfollahi
16:00-17:00
Invited Presentation: Deep Geometric Methods for Learning Dynamics and Interactions from Cellular Data
Room: Lumière Auditorium
Format: Live-stream

Moderator(s): Magnus Rattray

  • Smita Krishnaswamy


Presentation Overview: Show

Abstract: In this talk we show how to learn the underlying geometry of data using heat diffusion, which can represent each point by a distribution of transition probabilities that are proportional to manifold distance. This distribution renders point cloud data into a statistical manifold. Then we show how to derive low dimensional visualizations and embeddings of such data using divergences between such datapoint transition probabilities. I will then cover recent work which learns a continuous model of such a statistical manifold using a neural network which is then used to learn the infinitesimal analog of such a divergence: a Fisher information metric. Next we show how to compare many such distributions using multiscale diffusion distances for optimal transport. Then we move from static to dynamic optimal transport using neural ODEs in order to learn dynamic trajectories from static snapshot data—a key problem in inference from single cell data. Finally, we will show how to characterize and classify dynamic regimes by combining geometry with topology to characterize the data. Throughout the talk, we present examples of such techniques being applied to massively high throughput and high dimensional datasets from biology and medicine including in cancer and neurodegeneration.

17:00-17:20
Proceedings Presentation: AttOmics: Attention-based architecture for diagnosis and prognosis from Omics data
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Magnus Rattray

  • Aurélien Beaude, Université Paris Saclay, France
  • Milad Rafiee Vahid, Sanofi R&D Data and Data Science, United States
  • Franck Augé, Sanofi R&D Data and Data Science, France
  • Farida Zehraoui, Université d'Évry, France
  • Blaise Hanczar, Université d'Évry, France


Presentation Overview: Show

The increasing availability of high-throughput omics data allows for considering a new medicine centered on individual patients. Precision medicine relies on exploiting these high-throughput data with machine-learning models, especially the ones based on deep-learning approaches, to improve diagnosis. Due to the high-dimensional small-sample nature of omics data, current deep-learning models end up with many parameters and have to be fitted with a limited training set. Furthermore, interactions between molecular entities inside an omics profile are not patient-specific but are the same for all patients.

In this article, we propose AttOmics, a new deep-learning architecture based on the self-attention mechanism. First, we decompose each omics profile into a set of groups, where each group contains related features. Then, by applying the self-attention mechanism to the set of groups, we can capture the different interactions specific to a patient. The results of different experiments carried out in this paper show that our model can accurately predict the phenotype of a patient with fewer parameters than deep neural networks. Visualizing the attention maps can provide new insights into the essential groups for a particular phenotype.

17:20-17:40
Proceedings Presentation: Robust reconstruction of single cell RNA-seq data with iterative gene weight updates
Room: Lumière Auditorium
Format: Live-stream

Moderator(s): Magnus Rattray

  • Yueqi Sheng, School of Engineering and Applied Sciences, Harvard University, Boston, MA, 02134, USA, United States
  • Boaz Barak, School of Engineering and Applied Sciences, Harvard University, Boston, MA, 02134, USA, United States
  • Mor Nitzan, The Hebrew University of Jerusalem, Jerusalem, 9190401, Israel, Israel


Presentation Overview: Show

Single-cell RNA-sequencing technologies have greatly enhanced our understanding of
heterogeneous cell populations and underlying regulatory processes. However, structural (spatial or
temporal) relations between cells are lost during cell dissociation. These relations are crucial for identifying
associated biological processes. Many existing tissue-reconstruction algorithms use prior information
about subsets of genes that are informative with respect to the structure or process to be reconstructed.
When such information is not available, and in the general case when the input genes code for
multiple processes, including being susceptible to noise, biological reconstruction is often computationally
challenging.
We propose an algorithm that iteratively identifies manifold-informative genes using existing
reconstruction algorithms for single-cell RNA-seq data as subroutine. We show that our algorithm improves
the quality of tissue reconstruction for diverse synthetic and real scRNA-seq data, including data from the
mammalian intestinal epithelium and liver lobules.

17:40-17:50
Paired single-cell multi-omics data integration with Mowgli
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Magnus Rattray

  • Geert-Jan Huizing, Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics Group, France
  • Ina Maria Deutschmann, Institut de Biologie de l’Ecole Normale Supérieure, CNRS, INSERM, Ecole Normale Supérieure, Université PSL, France
  • Gabriel Peyre, CNRS and Département de mathématiques et applications de l’Ecole Normale Supérieure, Université PSL, France
  • Laura Cantini, Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics Group, France


Presentation Overview: Show

The profiling of multiple molecular layers from the same set of cells has recently become possible. There is thus a growing need for multi-view learning methods able to jointly analyze such data. We here present Multi-Omics Wasserstein inteGrative anaLysIs (Mowgli), a novel method for the integration of paired multi-omics data with any type and number of omics. Of note, Mowgli combines integrative Nonnegative Matrix Factorization (NMF) and Optimal Transport (OT), enhancing at the same time the clustering performance and interpretability of integrative NMF. We apply Mowgli to multiple paired single-cell multi-omics data profiled with 10X Multiome, CITE-seq, and TEA-seq. Our in-depth benchmark demonstrates that Mowgli’s performance is competitive with the state-of-the-art in cell clustering and superior to the state-of-the-art once considering biological interpretability. Mowgli is implemented as a Python package seamlessly integrated within the scverse ecosystem and it is available at http://github.com/cantinilab/mowgli.

17:50-18:00
Machine learning-based informative feature selection for single cell RNA sequencing cell type characterization
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Magnus Rattray

  • Richard H. Scheuermann, J. Craig Venter Institute, United States
  • Yun Zhang, J. Craig Venter Institute, United States
  • Kayva Chegireddy, J. Craig Venter Institute, United States
  • Jeremy Miller, Allen Institute for Brain Science, United States
  • Michael Hawrylycz, Allen Institute for Brain Science, United States
  • Ed Lein, Allen Institute for Brain Science, United States


Presentation Overview: Show

Single cell transcriptomics is revolutionizing our understanding of the cellular complexity of complex tissues at a systems level. As cells are classified into clusters based on similar gene expression profiles, there is a need to identify cell type-specific biomarkers to reliably identify and match cells of the same type in new experiments. We’ve developed an algorithm – NS-Forest – that leverages the explainable characteristics of random forest machine learning to capture the most informative gene expression feature combinations that maximize cell type classification accuracy. Applied to several human reference datasets from brain, lung, and kidney, NS-Forest selects on average ~2.5 marker genes per cell type for optimal classification. These cell type marker genes can be used as targets for spatial transcriptomics cell localization, as definitional characteristics for semantic cell type representation in the Provisional Cell Ontology (https://bioportal.bioontology.org/ontologies/PCL), and as a reduced feature space for cell type matching between datasets. Using NS-Forest markers and the multivariate statistical graph algorithm – FR-Match – to compare human middle temporal gyrus and primary motor cortex, we find that the majority of GABAergic inhibitory neuron and glial cell types are well conserved across cortical brain regions, whereas the glutamatergic excitatory neuron types are region specific.

Tuesday, July 25th
10:30-11:30
Invited Presentation: Some Thoughts on Machine Learning-based Protein Engineering
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Karsten Borgwardt

  • Jennifer Listgarten


Presentation Overview: Show

Machine learning-based design has gained traction in the sciences, most notably in the design of small molecules, materials, and proteins, with societal implications spanning drug development and manufacturing, plastic degradation, and carbon sequestration. When designing objects to achieve novel property values with machine learning, one faces a fundamental challenge: how to push past the frontier of current knowledge, distilled from the training data into the model, in a manner that rationally controls the risk of failure. In addition to discussing this topic, I will also give an overview of the different ways in which machine learning is being developed and applied to protein engineering.

11:30-11:50
Proceedings Presentation: SynBa: Improved estimation of drug combination synergies with uncertainty quantification
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Karsten Borgwardt

  • Haoting Zhang, University of Cambridge, United Kingdom
  • Carl Henrik Ek, University of Cambridge, United Kingdom
  • Magnus Rattray, University of Manchester, United Kingdom
  • Marta Milo, Oncology R&D AstraZeneca, United Kingdom


Presentation Overview: Show

There exists a range of different quantification frameworks to estimate the synergistic effect of drug combinations. The diversity and disagreement in estimates make it challenging to determine which combinations from a large drug screening should be proceeded with. Furthermore, the lack of accurate uncertainty quantification for those estimates precludes the choice of optimal drug combinations based on the most favourable synergistic effect. In this work, we propose SynBa, a flexible Bayesian approach to estimate the uncertainty of the synergistic efficacy and potency of drug combinations, so that actionable decisions can be derived from the model outputs. The actionability is enabled by incorporating the Hill equation into SynBa, so that the parameters representing the potency and the efficacy can be preserved. Existing knowledge may be conveniently inserted due to the flexibility of the prior, as shown by the empirical Beta prior defined for the normalised maximal inhibition. Through experiments on large combination screenings and comparison against benchmark methods, we show that SynBa provides improved accuracy of dose-response predictions and better-calibrated uncertainty estimation for the parameters and the predictions.

11:50-12:10
Proceedings Presentation: SpatialSort: A Bayesian Model for Clustering and Cell Population Annotation of Spatial Proteomics Data
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Karsten Borgwardt

  • Eric Lee, Department of Molecular Oncology, BC Cancer Agency, Vancouver, BC, Canada, Canada
  • Kevin Chern, Department of Statistics, University of British Columbia, Vancouver, British Columbia, Canada, Canada
  • Michael Nissen, Terry Fox Laboratory, British Columbia Cancer Research Centre, Vancouver, British Columbia, Canada, Canada
  • Xuehai Wang, Terry Fox Laboratory, British Columbia Cancer Research Centre, Vancouver, British Columbia, Canada, Canada
  • Chris Huang, Translational Medicine Hematology, Bristol Myers Squibb, Summit NJ, USA, United States
  • Anita K. Gandhi, Translational Medicine Hematology, Bristol Myers Squibb, Summit NJ, USA, United States
  • Alexandre Bouchard-Côté, Department of Statistics, University of British Columbia, Vancouver, British Columbia, Canada, Canada
  • Andrew P. Weng, Terry Fox Laboratory, British Columbia Cancer Research Centre, Vancouver, British Columbia, Canada, Canada
  • Andrew Roth, Department of Molecular Oncology, BC Cancer Agency, Vancouver, BC, Canada, Canada


Presentation Overview: Show

Motivation: Recent advances in spatial proteomics technologies have enabled the profiling of dozens of proteins in thousands of single cells in situ. This has created the opportunity to move beyond quantifying the composition of cell types in tissue, and instead probe the spatial relationships between cells. However, current methods for clustering data from these assays only consider the expression values of cells and ignore the spatial context. Furthermore, existing approaches do not account for prior information about the expected cell populations in a sample.
Results: To address these shortcomings, we developed SpatialSort, a spatially aware Bayesian clustering approach that allows for the incorporation of prior biological knowledge. Our method is able to account for the affinities of cells of different types to neighbor in space, and by incorporating prior information about expected cell populations, it is able to simultaneously improve clustering accuracy and perform automated annotation of clusters. Using synthetic and real data, we show that by using spatial and prior information SpatialSort improves clustering accuracy. We also demonstrate how SpatialSort can perform label transfer between spatial and non-spatial modalities through the analysis of a real world diffuse large B-cell lymphoma dataset.

12:10-12:20
Whole Genome Deconvolution Unveils Alzheimer’s Resilient Epigenetic Signature
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Karsten Borgwardt

  • Eloise Berson, Stanford University, United States
  • Anjali Sreenivas, Stanford University, United States
  • Thanaphong Phongpreecha, Stanford University, United States
  • M. Ryan Corces, Gladstone Institute, United States
  • Nima Aghaeepour, Stanford University, United States
  • Thomas Montine, Stanford University, United States


Presentation Overview: Show

Assay for Transposase Accessible Chromatin by sequencing (ATAC-seq) provides an accurate way to depict the chromatin regulatory state and altered mechanisms guiding gene expression in disease. However bulk sequencing entangles information from different cell types and obscures cellular heterogeneity. Here, we develop and validate Cellformer, a novel deep learning method, that deconvolutes bulk ATAC-seq into cell type-specific expression across the whole genome. Cellformer enhances the bulk ATAC-seq resolution and allows an efficient cell type specific open chromatin profiling on large size cohorts at a low cost. Applied to 191 bulk samples from 3 brain regions, Cellformer identifies cell type-specific gene regulatory mechanisms and putative mediators involved in resilient to Alzheimer’s disease (RAD), an uncommon group of cognitively healthy individuals that harbor a high pathological load of Alzheimer’s disease (AD). Cell type-resolved chromatin profiling unveils cell type specific pathways and nominates potential epigenetic mediators underlying RAD that may illuminate therapeutic opportunities to limit the cognitive impact of this highly prevalent yet incurable disease. Cellformer has been made freely and publicly available to advance analysis of high-throughput bulk ATAC-seq in future investigations.

12:20-12:30
Graph-pMHC: Graph Neural Network Approach to MHC Class II Peptide Presentation and Antibody Immunogenicity
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Karsten Borgwardt

  • Will Thrift, Genentech, United States
  • Kai Liu, Genentech, United States


Presentation Overview: Show

Antigen presentation on MHC Class II (pMHCII presentation) plays an essential role in the adaptive immune response to extracellular pathogens and cancerous cells. But it can also reduce the efficacy of large-molecule drugs by triggering an anti-drug response. Significant progress has been made in pMHCII presentation modeling due to the collection of large-scale pMHC mass spectrometry datasets (ligandomes) and advances in machine learning. Here, we develop graph-pMHC, a graph neural network approach to predict pMHCII presentation. We derive adjacency matrices for pMHCII using Alphafold2-multimer, and address the peptide-MHC binding groove alignment problem with a simple graph enumeration strategy. We demonstrate that graph-pMHC dramatically outperforms methods with suboptimal inductive biases, such as the multilayer-perceptron-based NetMHCIIpan-4.0 (+22.84% average precision). Finally, we create an antibody drug immunogenicity dataset from clinical trial data, and develop a method for measuring anti-antibody immunogenicity risk using pMHCII presentation models. In comparison with BioPhi’s Sapiens score, a deep learning based measure of the humanness of an antibody drug, our strategy achieves a 7.14% ROC AUC improvement in predicting antibody drug immunogenicity.

14:10-14:30
Proceedings Presentation: Transfer Learning for Drug-Target Interaction Prediction
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Sushmita Roy

  • Alperen Dalkiran, Middle East Technical University, Turkey
  • Ahmet Atakan, Middle East Technical University, Turkey
  • Ahmet Süreyya Rifaioğlu, Heidelberg University, Germany
  • Maria Martin, EMBL-EBI, United Kingdom
  • Rengul Atalay, University of Chicago, United States
  • Aybar Acar, Middle East Technical University, Turkey
  • Tunca Dogan, Hacettepe University, Turkey
  • Volkan Atalay, Middle East Technical University, Turkey


Presentation Overview: Show

Utilizing AI-driven approaches for DTI prediction require large volumes of training data which are not available for the majority of target proteins. In this study, we investigate the use of deep transfer learning for the prediction of interactions between drug candidate compounds and understudied target proteins with scarce training data. The idea here is to first train a deep neural network classifier with a generalized source training dataset of large size and then reuse this pre-trained neural network as an initial configuration for re-training/fine-tuning purposes with a small-sized specialized target training dataset. To explore this idea, we selected six protein families that have critical importance in biomedicine: kinases, G-protein-coupled receptors (GPCRs), ion channels, nuclear receptors, proteases, and transporters. The protein families of transporters and nuclear receptors were individually set as the target datasets, while the other five families were used as the source datasets. Several size-based target family training datasets were formed in a controlled manner. Here we present a disciplined evaluation by pre-training a feed-forward neural network with source training datasets and applying different modes of transfer learning from the pre-trained source network to a target dataset. The performance of deep transfer learning is evaluated and compared with that of training the same deep neural network from scratch. We found that when the training dataset is smaller than 100 compounds, transfer learning yields significantly better performance compared to training the system from scratch, suggesting an advantage to using transfer learning to predict binders to under-studied targets.

14:30-14:50
Proceedings Presentation: DeepCoVDR: Deep transfer learning with graph transformer and cross-attention for predicting COVID-19 drug response
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Sushmita Roy

  • Zhijian Huang, Central South University, China
  • Pan Zhang, Central South University, China
  • Lei Deng, Central South University, China


Presentation Overview: Show

Motivation: The coronavirus disease 2019 (COVID-19) remains a global public health emergency. Although people, especially those with underlying health conditions, could benefit from several approved COVID-19 therapeutics, the development of effective antiviral COVID-19 drugs is still a very urgent problem. Accurate and robust drug response prediction to a new chemical compound is critical for discovering safe and effective COVID-19 therapeutics.

Results: In this study, we propose DeepCoVDR, a novel COVID-19 drug response prediction method based on deep transfer learning with graph transformer and cross-attention. First, we adopt a graph transformer and feed-forward neural network to mine the drug and cell line information. Then, we use a cross-attention module that calculates the interaction between the drug and cell line. After that, DeepCoVDR combines drug and cell line representation and their interaction features to predict drug response. To solve the problem of SARS-CoV-2 data scarcity, we apply transfer learning and use the SARS-CoV-2 dataset to fine-tune the model pre-trained on the cancer dataset. The experiments of regression and classification show that DeepCoVDR outperforms baseline methods. We also evaluate DeepCoVDR on the cancer dataset, and the results indicate that our approach has high performance compared with other state-of-the-art methods. Moreover, we use DeepCoVDR to predict COVID-19 drugs from FDA-approved drugs and demonstrate the effectiveness of DeepCoVDR in identifying novel COVID-19 drugs.

14:50-15:10
Proceedings Presentation: ArkDTA: Attention Regularization guided by non-Covalent Interactions for Explainable Drug-Target Binding Affinity Prediction
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Sushmita Roy

  • Mogan Gim, Korea University, South Korea
  • Junseok Choe, Korea University, South Korea
  • Seungheun Baek, Korea University, South Korea
  • Jueon Park, Korea University, South Korea
  • Chaeeun Lee, Korea University, South Korea
  • Minjae Ju, LG CNS, AI Research Center, South Korea
  • Sumin Lee, LG AI Research, South Korea
  • Jaewoo Kang, Korea University, South Korea


Presentation Overview: Show

Protein-ligand binding affinity prediction is an important task in drug design and development. Cross-modal attention mechanism has become a core component of deep learning models due to the significance of model explainability. Non-covalent interactions, one of the key chemical aspects of this task, should be incorporated in protein-ligand attention mechanism. We propose ArkDTA, a novel deep neural architecture for explainable binding affinity prediction guided by non-covalent interactions. Experimental results show that ArkDTA achieves predictive performance comparable to current state-of-the-art models while significantly improving model explainability. Qualitative investigation into our novel attention mechanism reveals that ArkDTA can identify potential regions for non-covalent interactions between candidate drug compounds and target proteins, as well as guiding internal operations of the model in a more interpretable and domain-aware manner. ArkDTA is available at https://github.com/dmis-lab/ArkDTA

15:10-15:20
SELFormer: Molecular Representation Learning via SELFIES Language Models
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Sushmita Roy

  • Erva Ulusoy, Hacettepe University, Turkey
  • Atakan Yüksel, Hacettepe University, Turkey
  • Atabey Ünlü, Hacettepe University, Turkey
  • Tunca Doğan, Hacettepe University, Turkey


Presentation Overview: Show

The expensive and time-consuming nature of drug discovery necessitates the incorporation of innovative computational techniques into research and development pipelines. Representation learning has emerged as a promising solution for creating compact and informative numerical representations of molecules that can be utilized effectively in subsequent prediction tasks. However, current methods suffer from robustness and validity issues, primarily as a result of the input encoding or algorithms employed. In this study, we introduce SELFormer, a transformer-based chemical language model that utilizes SELFIES as input, a 100% valid, compact, and expressive notation. SELFormer is pre-trained on two million drug-like compounds in the ChEMBL database and evaluated on various molecular property prediction tasks. SELFormer demonstrated superior performance in predicting the aqueous solubility of molecules and adverse drug reactions compared to existing graph learning-based methods and SMILES-based chemical language models, while producing comparable results for other tasks. We shared SELFormer as a programmatic tool, along with its datasets and pre-trained models. Overall, our research demonstrates the benefits of combining a valid and expressive molecular notation with the appropriate deep learning architecture in chemical language modeling, thereby opening up new possibilities for discovering and designing novel drug candidates.

15:20-15:30
Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Sushmita Roy

  • Atabey Ünlü, Hacettepe University, Turkey
  • Elif Çevrim, Hacettepe University, Turkey
  • Ahmet Sarıgün, Middle East Technical University, Turkey
  • Hayriye Çelikbilek, Hacettepe University, Turkey
  • Heval Ataş Güvenilir, Middle East Technical University, Turkey
  • Altay Koyaş, Middle East Technical University, Turkey
  • Deniz Cansen Kahraman, Middle East Technical University, Turkey
  • Abdurrahman Olgac, Evias Pharmaceutical R&D, Turkey
  • Ahmet Süreyya Rifaioğlu, Heidelberg University, Germany
  • Tunca Dogan, Hacettepe University, Turkey


Presentation Overview: Show

Discovering novel drug candidate molecules is one of the most fundamental and critical steps in drug development. Generative models offer high potential for designing de novo molecules; however, to be useful in real-life drug development pipelines, these models should be able to design target-specific molecules. In this study, we propose a novel generative system, DrugGEN, for de novo design of drug candidate molecules that interact with selected target proteins. The proposed system represents compounds and protein structures as graphs and processes them via serially connected two generative adversarial networks comprising graph transformers. The system is trained using two million compounds from ChEMBL and target-specific bioactive molecules, to design effective and specific inhibitory molecules against the AKT1 protein. DrugGEN has a competitive or better performance against other methods on fundamental benchmarks. To assess the target-specific generation performance, we conducted further in silico analysis with molecular docking. Their results indicate that de novo molecules have high potential for interacting with the AKT1 protein structure in the level of its native ligand. DrugGEN can be used to design novel and effective target-specific drug candidate molecules for any druggable protein, given the target features and a dataset of known bioactive molecules.

16:00-17:00
Invited Presentation: Deep learning for biological sequences
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Su-In Lee

  • Jean-Phillipe Vert, Owkin, France


Presentation Overview: Show

Deep neural networks are increasingly used to analyze biological sequences, including DNA, RNA and proteins, leading to promising applications in annotation, classification, structure prediction or generation. While the architectures of deep neural networks for biosequences have been so far largely borrowed from the field of natural language processing, I will discuss in this presentation some specificities of biosequences that deserve specific methodological developments, in particular 1) how to transform a biosequence as a sequence of tokens, 2) how to incorporate some known symmetries of biosequences in the architecture of the model, and 3) how to solve tasks which are specific to biosequences such as learning to align.

17:00-17:20
Proceedings Presentation: COmic: Convolutional Kernel Networks for Interpretable End-to-End Learning on (Multi-)Omics Data
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Su-In Lee

  • Jonas Christian Ditz, University of Tübingen, Germany
  • Bernhard Reuter, University of Tübingen, Germany
  • Nico Pfeifer, University of Tübingen, Germany


Presentation Overview: Show

Motivation: The size of available omics datasets is steadily increasing with technological advancement in recent years. While this increase in sample size can be used to improve the performance of relevant prediction tasks in healthcare, models that are optimized for large datasets usually operate as black boxes. In high stakes scenarios, like healthcare, using a black-box model poses safety and security issues. Without an explanation about molecular factors and phenotypes that affected the prediction, healthcare providers are left with no choice but to blindly trust the models. We propose a new type of artificial neural network, named Convolutional Omics Kernel Network (COmic). By combining convolutional kernel networks with pathway-induced kernels, our method enables robust and interpretable end-to-end learning on omics datasets ranging in size from a few hundred to several hundreds of thousands of samples. Furthermore, COmic can be easily adapted to utilize multi-omics data.

Results: We evaluated the performance capabilities of COmic on six different breast cancer cohorts. Additionally, we trained COmic models on multi-omics data using the METABRIC cohort. Our models performed either better or similar to competitors on both tasks. We show how the use of pathway-induced Laplacian kernels opens the black-box nature of neural networks and results in intrinsically interpretable models that eliminate the need for post-hoc explanation models.

17:20-17:30
Structure-Independent Peptide Binder Design via Generative Language Models
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Su-In Lee

  • Garyk Brixi, Duke University, United States
  • Kalyan Palepu, Duke University, United States
  • Suhaas Bhat, Duke University, United States
  • Sophia Vincoff, Duke University, United States
  • Tianlai Chen, Duke University, United States
  • Lauren Hong, Duke University, United States
  • Vivian Yudistyra, Duke University, United States
  • Pranam Chatterjee, Duke University, United States


Presentation Overview: Show

The ability to modulate pathogenic proteins represents a powerful treatment strategy for diseases. Unfortunately, many proteins are considered “undruggable” by small molecules, and are often intrinsically disordered, precluding the usage of structure-based tools for binder design. To address these challenges, we have developed a suite of algorithms that enable the design of target-specific peptides via protein language model embeddings, without the requirement of 3D structures. First, we train a model that leverages ESM-2 embeddings to efficiently select high-affinity peptides from natural protein interaction interfaces. We experimentally fuse model-derived peptides to E3 ubiquitin ligases and identify candidates exhibiting robust degradation of undruggable targets in human cells. Next, we develop a high-accuracy discriminator, based on the CLIP architecture, to prioritize and screen peptides with selectivity to a specified target protein. As input to the discriminator, we create a Gaussian diffusion generator to sample an ESM-2-based latent space, fine-tuned on experimentally-valid peptide sequences. Finally, to enable de novo generation of binding peptides, we train an instance of GPT-2 with protein interacting sequences to enable peptide generation conditioned on target sequence. Our model demonstrates low perplexities across both existing and generated peptide sequences. Together, our work lays the foundation for programmable protein targeting applications.

17:30-17:40
Reliable interpretability of deep learning on biological networks
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Su-In Lee

  • Nikolaus Fortelny, University of Salzburg, Austria
  • Wolfgang Esser-Skala, University of Salzburg, Austria


Presentation Overview: Show

BACKGROUND: Deep learning is powerful, but interpretability remains a challenge. A unique approach for interpretability builds on biological knowledge to construct the computational graph of a neural network such that hidden nodes represent biological entities (pathways). After training, such “biology-inspired” neural networks reveal biological pathways involved in a given process (cancer).

MOTIVATION: Biology-inspired models provide an unprecedented ability to interpret hidden nodes, in contrast to the common approaches to interpret input features. However, critical elements of interpretability remain unsolved. First, the random initiation of weights limits the robustness of interpretations. Second, biases in biological knowledge favor highly connected hidden nodes. Yet, despite their critical relevance, robustness and network biases are largely unstudied.

METHODS: We developed methods to assess and control robustness and network biases, and validated them in state-of-the-art biology-inspired models to evaluate their impact on interpretations.

RESULTS: We demonstrate that controlling both robustness and biases is required for reliable interpretability. We find that the impact of robustness and biases on interpretations depend on the difficulty of the prediction task and we identify which network biases mostly affect interpretations. Together, these results reveal critical elements of interpretability that may be relevant beyond the special case of biology-inspired deep learning.

17:40-17:50
Cracking the black box of deep sequence-based protein-protein interaction prediction
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Su-In Lee

  • Judith Bernett, Technical University of Munich, Germany
  • David B. Blumenthal, Friedrich-Alexander University Erlangen-Nürnberg, Germany
  • Markus List, Technical University of Munich, Germany


Presentation Overview: Show

Identifying protein-protein interactions (PPIs) is crucial for deciphering biological pathways and their dysregulation. Numerous prediction methods have been developed as a cheap alternative to biological experiments, reporting phenomenal accuracy estimates. While most methods rely exclusively on sequence information, PPIs occur in 3D space. As predicting protein structure from sequence is an infamously complex problem, the almost perfect reported performances for PPI prediction seem dubious. We systematically investigated how much reproducible deep learning models depend on data leakage, sequence similarities, and node degree information and compared them to basic machine learning models. We found that overlaps between training and test sets resulting from random splitting lead to strongly overestimated performances, giving a false impression of the field. In this setting, models learn solely from sequence similarities and node degrees. When data leakage is avoided by minimizing sequence similarities between training and test, performances become random, leaving this research field wide open.

17:50-18:00
G4mismatch: Deep neural networks to predict G-quadruplex propensity based on G4-seq data
Room: Lumière Auditorium
Format: Live from venue

Moderator(s): Su-In Lee

  • Mira Barshai, Ben-Gurion University, Israel
  • Barak Engel, Ben-Gurion University, Israel
  • Idan Haim, Ben-Gurion University, Israel
  • Yaron Orenstein, Bar-Ilan University, Israel


Presentation Overview: Show

G-quadruplexes are non-B-DNA structures that form in the genome by Hoogsteen bonds between guanines in single or multiple DNA strands. The functions of G-quadruplexes are linked to various molecular and disease phenotypes, and thus researchers are interested in measuring G-quadruplex formation genome-wide. Experimentally measuring G-quadruplexes is a long and laborious process. Computational prediction of G-quadruplex propensity from a given DNA sequence is thus a long-standing challenge. Unfortunately, despite the availability of high-throughput datasets measuring G-quadruplex propensity in the form of mismatch scores, extant methods to predict G-quadruplex formation either rely on small datasets or are based on domain-knowledge rules. We developed G4mismatch, a novel algorithm to accurately and efficiently predict G-quadruplex propensity for any genomic sequence. G4mismatch is based on a convolutional neural network trained on millions human genomic loci measured in a single G4-seq experiment. When tested on sequences from a held-out chromosome, G4mismatch, the first method for the task, achieved a Pearson correlation of over 0.8. Moreover, when tested in detecting G-quadruplexes genome-wide using the predicted mismatch scores, G4mismatch achieved superior performance compared to extant methods. Last, we demonstrate the ability to deduce the mechanism behind G-quadruplex formation by unique visualization of the principles learned by the model.