MLCSB COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CDT
Wednesday, July 13th
10:30-11:30
Keynote Presentation: Computational analysis of tumor single-cell sequencing data
Room: Madison CD
Format: Live-stream

Moderator(s): Marinka Zitnik

  • Niko Beerenwinkel, ETH Zurich, Switzerland


Presentation Overview: Show

Cancer progression is an evolutionary process characterized by the accumulation of genetic alterations and responsible for tumor growth, clinical progression, and drug resistance development. We discuss how to reconstruct the evolutionary history of a tumor from single- cell sequencing data and present probabilistic models and efficient inference algorithms for mutation calling and learning tumor phylogenies from mutation and copy number data. We present methods for integrating single-cell DNA and RNA data obtained from tumor biopsies and for detecting common patterns of tumor evolution among patients, including re-occurring evolutionary trajectories and clonally exclusive mutations.

11:30-11:50
Proceedings Presentation: psupertime: supervised pseudotime analysis for time-series single cell RNA-seq data
Room: Madison CD
Format: Live from venue

Moderator(s): Marinka Zitnik

  • Will Macnair, ETH, Switzerland
  • Revant Gupta, University of Tübingen, Germany
  • Manfred Claassen, University of Tübingen, Germany


Presentation Overview: Show

Improvements in single cell RNA-seq technologies mean that studies measuring multiple experimental conditions, such as time series, have become more common. At present, few computational methods exist to infer time series-specific transcriptome changes, and such studies have therefore typically used unsupervised pseudotime methods. While these methods identify cell subpopulations and the transitions between them, they are not appropriate for identifying the genes which vary coherently along the time series. In addition, the orderings they estimate are based only on the major sources of variation in the data, which may not correspond to the processes related to the time labels.
We introduce psupertime, a supervised pseudotime approach based on a regression model, which explicitly uses time series labels as input. It identifies genes that vary coherently along a time series, in addition to pseudotime values for individual cells, and a classifier which can be used to estimate labels for new data with unknown or differing labels. We show that psupertime outperforms benchmark classifiers in terms of identifying time-varying genes, and provides better individual cell orderings than popular unsupervised pseudotime techniques. psupertime is applicable to any single cell RNA-seq dataset with sequential labels (principally time series but also drug dosage and disease progression, for example), derived from either experimental design and provides a fast, interpretable tool for targeted identification of genes varying along with specific biological processes.

11:50-12:00
Transcriptomic forecasting over short time periods using neural ODEs
Room: Madison CD
Format: Live from venue

Moderator(s): Marinka Zitnik

  • Rossin Erbe, Johns Hopkins School of Medicine, United States
  • Genevieve Stein-O'Brien, Johns Hopkins School of Medicine, United States
  • Elana Fertig, Johns Hopkins School of Medicine, United States


Presentation Overview: Show

Single-cell RNA-seq (scRNA-seq) technologies are often used to study changes in the molecular states that underlie cellular phenotypes. However, the transcript counts from scRNA-seq are obtained from each cell at a single instant in time. A single snapshot in time often does not provide sufficient information to understand the dynamic molecular processes the cell is undergoing. To address this challenge, we have developed a neural ordinary differential equation based method, RNAForecaster, for predicting RNA expression states in single cells for multiple future time steps. We demonstrated that in 111 simulated single-cell expression data sets, RNAForecaster can accurately predict expression states up to two hundred simulated time steps downstream of the data it was trained on. Additionally, using metabolic labeling transcriptomic profiling data from human RPE cells, RNAForecaster was able to predict cellular progression through the cell cycle, with the expression changes predicted aligning highly significantly with continuous measures of cell cycle over a three day prediction period. Thus, RNAForecaster enables predictions of future expression states in biological systems over short to medium time periods, which has potential to yield significant insight into transcriptional dynamics.

12:00-12:10
CpG Transformer for imputation of single-cell methylomes
Room: Madison CD
Format: Live from venue

Moderator(s): Marinka Zitnik

  • Gaetan De Waele, Ghent University, Belgium
  • Jim Clauwaert, Ghent University, Belgium
  • Gerben Menschaert, Ghent University, Belgium
  • Willem Waegeman, Ghent University, Belgium


Presentation Overview: Show

The adoption of current single-cell DNA methylation sequencing protocols is hindered by incomplete coverage, outlining the need for effective imputation techniques. The task of imputing single-cell (methylation) data requires models to build an understanding of underlying biological processes.
In this work, we adapt the transformer neural network architecture to operate on methylation matrices through combining axial attention with sliding window self-attention. The obtained CpG Transformer displays state-of-the-art performances on a wide range of scBS-seq and scRRBS-seq datasets. Furthermore, we demonstrate the interpretability of CpG Transformer and illustrate its rapid transfer learning properties, allowing practitioners to train models on new datasets with a limited computational and time budget.

12:10-12:20
FISHFactor: A Probabilistic Factor Model for Spatial Transcriptomics Data with Subcellular Resolution
Room: Madison CD
Format: Live from venue

Moderator(s): Marinka Zitnik

  • Florin Cornelius Walter, German Cancer Research Center (DKFZ), European Molecular Biology Laboratory (EMBL), Germany
  • Oliver Stegle, German Cancer Research Center (DKFZ), European Molecular Biology Laboratory (EMBL), Germany
  • Britta Velten, German Cancer Research Center (DKFZ), Germany


Presentation Overview: Show

Factor analysis is a widely used method for dimensionality reduction of datasets in molecular biology and has recently been adapted to spatial data. However, existing methods assume (count) matrices as input and are therefore not directly applicable to single molecule resolved data, which increasingly arise in the field of spatial transcriptomics and provide insight into the subcellular localization of individual RNA molecules. To address this, we propose FISHFactor, a probabilistic factor model that combines the benefits of spatial, non-negative factor analysis with a Poisson point process likelihood to explicitly model and account for the nature of single molecule resolved data. Given data from multiple segmented cells, FISHFactor infers a weight matrix that is shared across cells, while factors remain independent. This allows a consistent interpretation of factors and clustering of cells based on factor activities.
Using simulated data, we show that our approach leads to a more accurate estimate of the latent structure compared to methods that rely on aggregating information by spatial binning. We demonstrate on a real data set of cultured mouse embryonic fibroblast cells that FISHFactor identifies major subcellular expression patterns and accurately recovers known spatial gene clusters.

12:20-12:30
Unbiased discovery and annotation of cellular phenotypes in high-dimensional single-cell proteomic datasets
Room: Madison CD
Format: Live from venue

Moderator(s): Marinka Zitnik

  • Evan Greene, Ozette Technologies, United States
  • Greg Finak, Ozette Technologies, United States
  • Leonard D'Amico, Fred Hutchinson Cancer Research Center, United States
  • Nina Bhardwaj, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, United States
  • Candice Church, Division of Dermatology, Department of Medicine University of Washington, United States
  • Chihiro Morishima, Division of Dermatology, Department of Medicine University of Washington, United States
  • Nirasha Ramchurren, Fred Hutchinson Cancer Research Center, United States
  • Janis Taube, Bloomberg Kimmel Institute for Cancer Immunotherapy and the Sidney Kimmel Comprehensive Cancer Center, United States
  • Paul Nghiem, Division of Dermatology, Department of Medicine University of Washington, United States
  • Martin Cheever, Fred Hutchinson Cancer Research Center, United States
  • Steven Fling, Fred Hutchinson Cancer Research Center, United States
  • Raphael Gottardo, University of Lausanne and Lausanne University Hospital, Swiss Institute of Bioinformatics, Switzerland


Presentation Overview: Show

We present Full Annotation Using Shape-constrained Trees (FAUST), a recently-published method for unbiased discovery and annotation of cellular phenotypes in high-dimensional cytometry datasets (https://doi.org/10.1016/j.patter.2021.100372). Analyzing such datasets is challenging as they can consist of hundreds of samples jointly profiling many millions of cells. To address these challenges, our method combines novel approaches for variable scoring, variable selection, clustering, cluster-matching, and feature selection [Fig 1]. We discuss how this allows FAUST to process samples independently from one-another without sub-sampling, and match clusters on the basis of computationally-derived phenotypes. We show how FAUST can automatically adapt to technical effects like batching as well as sample-to-sample variability that manifests as changes in the location and scale of the expression distributions. We demonstrate through simulation studies that FAUST can resolve phenotypes in the data that vary by orders of magnitude in their abundances. Also, we show that FAUST outperforms other methods on the FlowCAP-IV benchmark dataset. Finally, we discuss how FAUST has been used to analyze multiple high-dimensional cytometry datasets, including data generated in a Merkel cell carcinoma anti-PD-1 clinical trial. An open-source implementation of the FAUST method is available online (https://github.com/RGLab/faust).

14:30-15:30
Panel: Reproducibility and transparency in ML: Training the next generation
Room: Madison CD
Format: Live from venue

Moderator(s): Elena Fertig

  • Casey Greene
  • Florian Markowetz
  • Gabriella Rusticci
16:00-17:00
Keynote Presentation: Image-based profiling for drug discovery: Cell Painting
Room: Madison CD
Format: Live from venue

Moderator(s): Marinka Zitnik

  • Anne Carpenter


Presentation Overview: Show

Cell images contain a vast amount of quantifiable information about the status of the cell: for example, whether it is diseased, whether it is responding to a drug treatment, or whether a pathway has been disrupted by a genetic mutation. We extract hundreds of features of cells from images. Just like transcriptional profiling, the similarities and differences in the patterns of extracted features reveal connections among diseases, drugs, and genes.

We are harvesting similarities in image-based profiles to identify, at a single-cell level, how diseases, drugs, and genes affect cells, which can uncover small molecules’ mechanism of action, discover gene functions, predict assay outcomes, discover disease-associated phenotypes, identify the functional impact of disease-associated alleles, and find novel therapeutic candidates. As part of the JUMP-Cell Painting Consortium (Joint Undertaking for Morphological Profiling-Cell Painting) we are aiming to establish experimental and computational best practices for image-based profiling (https://jump-cellpainting.broadinstitute.org/results) and produce the world’s largest public Cell Painting gene/compound image resource, with 140,000 perturbations in five replicates, to be released November 2022. With these data and new technologies like Pooled Cell Painting, we hope to bring drug discovery-accelerating applications to practice.

17:00-17:20
Proceedings Presentation: A Graph Neural Network Approach for Molecule Carcinogenicity Prediction
Room: Madison CD
Format: Live from venue

Moderator(s): Marinka Zitnik

  • Philip Fradkin, Univeristy of Toronto, Vector Institute, Canada
  • Adamo Young, Univeristy of Toronto, Vector Institute, Canada
  • Lazar Atanackovic, Univeristy of Toronto, Vector Institute, Canada
  • Brendan Frey, Univeristy of Toronto, Vector Institute, Canada
  • Leo Lee, Univeristy of Toronto, Vector Institute, Canada
  • Bo Wang, Univeristy of Toronto, Vector Institute, Canada


Presentation Overview: Show

Molecular carcinogenicity is a preventable cause of cancer, but systematically identifying carcinogenic compounds, which involves performing experiments on animal models, is expensive, time consuming, and low throughput. As a result, carcinogenicity information is fairly limited and building data-driven models with good prediction accuracy remains a major challenge. In this work, we propose CONCERTO, a deep learning model that uses a graph transformer in conjunction with a molecular fingerprint representation for carcinogenicity prediction from molecular structure. Special efforts have been made to overcome the data size constraint, such as enriching the training data with more informative labels, multi-round pre-training on related but lower quality mutagenicity data, and transfer learning from a large self-supervised model. Extensive experiments demonstrate that our model performs well and can generalize to external validation sets. CONCERTO could be useful for guiding future carcinogenicity experiments and provide insight into the molecular basis of carcinogenicity.

17:20-17:30
Towards an in silico cell
Room: Madison CD
Format: Live from venue

Moderator(s): Marinka Zitnik

  • Yue Qin, University of California San Diego, United States
  • Emma Lundberg, KTH-Royal Institute of Technology, Sweden
  • Trey Ideker, University of California San Diego, United States


Presentation Overview: Show

The cell is a multi-scale structure with modular organization across at least four orders of magnitude. Two central approaches for mapping this structure – protein fluorescent imaging and protein biophysical association – each generate extensive datasets, but of distinct qualities and resolutions that are typically treated separately. Here, we integrate immunofluorescence images in the Human Protein Atlas (HPA) with affinity purifications in BioPlex to create a unified hierarchical map of human cell architecture. Integration is achieved by configuring each approach as a general measure of protein distance, then calibrating the two measures using machine learning. The map, called the Multi-Scale Integrated Cell (MuSIC 1.0), resolves 69 subcellular systems of which approximately half are undocumented. Accordingly we perform 134 additional affinity purifications, validating subunit associations for the majority of systems. The map reveals a pre-ribosomal RNA processing assembly and accessory factors, which we show govern rRNA maturation, and functional roles for SRRM1 and FAM120C in chromatin and RPS3A in splicing. By integration across scales, MuSIC increases the resolution of imaging while giving protein interactions a spatial dimension, paving the way to incorporate diverse types of data in proteome-wide cell maps.

17:30-17:40
Segmentation error aware probabilistic clustering for highly multiplexed imaging reveals densely packed tissue dynamics
Room: Madison CD
Format: Live-stream

Moderator(s): Marinka Zitnik

  • Yuju Lee, University of Toronto, Canada
  • Edward Chen, Lunenfeld-Tanenbaum Research Institute, Canada
  • Somi Afiuni, Lunenfeld-Tanenbaum Research Institute, Canada
  • Hartland W. Jackson, University of Toronto & Lunenfeld-Tanenbaum Research Institute, Canada
  • Kieran R. Campbell, University of Toronto & Lunenfeld-Tanenbaum Research Institute, Canada


Presentation Overview: Show

Highly multiplexed imaging technologies such as Imaging Mass Cytometry (IMC) enable the quantification of the expression of up to 40 proteins in tissue sections. Data preprocessing pipelines subsequently segment images to single cells, recording their average expression profile along with spatial characteristics. However, segmentation of the resulting images to single cells remains a challenge, with doublets -- an area erroneously segmented as a single-cell that is composed of more than one ""true"" single cell. This results in cells with implausible protein co-expression combinations, confounding the interpretation of important cellular populations across tissues.

While doublets have been discussed in the context of single-cell RNA-sequencing analysis extensively, there is no clustering method for IMC while accounting for segmentation errors. Therefore, we introduce SegmentaTion AwaRe cLusterING (STARLING), a probabilistic model tailored for IMC that clusters the cells explicitly allowing for doublets resulting from mis-segmentation. To benchmark STARLING against existing methods, we develop a novel evaluation system that penalizes clusters with biologically-implausible marker co-expression combinations. Finally, we generate IMC data of human tonsil (a densely packed secondary lymphoid organ) and demonstrate cellular states captured by STARLING identify known cell types not visible with other methods and important for understanding the dynamics of immune response.

17:40-17:50
Joint modeling of rare variant genetic effects using deep learning and data-driven burden scores
Room: Madison CD
Format: Live from venue

Moderator(s): Marinka Zitnik

  • Brian Clarke, German Cancer Research Center (DKFZ), Germany
  • Eva Holtkamp, German Cancer Research Center (DKFZ), Germany
  • Hakime Öztürk, German Cancer Research Center (DKFZ), Germany
  • Felix Brechtmann, Department of Informatics, Technical University of Munich, Germany
  • Florian Hölzlwimmer, Department of Informatics, Technical University of Munich, Germany
  • Julien Gagneur, Department of Informatics, Technical University of Munich, Germany
  • Oliver Stegle, German Cancer Research Center (DKFZ), Germany


Presentation Overview: Show

Population-scale genomic sequencing provides novel opportunities to survey the effect of rare variants on phenotypes. Existing methods for rare-variant-association studies (RVASs), such as burden or variance-component tests, make strong assumptions about which variants exhibit phenotypic effects, limiting their efficacy.

Here, we propose DeepRVAT (Deep Rare Variant Association Testing), a data-driven framework that uses deep neural networks to learn a flexible rare variant aggregation function. Specifically, we build on DeepSet networks to efficiently model variant effects and interactions. Compared to existing methods, DeepRVAT (1) learns variant effects without strong filtering or specifying a kernel, (2) models nonlinear and epistatic effects, (3) efficiently incorporates dozens of multi-modal variant annotations, (4) provides trait-specific burden scores, and (5) utilizes GPUs for biobank-scale analyses.

We apply DeepRVAT to multiple phenotypes on 167,000 whole-exome-sequenced samples from UK Biobank. Compared with previous state-of-the-art methods, we obtain significantly increased power (e.g., 29 vs. 15 genes associated to human height; FDR < 0.05), while maintaining statistical calibration. Furthermore, we validate our results by multiple methods (e.g., enrichment analysis, comparison to larger studies) to ensure biological plausibility of our associations. Collectively, our results demonstrate increased power and robustness for studying gene-trait associations using rare variants.

17:50-18:00
maxATAC: Predicting Transcription Factor Binding at Disease Risk Loci from ATAC-seq and DNA Sequence with Convolutional Neural Networks
Room: Madison CD
Format: Live from venue

Moderator(s): Marinka Zitnik

  • Tareian Cazares, University of Cincinnati, United States
  • Faiz Rizvi, University of Cincinnati, United States
  • Balaji Iyer, University of Cincinnati, United States
  • Xiaoting Chen, Cincinnati Children's Hospital Medical Center, United States
  • Michael Kotliar, Cincinnati Children's Hospital Medical Center, United States
  • Joseph Wayman, Cincinnati Children's Hospital Medical Center, United States
  • Anthony Bejjani, University of Cincinnati, United States
  • Omer Donmez, Cincinnati Children's Hospital Medical Center, United States
  • Benjamin Wronowski, Cincinnati Children's Hospital Medical Center, United States
  • Sreeja Parameswaran, Cincinnati Children's Hospital Medical Center, United States
  • Leah Kottyan, Cincinnati Children's Hospital Medical Center, United States
  • Artem Barski, Cincinnati Children's Hospital Medical Center, United States
  • Matthew Weirauch, Cincinnati Children's Hospital Medical Center, United States
  • Vb Surya Prasath, Cincinnati Children's Hospital Medical Center, United States
  • Emily Miraldi, Cincinnati Children's Hospital Medical Center, United States


Presentation Overview: Show

Most disease-associated genetic variants fall outside of protein-coding DNA and are often enriched in regulatory elements associated with DNA binding proteins known as transcription factors (TFs). Computational methods are largely used to predict TF binding sites (TFBS) as the experimental characterization of most human TFs is intractable due to technical limitations. Instead, the most popular approaches use TF motifs and chromatin accessibility data to predict TF binding. Here, we present “maxATAC” a suite of deep neural network models for genome-wide TFBS prediction from the assay for transposase accessible chromatin (ATAC-seq) in any cell type, with models available for 127 human TFs. We demonstrate maxATAC’s capabilities by identifying TFBS associated with allele-dependent chromatin accessibility at atopic dermatitis genetic risk loci. We analyzed activated T cells isolated from patients with atopic dermatitis and their age-matched controls. Patient-specific ATAC-seq signal and DNA sequence were used as input for maxATAC to predict the binding of 103 nominally expressed TFs. We predicted increased binding of several TFs relevant to T cells, including MYB and FOXP1, in patients with atopic dermatitis. These results illustrate the utility of maxATAC models for exploring potential regulators of cellular biology and the effects of genetic variants on genomic regulation.

Thursday, July 14th
10:15-11:15
Keynote Presentation: Machine learning for the analysis of multimorbidities, disease and prescription trajectories
Room: Madison CD
Format: Live-stream

Moderator(s): Yves Moreau

  • Søren Brunak, University of Copenhagen, Denmark


Presentation Overview: Show

Analysis of disease progression patterns of multimorbid patients typically try to find systematic patterns of risk factors, diseases and complications. Such analyses are complicated by the fact that certain risk factors also can present as complications, thus representing “promiscuous” diseases that appear in quite different contexts. Another problem is that similar outcomes can be caused by different mechanisms, mixed etiologies, that can be difficult to disentangle longitudinally. The talk will discuss approaches to patient stratification in such situations, including cases where cohort multi-omics data are available. These can potentially reveal patient level disease characteristics and individualized response to treatment. We developed a deep learning based framework, Multi-Omics Variational autoEncoders, to integrate such data and applied it to a cohort of 789 patients with newly diagnosed type 2 diabetes (T2D). These patients had clinical measurements and drug use recorded in electronic case report forms. We show how we can capture relevant clinical patterns and identify associations between the medication data and the multi-omics data. For the 20 most prevalent T2D drugs, we identified significant multi-omics associations, which we use to extract both known and novel biological associations between drugs and multi-omics features (genomics, transcriptomics, proteomics, metabolomics, and microbiomes as well as data from diet questionnaires, and clinical measurements). To properly disentangle molecular mechanisms and response to drugs we propose that future studies should explore the multi-omics space in depth rather than having focus on single data types, as is often done.

11:15-11:35
Proceedings Presentation: Scaling Multi-Instance Support Vector Machine to Breast Cancer Detection on the BreaKHis Dataset
Room: Madison CD
Format: Live-stream

Moderator(s): Yves Moreau

  • Hoon Seo, Colorado School of Mines, United States
  • Lodewijk Brand, Colorado School of Mines, United States
  • Lucia Saldana Barco, Colorado School of Mines, United States
  • Hua Wang, Colorado School of Mines, United States


Presentation Overview: Show

Breast cancer is a type of cancer that develops in breast tissue, and, after skin cancer, it is the most commonly diagnosed cancer in women in the United States. Given that an early diagnosis is imperative to prevent breast cancer progression, many machine learning models have automated the histopathological classification of the different types of carcinomas. However, many of them are not scalable to the large dataset. In this study, we propose the novel Primal-Dual Multi-Instance Support Vector Machine (pdMISVM) to determine which tissue segments in an image exhibit an indication of an abnormality. We also derive the efficient optimization approach for the proposed method by bypassing the quadratic programming and least-squares problems, which are commonly employed to optimize Support Vector Machine (SVM) models in multi-instance learning. The proposed method is scalable to large datasets, and it is computationally efficient. We applied our method to the public BreaKHis dataset and achieved promising prediction performance and scalability for histopathological classification.

11:35-11:45
The multifocal transcriptomic landscape of locally advanced prostate cancer
Room: Madison CD
Format: Live from venue

Moderator(s): Yves Moreau

  • Maarten Larmuseau, Ghent University, Belgium
  • Kim Van der Eecken, Ghent University, Belgium
  • Louise de Schaetzen van Brienen, Ghent University, Belgium
  • Piet Ost, Ghent University, Belgium
  • Kathleen Marchal, Ghent University, Belgium


Presentation Overview: Show

Understanding the molecular alterations that allow transitioning from local to invasive disease is essential to improve cancer treatments. Investigating tumor progression has, however, been hampered by substantial intra- and intertumor heterogeneity. Here, we introduce a multifocal cohort of locally advanced prostate cancer, where several primary lesions and metastatic lymph nodes per patient have been transcriptomically profiled. Modeling pathway activity using a centroid-based approach allows tracing cancer progression from primary to metastatic tissue, highlighting how mainly invasion-related processes are altered. Moreover, using an external dataset we demonstrate that in lymph node positive primary tumors the activity of certain signatures is altered to resemble metastatic lymph nodes. We use this observation to identify the most likely seeding primary lesion in each patient of our cohort. The predicted seeding lesions agree well with seeding lesions estimated from RNA-seq derived somatic variants and are enriched in the invasive PAM50 Luminal B subtype. Finally, we leverage the unique design of our cohort and develop a new testing procedure to identify molecular processes that characterize the seeding lesions. Importantly, the poor correspondence between a lesion’s Gleason score and seeding status suggests that multifocal designs will be pivotal for the study of invasive disease.

11:45-11:55
SUPREME: A cancer subtype prediction methodology integrating multiomics data using Graph Convolutional Neural Network
Room: Madison CD
Format: Live from venue

Moderator(s): Yves Moreau

  • Ziynet Nesibe Kesimoglu, University of North Texas, United States
  • Serdar Bozdag, University of North Texas, United States


Presentation Overview: Show

To pave the road towards precision medicine, patients with similar biology should be grouped into cancer subtypes. Utilizing high-dimensional multiomics data generated from cancer tissues, integrative computational approaches have been developed to uncover cancer subtypes. Recently, Graph Neural Networks were discovered utilizing node features and associations simultaneously on graph-structured data. Addressing limitations that existing tools have in leveraging these architectures, we developed SUPREME, an integrative approach by comprehensively analyzing multiomics data and patient associations with multiplex graph convolutions. Unlike existing tools, SUPREME generates patient similarity networks and obtains embeddings from each using Graph Convolutional Network models, on which it utilizes all multiomics. Also, SUPREME integrates embeddings with raw features to capture local and global features simultaneously. SUPREME integrates all embedding combinations as separate tasks, thus being interpretable regarding utilized networks and features. On TCGA, METABRIC, and combined datasets, SUPREME significantly outperformed seven supervised methods, with consistent results. SUPREME-inferred subtypes consistently had significant survival differences, mostly more significant than survival differences between ground truth(PAM50) subtypes, and outperformed all nine methods. Our findings suggest that properly utilizing multiple datatypes and associations, SUPREME could demystify subtype characteristics that cause significant survival differences and could improve ground truth, which depends mainly on one datatype.

11:55-12:05
Exploring the Mechanisms of Polypharmacy Side Effects via Two-stage Graph Neural Networks
Room: Madison CD
Format: Live-stream

Moderator(s): Yves Moreau

  • Hao Xu, Queen's University, Canada
  • Shengqi Sang, University of Waterloo; Perimeter Institute for Theoretical Physics, Canada
  • Herbert Yao, Queen's University, Canada
  • Alexandra Herghelegiu, The University of Sheffield, United Kingdom
  • Haiping Lu, The University of Sheffield, United Kingdom
  • James Yurkovich, University of California San Diego, United States
  • Laurence Yang, Queen's University, Canada


Presentation Overview: Show

Polypharmacy presents a major medical challenge for the world's ageing population due to novel drugs used in combination, the high cost of clinical trials, and limited studies on side-effect mechanisms. For problems like polypharmacy side effects, a predictor that predicts correct side effects is far from enough. A question of greater concern is: Can the ‘predictor’ help us to figure out the cause of a predicted side effect? In this paper, we gave it a positive answer by proposing APRILE, a graph learning framework for discovering the mechanism behind drug side effects. Given a predictor, APRILE can generate hypothetical mechanisms by identifying the optimal set of drug targets and protein interactions of greatest importance to the predictor’s prediction. In the existing literature, we found ample evidence for mechanisms generated by APRILE, including drug hypersensitivity leading to anxiety, or paradoxical drug action leading to panic disorder. We wrapped our framework into a python package and developed a website application, for offering the mechanistic explanations for 34 million POSE predictions, and the common mechanistic hypotheses for 472 diseases, 485 symptoms, 9 mental disorders, and 20 disease categories, to make it more accessible to the biomedical research community.

12:05-12:15
A Context-aware Deconfounding Autoencoder for Robust Prediction of Personalized in vivo Drug Response From Cell Line Compound Screening
Room: Madison CD
Format: Live from venue

Moderator(s): Yves Moreau

  • Di He, The City University of New York, United States
  • Qiao Liu, The City University of New York, United States
  • You Wu, The City University of New York, United States
  • Lei Xie, The City University of New York, United States


Presentation Overview: Show

Accurate and robust prediction of patient-specific responses to a new compound is critical to personalized drug discovery and development. However, patient data are often too scarce to train a generalized machine learning model. Although many methods have been developed to utilize cell line screens for predicting clinical responses, their performances are unreliable due to data heterogeneity and distribution shift. We have developed a novel Context-aware Deconfounding Autoencoder (CODE-AE) that can extract intrinsic biological signals masked by context-specific patterns and confounding factors. Extensive comparative studies demonstrated that CODE-AE effectively alleviated the out-of-distribution problem for the model generalization, significantly improved accuracy and robustness over state-of-the-art methods in predicting patient-specific \in vivo drug responses purely from in vitro compound screens. Using CODE-AE, we screened 59 drugs for 9,808 cancer patients. Our results are consistent with existing clinical observations, suggesting the potential of CODE-AE in developing personalized anti-cancer therapies and drug-response biomarkers.

13:15-14:15
Keynote Presentation: Explainable AI: where we are and how to move forward for cancer pharmacogenomics
Room: Madison CD
Format: Live-stream

Moderator(s): Yves Moreau

  • Su-In Lee


Presentation Overview: Show

In the first part of the talk, I will go over a number of research work done by my lab on the topics of explainable AI applied to biomedical problems, which exemplifies how it addresses new scientific questions, make new biological discoveries from data, make informed clinical decisions, and even open new research directions in biomedicine. In the second part of the talk, I will show you that explainable AI needs to evolve and improve to solve real-world problems in computational biology and medicine by having a deep dive into our cancer pharmacogenomics project led by our Ph.D. student Joseph Janizek in collaboration with Prof. Kamila Naxerova at Harvard Medical School.

14:15-14:35
Proceedings Presentation: BITES: Balanced Individual Treatment Effect for Survival data
Room: Madison CD
Format: Live from venue

Moderator(s): Yves Moreau

  • Stefan Schrod, University Medical Center Göttingen, Germany
  • Andreas Schäfer, Universität Regensburg, Germany
  • Stefan Solbrig, Universität Regensburg, Germany
  • Robert Lohmayer, Leibniz Institute for Immunotherapy Regensburg, Germany
  • Wolfram Gronwald, Universität Regensburg, Germany
  • Peter J. Oefner, Universität Regensburg, Germany
  • Tim Beißbarth, University Medical Center Göttingen, Germany
  • Rainer Spang, Universität Regensburg, Germany
  • Helena Zacharias, Universität Kiel, Germany
  • Michael Altenbuchinger, University Medical Center Göttingen, Germany
  • Stefan Schrod, University Medical Center Göttingen, Germany
  • Andreas Schäfer, Universität Regensburg, Germany
  • Stefan Solbrig, Universität Regensburg, Germany
  • Robert Lohmayer, Leibniz Institute for Immunotherapy Regensburg, Germany
  • Wolfram Gronwald, Universität Regensburg, Germany
  • Peter J. Oefner, Universität Regensburg, Germany
  • Tim Beißbarth, University Medical Center Göttingen, Germany
  • Rainer Spang, Universität Regensburg, Germany
  • Helena Zacharias, Universität Kiel, Germany
  • Michael Altenbuchinger, University Medical Center Göttingen, Germany


Presentation Overview: Show

Estimating the effects of interventions on patient outcome is one of the key aspects of personalized medicine. Their inference is often challenged by the fact that the training data comprises only the outcome for the administered treatment, and not for alternative treatments (the so-called counterfactual outcomes). Several methods were suggested for this scenario based on observational data, i.e.~data where the intervention was not applied randomly, for both continuous and binary outcome variables. However, patient outcome is often recorded in terms of time-to-event data, comprising right-censored event times if an event does not occur within the observation period. Albeit their enormous importance, time-to-event data is rarely used for treatment optimization.
We suggest an approach named BITES (Balanced Individual Treatment Effect for Survival data), which combines a treatment-specific semi-parametric Cox loss with a treatment-balanced deep neural network; i.e.~we regularize differences between treated and non-treated patients using Integral Probability Metrics (IPM). We show in simulation studies that this approach outperforms the state of the art. Further, we demonstrate in an application to a cohort of breast cancer patients that hormone treatment can be optimized based on six routine parameters. We successfully validated this finding in an independent cohort. We provide BITES as an easy-to-use python implementation including scheduled hyper-parameter optimization.

14:35-14:55
Proceedings Presentation: SPARSE: a sparse hypergraph neural network for learning multiple types of latent combinations to accurately predict drug-drug interactions
Room: Madison CD
Format: Live from venue

Moderator(s): Yves Moreau

  • Duc Anh Nguyen, Kyoto University, Japan
  • Canh Hao Nguyen, Kyoto University, Japan
  • Peter Petschner, Kyoto University, Japan
  • Hiroshi Mamitsuka, Kyoto University, Japan


Presentation Overview: Show

Motivation: Predicting side effects of drug-drug interactions (DDIs) is an important task in pharmacology. The state-of-the-art methods for DDI prediction use hypergraph neural networks to learn latent representations of drugs and side effects to express high-order relationships among two interacting drugs and a side effect. The idea of these methods is that each side effect is caused by a unique combination of latent features of the corresponding interacting drugs. However, in reality, a side effect might have multiple, different mechanisms that cannot be represented by a single combination of latent features of drugs. Moreover, DDI data is sparse, suggesting that using a sparsity regularization would help to learn better latent representations to improve prediction performances.

Result: We propose SPARSE, which encodes the DDI hypergraph and drug features to latent spaces to learn multiple types of combinations of latent features of drugs and side effects, controlling the model sparsity by a sparse prior. Our extensive experiments using both synthetic and three real-world DDI datasets showed the clear predictive performance advantage of SPARSE over cutting-edge competing methods. Also latent feature analysis over unknown top predictions by SPARSE demonstrated the interpretability advantage contributed by the model sparsity.

14:55-15:05
Deep-Multitask learning framework to predict synergistic drug combinations
Room: Madison CD
Format: Live-stream

Moderator(s): Yves Moreau

  • Mohamed Reda El Khili, McGill University, Canada
  • Amin Emad, McGill University and Mila - Quebec AI, Canada


Presentation Overview: Show

The improved therapeutic outcome and reduced adverse effects of synergistic drug combinations have turned them into standard of care in many cancer types. Given the wealth of relevant information provided by high-throughput screening studies, the costly experimental design of these combinations can now be guided by advanced computational tools. Thus, we present MARSY, a deep-multitask learning method that predicts the level of synergism between drug pairs tested on cancer cell lines. Using gene expression to characterize cancer cell lines and induced signature profiles to represent drug pairs, MARSY learns a distinct set of embeddings to obtain multiple views of the features. Precisely, a representation of the entire combination and a representation of the drug pair are learned in parallel. These representations are then fed to a multitask network that predicts the synergy score of the drug combination alongside single drug responses. A thorough evaluation of MARSY revealed its superior performance compared to various state-of-the-art and traditional computational methods. A detailed analysis of the design choices of our framework demonstrated the predictive contribution of the learned embeddings by this model. Additionally, we predicted 116,348 new synergy scores, which revealed new insights regarding the impact of drug combinations in cancer cell lines.

15:05-15:15
EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction
Room: Madison CD
Format: Live-stream

Moderator(s): Yves Moreau

  • Hannes Stärk, Massachusetts Institute of Technology, Germany
  • Octavian-Eugen Ganea, Massachusetts Institute of Technology, United States
  • Lagnajit Pattanaik, Massachusetts Institute of Technology, United States
  • Regina Barzilay, Massachusetts Institute of Technology, United States
  • Tommi Jaakkola, Massachusetts Institute of Technology, United States


Presentation Overview: Show

Predicting how a drug-like molecule binds to a specific protein target is a core problem in drug discovery. An extremely fast computational binding method would enable key applications such as fast virtual screening or drug engineering. Existing methods are computationally expensive as they rely on heavy candidate sampling coupled with scoring, ranking, and fine-tuning steps. We challenge this paradigm with EquiBind, an SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand's bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. Further, we show extra improvements when coupling it with existing fine-tuning techniques at the cost of increased running time. Finally, we propose a novel and fast fine-tuning model that adjusts torsion angles of a ligand's rotatable bonds based on closed-form global minima of the von Mises angular distance to a given input atomic point cloud, avoiding previous expensive differential evolution strategies for energy minimization.

15:45-16:05
Proceedings Presentation: DECODE: a computational pipeline to discover T-cell receptor binding rules
Room: Madison CD
Format: Live from venue

Moderator(s): Yves Moreau

  • An-Phi Nguyen, IBM Research Europe, ETH Zurich, Switzerland
  • Maria Rodriguez Martinez, IBM Research Europe, Switzerland
  • Iliana Papadopoulou, IBM Research Europe, ETH Zurich, Switzerland
  • Anna Weber, IBM Research Europe, ETH Zurich, Switzerland


Presentation Overview: Show

Motivation: Understanding the mechanisms underlying T cell receptor (TCR) binding is of fundamental importance to understanding adaptive immune responses. A better understanding of the biochemical rules governing TCR binding can be used, for example, to guide the design of more powerful and safer T cell-based therapies. Advances in repertoire sequencing technologies have made available millions of TCR sequences. Data abundance has, in turn, fueled the development of many computational models to predict the binding properties of TCRs from their sequences. Unfortunately, while many of these works have made great strides towards predicting TCR specificity using machine learning, the black-box nature of these models has resulted in a limited understanding of the rules that govern the binding of a TCR and an epitope. Results: We present an easy-to-use and customizable pipeline, DECODE, to extract the binding rules from any black-box model designed to predict the TCR-epitope binding. DECODE offers a range of analytical and visualization tools to guide the user in the extraction of such rules. We demonstrate our pipeline on a recently published TCR binding prediction model, TITAN, and show how to use the provided metrics to assess the quality of the computed rules.
In conclusion, DECODE can lead to a better understanding of the sequence motifs that underlie TCR binding. Our pipeline can facilitate the investigation of current immunotherapeutic challenges, such as cross-reactive events due to off-target TCR binding.

16:05-16:25
Proceedings Presentation: Fast and interpretable genomic data analysis using multiple approximate kernel learning
Room: Madison CD
Format: Live from venue

Moderator(s): Yves Moreau

  • Ayyüce Begüm Bektaş, Koç University, Turkey
  • Çiğdem Ak, Oregon Health & Science University, United States
  • Mehmet Gönen, Koç University, Turkey


Presentation Overview: Show

Motivation: Dataset sizes in computational biology have been increased drastically with the help of improved data collection tools and increasing size of patient cohorts. Previous kernel-based machine learning algorithms proposed for increased interpretability started to fail with large sample sizes, owing to their lack of scalability. To overcome this problem, we proposed a fast and efficient multiple kernel learning (MKL) algorithm to be particularly used with large-scale data that integrates kernel approximation and group Lasso formulations into a conjoint model. Our method extracts significant and meaningful information from the genomic data while conjointly learning a model for out-of-sample prediction. It is scalable with increasing sample size by approximating instead of calculating distinct kernel matrices.

Results: To test our computational framework, namely, Multiple Approximate Kernel Learning (MAKL), we demonstrated our experiments on three cancer datasets and showed that MAKL is capable to outperform the baseline algorithm while using only a small fraction of the input features. We also reported selection frequencies of approximated kernel matrices associated with feature subsets (i.e. gene sets/pathways), which helps to see their relevance for the given classification task. Our fast and interpretable MKL algorithm producing sparse solutions is promising for computational biology applications considering its scalability and highly correlated structure of genomic datasets, and it can be used to discover new biomarkers and new therapeutic guidelines.

16:25-16:35
Accelerating in-silico saturation mutagenesis using compressed sensing
Room: Madison CD
Format: Live from venue

Moderator(s): Yves Moreau

  • Jacob Schreiber, Stanford University, United States
  • Surag Nair, Stanford University, United States
  • Akshay Balsubramani, Stanford University, United States
  • Anshul Kundaje, Stanford University, United States


Presentation Overview: Show

A challenge with using modern machine learning methods in practice is that, frequently, their learned logic for transforming input features into output predictions is opaque and difficult for humans to understand. In-silico saturation mutagenesis (ISM) attempts to explain this logic on a per-example basis by applying the model to a biological sequence and also to all possible single mutations of that sequence, and comparing the model output. However, when a model contains convolution operations, this procedure can perform a significant amount of redundant calculations due to the convolution's limited receptive field.

We propose a method, named Yuzu, that speeds up ISM using two ideas: (1) Yuzu operates on the difference in layer outputs between the mutated sequences and the original sequence, which is sparse when the layer is a convolution, and (2) Yuzu uses the principles of compressed sensing to compress these sparse deltas into a compact set of probes that convolutions can efficiently operate on while eliminating redundant computation. Together, these properties enable Yuzu to achieve speedups of over three orders of magnitude for individual convolutions and around one order of magnitude for several published model architectures on both a CPU and GPU.

16:35-16:45
Discovering interpretable features of the intrinsically disordered dark proteome by using evolution for contrastive learning
Room: Madison CD
Format: Live from venue

Moderator(s): Yves Moreau

  • Alex Lu, Microsoft Research, United States
  • Amy Lu, Berkeley, United States
  • Iva Pritišanac, Medizinische Universität Graz, Austria
  • Taraneh Zarin, Center for Genomic Regulation, Spain
  • Julie Forman-Kay, University of Toronto, Canada
  • Alan Moses, University of Toronto, Canada


Presentation Overview: Show

A major challenge to the characterization of intrinsically disordered regions (IDRs), which are widespread in the proteome, but relatively poorly understood, is the identification of molecular features that mediate functions of these regions, such as short motifs, amino acid repeats and physicochemical properties. Here, we introduce a proteome-scale feature discovery approach for IDRs. Our approach, which we call “reverse homology”, exploits the principle that important functional features are conserved over evolution. We use this as a contrastive learning signal for deep learning: given a set of homologous IDRs, the neural network has to correctly choose a held-out homologue from another set of IDRs otherwise sampled randomly from the proteome. We pair reverse homology with a simple architecture and standard interpretation techniques and show that the network learns conserved features of IDRs that can be interpreted as motifs, repeats, or bulk features like charge or amino acid propensities. We also show that our model can be used to produce visualizations of what residues and regions are most important to IDR function, generating hypotheses for uncharacterized IDRs. Our results suggest that feature discovery using unsupervised neural networks is a promising avenue to gain systematic insight into poorly understood protein sequences.