Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

MLCSB COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in UTC
Wednesday, July 28th
11:00-11:20
Proceedings Presentation: CALLR: a semi-supervised cell type annotation method for single-cell RNA sequencing data
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Ziyang Wei, Department of Statistics, University of Chicago, United States
  • Shuqin Zhang, School of Mathematical Sciences, Fudan University, China

Presentation Overview: Show

Single-cell RNA sequencing (scRNA-seq) technology has been widely applied to capture the heterogeneity of different cell types within complex tissues. An essential step in scRNA-seq data analysis is the annotation of cell types. Traditional cell type annotation is mainly clustering the cells first, and then using the aggregated cluster-level expression profiles and the marker genes to label each cluster. Such methods are greatly dependent on the clustering results, which are still insufficient for accurate annotation. In this work, we propose a semi-supervised learning method for cell type annotation called CALLR. It combines unsupervised learning represented by the Laplacian matrix constructed from all the cells, and a supervised term using logistic regression. By alternately updating the cell clusters and annotation labels, high annotation accuracy can be achieved. The model is formulated as an optimization problem, and a computationally efficient algorithm is developed to solve it. Experiments on seven real data sets show that CALLR outperforms the compared (semi-) supervised learning methods, and the popular clustering methods.

11:20-11:30
Using topic modeling to detect cellular crosstalk in scRNA-seq
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Alexandrina Pancheva, University of Glasgow, United Kingdom
  • Helen Wheadon, University of Glasgow, United Kingdom
  • Simon Rogers, University of Glasgow, United Kingdom
  • Thomas Otto, University of Glasgow, United Kingdom

Presentation Overview: Show

Cell-cell interactions are vital for numerous biological processes including development, differentiation, and response to inflammation. Currently most methods for studying interactions on scRNA-seq level are based on curated databases of ligands and receptors. Whilst useful, such methods are limited by current biological knowledge. Recent advances in single cell protocols have allowed for physically interacting cells to be captured, enabling complimentary methods for studying interactions that does not rely on prior information. We introduce a new method for detecting genes whose expression change as a result of interaction in such datasets based on Latent Dirichlet Allocation (LDA). We validate our method on synthetic data before applying our approach to two datasets of physically interacting cells, allowing us to identify genes that change as a result of interaction. For each dataset we produce a ranking of genes that are changing in subpopulations of the interacting cells. Lastly, we apply our method to a dataset generated by a standard droplet based protocol, not designed to capture interacting cells and discuss its suitability for analysing interaction. We are able to rank genes that change as a result of interaction without relying on prior clustering and generation of synthetic reference profiles as current methods do.

11:30-11:40
Representation learning of RNA velocity reveals robust cell transitions
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Chen Qiao, The University of Hong Kong, Hong Kong
  • Yuanhua Huang, The University of Hong Kong, Hong Kong

Presentation Overview: Show

RNA velocity is a promising technique for revealing transient cellular dynamics among a heterogeneous cell population, and for quantifying cellular transitions from single-cell transcriptome experiments.
However, the cell transitions estimated from high dimensional RNA velocity are often unstable or inaccurate, partly due to the high technical noise and less informative projection.
Here, we present VeloAE, a tailored representation learning method, to learn a low-dimensional representation of RNA velocity on which cellular transitions can be robustly estimated.
On various experimental datasets, we show that VeloAE can not only accurately identify stimulation dynamics in time-series designs, but also effectively capture expected cellular differentiation in different biological systems. VeloAE therefore enhances the usefulness of RNA velocity for studying a wide range of biological processes.

11:40-11:50
Learning large-scale perturbation effects in single cell genomics
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Mohammad Lotfollahi, Helmholtz Zentrum München, Germany
  • Anna Klimovskaia Susmelj, facebook AI, France
  • Carlo De De Donno, Helmholtz Zentrum München, Germany
  • Yuge Ji, Helmholtz Zentrum München, Germany
  • Ignacio L. Ibarra, Helmholtz Zentrum München, Germany
  • F.Alex Wolf, Helmholtz Zentrum München, Germany
  • Nafissa Yakubova, facebook, Germany
  • Fabian Theis, Helmholtz Zentrum München, Germany
  • David Lopez-Paz, facebook, Germany

Presentation Overview: Show

Recent advances in multiplexing single-cell transcriptomics across experiments are facilitating the
high-throughput study of drug and genetic perturbations. However, an exhaustive exploration of
the combinatorial perturbation space is experimentally unfeasible, so computational methods are
needed to predict, interpret, and prioritize perturbations. We present the compositional perturbation
autoencoder (CPA), which combines the interpretability of linear models with the flexibility of
deep-learning approaches for single-cell response modeling. CPA encodes and learns transcriptional
drug responses across different dosages, combinations, and treated cell types. The model produces
easy-to-interpret embeddings for drugs and cell types, allowing drug similarity analysis and predictions
for unseen drug dosages and combinations. We show that CPA accurately models single-cell
perturbations across compounds, dosages, species, and time. We further demonstrate that CPA can
predict combinatorial genetic interactions of several types, implying that it captures features that
distinguish different interaction programs. Finally, we demonstrate CPA allows in-silico generation
of 5,329 non-measured genetic perturbation combinations (97:6% of all possibilities) inferring a diverse
landscape of regulatory mechanisms. We envision CPA will facilitate efficient experimental
design by enabling in-silico response prediction at the single-cell level.

11:50-12:00
Schema: metric learning enables interpretable synthesis of heterogeneous single-cell modalities
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Rohit Singh, Massachusetts Institute of Technology, United States
  • Brian Hie, Massachusetts Institute of Technology, United States
  • Ashwin Narayan, Massachusetts Institute of Technology, United States
  • Bonnie Berger, Massachusetts Institute of Technology, United States

Presentation Overview: Show

A complete understanding of biological processes requires synthesizing information across heterogeneous modalities, such as age, disease status, or gene expression. Technological advances in single-cell profiling have enabled researchers to assay multiple modalities simultaneously. We present Schema, which uses a principled metric learning strategy that identifies informative features in a modality to synthesize disparate modalities into a single coherent interpretation. We use Schema to infer cell types by integrating gene expression and chromatin accessibility data; demonstrate informative data visualizations that synthesize multiple modalities; perform differential gene expression analysis in the context of spatial variability; and estimate evolutionary pressure on peptide sequences.

12:00-12:20
Proceedings Presentation: stPlus: a reference-based method for the accurate enhancement of spatial transcriptomics
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Shengquan Chen, Tsinghua University, China
  • Boheng Zhang, Tsinghua University, China
  • Xiaoyang Chen, Tsinghua University, China
  • Xuegong Zhang, Tsinghua University, China
  • Rui Jiang, Tsinghua University, China

Presentation Overview: Show

Motivation: Single-cell RNA sequencing (scRNA-seq) techniques have revolutionized the investigation of transcriptomic landscape in individual cells. Recent advancements in spatial transcriptomic technologies further enable gene expression profiling and spatial organization mapping of cells simultaneously. Among the technologies, imaging-based methods can offer higher spatial resolutions, while they are limited by either the small number of genes imaged or the low gene detection sensitivity. Although several methods have been proposed for enhancing spatially resolved transcriptomics, inadequate accuracy of gene expression prediction and insufficient ability of cell-population identification still impede the applications of these methods.

Results: We propose stPlus, a reference-based method that leverages information in scRNA-seq data to enhance spatial transcriptomics. Based on an auto-encoder with a carefully tailored loss function, stPlus performs joint embedding and predicts spatial gene expression via a weighted k-NN. stPlus out-performs baseline methods with higher gene-wise and cell-wise Spearman correlation coefficients. We also introduce a clustering-based approach to assess the enhancement performance systematically. Using the data enhanced by stPlus, cell populations can be better identified than using the measured data. The predicted expression of genes unique to scRNA-seq data can also well characterize spatial cell heterogeneity. Besides, stPlus is robust and scalable to datasets of diverse gene detection sensitivity levels, sample sizes, and number of spatially measured genes. We anticipate stPlus will facilitate the analysis of spatial transcriptomics.

Availability: stPlus with detailed documents is freely accessible at http://health.tsinghua.edu.cn/software/stPlus/ and the source code is openly available on https://github.com/xy-chen16/stPlus.

12:40-13:00
Proceedings Presentation: Bayesian information sharing enhances detection of regulatory associations in rare cell types
Format: Pre-recorded with live Q&A

Moderator(s): Anshul Kundaje

  • Alexander P. Wu, Massachusetts Institute of Technology, United States
  • Jian Peng, University of Illinois at Urbana-Champaign, United States
  • Bonnie Berger, Massachusetts Institute of Technology, United States
  • Hyunghoon Cho, Broad Institute of MIT and Harvard, United States

Presentation Overview: Show

Recent advances in single-cell RNA-sequencing (scRNA-seq) technologies promise to enable the study of gene regulatory associations at unprecedented resolution in diverse cellular contexts. However, identifying unique regulatory associations observed only in specific cell types or conditions remains a key challenge; this is particularly so for rare transcriptional states whose sample sizes are too small for existing gene regulatory network inference methods to be effective. We present ShareNet, a Bayesian framework for boosting the accuracy of cell type-specific gene regulatory networks by propagating information across related cell types via an information sharing structure that is adaptively optimized for a given single-cell dataset. The techniques we introduce can be used with a range of general network inference algorithms to enhance the output for each cell type. We demonstrate the enhanced accuracy of our approach on three benchmark scRNA-seq datasets. We find that our inferred cell type-specific networks also uncover key changes in gene associations that underpin the complex rewiring of regulatory networks across cell types, tissues, and dynamic biological processes. Our work presents a path towards extracting deeper insights about cell type-specific gene regulation in the rapidly growing compendium of scRNA-seq datasets.

13:00-13:20
Gene Network Connectivity Conveys Robustness in Gene Expression across Individuals, Cell types and Species
Format: Pre-recorded with live Q&A

Moderator(s): Anshul Kundaje

  • Amirreza Shaeiri, EMBL-Heidelberg, Iran
  • Olga Sigalova, EMBL-Heidelberg, Germany
  • Judith Zaugg, EMBL-Heidelberg, Germany

Presentation Overview: Show

One of the remarkable properties of living systems is their propensity to be robust against various sources of variation. A growing body of work has been studying this topic from the viewpoint of gene expression program, under dissimilar names. These efforts, while shedding great light on the study of this subject, have different limitations. They lack a comprehensive and comparative approach to study the variation across tissues and cell lines. The connectivity of genes through the co-expression or gene regulatory network is often not taken into account. A comparative analysis of gene expression variation across different species is missing. In this work, we aimed to address the above challenges by analyzing a variety of datasets. We illustrate extensive evidence suggesting the important role of networks in controlling the variation. To the best of our knowledge, we build the most inclusive gene-specific regulatory features. Moreover, we predict our desired statistics based on (1) only the genome sequence, (2) only features, and (3) features and network by utilizing a variety of methods. We further observe that expression variation is conserved for closer species in matching tissues.

13:20-13:40
Prediction and mechanistic dissection of transcriptional activation protein domains using deep learning and high-throughput screening
Format: Pre-recorded with live Q&A

Moderator(s): Anshul Kundaje

  • Adrian L. Sanborn, Stanford University, United States
  • Benjamin T. Yeh, Stanford University, United States
  • Jordan T. Feigerle, Stanford University, United States
  • Cynthia V. Hao, Stanford University, United States
  • Raphael J. L. Townshend, Stanford University, United States
  • Erez Lieberman Aiden, Baylor College of Medicine, United States
  • Ron O. Dror, Stanford University, United States
  • Roger D. Kornberg, Stanford University, United States

Presentation Overview: Show

Transcription factor (TF) proteins comprise a DNA-binding domain and an effector domain that regulates nearby genes. Activation domains (ADs) – effector domains that increase transcription – have long been of particular interest due to their roles as oncogenic drivers and use as scientific tools.

We combined high-throughput measurements of in vivo activation with deep learning to characterize ADs. Using a domain-tiling screen, we identified all ADs in budding yeast. We trained a neural network named PADDLE (Predictor of Activation Domains using Deep Learning in Eukaryotes) that accurately predicts ADs across species, enabling identification of hundreds of ADs in human TFs. Surprisingly, ADs were also predicted and confirmed in all major yeast coactivator complexes.

Guided by PADDLE predictions, we designed and measured activation of thousands of AD mutants to derive a deeper understanding of the principles underlying activation. ADs shared no common sequence motifs. Acidic and bulky hydrophobic residues were each necessary for activation, but excess hydrophobicity was inhibiting. While a few ADs required alpha helical folding, most activated based on biochemical features and did not need specific sequences or secondary structure.

Altogether, these results demonstrate how deep learning and high-throughput experiments can be integrated to accelerate discovery of protein function.

13:40-14:00
Proceedings Presentation: Model learning to identify systemic regulators ofthe peripheral circadian clock
Format: Pre-recorded with live Q&A

Moderator(s): Anshul Kundaje

  • Julien Martinelli, Inria/Inserm/Institut Curie, France
  • Annabelle Ballesta, Inserm/Institut Curie, France
  • Xiao-Mei Li, Inserm/Université Paris Saclay, France
  • Sandrine Dulong, Inserm/Université Paris Saclay, France
  • Francis Lévi, Inserm/Université Paris Saclay, France
  • Michèle Teboul, Université Côté D'azur/CNRS/Inserm/IbV, France
  • Sylvain Soliman, Inria, France
  • François Fages, Inria, France

Presentation Overview: Show

Motivation: Personalized medicine aims at providing patient-tailored therapeutics based on multi-type data towards improved treatment outcomes. Chronotherapy that consists in adapting drug administration to the patient's circadian rhythms may be improved by such approach. Recent clinical studies demonstrated large variability in patients' circadian coordination and optimal drug timing. Consequently, new eHealth platforms allow the monitoring of circadian biomarkers in individual patients through wearable technologies (rest-activity, body temperature), blood or salivary samples (melatonin, cortisol), and daily questionnaires (food intake, symptoms). A current clinical challenge involves designing a methodology predicting from circadian biomarkers the patient peripheral circadian clocks and associated optimal drug timing. The mammalian circadian timing system being largely conserved between mouse and humans yet with phase opposition, the study was developed using available mouse datasets.
Results: We investigated at the molecular scale the influence of systemic regulators (e.g. temperature, hormones) on peripheral clocks, through a model learning approach involving systems biology models based on ordinary differential equations. Using as prior knowledge our existing circadian clock model, we derived an approximation for the action of systemic regulators on the expression of three core-clock genes: Bmal1, Per2 and Rev-Erb-alpha.
These time profiles were then fitted with a population of models, based on linear regression. Selected models involved a modulation of either Bmal1 or Per2 transcription most likely by temperature or nutrient exposure cycles. This agreed with biological knowledge on temperature-dependent control of Per2 transcription. The strengths of systemic regulations were found to be significantly different according to mouse sex and genetic background.

14:20-15:20
MLCSB Keynote: TBC
Format: Live-stream

Moderator(s): Anshul Kundaje

  • Dana Pe'Er, Memorial Sloan Kettering Cancer Center, United States
Thursday, July 29th
11:00-11:20
Deep multitask learning of gene risk for comorbid neurodevelopmental disorders
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • A. Ercument Cicek, Bilkent University, Turkey
  • Ilayda Beyreli, Bilkent University, Turkey
  • Oguzhan Karakahya, Bilkent University, Turkey

Presentation Overview: Show

Autism Spectrum Disorder (ASD) and Intellectual Disability (ID) are comorbid neurodevelopmental disorders with complex genetic architectures. Despite large-scale sequencing studies only a fraction of the risk genes were identified for both. Here, we present a novel network-based gene risk prioritization algorithm named DeepND that performs cross-disorder analysis to improve prediction power by exploiting the comorbidity of ASD and ID via multitask learning. Our model leverages information from gene co-expression networks that model human brain development using graph convolutional neural networks and learns which spatio-temporal neurovelopmental windows are important for disorder etiologies. We show that our approach substantially improves the state-of-the-art prediction power in both single-disorder and cross-disorder settings. DeepND identifies prefrontal cortex brain region and early-mid fetal period as the highest neurodevelopmental risk window for both ASD and ID. Finally, we investigate frequent ASD and ID associated copy number variation regions and confident false findings to suggest several novel susceptibility gene candidates. DeepND can be generalized to analyze any combinations of comorbid disorders and is released athttp://github.com/ciceklab/deepnd.

11:20-11:40
Multitask group Lasso for Genome-Wide Association Studies in admixed populations
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Asma Nouira, MINES PARISTECH - Institut Curie - INSERM, France
  • Chloé-Agathe Azencott, MINES PARISTECH - Instiut Curie - INSERM, France

Presentation Overview: Show

Population stratification refers to the presence of differences in allele frequencies between subpopulations within samples, due to different ancestry. It is one of the major challenges in Genome Wide Association Studies (GWAS) as it increases type I error. An additional issue in GWAS is the presence of correlation between SNPs, or Linkage Disequilibrium (LD). To account for LD, we consider associations at the level of LD-groups (groups of correlated SNPs) rather than at the individual SNP level. In this contribution, we introduce multitask group Lasso for feature selection where each task corresponds to a subpopulation and each feature corresponds to an LD-group. Our algorithm provides the selection of either shared LD-groups across all tasks, or of population-specific LD-groups. We incorporate stability selection to improve the stability of sparsity-enforcing penalties. We used safe screening rules to provide a significant speed-up to scale the algorithm for GWAS data. To our knowledge, this is the first framework applied to GWAS associating feature selection, stability selection and safe screening rules for admixed populations at the LD-groups level. We show that our approach outperforms all standard methods on a simulated dataset and on two real cancer datasets.

11:40-12:00
Navigating the pitfalls of applying machine learning in genomics
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • William Noble, University of Washington, United States
  • Jacob Schreiber, Stanford University, United States
  • Sean Whalen, Gladstone Institute, UCSF, United States
  • Katherine Pollard, Gladstone Institute, UCSF, United States

Presentation Overview: Show

The scale of genetic, epigenomic, transcriptomic, cheminformatic, and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled a rise in the application of ML in genomics research. However, the assumptions behind the statistical models in ML software frequently are not met in biological systems. Furthermore, not all problems are well suited to being solved using ML. In this review, we illustrate the impact of several common pitfalls encountered when applying ML in genomics. We explore how the structure of genomics data can bias performance evaluations, predictions, and model interpretation. To address these challenges of translating cutting-edge ML methods to genomics, we describe solutions and appropriate use cases where ML modeling shows great potential.

12:00-12:20
Genomic data inequality and multi-ethnic machine learning
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Yan Gao, University of Tennessee Health Science Center, United States
  • Yan Cui, University of Tennessee Health Science Center, United States

Presentation Overview: Show

Over 80% of the GWAS (genome-wide association study) and clinical omics data were collected from individuals of European ancestry (EA), which constitute approximately 16% of the world’s population. This severe data disadvantage of the ethnic minority groups is set to produce new health disparities as machine learning powered biomedical research and health care become increasingly common. Current schemes of machine learning with multiethnic data fail to address this challenge and often lead to unintentional and even unnoticed low prediction accuracy for the data-disadvantaged ethnic groups. In this work, we show that the current prevalent scheme for machine learning with multiethnic data, the mixture learning scheme, and its main alternative, the independent learning scheme, are prone to generating machine learning models with relatively low performance for data-disadvantaged ethnic groups due to inadequate training data and data distribution discrepancies among ethnic groups. We find that transfer learning can improve machine learning model performance for data-disadvantaged ethnic groups by leveraging knowledge learned from other groups having more abundant data. These results indicate that transfer learning can provide an effective approach to reduce health care disparities arising from data inequality among ethnic groups.

12:40-13:20
MLCSB Keynote: Using machine learning to increase health equality
Format: Live-stream

Moderator(s): Anshul Kundaje

  • Emma Pierson
13:20-14:00
MLCSB Panel discussion
Format: Live-stream

Moderator(s): Gabriele Schweikert

  • Emma Pierson, James Zou, Genevieve Wojcik, Ulrich Hemel
14:20-15:20
MLCSB Keynote: Algorithms for Infectious Disease
Format: Live-stream

Moderator(s): Anshul Kundaje

  • Bryan Bryson
Friday, July 30th
11:00-11:10
On the estimation of epigenetic energy landscapes from nanopore sequencing data
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Jordi Abante, Johns Hopkins University, United States
  • Sandeep Kambhampati, Johns Hopkins University, United States
  • Andrew P. Feinberg, Johns Hopkins University, United States
  • John Goutsias, Johns Hopkins University, United States

Presentation Overview: Show

High-throughput third-generation sequencing devices, such as nanopore sequencing, can generate long reads spanning thousands of bases. This new technology offers the possibility of considering a wide range of epigenetic modifications and provides the capability to interrogate previously inaccessible regions of the genome, such as highly repetitive regions, and perform comprehensive allele-specific methylation analysis, among other applications. It is well-known, however, that detection of DNA methylation from nanopore data results in a substantially reduced per-read accuracy when comparing to bisulfite sequencing due to noise introduced by the sequencer and its underlying pore chemistry. Therefore, new methods must be developed for reliable modeling and analysis of DNA methylation landscapes using nanopore sequencing data. Here we introduce such a method and, by using simulations, we provide evidence of its superiority to the state-of-the-art. The proposed approach establishes a solid foundation for developing a comprehensive framework for the statistical analysis of DNA methylation and possibly of other epigenetic marks using nanopore sequencing data and potential energy landscapes.

11:10-11:20
RandomSCM: interpretable ensembles of sparse classifiers tailored for omics data
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Thibaud Godon, Laval University, Canada
  • Pier-Luc Plante, Laval University, Canada
  • Baptiste Bauvin, Laval University, Canada
  • Élina Francovic-Fontaine, Laval University, Canada
  • Alexandre Drouin, Element AI, a ServiceNow company; Laval University, Canada
  • François Laviolette, Laval University, Canada
  • Jacques Corbeil, Laval University, Canada

Presentation Overview: Show

Recent metabolomics measurement devices, such as mass spectrometers, produce extremely high-dimensional data. Together with small sample sizes, this setting is known as the fat data (or p >> n) problem. Biomarker discovery in this configuration is a challenge. Classical statistical methods fail and common Machine Learning (ML) algorithms produce models too complex to be interpretables. ML algorithms that rely on sparsity to predict phenotypes using very few covariates have been shown to thrive in this setting. While sparsity helps to avoid overfitting, it also leads to concise models that are easier to interpret for biomarker discovery.

The Set Covering Machine (SCM) algorithm produces sparse models based on simple decision rules. Recent work has applied SCMs to the genotype-to-phenotype prediction of antibiotic resistance and achieved state-of-the-art accuracy. To adapt this approach to metabolomics (fat) data, we developed a bootstrap aggregation of SCM models : RandomSCM.

We explored applications of RandomSCM beyond genotype-to-phenotype prediction by applying it to five metabolomics dataset. Predictions performances are at state-of-the art level. Furthermore, the study of the decision rules in RandomSCM revealed valid biomarkers of the phenotypes. These results demonstrate the high potential of the RandomSCM algorithm for biomarker discovery in omics sciences.

11:20-11:40
Proceedings Presentation: CROTON: An Automated and Variant-Aware Deep Learning Framework for Predicting CRISPR/Cas9 Editing Outcomes
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Victoria Li, Hunter College High School, United States
  • Zijun Zhang, Flatiron Institute, Simons Foundation, United States
  • Olga Troyanskaya, Princeton University, United States

Presentation Overview: Show

CRISPR/Cas9 is a revolutionary gene-editing technology that has been widely utilized in biology, biotechnology, and medicine. CRISPR/Cas9 editing outcomes depend on local DNA sequences at the target site and are thus predictable. However, existing prediction methods are dependent on both feature and model engineering, which restricts their performance to existing knowledge about CRISPR/Cas9 editing. Herein, deep multi-task convolutional neural networks (CNNs) and neural architecture search (NAS) were used to automate both feature and model engineering and create an end-to-end deep-learning framework, CROTON (CRISPR Outcomes Through cONvolutional neural networks). The CROTON model architecture was tuned automatically with NAS on a synthetic large-scale construct-based dataset and then tested on an independent primary T cell genomic editing dataset. CROTON outperformed existing expert-designed models and non-NAS CNNs in predicting 1 base pair insertion and deletion probability as well as deletion and frameshift frequency. Interpretation of CROTON revealed local sequence determinants for diverse editing outcomes. Finally, CROTON was utilized to assess how single nucleotide variants (SNVs) affect the genome editing outcomes of four clinically relevant target genes: the viral receptors ACE2 and CCR5 and the immune checkpoint inhibitors PDCD1 and CTLA4. Large SNV-induced differences in CROTON predictions in these target genes suggest that SNVs should be taken into consideration when designing widely-applicable gRNAs.

11:40-12:00
Proceedings Presentation: TITAN: T Cell Receptor Specificity Prediction with Bimodal Attention Networks
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Anna Weber, IBM, Zurich Research Laboratory and ETH Zurich, Switzerland
  • Jannis Born, IBM, Zurich Research Laboratory and ETH Zurich, Switzerland
  • Maria Rodriguez Martinez, IBM, Zurich Research Laboratory, Switzerland

Presentation Overview: Show

Motivation: The activity of the adaptive immune system is governed by T-cells and their specific T-cell receptors (TCR), which selectively recognize foreign antigens. Recent advances in experimental techniques have enabled sequencing of TCRs and their antigenic targets (epitopes), allowing to research the missing link between TCR sequence and epitope binding specificity. Scarcity of data and a large sequence space make this task challenging, and to date only models limited to a small set of epitopes have achieved good performance. Here, we establish a K-NN classifier as a strong baseline and then propose TITAN (Tcr epITope bimodal Attention Networks), a bimodal neural network that explicitly encodes both, TCR sequences and epitopes to enable the independent study of generalization capabilities to unseen TCRs and/or epitopes.
Results: By encoding epitopes on the atomic level with SMILES sequences, we leverage transfer learning techniques to enrich the input data and boost performance. TITAN achieves high performance on general unseen TCR prediction (ROC-AUC 0.87 in 10-fold CV) and surpasses the results of the current state of the art (ImRex) by a large margin. While unseen epitope generalization remains challenging, we report two major breakthroughs. First, by dissecting the attention heatmaps, we demonstrate that the sparsity of available epitope data favors an implicit treatment of epitopes as classes. This may be a general problem that limits unseen epitope performance for sufficiently complex models. Second, we show that TITAN nevertheless exhibits significantly improved performance on unseen epitopes and is capable of focusing attention on chemically meaningful molecular structures.

12:00-12:20
Light Attention Predicts Protein Location from the Language of Life
Format: Pre-recorded with live Q&A

Moderator(s): Gabriele Schweikert

  • Hannes Stärk, Department of Informatics, Technical University of Munich, Germany
  • Christian Dallago, Department of Informatics, Technical University of Munich, Germany
  • Michael Heinzinger, Department of Informatics, Technical University of Munich, Germany
  • Burkhard Rost, Department of Informatics, Technical University of Munich, Germany

Presentation Overview: Show

Although knowing where a protein functions in a cell is important to characterize biological processes, this information remains unavailable for most known proteins. Machine learning narrows the gap through predictions from expertly chosen input features leveraging evolutionary information that is resource expensive to generate. We showcase using embeddings from protein language models for competitive localization predictions not relying on evolutionary information. Our lightweight deep neural network architecture uses a softmax weighted aggregation mechanism with linear complexity in sequence length referred to as light attention (LA). The method significantly outperformed the state-of-the-art for ten localization classes by about eight percentage points (Q10). The novel models are available as a web-service and as a stand-alone application at embed.protein.properties.

12:40-12:50
Subclonal reconstruction of tumors by using machine learning and population genetics
Format: Pre-recorded with live Q&A

Moderator(s): Anshul Kundaje

  • Giulio Caravagna, The Institute of Cancer Research and The University of Trieste, United Kingdom
  • Timon Heide, The Institute of Cancer Research, United Kingdom
  • Marc Williams, MSKCC, United States
  • Luis Zapata, The Institute of Cancer Research, United Kingdom
  • William Cross, Queem Mary University of London, United Kingdom
  • George D Cresswell, The Institute of Cancer Research, United Kingdom
  • Benjamin Werner, The Institute of Cancer Research, United Kingdom
  • Ahmet Acar, The Institute of Cancer Research, United Kingdom
  • Chris Barnes, University College London, United Kingdom
  • Guido Sanguinetti, The University of Edinburgh, United Kingdom
  • Trevor A. Graham, Queen Mary University of London, United Kingdom
  • Andrea Sottoriva, The Institute of Cancer Research, United Kingdom

Presentation Overview: Show

Most cancer genomic data are generated from bulk samples composed of mixtures of cancer subpopulations, as well as normal cells. Subclonal reconstruction methods based on machine learning aim to separate those subpopulations in a sample and infer their evolutionary history. However, current approaches are entirely data driven and agnostic to evolutionary theory. We demonstrate that systematic errors occur in the analysis if evolution is not accounted for, and this is exacerbated with multi-sampling of the same tumor. We present a novel approach for model-based tumor subclonal reconstruction, called MOBSTER, which combines machine learning with theoretical population genetics. Using public whole-genome sequencing data from 2,606 samples from different cohorts, new data and synthetic validation, we show that this method is more robust and accurate than current techniques in single-sample, multiregion and longitudinal data. This approach minimizes the confounding factors of nonevolutionary methods, thus leading to more accurate recovery of the evolutionary history of human cancers.

12:50-13:10
Proceedings Presentation: TUGDA: Task uncertainty guided domain adaptation for robust generalization of cancer drug response prediction from in vitro to in vivo settings
Format: Pre-recorded with live Q&A

Moderator(s): Anshul Kundaje

  • Rafael Peres da Silva, School of Computing, National University of Singapore, Singapore
  • Chayaporn Suphavilai, Genome Institute of Singapore, Singapore
  • Niranjan Nagarajan, Genome Institute of Singapore, Singapore

Presentation Overview: Show

Motivation: Large-scale cancer omics studies have highlighted the diversity of patient molecular profiles
and the importance of leveraging this information to deliver the right drug to the right patient at the right time.
Key challenges in learning predictive models for this include the high-dimensionality of omics data and
heterogeneity in biological and clinical factors affecting patient response. The use of multi-task learning
(MTL) techniques has been widely explored to address dataset limitations for in vitro drug response
models, while domain adaptation (DA) has been employed to extend them to predict in vivo response. In
both of these transfer learning settings, noisy data for some tasks (or domains) can substantially reduce
the performance for others compared to single-task (domain) learners, i.e. lead to negative transfer (NT).
Results: We describe a novel multi-task unsupervised DA method (TUGDA) that addresses these
limitations in a unified framework by quantifying uncertainty in predictors and weighting their influence on
shared feature representations. TUGDA’s ability to rely more on predictors with low-uncertainty allowed it
to notably reduce cases of NT for in vitro models (94% overall) compared to state-of-the-art methods. For
DA to in vivo settings, TUGDA improved over previous methods for patient-derived xenografts (9 out of 14
drugs) as well as patient datasets (significant associations in 9 out of 22 drugs). TUGDA’s ability to avoid
NT thus provides a key capability as we try to integrate diverse drug-response datasets to build consistent
predictive models with in vivo utility.
Availability: https://github.com/CSB5/TUGDA
Contact: nagarajann@gis.a-star.edu.sg
Supplementary information: Attached

13:10-13:30
Proceedings Presentation: Predicting mechanism of action of novel compounds using compound structure and transcriptomic signature co-embedding
Format: Pre-recorded with live Q&A

Moderator(s): Anshul Kundaje

  • Gwanghoon Jang, Korea University, South Korea
  • Sungjoon Park, Korea University, South Korea
  • Sanghoon Lee, korea university, South Korea
  • Sunkyu Kim, Korea University, South Korea
  • Sejeong Park, Korea University, South Korea
  • Jaewoo Kang, Korea University, South Korea

Presentation Overview: Show

Motivation:Identifying mechanism of actions (MoA) of novel compounds is crucial in drug discovery. Careful understanding of MoA can avoid potential side effects of drug candidates. Efforts have been made to identify MoA using the transcriptomic signatures induced by compounds. However, those approaches fail to reveal MoAs in the absence of actual compound signatures.

Results: We present MoAble, which predicts MoAs without requiring compound signatures. We train a deep learning-based co-embedding model to map compound signatures and compound structure into the same embedding space. The model generates low-dimensional compound signature representation from the compound structure. To predict MoAs, pathway enrichment analysis is performed based on the connectivity between embedding vectors of compounds and those of genetic perturbation. Results show that MoAble is comparable to the methods that use actual compound signatures. We demonstrate that MoAble can be used to reveal MoAs of novel compounds without measuring compound signatures with the same prediction accuracy as measuring it.

Availability:MoAble is available at https://github.com/dmis-lab/moable

Contact:kangj@korea.ac.kr, sungjoonpark@korea.ac.kr

Supplementary information:Supplementary data are available atBioinformaticsonline.

13:30-13:50
Proceedings Presentation: Modeling drug combination effects via latent tensor reconstruction
Format: Pre-recorded with live Q&A

Moderator(s): Anshul Kundaje

  • Tianduanyi Wang, Aalto University; University of Helsinki, Finland
  • Sandor Szedmak, Aalto University, Finland
  • Haishan Wang, Aalto University, Finland
  • Tero Aittokallio, Aalto University; University of Helsinki; University of Turku; Oslo University Hospital; University of Oslo, Finland
  • Tapio Pahikkala, University of Turku, Finland
  • Anna Cichonska, Aalto University; University of Helsinki, Finland
  • Juho Rousu, Aalto University, Finland

Presentation Overview: Show

Motivation: Combination therapies have emerged as a powerful treatment modality to overcome drug resistance and improve treatment efficacy. However, the number of possible drug combinations increases very rapidly with the number of individual drugs in consideration which makes the comprehensive experimental screening infeasible in practice. Machine learning models offer time- and cost-efficient means to aid this process by prioritising the most effective drug combinations for further pre-clinical and clinical validation. However, the complexity of the underlying interaction patterns across multiple drug doses and in different cellular contexts poses challenges to the predictive modelling of drug combination effects.
Results: We introduce comboLTR, highly time-efficient method for learning complex, nonlinear target functions for describing the responses of therapeutic agent combinations in various doses and cancer cell-contexts. The method is based on a polynomial regression via powerful latent tensor reconstruction. It uses a combination of recommender system-style features indexing the data tensor of response values in different contexts, and chemical and multi-omics features as inputs. We demonstrate that comboLTR outperforms state-of-the-art methods in terms of predictive performance and running time, and produces highly accurate results even in the challenging and practical inference scenario where full dose-response matrices are predicted for completely new drug combinations with no available combination and monotherapy response measurements in any training cell line.

13:50-14:00
MatchMaker: A Deep Learning Framework for Drug Synergy Prediction
Format: Pre-recorded with live Q&A

Moderator(s): Anshul Kundaje

  • A. Ercument Cicek, Bilkent University, Turkey
  • Halil İbrahim Kuru, Bilkent Universtiy, Turkey
  • Oznur Tastan, Sabanci University, Turkey

Presentation Overview: Show

Drug combination therapies are commonly used for the treatment of complex diseases such as cancer due to increased efficacy and reduced side effects. However, experimentally validating all possible combinations for synergistic interaction even with high-throughout screens is intractable due to vast combinatorial search space. Computational techniques are often used to reduce the number of combinations to be evaluated experimentally by prioritizing promising candidates. We present MatchMaker that can predict drug synergy scores for a pair of drugs using the drugs’ chemical structures and gene expression profiles of untreated cell lines as input. MatchMaker is a deep neural network-based drug synergy prediction algorithm. The model contains three neural subnetworks; two subnetworks learn a representation of the two drugs separately conditioned on cell line gene expression of the given cell line, the output of these two subnetworks are then input to a third subnetwork that predicts the Loewe synergy score of the pair. We train Matchmaker using DrugComb dataset, that contained 286,421 examples. MatchMaker yields performance improvements up to ~15% correlation and ~33% mean squared error (MSE) improvements over the next best method DeepSynergy.

14:20-15:20
MLCSB Keynote: Exploiting Molecular Interactions in Machine Learning Models for Cancer
Format: Live-stream

Moderator(s): Gabriele Schweikert

  • Oznur Tastan, Sabanci University, Turkey



International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Twitter Facebook Linkedin
Flickr Youtube