Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

MLCSB COSI

Presentations

Schedule subject to change
Monday, July 13th
10:40 AM-11:40 AM
Computational single-cell biology from one to many cells
Format: Live-stream

*Dr. Marks was unable to present

  • Oliver Stegle
12:00 PM-12:20 PM
Proceedings Presentation: Cancer mutational signatures representation by large-scale context embedding
Format: Pre-recorded with live Q&A

  • Jian Ma, Carnegie Mellon University, United States
  • Yang Zhang, Carnegie Mellon University, United States
  • Yunxuan Xiao, Shanghai Jiao Tong University, China
  • Muyu Yang, Carnegie Mellon University, United States

Presentation Overview: Show

The accumulation of somatic mutations plays critical roles in cancer development and progression. However, the global patterns of somatic mutations, especially non-coding mutations, and their roles in defining molecular subtypes of cancer have not been well characterized due to the computational challenges in analyzing the complex mutational patterns. Here we develop a new algorithm, called MutSpace, to effectively extract patient-specific mutational features using an embedding framework for larger sequence context. Our method is motivated by the observation that the mutational rate at megabase scale and the local mutational patterns jointly contribute to distinguishing cancer subtypes, both of which can be simultaneously captured by MutSpace. Simulation evaluations demonstrate that MutSpace can effectively characterize mutational features from bona fide patient subgroups and achieve superior performance compared with previous methods. As a proof-of-principle, we apply MutSpace to 560 breast cancer patient samples and demonstrate that our method achieves high accuracy in subtype identification. In addition, the learned embeddings from MutSpace reflect intrinsic patterns of breast cancer subtypes and other features of genome structure and function. MutSpace is a promising new framework to better understand cancer heterogeneity based on somatic mutations.

12:20 PM-12:40 PM
Proceedings Presentation: MHCAttnNet: Predicting MHC-Peptide Bindings for MHC Alleles Classes I & II Using An Attention-Based Deep Neural Model
Format: Pre-recorded with live Q&A

  • Gopalakrishnan Venkatesh, International Institute of Information Technology, Bangalore, India
  • Aayush Grover, International Institute of Information Technology, Bangalore, India
  • G Srinivasaraghavan, International Institute of Information Technology, Bangalore, India
  • Shrisha Rao, International Institute of Information Technology, Bangalore, India

Presentation Overview: Show

Motivation: Accurate prediction of binding between an MHC allele and a peptide plays a major role in the synthesis of personalized cancer vaccines. The immune system struggles to distinguish between a cancerous and a healthy cell. In a patient suffering from cancer who has a particular MHC allele, only those peptides that bind with the MHC allele with high affinity help the immune system recognize the cancerous cells.
Results: MHCAttnNet is a deep neural model that uses an attention mechanism to capture the relevant subsequences of the amino acid sequences of peptides and MHC alleles. It then uses this to accurately predict the MHC-peptide binding. MHCAttnNet achieves an AUC-PRC score of 94.18% with 161 class I MHC alleles which outperforms the state-of-the-art models for this task. MHCAttnNet also achieves a better AUC-ROC score in comparison to the state-of-the-art models while covering a greater number of class II MHC alleles. The attention mechanism used by MHCAttnNet provides a heatmap over the amino acids thus indicating the important subsequences present in the amino acid sequence. This approach also allows us to focus on a much smaller number of relevant trigrams corresponding to the amino acid sequence of an MHC allele, from 9251 possible trigrams to about 258. This significantly reduces the number of amino acid subsequences that need to be clinically tested.

2:00 PM-3:00 PM
Deep learning at base-resolution reveals motif syntax of the cis-regulatory code
Format: Live-stream

  • Anshul Kundaje, Stanford University, United States

Presentation Overview: Show

Genes are regulated by cis-regulatory elements, which contain transcription factor (TF) binding motifs in specific arrangements. To understand the syntax of these motif arrangements and its influence on TF binding, we developed a new convolutional neural network called BPNet that models the relationship between regulatory DNA sequence and base-resolution binding profiles from ChIP-exo/nexus experiments targeting four pluripotency TFs Oct4, Sox2, Nanog, and Klf4 in mouse embryonic stem cells. BPNet is able to predict base-resolution binding profiles and footprints at unprecedented accuracy on par with replicate experiments. We developed a suite of model interpretation methods to learn novel motif representations, accurately map predictive motif instances in the genome and identify higher-order rules by which combinatorial motif syntax influences binding of these TFs. We discovered several novel motifs bound by these TFs supported by distinct footprints. We further found that instances of strict motif spacing are largely due to retrotransposons, but that soft motif syntax influences TF binding at protein or nucleosome range in a directional manner. Most strikingly, Nanog binding is driven by motifs with a strong preference for ~10.5 bp spacings corresponding to helical periodicity. We then validated our model's predictions using CRISPR-induced point mutations of motif instances followed by profiling TF binding. The sequence representations learned by the binding models also generalized to accurately predict differential chromatin accessibility after TF depletion as well as massively parallel reporter experiments. BPNet easily adapts to other types of profiling experiments (e.g. ChIP-seq, DNase-seq, ATAC-seq, PRO-seq) in mammals as well as other species such as yeast, thus paving the way to decipher the cis-regulatory code from diverse regulatory profiling experiments.

3:20 PM-3:40 PM
Fourier-transform-based attribution priors improve the stability and interpretability of deep learning models for regulatory genomics
Format: Pre-recorded with live Q&A

  • Alex Tseng, Stanford University, United States
  • Avanti Shrikumar, Stanford University, United States
  • Anshul Kundaje, Stanford University, United States

Presentation Overview: Show

Deep learning models of regulatory DNA can accurately predict transcription factor (TF) binding and chromatin accessibility profiles. Base-resolution importance (i.e. "attribution") scores learned by the models can highlight predictive motifs and syntax. Unfortunately, these models are prone to overfitting and are sensitive to random initializations, often resulting in noisy and irreproducible attributions that obfuscate underlying motifs. To address these shortcomings, we propose a novel attribution prior, where the Fourier transform of input-level attribution scores are computed at training-time, and high-frequency components of the Fourier spectrum are penalized. We evaluate different model architectures with and without attribution priors to predict binary or continuous profiles of TF binding or chromatin accessibility. The prior is agnostic to the model architecture or predicted experimental assay, yet provides similar gains across all experiments. We show that our attribution prior dramatically improves the models’ stability, interpretability, and performance on held-out data, including when training data is severely limited. Our attribution prior also allows models to identify motifs more sensitively and precisely within individual regulatory elements. This work represents an important advancement in improving the reliability of deep learning models for deciphering the cis-regulatory code from regulatory profiling experiments.

3:40 PM-4:00 PM
Dissecting the grammar of chromatin architecture using megabase scale DNA sequence with deep neural networks and transfer learning
Format: Pre-recorded with live Q&A

  • Ron Schwessinger, University of Oxford, United Kingdom
  • Matthew Gosden, University of Oxford, United Kingdom
  • Damien Downes, University of Oxford, United Kingdom
  • Richard Brown, University of Oxford, United Kingdom
  • Marieke Oudelaar, University of Oxford, United Kingdom
  • Jelena Telenius, University of Oxford, United Kingdom
  • Yee Whye Tee, University of Oxford, United Kingdom
  • Gerton Lunter, University of Oxford, United Kingdom
  • Jim Hughes, University of Oxford, United Kingdom

Presentation Overview: Show

Mammalian genome architecture is characterized by an intricate framework of hierarchically folded domains that dictate essential gene regulatory functions such as enhancer-promoter interactions. Understanding which determinants are driving 3D genome formation and how this architecture is altered through the sequence and structural variations requires high-throughput, genome-wide approaches. Computational models sophisticated enough to grasp the determinants of chromatin folding allow us to perform these large scale experiments in silico.
We have developed a deep neural network (deepC) that uses transfer learning to predict chromatin interactions from DNA sequence at the megabase scale. Our model predicts Hi-C interactions at high resolution captures intricate, hierarchical chromatin structures and can be used to fine-map Hi-C data.
DeepC allows us to predict the impact of single base pair variants as well as structural variation in the same end-to-end framework, bridging the different levels of resolution from base pairs to TADs. DeepC enables large-scale, computational screens that empower us to dissect the functional elements and sequence determinants that regulate chromatin architecture at base pair resolution and genome-wide scale. We demonstrate how we employ deepC to stratify the contribution of distinct classes of regulatory elements, study the grammar of domain boundaries and predict the effect of SNPs.

4:00 PM-4:20 PM
CoRE-ATAC: A Deep Learning model for the Classification of Regulatory Elements from single cell and bulk ATAC-seq data
Format: Pre-recorded with live Q&A

  • Asa Thibodeau, The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA., United States
  • Shubham Khetan, The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA., United States
  • Alper Eroglu, The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA., United States
  • Ryan Tewhey, The Jackson Laboratory, Bar Harbor, ME, 04609, USA., United States
  • Michael L. Stitzel, The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA., United States
  • Duygu Ucar, The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA., United States

Presentation Overview: Show

Cis-Regulatory elements (cis-REs) include promoters, enhancers, and insulators that regulate gene expression programs via binding of transcription factors. ATAC-seq technology effectively identifies active cis-REs in a given cell type (including from single cells) by mapping the accessible chromatin at base-pair resolution. However, these maps are not immediately useful for inferring specific functions of cis-REs. For this purpose, we developed a deep learning framework (CoRE-ATAC) with novel data encoders that integrate DNA sequence (reference or personal genotypes) and ATAC-seq read pileups. CoRE-ATAC was trained on 4 cell types (n=6 samples/replicates) and accurately predicted known cis-RE functions from 7 cell types (n=40 samples) that were not used in model training (average precision=0.80). CoRE-ATAC enhancer predictions from 19 human islets coincided with genetically modulated gain/loss of enhancer activity, which was confirmed by massively parallel reporter assays (MPRAs). Finally, CoRE-ATAC effectively inferred functionality of cis-REs from single nucleus ATAC-seq data from human blood-derived immune cells that overlapped well with known functional annotations in sorted immune cells. ATAC-seq maps from primary human cells reveal individual- and cell-specific variation in cis-RE activity. CoRE-ATAC increases the functional resolution of these maps, a critical step for studying regulatory disruptions behind diseases.

4:20 PM-4:40 PM
Zero-shot imputations across species are enabled through joint modeling of human and mouse epigenomics
Format: Pre-recorded with live Q&A

  • Jacob Schreiber, Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA, United States
  • Deepthi Hegde, University of Washington, United States
  • William S. Noble, University of Washington, United States

Presentation Overview: Show

Recent large-scale efforts to characterize biochemical activity along the human genome have produced thousands of genome-wide experiments that quantify various forms of biological activity, such as histone modifications, protein binding, and chromatin accessibility. Although these experiments represent a small fraction of the possible experiments that could be performed, the human genome remains the most characterized of any species. We propose an extension to the imputation approach Avocado that enables the model to leverage the large number of human genomic data sets when making imputations in other species. This extension takes advantage of shared synteny and the similar role that many types of biochemical activity have in evolutionarily-related species. We found that not only does this extension result in improved imputations of mouse genomics experiments, but that the extended model is able to make accurate imputations for assays that have been performed in humans but not in mice. This ability to make ``zero-shot'' imputations greatly increases the utility of such imputation approaches, and enables comprehensive imputations to be made for species even when experimental data are sparse.

4:40 PM-4:50 PM
A Combined Species Model using Branched Multitask Routing Networks
Format: Pre-recorded with live Q&A

  • Chendi Wang, The University of British Columbia, Canada
  • Bernard Ng, University of British Columbia, Canada
  • Gherman Novakovsky, Centre for Molecular Medicine and Therapeutics, University of British Columbia, Canada
  • Alexandra Maslova, The University of British Columbia, Canada
  • Sara Mostafavi, University of British Columbia, Canada

Presentation Overview: Show

Recent development in deep learning has shown unprecedented accuracy in predicting chromatin features using DNA sequences alone. The high prediction accuracy partly stems from the millions of labeled sequences available across whole organs and cell-lines in culture, but for specific biological systems, the number of labeled sequences is often an order of magnitude lower. Acquiring additional labeled sequences from the same species adds little variability. A promising strategy is to combine datasets across species, since flanks around conserved regulatory subsequences tend to vary. Common approaches for combining datasets include transfer learning and multitask learning. However, these approaches do not facilitate common and species-specific attribute extraction from the models. To improve both prediction accuracy and model interpretability, we propose a branched multitask routing network (BMTRN) for cross species chromatin feature prediction. The idea is to split a network into common and species-specific layers via task routing so that shared signals between datasets can be exploited without assuming the species have the same regulatory attributes. We apply BMTRN to ATAC-seq datasets from mouse and human, and show that BMTRN improves prediction accuracy, enhances filter reproducibility, facilitates easier interpretation of cross-species differences, and increases sensitivity in detecting effects of functional variants in silico.

5:00 PM-5:20 PM
Proceedings Presentation: Towards Heterogeneous Information Fusion: Bipartite Graph Convolutional Networks for In Silico Drug Repurposing
Format: Pre-recorded with live Q&A

  • Mu Zhou, SenseBrain Research, United States
  • Zichen Wang, University of California, Los Angeles, United States
  • Corey Arnold, University of California, Los Angeles, United States

Presentation Overview: Show

Motivation: Mining disease and drug association and their interactions are essential for developing computational models in drug repurposing and understanding underlying biological mechanisms. Recently, large-scale biological databases are increasingly available for pharmaceutical research, allowing for deep characterization for molecular informatics and drug discovery. In this study, we propose a bipartite graph convolution network model that integrates multi-scale pharmaceutical information. Especially the introduction of protein nodes serve as a bridge of message passing, which provides insights into the protein-protein interaction (PPI) network for improved drug repositioning assessment.

Results: Our approach combines insights of multi-scale pharmaceutical information by constructing a multi-relational graph of protein–protein, drug–protein and disease-protein interactions. Specifically, our model offers a novel avenue for message passing among diverse domains that we learn useful feature representations for all graph nodes by fusing biological information across interaction edges. Then the high-level representation of drug-disease pairs are fed into a multi-layer perceptron decoder to predict therapeutic indications. Unlike conventional graph convolution networks that assume the same node attributes in a global graph, our model is domain-consistent by modeling inter-domain information fusion with bipartite graph convolution operation. We offered an exploratory analysis for finding novel drug-disease associations. Extensive experiments showed that our approach achieves improved performance than multiple baseline approaches.

5:20 PM-5:30 PM
Learning Context-aware Structural Representations to Predict Antigen and Antibody Binding Interfaces
Format: Pre-recorded with live Q&A

  • Srivamshi Pittala, Dartmouth College, United States
  • Chris Bailey-Kellogg, Dartmouth College, United States

Presentation Overview: Show

Understanding how antibodies specifically interact with their antigens can enable better drug and vaccine design, as well as provide insights into natural immunity. Experimental structural characterization can detail the “ground truth” of antibody-antigen interactions, but computational methods are required to efficiently scale to large-scale studies. In order to increase prediction accuracy as well as to provide a means to gain new biological insights into these interactions, we have developed PECAN, a unified deep learning-based framework to predict binding interfaces on both antibodies and antigens. PECAN leverages three key aspects of antibody-antigen interactions to learn predictive structural representations: (1) since interfaces are formed from multiple residues in spatial proximity, we employ graph convolutions to aggregate properties across local regions in a protein; (2) since interactions are specific between antibody-antigen pairs, we employ an attention layer to explicitly encode the context of the partner; (3) since more data is available for general protein-protein interactions, we employ transfer learning to leverage this data as a prior for the specific case of antibody-antigen interactions. We show that PECAN achieves state-of-the-art performance at predicting binding interfaces on both antibodies and antigens, and that each of its three aspects drives additional improvement in the performance.

5:30 PM-5:40 PM
PaccMann^RL: Designing anticancer drugs from transcriptomic data via reinforcement learning
Format: Pre-recorded with live Q&A

  • Jannis Born, ETH Zurich, Switzerland
  • Matteo Manica, IBM, Switzerland
  • Ali Oskooei, IBM, Switzerland
  • Joris Cadow, IBM, Switzerland
  • Karsten Borgwardt, ETH Zurich, Switzerland
  • Maria Rodriguez Martinez, IBM Research Zurich, Switzerland

Presentation Overview: Show

While state-of-the-art deep learning approaches have shown potential in generating compounds with desired chemical properties, they disregard the cellular biomolecular properties of the target disease. We introduce a novel framework for de-novo molecular design that systematically leverages systems biology information into the drug discovery process. Embodied through two separate VAEs, the drug generation is driven by a disease context (transcriptomic profiles of cancer cells) deemed to represent the target environment of the drug. Showcased at the task of anticancer drug discovery, our conditional generative model is demonstrated to tailor anticancer compounds to target desired biomolecular profiles. Specifically, we reveal how the molecule generation can be biased towards compounds with high predicted inhibitory effect against individual cell-lines or cell-lines from specific cancer sites. We verify our approach by investigating candidate drugs generated against specific cancer types and find highest structural similarity to existing compounds with known efficacy against these types. Despite no direct optimization of other pharmacological properties, we report good agreement with cancer drugs in metrics like drug-likeness, synthesizability and solubility. We envision our approach to be a step towards increasing success rates in lead compound discovery and finding more targeted medicines by leveraging the cellular environment of the disease.

5:40 PM-5:50 PM
Tissue-guided LASSO for prediction of clinical drug response using preclinical samples
Format: Pre-recorded with live Q&A

  • Saurabh Sinha, University of Illinois at Urbana-Champaign, United States
  • Edward Huang, University of Illinois at Urbana-Champaign, United States
  • Ameya Bhope, McGill University, Canada
  • Jing Lim, University of Illinois at Urbana-Champaign, United States
  • Amin Emad, McGill University, Canada

Presentation Overview: Show

Predicting the clinical drug response (CDR) of cancer patients, based on their clinical parameters and their tumours' molecular profiles, can play an important role in precision medicine. While machine learning (ML) models have the potential to address this issue, their training requires data from a large number of patients treated with each drug, limiting their feasibility for many drugs. One alternative is training ML models on large databases containing molecular profiles of hundreds of preclinical cell lines and their response to hundreds of drugs. Here, we developed a novel algorithm (TG-LASSO) that explicitly incorporates information on samples' tissue of origin with gene expression profiles to predict the CDR of patients using preclinical samples. Using two large databases, we showed that TG-LASSO can accurately distinguish between resistant and sensitive patients for 7 out of 12 drugs, outperforming various other methods. Moreover, TG-LASSO identified genes associated with the drug response, including known targets and pathways involved in the drugs' mechanism of action. Additionally, genes identified by this method for multiple drugs in a tissue are associated with patient survival and can be used to predict their outcome. In summary, TG-LASSO can predict patients’ CDR and identify biomarkers of drug sensitivity and survival.

5:50 PM-6:00 PM
Quantifying gene selection in cancer through protein functional alteration bias
Format: Pre-recorded with live Q&A

  • Michal Linial, The Hebrew University of Jerusalem, Israel
  • Nadav Brandes, The Hebrew University of Jerusalem, Israel
  • Nathan Linial, The Hebrew University of Jerusalem, Israel

Presentation Overview: Show

Compiling the catalogue of genes actively involved in cancer is an ongoing endeavor, with profound implications to the understanding and treatment of the disease. An abundance of computational methods have been developed to screening the genome for candidate driver genes based on genomic data of somatic mutations in tumors. Existing methods make many implicit and explicit assumptions about the distribution of random mutations. We present FABRIC, a new framework for quantifying the selection of genes in cancer by assessing the effects of de-novo somatic mutations on protein-coding genes. Using a machine-learning model, we quantified the functional effects of ∼3M somatic mutations extracted from over 10 000 human cancerous samples, and compared them against the effects of all possible single-nucleotide mutations in the coding human genome. We detected 593 protein-coding genes showing statistically significant bias towards harmful mutations. These genes, discovered without any prior knowledge, show an overwhelming overlap with known cancer genes, but also include many overlooked genes. FABRIC is designed to avoid false discoveries by comparing each gene to its own background model using rigorous statistics, making minimal assumptions about the distribution of random somatic mutations. The framework is an open-source project with a simple command-line interface.

Tuesday, July 14th
10:40 AM-11:40 AM
Finding concise descriptors of genomic data
Format: Live-stream

  • Maria Chikina, University of Pittsburgh, United States

Presentation Overview: Show

Genome scale technologies can measure thousands or even millions of molecules but the individual measurements are often noisy redouts of the internal biological state. For many computational approaches the ultimate goal is to infer the true biological state parameters from their effects on a much larger set of experimental readouts. Indeed, much of genomic data analysis can be viewed as a special case of dimensionality reduction. Unlike a generic dimensionality reduction problem, however, in the biological case we have extensive prior knowledge about the data-generating process. In this talk we will discuss several methods that exploit this prior knowledge to create interpretable low-dimensional representations for genome scale datasets.

12:00 PM-12:20 PM
Causal network learning using a semi-supervised approach
Format: Pre-recorded with live Q&A

  • Steven M. Hill, MRC Biostatistics Unit, University of Cambridge, United Kingdom
  • Chris J. Oates, Newcastle University, United Kingdom
  • Duncan A. Blythe, German Center for Neurodegenerative Diseases, Germany
  • Sach Mukherjee, German Center for Neurodegenerative Diseases, Germany

Presentation Overview: Show

Causal interplay between molecular components is central to the regulation of cellular behaviour. Data-driven learning of molecular network topology continues to be an active area of research, including development of methods that have a particular focus on learning causal relationships. These methods often use statistical models with explicit causal assumptions (for example, causal directed acyclic graphs). Here, we take an alternative approach and view causal network inference from a machine learning point of view. The idea is to allow direct learning of patterns of causal influence between nodes. Binary indicators of causal influence between pairs of variables are treated as "labels". Available data for the variables of interest are used to provide edge features for the labelling task. Background knowledge or any available interventional data provide labels on a subset of variable pairs. The task is to learn the remaining labels that are unobserved. Using a specific approach rooted in manifold semi-supervised learning we present empirical results on three different biological datasets, including data where causal effects can be verified experimentally. Our results demonstrate the efficacy and highly general nature of the approach as well as its simplicity from a user's point of view.

12:20 PM-12:40 PM
Proceedings Presentation: Factorized embeddings learns rich and biologically meaningful embedding spaces using factorized tensor decomposition
Format: Pre-recorded with live Q&A

  • Assya Trofimov, IRIC - Université de Montréal, Canada
  • Joseph Paul Cohen, University of Montreal, Canada
  • Yoshua Bengio, U. Montreal, Canada
  • Claude Perreault, IRIC - Université de Montréal, Canada
  • Sebastien Lemieux, IRIC / Université de Montréal, Canada

Presentation Overview: Show

The recent development of sequencing technologies revolutionised our understanding of the inner workings of the cell as well as the way disease is treated. A single RNA sequencing (RNA-Seq) experiment, however, measures tens of thousands of parameters simultaneously. While the results are information rich, data analysis provides a challenge. Dimensionality reduction methods help with this task by extracting patterns from the data by compressing it into compact vector representations.

We present the factorized embeddings (FE) model, a self-supervised deep learning algorithm that learns simultaneously, by tensor factorization, gene and sample representation spaces. We ran the model on RNA-Seq data from two large-scale cohorts and observed that the sample representation captures information on single-gene and global gene expression patterns. Moreover, we found that the gene representation space was organized such that tissue-specific genes, highly correlated genes as well as genes participating in the same GO terms were grouped. Finally, we compared the vector representation of samples learned by the FE model to other similar models on 49 regression tasks. We report that the FE-trained representations rank first or second in all of the tasks, surpassing, sometimes by a considerable margin, other representations.

2:00 PM-2:20 PM
Adversarial Deconfounding Autoencoder for Learning Robust Gene Expression Embeddings
Format: Pre-recorded with live Q&A

  • Ayse Dincer, Univeristy of Washington, United States
  • Joseph Janizek, University of Washington, United States
  • Su-In Lee, University of Washington, United States

Presentation Overview: Show

The increasing number of gene expression profiles has enabled the use of complex models, such as deep unsupervised neural networks, to extract a latent space from these profiles. However, expression profiles, especially when collected in large numbers, inherently contain variations introduced by technical artifacts (e.g., batch effects) and uninteresting biological variables (e.g., age) in addition to the true signals of interest. These sources of variation, called confounders, produce embeddings that fail to capture biological variables of interest and transfer to different domains. To remedy this problem, we attempt to disentangle confounders from true signals to generate biologically informative embeddings by introducing the AD-AE (Adversarial Deconfounding AutoEncoder) approach. The AD-AE model consists of an autoencoder and an adversary network that we jointly train to generate embeddings that can encode as much information as possible without encoding any confounding signal. By applying AD-AE to two distinct gene expression datasets, we show that our model can (1) generate embeddings that do not encode confounder information, (2) conserve the biological signals present in the original space, and (3) generalize successfully across different confounder domains. We believe that this adversarial deconfounding approach can be the key to discovering robust expression patterns.

2:20 PM-2:40 PM
Latent periodic process inference from single-cell RNA-seq data
Format: Pre-recorded with live Q&A

  • Ken Chen, The University of Texas MD Anderson Cancer Center, United States
  • Fang Wang, The University of Texas MD Anderson Cancer Center, United States
  • Shaoheng Liang, The University of Texas MD Anderson Cancer Center, United States
  • Jincheng Han, The University of Texas MD Anderson Cancer Center, United States

Presentation Overview: Show

The development of a phenotype in a multicellular organism often involves multiple, simultaneously occurring biological processes. Advances in single-cell RNA-sequencing make it possible to infer latent developmental processes from the transcriptomic profiles of cells at various developmental stages. Accurate characterization is challenging, however, particularly for periodic processes such as cell cycle. To address this, we develop Cyclum, an autoencoder approach to identify circular trajectories in the gene expression space. Experiments using the scRNA-seq data from a set of proliferating cell-lines and mouse embryonic stem cells show that Cyclum reconstructed experimentally labeled cell-cycle stages and rediscovered known cell-cycle genes more accurately than Cyclone, ccRemover, Seurat, and reCAT. Applying Cyclum to removing cell-cycle effects substantially improves delineations of cell subpopulations, which is useful for establishing various cell atlases and studying tumor heterogeneity. Comparing circular patterns in each gene between nicotine treated human embryonic cells and a control sample proposes proven and new target genes of nicotine. Thus, Cyclum can be applied as a generic tool for characterizing periodic processes underlying cellular development/differentiation and cellular architecture in the scRNA-seq data. These features make it useful for constructing the Human Cell Atlas, the Human Tumor Atlas, and other cell ontologies.

2:40 PM-3:00 PM
SCIM: Universal Single-Cell Matching with Unpaired Feature Sets
Format: Pre-recorded with live Q&A

  • Gunnar Rätsch, ETH Zürich, Switzerland
  • Stefan Stark, Biomedical Informatics, Dept. of Computer Science, ETH Zürich, Universitätsstrasse 6, 8092 Zürich, Switzerland, Switzerland
  • Joanna Ficek, Biomedical Informatics, Dept. of Computer Science, ETH Zürich, Universitätsstrasse 6, 8092 Zürich, Switzerland, Switzerland
  • Francesco Locatello, Biomedical Informatics, Dept. of Computer Science, ETH Zürich, Universitätsstrasse 6, 8092 Zürich, Switzerland, Switzerland
  • Ximena Bonilla, Biomedical Informatics, Dept. of Computer Science, ETH Zürich, Universitätsstrasse 6, 8092 Zürich, Switzerland, Switzerland
  • Stéphane Chevrier, Department of Quantitative Biomedicine, University of Zürich, Winterthurerstrasse 190, 8057 Zürich, Switzerland, Switzerland
  • Franziska Singer, Nexus Personalized Health Technologies, ETH Zürich, Rämistrasse 101, 8092 Zürich, Switzerland, Switzerland
  • Kjong-Van Lehmann, Biomedical Informatics, Dept. of Computer Science, ETH Zürich, Universitätsstrasse 6, 8092 Zürich, Switzerland, Switzerland

Presentation Overview: Show

Multi-modal molecular profiling of samples on a single-cell level can yield deeper insights into tissue microenvironment and disease dynamics. Profiling technologies like scRNA-seq often consume the analyzed cells and cellular correspondences between data modalities are lost. To exploit single cell ’omics technologies jointly, we propose Single-Cell data Integration via Matching (SCIM), a scalable and accurate approach to recover such correspondences in two or more technologies, even in the absence of overlapping feature sets. SCIM assumes that cells share a common underlying structure and reconstructs such technology-invariant latent space using an auto-encoder framework with an adversarial objective. Cell pairs across technologies are then identified using a customized bipartite matching scheme operating on the latent representations. We evaluate SCIM on a simulated branching process designed for scRNA-seq data (total of 192,000 cells) and show that the cell-to-cell matches reflect the same pseudotime (Pearson’s coefficient: 0.86). Moreover, we apply our method to a real-world melanoma tumor sample, and achieve 93% cell-matching accuracy with respect to cell-type label when aligning scRNA-seq and CyTOF datasets. SCIM is a scalable and flexible algorithm that bridges the gap between generation and integrative interpretation of diverse multi-modal data.

3:20 PM-3:40 PM
Proceedings Presentation: TinGa: fast and flexible trajectory inference with Growing Neural Gas
Format: Pre-recorded with live Q&A

  • Helena Todorov, Ghent University, Belgium
  • Wouter Saelens, Ghent University, Belgium
  • Robrecht Cannoodt, Ghent University, Belgium
  • Yvan Saeys, Ghent University, Belgium

Presentation Overview: Show

Motivation: During the last decade, trajectory inference methods have emerged as a novel framework to model cell developmental dynamics, most notably in the area of single-cell transcriptomics. At present, more than 70 trajectory inference methods have been published, and recent benchmarks showed that, while some methods perform well for certain trajectory types, overall there is still a lot of room for improvement.
Results: In this work we present TinGa, a new trajectory inference model that is fast and flexible, and that is based on growing neural graphs. This allows TinGa to model both the most simple as well as most complex trajectory types. We performed an extensive comparison of TinGa to the five best existing methods for trajectory inference on a set of 250 datasets, including both synthetic as well as real datasets. Overall, TinGa obtained better results than all other methods on all ranges of data complexity, from the simplest linear datasets to the most complex disconnected graphs. In addition, TinGa obtained the fastest run times, showing that our method is thus one of the most versatile methods up to date.
Availability: R scripts for running TinGa, comparing it to top existing methods and generating the figures of this paper are available at https://github.com/Helena-todd/researchgng

3:40 PM-4:00 PM
Proceedings Presentation: Unsupervised Topological Alignment for Single-Cell Multi-Omics Integration
Format: Pre-recorded with live Q&A

  • Kai Cao, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, China
  • Xiangqi Bai, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, China
  • Yiguang Hong, Chinese Academy of Science, China
  • Lin Wan, Academy of Mathematics and Systems Science, CAS, China

Presentation Overview: Show

Motivation: Single-cell multi-omics data provide a comprehensive molecular view of cells. However, single- cell multi-omics datasets consist of unpaired cells measured with distinct unmatched features across modalities, making data integration challenging.
Results: In this study, we present a novel algorithm, termed UnionCom, for the unsupervised topological alignment of single-cell multi-omics integration. UnionCom does not require any correspondence information, either among cells or among features. It first embeds the intrinsic low-dimensional structure of each single-cell dataset into a distance matrix of cells within the same dataset and then aligns the cells across single-cell multi-omics datasets by matching the distance matrices via a matrix optimization method. Finally, it projects the distinct unmatched features across single-cell datasets into a common embedding space for feature comparability of the aligned cells. To match the complex nonlinear geometrical distorted low-dimensional structures across datasets, UnionCom proposes and adjusts a global scaling parameter on distance matrices for aligning similar topological structures. It does not require one-to-one correspondence among cells across datasets, and it can accommodate samples with dataset-specific cell types. UnionCom outperforms state-of-the-art methods on both simulated and real single-cell multi-omics datasets. UnionCom is robust to parameter choices, as well as subsampling of features.
Availability: UnionCom software is available at https://github.com/caokai1073/UnionCom.

4:00 PM-4:20 PM
scNym: Semi-supervised neural networks for single cell identity classification
Format: Pre-recorded with live Q&A

  • David Kelley, Calico Life Sciences, LLC, United States
  • Jacob Kimmel, Calico Life Sciences, LLC, United States

Presentation Overview: Show

Single cell genomics experiments can reveal the keystone cellular actors in complex tissues.
However, annotating cell type and state identities for each molecular profile in these experiments remains an analytical bottleneck.
Here, we present scNym, a semi-supervised neural network that learns to transfer cell identity annotations from one experiment to another.
scNym uses consistency regularization and entropy minimization techniques in the semi-supervised MixMatch framework to take advantage of information in both the labeled and unlabeled datasets.
In benchmark experiments, we show that scNym offers superior performance to baseline approaches in transferring cell identity annotations across experiments performed with different technologies or in distinct biological conditions.
We show with ablation experiments that semi-supervision techniques improved both the performance and calibration of scNym models.
We also show that scNym models are well-calibrated and interpretable with saliency methods, allowing for review of model decisions by domain experts.

4:20 PM-4:30 PM
Gromov-Wasserstein based optimal transport to align single-cell multi-omics data
Format: Pre-recorded with live Q&A

  • William S. Noble, University of Washington, United States
  • Pinar Demetci, Brown University, United States
  • Rebecca Santorella, Brown University, United States
  • Bjorn Sandstede, Brown University, United States
  • Ritambhara Singh, Brown University, United States

Presentation Overview: Show

Data integration of single-cell measurements is critical for our understanding of cell development and disease, but the lack of correspondence between different types of single-cell measurements makes such efforts challenging. Several unsupervised algorithms are capable of aligning heterogeneous types of single-cell measurements in a shared space, enabling the creation of mappings between single cells in different data modalities.
We present Single-Cell alignment using Optimal Transport (SCOT), an unsupervised learning algorithm that uses Gromov Wasserstein-based optimal transport to align single-cell multi-omics datasets. SCOT calculates a probabilistic coupling matrix that matches cells across two datasets. The optimization uses k-nearest neighbor graphs, thus preserving the local geometry of the data. We use the resulting coupling matrix to project one single-cell dataset onto another via barycentric projection. We compare the alignment performance of SCOT with state-of-the-art algorithms on three simulated and two real datasets. Our results demonstrate that SCOT yields results that are comparable in quality to those of competing methods, but SCOT is significantly faster and requires tuning fewer hyperparameters.

4:30 PM-4:40 PM
A Non-Parametric Bayesian Framework for Detecting Coregulated Splicing Signals in Heterogeneous RNA Datasets with Applications to Acute Myeloid Leukemia
Format: Pre-recorded with live Q&A

  • Yoseph Barash, University of Pennsylvania, United States
  • David Wang, University of Pennsylvania, United States
  • Mathieu Quesnel-Vallieres, University of Pennsylvania, United States

Presentation Overview: Show

Analysis of RNASeq data from large patient cohorts can reveal transcriptomic perturbations that are associated with disease. This is typically framed as an unsupervised learning task to discover latent structure in a data matrix. However, the heterogeneity of these datasets makes such analysis challenging. For example, in acute myeloid leukemia, mutations in splice factor genes occurring in a subset of the patients may only result in alteration of a subset of splicing events. Thus, there is a need to identify “tiles”, defined by a subset of samples and splicing events with abnormal signals. Although algorithms exist for this task, they fail to model splicing data.
To address these challenges, we propose CHESSBOARD, a non-parametric Bayesian model for unsupervised discovery of tiles. Our algorithm does not require a priori knowledge of the number of the tiles and uses a unique missing value model for cancer data. First, we apply our model to
synthetic datasets and show it outperforms several baseline approaches. Next, we show that it recovers tiles characterized by splicing aberrations which are reproducible in multiple AML patient cohorts. Finally, we show that tiles we discover are correlated with drug response to therapeutics, pointing to translational potential of our findings.

5:00 PM-5:10 PM
Knowledge-primed neural networks enable biologically interpretable deep learning on single-cell sequencing data
Format: Pre-recorded with live Q&A

  • Nikolaus Fortelny, CeMM Center for Molecular Medicine of the Austrian Academy of Sciences, Austria
  • Christoph Bock, CeMM Center for Molecular Medicine of the Austrian Academy of Sciences, Austria

Presentation Overview: Show

Deep learning algorithms are powerful predictors but generally provide little insight into the functions that underlie a prediction. We hypothesized that deep learning on biological networks instead of artificial networks assigns meaningful weights that can be readily interpreted, thus developing knowledge-primed neural networks (KPNNs). In KPNNs, every node has a molecular equivalent (protein activity) and every edge has a mechanistic interpretation (a regulatory interaction). KPNNs thereby connect gene expression to cellular phenotypes through large regulatory networks. After training on large single-cell transcriptomics datasets, KPNNs reveal important regulatory proteins.
We validate KPNN interpretability on simulated data and biological systems with known ground truth, and reveal regulatory proteins in underexplored systems. We show that KPNN interpretations largely differ from standard interpretation approaches that are focused on features, and that interpretations are lost in shuffled networks. These results are enabled by an optimized learning method that stabilizes node weights after random initiation and controls for connectivity biases in biological networks.
In summary, KPNNs combine the predictive power of deep learning with the interpretability of biological networks. While demonstrated on single-cell sequencing data and regulatory networks, our method is broadly relevant to all research areas where domain knowledge can be represented as networks.

5:10 PM-5:20 PM
DeepPLIER: a deep learning approach to pathway-level representation of gene expression data
Format: Pre-recorded with live Q&A

  • Bernard Ng, University of British Columbia, Canada
  • Sara Mostafavi, University of British Columbia, Canada
  • Yichen Zhang, University of British Columbia, Canada
  • Maria Chikina, University of Pittsburgh, United States

Presentation Overview: Show

Extracting latent representation of gene expression data can provide insights into components and activations of gene pathways and networks. Inspired by the Pathway-level information extractor (PLIER) method, we propose a deep learning model, DeepPLIER, that incorporates pathway information in its architecture as prior for extracting latent variables (LVs) from gene expression data. DeepPLIER constructs LVs as combinations of known pathways by using partially connected nodes, but also includes fully connected nodes to correct for pathway misspecifications and to learn potentially unknown pathways from data. By incorporating pathway information, extraction of biologically relevant LVs is encouraged. Using simulation, we show that DeepPLIER achieves higher LV estimation accuracy than PLIER, especially in scenarios where prior pathways are partially missing. Using two large gene expression datasets from bulk brain tissues (ROSMAP and CommonMind), we show that DeepPLIER attains higher LV replication than PLIER. As well, we show that some of the identified LVs represent cell type proportions, and these LVs more accurately align with experimental estimates from immunohistochemistry than PLIER for a number of brain cell types.

5:20 PM-5:30 PM
Inferring Signaling Pathways with Probabilistic Programming
Format: Pre-recorded with live Q&A

  • David Merrell, University of Wisconsin, United States
  • Anthony Gitter, University of Wisconsin, United States

Presentation Overview: Show

Cells regulate themselves via complex biochemical processes called signaling pathways. These are usually depicted as networks, where nodes represent proteins and edges indicate their influence relationships. To understand diseases and therapies at the cellular level, it is crucial to understand the signaling pathways at work. Because signaling pathways can be rewired by disease, inferring signaling pathways from context-specific data is highly valuable.

We formulate signaling pathway inference as a Dynamic Bayesian Network (DBN) structure learning problem on phosphoproteomic time course data. We take a Bayesian approach, using MCMC to sample DBN structures. We use a novel proposal distribution that efficiently samples large, sparse graphs. We also relax some modeling assumptions made in past works. We call the resulting method Sparse Signaling Pathway Sampling (SSPS). We implement SSPS in Julia, using the Gen probabilistic programming language.

We evaluate SSPS on simulated and real data. SSPS attains superior scalability for large problems, in comparison to other DBN techniques. We also find that it competes well against established methods on the HPN-DREAM breast cancer network reconstruction challenge. SSPS significantly improves Bayesian techniques for network inference, and gives a proof of concept for probabilistic programming in this setting.

5:30 PM-5:40 PM
Can We Trust Convolutional Neural Networks for Genomics?
Format: Pre-recorded with live Q&A

  • Peter Koo, Cold Spring Harbor Laboratory, United States
  • Matthew Ploenzke, Harvard University, United States

Presentation Overview: Show

Convolutional neural networks (CNNs) are powerful methods to predict transcription factor binding sites from DNA sequence. Although CNNs are largely considered a "black box", attribution-based interpretability methods can be employed to identify single nucleotide variants that are important for model predictions. However, there is no guarantee that attribution methods will recover meaningful features even for state-of-the-art CNNs. Here we train CNNs with different architectures and training procedures on synthetic sequences embedded with known motifs and then quantitatively measure how well attribution methods recover ground truth. We find that deep CNNs tend to recover less interpretable motifs, despite yielding superior performance on held out test data. This suggests that good model performance does not necessarily imply good model interpretability. Strikingly, we find that adversarial training, a method to promote robustness to small perturbations to the input data, can significantly improve the efficacy of attribution methods. We also find that CNNs specially designed with an inductive bias to learn strong motif representations consistently improves interpretability. We then show that these results generalize to in vivo ChIP-seq data. This work highlights the importance of moving beyond performance on benchmark datasets when considering whether to trust a CNN’s prediction in genomics.