Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


MLCSB: Machine Learning in Computational and Systems Biology

COSI Track Presentations

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
Wednesday, July 24th
10:15 AM-10:20 AM
Welcome and Start MLCSB COSI
Room: San Francisco (3rd Floor)
10:20 AM-11:20 AM
MLCSB Keynote: Representation Learning of Patient Health States
Room: San Francisco (3rd Floor)
  • Gunnar Rätsch, ETH Zurich, Switzerland
11:20 AM-11:40 AM
Proceedings Presentation: Weighted Elastic Net for Unsupervised Domain Adaptation with Application to Age Prediction from DNA Methylation Data
Room: San Francisco (3rd Floor)
  • Lisa Handl, University of Tübingen, Germany
  • Adrin Jalali, Max-Planck-Institut für Informatik, Germany
  • Michael Scherer, Max-Planck Institute for Informatics, Germany
  • Ralf Eggeling, University of Tübingen, Germany
  • Nico Pfeifer, Department of Computer Science, University of Tübingen, Germany

Presentation Overview: Show

Motivation: Predictive models are a powerful tool for solving complex problems in computational biology. They are typically designed to predict or classify data coming from the same unknown distribution as the training data. In many real-world settings, however, uncontrolled biological or technical factors can lead to a distribution mismatch between datasets acquired at different times, causing model performance to deteriorate on new data. A common additional obstacle in computational biology is scarce data with many more features than samples. To address these problems, we propose a method for unsupervised domain adaptation that is based on a weighted elastic net. The key idea of our approach is to compare dependencies between inputs in training and test data and to increase the cost of differently behaving features in the elastic-net regularization term. In doing so, we encourage the model to assign a higher importance to features that are robust and behave similarly across domains.

Results: We evaluate our method both on simulated data with varying degrees of distribution mismatch and on real data, considering the problem of age prediction based on DNA methylation data across multiple tissues. Compared to a non-adaptive standard model, our approach substantially reduces errors on samples with a mismatched distribution. On real data, we achieve far lower errors on cerebellum samples, a tissue which is not part of the training data and poorly predicted by standard models. Our results demonstrate that unsupervised domain adaptation is possible for applications in computational biology, even with many more features than samples.

11:40 AM-12:00 PM
Proceedings Presentation: Model-Based Optimization of Subgroup Weights for Survival Analysis
Room: San Francisco (3rd Floor)
  • Jakob Richter, TU Dortmund, Germany
  • Katrin Madjar, TU Dortmund University, Germany
  • Jörg Rahnenführer, Technische Universität Dortmund, Germany

Presentation Overview: Show

Motivation: To obtain a reliable prediction model for a specific cancer subgroup or cohort is often difficult due to limited sample size and, in survival analysis, due to potentially high censoring rates.
Sometimes similar data from other patient subgroups are available, e.g., from other clinical centers.
Simple pooling of all subgroups can decrease the variance of the predicted parameters of the prediction models, but also increase the bias due to heterogeneity between the cohorts.
A promising compromise is to identify those subgroups with a similar relationship between covariates and target variable and then include only these for model building.
Results: We propose a subgroup-based weighted likelihood approach for survival prediction with high-dimensional genetic covariates.
When predicting survival for a specific subgroup, for every other subgroup an individual weight determines the strength with which its observations enter into model building.
MBO (model-based optimization) can be used to quickly find a good prediction model in the presence of a large number of hyperparameters.
We use MBO to identify the best model for survival prediction of a specific subgroup by optimizing the weights for additional subgroups for a Cox model.
The approach is evaluated on a set of lung cancer cohorts with gene expression measurements.
The resulting models have competitive prediction quality, and they reflect the similarity of the corresponding cancer subgroups, with both weights close to 0, weights close to 1, and medium weights.

12:00 PM-12:20 PM
Proceedings Presentation: Block HSIC Lasso: model-free biomarker detection for ultra-high dimensional data
Room: San Francisco (3rd Floor)
  • Héctor Climente-González, Institut Curie, France
  • Chloé-Agathe Azencott, MINES ParisTech, France
  • Makoto Yamada, Kyoto University, Japan
  • Samuel Kaski, Aalto University, Finland

Presentation Overview: Show

Motivation: Finding nonlinear relationships between biomolecules and a biological outcome is computationally expensive and statistically challenging. Existing methods have important drawbacks, including among others lack of parsimony, non-convexity, and computational overhead. Here we propose block HSIC Lasso, a nonlinear feature selector that does not present the previous drawbacks.
Results: We compare block HSIC Lasso to other state-of-the-art feature selection techniques in both synthetic and real data, including experiments over three common types of genomic data: gene-expression microarrays, single-cell RNA sequencing, and genome-wide association studies. In all cases, we observe that features selected by block HSIC Lasso retain more information about the underlying biology than those selected by other techniques. As a proof of concept, we applied block HSIC Lasso to a single-cell RNA sequencing experiment on mouse hippocampus. We discovered that many genes linked in the past to brain development and function are involved in the biological differences between the types of neurons.
Availability: Block HSIC Lasso is implemented in the Python 2/3 package pyHSICLasso, available on PyPI. Source code is available on GitHub (https://github.com/riken-aip/pyHSICLasso).

12:20 PM-12:20 PM
Spotlight Session 1 - MLCSB
Room: San Francisco (3rd Floor)
12:20 PM-12:40 PM
Deconvolution of autoencoders to learn biological modules from single cell mRNA sequencing data
Room: San Francisco (3rd Floor)
  • Savvas Kinalis, Centre for Genomic Medicine, Rigshospitalet, University of Copenhagen, Denmark
  • Frederik Otzen Bagger, Centre for Genomic Medicine, Rigshospitalet, University of Copenhagen, Denmark
  • Ole Winther, Centre for Genomic Medicine, Rigshospitalet, University of Copenhagen, Denmark
  • Finn Cilius Nielsen, Centre for Genomic Medicine, Rigshospitalet, University of Copenhagen, Denmark

Presentation Overview: Show

Background: Unsupervised neural network models have shown their usefulness with noisy single cell mRNA-sequencing data (scRNA-seq), where the models generalize well, despite the zero-inflation of the data. Variants of autoencoders have been useful for denoising of single cell data, imputation of missing values and dimensionality reduction.

Results: We present a feature with the potential to greatly increase the usability of autoencoders. By application of saliency maps on the representation layer, we can identify genes that are associated with each hidden unit. We apply a soft orthogonality constraint on the representation layer, to aid the deconvolution of the input signal. Our model can delineate biological meaningful modules that govern a dataset, as well as give information as to which modules are active in each single cell. Importantly, most of these modules can be explained by known biological functions, as provided by the Hallmark gene sets.

Conclusions: We discover that tailored training of an autoencoder makes it possible to deconvolute biological modules inherent in the data, without any assumptions. In perspective, our model in combination with clustering methods is able to provide information about which subtype a given single cell belongs to, as well as which biological functions determine that membership.

CellGen: A mixture of expert autoencoder to cluster single cell data
Room: San Francisco (3rd Floor)
  • Vignesh Ram Somnath, Institute of Molecular Systems Biology, ETH Zürich, Switzerland, Switzerland
  • Andreas Kopf, Institute of Molecular Systems Biology, ETH Zürich, Switzerland, Switzerland

Presentation Overview: Show

Defining cell types in high-dimensional single cell data, such as mass cytometry (CyTOF), single cell RNA-sequencing or images, has been applied extensively. Here we introduce CellGen, a novel Autoencoder based clustering approach, to learn and interpret the multimodal distribution of single cell data.

CellGen is built on an Autoencoder (AE) setup, in which the decoder consists of a Mixture-of-Expert architecture. This specific architecture allows various modes of the data in the specific experts to be automatically learned. Additionally, we make use of a Multivariate Normal (MVN) distributed latent space. As proof of concept, we tested CellGen on reconstructing synthetic data sampled from MVN distributions. We conducted preliminary tests with CellGen to specifically define rare subpopulations in CyTOF measurements of samples of peripheral blood mononuclear cells, where we achieved state-of-the-art results when comparing F-measures. In comparison to competitors, we see the consistent superiority of CellGen, especially in detecting both abundant and rare cell populations, where baseline approaches either performanced well for detecting abundant or rare cell population.

Approaches like CellGen have the ability to perform unsupervised detection of abundant and rare subpopulations. This could enhance exploratory single-cell initiatives such as the Human Cell Atlas or be applied in personalized medicine.

Interpretability for computational biology
Room: San Francisco (3rd Floor)
  • María Rodríguez Martínez, IBM, Switzerland
  • An-Phi Nguyen, IBM Research Zurich, Switzerland

Presentation Overview: Show

Understanding real-world datasets is often challenging due to size, complexity
and/or poor knowledge about the problem to be tackled (i.e. electronic health
records, OMICS data, ...). To achieve high accuracy in important tasks, equally
complex machine learning models are usually used. In many situations (e.g. di-
agnosis prediction) the decisions achieved by such automated systems can have
significant, and potentially deleterious, consequences. It is therefore necessary
for a model to not only provide a correct prediction, but to also provide an
accompanying explanation regarding why a certain decision was achieved. How-
ever, one of the current flaws in current interpretable machine learning research
is the lack of an agreed-upon definition of interpretability, which hinders the
fair comparison of existing interpretability methods.
In our work we aim to properly define interpretability for problems and mod-
els in computational biology. This will enable us to systematically benchmark
existing interpretability methods, and potentially lead to the development of
novel methods.
Our work will focus on important tasks in computational biology, such as tran-
scription factor binding predictions or 3D DNA structure reconstruction, lever-
aging publicly available data (e.g. TCGA) or produced by close collaborators.

Bijective Encoding of Proteins In a Scalable Distributed Deep Learning Framework
Room: San Francisco (3rd Floor)
  • Alexandre Perera-Lluna, B2SLab, ESAII, Universitat Politècnica de Catalunya, Barcelona, Spain, Spain
  • Maria Jesus Martin, EMBL-EBI, United Kingdom
  • Rabie Saidi, EMBL-EBI, United Kingdom
  • Angela Lopez-Del Rio, Universitat Politècnica de Catalunya, Spain

Presentation Overview: Show

Deep learning protein-based prediction models have gained great popularity in recent years. For these models, protein sequences are usually encoded into feature vectors. However, these encoding features are generally aggregative and not bijective, or require sequences to be alignable, thus decreasing the generalisation capability of the models.

The use of raw amino acid sequences as models input is now gaining popularity. Padding is usually applied to get different length proteins to be within the same dimension, but little is known on how this addition could affect to the model performance.

On the other hand, state-of-the-art Deep Learning models are not yet taking advantage of big data frameworks and distributed computation. Although there have been some approaches towards this integration, there are still no stable solutions. Overcoming this gap is crucial for getting the maximal potential out the growing public biological datasets.

In this work, we build an scalable Deep Learning model by integrating big data and deep learning frameworks. We then analyse different protein bijective encodings in a protein function prediction problem and study the impact that the padding has on the performance of the model. Our results provide good practices on distributed computing protein-based deep learning models.

Towards Explainable Anticancer Compound Sensitivity Prediction via Multimodal Attention-based Convolutional Encoders
Room: San Francisco (3rd Floor)
  • Vigneshwari Subramanian, AstraZeneca, Sweden
  • Julio Saez-Rodriguez, Institute of Computational Biomedicine, Heidelberg University, Germany
  • María Rodríguez Martínez, IBM, Switzerland
  • Jannis Born, IBM, Switzerland
  • Ali Oskooei, IBM, Switzerland
  • Matteo Manica, IBM, Switzerland

Presentation Overview: Show

De novo drug development is hampered by tedious in-vitro and animal-screening experiments exploring the enormous space of candidate drugs.
In addition, the primary goal of personalized cancer medicine is to tailor a treatment given a patient's tumor molecular profile.
To tackle these challenges, it is imperative to devise techniques that enable effective screening of anticancer compound efficacy.
We analyze a family of models that makes use of three data modalities: compound's structure in SMILES format, gene expression profiles and prior knowledge from PPI networks.
We propose a novel architecture for interpretable anticancer compound sensitivity prediction using a multimodal attention-based convolutional encoder.
Our encoder outperforms a baseline model based on molecular fingerprints and a selection of SMILES encoders.
In addition, we demonstrate that our model outperforms previously reported state-of-the-art results.
Finally, we validate the attended genes and molecular sub-structures with domain knowledge and verify that the learned attention weights are meaningful.
The generalization power and the interpretability of our model enable in-silico evaluation of anticancer compound efficacy on unseen cancer cells, positioning it as a valid solution for the development of personalized therapies as well as for the evaluation of candidate compounds in de novo drug design.

DrugCell: A visible neural network to guide precision medicine
Room: San Francisco (3rd Floor)
  • Jason Kreisberg, University of California San Diego, United States
  • Jisoo Park, University of California San Diego, United States
  • Brent Kuenzi, University of California San Diego, United States
  • Trey Ideker, Department of Medicine, University of California, San Diego, United States
  • Samson Fong, University of California San Diego, United States

Presentation Overview: Show

One of the major contributing factors to the notably low clinical success rate for most cancer therapies is a lack of understanding of how cancer cells respond to a particular drug. Recent advances in artificial intelligence (AI) could aid the development of drugs with better efficacy by predicting therapeutic responses of cells. However, black-box AI is limited as the reasoning behind its predictions are not interpretable.

Here we developed a “visible” neural network-based AI, DrugCell, that predicts anti-cancer drug response utilizing the structural features of drugs and genomic features of cells while providing insights into the pathways that sensitize the cell to therapy. DrugCell outperformed baseline models trained using ElasticNet and Random Forest. The structural restriction imposed for interpretability of the model did not affect performance as DrugCell demonstrated comparable performance to fully-connected neural networks. The compound with the most accurate prediction of DrugCell was vincristine which has well-defined mechanisms of action. The cell model in DrugCell highlighted well-known pathways altered through microtubule inhibition by vincristine, such as cell adhesion and cell division. Armed with interpretability and generalizability, DrugCell will serve as the first step towards the next generation of intelligent systems in drug discovery and precision medicine.

Identifying glycan motifs using a novel tree representation that considers terminal connections
Room: San Francisco (3rd Floor)
  • Paul Ramsland, RMIT University, Australia
  • Jeffrey Chan, RMIT University, Australia
  • Lachlan Coff, RMIT University, Australia
  • Andrew Guy, RMIT University, Australia

Presentation Overview: Show

Glycans are complex sugar chains that are crucial components of many biological processes. Many proteins bind to glycans, including lectins and antibodies, with binding specificity to glycans often mediated by small motifs within the larger glycan. Identification of glycan binding motifs has previously been approached as a frequent subtree mining problem. In this work, we extend a standard frequent subtree mining approach by altering the glycan representation to include information on terminal connections. This allows specific identification of terminal residues as potential motifs. We achieve this by including additional nodes in a graph representation of the glycan structure to indicate the presence or absence of a connection at particular backbone carbon positions. Combining this frequent subtree mining approach with a state-of-the-art feature selection algorithm termed minimum-redundancy-maximum-relevance (mRMR), we have generated a classification pipeline that is trained on data from a glycan microarray. This classification pipeline enables prediction of binding to unknown/novel glycans, as well as identification of likely binding motifs based on existing array data. This new subtree mining approach will assist in the interpretation of the large number of publicly available glycan microarray datasets and will aid in the discovery of novel binding motifs for further experimental characterisation.

Generative classification of cell types in scRNA-seq data
Room: San Francisco (3rd Floor)
  • Murat Can Cobanoglu, UT Southwestern Medical Center, United States

Presentation Overview: Show

We present PIET (Probabilistic Inference and Explanation of Transcriptomic data), a generative classifier that can train on bulk RNA-seq data and classify cell types in single cell RNA sequencing (scRNA-seq) data. ScRNA-seq enables single cell resolution insights in complex tissues. However, it also renders cellular identity and subpopulation identification challenging. PIET implements a generative probabilistic modeling of scRNA-seq data can effectively address this challenge. We demonstrate the performance of our approach by classifying tumor infiltrating lymphocytes from the melanoma microenvironment.

Comparative machine learning framework for efficient prediction of host-pathogen protein-protein interactions using sequence-based features
Room: San Francisco (3rd Floor)
  • Rakesh Kaundal, PSC / Center for Integrated BioSystems, Utah State University, United States
  • Nicholas Flann, Utah State University, United States
  • Cristian Loaiza, Utah State University, United States

Presentation Overview: Show

Host-pathogen protein-protein interactions (HPIs) play vital roles in several biological processes and are directly involved with infectious diseases. It is crucial to understand their mechanism and unravel potential targets to develop therapeutics. Beyond single-species Protein-Protein Interaction (PPI) prediction, no comprehensive analysis has been attempted to model HPIs on a genome scale.
In this study, a comparison between different machine learning methods such as support vector machines (SVM), artificial neural networks (ANN) and Deep Learning-based Convolutional Neural Networks (CNN) was performed to predict HPIs with high accuracy. Several sequence-based features were tested, including Autocorrelation, Dipep composition, Conjoint Triad, Quasi-order and One-hot. The training datasets were obtained from HPIDB and Negatome, to create positive and negative datasets as well as independent test datasets.
Most of the models were able to perform well; the independent test sensitivity values ranging from 84.6% to 99.5%, specificity 56.1% to 98.8%, and MCC 0.61 to 0.91. We found that Negatome is not an ideal dataset in inter-species predictions. A novel approach to generate a more suitable non-interaction dataset is proposed. Among the methods tested, Deep Learning looks promising and further architectures are being explored. The best prediction models will be implemented on a web server called DeepHPI.

12:40 PM-2:00 PM
Room: San Francisco (3rd Floor)
2:00 PM-2:20 PM
Proceedings Presentation: Rotation equivariant and invariant neural networks for microscopy image analysis
Room: San Francisco (3rd Floor)
  • Benjamin Chidester, Carnegie Mellon University, United States
  • Tianming Zhou, Carnegie Mellon University, United States
  • Minh N. Do, University of Illinois at Urbana-Champaign, United States
  • Jian Ma, Carnegie Mellon University, United States

Presentation Overview: Show

Neural networks have been widely used to analyze high-throughput microscopy images. However, the performance of neural networks can be significantly improved by encoding known invariance for particular tasks. Highly relevant to the goal of automated cell phenotyping from microscopy image data is rotation invariance. Here we consider the application of two schemes for encoding rotation equivariance and invariance in a convolutional neural network, namely, the group-equivariant CNN (G-CNN), and a new architecture with simple, efficient conic convolution, for classifying microscopy images. We additionally integrate the 2D-discrete-Fourier transform (2D-DFT) as an effective means for encoding global rotational invariance. We call our new method the Conic Convolution and DFT Network (CFNet). We evaluated the efficacy of CFNet and G-CNN as compared to a standard CNN for several different image classification tasks, including simulated and real microscopy images of subcellular protein localization, and demonstrated improved performance. We believe CFNet has the potential to improve many high-throughput microscopy image analysis applications. Source code of CFNet is available at: https://github.com/bchidest/CFNet.

2:20 PM-12:40 PM
Proceedings Presentation: Identifying progressive imaging genetic patterns via multi-task sparse canonical correlation analysis: a longitudinal study of the ADNI cohort
Room: San Francisco (3rd Floor)
  • Lei Du, Northwestern Polytechnical University, China
  • Kefei Liu, University of Pennsylvania, United States
  • Lei Zhu, Xi'an University of Technology, China
  • Xiaohui Yao, University of Pennsylvania, United States
  • Shannon Leigh Risacher, Indiana University School of Medicine, United States
  • Andrew Saykin, Indiana University School of Medicine, United States
  • Lei Guo, Northwestern Polytechnical University, China
  • Li Shen, University of Pennsylvania, United States

Presentation Overview: Show

Identifying the genetic basis of the brain structure, function and disorder by using the imaging quantitative traits (QTs) as endophenotypes is an important task in brain science. Brain QTs often change over time while the disorder progresses and thus understanding how the genetic factors play roles on the progressive brain QT changes is of great importance and meaning. Most existing imaging genetics methods only analyze the baseline neuroimaging data, and thus those longitudinal imaging data across multiple time points containing important disease progression information are omitted. We propose a novel temporal imaging genetic model which performs the multi-task sparse canonical correlation analysis (T-MTSCCA). Our model uses longitudinal neuroimaging data to uncover that how SNPs play roles on affecting brain QTs over the time. Incorporating the relationship of the longitudinal imaging data and that within SNPs, T-MTSCCA could identify a trajectory of progressive imaging genetic patterns over the time. We propose an efficient algorithm to solve the problem and show its convergence. We evaluate T-MTSCCA on 408 subjects from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database with longitudinal magnetic resonance imaging (MRI) data and genetic data available. The experimental results show that T-MTSCCA performs either better than or equally to the state-of-the-art methods. In particular, T-MTSCCA could identify higher canonical correlation coefficients and captures clearer canonical weight patterns. This suggests that, T-MTSCCA identifies time-consistent and time-dependent SNPs and imaging QTs, which further help understand the genetic basis of the brain QT changes over the time during the disease progression.

2:39 PM-2:40 PM
Spotlight Session 2 - MLCSB
Room: San Francisco (3rd Floor)
2:40 PM-3:00 PM
WideDTA: prediction of drug-target binding affinity
Room: San Francisco (3rd Floor)
  • Elif Ozkirimli, Bogazici University, Turkey
  • Arzucan Ozgur, Bogazici University, Turkey
  • Hakime Öztürk, Boğaziçi University, Turkey

Presentation Overview: Show

Motivation: Prediction of the interaction affinity between proteins and compounds is a major challenge in the drug discovery process. WideDTA is a deep-learning based prediction model that employs chemical and biological textual sequence information to predict binding affinity.

Results: WideDTA uses four text-based information sources, namely the protein sequence, ligand SMILES, protein domains and motifs, and maximum common substructure words to predict binding affinity. WideDTA outperformed one of the state of the art deep learning methods for drug-target binding affinity prediction, DeepDTA on the KIBA dataset with a statistical significance. This indicates that the word-based sequence representation adapted by WideDTA is a promising alternative to the character-based sequence representation approach in deep learning models for binding affinity prediction, such as the one used in DeepDTA. In addition, the results showed that, given the protein sequence and ligand SMILES, the inclusion of protein domain and motif information as well as ligand maximum common substructure words do not provide additional useful information for the deep learning model. Interestingly, however, using only domain and motif information to represent proteins achieved similar performance to using the full protein sequence, suggesting that important binding relevant information is contained within the protein motifs and domains.

RLFimpute: Using reinforcement learning framework for imputation of scRNA-seq data
Room: San Francisco (3rd Floor)
  • Jiajia Liu, Tongji University, China
  • Xiaobo Zhou, Tongji University, China

Presentation Overview: Show

Introduction: Considering the heterogeneity of cell populations, single-cell RNA sequencing (scRNA-seq) technology was developed to enable a wide variety of transcriptomic analyses at the single cell level. However, the major problem in scRNA-seq is that there are too many zeros or near zero values, some of which are truly expressed while the other are missing values called dropouts. To address this issue, we proposed a novel method RLFimpute that combines reinforcement learning framework (RLF) with Davies-Bouldin Index to impute dropouts in the scRNA-seq data.
Methodology: In the framework of RLFimpute, agent represents dropout, environment represents current data matrix, and reward represents Davies-Bouldin Index. We constructed the RLF with only two states: under missing and after imputation. The dropout has many candidate values for replacing itself represents the action of imputation. Then we select the appropriate candidate values based on the returned rewards to impute dropouts in the scRNA-seq data.
Conclusion: The proposed algorithm can detect outliers or rare cells beside imputation, and automatically identify cell types. The performance on the simulated data and two real scRNA-seq datasets also shows that RLFimpute significantly improves cell clustering and cell type visualization comparing with other properties.

A principal curve approach to inferring 3D chromatin architecture
Room: San Francisco (3rd Floor)
  • Mark Segal, UCSF, United States
  • Trevor Hastie, Stanford University, United States
  • Elena Tuzhilina, Stanford University, United States

Presentation Overview: Show

The 3D configuration of chromosomes is consequential for several cellular functions including expression regulation. The ability to infer 3D structure at increasing resolution has been enabled by Hi-C assays which yield matrices of chromatin contact counts. Several algorithms operate on these matrices to produce reconstructed 3D configurations. Many utilize multidimensional scaling variants following conversion of contacts to distances. However, none of the methods exploit the fact that the target solution is a (smooth) 1D curve in 3D: this contiguity attribute is either ignored or indirectly addressed via constraints. Here we demonstrate the utility of principle curves in directly obtaining 1D solutions that best recapitulate the contact matrix. Our target 1D curve in 3D is a vector function with three coordinates, each indexed by 1D genomic distance. Each coordinate function is represented using a spline basis with smoothness controlled by a degrees of freedom (df) parameter. A principle curve solution is then readily obtained using eigen-decomposition. While the suite of solutions from a range of df is informative with respect to chromatin architecture at differing scales we also detail methods for selecting a single summary. Illustrative examples feature chromosomes 20-22 from IMR90 cells since orthogonal multiplex FISH imaging permits external validation.

Using Gene Expression and Clinical Data Profiles to Predict Sepsis at ER Admission
Room: San Francisco (3rd Floor)
  • Olga Pena, Centre for Microbial Diseases and Immunity Research, University of British Columbia, Canada
  • Arjun Baghela, The University of British Columbia, Canada
  • Amy Lee, Centre for Microbial Diseases and Immunity Research, University of British Columbia, Canada
  • Gabriela Cohen Freue, Department of Statistics, University of British Columbia, Canada
  • Robert Ew Hancock, Centre for Microbial Diseases and Immunity Research, University of British Columbia, Canada

Presentation Overview: Show

Sepsis represents a suppressed immune response to infection. Despite advances in modern medicine, severe sepsis remains a major cause of mortality globally. Currently, physicians rely on their interpretations of heterogeneous clinical symptoms and medical scoring systems, often resulting in misdiagnoses. Gene expression biomarker studies tend to be inconclusive, resulting in no effective diagnostic panels [1]. In this global study of 300 emergency room patients, we assessed the predictive power of machine learning classifiers built using clinical and gene expression (RNA-Seq) data. We applied elastic net, support vector machine, and random forest, on each feature type separately and features combined. The highest area of the receiver operator characteristic curve was 0.88 and was achieved using elastic net on the combined model. The combined approach involved the direct concatenation of features followed by a z score transformation. We used “block-scaling” to address the issue of the model being dominated by gene features. This scales each feature by the inverse of the total number of features in a data-type specific block [2]. Genetic and clinical feature coefficients were also explored for biological relevance. 1. Biron et al. (2015) Biomarker Insights,10:7-17. 2. Jeppe et al. (2008) Tox Sci, 102:2:444-54

Applying deep neural networks with feature extraction to target-gene expression prediction from landmark genes
Room: San Francisco (3rd Floor)
  • Da-Bin Lee, Soongsil University, South Korea
  • Kyu-Baek Hwang, Soongsil University, South Korea

Presentation Overview: Show

It has been widely shown that gene expression is highly correlated, meaning that the expression level of a subset of genes can be used to infer the expression level of other genes. As such, 978 landmark genes are used to predict the expression of 11,350 target genes in the L1000 assay developed by the Connectivity Map group (https://clue.io/cmap) for cost-effective gene expression profiling. Here, accurate inference of target-gene expression is prerequisite. To this end, the ordinary least squares and regularization methods for linear regression (LR) have mainly been adopted. Recently, deep neural networks (DNNs) were shown to achieve higher accuracy than LR, suggesting a non-linear relationship between the landmark genes and their targets. To improve the accuracy of DNNs for target-gene expression prediction, we extracted non-linear features of landmark gene expression using various types of autoencoders. When applied to a dataset containing 111,009 expression profiles of 943 landmark and 9520 target genes, feature extraction by denoising autoencoders improved the prediction accuracy of a DNN for ~94.4% of the target genes. On average, the prediction error for each target gene decreased by ~2.3%. Our results suggest that autoencoders can extract useful features of landmark gene expression for target-gene expression prediction.

Deep neural networks predict drug-induced histopathology based on gene expression
Room: San Francisco (3rd Floor)
  • Jitao David Zhang, Roche Innovation Center Basel, F. Hoffmann-La-Roche AG,, Switzerland
  • Pierre Maliver, Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Switzerland
  • Alexia Phedonos, Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Switzerland
  • Claudia Bossen, Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Switzerland
  • Timo Schwandt, Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Switzerland
  • Matthias Wittwer, Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Switzerland
  • Annie Moisan, Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Switzerland
  • Virginie Sandrin, Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Switzerland
  • Mark D. Robinson, University of Zurich, Switzerland
  • Klas Hatje, Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Switzerland
  • Tao Fang, Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Switzerland
  • Lisa Sach-Peltason, Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Switzerland

Presentation Overview: Show

Successful prediction and a molecular-level understanding of drug-induced organ toxicity are key to reduce both animal use and attrition rate in drug discovery. We leveraged gene-expression data in the Open TG-GATEs database and systematically compared predictive powers of a plethora of machine-learning models of increasing complexity, including logistic regression, support vector machines, and deep neural network (DNNs). DNNs consistently and substantially outperformed other models for almost all types of liver histopathology. The findings were corroborated by benchmark against published models and by a crowd-sourcing approach. We applied DNNs to independent datasets and confirmed their superior performance over other models across technological platforms of gene expression profiling and across rodent species. Finally, we observed that, by using transformed or additional features as input, it is possible to further boost either the interpretability or the performance of DNNs. The present study demonstrates the feasibility and advantage of applying deep-learning techniques to predict drug-induced liver histopathology based on gene expression. Joint application of physiology-emulating cellular systems, omics technologies and machine-learning models such as DNNs holds the promise to replace and reduce animal use in drug discovery and to ensure the safety profiles of drug candidates tested in clinical trials.

CNN-Peaks: ChIP-seq peak detector using convolution neural networks
Room: San Francisco (3rd Floor)
  • Giltae Song, Pusan National University, South Korea
  • J. Michael Cherry, Stanford University, United States
  • Dongpin Oh, Pusan National University, South Korea
  • J. Seth Strattan, Stanford University, United States
  • Junho Hur, Kyung Hee University, South Korea
  • Jose Bento, Boston College, United States

Presentation Overview: Show

To elucidate pathological mechanisms, and pinpoint target genes for developing therapeutics, it is important to assess abnormalities in the genome-wide interactions of proteins and genomic elements that cause gene mis-regulation. In this regard, ChIP-seq data is one of the core experimental resources to elucidate genome-wide epigenetic interactions and identify the functional elements which are associated with diseases. Accordingly, the analyses of ChIP-seq data are important and difficult computational challenges, due to the presence of irregular noise and bias on various levels. Although many peak-calling methods have been developed, the current computational tools still require, in some cases, human manual inspection using data visualization. However, the huge volumes of ChIP-seq data make it almost impossible for human researchers to manually determine all the peaks. We designed a novel supervised learning approach with convolution neural networks and integrated it into a software pipeline (called CNN-Peaks) for identifying ChIP-seq peaks using labeled data from human researchers who annotate the presence or absence of peaks in some genomic segments. The labeled data were used to train a model for identifying peak signals and the model was applied to unknown genomic segments. We validated our pipeline using ChIP-seq data available from the ENCODE data portal.

Modeling and prediction of genes associated with rare Mendelian phenotypes using an end-to-end deep learning approach
Room: San Francisco (3rd Floor)
  • Denise Duma, Ichan School of Medicine at Mount Sinai, United States
  • Adam Margolin, Ichan School of Medicine at Mount Sinai, United States

Presentation Overview: Show

Over the past decades, it has become increasingly evident that genetic background is a key determinant of many types of human diseases. Identifying the genes and mutations that underlie human disease phenotypes is important for multiple purposes, including: (1) understanding disease mechanisms via the functions of the genes and the broader background of their biological pathways; (2) pre- and post-natal risk assessment; (3) precision medicine translational advances. To this end, we annotate the set of human protein-coding genes using a large array of biological annotations. We then apply state-of-the-art unsupervised deep learning techniques (graph embeddings) and couple these with a neural network classifier in order to model the gene-disease associations cataloged by the Human Gene Mutation Database (HGMD). We both model the known associations between a subset of human genes and rare Mendelian phenotypes as well as predict novel associations not yet in HGMD. We note that our classifier is a hierarchical one able to handle parent-child relationships that exist between disease phenotypes as encoded by the Human Phenotype Ontology (HPO). Both the graph embeddings and classifier network are trained together making this an end-to-end approach. We show good prediction performance with training accuracy up to 85% for certain phenotypes.

3:00 PM-4:00 PM
MLCSB Keynote: Representation Learning of Patient Health States
Room: San Francisco (3rd Floor)
  • Barbara Engelhardt, Princeton University, United States

Presentation Overview: Show

Sequential decision-making has recently been recognized as an important analytic tool in biomedical data analysis. Consider, for example, designing single cell experiments, treating hospital patients, or even model selection as the biological data evolve: these methods are necessary but not particularly common. Two important challenges exist. Many of these approaches are challenging to compare -- showing that they perform as well as or better than random decisions is an unfortunate default. Second, these approaches are difficult to validate -- it is hard to apply these methods to expensive or delicate decision-making scenarios including hospital patients because the price of a wrong decision is too high. I will discuss opportunities and also describe possible ways to address these challenges to allow decision-making methods to be used more effectively for biomedical research.

4:40 PM-5:00 PM
Proceedings Presentation: PRISM: Methylation Pattern-based, Reference-free Inference of Subclonal Makeup
Room: San Francisco (3rd Floor)
  • Dohoon Lee, Seoul National University, South Korea
  • Sangseon Lee, Seoul National University, South Korea
  • Sun Kim, Seoul National University, South Korea

Presentation Overview: Show

Motivation: Characterizing cancer subclones is crucial for the ultimate conquest of cancer. Thus, a number of bioinformatic tools have been developed to infer heterogeneous tumor populations based on genomic signatures such as mutations and copy number variations. Despite accumulating evidence for the significance of global DNA methylation reprogramming in certain cancer types including myeloid malignancies, none of the bioinformatic tools are designed to exploit subclonally reprogrammed methylation patterns to reveal constituent populations of a tumor. In accordance with the notion of global methylation reprogramming, our preliminary observations on acute myeloid leukemia (AML) samples implied the existence of subclonally-occurring focal methylation aberrance throughout the genome.
Results: We present PRISM, a tool for inferring the composition of epigenetically distinct subclones of a tumor solely from methylation patterns obtained by reduced representation bisulfite sequencing (RRBS). PRISM adopts DNA methyltransferase 1 (DNMT1)-like hidden Markov model-based in silico proofreading for the correction of erroneous methylation patterns. With error-corrected methylation patterns, PRISM focuses on a short individual genomic region harboring dichotomous patterns that can be split into fully methylated and unmethylated patterns. Frequencies of such two patterns form a sufficient statistic for subclonal abundance. A set of statistics collected from each genomic region is modeled with a beta-binomial mixture. Fitting the mixture with expectation-maximization algorithm finally provides inferred composition of subclones. Applying PRISM for two acute myeloid leukemia samples, we demonstrate that PRISM could infer the evolutionary history of malignant samples from an epigenetic point of view.
Availability: PRISM is freely available on GitHub (https://github.com/dohlee/prism).

5:00 PM-5:20 PM
Proceedings Presentation: Collaborative Intra-Tumor Heterogeneity Detection
Room: San Francisco (3rd Floor)
  • Sahand Khakabimamaghani, Simon Fraser University, Canada
  • Salem Malikic, Simon Fraser University, Canada
  • Jeffrey Tang, Simon Fraser University, Canada
  • Dujian Ding, Simon Fraser University, Canada
  • Ryan Morin, Simon Fraser University, Canada
  • Leonid Chindelevitch, Simon Fraser University, Canada
  • Martin Ester, Simon Fraser University, Canada

Presentation Overview: Show

Motivation: Despite the remarkable advances in sequencing and computational techniques, noise in the data and complexity of the underlying biological mechanisms render deconvolution of the phylogenetic relationships between cancer mutations difficult. To overcome these limitations, new methods are required for integrating and harnessing the full potential of the existing data. \\

Results: We introduce a method called Hintra for intra-tumor heterogeneity detection. Hintra integrates sequencing data for a cohort of tumors and infers tumor phylogeny for each individual based on the evolutionary information shared between different tumors. Through an iterative process, Hintra learns the repeating evolutionary patterns and uses this information for resolving the phylogenetic ambiguities of individual tumors. The results of synthetic experiments show an improved performance compared to two state-of-the-art methods. The experimental results with a recent Breast Cancer dataset are consistent with the existing knowledge and provide potentially interesting findings.

5:20 PM-5:40 PM
Clonesig: Joint Inference of intra-tumor heterogeneity and signature deconvolution in tumor bulk sequencing data
Room: San Francisco (3rd Floor)
  • Judith Abécassis, Mines Paristech, Institut Curie, France
  • Fabien Reyal, Institut Curie, France
  • Jean-Philippe Vert, Google, France

Presentation Overview: Show

The possibility to sequence DNA in cancer samples has triggered much effort recently to identify the forces at the genomic level that shape tumor apparition and evolution. Two main approaches have been followed for that purpose: (i) deciphering the clonal composition of each tumour by using the observed prevalences of somatic mutations, and (ii) elucidating the mutational processes involved in the generation of those same somatic mutations. Currently, both subclonal and mutational signatures deconvolutions are performed separately, while they are both manifestations of the same underlying process.

We present Clonesig, the first method that jointly infers subclonal and mutational signature composition evolution of a tumor sample form bulk sequencing. CloneSig is based on a probabilistic graphical model that models somatic mutations as derived from a mixture of subclones where different mutational signatures are active. Parameters of the model are estimated using an EM algorithm. We have conducted extensive simulations of various tumor evolution scenarios that illustrate that Clonesig joint inference allows an accurate reconstruction of both processes. Application to real data shows results obtained with whole exome sequencing recapitulate characteristics of observations on whole genome sequencing, illustrating CloneSig's ability to recover relevant biological signal in the noisiest setting.

5:40 PM-6:00 PM
Proceedings Presentation: Inheritance and variability of kinetic gene expression parameters in microbial cells: Modelling and inference from lineage tree data
Room: San Francisco (3rd Floor)
  • Aline Marguet, Univ. Grenoble Alpes, Inria, 38000 Grenoble, France, France
  • Marc Lavielle, Inria Saclay & Ecole Polytechnique, Palaiseau, France, France
  • Eugenio Cinquemani, Univ. Grenoble Alpes, Inria, 38000 Grenoble, France, France

Presentation Overview: Show

Motivation: Modern experimental technologies enable monitoring of gene expression dynamics in individual cells and quantification of its variability in isogenic microbial populations. Among the sources of this variability is the randomness that affects inheritance of gene expression factors at cell division. Known parental relationships among individually observed cells provide invaluable information for the characterization of this extrinsic source of gene expression noise. Despite this fact, most existing methods to infer stochastic gene expression models from single-cell data dedicate little attention to the reconstruction of mother-daughter inheritance dynamics.
Results: Starting from a transcription and translation model of gene expression, we propose a stochastic model for the evolution of gene expression dynamics in a population of dividing cells. Based on this model, we develop a method for the direct quantification of inheritance and variability of kinetic gene expression
parameters from single-cell gene expression and lineage data. We demonstrate that our approach provides unbiased estimates of mother-daughter inheritance parameters, whereas indirect approaches using lineage information only in the post-processing of individual cell parameters underestimate inheritance. Finally, we show on yeast osmotic shock response data that daughter cell parameters are largely determined by the mother, thus confirming the relevance of our method for the correct assessment
of the onset of gene expression variability and the study of the transmission of regulatory factors.

Thursday, July 25th
8:35 AM-8:40 AM
Welcome and Start MLCSB COSI
Room: San Francisco (3rd Floor)
8:40 AM-9:00 AM
Proceedings Presentation: DeepLigand: accurate prediction of MHC class I ligands using peptide embedding
Room: San Francisco (3rd Floor)
  • Haoyang Zeng, Massachusetts Institute of Technology, United States
  • David Gifford, Massachusetts Institute of Technology, United States
  • Brandon Carter, Massachusetts Institute of Technology, United States

Presentation Overview: Show

The computational modeling of peptide display by class I major histocompatibility complexes (MHCs) is essential for peptide-based therapeutics design. Existing computational methods for peptide-display focus on modeling the peptide-MHC binding affinity. However, such models are not able to characterize the sequence features for the other cellular processes in the peptide display pathway that determines MHC ligand selection. We introduce a semi-supervised model, DeepLigand, that outperforms the state-of-the-art models in MHC class I ligand prediction. DeepLigand combines a peptide language model and peptide binding affinity prediction to score MHC class I peptide presentation. The peptide language model characterizes sequence features that correspond to secondary factors in MHC ligand selection other than binding affinity. The peptide embedding is learned by pre-training on natural ligands, and discriminates between ligands and non-ligands in the absence of binding affinity prediction. While conventional affinity-based models fail to classify peptides with moderate affinities, DeepLigand discriminates ligands from non-ligands with consistently high accuracy

9:00 AM-9:20 AM
Proceedings Presentation: Adversarial domain adaptation for cross data source macromolecule in situ structural classification in cellular electron cryo-tomograms
Room: San Francisco (3rd Floor)
  • Ruogu Lin, Carnegie Mellon University, United States
  • Xiangrui Zeng, Carnegie Mellon University, United States
  • Kris M. Kitani, Carnegie Mellon University, United States
  • Min Xu, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Since 2017, an increasing amount of attention has been paid to the supervised deep learning based macromolecule in situ structural classification (i.e. subtomogram classification) in Cellular Electron Cryo-Tomography (CECT) due to the substantially higher scalability of deep learning. However, the success of such supervised approach relies heavily on the availability of large amounts of labeled training data. For CECT, creating valid training data from the same data source as prediction data is usually laborious and computationally intensive. It would be beneficial to have training data from a separate data source where the annotation is readily available or can be performed in a high-throughput fashion. However, the cross data source prediction is often biased due to the different image intensity distributions (a.k.a. domain shift).

Results: We adapt a deep learning based adversarial domain adaptation method (3D-ADA) to timely address the domain shift problem in CECT data analysis. 3D-ADA first uses a source domain feature extractor to extract discriminative features from the training data as the input to a classifier. Then it adversarially trains a target domain feature extractor to reduce the distribution differences of the extracted features between training and prediction data. As a result, the same classifier can be directly applied to the prediction data. We tested 3D-ADA on both experimental and realistically simulated subtomogram datasets under different imaging conditions. 3D-ADA stably improved the cross data source prediction, as well as outperformed two popular domain adaptation methods. Furthermore, we demonstrate that 3D-ADA can improve cross data source recovery of novel macromolecular structures.

9:20 AM-9:40 AM
Targeted Genetic Optimization of TF Binding with Neural Editing Architectures
Room: San Francisco (3rd Floor)
  • Anvita Gupta, Stanford University, United States
  • Anshul Kundaje, Stanford University, United States

Presentation Overview: Show

Targeted gene editing, or optimizing existing DNA sequences, could enable synthetic biology applications from designing promoters for higher gene expression to modifying DNA to treat genetic diseases. Current approaches for targeted gene editing rely on the intuition of biologists or costly massively parallel experiments; no status-quo neural architectures exist that leverage state of the art machine learning methods to do targeted DNA sequence editing to explicitly optimize sequences for useful properties, like transcription factor binding or gene expression.

Here, we propose a custom neural architecture for targeted sequence editing - the EDA architecture - consisting of an encoder, decoder, and analyzer. We evaluate the architecture to edit genomic sequences to bind to transcription factor SPI1. We compare the novel architecture to baseline approaches, including a textual variational autoencoder and rule-based editing model, which are current state of the art approaches for editing DNA sequences. The model outputs are evaluated on plausibility, similarity to original sequences, and predicted binding to transcription factor SPI1. Compared to state of the art approaches, the EDA architecture significantly improves binding scores of genomic sequences to 84.4\% predicted positive versus 17\%, while maintaining a high gapped-kmer similarity between original and generated sequences (52.4\% vs 5.2\%).

10:15 AM-11:15 AM
MLCSB Keynote: NOT all the things: submodular representative set selection for when big data is too big
Room: San Francisco (3rd Floor)

Presentation Overview: Show

Not everyone loves big data. Sometimes a data set is too big to compute on efficiently or to wrap our minds around. In such situations, we would like a way to pare down a big data set, preferably in a smarter fashion than randomly discarding some data. Submodular optimization provides a flexible paradigm to approach such problems. I will describe how selecting a subset of data examples that maximizes a submodular set function can yield a subset that is much smaller than, but still representative of, the full data set. I will then demonstrate the utility of this general framework in three settings: selecting representative sets of protein sequences, selecting genomic loci for characterization in a CRISPR-based assay, and prioritizing epigenomic and transcriptomic experiments.

11:15 AM-11:35 AM
Proceedings Presentation: MOLI: Multi-Omics Late Integration with deep neural networks for drug response prediction
Room: San Francisco (3rd Floor)
  • Hossein Sharifi-Noghabi, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, Canada
  • Olga Zolotareva, Faculty of Technology and Center for Biotechnology, Bielefeld University, Germany, Germany
  • Colin Collins, Vancouver Prostate Centre, Vancouver, BC, Canada, Canada
  • Martin Ester, Simon Fraser University, Canada

Presentation Overview: Show

Motivation: Historically, gene expression has been shown to be the most informative data for drug response prediction. Recent evidence suggests that integrating additional omics can improve the prediction accuracy which raises the question of how to integrate the additional omics. Regardless of the integration strategy, clinical utility and translatability are crucial. Thus, we reasoned a multi-omics approach combined with clinical datasets would improve drug response prediction and clinical relevance.
Results: We propose MOLI, a Multi-Omics Late Integration method based on deep neural networks. MOLI takes somatic mutation, copy number aberration, and gene expression data as input, and integrates them for drug response prediction. MOLI uses type-specific encoding subnetworks to learn features for each omics type, concatenates them into one representation and optimizes this representation via a combined cost function consisting of a triplet loss and a binary cross-entropy loss. The former makes the representations of responder samples more similar to each other and different from the non-responders, and the latter makes this representation predictive of the response values. We validate MOLI on in vitro and in vivo datasets for five chemotherapy agents and two targeted therapeutics. Compared to state-of-the-art single-omics and early integration multi-omics methods, MOLI achieves higher prediction accuracy in external validations. Moreover, a significant improvement in MOLI’s performance is observed for targeted drugs when training on a pan-drug input, i.e. using all the drugs with the same target compared to training only on drug-specific inputs. MOLI's high predictive power suggests it may have utility in precision oncology.

11:35 AM-11:55 AM
Multi-group factor analysis framework for structured single-cell omics data
Room: San Francisco (3rd Floor)
  • Danila Bredikhin, EMBL, Germany

Presentation Overview: Show

Single-cell RNA sequencing (scRNA-seq) has become an ubiquitous method for studying gene expression dynamics. However the high dimensionality of the datasets and the inherent amounts of technical noise make the analysis of scRNA-seq data challenging.

An important computational strategy for analysing scRNA-seq is dimensionality reduction allowing one to learn a latent representation of the cells, which can be used for data interpretation, visualisation, and feature extraction. Existing approaches typically ignore the rich structure of single-cell experiments, which may include multiple groups of cells or multiple omics profiled.

We propose a statistical framework for learning the latent sources of cell-to-cell variability in structured data sets. The model builds upon group factor analysis, a Bayesian framework that includes hierarchical sparsity priors on factor loadings to efficiently integrate multi-view data. Here we re-define the sparsity priors to include a group-specific regularisation in order to disentangle the activity of factors across multiple groups of cells. Effectively, this allows the quantification, for every latent factor, of how much variability is shared between the different groups of cells, e.g. different cell types, tissues, or donor cohorts.
Importantly, we employ stochastic variational inference and GPU-accelerated computations in order to accommodate large volumes of single-cell sequencing data.

11:55 AM-12:15 PM
Proceedings Presentation: Modeling Clinical and Molecular Covariates of Mutational Process Activity in Cancer
Room: San Francisco (3rd Floor)
  • Welles Robinson, University of Maryland, United States
  • Roded Sharan, School of computer science, Tel Aviv university, Israel
  • Mark Leiserson, University of Maryland, United States

Presentation Overview: Show

Motivation: Somatic mutations result from processes related to DNA replication or environmental/lifestyle exposures.
Knowing the activity of mutational processes in a tumor can inform personalized therapies, early detection, and understanding of tumorigenesis.
Computational methods have revealed 30 validated signatures of mutational processes active in human cancers, where each signature is a pattern of single base substitutions.
However, half of these signatures have no known etiology, and some similar signatures have distinct etiologies, making patterns of mutation signature activity hard to interpret.
Existing mutation signature detection methods do not consider tumor-level clinical/demographic (e.g., smoking history) or molecular features (e.g., inactivations to DNA damage repair genes).
Results: To begin to address these challenges, we present the Tumor Covariate Signature Model (TCSM), the first method to directly model the effect of observed tumor-level covariates on mutation signatures.
To this end, our model uses methods from Bayesian topic modeling to change the prior distribution on signature exposure conditioned on a tumor's observed covariates.
We also introduce methods for imputing covariates in held-out data and for evaluating the statistical significance of signature-covariate associations.
On simulated and real data, we find that TCSM outperforms both non-negative matrix factorization and topic modeling-based approaches, particularly in recovering the ground truth exposure to similar signatures.
We then use TCSM to discover five mutation signatures in breast cancer and predict homologous recombination repair deficiency in held-out tumors.
We also discover four signatures in a combined melanoma and lung cancer cohort -- using cancer type as a covariate -- and provide statistical evidence to support earlier claims that three lung cancers from The Cancer Genome Atlas are misdiagnosed metastatic melanomas.

Availability: TCSM is implemented in Python 3 and available at https://github.com/lrgr/tcsm, along with a data workflow for reproducing the experiments in the paper.

12:15 PM-12:35 PM
A deep learning system can accurately classify primary and metastatic cancers based on patterns of passenger mutations
Room: San Francisco (3rd Floor)
  • Wei Jiao, Ontario Institute for Cancer Research, Canada
  • Gurnit Atwal, University of Toronto, Vector Institute, Ontario Institute for Cancer Research, Canada
  • Paz Polak, Icahn School of Medicine at Mount Sinai, United States
  • Rosa Karlic, University of Zagreb, Croatia
  • Edwin Cuppen, UMC Utrecht, Netherlands
  • Alexandra Danyi, UMC Utrecht, Netherlands
  • Jeoren de Ridder, UMC Utrecht, Netherlands
  • Carla van Herpen, Radboud University, Netherlands
  • Martijm Lolkema, Erasmus University Medical Center, Netherlands
  • Neeltje Steeghs, Netherlands Cancer Institute, Netherlands
  • Gad Getz, Harvard University, United States
  • Lincoln Stein, University of Toronto, Ontario Institute for Cancer Research, Canada
  • Quaid Morris, University of Toronto, Canada

Presentation Overview: Show

In cancer, the primary tumour's organ of origin and histopathology are the strongest determinants of its clinical behaviour, but in 3% of new cancer diagnoses, a cancer patient presents with a metastatic tumour and no obvious primary. Challenges also arise when distinguishing a metastatic recurrence of a previously treated cancer from the emergence of a new one. Here we train a deep learning classifier to predict cancer type based on patterns of somatic passenger mutations detected in whole genome sequencing (WGS) of 2606 tumours representing 24 common cancer types. Our classifier achieves an accuracy of 91% on held-out tumor samples from this set.On primary and metastatic samples from an independent cohort, it achieves accuracies of 87% and 85%, respectively. This is double the accuracy of pathologists who were presented with a metastatic tumour without knowledge of the primary. Surprisingly, adding information about driver mutations reduced classifier accuracy. Our results have immediate clinical applicability, underscoring how patterns of somatic passenger mutations encode the state of the cell of origin, and can inform future strategies to detect the source of cell-free circulating tumour DNA.

2:00 PM-3:00 PM
MLCSB Keynote: Relational Representation Learning as a New Approach in Computational Biology
Room: San Francisco (3rd Floor)
  • Marinka Zitnik, Stanford University, United States

Presentation Overview: Show

The success of machine learning in biomedicine generally depends on data representation. This is because different representations describe more or less different dimensions of biological variation. Although one can carefully hand-engineer features for machine learning and painstakingly extract them from biomedical data, the quest for AI is motivating the design of algorithms that can automatically learn powerful data representations.

This talk will focus on a key aspect of this quest: integration of data and knowledge from several dimensions of biological variation into rich, heterogeneous networks; enhancement of these networks to reduce biases and uncertainty; and learning from the networks to open doors for scientific discoveries. I will first describe representation learning algorithms that learn how to map nodes, or larger network structures, to embeddings, points in a compact vector space whose geometry is optimized to reflect interactions, the essence of networks. These algorithms use data transformation techniques based on random walks over graphs, message-passing in neural networks, and graph convolutions and move beyond prevailing deep learning methods, which ignore biomedical interactions and treat networks as simple, flat matrix-like datasets.

I will then discuss how these advancements in graph neural networks and graph embedding methods enabled us to predict protein functions in specific human tissues, identify combinations of drugs safe for patients, repurpose old drugs for new diseases, and identify new drug targets and disease proteins, among many others. In all studies, we collaborated closely with experimental biologists and clinical scientists to give insights and validate predictions made by our methods. I will conclude with open directions for explainable discovery infrastructure that is necessary to fully unlock biomedical data.

3:00 PM-3:20 PM
Data Mining of Alcoholic Liver Disease Progression from Health Registry Data
Room: San Francisco (3rd Floor)
  • Dhouha Grissa, NNF Center for protein research - University of Copenhagen, Denmark
  • Ditlev Nytoft Rasmussen, Odense University Hospital, Denmark
  • Aleksander Krag, Odense University Hospital, Denmark
  • Søren Brunak, Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark
  • Lars Juhl Jensen, The Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark

Presentation Overview: Show

Alcoholic liver disease (ALD) is one of the most prevalent chronic liver diseases worldwide, causing pathological changes in liver due to excessive consumption of alcohol. It progresses from fatty liver through alcoholic liver fibrosis (ALF) to cirrhosis (ALC). Unfortunately, the clear majority of ALD patients are only diagnosed by the time ALD has reached the irreversible ALC stage. Here, we use data from Danish health registries to examine if it is possible to identify patients likely to develop ALF or ALC based on their past medical history. To this end, we use statistical and machine-learning techniques to analyze data from the Danish National Patient Registry. Consistent with the late diagnoses of ALD, we show that ALC is the most common form of ALD in the registry data and that ALC patients have a strong over-representation of diagnoses associated with liver dysfunction, such as ascites and hepatic failure. We also find a small number of ALF patients who appear to be much less sick than those with ALC. Our findings highlight the potential of this approach to uncover hidden knowledge in registry data related to ALD.

3:20 PM-3:40 PM
A pitfall for machine learning methods aiming to predict across cell types
Room: San Francisco (3rd Floor)
  • Jacob Schreiber, University of Washington, United States
  • Ritambhara Singh, University of Washington, United States
  • Jeffrey Bilmes, University of Washington, United States
  • William Noble, University of Washington, United States

Presentation Overview: Show

Machine learning models to predict phenomena such as gene expression, enhancer activity, transcription factor binding, or chromatin conformation are most useful when they can generalize to make accurate predictions across cell types. In this situation, a natural strategy is to train the model on experimental data from some cell types and evaluate performance on one or more held-out cell types. In this work, we show that, when the training set contains examples derived from the same genomic loci across multiple cell types, then the resulting model can be susceptible to a particular form of bias related to memorizing the average activity associated with each genomic locus. Consequently, the trained model may appear to perform well when evaluated on the genomic loci that it was trained on but tends to perform poorly on loci that it was not trained on. We demonstrate this phenomenon by using epigenomic measurements and nucleotide sequence to predict gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data and computing resources become available, future projects will increasingly risk suffering from this issue.

3:40 PM-4:00 PM
Interpretation of deep learning models in genomics: splicing codes as a case study
Room: San Francisco (3rd Floor)
  • Anupama Jha, University of Pennsylvania, United States
  • Joseph K. Aicher, University of Pennsylvania, United States
  • Deependra Singh, University of Pennsylvania, United States
  • Yoseph Barash, University of Pennsylvania, United States

Presentation Overview: Show

The success of deep learning models led to their fast adaptation for genomics tasks such as predicting DNA binding sites of proteins and RNA splicing outcomes. One major limitation of such models though, especially in application for biomedical tasks, is their black box nature, hindering interpretability. A recent promising method to address this limitation is Integrated Gradient (IG), which identifies features associated with prediction for a sample by a deep model. IG works by aggregating the gradients along the inputs that fall on the straight line between a baseline point and the sample of interest.
In this work we address several limitations of IG. First, we define a procedure to identify features significantly associated with a specific prediction task such as differentially included exons in the brain. Then, we assess the effect of using different reference point definitions, and replacing the original single linear path used in IG with nonlinear variants. These variants include neighbors path in the original space (O-N-IG) and the hidden space (H-N-IG), and linear path in the hidden space (H-L-IG). Together, our proposed methods for selecting significant features, reference points, and paths for integrated gradients establish a framework to interpret deep learning models for genomic tasks.

4:00 PM-4:20 PM
Learning the language of life
Room: San Francisco (3rd Floor)
  • Jose Juan Almagro Armenteros, Technical University of Denmark, Denmark
  • Alexander Rosenberg Johansen, Technical University of Denmark, Denmark
  • Henrik Nielsen, Technical University of Denmark, Denmark
  • Ole Winther, Technical University of Denmark, Denmark

Presentation Overview: Show

Machine learning models trained on protein data tend to underperform due to the low amount of annotated data. Current research has shown that Language Models (LM) trained on unlabeled protein sequences can be used to improve performance on protein prediction tasks. However, protein LMs have not been fully studied, and their full capabilities are yet to be explored. A protein LM can be defined as a model that predicts the next amino acid given the context previous to that amino acid. In this research, we focus on assembling a high-quality protein dataset suitable for protein language modeling and training a Recurrent Neural Language Model on this dataset. We show that the protein LM learns to predict the next amino acid in a sequence and creates amino acid representations that are context dependent. In addition, our protein LM is able to predict the probability of a protein sequence, being able to discriminate between real and fake proteins. Finally, we show that our model also can generate new protein sequences with similar features to real proteins.

4:20 PM-4:40 PM
Concluding Remarks - MLCSB
Room: San Francisco (3rd Floor)