Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


MLCSB: Machine Learning in Computational and Systems Biology

COSI Track Presentations

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
Monday, July 9th
10:15 AM-10:20 AM
MLCSB: Introduction
Room: Grand Ballroom C-F
10:20 AM-11:20 AM
Keynote: Olga Troyanskaya
Room: Grand Ballroom C-F
  • Olga Troyanskaya, Princeton University, United States
11:20 AM-11:40 AM
Deep-learning improves prediction of CRISPR-Cpf1 guide RNA activity
Room: Grand Ballroom C-F
  • Seonwoo Min, Seoul National University, South Korea
  • Hui Kwon Kim, Yonsei University College of Medicine, South Korea
  • Myungjae Song, Yonsei University College of Medicine, South Korea
  • Soobin Jung, Yonsei University College of Medicine, South Korea
  • Jae Woo Choi, Yonsei University College of Medicine, South Korea
  • Younggwang Kim, Yonsei University College of Medicine, South Korea
  • Sangeun Lee, Yonsei University College of Medicine, South Korea
  • Hyongbum Kim, Yonsei University College of Medicine, South Korea
  • Sungroh Yoon, Seoul National University, South Korea

Presentation Overview: Show

Targeted genome editing using CRISPR-Cas (clustered, regularly interspaced, short palindromic repeats and CRISPR-associated proteins) system has rapidly become a mainstream method in molecular biology. Cpf1 (from Prevotella and Francisella 1), a recently reported effector endonuclease protein of class 2 CRISPR-Cas system, has several different characteristics from the predominant Cas9 nuclease. Although Cpf1 has broadened our options to efficiently modify genes in various species and cell types, we still have limited knowledge on Cpf1, especially regarding its target sequence dependent activity profiles.
Determination of CRISPR nuclease activities is one of the key initial steps for genome editing. Several computational approaches have been proposed for the in silico prediction of CRISPR nuclease activities. However, they rely on manual feature extraction, which inevitably limits the efficiency, robustness, and generalization performance. To address the limitations of existing approaches, this paper presents an end-to-end deep learning framework for CRISPR-Cpf1 guide RNA activity prediction, dubbed as DeepCpf1. Leveraged by (1) a convolutional neural network for feature learning from target sequence composition and (2) multi-modal architecture for seamless integration of an epigenetic factor (i.e., chromatin accessibility), the proposed method significantly outperforms the conventional approaches with an unprecedented level of high accuracy. (Published in Nature Biotechnology 2018)

11:40 AM-12:00 PM
Automatically eliminating errors induced by suboptimal parameter choices in transcript assembly
Room: Grand Ballroom C-F
  • Dan DeBlasio, Carnegie Mellon University, United States
  • Carl Kingsford, Carnegie Mellon University, United States

Presentation Overview: Show

The computational tools used for genomic analyses are becoming increasingly sophisticated. While these applications provide more accurate results, a new problem is emerging in that these pieces of software have a large number of tunable parameters. The default parameter choices are designed to work well on average, but the most interesting experiments are often not "average". Choosing the wrong parameter values can lead to significant results being overlooked, or false results being reported. We take the first steps towards generating an automated genomic analysis pipeline by developing a method for automatically choosing input-specific parameter values for reference-based transcript assembly. Using the parameter advising framework, first developed for multiple sequence alignment, we can optimize parameter choices for each input. In doing so, we provide the first method for finding advisor sets for applications with large numbers of tunable parameters. On over 1500 RNA-Seq samples in the Sequence Read Archive, area under the curve (AUC) for the Scallop transcript assembler shows a median increase of 28.9% over using only the default parameter choices. This approach is general, and when applied to StringTie it increases AUC by 13.1% on experiments from ENCODE. A parameter advisor for Scallop is available on Github (https://github.com/Kingsford-Group/scallopadvising).

12:00 PM-12:20 PM
Multi-scale Deep Tensor Factorization Learns a Latent Representation of the Human Epigenome
Room: Grand Ballroom C-F
  • Jacob Schreiber, University of Washington, United States
  • Timothy Durham, University of Washington, United States
  • Jeffrey Bilmes, University of Washington, United States
  • William Noble, University of Washington, United States

Presentation Overview: Show

The human epigenome has been experimentally characterized by dozens of assays across hundreds of cell types. Unfortunately, most potential experiments—combinations of cell types and assay types—have not been performed and likely will never be run due to their cost. A natural desire is to impute these missing data sets by extrapolating from currently available data. Previous imputation techniques include ChromImpute, an ensemble of regression trees, and PREDICTD, a tensor factorization approach. We adopt a deep tensor factorization model, called Avocado, that outperforms both prior approaches in terms of the mean-squared error on a pre-defined 1% of the human genome. In addition, we show that Avocado learns a latent representation of the genome that can be used to predict aspects of chromatin architecture, gene expression, promoter-enhancer interactions, and replication timing more accurately than similar predictions made from real or imputed data. We then use feature attribution methods to better understand how Avocado works. Finally, we use submodular selection to identify a representative subset of genomic regions, and we demonstrate that these regions can be used to train genome-wide estimators more efficiently than using all positions and to better effect than a randomly selected set of positions.

12:20 PM-12:40 PM
Proceedings Presentation: Random forest based similarity learning for single cell RNA sequencing data
Room: Grand Ballroom C-F
  • Maziyar Baran Pouyan, University of Pittsburgh, United States
  • Dennis Kostka, University of Pittsburgh, United States

Presentation Overview: Show

Motivation: Genome-wide transcriptome sequencing applied to single cells (scRNA-seq) is rapidly becoming an assay of choice across many fields of biological and biomedical research. Scientific objectives often revolve around discovery or characterization of types or sub-types of cells, and therefore obtaining accurate cell–cell similarities from scRNA-seq data is critical step in many studies. While rapid advances are being made in the development of tools for scRNA-seq data analysis, few approaches exist that explicitly address this task. Furthermore, abundance and type of noise present in scRNA-seq datasets suggest that application of generic methods, or of methods developed for bulk RNA-seq data, is likely suboptimal.
Results: Here we present RAFSIL, a random forest based approach to learn cell-cell similarities from scRNA-seq data. RAFSIL implements a two-step procedure, where feature construction geared towards scRNA-seq data is followed by similarity learning. It is designed to be adaptable and expandable, and RAFSIL similarities can be used for typical exploratory data analysis tasks like dimension reduction, visualization, and clustering. We show that our approach compares favorably with current methods across a diverse collection of datasets, and that it can be used to detect and highlight unwanted technical variation in scRNA-seq datasets in situations where other methods fail. Overall, RAFSIL implements a flexible approach yielding a useful tool that improves the analysis of scRNA-seq data.
Availability and Implementation: The RAFSIL R package is available at www.kostkalab.net/software.html

12:40 PM-2:00 PM
Lunch Break
2:00 PM-2:20 PM
Proceedings Presentation: mGPfusion: Predicting protein stability changes with Gaussian process kernel learning and data fusion
Room: Grand Ballroom C-F
  • Emmi Jokinen, Aalto University, Finland
  • Markus Heinonen, Aalto University, Finland
  • Harri Lähdesmäki, Aalto University, Finland

Presentation Overview: Show

Motivation: Proteins are commonly used by biochemical industry for numerous processes. Refining these proteins' properties via mutations causes stability effects as well. Accurate computational method to predict how mutations affect protein stability are necessary to facilitate efficient protein design. However, accuracy of predictive models is ultimately constrained by the limited availability of experimental data.

Results: We have developed mGPfusion, a novel Gaussian process (GP) method for predicting protein's stability changes upon single and multiple mutations. This method complements the limited experimental data with large amounts of molecular simulation data. We introduce a Bayesian data fusion model that re-calibrates the experimental and in silico data sources and then learns a predictive GP model from the combined data. Our protein-specific model requires experimental data only regarding the protein of interest, and performs well even with few experimental measurements. The mGPfusion models proteins by contact maps and infers the stability effects caused by mutations with a mixture of graph kernels. Our results show that mGPfusion outperforms state-of-the-art methods in predicting protein stability on a dataset of 15 different proteins and that incorporating molecular simulation data improves the model learning and prediction accuracy.

Availability: Software implementation and datasets are available at github.com/emmijokinen/mgpfusion}
Contact: emmi.jokinen@aalto.fi

2:20 PM-2:40 PM
Proceedings Presentation: DLBI: Deep learning guided Bayesian inference for structure reconstruction of super-resolution fluorescence microscopy
Room: Grand Ballroom C-F
  • Yu Li, KAUST, Saudi Arabia
  • Fan Xu, Chinese Academy of Sciences, China
  • Fa Zhang, Chinese Academy of Sciences, China
  • Pingyong Xu, Chinese Academy of Sciences, China
  • Mingshu Zhang, Chinese Academy of Sciences, China
  • Ming Fan, Hangzhou Dianzi University, China
  • Lihua Li, Hangzhou Dianzi University, China
  • Xin Gao, King Abdullah University of Science and Technology, Saudi Arabia
  • Renmin Han, KAUST, Saudi Arabia

Presentation Overview: Show

Super-resolution fluorescence microscopy, with a resolution beyond the diffraction limit of light, has become an indispensable tool to directly visualize biological structures in living cells at a nanometer-scale resolution. Despite advances in high-density super-resolution fluorescent techniques, existing methods still have bottlenecks, including extremely long execution time, artificial thinning and thickening of structures, and lack of ability to capture latent structures.

Here we propose a novel deep learning guided Bayesian inference approach, DLBI, for the time-series analysis of high-density fluorescent images. Our method combines the strength of deep learning and statistical inference, where deep learning captures the underlying distribution of the fluorophores that are consistent with the observed time-series fluorescent images by exploring local features and correlation along time-axis, and statistical inference further refines the ultrastructure extracted by deep learning and endues physical meaning to the final image. In particular, our method contains three main components. The first one is a simulator that takes a high-resolution image as the input, and simulates time-series low-resolution fluorescent images based on experimentally calibrated parameters, which provides supervised training data to the deep learning model. The second one is a multi-scale deep learning module to capture both spatial information in each input low-resolution image as well as temporal information among the time-series images. And the third one is a Bayesian inference module that takes the image from the deep learning module as the initial localization of fluorophores and removes artifacts by statistical inference. Comprehensive experimental results on both real and simulated datasets demonstrate that our method provides more accurate and realistic local patch and large-field reconstruction than the state-of-the-art method, the 3B analysis, while our method is more than two orders of magnitude faster.

2:40 PM-3:00 PM
Proceedings Presentation: COSSMO: Predicting Competitive Alternative Splice Site Selection using Deep Learning
Room: Grand Ballroom C-F
  • Hannes Bretschneider, University of Toronto, Canada
  • Shreshth Gandhi, Deep Genomics, Inc., Canada
  • Amit G Deshwar, Deep Genomics, Inc., Canada
  • Khalid Zuberi, Deep Genomics, Inc., Canada
  • Brendan Frey, Deep Genomics, Inc., Canada

Presentation Overview: Show

Motivation: Alternative splice site selection is inherently competitive and the probability of a given splice site to be used also depends strongly on the strength of neighboring sites. Here we present a new model named Competitive Splice Site Model (COSSMO), which explicitly models these competitive effects and predict the PSI distribution over any number of putative splice sites. We model an alternative splicing event as the choice of a 3’ acceptor site conditional on a fixed upstream 5’ donor site, or the choice of a 5’ donor site conditional on a fixed 3’ acceptor site. We build four different architectures that use convolutional layers, communication layers, LSTMs, and residual networks, respectively, to learn relevant motifs from sequence alone. We also construct a new dataset from genome annotations and RNA-Seq read data that we use to train our model.

Results: COSSMO is able to predict the most frequently used splice site with an accuracy of 70% on unseen test data, and achieve an R2 of 60% in modeling the PSI distribution. We visualize the motifs that COSSMO learns from sequence and show that COSSMO recognizes the consensus splice site sequences as well as many known splicing factors with high specificity.

Availability: Our dataset is available from http://cossmo.deepgenomics.com.

3:00 PM-4:00 PM
Keynote: Adventures with sparsity and structure in computational biology
Room: Grand Ballroom C-F
  • Matthew Stephens, University of Chicago, United States
4:00 PM-4:40 PM
Coffee Break
4:40 PM-5:00 PM
A Machine learning approach to dissect subcellular protrusion heterogeneity and the underlying actin regulator dynamics
Room: Grand Ballroom C-F
  • Hee June Choi, Worcester Polytechnic Institute, United States
  • Chuangqi Wang, Worcester Polytechnic Institute, United States
  • Kwonmoo Lee, Worcester Polytechnic Institute, United States

Presentation Overview: Show

Cell protrusion is morphodynamically heterogeneous at the subcellular level. However, the mechanism of cell protrusion has been understood based on the ensemble average of actin regulator dynamics. Here, we establish a computational framework to deconvolve the subcellular heterogeneity of lamellipodial protrusion from live cell imaging. HACKS (deconvolution of Heterogeneous Activity in Coordination of cytosKeleton at a Subcellular level) identifies distinct subcellular phenotypes based on machine-learning algorithms and reveals their underlying actin regulator dynamics at the leading edge. Using our method, we discover ‘accelerating protrusion’, which is driven by previously unknown temporal coordination of Arp2/3 and VASP activities. We validate our finding by drug treatment assays and further identified fine regulation of Arp2/3 and VASP recruitment associated with accelerating protrusion. Our study suggests HACKS, combined with pharmaloclogical perturbations, can be a powerful tool to discover the drug effect by revealing the susceptible morphodynamic phenotypes and the associated changes in molecular dynamics.

5:00 PM-5:20 PM
Inferring transcriptional regulatory programs in gynecological cancers
Room: Grand Ballroom C-F
  • Hatice Osmanbeyoglu, MSKCC, United States
  • Petar Jelinic, New York University, United States
  • Douglas Levine, New York University, United States
  • Christina Leslie, Memorial Sloan-Kettering Cancer Center, United States

Presentation Overview: Show

Most regulatory network inference approaches in cancer use expression data to analyze transcription factor (TF) motifs in annotated promoter regions. However, distal enhancers are important for fine-tuning of gene expression. We developed an experimental and computational strategy to incorporate the effect of enhancers on gene regulatory programs across multiple cancers. Our framework, PSIONIC (Patient Specific Inference of Networks Incorporating Chromatin), enables us to selectively share information across tumors and explore similarities and differences in patient-specific inferred TF activities. TFs impacts on gene regulation have not been well characterized in gynecological and basal breast cancers. Hence, we applied our approach to 723 RNA-seq experiments from gynecological and basal breast cancer tumors as well as 96 cell lines. To integrate regulatory sequence information in our models, we also generated an ATAC-seq data set profiling chromatin accessibility in cell line models of these cancers. Our analysis identified tumor type-specific and common TF regulators of gene expression, as well as predicted dysregulated transcriptional regulators. In vitro assays confirmed that PSIOINC -inferred TF activities were predictive of sensitivity to targeted TF inhibitors. Moreover, many of the identified TF regulators were significantly associated with survival outcome within the tumor type.

5:20 PM-5:40 PM
Proceedings Presentation: Optimization and profile calculation of ODE models using second order adjoint sensitivity analysis
Room: Grand Ballroom C-F
  • Paul Stapor, Helmholtz Center for Environmental Health, Germany
  • Fabian Fröhlich, Helmholtz Center for Environmental Health, Germany
  • Jan Hasenauer, Institute of Computational Biology, Helmholtz Zentrum München, Germany

Presentation Overview: Show

Motivation: Parameter estimation methods for ordinary differential equation (ODE) models of biological processes can exploit gradients and Hessians of objective functions to achieve convergence and computational efficiency. However, the computational complexity of established methods to evaluate the Hessian scales linearly with the number of state variables and quadratically with the number of parameters. This limits their application to low-dimensional problems.
Results: We introduce second order adjoint sensitivity analysis for the computation of Hessians and a hybrid optimization-integration based approach for profile likelihood computation. Second order adjoint sensitivity analysis scales linearly with the number of parameters and state variables. The Hessians are effectively exploited by the proposed profile likelihood computation approach. We evaluate our approaches on published biological models with real measurement data. Our study reveals an improved computational efficiency and robustness of optimization compared to established approaches, when using Hessians computed with adjoint sensitivity analysis. The hybrid computation method was more than two-fold faster than the best competitor. Thus, the proposed methods and implemented algorithms allow for the improvement of parameter estimation for medium and large scale ODE models.
Availability: The algorithms for second order adjoint sensitivity analysis are implemented in the Advanced MATLAB Interface to CVODES and IDAS (AMICI, https://github.com/ICB-DCM/AMICI/). The algorithm for hybrid profile likelihood computation is implemented in the parameter estimation toolbox (PESTO, https://github.com/ICB-DCM/PESTO/). Both toolboxes are freely available under the BSD license. Contact: jan.hasenauer@helmholtz-muenchen.de
Supplementary information: Supplementary data are available at Bioinformatics online.

5:40 PM-6:00 PM
Proceedings Presentation: Improved pathway reconstruction from RNA interference screens by exploiting off-target effects
Room: Grand Ballroom C-F
  • Sumana Srivatsa, ETH Zurich, Switzerland
  • Jack Kuipers, ETH Zurich, Switzerland
  • Fabian Schmich, ETH Zurich, Switzerland
  • Simone Eicher, University of Basel, Switzerland
  • Mario Emmenlauer, University of Basel, Switzerland
  • Christoph Dehio, University of Basel, Switzerland
  • Niko Beerenwinkel, ETH Zurich, Switzerland

Presentation Overview: Show

Motivation: Pathway reconstruction has proven to be an indispensable tool for analyzing the molecular mechanisms of signal transduction underlying cell function. Nested effects models (NEMs) are a class of probabilistic graphical models designed to reconstruct signalling pathways from high-dimensional observations resulting from perturbation experiments, such as RNA interference (RNAi). NEMs assume that the short interfering RNAs (siRNAs) designed to knockdown specific genes are always on-target. However, it has been shown that most siRNAs exhibit strong off-target effects, which further confound the data, resulting in unreliable reconstruction of networks by NEMs.

Results: Here, we present an extension of NEMs called probabilistic combinatorial nested effects models (pc-NEMs), which capitalize on the ancillary siRNA off-target effects for network reconstruction from combinatorial gene knockdown data. Our model employs an adaptive simulated annealing search algorithm for simultaneous inference of network structure and error rates inherent to the data. Evaluation of pc-NEMs on simulated data with varying number of phenotypic effects and noise levels as well as real data demonstrates improved reconstruction compared to classical NEMs. Application to Bartonella henselae infection RNAi screening data yielded an eight node network largely in agreement with previous works, and revealed novel binary interactions of direct impact between established components.

Tuesday, July 10th
8:35 AM-8:40 AM
MLCSB: Introduction
Room: Grand Ballroom C-F
8:40 AM-9:40 AM
Keynote: Machine learning in cancer genomics
Room: Grand Ballroom C-F
  • Quaid Morris
9:40 AM-10:15 AM
Coffee Break
10:20 AM-10:40 AM
Drug-Target Interaction Prediction with Deep Convolutional Neural Networks Using Compound Images
Room: Grand Ballroom C-F
  • Ahmet Süreyya Rifaioğlu, Middle East Technical University, Turkey
  • Volkan Atalay, Middle East Technical University, Turkey
  • Maria Martin, EMBL-EBI, United Kingdom
  • Rengul Atalay, METU, Turkey
  • Tunca Dogan, EMBL-EBI, CanSyL, METU, United Kingdom

Presentation Overview: Show

The identification of the interactions between drug candidate compounds and target biomolecules is an important step in drug discovery. Within the last decade, computational approaches have been developed (i.e., "virtual screening") with the objective of aiding experimental work by predicting novel drug-target interactions (DTIs), via the construction and application of statistical models. Deep learning algorithms have drawn attention to virtual screening after a deep learning algorithm won the Merck’s drug discovery challenge. One of the current issues in DTI prediction is the process of feature engineering, where obtaining the most representative compound feature vectors is challenging even with intensive manual work. In this study, our aim was to create a novel virtual screening system using convolutional neural networks -CNNs- (widely used in image analysis), that extracts features of compounds from their simple images containing the skeletal formula (i.e., 2D drawings). The main advantage of this system is reducing the time spent on generating complex compound features and letting the CNN learn complex features inherently from the ready-to-use drawings. With extensive amount tests regarding different architectures and hyper-parameters, our initial optimal models reached an average accuracy of 0.82, displaying the potential of CNNs for DTI prediction without any feature engineering.

10:40 AM-11:00 AM
Can Deep Learned Omic features predict clinically successful therapeutic targets?
Room: Grand Ballroom C-F
  • Andrew Rouillard, GSK, United States
  • Mark Hurle, GlaxoSmithKline, United States
  • Pankaj Agarwal, GlaxoSmithKline, United States

Presentation Overview: Show

Target selection is the first and pivotal step in drug discovery. We collected a set of 332 targets that succeeded or failed in phase III clinical trials, and explored whether Omic features from Harmonizome could predict clinical success. We used stringent validation schemes with bootstrapping and modified permutation tests to assess feature robustness and generalizability while accounting for target class selection bias. We also used classifiers to perform multivariate feature selection and found that classifiers with a single feature performed as well in cross-validation as classifiers with more features. Successful targets tended to have lower mean expression and higher expression variance than failed targets. Finding this modest predictive signal led us to ask whether a Deep Learning model could automatically derive transcriptomic features with greater predictive signal. We opted for a transfer learning approach, where we first trained a stacked denoising auto-encoder to learn a low-dimensional encoding of targets’ tissue expression patterns, and then used the learned features to annotate targets. Preliminary results suggest the learned features encode both tissue-specific and non-tissue-specific functional information about targets and are promising predictors of target outcomes in clinical trials, as well as other target properties.

11:00 AM-11:20 AM
Proceedings Presentation: Improving genomics-based predictions for precision medicine through active elicitation of expert knowledge
Room: Grand Ballroom C-F
  • Iiris Sundin, Aalto University, Finland
  • Tomi Peltola, Aalto University, Finland
  • Luana Micallef, Aalto University, Finland
  • Homayun Afrabandpey, Aalto University, Finland
  • Marta Soare, Aalto University, Finland
  • Muntasir Mamun Majumder, University of Helsinki, Finland
  • Pedram Daee, Aalto University, Finland
  • Chen He, University of Helsinki, Finland
  • Baris Serim, University of Helsinki, Finland
  • Aki Havulinna, National Institute for Health and Welfare THL, Helsinki, Finland
  • Caroline Heckman, University of Helsinki, Finland
  • Giulio Jacucci, University of Helsinki, Finland
  • Pekka Marttinen, Aalto University, Finland
  • Samuel Kaski, Aalto University, Finland

Presentation Overview: Show

Motivation: Precision medicine requires the ability to predict the efficacies of different treatments for a given individual using high-dimensional genomic measurements. However, identifying predictive features remains a challenge when the sample size is small. Incorporating expert knowledge offers a promising approach to improve predictions, but collecting such knowledge is laborious if the number of candidate features is very large.
Results: We introduce a probabilistic framework to incorporate expert feedback about the impact of genomic measurements on the outcome of interest, and present a novel approach to collect the feedback efficiently, based on Bayesian experimental design. The new approach outperformed other recent alternatives in two medical applications: prediction of metabolic traits and prediction of sensitivity of cancer cells to different drugs, both using genomic features as predictors. Furthermore, the intelligent approach to collect feedback reduced the workload of the expert to approximately 11%, compared to a baseline approach.
Availability: Source code implementing the introduced computational methods is freely available at https://github.com/AaltoPML/knowledge-elicitation-for-precision-medicine.
Contact: first.last@aalto.fi
Supplementary information: Supplementary data are available at Bioinformatics

11:20 AM-11:40 AM
Proceedings Presentation: Discriminating early- and late-stage cancers using multiple kernel learning on gene sets
Room: Grand Ballroom C-F
  • Arezou Rahimi, Koç University, Turkey
  • Mehmet Gönen, Koç University, Turkey

Presentation Overview: Show

Motivation: Identifying molecular mechanisms that drive cancers from early to late stages is highly important to develop new preventive and therapeutic strategies. Standard machine learning algorithms could be used to discriminate early- and late-stage cancers from each other using their genomic characterisations. Even though these algorithms would get satisfactory predictive performance, their knowledge extraction capability would be quite restricted due to highly correlated nature of genomic data. That is why we need algorithms that can also extract relevant information about these biological mechanisms using our prior knowledge about pathways/gene sets.

Results: In this study, we addressed the problem of separating early- and late-stage cancers from each other using their gene expression profiles. We proposed to use a multiple kernel learning formulation that makes use of pathways/gene sets (i) to obtain satisfactory/improved predictive performance and (ii) to identify biological mechanisms that might have an effect in cancer progression. We extensively compared our proposed multiple kernel learning on gene sets algorithm against two standard machine learning algorithms, namely, random forests and support vector machines, on 20 diseases from TCGA cohorts for two different sets of experiments. Our method obtained statistically significantly better or comparable predictive performance on most of the datasets using significantly fewer gene expression features. We also showed that our algorithm was able to extract meaningful and disease-specific information that gives clues about the progression mechanism.

Availability: Our implementations of support vector machine and multiple kernel learning algorithms in R are available at https://github.com/mehmetgonen/gsbc together with the scripts that replicate the reported experiments.

11:40 AM-12:00 PM
Proceedings Presentation: A new method for constructing tumor specific gene co-expression networks based on samples with tumor purity heterogeneity
Room: Grand Ballroom C-F
  • Francesca Petralia, Mount Sinai Medical School, United States
  • Jie Peng, University of California, Davis, United States

Presentation Overview: Show

Tumor tissue samples often contain an unknown fraction of stromal cells. This problem is widely known as tumor purity heterogeneity (TPH) was recently recognized as a severe issue in omics studies. Specifically, if TPH is ignored when inferring co-expression networks, edges are likely to be estimated among genes with mean shift between non-tumor and tumor cells rather than among gene pairs interacting with each other in tumor cells. To address this issue, we propose TSNet, a new method which constructs tumor-cell specific gene/protein co-expression networks based on gene/protein expression profiles of tumor tissues. TSNet treats the observed expression profile as a mixture of expressions from different cell types and explicitly models tumor purity percentage in each tumor sample.

Using extensive synthetic data experiments, we demonstrate that TSNet outperforms a standard graphical model which does not account for tumor purity heterogeneity. We then apply TSNet to estimate tumor specific gene co-expression networks based on TCGA ovarian cancer RNAseq data. We identify novel co-expression modules and hub structure specific to tumor cells.

12:00 PM-12:20 PM
Proceedings Presentation: Gene Prioritization Using Bayesian Matrix Factorization with Genomic and Phenotypic Side Information
Room: Grand Ballroom C-F
  • Pooya Zakeri, Katholieke Universiteit Leuven, Belgium
  • Jaak Simm, Katholieke Universiteit Leuven, Belgium
  • Adam Arany, Katholieke Universiteit Leuven, Belgium
  • Sarah Elshal, Katholieke Universiteit Leuven, Belgium
  • Yves Moreau, Katholieke Universiteit Leuve, Belgium

Presentation Overview: Show

Motivation: Most gene prioritization methods model each disease or phenotype individually, but this fails to capture patterns common to several diseases or phenotypes. To overcome this limitation, we formulate the gene prioritization task as the factorization of a sparsely filled gene-phenotype matrix, where the objective is to predict the unknown matrix entries. To deliver more accurate gene-phenotype matrix completion, we extend classical Bayesian matrix factorization to work with multiple side information sources. The availability of side information allows us to make nontrivial predictions for genes for which no previous disease association is known.

Results: Our gene prioritization method can innovatively not only integrate data sources describing genes, but also data sources describing Human Phenotype Ontology terms. Experimental results on our benchmarks show that our proposed model can effectively improve accuracy over the well-established gene prioritization method, Endeavour. In particular, our proposed method offers promising results on diseases of the nervous system; diseases of the eye and adnexa; endocrine, nutritional and metabolic diseases; and congenital malformations, deformations and chromosomal abnormalities, when compared to Endeavour.

12:20 PM-12:40 PM
Proceedings Presentation: Modeling Polypharmacy Side Effects with Graph Convolutional Networks
Room: Grand Ballroom C-F
  • Marinka Zitnik, Stanford University, United States
  • Monica Agrawal, Stanford University, United States
  • Jure Leskovec, Stanford University, United States

Presentation Overview: Show

Motivation: The use of drug combinations, termed polypharmacy, is common to treat patients with complex diseases or co-existing conditions. However, a major consequence of polypharmacy is a much higher risk of adverse side effects for the patient. Polypharmacy side effects emerge because of drug-drug interactions, in which activity of one drug may change, favorably or unfavorably, if taken with another drug. The knowledge of drug interactions is often limited because these complex relationships are rare, and are usually not observed in relatively small clinical testing. Discovering polypharmacy side effects thus remains an important challenge with significant implications for patient mortality and morbidity.

Results: Here, we present Decagon, an approach for modeling polypharmacy side effects. The approach constructs a multimodal graph of protein-protein interactions, drug-protein target interactions, and the polypharmacy side effects, which are represented as drug-drug interactions, where each side effect is an edge of a different type. Decagon is developed specifically to handle such multimodal graphs with a large number of edge types. Our approach develops a new graph convolutional neural network for multirelational link prediction in multimodal networks. Unlike approaches limited to predicting simple drug-drug interaction values, Decagon can predict the exact side effect, if any, through which a given drug combination manifests clinically. Decagon accurately predicts polypharmacy side effects, outperforming baselines by up to 69%. We find that it automatically learns representations of side effects indicative of co-occurrence of polypharmacy in patients. Furthermore, Decagon models particularly well polypharmacy side effects that have a strong molecular basis, while on predominantly non-molecular side effects, it achieves good performance because of effective sharing of model parameters across edge types. Decagon opens up opportunities to use large pharmacogenomic and patient population data to flag and prioritize polypharmacy side effects for follow-up analysis via formal pharmacological studies.

12:40 PM-2:00 PM
Lunch Break
2:00 PM-3:00 PM
Keynote: Context-specific and dynamic effects of genetic variation
Room: Grand Ballroom C-F
  • Alexis Battle
3:00 PM-3:20 PM
Integrating Heterogeneous Predictive Models using Reinforcement Learning
Room: Grand Ballroom C-F
  • Ana Stanescu, University of West Georgia, United States

Presentation Overview: Show

The application of systems biology and machine learning approaches to large amounts and variety of biomedical data often yields predictive models that can potentially transform data into knowledge. However, it is not always obvious what techniques and/or datasets are most appropriate for specific problems, calling for alternatives such as building heterogeneous ensembles capable of incorporating the inherent variety and complementarity of the many possible models. However, the problem of systematically constructing these ensembles from a large number and variety of base models/predictors is computationally and mathematically challenging. We developed novel algorithms for this problem that operate within a Reinforcement Learning (RL) framework to search the large space of all possible ensembles that can be generated from an initial set of base predictors. RL offers a more systematic alternative to the conventional ad-hoc methods of choosing the base predictors into the final ensemble, and has the potential of deriving optimal solutions to the problem. For the sample problem of splice site identification, our algorithms yielded effective ensembles that perform competitively with the ones consisting of all the base predictors. Furthermore, the ensembles utilized a substantially smaller subset of all the base predictors, potentially aiding the ensembles’ reverse engineering and eventual interpretation.

3:20 PM-3:40 PM
Applying semi-supervised variational inference to heterogeneous genomic data predicts heart enhancers
Room: Grand Ballroom C-F
  • Tahmid Mehdi, University of Toronto, Canada
  • Alan Moses, University of Toronto, Canada

Presentation Overview: Show

One challenge in integration of large genomic datasets is the increasing heterogeneity: continuous, binary and discrete features may all be relevant to a biological question. Coupled with the typically small numbers of positive and negative training examples (which renders deep learning suboptimal), semi-supervised approaches for heterogeneous data are needed. However, most distance and kernel-based clustering algorithms have difficulty with partitioning heterogeneous data. Bayesian non-parametric model-based approaches, such as Dirichlet Process Mixtures, offer attractive alternatives. Here, we implement a Dirichlet Process Heterogeneous Mixture that can infer Gaussian, Bernoulli or Poisson distributions over features. We derived a novel variational inference algorithm to handle semi-supervised learning tasks where certain observations are forced to cluster together. We applied this model to 6209 genomic regions bound by transcription factors in mouse heart tissues based on heterogeneous epigenetic features. A small number of experimentally validated enhancers were constrained to appear in the same cluster and 29 additional bound regions clustered with them. Many of these are located near genes that are important for the heart, suggesting that our model can discover new enhancers. Our model provides a principled Bayesian method for semi-supervised integration of heterogeneous data.

3:40 PM-4:00 PM
Proceedings Presentation: A scalable estimator of SNP heritability for Biobank-scale data
Room: Grand Ballroom C-F
  • Yue Wu, University of California, Los Angeles, United States
  • Sriram Sankararaman, University of California, Los Angeles, United States

Presentation Overview: Show

Heritability, the proportion of variation in a trait that can be explained by genetic variation, is an important parameter in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to understand the heritability of complex phenotypes attributable to genome-wide SNP variation data have motivated the analysis of large datasets as well as the development of sophisticated tools to estimate heritability in these datasets.

Linear Mixed Models (LMMs) have emerged as a key tool for heritability estimation where the parameters of the LMMs,i.e., the variance components, are related to the heritability attributable to the SNPs analyzed. Likelihood-based inference in LMMs however poses serious computational burdens.

We propose a scalable randomized algorithm for estimating variance components in LMMs. Our method is based on a MoM estimator that has a runtime complexity O(NMB) for N individuals and M SNPs (where B is a parameter that controls the number of random matrix-vector mutiplications). Further, by leveraging the structure of the genotype matrix, we can reduce the time complexity to O(NMB/max( log_3N , log_3M) ) .

We demonstrate the scalability and accuracy of our method on simulated as well as on empirical data. On a standard hardware, our method computes heritability on a dataset of 500,000 individuals and 100,000 SNPs in 38 minutes.

4:20 PM-4:40 PM
Proceedings Presentation: A unifying framework for joint trait analysis under a non-infinitesimal model
Room: Grand Ballroom C-F
  • Ruth Johnson, University of California, Los Angeles, United States
  • Huwenbo Shi, University of California, Los Angeles, United States
  • Bogdan Pasaniuc, University of California, Los Angeles, United States
  • Sriram Sankararaman, University of California, Los Angeles, United States

Presentation Overview: Show

Motivation: A large proportion of risk regions identified by genome-wide association studies (GWAS) are
shared across multiple diseases and traits. Understanding whether this clustering is due to sharing of
causal variants or chance colocalization can provide insights into shared etiology of complex traits and
Results: In this work, we propose a flexible, unifying framework to quantify the overlap between two
traits called UNITY (Unifying Non-Infinitesimal Trait analYsis). We formulate a full generative model that
makes minimal assumptions under a non-infinitesimal model and performs inference starting from GWAS
summary association data. To address the very large parameter space, we propose a Metropolis-Hastings
within collapsed Gibbs sampler to perform inference. Through comprehensive simulations and an analysis
of height and BMI, we show that our method produces estimations consistent with the known genetic
makeup of both height and BMI.

4:40 PM-5:00 PM
Coffee Break (on the go) to Closing Keynote