- Mi Yang, Heidelberg University, Germany
- Michael Menden, AstraZeneca, United Kingdom
- Patricia Jaaks, Wellcome Sanger Institute, United Kingdom
- Jonathan Dry, AstraZeneca, United States
- Mathew Garnett, Wellcome Sanger Institute, United Kingdom
- Julio Saez-Rodriguez, Heidelberg University, Germany
Cancer monotherapies are hampered by the ability of tumor cells to escape inhibition through rewiring or alternative pathways. Therefore, smart drug combination approaches are essential in controlling cancer proliferation and survival. We present two complementary workflows: One for prioritising drug synergy enrichment in high-throughput screens, and a consecutive workflow to predict hypothesis-driven patient stratification. Both workflows rely on bayesian matrix factorization to explore mechanistic relations between pathway activations derived from gene expression profiles and putative drug targets.
We introduce the notion of Target functional similarity, which reflects how similarly effective drugs are as a function of targeted signaling pathway activities. We argue that functional similarities between protein targets shed light on different synergy mechanisms: (Mechanism 1) by targeting functionally similar proteins, synergy occurs by targeting the same pathway/mechanism. (Mechanism 2) by targeting functionally opposite proteins, synergy emerges by compensation of each other’s escape mechanism. Synergy prediction workflow estimates a likelihood of synergistic effect for any given pair of protein targets based on the value of the target functional similarity.
We next use the resulting mechanisms to build specific models to stratify cell lines. For functionally similar target pairs (Mechanism 1), Synergy prediction is made by maximizing the sensitizing pathways and minimizing those associated with resistance. we predict the synergy score with a linear combination of pathway activations: activities of sensitizing pathways minus activities of those conferring resistance. For functionally opposite pairs (Mechanism 2), synergy may arise in a situation of drug resistance. Therefore we maximize the pathways conferring resistance and minimize the sensitizing pathways.
We tested our synergy stratification workflow on a drug combination dataset for 7 pairs of protein targets (29 drug combinations) applied to 33 breast cancer cell lines. For performance metric, we used the Pearson’s correlation of observed versus predicted synergy scores. We were able to reach an average drug-wise correlation of 0.27. We next experimentally validated our synergy stratification workflow with a BRAF/Insulin Receptor combination (Dabrafenib/BMS−754807) for 48 colorectal cancer cell lines. The performance is 0.31 for all 48 cell lines and 0.4 by taking into account KRAS status.
In conclusion, the synergy prediction workflow can be a powerful framework for compound prioritization in large scale drug screenings. For instance, only testing drugs targeting two functionally very similar or very distinct proteins could significantly reduce the search space. The synergy stratification workflow could potentially maximize the drug efficacy of drugs already known for inducing synergy.
- Sarvenaz Choobdar, University of Lausanne, Switzerland
- Mehmet Ahsen, Icahn School of Medicine at Mount Sinai, United States
- Jake Crawford, Tufts University, United States
- The Dream Module Identification Challenge Consortium, The Dream Challenges, Switzerland
- Donna Slonim, Tufts University, United States
- Julio Saez-Rodriguez, RWTH Aachen and European Bioinformatics Institute, Germany
- Lenore Cowen, Tufts University, United States
- Sven Bergmann, University of Lausanne, Switzerland
- Daniel Marbach, University of Lausanne and Roche Pharmaceuticals, Switzerland
The Disease Module Identification challenge aimed to comprehensively
assess disease and trait module identification methods across diverse
molecule identification methods. Six research groups contributed
unpublished molecular networks and over 400 participants from all over
the world developed and applied a diverse set of module identification
methods. Teams predicted disease-relevant modules both within
individual networks (Subchallenge 1) and across multiple, layered
networks (Subchallenge2). In the final round, 75 submissions,
including method descriptions and code, were made across the two
sub-challenges, providing a broad sampling of state-of-the-art
methods. The challenge employed a novel approach to use a large
collection of GWAS datasets to assess the performance of these methods
based on the number of discovered modules associated with complex
traits or diseases.
We highlight some of the biological and theraputic relevance of some
of the trait-associated modules found by the top performing methods,
and also introduce a new method developed after the close of the
challenge to generate robust consensus modules which we describe in
detail. All challenge data, including the networks, GWAS datasets,
team submissions and code are available as a community resource at
https://www.synapse.org/modulechallenge These benchmark datasets and
results of the challenge provide a reference point for future method
- Michael Banf, Max Planck Institute for Plant Breeding Research and Carnegie Institution for Science, Germany
- Seung Y. Rhee, Carnegie Institution for Science, United States
A gene regulatory network links transcription factors to their target genes and represents a map of transcriptional regulation. Much progress has been made in deciphering gene regulatory networks computationally. However, gene regulatory network inference for most eukaryotic organisms remain challenging. To improve the accuracy of gene regulatory network inference and facilitate candidate selection for experimentation, we developed an algorithm called GRACE (Gene Regulatory network inference ACcuracy Enhancement). GRACE exploits biological a priori and heterogeneous data integration to generate high- confidence network predictions for eukaryotic organisms using Markov Random Fields in a semi-supervised fashion. GRACE uses a novel optimization scheme to integrate regulatory evidence and biological relevance. It is particularly suited for model learning with sparse regulatory gold standard data. We show GRACE’s potential to produce high confidence regulatory networks compared to state of the art approaches using Drosophila melanogaster and Arabidopsis thaliana data. In an A. thaliana developmental gene regulatory network, GRACE recovers cell cycle related regulatory mechanisms and further hypothesizes several novel regulatory links, including a putative control mechanism of vascular structure formation due to modifications in cell proliferation.
- Andrew Gentles, Departments of Medicine, and Biomedical Data Science, Stanford University School of Medicine, United States
- Brian White, Sage Bionetworks, United States
- Aurélien de Reyniès, Programme Cartes d'Identité des Tumeurs, Ligue Nationale Contre le Cancer, France
- Aaron Newman, Stanford University, United States
- Andrew Lamb, Sage Bionetworks, United States
- Laura Heiser, Oregon Health and Science University, United States
- Joshua Waterfall, Institut Curie, PSL Research University, France
Immune cell infiltration in solid tumors correlates with patient outcome and therapeutic response. While specific cell-type infiltration can be elucidated by single-cell transcriptomic techniques, these suffer biases and limitations of scale. To instead leverage existing large repositories of bulk gene expression data with clinical outcomes, computational methods have been developed to deconvolve sample profiles into their stromal (including immune) and malignant cell components. However, their performance has yet to be compared within an unbiased, objective framework. To assess methods and catalyze development of new approaches, we are organizing a community-wide DREAM Challenge.
The Challenge consists of: (1) an open phase, during which methods are trained on publicly-available transcriptomic profiles of cell populations; (2) a leaderboard phase, during which methods are submitted, assessed, and revised using bulk expression data having ground truth (e.g., ratios from mixing experiments or from flow cytometry); and (3) a validation phase, during which final submissions are assessed using independent expression profiles of known admixtures. The latter are generated in vitro by mixing RNA from multiple types of purified stromal and cancer cells. To assess sensitivity, we provide in silico admixtures from expression profiles of purified populations with a range of tumor “contamination.” Models will be submitted as Docker containers, executed in the cloud, and assessed based on their ability to predict levels of an individual cell type across samples. In sub-Challenge 1 models will predict coarsely-defined stromal populations, while in sub-Challenge 2 models will further dissect these into subsets (e.g., of T-cell subtypes).
The active phase will launch in early 2019 for seven weeks, with 240 participants already pre-registered. At its completion, we will identify features of best performing models and provide guidelines where improvements are needed.
- Pei Wang, Icahn School of Medicine at Mount Sinai, United States
Sample mislabeling that contributes to irreproducible results and invalid conclusions is known to be one of the obstacles in basic and translational research. This is also prevalent in data-rich large-scale omics studies, in which human errors could arise anywhere in the data production and analysis pipeline—either sample mislabeling (early in the pipeline) or data mislabeling (later in the pipeline). To address this issue, the National Cancer Institute and the Food and Drug Administration, in coordination with the DREAM Challenges, are launching the first computational challenge using multi-omics datasets to detect and correct specimen mislabeling. The objective of this challenge is to encourage development and evaluation of computational algorithms that can accurately detect and correct mislabeled samples using rich multi-omics datasets, enhancing the assurance that the right data is attributed to the right patient. In the talk, I will introduce the design of the challenge and some initial results.
- Sage Davis, Eck Institute of Global Health, University of Notre Dame, United States
- Geoffrey Siwo, Eck Institute of Global Health, University of Notre Dame, United States Minor Outlying Islands
- Pablo Meyer, Thomas J. Watson Research Center, IBM, United States
- Katrina Button-Simons, Eck Institute of Global Health, University of Notre Dame, United States
- Lisa Checkley-Needham, Eck Institute of Global Health, University of Notre Dame, United States
- François H Nosten, Shoklo Malaria Research Unit, Mahidol-Oxford Tropical Medicine Research Unit, Thailand
- Timothy J.C. Anderson, Texas Biomedical Research Institute, United States
- Gustavo Stolovitzky, Thomas J. Watson Research Center, IBM, United States
- Michael T Ferdig, Eck Institute of Global Health, University of Notre Dame, United States
Malaria is still a global health burden, despite numerous coordinated efforts for eradication. One obstacle to eradication is the malaria parasite’s quick adaptation to anti-malarial drugs resulting in drug resistance. Resistance mechanisms are ill understood for many anti-malarials, making it difficult to overcome these costly evolutionary changes. Recent reports from Southeast Asia of resistance against Artemisinins (Art), the last line of defense against multi-drug resistant malaria, raises concerns of a significant setback in eradication efforts. Understanding the changing biology of malaria parasites as they acquire resistance may be key to preventing the spread of resistance. Here we introduce the DREAM of Malaria challenge, the first DREAM challenge that focuses on an infectious disease agent, aimed at predicting the changing biology of Art-resistant malaria. This two-part challenge will predict Art resistance states of malaria isolates based on transcription data and predict synergistic anti-malarial combinations in drug-resistant isolates. The first set of the Malaria DREAM challenge has 2 sub-challenges- i) predicting Art-resistance status measured as in vivo clearance rate using transcriptional data from sensitive and resistant parasites; and ii) predicting Art-resistance status measured as in vitro IC50. The aim of this challenge is to determine whether transcriptional data obtained from Art-resistant malaria field isolates could explain the complex Art resistance landscape in Southeast Asia. Unique aspects of the sub-challenge include a largely unannotated genome, the malaria parasite’s “just in time” transcription, and the peculiar nature of the resistance phenotype. The second challenge aims at determining which drugs may prove most effective in combination with Art. The second challenge will build on knowledge gained from the first challenge, utilizing transcriptional signals to determine if there are any potential untapped combination therapies that may exploit the changed biology of Art-resistant isolates. Between the two challenges, the DREAM of Malaria challenge aims at understanding and attacking anti-malarial resistance, preventing any further treatment failures, and taking a large step towards eradication.
- Anna Cichonska, Institute for Molecular Medicine Finland FIMM, University of Helsinki, Finland
Despite several years of target-based drug discovery, chemical agents inhibiting single targets are still rare. Mapping the complete target space of drugs and drug-like compounds, including both intended “primary targets” as well as secondary “off-targets”, is therefore a critical part of drug discovery efforts. Such a map would enable one not only to explore the therapeutic potential of chemical agents but also to better predict and manage their possible adverse effects prior to clinical trials. However, the massive size of the chemical universe makes experimental bioactivity mapping of the full space of compound-target interactions quickly infeasible in practice, even with the modern automated high-throughput profiling assays.
The Illuminating the Druggable Genome (IDG)-DREAM Drug-Kinase Binding Prediction Challenge aims at evaluating the power of statistical and machine learning models as systematic and cost-effective means for catalyzing compound-target interaction mapping efforts by prioritizing most potent interactions for further experimental evaluation. The focus is directed towards quantitative modelling of compound-kinase interactions to fully characterize the wide target activity spectrum of each compound. Kinase inhibitors form currently the largest group of new drugs approved for cancer treatment, and it is anticipated that also a variety of other human diseases could be treated using such kinase-modulating drugs.
The Challenge seeks to determine (i) the best computational approaches for predicting compound-kinase binding affinities, (ii) the most predictive chemical and genomic features for both compounds and kinase targets, and (iii) the best bioactivity data for the model training. In addition to the quantitative structure-activity relationship (QSAR) models, the teams are encouraged to explore various other statistical and machine learning modelling approaches, such as those based on multi-target, multi-view or multi-kernel learning as well as transfer learning. To develop the predictive models, the teams have access to standardized bioactivity data for model training and cross-validation, whereas the final evaluation of the model predictions will be based on blinded test data.
The Challenge was launched in October 2018 and it will be completed by the end of the year. The submissions from the participants will be evaluated based on the currently unpublished dissociation constants Kd measuring the strength of a physical interaction between 944 compound-kinase pairs. This blinded dataset provided by the IDG-Kinase consortium was selected based on single-concentration percent inhibition data points from an initial full kinome-scan profiling of 400 multi-targeted compounds. Therefore, it provides a standardized and large enough set to evaluate the on- and off-target prediction accuracy of the submitted models. This talk will summarize the Challenge goals and baseline data, as well as give an overview of the round 1 leaderboard results.
- Minji Jeon, Korea University, South Korea
- Donghyeon Park, Korea University, South Korea
- Jinhyuk Lee, Korea University, South Korea
- Hwisang Jeon, Interdisciplinary Graduate Program in Bioinformatics, Korea University, South Korea
- Miyong Ko, Korea University, South Korea
- Jaewoo Kang, Korea University, South Korea
- Aik-Choon Tan, University of Colorado Anschutz Medical Campus, United States
Traditional drug discovery approach is identifying a suitable target for a disease and finding a compound that binds to the target. In this approach, structures of compounds are considered as the most important feature because it is assumed that similar structures will bind to the same target. Therefore structural analogs of the drugs that bind to the target have been selected as drug candidates. However, even though compounds are not structural analogs, they may achieve the desired response and may be used for the disease. A new drug discovery method based on drug response, and not solely on drug structure, is necessary; therefore, we propose a drug response-based drug discovery model called ReSimNet
- Shengbao Suo, Harvard University, United States
- Qian Zhu, Harvard University, United States
- Assieh Saadatpour, Harvard University, United States
- Lijiang Fei, Zhejiang University School of Medicine, China
- Guoji Guo, Zhejiang University School of Medicine, China
- Guo-Cheng Yuan, Harvard University, United States
Recent progress in single-cell technologies has enabled the identification of all major cell types in mouse. However, for most cell types, the regulatory mechanism underlying their identity remains poorly understood. By computational analysis of the recently published mouse cell atlas data, we have identified 202 regulons whose activities are highly variable across different cell types, and more importantly, predicted a small set of essential regulators for each major cell type in mouse. Systematic validation by automated literature- and data-mining provides strong additional support for our predictions. Thus, these predictions serve as a valuable resource that would be useful for the broad biological community. Finally, we have built a user-friendly, interactive, web-portal to enable users to navigate this mouse cell network atlas.
- Thin Nguyen, Deakin University, Australia
- Buu Truong, University of South Australia, Australia
- Hoang Pham, University of South Australia, Australia
- Li Xiaomei, University of South Australia, Australia
- Thuc Le, University of South Australia, Australia
The positions for cells in the Drosophila embryo have been accurately estimated using 84 driver genes. The DREAM Single Cell Transcriptomics Challenge questions if the cell locations can be comparably estimated using a lesser number of genes. Feature selection become a critical part of the estimation. Our approach for this is to keep the genes that are the least able to be predicted by other genes. However as missing data is a known issue with single cell RNA sequencing (scRNA-seq), imputation should be employed before handing the data to the feature selection procedure. Once the features are selected for a prediction model, for a cell, the locations with the best match from the gene expression data are chosen as the cell’s locations. This matching can be measured via the Matthews correlation coefficient score (MCC). In addition, as the locations are not evenly distributed, a less abnormal location is preferred. The local outlier factor (LOF) can be used to measure the abnormal score of the locations.
- Jianhua Ruan, Department of Computer Science, University of Texas at San Antonio, United States
- Maryam Zand, Department of Computer Science, University of Texas at San Antonio, United States
The DREAM Single Cell Transcriptomics Challenge is a community-wide effort to seek computational solutions for spatial mapping of single cells in tissues using single-cell RNAseq data and a reference atlas obtained from in situ hybridization data. We approached this problem by combining unsupervised and supervised machine learning algorithms and obtained promising results. First, to find a set of most informative genes, an unsupervised feature selection method was designed to optimize two biologically rational metrics based on the consistency between gene expression similarity and cell proximity. The “gold standard” locations of the cells to be predicted were not used at this stage, thus significantly reducing the chance of overfitting. Second, a Particle Swarm Optimization (PSO) algorithm was used to learn proper weights for different genes in order to maximize matches between the predicted locations and the “gold standard” locations. Cross-validation was performed to avoid over-fitting. Finally, the information embedded in the cell topology was used to improve the predicted cell-location scores by weighted averaging of scores from neighbor locations. While our own evaluation shows that all three components are important for the performance of the algorithm, post-challenge analysis will be performed to evaluate the contribution from individual components and the biological significance of the selected genes and their associated weights
- Peng Qiu, Georgia Tech, United States
This challenge posed an interesting feature selection question in the context of scRNAseq data and spatial analysis, how to select the best subset of the 84 landmark genes without using the in situ data of these 84 genes. Since the in situ data should not be used in feature selection, an alternative question is explored: within the scRNAseq data, how to find a subset of the 84 genes which is able to describes the same amount of variations that the 84 genes can describe? This alternative question motivated a new pipeline, which combined linear regression for expanding the landmark genes, PCA for dimension reduction, kNN for encoding similarities, and a new scoring metric to evaluate gene subsets.
Feature selection in this submission was performed by ranking the features, using a pipeline that combined linear regression for expanding the landmark genes, PCA for dimension reduction, kNN for encoding similarities, and a new scoring metric to evaluate gene subsets' ability to recapitulate the similarities.
This method takes the scRNAseq data and the list of 84 genes as input, and build a nearest neighbor graph to connect cells that are similar to each other according to the landmark-relevant genes. The subsequent feature ranking analyses seek to identify "useless" genes, excluding which won't significantly distort the similarity coded in the graph. With this method, the importance of the 84 genes are rank ordered, which is a unified solution for all three sub-challenges.
- Chang Shu, Department of Psychiatry, Yale School of Medicine, United States
- Xiaoyu Liang, Department of Psychiatry, Yale School of Medicine, United States
- Xinyu Zhang, Department of Psychiatry, Yale School of Medicine, United States
- Ying Hu, Center for Biomedical Bioinformatics, National Cancer Institute, United States
- Ke Xu, Department of Psychiatry, Yale School of Medicine, United States
Our goal is to select subsets of 60, 40, 20 genes that contains the most spatial information out of the known 84 genes to predict cell position in Drosophila embryo. We used two methods to select genes – Maximum Distance (MD) approach, which we developed for this challenge, and Genetic Algorithm (GA). To limit potential overfitting, we applied DistMap method to map cell positions using selected genes with additional procedure by removing outliers among the predicted positions. The MD approach is that we maximized the summation of distance between all selected genes. We calculated the Euclidean distance between each pair of the genes and removed one gene at a time that had the lowest sum of Euclidean distance with the rest of genes. The gene removal procedure was recursively repeated until the designated number of genes were selected. The GA approach is based on the single cell expression data and the gold standard predicted position. Multivariate Random Forest (MRF) was used in GA to predict the x, y, z coordinates of the cells, and the genes were selected by the correlation between MRF and DistMap predicted x, y, z coordinates. We found that the MD approach performed the best in the selection of 60 and 40 gene sets, and the GA method performed better to select the most informative 20 gene set than MD method. In conclusion, we successfully selected informative 60, 40, and 20 gene set for the reconstruction of cell positions in Drosophila embryo without overfitting model.