Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

CAMDA: Critical Assessment of Massive Data Analysis

COSI Track Presentations



Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
Wednesday, July 24th
10:15 AM-10:20 AM
CAMDA Welcome
Room: Boston 1/2 (Ground Floor)
  • David P. Kreil, Chair of Bioinformatics, Boku University Vienna, Austria
10:20 AM-11:20 AM
International Space Station and hospital environments: Composition and function of microbiomes in confined built environments
Room: Boston 1/2 (Ground Floor)
  • Christine Moissl-Eichinger, Medical University of Graz , Austria
11:20 AM-11:40 AM
Data Analysis Challenges of the CAMDA Contest 2019
Room: Boston 1/2 (Ground Floor)
  • Wenzhong Xiao, Stanford and Harvard Medical School, United States
11:40 AM-12:00 PM
A Machine Learning Framework to Determine Geolocations from Metagenomics Profiling
Room: Boston 1/2 (Ground Floor)
  • Lihong Huang, Xiamen University, China
  • Canqiang Xu, Aginome Scientific, China
  • Wenxian Yang, Aginome Scientific, China
  • Rongshan Yu, Xiamen University, China

Presentation Overview: Show

Profiling of microbiomes can answer forensic questions including the geographical origin of an environmental sample. However, due to the rich and diverse interaction among microbiomes and environment, it is a usually challenging task to associate microbiome samples with their origins. To this end, we developed a machine learning framework to predict the geolocations of microbiome samples based on metagenomics profiling. Specifically, our method uses abundance profiles of a set of species as fingerprints, where the set of species are selected using machine learning algorithms based on their differentiation power to different cities from the dataset. In addition, the abundance profiles are further binned to binary values according to their percentile in the dataset to avoid potential overfitting problem due to small training set. Our results show that once the abundance profiles of the metagenomic data are extracted, data-driven machine learning algorithms can be used to predict the geolocation of an environment sample from its metagenomic sequencing data with reasonable accuracy.

12:00 PM-12:20 PM
Metagenomic sequence classification to search for the origin of samples
Room: Boston 1/2 (Ground Floor)
  • Jolanta Kawulok, Silesian University of Technology, Poland

Presentation Overview: Show

Metagenomes are sets of DNA fragments derived from microbes living in a given environment. Analysis of these fragments makes it possible to extract crucial information on the organisms that have left their traces in a given environmental sample. Based on this information, we may try to create a sample origin profile.
The work presented here is based on our earlier solution published at the 2018 CAMDA MetaSUB challenge. Here, we are focused on verifying the reliability of the microbiome fingerprints for identifying the sample origin. The samples from the MetaSUB Forensics Challenge, being a part of the CAMDA 2019 conference, are used in our research. We exploit our CoMeta program, which allows for fast classification of metagenome samples, as well as the Mash program to determine the mutual similarities between the new samples of unpublished origin.

12:20 PM-12:40 PM
Comparison between functional profiles derived from whole genome sequencing and inferred from 16S sequencing.
Room: Boston 1/2 (Ground Floor)
  • Carlos Sánchez Casimiro-Soriguer, Clinical Bioinformatic Area, Fundacion Progreso y Salud, Spain
  • Carlos Loucera, Clinical Bioinformatic Area, Fundacion Progreso y Salud, Spain
  • Daniel López López, Fundación progreso y salud, Spain
  • Joaquin Dopazo, Fundacion Progreso y Salud, Spain

Presentation Overview: Show

Although whole genome sequencing (WGC) of metagenomic samples is becoming more frequent, offering comprehensive information on the gene composition of the bacterial species present in the pool, there is still a wealth of 16S metagenomic data available that allows indirect inference of the gene composition in the sample sequenced. We have recently demonstrated the feasibility of functional profiles derived from metagenomic WGS for predicting the geographical location of samples. Here we explore the suitability of functional profiles inferred from 16S metagenomic samples for geographical origin prediction and compare the compatibility of the functional information obtained form both types of sequencing technologies.

2:00 PM-2:40 PM
Systematic evaluation of microbial abundance from amplicon and shotgun sequencing for machine learning prediction of sample origin
Room: Boston 1/2 (Ground Floor)
  • Julie Chih-Yu Chen, Public Health Agency of Canada - National Microbiology Laboratory, Canada
  • Andrea Tyler, Public Health Agency of Canada - National Microbiology Laboratory, Canada

Presentation Overview: Show

Recent technological advances have provided different ways to measure microbial abundances in collected samples. The 16S ribosomal RNA amplicon sequencing approach targets and specifically sequences the 16S rRNA gene of bacteria and archaea, whereas the shotgun whole genome sequencing (WGS) approach sequences all the DNA present in a sample. The latter thus allows identification to the level of species, evaluation of functional units, and concurrent identification of eukaryotes, fungi and DNA viruses. With the abundant data from different environmental sources and protocols in the CAMDA challenge, we first set out to evaluate differences in organism abundance among datasets generated using different sequencing technologies. Subsequently, we performed a supervised machine learning approach for predicting sample origin using these data. Current prediction models focus on classification methods such as support vector machines, random forest or Bayesian approaches. However, the predictions of these classification models are limited to sources where the training samples were from. Hence, we propose to model the longitude and latitude as the outcome variables using multivariate regression to enable predictions of new origins and using Lasso regularization to enhance prediction accuracy and avoid model overfitting.

2:40 PM-3:00 PM
MetaSUB: A Global Atlas of the Urban Microbiome
Room: Boston 1/2 (Ground Floor)
  • David Danko, Weill Cornell Graduate School of Medical Sciences, United States
3:00 PM-3:40 PM
Integration of human cell lines gene expression and chemical properties of drugs for Drug Induced Liver Injury prediction
Room: Boston 1/2 (Ground Floor)
  • Wojciech Lesinski, University of Bialystok, Poland
  • Agnieszka Kitlas Golinska, University of Bialystok, Poland, Poland
  • Krzysztof Mnich, University of Białystok, Poland
  • Witold R. Rudnicki, University of Białystok and ICM University of Warsaw, Poland

Presentation Overview: Show

Abstract—Motivation: Drug-induced liver injury (DILI) is one of the primary problems in drug development. Early prediction of DILI, based on the chemical properties of substances and experiments performed on cell lines, can bring a significant reduction in the cost of clinical trials. The current study aims to build predictive models of drugs using both their chemical properties, as well as gene expression levels in cell-lines treated with them.
Methods: We built cross - validated Random Forest predictive models using gene expression from 13 human cell lines and molecular properties of drugs. In this process we identified the most informative variables and built models on them. Models were built both for expression profiles of individual cell lines and chemical properties, as well as using different methods of integration of them.
Results: We have obtained a weakly predictive model for models that used molecular descriptors alone, and models that used expression profiles from some cell lines – AUC 2 (0:55 - 0:61). The individual models were then integrated using Super Learner approach. The accuracy is significantly improved in the case of the composite model (AUC=0.74, MCC=0.34), which allows for a division of drug compounds into low-risk and high-risk classes.

3:40 PM-4:00 PM
An ensemble learning approach for modeling the systems biology of drug-induced injury in human liver
Room: Boston 1/2 (Ground Floor)
  • Emre Guney, GRIB (IMIM-UPF), Spain
  • Joaquim Aguirre-Plans, GRIB (IMIM-UPF), Spain
  • Terezinha Souza, Maastricht University, Netherlands
  • Janet Piñero, GRIB (IMIM-UPF), Spain
  • Giulia Callegaro, Leiden University, Netherlands
  • Steven J. Kunnen, Leiden University, Netherlands
  • Laura I. Furlong, GRIB (IMIM-UPF), Spain
  • Baldo Oliva, GRIB (IMIM-UPF), Spain

Presentation Overview: Show

Drug-induced liver injury (DILI) has a relatively high incidence rate, estimated to affect around 20 in 100,000 inhabitants worldwide each year. Many drugs ranging from pain killers to anti-tuberculous treatments can cause DILI. Despite DILI being one of the leading causes of acute liver failure, the pathophysiology of DILI is poorly understood and pinpointing the toxicity of compounds in human liver remains non-trivial. Accordingly, several methods have been proposed to predict the hepatotoxicity of compounds. Among these, machine learning models trained using drug estructural features have shown a good performance. Furthermore, the incorporation of gene- and pathway-level signatures from transcriptomics data has shown a high predictive accuracy using Deep Neural Networks. In this work, to predict DILI, we investigated combining gene expression data from the Connectivity Map (CMap), target binding information and chemical similarity of drugs upon drug treatment into ensemble learning methods using random forest classifiers and gradient boosting machines.

4:40 PM-5:00 PM
Steps towards predictive models for DILI based on chemical structure and gene expression signatures and their interpretation
Room: Boston 1/2 (Ground Floor)
  • Anika Liu, University of Cambridge, United Kingdom
  • Peter Wright, University of Cambridge, United Kingdom
  • Aleksandra Bartosik, University of Cambridge, United Kingdom
  • Daniela Dolciami, University of Cambridge, United Kingdom
  • Moritz Walter, University of Cambridge, United Kingdom
  • Andreas Bender, University of Cambridge, United Kingdom

Presentation Overview: Show

Drug-induced liver injury (DILI) is a major safety concern in drug development. One approach to detect DILI early are cellular readouts such as gene expression profiles. However, it is unclear how much signal these contain with respect to DILI compared to molecular structure.
We identified 13 DILI-related proteins based on target prediction, such as Prostaglandine E synthase and c-Jun-N-terminal synthase 2, and 4 scaffolds from substructural mining including benzenesulfonamides.

Random Forest models based on chemical structure afforded better performance than those using gene expression with balanced accuracies of 0.70 and 0.55 during cross-validation, respectively. However, this is only a preliminary result as the number of drugs varied between the gene expression- and the chemical structure-based models, and the models’ ability to extrapolate to novel chemical space has not yet been evaluated.

A deeper analysis of the LINCS data has been difficult due to its noisiness and the limited coverage of compounds and genes. Improved results might be obtained by different data processing and filtering based on therapeutic doses. Further work will continue to evaluate the usefulness of gene expression and molecule structure with respect to understanding and predicting DILI and will also explore their complementarity.

5:00 PM-5:10 PM
Prediction of human clinical drug-induced liver injury: cell-line responses versus chemical structures
Room: Boston 1/2 (Ground Floor)
  • Thin Nguyen, Deakin University, Australia
  • Hang Le, Nha Trang University, Viet Nam
  • Svetha Venkatesh, Deakin University, Australia

Presentation Overview: Show

Drug-induced liver injury (DILI) is a major issue in drug discovery and development. This work examines whether gene expression is a good proxy of DILI, in comparison with chemical structures of drugs. The result shows that chemical structures, either seen as 1D or graphs, gain better results, in both cross-validation of known label drugs as well as via independent validation with toxicity labels provided by US FDA.

5:10 PM-5:20 PM
A Novel Gene Selection Method for Gene Expression Data for the Task of Cancer Type Classification
Room: Boston 1/2 (Ground Floor)
  • Arzucan Ozgur, Bogazici University, Turkey
  • Nuriye Özlem Özcan Şimşek, Boğaziçi Unversity, Turkey
  • Fikret Gurgen, Bogazici University, Turkey

Presentation Overview: Show

Abstract
Genomic data can be utilized for diagnosis of many diseases such as cancer. Cancer disease is caused by the mutations in DNA. These mutations may take action or be suppressed. The result of the active or suppressed state of mutations can be identified by gene expressions. In this study, we utilize and transfer the information of the effect of mutations in the development of cancer disease for a novel gene selection method for gene expression data. We tested the proposed method in order to diagnose and differentiate cancer types. Our experiment results show that the proposed gene selection method leads to similar or improved performance metrics compared to classical feature selection methods and curated gene sets.

5:20 PM-5:30 PM
mi-faser based partition of the CAMDA 2019 mystery samples in the Metagenomic Forensics Challenge
Room: Boston 1/2 (Ground Floor)
  • Maximilian Miller, Rutgers University, United States
  • Yana Bromberg, Rutgers University, United States
  • Yannick Mahlich, Technical University of Munich, Germany
  • Chengsheng Zhu, Rutgers University, United States

Presentation Overview: Show

Here we present an analysis of the CAMDA 2019 Metagenomic Forensics Challenge data. Our aim was to predict the geographic origins of so-called mystery location metagenome samples. For this, we partitioned the mystery samples into groups (i.e. cities) and compared their functional fingerprint against metagenome samples of known geographic origin. All samples are whole-genome shotgun sequenced microbiomes extracted from subway systems, an worldwide effort of the MetaSUB project. To this end, we used our mi-faser pipeline to functionally profile all 16 known cities provided in the challenge based on their metagenome samples. We created the same functional profiles for each of the mystery location samples. Those samples belong to cities which were not sampled before. Applying t-Distributed Stochastic Neighbor Embedding (t-SNE) and k-means clustering we propose a partition of the mystery samples of unknown location into ten sub-groups, i.e. cities. We describe ongoing efforts to further augment the mi-faser results to generate final location estimates for the set of mystery location metagenomes.

5:30 PM-5:40 PM
CAMDA Forensics Challenge: An Evaluation of Mass-Transit, Microbiome Profiles
Room: Boston 1/2 (Ground Floor)
  • Keenan Berry, Saint Louis University, United States
  • Jason Holdener, Saint Louis University, United States
  • Yu Zhan, Saint Louis University, United States
  • Scott Lewis, Saint Louis University, United States
  • Tae-Hyuk Ahn, Saint Louis University, United States

Presentation Overview: Show

Microbial communities exist in virtually every environment found within our planet, including subway transit systems. As a challenge to the global bioinformatics community, the MetaSUB International Consortium, has developed a challenge of classifying the location(s) of genetic samples obtained from various surfaces found in subway systems of 16 global cities. All samples used in our study were provided by the CAMDA MetaSUB consortium for analysis. Quality control of the samples was establish using FASTQC. Taxonomic profiles displaying organismal abundance of the samples were generated using MetaPhlAn2. The taxonomic profiles were split into test and training sets, pre-processed, labeled appropriately, and ran through machine learning models using Jupyter notebooks and python3 programming. Based upon prior research, the random forest machine learning model was implemented for both continent and city sample classification. Best parameters for the model were determined before implementation using random and grid search. Accuracy scores for continent and city prediction were 98% and 95% respectively. Further details for the CAMDA MetaSUB challenge can be found at the following link - http://metasub.org/.

5:40 PM-5:50 PM
Constructing microbial fingerprint for unraveling city-specific signature and identifying sample origin locations
Room: Boston 1/2 (Ground Floor)
  • Runzhi Zhang, University of Florida, United States
  • Alejandro Walker, University of Florida, United States
  • Susmita Datta, University of Florida, United States

Presentation Overview: Show

Composition of microbial communities can be location specific, and the different abundance of taxon within location could help us to construct the microbial fingerprint for predicting the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across several cities around the world were used for constructing the microbial fingerprint. Principal Component Analysis was used to assess the separation of the cities based on combined taxon. Appropriate machine learning methods including Random Forest, Support Vector Machine and Linear Discriminant Analysis were used to predict the origin of samples. Raw data was used at first, and due to the low coverage of the sequencing data from London samples, the final count of common species, families and orders in the final dataset, was enough justification for the removal of London’s samples. This resulted in the reduction of the error rate for each classifier. Analysis of composition of microbiomes (ANCOM-II) was conducted and showed a pattern of the difference of microbial composition between different cities. The results in this study gave us some inspiration about the importance of the number of taxon, which could be improved by more samples or better sequencing depth.

8:00 PM-11:00 PM
CAMDA dinner: Kohlmanns - essen & trinken
Room: Boston 1/2 (Ground Floor)
Thursday, July 25th
8:30 AM-8:40 AM
CAMDA Welcome
Room: Boston 1/2 (Ground Floor)
  • Joaquin Dopazo, Fundacion Progreso y Salud, Spain
8:40 AM-9:40 AM
Transcriptome Alterations in Cancer:Challenges and Opportunities
Room: Boston 1/2 (Ground Floor)
  • Gunnar Rätsch, ETH Zurich, Switzerland
10:15 AM-10:20 AM
Data Analysis Challenges of the CAMDA Contest 2019 (II)
Room: Boston 1/2 (Ground Floor)
  • Paweł P. Łabaj, Małopolska Centre of Biotechnology of Jagiellonian University, Poland
10:20 AM-10:40 AM
Analyzing cancer through Hipathia: a new insight on cancer signaling pathways
Room: Boston 1/2 (Ground Floor)
  • Sergio Romera-Giner, Principe Felipe Research Center, Spain
  • Marta R. Hidalgo, Centro de Investigación Príncipe Felipe, Spain

Presentation Overview: Show

Cancer research proves to be a difficult task, mainly because of the multiple variables that underlie this kind of diseases. A more biological approach, based on cell signaling pathways to give biological and functional context to gene expression data, may help to understand cancer in a better way. Hipathia suite, a tool made for this purpose, is being used with gene expression data of three different types of cancer to shed some light over their molecular causes and to reach a better understanding of its hidden mechanisms. Differential pathway activations have been found between healthy and tumor-affected tissues and also interesting similarities between the three types of cancer have appeared. These insights highlight the importance of signalling pathways in the future of cancer research, and could prove to be helpful in future design of diagnosis methods and treatments for this kind of diseases.

10:40 AM-11:00 AM
A systematic analysis of multiple cancer studies within a novel enhanced framework for semantic data integration
Room: Boston 1/2 (Ground Floor)
  • Iliyan Mihaylov, Sofia University St. Kliment Ohridski, Faculty of Mathematics and Informatics, Sofia, Bulgaria, Bulgaria
  • Maciej Kańduła, Chair of Bioinformatics Research Group, Boku University Vienna, Poland
  • Dimitar Vassilev, Sofia University St. Kliment Ohridski, Faculty of Mathematics and Informatics, Sofia, Bulgaria, Bulgaria

Presentation Overview: Show

Integrative approaches to cancer data analysis remain an active field of research, where effective integration of heterogeneous data sources, like, clinical, morphologic, molecular data, etc. is becoming crucial for subtyping and treating cancer. Moreover, measurements need to be combined not only across patients but also across assay types, i.e. both horizontally and vertically. This makes integration a complex problem. Importantly, management of knowledge, data accessibility and useability, lack of standards and common interfaces are also well recognized challenges in bioinformatics.

We develop a computational framework for integration of heterogeneous data, where relations between structurally unrelated data sources are inferred both from the data themselves, as well as from additional external sources, seamlessly facilitating knowledge discovery. We develop an enhanced novel universal predictive parameter for survival time prediction in cancer patients, focusing here on the TCGA cancer data sets. Our framework applies multiple machine learning regression-based models and incorporates cross-validation methodologies for effective benchmarking.

11:00 AM-11:40 AM
A sparse Bayesian factor model for the construction of gene co-expression networks from single-cell RNA sequencing count data
Room: Boston 1/2 (Ground Floor)
  • Susmita Datta, University of Florida, United States
  • Michael Sekula, University of Louisville, United States
  • Jeremy Gaskins, University of Louisville, United States

Presentation Overview: Show

Gene co-expression networks (GCNs) are powerful tools that enable biologists to examine associations between genes during different biological processes. With the advancement of new technologies, such as single-cell RNA sequencing (scRNA-seq), there is a need for developing novel network methods appropriate for new types of data. Here, we present a novel sparse Bayesian factor structure to explore the network structure associated with genes in scRNA-seq data. Latent factors impact the gene expression values for each cell and provide flexibility to account for the most common features of scRNA-seq: high proportions of zero values, increased cell-to-cell variability, and overdispersion due to abnormally large expression counts. From our model, we construct a GCN by analyzing the positive and negative associations of the factors that are shared between each pair of genes. Results from simulation studies and real data analysis demonstrate the performance of our methodology in constructing GCNs.

11:40 AM-12:00 PM
Benchmarking scRNA-seq clustering methods using multi-parameter ensembles of simulated data and workflows
Room: Boston 1/2 (Ground Floor)
  • Xianing Zheng, University of Michigan, United States
  • Jun Z. Li, University of Michigan - Ann Arbor, United States

Presentation Overview: Show

Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful technology for surveying cell types and state transitions. Today, >350 tools have appeared to address >30 scRNA-seq tasks. However, the community still struggles to identify the best workflow for any given task, including clustering. Benchmarking studies to date have relied on real datasets, using published methods at their default settings. We expanded these efforts by creating an ensemble of truth-known simulated datasets and testing them across many parameter combinations of existing methods. We produced scRNA-seq counts matrices by systematically altering five parameters: cluster size, cluster distance, cell library size, cell library size variability, and dropout rate, creating a full combination of 1,024 datasets. We evaluated 15 clustering methods, each with three gene selection methods and five k values, for 225 workflows, and 1024 ✕ 225 = 230,400 runs. Performance variation over the five parameters revealed strengths/weaknesses of individual methods/workflows. SC3 and Seurat performed well for most of the datasets. RaceID2 was sensitive to the true cluster distance; SIMR and RaceID2 only perform well for moderate-to-high library sizes. The ability to benchmark algorithmic choices using simulations that cover a wide parameter space is essential in developing customized pipelines for each real study.

12:00 PM-12:20 PM
Proceedings Presentation: Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology
Room: Boston 1/2 (Ground Floor)
  • Gregor Sturm, Chair of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Germany
  • Francesca Finotello, Medical University of Innsbruck, Austria
  • Florent Petitprez, Ligue Nationale Contre le Cancer, France
  • Jitao David Zhang, Roche Innovation Center Basel, F. Hoffmann-La-Roche AG,, Switzerland
  • Jan Baumbach, Technical University of Munich, Germany
  • Wolf H. Fridman, Cordeliers Research Centre, UMRS_1138, INSERM, University Paris-Descartes, Sorbonne University, Paris, France
  • Markus List, Technical University of Munich, Germany
  • Tatsiana Aneichyk, Pieris Pharmaceuticals GmbH, Lise-Meitner-Straße 30, 85354 Freising, Germany, Germany

Presentation Overview: Show

Motivation: The composition and density of immune cells in the tumor microenvironment profoundly influence tumor progression and success of anti-cancer therapies. Flow cytometry, immunohistochemistry staining, or single-cell sequencing is often unavailable such that we rely on computational methods to estimate the immune-cell composition from bulk RNA-sequencing (RNA-seq) data. Various methods have been proposed recently, yet their capabilities and limitations have not been evaluated systematically. A general guideline leading the research community through cell type deconvolution is missing.

Results: We developed a systematic approach for benchmarking such computational methods and assessed the accuracy of tools at estimating nine different immune- and stromal cells from bulk RNA-seq samples. We used a single-cell RNA-seq dataset of ~11,000 cells from the tumor microenvironment to simulate bulk samples of known cell type proportions, and validated the results using independent, publicly available gold-standard estimates. This allowed us to analyze and condense the results of more than a hundred thousand predictions to provide an exhaustive evaluation across seven computational methods over nine cell types and ~1,800 samples from five simulated and real-world datasets. We demonstrate that computational deconvolution performs at high accuracy for well-defined cell-type signatures and propose how fuzzy cell-type signatures can be improved. We suggest that future efforts should be dedicated to refining cell population definitions and finding reliable signatures.

Availability: A snakemake pipeline to reproduce the benchmark is available at https://github.com/grst/immune_deconvolution_benchmark. An R package allows the community to perform integrated deconvolution using different methods (https://grst.github.io/immunedeconv).

12:20 PM-12:40 PM
Proceedings Presentation: PRECISE: A domain adaptation approach to transfer predictors of drug response from pre-clinical models to tumors
Room: Boston 1/2 (Ground Floor)
  • Soufiane Mourragui, Delft University of Technology and the Netherlands Cancer Institute, Netherlands
  • Marco Loog, TU Delft and University of Copenhagen, Netherlands
  • Mark van de Wiel, VUmc Amsterdam, Netherlands
  • Marcel Reinders, TU Delft and Leiden University Medical Center, Netherlands
  • Lodewyk Wessels, The Netherlands Cancer Institute, Netherlands

Presentation Overview: Show

Motivation: Cell lines and patient-derived xenografts (PDX) have been used extensively to understand the molecular underpinnings of cancer. While core biological processes are typically conserved, these models also show important differences compared to human tumors, hampering the translation of findings from pre-clinical models to the human setting. In particular, employing drug response predictors generated on data derived from pre-clinical models to predict patient response, remains a challenging task. As very large drug response datasets have been collected for pre-clinical models, and patient drug response data is often lacking, there is an urgent need for methods that efficiently transfer drug response predictors from pre-clinical models to the human setting.

Results: We show that cell lines and PDXs share common characteristics and processes with human tumors. We quantify this similarity and show that a regression model cannot simply be trained on cell lines or PDXs and then applied on tumors. We developed PRECISE, a novel methodology based on domain adaptation that captures the common information shared amongst pre-clinical models and human tumors in a consensus representation. Employing this representation, we train predictors of drug response on pre-clinical data and apply these predictors to stratify human tumors. We show that the resulting domain-invariant predictors show a small reduction in predictive performance in the pre-clinical domain but, importantly, reliably recover known associations between independent biomarkers and their companion drugs on human tumors.

2:00 PM-2:40 PM
Evaluation of Connectivity Map shows limited reproducibility in drug repositioning
Room: Boston 1/2 (Ground Floor)
  • Nathaniel Lim, The University of British Columbia, Canada
  • Paul Pavlidis, The University of British Columbia, Canada

Presentation Overview: Show

The Connectivity Map (CMap) is a widely used resource enabling data-driven drug repositioning using a large compendium of gene expression profiles. However, evaluations of its performance are limited. We took advantage of the availability of two iterations of CMap (CMap 1 and CMap 2) to assess their comparability and reliability. First, we queried CMap 2 with 28 drug signatures derived from CMap 1, hypothesizing that CMap 2 would highly prioritize the same drugs. We found that CMap 2 succeeded only 2/28 times. In a similar analysis, CMap 2 was unable to replicate previously published drug prioritization recommendations that were originally obtained using CMap 1 from three studies. Next, we compared the similarity of individual differential expression profiles for the same conditions between both CMap versions (109 profiles), and also compared a third dataset (De Abrew et al. (2016), 12 compounds). We found that the profiles were highly dissimilar among all data sets (mean correlation < 0.1), likely explaining the limited reproducibility of prioritization. Because of the general lack of consistency, it is unclear which CMap version is more reliable. Our findings have implications for the use of CMap and suggest steps investigators can take to limit false positives.

2:40 PM-2:50 PM
Contest voting and summary
Room: Boston 1/2 (Ground Floor)
2:50 PM-3:20 PM
Retrospective and Outlook: Metagenomic Forensic Challenge
Room: Boston 1/2 (Ground Floor)
  • Paweł P. Łabaj, Małopolska Centre of Biotechnology of Jagiellonian University, Poland
3:20 PM-4:00 PM
Retrospective and Discussion: Read-level Data Anonymization
Room: Boston 1/2 (Ground Floor)
  • Andre Kahles, ETH Zurich, Switzerland
4:00 PM-4:20 PM
Discussion & outlook - CAMDA
Room: Boston 1/2 (Ground Floor)
  • Wenzhong Xiao, Stanford and Harvard Medical School, United States
4:20 PM-4:40 PM
Awards and Closing - CAMDA
Room: Boston 1/2 (Ground Floor)
  • Julia E Vogt, ETH Zurich, Switzerland
  • David P. Kreil, Chair of Bioinformatics, Boku University Vienna, Austria