Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

CAMDA: Critical Assessment of Massive Data Analysis

COSI Track Presentations

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
Saturday, July 7th
10:15 AM-10:20 AM
Data Analysis Challenges of the CAMDA Contest 2018
Room: Columbus GH
  • David P. Kreil, Boku University, Vienna, Austria

Presentation Overview: Show

Welcome, introduction, and overview of this year's CAMDA contests and sessions.

10:20 AM-11:20 AM
Why You Should Care (A Lot) About Real World Evidence
Room: Columbus GH
  • Lawrence J. Lesko, University of Florida, United States

Presentation Overview: Show

Healthcare problems need technology solutions. Clinicians expect to understand how a new therapeutic intervention will perform in the real world – the actual medical practice setting outside the controlled environment of randomized controlled clinical trials. Real world data (RWD) comes from many sources, for example: electronic health records and patient chart reviews, off-label use of medications, adverse event databases, health monitoring devices, patient registries, pragmatic clinical trials and payer claims databases. Many researchers, corporations and government institutions, including the Food and Drug Administration, are eager to tap into these RWD streams in order to provide answers to questions about the efficacy and safety of medicines. RWD, in and off itself, is meaningless until critical context is added which transforms RWD into useful information. Technology, in turn, can analyze this information to yield real world evidence (RWE) that can be used to answer carefully framed scientific and clinical questions. This presentation will show how technology and RWE can be used to support decision-making with an emphasis on opportunities in drug development and regulatory science.

11:20 AM-11:40 AM
The CMap Drug Safety Challenge
Room: Columbus GH
  • Weida Tong, National Center for Toxicological Research, United States
  • Shraddha Thakkar, NCTR, FDA, United States

Presentation Overview: Show

The CMap Drug Safety Challenge presents clinical toxicity results and gene expression responses to hundreds of drugs. This lets scientists compare or integrate responses of multiple cell lines, and predict drug induced liver injury in humans.

11:40 AM-12:20 PM
IS-DILI: Drug-Induced Liver Injury Inference in Big Data era
Room: Columbus GH
  • Leihong Wu, NCTR, FDA, United States
  • Zhichao Liu, National Center for Toxicological Research, United States
  • Shraddha Thakkar, NCTR, FDA, United States
  • Weida Tong, National Center for Toxicological Research, United States

Presentation Overview: Show

Under the umbrella of CMap Drug Safety Challenge, we developed model to understand or predict drug induced liver injury in humans from cell-based screens. We conducted a comprehensive analysis to understand the model performance. In details, we provided a standardized pipeline to convert the original microarray data (i.e., in .cel format) into normalized data matrix (.txt or .RData format), which can thus be directly used as input for current machine learning software such as Matlab, R, python, etc. Using the normalized data matrix, we built several predictive models, involving common algorithms such as K-Nearest Neighbor (KNN), Random Forest (RF), Support Vector Machine (SVM), and Linear Discriminative Analysis (LDA). Additionally, we tested the model performance on both in-platform (which means the training and testing dataset are from the same cell-line) and cross-platform (from different cell-lines) sides.
Our study provided a fundamental view on DILI prediction with cell-based screens, and demonstrated the cell-based screen model at least would show a comparable performance as conventional, structure-based DILI models. This study would serve as the baseline of the challenge to see whether more “state-of-the-art” algorithms from the competitors would get better performance.

12:20 PM-12:40 PM
Predicting Drug Induced Liver Injury from gene expression profiles in cancer cell lines using machine learning algorithms.
Room: Columbus GH
  • Wojciech Lesinski, Institute of Informatics, University of Białystok, Poland
  • Krzysztof Mnich, University of Białystok, Poland
  • Witold R. Rudnicki, Institute of Informatics and Computational Centre, University of Białystok; ICM, University of Warsaw, Poland

Presentation Overview: Show

Motivation:
Drug-induced liver injury (DILI) is one of the primary problems in drug development. In this work we examined whether occurrence of DILI can be predicted using gene expression profile in cancer cell lines.
Methods:
We built predictive models using supervised Machine Learning algorithm using gene expression as descriptor variables, using data on expression for two cancer cell lines (MCF7 and PC3) after exposure to active compounds and vehicle.
To this end we first identified informative variables and then built two kinds of classification models: global random forest algorithm and local k-nn based on correlation between observations for two problems: discerning active compound from vehicle and discerning DILI inducing compounds from harmless ones.
Models were built and tested in fully cross-validated manner, with feature selection performed inside cross-validation loop.
Results:
We have obtained weakly predictive models (AUC=0.6, MCC=0.3) for discerning between samples exposed to active compound and vehicle. Models were robust and transferable between cell lines. Global models for prediction of DILI were no better than random guess (AUC=0.5, MCC=0.0). Local k-NN models gives better prediction (AUC=0.6, MCC=0.15).
Apparently, GE response is only weakly transferable between different compounds and cancer cell lines are not good model for DILI.

12:40 PM-2:00 PM
Lunch Break
2:00 PM-2:40 PM
Deep Learning for drug-induced liver injury prediction
Room: Columbus GH
  • Marco Chierici, Fondazione Bruno Kessler, Italy
  • Nicole Bussola, Fondazione Bruno Kessler, Italy
  • Margherita Francescatto, Fondazione Bruno Kessler, Italy
  • Cesare Furlanello, Fondazione Bruno Kessler, Italy

Presentation Overview: Show

Drug-induced liver injury (DILI) is a major concern in drug development, as hepatotoxicity may not be apparent at early stages. It is therefore crucial to identify gene expression signatures that could predict DILI. The CAMDA 2018 CMap Drug Safety challenge provides an opportunity to develop statistical machine learning approaches for predicting DILI from CMap gene expression profiles of two different cancer cell lines (MCF7 and PC3) by Affymetrix GeneChip arrays. In the context of this challenge, we explored the application of Deep Learning and other machine learning architectures to DILI prediction on the CMAP screens. We used either log-fold change of gene expression of treated vs vehicle or log2 of treated, directly. The deep learning models were compared with standard classifiers (random forests, LSVMs, KNNs) within the MAQC Data Analysis Plan, i.e. 10x5CV over the training set, stratified over classes, with an internal MCC in the interval [0.00 - 0.18]. Further, we simulated the validation task with a nested 80-20 split, iterated 100 times, with an average MCC < 0.1.

2:40 PM-3:00 PM
An Ensemble Approach to Predicting Drug-induced Liver Injury based on RNA Expression Levels
Room: Columbus GH
  • Glen Rex Sumsion, Brigham Young University, United States
  • Michael S Bradshaw Iii, Brigham Young University, United States
  • Jeremy T Beales, Brigham Young University, United States
  • Emi Ford, Brigham Young University, United States
  • Griffin R. G. Caryotakis, Brigham Young University, United States
  • Daniel J Garrett, Brigham Young University, United States
  • Emily D. Lebaron, Brigham Young University, United States
  • Ifeanyichukwu O. Nwosu, Brigham Young University, United States
  • Stephen Piccolo, Brigham Young University, United States

Presentation Overview: Show

Drug-induced liver injury (DILI) is a serious concern during drug development. DILI is characterized by elevated levels of alanine aminotransferase, and it can ultimately result in patient death in serious cases1. Evidence suggests that reactive drug metabolites play a role in initiating DILI1. To quantify the effects of drugs on human cells in general, the Connectivity Map (CMap, build 02) data set measures drugs’ impact on RNA expression in cancer cell lines2. In this paper we outline an attempt to use CMap data before and after drug exposure to predict whether specific drugs in CMap cause hepatic injury. First, we applied seven classification algorithms independently. None of these algorithms predicted liver injury on a consistent basis with high accuracy. In an attempt to improve accuracy, we aggregated predictions for six of the algorithms (excluding one that had performed exceptionally poorly) using a soft-voting ensemble method. This approach improved our results in some cases for the training set but failed to generalize well to the test set. We conclude that more robust methods and/or datasets will be necessary to effectively predict drug-induced liver injury based on RNA expression levels in cell lines.

3:00 PM-3:20 PM
Predicting Drug Induced Liver Injury Through Combined Genomics Indicator and Ensemble Machine Learning Approaches
Room: Columbus GH
  • Zhixiu Lu, University of South Dakota, United States
  • Miyuraj Harishchandra Hikkaduwa Withanage, University of South Dakota, United States
  • Erliang Zeng, University of South Dakota, United States

Presentation Overview: Show

Conventional toxicity assessment is usually conducted using indicators such as pathology and clinical chemistry data, which could only detect around 60% of drug-induced liver injury (DILI) cases in the preclinical studies. The agreement between studies on animal models and human clinical trials is often poor. In this paper, we designed a computational framework to predict DILI of drug compound from cell-based CMap gene expression responses of two different cancer cell lines (MCF7 and PC3). The computational framework takes advantage of ensemble feature selection methods to first identify candidate discriminative genomic indicators, and then evaluates the performance of those genomic indicators by ensemble classifiers. Finally, the inherent connections among genomic indicators identified are explored using network analysis. The network analysis is able to discover the redundancy among genomic indicators and benefits the identification of a set of optimum non-redundant genomic indicators, so that improves the DILI prediction. We use the ROC (receiver operating characteristic) curve and AUC (area under the curve) to comprehensively evaluate our methods. The cross-validation results show that our method can achieve high AUCs, indicating the effectiveness of our framework.

3:20 PM-4:00 PM
Metrics for clinically relevant characterization of patient stratification.
Room: Columbus GH
  • Maciej M. Kandula, Boku University Vienna, Austria
  • David P. Kreil, Boku University, Vienna, Austria

Presentation Overview: Show

The identification and interpretation of biologically relevant patterns in genome scale data remains the rate-limiting bottleneck in the translation of experimental advances to the clinic. A lot of hope is now placed in the combination of measurements from complementary sources. Vertical data integration works on matching profiles of different data types collected for each patient. In addition, network representations have recently been explored for horizontal data integration to exploit not just the complementary nature of the data sources but also similarities across patients. We have previously developed a novel network-based approach for the integration of multiple molecular and clinical data types that also incorporates prior knowledge from curated databases. We first demonstrated remarkable performance in patient stratification for survival analysis in Neuroblastoma patients. Building on this work, extended analyses show that while the algorithm shows disappointing performance for the METABRIC breast cancer cohort, the algorithm performs strongly for several TCGA patient cohorts, including breast cancer data sets. We here further characterize alternative patient cohorts for the same disease, and investigate how differences in the cohort structures, especially as reflected in the inferred inter-patient networks, affect algorithm performance.

4:00 PM-4:40 PM
Coffee Break
4:40 PM-4:50 PM
Short highlight talks
Room: Columbus GH
  • Paweł P. Łabaj, MCB UJ, Kraków, Poland & Austrian Academy of Sciences, Vienna, Austria

Presentation Overview: Show

An introduction to the short highlight talk format and subsequent discussion session.

4:50 PM-5:00 PM
Connecting Expression Profiles in Cancer Cell Lines with Drug Induced Liver Injury
Room: Columbus GH
  • Anish Datta, Indian Institute of Engineering Science and Technology, Shibpur, India
  • Losiana Nayak, Indian Statistical Institute, India
  • Malay Bhattacharyya, Indian Institute of Engineering Science and Technology, India

Presentation Overview: Show

Connecting the missing links between drugs, genes and diseases has gained a lot of interest in recent years among the researchers. This is due to the availability of large-scale data related to cell-based screens in response to various drugs. With an explicit focus on the disease Drug-induced liver injury (DILI), we study the data from Broad Institute Connectivity Map to learn the effect of drugs on gene expression profiles. A variety of features is collected for the said purpose from the MCF7 and PC3 cancer cell lines. The features are mainly related to expression profiles. By applying robust supervised learning approaches powered by feature selection, we highlight that the expression profiles in cancer cell lines can be connected with DILI disease for the purpose of prediction. We adopt a random under-sampling approach to handle the data imbalance and obtain the prediction accuracy close to 75% for both the cell lines taken separately.

5:00 PM-5:10 PM
Multi-view feature selection for multi-omics data integration and its application in cancer survival prediction
Room: Columbus GH
  • Yasser El-Manzalawy, The Pennsylvania State University, United States
  • Mostafa Abbas, Qatar Computing Research Institute, Qatar
  • Thanh Le, The Pennsylvania State University, United States
  • Vasant Honavar, The Pennsylvania State University, United States

Presentation Overview: Show

Large-scale collaborative precision medicine initiatives are yielding rich multi-omics data. Integrative analyses of the resulting multi-omics data, such as somatic mutation, copy number alteration (CNA), and gene expression, offer the tantalizing possibilities of realizing the potential of precision medicine in cancer prevention, diagnosis, and treatment by substantially improving our understanding of underlying mechanisms as well as the discovery of novel biomarkers for different types of cancers. However, several challenges arise in the integrative analysis of such data including heterogeneity and high dimensionality of the omics data. In this study, we evaluate two multi-view feature selection algorithms, developed by our group, using CAMDA Cancer Data Integration datasets. Specifically, we benchmark the performance of different cancer survival predictors developed using some state-of-the-art machine learning algorithms that are trained using; i) single omics data source; ii) multi-view integrated data; and iii) baseline data fusion approaches. Our results demonstrate the viability of multi-view feature selection for multi-omics data integration and lay the ground for developing effective multi-omics data integration models using multi-view feature selection.

5:10 PM-5:20 PM
Data integration and survival time prediction models in cancer studies
Room: Columbus GH
  • Iliyan Mihaylov, Sofia University "St. Kliment Ohridski", Faculty of Mathematics and Informatics, Bulgaria
  • Dimitar Vassilev, Sofia University "St. Kliment Ohridski", Faculty of Mathematics and Informatics, Bulgaria

Presentation Overview: Show

Horizontal and vertical data integration of two types of cancer: neuroblastoma and breast cancer is based on the use of different NoSQL databases for the purposes of good management and not losing any information. We used the schema-less and compact document-based MongoDB for storing of all records and their attributes from the raw data. The graph-based Neo4j DB is used for providing a suitable management for all complicated relationships (as mutations, CAN, expressions).
By using K nearest neighbour classifier, we defined groups, based on a complex novel trait including tumour size, tumour stage and age at diagnosis. Such a data integration approach and the subsequent classification contributed to the determination of common groups of mutated proteins, related to both types of cancer.
The novel complex trait, combined with survival time and mutation identifiers is used in several machine learning models with different feature selection for optimal survival time prediction. We validate the results by using randomly smaller subsets of both raw and integrated data. The applied machine learning methods provide mechanisms for assessing cancer progression and survival time prediction.

5:20 PM-5:30 PM
Clinical and molecular markers for the prediction of clinical endpoints in breast cancer patients
Room: Columbus GH
  • Aneta Polewko-Klim, Institute of Informatics, University of Bialystok, Poland
  • Witold R. Rudnicki, Institute of Informatics and Computational Centre, University of Białystok; ICM, University of Warsaw, Poland

Presentation Overview: Show

Motivation:
The goal of the study is to build predictive models for the clinical
endpoints of breast cancer patients (CEBCP) using clinical descriptors (CD)
and molecular data from gene expression (GE) and Copy Number Alteration (CNA) assays from METABRIC dataset.
Methods:
To this end we first used different feature selection methods to identify informative variables in the clinical and molecular datasets.
Then the predictive models were built separately for each dataset using Random Forest classification algorithm. Finally the we tested different methods of combining information from different dataset.
Results:
All 17 clinical covariates were found to be relevant and non-redundant. In the case of molecular descriptors there are hundreds of relevant variables in GE dataset, whereas only a handful for CNA. For individual datasets best models were built on CD (MCC=0.33), followed by GE (MCC=0.25) and CNA (MCC=0.19). The best results of integration of clinical and molecular data were achieved when clinical data was augmented by the fraction of votes for a negative CEBCP from models built on molecular data. The best model was built using CD+GE+CNA (MCC=0.37), followed by CD+GE (MCC=0.36) and CD+CNA (MCC=0.35).

5:30 PM-5:40 PM
Metagenomic fingerprints reveal geographic origin of biological samples collected in mass-transit areas
Room: Columbus GH
  • Marco Chierici, Fondazione Bruno Kessler, Italy
  • Gabriele Franch, Fondazione Bruno Kessler, Italy
  • Giuseppe Jurman, Fondazione Bruno Kessler, Italy
  • Cesare Furlanello, Fondazione Bruno Kessler, Italy

Presentation Overview: Show

The MetaSUB consortium is building a unique database of whole-genome shotgun metagenomics data sampled in mass-transit areas worldwide. We propose to analyze the CAMDA MetaSUB Forensic dataset with the Ph-CNN deep learning architecture, recently developed for classifying metagenomics data (Fioravanti D et al, BMC Bioinformatics, 2018). The Ph-CNN architecture is based on Convolutional Neural Networks and the patristic distance computed on the phylogenetic tree as appropriate metrization embedding. The approach is applied to identify metagenomic fingerprints that detect the geographic location, based on features quantified by MetaPhlAn2. The impact of sample multiplicity and surface type is also considered. The novel Uniform Manifold Approximation Projection Method (UMAP) is also applied as non-linear dimensionality reduction algorithm and compared to t-SNE for visualization to explore the data global structure without computational restrictions on embedding dimension.

5:40 PM-5:50 PM
Deciphering bacterial signatures from WGS metagenomics data from multiple subway stations through MetaSUB International Consortium
Room: Columbus GH
  • Susmita Datta, University of Floria, United States
  • Alejandro Walker, University of Florida, United States

Presentation Overview: Show

Metagenomic data of whole genome sequences (WGS) from samples across several cities around the globe are the key to unravel city specific signatures of microbes. Consequently, it can be beneficial for forensic analysis. Illumina sequence data were provided from 8 cities in 6 different countries as part of the 2018 CAMDA “MetaSUB Forensic Challenge”. We will use appropriate machine learning technique on this massive dataset to effectively identify the geographical provenance of additional “mystery” samples. Additionally, we will pursue the compositional data analysis techniques to develop accurate inferential techniques for such microbiome data. Learning from the CAMDA 2017 MetaSUB challenge data, it is expected that this current higher quality and higher sequence depth data with improved analytical techniques would yield many more interesting, robust and useful results for forensic analysis.

5:50 PM-6:00 PM
Massive Metagenomic Data Analysis using Abundance-Based Machine Learning
Room: Columbus GH
  • Zachary Harris, Saint Louis University, United States
  • Eliza Dhungel, Saint Louis University, United States
  • Matthew Mosior, Saint Louis University, United States
  • Tae-Hyuk Ahn, Saint Louis University, United States

Presentation Overview: Show

Appropriate handling of large-scale metagenomics analyses remains to be an open problem in both biological and computational spaces. On one hand, many tools already exist for profiling community abundance and many are optimized to work for large data sets. On the other hand, these approaches often preclude novel discoveries at the cost of precompiled databases. Other open problems include identifying proper clusters within data and even the basic prediction of metadata traits concerning the samples at hand. The MetaSUB Forensics Challenge provides an ideal opportunity to assess current tools and methodologies on these problems. Specifically, MetaSUB presents two major challenges: 1) prediction of location and substrate of sampling and 2) identifying clusters in labelless data. To answer these challenges, we have constructed an entirely de novo metagenome assembly analysis based on a subset of available data. These assembled data are then used as the basis for community profiling in the complete data set, and those profiles are used as machine learning features for location and substrate predictions. Moreover, we plan to use a similar framework for the clustering of entirely unknown samples to infer the number of samples provided as the rest of the data become available.

Sunday, July 8th
10:15 AM-10:20 AM
Data Analysis Challenges of the CAMDA Contest 2018. (II)
Room: Columbus GH
  • Joaquin Dopazo, Fundacion Progreso y Salud, Spain

Presentation Overview: Show

Welcome, introduction, and overview of this year's CAMDA contests and sessions. (II)

10:20 AM-11:20 AM
Computational analysis of short and long microbiome sequencing reads - Keynote
Room: Columbus GH
  • Daniel H. Huson, University of Tuebingen, Germany

Presentation Overview: Show

Microbiome research is the study of microbes in their theater of activity. It has many applications in medicine, biotechnology, crop sciences, and other areas. The analysis of microbiome sequencing data poses a wide range of computational challenges. While much early work has focused on taxonomic profiling based on 16S rRNA amplicon sequencing data, many projects now apply shotgun sequencing to collect more detailed information from samples.
A typical microbiome sequencing project may involve a few billion short reads obtained by a Illumina HiSeq3000, say. How to perform a detailed computational analysis of such a large dataset in an reasonable amount of time on a modest server? A key question is which genes are present in a sample. A comprehensive answer can be obtained by aligning all reads against the NCBI-nr protein reference database, which is possible using the high-throughput alignment tool DIAMOND (Buchfink, Xie and Huson, 2015). A program called Meganizer can be applied to DIAMOND's output files so as perform taxonomic and functional binning of all reads, and to index all results. The resulting meganized DIAMOND files can then be opened and explored in MEGAN, an interactive microbiome sequence analysis program (Huson et al, 2007, 2011, 2016).
There is increasing interest in sequencing microbiome samples using long read technologies as provided by Oxford Nanopore or PacBio. Microbiome analysis tools designed for the analysis of short reads need to be adapted to long reads. We have developed a number of new algorithms for long read (and contig) analysis that are implemented in MEGAN and we refer to these extensions as MEGAN-LR (CAMDA 2017, Huson et al, 2018). Similarly, the DIAMOND alignment tool has recently been extended to support the alignment of long and error-prone reads. We will discuss the details of some of the new algorithms and will provide examples of their applications.

11:20 AM-12:00 PM
Functional biomarkers for precise simple classification in the MetaSUB Forensic Challenge
Room: Columbus GH
  • Carlos Sanchez, Fundacion Progreso y Salud, Spain
  • Javier Perez Florido, Fundacion Progreso y Salud, Spain
  • Carlos Loucera, Fundacion Progreso y Salud, Spain
  • Joaquin Dopazo, Fundacion Progreso y Salud, Spain

Presentation Overview: Show

The availability of hundreds of city microbiome profiles allows the development of predictors of origin based on microbiota composition and other related factors. Here we explore the use of profiles of functional activity not only to predict the putative origin of a sample but also to study the biogeography of the microbiota global metabolism and functionalities.

12:00 PM-12:20 PM
mi-faser deciphers city subway microbiome functional fingerprints
Room: Columbus GH
  • Chengsheng Zhu, Rutgers University, United States
  • Maximilian Miller, Rutgers University, United States
  • Nick Lusskin, Rutgers University, United States
  • Yannick Mahlich, Technical University of Munich, Germany
  • Yana Bromberg, Rutgers University, United States

Presentation Overview: Show

Molecular functionality of microbial communities is often assessed via meta-genomic/-transcriptomic sequencing. We recently created mi-faser, a computational method for super fast (minutes-per-microbiome) and accurate (90% precision) mapping of sequencing reads to molecular functions of the corresponding genes. The method is augmented by a manually curated reference database. Comparing microbiome function profiles between different environments, we identified oil degradation-specific functions in BP oil-spill data, functional signatures of individual-specific gut microbiome responses to a dietary intervention, and Crohn's Disease patient gut microbiome functional differences from microbiomes of related healthy individuals. In short, due to its speed, accuracy, and capability to highlight key functions, mi-faser is useful for generating testable hypotheses of emergent metagenome molecular functionality.
We used mi-faser to functionally annotate CAMDA. The preliminary results reveal clear city-specific patterns in function profiles of the set of microbiomes with known assigned locations. With further clustering and machine learning analyses, we expect to address challenges such as assigning unknown samples to specific cities and surface types.

12:20 PM-12:40 PM
Environmental Metagenome Classification for construction of a microbiome fingerprint
Room: Columbus GH
  • Jolanta Kawulok, Silesian University of Technology, Poland

Presentation Overview: Show

Analysis of metagenome makes it possible to extract key information on the organisms that have left their traces in a given environmental sample. In many cases (eg in forensic analysis) it is sufficient to determine the environmental sample origin, rather than being able to accurately identify the organisms living there. In the research we want to check the reliability of fingerprints for identifying the origin of a sample. Moreover, we want to investigate how the type of surface affects the results of sample recognition. For this purpose, we exploit our CoMeta program, which allows for fast classification of metagenome samples, and we apply it to classify the extracted unknown metagenomes to various collections of known samples. Our contribution lies in building separate groups of metagenomic reads for each pair city-surface, and then we compare the samples by measuring their similarity directly in the space of the metagenome reads.

12:40 PM-2:00 PM
Lunch Break
2:00 PM-2:20 PM
Application of machine learning techniques for creating urban microbial fingerprints
Room: Columbus GH
  • Feargal Ryan, South Australian Health and Medical Research Institute, EMBL, Australia

Presentation Overview: Show

The Metagenomics and Metadesign of the Subways and Urban Biomes (MetaSUB) International Consortium has generated shotgun metagenomics sequence data from large urban areas from around the world. As part of the MetaSUB forensic challenge we have taken the data and combined a taxonomic classifier against the NCBI nr database together with machine learning techniques to generate microbial profiles of each city. This approach resulted in distinct profiles for each city capable of clustering samples by city of origin in an unsupervised fashion, with the exception of Auckland and Hamilton which appear almost indistinguishable. Implementation of t-Distributed Stochastic Neighbor Embedding allowed visualization of this in 2 dimensions while random forest can classify samples to their city of origin with near perfect accuracy. The “mystery samples” which were already featured in the dataset cluster, discretely with Porto, Ofa, Santiago and with the New Zealand cities. The three new unnamed cities each clustered discretely with a country already featured, C1 with Nigeria, C2 with New Zealand and C3 with the U.S.A. Further assembly based analysis demonstrated that there is still many microbes in urban environments for which we have neither a culture nor complete genome.

2:20 PM-3:00 PM
Meta-analysis of Breast Cancer and Neuroblastoma through the integration of RNA-seq network analysis, clinical data, and known signaling pathways
Room: Columbus GH
  • Tyler Grimes, University of Florida, United States
  • Somnath Datta, University of Florida, United States

Presentation Overview: Show

Gene expression profiles from RNA-sequencing provide a window to the activity inside cancerous cells. This view has enabled researchers to identify driver genes for the pathology, but often the results only hold for the particular cancer at hand. In this study, we investigate a robust method for identifying cancer-related genes and deciding which ones are unique to the given cancer or are likely to be involved in a variety of cancers. Gene expression data, clinical information, and known signaling pathways are integrated to identify differentially expressed and differentially connected genes (in gene-gene association networks) among subgroups of patients with a particular cancer. The perturbations of gene expression and pathway connectivity found in a collection of cancers are then combined through a meta-analysis to reveal common mechanisms behind all cancers.
This unified approach allows researchers to identify genomic properties that are common across cancers, which in turn helps uncover the unique differences that characterize individual cancers.

3:00 PM-3:40 PM
Pair-based Integration of Gene Expression and CNV Data
Room: Columbus GH
  • Maarten Larmuseau, Ghent University, Belgium
  • Lieven P.C. Verbeke, Ghent University, Belgium
  • Kathleen Marchal, Ghent University, Belgium

Presentation Overview: Show

Recent technological evolutions have led to an exponential increase in data in all the omics fields. It is expected that integration of these different data sources, will drastically enhance our knowledge of the biological mechanisms behind genomic diseases such as cancer. However, the integration of different omics data still remains a challenge. In this work we propose a methodology based on gene pair analysis of both expression and CNV data in breast cancer and neuroblastoma. Gene expression is defined as either aberrant or basal, using a Gaussian Mixture Model. Using p-value testing, we select significantly co-aberrant pairs of genes. We develop a cluster algorithm that merges pairs of genes into gene sets that are collectively aberrant in a subset of the patients. Genes in those gene sets can have vastly different functions, indicating that they represent complementary pathways that together define a subtype. It is demonstrated that pairwise clustering also works for CNV data, where we retrieve gene sets corresponding to chromosomal regions. By looking for co-occurrence in aberrant gene expression and CNV, we can infer a causal relation and link the CNV gene sets to the expression gene sets.

3:40 PM-4:00 PM
Proceedings Presentation: Personalized Regression Enables Sample-Specific Pan-Cancer Analysis
Room: Columbus GH
  • Ben Lengerich, Carnegie Mellon University, United States
  • Bryon Aragam, Carnegie Mellon University, United States
  • Eric Xing, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: In many applications, inter-sample heterogeneity is crucial to understanding the complex biological processes under study. For example, in genomic analysis of cancers, each patient in a cohort may have a different driver mutation, making it difficult or impossible to identify causal mutations from an averaged view of the entire cohort. Unfortunately, many traditional methods for genomic analysis seek to estimate a single model which is shared by all samples in a population, ignoring this inter-sample heterogeneity entirely. In order to better understand patient heterogeneity, it is necessary to develop practical, personalized statistical models.
Results: To uncover this inter-sample heterogeneity, we propose a novel regularizer for achieving patient-specific personalized estimation. This regularizer operates by learning two latent distance metrics – one between personalized parameters and one between clinical covariates – and attempting to match the induced distances as closely as possible. Crucially, we do not assume these distance metrics are already known. Instead, we allow the data to dictate the structure of these latent distance metrics. Finally, we apply our method to learn patient-specific, interpretable models for a pan-cancer gene expression dataset containing samples from more than 30 distinct cancer types and find strong evidence of personalization effects between cancer types as well as between individuals. Our analysis uncovers sample-specific aberrations that are overlooked by population-level methods, suggesting a promising new path for precision analysis of complex diseases such as cancer.
Availability: Software for personalized linear and personalized logistic regression, along with code to reproduce experimental results, is freely available at github.com/blengerich/personalized_regression.

4:00 PM-4:40 PM
Coffee Break
4:40 PM-5:20 PM
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk and a Denoising Autoencoder for Survival Prediction in Cancer Studies
Room: Columbus GH
  • So Yeon Kim, Ajou University, South Korea
  • Hyun-Hwan Jeong, Baylor College of Medicine, United States
  • Jaesik Kim, Ajou University, South Korea
  • Jeong-Hyeon Moon, Ajou University, South Korea
  • Kyung-Ah Sohn, Ajou University, South Korea

Presentation Overview: Show

Integrating the rich information from multi-omics data has been popular for survival prediction and bio-marker identification for various cancer studies. To facilitate the integrative analysis on multiple genomic profiles, several studies have suggested utilizing pathway information rather than using individual genomic profiles.
We have recently proposed an integrative directed random walk-based method (iDRW) that combines the idea of pathway activity inference and a denoising autoencoder for more robust and effective genomic feature extraction. Our results showed that the proposed method not only improved the accuracy of survival prediction, but also identified more cancer-specific pathways using gene expression and methylation profiles for breast cancer patients. Accordingly, a further investigation into the robustness of the integrative pathway-based approach on multiple genomic profiles from different cancer datasets is necessary.
In this study, we benchmark iDRW method and several state-of-the-art pathway-based and gene-based integrative machine learning classification methods with respect to the classification accuracy and the robustness of each model across different datasets. Finally, we show that a robust data integration model guided by the pathway information can improve survival classification performance and provide better biological insight into the top pathways and genes prioritized by the model in both neuroblastoma and breast cancer patients.

5:20 PM-5:40 PM
FDA Meta-analysis of the CMap Drug Safety Challenge, and Outlook: Follow-on Challenges 2019
Room: Columbus GH
  • Shraddha Thakkar, NCTR, FDA, United States

Presentation Overview: Show

The US FDA coordinator will present first results of this year's meta-analysis of the CMap Drug Safety Challenge, followed by a presentation and discussion of follow-on challenges in this topic for CAMDA 2019.

5:40 PM-6:00 PM
CAMDA: Contest voting, awards, and outlook
Room: Columbus GH
  • Wenzhong Xiao, Harvard Medical School & Stanford, United States
  • David P. Kreil, Boku University, Vienna, Austria

Presentation Overview: Show

All delegates vote for the CAMDA Best Analysis Award winner(s), presentation of the trophy, introduction of the challenges for CAMDA 2019. Outlook and closing of the conference.