Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

CAMDA COSI

Presentations

Schedule subject to change
Monday, July 13th
10:45 AM-11:40 AM
CAMDA KEYNOTE: Climate, Oceans, and Human Health: Cholera as a paradigm for prediction infectious diseases
Format: Live-stream

  • Rita Colwell

Presentation Overview: Show

Climate and the oceans historically have been closely intertwined with human health. Today significant advances in information technology have brought new discoveries - from the outer reaches of space, where remote sensing monitors on satellites circle the earth, to the ultramicroscopic through application of next generation sequencing and bioinformatics. Vibrio cholerae provides a useful example of the fundamental link between human health and the oceans. This bacterium is the causative agent of cholera and is associated with major pandemics, yet it is a marine bacterium with a versatile genetics and is distributed globally in estuaries throughout the world, notably the Bay of Bengal, but also in coastal regions and aquatic systems of the world. Vibrio species, both nonpathogenic and those pathogenic for humans, marine animals, or marine vegetation, play a fundamental role in nutrient cycling. They have also been shown to respond to warming of surface waters of the North Atlantic, with increase in their numbers correlated with increased incident of vibrio disease in humans. The models we have developed for understanding and predicting outbreaks of cholera are based on work done in the Chesapeake Bay and the Bay of Bengal and these models are now used by UNICEF and aid agencies to predict cholera in Yemen and other countries of the African continent. With onset of COVID-19, these models are currently being modified to predict SARS CoV-2 and incidence of COVID-19, the current pandemic of coronavirus. In summary, molecular microbial ecology coupled with computational science can provide a critical indicator and prediction of human health and wellness. How this is being accomplished and how we are beginning to understand environmental aspects of COVID-19 will be discussed in this talk.

12:00 PM-12:40 PM
Metagenomic Geolocation using Read Signatures
Format: Pre-recorded with live Q&A

  • Timothy Chappell, Queensland University of Technology, Australia
  • Dimitri Perrin, Queensland University of Technology, Australia
  • Shlomo Geva, Queensland University of Technology, Australia
  • James Hogan, Queensland University of Technology, Australia
  • David Lovell, Queensland University of Technology, Australia

Presentation Overview: Show

We present a novel approach to the Metagenomic Geolocation Challenge based on random projection of the sample reads for each location. Individual signatures are computed for all reads, and each location is characterised by an hierarchical vector space representation of the resulting clusters. Classification is then treated as a problem in ranked retrieval of locations, where similar signatures are taken as a proxy for underlying microbial similarity. We evaluate our approach based on the 2016 and 2020 Challenge datasets and obtain promising results based on nearest neighbour classification.

2:00 PM-2:20 PM
Unraveling city-specific microbial signature and identifying sample origin for the data from CAMDA 2020 Metagenomic Geolocation Challenge
Format: Pre-recorded with live Q&A

  • Runzhi Zhang, University of Florida, China
  • Alejandro Riveros Walker, University of Florida, Chile
  • Dorothy Ellis, University of Florida, United States
  • Susmita Datta, University of Florida, United States

Presentation Overview: Show

As the composition of microbial communities can be location-specific, investigating the microbial community of different cities could unravel city-specific microbial signatures and further identify the origin of future samples. In this study, the whole genome shotgun (WGS) metagenomics data from over 20 cities in 17 countries and city-specific information including location, climate and biomes data were provided as part of the CAMDA 2020 “Metagenomic Geolocation Challenge”. Appropriate methods including feature selection, normalization, two popular machine learning methods (Random Forest (RF) and Support Vector Machine (SVM)), one deep learning method (Multilayer Perception (MLP)), and Principal Coordinates Analysis (PCoA) were used.

2:20 PM-2:40 PM
Spatial models for assessment of bacterial classification relevant to AMR
Format: Pre-recorded with live Q&A

  • Maya Zhelyazkova, Sofia University "St. Kliment Ohridski", Faculty of Mathematics and Informatics, Bulgaria
  • Roumyana Yordanova, Hokkaido University, Sapporo, Japan, Bulgaria
  • Iliyan Mihaylov, Sofia University "St. Kliment Ohridski", Faculty of Mathematics and Informatics, Bulgaria
  • Stefan Kirov, Bristol-Meyers Squibb, NJ, United States
  • David Danko, Weill Cornel Medical College, NY, United States
  • Dimitar Vassilev, Sofia University "St. Kliment Ohridski", Faculty of Mathematics and Informatics, Bulgaria

Presentation Overview: Show

The work and development of the MetaSUB international consortium project raises important questions about the antimicrobial resistance (AMR) of the collected samples. Different bioinformatics and statistical approaches are developed for bacterial classification, AMR taxa discovery and characterization of their geographical distribution across multiple cities around the world. In this work we use methods from epidemiology to estimate relative risk of antimicrobial resistance by modeling the spatial correlation in the data.
The novelty of our approach in this work is that we apply a convolution model or more specifically BYM (Besag, York, Mollié) model to incorporate explicitly the spatial structure in the data as determined by the longitude and latitude of the samples. We use the Bayesian setting implementation in R package CARBayes, where inference is based on Markov chain Monte Carlo (MCMC) simulation. We adapt the epidemiological concept of relative risk (RR) to find regions in the cities with elevated AMR risk. The model can also be used for data with excessive zeros by modeling the response as Zero Inflation Poisson (ZIP) process. In addition to spatial modeling, we used several machine learning approaches such as GBM, Random Forest and Neural Network to predict geographical origin of the samples.

2:40 PM-3:00 PM
Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier
Format: Pre-recorded with live Q&A

  • Somnath Datta, University of Florida, United States
  • Samuel Anyaso-Samuel, University of Florida, United States
  • Archie Sachdeva, University of Florida, United States
  • Subharup Guha, University of Florida, United States

Presentation Overview: Show

Several standard classifiers have been employed for the prediction of the origin of a given microbial sample. The application of these individual classification algorithms yields a varying degree of classification accuracy, and the performance of such classifiers are also dependent on the structure of the available data. Rather than employing several individual classifiers, we adopt the adaptive ensemble classification algorithm proposed by (Datta, 2010). The ensemble classifier, which is constructed by bagging and rank aggregation, comprises a set of standard classification algorithms where such individual algorithms are combined flexibly to yield classification performance as least, as good as the best classification algorithm in the ensemble. For our analysis, we trained and tested fourteen standard classifiers including the ensemble classifier, and in different instances, we also applied class weighting and an optimal oversampling technique to overcome the problem of class balance in the primary data. These analyses were conducted both on the primary data set of relative abundances and data with feature reduced space. In each instance, we found that the standard algorithms performed differently, whereas the ensemble classifier consistently showed to have optimal performance. Lastly, we predict the source cities of the mystery samples provided by the CAMDA organizers.

3:20 PM-3:40 PM
Separation of Mystery-Samples using mi-faser and forest embedding
Format: Pre-recorded with live Q&A

  • Yana Bromberg, Rutgers University, United States
  • Maximilian Miller, Rutgers University, United States
  • Ariel Aptekmann, Rutgers University, United States

Presentation Overview: Show

Here we propose a new approach for identifying and clustering city profiles based on the metagenomic fingerprints of related subway systems. We generate functional fingerprints of shotgun sequenced metagenomes and use forest embeddings in combination with a density based clustering method to identify samples of common provenance.
We applied this approach to the CAMDA2020 Metagenomic Forensics Challenge dataset and found that the set of 121 mystery samples may come from eight different cities. We also tried to predict metadata of the mystery samples, i.e. associated subway network length and usage, in order to assign each sample to a city.

3:40 PM-4:00 PM
Metagenomic Data Analysis with Probability-Based Reduced Dataset Representation
Format: Pre-recorded with live Q&A

  • Tae-Hyuk Ahn, Saint Louis University, United States
  • Cory Gardner, Saint Louis University, United States
  • Sadiya Ahmad, Saint Louis University, United States

Presentation Overview: Show

Metagenomics has become of increasing interest to researchers exploring human health and disease. Higher sequencing throughput and lower associated costs, in turn, has made whole genome sequencing of complex environmental samples a realistic possibility for researchers. The massive datasets produced by these approaches, however, calls for new approaches to their processing and interpretation. MetaSUB and CAMDA have partnered together to confront this challenge, offering a yearly contest in which participants are asked to predict the origin of unlabeled metagenomic samples by applying machine learning methods to a set of city-labeled whole genome sequencing data. A remaining roadblock, however, is the enormous computational power required to process these massive datasets. Previous research has studied the effect of mystery sample prediction using a "reduced-representation subset" which contained only 50% of the sample data in the training set. In each case, there is relatively little loss of predictive power using the subset. In this paper, we extend upon the previous work and explore the relationship between mystery sample prediction and reduced-representation subsets at 20% intervals.

4:00 PM-4:40 PM
Towards a metagenomics interpretable model for understanding the transition from adenoma to colorectal cancer
Format: Pre-recorded with live Q&A

  • Carlos Loucera, Clinical Bioinformatics Area (FPS), Spain

Presentation Overview: Show

TBP

5:00 PM-5:20 PM
Proceedings Presentation: IDMIL: An alignment-free interpretable deep multiple instance learning (MIL) for predicting disease from whole-metagenomic data
Format: Pre-recorded with live Q&A

  • Huzefa Rangwala, George Mason University, United States
  • Mohammad Arifur Rahman, George Mason University, United States

Presentation Overview: Show

The human body hosts more microbial organisms than human cells. Analysis of this microbial diversity provides key insight into the role played by these microorganisms on human health. Metagenomics is the collective DNA sequencing of coexisting microbial organisms in a host. This has several applications in precision medicine, agriculture, environmental science, and forensics. State-of-the-art predictive models for phenotype predictions from microbiome rely on alignments, assembly, extensive pruning, taxonomic profiling along with expert-curated reference databases. These processes are time-consuming and they discard the majority of the DNA sequences for downstream analysis limiting the potential of whole-metagenomics. We formulate the problem of predicting human disease from whole-metagenomic data using Multiple Instance Learning (MIL), a popular supervised learning paradigm. Our proposed alignment-free approach provides higher accuracy in prediction by harnessing the capability of deep convolutional neural network (CNN) within a MIL framework and provides interpretability via neural attention mechanism.

The MIL formulation combined with the hierarchical feature extraction capability of deep-CNN provides significantly better predictive performance compared to popular existing approaches. The attention mechanism allows for the identification of groups of sequences that are likely to be correlated to diseases providing the much-needed interpretation. Our proposed approach does not rely on alignment, assembly and manually curated databases; making it fast and scalable for large-scale metagenomic data. We evaluate our method on well-known large-scale metagenomic studies and show that our proposed approach outperforms comparative state-of-the-art methods for disease prediction.

5:20 PM-5:40 PM
DNA Based Methods in Intelligence - Moving Towards Metagenomics
Format: Pre-recorded with live Q&A

  • Paweł Łabaj

Presentation Overview: Show

TBP

5:40 PM-6:00 PM
Day summary
Format: Live-stream

  • Joaquin Dopazo, Clinical Bioinformatics Area. Fundación Progreso y Salud, Sevilla, Spain
Tuesday, July 14th
10:45 AM-11:40 AM
CAMDA KEYNOTE: Challenges and tools for integrative analysis of single cell 'omics data
Format: Live-stream

  • Aedin Culhane

Presentation Overview: Show

Single-cell multi 'omics data has potential to provide unprecedented insights into molecular, spatial and cellular organization of tissue during health and disease. Whilst approaches for integrating single-cell data 'omics are emerging, the sparsity, high dimensionality, heterogeneity and scale of these data type present unique challenges. Due to the scale of single cell data, dimension reduction is an essential preliminary step in analysis pipelines, data integration, cell clustering and trajectory analysis. Principal component analysis (PCA) is a widely used for this because it is relatively fast, and can easily scale to large datasets when used with sparse-matrix representations. I will review different forms of PCA, impact of scaling, log-transforming, and how to recognize problems such as horseshoe or arch effect. I will describe a alternative to PCA which we have recently implemented in a new Bioconductor package Corral, and show how simple replacement of PCA with corral can improve integrative analysis of single cell data. I will describe challenges we have encounters and how we are re-envisioning tools we developed for bulk RNAseq data for emerging single cell data types.

12:00 PM-12:20 PM
Effect of Tumor Purity on The Analysis of Gene Expression Data
Format: Pre-recorded with live Q&A

  • Seungjun Ahn, Department of Biostatistics, University of Florida, United States
  • Tyler Grimes, Department of Biostatistics, University of Florida, United States
  • Somnath Datta, Department of Biostatistics, University of Florida, United States

Presentation Overview: Show

The tumor microenvironment is comprised of tumor cells, stroma cells, immune cells, blood vessels, and other associated non-cancerous cells. Gene expression measurements on tumor samples are an average over cells in the microenvironment. However, research questions often seek answers about tumor cells rather than the surrounding non-tumor tissue. In this study, we investigate the effect of tumor purity – the proportion of tumor cells in a solid tumor sample – on the performance of two statistical methods used for analyzing gene expression data: differential network (DN) analysis and differential gene expression analysis (DGEA). We perform the DN analysis on a breast invasive carcinoma dataset. The results reveal cancer-progression related pathways when analyzing the high-purity samples, whereas many non-cancer-related pathways are obtained when analyzing the complete dataset. In addition, a simulation study is conducted to assess the effect of replacing low tumor purity samples compared to random sampling variability with two additional datasets: head and neck squamous cell carcinoma (HNSC) and lung squamous cell carcinoma (LUSC). The approach described in this study provides a general strategy for assessing the effect of tumor purity on any gene expression data analyses and can be applied to other types of cancers.

12:20 PM-12:40 PM
Mechanistic models of CMap drug perturbation functional profiles
Format: Pre-recorded with live Q&A

  • Macarena Lopez-Sanchez, Clinical Bioinformatics Area. Fundación Progreso y Salud, Sevilla, Spain
  • Marina Esteban-Medina, Clinical Bioinformatics Area, Fundación Progreso y Salud, Spain
  • Carlos Loucera, Clinical Bioinformatics Area (FPS), Spain
  • Joaquin Dopazo, Clinical Bioinformatics Area. Fundación Progreso y Salud, Sevilla, Spain
  • Maria Peña-Chilet, CIBERER, Spain

Presentation Overview: Show

We have used mechanistic models of pathway activities to generate complete catalogue of Cmap functional profiles that can be further used to detect reverse functional profiles in diseases, thus providing a functional basis and a potential biological interpretation for the drug-disease inverse matching.

2:00 PM-3:00 PM
CAMDA Cafe
Format: Live-stream

  • Wenzhong Xiao
  • Joaquin Dopazo, Clinical Bioinformatics Area. Fundación Progreso y Salud, Sevilla, Spain
3:20 PM-4:00 PM
Prediction of Drug Induced Liver Injury with different data sets and different end points
Format: Pre-recorded with live Q&A

  • Wojciech Lesinski, University of Bialystok, Poland
  • Witold Rudnicki, University of Bialystok, Poland
  • Krzysztof Mnich, University of Białystok, Poland

Presentation Overview: Show

Motivation: Drug-induced liver injury (DILI) is one of the primary problems in drug development. Early prediction of DILI, would bring a significant reduction in the cost of clinical trials and faster development of drugs. Current study is aims at building predictive models of DILI potential of chemical compounds.
Methods: We build predictive models for several alternative splits of compounds between DILI and non-DILI classes, using supervised Machine Learning algorithms.
To this end we use chemical properties of the compounds under scrutiny, their effects on gene expression levels in 6 human cell-lines treated with them and their toxicological profiles. We first identity the most informative variables and then use them to build ML models.
Individual models built using gene expression of single cell lines, chemical properties of compounds, their toxicology profiles are then combined using Super Learner approach.
Results: We have obtained weakly predictive model for using molecular descriptors and DILI statistics, with AUC exceeding 0.7 for some DILI definitions.
With one exception, gene expression profiles of human cell lines resulted were non-informative and resulted in random models. Gene expression profiles of HEPG2 cell line lead to statistically significant models (AUC=0.67) only for one definition of DILI.

4:00 PM-4:20 PM
Improving Deep Learning Performance on Prediction of Drug-Induced Liver Injury
Format: Pre-recorded with live Q&A

  • Peter Tran, Saint Louis University, United States
  • David Ahrens, Saint Louis University, United States
  • John Reddy Peasari, Saint Louis University, United States
  • Stephen Tahan, Saint Louis University, United States
  • Tae-Hyuk Ahn, Saint Louis University, United States

Presentation Overview: Show

Drug-induced liver injury (DILI) is an important safety issue in the field of drug development. The potential to accurately predict DILI from both in vivo and in vitro studies would be an added advantage in analyzing the drug potential to cause DILI because hepatotoxicity might not be evident at the beginning stages of development. The annual international conference on Critical Assessment of Massive Data Analysis (CAMDA) releases challenges each year to tackle big data problems in the life sciences; the CMap Drug Safety challenge focuses on the prediction of DILI. In previous years, deep learning has been utilized but to un-stellar performance. This year (2020), we seek to improve the performance of deep learning on this challenge by investigating different methods of preprocessing data and network architectures. Our current leading model is able to predict severe DILI more accurately than previous deep learning results.

4:20 PM-4:40 PM
Gene expression signature-based machine learning classifier of drug-induced liver injury
Format: Pre-recorded with live Q&A

  • Brett McGregor, University of North Dakota, United States
  • Kai Guo, University of North Dakota, United States
  • Junguk Hur, University of North Dakota, United States

Presentation Overview: Show

Drug-induced liver injury (DILI) is considered a primary factor in regulatory clearance for drug development, and there is a pressing need to develop and evaluate new prediction models for DILI. The CAMDA 2020 CMap Drug Safety Challenge included 422 drugs for training and 195 drugs with blinded labels for testing to predict four types of DILI classes (DILI1, DILI3, DILI5, and DILI6). Our approach utilized the machine learning (ML) approach focusing on drug perturbation gene expression signatures from the six human cell lines (PHH, HEPG2, HA1E, A375, MCF7, and PC3). We created representative expression signatures, the 250 most up-regulated and 250 down-regulated genes, for each drug using Kruskal-Borda merging of ranked z-scores profiles. Various ML algorithms, including random forest (RF), recursive partitioning and regression trees (RPART), support vector machine (SVM), generalized linear model (GLM), and naïve-Bayes classifier were built and evaluated using 100 times 5-fold cross-validation. The initial model results range from a ROC value of 0.818 in the RF DILI3 to 0.491 in GLM DILI6. These models are still a work in progress, in which data from Tox21, FAERS, and Mold2 will need to be incorporated alongside the gene expression response garnered from CMap.

5:00 PM-5:20 PM
Proceedings Presentation: AITL: Adversarial Inductive Transfer Learning with input and output space adaptation for pharmacogenomics
Format: Pre-recorded with live Q&A

  • Hossein Sharifi Noghabi, Simon Fraser University, Canada
  • Shuman Peng, Simon Fraser University, Canada
  • Olga Zolotareva, Bielefeld University, Germany
  • Colin Collins, University of British Columbia, Canada
  • Martin Ester, Simon Fraser University, Canada

Presentation Overview: Show

Motivation: the goal of pharmacogenomics is to predict drug response in patients using their single- or multi-omics data. A major challenge is that clinical data (i.e. patients) with drug response outcome is very limited, creating a need for transfer learning to bridge the gap between large pre-clinical pharmacogenomics datasets (e.g. cancer cell lines), as a source domain, and clinical datasets as a target domain. Two major discrepancies exist between pre-clinical and clinical datasets: 1) in the input space, the gene expression data due to difference in the basic biology, and 2) in the output space, the different measures of the drug response. Therefore, training a computational model on cell lines and testing it on patients violates the i.i.d assumption that train and test data are from the same distribution.
Results: We propose Adversarial Inductive Transfer Learning (AITL), a deep neural network method for addressing discrepancies in input and output space between the pre-clinical and clinical datasets. AITL takes gene expression of patients and cell lines as the input, employs adversarial domain adaptation and multi-task learning to address these discrepancies, and predicts the drug response as the output. To the best of our knowledge, AITL is the first adversarial inductive transfer learning method to address both input and output discrepancies. Experimental results indicate that AITL outperforms state-of-the-art pharmacogenomics and transfer learning baselines and may guide precision oncology more accurately.

5:20 PM-5:40 PM
Proceedings Presentation: Improved survival analysis by learning shared genomic information from pan-cancer data
Format: Pre-recorded with live Q&A

  • Sunkyu Kim, Korea University, South Korea
  • Keonwoo Kim, Korea University, South Korea
  • Junseok Choe, Korea University, South Korea
  • Inggeol Lee, Korea University, South Korea
  • Jaewoo Kang, Korea University, South Korea

Presentation Overview: Show

Motivation: Recent advances in deep learning have offered solutions to many biomedical tasks. However, there remains a challenge in applying deep learning to survival analysis using human cancer transcriptome data. Since the number of genes, the input variables of survival model, is larger than the amount of available cancer patient samples, deep learning models are prone to overfitting. To address the issue, we introduce a new deep learning architecture called VAECox. VAECox employs transfer learning and fine tuning.
Results: We pre-trained a variational autoencoder on all RNA-seq data in 20 TCGA datasets and transferred the trained weights to our survival prediction model. Then we fine-tuned the transferred weights during training the survival model on each dataset. Results show that our model outperformed other previous models such as Cox-PH with LASSO and ridge penalty and Cox-nnet on the 7 of 10 TCGA datasets in terms of C-index. The results signify that the transferred information obtained from entire cancer transcriptome data helped our survival prediction model reduce overfitting and show robust performance in unseen cancer patient samples.
Availability: Our implementation of VAECox is available at https://github.com/SunkyuKim/VAECox

5:40 PM-6:00 PM
Closing remarks and awards
Format: Live-stream

  • David Kreil