View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: Recent large-scale collection city-specific whole genome shotgun (WGS) metagenomics data has enabled us the mapping of microbes available in public spaces and transit systems across different cities. With an explicit focus on metagenomics data from the MetaSUB International Consortium, we study a variety of metagonomic features (including species distribution, relative abundance, diversity, etc.) for the purpose of learning the city-specific fingerprints of different samples. The features are mainly collected using tools available in the COSMOSID platform. By applying various supervised learning approaches, we highlight that the microbiome community can be linked to its city of origin, in a majority of cases, with an accuracy of more than 80%. However for some of the cities (e.g., Ofa and Porto for the MetaSUB data), the principle components are found to be overlapping with other cities, making it difficult to identify the metagenomic fingerprints with the chosen features. This might be highlighting the necessity of more sophisticated sample collection or feature extraction.
Short Abstract: Motivation: The goal of the study is to build predictive models for the clinical endpoints of breast cancer patients (CEBCP) using clinical descriptors (CD) and molecular data from gene expression (GE) and Copy Number Alteration (CNA) assays from METABRIC dataset. Methods: To this end we first used different feature selection methods to identify informative variables in the clinical and molecular datasets. Then the predictive models were built separately for each dataset using Random Forest classification algorithm. Finally the we tested different methods of combining information from different dataset. Results: All 17 clinical covariates were found to be relevant and non-redundant. In the case of molecular descriptors there are hundreds of relevant variables in GE dataset, whereas only a handful for CNA. For individual datasets best models were built on CD (MCC=0.33), followed by GE (MCC=0.25) and CNA (MCC=0.19). The best results of integration of clinical and molecular data were achieved when clinical data was augmented by the fraction of votes for a negative CEBCP from models built on molecular data. The best model was built using CD+GE+CNA (MCC=0.37), followed by CD+GE (MCC=0.36) and CD+CNA (MCC=0.35).
Short Abstract: Toxicity is considered a main drawback in drug discovery and development. Specifically, the improvement of methods for the prediction of drug-induced liver injury (DILI) from existing preclinical models is a major challenge for in drug development and regulatory clearance. Gene expression profiles have extensively been used for prediction purposes in an ample range of biological conditions. However, they are not exempt of problems due to noise, redundancies, non-relevant trends, etc. Recently, mathematical models have been proposed that use low-informative, decontextualized gene expression profiles to estimate signaling pathway activities, which account for cell functional activity that ultimately determines the phenotype. It is expectable that such activity values have more direct functional link to DILI phenotypes than individual gene expression values. Therefore, we propose to transform gene expression values into signaling pathway activities, and use them as features for prediction. This not only improves the prediction but also points to toxicity molecular mechanisms, given that the features selected account for specific cell functionalities
Short Abstract: Horizontal and vertical data integration of two types of cancer: neuroblastoma and breast cancer is based on the use of different NoSQL databases for the purposes of good management and not losing any information. We used the schema-less and compact document-based MongoDB for storing of all records and their attributes from the raw data. The graph-based Neo4j DB is used for providing a suitable management for all complicated relationships (as mutations, CAN, expressions). By using K nearest neighbour classifier, we defined groups, based on a complex novel trait including tumour size, tumour stage and age at diagnosis. Such a data integration approach and the subsequent classification contributed to the determination of common groups of mutated proteins, related to both types of cancer. The novel complex trait, combined with survival time and mutation identifiers is used in several machine learning models with different feature selection for optimal survival time prediction. We validate the results by using randomly smaller subsets of both raw and integrated data. The applied machine learning methods provide mechanisms for assessing cancer progression and survival time prediction.
Short Abstract: Appropriate handling of large-scale metagenomics analyses remains to be an open problem in both biological and computational spaces. On one hand, many tools already exist for profiling community abundance and many are optimized to work for large data sets. On the other hand, these approaches often preclude novel discoveries at the cost of precompiled databases. Other open problems include identifying proper clusters within data and even the basic prediction of metadata traits concerning the samples at hand. The MetaSUB Forensics Challenge provides an ideal opportunity to assess current tools and methodologies on these problems. Specifically, MetaSUB presents two major challenges: 1) prediction of location and substrate of sampling and 2) identifying clusters in labelless data. To answer these challenges, we have constructed an entirely de novo metagenome assembly analysis based on a subset of available data. These assembled data are then used as the basis for community profiling in the complete data set, and those profiles are used as machine learning features for location and substrate predictions. Moreover, we plan to use a similar framework for the clustering of entirely unknown samples to infer the number of samples provided as the rest of the data become available.
Short Abstract: Sequencing Read Archive contains more than one million runs of publicly available sequencing data. However, the lack of consistently preprocessed summary and molecular quantification data (for example, gene expression quantification for RNAseq) for each sequencing run hinders efficient Big Data interpolation. Here, we introduce Skymap, a standalone database that offers a single, multi-species data matrix incorporating all public sequencing studies. The data matrix contains several omic layers, including expression quantification, allelic read counts, microbes read counts, chip-seq. We reprocessed petabytes of sequencing data to generate the data matrix for each data type. We also offer a reprocessed biological metadata file that describes the relationships between the sequencing runs and the associated keywords, extracted from over 3 million free text annotations using natural language processing. The processed data can fit into a single hard drive (<500GB). In https://github.com/brianyiktaktsui/Skymap, we showcase how one can (1) retrieve and analyze the SNPs and expression of a genetic variant across >200k runs in less than a minute and (2) increase the temporal resolution on tracking gene expression in mouse development.
Short Abstract: The identification and interpretation of biologically relevant patterns in genome scale data remains the rate-limiting bottleneck in the translation of experimental advances to the clinic. A lot of hope is now placed in the combination of measurements from complementary sources. Vertical data integration works on matching profiles of different data types collected for each patient. In addition, network representations have recently been explored for horizontal data integration to exploit not just the complementary nature of the data sources but also similarities across patients. We have previously developed a novel network-based approach for the integration of multiple molecular and clinical data types that also incorporates prior knowledge from curated databases. We first demonstrated remarkable performance in patient stratification for survival analysis in Neuroblastoma patients. Building on this work, extended analyses show that while the algorithm shows disappointing performance for the METABRIC breast cancer cohort, the algorithm performs strongly for several TCGA patient cohorts, including breast cancer data sets. We here further characterize alternative patient cohorts for the same disease, and investigate how differences in the cohort structures, especially as reflected in the inferred inter-patient networks, affect algorithm performance.
Short Abstract: Metagenomic data of whole genome sequences (WGS) from samples across several cities around the globe are the key to unravel city specific signatures of microbes. Consequently, it can be beneficial for forensic analysis. Illumina sequence data were provided from 8 cities in 6 different countries as part of the 2018 CAMDA “MetaSUB Forensic Challenge”. We will use appropriate machine learning technique on this massive dataset to effectively identify the geographical provenance of additional “mystery” samples. Additionally, we will pursue the compositional data analysis techniques to develop accurate inferential techniques for such microbiome data. Learning from the CAMDA 2017 MetaSUB challenge data, it is expected that this current higher quality and higher sequence depth data with improved analytical techniques would yield many more interesting, robust and useful results for forensic analysis.
Short Abstract: We proposed a network-based machine learning approach to identify biomarkers capable of predicting breast cancer outcomes including disease free survival, and overall survival at 5 years and long-term, after a specific treatment. We integrated gene expression data in METABRIC study (2016) with protein-protein interactions from Human Protein Reference Database and ConsensusPathDB databases to find sub-networks of functionally related genes those can produce high classification accuracy. The seed genes to initialize sub-networks are selected by filter feature selection techniques and collected from the literature. We introduced a score to estimate the predictive ability of a sub-network, which helps ranked and pruned the space of sub-networks. Then, sub-networks were gradually generated and evaluated for classification performance during a greedy search. The highest classification accuracies by Support Vector Machine for chemotherapy, hormone-therapy, radio-therapy, and none of them, were 100%, 78.5%, 87.3%, and 82.7%, respectively. The extracted sub-networks contain many genes those involve in many cancer-relevant pathways and relate to breast cancer, according to some public databases. Such genes include: FGFR2, SMAD2, FGF8, MAPK8, NEDD4, CSTF3, EIF2S2, GSK3B, AKT1, UBA52, TERT, STAT5A, RARB, APC, ACVR1C, UBE2D1, RIPK1, TP53, PIK3CA, etc. This showed that our method is useful for sub-network biomarker detection in breast cancer.
Short Abstract: Hundreds of millions of people are exposed and also contribute to microbial communities residing in subway transit systems across the world. Understanding and exploiting subway system microbes especially on geographically characterized compositions is a question of importance in a rapidly more interconnected world, with implications for human health. For those reasons, MetaSUB International Consortium conducted shotgun metagenomic sequencing of various surfaces types within subway systems across diffferent continents. In partnership with CAMDA, multiple datasets including samples of both known and unknown provenance have been released with the following challenges: design a workflow to generate metagenomic fingerprints for cities and determine how these fingerprints can be leveraged to classify samples with unknown locations and surface types. To attemp these challenges we propose a new pipeline as below. After the data sets were trimmed and filltered, metaSPAdes has been used to assemble merged datasets for each city. The assembled sequences were analyzed using the bioBakery suite including HUMAnN2 and MetaPhlAn2 in order to gather abundance information for species in each assembly. Finally, the relatedness of mystery datasets were establsihed using satatical methods.
Short Abstract: In this research, we developed a computational model to analyse the gene expression data of Caenorhabditis elegans (C. elegans) because of its simplicity and significance in studying the genetic and molecular mechanisms of human development and disease. Leveraging and processing the data in C. elegans RiboNucleic Acid (RNA) molecules provides a path to understanding this organism of interest and we now have a huge amount of RNA sequence data available. Traditional transcriptomic data analysis methods have not been able to obtain the desired level of accuracy for the computational discrimination of protein-coding or non-coding RNA Transcripts. We first trained our computational model with labeled RNA transcripts and evaluated how successful we have been at scoring each transcript. We used a computational technique, resulting in a perfect confidence level in the predictions and obtained accurate results when predicting previously unseen RNA transcript class. Our computational predictions were compared with results from validated experimental data and the model was right in 100% of the instances. We have therefore designed an efficient bioinformatics tool that achieves accurate results in class prediction for C. elegans transcriptomes. In determining the computational and molecular make-up of a C. elegans protein, this work is a step forward.
Short Abstract: Large-scale collaborative precision medicine initiatives are yielding rich multi-omics data. Integrative analyses of the resulting multi-omics data, such as somatic mutation, copy number alteration (CNA), and gene expression, offer the tantalizing possibilities of realizing the potential of precision medicine in cancer prevention, diagnosis, and treatment by substantially improving our understanding of underlying mechanisms as well as the discovery of novel biomarkers for different types of cancers. However, several challenges arise in the integrative analysis of such data including heterogeneity and high dimensionality of the omics data. In this study, we evaluate two multi-view feature selection algorithms, developed by our group, using CAMDA Cancer Data Integration datasets. Specifically, we benchmark the performance of different cancer survival predictors developed using some state-of-the-art machine learning algorithms that are trained using; i) single omics data source; ii) multi-view integrated data; and iii) baseline data fusion approaches. Our results demonstrate the viability of multi-view feature selection for multi-omics data integration and lay the ground for developing effective multi-omics data integration models using multi-view feature selection.
Short Abstract: Connecting the missing links between drugs, genes and diseases has gained a lot of interest in recent years among the researchers. This is due to the availability of large-scale data related to cell-based screens in response to various drugs. With an explicit focus on the disease Drug-induced liver injury (DILI), we study the data from Broad Institute Connectivity Map to learn the effect of drugs on gene expression profiles. A variety of features is collected for the said purpose from the MCF7 and PC3 cancer cell lines. The features are mainly related to expression profiles. By applying robust supervised learning approaches powered by feature selection, we highlight that the expression profiles in cancer cell lines can be connected with DILI disease for the purpose of prediction. We adopt a random under-sampling approach to handle the data imbalance and obtain the prediction accuracy close to 75% for both the cell lines taken separately.
Short Abstract: The need to understand the biological processes that are involved in different diseases, from the large amount of biological data available such as genomic sequences, microarrays, protein interactions, biomedical images, among others. In addition, the rapid adoption of electronic medical records offers an opportunity for large-scale research. Therefore, data mining techniques for the new discovery of information from different sources are increasingly important in biology and health care. The biggest challenge of mining genomic data in extracting relevant information from large quantities of data and knowledge protocols, the biggest challenges are in: a) The collection of clinical and genomic data, b) recovery of relevant information and c) extraction of new information knowledge in addition to its association with the health status of the populations.
Short Abstract: The MetaSUB consortium is building a unique database of whole-genome shotgun metagenomics data sampled in mass-transit areas worldwide. We propose to analyze the CAMDA MetaSUB Forensic dataset with the Ph-CNN deep learning architecture, recently developed for classifying metagenomics data (Fioravanti D et al, BMC Bioinformatics, 2018). The Ph-CNN architecture is based on Convolutional Neural Networks and the patristic distance computed on the phylogenetic tree as appropriate metrization embedding. The approach is applied to identify metagenomic fingerprints that detect the geographic location, based on features quantified by MetaPhlAn2. The impact of sample multiplicity and surface type is also considered. The novel Uniform Manifold Approximation Projection Method (UMAP) is also applied as non-linear dimensionality reduction algorithm and compared to t-SNE for visualization to explore the data global structure without computational restrictions on embedding dimension.
Short Abstract: An increasingly promising form of cancer immunotherapy is neoantigen vaccination, which attempts to elicit an anti-tumor T-cell response against a patient's specific mutations. One of the most important features to consider in designing a neoantigen vaccine is the expression level of each mutation. Unfortunately, commercial targeted panels most commonly used in the clinic sequence only DNA. We built a model using publicly available tumor DNA/RNA sequencing data to predict allele specific expression of mutations using only features of the DNA. We evaluated this model on sequencing data from an ongoing neoantigen vaccine trial to see how many of the expressed predicted neoantigens would have been correctly identified using only DNA.
Short Abstract: The Metagenomics and Metadesign of the Subways and Urban Biomes (MetaSUB) International Consortium has generated shotgun metagenomics sequence data from large urban areas from around the world. As part of the MetaSUB forensic challenge we have taken the data and combined a taxonomic classifier against the NCBI nr database together with machine learning techniques to generate microbial profiles of each city. This approach resulted in distinct profiles for each city capable of clustering samples by city of origin in an unsupervised fashion, with the exception of Auckland and Hamilton which appear almost indistinguishable. Implementation of t-Distributed Stochastic Neighbor Embedding allowed visualization of this in 2 dimensions while random forest can classify samples to their city of origin with near perfect accuracy. The “mystery samples” which were already featured in the dataset cluster, discretely with Porto, Ofa, Santiago and with the New Zealand cities. The three new unnamed cities each clustered discretely with a country already featured, C1 with Nigeria, C2 with New Zealand and C3 with the U.S.A. Further assembly based analysis demonstrated that there is still many microbes in urban environments for which we have neither a culture nor complete genome.
Short Abstract: Drug-induced liver injury (DILI) is a serious concern during drug development. DILI is characterized by elevated levels of alanine aminotransferase, and it can ultimately result in patient death in serious cases1. Evidence suggests that reactive drug metabolites play a role in initiating DILI1. To quantify the effects of drugs on human cells in general, the Connectivity Map (CMap, build 02) data set measures drugs’ impact on RNA expression in cancer cell lines2. In this paper we outline an attempt to use CMap data before and after drug exposure to predict whether specific drugs in CMap cause hepatic injury. First, we applied seven classification algorithms independently. None of these algorithms predicted liver injury on a consistent basis with high accuracy. In an attempt to improve accuracy, we aggregated predictions for six of the algorithms (excluding one that had performed exceptionally poorly) using a soft-voting ensemble method. This approach improved our results in some cases for the training set but failed to generalize well to the test set. We conclude that more robust methods and/or datasets will be necessary to effectively predict drug-induced liver injury based on RNA expression levels in cell lines.
Short Abstract: The TCSS 588 Bioinformatics class at University of Washington Tacoma consists of 26 senior (undergraduate) and Master’s students who major in Computer Science. In order to give the students experience in building predictive models using big, complex biomedical data for real-life applications, the instructor of this class, Dr. Yeung, adopts a crowdsourcing model for the CMAP Drug Safety Challenge in the classroom.