Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

CAMDA COSI Track Presentations

Attention Conference Presenters - please review the Speaker Information Page available here
CAMDA Welcome and opening remarks
Date: Saturday, July 22
Time: 10:00 am - 10:05 am
Room: North Hall
  • Joaquin Dopazo, Fundación Progreso y Salud (Ministry of Health), Bioinformatics Area, University Hospital Virgen del Rocío, Seville, Spain, Spain
Molecular networks as determinants of response and outcome
Date: Saturday, July 22
Time: 10:05 am - 11:00 am
Room: North Hall
  • Lodewyk Wessels, Netherlands Cancer Institute, Netherlands

Presentation Overview: Show

Keynote - Lodewyk Wessels

The CAMDA Challenges
Date: Saturday, July 22
Time: 11:00 am - 11:10 am
Room: North Hall
  • David P. Kreil, Chair of Bioinformatics Research Group, Boku University Vienna, Austria

Presentation Overview: Show

An Introduction

Predicting clinical outcome of neuroblastoma patients using an integrative network-based approach
Date: Saturday, July 22
Time: 11:10 am - 11:30 am
Room: North Hall
  • Léon-Charles Tranchevent, Luxembourg Institute of Health (LIH), Luxembourg
  • Petr Nazarov, Luxembourg Institute of Health (LIH), Luxembourg
  • Tony Kaoma, Luxembourg Institute of Health (LIH), Luxembourg
  • Arnaud Muller, Luxembourg Institute of Health (LIH), Luxembourg
  • Sang-Yoon Kim, Luxembourg Institute of Health (LIH), Luxembourg
  • Jagath C. Rajapakse, Nanyang Technological University (NTU), Singapore
  • Francisco Azuaje, Luxembourg Institute of Health (LIH), Luxembourg

Presentation Overview: Show

One of the main current challenge in computational biology is to make the best of the huge amount of experimental data that is being produced. For instance, large cohorts of patients are often screened using different high-throughput technologies, effectively producing multiple molecular profiles per patients for hundreds or thousands of patients. We propose and implement a network-based method that integrates such patient omics data and use them to predict various clinical features. Using a neuroblastoma dataset, we then demonstrate that the networks inferred from omics data contain clinically relevant information and that patient clinical outcomes can therefore be predicted using only network topological data.

Predicting clinical outcomes in neuroblastoma with genomic data integration
Date: Saturday, July 22
Time: 11:30 am - 11:50 am
Room: North Hall
  • Hilal Kazan, Antalya International University, Turkey
  • Saber Hafezqorani, Middle East Technical University, Turkey
  • Tunde Aderinwale, Antalya International University, Turkey
  • Ilyes Baali, Antalya International University, Turkey
  • Durmus Alp Emre Acar, Antalya International University, Turkey

Presentation Overview: Show

Neuroblastoma is a heterogeneous disease with diverse clinical outcomes. Recently collected genome-wide datasets provide opportunities to infer neuroblastoma subtypes more accurately than existing classification of risk groups. To this end, we used machine learning techniques to predict overall survival and event-free survival profiles of patients. Using the model that we trained on SEQC cohort, we can predict patient survival in an independent cohort with high accuracy (AUROC: 0.96) indicating the applicability of the model to different datasets. Additionally, we used unsupervised learning techniques that can effectively integrate multiple high-dimensional datasets to identify subgroups of patients with distinct survival profiles after stratification based on MYCN expression. These subgroups can improve treatment stratification of neuroblastoma patients.

Accumulation of Potential Driver Genes with Genomic Alterations Predicts Survival in High-Risk Neuroblastoma
Date: Saturday, July 22
Time: 11:50 am - 12:20 pm
Room: North Hall
  • Chen Suo, Collaborative Innovation Center for Genetics and Development, Ministry of Education Key Laboratory of Contemporary Anthropology and the State Key Laboratory of Genetic Engineering, School of Life Scie, China
  • Wenjiang Deng, Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Sweden
  • Trung Nghia Vu, Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Sweden
  • Leming Shi, Collaborative Innovation Center for Genetics and Development, Ministry of Education Key Laboratory of Contemporary Anthropology and the State Key Laboratory of Genetic Engineering, School of Life Scie, China
  • Yudi Pawitan, Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Sweden

Presentation Overview: Show

Background: Neuroblastoma is the most common pediatric malignancy with heterogeneous clinical behaviors, ranging from spontaneous regression to aggressive progression. Many studies have identified potential aberrations related to the pathogenesis and prognosis, but predicting tumor progression in and clinical management of high-risk patients remains a big challenge.

Method: We integrate gene-level expression, array-based comparative genomic hybridization and functional gene-interaction-network profile of 145 neuroblastoma patients to detect potential driver genes. The drivers are summarized within each patient into a score (DGscore), and we then validate its clinical relevance in terms of association with patient survival.

Results: Focusing on the subset of 48 clinically defined high-risk patients, we identify 193 recurrent copy number aberrations (CNAs), resulting in 274 altered genes with copy number gain or loss which have corresponding impact on the gene expression. Using a network enrichment analysis, we detect four common driver genes, ERCC6, HECTD2, KIAA1279, EMX2, and 66 patient-specific driver genes. Patients with high DGscore, i.e. carrying more copy-number-altered genes with correspondingly up or down-regulated expression and functional implications, have worse survival than those with low DGscore (P = 0.006). Furthermore, Cox proportional-hazards regression analysis indicates that, adjusted for age, tumor stage or MYCN amplification, DGscore is the only significant prognostic factor for high-risk neuroblastoma patients (P = 0.008).

Conclusions: Integration of genomic copy-number alteration, expression and functional interaction-network data reveals clinically relevant and prognostic putative driver genes in neuroblastoma. The identified putative drivers may give us new drug targets for individualized therapy.

Multi-omics integration for neuroblastoma clinical endpoint prediction
Date: Saturday, July 22
Time: 2:00 pm - 2:30 pm
Room: North Hall
  • Margherita Francescatto, Fondazione Bruno Kessler, Trento, Italy, Italy
  • Setareh Rezvan Dezfooli, Fondazione Bruno Kessler, Trento, Italy, Italy
  • Alessandro Zandonà, Fondazione Bruno Kessler, Trento, Italy; CIBIO, University of Trento, Italy; DEI, University of Padova, Italy, Italy
  • Marco Chierici, Fondazione Bruno Kessler, Trento, Italy, Italy
  • Giuseppe Jurman, Fondazione Bruno Kessler, Trento, Italy, Italy
  • Cesare Furlanello, Fondazione Bruno Kessler, Trento, Italy, Italy

Presentation Overview: Show

Recent high-throughput methodologies such as microarrays and next-generation sequencing are well established/routinely used in cancer research, generating large amounts of complex data at different omics layers. The effective integration of omics data could provide a broader insight into the mechanisms of cancer biology, helping researchers and clinicians to develop personalized therapies. Here we explore the use of Integrative Network Fusion, a bioinformatics framework combining a novel similarity network fusion method and machine learning for the integration of multiple omics data. We apply the framework for the classification of neuroblastoma patients belonging to multiple disease stages integrating microarray, RNAseq and array comparative genomic hybridization data. We provide detailed results for one case study, the integration of microarray and CNV data for the classification and prediction of Event-Free Survival patients. Further, we discuss ongoing work on the dataset proposed for the CAMDA2017 neuroblastoma challenge.

Integration of Molecular Features with Clinical Information for Predicting Outcome for Neuroblastoma Patients
Date: Saturday, July 22
Time: 2:30 pm - 2:50 pm
Room: North Hall
  • Yatong Han, Harbin Engineering University, China
  • Jie Zhang, The Ohio State University, United States
  • Chao Wang, Thermo Fisher Scientific, China
  • Xiufen Ye, Harbin Engineering University, China
  • Yusong Liu, Harbin Engineering University, China
  • Kun Huang, The Ohio State University, United States

Presentation Overview: Show

Neuroblastoma (NB) is the most common extracranial solid tumor in children. NB in about 50\% of pediatric patients will metastasize and result in a poor outcome. In order to provide better prognosis and facilitate individualized precise treatment, here we developed a novel workflow, which integrates clinical information and molecular features such as gene expression for prognosis. First, we mined co-expressed gene modules from microarray and RNA-seq data using the weighted network mining algorithm lmQCM; secondly, we build weight matrix with module eigengenes and a consensus clustering method called Molecular Regularized Consensus Patient Stratification (MRCPS), which aggregates both essential clinical information and multiple eigengene data for patient stratification. Our method improves prognosis significantly by regularizing clinical partition of patients using the additional weight matrix information. Our results suggested this method has a superior performance for predicting survival than only use genetic data and clinical diagnose result. Simultaneously, a subgroup of patients with extremely poor survival in early months was identified.

Integration analysis based on survial associated co-expression gene modules for predicing neuroblastom patients survival times
Date: Saturday, July 22
Time: 2:50 pm - 3:10 pm
Room: North Hall
  • Yatong Han, Harbin Engineering University, China
  • Xiufen Ye, Harbin Engineering University, China
  • Jun Cheng, Southern Medical University, China
  • Siyuan Zhang, Harbin Engineering University, China
  • Jie Zhang, The Ohio State University, United States
  • Kun Huang, The Ohio State University, United States

Presentation Overview: Show

In this paper, we provide a workflow to improve survival prognosis for neuroblastom patients. With a step of gene co-expression network/module (GCN) mining in microarray and RNA-seq data, we extracted the molecular features from each module and summarized them into eigengenes. Then we adopted the lasso-regularized Cox proportional hazards model to select the most informative eigengen features in terms of association to the risk of matastasis. Nine eigengenes were selected which show strong association with patient survival prognosis. All of the nine modules also have highly enriched biological functions or cytoband locations. Three of them are unique modules to RNA-seq data, which complement the modules from microarray in terms of survival prognosis. We then merged all eigengenes from the nine modules and used an integrative method called Similarity Network Fusion to test the prognostic power of these eigengenes for prognosis. The prognostic accuracies are significantly improved as compared to use all eigengens, and two subgroups of patients with very poor survival rate were identified.

A multi-layer network approach to data integration for patient stratification
Date: Saturday, July 22
Time: 3:10 pm - 3:30 pm
Room: North Hall
  • Maciej Kandula, Chair of Bioinformatics Research Group, Boku University Vienna, Austria
  • Swati Singh, Department of Biological Sciences and Bioengineering, Indian Institute of Technology, Kanpur, India
  • Eric Kolaczyk, Department of Mathematics and Statistics, Boston University, United States
  • David P. Kreil, Chair of Bioinformatics Research Group, Boku University Vienna, Austria

Presentation Overview: Show

The Big Data bottleneck experienced for both basic and applied research, which remains rate-limiting in the translation of experimental advances to the clinic, is the identification and interpretation of biologically relevant patterns in genome scale data. A lot of hope is now being placed in analysis approaches which combine measurements from these different sources. With complementary data collected for each patient, it is natural to try and combine all a patient's interrelated information into a single joint representation. Recently, network representations have been explored to exploit not just the complementary nature of the data sources but also similarities across patients. Networks have already been applied to identify dysregulated pathways, optimize biotechnological processes, and predict patient survival. We here introduce a novel network-based approach for the integration of multiple molecular and clinical data types that also incorporate prior knowledge from curated databases, exploring performance in comparative quantitative benchmarks. Our algorithm creates a multi-layer network, where the highest level is a network of patients, and each patient has information from multiple data types, where each data type is characterized itself by a network. Notably, prior functional knowledge is incorporated in their construction. This structure facilitates an identification of similarities not only between patients but also of functional modules at the molecular level, across data-types. We present first promising outcomes on BRCA and will report results for the Camda neuroblastoma challenge data set at the conference.

Models of cell signalling uncover molecular mechanisms of high-risk neuroblastoma and predict outcome
Date: Saturday, July 22
Time: 3:30 pm - 4:00 pm
Room: North Hall
  • Marta Hidalgo, Clinical Bioinformatics Research Area. Fundación Progreso y Salud (FPS). Hospital Virgen del Rocio. 41013. Sevilla. Spain, Spain
  • Alicia Amadoz, Computational Genomics Department. Centro de Investigación Principe Felipe (CIPF). 46012 Valencia. Spain, Spain
  • Cankut Çubuk, Clinical Bioinformatics Research Area. Fundación Progreso y Salud (FPS). Hospital Virgen del Rocio. 41013. Sevilla. Spain, Spain
  • Jose Carbonell-Caballero, Computational Genomics Department. Centro de Investigación Principe Felipe (CIPF). 46012 Valencia. Spain, Spain
  • Joaquin Dopazo, Fundación Progreso y Salud (Ministry of Health), Bioinformatics Area, University Hospital Virgen del Rocío, Seville, Spain, Spain

Presentation Overview: Show

Despite the progress in neuroblastoma therapies the mortality of high-risk patients is still high (40% - 50%) and the molecular basis of the disease remains poorly known. Here we use models of cell signalling, a key process in this cancer, to understand the molecular determinants of bad prognostic. We also show how the activity of signalling circuits can be used as a predictor of survival in neuroblastoma patients.

Predicting survival times for neuroblastoma patients using RNA-Seq expression profiles
Date: Saturday, July 22
Time: 4:30 pm - 4:50 pm
Room: North Hall
  • Tyler Grimes, University of Florida, United States
  • Alejandro Walker, University of Florida, United States
  • Somnath Datta, University of Florida, United States
  • Susmita Datta, University of Florida, United States

Presentation Overview: Show

In this study, we undertake the CAMDA 2017, Neuroblastoma data integration challenge. In our analysis, clinical data and expression profiles from RNA-Seq data are integrated together to model survival times directly. The effects of using various feature levels of expression profiles (genes, transcripts, and introns) are examined and compared to a model without RNA-Seq data. The inclusion of RNA-Seq profiles is shown to increase the prediction accuracy for both overall survival and event free survival times. These models can also be used as a classifier to accurately identify high-risk groups

Integration of CNV and RNA-seq data can increase the predictive power of Neuroblastoma endpoint
Date: Saturday, July 22
Time: 4:50 pm - 4:59 pm
Room: North Hall
  • Yimin Ma, East China Normal University, China
  • Jiajun Chen, East China Normal University, China
  • Tieliu Shi, East China Normal University, China

Presentation Overview: Show

Neuroblastoma (NB) is the most common extracranial solid tumor in children. To compare the predictive power between data integration and the original expression-only study, we first built two risk-score models based on RNA-seq data and CNV data respectively, we then combined them with two different strategies; last we evaluated the predictive power of these four models.
Using the Cox regression method, we built the first risk-score model with five genes. NB patients could be classified into a high-risk group or a low-risk group based on this model. Overall survival between these two groups was significantly different (P = 0.00953 in the testing set). In addition, this model can further subdivide each of the clinical defined high/low group into two subgroups.
By applying similar procedures, we selected four CNV loci and built the second risk-score model. This model can also classify those matched NB patients but the predictive power was weaker than the RNA-seq based model (P = 0.0884 in the testing set).
To test whether integration of two different data (CNV and RNA-seq) can increase the predictive power or not, we combined the two individual models with two strategies. The first strategy was to define patients who were classified into high-risk group in both of previous two individual models as a new high-risk group, the predictive power of the new model can be significantly improved (P = 0.00228 in the testing set). The second strategy was to combine the high-risk samples defined by two individual models into a new high-risk group, but the predictive power of this strategy was improved only marginally (P = 0.0412 in the testing set). However, the clinical defined high-risk samples were entirely included in the expanded high-risk group.
According to clinical information, about three quarters of the NB patients were alive and most patients were defined as low-risk. When we redefined the high-risk group by overlapping the classification results of both of models, which make the samples more consist with the risk distribution of this disease, the new model was significantly improved in predictive power, suggesting that different integration strategies for different purposes with different data should be chosen to improve the predictive performance.

Computational Approaches to Assessing Clinical Relevance of Pre-clinical Cancer Models
Date: Saturday, July 22
Time: 4:59 pm - 5:07 pm
Room: North Hall
  • Vladimir Uzun, University of Sheffield, United Kingdom
  • Ian Sudbery, University of Sheffield, United Kingdom
  • James Bradford, University of Sheffield, United Kingdom

Presentation Overview: Show

TBA

Analysis of CAMDA RNA-seq data with the knowlegde of protein domains in genes
Date: Saturday, July 22
Time: 5:07 pm - 5:16 pm
Room: North Hall
  • Anna Leśniewska, Institute of Computer Science, Poznan University of Technology, Poland
  • Alicja Szabelska-Beręsewicz, Department of Mathematical and Statistical Methods Poznan University of Life Sciences, Poland
  • Joanna Zyprych-Walczak, Department of Mathematical and Statistical Methods Poznan University of Life Sciences, Poland
  • Michał Okoniewski, Scientific IT Services, ETH Zurich, Switzerland

Presentation Overview: Show

In RNA sequencing with short reads, it is often not possible to assign RNA fragment to a gene due to similarities in repeatable regions or protein domains. This may influence the downstream analysis. We have compiled the gene-domain database and used it for analysis to see the differences between the genes that share a domain versus the rest of the genes in the Neuroblastoma dataset. The major findings are:
* pairs of genes that share a domain have increased Pearson's correlation coefficients of counts
* the distribution of correlation coefficient for those pairs is leaning more towards the positive values for the for the smaller number of biological samples
* using diverse primary analysis counting strategies on non-CAMDA datasets suggests that the increased correlation reflects rather a real biological co-expression than sequence-based artifacts
* genes sharing a domain are expected to have a lower predictive power due to increased correlation, but with various type of classifiers the number of misclassified samples does not show yet an obvious dependence
* various classifiers perform in a very different way on the CAMDA data, which proves that clinical application
of gene signatures from similar datasets may be difficult
We have to admit that outcomes are sometimes not following our intuition and experience from standard RNA-seq analysis. That is why we would like to present it at the CAMDA meeting and discuss there with the experts in the area.

Microbiome Diversity on Materials
Date: Saturday, July 22
Time: 5:16 pm - 5:24 pm
Room: North Hall
  • Chandrima Bhattacharya, Indian Institute of Engineering Science and Technology, Shibpur, India
  • Pinaki Chakraborty, Indian Institute of Engineering Science and Technology, Shibpur, India
  • Rohit Pandey, Indian Institute of Engineering Science and Technology, Shibpur, India
  • Malay Bhattacharyya, Indian Institute of Engineering Science and Technology, Shibpur, India

Presentation Overview: Show

The study of microbiome is promising in understanding the higher-level organisms to a broader extent. In this paper, we consider to analyze the microbiome diversity across different materials. We particularly focus on the metagenomics data from the MetaSUB International Consortium for the said purpose. With an emphasis on the materials like metals, plastics, and woods available in subways and its vicinity, we demonstrate the how diverse the microbiome community might appear in multiple cities.

Codon usage diversity in city microbiomes
Date: Saturday, July 22
Time: 5:24 pm - 5:33 pm
Room: North Hall
  • Haruo Suzuki, Keio University, Japan

Presentation Overview: Show

For the MetaSUB Inter-City Challenge, we propose to apply annotation-independent approaches for synonymous codon usage to the microbiomes of three cities: New York City, Boston, and Sacramento. Multivariate statistical analysis identified gene features such as the codon-anticodon interaction efficiency and nucleotide content at third codon positions as major trends of variation in synonymous codon usage among genes of the metagenomes. We also found that diversity in synonymous codon usage was high in Sacramento, intermediate in Boston, and low in the New York City. Our results suggest that codon usage can provide additional information on genetic diversity in microbiomes.

Unraveling bacterial fingerprints of city subways from microbiome 16S gene profiles
Date: Saturday, July 22
Time: 5:33 pm - 5:42 pm
Room: North Hall
  • Alejandro Walker, University of Florida, United States
  • Tyler Grimes, University of Florida, United States
  • Susmita Datta, University of Florida, United States
  • Somnath Datta, University of Florida, United States

Presentation Overview: Show

In RNA sequencing with short reads, it is often not possible to assign RNA fragment to a gene due to
similarities in repeatable regions or protein domains. This may influence the downstream analysis.
We have compiled the gene-domain database and used it for analysis to see the differences between the genes that share a domain versus the rest of the genes in the Neuroblastoma dataset. The major findings are:

* pairs of genes that share a domain have increased Pearson's correlation coefficients of counts
* the distribution of correlation coefficient for those pairs is leaning more towards the positive values for the for the smaller number of biological samples
* using diverse primary analysis counting strategies on non-CAMDA datasets suggests that the increased correlation reflects rather a real biological co-expression than sequence-based artifacts
* genes sharing a domain are expected to have a lower predictive power due to increased correlation, but with various type of classifiers the number of misclassified samples does not show yet an obvious dependence
* various classifiers perform in a very different way on the CAMDA data, which proves that clinical application
of gene signatures from similar datasets may be difficult

We have to admit that outcomes are sometimes not following our
intuition and experience from standard RNA-seq analysis. That is why we would like to present it at the CAMDA meeting and discuss there with the experts in the area.

Assessing reproducibility of metagenomics studies and diversity of public transport systems microbiome profiles of New York, Boston and Sacramento cities
Date: Saturday, July 22
Time: 5:42 pm - 5:51 pm
Room: North Hall
  • Alina Frolova, The Institute of Molecular Biology and Genetics of NASU, Ukraine

Presentation Overview: Show

As a new members of MetaSUB Consortium we were greatly interested in analyzing Boston, New York and Sacramento cities microbiome profiles to point out important issues and problems before upcoming global City Sampling Day 2017. Here we performed detailed quality control of raw sequences, evaluated collected metadata in the context of creating uniform specification for data collection, assessed reproducibility of OTU abundances calculation, verified Yersinia pestis (the causative agent of plague) and Bacillus anthracis (the causative agent of anthrax) presence in NY microbiome profile, and investigated biodiversity vs biolocation. We conclude that it is important to sample fewer, more controlled environments with greater specificity and uniform coverage of meta-variables.

Identification of mobile elements in metagenomic data.
Date: Saturday, July 22
Time: 5:51 pm - 6:00 pm
Room: North Hall
  • Josef Moser, Austrian Centre of Industrial Biotechnology (ACIB), Vienna, Austria, Austria
  • Samuel Gerner, FH Campus Wien, Austria
  • Alexandra Graf, FH Campus Wien, Austria

Presentation Overview: Show

Little is known about mobile elements in metagenome samples, but certainly they represent important traits of microbial communities. Their potential to confer resistances, and transfer beneficial mutations and advantageous genes over spezies borders improves the evolutionary fitness for bacteria in, for humans, sometimes disastrous ways. Antibiotics resistance, which is partly acquired by horizontal gene transfer through mobile elements, has been termed as one of the major threats humanity is faced with today. In current metamobilomics experiments, additional lab protocols such as plasmid purification are applied to achieve higher sensitivity and selectivity compared with whole metagenome samples This however removes plasmids from the population context they exist in.
In this study, we evaluate the opportunities to study aspects of mobile elements in shotgun sequenced metagenomic data, using only bioinformatics methods. This would allow the scientific community to get a better understanding of existing data and thereby give a clearer picture of the trade off between the different experimental approaches. Functional analysis of found plasmids will help to elucidate the specific advantage they impart to the microbial community. Additionally, we look at CRISPR found in the provided metagenome samples to gain information about the exposure history of the microbial populations in the sample. We used the metagenome data provided by CAMDA to compare the different cities with regard to mobile elements and CRISPR content and composition.
Preliminary results on the plasmid and CRISPR content of the Boston and Sacramento data show that the Sacramento Samples contain a higher abundance of both plasmids and CRISPR. Alignment of reads to known plasmids produced very limited results, compared to de-novo assembled plasmid candidates, highlighting the knowledge that can be gain from and therefore the importance of urban metagenome data.

CAMDA Welcome and opening remarks
Date: Sunday, July 23
Time: 10:00 am - 10:10 am
Room: North Hall
  • Pawel Labaj, Apart Fellow of the Austrian Academy of Sciences, Boku University Vienna, Austria
Strain-level bacterial and viral diversity in the MetaSUB dataset
Date: Sunday, July 23
Time: 10:10 am - 10:40 am
Room: North Hall
  • Moreno Zolfo, CIBIO, University of Trento, Italy
  • Federica Pinto, CIBIO, University of Trento, Italy
  • Francesco Asnicar, CIBIO, University of Trento, Italy
  • Francesco Beghini, CIBIO, University of Trento, Italy
  • Paolo Manghi, CIBIO, University of Trento, Italy
  • Edoardo Pasolli, CIBIO, University of Trento, Italy
  • Serena Manara, CIBIO, University of Trento, Italy
  • Adrian Tett, CIBIO, University of Trento, Italy
  • Nicola Segata, CIBIO, University of Trento, United States

Presentation Overview: Show

The study of the microbiomes associated with the urban and built environment is of increasing research interest, as it is recognized that the microbial diversity associated with the places where people live and work everyday influences human health. While endeavours like the MetaSUB Consortium have started to characterize the structure and the composition of the microbiome of our cities, improved resolution in metagenome profiling is needed to uncover key microbial features that are distinguishing different strains within the same species. Small genetic differences among microbes of the same species can characterize ecological niches associated with different places, and little variations in the genomic landscape of microbes can convey important phenotypical variations. For example, the presence of toxins, antibiotic resistance genes, and virulence factors is at the base of the variable level of pathogenicity associated with bacterial species like Escherichia coli and Staphylococcus aureus , that can be useful commensal or life-threatening pathogens. However, metagenomic profiling failed so far to provide strain-level characterization of the microbial diversity on large datasets, preventing the use of metagenomics as a tool for strain-level microbial epidemiology and population genomics.
We analysed here the 1614 metagenomes of the MetaSUB/CAMDA2017 dataset with a set of recently developed, and new computational tools to unravel the strain-level diversity in the public transportation microbiome All microbial strain profilers were applied directly to the the raw reads and are applied on non-human microbiomes for the first time in this study. We first applied MetaMLST, a computational tool that profiles the strains in metagenomes using the Multi Locus Sequence Typing approach, and is capable of identifying both known and previously unseen (novel) sequence types (STs). Once profiled, such STs allow tracing (potentially pathogenic) strains across samples and comparing the identified strains with the extensive set of STs deposited in publicly available MLST databases containing more STs than available reference genomes. In our analysis, a total of 109 species were profiled, and 642 STs were observed. Of those, 500 are being observed here for the first time and differ by at most 10 SNVs from their closest known reference ST (Figure 1A). Among these, the most prevalent species were Acinetobacter baumannii, Enterobacter cloacae and Stenotrophomonas maltophilia: all species that are both known to be abundant in the environment but that are also associated to potential pathogenic infections in humans. Using E. coli profiles as an example, STs can be epidemiologically modelled using Minimum Spanning Trees which highlighted that 10 of the 19 identified E. coli STs are very closely related and within the E. coli phylotype A. The remaining STs are instead clearly distinct clones from phylotypes often populating the human gut microbiome, that would be consistent with the hypothesis of a human origin of these strains (Figure 1B).

To further extend the strain-level profiling to larger portions of the genome, we applied StrainPhlAn, a tool that characterizes strains by analysing single nucleotide variations (SNVs) in clade-specific marker genes. We reconstructed whole-genome scale phylogenies for the strains belonging to the six most abundant species in the dataset (Pseudomonas stutzeri, S. maltophilia, Enterobacter cloacae, Propionibacterium acnes, Acinetobacter pittii and Acinetobacter radioresistens), and found a remarkable sub-species variability in the samples. As expected, when associating such strain diversity with location and surface type we did not found strong genetic patterns, consistently with the high level of microbial seeding these samples are exposed to. Nevertheless, we were able to identify very closely related strains present along the same subway line (trains and stations) in the Boston dataset (Figure 2A). In other cases, we could identify the same clones within the same station, such as for the P. stutzeri strains found on lines NY-M and NY-R (data not shown).
The genetic variability across samples of the profiled strains highlighted in some cases the presence of discrete clusters (sub-species) within species. Specifically, we identified two and three subspecies-level clusters for P. stutzeri and S. maltophilia respectively (Figure 2D,2E). Even though no specific associations with sample types were found, these genetic niches could reflect different functional potential properties that are adapting to similar environments.
Another key aspect in microbial ecology is the presence of multiple related strains of hte same species in a sample. By analysing the polymorphisms in clade-specific marker genes for all the considered species, it is possible to infer the presence of more than one strain of a given species. We performed such analysis with StrainPhlAn, and we highlight that E. cloacae, P. acnes and A. pittii have high rates of polymorphism, suggesting that in the large majority of samples these species are represented by multiple strains (Figure 2C). These polymorphic rates tend to be higher than those found in the human microbiome probably owin to the higher exposure to diverse strains and lower strain adaptation costs of environmental microbiomes.
Finally, we developed and applied a new computational pipeline aimed at profiling the viral fraction of microbial communities. This is based on mapping the reads against known or newly assembled phage genomes and the identification of samples-specific SNVs for each detected viral genome. We identified a total of 73 viral genomes present in at least 5 samples. By reconstructing the dominant allele on the identified SNV positions, we then reconstructed the sample-specific genomes of several members of the virome. For example, the pipeline reconstructed the genomes of 11 distinct bacteriophages usually associated to Bacillus species with a variable level of similarity with previously available sequences (Figure 1B). More in depth analysis on the viral fraction of the MetaSub dataset and their relation with the bacterial community are being conducted, and will be ready to be presented at CAMDA/ECCB 2017.

Together, these findings show that it is possible to analyse and trace the microbes of the urban environment at strain-level, directly from the raw reads and without the need for assembly. Coupled with ongoing assembly-based efforts to uncover genomes of novel and uncharacterized microbial species, we show the potential use of metagenomic data for large-scale epidemiology, bio-surveillance, outbreak tracking, and bacterial-resistance profiling.

Viral and eukaryotic communities of urban ecosystems across US metropolitan areas
Date: Sunday, July 23
Time: 10:40 am - 11:10 am
Room: North Hall
  • Serghei Mangul, UCLA, United States
  • Nathan Lapierre, UCLA, United States
  • Igor Mandric, Georgia State University, United States
  • Lana Martin, UCLA, United States
  • Nicholas Wu, UCLA, United States
  • Eleazar Eskin, University of California, Los Angeles, United States
  • David Koslicki, Oregon State University, United States

Presentation Overview: Show

To analyze microbiome communities in urban areas, we used the metagenomics data provided in the MetaSUB Inter-City Challenge (CAMDA 2017). The dataset was comprised of metagenomics samples collected from the subway and light rail stations across three US metropolitan areas: New York (n=1572), Boston (n=141), and Sacramento (n=18). We randomly selected 65 samples from New York subway stations and considered all samples from Sacramento and Boston stations. In total, 224 samples were available for the analysis of microbiome composition and diversity across US metropolitan areas. We used CMH and miCoP to successfully obtain the composition of urban virome and eukaryome from samples obtained from multiple subways stations across three US metropolitan areas. We observe that many viruses are station-specific, suggesting that composition of virome is shared by the location and district human populations relevant to the stations. On another hand, we found that the eukaryome was composed of several species that are widely spread across the vast majorities of the station. Given the wide geographical distribution of those species (e.g., Aspergillus and Pythium aphanidermatum), they can be characterized as typical North American urban eukaryome.

Assessment of urban microbiome assemblies with the help of targeted mock communities
Date: Sunday, July 23
Time: 11:10 am - 11:30 am
Room: North Hall
  • Samuel Gerner, FH Campus Wien, Vienna, Austria, Austria
  • Josef W. Moser, Austrian Centre of Industrial Biotechnology (ACIB), Vienna, Austria, Austria
  • Thomas Rattei, University of Vienna, Vienna, Austria, Austria
  • Alexandra B. Graf, FH Campus Wien, Vienna, Austria, Austria

Presentation Overview: Show

Urban microbiomes are characterized by their comparatively high population dynamics, especially when considering public transport sites such as subway systems with a high fluctuation of bypassing humans. To detect novel species and to enable a detailed analysis of microbe-microbe or host-microbe interactions in such communities, metagenomic reads have to be assembled into ideally complete genomes.
In this study we aim to assemble urban metagenome datasets provided by the CAMDA MetaSUB Inter-City Challenge and to assess the quality and taxonomic as well as the functional content and putative virulence factors of the resulting bins and genomes. To analyse the performance of assembly based methods on urban metagenomic communities, we created mock communities with varying complexity, sequencing depth and quality based on taxonomic profiles of the provided urban metagenome samples. Urban metagenomes were not previously analysed using assembly based methods to our knowledge, thereby mock communities can be used to demonstrate the general applicability of these methods on such communities, enabling thorough testing of the applied assembly methods to samples with known truth. These mock community assemblies are used to propose a set of recommendations for sequencing parameters to obtain optimal assembly and binning quality of urban metagenome data.
Assemblies of single samples, pooled samples from the same surface type as well as the binning of the resulting contigs was achieved. Using assembled and binned mock communities, the impact of sequencing depth and quality could be demonstrated as well as the ability of the applied assembler and binning programs to achieve high quality bins with minimal contamination. Bins of real data samples consisting mainly of unclassified reads by taxonomic profilers are main targets for further analysis, as they might represent unknown species.

MetaBinG2: a fast and accurate metagenomics sequence classification method for samples with many unknown organisms
Date: Sunday, July 23
Time: 11:30 am - 12:00 pm
Room: North Hall
  • Chaochun Wei, Shanghai Jiao Tong University, China
  • Ben Jia, Shanghai Jiao Tong University, China
  • Yuyang Qiao, Shanghai Jiao Tong University, China

Presentation Overview: Show

Background: Many methods have been developed for metagenomic sequence classification, and most of them depend heavily on the known organisms. In addition, a large portion of reads may be classified as unknown, which greatly impairs our understanding of the whole sample.
Result: Here we present MetaBinG2, a fast method to do metagenomics sequence classification and consequent abundance analysis on complex environments with a large number of unknown organisms. MetaBinG2 is based on sequence composition, and uses GPUs to accelerate its speed. A million 100bp Illumina sequences can be classified within two minutes. We applied MetaBinG2 to MetaSUB Inter-City Challenge and identified microbial community structures for different cities.
Conclusion: Compared to existing methods, MetaBinG2 is fast, highly accurate, especially for those samples with a significant percentage of unknown organisms.

LAST + MEGAN-LR Approach to the Oxford Nanopore Wiggle Space Challenge
Date: Sunday, July 23
Time: 12:00 pm - 12:30 pm
Room: North Hall
  • Caner Bagci, University of Tuebingen, Germany
  • Benjamin Albrecht, University of Tuebingen, Germany
  • Dominic Boceck, University of Tuebingen, Germany
  • Ania Gorska, University of Tuebingen, Germany
  • Dino Jolic, Max-Planck Institute for Developmental Biology, Germany
  • Irina Bessarab, Singapore Centre for Environmental Life Sciences Engineering, Singapore
  • Rohan Williams, Singapore Centre for Environmental Life Sciences Engineering, Singapore
  • Daniel H. Huson, University of Tuebingen, Germany

Presentation Overview: Show

There exist many different computer programs and webservers for analyzing metagenomic datasets, aiming at taxonomic and/or functional binning and/or profiling the given data, such as MEGAN, MG-RAST or Kraken. Such tools are designed and engineered to work fast and accurately on short reads. Long reads pose a different set of challenges and require that existing approaches be adapted to address them. Here, we present a new variant of the lowest common ancestor (LCA) algorithm that is designed to perform taxonomic binning of long reads, and we demonstrate the use of the method on datasets provided by “The Oxford Nanopore ‘Wiggle Space’ Challenge” by CAMDA 2017, and on simulated Nanopore reads.

Towards a scientific blockchain framework for reproducible data analysis
Date: Sunday, July 23
Time: 2:00 pm - 2:55 pm
Room: North Hall
  • Cesare Furlanello, FBK, Italy

Presentation Overview: Show

Keynote - Cesare Furlanello

Rectified Factor Networks for Biclustering of Omics Data
Date: Sunday, July 23
Time: 2:55 pm - 3:25 pm
Room: North Hall
  • Djork-Arné Clevert, Bayer AG, Germany

Presentation Overview: Show

Motivation: Biclustering has become a major tool for analyzing large
data sets given as matrix of samples times features and has been
successfully applied in life sciences and e-commerce for drug design
and recommender systems, respectively. FABIA, one of the most successful
biclustering methods, is a generative model that represents each bicluster
by two sparse membership vectors: one for the samples and one for the features.
However, FABIA is restricted to about 20 code units because of the high
computational complexity of computing the posterior. Furthermore, code
nits are sometimes insufficiently decorrelated and sample membership is
difficult to determine. We propose to use the recently introduced unsupervised
Deep Learning approach Rectified Factor Networks (RFNs) to overcome the drawbacks
of existing biclustering methods. RFNs efficiently construct very sparse, non-linear,
high-dimensional representations of the input via their posterior means.
RFN learning is a generalized alternating minimization algorithm based on
the posterior regularization method which enforces non- negative and normalized
posterior means. Each code unit represents a bicluster, where samples for which
the code unit is active belong to the bicluster and features that have activating
weights to the code unit belong to the bicluster.
Results: On 400 benchmark data sets and on three gene expression data sets with
known clusters, RFN outperformed 13 other biclustering methods including FABIA.
On data of the 1000 Genomes Project, RFN could identify DNA segments which indicate,
that interbreeding with other hominins starting already before ancestors of
modern humans left Africa.
Availability and Implementation: https://github.com/bioinf-jku/librfn

Applying meta-analysis to Genotype-Tissue Expression data from multiple tissues to identify eQTLs and increase the number of eGenes
Date: Sunday, July 23
Time: 3:25 pm - 3:55 pm
Room: North Hall
  • Dat Duong, UCLA, U.S.A

Presentation Overview: Show

Motivation: There is recent interest in using gene expression data to contextualize findings from traditional genome wide association studies (GWAS). Conditioned on a tissue, expression quantitative trait loci (eQTLs) are genetic variants associated with gene expression, and eGenes are genes whose expression levels are associated with genetic variants. eQTLs and eGenes provide great supporting evidence for GWAS hits and important insights into the regulatory pathways involved in many diseases. When a significant variant or a candidate gene identified by GWAS is also an eQTL or eGene, there is strong evidence to further study this variant or gene. Multi-tissue gene expression datasets like the Gene Tissue Expression (GTEx) data are used to find eQTLs and eGenes. Unfortunately, these datasets often have small sample sizes in some tissues. For this reason, there have been many meta-analysis methods designed to combine gene expression data across many tissues to increase power for finding eQTLs and eGenes. However, these existing techniques are not scalable to datasets containing many tissues, like
the GTEx data. Furthermore, these methods ignore a biological insight that the same variant may be associated with the same gene across similar tissues.
Result: We introduce a meta-analysis model that addresses these problems in these existing methods. We focus on the problem of finding eGenes in gene expression data from many tissues, and show that
our model is better than other types of meta-analyses.
Availability: Source code and supplementary data are at https://github.com/datduong/RECOV.

Best presentation award voting
Date: Sunday, July 23
Time: 3:55 pm - 4:00 pm
Room: North Hall
  • Joaquin Dopazo, Fundación Progreso y Salud (Ministry of Health), Bioinformatics Area, University Hospital Virgen del Rocío, Seville, Spain, Spain
Future challenges in Big Data: Precision Medicine, Human Exposome, …
Date: Sunday, July 23
Time: 4:30 pm - 5:00 pm
Room: North Hall
  • Wenzhong Xiao, Immuno-Metabolic Computational Center, Massachusetts General Hospital, Harvard Medical School, Boston MA, USA, and Computational Genomics lab, Stanford Genome Technology Center, Stanford CA, USA, United States

Presentation Overview: Show

A presenation of future challenges and oportunities for Camda in 2018 and beyond. Discussion about possible next Contest Data Sets.

CAMDA Panel discussion: key insights & future challenges
Date: Sunday, July 23
Time: 5:00 pm - 5:40 pm
Room: North Hall
  • Pawel Labaj, Apart Fellow of the Austrian Academy of Sciences, Boku University Vienna, Austria
CAMDA Award announcements and closing remarks
Date: Sunday, July 23
Time: 5:40 pm - 6:00 pm
Room: North Hall
  • David P. Kreil, Chair of Bioinformatics Research Group, Boku University Vienna, Austria