Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

MICROBIOME

COSI Track Presentations

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
Monday, July 9th
2:00 PM-2:40 PM
Microbiome COSI Keynote I: Why is species-level classification so hard and how can we make it easy?
Room: Columbus KL
  • Adam Phillippy, National Human Genome Research Institute, National Institutes of Health, United States

Presentation Overview: Show

Accurate species-level taxonomic classification and profiling of complex microbial communities remains a challenge due to homologous regions shared among closely related species and a sparse representation of microbial diversity in the database. I will present some examples of species-level classification gone wrong and propose possible solutions to this problem, including improvements to both classification methods and public databases. Key to these advances are new, democratized sequencing platforms that promise to generate sequence data on a previously unimaginable scale.

2:40 PM-3:00 PM
CAMI Overview, Introduction CAMI II challenges
Room: Columbus KL
  • Alice McHardy, Helmholtz Centre for Infection Research, Germany
3:00 PM-3:10 PM
CAMISIM: Simulating metagenomes and microbial communities
Room: Columbus KL
  • Adrian Fritz, Helmholtz Centre for Infection Research, Germany
  • Peter Hofmann, Helmholtz Centre for Infection Research, Germany
  • Stephan Majda, Heinrich Heine University Dusseldorf, Germany
  • Eik Dahms, Helmholtz-Centre for Infection Research, Germany
  • Johannes Dröge, Chalmers University of Technology, Sweden
  • Jessika Fiedler, Heinrich Heine University Dusseldorf, Germany
  • Till R. Lesker, Helmholtz-Centre for Infection Research, Germany
  • Peter Belmann, Helmholtz-Centre for Infection Research, Germany
  • Matthew Z. Demaere, University of Technology, Sydney, Australia
  • Aaron E. Darling, University of Technology, Sydney, Australia
  • Alexander Sczyrba, Bielefeld University, Germany
  • Andreas Bremges, Helmholtz Centre for Infection Research, Germany
  • Alice McHardy, Helmholtz Centre for Infection Research, Germany

Presentation Overview: Show

Studies like the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI) have shown that, while metagenomic software already produces promising results, there is still a lot of room for improvement and an absolute requirement for standardized benchmarking data sets.
To overcome these obstacles, we present CAMISIM, a software for the automatic generation of complete microbial communities in silico. CAMISIM already was successfully used in the creation of the data sets used in the first CAMI challenge and provides vast possibilities for the personalization of the desired data sets such that they represent certain microbial compositions, sampling strategies and experimental setups as closely as possible. In addition to providing full microbial sequence samples, CAMISIM always provides a ground truth for assembling, binning and profiling of the produced metagenome which subsequently can be used to measure performance of different metagenomic software.
We successfully used CAMISIM to create different data sets to show its value in producing both small, specialised data sets for testing metagenomic software as well as large, realistic benchmarking data sets.
CAMISIM is implemented in Python and available under the Apache 2.0 license on GitHub (https://github.com/CAMI-challenge/CAMISIM)

3:10 PM-3:20 PM
AMBER: Assessment of Metagenome BinnERs
Room: Columbus KL
  • Fernando Meyer, Helmholtz Centre for Infection Research, Germany
  • Peter Hofmann, Helmholtz Centre for Infection Research, Germany
  • Peter Belmann, Helmholtz Centre for Infection Research, Germany
  • Ruben Garrido-Oter, Max Planck Institute for Plant Breeding Research, Germany
  • Adrian Fritz, Helmholtz Centre for Infection Research, Germany
  • Alexander Sczyrba, Bielefeld University, Germany
  • Alice McHardy, Helmholtz Centre for Infection Research, Germany

Presentation Overview: Show

Reconstructing the genomes of microbial community members is key to the interpretation of shotgun metagenome samples. Genome binning programs deconvolute reads or assembled contigs of such samples into individual bins, but assessing their quality is difficult due to the lack of evaluation software and standardized metrics. We present AMBER, an evaluation package for the comparative assessment of genome reconstructions from metagenome benchmark data sets. It calculates the performance metrics and comparative visualizations used in the first benchmarking challenge of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). As an application, we show the outputs of AMBER for eleven different binning software options on two CAMI benchmark data sets. AMBER is implemented in Python and available under the Apache 2.0 license on GitHub (https://github.com/CAMI-challenge/AMBER).

3:20 PM-3:40 PM
CAMI Evaluation Metrics: Assembly, Profiling
Room: Columbus KL
  • Alexander Sczyrba, Bielefeld University, Germany

Presentation Overview: Show

The Critical Assessment of Metagenome Interpretation (CAMI) aims to evaluate methods for metagenome analysis comprehensively and objectively by establishing standards through community involvement in the design of benchmark data sets, evaluation procedures, choice of performance metrics and questions to focus on. The benchmarking results can only be as meaningful as the metrics used for performance measurements. In this session, we will introduce the metrics used for benchmarking assembly and profiling tools during the first CAMI challenge and show examples of benchmarking results for these tools. We encourage the community to comment on these metrics and propose additional metrics for future challenges (see also session: CAMI - how to get involved).

3:40 PM-4:00 PM
A CAMI metagenomic Hi-C challenge: what should it look like?
Room: Columbus KL
  • Aaron E. Darling, University of Technology, Sydney, Australia

Presentation Overview: Show

Metagenomic Hi-C is an emerging technology that provides primary observational data on physical contacts among DNA strands in a sample. Metagenomic Hi-C has been demonstrated to provide a useful signal for assembling genomes from metagenomes, both on mock and natural microbial communities, yielding insights that would be difficult to obtain with other short or long read sequencing technologies. Recent advances in Hi-C protocols have led to commercial kit offerings for sample processing from companies like Phase Genomics and Arima Genomics, and the wide availability of these kits is expected to lead to greater demand for algorithms to process metagenomic Hi-C data. Yet metagenomic Hi-C data analysis algorithms are still in their infancy, and both the developer and user communities could benefit from high quality evaluation data sets. In this short talk I will announce a recently launched project to create such datasets for the community, and I will solicit input on various options and alternatives for benchmark datasets.

4:00 PM-4:40 PM
Coffee Break
4:40 PM-5:20 PM
Microbiome COSI Keynote II: Methods for multi'omics in microbial community population studies
Room: Columbus KL
  • Curtis Huttenhower, Harvard University, United States

Presentation Overview: Show

The human microbiome - the collection of microbial organisms residing in and on the body, mostly in the gut - has been associated with diseases ranging from autism to cancer, but the causative molecular or ecological mechanisms are difficult to discern. In particular, it remains challenging to integrate multiple types of molecular data detailing the microbiome's interaction with human hosts in large, epidemiology-scale studies. The Integrative Human Microbiome Project (iHMP or "HMP2") is one of several efforts to better understand microbiome and immune molecular activity as it affects human health. I will describe our work in the HMP2 longitudinally profiling gut microbiome multi'omics and host immune activity in inflammatory bowel disease, in addition to computational methods developed over the course of the project to facilitate bioinformatic and statistical analyses. Results to date include strain haplotyping and tracking within and among subjects, linking small molecule metabolites to microbial activity, and functional profiling of metatranscriptomes during inflammation. I will conclude with suggestions for future applications and open questions in microbiome functional 'omics, quantitative methods, and epidemiology more broadly in public health.

5:20 PM-5:40 PM
Proceedings Presentation: Viral quasispecies reconstruction via tensor factorization with successive read removal
Room: Columbus KL
  • Soyeon Ahn, The University of Texas at Austin, United States
  • Ziqi Ke, The University of Texas at Austin, United States
  • Haris Vikalo, The University of Texas at Austin, United States

Presentation Overview: Show

As RNA viruses mutate and adapt to environmental changes, often developing resistance to antiviral vaccines and drugs, they form an ensemble of viral strains – a viral quasispecies. While high-throughput
sequencing has enabled in-depth studies of viral quasispecies, sequencing errors and limited read lengths render the problem of reconstructing the strains and estimating their spectrum challenging. Inference of viral quasispecies is difficult due to generally non-uniform frequencies of the strains, and is further exacerbated when the genetic distances between the strains are small.
This paper presents TenSQR, an algorithm that utilizes tensor factorization framework to analyze high-throughput sequencing data and reconstruct viral quasispecies characterized by highly uneven frequencies of its components. Fundamentally, TenSQR performs clustering with successive data removal to infer strains in a quasispecies in order from the most to the least abundant one; every time a strain is inferred, sequencing reads generated from that strain are removed from the dataset. The proposed successive strain reconstruction and data removal enables discovery of rare strains in a population and facilitates detection of deletions in such strains. Results on simulated datasets demonstrate that TenSQR can reconstruct full-length strains having widely different abundances, generally outperforming state-of-the-art methods at diversities 1-10% and detecting long deletions even in rare strains. A study on a real HIV-1 dataset demonstrates that TenSQR outperforms competing methods in experimental settings as well. Finally, we apply TenSQR to analyze a Zika virus sample and reconstruct the full-length strains it contains.

5:40 PM-5:50 PM
Batch Effects Correction for Microbiome Data with Dirichlet-multinomial Regression
Room: Columbus KL
  • Fangda Song, The Chinese University of Hong Kong
  • Zhenwei Dai, The Chinese University of Hong Kong, Hong Kong
  • Yingying Wei, The Chinese University of Hong Kong, Hong Kong
  • Jun Yu, The Chinese University of Hong Kong, Hong Kong
  • Hei Wong, The Chinese University of Hong Kong, Hong Kong

Presentation Overview: Show

Metagenomic sequencing techniques enable quantitative analyses of the microbiome. However, combining the microbial data from these experiments is challenging due to the variations between experiments. The existing methods for correcting batch effects do not consider the interactions between variables---microbial taxa in microbial studies---and the overdispersion of the microbiome data. Therefore, they are not applicable to microbiome data. We developed a new method, Bayesian Dirichlet-multinomial regression meta-analysis (BDMMA), to simultaneously model the batch effects and detect the microbial taxa associated with phenotypes. BDMMA automatically models the dependence among microbial taxa and is robust in detecting associations in high-dimensional, over-dispersed microbiome data with sparse associations. Simulation studies and real data analysis have shown that BDMMA can successfully adjust batch effects and substantially reduce false discoveries in microbial meta-analyses. BDMMA is a powerful tool to perform meta-analysis for metagenomic studies and detect taxa that are truly associated with the phenotypes with high accuracy. We envision that BDMMA will be widely applied in practice, especially with the rise of large consortium projects such as the American Gut Project and the MetaHIT project.

5:50 PM-6:00 PM
Ampliclust: A Fully Probabilistic Model-Based Approach Denoising Illumina Amplicon Data
Room: Columbus KL
  • Xiyu Peng, Iowa State University, United States
  • Heliang Shi, Pfizer, United States
  • Karin Dorman, Iowa State University, United States

Presentation Overview: Show

Next-generation amplicon sequencing is a powerful tool for understanding microbial communites. Downstream analysis is often based on the construction of Operational Taxonomic Units (OTUs) with dissimilarity threshold 3%. The arbitrary threshold and reliance on OTU references can lead to low resolution, false positives, and misestimation of alpha and beta microbial diversity. We introduce Ampliclust, a reference-free method to resolve the number, abundance and identity of error-free sequences in Illumina Amplicon data. Unlike existing methods, Ampliclust is a fully probabilistic model, allowing the data, rather than an algorithm or an external database, drive the conclusions. We use a modified Bayesian information criterion to estimate the number of sequence variants and obtain maximum likelihood estimates of the abundance and identity of error-free sequences. Our model is able to match the performance of Dada2 on well-separated mock communities, but in simulated communities with more similar real sequences, Ampliclust can achieve better accuracy. The major challenge is the computational scalability, which we begin to address through principled iterative schemes and improved initialization methods.

Tuesday, July 10th
8:35 AM-8:40 AM
MICROBIOME: Session Overview and Introductions
Room: Columbus KL
8:40 AM-9:20 AM
Microbiome COSI Keynote III: Metagenotyping Reveals Cryptic Functional Variation in the Human Microbiome
Room: Columbus KL
  • Katherine Pollard, Gladstone Institutes and University of California, United States

Presentation Overview: Show

Metagenotyping is the identification of microbial genetic variation from shotgun metagenomics data. Capturing genetic variants is important, because strains of the same species can differ significantly in gene content and gene sequence, which in turn affects the functional capabilities of microbial communities from one host to another. We developed high-throughput computational methods to identify single nucleotide variants and gene copy number variants within the specific microbial strains present in a shotgun metagenome, as well as phylogenetic models to test for associations between genetic variants and microbial or host traits. Using these methods, we tracked microbial transmissions from mothers to infants, discovered bacterial genes associated with colonization of the human gut, and quantified population genetic evidence for selection in prevalent gut microbes.

9:20 AM-9:30 AM
Multivariable Association in Population-scale Meta'omic Surveys
Room: Columbus KL
  • Himel Mallick, Harvard University, United States
  • Timothy L. Tickle, Harvard University, United States
  • Lauren J. McIver, Harvard University, United States
  • George Weingart, Harvard University, United States
  • Joseph N. Paulson, Genentech, United States
  • Siyuan Ma, Harvard University, United States
  • Boyu Ren, Harvard University, United States
  • Emma Schwager, Harvard University, United States
  • Ayshwarya Subramanian, Harvard University, United States
  • Eric Franzosa, Harvard University, United States
  • Hector Corrada Bravo, University of Maryland, United States
  • Curtis Huttenhower, Harvard University, United States

Presentation Overview: Show

It is challenging to relate features such as human health outcomes, diet, environmental conditions, or other metadata to microbial community measurements, due in part to their quantitative properties. Microbiome multi’omics are typically noisy, sparse (zero-inflated), high-dimensional, and extremely non-normal, often in the form of either count or compositional measurements. Here, we introduce an optimal combination of established methodology to assess multivariable association of microbial community features with complex metadata in population-scale epidemiological studies. Our approach, MaAsLin2 (Multivariable Association with Linear Models), relies on multiple statistical models to account for the inherent characteristics of modern meta’omic epidemiology study designs, including repeated measures and multiple covariates. To construct this method, we conducted a large-scale evaluation of a broad range of data settings under which straightforward identification of meta’omic associations can be challenging. These simulation studies reveal that MaAsLin2 preserves statistical power in the presence of repeated measures and multiple covariates while accounting for the nuances of meta’omic features and controlling false discovery. Finally, we applied MaAsLin2 to a microbial multi’omic dataset from the Integrative Human Microbiome Project (HMP2) which, in addition to reproducing established results, revealed a unique, integrated landscape of inflammatory bowel disease (IBD) across multiple time points and ‘omics profiles.

9:30 AM-9:40 AM
Gaussian process models for microbial dynamics in the expanded Human Microbiome Project
Room: Columbus KL
  • Jason Lloyd-Price, Harvard University, United States
  • Anup Mahurkar, Institute for Genome Sciences, United States
  • Gholamali Rahnavard, Broad Institute of MIT and Harvard, United States
  • Jonathan Crabtree, Institute for Genome Sciences, United States
  • Joshua Orvis, Institute for Genome Sciences, United States
  • A. Brantley Hall, Harvard University, United States
  • Arthur Brady, Institute for Genome Sciences, United States
  • Heather H. Creasy, Institute for Genome Sciences, United States
  • Carrie McCracken, Institute for Genome Sciences, United States
  • Michelle Giglio, University of Maryland School of Medicine, United States
  • Daniel McDonald, UCSD, United States
  • Eric A. Franzosa, Harvard University, United States
  • Rob Knight, UCSD, United States
  • Owen White, Institute for Genome Sciences, United States
  • Curtis Huttenhower, Harvard University, United States

Presentation Overview: Show

Multiple molecular data types are increasingly used to study microbial community dynamics over time, for example in the NIH Human Microbiome Project (HMP). We have developed a set of complementary multi'omic longitudinal models for such data, including Gaussian Processes (GPs) with a Beta-Binomial likelihood appropriate for microbial communities' technical zeros, sequencing depth, overdispersion, and compositionality. Using GPs, we present new findings from a dramatic expansion of shotgun metagenomes (now ~2,400 samples) from the HMP (“HMP1-II”). We partitioned variance of microbial taxa and metabolic processes into host-specific, temporally-variable, and rapidly-variable subsets. We found that species abundances in the gut were highly individualized, with the Bacteroidetes phylum exhibiting highly individualized abundances, while Firmicutes tended to be shared among individuals with varied abundance over time. Microbes at other sites did not exhibit such a phylum-level distinction, and were less personalized than the gut. Meanwhile, metabolic pathways were not personalized despite being encoded by personalized microbial communities, indicating that community assembly may be mediated by the need for keystone functions rather than particular taxa. The results and framework presented here will enable further in-depth characterizations of the dynamics of the microbiome, particularly as longitudinal datasets become more widely available in the field.

9:40 AM-10:15 AM
Coffee Break
10:15 AM-10:20 AM
MICROBIOME: Session Continuation and Introductions
Room: Columbus KL
10:20 AM-11:20 AM
CAMI - how to get involved
Room: Columbus KL
  • CAMI steering committee
11:20 AM-11:40 AM
Proceedings Presentation: MicroPheno: Predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples
Room: Columbus KL
  • Ehsaneddin Asgari, University of California, Berkeley, United States
  • Kiavash Garakani, University of California, Berkeley, United States
  • Alice McHardy, Helmholtz Centre for Infection Research, Germany
  • Mohammad R.K. Mofrad, University of California, Berkeley, United States

Presentation Overview: Show

Motivation: Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient, and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture, and forensic science. We describe a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k-mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub-samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes.
Results: a k-mer distribution of shallow sub-samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body-site identification and Crohn's disease prediction. Aside from being more accurate, using k-mer features in shallow sub-samples allows (i) skipping computationally costly sequence alignments required in OTU-picking, and (ii) provided a proof of concept for the sufficiency of shallow and short-length 16S rRNA sequencing for phenotype prediction. In addition, k-mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro-F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and SVM.

11:40 AM-11:50 AM
MEGAN-LR: New algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs
Room: Columbus KL
  • Daniel H. Huson, University of Tuebingen, Germany
  • Benjamin Albrecht, University of Tuebingen, Germany
  • Caner Bagci, University of Tuebingen, Germany
  • Irina Bessarab, SCELSE, National University of Singapore, Singapore
  • Anna Gorska, University of Tuebingen, Germany
  • Dino Jolic, Max Planck Institute for Developmental Biology, Germany
  • Rohan Williams, Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, Singapore

Presentation Overview: Show

Computational tools for taxonomic or functional analysis of microbiome samples run on hundreds of millions of short, high quality reads. Programs like MEGAN allow interactive navigation of such datasets. As long read
technologies improve, there is increasing interest in using them in microbiome sequencing, and a need to adapt short read tools to long reads.

We describe a new LCA-based algorithm for taxonomic binning, and interval-tree algorithm for functional binning, for long reads and contigs. We provide a new interactive tool for the alignment of long reads against reference sequences. For taxonomic and functional binning, we propose to use DIAMOND or LAST to compare long reads against the NCBI-nr protein reference database so as to obtain frame-shift aware alignments, and then to process the results using our new methods.

All presented methods are implemented in the open source edition of MEGAN. We refer to this as MEGAN-LR (MEGAN long read). We evaluate the MEGAN-LR approach in a simulation study, and on a number of
mock community datasets consisting of Nanopore reads, PacBio reads and assembled PacBio reads. We also illustrate the application on a Nanopore dataset that we obtained from an anammox bio-reactor community.

To appear in: Biology Direct

11:50 AM-12:00 PM
Probabilistic abundance estimation accelerates metagenome binning by orders of magnitude
Room: Columbus KL
  • Andreas Bremges, Helmholtz Centre for Infection Research, Germany
  • Alice McHardy, Helmholtz Centre for Infection Research, Germany

Presentation Overview: Show

Abundance estimation in metagenomics and transcript quantification in RNAseq are challenges alike, yet state-of-the-art methods differ. For abundance estimation—a prerequisite for metagenome binning—, the first step is usually alignment-based (i.e. read mapping); while for transcript quantification, alignment-based methods were basically retired 2–3 years ago in favor of so-called alignment-free methods (more accurately: probabilistic methods implementing quasi-mapping or pseudo-alignment). I wonder why this is the case and if we can easily transfer these advances in RNAseq analysis to our beloved applications in metagenomics. Using real and mock microbial communities as well as simulated benchmarking datasets from the CAMI initiative, I compare the performances of popular alignment-based and alignment-free methods in the context of metagenome binning, and conclude that probabilistic abundance estimation yields comparable binning results but accelerates the process by orders of magnitude.

12:00 PM-12:20 PM
Zero-Inflated Generalized Dirichlet Multinomial (ZIGDM) Regression Model for Microbiome Compositional Data
Room: Columbus KL
  • ZhengZheng Tang, University of Wisconsin-Madison, United States

Presentation Overview: Show

There is heightened interest in using high-throughput sequencing technologies to quantify abundances of microbial taxa and linking the abundance to human diseases and traits. Proper modeling of multivariate taxon counts is essential to the power of detecting this association. Existing models are limited in handling excessive zero observations in taxon counts and in flexibly accommodating complex correlation structures and dispersion patterns among taxa. We develop a new probability distribution, Zero-Inflated Generalized Dirichlet Multinomial (ZIGDM), that overcomes these limitations in modeling multivariate taxon counts. Based on this distribution, we propose a ZIGDM regression model to link microbial abundances to covariates (e.g. disease status), and develop a fast Expectation-Maximization (EM) algorithm to efficiently estimate parameters in the model. The derived tests enable us to reveal rich patterns of variation in microbial compositions including differential mean and dispersion. The advantages of the proposed methods are demonstrated through simulation studies and an analysis of a gut microbiome dataset.

12:20 PM-12:30 PM
Identifying important uncharacterized genes using metagenomes and metatranscriptomes
Room: Columbus KL
  • Gholamali Rahnavard, Broad Institute of MIT and Harvard, United States
  • Afrah Shafquat, Harvard University, United States
  • Kevin Bonham, Harvard University, United States
  • Himel Mallick, Harvard University, United States
  • Eric Franzosa, Harvard University, United States
  • Curtis Huttenhower, Harvard University, United States

Presentation Overview: Show

The discovery of novel microbial genes from metagenomes and, increasingly, metatranscriptomes has outpaced our ability to functionally characterize those genes. In this work, we present PPANINI (Prioritization and Prediction of functional Annotation for Novel and Important genes via automated data Network Integration), a method to prioritize genes based on an “importance” score calculated across microbial communities. We validated PPANINI by, first, assessing homologs of known essential genes, achieving high accuracy (e.g. AUC=0.74, 0.82, and 0.94). This was true across a range of microbial habitats, including four human body sites (skin, vagina, gut, and mouth), marine, and prairie soil metagenomes. Applying the method to these environments prioritized in total 463,044 novel and 274,913 uncharacterized gene families, in addition to 124,332 already-characterized genes. These differed strikingly from isolate genome analysis, with 722,304 gene families identified based solely on metagenomes. Finally, applying PPANINI to the Crohn’s disease metatranscriptome revealed enriched functional categories important in the disease, including viral release from host cells. This method thus provides an efficient strategy to identify potentially important, undercharacterized genes from microbial communities, paving the way for improved bioinformatic and biochemical characterization efforts. http://huttenhower.sph.harvard.edu/ppanini.

12:30 PM-12:40 PM
Predicting Microbial Ecology from Shotgun Metagenomic Data
Room: Columbus KL
  • Jamie Strampe, Boston University, United States
  • Anthony Federico, Boston University, United States
  • Aaron Chevalier, Boston University, United States

Presentation Overview: Show

Thousands of bacterial species can coexist in one gram of soil, but little is currently known about the structure and metabolic interactions of these communities. When grown in carbon-limited medium, bacteria grown ex situ from soil samples formed stable family-level communities at steady state. This result is surprising given that species competing for a single limiting resource, in our case a single carbon source, should not be able to stably coexist according to the competitive exclusion principle. Spent media experiments indicate that metabolites are exchanged in this microbial community, but we do not know which metabolites or how these cross-feeding interactions contribute to producing a stable and reproducible community structure. To address these questions, we used whole genome shotgun sequencing and experimentally-derived phenotypic data to build constraint-based models of core carbon metabolism for five bacterial community members. Utilizing flux balance analysis (FBA), we simulated growth and predicted pairwise cross-feeding interactions. Simulations showed that our model of each organism was capable of growing on the spent media of all other organisms. They further exhibited expected levels of carbon-conversion efficiency and cross-feeding preference consistent with experimental results.

12:40 PM-2:00 PM
Lunch Break
2:00 PM-2:40 PM
Microbiome COSI Keynote IV: Metagenomic insights into ecology, evolution, and biochemistry of single environmental populations through single-amino acid variants.
Room: Columbus KL
  • Murat Eren, University of Chicago, United States

Presentation Overview: Show

The diversity and geographical distribution of surface ocean microbial populations are largely governed by temperature and its co-variables. However, neither the mechanisms by which genomic heterogeneity emerges within a single population, nor how it drives the partitioning of ecological niches are well understood. Increasing number of environmental metagenomes with astonishing depth of sequencing make it possible to characterize genomic heterogeneity within single microbial populations, investigate evolutionary processes acting upon them, and link genomic variation to predicted tertiary structures of genes to gain biochemical insights.

2:40 PM-2:50 PM
Identifying novel lateral gene transfer events from assembled metagenomes
Room: Columbus KL
  • Tiffany Hsu, Harvard University, United States
  • Eric Franzosa, Harvard University, United States
  • Dennis Wong, Dalhousie University, Canada
  • Chengwei Luo, Harvard University, United States
  • Robert Beiko, Dalhousie University, Canada
  • Morgan Langille, Dalhousie University, Canada
  • Curtis Huttenhower, Harvard University, United States

Presentation Overview: Show

Lateral gene transfer (LGT) is an important mechanism for genome diversification in microbial communities, including the human microbiome. While methods exist to identify LGTs from sequenced isolate genomes, identifying LGTs from community metagenomes remains an open problem. To address this, we developed WAAFLE: the Workflow to Annotate Assemblies and Find LGT Events. WAAFLE integrates gene sequence homology and taxonomic provenance to identify metagenomic contigs explained by pairs of microbial clades but not by single clades (i.e. putative LGTs). It also rules out alternative explanations such as gene deletion and misassembly. We validated our approach on synthetic contigs containing spiked LGTs: WAAFLE identified challenging intra-genus LGTs with 51% sensitivity, other LGTs with >91% sensitivity, and was >99.9% specific. We then applied WAAFLE to 138 million contigs from 2,289 assembled human metagenomes (the HMP1-II dataset), revealing 393 thousand novel LGTs (182±173 per metagenome, mean±SD). These were enriched in the oral and gut body sites (compared to skin and vagina) and among phylogenetically related taxa. Transferred functions were enriched for known mobile elements as well as outer membrane proteins, such as TonB receptors. Hence, WAAFLE is a powerful and useful approach for profiling LGTs in microbial communities.

2:50 PM-3:00 PM
Population structure discovery in meta-analyzed microbial communities
Room: Columbus KL
  • Siyuan Ma, Harvard University, United States
  • Dmitry Shugin, Harvard University, United States
  • Himel Mallick, Harvard University, United States
  • Raivo Kolde, Philips Research, United States
  • Eric A. Franzosa, Harvard University, United States
  • Hera Vlamakis, Harvard University, United States
  • Ramnik Xavier, Harvard University, United States
  • Curtis Huttenhower, Harvard University, United States

Presentation Overview: Show

Human microbiome studies have now achieved a scale at which it is practical to associate features of the microbiome with health outcomes and covariates in multiple large populations. This permits the development of rigorous meta-analysis and population structure analysis methods. We have developed MMUPHin (Meta-analysis Methods with Uniform Pipeline for Heterogeneity in Microbiome Studies), a set of normalization, meta-analysis, and population structure discovery methods appropriate for microbiome taxonomic and functional profiles. By applying our methods to a combination of eight inflammatory bowel disease (IBD) cohorts (5,232 total samples), we characterized consistent population structure in patients’ gut microbiomes. Evaluation of data handling practices identified those most sensitive to biological variation and robust to batch and technical differences, including known effects of Bacteroides and Prevotella species. Linear mixed effects models revealed consistent enrichment and depletion in the IBD population versus controls. Finally, multiple unsupervised clustering methods, combined with different clustering strength metrics, agreed on a lack of discrete microbiome “types” in the IBD gut microbiome.As these results are consistent across datasets, we anticipate they will provide a reference for the IBD microbiome and a future framework for human microbiome meta-analyses more broadly.

3:00 PM-3:10 PM
Phylogenetic placement of exact amplicon sequences improves associations with clinical information
Room: Columbus KL
  • Stefan Janssen, University of California San Diego, United States
  • Daniel McDonald, UCSD, United States
  • Antonio Gonzalez, UCSD, United States
  • Lingjing Jiang, UCSD, United States
  • Zechnjiang Zech Xu, UCSD, United States
  • Kevin Winker, University of Alaska Museum and Department of Biology and Wildlife, Fairbanks, Alaska, USA, United States
  • Deborah M Kado, University of California, San Diego, United States
  • Eric Orwoll, Oregon Health & Science University, United States
  • Mark Manary, Washington University in Saint Louis, United States
  • Siavash Mirarab, UCSD, United States
  • Rob Knight, UCSD, United States

Presentation Overview: Show

Recent algorithmic advances in amplicon-based microbiome studies enable inference of exact amplicon sequence fragments. These new methods allow for investigation of sub-operational-taxonomic-units (sOTU) by removing erroneous sequences. However, short DNA sequence fragments, do not contain sufficient phylogenetic signal to reproduce a reasonable tree, introducing a barrier in the utilization of critical phylogenetically-aware metrics, like Faith's PD or UniFrac. Although fragment insertion methods do exist, these methods have not been tested for sOTUs from high throughput amplicon studies when inserting against a broad reference phylogeny. We benchmark the SATé-enabled phylogenetic placement (SEPP) technique explicitly against 16S V4 sequence fragments, and show that it outperforms the conceptually problematic but often used practice of reconstructing de novo phylogenies. In addition, we provide a BSD-license QIIME2 plugin (https://github.com/biocore/q2-fragment-insertion) for SEPP and integration into the microbial study management platform QIITA.

The move from OTU-based to sOTU-based analysis, while providing additional resolution, also introduces computational challenges. We demonstrate that one popular method of dealing with sOTUs (building a de novo tree from the short sequences) can provide incorrect results in human gut metagenomic studies, and show that phylogenetic placement of the new sequences with SEPP resolves this problem while also yielding other benefits over existing methods.

3:10 PM-3:20 PM
PopPhy-CNN: A Convolutional Neural Network Approach Using Embedded Phylogenetic Trees for Analyzing the Association of Host Microbiome and Phenotype
Room: Columbus KL
  • Derek Reiman, University of Illinois at Chicago, United States
  • Ahmed Metwally, University of Illinois at Chicago, United States
  • Yang Dai, University of Illinois at Chicago, United States

Presentation Overview: Show

Accurate prediction of the host phenotype from a metagenomic sample and identification of the associated bacterial markers are important in metagenomic studies. We introduce PopPhy-CNN, a novel convolutional neural networks (CNN) learning architecture that effectively exploits phylogenetic structure in microbial taxa. PopPhy-CNN provides an input format of 2D matrix created by as an image of the phylogenetic tree that is populated with the relative abundance of microbial taxa in a metagenomic sample. This conversion empowers CNNs to explore the spatial relationship of the taxonomic annotations on the tree and their quantitative characteristics in metagenomic data. PopPhy-CNN is evaluated using three metagenomic datasets of moderate size. We show the superior performance of PopPhy-CNN compared to random forest, support vector machines, LASSO and a baseline 1D-CNN model constructed with relative abundance microbial feature vectors. In addition, we design a novel scheme of feature extraction from the learned CNN models and demonstrate the improved performance when the extracted features are used to train support vector machines. PopPhy-CNN facilitates not only the retrieval of informative microbial taxa from the trained CNN models but also the visualization of the taxa on the phylogenetic tree.

3:20 PM-3:30 PM
Computing Metabolic Routes in the Human Microbiome
Room: Columbus KL
  • Peter Karp, SRI International, United States
  • Markus Krummenacker, SRI International, United States

Presentation Overview: Show

The new BioCyc.org Multi-Organism (metabolic) Route Search (MORS) tool
enables a researcher to explore biochemical conversions among
metabolites that are accomplished by multiple organisms in a metabolic
community. MORS computes optimal routes that connect starting and
ending metabolites specified by the user. A route is a linear
sequence of reactions that converts the starting to the ending
compound. An optimal route minimizes the number of reactions used
while maximizing the number of atoms from the starting compound that
are incorporated into the ending compound. The set of reactions from
which routes are computed is the union of all reactions in the
metabolic networks of the selected organisms.

As the endpoint for one example route, we picked indoxyl sulfate,
which is implicated in toxicity among kidney disease patients. We
selected L-tryptophan as a starting point because it is the known
source of a microbial route to indoxyl sulfate. MORS found a known
route of 3 reaction steps, retaining 9 atoms from start to end. The
first reaction can be catalyzed by 164 different microbes, whereas the
last two reactions are catalyzed by Homo sapiens only. MORS also
computed a known route from L-carnitine to trimethylamine-N-oxide.

3:30 PM-3:40 PM
Genomics-based prediction of metabolic phenotypes in microbial communities
Room: Columbus KL
  • Stanislav Iablokov, Institute for Information Transmission Problems and P.G. Demidov Yaroslavl State University, Russia
  • Pavel Novichkov, Lawrence Berkeley National Laboratory, United States
  • Andrei Osterman, Sanford Burnham Prebys Institute, United States
  • Dmitry Rodionov, Sanford Burnham Prebys Institute and IITP RAS, United States

Presentation Overview: Show

High-throughput genomic and metagenomic sequencing revolutionized exploration of complex microbial communities such as human or soil microbiota. We have developed an approach for describing and comparing microbial communities in terms of their metabolic (in addition to phylogenetic) signatures. Using subsystems-based metabolic reconstruction methodology we infer phenotypic features (nutrient requirements, carbohydrate utilization capabilities, quorum sensing etc) directly from microbial genomes. The obtained collection of binary metabolic phenotypes for ~2,300 reference bacterial genomes representing human gut microbiota was used in two-step pipeline for prediction of phenotype profiles for 16S RNA samples. The upstream module determines taxonomic composition of input samples using classifiers implemented in QIIME2 and various 16S databases (GreenGenes, NCBI, RDP, SILVA) assessed for coverage of reference collection. The downstream module calculates the matrix of cumulative phenotypes normalized by species abundance for each sample and each metabolic feature. It uses a three-step taxonomic mapping procedure and computes averaged phenotype indices at the levels of species, genus and family for probabilistic assessment of metabolic features of those taxonomic entities that cannot be mapped to presently available reference genomes of individual species. We also implemented the sequence-based weighted mapping to reference genomes and compared it with taxonomy-based approaches for community phenotype inference.

3:40 PM-4:00 PM
Text-mining-based interpretation of microbiome data
Room: Columbus KL
  • Lars Juhl Jensen, NNF Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Evangelos Pafilis, Hellenic Center for Marine Research, Greece
  • Manimozhiyan Arumugam, University of Copenhagen, Denmark

Presentation Overview: Show

Studying microbiomes using metagenomic sequence data is a topic of major scientific importance due to the impact of microbiota on human health and environments. However, a major bottleneck in microbiome studies is to interpret this complex data in the context of existing literature and datasets.

There is thus an urgent need to speed up the interpretation phase of microbiome data analysis. Text-mining tools can help with this by systematically annotating the literature with organisms, diseases, processes, environmental descriptors, and associations among them. However, most microbiome analysts are unaware of the power of text mining.

We use the published human colorectal cancer microbiome to show how text mining could have accelerated its interpretation by using named entity recognition to identify relevant articles, extracting cooccurrence-based organism–disease associations, discovering indirect associations that can explain observed associations, and performing enrichment analyses based on text-mined associations to show an intriguing link to oral diseases.

We hope this will make the microbiome community aware of what is already possible with text mining, lead to the development of tools tailored to their specific needs, and thereby greatly reduce the work required to interpret new microbiomes in light of the existing biomedical literature.

4:20 PM-4:40 PM
Outlook / Community Input
Room: Columbus KL
4:40 PM-5:00 PM
Coffee Break (on the go) to Closing Keynote