View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: Reconstructing the genomes of microbial community members is key to the interpretation of shotgun metagenome samples. Genome binning programs deconvolute reads or assembled contigs of such samples into individual bins, but assessing their quality is difficult due to the lack of evaluation software and standardized metrics. We present AMBER, an evaluation package for the comparative assessment of genome reconstructions from metagenome benchmark data sets. It calculates the performance metrics and comparative visualizations used in the first benchmarking challenge of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). As an application, we show the outputs of AMBER for eleven different binning software options on two CAMI benchmark data sets. AMBER is implemented in Python and available under the Apache 2.0 license on GitHub (https://github.com/CAMI-challenge/AMBER).
Short Abstract: Molecular functionality of microbiomes is often assessed via meta-genomic/-transcriptomic sequencing. We recently created mi-faser, a computational method for super fast (minutes-per-microbiome) and accurate (90% precision) mapping of sequencing reads to molecular functions of the read-correspondent genes, augmented with a manually curated reference database. Comparing microbiome function profiles between different conditions, we identified previously unseen oil degradation-specific functions in BP oil-spill data, as well as functional signatures of individual-specific gut microbiome responses to a dietary intervention in children with Prader-Willi syndrome. Mi-faser also distinguished Crohn's Disease patient microbiomes from those of related healthy individuals, highlighting the microbiome role in CD pathogenicity. In a subsequent, soon-to-be-published, study of snow from Svalbard, Norway, we identified higher microbiome dissimilarity in the early vs. late spring samples, suggesting a community recovery hypothesis. The observed correlation between organic acid levels and geraniol degradation pathway further indicates that members of those communities can degrade complex organic compounds at temperatures below 0°C. These are potentially valuable in both industrial and bioremediation sense and will be followed up experimentally. In short, due to its speed, accuracy, and robustness to evolutionary differences, mi-faser is useful for generating testable hypothesis of emergent microbiome molecular functionality.
Short Abstract: It is challenging to relate features such as human health outcomes, diet, environmental conditions, or other metadata to microbial community measurements, due in part to their quantitative properties. Microbiome multi’omics are typically noisy, sparse (zero-inflated), high-dimensional, and extremely non-normal, often in the form of either count or compositional measurements. Here, we introduce an optimal combination of established methodology to assess multivariable association of microbial community features with complex metadata in population-scale epidemiological studies. Our approach, MaAsLin2 (Multivariable Association with Linear Models), relies on multiple statistical models to account for the inherent characteristics of modern meta’omic epidemiology study designs, including repeated measures and multiple covariates. To construct this method, we conducted a large-scale evaluation of a broad range of data settings under which straightforward identification of meta’omic associations can be challenging. These simulation studies reveal that MaAsLin2 preserves statistical power in the presence of repeated measures and multiple covariates while accounting for the nuances of meta’omic features and controlling false discovery. Finally, we applied MaAsLin2 to a microbial multi’omic dataset from the Integrative Human Microbiome Project (HMP2) which, in addition to reproducing established results, revealed a unique, integrated landscape of inflammatory bowel disease (IBD) across multiple time points and ‘omics profiles.
Short Abstract: Studies like the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI) have shown that, while metagenomic software already produces promising results, there is still a lot of room for improvement and an absolute requirement for standardized benchmarking data sets. To overcome these obstacles, we present CAMISIM, a software for the automatic generation of complete microbial communities in silico. CAMISIM already was successfully used in the creation of the data sets used in the first CAMI challenge and provides vast possibilities for the personalization of the desired data sets such that they represent certain microbial compositions, sampling strategies and experimental setups as closely as possible. In addition to providing full microbial sequence samples, CAMISIM always provides a ground truth for assembling, binning and profiling of the produced metagenome which subsequently can be used to measure performance of different metagenomic software. We successfully used CAMISIM to create different data sets to show its value in producing both small, specialised data sets for testing metagenomic software as well as large, realistic benchmarking data sets. CAMISIM is implemented in Python and available under the Apache 2.0 license on GitHub (https://github.com/CAMI-challenge/CAMISIM)
Short Abstract: With the rapid growth of microbiome research across the globe, one major challenge is large-scale analysis of metagenome datasets under context of the metagenome space known to mankind to-date. Here we describe Microbiome Search Engine (MSE), a powerful database engine that enables rapid sample search against large-scale of microbiome datasets. This database now contains 101,983 curated microbiome samples that are of clear scientific background from 293 studies. The input query microbiome can be uploaded by users via web or standalone interface and then searched against the entire database for structurally or functionally similar microbiomes in a ‘BLAST-like’ manner. The querying speed is independent of database size, and is less 0.3 second against the whole database, which is 80 times faster than pairwise exhaustive comparison. The search results provide visualized organismal or functional alignment patterns between queries and matches with quantitative similarity scores. The search for ‘best matches’ and ‘top N matches’ from the vast amount of microbiomes accumulated so far represents a novel way for not only annotation new microbiome datasets but also identifying scientific hypotheses that probe the complex interplay between microbiome features and ecosystem parameters. More information is available at: http://184.108.40.206/mse.
Short Abstract: Thousands of bacterial species can coexist in one gram of soil, but little is currently known about the structure and metabolic interactions of these communities. When grown in carbon-limited medium, bacteria grown ex situ from soil samples formed stable family-level communities at steady state. This result is surprising given that species competing for a single limiting resource, in our case a single carbon source, should not be able to stably coexist according to the competitive exclusion principle. Spent media experiments indicate that metabolites are exchanged in this microbial community, but we do not know which metabolites or how these cross-feeding interactions contribute to producing a stable and reproducible community structure. To address these questions, we used whole genome shotgun sequencing and experimentally-derived phenotypic data to build constraint-based models of core carbon metabolism for five bacterial community members. Utilizing flux balance analysis (FBA), we simulated growth and predicted pairwise cross-feeding interactions. Simulations showed that our model of each organism was capable of growing on the spent media of all other organisms. They further exhibited expected levels of carbon-conversion efficiency and cross-feeding preference consistent with experimental results.
Short Abstract: The goal of the ImMiGeNe Project is to implement a streamlined pipeline for high-throughput sampling, sequencing, analysis and integrative interpretation of clinical data collected from patients undergoing hematopoietic stem cell transplantation. Gut metagenomic samples, whole exome sequencing data, immunogenic characteristics and inflammatory biomarkers will be collected from patients before and after they receive their transplant. Donors will have their gut microbiome sampled, to be able to compare the communities before and after transplantation and monitor their change in relation to the change in the patient’s immune system. The bioinformatics analysis pipeline we established is highly automated with configuration files provided for different scenarios that guide the analysis. This way it will be easy to use in the clinical diagnostic workflow in the future to determine the microbial community match of a patient and their potential donor. We hope that the gut microbiome can provide markers to predict the outcome of the procedure and aid the donor selection process. The pipeline includes raw read preprocessing and uses fast alignment tools such as DIAMOND and MALT to enable fast processing of samples. Taxonomic and functional analysis is conducted using MEGAN as well as additional scripts for in-depth analysis using the available metadata.
Short Abstract: Culture-independent metagenomic methods are commonly employed in microbiome studies to study the millions of microbes living around/within a host and understand how they impact host function. The rapid decrease in sequencing costs and increasing interest in microbiome studies have increased the metagenome sample sizes per study. The critical bottleneck in analyzing large metagenome samples is access to computational resources: high-performance computing (HPC) and high-performance data analytics (HPDA) systems that have large memory, fast I/O, and multithreaded, and distributed parallel computing necessary to analyze these complex communities. The National Center for Genome Analysis Support (NCGAS) and eXtreme Science and Engineering Discovery Environment (XSEDE) are NSF-funded organizations that collaborate to enable the biological research community to analyze genomic data. Through NCGAS and XSEDE, users have access to a range of resources for bioinformatic analysis. NCGAS and XSEDE offer additional support as online materials, training/workshops, project consultation, and software installation, optimization, and maintenance on these clusters. Metagenomic analysis is also constantly evolving to include additional steps or new upgraded software. Organizations such as NCGAS and XSEDE are resources for staying current with these developments.
Short Abstract: Metagenomic sequencing techniques enable quantitative analyses of the microbiome. However, combining the microbial data from these experiments is challenging due to the variations between experiments. The existing methods for correcting batch effects do not consider the interactions between variables---microbial taxa in microbial studies---and the overdispersion of the microbiome data. Therefore, they are not applicable to microbiome data. We developed a new method, Bayesian Dirichlet-multinomial regression meta-analysis (BDMMA), to simultaneously model the batch effects and detect the microbial taxa associated with phenotypes. BDMMA automatically models the dependence among microbial taxa and is robust in detecting associations in high-dimensional, over-dispersed microbiome data with sparse associations. Simulation studies and real data analysis have shown that BDMMA can successfully adjust batch effects and substantially reduce false discoveries in microbial meta-analyses. BDMMA is a powerful tool to perform meta-analysis for metagenomic studies and detect taxa that are truly associated with the phenotypes with high accuracy. We envision that BDMMA will be widely applied in practice, especially with the rise of large consortium projects such as the American Gut Project and the MetaHIT project.
Short Abstract: Human microbiome studies have now achieved a scale at which it is practical to associate features of the microbiome with health outcomes and covariates in multiple large populations. This permits the development of rigorous meta-analysis and population structure analysis methods. We have developed MMUPHin (Meta-analysis Methods with Uniform Pipeline for Heterogeneity in Microbiome Studies), a set of normalization, meta-analysis, and population structure discovery methods appropriate for microbiome taxonomic and functional profiles. By applying our methods to a combination of eight inflammatory bowel disease (IBD) cohorts (5,232 total samples), we characterized consistent population structure in patients’ gut microbiomes. Evaluation of data handling practices identified those most sensitive to biological variation and robust to batch and technical differences, including known effects of Bacteroides and Prevotella species. Linear mixed effects models revealed consistent enrichment and depletion in the IBD population versus controls. Finally, multiple unsupervised clustering methods, combined with different clustering strength metrics, agreed on a lack of discrete microbiome “types” in the IBD gut microbiome.As these results are consistent across datasets, we anticipate they will provide a reference for the IBD microbiome and a future framework for human microbiome meta-analyses more broadly.
Short Abstract: The human gut microbiota interacts with host metabolism in conditions including insulin resistance, type-2 diabetes (T2D) and obesity. However, the exact contribution of gut microbiota to the development of T2D is not fully understood due to the complexity and diversity of gut microbes, ethnic variation and large variations between individuals studied. The aim of this study was to characterize the gut microbiome of obese adults with T2D versus non-diabetics using 16S rRNA gene amplicon sequencing and shotgun metagenomics. We identified that phylum Firmicutes and Bacteroidetes were the major dominant phyla by both sequencing approaches, but 16S analysis was biased towards detect higher proportions of Firmicutes. An increased abundance of Bacilli, in particular Streptococcus, in both approaches and a decrease of Clostridiales in metagenomics data were noted in T2D compared to non-diabetic subjects. Furthermore, functional profiling of shotgun data using HUMAnN2 revealed significant differences in pathway abundances linked to specific species involving short-chain fatty acids and branched chain amino acids. These findings suggest that shotgun sequencing has complementary advantages compared with the 16S amplicon approach in studying the association between gut microbiota and T2D, which provided comprehensive insights into bacterial communities and their functional repertoires.
Short Abstract: The bacterial communities in the gut of dairy cattle are very important since they relate to host health, milk production and food safety. However, a comprehensive analysis of gut microbiota in dairy cattle corresponding to each dairy production stage is still lacking. Here we report a systematic analysis of fecal microbiota from 90 dairy cattle over 12 dairy production stages using the DADA2 package, which models and corrects sequencing amplicon errors and thus infers variants instead of traditional OTU clustering approaches that can easily mask biological variation. The study identified 236 genera in 21 phyla predominated by Firmicutes, Patescibacteria, and Verrucomicrobia. The next-generation sequencing data revealed a high level of heterogeneity in terms of diversity, richness, and composition in cattle of various stages, especially between parous and nulliparous animals. Additionally, we summarized compositional change patterns of overall bacteria along the stages as well as patterns of certain interesting taxa such as Ruminococcaceae, an active plant degrader. Generally, this study provides the complete insights into the stability, variability, and composition of gut microbiota in dairy cattle over the entire dairy production lines and it may lay a foundation for future research on dairy food safety, ruminants management, and disease control.
Short Abstract: There is a strong correlation between some pathogens and certain cancer types. One example is Helicobacter pylori and gastric cancer. Exactly how they contribute to host tumorigenesis is, however, a mystery. Pathogens often interact with the host through proteins. To subvert defense, they may mimic host proteins at the sequence, structure, motif, or interface levels. Interface similarity permits pathogen proteins to compete with those of the host for a target protein and thereby alter the host signaling. Detection of host–pathogen interactions (HPIs) and mapping the re-wired superorganism HPI network—with structural details—can provide unprecedented clues to the underlying mechanisms and help therapeutics. Here, we describe the first computational approach exploiting solely interface mimicry to model potential HPIs. Interface mimicry can identify more HPIs than sequence or complete structural similarity since it appears more common than the other mimicry types. We illustrate the usefulness of this concept by modeling HPIs of H. pylori to understand how they modulate host immunity, persist lifelong, and contribute to tumorigenesis. H. pylori proteins interfere with multiple host pathways as they target several host hub proteins. Our results help illuminate structural basis of resistance to apoptosis, immune evasion, and loss of cell-junctions seen in infected host cells.
Short Abstract: Next-generation amplicon sequencing is a powerful tool for understanding microbial communites. Downstream analysis is often based on the construction of Operational Taxonomic Units (OTUs) with dissimilarity threshold 3%. The arbitrary threshold and reliance on OTU references can lead to low resolution, false positives, and misestimation of alpha and beta microbial diversity. We introduce Ampliclust, a reference-free method to resolve the number, abundance and identity of error-free sequences in Illumina Amplicon data. Unlike existing methods, Ampliclust is a fully probabilistic model, allowing the data, rather than an algorithm or an external database, drive the conclusions. We use a modified Bayesian information criterion to estimate the number of sequence variants and obtain maximum likelihood estimates of the abundance and identity of error-free sequences. Our model is able to match the performance of Dada2 on well-separated mock communities, but in simulated communities with more similar real sequences, Ampliclust can achieve better accuracy. The major challenge is the computational scalability, which we begin to address through principled iterative schemes and improved initialization methods.
Short Abstract: High-throughput genomic and metagenomic sequencing revolutionized exploration of complex microbial communities such as human or soil microbiota. We have developed an approach for describing and comparing microbial communities in terms of their metabolic (in addition to phylogenetic) signatures. Using subsystems-based metabolic reconstruction methodology we infer phenotypic features (nutrient requirements, carbohydrate utilization capabilities, quorum sensing etc) directly from microbial genomes. The obtained collection of binary metabolic phenotypes for ~2,300 reference bacterial genomes representing human gut microbiota was used in two-step pipeline for prediction of phenotype profiles for 16S RNA samples. The upstream module determines taxonomic composition of input samples using classifiers implemented in QIIME2 and various 16S databases (GreenGenes, NCBI, RDP, SILVA) assessed for coverage of reference collection. The downstream module calculates the matrix of cumulative phenotypes normalized by species abundance for each sample and each metabolic feature. It uses a three-step taxonomic mapping procedure and computes averaged phenotype indices at the levels of species, genus and family for probabilistic assessment of metabolic features of those taxonomic entities that cannot be mapped to presently available reference genomes of individual species. We also implemented the sequence-based weighted mapping to reference genomes and compared it with taxonomy-based approaches for community phenotype inference.
Short Abstract: Software pipelines have become almost standardized tools for microbiome analysis. Currently many pipelines are available, often sharing some of the same algorithms as stages. This is largely because each pipeline has its own source language and file formats, making it typically more economical to reinvent the wheel than to learn and interface to an existing package. We present Plugin-Based Microbiome Analysis (PluMA), which addresses this problem by providing a lightweight back end that can be infinitely extended using dynamically loaded plugin extensions. These can be written in one of many compiled or scripting languages. With PluMA and its online plugin pool, algorithm designers can easily plug-and-play existing pipeline stages with no knowledge of their underlying implementation, allowing them to efficiently test a new algorithm alongside these stages or combine them in a new and creative way. We demonstrate the usefulness of PluMA through an example pipeline (P-M16S) that expands an obesity study involving gut microbiome samples from the mouse, by integrating multiple plugins using a variety of source languages and file formats, and producing new results.
Short Abstract: The underlying holobiont pathways of the holobiont (the host and the microbial symbionts), encompass genes and proteins encoded by the hologenome (the collective genomic content of a host and its microbiome), metabolites and other molecules. Holobiont pathway underpins host-microbiota interactions that etiologically underlie complex diseases. Given metagenomics data generally featured with a limited sample size and millions of microbe gene variables, it remains challenging to correlate holobiont pathways to host traits especially for those disease and intermedia phenotypes. Here we present a framework of associating holobiont pathways with complex diseases including type II diabetes mellitus (T2DM), atherosclerotic cardiovascular disease (ACVD) and inflammatory bowel disease (IBD) by using microbiome data via an adaptive canonical-correlation analysis model (Acam). In this model, we weight the metagene variables with enzyme kinetic data such as the Michaelis constants of related encoding enzymes and gut microbiome metagene abundance. We implemented this model into an R package HolobiontR and assessed it with synthetic and real microbiome datasets. By applying this model to the MetaHIT and HMP2 metagenome and metatranscriptome data, we show the potential association of the short chain fatty acid pathway with T2DM, the secondary bile acids pathway with IBD, and trimethylamine holobiont pathways with ACVD.
Short Abstract: Through the Genomics Research and Development Initiative, the EcoBiomics Project focuses on the urgent need to better understand the extent and significance of ongoing changes to microbial and invertebrate biodiversity in the soil and aquatic ecosystems in response to anthropogenic stressors. To address the need for sustainability of ecosystem services and economically important natural resources such as fisheries, forests, and agriculture, the Government of Canada has engaged eight science-based departments and agencies to collaboratively address this challenge. The EcoBiomics project has three overarching objectives: i) Develop standard methods for sample collection, DNA extraction, next-generation DNA sequencing, and a federal bioinformatics platform for harmonizing the analysis of metagenomics data across federal departments, ii) Pilot genomic observatories for establishing comprehensive metagenomics baselines at nationally important long-term environmental monitoring sites in Canada, iii) Apply next generation sequencing to comprehensively characterize aquatic microbiomes, soil microbiomes, and invertebrate zoobiomes and test hypotheses for improving environmental monitoring, assessment, and remediation activities for water quality and soil health across Canada. Through the use of robust and standardized metadata, protocols, and data analysis methods, the EcoBiomics project will contribute high quality datasets that will enable environmental monitoring initiatives and comparative studies using geographically diverse sites.
Short Abstract: Deep learning has revolutionized various fields by offering incomparable strategies to extract abstract nonlinear features that are refractory to traditional methods. Specifically, Long Short-Term Memory (LSTM) networks have the ability to learn dynamic temporal behavior for a time sequence event. On the other hand, allergic asthma and food allergy are usually hard to diagnose at young ages. The inability to diagnose patients with these atopic diseases at earlier age may lead to severe complications. Recently, there have been many studies that link infant’s gut microbiome to allergy development. In this work, we investigate the use of autoencoder and LSTM to predict various types of allergies for young babies (0-3 years) from a subject’s longitudinal microbiome profiles of stool samples. Our results demonstrate the proper use of the proposed model and show the significant increase in predictive power compared to SVM and logistic regression.
Short Abstract: Risperidone, a commonly prescribed antipsychotic, causes weight gain in humans and mice. We have shown that risperidone-induced shifts in the gut microbiome are mechanistically involved in this weight gain phenotype. It is known that males and females have inherent differences in their microbiome; therefore, we hypothesize that risperidone alters the microbiome differently for the two sexes, which by extension differentially affects weight gain. By co-assembling metagenomic reads, assigning taxonomy, and analyzing the data with Anvi’o, we determined the microbiota of male and female mice in response to risperidone. Female mice showed a loss of species associated with a healthy gut during risperidone treatment, which correlated with a gain in body weight. Conversely, male mice did not show a loss of these organisms in response to risperidone and did not gain weight compared to controls. Assessing the functional capabilities of each gut microbiota, we observed more protein counts associated with antibiotic and secondary metabolite biosynthesis in male mice compared to female mice in response to risperidone. We speculate that the gut microbiome of male mice is protective against risperidone-induced weight gain through a fitness advantage due to increased antibiotic and metabolite biosynthesis. Future biological and informatics approaches will explore this hypothesis.
Short Abstract: Recent algorithmic advances in amplicon-based microbiome studies enable inference of exact amplicon sequence fragments. These new methods allow for investigation of sub-operational-taxonomic-units (sOTU) by removing erroneous sequences. However, short DNA sequence fragments, do not contain sufficient phylogenetic signal to reproduce a reasonable tree, introducing a barrier in the utilization of critical phylogenetically-aware metrics, like Faith's PD or UniFrac. Although fragment insertion methods do exist, these methods have not been tested for sOTUs from high throughput amplicon studies when inserting against a broad reference phylogeny. We benchmark the SATé-enabled phylogenetic placement (SEPP) technique explicitly against 16S V4 sequence fragments, and show that it outperforms the conceptually problematic but often used practice of reconstructing de novo phylogenies. In addition, we provide a BSD-license QIIME2 plugin (https://github.com/biocore/q2-fragment-insertion) for SEPP and integration into the microbial study management platform QIITA. The move from OTU-based to sOTU-based analysis, while providing additional resolution, also introduces computational challenges. We demonstrate that one popular method of dealing with sOTUs (building a de novo tree from the short sequences) can provide incorrect results in human gut metagenomic studies, and show that phylogenetic placement of the new sequences with SEPP resolves this problem while also yielding other benefits over existing methods.
Short Abstract: As genome sequencing continues at its inexorably rapid pace, the resulting information provides unprecedented glimpses into the diversity of life in our biosphere. Numerous novel organisms, with new metabolic pathways and unique nanostructural designs continue to be discovered, and along with that breadth of diversity, as the depth of coverage also increases and multiple genome assemblies become available for a greater number of organisms, pangenomic analysis provides for detailed studies of microevolutionary change. The ATGC (Alignable Tight Genomic Clusters) database is a collection of data for closely related prokaryotic genomes that provides several tools to aid research into evolutionary processes in the microbial world. These clusters, which contain millions of proteins from thousands of genomes organized into hundreds of clusters, are objectively defined based on local gene order (synteny) and synonymous substitutions in the protein-coding genes, since in the realm of small evolutionary distances where traditional phylogenetic markers such as 16S rRNA become useless, these criteria become extremely useful (and are far more useful than raw DNA similarity). As such, each ATGC is suited for analysis of microevolutionary variations within a cohesive group of organisms (e.g., species), whereas the entire collection of ATGCs is useful for macroevolutionary studies.
Short Abstract: Patients of Inflammatory Bowel Disease (IBD) show an increased tendency to develop depression symptoms than healthy individuals. The traditional explanation for this fact have been that being chronically ill has a negative effect on the mental health of the patient. However, nowadays the link between depression and the microbiome is quite well defined, and being IBD also closely related to microbiomic structure, our focus was to analyze which alterations in the microbiome explain this comorbidity. To achieve that we used data from the Integrative Human Microbiome Project that included several stool samples from 70 patients with Chron Disease, Ulcerative Colitis and control and a mental health questionnaire all of them answered. A Machine Learning approach was used to determine which are the bacteria that most influence the different phenotypes. We identified several significantly distinct taxons. Afterwards, we analyzed the metabolic pathways Burkholderiales bacterium 1_1_47, since it's only present on patients of IBD that don't show symptoms of depression. This results could help us understand better the relationship between the microbiome and the brain, and why some cases of dysbiosis drive to mental health problems and others don't.
Short Abstract: Non-digestible carbohydrate (NDC) intake is associated with beneficial health outcomes. Several lines of evidence suggest that NDCs modulate the microbial community in the gastrointestinal track, promoting microbes with beneficial traits. However, the extent to which different NDCs modulate microbiome and their temporal dynamics remains unclear. We designed three complementary studies to thoroughly characterize the effects of two NDCs, oligofructose (FOS) and polydextrose (PDX), on the structure and function of human microbiome. Using dense time-series metagenomic data, we determined responses to the NDCs in two studies and validated them in a third cross-over study featuring both. We observed that FOS and PDX had distinct effects on microbiome structure, with FOS decreasing and PDX increasing diversity. FOS consistently promoted Bifidobacterium spp. while PDX promoted Parabacteroides spp. among other significant taxa shifts. These shifts were recapitulated in the cross-over study independent to the order of NDC intake. Taxonomic shifts were further linked to changes in functional gene profiles, suggesting differences in carbohydrate utilization and metabolic capacity. Taken together, these results demonstrate that NDCs can modulate the human gut microbiome in distinct, targeted, and predictable manners, which could be used to promote specific health-related outcomes. These important findings support further research with proprietary compounds.
Short Abstract: Accurate prediction of the host phenotype from a metagenomic sample and identiﬁcation of the associated bacterial markers are important in metagenomic studies. We introduce PopPhy-CNN, a novel convolutional neural networks (CNN) learning architecture that effectively exploits phylogenetic structure in microbial taxa. PopPhy-CNN provides an input format of 2D matrix created by as an image of the phylogenetic tree that is populated with the relative abundance of microbial taxa in a metagenomic sample. This conversion empowers CNNs to explore the spatial relationship of the taxonomic annotations on the tree and their quantitative characteristics in metagenomic data. PopPhy-CNN is evaluated using three metagenomic datasets of moderate size. We show the superior performance of PopPhy-CNN compared to random forest, support vector machines, LASSO and a baseline 1D-CNN model constructed with relative abundance microbial feature vectors. In addition, we design a novel scheme of feature extraction from the learned CNN models and demonstrate the improved performance when the extracted features are used to train support vector machines. PopPhy-CNN facilitates not only the retrieval of informative microbial taxa from the trained CNN models but also the visualization of the taxa on the phylogenetic tree.
Short Abstract: Biomarkers discovery is one of the most successful means for translating genomic data into clinical practice. Changes in microbial compositions in the gut have been associated with disease states such as Type 2 Diabetes (T2D), Obesity, and Inflammatory Bowel Disease (IBD). Reliable identification of the most informative features (i.e., microbes) for discriminating metagenomics samples from two or more groups (i.e., phenotypes) is a major challenge in computational metagenomics. In this work, we propose a comparative network-based framework for detecting biomarkers from metagenomics data. Our framework has two customizable components: i) A network inference component, which applies any existing tool for inferring ecological networks from the abundances of microbial operational taxonomic units (OTUs); ii) A node importance scoring component, which compares constructed networks for two phenotypes and scores each node based on a measure of the change in its topological properties in the two networks. Our preliminary results for identifying biomarkers for IBD using a large cohort dataset of 657 and 316 IBD and healthy controls metagenomic samples (respectively) show that our network-based approach is very competitive with some state-of-the-art feature selection methods including the widely used method based on random forest variable importance scores.
Short Abstract: The discovery of novel microbial genes from metagenomes and, increasingly, metatranscriptomes has outpaced our ability to functionally characterize those genes. In this work, we present PPANINI (Prioritization and Prediction of functional Annotation for Novel and Important genes via automated data Network Integration), a method to prioritize genes based on an “importance” score calculated across microbial communities. We validated PPANINI by, first, assessing homologs of known essential genes, achieving high accuracy (e.g. AUC=0.74, 0.82, and 0.94). This was true across a range of microbial habitats, including four human body sites (skin, vagina, gut, and mouth), marine, and prairie soil metagenomes. Applying the method to these environments prioritized in total 463,044 novel and 274,913 uncharacterized gene families, in addition to 124,332 already-characterized genes. These differed strikingly from isolate genome analysis, with 722,304 gene families identified based solely on metagenomes. Finally, applying PPANINI to the Crohn’s disease metatranscriptome revealed enriched functional categories important in the disease, including viral release from host cells. This method thus provides an efficient strategy to identify potentially important, undercharacterized genes from microbial communities, paving the way for improved bioinformatic and biochemical characterization efforts. http://huttenhower.sph.harvard.edu/ppanini.
Short Abstract: For the last decade, a cultivation-independent metagenomics approach, in which the entire set of microorganisms in a sample are directly sequenced together, has been immensely applied to understand the crucial roles of microbes on human health. In previous work, Sigma was proposed for strain-level identification and quantification of microbes using their reference genomes in metagenomic analysis. Here we present SigmaW, a fast and accurate taxonomy profiler for metagenomic analysis on cloud computing. SigmaW uses Amazon Web Services (AWS) to provide its primary cloud computing capabilities. Cloud computing allows SigmaW to become more user friendly by providing users a quick and easy way of running the metagenomics profiling tool without undertaking the initial software setup and command line program execution. Elastic Beanstalk (EB), Relation Database Service (RDS), and EC2 are the central services adapted in SigmaW. In addition, the small size of NCBI reference genomes enabled a quick analysis of the metagenomic datasets to get a sketch of microbiome compositions. The algorithm performance was evaluated using simulated mock communities and human microbiome samples.
Short Abstract: Shotgun metagenomic sequencing creates incredibly rich datasets, with abundant information regarding the different organisms and biological functions present in a system. Devising methods for accurately measuring this information in a computationally tractable manner remains a tremendous challenge due to the large diversity of these datasets. There are generally two approaches, albeit with many alternative strategies for each: (1) read-based annotation methods, which annotate each raw read separately against a database, and (2) assembly-based methods, which first perform a de-novo assembly of raw reads and then annotate the resulting contigs. Here, we compare these two approaches, looking at the computational requirements and scaling of each strategy, the diversity and types of results obtained, and the quantitative similarities in specific estimates, such as taxonomic abundance, that can be measured from each approach. Furthermore, we provide general guidelines about what types of experimental questions would be best addressed by read-based annotation versus assembly.
Short Abstract: Advances in sequencing has led to an improved understanding, and promise, of microbial community’s environmental impact. Polystyrene (PS), a biodegradation-resistant material, commonly known as Styrofoam, can be used as a carbon source for microorganisms, however, its high molecular-weight limits its use as a substrate. Recently, mealworms have demonstrated the ability to consume PS, and we have shown increases in consumption when conditioned on a high-sugar diet. As the exact mechanism of degradation is unclear, 16S rRNA sequencing was performed to compare the fecal and gut microbiomes of two diet-conditioned mealworm groups after 4-time points (Day0, Day5, Day8, Day12) of exposure to a PS-only diet. No significant differences (Shannon Index, p=0.88) were found between sample input, demonstrating fecal materials representation of the gut microbiome. Significant differences (Shannon Index, p=7.88E-5) were found between those on a sugar-rich (dry apple slices) and sugar-poor (rice bran) diet. Amongst collection timepoints, the strongest differences were noted between Day 0/12 and Day 8/12 (Shannon Index, p=3E-5 and p=6.7E-3). These findings may be indicative of a rapid adaptation to the changing food sources, and future studies include culturing overrepresented species, to provide a detailed characterization of the community capable of degrading the Styrofoam and, possibly, other plastics.
Short Abstract: Most studies of microbiome function rely on identifying individual species (or higher taxonomic ranks), mainly through the use of marker genes, followed by functional annotation and/or phylogenetic mapping. This approach can suffer from loss of information due to the lack of reference genomes, and is computationally very expensive. In this study, we treat a microbiome as a population of genes/proteins and identify overrepresented protein families in the whole sample as opposed to individual species. This approach obviates factors such as horizontal gene transfer or unculturable species by assessing the behaviour of the entire microbiome in response to different conditions. Using a k-mer based pipeline to find frequently-occurring motif fragments in short-read data from lean, overweight, and obese twins, we construct functional groupings which show the most influential functions in all three cases. This is done without the need for global alignment, assembly or reference genomes, and a typical run takes approximately 3 hours on a standard laptop. Preliminary results have shown a much greater diversity of influential functional groupings in obese twins compared to lean and overweight ones. We also found around 185 potential candidates for novel protein families which warrant further experimental investigation.
Short Abstract: Multiple molecular data types are increasingly used to study microbial community dynamics over time, for example in the NIH Human Microbiome Project (HMP). We have developed a set of complementary multi'omic longitudinal models for such data, including Gaussian Processes (GPs) with a Beta-Binomial likelihood appropriate for microbial communities' technical zeros, sequencing depth, overdispersion, and compositionality. Using GPs, we present new findings from a dramatic expansion of shotgun metagenomes (now ~2,400 samples) from the HMP (“HMP1-II”). We partitioned variance of microbial taxa and metabolic processes into host-specific, temporally-variable, and rapidly-variable subsets. We found that species abundances in the gut were highly individualized, with the Bacteroidetes phylum exhibiting highly individualized abundances, while Firmicutes tended to be shared among individuals with varied abundance over time. Microbes at other sites did not exhibit such a phylum-level distinction, and were less personalized than the gut. Meanwhile, metabolic pathways were not personalized despite being encoded by personalized microbial communities, indicating that community assembly may be mediated by the need for keystone functions rather than particular taxa. The results and framework presented here will enable further in-depth characterizations of the dynamics of the microbiome, particularly as longitudinal datasets become more widely available in the field.
Short Abstract: Lateral gene transfer (LGT) is an important mechanism for genome diversification in microbial communities, including the human microbiome. While methods exist to identify LGTs from sequenced isolate genomes, identifying LGTs from community metagenomes remains an open problem. To address this, we developed WAAFLE: the Workflow to Annotate Assemblies and Find LGT Events. WAAFLE integrates gene sequence homology and taxonomic provenance to identify metagenomic contigs explained by pairs of microbial clades but not by single clades (i.e. putative LGTs). It also rules out alternative explanations such as gene deletion and misassembly. We validated our approach on synthetic contigs containing spiked LGTs: WAAFLE identified challenging intra-genus LGTs with 51% sensitivity, other LGTs with >91% sensitivity, and was >99.9% specific. We then applied WAAFLE to 138 million contigs from 2,289 assembled human metagenomes (the HMP1-II dataset), revealing 393 thousand novel LGTs (182±173 per metagenome, mean±SD). These were enriched in the oral and gut body sites (compared to skin and vagina) and among phylogenetically related taxa. Transferred functions were enriched for known mobile elements as well as outer membrane proteins, such as TonB receptors. Hence, WAAFLE is a powerful and useful approach for profiling LGTs in microbial communities.