Microbiome COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CDT
Wednesday, July 13th
10:30-11:15
Keynote Presentation: Methods for integration and hypothesis generation from high dimensional biomedical microbiome datasets
Room: FI
Format: Live from venue

Moderator(s): Mihai Pop

  • Catherine Lozupone


Presentation Overview: Show

Differences in microbiome composition have been described to occur in a variety of disease contexts, but a major challenge is to understand the mechanisms that may drive these relationships. Microbes may influence disease through cellular components or metabolites that they produce that in turn interact with the host, such as through influencing the immune system. These relationships might further be influenced by environmental or demographic characteristics such as diet or race/ethnicity. Complex datasets that we have been working with to predict relationships between the microbiome and disease have included in depth information on the microbiome composition and activity, diet, immune status (e.g. extensive cytokine panels and data on cell populations using time of flight mass cytometry (CyTOF)), metabolome, and demographics. I will discuss tools that we have been developing in this context for feature reduction of microbiome data (SCNIC: Sparse Correlation Network Investigation for Compositional Data), and finding and visualizing relationships between microbes and other complex data types using linear regression (VOLARE: visual analysis of disease-associated microbiome-immune system interplay). Once correlative relationships have been determined, a further goal is to use existing knowledge to generate hypotheses regarding their underlying basis. I will discuss our approach to exploring microbiome:metabolite relationships using metabolic networks (AMON: Analysis of Metabolite Origins Using networks), and broader knowledge-bases. Taken together these approaches should enable the integration of complex data to generate hypotheses worthy of further experimental validation.

11:15-11:30
Lasonolide A is synthesized by a trans-AT PKS pathway present in an uncultured Verrucomicrobiota
Room: FI
Format: Live from venue

Moderator(s): Mihai Pop

  • Jackie Metz, Florida Atlantic University, Harbor Branch Oceanographic Institute, United States
  • René Xavier, Florida Atlantic University, Harbor Branch Oceanographic Institute, United States
  • Guojun Wang, Valent BioSciences, United States
  • Amy Wright, Florida Atlantic University, Harbor Branch Oceanographic Institute, United States
  • Jason Kwan, University of Wisconsin-Madison, United States
  • Siddharth Uppal, Division of Pharmaceutical Sciences, School of Pharmacy, University of Wisconsin—Madison, Madison, Wisconsin, USA, United States


Presentation Overview: Show

Lasonolide A (LSA) is a bioactive polyketide isolated from marine sponge Forcepia sp. LSA exhibits potent anticancer activity against certain cell lines in the National Cancer Institute-60 cell line screen. Furthermore, LSA acts by a unique mechanism making it an excellent drug lead. However, the limited supply of the sponge and its laborious chemical synthesis has hampered its transition into clinical trials. Identification of LSA producer and elucidation of the biosynthetic genes producing LSA can potentially allow researchers to assess higher quantities of LSA thus facilitating its progression into the clinic.
We analyzed the metagenome of Forcepia sp. and uncovered a putative trans-AT PKS pathway (las BGC) proposed to produce LSA. las BGC was identified to be present in a bacterium belonging to a novel genus of phylum Verrucomicrobiota, which we named “Candidatus Thermopylae lasonolidus”. Significantly different penta-nucleotide composition (5-mers) and GC percent of las BGC when compared to the “Ca. T. lasonolidus” suggests a horizontal acquisition of the gene cluster. Mapping of paired-end reads and analysis of the assembly graph revealed three copies of las BGC in “Ca. T. lasonolidus”. The three repeats were not identical and were found to contain differences including insertions and single nucleotide polymorphism.

11:30-11:45
Host-microbiome protein-protein interactions capture disease-relevant pathways
Room: FI
Format: Live from venue

Moderator(s): Mihai Pop

  • Juan Felipe Beltrán, Cornell University, United States
  • Ilana Brito, Cornell University, United States
  • Hao Zhou, Cornell University, United States


Presentation Overview: Show

Host-microbe interactions are crucial for normal physiological and immune system development and are implicated in a variety of diseases. To identify potential pathways through which human-associated bacteria impact host health, we leverage publicly-available interspecies protein-protein interaction (PPI) data to find clusters of microbiome-derived proteins with high sequence identity to known human-protein interactors. We observe differential targeting of putative human-interacting bacterial genes in nine independent metagenomic studies, finding evidence that the microbiome broadly targets human proteins involved in immune, oncogenic, apoptotic, and endocrine signaling pathways in relation to IBD, CRC, obesity, and T2D diagnoses. This host-centric analysis provides a mechanistic hypothesis-generating platform and extensively adds human functional annotation to commensal bacterial proteins.

11:45-12:00
Pan-cancer characterization of microbiome signatures
Room: FI
Format: Live from venue

Moderator(s): Mihai Pop

  • Wei-Hao Lee, Systems, Synthetic, and Physical Biology Program, Rice University, United States
  • Ruth Dannenfelser, Department of Computer Science, Rice University, United States
  • Vicky Yao, Department of Computer Science, Rice University, United States


Presentation Overview: Show

Cancer has been studied at the molecular and genetic level in cells for decades, but non-cellular elements from the microenvironment, such as the microbiome, are rarely investigated. The tumor-resident microbiome has been linked to malignancies in both direct and indirect ways. To systematically characterize the cancer-associated microbiome, we re-examine the 32 cancer types from The Cancer Genome Atlas (TCGA) by matching non-human reads to microbial reference genomes. We then use semi-supervised non-negative matrix factorization to identify microbiome signatures. By analyzing these signatures, we successfully recapitulate known cancer-associated microbes and further identify several novel associations, including signatures that are linked with survival outcomes. This comprehensive investigation provides an overview of the microbiome spectrum across cancer types and establishes a new method for assessing the interplay between the microbiome and human disease.

12:00-12:15
Characterization and integration of transcriptional and microbial profiles of oral lesions and cancer
Room: FI
Format: Live from venue

Moderator(s): Mihai Pop

  • Mohammed Muzamil Khan, Boston University, United States
  • Jennifer Frustino, Erie County Medical Center, United States
  • Alessandro Villa, University of California, San Francisco, United States
  • Cuc Bach-Nguyen, Boston University School of Medicine, United States
  • Sook Bin-Woo, Brigham & Womens Hospital, United States
  • Xaralabos Varelas, Boston University School of Medicine, United States
  • Maria Kukuruzinska, Boston University School of Medicine, United States
  • Stefano Monti, Boston University School of Medicine, United States


Presentation Overview: Show

Head and neck cancer is a complex malignancy with its major anatomical subsite, cancer of the oral cavity, ranking among the most deadly and disfiguring cancers due to lack of early detection and effective treatments. Oral cancer(OC) presents primarily as HPV-negative oral squamous cell carcinoma, whose etiology includes tobacco and alcohol use. OC is thought to progress through a series of well-defined clinical and histopathological stages that starts off as premalignant lesions(PML). In this study, using total RNA sequencing and leveraging multiple bioinformatics methods, including differential gene and pathway enrichment analyses, we show that the PMLs are characterized by the activation of major pro-inflammatory and tumor-promoting pathways, such as epithelial-to-mesenchymal transition, TNFa, and NFkB, along with anti-inflammatory pathways, such as FCGR and IL6, which may prevent PMLs from further progressing. Through mediation analysis integrating host transcriptome and microbiome, we further show that these pathways may be driven by a concomitant differential abundance of specific microbes previously shown to be associated with OC. These results suggest that the cross-talk between host and microbial activity may play a significant role in the malignant transformation of PMLs and mal help uncover early detection markers and drivers of transformation of HPV(-) lesions and cancers.

12:15-12:30
The Gastric Microbiome and Gastric Carcinogenesis: Bacteria diversity, Co-occurrence patterns and Predictive Models
Room: FI
Format: Live from venue

Moderator(s): Mihai Pop

  • Edwin Moses Appiah, Department of biochemistry and Biotechnology, KNUST, Ghana
  • Samson Pandam Salifu, Department of biochemistry and Biotechnology, KNUST, Ghana


Presentation Overview: Show

Changes in the microbiome composition and interaction have been implicated in gastric cancer development. Toward the understanding of how the microbiome affects the pathogenesis of the disease, many studies have provided relevant yet varying results. We present a comprehensive analysis of the gastric microbiome in gastric carcinogenesis, focusing on bacterial diversity, co-occurrence patterns, and ultimately identification of potential microbial biomarkers. We combined raw 16s rRNA data from six (6) studies across 985 samples from individuals consisting of healthy, gastritis, intestinal metaplasia and cancer. Batch effects were corrected with the Herman package in R. The Proteobacteria composition and diversity decrease, and the Actinobacteria increase with carcinogenesis. Transient oral pathogenic and intestinal bacteria, Prevotella, Propionibacterium acnes, Acinetobacter baumannii, lactobacillus, Gordonai polyisoprenivorans, were highly enriched with increasing carcinogenesis from gastritis to cancer. Microbial co-occurrence analysis revealed essential keystone species with Pseudoxanthomonas spadix and Sphinogobium represented as hubs in healthy individuals. Filifactor alocis showed significant interaction with pathogenic bacteria Fusobacterium nucleatum in gastric cancer communities. LASSO models revealed Bacteroides dorei, Hydrogenophilus hirschii, and Propionibacterium granulosum as potential biomarkers for gastric cancer. This study provides significant insight into the gastric microbial communities and how they could serve as a potential tool for predicting gastric carcinogenesis.

14:30-15:00
Proceedings Presentation: Syotti: Scalable Bait Design for DNA Enrichment
Room: FI
Format: Live from venue

Moderator(s): Mihai Pop

  • Jarno Alanko, University of Helsinki, Finland
  • Ilya Slizovskiy, Department of Veterinary Population Medicine, College of Veterinary Medicine, University of Minnesota, United States
  • Daniel Lokshtanov, University of California, Santa Barbara, United States
  • Travis Gagie, Diego Portales University, Finland
  • Noelle Noyes, University of Minnesota, United States
  • Christina Boucher, University of Florida, United States


Presentation Overview: Show

Motivation: Bait enrichment is a relatively new protocol that is becoming increasingly ubiquitous as it has
been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of
synthetic probes (“baits”) are designed, manufactured, and applied to fragmented metagenomic DNA. The
probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to
be amplified for sequencing. Most recently, Metsky et al. (Nature Biotech 2019) demonstrated that bait-enrichment
is capable of detecting a large number of human viral pathogens within metagenomic samples.

Results: We formalize the problem of designing baits by defining the Minimum Bait Cover problem,
and show that the problem is NP-hard even under very restrictive assumptions, and design an efficient heuristic that takes advantage of
succinct data structures. We refer to our method as Syotti. The running time of Syotti shows linear scaling
in practice, running at least an order of magnitude faster than state-of-the-art methods, including the recent
method of Metsky et al. At the same time, our method produces bait sets that are smaller than the ones
produced by the competing methods, while also leaving fewer positions uncovered. Lastly, we show that
Syotti requires only 25 minutes to design baits for a dataset comprised of 3 billion nucleotides from 1000
related bacterial substrains, whereas the method of Metsky et al. shows clearly super-linear running time
and fails to process even a subset of 8% of the data in 24 hours.

Availability: https://github.com/jnalanko/syotti.

15:00-15:15
Target-enriched long-read sequencing (TELSeq) contextualizes antimicrobial resistance risk in metagenomes.
Room: FI
Format: Live from venue

Moderator(s): Mihai Pop

  • Ilya B. Slizovskiy, University of Minnesota, United States
  • Marco Oliva, University of Florida, United States
  • Jonathen K. Settle, University of Florida, United States
  • Lidiya V. Zyskina, University of Maryland, United States
  • Mattia C. F. Prosperi, University of Florida, United States
  • Christina Boucher, University of Florida, United States
  • Noelle R. Noyes, University of Minnesota, United States


Presentation Overview: Show

Metagenomic data can be used to profile high-importance functions within microbiomes. However, current metagenomic workflows produce data that suffer from low sensitivity and an inability to accurately reconstruct partial or full genomes. These limitations preclude colocalization analysis, i.e., the ability to characterize the genomic context of genes and functions within a metagenomic sample. Genomic context is especially crucial for functions associated with horizontal gene transfer (HGT) via mobile genetic elements (MGEs), for example antimicrobial resistance (AMR). To overcome this current limitation of metagenomics, we present a method for comprehensive and accurate reconstruction of antimicrobial resistance genes (ARGs) and MGEs from metagenomic DNA, termed target-enriched long-read sequencing (TELSeq).
Using replicates of diverse sample types, we compared TELSeq performance to that of non-enriched PacBio and short-read Illumina sequencing. TELSeq achieved much higher sensitivity than the other methods, revealing an extensive resistome profile comprising many low-abundance ARGs, including some with public health importance. Using the long reads generated by TELSeq, we identified numerous MGEs flanking the low-abundance ARGs, indicating that these ARGs could be transferred across bacterial taxa via HGT.

15:15-15:30
SCRAPT: An Iterative Algorithm for Clustering Large 16S rRNA Data Sets
Room: FI
Format: Live from venue

Moderator(s): Mihai Pop

  • Tu Luan, University of Maryland, College Park, United States
  • Harihara Subrahmaniam Muralidharan, University of Maryland, United States
  • Marwan Alshehri, University of Maryland, College Park, United States
  • Mihai Pop, University of Maryland, United States


Presentation Overview: Show

16S rRNA sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA data sets grow in size, existing sequence clustering algorithms become an analytical bottleneck. Existing methods spend a lot of time with clustering singletons and produce fragmented clusters leaving a gap for further improvements. We propose an iterative sampling-based 16S rRNA sequence clustering approach that targets the largest clusters, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that our iterative algorithm SCRAPT, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the dataset. The experiments also show SCRAPT is able to produce Operational Taxonomic Unit (OTUs) which are less fragmented than popular tools UCLUST, CDHIT and DNACLUST.

Software Availability: The algorithm is implemented in the open-source package SCRAPT and is available at https://github.com/hsmurali/SCRAPT.

16:00-16:15
Breastfeeding and Farming Lifestyle Promotes Predominant Bifidobacterium in Infants
Room: FI
Format: Live from venue

Moderator(s): David Koslicki

  • Deborah Chasman, University of Wisconsin-Madison, United States
  • Krittisak Chaiyakul, University of Wisconsin-Madison, United States
  • Samantha Fye, University of Wisconsin-Madison, United States
  • James Gern, University of Wisconsin-Madison, United States
  • Susan Lynch, University of California San Francisco, United States
  • Christine Seroogy, University of Wisconsin-Madison, United States
  • Irene Ong, University of Wisconsin-Madison, United States


Presentation Overview: Show

Introduction: The inception of immune mediated disorders, which have increased worldwide, typically occurs during early childhood and leads to chronic and lifelong diseases. Children exposed to microbes from pets, farm animals, or from traditional communities such as the Amish, have reduced rates of these diseases. The gut microbiome influences neonatal immune development; however, the contributing microbial features are unknown. We compared stool metagenomes from Wisconsin infants from three levels of farming exposures: traditionally-farming Amish (n=27), dairy farming (n=46), and rural non-farming (n=43). We hypothesized that microbiome composition would vary between the groups.

Methods: We analyzed farm group, diet, and metagenomic features using statistical tests and machine learning.

Results: Microbiome composition significantly differed by diet and farm group. Machine learning models successfully classified Amish from non-Amish (AUROC=0.94). Variable importance and statistical analysis highlighted a significantly greater abundance of Bifidobacterium longum in Amish. Gene families found uniquely in Amish samples included genes from B. longum infantis, which encodes a large complement of human milk utilization gene clusters.

Conclusion: Breastfeeding and Amish lifestyle influence early gut colonization. Pioneer microbes may protect against colonization by pathogens and aid immune maturation via metabolic products of human milk.

16:15-16:30
A constraint-based method to identify function-specific minimal microbiomes from large microbial communities
Room: FI
Format: Live from venue

Moderator(s): David Koslicki

  • Aswathy K. Raghu, Northwestern University, United States
  • Karthik Raman, Indian Institute of Technology Madras, India


Presentation Overview: Show

Microorganisms thrive in large communities of diverse species, exhibiting various functionalities. The mammalian gut microbiome, for instance, has the functionality of digesting dietary fibre and producing different short-chain fatty acids. Not all microbes present in a community contribute to a given functionality; it is possible to find a minimal microbiome, which is a subset of the large microbiome, that is capable of performing the functionality while maintaining other properties such as growth rate. Such a minimal microbiome will also contain keystone species of that community. In the wake of perturbations of gut microbiome that results in disease conditions, cultivated minimal microbiomes can be administered to restore lost functionalities. In this work, we present a systematic approach to find a minimal microbiome for a specific functionality, from a large community. We employ a top-down approach with sequential deletion followed by solving a mixed-integer linear programming problem with the objective to minimize the $L_1$-norm of the membership vector. We demonstrate the utility of our algorithm by identifying the minimal microbiomes of some communities and discuss their validity based on the presence of the keystone species in the community.

Availability: The algorithm is available from \url{https://github.com/RamanLab/MinMicrobiome}

16:30-16:45
Comprehensive functional annotation of metagenomes using a deep learning-based method (DeepFRI)
Room: FI
Format: Live from venue

Moderator(s): David Koslicki

  • Mary Maranga, Małopolska Centre of Biotechnology, Jagiellonian University, Poland
  • Tomasz Kosciolek, Małopolska Centre of Biotechnology, Jagiellonian University, Poland
  • Tommi Vatanen, Liggins Institute, University of Auckland, New Zealand
  • Valentyn Bezshapkin, Małopolska Centre of Biotechnology, Jagiellonian University, Poland
  • Pawel Szczerbiak, Małopolska Centre of Biotechnology, Jagiellonian University, Poland
  • Paweł Łabaj, Małopolska Centre of Biotechnology, Jagiellonian University, Poland
  • Richard Bonneau, Flatiron Institute, New York, United States


Presentation Overview: Show

The exact mechanisms of how the human gut microbiome influences health is still elusive, due to our limited understanding of the functional potential encoded in microbial genomes. There is a need for a comprehensive functional annotation method with high coverage.The present study aims to characterize the function potential of the human gut microbiome in type-1 diabetes. We used DIABIMMUNE datasets (1,070 samples) as a case study. We have developed a custom metagenomics functional annotation pipeline centered around DeepFRI, a deep learning method and EggNOG orthology-based method. The pipeline integrates taxonomic profiling, functional annotation and construction of pan-genomes in a reference free manner using metagenome-assembled genomes (MAGs). We generated a sequence catalog comprising 2,255 high-quality genomes and 1.9M non-redundant microbial genes. We observed an increase in the annotation coverage weighted by relative abundance with DeepFRI (~26%) in comparison to EggNOG (~16%). Our pan-genomes analysis covered 42 bacterial species and consisted of a total of 70,997 core genes and 355,761 accessory genes. EggNOG was shown to annotate more functions coming from well-studied species such as Escherichia coli. There was a diversity of functional enrichment between core and accessory genomes with DeepFRI annotations, Bacteroides vulgatus had the highest number of annotated genes 4,383.

16:45-17:00
Information-Measure Predicts the Generalizability of Machine Learning Models in Metagenomics Datasets
Room: FI
Format: Live-stream

Moderator(s): David Koslicki

  • Harrison Ho, Joint Genome Institute / University of California Merced, United States
  • Gerald Friedland, Department of EECS, University of California Berkeley, United States
  • Zhong Wang, Lawrence Berkeley National Lab; The Joint Genome Institute, United States


Presentation Overview: Show

The application of high-throughput genomics, transcriptomics, proteomics, and metabolomics technologies to clinical metagenomics makes it possible to link specific microbes to certain host phenotypes. Although machine learning approaches have been attempted, their success has been limited due to the inherent noisy and complex nature of these multi-omics datasets. Here, we mathematically defined a novel quantitative estimator based on information theory that measures the ability of a given dataset to support successful models. Using both synthetic and real-world metagenomics datasets, we demonstrate that this estimator, termed the generalizability measure (GM), correlates with model accuracy and monotonically decreases as noise increases. We also show that GM can guide train-test splits that produce models with higher accuracy than those from random splits. Furthermore, GM can predict inherent inconsistencies among datasets from different studies. We propose that GM is an indispensable metric to reduce the cost of unsuccessful machine learning experiments, and the results we obtained from meta-omics datasets can likely generalize to other types of complex datasets.

17:00-17:15
Identifying microbial drivers in biological phenotypes with a Bayesian Network Regression model
Room: FI
Format: Live from venue

Moderator(s): David Koslicki

  • Samuel Ozminkowski, University of Wisconsin-Madison, United States
  • Claudia Solis-Lemus, University of Wisconsin-Madison, United States


Presentation Overview: Show

Understanding the composition of microbial communities and how these compositions shape biological phenotypes is crucial to comprehend complex biological processes in soil, plants, and humans. Standard approaches to study these connections do not account for correlations between microbes, and models to connect a microbial network to a biological phenotype remain unknown. A handful of new methods using a regression framework to identify associations between network predictors and a phenotype have been developed but have only been studied for dense networks.
We introduce a Bayesian Network Regression (BNR) model that uses the microbial network as the predictor of a biological phenotype. This model accounts for interactions among microbes and can identify influential interactions and microbes that drive phenotypic variability. While the model itself is not new, it has only been studied for brain networks. Its applicability to microbial networks, which are sparser and higher-dimensional, has not been studied. We develop the first thorough investigation of BNR models for microbial datasets on synthetic data generated under realistic biological scenarios. We show that this model can identify influential nodes and edges in the microbial networks that drive changes in the phenotype for most biological settings and identify scenarios where this method performs poorly.

17:15-17:30
Resistome characterization of the honey bee gut microbiome via shotgun metagenomic sequencing
Room: FI
Format: Live-stream

Moderator(s): David Koslicki

  • Lance Lansing, Agriculture and Agri-Food Canada, Canada
  • Kurtis Clarke, Agriculture and Agri-Food Canada, Canada
  • Jackie Zorz, Agriculture and Agri-Food Canada, Canada
  • Linzhi Wu, Agriculture and Agri-Food Canada, Canada
  • Morgan Cunningham, Agriculture and Agri-Food Canada, Canada
  • Lan Tran, Agriculture and Agri-Food Canada, Canada
  • Marta Guarna, Agriculture and Agri-Food Canada, Canada
  • Rodrigo Ortega Polo, Agriculture and Agri-Food Canada, Canada


Presentation Overview: Show

Antimicrobial resistance (AMR) is a global threat to agriculture as bacterial infections become increasingly more difficult to treat. In the beekeeping industry, antibiotics are frequently used for prevention and treatment of disease such as American and European foulbrood. However, increasing rates of AMR have limited the effectiveness of antibiotic treatments which has direct consequences for colony health and the production and stability of pollinator-dependent crops.

In this work we describe the characterization of the honey bee gut resistome – the collective antibiotic resistance in a microbial community – using shotgun metagenomics sequencing. We analyzed 10 short-read bee gut microbiome samples from Ontario, Canada, where antibiotics are commonly used as prophylactics. AMR genes were identified at the read level with the AMRPlusPlus pipeline. Reads were also assembled using Metagenome-Atlas and searched against five AMR databases. Our work found 199 unique AMR genes providing resistance to such compounds as tetracyclines, sulfonamides, and aminoglycosides. Tetracycline resistance was most prevalent, which correlates with the history of oxytetracycline use in Ontario apiculture. Using metagenomics for characterizing the honey bee gut resistome can be a powerful tool to monitor AMR genes in pollinators and their environment.

17:30-17:45
Metagenome-scale species-resolved functional profiling of the gut microbiota in Celiac Disease
Room: FI
Format: Live-stream

Moderator(s): David Koslicki

  • Analeigha Colarusso, Tufts University, United States
  • Isabella Goodchild-Michelman, Harvard University, United States
  • Alessio Fasano, Massachusetts General Hospital - Harvard Medical School, United States
  • Ali Zomorrodi, Massachusetts General Hospital - Harvard Medical School, United States


Presentation Overview: Show

Alterations in the gut microbiota have been associated with Celiac Disease (CeD); however, the role of individual microbial species in CeD pathogenesis are largely unknown. This is because existing functional profiling approaches do not provide any information about what microbes carry what pathways or produce what metabolites in the gut. To address this gap, we used fecal metagenomic data and GEnome-scale Models (GEM) of metabolism to functionally profile the gut microbiota of ten children with CeD and ten controls at species- and molecular-level resolution. To this end, we integrated GEMs for microbial species present in each metagenome to construct sample-specific species-resolved GEMs of the gut microbiota (spanning 359 species across all samples). By computationally simulating these models, we traced back individual microbial species producing specific secreted metabolites. We identified 18 (out of 2,120) species-metabolite pairs involving nine species and 14 metabolites that were significantly different between cases and controls (Wilcox, p < 0.05). Some of these metabolites were previously implicated in CeD or inflammation, e.g., L-tryptophan, arabinose, xanthine, and cholic acid that are linked to species of Bacteroides, Bifidobacterium, and Blautia in our models. This study provides a roadmap for mechanistically linking the gut microbial and molecular markers of CeD.

17:45-18:00
A rat microbial BodyMap across 11 tissue types and 4 developmental stages
Room: FI
Format: Live-stream

Moderator(s): David Koslicki

  • Lan Zhao, Stanford University, United States
  • Mark Nicolls, Stanford University, United States


Presentation Overview: Show

The rat has been widely used as a model in a variety of fields. To determine whether specific microbiome patterns and signatures are associated with different developmental stages in the rat organs, we systematically analyzed a cohort of RNA-Seq samples generated by the SEQC consortium from 11 organs of juvenile, adolescent, adult and aged Fischer 344 healthy rats. Raw sequencing data were mapped to the rat reference genome (Ensembl release 104) with STAR (v2.7.9a) aligner. These sets of unmapped reads were subjected to microbial taxonomic classification using exact k-mer matches, Kraken (v2.0.8-beta). A biclustering algorithm based on Consensus Non-negative Matrix factorization (cNMF) was applied to identify the rat-microbe interaction patterns and signatures. A total of 4,647 taxonomies were identified across all rat tissue types, and 4 rat-microbe interaction patterns were subsequently determined. The lung's microbial profiles showed distinct separation among clusters, and shifted significantly at different developmental stages. Our study did taxonomic profiling of the rat 11 tissue types at four stages of development. The identified four rat-organ clusters with 362 microbial signatures may be useful for accessing the unculturable microbial communities and facilitating discovery of the roles the microbiome plays in human health.

Thursday, July 14th
10:15-11:00
Keynote Presentation: Revisiting string graph model for long-read assembly of genomes and metagenomes
Room: FI
Format: Live-stream

Moderator(s): Zhong Wang

  • Chirag Jain


Presentation Overview: Show

Read-overlap-based graph data structures play a central role in computing de novo genome assembly using long reads. Most long-read assembly tools use the string graph model to sparsify overlap graphs. Graph sparsification is crucial for high-quality genome assembly as it simplifies the graph significantly by removing redundant edges. However, a graph model must be coverage-preserving, i.e., it must ensure that each haplotype can be spelled as a walk in the graph, given sufficient sequencing coverage. This property becomes even more important for polyploid genomes and metagenomes where there is a risk of losing haplotype-specific information during graph sparsification. In the first part, we prove that de Bruijn graph and overlap graph models are guaranteed to be coverage-preserving. However, using the same framework, we show that the commonly used string graph model lacks the guarantee. To address this, the second part of our work introduces a novel sparse read-overlap-based graph model that is well-supported by our theoretical results. Practical advantage of this model is demonstrated using CHM13 and HG002 human sequencing data.

11:00-11:15
Persistent Memory as an Effective Alternative to Random Access Memory in Metagenome Assembly
Room: FI
Format: Live from venue

Moderator(s): Zhong Wang

  • Jingchao Sun, MemVerge Inc, United States
  • Rob Egan, DOE Joint Genome Institute, United States
  • Harrison Ho, DOE Joint Genome Institute, EGSB Lawrence Berkeley National Lab, United States
  • Yue Li, MemVerge Inc, United States
  • Zhong Wang, DOE Joint Genome Institute, EGSB Lawrence Berkeley National Lab, UC Merced, United States


Presentation Overview: Show

The assembly of metagenomes decomposes the member genomes of complex microbe communities and allows the characterization of these genomes without laborious cultivation or single-cell metagenomics. Metagenome assembly is a process that is memory intensive and time consuming. Multi-terabyte sequences can become too large to be assembled on a single computer node, and there is no reliable way to predict the memory requirement due to data-specific memory consumption pattern. Currently, out-of-memory (OOM) is one of the most prevalent problems in metagenome assembly experiments. In this study, we explored the possibility of using Persistent Memory (PMem) as a less expensive substitute for dynamic random access memory (DRAM) to reduce OOM and increase the scalability of metagenome assemblers. We evaluated the execution time and memory usage of three popular metagenome assemblers (MetaSPAdes, MEGAHIT, and MetaHipMer2) in datasets up to one terabase. We found that PMem can enable metagenome assemblers on terabyte-sized datasets by partially or fully substituting DRAM at a cost of longer running times. In addition, different assemblers displayed distinct memory/speed trade-offs in the same hardware/software environment. Because PMem was provided directly without any application-specific code modification, these findings are likely to be generalized to other memory-intensive bioinformatics applications.

11:15-11:30
Adversarial and variational autoencoders improve metagenomics binning
Room: FI
Format: Live from venue

Moderator(s): Zhong Wang

  • Pau Piera Lindez, University of Copenhagen, Novo Nordisk Foundation Center for Protein Research, Denmark
  • Joachim Johansen, University of Copenhagen, Novo Nordisk Foundation Center for Protein Research, Denmark
  • Jakob Nybo Nissen, University of Copenhagen, Novo Nordisk Foundation Center for Protein Research, Denmark
  • Simon Rasmussen, University of Copenhagen, Novo Nordisk Foundation Center for Protein Research, Denmark


Presentation Overview: Show

Reconstruction of high-quality genomes from metagenomic samples is a hard problem, often resulting in highly fragmented genome assemblies. Metagenomic binning allows us to reconstruct genomes by re-grouping the sequences by their organism of origin, thus representing a crucial bottleneck for exploring biological diversity in metagenomic samples. Here we present Adversarial Autoencoders for Metagenomics Binning (AAMB), a deep learning approach that integrates sequence co-abundances and tetra nucleotides frequencies into a common denoised space that enables precise clustering of sequences into microbial genomes. When benchmarked AAMB presented similar or better results compared with the state-of-the-art binner VAMB, reconstructing 0-35% and 5-10% more near-complete (NC) genomes on simulated and real data, respectively. When integrating VAMB and AAMB NC bins with dRep, we, on average, obtained 30% additional NC bins across simulated and real datasets. In addition, the VAMB-AAMB integrated bins had higher completeness, greater taxonomic diversity, and covered a wider range of sample prevalence compared with VAMB. Finally, we implemented a pipeline integrating VAMB, AAMB, and dRep that enables efficient binning and integration without extensive additional runtime.

11:30-11:45
Charcoal: filtering contamination in metagenome-assembled genome bins and other genomes
Room: FI
Format: Live from venue

Moderator(s): Zhong Wang

  • Taylor Reiter, University of Colorado Anschutz Medical Campus, United States
  • N. Tessa Pierce-Ward, Population Health And Reproduction, University of California, Davis, United States
  • Luiz Irber, Population Health And Reproduction, University of California, Davis, United States
  • Erich M. Schwarz, Department of Molecular Biology and Genetics, Cornell University, United States
  • C. Titus Brown, Population Health And Reproduction, University of California, Davis, United States


Presentation Overview: Show

Metagenomics has expanded our knowledge of microbial diversity, but contaminant sequences are frequently accidentally included in metagenome-assembled genomes. Genome contamination is often estimated by the presence of marker genes that are biased against detecting contaminants lacking these sequences. Further, most contamination detection tools do not remove contamination. We present charcoal, a tool that rapidly identifies and removes contamination in metagenome-assembled genomes using k-mer based methods. K-mers are nucleotide sequences of length k. Sufficiently long k-mers are usually specific to a taxonomic lineage. Exploiting this property of k-mers, charcoal identifies majority and minority lineages for each contiguous sequence in a genome and removes contiguous sequences belonging to minority lineages when those lineages occur below a taxonomic threshold (by default, order). Applying charcoal to the GTDB rs207 database, we found approximately 25% of genomes in GTDB were contaminated, with contamination broadly distributed across species and occurring in both representative and RefSeq genomes. Genomes with longer contiguous sequences are less likely to be contaminated. Our results show concordance with CheckM on detecting the presence of contamination in a genome. Charcoal is a snakemake workflow developed around the tool sourmash. It is available at github.com/dib-lab/charcoal, and is pip installable.

11:45-12:00
MetaBinner: a high-performance and stand-alone ensemble binning method to recover individual genomes from complex microbial communities
Room: FI
Format: Live-stream

Moderator(s): Zhong Wang

  • Ziye Wang, Fudan University, China
  • Pingqin Huang, Fudan University, China
  • Ronghui You, Fudan University, China
  • Fengzhu Sun, University of Southern California, United States
  • Shanfeng Zhu, Fudan University, China


Presentation Overview: Show

Binning aims to recover microbial genomes from metagenomic data. For complex metagenomic communities, the available binning methods are far from satisfactory, which usually do not fully use different types of features and important biological knowledge. We developed a novel ensemble binner, MetaBinner, which generates component results with multiple types of features by k-means and utilizes single-copy gene (SCG) information for initialization. It then employs a two-stage ensemble strategy based on SCGs to integrate the component results efficiently and effectively. Extensive experimental results on three large-scale simulated datasets and one real-world dataset demonstrate that MetaBinner outperforms the state-of-the-art binners significantly. MetaBinner is freely available at https://github.com/ziyewang/MetaBinner.

13:15-13:45
Proceedings Presentation: CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
Room: FI
Format: Live from venue

Moderator(s): Zhong Wang

  • Shaopeng Liu, Pennsylvania State University, United States
  • David Koslicki, Pennsylvania State University, United States


Presentation Overview: Show

K-mer based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where data sets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k = k_max value, we can simultaneously obtain k-mer based estimates for all k values up to k_max. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. For example, we show that when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time is close to 10x faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles.

13:45-14:00
Average Nucleotide Identity estimation from FracMinHash sketches
Room: FI
Format: Live from venue

Moderator(s): Zhong Wang

  • N. Tessa Pierce-Ward, University of California, Davis, United States
  • Mahmudur Rahman Hera, The Pennsylvania State University, United States
  • Luiz Irber, University of California, Davis, United States
  • Taylor Reiter, University of Colorado, United States
  • David Koslicki, The Pennsylvania State University, United States
  • C. Titus Brown, University of California, Davis, United States


Presentation Overview: Show

Average Nucleotide Identity (ANI), a measure representing the sequence similarity of all orthologous genes shared between two genomes, has shown lasting utility for taxonomic classification and phylogenetic analysis. BLAST-based alignment remains the gold standard method, but a number of methods have been developed in an effort to reduce computational requirements and allow ANI estimation to scale with the availability of sequenced genomes.
FracMinHash is a MinHash variant for selecting and hashing a set of representative k-mers from a sequence dataset. Downsampling sequencing datasets in this way enables estimation of Containment, which has been shown to permit more accurate estimation of genomic distance as compared with the Jaccard Index, particularly for genomes of very different lengths. Here, we demonstrate FracMinHash-based ANI estimation, an alignment-free method that is highly correlated with mapping-based ANI. Estimating ANI via FracMinHash containment allows this method to be relatively robust to contamination and allow ANI estimation directly from metagenome samples (comparing matched portions of the metagenome sample to their respective reference sequences).
Alignment-free ANI estimation from FracMinHash is fast, scalable, and assembly-independent. It is implemented in sourmash, an open-source command line tool and Python library for sketching k-mers for biological analysis of large-scale sequencing datasets.

14:00-14:15
TAMPA: interpretable analysis and visualization of metagenomics-based taxon abundance profiles
Room: FI
Format: Live from venue

Moderator(s): Zhong Wang

  • Varuni Sarwal, UCLA, United States
  • Serghei Mangul, University of California, Los Angeles, United States
  • David Koslicki, Penn State University, United States


Presentation Overview: Show

Taxonomic metagenome profiling aims to predict the identity and relative abundances of taxa in a given whole genome sequencing metagenomic sample. A recent surge in computational methods that aim to accomplish this, called taxonomic profilers, has motivated community-driven efforts to create standardized benchmarking datasets, standardized taxonomic profile formats, and benchmarking platforms to assess tool performance. Here we report the development of Tampa (Taxonomic metagenome profiling evaluation) , a robust and easy-to-use method that allows scientists to easily interpret and interact with taxonomic profiles produced by the many different taxonomic profiler methods. We demonstrate Tampa's ability by illuminating the critical biological differences between samples and conditions otherwise missed by commonly utilized metrics. We plan to apply Tampa to CAMI data1. Additionally, we show that Tampa can enable biologists to effectively choose the most appropriate profiling method to use on their real data. When ground truth taxonomic profiles are available, we show how Tampa can augment existing benchmarking platforms such as OPAL. Tampa will be provided in a platform-independent fashion via Bioconda and integrated into the Galaxy Toolshed. Tampa will allow scientists to quickly contextualize, assess, and extract insight from taxonomic profiles instead of relying primarily on statistical summaries or manual manipulation.

14:15-14:30
Critical assessment of pan-genomics of metagenome-assembled genomes
Room: FI
Format: Live from venue

Moderator(s): Zhong Wang

  • Yanbin Yin, University of Nebraska - Lincoln, United States
  • Tang Li, University of Nebraska - Lincoln, United States


Presentation Overview: Show

Background: Large scale metagenome assembly and binning to generate metagenome-assembled genomes (MAGs) has become possible in the past five years. As a result, millions of MAGs have been produced and increasingly included in pan-genomics workflow. However, pan-genome analyses of MAGs may suffer from the known issues with MAGs: fragmentation, incompleteness, and contamination. Here, we conducted a critical assessment of including MAGs in pan-genome analysis.

Results: We found that incompleteness led to more significant core gene loss than fragmentation. Contamination had little effect on core genome size but had major influence on accessory genomes. The core gene loss remained when using different pan-genome analysis tools and when using a mixture of MAGs and complete genomes. Importantly, the core gene loss was partially alleviated by lowering the core gene threshold and using gene prediction algorithms that consider fragmented genes. The core gene loss also led to incorrect pan-genome functional predictions and inaccurate phylogenetic trees.

Conclusions: We conclude that lowering core gene threshold and predicting genes in metagenome mode (as Anvi’o does with Prodigal) are necessary in pan-genome analysis of MAGs. Better quality control of MAGs and development of new pan-genome analysis tools specifically designed for MAGs are needed in future studies.

14:30-14:45
Sample origin prediction in forensics with use of targeted metagenomic sequencing
Room: FI
Format: Live from venue

Moderator(s): Zhong Wang

  • Michał Kowalski, Jagiellonian University, Poland
  • Kamila Marszałek, Jagiellonian University, Poland
  • Kinga Herda, Jagiellonian University, Poland
  • Agata Jagiełło, Central Forensic Laboratory of the Police, Poland
  • Anna Woźniak, Central Forensic Laboratory of the Police, Poland
  • Łukasz Nowak, Ardigen, Poland
  • Andrzej Ossowski, Pomeranian Medical University, Poland
  • Rafał Płoski, Medical University of Warsaw, Poland
  • Wojciech Branicki, Jagiellonian University, Poland
  • Paweł P. Łabaj, Małopolska Centre of Biotechnology of Jagiellonian University, Poland
  • Renata Zbieć-Piekarska, Central Forensic Laboratory of the Police, Poland


Presentation Overview: Show

Microbiome data have been successfully applied in forensics. However, studies by MetaSUB and CAMDA have shown that the full potential of metagenomics is yet to be unveiled. To allow the unconstrained analysis and interpretation the reference free approaches need to be used.
Here we characterize the composition of soil microbiome in Poland with MetaGraph on Whole Metagenome Sequencing data. We have collected about 1000 samples throughout the year from multiple locations and then sequenced. Those constitute “positive class” for Poland while MetaSUB and EMP500 collections “negative class”. We have constructed corresponding graphs and extracted features in form of the unitigs, which are sampling-site specific. Those were used to design a targeted panel allowing to obtain metagenomic profiles of high uniqueness. The obtained from TMS sample profiles are first clustered by their similarity to each other and as embeddings are passed to the DNN classifier, which predicts the probability of the origin of the sample.
In the pilot we have tested samples from 6 Polish cities with classifier trained on fraction of WMS data. Still the results are very promising (accuracy >90%), indicating that application of k-mer based approaches will be a step towards new era of metagenomics in forensic applications.

14:45-15:00
A protocol for studying metabolic interactions in a microbial community using graph-based approaches
Room: FI
Format: Live from venue

Moderator(s): Zhong Wang

  • Dinesh Kumar Kuppa Baskaran, Indian Institute of Technology Madras, India
  • Karthik Raman, Indian Institute of Technology Madras, India


Presentation Overview: Show

A major part of research in microbial systems biology deals with revealing the ecological principles that shape a microbial community. Approaches that can help understand the microbial inter-species interactions open new landscapes for controlling, engineering and synthesizing microbial communities for various applications. Computational studies in this field often rely on genome-scale metabolic models built from genome sequences. The main bottleneck of using these models is the need for manual curation. Alternatively, genome-scale metabolic networks generated from draft metabolic models lend themselves to many graph-theoretic analyses. The method we exploited in our research is based on a Python package previously developed in our group, MetQuest, which employs graph-theoretic algorithms to study metabolic networks. We developed scripts for constructing individual and community metabolic networks and predicting possible pairwise microbial interactions, higher-order microbial interactions, metabolic exchanges, and unique contributors of metabolic capabilities in the community. The ability of this approach to capture the effect of change in the metabolic environment allows us to study the change in interaction patterns with a different environment. This method can help researchers focus wet-lab experiments on a specific aspect by pointing out possible interesting interactions in a microbiome. The link to MetQuest can be found at https://github.com/RamanLab/metquest