Presentation Overview: Show
Metagenomic samples have high spatiotemporal variability. Hence, it is useful to summarize and characterize the microbial makeup of a given environment in a way that is biologically reasonable and interpretable. The UniFrac metric has been a robust and widely-used metric for measuring the variability between metagenomic samples. We propose that the characterization of metagenomic environments can be achieved by finding the average, a.k.a. the barycenter, among the samples with respect to the UniFrac distance. However, it is possible that such a UniFrac-average includes negative entries, making it no longer a valid representation of a metagenomic community. To overcome this intrinsic issue, we propose a special version of the UniFrac metric, termed L2UniFrac, which inherits the phylogenetic nature of the traditional UniFrac and with respect to which one can easily compute the average, producing biologically meaningful environment-specific “representative samples”. We demonstrate the usefulness of such representative samples as well as the extended usage of L2UniFrac in efficient clustering of metagenomic samples, and provide mathematical characterizations and proofs to the desired properties of L2UniFrac. A prototype implementation is provided at: https://github.com/ KoslickiLab/L2-UniFrac.git.
Presentation Overview: Show
In metagenomics, a fundamental computational task is that of determining which genomes from a reference database are present or absent in a given metagenomic sample. While tools exist to answer this question, all approaches to date return point estimates, with no associated confidence or uncertainty associated with it. This has led to practitioners experiencing difficulty when interpreting the results from these tools, particularly for low abundance organisms which often reside in the ``noisy tail'' of incorrect predictions. Furthermore, few tools account for incomplete reference databases that rarely contain exact replicas of genomes present in a given metagenome.
Here, we present solutions for these issues by introducing the algorithm YACHT: Yes/No Answers to Community membership via Hypothesis Testing. YACHT is a statistical framework that accounts for sequence divergence between the reference and sample genomes, in terms of average nucleotide identity, as well as incomplete sequencing depth, thus providing a hypothesis test for determining the presence or absence of a reference genome in a sample.
After introducing our approach, we quantify its statistical power and perform extensive experiments using both simulated and real data to confirm the accuracy and scalability of this approach. Code implementing this approach is available at https://github.com/KoslickiLab/YACHT.
Presentation Overview: Show
Heart failure (HF) is a complex clinical syndrome characterized by the heart's inability to meet the body's blood supply needs, which affects approximately 26 million adults globally. Current diagnosis relies heavily on symptoms and clinical history, underscoring the importance of early identification of individuals at risk. We launched the FINRISK Microbiome DREAM challenge (synapse.org/finrisk) in September 2022 aimed to investigate the potential of gut microbiome compositions (n=5749 taxonomic features) in predicting HF risk in a large population of 7231 Finnish adults (FINRISK 2002, n = 493 incident HF cases). To protect the privacy of individuals, we provided synthetic data that closely mimics the real data. Challenge participants' models were evaluated using Harrell's C and the Hosmer-Lemeshow test, with robust ranking ensured through bootstrap sampling of predicted and true scores. After evaluations we selected 2 teams as winners whose performances were comparable (Harrell’s C statistic: 0.8394, 0.8351; Hosmer-Lemeshow test: 0.0033, 0.012, respectively for both teams). Both teams employed regression methods with different approaches for defining feature importance. The challenge offered a platform for advancing our understanding of microbiome's role in HF. The challenge paves the way for future research to improve HF risk prediction and patient outcomes.
Presentation Overview: Show
Background: Head and neck squamous cell carcinoma (HNSCC) is the 7th most common type of cancer world-wide, leading to 450,000 deaths annually. Previous studies have shown the possibility to extract intra-tumoural microbiome signatures from public tumour WGS datasets. However, these studies commonly consider the effects of relative microbial abundances, not total bacterial load.
Methods: Independent multiomic datasets from 118 Human Papilloma Virus negative HNSCC patients from the Cancer Genome Atlas (TCGA) and 110 patients from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) were analysed by analysing all tumour WGS sequencing reads that did not map to the human genome. Likely contaminant species were removed based on sequencing centre batch effects. Bacterial load was calculated as the proportion of classified microbial reads to total library read depth. Bacterial load was integrated with matching transcriptomic, proteomic and survival datasets using non-matrix factorisation (TCGA) and variational autoencoder (CPATC) based approaches.
Results:
Low intratumoural bacterial load was associated with mesenchymal tumour phenotypes based on transcriptomic and proteomic signatures and was negatively associated with 3 and 5-year patient survival. Higher bacteria load tumours were characterised by the presence of common oral mircobes Fusobacteria, Treponema, Prevotella, Streptococcus, epithelial tumour phenotypes and significantly higher survival.
Presentation Overview: Show
Motivation: Interactions among microbes within microbial communities have been shown to play crucial roles in human health. In spite of recent progress, low-level knowledge of bacteria driving microbial interactions within microbiomes remains unknown, limiting our ability to fully decipher and control microbial communities.
Results: We present a novel approach for identifying species driving interactions within microbiomes. Bakdrive infers ecological networks of given metagenomic sequencing samples and identifies minimum sets of driver species (MDS) using control theory. Bakdrive has three key innovations in this space: (i) it leverages inherent information from metagenomic sequencing samples to identify driver species, (ii) it explicitly takes host-specific variation into consideration, and (iii) it does not require a known ecological network. In extensive simulated data, we demonstrate identifying driver species identified from healthy donor samples and introducing them to the disease samples, we can restore the gut microbiome in recurrent Clostridioides difficile (rCDI) infection patients to a healthy state. We also applied Bakdrive to two real datasets, rCDI and Crohn's disease patients, uncovering driver species consistent with previous work. Bakdrive represents a novel approach for capturing microbial interactions.
Availability: Bakdrive is open-source and available at: https://gitlab.com/treangenlab/bakdrive
Presentation Overview: Show
The gut microbiome is known to play a crucial role in human health and is influenced by various factors, particularly diet. However, the relationship between diet and the gut microbiome is complex and heterogeneous, as individuals with different diets can provide different sources of energy, which can impact not only the abundance of the microbiome but also its relationships. A better understanding of these complex diet-microbiome interactions holds promise for the development of personalized nutrition.
To uncover such diet-related heterogeneous microbial interactions, we propose a novel method, the Nutrition-Ecotype Graphical Mixture of Experts (NEGMoE), which models microbial co-abundance networks and accounts for diet-specific cohort variability via a mixture of experts’ model with a graphical lasso penalty.
We applied NEGMoE to real-world microbiome Parkinson’s disease (PD) datasets. Two subcohorts with different energy and fiber intakes are identified. We observed differential correlation structures among the microbiome within these two diet subcohorts. The correlation between the short-chain fatty acid producer Genus, Faecalibacterium and Blautia, increased in the high-fiber group, while the correlation between Bifidobacterium and Ruminococcus decreased in this group. These taxa have been shown to be important in PD. Our results further demonstrate that dietary fiber can influence interactions among them.
Presentation Overview: Show
Integrating multi-omics data is critical to understanding the role that the gut microbiome plays in regulating host immunity development. In 2017, we established the systems immunology Antibiotics and Immune Responses (AIR) study to understand how early life antibiotic exposure may be impacting responses to scheduled vaccines. The AIR study assessed the impact of neonatal antibiotic exposure on infant vaccine immune responses at 7 and 15 months in a cohort of 255 vaginally-born, healthy, term infants. Neonates directly exposed to antibiotics had significantly lower antibody titres against multiple different vaccine antigens, most notably to polysaccharides in the PCV13 pneumococcal vaccine.
To understand the host-microbe interactions underpinning how antibiotic exposure impacted vaccine responses we undertook a longitudinal multi-omics analysis integrating shotgun metagenomics and bacterial load measurements (n=409 samples) with whole blood transcriptomics (RNAseq, n=329 samples) multi-parameter immunophenotyping by flow cytometry (n=156) and vaccine antibody responses (n=499). Through this analysis we identified a direct correlation with Bifidobacterium spp. and vaccine responses, which appeared to be mediated by changes in B cell transcriptional signatures pre-vaccination. The AIR study provides a compelling example of the utility of multi-omics microbiota research for understanding how host-microbe interactions shape health.
Presentation Overview: Show
The growing threat of antimicrobial resistance (AMR) calls for new epidemiological surveillance methods, as well as a deeper understanding of how antimicrobial resistance genes (ARGs) have been transmitted around the world. The large pool of sequencing data available in public repositories provides an excellent resource for monitoring the temporal and spatial dissemination of AMR in different ecological settings. However, only a limited number of research groups globally have the computational resources to analyze such data. We retrieved 442 Tbp of sequencing reads from 214,095 metagenomic samples from the European Nucleotide Archive (ENA) and aligned them using a uniform approach against ARGs and 16S/18S rRNA genes. Here, we present the results of this extensive computational analysis and share the counts of reads aligned. Over 6.76∙108 read fragments were assigned to ARGs and 3.21∙109 to rRNA genes, where we observed distinct differences in both the abundance of ARGs and the link between microbiome and resistome compositions across various sampling types. This collection is another step towards establishing global surveillance of AMR and can serve as a resource for further research into the environmental spread and dynamic changes of ARGs.
Presentation Overview: Show
Biosynthetic Gene Clusters (BGCs) are regions of co-localized genes encoding the biosynthetic machinery capable of synthesizing a particular secondary metabolite. BGCs found in the gut microbiome are known to produce antibiotics, but can also encode the production of compounds relevant for host health, such as colibactin, a genotoxin contributing to colorectal cancer.
We present three scalable and accurate methods for the computational discovery and analysis of BGCs: GECCO, an accurate and scalable genome mining method for identifying BGCs in (meta-)genomes; HTGCF, a high-throughput clustering pipeline for grouping BGCs into Gene Cluster Families (GCFs), based on sequence and protein-content similarity; and CONCH, a data-driven machine-learning approach for predicting the chemical structure of a BGC compound.
We applied these methods to analyze over 300,000 genomes (including metagenomic assemblies) from human gut bacteria, identifying many BGCs (>400,000 BGCs, 64.5%) undiscovered by existing methods (e.g. antiSMASH). Clustering these BGCs into GCFs revealed that the majority (>65% of GCFs) was confined to a single species. After manually screening BGCs predicted in Bacteroidota, we selected gene clusters with rare biosynthetic features previously unseen in the human gut for further experimental validation.
Presentation Overview: Show
Next-generation sequencing has substantially increased genomic data volume and complexity, often exceeding terabytes in size. Traditional bioinformatic tools, designed for single computer operations, struggle to cope with these datasets. Despite the emergence of parallel frameworks like Apache Spark, Dask, Polars, and Ray, their application to genomic problems remains limited.
We introduce Axolotl, a scalable library built on Apache Spark, specifically designed for large-scale genomic data analysis. Axolotl creates genomics-specific function modules, enabling biologists to utilize Python for distributed computing environments. Users can harness Spark's built-in SQL and Machine Learning libraries for scalable bioinformatics analysis. We present two distinct use cases: a global-scale examination of over 1.5 million biosynthetic gene clusters (BGCs) and a distributed batch computation of polygenic risk scores (PRS). Axolotl efficiently processes these datasets in parallel using 32+ 16-core compute nodes, virtually combining the power of a 512-core, 4TB RAM machine, with entire analysis pipelines implemented in fewer than 50 lines of code.
Our findings highlight Axolotl's potential to revolutionize how researchers tackle large genomic data sets, enabling swift, scalable, and accurate analyses across a broad spectrum of omics applications.
Presentation Overview: Show
Microbial natural products represent a major source of bioactive compounds for drug discovery. Among these molecules, Non-Ribosomal Peptides (NRPs) represent a diverse class that include antibiotics, immunosuppressants, anticancer agents, toxins, siderophores, pigments, and cytostatics. The discovery of novel NRPs remains a laborious process because many NRPs consist of non-standard amino acids that are assembled by Non-Ribosomal Peptide Synthetases (NRPSs). Adenylation domains (A-domains) in NRPSs are responsible for selection and activation of monomers appearing in NRPs. During the past decade, several support vector machine-based algorithms have been developed for predicting the specificity of the monomers present in NRPs. These algorithms utilize physiochemical features of the amino acids present in the A-domains of NRPSs. In this paper, we benchmarked the performance of various machine learning algorithms and features for predicting specificities of NRPSs and we showed that the extra trees model paired with one hot encoding features outperforms the existing approaches. Moreover, we show that unsupervised clustering of 453,560 A-domains reveals many clusters that correspond to potentially novel amino acids. While it is challenging to predict the chemical structure of these amino acids, we developed novel techniques to predict their various properties, including polarity, hydrophobicity, charge, and presence of aromatic rings, and carboxyl, and hydroxyl groups.
Presentation Overview: Show
Summary: A major bottleneck in understanding newly discovered microbial genes is the lack of functional annotations. Genes participating in the same biological pathways are often found co-localized in the genome as conserved gene clusters. Conservation of gene neighborhood between two genes over a long evolutionary distance is a strong indicator of function association, in addition to sequence homology. Spacedust (Spatially conserved gene cluster search tool) is a fast and sensitive tool for systematic, de novo discovery of conserved gene clusters across multiple genomes, which does not require experimental or functional information. Given a set of genomes, it systematically finds all clusters of matched homologous genes showing significant neighborhood conservation between two genomes. Matches to homologous genes are found by sensitive structure similarity search with Foldseek. Additionally, we improved search speed against a large target database, making it fast enough to support all-vs-all searches of a large number of genomes. We anticipate that spacedust will provide insights on the function and evolutionary conservation for existing complete prokaryotic genomes and novel (meta)genomic contigs.
Availability and implementation: Spacedust is available as an open-source (GPLv3), user-friendly command-line software for Linux and macOS (https://github.com/soedinglab/spacedust).
Presentation Overview: Show
Secondary metabolites are biomolecules that are not essential for the growth or survival of organisms but offer ecological and physiological advantages to them. They have various applications in medicine, biotechnology, and agriculture. These metabolites are often produced by biosynthetic gene clusters (BGCs), groups of genes located close to each other in a microbial genome.
The advent and widespread availability of metagenomics have enabled the study of BGCs and their associated secondary metabolites directly from environmental samples, eliminating the need to culture individual microorganisms.
Here, we present the BGC Atlas, a web resource that facilitates exploring and analysing the diversity of biosynthetic gene clusters found in various environments through metagenomic sequencing. The BGC Atlas identifies and clusters BGCs from publicly available metagenomic datasets and provides a centralized database for exploring BGCs, their gene cluster families (GCFs), their associations with metadata, and the ability to search for similarities to the identified BGCs.
A web resource enabling researchers to easily explore and analyze the diversity of biosynthetic gene clusters in environmental samples can significantly enhance our understanding of secondary metabolites produced by microorganisms. Additionally, it can promote the identification of ecological and evolutionary factors that influence the biosynthetic potential of microbial communities.
Presentation Overview: Show
Carbohydrate Active EnZymes (CAZymes) are important for microbial communities to thrive in carbohydrate-rich environments such as animal guts, agricultural soils, forest floors, and ocean sediments. Since 2017, microbiome sequencing and assembly have produced numerous metagenome-assembled genomes (MAGs). We have updated our dbCAN-seq database (https://bcb.unl.edu/dbCAN_seq) to include the following new data and features: (i) ∼498 000 CAZymes and ∼169 000 CAZyme gene clusters (CGCs) from 9421 MAGs of four ecological (human gut, human oral, cow rumen, and marine) environments; (ii) Glycan substrates for 41 447 (24.54%) CGCs inferred by two novel approaches (dbCAN-PUL homology search and eCAMI subfamily majority voting) (the two approaches agreed on 4183 CGCs for substrate assignments); (iii) A redesigned CGC page to include the graphical display of CGC gene compositions, the alignment of query CGC and subject PUL (polysaccharide utilization loci) of dbCAN-PUL, and the eCAMI subfamily table to support the predicted substrates; (iv) A statistics page to organize all the data for easy CGC access according to substrates and taxonomic phyla; and (v) A batch download page. In summary, this updated dbCAN-seq database highlights glycan substrates predicted for CGCs from microbiome. Future work will implement the substrate prediction function in our dbCAN2 web server.
Presentation Overview: Show
Microbiome research is currently facing a reproducibility challenge. Numerous laboratory protocols exist for generating microbiome sequencing data, and each laboratory method creates protocol-specific biases in microbiome results, limiting their comparability. Even more concerningly, poor reporting of methods hampers the reproducibility of findings.
To evaluate the FAIR status, and to quantify the impact of specific laboratory choices on protocol bias, we performed a meta-analysis of bacterial mock communities from 52 published microbiome studies. We extracted 67 protocol variables from each study’s methods section, and jointly pre-processed corresponding raw mock sequencing data of 171 samples.
Regarding FAIR principles, we found no increase in raw data depositing over time. Key factors for protocol reproducibility, e.g. PCR details, were frequently not reported. The substantial protocol-specific bias on mock microbiome composition was mainly driven by choice of extraction kit, 16S region, primers, and PCR conditions, jointly explaining up to 96% of variation of bias per bacterial genus.
This unique meta-analysis approach using standardized mock controls revealed striking protocol-specific variation in microbiome data. We provide novel insights into the incomplete implementation of FAIR principles and reporting guidelines, highlighting the need for open and reproducible science for overcoming protocol biases in microbiome research.
Presentation Overview: Show
The advancement in high-throughput sequencing has increased the number of sequenced microbial genomes at an exponential pace, contributing to our understanding of the genetic diversity encoded in microbiomes. However, there is a vast gap between the amount of sequence information generated and their functional characterization. We have developed a new metagenome analysis workflow integrating de novo genome reconstruction, taxonomic profiling, and deep learning-based functional annotations from DeepFRI. We validate DeepFRI functional annotations by comparing them to orthology-based annotations from eggNOG on a set of 1,070 infant metagenome samples from the DIABIMMUNE cohort. Additionally, the workflow facilitates mapping between the Gene Ontology terms and COG categories. We have generated a sequence catalogue of 1.9M non-redundant genes. The functional annotations revealed 70% concordance between Gene Ontology annotations predicted by DeepFRI and eggNOG. DeepFRI improved the annotation coverage, with 99% of the gene catalogue obtaining Gene Ontology molecular function annotations, albeit less specific compared to eggNOG. Additionally, we construct pan-genomes in a reference-free manner and analyse the associated annotations. eggNOG annotated more genes on well-studied organisms such as Escherichia coli while DeepFRI was less sensitive to taxa. This workflow will contribute to novel understanding of the functional signature of the human gut microbiome.
Presentation Overview: Show
Microbial communities play essential roles in various biological processes, and manipulating their composition and structure can enhance the production of valuable products and improve human health. The coexistence theory provides insights into the mechanisms that allow multiple species to coexist in the same community. Understanding these mechanisms is essential for predicting how community may change in response to environmental disturbances and developing effective strategies for conserving and managing biodiversity. However, to apply this theory we need to identify the niche for each specie and model trade-offs. Recently we have proposed techniques FBA-PRCC to apply global sensitivity analysis to the whole-genome bacterial metabolic models. We modelled the sensitivity of bacterial growth to the presence of various external metabolites via FBA-PRCC and created the community sensitivity graph. In that graph, each bacterial species is connected to the sensitive metabolites. Based upon AGORA reconstruction we have prepared the sensitivity graph of the human gut microbiome. In addition, we have collected publicly available human gut metagenomics data from MG-RAST, in which species of the AGORA collection cover more than 50% of the DNA abundance. We are developing graph-based techniques to evaluate the stability of the community composition from its sensitivity graph.
Presentation Overview: Show
In recent years, the microbial composition of a large number of samples has been resolved through sequencing, thereby enabling microbial network inference and analysis. Dozens of network inference algorithms have been developed to predict interactions in microbial communities from abundance data, but comparative evaluations revealed that their accuracy is generally low. However, the analysis of microbial networks can generate biological hypotheses beyond interaction prediction. Here, I will present several tools dedicated to the analysis of microbial networks, including manta for network clustering, anuran for network comparison, mako for network querying and microbetag for network annotation, and illustrate them on examples.
Presentation Overview: Show
Motivation: As viruses that mainly infect bacteria, phages are key players across a wide range of ecosystems. Analyzing phage proteins is indispensable for understanding phages' functions and roles in microbiomes. High-throughput sequencing enables us to obtain phages in different microbiomes with low cost. However, compared to the fast accumulation of newly identified phages, phage protein classification remains difficult. In particular, a fundamental need is to annotate virion proteins, the structural proteins such as major tail, baseplate, etc. Although there are experimental methods for virion protein identification, they are too expensive or time-consuming, leaving a large number of proteins unclassified. Thus, there is a great demand to develop a computational method for fast and accurate phage virion protein classification.
Results: In this work, we adapted the state-of-the-art image classification model, Vision Transformer, to conduct virion protein classification. By encoding protein sequences into unique images using chaos game representation, we can leverage Vision Transformer to learn both local and global features from sequence ``images''. Our method, PhaVIP, has two main functions: classifying PVP and non-PVP sequences and annotating the types of PVP, such as capsid and tail. We tested PhaVIP on several datasets with increasing difficulty and benchmarked it against alternative tools. The experimental results show that PhaVIP has superior performance. After validating the performance of PhaVIP, we investigated two applications that can use the output of PhaVIP: phage taxonomy classification and phage host prediction. The results showed the benefit of using classified proteins over all proteins.
Presentation Overview: Show
The development of low-cost sequencing technologies has generated a massive amount of microbiome datasets in public repositories during the last 20 years. However, their reuse raises many difficulties making their comparison and integration very limited, even for a given ecosystem.
In this study, we present an integrative bioinformatics approach focusing on public metagenetic 16S datasets targeting lacto-fermented vegetables. This ecosystem needs to be better characterized regarding how microbial communities interact and evolve dynamically.
We have developed a workflow to explore, compare, and integrate public 16S datasets to conduct meta-analyses in the microbiota field. The workflow includes searching and selecting public time-series datasets and constructing Amplicon Sequence Variants (ASV) association networks based on co-abundance metrics. Microbial communities detection is achieved by comparison and clustering of ASVs networks (Figure 1). We applied the workflow to ten public datasets and demonstrated its value in monitoring precisely the fermentation with the identification of the bacterial communities succession (Figure 2) and of putative core-consortia shared by different plant fermentation types (Figure 3).
Our integrative analysis demonstrates that the reuse and integration of microbiome datasets can provide new insights into a little-known biotope and add value to the independent analysis of individual studies.
Presentation Overview: Show
The differential network (DN) analysis disentangles the microbial co-abundance among taxa by comparing network properties between two or more graphs under different biological conditions. However, existing methods to DN analysis for microbiome data do not adjust for other clinical differences between subjects.
We propose Statistical Approach via Pseudo-value Information and Estimation for Differential Network Analysis (SOHPIE-DNA) that incorporates additional covariates such as continuous age and categorical BMI. SOHPIE-DNA is the first attempt of introducing regression framework adopting jackknife pseudo-values for the DN analysis in microbiome data. This enables the prediction of characteristics of a connectivity of a network with the presence of additional covariate information in the regression.
We demonstrate through simulations that SOHPIE-DNA consistently reaches higher recall and F1-score, while maintaining similar precision and accuracy to existing methods (NetCoMi and MDiNE). Lastly, we apply SOHPIE-DNA on two real datasets from the American Gut Project and the Diet Exchange Study to showcase the utility. The latter example is to highlight that SOHPIE-DNA can be used to incorporate temporal change of connectivity of taxa with covariate adjustment. As a result, our method has found taxa that are related to the prevention of intestinal inflammation and fatigue in advanced metastatic cancer patients.
Presentation Overview: Show
Efficient and cost-effective high throughput DNA sequencing techniques have enhanced the study of complex microbial systems, leading to important conclusions in different fields. Differential abundance (DA) analysis finds a microbial signature looking at differences in taxa abundances between classes of samples. While many bioinformatics methods have been specifically developed for microbiome data, currently there is no consensus about the best approach.
In this work we performed an extensive benchmarking of 12 widely used DA methods (ALDEx2, eBay, ANCOM-BC, corncob, MaAsLin2, metagenomeSeq, edgeR, DESeq2, ...) across scenarios and covariates not yet investigated by previous studies such as the combined effect of sample size, percentage of DA taxa, sequencing depth, fold change, variability of taxa, low abundance DA taxa, normalisation and different ecological niches.
We simulated count data with DA features (both in terms of absolute and relative abundances) using the metaSPARSim simulator, with great attention to resemble real data characteristics (e.g. compositionality, sparsity and taxa intensity-variability relationship).
This paper provides researchers useful recommendations to properly conduct DA analysis in their own datasets. Moreover, the proposed assessment framework is released in the metaBenchDA R package and through a Docker container, thus representing a robust and reproducible tool for future benchmarking studies.
Presentation Overview: Show
Motivation: Metagenomic binning methods to reconstruct metagenome-assembled genomes (MAGs) from environmental samples have been widely used in large-scale metagenomic studies. The recently proposed semi-supervised binning method, SemiBin, achieved state-of-the-art binning results in several environments. However, this required annotating contigs, a computationally costly and potentially biased process.
Results: We propose SemiBin2, which uses self-supervised learning to learn feature embeddings from the contigs. In simulated and real datasets, we show that self-supervised learning achieves better results than the semi-supervised learning
used in SemiBin1 and that SemiBin2 outperforms other state-of-the-art binners. Compared to SemiBin1, SemiBin2 can reconstruct 8.3%–21.5% more high-quality bins and requires only 25% of the running time and 11% of peak memory usage
in real short-read sequencing samples. To extend SemiBin2 to long-read data, we also propose ensemble-based DBSCAN clustering algorithm, resulting in 13.1%–26.3% more high-quality genomes than the second best binner for long-read data.
Availability and implementation: SemiBin2 is available as open source software at https://github.com/BigDataBiology/-
SemiBin/ and the analysis script used in the study can be found at https://github.com/BigDataBiology/SemiBin2_benchmark.
Contact: Correspondence should be addressed to xmzhao@fudan.edu.cn and luispedro@big-data-biology.org.
Supplementary information: Supplementary data are available online.
Presentation Overview: Show
Metagenomics studies genomic material derived from the mixed microbial communities found in diverse environments and has significant implications for both human health and environmental sustainability. Metagenomic binning refers to the clustering of genomic subsequences obtained from high-throughput DNA sequencing into distinct bins, each corresponding to a constituent organism in the community. In contrast to the earlier methods using the composition and abundance of sequences, certain graph-based binning tools have been proposed that leverage homophily information from the assembly graph. However, the binning problem is exacerbated by fragment-level assembly graph, heterophilous constraints from single-copy marker genes, unknown binning numbers, and skewed bin size distribution. In this paper, we formulate metagenomic binning as a combination of graph learning and constraint satisfaction problems and design a reference-free binning tool, NeuroBin, which involves (i) a graph neural network model to learn the fragment-level assembly graph meanwhile respecting constraints, (ii) a constrained-contigs matching algorithm to generate initial bins with an accurate count, (iii) a neural network-based combinatorial optimization model to minimize the constraints violated by binning, and (iv) a local refinement strategy to adjust binning results. Extensive experiments conducted on simulated datasets demonstrate that NeuroBin surpasses the state-of-the-art binning methods based on the assembly graph.
Presentation Overview: Show
Plasmids are mobile genetic elements that carry important accessory genes. Cataloging plasmids is crucial for elucidating their roles in promoting horizontal gene transfer between bacteria. Metagenomic sequencing is the main source for discovering new plasmids. However, it is difficult to detect plasmid contigs, which are often short and have heterogeneous origins. Available tools for plasmid contig detection have limitations. Specifically, alignment-based tools tend to miss diverged plasmids, while learning-based tools often have lower precision.
In this work, we develop a plasmid detection tool PLASMe that exploits both alignment and learning-based strategies. Closely related plasmids to the reference plasmids can be easily identified using alignment, while diverged plasmids can be predicted using order-specific Transformer models. By encoding plasmids as a language defined on the protein cluster-based token set, Transformer can learn the importance of proteins and correlation through the positionally token embedding and attention mechanism. We compared PLASMe and other tools on complete plasmids, plasmid contigs, and contigs assembled from CAMI2 simulated data. PLASMe achieved the highest F1-score. We also tested it on real metagenomic and plasmidome data. The examination of some commonly used marker genes shows that PLASMe exhibits more reliable performance than other tools.
Presentation Overview: Show
Bacterial species in microbial communities are often represented by mixtures of strains. Variation in strain genomes may have important phenotypic effects, however strain-level deconvolution of microbial communities remains challenging. Short-read approaches can be used to detect small-scale variation between strains, but fail to phase these variants into contiguous haplotypes. Recent advances in long-read metagenomics resulted in complete de novo assemblies of various bacterial species. However, current assembly approaches often suppress strain-level variation, and instead produce species-level consensus representation. Strain variants are often unevenly distributed, and regions of high and low heterozygosity may interleave in the assembly graph, resulting in tangles. To address this, we developed an algorithm for metagenomic phasing and assembly called stRainy. Our approach takes a sequence graph as input, identifies graph regions that represent collapsed strains, phases them and represents the results in an expanded and simplified assembly graph. We benchmark stRainy using simulated data and mock metagenomic communities with both PacBio HiFi and Oxford Nanopore reads and show that it achieves strain-level deconvolution with high completeness and low error rates, compared to the other strain assembly and phasing approaches.
Presentation Overview: Show
Long-read assemblers struggle to distinguish between closely related strains and therefore tend to collapse them into a single sequence. This hinders metagenome analysis, as closely related strains present in a sample may have important functional differences. To solve this problem, we present a new pipeline, called HairSplitter, that phases a (partially or totally) collapsed assembly, thereby improving the reconstruction of the genomes of the different strains in a metagenome. The originality of the method lies in a custom variant-calling step that allows HairSplitter to filter out most sequencing errors from a long-read alignment. On simulated datasets comprising up to 10 strains of the same species and on a real dataset containing 5 strains of E. coli, HairSplitter improved on metaFlye, ouperforming Strainberry and stRainy in terms of k-mer completeness, both for low-quality Nanopore and high-quality Pacbio HiFi sequences.
Presentation Overview: Show
Genome mining has become a key technology for discovering novel natural products with therapeutic potential. Such analysis involves searching for biosynthetic gene clusters (BGCs) within the genome of an organism to identify genes responsible for the production of natural products. Bacteria and fungi are particularly attractive targets for genome mining due to their high genetic diversity and ability to produce a wide range of bioactive compounds. Microbial genomes are relatively small compared to higher eukaryotes, making them easier to sequence and analyze. In the pre-genomic era, a limited number of microbes were studied for drug discovery. The growing availability of genomic data has led to a significant increase in the number of novel BGCs identified through genome sequence mining. The limited computational methods for prioritization, as well as a lack of comparison across virulence risk classes of species (e.g. COGEM), have created challenges for further understanding and identification of species. In this work, we examine 81 species of bacteria and fungi from the COGEM list, to identify genomic traits which could prioritize species and clades drug discovery. Based on our analysis, we propose novel methods and suggestions for improving genome mining that can improve efficiency of discovery of new biomolecules.
Presentation Overview: Show
The analysis of bacterial isolates to detect plasmids is important due to their role in the propagation of antimicrobial resistance. In short-read sequence assemblies, both plasmids and bacterial chromosomes are typically split into several contigs of various lengths, making identification of plasmids a challenging problem. In plasmid contig binning, the goal is to distinguish short-read assembly contigs based on their origin into plasmid and chromosomal contigs and subsequently sort plasmid contigs into bins, each bin corresponding to a single plasmid. Previous works on this problem consist of de novo approaches and reference-based approaches. De novo methods rely on contig features such as length, circularity, read coverage, or GC content. Reference-based approaches compare contigs to databases of known plasmids or plasmid markers from finished bacterial genomes.
Recent developments suggest that leveraging information contained in the assembly graph improves the accuracy of plasmid binning. We present PlasBin-flow, a hybrid method that defines contig bins as subgraphs of the assembly graph. PlasBin-flow identifies such plasmid subgraphs through a mixed integer linear programming model that relies on the concept of network flow to account for sequencing coverage, while also accounting for the presence of plasmid genes and the GC content that often distinguishes plasmids from chromosomes. We demonstrate the performance of PlasBin-flow on a real data set of bacterial samples.
Presentation Overview: Show
Metagenome binning groups contigs assembled from metagenomic samples by their genome of origin, resulting in 'metagenome-assembled genomes' (MAGs). High-quality MAGs are essential for most downstream analyses of microbiomes. We present ’McDevol’ a metagenomic binner which applies a novel Bayesian distance measure using Poisson statistics, agglomerative clustering, and refinement through density-based clustering. Results on a CAMI II dataset demonstrate that McDevol performs better on almost all AMBER quality scores and provides more high-quality MAGs than the widely used binners MetaBAT2, MaxBin2, and CONCOCT. McDevol is fast and memory-efficient, making it a suitable tool for binning contigs of large metagenomic datasets.
Presentation Overview: Show
The soil microbiome establishes a complex network of interactions with their environment, which play a significant role in shaping microbial community structure. We propose a Deep Learning approach to exploit these microbiome-to-environment connections and to predict soil microbiome profiles through physicochemical and climatic metadata, without the need of sequencing technologies. Our methodology is based on an heterogeneous autoencoder adapted to an extensive database of corn and vineyard samples distributed worldwide and containing bacterial and fungal taxa. The model cross-validation was implemented under a stratified sampling schema to account for the spatial component, and an extensive environmental analysis served to identify the best combination of features to feed the model and depict microbial composition. The optimized system achieved reliable results in a first training phase (>0.70 Pearson correlation, <0.40 Bray-Curtis dissimilarity) and displayed remarkable performances in transfer learning (0.20-0.30 increase in Pearson correlation, 0.60 reduction in Bray-Curtis dissimilarity) with respect to baseline models. This technology has the potential to expand and automate the application of microbiome-based solutions in precision agriculture, enabling their implementation across a broader range of spatial and environmental contexts.
Presentation Overview: Show
Sequence comparison algorithms for metagenome-assembled genomes (MAGs) often have difficulties dealing with data that is high-volume or low-quality. We present skani (https://github.com/bluenote-1577/skani), a method for calculating average nucleotide identity (ANI) using sparse approximate alignments. skani is more accurate than FastANI for comparing incomplete, fragmented MAGs while also being > 20 times faster. For searching a database of > 65, 000 prokaryotic genomes, skani takes only seconds per query and 6 GB of memory. skani is a versatile tool that unlocks higher-resolution insights for larger, noisier metagenomic data sets.
Presentation Overview: Show
The Sequence Read Archive is the central repository for genomics experiments and a treasure trove of over 70 petabases of sequence data. However, its massive size presents a significant challenge to traditional search methods. Bloom-filter and sketching-based methods have been proposed as scalable alternatives, but their sensitivity is limited.
We present Petasearch, a tool for quickly and accurately searching protein sequences within large databases. Petasearch's algorithm involves three stages: First, the sequences in the database are pre-processed, sorted, and stored in a compressed k-mer index. Then, similar query k-mers are extracted and matched with database k-mers, filtering out non-homologous sequences early. Finally, high-scoring k-mer matches are aligned with a SIMD-accelerated banded Smith-Waterman.
We optimize Petasearch using modern CPU caching and prefetching, advanced Linux IO techniques, and high read-bandwidth NVMe-SSDs. Across 21 NVMe-SSDs, Petasearch is 15 and 145 times faster than current search algorithms for a 450GB and 9.3TB dataset, respectively. Petasearch maintains comparable sensitivity to state-of-the-art algorithms, detecting sequence identities as low as 60%, and identifying homology using its profile-search down to 40% in a SCOP25 benchmark.
Petasearch is available at petasearch.mmseqs.com as free open-source software for analysis and comparison of protein sequences at scale.
Presentation Overview: Show
In microbiome analysis, one main approach is to align metagenomic sequencing reads against a protein reference database, such as NCBI-nr, and then to perform taxonomic and functional binning based on the alignments. This approach is embodied, for example, in the standard DIAMOND+MEGAN analysis pipeline, which first aligns reads against NCBI-nr using DIAMOND and then performs taxonomic and functional binning using MEGAN. Here, we propose the use of the AnnoTree protein database, rather than NCBI-nr, in such alignment-based analyses to determine the prokaryotic content of metagenomic samples. We demonstrate a 2-fold speedup over the usage of the prokaryotic part of NCBI-nr and increased assignment rates, in particular assigning twice as many reads to KEGG. In addition to binning to the NCBI taxonomy, MEGAN now also bins to the GTDB taxonomy.
IMPORTANCE The NCBI-nr database is not explicitly designed for the purpose of microbiome analysis, and its increasing size makes its unwieldy and computationally expensive for this purpose. The AnnoTree protein database is only one-quarter the size of the full NCBI-nr database and is explicitly designed for metagenomic analysis, so it should be supported by alignment-based pipelines.
Presentation Overview: Show
Coupling the drop in sequencing costs and the rise of initiatives such as the Human Microbiome Project has resulted in an increasing amount of multi-omics data currently available in public repositories. As microbiome analysis methodologies lack standardization, it is only through the integration of multiple datasets that we can capture the complexity of the microbiome by eliminating the biases associated with individual studies. For this reason, data and related metadata accessibility and reusability are the first pivotal points for current microbiome research.
Here, we present MADAME (MetADAta MicrobiomE), a bioinformatic open-source and easy-to-use tool to facilitate and automate the process of metadata and data retrieval and download from the European Nucleotide Archive (ENA). Moreover, MADAME provides users to search for related publications and visualize the downloaded information by generating a report with graphs and statistics. We applied MADAME to the specific case study of the skin microbiome, exploring the available metadata for a total of 33 projects and 7162 samples.
In conclusion, MADAME can easily provide data and all the related information for downstream analysis, with the ultimate aim of contributing to microbiome research advances by facilitating multiple datasets integration.