A rigorous benchmarking of methods for SARS-CoV-2 lineage abundance estimation in wastewater
Confirmed Presenter: Victor Gordeev, Department of Computers, Informatics, and Microelectronics, Technical University of Moldova, Moldova
Room: 520c
Format: Live Stream
Moderator(s): Mihai Pop
Authors List: Show
- Victor Gordeev, Department of Computers, Informatics, and Microelectronics, Technical University of Moldova, Moldova
- Viorel Munteanu, Department of Computers, Informatics, and Microelectronics, Technical University of Moldova, Moldova
- Shelesh Agrawal, Department of Civil and Environmental Engineering Sciences, Technical University of Darmstadt, Germany
- Martin Hölzer, Genome Competence Center (MF1), Method Development and Research Infrastructure, Robert Koch Institute, Germany
- Adam Smith, Astani Department of Civil and Environmental Engineering University of Southern California, United States
- Dumitru Ciorba, Department of Computers, Informatics, and Microelectronics, Technical University of Moldova, Moldova
- Serghei Mangul, Titus Family Department of Clinical Pharmacy, University of Southern California, United States
Presentation Overview: Show
Wastewater genomic surveillance of SARS-CoV-2 has emerged as a scalable, cost-effective, passive surveillance tool to monitor viral variants circulating in the human population. However, accurate estimation of viral lineage prevalence in communities relies on the performance of computational methods for analyzing wastewater sequencing data. We perform a comprehensive benchmarking of bioinformatics methods designed for estimating the relative abundance of SARS-CoV-2 (sub)lineages from wastewater sequencing data, along with RNA-Seq and metagenomics methods repurposed for this task. We systematically compare the accuracy of these computational methods in estimating the relative abundances of the (sub)lineages present in a sample, including closely related and low-abundance (sub)lineages. Our preliminary results on simulated data and a few computational methods show that RNA-Seq methods RSEM (most accurate), Kallisto, and Salmon consistently achieve lower L1 errors for lineages and particularly for sublineages when compared to wastewater-surveillance methods Alcov and PiGx. In particular, while the distribution of absolute errors for lineages is similar, for sublineages, roughly 80% of the absolute error values for RSEM, Kallisto, and Salmon are lower than 0.04%, compared to roughly 75% for Alcov and only 25% for PiGx. In addition to extensive simulated data, we will use in vitro mixtures of (sub)lineages of various complexity prepared from synthetic RNA genomes or inactivated viral particles and sequenced using short-read and long-read technologies. Using different experimental strategies, we will also investigate how the performance of these computational methods is impacted by the wastewater matrix or wastewater nucleic acid background, but also by the design of the sequencing experiment. Our study will inform the selection of the most accurate, robust, and sensitive methods for SARS-CoV-2 lineage prevalence estimation to enable effective wastewater-based genomic surveillance.
MetaViz: Realistic assortment of novel metagenomics benchmarks with diverse biological and technological characteristics
Confirmed Presenter: Nitesh Kumar Sharma, Department of Clinical Pharmacy, University of Southern California, Los Angeles, CA 90089, United States
Room: 520c
Format: Live Stream
Moderator(s): Mihai Pop
Authors List: Show
- Nitesh Kumar Sharma, Department of Clinical Pharmacy, University of Southern California, Los Angeles, CA 90089, United States
- Karishma Chhugani, Department of Clinical Pharmacy, University of Southern California, Los Angeles, CA 90089, United States
- Viorel Munteanu, Department of Computers, Informatics and Microelectronics, Technical University of Moldova, Chisinau, 2045, Moldova, Moldova
- Pavel Skums, School of Computing, University of Connecticut, 371 Fairfield Way, Storrs, 06269, CT, USA, United States
- Alex Zelikovsky, Department of Computer Science, Georgia State University, Atlanta, GA, USA, United States
- Serghei Mangul, Department of Clinical Pharmacy, University of Southern California, Los Angeles, CA 90089, United States
Presentation Overview: Show
Metagenomics research relies heavily on bioinformatics methods for analyzing complex microbial communities, necessitating rigorous validation through benchmarking. However, creating high-quality experimental benchmarks can be costly and challenging. Current benchmarking efforts often rely on limited gold-standard samples or synthetic data, hindering comprehensive evaluations. To address this, we propose MetaViz, a tool for generating semi-real novel metagenomics benchmarks through in silico modification of existing experimental data. MetaViz offers a cost-effective alternative, combining elements of real data with simulated modifications, surpassing the limitations of purely simulated datasets. Our tool allows precise control over sample composition, diversity, and technological characteristics, enhancing benchmarking accuracy and applicability. We applied MetaViz to over 27 real metagenomics benchmarks, including in-vitro viral mock communities and intra-host clinical samples. Our tool allowed us to precisely control the composition and the abundance of microbial genomes in the in-vitro mixtures (mock community). We were also able to adjust their relative abundance with varying frequency ranging from 0.1% to 10%. Leveraging reference mapping, we introduced varying errors within the read data, thereby enhancing reliability. Our method introduces a novel approach to benchmarking in metagenomics, particularly valuable where traditional gold-standard creation is impractical. By capturing the complexity of actual datasets, MetaViz produces semi-real benchmarks that encompass a broader range of clinical and technological characteristics, ultimately enhancing benchmarking comprehensiveness. Adoption of our approach promises to significantly improve benchmarking studies' robustness and accuracy, advancing our understanding of microbial communities across diverse biological contexts.
Phage Host Prediction Using Novel Global-Scale Phage-Host Interaction Atlas and Genomic Language Models
Confirmed Presenter: Jonas Grove, Phase Genomics, United States
Room: 520c
Format: In Person
Moderator(s): Mihai Pop
Authors List: Show
- Jonas Grove, Phase Genomics, United States
- Samuel Bryson, Phase Genomics, United States
- Benjamin Auch, Phase Genomics, United States
- Bradley Nelson, Phase Genomics, United States
- Cristiana Carpinteiro, Loka, Portugal
- Zach Sisson, Phase Genomics, United States
- Demi Glidden, Phase Genomics, United States
- Emily Reister, Phase Genomics, United States
- Ivan Liachko, Phase Genomics, United States
Presentation Overview: Show
Viruses, including bacteriophages and archaeal viruses, are the most abundant form of life on earth (1031), interacting with all life and shaping the global ecosystem. However, phage-host relationships have proven challenging to identify without culture-based experiments to generate unambiguous evidence for a phage’s presence in a given host. These experiments inherently require that all hosts are culturable, restricting the microbial diversity that can be surveyed.
Proximity ligation sequencing is a powerful metagenomic method for associating viruses with their hosts directly in native microbial communities. Proximity ligation captures, in vivo, physical interactions between the host microbial genome and the genetic material of both lytic and lysogenic phages. These linkages offer direct evidence that phage sequences were present within an intact host cell, establishing a phage-host pair without the propagation of living bacterial cells. The combination of intra-phage and phage-host signal enables us to simultaneously deconvolve viral and microbial genomes directly from metagenomes, and to assign microbial hosts to large numbers of viruses without culturing.
Our application of this technology to thousands of complex microbiome samples has yielded host assignments for hundreds of thousands of novel phage and archaeal viruses. Utilizing our expanded phage-host interaction training data, and leveraging advancements made in the field of natural language processing (NLP) and genomic large language models (LLMs), we have developed deep learning networks that model the dynamics between phages and microbial hosts at sequence-level resolution. We will report published and unpublished work highlighting the power of this approach in the field of metagenomic discovery.