Refinement of SARS-CoV-2 Intra-host Mutations Using Explainable Representations
Confirmed Presenter: Fatima Mostefai, Université de Montréal; Montreal Heart Institute, Canada
Track: iRNA
Room: 519
Format: In Person
Moderator(s): Julia Salzman
Authors List: Show
- Fatima Mostefai, Fatima Mostefai, Université de Montréal; Montreal Heart Institute
- Jean-Christophe Grenier, Jean-Christophe Grenier, Montreal Heart Institute
- Raphaël Poujol, Raphaël Poujol, Montreal Heart Institute
- Julie Hussin, Julie Hussin, Université de Montréal; Montreal Heart Institute
Presentation Overview:Show
SARS-CoV-2, an RNA virus, has evolved into multiple variants by accumulating mutations during transmission (inter-host) and infection (intra-host). De novo mutations arise in viral genomes during infection, and analyzing these mutations in sequencing data may predict emerging variants. Intra-host single nucleotide variants (iSNVs) can be identified by analyzing RNA sequencing (RNA-seq) reads from infections. However, sequencing artifacts introduced during the RNA-seq process can result in erroneous iSNVs. We aim to identify true intra-host mutations from viral RNA-seq data and propose metrics to refine RNA-seq analysis.
We developed a two-step workflow to isolate de novo iSNVs, focusing on the SARS-CoV-2 RNA-seq dataset. Initially, we processed a dataset of RNA-seq libraries, ensuring high-quality library preparation through whole-genome quality control. We then used these libraries for iSNV calling, using metrics such as Alternative Allele Frequency (AAF) and Strand Bias Likelihood (S) metrics to distinguish iSNVs from sequencing artifacts. We also used dimensionality reduction representations, such as PHATE and t-SNE, to visualize and analyze library structures complemented with an explainability metric.
We applied our workflow to a comprehensive SARS-CoV-2 RNA-seq dataset, distinguishing between de novo and consensus iSNVs, which is crucial for understanding viral intra-host evolution. We identified batch effects from sequencing centers and refined the AAF and S metrics for artifact resolution. Analyzing libraries from 2020 to 2023, we observed low intra-host diversity per infection, significant diversity in the spike gene, and strong purifying selection. This workflow enhances the precision and depth of RNA-seq and viral genomic analyses, advancing studies in RNA viruses.