Bacteriophages—the viruses that infect prokaryotes—are the most abundant biological entities on Earth and play fundamental roles in shaping microbial communities. Despite their ubiquity, the vast majority remain uncharacterized, constituting a significant fraction of unidentified sequences in metagenomic datasets. While deep learning-based tools have improved viral sequence identification, they often suffer from high false positive rates when analyzing divergent sequences 1. To address this challenge, we introduce Jaeger, a homology-free deep learning framework designed to identify bacteriophage genome fragments from metagenome-assembled contigs.
Jaeger leverages a convolutional neural network (CNN) with dilated convolutions and six-frame amino acid parameter sharing to directly recognize protein-level signatures from six-frame translated nucleotide sequences. The model is trained to classify short nucleotide fragments into one of four categories: bacteria, archaea, eukaryote, and phage. For longer sequences, a sliding window approach aggregates predictions across multiple non-overlapping fragments to determine the final classification.
While neural networks are highly sensitive, they can generate spurious predictions when encountering sequences that significantly deviate from the training distribution. To mitigate this, we incorporated a neural mean discrepancy-based 2 auxiliary model—termed the reliability model—to detect out-of-distribution samples at deployment, further improving performance.
Extensive benchmarking on the IMG/VR 3 database and real-world metagenomes reveals Jaeger’s consistently high sensitivity (0.87) and precision (0.92) compared to state-of-the-art tools such as VirSorter2 and geNomad 4, Jaeger achieves similar classification accuracy while offering substantial computational speed improvements—running up to 20 times faster in CPU mode and 140 times faster with GPU acceleration. Its scalability allows it to process vast metagenomic datasets efficiently.
Application of Jaeger to approximately 16,000 metagenomic assemblies from the MGnify 5 database identified over five million putative phage contigs, highlighting its potential for uncovering hidden viral diversity. Additionally, Jaeger effectively identifies prophages and distinguishes viral sequences from bacterial, archaeal, and eukaryotic sequences. By integrating deep learning with reliability assessment, Jaeger enhances the robustness of viral sequence identification, making it a powerful tool for large-scale metagenomic studies.
Jaeger is open-source, easy to install, and supports GPU acceleration, making it accessible for large-scale analyses. Its ability to accurately and efficiently classify bacteriophage sequences will aid in uncovering viral diversity and advancing microbial ecology research.
Availability:
Code: https://github.com/MGXlab/Jaeger
Preprint: https://www.biorxiv.org/content/10.1101/2024.09.24.612722v1
Bibliography
1. Wu, L.-Y. et al. Benchmarking bioinformatic virus identification tools using real-world metagenomic data across biomes. Genome Biol. 25, 97 (2024).
2. Dong, X. et al. Neural Mean Discrepancy for Efficient Out-of-Distribution Detection. arXiv (2021) doi:10.48550/arxiv.2104.11408.
3. Camargo, A. P. et al. IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata. Nucleic Acids Res. 51, D733–D743 (2023).
4. Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. 42, 1303–1312 (2024).
5. Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).