The European Nucleotide Archive (ENA) at EMBL-EBI is one of the largest and long-standing public databases along with partner resources at NCBI and DDBJ forming the International Nucleotide Sequence Database Collaboration. ENA serves the bioinformatics community worldwide via the submission, processing, archiving and distribution of sequencing data. Supported data types cover from raw reads, through alignments and assemblies to functional annotation, enriched with contextual information relating to samples and experimental configurations. Since its inception in early 1980s, ENA has evolved into a comprehensive sequence data coordination platform with applications to specific scientific domains. Since 2020, EMBL-EBI has created and maintained services that build on the ENA data coordination platform such as Pathogens Portal and the COVID-19 Data Portal; ENA has helped these initiatives analysing millions of SARS-CoV-2 genomes systematically.
The volume of data in ENA has seen steady growth, at around 35% year over year for the last five years, and is projected to to exceed 150 PB by 2027. To sustain ENA in the face of ever increasing sequencing data volumes we present three technical goals:
1. To increase operational performance in most valuable areas
2. To improve and simplify systems to ensure sustainability
3. To tailor ENA services to fit modern sequencing analysis workflows
In support of these goals, we are taking steps to improve ENA to serve the bioinformatics community better with higher efficiency, and we are undertaking a programme of user outreach to establish closer collaboration with our user community and to better understand the data generation and analytical workflows that make use of ENA services. This includes workflows such as pipelines for SARS-CoV-2 analysis and metagenomic sequence assembly built by users around ENA.
In addition to highlighting the technical challenges that operating at such a large scale brings, we also present ENA services in the context of two specific analysis workflows, illustrating how ENA can be used as a data management platform to support scientific drivers. In particular, we use the Pathogen Analysis System to showcase some of the more familiar ENA features and how to apply them to analyse millions of SARS-CoV-2 genomes in a streaming mode. We also describe how metagenomics workflows can be well integrated with ENA.
1. Burgin, Josephine et al. “The European Nucleotide Archive in 2022.” Nucleic acids research vol. 51,D1 (2023): D121-D125. https://doi.org/10.1093/nar/gkac1051
2. Rahman, Nadim et al. “Mobilisation and analyses of publicly available SARS-CoV-2 data for pandemic responses” Biorxiv (2023): https://www.biorxiv.org/content/10.1101/2023.04.19.537514v2