Over the past two decades, public repositories like GEO and SRA have accumulated vast omics datasets, sparking a crucial discussion on secondary data analysis. Access to this data is vital for reproducibility, novel experiments, meta-analyses, and new discoveries. However, the extent and factors influencing reuse have been unclear.
A large-scale study analyzed over six million open-access publications from 2001 to 2025 to quantify reuse patterns and identify influencing factors. The analysis identified 213,213 omics-based publications, with approximately 65% based on secondary analysis, marking a significant shift. Since 2015, studies reusing existing gene expression data, particularly microarray data, have increasingly outnumbered those with new data. Despite this, a large portion of datasets, especially RNA-seq, remain underutilized, with over 72% of RNA-seq datasets in GEO and SRA not reused even once.
Reusability varies by data type; microarray data shows the highest average Reusability Index (RI), while RNA-seq and other sequencing data have lower RIs. Human datasets consistently exhibit higher reusability than non-human ones.
Significant barriers to reuse persist, including incomplete metadata, lack of standardization, and the complexity of raw data formats. Many researchers also lack the necessary computational tools or expertise. The study proposes solutions: enforcing metadata standards, integrating automated data processing tools into repositories, formally recognizing data contributions with metrics like RI and Normalized Reusability Index (NRI), and incentivizing reuse through journals and funding agencies. Addressing these challenges is crucial to unlock the full potential of existing omics data.