Reusability of Public Omics Data Across 6 Million Publications
Confirmed Presenter: Serghei Mangul, Stefan cel Mare University of Suceava, Romania
Room: 01C
Format: In person
Moderator(s): Wenzhong Xiao
Authors List: Show
- Serghei Mangul, Stefan cel Mare University of Suceava, Romania
- Viorel Munteanu, Technical University of Moldova, Moldova
- Dumitru Ciorbă, Technical University of Moldova, Moldova
- Viorel Bostan, UTM, Moldova
- Mihai Dimian, Stefan cel Mare University of Suceava, Romania
- Nicolae Drabcinski, Technical University of Moldova, Chisinau, Moldova
Presentation Overview: Show
Over the past two decades, public repositories like GEO and SRA have accumulated vast omics datasets, sparking a crucial discussion on secondary data analysis. Access to this data is vital for reproducibility, novel experiments, meta-analyses, and new discoveries. However, the extent and factors influencing reuse have been unclear.
A large-scale study analyzed over six million open-access publications from 2001 to 2025 to quantify reuse patterns and identify influencing factors. The analysis identified 213,213 omics-based publications, with approximately 65% based on secondary analysis, marking a significant shift. Since 2015, studies reusing existing gene expression data, particularly microarray data, have increasingly outnumbered those with new data. Despite this, a large portion of datasets, especially RNA-seq, remain underutilized, with over 72% of RNA-seq datasets in GEO and SRA not reused even once.
Reusability varies by data type; microarray data shows the highest average Reusability Index (RI), while RNA-seq and other sequencing data have lower RIs. Human datasets consistently exhibit higher reusability than non-human ones.
Significant barriers to reuse persist, including incomplete metadata, lack of standardization, and the complexity of raw data formats. Many researchers also lack the necessary computational tools or expertise. The study proposes solutions: enforcing metadata standards, integrating automated data processing tools into repositories, formally recognizing data contributions with metrics like RI and Normalized Reusability Index (NRI), and incentivizing reuse through journals and funding agencies. Addressing these challenges is crucial to unlock the full potential of existing omics data.
Pre-publication sharing of omics data improves paper citations
Confirmed Presenter: Serghei Mangul, Stefan cel Mare University of Suceava, Romania
Room: 01C
Format: In person
Moderator(s): Wenzhong Xiao
Authors List: Show
- Serghei Mangul, Stefan cel Mare University of Suceava, Romania
- Dhrithi Deshpande, University of Southern California, United States
- Viorel Munteanu, Technical University of Moldova, Moldova
- Mihai Dimian, Stefan cel Mare University of Suceava, Romania
- Grigore Boldirev, Georgia State University, United States
- Alexander Zelikovsky, GSU and University of Suceava, United States
Presentation Overview: Show
Advancements in omics technologies generate vast datasets, while public repositories facilitate their sharing, crucial for accelerating discovery, enhancing reproducibility, and meeting funder/journal mandates. Pre-publication data sharing, particularly alongside preprints, is increasingly beneficial, enabling early re-analysis and proving vital during public health crises like COVID-19, where data access is critical for verifying rapid findings and maintaining scientific integrity. However, a key question is whether raw omics data is consistently deposited when preprints are posted. Our study presents the first comprehensive analysis of pre-publication data sharing practices and their impact on citations in biomedical research. We analyzed 106,000 bioRxiv/medRxiv preprints and 72,715 publications with primary Gene Expression Omnibus (GEO) datasets, identifying 6,819 preprints mentioning GEO IDs and matching 2,022 preprint-publication pairs. Analysis revealed significant variability; only 29.7% of matched pairs had identical, single GEO IDs. While 71-87% of datasets were available before publication, only 9-23% were available at preprint posting. We examined the relationship between dataset release timing and citation counts, revealing statistically significant findings (Kolmogorov-Smirnov test, p = 8.596 x 10⁻⁶) indicating a discernible impact of early data availability on citation benefit. We also found over 1,600 cases where data IDs were in publications but not their preprints. Our findings reveal a fragmented landscape of pre-publication omics data sharing, challenging reproducibility and transparency.