Comparison of Single Cell RNA Synthetic Data Generators: A CAMDA Health Challenge Analysis
Confirmed Presenter: Patrick McKeever, University of Washington, United States
Room: 01C
Format: In person
Moderator(s): Hakime Öztürk
Authors List: Show
- Patrick McKeever, University of Washington, United States
- Daniil Filienko, University of Washington, United States
- Steven Golob, University of Washington, United States
- Shane Menzies, University of Washington, United States
- Sikha Pentyala, University of Washington, United States
- Jineta Banerjee, Sage Bionetworks, United States
- Luca Foschini, Sage Bionetworks, United States
- Martine De Cock, University of Washington, United States
Presentation Overview: Show
Single cell RNA sequencing has a wide range of applications in medical research, allowing researchers to identify distinct cell types and consider the impact of experimental conditions on a per-cell-type basis. However, the scarcity of counts data for rare cell types or experimental conditions poses considerable difficulties in the analysis of single-cell expression data. As such, a large literature has developed around the generation of synthetic single-cell data. Synthetic single-cell expression data allows biologists to model rare cell states, test new statistical methods against a known ground truth, perform in-silico gene perturbations, and guide the development of sequencing experiment structure in advance. However, while several comparative benchmarks of single cell data exist, much less literature has considered the privacy-preserving aspects of these algorithms. This extended abstract
In this abstract, we explore and compare multiple types of synthetic data generators (SDGs) to generate single-cell RNA-seq (scRNA-seq) data using the OneK1K dataset provided by the CAMDA Healthcare Challenge. Specifically, we evaluate both the statistical methods scDesign2 \cite{sun2021_scdesign} and Private-PGM (which also provides formal differential privacy guarantees) as well as the recent diffusion-based modelcfDiffusion. Our analysis follows the evaluation pipeline and metrics defined by the challenge organizers. We find that scDesign2 far exceeds the other generators in terms of data quality.
NoisyDiffusion: Privacy Preserving Synthetic Gene Expression Data Generation
Confirmed Presenter: Jules Kreuer, Methods in Medical Informatics, University of Tübingen, Germany
Room: 01C
Format: In person
Moderator(s): Hakime Öztürk
Authors List: Show
- Jules Kreuer, Methods in Medical Informatics, University of Tübingen, Germany
Presentation Overview: Show
Generating synthetic gene expression data has the potential to advance computational biology and health research by enabling broader access to data. However, creating synthetic data that is both highly faithful to the original and useful from a biological perspective while also ensuring privacy is a significant challenge. While diffusion models are powerful generative tools, their application to sensitive genomic data requires careful consideration of privacy implications, especially regarding their susceptibility to memorisation and membership inference attacks (MIAs). This project presents NoisyDiffusion: a conditional diffusion model designed to generate synthetic gene expression data while incorporating mechanisms for differential privacy to mitigate MIAs.
As this project is part of the CAMDA 2025 - Health Privacy Challenge, it was evaluated on the TCGA-COMBINED and TCGA-BRCA datasets. NoisyDiffusion demonstrated strong utility, with classifiers trained on its synthetic data achieving high accuracy (e.g., 96.92% on TCGA-COMBINED) and AUPR, rivaling top non-private baselines (Multivariate, CVAE) and significantly outperforming other generative models, including those with explicit DP (DP-CVAE, CTGAN).
Crucially, for privacy, Membership Inference Attack (MIA) AUCs were close to 0.5, suggesting good resilience and performance comparable to the Multivariate baseline.
This work demonstrates that diffusion models can effectively generate high-quality, privacy-respecting synthetic genomic data, offering a promising pathway for advancing research while safeguarding sensitive information.