The SciFinder tool lets you search Titles, Authors, and Abstracts of talks and panels. Enter your search term below and your results will be shown at the bottom of the page. You can also click on a track to see all the talks given in that track on that day.

View Talks By Category

Scroll down to view Results

July 14, 2025
July 15, 2025
July 20, 2025
July 21, 2025
July 22, 2025
July 23, 2025
July 24, 2025

Results

July 24, 2025
8:40-9:40
Invited Presentation: Data, Diagnoses, and Discovery: Improving Healthcare through Electronic Health Records
Confirmed Presenter: Spiros Denaxas, UCL
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: In person
Moderator(s): Joaquin Dopazo


Authors List: Show

  • Spiros Denaxas, Spiros Denaxas

Presentation Overview:Show

Electronic health records (EHRs) represent rich, multidimensional data generated through routine interactions within the healthcare system. These records have transformed biomedical research, shifting the traditional approach of studying diseases in isolation toward the simultaneous analysis of thousands of conditions. This talk will explore the unique opportunities and challenges that EHRs present to researchers and highlight best practices through examples.

July 24, 2025
9:40-10:00
Stage-Disease Grouping, Scoring, and Modeling for Predicting Diabetes Complications from Electronic Health Records
Confirmed Presenter: Daniel Voskergian, Al-Quds University, Computer Engineering Department
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: Live stream
Moderator(s): Joaquin Dopazo


Authors List: Show

  • Daniel Voskergian, Daniel Voskergian, Al-Quds University
  • Burcu Bakir-Gungor, Burcu Bakir-Gungor, Abdullah Gul University
  • Malik Yousef, Malik Yousef, Zefat College

Presentation Overview:Show

Diabetes mellitus remains a major global health challenge, contributing significantly to morbidity, disability, and mortality. Accurate prediction of diabetes-related complications from electronic health records (EHRs) is essential for early intervention and personalized care. This study proposes a novel predictive framework that utilizes a novel feature engineering, combined with XGBoost-based feature selection and a Grouping–Scoring–Modeling (GSM) approach to improve predictive performance. Rather than relying on individual features, the proposed method constructs Stage-Disease Groups—sets of clinically related variables grouped by disease category (e.g., cardiovascular, renal) and typical onset stage (e.g., early, mid, late) following diabetes diagnosis. Each group captures interactions between variables such as age range and chronic conditions, reflecting real-world progression patterns. Predictive models were developed for four critical diabetes complications: retinopathy, chronic kidney disease, ischemic heart disease, and amputations. These models were trained on a large-scale dataset of synthetic EHRs representing nearly 1 million patients, generated using dual-adversarial autoencoders to preserve realistic temporal and clinical patterns. Results demonstrate that leveraging structured, group-based features improves both classification accuracy and model interpretability. Final models achieved accuracies between 70% and 77% and AUC scores between 76% and 84%, underscoring the effectiveness of the GSM framework in clinical risk prediction.

July 24, 2025
11:20-12:00
Invited Presentation: Benchmarking for Better Private Algorithms
Confirmed Presenter: Antti Honkela, University of Helsinki, Finland
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: In person
Moderator(s): Wenzhong Xiao


Authors List: Show

  • Antti Honkela, Antti Honkela, University of Helsinki

Presentation Overview:Show

Responsible application of machine learning (ML) on sensitive health and genetic data requires privacy-preserving algorithms to ensure that the data are not exposed. There is even legislative pressure, especially in Europe, requiring privacy in trained ML models. My talk will discuss how to organise a challenge for privacy-preserving ML to stimulate the development of better private algorithms. This is significantly more difficult than organising regular ML challenges, because there are no straightforward means of reliably evaluating privacy, and fair comparison of solutions requires specifying a comparable privacy-utility trade-off for all participants. Building on experience from running multiple privacy-preserving ML challenges, I will review good and not so good solutions to these issues, hoping to encourage others to include a privacy component in their challenges.

July 24, 2025
12:00-12:20
Invited Presentation: The Health Privacy Challenge - Introduction
Confirmed Presenter: Hakime Öztürk, EMBL, Germany
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: In person
Moderator(s): Wenzhong Xiao


Authors List: Show

  • Hakime Öztürk, Hakime Öztürk, EMBL

Presentation Overview:Show

The Health Privacy Challenge, which is organized in the context of the European Lighthouse on Safe and Secure AI (ELSA, https://elsa-ai.eu/), explores the privacy-preserving aspect of synthetic data generation models in the context of biological datasets. The challenge, through a

July 24, 2025
12:20-13:00
Panel: The Health Privacy - panel discussion
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: In person
Moderator(s): Hakime Öztürk


Authors List: Show

  • Spiros Denaxas, Spiros Denaxas, Oliver Stegle
July 24, 2025
14:00-14:20
Synthetic genomic data generation through Differential Privacy-enhanced Non-Negative Matrix Factorization (DP-NMF)
Confirmed Presenter: Andrew Wicks, DKFZ, Germany
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: In person
Moderator(s): Hakime Öztürk


Authors List: Show

  • Andrew Wicks, Andrew Wicks, DKFZ
  • Kyle Fogerty, Kyle Fogerty, University of Maryland

Presentation Overview:Show

Generation of synthetic genomics data is increasingly considered as a routine approach for safely sharing sensitive genomic datasets. While traditional data-sharing methods often expose participants to privacy risks such as membership-inference attacks, the necessity of such methods may be reconsidered in favor of privacy-preserving alternatives. In this work, we outline scenarios in genomics research where synthetic data generation via non-negative matrix factorization (NMF) can effectively replace direct data sharing, thereby significantly enhancing privacy. We introduce a simple yet robust heuristic leveraging differential privacy (DP) integrated into NMF-based clustering, combined with a zero-inflated negative binomial or poisson sampling strategy. We demonstrate the utility and viability of this method through proof-of-concept evaluations on real genomic data, discuss practical use-cases, and highlight broader implications for secure and privacy-compliant genomic data dissemination.

July 24, 2025
14:20-14:40
Synthetic Data Generation for bulk RNA-seq Data: A CAMDA Health Challenge Analysis
Confirmed Presenter: Steven Golob, University of Washington Tacoma, United States
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: In person
Moderator(s): Hakime Öztürk


Authors List: Show

  • Shane Menzies, Shane Menzies, University of Washington Tacoma
  • Sikha Pentyala, Sikha Pentyala, University of Washington Tacoma
  • Daniil Filienko, Daniil Filienko, University of Washington Tacoma
  • Steven Golob, Steven Golob, University of Washington Tacoma
  • Jineta Banerjee, Jineta Banerjee, Sage Bionetworks
  • Luca Foschini, Luca Foschini, Sage Bionetworks
  • Martine De Cock, Martine De Cock, University of Washington Tacoma

Presentation Overview:Show

One of the major barriers to AI-driven medical discoveries is the limited availability of high-quality, accessible healthcare data. This is because medical data is inherently sensitive, necessitating strict privacy protections that often lead to data being siloed across clinical sites and research institutions. Lack of access to such data hinders reproducibility and slows down the AI adaption.

To address this bottleneck, we investigate the use of Synthetic Data Generation (SDG) algorithms, capable of generating realistic data with formal privacy guarantees. Here, we investigate the extent to which state-of-the-art SDG algorithms can be applied to bulk RNA-seq data to generate high-quality genomics data suitable for downstream analysis.

July 24, 2025
14:40-15:00
Comparison of Single Cell RNA Synthetic Data Generators: A CAMDA Health Challenge Analysis
Confirmed Presenter: Patrick McKeever, University of Washington, United States
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: In person
Moderator(s): Hakime Öztürk


Authors List: Show

  • Patrick McKeever, Patrick McKeever, University of Washington
  • Daniil Filienko, Daniil Filienko, University of Washington
  • Steven Golob, Steven Golob, University of Washington
  • Shane Menzies, Shane Menzies, University of Washington
  • Sikha Pentyala, Sikha Pentyala, University of Washington
  • Jineta Banerjee, Jineta Banerjee, Sage Bionetworks
  • Luca Foschini, Luca Foschini, Sage Bionetworks
  • Martine De Cock, Martine De Cock, University of Washington

Presentation Overview:Show

Single cell RNA sequencing has a wide range of applications in medical research, allowing researchers to identify distinct cell types and consider the impact of experimental conditions on a per-cell-type basis. However, the scarcity of counts data for rare cell types or experimental conditions poses considerable difficulties in the analysis of single-cell expression data. As such, a large literature has developed around the generation of synthetic single-cell data. Synthetic single-cell expression data allows biologists to model rare cell states, test new statistical methods against a known ground truth, perform in-silico gene perturbations, and guide the development of sequencing experiment structure in advance. However, while several comparative benchmarks of single cell data exist, much less literature has considered the privacy-preserving aspects of these algorithms. This extended abstract

In this abstract, we explore and compare multiple types of synthetic data generators (SDGs) to generate single-cell RNA-seq (scRNA-seq) data using the OneK1K dataset provided by the CAMDA Healthcare Challenge. Specifically, we evaluate both the statistical methods scDesign2 \cite{sun2021_scdesign} and Private-PGM (which also provides formal differential privacy guarantees) as well as the recent diffusion-based modelcfDiffusion. Our analysis follows the evaluation pipeline and metrics defined by the challenge organizers. We find that scDesign2 far exceeds the other generators in terms of data quality.

July 24, 2025
14:40-15:00
NoisyDiffusion: Privacy Preserving Synthetic Gene Expression Data Generation
Confirmed Presenter: Jules Kreuer, Methods in Medical Informatics, University of Tübingen
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: In person
Moderator(s): Hakime Öztürk


Authors List: Show

  • Jules Kreuer, Jules Kreuer, Methods in Medical Informatics

Presentation Overview:Show

Generating synthetic gene expression data has the potential to advance computational biology and health research by enabling broader access to data. However, creating synthetic data that is both highly faithful to the original and useful from a biological perspective while also ensuring privacy is a significant challenge. While diffusion models are powerful generative tools, their application to sensitive genomic data requires careful consideration of privacy implications, especially regarding their susceptibility to memorisation and membership inference attacks (MIAs). This project presents NoisyDiffusion: a conditional diffusion model designed to generate synthetic gene expression data while incorporating mechanisms for differential privacy to mitigate MIAs.

As this project is part of the CAMDA 2025 - Health Privacy Challenge, it was evaluated on the TCGA-COMBINED and TCGA-BRCA datasets. NoisyDiffusion demonstrated strong utility, with classifiers trained on its synthetic data achieving high accuracy (e.g., 96.92% on TCGA-COMBINED) and AUPR, rivaling top non-private baselines (Multivariate, CVAE) and significantly outperforming other generative models, including those with explicit DP (DP-CVAE, CTGAN).

Crucially, for privacy, Membership Inference Attack (MIA) AUCs were close to 0.5, suggesting good resilience and performance comparable to the Multivariate baseline.
This work demonstrates that diffusion models can effectively generate high-quality, privacy-respecting synthetic genomic data, offering a promising pathway for advancing research while safeguarding sensitive information.

July 24, 2025
15:00-15:20
Reusability of Public Omics Data Across 6 Million Publications
Confirmed Presenter: Serghei Mangul, Stefan cel Mare University of Suceava, Romania
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: In person
Moderator(s): Wenzhong Xiao


Authors List: Show

  • Serghei Mangul, Serghei Mangul, Stefan cel Mare University of Suceava
  • Viorel Munteanu, Viorel Munteanu, Technical University of Moldova
  • Dumitru Ciorbă, Dumitru Ciorbă, Technical University of Moldova
  • Viorel Bostan, Viorel Bostan, UTM
  • Mihai Dimian, Mihai Dimian, Stefan cel Mare University of Suceava
  • Nicolae Drabcinski, Nicolae Drabcinski, Technical University of Moldova

Presentation Overview:Show

Over the past two decades, public repositories like GEO and SRA have accumulated vast omics datasets, sparking a crucial discussion on secondary data analysis. Access to this data is vital for reproducibility, novel experiments, meta-analyses, and new discoveries. However, the extent and factors influencing reuse have been unclear.

A large-scale study analyzed over six million open-access publications from 2001 to 2025 to quantify reuse patterns and identify influencing factors. The analysis identified 213,213 omics-based publications, with approximately 65% based on secondary analysis, marking a significant shift. Since 2015, studies reusing existing gene expression data, particularly microarray data, have increasingly outnumbered those with new data. Despite this, a large portion of datasets, especially RNA-seq, remain underutilized, with over 72% of RNA-seq datasets in GEO and SRA not reused even once.

Reusability varies by data type; microarray data shows the highest average Reusability Index (RI), while RNA-seq and other sequencing data have lower RIs. Human datasets consistently exhibit higher reusability than non-human ones.

Significant barriers to reuse persist, including incomplete metadata, lack of standardization, and the complexity of raw data formats. Many researchers also lack the necessary computational tools or expertise. The study proposes solutions: enforcing metadata standards, integrating automated data processing tools into repositories, formally recognizing data contributions with metrics like RI and Normalized Reusability Index (NRI), and incentivizing reuse through journals and funding agencies. Addressing these challenges is crucial to unlock the full potential of existing omics data.

July 24, 2025
15:00-15:20
Pre-publication sharing of omics data improves paper citations
Confirmed Presenter: Serghei Mangul, Stefan cel Mare University of Suceava, Romania
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: In person
Moderator(s): Wenzhong Xiao


Authors List: Show

  • Serghei Mangul, Serghei Mangul, Stefan cel Mare University of Suceava
  • Dhrithi Deshpande, Dhrithi Deshpande, University of Southern California
  • Viorel Munteanu, Viorel Munteanu, Technical University of Moldova
  • Mihai Dimian, Mihai Dimian, Stefan cel Mare University of Suceava
  • Grigore Boldirev, Grigore Boldirev, Georgia State University
  • Alexander Zelikovsky, Alexander Zelikovsky, GSU and University of Suceava

Presentation Overview:Show

Advancements in omics technologies generate vast datasets, while public repositories facilitate their sharing, crucial for accelerating discovery, enhancing reproducibility, and meeting funder/journal mandates. Pre-publication data sharing, particularly alongside preprints, is increasingly beneficial, enabling early re-analysis and proving vital during public health crises like COVID-19, where data access is critical for verifying rapid findings and maintaining scientific integrity. However, a key question is whether raw omics data is consistently deposited when preprints are posted. Our study presents the first comprehensive analysis of pre-publication data sharing practices and their impact on citations in biomedical research. We analyzed 106,000 bioRxiv/medRxiv preprints and 72,715 publications with primary Gene Expression Omnibus (GEO) datasets, identifying 6,819 preprints mentioning GEO IDs and matching 2,022 preprint-publication pairs. Analysis revealed significant variability; only 29.7% of matched pairs had identical, single GEO IDs. While 71-87% of datasets were available before publication, only 9-23% were available at preprint posting. We examined the relationship between dataset release timing and citation counts, revealing statistically significant findings (Kolmogorov-Smirnov test, p = 8.596 x 10⁻⁶) indicating a discernible impact of early data availability on citation benefit. We also found over 1,600 cases where data IDs were in publications but not their preprints. Our findings reveal a fragmented landscape of pre-publication omics data sharing, challenging reproducibility and transparency.

July 24, 2025
15:20-15:40
Proceedings Presentation: HI-MGSyn: A Hypergraph and Interaction-aware Multi-Granularity Network for Predicting Synergistic Drug Combinations
Confirmed Presenter: Yuexi Gu, School of Mathematics and Statistics, Xi’an Jiaotong University
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: Live stream
Moderator(s): Wenzhong Xiao


Authors List: Show

  • Yuexi Gu, Yuexi Gu, School of Mathematics and Statistics
  • Jian Zu, Jian Zu, School of Mathematics and Statistics
  • Yongheng Sun, Yongheng Sun, School of Mathematics and Statistics
  • Louxin Zhang, Louxin Zhang, Department of Mathematics and Centre for Data Science and Machine Learning,National University of Singapore,Singapore

Presentation Overview:Show

Motivation: Drug combinations can not only enhance drug efficacy but also effectively reduce toxic side effects and mitigate drug resistance. With the advancement of drug combination screening technologies, large amounts of data have been generated. The availability of large data enables researchers to develop deep learning methods for predicting drug targets for synergistic combination. However, these methods still lack sufficient accuracy for practical use, and most overlook the biological significance of their models.

Results: We propose the HI-MGSyn (Hypergraph and Interaction-aware Multi-granularity Network for Drug Synergy Prediction) model, which integrates a coarse-granularity module and a fine-granularity module to predict drug combination synergy. The former utilizes a hypergraph to capture global features, while the latter employs interaction-aware attention to simulate biological processes by modeling substructure-substructure and substructure-cell line interactions. HI-MGSyn outperforms state-of-the-art machine learning models on our validation datasets extracted from the DrugComb and GDSC2 databases. Furthermore, the fact that five of the 12 novel synergistic drug combinations predicted by HI-MGSyn are strongly supported by experimental evidence in the literature underscores its practical potential.

July 24, 2025
15:40-16:00
CAMDA Trophy
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: Live stream

Authors List: Show

  • David Kreil
July 24, 2025
15:40-16:00
Closing remarks
Track: CAMDA: Critical Assessment of Massive Data Analysis

Room: 01C
Format: Live stream

Authors List: Show

  • David Kreil