Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in BST
Wednesday, July 23rd
11:20-12:20
Invited Presentation: Genome-based prediction of microbial traits
Confirmed Presenter: Thomas Rattei, University of Vienna, Austria

Room: 01C
Format: In person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Thomas Rattei, University of Vienna, Austria

Presentation Overview: Show

The prediction of phenotypic traits from genomic information is an ongoing challenge in computational biology. Although the fundamental principles of information encoding in genomes have been studied since decades and allowed first directed modifications, the expression of phenotypic traits is often the result of complex interactions. Predictive approaches in bioinformatics therefore focus on machine learning from labeled genomic data.

During the last years, we have focused on the computational prediction of microbial phenotypic traits from metagenomic data. These data have been collected on large scale, to explore the diversity and composition of microbial communities and to correlate them with environmental factors (e.g. human health and disease). The prediction of traits for these millions of genomes, based on neural networks that use protein families as features, goes one step further and can be used in first applications.

12:20-12:40
Invited Presentation: The Anti-Microbial Resistance Prediction Challenge - Introduction
Room: 01C
Format: In person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Leonid Chidelevitch

Presentation Overview: Show

TBP

12:40-13:00
A Hybrid Pipeline for Feature Reduction, and Ordinal Classification to Predict Antimicrobial Resistance from Genetic Profiles
Room: 01C
Format: In person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Adriana Haydeé Contreras Peruyero, Centro de Ciencias Matemáticas UNAM Morelia, Mexico
  • Yesenia Villicaña Molina, Centro de Ciencias Matemáticas UNAM Morelia, Mexico
  • Nelly Sélem Mojica, Centro de Ciencias Matemáticas UNAM Morelia, Mexico
  • Francisco Santiago Nieto de la Rosa, Centro de Ciencias Matemáticas UNAM Morelia, Mexico
  • Victor Muñiz Sánchez, CIMAT MTY, Mexico
  • Anton Pashkov, ENES Morelia UNAM, Mexico
  • Johanna Atenea Carreón Baltazar, Centro de Ciencias Matemáticas UNAM Morelia, Mexico
  • Luis Raúl Figueroa Martínez, Centro de Ciencias Matemáticas UNAM Morelia, Mexico
  • Evelia Lorena Coss Navarrete, LIIGH, UNAM, Mexico
  • César Augusto Aguilar Martínez, Campus Monterrey, School of Engineering and Sciences, Mexico

Presentation Overview: Show

One of the three challenges proposed by the Community of Interest Critical Assessment of Massive Data Analysis (CAMDA) involves predicting antimicrobial resistance or susceptibility for nine bacterial species and four antibiotics of interest. The dataset underwent a cleaning process to remove duplicate IDs with differing MIC values or phenotypes. After data cleaning and preprocessing, three distinct strategies were implemented to perform the predictions. The first strategy focused on predicting minimum inhibitory concentration (MIC) values. We adapted machine learning models for ordinal classification, assuming MIC as an ordinal variable. Two main approaches were used: multiple binary models (logistic regression, CART, random forests) and threshold models (neural networks). Due to the high dimensionality and sparsity of the AMR gene count data, we applied preprocessing techniques including a TF-IDF-like transformation (GF-IAF) and dimensionality reduction (truncated SVD and NMF). In the second strategy, we tested several classical machine learning models to predict the phenotype directly and used a grid search to find the optimal set of parameters, without using MIC values. In the third, we applied dimensionality reduction methods such as TF-IDF, along with a biological filtering step, before predicting the phenotype. Finally, as a preliminary result, ANI and pangenome analyses of E. coli isolates revealed divergence in gene content among some strains. Accessory regions potentially linked to antibiotic resistance suggest that key resistance determinants may lie outside the core genome.

14:00-14:40
Predicting Antimicrobial Resistance Using Microbiome-Pretrained DNABERT2 and DBGWAS-Derived Genomic Features
Room: 01C
Format: In person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Jack Vaska, Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA, United States
  • Pratik Dutta, Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA, United States
  • Max Chao, Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA, United States
  • Rekha Sathian, Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA, United States
  • Zhihan Zhou, Department of Computer Science, Northwestern University, Evanston, IL, USA, United States
  • Han Liu, Department of Computer Science, Northwestern University, Evanston, IL, USA, United States
  • Ramana Davuluri, Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA, United States

Presentation Overview: Show

Antimicrobial resistance (AMR) is an escalating public health threat, especially in hospitals where diverse resistance gene reservoirs have emerged. With the increasing availability of metagenomic and whole-genome sequencing data from AMR pathogens, there is a timely opportunity to develop predictive models. Given the complexity of these genomic datasets, large language models (LLMs) offer a promising approach due to their ability to capture long-range sequence patterns. DNABERT2, an LLM pretrained on diverse DNA sequences, has shown strong performance in various genomic tasks and is well-suited for AMR prediction (Zhou et al., 2023). We present a novel method to predict AMR across nine pathogenic bacterial species treated with four common antibiotics. Four custom DNABERT2 models, pretrained on human microbiome-derived genomic sequences, were fine-tuned on sequences obtained from de novo assembled bacterial genomes. To extract phenotype-associated features, we employed De Bruijn Graph-based Genome-Wide Association Study (DBGWAS) in an alignment-free manner (Jaillard et al., 2018). Statistically significant sequences (p < 0.05) were aligned back to assemblies using BLAST (≥80% identity), and 1,000 bp flanking subsequences were extracted. Resistant samples showed a markedly higher number of BLAST hits than susceptible ones. Data were grouped by antibiotic and each group was fine-tuned using a DNABERT2 model incorporating species and BLAST hit count as additional features. Consensus predictions across sequences achieved 84.5% accuracy and a macro F1 score of 0.84. Our findings demonstrate that resistant bacteria contain distinct genomic features absent in susceptible strains, highlighting the promise of LLM-based methods for AMR prediction.

14:40-15:00
The Antimicrobial Resistance Prediction Challenge
Confirmed Presenter: Alper Yurtseven, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarbruecken, Germany, Germany

Room: 01C
Format: In person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Alper Yurtseven, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarbruecken, Germany, Germany
  • Dilfuza Djamalova, Computational Metagenomics (IBG-5), Forschungszentrum Jülich, Bielefeld, Germany, Germany
  • Marco Galardini, Molecular Bacteriology Institute, TWINCORE, Hannover, Germany, Germany
  • Olga V. Kalinina, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarbruecken, Germany, Germany

Presentation Overview: Show

Antimicrobial Resistance (AMR) is an urgent threat to human health worldwide as microbes have developed resistance to even the most advanced drugs. In this year’s CAMDA challenge, we focused on predicting antimicrobial resistance of 5,346 bacterial strains that belong to 9 different species (Acinetobacter baumannii, Campylobacter jejuni, Escherichia coli, Klebsiella pneumoniae, Neisseria gonorrhoeae, Pseudomonas aeruginosa, Salmonella enterica, Staphylococcus aureus, Streptococcus pneumoniae) using two machine learning algorithms.

15:00-15:20
Antimicrobial Resistance Prediction via Binary Ensemble Classifier and Assessment of Variable Importance
Confirmed Presenter: Owen Visser, University of Florida, United States

Room: 01C
Format: In person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Owen Visser, University of Florida, United States
  • Victor Agboli, University of Florida, United States
  • Somnath Datta, University of Florida, United States

Presentation Overview: Show

Antimicrobial resistance (AMR) presents a growing challenge to global health, driven by antibiotic overuse and the rapid evolution of resistant bacteria. Predicting whether an isolate is resistant or susceptible to a drug remains difficult due to genomic variability. As part of the 2025 CAMDA Challenge, we altered a standard bioinformatic pipeline to preprocess the variable raw sequencing data, and features were derived from strain-specific markers and AMR gene classes. Three machine learning methods which have shown high accuracy in recent AMR prediction research were trained and compiled into an ensemble to predict binary resistance phenotypes for nine bacterial pathogens for four antibiotics. The ensemble performed well across most species, notably achieving 96.8% accuracy for C. jejuni and 98.2% for A. baumannii. Permutation-based variable importance analysis identified relevant resistance genes and strains, such as sulphonamide and aminoglycoside genes and the LAC-4 strain in A. baumannii. These results demonstrate the utility of ensemble models for AMR prediction on large, heterogeneous genomic datasets.

A Highly Accurate Workflow for Inference of Antimicrobial Resistance from Genetic Data Based on Machine Learning and Global Data Curation
Room: 01C
Format: In person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Gabor Fidler, Biotia Inc, Hungary
  • Heather Wells, Biotia Inc, United States
  • Ford Combs, Biotia Inc, United States
  • John Papciak, Biotia Inc, United States
  • Mara Couto-Rodriguez, Biotia Inc, United States
  • Sol Rey, Biotia Inc, United States
  • Tiara Rivera, Biotia Inc, United States
  • Lorenzo Uccellini, Biotia Inc, United States
  • Christopher Mason, Biotia Inc, Cornell University, United States
  • Niamh O'Hara, Biotia Inc, United States
  • Dorottya Nagy-Szakal, Biotia Inc, United States
  • David Danko, Biotia Inc, United States

Presentation Overview: Show

Note: This abstract is paired with the prediction submission “Base Model, 2nd Submission (Biotia)” made by user gfidler from team Biotia on May 15, 2025.

We present BIOTIA-DX Resistance, our submission to the CAMDA AMR Challenge. This tool builds off of our clinically validated metagenomic workflow to provide broad domain predictions for antimicrobial resistance from microbial sequencing data. We achieved an F1 score of 84 on the CAMDA challenge test set. Our technique is based on curation of global datasets, machine learning-based predictions from input data, and highly stringent prepreprocessing of input data and databases.

15:20-15:40
Invited Presentation: The Gut Microbiome Health Index Challenge - Introduction
Confirmed Presenter: Kinga Zielińska

Room: 01C
Format: In person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Kinga Zielińska
15:40-16:00
Integrating Taxonomic and Functional Features for Gut Microbiome Health Indexing
Room: 01C
Format: In person

Moderator(s): Kinga Zielińska


Authors List: Show

  • Shaday Guerrero Flores, Cinvestav Unidad Irapuato, Mexico
  • Rafael Pérez-Estrada, Centro de Ciencias Matemáticas, UNAM, Mexico
  • Juan Francisco Espinosa Maya, Centro de Ciencias Matemáticas UNAM, Mexico
  • Nelly Selem Mojica, Centro de Ciencias Matemáticas de la UNAM, Mexico
  • David Alberto García Estrada, Unidad Genómica Avanzada Cinvestav, Mexico
  • Orlando Camargo Escalante, Unidad Genómica Avanzada Cinvestav, Mexico
  • Mario Jardon, Centro de Ciencias Matemáticas UNAM, Mexico
  • Jose Daniel Chavez Gonzalez, Universidad Autonoma de Guerrero, Mexico

Presentation Overview: Show

Accurate characterization of the gut microbiome is essential for understanding its role in health and disease; however, while current indices such as GMHI and hiPCA rely on taxonomic profiles to associate microbiome composition with health states, they do not consider underlying functional variability. Here, we integrate species-level (MetaPhlAn) and pathway-level (HUMAnN) data from 4,398 samples provided by CAMDA 2025 to understand key organisms and pathways in different groups of diseases and to develop and evaluate composite health indices.
We first built co-occurrence networks, identifying Dorea formicigenerans and Erysipelatoclostridium ramosum as keystone taxa in hypertension and T2D respiratory infirmity. We then recalibrated GMHI and hiPCA for both taxonomic and functional data and developed three ensemble models. The best-performing, the Optimized Pathway Ensemble, reached an F1-score of 0.76. We extended GMHI to distinguish between disease groups and tested pairwise classifiers across conditions—including healthy, gastrointestinal, metabolic, psychiatric, and neurological disorders. Additionally, we developed the Gut Microbiome Health Calculator, a web tool for computing and comparing these indices. Our results show that combining taxonomic and functional features enhances classification and reveals biologically relevant patterns in disease.

16:40-17:20
Building a Rare-Disease Microbiome Health Index: Integrating Gut Metagenomes, Synthetic PKU EHRs and Rare-Variant Profiles to Forecast Phenylalanine Crises
Room: 01C
Format: In person

Moderator(s): Kinga Zielińska


Authors List: Show

  • Khartik Uppalapati, RareGen Youth Network 501(c)(3), United States
  • Bora Yimenicioglu, RareGen Youth Network 501(c)(3), United States
  • Shakeel Abdulkareem, RareGen Youth Network 501(c)(3), United States
  • Adan Eftekhari, Harvard University, United States

Presentation Overview: Show

Phenylketonuria (PKU) is an autosomal recessive metabolic disorder characterized by deficient phenylalanine hydroxylase activity, leading to episodic neurotoxic elevations in plasma phenylalanine (Phe) despite strict dietary management. However, existing gut health metrics fail to capture rare-disease–specific dysbiosis. In order to address, these concerns, we developed a Rare-Disease Microbiome Health Index (RDMHI) that integrates MetaPhlAn-derived species abundances, HUMAnN functional pathways, synthetic electronic health record timelines, and rare-variant burdens to forecast imminent Phe crises. We curated 4 398 metagenomic profiles from the CAMDA dataset alongside three external PKU cohorts (n < 100), applied centered log-ratio transformation and batch correction, and generated 5 000 patient-month windows via Synthea-augmented GAN models to simulate clinical and laboratory events. Rare-variant burdens for PAH and BH₄-pathway genes were collapsed into gene-level indicators. A LightGBM-DART classifier was trained under nested five-fold, leave-one-dataset-out cross-validation and evaluated by AUROC, AUPRC, and Matthews correlation coefficient with 1 000-sample bootstrap CIs. RDMHI achieved an AUROC of 0.91 (95 % CI 0.88–0.94), and MCC 0.64, outperforming clinical-only (AUROC 0.78; MCC 0.38) and microbiome-only (AUROC 0.81; MCC 0.45) baselines. External validation on 50 registry windows yielded an AUROC of 0.85 (0.81–0.89) and 78 % sensitivity at a 22 % false-positive rate. By outperforming existing gut-health indices (GMHI and hiPCA), RDMHI demonstrates the impact of tailoring health indices to rare diseases and establishes a new standard of microbiome-based prognostic modeling for precision risk stratification in rare metabolic disorders.

17:20-17:40
Toward the Development of a Novel and Comprehensive Gut Health Index: An Ensemble Model Integrating Taxonomic and Functional Profiles
Confirmed Presenter: Vincent Mei, University of Florida, United States

Room: 01C
Format: In person

Moderator(s): Kinga Zielińska


Authors List: Show

  • Vincent Mei, University of Florida, United States
  • Yulin Li, University of Florida, United States
  • Somnath Datta, University of Florida, United States

Presentation Overview: Show

Diseases linked to the gut microbiome have been on the rise, which contributes to the rising cost of healthcare and worsening patient outcomes . Since stool samples provide an accurate representation of the gut microbiome and can be collected frequently and non-invasively, it is of clinical interest to create an index that can accurately classify samples as healthy or non-healthy.
Several indices already exist to assess microbiome health, such as the Gut Microbiome Health Index (GMHI), health index with PCA (hiPCA), and Shannon entropy measures, but their reliance solely on species abundance limits their ability to distinguish between healthy and non-healthy individuals.
To improve upon these indices, we proposed a novel ensemble-based index that integrates both taxonomical and metabolic pathway abundance data from stool samples to predict individual health status.
From the provided data with 1211 species features and 619 pathway features, 61 species and 21 pathways were identified and used to train the ensemble model. The best threshold for the index generated from the ensemble model was selected using Youden’s index, resulting in a balanced accuracy of 0.7234 compared to values below 0.5 for GMHI, hiPCA, and Shannon entropy measures.
Feature importance was also calculated simultaneously with the ensemble model training by permuting one feature at a time, leading to the identification of the 20 most important species and pathways when determining gut microbiome health.

17:40-18:00
Topology-Enabled Integration of Taxonomic and Functional Microbiome Profiles Reveals Distinct Subgroups in Healthy Individuals
Confirmed Presenter: Doroteya Staykova, Multicore Dynamics Ltd, New Milton, United Kingdom, Bulgaria

Room: 01C
Format: In person

Moderator(s): Kinga Zielińska


Authors List: Show

  • Doroteya Staykova, Multicore Dynamics Ltd, New Milton, United Kingdom, Bulgaria

Presentation Overview: Show

High-throughput sequencing technologies have enabled detailed taxonomic and functional profiling of the human gut microbiome. However, integrating these diverse, high-dimensional data sources remains a major challenge - particularly in defining robust, cross-modal indicators of gut health - due to significant inter-individual variability observed even within healthy populations. In this study, I applied Topological Data Analysis (TDA) to the CAMDA 2025 Microbiome Challenge dataset to integrate taxonomic and functional profiles from healthy individuals. My primary aim was to establish a baseline for human gut health by identifying microbial patterns within a large, healthy cohort. A cross-modal network representation of over 1,600 microbiome samples was constructed using the Mapper algorithm with PHATE-based topological lenses. The derived topological shape revealed two distinct subgroups within the landscape of the healthy gut microbiome. Subsequent statistical analyses identified characteristic taxonomic and functional signatures associated with each subgroup, demonstrating the utility of TDA in uncovering intrinsic patterns and providing a data-driven framework for more precise stratification of gut health.

Ensemble-Based Topic Selection for Text Classification via a Grouping, Scoring, and Modeling Approach
Room: 01C
Format: In person

Moderator(s): Kinga Zielińska


Authors List: Show

  • Daniel Voskergian, Al-Quds University, Computer Engineering Department, Palestine, Palestine
  • Burcu Bakir-Gungor, Abdullah Gul University, Department of Computer Engineering, Faculty of Engineering, Kayseri, Turkey, Turkey
  • Malik Yousef, Zefat College, Israel

Presentation Overview: Show

The exponential growth in scientific literature, especially in biomedical domains, has intensified the need for effective automatic text classification (ATC) systems. TextNetTopics is a recent approach that classifies documents using topic-based features derived from Latent Dirichlet Allocation (LDA), reducing dimensionality while maintaining semantic richness. However, TextNetTopics’ reliance on single topic models introduces performance variability across datasets, limiting its generalizability.

This study introduces ENTM-TS (Ensemble Topic Modeling for Topic Selection), a novel framework that enhances TextNetTopics by integrating multiple topic models through a three-stage Grouping, Scoring, and Modeling (GSM) approach. First, topics are extracted from various models and merged based on semantic similarity to reduce redundancy and generate discriminative topic groups. These groups are then scored using internal and external evaluation strategies, ensuring normalized comparison and identifying top-performing subsets. Finally, a modeling phase iteratively aggregates and evaluates these groups to build an optimal feature set for classification.
ENTM-TS was evaluated on two biomedical text datasets: the DILI dataset and the WOS-5736 dataset of scientific abstracts. Results demonstrate that ENTM-TS consistently meets or exceeds the performance of single-model configurations, improving classification accuracy and reducing variability. This ensemble-based approach not only preserves semantic richness but also ensures robustness across diverse datasets.
ENTM-TS offers a generalizable and interpretable solution for biomedical text mining, with future work aimed at automating parameter selection for greater usability.

Thursday, July 24th
8:40-9:40
Invited Presentation: Data, Diagnoses, and Discovery: Improving Healthcare through Electronic Health Records
Room: 01C
Format: In person

Moderator(s): Joaquin Dopazo


Authors List: Show

  • Spiros Denaxas, UCL

Presentation Overview: Show

Electronic health records (EHRs) represent rich, multidimensional data generated through routine interactions within the healthcare system. These records have transformed biomedical research, shifting the traditional approach of studying diseases in isolation toward the simultaneous analysis of thousands of conditions. This talk will explore the unique opportunities and challenges that EHRs present to researchers and highlight best practices through examples.

9:40-10:00
Stage-Disease Grouping, Scoring, and Modeling for Predicting Diabetes Complications from Electronic Health Records
Room: 01C
Format: In person

Moderator(s): Joaquin Dopazo


Authors List: Show

  • Daniel Voskergian, Al-Quds University, Computer Engineering Department, Palestine, Palestine
  • Burcu Bakir-Gungor, Abdullah Gul University, Department of Computer Engineering, Faculty of Engineering, Kayseri, Turkey, Turkey
  • Malik Yousef, Zefat College, Israel

Presentation Overview: Show

Diabetes mellitus remains a major global health challenge, contributing significantly to morbidity, disability, and mortality. Accurate prediction of diabetes-related complications from electronic health records (EHRs) is essential for early intervention and personalized care. This study proposes a novel predictive framework that utilizes a novel feature engineering, combined with XGBoost-based feature selection and a Grouping–Scoring–Modeling (GSM) approach to improve predictive performance. Rather than relying on individual features, the proposed method constructs Stage-Disease Groups—sets of clinically related variables grouped by disease category (e.g., cardiovascular, renal) and typical onset stage (e.g., early, mid, late) following diabetes diagnosis. Each group captures interactions between variables such as age range and chronic conditions, reflecting real-world progression patterns. Predictive models were developed for four critical diabetes complications: retinopathy, chronic kidney disease, ischemic heart disease, and amputations. These models were trained on a large-scale dataset of synthetic EHRs representing nearly 1 million patients, generated using dual-adversarial autoencoders to preserve realistic temporal and clinical patterns. Results demonstrate that leveraging structured, group-based features improves both classification accuracy and model interpretability. Final models achieved accuracies between 70% and 77% and AUC scores between 76% and 84%, underscoring the effectiveness of the GSM framework in clinical risk prediction.

11:20-12:00
Invited Presentation: Benchmarking for Better Private Algorithms
Room: 01C
Format: In person

Moderator(s): Wenzhong Xiao


Authors List: Show

  • Antti Honkela

Presentation Overview: Show

Responsible application of machine learning (ML) on sensitive health and genetic data requires privacy-preserving algorithms to ensure that the data are not exposed. There is even legislative pressure, especially in Europe, requiring privacy in trained ML models. My talk will discuss how to organise a challenge for privacy-preserving ML to stimulate the development of better private algorithms. This is significantly more difficult than organising regular ML challenges, because there are no straightforward means of reliably evaluating privacy, and fair comparison of solutions requires specifying a comparable privacy-utility trade-off for all participants. Building on experience from running multiple privacy-preserving ML challenges, I will review good and not so good solutions to these issues, hoping to encourage others to include a privacy component in their challenges.

12:00-12:20
Invited Presentation: The Health Privacy Challenge - Introduction
Confirmed Presenter: Hakime Öztürk

Room: 01C
Format: In person

Moderator(s): Wenzhong Xiao


Authors List: Show

  • Hakime Öztürk

Presentation Overview: Show

The Health Privacy Challenge, which is organized in the context of the European Lighthouse on Safe and Secure AI (ELSA, https://elsa-ai.eu/), explores the privacy-preserving aspect of synthetic data generation models in the context of biological datasets. The challenge, through a

12:20-13:00
Panel: The Health Privacy - panel discussion
Room: 01C
Format: In person

Moderator(s): Hakime Öztürk


Authors List: Show

  • Spiros Denaxas, Oliver Stegle, Antti Honkela, David Kreil, Wenzhong Xiao, Joaquin Dopazo
14:00-14:20
Synthetic genomic data generation through Differential Privacy-enhanced Non-Negative Matrix Factorization (DP-NMF)
Room: 01C
Format: In person

Moderator(s): Hakime Öztürk


Authors List: Show

  • Andrew Wicks, DKFZ, Germany
  • Kyle Fogerty, University of Maryland, United States

Presentation Overview: Show

Generation of synthetic genomics data is increasingly considered as a routine approach for safely sharing sensitive genomic datasets. While traditional data-sharing methods often expose participants to privacy risks such as membership-inference attacks, the necessity of such methods may be reconsidered in favor of privacy-preserving alternatives. In this work, we outline scenarios in genomics research where synthetic data generation via non-negative matrix factorization (NMF) can effectively replace direct data sharing, thereby significantly enhancing privacy. We introduce a simple yet robust heuristic leveraging differential privacy (DP) integrated into NMF-based clustering, combined with a zero-inflated negative binomial or poisson sampling strategy. We demonstrate the utility and viability of this method through proof-of-concept evaluations on real genomic data, discuss practical use-cases, and highlight broader implications for secure and privacy-compliant genomic data dissemination.

14:20-14:40
Synthetic Data Generation for bulk RNA-seq Data: A CAMDA Health Challenge Analysis
Confirmed Presenter: Steven Golob, University of Washington Tacoma, United States

Room: 01C
Format: In person

Moderator(s): Hakime Öztürk


Authors List: Show

  • Shane Menzies, University of Washington Tacoma, United States
  • Sikha Pentyala, University of Washington Tacoma, United States
  • Daniil Filienko, University of Washington Tacoma, United States
  • Steven Golob, University of Washington Tacoma, United States
  • Jineta Banerjee, Sage Bionetworks, Seattle, United States
  • Luca Foschini, Sage Bionetworks, Seattle, United States
  • Martine De Cock, University of Washington Tacoma, United States

Presentation Overview: Show

One of the major barriers to AI-driven medical discoveries is the limited availability of high-quality, accessible healthcare data. This is because medical data is inherently sensitive, necessitating strict privacy protections that often lead to data being siloed across clinical sites and research institutions. Lack of access to such data hinders reproducibility and slows down the AI adaption.

To address this bottleneck, we investigate the use of Synthetic Data Generation (SDG) algorithms, capable of generating realistic data with formal privacy guarantees. Here, we investigate the extent to which state-of-the-art SDG algorithms can be applied to bulk RNA-seq data to generate high-quality genomics data suitable for downstream analysis.

14:40-15:00
Comparison of Single Cell RNA Synthetic Data Generators: A CAMDA Health Challenge Analysis
Confirmed Presenter: Patrick McKeever, University of Washington, United States

Room: 01C
Format: In person

Moderator(s): Hakime Öztürk


Authors List: Show

  • Patrick McKeever, University of Washington, United States
  • Daniil Filienko, University of Washington, United States
  • Steven Golob, University of Washington, United States
  • Shane Menzies, University of Washington, United States
  • Sikha Pentyala, University of Washington, United States
  • Jineta Banerjee, Sage Bionetworks, United States
  • Luca Foschini, Sage Bionetworks, United States
  • Martine De Cock, University of Washington, United States

Presentation Overview: Show

Single cell RNA sequencing has a wide range of applications in medical research, allowing researchers to identify distinct cell types and consider the impact of experimental conditions on a per-cell-type basis. However, the scarcity of counts data for rare cell types or experimental conditions poses considerable difficulties in the analysis of single-cell expression data. As such, a large literature has developed around the generation of synthetic single-cell data. Synthetic single-cell expression data allows biologists to model rare cell states, test new statistical methods against a known ground truth, perform in-silico gene perturbations, and guide the development of sequencing experiment structure in advance. However, while several comparative benchmarks of single cell data exist, much less literature has considered the privacy-preserving aspects of these algorithms. This extended abstract

In this abstract, we explore and compare multiple types of synthetic data generators (SDGs) to generate single-cell RNA-seq (scRNA-seq) data using the OneK1K dataset provided by the CAMDA Healthcare Challenge. Specifically, we evaluate both the statistical methods scDesign2 \cite{sun2021_scdesign} and Private-PGM (which also provides formal differential privacy guarantees) as well as the recent diffusion-based modelcfDiffusion. Our analysis follows the evaluation pipeline and metrics defined by the challenge organizers. We find that scDesign2 far exceeds the other generators in terms of data quality.

NoisyDiffusion: Privacy Preserving Synthetic Gene Expression Data Generation
Confirmed Presenter: Jules Kreuer, Methods in Medical Informatics, University of Tübingen, Germany

Room: 01C
Format: In person

Moderator(s): Hakime Öztürk


Authors List: Show

  • Jules Kreuer, Methods in Medical Informatics, University of Tübingen, Germany

Presentation Overview: Show

Generating synthetic gene expression data has the potential to advance computational biology and health research by enabling broader access to data. However, creating synthetic data that is both highly faithful to the original and useful from a biological perspective while also ensuring privacy is a significant challenge. While diffusion models are powerful generative tools, their application to sensitive genomic data requires careful consideration of privacy implications, especially regarding their susceptibility to memorisation and membership inference attacks (MIAs). This project presents NoisyDiffusion: a conditional diffusion model designed to generate synthetic gene expression data while incorporating mechanisms for differential privacy to mitigate MIAs.

As this project is part of the CAMDA 2025 - Health Privacy Challenge, it was evaluated on the TCGA-COMBINED and TCGA-BRCA datasets. NoisyDiffusion demonstrated strong utility, with classifiers trained on its synthetic data achieving high accuracy (e.g., 96.92% on TCGA-COMBINED) and AUPR, rivaling top non-private baselines (Multivariate, CVAE) and significantly outperforming other generative models, including those with explicit DP (DP-CVAE, CTGAN).

Crucially, for privacy, Membership Inference Attack (MIA) AUCs were close to 0.5, suggesting good resilience and performance comparable to the Multivariate baseline.
This work demonstrates that diffusion models can effectively generate high-quality, privacy-respecting synthetic genomic data, offering a promising pathway for advancing research while safeguarding sensitive information.

15:00-15:20
Reusability of Public Omics Data Across 6 Million Publications
Room: 01C
Format: In person

Moderator(s): Wenzhong Xiao


Authors List: Show

  • Serghei Mangul, Stefan cel Mare University of Suceava, Romania
  • Viorel Munteanu, Technical University of Moldova, Moldova
  • Dumitru Ciorbă, Technical University of Moldova, Moldova
  • Viorel Bostan, UTM, Moldova
  • Mihai Dimian, Stefan cel Mare University of Suceava, Romania
  • Nicolae Drabcinski, Technical University of Moldova, Chisinau, Moldova

Presentation Overview: Show

Over the past two decades, public repositories like GEO and SRA have accumulated vast omics datasets, sparking a crucial discussion on secondary data analysis. Access to this data is vital for reproducibility, novel experiments, meta-analyses, and new discoveries. However, the extent and factors influencing reuse have been unclear.

A large-scale study analyzed over six million open-access publications from 2001 to 2025 to quantify reuse patterns and identify influencing factors. The analysis identified 213,213 omics-based publications, with approximately 65% based on secondary analysis, marking a significant shift. Since 2015, studies reusing existing gene expression data, particularly microarray data, have increasingly outnumbered those with new data. Despite this, a large portion of datasets, especially RNA-seq, remain underutilized, with over 72% of RNA-seq datasets in GEO and SRA not reused even once.

Reusability varies by data type; microarray data shows the highest average Reusability Index (RI), while RNA-seq and other sequencing data have lower RIs. Human datasets consistently exhibit higher reusability than non-human ones.

Significant barriers to reuse persist, including incomplete metadata, lack of standardization, and the complexity of raw data formats. Many researchers also lack the necessary computational tools or expertise. The study proposes solutions: enforcing metadata standards, integrating automated data processing tools into repositories, formally recognizing data contributions with metrics like RI and Normalized Reusability Index (NRI), and incentivizing reuse through journals and funding agencies. Addressing these challenges is crucial to unlock the full potential of existing omics data.

Pre-publication sharing of omics data improves paper citations
Room: 01C
Format: In person

Moderator(s): Wenzhong Xiao


Authors List: Show

  • Serghei Mangul, Stefan cel Mare University of Suceava, Romania
  • Dhrithi Deshpande, University of Southern California, United States
  • Viorel Munteanu, Technical University of Moldova, Moldova
  • Mihai Dimian, Stefan cel Mare University of Suceava, Romania
  • Grigore Boldirev, Georgia State University, United States
  • Alexander Zelikovsky, GSU and University of Suceava, United States

Presentation Overview: Show

Advancements in omics technologies generate vast datasets, while public repositories facilitate their sharing, crucial for accelerating discovery, enhancing reproducibility, and meeting funder/journal mandates. Pre-publication data sharing, particularly alongside preprints, is increasingly beneficial, enabling early re-analysis and proving vital during public health crises like COVID-19, where data access is critical for verifying rapid findings and maintaining scientific integrity. However, a key question is whether raw omics data is consistently deposited when preprints are posted. Our study presents the first comprehensive analysis of pre-publication data sharing practices and their impact on citations in biomedical research. We analyzed 106,000 bioRxiv/medRxiv preprints and 72,715 publications with primary Gene Expression Omnibus (GEO) datasets, identifying 6,819 preprints mentioning GEO IDs and matching 2,022 preprint-publication pairs. Analysis revealed significant variability; only 29.7% of matched pairs had identical, single GEO IDs. While 71-87% of datasets were available before publication, only 9-23% were available at preprint posting. We examined the relationship between dataset release timing and citation counts, revealing statistically significant findings (Kolmogorov-Smirnov test, p = 8.596 x 10⁻⁶) indicating a discernible impact of early data availability on citation benefit. We also found over 1,600 cases where data IDs were in publications but not their preprints. Our findings reveal a fragmented landscape of pre-publication omics data sharing, challenging reproducibility and transparency.

15:20-15:40
Proceedings Presentation: HI-MGSyn: A Hypergraph and Interaction-aware Multi-Granularity Network for Predicting Synergistic Drug Combinations
Confirmed Presenter: Yuexi Gu, School of Mathematics and Statistics, Xi’an Jiaotong University, Shaanxi, People’s Republic of China, China

Room: 01C
Format: In person

Moderator(s): Wenzhong Xiao


Authors List: Show

  • Yuexi Gu, School of Mathematics and Statistics, Xi’an Jiaotong University, Shaanxi, People’s Republic of China, China
  • Jian Zu, School of Mathematics and Statistics, Xi’an Jiaotong University, Shaanxi, People’s Republic of China, China
  • Yongheng Sun, School of Mathematics and Statistics, Xi’an Jiaotong University, Shaanxi, People’s Republic of China, China
  • Louxin Zhang, Department of Mathematics and Centre for Data Science and Machine Learning,National University of Singapore,Singapore, Singapore

Presentation Overview: Show

Motivation: Drug combinations can not only enhance drug efficacy but also effectively reduce toxic side effects and mitigate drug resistance. With the advancement of drug combination screening technologies, large amounts of data have been generated. The availability of large data enables researchers to develop deep learning methods for predicting drug targets for synergistic combination. However, these methods still lack sufficient accuracy for practical use, and most overlook the biological significance of their models.

Results: We propose the HI-MGSyn (Hypergraph and Interaction-aware Multi-granularity Network for Drug Synergy Prediction) model, which integrates a coarse-granularity module and a fine-granularity module to predict drug combination synergy. The former utilizes a hypergraph to capture global features, while the latter employs interaction-aware attention to simulate biological processes by modeling substructure-substructure and substructure-cell line interactions. HI-MGSyn outperforms state-of-the-art machine learning models on our validation datasets extracted from the DrugComb and GDSC2 databases. Furthermore, the fact that five of the 12 novel synergistic drug combinations predicted by HI-MGSyn are strongly supported by experimental evidence in the literature underscores its practical potential.

15:40-16:00
CAMDA Trophy
Room: 01C
Format: In person


Authors List: Show

  • David Kreil
Closing remarks
Room: 01C
Format: In person


Authors List: Show

  • David Kreil