Return to ISMB/ECCB 2025 Homepage Click here for the abridged agenda

Schedule for CAMDA

NOTE: Browser resolution may limit the width of the agenda and you may need to scroll the iframe to see additional columns.
Click the buttons below to download your current table in that format

Date	Start Time	End Time	Room	Track	Title	Confrimed Presenter	Format	Authors	Abstract
2025-07-23	11:20:00	12:20:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Genome-based prediction of microbial traits	Thomas Rattei	In person	Thomas Rattei	The prediction of phenotypic traits from genomic information is an ongoing challenge in computational biology. Although the fundamental principles of information encoding in genomes have been studied since decades and allowed first directed modifications, the expression of phenotypic traits is often the result of complex interactions. Predictive approaches in bioinformatics therefore focus on machine learning from labeled genomic data. During the last years, we have focused on the computational prediction of microbial phenotypic traits from metagenomic data. These data have been collected on large scale, to explore the diversity and composition of microbial communities and to correlate them with environmental factors (e.g. human health and disease). The prediction of traits for these millions of genomes, based on neural networks that use protein families as features, goes one step further and can be used in first applications.
2025-07-23	12:20:00	12:40:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	The Anti-Microbial Resistance Prediction Challenge - Introduction	Leonid Chidelevitch	In person	Leonid Chidelevitch	The AMR prediction challenge at CAMDA is now in its third year. This year's challenge on predicting AMR quantiatively (MIC values) as well as qualitatively (resistance vs susceptibility) has been developed in conjunction with our CABBAGE project. CABBAGE, which stands for a Comprehensive Assessment of Bacterial-Based AMR prediction from GEnotypes, involves the collection, curation, and exploitation of all the publicly available data on AMR genotypes and phenotypes, not only from databases, but also from individual publications. In this introductory talk I will describe the process by which we arrived at the selected datasets for this year's challenge, discuss other progress we've made on CABBAGE so far, and preview the plans for next year's challenge.
2025-07-23	12:40:00	13:00:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	A Hybrid Pipeline for Feature Reduction, and Ordinal Classification to Predict Antimicrobial Resistance from Genetic Profiles	Anton Pashkov	In person	Adriana Haydeé Contreras Peruyero, Yesenia Villicaña Molina, Nelly Sélem Mojica, Francisco Santiago Nieto de la Rosa, Victor Muñiz Sánchez, Anton Pashkov, Johanna Atenea Carreón Baltazar, Luis Raúl Figueroa Martínez, Evelia Lorena Coss Navarrete, César Augusto Aguilar Martínez	One of the three challenges proposed by the Community of Interest Critical Assessment of Massive Data Analysis (CAMDA) involves predicting antimicrobial resistance or susceptibility for nine bacterial species and four antibiotics of interest. The dataset underwent a cleaning process to remove duplicate IDs with differing MIC values or phenotypes. After data cleaning and preprocessing, three distinct strategies were implemented to perform the predictions. The first strategy focused on predicting minimum inhibitory concentration (MIC) values. We adapted machine learning models for ordinal classification, assuming MIC as an ordinal variable. Two main approaches were used: multiple binary models (logistic regression, CART, random forests) and threshold models (neural networks). Due to the high dimensionality and sparsity of the AMR gene count data, we applied preprocessing techniques including a TF-IDF-like transformation (GF-IAF) and dimensionality reduction (truncated SVD and NMF). In the second strategy, we tested several classical machine learning models to predict the phenotype directly and used a grid search to find the optimal set of parameters, without using MIC values. In the third, we applied dimensionality reduction methods such as TF-IDF, along with a biological filtering step, before predicting the phenotype. Finally, as a preliminary result, ANI and pangenome analyses of E. coli isolates revealed divergence in gene content among some strains. Accessory regions potentially linked to antibiotic resistance suggest that key resistance determinants may lie outside the core genome.
2025-07-23	14:00:00	14:40:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Predicting Antimicrobial Resistance Using Microbiome-Pretrained DNABERT2 and DBGWAS-Derived Genomic Features	Jack Vaska	In person	Jack Vaska, Pratik Dutta, Max Chao, Rekha Sathian, Zhihan Zhou, Han Liu, Ramana Davuluri	Antimicrobial resistance (AMR) is an escalating public health threat, especially in hospitals where diverse resistance gene reservoirs have emerged. With the increasing availability of metagenomic and whole-genome sequencing data from AMR pathogens, there is a timely opportunity to develop predictive models. Given the complexity of these genomic datasets, large language models (LLMs) offer a promising approach due to their ability to capture long-range sequence patterns. DNABERT2, an LLM pretrained on diverse DNA sequences, has shown strong performance in various genomic tasks and is well-suited for AMR prediction (Zhou et al., 2023). We present a novel method to predict AMR across nine pathogenic bacterial species treated with four common antibiotics. Four custom DNABERT2 models, pretrained on human microbiome-derived genomic sequences, were fine-tuned on sequences obtained from de novo assembled bacterial genomes. To extract phenotype-associated features, we employed De Bruijn Graph-based Genome-Wide Association Study (DBGWAS) in an alignment-free manner (Jaillard et al., 2018). Statistically significant sequences (p < 0.05) were aligned back to assemblies using BLAST (≥80% identity), and 1,000 bp flanking subsequences were extracted. Resistant samples showed a markedly higher number of BLAST hits than susceptible ones. Data were grouped by antibiotic and each group was fine-tuned using a DNABERT2 model incorporating species and BLAST hit count as additional features. Consensus predictions across sequences achieved 84.5% accuracy and a macro F1 score of 0.84. Our findings demonstrate that resistant bacteria contain distinct genomic features absent in susceptible strains, highlighting the promise of LLM-based methods for AMR prediction.
2025-07-23	14:40:00	15:00:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	The Antimicrobial Resistance Prediction Challenge	Alper Yurtseven	In person	Alper Yurtseven, Dilfuza Djamalova, Marco Galardini, Olga V. Kalinina	Antimicrobial Resistance (AMR) is an urgent threat to human health worldwide as microbes have developed resistance to even the most advanced drugs. In this year’s CAMDA challenge, we focused on predicting antimicrobial resistance of 5,346 bacterial strains that belong to 9 different species (Acinetobacter baumannii, Campylobacter jejuni, Escherichia coli, Klebsiella pneumoniae, Neisseria gonorrhoeae, Pseudomonas aeruginosa, Salmonella enterica, Staphylococcus aureus, Streptococcus pneumoniae) using two machine learning algorithms.
2025-07-23	15:00:00	15:20:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Antimicrobial Resistance Prediction via Binary Ensemble Classifier and Assessment of Variable Importance	Owen Visser	In person	Owen Visser, Victor Agboli, Somnath Datta	Antimicrobial resistance (AMR) presents a growing challenge to global health, driven by antibiotic overuse and the rapid evolution of resistant bacteria. Predicting whether an isolate is resistant or susceptible to a drug remains difficult due to genomic variability. As part of the 2025 CAMDA Challenge, we altered a standard bioinformatic pipeline to preprocess the variable raw sequencing data, and features were derived from strain-specific markers and AMR gene classes. Three machine learning methods which have shown high accuracy in recent AMR prediction research were trained and compiled into an ensemble to predict binary resistance phenotypes for nine bacterial pathogens for four antibiotics. The ensemble performed well across most species, notably achieving 96.8% accuracy for C. jejuni and 98.2% for A. baumannii. Permutation-based variable importance analysis identified relevant resistance genes and strains, such as sulphonamide and aminoglycoside genes and the LAC-4 strain in A. baumannii. These results demonstrate the utility of ensemble models for AMR prediction on large, heterogeneous genomic datasets.
2025-07-23	15:00:00	15:20:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	A Highly Accurate Workflow for Inference of Antimicrobial Resistance from Genetic Data Based on Machine Learning and Global Data Curation	David Danko	In person	Gabor Fidler, Heather Wells, Ford Combs, John Papciak, Mara Couto-Rodriguez, Sol Rey, Tiara Rivera, Lorenzo Uccellini, Christopher Mason, Niamh O'Hara, Dorottya Nagy-Szakal, David Danko	Note: This abstract is paired with the prediction submission “Base Model, 2nd Submission (Biotia)” made by user gfidler from team Biotia on May 15, 2025. We present BIOTIA-DX Resistance, our submission to the CAMDA AMR Challenge. This tool builds off of our clinically validated metagenomic workflow to provide broad domain predictions for antimicrobial resistance from microbial sequencing data. We achieved an F1 score of 84 on the CAMDA challenge test set. Our technique is based on curation of global datasets, machine learning-based predictions from input data, and highly stringent prepreprocessing of input data and databases.
2025-07-23	15:20:00	15:40:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	The Gut Microbiome Health Index Challenge - Introduction	Kinga Zielińska	In person	Kinga Zielińska
2025-07-23	15:40:00	16:00:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Integrating Taxonomic and Functional Features for Gut Microbiome Health Indexing	Rafael Pérez Estrada	In person	Shaday Guerrero Flores, Rafael Pérez Estrada, Juan Francisco Espinosa Maya, Nelly Selem Mojica, David Alberto García Estrada, Orlando Camargo Escalante, Mario Jardón Santos, Jose Daniel Chavez Gonzalez	Accurate characterization of the gut microbiome is essential for understanding its role in health and disease; however, while current indices such as GMHI and hiPCA rely on taxonomic profiles to associate microbiome composition with health states, they do not consider underlying functional variability. Here, we integrate species-level (MetaPhlAn) and pathway-level (HUMAnN) data from 4,398 samples provided by CAMDA 2025 to understand key organisms and pathways in different groups of diseases and to develop and evaluate composite health indices. We first built co-occurrence networks, identifyin keystone taxa. We then recalibrated GMHI and hiPCA for both taxonomic and functional data and developed three ensemble models. The best-performing, the Optimized Pathway Ensemble, reached an F1-score of 0.76. We extended GMHI to distinguish between disease groups and tested pairwise classifiers across conditions—including healthy, gastrointestinal, metabolic, psychiatric, and neurological disorders. Additionally, we developed the Gut Microbiome Health Calculator, a web tool for computing and comparing these indices. Our results show that combining taxonomic and functional features enhances classification and reveals biologically relevant patterns in disease.
2025-07-23	16:40:00	17:20:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Building a Rare-Disease Microbiome Health Index: Integrating Gut Metagenomes, Synthetic PKU EHRs and Rare-Variant Profiles to Forecast Phenylalanine Crises			Khartik Uppalapati, Bora Yimenicioglu, Shakeel Abdulkareem, Adan Eftekhari	Phenylketonuria (PKU) is an autosomal recessive metabolic disorder characterized by deficient phenylalanine hydroxylase activity, leading to episodic neurotoxic elevations in plasma phenylalanine (Phe) despite strict dietary management. However, existing gut health metrics fail to capture rare-disease–specific dysbiosis. In order to address, these concerns, we developed a Rare-Disease Microbiome Health Index (RDMHI) that integrates MetaPhlAn-derived species abundances, HUMAnN functional pathways, synthetic electronic health record timelines, and rare-variant burdens to forecast imminent Phe crises. We curated 4 398 metagenomic profiles from the CAMDA dataset alongside three external PKU cohorts (n < 100), applied centered log-ratio transformation and batch correction, and generated 5 000 patient-month windows via Synthea-augmented GAN models to simulate clinical and laboratory events. Rare-variant burdens for PAH and BH₄-pathway genes were collapsed into gene-level indicators. A LightGBM-DART classifier was trained under nested five-fold, leave-one-dataset-out cross-validation and evaluated by AUROC, AUPRC, and Matthews correlation coefficient with 1 000-sample bootstrap CIs. RDMHI achieved an AUROC of 0.91 (95 % CI 0.88–0.94), and MCC 0.64, outperforming clinical-only (AUROC 0.78; MCC 0.38) and microbiome-only (AUROC 0.81; MCC 0.45) baselines. External validation on 50 registry windows yielded an AUROC of 0.85 (0.81–0.89) and 78 % sensitivity at a 22 % false-positive rate. By outperforming existing gut-health indices (GMHI and hiPCA), RDMHI demonstrates the impact of tailoring health indices to rare diseases and establishes a new standard of microbiome-based prognostic modeling for precision risk stratification in rare metabolic disorders.
2025-07-23	17:20:00	17:40:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Toward the Development of a Novel and Comprehensive Gut Health Index: An Ensemble Model Integrating Taxonomic and Functional Profiles	Vincent Mei	In person	Vincent Mei, Yulin Li, Somnath Datta	Diseases linked to the gut microbiome have been on the rise, which contributes to the rising cost of healthcare and worsening patient outcomes . Since stool samples provide an accurate representation of the gut microbiome and can be collected frequently and non-invasively, it is of clinical interest to create an index that can accurately classify samples as healthy or non-healthy. Several indices already exist to assess microbiome health, such as the Gut Microbiome Health Index (GMHI), health index with PCA (hiPCA), and Shannon entropy measures, but their reliance solely on species abundance limits their ability to distinguish between healthy and non-healthy individuals. To improve upon these indices, we proposed a novel ensemble-based index that integrates both taxonomical and metabolic pathway abundance data from stool samples to predict individual health status. From the provided data with 1211 species features and 619 pathway features, 61 species and 21 pathways were identified and used to train the ensemble model. The best threshold for the index generated from the ensemble model was selected using Youden’s index, resulting in a balanced accuracy of 0.7234 compared to values below 0.5 for GMHI, hiPCA, and Shannon entropy measures. Feature importance was also calculated simultaneously with the ensemble model training by permuting one feature at a time, leading to the identification of the 20 most important species and pathways when determining gut microbiome health.
2025-07-23	17:40:00	18:00:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Topology-Enabled Integration of Taxonomic and Functional Microbiome Profiles Reveals Distinct Subgroups in Healthy Individuals	Doroteya Staykova	In person	Doroteya Staykova	High-throughput sequencing technologies have enabled detailed taxonomic and functional profiling of the human gut microbiome. However, integrating these diverse, high-dimensional data sources remains a major challenge - particularly in defining robust, cross-modal indicators of gut health - due to significant inter-individual variability observed even within healthy populations. In this study, I applied Topological Data Analysis (TDA) to the CAMDA 2025 Microbiome Challenge dataset to integrate taxonomic and functional profiles from healthy individuals. My primary aim was to establish a baseline for human gut health by identifying microbial patterns within a large, healthy cohort. A cross-modal network representation of over 1,600 microbiome samples was constructed using the Mapper algorithm with PHATE-based topological lenses. The derived topological shape revealed two distinct subgroups within the landscape of the healthy gut microbiome. Subsequent statistical analyses identified characteristic taxonomic and functional signatures associated with each subgroup, demonstrating the utility of TDA in uncovering intrinsic patterns and providing a data-driven framework for more precise stratification of gut health.
2025-07-23	17:40:00	18:00:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Ensemble-Based Topic Selection for Text Classification via a Grouping, Scoring, and Modeling Approach			Daniel Voskergian, Burcu Bakir-Gungor, Malik Yousef	The exponential growth in scientific literature, especially in biomedical domains, has intensified the need for effective automatic text classification (ATC) systems. TextNetTopics is a recent approach that classifies documents using topic-based features derived from Latent Dirichlet Allocation (LDA), reducing dimensionality while maintaining semantic richness. However, TextNetTopics’ reliance on single topic models introduces performance variability across datasets, limiting its generalizability. This study introduces ENTM-TS (Ensemble Topic Modeling for Topic Selection), a novel framework that enhances TextNetTopics by integrating multiple topic models through a three-stage Grouping, Scoring, and Modeling (GSM) approach. First, topics are extracted from various models and merged based on semantic similarity to reduce redundancy and generate discriminative topic groups. These groups are then scored using internal and external evaluation strategies, ensuring normalized comparison and identifying top-performing subsets. Finally, a modeling phase iteratively aggregates and evaluates these groups to build an optimal feature set for classification. ENTM-TS was evaluated on two biomedical text datasets: the DILI dataset and the WOS-5736 dataset of scientific abstracts. Results demonstrate that ENTM-TS consistently meets or exceeds the performance of single-model configurations, improving classification accuracy and reducing variability. This ensemble-based approach not only preserves semantic richness but also ensures robustness across diverse datasets. ENTM-TS offers a generalizable and interpretable solution for biomedical text mining, with future work aimed at automating parameter selection for greater usability.
2025-07-24	08:40:00	09:20:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Data, Diagnoses, and Discovery: Improving Healthcare through Electronic Health Records	Spiros Denaxas	In person	Spiros Denaxas	Electronic health records (EHRs) represent rich, multidimensional data generated through routine interactions within the healthcare system. These records have transformed biomedical research, shifting the traditional approach of studying diseases in isolation toward the simultaneous analysis of thousands of conditions. This talk will explore the unique opportunities and challenges that EHRs present to researchers and highlight best practices through examples.
2025-07-24	09:20:00	09:40:00		CAMDA: Critical Assessment of Massive Data Analysis	The Synthetic Clinical Health Records Challenge Introduction			Carlos Loucera
2025-07-24	09:40:00	10:00:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Stage-Disease Grouping, Scoring, and Modeling for Predicting Diabetes Complications from Electronic Health Records	Daniel Voskergian	Live stream	Daniel Voskergian, Burcu Bakir-Gungor, Malik Yousef	Diabetes mellitus remains a major global health challenge, contributing significantly to morbidity, disability, and mortality. Accurate prediction of diabetes-related complications from electronic health records (EHRs) is essential for early intervention and personalized care. This study proposes a novel predictive framework that utilizes a novel feature engineering, combined with XGBoost-based feature selection and a Grouping–Scoring–Modeling (GSM) approach to improve predictive performance. Rather than relying on individual features, the proposed method constructs Stage-Disease Groups—sets of clinically related variables grouped by disease category (e.g., cardiovascular, renal) and typical onset stage (e.g., early, mid, late) following diabetes diagnosis. Each group captures interactions between variables such as age range and chronic conditions, reflecting real-world progression patterns. Predictive models were developed for four critical diabetes complications: retinopathy, chronic kidney disease, ischemic heart disease, and amputations. These models were trained on a large-scale dataset of synthetic EHRs representing nearly 1 million patients, generated using dual-adversarial autoencoders to preserve realistic temporal and clinical patterns. Results demonstrate that leveraging structured, group-based features improves both classification accuracy and model interpretability. Final models achieved accuracies between 70% and 77% and AUC scores between 76% and 84%, underscoring the effectiveness of the GSM framework in clinical risk prediction.
2025-07-24	11:20:00	12:00:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Benchmarking for Better Private Algorithms	Antti Honkela	In person	Antti Honkela, Antti Honkela	Responsible application of machine learning (ML) on sensitive health and genetic data requires privacy-preserving algorithms to ensure that the data are not exposed. There is even legislative pressure, especially in Europe, requiring privacy in trained ML models. My talk will discuss how to organise a challenge for privacy-preserving ML to stimulate the development of better private algorithms. This is significantly more difficult than organising regular ML challenges, because there are no straightforward means of reliably evaluating privacy, and fair comparison of solutions requires specifying a comparable privacy-utility trade-off for all participants. Building on experience from running multiple privacy-preserving ML challenges, I will review good and not so good solutions to these issues, hoping to encourage others to include a privacy component in their challenges.
2025-07-24	12:00:00	12:20:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	The Health Privacy Challenge - Introduction	Hakime Öztürk	In person	Hakime Öztürk	The Health Privacy Challenge, which is organized in the context of the European Lighthouse on Safe and Secure AI (ELSA, https://elsa-ai.eu/), explores the privacy-preserving aspect of synthetic data generation models in the context of biological datasets. The challenge, through a
2025-07-24	12:20:00	13:00:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	The Health Privacy - panel discussion			Spiros Denaxas, Oliver Stegle, Antti Honkela, David Kreil, Wenzhong Xiao, Joaquin Dopazo, Spiros Denaxas, Oliver Stegle, Antti Honkela, David Kreil, Wenzhong Xiao, Joaquin Dopazo
2025-07-24	14:00:00	14:20:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Synthetic genomic data generation through Differential Privacy-enhanced Non-Negative Matrix Factorization (DP-NMF)	Andrew Wicks	In person	Andrew Wicks, Kyle Fogerty	Generation of synthetic genomics data is increasingly considered as a routine approach for safely sharing sensitive genomic datasets. While traditional data-sharing methods often expose participants to privacy risks such as membership-inference attacks, the necessity of such methods may be reconsidered in favor of privacy-preserving alternatives. In this work, we outline scenarios in genomics research where synthetic data generation via non-negative matrix factorization (NMF) can effectively replace direct data sharing, thereby significantly enhancing privacy. We introduce a simple yet robust heuristic leveraging differential privacy (DP) integrated into NMF-based clustering, combined with a zero-inflated negative binomial or poisson sampling strategy. We demonstrate the utility and viability of this method through proof-of-concept evaluations on real genomic data, discuss practical use-cases, and highlight broader implications for secure and privacy-compliant genomic data dissemination.
2025-07-24	14:20:00	14:40:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Synthetic Data Generation for bulk RNA-seq Data: A CAMDA Health Challenge Analysis	Steven Golob	In person	Shane Menzies, Sikha Pentyala, Daniil Filienko, Steven Golob, Jineta Banerjee, Luca Foschini, Martine De Cock	One of the major barriers to AI-driven medical discoveries is the limited availability of high-quality, accessible healthcare data. This is because medical data is inherently sensitive, necessitating strict privacy protections that often lead to data being siloed across clinical sites and research institutions. Lack of access to such data hinders reproducibility and slows down the AI adaption. To address this bottleneck, we investigate the use of Synthetic Data Generation (SDG) algorithms, capable of generating realistic data with formal privacy guarantees. Here, we investigate the extent to which state-of-the-art SDG algorithms can be applied to bulk RNA-seq data to generate high-quality genomics data suitable for downstream analysis.
2025-07-24	14:40:00	15:00:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Comparison of Single Cell RNA Synthetic Data Generators: A CAMDA Health Challenge Analysis	Patrick McKeever	In person	Patrick McKeever, Daniil Filienko, Steven Golob, Shane Menzies, Sikha Pentyala, Jineta Banerjee, Luca Foschini, Martine De Cock	Single cell RNA sequencing has a wide range of applications in medical research, allowing researchers to identify distinct cell types and consider the impact of experimental conditions on a per-cell-type basis. However, the scarcity of counts data for rare cell types or experimental conditions poses considerable difficulties in the analysis of single-cell expression data. As such, a large literature has developed around the generation of synthetic single-cell data. Synthetic single-cell expression data allows biologists to model rare cell states, test new statistical methods against a known ground truth, perform in-silico gene perturbations, and guide the development of sequencing experiment structure in advance. However, while several comparative benchmarks of single cell data exist, much less literature has considered the privacy-preserving aspects of these algorithms. This extended abstract In this abstract, we explore and compare multiple types of synthetic data generators (SDGs) to generate single-cell RNA-seq (scRNA-seq) data using the OneK1K dataset provided by the CAMDA Healthcare Challenge. Specifically, we evaluate both the statistical methods scDesign2 \cite{sun2021_scdesign} and Private-PGM (which also provides formal differential privacy guarantees) as well as the recent diffusion-based modelcfDiffusion. Our analysis follows the evaluation pipeline and metrics defined by the challenge organizers. We find that scDesign2 far exceeds the other generators in terms of data quality.
2025-07-24	14:40:00	15:00:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	NoisyDiffusion: Privacy Preserving Synthetic Gene Expression Data Generation	Jules Kreuer	In person	Jules Kreuer	Generating synthetic gene expression data has the potential to advance computational biology and health research by enabling broader access to data. However, creating synthetic data that is both highly faithful to the original and useful from a biological perspective while also ensuring privacy is a significant challenge. While diffusion models are powerful generative tools, their application to sensitive genomic data requires careful consideration of privacy implications, especially regarding their susceptibility to memorisation and membership inference attacks (MIAs). This project presents NoisyDiffusion: a conditional diffusion model designed to generate synthetic gene expression data while incorporating mechanisms for differential privacy to mitigate MIAs. As this project is part of the CAMDA 2025 - Health Privacy Challenge, it was evaluated on the TCGA-COMBINED and TCGA-BRCA datasets. NoisyDiffusion demonstrated strong utility, with classifiers trained on its synthetic data achieving high accuracy (e.g., 96.92% on TCGA-COMBINED) and AUPR, rivaling top non-private baselines (Multivariate, CVAE) and significantly outperforming other generative models, including those with explicit DP (DP-CVAE, CTGAN). Crucially, for privacy, Membership Inference Attack (MIA) AUCs were close to 0.5, suggesting good resilience and performance comparable to the Multivariate baseline. This work demonstrates that diffusion models can effectively generate high-quality, privacy-respecting synthetic genomic data, offering a promising pathway for advancing research while safeguarding sensitive information.
2025-07-24	15:00:00	15:20:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Reusability of Public Omics Data Across 6 Million Publications	Serghei Mangul	In person	Serghei Mangul, Viorel Munteanu, Dumitru Ciorbă, Viorel Bostan, Mihai Dimian, Nicolae Drabcinski	Over the past two decades, public repositories like GEO and SRA have accumulated vast omics datasets, sparking a crucial discussion on secondary data analysis. Access to this data is vital for reproducibility, novel experiments, meta-analyses, and new discoveries. However, the extent and factors influencing reuse have been unclear. A large-scale study analyzed over six million open-access publications from 2001 to 2025 to quantify reuse patterns and identify influencing factors. The analysis identified 213,213 omics-based publications, with approximately 65% based on secondary analysis, marking a significant shift. Since 2015, studies reusing existing gene expression data, particularly microarray data, have increasingly outnumbered those with new data. Despite this, a large portion of datasets, especially RNA-seq, remain underutilized, with over 72% of RNA-seq datasets in GEO and SRA not reused even once. Reusability varies by data type; microarray data shows the highest average Reusability Index (RI), while RNA-seq and other sequencing data have lower RIs. Human datasets consistently exhibit higher reusability than non-human ones. Significant barriers to reuse persist, including incomplete metadata, lack of standardization, and the complexity of raw data formats. Many researchers also lack the necessary computational tools or expertise. The study proposes solutions: enforcing metadata standards, integrating automated data processing tools into repositories, formally recognizing data contributions with metrics like RI and Normalized Reusability Index (NRI), and incentivizing reuse through journals and funding agencies. Addressing these challenges is crucial to unlock the full potential of existing omics data.
2025-07-24	15:00:00	15:20:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Pre-publication sharing of omics data improves paper citations	Serghei Mangul	In person	Serghei Mangul, Dhrithi Deshpande, Viorel Munteanu, Mihai Dimian, Grigore Boldirev, Alexander Zelikovsky	Advancements in omics technologies generate vast datasets, while public repositories facilitate their sharing, crucial for accelerating discovery, enhancing reproducibility, and meeting funder/journal mandates. Pre-publication data sharing, particularly alongside preprints, is increasingly beneficial, enabling early re-analysis and proving vital during public health crises like COVID-19, where data access is critical for verifying rapid findings and maintaining scientific integrity. However, a key question is whether raw omics data is consistently deposited when preprints are posted. Our study presents the first comprehensive analysis of pre-publication data sharing practices and their impact on citations in biomedical research. We analyzed 106,000 bioRxiv/medRxiv preprints and 72,715 publications with primary Gene Expression Omnibus (GEO) datasets, identifying 6,819 preprints mentioning GEO IDs and matching 2,022 preprint-publication pairs. Analysis revealed significant variability; only 29.7% of matched pairs had identical, single GEO IDs. While 71-87% of datasets were available before publication, only 9-23% were available at preprint posting. We examined the relationship between dataset release timing and citation counts, revealing statistically significant findings (Kolmogorov-Smirnov test, p = 8.596 x 10⁻⁶) indicating a discernible impact of early data availability on citation benefit. We also found over 1,600 cases where data IDs were in publications but not their preprints. Our findings reveal a fragmented landscape of pre-publication omics data sharing, challenging reproducibility and transparency.
2025-07-24	15:20:00	15:40:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	HI-MGSyn: A Hypergraph and Interaction-aware Multi-Granularity Network for Predicting Synergistic Drug Combinations	Yuexi Gu	Live stream	Yuexi Gu, Jian Zu, Yongheng Sun, Louxin Zhang	Motivation: Drug combinations can not only enhance drug efficacy but also effectively reduce toxic side effects and mitigate drug resistance. With the advancement of drug combination screening technologies, large amounts of data have been generated. The availability of large data enables researchers to develop deep learning methods for predicting drug targets for synergistic combination. However, these methods still lack sufficient accuracy for practical use, and most overlook the biological significance of their models. Results: We propose the HI-MGSyn (Hypergraph and Interaction-aware Multi-granularity Network for Drug Synergy Prediction) model, which integrates a coarse-granularity module and a fine-granularity module to predict drug combination synergy. The former utilizes a hypergraph to capture global features, while the latter employs interaction-aware attention to simulate biological processes by modeling substructure-substructure and substructure-cell line interactions. HI-MGSyn outperforms state-of-the-art machine learning models on our validation datasets extracted from the DrugComb and GDSC2 databases. Furthermore, the fact that five of the 12 novel synergistic drug combinations predicted by HI-MGSyn are strongly supported by experimental evidence in the literature underscores its practical potential.
2025-07-24	15:40:00	16:00:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	CAMDA Trophy			David Kreil
2025-07-24	15:40:00	16:00:00	01C	CAMDA: Critical Assessment of Massive Data Analysis	Closing remarks			David Kreil

- top -