Return to ISMB/ECCB 2025 Homepage   Click here for the abridged agenda


Select Track: 3DSIG | Bio-Ontologies and Knowledge Representation | BioInfo-Core | Bioinfo4Women Meet-Up | Bioinformatics in the UK | BioVis | BOSC | CAMDA | CollaborationFest | CompMS | Computational Systems Immunology | Distinguished Keynotes | Dream Challenges | Education | Equity and Diversity | EvolCompGen | Fellows Presentation | Function | General Computational Biology | HiTSeq | iRNA | ISCB-China Workshop | JPI | MICROBIOME | MLCSB | NetBio | NIH Cyberinfrastructure and Emerging Technologies Sessions | NIH/Elixir | Publications - Navigating Journal Submissions | RegSys | Special Track | Stewardship Critical Infrastructure | Student Council Symposium | SysMod | Tech Track | Text Mining | The Innovation Pipeline: How Industry & Academia Can Work Together in Computational Biology | TransMed | Tutorials | VarI | WEB 2025 | Youth Bioinformatics Symposium | All


Schedule for CAMDA

NOTE: Browser resolution may limit the width of the agenda and you may need to scroll the iframe to see additional columns.
Click the buttons below to download your current table in that format

Date Start Time End Time Room Track Title Confrimed Presenter Format Authors Abstract
2025-07-23 11:20:00 12:20:00 01C CAMDA Genome-based prediction of microbial traits Thomas Rattei Thomas Rattei The prediction of phenotypic traits from genomic information is an ongoing challenge in computational biology. Although the fundamental principles of information encoding in genomes have been studied since decades and allowed first directed modifications, the expression of phenotypic traits is often the result of complex interactions. Predictive approaches in bioinformatics therefore focus on machine learning from labeled genomic data. During the last years, we have focused on the computational prediction of microbial phenotypic traits from metagenomic data. These data have been collected on large scale, to explore the diversity and composition of microbial communities and to correlate them with environmental factors (e.g. human health and disease). The prediction of traits for these millions of genomes, based on neural networks that use protein families as features, goes one step further and can be used in first applications.
2025-07-23 12:20:00 12:40:00 01C CAMDA The Anti-Microbial Resistance Prediction Challenge - Introduction Leonid Chidelevitch Leonid Chidelevitch The AMR prediction challenge at CAMDA is now in its third year. This year's challenge on predicting AMR quantiatively (MIC values) as well as qualitatively (resistance vs susceptibility) has been developed in conjunction with our CABBAGE project. CABBAGE, which stands for a Comprehensive Assessment of Bacterial-Based AMR prediction from GEnotypes, involves the collection, curation, and exploitation of all the publicly available data on AMR genotypes and phenotypes, not only from databases, but also from individual publications. In this introductory talk I will describe the process by which we arrived at the selected datasets for this year's challenge, discuss other progress we've made on CABBAGE so far, and preview the plans for next year's challenge.
2025-07-23 12:40:00 13:00:00 01C CAMDA A Hybrid Pipeline for Feature Reduction, and Ordinal Classification to Predict Antimicrobial Resistance from Genetic Profiles Anton Pashkov Adriana Haydeé Contreras Peruyero, Yesenia Villicaña Molina, Nelly Sélem Mojica, Francisco Santiago Nieto de la Rosa, Victor Muñiz Sánchez, Anton Pashkov, Johanna Atenea Carreón Baltazar, Luis Raúl Figueroa Martínez, Evelia Lorena Coss Navarrete, César Augusto Aguilar Martínez One of the three challenges proposed by the Community of Interest Critical Assessment of Massive Data Analysis (CAMDA) involves predicting antimicrobial resistance or susceptibility for nine bacterial species and four antibiotics of interest. The dataset underwent a cleaning process to remove duplicate IDs with differing MIC values or phenotypes. After data cleaning and preprocessing, three distinct strategies were implemented to perform the predictions. The first strategy focused on predicting minimum inhibitory concentration (MIC) values. We adapted machine learning models for ordinal classification, assuming MIC as an ordinal variable. Two main approaches were used: multiple binary models (logistic regression, CART, random forests) and threshold models (neural networks). Due to the high dimensionality and sparsity of the AMR gene count data, we applied preprocessing techniques including a TF-IDF-like transformation (GF-IAF) and dimensionality reduction (truncated SVD and NMF). In the second strategy, we tested several classical machine learning models to predict the phenotype directly and used a grid search to find the optimal set of parameters, without using MIC values. In the third, we applied dimensionality reduction methods such as TF-IDF, along with a biological filtering step, before predicting the phenotype. Finally, as a preliminary result, ANI and pangenome analyses of E. coli isolates revealed divergence in gene content among some strains. Accessory regions potentially linked to antibiotic resistance suggest that key resistance determinants may lie outside the core genome.
2025-07-23 14:00:00 14:40:00 01C CAMDA Predicting Antimicrobial Resistance Using Microbiome-Pretrained DNABERT2 and DBGWAS-Derived Genomic Features Jack Vaska Jack Vaska, Pratik Dutta, Max Chao, Rekha Sathian, Zhihan Zhou, Han Liu, Ramana Davuluri Antimicrobial resistance (AMR) is an escalating public health threat, especially in hospitals where diverse resistance gene reservoirs have emerged. With the increasing availability of metagenomic and whole-genome sequencing data from AMR pathogens, there is a timely opportunity to develop predictive models. Given the complexity of these genomic datasets, large language models (LLMs) offer a promising approach due to their ability to capture long-range sequence patterns. DNABERT2, an LLM pretrained on diverse DNA sequences, has shown strong performance in various genomic tasks and is well-suited for AMR prediction (Zhou et al., 2023). We present a novel method to predict AMR across nine pathogenic bacterial species treated with four common antibiotics. Four custom DNABERT2 models, pretrained on human microbiome-derived genomic sequences, were fine-tuned on sequences obtained from de novo assembled bacterial genomes. To extract phenotype-associated features, we employed De Bruijn Graph-based Genome-Wide Association Study (DBGWAS) in an alignment-free manner (Jaillard et al., 2018). Statistically significant sequences (p < 0.05) were aligned back to assemblies using BLAST (≥80% identity), and 1,000 bp flanking subsequences were extracted. Resistant samples showed a markedly higher number of BLAST hits than susceptible ones. Data were grouped by antibiotic and each group was fine-tuned using a DNABERT2 model incorporating species and BLAST hit count as additional features. Consensus predictions across sequences achieved 84.5% accuracy and a macro F1 score of 0.84. Our findings demonstrate that resistant bacteria contain distinct genomic features absent in susceptible strains, highlighting the promise of LLM-based methods for AMR prediction.
2025-07-23 14:40:00 15:00:00 01C CAMDA The Antimicrobial Resistance Prediction Challenge Alper Yurtseven Alper Yurtseven, Dilfuza Djamalova, Marco Galardini, Olga V. Kalinina Antimicrobial Resistance (AMR) is an urgent threat to human health worldwide as microbes have developed resistance to even the most advanced drugs. In this year’s CAMDA challenge, we focused on predicting antimicrobial resistance of 5,346 bacterial strains that belong to 9 different species (Acinetobacter baumannii, Campylobacter jejuni, Escherichia coli, Klebsiella pneumoniae, Neisseria gonorrhoeae, Pseudomonas aeruginosa, Salmonella enterica, Staphylococcus aureus, Streptococcus pneumoniae) using two machine learning algorithms.
2025-07-23 15:00:00 15:20:00 01C CAMDA Antimicrobial Resistance Prediction via Binary Ensemble Classifier and Assessment of Variable Importance Owen Visser Owen Visser, Victor Agboli, Somnath Datta Antimicrobial resistance (AMR) presents a growing challenge to global health, driven by antibiotic overuse and the rapid evolution of resistant bacteria. Predicting whether an isolate is resistant or susceptible to a drug remains difficult due to genomic variability. As part of the 2025 CAMDA Challenge, we altered a standard bioinformatic pipeline to preprocess the variable raw sequencing data, and features were derived from strain-specific markers and AMR gene classes. Three machine learning methods which have shown high accuracy in recent AMR prediction research were trained and compiled into an ensemble to predict binary resistance phenotypes for nine bacterial pathogens for four antibiotics. The ensemble performed well across most species, notably achieving 96.8% accuracy for C. jejuni and 98.2% for A. baumannii. Permutation-based variable importance analysis identified relevant resistance genes and strains, such as sulphonamide and aminoglycoside genes and the LAC-4 strain in A. baumannii. These results demonstrate the utility of ensemble models for AMR prediction on large, heterogeneous genomic datasets.
2025-07-23 15:00:00 15:20:00 01C CAMDA A Highly Accurate Workflow for Inference of Antimicrobial Resistance from Genetic Data Based on Machine Learning and Global Data Curation David Danko Gabor Fidler, Heather Wells, Ford Combs, John Papciak, Mara Couto-Rodriguez, Sol Rey, Tiara Rivera, Lorenzo Uccellini, Christopher Mason, Niamh O'Hara, Dorottya Nagy-Szakal, David Danko Note: This abstract is paired with the prediction submission “Base Model, 2nd Submission (Biotia)” made by user gfidler from team Biotia on May 15, 2025. We present BIOTIA-DX Resistance, our submission to the CAMDA AMR Challenge. This tool builds off of our clinically validated metagenomic workflow to provide broad domain predictions for antimicrobial resistance from microbial sequencing data. We achieved an F1 score of 84 on the CAMDA challenge test set. Our technique is based on curation of global datasets, machine learning-based predictions from input data, and highly stringent prepreprocessing of input data and databases.
2025-07-23 15:20:00 15:40:00 01C CAMDA The Gut Microbiome Health Index Challenge - Introduction Kinga Zielińska Kinga Zielińska
2025-07-23 15:40:00 16:00:00 01C CAMDA Integrating Taxonomic and Functional Features for Gut Microbiome Health Indexing Rafael Pérez Estrada Shaday Guerrero Flores, Rafael Pérez Estrada, Juan Francisco Espinosa Maya, Nelly Selem Mojica, David Alberto García Estrada, Orlando Camargo Escalante, Mario Jardón Santos, Jose Daniel Chavez Gonzalez Accurate characterization of the gut microbiome is essential for understanding its role in health and disease; however, while current indices such as GMHI and hiPCA rely on taxonomic profiles to associate microbiome composition with health states, they do not consider underlying functional variability. Here, we integrate species-level (MetaPhlAn) and pathway-level (HUMAnN) data from 4,398 samples provided by CAMDA 2025 to understand key organisms and pathways in different groups of diseases and to develop and evaluate composite health indices. We first built co-occurrence networks, identifyin keystone taxa. We then recalibrated GMHI and hiPCA for both taxonomic and functional data and developed three ensemble models. The best-performing, the Optimized Pathway Ensemble, reached an F1-score of 0.76. We extended GMHI to distinguish between disease groups and tested pairwise classifiers across conditions—including healthy, gastrointestinal, metabolic, psychiatric, and neurological disorders. Additionally, we developed the Gut Microbiome Health Calculator, a web tool for computing and comparing these indices. Our results show that combining taxonomic and functional features enhances classification and reveals biologically relevant patterns in disease.
2025-07-23 16:40:00 17:20:00 01C CAMDA Building a Rare-Disease Microbiome Health Index: Integrating Gut Metagenomes, Synthetic PKU EHRs and Rare-Variant Profiles to Forecast Phenylalanine Crises Khartik Uppalapati, Bora Yimenicioglu, Shakeel Abdulkareem, Adan Eftekhari Phenylketonuria (PKU) is an autosomal recessive metabolic disorder characterized by deficient phenylalanine hydroxylase activity, leading to episodic neurotoxic elevations in plasma phenylalanine (Phe) despite strict dietary management. However, existing gut health metrics fail to capture rare-disease–specific dysbiosis. In order to address, these concerns, we developed a Rare-Disease Microbiome Health Index (RDMHI) that integrates MetaPhlAn-derived species abundances, HUMAnN functional pathways, synthetic electronic health record timelines, and rare-variant burdens to forecast imminent Phe crises. We curated 4 398 metagenomic profiles from the CAMDA dataset alongside three external PKU cohorts (n < 100), applied centered log-ratio transformation and batch correction, and generated 5 000 patient-month windows via Synthea-augmented GAN models to simulate clinical and laboratory events. Rare-variant burdens for PAH and BH₄-pathway genes were collapsed into gene-level indicators. A LightGBM-DART classifier was trained under nested five-fold, leave-one-dataset-out cross-validation and evaluated by AUROC, AUPRC, and Matthews correlation coefficient with 1 000-sample bootstrap CIs. RDMHI achieved an AUROC of 0.91 (95 % CI 0.88–0.94), and MCC 0.64, outperforming clinical-only (AUROC 0.78; MCC 0.38) and microbiome-only (AUROC 0.81; MCC 0.45) baselines. External validation on 50 registry windows yielded an AUROC of 0.85 (0.81–0.89) and 78 % sensitivity at a 22 % false-positive rate. By outperforming existing gut-health indices (GMHI and hiPCA), RDMHI demonstrates the impact of tailoring health indices to rare diseases and establishes a new standard of microbiome-based prognostic modeling for precision risk stratification in rare metabolic disorders.
2025-07-23 17:20:00 17:40:00 01C CAMDA Toward the Development of a Novel and Comprehensive Gut Health Index: An Ensemble Model Integrating Taxonomic and Functional Profiles Vincent Mei Vincent Mei, Yulin Li, Somnath Datta Diseases linked to the gut microbiome have been on the rise, which contributes to the rising cost of healthcare and worsening patient outcomes . Since stool samples provide an accurate representation of the gut microbiome and can be collected frequently and non-invasively, it is of clinical interest to create an index that can accurately classify samples as healthy or non-healthy. Several indices already exist to assess microbiome health, such as the Gut Microbiome Health Index (GMHI), health index with PCA (hiPCA), and Shannon entropy measures, but their reliance solely on species abundance limits their ability to distinguish between healthy and non-healthy individuals. To improve upon these indices, we proposed a novel ensemble-based index that integrates both taxonomical and metabolic pathway abundance data from stool samples to predict individual health status. From the provided data with 1211 species features and 619 pathway features, 61 species and 21 pathways were identified and used to train the ensemble model. The best threshold for the index generated from the ensemble model was selected using Youden’s index, resulting in a balanced accuracy of 0.7234 compared to values below 0.5 for GMHI, hiPCA, and Shannon entropy measures. Feature importance was also calculated simultaneously with the ensemble model training by permuting one feature at a time, leading to the identification of the 20 most important species and pathways when determining gut microbiome health.
2025-07-23 17:40:00 18:00:00 01C CAMDA Topology-Enabled Integration of Taxonomic and Functional Microbiome Profiles Reveals Distinct Subgroups in Healthy Individuals Doroteya Staykova Doroteya Staykova High-throughput sequencing technologies have enabled detailed taxonomic and functional profiling of the human gut microbiome. However, integrating these diverse, high-dimensional data sources remains a major challenge - particularly in defining robust, cross-modal indicators of gut health - due to significant inter-individual variability observed even within healthy populations. In this study, I applied Topological Data Analysis (TDA) to the CAMDA 2025 Microbiome Challenge dataset to integrate taxonomic and functional profiles from healthy individuals. My primary aim was to establish a baseline for human gut health by identifying microbial patterns within a large, healthy cohort. A cross-modal network representation of over 1,600 microbiome samples was constructed using the Mapper algorithm with PHATE-based topological lenses. The derived topological shape revealed two distinct subgroups within the landscape of the healthy gut microbiome. Subsequent statistical analyses identified characteristic taxonomic and functional signatures associated with each subgroup, demonstrating the utility of TDA in uncovering intrinsic patterns and providing a data-driven framework for more precise stratification of gut health.
2025-07-23 17:40:00 18:00:00 01C CAMDA Ensemble-Based Topic Selection for Text Classification via a Grouping, Scoring, and Modeling Approach Daniel Voskergian, Burcu Bakir-Gungor, Malik Yousef The exponential growth in scientific literature, especially in biomedical domains, has intensified the need for effective automatic text classification (ATC) systems. TextNetTopics is a recent approach that classifies documents using topic-based features derived from Latent Dirichlet Allocation (LDA), reducing dimensionality while maintaining semantic richness. However, TextNetTopics’ reliance on single topic models introduces performance variability across datasets, limiting its generalizability. This study introduces ENTM-TS (Ensemble Topic Modeling for Topic Selection), a novel framework that enhances TextNetTopics by integrating multiple topic models through a three-stage Grouping, Scoring, and Modeling (GSM) approach. First, topics are extracted from various models and merged based on semantic similarity to reduce redundancy and generate discriminative topic groups. These groups are then scored using internal and external evaluation strategies, ensuring normalized comparison and identifying top-performing subsets. Finally, a modeling phase iteratively aggregates and evaluates these groups to build an optimal feature set for classification. ENTM-TS was evaluated on two biomedical text datasets: the DILI dataset and the WOS-5736 dataset of scientific abstracts. Results demonstrate that ENTM-TS consistently meets or exceeds the performance of single-model configurations, improving classification accuracy and reducing variability. This ensemble-based approach not only preserves semantic richness but also ensures robustness across diverse datasets. ENTM-TS offers a generalizable and interpretable solution for biomedical text mining, with future work aimed at automating parameter selection for greater usability.
2025-07-24 08:40:00 09:40:00 01C CAMDA Data, Diagnoses, and Discovery: Improving Healthcare through Electronic Health Records Spiros Denaxas Spiros Denaxas Electronic health records (EHRs) represent rich, multidimensional data generated through routine interactions within the healthcare system. These records have transformed biomedical research, shifting the traditional approach of studying diseases in isolation toward the simultaneous analysis of thousands of conditions. This talk will explore the unique opportunities and challenges that EHRs present to researchers and highlight best practices through examples.
2025-07-24 09:40:00 10:00:00 01C CAMDA Stage-Disease Grouping, Scoring, and Modeling for Predicting Diabetes Complications from Electronic Health Records Daniel Voskergian Daniel Voskergian, Burcu Bakir-Gungor, Malik Yousef Diabetes mellitus remains a major global health challenge, contributing significantly to morbidity, disability, and mortality. Accurate prediction of diabetes-related complications from electronic health records (EHRs) is essential for early intervention and personalized care. This study proposes a novel predictive framework that utilizes a novel feature engineering, combined with XGBoost-based feature selection and a Grouping–Scoring–Modeling (GSM) approach to improve predictive performance. Rather than relying on individual features, the proposed method constructs Stage-Disease Groups—sets of clinically related variables grouped by disease category (e.g., cardiovascular, renal) and typical onset stage (e.g., early, mid, late) following diabetes diagnosis. Each group captures interactions between variables such as age range and chronic conditions, reflecting real-world progression patterns. Predictive models were developed for four critical diabetes complications: retinopathy, chronic kidney disease, ischemic heart disease, and amputations. These models were trained on a large-scale dataset of synthetic EHRs representing nearly 1 million patients, generated using dual-adversarial autoencoders to preserve realistic temporal and clinical patterns. Results demonstrate that leveraging structured, group-based features improves both classification accuracy and model interpretability. Final models achieved accuracies between 70% and 77% and AUC scores between 76% and 84%, underscoring the effectiveness of the GSM framework in clinical risk prediction.
2025-07-24 11:20:00 12:00:00 01C CAMDA Benchmarking for Better Private Algorithms Antti Honkela Antti Honkela, Antti Honkela Responsible application of machine learning (ML) on sensitive health and genetic data requires privacy-preserving algorithms to ensure that the data are not exposed. There is even legislative pressure, especially in Europe, requiring privacy in trained ML models. My talk will discuss how to organise a challenge for privacy-preserving ML to stimulate the development of better private algorithms. This is significantly more difficult than organising regular ML challenges, because there are no straightforward means of reliably evaluating privacy, and fair comparison of solutions requires specifying a comparable privacy-utility trade-off for all participants. Building on experience from running multiple privacy-preserving ML challenges, I will review good and not so good solutions to these issues, hoping to encourage others to include a privacy component in their challenges.
2025-07-24 12:00:00 12:20:00 01C CAMDA The Health Privacy Challenge - Introduction Hakime Öztürk Hakime Öztürk The Health Privacy Challenge, which is organized in the context of the European Lighthouse on Safe and Secure AI (ELSA, https://elsa-ai.eu/), explores the privacy-preserving aspect of synthetic data generation models in the context of biological datasets. The challenge, through a
2025-07-24 12:20:00 13:00:00 01C CAMDA The Health Privacy - panel discussion Spiros Denaxas, Oliver Stegle, Antti Honkela, David Kreil, Wenzhong Xiao, Joaquin Dopazo, Spiros Denaxas, Oliver Stegle, Antti Honkela, David Kreil, Wenzhong Xiao, Joaquin Dopazo
2025-07-24 14:00:00 14:20:00 01C CAMDA Synthetic genomic data generation through Differential Privacy-enhanced Non-Negative Matrix Factorization (DP-NMF) Andrew Wicks Andrew Wicks, Kyle Fogerty Generation of synthetic genomics data is increasingly considered as a routine approach for safely sharing sensitive genomic datasets. While traditional data-sharing methods often expose participants to privacy risks such as membership-inference attacks, the necessity of such methods may be reconsidered in favor of privacy-preserving alternatives. In this work, we outline scenarios in genomics research where synthetic data generation via non-negative matrix factorization (NMF) can effectively replace direct data sharing, thereby significantly enhancing privacy. We introduce a simple yet robust heuristic leveraging differential privacy (DP) integrated into NMF-based clustering, combined with a zero-inflated negative binomial or poisson sampling strategy. We demonstrate the utility and viability of this method through proof-of-concept evaluations on real genomic data, discuss practical use-cases, and highlight broader implications for secure and privacy-compliant genomic data dissemination.
2025-07-24 14:20:00 14:40:00 01C CAMDA Synthetic Data Generation for bulk RNA-seq Data: A CAMDA Health Challenge Analysis Steven Golob Shane Menzies, Sikha Pentyala, Daniil Filienko, Steven Golob, Jineta Banerjee, Luca Foschini, Martine De Cock One of the major barriers to AI-driven medical discoveries is the limited availability of high-quality, accessible healthcare data. This is because medical data is inherently sensitive, necessitating strict privacy protections that often lead to data being siloed across clinical sites and research institutions. Lack of access to such data hinders reproducibility and slows down the AI adaption. To address this bottleneck, we investigate the use of Synthetic Data Generation (SDG) algorithms, capable of generating realistic data with formal privacy guarantees. Here, we investigate the extent to which state-of-the-art SDG algorithms can be applied to bulk RNA-seq data to generate high-quality genomics data suitable for downstream analysis.
2025-07-24 14:40:00 15:00:00 01C CAMDA Comparison of Single Cell RNA Synthetic Data Generators: A CAMDA Health Challenge Analysis Patrick McKeever Patrick McKeever, Daniil Filienko, Steven Golob, Shane Menzies, Sikha Pentyala, Jineta Banerjee, Luca Foschini, Martine De Cock Single cell RNA sequencing has a wide range of applications in medical research, allowing researchers to identify distinct cell types and consider the impact of experimental conditions on a per-cell-type basis. However, the scarcity of counts data for rare cell types or experimental conditions poses considerable difficulties in the analysis of single-cell expression data. As such, a large literature has developed around the generation of synthetic single-cell data. Synthetic single-cell expression data allows biologists to model rare cell states, test new statistical methods against a known ground truth, perform in-silico gene perturbations, and guide the development of sequencing experiment structure in advance. However, while several comparative benchmarks of single cell data exist, much less literature has considered the privacy-preserving aspects of these algorithms. This extended abstract In this abstract, we explore and compare multiple types of synthetic data generators (SDGs) to generate single-cell RNA-seq (scRNA-seq) data using the OneK1K dataset provided by the CAMDA Healthcare Challenge. Specifically, we evaluate both the statistical methods scDesign2 \cite{sun2021_scdesign} and Private-PGM (which also provides formal differential privacy guarantees) as well as the recent diffusion-based modelcfDiffusion. Our analysis follows the evaluation pipeline and metrics defined by the challenge organizers. We find that scDesign2 far exceeds the other generators in terms of data quality.
2025-07-24 14:40:00 15:00:00 01C CAMDA NoisyDiffusion: Privacy Preserving Synthetic Gene Expression Data Generation Jules Kreuer Jules Kreuer Generating synthetic gene expression data has the potential to advance computational biology and health research by enabling broader access to data. However, creating synthetic data that is both highly faithful to the original and useful from a biological perspective while also ensuring privacy is a significant challenge. While diffusion models are powerful generative tools, their application to sensitive genomic data requires careful consideration of privacy implications, especially regarding their susceptibility to memorisation and membership inference attacks (MIAs). This project presents NoisyDiffusion: a conditional diffusion model designed to generate synthetic gene expression data while incorporating mechanisms for differential privacy to mitigate MIAs. As this project is part of the CAMDA 2025 - Health Privacy Challenge, it was evaluated on the TCGA-COMBINED and TCGA-BRCA datasets. NoisyDiffusion demonstrated strong utility, with classifiers trained on its synthetic data achieving high accuracy (e.g., 96.92% on TCGA-COMBINED) and AUPR, rivaling top non-private baselines (Multivariate, CVAE) and significantly outperforming other generative models, including those with explicit DP (DP-CVAE, CTGAN). Crucially, for privacy, Membership Inference Attack (MIA) AUCs were close to 0.5, suggesting good resilience and performance comparable to the Multivariate baseline. This work demonstrates that diffusion models can effectively generate high-quality, privacy-respecting synthetic genomic data, offering a promising pathway for advancing research while safeguarding sensitive information.
2025-07-24 15:00:00 15:20:00 01C CAMDA Reusability of Public Omics Data Across 6 Million Publications Serghei Mangul Serghei Mangul, Viorel Munteanu, Dumitru Ciorbă, Viorel Bostan, Mihai Dimian, Nicolae Drabcinski Over the past two decades, public repositories like GEO and SRA have accumulated vast omics datasets, sparking a crucial discussion on secondary data analysis. Access to this data is vital for reproducibility, novel experiments, meta-analyses, and new discoveries. However, the extent and factors influencing reuse have been unclear. A large-scale study analyzed over six million open-access publications from 2001 to 2025 to quantify reuse patterns and identify influencing factors. The analysis identified 213,213 omics-based publications, with approximately 65% based on secondary analysis, marking a significant shift. Since 2015, studies reusing existing gene expression data, particularly microarray data, have increasingly outnumbered those with new data. Despite this, a large portion of datasets, especially RNA-seq, remain underutilized, with over 72% of RNA-seq datasets in GEO and SRA not reused even once. Reusability varies by data type; microarray data shows the highest average Reusability Index (RI), while RNA-seq and other sequencing data have lower RIs. Human datasets consistently exhibit higher reusability than non-human ones. Significant barriers to reuse persist, including incomplete metadata, lack of standardization, and the complexity of raw data formats. Many researchers also lack the necessary computational tools or expertise. The study proposes solutions: enforcing metadata standards, integrating automated data processing tools into repositories, formally recognizing data contributions with metrics like RI and Normalized Reusability Index (NRI), and incentivizing reuse through journals and funding agencies. Addressing these challenges is crucial to unlock the full potential of existing omics data.
2025-07-24 15:00:00 15:20:00 01C CAMDA Pre-publication sharing of omics data improves paper citations Serghei Mangul Serghei Mangul, Dhrithi Deshpande, Viorel Munteanu, Mihai Dimian, Grigore Boldirev, Alexander Zelikovsky Advancements in omics technologies generate vast datasets, while public repositories facilitate their sharing, crucial for accelerating discovery, enhancing reproducibility, and meeting funder/journal mandates. Pre-publication data sharing, particularly alongside preprints, is increasingly beneficial, enabling early re-analysis and proving vital during public health crises like COVID-19, where data access is critical for verifying rapid findings and maintaining scientific integrity. However, a key question is whether raw omics data is consistently deposited when preprints are posted. Our study presents the first comprehensive analysis of pre-publication data sharing practices and their impact on citations in biomedical research. We analyzed 106,000 bioRxiv/medRxiv preprints and 72,715 publications with primary Gene Expression Omnibus (GEO) datasets, identifying 6,819 preprints mentioning GEO IDs and matching 2,022 preprint-publication pairs. Analysis revealed significant variability; only 29.7% of matched pairs had identical, single GEO IDs. While 71-87% of datasets were available before publication, only 9-23% were available at preprint posting. We examined the relationship between dataset release timing and citation counts, revealing statistically significant findings (Kolmogorov-Smirnov test, p = 8.596 x 10⁻⁶) indicating a discernible impact of early data availability on citation benefit. We also found over 1,600 cases where data IDs were in publications but not their preprints. Our findings reveal a fragmented landscape of pre-publication omics data sharing, challenging reproducibility and transparency.
2025-07-24 15:20:00 15:40:00 01C CAMDA HI-MGSyn: A Hypergraph and Interaction-aware Multi-Granularity Network for Predicting Synergistic Drug Combinations Yuexi Gu Yuexi Gu, Jian Zu, Yongheng Sun, Louxin Zhang Motivation: Drug combinations can not only enhance drug efficacy but also effectively reduce toxic side effects and mitigate drug resistance. With the advancement of drug combination screening technologies, large amounts of data have been generated. The availability of large data enables researchers to develop deep learning methods for predicting drug targets for synergistic combination. However, these methods still lack sufficient accuracy for practical use, and most overlook the biological significance of their models. Results: We propose the HI-MGSyn (Hypergraph and Interaction-aware Multi-granularity Network for Drug Synergy Prediction) model, which integrates a coarse-granularity module and a fine-granularity module to predict drug combination synergy. The former utilizes a hypergraph to capture global features, while the latter employs interaction-aware attention to simulate biological processes by modeling substructure-substructure and substructure-cell line interactions. HI-MGSyn outperforms state-of-the-art machine learning models on our validation datasets extracted from the DrugComb and GDSC2 databases. Furthermore, the fact that five of the 12 novel synergistic drug combinations predicted by HI-MGSyn are strongly supported by experimental evidence in the literature underscores its practical potential.
2025-07-24 15:40:00 16:00:00 01C CAMDA CAMDA Trophy David Kreil
2025-07-24 15:40:00 16:00:00 01C CAMDA Closing remarks David Kreil

- top -