The SciFinder tool lets you search Titles, Authors, and Abstracts of talks and panels. Enter your search term below and your results will be shown at the bottom of the page. You can also click on a track to see all the talks given in that track on that day.

View Talks By Category

Scroll down to view Results

July 14, 2025
July 15, 2025
July 20, 2025
July 21, 2025
July 22, 2025
July 23, 2025
July 24, 2025

Results

July 21, 2025
11:20-12:20
Invited Presentation: Where does it hurt (in your genome)?
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Moderator(s): Anshul Kundaje


Authors List: Show

  • Julien Gagneur

Presentation Overview:Show

The identification of genetic variants strongly affecting when phenotypes remains an unsolved problem with major relevance in rare diseases diagnostics, oncology, and for the identification of effector genes of complex traits and diseases.

I will present a series of published and ongoing work from my lab tackling this issue, with a focus on non-coding variants. This will span variant scoring based on genomic language models [1], methods to predict aberrant expression [2] and splicing [3], all the way to integrative deep learning models for rare variant association analyses demonstrated on UK Biobank [4].

1. Tomaz da Silva, et al. Nucleotide dependency analysis of DNA language models reveals genomic functional elements. bioRxiv, 2024
2. Hölzlwimmer et al. Aberrant gene expression prediction across human tissues. Nature Communications, 2025
3. Wagner et al. Aberrant splicing prediction across human tissues. Nature Genetics, 2023
4. Clarke, Holtkamp, et al. Integration of variant annotations using deep set networks boosts rare variant association genetics. Nature Genetics, 2024

July 21, 2025
12:20-12:40
Proceedings Presentation: Locality-aware pooling enhances protein language model performance across varied applications
Confirmed Presenter: Minh Hoang, Princeton University, United States
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Anshul Kundaje


Authors List: Show

  • Minh Hoang, Minh Hoang, Princeton University
  • Mona Singh, Mona Singh, Princeton University

Presentation Overview:Show

Protein language models (PLMs) are amongst the most exciting recent advances for characterizing protein sequences, and have enabled a diverse set of applications including structure determination, functional property prediction, and mutation impact assessment, all from single protein sequences alone. State-of-the-art PLMs leverage transformer architectures originally developed for natural language processing, and are pre-trained on large protein databases to generate contextualized representations of individual amino acids. To harness the power of these PLMs to predict protein-level properties, these per-residue embeddings are typically ``pooled'' to fixed-size vectors that are further utilized in downstream prediction networks. Common pooling strategies include Cls-Pooling and Avg-Pooling, but neither of these approaches can capture the local substructures and long-range interactions observed in proteins. To address these weaknesses in existing PLM pooling strategies, we propose the use of attention pooling, which can naturally capture these important features of proteins.
To make the expensive attention operator (quadratic in length of the input protein) feasible in practice, we introduce bag-of-mer pooling (BoM-Pooling), a locality-aware hierarchical pooling technique that combines windowed average pooling with attention pooling. We empirically demonstrate that both full attention pooling and BoM-Pooling outperform previous pooling strategies on three important, diverse tasks: (1) predicting the activities of two proteins as they are varied; (2) detecting remote homologs; and (3) predicting signaling interactions with peptides. Overall, our work highlights the advantages of biologically inspired pooling techniques in protein sequence modeling and is a step towards more effective adaptations of language models in biological settings.

July 21, 2025
12:40-13:00
Proceedings Presentation: NEAR: Neural Embeddings for Amino acid Relationships
Confirmed Presenter: Daniel Olson, University of Montana, United States
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Anshul Kundaje


Authors List: Show

  • Daniel Olson, Daniel Olson, University of Montana
  • Thomas Colligan, Thomas Colligan, University of Arizona
  • Daphne Demekas, Daphne Demekas, University of Arizona
  • Jack Roddy, Jack Roddy, University of Arizona
  • Ken Youens-Clark, Ken Youens-Clark, University of Arizona
  • Travis Wheeler, Travis Wheeler, University of Arizona

Presentation Overview:Show

Protein language models (PLMs) have recently demonstrated potential to supplant classical protein database search methods based on sequence alignment, but are slower than common alignment-based tools and appear to be prone to a high rate of false labeling.
Here, we present NEAR, a method based on neural representation learning that is designed to improve both speed and accuracy of search for likely homologs in a large protein sequence database.
NEAR’s ResNet embedding model is trained using contrastive learning guided by trusted sequence alignments. It computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of residue-level k-NN search and a simple neighbor aggregation scheme.
Tests on a benchmark consisting of trusted remote homologs and randomly shuffled decoy sequences reveal that NEAR substantially improves accuracy relative to state-of-the-art PLMs, with lower memory requirements and faster embedding / search speed. While these results suggest that the NEAR model may be useful for standalone homology detection with increased sensitivity over standard alignment-based methods, in this manuscript we focus on a more straightforward analysis of the model's value as a high-speed pre-filter for sensitive annotation. In that context, NEAR is at least 5x faster than the pre-filter currently used in the widely-used profile hidden Markov model (pHMM) search tool, HMMER3, and also outperforms the pre-filter used in our fast pHMM tool, nail.

July 21, 2025
14:00-14:20
Proceedings Presentation: LoRA-DR-suite: adapted embeddings predict intrinsic and soft disorder from protein sequences
Confirmed Presenter: Gianluca Lombardi, Sorbonne Université, France
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Oznur Tastan


Authors List: Show

  • Gianluca Lombardi, Gianluca Lombardi, Sorbonne Université
  • Beatriz Seoane, Beatriz Seoane, Universidad Complutense de Madrid
  • Alessandra Carbone, Alessandra Carbone, Sorbonne Université

Presentation Overview:Show

Intrinsic disorder regions (IDR) and soft disorder regions (SDR) provide crucial information on a protein structure to underpin its functioning, interaction with other molecules and assembly path. Circular dichroism experiments are used to identify intrinsic disorder residues, while SDRs are characterized using B-factors, missing residues, or a combination of both in alternative X-ray crystal structures of the same molecule. These flexible regions in proteins are particularly significant in diverse biological processes and are often implicated in pathological conditions. Accurate computational prediction of these disordered regions is thus essential for advancing protein research and understanding their functional implications. To address this challenge, LoRA-DR-suite employs a simple adapter-based architecture that utilizes protein language models embeddings as protein sequence representations, enabling the precise prediction of IDRs and SDRs directly from primary sequence data. Alongside the fast LoRA-DR-suite implementation, we release SoftDis, a unique soft disorder database constructed for approximately 500,000 PDB chains. SoftDis is designed to facilitate new research, testing, and applications on soft disorder, advancing the study of protein dynamics and interactions.

July 21, 2025
14:20-14:40
Proceedings Presentation: TCR-epiDiff: Solving Dual Challenges of TCR Generation and Binding Prediction
Confirmed Presenter: Se Yeon Seo, Soongsil University, South Korea
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Oznur Tastan


Authors List: Show

  • Se Yeon Seo, Se Yeon Seo, Soongsil University
  • Je-Keun Rhee, Je-Keun Rhee, Soongsil University

Presentation Overview:Show

Motivation: T-cell receptors (TCRs) are fundamental components of the adaptive immune system, recognizing specific antigens for targeted immune responses. Understanding their sequence patterns for designing effective vaccines and immunotherapies. However, the vast diversity of TCR sequences and complex binding mechanisms pose significant challenges in generating TCRs that are specific to a particular epitope.
Results: Here, we propose TCR-epiDiff, a diffusion-based deep learning model for generating epitope-specific TCRs and predicting TCR-epitope binding. TCR-epiDiff integrates epitope information during TCR sequence embedding using ProtT5-XL and employs a denoising diffusion probabilistic model for sequence generation. Using external validation datasets, we demonstrate the ability to generate biologically plausible, epitope-specific TCRs. Furthermore, we leverage the model's encoder to develop a TCR-epitope binding predictor that shows robust performance on the external validation data. Our approach provides a comprehensive solution for both de novo generation of epitope-specific TCRs and TCR-epitope binding prediction. This capability provides valuable insights into immune diversity and has the potential to advance targeted immunotherapies.
Availability and implementation: The data and source codes for our experiments are available at https://github.com/seoseyeon/TCR-epiDiff

July 21, 2025
14:40-15:00
Proceedings Presentation: Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Data
Confirmed Presenter: Chau Do, Aalto University, Finland
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: Live stream
Moderator(s): Oznur Tastan


Authors List: Show

  • Chau Do, Chau Do, Aalto University
  • Harri Lähdesmäki, Harri Lähdesmäki, Aalto University

Presentation Overview:Show

Multiple Instance Learning (MIL) provides a structured approach to patient phenotype prediction with single-cell RNA-sequencing (scRNA-seq) data. However, existing MIL methods tend to overlook the hierarchical structure inherent in scRNA-seq data, especially the biological groupings of cells, or cell types. This limitation may lead to suboptimal performance and poor interpretability at higher levels of cellular division. To address this gap, we present a novel approach to incorporate hierarchical information into the attention-based MIL framework. Specifically, our model applies the attention-based aggregation mechanism over both cells and cell types, thus enforcing a hierarchical structure on the flow of information throughout the model. Across extensive experiments, our proposed approach consistently outperforms existing models and demonstrates robustness in data-constrained scenarios. Moreover, ablation test results show that simply applying the attention mechanism on cell types instead of cells leads to improved performance, underscoring the benefits of incorporating the hierarchical groupings. By identifying the critical cell types that are most relevant for prediction, we show that our model is capable of capturing biologically meaningful associations, thus facilitating biological discoveries.

July 21, 2025
15:00-15:10
PANDORA: Peptide-based ANtimicrobial Design Optimized by Reinforcement Automation
Confirmed Presenter: Julián García-Vinuesa, University of Chile, Chile
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Oznur Tastan


Authors List: Show

  • Julián García-Vinuesa, Julián García-Vinuesa, University of Chile
  • Nicole Soto, Nicole Soto, University of Magallanes
  • David Medina-Ortiz, David Medina-Ortiz, University of Magallanes

Presentation Overview:Show

The global rise of antibiotic resistance, particularly in critical pathogens such as Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacteriaceae, demands innovative therapeutic strategies. Despite their promise, conventional drug discovery pipelines remain slow, costly, and often ineffective. In contrast, antimicrobial peptides (AMPs) offer a compelling alternative due to their specificity, low resistance induction, and multifunctional activity, although their development is hindered by issues like poor stability and immunogenicity. To address these challenges, we introduce PANDORA (Peptide-based ANtimicrobial Design Optimized by Reinforcement Automation), an autonomous AI-driven platform for the optimized design of antibiotic peptides. PANDORA integrates predictive models (e.g., AMP activity, half-life, toxicity), generative transformer-based architectures for de novo sequence design, and explainable AI for property inference. All components are coordinated by a multi-agent reinforcement learning system that enables adaptive, end-to-end peptide engineering. Predictive models, fine-tuned on protein language embeddings with advanced feature extraction, reach classification accuracies above 90% for antimicrobial activity and >85% for toxicity risk estimation. Generative models, guided by physicochemical constraints, yield diverse candidate peptides with optimized therapeutic profiles. The platform supports natural language prompts and user-defined constraints, offering a flexible, user-friendly interface. Preliminary candidates are undergoing experimental validation against WHO-priority bacteria, with feedback used to further train the system via reinforcement learning. PANDORA represents a scalable, autonomous solution that fuses artificial intelligence, automation, and experimental validation. Its capacity to iteratively learn and adapt positions it as a transformative tool for accelerating peptide-based drug discovery in the fight against antibiotic resistance.

July 21, 2025
15:10-15:20
NetStart 2.0: Prediction of Eukaryotic Translation Initiation Sites Using a Protein Language Model
Confirmed Presenter: Line Sandvad Nielsen, Section for Computational and RNA Biology, Department of Biology
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Oznur Tastan


Authors List: Show

  • Line Sandvad Nielsen, Line Sandvad Nielsen, Section for Computational and RNA Biology
  • Anders Gorm Pedersen, Anders Gorm Pedersen, Section for Bioinformatics
  • Ole Winther, Ole Winther, Section for Computational and RNA Biology
  • Henrik Nielsen, Henrik Nielsen, Section for Bioinformatics

Presentation Overview:Show

"Background: Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5' end and the local start codon context. Translation initiation sites mark the transition from non-coding to coding regions. This fact motivates the expectation that the upstream sequence, if translated, would assemble a nonsensical order of amino acids, while the downstream sequence would correspond to the structured beginning of a protein. This distinction suggests potential for predicting translation initiation sites using a protein language model.

Results: We present NetStart 2.0, a deep learning-based model that integrates the ESM-2 protein language model with the local sequence context to predict translation initiation sites across a broad range of eukaryotic species. NetStart 2.0 was trained as a single model across multiple species, and despite the broad phylogenetic diversity represented in the training data, it consistently relied on features marking the transition from non-coding to coding regions.

Conclusion: By leveraging ""protein-ness"", NetStart 2.0 achieves state-of-the-art performance in predicting translation initiation sites across a diverse range of eukaryotic species. This success underscores the potential of protein language models to bridge transcript- and peptide-level information in complex biological prediction tasks. The NetStart 2.0 webserver is available at: https://services.healthtech.dtu.dk/services/NetStart-2.0/"

July 21, 2025
15:20-15:30
Towards a more inductive world for drug repurposing approaches
Confirmed Presenter: Uxía Veleiro, Center for Applied Medical Research (CIMA) University of Navarra, Spain
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Oznur Tastan


Authors List: Show

  • Jesus De la Fuente Cedeño, Jesus De la Fuente Cedeño, Center for Applied Medical Research (CIMA) University of Navarra
  • Guillermo Serrano, Guillermo Serrano, King Abdullah University of Science and Technology
  • Uxía Veleiro, Uxía Veleiro, Center for Applied Medical Research (CIMA) University of Navarra
  • Mikel Casals, Mikel Casals, TECNUN University of Navarra
  • Laura Vera, Laura Vera, Center for Applied Medical Research (CIMA)
  • Marija Pizurica, Marija Pizurica, Stanford University
  • Nuria Gomez-Cebrian, Nuria Gomez-Cebrian, Instituto de Investigación Sanitaria La Fe
  • Leonor Puchades-Carrasco, Leonor Puchades-Carrasco, Instituto de Investigación Sanitaria La Fe
  • Antonio Pineda-Lucena, Antonio Pineda-Lucena, Center for Applied Medical Research (CIMA)
  • Idoia Ochoa, Idoia Ochoa, TECNUN University of Navarra
  • Silve Vicent, Silve Vicent, Center for Applied Medical Research (CIMA) University of Navarra
  • Olivier Gevaert, Olivier Gevaert, Stanford University

Presentation Overview:Show

Drug–target interaction (DTI) prediction is a challenging albeit essential task in drug repurposing. Learning on graph models has drawn special attention as they can substantially reduce drug repurposing costs and time commitment. However, many current approaches require high-demand additional information besides DTIs that complicates their evaluation process and usability. Additionally, structural differences in the learning architecture of current models hinder their fair benchmarking. In this work, we first perform an in-depth evaluation of current DTI datasets and prediction models through a robust benchmarking process and show that DTI methods based on transductive models lack generalization and lead to inflated performance when traditionally evaluated, making them unsuitable for drug repurposing. We then propose a biologically driven strategy for negative-edge subsampling and uncovered previously unknown interactions via in vitro validation, missed by traditional subsampling. Finally, we provide a toolbox from all generated resources, crucial for fair benchmarking and robust model design.

July 21, 2025
15:30-15:40
Hierarchical Multi-Agent Reinforcement Learning For Optimizing CRISPR-Based Polygenic Therapeutic Design
Confirmed Presenter: Nhung Duong, Hanoi University of Pharmacy, Viet Nam
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Oznur Tastan


Authors List: Show

  • Nhung Duong, Nhung Duong, Hanoi University of Pharmacy
  • Tuan Do, Tuan Do, N2TP Technology Solutions JSC
  • Anh Truong, Anh Truong, Hanoi University of Pharmacy
  • Ngoc Do, Ngoc Do, Hanoi University of Pharmacy
  • Lap Nguyen, Lap Nguyen, Hanoi University of Pharmacy

Presentation Overview:Show

Background. CRISPR therapies for polygenic disorders require simultaneous optimization of multiple guides and delivery constraints. Current approaches optimize guides individually, neglecting target interactions. We developed a hierarchical multi-agent reinforcement learning (MARL) framework to optimize CRISPR strategies for polygenic diseases while balancing efficiency, synergy, vector capacity, and immunogenicity.

Methods. We implemented a three-layer MARL architecture consisting of: (1) Optimize guide RNA sequences for each target gene; (2) Selects optimal editing modes and guide combinations while maximizing synergy; and (3) Ensures vector capacity and immunogenicity constraints are satisfied. For computational feasibility, we used a simplified 1 Mb mini-genome for off-target analysis rather than the full human genome, and employed approximated versions of RuleSet2 scoring, synergy effect and immunogenicity prediction. We trained the framework using iterative policy optimization and validated it on a model polygenic retinal disease involving five genes with known pathogenic mutations.

Results. The framework generated optimized guides with high on-target efficiency (0.70-1.00) and minimal off-target effects. The system selected optimal editing modes (prime editing for PDE6B, RHO, USH2A; base editing for RP1) while maximizing synergistic effects. The final design utilized Cas9 with a total size of within 4500 bp capacity and zero immunogenic epitopes, requiring only two rework iterations.

Conclusion. Our MARL approach demonstrates AI's potential for solving complex therapeutic design challenges. The framework navigates multi-dimensional optimization involving sequence, strategy, and clinical constraints simultaneously. While current implementation uses simplified biological modeling, the foundation is robust and provides a proof-of-concept for AI-guided design of personalized CRISPR therapeutics for polygenic diseases.

July 21, 2025
15:40-15:50
Developing a Deep Learning Model for Single-Cell RNA Splicing Analysis
Confirmed Presenter: Luyang Li, Division of Infection & Immunity, School of Medicine
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Oznur Tastan


Authors List: Show

  • Luyang Li, Luyang Li, Division of Infection & Immunity
  • You Zhou, You Zhou, Systems Immunity University Research Institute and the Division of Infection & Immunity

Presentation Overview:Show

Approximately 95% of human genes undergo alternative splicing (AS), a process that allows a single gene to produce multiple proteins with distinct functions. This mechanism enormously increases the complexity of our genome and plays an important role in maintaining health. Disruptions in normal splicing can lead to various diseases, and predicting AS events and understanding their regulatory mechanisms at the single-cell level can open the door to discovering new therapeutic targets. Despite the importance of RNA splicing, current single-cell RNA sequencing (scRNA-seq) efforts primarily focus on gene expression profiling. And very few scRNA-seq computational tools are available for identifying and quantifying RNA splicing.
Inspired by the successful application of large language models in biomedical research, we developed a new State space model based framework for Alternative Splicing prediction, named SAS, trained on long-read RNA sequencing data. SAS employs a stacked selective state space model architecture to generate latent state representations of transcript sequences, enabling accurate predictions of diverse AS events, even in data-limited conditions. Furthermore, this model is specifically tailored for single-cell splicing prediction. Our results show that SAS outperforms existing methods, achieving an accuracy of 0.97, PR-AUC of 0.99, and F1 score of 0.97. This innovative framework provides valuable insights into identifying splicing events at single-cell resolution, guiding experimental efforts to uncover novel splicing mechanisms and therapeutic targets.

July 21, 2025
15:50-16:00
Evolutionary constraints guide AlphaFold2 in predicting alternative conformations and inform rational mutation design
Confirmed Presenter: Francesca Cuturello, Area Science Park, Italy
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Oznur Tastan


Authors List: Show

  • Valerio Piomponi, Valerio Piomponi, Area Science Park
  • Alberto Cazzaniga, Alberto Cazzaniga, Area Science Park
  • Francesca Cuturello, Francesca Cuturello, Area Science Park

Presentation Overview:Show

Investigating structural variability is essential for understanding protein biological functions. Although AlphaFold2 accurately predicts static structures, it fails to capture the full spectrum of functional states. Recent methods have used AlphaFold2 to generate diverse structural ensembles, but they offer limited interpretability and overlook the evolutionary signals underlying predictions. In this work, we enhance the generation of conformational ensembles and identify sequence patterns that influence alternative fold predictions for several protein families. Building on prior research that clustered Multiple Sequence Alignments to predict fold-switching states, we introduce a refined clustering strategy that integrates protein language model representations with hierarchical clustering, overcoming limitations of density-based methods. Our strategy effectively identifies high-confidence alternative conformations and generates abundant sequence ensembles, providing a robust framework for applying Direct Coupling Analysis (DCA). Through DCA, we uncover key coevolutionary signals within the clustered alignments, leveraging them to design mutations that stabilize specific conformations, which we validate using alchemical free energy calculations from molecular dynamics. Notably, our method extends beyond fold-switching, effectively capturing a variety of conformational changes.

July 21, 2025
16:40-16:50
Integrating Machine Learning and Systems Biology to rationally design operational conditions for in vitro / in vivo translation of microphysiological systems
Confirmed Presenter: Nikolaos Meimetis, Department of Biological Engineering, Massachusetts Institute of Technology
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Tunca Doğan


Authors List: Show

  • Nikolaos Meimetis, Nikolaos Meimetis, Department of Biological Engineering
  • Jose Cadavid, Jose Cadavid, Department of Biological Engineering
  • Linda Griffith, Linda Griffith, Department of Biological Engineering
  • Douglas Lauffenburger, Douglas Lauffenburger, Department of Biological Engineering

Presentation Overview:Show

Preclinical models are used extensively to study diseases and potential therapeutic treatments. Complex in vitro platforms incorporating human cellular components, known as microphysiological systems (MPS), can model cellular and microenvironmental features of diseased tissues. However, determining experimental conditions -- particularly biomolecular cues such as growth factors, cytokines, and matrix proteins -- providing the most effective translatability of MPS-generated information to in vivo human subject contexts is a major challenge. Here, using metabolic dysfunction-associated fatty liver disease (MAFLD) studied using the CNBio PhysioMimix as a case study, we developed a machine learning framework called Latent In Vitro to In Vivo Translation (LIV2TRANS) to ascertain how MPS data map to in vivo data, first sharpening translation insights and consequently elucidating experimental conditions that can further enhance translation capability. Our findings in this case study highlight TGFβ as a crucial cue for MPS translatability and indicate that adding JAK-STAT pathway perturbations via interferon stimuli could increase the predictive performance of this MPS in MAFLD studies. Finally, we developed an optimization approach that identified androgen and EGFR signaling as key for maximizing the capacity of this MPS to capture in vivo human biological information germane to MAFLD. More broadly, this work establishes a mathematically principled approach for identifying experimental conditions that most beneficially capture in vivo human-relevant molecular pathways and processes, generalizable to preclinical studies for a wide range of diseases and potential treatments.

July 21, 2025
16:50-17:00
Data Splitting Against Information Leakage with DataSAIL
Confirmed Presenter: Roman Joeres, Helmholtz Institute for Pharmaceutical Research Saarland, Germany
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Tunca Doğan


Authors List: Show

  • Roman Joeres, Roman Joeres, Helmholtz Institute for Pharmaceutical Research Saarland
  • David B. Blumenthal, David B. Blumenthal, Department Artificial Intelligence in Biomedical Engineering
  • Olga Kalinina, Olga Kalinina, Helmholtz Institute for Pharmaceutical Research Saarland

Presentation Overview:Show

Information leakage (IL) is an increasingly important topic in machine learning (ML) research, especially in biomedical applications. When IL happens during a model's development, the model is prone to memorizing the training data instead of learning generalizable properties. This can lead to inflated performance metrics that do not reflect the actual performance at inference time.

Therefore, we present DataSAIL, a versatile Python package to facilitate leakage-reduced data splitting to enable realistic evaluation of ML models for biological data that are intended to be applied in out-of-distribution scenarios. DataSAIL is based on formulating the problem to find leakage-reduced data splits as a combinatorial optimization problem. We prove that this problem is NP-hard and provide a scalable heuristic based on clustering and integer linear programming. DataSAIL uses similarities between samples to compute leakage-reduced splits for classical property prediction tasks, stratified splits, and drug-target interaction datasets where information can be leaked along two dimensions. DataSAIL is accepted in principle by Nature Communications.

We empirically demonstrate DataSAIL's impact on evaluating biomedical ML models. We compare DataSAIL to seven other algorithms on 14 datasets from the MoleculeNet benchmark and LP-PDBBind. We show that DataSAIL is consistently amongst the best algorithms in removing IL. Furthermore, we train 6 different ML models on each split to evaluate how information leakage affects different models. We observe that deep learning models generally perform better than statistical models and that higher IL leads to better performance estimates. Another ablation study shows that DataSAIL reduces IL better than the PLINDER benchmark.

July 21, 2025
17:00-18:00
Invited Presentation: Generative AI for Unlocking the Complexity of Cells
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Tunca Doğan


Authors List: Show

  • Maria Brbic

Presentation Overview:Show

We are witnessing an AI revolution. At the heart of this revolution are generative AI models that, powered by advanced architectures and large datasets, are transforming AI across a variety of disciplines. But how can AI facilitate and eventually enable discoveries in life sciences? How can it bring us closer to understanding biology, the functions of our cells and relationships across different molecular layers? In this talk, I will present AI methods that can extract meaningful differences between classes from representations of foundation models with minimal or no supervision. I will then introduce generative AI methods designed to uncover relationships across different omics layers. I will demonstrate how these approaches enable the reassembly of tissues from dissociated single cells and how AI-driven tissue reconstruction can overcome existing technological limitations.