Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in BST
Monday, July 21st
11:20-12:20
Invited Presentation: Is distribution shift still an AI problem
Confirmed Presenter: Sanmi Koyejo

Room: 01A
Format: In person


Authors List: Show

  • Sanmi Koyejo

Presentation Overview: Show

Distribution shifts describe the phenomena where the deployment performance of an AI model exhibits differences from training. On the one hand, some claim that distribution shifts are ubiquitous in real-world deployments. On the other hand, modern implementations (e.g., foundation models) often claim to be robust to distribution shifts by design. Similarly, phenomena such as “accuracy on the line” promise that standard training produces distribution-shift-robust models. When are these claims valid, and do modern models fail due to distribution shifts? If so, what can be done about it? This talk will outline modern principles and practices for understanding the role of distribution shifts in AI, discuss how the problem has changed, and outline recent methods for engaging with distribution shifts with comprehensive and practical insights. Some highlights include a taxonomy of shifts, the role of foundation models, and finetuning. This talk will also briefly discuss how distribution shifts might interact with AI policy and governance.

Bio: Sanmi Koyejo is an assistant professor in the Department of Computer Science at Stanford University and a co-founder of Virtue AI. At Stanford, Koyejo leads the Stanford Trustworthy Artificial Intelligence (STAIR) lab, which works to develop the principles and practice of trustworthy AI, focusing on applications to science and healthcare. Koyejo has been the recipient of several awards, including a Skip Ellis Early Career Award, a Presidential Early Career Award for Scientists and Engineers (PECASE), and a Sloan Fellowship. Koyejo serves on the Neural Information Processing Systems Foundation Board, the Association for Health Learning and Inference Board, and as president of the Black in AI Board.

12:20-12:40
Proceedings Presentation: Locality-aware pooling enhances protein language model performance across varied applications
Confirmed Presenter: Minh Hoang, Princeton University, United States

Room: 01A
Format: In person


Authors List: Show

  • Minh Hoang, Princeton University, United States
  • Mona Singh, Princeton University, United States

Presentation Overview: Show

Protein language models (PLMs) are amongst the most exciting recent advances for characterizing protein sequences, and have enabled a diverse set of applications including structure determination, functional property prediction, and mutation impact assessment, all from single protein sequences alone. State-of-the-art PLMs leverage transformer architectures originally developed for natural language processing, and are pre-trained on large protein databases to generate contextualized representations of individual amino acids. To harness the power of these PLMs to predict protein-level properties, these per-residue embeddings are typically ``pooled'' to fixed-size vectors that are further utilized in downstream prediction networks. Common pooling strategies include Cls-Pooling and Avg-Pooling, but neither of these approaches can capture the local substructures and long-range interactions observed in proteins. To address these weaknesses in existing PLM pooling strategies, we propose the use of attention pooling, which can naturally capture these important features of proteins.
To make the expensive attention operator (quadratic in length of the input protein) feasible in practice, we introduce bag-of-mer pooling (BoM-Pooling), a locality-aware hierarchical pooling technique that combines windowed average pooling with attention pooling. We empirically demonstrate that both full attention pooling and BoM-Pooling outperform previous pooling strategies on three important, diverse tasks: (1) predicting the activities of two proteins as they are varied; (2) detecting remote homologs; and (3) predicting signaling interactions with peptides. Overall, our work highlights the advantages of biologically inspired pooling techniques in protein sequence modeling and is a step towards more effective adaptations of language models in biological settings.

12:40-13:00
Proceedings Presentation: NEAR: Neural Embeddings for Amino acid Relationships
Confirmed Presenter: Daniel Olson, University of Montana, United States

Room: 01A
Format: In person


Authors List: Show

  • Daniel Olson, University of Montana, United States
  • Thomas Colligan, University of Arizona, United States
  • Daphne Demekas, University of Arizona, United States
  • Jack Roddy, University of Arizona, United States
  • Ken Youens-Clark, University of Arizona, United States
  • Travis Wheeler, University of Arizona, United States

Presentation Overview: Show

Protein language models (PLMs) have recently demonstrated potential to supplant classical protein database search methods based on sequence alignment, but are slower than common alignment-based tools and appear to be prone to a high rate of false labeling.
Here, we present NEAR, a method based on neural representation learning that is designed to improve both speed and accuracy of search for likely homologs in a large protein sequence database.
NEAR’s ResNet embedding model is trained using contrastive learning guided by trusted sequence alignments. It computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of residue-level k-NN search and a simple neighbor aggregation scheme.
Tests on a benchmark consisting of trusted remote homologs and randomly shuffled decoy sequences reveal that NEAR substantially improves accuracy relative to state-of-the-art PLMs, with lower memory requirements and faster embedding / search speed. While these results suggest that the NEAR model may be useful for standalone homology detection with increased sensitivity over standard alignment-based methods, in this manuscript we focus on a more straightforward analysis of the model's value as a high-speed pre-filter for sensitive annotation. In that context, NEAR is at least 5x faster than the pre-filter currently used in the widely-used profile hidden Markov model (pHMM) search tool, HMMER3, and also outperforms the pre-filter used in our fast pHMM tool, nail.

14:00-14:20
Proceedings Presentation: LoRA-DR-suite: adapted embeddings predict intrinsic and soft disorder from protein sequences
Confirmed Presenter: Gianluca Lombardi, Sorbonne Université, France

Room: 01A
Format: In person


Authors List: Show

  • Gianluca Lombardi, Sorbonne Université, France
  • Beatriz Seoane, Universidad Complutense de Madrid, Spain
  • Alessandra Carbone, Sorbonne Université, France

Presentation Overview: Show

Intrinsic disorder regions (IDR) and soft disorder regions (SDR) provide crucial information on a protein structure to underpin its functioning, interaction with other molecules and assembly path. Circular dichroism experiments are used to identify intrinsic disorder residues, while SDRs are characterized using B-factors, missing residues, or a combination of both in alternative X-ray crystal structures of the same molecule. These flexible regions in proteins are particularly significant in diverse biological processes and are often implicated in pathological conditions. Accurate computational prediction of these disordered regions is thus essential for advancing protein research and understanding their functional implications. To address this challenge, LoRA-DR-suite employs a simple adapter-based architecture that utilizes protein language models embeddings as protein sequence representations, enabling the precise prediction of IDRs and SDRs directly from primary sequence data. Alongside the fast LoRA-DR-suite implementation, we release SoftDis, a unique soft disorder database constructed for approximately 500,000 PDB chains. SoftDis is designed to facilitate new research, testing, and applications on soft disorder, advancing the study of protein dynamics and interactions.

14:20-14:40
Proceedings Presentation: TCR-epiDiff: Solving Dual Challenges of TCR Generation and Binding Prediction
Confirmed Presenter: Se Yeon Seo, Soongsil University, South Korea

Room: 01A
Format: In person


Authors List: Show

  • Se Yeon Seo, Soongsil University, South Korea
  • Je-Keun Rhee, Soongsil University, South Korea

Presentation Overview: Show

Motivation: T-cell receptors (TCRs) are fundamental components of the adaptive immune system, recognizing specific antigens for targeted immune responses. Understanding their sequence patterns for designing effective vaccines and immunotherapies. However, the vast diversity of TCR sequences and complex binding mechanisms pose significant challenges in generating TCRs that are specific to a particular epitope.
Results: Here, we propose TCR-epiDiff, a diffusion-based deep learning model for generating epitope-specific TCRs and predicting TCR-epitope binding. TCR-epiDiff integrates epitope information during TCR sequence embedding using ProtT5-XL and employs a denoising diffusion probabilistic model for sequence generation. Using external validation datasets, we demonstrate the ability to generate biologically plausible, epitope-specific TCRs. Furthermore, we leverage the model's encoder to develop a TCR-epitope binding predictor that shows robust performance on the external validation data. Our approach provides a comprehensive solution for both de novo generation of epitope-specific TCRs and TCR-epitope binding prediction. This capability provides valuable insights into immune diversity and has the potential to advance targeted immunotherapies.
Availability and implementation: The data and source codes for our experiments are available at https://github.com/seoseyeon/TCR-epiDiff

14:40-15:00
Proceedings Presentation: Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Data
Confirmed Presenter: Chau Do, Aalto University, Finland

Room: 01A
Format: In person


Authors List: Show

  • Chau Do, Aalto University, Finland
  • Harri Lähdesmäki, Aalto University, Finland

Presentation Overview: Show

Multiple Instance Learning (MIL) provides a structured approach to patient phenotype prediction with single-cell RNA-sequencing (scRNA-seq) data. However, existing MIL methods tend to overlook the hierarchical structure inherent in scRNA-seq data, especially the biological groupings of cells, or cell types. This limitation may lead to suboptimal performance and poor interpretability at higher levels of cellular division. To address this gap, we present a novel approach to incorporate hierarchical information into the attention-based MIL framework. Specifically, our model applies the attention-based aggregation mechanism over both cells and cell types, thus enforcing a hierarchical structure on the flow of information throughout the model. Across extensive experiments, our proposed approach consistently outperforms existing models and demonstrates robustness in data-constrained scenarios. Moreover, ablation test results show that simply applying the attention mechanism on cell types instead of cells leads to improved performance, underscoring the benefits of incorporating the hierarchical groupings. By identifying the critical cell types that are most relevant for prediction, we show that our model is capable of capturing biologically meaningful associations, thus facilitating biological discoveries.

15:00-15:10
PANDORA: Peptide-based ANtimicrobial Design Optimized by Reinforcement Automation
Confirmed Presenter: Julián García-Vinuesa, University of Chile, Chile

Room: 01A
Format: In person


Authors List: Show

  • Julián García-Vinuesa, University of Chile, Chile
  • Nicole Soto, University of Magallanes, Chile
  • David Medina-Ortiz, University of Magallanes, Chile

Presentation Overview: Show

The global rise of antibiotic resistance, particularly in critical pathogens such as Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacteriaceae, demands innovative therapeutic strategies. Despite their promise, conventional drug discovery pipelines remain slow, costly, and often ineffective. In contrast, antimicrobial peptides (AMPs) offer a compelling alternative due to their specificity, low resistance induction, and multifunctional activity, although their development is hindered by issues like poor stability and immunogenicity. To address these challenges, we introduce PANDORA (Peptide-based ANtimicrobial Design Optimized by Reinforcement Automation), an autonomous AI-driven platform for the optimized design of antibiotic peptides. PANDORA integrates predictive models (e.g., AMP activity, half-life, toxicity), generative transformer-based architectures for de novo sequence design, and explainable AI for property inference. All components are coordinated by a multi-agent reinforcement learning system that enables adaptive, end-to-end peptide engineering. Predictive models, fine-tuned on protein language embeddings with advanced feature extraction, reach classification accuracies above 90% for antimicrobial activity and >85% for toxicity risk estimation. Generative models, guided by physicochemical constraints, yield diverse candidate peptides with optimized therapeutic profiles. The platform supports natural language prompts and user-defined constraints, offering a flexible, user-friendly interface. Preliminary candidates are undergoing experimental validation against WHO-priority bacteria, with feedback used to further train the system via reinforcement learning. PANDORA represents a scalable, autonomous solution that fuses artificial intelligence, automation, and experimental validation. Its capacity to iteratively learn and adapt positions it as a transformative tool for accelerating peptide-based drug discovery in the fight against antibiotic resistance.

15:10-15:20
NetStart 2.0: Prediction of Eukaryotic Translation Initiation Sites Using a Protein Language Model
Confirmed Presenter: Line Sandvad Nielsen, Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, Denmark

Room: 01A
Format: In person


Authors List: Show

  • Line Sandvad Nielsen, Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, Denmark
  • Anders Gorm Pedersen, Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Denmark
  • Ole Winther, Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, Denmark
  • Henrik Nielsen, Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Denmark

Presentation Overview: Show

"Background: Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5' end and the local start codon context. Translation initiation sites mark the transition from non-coding to coding regions. This fact motivates the expectation that the upstream sequence, if translated, would assemble a nonsensical order of amino acids, while the downstream sequence would correspond to the structured beginning of a protein. This distinction suggests potential for predicting translation initiation sites using a protein language model.

Results: We present NetStart 2.0, a deep learning-based model that integrates the ESM-2 protein language model with the local sequence context to predict translation initiation sites across a broad range of eukaryotic species. NetStart 2.0 was trained as a single model across multiple species, and despite the broad phylogenetic diversity represented in the training data, it consistently relied on features marking the transition from non-coding to coding regions.

Conclusion: By leveraging ""protein-ness"", NetStart 2.0 achieves state-of-the-art performance in predicting translation initiation sites across a diverse range of eukaryotic species. This success underscores the potential of protein language models to bridge transcript- and peptide-level information in complex biological prediction tasks. The NetStart 2.0 webserver is available at: https://services.healthtech.dtu.dk/services/NetStart-2.0/"

15:20-15:30
Towards a more inductive world for drug repurposing approaches
Confirmed Presenter: Uxía Veleiro, Center for Applied Medical Research (CIMA) University of Navarra, Spain

Room: 01A
Format: In person


Authors List: Show

  • Jesus De la Fuente Cedeño, Center for Applied Medical Research (CIMA) University of Navarra, Spain
  • Guillermo Serrano, King Abdullah University of Science and Technology, Saudi Arabia
  • Uxía Veleiro, Center for Applied Medical Research (CIMA) University of Navarra, Spain
  • Mikel Casals, TECNUN University of Navarra, Spain
  • Laura Vera, Center for Applied Medical Research (CIMA), University of Navarra, Spain
  • Marija Pizurica, Stanford University, United States
  • Nuria Gomez-Cebrian, Instituto de Investigación Sanitaria La Fe, Spain, Spain
  • Leonor Puchades-Carrasco, Instituto de Investigación Sanitaria La Fe, Spain, Spain
  • Antonio Pineda-Lucena, Center for Applied Medical Research (CIMA), University of Navarra, Spain
  • Idoia Ochoa, TECNUN University of Navarra, Spain
  • Silve Vicent, Center for Applied Medical Research (CIMA) University of Navarra, Spain
  • Olivier Gevaert, Stanford University, United States
  • Mikel Hernaez, Center for Applied Medical Research (CIMA) University of Navarra, Spain

Presentation Overview: Show

Drug–target interaction (DTI) prediction is a challenging albeit essential task in drug repurposing. Learning on graph models has drawn special attention as they can substantially reduce drug repurposing costs and time commitment. However, many current approaches require high-demand additional information besides DTIs that complicates their evaluation process and usability. Additionally, structural differences in the learning architecture of current models hinder their fair benchmarking. In this work, we first perform an in-depth evaluation of current DTI datasets and prediction models through a robust benchmarking process and show that DTI methods based on transductive models lack generalization and lead to inflated performance when traditionally evaluated, making them unsuitable for drug repurposing. We then propose a biologically driven strategy for negative-edge subsampling and uncovered previously unknown interactions via in vitro validation, missed by traditional subsampling. Finally, we provide a toolbox from all generated resources, crucial for fair benchmarking and robust model design.

15:30-15:40
Hierarchical Multi-Agent Reinforcement Learning For Optimizing CRISPR-Based Polygenic Therapeutic Design
Confirmed Presenter: Nhung Duong, Hanoi University of Pharmacy, Viet Nam

Room: 01A
Format: In person


Authors List: Show

  • Nhung Duong, Hanoi University of Pharmacy, Viet Nam
  • Tuan Do, N2TP Technology Solutions JSC, Viet Nam
  • Anh Truong, Hanoi University of Pharmacy, Viet Nam
  • Ngoc Do, Hanoi University of Pharmacy, Viet Nam
  • Lap Nguyen, Hanoi University of Pharmacy, Viet Nam

Presentation Overview: Show

Background. CRISPR therapies for polygenic disorders require simultaneous optimization of multiple guides and delivery constraints. Current approaches optimize guides individually, neglecting target interactions. We developed a hierarchical multi-agent reinforcement learning (MARL) framework to optimize CRISPR strategies for polygenic diseases while balancing efficiency, synergy, vector capacity, and immunogenicity.

Methods. We implemented a three-layer MARL architecture consisting of: (1) Optimize guide RNA sequences for each target gene; (2) Selects optimal editing modes and guide combinations while maximizing synergy; and (3) Ensures vector capacity and immunogenicity constraints are satisfied. For computational feasibility, we used a simplified 1 Mb mini-genome for off-target analysis rather than the full human genome, and employed approximated versions of RuleSet2 scoring, synergy effect and immunogenicity prediction. We trained the framework using iterative policy optimization and validated it on a model polygenic retinal disease involving five genes with known pathogenic mutations.

Results. The framework generated optimized guides with high on-target efficiency (0.70-1.00) and minimal off-target effects. The system selected optimal editing modes (prime editing for PDE6B, RHO, USH2A; base editing for RP1) while maximizing synergistic effects. The final design utilized Cas9 with a total size of within 4500 bp capacity and zero immunogenic epitopes, requiring only two rework iterations.

Conclusion. Our MARL approach demonstrates AI's potential for solving complex therapeutic design challenges. The framework navigates multi-dimensional optimization involving sequence, strategy, and clinical constraints simultaneously. While current implementation uses simplified biological modeling, the foundation is robust and provides a proof-of-concept for AI-guided design of personalized CRISPR therapeutics for polygenic diseases.

15:40-15:50
Developing a Deep Learning Model for Single-Cell RNA Splicing Analysis
Confirmed Presenter: Luyang Li, Division of Infection & Immunity, School of Medicine, Cardiff University, United Kingdom

Room: 01A
Format: In person


Authors List: Show

  • Luyang Li, Division of Infection & Immunity, School of Medicine, Cardiff University, United Kingdom
  • You Zhou, Systems Immunity University Research Institute and the Division of Infection & Immunity, Cardiff University, United Kingdom

Presentation Overview: Show

Approximately 95% of human genes undergo alternative splicing (AS), a process that allows a single gene to produce multiple proteins with distinct functions. This mechanism enormously increases the complexity of our genome and plays an important role in maintaining health. Disruptions in normal splicing can lead to various diseases, and predicting AS events and understanding their regulatory mechanisms at the single-cell level can open the door to discovering new therapeutic targets. Despite the importance of RNA splicing, current single-cell RNA sequencing (scRNA-seq) efforts primarily focus on gene expression profiling. And very few scRNA-seq computational tools are available for identifying and quantifying RNA splicing.
Inspired by the successful application of large language models in biomedical research, we developed a new State space model based framework for Alternative Splicing prediction, named SAS, trained on long-read RNA sequencing data. SAS employs a stacked selective state space model architecture to generate latent state representations of transcript sequences, enabling accurate predictions of diverse AS events, even in data-limited conditions. Furthermore, this model is specifically tailored for single-cell splicing prediction. Our results show that SAS outperforms existing methods, achieving an accuracy of 0.97, PR-AUC of 0.99, and F1 score of 0.97. This innovative framework provides valuable insights into identifying splicing events at single-cell resolution, guiding experimental efforts to uncover novel splicing mechanisms and therapeutic targets.

15:50-16:00
Evolutionary constraints guide AlphaFold2 in predicting alternative conformations and inform rational mutation design
Confirmed Presenter: Francesca Cuturello, Area Science Park, Italy

Room: 01A
Format: In person


Authors List: Show

  • Valerio Piomponi, Area Science Park, Italy
  • Alberto Cazzaniga, Area Science Park, Italy
  • Francesca Cuturello, Area Science Park, Italy

Presentation Overview: Show

Investigating structural variability is essential for understanding protein biological functions. Although AlphaFold2 accurately predicts static structures, it fails to capture the full spectrum of functional states. Recent methods have used AlphaFold2 to generate diverse structural ensembles, but they offer limited interpretability and overlook the evolutionary signals underlying predictions. In this work, we enhance the generation of conformational ensembles and identify sequence patterns that influence alternative fold predictions for several protein families. Building on prior research that clustered Multiple Sequence Alignments to predict fold-switching states, we introduce a refined clustering strategy that integrates protein language model representations with hierarchical clustering, overcoming limitations of density-based methods. Our strategy effectively identifies high-confidence alternative conformations and generates abundant sequence ensembles, providing a robust framework for applying Direct Coupling Analysis (DCA). Through DCA, we uncover key coevolutionary signals within the clustered alignments, leveraging them to design mutations that stabilize specific conformations, which we validate using alchemical free energy calculations from molecular dynamics. Notably, our method extends beyond fold-switching, effectively capturing a variety of conformational changes.

16:40-16:50
Integrating Machine Learning and Systems Biology to rationally design operational conditions for in vitro / in vivo translation of microphysiological systems
Confirmed Presenter: Nikolaos Meimetis, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA, United States

Room: 01A
Format: In person


Authors List: Show

  • Nikolaos Meimetis, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA, United States
  • Jose Cadavid, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA, United States
  • Linda Griffith, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA, United States
  • Douglas Lauffenburger, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA, United States

Presentation Overview: Show

Preclinical models are used extensively to study diseases and potential therapeutic treatments. Complex in vitro platforms incorporating human cellular components, known as microphysiological systems (MPS), can model cellular and microenvironmental features of diseased tissues. However, determining experimental conditions -- particularly biomolecular cues such as growth factors, cytokines, and matrix proteins -- providing the most effective translatability of MPS-generated information to in vivo human subject contexts is a major challenge. Here, using metabolic dysfunction-associated fatty liver disease (MAFLD) studied using the CNBio PhysioMimix as a case study, we developed a machine learning framework called Latent In Vitro to In Vivo Translation (LIV2TRANS) to ascertain how MPS data map to in vivo data, first sharpening translation insights and consequently elucidating experimental conditions that can further enhance translation capability. Our findings in this case study highlight TGFβ as a crucial cue for MPS translatability and indicate that adding JAK-STAT pathway perturbations via interferon stimuli could increase the predictive performance of this MPS in MAFLD studies. Finally, we developed an optimization approach that identified androgen and EGFR signaling as key for maximizing the capacity of this MPS to capture in vivo human biological information germane to MAFLD. More broadly, this work establishes a mathematically principled approach for identifying experimental conditions that most beneficially capture in vivo human-relevant molecular pathways and processes, generalizable to preclinical studies for a wide range of diseases and potential treatments.

16:50-17:00
Data Splitting Against Information Leakage with DataSAIL
Confirmed Presenter: Roman Joeres, Helmholtz Institute for Pharmaceutical Research Saarland, Germany

Room: 01A
Format: In person


Authors List: Show

  • Roman Joeres, Helmholtz Institute for Pharmaceutical Research Saarland, Germany
  • David B. Blumenthal, Department Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
  • Olga Kalinina, Helmholtz Institute for Pharmaceutical Research Saarland, Germany

Presentation Overview: Show

Information leakage (IL) is an increasingly important topic in machine learning (ML) research, especially in biomedical applications. When IL happens during a model's development, the model is prone to memorizing the training data instead of learning generalizable properties. This can lead to inflated performance metrics that do not reflect the actual performance at inference time.

Therefore, we present DataSAIL, a versatile Python package to facilitate leakage-reduced data splitting to enable realistic evaluation of ML models for biological data that are intended to be applied in out-of-distribution scenarios. DataSAIL is based on formulating the problem to find leakage-reduced data splits as a combinatorial optimization problem. We prove that this problem is NP-hard and provide a scalable heuristic based on clustering and integer linear programming. DataSAIL uses similarities between samples to compute leakage-reduced splits for classical property prediction tasks, stratified splits, and drug-target interaction datasets where information can be leaked along two dimensions. DataSAIL is accepted in principle by Nature Communications.

We empirically demonstrate DataSAIL's impact on evaluating biomedical ML models. We compare DataSAIL to seven other algorithms on 14 datasets from the MoleculeNet benchmark and LP-PDBBind. We show that DataSAIL is consistently amongst the best algorithms in removing IL. Furthermore, we train 6 different ML models on each split to evaluate how information leakage affects different models. We observe that deep learning models generally perform better than statistical models and that higher IL leads to better performance estimates. Another ablation study shows that DataSAIL reduces IL better than the PLINDER benchmark.

17:00-18:00
Invited Presentation: Generative AI for Unlocking the Complexity of Cells
Room: 01A
Format: In person


Authors List: Show

  • Maria Brbic

Presentation Overview: Show

We are witnessing an AI revolution. At the heart of this revolution are generative AI models that, powered by advanced architectures and large datasets, are transforming AI across a variety of disciplines. But how can AI facilitate and eventually enable discoveries in life sciences? How can it bring us closer to understanding biology, the functions of our cells and relationships across different molecular layers? In this talk, I will present AI methods that can extract meaningful differences between classes from representations of foundation models with minimal or no supervision. I will then introduce generative AI methods designed to uncover relationships across different omics layers. I will demonstrate how these approaches enable the reassembly of tissues from dissociated single cells and how AI-driven tissue reconstruction can overcome existing technological limitations.

Tuesday, July 22nd
11:20-12:20
Invited Presentation: Toward Mechanistic Genomics: Advances in Sequence-to-Function Modeling
Room: 01A
Format: In person


Authors List: Show

  • Maria Chikina

Presentation Overview: Show

Recent advances have firmly established sequence-to-function models as essential tools in modern genomics, enabling unprecedented insights into how genomic sequences drive molecular and cellular phenotypes. As these models have matured—with increasingly robust architectures, improved training strategies, and the emergence of standardized software frameworks—the field has rapidly evolved from proof-of-concept demonstrations to widespread practical applications across a variety of biological systems.

With the core methodologies now widely adopted and infrastructure in place, the community's focus is shifting toward ambitious new frontiers. There is growing momentum around developing models that are biologically interpretable, capable of uncovering causal mechanisms of gene regulation, and generalizable to novel contexts—such as predicting the effects of perturbing a regulatory protein rather than simply altering a DNA sequence. These efforts reflect a broader aspiration: to create models that serve not just as black-box predictors, but as scientific instruments that deepen our understanding of genome function.

In this talk, we will explore how such models can move us from descriptive genomics to mechanistic insight, highlighting recent innovations in architecture and training that support interpretability, modularity, and reusability. We will examine the contexts in which these models offer clear advantages, the limitations that remain, and practical considerations for their training. Ultimately, we will consider how advancing these models may refine the role of machine learning in biology, supporting not only accurate prediction but also the generation of more detailed and mechanistically informed hypotheses.

12:20-12:40
Proceedings Presentation: Deep learning models for unbiased sequence-based PPI prediction plateau at an accuracy of 0.65
Confirmed Presenter: Judith Bernett, Technical University of Munich, Germany

Room: 01A
Format: In person


Authors List: Show

  • Timo Reim, Technical University of Munich; Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
  • Anne Hartebrodt, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
  • David B. Blumenthal, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
  • Judith Bernett, Technical University of Munich, Germany
  • Markus List, Technical University of Munich, Germany

Presentation Overview: Show

As most proteins interact with other proteins to perform their respective functions, methods to computationally predict these interactions have been developed.
However, flawed evaluation schemes and data leakage in test sets have obscured the fact that sequence-based protein-protein interaction (PPI) prediction is still an open problem. Recently, methods achieving better-than-random performance on leakage-free PPI data have been proposed. Here, we show that the use of ESM-2 protein embeddings explains this performance gain irrespective of model architecture. We compared the performance of models with varying complexity, per-protein, and per-token embeddings, as well as the influence of self- or cross-attention, where all models plateaued at an accuracy of 0.65. Moreover, we show that the tested sequence-based models cannot implicitly learn a contact map as an intermediate layer.
These results imply that other input types, such as structure, might be necessary for producing reliable PPI predictions.

12:40-13:00
Proceedings Presentation: Accurate PROTAC targeted degradation prediction with DegradeMaster
Confirmed Presenter: Jie Liu, The University of Adelaide, Australia

Room: 01A
Format: In person


Authors List: Show

  • Jie Liu, The University of Adelaide, Australia
  • Michael Roy, The University of Adelaide, Australia
  • Luke Isbel, The University of Adelaide, Australia
  • Fuyi Li, The University of Adelaide, Australia

Presentation Overview: Show

Motivation: Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules that can degrade ‘undruggable’ protein of interest (POI) by recruiting E3 ligases and hijacking the ubiquitin-proteasome system. Some efforts have been made to develop deep learning-based approaches to predict the degradation ability of a given PROTAC. However, existing deep learning methods either simplify proteins and PROTACs as 2D graphs by disregarding crucial 3D spatial information or exclusively rely on limited labels for supervised learning without considering the abundant information from unlabeled data. Nevertheless, considering the potential to accelerate drug discovery, developing more accurate computational methods for PROTAC-targeted protein degradation prediction is critical.

Results: This study proposes DegradeMaster, a semi-supervised E(3)-equivariant graph neural network-based predictor for targeted degradation prediction of PROTACs. DegradeMaster leverages an E(3)-equivariant graph encoder to incorporate 3D geometric constraints into the molecular representations and utilizes a memory-based pseudo-labeling strategy to enrich annotated data during training. A mutual attention pooling module is also designed for interpretable graph representation. Experiments on both supervised and semi-supervised PROTAC datasets demonstrate that DegradeMaster outperforms state-of-the-art baselines, substantially improving AUROC by 10.5%. Case studies show DegradeMaster achieves 88.33% and 77.78% accuracy in predicting the degradability of VZ185 candidates on BRD9 and ACBI3 on KRAS mutants. Visualization of attention weights on 3D molecule graph demonstrates that DegradeMaster recognises linking and binding regions of warhead and E3 ligands and emphasizes the importance of structural information in these areas for degradation prediction. Together, this shows the potential for cutting-edge tools to highlight functional PROTAC components, thereby accelerating novel compound generation.

14:00-14:20
Proceedings Presentation: GPO-VAE: Modeling Explainable Gene Perturbation Responses utilizing GRN-Aligned Parameter Optimization
Confirmed Presenter: Seungheun Baek, Korea University, South Korea

Room: 01A
Format: In person


Authors List: Show

  • Seungheun Baek, Korea University, South Korea
  • Soyon Park, Korea University, South Korea
  • Yan Ting Chok, Korea University, Malaysia
  • Mogan Gim, Hankuk University of Foreign Studies, South Korea
  • Jaewoo Kang, Korea University, Aigen Sciences, South Korea

Presentation Overview: Show

Predicting cellular responses to genetic perturbations is essential for understanding biological systems and developing targeted therapeutic strategies. While variational autoencoders (VAEs) have shown promise in modeling perturbation responses, their limited explainability poses a significant challenge, as the learned features often lack clear biological meaning. Nevertheless, model explainability is one of the most important aspects in the realm of biological AI. One of the most effective ways to achieve explainability is incorporating the concept of gene regulatory networks (GRNs) in designing deep learning models such as VAEs. GRNs elicit the underlying causal relationships between genes and are capable of explaining the transcriptional responses caused by genetic perturbation treatments. We propose GPO-VAE, an explainable VAE enhanced by GRN-aligned Parameter Optimization that explicitly models gene regulatory networks in the latent space. Our key approach is to optimize the learnable parameters related to latent perturbation effects towards GRN-aligned explainability. Experimental results on perturbation prediction show our model achieves state-of-the-art performance in predicting transcriptional responses across multiple benchmark datasets. Furthermore, additional results on evaluating the GRN inference task reveal our model's ability to generate meaningful GRNs compared to other methods. According to qualitative analysis, GPO-VAE posseses the ability to construct biologically explainable GRNs that with experimentally validated regulatory pathways.

14:20-14:40
Proceedings Presentation: Fast and scalable Wasserstein-1 neural optimal transport solver for single-cell perturbation prediction
Confirmed Presenter: Yanshuo Chen, Department of Computer Science, University of Maryland, United States

Room: 01A
Format: Live stream


Authors List: Show

  • Yanshuo Chen, Department of Computer Science, University of Maryland, United States
  • Zhengmian Hu, Department of Computer Science, University of Maryland, United States
  • Wei Chen, Department of Pediatrics, UPMC Children’s Hospital of Pittsburgh, United States
  • Heng Huang, Department of Computer Science, University of Maryland, United States

Presentation Overview: Show

Predicting single-cell perturbation responses requires mapping between two unpaired single-cell data distributions. Optimal transport (OT) theory provides a principled framework for constructing such mappings by minimizing transport cost. Recently, Wasserstein-2 ($W_2$) neural optimal transport solvers (\textit{e.g.}, CellOT) have been employed for this prediction task. However, $W_2$ OT relies on the general Kantorovich dual formulation, which involves optimizing over two conjugate functions, leading to a complex min-max optimization problem that converges slowly. To address these challenges, we propose a novel solver based on the Wasserstein-1 ($W_1$) dual formulation. Unlike $W_2$, the $W_1$ dual simplifies the optimization to a maximization problem over a single 1-Lipschitz function, thus eliminating the need for time-consuming min-max optimization. While solving the $W_1$ dual only reveals the transport direction and does not directly provide a unique optimal transport map, we incorporate an additional step using adversarial training to determine an appropriate transport step size, effectively recovering the transport map. Our experiments demonstrate that the proposed $W_1$ neural optimal transport solver can mimic the $W_2$ OT solvers in finding a unique and ``monotonic" map on 2D datasets. Moreover, the $W_1$ OT solver achieves performance on par with or surpasses $W_2$ OT solvers on real single-cell perturbation datasets. Furthermore, we show that $W_1$ OT solver achieves $25 \sim 45\times$ speedup, scales better on high dimensional transportation task, and can be directly applied on single-cell RNA-seq dataset with highly variable genes. Our implementation and experiments are open-sourced at \url{https://github.com/poseidonchan/w1ot}.

14:40-15:00
Proceedings Presentation: Recovering Time-Varying Networks From Single-Cell Data
Confirmed Presenter: Euxhen Hasanaj, Carnegie Mellon University, United States

Room: 01A
Format: In person


Authors List: Show

  • Euxhen Hasanaj, Carnegie Mellon University, United States
  • Barnabás Póczos, Carnegie Mellon University, United States
  • Ziv Bar-Joseph, Carnegie Mellon University, United States

Presentation Overview: Show

Gene regulation is a dynamic process that underlies all aspects of human development, disease response, and other key biological processes. The reconstruction of temporal gene regulatory networks has conventionally relied on regression analysis, graphical models, or other types of relevance networks. With the large increase in time series single-cell data, new approaches are needed to address the unique scale and nature of this data for reconstructing such networks. Here, we develop a deep neural network, Marlene, to infer dynamic graphs from time series single-cell gene expression data. Marlene constructs directed gene networks using a self-attention mechanism where the weights evolve over time using recurrent units. By employing meta learning, the model is able to recover accurate temporal networks even for rare cell types. In addition, Marlene can identify gene interactions relevant to specific biological responses, including COVID-19 immune response, fibrosis, and aging, paving the way for potential treatments. The code use to train Marlene is available at https://github.com/euxhenh/Marlene.

15:00-15:10
Developing a Deep Learning Model for Single-Cell RNA Splicing Analysis
Confirmed Presenter: Luyang Li, Division of Infection & Immunity, School of Medicine, Cardiff University, United Kingdom

Room: 01A
Format: In person


Authors List: Show

  • Luyang Li, Division of Infection & Immunity, School of Medicine, Cardiff University, United Kingdom
  • You Zhou, Systems Immunity University Research Institute and the Division of Infection & Immunity, Cardiff University, United Kingdom

Presentation Overview: Show

Approximately 95% of human genes undergo alternative splicing (AS), a process that allows a single gene to produce multiple proteins with distinct functions. This mechanism enormously increases the complexity of our genome and plays an important role in maintaining health. Disruptions in normal splicing can lead to various diseases, and predicting AS events and understanding their regulatory mechanisms at the single-cell level can open the door to discovering new therapeutic targets. Despite the importance of RNA splicing, current single-cell RNA sequencing (scRNA-seq) efforts primarily focus on gene expression profiling. And very few scRNA-seq computational tools are available for identifying and quantifying RNA splicing.
Inspired by the successful application of large language models in biomedical research, we developed a new State space model based framework for Alternative Splicing prediction, named SAS, trained on long-read RNA sequencing data. SAS employs a stacked selective state space model architecture to generate latent state representations of transcript sequences, enabling accurate predictions of diverse AS events, even in data-limited conditions. Furthermore, this model is specifically tailored for single-cell splicing prediction. Our results show that SAS outperforms existing methods, achieving an accuracy of 0.97, PR-AUC of 0.99, and F1 score of 0.97. This innovative framework provides valuable insights into identifying splicing events at single-cell resolution, guiding experimental efforts to uncover novel splicing mechanisms and therapeutic targets.

15:10-15:20
Benchmarking foundation cell models for post-perturbation RNA-seq prediction
Confirmed Presenter: Gerold Csendes, Turbine, Hungary

Room: 01A
Format: In person


Authors List: Show

  • Gerold Csendes, Turbine, Hungary
  • Gema Sanz, Turbine, Spain
  • Krisóf Szalay, Turbine, Hungary
  • Bence Szalai, Turbiine, Hungary

Presentation Overview: Show

Accurately predicting cellular responses to perturbations is essential for understanding cell behaviour in both healthy and diseased states. While perturbation data is ideal for building such predictive models, its availability is considerably lower than baseline (non-perturbed) cellular data. To address this limitation, several foundation cell models have been developed using large-scale single-cell gene expression data. These models are fine-tuned after pre-training for specific tasks, such as predicting post-perturbation gene expression profiles, and are considered state-of-the-art for these problems. However, proper benchmarking of these models remains an unsolved challenge.

In this study, we benchmarked two recently published foundation models, scGPT and scFoundation, against baseline models. Surprisingly, we found that even the simplest baseline model - taking the mean of training examples - outperformed scGPT and scFoundation. Furthermore, basic machine learning models that incorporate biologically meaningful features outperformed scGPT by a large margin. Additionally, we identified that the current Perturb-Seq benchmark datasets exhibit low perturbation-specific variance, making them suboptimal for evaluating such models.

Our results highlight important limitations in current benchmarking approaches and provide insights into more effectively evaluating post-perturbation gene expression prediction models.

15:20-15:30
scPRINT: pre-training on 50 million cells allows robust gene network predictions
Confirmed Presenter: Jeremie Kalfon, Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics group, France

Room: 01A
Format: In person


Authors List: Show

  • Jeremie Kalfon, Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics group, France
  • Gabriel Peyré, CNRS and DMA de l’Ecole Normale Supérieure, CNRS, Ecole Normale Supérieure, Université PSL, France
  • Laura Cantini, Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics group, France

Presentation Overview: Show

A cell is governed by the interaction of myriads of macromolecules. Inferring such a network of interactions has remained an elusive milestone in cellular biology. Building on recent advances in large foundation models and their ability to learn without supervision, we present scPRINT, a large cell model for the inference of gene networks pre-trained on more than 50 million cells from the cellxgene database. Using innovative pretraining tasks and model architecture, scPRINT pushes large transformer models towards more interpretability and usability when uncovering the complex biology of the cell. Based on our atlas-level benchmarks, scPRINT demonstrates superior performance in gene network inference to the state of the art, as well as competitive zero-shot abilities in denoising, batch effect correction, and cell label prediction. On an atlas of benign prostatic hyperplasia, scPRINT highlights the profound connections between ion exchange, senescence, and chronic inflammation.

15:30-15:40
MGCL-ST: Multi-view Graph Self-supervised Contrastive Learning for Spatial Transcriptomics Enhancement
Confirmed Presenter: Hongmin Cai, South China University of Technology, China

Room: 01A
Format: In person


Authors List: Show

  • Hongmin Cai, South China University of Technology, China
  • Siqi Ding, South China University of Technology, China
  • Weitian Huang, South China University of Technology, China

Presentation Overview: Show

Spatial transcriptomics enables the investigation of gene expression within its native spatial context, but existing technologies often suffer from low resolution and sparse sampling. These limitations hinder the accurate delineation of fine tissue structures and reduce robustness to noise. To address these challenges, we propose MGCL-ST, a spatial transcriptomics super-resolution framework that integrates multi-view contrastive learning with a dual-metric neighbor selection strategy. By combining spatial structure and histological image features, MGCL-ST achieves robust, pixel-level gene expression imputation. Experimental results on both simulated and real datasets demonstrate its superior reconstruction accuracy and generalization capability, supporting advanced analysis of the tumor microenvironment.

15:40-15:50
Characterizing cell-type spatial relationships across length scales in spatially resolved omics data
Confirmed Presenter: Rafael dos Santos Peixoto, Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, United States

Room: 01A
Format: In person


Authors List: Show

  • Rafael dos Santos Peixoto, Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, United States
  • Brendan Miller, Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, United States
  • Maigan Brusko, Department of Pathology, Immunology, and Laboratory Medicine, University of Florida, United States
  • Gohta Aihara, Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, United States
  • Lyla Atta, Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, United States
  • Manjari Anant, Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, United States
  • Adina Jailova, Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, United States
  • Mark Atkinson, Department of Pathology, Immunology, and Laboratory Medicine, University of Florida, United States
  • Todd Brusko, Department of Pathology, Immunology, and Laboratory Medicine, University of Florida, United States
  • Clive Wasserfall, Department of Pathology, Immunology, and Laboratory Medicine, University of Florida, United States
  • Jean Fan, Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, United States

Presentation Overview: Show

Spatially resolved omics (SRO) technologies enable the identification of cell types while preserving their organization within tissues. Application of such technologies offers the opportunity to delineate cell-type spatial relationships, particularly across different length scales, and enhance our understanding of tissue organization and function. To quantify such multi-scale cell-type spatial relationships, we present CRAWDAD, Cell-type Relationship Analysis Workflow Done Across Distances, as an open-source R package. To demonstrate the utility of such multi-scale characterization, recapitulate expected cell-type spatial relationships, and evaluate against other cell-type spatial analyses, we apply CRAWDAD to various simulated and real SRO datasets of diverse tissues assayed by diverse SRO technologies. We further demonstrate how such multi-scale characterization enabled by CRAWDAD can be used to compare cell-type spatial relationships across multiple samples. Finally, we apply CRAWDAD to SRO datasets of the human spleen to identify consistent as well as patient and sample-specific cell-type spatial relationships. In general, we anticipate such multi-scale analysis of SRO data enabled by CRAWDAD will provide useful quantitative metrics to facilitate the identification, characterization, and comparison of cell-type spatial relationships across axes of interest.

15:50-16:00
Segger: Fast and accurate cell segmentation of imaging-based spatial transcriptomics data
Confirmed Presenter: Elyas Heidari, DFKZ Heidelberg, EMBL Heidelberg, Germany

Room: 01A
Format: In person


Authors List: Show

  • Elyas Heidari, DFKZ Heidelberg, EMBL Heidelberg, Germany
  • Andrew Moorman, Memorial Sloan Kettering Cancer Center, United States
  • Tal Nawy, Memorial Sloan Kettering, United States
  • Moritz Gerstung, DKFZ Heidelberg, Germany
  • Dana Pe'Er, Memorial Sloan Kettering Cancer Center, United States
  • Oliver Stegle, DFKZ Heidelberg, EMBL Heidelberg, Germany

Presentation Overview: Show

Accurate cell segmentation is a critical first step in the analysis of imaging-based spatial transcriptomics (iST). Despite decades of research in cell segmentation, current methods fail to address this task with adequate accuracy, tending to either over- or under segment, create false positive transcript assignments, and additionally many methods fail to scale to large datasets with hundreds of millions of transcripts. To address these limitations, we introduce segger, a versatile graph neural network (GNN) that frames cell segmentation as a transcript-to-cell link prediction task. Segger employs a heterogeneous graph representation of individual transcripts and cells, and can optionally leverage single-cell RNA-seq information to enhance transcript assignments.
In benchmarks on multiple iST dataset, including a lung adenocarcinoma dataset with membrane staining for validation, segger demonstrates superior sensitivity and specificity compared to existing methods such as Baysor and BIDCell. At the same time, segger requires orders of magnitude less compute time than existing approaches. The Segger software features adaptive tiling and efficient task scheduling, supporting multi-GPU processing and multi-threading for scalability. Segger also includes a new workflow to cluster unassigned transcripts into ‘fragments’, enabling the recovery of information missed by nucleus or membrane marker-dependent methods. Segger is implemented as user-friendly open source software (https://github.com/PMBio/segger), comes with extensive documentation and integrates seamlessly into existing workflows, enabling atlas-scale applications with high accuracy and speed.

16:40-16:50
Dissecting cellular and molecular mechanisms of pancreatic cancer with deep learning
Confirmed Presenter: Aarthi Venkat, Broad Institute of MIT and Harvard, United States

Room: 01A
Format: In person


Authors List: Show

  • Aarthi Venkat, Broad Institute of MIT and Harvard, United States
  • Cathy Garcia, Stanford University, United States
  • Daniel McQuaid, Yale University, United States
  • Smita Krishnaswamy, Yale University, United States
  • Mandar Muzumdar, Yale University, United States

Presentation Overview: Show

Pancreatic endocrine-exocrine crosstalk plays a key role in normal physiology and disease and is perturbed by altered host metabolic states. For example, obesity imparts an stress-induced endocrine secretion of cholecystokinin (CCK), which promotes pancreatic ductal adenocarcinoma (PDAC), an exocrine tumor. However, the mechanisms governing endocrine-exocrine signaling in obesity-driven tumorigenesis remain unclear.

Here, we design a suite of machine learning tools (TrajectoryNet, AAnet, scMMGAN, DiffusionEMD) to reveal from single-cell RNA-seq data the cellular and molecular mechanisms by which beta cells express CCK and promote obesity-driven PDAC. AAnet identifies an immature beta cell state characterized by low insulin and maturation marker expression and high dedifferentiation and immaturity marker expression. TrajectoryNet predicts obesity stimulates this immature state to expand and adapt toward a pro-tumorigenic CCK-hi state, which we validate with in vivo genetic lineage tracing. TrajectoryNet-based gene regulatory network inference predicts cJun regulates CCK, validated by JNK inhibition and CUT&RUN sequencing showing cJun mediates CCK expression by binding to a novel conserved 3’ enhancer ~3kb downstream of the Cck gene. Finally, mapping beta cells from diverse physiologic and pharmacologic stressors, developmental stages, and species to our dataset with scMMGAN and DiffusionEMD reveals concordance between adult beta cell dedifferentiation and embryonic beta cells, as well as shared stress induction mechanisms between obesity and type II diabetes in mice and humans. Together, this work uncovers new avenues to target the endocrine pancreas to subvert exocrine tumorigenesis and highlights the utility of developing biological and computational models in a wet-to-dry and dry-to-wet fashion toward mechanistic discovery.

16:50-17:00
SpliceSelectNet: A Hierarchical Transformer-Based Deep Learning Model for Splice Site Prediction
Confirmed Presenter: Yuna Miyachi, Department of Computer Science, Graduate School of Information Science and Technology, Japan

Room: 01A
Format: In person


Authors List: Show

  • Yuna Miyachi, Department of Computer Science, Graduate School of Information Science and Technology, Japan
  • Kenta Nakai, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Japan

Presentation Overview: Show

RNA splicing is a critical post-transcriptional process that enables the generation of diverse protein isoforms. Aberrant splicing is implicated in a wide range of genetic disorders and cancers, making accurate prediction of splice sites and mutation effects essential. Convolutional neural network-based models such as SpliceAI and Pangolin have achieved high accuracy but often lack interpretability. Recently, Transformer-based models like DNABERT and SpTransformer have been applied to genomic sequences, yet they typically inherit input length limitations from natural language processing models, restricting context to a few thousand base pairs, which are insufficient for capturing long-range regulatory signals.
To overcome these challenges, we propose SpliceSelectNet (SSNet), a hierarchical Transformer model that integrates local and global attention mechanisms to handle up to 100 kb of input while maintaining nucleotide-level interpretability. Trained on multiple datasets, including those incorporating splice site usage derived from RNA-seq data, SSNet outperforms SpliceAI and Pangolin on the Gencode test dataset, a clinically curated BRCA variant dataset, and a deep intronic variant benchmark. It demonstrates improved performance, particularly in regions characterized by complex splicing regulation, such as long exons and deep introns, as measured by area under the precision-recall curve.
Furthermore, SSNet’s attention maps provide direct insight into sequence context. In the case of a pathogenic variant in BRCA1 exon 10, the model highlighted an upstream region that may contribute to cryptic splice site activation.
These results demonstrate that SSNet combines high predictive performance with biological interpretability, offering a powerful tool for splicing analysis in both research and clinical settings.

17:00-18:00
Invited Presentation: Where does it hurt (in your genome)?
Room: 01A
Format: In person


Authors List: Show

  • Julien Gagneur

Presentation Overview: Show

The identification of genetic variants strongly affecting when phenotypes remains an unsolved problem with major relevance in rare diseases diagnostics, oncology, and for the identification of effector genes of complex traits and diseases.

I will present a series of published and ongoing work from my lab tackling this issue, with a focus on non-coding variants. This will span variant scoring based on genomic language models [1], methods to predict aberrant expression [2] and splicing [3], all the way to integrative deep learning models for rare variant association analyses demonstrated on UK Biobank [4].

1. Tomaz da Silva, et al. Nucleotide dependency analysis of DNA language models reveals genomic functional elements. bioRxiv, 2024
2. Hölzlwimmer et al. Aberrant gene expression prediction across human tissues. Nature Communications, 2025
3. Wagner et al. Aberrant splicing prediction across human tissues. Nature Genetics, 2023
4. Clarke, Holtkamp, et al. Integration of variant annotations using deep set networks boosts rare variant association genetics. Nature Genetics, 2024