The SciFinder tool lets you search Titles, Authors, and Abstracts of talks and panels. Enter your search term below and your results will be shown at the bottom of the page. You can also click on a track to see all the talks given in that track on that day.

View Talks By Category

Scroll down to view Results

July 14, 2025
July 15, 2025
July 20, 2025
July 21, 2025
July 22, 2025
July 23, 2025
July 24, 2025

Results

July 22, 2025
11:20-12:20
Invited Presentation: Toward Mechanistic Genomics: Advances in Sequence-to-Function Modeling
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Moderator(s): Julian Gagneur


Authors List: Show

  • Maria Chikina

Presentation Overview:Show

Recent advances have firmly established sequence-to-function models as essential tools in modern genomics, enabling unprecedented insights into how genomic sequences drive molecular and cellular phenotypes. As these models have matured—with increasingly robust architectures, improved training strategies, and the emergence of standardized software frameworks—the field has rapidly evolved from proof-of-concept demonstrations to widespread practical applications across a variety of biological systems.

With the core methodologies now widely adopted and infrastructure in place, the community's focus is shifting toward ambitious new frontiers. There is growing momentum around developing models that are biologically interpretable, capable of uncovering causal mechanisms of gene regulation, and generalizable to novel contexts—such as predicting the effects of perturbing a regulatory protein rather than simply altering a DNA sequence. These efforts reflect a broader aspiration: to create models that serve not just as black-box predictors, but as scientific instruments that deepen our understanding of genome function.

In this talk, we will explore how such models can move us from descriptive genomics to mechanistic insight, highlighting recent innovations in architecture and training that support interpretability, modularity, and reusability. We will examine the contexts in which these models offer clear advantages, the limitations that remain, and practical considerations for their training. Ultimately, we will consider how advancing these models may refine the role of machine learning in biology, supporting not only accurate prediction but also the generation of more detailed and mechanistically informed hypotheses.

July 22, 2025
12:20-12:40
Proceedings Presentation: Deep learning models for unbiased sequence-based PPI prediction plateau at an accuracy of 0.65
Confirmed Presenter: Judith Bernett, Technical University of Munich, Germany
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Julian Gagneur


Authors List: Show

  • Timo Reim, Timo Reim, Technical University of Munich; Friedrich-Alexander-Universität Erlangen-Nürnberg
  • Anne Hartebrodt, Anne Hartebrodt, Friedrich-Alexander-Universität Erlangen-Nürnberg
  • David B. Blumenthal, David B. Blumenthal, Friedrich-Alexander-Universität Erlangen-Nürnberg
  • Judith Bernett, Judith Bernett, Technical University of Munich
  • Markus List, Markus List, Technical University of Munich

Presentation Overview:Show

As most proteins interact with other proteins to perform their respective functions, methods to computationally predict these interactions have been developed.
However, flawed evaluation schemes and data leakage in test sets have obscured the fact that sequence-based protein-protein interaction (PPI) prediction is still an open problem. Recently, methods achieving better-than-random performance on leakage-free PPI data have been proposed. Here, we show that the use of ESM-2 protein embeddings explains this performance gain irrespective of model architecture. We compared the performance of models with varying complexity, per-protein, and per-token embeddings, as well as the influence of self- or cross-attention, where all models plateaued at an accuracy of 0.65. Moreover, we show that the tested sequence-based models cannot implicitly learn a contact map as an intermediate layer.
These results imply that other input types, such as structure, might be necessary for producing reliable PPI predictions.

July 22, 2025
12:40-13:00
Proceedings Presentation: Accurate PROTAC targeted degradation prediction with DegradeMaster
Confirmed Presenter: Jie Liu, The University of Adelaide, Australia
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Julian Gagneur


Authors List: Show

  • Jie Liu, Jie Liu, The University of Adelaide
  • Michael Roy, Michael Roy, The University of Adelaide
  • Luke Isbel, Luke Isbel, The University of Adelaide
  • Fuyi Li, Fuyi Li, The University of Adelaide

Presentation Overview:Show

Motivation: Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules that can degrade ‘undruggable’ protein of interest (POI) by recruiting E3 ligases and hijacking the ubiquitin-proteasome system. Some efforts have been made to develop deep learning-based approaches to predict the degradation ability of a given PROTAC. However, existing deep learning methods either simplify proteins and PROTACs as 2D graphs by disregarding crucial 3D spatial information or exclusively rely on limited labels for supervised learning without considering the abundant information from unlabeled data. Nevertheless, considering the potential to accelerate drug discovery, developing more accurate computational methods for PROTAC-targeted protein degradation prediction is critical.

Results: This study proposes DegradeMaster, a semi-supervised E(3)-equivariant graph neural network-based predictor for targeted degradation prediction of PROTACs. DegradeMaster leverages an E(3)-equivariant graph encoder to incorporate 3D geometric constraints into the molecular representations and utilizes a memory-based pseudo-labeling strategy to enrich annotated data during training. A mutual attention pooling module is also designed for interpretable graph representation. Experiments on both supervised and semi-supervised PROTAC datasets demonstrate that DegradeMaster outperforms state-of-the-art baselines, substantially improving AUROC by 10.5%. Case studies show DegradeMaster achieves 88.33% and 77.78% accuracy in predicting the degradability of VZ185 candidates on BRD9 and ACBI3 on KRAS mutants. Visualization of attention weights on 3D molecule graph demonstrates that DegradeMaster recognises linking and binding regions of warhead and E3 ligands and emphasizes the importance of structural information in these areas for degradation prediction. Together, this shows the potential for cutting-edge tools to highlight functional PROTAC components, thereby accelerating novel compound generation.

July 22, 2025
14:00-14:20
Proceedings Presentation: GPO-VAE: Modeling Explainable Gene Perturbation Responses utilizing GRN-Aligned Parameter Optimization
Confirmed Presenter: Seungheun Baek, Korea University, South Korea
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Maria Brbic


Authors List: Show

  • Seungheun Baek, Seungheun Baek, Korea University
  • Soyon Park, Soyon Park, Korea University
  • Yan Ting Chok, Yan Ting Chok, Korea University
  • Mogan Gim, Mogan Gim, Hankuk University of Foreign Studies
  • Jaewoo Kang, Jaewoo Kang, Korea University

Presentation Overview:Show

Predicting cellular responses to genetic perturbations is essential for understanding biological systems and developing targeted therapeutic strategies. While variational autoencoders (VAEs) have shown promise in modeling perturbation responses, their limited explainability poses a significant challenge, as the learned features often lack clear biological meaning. Nevertheless, model explainability is one of the most important aspects in the realm of biological AI. One of the most effective ways to achieve explainability is incorporating the concept of gene regulatory networks (GRNs) in designing deep learning models such as VAEs. GRNs elicit the underlying causal relationships between genes and are capable of explaining the transcriptional responses caused by genetic perturbation treatments. We propose GPO-VAE, an explainable VAE enhanced by GRN-aligned Parameter Optimization that explicitly models gene regulatory networks in the latent space. Our key approach is to optimize the learnable parameters related to latent perturbation effects towards GRN-aligned explainability. Experimental results on perturbation prediction show our model achieves state-of-the-art performance in predicting transcriptional responses across multiple benchmark datasets. Furthermore, additional results on evaluating the GRN inference task reveal our model's ability to generate meaningful GRNs compared to other methods. According to qualitative analysis, GPO-VAE posseses the ability to construct biologically explainable GRNs that with experimentally validated regulatory pathways.

July 22, 2025
14:20-14:40
Proceedings Presentation: Fast and scalable Wasserstein-1 neural optimal transport solver for single-cell perturbation prediction
Confirmed Presenter: Yanshuo Chen, Department of Computer Science, University of Maryland
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: Live stream
Moderator(s): Maria Brbic


Authors List: Show

  • Yanshuo Chen, Yanshuo Chen, Department of Computer Science
  • Zhengmian Hu, Zhengmian Hu, Department of Computer Science
  • Wei Chen, Wei Chen, Department of Pediatrics
  • Heng Huang, Heng Huang, Department of Computer Science

Presentation Overview:Show

Predicting single-cell perturbation responses requires mapping between two unpaired single-cell data distributions. Optimal transport (OT) theory provides a principled framework for constructing such mappings by minimizing transport cost. Recently, Wasserstein-2 ($W_2$) neural optimal transport solvers (\textit{e.g.}, CellOT) have been employed for this prediction task. However, $W_2$ OT relies on the general Kantorovich dual formulation, which involves optimizing over two conjugate functions, leading to a complex min-max optimization problem that converges slowly. To address these challenges, we propose a novel solver based on the Wasserstein-1 ($W_1$) dual formulation. Unlike $W_2$, the $W_1$ dual simplifies the optimization to a maximization problem over a single 1-Lipschitz function, thus eliminating the need for time-consuming min-max optimization. While solving the $W_1$ dual only reveals the transport direction and does not directly provide a unique optimal transport map, we incorporate an additional step using adversarial training to determine an appropriate transport step size, effectively recovering the transport map. Our experiments demonstrate that the proposed $W_1$ neural optimal transport solver can mimic the $W_2$ OT solvers in finding a unique and ``monotonic" map on 2D datasets. Moreover, the $W_1$ OT solver achieves performance on par with or surpasses $W_2$ OT solvers on real single-cell perturbation datasets. Furthermore, we show that $W_1$ OT solver achieves $25 \sim 45\times$ speedup, scales better on high dimensional transportation task, and can be directly applied on single-cell RNA-seq dataset with highly variable genes. Our implementation and experiments are open-sourced at \url{https://github.com/poseidonchan/w1ot}.

July 22, 2025
14:40-15:00
Proceedings Presentation: Recovering Time-Varying Networks From Single-Cell Data
Confirmed Presenter: Euxhen Hasanaj, Carnegie Mellon University, United States
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Maria Brbic


Authors List: Show

  • Euxhen Hasanaj, Euxhen Hasanaj, Carnegie Mellon University
  • Barnabás Póczos, Barnabás Póczos, Carnegie Mellon University
  • Ziv Bar-Joseph, Ziv Bar-Joseph, Carnegie Mellon University

Presentation Overview:Show

Gene regulation is a dynamic process that underlies all aspects of human development, disease response, and other key biological processes. The reconstruction of temporal gene regulatory networks has conventionally relied on regression analysis, graphical models, or other types of relevance networks. With the large increase in time series single-cell data, new approaches are needed to address the unique scale and nature of this data for reconstructing such networks. Here, we develop a deep neural network, Marlene, to infer dynamic graphs from time series single-cell gene expression data. Marlene constructs directed gene networks using a self-attention mechanism where the weights evolve over time using recurrent units. By employing meta learning, the model is able to recover accurate temporal networks even for rare cell types. In addition, Marlene can identify gene interactions relevant to specific biological responses, including COVID-19 immune response, fibrosis, and aging, paving the way for potential treatments. The code use to train Marlene is available at https://github.com/euxhenh/Marlene.

July 22, 2025
15:00-15:10
Sliding Window INteraction Grammar (SWING): a generalized interaction language model for peptide and protein interactions
Confirmed Presenter: Jishnu Das, University of Pittsburgh, United States
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Maria Brbic


Authors List: Show

  • Jane Siwek, Jane Siwek, University of Pittsburgh
  • Alisa Omelchenko, Alisa Omelchenko, University of Pittsburgh
  • Prabal Chhibbar, Prabal Chhibbar, University of Pittsburgh
  • Alok Joglekar, Alok Joglekar, University of Pittsburgh
  • Jishnu Das, Jishnu Das, University of Pittsburgh

Presentation Overview:Show

Protein language models (pLMs) can embed protein sequences for different proteomic tasks. However, these methods are suboptimal at learning the language of protein interactions. We developed an interaction LM (iLM), Sliding Window Interaction Grammar (SWING) which leverages differences in amino acid properties to generate an interaction vocabulary. This is embedded by an LM and supervised learning is performed on the embeddings.
SWING was used across a range of tasks. Using only sequence information, it successfully predicted both class I and class II pMHC interactions as well as state-of-the-art approaches. Further, the Class I SWING model could uniquely cross-predict Class II interactions, a complex prediction task not attempted by existing methods. A unique Mixed Class model effectively predicted interactions for both classes. Using only human Class I or Class II data, SWING accurately predicted novel murine Class II pMHC interactions involving risk alleles in SLE and T1D. SWING also accurately predicted how Mendelian and population variants can disrupt specific protein-protein interactions, based on sequence information alone. Across these tasks, SWING outperformed passive uses of pLM embeddings, demonstrating the value of the unique iLM architecture. Overall, SWING is a first-in-class generalizable zero-shot iLM that learns the language of PPIs.

July 22, 2025
15:10-15:20
Benchmarking foundation cell models for post-perturbation RNA-seq prediction
Confirmed Presenter: Gerold Csendes, Turbine, Hungary
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Maria Brbic


Authors List: Show

  • Gerold Csendes, Gerold Csendes, Turbine
  • Gema Sanz, Gema Sanz, Turbine
  • Krisóf Szalay, Krisóf Szalay, Turbine
  • Bence Szalai, Bence Szalai, Turbiine

Presentation Overview:Show

Accurately predicting cellular responses to perturbations is essential for understanding cell behaviour in both healthy and diseased states. While perturbation data is ideal for building such predictive models, its availability is considerably lower than baseline (non-perturbed) cellular data. To address this limitation, several foundation cell models have been developed using large-scale single-cell gene expression data. These models are fine-tuned after pre-training for specific tasks, such as predicting post-perturbation gene expression profiles, and are considered state-of-the-art for these problems. However, proper benchmarking of these models remains an unsolved challenge.

In this study, we benchmarked two recently published foundation models, scGPT and scFoundation, against baseline models. Surprisingly, we found that even the simplest baseline model - taking the mean of training examples - outperformed scGPT and scFoundation. Furthermore, basic machine learning models that incorporate biologically meaningful features outperformed scGPT by a large margin. Additionally, we identified that the current Perturb-Seq benchmark datasets exhibit low perturbation-specific variance, making them suboptimal for evaluating such models.

Our results highlight important limitations in current benchmarking approaches and provide insights into more effectively evaluating post-perturbation gene expression prediction models.

July 22, 2025
15:20-15:30
scPRINT: pre-training on 50 million cells allows robust gene network predictions
Confirmed Presenter: Jeremie Kalfon, Institut Pasteur, Université Paris Cité
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Maria Brbic


Authors List: Show

  • Jeremie Kalfon, Jeremie Kalfon, Institut Pasteur
  • Gabriel Peyré, Gabriel Peyré, CNRS and DMA de l’Ecole Normale Supérieure
  • Laura Cantini, Laura Cantini, Institut Pasteur

Presentation Overview:Show

A cell is governed by the interaction of myriads of macromolecules. Inferring such a network of interactions has remained an elusive milestone in cellular biology. Building on recent advances in large foundation models and their ability to learn without supervision, we present scPRINT, a large cell model for the inference of gene networks pre-trained on more than 50 million cells from the cellxgene database. Using innovative pretraining tasks and model architecture, scPRINT pushes large transformer models towards more interpretability and usability when uncovering the complex biology of the cell. Based on our atlas-level benchmarks, scPRINT demonstrates superior performance in gene network inference to the state of the art, as well as competitive zero-shot abilities in denoising, batch effect correction, and cell label prediction. On an atlas of benign prostatic hyperplasia, scPRINT highlights the profound connections between ion exchange, senescence, and chronic inflammation.

July 22, 2025
15:30-15:40
MGCL-ST: Multi-view Graph Self-supervised Contrastive Learning for Spatial Transcriptomics Enhancement
Confirmed Presenter: Hongmin Cai, South China University of Technology, China
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Maria Brbic


Authors List: Show

  • Hongmin Cai, Hongmin Cai, South China University of Technology
  • Siqi Ding, Siqi Ding, South China University of Technology
  • Weitian Huang, Weitian Huang, South China University of Technology

Presentation Overview:Show

Spatial transcriptomics enables the investigation of gene expression within its native spatial context, but existing technologies often suffer from low resolution and sparse sampling. These limitations hinder the accurate delineation of fine tissue structures and reduce robustness to noise. To address these challenges, we propose MGCL-ST, a spatial transcriptomics super-resolution framework that integrates multi-view contrastive learning with a dual-metric neighbor selection strategy. By combining spatial structure and histological image features, MGCL-ST achieves robust, pixel-level gene expression imputation. Experimental results on both simulated and real datasets demonstrate its superior reconstruction accuracy and generalization capability, supporting advanced analysis of the tumor microenvironment.

July 22, 2025
15:40-15:50
Characterizing cell-type spatial relationships across length scales in spatially resolved omics data
Confirmed Presenter: Rafael dos Santos Peixoto, Center for Computational Biology, Whiting School of Engineering
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Maria Brbic


Authors List: Show

  • Rafael dos Santos Peixoto, Rafael dos Santos Peixoto, Center for Computational Biology
  • Brendan Miller, Brendan Miller, Center for Computational Biology
  • Maigan Brusko, Maigan Brusko, Department of Pathology
  • Gohta Aihara, Gohta Aihara, Center for Computational Biology
  • Lyla Atta, Lyla Atta, Center for Computational Biology
  • Manjari Anant, Manjari Anant, Center for Computational Biology
  • Adina Jailova, Adina Jailova, Center for Computational Biology
  • Mark Atkinson, Mark Atkinson, Department of Pathology

Presentation Overview:Show

Spatially resolved omics (SRO) technologies enable the identification of cell types while preserving their organization within tissues. Application of such technologies offers the opportunity to delineate cell-type spatial relationships, particularly across different length scales, and enhance our understanding of tissue organization and function. To quantify such multi-scale cell-type spatial relationships, we present CRAWDAD, Cell-type Relationship Analysis Workflow Done Across Distances, as an open-source R package. To demonstrate the utility of such multi-scale characterization, recapitulate expected cell-type spatial relationships, and evaluate against other cell-type spatial analyses, we apply CRAWDAD to various simulated and real SRO datasets of diverse tissues assayed by diverse SRO technologies. We further demonstrate how such multi-scale characterization enabled by CRAWDAD can be used to compare cell-type spatial relationships across multiple samples. Finally, we apply CRAWDAD to SRO datasets of the human spleen to identify consistent as well as patient and sample-specific cell-type spatial relationships. In general, we anticipate such multi-scale analysis of SRO data enabled by CRAWDAD will provide useful quantitative metrics to facilitate the identification, characterization, and comparison of cell-type spatial relationships across axes of interest.

July 22, 2025
15:50-16:00
Segger: Fast and accurate cell segmentation of imaging-based spatial transcriptomics data
Confirmed Presenter: Elyas Heidari, DFKZ Heidelberg, EMBL Heidelberg
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Maria Brbic


Authors List: Show

  • Elyas Heidari, Elyas Heidari, DFKZ Heidelberg
  • Andrew Moorman, Andrew Moorman, Memorial Sloan Kettering Cancer Center
  • Tal Nawy, Tal Nawy, Memorial Sloan Kettering
  • Moritz Gerstung, Moritz Gerstung, DKFZ Heidelberg
  • Dana Pe'Er, Dana Pe'Er, Memorial Sloan Kettering Cancer Center
  • Oliver Stegle, Oliver Stegle, DFKZ Heidelberg

Presentation Overview:Show

Accurate cell segmentation is a critical first step in the analysis of imaging-based spatial transcriptomics (iST). Despite decades of research in cell segmentation, current methods fail to address this task with adequate accuracy, tending to either over- or under segment, create false positive transcript assignments, and additionally many methods fail to scale to large datasets with hundreds of millions of transcripts. To address these limitations, we introduce segger, a versatile graph neural network (GNN) that frames cell segmentation as a transcript-to-cell link prediction task. Segger employs a heterogeneous graph representation of individual transcripts and cells, and can optionally leverage single-cell RNA-seq information to enhance transcript assignments.
In benchmarks on multiple iST dataset, including a lung adenocarcinoma dataset with membrane staining for validation, segger demonstrates superior sensitivity and specificity compared to existing methods such as Baysor and BIDCell. At the same time, segger requires orders of magnitude less compute time than existing approaches. The Segger software features adaptive tiling and efficient task scheduling, supporting multi-GPU processing and multi-threading for scalability. Segger also includes a new workflow to cluster unassigned transcripts into ‘fragments’, enabling the recovery of information missed by nucleus or membrane marker-dependent methods. Segger is implemented as user-friendly open source software (https://github.com/PMBio/segger), comes with extensive documentation and integrates seamlessly into existing workflows, enabling atlas-scale applications with high accuracy and speed.

July 22, 2025
16:40-16:50
Dissecting cellular and molecular mechanisms of pancreatic cancer with deep learning
Confirmed Presenter: Aarthi Venkat, Broad Institute of MIT and Harvard, United States
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Maria Chikina


Authors List: Show

  • Aarthi Venkat, Aarthi Venkat, Broad Institute of MIT and Harvard
  • Cathy Garcia, Cathy Garcia, Stanford University
  • Daniel McQuaid, Daniel McQuaid, Yale University
  • Smita Krishnaswamy, Smita Krishnaswamy, Yale University
  • Mandar Muzumdar, Mandar Muzumdar, Yale University

Presentation Overview:Show

Pancreatic endocrine-exocrine crosstalk plays a key role in normal physiology and disease and is perturbed by altered host metabolic states. For example, obesity imparts an stress-induced endocrine secretion of cholecystokinin (CCK), which promotes pancreatic ductal adenocarcinoma (PDAC), an exocrine tumor. However, the mechanisms governing endocrine-exocrine signaling in obesity-driven tumorigenesis remain unclear.

Here, we design a suite of machine learning tools (TrajectoryNet, AAnet, scMMGAN, DiffusionEMD) to reveal from single-cell RNA-seq data the cellular and molecular mechanisms by which beta cells express CCK and promote obesity-driven PDAC. AAnet identifies an immature beta cell state characterized by low insulin and maturation marker expression and high dedifferentiation and immaturity marker expression. TrajectoryNet predicts obesity stimulates this immature state to expand and adapt toward a pro-tumorigenic CCK-hi state, which we validate with in vivo genetic lineage tracing. TrajectoryNet-based gene regulatory network inference predicts cJun regulates CCK, validated by JNK inhibition and CUT&RUN sequencing showing cJun mediates CCK expression by binding to a novel conserved 3’ enhancer ~3kb downstream of the Cck gene. Finally, mapping beta cells from diverse physiologic and pharmacologic stressors, developmental stages, and species to our dataset with scMMGAN and DiffusionEMD reveals concordance between adult beta cell dedifferentiation and embryonic beta cells, as well as shared stress induction mechanisms between obesity and type II diabetes in mice and humans. Together, this work uncovers new avenues to target the endocrine pancreas to subvert exocrine tumorigenesis and highlights the utility of developing biological and computational models in a wet-to-dry and dry-to-wet fashion toward mechanistic discovery.

July 22, 2025
16:50-17:00
SpliceSelectNet: A Hierarchical Transformer-Based Deep Learning Model for Splice Site Prediction
Confirmed Presenter: Yuna Miyachi, Department of Computer Science, Graduate School of Information Science and Technology
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Maria Chikina


Authors List: Show

  • Yuna Miyachi, Yuna Miyachi, Department of Computer Science
  • Kenta Nakai, Kenta Nakai, Human Genome Center

Presentation Overview:Show

RNA splicing is a critical post-transcriptional process that enables the generation of diverse protein isoforms. Aberrant splicing is implicated in a wide range of genetic disorders and cancers, making accurate prediction of splice sites and mutation effects essential. Convolutional neural network-based models such as SpliceAI and Pangolin have achieved high accuracy but often lack interpretability. Recently, Transformer-based models like DNABERT and SpTransformer have been applied to genomic sequences, yet they typically inherit input length limitations from natural language processing models, restricting context to a few thousand base pairs, which are insufficient for capturing long-range regulatory signals.
To overcome these challenges, we propose SpliceSelectNet (SSNet), a hierarchical Transformer model that integrates local and global attention mechanisms to handle up to 100 kb of input while maintaining nucleotide-level interpretability. Trained on multiple datasets, including those incorporating splice site usage derived from RNA-seq data, SSNet outperforms SpliceAI and Pangolin on the Gencode test dataset, a clinically curated BRCA variant dataset, and a deep intronic variant benchmark. It demonstrates improved performance, particularly in regions characterized by complex splicing regulation, such as long exons and deep introns, as measured by area under the precision-recall curve.
Furthermore, SSNet’s attention maps provide direct insight into sequence context. In the case of a pathogenic variant in BRCA1 exon 10, the model highlighted an upstream region that may contribute to cryptic splice site activation.
These results demonstrate that SSNet combines high predictive performance with biological interpretability, offering a powerful tool for splicing analysis in both research and clinical settings.

July 22, 2025
17:00-18:00
Invited Presentation: Is distribution shift still an AI problem
Confirmed Presenter: Sanmi Koyejo
Track: MLCSB: Machine Learning in Computational and Systems Biology

Room: 01A
Format: In person
Moderator(s): Maria Chikina


Authors List: Show

  • Sanmi Koyejo

Presentation Overview:Show

Distribution shifts describe the phenomena where the deployment performance of an AI model exhibits differences from training. On the one hand, some claim that distribution shifts are ubiquitous in real-world deployments. On the other hand, modern implementations (e.g., foundation models) often claim to be robust to distribution shifts by design. Similarly, phenomena such as “accuracy on the line” promise that standard training produces distribution-shift-robust models. When are these claims valid, and do modern models fail due to distribution shifts? If so, what can be done about it? This talk will outline modern principles and practices for understanding the role of distribution shifts in AI, discuss how the problem has changed, and outline recent methods for engaging with distribution shifts with comprehensive and practical insights. Some highlights include a taxonomy of shifts, the role of foundation models, and finetuning. This talk will also briefly discuss how distribution shifts might interact with AI policy and governance.

Bio: Sanmi Koyejo is an assistant professor in the Department of Computer Science at Stanford University and a co-founder of Virtue AI. At Stanford, Koyejo leads the Stanford Trustworthy Artificial Intelligence (STAIR) lab, which works to develop the principles and practice of trustworthy AI, focusing on applications to science and healthcare. Koyejo has been the recipient of several awards, including a Skip Ellis Early Career Award, a Presidential Early Career Award for Scientists and Engineers (PECASE), and a Sloan Fellowship. Koyejo serves on the Neural Information Processing Systems Foundation Board, the Association for Health Learning and Inference Board, and as president of the Black in AI Board.