Return to ISMB/ECCB 2025 Homepage Click here for the abridged agenda
Select Track: 3DSIG | Bio-Ontologies and Knowledge Representation | BioInfo-Core | Bioinfo4Women Meet-Up | Bioinformatics in the UK | BioVis | BOSC | CAMDA | CollaborationFest | CompMS | Computational Systems Immunology | Distinguished Keynotes | Dream Challenges | Education | Equity and Diversity | EvolCompGen | Fellows Presentation | Function | General Computational Biology | HiTSeq | iRNA | ISCB-China Workshop | JPI | MICROBIOME | MLCSB | NetBio | NIH Cyberinfrastructure and Emerging Technologies Sessions | NIH/Elixir | Publications - Navigating Journal Submissions | RegSys | Special Track | Stewardship Critical Infrastructure | Student Council Symposium | SysMod | Tech Track | Text Mining | The Innovation Pipeline: How Industry & Academia Can Work Together in Computational Biology | TransMed | Tutorials | VarI | WEB 2025 | Youth Bioinformatics Symposium | All
NOTE: Browser resolution may limit the width of the agenda and you may
need to scroll the iframe to see additional columns.
Click the buttons below to download your current table in that format
Date | Start Time | End Time | Room | Track | Title | Confrimed Presenter | Format | Authors | Abstract |
---|---|---|---|---|---|---|---|---|---|
2025-07-21 | 11:20:00 | 12:20:00 | 01A | MLCSB | Is distribution shift still an AI problem | Sanmi Koyejo | Sanmi Koyejo | Distribution shifts describe the phenomena where the deployment performance of an AI model exhibits differences from training. On the one hand, some claim that distribution shifts are ubiquitous in real-world deployments. On the other hand, modern implementations (e.g., foundation models) often claim to be robust to distribution shifts by design. Similarly, phenomena such as “accuracy on the line” promise that standard training produces distribution-shift-robust models. When are these claims valid, and do modern models fail due to distribution shifts? If so, what can be done about it? This talk will outline modern principles and practices for understanding the role of distribution shifts in AI, discuss how the problem has changed, and outline recent methods for engaging with distribution shifts with comprehensive and practical insights. Some highlights include a taxonomy of shifts, the role of foundation models, and finetuning. This talk will also briefly discuss how distribution shifts might interact with AI policy and governance. Bio: Sanmi Koyejo is an assistant professor in the Department of Computer Science at Stanford University and a co-founder of Virtue AI. At Stanford, Koyejo leads the Stanford Trustworthy Artificial Intelligence (STAIR) lab, which works to develop the principles and practice of trustworthy AI, focusing on applications to science and healthcare. Koyejo has been the recipient of several awards, including a Skip Ellis Early Career Award, a Presidential Early Career Award for Scientists and Engineers (PECASE), and a Sloan Fellowship. Koyejo serves on the Neural Information Processing Systems Foundation Board, the Association for Health Learning and Inference Board, and as president of the Black in AI Board. | |
2025-07-21 | 12:20:00 | 12:40:00 | 01A | MLCSB | Locality-aware pooling enhances protein language model performance across varied applications | Minh Hoang | Minh Hoang, Mona Singh | Protein language models (PLMs) are amongst the most exciting recent advances for characterizing protein sequences, and have enabled a diverse set of applications including structure determination, functional property prediction, and mutation impact assessment, all from single protein sequences alone. State-of-the-art PLMs leverage transformer architectures originally developed for natural language processing, and are pre-trained on large protein databases to generate contextualized representations of individual amino acids. To harness the power of these PLMs to predict protein-level properties, these per-residue embeddings are typically ``pooled'' to fixed-size vectors that are further utilized in downstream prediction networks. Common pooling strategies include Cls-Pooling and Avg-Pooling, but neither of these approaches can capture the local substructures and long-range interactions observed in proteins. To address these weaknesses in existing PLM pooling strategies, we propose the use of attention pooling, which can naturally capture these important features of proteins. To make the expensive attention operator (quadratic in length of the input protein) feasible in practice, we introduce bag-of-mer pooling (BoM-Pooling), a locality-aware hierarchical pooling technique that combines windowed average pooling with attention pooling. We empirically demonstrate that both full attention pooling and BoM-Pooling outperform previous pooling strategies on three important, diverse tasks: (1) predicting the activities of two proteins as they are varied; (2) detecting remote homologs; and (3) predicting signaling interactions with peptides. Overall, our work highlights the advantages of biologically inspired pooling techniques in protein sequence modeling and is a step towards more effective adaptations of language models in biological settings. | |
2025-07-21 | 12:40:00 | 13:00:00 | 01A | MLCSB | NEAR: Neural Embeddings for Amino acid Relationships | Daniel Olson | Daniel Olson, Thomas Colligan, Daphne Demekas, Jack Roddy, Ken Youens-Clark, Travis Wheeler | Protein language models (PLMs) have recently demonstrated potential to supplant classical protein database search methods based on sequence alignment, but are slower than common alignment-based tools and appear to be prone to a high rate of false labeling. Here, we present NEAR, a method based on neural representation learning that is designed to improve both speed and accuracy of search for likely homologs in a large protein sequence database. NEAR’s ResNet embedding model is trained using contrastive learning guided by trusted sequence alignments. It computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of residue-level k-NN search and a simple neighbor aggregation scheme. Tests on a benchmark consisting of trusted remote homologs and randomly shuffled decoy sequences reveal that NEAR substantially improves accuracy relative to state-of-the-art PLMs, with lower memory requirements and faster embedding / search speed. While these results suggest that the NEAR model may be useful for standalone homology detection with increased sensitivity over standard alignment-based methods, in this manuscript we focus on a more straightforward analysis of the model's value as a high-speed pre-filter for sensitive annotation. In that context, NEAR is at least 5x faster than the pre-filter currently used in the widely-used profile hidden Markov model (pHMM) search tool, HMMER3, and also outperforms the pre-filter used in our fast pHMM tool, nail. | |
2025-07-21 | 14:00:00 | 14:20:00 | 01A | MLCSB | LoRA-DR-suite: adapted embeddings predict intrinsic and soft disorder from protein sequences | Gianluca Lombardi | Gianluca Lombardi, Beatriz Seoane, Alessandra Carbone | Intrinsic disorder regions (IDR) and soft disorder regions (SDR) provide crucial information on a protein structure to underpin its functioning, interaction with other molecules and assembly path. Circular dichroism experiments are used to identify intrinsic disorder residues, while SDRs are characterized using B-factors, missing residues, or a combination of both in alternative X-ray crystal structures of the same molecule. These flexible regions in proteins are particularly significant in diverse biological processes and are often implicated in pathological conditions. Accurate computational prediction of these disordered regions is thus essential for advancing protein research and understanding their functional implications. To address this challenge, LoRA-DR-suite employs a simple adapter-based architecture that utilizes protein language models embeddings as protein sequence representations, enabling the precise prediction of IDRs and SDRs directly from primary sequence data. Alongside the fast LoRA-DR-suite implementation, we release SoftDis, a unique soft disorder database constructed for approximately 500,000 PDB chains. SoftDis is designed to facilitate new research, testing, and applications on soft disorder, advancing the study of protein dynamics and interactions. | |
2025-07-21 | 14:20:00 | 14:40:00 | 01A | MLCSB | TCR-epiDiff: Solving Dual Challenges of TCR Generation and Binding Prediction | Se Yeon Seo | Se Yeon Seo, Je-Keun Rhee | Motivation: T-cell receptors (TCRs) are fundamental components of the adaptive immune system, recognizing specific antigens for targeted immune responses. Understanding their sequence patterns for designing effective vaccines and immunotherapies. However, the vast diversity of TCR sequences and complex binding mechanisms pose significant challenges in generating TCRs that are specific to a particular epitope. Results: Here, we propose TCR-epiDiff, a diffusion-based deep learning model for generating epitope-specific TCRs and predicting TCR-epitope binding. TCR-epiDiff integrates epitope information during TCR sequence embedding using ProtT5-XL and employs a denoising diffusion probabilistic model for sequence generation. Using external validation datasets, we demonstrate the ability to generate biologically plausible, epitope-specific TCRs. Furthermore, we leverage the model's encoder to develop a TCR-epitope binding predictor that shows robust performance on the external validation data. Our approach provides a comprehensive solution for both de novo generation of epitope-specific TCRs and TCR-epitope binding prediction. This capability provides valuable insights into immune diversity and has the potential to advance targeted immunotherapies. Availability and implementation: The data and source codes for our experiments are available at https://github.com/seoseyeon/TCR-epiDiff | |
2025-07-21 | 14:40:00 | 15:00:00 | 01A | MLCSB | Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Data | Chau Do | Chau Do, Harri Lähdesmäki | Multiple Instance Learning (MIL) provides a structured approach to patient phenotype prediction with single-cell RNA-sequencing (scRNA-seq) data. However, existing MIL methods tend to overlook the hierarchical structure inherent in scRNA-seq data, especially the biological groupings of cells, or cell types. This limitation may lead to suboptimal performance and poor interpretability at higher levels of cellular division. To address this gap, we present a novel approach to incorporate hierarchical information into the attention-based MIL framework. Specifically, our model applies the attention-based aggregation mechanism over both cells and cell types, thus enforcing a hierarchical structure on the flow of information throughout the model. Across extensive experiments, our proposed approach consistently outperforms existing models and demonstrates robustness in data-constrained scenarios. Moreover, ablation test results show that simply applying the attention mechanism on cell types instead of cells leads to improved performance, underscoring the benefits of incorporating the hierarchical groupings. By identifying the critical cell types that are most relevant for prediction, we show that our model is capable of capturing biologically meaningful associations, thus facilitating biological discoveries. | |
2025-07-21 | 15:00:00 | 15:10:00 | 01A | MLCSB | AI-Guided Multi-Objective Discovery of Antimicrobial Peptides via Self-Play Reinforcement Learning | Chia-Ru Chung, Tzong-Yi Lee, Yun Tang | The rising crisis of antimicrobial resistance has created an urgent demand for innovative therapeutic solutions, and antimicrobial peptides (AMPs) present a promising option due to their broad-spectrum effectiveness and unique mechanisms of action. However, current experimental and computational discovery pipelines for AMPs are often slow and limited in the diversity of sequences and properties they can explore. Although AI-driven generative models for peptide design are gaining traction, the field still lacks methods that offer adequate sequence diversity, simultaneous optimization of multiple therapeutic properties, and biologically contextualized control over safety and structural attributes. To address these shortcomings, we propose a strategic reinforcement learning framework for multi-objective AMP discovery, employing policy-guided Monte Carlo tree search within a self-play environment. This AI-driven approach utilizes surrogate models for antimicrobial activity, structural stability, and hemolytic toxicity to inform the generation process while incorporating biologically inspired filtering rules to ensure safety and structural constraints. Candidate peptide sequences were further assessed through an ensemble classifier consensus to validate their predicted efficacy and safety robustly. Our results demonstrate that this framework effectively generates structurally reliable, potent, and non-toxic AMPs, surpassing existing design strategies in the diversity of sequences produced and the favorable trade-offs achieved among activity, stability, and toxicity objectives. The AI-designed peptides display stable secondary structures and strong antimicrobial potency with minimal hemolytic activity, highlighting the advantages of our multi-objective optimization strategy. In summary, this work establishes a powerful AI-driven paradigm for therapeutic peptide design. | ||
2025-07-21 | 15:10:00 | 15:20:00 | 01A | MLCSB | NetStart 2.0: Prediction of Eukaryotic Translation Initiation Sites Using a Protein Language Model | Line Sandvad Nielsen | Line Sandvad Nielsen, Anders Gorm Pedersen, Ole Winther, Henrik Nielsen | "Background: Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5' end and the local start codon context. Translation initiation sites mark the transition from non-coding to coding regions. This fact motivates the expectation that the upstream sequence, if translated, would assemble a nonsensical order of amino acids, while the downstream sequence would correspond to the structured beginning of a protein. This distinction suggests potential for predicting translation initiation sites using a protein language model. Results: We present NetStart 2.0, a deep learning-based model that integrates the ESM-2 protein language model with the local sequence context to predict translation initiation sites across a broad range of eukaryotic species. NetStart 2.0 was trained as a single model across multiple species, and despite the broad phylogenetic diversity represented in the training data, it consistently relied on features marking the transition from non-coding to coding regions. Conclusion: By leveraging ""protein-ness"", NetStart 2.0 achieves state-of-the-art performance in predicting translation initiation sites across a diverse range of eukaryotic species. This success underscores the potential of protein language models to bridge transcript- and peptide-level information in complex biological prediction tasks. The NetStart 2.0 webserver is available at: https://services.healthtech.dtu.dk/services/NetStart-2.0/" | |
2025-07-21 | 15:20:00 | 15:30:00 | 01A | MLCSB | Towards a more inductive world for drug repurposing approaches | Uxía Veleiro | Jesus De la Fuente Cedeño, Guillermo Serrano, Uxía Veleiro, Mikel Casals, Laura Vera, Marija Pizurica, Nuria Gomez-Cebrian, Leonor Puchades-Carrasco, Antonio Pineda-Lucena, Idoia Ochoa, Silve Vicent, Olivier Gevaert, Mikel Hernaez | Drug–target interaction (DTI) prediction is a challenging albeit essential task in drug repurposing. Learning on graph models has drawn special attention as they can substantially reduce drug repurposing costs and time commitment. However, many current approaches require high-demand additional information besides DTIs that complicates their evaluation process and usability. Additionally, structural differences in the learning architecture of current models hinder their fair benchmarking. In this work, we first perform an in-depth evaluation of current DTI datasets and prediction models through a robust benchmarking process and show that DTI methods based on transductive models lack generalization and lead to inflated performance when traditionally evaluated, making them unsuitable for drug repurposing. We then propose a biologically driven strategy for negative-edge subsampling and uncovered previously unknown interactions via in vitro validation, missed by traditional subsampling. Finally, we provide a toolbox from all generated resources, crucial for fair benchmarking and robust model design. | |
2025-07-21 | 15:30:00 | 15:40:00 | 01A | MLCSB | Hierarchical Multi-Agent Reinforcement Learning For Optimizing CRISPR-Based Polygenic Therapeutic Design | Nhung Duong | Nhung Duong, Tuan Do, Anh Truong, Ngoc Do, Lap Nguyen | Background. CRISPR therapies for polygenic disorders require simultaneous optimization of multiple guides and delivery constraints. Current approaches optimize guides individually, neglecting target interactions. We developed a hierarchical multi-agent reinforcement learning (MARL) framework to optimize CRISPR strategies for polygenic diseases while balancing efficiency, synergy, vector capacity, and immunogenicity. Methods. We implemented a three-layer MARL architecture consisting of: (1) Optimize guide RNA sequences for each target gene; (2) Selects optimal editing modes and guide combinations while maximizing synergy; and (3) Ensures vector capacity and immunogenicity constraints are satisfied. For computational feasibility, we used a simplified 1 Mb mini-genome for off-target analysis rather than the full human genome, and employed approximated versions of RuleSet2 scoring, synergy effect and immunogenicity prediction. We trained the framework using iterative policy optimization and validated it on a model polygenic retinal disease involving five genes with known pathogenic mutations. Results. The framework generated optimized guides with high on-target efficiency (0.70-1.00) and minimal off-target effects. The system selected optimal editing modes (prime editing for PDE6B, RHO, USH2A; base editing for RP1) while maximizing synergistic effects. The final design utilized Cas9 with a total size of within 4500 bp capacity and zero immunogenic epitopes, requiring only two rework iterations. Conclusion. Our MARL approach demonstrates AI's potential for solving complex therapeutic design challenges. The framework navigates multi-dimensional optimization involving sequence, strategy, and clinical constraints simultaneously. While current implementation uses simplified biological modeling, the foundation is robust and provides a proof-of-concept for AI-guided design of personalized CRISPR therapeutics for polygenic diseases. | |
2025-07-21 | 15:40:00 | 15:50:00 | 01A | MLCSB | Sliding Window INteraction Grammar (SWING): a generalized interaction language model for peptide and protein interactions | Jishnu Das | Jane Siwek, Alisa Omelchenko, Prabal Chhibbar, Alok Joglekar, Jishnu Das | Protein language models (pLMs) can embed protein sequences for different proteomic tasks. However, these methods are suboptimal at learning the language of protein interactions. We developed an interaction LM (iLM), Sliding Window Interaction Grammar (SWING) which leverages differences in amino acid properties to generate an interaction vocabulary. This is embedded by an LM and supervised learning is performed on the embeddings. SWING was used across a range of tasks. Using only sequence information, it successfully predicted both class I and class II pMHC interactions as well as state-of-the-art approaches. Further, the Class I SWING model could uniquely cross-predict Class II interactions, a complex prediction task not attempted by existing methods. A unique Mixed Class model effectively predicted interactions for both classes. Using only human Class I or Class II data, SWING accurately predicted novel murine Class II pMHC interactions involving risk alleles in SLE and T1D. SWING also accurately predicted how Mendelian and population variants can disrupt specific protein-protein interactions, based on sequence information alone. Across these tasks, SWING outperformed passive uses of pLM embeddings, demonstrating the value of the unique iLM architecture. Overall, SWING is a first-in-class generalizable zero-shot iLM that learns the language of PPIs. | |
2025-07-21 | 15:50:00 | 16:00:00 | 01A | MLCSB | Evolutionary constraints guide AlphaFold2 in predicting alternative conformations and inform rational mutation design | Francesca Cuturello | Valerio Piomponi, Alberto Cazzaniga, Francesca Cuturello | Investigating structural variability is essential for understanding protein biological functions. Although AlphaFold2 accurately predicts static structures, it fails to capture the full spectrum of functional states. Recent methods have used AlphaFold2 to generate diverse structural ensembles, but they offer limited interpretability and overlook the evolutionary signals underlying predictions. In this work, we enhance the generation of conformational ensembles and identify sequence patterns that influence alternative fold predictions for several protein families. Building on prior research that clustered Multiple Sequence Alignments to predict fold-switching states, we introduce a refined clustering strategy that integrates protein language model representations with hierarchical clustering, overcoming limitations of density-based methods. Our strategy effectively identifies high-confidence alternative conformations and generates abundant sequence ensembles, providing a robust framework for applying Direct Coupling Analysis (DCA). Through DCA, we uncover key coevolutionary signals within the clustered alignments, leveraging them to design mutations that stabilize specific conformations, which we validate using alchemical free energy calculations from molecular dynamics. Notably, our method extends beyond fold-switching, effectively capturing a variety of conformational changes. | |
2025-07-21 | 16:40:00 | 16:50:00 | 01A | MLCSB | Integrating Machine Learning and Systems Biology to rationally design operational conditions for in vitro / in vivo translation of microphysiological systems | Nikolaos Meimetis | Nikolaos Meimetis, Jose Cadavid, Linda Griffith, Douglas Lauffenburger | Preclinical models are used extensively to study diseases and potential therapeutic treatments. Complex in vitro platforms incorporating human cellular components, known as microphysiological systems (MPS), can model cellular and microenvironmental features of diseased tissues. However, determining experimental conditions -- particularly biomolecular cues such as growth factors, cytokines, and matrix proteins -- providing the most effective translatability of MPS-generated information to in vivo human subject contexts is a major challenge. Here, using metabolic dysfunction-associated fatty liver disease (MAFLD) studied using the CNBio PhysioMimix as a case study, we developed a machine learning framework called Latent In Vitro to In Vivo Translation (LIV2TRANS) to ascertain how MPS data map to in vivo data, first sharpening translation insights and consequently elucidating experimental conditions that can further enhance translation capability. Our findings in this case study highlight TGFβ as a crucial cue for MPS translatability and indicate that adding JAK-STAT pathway perturbations via interferon stimuli could increase the predictive performance of this MPS in MAFLD studies. Finally, we developed an optimization approach that identified androgen and EGFR signaling as key for maximizing the capacity of this MPS to capture in vivo human biological information germane to MAFLD. More broadly, this work establishes a mathematically principled approach for identifying experimental conditions that most beneficially capture in vivo human-relevant molecular pathways and processes, generalizable to preclinical studies for a wide range of diseases and potential treatments. | |
2025-07-21 | 16:50:00 | 17:00:00 | 01A | MLCSB | Data Splitting Against Information Leakage with DataSAIL | Roman Joeres | Roman Joeres, David B. Blumenthal, Olga Kalinina | Information leakage (IL) is an increasingly important topic in machine learning (ML) research, especially in biomedical applications. When IL happens during a model's development, the model is prone to memorizing the training data instead of learning generalizable properties. This can lead to inflated performance metrics that do not reflect the actual performance at inference time. Therefore, we present DataSAIL, a versatile Python package to facilitate leakage-reduced data splitting to enable realistic evaluation of ML models for biological data that are intended to be applied in out-of-distribution scenarios. DataSAIL is based on formulating the problem to find leakage-reduced data splits as a combinatorial optimization problem. We prove that this problem is NP-hard and provide a scalable heuristic based on clustering and integer linear programming. DataSAIL uses similarities between samples to compute leakage-reduced splits for classical property prediction tasks, stratified splits, and drug-target interaction datasets where information can be leaked along two dimensions. DataSAIL is accepted in principle by Nature Communications. We empirically demonstrate DataSAIL's impact on evaluating biomedical ML models. We compare DataSAIL to seven other algorithms on 14 datasets from the MoleculeNet benchmark and LP-PDBBind. We show that DataSAIL is consistently amongst the best algorithms in removing IL. Furthermore, we train 6 different ML models on each split to evaluate how information leakage affects different models. We observe that deep learning models generally perform better than statistical models and that higher IL leads to better performance estimates. Another ablation study shows that DataSAIL reduces IL better than the PLINDER benchmark. | |
2025-07-21 | 17:00:00 | 18:00:00 | 01A | MLCSB | Generative AI for Unlocking the Complexity of Cells | Maria Brbic | We are witnessing an AI revolution. At the heart of this revolution are generative AI models that, powered by advanced architectures and large datasets, are transforming AI across a variety of disciplines. But how can AI facilitate and eventually enable discoveries in life sciences? How can it bring us closer to understanding biology, the functions of our cells and relationships across different molecular layers? In this talk, I will present AI methods that can extract meaningful differences between classes from representations of foundation models with minimal or no supervision. I will then introduce generative AI methods designed to uncover relationships across different omics layers. I will demonstrate how these approaches enable the reassembly of tissues from dissociated single cells and how AI-driven tissue reconstruction can overcome existing technological limitations. | ||
2025-07-22 | 11:20:00 | 12:20:00 | 01A | MLCSB | Where does it hurt (in your genome)? | Julien Gagneur | The identification of genetic variants strongly affecting when phenotypes remains an unsolved problem with major relevance in rare diseases diagnostics, oncology, and for the identification of effector genes of complex traits and diseases. I will present a series of published and ongoing work from my lab tackling this issue, with a focus on non-coding variants. This will span variant scoring based on genomic language models [1], methods to predict aberrant expression [2] and splicing [3], all the way to integrative deep learning models for rare variant association analyses demonstrated on UK Biobank [4]. 1. Tomaz da Silva, et al. Nucleotide dependency analysis of DNA language models reveals genomic functional elements. bioRxiv, 2024 2. Hölzlwimmer et al. Aberrant gene expression prediction across human tissues. Nature Communications, 2025 3. Wagner et al. Aberrant splicing prediction across human tissues. Nature Genetics, 2023 4. Clarke, Holtkamp, et al. Integration of variant annotations using deep set networks boosts rare variant association genetics. Nature Genetics, 2024 | ||
2025-07-22 | 12:20:00 | 12:40:00 | 01A | MLCSB | Deep learning models for unbiased sequence-based PPI prediction plateau at an accuracy of 0.65 | Judith Bernett | Timo Reim, Anne Hartebrodt, David B. Blumenthal, Judith Bernett, Markus List | As most proteins interact with other proteins to perform their respective functions, methods to computationally predict these interactions have been developed. However, flawed evaluation schemes and data leakage in test sets have obscured the fact that sequence-based protein-protein interaction (PPI) prediction is still an open problem. Recently, methods achieving better-than-random performance on leakage-free PPI data have been proposed. Here, we show that the use of ESM-2 protein embeddings explains this performance gain irrespective of model architecture. We compared the performance of models with varying complexity, per-protein, and per-token embeddings, as well as the influence of self- or cross-attention, where all models plateaued at an accuracy of 0.65. Moreover, we show that the tested sequence-based models cannot implicitly learn a contact map as an intermediate layer. These results imply that other input types, such as structure, might be necessary for producing reliable PPI predictions. | |
2025-07-22 | 12:40:00 | 13:00:00 | 01A | MLCSB | Accurate PROTAC targeted degradation prediction with DegradeMaster | Jie Liu | Jie Liu, Michael Roy, Luke Isbel, Fuyi Li | Motivation: Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules that can degrade ‘undruggable’ protein of interest (POI) by recruiting E3 ligases and hijacking the ubiquitin-proteasome system. Some efforts have been made to develop deep learning-based approaches to predict the degradation ability of a given PROTAC. However, existing deep learning methods either simplify proteins and PROTACs as 2D graphs by disregarding crucial 3D spatial information or exclusively rely on limited labels for supervised learning without considering the abundant information from unlabeled data. Nevertheless, considering the potential to accelerate drug discovery, developing more accurate computational methods for PROTAC-targeted protein degradation prediction is critical. Results: This study proposes DegradeMaster, a semi-supervised E(3)-equivariant graph neural network-based predictor for targeted degradation prediction of PROTACs. DegradeMaster leverages an E(3)-equivariant graph encoder to incorporate 3D geometric constraints into the molecular representations and utilizes a memory-based pseudo-labeling strategy to enrich annotated data during training. A mutual attention pooling module is also designed for interpretable graph representation. Experiments on both supervised and semi-supervised PROTAC datasets demonstrate that DegradeMaster outperforms state-of-the-art baselines, substantially improving AUROC by 10.5%. Case studies show DegradeMaster achieves 88.33% and 77.78% accuracy in predicting the degradability of VZ185 candidates on BRD9 and ACBI3 on KRAS mutants. Visualization of attention weights on 3D molecule graph demonstrates that DegradeMaster recognises linking and binding regions of warhead and E3 ligands and emphasizes the importance of structural information in these areas for degradation prediction. Together, this shows the potential for cutting-edge tools to highlight functional PROTAC components, thereby accelerating novel compound generation. | |
2025-07-22 | 14:00:00 | 14:20:00 | 01A | MLCSB | GPO-VAE: Modeling Explainable Gene Perturbation Responses utilizing GRN-Aligned Parameter Optimization | Seungheun Baek | Seungheun Baek, Soyon Park, Yan Ting Chok, Mogan Gim, Jaewoo Kang | Predicting cellular responses to genetic perturbations is essential for understanding biological systems and developing targeted therapeutic strategies. While variational autoencoders (VAEs) have shown promise in modeling perturbation responses, their limited explainability poses a significant challenge, as the learned features often lack clear biological meaning. Nevertheless, model explainability is one of the most important aspects in the realm of biological AI. One of the most effective ways to achieve explainability is incorporating the concept of gene regulatory networks (GRNs) in designing deep learning models such as VAEs. GRNs elicit the underlying causal relationships between genes and are capable of explaining the transcriptional responses caused by genetic perturbation treatments. We propose GPO-VAE, an explainable VAE enhanced by GRN-aligned Parameter Optimization that explicitly models gene regulatory networks in the latent space. Our key approach is to optimize the learnable parameters related to latent perturbation effects towards GRN-aligned explainability. Experimental results on perturbation prediction show our model achieves state-of-the-art performance in predicting transcriptional responses across multiple benchmark datasets. Furthermore, additional results on evaluating the GRN inference task reveal our model's ability to generate meaningful GRNs compared to other methods. According to qualitative analysis, GPO-VAE posseses the ability to construct biologically explainable GRNs that with experimentally validated regulatory pathways. | |
2025-07-22 | 14:20:00 | 14:40:00 | 01A | MLCSB | Fast and scalable Wasserstein-1 neural optimal transport solver for single-cell perturbation prediction | Yanshuo Chen | Yanshuo Chen, Zhengmian Hu, Wei Chen, Heng Huang | Predicting single-cell perturbation responses requires mapping between two unpaired single-cell data distributions. Optimal transport (OT) theory provides a principled framework for constructing such mappings by minimizing transport cost. Recently, Wasserstein-2 ($W_2$) neural optimal transport solvers (\textit{e.g.}, CellOT) have been employed for this prediction task. However, $W_2$ OT relies on the general Kantorovich dual formulation, which involves optimizing over two conjugate functions, leading to a complex min-max optimization problem that converges slowly. To address these challenges, we propose a novel solver based on the Wasserstein-1 ($W_1$) dual formulation. Unlike $W_2$, the $W_1$ dual simplifies the optimization to a maximization problem over a single 1-Lipschitz function, thus eliminating the need for time-consuming min-max optimization. While solving the $W_1$ dual only reveals the transport direction and does not directly provide a unique optimal transport map, we incorporate an additional step using adversarial training to determine an appropriate transport step size, effectively recovering the transport map. Our experiments demonstrate that the proposed $W_1$ neural optimal transport solver can mimic the $W_2$ OT solvers in finding a unique and ``monotonic" map on 2D datasets. Moreover, the $W_1$ OT solver achieves performance on par with or surpasses $W_2$ OT solvers on real single-cell perturbation datasets. Furthermore, we show that $W_1$ OT solver achieves $25 \sim 45\times$ speedup, scales better on high dimensional transportation task, and can be directly applied on single-cell RNA-seq dataset with highly variable genes. Our implementation and experiments are open-sourced at \url{https://github.com/poseidonchan/w1ot}. | |
2025-07-22 | 14:40:00 | 15:00:00 | 01A | MLCSB | Recovering Time-Varying Networks From Single-Cell Data | Euxhen Hasanaj | Euxhen Hasanaj, Barnabás Póczos, Ziv Bar-Joseph | Gene regulation is a dynamic process that underlies all aspects of human development, disease response, and other key biological processes. The reconstruction of temporal gene regulatory networks has conventionally relied on regression analysis, graphical models, or other types of relevance networks. With the large increase in time series single-cell data, new approaches are needed to address the unique scale and nature of this data for reconstructing such networks. Here, we develop a deep neural network, Marlene, to infer dynamic graphs from time series single-cell gene expression data. Marlene constructs directed gene networks using a self-attention mechanism where the weights evolve over time using recurrent units. By employing meta learning, the model is able to recover accurate temporal networks even for rare cell types. In addition, Marlene can identify gene interactions relevant to specific biological responses, including COVID-19 immune response, fibrosis, and aging, paving the way for potential treatments. The code use to train Marlene is available at https://github.com/euxhenh/Marlene. | |
2025-07-22 | 15:00:00 | 15:10:00 | 01A | MLCSB | Developing a Deep Learning Model for Single-Cell RNA Splicing Analysis | Luyang Li | Luyang Li, You Zhou | Approximately 95% of human genes undergo alternative splicing (AS), a process that allows a single gene to produce multiple proteins with distinct functions. This mechanism enormously increases the complexity of our genome and plays an important role in maintaining health. Disruptions in normal splicing can lead to various diseases, and predicting AS events and understanding their regulatory mechanisms at the single-cell level can open the door to discovering new therapeutic targets. Despite the importance of RNA splicing, current single-cell RNA sequencing (scRNA-seq) efforts primarily focus on gene expression profiling. And very few scRNA-seq computational tools are available for identifying and quantifying RNA splicing. Inspired by the successful application of large language models in biomedical research, we developed a new State space model based framework for Alternative Splicing prediction, named SAS, trained on long-read RNA sequencing data. SAS employs a stacked selective state space model architecture to generate latent state representations of transcript sequences, enabling accurate predictions of diverse AS events, even in data-limited conditions. Furthermore, this model is specifically tailored for single-cell splicing prediction. Our results show that SAS outperforms existing methods, achieving an accuracy of 0.97, PR-AUC of 0.99, and F1 score of 0.97. This innovative framework provides valuable insights into identifying splicing events at single-cell resolution, guiding experimental efforts to uncover novel splicing mechanisms and therapeutic targets. | |
2025-07-22 | 15:10:00 | 15:20:00 | 01A | MLCSB | Benchmarking foundation cell models for post-perturbation RNA-seq prediction | Gerold Csendes | Gerold Csendes, Gema Sanz, Krisóf Szalay, Bence Szalai | Accurately predicting cellular responses to perturbations is essential for understanding cell behaviour in both healthy and diseased states. While perturbation data is ideal for building such predictive models, its availability is considerably lower than baseline (non-perturbed) cellular data. To address this limitation, several foundation cell models have been developed using large-scale single-cell gene expression data. These models are fine-tuned after pre-training for specific tasks, such as predicting post-perturbation gene expression profiles, and are considered state-of-the-art for these problems. However, proper benchmarking of these models remains an unsolved challenge. In this study, we benchmarked two recently published foundation models, scGPT and scFoundation, against baseline models. Surprisingly, we found that even the simplest baseline model - taking the mean of training examples - outperformed scGPT and scFoundation. Furthermore, basic machine learning models that incorporate biologically meaningful features outperformed scGPT by a large margin. Additionally, we identified that the current Perturb-Seq benchmark datasets exhibit low perturbation-specific variance, making them suboptimal for evaluating such models. Our results highlight important limitations in current benchmarking approaches and provide insights into more effectively evaluating post-perturbation gene expression prediction models. | |
2025-07-22 | 15:20:00 | 15:30:00 | 01A | MLCSB | scPRINT: pre-training on 50 million cells allows robust gene network predictions | Jeremie Kalfon | Jeremie Kalfon, Gabriel Peyré, Laura Cantini | A cell is governed by the interaction of myriads of macromolecules. Inferring such a network of interactions has remained an elusive milestone in cellular biology. Building on recent advances in large foundation models and their ability to learn without supervision, we present scPRINT, a large cell model for the inference of gene networks pre-trained on more than 50 million cells from the cellxgene database. Using innovative pretraining tasks and model architecture, scPRINT pushes large transformer models towards more interpretability and usability when uncovering the complex biology of the cell. Based on our atlas-level benchmarks, scPRINT demonstrates superior performance in gene network inference to the state of the art, as well as competitive zero-shot abilities in denoising, batch effect correction, and cell label prediction. On an atlas of benign prostatic hyperplasia, scPRINT highlights the profound connections between ion exchange, senescence, and chronic inflammation. | |
2025-07-22 | 15:30:00 | 15:40:00 | 01A | MLCSB | MGCL-ST: Multi-view Graph Self-supervised Contrastive Learning for Spatial Transcriptomics Enhancement | Hongmin Cai | Hongmin Cai, Siqi Ding, Weitian Huang | Spatial transcriptomics enables the investigation of gene expression within its native spatial context, but existing technologies often suffer from low resolution and sparse sampling. These limitations hinder the accurate delineation of fine tissue structures and reduce robustness to noise. To address these challenges, we propose MGCL-ST, a spatial transcriptomics super-resolution framework that integrates multi-view contrastive learning with a dual-metric neighbor selection strategy. By combining spatial structure and histological image features, MGCL-ST achieves robust, pixel-level gene expression imputation. Experimental results on both simulated and real datasets demonstrate its superior reconstruction accuracy and generalization capability, supporting advanced analysis of the tumor microenvironment. | |
2025-07-22 | 15:40:00 | 15:50:00 | 01A | MLCSB | Characterizing cell-type spatial relationships across length scales in spatially resolved omics data | Rafael dos Santos Peixoto | Rafael dos Santos Peixoto, Brendan Miller, Maigan Brusko, Gohta Aihara, Lyla Atta, Manjari Anant, Adina Jailova, Mark Atkinson, Todd Brusko, Clive Wasserfall, Jean Fan | Spatially resolved omics (SRO) technologies enable the identification of cell types while preserving their organization within tissues. Application of such technologies offers the opportunity to delineate cell-type spatial relationships, particularly across different length scales, and enhance our understanding of tissue organization and function. To quantify such multi-scale cell-type spatial relationships, we present CRAWDAD, Cell-type Relationship Analysis Workflow Done Across Distances, as an open-source R package. To demonstrate the utility of such multi-scale characterization, recapitulate expected cell-type spatial relationships, and evaluate against other cell-type spatial analyses, we apply CRAWDAD to various simulated and real SRO datasets of diverse tissues assayed by diverse SRO technologies. We further demonstrate how such multi-scale characterization enabled by CRAWDAD can be used to compare cell-type spatial relationships across multiple samples. Finally, we apply CRAWDAD to SRO datasets of the human spleen to identify consistent as well as patient and sample-specific cell-type spatial relationships. In general, we anticipate such multi-scale analysis of SRO data enabled by CRAWDAD will provide useful quantitative metrics to facilitate the identification, characterization, and comparison of cell-type spatial relationships across axes of interest. | |
2025-07-22 | 15:50:00 | 16:00:00 | 01A | MLCSB | Segger: Fast and accurate cell segmentation of imaging-based spatial transcriptomics data | Elyas Heidari | Elyas Heidari, Andrew Moorman, Moritz Gerstung, Dana Pe'Er, Oliver Stegle, Tal Nawy | Accurate cell segmentation is a critical first step in the analysis of imaging-based spatial transcriptomics (iST). Despite decades of research in cell segmentation, current methods fail to address this task with adequate accuracy, tending to either over- or under segment, create false positive transcript assignments, and additionally many methods fail to scale to large datasets with hundreds of millions of transcripts. To address these limitations, we introduce segger, a versatile graph neural network (GNN) that frames cell segmentation as a transcript-to-cell link prediction task. Segger employs a heterogeneous graph representation of individual transcripts and cells, and can optionally leverage single-cell RNA-seq information to enhance transcript assignments. In benchmarks on multiple iST dataset, including a lung adenocarcinoma dataset with membrane staining for validation, segger demonstrates superior sensitivity and specificity compared to existing methods such as Baysor and BIDCell. At the same time, segger requires orders of magnitude less compute time than existing approaches. The Segger software features adaptive tiling and efficient task scheduling, supporting multi-GPU processing and multi-threading for scalability. Segger also includes a new workflow to cluster unassigned transcripts into ‘fragments’, enabling the recovery of information missed by nucleus or membrane marker-dependent methods. Segger is implemented as user-friendly open source software (https://github.com/PMBio/segger), comes with extensive documentation and integrates seamlessly into existing workflows, enabling atlas-scale applications with high accuracy and speed. | |
2025-07-22 | 16:40:00 | 16:50:00 | 01A | MLCSB | Dissecting cellular and molecular mechanisms of pancreatic cancer with deep learning | Aarthi Venkat | Aarthi Venkat, Cathy Garcia, Daniel McQuaid, Smita Krishnaswamy, Mandar Muzumdar | Pancreatic endocrine-exocrine crosstalk plays a key role in normal physiology and disease and is perturbed by altered host metabolic states. For example, obesity imparts an stress-induced endocrine secretion of cholecystokinin (CCK), which promotes pancreatic ductal adenocarcinoma (PDAC), an exocrine tumor. However, the mechanisms governing endocrine-exocrine signaling in obesity-driven tumorigenesis remain unclear. Here, we design a suite of machine learning tools (TrajectoryNet, AAnet, scMMGAN, DiffusionEMD) to reveal from single-cell RNA-seq data the cellular and molecular mechanisms by which beta cells express CCK and promote obesity-driven PDAC. AAnet identifies an immature beta cell state characterized by low insulin and maturation marker expression and high dedifferentiation and immaturity marker expression. TrajectoryNet predicts obesity stimulates this immature state to expand and adapt toward a pro-tumorigenic CCK-hi state, which we validate with in vivo genetic lineage tracing. TrajectoryNet-based gene regulatory network inference predicts cJun regulates CCK, validated by JNK inhibition and CUT&RUN sequencing showing cJun mediates CCK expression by binding to a novel conserved 3’ enhancer ~3kb downstream of the Cck gene. Finally, mapping beta cells from diverse physiologic and pharmacologic stressors, developmental stages, and species to our dataset with scMMGAN and DiffusionEMD reveals concordance between adult beta cell dedifferentiation and embryonic beta cells, as well as shared stress induction mechanisms between obesity and type II diabetes in mice and humans. Together, this work uncovers new avenues to target the endocrine pancreas to subvert exocrine tumorigenesis and highlights the utility of developing biological and computational models in a wet-to-dry and dry-to-wet fashion toward mechanistic discovery. | |
2025-07-22 | 16:50:00 | 17:00:00 | 01A | MLCSB | SpliceSelectNet: A Hierarchical Transformer-Based Deep Learning Model for Splice Site Prediction | Yuna Miyachi | Yuna Miyachi, Kenta Nakai | RNA splicing is a critical post-transcriptional process that enables the generation of diverse protein isoforms. Aberrant splicing is implicated in a wide range of genetic disorders and cancers, making accurate prediction of splice sites and mutation effects essential. Convolutional neural network-based models such as SpliceAI and Pangolin have achieved high accuracy but often lack interpretability. Recently, Transformer-based models like DNABERT and SpTransformer have been applied to genomic sequences, yet they typically inherit input length limitations from natural language processing models, restricting context to a few thousand base pairs, which are insufficient for capturing long-range regulatory signals. To overcome these challenges, we propose SpliceSelectNet (SSNet), a hierarchical Transformer model that integrates local and global attention mechanisms to handle up to 100 kb of input while maintaining nucleotide-level interpretability. Trained on multiple datasets, including those incorporating splice site usage derived from RNA-seq data, SSNet outperforms SpliceAI and Pangolin on the Gencode test dataset, a clinically curated BRCA variant dataset, and a deep intronic variant benchmark. It demonstrates improved performance, particularly in regions characterized by complex splicing regulation, such as long exons and deep introns, as measured by area under the precision-recall curve. Furthermore, SSNet’s attention maps provide direct insight into sequence context. In the case of a pathogenic variant in BRCA1 exon 10, the model highlighted an upstream region that may contribute to cryptic splice site activation. These results demonstrate that SSNet combines high predictive performance with biological interpretability, offering a powerful tool for splicing analysis in both research and clinical settings. | |
2025-07-22 | 17:00:00 | 18:00:00 | 01A | MLCSB | Toward Mechanistic Genomics: Advances in Sequence-to-Function Modeling | Maria Chikina | Recent advances have firmly established sequence-to-function models as essential tools in modern genomics, enabling unprecedented insights into how genomic sequences drive molecular and cellular phenotypes. As these models have matured—with increasingly robust architectures, improved training strategies, and the emergence of standardized software frameworks—the field has rapidly evolved from proof-of-concept demonstrations to widespread practical applications across a variety of biological systems. With the core methodologies now widely adopted and infrastructure in place, the community's focus is shifting toward ambitious new frontiers. There is growing momentum around developing models that are biologically interpretable, capable of uncovering causal mechanisms of gene regulation, and generalizable to novel contexts—such as predicting the effects of perturbing a regulatory protein rather than simply altering a DNA sequence. These efforts reflect a broader aspiration: to create models that serve not just as black-box predictors, but as scientific instruments that deepen our understanding of genome function. In this talk, we will explore how such models can move us from descriptive genomics to mechanistic insight, highlighting recent innovations in architecture and training that support interpretability, modularity, and reusability. We will examine the contexts in which these models offer clear advantages, the limitations that remain, and practical considerations for their training. Ultimately, we will consider how advancing these models may refine the role of machine learning in biology, supporting not only accurate prediction but also the generation of more detailed and mechanistically informed hypotheses. |