Return to ISMB/ECCB 2025 Homepage Click here for the abridged agenda

Schedule for MLCSB

NOTE: Browser resolution may limit the width of the agenda and you may need to scroll the iframe to see additional columns.
Click the buttons below to download your current table in that format

Date	Start Time	End Time	Room	Track	Title	Confrimed Presenter	Format	Authors	Abstract
2025-07-21	11:20:00	12:20:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Where does it hurt (in your genome)?	Julien Gagneur		Julien Gagneur	The identification of genetic variants strongly affecting when phenotypes remains an unsolved problem with major relevance in rare diseases diagnostics, oncology, and for the identification of effector genes of complex traits and diseases. I will present a series of published and ongoing work from my lab tackling this issue, with a focus on non-coding variants. This will span variant scoring based on genomic language models [1], methods to predict aberrant expression [2] and splicing [3], all the way to integrative deep learning models for rare variant association analyses demonstrated on UK Biobank [4]. 1. Tomaz da Silva, et al. Nucleotide dependency analysis of DNA language models reveals genomic functional elements. bioRxiv, 2024 2. Hölzlwimmer et al. Aberrant gene expression prediction across human tissues. Nature Communications, 2025 3. Wagner et al. Aberrant splicing prediction across human tissues. Nature Genetics, 2023 4. Clarke, Holtkamp, et al. Integration of variant annotations using deep set networks boosts rare variant association genetics. Nature Genetics, 2024
2025-07-21	12:20:00	12:40:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Locality-aware pooling enhances protein language model performance across varied applications	Minh Hoang	In person	Minh Hoang, Mona Singh	Protein language models (PLMs) are amongst the most exciting recent advances for characterizing protein sequences, and have enabled a diverse set of applications including structure determination, functional property prediction, and mutation impact assessment, all from single protein sequences alone. State-of-the-art PLMs leverage transformer architectures originally developed for natural language processing, and are pre-trained on large protein databases to generate contextualized representations of individual amino acids. To harness the power of these PLMs to predict protein-level properties, these per-residue embeddings are typically ``pooled'' to fixed-size vectors that are further utilized in downstream prediction networks. Common pooling strategies include Cls-Pooling and Avg-Pooling, but neither of these approaches can capture the local substructures and long-range interactions observed in proteins. To address these weaknesses in existing PLM pooling strategies, we propose the use of attention pooling, which can naturally capture these important features of proteins. To make the expensive attention operator (quadratic in length of the input protein) feasible in practice, we introduce bag-of-mer pooling (BoM-Pooling), a locality-aware hierarchical pooling technique that combines windowed average pooling with attention pooling. We empirically demonstrate that both full attention pooling and BoM-Pooling outperform previous pooling strategies on three important, diverse tasks: (1) predicting the activities of two proteins as they are varied; (2) detecting remote homologs; and (3) predicting signaling interactions with peptides. Overall, our work highlights the advantages of biologically inspired pooling techniques in protein sequence modeling and is a step towards more effective adaptations of language models in biological settings.
2025-07-21	12:40:00	13:00:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	NEAR: Neural Embeddings for Amino acid Relationships	Daniel Olson	In person	Daniel Olson, Thomas Colligan, Daphne Demekas, Jack Roddy, Ken Youens-Clark, Travis Wheeler	Protein language models (PLMs) have recently demonstrated potential to supplant classical protein database search methods based on sequence alignment, but are slower than common alignment-based tools and appear to be prone to a high rate of false labeling. Here, we present NEAR, a method based on neural representation learning that is designed to improve both speed and accuracy of search for likely homologs in a large protein sequence database. NEAR’s ResNet embedding model is trained using contrastive learning guided by trusted sequence alignments. It computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of residue-level k-NN search and a simple neighbor aggregation scheme. Tests on a benchmark consisting of trusted remote homologs and randomly shuffled decoy sequences reveal that NEAR substantially improves accuracy relative to state-of-the-art PLMs, with lower memory requirements and faster embedding / search speed. While these results suggest that the NEAR model may be useful for standalone homology detection with increased sensitivity over standard alignment-based methods, in this manuscript we focus on a more straightforward analysis of the model's value as a high-speed pre-filter for sensitive annotation. In that context, NEAR is at least 5x faster than the pre-filter currently used in the widely-used profile hidden Markov model (pHMM) search tool, HMMER3, and also outperforms the pre-filter used in our fast pHMM tool, nail.
2025-07-21	14:00:00	14:20:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	LoRA-DR-suite: adapted embeddings predict intrinsic and soft disorder from protein sequences	Gianluca Lombardi	In person	Gianluca Lombardi, Beatriz Seoane, Alessandra Carbone	Intrinsic disorder regions (IDR) and soft disorder regions (SDR) provide crucial information on a protein structure to underpin its functioning, interaction with other molecules and assembly path. Circular dichroism experiments are used to identify intrinsic disorder residues, while SDRs are characterized using B-factors, missing residues, or a combination of both in alternative X-ray crystal structures of the same molecule. These flexible regions in proteins are particularly significant in diverse biological processes and are often implicated in pathological conditions. Accurate computational prediction of these disordered regions is thus essential for advancing protein research and understanding their functional implications. To address this challenge, LoRA-DR-suite employs a simple adapter-based architecture that utilizes protein language models embeddings as protein sequence representations, enabling the precise prediction of IDRs and SDRs directly from primary sequence data. Alongside the fast LoRA-DR-suite implementation, we release SoftDis, a unique soft disorder database constructed for approximately 500,000 PDB chains. SoftDis is designed to facilitate new research, testing, and applications on soft disorder, advancing the study of protein dynamics and interactions.
2025-07-21	14:20:00	14:40:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	TCR-epiDiff: Solving Dual Challenges of TCR Generation and Binding Prediction	Se Yeon Seo	In person	Se Yeon Seo, Je-Keun Rhee	Motivation: T-cell receptors (TCRs) are fundamental components of the adaptive immune system, recognizing specific antigens for targeted immune responses. Understanding their sequence patterns for designing effective vaccines and immunotherapies. However, the vast diversity of TCR sequences and complex binding mechanisms pose significant challenges in generating TCRs that are specific to a particular epitope. Results: Here, we propose TCR-epiDiff, a diffusion-based deep learning model for generating epitope-specific TCRs and predicting TCR-epitope binding. TCR-epiDiff integrates epitope information during TCR sequence embedding using ProtT5-XL and employs a denoising diffusion probabilistic model for sequence generation. Using external validation datasets, we demonstrate the ability to generate biologically plausible, epitope-specific TCRs. Furthermore, we leverage the model's encoder to develop a TCR-epitope binding predictor that shows robust performance on the external validation data. Our approach provides a comprehensive solution for both de novo generation of epitope-specific TCRs and TCR-epitope binding prediction. This capability provides valuable insights into immune diversity and has the potential to advance targeted immunotherapies. Availability and implementation: The data and source codes for our experiments are available at https://github.com/seoseyeon/TCR-epiDiff
2025-07-21	14:40:00	15:00:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Data	Chau Do	Live stream	Chau Do, Harri Lähdesmäki	Multiple Instance Learning (MIL) provides a structured approach to patient phenotype prediction with single-cell RNA-sequencing (scRNA-seq) data. However, existing MIL methods tend to overlook the hierarchical structure inherent in scRNA-seq data, especially the biological groupings of cells, or cell types. This limitation may lead to suboptimal performance and poor interpretability at higher levels of cellular division. To address this gap, we present a novel approach to incorporate hierarchical information into the attention-based MIL framework. Specifically, our model applies the attention-based aggregation mechanism over both cells and cell types, thus enforcing a hierarchical structure on the flow of information throughout the model. Across extensive experiments, our proposed approach consistently outperforms existing models and demonstrates robustness in data-constrained scenarios. Moreover, ablation test results show that simply applying the attention mechanism on cell types instead of cells leads to improved performance, underscoring the benefits of incorporating the hierarchical groupings. By identifying the critical cell types that are most relevant for prediction, we show that our model is capable of capturing biologically meaningful associations, thus facilitating biological discoveries.
2025-07-21	15:00:00	15:10:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	PANDORA: Peptide-based ANtimicrobial Design Optimized by Reinforcement Automation	Julián García-Vinuesa	In person	Julián García-Vinuesa, Nicole Soto, David Medina-Ortiz	The global rise of antibiotic resistance, particularly in critical pathogens such as Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacteriaceae, demands innovative therapeutic strategies. Despite their promise, conventional drug discovery pipelines remain slow, costly, and often ineffective. In contrast, antimicrobial peptides (AMPs) offer a compelling alternative due to their specificity, low resistance induction, and multifunctional activity, although their development is hindered by issues like poor stability and immunogenicity. To address these challenges, we introduce PANDORA (Peptide-based ANtimicrobial Design Optimized by Reinforcement Automation), an autonomous AI-driven platform for the optimized design of antibiotic peptides. PANDORA integrates predictive models (e.g., AMP activity, half-life, toxicity), generative transformer-based architectures for de novo sequence design, and explainable AI for property inference. All components are coordinated by a multi-agent reinforcement learning system that enables adaptive, end-to-end peptide engineering. Predictive models, fine-tuned on protein language embeddings with advanced feature extraction, reach classification accuracies above 90% for antimicrobial activity and >85% for toxicity risk estimation. Generative models, guided by physicochemical constraints, yield diverse candidate peptides with optimized therapeutic profiles. The platform supports natural language prompts and user-defined constraints, offering a flexible, user-friendly interface. Preliminary candidates are undergoing experimental validation against WHO-priority bacteria, with feedback used to further train the system via reinforcement learning. PANDORA represents a scalable, autonomous solution that fuses artificial intelligence, automation, and experimental validation. Its capacity to iteratively learn and adapt positions it as a transformative tool for accelerating peptide-based drug discovery in the fight against antibiotic resistance.
2025-07-21	15:10:00	15:20:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	NetStart 2.0: Prediction of Eukaryotic Translation Initiation Sites Using a Protein Language Model	Line Sandvad Nielsen	In person	Line Sandvad Nielsen, Anders Gorm Pedersen, Ole Winther, Henrik Nielsen	"Background: Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5' end and the local start codon context. Translation initiation sites mark the transition from non-coding to coding regions. This fact motivates the expectation that the upstream sequence, if translated, would assemble a nonsensical order of amino acids, while the downstream sequence would correspond to the structured beginning of a protein. This distinction suggests potential for predicting translation initiation sites using a protein language model. Results: We present NetStart 2.0, a deep learning-based model that integrates the ESM-2 protein language model with the local sequence context to predict translation initiation sites across a broad range of eukaryotic species. NetStart 2.0 was trained as a single model across multiple species, and despite the broad phylogenetic diversity represented in the training data, it consistently relied on features marking the transition from non-coding to coding regions. Conclusion: By leveraging ""protein-ness"", NetStart 2.0 achieves state-of-the-art performance in predicting translation initiation sites across a diverse range of eukaryotic species. This success underscores the potential of protein language models to bridge transcript- and peptide-level information in complex biological prediction tasks. The NetStart 2.0 webserver is available at: https://services.healthtech.dtu.dk/services/NetStart-2.0/"
2025-07-21	15:20:00	15:30:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Towards a more inductive world for drug repurposing approaches	Uxía Veleiro	In person	Jesus De la Fuente Cedeño, Guillermo Serrano, Uxía Veleiro, Mikel Casals, Laura Vera, Marija Pizurica, Nuria Gomez-Cebrian, Leonor Puchades-Carrasco, Antonio Pineda-Lucena, Idoia Ochoa, Silve Vicent, Olivier Gevaert, Mikel Hernaez	Drug–target interaction (DTI) prediction is a challenging albeit essential task in drug repurposing. Learning on graph models has drawn special attention as they can substantially reduce drug repurposing costs and time commitment. However, many current approaches require high-demand additional information besides DTIs that complicates their evaluation process and usability. Additionally, structural differences in the learning architecture of current models hinder their fair benchmarking. In this work, we first perform an in-depth evaluation of current DTI datasets and prediction models through a robust benchmarking process and show that DTI methods based on transductive models lack generalization and lead to inflated performance when traditionally evaluated, making them unsuitable for drug repurposing. We then propose a biologically driven strategy for negative-edge subsampling and uncovered previously unknown interactions via in vitro validation, missed by traditional subsampling. Finally, we provide a toolbox from all generated resources, crucial for fair benchmarking and robust model design.
2025-07-21	15:30:00	15:40:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Hierarchical Multi-Agent Reinforcement Learning For Optimizing CRISPR-Based Polygenic Therapeutic Design	Nhung Duong	In person	Nhung Duong, Tuan Do, Anh Truong, Ngoc Do, Lap Nguyen	Background. CRISPR therapies for polygenic disorders require simultaneous optimization of multiple guides and delivery constraints. Current approaches optimize guides individually, neglecting target interactions. We developed a hierarchical multi-agent reinforcement learning (MARL) framework to optimize CRISPR strategies for polygenic diseases while balancing efficiency, synergy, vector capacity, and immunogenicity. Methods. We implemented a three-layer MARL architecture consisting of: (1) Optimize guide RNA sequences for each target gene; (2) Selects optimal editing modes and guide combinations while maximizing synergy; and (3) Ensures vector capacity and immunogenicity constraints are satisfied. For computational feasibility, we used a simplified 1 Mb mini-genome for off-target analysis rather than the full human genome, and employed approximated versions of RuleSet2 scoring, synergy effect and immunogenicity prediction. We trained the framework using iterative policy optimization and validated it on a model polygenic retinal disease involving five genes with known pathogenic mutations. Results. The framework generated optimized guides with high on-target efficiency (0.70-1.00) and minimal off-target effects. The system selected optimal editing modes (prime editing for PDE6B, RHO, USH2A; base editing for RP1) while maximizing synergistic effects. The final design utilized Cas9 with a total size of within 4500 bp capacity and zero immunogenic epitopes, requiring only two rework iterations. Conclusion. Our MARL approach demonstrates AI's potential for solving complex therapeutic design challenges. The framework navigates multi-dimensional optimization involving sequence, strategy, and clinical constraints simultaneously. While current implementation uses simplified biological modeling, the foundation is robust and provides a proof-of-concept for AI-guided design of personalized CRISPR therapeutics for polygenic diseases.
2025-07-21	15:40:00	15:50:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Developing a Deep Learning Model for Single-Cell RNA Splicing Analysis	Luyang Li	In person	Luyang Li, You Zhou	Approximately 95% of human genes undergo alternative splicing (AS), a process that allows a single gene to produce multiple proteins with distinct functions. This mechanism enormously increases the complexity of our genome and plays an important role in maintaining health. Disruptions in normal splicing can lead to various diseases, and predicting AS events and understanding their regulatory mechanisms at the single-cell level can open the door to discovering new therapeutic targets. Despite the importance of RNA splicing, current single-cell RNA sequencing (scRNA-seq) efforts primarily focus on gene expression profiling. And very few scRNA-seq computational tools are available for identifying and quantifying RNA splicing. Inspired by the successful application of large language models in biomedical research, we developed a new State space model based framework for Alternative Splicing prediction, named SAS, trained on long-read RNA sequencing data. SAS employs a stacked selective state space model architecture to generate latent state representations of transcript sequences, enabling accurate predictions of diverse AS events, even in data-limited conditions. Furthermore, this model is specifically tailored for single-cell splicing prediction. Our results show that SAS outperforms existing methods, achieving an accuracy of 0.97, PR-AUC of 0.99, and F1 score of 0.97. This innovative framework provides valuable insights into identifying splicing events at single-cell resolution, guiding experimental efforts to uncover novel splicing mechanisms and therapeutic targets.
2025-07-21	15:50:00	16:00:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Evolutionary constraints guide AlphaFold2 in predicting alternative conformations and inform rational mutation design	Francesca Cuturello	In person	Valerio Piomponi, Alberto Cazzaniga, Francesca Cuturello	Investigating structural variability is essential for understanding protein biological functions. Although AlphaFold2 accurately predicts static structures, it fails to capture the full spectrum of functional states. Recent methods have used AlphaFold2 to generate diverse structural ensembles, but they offer limited interpretability and overlook the evolutionary signals underlying predictions. In this work, we enhance the generation of conformational ensembles and identify sequence patterns that influence alternative fold predictions for several protein families. Building on prior research that clustered Multiple Sequence Alignments to predict fold-switching states, we introduce a refined clustering strategy that integrates protein language model representations with hierarchical clustering, overcoming limitations of density-based methods. Our strategy effectively identifies high-confidence alternative conformations and generates abundant sequence ensembles, providing a robust framework for applying Direct Coupling Analysis (DCA). Through DCA, we uncover key coevolutionary signals within the clustered alignments, leveraging them to design mutations that stabilize specific conformations, which we validate using alchemical free energy calculations from molecular dynamics. Notably, our method extends beyond fold-switching, effectively capturing a variety of conformational changes.
2025-07-21	16:40:00	16:50:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Integrating Machine Learning and Systems Biology to rationally design operational conditions for in vitro / in vivo translation of microphysiological systems	Nikolaos Meimetis	In person	Nikolaos Meimetis, Jose Cadavid, Linda Griffith, Douglas Lauffenburger	Preclinical models are used extensively to study diseases and potential therapeutic treatments. Complex in vitro platforms incorporating human cellular components, known as microphysiological systems (MPS), can model cellular and microenvironmental features of diseased tissues. However, determining experimental conditions -- particularly biomolecular cues such as growth factors, cytokines, and matrix proteins -- providing the most effective translatability of MPS-generated information to in vivo human subject contexts is a major challenge. Here, using metabolic dysfunction-associated fatty liver disease (MAFLD) studied using the CNBio PhysioMimix as a case study, we developed a machine learning framework called Latent In Vitro to In Vivo Translation (LIV2TRANS) to ascertain how MPS data map to in vivo data, first sharpening translation insights and consequently elucidating experimental conditions that can further enhance translation capability. Our findings in this case study highlight TGFβ as a crucial cue for MPS translatability and indicate that adding JAK-STAT pathway perturbations via interferon stimuli could increase the predictive performance of this MPS in MAFLD studies. Finally, we developed an optimization approach that identified androgen and EGFR signaling as key for maximizing the capacity of this MPS to capture in vivo human biological information germane to MAFLD. More broadly, this work establishes a mathematically principled approach for identifying experimental conditions that most beneficially capture in vivo human-relevant molecular pathways and processes, generalizable to preclinical studies for a wide range of diseases and potential treatments.
2025-07-21	16:50:00	17:00:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Data Splitting Against Information Leakage with DataSAIL	Roman Joeres	In person	Roman Joeres, David B. Blumenthal, Olga Kalinina	Information leakage (IL) is an increasingly important topic in machine learning (ML) research, especially in biomedical applications. When IL happens during a model's development, the model is prone to memorizing the training data instead of learning generalizable properties. This can lead to inflated performance metrics that do not reflect the actual performance at inference time. Therefore, we present DataSAIL, a versatile Python package to facilitate leakage-reduced data splitting to enable realistic evaluation of ML models for biological data that are intended to be applied in out-of-distribution scenarios. DataSAIL is based on formulating the problem to find leakage-reduced data splits as a combinatorial optimization problem. We prove that this problem is NP-hard and provide a scalable heuristic based on clustering and integer linear programming. DataSAIL uses similarities between samples to compute leakage-reduced splits for classical property prediction tasks, stratified splits, and drug-target interaction datasets where information can be leaked along two dimensions. DataSAIL is accepted in principle by Nature Communications. We empirically demonstrate DataSAIL's impact on evaluating biomedical ML models. We compare DataSAIL to seven other algorithms on 14 datasets from the MoleculeNet benchmark and LP-PDBBind. We show that DataSAIL is consistently amongst the best algorithms in removing IL. Furthermore, we train 6 different ML models on each split to evaluate how information leakage affects different models. We observe that deep learning models generally perform better than statistical models and that higher IL leads to better performance estimates. Another ablation study shows that DataSAIL reduces IL better than the PLINDER benchmark.
2025-07-21	17:00:00	18:00:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Generative AI for Unlocking the Complexity of Cells	Maria Brbic		Maria Brbic	We are witnessing an AI revolution. At the heart of this revolution are generative AI models that, powered by advanced architectures and large datasets, are transforming AI across a variety of disciplines. But how can AI facilitate and eventually enable discoveries in life sciences? How can it bring us closer to understanding biology, the functions of our cells and relationships across different molecular layers? In this talk, I will present AI methods that can extract meaningful differences between classes from representations of foundation models with minimal or no supervision. I will then introduce generative AI methods designed to uncover relationships across different omics layers. I will demonstrate how these approaches enable the reassembly of tissues from dissociated single cells and how AI-driven tissue reconstruction can overcome existing technological limitations.
2025-07-22	11:20:00	12:20:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Toward Mechanistic Genomics: Advances in Sequence-to-Function Modeling	Maria Chikina		Maria Chikina	Recent advances have firmly established sequence-to-function models as essential tools in modern genomics, enabling unprecedented insights into how genomic sequences drive molecular and cellular phenotypes. As these models have matured—with increasingly robust architectures, improved training strategies, and the emergence of standardized software frameworks—the field has rapidly evolved from proof-of-concept demonstrations to widespread practical applications across a variety of biological systems. With the core methodologies now widely adopted and infrastructure in place, the community's focus is shifting toward ambitious new frontiers. There is growing momentum around developing models that are biologically interpretable, capable of uncovering causal mechanisms of gene regulation, and generalizable to novel contexts—such as predicting the effects of perturbing a regulatory protein rather than simply altering a DNA sequence. These efforts reflect a broader aspiration: to create models that serve not just as black-box predictors, but as scientific instruments that deepen our understanding of genome function. In this talk, we will explore how such models can move us from descriptive genomics to mechanistic insight, highlighting recent innovations in architecture and training that support interpretability, modularity, and reusability. We will examine the contexts in which these models offer clear advantages, the limitations that remain, and practical considerations for their training. Ultimately, we will consider how advancing these models may refine the role of machine learning in biology, supporting not only accurate prediction but also the generation of more detailed and mechanistically informed hypotheses.
2025-07-22	12:20:00	12:40:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Deep learning models for unbiased sequence-based PPI prediction plateau at an accuracy of 0.65	Judith Bernett	In person	Timo Reim, Anne Hartebrodt, David B. Blumenthal, Judith Bernett, Markus List	As most proteins interact with other proteins to perform their respective functions, methods to computationally predict these interactions have been developed. However, flawed evaluation schemes and data leakage in test sets have obscured the fact that sequence-based protein-protein interaction (PPI) prediction is still an open problem. Recently, methods achieving better-than-random performance on leakage-free PPI data have been proposed. Here, we show that the use of ESM-2 protein embeddings explains this performance gain irrespective of model architecture. We compared the performance of models with varying complexity, per-protein, and per-token embeddings, as well as the influence of self- or cross-attention, where all models plateaued at an accuracy of 0.65. Moreover, we show that the tested sequence-based models cannot implicitly learn a contact map as an intermediate layer. These results imply that other input types, such as structure, might be necessary for producing reliable PPI predictions.
2025-07-22	12:40:00	13:00:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Accurate PROTAC targeted degradation prediction with DegradeMaster	Jie Liu	In person	Jie Liu, Michael Roy, Luke Isbel, Fuyi Li	Motivation: Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules that can degrade ‘undruggable’ protein of interest (POI) by recruiting E3 ligases and hijacking the ubiquitin-proteasome system. Some efforts have been made to develop deep learning-based approaches to predict the degradation ability of a given PROTAC. However, existing deep learning methods either simplify proteins and PROTACs as 2D graphs by disregarding crucial 3D spatial information or exclusively rely on limited labels for supervised learning without considering the abundant information from unlabeled data. Nevertheless, considering the potential to accelerate drug discovery, developing more accurate computational methods for PROTAC-targeted protein degradation prediction is critical. Results: This study proposes DegradeMaster, a semi-supervised E(3)-equivariant graph neural network-based predictor for targeted degradation prediction of PROTACs. DegradeMaster leverages an E(3)-equivariant graph encoder to incorporate 3D geometric constraints into the molecular representations and utilizes a memory-based pseudo-labeling strategy to enrich annotated data during training. A mutual attention pooling module is also designed for interpretable graph representation. Experiments on both supervised and semi-supervised PROTAC datasets demonstrate that DegradeMaster outperforms state-of-the-art baselines, substantially improving AUROC by 10.5%. Case studies show DegradeMaster achieves 88.33% and 77.78% accuracy in predicting the degradability of VZ185 candidates on BRD9 and ACBI3 on KRAS mutants. Visualization of attention weights on 3D molecule graph demonstrates that DegradeMaster recognises linking and binding regions of warhead and E3 ligands and emphasizes the importance of structural information in these areas for degradation prediction. Together, this shows the potential for cutting-edge tools to highlight functional PROTAC components, thereby accelerating novel compound generation.
2025-07-22	14:00:00	14:20:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	GPO-VAE: Modeling Explainable Gene Perturbation Responses utilizing GRN-Aligned Parameter Optimization	Seungheun Baek	In person	Seungheun Baek, Soyon Park, Yan Ting Chok, Mogan Gim, Jaewoo Kang	Predicting cellular responses to genetic perturbations is essential for understanding biological systems and developing targeted therapeutic strategies. While variational autoencoders (VAEs) have shown promise in modeling perturbation responses, their limited explainability poses a significant challenge, as the learned features often lack clear biological meaning. Nevertheless, model explainability is one of the most important aspects in the realm of biological AI. One of the most effective ways to achieve explainability is incorporating the concept of gene regulatory networks (GRNs) in designing deep learning models such as VAEs. GRNs elicit the underlying causal relationships between genes and are capable of explaining the transcriptional responses caused by genetic perturbation treatments. We propose GPO-VAE, an explainable VAE enhanced by GRN-aligned Parameter Optimization that explicitly models gene regulatory networks in the latent space. Our key approach is to optimize the learnable parameters related to latent perturbation effects towards GRN-aligned explainability. Experimental results on perturbation prediction show our model achieves state-of-the-art performance in predicting transcriptional responses across multiple benchmark datasets. Furthermore, additional results on evaluating the GRN inference task reveal our model's ability to generate meaningful GRNs compared to other methods. According to qualitative analysis, GPO-VAE posseses the ability to construct biologically explainable GRNs that with experimentally validated regulatory pathways.
2025-07-22	14:20:00	14:40:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Fast and scalable Wasserstein-1 neural optimal transport solver for single-cell perturbation prediction	Yanshuo Chen	Live stream	Yanshuo Chen, Zhengmian Hu, Wei Chen, Heng Huang	Predicting single-cell perturbation responses requires mapping between two unpaired single-cell data distributions. Optimal transport (OT) theory provides a principled framework for constructing such mappings by minimizing transport cost. Recently, Wasserstein-2 ($W_2$) neural optimal transport solvers (\textit{e.g.}, CellOT) have been employed for this prediction task. However, $W_2$ OT relies on the general Kantorovich dual formulation, which involves optimizing over two conjugate functions, leading to a complex min-max optimization problem that converges slowly. To address these challenges, we propose a novel solver based on the Wasserstein-1 ($W_1$) dual formulation. Unlike $W_2$, the $W_1$ dual simplifies the optimization to a maximization problem over a single 1-Lipschitz function, thus eliminating the need for time-consuming min-max optimization. While solving the $W_1$ dual only reveals the transport direction and does not directly provide a unique optimal transport map, we incorporate an additional step using adversarial training to determine an appropriate transport step size, effectively recovering the transport map. Our experiments demonstrate that the proposed $W_1$ neural optimal transport solver can mimic the $W_2$ OT solvers in finding a unique and ``monotonic" map on 2D datasets. Moreover, the $W_1$ OT solver achieves performance on par with or surpasses $W_2$ OT solvers on real single-cell perturbation datasets. Furthermore, we show that $W_1$ OT solver achieves $25 \sim 45\times$ speedup, scales better on high dimensional transportation task, and can be directly applied on single-cell RNA-seq dataset with highly variable genes. Our implementation and experiments are open-sourced at \url{https://github.com/poseidonchan/w1ot}.
2025-07-22	14:40:00	15:00:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Recovering Time-Varying Networks From Single-Cell Data	Euxhen Hasanaj	In person	Euxhen Hasanaj, Barnabás Póczos, Ziv Bar-Joseph	Gene regulation is a dynamic process that underlies all aspects of human development, disease response, and other key biological processes. The reconstruction of temporal gene regulatory networks has conventionally relied on regression analysis, graphical models, or other types of relevance networks. With the large increase in time series single-cell data, new approaches are needed to address the unique scale and nature of this data for reconstructing such networks. Here, we develop a deep neural network, Marlene, to infer dynamic graphs from time series single-cell gene expression data. Marlene constructs directed gene networks using a self-attention mechanism where the weights evolve over time using recurrent units. By employing meta learning, the model is able to recover accurate temporal networks even for rare cell types. In addition, Marlene can identify gene interactions relevant to specific biological responses, including COVID-19 immune response, fibrosis, and aging, paving the way for potential treatments. The code use to train Marlene is available at https://github.com/euxhenh/Marlene.
2025-07-22	15:00:00	15:10:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Sliding Window INteraction Grammar (SWING): a generalized interaction language model for peptide and protein interactions	Jishnu Das	In person	Jane Siwek, Alisa Omelchenko, Prabal Chhibbar, Alok Joglekar, Jishnu Das	Protein language models (pLMs) can embed protein sequences for different proteomic tasks. However, these methods are suboptimal at learning the language of protein interactions. We developed an interaction LM (iLM), Sliding Window Interaction Grammar (SWING) which leverages differences in amino acid properties to generate an interaction vocabulary. This is embedded by an LM and supervised learning is performed on the embeddings. SWING was used across a range of tasks. Using only sequence information, it successfully predicted both class I and class II pMHC interactions as well as state-of-the-art approaches. Further, the Class I SWING model could uniquely cross-predict Class II interactions, a complex prediction task not attempted by existing methods. A unique Mixed Class model effectively predicted interactions for both classes. Using only human Class I or Class II data, SWING accurately predicted novel murine Class II pMHC interactions involving risk alleles in SLE and T1D. SWING also accurately predicted how Mendelian and population variants can disrupt specific protein-protein interactions, based on sequence information alone. Across these tasks, SWING outperformed passive uses of pLM embeddings, demonstrating the value of the unique iLM architecture. Overall, SWING is a first-in-class generalizable zero-shot iLM that learns the language of PPIs.
2025-07-22	15:10:00	15:20:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Benchmarking foundation cell models for post-perturbation RNA-seq prediction	Gerold Csendes	In person	Gerold Csendes, Gema Sanz, Krisóf Szalay, Bence Szalai	Accurately predicting cellular responses to perturbations is essential for understanding cell behaviour in both healthy and diseased states. While perturbation data is ideal for building such predictive models, its availability is considerably lower than baseline (non-perturbed) cellular data. To address this limitation, several foundation cell models have been developed using large-scale single-cell gene expression data. These models are fine-tuned after pre-training for specific tasks, such as predicting post-perturbation gene expression profiles, and are considered state-of-the-art for these problems. However, proper benchmarking of these models remains an unsolved challenge. In this study, we benchmarked two recently published foundation models, scGPT and scFoundation, against baseline models. Surprisingly, we found that even the simplest baseline model - taking the mean of training examples - outperformed scGPT and scFoundation. Furthermore, basic machine learning models that incorporate biologically meaningful features outperformed scGPT by a large margin. Additionally, we identified that the current Perturb-Seq benchmark datasets exhibit low perturbation-specific variance, making them suboptimal for evaluating such models. Our results highlight important limitations in current benchmarking approaches and provide insights into more effectively evaluating post-perturbation gene expression prediction models.
2025-07-22	15:20:00	15:30:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	scPRINT: pre-training on 50 million cells allows robust gene network predictions	Jeremie Kalfon	In person	Jeremie Kalfon, Gabriel Peyré, Laura Cantini	A cell is governed by the interaction of myriads of macromolecules. Inferring such a network of interactions has remained an elusive milestone in cellular biology. Building on recent advances in large foundation models and their ability to learn without supervision, we present scPRINT, a large cell model for the inference of gene networks pre-trained on more than 50 million cells from the cellxgene database. Using innovative pretraining tasks and model architecture, scPRINT pushes large transformer models towards more interpretability and usability when uncovering the complex biology of the cell. Based on our atlas-level benchmarks, scPRINT demonstrates superior performance in gene network inference to the state of the art, as well as competitive zero-shot abilities in denoising, batch effect correction, and cell label prediction. On an atlas of benign prostatic hyperplasia, scPRINT highlights the profound connections between ion exchange, senescence, and chronic inflammation.
2025-07-22	15:30:00	15:40:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	MGCL-ST: Multi-view Graph Self-supervised Contrastive Learning for Spatial Transcriptomics Enhancement	Hongmin Cai	In person	Hongmin Cai, Siqi Ding, Weitian Huang	Spatial transcriptomics enables the investigation of gene expression within its native spatial context, but existing technologies often suffer from low resolution and sparse sampling. These limitations hinder the accurate delineation of fine tissue structures and reduce robustness to noise. To address these challenges, we propose MGCL-ST, a spatial transcriptomics super-resolution framework that integrates multi-view contrastive learning with a dual-metric neighbor selection strategy. By combining spatial structure and histological image features, MGCL-ST achieves robust, pixel-level gene expression imputation. Experimental results on both simulated and real datasets demonstrate its superior reconstruction accuracy and generalization capability, supporting advanced analysis of the tumor microenvironment.
2025-07-22	15:40:00	15:50:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Characterizing cell-type spatial relationships across length scales in spatially resolved omics data	Rafael dos Santos Peixoto	In person	Rafael dos Santos Peixoto, Brendan Miller, Maigan Brusko, Gohta Aihara, Lyla Atta, Manjari Anant, Adina Jailova, Mark Atkinson, Todd Brusko, Clive Wasserfall, Jean Fan	Spatially resolved omics (SRO) technologies enable the identification of cell types while preserving their organization within tissues. Application of such technologies offers the opportunity to delineate cell-type spatial relationships, particularly across different length scales, and enhance our understanding of tissue organization and function. To quantify such multi-scale cell-type spatial relationships, we present CRAWDAD, Cell-type Relationship Analysis Workflow Done Across Distances, as an open-source R package. To demonstrate the utility of such multi-scale characterization, recapitulate expected cell-type spatial relationships, and evaluate against other cell-type spatial analyses, we apply CRAWDAD to various simulated and real SRO datasets of diverse tissues assayed by diverse SRO technologies. We further demonstrate how such multi-scale characterization enabled by CRAWDAD can be used to compare cell-type spatial relationships across multiple samples. Finally, we apply CRAWDAD to SRO datasets of the human spleen to identify consistent as well as patient and sample-specific cell-type spatial relationships. In general, we anticipate such multi-scale analysis of SRO data enabled by CRAWDAD will provide useful quantitative metrics to facilitate the identification, characterization, and comparison of cell-type spatial relationships across axes of interest.
2025-07-22	15:50:00	16:00:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Segger: Fast and accurate cell segmentation of imaging-based spatial transcriptomics data	Elyas Heidari	In person	Elyas Heidari, Andrew Moorman, Moritz Gerstung, Dana Pe'Er, Oliver Stegle, Tal Nawy	Accurate cell segmentation is a critical first step in the analysis of imaging-based spatial transcriptomics (iST). Despite decades of research in cell segmentation, current methods fail to address this task with adequate accuracy, tending to either over- or under segment, create false positive transcript assignments, and additionally many methods fail to scale to large datasets with hundreds of millions of transcripts. To address these limitations, we introduce segger, a versatile graph neural network (GNN) that frames cell segmentation as a transcript-to-cell link prediction task. Segger employs a heterogeneous graph representation of individual transcripts and cells, and can optionally leverage single-cell RNA-seq information to enhance transcript assignments. In benchmarks on multiple iST dataset, including a lung adenocarcinoma dataset with membrane staining for validation, segger demonstrates superior sensitivity and specificity compared to existing methods such as Baysor and BIDCell. At the same time, segger requires orders of magnitude less compute time than existing approaches. The Segger software features adaptive tiling and efficient task scheduling, supporting multi-GPU processing and multi-threading for scalability. Segger also includes a new workflow to cluster unassigned transcripts into ‘fragments’, enabling the recovery of information missed by nucleus or membrane marker-dependent methods. Segger is implemented as user-friendly open source software (https://github.com/PMBio/segger), comes with extensive documentation and integrates seamlessly into existing workflows, enabling atlas-scale applications with high accuracy and speed.
2025-07-22	16:40:00	16:50:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Dissecting cellular and molecular mechanisms of pancreatic cancer with deep learning	Aarthi Venkat	In person	Aarthi Venkat, Cathy Garcia, Daniel McQuaid, Smita Krishnaswamy, Mandar Muzumdar	Pancreatic endocrine-exocrine crosstalk plays a key role in normal physiology and disease and is perturbed by altered host metabolic states. For example, obesity imparts an stress-induced endocrine secretion of cholecystokinin (CCK), which promotes pancreatic ductal adenocarcinoma (PDAC), an exocrine tumor. However, the mechanisms governing endocrine-exocrine signaling in obesity-driven tumorigenesis remain unclear. Here, we design a suite of machine learning tools (TrajectoryNet, AAnet, scMMGAN, DiffusionEMD) to reveal from single-cell RNA-seq data the cellular and molecular mechanisms by which beta cells express CCK and promote obesity-driven PDAC. AAnet identifies an immature beta cell state characterized by low insulin and maturation marker expression and high dedifferentiation and immaturity marker expression. TrajectoryNet predicts obesity stimulates this immature state to expand and adapt toward a pro-tumorigenic CCK-hi state, which we validate with in vivo genetic lineage tracing. TrajectoryNet-based gene regulatory network inference predicts cJun regulates CCK, validated by JNK inhibition and CUT&RUN sequencing showing cJun mediates CCK expression by binding to a novel conserved 3’ enhancer ~3kb downstream of the Cck gene. Finally, mapping beta cells from diverse physiologic and pharmacologic stressors, developmental stages, and species to our dataset with scMMGAN and DiffusionEMD reveals concordance between adult beta cell dedifferentiation and embryonic beta cells, as well as shared stress induction mechanisms between obesity and type II diabetes in mice and humans. Together, this work uncovers new avenues to target the endocrine pancreas to subvert exocrine tumorigenesis and highlights the utility of developing biological and computational models in a wet-to-dry and dry-to-wet fashion toward mechanistic discovery.
2025-07-22	16:50:00	17:00:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	SpliceSelectNet: A Hierarchical Transformer-Based Deep Learning Model for Splice Site Prediction	Yuna Miyachi	In person	Yuna Miyachi, Kenta Nakai	RNA splicing is a critical post-transcriptional process that enables the generation of diverse protein isoforms. Aberrant splicing is implicated in a wide range of genetic disorders and cancers, making accurate prediction of splice sites and mutation effects essential. Convolutional neural network-based models such as SpliceAI and Pangolin have achieved high accuracy but often lack interpretability. Recently, Transformer-based models like DNABERT and SpTransformer have been applied to genomic sequences, yet they typically inherit input length limitations from natural language processing models, restricting context to a few thousand base pairs, which are insufficient for capturing long-range regulatory signals. To overcome these challenges, we propose SpliceSelectNet (SSNet), a hierarchical Transformer model that integrates local and global attention mechanisms to handle up to 100 kb of input while maintaining nucleotide-level interpretability. Trained on multiple datasets, including those incorporating splice site usage derived from RNA-seq data, SSNet outperforms SpliceAI and Pangolin on the Gencode test dataset, a clinically curated BRCA variant dataset, and a deep intronic variant benchmark. It demonstrates improved performance, particularly in regions characterized by complex splicing regulation, such as long exons and deep introns, as measured by area under the precision-recall curve. Furthermore, SSNet’s attention maps provide direct insight into sequence context. In the case of a pathogenic variant in BRCA1 exon 10, the model highlighted an upstream region that may contribute to cryptic splice site activation. These results demonstrate that SSNet combines high predictive performance with biological interpretability, offering a powerful tool for splicing analysis in both research and clinical settings.
2025-07-22	17:00:00	18:00:00	01A	MLCSB: Machine Learning in Computational and Systems Biology	Is distribution shift still an AI problem	Sanmi Koyejo	In person	Sanmi Koyejo	Distribution shifts describe the phenomena where the deployment performance of an AI model exhibits differences from training. On the one hand, some claim that distribution shifts are ubiquitous in real-world deployments. On the other hand, modern implementations (e.g., foundation models) often claim to be robust to distribution shifts by design. Similarly, phenomena such as “accuracy on the line” promise that standard training produces distribution-shift-robust models. When are these claims valid, and do modern models fail due to distribution shifts? If so, what can be done about it? This talk will outline modern principles and practices for understanding the role of distribution shifts in AI, discuss how the problem has changed, and outline recent methods for engaging with distribution shifts with comprehensive and practical insights. Some highlights include a taxonomy of shifts, the role of foundation models, and finetuning. This talk will also briefly discuss how distribution shifts might interact with AI policy and governance. Bio: Sanmi Koyejo is an assistant professor in the Department of Computer Science at Stanford University and a co-founder of Virtue AI. At Stanford, Koyejo leads the Stanford Trustworthy Artificial Intelligence (STAIR) lab, which works to develop the principles and practice of trustworthy AI, focusing on applications to science and healthcare. Koyejo has been the recipient of several awards, including a Skip Ellis Early Career Award, a Presidential Early Career Award for Scientists and Engineers (PECASE), and a Sloan Fellowship. Koyejo serves on the Neural Information Processing Systems Foundation Board, the Association for Health Learning and Inference Board, and as president of the Black in AI Board.

- top -