Return to ISMB/ECCB 2025 Homepage   Click here for the abridged agenda


Select Track: 3DSIG | Bio-Ontologies and Knowledge Representation | BioInfo-Core | Bioinfo4Women Meet-Up | Bioinformatics in the UK | BioVis | BOSC | CAMDA | CollaborationFest | CompMS | Computational Systems Immunology | Distinguished Keynotes | Dream Challenges | Education | Equity and Diversity | EvolCompGen | Fellows Presentation | Function | General Computational Biology | HiTSeq | iRNA | ISCB-China Workshop | JPI | MICROBIOME | MLCSB | NetBio | NIH Cyberinfrastructure and Emerging Technologies Sessions | NIH/Elixir | Publications - Navigating Journal Submissions | RegSys | Special Track | Stewardship Critical Infrastructure | Student Council Symposium | SysMod | Tech Track | Text Mining | The Innovation Pipeline: How Industry & Academia Can Work Together in Computational Biology | TransMed | Tutorials | VarI | WEB 2025 | Youth Bioinformatics Symposium | All


Schedule for RegSys

NOTE: Browser resolution may limit the width of the agenda and you may need to scroll the iframe to see additional columns.
Click the buttons below to download your current table in that format

Date Start Time End Time Room Track Title Confrimed Presenter Format Authors Abstract
2025-07-23 11:20:00 12:00:00 11BC RegSys Exploring cellular plasticity: 4D epigenomes in the context of the tumour microenvironment Vera Pancaldi Vera Pancaldi Oncogenesis is characterized by alterations in chromatin organization and the reactivation of unicellular phenotypes at both metabolic and transcriptional levels. The underlying mechanisms remain largely unexplored, despite their critical relevance in cancer biology. We studied the spatial organization of genes in relation to their evolutionary origins, as well as changes occurring during cell differentiation and oncogenesis. We reveal significant topological changes in chromatin organization during cell differentiation, with patterns in specific regulatory marks involving Polycomb repression and RNA Polymerase II pausing, being reversed during oncogenesis. Reflecting on recent findings regarding epigenomic routes to oncogenesis made us consider the importance of the tumour microenvironment in determining plasticity of cancer cells in different environments, which we are studying through data-driven inference of regulatory networks in simplified in-vitro culture systems. We will discuss our recent results and frame them in the context of changing oncogenesis paradigms.
2025-07-23 12:00:00 12:20:00 11BC RegSys Leveraging Transcription Factor Physical Proximity for Enhancing Gene Regulation Inference Yijie Wang Xiaoqing Huang, Aamir Raza Muneer Ahemad Hullur, Elham Jafari, Kaushik Shridhar, Kun Huang, Yijie Wang, Kenneth Mackie, Mu Zhou Motivation: Gene regulation inference, a key challenge in systems biology, is crucial for understanding cell function, as it governs processes such as differentiation, cell state maintenance, signal transduction, and stress response. Leading methods utilize gene expression, chromatin accessibility, Transcription Factor (TF) DNA binding motifs, and prior knowledge. However, they overlook the fact that TFs must be in physical proximity to facilitate transcriptional gene regulation. Results: To fill the gap, we develop GRIP – Gene Regulation Inference by considering TF Proximity – a gene regulation inference method that directly considers the physical proximity between regulating TFs. Specifically, we use the distance in a protein-protein interaction (PPI) network to estimate the physical proximity between TFs. We design a novel Boolean convex program, which can identify TFs that not only can explain the gene expression of target genes (TGs) but also stay close in the PPI network. We propose an efficient algorithm to solve the Boolean relaxation of the proposed model with a theoretical tightness guarantee. We compare our GRIP with state-of-the-art methods (SCENIC+, DirectNet, Pando, and CellOracle) on inferring cell-type-specific (CD4, CD8, and CD 14) gene regulation using the PBMC 3k scMultiome-seq data and demonstrate its out-performance in terms of the predictive power of the inferred TFs, the physical distance between the inferred TFs, and the agreement between the inferred gene regulation and PCHiC ground-truth data.
2025-07-23 12:20:00 12:40:00 11BC RegSys miRBench: novel benchmark datasets for microRNA binding site prediction that mitigate against prevalent microRNA Frequency Class Bias Panagiotis Alexiou Stephanie Sammut, Katarina Gresova, Dimosthenis Tzimotoudis, Eva Marsalkova, David Cechak, Panagiotis Alexiou Motivation: MicroRNAs (miRNAs) are crucial regulators of gene expression, but the precise mechanisms governing their binding to target sites remain unclear. A major contributing factor to this is the lack of unbiased experimental datasets for training accurate prediction models. While recent experimental advances have provided numerous miRNA-target interactions, these are solely positive interactions. Generating negative examples in silico is challenging and prone to introducing biases, such as the miRNA frequency class bias identified in this work. Biases within datasets can compromise model generalization, leading models to learn dataset-specific artifacts rather than true biological patterns. Results: We introduce a novel methodology for negative sample generation that effectively mitigates the miRNA frequency class bias. Using this methodology, we curate several new, extensive datasets and benchmark several state-of-the-art methods on them. We find that a simple convolutional neural network model, retrained on some of these datasets, is able to outperform state-of-the-art methods. This highlights the potential for leveraging unbiased datasets to achieve improved performance in miRNA binding site prediction. To facilitate further research and lower the barrier to entry for machine learning researchers, we provide an easily accessible Python package, miRBench, for dataset retrieval, sequence encoding, and the execution of state-of-the-art models. Availability: The miRBench Python Package is accessible at https://github.com/katarinagresova/miRBench/releases/tag/v1.0.0 Contact: panagiotis.alexiou@um.edu.mt
2025-07-23 12:40:00 13:00:00 11BC RegSys Flash Talk Session 1 Aryan Kamal, Damla Baydar, Laura Hinojosa, Charles-Henri Lecellier Session with 4 short talks: Aryan Kamal - Transcriptional regulation of cell fate plasticity in hematopoiesis Damla Övek Baydar - Enhancing JASPAR and UniBind databases with deep learning models for transcription factor-DNA interactions Laura Hinojosa - Master Transcription Factors Regulate Replication Timing Charles-Henri Lecellier - DNA replication timing and Copy Number Variations are confounders of RNA-DNA interaction data
2025-07-23 14:00:00 14:20:00 11BC RegSys Unicorn: Enhancing Single-Cell Hi-C Data with Blind Super-Resolution for 3D Genome Structure Reconstruction Oluwatosin Oluwadare Mohan Kumar Chandrashekar, Rohit Menon, Samuel Olowofila, Oluwatosin Oluwadare Motivation: Single-cell Hi-C (scHi-C) data provide critical insights into chromatin interactions at individual cell levels, uncovering unique genomic 3D structures. However, scHi-C datasets are characterized by sparsity and noise, complicating efforts to accurately reconstruct high-resolution chromosomal structures. In this study, we present ScUnicorn, a novel blind Super-Resolution framework for scHi-C data enhancement. ScUnicorn employs an iterative degradation kernel optimization process, unlike traditional Super-resolution approaches, which rely on downsampling, predefined degradation ratios, or constant assumptions about the input data to reconstruct high-resolution interaction matrices. Hence, our approach more reliably preserves critical biological patterns and minimizes noise. Additionally, we propose 3DUnicorn, a maximum likelihood algorithm that leverages the enhanced scHi-C data to infer precise 3D chromosomal structures. Result: Our evaluation demonstrates that ScUnicorn achieves superior performance over the state-of-the-art methods in terms of Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and GenomeDisco scores. Moreover, 3DUnicorn’s reconstructed structures align closely with experimental 3D-FISH data, underscoring its biological relevance. Together, ScUnicorn and 3DUnicorn provide a robust framework for advancing genomic research by enhancing scHi-C data fidelity and enabling accurate 3D genome structure reconstruction. Code Availability: Unicorn implementation is publicly accessible at https://github.com/OluwadareLab/Unicorn
2025-07-23 14:20:00 14:40:00 11BC RegSys Predicting gene-specific regulation with transcriptomic and epigenetic single-cell data Laura Rumpf Laura Rumpf, Fatemeh Behjati, Dennis Hecker, Marcel Schulz To gain insights into phenotype-specific gene regulation, we present our integrative analysis approach MetaFR harnessing single-cell epigenetic and transcriptomic data. MetaFR generates random forest regression models in a gene-specific manner utilizing both scATAC-seq and scRNA-seq data to predict gene expression in a large window around a target gene. The gene window is partitioned into bins of equal size which correspond to the model features holding the epigenetic signal counts. The importance of model features can be leveraged to prioritize enhancer-gene interactions. The inherent sparsity problem of single-cell data is addressed by aggregating the scRNA-seq and scATAC-seq signal into metacells based on gene activity similarities. MetaFR enables large-scale analysis of scATAC-seq and scRNA-seq data in an automated fashion. The automated pipeline has been successfully applied to a human PBMC dataset to identify immune cell-specific enhancer-gene interactions. We validated our findings with experimentally measured interactions (CRISPRi regions) and fine-mapped eQTLs. We benchmarked our performance against the state-of-the-art method SCARlink. We were able to outperform SCARlink in both accuracy and runtime. Our pipeline allows time-efficient analysis and obtains reliable models of gene expression, which can be used to study gene regulatory elements in any organism for which scRNA-seq and scATAC-seq data becomes available.   
2025-07-23 14:40:00 15:00:00 11BC RegSys Biophysical deep learning resolves how TF and DNA sequence specify the genome state of every cell population in human embryogenesis Vitalii Kleshchevnikov Vitalii Kleshchevnikov, Oliver Stegle, Omer Bayraktar Understanding how interactions between transcription factors (TFs) and DNA sequence are orchestrated and give rise to the vast complexity of cell types is a major challenge of regulatory developmental biology. Large-scale multimodal single-cell RNA-seq and ATAC-seq atlases enable reconstructing the regulatory mechanisms across cell types from data, laying the foundation for cell programming and design of synthetic regulatory elements. Despite significant progress, current DNA sequence models fail to account for cellular context, TF-DNA sequence relationships and TF combinatorics in a principled manner, limiting their causal expressiveness and generalization capacity across cell types. To overcome this, we developed cell2state, an end-to-end deep learning model with biophysical constraints on how TFs specify the genome accessibility state in every cell population. Cell2state leverages known TF-motif interactions while accounting for biophysical constraints, employs an interpretable neural network based on HyenaDNA architecture and captures TF-TF synergy and antagonism, enablings the model to integrate DNA sequence and transcription factor (TF) abundance. We demonstrated cell2state generalisation capabilities by predicting ATAC-seq signals for new chromosomes and cell types. To link regulatory TF interactions to developmental processes at whole embryo scale, we applied cell2state to an unpublished multimodal single-cell and spatial transcriptomics atlas covering over 1,000 human developmental cell states (n=4,000 pseudobulk replicates, n=5 embryos). At critical developmental junctions, such as the dorsal-ventral patterning of the spinal cord/hindbrain and anterior-posterior patterning of the forebrain, cell2state revealed how enhancer DNA sequences integrate activities of cell-type-defining TFs (LHX2, PAX6) with cell communication pathway TFs (GLI, TCF).
2025-07-23 15:00:00 15:20:00 11BC RegSys Nona: A unifying multimodal masked modeling framework for functional genomics Surag Nair Surag Nair, Alex Tseng, Ehsan Hajiramezanali, Nathaniel Diamant, Avantika Lal, Tommaso Biancalani, Gabriele Scalia, Gokcen Eraslan We present Nona, a unifying multimodal masked modeling paradigm for functional genomics. Nona is a neural network model that operates on both DNA sequence and epigenetic tracks such as DNase-seq, ChIP-seq, and RNA-seq at base-pair resolution. By leveraging a flexible masking strategy, Nona can predict any subset of masked DNA and/or tracks from the unmasked subset. As a result, Nona encompasses versatile existing and novel use cases that were hitherto addressed using separate models. In addition to vanilla sequence-to-function prediction and DNA language modeling, Nona enables multiple novel application modes, of which we highlight 3: 1) context-aware prediction, where the model predicts epigenetic tracks in a local genomic window by taking into account the observed epigenetic tracks in adjacent windows, in addition to the DNA sequence, 2) sequence generation, where a conditional language model is used to iteratively generate a DNA sequence with desired epigenetic profiles across cellular states, 3) functional genotyping, where a conditional language model trained on base resolution ATAC-seq is used to infer the genotype of the sample donors. Beyond these applications, Nona can enable use cases such as functional perturbations and denoising functional measurements. Altogether, Nona is a versatile paradigm that extends sequence-to-function and masked language modeling to novel applications in regulatory genomics.
2025-07-23 15:20:00 15:40:00 11BC RegSys SCRIMPy: Single Cell Replication Inference from Multiome data using Python Tatevik Jalatyan Tatevik Jalatyan, Jennifer Herrmann, Antonio Rodriguez-Romera, Beth Psaila, Jim Hughes, Simone Riva, Robert Beagrie The cell cycle is a fundamental biological process crucial for an organism’s growth and development. Dysregulation of the cell cycle can lead to diseases such as cancer, neurodegenerative, cardiovascular, or autoimmune disorders. Thus, accurate characterization of cell cycle dynamics in healthy and disease states is important for understanding disease mechanisms. Existing methods for cell cycle state prediction from single-cell data use the expression of marker genes in individual cells. However, these approaches perform poorly on single-cell multiome (ATAC+GEX) data, likely due to the increased data sparsity and nuclear RNA bias. To address these limitations, we propose a novel method for cell cycle state inference that uses replication-driven DNA copy number signals from scATAC-seq data. Our approach is based on two complementary metrics that reflect the replication state of individual cells. First, we capture the imbalance of ATAC fragment depth between early- and late-replicating regions of genome to identify S-phase cells with higher DNA copy number in early replicating domains. Second, we introduce a novel metric for DNA copy number in ATAC-seq data to differentiate G1-phase cells from G2/M-phase cells, since the latter have duplicated DNA content. We apply this method to multiome data from mouse embryonic stem cells sorted by cell cycle state (G1, S, G2/M) and show that SCRIMPy outperforms the commonly used expression-based classifier Seurat. With the increasing availability of multiome datasets, this approach holds promise for deriving novel insights into cell cycle mechanisms in diseases and identifying potential therapeutic targets.
2025-07-23 15:40:00 16:00:00 11BC RegSys Soffritto: a deep-learning model for predicting high-resolution replication timing Dante Bolzan Dante Bolzan, Ferhat Ay Motivation: Replication Timing (RT) refers to the order by which DNA loci are replicated during S phase. RT is cell-type specific and implicated in cellular processes including transcription, differentiation, and disease. RT is typically quantified genome-wide using two-fraction assays (e.g., Repli-Seq) which sort cells into early and late S phase fractions followed by DNA sequencing yielding a ratio as the RT signal. While two-fraction RT data is widely available in multiple cell lines, it is limited in its ability to capture high-resolution RT features. To address this, high-resolution Repli-Seq, which quantifies RT across 16 fractions, was developed, but it is costly and technically challenging with very limited data generated to date. Results: Here we developed Soffritto, a deep learning model that predicts high-resolution RT data using two-fraction RT data, histone ChIP-seq data, GC content, and gene density as input. Soffritto is composed of a Long Short Term Memory (LSTM) module and a prediction module. The LSTM module learns long- and short-range interactions between genomic bins while the prediction module is composed of a fully connected layer that outputs a 16-fraction probability vector for each bin using the LSTM module’s embeddings as input. By performing both within cell line and cross cell line training and testing for five human and mouse cell lines, we show that Soffritto is able to capture experimental 16-fraction RT signals with high accuracy and the predicted signals allow detection of high-resolution RT patterns.
2025-07-23 16:40:00 17:00:00 11BC RegSys Ledidi: Programmatic design and editing of cis-regulatory elements Jacob Schreiber Jacob Schreiber, Franziska Lorbeer, Monika Heinzl, Yang Lu, Alexander Stark, William Noble The development of modern genome editing tools has enabled researchers to make such edits with high precision, but has left unsolved the problem of designing these edits. As a solution, we propose Ledidi, a computational approach that rephrases the design of genomic edits as a continuous optimization problem where the goal is to produce the desired outcome as measured by one or more predictive models using as few edits from an initial sequence as possible. Ledidi can be paired with any pre-trained machine learning model, and when applied across dozens of such models, we find that Ledidi can quickly design edits to precisely control transcription factor binding, chromatin accessibility, transcription, and enhancer activity across several species. Ledidi can achieve its target objective using surprisingly few edits by converting weak affinity TF binding sites into stronger affinity ones, and can do so almost an order of magnitude faster than other approaches. Unlike other approaches, Ledidi can use several models simultaneously to programmatically design edits that exhibit multiple desired characteristics. We demonstrate this capability by designing uniformly accessible regions with controllable patterns of TF binding, by designing cell type-specific enhancers, and by showing how one can use multiple models that predict the same thing to more robustly design edits. Finally, we introduce the concept of an affinity catalog, in which multiple sets of edits are designed that induce a spectrum of outcomes, and demonstrate the practical benefits of this approach for design tasks and scientific understanding.
2025-07-23 17:00:00 17:20:00 11BC RegSys Lilliput: Compact native regulatory element design with machine learning-guided miniaturization Laura Gunsalus Laura Gunsalus, Avantika Lal, Tommaso Biancalani, Gokcen Eraslan Size-limited gene therapy vectors require compact cell type-specific regulatory elements. Existing miniaturized sequences have been hand-selected and curated, relying on costly experimental iteration. We present Lilliput, a method for designing compact and specific regulatory elements by nominating and iteratively editing endogenous elements with state-of-the-art DNA sequence-to-function models. Our approach involves scoring elements in silico, removing subsequences with limited predicted impact, and introducing minimal mutations to increase specificity. We demonstrate the effectiveness of our approach by reducing a 10kb heart-specific locus to under 300bp. Our method offers a generalizable framework for engineering mini-elements across diverse target cell types. More broadly, we identify core sequence features sufficient to determine cell-type specific expression patterns, advancing our understanding of the mechanisms underlying precise control of gene expression.
2025-07-23 17:20:00 18:00:00 11BC RegSys Learning the Regulatory Genome by Destruction and Creation Luca Pinello Luca Pinello The regulatory genome operates through a complex DNA language that controls gene expression. In this keynote, I will present two complementary approaches to decode this language: learning by precise perturbation and learning by generative design. First, I will introduce CRISPR-CLEAR, which combines dense base editing with sequencing of resulting mutations to map regulatory elements at single-nucleotide resolution. We systematically "destroy" regulatory sequences through targeted mutations to identify functional nucleotides. Applied to the CD19 enhancer, we pinpoint exact bases whose alteration confers resistance to CAR-T therapy—revealing clinically actionable insights. Second, I will present DNA-Diffusion, which uses generative AI to "create" novel regulatory elements by learning from thousands of cell-type-specific sequences. This diffusion model generates synthetic 200bp elements that often exceed the activity of endogenous enhancers. We validated 5,850 sequences through reporter assays and pioneered direct genomic replacement to show these synthetic elements can precisely modulate therapeutic targets like AXIN2 in leukemia cells. Together, systematic perturbation and generative design provide complementary lenses for understanding regulatory logic. CRISPR-CLEAR reveals which nucleotides matter; DNA-Diffusion demonstrates we can engineer better solutions. This dual framework opens new avenues for precision gene therapy, where understanding and designing regulatory elements become two sides of the same coin.
2025-07-24 08:40:00 09:20:00 11BC RegSys TBA Roser Vento-Tormo
2025-07-24 09:20:00 09:40:00 11BC RegSys Anomaly Detection in Spatial Transcriptomics via Spatially Localized Density Comparison Gary Hu Gary Hu, Julian Gold, Uthsav Chitra, Sunay Joshi, Benjamin Raphael Motivation Perturbations in biological tissues – e.g. due to inflammation, disease, or drug treatment – alter the composition of cell types and cell states in the tissue. These alterations are often spatially localized in different regions of a tissue, and can be measured using spatial transcriptomics technologies. However, current methods to analyze differential abundance in cell types or cell states, either do not incorporate spatial information – and thus cannot identify spatially localized alterations – or use heuristic and inaccurate approaches. Results We introduce Spatial Anomaly Region Detection in Expression Manifolds (Sardine), a method to estimate spatially localized changes in spatial transcriptomics data obtained from tissue slices from two or more conditions. Sardine estimates the probability of a cell state being at the same (relative) spatial location between different conditions using spatially localized density estimation. On simulated data, Sardine recapitulates the spatial patterning of expression changes more accurately than existing approaches. On a Visium dataset of the mouse cerebral cortex before and after injury response, as well as on a Visium dataset of a mouse spinal cord undergoing electrotherapy, Sardine identifies regions of spatially localized expression changes that are more biologically plausible than alternative approaches.
2025-07-24 09:40:00 10:00:00 11BC RegSys Flash Talk Session 2 Maxime Christophe, Gabriela A Merino, Erick Isaac Navarro Delgado, Tomas Rube Session with 4 short talks: Maxime Christophe - Interpretable deep learning reveals sequence determinants of nucleosome positioning in mammalian genomes Gabriela A Merino - Ensembl’s multispecies catalogue of regulatory elements Erick Isaac Navarro Delgado - RAMEN: A reproducible framework for dissecting individual, additive and interactive gene-environment contributions in genomic regions with variable DNA methylation Tomas Rube - Accurate affinity models for SH2 domains from peptide binding assays and free-energy regression
2025-07-24 11:20:00 11:40:00 11BC RegSys GASTON-Mix: a unified model of spatial gradients and domains using spatial mixture-of-experts Uthsav Chitra Uthsav Chitra, Shu Dan, Fenna Krienen, Ben Raphael Motivation: Gene expression varies across a tissue due to both the organization of the tissue into spatial domains, i.e. discrete regions of a tissue with distinct cell type composition, and continuous spatial gradients of gene expression within di↵erent spatial domains. Spatially resolved transcriptomics (SRT) technologies provide high-throughput measurements of gene expression in a tissue slice, enabling the characterization of spatial gradients and domains. However, existing computational methods for quantifying spatial variation in gene expression either model only spatial domains – and do not account for continuous gradients of expression – or require restrictive geometric assumptions on the spatial domains and spatial gradients that do not hold for many complex tissues. Results: We introduce GASTON-Mix, a machine learning algorithm to identify both spatial domains and spatial gradients within each domain from SRT data. GASTON-Mix extends the mixture-of-experts (MoE) deep learning framework to a spatial MoE model, combining the clustering component of the MoE model with a neural field model that learns a separate 1-D coordinate (“isodepth”) within each domain. The spatial MoE is capable of representing any geometric arrangement of spatial domains in a tissue, and the isodepth coordinates define continuous gradients of gene expression within each domain. We show using simulations and real data that GASTON-Mix identifies spatial domains and spatial gradients of gene expression more accurately than existing methods. GASTON-Mix reveals spatial gradients in the striatum and lateral septum that regulate complex social behavior, and GASTON-Mix reveals localized spatial gradients of hypoxia and TNF-$alpha$ signaling in the tumor microenvironment.
2025-07-24 11:40:00 12:00:00 11BC RegSys Refinement Strategies for Tangram for Reliable Single-Cell to Spatial Mapping Merle Stahl Merle Stahl, Lena J. Straßer, Chit Tong Lio, Judith Bernett, Richard Röttger, Markus List Motivation: Single-cell RNA sequencing (scRNA-seq) provides comprehensive gene expression data at a single-cell level but lacks spatial context. In contrast, spatial transcriptomics captures both spatial and transcriptional information but is limited by resolution, sensitivity, or feasibility. No single technology combines both the high spatial resolution and deep transcriptomic profiling at the single-cell level without trade-offs. Spatial mapping tools that integrate scRNA-seq and spatial transcriptomics data are crucial to bridge this gap. However, we found that Tangram, one of the most prominent spatial mapping tools, provides inconsistent results over repeated runs. Results: We refine Tangram to achieve more consistent cell mappings and investigate the challenges that arise from data characteristics. We find that the mapping quality depends on the gene expression sparsity. To address this, we (1) train the model on an informative gene subset, (2) apply cell filtering, (3) introduce several forms of regularization, and (4) incorporate neighborhood information. Evaluations on real and simulated mouse datasets demonstrate that this approach improves both gene expression prediction and cell mapping. Consistent cell mapping strengthens the reliability of the projection of cell annotations and features into space, gene imputation, and correction of low-quality measurements. Our pipeline, which includes gene set and hyperparameter selection, can serve as guidance for applying Tangram on other datasets, while our benchmarking framework with data simulation and inconsistency metrics is useful for evaluating other tools or Tangram modifications. Availability: The refinements for Tangram and our benchmarking pipeline are available at https://github. com/daisybio/Tangram_Refinement_Strategies.
2025-07-24 12:00:00 12:20:00 11BC RegSys Encoding single-cell chromatin landscapes as probability distributions with optimal transport Cassandra Burdziak Cassandra Burdziak, Danielle Maydan, Doron Haviv, Ronan Chaligne, Dana Pe'Er Single-cell measurement of paired epigenetic and transcriptomic features is becoming routine, and promises to license more sophisticated models of gene regulation. Still, most existing models are limited to the cis-regulatory element representation (typically, averaged signal at pre-defined accessibility “peaks”), which shrouds much of the chromatin molecule’s fine-grained structure. To maximize chromatin’s explanatory power for cell-state (and fate) prediction, we sought to achieve a more unbiased, quantitative representation of the chromatin molecule by treating the accessibility landscape as a discrete (per-base pair) probability distribution. Given single-cell accessibility data, our approach embeds the chromatin landscape of each cell state according to the optimal transport (OT) distance between the empirical distribution of accessibility at particular loci, whilst controlling for sequence-related biases in DNA tagmentation. The resulting embeddings capture the precise shape of the accessibility distribution, which itself reflects transcription factor binding footprints, nucleosome positions, and RNA polymerase movement. Application of this model in the well-studied hematopoiesis system highlights its superior ability to explain cell-state: the latent accessibility distribution is more universally predictive of gene expression than promoter accessibility, and can define transcription factor binding modes active in specific branches of development. Most excitingly, position in latent space may closely correspond with the presence of certain activating or repressive chromatin marks, despite the model lacking such information during training. This representation may thus empower future models of gene regulation with a richer representation of epigenetic data with stronger ties to cellular phenotypes.
2025-07-24 12:20:00 12:40:00 11BC RegSys scooby: Modeling multi-modal genomic profiles from DNA sequence at single-cell resolution Laura D. Martens Laura D. Martens, Johannes C. Hingerl, Alexander Karollus, Trevor Manz, Jason D. Buenrostro, Fabian J. Theis, Julien Gagneur Understanding how regulatory sequences shape gene expression across individual cells is a fundamental challenge in genomics. Joint RNA-seq and epigenomic profiling provides opportunities to build models capturing sequence determinants across steps of gene expression. However, current models, developed primarily for bulk omics data, fail to capture the cellular heterogeneity and dynamic processes revealed by single-cell multi-modal technologies. Here, we introduce scooby, a framework to model genomic profiles of scRNA-seq coverage and scATAC-seq insertions from sequence at single-cell resolution. For this, we leverage the pre-trained multi-omics profile predictor Borzoi and equip it with a cell-specific decoder. Scooby recapitulates cell-specific expression levels of held-out genes and identifies regulators and their putative target genes. Moreover, scooby allows resolving single-cell effects of bulk eQTLs and delineating their impact on chromatin accessibility and gene expression. We anticipate scooby to aid unraveling the complexities of gene regulation at the resolution of individual cells.
2025-07-24 12:40:00 13:00:00 11BC RegSys Uncovering Novel Cellular Programs and Regulatory Circuits Underlying Bifurcating Human B Cell States Jishnu Das Zarifeh Rarani, Swapnil Keshari, Akanksha Sachan, Nicholas Pease, Jingyu Fan, Peter Gerges, Harinder Singh, Jishnu Das B cells upon antigen encounter undergo activation followed by a bifurcation either into extrafollicular plasmablasts (PB) or into germinal center (GC) B cells. We have assembled gene regulatory networks (GRNs) underlying this bifurcation using temporally resolved single cell multiomics. To complement this, we analyzed transcriptomic states of GC and PB cells using SLIDE, a novel interpretable machine learning approach method to infer a small set of cellular programs (latent factors/LFs) necessary and sufficient to distinguish GC and PB cells. These LFs provide stronger discrimination between the two emergent cell states, than DEG analyses. Interestingly, when the LF genes were cross-referenced with state-specific GRNs, the LFs recapitulated aspects of GRN architecture orchestrating the bifurcation. Intriguingly, the LFs also captured gene programs reflective of cell-fate propensity prior to the bifurcation in activated B cells. These programs were validated using perturbation of key TFs. To move beyond high-resolution static state-specific GRNs, we used a stochastic ODE-based framework to construct a dynamic GRN across the 5 states. In addition to recapitulating previously known lineage-defining TFs and their regulons, we identify novel regulons as driving divergent gene activity across the bifurcation trajectory. We also combined the dynamic GRN with the inferred cellular programs to predict TF pairs that combinatorically control B cell fate dynamics. Intriguingly, several of these inferred TF pairs are not detected by conventional network topological metrics. Overall, our framework is generalizable and applicable across contexts to identify cellular programs and regulatory circuits underlying diverse cell fate bifurcations.
2025-07-24 14:00:00 14:20:00 11BC RegSys Detection of Cell-type-specific Differentially Methylated Regions in Epigenome-Wide Association Studies Yingying Wei Ruofan Jia, Yingying Wei DNA methylation at cytosine-phosphate-guanine (CpG) sites is one of the most important epigenetic markers. Therefore, epidemiologists are interested in investigating DNA methylation in large cohorts through epigenome-wide association studies (EWAS). However, the observed EWAS data are bulk data with signals aggregated from distinct cell types. Deconvolution of cell-type-specific signals from EWAS data is challenging because phenotypes can affect both cell-type proportions and cell-type-specific methylation levels. Recently, there has been active research on detecting cell-type-specific risk CpG sites for EWAS data. However, since existing methods all assume that the methylation levels of different CpG sites are independent and perform association detection for each CpG site separately, although they significantly improve the detection at the aggregated-level−identifying a CpG site as a risk CpG site as long as it is associated with the phenotype in any cell type, they have low power in detecting cell-type-specific associations for EWAS with typical sample sizes. Here, we develop a new method, Fine-scale inference for Differentially Methylated Regions (FineDMR), to borrow strengths of nearby CpG sites to improve the cell-type-specific association detection. Via a Bayesian hierarchical model built upon Gaussian process functional regression, FineDMR takes advantage of the spatial dependencies between CpG sites. FineDMR can provide cell-type-specific association detection as well as output subject-specific and cell-type-specific methylation profiles for each subject. Simulation studies and real data analysis show that FineDMR substantially improves the power in detecting cell-type-specific associations for EWAS data. FineDMR is freely available at https://github.com/JiaRuofan/Detection-of-Cell-type-specific-DMRs-in-EWAS.
2025-07-24 14:20:00 14:40:00 11BC RegSys MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models Weicai Long Weicai Long, Houcheng Su, Jiaqi Xiong, Yanlin Zhang Motivation: Understanding the genomic foundation of human diversity and disease requires models that effectively capture sequence variation, such as single nucleotide polymorphisms (SNPs). While recent genomic foundation models have scaled to larger datasets and multi-species inputs, they often fail to account for the sparsity and redundancy inherent in human population data, such as those in the 1000 Genomes Project. SNPs are rare in humans, and current masked language models (MLMs) trained directly on whole-genome sequences may struggle to efficiently learn these variations. Additionally, training on the entire dataset without prioritizing regions of genetic variation results in inefficiencies and negligible gains in performance. Results: We present MutBERT, a probabilistic genome-based masked language model that efficiently utilizes SNP information from population-scale genomic data. By representing the entire genome as a probabilistic distribution over observed allele frequencies, MutBERT focuses on informative genomic variations while maintaining computational efficiency. We evaluated MutBERT against DNABERT-2, various versions of Nucleotide Transformer, and modified versions of MutBERT across multiple downstream prediction tasks. MutBERT consistently ranked as one of the top-performing models, demonstrating that this novel representation strategy enables better utilization of biobank-scale genomic data in building pretrained genomic foundation models. Availability: https://github.com/ai4nucleome/mutBERT Contact: yanlinzhang@hkust-gz.edu.cn
2025-07-24 14:40:00 15:00:00 11BC RegSys Detecting and avoiding homology-based data leakage in genome-trained sequence models Abdul Muntakim Rafi Abdul Muntakim Rafi, Brett Kiyota, Nozomu Yachcie, Carl de Boer Models that predict function from DNA sequence have become critical tools in deciphering the roles of genomic sequences and genetic variation within them. However, traditional approaches for dividing the genomic sequences into training data, used to create the model, and test data, used to determine the model’s performance on unseen data, fail to account for the widespread homology that permeates the genome. Using models that predict human gene expression from DNA sequence, we demonstrate that model performance on test sequences varies by their similarity with training sequences, consistent with homology-based ‘data leakage’ that influences model performance by rewarding overfitting of homologous sequences. Because the sequence and its function are inexorably linked, even a maximally overfit model with no understanding of gene regulation can predict the expression of sequences that are similar to its training data. To prevent leakage in genome-trained models, we introduce ‘hashFrag,' a scalable solution for partitioning data with minimal leakage. hashFrag improves estimates of model performance and can actually increase model performance by providing improved splits for model training. Altogether, we demonstrate how to account for homology based leakage when partitioning genomic sequences for model training and evaluation, and highlight the consequences of failing to do so.
2025-07-24 15:00:00 15:20:00 11BC RegSys Predicting gene expression using millions of yeast promoters reveals cis-regulatory logic Susanne Bornelöv Tirtharaj Dash, Susanne Bornelöv Gene expression is largely controlled by transcription factors and their binding and interactions in gene promoter regions. Early attempts to use deep learning to learn about this gene-regulatory logic were limited to training sets containing naturally occurring promoter sequences. However, using massive parallel reporter assays, potential training data can now be expanded by orders of magnitude, going beyond naturally occurring sequences. Nevertheless, a clear understanding of how to best use deep learning to study gene regulation is still lacking. Here we investigate the complex association between promoters and gene expression in S. cerevisiae using Camformer, a residual convolutional neural network that ranked 4th in the Random Promoter DREAM Challenge 2022. We explore the original Camformer model trained on 6.7 million random promoter sequences and investigate 270 alternative models to determine what factors contribute most to model performance. We show that Camformer accurately decodes the association between promoters and gene expression (r2 = 0.914 ± 0.003, ρ = 0.962 ± 0.002) and provides a substantial improvement over previous state of the art. Using explainable AI techniques, such as in silico mutagenesis, we demonstrate that the model learns both individual motifs and their hierarchy. For example, while an IME1 motif on its own increases gene expression, the co-occurrence of IME1 and UME6 motifs strongly reduces gene expression, beyond the repressive effect of UME6 on its own. Thus, we demonstrate that Camformer can be used to provide detailed insights into cis-regulatory logic.
2025-07-24 15:20:00 16:00:00 11BC RegSys What can the diversity of life of Earth teach us about disease? Mafalda Dias Mafalda Dias Biological sequences across the tree of life reflect the cumulative effects of millions of years of evolution. Modelling variation in these sequences offers a powerful window into the sequence constraints that shape protein function and genome regulation — and holds great promise for uncovering the genetic basis of human disease. In this talk, I will explore how recent advances in deep learning are enabling us to decode these evolutionary signatures at scale. I will highlight how such models are already improving diagnostic yield of patient sequencing, by providing evidence for hundreds of new disorders, and offer new avenues to assess disease risk before symptoms arise.

- top -