Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in BST
Wednesday, July 23rd
11:20-12:00
Invited Presentation: TBA
Room: 02F
Format: In person

Moderator(s): Anthony Mathelier


Authors List: Show

  • Vera Pancaldi
12:00-12:20
Proceedings Presentation: Leveraging Transcription Factor Physical Proximity for Enhancing Gene Regulation Inference
Confirmed Presenter: Yijie Wang, School of Informatics and Computing, Indiana University, Bloomington, IN 47408, United States

Room: 02F
Format: In person

Moderator(s): Anthony Mathelier


Authors List: Show

  • Xiaoqing Huang, Department of Biostatistics and Health Data Science School of Medicine, Indiana University, United States
  • Aamir Raza Muneer Ahemad Hullur, School of Informatics and Computing, Indiana University, Bloomington, IN 47408, United States
  • Elham Jafari, INDIANA UNIVERSITY, United States
  • Kaushik Shridhar, School of Informatics and Computing, Indiana University, Bloomington, IN 47408, United States
  • Mu Zhou, Rutgers University, United States
  • Kenneth Mackie, Indiana University Bloomington, United States
  • Kun Huang, Indiana University School of Medicine, United States
  • Yijie Wang, School of Informatics and Computing, Indiana University, Bloomington, IN 47408, United States

Presentation Overview: Show

Motivation: Gene regulation inference, a key challenge in systems biology, is crucial for understanding cell function, as it governs processes such as differentiation, cell state maintenance, signal transduction, and stress response. Leading methods utilize gene expression, chromatin accessibility, Transcription Factor (TF) DNA binding motifs, and prior knowledge. However, they overlook the fact that TFs must be in physical proximity to facilitate transcriptional gene regulation.
Results: To fill the gap, we develop GRIP – Gene Regulation Inference by considering TF Proximity – a gene regulation inference method that directly considers the physical proximity between regulating TFs. Specifically, we use the distance in a protein-protein interaction (PPI) network to estimate the physical proximity between TFs. We design a novel Boolean convex program, which can identify TFs that not only can explain the gene expression of target genes (TGs) but also stay close in the PPI network. We propose an efficient algorithm to solve the Boolean relaxation of the proposed model with a theoretical tightness guarantee. We compare our GRIP with state-of-the-art methods (SCENIC+, DirectNet, Pando, and CellOracle) on inferring cell-type-specific (CD4, CD8, and CD 14) gene regulation using the PBMC 3k scMultiome-seq data and demonstrate its out-performance in terms of the predictive power of the inferred TFs, the physical distance between the inferred TFs, and the agreement between the inferred gene regulation and PCHiC ground-truth data.

12:20-12:40
Proceedings Presentation: miRBench: novel benchmark datasets for microRNA binding site prediction that mitigate against prevalent microRNA Frequency Class Bias
Confirmed Presenter: Panagiotis Alexiou, University of Malta, Malta

Room: 02F
Format: In person

Moderator(s): Anthony Mathelier


Authors List: Show

  • Stephanie Sammut, University of Malta, Malta
  • Katarina Gresova, Masaryk University, Czechia
  • Dimosthenis Tzimotoudis, University of Malta, Malta
  • Eva Marsalkova, Masaryk University, Czechia
  • David Cechak, Masaryk University, Czechia
  • Panagiotis Alexiou, University of Malta, Malta

Presentation Overview: Show

Motivation: MicroRNAs (miRNAs) are crucial regulators of gene expression, but the precise mechanisms governing their binding to target sites remain unclear. A major contributing factor to this is the lack of unbiased experimental datasets for training accurate prediction models. While recent experimental advances have provided numerous miRNA-target interactions, these are solely positive interactions. Generating negative examples in silico is challenging and prone to introducing biases, such as the miRNA frequency class bias identified in this work. Biases within datasets can compromise model generalization, leading models to learn dataset-specific artifacts rather than true biological patterns.

Results: We introduce a novel methodology for negative sample generation that effectively mitigates the miRNA frequency class bias. Using this methodology, we curate several new, extensive datasets and benchmark several state-of-the-art methods on them. We find that a simple convolutional neural network model, retrained on some of these datasets, is able to outperform state-of-the-art methods. This highlights the potential for leveraging unbiased datasets to achieve improved performance in miRNA binding site prediction. To facilitate further research and lower the barrier to entry for machine learning researchers, we provide an easily accessible Python package, miRBench, for dataset retrieval, sequence encoding, and the execution of state-of-the-art models.

Availability: The miRBench Python Package is accessible at https://github.com/katarinagresova/miRBench/releases/tag/v1.0.0

Contact: panagiotis.alexiou@um.edu.mt

12:40-13:00
Flash Talk Session 1
Room: 02F
Format: In person

Moderator(s): Anthony Mathelier


Authors List: Show

  • Aryan Kamal
  • Damla Baydar
  • Laura Hinojosa
  • Charles-Henri Lecellier

Presentation Overview: Show

Session with 4 short talks:
Aryan Kamal - Transcriptional regulation of cell fate plasticity in hematopoiesis
Damla Övek Baydar - Enhancing JASPAR and UniBind databases with deep learning models for transcription factor-DNA interactions
Laura Hinojosa - Master Transcription Factors Regulate Replication Timing
Charles-Henri Lecellier - DNA replication timing and Copy Number Variations are confounders of RNA-DNA interaction data

14:00-14:20
Proceedings Presentation: Unicorn: Enhancing Single-Cell Hi-C Data with Blind Super-Resolution for 3D Genome Structure Reconstruction
Confirmed Presenter: Oluwatosin Oluwadare, University of Colorado, Colorado Springs, United States

Room: 02F
Format: In person

Moderator(s): Annique Claringbould


Authors List: Show

  • Mohan Kumar Chandrashekar, University of Colorado,Colorado Springs, United States
  • Rohit Menon, University of Colorado, Colorado Springs, United States
  • Samuel Olowofila, University of Colorado, Colorado Springs, United States
  • Oluwatosin Oluwadare, University of Colorado, Colorado Springs, United States

Presentation Overview: Show

Motivation: Single-cell Hi-C (scHi-C) data provide critical insights into chromatin interactions at individual cell levels, uncovering unique genomic 3D structures. However, scHi-C datasets are characterized by sparsity and noise, complicating efforts to accurately reconstruct high-resolution chromosomal structures. In this study, we present ScUnicorn, a novel blind Super-Resolution framework for scHi-C data enhancement. ScUnicorn employs an iterative degradation kernel optimization process, unlike traditional Super-resolution approaches, which rely on downsampling, predefined degradation ratios, or constant assumptions about the input data to reconstruct high-resolution interaction matrices. Hence, our approach more reliably preserves critical biological patterns and minimizes noise. Additionally, we propose 3DUnicorn, a maximum likelihood algorithm that leverages the enhanced scHi-C data to infer precise 3D chromosomal structures.

Result: Our evaluation demonstrates that ScUnicorn achieves superior performance over the state-of-the-art methods in terms of Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and GenomeDisco scores. Moreover, 3DUnicorn’s reconstructed structures align closely with experimental 3D-FISH data, underscoring its biological relevance. Together, ScUnicorn and 3DUnicorn provide a robust framework for advancing genomic research by enhancing scHi-C data fidelity and enabling accurate 3D genome structure reconstruction.

Code Availability: Unicorn implementation is publicly accessible at https://github.com/OluwadareLab/Unicorn

14:20-14:40
Predicting gene-specific regulation with transcriptomic and epigenetic single-cell data
Confirmed Presenter: Laura Rumpf, Goethe University Frankfurt Main, Germany

Room: 02F
Format: In person

Moderator(s): Annique Claringbould


Authors List: Show

  • Laura Rumpf, Goethe University Frankfurt Main, Germany
  • Fatemeh Behjati, Goethe University Frankfurt Main, Germany
  • Dennis Hecker, Goethe University Frankfurt, Germany
  • Marcel Schulz, Goethe University, Germany

Presentation Overview: Show

To gain insights into phenotype-specific gene regulation, we present our integrative analysis approach MetaFR harnessing single-cell epigenetic and transcriptomic data.
MetaFR generates random forest regression models in a gene-specific manner utilizing both scATAC-seq and scRNA-seq data to predict gene expression in a large window around a target gene. The gene window is partitioned into bins of equal size which correspond to the model features holding the epigenetic signal counts. The importance of model features can be leveraged to prioritize enhancer-gene interactions.
The inherent sparsity problem of single-cell data is addressed by aggregating the scRNA-seq and scATAC-seq signal into metacells based on gene activity similarities.
MetaFR enables large-scale analysis of scATAC-seq and scRNA-seq data in an automated fashion. The automated pipeline has been successfully applied to a human PBMC dataset to identify immune cell-specific enhancer-gene interactions. We validated our findings with experimentally measured interactions (CRISPRi regions) and fine-mapped eQTLs. We benchmarked our performance against the state-of-the-art method SCARlink.
We were able to outperform SCARlink in both accuracy and runtime.
Our pipeline allows time-efficient analysis and obtains reliable models of gene expression, which can be used to study gene regulatory elements in any organism for which scRNA-seq and scATAC-seq data becomes available.   

14:40-15:00
Biophysical deep learning resolves how TF and DNA sequence specify the genome state of every cell population in human embryogenesis
Room: 02F
Format: In person

Moderator(s): Annique Claringbould


Authors List: Show

  • Vitalii Kleshchevnikov, Wellcome Sanger Institute, United Kingdom
  • Oliver Stegle, Computational Genomics and Systems Genetics, Deutsches Krebsforschungszentrum (DKFZ), Germany
  • Omer Bayraktar, Wellcome Sanger Institute, United Kingdom

Presentation Overview: Show

Understanding how interactions between transcription factors (TFs) and DNA sequence are orchestrated and give rise to the vast complexity of cell types is a major challenge of regulatory developmental biology. Large-scale multimodal single-cell RNA-seq and ATAC-seq atlases enable reconstructing the regulatory mechanisms across cell types from data, laying the foundation for cell programming and design of synthetic regulatory elements.


Despite significant progress, current DNA sequence models fail to account for cellular context, TF-DNA sequence relationships and TF combinatorics in a principled manner, limiting their causal expressiveness and generalization capacity across cell types. To overcome this, we developed cell2state, an end-to-end deep learning model with biophysical constraints on how TFs specify the genome accessibility state in every cell population. Cell2state leverages known TF-motif interactions while accounting for biophysical constraints, employs an interpretable neural network based on HyenaDNA architecture and captures TF-TF synergy and antagonism, enablings the model to integrate DNA sequence and transcription factor (TF) abundance. We demonstrated cell2state generalisation capabilities by predicting ATAC-seq signals for new chromosomes and cell types.


To link regulatory TF interactions to developmental processes at whole embryo scale, we applied cell2state to an unpublished multimodal single-cell and spatial transcriptomics atlas covering over 1,000 human developmental cell states (n=4,000 pseudobulk replicates, n=5 embryos). At critical developmental junctions, such as the dorsal-ventral patterning of the spinal cord/hindbrain and anterior-posterior patterning of the forebrain, cell2state revealed how enhancer DNA sequences integrate activities of cell-type-defining TFs (LHX2, PAX6) with cell communication pathway TFs (GLI, TCF).

15:00-15:20
Nona: A unifying multimodal masked modeling framework for functional genomics
Confirmed Presenter: Surag Nair, Genentech Inc, United States

Room: 02F
Format: In person

Moderator(s): Annique Claringbould


Authors List: Show

  • Surag Nair, Genentech Inc, United States
  • Alex Tseng, Genentech Inc, United States
  • Ehsan Hajiramezanali, Genentech Inc, United States
  • Nathaniel Diamant, Genentech Inc, United States
  • Avantika Lal, Genentech Inc, United States
  • Tommaso Biancalani, Genentech Inc, United States
  • Gabriele Scalia, Genentech Inc, United States
  • Gokcen Eraslan, Genentech Inc, United States

Presentation Overview: Show

We present Nona, a unifying multimodal masked modeling paradigm for functional genomics. Nona is a neural network model that operates on both DNA sequence and epigenetic tracks such as DNase-seq, ChIP-seq, and RNA-seq at base-pair resolution. By leveraging a flexible masking strategy, Nona can predict any subset of masked DNA and/or tracks from the unmasked subset. As a result, Nona encompasses versatile existing and novel use cases that were hitherto addressed using separate models. In addition to vanilla sequence-to-function prediction and DNA language modeling, Nona enables multiple novel application modes, of which we highlight 3: 1) context-aware prediction, where the model predicts epigenetic tracks in a local genomic window by taking into account the observed epigenetic tracks in adjacent windows, in addition to the DNA sequence, 2) sequence generation, where a conditional language model is used to iteratively generate a DNA sequence with desired epigenetic profiles across cellular states, 3) functional genotyping, where a conditional language model trained on base resolution ATAC-seq is used to infer the genotype of the sample donors. Beyond these applications, Nona can enable use cases such as functional perturbations and denoising functional measurements. Altogether, Nona is a versatile paradigm that extends sequence-to-function and masked language modeling to novel applications in regulatory genomics.

15:20-15:40
SCRIMPy: Single Cell Replication Inference from Multiome data using Python
Room: 02F
Format: In person

Moderator(s): Annique Claringbould


Authors List: Show

  • Tatevik Jalatyan, Armenian Bioinformatics Institute; Chromatin and Disease Group, Centre for Human Genetics, Oxford University, Armenia
  • Jennifer Herrmann, Chromatin and Disease Group, Centre for Human Genetics, Oxford University, United Kingdom
  • Antonio Rodriguez-Romera, MRC Weatherall Institute of Molecular Medicine, University of Oxford, United Kingdom
  • Beth Psaila, MRC Weatherall Institute of Molecular Medicine, University of Oxford, United Kingdom
  • Jim Hughes, MRC Weatherall Institute of Molecular Medicine, University of Oxford, United Kingdom
  • Simone Riva, MRC Weatherall Institute of Molecular Medicine, University of Oxford, United Kingdom
  • Robert Beagrie, Chromatin and Disease Group, Centre for Human Genetics, Oxford University, United Kingdom

Presentation Overview: Show

The cell cycle is a fundamental biological process crucial for an organism’s growth and development. Dysregulation of the cell cycle can lead to diseases such as cancer, neurodegenerative, cardiovascular, or autoimmune disorders. Thus, accurate characterization of cell cycle dynamics in healthy and disease states is important for understanding disease mechanisms. Existing methods for cell cycle state prediction from single-cell data use the expression of marker genes in individual cells. However, these approaches perform poorly on single-cell multiome (ATAC+GEX) data, likely due to the increased data sparsity and nuclear RNA bias.
To address these limitations, we propose a novel method for cell cycle state inference that uses replication-driven DNA copy number signals from scATAC-seq data. Our approach is based on two complementary metrics that reflect the replication state of individual cells. First, we capture the imbalance of ATAC fragment depth between early- and late-replicating regions of genome to identify S-phase cells with higher DNA copy number in early replicating domains. Second, we introduce a novel metric for DNA copy number in ATAC-seq data to differentiate G1-phase cells from G2/M-phase cells, since the latter have duplicated DNA content. We apply this method to multiome data from mouse embryonic stem cells sorted by cell cycle state (G1, S, G2/M) and show that SCRIMPy outperforms the commonly used expression-based classifier Seurat.
With the increasing availability of multiome datasets, this approach holds promise for deriving novel insights into cell cycle mechanisms in diseases and identifying potential therapeutic targets.

15:40-16:00
Proceedings Presentation: Soffritto: a deep-learning model for predicting high-resolution replication timing
Confirmed Presenter: Dante Bolzan, La Jolla Institute for Immunology, United States

Room: 02F
Format: In person

Moderator(s): Annique Claringbould


Authors List: Show

  • Dante Bolzan, La Jolla Institute for Immunology, United States
  • Ferhat Ay, La Jolla Institute for Immunology, United States

Presentation Overview: Show

Motivation: Replication Timing (RT) refers to the order by which DNA loci are replicated during S phase. RT is cell-type specific and implicated in cellular processes including transcription, differentiation, and disease. RT is typically quantified genome-wide using two-fraction assays (e.g., Repli-Seq) which sort cells into early and late S phase fractions followed by DNA sequencing yielding a ratio as the RT signal. While two-fraction RT data is widely available in multiple cell lines, it is limited in its ability to capture high-resolution RT features. To address this, high-resolution Repli-Seq, which quantifies RT across 16 fractions, was developed, but it is costly and technically challenging with very limited data generated to date.
Results: Here we developed Soffritto, a deep learning model that predicts high-resolution RT data using two-fraction RT data, histone ChIP-seq data, GC content, and gene density as input. Soffritto is composed of a Long Short Term Memory (LSTM) module and a prediction module. The LSTM module learns long- and short-range interactions between genomic bins while the prediction module is composed of a fully connected layer that outputs a 16-fraction probability vector for each bin using the LSTM module’s embeddings as input. By performing both within cell line and cross cell line training and testing for five human and mouse cell lines, we show that Soffritto is able to capture experimental 16-fraction RT signals with high accuracy and the predicted signals allow detection of high-resolution RT patterns.

16:40-17:00
Ledidi: Programmatic design and editing of cis-regulatory elements
Confirmed Presenter: Jacob Schreiber, Research Institute of Molecular Pathology (IMP), Austria

Room: 02F
Format: In person

Moderator(s): Alejandra Medina Rivera


Authors List: Show

  • Jacob Schreiber, Research Institute of Molecular Pathology (IMP), Austria
  • Franziska Lorbeer, Research Institute of Molecular Pathology (IMP), Austria
  • Monika Heinzl, Research Institute of Molecular Pathology (IMP), Austria
  • Yang Lu, University of Waterloo, Canada
  • Alexander Stark, Research Institute of Molecular Pathology (IMP), Austria
  • William Noble, University of Washington, United States

Presentation Overview: Show

The development of modern genome editing tools has enabled researchers to make such edits with high precision, but has left unsolved the problem of designing these edits. As a solution, we propose Ledidi, a computational approach that rephrases the design of genomic edits as a continuous optimization problem where the goal is to produce the desired outcome as measured by one or more predictive models using as few edits from an initial sequence as possible. Ledidi can be paired with any pre-trained machine learning model, and when applied across dozens of such models, we find that Ledidi can quickly design edits to precisely control transcription factor binding, chromatin accessibility, transcription, and enhancer activity across several species. Ledidi can achieve its target objective using surprisingly few edits by converting weak affinity TF binding sites into stronger affinity ones, and can do so almost an order of magnitude faster than other approaches. Unlike other approaches, Ledidi can use several models simultaneously to programmatically design edits that exhibit multiple desired characteristics. We demonstrate this capability by designing uniformly accessible regions with controllable patterns of TF binding, by designing cell type-specific enhancers, and by showing how one can use multiple models that predict the same thing to more robustly design edits. Finally, we introduce the concept of an affinity catalog, in which multiple sets of edits are designed that induce a spectrum of outcomes, and demonstrate the practical benefits of this approach for design tasks and scientific understanding.

17:00-17:20
Lilliput: Compact native regulatory element design with machine learning-guided miniaturization
Room: 02F
Format: In person

Moderator(s): Alejandra Medina Rivera


Authors List: Show

  • Laura Gunsalus, Genentech, United States
  • Avantika Lal, Genentech, United States
  • Tommaso Biancalani, Genentech, United States
  • Gokcen Eraslan, Genentech, United States

Presentation Overview: Show

Size-limited gene therapy vectors require compact cell type-specific regulatory elements. Existing miniaturized sequences have been hand-selected and curated, relying on costly experimental iteration. We present Lilliput, a method for designing compact and specific regulatory elements by nominating and iteratively editing endogenous elements with state-of-the-art DNA sequence-to-function models. Our approach involves scoring elements in silico, removing subsequences with limited predicted impact, and introducing minimal mutations to increase specificity. We demonstrate the effectiveness of our approach by reducing a 10kb heart-specific locus to under 300bp. Our method offers a generalizable framework for engineering mini-elements across diverse target cell types. More broadly, we identify core sequence features sufficient to determine cell-type specific expression patterns, advancing our understanding of the mechanisms underlying precise control of gene expression.

17:20-18:00
Invited Presentation: TBA
Room: 02F
Format: In person

Moderator(s): Alejandra Medina Rivera


Authors List: Show

  • Luca Pinello
Thursday, July 24th
8:40-9:20
Invited Presentation: TBA
Room: 02F
Format: In person

Moderator(s): Xiuwei Zhang


Authors List: Show

  • Roser Vento-Tormo
9:20-9:40
Proceedings Presentation: Anomaly Detection in Spatial Transcriptomics via Spatially Localized Density Comparison
Confirmed Presenter: Gary Hu, Princeton University, United States

Room: 02F
Format: In person

Moderator(s): Xiuwei Zhang


Authors List: Show

  • Gary Hu, Princeton University, United States
  • Julian Gold, Princeton University, United States
  • Uthsav Chitra, Broad Institute of MIT and Harvard, United States
  • Sunay Joshi, University of Pennsylvania, United States
  • Benjamin Raphael, Princeton University, United States

Presentation Overview: Show

Motivation
Perturbations in biological tissues – e.g. due to inflammation, disease, or drug treatment – alter the composition of cell types and cell states in the tissue. These alterations are often spatially localized in different regions of a tissue, and can be measured using spatial transcriptomics technologies. However, current methods to analyze differential abundance in cell types or cell states, either do not incorporate spatial information – and thus cannot identify spatially localized alterations – or use heuristic and inaccurate approaches.

Results
We introduce Spatial Anomaly Region Detection in Expression Manifolds (Sardine), a method to estimate spatially localized changes in spatial transcriptomics data obtained from tissue slices from two or more conditions. Sardine estimates the probability of a cell state being at the same (relative) spatial location between different conditions using spatially localized density estimation. On simulated data, Sardine recapitulates the spatial patterning of expression changes more accurately than existing approaches. On a Visium dataset of the mouse cerebral cortex before and after injury response, as well as on a Visium dataset of a mouse spinal cord undergoing electrotherapy, Sardine identifies regions of spatially localized expression changes that are more biologically plausible than alternative approaches.

9:40-10:00
Flash Talk Session 2
Room: 02F
Format: In person


Authors List: Show

  • Maxime Christophe
  • Gabriela A Merino
  • Erick Isaac Navarro Delgado
  • Tomas Rube

Presentation Overview: Show

Session with 4 short talks:
Maxime Christophe - Interpretable deep learning reveals sequence determinants of nucleosome positioning in mammalian genomes
Gabriela A Merino - Ensembl’s multispecies catalogue of regulatory elements
Erick Isaac Navarro Delgado - RAMEN: A reproducible framework for dissecting individual, additive and interactive gene-environment contributions in genomic regions with variable DNA methylation
Tomas Rube - Accurate affinity models for SH2 domains from peptide binding assays and free-energy regression

11:20-11:40
Proceedings Presentation: GASTON-Mix: a unified model of spatial gradients and domains using spatial mixture-of-experts
Confirmed Presenter: Uthsav Chitra, Princeton University, United States

Room: 02F
Format: In person


Authors List: Show

  • Uthsav Chitra, Princeton University, United States
  • Shu Dan, Princeton University, United States
  • Fenna Krienen, Princeton University, United States
  • Ben Raphael, Princeton University, United States

Presentation Overview: Show

Motivation: Gene expression varies across a tissue due to both the organization of the tissue into spatial domains, i.e. discrete regions of a tissue with distinct cell type composition, and continuous spatial gradients of gene expression within di↵erent spatial domains. Spatially resolved transcriptomics (SRT) technologies provide high-throughput measurements of gene expression in a tissue slice, enabling the characterization of spatial gradients and domains. However, existing computational methods for quantifying spatial variation in gene expression either model only spatial domains – and do not account for continuous gradients of expression – or require restrictive geometric assumptions on the spatial domains and spatial gradients that do not hold for many complex tissues.

Results: We introduce GASTON-Mix, a machine learning algorithm to identify both spatial domains and spatial gradients within each domain from SRT data. GASTON-Mix extends the mixture-of-experts (MoE) deep learning framework to a spatial MoE model, combining the clustering component of the MoE model with a neural field model that learns a separate 1-D coordinate (“isodepth”) within each domain. The spatial MoE is capable of representing any geometric arrangement of spatial domains in a tissue, and the isodepth coordinates define continuous gradients of gene expression within each domain. We show using simulations and real data that GASTON-Mix identifies spatial domains and spatial gradients of gene expression more accurately than existing methods. GASTON-Mix reveals spatial gradients in the striatum and lateral septum that regulate complex social behavior, and GASTON-Mix reveals localized spatial gradients of hypoxia and TNF-$alpha$ signaling in the tumor microenvironment.

11:40-12:00
Proceedings Presentation: Refinement Strategies for Tangram for Reliable Single-Cell to Spatial Mapping
Confirmed Presenter: Merle Stahl, Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Freising, Germany, Germany

Room: 02F
Format: In person


Authors List: Show

  • Merle Stahl, Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Freising, Germany, Germany
  • Lena J. Straßer, Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Freising, Germany, Germany
  • Chit Tong Lio, Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Freising, Germany, Germany
  • Judith Bernett, Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Freising, Germany, Germany
  • Richard Röttger, Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, Germany
  • Markus List, Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Freising, Germany, Germany

Presentation Overview: Show

Motivation: Single-cell RNA sequencing (scRNA-seq) provides comprehensive gene expression data at a
single-cell level but lacks spatial context. In contrast, spatial transcriptomics captures both spatial and
transcriptional information but is limited by resolution, sensitivity, or feasibility. No single technology combines
both the high spatial resolution and deep transcriptomic profiling at the single-cell level without trade-offs.
Spatial mapping tools that integrate scRNA-seq and spatial transcriptomics data are crucial to bridge this gap.
However, we found that Tangram, one of the most prominent spatial mapping tools, provides inconsistent
results over repeated runs.
Results: We refine Tangram to achieve more consistent cell mappings and investigate the challenges that
arise from data characteristics. We find that the mapping quality depends on the gene expression sparsity.
To address this, we (1) train the model on an informative gene subset, (2) apply cell filtering, (3) introduce
several forms of regularization, and (4) incorporate neighborhood information. Evaluations on real and
simulated mouse datasets demonstrate that this approach improves both gene expression prediction and cell
mapping. Consistent cell mapping strengthens the reliability of the projection of cell annotations and features
into space, gene imputation, and correction of low-quality measurements. Our pipeline, which includes gene
set and hyperparameter selection, can serve as guidance for applying Tangram on other datasets, while our
benchmarking framework with data simulation and inconsistency metrics is useful for evaluating other tools
or Tangram modifications.
Availability: The refinements for Tangram and our benchmarking pipeline are available at https://github.
com/daisybio/Tangram_Refinement_Strategies.

12:00-12:20
Encoding single-cell chromatin landscapes as probability distributions with optimal transport
Confirmed Presenter: Cassandra Burdziak, Memorial Sloan Kettering Cancer Center, United States

Room: 02F
Format: In person


Authors List: Show

  • Cassandra Burdziak, Memorial Sloan Kettering Cancer Center, United States
  • Danielle Maydan, Columbia University, United States
  • Doron Haviv, Memorial Sloan Kettering Cancer Center, United States
  • Marisa Mariani, Memorial Sloan Kettering Cancer Center, United States
  • Ronan Chaligne, Memorial Sloan Kettering Cancer Center, United States
  • Dana Pe'Er, Memorial Sloan Kettering Cancer Center, United States

Presentation Overview: Show

Single-cell measurement of paired epigenetic and transcriptomic features is becoming routine, and promises to license more sophisticated models of gene regulation. Still, most existing models are limited to the cis-regulatory element representation (typically, averaged signal at pre-defined accessibility “peaks”), which shrouds much of the chromatin molecule’s fine-grained structure. To maximize chromatin’s explanatory power for cell-state (and fate) prediction, we sought to achieve a more unbiased, quantitative representation of the chromatin molecule by treating the accessibility landscape as a discrete (per-base pair) probability distribution. Given single-cell accessibility data, our approach embeds the chromatin landscape of each cell state according to the optimal transport (OT) distance between the empirical distribution of accessibility at particular loci, whilst controlling for sequence-related biases in DNA tagmentation. The resulting embeddings capture the precise shape of the accessibility distribution, which itself reflects transcription factor binding footprints, nucleosome positions, and RNA polymerase movement. Application of this model in the well-studied hematopoiesis system highlights its superior ability to explain cell-state: the latent accessibility distribution is more universally predictive of gene expression than promoter accessibility, and can define transcription factor binding modes active in specific branches of development. Most excitingly, position in latent space may closely correspond with the presence of certain activating or repressive chromatin marks, despite the model lacking such information during training. This representation may thus empower future models of gene regulation with a richer representation of epigenetic data with stronger ties to cellular phenotypes.

12:20-12:40
scooby: Modeling multi-modal genomic profiles from DNA sequence at single-cell resolution
Confirmed Presenter: Laura D. Martens, Technical University of Munich, Germany

Room: 02F
Format: In person


Authors List: Show

  • Johannes C. Hingerl, Technical University of Munich, Germany
  • Laura D. Martens, Technical University of Munich, Germany
  • Alexander Karollus, Technical University of Munich, Germany
  • Trevor Manz, Harvard Medical School, United States
  • Jason D. Buenrostro, Harvard University, United States
  • Fabian J. Theis, Helmholtz Center Munich, Germany
  • Julien Gagneur, Technical University of Munich, Germany

Presentation Overview: Show

Understanding how regulatory sequences shape gene expression across individual cells is a fundamental challenge in genomics. Joint RNA-seq and epigenomic profiling provides opportunities to build models capturing sequence determinants across steps of gene expression. However, current models, developed primarily for bulk omics data, fail to capture the cellular heterogeneity and dynamic processes revealed by single-cell multi-modal technologies. Here, we introduce scooby, a framework to model genomic profiles of scRNA-seq coverage and scATAC-seq insertions from sequence at single-cell resolution. For this, we leverage the pre-trained multi-omics profile predictor Borzoi and equip it with a cell-specific decoder. Scooby recapitulates cell-specific expression levels of held-out genes and identifies regulators and their putative target genes. Moreover, scooby allows resolving single-cell effects of bulk eQTLs and delineating their impact on chromatin accessibility and gene expression. We anticipate scooby to aid unraveling the complexities of gene regulation at the resolution of individual cells.

12:40-13:00
Uncovering Novel Cellular Programs and Regulatory Circuits Underlying Bifurcating Human B Cell States
Room: 02F
Format: In person


Authors List: Show

  • Zarifeh Rarani, University of Pittsburgh, United States
  • Swapnil Keshari, University of Pittsburgh, United States
  • Akanksha Sachan, University of Pittsburgh, United States
  • Nicholas Pease, University of Pittsburgh, United States
  • Jingyu Fan, University of Pittsburgh, United States
  • Peter Gerges, University of Pittsburgh, United States
  • Harinder Singh, University of Pittsburgh, United States
  • Jishnu Das, University of Pittsburgh, United States

Presentation Overview: Show

B cells upon antigen encounter undergo activation followed by a bifurcation either into extrafollicular plasmablasts (PB) or into germinal center (GC) B cells. We have assembled gene regulatory networks (GRNs) underlying this bifurcation using temporally resolved single cell multiomics. To complement this, we analyzed transcriptomic states of GC and PB cells using SLIDE, a novel interpretable machine learning approach method to infer a small set of cellular programs (latent factors/LFs) necessary and sufficient to distinguish GC and PB cells. These LFs provide stronger discrimination between the two emergent cell states, than DEG analyses. Interestingly, when the LF genes were cross-referenced with state-specific GRNs, the LFs recapitulated aspects of GRN architecture orchestrating the bifurcation. Intriguingly, the LFs also captured gene programs reflective of cell-fate propensity prior to the bifurcation in activated B cells. These programs were validated using perturbation of key TFs.

To move beyond high-resolution static state-specific GRNs, we used a stochastic ODE-based framework to construct a dynamic GRN across the 5 states. In addition to recapitulating previously known lineage-defining TFs and their regulons, we identify novel regulons as driving divergent gene activity across the bifurcation trajectory. We also combined the dynamic GRN with the inferred cellular programs to predict TF pairs that combinatorically control B cell fate dynamics. Intriguingly, several of these inferred TF pairs are not detected by conventional network topological metrics. Overall, our framework is generalizable and applicable across contexts to identify cellular programs and regulatory circuits underlying diverse cell fate bifurcations.

14:00-14:20
Proceedings Presentation: Detection of Cell-type-specific Differentially Methylated Regions in Epigenome-Wide Association Studies
Confirmed Presenter: Yingying Wei, The Chinese University of Hong Kong, Hong Kong

Room: 02F
Format: In person

Moderator(s): Marcel Schulz


Authors List: Show

  • Ruofan Jia, The Chinese University of Hong Kong, Hong Kong
  • Yingying Wei, The Chinese University of Hong Kong, Hong Kong

Presentation Overview: Show

DNA methylation at cytosine-phosphate-guanine (CpG) sites is one of the most important epigenetic markers. Therefore, epidemiologists are interested in investigating DNA methylation in large cohorts through epigenome-wide association studies (EWAS). However, the observed EWAS data are bulk data with signals aggregated from distinct cell types. Deconvolution of cell-type-specific signals from EWAS data is challenging because phenotypes can affect both cell-type proportions and cell-type-specific methylation levels. Recently, there has been active research on detecting cell-type-specific risk CpG sites for EWAS data. However, since existing methods all assume that the methylation levels of different CpG sites are independent and perform association detection for each CpG site separately, although they significantly improve the detection at the aggregated-level−identifying a CpG site as a risk CpG site as long as it is associated with the phenotype in any cell type, they have low power in detecting cell-type-specific associations for EWAS with typical sample sizes. Here, we develop a new method, Fine-scale inference for Differentially Methylated Regions (FineDMR), to borrow strengths of nearby CpG sites to improve the cell-type-specific association detection. Via a Bayesian hierarchical model built upon Gaussian process functional regression, FineDMR takes advantage of the spatial dependencies between CpG sites. FineDMR can provide cell-type-specific association detection as well as output subject-specific and cell-type-specific methylation profiles for each subject. Simulation studies and real data analysis show that FineDMR substantially improves the power in detecting cell-type-specific associations for EWAS data. FineDMR is freely available at https://github.com/JiaRuofan/Detection-of-Cell-type-specific-DMRs-in-EWAS.

14:20-14:40
Proceedings Presentation: MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models
Confirmed Presenter: Weicai Long, Hong Kong University of Science and Technology (Guangzhou), China

Room: 02F
Format: In person

Moderator(s): Marcel Schulz


Authors List: Show

  • Weicai Long, Hong Kong University of Science and Technology (Guangzhou), China
  • Houcheng Su, Hong Kong University of Science and Technology (Guangzhou), China
  • Jiaqi Xiong, Hong Kong University of Science and Technology (Guangzhou), China
  • Yanlin Zhang, Hong Kong University of Science and Technology (Guangzhou), China

Presentation Overview: Show

Motivation: Understanding the genomic foundation of human diversity and disease requires models that effectively capture sequence variation, such as single nucleotide polymorphisms (SNPs). While recent genomic foundation models have scaled to larger datasets and multi-species inputs, they often fail to account for the sparsity and redundancy inherent in human population data, such as those in the 1000 Genomes Project. SNPs are rare in humans, and current masked language models (MLMs) trained directly on whole-genome sequences may struggle to efficiently learn these variations. Additionally, training on the entire dataset without prioritizing regions of genetic variation results in inefficiencies and negligible gains in performance.
Results: We present MutBERT, a probabilistic genome-based masked language model that efficiently utilizes SNP information from population-scale genomic data. By representing the entire genome as a probabilistic distribution over observed allele frequencies, MutBERT focuses on informative genomic variations while maintaining computational efficiency. We evaluated MutBERT against DNABERT-2, various versions of Nucleotide Transformer, and modified versions of MutBERT across multiple downstream prediction tasks. MutBERT consistently ranked as one of the top-performing models, demonstrating that this novel representation strategy enables better utilization of biobank-scale genomic data in building pretrained genomic foundation models.
Availability: https://github.com/ai4nucleome/mutBERT
Contact: yanlinzhang@hkust-gz.edu.cn

14:40-15:00
Detecting and avoiding homology-based data leakage in genome-trained sequence models
Confirmed Presenter: Abdul Muntakim Rafi, University of British Columbia, Canada

Room: 02F
Format: In person

Moderator(s): Marcel Schulz


Authors List: Show

  • Abdul Muntakim Rafi, University of British Columbia, Canada
  • Brett Kiyota, University of British Columbia, Canada
  • Nozomu Yachcie, University of British Columbia, Canada
  • Carl de Boer, University of British Columbia, Canada

Presentation Overview: Show

Models that predict function from DNA sequence have become critical tools in deciphering the roles of genomic sequences and genetic variation within them. However, traditional approaches for dividing the genomic sequences into training data, used to create the model, and test data, used to determine the model’s performance on unseen data, fail to account for the widespread homology that permeates the genome. Using models that predict human gene expression from DNA sequence, we demonstrate that model performance on test sequences varies by their similarity with training sequences, consistent with homology-based ‘data leakage’ that influences model performance by rewarding overfitting of homologous sequences. Because the sequence and its function are inexorably linked, even a maximally overfit model with no understanding of gene regulation can predict the expression of sequences that are similar to its training data. To prevent leakage in genome-trained models, we introduce ‘hashFrag,' a scalable solution for partitioning data with minimal leakage. hashFrag improves estimates of model performance and can actually increase model performance by providing improved splits for model training. Altogether, we demonstrate how to account for homology based leakage when partitioning genomic sequences for model training and evaluation, and highlight the consequences of failing to do so.

15:00-15:20
Predicting gene expression using millions of yeast promoters reveals cis-regulatory logic
Confirmed Presenter: Susanne Bornelöv, University of Cambridge, United Kingdom

Room: 02F
Format: In person

Moderator(s): Marcel Schulz


Authors List: Show

  • Tirtharaj Dash, University of Cambridge, United Kingdom
  • Susanne Bornelöv, University of Cambridge, United Kingdom

Presentation Overview: Show

Gene expression is largely controlled by transcription factors and their binding and interactions in gene promoter regions. Early attempts to use deep learning to learn about this gene-regulatory logic were limited to training sets containing naturally occurring promoter sequences. However, using massive parallel reporter assays, potential training data can now be expanded by orders of magnitude, going beyond naturally occurring sequences. Nevertheless, a clear understanding of how to best use deep learning to study gene regulation is still lacking. Here we investigate the complex association between promoters and gene expression in S. cerevisiae using Camformer, a residual convolutional neural network that ranked 4th in the Random Promoter DREAM Challenge 2022. We explore the original Camformer model trained on 6.7 million random promoter sequences and investigate 270 alternative models to determine what factors contribute most to model performance. We show that Camformer accurately decodes the association between promoters and gene expression (r2 = 0.914 ± 0.003, ρ = 0.962 ± 0.002) and provides a substantial improvement over previous state of the art. Using explainable AI techniques, such as in silico mutagenesis, we demonstrate that the model learns both individual motifs and their hierarchy. For example, while an IME1 motif on its own increases gene expression, the co-occurrence of IME1 and UME6 motifs strongly reduces gene expression, beyond the repressive effect of UME6 on its own. Thus, we demonstrate that Camformer can be used to provide detailed insights into cis-regulatory logic.

15:20-16:00
Invited Presentation: TBA
Room: 02F
Format: In person

Moderator(s): Marcel Schulz


Authors List: Show

  • Mafalda Dias