Attention Presenters - please review the Presenter Information Page available here
Schedule subject to change
All times listed are in EDT
Saturday, July 13th
10:40-11:20
Invited Presentation: Continual improvement of cis-regulatory models
Confirmed Presenter: Carl de Boer

Room: 518
Format: In Person

Moderator(s): Shaun Mahony


Authors List: Show

  • Carl de Boer

Presentation Overview: Show

Gene expression is regulated by transcription factors that work together to read cis-regulatory DNA sequences. A primary aim of my group is to decipher the “cis-regulatory code” - the rules that cells use to determine when, where, and how much genes should be expressed. While cis-regulation has proven to be exceedingly complex, recent advances in our ability to query the activity of DNA, combined with Machine Learning have enabled significant progress towards deciphering this code. Here, I will focus on several of our recent efforts to improve cis-regulatory models. First, I will describe a recent DREAM Challenge, where competitors from across the globe competed to create the best sequence-expression models using a dataset of random yeast promoter sequences and their experimentally determined expression levels, which resulted in state-of-the-art model architectures, even for human cis-regulatory data. Next, I will describe an ongoing effort to make cis-regulatory models and evaluation tasks interoperable, streamlining model evaluation and enabling model comparison. Then, I will describe an alternate strategy for dividing the genome into training and test datasets, which substantially mitigates the homology-driven data leakage common in genome-trained models. Finally, I will give a perspective on where the field needs to go to crack the cis-regulatory code. Namely, profiling the regulatory activities of non-genomic DNA sequences in very high-throughput, and using these data to train models that understand genome regulation without ever having seen genomic sequences.

11:20-11:40
Interpreting Cis-Regulatory Interactions from Large-Scale Deep Neural Networks for Genomics
Confirmed Presenter: Peter Koo, Cold Spring Harbor Laboratory, United States

Room: 518
Format: In Person

Moderator(s): Shaun Mahony


Authors List: Show

  • Shushan Toneyan, Cold Spring Harbor Laboratory, United States
  • Peter Koo, Cold Spring Harbor Laboratory, United States

Presentation Overview: Show

The rise of large-scale, sequence-based deep neural networks (DNNs) for predicting gene expression has introduced challenges in their evaluation and interpretation. Current evaluations align DNN predictions with experimental perturbation assays, which provide insights into the generalization capabilities within the studied loci but offer a limited perspective of what drives their predictions. Moreover, existing model explainability tools focus mainly on motif analysis, which becomes complex when interpreting longer sequences. Here we introduce CREME, an in silico perturbation toolkit that interrogates large-scale DNNs to uncover rules of gene regulation that it learns. Using CREME, we investigate Enformer, a prominent DNN in gene expression prediction, revealing cis-regulatory elements (CREs) that directly enhance or silence target genes. We explore the intricate complexity of higher-order CRE interactions, the relationship between CRE distance from the transcription start sites on gene expression, as well as the biochemical features of enhancers and silencers learned by Enformer. Moreover, we demonstrate the flexibility of CREME to efficiently uncover a higher-resolution view of functional sequence elements within CREs. This work demonstrates how CREME can be employed to translate the powerful predictions of large-scale DNNs to study open questions in gene regulation.

11:40-12:00
Chromatin accessibility is driven by intra-nucleosomal pioneer cooperativity that includes low affinity motifs
Confirmed Presenter: Melanie Weilert, Stowers Institute for Medical Research, United States

Room: 518
Format: In Person

Moderator(s): Shaun Mahony


Authors List: Show

  • Melanie Weilert, Stowers Institute for Medical Research, United States
  • Kaelan Brennan, Stowers Institute for Medical Research, United States
  • Khyati Dalal, Stowers Institute for Medical Research, United States
  • Sabrina Krueger, European Molecular Biology Laboratory,, Germany
  • Charles McAnany, Stowers Institute for Medical Research, United States
  • Yue Liang, Stowers Institute for Medical Research, United States
  • Julia Zeitlinger, Stowers Institute for Medical Research, United States

Presentation Overview: Show

The regulation of chromatin accessibility at cis-regulatory DNA sequences is a key rate-limiting step for enhancer activation and thus is an important element of the cis-regulatory code. Pioneer transcription factors (TFs) that induce nucleosome remodeling mediate chromatin opening, but the sequence rules by which pioneer or other TFs cooperate to make chromatin accessible are not well understood. To identify these sequence rules in an unbiased manner, we trained and interpreted BPNet-derived deep learning models that predict base-resolution TF binding data and bias-corrected chromatin accessibility data in mouse embryonic stem cells. By comparing the interpretations from both models, we can distinguish between TFs that are strong pioneers, weak pioneers and non-pioneers. Furthermore, we find that pioneering depends on low-affinity TF motifs, which increase in importance when they cooperate with other motifs. This reliance on cooperativity is observed to be important at low TF concentrations, when high-affinity motifs have a decreased effect on chromatin accessibility, confirming our model predictions. By probing the cooperativity in more detail, we find that it generally occurs at intra-nucleosomal distances, supporting a nucleosome-mediated mechanism of cooperativity. This highlights the ability of deep learning models to learn complex sequence rules, suggesting that widespread cooperativity and involvement of low affinity motifs could explain why the context-dependent function of pioneer TFs has been difficult to decipher.

12:00-12:20
Characterizing transcription factor binding with multi-omics sequence model
Confirmed Presenter: Fangxin Cai, University of Hong Kong, Hong Kong

Room: 518
Format: In person

Moderator(s): Shaun Mahony


Authors List: Show

  • Fangxin Cai, University of Hong Kong, Hong Kong
  • Yuanhua Huang, Unviersity of Hong Kong, Hong Kong

Presentation Overview: Show

The linkage between transcription factors (TFs) and cis-regulatory regions (CREs) is crucial to under- standing gene regulation. Conventionally, it is determined by a step-wise process—motif enrichment and correlation/regression-based analysis. As the presence of motifs does not always imply binding, and cor- relation analysis may miss low-expression TFs, this process can suffer from false positive and negatives. Here we propose a holistic model that takes joint single-cell RNA sequencing (scRNA-seq) data and single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq) data to delineate TF-CRE linkage. In- spired by multi-omics factor analysis and sequence modeling, our model decomposes peaks’ accessibility into cell factors, encoded from TF expression, and peak factors, encoded from DNA sequences.
We demonstrate our model on an embryonic mouse brain dataset. Both modalities are accurately recon- structed on held-out cells and sequences . Cell factors preserve cell type distinction and trajectory structure, while sequence factors motifs moderately localize some motifs, such as that of Neurod2 and Sox11, indicating the regulatory information is captured.
To delineate TF-CRE linkage, we take gradients with respect to the two inputs. High gradient times TF expression values (gradTF) are assigned to high correlation TF-CREs pairs, whereas low-correlation, high gradTF pairs may correspond to low-expression TFs, though systematic evaluation remains to be done. As an example, Runx1, a low-expression TF, correlates poorly with almost all peaks’ accessibility; however, its potential target CREs (compiled from ChIP-Atlas) have a higher absolute gradTF. On the other hand, gradient times sequence (gradSeq) highlights regulatory motifs.

Protein Language Models improve the target prediction of nucleic acid-binding proteins
Confirmed Presenter: Cyrus Tam, Memorial Sloan Kettering Cancer Center, United States

Room: 518
Format: In person

Moderator(s): Shaun Mahony


Authors List: Show

  • Cyrus Tam, Memorial Sloan Kettering Cancer Center, United States
  • Ilyes Baali, Memorial Sloan Kettering Cancer Center, United States
  • Kaitlin Laverty, Memorial Sloan Kettering Cancer Center, United States
  • Taykhoom Dalal, Memorial Sloan Kettering Center, United States
  • Debashish Ray, University of Toronto, United States
  • Alexander Sasse, University of Washington, United States
  • Woojeong Kim, Cornell University, United States
  • Alexander Rush, Cornell University, United States
  • Matthew Weirauch, University of Cincinnati College of Medicine, United States
  • Timothy Hughes, University of Toronto, United States
  • Quaid Morris, Memorial Sloan Kettering Cancer Center, United States

Presentation Overview: Show

Unraveling the DNA and RNA-binding preferences of regulatory proteins, like transcription and splicing factors, is important for understanding gene regulatory networks. In vitro binding assays, including Protein Binding Microarrays (PBMs) and RNAcompete, have been conducted for hundreds of nucleic-acid-binding proteins (NBPs) and provide training data for homology models that can predict the binding preferences of unmeasured NBPs. However, to date, these homology models have used simple rules to predict motifs.

Protein Language Models have emerged as effective models for downstream protein property prediction; however, their value in predicting protein-ligand interactions is less clear. To evaluate this, we extracted diverse NBP representations from four PLMs (AlphaFold2, AminoBERT, ESM-2, ProteinBERT) and compared their performance against baseline representations when used as inputs for different target prediction models, including unsupervised methods and neural networks.

Extensive evaluations across diverse datasets revealed that PLM-learned representations consistently outperformed baseline methods. Further analysis demonstrated the particular value of PLM-learned representations in scenarios on proteins with distant homologs. Feature attribution analyses demonstrated that PLM-learned representations capture global and local structural properties, showcasing their efficacy in predicting binding preferences.

ESM-2 emerged as a top performer across predictive models. We further evaluated its performance in predicting the targets of unmeasured RBPs and unmeasured RNAs after fine-tuning. By introducing new tokens for nucleic acids and using concatenated RBP-RNA sequences as inputs, we demonstrated that a fine-tuned ESM-2 model matches the SOTA approach in generalizing to unseen RNA sequences and outperforms the SOTA approach in generalizing to unseen RBPs.

DNA language models reveal the architecture of nucleotide dependencies in genomes
Confirmed Presenter: Pedro Tomaz da Silva, Technical University of Munich, Germany

Room: 518
Format: In person

Moderator(s): Shaun Mahony


Authors List: Show

  • Pedro Tomaz da Silva, Technical University of Munich, Germany
  • Alexander Karollus, Technical University of Munich, Germany
  • Johannes Hingerl, Technical University of Munich, Germany
  • Xavier Hernandez-Alias, Mechanisms of Protein Biogenesis, Max Planck Institute of Biochemistry, Germany
  • Gihanna Galindez, Technical University of Munich, Germany
  • Nils Wagner, Technical University of Munich, Germany
  • Danny Incarnato, University of Groningen, Netherlands
  • Julien Gagneur, Technical University of Munich, Germany

Presentation Overview: Show

While the genome is composed of individual nucleotides, functional elements such as cis-regulatory elements and structural interactions are formed from sets of interdependent nucleotides. In principle, these dependencies are reflected in coevolutionary relationships. However, their detection beyond coding sequences is challenging with classical approaches.

DNA language models (LMs), which are trained by predicting nucleotides given their sequence context, have recently been proposed as foundational models for sequence-based prediction problems. DNA LMs implicitly capture functional elements from genomic sequences alone. However, which dependencies DNA LMs learn and whether they reflect known or even novel biology remains an open question.

Here we introduce nucleotide dependency maps to systematically study nucleotide dependencies captured by DNA LMs in a purely unsupervised setup.
We compute these maps genome-wide and show that they reveal and clearly delineate known functional genomic features such as transcription factor binding motifs, functional interactions between splice sites, RNA tertiary structures, and coding sequences. Additionally we uncover novel and conserved dependency structures suitable for experimental validation.

We furthermore investigate dependency maps from in silico manipulated sequences, revealing the ability of DNA LMs to capture operations such as copying and reverse complementarity without memorization.

Lastly, we compare dependency maps from openly available DNA LMs, showcasing the drawbacks and advantages of different models. We find stark differences in the ability of models to accurately learn conserved but infrequent features.

Altogether, by leveraging the flexibility of DNA language models, nucleotide dependency mapping emerges as a general methodology to discover and study functional interactions in genomes.

LoopHunter: Enhancing Chromatin Loop Annotation by Focusing on Larger Regions in Hi-C Data
Confirmed Presenter: Yusen Hou, The Hong Kong University of Science and Technology (Guangzhou), China

Room: 518
Format: In person

Moderator(s): Shaun Mahony


Authors List: Show

  • Yusen Hou, The Hong Kong University of Science and Technology (Guangzhou), China
  • Yanlin Zhang, The Hong Kong University of Science and Technology (Guangzhou), China

Presentation Overview: Show

Chromatin loops, which bring distant loci into close contact, play a crucial role in gene expression and regulation. Although several methods have been developed for annotating loops from Hi-C contact maps, these methods remain unsatisfactory, particularly in accurately identifying loops from low coverage or single-cell Hi-C contact maps at high resolutions. Chromatin loops manifest as small blob-shaped patterns on Hi-C contact maps, encouraging existing tools to focus on analyzing contact pairs within small areas, such as a 21x21 window. However, these blob-shaped patterns are often indistinct in sparse regions, providing insufficient data for precise loop detection. Meanwhile, many chromatin loops exhibit broader patterns, including stripes, particularly in loops associated with the formation of Topologically Associating Domains (TADs), which current tools largely ignore. In this study, we introduce LoopHunter, a axial attention-based deep learning model to annotate loops from Hi-C contact maps at high resolutions across various coverages. LoopHunter utilizes a 224x224 sub-matrix as input and employs a combination of axial attention transformer and convolutional blocks to capture multi-scale data, facilitating robust loop prediction within the input region. Unlike traditional approaches that focus only on the center of the input matrix, we propose to train LoopHunter based on knowledge distillation, enabling it to make dense predictions. Our comparisons of LoopHunter against alternative tools demonstrate that LoopHunter significantly enhances loop annotation across both low and high coverage Hi-C contact maps.

A systematic comparison of Machine learning methods for the prediction of enhancer-gene interactions from epigenomic data
Confirmed Presenter: Shamim Ashrafiyan, Goethe University Frankfurt, Germany

Room: 518
Format: In person

Moderator(s): Shaun Mahony


Authors List: Show

  • Fatemeh Behjati Ardakani, Goethe University Frankfurt, Germany
  • Shamim Ashrafiyan, Goethe University Frankfurt, Germany
  • Dennis Hecker, Goethe University Frankfurt, Germany
  • Laura Rumpf, Goethe University Frankfurt, Germany
  • Marcel Schulz, Goethe University, Germany

Presentation Overview: Show

Understanding the complex interaction between histone modifications, enhancers, and gene regulation is pivotal in deciphering the mechanisms governing cellular identity and function. This study investigates the critical task of predicting enhancer-gene interactions, essential for unraveling non-coding variation and DNA-binding factor-mediated gene regulation. Leveraging the comprehensive EpiATLAS dataset, encompassing high-quality histone ChIP-seq and RNA-seq data from a wide variety of cell types curated by IHEC, we embarked on a systematic comparison of various machine learning methods tailored to gene-specific prediction of gene expression from epigenome data.
Our investigation extends beyond traditional approaches by incorporating the large EpiATLAS dataset and exploring different state-of-the-art Machine learning methods. Notably, we optimized novel Convolutional Neural Network (CNN), and Multi-Layer Perceptron (MLP) architectures, and Random Forest-based (RF) methods in comparison to established linear models. By harnessing H3K27ac histone mark signatures within megabase genomic windows surrounding each gene, our models, especially RF and CNN, demonstrated exceptional performance in predicting gene expression. Many different aspects of a gene, such as gene structure, and expression variance across cell types dictate the success of building an accurate model.
Through comprehensive validation using CRISPRi screens and eQTL data, we investigate the efficacy of the learned models in predicting enhancer-gene interactions using an in silico perturbation setup.
In summary, our work offers a comprehensive framework for understanding enhancer-mediated gene regulation, supported by rigorous validation methods. These findings provide valuable insights into the regulatory landscape of the human genome, advancing our understanding of cellular function and disease mechanisms.

Q&A for Flash Talks
Room: 518
Format: In person

Moderator(s): Shaun Mahony


Authors List: Show

14:20-15:00
Invited Presentation: Integrative modeling of multiscale single-cell spatial epigenome
Confirmed Presenter: Jian Ma

Room: 518
Format: In Person

Moderator(s): Ferhat Ay


Authors List: Show

  • Jian Ma

Presentation Overview: Show

Despite significant advancements in high-throughput data acquisition in genomics and cell biology, our understanding of the diverse cell types within the human body remains limited. In particular, the principles governing intracellular molecular spatial organization and interaction, as well as cellular spatial organization within complex tissues, are still largely unclear. A major challenge lies in developing computational methods capable of integrating heterogeneous and multiscale molecular, cellular, and tissue information. In this talk, I will discuss our recent work on creating integrative approaches to advance regulatory genomics using single-cell spatial epigenomics. These methods hold the potential to reveal new insights into fundamental genome structure, gene regulation, and cellular function within complex tissues, across a wide range of biological contexts in both health and disease.

15:00-15:20
Proceedings Presentation: Enhancing Hi-C contact matrices for loop detection with Capricorn, a multi-view diffusion model
Confirmed Presenter: William Noble, University of Washington, United States

Room: 518
Format: In Person

Moderator(s): Ferhat Ay


Authors List: Show

  • Tangqi Fang, University of Washington, United States
  • Yifeng Liu, University of Washington, United States
  • Addie Woicik, University of Washington, United States
  • Minsi Lu, University of Washington, United States
  • Anupama Jha, University of Washington, United States
  • Xiao Wang, Purdue University, United States
  • Gang Li, University of Washington, United States
  • Borislav Hristov, University of Washington, United States
  • Zixuan Liu, University of Washington, United States
  • Hanwen Xu, University of Washington, United States
  • William Noble, University of Washington, United States
  • Sheng Wang, University of Washington, United States

Presentation Overview: Show

Motivation: High-resolution Hi-C contact matrices reveal the detailed three-dimensional architecture of the genome, but high-coverage experimental Hi-C data are expensive to generate. On the other hand, chromatin structure analyses struggle with extremely sparse contact matrices. To address this problem, computational methods to enhance low-coverage contact matrices have been developed, but existing methods are largely based on resolution enhancement methods for natural images and hence often employ models that do not distinguish between biologically meaningful contacts, such as loops, and other stochastic contacts.

Results: We present Capricorn, a machine learning model for Hi-C resolution enhancement that incorporates small-scale chromatin features as additional views of the input Hi-C contact matrix and leverages a diffusion probability model backbone to generate a high-coverage matrix. We show that Capricorn outperforms the state of the art in a cross-cell-line setting, improving on existing methods by 17% in mean squared error and 26% in F1 score for chromatin loop identification from the generated high-coverage data. We demonstrate that Capricorn performs well in the cross-chromosome setting and cross-chromosome, cross-cell-line setting. We further show that our multi-view idea can also be used to improve several existing methods, HiCARN and HiCNN, indicating the wide applicability of this approach. Finally, we use DNA sequence to validate discovered loops and find that the fraction of CTCF-supported loops from Capricorn is similar to those identified from the high-coverage data. Capricorn is a powerful Hi-C resolution enhancement method that enables scientists to find chromatin features that cannot be identified in the low-coverage contact matrix.

15:20-15:40
Ultra-long-range and interchromosomal loops link T cell superenhancers
Confirmed Presenter: Gabriel Dolsten, Princeton, United States

Room: 518
Format: In Person

Moderator(s): Ferhat Ay


Authors List: Show

  • Gabriel Dolsten, Princeton, United States
  • Zhong-Min Wang, Memorial Sloan Kettering Cancer Center, United States
  • Xiao Huang, Memorial Sloan Kettering Cancer Center, United States
  • Anthony Michaels, Memorial Sloan Kettering Cancer Center, United States
  • Mike Wilson, Princeton, United States
  • Susie Song, Princeton, United States
  • Aaron Viny, Columbia University, United States
  • Alexander Rudensky, Memorial Sloan Kettering Cancer Center, United States
  • Yuri Pritykin, Princeton, United States

Presentation Overview: Show

Functional enhancer-promoter interactions are typically thought to occur at distances less than two megabases. To explore the role of long-range regulatory interactions, we generated two Hi-C libraries in regulatory (Treg) and conventional (Tcon) CD4+ T cells. We found that interactions beyond 10Mb dramatically improved prediction of gene expression from Hi-C, suggesting that long-range interactions may play an important role in gene regulation. To analyze the role of long-range interactions, we examined differential contact frequency between Treg and Tcon genome-wide. This analysis revealed 78,089 differential interactions at distances greater than two megabases. Differential long-range contact was especially common at critical T cell genes regulated by superenhancers, such as Ikzf2. These interactions often presented as focal contacts (“megaloops”), such as the Treg-specific 9Mb megaloop between Ikzf2 and Ctla4. A second 18Mb megaloop between Ikzf2 and Arl4c was confirmed by DNA-FISH. We developed a novel algorithm and package, LONGSHOT, to find megaloops and identified 33,791 intrachromosomal and 23,003 interchromosomal megaloops in the T cell connectome. Clustering of megaloops revealed three distinct interchromosomal megaloop hubs. Two of the hubs were highly enriched for superenhancers, capturing 50% of all Treg cell superenhancers. Analysis of a published Hi-C dataset with an Ets1 superenhancer knockout revealed changes in megalooping after superenhancer knockout and changes in gene expression at the megalooped sites. Together, these results suggest that ultra-long-range chromatin contacts, partly mediated by superenhancers, are an important component of T cell gene regulation.

15:40-16:00
Proceedings Presentation: scGrapHiC: Deep learning-based graph deconvolution for Hi-C using single cell gene expression
Confirmed Presenter: Ghulam Murtaza, Department of Computer Science, Brown University, United States

Room: 518
Format: In Person

Moderator(s): Ferhat Ay


Authors List: Show

  • Ghulam Murtaza, Department of Computer Science, Brown University, United States
  • Byron Butaney, Department of Computer Science, Brown University, United States
  • Justin Wagner, Material Measurement Laboratory, National Institute of Standards and Technology, United States
  • Ritambhara Singh, Department of Computer Science and Center for Computational Molecular Biology, Brown University, United States

Presentation Overview: Show

Single-cell Hi-C (scHi-C) protocol helps identify cell-type-specific chromatin interactions and sheds light on cell differentiation and disease progression. Despite providing crucial insights, scHi-C data is often underutilized due to the high cost and the complexity of the experimental protocol. We present a deep learning framework, scGrapHiC, that predicts pseudo-bulk scHi-C contact maps using pseudo-bulk scRNA-seq data. Specifically, scGrapHiC performs graph deconvolution to extract genome-wide single-cell interactions from a bulk Hi-C contact map using scRNA-seq as a guiding signal. Our evaluations show that scGrapHiC, trained on 7 cell-type co-assay datasets, outperforms typical sequence encoder approaches. For example, scGrapHiC achieves a substantial improvement of 23.2% in recovering cell-type-specific Topologically Associating Domains over the baselines. It also generalizes to unseen embryo and brain tissue samples. scGrapHiC is a novel method to generate cell-type-specific scHi-C contact maps using widely available genomic signals that enables the study of cell-type-specific chromatin interactions.

16:40-17:00
Cross-species and tissue imputation of species-level DNA methylation samples across mammalian species
Confirmed Presenter: Emily Maciejewski, UCLA, United States

Room: 518
Format: In Person

Moderator(s): Shamim Mollah


Authors List: Show

  • Emily Maciejewski, UCLA, United States
  • Steve Horvath, Altos Labs, United Kingdom
  • Jason Ernst, University of California, Los Angeles, United States

Presentation Overview: Show

DNA methylation data offers valuable insights into various aspects of mammalian biology. However, the availability of such data for many mammals has been historically limited due to a lack of applicable microarrays in species other than human and mouse. The recent introduction and large-scale application of the mammalian methylation array has significantly expanded the availability of such data across conserved sites in many mammalian species. In our study, we consider 13,245 samples profiled on this array encompassing 348 species and 59 tissues from 746 species-tissue combinations. While having some coverage of many different species and tissue types, this data captures only 3.6% of potential species-tissue combinations. To address this gap, we developed CMImpute (Cross-species Methylation Imputation) which uses a Conditional Variational Autoencoder (CVAE), a conditional generative model implemented via neural networks, to impute DNA methylation of non-profiled species-tissue combinations. CMImpute specifically conditions the CVAE on species and tissue labels, allowing for direct control over the combination to be imputed. In cross-validation, we demonstrate that CMImpute achieves a strong correlation with actual observed values, surpassing several baseline methods in terms of agreement across methylation array probes with a mean correlation of 0.92 and across samples with a mean correlation of 0.69. Using CMImpute we imputed methylation data for 19,786 new species-tissue combinations representing the remaining 96.4% of potential combinations. We believe that both CMImpute and our imputed data resource will be useful for DNA methylation analyses across a wide range of mammalian species.

17:00-17:20
Ontology-aware prediction of tissue-specific DNA methylation
Confirmed Presenter: Mirae Kim, Rice University, United States

Room: 518
Format: In Person

Moderator(s): Shamim Mollah


Authors List: Show

  • Mirae Kim, Rice University, United States
  • Ruth Dannenfelser, Rice University, United States
  • Yufei Cui, Massachusetts Institute of Technology, United States
  • Vicky Yao, Rice University, United States

Presentation Overview: Show

DNA methylation (DNAm) has shown tremendous potential in distinguishing physiological states such as aging and cancer progression, and epigenetic clocks, in particular, have had far-reaching applications. Though DNAm is also highly tissue-specific, no pan-tissue classifier currently exists. Here, we manually curate 3,145 healthy human DNA methylation samples across 116 studies spanning 50 tissue types and combine this data compendium with a novel framework that combines Minipatch feature selection with ontology-aware classification. Through this study, we identify a minimal set of 741 CpG sites that can accurately distinguish between different tissue types. A deeper examination of the CpG sites also reveals underlying biological mechanisms that underpin the tissue-specificity of DNA methylation. Furthermore, we demonstrate that this ontology-aware learning structure enables effective zero-shot learning for tissues not seen in training.

17:20-18:00
Invited Presentation: Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome
Confirmed Presenter: Michael Hoffman, University Health Network/University of Toronto, Canada

Room: 518
Format: In Person

Moderator(s): Shamim Mollah


Authors List: Show

  • Michael Hoffman, University Health Network/University of Toronto, Canada

Presentation Overview: Show

We will discuss a new method, Virtual ChIP-seq, which predicts binding of individual transcription factors in new cell types using an artificial neural network that integrates ChIP-seq results from other cell types and chromatin accessibility data in the new cell type. Virtual ChIP-seq also uses learned associations between gene expression and transcription factor binding at specific genomic regions. This approach outperforms methods that use transcription factor sequence preferences in the form of position weight matrices, predicting binding for 33 transcription factors.

Sunday, July 14th
10:40-11:20
Invited Presentation: Single-cell and single-molecule computational epigenomics
Confirmed Presenter: Maria Colomé Tatché

Room: 518
Format: In Person

Moderator(s): Alejandra Medina-Rivera


Authors List: Show

  • Maria Colomé Tatché

Presentation Overview: Show

Recent breakthroughs in high-throughput sequencing of single cells are revolutionizing the biological and biomedical sector. Among the different -omics layers that can be measured at the single-cell level, single-cell epigenomic measurements present a rich layer of regulatory information that stands between the genome and the transcriptome. These measurements can be obtained for large heterogeneous samples of single cells to profile tissues, organs and whole organisms, and to study dynamic processes like cellular differentiation, reprogramming or cancer evolution. These data types provide an unprecedented level of measurement resolution.
In this talk I will discuss how single-cell ATAC-seq and single-cell DNA methylation data can be used to study cell identity [1,2]. I will introduce and compare multiple feature space constructions for epigenetic data analysis and show the feasibility of common clustering, dimension reduction, batch integration and trajectory learning techniques for both single-cell DNA methylation data and scATAC-seq data.
Studying single-cell DNA methylation heterogeneity using single-cell DNA methylation measurements is however complicated, as experimental protocols are costly and difficult to implement. I will present an alternative strategy, which involves minION sequencing combined with deconvolution of single-molecule methylation signals to reconstruct cell-type methylation profiles. I will show how, using this method, it is possible to deconvolve the methylomes of different cell types from an in-silico mix of cells.
Another level of genomic information that can be extracted from single-cell data are single-cell copy number variations (CNVs). I will present a novel algorithm, epiAneufinder [3], which exploits the read count information from scATAC-seq data to extract genome-wide CNVs for individual single-cells, and I will show how the obtained CNVs are comparable to the ones obtained from single-cell whole genome sequencing data. Thanks to epiAneufinder it is therefore possible to add a relevant extra layer of genomic information, namely single-cell copy number variation, to every scATAC-seq dataset without the need of additional experiments.

[1] A. Danese, M.L. Richter, D.S. Fischer, F.J. Theis and M. Colomé-Tatché*. EpiScanpy: integrated single-cell epigenomic analysis. Nature Communications, 12, 5228 (2021).
[2] M.D. Luecken, M. Büttner, K. Chaichoompu, A. Danese, M. Interlandi, M.F. Mueller, D.C. Strobl, L. Zappia, M. Dugas, M. Colomé-Tatché*, F.J. Theis*. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
[3] A. Ramakrishnan, A. Symeonidi, P. Hanel, K. T. Schmid, M. L. Richter, M. Schubert, M. Colomé-Tatché*. epiAneufinder identifies copy number alterations from single-cell ATAC-seq data. Nat. Commun. 14, 5846 (2023).

11:20-11:40
Proceedings Presentation: REUNION: transcription factor binding prediction and regulatory association inference from single-cell multi-omics data
Confirmed Presenter: Yang Yang, Memorial Sloan Kettering Cancer Center, Howard Hughes Medical Institute, United States

Room: 518
Format: Live Stream

Moderator(s): Alejandra Medina-Rivera


Authors List: Show

  • Yang Yang, Memorial Sloan Kettering Cancer Center, Howard Hughes Medical Institute, United States
  • Dana Pe'er, Memorial Sloan Kettering Cancer Center, Howard Hughes Medical Institute, United States

Presentation Overview: Show

Motivation: Profiling of gene expression and chromatin accessibility by single-cell multi-omics approaches can help to systematically decipher how transcription factors (TFs) regulate target gene expression via cis-region interactions. However, integrating information from different modalities to discover regulatory associations is challenging, in part because motif scanning approaches miss many likely TF binding sites.
Results: We develop REUNION, a framework for predicting genome-wide TF binding and cis-region-TF-gene “triplet” regulatory associations using single-cell multi-omics data. The first component of REUNION, Unify, utilizes information theory-inspired complementary score functions that incorporate TF expression, chromatin accessibility, and target gene expression to identify regulatory associations. The second component, Rediscover, takes Unify estimates as input for pseudo semi-supervised learning to predict TF binding in accessible genomic regions that may or may not include detected TF motifs. Rediscover leverages latent chromatin accessibility and sequence feature spaces of the genomic regions, without requiring chromatin immunoprecipitation data for model training. Applied to peripheral blood mononuclear cell data, REUNION outperforms alternative methods in TF binding prediction on average performance. In particular, it recovers missing region-TF associations from regions lacking detected motifs, which circumvents the reliance on motif scanning and facilitates discovery of novel associations involving potential co-binding transcriptional regulators. Newly identified region-TF associations, even in regions lacking a detected motif, improve the prediction of target gene expression in regulatory triplets, and are thus likely to genuinely participate in the regulation.
Availability and implementation: All source code is available at https://github.com/yangymargaret/REUNION.

11:40-12:00
scHOCMO: Higher Order Correlation Model for Single-cell Multi-omics
Confirmed Presenter: Reetika Ghag, Washington University in St. Louis School of Medicine, United States

Room: 518
Format: In Person

Moderator(s): Alejandra Medina-Rivera


Authors List: Show

  • Reetika Ghag, Washington University in St. Louis School of Medicine, United States
  • Shamim Mollah, Washington University in St. Louis School of Medicine, United States

Presentation Overview: Show

Single-cell technologies enable system level interrogations across several molecular layers at a single-cell resolution. Current single-cell technologies incorporate massive parallelism enabling high-throughput joint profiling of various modalities on a cell. Multi-modal single-cell data can be further integrated to understand the causal relationships among the several molecular layers driving regulatory mechanisms in disease progression. However, the heterogeneity introduced by various modalities and their feature spaces makes it challenging to unify data into a single inference framework. Here, we propose a novel method scHOCMO (Higher Order Correlation Model for Single cell multiomics), to address the scalability and generalizability challenges in the existing single-cell multimodal data integration methods. We extend the previously developed tensor-based HOCMO (Higher Order Correlation Model) to improve the scalability to analyze single cell data from 107 to 1013 scale using Trillion-Tensor framework. We illustrate our method, using the single-nucleus RNA seq (sn-RNA) and single-nucleus ATAC seq (sn-ATAC) data for Diabetic Kidney Disease. Using differentially expressed genes, we aim to elucidate the regulatory dynamics of disease progression based on these disease specific marker genes and cell types.

12:00-12:20
Pan-cell type continuous chromatin state annotation of all IHEC epigenomes
Confirmed Presenter: Habib Daneshpajouh, Simon Fraser University, Canada

Room: 518
Format: In person

Moderator(s): Alejandra Medina-Rivera


Authors List: Show

  • Habib Daneshpajouh, Simon Fraser University, Canada
  • Kay C. Wiese, Simon Fraser University, Canada
  • Maxwell W. Libbrecht, Simon Fraser University, Canada

Presentation Overview: Show

Understanding the mechanistic basis of genetic disease requires annotating the regulatory elements in the human genome. To this end, international consortia such as IHEC, ENCODE, and Roadmap Epigenomics have generated thousands of epigenomic datasets such as ChIP-seq, DNase-seq, and ATAC-seq that measure various biochemical activities in the genome, including transcription factor binding, histone modification, and DNA accessibility. Currently, the predominant methods for integrating these data sets to annotate regulatory elements are segmentation and genome annotation (SAGA) algorithms such as ChromHMM and Segway. SAGA algorithms partition the genome and assign a chromatin state label to each segment, indicating the epigenetic activity at that position. To alleviate the limitations of the discrete SAGA framework, we recently developed epigenome-ssm, a method that produces a vector of continuous chromatin state features at each position that summarizes epigenetic activity. Unlike discrete labels, these continuous features can easily represent varying strengths of a given element and can represent combinatorial elements with multiple types of activities. Here, we present a continuous chromatin state feature map generated using epigenome-ssm on 9,539 genome-wide signal tracks from six core histone modification assays across 1,698 epigenomes. We show that these feature maps constitute an intuitive and visualizable summary of epigenomic data and enable accurate identification of mechanisms of disease association.

Automated and genome-scale exploration of the cis-regulatory code involved in neuronal differentiation
Confirmed Presenter: Océane Cassan, LIRMM, Univ Montpellier, CNRS, Montpellier, France

Room: 518
Format: In person

Moderator(s): Alejandra Medina-Rivera


Authors List: Show

  • Océane Cassan, LIRMM, Univ Montpellier, CNRS, Montpellier, France
  • Christophe Vroland, Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
  • Raynal Julien, Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
  • Masaki Kato, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
  • Hazuki Takahashi, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
  • Takeya Kasukawa, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
  • Piero Carninci, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa & Human technopole, Milan, Italy, Japan
  • Chi Wai Yip, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
  • Laurent Bréhélin, LIRMM, Univ Montpellier, CNRS, Montpellier, France
  • Charles-Henri Lecellier, Institut de Génétique Moléculaire de Montpellier & LIRMM, Univ Montpellier, CNRS, Montpellier, France

Presentation Overview: Show

Gene expression is controlled by proximal and distal cis-regulatory elements (CREs), containing DNA motifs bound by various transcription factors (TFs). Other sequence features, such as specific k-mers or low complexity regions, have also been implicated.
However, in a dynamic biological process such as cell differentiation, we lack an understanding of how the transcriptional activity of CREs progressively change and what sequence features underlie these transitions, which may reflect common and/or coordinated regulatory processes.
Here, we use single-nucleus ATAC-seq with single-cell 5’ RNA-seq to follow, at a genome scale, CREs along differentiation of induced pluripotent stem cells into cortical neurons. We propose a guided clustering algorithm, STOIC (Statistical learning To Optimize Integrative Clustering) that jointly learns the different CRE clusters and their distinctive sequence-level features using an interpretable machine learning approach.
This procedure explores the expression space and delineates the CRE clusters iteratively in order to optimize the performance of a supervised classifier predicting CRE cluster membership based on DNA sequence features.
We show that STOIC provides more predictive sequence-level features than a standard k-means clustering. Furthermore, orthogonal chromatin and TF binding data collected in the same settings are used to validate the inferred CRE clusters and their sequence features, associate them to specific enhancer or promoter signatures and biological processes. Our results explore the complexity of the cis-regulatory code at the genome scale and provide an updated perspective on the transcriptional regulations at play during neuronal differentiation.

Expanding GTEx dataset with brain ontology-based graph neural networks to investigate genetic impacts on brain diseases
Confirmed Presenter: Jianfeng Ke, University of Massachusetts Lowell, United States

Room: 518
Format: In person

Moderator(s): Alejandra Medina-Rivera


Authors List: Show

  • Jianfeng Ke, University of Massachusetts Lowell, United States
  • Rachel Melamed, University of Massachusetts Lowell, United States
  • Tingjian Ge, University of Massachusetts Lowell, United States

Presentation Overview: Show

The human brain, with its intricate network of diverse regions, profoundly influences disease development. The Genotype-Tissue Expression (GTEx) program gathered transcriptome data and matched genotype data from over three hundred post-mortem donors, which allows us to understand how genetic variation can impact gene expression in diverse regions. However, the GTEx dataset included only 13 brain regions and only 10% of subjects had all brain regions measured. Improving the completeness of gene expression data within the GTEx project has the potential to elucidate the impact of disease risk variants on gene regulation in crucial tissues relevant to disease development. A possible resource to address this issue is the Allen Human Brain Atlas dataset. It collected transcriptome data from post-mortem brain tissue samples from 6 individuals, covering over a hundred distinct brain subregions. Leveraging the Allen dataset, we proposed a graph neural network model based on an expert ontology describing a hierarchy of increasingly fine-grained brain regions. This Graph Ontology model can predict 103 subordinate or previously uncollected brain regions for subjects within the GTEx dataset. We showed that our model outperformed several existing multi-tissue imputation models. Our model extended the initial 13 GTEx regions to 103 subordinate regions, enabling us to explore how genetic variation represented in GTEx can impact diverse disease-relevant regions that were not originally covered by the GTEx. Our prediction results can serve as a foundation for future investigations into how specific genetic variations influence diseases by altering gene expression patterns across a wide range of brain regions.

Interpretable single-cell factor decomposition using sciRED
Confirmed Presenter: Delaram Pouyabahar, Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada, Canada

Room: 518
Format: In person

Moderator(s): Alejandra Medina-Rivera


Authors List: Show

  • Delaram Pouyabahar, Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada, Canada
  • Tallulah Andrews, Departments of Biochemistry and Computer Science, University of Western Ontario, London, Ontario, Canada, Canada
  • Gary Bader, Departments of Molecular Genetics and Computer Science, University of Toronto, Toronto, Ontario, Canada, Canada

Presentation Overview: Show

Single-cell RNA sequencing (scRNA-seq) enables the exploration of gene expression heterogeneity within large cell populations, arising from biological and technical factors. Inferring gene expression programs from scRNA-seq data is challenging due to noise, sparsity, and high dimensionality, addressed by computational approaches like matrix factorization. Specialized factorization techniques such as glmPCA and cNMF have emerged in recent years to be tailored for scRNA-seq. However, the resulting factors must be manually interpreted. To address this gap, we developed sciRED as a tool to improve the interpretation of scRNA-seq factor analysis. sciRED implements a four-step approach to characterizing gene expression programs: (1) Removing confounding effects and using rotations to maximize factor interpretability (2) Calculating association statistics to map factors with known covariates, (3) Highlighting unexplained factors that may indicate hidden biological phenomena, and (4) Determining the genes and biological processes represented by unexplained factors. We apply our method, sciRED, across diverse datasets including the scMixology benchmark dataset and four biological single-cell atlases. Specifically, we showcase its application in identifying cell identity programs and sex-specific variations in a kidney map, discerning strong and weak stimulation signals in a PBMC dataset, eliminating ambient RNA contamination in a rat liver atlas to unveil strain variations, and revealing the hidden biology, represented by a rare cell type signature and anatomical zonation gene programs, in the healthy human liver map. These demonstrate the utility of our approach on real datasets for characterizing intricate biological signals within scRNA-seq maps.

Accurate allocation of multi-mapped reads enables regulatory element analysis at repeats
Confirmed Presenter: Shaun Mahony, Penn State University, United States

Room: 518
Format: In person

Moderator(s): Alejandra Medina-Rivera


Authors List: Show

  • Alexis Morrissey, Penn State University, United States
  • Jeffrey Shi, Penn State University, United States
  • Daniela James, Penn State University, United States
  • Shaun Mahony, Penn State University, United States

Presentation Overview: Show

Transposable elements (TEs) and other repetitive regions have been shown to contain gene regulatory elements, including transcription factor binding sites. Unfortunately, regulatory elements harbored by repeats have proven difficult to characterize using short-read sequencing assays such as ChIP-seq or ATAC-seq, as most regulatory genomics analysis pipelines discard “multi-mapped” reads. To address this shortcoming, we developed Allo, a new approach to allocate multi-mapped reads in an efficient, accurate, and user-friendly manner. Allo combines probabilistic mapping of multi-mapped reads with a convolutional neural network that recognizes the read distribution features of potential peaks, offering enhanced accuracy in multi-mapping read assignment. To demonstrate Allo’s potential, we apply it to reanalyze almost 500 transcription factor ChIP-seq datasets from K562 cells. This analysis resulted in over 385,000 previously unidentified transcription factor binding sites in repetitive regions of the genome. We find that Allo is particularly beneficial in identifying ChIP-seq peaks at centromeres and in younger TEs. In particular, we find novel associations between particular TFs and the recently expanded SVA and ERVK transposon families. We also find that Allo has a striking ability to disambiguate multi-mapped reads at recently duplicated genes. Using Allo, we analyze how regulatory elements diverge at recently generated paralogous genes, enabling new regulatory insights at sites of recent evolutionary novelty that often get overlooked in regulatory genomics analyses. Finally, we demonstrate that TF binding sites harbored by repeats are particularly difficult for neural network-based methods to predict de novo, and we speculate on approaches that can offer improved performance.

Q&A for Flash Talks
Room: 518
Format: In person

Moderator(s): Alejandra Medina-Rivera


Authors List: Show

14:20-14:40
Proceedings Presentation: A count-based model for delineating cell-cell interactions in spatial transcriptomics data
Confirmed Presenter: Hirak Sarkar, Princeton University, United States

Room: 518
Format: In Person

Moderator(s): Amin Emad


Authors List: Show

  • Hirak Sarkar, Princeton University, United States
  • Uthsav Chitra, Princeton University, United States
  • Julian Gold, Princeton University, United States
  • Benjamin Raphael, Princeton University, United States

Presentation Overview: Show

Motivation: Cell-cell interactions (CCIs) consist of cells exchanging signals with themselves and neighboring cells by expressing ligand and receptor molecules, and play a key role in cellular development, tissue homeostasis, and other critical biological functions. Since direct measurement of CCIs is challenging, multiple methods have been developed to infer CCIs by quantifying correlations between the gene expression of the ligands and receptors that mediate CCIs, originally from bulk RNA sequencing data and more recently from single-cell or spatial transcriptomics data. Spatial transcriptomics has a particular advantage over single-cell approaches since ligand-receptor correlations can be computed between cells or spots that are physically close in the tissue. However, the transcript counts of individual ligands and receptors in spatial transcriptomics data are generally low, complicating the inference of CCIs from expression correlations.

Results: We introduce Copulacci, a count-based model for inferring CCIs from spatial transcriptomics data. Copulacci uses a Gaussian copula to model dependencies between the expression of ligands and receptors from nearby spatial locations even when the transcript counts are low. On simulated data, Copulacci outperforms existing CCI inference methods based on the standard Spearman and Pearson correlation coefficients. Using several real spatial transcriptomics datasets, we show that Copulacci discovers biologically meaningful ligand-receptor interactions that are lowly expressed and undiscoverable by existing CCI inference methods.

Availability: Copulacci is implemented in Python and available at https://github.com/raphael-group/copulacci

14:40-15:00
Mapping lineage-resolved scRNA-seq data with spatial transcriptomics using TemSOMap
Confirmed Presenter: Xinhai Pan, Georgia Institute of Technology, United States

Room: 518
Format: In Person

Moderator(s): Amin Emad


Authors List: Show

  • Xinhai Pan, Georgia Institute of Technology, United States
  • Xiuwei Zhang, Georgia Institute of Technology, United States

Presentation Overview: Show

Spatial transcriptomics (ST) has become a powerful technique that bridges the gap between traditional gene expression analysis and spatial information within tissues or organisms. While ST can obtain a snapshot of cells’ spatial gene expressions, the library size is relatively limited compared to scRNAseq datasets. This limitation can be overcome by integrating scRNAseq data with the ST data. By mapping the single cells onto the spatial data, we can also infer the spatial coordinates of the cells from the scRNAseq dataset. On the other hand, CRISPR/Cas9-based lineage tracing technologies have enabled paired sequencing of cells’ gene expression and lineage barcodes. The reconstructed cell lineage tree from the barcodes represents cells’ clonal distances. With the availability of single-cell spatial and temporal information at the single-cell resolution, it is of great interest to look into the spatio-temporal dynamics of cells, which requires the inference of spatial coordinates of the lineage-traced cells. Therefore, we developed TemSOMap (Temporal and Spatial-Omics Mapping of single cells), which infers the spatial coordinates of cells by mapping a paired gene expression and lineage barcode dataset onto a spatial transcriptomics dataset using deep learning. The method aims to improve the accuracy of state-of-the-art mapping methods by utilizing the temporal and spatial information in the data. We show that TemSOMap can more accurately infer the spatial location of single cells, and can help us better understand the spatio-temporal dynamics of single cells using the spatially-resolved cell lineage and transcriptomic map.

15:00-15:20
Enhancing spatial transcriptomics analysis using deep learning-based batch effect mitigation
Confirmed Presenter: Rian Pratama, Pusan National University, South Korea

Room: 518
Format: In Person

Moderator(s): Amin Emad


Authors List: Show

  • Rian Pratama, Pusan National University, South Korea
  • Jason Hilton, Stanford University, United States
  • Jeonghoon Choi, Pusan National University, South Korea
  • J. Michael Cherry, Stanford University, United States
  • Giltae Song, Pusan National University, South Korea

Presentation Overview: Show

Spatial transcriptomics (ST) is a groundbreaking technique for studying the correlation between cellular organization within a tissue and their physiological and pathological properties. Every facet of spatial information, including cell/spot proximity, distribution, and dimensionality, holds significance. Most methods lean heavily on proximity for ST analysis, each resulting in useful insights but still leaves other aspects untapped. In addition, samples procured at different times, different donors, and by different technologies introduce batch effects problem that hinders statistical approach employed by most analysis tools. Addressing these challenges, we have developed a deep learning method for analyzing integrated multiple ST data, focusing on distribution aspect. Additionally, our method leverages single-cell analysis tools.

Our study introduces Spatial Gene Net, a data integration pipeline utilizing representation learning approach to extract spatial distribution of genes into the same feature space as gene expression features. We employ an encoder network to extract spatial embedding, facilitating the projection of spatial features into gene expression feature space. Our approach allows seamless integration of multiple samples with minimum detriment, bolstering the statistical power of ST data analysis tool. We show application of our method on human DLPFC dataset. Our method consistently improves the performance of Seurat tools clustering, with the most significant increase observed in sample 151673, almost doubling the ARI score from 0.236 to 0.405. This result reveals the potential of gene distribution spatial aspect, encouraging the development of better spatial feature extractor which emphasizes the impact of integration and batch effect correction for understanding tissue characteristics.

15:20-15:40
Gene Regulatory Networks analysis from single cell multi-omics data
Confirmed Presenter: Zhana Duren, Clemson University, United States

Room: 518
Format: In Person

Moderator(s): Amin Emad


Authors List: Show

  • Qiuyue Yuan, Clemson University, United States
  • Zhana Duren, Clemson University, United States

Presentation Overview: Show

Existing methods for gene regulatory networks (GRN) inference rely on gene expression data alone or on lower resolution bulk data. Despite the recent integration of chromatin accessibility and RNA sequencing data, learning complex mechanisms from limited independent data points still presents a daunting challenge. Here we present LINGER (Lifelong neural network for gene regulation), a machine-learning method to infer GRNs from single-cell paired gene expression and chromatin accessibility data. LINGER incorporates atlas-scale external bulk data across diverse cellular contexts and prior knowledge of transcription factor motifs as a manifold regularization. LINGER achieves a fourfold to sevenfold relative increase in accuracy over existing methods and reveals a complex regulatory landscape of genome-wide association studies, enabling enhanced interpretation of disease-associated variants and genes. Following the GRN inference from reference single-cell multiome data, LINGER enables the estimation of transcription factor activity solely from bulk or single-cell gene expression data, leveraging the abundance of available gene expression data to identify driver regulators from case-control studies.

15:40-16:00
Dynamic Gene Regulatory Network Inference with Interpretable, Biophysically-Motivated Neural ODEs
Confirmed Presenter: Maggie Beheler-Amass, New York University, United States

Room: 518
Format: In Person

Moderator(s): Amin Emad


Authors List: Show

  • Maggie Beheler-Amass, New York University, United States
  • Christopher A Jackson, New York Genome Center, United States
  • David Gresham, New York University, United States
  • Richard Bonneau, Prescient Design, Genentech, United States

Presentation Overview: Show

Gene Regulatory Networks (GRNs) are complex dynamical systems that modulate gene expression and drive transitions between phenotypic cell states. Determining these networks is crucial in understanding how gene dysregulation can lead to phenotypic variation and diseases. We present a novel biophysically-motivated neural ordinary differential equation (ODE) model framework with a biologically interpretable deep learning architecture that leverages dynamic single-cell data. This model framework infers GRNs by implicitly estimating underlying biophysical parameters such as RNA velocity, mRNA transcription rate, and mRNA decay rate.

To test the accuracy of our model, we apply it to a simulated dataset with a known ground truth GRN. We demonstrate that the neural ODE can successfully predict gene expression at unseen time points, and decompose the inferred RNA velocity into the transcription and degradation driving the system, while inferring the underlying GRN. Next, we train a model on a single-cell Saccharomyces cerevisiae dataset dynamically responding to rapamycin treatment. The model learns regulatory responses to the rapamycin perturbation and reveals key genes involved in the cellular response in silico. Finally, we apply the model to a dynamic hematopoiesis dataset to test whether the model can capture bifurcations of hematopoietic stem cells progressing along the myeloid and lymphoid lineages. The applicability to real-world datasets highlights the utility of neural ODEs coupled with interpretable deep learning. This framework has the potential to advance our understanding of complex biological systems and aid in the discovery of regulatory mechanisms underlying cellular responses to perturbations.

16:40-17:00
Proceedings Presentation: Optimal sequencing budget allocation for trajectory reconstruction of single cells
Confirmed Presenter: Noa Moriel, Hebrew University of Jerusalem, Israel

Room: 518
Format: Live Stream

Moderator(s): Marcel Schulz


Authors List: Show

  • Noa Moriel, Hebrew University of Jerusalem, Israel
  • Edvin Memet, Harvard University, United States
  • Mor Nitzan, Hebrew University of Jerusalem, Israel

Presentation Overview: Show

Charting cellular trajectories over gene expression is key to understanding dynamic cellular processes and their underlying mechanisms. While advances in single-cell RNA-sequencing technologies and computational methods have pushed forward the recovery of such trajectories, trajectory inference remains a challenge due to the noisy, sparse, and high-dimensional nature of single-cell data. This challenge can be alleviated by increasing either the number of cells sampled along the trajectory (breadth) or the sequencing depth, i.e. the number of reads captured per cell (depth). Generally, these two factors are coupled due to an inherent breadth-depth tradeoff that arises when the sequencing budget is constrained due to financial or technical limitations. Here we study the optimal allocation of a fixed sequencing budget to optimize the recovery of trajectory attributes. Empirical results reveal that reconstruction accuracy of internal cell structure in expression space scales with the logarithm of either the breadth or depth of sequencing. We additionally observe a power law relationship between the optimal number of sampled cells and the corresponding sequencing budget. For linear trajectories, non-monotonicity in trajectory reconstruction across the breadth-depth tradeoff can impact downstream inference, such as expression pattern analysis along the trajectory. We demonstrate these results for five single-cell RNA-sequencing datasets encompassing differentiation of embryonic stem cells, pancreatic β cells, hepatoblast and multipotent haematopoietic cells, as well as induced reprogramming of embryonic fibroblasts into neurons. By addressing the challenges of single-cell data, our study offers insights into maximizing the efficiency of cellular trajectory analysis through strategic allocation of sequencing resources.

17:00-17:20
Charting the role of RNA binding proteins in tissue-specific alternative splicing using machine explanations
Confirmed Presenter: Ayan Paul, Northeastern University, United States

Room: 518
Format: In Person

Moderator(s): Marcel Schulz


Authors List: Show

  • Ayan Paul, Northeastern University, United States
  • Shalini Karthyk, Northeastern University, United States
  • Yogi Raghav, University of Virginia, United States
  • Jennifer Dy, Northeastern University, United States
  • John Platig, University of Virginia, United States
  • Peter Castaldi, Harvard Medical School, United States

Presentation Overview: Show

The regulation of alternative splicing by RNA Binding Proteins (RBP) is an essential mechanism in determining tissue specificity. The nuances of the variation in the role of the RBPs, singly and collectively, in various tissues are not well understood. We present a study of two cell lines, HepG2 and K562 using eCLIP RBP binding data and shRNA RBP knockdown followed by RNA-seq data from the ENCODE project to chart the role of RBP cooperativity in regulating exon skipping, one of the primary modes of alternative splicing. We build RBP binding graphs from exon triplets and train machine learning models, both linear and non-linear, to map RBP bindings to exon-skipping quantification. We show significant non-linearities are expressed in both cell lines. We achieve state-of-the-art performance with Extreme Gradient Boosted Decision Trees and Deep Neural Networks with skip connections. We use Shapley values as post-hoc explanations of machine learning models to quantify the importance of individual RBPs and to identify instances of cooperative regulation between sets of RBPs. We explore RBP activity in close proximity to intron-exon junctions and in deep intronic regions. We show that RBPs have a subset of cell-line agnostic roles and a subset of cell-line-specific roles in regulating alternative splicing. Furthermore, we identify binding-region-specific roles of RBPs as splicing enhancers or silencers displaying the power of our analysis in elucidating the functional roles by which RBPs regulate alternative splicing.

17:20-18:00
Invited Presentation: Harnessing deep learning to amplify insights from GWAS
Confirmed Presenter: Hae Kyung Im

Room: 518
Format: In Person

Moderator(s): Marcel Schulz


Authors List: Show

  • Hae Kyung Im

Presentation Overview: Show

Genome-wide Association Studies (GWAS) have identified associations with thousands of complex traits across a significant portion of the genome. Transcriptome-wide Association Studies (TWAS) and similar methods (xWAS) aim to uncover causal mechanisms by leveraging genetic predictors of molecular traits. However, their effectiveness is constrained by the current limitations in predicting these traits from genotypes. In this talk, I will explore recent advancements in deep learning methods for predicting gene expression from DNA sequences and demonstrate how these techniques can enhance the power of TWAS. By fine-tuning pre-trained large-scale models, we can predict molecular traits on a much larger scale than is possible with traditional population-based approaches. This methodology holds promise for addressing challenges related to portability across ancestries and species, rare variations, linkage disequilibrium (LD) confounding, and single-cell expression analysis.