The SciFinder tool lets you search Titles, Authors, and Abstracts of talks and panels. Enter your search term below and your results will be shown at the bottom of the page. You can also click on a track to see all the talks given in that track on that day.

View Talks By Category

Scroll down to view Results

July 12, 2024
July 13, 2024
July 14, 2024
July 15, 2024
July 16, 2024

Results

July 13, 2024
10:40-11:20
Invited Presentation: Continual improvement of cis-regulatory models
Confirmed Presenter: Carl de Boer
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Shaun Mahony


Authors List: Show

  • Carl de Boer

Presentation Overview:Show

Gene expression is regulated by transcription factors that work together to read cis-regulatory DNA sequences. A primary aim of my group is to decipher the “cis-regulatory code” - the rules that cells use to determine when, where, and how much genes should be expressed. While cis-regulation has proven to be exceedingly complex, recent advances in our ability to query the activity of DNA, combined with Machine Learning have enabled significant progress towards deciphering this code. Here, I will focus on several of our recent efforts to improve cis-regulatory models. First, I will describe a recent DREAM Challenge, where competitors from across the globe competed to create the best sequence-expression models using a dataset of random yeast promoter sequences and their experimentally determined expression levels, which resulted in state-of-the-art model architectures, even for human cis-regulatory data. Next, I will describe an ongoing effort to make cis-regulatory models and evaluation tasks interoperable, streamlining model evaluation and enabling model comparison. Then, I will describe an alternate strategy for dividing the genome into training and test datasets, which substantially mitigates the homology-driven data leakage common in genome-trained models. Finally, I will give a perspective on where the field needs to go to crack the cis-regulatory code. Namely, profiling the regulatory activities of non-genomic DNA sequences in very high-throughput, and using these data to train models that understand genome regulation without ever having seen genomic sequences.

July 13, 2024
11:20-11:40
Interpreting Cis-Regulatory Interactions from Large-Scale Deep Neural Networks for Genomics
Confirmed Presenter: Peter Koo, Cold Spring Harbor Laboratory, United States
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Shaun Mahony


Authors List: Show

  • Shushan Toneyan, Shushan Toneyan, Cold Spring Harbor Laboratory
  • Peter Koo, Peter Koo, Cold Spring Harbor Laboratory

Presentation Overview:Show

The rise of large-scale, sequence-based deep neural networks (DNNs) for predicting gene expression has introduced challenges in their evaluation and interpretation. Current evaluations align DNN predictions with experimental perturbation assays, which provide insights into the generalization capabilities within the studied loci but offer a limited perspective of what drives their predictions. Moreover, existing model explainability tools focus mainly on motif analysis, which becomes complex when interpreting longer sequences. Here we introduce CREME, an in silico perturbation toolkit that interrogates large-scale DNNs to uncover rules of gene regulation that it learns. Using CREME, we investigate Enformer, a prominent DNN in gene expression prediction, revealing cis-regulatory elements (CREs) that directly enhance or silence target genes. We explore the intricate complexity of higher-order CRE interactions, the relationship between CRE distance from the transcription start sites on gene expression, as well as the biochemical features of enhancers and silencers learned by Enformer. Moreover, we demonstrate the flexibility of CREME to efficiently uncover a higher-resolution view of functional sequence elements within CREs. This work demonstrates how CREME can be employed to translate the powerful predictions of large-scale DNNs to study open questions in gene regulation.

July 13, 2024
11:40-12:00
Chromatin accessibility is driven by intra-nucleosomal pioneer cooperativity that includes low affinity motifs
Confirmed Presenter: Melanie Weilert, Stowers Institute for Medical Research, United States
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Shaun Mahony


Authors List: Show

  • Melanie Weilert, Melanie Weilert, Stowers Institute for Medical Research
  • Kaelan Brennan, Kaelan Brennan, Stowers Institute for Medical Research
  • Khyati Dalal, Khyati Dalal, Stowers Institute for Medical Research
  • Sabrina Krueger, Sabrina Krueger, European Molecular Biology Laboratory,
  • Charles McAnany, Charles McAnany, Stowers Institute for Medical Research
  • Yue Liang, Yue Liang, Stowers Institute for Medical Research
  • Julia Zeitlinger, Julia Zeitlinger, Stowers Institute for Medical Research

Presentation Overview:Show

The regulation of chromatin accessibility at cis-regulatory DNA sequences is a key rate-limiting step for enhancer activation and thus is an important element of the cis-regulatory code. Pioneer transcription factors (TFs) that induce nucleosome remodeling mediate chromatin opening, but the sequence rules by which pioneer or other TFs cooperate to make chromatin accessible are not well understood. To identify these sequence rules in an unbiased manner, we trained and interpreted BPNet-derived deep learning models that predict base-resolution TF binding data and bias-corrected chromatin accessibility data in mouse embryonic stem cells. By comparing the interpretations from both models, we can distinguish between TFs that are strong pioneers, weak pioneers and non-pioneers. Furthermore, we find that pioneering depends on low-affinity TF motifs, which increase in importance when they cooperate with other motifs. This reliance on cooperativity is observed to be important at low TF concentrations, when high-affinity motifs have a decreased effect on chromatin accessibility, confirming our model predictions. By probing the cooperativity in more detail, we find that it generally occurs at intra-nucleosomal distances, supporting a nucleosome-mediated mechanism of cooperativity. This highlights the ability of deep learning models to learn complex sequence rules, suggesting that widespread cooperativity and involvement of low affinity motifs could explain why the context-dependent function of pioneer TFs has been difficult to decipher.

July 13, 2024
12:00-12:20
Characterizing transcription factor binding with multi-omics sequence model
Confirmed Presenter:
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Shaun Mahony


Authors List: Show

  • Fangxin Cai, Fangxin Cai, University of Hong Kong
  • Yuanhua Huang, Yuanhua Huang, Unviersity of Hong Kong

Presentation Overview:Show

The linkage between transcription factors (TFs) and cis-regulatory regions (CREs) is crucial to under- standing gene regulation. Conventionally, it is determined by a step-wise process—motif enrichment and correlation/regression-based analysis. As the presence of motifs does not always imply binding, and cor- relation analysis may miss low-expression TFs, this process can suffer from false positive and negatives. Here we propose a holistic model that takes joint single-cell RNA sequencing (scRNA-seq) data and single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq) data to delineate TF-CRE linkage. In- spired by multi-omics factor analysis and sequence modeling, our model decomposes peaks’ accessibility into cell factors, encoded from TF expression, and peak factors, encoded from DNA sequences.
We demonstrate our model on an embryonic mouse brain dataset. Both modalities are accurately recon- structed on held-out cells and sequences . Cell factors preserve cell type distinction and trajectory structure, while sequence factors motifs moderately localize some motifs, such as that of Neurod2 and Sox11, indicating the regulatory information is captured.
To delineate TF-CRE linkage, we take gradients with respect to the two inputs. High gradient times TF expression values (gradTF) are assigned to high correlation TF-CREs pairs, whereas low-correlation, high gradTF pairs may correspond to low-expression TFs, though systematic evaluation remains to be done. As an example, Runx1, a low-expression TF, correlates poorly with almost all peaks’ accessibility; however, its potential target CREs (compiled from ChIP-Atlas) have a higher absolute gradTF. On the other hand, gradient times sequence (gradSeq) highlights regulatory motifs.

July 13, 2024
12:00-12:20
Protein Language Models improve the target prediction of nucleic acid-binding proteins
Confirmed Presenter:
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Shaun Mahony


Authors List: Show

  • Cyrus Tam, Cyrus Tam, Memorial Sloan Kettering Cancer Center
  • Ilyes Baali, Ilyes Baali, Memorial Sloan Kettering Cancer Center
  • Kaitlin Laverty, Kaitlin Laverty, Memorial Sloan Kettering Cancer Center
  • Taykhoom Dalal, Taykhoom Dalal, Memorial Sloan Kettering Center
  • Debashish Ray, Debashish Ray, University of Toronto
  • Alexander Sasse, Alexander Sasse, University of Washington
  • Woojeong Kim, Woojeong Kim, Cornell University
  • Alexander Rush, Alexander Rush, Cornell University
  • Matthew Weirauch, Matthew Weirauch, University of Cincinnati College of Medicine
  • Timothy Hughes, Timothy Hughes, University of Toronto
  • Quaid Morris, Quaid Morris, Memorial Sloan Kettering Cancer Center

Presentation Overview:Show

Unraveling the DNA and RNA-binding preferences of regulatory proteins, like transcription and splicing factors, is important for understanding gene regulatory networks. In vitro binding assays, including Protein Binding Microarrays (PBMs) and RNAcompete, have been conducted for hundreds of nucleic-acid-binding proteins (NBPs) and provide training data for homology models that can predict the binding preferences of unmeasured NBPs. However, to date, these homology models have used simple rules to predict motifs.

Protein Language Models have emerged as effective models for downstream protein property prediction; however, their value in predicting protein-ligand interactions is less clear. To evaluate this, we extracted diverse NBP representations from four PLMs (AlphaFold2, AminoBERT, ESM-2, ProteinBERT) and compared their performance against baseline representations when used as inputs for different target prediction models, including unsupervised methods and neural networks.

Extensive evaluations across diverse datasets revealed that PLM-learned representations consistently outperformed baseline methods. Further analysis demonstrated the particular value of PLM-learned representations in scenarios on proteins with distant homologs. Feature attribution analyses demonstrated that PLM-learned representations capture global and local structural properties, showcasing their efficacy in predicting binding preferences.

ESM-2 emerged as a top performer across predictive models. We further evaluated its performance in predicting the targets of unmeasured RBPs and unmeasured RNAs after fine-tuning. By introducing new tokens for nucleic acids and using concatenated RBP-RNA sequences as inputs, we demonstrated that a fine-tuned ESM-2 model matches the SOTA approach in generalizing to unseen RNA sequences and outperforms the SOTA approach in generalizing to unseen RBPs.

July 13, 2024
12:00-12:20
DNA language models reveal the architecture of nucleotide dependencies in genomes
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Shaun Mahony


Authors List: Show

  • Pedro Tomaz da Silva, Pedro Tomaz da Silva, Technical University of Munich
  • Alexander Karollus, Alexander Karollus, Technical University of Munich
  • Johannes Hingerl, Johannes Hingerl, Technical University of Munich
  • Xavier Hernandez-Alias, Xavier Hernandez-Alias, Mechanisms of Protein Biogenesis
  • Gihanna Galindez, Gihanna Galindez, Technical University of Munich
  • Nils Wagner, Nils Wagner, Technical University of Munich
  • Danny Incarnato, Danny Incarnato, University of Groningen
  • Julien Gagneur, Julien Gagneur, Technical University of Munich

Presentation Overview:Show

While the genome is composed of individual nucleotides, functional elements such as cis-regulatory elements and structural interactions are formed from sets of interdependent nucleotides. In principle, these dependencies are reflected in coevolutionary relationships. However, their detection beyond coding sequences is challenging with classical approaches.

DNA language models (LMs), which are trained by predicting nucleotides given their sequence context, have recently been proposed as foundational models for sequence-based prediction problems. DNA LMs implicitly capture functional elements from genomic sequences alone. However, which dependencies DNA LMs learn and whether they reflect known or even novel biology remains an open question.

Here we introduce nucleotide dependency maps to systematically study nucleotide dependencies captured by DNA LMs in a purely unsupervised setup.
We compute these maps genome-wide and show that they reveal and clearly delineate known functional genomic features such as transcription factor binding motifs, functional interactions between splice sites, RNA tertiary structures, and coding sequences. Additionally we uncover novel and conserved dependency structures suitable for experimental validation.

We furthermore investigate dependency maps from in silico manipulated sequences, revealing the ability of DNA LMs to capture operations such as copying and reverse complementarity without memorization.

Lastly, we compare dependency maps from openly available DNA LMs, showcasing the drawbacks and advantages of different models. We find stark differences in the ability of models to accurately learn conserved but infrequent features.

Altogether, by leveraging the flexibility of DNA language models, nucleotide dependency mapping emerges as a general methodology to discover and study functional interactions in genomes.

July 13, 2024
12:00-12:20
LoopHunter: Enhancing Chromatin Loop Annotation by Focusing on Larger Regions in Hi-C Data
Confirmed Presenter:
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Shaun Mahony


Authors List: Show

  • Yusen Hou, Yusen Hou, The Hong Kong University of Science and Technology (Guangzhou)
  • Yanlin Zhang, Yanlin Zhang, The Hong Kong University of Science and Technology (Guangzhou)

Presentation Overview:Show

Chromatin loops, which bring distant loci into close contact, play a crucial role in gene expression and regulation. Although several methods have been developed for annotating loops from Hi-C contact maps, these methods remain unsatisfactory, particularly in accurately identifying loops from low coverage or single-cell Hi-C contact maps at high resolutions. Chromatin loops manifest as small blob-shaped patterns on Hi-C contact maps, encouraging existing tools to focus on analyzing contact pairs within small areas, such as a 21x21 window. However, these blob-shaped patterns are often indistinct in sparse regions, providing insufficient data for precise loop detection. Meanwhile, many chromatin loops exhibit broader patterns, including stripes, particularly in loops associated with the formation of Topologically Associating Domains (TADs), which current tools largely ignore. In this study, we introduce LoopHunter, a axial attention-based deep learning model to annotate loops from Hi-C contact maps at high resolutions across various coverages. LoopHunter utilizes a 224x224 sub-matrix as input and employs a combination of axial attention transformer and convolutional blocks to capture multi-scale data, facilitating robust loop prediction within the input region. Unlike traditional approaches that focus only on the center of the input matrix, we propose to train LoopHunter based on knowledge distillation, enabling it to make dense predictions. Our comparisons of LoopHunter against alternative tools demonstrate that LoopHunter significantly enhances loop annotation across both low and high coverage Hi-C contact maps.

July 13, 2024
12:00-12:20
A systematic comparison of Machine learning methods for the prediction of enhancer-gene interactions from epigenomic data
Confirmed Presenter:
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Shaun Mahony


Authors List: Show

  • Fatemeh Behjati Ardakani, Fatemeh Behjati Ardakani, Goethe University Frankfurt
  • Shamim Ashrafiyan, Shamim Ashrafiyan, Goethe University Frankfurt
  • Dennis Hecker, Dennis Hecker, Goethe University Frankfurt
  • Laura Rumpf, Laura Rumpf, Goethe University Frankfurt
  • Marcel Schulz, Marcel Schulz, Goethe University

Presentation Overview:Show

Understanding the complex interaction between histone modifications, enhancers, and gene regulation is pivotal in deciphering the mechanisms governing cellular identity and function. This study investigates the critical task of predicting enhancer-gene interactions, essential for unraveling non-coding variation and DNA-binding factor-mediated gene regulation. Leveraging the comprehensive EpiATLAS dataset, encompassing high-quality histone ChIP-seq and RNA-seq data from a wide variety of cell types curated by IHEC, we embarked on a systematic comparison of various machine learning methods tailored to gene-specific prediction of gene expression from epigenome data.
Our investigation extends beyond traditional approaches by incorporating the large EpiATLAS dataset and exploring different state-of-the-art Machine learning methods. Notably, we optimized novel Convolutional Neural Network (CNN), and Multi-Layer Perceptron (MLP) architectures, and Random Forest-based (RF) methods in comparison to established linear models. By harnessing H3K27ac histone mark signatures within megabase genomic windows surrounding each gene, our models, especially RF and CNN, demonstrated exceptional performance in predicting gene expression. Many different aspects of a gene, such as gene structure, and expression variance across cell types dictate the success of building an accurate model.
Through comprehensive validation using CRISPRi screens and eQTL data, we investigate the efficacy of the learned models in predicting enhancer-gene interactions using an in silico perturbation setup.
In summary, our work offers a comprehensive framework for understanding enhancer-mediated gene regulation, supported by rigorous validation methods. These findings provide valuable insights into the regulatory landscape of the human genome, advancing our understanding of cellular function and disease mechanisms.

July 13, 2024
12:00-12:20
Q&A for Flash Talks
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Shaun Mahony


Authors List: Show

July 13, 2024
14:20-15:00
Invited Presentation: Integrative modeling of multiscale single-cell spatial epigenome
Confirmed Presenter: Jian Ma
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Ferhat Ay


Authors List: Show

  • Jian Ma

Presentation Overview:Show

Despite significant advancements in high-throughput data acquisition in genomics and cell biology, our understanding of the diverse cell types within the human body remains limited. In particular, the principles governing intracellular molecular spatial organization and interaction, as well as cellular spatial organization within complex tissues, are still largely unclear. A major challenge lies in developing computational methods capable of integrating heterogeneous and multiscale molecular, cellular, and tissue information. In this talk, I will discuss our recent work on creating integrative approaches to advance regulatory genomics using single-cell spatial epigenomics. These methods hold the potential to reveal new insights into fundamental genome structure, gene regulation, and cellular function within complex tissues, across a wide range of biological contexts in both health and disease.

July 13, 2024
15:00-15:20
Proceedings Presentation: Enhancing Hi-C contact matrices for loop detection with Capricorn, a multi-view diffusion model
Confirmed Presenter: William Noble, University of Washington, United States
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Ferhat Ay


Authors List: Show

  • Tangqi Fang, Tangqi Fang, University of Washington
  • Yifeng Liu, Yifeng Liu, University of Washington
  • Addie Woicik, Addie Woicik, University of Washington
  • Minsi Lu, Minsi Lu, University of Washington
  • Anupama Jha, Anupama Jha, University of Washington
  • Xiao Wang, Xiao Wang, Purdue University
  • Gang Li, Gang Li, University of Washington
  • Borislav Hristov, Borislav Hristov, University of Washington
  • Zixuan Liu, Zixuan Liu, University of Washington
  • Hanwen Xu, Hanwen Xu, University of Washington
  • William Noble, William Noble, University of Washington
  • Sheng Wang, Sheng Wang, University of Washington

Presentation Overview:Show

Motivation: High-resolution Hi-C contact matrices reveal the detailed three-dimensional architecture of the genome, but high-coverage experimental Hi-C data are expensive to generate. On the other hand, chromatin structure analyses struggle with extremely sparse contact matrices. To address this problem, computational methods to enhance low-coverage contact matrices have been developed, but existing methods are largely based on resolution enhancement methods for natural images and hence often employ models that do not distinguish between biologically meaningful contacts, such as loops, and other stochastic contacts.

Results: We present Capricorn, a machine learning model for Hi-C resolution enhancement that incorporates small-scale chromatin features as additional views of the input Hi-C contact matrix and leverages a diffusion probability model backbone to generate a high-coverage matrix. We show that Capricorn outperforms the state of the art in a cross-cell-line setting, improving on existing methods by 17% in mean squared error and 26% in F1 score for chromatin loop identification from the generated high-coverage data. We demonstrate that Capricorn performs well in the cross-chromosome setting and cross-chromosome, cross-cell-line setting. We further show that our multi-view idea can also be used to improve several existing methods, HiCARN and HiCNN, indicating the wide applicability of this approach. Finally, we use DNA sequence to validate discovered loops and find that the fraction of CTCF-supported loops from Capricorn is similar to those identified from the high-coverage data. Capricorn is a powerful Hi-C resolution enhancement method that enables scientists to find chromatin features that cannot be identified in the low-coverage contact matrix.

July 13, 2024
15:20-15:40
Ultra-long-range and interchromosomal loops link T cell superenhancers
Confirmed Presenter: Gabriel Dolsten, Princeton, United States
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Ferhat Ay


Authors List: Show

  • Gabriel Dolsten, Gabriel Dolsten, Princeton
  • Zhong-Min Wang, Zhong-Min Wang, Memorial Sloan Kettering Cancer Center
  • Xiao Huang, Xiao Huang, Memorial Sloan Kettering Cancer Center
  • Anthony Michaels, Anthony Michaels, Memorial Sloan Kettering Cancer Center
  • Mike Wilson, Mike Wilson, Princeton
  • Susie Song, Susie Song, Princeton
  • Aaron Viny, Aaron Viny, Columbia University
  • Alexander Rudensky, Alexander Rudensky, Memorial Sloan Kettering Cancer Center
  • Yuri Pritykin, Yuri Pritykin, Princeton

Presentation Overview:Show

Functional enhancer-promoter interactions are typically thought to occur at distances less than two megabases. To explore the role of long-range regulatory interactions, we generated two Hi-C libraries in regulatory (Treg) and conventional (Tcon) CD4+ T cells. We found that interactions beyond 10Mb dramatically improved prediction of gene expression from Hi-C, suggesting that long-range interactions may play an important role in gene regulation. To analyze the role of long-range interactions, we examined differential contact frequency between Treg and Tcon genome-wide. This analysis revealed 78,089 differential interactions at distances greater than two megabases. Differential long-range contact was especially common at critical T cell genes regulated by superenhancers, such as Ikzf2. These interactions often presented as focal contacts (“megaloops”), such as the Treg-specific 9Mb megaloop between Ikzf2 and Ctla4. A second 18Mb megaloop between Ikzf2 and Arl4c was confirmed by DNA-FISH. We developed a novel algorithm and package, LONGSHOT, to find megaloops and identified 33,791 intrachromosomal and 23,003 interchromosomal megaloops in the T cell connectome. Clustering of megaloops revealed three distinct interchromosomal megaloop hubs. Two of the hubs were highly enriched for superenhancers, capturing 50% of all Treg cell superenhancers. Analysis of a published Hi-C dataset with an Ets1 superenhancer knockout revealed changes in megalooping after superenhancer knockout and changes in gene expression at the megalooped sites. Together, these results suggest that ultra-long-range chromatin contacts, partly mediated by superenhancers, are an important component of T cell gene regulation.

July 13, 2024
15:40-16:00
Proceedings Presentation: scGrapHiC: Deep learning-based graph deconvolution for Hi-C using single cell gene expression
Confirmed Presenter: Ghulam Murtaza, Department of Computer Science, Brown University
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Ferhat Ay


Authors List: Show

  • Ghulam Murtaza, Ghulam Murtaza, Department of Computer Science
  • Byron Butaney, Byron Butaney, Department of Computer Science
  • Justin Wagner, Justin Wagner, Material Measurement Laboratory
  • Ritambhara Singh, Ritambhara Singh, Department of Computer Science and Center for Computational Molecular Biology

Presentation Overview:Show

Single-cell Hi-C (scHi-C) protocol helps identify cell-type-specific chromatin interactions and sheds light on cell differentiation and disease progression. Despite providing crucial insights, scHi-C data is often underutilized due to the high cost and the complexity of the experimental protocol. We present a deep learning framework, scGrapHiC, that predicts pseudo-bulk scHi-C contact maps using pseudo-bulk scRNA-seq data. Specifically, scGrapHiC performs graph deconvolution to extract genome-wide single-cell interactions from a bulk Hi-C contact map using scRNA-seq as a guiding signal. Our evaluations show that scGrapHiC, trained on 7 cell-type co-assay datasets, outperforms typical sequence encoder approaches. For example, scGrapHiC achieves a substantial improvement of 23.2% in recovering cell-type-specific Topologically Associating Domains over the baselines. It also generalizes to unseen embryo and brain tissue samples. scGrapHiC is a novel method to generate cell-type-specific scHi-C contact maps using widely available genomic signals that enables the study of cell-type-specific chromatin interactions.

July 13, 2024
16:40-17:00
Cross-species and tissue imputation of species-level DNA methylation samples across mammalian species
Confirmed Presenter: Emily Maciejewski, UCLA, United States
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Shamim Mollah


Authors List: Show

  • Emily Maciejewski, Emily Maciejewski, UCLA
  • Steve Horvath, Steve Horvath, Altos Labs
  • Jason Ernst, Jason Ernst, University of California

Presentation Overview:Show

DNA methylation data offers valuable insights into various aspects of mammalian biology. However, the availability of such data for many mammals has been historically limited due to a lack of applicable microarrays in species other than human and mouse. The recent introduction and large-scale application of the mammalian methylation array has significantly expanded the availability of such data across conserved sites in many mammalian species. In our study, we consider 13,245 samples profiled on this array encompassing 348 species and 59 tissues from 746 species-tissue combinations. While having some coverage of many different species and tissue types, this data captures only 3.6% of potential species-tissue combinations. To address this gap, we developed CMImpute (Cross-species Methylation Imputation) which uses a Conditional Variational Autoencoder (CVAE), a conditional generative model implemented via neural networks, to impute DNA methylation of non-profiled species-tissue combinations. CMImpute specifically conditions the CVAE on species and tissue labels, allowing for direct control over the combination to be imputed. In cross-validation, we demonstrate that CMImpute achieves a strong correlation with actual observed values, surpassing several baseline methods in terms of agreement across methylation array probes with a mean correlation of 0.92 and across samples with a mean correlation of 0.69. Using CMImpute we imputed methylation data for 19,786 new species-tissue combinations representing the remaining 96.4% of potential combinations. We believe that both CMImpute and our imputed data resource will be useful for DNA methylation analyses across a wide range of mammalian species.

July 13, 2024
17:00-17:20
Ontology-aware prediction of tissue-specific DNA methylation
Confirmed Presenter: Mirae Kim, Rice University, United States
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Shamim Mollah


Authors List: Show

  • Mirae Kim, Mirae Kim, Rice University
  • Ruth Dannenfelser, Ruth Dannenfelser, Rice University
  • Yufei Cui, Yufei Cui, Massachusetts Institute of Technology
  • Vicky Yao, Vicky Yao, Rice University

Presentation Overview:Show

DNA methylation (DNAm) has shown tremendous potential in distinguishing physiological states such as aging and cancer progression, and epigenetic clocks, in particular, have had far-reaching applications. Though DNAm is also highly tissue-specific, no pan-tissue classifier currently exists. Here, we manually curate 3,145 healthy human DNA methylation samples across 116 studies spanning 50 tissue types and combine this data compendium with a novel framework that combines Minipatch feature selection with ontology-aware classification. Through this study, we identify a minimal set of 741 CpG sites that can accurately distinguish between different tissue types. A deeper examination of the CpG sites also reveals underlying biological mechanisms that underpin the tissue-specificity of DNA methylation. Furthermore, we demonstrate that this ontology-aware learning structure enables effective zero-shot learning for tissues not seen in training.

July 13, 2024
17:20-18:00
Invited Presentation: Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome
Confirmed Presenter: Michael Hoffman, University Health Network/University of Toronto, Canada
Track: RegSys

Room: 518
Format: In Person
Moderator(s): Shamim Mollah


Authors List: Show

  • Michael Hoffman, Michael Hoffman, University Health Network/University of Toronto

Presentation Overview:Show

We will discuss a new method, Virtual ChIP-seq, which predicts binding of individual transcription factors in new cell types using an artificial neural network that integrates ChIP-seq results from other cell types and chromatin accessibility data in the new cell type. Virtual ChIP-seq also uses learned associations between gene expression and transcription factor binding at specific genomic regions. This approach outperforms methods that use transcription factor sequence preferences in the form of position weight matrices, predicting binding for 33 transcription factors.