Pan-cell type continuous chromatin state annotation of all IHEC epigenomes
Confirmed Presenter: Habib Daneshpajouh, Simon Fraser University, Canada
Room: 518
Format: In person
Moderator(s): Alejandra Medina-Rivera
Authors List: Show
- Habib Daneshpajouh, Simon Fraser University, Canada
- Kay C. Wiese, Simon Fraser University, Canada
- Maxwell W. Libbrecht, Simon Fraser University, Canada
Presentation Overview: Show
Understanding the mechanistic basis of genetic disease requires annotating the regulatory elements in the human genome. To this end, international consortia such as IHEC, ENCODE, and Roadmap Epigenomics have generated thousands of epigenomic datasets such as ChIP-seq, DNase-seq, and ATAC-seq that measure various biochemical activities in the genome, including transcription factor binding, histone modification, and DNA accessibility. Currently, the predominant methods for integrating these data sets to annotate regulatory elements are segmentation and genome annotation (SAGA) algorithms such as ChromHMM and Segway. SAGA algorithms partition the genome and assign a chromatin state label to each segment, indicating the epigenetic activity at that position. To alleviate the limitations of the discrete SAGA framework, we recently developed epigenome-ssm, a method that produces a vector of continuous chromatin state features at each position that summarizes epigenetic activity. Unlike discrete labels, these continuous features can easily represent varying strengths of a given element and can represent combinatorial elements with multiple types of activities. Here, we present a continuous chromatin state feature map generated using epigenome-ssm on 9,539 genome-wide signal tracks from six core histone modification assays across 1,698 epigenomes. We show that these feature maps constitute an intuitive and visualizable summary of epigenomic data and enable accurate identification of mechanisms of disease association.
Automated and genome-scale exploration of the cis-regulatory code involved in neuronal differentiation
Confirmed Presenter: Océane Cassan, LIRMM, Univ Montpellier, CNRS, Montpellier, France
Room: 518
Format: In person
Moderator(s): Alejandra Medina-Rivera
Authors List: Show
- Océane Cassan, LIRMM, Univ Montpellier, CNRS, Montpellier, France
- Christophe Vroland, Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
- Raynal Julien, Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
- Masaki Kato, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- Hazuki Takahashi, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- Takeya Kasukawa, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- Piero Carninci, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa & Human technopole, Milan, Italy, Japan
- Chi Wai Yip, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- Laurent Bréhélin, LIRMM, Univ Montpellier, CNRS, Montpellier, France
- Charles-Henri Lecellier, Institut de Génétique Moléculaire de Montpellier & LIRMM, Univ Montpellier, CNRS, Montpellier, France
Presentation Overview: Show
Gene expression is controlled by proximal and distal cis-regulatory elements (CREs), containing DNA motifs bound by various transcription factors (TFs). Other sequence features, such as specific k-mers or low complexity regions, have also been implicated.
However, in a dynamic biological process such as cell differentiation, we lack an understanding of how the transcriptional activity of CREs progressively change and what sequence features underlie these transitions, which may reflect common and/or coordinated regulatory processes.
Here, we use single-nucleus ATAC-seq with single-cell 5’ RNA-seq to follow, at a genome scale, CREs along differentiation of induced pluripotent stem cells into cortical neurons. We propose a guided clustering algorithm, STOIC (Statistical learning To Optimize Integrative Clustering) that jointly learns the different CRE clusters and their distinctive sequence-level features using an interpretable machine learning approach.
This procedure explores the expression space and delineates the CRE clusters iteratively in order to optimize the performance of a supervised classifier predicting CRE cluster membership based on DNA sequence features.
We show that STOIC provides more predictive sequence-level features than a standard k-means clustering. Furthermore, orthogonal chromatin and TF binding data collected in the same settings are used to validate the inferred CRE clusters and their sequence features, associate them to specific enhancer or promoter signatures and biological processes. Our results explore the complexity of the cis-regulatory code at the genome scale and provide an updated perspective on the transcriptional regulations at play during neuronal differentiation.
Expanding GTEx dataset with brain ontology-based graph neural networks to investigate genetic impacts on brain diseases
Confirmed Presenter: Jianfeng Ke, University of Massachusetts Lowell, United States
Room: 518
Format: In person
Moderator(s): Alejandra Medina-Rivera
Authors List: Show
- Jianfeng Ke, University of Massachusetts Lowell, United States
- Rachel Melamed, University of Massachusetts Lowell, United States
- Tingjian Ge, University of Massachusetts Lowell, United States
Presentation Overview: Show
The human brain, with its intricate network of diverse regions, profoundly influences disease development. The Genotype-Tissue Expression (GTEx) program gathered transcriptome data and matched genotype data from over three hundred post-mortem donors, which allows us to understand how genetic variation can impact gene expression in diverse regions. However, the GTEx dataset included only 13 brain regions and only 10% of subjects had all brain regions measured. Improving the completeness of gene expression data within the GTEx project has the potential to elucidate the impact of disease risk variants on gene regulation in crucial tissues relevant to disease development. A possible resource to address this issue is the Allen Human Brain Atlas dataset. It collected transcriptome data from post-mortem brain tissue samples from 6 individuals, covering over a hundred distinct brain subregions. Leveraging the Allen dataset, we proposed a graph neural network model based on an expert ontology describing a hierarchy of increasingly fine-grained brain regions. This Graph Ontology model can predict 103 subordinate or previously uncollected brain regions for subjects within the GTEx dataset. We showed that our model outperformed several existing multi-tissue imputation models. Our model extended the initial 13 GTEx regions to 103 subordinate regions, enabling us to explore how genetic variation represented in GTEx can impact diverse disease-relevant regions that were not originally covered by the GTEx. Our prediction results can serve as a foundation for future investigations into how specific genetic variations influence diseases by altering gene expression patterns across a wide range of brain regions.
Interpretable single-cell factor decomposition using sciRED
Confirmed Presenter: Delaram Pouyabahar, Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada, Canada
Room: 518
Format: In person
Moderator(s): Alejandra Medina-Rivera
Authors List: Show
- Delaram Pouyabahar, Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada, Canada
- Tallulah Andrews, Departments of Biochemistry and Computer Science, University of Western Ontario, London, Ontario, Canada, Canada
- Gary Bader, Departments of Molecular Genetics and Computer Science, University of Toronto, Toronto, Ontario, Canada, Canada
Presentation Overview: Show
Single-cell RNA sequencing (scRNA-seq) enables the exploration of gene expression heterogeneity within large cell populations, arising from biological and technical factors. Inferring gene expression programs from scRNA-seq data is challenging due to noise, sparsity, and high dimensionality, addressed by computational approaches like matrix factorization. Specialized factorization techniques such as glmPCA and cNMF have emerged in recent years to be tailored for scRNA-seq. However, the resulting factors must be manually interpreted. To address this gap, we developed sciRED as a tool to improve the interpretation of scRNA-seq factor analysis. sciRED implements a four-step approach to characterizing gene expression programs: (1) Removing confounding effects and using rotations to maximize factor interpretability (2) Calculating association statistics to map factors with known covariates, (3) Highlighting unexplained factors that may indicate hidden biological phenomena, and (4) Determining the genes and biological processes represented by unexplained factors. We apply our method, sciRED, across diverse datasets including the scMixology benchmark dataset and four biological single-cell atlases. Specifically, we showcase its application in identifying cell identity programs and sex-specific variations in a kidney map, discerning strong and weak stimulation signals in a PBMC dataset, eliminating ambient RNA contamination in a rat liver atlas to unveil strain variations, and revealing the hidden biology, represented by a rare cell type signature and anatomical zonation gene programs, in the healthy human liver map. These demonstrate the utility of our approach on real datasets for characterizing intricate biological signals within scRNA-seq maps.
Accurate allocation of multi-mapped reads enables regulatory element analysis at repeats
Confirmed Presenter: Shaun Mahony, Penn State University, United States
Room: 518
Format: In person
Moderator(s): Alejandra Medina-Rivera
Authors List: Show
- Alexis Morrissey, Penn State University, United States
- Jeffrey Shi, Penn State University, United States
- Daniela James, Penn State University, United States
- Shaun Mahony, Penn State University, United States
Presentation Overview: Show
Transposable elements (TEs) and other repetitive regions have been shown to contain gene regulatory elements, including transcription factor binding sites. Unfortunately, regulatory elements harbored by repeats have proven difficult to characterize using short-read sequencing assays such as ChIP-seq or ATAC-seq, as most regulatory genomics analysis pipelines discard “multi-mapped” reads. To address this shortcoming, we developed Allo, a new approach to allocate multi-mapped reads in an efficient, accurate, and user-friendly manner. Allo combines probabilistic mapping of multi-mapped reads with a convolutional neural network that recognizes the read distribution features of potential peaks, offering enhanced accuracy in multi-mapping read assignment. To demonstrate Allo’s potential, we apply it to reanalyze almost 500 transcription factor ChIP-seq datasets from K562 cells. This analysis resulted in over 385,000 previously unidentified transcription factor binding sites in repetitive regions of the genome. We find that Allo is particularly beneficial in identifying ChIP-seq peaks at centromeres and in younger TEs. In particular, we find novel associations between particular TFs and the recently expanded SVA and ERVK transposon families. We also find that Allo has a striking ability to disambiguate multi-mapped reads at recently duplicated genes. Using Allo, we analyze how regulatory elements diverge at recently generated paralogous genes, enabling new regulatory insights at sites of recent evolutionary novelty that often get overlooked in regulatory genomics analyses. Finally, we demonstrate that TF binding sites harbored by repeats are particularly difficult for neural network-based methods to predict de novo, and we speculate on approaches that can offer improved performance.
Q&A for Flash Talks
Room: 518
Format: In person
Moderator(s): Alejandra Medina-Rivera
Authors List: Show