Presentation Overview: Show
Transcriptional regulatory networks (TRNs) are a powerful conceptual framework to describe mechanisms of gene expression changes accompanying biological processes. It is increasingly apparent that TRNs can change between conditions, e.g., due to mutations or epigenomic changes associated with disease progression. Several statistical methods have been proposed for inference of TRN changes from transcriptomic data under two biological conditions or sample groups. These methods generally report inter-group changes in correlation between transcription factors’ (TFs) and target genes’ expression, revealed either by pairwise analysis or multi-variable regression and machine learning models. However, correlation-based discovery of a TF’s regulatory influence on a gene and of changes in such influence between biological groups suffers in the presence of confounders such as other TFs and group identity. This results in the reconstructed TRN and differential TRN deviating from the true causal network of interest.
We present a causal inference approach to the reconstruction of TF-gene relationships that change between two biological groups. We define the desired differential relationship as an estimand of “do calculus”, tied to a causal diagram that explicitly encodes the confounding effects of other TFs and of the biological group or condition. We develop a computational tool called CIMLA (Counterfactual Inference by Machine Learning and Attribution Models) to estimate this quantity, using non-linear models such as random forests and neural networks to fit gene expression as a function of TFs’ expression, “SHAPley Additive exPlanations” (SHAP) scores to assess each TF’s influence on a gene in each sample, and aggregating inter-group differences of this influence across all samples.
We compared CIMLA with leading methods for differential GRN inference (identified in a recent benchmarking study), using data simulated by our biophysics-inspired SERGIO simulator, and showed that CIMLA significantly outperforms these methods, especially when confounders are present. We also showed that CIMLA can identify differential regulatory relationships that do not exhibit significant differential correlation – a capability at which correlation-based methods commonly fall short.
We employed CIMLA to analyze a previously published single-cell RNA-seq data comprising more than 60,000 human brain cells collected from subjects with and without Alzheimer’s disease (AD). Our analysis points to several potential regulators of AD, some of which are strongly supported by the literature.
Finally, we note that the applications of CIMLA go beyond differential GRN inference; with simple modifications it can be extended to the inference of condition-related changes in a variety of causal associations from observational data.
Presentation Overview: Show
Linear and nonlinear latent variable models to infer cellular manifolds are a corner-stone component of the analysis of single-cell (sc) RNA-seq data. Experimental advances have enabled to jointly profile gene expression and chromatin accessibility in the same physical cells, which yields a wealth of additional information that are not fully exploited by existing approaches. In particular, the integrative analysis of multi-ome data promises to enable the comprehensive recovery of cis-regulatory logic, which could improve the interpretability and the regulatory relevance of inferred latent space components.
Here, we present single-cell Deep multi-Omic Regulatory Inference (scDoRI), an end to end deep learning (DL) model that decomposes variation in gene expression and chromatin accessibility into continuous states (topics) of regulatory modules (gene-regulatory networks, GRNs). For each topic, scDoRI learns a set of co-accessible cis-regulatory elements and links them to downstream genes in a predictive framework. By modelling transcription factor (TF) binding at cis-regulatory elements using evidence from pretrained sequence based DL models, accessibility and expression, scDoRI learns topic-specific links between TFs and their downstream target genes. This yields directly interpretable latent space components in terms of GRNs, aiding downstream analyses and exploration of multi-omic scRNA-ATAC datasets. To illustrate the model, we apply scDoRI to a multi-omic sc RNA-ATAC atlas of mouse gastrulation to discover continuous changes in regulatory modules and infer fine-grain TF activities across different differentiation trajectories. By exploiting the interpretability of the latent space, scDoRI recovers context-specific changes in GRNs upon TF perturbation and accurately predicts the effect of perturbations in unseen contexts. Finally, we apply scDoRI to a multi-omic scRNA-ATAC and spatial transcriptomics dataset of Glioblastoma (GBM) patients to discover changes in GRNs in heterogeneous GBM cell-states across patients. We develop a transfer learning approach to map regulatory modules on spatial data and find colocalization of specific GBM GRNs with immune GRNs in a patient specific manner.
Presentation Overview: Show
Transcription factor-DNA binding is an important aspect of gene regulation, but the effect of cooperativity between TFs at neighboring binding sites has not been sufficiently characterized. In the present work we describe a protein-DNA binding microarray (PBM) methodology to measure TF-TF cooperativity and demonstrate that the resulting data can be used to train (via supervised machine learning) accurate classification and regression models to predict cooperativity. We additionally show that analysis of our data and models aligns with low-throughput results regarding mechanism of cooperativity for the two TFs we selected for our case-studies, ETS1 and RUNX1.
We begin by selecting a number of probe sequences from the genome that have putative, neighboring binding sites for both TFs of interest. Following a standard PBM protocol, we measure TF binding levels when the DNA library is incubated with each TF individually or with both TFs simultaneously. We then compute the cooperativity as the difference in TF binding levels between the two experiments. A given probe sequence is classified as exhibiting cooperative or independent behavior based on whether the presence of the co-factor TF leads to a statistically significant increase in the TF binding level.
Next, we demonstrate that the cooperativity data collected via our novel experimental approach can be used to train accurate classification and regression models. We found that the random forest classifier is effective for distinguishing whether TFs with neighboring binding sites on particular sequences will exhibit independent versus cooperative binding. For the regression task, we investigated the efficacy of random forest regression (RFR), support vector regression (SVR), and convolutional neural networks (CNN). For RFR and SVR, many of the same features were used, such as predicted binding affinity of each TF, the distance between binding site centers, the relative orientation of the two binding sites, as well as DNA sequence and DNA shape features for the region between and outside of the binding site cores. RFR performed well, but error-rates were further reduced when CNNs were used for modeling, especially when using a combination of one-hot encoding (to capture the DNA sequence) and the predicted binding site strengths. This suggests that while important information can be gleaned directly from the sequence data, there is information on binding site strength that the CNN is unable to learn directly from sequence.
Presentation Overview: Show
While transcription factors (TFs) are central to the establishment of cell fates, we still know little about how cell-specific TF regulatory activities result from the interplay between a TF’s sequence preference and cell-specific chromatin environments. To understand the determinants of TF DNA-binding specificity, we need to examine how newly activated TFs interact with sequence and preexisting chromatin landscapes to select their binding sites. Here, we present a principled approach to model the sequence and preexisting chromatin determinants of TF binding. Specifically, we have developed a bimodal neural network that jointly models sequence and prior chromatin data to interpret the binding specificity of TFs that have been induced in well-characterized chromatin environments. The bimodal network architecture allows us to quantify the degree to which sequence and prior chromatin features explain induced TF binding, both at individual sides and genome-wide.
Here, we apply our approach to characterize differential binding activities across a selection of Forkhead-domain TFs when each is expressed in mouse embryonic stem cells. We show that the preexisting chromatin landscape is an important determinant of differential Fox TF binding specificity. While all share similar Forkhead DNA-binding domains with the prototypical “pioneer” factor FoxA1, paralogous Fox TFs display a wide range of abilities to engage relatively inaccessible chromatin. Thus, despite having similar DNA-binding preferences, paralogous Fox TFs can bind to different DNA targets, and drive differential gene expression patterns, even when expressed in the same chromatin environment. We propose that modifying preferences for preexisting chromatin states is an important strategy by which evolution enables the functional diversification of paralogous TFs.
Presentation Overview: Show
R-loops are hybrid structures of DNA and nascent RNA that form widely across eukaryotic and bacterial genomes during transcription and can lead to transcriptional repression and genome instability. Factors such as primary DNA sequence, nascent RNA levels, DNA and RNA binding factors, and chromatin state are thought to influence R-loop metabolism. What is unknown are the relative contributions of these factors towards R-loop formation, and to what extent the underlying rules governing R-loop formation are conserved across species.
To address this, we constructed a deep neural network to predict R-loop formation along the genome that is based on primary DNA sequence and other genomic and epigenomic annotations. Our genomic annotations included those related to nascent and mature transcription, chromatin state (histone marks, methylation), and chromatin accessibility. Surprisingly and despite R-loop formation’s dependence on transcription, we found that DNA sequence was the most predictive factor for R-loop formation and frequency; performance of the DNA sequence-only model was 41% higher than the nascent transcription-only model. When using a DNA sequence-only model as a baseline, we found that adding epigenetic state and nascent transcription data provided modest additional prediction power (5% more). This suggests that the role of DNA sequence is distinct from its influence on transcription and epigenetic state. When interrogating the neural network to identify the most salient DNA sequence features that predicted R-loop formation, we found that local islands of G’s and T’s provided strikingly high contributions (positively and negatively, respectively). Simulations removing those islands yielded substantial decrease in predicted R-loop formation.
Finally, an open question is the extent to which the underlying molecular rules governing R-loop formation is conserved across species. To probe this, we trained individual models on human, mouse, zebrafish, chicken, fruit fly, and roundworm datasets and evaluated how well a model trained on one species could predict R-loop data of another species. We found that models trained on human data were 84% as predictive of yeast R-loop formation as other models trained directly on yeast, despite the large variance in GC content and gene density between their genomes. Our results suggest high conservation of sequence rules governing R-loop formation.
Presentation Overview: Show
Genome-wide maps of epigenetic modifications are powerful resources for non-coding genome annotation. Maps of multiple epigenetics marks have been integrated into sample-specific chromatin state annotations for many samples of various cell or tissue types. With the increasing availability of multiple chromatin state maps for biologically similar samples, there is a need for methods that can effectively (1) summarize the information about chromatin state annotations within groups of samples and (2) identify chromatin state differences across groups of samples at a high resolution.
To address this, we developed CSREP, which takes as input chromatin state annotations for a group of samples and then probabilistically estimates state assignment probabilities at each genomic position, which it then summarizes with the maximum-probability chromatin state at each position for the group. CSREP uses an ensemble of multi-class logistic regression classifiers to predict the chromatin state assignment of each sample given the state maps from all other samples with shared biological properties. The difference of CSREP’s summary probability assignments for two groups can be used to identify genomic locations with differential chromatin state patterns.
Using groups of chromatin state maps of a diverse set of cell and tissue types, we demonstrate the advantages of using CSREP to summarize chromatin state maps. We show that CSREP is better able to predict the chromatin state of held out samples in a group than the counting-based baseline approach. We also show that CSREP can identify biologically relevant differences between groups at a high resolution, such as predicting cell-type-specific peaks of histone modification marks. The CSREP source code, and the summary chromatin state maps for 11 sample groups from Roadmap Epigenomics project, and 75 sample groups from Epimap is available at http://github.com/ernstlab/csrep.
Presentation Overview: Show
The three-dimensional (3D) organization of the genome, which determines how the DNA is packaged inside the nucleus, has emerged as a key regulatory mechanism of cellular function and malfunction. High-throughput chromosomal conformation capture (Hi-C) technologies have enabled the study of 3D genome organization by experimentally measuring the tendency of genomic regions to interact with one another in 3D space. Changes or disruptions to the 3D genome organization have been associated with disease, developmental, and evolutionary processes. Therefore, a key problem in regulatory genomics is to systematically detect higher-order structural changes across Hi-C datasets from multiple timepoints and conditions. Existing methods to detect changes in 3D genome organization are limiting in one or more ways: they identify changes at the individual interaction level, they offer only pairwise comparison between two datasets at a time, or they do not account for more complex nested relationships among the conditions represented by the datasets. We address these limitations with a new approach, Tree-Guided Integrated Factorization (TGIF) to examine the dynamics of topologically associating domain (TAD) boundaries. TGIF is based on hierarchical multi-task Non-negative Matrix Factorization (NMF), a dimensionality reduction method particularly useful for co-clustering the row and the column entities of a matrix. TGIF takes as input Hi-C matrices from multiple related biological conditions, for example, different cell types from a cell lineage. TGIF constrains the lower-dimensional factors from closely related tasks (e.g. cell types derived from the same “parent” of the cell lineage) to be similar. The output factor matrices represent the lower-dimensional view of the chromosomal architecture. We use these low-dimensional views to calculate a TAD boundary score for each region and identify differential or common boundaries across all conditions. Compared to existing approaches, TGIF recovers ground-truth differential TAD boundaries with higher precision in simulated data and is also less susceptible to identifying spurious boundaries due to variation in dataset depth. We applied TGIF to study across multiple cell types as well as across a time course. From a cardiomyocyte differentiation time course data, TGIF identified embryonic-stage-specific TAD boundaries associated with transcriptionally active HERV-H retrotransposon; such sites are known to demarcate TAD boundaries in human and ape pluripotent cells. We further identify transcription factor motifs enriched in embryonic-stage-specific boundary regions and demonstrate that cardiovascular-disease associated GWAS SNPs are depleted in such boundary regions. Together, these results demonstrate the utility of TGIF to identify biologically meaningful topological shifts in genome organization.
Presentation Overview: Show
Loss of chromosome Y (LOY) is the most common form of clonal mosaicism in males. Presence of this phenomenon is most pronounced in leukocytes and increases with age. LOY is also associated with an increased risk for disease and mortality.
Within a single cell, LOY is an event leading to a complete loss of chromosome Y and causes disappearance of almost 2% of the male haploid nuclear genome. This should in turn lead to changes in DNA’s compaction and its rearrangement in the cell’s nucleus and hence alter the epigenetic processes that influence gene expression. Among these are DNA methylation, histone modifications, as well as higher level 3-dimensional structures such as topologically associating domains.
Significantly reduced DNA methylation levels were found to coincide with low Y levels. Our previous research showed that LOY has a clear transcriptional effect in leukocytes, and it is associated with dysregulation of multiple genes. Many of these are involved in immune functions. Moreover, we found that levels of LOY in leukocytes were varying dependent on the investigated cell type but had the highest range of values in granulocytes and monocytes. These two cell types make up the major human blood fractions and play key roles in inflammatory and anti-microbial defense processes.
Alzheimer disease (AD), the most common chronic neurodegenerative disorder and cause of dementia in the elderly, is a major public health problem worldwide. Loosing chromosome Y is associated with higher risks of AD. Disruption in the inflammatory signals has also been documented to be implicated AD.
Here we aimed at investigating the epigenetic effect of LOY in leukocytes in the Alzheimer’s disease. We generated a broad set of DNA methylation data representing AD patients and healthy controls. All these had their LOY levels estimated that allowed to further divide them into LOY and non-LOY groups. We confirmed the involvement of dysregulated genes (both at the level of DNA methylation, as well as bulk and single-cell RNA expression) in immune processes and found these to be abnormally activated in LOY. Additionally, we show that LOY-associated epigenetic changes are much more pronounced in AD when compared to controls.
Presentation Overview: Show
Tandem repeats (TR) represent one of the largest sources of genetic variation in humans. While repetitive regions of the genome have been understudied mainly due to the technical challenges they present, the advent of improved sequencing technology and bioinformatics methods for genome-wide analysis now enable more systematic study of the contribution of TRs to human phenotypes. Here I will discuss how new computational methods have uncovered thousands of TRs associated with gene regulation or other complex phenotypes, and have revealed new biological insights into these traits.
Presentation Overview: Show
The exploration of ligand-receptor networks has attracted a lot of attention in current research, especially through single-cell omics. Here, we introduce BulkSignalR, a R package that offers comparable capabilities for bulk transcriptomics or proteomics as well as spatial transcriptomics.
Despite the revolution brought by single-cell methods, sample cohorts that were (or are) profiled in bulk and accompanied with clinical data remain much larger so far. It is therefore very relevant to be able to extract cellular signaling networks from the latter, as an orthogonal and yet complementary option to single-cell investigations.
Moreover, the other revolutionary technology that is spatial transcriptomics is usually performed at a scale where every data point is a localized bulk transcriptome. The transcriptomes of 10-30 cells are combined. Accordingly, our bulk-dedicated model can be applied to such spatial data also.
In its implementation, BulkSignalR leverages our previous work in single-cell data with SingleCellSignalR and LRdb (Cabello-Aguilar, NAR, 2020). It integrates ligand-receptor interactions with receptor downstream pathways to estimate interaction statistical significance. A range of visualization methods complement the statistical analysis including functions dedicated to spatial applications. BulkSignalR can be virtually applied to any species thanks to a built-in generic ortholog mapping mechanism. User-specific putative interactions can be added to our reference database LRdb.
We show how easily BulkSignalR analyses can be conducted through a few lines of R thanks to its S4 object-oriented design. Examples from cancer and development biology will illustrate the rich graphical and analytical functionalities, including those dedicated to spatial data.
Presentation Overview: Show
Imaging-based spatial transcriptomics technologies such as MERFISH and SEQFISH provide an extraordinarily detailed view of biological processes, which includes individual transcript locations within a cell, across thousands of cells, of hundreds to thousands of genes. We present InSTAnT (Intra-cellular Spatial Transcriptomics Analysis Toolkit), a toolkit for extracting molecular relationships from spatial transcriptomics data at the intra-cellular resolution. InSTAnT detects gene pairs and modules with unusual patterns of co-localization within and across cells. Intra-cellular spatial patterns discovered by InSTAnT may have diverse biological interpretations, including RNA-RNA interactions with regulatory functions, formation of condensates, shared sub-cellular localization, etc., and thus provide a rich compendium of testable hypotheses regarding molecular functions.
At the heart of the InSTAnT suite is a statistical test to detect “proximal pairs” of genes in a single cell’s spatial transcriptome. This test determines if transcripts of a gene pair are located within a distance d significantly more often than expected by chance. We define a gene pair to be “d-colocalized” if it is detected as a proximal pair in many cells, thereby increasing our confidence in a spatial relationship between the two genes. InSTAnT detects d-colocalization using the “Conditional Poisson Binomial” (CPB) test, which is based on a Poisson Binomial distribution and is specially designed to (a) allow for the fact that different cells can have varying numbers of proximal pairs and (b) to deemphasize abundant genes, thus yielding a diverse pool of gene pairs. It then annotates the detected gene pairs by cellular regions e.g., nuclear, perinuclear, cytoplasmic or perimembrane, where they tend to colocalize, and provides specialized statistical tests to identify d-colocalized pairs specific to a cell type. InSTAnT also implements a probabilistic graphical model that detects special cases where an intra-cellular colocalization signal exhibits a non-random spatial pattern at the inter-cellular level. Finally, it relies on a frequent sub-graph mining algorithm to discover gene colocalization modules that are supported by many cells.
We demonstrate the usefulness of the InSTAnT toolkit through extensive analysis of two MERFISH datasets, profiling a human U2OS cell line and the mouse hypothalamic preoptic region respectively. We report numerous colocalization patterns revealed by these analyses, along with rigorous statistical assessment and supporting evidence from databases and predicted RNA interactomes. We identify several novel cell type-specific gene colocalizations in the brain.
In summary, InSTAnT is a powerful toolkit for unbiased spatial pattern discovery and analysis for spatial transcriptomics technologies of the future.
Presentation Overview: Show
Transcription factors (TFs) are important modulators of cell fate and function, responsible for large-scale changes in response to the environment or intercellular communication. Hence, other types of cells in proximity are critical for instructing cell context-specific TF activities. Since TFs are typically expressed at low levels and regulated post transcriptionally and/or post translationally, estimating TF activity in specific contexts is a formidable challenge. TF activity can be indirectly assessed by target gene expression. Emerging spatially transcriptomics technologies measure genome-wide mRNA expression across thousands of spots on a tissue slice while preserving information about the location of spots and allowing characterization of the microenvironment. To date, these datasets have not been used to systematically infer the underlying context-specific TFs defining cell identity. Here, we present STAN (Spatially informed Transcription Factor Activity Network), a computational method to predict spot-specific TF activities by utilizing spatial transcriptomics datasets and cis-regulatory information. Specifically, we develop a linear mixed effect model that integrates curated TF target-gene interactions, mRNA expression, spatial coordinates, and imaging data to learn gene regulatory programs that predict the expression of target genes. Spatial coordinates and morphological features extracted from corresponding imaging data are used to promote spatially cohesive gene regulatory programs. Importantly, our model can identify TFs whose activity are primarily dependent on cell type proportions, and TFs whose cell-type specific activity vary spatially by using cell-type deconvolution approaches.
We apply STAN to breast cancer spatial transcriptomics data from 15 patients. For statistical evaluation, we computed the mean Spearman correlation between predicted and measured gene expression profiles on held-out samples and obtained significantly better performance with models utilizing spatial coordinates and imaging data. Clustering of spots by STAN-predicted TF activities largely recovered the distinct spatial domains. For example, we show that STAN yields biological insights into TFs associated with structures such as tertiary lymphoid structures (TLSs). TLSs are well recognized and consist of T cell-rich areas containing dendritic cells and B cell-rich areas containing germinal centers. STAN-predicted BCL6 activity is associated with B-cells in samples with TLSs. Indeed, BCL6 is a TF repressor which has emerged as a critical regulator of germinal centers. The corresponding mRNA levels for BCL6 are lowly abundant in the transcriptomics data, which highlights the advantage of our method for quantifying the effect of TFs. Taken together, STAN enhances the utility of spatial transcriptomics datasets to uncover TF and spatial relationships in diverse cellular states.
Presentation Overview: Show
A major goal of cancer biology is to fill in the gap between somatically acquired protein-altering mutations and malignant transformation of cells. Two central approaches—identification of spatially clustered mutations on protein structures and inference of commonly dysregulated cellular pathways—interpret the potential function of mutations at different scales. Integrating these two approaches is a promising way to decipher the multiscale functional effects spread from mutated amino acid residues to cellular systems. However, limited 3D structural coverage of the human proteome and interactome has hindered understanding of how the local mutation clusters are organized in the protein-protein interaction network. Here we compiled a complete repository of the structures of every single protein as well as interfaces for all known interacting protein pairs in humans which are determined by either experimental or deep learning-based approaches, and developed a computational tool integrating spatial cluster identification with 3D structurally-informed protein network analysis to create a multiscale functional map of somatic mutations in cancer. By applying our methodology to 5,950 TCGA tumors across 19 cancer types, we identified 1,656 intra- and 3,343 inter-protein mutation clusters, of which ~50% would not have been found if using only experimentally-determined protein structures. Moreover, studying the global organization of local mutation clusters led to a 5.5-fold increase in the number of significantly dysregulated protein subnetworks, the majority of which were previously blurred by non-clustered background mutations using standard network analyses. A notable discovery was a group of local mutation clusters converging on non-canonical PRC1, a protein complex repressing cell cycle and metabolic genes. Altogether, this study demonstrates that charting the functional landscape of cancer mutations requires a combination of their local spatial organization within protein structures and their global organization at the network level. Our multiscale functional map of somatic mutations can ultimately spark the discovery of novel molecular mechanisms underlying tumorigenesis.
Presentation Overview: Show
Genome-scale sequencing and screening data provide exceptionally detailed genetic and transcriptional profiles of cells. These can potentially be harnessed for early cancer detection and targeted treatment. One of the most critical steps in interpreting such data is biological pathway analysis. Pathway analysis identifies statistical relationships between experimentally derived group of genes and groups of genes (gene sets) with known biological functions. Several methods are available for pathway analysis (e.g., GSEA and Enrichr). However, the performance of these methods to identify correct biological pathways from either bulk or single cell data and to rank them by importance remains unknown, largely due to the absence of evaluation platforms using real biological data.
Here, we develop a comprehensive benchmark for evaluating pathway analysis methods. This benchmark is composed of experimental data collected from a range cell types, assays and species. Using this benchmark, we assessed the performance of commonly used pathway analysis methods at correctly identifying true biological pathways in bulk RNA-seq data. We specifically looked at true and false positive rates, robustness to noise and key parameters in analysis such as gene set size and ranking of genes in input genesets. Our benchmark determined best practices for conducting pathway analysis and a simple statistical stratification to reduce the number of false positive pathways from these analyses. Additionally, we developed a new ensemble pathway enrichment method that combines multiple approaches and greatly outperforms underlying methods in all key parameters. Importantly, we have applied best practices to predict pathways associated with poor and good survival in early-stage TCGA cancer patients. We show that this approach is able to identify novel prognostic markers with strong predictive power for patient survival. Finally, we predicted drugs that modulate the expression of prognostic markers for further therapeutic use. In summary, we have created novel benchmarking and pathway analysis tools to enable unlocking the full potential of genome-scale studies.
Presentation Overview: Show
Understanding of mutagenic processes acting on cancer cells is fundamental for a better understating of etiology of carcinogenesis and designing treatment strategies. There is a growing appreciation that mutagenic processes can be studied through the lenses of mutational signatures, which represent characteristic mutation patterns attributed to individual mutagens. However, the causal link between mutagens and observed mutation patterns remains not fully understood, limiting the utility of mutational signatures.
To gain insights into these relationships, we developed a network-based method, named GeneSigNet that uncovers dominant influences among genes and mutational signatures. Without using perturbation data or prior knowledge, GeneSigNet relies solely on the purely observed activities of nodes and infers weighted-edges and their directions among genes and signatures. As input, patient-specific activity of a node corresponding to a gene is measured by its gene expression and the exposure of a mutational signature is seen as a measure of the activity of the corresponding mutagenic process. The network construction consists of two main steps: First, a sparse partial correlation technique is used to obtain an initial sparse weighted directed network. This graph contains both unidirectional and bidirectional edges. Next, where applicable, a partial higher moment strategy is used to orient bidirectional edges between nodes.
In general, the inference of direction of influence relations from statistical dependencies without additional prior knowledge about regulatory mechanisms or dedicated perturbation experiments is highly challenging. However, it has a potential advantage of capturing the causative dependencies. GeneSigNet provides an important step toward this direction that is independent of the specific application considered in this study.
Applying GeneSigNet to cancer data sets, we uncovered important relations between mutational signatures and several cellular processes that can shed light on cancer related mutagenic processes. Our results are consistent with previous findings such as the impact of homologous recombination deficiency on a clustered APOBEC mutations in breast cancer. The network identified by GeneSigNet also suggest an interaction between APOBEC hypermutation and activation of FOXP3 expressing regulatory T Cells that suppress anti-tumor immune response. It also inferred a relation between APOBEC mutations and changes in DNA conformation, supporting the view that DNA conformational changes sensitive DNA to APOBEC mutations. GeneSigNet also exposed a possible link between the SBS8 signature of unknown aetiology and the nucleotide excision repair pathway. These results demonstrate that GeneSigNet provides a new and powerful method to reveal the relation between mutational signatures and gene expression.
Presentation Overview: Show
Stem cells and their progeny play critical roles in diverse physiological and pathological processes. While many computational tools have been developed to infer developmental trajectories from single cell RNA-sequencing (scRNA-seq) data, such methods require specialized prior knowledge or are limited to inferring developmental orderings that are relative to each dataset (e.g., CytoTRACE, CellRank, RNA velocity, Monocle 3). To overcome these challenges, we developed CytoTRACE 2, an integrated deep learning framework for simultaneously predicting developing orderings and the absolute developmental potential of single cells, known as potency, from scRNA-seq data for the first time. Potency describes the capacity of a cell to create more specialized cells within a developmental continuum – from totipotent stem cells (e.g., fertilized egg) and pluripotent stem cells (embryonic stem cells) to multipotent cells (adult stem cells) and cells with increasingly restricted developmental potential. We trained and validated CytoTRACE 2 on 30 human and mouse scRNA-seq datasets with experimentally confirmed potency levels encompassing 101,089 cells, 39 tissues, and 9 platforms. CytoTRACE 2 was substantially more accurate than existing methods at predicting developmental orderings and potency in held-out datasets. To demonstrate its utility, we applied CytoTRACE 2 to >1.6M mouse cells from five datasets spanning diverse stages of embryonic development and 386,276 malignant cells from 18 cancer types. Without requiring batch correction or dataset integration, CytoTRACE 2 correctly reconstructed mouse developmental ontogeny and revealed striking associations between clinical outcomes and the predicted potency levels of human cancer cells. Our results highlight the value of CytoTRACE 2 for characterizing single-cell potency in health and disease, with implications for basic and clinical research, including ongoing cell atlasing efforts.
Presentation Overview: Show
Understanding how single cells divide and differentiate into different cell types in developed organs is one of the major tasks of developmental biology. Recently, lineage tracing technology using CRISPR/Cas9 genome editing have enabled simultaneous readouts of gene expressions and lineage barcodes, which allows for reconstruction of the cell division tree, and inference of ancestral cell types and differentiation pathways at the whole organism level. While most state-of-the-art methods for lineage reconstruction utilizes only the lineage barcode data, some hybrid methods start to emerge to incorporate gene expression data, aiming to further improve the accuracy of lineage reconstruction. The hybrid methods tend to vary in their ways to utilize gene expression, which can potentially not improve, or even damage the accuracy. Here, we present LinRace (lineage reconstruction with asymmetric cell division model), a hybrid method which combines the lineage information and gene expression using the asymmetric cell division model, and infers cell lineage under a joint neighbor joining and maximum-likelihood framework. For a candidate tree, a combined likelihood function of gene expression and lineage barcode data, which prefers proper placement of cell state and barcode, is implemented to evaluate the state-lineage consistency on the tree. Using local search with the subtree swapping technique, LinRace searches the tree space for the optimal substructure of cells with maximum likelihood. Using the maximum-likelihood framework, LinRace is able to search for tree topologies that are neglected by the Neighbor Joining heuristic. With a more biologically motivated way of combining the two modalities, LinRace can reconstruct cell lineages with better state-lineage coherence and uncover new insights on cell fate decisions.
Presentation Overview: Show
Single-cell genomics datasets offer vast new resources with which to study cells, but their potential to inform parameter inference of cell dynamics has yet to be realized. Here we develop methods for Bayesian parameter inference with data that jointly measure gene expression and fluorescent reporter dynamics in single cells. By ordering cells by their transcriptional similarity, the posterior distribution of one is used to inform the prior distribution of its neighbor via transfer learning along a cell chain. In application to Ca2+ signaling dynamics, we fit the parameters of thousands of models of variable single-cell responses. We show that transfer learning accelerates inference, although constructing cell chains by gene expression does not improve over randomly ordered cells. Clustering cell posterior distributions reveals that only using similarity-based chains – and not randomly sampled chains – can we distinguish Ca2+ dynamic profiles and associated marker genes. Both at the global and the individual-gene level, transcriptional states predict features of the Ca2+ response. We discover complex and competing sources of cell heterogeneity: parameter covariation can diverge between the intracellular and intercellular contexts. Single-cell parameter inference thus offers broad means to quantify relationships between transcriptional states and the dynamic responses of cells.
Presentation Overview: Show
Large-scale copy number variation (CNV) is a major driver of tumor heterogeneity, which is the phenomenon by which cellular clones within the tumor may demonstrate differing disease states and responses to therapy. CNVs are traditionally profiled at the bulk level through whole genome sequencing (WGS), and clonal structure is inferred from this bulk data. Understanding these genomic changes along with respective transcriptional output at a single-cell resolution has the potential to inform both tumorigenesis, as well as genomic and phenotypic determinants of drug response/resistance. Here, we propose ECHIDNA, a novel probabilistic model that leverages single-cell RNA sequencing (scRNA-seq) data to deconvolve concurrently collected, population-matched bulk WGS data into constituent clones defined by a Dirichlet process. Jointly, we factorize scRNA-seq data into cell and gene latent factors [1] driven by the inferred CNV that may be examined for biological insight. These latent factors are learned with an additional temporal dimension, corresponding to samples collected over the course of treatment, while the inferred CNV are uniform across all time points. This temporal axis allows for the connection of a constant clonal genotype to temporally dynamic changes in cell phenotype, while the integration of both data modalities increases the robustness of both deconvolution and definition of clones. We apply the model to biopsies from melanoma patients undergoing anti-PD1 immunotherapy, and demonstrate accurate deconvolution as well as the ability to study the impact of therapy on diverse tumor clones.
[1] Levitin, Hanna Mendes, Yuan, Jinzhou, Cheng, Yim Ling. De novo gene signature identification from single-cell RNA-seq with hierarchical Poisson factorization. Mol Syst Biol (2019)
Presentation Overview: Show
Genotype-phenotype association is found in many biological systems, such as brain-related diseases and behavioral traits. Despite the recent improvement in the prediction of phenotypes from genotypes, they can be further improved and explainability of these predictions remains challenging, primarily due to complex underlying molecular and cellular mechanisms. Emerging multimodal data enables studying such mechanisms at different scales from genotype to phenotypes involving intermediate phenotypes like gene expression. However, due to the black-box nature of many machine learning techniques, it is challenging to integrate these multi-modalities and interpret the biological insights in prediction, especially when some modality is missing. Biological knowledge has recently been incorporated into machine learning modeling to help understand the reasoning behind the choices made by these models.
To this end, we developed DeepGAMI, an interpretable deep learning model to improve genotype-phenotype prediction from multimodal data. DeepGAMI uses prior biological knowledge to define the neural network architecture. Notably, it embeds an auxiliary-learning layer for cross-modal imputation while training the model from multimodal data. Using this pre-trained layer, we can impute latent features of additional modalities and thus enable predicting phenotypes from a single modality only. Finally, the model uses integrated gradient to prioritize multimodal features and links for phenotypes. We applied DeepGAMI to multiple emerging multimodal datasets: (1) population-level genotype and bulk-tissue gene expression data for predicting schizophrenia, (2) population-level genotype and gene expression data for predicting clinical phenotypes in Alzheimer's Disease, (3) gene expression and electrophysiological data of single neuronal cells in the mouse visual cortex, and (4) cell-type gene expression and genotype data for predicting schizophrenia. We found that DeepGAMI outperforms existing state-of-the-art methods and provides a profound understanding of gene regulatory mechanisms from genotype to phenotype, especially at cellular resolution. DeepGAMI is an open-source tool and is available at https://github.com/daifengwanglab/DeepGAMI.
Presentation Overview: Show
Measures of nucleotide sequence conservation across species are useful for identifying functional genomic loci. For example, phyloP scores are often able to a substantial boost in fine-mapping candidate loci from genome-wide associations studies (GWAS). However, these methods can fail when regulatory function is maintained, often in a cell-type specific manner, even when the genome sequence itself shows minimal conservation. To overcome those limitations, we introduce Cell-TACIT, the Cell Type-Aware Conservation Inference Toolkit, to identify human trait-associated regulatory variants. In Cell-TACIT, we train convolutional neural networks models corresponding to each cell type on matched single nucleus ATAC-Seq datasets across at least 2 species. Then, we use those models impute cell type-specific open chromatin across hundreds of orthologous candidate regulatory elements, which provides an estimate on the extent to which cell type-specific function is conserved. Finally, we compare the degree of cell type-specific conservation to using GWAS summary statistics stratified heritability enrichment scores.
Applying Cell-TACIT to dozens neuropsychiatric trait loci identified higher heritability enrichment and more fine-mapped variants than nucleotide conservation and human open chromatin data alone. In the case of schizophrenia, a highly-powered GWAS, we found conservation based on phyloP score yielded a heritability enrichment score of only 3.8 relative to other neural cell types. Using open chromatin annotations across the human population for a specific subtype of neuron implicated in schizophrenia (D1-D2-Hybrid) raised the heritability enrichment to 53.3. Using only highly cell type-specific open chromatin Cell-TACIT with cell TACIT raised the heritability enrichment to 116.7 (adj. P = 1.32E-8). These stronger enrichments were more likely to be found for neural cell types implicated in the disorder. For example, open chromatin regions human oligodendrocyte precursor cells did show an enrichment for schizophrenia-associated mutations (heritability enrichment 21.0), but these loci did not tend to have orthologous regions also predicted to be active in oligodendrocyte precursors (heritability enrichment 10.9).
We experimentally validate predictions for enhancers with risk variants near the DRD2 schizophrenia risk locus using in vivo reporter assays. By integrating genome conservation and multi-species open chromatin data, Cell-TACIT prioritizes variants within regions of conserved regulatory function for in vivo characterization, and addresses a major challenge in translating disease associations to mechanistic understanding.
Presentation Overview: Show
Evolutionary constraint is a powerful predictor of genome function. Previous studies of mammalian constraint were limited by species number and reliance on human-referenced alignments. The Zoonomia Consortium has created a reference-free whole-genome alignment of 240 species to explore the evolution of placental mammals.
Using this alignment, we computed new constraint scores, achieving single-base resolution of constraint and acceleration. Within constrained transcription factor binding sites, constraint scores are highly correlated with motif information content. We identified regions of contiguous constraint (RoCC), defined as twenty or more consecutive constrained bases. The longest RoCC is in an intron of METAP1D, encompassing four distal enhancer-like ENCODE candidate cis-regulatory elements. METAP1D encodes an essential mitochondrial protein conserved at least back to the common ancestor of human and zebrafish. In addition to the METAP1D transcription start site (TSS), this RoCC physically interacts with the TSS’s of TLK1 and HAT1 in the human adult cerebral cortex, which regulate chromatin structure and histone production and acetylation, respectively, and are expressed in the cortex of diverse mammals. Some of the most accelerated regions in human and chimpanzee overlap topologically associated domain boundaries near neurodevelopmental genes, suggesting that they alter these genes’ regulatory landscapes.
Intriguingly, 48.5% of constrained bases are unannotated by resources like ENCODE; we call such regions UNannotated Intergenic COnstrained RegioNs (UNICORNs). Most UNICORNs are within 500kb of the transcription start site for a protein-coding gene. UNICORNs tend to contain fewer variants, which tend to have lower allele frequencies, than in other intergenic regions. 17% of UNICORNs overlap open chromatin regions identified in recently published datasets from the developing brain, specific adult brain regions, and specific cell types within the motor cortex, suggesting that UNICORNs may have functions that will be revealed as additional regulatory genomics data is generated.
To associate genetic variation across species with mammalian traits, we developed new computational methods for pairing our alignment with recently curated phenotype annotations. We found that the number of olfactory receptor genes is associated with the number of olfactory turbinals, genes evolving more quickly in hibernators are associated with stress adaption, and motor cortex regulatory elements near genes associated with microcephaly and macrocephaly are associated with the evolution of brain size. As more phenotype annotations along with regulatory element data from relevant tissues and cell types become available, we anticipate that our alignment and new methods will provide additional exciting insights into the evolution of placental mammals.
Presentation Overview: Show
Sequence-based deep learning models, particularly convolutional neural networks (CNNs), have shown superior performance on a wide range of genomic tasks. A key limitation of these models is the lack of interpretability, slowing down their adoption by the genomics community. Current approaches to model interpretation do not readily reveal how a model makes predictions, can be computationally intensive, and depend on the implemented architecture. Here, we introduce ExplaiNN, an adaptation of neural additive models for genomic tasks wherein predictions are computed as a linear combination of multiple independent CNNs, each consisting of a single convolutional filter and fully connected layers. This approach brings together the expressiveness of CNNs with the interpretability of linear models, providing global (cell state level) as well as local (individual sequence level) biological insights into the data. We use ExplaiNN to predict transcription factor (TF) binding and chromatin accessibility states, demonstrating performance levels comparable to state-of-the-art methods, while providing a transparent view of the model’s predictions in a straightforward manner. Applied to de novo motif discovery, ExplaiNN identifies equivalent motifs to those obtained from specialized algorithms across a range of datasets. Finally, we present ExplaiNN as a plug and play platform in which pretrained TF binding models and annotated position weight matrices from reference databases can be easily combined. We expect that ExplaiNN will accelerate the adoption of deep learning by biological domain experts in their daily genomic sequence analyses. ExplaiNN is open-source software distributed under the MIT license on GitHub (https://github.com/wassermanlab/ExplaiNN) and is accompanied by a bioRxiv preprint (https://doi.org/10.1101/2022.05.20.492818).
Presentation Overview: Show
The integration of genome-wide association studies and functional genomics data have led to the identification of many associated cell types for any one disease. While diseases tend to have tissue-specific presentations and gene regulatory effects across tissues are highly correlated, it is critical yet challenging to infer which cell types are responsible for mediating genetic effects on disease. In this talk, I will describe our work toward inferring causal cell types regulating disease.
Presentation Overview: Show
Pre-mRNA splicing is initiated with the recognition of a single-nucleotide intronic branchpoint (BP) within a BP motif by spliceosome elements. Forty-eight rare variants in 43 human genes have been reported to alter splicing and cause disease by disrupting BP. However, until now, no computational approach was available to efficiently detect such variants in massively parallel sequencing data. We established a comprehensive human genome-wide BP database by integrating existing BP data, and generating new BP data from RNA-seq of lariat debranching enzyme DBR1-mutated patients and from machine-learning predictions. We characterized multiple features of BP in major and minor introns, and found that BP and BP-2 (two-nucleotides upstream of BP) positions exhibit a lower rate of variation in human populations and higher evolutionary conservation than the intronic background, whilst being comparable to the exonic background. We developed BPHunter as a genome-wide computational approach to systematically and efficiently detect intronic variants that may disrupt BP recognition. BPHunter retrospectively identified 40 of the 48 known pathogenic BP variants, in which we summarized a strategy for prioritizing BP variant candidates. The remaining 8 variants all create AG-dinucleotides between the BP and acceptor site, which is the likely reason for mis-splicing. We demonstrated the practical utility of BPHunter prospectively by using it to identify a novel germline heterozygous BP variant of STAT2 in a patient with critical COVID-19 pneumonia, and a novel somatic intronic 59-nucleotide deletion of ITPKB in a lymphoma patient, both of which were validated experimentally. BPHunter is publicly available from https://hgidsoft.rockefeller.edu/BPHunter and https://github.com/casanova-lab/BPHunter.
Presentation Overview: Show
Polygenic scores (PGS) attracted significant research interest, given the recent methodological advancements and increasing availability of large-scale genotyped cohorts. However, limited transferability across populations remains one of the key challenges in realizing equitable healthcare benefits from advances in genetic research. Given the trans-ancestry genetic correlation in complex traits, admixed individuals provide valuable opportunities to fill the gap. Most modern PGS approaches rely on linkage disequilibrium reference panels, often excluding admixed individuals in PGS training. Here we overcome the methodological limitations by leveraging a sparse PGS approach that directly operates on individual-level data. We systematically characterize and evaluate PGS models across more than 70 traits in UK Biobank using more than n = 406,000 participants of European, African, Asian, and admixed individuals. We demonstrate that the direct inclusion of ancestry-diverse populations results in, on average, a 30% increase in the transferability of polygenic scores in individuals of African ancestry. We observe the greatest improvements in individuals of African ancestry, for example, in neutrophil count (50-fold increase in R^2, 0.0577 vs. 0.0011 in the single population model and inclusive model, respectively), leukocyte (white blood cell) count (R^2 = 0.0465 vs. 0.0044), and lymphocyte count (R^2 = 0.0048 vs. 0.0021) PGS models. We also find improvements in predictive performance in European ancestry thanks to the increased power of the inclusive training set. Together, our results offer insights into developing transferable polygenic scores.