- Anat Kreimer, University of California, Berkeley, United States
- Fumitaka Inoue, UCSF, United States
- Tal Ashuah, University of California, Berkeley, United States
- Nadav Ahituv, UCSF, United States
- Nir Yosef, University of California, Berkeley, United States
The temporal interplay between gene regulation and gene expression during cell differentiation remains largely unknown. Using neural induction as a model, we set out to decipher these dynamics. We performed RNA-seq, ChIP-seq (H3K27ac, H3K27me3) and ATAC-seq on human embryonic stem cells at seven early neural differentiation time points (0-72 hours). We found that DNA accessibility precedes H3K27ac, which is then followed by gene expression changes. Using massively parallel reporter assays (~2,500 sequences tested at all time-points), we further show that temporal enhancers correlate with H3K27ac. Development of a prioritization method that incorporated all genomic data identified key transcription factors (TFs) involved in temporal function, several of which were functionally validated to be important and novel neural induction regulators. Combined, our results provide a temporal framework for gene regulation and expression during differentiation, specify a blueprint for neural induction and identify novel enhancers and TFs that are instrumental for this process.
- Tobias Zehnder, Max Planck Institute for Molecular Genetics, Germany
- Philipp Benner, Max Planck Institute for Molecular Genetics, Germany
- Martin Vingron, Max Planck Institute for Molecular Genetics, Germany
Eukaryotic gene regulation is a complex process comprising the
dynamic interaction of enhancers and promoters in order to activate gene
expression. In recent years, research in regulatory genomics has contributed to a
better understanding of the characteristics of promoter elements and for most
sequenced model organism genomes there exist comprehensive and reliable
promoter annotations. For enhancers, however, a reliable description of their
characteristics and location has so far proven to be elusive. With the
development of high-throughput methods such as ChIP-seq, large amounts of
data about epigenetic conditions have become available, and many existing
methods use the information on chromatin accessibility or histone modifications
to train classifiers in order to segment the genome into functional groups such as
enhancers and promoters. However, these methods often do not consider prior
biological knowledge about enhancers such as their diverse lengths or molecular
We developed enhancer HMM (eHMM), a supervised hidden Markov
model designed to learn the molecular structure of promoters and enhancers.
Both consist of a central stretch of accessible DNA flanked by nucleosomes with
distinct histone modification patterns. We evaluated the performance of eHMM
within and across cell types and developmental stages and found that eHMM
successfully predicts enhancers with high precision and recall comparable to
state-of-the-art methods, and consistently outperforms those in terms of
accuracy and resolution.
eHMM predicts active enhancers based on data from chromatin
accessibility assays and a minimal set of histone modification ChIP-seq
experiments. In comparison to other ’black box’ methods its parameters are easy
to interpret. eHMM can be used as a stand-alone tool for enhancer prediction
without the need for additional training or a tuning of parameters. The high
spatial precision of enhancer predictions gives valuable targets for potential
knockout experiments or downstream analyses such as motif search.
- William Lai, The Pennsylvania State University, United States
- Kylie Bocklund, The Pennsylvania State University, United States
- Kate Mistretta, The Pennsylvania State University, United States
- B Franklin Pugh, The Pennsylvania State University, United States
Mammalian chromatin contains thousands of distinct proteins spanning numerous isoforms and post-translationally modified variants. The study of chromatin composition and its organization often involve the use of antibodies directed against specific chromosomal proteins. For example, factor-specific antibodies are used in chromatin immunoprecipitation (ChIP) assays to probe the genomic binding locations of proteins. This is most applicable where epitope tagging is impractical (e.g., human tissue, or cell lines from a wide variety of origins). Antibody reliability or efficacy in ChIP is notoriously problematic. It is estimated that fewer than 20% of commercially produced antibodies are viable for their recommended applications in ChIP. In response to this gap in reliability, the NIH Common Fund initiated the Protein Capture Reagents Program (PCRP) to generate a renewable source of antibodies useful in a wide-range of molecular biology applications. We set out to test the utility of these antibodies using the ultra-high resolution ChIP-exo assay. We tested 450 monoclonal antibodies derived from hybridoma sources generated by PCRP. The generation of a large number of ChIP-exo datasets against a wide range of chromatin proteins presented the unique difficulty of developing an equally high-throughput analysis strategy capable of evaluating success when the very definition of “success” is unknown. We developed a principled method of success in antibody performance. The application of an automated analysis pipeline on multiple proteins also served as a ‘first-pass’ analysis which revealed conserved regulatory modules across different mammalian cell lines. We believe that the methodologies described here can be used as a framework not only for antibody validation, but also as the first step towards an automated and agnostic analysis platform for genomics.
- Alireza Fotuhi Siahpirani, University of Wisconsin-Madison, United States
- Rupa Sridharan, Wisconsin Institute for Discovery, Department of Cell and Regenerative Biology, University of Wisconsin--Madison, United States
- Sushmita Roy, University of Wisconsin-Madison, United States
Transcriptional regulatory networks control the context-specific expression levels of genes by specifying the regulatory proteins of target genes but remain challenging to map using experimental or computational approaches. Recent work in computational inference of regulatory networks have advanced in two directions: (a) integration of prior information to improve the agreement with physical networks, (b) inference of transcription factor (TF) activities to overcome possible issues with using mRNA levels of the regulator as a proxy for TF activity [Liao et al., 2003, Ocone and Sanguinetti, 2011, Arrieta-Ortiz et al., 2015]. Both of these directions require an input network that provides an initial assessment of potential regulatory edges. In most systems, the input regulatory network can be very noisy, however, the extent to which the quality of the input network influences results is not known. This is especially an issue for estimation of TFA which directly uses the structure to infer the TF activity levels.
To address these issues, we developed a new approach for inferring genome-scale regulatory networks that uses a regularized regression (constrained by the input graph structure) to estimate factor activities from noisy input networks and incorporates them into an integrative network learning framework [Roy et al. 2013, Siahpirani & Roy 2017]. On simulated data, we show that our approach of estimating TFA is more robust to noise, infers better networks and more accurate TFA compared to an approach that naively uses the input noisy network. We collected a large compendium of expression profiles spanning RNA-seq and microarray data and three human cell lines and one mouse cell line comprising a total 5,644 samples from 691 datasets. We integrated these expression datasets with DNaseI-derived input networks. We apply our network inference approach to yeast and mammalian systems. Networks inferred using estimated TFA agree better with the yeast gold standard network [MacIsaac et al. 2006] than networks inferred using the mRNA level alone of the regulator, and this property holds for different network inference methods. Consistent with yeast experiments, we observe that using regularized TFA in mammalian cell types in a prior-based framework is significantly better than using either TFA or prior alone. Taken together, our results show that by handling noisy input prior networks, our algorithm provides a powerful tool for reconstructing gene regulatory networks and is broadly useful across diverse systems.
- Amir Alavi, Carnegie Mellon University, United States
- Matthew Ruffalo, Carnegie Mellon University, United States
- Aiyappa Parvangada, Carnegie Mellon University, United States
- Zhilin Huang, Carnegie Mellon University, United States
- Ziv Bar-Joseph, Carnegie Mellon University, United States
Single cell RNA-Seq (scRNA-seq) studies often profile upward of thousands of cells in heterogeneous environments. Current methods for characterizing cells, including PCA and t-SNE, perform unsupervised analysis followed by assignment using a small set of known marker genes. Such approaches are limited to a few, well characterized cell types. In past work, we have shown that supervised methods based on neural networks perform better than these unsupervised method for the analysis of large scRNA-seq datasets . To enable large scale supervised characterization we recently developed an automated pipeline to download, process, and annotate publicly available scRNA-seq datasets. We further developed various types of supervised neural networks, including biologically based architectures (where connections are based on the Gene Ontology to reduce overfitting), and networks which directly learn a discriminatory function (Siamese and triplet architectures) . We use the neural embedding to learn reduced dimension representations for each expression profile. We applied our pipeline to analyze hundreds of thousands of cells from over 500 different studies representing 300 unique cell types. We tested our methods on retrieval tasks and demonstrated that they greatly outperform unsupervised methods. In addition to reducing the dimensions, we utilize fast approximate nearest neighbors search algorithms to implement a cell profile retrieval system, allowing global comparative analysis of new expression profiles to all available scRNA-seq data. We also used the data to perform cell-type specific differential expression analysis which successfully identified important genes for several different cell types in an unbiased fashion. A case study of neural degeneration data highlights the ability of our methods and data to identify differences between cell type distributions in healthy and diseased mice. Using our neural networks we found an increase in immune system related cell types in late-stage diseased mice. Our DE gene lists for immune cells revealed known and novel up-regulated genes in late-stage disease cells. We implemented a web server  that incorporates our reprocessed dataset, neural networks, and fast matching methods into an easy to use web application in order to determine cell types, key genes, similar prior studies, and more.
 Lin et al. NAR 45.17, e156, 2017
 Alavi et al, bioRxiv, doi: https://doi.org/10.1101/323238, 2018
- Florian Wagner, New York University, United States
- Itai Yanai, New York University, United States
Single-cell RNA-Seq (scRNA-Seq) excels at characterizing heterogeneous tissues in the context of health and disease, as well as following experimental perturbation. However, given the inherent noisiness of scRNA-Seq data and the frequent lack of a gold standard, it is currently unclear how to establish formal cell type definitions based on scRNA-Seq data, which severely impedes the consistent analysis of scRNA-Seq data across experiments and studies. To address this challenge, we have developed Moana, a machine learning framework that leverages a computational denoising algorithm to enable the construction of precise, robust, and efficient scRNA-Seq cell type classifiers for arbitrary tissues. We propose a hierarchical approach to clustering and classification that allows complex cell type classifiers to be constructed based on a single, large scRNA-Seq dataset containing a heterogeneous mixture of cell types. Moreover, we propose a novel strategy to rigorously validate classification performance in the absence of a gold standard, using an independent dataset that contains the same set of cell types. To demonstrate the ability of our framework to resolve cell types with highly similar transcriptomes, we show that a Moana classifier for human peripheral blood mononuclear cells (PBMCs) is able to reliably distinguish between all major immune cell populations present, including for example CD14+ and CD16+ monocytes. To demonstrate that Moana classifiers are highly robust to batch and treatment effects, we show that a Moana PBMC classifier can accurately assign cell identities in data generated using a different scRNA-Seq protocol, as well as following an experimental perturbation. To demonstrate the general applicability of our framework, we construct and validate cell type classifiers for human pancreatic islets and lung tumors. An efficient open-source implementation of our framework will be made available on GitHub. Our work represents a major step towards the routine transfer of knowledge across scRNA-Seq experiments, and opens up the possibility of building a universal scRNA-Seq atlas of cell types and states as a library of tissue-specific classifiers that can be directly applied for the analysis of new datasets.
- Joseph A. Wayman, Cincinnati Children's Hospital Medical Center, United States
- Diep Nguyen, Oberlin College, United States
- Peter DeWeirdt, Harvard University, United States
- Bryan D. Bryson, Massachusetts Institute of Technology, United States
- Emily R. Miraldi, Cincinnati Children's Hospital Medical Center, United States
Transcriptional regulatory networks (TRNs) promote cellular behavior through coordination of gene expression by transcription factors (TFs). Single-cell RNA-seq (scRNA-seq) enables characterization of rare and/or heterogeneous cell populations and has already provided insights into the TFs driving cell fate decisions and states. Thus, there is great interest in leveraging scRNA-seq for genome-scale TRN inference. However, scRNA-seq poses significant statistical challenges. Low transcript capture rate results in many technical zeros ("drop-out" genes) that obscure biological signal.
Here, we present methods for TRN inference from scRNA-seq (scTRN methods) that leverage advances in machine learning to (1) impute technical zeros in scRNA-seq data and (2) amplify biological signal through incorporation of prior information and relevant data. In our framework, gene expression is modeled as a sparse multivariate function of TF activities, and prior information (e.g., TF-gene interactions derived from a curated database, TF ChIP-seq and/or TF motif analysis of chromatin accessibility) enters our formulation twice: first, to estimate TF activities based on a priori knowledge of TF-target gene expression and, secondly, using the adaptive LASSO to reinforce selection of prior-supported interactions. To incorporate relevant gene expression data (e.g., bulk RNA-seq data of the same cell types, scRNA-seq of related cell types), we use a multi-task learning framework that enables decomposition of the resulting TRN into dataset-shared and dataset-specific interactions.
Importantly, we developed a scRNA-seq dataset in T Helper 17 (Th17) cells for benchmarking scTRN methods, as a rich set of existing Th17 genomics resources enabled construction of a gold-standard TRN (TF ChIP-seq, RNA-seq of TF knockouts, bulk RNA-seq, ATAC-seq)[1,3]. Using our benchmarking dataset, we tested the effects of experimental design, data normalization and imputation on precision-recall. Proper treatment of these factors not only improves TRN inference but is also likely to improve other scRNA-seq methods (e.g., clustering). We compared our performance to published scTRN inference methods for which evaluation with an extensive gold standard was not previously possible. Our scTRN method outperforms state-of-the-art, and our benchmarking dataset affords the opportunity for further innovation in scTRN inference.
1. Miraldi, E. R. et al. Leveraging chromatin accessibility for transcriptional regulatory network inference in T Helper 17 Cells. bioRxiv 292987 (2018).
2. Castro, D. M., Veaux, N. de, Miraldi, E. R. & Bonneau, R. Multi-study inference of regulatory networks for more accurate models of gene regulation. bioRxiv 279224 (2018).
3. Ciofani, M. et al. A validated regulatory network for Th17 cell specification. Cell 151, 289‑303 (2012).
- Gunsagar Gulati, Stanford University, Institute for Stem Cell Biology and Regenerative Medicine, Department of Biomedical Data Science, United States
- Shaheen Sikandar, Stanford University, Institute for Stem Cell Biology and Regenerative Medicine, United States
- Daniel Wesche, Stanford University, Institute for Stem Cell Biology and Regenerative Medicine, United States
- Anjan Bharadwaj, Stanford University, Institute for Stem Cell Biology and Regenerative Medicine, Department of Biomedical Data Science, United States
- Anoop Manjunath, Stanford University, Institute for Stem Cell Biology and Regenerative Medicine, Department of Biomedical Data Science, United States
- Francisco Ilagan, Stanford University, Institute for Stem Cell Biology and Regenerative Medicine, Department of Biomedical Data Science, United States
- Mark Berger, Stanford University, Department of Computer Science, United States
- Michael Clarke, Stanford University, Institute for Stem Cell Biology and Regenerative Medicine, United States
- Aaron Newman, Stanford University, Institute for Stem Cell Biology and Regenerative Medicine, Department of Biomedical Data Science, United States
Increasing evidence suggests that human malignancies arise from subpopulations of tumor-initiating cells (TICs), which have been implicated in tumor growth, metastasis, and resistance to therapy. Despite their importance, the molecular profiles and functional properties of TICs are poorly understood. Single-cell RNA-sequencing (scRNA-seq) has enabled the study of cell states and their transitions at high resolution, revealing cell types and gene expression programs that are masked in aggregated, bulk measurements. However, computational methods to accurately infer differentiation status from single cell transcriptomes and prioritize TICs without prior information are lacking. We therefore developed CytoTRACE, a new computational framework to predict cellular differentiation trajectories from scRNA-seq data. We evaluated the performance of CytoTRACE on nearly 250,000 single cell transcriptomes with known differentiation or developmental status, representing 55 lineages, 402 phenotypes, 14 tissue types, 5 species, and 9 sequencing platforms from 35 studies. When compared against nearly 18,000 gene sets (e.g. pluripotency genes), 3 stemness inference tools (e.g. SLICE), and 6 methods for inferring cell lineage trajectories (e.g. Monocle DDRTree), CytoTRACE exhibited superior performance (P < 0.05), and correctly inferred the direction of differentiation in 82% of datasets. Therefore, we applied CytoTRACE to scRNA-seq profiles of 4,724 human breast epithelial cells, including 2,795 cancer cells, from 18 human breast cancer patients. We found that human breast basal epithelial genes ranked by their correlation with CytoTRACE recapitulated the gene expression patterns of mouse mammary development and were enriched for genes independently shown to be associated with tumorigenesis. When CytoTRACE was applied to human breast luminal progenitors, we significantly enriched for known genes associated with clonogenicity, such as ALDH1A3, and identified new candidates associated with less differentiation. Preliminary data from the knockdown of a leading candidate in a human breast cancer xenograft model validates the ability of CytoTRACE to prioritize genes involved in tumor growth and survival. In conclusion, CytoTRACE outperforms existing methods for characterizing differentiation trajectories in complex tissues and enables the identification of candidate cell types and genes associated with tumorigenesis.
- Qian Zhu, Harvard University, United States
- Sheel Shah, California Institute of Technology, United States
- Ruben Dries, Harvard University, United States
- Long Cai, California Institute of Technology, United States
- Guo-Cheng Yuan, Harvard University, United States
Human and other multicellular organisms are composed of diverse cell types characterized by distinct gene expression patterns. Within each cell type, there is also considerable heterogeneity. The source of cellular heterogeneity remains poorly understood, but it is commonly thought to be modulated by the balance between intrinsic regulatory networks and extrinsic cellular microenvironment. In this work, we seek to systematically dissect the differential roles of intrinsic and extrinsic factors on mediating cellular heterogeneity, by combining two complementary approaches: single-cell RNA sequencing (scRNAseq), and multiplexed single-molecule fluorescence in situ hybridization (smFISH).
Each technology features a distinct set of advantages and limitations. Multiplexed smFISH carries the advantage of measuring the transcriptome with high accuracy in its native spatial environment, but current implementations profile only a few hundred genes, whereas scRNAseq provides whole-transcriptome estimation but requires cells to be removed from their environment, resulting in a loss of spatial information.
It is clear that an integrative analysis framework combining both scRNAseq and sequential FISH would better enable one to characterize both cell type and spatially dependent variations. We thus developed a computational approach that contains two major components. First, the scRNAseq data is used as a guide to accurately determine the cell-types corresponding to the cells profiled by sequential FISH. Second, distinct spatial domain patterns are systematically detected from sequential FISH data. These spatial patterns are then in turn used to dissect the environment-associated variation in a scRNAseq dataset.
We illustrate the two components using a seqFISH dataset of the mouse visual cortex region which has been generated. We applied a supervised approach based on support vector machine to map cell types of the reference scRNAseq dataset to the seqFISH data. To systematically dissect the contributions of microenvironments on gene expression variation, we developed a novel hidden-Markov random field (HMRF) approach to unbiasedly inform the organizational structure of the visual cortex.
Most existing studies focused on identifying cell-type differences, but, as shown in our analysis of the mouse visual cortex region, cell-type differences represent only one component in cell-state variation, whereas local environment (or spatial domains) plays a significant role in mediating gene activities, probably through cell-cell interactions and signaling. As each technology has its own strengths and weaknesses, the integrated approach presented here provides a powerful model framework and broadly applicable to analyze cell type and spatial variation in diverse tissues from various model systems.
- Nelson Johansen, University of California, Davis, United States
- Gerald Quon, University of California, Davis, United States
Single cell RNA sequencing (scRNA-seq) technologies are quickly advancing our ability to characterize the transcriptional heterogeneity of biological samples, given their ability to identify novel cell types and characterize precise transcriptional changes during previously difficult-to-observe processes such as differentiation and cellular reprogramming. An emerging challenge in scRNA-seq analysis is the characterization of cell type-specific transcriptional responses to stimuli, measured when similar collections of cells are assayed under two or more conditions, such as in control/treatment or cross-organism studies.
We will present a novel computational strategy for identifying cell type specific responses using associative domain adaptation neural networks. Briefly, our method projects single cells measured across different conditions into the same low dimensional cell state space, in a way such that the effect of the condition(s) is removed. The most novel component of our approach is that we then train neural networks to project cells from the low dimensional cell state space back to the original gene expression profile, allowing us to predict paired differential expression at the level of an individual cell. Compared to other existing approaches, ours can generate paired differential expression at the individual cell level, is unsupervised (does not require identification of cell types before alignment), and can align more than two conditions simultaneously.
We applied our model to two problems. First, we aligned hematopoietic progenitor populations collected under control and after being challenged by an inflammatory stimulus (LPS). We identify three distinct sub-populations of long term HSCs that respond differentially to LPS, which are not apparent when cells are aligned using other approaches. These subpopulations are distinguished by a set of approximately 30 genes, enriched in functions related to inflammation response to other pathogens. Second, we apply our method to a recent dataset comparing a control versus knockdown of the AP2-G gene necessary for sexual commitment and subsequent transmission of malaria. It is unknown at what stage of the malaria cell cycle do malaria cells decide to commit to asexual versus sexual reproduction, as well as the underlying mechanism. Using our method, we were able to find a novel component of the malaria cell cycle that exhibited strong differences between the case and control condition that was not identified in the original study. In this component, we identify both known (P48/45 and Pf11-1) and novel sexual commitment-specific genes (NEK2, PM6, among others). These genes potentially interact with AP2-G to establish sexual commitment of malaria.
- Akpeli Nordor, Harvard University, United States
- Martin Aryee, Harvard University, United States
- Geoffrey Siwo, University of Notre Dame, United States
Genome editing technologies especially CRISPR/Cas systems could revolutionize the treatment of various genetic diseases. Therefore, early identification of potential pharmacological interactions between CRISPR/Cas systems and small molecules or drugs is needed as such interactions could lead to adverse reactions when these technologies are used in the clinic. Furthermore, interactions between small molecules and genome editing technologies could also open new avenues for modulating the activity of genome editing technologies and help fine-tune specific cellular processes that influence editing outcomes. Here, we show that transcriptional responses of CD34+ hematopoietic stem cells to CRISPR/Cas9-AAV6 editing machinery can be used to predict small molecules that interact with components of the editing machinery. Specifically, we compared the transcriptional signatures of cellular responses to CRISPR/Cas9-AAV6 editing components (recently described by Cromer et al. 2018) to those of small molecule perturbations of cancer cell lines using the next generation Connectivity Map. We found that transcriptional responses to several small molecule perturbations and CRISPR/Cas9-AAV6 editing machinery show close similarities. While transcriptional signatures of some small molecules were significantly positively correlated to those of CRISPR/Cas9-AAV6, others were significantly negatively correlated. Some of the small molecules identified by our approach have been previously shown to influence CRISPR/Cas9 genome editing. For example, trichostatin – an HDACi – enhances targeted nucleotide substitutions in CRISPR/Cas9 edited cells and L755507 – an adrenergic receptor agonist – enhances homology directed repair. Consistent with recent studies showing that P53 inhibits Cas9 dependent gene editing, we also found strong positive correlations between Cas9 ribonucleoprotein and serdemetan – an inhibitor of MDM2 protein –, which negatively regulates P53. Our results suggest that transcriptional responses to genome editing components could potentially be used to predict small molecules that modulate CRISPR/Cas9 editing and may be applicable in a combination, hybrid therapeutic approach in which small molecules are administered concomitantly with gene editing machinery.
- Gregory Nuel, LPSM, CNRS 8001, Sorbonne University, Paris, France
- Flaminia Zane, IBPS, Sorbonne University, Paris, France
- Andrea Rau, GABI, INRA, France
- Florence Jaffrezic, INRA, France
Systems biology aims at modeling complex biological systems such as gene regulation networks (GRN). In this context, the correlation and causal relationships between gene expression is inferred from high-throughput transcriptome data (ex: RNA-seq) using various computational models. For inferring causality, a classical approach is based on directed acyclic graphs (DAGs). The typical output of a DAG inference algorithm is a (possibly weighted) sample of DAGs which represent the posterior distribution of the DAG structure conditionally on the available data. Due to the high complexity of this full posterior distribution of DAGs, it is common practice to derive from this distribution easy to grasp quantities like the marginal edge distribution or a so-called consensus graph (ex: keeping only edges with high marginal probabilities). In this work, we want to suggest an alternative approach which takes advantage of the full DAG distribution. Our idea is to perform unsupervised clustering directly on the (weighted) sample of DAGs using a simple conditional mixture model (with edge probabilities). The likelihood is maximized through a classical Expectation-Maximization algorithm and the number of mixture components is selected using classical criteria like the Bayesian information criterion (BIC) or the enhanced BIC (EBIC). Our method is illustrated both on simulated and real datasets using a causal Gaussian Bayesian network inference model, and the interest of the proposed approach compared to classical ones is discussed.
- Peter Koo, Howard Hughes Medical Institute, Harvard University, United States
- Praveen Anand, Harvard University, United States
- Steffan Paul, Harvard University, United States
- Sean Eddy, Howard Hughes Medical Institute, Harvard University, United States
To infer the sequence and RNA structure specificities of RNA-binding proteins (RBPs) from experiments that enrich for bound sequences, we introduce a convolutional residual network which we call ResidualBind. ResidualBind significantly outperforms previous methods on experimental data from many RBP families. We interrogate ResidualBind to identify what features it has learned from high-affinity sequences with saliency analysis along with 1st-order and 2nd-order in silico mutagenesis. We then use synthetic sequences to verify the importance of putative features and to uncover their functional relationship with predictions. We show that in addition to sequence motifs, ResidualBind learns a model that includes the number of motifs, their spacing, and both positive and negative effects of RNA structure context. Strikingly, ResidualBind learns RNA structure context, including detailed base-pairing relationships, directly from sequence data, which we confirm on synthetic data. ResidualBind is a powerful, flexible, and interpretable model that can uncover cis-recognition preferences across a broad spectrum of RBPs.
- Jingyi Jessica Li, University of California, Los Angeles, United States
- Guo-Liang Chew, Fred Hutchinson Cancer Research Center, United States
- Mark Biggin, Lawrence Berkeley National Laboratory, United States
General translational cis-elements are found in the mRNAs of most or all genes and affect the formation and progress of preinitiation complexes and the ribosome under many physiological states. These elements, or sequence features, are: mRNA folding, upstream ORFs, specific nucleotides flanking the initiating AUG, the length of the protein coding sequence, poly A tail length and codon usage. Using a greatly improved model, we show that these features collectively specify at least 40%, 46% and 80% of the variance in translation rates in M. musculus, Arabidopsis thaliana and S. cerevisiae respectively, more than twice the percent suggested by earlier work. We establish common principles for these features, including that control by secondary structure is chiefly mediated by highly folded ~35-55 nucleotide segments within mRNA 5’ regions; that relatively small differences in hairpin stem and loop lengths distinguish high and low translation rates; that the changes in tri-nucleotide frequencies between highly and poorly translated 5’ regions are correlated across all three species; and that control by distinct biochemical steps is strongly correlated as is control by different sections within mRNAs. Species specific differences are observed, however. For example, the relative contributions of features varies between species. Our analysis provides a more precise, quantitative understanding of general translation elements and the biochemical processes they direct. It also sets an upper limit on the extent to which other types of cis-element that mediate gene- and condition-specific control, such as miRNAs recognition sites, impact translation.
- Svetlana Shabalina, NCBI, NLM, NIH, United States
Alternative splicing (AS) and alternative transcription (AT) create the extraordinary complexity of transcriptomes and lay the basis for the structural and functional diversity of mammalian proteomes. We present evidence that the acquisition of new exons in spliced genes can occur by mosaic extension of gene functional domains, where new alternative coding exons can be incorporated during evolution, preferentially at the ends of CDSs. In this study, it is shown how gene architecture and evolutionary rates of human receptor genes (nuclear receptors, opioid receptors, etc.) influence their expression patterns and translation.
Notably, the acquisition of novel exons at the boundaries of CDSs and UTRs is mainly mediated by alternative transcription events - initiation (ATI) and termination (ATT). Extended 5' and 3' ends, associated with AT events, make major contribution to the diversity of the protein isoforms and harbor approximately five times more alternative nucleotides in the coding exons than in the core protein-coding regions which are subject to AS. Thus, alternative transcription makes an even larger contribution to transcriptome and proteome diversity than alternative splicing, specifically for tissue- and condition- specific expression. Our results suggest that differential processing of the 5' and 3' ends reflect two different regulatory strategies employed by large gene groups: regulation by ATI at the level of transcription initiation, and regulation by ATT that alters post-transcriptional stability of mRNA, enriches C-terminal variability of protein isoforms and provides options for differential posttranslational regulation. We also showed that during evolution, compact protein domains are typically encoded by highly structured mRNAs suggesting that alternative mRNA structures might control protein folding of alternative isoforms.
- Shahin Mohammadi, MIT, United States
- Jose Davila-Velderrain, MIT, United States
- Manolis Kellis, MIT, United States
Protein-protein interaction networks have been used extensively to identify protein complexes, pathways, and functional modules, and to study the structural context and potential mechanisms of action of disease-associated perturbations. However, current representations of these networks fail to capture the tissue- and cell-type-specificity of gene expression and protein function, which have been recently elucidated to unprecedented detail using single-cell profiling technologies.
To harness this unique opportunity, we develop SCINET (single-cell interactomes by expression imputation), which infers an ensemble of cell type-specific interactomes by integrating protein-protein-interactions with single-cell gene expression data. To address the high levels of noise, sparsity, and skewness of single-cell data, we develop a network-based imputation method, based on the ACTION framework (Mohammadi et al., 2018), and then transform imputed expressions to normalize their underlying distributions using the rank-based inverse Normal transformation. Given the distribution of transformed expression values of each pair of potentially-interacting genes, we calculate their interaction probability using the analytical form for the tail distribution of their minimum values, resulting in cell-type-specific networks for all cells in each cell type.
We apply SCINET to infer cell-type-specific networks in the human brain for six major cell types (excitatory neurons, inhibitory neurons, oligodendrocytes, oligodendrocyte progenitor cells, astrocytes, and microglia) using protein-protein-interactions from InBioMap and single-cell expression profiles from the human prefrontal cortex (Lake et al., 2018). We validate the resulting cell type-specific networks based on the specificity of the inferred edges, enrichment for known cell type-specific markers, and pathway enrichment for cell-type-specific functions.
We use the resulting networks to infer putative schizophrenia driver genes, by integration with common genetic variants from the GWAS catalog, rare variants (Fromer et al., 2014; Purcell et al., 2014), and case-control differential expression (Jaffe et al., 2018). We define dense subgraphs of schizophrenia-associated genes and find quantitatively-distinct cell-type-specific localization patterns and cell-type-specific enrichments for predicted driver genes.
Overall, SCINET provides a general framework to leverage single-cell profiles to assign context-specificity to global protein-interaction networks and is applicable to study any combination of reference networks/single-cell profiles.
- Federica Eduati, Eindhoven University of Technology, Netherlands
- Ramesh Utharala, European Molecular Biology Laboratory, Germany
- Patricia Jaaks, Wellcome Trust Sanger Institute, United Kingdom
- Mathew Garnett, Wellcome Trust Sanger Institute, United Kingdom
- Thorsten Cramer, RWTH University Hospital, Germany
- Christoph Merten, European Molecular Biology Laboratory, Germany
- Julio Saez-Rodriguez, Institute for Computational Biomedicine, Heidelberg University, Faculty of Medicine, BIOQUANT-Center, Germany
Research on precision medicine is essential to improve our ability to tailor treatments to patients, especially for cancer types with high patient heterogeneity (e.g. pancreatic cancer). Ideally, we would test patient-specific response to different prospective therapies, but the use of standard screening technologies is limited by the number of cells available from biopsy, so new platforms are needed. Additionally, understanding signaling pathways mediating patient-specific response to therapy would help to unveil resistance mechanisms and improve rational therapeutic strategies. With this motivation, we combined a novel microfluidics platform with mathematical modeling, for screening of biopsies from tumor patients and prioritization of personalized therapies.
We developed a plug-based microfluidics platform for combinatorial drug screening (Eduati et al., Nature Communications, 2018), which allowed us to perform a large number of experiments (>1200 data points: 56 different conditions with at least 20 replicates each) with the limited number of viable cells available from tumor biopsy. The platform allows fast (<48 hours after resection) and inexpensive (<150$ per patient) screening without need for ex vivo culturing. We screened two cell lines and four pancreatic tumor biopsies from patients at different stages. Data were used to prioritize patient-specific treatments, which were validated also in vivo on cell line-derived mouse xenografts.
Combinatorial drug screening data were used to build patient-specific mathematical models to investigate patient heterogeneity at signaling pathways level (Eduati et al., bioRxiv, 2018). We started from general prior knowledge (from different public resources) on signaling pathways involved in apoptosis, and interpreted it using logic based ordinary differential equations formalism. The model was trained using the experimental data, to derive specific predictive models, applying bootstrap to obtain a distribution of model parameters and predictions.
Optimized models were used to compare the two cell lines, highlighting functional differences especially in the dynamic of the PI3K-Akt pathway, which cannot be derived only from basal transcriptomic or genomic data. Similarly, we investigated patient heterogeneity, showing dissimilarities in different important pathways, nicely reflecting the different tumor stages better than the screening data. Models also showed great potential for prediction of new combinatorial treatments (cross-validated Pearson corr=0.7), which were experimentally validated on cell lines.
In summary, we present both technological and computational advances towards personalized medicine. Combining our microfluidics platform for screening of patient biopsies with mathematical models allows us to investigate deregulations and to generate mechanism-based hypotheses of effective personalised combinatorial therapy in cancer.
- Lu Cheng, Cardiff School of Biosciences, Cardiff University, United Kingdom
- Siddharth Ramchandran, Department of Computer Science, Aalto University School of Science, Finland
- Tommi Vatanen, Broad Institute of MIT and Harvard, United States
- Juho Timonen, Department of Computer Science, Aalto University School of Science, Finland
- Niina Lietzen, Turku Centre for Biotechnology, University of Turku and Abo Akademi University, Finland
- Riitta Lahesmaa, Turku Centre for Biotechnology, University of Turku and Abo Akademi University, Finland
- Aki Vehtari, Department of Computer Science, Aalto University School of Science, Finland
- Harri Lähdesmäki, Department of Computer Science, Aalto University School of Science, Finland
Biomedical research typically involves longitudinal study designs where samples from individuals are measured repeatedly over time and the goal is to identify risk factors that are associated with an outcome value, such as disease onset, a biomarker abundance, or any other value characterizing a phenotype. General linear mixed effect models (LMM) have become the standard workhorse for statistical analysis of longitudinal data, and LMMs have successfully been used in numerous biomedical studies. However, analysis of longitudinal data using existing (parametric) statistical methods can be complicated for both practical and theoretical reasons, including difficulties in modelling correlated outcome values, functional (time-varying) covariates, nonlinear and non-stationary effects, and model inference. Consequently, recent work on modern statistical methods for longitudinal data analysis has predominantly focused on non-parametric models, such as splines, and more recently latent stochastic processes, such as Gaussian processes (GP).
We have developed LonGP, an additive Gaussian process regression model that is specifically designed for longitudinal study designs. LonGP implements a flexible, non-parametric modelling framework that solves commonly faced challenges in longitudinal data analysis. LonGP can model time-varying random effects and non-stationary signals, naturally account for missing values and irregular sampling times, incorporate multiple kernel learning, and provide interpretable results for the effects of individual covariates and their interactions. Regarding statistical inference, we have developed an accurate Bayesian inference and model selection method for our non-parametric LonGP model. We demonstrate LonGP’s performance and accuracy by analysing various simulated and real longitudinal -omics datasets, including high-throughput longitudinal proteomics and metagenomics data. We also benchmark our non-parametric method with the standard analysis methods. LonGP is implemented as a versatile software that is publicly available for the research community. Given the importance of longitudinal study designs in biomedicine, we believe that our interpretable probabilistic non-parametric methods will be highly valuable tools for longitudinal data analysis.
https://www.biorxiv.org/content/early/2018/02/06/259564 (preliminary version)
- Jonathan Warrell, Yale University, United States
- Daifeng Wang, Yale University, Stony Brook University, United States
- Shuang Liu, Yale University, United States
- Hyejung Wong, The University of North Carolina at Chapel Hill, United States
- Xu Shi, Yale University, United States
- Fabio Navarro, Yale University, United States
- Declan Clarke, Yale University, United States
- Mengting Gu, Yale University, United States
- Prashant Emani, Yale University, United States
- Mark Gerstein, Yale University, United States
Disorders of the brain affect nearly a fifth of the world’s population. Robust phenotype-genotype associations have been established for a number of brain disorders including psychiatric diseases (e.g., schizophrenia, bipolar disorder). However, understanding the molecular causes of brain disorders is still a challenge. To address this, recent large scientific projects have generated comprehensive genomic datasets for the human brain -- e.g., the PsychENCODE consortium generated ~5,500 genotype, transcriptome, chromatin, and single-cell datasets from 1,866 individuals. Using these data, we have developed a set of interpretable machine learning approaches for deciphering functional genomic elements and linkages in the brain and psychiatric disorders. In particular, we have found ~79,000 brain-active enhancers and ~2.5M eQTLs comprising ~238K linkage-disequilibrium-independent SNPs. Leveraging our QTL and Hi-C datasets, we predicted a full regulatory network for the brain, linking TFs, enhancers, and target genes, using elastic-net regression. Using the full regulatory network, we connected genes and epigenetic changes to GWAS variants for psychiatric disorders (connecting a total 304 genes to SNPs for schizophrenia and finding new genes potentially associated with the disease). In order to further connect these molecular level networks and phenotypes to psychiatric disorders, we developed an interpretable deep-learning model embedding the physical regulatory network to predict high-level disease and cognitive traits from the genotype via intermediate phenotypes. Our model uses a conditional Deep Boltzmann Machine architecture and introduces lateral connectivity at the visible layer to embed the biological structure learned from the regulatory network and QTL linkages. Further, we develop a rank-statistic based interpretation scheme which allows us to functionally annotate hidden nodes and prioritize them relative to disorders, generating a hierarchy of ‘higher-order modules’ (generalizing gene co-expression modules) linked to traits of interest. Our model improves disease prediction (by 6-fold compared to additive polygenic risk scores), highlights key genes for disorders, and allows imputation of missing transcriptome information from genotype data alone. Further, we show how our results can be explicitly transformed into liability estimates, allowing us to compare the predictive performance of models in our framework that have varying amounts of intermediate structure with heritability estimates across disorders.
- Hatice Osmanbeyoglu, Memorial Sloan Kettering Cancer Center, United States
- Fumiko Shimizu, Memorial Sloan Kettering Cancer Center, United States
- Angela Rynne-Vidal, The University of Texas MD Anderson Cancer Center, United States
- Tsz-Lun Yeung, The University of Texas MD Anderson Cancer Center, United States
- Petar Jelinic, New York University, United States
- Samuel Mok, The University of Texas MD Anderson Cancer Center, United States
- Gabriela Chiosis, Memorial Sloan Kettering Cancer Center, United States
- Douglas Levine, New York University, United States
- Christina Leslie, Memorial Sloan Kettering Cancer Center, United States
Epigenomic data on transcription factor occupancy and chromatin accessibility can elucidate the developmental origin of cancer cells and reveal the enhancer landscape of key oncogenic transcriptional regulators. We develop a computational strategy called PSIONIC (patient-specific inference of networks informed by chromatin) to combine cell line chromatin accessibility data with large tumor expression data sets and model the effect of enhancers on transcriptional programs in multiple cancers. We generated a new ATAC-seq data set profiling chromatin accessibility in gynecologic and basal breast cancer cell lines and applied PSIONIC to 723 patient and 96 cell line RNA-seq profiles from ovarian, uterine, and basal breast cancers. Our computational framework enables us to share information across tumors to learn patient-specific TF activities, revealing regulatory differences between and within tumor types. Many of the identified TFs were significantly associated with survival outcome in basal breast, uterine serous and endometrioid carcinomas. To validate one PSIONIC-derived prognostic TF, we performed immunohistochemical analyses in 14 uterine serous tumors for ETV6 and confirmed that the corresponding protein expression pattern was also significantly associated with prognosis. Moreover, PSIONIC-predicted activity for MTF1 in cell line models correlated with sensitivity
to MTF1 inhibition, showing the potential of our approach for personalized therapy.