View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: Complex regulatory programs of multicellular organisms are controlled by regulatory elements, which can be located at large distances from their target genes. The scope of action of regulatory elements is constrained by the spatial chromatin organization, in particular within Topologically Associating Domains (TADs). Recent studies investigated how genomic variations that affect TAD boundaries lead to changes TAD structure and gene expression. However, those results remain limited to a small number of loci. To globally assess the relationship between gene expression and chromatin organization, we utilized highly rearranged balancer chromosomes in Drosophila melanogaster. These chromosomes feature multiple types of genomic variations at different scales. We compared gene expression in developing embryos using RNA-seq between the rearranged chromosomes and their wild-type counterparts. Doing it in a heterozygous cross allowed us to intrinsically account for trans regulatory effects. We also quantified the differences in chromatin organization, using Hi-C. In line with previous studies, we found that differential gene expression is correlated with local changes in genome topology. Surprisingly though, we observed that changes in large-scale chromatin organization do not globally correlate with changes in gene expression, despite the frequent disruption of TADs. Overall, our results are indicative of robust mechanisms buffering genomic variation.
Short Abstract: Enhancers are critical for gene regulation not only in differentiation processes but also during disease development. It remains a challenge to identify these regulatory elements in a cell-type or even disease-state dependent manner. Thus, rather than comparing separated epigenetic signature tracks we propose an approach to computationally map and compare enhancers across different samples and conditions. Here we present a two-step framework to predict and assign condition dependent enhancers solely based on ChIP-seq histone modification data. To this end, a random forest based classifier is trained on a set of high confidence regions and used for enhancer prediction. We will demonstrate that the presented approach can be applied across different tissues and species without the need of re-training. In a second step, all regions are assigned to different biological conditions by applying a permutation test directly to enhancer probability values and are subsequently formed into regulatory units by incorporating topologically associated domains (TADs). We have applied our strategy to several projects which encompass different numbers and types of conditional states and were able to prioritize candidate enhancer regions that are correlated to the respective biological question.
Short Abstract: The ability to profile the expression levels of thousands of genes simultaneously and identify the genes associated with a disease has opened new avenues in understanding disease mechanisms and developing precision medicine interventions. Since the organization of physical and functional cellular networks into databases, it has been possible to develop methods that analyze expression data in the context of these networks. A key challenge is to combine the expression data with the systems-level information and still obtain specific molecular targets. We present a new analysis technique, which we call GeneSurrounder, that identifies specific disease-associated genes and takes into account the complex network of cellular interactions. GeneSurrounder identifies genes that (i) appear to influence nearby genes on the network that (ii) themselves are dysregulated and associated with the disease under study. We apply GeneSurrounder to three distinct ovarian cancer studies using a global KEGG network and show that our method yields more consistent results across multiple studies of the same phenotype than competing methods. These methods can open up new avenues of precision medicine by identifying disease-associated genes.
Short Abstract: Building on the tremendous resource of The Cancer Genome Atlas (TCGA), we carried out a comprehensive analysis of the transcriptomes of 8,705 cancer patients over a total of 32 cancer types. We uniformly processed both RNA and Whole Exome sequencing data from TCGA and extracted alternative splicing (AS) events and tumor variants. We observe thousands of AS events present in cancer samples that are absent from TCGA normals or GTEx samples and find a consistent increase of splicing in cancer vs normal (≈30%). In a genome-wide association of splicing and somatic variation we confirmed known trans-associations involving SF3B1 and U2AF1 and identified three additional trans-acting variants (IDH1, TADA1, PPP2R1A). Integrating data from protein-MS for Breast and Ovarian Cancer samples, we were able to confirm on average ≈1.7 peptides derived from novel exon-exon junctions compared to ≈0.6 SNV-derived peptides per tumor sample, for peptides that were also predicted MHC-I binders. Hence, by including neoantigens derived from novel exon-exon junctions, the fraction of samples for which at least one putative neoantigen can be identified increases from 30% to 75%, presenting a new class of splicing-associated potential neoantigens that could be exploited for immunotherapy.
Short Abstract: Integrative analysis of histone modifications across diverse tissue types and diseases has uncovered the dependence of gene regulation on chromatin organization. High-throughput technologies for analyzing genome-wide chromosomal conformation have revealed that chromatin is arranged in topologically associated domains (TADs), which remain largely stable across cell types, while intra-TAD activities are cell type specific. Consequently, detailed knowledge about TAD boundaries can be utilized for associating epigenomic signals with their target genes. For example, we have recently identified enhancer-associated genes in 42 primary ependymoma brain tumors across six distinct molecular subgroups by H3K27ac ChIP-sequencing. Our TAD guided analysis leveraged Hi-C data previously generated from human fetal fibroblasts and revealed promising molecular targets for improved treatment of ependymoma tumors. We have now implemented our analysis strategy as an open-source R package that can be applied to any heterogeneous cohort of samples analyzed by a combination of gene expression and epigenetic profiling techniques with or without sample matched chromosomal conformation information. To investigate the impact of tumor specific TADs, we have generated chromosomal conformation data from patient derived ependymoma cell-lines. Our preliminary results confirm that enhancer-associated genes can largely be inferred by borrowing TADs information from unrelated reference samples.
Short Abstract: DNA methylation is an epigenetic event that occurs when a methyl radical binds to cytosine in the DNA. This event, which regulates the expression of the genes, is usually found in repetitive sequences. However, it can be associated with repression or stimulation of the expression of genes with important roles in the biology of different tumor types. We applied an innovative approach using different types of comparisons as filters between methylation microarray data from different cancer patients. The phase 3 methylation microarray data publicly available was obtained from TCGA database. We used a total of 567 samples from different normal tissues, including 27 samples from normal breast tissues, and 1405 samples from tumor tissues from different types of cancer. As a result, we identified 764 genes with differentially methylated regions specific to breast tumor type, which 15 associated with this type of cancer. Also, the methylation of 8 genes was already identified in the cell-free DNA of breast cancer patients plasma. We can use the same method to identify specific candidate biomarkers in other tumor types. This new approach will help to identify new specific candidates as diagnosis biomarkers never described in breast cancer, including in blood plasma.
Short Abstract: The importance of proteins-DNA interactions in gene regulation is indisputable. Yet, the role of RNA-DNA interactions in gene regulation have been poorly explored so far. We are interested in triple helices, where a single RNA strand binds to the major groove of a double helix and individual nucleobases form specific Hoogsteen hydrogen bonds with adenine or guanine residues of the purine-rich DNA strand. There is an increasing evidence on the use of triple helices binding in transcription regulation. So far, computational methods for triple helix detection are based on enumerating all triple helices, i.e. small sequences with high proportion of bases following the triple helix code, for a given pair of RNA and DNA sequence. We describe here a method for statistical characterization of set of RNAs to bind to particular DNA regions. Triplex Domain Finder indicates regions within the RNAs (DNA binding domains) with the highest potential for forming triple helices. Case studies on long noncoding RNAs known to form triple helices demonstrate that TDF is able to recover known regions of RNA and DNA forming triple helices. Moreover, sequencing confirms triple helix binding sites of a known and a novel Meg3 DNA binding domain.
Short Abstract: A catalogue of mutations that drive tumorigenesis and progression is essential to understanding tumor biology and developing therapies. Protein-coding driver mutations have been well-characterized by large exome-sequencing studies, however many tumors have no mutations in protein-coding drivers and few non-coding drivers besides the TERT promoter are known. To fill this gap, we analyzed 150,000 cis-regulatory regions in 1,844 whole cancer genomes from the ICGC-TCGA PCAWG project. Using our new method, ActiveDriverWGS, we found 41 frequently mutated regulatory elements (FMREs) enriched in non-coding SNVs and indels characterized by aging-associated mutation signatures and frequent structural variants. FMREs were enriched in super-enhancers and long-range chromatin interactions, suggesting that the mutations drive cancer by altering distal gene regulation. The chromatin interaction network of FMREs and target genes revealed associations of mutations and differential gene expression of known and novel cancer genes, activation of immune response pathways and altered enhancer marks. Thus distal genomic regions may include additional, infrequently mutated drivers that act on target genes via chromatin loops. Our study is an important step towards finding such regulatory regions and deciphering the somatic mutation landscape of the non-coding genome.
Short Abstract: Transposase-Accessible Chromatin (ATAC) followed by sequencing (ATAC-seq) is a simple and fast protocol for detection of open chromatin. However, computational footprinting in ATAC-seq, i.e. search for regions with depletion of cleavage events due to transcription factor binding sites, has been poorly explored so far. We propose HINT-ATAC, a footprinting method that addresses ATAC- seq specific protocol artifacts. HINT-ATAC uses a probabilistic framework based on Variable-order Markov models to learn the complex sequence cleavage preferences of the transposase enzyme. Moreover, we observed specific strand specific cleavage patterns around the binding sites of transcription factors, which are determined by local nucleosome architecture. HINT-ATAC explores local nucleosome architecture to significantly outperform competing footprinting methods in predicting transcription factor binding sites by ChIP-seq.
Short Abstract: Principal component analysis (PCA) is a widely used technique for dimensionality reduction and visualization in genomics, where the number of dimensions can be thousands or even hundreds of thousands. However, since each principal component (PC) is a linear combination of original dimensions, the meaning of the new dimensions can be hard to interpret. For PCA of DNA methylation data, the cytosines which are the original dimensions may not have a clear biological annotation, further hindering interpretation. Currently, there is a lack of methods for interpreting PCs of DNA methylation data. We present a method which annotates PCs using sets of genomic regions corresponding to a given biological annotation, such as transcription factor binding or histone modifications. We tested the method on DNA methylation data from breast cancer, confirming known associations, and data from the rare childhood cancer Ewing sarcoma, discovering novel associations. Our method is computationally efficient, scales well with increasing number of samples, and will fit well into existing analysis workflows. This method will be broadly useful to help researchers understand variation in DNA methylation among samples.
Short Abstract: The development of DNA sequencing technologies has been dramatically increasing the amount of genome sequence data derived from diverse species and individual humans. Apparently, the next demanding challenge is a deeper understanding of what genome sequences encode and how to extract useful information from them. Such sequence-based understandings would ultimately yield the predictability of phenotypes from genome sequences. While the syntax of protein-coding genes is well understood (thereby allows us to predict some extent of phenotypic consequences such as nonsense mutations), there is a big room to explore on the basic rules of non-coding gene regulatory sequences. In this presentation, I would like to introduce a deep learning-based approach to tackle this limitation. I design a deep convolutional neural network that can learn and predict regulatory DNA sequences. The key features of my method are the following: a) a convolutional layer that integrates information from forward and reverse DNA sequences; b) a simplified data structure; c) a new quality index to filter out low-quality data from a training data set. These features improve the prediction accuracy of the model. Furthermore, by extracting what the model learned, I show some preliminary results that may be useful to interpret genomic information.
Short Abstract: Ewing sarcoma is an aggressive pediatric cancer predominantly driven by EWS-FLI1. Little is known about the systemic impact of EWS-FLI1 and the underlying basis of its chemosensitivity. We probed a genome-wide RNAi screen and identified transcription, RNA metabolism and DNA damage response as being required for Ewing sarcoma viability. Interestingly, these processes were also altered in response to damage in expression profiles of Ewing sarcoma cell lines. We found a highly significant accumulation of R-loops (three-stranded RNA:DNA structures), a consequence of transcription dysregulation, in Ewing sarcoma. We developed an analysis pipeline in order to compare genome-wide R-loops and other ChIP-seq data and found a strong concordance between R-loops and RNA Polymerase II, as well as the DNA repair protein BRCA1 both in terms of peak height (depicting level of enrichment) as well as coverage. Importantly, BRCA1 co-localization was significantly higher at genes also bound by EWS-FLI1. Further, BRCA1 localization was diminished following damage in control cell lines but less so in Ewing sarcoma. Finally, these observations were confirmed by experimental evidence of impaired homologous recombination and sensitivity to PARP1 inhibitors. In conclusion, our study combines bioinformatics and experimental data to establish the underlying basis of Ewing sarcoma chemosensitivity.
Short Abstract: We have shown previously that higher-order Bayesian Markov Models (BaMMs) perform substantially better than PWMs or first-order models for motif discovery [Siebert M, NAR, 2016]. To bring the community the high-order BaMMs with improved quality and to offer users the possibility to combine various standard analyses, we developed the BaMM webserver with user-friendly interfaces and results pages. The BaMM webserver offers four tools: (i) de-novo motif discovery in a sequence set, (ii) scanning a sequence set with motifs to find motif occurrences, (iii) searching with an input motif for similar motifs in our BaMM database, and (iv) browsing and keyword searching in the database. Our motif database contains motifs for 798 transcription factors, trained from two ChIP-seq databases for human and mouse. In contrast to other servers, e.g. JASPAR and HOCOMOCO, we represent sequence motifs not by PWMs but by 4th-order BaMMs. To address the inadequacy of P- and E-values as measures of motif quality, which are badly correlated with biological relevance of the motif, we developed the AURRC score (area under the recall-versus-true-positive-to-false-positive-ratio curve). The AURRC score summarizes how well the motif model can distinguish true motif instances from the background. The BaMM server is freely accessible at https://bammmotif.mpibpc.mpg.de.
Short Abstract: Non-coding gene regulatory enhancers are essential to transcription in mammalian cells. As a result, numerous experimental and computational strategies have been developed to identify cis-regulatory enhancer sequences. Most studies consider enhancers identified by only a single method, and concordance between sets from different methods has not been comprehensively evaluated. We assessed the similarities of enhancer sets identified by ten representative strategies in four biological contexts and evaluated the robustness of resulting downstream conclusions. We demonstrate significant dissimilarity between enhancer sets in genomic characteristics, evolutionary conservation, and association with functional loci. We find most regions identified as enhancers are supported by only one method. The disagreement is sufficient to influence interpretation of functional loci, and to lead to disparate conclusions about enhancer biology and disease mechanisms. We also find limited evidence that regions identified by multiple methods are better enhancer candidates than regions identified by a single strategy. Our results highlight the inherent complexity of enhancer biology and argue that current approaches have yet to adequately account for enhancer diversity. To facilitate assessment of enhancer diversity in future studies, we developed creDB, a database of enhancer annotations designed to integrate into bioinformatics workflows.
Short Abstract: Generating detailed and accurate organogenesis models using single-cell RNA-seq data remains a major challenge. Current methods have relied primarily on the assumption that descendant cells are similar to their parents in terms of gene expression levels. These assumptions do not always hold for in vivo studies, which often include infrequently sampled, unsynchronized, and diverse cell populations. Thus, additional information may be needed to determine the correct ordering and branching of progenitor cells and the set of transcription factors (TFs) that are active during advancing stages of organogenesis. To enable such modeling, we have developed a method that learns a probabilistic model that integrates expression similarity with regulatory information to reconstruct the dynamic developmental cell trajectories. When applied to mouse lung developmental data, the method accurately distinguished different cell types and lineages. Existing and new experimental data validated the ability of the method to identify key regulators of cell fate.
Short Abstract: High-throughput chromatin conformation assays such as Hi-C have enabled genome-wide detection of long-range chromatin contacts, which have been shown to be integral in various regulatory mechanisms. However, interactions from Hi-C experiments are typically identified at relatively coarse resolutions (e.g., 5-25kb) and thus do not robustly identify interactions at a fine-scale. We present a novel computational method, Chromatin Interaction Siamese Convolutional Neural Net (ChISCNN), to fine map Hi-C detected interactions to their likely source at a high resolution. Using high resolution information within DNase-seq and ChIP-seq data for transcription factors and histone marks, we trained a Siamese Convolutional Neural Network (SCNN) to discriminate between true interactions and non-interactions. We then use a feature importance algorithm along with the SCNN to assign each pair of 100bp subregions a score that corresponds to its importance in the Hi-C interaction. We demonstrate the effectiveness of our approach both by comparing our predictions to independent genome annotations and the recovery of original Hi-C peaks after extending their boundaries. Finally, we discuss what signals give chromatin interactions their specificity.
Short Abstract: Regulation of gene expression is an important mechanism through which genetic variation can affect complex traits. A substantial portion of gene expression variation can be explained by both local (cis) and distal (trans) genetic variation. Much progress has been made in uncovering cis-acting expression quantitative trait loci (cis-eQTL), but trans-acting eQTL have been more difficult to identify and replicate. Rather than testing every SNP for association with every gene, we first imputed the component of gene expression determined by local genetic variation. Then, we tested this imputed gene expression component for association with observed expression of genes on different chromosomes to identify trans-acting genes. Gene expression imputation models were trained by applying statistical machine learning to independent eQTL panels. We leverage a recent extension of PrediXcan called MulTiXcan, which is a gene level association method that aggregates imputation models across multiple eQTL panels, to identify 1159 trans-acting genes and their 1247 targets, for a total of 3657 trans-acting/target gene pairs (FDR < 0.05). Trans-acting genes identified by MulTiXcan are enriched in transcription and transcription factor pathways, which indicates our method uncovers genes of expected function.
Short Abstract: The availability of increasing volumes of multi-omics profiles across many cancers promises to improve our understanding of the regulatory mechanisms underlying cancer. The main challenge is to integrate these multiple levels of omics profiles and especially to analyze them across many cancers. Here we present AMARETTO, an algorithm that addresses both challenges in three steps. First, AMARETTO identifies potential cancer driver genes through integration of copy number, DNA methylation and gene expression data. Then AMARETTO connects these driver genes with co-expressed target genes that they control, defined as regulatory modules. Thirdly, we connect AMARETTO modules identified from different cancer sites into a pancancer network to identify cancer driver genes. Here we applied AMARETTO in a pancancer study comprising eleven cancer sites and confirmed that AMARETTO captures hallmarks of cancer. We also demonstrated that AMARETTO enables the identification of novel pancancer driver genes. In particular, our analysis led to the identification of pancancer driver genes of smoking-induced cancers and ‘antiviral’ interferon-modulated innate immune response.
Short Abstract: Both repressive (H3K27me3) and active (H3K4me3) histone modifications are present at key developmental promoters in embryonic stem cells due to co-localization of repressive Polycomb group (PcG) and activating Trithorax/COMPASS group (TrxG) protein complexes. Direct functional interactions between PcG and TrxG at these bivalent promoters are unclear. Our integrative analysis of public RNA-seq and ChIP-seq datasets of multiple PcG and TrxG proteins revealed a quantitative genome-wide correlation between chromatin occupancies of Kdm2b, a component of Polycomb Repressive complex 1 (PRC1), and Mll2, an essential H3K4 methylase component of TrxG. This correlation suggested potential functional crosstalk between Kdm2b and Mll2 at both active and repressed promoters. Experimental validation of this hypothesis revealed that loss of Kdm2b resulted in depletion of Mll2 at promoters genome-wide, suggesting that Kdm2b is required for Mll2 occupancy at both bivalent and active promoters. Loss of Kdm2b or the core PRC1 component Ring1b also resulted in the reduction of H3K4me3 at bivalent promoters. This surprising hypothesis suggests a direct pathway for cooperation between PcG and TrxG complexes at bivalent promoters, an unexpected modification to the current model of bivalency. In addition, our results reveal genome-wide role of Kdm2b protein independent of the full PRC1 complex.
Short Abstract: Hypoxia is prevalent in many tumors and a regulator of malignant tumor progression, notably through hypoxia-inducible transcription factors (HIFs). HIF’s downstream effects on alternative splicing (AS) is unclear. To identify HIF-dependent AS, we performed RNA-seq on human pancreatic cancer (PDAC) cells subjected to hypoxia or normoxia +/- ARNT/HIF1B, a dimerization partner for hypoxia transcriptional response. We identified 538 HIF-dependent events (FDR<15%), where 38 events have percent-spliced-in change (delta-PSI) > 0.05. We experimentally validated events using multiple PDAC cell lines and patient-derived PDAC organoids. More than half (22/38) were confirmed in TCGA. We compared PSI values between tumor/normal tissues across cancer types with sufficient sample size. In breast cancer, where HIFs are upregulated, 10/22 events have significant PSI difference. Among BRCA patients, we found differential usage of the hypoxia-inducible event we identified for SLC35A3 (q-value=2.8e-08; t-test), a transporter involved in metabolism. Focusing on SLC35A3, we illustrate how HIF-dependent isoforms constitute a novel regulatory mechanism in hypoxia biology. In summary, we report the discovery of hypoxia-inducible isoforms in PDAC, linking HIF proteins with post-transcriptional regulation. We validate a set of HIF-dependent splicing events in several model systems and investigate their prevalence in human cancer patients.
Short Abstract: Nearly all protein-coding genes undergo alternative RNA splicing, which provides an important mean to expand transcriptome diversity beyond the scope of genomic information. While splicing is an elaborate process, it can be prone to errors that could become pathogenic. Unsurprisingly, aberrant splicing, which collectively refers to splicing events that could confer risk of a disease, is often implicated in cancer. Recent studies have revealed splicing regulation is characterized by increased levels of nucleosome density and positioning, DNA methylation, and distinct histone modification patterns. However, most studies on aberrant splicing have largely focused on identifying genomic- and transcriptomic-level variations within splice sites, cis-acting splicing regulatory elements, and trans-acting splicing factors. The extent, nature, and effects of epigenomic dysregulation in aberrant splicing remain unsolved. By systematically profiling the epigenomic landscape of aberrant splicing using transcriptomic and epigenomic data from the ENCODE and the Epigenome Roadmap projects, we aimed to (1) identify chromatin status and distinct epigenetic signatures that characterize aberrant splicing in cancer, (2) classify aberrant splicing by different class of epigenomic dysregulation, and (3) elucidate the role of epigenomic control in aberrant splicing. The proposed study will significantly advance our understanding of epigenomic contribution to aberrant splicing in cancer.
Short Abstract: As part of a DARPA program called “Communicating with Computers”, we are developing a natural language driven dialog system for analyzing gene expression. This system is one part of a larger system that allows for development of executable network models of biological pathways, network path analysis, and integration of cancer proteomic and genetic data. The overall system employs an agent based architecture. The system incorporates state of the art natural language processing, reasoning systems, and dialog management that incorporates semantics of biological molecules and processes. We have developed an agent for the system that we call the “Transcription Factor and Targeting Agent” or TFTA. The TFTA can answer queries regarding which transcription factors are associated with a given gene and which genes are bound by a given transcription factor, the tissue specific expression of a gene, and the association of a gene and transcription factor with a pathway. The TFTA can find transcription factors common to a list of genes or pathways common to set of genes. The TFTA also answers queries about microRNAs and their target genes. The goal of the system will be to allow biologists to easily access and analyze gene regulatory relations using natural language.
Short Abstract: It is widely recognized that disruptions of chromatin-based mechanisms caused by epigenetic alterations through histone modifications, contribute to cancer development and progression. These histone modifications are highly reversible, making them potential drug targets in cancer therapy. While many molecularly targeted drugs have the potential to revert these modifications, their precise mechanism of action, i.e., alterations in gene regulatory networks, are poorly understood. To address this problem, we developed an integrated phosphoprotein-histone-drug network (iPhDnet) that serves as a window into histone modifications in breast cancer, revealing molecular fingerprints, referred as “global chromatin profile fingerprints”. The model is based on a hybrid approach, whereby an unsupervised clustering method is used to histone signature generation and a supervised multivariate regression method is used to histone prediction using high-information content mass spectrometry data.
Short Abstract: Enhancers and promoters both play indispensable roles in gene transcription activation. Recently, we observed that mutation in, or loss of a preferred cognate promoter can release its regulatory enhancer to loop to, and activate an alternative promoter in its chromosomal neighborhood. Here, we present a novel computational approach to identify such 'enhancer release and retargeting' (ERR) events on a genome-wide scale, and their implications in human diseases, through statistical analysis and integration of various genomic data sets and the GWAS catalog of human complex diseases. We identified putative ERR events, with a count ranging from 31 to 525, in all 48 human tissues available in the current version of GTEx. Over a hundred of them are common to multiple tissues, with some occurring in as many as 36 tissue types. In several ERR events, enhancer retargeting would cause activation of genes associated with diseases. Our analysis shows that ERR, a previously unobserved and unsuspected mechanism, by which genetic alterations of promoters causes activation of alternative gene promoters, is a common occurrence in the transcriptomes of multiple tissue types. Moreover, our study suggests that ERR may also allude to a previously-overlooked mechanism underlying disease and developmental defect risk.
Short Abstract: Characterizing DNA binding specificities of transcription factors (TFs) is of primary importance for studying gene regulation. Recently, several lines of evidence suggested that both DNA sequence and shape contribute to TF binding. However providing a direct evidence for the role of DNA shape in TF binding has been challenging due to the difficulty in separating the sequence and structure contributions to the binding. To address this challenge, we developed a novel way of analyzing the results of in vitro HT-SELEX experiments for TF-DNA binding. Specifically, the presence of motif-free sequences in late HT-SELEX cycles and their enrichment in weak binders allowed us to detect evidences for the role of DNA shape features in TF binding. Our approach revealed that, even in the absence of a sequence motifs, TFs weakly bind to DNA molecules enriched in specific shape features that were often TF specific. Surprisingly, we also found that some properties of DNA shape contribute to promiscuous binding of all tested TF families. Strikingly, such promiscuously bound shapes correspond to the most frequent shape formed by the DNA. We propose that this promiscuous binding facilitates sliding of TFs along the DNA molecule before it is locked in its binding site.
Short Abstract: MicroRNA Data Integration Portal (mirDIP) is the integrative database of human microRNA target predictions. In its recent version (v4.1), mirDIP compiles nearly 152 million computational predictions obtained from 30 different resources, covering almost all known human microRNAs and the vast majority of human genes. In contrast to other integrative resources, the scope of mirDIP extends beyond collection of the existing data. Predictions obtained from individual resources were standardized to the contemporary nomenclature and their predictive precision was benchmarked on the currently available experimental evidence. We used statistical learning to infer what we refer to as integrative score, to asses overall confidence in existence of the given microRNA-target interaction. As we demonstrated, integrative score provides more precise predictions than those obtained from individual resources. Importantly, due to its integrative nature, predictions derived using this score are also less biased than those from other resources, who tend to identify targets belonging to specific biological processes or pathways (likely due to the underlying knowledge bias). Using mirDIP we identified previously unknown functional classes of microRNAs and revealed novel associations between microRNAs and various human pathologies. Altogether, mirDIP provides a very comprehensive and reliable resource for miRNA-target predictions, substantially advancing the human microRNA-related research.
Short Abstract: Topologically associated domains (TADs) are regions of the genome defined by strong inter-TAD interaction patterns. TADs have been shown to be associated with a variety of genomic functions, including gene regulation. Despite this importance, there is no consensus on how to properly detect TADs from raw Hi-C contact matrices. We propose a method that reframes TAD detection as a spectral clustering problem. To perform clustering, the contact matrix is interpreted as an adjacency matrix corresponding to a graph with weights indicating the number of inter-loci contacts. Spectral clustering is performed with each cluster corresponding to a unique TAD. Clusters are assigned based on the iterative discretization method described in (Yu and Shi 2003). We demonstrate how the eigengap, a common heuristic for determining the number of clusters for spectral clustering, fails when analyzing Hi-C matrices, and introduce a novel alternative for detecting the optimal number of TADs based on maximizing cluster-wise silhouette scores. Our method is implemented in SpectralTAD R package. Results show that TAD boundaries identified with this method co-locate heavily with CTCF peaks. Additionally, this method produces TADs that have better separation when compared to other commonly used methods.
Short Abstract: Chromatin interactions have important roles for enhancer-promoter interactions (EPI) and regulating the transcription of genes. CTCF and cohesin proteins are located at the anchors of chromatin interactions, forming their loop structures. DNA binding sequences of CTCF indicate their orientation bias at chromatin interaction anchors. Forward-reverse (FR) orientation is frequently observed. However, it is still unclear what proteins are associated with chromatin interactions. To find DNA binding motif sequences of transcription factors (TF) such as CTCF affecting EPI and the transcription of genes, transcriptional target genes were predicted based on enhancer-promoter association (EPA). EPA was shortened at the genomic locations of FR or reverse-forward (RF) orientation of DNA binding motifs of TF. The expression level of the target genes predicted based on EPA was compared with target genes predicted from only promoters. Total 351 biased orientation of DNA motifs affected the expression level of putative transcriptional target genes significantly in monocytes of four people in common, and included known transcription factors associated with chromatin interactions and EPI, such as CTCF, cohesin (RAD21 and SMC3), ZNF143 and YY1. Moreover, EPI predicted using FR or RF orientation of some DNA motifs were overlapped with chromatin interaction data (Hi-C) more than the other EPA.
Short Abstract: Cell differentiation is driven by changes in gene expression that manifest as changes in cellular phenotype or function. Altered cellular phenotypes, stemming from genetic mutations or other perturbations, are widely assumed to directly correspond to changes in the transcriptome and vice versa. Here, we use the cytologically well-defined Prdm9 mutant mouse as a model of developmental arrest to demonstrate that parallel programs of cellular differentiation and transcription can become dis-associated. By comparing cytological phenotype markers and transcriptomes in wild-type and mutant spermatocytes, we identified multiple instances of cellular and transcriptional uncoupling in Prdm9-/- mutants. Most notably, Prdm9-/- germ cells arrest cytologically in late-leptotene/zygotene but nevertheless develop gene expression signatures characteristic of later, post-arrest developmental substages. These findings suggest that transcriptome changes may not reliably map to cellular phenotypes in perturbed systems.
Short Abstract: Transcription factors (TFs) are known to recognize DNA using both sequence (direct) and shape (indirect) readout. To investigate the contribution of shape to protein-DNA binding, we use mismatches (i.e. mis-paired bases) to induce significant structural changes in TF-DNA binding sites, with minimal changes in the DNA sequence of these sites. We present Saturation Mismatch Binding Assay (SaMBA), the first assay to characterize the effects of mismatches on TF-DNA binding in high-throughput. For genomic sequences of interest, SaMBA generates DNA duplexes containing all possible single-base mismatches, and quantitatively assesses the effects of the mismatches on TF-DNA interactions. We applied SaMBA to measure binding of 21 TFs (covering 14 structural families) to thousands of mismatched sequences, and mapped the impact of mismatches on these TFs. For all tested factors we found that DNA mismatches within binding sites can significantly increase TF binding levels. Furthermore, for several TFs we identified non-specific genomic regions that become strongly bound after certain mismatches are introduced. Structural analyses of mismatches that increase TF binding revealed that these mismatches oftentimes distort the naked DNA to induce shapes that are also present in the protein-bound sites, thus providing direct evidence of the contribution of DNA shape to protein-DNA recognition.
Short Abstract: The chromatin and its 3D organization plays important roles in cellular function in the eukaryotic cell, with the advance in the 3C (HiC) technology, more long-range intra-chromosomal and inter-chromosomal interactions between genomic loci have come to light. Specifically, the 3D organization of the genome may play important roles in transcription regulation. The theory of “transcription factory” is one such hypothesis. These nuclear subcompartments are dynamically organized so that the genes in these compartments have coordinated transcription. This study is an attempt to further consolidate the theory of “transcription factory” using a spatial Markov Random Field (MRF) model. By directly modelling gene expression values on a spatial neighborhood network inferred from HiC data, we were able to estimate the level of spatial dependency among protein-coding genes in the human IMR90 cell. We overcame computational challenges of large matrices using the double Metropolis algorithm to carry out the Markov Chain Monte Carlo (MCMC) simulation for this Bayesian model. Our study confirms the spatial dependency of transcription among neighboring genes in the 3D genome organization on a global scale. Further insights can be made into the mechanism of differential expression as a response to stimuli involving the chromatin compartments.
Short Abstract: Better understanding of regulatory architectures and underlying disease etiology substantially enhance targeting effective risk variants or biological entities in complex diseases including Alzheimer (AD). Resolving the heterogeneity of various immune cell types, researchers recently deployed scRNA-seq into AD transgenic mice for identifying potential markers. However, current analytical pipelines for the error-prone single cell assays either used conventional methods from bulk RNA-seq studies with vulnerable assumptions or incorporated partial findings from scRNA-seq data into statistical methods. Due to the effects of inevitable noise and large sparcity in such high-dimensional data and a complex mixture of biological stochasticity and technical variability, the analytical outcomes are thus highly questionable especially in their accuracy. With more than 100,000 cells from fresh human brains in 12 individuals via drop-seq protocols, we developed a novel computational framework for identifying immune-related cell subtypes in unprecedentedly high resolution by iteratively combining parametric modeling and nonparametric approaches. We show that our approach successfully identifies both known and hidden and rare subtypes and more accurately reveals their associated marker genes than current methods. Our novel method would provide a detailed description of mechanistic interplay among distinctive immune cells in multiple scales and therefore assist better therapeutics in neurological diseases.
Short Abstract: Advances in single-cell transcriptomics have enabled observing gene expression in individual cells, providing a detailed view of dynamic biological processes. Cells’ expression states allow them to be ordered based on their progression through a process such as differentiation. Ordered data can be valuable in understanding the underlying gene-gene regulatory interactions that control the process. Regulatory interactions between genes can manifest as 'causal' relationships in their expression trends; for example, increases and decreases in expression of a regulator gene may consistently precede those of its target genes. However, the distribution of cells along the process is not uniform, preventing the use of standard mathematical methods for detecting dependencies in temporal data, including Granger causality. We present an ensemble approach using a generalized Lasso-based Granger causality test suitable for analyzing irregular time series to infer gene regulatory networks from ordered single-cell data. Modified Borda count aggregation combines multiple rankings obtained from diverse kernel-based Granger causality analyses. The kernel smooths over the irregularly-spaced and missing observations. We apply our algorithm to mouse embryonic stem-cell differentiation datasets and demonstrate that it recovers gold standard transcriptional regulatory interactions more accurately than existing single-cell network inference algorithms.
Short Abstract: DNA-binding specificity is a fundamental characteristic of transcription factors (TFs). In eukaryotes, most TF-coding genes have undergone gene duplication and divergence during evolution, resulting in paralogous factors with highly conserved DNA-binding domains and recognizing similar DNA sequence motifs. However, paralogous TFs oftentimes bind to distinct targets in the cell, and they perform distinct regulatory functions. The differential genomic targeting by paralogous TFs is generally assumed to be due to interactions with protein cofactors or the chromatin environment. Using a computational-experimental framework called iMADS (integrative Modeling and Analysis of Differential Specificity), we show that, contrary to previous assumptions, paralogous TFs bind differently to genomic target sites even in vitro. We used iMADS to quantify, model, and analyze specificity differences between 11 TFs from 4 protein families. We found that paralogous TFs have diverged mainly at medium and low affinity sites, which are poorly captured by current motif models. We identify sequence and shape features differentially preferred by paralogous TFs, and we show that the intrinsic differences in specificity among paralogous TF contribute to their differential in vivo binding. Thus, our study represents a step forward in deciphering the molecular mechanisms of differential specificity in TF families.
Short Abstract: Mammalian genomes are organized into different levels. As one of the fundamental structural units, Topologically Associating Domains (TADs) play a key role in gene regulatory machinery. Recent studies found that hierarchical structures are also present within some TADs. However, precise identification of the locations of hierarchical TAD structures still remains challenging. Here we present HitHiC, a dynamic programming based method that can accurately and quickly uncover hierarchical TAD structures from Hi-C data. Through a systematic evaluation, we show that HitHiC has better accuracy, reproducibility and running speed than the existing methods. We applied HitHiC to high resolution Hi-C matrices and found TADs that have nested structures are in general more active than those that do not. Furthermore, we identified a group of boundaries that are shared by multiple TADs, which we call super boundaries. We showed that the super boundaries are highly enriched with active chromatin states and expressed genes. This observation of super boundaries potentially agrees with by the asymmetric movement in loop extrusion model. Altogether, our results reveal new insights towards understanding the complex system of gene regulation.
Short Abstract: Environmental changes have profound effects on well-being and survival of living organisms. For example, environmental stress factors have been found to account for over 70% of the chronic diseases, yet the knowledge about their genome-wide impacts remains limited. Therefore, it is crucial to uncover the environmental responsive genetic mechanisms in systems level. We have built comprehensive gene co-expression networks with in-house and publicly available transcriptome data of Caenorhabditis elegans from different environmental stress conditions, such as exposures to microgravity and dietary alterations. Our gene networks showed high accuracy (>88%) for predicting the known gene networks and are significantly enriched in previously found protein-protein interactions (P<0.01). Moreover, the networks predicted 19% - 98% more interactions for the known pathways through guilt-by-association. A high correlation was found between the function of the gene networks and the observed phenotype. By incorporating the data from histone modifications and transcription factors, we identified the putative drivers of the gene co-expression networks. Overall, our combinatory approach can help facilitate exploration of the disease-driving genetic responses to environmental stimuli and create potential therapeutic targets.
Short Abstract: Despite the availability of large datasets of RNA and protein expression data, the underlying regulatory mechanisms that allow a cell to adequately respond to a wide range of external stimuli have not been fully quantified. For example, many existing transcription network inference methods typically do not estimate biophysical parameters involved in RNA transcription, but instead rely on probabilistic methods such as random decision forests. Conversely, proper biophysical modeling has typically only been implemented in small-scale systems, such as the LAC operon. The primary goal of my work is to demonstrate that increasing the level of biophysical detail can improve large-scale modeling and inference of regulation. First, I demonstrate that explicitly accounting for RNA degradation improves transcription regulatory network inference in S. cerevisiae and B. subtilis, and that RNA half-lives estimated de-novo via condition- and gene-specific network inference optimization correspond to experimentally measured RNA half-lives. Furthermore, I demonstrate that more accurate mathematical treatment of the contribution of transcriptional repression to RNA expression regulation infers more biophysically accurate models of regulation for many genes.
Short Abstract: Approximately 150 neurons of the 200,000 neurons in the Drosophila brain comprise the fly's circadian neural network. Two main types of cells, PDF and DN1, have diverse functions in guiding fly circadian activity and behavior. PDF (or “morning”) cells control the morning peak of activity and are important in short photoperiods. DN1 cells, part of group called “evening” cells, control the evening peak of activity as well as the morning peak, and enhance morning arousal. DN1 cells feed back to morning and evening cells and promote sleep. It has been shown recently that temperature and sleep regulate activity of DN1 neurons. Using a variety of computational methods, we analyzed cycling patterns of gene expression in RNA-seq data from both DN1 and PDF neurons under different diet and temperature conditions. While the core clock genes retain their phase of oscillations in both cell types and under all conditions, we observe clock-regulated genes with a strong phase difference between the neuron types. Differential expression analysis shows that temperature impacts gene cycling profiles more strongly than diet in these neurons. Analysis of gene rhythmicity, differential cycling and functional clustering provides insights into these neurons’ distinctive functions and their response to environmental perturbations.
Short Abstract: The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project seek to characterize the epigenome in diverse cell types using assays that identify, for example, genomic regions with modified histones or accessible chromatin. These efforts have produced thousands of datasets but cannot possibly measure each epigenomic factor in all cell types. To address this, we present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to computationally impute missing experiments. PREDICTD leverages an elegant model called “tensor decomposition” to impute many experiments simultaneously. Tensor decomposition learns a low-rank representation of the epigenome that captures latent patterns in ChIP-seq and DNase-seq experiments from the Roadmap Epigenomics data corpus. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining the two methods yields further improvement. We show that PREDICTD data captures enhancer activity at noncoding human-accelerated regions. PREDICTD provides reference imputed data and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, both promising technologies for bioinformatics.
Short Abstract: The placenta is crucial during pregnancy, regulating proper fetal growth and development. However, many aspects of placental function and development are not yet fully understood. We therefore aimed to identify active and repressed gene networks in mouse placenta at e9.5. We generated open chromatin data using ATAC-seq, and integrated it with previously published transcriptomic data. RNA-seq reads were quantified using transcripts per million (TPM), and ATAC-seq reads were quantified at gene promoters using the maximum read pileup (coverage). We then grouped genes based on their TPM and promoter coverage values. Genes with high expression and high coverage were enriched for house-keeping functions. Surprisingly, we identified genes that have high expression and medium-low coverage, that were enriched for placenta related terms including vasculogenesis and endothelial cell migration. We also identified genes that have low expression and high promoter coverage and, within this group, we extracted a protein-protein interaction network enriched for neuronal functions. Finally, we generalized these findings by running our analysis pipeline on eight other tissues/cell-lines. We found that the genes with medium-low coverage and high expression are consistently enriched for tissue-specific terms and genes. We also identified potentially repressed neuronal networks in placental cells and embryonic stem cells.
Short Abstract: A large amount of multi-species functional genomic data from high-throughput assays are becoming available to help understand the molecular mechanisms for phenotypic diversity across species. However, continuous-trait probabilistic models, which are key to such comparative analysis, remain under-explored. Here we develop a new model, called phylogenetic hidden Markov Gaussian processes (Phylo-HMGP), to simultaneously infer heterogeneous evolutionary states of functional genomic features in a genome-wide manner. Both simulation studies and real data application demonstrate the effectiveness of Phylo-HMGP. Importantly, we applied Phylo-HMGP to analyze a new cross-species DNA replication timing (RT) dataset from the same cell type in five primate species (human, chimpanzee, orangutan, gibbon, and green monkey). We demonstrate that our Phylo-HMGP model enables discovery of genomic regions with distinct evolutionary patterns of RT. Our method provides a generic framework for comparative analysis of multi-species continuous functional genomic signals to help reveal regions with conserved or lineage-specific regulatory roles.
Short Abstract: Long non-coding RNAs (lncRNAs) play crucial roles in many developmental processes in plants. In particular, plant lncRNAs have emerged as important regulatory elements in response to biotic and abiotic stress. To date, identification of lncRNAs in switchgrass has not been performed and their regulatory roles are unknown. In this study, we predicted lncRNAs using two tools, Coding Potential Calculator (CPC) and Plant Long Non-Coding RNA Prediction by Random Forest (PLncPRO) from RNA-Seq data of switchgrass derived from plants under heat and drought stress. A total of 14,144 novel candidate lncRNAs were predicted, of which 90, 44 and 128 were differentially expressed in drought, heat, and drought + heat stress conditions as compared to control, respectively. Further characterization of candidate lncRNAs was performed for their length distribution, exon number, AU content and annotation categories. Genes that overlap with, and are in close proximity (<= 10 kb) of differentially expressed lncRNAs were extracted and associated Gene Ontology (GO) terms were analyzed. Neighboring genes of differentially expressed lncRNAs were associated with stress related GO terms. This study will enable the exploration of potential regulatory roles of switchgrass lncRNAs under specific stress conditions and provide information to uncover functions of lncRNAs in switchgrass.
Short Abstract: Identification of functional transcription factors that regulate a given gene set is an important problem in gene regulation studies. Conventional approaches for identifying transcription factors, such as DNA sequence motif analysis, are unable to predict functional binding of specific factors and not sensitive to detect factors binding at distal enhancers. Here we present Binding Analysis for Regulation of Transcription (BART), a novel computational method and software package for predicting functional transcription factors that regulate a query gene set or associate with a query genomic profile, based on more than 6000 existing ChIP-seq datasets for over 400 factors in human or mouse. This method demonstrates the advantage of utilizing publicly available data for regulatory genomics research.
Short Abstract: We present a new computational pipeline to identify transcription factors (TFs) associated with inter-individual drug response variation by integrating genotype, gene expression, and cytotoxicity data on a panel of ~300 cell lines, in the context of transcription factor binding sites (TFBS) derived from ENCODE and motifs from various databases. The first method of the pipeline, STAPMM, predicts the impact of a SNP on TF binding; we demonstrate its efficacy by predicting allele-specific binding SNPs and comparing its performance to other methods, such as gkmSVM. The second component of the pipeline assesses the extent to which SNPs impacting TFBS are proximal to drug response genes (genes whose expression co-varies with cytotoxicity), thereby associating TFs with drug response. We predicted 38 significant (TF, Drug) pairs at an FDR threshold of 0.05, 21 of which are not significant in the absence of TF binding predictors. Among the 38 (TF, Drug) pairs, three of them, (ELF1, Epirubicin), (ELF1, Doxorubicin), and (SP1, Carboplatin), were experimentally validated. We took a further look at (ELF1, Doxorubicin) and found that 25 out of the 44 drug response genes predicted to be affected by SNPs that change TF binding are central to the apoptosis pathway.
Short Abstract: Massively parallel reporter assays (MPRAs) is a technique that enables testing thousands of regulatory DNA sequences in a single, quantitative experiment. Since MPRA is still a nascent technology, there’s no set of computational methods dedicated to effectively leverage their promise. Development of such methods could help improve future MPRA candidate sequence selection, enhance our ability to predict functional regulatory sequences and increase our understanding of the regulatory code and how its alteration can lead to a phenotypic consequence. Here we present MPRAnalyze: a statistical framework dedicated to analyzing MPRA count data. MPRAnalyze addresses all major questions posed in the context of MPRA experiments: estimating the magnitude of the effect of a regulatory sequence in a single condition setting, and comparing differential activity of regulatory sequences across multiple conditions. The framework allows for various distributional assumptions and uses generalized linear models to account for uncertainty in both DNA and RNA observations, control for various sources of unwanted variation, and incorporate negative controls for robust hypothesis testing, thereby providing clear quantitative answers in complex experimental. We demonstrate the robustness, accuracy and applicability of MPRAnalyze on simulated data and published data sets. MPRAnalyze is implemented as a publicly available R package.
Short Abstract: Although the usage of pathway analysis in whole genome sequencing (WGS) has greatly increased our understanding of the genetic determinants of disease, most disease-associated WGS variants are located outside of protein-coding regions and are thought to reside in regulatory regions, including many enhancers. In order to be able to utilize this information, it is necessary to map enhancers into existing pathway networks via their target genes. We want to be able to associate individual enhancers to the specific genes that they regulate in order to extend pathway networks to include enhancer function in a tissue-specific manner. We developed a pipeline to create links between putative enhancers and genes to infer the target genes of enhancers. We used this pipeline in conjunction with data from ENCODE, Ensembl, VISTA, FANTOM, GTEx, and EMBL-EBI to create a database of enhancer-gene links that allows for tissue-specific relationships between enhancers and genes, as well as transparency regarding which assays from which data sources were used to generate the links. Using epigenetic marks associated with enhancer activity, ChIA-PET data, single-tissue eQTL variants located within enhancers, and topologically associated domain data, we linked 340,119 enhancers to 18,674 protein-coding genes in 72 tissues and cell types.
Short Abstract: Forkhead (Fkh) transcription factors are evolutionarily conserved among eukaryotes, and coordinate a timely cell cycle progression. In budding yeast, Fkh are expressed during a lengthy window of the cell cycle, being potentially able to function as hubs integrating multiple cellular network. Here, we report on a novel ChIP-exo dataset of Fkh targets, which combines ChIP with lambda exonuclease digestion followed by high-throughput sequencing, that allows identification of a nearly complete set of binding sites at single nucleotide resolution. The available software for ChIP-seq analyses, GEM and MACE, yielded problems when analyzing ChIP-exo dataset. Therefore, we have developed a novel ChIP-exo data analysis method, that we named maxPeak. This method confirms known Fkh targets, and points to many novel ones across various cellular processes. We analyzed target genes with respect to their functional enrichment, temporal expression during the cell cycle and metabolic pathways they occur in. Furthermore, we present a comprehensive overview of the current knowledge of Fkh targets by integrating our results with complementary genome-wide studies available in literature, also pointing at differences in metabolic targets between Fkh. Our work highlights Fkh as hubs that integrate multi-scale regulatory networks to achieve proper timing of cell division in budding yeast.
Short Abstract: Models for predicting phenotypic outcomes from genotypes have important applications to understanding genomic function and improving human health. Here, we developed a machine learning system to predict cell type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. We introduce densely connected dilated convolution layers to propagate information across large sequence distances in a convolutional neural network. Using this architecture, the system identified promoters and distal regulatory elements and synthesized their content to make effective gene expression predictions. We trained models to predict thousands of human genomic profiles across hundreds of cell types. Model predictions for the influence of genomic variants on gene expression align well to causal variants underlying human eQTLs mapped by the Genotype-Tissue Expression project. We demonstrate how these predictions can be used to generate mechanistic hypotheses to enable fine mapping of disease loci.
Short Abstract: The three-dimensional (3D) organization of genomes plays a key role in the regulation of genes. High-throughput chromosome conformation capture (Hi-C) allows 3D genomic organization to be determined by capturing all chromatin contacts within a cell population. Much work has already been done to computationally predict these contacts using one or more types of biochemical data (e.g., nucleosome positioning, ChIP-seq, Hi-C) as input. Although informative, these models cannot be applied to cell types or genomes where the required input data is unavailable (e.g., ancestral genomes). Moreover, most studies only predict a subset of all genome-wide chromatin interactions, which are typically found at relatively short distances (<1 Mb). Here, we describe the supervised regression problem of predicting complete Hi-C contact maps from genomic sequence alone. To address this problem, we define multiple features derived from genomic sequence data that allow machine learning algorithms to fit the underlying distribution of a Hi-C contact map at restriction-fragment resolution. We show that our models provide (i) accurate predictions of Hi-C contact frequency by properly weighting input features relevant to a particular cell type as well as (ii) insight into potential factors contributing to chromatin architecture changes.
Short Abstract: In the era of ever increasing number of genome-available organisms, direct estimation of functionally related genes from genome is a fundamental challenge in computational biology. In order to achieve this goal, we have investigated genomic features associated with the functional relationships of genes by using gene coexpression information. Gene coexpression, a similarity of gene expression profiles, provides a genome-wide approximation of functional gene relationships at transcriptional regulation level. In comparative analysis between the similarity of genomic features and the strength of gene coexpression, we found that genes belonging to same evolutionary age group tend to be strongly coexpressed. We further found that individual genetic diversity was significantly altered between the older gene loci and the younger one, by comparing allele frequencies calculated from the results of several large genome cohort studies. We also investigated the effect of genetic diversity on the architecture of other gene networks such as a protein-protein interaction network. We anticipate our results to be a starting point for understanding the mechanisms underlying cellular systems evolution, and for developing a genome-based gene function prediction method with taking into consideration of the individual genetic diversity.
Short Abstract: DNA methylation is a well-studied epigenetic mark attributed with key roles in normal cell differentiation and cancer. Several software packages facilitate individual DNA methylation analysis steps, such as normalization and differential analysis. However, tools that provide a start-to-finish pipeline are rare. To fill this gap, we developed the RnBeads software package. It is structured into the modules: import and export, quality control, preprocessing, covariate inference, and exploratory and differential analysis. Here, we present a substantially extended version of RnBeads with major improvements on each of the modules, including support of new data types (e.g. the Illumina EPIC array), new inference methods (such as epigenetic age prediction and estimating immune cell content of cancer samples) and improved usability by a new graphical user interface. We showcase this on four reproducible examples, each highlighting the new features: a large array-based blood data set, a whole-genome bisulfite sequencing data set on human hematopoiesis, a reduced-representation bisulfite sequencing data set on Ewing sarcoma and a benchmark data set for cross-platform integration. RnBeads represents a comprehensive tool for DNA methylation analysis and is available through R/Bioconductor.  Assenov, Y. et al. Comprehensive analysis of DNA methylation data with RnBeads. Nat. Methods 11, 1138–1140 (2014).
Short Abstract: see the attached.
Short Abstract: In the face of changes in their environment, bacteria adjust gene expression levels and produce appropriate responses. The individual layers of this process have been widely studied: the transcriptional regulatory network describes the regulatory interactions that produce changes in the metabolic network, both of which are coordinated by the signaling network, but the interplay between them has never been described in a systematic fashion. Here, we formalize the process of detection and processing of environmental information mediated by individual transcription factors (TFs), utilizing a concept termed genetic sensory response units (GENSOR units), which are composed of four components: (1) a signal, (2) signal transduction, (3) genetic switch, and (4) response. We used experimentally validated data sets from two databases to assemble a GENSOR unit for each of the 189 local TFs of Escherichia coli K-12 in RegulonDB. Further analysis suggested that feedback is a common occurrence in signal processing, and there is a gradient of functional complexity in the response mediated by each TF, as opposed to a one regulator/one capacity rule. Finally, we provide examples of other GENSOR unit applications, such as hypothesis generation, detailed description of cellular decision making, and elucidation of indirect regulatory mechanisms.
Short Abstract: As the cost of RNA sequencing has continued to fall, the amount of publicly available RNA-seq data has continued to grow. This technology offers several advantages over microarrays including capturing known and novel transcripts and all isoforms. Constructing gene co-expression networks is a predominant method for studying gene function in specific biological contexts. However, integrating RNA-seq data from multiple sources into an accurate co-expression network still poses a significant challenge, largely due to the need for read count normalization and presence of batch effects from different experiments, which introduce non-biological variation into the data. In this research, we leverage thousands of uniformly aligned RNA-seq samples from various experiments and tissues to address these challenges. We construct gene co-expression networks for different experimental conditions (such as different tissues) using different normalization methods and batch effect correction methods to find the best methodology for each pipeline. The resulting networks are evaluated based on their ability to recover documented gene relationships.
Short Abstract: Cell fate acquisition is a fundamental process in the ontogeny of multicellular organisms, involving a plethora of intrinsic and extrinsic instructive signals that direct the lineage progression of pluripotent cells. In the present study, we reveal the signal-propagating role of the chromatin interactome in the commitment and propagation of the initiating signal in early neurogenesis by reconstructing dynamic loop-enhanced Gene Regulatory Networks (eGRNs) that integrate transcriptome, chromatin accessibility and long-range chromatin interactions in a temporal dimension. We observe a highly dynamic re-wiring of chromatin interactions already at very early stages of neuronal differentiation. Long-range chromatin interactions are massively reorganized; only 30% of the initial interactome is conserved through cell differentiation, while new interactions are established already 6 hours after induction of neurogenesis. By integration of chromatin interactions together with temporal epigenome and transcriptome data, we identify a group of key regulatory elements that respond to and propagate the initial signal. Our data reveal an enormous capacity of the morphogen to reorganize long-range chromatin interactions by “reading” distant epigenetic signals and chromatin accessibility to drive cell fate acquisition. These results suggest that the differential establishment of chromatin contacts directs the acquisition of cell fate.
Short Abstract: Several UV cross linking protocols such as eCLIP have been established to delineate the molecular interaction of RNA Binding Protein (RBP) and their target RNAs. With the advancement of pooled CRISPR/Cas9 screens, it is possible to perturb noncoding genomic regions and hence re-investigate the global impact of these proteins. We present SliceIt (http://sliceit.soic.iupui.edu/), a database of in silico sgRNA (or guideRNA) library to facilitate conducting such high throughput screens. We used CRISPR-DO to design ~4.8 million unique sgRNAs targeting all possible RBP binding sites from ENCODE eCLIP experiments of 123 RBPs in HepG2 and K562 cell lines. SliceIt provides a user friendly environment, developed using advanced framework, Elasticsearch. It is available in both table and genome browser views facilitating the easy navigation of RBP binding sites, designed sgRNAs, exon expression levels across 53 human tissues along with prevalence of SNPs and GWAS hits on binding sites. Users can also upload custom tracks of various file formats (in browser) to navigate and compare additional genomic features and omics data in hg38 genome. SliceIt provides a one-stop repertoire of sgRNA library for RBP binding sites, along with several layers of functional information to design both low and high throughput CRISPR/Cas9 screens.
Short Abstract: Identifying regions that are similar in function among species is crucial because they are likely to be functionally conserved and therefore important. In particular, finding functionally similar regions between human and mouse would help us make an informed use of murine models. Previous studies are limiting in that they require matching cell-types across species and that they measure conservation at the functional genomics level only for a small portion of the genome. Thus we developed a score that quantifies conservation at the functional genomics level between human and mouse. Scores are generated at base-pair resolution by a neural network trained to learn the characteristics of functionally similar regions. Features consist of ChromHMM annotations and peak calls from DNase-seq, ChIP-seq, and CAGE experiments across human and mouse cell types. Unlike previous methods, our method does not require matching cell types, taking advantage of the wealth of publically available data. Regulatory elements conserved in sequence or active in similar cell types in human and mouse score highly, suggesting that our score captures conservation of regulatory activity. Our score is moderately correlated with sequence conservation scores, which suggests that our method offers complementary information to existing genomic annotations based on sequence alignment.