RECOMB/ISCB Conference on Regulatory & Systems Genomics with DREAM Challenges 2020 Poster Information
View Talks By Category
Late Breaking Posters
- Wei Vivian Li, Rutgers, The State University of New Jersey, United States
- Yanzeng Li, Rutgers, The State University of New Jersey, United States
Short Abstract: A system-level understanding of the regulation and coordination mechanisms of gene expression is essential to understanding the complexity of biological processes in health and disease. With the rapid development of single-cell RNA sequencing technologies, it is now possible to investigate gene interactions in a cell-type-specific manner. Here we propose the scLink method, which uses statistical network modeling to construct sparse gene co-expression networks from single-cell gene expression data. We first propose a new correlation measure for the strength of gene co-expression relationships, while accounting for the sparsity feature of single-cell gene expression data. Next, relying on the more robust correlation measure, scLink identifies gene co-expression networks in single cells using a penalized and data-adaptive graph model. We have evaluated and validated the accuracy and robustness of scLink in both simulation and real data studies. First, our simulation studies showed that scLink has the best accuracy in gene network construction, given different network topologies, gene numbers, and sparsity levels of gene expression, compared with five other network construction methods. We then conducted a series of studies to evaluate the performance of scLink on real single-cell gene expression data. Our results based on the Tabula Muris database show that scLink is able to identify cell-type-specific networks and functional gene modules, and the edges inferred by scLink can capture regulatory relationships between gene pairs. Moreover, our real data studies also demonstrate that scLink can help identify co-expression changes and gene network rewiring between healthy and disease states such as breast cancer. In addition, scLink was also demonstrated to reveal network differences and critical hub genes in time course data, such as those from the differentiation process of definitive endoderm. For easy application of scLink to additional single-cell gene expression datasets, we implemented the methods in the R package scLink (https://github.com/Vivianstats/scLink). To demonstrate the applications of scLink and disseminate the research findings in our real data studies, we also developed a web application of scLink (https://rutgersbiostat.shinyapps.io/sclink/).
- Arjun Yadaw, Icahn School of Medicine at Mount Sinai, United States
- Yan-Chak Li, Icahn School of Medicine at Mount Sinai, United States
- Sonali Bose, Icahn School of Medicine at Mount Sinai, United States
- Ravi Iyengar, Icahn School of Medicine at Mount Sinai, United States
- Supinda Bunyavanich, Icahn School of Medicine at Mount Sinai, United States
- Gaurav Pandey, Icahn School of Medicine at Mount Sinai, United States
Short Abstract: The COVID-19 pandemic has affected millions of individuals and caused hundreds of thousands of deaths worldwide. Predicting mortality among patients with COVID-19 who present with a spectrum of complications is very difficult, hindering the prognostication and management of the disease. We aimed to develop an accurate prediction model of COVID-19 mortality using unbiased computational methods, and identify the clinical features most predictive of this outcome. For this, we applied a rigorous machine learning framework to clinical data from a large cohort of patients with COVID-19 treated at the Mount Sinai Health System in New York City, NY, USA, to predict mortality. This framework consisted of missing value imputation, feature selection and classification steps. We analyzed patient-level data captured in the Mount Sinai Data Warehouse database for individuals with a confirmed diagnosis of COVID-19 who had a health system encounter between March 9 and April 6, 2020. For initial analyses, we used patient data from March 9 to April 5, and randomly assigned (80:20) the patients to the development dataset (n=3841) or test dataset 1 (n=961; retrospective). Patient data for those with encounters on April 6, 2020, were used in test dataset 2 (n=249; prospective). We designed prediction models based on clinical features and patient characteristics during health system encounters to predict mortality using the development dataset. We then assessed the resultant models in terms of the area under the Receiver Operating Characteristic curve (AUC) score in the test datasets. Using the development dataset and the systematic machine learning framework, we developed a COVID-19 mortality prediction model that showed high accuracy (AUC=0·91) when applied to test datasets of retrospective and prospective patients. This model was based on the XGBoost algorithm and three clinical features: patient's age, minimum oxygen saturation over the course of their medical encounter, and type of patient encounter (inpatient vs outpatient and telehealth visits). These features showed significantly differential distributions among the deceased and alive classes, and have previously been associated with COVID-19 severity and outcomes. After external validation, such an accurate and parsimonious mortality prediction model based on three features might have utility in clinical settings to guide the management and prognostication of patients affected by this disease.
- Fan Zheng, University of California, San Diego, United States
- She Zhang, University of Pittsburgh, United States
- Chris Churas, University of California, San Diego, United States
- Dexter Pratt, University of California, San Diego, United States
- Ivet Bahar, University of Pittsburgh, United States
- Trey Ideker, University of California, San Diego, United States
Short Abstract: In any ‘omics study, Significant patterns in data often become apparent only when looking at the right scale. For instance, when clustering single-cell transcriptomes, is the analysis tuned to discover broad or specific cell types? Likewise, protein communities revealed from protein networks can vary widely in sizes depending on the method. Here, we use the concept of “persistent homology” drawn from mathematical topology, to develop a new computational method, termed HiDeF (Hierarchical community Decoding Framework) for identifying robust structures in data at all scales simultaneously. HiDeF outperformed alternative approaches that do not consider persistence of structures In benchmark experiments with simulated and real-world single-cell transcriptomics data and protein networks. Comparative analysis of protein-protein interaction networks found significant differences in the distributions of community sizes across these networks, correlating with the different measurement approaches used to generate them. HiDeF could facilitate biological interpretation of large ‘omics data, and its application to mouse single-cell transcriptomes significantly expands the catalog of identified cell types, while analysis of SARS-COV-2 protein interactions suggests hijacking of WNT. The method is available via Python and Cytoscape.
- Jiajie Peng, Northwestern Polytechnical University, China
- Xiaoyu Wang, Northwestern Polytechnical University, China
- Yuxian Wang, Northwestern Polytechnical University, China
- Hansheng Xue, Australian National University, Australia
- Xuequn Shang, Northwestern Polytechnical University, China
Short Abstract: Single-cell RNA-sequencing (scRNA-seq) is a powerful technique that enables researchers to measure the gene expression at the resolution of single cells. An effective low-dimensional representation of scRNA-seq data is critical for downstream analyses, such as interpreting cell sub-populations. However, scRNA-seq data suffers from technical noise and bias, which have to be modeled for reducing the uncertainty in downstream analyses. In this study, combining priori biological knowledge in gene ontology, we developed a deep semi-autoencoder model, named scAN, to denoise scRNA-seq data and obtain a low-dimensional representation. We demonstrate, with datasets from different platforms and sizes, that the newly proposed model can give a more accurate low-dimensional representation of scRNA-seq data than several state-of-the-art methods. Furthermore, by incorporating with gene ontology, we provide an opportunity to interpret the biological knowledge behind the low-dimensional representation.
- Christina Akirtava, Carnegie Mellon University, United States
- C. Joel McManus, Carnegie Mellon University, United States
- Gemma E. May, Carnegie Mellon University, United States
- Hunter Kready, Carnegie Mellon University, United States
- Lauren Nazzaro, Carnegie Mellon University, United States
- Matt Agar-Johnson, Carnegie Mellon University, United States
Short Abstract: Translation initiation is regulated by sequences surrounding the start codon, the “Kozak context” (Kozak, 1984), cis-acting sequences and structures in the 5’ transcript leader (TL), and corresponding trans-acting factors. Previous work evaluating the in vitro translation efficiency of 96 native yeast TLs showed that changes of 50-200nt in TL lengths varied translation up to 100-fold (Rojas-duran & Gilbert, 2012). Larger-scale in vivo reporter studies of fixed-length synthetic TLs identified upstream AUGs as major repressors of gene expression (Cuperus et al., 2017; Dvir et al., 2013; Sample et al., 2018). However, the in vivo regulation of translation by native yeast TLs has not been systematically studied. To investigate cis-regulatory translational control by native TLs in vivo, we assayed gene expression from ~11,000 endogenous TLs of S. cerevisiae and S. paradoxus using Fluorescence-Activated Cell Sorting and high-throughput sequencing (FACS-seq) (Noderer et al., 2014). Additionally, we tested all Kozak variants surrounding AUG start codons in S. cerevisiae. We find Kozak context influences expression over a ~20-fold range, with the expected strong preference for -3 A. Our results show that alternative transcription start sites that change leader length by as little as 10 nucleotides can impact gene expression as much as ~16 fold. Finally, we trained a machine learning model on native 5’ TLs to identify regulatory features that increase sequence-based predictions of translation. Although Kozak strengths explain much of the variance in expression, our model quantitates the influence of upstream open reading frames, mRNA folding around the 5’ cap, and other structures that regulate the rate of translation initiation in vivo. Thus, our results identify the range and relative regulatory impacts of Kozak context and other cis-acting sequences and structures on translation from native yeast TLs in vivo.
- Xiao Li, Beijing Institute of Genomics, CAS, and China National Center for Bioinformation, China
- Zhihua Zhang, Beijing Institute of Genomics, CAS, and China National Center for Bioinformation, China
Short Abstract: The human genome has a dynamic, well-organized hierarchical 3D architecture, including megabase-sized topologically associating domains (TAD). TADs in the genome are a key structure in regulating intra-nucleus biological processes, such as gene expression, DNA replication and damage repair. However, owing to a lack of proper computational tools, TADs have still not been systematically and reliably surveyed. Even the basic properties of TADs in single cells, e.g., the existence, the biogenesis, and dynamics, remain elusive. In the present work, we developed a new algorithm to decode TAD boundaries that keep chromatin interaction insulated (deTOKI) from ultra-sparse Hi-C data. This novel algorithm seeks regions that insulate the genome into blocks with minimal chance of clustering by nonnegative matrix factorization. We found that deTOKI outperformed competitive tools and that it reliably identified TADs with single-cell Hi-C (scHi-C) data. By applying deTOKI, we found that the domain structures are prevalent in single cells. Further, although their structure is highly dynamic between cells, TADs adhere to the ensemble, suggesting tight regulation of single-cell TADs. Finally, we found that the insulation property of TAD boundaries has a major effect on the epigenetic landscape in single cells. In sum, deTOKI serves as a powerful tool for profiling TADs in single cells.
- Meiyue Wang, CAS Center for Excellence in Molecular Plant Sciences, Chinese Academy of Sciences, China
- Zijuan Li, CAS Center for Excellence in Molecular Plant Sciences, Chinese Academy of Sciences, China
- Yuyun Zhang, CAS Center for Excellence in Molecular Plant Sciences, Chinese Academy of Sciences, China
- Yijing Zhang, CAS Center for Excellence in Molecular Plant Sciences, Chinese Academy of Sciences, China
Short Abstract: The widely cultivated wheat has a large allohexaploid genome. Subgenome-divergent regulation contributed to the genome plasticity and success of polyploid wheat in domestication. However, the specificity encoded in wheat genome determining the subgenome-divergent spatio-temporal regulation has been largely unexplored. The considerable size and complexity of the genome are major obstacles to dissecting the regulatory specificity. Here, we compared the epigenomes and transcriptomes from a large spectrum of samples under diverse developmental and environmental conditions. A total of 223,976 distal regulatory elements (REs) were specifically linked to their target promoters with orchestrated epigenomic activity. We detected distinct epigenetic architectures of REs representing different levels of subgenome divergence. Furthermore, through employing quantitative epigenomic approaches, we detected key responsive cis- and trans-acting factors validated by DNA Affinity Purification and sequencing (DAP-seq), and demonstrated the coordinated interplay between RE sequence contexts, epigenetic factors, and transcription factors in determining subgenome divergence. Altogether, this study provides a wealth of resources for elucidating the RE regulomics and subgenome-divergent regulation in hexaploid wheat, and gives new clues for interpreting genetic and epigenetic interplay in regulating the benefits of polyploid wheat.
- Agnes Preethy. H, SASTRA Deemed to be University, India
- Vigneshwar Ramakrishnan, SASTRA Deemed to be University, India
- Y B Venkatakrishnan, SASTRA Deemed to be University, India
- Uma Maheswari Krishnan, SASTRA Deemed to be University, India
Short Abstract: Our present study aims to predict the potential therapeutic targets of a complex traditional Siddha formulation - Brahmi Nei (BN) through in silico network pharmacological (NP) approaches. BN comprises of nine different herbs such as Bacopa monnieri (BM), Curcuma aromatica (CA) etc., which has been traditionally used to treat various nervous system diseases like depression, epilepsy etc., and it's major herb is BM, reported earlier for its memory enhancing properties. Owing to the complexity, understanding the mechanism of action of complex herbal formulations are difficult, yet the emerging field of NP helps us to do so. In the present work, we studied the potential therapeutic targets of BN through NP approaches. Briefly, we have collected the list of active compounds (AC) for BN herbs via literature review, Dr. Duke’s database. The targets of BN - AC are predicted using Binding and STITCH databases. The pathways and diseases information are retrieved from STRING, KEGG Databases, respectively. Totally, 408 different AC of BN herbs hit 455 targets, which in turn involved in 276 different pathways and 41 disease categories. The network of BN herbs vs AC vs targets is constructed using Cytoscape software. The AC and targets with highest degree nodes as well as shared AC, targets between BN herbs are identified. Two mode Target-Pathway interaction network is created and converted into one mode network i.e., Target-(Pathway)-Target, where the information of pathway interactions is converted into edges, using Pajek software. We visualized this network in Gephi software and identified modules within the network using Louvain algorithm. Totally, Target-(Pathway)-Target network of BN contains 4 modules and the contribution score (C) of each module towards disease is calculated using the C algorithm. C of module 2 and module 1 are higher for diseases that could be treated by BN. Centrality measures of Target-(Pathway)-Target network was used to analyze the targets. PTGS2, CYCS, FOS, NOS2A etc., are the targets of module 2 and ACACA, 15 LOX, TS11, 15 LOX, 5LOX, etc., are targets of module 1 with highest C scores, and have been known to have implications in the therapeutically relevant diseases of BN. To further validate our C score for the modules, studies using Randic index calculation method and molecular docking studies by Schrödinger software are currently underway.
- Chen Su, McGill University, Department of Electrical and Computer Engineering, Canada
- Simon Rousseau, McGill University, Department of Medicine, Division of Experimental Medicine, Canada
- Amin Emad, McGill University, Department of Electrical and Computer Engineering, Canada
Short Abstract: COVID19 has resulted in the death of more than 1 million individuals as of October 2020. There is an urgent need for the development of therapies targeting two processes elicited by SARS-CoV2: direct viral infection and the inflammatory immune response. Unraveling the gene expression programs involved in the response of the host to the infection by this virus would enable a fundamental understanding of the disease and enable identification of therapeutic targets and novel treatments. The transcriptional regulatory network (TRN), composed of transcription factors (TFs) and their target genes, play significant roles in regulating these immune-related gene expression programs. Here, we first used a tool that we have developed for reconstruction of phenotype-relevant TRNs (InPheRNo) to identify transcriptional regulatory mechanisms involved in the host response to SARS-CoV2 infection. InPheRNo utilizes a probabilistic graphical model to integrate the collective regulatory influence of multiple TFs on a gene with the association of target genes’ expression and a phenotypic label (here COVID19 positive vs. negative or COVID19 positive vs. infected by other viruses). Our results, obtained using gene expression profiles of human lung epithelial cell lines that were mock-treated or infected by a virus (SARS-CoV2, RSV, H1N1 and HPIV3) revealed known and novel regulatory mechanisms in the course of COVID19. For example, pathways related to the Immune system, cytokines and interferon signalling were implicated for the top TFs. Recent genetic data has highlighted the importance of interferon signaling in COVID-19 severity. Next, we used a our recently developed computational tool foRWaRD to prioritize kinases that are most associated with the reconstructed COVID19-relevant TRNs. Kinases are enzymes that are involved in the regulation of protein activities through phosphorylation, and are a major category of drug targets for human diseases. foRWaRD (a tool based on the random walk with restarts algorithm) ranks kinases based on their relevance to the COVID19-relevant TRNs on a heterogenous network consisting of known kinase-substrate relationships, gene-gene interactions, and the identified COVID19-relevant TRNs. Our results implicated the JAK family and the MAPK family (including MAPK11 and MAPK14) of kinases, the inhibitions of which have been used as a therapeutic strategy for inflammatory diseases. JAK-family of kinases are important transducers of interferon-signaling, while the p38 MAPK (MAPK11 and MAPK14), are key regulators of genes linked to inflammation such as IL-6, found elevated in the circulation of severely ill COVID-19 patients. Currently, we are experimentally testing the predictions of our analyses.
- Anne Nicole, Insight biosolutions, France
- Sarra Akermi, Annotation Analytics pvt. ltd., India
- Sunil Jayant, annotation analytics pvt. ltd., India
Short Abstract: Drug-induced liver injury (DILI) is a liver disease caused by drugs or other metabolites during drug application . Role of hepatic transporters have been implicated in inducing the Hepatic cholestasis (accumulation of bile acid in liver cells) through DILI. It has been reported that inhibition of bile salt export pumps (BSEP) is not the only mechanism of cholestatic DILI, however other transporters such as MRP2 contribute to induce the cholestasis . Therefore, it becomes particularly important to understand the difference between hepatic transporters for better understating the Drug-induced liver injury (DILI). In our work, we perform comparative genomics and structural based studies for Hepatic Transporters BSEP/MRP2. Genomics based alignment using the Basic local alignment search tool (BLASTp) reveals that both the transporters show genomic identities of 20 % (332/1628) with positive replacement of 36% (599/1628) and gap penalty of 23% (390/1628) and total alignment score of 320. This shows that both transporters have less identities at genomic level. We obtained BSEP pdb structure from RCSB data bank and protein model of MRP2 from swiss model. Structure-Structure superimposition between BSEP/MRP2 three dimensional structures show RMSD of 0.28 Å which determines the close relationship between BSEP/MRP2. Different FDA approved inhibitors such as Bosentan, Cyclosporine A, Ketokonazole, Rifampicin and Chlorpromazine were screened against BSEP/MRP2 structures by Molecular docking. Among the five inhibitors tested, Cyclosporine is the highest-affinity inhibitor to the BSEP and MRP2. Our docking analysis detected more affinity (IC50 0.6µM, DE -19.00 kcal/mol) of Cyclosporine A with BSEP protein as compared to MRP2 protein (IC50 14.5µM, DE -16.97 kcal/mol). Ketokonazole finds more affinity for BSEP (IC50 8.78µM, DE -15.2kcal/mol) than MRP2 (IC50 ~133µM, DE -14kcal/mol). Chlorpromazine finds lowest affinity for both BSEP (IC50 147.6µM, DE -8.56kcal/mol) and MRP2 (IC50 ~133µM, DE -9.58kcal/mol). This in silico analysis determines that Cyclosporine A has a more potent inhibition for BSEP as compared to MRP2. Our proof of concept study will provide more insight about functions of BSEP/MRP2 in Drug-induced liver injury (DILI).
- Fatemeh Behjati, Uniklinikum and Goethe University Frankfurt, Institute of Cardiovascular Regeneration, Germany
- Kathrin Kattler, Department of Genetics, University of Saarland, Germany
- Tobias Heinen, German Cancer Research Center (DKFZ), Germany
- Florian Schmidt, Genome Insitute of Singapore, Singapore
- David Feuerborn, Leibniz Research Centre for Working Environment and Human Factors (IfADo), Germany
- Gilles Gasparoni, Department of Genetics, Saarland University, Germany
- Konstantin Lepikhov, Department of Genetics, Saarland University, Germany
- Patrick Nell, Leibniz Research Centre for Working Environment and Human Factors (IfADo), Germany
- Jan Hengstler, Leibniz Research Centre for Working Environment and Human Factors (IfADo), Germany
- Luka Nicin, Uniklinikum and Goethe University Frankfurt, Institute of Cardiovascular Regeneration, Germany
- Wesley Abplanalp, Uniklinikum and Goethe University Frankfurt, Institute of Cardiovascular Regeneration, Germany
- Hannah Melentin, Uniklinikum and Goethe University Frankfurt, Institute of Cardiovascular Regeneration, Germany
- Thomas Walther, Uniklinikum and Goethe University Frankfurt, Institute of Cardiovascular Regeneration, Germany
- David John, Uniklinikum and Goethe University Frankfurt, Institute of Cardiovascular Regeneration, Germany
- Joern Walter, Saarland University, Germany
- Stefanie Dimmeler, Uniklinikum and Goethe University Frankfurt, Institute of Cardiovascular Regeneration, Germany
- Marcel Schulz, Goethe University, Germany
Short Abstract: Single cell sequencing technology enables to probe gene regulation dynamics in a refined resolution. Analyses that exploit the single cell gene expression data demand suitable computational approaches that are able to handle the caveats of such data as well as elucidating interesting biological insights related to the regulatory mechanism. A current challenge is to infer the transcriptional and post-transcriptional regulation from scRNA data. We developed a machine learning framework, called TRIANGULATE, to predict the single cell gene expression using diverse feature types, such as transcription factor (TF) measured in a static state of the genome, TF measured in open chromatin regions defined based on the DNase-seq signal, and TF obtained from the TF ChIP-seq experiments. Through training a multi-task-learning regression model that is able to share the information across cells, we were able to infer TF activity per individual cell. By interrogating the coefficients of this linear model, we could determine positive and negative associations between TFs and gene expression. This framework is explained in our recent publication  and the implementation of TRIANGULATE can be accessed under an MIT license via: https://github.com/SchulzLab/Triangulate TRIANGULATE’s capability in inferring negative regulatory activities, encouraged us to extend our approach by enriching the feature set with regulatory elements, such as miRNA binding to target genes. In addition, we made our framework adaptable to sparse sequencing protocols such as 10x Genomics, through an additional heuristic step of aggregating the similar cells by creating a summarized profile of single cell gene expressions. We show that our approach is superior to the competing methods in terms of identifying the cell-specific regulators. In addition, application to 10x single cell sequencing of diseased human hearts illustrates that known and novel associations of disease and cell-type specific regulators can be accurately inferred from sparse single cell data. 1. Behjati et.al., “Prediction of single cell gene expression for transcription factor analysis”, 2020, GigaScience (in press).
- Sean Cheah, University of California, Los Angeles, United States
- Mithun Mitra, University of California, Los Angeles, United States
- Hilary Coller, University of California, Los Angeles, United States
Short Abstract: Quiescence is the reversible arrest of cell proliferation and is important across biological processes ranging from stem cell maintenance to cancer cell dormancy. While several key transcriptional regulators of quiescence have been identified, it is unclear how these individual transcription factors (TFs) fit into the broader regulatory networks that govern this cellular state. Based on the methodology published by Gerstein et al. , we devised a highly adaptable pipeline that identifies potential cooperative binding interactions between TFs using their individual ChIP-Seq binding profiles. This was achieved by reframing the search for TF interactions as a binary classification problem. We then fit a gradient-boosting decision tree model and drew conclusions about predicted TF interactions using a model output explainer called SHAP . By exploiting recent advancements in machine learning, our models return improved cross-validation classification accuracies when compared to prior approaches, allowing potential TF interactions to be captured with greater fidelity. Additionally, the granularity of our final outputs enables prediction of TF interactions across anything from entire genomes to individual loci. The improved protocol was used to explore putative interactions between 127 TFs that displayed significant differential expression when comparing in-house RNA-Seq data between quiescent and proliferating primary human dermal fibroblasts. ChIP-Seq peak files corresponding to these TFs were retrieved from the Cistrome v2 database . Initial findings include previously characterized interactions between TFs like FOXM1 and MYBL2, both regulators of proliferation that were depleted in quiescence. Additionally, the model offered compelling evidence for undocumented TF-TF interactions that may warrant further study. One such novel interaction was predicted between NR1H3 and PPARG, two known inhibitors of tumor cell proliferation that were both enriched in quiescence. The applications of our pipeline are not limited to the study of quiescence and can be tailored to target other cell states (e.g. senescence). The exact machine learning model can be substituted for alternatives like neural networks to achieve even greater accuracy on other data sets. In short, our proposed pipeline shows promising results in identifying interactions between quiescence-associated TFs and can be easily adapted to other cellular contexts.  Gerstein, Mark B. et al. Nature 489, 91-100 (2012)  Lundberg, Scott M. et al. Nature Machine Intelligence 2, 56-67 (2020)  Zheng, Rongbin et al. Nucleic Acids Res. 47, D1, D729-D735 (2019)
- Ellie Xi, Basis Independent Silicon Valley, United States
- Shurui Cai, The Ohio State University, United States
- Qi-En Wang, The Ohio State University, United States
- Yongsheng Bai, Next-Generation Intelligent Science Training , United States
Short Abstract: Ovarian cancer ranks fifth in cancer deaths among women, accounting for more deaths than any other cancer of the female reproductive system. Cancer stem cells (CSCs) represent a small subpopulation of cells within a tumor that are capable of self-renewal and differentiation. MicroRNAs (miRNAs) are a class of non-coding small RNAs (~20 nucleotides) and have been reported to regulate CSC self-renewal, differentiation, and tumorigenesis through regulating targetted genes’ expression. In our previous study, we have profiled the miRNA expression in cancer stem cells and bulk cancer cells from two ovarian cancer cell lines and identified 12 upregulated and 5 downregulated miRNAs in cancer stem cells. Here, we developed a bioinformatics method to identify putative targeted genes of these dysregulated miRNAs and further investigated their role in the maintenance of ovarian cancer stem cells. We first used three target prediction databases, including TargetScan, miRDB, and TargetProfiler, to obtain two separate mutual gene lists, one for up-regulated miRNA downstream targets, total 442 genes, and the other for down-regulated miRNA downstream targets, total 271 genes. These gene lists were further validated using MMiRNA-Tar. We then ran the Database for Annotation, Visualization, and Integrated Discovery (DAVID) for 271 genes from our mutual downregulated miRNA targeting gene list to perform functional annotation and pathway analysis. Based on the results from DAVID, we identified the axon guidance pathway as one of the most prominent pathways that our input genes were involved with. Five genes (EPHA8, EPHB3, ROBO1, ROBO2, and SEMA3A) are involved in the Axon Guidance Pathway in both KEGG and Biological Process from DAVID. We also searched the literature to verify these identified marker genes. EPHA8 has been previously reported as a prognostic marker in epithelial ovarian cancer. EPHB3 is associated with ovarian cancer. Basal ROBO1 and ROBO2 were both expressed lower in primary cultures of ovarian cancer epithelial cells compared to normal ovarian surface epithelium and SKOV-3 cells. The decreased expression of SEMA3A could be associated with the development of ovarian cancer carcinoma. Our developed bioinformatic method is valuable for studying miRNA targeted genes involved in tumorigenesis.
- Zaara Yakub, Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, United States
- Amber Tang, Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, United States
- Shushan Toneyan, Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, United States
- Peter Koo, Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, United States
Short Abstract: Recent progress has shown that architecture choices for convolutional neural networks (CNNs) can influence the extent that first layer filters learn robust representations of genomic features, such as sequence motifs. However, it remains unclear whether these same design principles extend to hybrid convolutional-recurrent networks, which have demonstrated improved predictions compared to deep CNNs across many regulatory genomics tasks. Here we explore how hybrid convolutional-recurrent networks, specifically a bi-directional long-short-term memory (bi-LSTM), build representations of genomic sequence motifs. We find that architecture choices—such as pooling size and the number of units in the bi-LSTM—significantly modulate the extent that first layer filters learn motif representations, while other design choices—such as filter size, number of filters, and common activation functions—are not seemingly influential. Similar to deep CNNs, our findings support that the ability to assemble partial features into whole motifs in deeper layers (i.e. the bi-LSTM layer) modulate the extent that interpretable sequence motifs are learned by first layer filters. This work makes it possible to design hybrid CNN-recurrent networks that intentionally learn interpretable representations of genomic sequence motifs in first layer filters and perform as well as other state-of-the-art architecture choices in this model class.
- Yosuke Tanigawa, Stanford University, United States
- Ethan Dyer, Stanford University, United States
- Gill Bejerano, Stanford University, United States
Short Abstract: Transcription factors (TFs) are the master regulators of development and regulate biological processes in healthy adult tissues. Mutations in both TF genes and their genomic binding sites have been linked to human disease. With the advent of high-throughput experimental profiling of context-specific TF activities, virtually all TF families, if not all ~1500 TFs, have their binding profile characterized in some cellular context. Many computational tools that benefit the reference data were developed to identify functionally important TFs. However, those tools often focus on the sheer abundance of the TFs in their ranking, limiting the ability to identify context-specific TFs of functional importance across diverse cell-types. Here, we develop WhichTF, a computational method to identify functionally dominant TFs from chromatin accessibility profiles. To rank TFs, WhichTF applies an ontology-guided functional approach to compute novel enrichment by integrating accessibility measurements, high-confidence pre-computed conservation-aware TF binding sites, and putative gene-regulatory models. Comparison with prior methods reveals the unique ability of WhichTF to identify context-specific TFs with known functional relevance, including NF-?B family members in lymphocytes, GATA factors in cardiac cells, and SOX2 in brain cells, while the results from other approaches are often not cell-type specific and are dominated by structural elements, such as CTCF. Our scoring framework naturally provides a way to investigate the ontology term-TF pairs contributing to the ranking. To further distinguish the transcriptional regulatory landscape in closely related samples, we devise a differential analysis framework and demonstrate its comparative utility against existing tools in lymphocyte, mesoderm developmental, and disease cells. In lymphocytes, we find ETS family members in B-cells and RUNX family members in T-cells, both of which are well supported by the literature. On top of the confirmatory results, we find many suggestive, under-characterized TFs, such as RUNX3 in mesoderm developmental trajectory and GLI1 in systemic lupus erythematosus. We also find TFs known for stress response in many samples, suggesting routine experimental caveats that warrant careful consideration, as well as ZNF410, a novel candidate stress response factor. Together, we demonstrate the utilities in the ontology-guided functional approach that integrates experimentally characterized chromatin accessibility, high-quality TFBS reference datasets, and ontology-based genome annotation, suggesting that systematic identification of dominant TFs across a large number of samples will be a powerful approach to understand the molecular mechanisms of gene regulation and their influence on cell type differentiation, development, and disease.
- Zhi-Jie Cao, Peking University, China
- Lin Wei, Peking University, China
- Shen Lu, Peking University, China
- De-Chang Yang, Peking University, China
- Ge Gao, Peking University, China
Short Abstract: Single-cell transcriptomic profiling provides valuable insights for important biological questions like cellular function and gene expression regulation. Recent technological innovations lead to a rapid accumulation of single-cell transcriptomic data, which further enables data-driven cell annotation. However, multiple confounding effects like intra-/inter-dataset batch effect raise serious challenges for effective and efficient cross-dataset querying and integration. Herein, we designed Cell BLAST, a single-cell transcriptomic data querying method based on deep generative modeling. By introducing adversarial domain adaptation to correct for batch effect, as well as a posterior-based cell-to-cell similarity metric, we significantly improve the accuracy of cell querying over existing tools. Combined with a well-curated reference database ACA and a user-friendly Web server (https://cblast.gao-lab.org), Cell BLAST provides the one-stop solution for real-world scRNA-seq cell querying and annotation.
- Shilu Zhang, University of Wisconsin-Madison, United States
- Sushmita Roy, University of Wisconsin-Madison, United States
Short Abstract: Cell-fate specification is a dynamic process during which a cell in one state transitions into a different state. Cell state is characterized by cell type-specific expression patterns, which are in turn determined by interactions between transcription factors (TFs) and chromatin modifications. However, our understanding of these interactions and how they change between different cell states is limited. Recently, genome-wide datasets for multiple chromatin marks and TFs have become available that can be leveraged to identify how chromatin marks and TFs interact and how it impacts the expression state. Existing approaches to analyze these data have either modeled one cell type at a time or concatenated the measurements from multiple cell types and learned a single model. However, these approaches either cannot capture dynamics across cell types or have a harder learning task and are prone to spurious interactions by merging the variables from all cell types. To overcome these limitations, we present Cell type Varying Networks (CVN), a multi-task learning framework to capture the interactions between chromatin marks, TFs and expression levels in each cell type related by a linear or hierarchical process. CVN is based on probabilistic graphical models to represent each cell type’s network and models the relatedness information by using a structure prior on the set of graphs. Compared to existing approaches that learn a single network across all cell types (ChromNet) or learn each network separately (INDEP), CVN has a better performance in predicting the interactions from simulated data. The similarity of CVN networks along the lineage tree is consistent with the similarity of the simulated networks across cell lineage. This suggests that our multi-task learning framework is helpful to build accurate models of cell type-specific networks. We applied CVN to published ChIP-seq, ATAC-seq and RNA-seq datasets for four stages during mouse reprogramming to infer the cell type-specific networks of chromatin marks, TFs and expression at each stage. Our results captured global differences in the network structure consistent with the lineage and identified cell type-specific subnetworks that are common and differentially wired across the cellular stages, and specific interaction patterns of TFs (e.g. Sox2, Zfx) and chromatin marks (e.g. H3K27ac, H3K9ac) that help prioritize TFs and histone marks important for defining the transition from a differentiated state to a pluripotent state. Taken together, CVN is a powerful framework to infer cell type-specific interactions between chromatin marks, TFs and expression.
- Tassia Mangetti Goncalves, Washington University in St. Louis, United States
- Guoyan Zhao, Washington University in St. Louis, United States
Short Abstract: We are still far from a complete understanding of regulatory elements in mammalian genomes, even though their central role in most biological processes is widely appreciated. Development of sequence-based models to predict the function and activity of regulatory elements is a fundamental step in being able to address many unsolved questions. Cis-Regulatory Modules (CRMs) are DNA sequences with transcription factor binding sites clustered into modular structures, such as locus control regions, promoters, enhancers, silencers, and other modulators. The interaction between CRMs and transcription factors (TF) can control spatio-temporal expression of any genes in any cell. We developed the Mammalian Regulatory Module Detector (MrMod), an algorithm to predict CRMs using predicted TF binding motifs in the mammalian genomes. Comparison with over 100 experimental defined CRMs curated from literature demonstrated high sensitivity and accuracy of the prediction for enhancers located upstream of the coding region and those located in the introns. Next, we compared our predicted CRMs with Cistrome database collection of chromatin accessibility data for both mouse and human genomes. The use of epigenetic methods such as transposase-accessible chromatin assay (ATAC-seq) and DNase I hypersensitivity (DNase-seq), identify chromatin accessible regions, which is important to understand transcriptional control of many biological processes. The results also demonstrated high sensitivity and accuracy of identifying functional regulatory sequences in the mammalian genomes and offers novel approaches to study genome-wide transcription regulation during evolution and development. This method is applicable to any species that resides between mouse and human within the mammalian phylogenetic tree.
- Nina Baumgarten, Institute of Cardiovascular Regeneration, Frankfurt am Main and Saarland Informatics Campus, Saarbrücken, Germany
- Dennis Hecker, Institute of Cardiovascular Regeneration, Frankfurt am Main, Germany
- Marcel H. Schulz, Institute of Cardiovascular Regeneration, Frankfurt am Main and Saarland Informatics Campus, Saarbrücken, Germany
Short Abstract: Genome-wide association studies (GWAS) indicate that most single nucleotide polymorphisms (SNPs) appear in non-coding genomic regions. Additionally, a large percentage of these SNPs are located within regulatory elements, like promoters or enhancers. Since TF binding sites occur frequently in regulatory elements, SNPs may interrupt them. Consequently, this can affect TF binding behavior and alter expression of possible far away target genes. Since SNPs in non-coding regions can lead to functional consequences like various traits or diseases, a major aim is to understand their molecular mechanisms. We present the SNEEP approach, which prioritizes SNPs as targets of one or several TFs and infers whether a gene’s expression is influenced by the change in the TF binding behavior. First, we evaluate the impact of a SNP on a potential TF binding site by calculating a probabilistic differential binding score for the difference in TF binding affinity in wild type versus a mutated sequence. To associate SNPs to potential target genes, SNEEP can make use of different types of information, linked epigenetic elements using Hi-C data or catalouged elements that associate with gene expression changes from the EpiRegio database. SNEEP can easily handle large collections of SNPs and can incorporate various kinds of user-specific epigenetic data, like open chromatin data to exclude SNPs located in region not accessible in the cell type of interest or gene expression data to restrict the analysis to TFs which are expressed. Further, it provides a comprehensive report providing different summary statistics. This summary for instance highlights TFs whose binding affinity is most often affected by the analyzed SNPs, lists TFs associated to a gain or a loss of TF binding affinity based on the input data, and provides information about how many SNPs are linked to a target gene. All statistical assessments are compared against proper random controls to highlight biologically interesting results. To sum up, SNEEP is a method that helps to prioritize GWAS SNPs to study the impact of genetically induced transcriptional mis-regulation in human diseases and other phenotypes.
- Junil Kim, University of Copenhagen, Denmark
- Kyoung-Jae Won, University of Copenhagen, Denmark
Short Abstract: Accurate prediction of gene regulatory rules is important towards understanding of cellular processes. Existing computational algorithms devised for bulk transcriptomics typically require a large number of time points to infer gene regulatory networks (GRNs), are applicable for a small number of genes, and fail to detect potential causal relationships effectively. We aim to quantify the strength of causality between genes by using a concept originating from information theory, called transfer entropy (TE). TE measures the amount of directed information transfer between two variables. Based on TE, we developed TENET (https://github.com/neocaleb/TENET) to reconstruct GRNs from scRNAseq data. Using single-cell gene expression profile along the pseudo-time, TENET calculates TE values between each set of gene pairs. We found that TE values of the known critical regulators (i.e. target genes) were significantly higher than that of randomly selected targets. Interestingly, target genes with higher TE values were influenced more profoundly by the perturbation analysis. We also show that TENET outperforms previous GRN constructors in identifying target genes. Unique to TENET is the ability to represent key regulators with the hub nodes in the reconstructed GRNs. For instance, TENET identified pluripotency factors from scRNAseq during mouse embryonic stem cell (mESC) differentiation and the key programming factors from scRNAseq for the direct reprogramming toward cardiomyocytes, where existing methods either failed to identify or capture their importance for the regulatory network. Interestingly, the factors that TENET identified were more negatively correlated with the number of final states (or attractors) in the Boolean networks, which confirms the importance of the identified hub nodes. We applied TENET to identify previously uncharacterized regulatory rules for autophagic process induced by amino acid starvation (AAS) as well as glucose starvation (GS) in mouse embryonic fibroblasts. Interestingly, TENET reconstructed GS- as well as AAS-specific GRNs. Strikingly, TENET newly found that Cebpg is an AAS-specific transcription factor regulating key autophagy associated genes. Knocking down of Cebpg dramatically dampened the autophagic process only upon AAS but not on GS. Our results provide the first evidence of distinct GRNs between AAS and GS induced autophagy to our knowledge. More importantly, we show that TENET has a potential to elucidate previously uncharacterized regulatory mechanisms by reprocessing scRNAseq data.
- Dennis Hecker, Goethe University Frankfurt, Germany
- Nina Baumgarten, Goethe University, Germany
- Marcel Schulz, Goethe University, Germany
Short Abstract: Advances in epigenomics have highlighted the importance of non-coding regions in the DNA for regulation of gene expression. Many different studies have further increased the acknowledgement of regulatory regions in the context of diseases and cell differentiation and underlined the necessity to further unravel the interactions and processes that orchestrate the regulatory program of cells. Key players in gene regulation are enhancers. They can be located far away from their target genes and affect gene expression by serving as transcription factor (TFs) binding sites. A gene’s promoter is also bound by TFs which form protein complexes together with the associated enhancer-bound TFs to govern gene expression. We applied topic modelling using Latent Dirichlet Allocation (LDA) on the predicted TF binding sites in enhancers and gene promoters of regulatory connections to explore whether enhancer-promoter interactions can be characterized by their TFs. LDA is an unsupervised statistical model which helps to explain differences in TF binding in between regulatory interactions by assigning them to unobserved groups, the so-called topics. Further, LDA evaluates the relevance of TFs to assign an interaction to a topic and thus, allows to identify sets of TFs that are important for distinguishing the interactions. Although topic modelling has been applied to TF binding before, we follow a different approach by taking both sides of the interaction into account, the enhancer-bound TFs as well as the promoter-bound TFs. This separation could help to describe the formation of regulatory interactions, as we know whether a TF was predicted to bind in the enhancer or the promoter region. The method is not restricted to enhancer-promoter interactions. Any set of interacting sequences can be analysed, for example loops called on Hi-C data. The two probability distributions calculated by the LDA – the probability of an interaction to belong to a topic and the probability of a TF to belong to a topic – in combination with other genomics data can be used to explore the correlation of TF combinations with genomic features like gene function, ubiquitousness of genes or histone modification. By applying topic modelling on regulatory interactions in different cell types or conditions, sets of important TFs can be inferred to explain the differences in cell states.
- Ofir Yaish, Ben Gurion University, Israel
- Yaron Orenstein, Ben Gurion University, Israel
Short Abstract: mRNA degradation has a critical role in post-transcriptional gene regulation. The 3’ untranslated region (UTR) of an mRNA transcript represents a central regulatory hub that integrates multiple signals to control mRNA translation, localization, stability, and polyadenylation status. Hence, researchers are interested in studying the mRNA dynamics as a function of 3’ UTR sequence elements. A recent study measured the mRNA degradation dynamics of tens of thousands of 3' UTR sequences using a massively parallel reporter assay in zebrafish during early embryogenesis. However, the computational approach used to model mRNA degradation dynamics was based on the simplifying assumption of a linear degradation rate. As a result, the underlying mechanism of the regulatory sequence elements in the 3’ UTR is still not fully understood. In this study, we developed deep neural networks to predict mRNA degradation dynamics and interpret the network to detect regulatory elements in the 3’ UTR. Using a 110nt-long 3' UTR sequence and initial mRNA level measured at the first hr, the model predicts mRNA levels of 8 times points: 2-8 hr and 10 hr. Our deep neural networks significantly improved the prediction accuracy of mRNA degradation dynamics compared to the extant method used for this task (Lasso and Random Forest). For example, our newly developed convolutional and recurrent neural networks (CNN and RNN) achieved average root mean squared error (RMSE) of 0.687 and 0.670, respectively, compared to Lasso and Random forest achieving 0.808 and 0.769, respectively. Moreover, we demonstrated that models predicting the dynamics of two identical 3’ UTR sequences, differing by their poly(A) tail, at the same time achieve better performance than single-task models. In general, we observed that RNNs are more accurate for this task than CNNs. On the interpretability front, by using Integrated Gradients, our CNNs models identified relevant cis-regulatory sequence elements of mRNA degradation embedded in designed validation sequences. The models identified poly-U as a motif associated with increased mRNA stability. In contrast, the models classified miR-430, AU-rich (ARE), and Pumilio as destabilizing elements. By systematically evaluating models interpretability, we demonstrated that the RNN models are inferior to the CNN models in terms of interpretability. To conclude, we developed the first deep learning-based approach for predicting mRNA degradation dynamics and successfully used it to identify regulatory elements. Through this work, we observed the advantages of CNNs and RNNs, and multi-task and single-task models compared to each other.
- Ilka Keller, University of Debrecen, Faculty of Medicine, Department of Medical Chemistry, Hungary
- Dániel Horváth, University of Debrecen, Faculty of Medicine, Department of Medical Chemistry, Hungary
- Beáta Lontay, University of Debrecen, Faculty of Medicine, Department of Medical Chemistry, Hungary
Short Abstract: Introduction: Myosin phosphatase (MP) holoenzyme is a protein phosphatase-1 (PP1) type Ser/Thr specific enzyme that consists of a PP1 catalytic (PP1c) and a myosin phosphatase target subunit-1 (MYPT1). MYPT1 is an ubiquitously expressed isoform and it targets PP1c to its substrates. We identified several novel nuclear MYPT-interacting protein such as the protein arginine methyltransferase 5 (PRMT5) enzyme of the methylosome complex uncovering the nuclear MYPT1-interactome of hepatocellular carcinoma cells. One of the potential upstream regulatory element of MP is the Mg2+-dependent protein Phosphatase 1B (PPM1B) that was identified as a MYPT-binding protein by coimmunoprecipitation and pull down assay followed by mass spectometry analysis.  Aim: Our research aims to investigate the regulating effect PPM1B exercises over MP in nuclear localization, focusing mainly on downstream effectors of MP. Methods: Overexpression of Flag-PPM1B in HeLa cell line, along with the opponent experiment as inhibiting PPM1B with sangunarine. Samples will be analyzed by Western Blot technique. Results: Overexpression of Flag-PPM1B decreased the inhibitory phosphorylation of MYPT1, the decreased stimulating phosphorylation of PRMT5 and the parallel with the decreased symmetric demethylation of histone 2A and 4 suggesting the activation of gene expression of tumor suppressor genes. The silencing of PPM1B as well as the inhibition of PPM1B by sanguinarine resulted opponent effects clarifying the mechanism between PPM1B and MP, revealed that PPM1B activates MP indeed, along with its tumor protective effects. Our results suggest the tumor suppressor role of PPM1B via dephosphorylation of MP and indirectly modulating the activity of PRMT5 thereby regulating gene expression through histone arginine dimethylation.  Kiss A, Erdodi F, Lontay B. Myosin phosphatase: Unexpected functions of a long-known enzyme. Biochim Biophys Acta Mol Cell Res. 2019;1866(1):2-15. doi:10.1016/j.bbamcr.2018.07.023
- Christoph J. Thieme, MDC Berlin, Germany
- Robert A. Beagrie, Weatherall Institute of Molecular Medicine, United Kingdom
- Carlo Annunziatella, Università di Napoli Federico II, Italy
- Catherine Baugher, Ohio University, United States
- Yingnan Zhang, Ohio University, United States
- Markus Schueler, MDC Berlin, Germany
- Alexander Kukalev, MDC Berlin, Germany
- Rieke Kempfer, MDC Berlin, Germany
- Dorothee C.A. Kramer, MDC Berlin, Germany
- Andrea M Chiariello, Università di Napoli Federico II, Italy
- Simona Bianco, Università di Napoli Federico II, Italy
- Yichao Li, Ohio University, United States
- Antonio Scialdone, Helmholtz Zentrum München, Germany
- Lonnie R. Welch, Ohio University, United States
- Mario Nicodemi, Università di Napoli Federico II, Italy
- Ana Pombo, MDC Berlin, Germany
Short Abstract: Gene expression and regulation is functionally coupled with 3D genome organization. Genome Architecture Mapping (GAM) is a ligation-free, genome-wide method that maps chromatin contacts in 3D, based on measuring the frequency of locus co-segregation from an ensemble of ultra-thin nuclear slices of random orientation. To compare GAM and Hi-C, we used our new high-throughput, multiplexed GAM pipeline to produce a deep dataset from mouse embryonic stem cells, and devised a procedure to extract contacts preferentially detected by either GAM or Hi-C. Strong contacts enriched in the GAM data contain a 2-fold amplification of feature pairs associated with TF binding (including CTCF), histone marks, and enhancers. In contrast, feature pairs enriched in Hi-C-specific contacts are characterized by heterochromatin marks (H3K9me3 and H3K20me3). Contacts with strongest differences in values intensities from both methods can be distinguished by the presence pattern of CTCF, Enhancer, RNAPII-S7p, and RNAPII-S5p, which frequently co-occur in GAM-specific contacts. In general, we observe genomic regions with increased transcriptional activity often to form strong GAM contacts that are underestimated by Hi-C. We are currently investigating whether the differences can be explained by increased contact multiplicity, which could limit ligation-dependent detection. Our findings expand our current understanding of 3D genome folding and highlight the importance of orthogonal approaches.
- Ting Jin, Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison, United States
- Nam D. Nguyen, Department of Computer Science, Stony Brook University, United States
- Flaminia Talos, Departments of Pathology and Urology, Stony Brook Medicine; Stony Brook Cancer Center, Stony Brook Medicine, United States
- Daifeng Wang, Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison; Waisman Center, UW–Madison, United States
Short Abstract: Gene expression and regulation, a key molecular mechanism driving human disease development, remains elusive, especially at early stages. Integrating the increasing amount of population-level genomic data and understanding gene regulatory mechanisms in disease development are still challenging. Machine learning has emerged to solve this, but many machine learning methods were typically limited to building an accurate prediction model as a “black box”, barely providing biological and clinical interpretability from the box. To address these challenges, we developed an interpretable and scalable machine learning model, ECMarker, to predict gene expression biomarkers for disease phenotypes and simultaneously reveal underlying regulatory mechanisms. Particularly, ECMarker consists of three major components including (1) a neural network model integrating the semi- and discriminative- restricted Boltzmann machines for classifying disease phenotypes from the continuous gene expression value at the population level; (2) without any prior feature selection, directly identification of a gene network allowing the lateral connections at the input gene layer, and the prioritization of gene expression biomarkers for each phenotype using the integrated gradient method based on the neural network connectivity; (3)the functional and survival analyses of biomarker genes and networks for revealing underlying molecular mechanisms in the disease phenotypes (biological interpretability) and predicting clinical outcomes(clinical interpretability).With application to the gene expression data of non-small cell lung cancer (NSCLC) patients, we found that ECMarker not only achieved a relatively high accuracy for predicting cancer stages but also identified the biomarker genes and gene networks implying the regulatory mechanisms in the lung cancer development. Additionally, ECMarker demonstrates clinical interpretability as its prioritized biomarker genes can predict survival rates of early lung cancer patients (p-value < 0.005). Finally, we identified a number of drugs currently in clinical use for late stages or other cancers with effects on these early lung cancer biomarkers, suggesting potential novel candidates on early cancer medicine. ECMarker is also general purpose for other disease types and available as an open-source tool at https://github.com/daifengwanglab/ECMarker. *note: this work is under review.
- Yidan Sun, University of California, Los Angeles, United States
- Heather Zhou, University of California, Los Angeles, United States
- Jingyi Li, University of California, Los Angeles, United States
Short Abstract: Motivation: Gene clustering is a widely-used technique that has enabled computational prediction of unknown gene functions within a species. However, it remains a challenge to refine gene function prediction by leveraging evolutionarily conserved genes in another species. This challenge calls for a new computational algorithm to identify gene co-clusters in two species, so that genes in each co-cluster exhibit similar expression levels in each species and strong conservation between the species. Results: Here we develop the bipartite tight spectral clustering (BiTSC) algorithm, which identifies gene co-clusters in two species based on gene orthology information and gene expression data. BiTSC novelly implements a formulation that encodes gene orthology as a bipartite network and gene expression data as node covariates. This formulation allows BiTSC to adopt and combine the advantages of multiple unsupervised learning techniques: kernel enhancement, bipartite spectral clustering, consensus clustering, tight clustering, and hierarchical clustering. As a result, BiTSC is a flexible and robust algorithm capable of identifying informative gene co-clusters without forcing all genes into co-clusters. Another advantage of BiTSC is that it does not rely on any distributional assumptions. Beyond cross-species gene co-clustering, BiTSC also has wide applications as a general algorithm for identifying tight node co-clusters in any bipartite network with node covariates. We demonstrate the accuracy and robustness of BiTSC through comprehensive simulation studies. In a real data example, we use BiTSC to identify conserved gene co-clusters of D. melanogaster and C. elegans, and we perform a series of downstream analysis to both validate BiTSC and verify the biological significance of the identified co-clusters.
- Gregory M. Chen, University of Pennsylvania, United States
- Changya Chen, Children's Hospital of Philadelphia, United States
- Rajat K. Das, Children's Hospital of Philadelphia, United States
- Peng Gao, Children's Hospital of Philadelphia, United States
- Chia-Hui Chen, Children's Hospital of Philadelphia, United States
- Yang-Yang Ding, Children's Hospital of Philadelphia, United States
- Yasin Uzun, Children's Hospital of Philadelphia, United States
- Qin Zhu, University of Pennsylvania, United States
- Stephan A. Grupp, Children's Hospital of Philadelphia, United States
- David M. Barrett, Children's Hospital of Philadelphia, United States
- Kai Tan, Children's Hospital of Philadelphia, United States
Short Abstract: Chimeric Antigen Receptor (CAR) T-cell therapy has been a major breakthrough in B-cell Acute Lymphoblastic Leukemia (ALL), yet therapy resistance remains a significant challenge. Given that approximately 20% of patients do not respond to therapy and 40% of initial responders relapse within a year, there is a pressing need to understand the basis of therapy resistance in order to identify prognostic biomarkers and improve the therapeutic modality. Here, we take a systems immunology approach to uncover the transcriptional regulatory mechanisms mediating long-term CAR T-cell persistence in patients. We obtained pre-manufacture T-cells from 71 patients treated with anti-CD19 CAR T-cell therapy, sorted the cells into five major T-cell subtypes (Naive, Stem Cell Memory, Central Memory, Effector Memory, and Effector), and performed RNA-Seq on sorted T-cells. To our knowledge, this is the largest and most comprehensive transcriptomic atlas of clinical T-cells in CAR T-cell therapy to date. In addition, we performed integrative single-cell RNA-Seq and ATAC-Seq on T-cells from 6 of these patients in order to characterize the transcriptional and epigenetic states. We found that higher proportions of CCR7+CD62L+ naive and early memory T-cell phenotypes associate with longer CAR T-cell persistence, and higher proportions of CCR7-CD62L- effector memory and effector T-cells associate with shorter CAR T-cell persistence (p = 2.0e-4 to 0.04). Using a mixed-effects interaction model to account for the potential confounding effect of T-cell subtype composition, we found that genes associated with chronic interferon response, such as IRF7, RSAD2, MX1, and ISG15, were significantly associated with poor CAR T-cell persistence across T-cell subsets (FDR = 1.0e-5 to 6.3e-3). While network analysis strongly implicated TCF7 in maintenance of naive and early memory states, we also found that TCF7 expression in effector memory and effector T-cells was significantly up-regulated in patients with long-term CAR T-cell persistence (FDR = 0.018). Motif accessibility in our single-cell ATAC-Seq data strongly supported key roles of TCF7 and IRF7 in determining functional cellular states. Single-cell RNA-Seq revealed a population of naive/memory T-cells that was enriched in genes associated with both the TCF7 regulon and interferon response; this finding suggests that these cellular states are not necessarily mutually exclusive, and likely have an independent effect on clinical CAR T-cell efficacy. Together, these data shed light on the critical role of pre-infusion T-cell populations in CAR T-cell therapy, and may inform clinical prognosis and the development of improved CAR T-cell therapies.
- Marissa Sumathipala, Harvard Medical School Channing Division of Network Medicine, United States
- Kimberly Glass, Harvard Medical School Channing Division of Network Medicine, United States
Short Abstract: Chronic Obstructive Pulmonary Disease (COPD) is a heterogeneous condition comprised of many sub-diseases, including emphysema. Despite its prevalence, the molecular mechanisms underlying COPD remain poorly understood. A major barrier to elucidating the pathophysiology of COPD is its remarkable genetic heterogeneity. Standard computational methods that average across all COPD patients obscure the genetic signatures that are only present in a subset of the patients. To better capture individual genetic heterogeneity in COPD, we use a novel systems biology approach to model the complex gene regulatory processes underlying the disease. Specifically, we apply network modeling to construct 326 patient specific gene networks: combining gene expression microarray data collected from lung biopsies of patients with varying stages of COPD and smoking controls with information from GWAS studies on COPD. Patient specific networks are first generated from applying linear interpolation to Pearson correlation co-expression networks in a jack-knifing approach. Next, we separated the COPD patient networks into Mild, Moderate, Severe, and Very Severe based on their clinical progression, and compared each group to networks for the controls. We find significant regulatory differences between networks for COPD and control patients, particularly in Very Severe COPD patients. Our network comparison approach identifies 109 genes with potential relevance for COPD that are not found with standard differential gene expression analysis. Gene ontology analysis implicates these 109 genes in antigen processing, major histocompatibility complex regulation, T cell receptor signaling, and p53 signal transduction. These pathways are not found using differential gene expression analysis, warranting further research into their role in COPD and the potential for targeting these pathways with therapeutics. Clustering the COPD patient networks revealed two groups with distinct patterns of gene coexpression, while clustering on gene expression did not reveal such a pattern. Clinical data indicates these two clusters correspond to patients with significantly different levels of emphysema. Network structure analysis reveals differentially weighted edges between COPD and controls form a single connected component characterized by scale free properties with several key hubs that include cell signaling proteins from the Notch pathway and autophagy modulators. Our network model for COPD enables a more comprehensive study of the disease’s molecular underpinnings by preserving the genetic heterogeneity, allowing for molecular-based patient subtyping and prioritization of disease gene candidates.
- Rakesh Netha Vadnala, The Institute of Mathematical Sciences, Chennai, India. Homi Bhabha National Institute, Mumbai, India., India
- Leelavati Narlikar, Department of Chemical Engineering, CSIR-National Chemical Laboratory, Pune, India., India
- Sridhar Hannenhalli, Cancer Data Science Lab, National Cancer Institute, NIH, Bethesda, Maryland, USA., United States
- Rahul Siddharthan, The Institute of Mathematical Sciences, Chennai, India. Homi Bhabha National Institute, Mumbai, India., India
Short Abstract: Transcription factors (TFs) bind to specific DNA loci to regulate transcription. TFs are well known to interact with each other cooperatively or competitively when binding in sequential proximity. However, eukaryotic genomes are organized in a densely packed 3D chromatin structure where sequentially distant regions may be in close spatial proximity. Few studies explore such sequentially distant but spatially proximal TF-TF interaction. Here we explore 3D co-occurrence patterns of TFs using chromatin interaction data (ChIA-PET, Hi-C) and ChIP-seq data, primarily from the ENCODE project. We assess significance of co-occurrence or avoidance of TF pairs based on carefully randomized networks that retain essential features of the true chromatin contact and TF-DNA interaction networks. Across cell types, we find that most TF pairs either co-occur or avoid each other significantly more than expected by chance. Further, TFs clearly cluster into two main groups: TFs within each group co-occurring while avoiding TFs in the other group. ChIP-seq peaks for one group tend to occur significantly closer to annotated transcription start sites than the other group, suggesting association with promoter and enhancer activity respectively. These trends agree with previous work by Ma et al. (2018), who used different methodology to assess co-occurrence in GM12878 and mESC. GO term enrichment suggests that while one group of TFs is enriched for lineage-specific functions, the other group is enriched for regulation of constitutive gene expression. Co-occurring TF pairs show significant enrichment of protein-protein interactions and domain-domain interactions compared to avoiding TF pairs, providing a potential mechanism for this. Cohesin subunits SMC3 and RAD21 show significant association with each other and CTCF, while avoiding most TFs. Intriguingly, we find a TF-TF co-occurrence pattern within sequentially contiguous genomic regions largely consistent with that observed in sequentially distant but spatially proximal regions. A similar analysis based on DNA sequence motif co-occurrence also segregates TFs into two groups. These observations are broadly consistent across cell lines; however, the specific TFs that attract or repel vary, suggesting plasticity in the TF-TF interaction patterns across cell lines. Our work extends understanding of spatial co-occurrence patterns of TFs and provides new mechanistic insights.
- Naoki Osato, Osaka University, Japan
Short Abstract: Chromatin interactions are essential in enhancer-promoter interactions (EPIs) and transcriptional regulation. CTCF and cohesin proteins located at chromatin interaction anchors and other DNA-binding proteins such as YY1, ZNF143, and SMARCA4 are involved in chromatin interactions. However, there is still no good overall understanding of proteins associated with chromatin interactions and insulator functions. Here, I describe a systematic and comprehensive approach for discovering DNA-binding motifs of transcription factors (TFs) that affect EPIs and gene expression. This analysis identified 96 directional [64 forward-reverse (FR) and 52 reverse-forward (RF) orientation] of motifs that significantly affected the expression level of putative transcriptional target genes in monocytes, T cells, HMEC, and NPC and included CTCF, cohesin (RAD21 and SMC3), YY1, and ZNF143. Some TFs have more than one motif in databases; thus, the total number is smaller than the sum of FRs and RFs. KLF4, ERG, RFX, RFX2, HIF1, SP1, STAT3, and AP1 were associated with chromatin interactions. Many other TFs were also known to have chromatin-associated functions. The predicted directional motifs were compared with chromatin interaction data. Correlations in expression level of nearby genes separated by the motif sites were then examined among 53 tissues. Most TFs showed weak directional biases at chromatin interaction anchors and were difficult to identify using enrichment analysis of motifs. These findings contribute to the understanding of chromatin-associated motifs involved in transcriptional regulation, chromatin interactions/regulation, and histone modifications. References: Osato N, bioRxiv 2020, https://doi.org/10.1101/290825
- Seong Kyu Han, Boston Children's Hospital / Harvard Medical School, United States
- Matthew Sampson, Boston Children's Hospital / Harvard Medical School, United States
- Dongwon Lee, Boston Children's Hospital / Harvard Medical School, United States
Short Abstract: Chromatin-accessibility assays, such as DNase-seq and ATAC-seq, have become a standard method for genome-wide identification of cis-regulatory elements (CREs), which play an essential role in transcriptional regulation. However, evaluating their data quality is still challenging since several biological and technical factors can significantly influence it. The unwitting use of low-quality data can confound the results of downstream transcriptional regulation research, leading to incorrect interpretations. To tackle this, we devised a sequence-based approach, gapped k-mer-SVM quality check (gkmQC), to conduct quality assessment and refinement of the chromatin-accessibility data. We employed a sequence-based predictive model for CREs and derived a quality metric from CRE prediction performance. Testing >800 samples from the ENCODE/Roadmap DNase-seq dataset, we discovered that gkmQC scores are significantly correlated with the degree to which the open-chromatin peaks deviate from core CRE regions (Spearman’s ? = 0.51), while the SPOT2 scores, a conventional metric that measures read density within peaks, are not (? = 0.23). Next, we identified >100 high-quality (HQ) samples of 8 representative tissues and cells using gkmQC, and found that CRE variants from these HQ samples systematically contribute more to heritability than low quality (LQ) samples for 30 relevant traits from UK-Biobank (1.8x of normalized z-score of stratified LD-score regression coefficients; paired T-test P=8.7×10??). Moreover, a detailed inspection of peaks from 35 developing-kidney samples showed that peaks near the fine-mapped kidney GWAS loci from the HQ samples are more precisely located in the core CRE regions (average genomic distance between peak summits and the peak centroids: 19±1.2bp for HQ vs. 38±4.8bp for LQ [P=2.8×10?4]). Lastly, we demonstrate that gkmQC can be used to optimize a peak calling threshold, especially for single-cell data with sparse read-mapping. By applying gkmQC to kidney single-nuclei ATAC-seq data, we successfully identified additional open chromatin regions from a rare cell-type (podocytes; <1%). Furthermore, we showed that variants in these newly found peaks can explain significant heritability of a major kidney trait (eGFR; P(Pr[h²]/Pr[SNPs]) = 3.5×10?5 for default vs. 2.8×10?7 for gkmQC), suggesting that these are functionally relevant. Taken together, we expect that gkmQC will enable us to construct accurate CRE maps by identifying HQ samples and optimizing peak calling thresholds. Ultimately, fine-tuned CRE maps by gkmQC will be useful to empower the functional interpretation of disease-associated genetic variation.
- Jonas Ibn-Salem, TRON Translational Oncology Mainz, Germany
- Franziska Lang, TRON Translational Oncology Mainz, Germany
- Barbara Schrörs, TRON Translational Oncology Mainz, Germany
- Martin Löwer, TRON Translational Oncology Mainz, Germany
- Ugur Sahin, BioNTech, Germany
Short Abstract: Many cancer immunotherapy approaches rely on tumor-specific antigens (neo-antigens), which drive specific recognition and killing of tumor cells by T cells. While neo-antigens are commonly predicted from non-synonymous somatic mutations only, aberrant splicing can also lead to tumor-specific transcript isoforms, which encode further neo-antigens. However, detection of truly cancer-specific splicing variants is challenging because many identified splice junctions appear also in normal tissue. To overcome this problem, we combine somatic mutations from whole-exome sequencing with alternative splicing detection from RNA-seq in order to predict aberrant splice-junctions caused by somatic mutations as potential truly cancer-specific splicing variants. Here, we present a computational pipeline to detect tumor-specific splice-junctions encoding neo-antigen candidates. Three established tools are used to detect alternative splicing from RNA-seq data (JUM, LeafCutter, SplAdder). Next, we predict the effect of somatic point mutations on splicing in matching sequencing data using deep learning-based approaches (SpliceAI, MMsplice). We integrate the resulting splice junctions and confirm them in RNA-seq by re-mapping RNA-seq reads to splice-junction sequences. Finally, we filter out canonical splice junctions by comparing to GENCODE reference transcripts and splice junctions detected in RNA-seq data from thousands of healthy tissue samples from GTEx and TCGA consortia. In a proof of principle study, we applied our approach to 14 FFPE solid tumor samples from diverse tumor entities. The integration of RNA-seq derived splice-junction with mutation-based predictions results in eight candidates. Stringent re-quantification of mutation-derived splice-junctions with RNA-seq reads confirmed many further candidates per sample. Detailed annotation of the encoded peptides with MHC binding predictions may result in promising neo-antigen candidates. Taken together, our approach allows detecting tumor-specific splice variants by associating them with a causal somatic mutation. This enables a more comprehensive characterization of the neo-antigen load for both, the prediction of response to immune checkpoint-therapy and the design of personalized mutanome vaccines.
- Avirath Sundaresan, The Nueva School, United States
- Benjamin D. Lee, National Center for Biotechnology Information; Nuffield Department of Medicine, University of Oxford; In-Q-Tel Labs, United States
Short Abstract: We propose a novel two-dimensional graphical visualization method to intuitively analyze and compare multiple DNA sequences. To do so, we employ arithmetic coding, a lossless entropy-based compression method, to transform a sequence into a representative series of bits, taking advantage of differing DNA k-mer frequencies to encode DNA sequences more efficiently than standard two-bit encoding methods. To better visualize and compare related sequences, the k-mer frequencies used to generate the encoding can be derived from multiple sequences. The compressed binary sequence then is represented by a series of vectors constituting a two-dimensional graphical visualization that avoids degeneracy and loss of information. This DNA visualization method can be used to explore variation across sequences in an efficient and intuitive manner.
- Jan Grau, Institute of Computer Science, Martin Luther University Halle-Wittenberg, Germany
- Florian Schmidt, Genome Institute of Singapore, Singapore
- Marcel Schulz, Goethe University, Germany
Short Abstract: Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. We present MeDeMo (Methylation and Dependencies in Motifs), a novel framework for TF motif discovery and TFBS prediction that combines information about DNA methylation with models capturing intra-motif dependencies. Specifically, we represent methylation information, captured by genome-wide bisulfite sequencing, by an extended alphabet defining symbols ""M"" for a methylated C and ""H"" for a G opposite of a methyl-C on the DNA double strand. Intra-motif dependencies are represented by adapted LSlim models, which are learned within the Dimont framework. We conduct a large scale study of DNA methylation sensitivity for an unbiased collection of 144 TFs applying MeDeMo to more than 600 ENCODE ChIP-seq datasets with matched bisulfite-seq data in four cell types. For 32 of the TFs, using DNA methylation information leads to a signficantly and consistently improved discrimination between TF-bound and unbound sequences. MeDeMo allows for computing a position-specific profile of methylation sensitivity based on the inferred binding motifs, i.e., a score indicating the influence of CpG methylation on TF binding at a particular motif position. Overall, we find that the influence of methylation on the prediction score is detrimental for many of the TFs and that this is consistent for the analyzed cell types. Furthermore, the introduction of intra-motif dependencies is additionally important for 11 out of the 32 methylation-sensitive TFs. We illustrate for several example TFs why the combination of methylation information and modelling intra-motif dependencies is important to yield accurate models of TF-DNA binding preferences. For instance, JUND binds DNA as a dimer with a variable 1-2 bp spacer, which may be captured by dependency models but not by PWM models. The two spacer variants show a different sensitivity to DNA methylation. MeDeMo is available as a stand-alone tool allowing both de novo discovery of methylation-aware TF motifs and genome-wide TFBS predictions using our catalogue of methylation sensitive TF motifs.  Keilwagen & Grau, Nucleic acids research, 2015, doi:10.1093/nar/gkv577  Grau et al., Nucleic acids research, 2013, doi:10.1093/nar/gkt831
- Aditi Deokar, Boston University Academy, United States
Short Abstract: Systemic lupus erythematosus (SLE) is the tenth leading cause of death in females 15-24 years old in the US. The diversity of symptoms and immune pathways expressed in SLE patients causes difficulties in treating SLE as well as in new clinical trials. This study used unsupervised learning on gene expression data from adult SLE patients to separate patients into clusters. The dimensionality of the gene expression data was reduced by three separate methods (PCA, UMAP, and a simple linear autoencoder) and the results from each of these methods were used to separate patients into six clusters with k-means clustering. The clusters revealed three separate immune pathways in the SLE patients that caused SLE. These pathways were: (1) high interferon levels, (2) high autoantibody levels, and (3) dysregulation of the mitochondrial apoptosis pathway. Mitochondrial apoptosis has not been investigated before to our knowledge as a standalone cause of SLE, independent of autoantibody production, and mitochondrial proteins could be investigated as a therapeutic target for SLE in the future.
- Satyaki Roy, University of North Carolina, Chapel Hill, United States
- Benjamin Keith, University of North Carolina, Chapel Hill, United States
- Shehzad Sheikh, University of North Carolina, Chapel Hill, United States
- Terrence Furey, University of North Carolina, Chapel Hill, United States
Short Abstract: Background Weighted co-expression networks are undirected graphs where gene pairs having correlated gene expression across samples share edge connections. These networks are of biological interest as the associated genes often have the same transcriptional regulators or signaling (or metabolic) pathways. Co-expression networks can help identify higher-order modules of coordinated genes that collectively affect infectious, inflammatory, or neurological diseases. In this work, we leverage network analysis to identify genomic biomarkers for Crohn’s disease (CD), which is an autoimmune, inflammatory bowel disease (IBD) that causes chronic inflammation of the gastrointestinal tract. Methods We generate co-expression networks on colon tissue samples of CD and non-IBD patients and apply network centrality measures to pinpoint significant genes in each module. We further carry out enrichment analysis to find consequential genes and their associated pathways corresponding to the two subtypes of CD, namely ileum- and colon-like (IL and CL). We apply the Jaccard similarity (a well-studied variant of the topological overlap measure) to analyze variations in gene expression profiles across control and diseased networks. Findings Our analysis demonstrates that Jaccard similarity identifies genes and pathways reported to be responsible for heightened immunoreactivity. We posit that it can be a useful metric to pinpoint variational gene expression across co-expression networks. The combination of network and pathway analysis revealed that (1) the Interleukin (IL) genes, which are often associated with innate and adaptive immune systems controlling inflammation, exhibit the highest topological alteration, and (2) cytokine signaling pathways are enriched in expression profiles in CD patients. The CL subtypes show a higher dissimilarity with non-IBD samples compared to the IL counterpart. Interpretation The present findings show that a smaller pool of immune response, stress response, and pro-inflammatory genes are influencing the pathogenesis of CD. We are currently expanding this study by performing graph difference between large-scale databases of CD and non-IBD co-expression networks to gain a more fundamental understanding of the genes and their effects on the underlying biological processes. Finally, we are analyzing the eigengene networks to infer the functional relationship among modules in the co-expression networks.
- Omer Acar, University of Pittsburgh, United States
- She Zhang, University of Pittsburgh, United States
- Ivet Bahar, University of Pittsburgh, United States
- Anne-Ruxandra Carvunis, University of Pittsburgh, United States
Short Abstract: Genetic networks are high-level representations of relationships between genes. These networks allow investigating not only pairwise relationships but long-range ones, as well. Network propagation methods have been widely used to identify these long-range relationships by diffusing information into the network, similar to the flow of a liquid. For example, if a gene is known to be a disease gene, diffusing information from this gene enables the discovery of other disease genes. However, existing network propagation methods lack the power to calculate and compare the total information flow caused by all genes in the network. Indeed, the total information in the system is restricted to a constant, predetermined quantity and the detection of long-range interactions depends on a chosen duration of propagation. Here, we address these limitations to obtain a global, unbiased view of long-range interactions in genetic networks by leveraging a perturbation-response scanning (PRS) method originally developed for identifying long-range interactions within protein structures. The PRS method utilizes elastic network models where each network node is a ball and each network edge is a spring, forming a network that is a system of balls connected by springs. The method allows for applying forces/perturbations on network nodes and measuring the cooperative motions/responses of all other nodes, where the former represent the initial information and the latter represents the propagated information. We adapted the PRS methodology to genetic networks and systematically identified long-range relationships between genes in the yeast genetic interaction similarity network, the yeast protein-protein interaction (PPI) network, and the human PPI network. We then used this information to evaluate the signal transduction ability of the genes using two metrics: effectiveness and sensitivity. Effectiveness is the ability of a gene to distribute information to others; sensitivity is the propensity of receiving information, independent of the source of information. Genes distinguished by their high ability to transmit and receive information are defined as effectors and sensors, respectively. We find that higher degree genes show higher effectiveness, while lower degree genes show higher sensitivity. Closer examination suggests that low-degree genes we identified as sensors regulate energy metabolism in yeast and transcriptional control in human, integrating information from diverse pathways that are crucial for cell survival and plasticity.
- Irene Kaplow, Carnegie Mellon University, United States
- Daniel Schäffer, Carnegie Mellon University, United States
- Michael Kleyman, Carnegie Mellon University, United States
- Morgan Wirthlin, Carnegie Mellon University, United States
- Andreas Pfenning, Carnegie Mellon University, United States
Short Abstract: Many phenotypes, including vocal learning, longevity, and brain size, have evolved at least in part through gene expression, meaning that some of their differences across species are caused by differences in genome sequence at enhancers. While some of the genes involved in these phenotypes have been identified, in most cases, it remains unknown which enhancers are responsible and how genome sequence differences in those enhancers have led to differences in gene expression. We used open chromatin regions (OCRs), which can serve as a proxy for enhancers, from datasets that we and others generated from diverse mammalian brains and livers to train machine learning models that predict brain and liver open chromatin status from genome sequences at orthologs of OCRs. We found that our models could accurately predict lineage-specific and tissue-specific OCR ortholog activity, that our predictions across species tended to match our expectations based on the phylogenetic tree, and that our predictions more accurately predict open chromatin status conservation than predictions made using conservation scores. We applied our models to make brain and liver OCR ortholog open chromatin status predictions in hundreds of mammals. We found distinct clusters of OCR orthologs based on their predicted open chromatin status in brain and liver. For example, we found clusters of OCR orthologs that are predicted to be closed specific lineages, such as bats or cetaceans, and clusters of OCR orthologs whose open chromatin status seems to have evolved convergently, such as OCR orthologs that are open in brain or liver in only primates and ungulates or only primates and carnivora. We also associated our predictions with cross-species brain and liver phenotype annotations in a way that accounts for phylogeny to identify candidate OCRs that might be involved in these phenotypes. For example, human OCR orthologs associated with longevity are enriched for occurring near genes involved in response to epinephrine and negative regulation of oxidative stress-induced neuron death. Our approach to identifying OCR orthologs associated with phenotypes that have evolved through gene expression can be applied to phenotypes associated with any tissue or cell type that has open chromatin data from at least a few species.
- Aryan Kamal, EMBL, Germany
- Christian Arnold, EMBL, Germany
- Sophia Müller-Dott, EMBL, Germany
- Maksim Kholmatov, EMBL, Germany
- Neha Daga, EMBL, Germany
- Judith Zaugg, EMBL, Germany
Short Abstract: Genetic variants associated with diseases often affect non-coding regions, thus likely having a regulatory role. To understand the effects of genetic variants in these regulatory regions, identifying genes that are modulated by specific regulatory elements (REs) is crucial. The effect of gene regulatory elements, such as enhancers, is often cell-type specific, likely because the combinations of transcription factors (TFs) that are regulating a given enhancer have cell-type specific activity. This TF activity can be quantified with existing tools such as diffTF and captures differences in binding of a TF in open chromatin regions. Collectively, this forms a gene regulatory network (GRN) with cell-type and data-specific TF-RE and RE-gene links. Here, we reconstruct such a GRN using bulk RNAseq and open chromatin (e.g., using ATACseq or ChIPseq for open chromatin marks) and optionally TF activity data. Our network contains different types of links, connecting TFs to regulatory elements, the latter of which are connected to genes in vicinity or within the same chromatin domain (TAD). We use a statistical framework to assign empirical FDRs and weights to all links using a permutation-based approach. Since no widely accepted ground-truth dataset for assessing the constructed GRN exists, we propose a novel evaluation algorithm which is not using a ground-truth network and instead assesses a GRN based on its performance in predicting differential expression response. For this, we used a random forest regression model and evaluate how well the GRN links predict differential expression values based on differential TF activity. Overall, our GRNs consistently perform significantly better than corresponding randomized versions, showing that they capture reliable links between TFs and their target genes. Our framework also allows to benchmark and compare different GRN reconstruction algorithms. Finally, we run our GRN construction and evaluation pipeline on diverse datasets such as naive CD4-positive T cells or an AML cohort and identified a set of cell-type specific TFs with crucial roles to predict differential gene expression based on differential TF activity. The resulting core subnetwork has higher predictive power, and enables deeper understanding of the underlying regulatory programs.
- Mira Barshai, Ben-Gurion University of the Negev, Israel
- Alice Aubert, Ecole Polytechnique, France
- Yaron Orenstein, Ben-Gurion University of the Negev, Israel
Short Abstract: G-quadruplexes (G4s) are nucleic acid secondary structures that form within guanine-rich DNA or RNA sequences. G4 formation can affect chromatin architecture and gene regulation and has been associated with genomic instability, genetic diseases and cancer progression. The experimental data produced by the G4-seq experiment provides unprecedented details on G4 formation in the genome. Still, running the experimental protocol on a whole genome is an expensive and time-consuming process. Thus, it is highly desirable to have a computational method to predict G4 formation of new DNA sequences or whole genomes. Here, we present G4detector, a new method to predict G4s from DNA sequences based on a convolutional neural network. On top of the sequence information, we improved prediction accuracy by combining RNA secondary structure information obtained from running the ViennaRNA program RNAplfold. To train and test G4detector, we compiled novel high-throughput benchmarks with three different negative example types over multiple species genomes measured by the G4-seq protocol (human, mouse, zebrafish and drosophila). The purpose of the three negative types being to test the method on progressively more difficult problems. Not only does G4detector achieve very high area under the ROC curve scores (AUC) on all three negative types, it also outperforms extant methods on the same task and metric on all benchmark datasets and is able to extrapolate human-trained measurements to the three non-human species from which we generate the other datasets. We also show, through the use of integrated gradients and mutation maps, that the method is able to learn relevant biological information for each dataset. The code and benchmarks are publicly available on github.com/OrensteinLab/G4detector
- Jan Zrimec, Chalmers University of Technology, Sweden
- Aleksej Zelezniak, Chalmers University of Technology, Sweden
Short Abstract: The DNA regulatory code that governs gene expression is present in the gene regulatory structure that spans the coding and adjacent non-coding regulatory DNA regions, including promoters, terminators and untranslated regions. Deciphering this regulatory code, as well as how the whole gene regulatory structure interacts to produce mRNA transcripts and regulate mRNA abundance, can greatly improve our capabilities for controlling gene expression and solving problems related to both medicine and biotechnology. Here, we consider that natural systems offer the most accurate information on gene expression regulation and apply deep learning on over 20,000 mRNA datasets to learn the DNA-encoded regulatory code across a variety of model organisms from bacteria to Human (https://doi.org/10.1101/792531). Since up to 82% of the regulatory code is encoded in the gene regulatory structure, mRNA abundance can be predicted directly from DNA with high accuracy in all model organisms. Coding and regulatory regions in fact carry both overlapping and orthogonal information and additively contribute to gene expression levels. By mining the gene expression models for the relevant DNA regulatory motifs, we uncover motif interactions across the whole gene regulatory structure that define over 3 orders of magnitude of gene expression levels. Based on these findings we develop a novel AI-guided approach for protein expression engineering and experimentally verify its usefulness. Our results challenge the current paradigm that single motifs or regulatory regions are solely responsible for gene expression levels. Instead, we demonstrate that the whole gene regulatory structure, comprising the DNA regulatory grammar of interacting DNA motifs across protein coding and adjacent regulatory regions, forms a coevolved transcriptional regulatory unit and provides a mechanism by which whole gene systems with pre-specified expression patterns can be designed.
- Isabella He, Pittsford Mendon High School, United States
- Zhaohui Qin, Emory University, United States
- Yongsheng Bai, Next-Gen Intelligent Science Training, United States
Short Abstract: Schizophrenia is a neurological disorder that affects behavior and emotions and can result in serious hallucinations and motor problems. Around 3.5 million Americans have been diagnosed with schizophrenia. the underlying disease mechanism of schizophrenia is still unknown. We downloaded genome-wide association study (GWAS) results for schizophrenia from the Phenotype-Genotype Integrator (PheGenI). Using Enrichr, we inputted 1,453 genes associated with schizophrenia and discovered a pathway connection between the disease and ARCHS4 co-expression. We obtained 9 missense variant-containing genes and 14 3’UTR variant-containing genes associated with schizophrenia. Out of the total number of studied variants stated above, 3 missense variant-containing genes and 4 3’UTR variant-containing genes were reported to have ClinVar annotation significance (pathogenic/likely pathogenic). Protein-protein interaction (PPI) results from STRING Database indicate no evidence showing interacted genes overlapping across the 9 missense variant-containing genes. UCSC Genome Browser, we retrieved 3’UTR sequences for all 14 genes to run MEME software. We identified one promising motif present in all 14 3’UTR variant-containing genes. Currently, we are still investigating whether any of the 3’UTR variants are also contributing to functional importance, in the context of the discovered motif.
- Julong Wei, wayne state university, United States
- Justyna Resztak, wayne state university, United States
- Peijun Wu, University of Michigan Ann arbor, United States
- Edward Sendler, wayne state university, United States
- Adnan Alazizi, wayne state university, United States
- Henriette Mair-Meijers, wayne state university, United States
- Samuele Zilioli, wayne state university, United States
- Xiang Zhou, University of Michigan Ann Arbor, United States
- Francesca Luca, wayne state university, United States
- Roger Pique-Regi, wayne state university, United States
Short Abstract: Single cell technologies enabled gene expression analysis at single cell resolution to study heterogeneous cell populations across conditions and individuals; yet few computational methods exist to profile response dynamics. Beyond looking at the transcriptional response to stimuli for each cell-type separately, single cell data can capture more fine-grained details by analyzing trajectory dynamics, variance, other higher order moments or velocity. These response dynamics are crucial to gain a better understanding of the molecular underpinnings in inter-individual variation in drug response. Here we collected new data and developed a new supervised approach based on the linear discriminant analysis (LDA), to construct a low dimensional representation of the response for each cell-type. We then used the LDA axes to identify gene-by-environment interactions in the response dynamics to different stimuli for each cell-type. To evaluate the new method and to also compare to other dynamic metrics (dispersion and velocity), we activated peripheral blood mononuclear cells from 96 African Americans with phytohemagglutinin (PHA) or lipopolysaccharide (LPS), and treated with the glucocorticoid dexamethasone (DEX). We performed scRNA-seq for 292,394 cells and identified four major cell types: B-cell, Monocyte, NK-cell and T-cell. We employed negative binomial distribution to calculate RNA expression dispersion (which is less correlated with mean expression than variance). We detected 2,585 genes with variable dispersion, most in monocytes and enriched in immune-related processes. Effects on dispersion induced by PHA and LPS are negatively correlated with those induced by DEX, which implies that DEX suppresses the activated immune response through effects on both gene expression mean and dispersion. Using scVelo we quantified and determined 1,706 genes with differential gene expression velocities between treatments (34% of them with variable dispersion). Our new approach (LDA) is able to capture gene expression dynamic changes in response to treatments in two components: LDA1 captures dynamic changes induced by DEX, while LDA2 tends to reflect dynamic changes induced by PHA and LPS. For dynamic interaction eQTL mapping, we identified 834 eQTLs interacting with LDA1 (39 genes) in the DEX groups, and 12,512 eQTLs interacting with LDA2 (2,531 genes) from PHA or LPS groups. We also discovered 130 dispersion eQTLs corresponding to 19 unique genes. Our results shed light on the dynamics of gene expression in response to stimuli and across individuals. These dynamic processes fundamentally regulate the immune system and may contribute to inter-individual variation in immunological processes and diseases.
- Matt Kanke, Cornell University, United States
- Meaghan Kennedy, University of North Carolina at Chapel Hill, United States
- Sean Connelly, University of North Carolina at Chapel Hill, United States
- Matthew Schaner, University of North Carolina at Chapel Hill, United States
- Ashley Wolber, University of North Carolina at Chapel Hill, United States
- Caroline Beasley, University of North Carolina at Chapel Hill, United States
- Terrence Furey, University of North Carolina at Chapel Hill, United States
- Shehzad Sheikh, University of North Carolina at Chapel Hill, United States
- Praveen Sethupathy, Cornell University, United States
Short Abstract: Crohn’s disease (CD) is a condition caused by an abnormal immune response to enteric microbiota in the gastrointestinal tract of a genetically susceptible host (1). Due to the high heterogeneity in disease location, behavior, and progression, precise clinical diagnosis of CD is challenging2. Furthermore, there are no reliable methods to aid prognosis, and there are no long-term treatments for CD, necessitating a better understanding of CD at a molecular level (2). A key contributing factor to this chronic inflammatory condition is intestinal epithelial cell (IEC) barrier breakdown. IECs separate microbes in the lumen from the mucosal immune system, and multiple IEC subtypes compose the intestinal epithelium with distinct functions (3). In this study we generated single-cell RNA sequencing data from colonic IECs of 3 treatment naïve newly diagnosed adult CD and 4 healthy control (NIBD) patients to understand which IEC subtypes drive changes in gene expression and IEC function that lead to dysregulation and disease. Previous studies have looked at single-cell transcriptomics of healthy colonic and ileal samples as well as those with ulcerative colitis and CD (4-6), but our study is the first to focus on the single-cell composition of colonic IECs from CD patients. We found cell types that were not identified previously, such as CEACAM7+ colonocytes, and confirmed previous findings, such as the existence of high BEST4 signal in an enterocyte cell subtype. Cellular composition was generally consistent between NIBD and CD samples, though CD patients had significantly more CA1+ late colonocytes and reduced numbers of stem cells. Furthermore, differences between CD and NIBD patients included a higher ratio of goblet cells to enteroendocrine cells in CD. We also see differences between CD and NIBD samples in expression levels of genes within each IEC subtype, and KEGG pathway analysis showed that genes upregulated in CD patients were overlapped with gene sets implicated in autoimmune diseases and infections (viral or bacterial). Overall, these differences may reflect altered IEC function in disease. 1. Sartor R.B. 2006. Nat Clin Pract Gastroenterol Hepatol 3(7):390-407. 2. Furey, T.S., Sethupathy, P., Sheikh, S.Z. 2019. Nat Rev Gastroenterol Hepatol 16, 296–311. 3. Peterson, L., Artis, D. 2014. Nat Rev Immunol 14, 141–153. 4. Haber, A. et al. 2017. Nature 551, 333–339. 5. Smillie, C.S. et al. 2019. Cell 178(3):714-730.e22. 6. Parikh, K. et al. 2019. Nature 567, 49–55.
- Hannah Zhou, Palo Alto High School, United States
- Avanti Shrikumar, Stanford University, United States
- Anshul Kundaje, Stanford University, United States
Short Abstract: Predictive models that map double-stranded regulatory DNA to molecular signals of regulatory activity should, in principle, produce identical predictions regardless of whether the sequence of the forward strand or its reverse complement (RC) is supplied as input. Unfortunately, standard convolutional neural network architectures can produce highly divergent predictions across strands, even when the training set is augmented with RC sequences. Two strategies have emerged in the literature to enforce this symmetry: conjoined a.k.a. "siamese" architectures where the model is run in parallel on both strands \& predictions are combined, and RC parameter sharing or RCPS where weight sharing ensures that the response of the model is equivariant across strands. However, systematic benchmarks are lacking, and neither architecture has been adapted to base-resolution signal profile prediction tasks. In this work, we extend conjoined and RCPS models to signal profile prediction, and introduce a strong baseline: a standard model (trained on RC augmented data) that is converted to a conjoined model only after it has been trained, which we call a "post-hoc" conjoined model. We then conduct benchmarks on both binary and signal profile prediction. We find post-hoc conjoined models consistently perform as well as or better than models that were conjoined during training, and present a mathematical intuition for why. We also find that - despite its theoretical appeal - RCPS performs surprisingly poorly on certain tasks, in particular, signal profile prediction. In fact, RCPS can sometimes do worse than even standard models trained with RC data augmentation. We prove that the RCPS models can represent the solution learned by the conjoined models, implying that the poor performance of RCPS may be due to optimization difficulties. We therefore suggest that users interested in RC symmetry should default to post-hoc conjoined models as a reliable baseline before exploring RCPS.
- Kevin Dsouza, University of British Columbia, Canada
- Vijay Bhargava, University of British Columbia, Canada
- Maxwell Libbrecht, Simon Fraser University, Canada
Short Abstract: The organization of the genome in 3D space inside the nucleus plays an important role in general functional characteristics of the genome. Chromosome Conformation Capture (3C) and Hi-C techniques, have enabled us to quantify the strength of interactions between loci that are nearby in space. A Hi-C assay outputs a matrix of interaction strengths for every pair of genomic positions. This huge amount of data requires the development of computational techniques to discover hidden relationships between genome structure and function. Representation learning methods provide a way to understand 3D organization. These methods assign a low-dimensional vector of features to each genomic position that summarizes the 3D organization properties of that position. Representations learnt from Hi-C datasets can serve multiple purposes. They can capture the existing elements that drive 3D conformation, identify spatially clustering genomic regions, and identify relationships between 3D structure and functional phenomena. Several existing methods for Hi-C representation learning have been developed recently, including SNIPER and SCI. However, these existing methods do not take into account the linear structure of the genome, which hampers their performance. In this work, we propose a method, Hi-C-LSTM, that produces low-dimensional representations of the Hi-C intra-chromosomal contacts, assigning a vector of features to each genomic position that represents that position's contact activity with all other positions in the given chromosome. We do this by using a deep long-short term memory (LSTM) recurrent neural network. This LSTM model takes the representations as input and outputs predicted Hi-C contact strength for each pair of positions. The key benefit of the LSTM is its sequential structure, which allows the model to take the linear structure of the genome into account. We find that representation learning using an LSTM structure results in extremely effective representations according to several metrics. First, we find that Hi-C-LSTM captures more information from the input Hi-C matrix than existing methods. Second, we find that Hi-C-LSTM identifies which genomic loci drive 3D organization. Third, we find that Hi-C-LSTM can be used to perform in-silico experiments to evaluate the effect of removing or editing genomic loci.
- Faezeh Bayat, Simon Fraser University, Canada
- Maxwell Libbrecht, Simon Fraser University, Canada
Short Abstract: A sequencing-based genomic assay such as ChIP-seq outputs a real-valued signal for each position in the genome that measures the strength of activity at that position. Read counts of genomic assays have a nonuniform mean-variance relationship, which poses a challenge to their analysis. For example, a locus with 1,000 reads in one experiment might get 1,100 reads in a replicate experiment by chance, whereas a locus with 100 reads might usually see no more than 110 reads by chance in a replicate. This property means that, for example, the difference in read count between biosamples is a poor measure of the difference in activity. To handle this issue, most statistical models of genomic signals such as those used in peak calling, model the mean-variance relationship of read counts explicitly using, for example, a negative binomial distribution. These statistical models can account for this pattern, but learning these models is computationally challenging. Therefore, many applications including imputation and segmentation and genome annotation (SAGA) instead use Gaussian models and use a transformation such as log or inverse hyperbolic sine (asinh) to stabilize variance. In this study, we proposed VSS, a method that produces units for sequencing-based genomic signals that have the desirable property of variance stability. We found that the transformations that are currently used to stabilize variance like log(x+1) and asinh(x) do not fully do so. In fact, we found that the mean-variance relationship of genomic signals varies greatly between data sets, indicating that no single transformation can be applied to all data sets uniformly. Instead, variance stability requires a method such as VSS that empirically determines the experiment-specific mean-variance relationship. We showed that VSS successfully stabilizes variance in genomic data sets. Further, we found that using variance-stabilized data improves the performance of Gaussian models such as SAGA. Variance-stabilized signals will aid in all downstream applications of genomic signals. In particular, they are valuable for two reasons. First, VSS signals allow downstream methods to use squared error loss or Gaussian likelihood distributions, which are much easier to optimize than the existing practice of implementing a model that accounts for the mean-variance relationship. This will improve tasks that currently use Gaussian models, such as chromatin state annotation and imputation. Second, VSS signals can be easily analyzed by eye because the viewer does not need to take the mean-variance relationship into account when visually inspecting the data.
- Habib Daneshpajouh, Simon Fraser University, Canada
- Bowen Chen, Simon Fraser University, Canada
- Neda Shokraneh, Simon Fraser University, Canada
- Shohre Masoumi, Simon Fraser University, Canada
- Kay Wiese, Simon Fraser University, Canada
- Maxwell Libbrecht, Simon Fraser University, Canada
Short Abstract: Sequencing-based genomics assays can measure many types of genomic biochemical activity, including transcription factor binding, chromatin accessibility, transcription, and histone modifications. Data from sequencing-based genomics assays is now available from hundreds of human cellular conditions, including varying tissues, individuals, disease states, and drug perturbations. Semi-automated genome annotation (SAGA) methods are widely used to understand genome activity and gene regulation. These algorithms take as input a collection of sequencing-based genomics data sets from a particular tissue. They output an annotation of the genome that assigns a label to each genomic position. All existing SAGA methods output a discrete annotation that assigns a single label to each position. This discrete annotation strategy has several limitations. First, discrete annotations cannot represent the strength of genomic elements. Variation among genomic elements in intensity or frequency of activity of cells in the sample is captured in variation in the intensity of the associated marks. Such variation is lost if all such elements are assigned the same label. Second, a discrete annotation cannot represent combinatorial elements that simultaneously exhibit multiple types of activity. To model combinatorial activity, a discrete annotation must use a separate label to represent each pair (or triplet etc.) of activity types. We propose a method that uses a Kalman filter state space model to efficiently annotate the genome with chromatin state features. That is, our method outputs a vector of real-valued chromatin state features for each genomic position, where each chromatin state feature putatively represents a different type of activity. Continuous chromatin state features have a number of benefits over discrete labels. First, these features preserve the underlying continuous nature of the input signal tracks. Second, in contrast to discrete labels, continuous features can easily capture the strength of a given element. Third, continuous features can easily handle positions with combinatorial activity by assigning a high weight to multiple features. Fourth, chromatin state features lend themselves to expressive visualizations because they project complex data sets onto a small number of dimensions. We also propose several measures of the quality of a chromatin state feature annotation relative to genes and gene expression, and we compare epigenome-ssm to existing SAGA methods (ChromHMM, Segway) according to these quality measures. We found that epigenome-ssm produces the highest-quality annotations of the methods we compared. Particularly, an epigenome-ssm annotation is a better predictor of gene expression and enhancer activity.
- Soo Bin Kwon, University of California, Los Angeles, United States
- Jason Ernst, University of California, Los Angeles, United States
Short Abstract: Identifying genomic regions with functional genomic properties that are conserved between human and mouse is an important challenge in the context of mouse model studies. To address this, we develop a computational method, Learning Evidence of Conservation from Integrated Functional genomic annotations (LECIF). LECIF integrates thousands of human and mouse functional genomic annotations to learn a score of evidence for conservation at the functional genomics level. While LECIF leverages data from diverse sources, it does not require explicit matching of experiments across species by biological origin or type. To do so, LECIF trains an ensemble of neural networks that take human and mouse regions aligning at the sequence level as positive pairs and randomly matched human and mouse regions that do not align to each other but somewhere else in the other species as negative pairs. Each human or mouse region is characterized by a species-specific feature vector with thousands of values corresponding to ChromHMM chromatin state annotations, signals from RNA-seq experiments, and peak calls from DNase-seq, transcription factor ChIP-seq, histone mark ChIP-seq, and CAGE experiments. We used the trained classifier used to provide a LECIF score for all aligning regions. The LECIF score is highly predictive of pairs of regions that align at the sequence level, even when controlling for properties of regions that align in general. The score captures correspondence of biologically similar annotations between human and mouse, even though LECIF was not explicitly given such information. While the LECIF score is moderately correlated with sequence constraint scores, it captures distinct information on conserved properties. The LECIF score is higher in regions previously shown to have similar phenotypic properties in human and mouse at the genetic and epigenetic level. We expect the human-mouse LECIF score will be an important a resource for studies using mouse as a model organism.
- Jiyoung Lee, Virginia Tech, United States
- Shuo Geng, Virginia Tech, United States
- Liwu Li, Virginia Tech, United States
- Song Li, Virginia Tech, United States
Short Abstract: Monocyte is a key innate immune cell type modulating diverse host inflammatory responses. Subclinical doses of LPS (SD-LPS) are known to causes low-grade inflammation in monocytes, which could lead to inflammatory diseases including atherosclerosis and metabolic syndrome. Sodium 4-phenylbutyrate (4-PBA) is a potential therapeutic compound which can reduce the inflammation caused by SD-LPS. In this study, we aim to understand the gene regulatory networks of monocyte under low-grade inflammation and the mechanism of action for 4-PBA by integrating single cell RNA sequencing (scRNAseq), transcription factor binding motifs and ATAC-seq data using machine learning. We have generated scRNAseq data from mouse monocytes treated with PBS, SD-LPS, 4-PBA, and SD-LPS + 4-PBA and identified 11 clusters in the single-cell RNA-seq data from these four conditions. A machine learning method, based on guided, regularized random forest (GRRF) and feature selection was developed to select best candidate TFs that are involved in this immune response. Our method achieved high auROC, auPRC, and F1 scores in testing dataset and outperformed traditional enrichment-based methods. In particular, among 531 candidate TFs, our method achieves an auROC of 0.961 with only 10 motifs for one of the 11 clusters. Our method is particularly efficient in selecting a few candidate genes to explain observed expression pattern. For example, our GRRF method achieved auROC=0.90 using only three TFs whereas enrichment method required seven TFs to achieve the same performance. Finally, we found two novel subpopulations of monocyte cells in response to LD-LPS and we confirmed our analysis using independent flow cytometry experiments. Our results suggest our new machine learning method can select candidate regulatory genes as potential targets for developing new therapeutics against low grade inflammation.
- Yang Yang, Carnegie Mellon University, United States
- Yuchuan Wang, Carnegie Mellon University, United States
- Jian Ma, Carnegie Mellon University, United States
Short Abstract: DNA replication in eukaryotic cells duplicates the genome during cell division with a highly regulated temporal order. Proper replication timing (RT) control is of vital importance to maintain the composition of the epigenome (such as 3D genome organization) and gene transcription. However, our understanding of the genomic sequence determinants regulating DNA replication timing remains surprisingly limited. A major algorithmic challenge is to delineate a series of potential sequence determinants in shaping the RT programs over large-scale sequence domains. Here we develop a new method, named CONCERT, to simultaneously predict RT from sequence features and identify genomic sequence elements that modulate RT in a genome-wide manner. CONCERT integrates two functionally cooperative modules, a selector and a predictor, which are trained jointly. The selector module performs importance estimation-based subset sampling of the genomic sequences to detect predictive elements. Utilizing sequence importance estimation from the selector, the predictor module incorporates the bi-directional recurrent neural networks and the self-attention mechanism to achieve selective learning of long-range spatial dependencies across genomic loci and context-aware feature representation learning of genomic sequences. The model also employs a hierarchical structure for capturing genomic context information at different scales. We applied CONCERT to predict RT in mouse embryonic stem cells (mESCs) and human cell types. Using only the genomic sequences as input, CONCERT reaches above 0.90 Pearson correlation coefficients (PCCs) for RT prediction in mESCs and 0.80-0.88 PCCs in seven human cell types. Importantly, the identified predictive genomic elements exhibit strong correlations with specific types of genomic features including repetitive elements and cis-regulatory elements, revealing sequence properties that may regulate RT. In particular, each of the five early replication control elements (ERCEs) in mESCs that were experimentally validated through CRISPR-mediated deletions in a recent study (Sima et al., Cell 2019) can be reliably identified by our method. Furthermore, by applying to multiple human cell types, CONCERT delineated conserved and cell type-specific sequence elements that may play key roles in RT regulation. Taken together, CONCERT provides a generic interpretable machine learning framework for predicting large-scale genomic profiles based on sequence features and provides new insights into the potential sequence determinants of the DNA replication timing program.
- Da-Inn Lee, University of Wisconsin-Madison, United States
- Sushmita Roy, University of Wisconsin-Madison, United States
Short Abstract: Three-dimensional (3D) genome organization, which determines how the DNA is packaged inside the nucleus, has emerged as a key regulatory mechanism of cellular processes. High-throughput chromosomal conformation capture (Hi-C) technologies have enabled the study of 3D genome organization by experimentally measuring interactions among genomic regions in 3D space. Analysis of Hi-C data has revealed higher-order organizational units at multiple resolutions: chromosomal territories, compartments, and topologically associating domains (TADs). Changes or disruptions to such structures have been associated with disease, developmental, and evolutionary processes. Therefore, a key problem is to systematically detect higher-order structural changes across Hi-C datasets from multiple conditions. Existing methods to detect changes in 3D genome organization either do not model higher-order structural units, are specialized to one type of unit (e.g., TADs), or only compare pairs of Hi-C datasets. We address these limitations with Tree-structured Graph-regularized Integrated Factorization (TGIF), a new multi-task Non-negative Matrix Factorization (NMF) approach. TGIF makes use of complex tree-structured relationships among multiple Hi-C datasets such that closely related tasks, one for each Hi-C matrix, have similar lower-dimensional factors. The factors can be further constrained with task-specific graph regularization and are used to extract clusters of genomic regions with dynamically changing interaction profiles across tasks. We applied TGIF to simulated data and in real Hi-C data from cancer cell lines and mouse neural development process. TGIF effectively recovers ground-truth clusters in simulated data even with a large amount of noise and sparsity. When applied to genome-wide Hi-C matrices from karyotypically normal hematopoietic stem and progenitor cells (HSPC) and two chronic myelogenous leukemia (CML) cell lines (K562 and KBM7), TGIF detects the Philadelphia translocation, a large reciprocal translocation between chr9 and chr22 used in the diagnosis of CML. In per-chromosome Hi-C matrices from three cell states during mouse neural development (embryonic stem cell, neural progenitors, and cortical neurons), TGIF is able to identify compartmental switches as well as local TAD shifts accompanying change in nearby gene expression. Taken together, TGIF provides a powerful multi-task framework to study the dynamics and context-specificity of 3D genome organization.
- Minjun Park, Baylor College of Medicine, United States
- Salvi Singh, Baylor College of Medicine, United States
- Francisco Grisanti, Baylor College of Medicine, United States
- Hassan Samee, Baylor College of Medicine, United States
Short Abstract: Predicting a sequence’s enhancer activity, defined as the extent to which it changes a gene’s expression, is a fundamental objective of regulatory genomics. We expect an enhancer activity model to reveal at least two pieces of mechanistic information. First, it should identify the putative binding sites for regulatory transcription factors (TFs) in the modeled sequences. Secondly, it should infer the corresponding TFs’ regulatory effects and integrate them to model the sequences’ enhancer activities. Unfortunately, human regulatory genomics still lacks models that meet the above expectations. Convolutional neural networks (CNNs) show high accuracy, and their post hoc analysis reveals position weight matrices that characterize potential TF binding sites in the sequences. However, interpreting the deep architectures in a biophysical manner remains difficult. It is also unclear if a biophysical model could use these post hoc discovered motifs toward similar accuracy. Other models in this realm leverage features, such as k-mers, evolutionary conservation, GC-content, and epigenetic features, that are difficult to relate to TF-DNA binding or TF’s regulatory effect on enhancer activity. To address this gap, we propose MuSeAM (Multinomial CNNs for Sequence Activity Modeling), a CNN model that learns convolutions as multinomial distributions over the four nucleotides. The multinomial convolutions are directly interpretable (without post hoc analysis) as motifs of TF-DNA binding specificity that have been classically used in biophysical modeling of genomic sequences. Convolving a sequence with these multinomial convolutions gives us likelihood terms, which we use in a linear regression model to fit the sequence’s enhancer activity. The model’s linear coefficients represent the regulatory effects and strength of the TFs, yielding an overall mechanistically interpretable enhancer activity model. MuSeAM is customizable; one can replace linear regression with more mechanistic statistical thermodynamic functions. We applied MuSeAM on data from a massively parallel reporter assay of human genome sequences in human liver cells. MuSeAM achieved state-of-the-art performance in modeling this data, while also learning multinomial convolutions that recapitulate motifs of TFs known to be active in the human liver, their known regulatory roles, and motif co-occurrence patterns that reflect known TF-TF interactions. The trained MuSeAM model showed high generalization on unseen tasks such as predicting chromatin accessibility and prioritizing functional single-nucleotide polymorphisms (SNPs). The prioritized SNPs are enriched for low minor allele frequencies and occur within promoters and enhancers, highlighting the model’s credibility. We believe the multinomial convolution approach will be transformative in building mechanistically interpretable models of genomic sequences.
- Ahmed Abbas, The Jackson Laboratory for Genomic Medicine, United States
- Hideyuki Oguro, Department of Cell Biology, University of Connecticut School of Medicine, United States
- Sheng Li, The Jackson Laboratory for Genomic Medicine, United States
Short Abstract: Single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) measures chromatin accessibility at the single-cell level. Analyzing scATAC-seq data reveals much knowledge about cell-to-cell variability through variation in genomic sites' accessibility among different cells. On the other hand, single-cell RNA sequencing (scRNA-seq) measures the gene expression levels at the single-cell level. Both types of data suffer from false zero counts (technical dropout entries) with different degrees. More specifically, scATAC-seq data can detect only 1–10% of expected accessible peaks per cell vs. 10–45% of expected expressed genes per cell for scRNA-seq data. Thus, we can consider scRNA-seq data to be of better quality than scATAC-seq data. However, both types of data are closely related because of the positive correlation between the gene expression levels of specific genes (from the scRNA-seq data) and the open chromatin status of their corresponding peaks (in the scATAC-seq data). Thus, we exploit this relation between the two types of data and use better quality data (scRNA-seq data) to improve the scATAC-seq data quality. Here, we propose denoising scATAC-seq data (recovering its technical dropout entries) through integration with scRNA-seq data. We integrate the two types of data using Seurat3, which uses canonical correlation analysis (CCA) and mutual nearest neighbors (MNNs) to find cells that share the same biological nature from the two datasets. After finding the matching cells, and according to the matching score, we detect the peaks corresponding to the significantly expressed genes. Then, we impute the entries of those peaks in the count matrix if the original entries are closed (‘0’). We tested our method on a dataset of six scATAC-seq libraries of mouse bone marrow hematopoietic stem cells (HSCs) and their corresponding scRNA-seq libraries. Our approach improved the quality of integration between scRNA-seq and scATAC-seq libraries in terms of the accuracy of transferring cell type labels from the reference (scRNA-seq) dataset to the query (scATAC-seq) dataset and also the balance in the number of cells from libraries of the two datasets in clusters. Also, a larger number of detected marker genes were found close to the differentially accessible peaks in the libraries denoised by our method. Furthermore, our method resulted in denoised libraries having pseudotime trajectories matching with known hematopoietic lineage tree. Altogether, we believe that our approach enhances the quality of the scATAC-seq libraries, and consequently improves the accuracy and usefulness of the downstream analysis performed on it.
- Anthony Schmitt, Arima Genomics, United States
- Frank Boellmann, Arima Genomics, United States
- Jon Belton, Arima Genomics, United States
- Derek Raid, Arima Genomics, United States
- Steven Mac, Arima Genomics, United States
- Xiang Zhou, Arima Genomics, United States
Short Abstract: Chromosome conformation capture technologies developed by Arima Genomics, such as Arima-Hi-C+, Arima-HiChIP, and Arima Capture-HiC, are powerful approaches for profiling 3D genome structure and providing valuable insights into the mechanisms of gene regulation in human disease. Disease-specific chromosome folding patterns have been implicated across numerous human pathologies, such as cancer and have been valuable for the functional interpretation of non-coding disease associated variants (GWAS). The ability to physically connect extremely distant regions of the same chromosome without the need to isolate intact ultra-long DNA molecules can also be leveraged for applications such as the phased de-novo assembly of diploid genomes, the validation and phasing of assembled contigs and the discovery and phasing of single nucleotide, InDel and structural variants. Arima’s R&D efforts to improve coverage uniformity have resulted in the broad utilization of Arima sample prep solutions across various assembly projects, like the Vertebrate Genome Project and the Darwin Tree of Life. We will showcase the utility of high-coverage Arima-HiC data for chromosome-scale scaffolding of vertebrate genomes and report our ongoing analyses of sample collection and preservation methodologies. We will also highlight the results of a recent collaborative publication that demonstrated utility of our technology towards haplotype-resolved chromosome-scale de novo assembly of 3 human samples PGP1, HG002 and NA12878 with contig NG50 of up to 25 Mb and scaffold NG50 of up to 130 Mb. Around 99.5% of the heterozygous loci could be phased to over 98% accuracy, outperforming other approaches in terms of both contiguity and phasing completeness. Our reproducible and flexible kit platform covering genome-wide (Hi-C) and targeted (HiChIP, Capture-HiC) genome structures is based on our core proximity ligation chemistry. The 6-hour protocol improves analytical sensitivity through rapid, multiple restriction enzyme Hi-C chemistry. This significantly improved technology detects more chromatin folding features, such as chromatin loops, from significantly reduced sequencing depth. The Arima-HiC kits has been widely validated through scientific publications across research domains including oncology, cardiology, neurobiology, and immunology. The optimized targeted chromosome conformation capture technologies from Arima Genomics significantly enhance the ability of clinical and translational researchers to study pathological mis regulation of chromosome conformation in fine detail and at reduced costs. We will present our customer validated H3K27ac and H3K4me3 HiChIP protocols, demonstrating reproducible detection of long-range interactions at active promoters as well as data from high-resolution (500bp) Capture-HiC experiments targeting oncogenes and tumor suppressors in a panel of 10 cancer and non-cancer samples.
- Ziqi Zhang, Georgia Institute of Technology, United States
- Xiuwei Zhang, Georgia Institute of Technology, United States
Short Abstract: The availability of large scale single-cell RNA-Sequencing (scRNA-Seq) data allows researchers to study the underlying mechanisms that drive the change of cells within dynamic processes such as stem cell differentiation and cancer cell development. Trajectory inference (TI) methods are often used to infer the trajectory of this dynamic process, namely, assign developmental lineages and pseudo-time for every cell. Most of the current TI methods infer cell developmental trajectories based on the transcriptome similarity between cells, using only scRNA-Seq data. The disadvantages of these methods are: 1) a method is often restricted to certain trajectory structures like trees, and complex structures like cycles or a mixture of different topology are hard to reconstruct; 2) directions of the trajectory cannot be inferred and the root cell is often required as a prior. The recent surge of RNA-velocity estimation methods has opened up a new perspective for trajectory inference of cells. RNA-velocity provides a short-term prediction of gene expression profile in each cell by incorporating unspliced mRNA counts, thus helps ameliorate the loss of developmental direction information in scRNA-Seq data. We present CellPaths, a single cell TI method that infers multiple high-resolution developmental trajectories by integrating RNA velocity information. CellPaths has the following advantages: 1) Using the direction information from RNA-velocity, CellPaths can find multiple high-resolution trajectories between cells instead of one single trajectory from traditional TI methods; 2) CellPaths allows for automatic root cell detection; 3) CellPaths does not require the trajectory structure to be of any specific topology. We evaluate CellPaths on both simulated and real datasets. We run CellPaths on simulated datasets with complex trajectory structures, including trees with a high number of branches and multiple-cycles, and find that CellPaths performs significantly better than traditional TI methods like Slingshot, especially in separating close lineages. We further applied CellPaths on dentate-gyrus, pancreatic endocrinogenesis and human forebrain glutamatergic neurogenesis dataset, and found that CellPaths is able to not only find main differentiation lineages, but also multiple small pathways which are biologically meaningful.
- Elysia Saputra, University of Pittsburgh, United States
- Matthew MacDonald, University of Pittsburgh, United States
- Maria Chikina, University of Pittsburgh, United States
Short Abstract: Schizophrenia is a debilitating psychiatric disorder with approximately 1% lifetime risk globally. Dendritic spine loss is known to be strongly associated with schizophrenia, although the exact causal mechanism is subject to ongoing debate. While the effect is classically attributed to mature spine loss, our prior work characterizing the distribution of spine densities supports a hypothesis that instead spineogenesis and spine stabilization are dysregulated. In this work, we integrate spine density measurements with multi-modal molecular measurements including homogenate/synaptic proteomics and genotype from 40 schizophrenia patients and matched controls. Using state-of-the-art factor analysis, multivariate genetic approaches, and graphical modeling methods, we infer an undirected graphical model of the molecular mechanism regulating spine density. Overall, the graphical modeling results further support spine turnover as the underlying mechanism. Our analysis also highlights upstream metabolic pathways regulating spine density, such as the mTOR pathway and the possible role of aldehyde oxidase 1, which to our knowledge has not been associated with schizophrenia but has been implicated in amyotrophic lateral sclerosis, a motor neurone disease.
- Ha Vu, University of California, LA, United States
- Zane Koch, University of California, LA, United States
- Petko Fiziev, Illumina Inc, United States
- Jason Ernst, University of California, LA, United States
Short Abstract: Genome-wide maps of epigenetic modifications are a powerful resource for cell type specific genome annotations, and have become available for multiple different epigenetic marks in many different cell types and conditions. Maps of multiple epigenetics marks have been integrated into widely used cell-type-specific ‘chromatin state’ annotations of the genome. In many cases, given a group of multiple biologically similar samples, it can be desirable to have a single chromatin state annotation that summarizes such annotations of each sample in the group. However, determining an effective summary annotation that best represents a set of chromatin state annotations from multiple samples is challenging, since there exists no explicit notion of distance between chromatin states, while in practice some chromatin states will carry more similar biological implications compared to others. To address this challenge, we developed a novel method, CSREP, that takes in a set of chromatin state annotations from a group of biologically similar samples and probabilistically estimates the most representative chromatin state map for the group. CSREP makes the modeling assumption that each sample in the group has the same underlying chromatin state, but that noise could lead to a different chromatin state actually being observed for each sample. CSREP does this by training a logistic regression classifier to predict the chromatin state assignment of each biological sample, given the equivalent state maps from all other samples, and then averaging the prediction probabilities. This enables implicitly learning a notion of distance among chromatin states. Additionally, CSREP can be applied to different groups of samples and the difference of CSREP’s probability assignments for two groups can help identify genomic locations with differential chromatin state assignments. We designed a permutation-based testing framework to help derive the statistical significance of those differences, and declare differential chromatin regions between the two groups. We applied CSREP to different groups of chromatin state maps from the Roadmap Epigenomics project. We demonstrate advantages of CSREP compared to the baseline method of assigning the state with maximum frequency across samples to represent the group’s chromatin state at each genomic position. We also demonstrate how CSREP can effectively identify biologically relevant differences between groups of samples at a higher resolution or with greater power than previous approaches.
- Rydberg Supo Escalante, Columbia University, Peru
- David Requena, Rockefeller University, Peru
- Mirko Zimic, Universidad Peruana Cayetano Heredia, Peru
Short Abstract: Pyrazinamide (PZA) is one of the most important drugs used in first and second-line treatments against tuberculosis to eliminate bacteria in latent state, reducing relapse incidence. Pyrazinoic acid (POA) is the active form of PZA, generated by hydrolysis of PZA mediated by the enzyme pyrazinamidase (PZase), encoded by the gene pncA. Its antibiotic effect requires an acidic environment, discovered by the high in vivo and poor in vitro sterilizing activity, potentially due the acidic environment experienced by M. tuberculosis inside granulomes. The molecular targets of this drug are still unknown, and are thought to be more than one. The current mechanism, based on POA accumulation, states that PZA enters the cell through passive diffusion, is hydrolyzed to POA (mediated by PZase), and expelled to the extracellular environment by efflux pumps. Outside, expelled POA is protonated to HPOA in the acidic environment, and re-enters the bacteria by a membrane potential gradient. In this way, the external acidic environment helps to recover and accumulate intracellular POA, setting up a cycle that leads to a slight acidification of the cytoplasm, which together with the intracellular accumulation of POA act on several potential targets and metabolic pathways. However, cytoplasmic acidification is not strictly required. Previous studies have shown that overexpression of pncA or efflux pump inhibitors eliminate the requirement of an acidic environment for susceptibility in vitro. To better understand the contribution of an acidic extracellular environment to the internal accumulation of POA, and to study alternative scenarios of resistance/susceptibility, we modelled the PZA/POA metabolic pathway using a system of nonlinear differential equations, fitting experimental parameters. Our results indicate that, in equilibrium (laboratory) conditions, POA accumulation is independent of external pH and only depends on the ratio between the rates of POA production and efflux. In addition, the acidic environment has a significant contribution in the internal accumulation of total-POA (defined as POA+HPOA) only when the ratio efflux-rate/diffusion of external POA and HPOA is greater than 1. Interestingly, the rate of POA production could help to increase the total-POA accumulation independently of the POA efflux rate and external pH. In addition, although low POA efflux rates accumulate more internal total-POA than an acidic environment, the cytoplasm will not be acidified, suggesting that a reduction of internal pH is not sufficient to produce the bactericidal effect. Finally, we performed simulations to evaluate possible consequences in growth rate under different PZA concentrations.
- Wennan Chang, Purdue University, United States
- Norah Alghamdi, Indiana University, United States
- Xiaoyu Lu, Indiana University, United States
- Sha Cao, Indiana University, Department of Biostatistics, United States
- Chi Zhang, Indiana University School of Medicine, United States
Short Abstract: The metabolic heterogeneity, and metabolic interplay between cells and their microenvironment have been known as significant contributors to disease treatment resistance. Our understanding of the intra-tissue metabolic heterogeneity and cooperation phenomena among cell populations is unfortunately quite limited, without a mature single cell metabolomics technology. To mitigate this knowledge gap, we developed a novel computational method, namely scFEA (single cell Flux Estimation Analysis), to infer single cell fluxome from single cell RNA-sequencing (scRNA-seq) data. scFEA is empowered by a comprehensively reorganized human metabolic map as focused metabolic modules, a novel probabilistic model to leverage the flux balance constraints on scRNA-seq data, and a novel graph neural network based optimization solver. The intricate information cascade from transcriptome to metabolome was captured using multi-layer neural networks to fully capitulate the non-linear dependency between enzymatic gene expressions and reaction rates. We experimentally validated scFEA by generating an scRNA-seq dataset with matched metabolomics data on cells of perturbed oxygen and genetic conditions. Application of scFEA on this dataset demonstrated the consistency between predicted flux and metabolic imbalance with the observed variation of metabolites in the matched metabolomics data. We also applied scFEA on publicly available single cell melanoma and head and neck cancer datasets, and discovered different metabolic landscapes between cancer and stromal cells. The cell-wise fluxome predicted by scFEA empowers a series of downstream analysis including identification of metabolic modules or cell groups that share common metabolic variations, sensitivity evaluation of enzymes with regards to their impact on the whole metabolic flux, and inference of cell-tissue and cell-cell metabolic communications.
- Peter T. Nguyen, Cedars-Sinai Medical Center, United States
- Simon G. Coetzee, Cedars-Sinai Medical Center, United States
- Daniel L. Lakeland, Lakeland Applied Sciences LLC, United States
- Dennis J. Hazelett, Cedars-Sinai Medical Center, United States
Short Abstract: Cancer is an organism-level disease, impacting processes from cellular metabolism and the microenvironment to systemic immune response. Nevertheless, efforts to distinguish overarching mutational processes from interactions with the cell of origin for a tumor have seen limited success, presenting a barrier to individualized medicine. Here we present a novel, pathway-centric approach, extracting somatic mutational profiles within and between tissues, largely orthogonal to cell of origin, mutational burden, or stage. Known predisposition variants are equally distributed among clusters, and largely independent of molecular subtype. Prognosis and risk of death vary jointly by cancer type and cluster. Analysis of metastatic tumors reveals that differences are largely cluster-specific and complementary, implicating convergent mechanisms that combine familiar driver genes with diverse low-frequency lesions in tumor-promoting pathways, ultimately producing distinct molecular phenotypes. The results shed new light on the interplay between organism-level dysfunction and tissue-specific lesions.
- Shengnan Sun, James J. Peters VA Medical Center; Department of Neuroscience, Icahn School of Medicine at Mount Sinai, United States
- Zhaoyu Wang, James J. Peters VA Medical Center; Department of Neuroscience, Icahn School of Medicine at Mount Sinai, United States
- Yongchao Ge, Department of Neurology, Icahn School of Medicine at Mount Sinai, United States
- Jeffrey Nemes, Walter Reed Army Institute of Research, Silver Spring, MD, United States
- Christina LaValle, Walter Reed Army Institute of Research, Silver Spring, MD, United States
- Angela Boutte, Walter Reed Army Institute of Research, Silver Spring, MD, United States
- Walter Carr, Walter Reed Army Institute of Research; Oak Ridge Institute for Science and Education, United States
- Gary Kamimori, Walter Reed Army Institute of Research, Silver Spring, MD, United States
- Fatemeh Haghighi, James J. Peters VA Medical Center; Department of Neuroscience, Icahn School of Medicine at Mount Sinai, United States
Short Abstract: Background: Injuries from exposure to blast explosions rose dramatically during the Iraq and Afghanistan wars due to increase use of IEDs resulting in blast-related neurotrauma. To investigate the effect of blast on gene regulatory networks, we use timeseries gene expression data from military “breachers” exposed to controlled, low-level blast explosives during training. Methods: Blood samples were collected from male participants (age 30.2 ± 7.4 years) during a 3-day breacher training: baseline (day1), pre- and-post- breaching (day2), and follow-up (day3) at a U.S. Army training site. Blood samples were obtained and RNA-seq was performed for every time point. RNA-seq read counts were transformed to logCPM by fitting a linear model with the variable of interest (i.e., pre-post breaching operations) and adjusted for subject effect using limma. Potential batch effects was eliminated using surrogate variable analysis. To investigate whether exposure to explosive blast affects gene expression in a coordinated fashion, network analysis was performed via multiscale embedded gene co-expression network analysis (MEGENA) using variably expressed genes (sd >= 0.25) across all subjects/ time. Principal component (PC) analysis was performed on each network using a linear model and fitted on the 1st PC across timepoints for pre-post and pre-follow-up comparisons separately. Results & Conclusions. To investigate whether prior exposure load including history of traumatic brain injury (TBI) or cumulative career breaching in military service impacts responsivity to an explosion acutely, interaction models including prior exposure load by time (pre-post breaching) analyses were performed. Compared to those with no history of TBI, those with prior TBI history showed blunted acute gene expression responsivity (i.e., no significant alterations) following exposure to blast explosives. Network analyses identified 5 unnested-networks with significant interaction between TBI/Breaching history and time (pre-post/follow-up, p<0.05) that included genes involved in inflammation and adaptive immunity. Semaphorin 7A (a hub gene within one of these networks) is a membrane associated protein that plays an important role in mediating the connection between the CNS and the immune system. Emerging data from both experimental and human studies have shown that a number of Sema loci are involved in response to CNS injury. Although SEMA7A has not been previously implicated in CNS injury, it shows increased expression following blast sub-acutely in subjects with prior history of TBI. These findings show the power of network approaches to identify transcriptional alterations directly in response to exposure to explosives, which is translationally significant in furthering our understanding of blast-related neurotrauma.
- Hao Wu, University of Connecticut, United States
- Disheng Mao, University of Connecticut, United States
- Yuping Zhang, University of Connecticut, United States
- Zhiyi Chi, University of Connecticut, United States
- Michael Stitzel, The Jackson Laboratory for Genomic Medicine, United States
- Zhengqing Ouyang, University of Massachusetts, United States
Short Abstract: Traditional bulk RNA-sequencing of human pancreatic islets mainly reflects transcriptional response of major cell types. Single-cell RNA sequencing technology enables transcriptional characterization of individual cells, and thus makes it possible to detect cell types and subtypes. To tackle the heterogeneity of single-cell RNA-seq data, powerful and appropriate clustering is required to facilitate the discovery of cell types. In this paper, we propose a new clustering framework based on a graph-based model with various types of distances. We take the compositional nature of single-cell RNA-seq data into account and employ log-ratio transformations. The practical merit of the proposed method is demonstrated through the application to the centered log-ratio transformed single-cell RNA-seq data for human pancreatic islets. The practical merit is also demonstrated through comparisons with existing single-cell clustering methods.
- Satvik Lolla, George Washington University, United States
Short Abstract: Prostate cancer (PCa) is one of the most widespread and deadly cancers among American men. Currently, patients who are suspected of having PCa are recommended to undergo a biopsy, a mpMRI scan, or a MRI scan. These scans are often not reliable and can have a low detection rate. Furthermore, pathologists almost always have trouble grading the aggressiveness of the prostate tumor. These tumors are classified by the Gleason Grade scale or the ISUP Group Grade scale, which ranks the aggressiveness of the tumor on a scale from 1 to 5 where higher numbers represent a more dangerous tumor. To help identify, segment, and grade prostate tumors, recent studies point to the use of machine learning or artificial intelligence. We present a novel method to both grade and localize prostate tumors in multiparametric magnetic resonance images (mpMRI). Using a convolutional neural network (CNN) to identify mpMRI images that contained the tumors, a Faster region-based convolutional neural network (Faster RCNN) to segment the tumors, and a CNN to grade the tumors by differentiating tumors into high-grade and low-grade tumors, we present a novel method to localize and grade the tumors that performs at least as well as current machine learning techniques while using faster networks and requiring less data. Specifically, we used deep learning to segment the tumors, whereas previous studies had used machine learning, which is more computationally expensive than our method. Our CNN that identified whether mpMRI images had tumors or not had an accuracy of 98.7%. The Faster RCNN used had a sensitivity of 0.972, a false positive rate of 0.263, and an area under curve (AUC) of 0.9007. The CNN used to grade the tumors had an accuracy of 97.6%, which is similar to other machine learning models. In conclusion, these three models provide a novel approach to classifying and grading prostate tumors while requiring less data.
- Nina Baumgarten, Institute of Cardiovascular Regeneration, Frankfurt am Main and Saarland Informatics Campus, Saarbrücken, Germany
- Dennis Hecker, Institute of Cardiovascular Regeneration, Frankfurt am Main, Germany
- Sivarajan Karunanithi, Institute of Cardiovascular Regeneration, Frankfurt am Main and Saarland Informatics Campus, Saarbrücken, Germany
- Florian Schmidt, Genome Institute of Singapore, Singapore
- Markus List, Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Germany
- Marcel H. Schulz, Institute of Cardiovascular Regeneration, Frankfurt am Main and Saarland Informatics Campus, Saarbrücken, Germany
Short Abstract: Understanding gene regulation requires an extensive examination of the impact of noncoding regions on target genes. GWAS studies show that a large part of genomic variants is found in those noncoding regions. However, their impact on gene regulation is often unknown, as it is challenging to accurately identify target genes whose expression is affected by these regulatory elements (REMs). REMs such as enhancers, repressors and promoters regulate the expression of genes by serving e.g. as binding sites for Transcription Factors (TFs). Enhancers can also be actively transcribed to bi-directional enhancer RNA. Identifying REMs is difficult, as there is no method yet to locate them with absolute certainty. Instead, indirect epigenomic indicators are used in different combinations, leading to a variety of REM annotation approaches. An additional challenge is to reliably identify the target genes of the REMs, which is an essential step in understanding their impact on gene expression. We developed the EpiRegio web server, a resource of REMs and their target genes. The underlying algorithm STITCHIT identifies REMs by analyzing variations in gene expression across samples in combination with chromatin accessibility profiles. This approach is unique in its way to annotate REMs as it defines REMs and their target genes simultaneously. Other methods look for REMs first and then make use of varying techniques to link them to target genes. Further, EpiRegio’s REMs were observed in relation to actual changes in gene expression, which potentially leads to a higher specificity. EpiRegio incorporates data for various human primary cell types and tissues, providing an integrated view of REMs in the genome. It allows the analysis of genes and their associated REMs, including the REM’s activity and its estimated cell type-specific contribution to its target gene’s expression. Moreover, it is possible to explore genomic regions for their regulatory potential, investigate overlapping REMs and by that the dissection of regions of large epigenomic complexity. EpiRegio allows programmatic access through a REST API and is freely available at https://epiregio.de/.
- Anna Hendrika Cornelia Vlot, Berlin Institute for Medical Systems Biology, Germany
- Uwe Ohler, Berlin Institute for Medical Systems Biology, Germany
Short Abstract: The identification of informative features (i.e. genes and regulatory regions) is an essential step in single-cell data analytics. Current marker identification methods typically rely on the cluster assignments of cells. Clustering, in particular in developmental data, is non-trivial and potentially biologically arbitrary, and cluster interpretation is frequently based on prior knowledge. Methods also do not generally take cluster assignment uncertainties into account. To circumvent these issues, we developed SEMITONES (github.com/ohlerlab/SEMITONES; Single-cEll Marker IdentificaTiON by Enrichment Scoring), a principled method for cluster-free identification of informative features. The method consists of three steps. First, we identify a representative set of reference cells from the population. Next, for each feature, we quantify its enrichment in the reference cell neighbourhood using a linear regression framework to compute an enrichment score. Lastly, we test for significance of these enrichment scores with respect to an empirical null-distribution of enrichment scores for random gene expression profiles. We showcase the application of SEMITONES in healthy haematopoiesis data, as 1) as a robust alternative to highly variable gene selection, 2) the identification of individual marker genes and regulatory regions, and 3) gene set enrichment scoring for the construction of co-enrichment graphs to identify regulators of cell identity. SEMITONES identifies a smaller set of highly enriched features that captures the same biological variation as a larger set of highly variable features. These highly enriched features include markers for small cell populations like the eosinophil/mast cell/basophil lineage and plasma cells, which is of interest when selecting a set of genes for targeted scRNA-seq. Additionally, SEMITONES identifies markers for 1) global lineages, e.g. GATA2 for the myeloid lineage, 2) specific lineages, e.g. marker for the small pre-B, large pre-B and transitional B cells, and 3) markers for highly specialized cells like TNFRSF4 for Treg cells. Moreover, we use co-enrichment scores to construct co-enrichment networks in which we identify known regulatory modules, like components of the CD3-complex-CD8 interaction which is essential for T cell activation. Lastly, SEMITONES can be used for regulatory element identification in scATAC-seq data: given significantly enriched regions, we can annotate cell types and perform motif enrichment to identify the binding sites of transcription factors with known regulatory functions, like E2A, PU.1 and IRF8 in B cell development. In conclusion, SEMITONES identifies biologically meaningful marker genes and regulatory regions from scRNA-seq and scATAC-seq data without reliance on clustering.