View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: Abstract: Dengue virus (DENV) causes dengue hemorrhagic fever (DHF) and affect the liver, one of the most important target tissues in severe cases. We sequenced the miRNoma of human formalin fixed paraffin embedded (FFPE) liver tissue from ten DHF fatal cases. Eight miRNAs were found differentially expressed using miRDeep2 and edgeR, among that, three miRNAs were closely related to dengue immunopathogenesis: miR-126-5p -up regulated- is a regulatory molecule of endothelial cells, miR-122-5p (is liver-specific) and miR-146a-5p (Interferon-regulator) were down regulated (Fig1). Enrichment analysis of predicted target genes of overexpressed miRNAs revealed regulatory pathways of apoptosis and immune response (Fig2). We could detect 188 differentially expressed isoforms, including those differentially expressed miRNAs and were identified divergences in isomiRs and canonical expression level. Lastly, we also detected nine potential novel miRNAs targeting 76 genes, which may be involved on 131 cellular metabolic pathways and biological processes. This is the first description of hepatic human miRNA profile from DHF cases. The results demonstrated the association of miR-126-5p, miR-122-5p and miR-146a-5p with DHF liver pathogenesis, involving endothelial repair and vascular permeability regulation, control of homeostasis and expression of inflammatory cytokines, that can help to understand the regulatory mechanisms of DHF, diagnostic and anti-viral therapies.
Short Abstract: MicroRNAs (miRNAs) play a vital role as post-transcriptional regulators in gene expression. As the experimental determination of miRNAs is highly resource-consuming and error-prone, developing computational methods has become an active research area. This study aims to identify proposed solutions and unresolved problems in ab initio plant miRNA identification methods over the last decade. We first query five popular scientific databases for retrieving the relevant set of articles on novel plant miRNA identification. Then, a comprehensive comparative analysis is carried out on their methodologies and performance. In the last decade, there were 16 articles published on novel miRNA identification methods using plant datasets and 10 of them focused entirely on plants. Thirteen studies use supervised machine learning algorithms; Support Vector Machines algorithm is the most popular. The rest use RNA sequence mapping strategies for identifying miRNAs. We observe that, although the reported prediction accuracies of these methods are satisfactory, they still report a considerable amount of false negatives. In comparison to the large number of similar tools available for miRNA identification in animals, there is a need for more studies on plant miRNA identification, especially because the miRNA mechanisms significantly vary across animals and plants.
Short Abstract: Next generation sequencing (NGS) technologies have indicated that more than 90% of eukaryotic genomes are transcribed into protein-coding or non-protein-coding RNAs and approximately 98% comprise the latter. microRNA-offset RNA (moRNA) is a novel type of short non-protein-coding RNA (ncRNA), which was once considered as the co-products or degradation products of miRNAs. Previous screening studies indicated that moRNAs may act as miRNAs in some biological processes, but its biological function is still fragmentary. Currently, there is no publicly available bioinformatics tool for moRNA detection. This project succeeded in developing an effective and accurate software package named moRNA Finder that can detect moRNAs from NGS data sets. Based on Bayesian statistics theory and a previous algorithm to detect miRNAs, we constructed an optimized algorithm for moRNA identification and implemented it in a software tool. The software package provided accurate and sensitive identification of both known and novel moRNAs. This project will improve the knowledge concerning the abundance and distribution of moRNAs in biological systems and will benefit future studies for this new class of small RNA.
Short Abstract: G-quadruplexes (G4) are tetra-helices formed by the stacking of planar guanine tetrads. Their folding in RNA molecules were shown to affect mRNA post-transcriptional regulation and miRNA biogenesis. However, there are not enough data available to draw conclusions on the biological functions associated with RNA G4. The G4RNA tools were developed as a first step to address the issue. The G4RNA database is a reference support and a source of curated data for comparative analysis which was used to train an artificial neural network (G4NN). This approach allows the prediction of unusual observed G4 that cannot be predicted by classical motif searches. G4NN provides good classification performances and was thoroughly described during its optimization. It was validated using a set of high-throughput detected G4 occurrences and was also shown to be very efficient at discarding randomly selected sequences from the transcriptome. G4NN is integrated in G4RNA screener which scans RNA sequences to find favorable G4 folding conditions. G4RNA screener is used to identify and characterize sub-populations of G4 structures which act as shared features of regulation common to groups of RNA molecules. Its predictions have been challenged experimentally producing a G4 based structural sub-categorization that relates to colorectal cancer pathways.
Short Abstract: Multiple approaches have been developed to infer abundance of different cell types in heterogeneous samples (=computational deconvolution). Albeit potentially applicable to different RNA fractions, current methods have been designed and tested on mRNAs only. Using expression data of long non-coding RNAs, circular RNAs, microRNAs and mRNAs from RNA-sequencing data across 160 normal cell types and 45 tissues from the RNA Atlas project, we investigated the performance of additional RNA fractions in the computational deconvolution. Tissues and cell types in the RNA-Atlas were matched based on UBERON ontology. For each cell type, we defined cell-type specific markers based on matching mRNA, lncRNA, miRNA and circRNA expression data. These markers were subsequently applied to determine the proportion of each cell type in each of the tissues through computational deconvolution. For any given tissue, we defined the “signal” as the sum of the proportions of all its constituent cell types. This signal was computed for mRNA, miRNA, lncRNA and circRNA markers separately. We found that mRNAs contained the highest amount of biological signal across tissues, closely followed by lncRNAs. Furthermore, despite having lower overall performance, both miRNAs and circRNAs can deconvolve specific tissues with higher accuracy than mRNAs and lncRNAs.
Short Abstract: AGO-PAR-CLIP is considered one of the most powerful high-throughput methodologies for miRNA target identification. Until today, PAR-CLIP experiments have been performed in numerous tissues and cell types from physiological or pathological conditions. Current AGO-CLIP-guided implementations present limitations that undermine the central position of these experiments in the characterization of miRNA targetome. They depend strongly upon the T-to-C conversions to define miRNA bindings, while the efficacy of neglected interactions remains unknown. By analyzing miRNA perturbation experiments and structural sequencing data we showed that the previously neglected non-T-to-C clusters exhibit functional miRNA binding events and strong accessibility. Our findings are integrated in microCLIP, an innovative in silico framework based on deep structured learning for CLIP-Seq-guided detection of miRNA interactions. microCLIP was trained and evaluated against a compendium of miRNA binding sites deduced by numerous low-yield techniques and the analysis of more than 200 high-throughput experiments. Contrary to existing implementations, microCLIP operates on every AGO-enriched cluster. The proper incorporation of non-T-to-C clusters yields an average 14% increase in miRNA-target interactions per PAR-CLIP library, uncovering previously elusive regulatory events. microCLIP framework robustly identifies 1.6-fold more validated binding sites compared to state-of-the-art algorithms, ushering in a new era of experimentally supported miRNA target annotation.
Short Abstract: Candida glabrata is an opportunistic pathogen that causes deadly infection in immunocompromised individuals. In order to understand how these pathogens, maintain homeostasis and establish virulence in hosts, we set out to map gene expression changes during macrophage infection. Many genome-wide studies are focused on studying gene expression at the transcription level, but expression of genes depends on mRNA stability and translation in addition to rate of mRNA synthesis. Here, we perform an integrated analysis of RNA Pol-lI occupancy by ChIP-Seq and mRNA levels through RNA-Seq to infer mRNA stability upon Candida glabrata infection of macrophage cells. We identified many genes whose relative ratio of transcription (by Pol II occupancy) to mRNA levels significantly changes upon infection, suggesting that the stability of those transcripts is altered during infection. Our preliminary result reveal transcript stability of different classes of genes with specific functions for instance, genes involved ribosome biogenesis, amino acid metabolism become unstable after C. glabrata enters the macrophage host, suggesting coordinated stability of related transcripts is a mechanism used by cells to adapt to changing environments. Our method provides a convenient means to determine mRNA stability in any organism for better understanding of gene expression regulation under given environmental condition.
Short Abstract: The high-throughput sequencing of mRNA, together with software (e.g. Salmon and Kallisto) that allows high-throughput processing of thousands of samples, makes possible a global description of alternative splicing at the level of a sample, a tissue or a species. We define heteroformity, a measure of transcript diversity, as the fraction of transcript pairs drawn at random from a single gene that differ. The heteroformity of a gene thus varies between 0 (a single isoform) and 1 (the limit, when each transcript is different). The abundance-weighted gene heteroformity for a sample can be visualized as a cumulative distribution. Application to 11,688 human samples from 30 tissues in the Genotype-Tissue Expression (GTEx) project revealed some general patterns. About 25% of transcripts from all samples lie in genes with very low heteroformity, while the top quartile of transcripts are in genes with over 0.5 heteroformity. Tissues differ; reproductive and nervous tissue show more heteroformity. Nevertheless, there is great individual variation, with specific heart samples varying over threefold. We note that overall heteroformity and differential alternative splicing are distinct measures. Both reveal patterns of regulation. We are currently applying heteroformity to diverse samples and exploring its properties as a robust and useful metric.
Short Abstract: Aberrant splicing is a hallmark of leukemias with mutations in splicing factor (SF)-encoding genes. Here we investigated its prevalence in pediatric B-cell acute lymphoblastic leukemias (BALL), where SFs are not mutated. By comparing them to normal pro-B cells, we found thousands of aberrant local splice variations (LSVs) per sample, with 279 LSVs in 241 genes present in every comparison. These genes were enriched in RNA processing pathways and encoded ~100 SFs, e.g. hnRNPA1. hnRNPA1 3’UTR was pervasively misspliced, yielding the transcript subject to nonsense-mediated decay. Thus, we knocked it down in B-lymphoblastoid cells, identified 213 hnRNPA1-dependent splicing events, and defined the hnRNPA1 splicing signature in pediatric leukemias. One of its elements was DICER1, a known tumor suppressor gene; its LSVs were consistent with reduced translation of DICER1 mRNA. Additionally, we searched for LSVs in other leukemia and lymphoma drivers and discovered 81 LSVs in 41 genes. 77 LSVs were confirmed using two large independent B-ALL RNA-seq datasets. In fact, the twenty most common B-ALL drivers showed higher prevalence of aberrant splicing than of somatic mutations. Thus, post-transcriptional deregulation of SF can drive widespread changes in B-ALL splicing and likely contribute to disease pathogenesis.
Short Abstract: With increasing number of non-coding RNA families being identified, there is strong interest in developing computational methods to estimate sequence alignment and secondary structure. I developed TurboFold II, an algorithm that takes multiple, unaligned homologous RNA sequences, and outputs the predicted secondary structures and the structural alignment of the sequences. Secondary structure conservation information is incorporated into the alignment by using a match score, calculated from estimated base pairing probabilities to represent the secondary structural similarity between nucleotide positions in the two sequences. TurboFold II computes a multiple sequence alignment, based on a probabilistic consistency transformation and a hierarchically computed guide tree. The TurboFold II algorithm is modified for prediction of RNA secondary structures to utilize base pairing probabilities guided by SHAPE experimental data. Results demonstrate that the SHAPE mapping data for a sequence improves structure prediction accuracy of other homologous sequences beyond the accuracy obtained by sequence comparison alone. To assess TurboFold II, its sequence alignment and structure predictions were compared with leading tools. TurboFold II has comparable alignment accuracy with MAFFT and higher accuracy than other tools. TurboFold II also has comparable structure prediction accuracy as the original TurboFold algorithm, which is one of the most accurate methods.
Short Abstract: Technological advances in RNA expression profiling methods revealed that our genome is pervasively transcribed, producing an unexpectedly complex transcriptome consisting of various classes of RNA molecules and a huge isoform diversity. Many of these RNAs show high tissue specificity, with some being expressed in only one or few cell types. While numerous large-scale RNA-sequencing studies have been performed, samples involved are often complex tissues, masking transcripts expressed in low-frequent cell populations, and sequencing methods typically focus on one class of RNA transcripts. By applying complementary RNA sequencing methods (total RNA, poly-A RNA and small-RNA sequencing) across an extensive cohort of 300 human samples, we captured a wide variety of human transcripts, including protein coding genes, miRNAs, circular RNAs and long non-coding RNAs, a large fraction of which were previously unknown. We found that many non-coding RNAs show variable polyadenylation status across samples. We also compared cell-type specificity between different RNA species. Our results confirm the dynamic nature of the transcriptome, with many RNAs being expressed in only a limited number of cell-types. RNA atlas constitutes a unique resource for further studies on the function, organization and regulation of the different layers of the human transcriptome.
Short Abstract: Circular RNAs (cRNA) are increasingly being recognized as an important class of noncoding RNA that are pervasively expressed in a variety of eukaryotes, display significant conservation across mammals, and are coherently expressed independently of their cognate linear isoforms. Their functional role and biogenesis remains largely unknown. Here we studied the role of cRNA in animal models of the onset of inflammatory bowel disease IBD. Leveraging ribosomal-depleted RNA sequencing obtained on 403 C57/B6 mice in longitudinal Dextran sulfate sodium (DSS) and adoptive T-cell transfer models, we compared the predictive power of cRNA and mRNA expression signatures to predict disease severity and evolution, jointly modeled cRNA and mRNA via co-expression networks to identify key drivers of colitis development, and finally detected cognate linear and circular RNA that displayed evidence of regulating different phenotypes to infer cRNA function. We found that cRNA signatures derived from blood rival the predictive power of mRNA signatures in tissue in predicting colitis disease severity, and furthermore that co-expression networks identify cRNA disease drivers and suggest that scalable functional cRNA screening is facilitated by identifying differential cognate cRNA/mRNA phenotype association.
Short Abstract: Rfam is a database of non-coding RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. Rfam currently contains 2,772 families and continues to grow. Starting with release 13.0, Rfam switched to a new genome-based sequence database, which currently includes a non-redundant set of over 14,000 reference genomes identified by UniProt. The new database is more scalable and gives a more accurate view of the distribution of Rfam entries. Using complete genomes enables meaningful taxonomic comparisons and identification of a repertoire of RNA families found in a certain species. The text search functionality of the Rfam website was significantly improved. Users can now more easily search Rfam with the new and more powerful faceted text search. For example, it is possible to explore RNA families or ncRNAs in any annotated genome and compare annotations across genomes. The transition of Rfam to a genome-centric sequence database and the new website features make Rfam a more valuable resource for the sequence analysis community. Rfam is available at http://rfam.org.
Short Abstract: We introduce Scallop, an accurate reference-based transcript assembler that improves reconstruction of multi-exon and lowly expressed transcripts. Scallop preserves long-range phasing paths extracted from reads, while producing a parsimonious set of transcripts and minimizing coverage deviation. On 10 human RNA-seq samples, Scallop produces 34.5% and 36.3% more correct multi-exon transcripts than StringTie and TransComb, and respectively identifies 67.5% and 52.3% more lowly expressed transcripts. Scallop achieves higher sensitivity and precision than previous approaches over a wide range of coverage thresholds.
Short Abstract: Background The RNAcentral database (http://rnacentral.org) is a continuously growing, comprehensive collection containing over 11 million non-coding RNA (ncRNA) sequences of all types across a broad range of organisms. RNAcentral integrates over 25 expert resources, such as miRBase, LNCipedia, HGNC, and Ensembl, and provides an integrated faceted text and sequence search. Results To identify potentially inconsistent annotations, RNAcentral implemented new quality control procedures that annotate all RNAcentral sequences with Rfam families. These procedures warn users about partial sequences and potential contamination allowing users to identify and exclude problematic sequences from search results. Additionally, Rfam is used to annotate sequences with GO terms. RNAcentral is one of the largest sources of genome-level ncRNA annotations as it maintains a mapping of all ncRNA sequences from key species to reference genomes including sequences without annotated genomic locations or coming from non-reference assemblies. The data are available in a genome browser, a set of track hubs, and in multiple downloadable formats. Conclusions The RNAcentral website has been continuously improved with an updated text search interface and a feature viewer displaying Rfam annotations and modified nucleotides. We welcome feedback about the resource and invite new member database to join RNAcentral.
Short Abstract: Predicting the secondary structure of an RNA sequence with speed and accuracy is useful in many applications such as drug design. The state-of-the-art predictors have a fundamental limitation: they have a runtime that scales cubically with the length of the input sequence, which is slow for longer RNAs and limits the use of secondary structure prediction in genome-wide applications. To ad- dress this bottleneck, we designed the first linear-time algorithm for this problem. which can be used with both thermodynamic and machine-learned scoring functions. Our algorithm, like previous work, is based on dynamic programming (DP), but with two crucial differences: (a) we incrementally process the sequence in a left-to-right rather than in a bottom-up fashion, and (b) because of this incremental processing, we can further employ beam search pruning to ensure linear runtime in practice (with the cost of exact search). Even though our search is approximate, surprisingly, it results in even higher overall accuracy on a diverse database of sequences with known structures. More interestingly, it leads to significantly more accurate predictions on the longest sequence families in that database (16S and 23S Ribosomal RNAs), as well as improved accuracies for long-range base pairs (500+ nucleotides apart).
Short Abstract: Accurate measurement of RNA expression is crucial in the quest to understand disease and identify drug targets. NanoString RNA assays represent an optimal solution to this challenge by using a direct, amplification free-expression measurement system to simultaneously detect hundreds of targets. Although NanoString assays overcome the multiplex limitations of traditional qPCR-based approaches and the stringent RNA quality and purity requirements needed for sequencing, robust data processing is crucial for reliable results. NavSIVRAC is a data analysis pipeline developed at Navigate BioPharma Services, Inc., a Novartis subsidiary, that is a modularly designed collection of open source and custom algorithms. It allows for the rapid implementation of novel analysis scripts and the high-degree of flexibility needed to suit individual clinical trial needs. The system integrates sample demographic information with the NanoString Digital Analyzer, and performs custom normalization strategies and gene expression differentiation using clustering and statistical inference. It also provides visualization tools of the raw and/or analyzed data for final result formatting and reporting. We have applied NavSIVRAC to multiple placebo/drug sets to aid in RNA profiling studies. Our pipeline identifies clear gene differentiation patterns among the data sets, maximizing the value of clinical information obtained from NanoString gene expression assays.
Short Abstract: RNA binding proteins (RBPs) accompany RNA from birth to death, affecting RNA biogenesis and functions. Identifying RBP-RNA interactions is essential to understand their complex roles in different cellular processes. However, detecting in vivo RNA targets of RBPs, especially in a small number of discrete cells, has been a technically challenging task. We have previously developed a novel technique called TRIBE (Targets of RNA-binding proteins Identified By Editing) to overcome this problem. TRIBE expresses a fusion protein consisting of a queried RBP and the catalytic domain from RNA editing enzyme ADAR (ADARcd), which marks target RNA transcripts by converting adenosine to inosine near the RBP binding sites. These marks can be subsequently identified via high-throughput sequencing. In spite of its usefulness, TRIBE is constrained by a low editing efficiency and editing-sequence bias from the ADARcd. So, we developed HyperTRIBE by incorporating a previously characterized hyperactive mutation, E488Q, into the ADARcd. This strategy increases the editing efficiency and reduce sequence bias, which dramatically increased sensitivity of this technique without sacrificing specificity. HyperTRIBE provides a more powerful strategy to identify RNA targets of RBPs with an easy experimental and computational protocol at low cost in both flies and mammals.
Short Abstract: Conventional short-read RNA sequencing has been widely used to quantify gene expression in a variety of applications. However, short reads on their own lack the ability to resolve full-length isoforms, which can be several kilobases in length. Furthermore, computational methods developed to reconstruct isoforms from short read data are plagued by challenges, and results from different algorithms tend to be inconsistent. While long read sequencing technologies such as PacBio Iso-seq and Oxford Nanopore have a higher error rate than Illumina sequencing, they have great potential for isoform discovery and characterization of the 90% of multi-exon human genes that are thought to undergo alternative splicing. To take advantage of these properties, we develop a computational pipeline to process long reads into cleaned isoforms and generate a high-quality, full-length transcriptome. We demonstrate this process on PacBio Iso-seq data from human cell lines K562, GM12878, and HepG2 and show that the technology is mature enough to produce full-length transcriptomes by comparing the results to existing ENCODE data.
Short Abstract: MicroRNAs (miRNAs) play important roles in interindividual variability in drug safety by modulating the expression of drug metabolizing enzymes. The Phase II drug metabolizing enzyme sulfotransferase 2A1 (SULT2A1) catalyzes many drugs to increase their solubility and facilitate their elimination. Down-regulation of SULT2A1 may affect drug-induced toxicity and is associated with several liver diseases including cholestasis and primary sclerosing cholangitis. However, little is known about the roles of miRNAs in down-regulation of SULT2A1. We utilized two prediction programs to identify potential binding positions of miRNAs on SULT2A1 mRNA. To evaluate the binding strength, the minimum free energy (MFE) of miRNA-mRNA interaction was then calculated by using RNAhybrid. Furthermore, we extracted RNA-seq and miRNA-seq data from The Cancer Genome Atlas (TCGA) and conducted Pearson correlation analyses of the levels of SULT2A1 mRNA and miRNA candidates. We found that hsa-mir-495 and hsa-mir-486 may target SULT2A1 at the 5’UTR and 3’UTR respectively and that their expression levels are inversely correlated with that of SULT2A1 in human liver samples. Our integrative analyses provide a foundation for investigating the repressive regulation of SULT2A1 by miRNAs.
Short Abstract: High-throughput sequencing methods such as RNA-Seq have offered us a way to reveal the transcriptomic landscape’s complexity. However, it has become increasingly obvious that classical RNA-Seq poorly detects highly structured RNAs, which has contributed to their poor characterization. Our previous studies led us to favor the TGIRT-Seq method which substitutes the retroviral reverse transcriptase for a Thermostable Group II Intron Reverse Transcriptase (TGIRT). TGIRT-Seq allows to detect highly structured RNAs such as snoRNA and tRNA in their correct biological abundance. We present here the discovery of hundreds of non-annotated non-coding RNA genes that are only found in these TGIRT-Seq datasets. We show that many of these novel genes share high sequence and structure similarity with known RNAs such as snoRNAs and tRNAs. Comparisons with RNA polymerase III ChIP datasets and ddPCR following the depletion of specific RNA-binding proteins validate that many of these genes are, indeed, actively transcribed, and give indications of their functions. Understanding the function of these genes is a challenge as more than a third show no similarity with known genes. Nevertheless, it is clear that much remains to be understood about highly-structured RNA and that the endeavor of gene annotation in human is not over yet.
Short Abstract: Bordetella pertussis (Bp) is the causative agent of highly contagious whooping cough. Expression of virulence genes is under a master two-component regulation system, BvgAS. BvgS, a sensor kinase, phosphorylates a response regulator, BvgA, which then forms the active phosphorylated dimers (BvgA~P). Previous studies have shown that the RNA chaperone Hfq is important in virulence of Bp, which suggested that Hfq-dependent small RNA may play a crucial role in virulence regulation. Therefore, we conducted genome wide search of sRNA in Bp under various conditions. The RNA-seq pipeline READemption was used for alignment and differential expression analysis; sRNA and sRNA target prediction were performed with ANNOgesic. Our RNA-seq data revealed about 150 possible sRNA, 33 of them are potential Hfq-binders. Among the 15 predicted sRNA tested by Northern Blot, the number of True Positive, True Negative, False Positive and False Negative are 7, 3, 3, 2 respectively. S17 is an example of a Hfq-bidning sRNA. The level of S17 increases in the presence of BvgA~P and Hfq in both RNA-seq and Northern blot analyses. It suggests that an unknown repressor(s) may be involved in the expression of this RNA. Target genes of S17 are currently under investigation.
Short Abstract: We identify 665 conserved lncRNA promoters in mouse and human that are preserved in genomic position relative to orthologous coding genes. These positionally conserved lncRNA genes are primarily associated with developmental transcription factor loci with which they are coexpressed in a tissue-specific manner. Over half of positionally conserved RNAs in this set are linked to chromatin organization structures, overlapping binding sites for the CTCF chromatin organiser and located at chromatin loop anchor points and borders of topologically associating domains (TADs). We define these RNAs as topological anchor point RNAs (tapRNAs). Characterization of these noncoding RNAs and their associated coding genes shows that they are functionally connected: they regulate each other’s expression and influence the metastatic phenotype of cancer cells in vitro in a similar fashion. Furthermore, we find that tapRNAs contain conserved sequence domains that are enriched in motifs for zinc finger domain-containing RNA-binding proteins and transcription factors, whose binding sites are found mutated in cancers. This work leverages positional conservation to identify lncRNAs with potential importance in genome organization, development and disease. The evidence that many developmental transcription factors are physically and functionally connected to lncRNAs represents an exciting stepping-stone to further our understanding of genome regulation.
Short Abstract: RNA-protein interactions are implicated in a wide range of critical regulatory and structural roles whose disruption can lead to numerous diseases. Computational methods for predicting RNA-protein interaction partners (RPIPs) are valuable because experimentally characterizing these interactions is time-consuming and expensive. Published prediction methods utilize various sequence and structural features, but are generally limited by high false positive rates (FPRs) and/or query sequence length. Because intrinsically disordered regions (IDRs) are abundant in RNA-binding sites of proteins, we hypothesized that incorporating IDR information with sequence features could improve prediction of RPIPs. We developed a new random forest machine learning classifier, RPIDisorder, which requires only primary sequences of potential RNA and protein interaction partners as input. RPIDisorder outperformed our published classifier, RPISeq, on an independent test set of 11,281 RPIPs and 971 non-interacting pairs, with MCC 0.68 (vs 0.47) and FPR 21% (vs 55%). In a case study, RPIDisorder was used to identify RNAs bound to the Fragile-X Mental Retardation Protein (FMRP). On a test set of 30 RNAs (14 binding and 16 non-binding ncRNAs), RPIDisorder achieved an MCC of 0.73 and FPR 6.3%. These results indicate that incorporating IDR information can improve the reliability of RNA-protein partner prediction over sequence composition alone.
Short Abstract: Presence or absence of genetic variation in the general population is reflective of negative selection (particularly for genes predicted to be haploinsufficient) and mutation probability. In this work, we used this property to evaluate changes in splicing regulatory sequence. We considered all genomic positions covered by gnomAD 120,000 exomes. We then looked at variance in negative selection with respect to mutation type corrected for nucleotide mutation probability. We observed 10% of all possible synonymous variants and 5.5% of all possible missense as well as 1% of all nonsense mutations (stop gains, splice-acceptor,splice-donor). Based on selection constraint methodology we present an unbiased approach for evaluating mis-splicing predictors. MaxEntScan and DeepScan,a deep learning tool, are the best splicing variant effect predictors, when considering only variants in the splicing consensus region. However, splicing effects in the distal intronic and exonic regions appear to be too weak to be detected. In conclusion we present an investigation into impact of negative selection with regards to mutation type as well as an alternative approach for splicing predictor evaluation that trades a reduction in statistical power for an unbiased evaluation set.
Short Abstract: We introduce ReQTL, a method to assess co-regulated genomic regions via correlation between gene expression and ex-pressed allele frequency at single nucleotide variant (SNV) positions from RNA-sequencing data. We exemplify the application on sets of cancer genomic data from TCGA and demonstrate that ReQTL analyses show consistently high performance and sufficient power to outline both previously known and novel molecular associations. ReQTL analyses are computationally feasible and do not require matched DNA data, hence hold a strong potential to facilitate the discovery of novel molecular interactions through exploration of the increasingly accessible RNA-sequencing datasets.ReQTL toolkit is available from: https://github.com/HorvathLab/ReQTL
Short Abstract: The development of next generation sequencing (NGS) methods has resulted in a rapid increase in the generation of large genomic datasets. However, the development of tools accessible to those without bioinformatics training has not progressed at the same pace, and the lack of user-friendly tools remains a significant challenge. Additionally, the correct processing pipeline and normalization strategy for NGS data is important for downstream analysis; several pipelines have been published, and expert knowledge of genomics and statistics is required to select the appropriate methods. This presents a two-fold challenge; selection of the most appropriate analysis pipelines, and bioinformatics skills sufficient to apply this pipeline. To address these challenges we have combined R packages used for RNA-Seq analysis and visualization in a Shiny Web Application to create GENAVi (Gene Expression Normalization Analysis and Visualization). This GUI based application provides a user-friendly platform to normalize expression data, cluster samples based on expression, perform differential expression analysis and visualize results. We have performed RNA-Seq on a panel of 20 cell lines frequently used for the study of breast and ovarian cancer and included this data within GENAVi as a resource, and a foundation for users to bring their own data to the application.
Short Abstract: Pre-mRNA splicing is an essential step of gene expression that is regulated through multiple trans-acting splicing factors interacting at intronic and exonic positions. Since most exons are protein coding, evolution of exons must be modulated by a combination of selective coding and splicing pressures. We have previously demonstrated that deconvolution of splicing pressures is enhanced when phylogenetic comparisons are made in the framework of identically sized exons. We hypothesize that exon size-filtered sequence alignments may improve identification of nucleotides that have evolved to mediate efficient exon ligation. To address this, an exon size database was generated evaluating 100 vertebrate sequence alignments based on exon size conservation. The inclusion of splice site strength, gene position, and flanking intron length information in the database, permits identification of exons simultaneously conserved by sequence and size. While highly size-conserved exons are always sequence conserved, sequence conservation did not necessitate exon size conservation. Our analysis identified exons unique to humans/primates, indicative of exons considered to be evolutionarily young. By further comparing exon-size alignments with a published dataset of disease-associated SNPs, we demonstrated that coding pressures dominate nucleotide composition at invariable codon positions. This exon-size alignment approach permits identification of splice-altering nucleotides specifically at wobble positions.
Short Abstract: Background: RNA processing dysfunction has been implicated the pathology of the neurodegenerative disease amyotrophic lateral sclerosis (ALS), notably due to the characteristic mislocalisation of crucial RNA-binding protein TDP-43. This indicates the importance of investigating the widespread TDP-43 dysfunction-mediated changes in RNA processing, with the aim of identifying differential gene and transcript expression in the context of neurodegenerative disease. Methods: We investigated two mouse models of TDP-43, each containing a single substitution within the coding region of the TDP-43 gene. One mutation in the RRM2 domain, the other in the C-terminal hotspot for ALS-causative mutations. RNA sequencing was used to examine differential gene expression and alternative splicing events, while iCLIP highlighted changes in RNA-binding patterns. Results and discussion: Severe molecular dysregulation was identified in both models. The mutation of RRM2 led to dose-dependent preferential exon inclusion, including cryptic exons. Alongside this was downregulation of long intron-containing genes, typically neuronal. The altered C-terminus mutation caused greater levels of exon skipping, including novel gain-of-function splicing which resulted in mutant-specific ‘skiptic’ transcripts. iCLIP confirmed both cryptic and ‘skiptic’ events to be enriched for TDP-43 binding sites. Collectively these results highlight the array of TDP-43-mediated disrupted RNA processing features in neurodegenerative disease models.
Short Abstract: Global quantification of total RNA is used to investigate steady state levels of gene expression. However, being able to differentiate pre-existing RNA and newly transcribed RNA can provide invaluable information (estimate RNA half-lives, identify fast and complex regulatory processes,...). Recently, new techniques based on metabolic labeling and RNA-seq have emerged that allow to quantify new and old RNA: Nucleoside analogs are incorporated into newly transcribed RNA and are made detectable as point mutations in mapped reads. However, relatively infrequent incorporation events and significant sequencing error rates make the differentiation between old and new RNA a highly challenging task. We developed a statistical approach termed GRAND-SLAM that, for the first time, allows to estimate the proportion of old and new RNA in such an experiment. Uncertainty in the estimates is quantified in a Bayesian framework. Simulation experiments show our approach to be unbiased and highly accurate. Furthermore, we analyze how uncertainty in the proportion translates into uncertainty in estimating RNA half-lives and give guidelines for planning experiments. Finally, we demonstrate that our estimates of RNA half-lives compare favorably to other experimental approaches and that biological processes affecting RNA half-lives can be investigated with greater power than offered by any other method.
Short Abstract: Background: Long intergenic noncoding RNAs (lincRNAs) have risen to prominence in cancer biology. Association of lincRNAs with cis-regulatory DNA elements (enhancers) provides mechanistic insight into transcriptional regulation; however, in the absence of an enhancer, functional lincRNAs remain challenging for computational prediction. Methods: We designed and evaluated a cis-pi score to predict regulatory lincRNAs by assessing the mutual biological relevance between lincRNAs and target genes. To predict transcriptional regulatory lincRNAs in neuroblastoma, an aggressive pediatric cancer, we enhanced this scoring system and developed a novel side-by-side analytics pipeline for RNA-Seq data to measure lincRNAs with relatively low expression levels. Results: Risk-dependently transcribed lincRNAs over-represented neuroblastoma susceptibility loci and recaptured novel clinical biomarkers. The lincRNAs prioritized by cis-pi not only dissected independent high-risk patients but were significantly prognostic. The predicted target genes further inherited the prognostic significance of these lincRNAs. Conclusion: Altered expression of lincRNAs that stratifies tumor risk is an informative readout of oncogenic enhancer activity. Risk-dependent and prognostic lincRNAs provide cis-regulatory insights into cancer biology. Significance: RNA-Seq alone is sufficient to identify regulatory lincRNAs using our methodologies, allowing broader applications. Regulatory lincRNAs that have polyA tails without a hallmark of enhancer activity could represent a new class of functional lincRNAs.
Short Abstract: We describe new techniques and technologies under development for HMMER4, the fourth generation of the HMMER software for identifying homologous biological sequences using profile HMMs. These advances allow HMMER4 to efficiently analyze billion-sequence databases and million-sequence family alignments while improving its ability to recognize remote homologs. HMMER4 replaces HMMER3's local-only alignments with a combined global/local alignment probability model that is better able to annotate the bounds of complete sequence domains when they are present. A new probabilistic domain identification algorithm annotates domain coordinates in multidomain proteins using an ensemble calculation that takes alignment uncertainty into account, rather than relying on a single optimal alignment. Memory-efficient sparse dynamic programming and checkpointing techniques allow HMMER4 to support 100,000-element sequences and 100,000-position HMMs while consuming less than 1GB of RAM per core. Use of wider AVX and AVX-512 vector instructions increases performance by about 2x over HMMER3's 128-bit SSE implementation. An improved data format allows HMMER4 to scale well on multi-core processors, unlike HMMER3, which saturated at 2-4 cores. Improved load-balancing and parallelization improve performance on multi-computer systems, delivering sub-second search times on a 16-server cluster. A prototype GPU implementation shows potential to further improve performance.
Short Abstract: During liver regeneration, most new hepatocytes arise from pre-existing ones; yet, the underlying mechanisms that drive quiescent hepatocytes to proliferate following injury remain poorly defined. By combining high-resolution transcriptome and polysome profiling of hepatocytes purified from quiescent and toxin-injured adult mouse livers, we uncover pervasive shifts in ribosome occupancies for transcripts encoding metabolic and RNA processing factors. The translational remodeling modulates protein levels of a set of splicing factors, amongst which, downregulation of Epithelial Splicing Regulatory Protein 2 (ESRP2) activates a neonatal splicing program that rewires the Hippo signaling pathway in regenerating hepatocytes. We show that neonatal Hippo protein isoforms have lower signaling capacity, which allows higher transcriptional activity of the downstream YAP1 and TEAD1 effectors, thereby sustaining hepatocyte proliferation. We further demonstrate that ESRP2 knockout mice manifest excessive hepatocyte proliferation upon injury, whereas forced expression of ESRP2 inhibits hepatocyte proliferation by impeding the production of neonatal Hippo isoforms. Thus, our findings reveal an ESRP2-Hippo pathway alternative-splicing axis that controls hepatocyte proliferation in response to chronic liver injury.
Short Abstract: The differential production of transcript isoforms through the mechanism of alternative splicing is crucial in multiple biological processes as well as pathologies, including cancer. This has been exhaustively shown at RNA level but it remains elusive at protein level. Sequencing of ribosome-protected mRNA fragments (ribosome profiling) provides information on the transcripts being translated. We describe a new pipeline for the quantification of individual transcript coding sequences from ribosome profiling using both RNA-seq and Ribo-seq. Using multiple datasets, we find evidence of translation for 50-70% of the isoforms quantified with RNA-seq. Additionally, we performed differential splicing analysis between glia and glioma samples from human and mouse and found consistent changes occurring in both RNA-seq and Ribo-seq for the majority of cases, indicating that changes in the relative abundance of transcript isoforms lead to changes in the production of protein isoforms in the same direction. Among the cassette exon events changing splicing, we identified an enrichment of orthologous exons with the majority of them preserving the directionality of the change. Interestingly, there was a significant enrichment of microexons that decrease inclusion in glioma compared to glia in both, human and mouse, suggesting a concerted mechanism of dedifferentiation in glioma.
Short Abstract: Myotonic dystrophy type 1 (DM1) is a dominantly inherited neuromuscular disease caused by a CTG repeat expansion in 3’-UTR of DMPK gene. DM1 affects multiple tissues, but cardiac dysfunctions are the second leading cause of death. The best characterized pathogenic mechanism of DM1 is toxic gain-of-function of expanded CUG repeat RNA that accumulates to form ribonuclear foci affecting MBNL and CELF family of splicing factors. However, misregulation of MBNL1 or CELF1 does not explain the cardiac phenotypes observed in DM1. We have discovered that steady-state protein levels of RBFOX2, a critical splicing regulator, are drastically increased in DM1 heart tissue, which is accompanied by simultaneous skipping of a muscle-specific exon in the Rbfox2 transcript. We demonstrate that tet-inducible overexpression of the non-muscle RBFOX isoform, or CRISPR/Cas9 mediated deletion of the muscle-specific RBFOX2 exon in the mouse heart results in prolonged PR and QRS intervals, slower conduction velocity, and cardiac arrhythmias that mirror human DM1 pathology. RNA-sequencing of the isolated cardiomyocytes from these mice identified a core network of mRNA splicing defects in genes involved in cardiac conduction and excitation-contraction coupling. Collectively, our study has uncovered a novel role for a non-muscle Rbfox2 splice isoform in DM1 cardiac pathogenesis.
Short Abstract: The untranslated regions (UTRs) of mRNAs have characteristics of noncoding RNAs that regulate gene expression. Due to their sessile nature, plants are exposed to various biotic and abiotic stresses and undergo different post-transcriptional modifications to combat such stresses leading to 5’/3’ UTR lengthening. Here, we report that drought and heat stress for a prolonged time may result in an extension of 5’/3’ UTR in genes related to stress compared to non-treated controls in switchgrass. We identified more than 17,000 UTR extensions of varying lengths. Based on the differential expression of these UTR extensions, we selected 330 extensions for further characterization. Of these 148 are 5’ UTR and 182 are 3’ UTR extensions. The characterization of these extensions revealed their similarities to long-noncoding RNAs based on length distribution and coding potential. Since the reference genome is still in draft, some of these extensions may be due to misannotation as 38 of 330 extensions are predicted to have coding potential. Based on the differential expression of mRNA with and without UTR extensions reads, we identified putative UTR extensions that may play an important role in stress response genes and thus expand our current understanding of switchgrass transcriptome.
Short Abstract: Since the experimental validation of the first riboswitch classes in 2002, more than 40 additional classes have been reported. With the recent identification of the ligands for some long-standing ‘orphan’ riboswitch candidates such as ykkC (guanidine-I riboswitch), it is increasingly likely that all the most widespread riboswitch classes have already been found. However, it has been proposed that many thousands of additional riboswitch classes that are less widespread remain to be discovered. New computational approaches will be required to enable their rapid discovery. We have developed a computational approach that is optimized for discovering new, less-common classes of structured noncoding RNAs (ncRNAs), including new riboswitch candidates. This approach employs in-depth homology searches on the longer, GC-rich intergenic regions of individual bacterial genomes. Potential ncRNA motifs can then be evaluated based on predicted structure, sequence conservation, nucleotide covariation, and genetic context to assign probable biological functions. Preliminary results on a set of five bacterial genomes has revealed the existence of wide variety of probable regulatory RNA motifs including uORFs and riboswitch candidates.
Short Abstract: Numerous studies have demonstrated the critical role of translational control in the dynamic regulation of protein synthesis. However, most of them suggested that the elongation phase is not regulated in a condition-specific manner and is rather 'static'. Here, we employ novel computational approaches applied to ribosome profiling data to estimate for the first time the distinct changes in translation elongation and initiation at multiple time points during yeast meiosis. We show that codon decoding rates and thus mRNAs elongation rates change dynamically and substantially during meiosis to facilitate the translation of transcripts whose proteins are required at specific time points. Our approach captured a unique elongation pattern at the onset of anaphase II that was invisible to previous translational analyses. Particularly, we identified a large cluster of lowly expressed genes involved in sister chromatid segregation that showed a strong temporal shift toward increased elongation efficiency precisely when these processes occurred. Also at this time point, the elongation of the ribosomal proteins is decreased but their initiation is maintained to promote the translation of these anaphase II genes. Our analysis provides new insights into gene expression regulation during meiosis and demonstrates a functional role of translation elongation dynamics.
Short Abstract: Background. Heart failure affects 2–3 % of the adult Western population and its prevalence increases, in particular the proportion of heart failure with preserved (P) left ventricular (LV) ejection fraction (EF). We hypothesized that patients undergoing elective coronary by pass surgery (CABG) with PEF physiology will show distinctive gene expression compared to patients with normal LV physiology. Methods. Cardiac biopsies from the left ventricle were obtained from the CABG patients. The patients were divided into two groups, Normal or PEF physiology, according to echocardiography, NTproBNP levels and HF guidelines definitions. Results. Of a total of 16 patients 5 were classified as having PEF and 11 as having Normal physiology. Utilizing principal component analysis on batch corrected normalized gene expression data, the samples clearly clustered into these two groups. A total of 743 differentially expressed genes were identified and analyzed to characterize functional correlations and regulatory properties. We found that the top biological functions associated with down-regulated genes in PEF were cardiac muscle contraction, oxidative phosphorylation, endocytosis and matrix organization. Conclusions. This exploratory study could confirm our hypothesis that patients undergoing elective CABG with PEF physiology had distinctive gene expression compared to patients with normal physiology.
Short Abstract: The effects of consumption of insoluble proteins on physiology, including host response and microbial community has not been studied in chicken. In this study, we adapted a novel approach that combines RNA microbial identification with host gene expression to characterize and validate metagenomic taxonomic profiling to elucidate the genomic responses in intestines of gluten, as insoluble proteins, fed chickens. Using whole metagenomic shotgun RNA sequencing, we identified and compared the microbial communities of individuals with gluten-feed and control chickens (1-week old and 4-week old, respectively). Microbial reads were used to characterize the microbial diversity, and potential difference between those groups. Chicken reads were used to estimate the expression of known genes involved in the host response by gluten and detect potential differences between those groups. We identified 289 differentially expressed host genes in comparison of those groups. These DEGs were analyzed by KEGG pathway, leading to identification of PPAR signaling pathway and ribosome, as enriched pathways by gluten uptake. And microbial communities in the small intestine of chickens differed significantly between those groups, and especially, Shigella sonnei was significantly overrepresented in gluten-fed chickens, showing that the dual RNA-sequencing approach can be applied to dissect the interactions between host and microbes.
Short Abstract: What do we truly measure with RNA-Seq? In eukaryotes, aberrant mRNAs are eliminated by mRNA surveillance pathways whereas canonical mRNAs are degraded by deadenylation, decapping and exonucleolysis. Until recently, decay and translation were considered distinct processes but new studies are beginning to show otherwise. In this work, we simultaneously capture the 3' and 5' ends of capped and polyadenylated RNAs respectively, in human cells, in vivo. We integrated this with large-scale genomic datasets and found that unexpectedly, mRNAs are subject to repeated, cotranslational, ribosome-phased, endonucleolytic cuts in a process that we termed ribothrypsis. We showed that mRNA decay is initiated by a ribosome stall that triggers an endonucleolytic cleavage and propagates by upstream ribosomes cleavages. Ribothrypsis is a conserved process with potential regulatory roles and can be triggered by G-quadruplexes. Our results demonstrate a cotranslational mRNA decay far beyond expectations with a remarkable ~64% of the 3′ ends of capped RNAs and ~63% of the 5′ ends of polyadenylated RNAs mapping within coding sequences. Also, cells are awash with mRNA fragments, residuals of ribothrypsis, challenging the central assumption behind profiling methods such as RNA-Seq, microarrays and RT-PCR, that mRNAs exist as full-length molecules in cells.
Short Abstract: The role of lncRNAs in the extensive genomic and epigenetic responses of mammalian liver to xenobiotic exposure remains elusive. Here, we analyzed 115 liver RNA-seq data sets from male rats exposed to 27 chemicals representing diverse mechanisms of action, ranging from activation of nuclear receptors to induction of DNA damage, to assemble the long non-coding transcriptome. We characterized gene structures and response patterns for 5798 rat liver lncRNAs, of which 1447 were differentially expressed by xenobiotic exposure. Remarkably, 280 of these lncRNAs responded to >10 of the 27 xenobiotics. In most cases, chemicals with common mode of action clustered tightly based on gene expression pattern. Weighted Correlation Network Analysis (WGCNA) identified lncRNA- PCG regulatory modules enriched for specific biological functions, and revealed putative regulatory lncRNAs occupying key points (hubs) in co-expression networks with genes involved in liver metabolism and hepatoxicity. These putative lncRNA regulators showed strong co-expression patterns with local (cis effect) and distal PCGs (trans effect). Many of these PCGs belonged to Cyp and Sult family of genes with known involvement in xenobiotic metabolism. Our findings will guide further mechanistic research on the roles of these lncRNAs in the hepatotoxicity or detoxification responses to diverse chemical exposures.
Short Abstract: Formation of RNA structure begins during RNA transcription, and the final folded structure can depend on the series of folding events during transcription called a cotranscriptional folding pathway. However, few methods exist that can generate high resolution models of this ubiquitous folding process that is important for many biological processes including gene expression, splicing, and macromolecular assembly. Previous work showed improvement in equilibrium RNA structure predictions when experimental RNA structure probing data is incorporated in the algorithm. Not many existing computational methods can predict out of equilibrium structures that occur during transcription and none of these take RNA structure probing data as input to guide the predictions. We present a novel method to predict RNA secondary and tertiary structure from cotranscriptional SHAPE-Seq data called Reconstructing RNA Dynamics from Data (R2D2). We applied R2D2 to the E. coli Signal Recognition Particle (SRP) RNA. Our predictions informed a point mutation design that disrupts the wildtype cotranscriptional folding pathway and precludes the formation of the wildtype final structure, which is predicted by computational minimum free energy structure methods. Overall the R2D2 algorithm provides a powerful starting point for utilizing experimental data to gain deeper insights into cotranscriptional RNA folding and its biological impacts.
Short Abstract: Mechanically ventilated patients in the intensive care unit (ICU) are frequently exposed to unnecessary antibiotics. Novel approaches to exclude bacterial pneumonia in critically ill patients are urgently needed to avoid antibiotic-induced complications. We used RNA-Seq to analyze the mRNA transcriptome in flow-sorted alveolar macrophages collected from mechanically ventilated patients with and without bacterial pneumonia defined by quantitative culture. A transcriptional signature of bacterial infection was present in both resident and recruited alveolar macrophages. Gene signatures from both cell types identified patients with bacterial pneumonia. Test characteristics were used to construct a positive prediction model. Informative transcriptomic biomarkers can be generated from BAL fluid obtained during routine clinical care in the ICU. Transcriptomic profiling of BAL fluid offers promise in aiding antibiotic stewardship efforts in the ICU.
Short Abstract: Background Alternative splicing has a key role in increasing transcriptome diversity. Since differential splicing is prevalent among tissues, cell types and developmental stages, its misregulation can lead to diseases. This motivates computational research efforts to uncover splicing regulatory mechanisms. Splicing codes are probabilistic graphical models that predict splicing outcome in different conditions. The connections in these models can be queried to understand the contribution of different regulatory mechanisms towards the splicing outcome. A key limitation of previous splicing codes is that they model only cassette/exon-skipping events. Results Here, we propose a computational framework that extends the work from Jha et al. 2017 in three directions. First, we introduce a unified framework for splicing code for alternative 3’ and 5’ events in addition to the exon-skipping events. Second, we improve the framework to handle inherent structure in the splicing data, modeling it explicitly. Finally, we develop a convolutional neural network that learns motifs from the RNA sequence de-novo while making use of the existing and newly added features. We evaluate the new framework on diverse tissue datasets from human and mouse and demonstrate its improvement compared to previous models.
Short Abstract: Despite the long-held assumption that transposons are normally only expressed in the germ-line, recently we learned that full length or partial transcripts of LINE1 are frequently found in the somatic cells. However, the extent of variation in LINE1 transcript levels across different tissues and different individuals, and the genes and pathways that are co-expressed with LINE1 are unknown. Here we report the extent of tissue-specific variation in LINE1 expression levels across tissues and between individuals observed in the normal tissues collected for The Cancer Genome Atlas (TCGA). Our results confirm earlier reports of higher L1HS expression in the esophagus and stomach tissue. We also show that mitochondrial functions are enriched among the genes that show negative correlation with L1HS in transcript level, and that PHD fingers, bromodomains and KRAB-zinc fingers (KRAB-ZFPs) are enriched among the genes positively co-expressed with L1HS. The stable tissue-specific expression of individual LINE1 integrants and their correlated expression with KRAB-ZFPs support the hypothesis that specific LINE1 integrants are co-opted as part of the human gene regulation network, with many KRAB-ZFPs as their activators.
Short Abstract: High-throughput methods are commonly used to study polygenic disorders. Despite the increasing use of single-cell data, the main source for these analyses remains bulk tissue, especially in the field of neuropsychiatric and neurodevelopment disorders. However, while bulk tissue data is relatively abundant, the analysis and interpretation of these data are not straightforward since the observed transcriptional alterations can represent alterations in cellular densities as well as functional or regulatory changes. We have previously demonstrated that cellular marker-genes can be used to infer cell-type specific changes from brain bulk tissue expression data. We have now implemented this approach in the analysis of multiple datasets of bipolar disorder and schizophrenia subjects, demonstrating robust changes in astrocytes and parvalbumin cells in both disorders. Importantly, accounting for alterations in these two cell-types had a dramatic effect on the outcome of differential expression and functional enrichment analyses. Specifically, our results indicate that the previously reported downregulation of mitochondria-related genes might merely be an outcome of decrease in parvalbumin interneurons exhibiting high expression of these genes rather than global alteration in mitochondrial function. Our results emphasize that analysis and interpretation of bulk tissue data should always be done with the consideration of possible cell-type specific alterations.
Short Abstract: The importance of chemical modifications of RNA sequences in different biological contexts is being increasingly appreciated, giving rise to the field of RNA epigenetics. A pivotal challenge in this area is the identification of modified RNA residues within their sequence contexts. Mass spectrometry would offer a solution by using approaches analogous to shotgun proteomics. However, software support for the necessary data analyses is currently lacking. In particular, search engines that match tandem mass spectra to theoretical spectra derived from sequence databases are required. We present a database search engine for RNA sequences, developed in C++ within the OpenMS framework for computational mass spectrometry. We implemented classes representing endonucleases, (modified) ribonucleotides, RNA sequences, and a corresponding generator for theoretical spectra. We integrated modification data from the MODOMICS database and developed an output format for RNA identification results based on the proteomics standard mzTab. Finally, we added visualisation capabilities for these results to OpenMS’ viewer application. Our search engine supports the estimation of false discovery rates (FDR) based on target-decoy search strategies. We evaluated the performance of our software based on two benchmark samples, containing modified and unmodified versions of in vitro transcribed and chemically synthesised RNA, respectively, with promising initial results.
Short Abstract: RNA Binding Proteins (RBP's) are key players in several post-transcriptional regulatory mechanisms. High throughput technologies have led to the identification of large number of RBP's and RNA binding regions. Although experimental methods have increased the repertoire of RBPs in model systems, the complete repertoire of RBP's across species is far from complete. In this study, we developed a computational pipeline to predict RNA binding proteins using RNA binding domain (RBD's) and Homology information. Our approach involved, using peptides of RNA binding regions from 529 RBP's and a dataset of 1344 experimentally known Human RBP's as a reference set. Domain-based prediction using HMMER was integrated with homology information to get an integrated genome-wide prediction of RBP's. Benchmarking of our predictions against mouse genes annotated with GO term 'RNA Binding' resulted in a precision of 60% and recall of 75%. An average of 1750 RBPs were identified across eukaryotes with few lower order species exhibiting fewer RBPs suggestive of the divergence of RBP repertoire in distant relatives. In contrast to transcription factors and kinases, RBPs exhibited an increase in their number with increase in genome size. A co-occurrence network of RBDs revealed prominent enrichment of classical RBDs with other domains.
Short Abstract: Alternative splicing and alternative 5' and 3' splice sites are an important evolutionary advantage for eukaryotes since they allow a gene to have more than one product. There are seven canonical models describing basic splicing mechanisms but they are not sufficient for the representation of complex splicing events. In literature, there is already an established diagrammatic visual language for comparing gene models of alternative gene transcripts. This work presents a new formal regular language that allows such alignments to be represented in multiple valid ways with different levels of generalization. More general representations cover much greater variety of gene models. These multiple valid representations for an alignment are nested with IS-A relation between them. Such relations form a partially ordered set, or directed acyclic graph, that can be used for refined summarizing of all alternative gene transcripts in a genome. This is basis for our work on analysing alternative transcripts in evolution and evaluating genome annotation maturity.
Short Abstract: Analysis of gene expression in whole tissues remain an important tool for the study of neurological disorders. These types of analyses are complicated by the heterogeneity of brain tissues due to difficulties in differentiating cell type specific differences from global changes in gene expression. Recently, we published a method for summarising expression of cell type markers using principal component analysis (marker gene profiles) and demonstrated that they reflect cell type proportion changes in select datasets. We now expand the scope of our analysis by introducing quality metrics to ensure that the differences in marker gene profiles reflect cell type specific changes in arbitrary whole tissue studies. Our quality metrics examine the effect size, the number of correlating cell type markers and how much variance is explained by the first principal component of marker gene expression. We show that quality metrics are useful in differentiating between true (whole tissue studies with known differences in cell type proportions) and false positives (studies where no cell type proportion differences are expected). Finally, we examine marker gene profiles in ~400 previously published whole tissue datasets from mouse and human brains to detect cell type specific differences not analysed by the original studies.
Short Abstract: Coxsackievirus B3 (CVB3) is a cardiovirulent enterovirus from the family Picornaviridae. The RNA genome houses an internal ribosome entry site (IRES) in the 5’ untranslated region (5’UTR) that enables cap-independent translation. Ample evidence suggests that the structure of the 5’UTR is a critical element for virulence. We probe RNA structure in solution using base-specific modifying agents such as dimethyl sulfate as well as backbone targeting agents such as N-methylisatoic anhydride used in Selective 2’-Hydroxyl Acylation Analyzed by Primer Extension (SHAPE). We have developed a pipeline that merges and evaluates base-specific and SHAPE data together with statistical analyses that provides confidence intervals for reactivity values. Combining the “2%-8% rule” for normalization with base-specific mean and standard deviation calculations, ANOVA and multiple comparison procedures, we generate confidence intervals for each position, thereby verifying resulting secondary and tertiary structure models. Our datasets demonstrate that reactivity of each nucleotide base primarily parallels modification of the backbone, but not at every position. Using reactivity values validated by our statistical analyses, we are now in position to provide base-by-base analysis of RNA structural transitions. Understanding these transitions extends our previous comparative analysis of genomes from virulent and avirulent serotypes and sequential structural states during RNA-protein interaction.
Short Abstract: The accelerating rate of evidence discovery for long non-coding RNAs’ (lncRNAs) role in various critical biochemical, cellular and physiological processes is necessitating the need for robust lncRNA annotation resources. Although, there are a plethora of resources for annotating protein-coding genes, resources with lncRNA-ontology annotations are rare. Here, we present lncRNA annotation extractor and repository -Lantern (http://www.iupui.edu/~sysbio/lantern/), which provides high quality-controlled ontology annotations, extracted by mining recent lncRNA literature using a robust Natural Language Processing (NLP) based approach. LncRNA-relevant literature was obtained as a corpus of abstracts and ontology annotations were extracted using NCBO’s ontology-recommender system using a semi-automated pipeline. Benchmarking analysis was performed by evaluating the extracted annotations against lncRNAdb’s manually curated free-text. Lantern’s extracted annotations have a recall of 0.62 and precision of 0.8. Lantern’s web-interface not only provides Gene, Human Phenotype, SNOMEDCT and Disease Ontology annotations, but also houses an extensive range of functional omics data like: tissue-specific lncRNA expression, lncRNA-RBP interactions, lncRNA-protein co-expression, coding-potential, sub-cellular localization and SNPs on lncRNAs, computed and extracted via robust NGS pipelines. Thus, making it a holistic resource for improving the understanding of the noncoding transcriptome with the extracted annotations and functional associations for ~11,000 lncRNAs in the human genome.
Short Abstract: Long non-coding RNAs (lncRNAs) emerge as important regulators of various biological processes. While many studies have exploited public resources such as The Cancer Genome Atlas to study lncRNAs in cancer, it is crucial to choose the optimal method for expression quantification of lncRNAs. In this benchmarking study, we compared the performance of pseudoalignment methods Kallisto and Salmon, and alignment-based methods HTSeq, featureCounts, and RSEM, in lncRNA quantification, by applying them to a simulated RNA-Seq dataset and a pan-cancer dataset. Pseudoalignment-based methods detect more lncRNAs than alignment-based methods and correlate highly with simulated ground truth, while alignment-based methods underestimate the expression for some lncRNAs, including cancer-relevant lncRNAs TERC and ZEB2-AS1. Overall, 10-16% of lncRNAs are detected in the samples, with antisense and lincRNAs the two most abundant categories. A higher proportion of antisense RNAs are detected than lincRNAs. Moreover, antisense RNAs, lncRNAs with fewer transcripts, less than three exons, and lower sequence uniqueness are more discordant with ground truth. Full transcriptome annotation, including both protein coding and noncoding RNAs, greatly improves the specificity of lncRNA quantification. In summary, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.
Short Abstract: Third generation sequencing platforms produce reads from DNA molecules with much larger read lengths than second generation sequencing platforms but with lower throughput. For transcriptome sequencing, long reads have been used to construct full-length gene isoforms, while higher throughput short reads have remained popular for quantifying isoform abundance. PacBio long read RNA sequencing also requires a size selection step to alleviate bias due to sequencer preference for shorter molecules. Here, we have developed a method for gene isoform abundance quantification using long reads that allows for ambiguity in assignment of reads to isoforms and accounts for sampling bias due to isoform length. We conducted numerical studies to understand situations where bias correction is necessary and analyzed statistical properties of our method. We also analyzed short reads and long reads simulated from the human transcriptome to understand how read length, number of reads, and repetitive regions of the genome impact abundance quantification. Further, to evaluate the adequacy of our method to improve quantification, we compared our method to a standard quantification approach on three long read data sets.
Short Abstract: TDP-43 is an RNA-binding protein associated with the neurodegenerative diseases amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD). This protein is ubiquitously expressed and is localized primarily in the nucleus where it regulates pre-mRNA processing, such as alternative splicing. While many datasets have been generated in order to elucidate the regulatory targets of TDP-43, previous studies focused on a single context or type of splicing event (e.g. cryptic exons). Furthermore, little effort has been made to link TPD-43 regulated splicing to changes seen in ALS patients on a transcriptome-wide scale. To this end, we performed a comprehensive meta-analysis of available RNA-seq using MAJIQ, a splicing quantification algorithm that captures both classical as complex splicing variations. Our analysis includes TDP-43 depletion, knockout, overexpression, and mutations across a wide array of metazoans including human, mouse, Drosophila, and C. elegans. We also include samples from control and ALS patient tissues including cortex, cerebellum, spine, and PBMCs. Strikingly, we find common splicing variations that are regulated by TDP-43 across evolution and are dysregulated in ALS patients. Finally, analysis of sequence features of using probabilistic splicing code models for TDP-43 regulated events or those dysregulated in ALS showed common regulatory signatures, suggesting novel co-regulators.
Short Abstract: The Zika virus (ZIKV) can cause a congenital syndrome that leads to early brain development impairment by affecting neural progenitor cells. However, the molecular mechanisms of this pathology have not been established. We employed whole transcriptome sequencing of human neurospheres exposed to ZIKV isolated in Brazil (AB strain). In addition, we also investigated changes in gene expression induced by ZIKV MR-766 and Dengue 2. When comparing the brazilian ZIKV strain with ZIKV766 and Dengue 2, 455 and 91 differentially expressed genes (DEGs) were found, respectively. Several DEGs involved in the regulation of actin and cytoskeleton were overexpressed in both ZIKV strains, but significantly more on the AB strain. The same was true for the gap junction and tight junction pathways. GO analysis revealed a significant increase in the biological processes of the GO categories such as “cell adhesion”, “cell projection morphogenesis”, “cell morphogenesis involved in neuron differentiation” and “neuron projection morphogenesis” in the Brazilian Zika strain, when compared to ZIKV 766 and Dengue 2. Several studies identified alterations of the cytoskeleton by some flaviviruses, including Dengue-2 and Zika, but out results suggests that the AB strain can disrupt cytoskeletal reorganization more effectively than ZIKV 766 and Dengue 2.
Short Abstract: The growing power and reducing cost of RNA sequencing (RNAseq) technologies have resulted in an explosion of RNAseq dataset production. Comparing gene expression values within RNAseq datasets is relatively trivial for many interdisciplinary biomedical researchers, however, more complexed analyses and a deep exploration of multiple datasets, is bottlenecked by the availability of highly skilled bioinformaticians. ROGUE (RNAseq Ontology Graphic User Environment) is a user-friendly R Shiny application that allows a biologist to perform differentially expressed gene analysis, gene ontology and pathway enrichment analysis, potential biomarker identification, and advanced statistical analyses. Here we use ROGUE to identify potential biomarkers and show unique enriched pathways between various tumors derived from the neural crest. User-friendly tools for the advanced analyses of NGS data, such as ROGUE, will allow biologists to efficiently explore their datasets, discover expression patterns, and advance their research with the developing and testing of hypotheses in the absence of a bioinformatician.