Accepted Posters
If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.
Track: RNA
Short Abstract: Mutually exclusive splicing is an important splicing pattern that gives rise to functionally distinct proteins. It appears to be highly specific since it gives rise to mRNA variants that contain exactly one exon from a set of mutually exclusive exons. Several mechanisms have been recognized to drive mutually exclusive splicing. Of these, the mechanism of formation of long-range secondary structures in mRNA appears to be of a specific interest since it is capable of explaining mutually exclusive splicing of numerous cassette exons.
Our goal is to examine mutually exclusive splicing events in a set of closely related Drosophila species. To this end, we built a tool for joint visualization of splicing events and competitive RNA secondary structures. We hypothesize that since mutually exclusive exons in most cases result from tandem exon duplication, such competitive regulatory elements may also result from duplications of selector sequences.
Our hypothesis suggest an evolutionary mechanism of generation of mutually exclusive splicing patterns by tandem exon duplication along with the duplication of neighbouring selector sequences that creates competitive RNA secondary structures. In theory, this mechanism might also be applicable to other complex splicing events such as multiple exon skipping, where tandem exon duplication should happen within the base pairing region leading to mutually inclusive splicing pattern. However, here we show that in Drosophila this mechanism is likely to be unique for mutually exclusive exons. For instance, in other complex splicing events such as multiple exon skipping there is little or no evidence of tandem exon duplication.
Short Abstract: Recent years saw an explosion in the discovery of ncRNAs in all kingdoms of life. NcRNAs commonly interact with other RNAs, mainly mRNAs, by complementary base pairing, to carry out their regulatory function. The diversity and large number of ncRNAs demands a high-throughput method to identify and unravel targets and functions of ncRNAs. Psoralen crosslinking is a very promising tool for the detection of RNA-RNA interactions in vivo and in a transcriptome-wide fashion. Direct-Duplex-Detection methods (DDD) [1] rely on the Psoralen-mediated reversible crosslinking of interacting RNAs, their ligation to obtain a single RNA molecule, subsequent sequencing and mapping of reads to the genome. The duplex RNAs, obtained after nuclease digestion to enrich for crosslinked RNAs, may often be (nearly) blunt-ended and thus a poor substrate for the ligation of the 5´-end to the 3´-end at one side of the duplex. We improved the ligation efficiency by adding nucleotides at the 3´-ends through terminal deoxynucleotidyl transferase (TdT) treatment. Thereby, the overhang gains flexibility necessary for efficient ligation. Here we present the DDD protocol optimized in this way and the results we obtained for the RNA-RNA interactome of E. coli under standard growth conditions. In comparison to results from another group [2], we detected more interactions in total and recovered more known interactions when compared with sRNATarBase, which holds experimentally verified RNA-RNA interactions. References 1. Weidmann et al. (2016) Trends in Biochemical Sciences. 41(9) 2. Liu et al. (2017) BMC Genomics, 18, 343
Short Abstract: The ability of RNA to base-pair with itself and other RNAs is crucial for its function in vivo. While there are reasonable approaches to map RNA secondary structures genome-wide, understanding how different RNAs interact to carry out their regulatory functions requires mapping of intermolecular base pairs. Recently, different strategies to detect RNA-RNA duplexes in cells, termed direct duplex detection (DDD) methods (reviewed in [1]) have been developed. Common to all is that they rely on Psoralen mediated in vivo crosslinking and RNA Proximity Ligation (RPL) [2], which covalently links the interacting RNA strands. Subsequently, the RNA is sequenced using RNA-seq and analyzed with respect to inter- and intramolecular RNA-RNA interactions. The methods that have been used so far implement strict algorithms that lack a sophisticated processing of the reads and tend to miss captured interactions. In this work, we present a general pipeline for the inference of RNA-RNA interactions from raw DDD reads. We applied our pipeline to data from different DDD methods and compared our results to the original ones. This showed that our method due to its tolerant primary data analysis reconstructs more information about known and novel RNA-RNA interactions that otherwise would have been lost. In order to ensure comparability between the established and future DDD methods there is a need for a standardized pipeline to analyze chimeric reads to infer inter- and intramolecular interactions and to guarantee the reproducibility of the analysis. References 1. Weidmann C.A. et al. (2016) Trends in Biochemical Sciences. 2. Ramani V. et al. (2015) Nature Biotechnology.
Short Abstract: In human cells, mature microRNAs (miRNAs) are produced from primary precursors (pri-miRNA) through intermediate step of pre-miRNA precursor, with or without the use of canonical protein machinery that includes Drosha/DGCR8 and Dicer. Complexity of miRNA maturation process is caused by the involvement of multiple other regulating proteins that bind directly to distinct miRNA precursors in sequence- or structure-dependent manner. Thus far, a number of proteins were shown to bind to the terminal loop of miRNA precursors (e.g., hnRNPA1, HuR, KSRP, Lin28, MBNL1, MCPIP1) and other auxiliary proteins were demonstrated to interact with stem portion or flanking sequences of pre- and pri-miRNA. In plants different proteins are involved in miRNA biogenesis, and in both animal and plant systems multiple auxiliary components of their miRNA biogenesis machineries remain to be identified. To facilitate their finding we present here a web server that enables to search for miRNA precursors, that can be recognized by diverse RNA binding proteins, based on known sequence motifs. The database used by the server contains known human, murine and A. thaliana pre-miRNAs. The server may also be used to predict new RNA binding protein motifs based on a group of user provided sequences. We show examples of miRNAmotif applications, presenting precursors that contain motif recognized by Lin28 and predicting motifs within pre-miRNA precursors that are recognized by DDX1.
Short Abstract: RNA binding proteins (RBPs) are essential for cell processes. Many RBPs recognize specific RNA binding sites characterized by specific short sequences known as binding motifs. Besides primary RNA sequence, the structure of the RNA target is known to play a major role in RBP-RNA recognition. Inferring both sequence and structure preferences of RBPs remains a big challenge. Here we present a novel method, named SMARTIV (Sequence and Structure Motif Enrichment Analysis for Ranked RNA daTa generated from In-Vivo binding experiments), for enriched motif discovery from in-vivo high-throughput RNA binding data. SMARTIV uses ranked numerical sequence scores from results of CrossLinking and ImmunoPrecipitation (CLIP) experiments and predicted secondary structure of the sequences to generate motifs. SMARTIV motifs are concisely represented in a combined sequence and structure 8-letter alphabet ACGUacgu (upper case for unpaired and lower case for paired nucleotides). SMARTIV is an extremely fast algorithm representing motif sequence and structure in an informative single logo and is available both as a stand-alone program and a user-friendly web-server (http://smartiv.technion.ac.il). The method is based on the ranked CLIP-data with no requirement to split the input data into bound and unbound datasets. SMARTIV provides data-driven p-value assessment for the detected motifs. We tested our method on CLIP-seq data for a variety of RBPs and show that our results are highly consistent with previously known sequence and structure binding preferences of the proteins.
Short Abstract: Inflammatory bowel disease (IBD) is a chronic intestinal disease entity comprising two major subtypes: Crohn’s disease (CD) and ulcerative colitis (UC). Because of extensive heterogeneity in the disease presentation, behavior, and response to treatment, the diagnostic distinction of CD and UC remains a clinical challenge. In this study we aimed to evaluate the suitability of sequencing-based isomiR expression profiling and state-of-the-art machine learning techniques for non-invasive diagnostic tests. Full blood was drawn from a total of 672 German and Swedish individuals, in order to profile isomiRs for the following traits: CD (138 treated, 44 untreated samples), UC (108 treated, 49 untreated samples) as well as different types of controls (333 samples). After normalization sequencing read count data was corrected for biological (country-of-origin and sex) as well as non-biological experimental variation (sequencing machine, run, chemistry and technician performing library preparation) using empirical Bayes adjustments. Subsequently, various types of (penalized) support vector machines (SVMs) were employed to solve binary classification problems, considering main diagnoses (CD, UC, controls) as well as subphenotypes (CD location, CD behaviour, UC extent) and evaluated with respect to model performance, stability and sparsity. In terms of median Matthews correlation coefficient (MCC) resulting models showed remarkable predictive performance estimated as being 1.00 (main diagnoses) or ranging from 0.66 to 0.76 (CD behavior), 0.68 to 0.76 (CD location) and 0.69 to 0.76 (UC extent), respectively, incorporating a median number of 754 to 1298 (non-penalized models) and 1 to 39 isomiRs (penalized models).
Short Abstract: MicroRNAs (miRNAs) are small noncoding regulatory RNAs, which are involved in complex regulatory processes including inhibition of translation and modulation of transcript stability. These molecules are implicated in the pathogenesis of oncological, immune-related, cardiac and other diseases. The discovery of circulating miRNAs in serum, plasma, and other body fluids has attracted great interest in biomarker research. To date, numerous studies have reported circulating miRNAs as biomarkers for a variety of diseases. However, the origin of these circulating miRNAs has been poorly examined. With this study, we provide a comprehensive reference dataset of detailed miRNA expression profiles from seven types of human peripheral blood cells, serum, exosomes and whole blood. The peripheral blood cells from buffy coats were typed and sorted using FACS/MACS. The overall dataset was generated from 450 small RNA libraries using high-throughput sequencing. We define the cell lineage-specific miRNA/isomiR expression and modification patterns. Furthermore, we identify novel cell-type specific miRNA candidates. The study provides the most comprehensive contribution to date towards a complete miRNA catalogue of human peripheral blood, which can be used as a reference for future studies. The dataset is publicly available on GEO and also can be explored interactively following this link: http://134.245.63.235/ikmb-tools/bloodmiRs.
Short Abstract: background Human cells respond to a broad range of stimuli with a characteristic burst of transcription within minutes at many sites across the genome; this underlies differentiation, responses to cellular stress, and inflammation. The earliest events involve the transient activation of the promoters of immediate-early genes (IEGs), a special class of genes dysregulated in developmental diseases and cancer [1, 2]. However, the core IEG repertoire active across cellular responses, and the mechanisms underlying IEG induction remain controversial. results Here we present a rigorous meta-analysis of 8 genome-wide FANTOM5 CAGE time course expression datasets [3]. These unique datasets measure the responses of different human cell types (including embryonic, immune and cancer cells) to many different stimuli over the first 5 hours post stimulus. Using novel approaches we classify promoter expression profiles to several predefined patterns, including a transient early peak representing IEG dynamics, to identify the likely core complement of IEGs. These genes are strongly enriched for known IEGs and relevant Gene Ontology terms, but also include a number of compelling novel IEG candidates, including transcription factors and noncoding RNAs. However, comparing genes classified as IEGs between datasets we found differences in their numbers and identity, which explain the heterogeneity and inconsistencies of the literature related to the immediate-early response Furthermore, we show that the promoters of many candidate and known IEGs are activated in a consistent order across datasets, suggesting deeper levels of conservation in the regulation of immediate-early responses. conclusions Here we exploit unusual, densely sampled promoter expression datasets using novel approaches to estimate the core IEG response of human cells to cellular stimuli. We discuss surprising candidate IEGs that may be important new factors in diseases related processes and the potential roles for non-coding RNAs in the immediate-early response. We also uncover unexpected conservation in the temporal order of promoter activation across stimuli and cell types. references 1. Healy, S., P. Khan, and J.R. Davie, Immediate early response genes and cell transformation. Pharmacology & therapeutics. 137(1): p. 64-77 (2013). 2. Fowler, T., R. Sen, and A.L. Roy, Regulation of primary response genes. Molecular cell 44(3): p. 348-360 (2011). 3. Lizio, M., et al., Update of the FANTOM web resource: high resolution transcriptome of diverse cell types in mammals. Nucleic Acids Research. 45(D1): p. D737-D743 (2017).
Short Abstract: Background and Methods: Hfq is a homohexameric RNA chaperone that stabilizes small noncoding RNAs (sRNAs) and facilitates riboregulation by promoting sRNA base pairing with target mRNAs. Recent studies showed that the C-terminus contributed to the stability of a subset of sRNAs (Class II sRNAs) and release of RNA from Hfq. Here we further investigate the global effect of the C-terminus for Hfq interactions in E. coli by comparing RNAseq data of wild type and C-terminus deleted Hfq65, total RNA as well as Hfq immunoprecipitated (IP) RNA samples. Results: Comparing the IP Hfq65 mutant to IP wild type samples, 82 genes are 2 fold down regulated in Hfq65. Among them, 1 is antisense RNA, 3 are tRNAs, and 12 are Hfq-binding sRNAs, including 3 of 4 Class II sRNAs. Many of the sRNA targets are among the mRNAs been down regulated. Further investigation is being undertaken to confirm these findings and investigate the effects on tRNAs. Differences were also observed for the regions flanking mature tRNAs, possibly reflecting changes in tRNA processing or independent roles of these RNA regions. Conclusion: About 1/5 of sRNAs in the database used here are 2 fold down regulated in the Hfq65. In vivo results thus far are consistent with the in vitro role for the C-terminus in modulating on and off rates for RNAs, but suggest that this is not rate-limiting for steady state RNA levels for most genes. A better computational tool is needed to capture and analyze RNAseq signals for tRNA precursors and non-coding regions.
Short Abstract: A large portion of the human transcriptome is not coding for proteins. These non coding RNAs (ncRNAs) have been shown to be associated with a large range of important cellular functions. Clustering of RNA sequences is currently one of the prevalent approaches for detecting and annotating the function of putative ncRNAs and regulatory elements. The structure of ncRNAs often plays a crucial rule and is usually better conserved than the sequence, making it computationally more expensive to compare them with traditional algorithms. Thanks to the pervasive availability of the transcriptomic and metatranscriptomic data, generated by high-throughput sequencing (HTS), efficient and easy-to-use approaches are highly demanded. Here we introduce Galaxy-GraphClust, a workflow for large-scale clustering of RNAs based on sequential and structural similarity in linear-time that is provided via the Galaxy framework. This extension of GraphClust considerably simplifies clustering and analysing large amounts of RNA sequences by making it possible to: a) interactively perform the clustering via a web interface, b) support back end computation on diverse platforms, ranging from personal computers to large scale computer clusters and the cloud, c) run both HTS transcriptomic analysis and structural clustering in a homogeneous manner. The highly modular design of Galaxy-GraphClust has made it possible to enhance the clustering performance by offering alternative RNA structure prediction and annotation tools and incorporating chemical structure probing data. We also present the applicability of the tool for predicting conserved structural motifs under the presence of noisy unrelated sequences and long surrounding contexts. Availability: https://github.com/BackofenLab/docker-galaxy-graphclust
Short Abstract: Massive parallel sequencing of transcriptomes revealed the presence of miRNA variants named isomiRs. The sequence variations identified within isomiR molecules with respect to the miRNA sequences with which they share the same seed can affect their targeting activity. With consequences in gene expression and potential impact in multi-factorial diseases. miRNAs are considered good biomarkers, making their adoption for disease characterization highly desirable. Several methodologies and tools were devised to identify and quantify miRNAs from sequencing data. However, all these tools are built on-top of general-purpose alignment algorithms, providing poorly accurate results and no information concerning isomiRs and conserved miRNA-mRNA interaction sites. To overcome these limitations we developed the isomiR-SEA algorithm. By implementing a miRNA-specific alignment procedure, isomiR-SEA analysis accounts for accurate miRNA/isomiR expression levels and for a precise evaluation of the conserved interaction sites. As first, isomiR-SEA identifies miRNA seeds within the tags. If the seed is found, the alignment is extended and the positions of the encountered mismatches recorded. Then, the collected info is evaluated to distinguish among miRNAs and isomiRs and to assess the conservation of the interaction sites. isomiR-SEA performance was assessed on 7 public RNA-Seq datasets. 40% of reads attributed to miRNAs (189M) comes from mature miRNAs, 50% derives instead from 3’ isomiRs, and the remaining reads account for 5’/SNP isomiRs or combinations between them. Furthermore, about 2% of reads lost some interaction sites. This proves the importance of a miRNA-specific alignment algorithm to correctly evaluate miRNA targeting activity. For further Information, please see eda.polito.it/isomir-sea/
Short Abstract: RNA binding sites for a protein of interest can now be detected genome-wide and at a high resolution thanks to the development of CLIP-seq technologies. Among these methods, iCLIP and eCLIP provide single-nucleotide resolution and are particularly powerful in characterizing protein-RNA interaction landscapes. However, current methods do not address both problems of peak calling and crosslink sites detection simultaneously, and fail to model the various sources of biases, such as transcript abundances or unspecific crosslink (CL) sequence motifs. We developed an approach based on a non-homogeneous Hidden Markov model, which calls individual crosslink sites taking into account both regions enriched in protein bound fragments and the specifics of iCLIP truncation patterns. Our modeling framework also incorporates information from various covariates, such as RNA abundances or information from CL-motifs. We extensively validated the superiority of our approach over other common strategies, both within a realistic iCLIP simulation setup (using real RNA-seq data) and on five published iCLIP/eCLIP datasets where the protein's predominant binding regions are known. Over a large range of simulation parameters, our tool recovers binding sites with a better accuracy than other methods. Further, on all real datasets our approach is more precise in determining the bona fide binding sites. Our results show the importance of combining peak calling and cross link site detection when analyzing iCLIP or eCLIP data. We also show that the incorporation of covariates (input signals, as well as CL-motifs) clearly improves the accuracy of the calls.
Short Abstract: Background RNA-binding proteins (RBPs) play an important role in alternative splicing and other RNA processing steps. Mutations that occur in splice sites are the major risk factor for genetic diseases such as neurological disorders or cancer. Experimental validation of a mutation’s impact is expensive and time consuming. Thus, in silico predictions using primary RNA sequence provide a convenient way to score and prioritize mutations [1], but requires a highly accurate computational model capturing RBP binding “code” of the cis-RNAs regulatory elements. Results Using a large collection of functional RBP binding sites derived for K562 and HepG2 cell lines from eCLIP-seq experiments within ENCODE project [2], we developed a machine learning method that predicts RBP binding. The dataset contains nearly 80 RBPs profiled in a genome-wide manner with eCLIP-seq method. As a background, we used regions that were sampled randomly from the human genome that were matched by the GC content with the foreground set. Notably, we used an unbalanced dataset with 20-100 times more negative regions in order to have a classifier with high specificity. To train our model we used convolutional neural networks. Several topologies were tested in order to select network structure yielding the highest prediction power. To this end, we varied the size, the number of filters (to learn long RNA binding motifs from primary sequence), and number of convolutional layers to select the best performing models to predict binding of a certain RBP type. Conclusion Performance of our models varies depending on the RBP type. Also we found that for the majority of RPBs, the optimal network topology converges to a certain architecture and the number of filters, but not the number of convolutional layers, is the most sensitive hyperparameter with respect to the classification performance. We applied our classifiers to identify variants that may affect binding of regulatory proteins to RNA. References [1] Li, X., Kazan, H., Lipshitz, H. D., & Morris, Q. D. (2014). Finding the target sites of RNA-binding proteins. WIREs RNA, 5, 111–130. http://doi.org/10.1002/wrna.1201 [2] Redesigning CLIP for efficiency, accuracy and speed. Nature Methods, 13(6), 482. http://doi.org/10.1038/NMETH.3870
Short Abstract: A class of structured non-coding RNAs is known to play various roles in the cells, but the annotation of these RNAs is still lacking even within the human genome. Part of the reason is ascribed to the fact that the currently available computational tools are either too computationally heavy for use in full genomic screens or rely on pre-aligned sequences. In this poster we present DotcodeR for detecting a set of structurally similar RNA pairs of predefined window length in a pair of genomic sequences by comparing their corresponding coarse-grained secondary structure dot plots at string level. This allows us to perform an all-against-all scan of all window pairs from two genomes without alignment. Our computational experiments with simulated data and real chromosomes show that DotcodeR has good sensitivity while reducing the search space drastically. Considering the results, DotcodeR can be useful as a prefilter in a comparative genomic scan for structured RNAs, which could be followed by a more rigorous approach to structural alignment for functional analysis of predicted RNA regions.
Short Abstract: The role of most RNA-binding proteins (RBPs) in alternative splicing (AS) remains unclear. Next-generation sequencing (NGS) assays enable the search for the splicing code, a model that can relate multiple cis- and trans- acting factors to differential exon usage. Contemporary differential exon usage (DEU) statistical tests compare multiple experimental conditions (e.g. RBP knockdowns) to a single reference condition. The emergence of datasets including hundreds of experimental conditions calls for tailored models to detect condition-specific changes in AS and uncover RBP-specific regulation. We design a novel statistical model, named Condition-specific differential exon expression (csDEX), to discover changes in exon usage that occur only in a small subset of conditions. The package supports both read count- and Percent spliced-in (PSI)-based exon expression quantification. We test for splicing changes on a real-size public dataset with 189 shRNA knockdown samples of different RBPs (including SRSF1, U2AF1/2, PTBP1, hnRNPs, TARDBP) provided by the ENCODE project. We demonstrate the advantages of PSI- based quantification when seeking changes in AS, which are not due to gene expression. The precision of related methods is evaluated using the UCSC knownAlt annotation, where csDEX PSI-based model retrieves known AS events with highest precision (98%). The causal effect of RBP binding on AS is further validated by multiple independent data sources, such as RBP binding assays (eCLIP) and motif analysis. For TARDBP, the functional relevance was further verified by successfully retrieving cryptic exons known to be specifically TARDBP-regulated. We provide the first statistical package for computationally efficient detection condition-specific AS changes in RNA-seq datasets with hundreds of experimental conditions. The predicted condition-specific changes in AS were verified by multiple independent data sources provide functionally relevant candidates.
Short Abstract: The role of small RNA (sRNA) molecules in genome regulation is not fully understood. Micro RNAs (miRNA), small interfering RNAs (siRNA), piwi-interacting RNAs (piRNA) and trans-acting RNAs (taRNA) are some members of the sRNA family. Existing sRNA analysis tools predominantly focus on predicting novel miRNAs, piRNAs and quantifying them. This leads to either ignoring other classes of sRNA or require custom made scripts. Understanding the role of these sRNAs in diverse biological processes requires paying attention to minor details, e.g. sRNAs originating from different strands in varying lengths, have different targets to involve in different downstream pathway. No integrated computational solution exists to investigate novel sRNA data in an unbiased way. Hence, we developed a generic sRNA analysis tool capturing read counts along with strand specificity, length distribution, and base modification. Our tool also automatically produces numerous visualizations covering multiple categories required for sRNA analysis. Our tool allows various normalization techniques to compare different sRNA samples, tailored to different scenarios e.g. knockdown of RNA interference components in the model organism. All analyses can be restricted to certain read lengths. For ease of use, our tool integrates an automated differential expression analysis using DESeq2. Finally, we present analyses of multiple datasets from different organisms. With no doubt, our tool is designed to simplify the life of data analysts and introduces a different perspective of available data. Our tool is available at: https://github.com/SchulzLab/RAPID.
Short Abstract: Random Sample Consensus (RANSAC) is a technique that has been widely and successfully used in areas such as computer vision for modeling data with a large amount of noise. RANSAC's algorithm randomly selects a number of observations and creates a model that is expanded with observations that are below a distance threshold, named inliers. This procedure is repeated k times, leading to a consensus model. We applied this technique in the biomedical area to both synthetic and clinical datasets, namely, the breast invasive carcinoma dataset from The Cancer Genome Atlas (TCGA-BRCA), including RNA-seq expression levels for both tumour and non-tumor tissues. The results of a baseline regularized logistic model, trained with all observations, are then compared against RANSAC. To evaluate the robustness of this method, the original dataset is perturbed by randomly changing the response class for 0% to 25% of the observations. At each step, 10 replicates were generated and the average misclassification rates and standard errors were obtained. In RANSAC, an observation is considered an outlier if it is not included in the model. Outlier observations are compared against misclassifications of the baseline model. The results show that RANSAC has high precision with the inlier observations, as expected, and is robust for increasingly perturbed data. In conclusion, the RANSAC results for these experiments show that this algorithm can identify a subset of observations for which the model is highly accurate, while simultaneously identifying outlier observations.
Short Abstract: Motivation: Circular RNAs are a special class of RNA forming a covalently closed loop through a process called back-splicing. Only for a few well studied circRNAs, potential functions were shown, these include miRNA sponging, RNA binding protein (RBP) sponging, and regulation of their host gene’s transcription. Circular RNAs can be identified in rRNA depleted RNA-Sequencing by detecting chimeric reads, which span a back-splice junction. A variety of circRNA detection tools exists but no tool is able to summarize and characterize the identified circRNAs. To perform accurate downstream analyses after circRNA detection, it is crucial to know the exact exon-intron structure of circRNAs. Here, I am presenting FUCHS and FUCHSdenovo to summarize circRNAs and reconstruct their exon-intron chain based on linear-splice signals of back-splice junction anchored reads. Results: Running FUCHS on mouse samples revealed that heart circRNAs are less diverse but more abundant than liver circRNAs. The average length of circRNAs was 500BP. A de novo reconstruction of the inner circle structure using FUCHSdenovo showed a gain of information of 15%. Furthermore, FUCHSdenovo identified alternative splicing in 8-10% of circRNAs. To exemplify the value of the reconstructed circRNA models in downstream analyses, I performed a miRNA seed search and RBP motif search. Comparing the seed density of circRNAs and mRNAs showed that circRNAs were more densely populated with both, miRNA seeds and RBP motifs suggesting that circRNAs could form an additional layer in the gene-regulatory network by competing with their host genes for miRNA or RBP binding. Availability: https://github.com/dieterich-lab/FUCHS.git
Short Abstract: Antisense transcripts impact gene transcription in several different ways, affecting transcription initiation, such as overlapping antisense transcripts repressing initiation; transcription itself, where antisense transcripts can limit the length of the sense transcripts to shorter isoforms; or post-transcriptionally, when, for example, antisense transcripts can compete with sense transcripts for binding sites. Stranded RNA-Seq determines the strand from which an RNA fragment originates, and so can be used to identify where antisense transcription may be implicated in gene regulation. However, by analysing over 100 experiments across multiple organisms from both ENCODE and our own work, we show that spurious antisense reads are often present in experiments, and can manifest at levels greater than 1% of sense transcript levels. This is enough to disrupt analyses by causing spurious antisense counts to dominate the set of genes with high antisense transcription levels. Our tool RoSA (Removal of Spurious Antisense) detects the presence of high levels of non-authentic antisense transcripts, by analysing ERCC spike-in data to find the ratio of antisense:sense transcripts in the spike-ins. Similarly, RoSA will calculate a correction to the antisense counts based on either the spike-in antisense:sense ratio, or, where possible, using antisense and sense counts around splice sites to provide a gene-specific correction. We demonstrate the utility of our tool to filter authentic antisense transcript counts in an Arabidopsis thaliana RNA-Seq experiment.
Short Abstract: We present the R package “RATs” – (Relative Abundance of Transcripts) – that identifies transcriptome-wide Differential Transcript Usage (DTU) directly from transcript abundance estimations, without requiring access to alignment or assembly information. RATs is agnostic to quantification methods and unique in that it exploits bootstrapped quantifications, if available, to inform the significance of detected DTU events. In addition, RATs shows the DTU results graphically, and achieves a median False Discovery Rate ≤0.05 even at low replication levels. The package is available through Github at https://github.com/bartongroup/Rats. We applied RATs to a public human RNA-seq dataset for which three DTU events had previously been validated by qRT-PCR. We found that the ability to reproduce the reported DTU events depended on the genome annotation used for quantification of the data. The isoform abundance profiles of two of the three genes changes radically between Ensembl v60 and v87. Of the >500 and >400 DTU events identified respectively by RATs, only 141 were in common and only 8 were among those reported by the original study. Investigation of this discrepancy revealed that the effect size of most of the originally reported events was small and below our threshold. More importantly, our analysis revealed that the qRT-PCR probes designed based on Ensembl v60 no longer corresponded to the intended transcripts according to Ensembl v87, but rather they matched an incompatible multitude of isoforms. As a consequence, interpretation of the original qRT-PCR quantifications is impossible with the newer annotation.
Short Abstract: There is no escaping it; identifying the deferentially expressed genes between two experimental conditions requires replicates. The question is, how many do you need? The answer to this deceptively simple question is not straight-forward. By conducting the most highly replicated RNA-seq experiment to date, we show that the answer depends on: 1) the statistical character of the underlying RNA-seq expression measurements, 2) the differential gene expression (DGE) tool you use, and 3) the effect size you are looking to detect. Expanding this analysis beyond the simple transcriptome of yeast and into a higher eukaryote (Arabidopsis thaliana) we show that transcriptomic complexity does not appear to be a strong factor in answering this question. Based on this analysis of these two highly-replicated datasets, for future RNA-seq DGE experiments we recommend 1) DGE tools based around a negative binomial count distribution that use a shrinkage variance approach, 2) >6 replicates in all conditions, 3) a minimum effect size threshold of ~10-20% (depending on the replication level), and 4) >12 replicates per condition when it is important to identify cases of DGE with effect sizes in the ~10-20% range.
Short Abstract: Dogs have lived with humans for thousands of years and have shared many of them with human lives. Over the past decade, a variety of functional genomic studies have been conducted using one of the dog breeds, beagle. Beagle genes are similar to human genes and it is a good model organism to study human diseases. For this reason, more accurate gene annotation studies of the beagle genome have been performed using various next generation sequencing techniques. In this study, we compared the gene expression profiles among various tissues in the beagle, Aveolar bone, Maxilla, Skull and Tibia including RNA-Seq datasets obtained from beagle bone (Maxilla, Tibia, Skull and Alveolar). Principal component analysis reveals that 78% of the variation in gene expression is explained by the first three principal components and the first principal component separates the data according to tissues. When we focused on the bone tissues, total 19,360 differentially expressed genes have been identified among different bone tissues. Analysis of gene ontology and KEGG pathway showed the different gene expression functional profiles among different bone tissue types. Interestingly, we found total 3789, long noncoding RNAs (lncRNAs) in four tissue types, 3,744 lncRNAs among them were not listed in the known lncRNA databases and the functional annotations for the novel lncRNAs will be performed. The presented study could provide the valuable information of novel lncRNAs, especially for the important functional roles of lncRNAs in the various bone tissues.
Short Abstract: Circular RNAs (circRNAs) are an emerging class of RNAs originated from exon Back-Splicing (BS) and, given their extracellular stability, they represent a promising set of biomarker for several diseases, including breast cancer. We deeply analyse the circRNA transcriptome on a luminal breast cancer model MCF-7 using twelve paired-end poly(A-) RNA-Seq experiments performed in four different cell growth conditions. We predicted 3,271 circRNAs using the CIRI algorithm and we characterize their genomic properties, using CircHunter, a novel algorithm for circRNA post-discovery analysis developed by our group. We confirmed that circRNAs are predominantly formed by two exons but we also identified intergenic, intronic, and monoexonic BS events. We observed also that circRNA host genes are longer, generate a high number of transcripts, have longer introns, and significant enrichment in H3K36me3 histone modification compared to control gene sets. Then, to extend the circRNA expression analysis on public total RNA-Seq datasets of breast cancer tissues, we developed an alignment-free method to directly compare sequencing reads with reconstructed BS sequences. As a result, we identified 113 circRNAs differentially expressed between Triple Negative and ER+ tumors and 622 circRNAs differentially expressed between ER+ and normal tissue. We analysed experimentally a set of 28 circRNAs in breast cancer cell lines and in 52 breast tumor samples. We identified circRNAs showing a higher expression when compared to their host gene and that are significantly highly expressed in ER+ luminal breast cancer cell lines and tissues.
Short Abstract: In agricultural production, it is fundamental to characterize the phenological stage of the plants to ensure a good evaluation of the development, growth and health of the crops. In viticulture, the phenological characterization allows early-detection of nutritional deficiencies in the plants, those diminish the growth, the productive yield and drastically affect the quality of its fruits. Currently, the phenological estimation in grapevine (Vitis vinifera) is done using the scale of Eichhorn and Lorenz and its derivatives. For this estimation, seven phenological stages of the plant are divided into two categories: vegetative growth and reproductive growth. According to the visualization of certain structures, the phenological stage of the plant is determined. This system, which has been widely used for the last 20 years, requires the exhaustive evaluation of crops, which makes it intensive in terms of labor, personnel and time required for its application. There are several genomic information databases for Vitis vinifera and the function of their genes has been widely characterized. The application of advanced molecular biology, including massive parallel sequencing of RNA (RNA-seq), and the handling of large volumes of highly complex data, provide state-of-the-art tools for the determination of phenological stages on a global scale of molecular functions of plants. The main objective of this work is to create a phenological classifier to accurately estimate the stage of development of the grapevine. This estimation will help to improve crop productivity and reduce the costs associated with the use of fertilizers and pesticides to obtain quality fruits.
Short Abstract: Currently more than dozen research studies about a novel high-throughput structure analysis have been published. While high-throughput structure analyses can be practical for quite a few subjects beyond scalability problems, their estimation of RNA reactivity at each nucleotide can be inconsistent between each structure analysis. This inconsistency is supposed to come from the distinctive difference of detectability and systematic biases of individual high-throughput structure analyses as well as the sparseness of sequencing read distribution. To establish a statistical methodology for robust structure analyses, I present a novel pipeline, reactIDR, which is designed to extract reliable structure information from general high-throughput structure analyses. To evaluate the reliability of each reactivity score, reactIDR computes the irreproducible discovery rate (IDR) by modeling the joint probability distribution among replicates. Moreover, reactIDR can take the local consistency of IDR into account based on the combination of hidden Markov model and IDR model with EM algorithm for parameter optimization. The efficiency of IDR filtering and classification for reproducible structure prediction was evaluated for the reference structure of human 18S rRNA and computationally estimated stem probabilities for the whole transcriptome. According to the results, IDR-based classification showed higher consistency with the reference structure and stem probability as calculated by in silico structure prediction, indicating that reactIDR would be a significant assist in extracting the condition-specific difference of secondary structure, with a view to deciphering the global view of RNA secondary structure.
Short Abstract: Background: In mammals, sex chromosomes are a source of an inherent genetic difference between the sexes. A balance between the sexes is reached by random inactivation of one of the X-chromosomes, in each female somatic cell. Thus resulting in a tissue mosaic such that some of the cells express one of the X-chromosomes and some the other. While most genes from the inactivated X-chromosome are silenced, about 15-25% of them have been shown to escape inactivation (referred as escapees). These escapees have so far been identified using multiple indirect methods considering the mosaic nature of female tissues. Results: We use single-cell RNA-seq to directly quantify the extent of escape from X-inactivation phenomenon. Analyzing data from single cells is preferable as in each cell only one of the X chromosomes is inactivated. Our method relies on allelic specific expression of genes with heterozygous SNPs, thus enabling us to discriminate between the expression of genes from the active X-chromosome (Xa) and from the inactive (Xi) one. We apply our method to datasets from (i) primary fibroblasts without genomic phasing (n=104), and (ii) clonal lymphoblasts cells with phased parental genomes (n=25). Applying our single-cell analysis we identify 27 and 34 escapees from fibroblasts and lymphoblasts, respectively. On the other hand, when analyzing a pool of lymphoblasts, only 14 escapees are discovered. Altogether, we report on 51 escapees, many of which are known escapees discovered by indirect methods (p=2.74e-06). We identify a few overlooked escapees, and propose to revise the annotations associated with some others. Conclusions: Chromosome X-inactivation and escaping from it are robust phenomena detected at a single-cell resolution. These phenomena are apparent in isolated primary fibroblasts as well as in cell-lines. Genomic phasing substantially improves the detection of escapees. Cumulative data from individual cells increases the sensitivity of detecting escapees compared to data extracted from a pooled sample. Our results support the notion that different cell types express different escapees with only a handful of consistent escapees across cell types and experimental settings.
Short Abstract: Multiple sclerosis (MS) is the most common cause of neurological disability among young adults. Early in the disease course, the relapsing and remitting phase, influx of peripheral inflammatory cells induces focal demyelinating lesions in the central nervous system (CNS). Within a few decades, ongoing inflammatory demyelination eventually leads to diffuse axonal/neuronal degeneration with clinical decline resulting in a progressive phase. Unfortunately here the effectively of immunotherapies is lost and there are no good sufficient criteria that determine the conversion from relapsing and remitting to the progressive phase. We hypothesize that that specific transcriptome signatures can characterize lesion evolution in the progressive MS brain and this can be used for stage-specific biomarker discovery. In order to address gene expression changes during lesions evolution/expansion we identified 20 normal-appearing white matter areas, 15 active, 17 chronic active, 14 inactive and 6 repairing lesions, from 10 brains of patients with MS. We characterized tissues for cellular changes (i.e. microglia/macrophage activity), myelin integrity and phagocytosis of myelin proteins by using standard histology and immunohistochemistry. As controls, we chose 25 white matter (WM) areas from five brains without neurological disease. We microdissected the areas of interest, extracted the RNA, and performed next-generation RNA sequencing of the different lesion types using paired end sequencing of 2x80 bases on Illumina’s NextSeq500/550. The transcriptome assembly of the RNA reads was done using reference sequences from Ensembl and alignment program from Bowtie/TopHat2. We counted and analyzed different expressed genes using HT-seq and the edgeR package. From the preliminary data, comparing the chronic active MS lesions and the WM control areas, 1301 significantly differentially expressed genes and 62 significantly differentially pre-defined pathways were found. The most changed genes belong to the immunoglobulin family, which support the autoantibody-mediated theory of MS damage and correlates with the oligoclonal bands detected in the CSF of MS patients. Other differentially expressed genes and pathways confirmed the known key factors playing a role in MS such as the presence of the CD8+ T cells and CD20+ B cells, oxidative stress markers, Ca2+/Na+-induced K-channels and metabolic pathways. Differentially expressed genes involved in degeneration and cell death/survival were also found, such as growth factors and components for axonal regeneration. The transcriptome signature of different lesion types will be compared to our CSF proteome database obtained from 97 patients with early inflammatory and late progressive disease to explore, if lesion type signatures are reflected in MS-CSF, and can be used as composite biomarkers related to disease stages. The multi-omics approach of MS brain lesions may provide radically new insight into MS pathogenesis, reveal novel potential treatment targets, and contribute to discovery of composite biomarkers predicting irreversible disease progression.
Short Abstract: Motivation: RNA is a biopolymer with many different applications inside the cell and in biotechnology. Structure of RNA molecules mainly determines their function. Accurate prediction of RNA molecules is therefore important. RNA secondary structure prediction has received attention in the past decade. However, a long standing question for improving prediction accuracy of RNA secondary structure is whether to focus on prediction algorithm or the energy model. This problem is particularly pronounced for complex pseudoknotted structures, for which there is even more trade-off on computational cost of the prediction algorithm versus its generality. The aim of this paper is to attract attention to the importance of energy model and to invite more research in this area. Results: In this work, we thoroughly compare performance of two of the most general RNA pseudoknotted secondary structure prediction algorithms with two different folding hypothesis but the same underlying energy model on a large data set of known structures. Based on the methods compared in this work, we hypothesize that energy model has more significant contribution to the prediction accuracy of the method than its folding hypothesis.
Short Abstract: Brazil is facing an unprecedented growth in the number of microcephaly cases in babies. This phenomenon coincided with the recent Zika virus (ZIKV) outbreak in thise country. Although the Brazilian Ministry of Health was quick to recognize that ZIKV was probablyis the cause of microcephaly in newborns, the underlying mechanisms leading to the development of this pathology have not been established. To tackle this problem at the molecular level, we employed whole transcriptome sequencing of human neurospheres derived from neural stem cells exposed to ZIKV isolated in Brazil, that belongs to the Asian genotype. Differential gene expression analysis of control (MOCK) and ZIKV infected neurospheres generated a list of 26 down-regulated and 64 up-regulated genes. Among the up-regulated detected genes, the Cyclin-dependent kinase inhibitor 1A (CDKN1A) and the Glial fibrillary acidic protein gene (GFAP) were found. CDKN1A prevents the activation of the Cyclin E/CDK2 complex, acting as a regulator of cell cycle progression during G1 and GFAP is a known marker of astrocytes. We also observed a decrease in the expression of the neurogenic differentiation 1 gene (NEUROD1), which is directly involved in the neurogenic program. Those findings suggests that ZIKV infection induces cell cycle arrest and inhibits the neuronal differentiation, resulting not only in the reduction of the size, but in a deeper disruption of the normal development of the human brain.
Short Abstract: By capturing and sequencing the RNA fragments protected by translating ribosomes, ribosome profiling sketches the landscape of translation at subcodon resolution. We developed a new method, RiboCode, which uses ribosome profiling data to assess the translation of each RNA transcript genome-wide. As supported by multiple tests with simulated data and cell type-specific QTI-seq and mass spectrometry data, RiboCode exhibits superior efficiency, sensitivity, and accuracy for de novo annotation of the full translatome, which covers various types of novel ORFs in the previously annotated coding and non-coding regions and overlapping ORFs. Finally, to showcase its application, we applied RiboCode on a published ribosome profiling dataset and assembled the context-dependent translatomes of yeast under normal condition, heat shock, and oxidative stress. Comparisons among these translatomes revealed stress-activated novel upstream and downstream ORFs, some of which are associated with the potential translational dysregulations of the main protein coding ORFs in response to the stress signals.
Short Abstract: Insertions of retrotransposons can disrupt genes, and cause dysregulation of gene expression. Our objective is to understand retrotransposons expression in cancer, and identify miRNAs and somatic mutation of genes that control retrotransposons. We measured retrotransposon expression levels in the RNAseq data of cancer samples in the Cancer Genome Atlas (TCGA). We used different statistics to test association between miRNA/gene expression and L1HS expression, and between somatic mutations and L1HS expression across 634 number of patients. We found that unlike other transposon families, L1HS transcripts are always overexpressed in cancer compared to the normal tissue, although the degree of overexpression varied across patients and cancer types. We identified a list of candidate miRNAs and genes that may control transposon expression. The list of genes we identified includes several of the known host factors in L1HS activity.
Short Abstract: Due to heterogenous nature of the brain, molecular classification of the individual cell types is often challenging due to limited focus of individual studies to individual cell types, or specific brain regions. The data from such data however is constantly accumulating, allowing us to aggregate them to have a more comprehensive view of the brain cell types. Here we present NeuroExpresso, a curated database of mouse brain cell type specific expression profiles representing 35 major cell types from 10 brain regions, acquired from independent expression profiling experiments using both pooled cell microarray data and single cell RNA sequencing. We make this database available to the community at neuroexpresso.org, a website that allows visualization and basic analysis (differential expression) of gene expression in brain cell types. The database is a valuable resource as it allows researchers to identify novel properties of brain cell types in context whole brain or specific brain regions, not detectable by individual studies. Further, we used this database to identify marker genes specific to individual cell types and identified a substantial number of previously unknown cellular markers. These markers are then validated using in siloco analyses and in situ hybridization. Finally, we demonstrate that summarized expression of marker genes (marker gene profiles-MGPs) in bulk tissue correlates with changes in cellular proportions. We use MGPs to re-capture known loss of dopaminergic cells in Parkinson’s disease (PD) patients and discover that a substantial proportion of genes previously reported as differentially expressed in PD patients can be attributed to the reduction of dopaminergic cells
Short Abstract: The ability to compare the abundance of one RNA molecule to another is a crucial step for understanding how gene expression is modulated to shape the transcriptome landscape. However, little information is available about the relative expression of the different classes of coding and non-coding RNA or even between RNA of the same class. We have determined the most accurate experimental and bioinformatic sequencing methodology to date to elaborate a complete portrait of the human transcriptome that depicts the relationship of all classes of non-ribosomal RNA longer than sixty nucleotides. The results show that the most abundant RNA in the human rRNA-depleted transcriptome is tRNA followed by spliceosomal RNA. Surprisingly, the signal recognition particle RNA 7SL by itself occupies 8% of the ribodepleted transcriptome producing a similar number of transcripts as that produced by all snoRNA genes combined. In general, the most abundant RNA are non-coding but many more protein coding than non-coding genes produce more than 1 transcript per million. Examination of gene functions suggests that RNA abundance reflects both gene and cell function. Together, the data indicate that the human transcriptome is shaped by a small number of highly expressed non-coding genes and a large number of moderately expressed protein coding genes that reflect cellular phenotypes.
Short Abstract: Motivation: Advancements in sequencing technologies have highlighted the role of alternative splicing (AS) in increasing transcriptome complexity. This role of AS, combined with the relation of aberrant splicing to malignant states, motivated two streams of research, experimental and computational. The first involves a myriad of techniques such as RNA-Seq and CLIP-Seq to identify splicing regulators and their putative targets. The second involves probabilistic models, also known as splicing codes, which infer regulatory mechanisms and predict splicing outcome directly from genomic sequence. To date, these models have utilized only expression data. In this work we address two related challenges: Can we improve on previous models for AS outcome prediction and can we integrate additional sources of data to improve predictions for AS regulatory factors. Results: We perform a detailed comparison of two previous modeling approaches, Bayesian and Deep Neural networks, dissecting the confounding effects of datasets and target functions. We then develop a new target function for AS prediction in exon skipping events and show it significantly improves model accuracy. Next, we develop a modeling framework that leverages transfer learning to incorporate CLIP-Seq, knockdown and over expression experiments, which are inherently noisy and suffer from missing values.Using several datasets involving key splice factors in mouse brain, muscle and heart we demonstrate both the prediction improvements and biological insights offered by our new models. Overall, the framework we propose offers a scalable integrative solution to improve splicing code modeling as vast amounts of relevant genomic data become available. Availability: code and data available at: majiq.biociphers.org/jha_et_al_2017/
Short Abstract: Correct quantification of transcript abundance is essential to understand the functional products of the genome in different physiological conditions and developmental stages. Recently, the development of high-throughput RNA sequencing (RNA-Seq) allows the researchers to perform transcriptome analysis for the organisms without the reference genome and transcriptome. For these practical projects, de novo transcriptome assembly must be carried out prior to quantification. However, a large number of fragmented contigs and redundant sequences produced by the assembler may result in ambiguous read mapping or unreliable abundance estimation, resulting in the decrease of reliability in transcript quantification. In this regard, this study aims at investigating how assembly quality might affect the quality of read mapping and count estimation. By experiments and analyses conducted in this study, several important factors that might seriously affect the accuracy of the RNA-Seq analysis workflow were comprehensively discussed. The effects of ambiguous regions originally present in the transcriptome to the mapping and assembly quality were examined. The factors such as assembly completeness and types of mis-assembly were examined on both of the synthetic and practical sequencing reads. By assessing the quantification quality for each sequence category, this study has shown that the ambiguous regions presented in the reference transcriptome only slightly influences mapping quality, but leads to un-eliminated mis-assembly. The resultant mis-assembly then heavily decreased the reliability of read mapping and count estimation. Among all the wrongly assembled contigs, the mistakenly aggregation of several transcripts into one was shown to cause the most serious damages on the reliability of quantification. Fortunately, it is generally believed that this category has chances to be largely reduced by post-modifications in the advanced transcriptome assembly pipelines developed recently.
Short Abstract: High-throughput sequencing (HTS) technology has become essential in study of genomics and has made it possible to obtain millions of sequencing reads in a single experiment in a cost effective manner. However, the analysis of HTS data requires heavy utilization of computationally intensive techniques, since millions of sequencing samples have to undergo various processing steps, from read quality assessment and alignment to quantification. Each step in the analysis needs a specialized tool or algorithm and all of these steps need to be streamlined due to their computationally demanding nature.Both commercial and open source platforms are available for analysis of HTS data; however,they are either too limited in terms of tools and functionality (hence, rigid), or incorporate a wide array of tools but are difficult to use (hence, too complex). The aim of our project was to develop a framework that would address these concerns, and be user-friendly while retaining its flexibility and reproducibility. We have developed “PIPELINER”, a framework that is efficient for the user and can generate flexible and modular workflows for quality assessment and processing of HTS data. PIPELINER is based on Nextflow, a portable, scalable, parallelizable, domain-specific language (DSL) that enables our pipelines to have language and platform independence, implicit parallelism, and automatic failure recovery. PIPELINER also incorporates an Anaconda virtual environment, which allows for the pre-compilation of all the tools involved with the pipeline being generated. This makes the deployment process and execution platform-independent and less cumbersome.As a proof of concept, we developed a pipeline for processing of bulk RNA-sequencing(RNA-seq) data. Based on the lessons learned, we will next develop a pipeline for single cell RNA sequencing (scRNA-seq) data.
Short Abstract: Eukaryotic RNAs undergo extensive processing at the post-transcriptional level, including capping, 3’-cleavage and polyadenylation, and splicing. These steps happen synergistically and also concurrently with each other and with transcription, generating multiple alternative products arising from the same locus. Here, by using comparative genomics, we identified a robust set of ~15,000 pairs of conserved complementary regions in non-coding regions of human protein-coding genes. These conserved regions have a large non-random overlap with eCLIP peaks, and in particular with those of RBFOX2, suggesting widespread structural regulatory mechanisms similar to RNA bridges. The complementary regions tend to avoid mutations and even if polymorphisms occur in them, the corresponding impact on the free energy is smaller than that of random mutations. Conserved regions also contain cryptic splice sites and are possibly involved in the suppression of aberrant splicing. Interestingly, the complementary pairs are located preferentially close to splice junctions and contain unexpectedly high number of internal transcription termination and transcription start sites. This leads to a hypothesis that intramolecular RNA structure in combination with splicing could serve to suppress premature cleavage and polyadenylation by holding RNA parts together while the spliceosome excises the intron containing the cleavage site. This mechanism could as well be responsible for seemingly alternative transcription start sites that are generated through premature cleavage and nuclear re-capping of initial introns. Overall, we find a highly non-random distribution of conserved complementary regions with respect to mammalian gene structure including not only splicing signal demarcation, but also transcriptional start and stop sites.
Short Abstract: In this study, I developed TurboFold II, an extension of the TurboFold algorithm for predicting secondary structures for multiple RNA homologs. TurboFold II makes several improvements upon the original TurboFold algorithm. Whereas TurboFold only provided secondary structure predictions, TurboFold II also provides a multiple sequence alignment that incorporates information from secondary structure conservation. In contrast with TurboFold that used fixed alignment probabilities computed at the start using only sequence information, TurboFold II updates the alignment probabilities for inter-sequence alignment at each iteration. The updates incorporate secondary structure conservation information in the alignment by using a match score, calculated from estimated base pairing probabilities to represent the secondary structural similarity between nucleotide positions in the two sequences. Upon completion of the iterations, in addition to structure predictions computed as in TurboFold, TurboFold II computes a multiple sequence alignment that is progressively computed based on a probabilistic consistency transformation and a hierarchically computed guide tree, adopted from other sequence alignment methods. The TurboFold II algorithm is modified for prediction of RNA secondary structures to utilize base pairing probabilities guided by SHAPE experimental data. Results demonstrate that the SHAPE mapping data for a sequence improves structure prediction accuracy of other homologous sequences beyond the accuracy obtained by sequence comparison alone.
TurboFold II has comparable alignment accuracy with MAFFT and higher accuracy than other tools. TurboFold II also has comparable structure prediction accuracy as the original TurboFold algorithm, which is one of the most accurate methods.TurboFold II is part of the RNAstructure software package(http://rna.urmc.rochester.edu).
View Posters By Category
- 3Dsig
- Bioinformatics Open Source Conference (BOSC)
- CAMDA
- Education
- Network Biology
- Regulatory Genomics (RegGenSig)
- RNA
- Computational Modeling of Biological Systems (SysMod)
Session A: (July 22 and July 23)
- Bio-Ontologies
- BioVis
- Function
- High Throughput Sequencing Algorithms and Applications (HitSeq)
- Machine Learning Systems Biology (MLSB)
- Translational Medicine (TransMed)
- VarI
- Other