Poster presentations at ISMB 2020 will be presented virtually. Authors will pre-record their poster talk (5-7 minutes) and will upload it to the virtual conference platform site along with a PDF of their poster.
All registered conference participants will have access to the poster and presentation through the conference and content until October 31, 2020. There are Q&A opportunities through a chat function to allow interaction between presenters and participants.
Preliminary information on preparing your poster and poster talk are available at: https://www.iscb.org/ismb2020-general/presenterinfo#posters
Ideally authors should be available for interactive chat during the times noted below:
View Posters By Category
Poster Session A: July 13 & July 14 7:45 am - 9:15 am Eastern Daylight Time
Session B: July 15 and July 16 between 7:45 am - 9:15 am Eastern Daylight Time
Short Abstract: To fulfill its functions, RNA must fold into specific three-dimensional structures. Consequently, we need to predict the 3D-structure of an RNA. Some interactions, called canonical interactions, and forming the local secondary structure, are effectively predicted nowadays. However, other interactions are essential to the good global folding of the molecule, and stay currently unpredictable. The most well-known class is the A-minor like motifs, whereas they have been proven to play a crucial role in folding and cellular mechanisms from ribozymes to ribosomes by riboswitches and viral RNAs.
We developed an automated pipeline of exhaustive extraction and classification of A-minor like motifs in RNA 3D-structures. The classification is done by comparing the local context of these motifs, i.e. the arrangement of bonds around the motifs. Each motif and its context are represented by a special graph, where nucleotides are grouped according to their proximity and to the bounds they are involved in. A special similarity measure has been developed for comparing them. We first show that there is a fairly good correlation between our similarity measure and the RMSD. Then, our first results show that, as expected, homologous occurrences share similar contexts. More interestingly, a few number of non-homologous occurrences also does.
Short Abstract: Ribosome profiling is a powerful method for studying translation. In this method, ribosome protected RNA fragments are sequenced to quantify transcript-level ribosome occupancy. Recent studies have shown that sequence read lengths are variable and carry critical information. Therefore, analyzing ribosome profiling data involves computation and storage of multiple metrics for different ribosome footprint lengths. The current solution to this problem of using text files is inefficient.
Here, we developed a complete software ecosystem including a new efficient binary file format for ribosome profiling. The first component, RiboFlow, is a pipeline that processes raw ribosome profiling reads to generate sequence alignments, and related statistics. Importantly, RiboFlow assembles the results into a new binary file format, that we named “ribo”. Ribo files store all quantities of interest grouped by different ribosome footprint lengths. The other components, RiboR and RiboPy, provide an interface in R and Python. Using RiboR and RiboPy, users can efficiently access ribo files and generate plots.
Taken together, this new software ecosystem provides a robust software ecosystem for researchers to study translation. Moreover, we provide more than 15-fold improvement in storage compared to the conventional approach. We expect our infrastructure to facilitate large-scale and high resolution studies on translation.
Short Abstract: Numerous RNA-associated proteins have been characterized as being part of the mRNA maturation machinery. However, it remains unclear whether RNA-associated proteins preferentially associate with specific types of RNAs or process all RNAs irrespective of their nature. We built a dataset of 124 affinity purified proteins implicated in mRNA maturation. These proteins and their interactions were characterized using mass spectrometry and RNA transcripts isolated from 47 of the purifications were sequenced. Using this multiomics dataset, we propose an unsupervised learning approach to identify subsets of proteins that share common protein-protein interactions and transcript associations. We also designed a co-clustering approach using bootstrapping to discover groups of transcripts that are associated with the same subsets of proteins. Our algorithms identified 24 groups of proteins that significantly cluster based on their shared protein interactions and 8 groups of RNA transcripts that are preferentially purified by similar sets of proteins (bootstrapping value > 90). Furthermore, our co-clustering technique discovered sets of RNA transcripts and proteins that preferentially associate with each other and share similar functionality. Our results demonstrate that a subset of proteins playing a role in mRNA maturation preferentially associate with certain RNA transcripts, thereby providing a better understanding of mRNA maturation mechanisms.
Short Abstract: As a result of the high number of measured gene expressions, low number of replicates, interconnected gene expression networks and many biases introduced by the multistep processing, analysis of RNA-seq data can be challenging. Therefore, inferential statistical analysis must be adjusted for these factors. Type I errors are well studied for RNA-seq data, whereas type II errors are far less frequently analyzed although similarly important. Here, we propose the use of a univariate smoothed bootstrap with a Gaussian kernel density method to accurately estimate the type II errors in fold change within an RNA-seq experiment. A key advantage of this approach is that experiment-level power can be assessed and can verify that the sample replication is sufficient for a given effect size. The two RNA-seq data sets used for testing are a study comparing cell line responses to different drug treatments and data from the human breast cancer experiment TCGA-BRCA project. The results show a high overall power, even for relatively low replicate numbers. Furthermore, the type II errors for genes are highly dependent on the expression level. The proposed method will help provide additional confidence for genes of interest and complete RNA-seq experiments.
Short Abstract: A number of studies have revealed that gene expression networks in dementia are deregulated. However, the molecular mechanisms associated with this deregulation remain largely unknown. In previous work we have shown that mechanisms involving non protein-coding RNAs, such as Short Interspersed Nuclear Element (SINE) RNAs (Zovoilis et al, Cell 2016), are key in maintaining transcriptional homeostasis in cells. RNA from one of the most frequent SINE repeat classes controls expression of early response genes through binding and inhibition of RNA Polymerase II. However, it remained unclear whether deregulation of this mechanism could underlie a pathological condition in human. Here, we applied an integrative RNA genomics and bioinformatics approach to dissect any connection between SINE RNAs and dementia. Our study reveals that SINE RNAs are subject to abnormal processing in human dementia patients and are connected with deregulation of gene expression.
Short Abstract: Predicting non-coding RNA structures is crucial to understanding their mechanism of action. Comparative approaches have contributed to significantly improve the prediction of conserved RNA structures of homologous RNA sequences. Computational methods that rely on comparative approaches mainly exploit multiple sequence and/or structure alignments of homologous RNA to improve the accuracy of the prediction of conserved RNA structures. An other comparative approach named "align-free-fold" consists in predicting the conserved RNA structure without relying on any time-consuming alignment step. The "align-free-fold" strategy has the advantage of being generally faster than alignment-based strategies. Recently, we introduced an align-free-fold method named aliFreeFold which predicts a representative secondary structure of a set of homologous RNA by using a vectorial representation of their samples of Minimum-Free-Energy sub-optimal structures. In this work, we present aliFreeFoldMulti, an extension of the aliFreeFold algorithm. aliFreeFoldMulti improves on aliFreeFold by computing the conserved structure of each sequence of the input family. To assess the accuracy and efficiency of aliFreeFoldMulti prediction, we compare it to a selection of current RNA structure prediction methods using a dataset of non-coding RNA families with known secondary structures.
Short Abstract: RNA-small molecule binding is a key regulatory mechanism which can stabilize 3D structures and activate molecular functions.
The discovery of RNA-targeting compounds is thus a current topic of interest for novel therapies.
Our work is a first attempt at bringing the scalability and generalizability of learning methods to the problem of RNA drug discovery, as well as a step towards understanding the interactions which drive binding specificity.
Our tool, RNAmigos, builds a network representation of RNA structures to predict likely ligands for novel binding sites.
We subject RNAmigos to virtual screening and show that we enrich the true ligand in the 71st-73rd percentile in two decoy libraries, showing a significant improvement over several baselines, and a state of the art method.
Furthermore, we observe that augmenting structural networks with non-canonical base pairing data is the only representation able to uncover a significant signal.
Finally, we find that pre-training with a graph representation task significantly boosts performance.
This finding can serve as a general principle for RNA structure-function prediction when data is scarce.
We show that RNA binding data contains structural patterns with potential for drug discovery, and provides methodological insights for possible applications to other structure-function learning task
Short Abstract: RNA editing and its regulation is increasingly being recognized as an important mechanism in disease pathogenesis. With the growing availability of deep RNASeq datasets, the opportunity for global detection of RNA editing events has become achievable. However, despite advances, reliably detecting RNA editing events in the transcriptome remains difficult.
Here we introduce Balanced Output of RNA Editing (BORE), a cloud-native application that provides researchers with the ability to process their own High Throughput Sequencing data and find potential RNA editing sites. The Application is written in Python and Go, and is hosted on Amazon Web Services (AWS).
The Application uses AWS Simple Storage Service (S3) to store the raw FASTQ files. It then processes the files through the BORE Pipeline to generate candidate RNA editing sites. The results are then available to download by the researcher.
The internal workflow of BORE is: download a FASTQ file, align it to a reference genome with HiSAT2, and then preprocess into a filtered high-quality BAM File. The main processing workflow then takes over. We filter out known Single Nucleotide Variants (SNVs), generate an internal representation of the putative editing sites, and represent them as machine-readable files like VCF and as a summary report.
Short Abstract: miRNAs are short (∼23 nucleotide) single-stranded noncoding RNA molecules that post-transcriptionally regulate gene expression, through target cleavage, degradation and/or translational suppression. The expression and distribution of miRNAs across tissues is crucial for delineating the underlying mechanisms in both physiological and pathological aspects of biological systems.
In this work we developed DIANA-miTED, a web database application offering the expression of miRNAs obtained from 4143 publicly available miRNA-seq data retrieved from NCBI-SRA and analysed through a well-defined workflow.
The workflow consists of the following steps: Data Acquisition, Quality Checking utilizing FastQC, Contaminant Detection using DNApi and Minion, Alignment step and Quantification using miRDeep2.
miRNA Tissue-Expression Database was developed utilizing the latest technologies in web development, namely PostgreSQL(relational database), PHP and Laravel6 for data access layer development and Typescript and Angular 9 for the presentation layer. Through its intuitive User Interface, the user can retrieve the expression values of 2656 miRNAs from 4143 analyzed experiments. The gene expression units offered are Read Counts, Reads Per Million (RPM) and Log2RPM. The analyzed samples are classified in 120 tissues and 289 cell lines.
miRNA Tissue-Expression Database is an important utility to the scientific community, enabling further research in miRNA tissue expression patterns.
Short Abstract: G-quadruplexes (G4) are non-canonical secondary structures present in both genomes and transcriptomes. Although their role is not fully known, many studies point at implications in gene expression regulatory mechanisms. In the nervous system, dysregulation of DNA-G4 has been associated with pathological disorders including neurological dysfunction, accelerated ageing, and increased risk of cancer development. While clear links between RNA-G4 and diseases of the nervous system have not been established, their critical role in RNA metabolism suggests that it should be investigated. Prediction of putative G4 regions (pG4r) in the human transcriptome has recently been conducted, showing that pG4r are present in all types of transcripts (mRNA, ncRNA and pseudogenes) and enriched in 5’ and 3’ UTR regions of mRNA. Based on this context, the aim of this project is to study more specifically the pG4r in the human transcriptome of the nervous system compared to the remain transcriptome. Transcripts and genes expression data from the Genotype-Tissue Expression (GTEx) project are used to determine transcripts and genes significatively expressed in tissues of the nervous system. After this first step, protein expression data will also be incorporated to find differences between transcript and protein abundance and thus highlight post-traduction regulatory mechanisms.
Short Abstract: The rise of (single-cell) RNA-seq data collection in the recent years is unmatched. Commonly, this data is mostly used to determine differential expressed genes (DEGs) between samples or cell types. Utilizing transcriptomic quantifiers, the very same data sets can additionally be used for a transcript level analysis. Complementary to a DEG-analysis, proportional changes in the transcript composition of a gene would be of great interest for many research questions, such as analysis of differential splicing on bulk or single-cell level.
We present the easy-to-use R package DTUrtle that enables a broad audience to perform differential transcript usage (DTU) analysis. It is the first package that both supports counts from bulk and single-cell RNA-seq data, optionally utilizing a sparse data representation. DTUrtle builds upon established statistical frameworks for DTU analysis, offers customizable filtering schemes and a variety of visualization options. It has been successfully applied to single-cell and bulk RNA-seq data from both human and mouse.
DTUrtle enables a comfortable way for researchers to perform DTU analysis of single-cell and bulk data and allows for an in-depth inspection of the results via multiple visualization options.
Short Abstract: In this study, we show that deep feed-forward neural nets are able to accurately classify RNA families directly from RNA sequences only. We demonstrate to what degree those models use length, as well as nucleotide and dinucleotide composition, and higher order subsequences present in the RNA to make accurate predictions by selectively obfuscating combinations of these features. We report the area under the receiver-operator characteristic curve (ROC-AUC) for the classification task of a diverse selection of RNA families, showing how randomizing various implicit sequence features affects the performance of these models, suggesting what features they are able to detect. We hope these findings will encourage the use of artificial neural network models for reliable data-driven detection of RNA families from primary structure directly, and integration of these models into other various sequence-based bioinformatics tasks, such as de novo genome annotation.
Short Abstract: Single-cell RNA sequencing measures gene expression states in individual cells, enabling high-resolution biological studies. Analysis of this data requires high speed and interactivity, to allow easy exploration, clustering cellular populations at multiple scales, comparisons between cell types, and integration of datasets and modalities. This is made increasingly difficult by the growth in single-cell data, with datasets of >1 million cells being generated.
We present an open-source GPU (Graphics Processing Unit) based pipeline for high-speed scRNA-seq data analysis. Our pipeline is built upon Python and RAPIDS, a free and open-source software suite for GPU-accelerated data science. It exploits GPUs to accelerate all steps of analysis, including machine learning, dimensionality reduction, clustering, visualization and differential gene expression.
As a demonstration, we analyzed gene expression from 70,000 human lung cells to identify cell types susceptible to Covid-19. Compared to a standard CPU-based pipeline using Scanpy, we achieved 4-90 times faster runtime for each step, including 5x faster preprocessing, 90x faster t-SNE visualization, 70x faster UMAP and 50x faster Louvain clustering, on a Tesla V100 GPU compared to 32 CPU cores. The runtime in seconds enables interactivity, allowing researchers to explore, analyze and visualize large single-cell datasets in real time.
Short Abstract: N6-methyladenosine (m6A) is one of the most abundant RNA modifications found in nature. Several wet-lab studies have identified some RNA binding proteins (RBPs) that function as m6A regulators. This study’s objective was to identify m6A-associated RBPs as potential m6A regulators using an integrative computational framework.The framework was composed of an enrichment analysis and a classification model. We identified reproducible m6A regions from independent studies in two cell lines and then utilized RBPs’ binding data of the same cell line to identify m6A-associated RBPs. The enrichment analysis identified known m6A regulators, including YTH domain-containing proteins; it also identified a potential m6A regulator, RBM3, for mouse. In addition, we built a Random Forest classification model for the reproducible m6A regions using RBPs’ binding data. The RBP-based predictor demonstrated not only competitive performance when compared with sequence-based predictions but also helped to identify m6A-repelled RBP. These results suggested that our framework allowed us to infer interaction between m6A and m6A regulators beyond the sequence level.In summary, we designed an integrative computational framework for the identification of known and potential m6A regulators. We hope the analysis will provide more insights into the studies of m6A and RNA modification.
Short Abstract: Despite the expansion of high-throughput RNA sequencing (RNA-seq), the correct characterization and quantification of small ncRNA remains difficult. Being usually very structured and stable (ex: tRNA, snoRNA), their sequencing requires a correct reverse transcription hardly achieved by standard reverse transcriptases. Recently, a new RNA-seq method using a bacterial Thermostable Group II Intron Reverse Transcriptase (TGIRT) has been shown to circumvent this issue on human cell lines, thanks to its thermostability, higher processivity and fidelity. Here we use this technique to study S. cerevisiae and compare our findings to human experiments.
While we observe that TGIRT-seq and conventional RNA-seq give similar results for mRNA, TGIRT-seq appears superior for detecting small ncRNAs such as snoRNAs and tRNAs. In particular tRNA, almost undetectable by conventional RNA-seq, is found to be one of the most abundant biotypes (22%). Furthermore, the 20 most abundant genes are tRNAs and snoRNAs using TGIRT-seq, whereas protein coding genes compose the majority of the ranking using standard RNA-seq. This confirms the sensitivity of TGIRT-seq for highly structured RNA previously observed in human studies. Our results also question the current understanding of the transcriptome of S. cerevisiae.
Short Abstract: To understand driving biological factors for cancer, regulatory circuity of genes needs to be discovered. Recently, a new gene regulation mechanism called competing endogenous RNA (ceRNA) interactions has been discovered. Certain RNAs targeted by common microRNAs (miRNAs) “compete” for these miRNAs, thereby regulate each other by making other free from miRNA regulation. Several computational tools have been published to infer ceRNA networks. In most existing tools, however, expression abundance and groupwise effect of ceRNAs are not considered. In this study, we developed a computational pipeline named Crinet to infer cancer-associated ceRNA networks addressing critical drawbacks. Crinet considers lncRNAs, pseudogenes and mRNAs as potential ceRNAs and incorporates a network deconvolution method to exclude amplifying effect of ceRNA pairs. We tested Crinet on breast cancer data in TCGA. Crinet inferred reproducible ceRNA interactions and groups, which were significantly enriched in cancer-related genes and biological processes. We validated our ceRNA interactions using protein expression data. Crinet outperformed existing tools predicting gene expression change in knockdown assays. Top high-degree genes in the inferred network included known suppressor/oncogene lncRNAs of breast cancer showing the importance of noncoding-RNA’s inclusion for ceRNA inference.
Short Abstract: Third-generation (PacBio/Oxford Nanopore) transcriptomics allow to generate long reads, which in contrast to short-read sequencing have the potential to analyse and quantify complex alternative isoforms. Due to long-read RNA sequencing novelty, few computational methods analyse such data, the SQANTI pipeline being an exception that is designed primarily for PacBio data.
Here, we present a software called IsoQuant for reference-based analysis of long error-prone reads. IsoQuant assigns reads to annotated isoforms based on their intron and exon structure, and further performs gene and isoform quantification. For high-error-rate data, the algorithm uses inexact intron and exon matching, which accurately resolves various error-rate induced alignment artifacts, such as skipped short exons or shifted splice sites.
To estimate accuracy of IsoQuant we simulated several Nanopore and PacBio datasets based on mouse and human transcriptomes. For low-error reads (e.g. PacBio CCS), both IsoQuant and SQANTI2 show near-perfect accuracy, but for high error data with complex artifacts (such as Oxford Nanopore - for which SQANTI2 was not designed), IsoQuant’s inexact intron/exon matching yields strong improvement.
IsoQuant is an open-source software implemented in Python and is available at github.com/ablab/IsoQuant.
Short Abstract: A messenger RNA (mRNA) vaccine has emerged as a promising direction to combat the current COVID-19 pandemic. This requires an mRNA sequence that is stable and highly productive in protein expression, features which have been shown to benefit from greater mRNA secondary structure folding stability and optimal codon usage. However, sequence design remains a hard problem due to the exponentially many synonymous mRNA sequences that encode the same protein. We reduce this problem to a classical problem in formal language theory and computational linguistics that can be solved in O(n^3) time, where n is the mRNA sequence length. This algorithm could still be too slow for large n, so we further developed a linear-time approximate version, LinearDesign, which can compute the approximate MFE mRNA sequence for SARS-CoV-2 spike protein in 11 minutes using beam size b = 1, 000, with only 0.6% loss in free energy change compared to exact search, which costs 1 hour. We also develop two algorithms for incorporating the codon optimality, one based on k-best parsing to find alternative sequences and one directly incorporating codon optimality into the dynamic programming. Our work provides efficient computational tools to speed up and improve mRNA vaccine development.
Short Abstract: The etiology of Parkinson’s disease (PD) is largely unknown. Genome-wide transcriptomic studies in bulk brain tissue have identified several molecular signatures associated with the disease, with the most consistent alterations in pathways related to energy metabolism/mitochondrial function and protein degradation, followed by synaptic transmission, vesicle trafficking, lysosome/autophagy and neuroinflammation. These studies are, however, limited by two major confounders: RNA post-mortem degradation and heterogeneous cell type composition. We performed RNA-seq following ribosomal RNA depletion in the prefrontal cortex of 49 individuals from two independent cohorts. Using cell-type specific markers, we estimated relative cell-type composition across samples and included these in the differential expression models to account for cell-type variability. Our results indicate that the ribosomal RNA depletion results in substantially more even coverage compared to poly(A) capture. We show that cell-type composition is a major confounder of differential gene expression analysis in the PD brain. Accounting for variability in cellularity attenuates numerous cell type-specific transcriptomic signatures that have been previously associated with PD, including mitochondrial function, vesicle trafficking, synaptic transmission, and immune function. Conversely, pathways related to endoplasmic reticulum, lipid oxidation and unfolded protein response are strengthened and surface as the top differential gene expression signatures in the PD prefrontal cortex.
Short Abstract: While the effects of confounders on gene expression analysis have been extensively studied there is a lack of equivalent analysis and tools for RNA splicing analysis. Here we assess the effect of confounders in two large public RNA-Seq datasets (TARGET, ENCODE), develop a new method, MOCCASIN, to correct the effect of both known and unknown confounders on RNA splicing quantification, and demonstrate MOCCASIN’s effectiveness on both synthetic and real data.
Short Abstract: Aberrant pre-mRNA alternative splicing (AS) is widespread in cancer, but the causes and consequences of AS dysregulation during cancer progression are not well understood. We developed a novel computational framework, PEGASAS, as a pathway-guided approach for examining the effects of oncogenic signaling on exon incorporation. PEGASAS was designed to study the interplay among oncogenic signaling, AS, and affected biological processes. In this study, we applied PEGASAS to define the AS landscape across prostate cancer disease states and the relationship between splicing and known driver alterations. We compiled a meta-dataset of RNA-seq data of 876 tissue samples from publicly available sources, covering a range of disease states, from normal tissues to aggressive metastatic tumors. PEGASAS analysis revealed a correlation between Myc signaling and splicing changes in RNA binding proteins (RBPs), suggestive of a previously undescribed auto-regulatory phenomenon. We experimentally verified this result in a human prostate cell transformation assay. Our findings establish a role for Myc in regulating RNA processing by controlling incorporation of nonsense mediated decay-determinant exons in RBP-encoding genes. In conclusion, PEGASAS can mine large-scale transcriptomic data to connect changes in pre-mRNA AS with oncogenic alterations that are common to many cancer types.
Short Abstract: Long noncoding RNAs (lncRNAs) play a key role in many cellular processes including chromatin regulation. To modify chromatin, lncRNAs often interact with DNA in a sequence-specific manner forming RNA:DNA triple helices. Computational tools for triple helices search do not always provide genome-wide predictions of sufficient quality. Here, we used four human lncRNAs (MEG3, DACOR1, TERC and HOTAIR) and their experimentally determined binding regions for evaluating triple helix parameters used by Triplexator software. We find out that 10 nt as a minimum length, 20% as a maximum error-rate and a minimum G-content of 70% or 40% provide the highest accuracy of triple helices predictions in terms of area under the curve (AUC). Additionally, we combined triple helix prediction with the lncRNA secondary structure and demonstrated that consideration of only single-stranded fragments of lncRNA predicted by RNAplfold with 0.95 or 0.5 thresholds for probability of pairing can further improve DNA-RNA the quality of triplexes prediction, especially in MEG3 case. This improvement can be explained by the number and characteristics of DBDs - regions of lncRNA that form the majority of the triplexes, detected by TDF software.
Short Abstract: The genome of eukaryotes code for hundreds of RNA-binding proteins (RBP), which regulate the fate of RNA from synthesis to degradation. These RBPs form an extensive network of condition specific regulation, including direct and indirect regulation of other RBPs. This fact, combined with technological limitations, make systematic experimental characterization of their effect on gene expression infeasible. Here, we propose to leverage the available expression measurements in diverse conditions to instead build predictive models for the effect of RBP knockdown on expression of other RBPs, capturing a condition-specific “RBP state”. We develop a model for prediction of knockdown effects via DNN experiments (Pokedex). We use an unsupervised learning approach where we construct a variational autoencoder for the expression of 531 RBPs. Training it on the naturally occurring variations of RBP expression across 53 GTEx tissues we test its ability to predict knockdown effects in ENCODE experiments. Using the expression changes of RBPs compared to control as the test statistic we show this DNN performs significantly better then a standard PCA based linear model of variation. Finally, we make the learned model available to the RNA community through a web-tool, RBP-Pokedex (tools.biociphers.org/rbp-pokedex), to predict the effect of pan-tissue RBP knockdowns.
Short Abstract: RNA structures possess multiple levels of structural organization. Secondary structures are made of canonical (i.e. Watson-Crick and Wobble) helices, connected by loops whose local conformations are critical determinants of global 3D architectures. Such local 3D structures consist of conserved sets of non-canonical base pairs embedded between canonical base pairs, called RNA modules. Their prediction from sequence data is thus a milestone towards 3D structure modelling. Unfortunately, the computational efficiency and scope of the current 3D module identification methods are too limited yet to benefit from all the knowledge accumulated in module databases.
Here, we introduce BayesPairing 2, a new sequence search algorithm leveraging secondary structure tree decomposition which allows to reduce the computational complexity and improve predictions on new sequences. BayesPairing identifies 3D modules in sequences via stochastic sampling of structural contexts: loops are identified in predicted secondary structures and matched to networks of non-canonical base pairs.
We benchmarked our methods on 75 modules and ~6000 RNA sequences, and report accuracies that are comparable to the state of the art, with considerable running time improvements. When identifying 200 modules on a single sequence, BayesPairing 2 is over 100 times faster than its previous version, opening new doors for genome-wide applications.
Short Abstract: Current studies look at genomic variants or differential gene expression to predict genetic predisposition and find biomarkers for many neurodevelopmental, psychiatric and degenerative disorders. Here we explore a novel approach to elucidate potential diagnostic, prognostic and/or therapeutic biomarkers for major depressive disorder and suicide focusing on a more nuanced aspect of transcriptome diversity, RNA editing. RNA editing, more specifically adenosine deaminase acting on RNA (ADAR) editing which contributes to transcriptome diversity by dynamically altering the ratios of differentially functioning proteins unpinning the “fine-tuning” of neural signaling and synaptic plasticity. We apply a novel approach utilizing an item response theory model the Guttman Scale to create ADAR editing landscape profiles which are then used to map differential editing in major depressive disorder and suicide. We were able to find a handful of genes of interest for further investigation into their direct contribution to neurological symptoms. We also highlight pathways including ion homeostasis that are altered in depression and suicide which warrant further investigation for their role in synaptic plasticity. Furthermore, we provide evidence this model can be used in biomarker discovery for many other neurological disorders in which transcriptome diversity plays an important role.
Short Abstract: RNA secondary structure prediction is one of the oldest computational problems in bioinformatics, which has been studied for more than 40 years. Usually, researchers tend to utilize dynamic programming to resolve it, which can be relatively slow, with the F1 score being around 0.6 and having difficulty in handling pseudoknots. In this paper, we address it from an entirely new angle, viewing it as a translation with constraints problem. We propose a novel end-to-end deep learning model, called E2Efold, which has the problem-specific constraints embedded in the network architecture. The core idea of E2Efold is to predict the RNA base-pairing matrix directly, and use an unrolled algorithm for constrained programming as the template for deep architectures to enforce constraints. With comprehensive experiments on benchmark datasets, we demonstrate the superior performance of E2Efold: it predicts significantly better structures compared to the previous state-of-the-art methods (especially for pseudoknotted structures), with the F1 score being around 0.8, while being as efficient as the fastest algorithm in terms of inference time. The original paper has been published as an oral paper in ICLR 2020. The code of E2Efold is available at github.com/ml4bio/e2efold.
Short Abstract: Over half of the non-coding genome is composed of transposable elements, containing a large group called Short Interspersed Nuclear Elements (SINEs). B2, in mouse, and Alu, in human, being the most frequent SINE families. Despite the apparent nonfunctionality of these elements, many of them are transcribed. These RNAs have been demonstrated to affect RNA POLII activity by binding and suppressing transcription. Additionally, these SINE RNA have been shown to be processed at different rates, which releases POLII from repression, this process is shown to be connected with amyloid pathology. To further investigate the connection between SINE RNA processing and amyloid pathology, we searched for factors which influence B2’s processing rates and structural integrity. We found post transcriptional Adenosine-to-Inosine editing as a key candidate of interest. We have identified and quantified the rates of A-to-I editing in illumina RNAseq data produced from a variety of mouse and human samples and identified an unique editing profile underpinning amyloid pathology, which would further our understanding of human diseases like Alzheimer’s Disease.
Short Abstract: The organ of Corti, the receptor organ for hearing, is formed by a variety of sensory hair cells (HCs) and supporting cells (SCs) within the cochlea. However, the gene regulation mechanisms of cochlea development are not fully understood.
The aim of this study is to identify regulatory elements controlling the differentiation and maturation of the organ of Corti. To achieve this goal, we generated scATAC-seq and scRNA-seq libraries from postnatal day 2 organ of Corti preparations divided into apical and basal compartments. By integrating scRNA-seq data, we identified cell types of scATAC-seq by calculating a Jaccard similarity matrix, identified cell type-specific transcription factors (TFs), classified them as activators and repressors based on function, and further validation by footprints.
Focusing on HCs, we reconstructed the organ’s one-dimensional architecture from both scRNA-seq and scATAC-seq data. We identified novel differentially expressed genes along the tonotopic axis and validated them by RNAscope. Additionally, we identified TFs that drive HC differentiation and maturation by reconstructing developmental trajectories.
The results of this study enable us to understand how epigenomic landscape delineates cellular identities and functions within the organ of Corti. Further studies will investigate regulatory elements driving SC maturation, which will contribute to regenerative strategies.
Short Abstract: Altered pre-mRNA splicing may result in aberrations that phenocopy classical somatic mutations. Despite the importance of RNA splicing, most studies of acute myeloid leukemia (AML) have not broadly explored means by which altered splicing may functionally disrupt genes associated with AML. To address this gap, we investigated the splicing variability of 70 AML-associated genes within RNA-Seq data from 29 in-house AML patient samples (PENN cohort). In brief, using the MAJIQ splicing quantification algorithm, we detected 40 highly variable splicing events across the patients of the PENN cohort, many of which are novel and reduce expression of protein without changing overall transcript abundance. Splicing variability occurred independently of known cis-mutations, thus highlighting pathogenic mechanisms overlooked by standard genetic analyses. We also find these 40 splicing events as significantly more variable within the ~400 patient BEAT-AML cohort when compared to normal CD34+ cells. Furthermore, hierarchical clustering revealed a high degree of correlation between 23 of these 40 splicing events in both the PENN and BEAT-AML cohorts, suggesting a pathogenic co-regulation that is not observed in normal CD34+ cells. Overall, our findings highlight underlying transcriptomic complexity across AML populations and demonstrate how previously unreported splicing variations contribute to protein dysregulation in AML.
Short Abstract: Long-read RNA-sequencing platforms such as PacBio and Oxford Nanopore have led to an explosion in discovery of transcript isoforms that were impossible to assemble with short reads. Current transcript model visualization tools are difficult to interpret on a genomic scale and complicate distinguishing similar isoforms.
We introduce the Swan Python library, which is designed for the analysis and visualization of transcript models. Swan offers a robust visualization suite for easily differentiating splicing events. Using a graphical model approach, Swan provides a platform to visually discriminate between transcript models and to identify novel exon skipping as well as intron retention events that are commonly missed in short read transcriptomics. Furthermore, Swan is integrated with flexible differential gene and transcript expression statistical tools that enable the analysis of full-length transcript models in different biological settings. We demonstrate the utility of this software by applying Swan to the HepG2 and HFFc6 human cell lines which have full-length PacBio transcriptome data available on the ENCODE portal. Swan found 4,503 differentially expressed transcripts, including 280 transcripts that are differentially expressed even though the parent gene is not. Swan provides a comprehensive environment to analyze long-read transcriptomes and produce high-quality publication-ready figures.
Short Abstract: Small nucleolar RNA (snoRNA) are a class of highly abundant non-coding RNAs mainly known for their implication in ribosomal RNA maturation and processing. There are two distinct groups of snoRNAs: box C/D and box H/ACA. The canonical function of the former is to guide 2’O-methylation, while the latter is to guide pseudouridylation. In the last two decades however, many non-canonical functions of snoRNAs have been discovered including alternative splicing and mRNA stability modulation, chromatin remodeling, regulation of oxidative stress response and protein activation. This might only be the tip of the iceberg and snoRNAs are likely to be part of the bigger picture of cell regulation.
In 2016, three groups published different, but similar methods to sequence RNA-RNA duplexes in the cell (PARIS, LIGR-seq and SPLASH). We re-analysed those three datasets using one single pipeline and created a network based on the snoRNA-RNA interactions. We integrated diverse data including 2’O-methylation, pseudouridilation and CLIP-seq datasets as well as RNA-seq expression profiles and duplex interaction predictions. Using this information, we found many interesting snoRNAs with potential non-canonical function that we are investigating more extensively with other datasets and wet lab experiments.
Short Abstract: The long noncoding RNAs (lncRNA) are starting to garner attention because of their potential role in regulating the epigenetic processes by various mechanisms. They do so by interacting with biomolecules such as DNA, RNA, and proteins during transcriptional and post-transcriptional phases. However, the functionality of only <2000 lncRNAs are known out of ~270,000 transcripts identified [lncBook database].
Currently, there are wet lab protocols available to capture lncRNA and complementary in-silico methods for their identification also exist. But we currently lack more generic tools which can be run on genome scale high throughput data to both functionally identify and annotate lncRNA with high accuracy. Machine learning models trained on validated data to make novel predictions hold promise to develop such methods.
Here, we present a deep learning model which improves the ab initio identification algorithms by taking into consideration both primary and secondary structural features along with interaction motifs on structures. To the best of our knowledge this is the first comprehensive pipeline that combines lncRNA identification information with prediction of the secondary structure and finds the binding sites on them. It is expected to result in a significant jump in the capacity to identify novel lncRNAs and their function.