Posters - Schedules
Poster presentations at ISMB/ECCB 2021 will be presented virtually. Authors will pre-record their poster talk (5-7
minutes) and will upload it to the virtual conference platform site along with a PDF of their poster beginning July 19
and no later than July 23. All registered conference participants will have access to the poster and presentation
through the conference and content until October 31, 2021. There are Q&A opportunities through a chat
function and poster presenters can schedule small group discussions with up to 15 delegates during the conference.
Information on preparing your poster and poster talk are available at: https://www.iscb.org/ismbeccb2021-general/presenterinfo#posters
Ideally authors should be available for interactive chat during the times noted below:
View Posters By Category
Session A: Sunday, July 25 between 15:20 - 16:20 UTC
Session B: Monday, July 26 between 15:20 - 16:20 UTC
Session C: Tuesday, July 27 between 15:20 - 16:20 UTC
Session D: Wednesday, July 28 between 15:20 - 16:20 UTC
Session E: Thursday, July 29 between 15:20 - 16:20 UTC
Short Abstract: Culicine mosquitoes have large genomes, display extensive spatially periodic patterns of small RNA (piRNA) biogenesis, and vector human pathogenic viruses. Anopheline mosquitoes have compact genomes, exhibit weak small RNA biogenesis patterns, and transmit malaria but few relevant viruses. The piRNAs responsible for these differences in periodic ‘phasing’ patterns and their impact on mosquito viral immunity are unknown.
Current methodology for evaluating phasing strength relies upon by-eye ranking of piRNA phasing curves, which is unsuitable for classification at the genomic scale. To address this challenge, we developed a quantitative method to rate phasing patterns across mosquito small RNA datasets. This scoring mechanism measures seven features from a given piRNA phasing curve and weights these to form a single ranking metric. We then classify the curves by finding similar curves using a trained K-means clustering which optimizes the model feature weights.
We use this scoring mechanism to compare piRNA phasing patterns generated with both random subsampling of reads and segmentation according to genomic loci of small RNA datasets. This step will yield insight into genomic loci that drive periodic piRNA phasing patterns in culicine mosquitoes and elucidate the weakness in phasing patterns in anopheline mosquitoes.
Short Abstract: Identification of cancer subtypes characterized by actionable genetic lesions is a pivotal step for developing treatment and improving clinical care. However, in heterogeneous diseases such as Acute Myeloid Leukemia (AML), subtype discovery can be challenging since mutation burden, which has traditionally been prioritized for this task, is low. However, recent studies pointing to splicing aberrations in AML motivate splicing based detection of cancer subtypes. We thus developed CHESSBOARD, an unsupervised machine learning algorithm, to identify “tiles” defined by a subset of splicing events and patient samples that represent disease subtypes. We applied our method to the beatAML dataset and found tiles that are enriched for splicing events that are affected by upstream regulator factors that have strong evidence of binding and differential behavior including SRSF1 and HNRNPC. We also show that the tiles correlate to therapeutic response to Sorafenib and mutations in FLT3, NPM1 and CEBPA that define known subtypes. Finally, to further explore CHESSBOARD’s utility, we analyze the TARGET B-ALL data discovering novel splicing subtypes characterized by patients with low relapse rates harboring a RUNX1-ETV6 fusion. Our results reveal novel mechanisms that drive tile formation and further confirm the translational importance of splicing aberrations in AML and related diseases.
Short Abstract: Spatial transcriptomics technologies produce imaging data at unprecedented resolution and scale, enabling insight into the spatial organization of RNA transcripts. Despite the recent influx of methods for tasks such as identifying cell types, characterizing tissue heterogeneity and predicting ligand-receptor interactions, methods for studying structure at the subcellular scale are lacking. We present Bento, a Python toolkit for subcellular spatial analysis of RNA, implementing tools for ingesting high-throughput multiplexed spatial transcriptomics data, visualization, detection of RNA localization patterns, identification of colocalizing transcripts, and RNA regulon analysis. We applied our methods to analyze multiple spatial transcriptomics datasets, including seqFISH+ data (10k genes in ~200 fibroblast cells) and MERFISH data (130 genes quantified in ~1000 U2-OS cells) to show how subcellular spatial analysis can be leveraged to understand RNA localization and behavior. We also show that Bento is scalable and flexible, able to work with datasets generated from various technologies. The package has an accessible API built on top of existing single-cell tools and whose runtime/memory usage is optimized for analysis of entire datasets on the average work laptop.
Short Abstract: Corals are composed of thousands of different organisms that coexist as a holobiont. This requires a very particular balance between all of these individual pieces, and even a slight change in the coral's immune response can produce massive shifts that affect the health of the holobiont. Differential gene experiments using RNAseq have been used to identify the important genes involved in various coral stress responses, borrowing bioinformatics tools developed for human studies. Extensive benchmarking has been done on alternative RNAseq pipelines in humans, where the quality of the reference genome is much higher. We compare three different RNAseq analysis pipelines to discover differential gene expression analysis from RNAseq data in the coral species Pocillopora damicornis from a previously published study on corals exposed to an endotoxin to simulate immune system stress. We find some variation both in individual genes and pathways that are found to be differentially expressed in each pipeline. Our preliminary results indicate that overall, results appear biologically consistent enough that each tool recovers some consistent signal, but in this context, the pipeline we tested that operates primarily on the transcriptome and makes least use of the (preliminary) reference genome is the most sensitive.
Short Abstract: Breast cancer is the most diagnosed type of cancer among women worldwide, being classified in four subtypes, with triple-negative breast cancer being the high-risk subtype. Genes and pathways have been characterized in cancer, but there is still a challenge to integrate them and their interactions for tissues or cells being treated targeting new markers to better diagnose for breast cancer patients. We used and curated normal and tumor RNA-seq data from The Cancer Genome Atlas. Established a criteria for sample exclusion and normalization (TMM+COMBAT) to avoid spurious samples, batch effects and outliers. Adopting approaches such as Rank Product and Stouffer we combined multiple samples from different studies allowing a study-wise through graphical and quantitative representations. Our user-friendly shiny web tool supports and integrates multiple studies by meta-analysis approaches with interactive viewers. Well established associations were confirmed such as a higher correlation between Luminal A and Luminal B subtypes at gene level, and such higher relevance from pathways like Extracellular Matrix Organization, beyond new promising highly ranked associations. We have established a multifactorial meta-analysis approach to identify tumor related molecular markers and signaling pathways. The interface improves visualization of dynamic tables, interactive heatmaps and graph connections from multiple studies.
Short Abstract: A G-quadruplex (G4) is a special structure formed from a G-rich sequence, and enriched in UTRs of mRNAs. Recent high-throughput sequencing technologies can generate profile of RNA G4s (rG4s), but their biological roles still remain unclear. In this poster abstract, we present D-Quartet, a deep convolutional network approach to predicting potential rG4s from their sequence information. Exhaustive tests on augmented rG4-sequencing data show that D-Quartet outperforms an earlier neural network approach as well as scoring schemes in discriminative power.
Short Abstract: microRNAs (miRNAs) are short (∼23 nucleotide) single-stranded non-coding RNAs that negatively regulate gene expression, through transcript cleavage, degradation and/or translational suppression. DIANA-miTED is a web database offering abundance estimates of miRNAs, as obtained via consistent analysis from scratch of thousands of small RNA-Seq (sRNA-Seq) datasets. The DIANA-miTED collection features information for >240 cell-lines from Sequence Read Archive (SRA) and for >190 healthy/disease tissues or organs derived and efficiently catalogued from 32 projects of The Cancer Genome Atlas project (TCGA), as well as SRA. In total, users can retrieve expression values of 2656 miRNAs (miRBase) and relevant meta-information from 15,183 analyzed experiments. Pre-processing and analysis was performed following a well-defined sRNA-Seq analysis workflow, “DIANA-mAP” (Alexiou et al., Genes, 2021). The rich visualization capacities of miTED (e.g. result-specific grouped boxplots, pie charts) enable direct inspection of the distribution of requested data across variables, such as health state and gender. Sankey diagrams are provided, enabling exploratory analyses of the relationships between variables (Tissue-anatomical locations, Tissue-Diagnosis etc). miTED is connected with other DIANA-Tools (www.dianalab.gr/microrna-research/), facilitating tissue-specific miRNA-target analyses and functional explorations. miTED currently constitutes the largest systematic context-specific miRNA abundance resource, filling an existing gap in non-coding RNA research.
Short Abstract: The conservation of circular RNAs (circRNAs) between closely related species remains unclear. In the particular case of primates, it is known that many primate genes produce circular RNAs but their extent of conservation is unkwon. By comparing tissue-specific transcriptomes across over 70 million years of primate evolution, we identify that within 3 million years circRNA expression profiles diverged such that they are more related to species identity than organ type. Nonetheless, our analysis also revealed a subset of circRNAs with conserved neural expression across tens of millions of years of evolution. These circRNAs are defined by an extended downstream intron that has shown dramatic lengthening during evolution due to the insertion of novel retrotransposons. Our work provides comparative analyses of the mechanisms promoting circRNAs to generate increased transcriptomic complexity in primates.
Short Abstract: A-to-I RNA editing diversifies the transcriptome and has multiple downstream functional effects. We analyze matched genetic and transcriptomic data in 49 tissues across 437 individuals to identify RNA editing events that are associated with genetic variation. Using an RNA editing quantitative trait loci (edQTL) mapping approach, we identify 3117 unique RNA editing events associated with a cis genetic polymorphism. Fourteen percent of these edQTL events are also associated with genetic variation in their gene expression. A subset of these events are associated with genome-wide association study signals of complex traits or diseases. We find that certain microRNAs are able to differentiate between the edited and unedited isoforms of their targets. Furthermore, microRNAs can generate an expression quantitative trait loci (eQTL) signal from an edQTL locus by microRNA-mediated transcript degradation in an editing-specific manner. By integrative analyses of edQTL, eQTL, and microRNA expression profiles, we computationally discover and experimentally validate edQTL-microRNA pairs for which the microRNA may generate an eQTL signal from an edQTL locus in a tissue-specific manner. Our work suggests a mechanism in which RNA editing variability can influence the phenotypes of complex traits and diseases by altering the stability and steady-state level of critical RNA molecules.
Short Abstract: RNA splicing is a key step of gene expression in higher organisms. Accurate quantification of the two-step splicing kinetics is of high interests not only for understanding the regulatory machinery, but also for estimating the RNA velocity in single cells. However, the kinetic rates remain poorly understood due to the intrinsic low content of unspliced RNAs and its stochasticity across contexts. Here, we estimated the relative splicing efficiency across a variety of single-cell RNA-Seq data with scVelo. We further extracted three large feature sets including 92 basic genomic sequence features, 65,536 octamers and 120 RNA binding proteins features and found they are highly predictive to RNA splicing efficiency across multiple tissues on human and mouse. A set of important features have been identified with strong regulatory potentials on splicing efficiency. This predictive power brings promise to reveal the complexity of RNA processing and to enhance the estimation of single-cell RNA velocity.
Short Abstract: The codon composition of messenger RNAs (mRNAs) imposes regulatory information that strongly affects transcript stability, allowing cells to fine-tune protein expression. Current codon optimization methods revolve around codon usage frequency, despite the fact that it weakly correlates with mRNA stability. Here, we trained a machine learning model with mRNA stability profiles from several vertebrate species to predict mRNA stability based on the regulatory properties of codon composition. Using this model, we developed www.iCodon.org, a web interface that predicts mRNA stability, and customizes gene expression by introducing synonymous codon substitutions. To validate the potential of iCodon, we constructed twelve EGFP variants ranging in levels of predicted mRNA stability. Transfection of these variants in human cells revealed that mRNA stability predictions correlated with fluorescence intensity and captured a range of nearly 50-fold differences in gene expression. Additionally, zebrafish embryos injected with these EGFP variants recapitulated the human cells results, demonstrating that iCodon can also modulate gene expression in vivo. In conclusion, iCodon provides a powerful tool to interrogate mRNA stability and design strategies to modulate gene expression in vertebrates, for a wide range of applications for research, and for the potential optimization of RNA-based therapeutics and vaccines.
Short Abstract: MicroRNAs are involved in many biological contexts, including pregnancy. The identification of genetic variants influencing gene expression, known as expression quantitative trait loci (eQTL), helps to understand the underlying molecular determinants of complex disease. Using data from the Genetics of Glucose Regulation in Gestation and Growth (Gen3G) Cohort, we investigated associations between SNPs and microRNA plasma levels during the first trimester of pregnancy. We used 369 samples from which maternal genotypes and full microRNA quantification were both available to identify 22,634 eQTLs involving 149 unique microRNAs.
Using these eQTLs we built genetic risk scores (GRS) using elastic-net regressions to select the most relevant SNPs. For about half of the selected microRNAs, the GRS capture more than 10% of the plasma level variance.
We also applied Mendelian randomization using eQTLs involved in GRS and found associations between the levels of circulating microRNAs during the first trimester of pregnancy and pregnancy complications reported in the Gen3G cohort, including gestational diabetes melitus.
Our results highlight the potential of genetic instruments in predicting circulating microRNA levels associated with pregnancy complications. Such instruments can help understanding the regulation of microRNA expression and the etiology of complex traits such as pregnancy complications.
Short Abstract: Mammalian cells generate >100 different RNA modifications that can change the base-pairing, RNA structures, or recruitment of RNA binding proteins. Pseudouridine modified mRNAs are more resistant to RNAse-mediated degradation and also have the potential to modulate immunogenicity and enhance translation in vivo. However, we have yet to understand the precise biological function of pseudouridine on mRNAs due to a lack of tools for their direct detection and quantification.
We have recently developed an algorithm for identifying pseudouridylated sites directly on mammalian mRNA transcripts using nanopore sequencing. We use our algorithm to classify 3 types of pseudouridine hyper-modification that may occur on mRNAs: Type 1 has a high percentage of pseudouridine at a given site; type 2 has >1 pseudouridine on a single read; type 3 has pseudouridine in addition to other modifications on a given read.
Our pipeline enables the direct identification and quantification of the pseudouridine modification on native RNA molecules. Further, the long read lengths allow multiple modifications to be detected on the same transcript, which allows layering of RNA modification data and RNA sequence information. Future applications of this pipeline will enhance our understanding of the biological impacts of pseudouridylation as they pertain to disease and development.
Short Abstract: Cancer is a set of diseases characterized by unchecked cell proliferation and invasion of surrounding tissues. The many genes that have been genetically associated with cancer or shown to directly contribute to oncogenesis vary widely between tumor types and it is not clear whether there exists a set of genes or other transcriptomic features commonly deregulated across several cancer types. We trained three feed-forward neural networks to predict the cancer state (healthy tissue versus tumor) of RNA-seq samples using either gene expression (protein-coding or lncRNA) or splice junction usage data on a set comprising 17 healthy tissue types and 18 solid tumor types. All three models achieve high precision (95.7% ± 2.1%) and high recall (97.3% ± 1.3%) across 14 datasets. Analysis of attribution values extracted from our models reveals that genes with high attribution values are evolutionarily conserved and are under strong selective pressure against loss of function. These findings suggest that the features making up the transcriptomic profile of cancer have essential cellular functions. Our results also highlight that deregulation of RNA-regulating genes and aberrant splicing are pervasive features across a large array of solid tumor types.
Short Abstract: The classical stochastic sampling algorithm generates secondary structures according to their probabilities in the Boltzmann ensemble. However, consisting of a bottom-up partition function phase followed by a top-down sampling phase, this algorithm suffers from three limitations: (a) the formulation and implementation are unnecessarily complicated; (b) the sampling phase recalculates many redundant recursions already done during the partition function phase; (c) the partition function run-time scales cubically with the sequence length. These issues prevent it from being used for very long RNAs such as the full-genome of SARS-CoV-2. To address these problems, present LinearSampling, an end-to-end linear-time sampling algorithm that is orders of magnitude faster than previous ones. More importantly, LinearSampling can scale up to full-length SARS-CoV-2 (29,903 nt), taking only 69.2 seconds. It finds 23 regions of 15 nt with high accessibilities, which can be potentially used for COVID-19 diagnostics and drug design.
Short Abstract: Despite significant steps in our understanding of Alzheimer’s disease (AD), many of the molecular processes underlying its pathogenesis remain largely unknown. Here, we focus on the role of non‐coding RNAs produced by small interspersed nuclear elements (SINEs). RNAs from SINE repeats such as B2 RNAs in mouse and Alu RNAs in humans, control gene expression by binding RNA polymerase II and suppressing transcription. They also possess self‐cleaving activity that is accelerated through their interaction with certain proteins disabling this suppression. Here, we show that similar to their mouse counterparts, human Alu RNAs, are processed in hippocampus and cortex, and the processing rate is increased in AD patients. This increased processing correlates with the activation of genes up‐regulated in AD patients, while increased intact Alu RNA levels correlate with down‐regulated gene expression in AD. In vitro assays show that processing of Alu RNAs is accelerated by HSF1. Overall, our data show that RNAs from SINE elements in the human brain show a similar pattern of deregulation during amyloid beta pathology as in mouse.
Short Abstract: Alzheimer’s disease is a widespread neurodegenerative disease, which features pervasive neurodegeneration of the brain. Most often, attempts to treat or detect the disease focus on amyloid-beta prion-like aggregation, widely thought to cause neurodegeneration, however, such attempts remain unsuccessful. Recent attention has been paid to the transcriptome for its involvement in neuro-cellular stress regulation and particularly, non-coding RNAs have been shown to play essential roles in neurodegeneration and neuroprotection. The SINE non-coding Alu RNA is a, self-cleaving RNA highly transcribed in brain tissue and is known to inhibit the stress response until the presence of a stimuli. Afterwards Alu RNA is processed, and a stress response proceeds. Here, we show how the SINE non-coding Alu RNA is aberrantly processed in Alzheimer’s disease patients in correlation with elevated stress response gene transcription. Furthermore, we show that many of the same genes are upregulated in a neural cell line when Alu RNA is targeted for breakdown using an antisense LNA, and that the process of Alu RNA processing is accelerated by the Hsf1 protein. Our study contributes to the discussion of the involvement of non-coding RNAs in disease and particularly, in neurodegeneration.
Short Abstract: 3’ untranslated regions (3’UTRs) play a critical role in controlling gene expression because they contain binding sites for microRNAs and RNA binding proteins (RBPs) that alter mRNA stability, translation, and localization. Most 3’UTRs in diverse organisms have multiple polyadenylation sites which are utilized in condition-specific manners through a process called alternative polyadenylation (APA). Although the breadth of APA in different biological contexts is known, the mechanisms driving the regulation of APA remain poorly understood. Therefore, identification of auxiliary factors that direct the core machinery and drive APA regulation in tissue- and disease-specific contexts warrants further attention.
To identify novel 3’end processing factors we analyzed data available for RBPs from ENCODE. We performed an integrative analysis of diverse algorithms in order to detect multiple patterns of APA from RNA-seq data. We applied this approach on the large set of over 350 RBP depletion experiments to detect significant shifts in 3’UTR isoforms. Integrating binding (eCLIP) and motif data with these functional targets identified several novel regulators of APA. Co-expression analysis shows altered expression of specific regulators is associated with APA shifts in certain cancers and erythropoiesis. We validated several targets of one of these novel regulators by knockdown followed by 3’RACE.
Short Abstract: Rheumatoid arthritis (RA) is a chronic, inflammatory and autoimmune disease affecting 1% of the worldwide population. It is characterized by an alternation of asymptomatic phases and symptomatic flares during which significant inflammation and destruction of the joints appears. The pathophysiology of the disease is still poorly understood and the available treatments can only alleviate its symptoms without inducing long-term remission. At present, there are no effective tools to predict the course and response to treatments.
RA is a heterogeneous entity resulting in not all patients responding positively to the same treatments. Our hypothesis is that this difference is due to a plurality of immune endophenotypes.
To conduct this study, single-cell RNA sequencing data has been generated from blood mononuclear cell samples of patients presenting with RA, prior to initiation of treatments. The point of using resolution at the cellular level is that different cell types are activated and act differently during symptomatic flares.
These datasets will be analyzed to find the most relevant biomarkers to create a tool capable of relating a new expression profile to a previously characterised endophenotype. This tool will make it possible to predict the response of a patient to treatment according to their expression profile.
Short Abstract: As the COVID-19 outbreak spreads, there is a growing need for an efficient tool to identify conserved RNA structures as critical targets for diagnostics and treatments. We present LinearTurboFold, an end-to-end linear-time algorithm that estimates conserved structures for a set of unaligned homologous RNA sequences. LinearTurboFold significantly improves structure prediction accuracy and achieves comparable alignment accuracy. LinearTurboFold uses the same iterative refinement of structure and alignment as TurboFold, but it is substantially faster than previous methods and it can globally fold full-length SARS-CoV-2 homologs without constraints on base-pairing distance. LinearTurboFold identifies conserved structures as potential binding domains for active small-molecule drugs and discovers accessible and sequence-conserved regions as promising targets for designing efficient siRNAs, CRISPR-Cas13 gRNAs, and RT-PCR primers as COVID therapeutics and diagnostics.
Short Abstract: Alternative splicing is a key regulatory process that allows multiple transcripts to be produced from a single gene. Splicing has been primarily studied on a high-throughput scale via RNA sequencing (RNA-Seq). However, most of the reads in a standard RNA-Seq experiment are not the junction spanning reads used for detecting splicing changes and new splice isoforms. To address this limitation, we developed a cost-effective, targeted RNA-Seq method to quantify splicing variations. Building on a previous method for detecting such splicing variations in yeast, our LSV-Seq method captures unannotated and complex splicing variations in the human transcriptome. First, we created a new pipeline based on the MAJIQ algorithm to identify targetable regions from previous RNA-Seq data. Next, we implement a model for selecting high-performance primers, based on input features including in silico predicted binding, primer sequence characteristics, and positional/nucleotide biases. During targeted library preparation, highly specific reverse transcription conditions prevent off-target amplification. Preliminary results on a small target pool demonstrate an overall median enrichment of 120-fold compared to standard RNA-Seq and a median enrichment of 8000-fold for lowly expressed genes. When fully optimized, we envision that our assay can quantify splicing with both bulk and single-cell resolution.
Short Abstract: Combining functional metrics with genomic data can better link phenotype to genotype to elucidate drivers of within-cell-type heterogeneity. We developed a photoconversion based technique to label and isolate cells grown in 3D culture based on visually measurable characteristics. This technique enables high throughput enrichment of phenotypically defined subpopulations for supervised single cell analysis. Using this approach, we isolated subpopulations of cells that exhibit invasive versus non-invasive modes of collective migration and subjected them to single cell RNA sequencing. Then, we developed a scRNA sequencing analysis pipeline integrating supervised and unsupervised clustering, gene expression pattern detection, and pseudotime trajectory analysis. Our analyses revealed that unsupervised clusters based on the most variably expressed genes did not recapitulate measured phenotypic characteristics, suggesting that the largest gene expression signals may not necessarily reflect the most obvious phenotypic differences. We also saw that non-negative matrix factorization-based approaches were capable of identifying expression programs that distinguished cells transitioning between phenotypes. We automated these computational analyses in a GenePattern notebook to provide a user-friendly interface, making it easy to apply to any scRNAseq data. This protocol thus provides an end-to-end strategy to characterize functional within-cell-type heterogeneity associated with visual phenotype-based morphological traits.
Short Abstract: Box C/D small nucleolar RNAs (snoRNAs) are a subfamily of small noncoding RNAs known to guide methylation of ribosomal and small nuclear RNAs. This function is well characterized and requires an interaction with specific regions of the snoRNAs. However, some snoRNAs do not have known canonical targets. Also, some snoRNAs exhibit noncanonical functions, such as regulation of alternative splicing and of mRNA stability, the deregulation of which has been implicated in diseases such as Prader-Willi syndrome and cancer. Such functions were shown in human, budding yeast and mouse.
Recently, a tool called snoGloBe was developed to predict both canonical and noncanonical box C/D snoRNA interactions in human. The transcriptome-wide prediction of every expressed box C/D snoRNA interactions in human using snoGloBe brought to light interesting interaction patterns for different snoRNAs regarding their localization (UTR, CDS, etc.), the (anti-)correlation of RNA levels between the snoRNAs and their targets, and some gene ontology enrichment of their predicted targets.
To further help the snoRNA community study their wide range of functions, we will evaluate the performance of snoGloBe on other model organisms, such as mouse and yeast, and compare the predictions with known interactions and high-throughput RNA-RNA interaction datasets from CLASH and MARIO.
Short Abstract: Background/Motivation: Small nucleolar RNAs (snoRNAs) are non-coding RNAs involved in multiple levels of gene regulation in eukaryotes (ribosome biogenesis, splicing, pre-mRNA stability, etc.). snoRNAs are either located in introns of host genes (HG) or within intergenic regions. The expression of snoRNAs is presumed to depend either on their HG transcription or the use of an independent promoter, respectively for intronic and intergenic snoRNAs. We recently showed that only less than 50% of all annotated human snoRNAs are expressed across human tissues, highlighting our lack of understanding of snoRNA abundance determinants.
Methods/Results: To characterize which factors modulate whether a snoRNA is expressed or not, we thus undertook the conception of a model to predict snoRNA expression in human tissues based on published low-structure bias RNA-Seq datasets. The input features given to the algorithm are the snoRNA structure stability and conservation, the chromatin state of its surroundings, its location within the HG and HG characteristics (if applicable). After optimization, the principal features used by the model for classification will be identified.
Conclusion: By accurately predicting whether a snoRNA is expressed or not, the model we built will thus provide valuable insights into what are the main snoRNA abundance determinants.
Short Abstract: The multiple sequence alignment (MSA) is the entry point for many RNA structure modeling tasks, such as prediction of RNA secondary structure (rSS), molecular contacts and solvent accessibility (SA). Yet, there are few automated programs for consistent generation of high quality MSA for a target RNA. We developed rMSA, an automated five-stage approach for sensitive search and accurate alignment of RNA homologs from the standard RNAcentral and NCBI nucleotide collection (nt) databases for a target RNA. rMSA is benchmarked on a diverse set of 365 non-redundant and high-resolution RNA structures against four state-of-the-art programs (RNAcmap, Infernal, nhmmer, blastn). It significantly outperforms the state-of-the-art MSA programs by approximately 20% and 5% higher F1-score for rSS and contact prediction, respectively. Moreover, it is comparable to state-of-the-art for SA prediction. Our program enables the detection of conserved rSS in 3 lncRNAs (RepA, SRA and HOTAIR) previously claimed to lack evolutionary conserved base pairs. Detailed analysis suggests that the advantage of rMSA lies in its hierarchical search strategy, which progressively incorporates more diverse homologs at each stage while avoiding attraction of unrelated sequences. rMSA is available at github.com/pylelab/rMSA.
Short Abstract: Accurate predictions of RNA secondary structures can help uncover the of roles of functional non-coding RNAs. Although machine learning-based models have achieved high performance in terms of prediction accuracy, overfitting is a common risk for such highly parameterized models. Here we show that overfitting can be minimized when RNA folding scores learnt using a deep neural network are integrated together with Turner’s nearest-neighbor free energy parameters. Training the model with thermodynamic regularization ensures that folding scores and the calculated free energy are as close as possible. In computational experiments designed for newly discovered non-coding RNAs, our algorithm (MXfold2) achieves the most robust and accurate predictions of RNA secondary structures without sacrificing computational efficiency compared to several other algorithms. The results suggest that integrating thermodynamic information could help improve the robustness of deep learning-based predictions of RNA secondary structure. The MXfold2 source code is available at github.com/keio-bioinformatics/mxfold2/, and the MXfold2 web server is available for use at www.dna.bio.keio.ac.jp/mxfold2/.
Short Abstract: Small nucleolar RNAs (snoRNAs) are highly expressed short non-coding RNAs subdivided in two groups: box C/D and box H/ACA snoRNAs. The canonical function of the former is to guide 2’O-methylation of ribosomal RNAs (rRNAs), whereas the latter is to guide pseudouridylation of rRNAs. During the last two decades, a growing body of evidence has shown that snoRNAs can also have non-canonical functions in a variety of cellular processes. To gain more insight into snoRNA non-canonical functions, we reanalysed the datasets of high throughput methods aimed at capturing RNA-RNA interactions in cells, to create a snoRNA-RNA interaction network.
Surprisingly, we found that one third of the snoRNA detected in our network were interacting with their host gene. From these candidates, we found SNORD2, a snoRNA embedded in an intronic sequence of the EIF4A2 gene. We discovered an alternative folding of the snoRNA with the downstream EIF4A2 intron sequence, producing a stable intermediate, whose expression is highly anti-correlated with the percent spliced in of a cassette exon located immediately downstream of the SNORD2 intron. In conclusion, we unveiled a novel alternative splicing mechanism that involves regulation by embedded snoRNAs which could be used by the cell to alter gene expression.
Short Abstract: Clustering is a crucial step in single-cell data analysis. Unsupervised, graph-based, clustering is the most prevalent technique to cluster single cell data. Upon clustering, cell type labels are determined based on differentially expressed genes between clusters. In contrast, supervised methods use a reference panel of known transcriptomes to guide both clustering and cell type identification. Supervised and unsupervised clustering approaches have their distinct merits and demerits, which leads to different but often complementary clustering results for both approaches. Hence, a consensus algorithm leveraging the merits of both clustering paradigms could improve both clustering and cell type annotation.
We present scConsensus, a framework for generating a consensus clustering by (1) integrating results from both unsupervised and supervised approaches and (2) refining the consensus clusters using differentially expressed genes. scConsensus demonstrates a marked improvement in cell type detection in numerous CITE-Seq datasets and in a FACS-sorted PBMC data set.
scConsensus combines the merits of unsupervised and supervised clustering approaches to improve cluster separation and cluster homogeneity, thereby increasing our confidence in detecting distinct cell types. scConsensus is implemented in R and is freely available on GitHub at github.com/prabhakarlab/scConsensus.
Short Abstract: We present a novel statistical framework for identifying differential distributions in single-cell RNA-sequencing (scRNA-seq) data between treatment conditions by modelling gene expression read counts using generalized linear models. We model each gene independently under each treatment condition using the error distributions Poisson, Negative Binomial, Zero-inflated Poisson and Zero-inflated Negative Binomial with log link function and model-based normalization for differences in sequencing depth. Model selection is done by calculating the Bayesian Information Criterion and likelihood ratio test statistic. While most methods for differential gene expression analysis aim to detect a shift in the mean of expressed values, single-cell data are driven by over-dispersion and dropouts requiring statistical distributions that can handle the excess zeros. By modelling gene expression distributions, our framework, scShapes, can identify subtle variations that do not involve the change in mean. It also has the flexibility to adjust for covariates and perform multiple comparisons while explicitly modelling the variability between samples. Through simulation, we show that this framework is able to detect zero-inflated genes and when applied to real scRNA-seq datasets, our framework was able to identify genes and pathways linked to the phenotype of interest that were not discovered through traditional analysis of transcriptomic data.
Short Abstract: Long-read transcriptomics require understanding error sources inherent to technologies. Current approaches cannot compare methods for an individual RNA molecule. Here, we combined barcoding strategies and long-read sequencing to sequence cDNA copies representing an individual RNA molecule on both Pacific Biosciences and Oxford Nanopore. We compared these long reads pairs in terms of sequence content and splicing structure. Although individual read pairs show high similarity, we found differences in (i) aligned length, (ii) polyA-tail length, (iii) TSS and (iv) polyA-site assignment and (v) exon-intron structures. Overall 25% of read pairs disagreed on either TSS, polyA-site or a splice site. Intron-chain disagreement typically arises from microexons and complicated splice sites. Our single-molecule technology comparison revealed that inconsistencies are often caused by sequencing-error induced inaccurate ONT alignments, especially to downstream GTNNGT donor motifs. However, annotation-disagreeing upstream shifts in NAGNAG acceptors are often confirmed by both technologies and thus real. We also analyzed non-barcoded ONT reads and confirmed that the intron number and proximity of other GT/AGs better predict inconsistency with the annotation than the read quality alone. Taken together, our novel technology comparison approach enables an accurate delineation of true isoform characteristics from sequencing and analysis errors in individual reads.
Short Abstract: Host response to viral infection can vary greatly as seen with the current Covid19 pandemic, and it is still not very clear at times what defines the host response. One potential avenue of exploration is non-coding RNAs like SINEs as they have been recently found to be involved in the stress response. A previously published sus scrofa RNA-seq timeline dataset was chosen to explore the role of Short Interspersed Nuclear Elements (SINE) in viral infection and host response. Our analysis focused on the Pre1 sequence, which is analogous to the human Alu SINE, due to recent discoveries showing these SINEs playing a role in the stress response. Our analysis revealed two sets of Pre1 at 4 and 21 days post infection (DPI) that are differentially expressed with respect to time. One of these sets is highly downregulated at 4DPI and highly upregulated at 21DPI. This all occurs even though there is no significant difference in viral load between the resistant and nonresistant. Pathway analysis reveals the Pre1 are located near genes associated with viral infection, innate immunity and response to virus, and ribosomal proteins which supports that these SINEs play a role in the host response.
Short Abstract: The complexity of the genetic component of cancers makes the outcome prediction more complicated. For decades, every year, many researchers have derived signatures of biomarkers for cancer outcome prediction, with more additive markers and less overlapping. For the same cancer type, different characteristics are statistically associated with different genes. Nevertheless, cancer development is based on distinctive biological processes responsible for tumour growth and invasion. The question arises whether it is possible to introduce a universal signature for the same cancer type. If so, is it possible to generalise that and construct a universal cancer signature?
We collected more than 130 high-quality microarray datasets for 13 cancer type, and then we analyse several phenotypes in each dataset, such as prognosis, survival time, clinical stage, tumour size and treatment response. While investigating the similarity between published signatures, we did not discover a significant overlap. In contrast, our network-based signatures express a strong overlap. We improved the Netrank algorithm that we already developed by considering the connectivity of genes based on the protein-protein interaction links. Our approach supports the synergy between the statistical significance and biological relevance of the genes regarding their relation to cancer development and the coverage of general cancer signals.
Short Abstract: Long non-coding RNAs (lncRNAs) are emerging new players of gene regulation. Here, we identified a novel lncRNA, Tapir, that regulates the pluripotent state of embryonic stem cells (ESCs), cells that are necessary for proper embryo development. Knock-down of Tapir in ESCs leads to a decrease in pluripotency genes expression, whereas up-regulation of Tapir increases their expression. Moreover, Tapir accelerates the reprogramming of differentiated cells towards induced pluripotent stem cells, which is accompanied by rapid up-regulation of chromatin regulators and pluripotency gene expression. We also found that it directly interacts with several mRNAs implicated in pluripotency and chromatin regulation in ESCs. By screening for sequence complementarity, we identified a SINE retrotransposable element in Tapir lncRNA, also present in the intronic regions and 3’UTRs of several pluripotency genes in the antisense orientation suggesting that Tapir functions through intermolecular hybridization with these mRNAs. Overexpression of a SINE depleted form of Tapir in ESCs or during reprogramming significantly altered the pluripotency-enhancer effect of Tapir. This suggests that the SINE element is essential for Tapir’s function, possibly by affecting the regulation or translation of pluripotency associated mRNAs. Our study highlights a new aspect of pluripotent stem cells regulation by lncRNAs.
Short Abstract: Dysregulation of miRNAs in biological processes underlies several diseases including cancer. The objective of this study is to analyze miRNAs for their putative role as diagnostic, prognostic, and therapeutic markers in Head and Neck Squamous cell carcinoma (HNSCC). We gathered a list of 69 miRNAs that were associated with HNSCC according to the miRbase, miRNet, and miRCancer databases. Next, we constructed miRNA-Gene interaction networks using interaction data from miRNet for studying the network topology including the identification of hubs and highly-connected modules within the network. This was followed by integration with miRNA expression data from miRCancer database that highlighted hubs in our network which were over-expressed during HNSCC. We identified hsa-miR-21-5p, hsa-miR-30a-5p, hsa-miR-106b-5p, hsa-miR-16-5p, and hsa-miR-93-5p as high priority candidates for further validation of their putative roles in HNSCC initiation or progression. We analyzed GO and KEGG Pathway annotations of these miRNAs which showed them to be involved in regulatory and binding processes inside the nucleus and cytoplasm, and also in endometrial, pancreatic, non-small cell lung cancers and cancer-related pathways. We intend to validate these results by undertaking in vitro experiments using anti-miRs molecules against these prioritized set of up-regulated miRNAs, primary HNSCC cell lines at the Precision Medicine Lab.