Attention Conference Presenters - please review the Speaker Information Page available here.
If you need assistance please contact email@example.com and provide your poster title or submission ID.
Category N - 'Sequence Analysis'
Short Abstract: Non-coding RNAs are an area of intense scientific interest, however, the field has been hampered by the lack of a comprehensive collection of sequences representing all types of non-coding RNAs from all organisms. We begin to address this challenge by creating RNAcentral (http://rnacentral.org), a database that provides a unified view of data from over twenty RNA resources such as miRBase, RefSeq, Vega, Rfam, dictyBase, PomBase, WormBase, and others. RNAcentral search functionality makes it easy to explore all ncRNA sequences, compare data across different resources, and discover what is known about each ncRNA. Using RNAcentral sequence similarity search one can search data from multiple RNA databases using a web interface, which is a unique capability worldwide. Where possible, sequences are mapped onto reference genomes from key species and made available in an integrated genome browser as well as in Ensembl and UCSC genome browsers. In the near future we plan to integrate functional annotations of non-coding RNAs, such as intermolecular interactions, nucleotide modifications, and high-quality secondary structures. The ultimate goal of RNAcentral is to incorporate curated information about all non-coding RNAs as UniProt does for proteins and provide a single entry point for anyone interested in ncRNA biology.
Short Abstract: Cervical small cell neuroendocrine tumors (CSCNETs) are an extremely rare and aggressive
form of neuroendocrine tumors (NETs). Due to the lack of reliable standardized diagnostic or prognostic markers to date, it is difficult to diagnose and predict the disease progress, and treatment strategies are limited. For the first time we provide the mutation profile of five tumor-normal paired CSCNETs using whole exome sequencing. ATRX, ERBB4, and genes in the Akt/mTOR pathways were most frequently mutated. The Akt/mTOR signaling pathway displayed a common mutation signature across NETs and downstream signaling was affected by ERBB4. Positive cytoplasmic ERBB4 expression was detected in tumor tissue, but not in adjacent normal tissue, when examining ERBB4 expression in 16 CSCNETs using
immunohistochemical staining. This result suggests that CSCNETs share the genetic
characteristics of NETs and provides new insight for sufficient clinical management in
patients by targeting ERBB4 and the Akt/mTOR signaling pathway axis.
Short Abstract: While many consider assembly to be a solved problem, unordered and fragmented assemblies with false joins are widespread, hampering significantly any of the downstream analyses depending on high-quality contiguous assemblies. In addition, due to the nature of many genomic regions and their low entropy, short read based approaches tend to fall short. With the advent of single molecule sequencing technologies capable of producing long reads, relief to such issues is available, however with the downside of a high per base error-rate and presence of large-scale sequencing artifacts. We provide an efficient front-to-end solution to the assembly of genomes that have notoriously complex genomic structure using very long noisy reads with high error rates produced by single molecule sequencing technologies. Here we present MARVEL, an assembler that is capable of handling a wide range of genomes within the tree of life ranging from the elemental Escherichia coli to beyond the intricate Schmidtea mediterranea.
Short Abstract: Prokaryotic genomes typically consist of one circular chromosome of a few million base pairs encoding a few thousand genes. Multiple genes arranged in tandem with the same orientation are often transcribed into a single transcript by sharing the same promoter and terminator. Such co-transcribed genes are called an operon, and in most cases these genes have similar or coordinated biological functions, and are involved in related biological pathways. Using RNA-seq techniques, multiple studies have attempted to redefine operon maps of some bacterial and archaeal species, however, majority of these studies have only investigated the the changes in operon structures under a single culture conditions, thus only a small portion of alternative operons have been revealed in these species, and the patterns and functional implications of alternative operons utilization in response to environmental conditions are largely unknown. In this research we determined the strand-specific transcriptomes of E. coli K12 at multiple time points in five culture conditions and propose a new statistical method to investigate operon utilization under multiple conditions. We found that 40% of the operons show alternate transcription of their genes. We have investigated functional implication of this alternation for multiple candidate operons.
Short Abstract: Motivation: The Full-text index in Minute space (FM-index) derived from the Burrows–Wheeler transform (BWT) is broadly used for fast string matching in large genomes or a huge set of sequencing reads. Several graphic processing unit (GPU) accelerated aligners based on the FM-index have been proposed recently; however, the construction of the index is still handled by central processing unit (CPU), only parallelized in data level (e.g., by performing blockwise suffix sorting in GPU), or not scalable for large genomes.
Results: To fulfill the need for a more practical, hardware-parallelizable indexing approach, we proposed in this work a k-ordered FM-index based on a BWT variant (i.e., Schindler transform) that can be built with highly simplified hardware-acceleration-friendly algorithms and still suffices accurate and fast string matching in repetitive references. In our tests, the proposed implementation achieves significant speedups in indexing and searching compared to other BWT based tools and can be applied to a variety of domains.
Short Abstract: Variation analysis plays an important role in elucidating the causes of various human diseases. The massive sequencing data imposes significant technical challenges for data management and analysis, including the tasks of collection, storage, transfer, sharing, and privacy protection. Currently, each analysis group must download all the relevant sequence data into a local file system before variation analysis is initiated. This heavy-weight transaction not only slows down the pace of the analysis, but also creates financial burdens for researchers. To overcome such limitations and explore the feasibility of analyzing NCBI sequencing data in cloud environment while maintaining data privacy and security, here we introduce a cloud-based analysis framework that facilitates variation analysis using direct access to the NCBI Sequence Read Archive through SRA Toolkit, which allows the users to programmatically access data with encryption and decryption capabilities and converts it from the SRA format to the desired format for data analysis. A customized machine image (ngs-swift) with preconfigured tools, including NCBI SRA Toolkit, and other resources essential for variant analysis has been created for instantiating an EC2 instance or instance cluster on Amazon cloud. Performance of this framework has been evaluated using dbGaP study phs000710.v1.p1 (1000Genome Dataset in dbGaP, http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000710.v1.p1), and compared with that from traditional analysis pipeline. Security handling in cloud environment has been addressed. We demonstrate that with this framework, it is cost effective to make variant calls without first transferring the entire set of aligned sequence data into a local storage environment, thereby accelerating variant discovery.
Short Abstract: CsrA family RNA-binding proteins are widely distributed in bacteria and regulate gene expression at the post-transcriptional level. Pseudomonas aeruginosa has a canonical CsrA family member (RsmA) and a novel, structurally distinct variant (RsmF). To better understand RsmF binding properties, we performed parallel systematic evolution of ligands by exponential enrichment (SELEX) experiments for both RsmA and RsmF. The initial target aptamer was a 57 nt RNA transcript containing a central core randomized at 15 sequential positions. Most of the selected aptamers were the expected size and shared a common consensus sequence (CAnGGAyG). Concatemerized aptamers (80-140 nts) containing two consensus-binding sites were also identified. Representative short (single consensus site) and long (two consensus sites) aptamers were tested for RsmA and RsmF binding. Whereas RsmA bound the short aptamers with high affinity, RsmF was unable to bind the same targets. RsmA and RsmF both bound the long aptamers with high affinity. Mutation of either consensus GGA site in the long aptamers reduced or eliminated RsmF binding, suggesting a requirement for two binding sites. Based on our observations that high affinity binding by RsmF appears to require two binding sites, we used an in-silico approach to search for candidate RsmF targets in the P. aeruginosa genome. We queried a library of 5’ UTRs for potential targets of RsmF based on the number and positions of GGAs, and secondary structure. Experimental validation of potential targets yielded few direct targets for both RsmA and RsmF indicating that more than sequence and structure contribute to differential binding.
Short Abstract: Motivation: The functional annotation is a key step in biological data interpretation. Quality of this classification is directly related to the database used during the procedure. One of the most important databases for functional classification is KEGG ORTHOLOGY (KO). KEGG Hierarchy system facilitates the biological contextualization into pathways or more superficial categories. The fact that KEGG contains only data from complete published genomes, restrain the amount of represented organisms and respective entries. In this work, we present procedure to enrich orthologous KO with Uniron entries to increase the number of proteins, and evaluate the impact of this enrichment.
Results: After the enrichment, the number of protein entries increased from 4,604,372 to 18,035,804, representing 69,374 taxon. Using UEKO database to annotate metagenomic sequences we were able to exclusively characterize 59,424 genes. Another 80,129 predicted genes had their BLAST bit score increased when compared to KO.
Short Abstract: In today’s era of ever-increasing biological data, it is essential to maintain databases that not only integrate and cluster the data as a warehouse but as well help analyze it. Such resources are invaluable tools for researchers as they provide a useful platform to retrieve biological information like sequences, structures, structure classes, pathways etc., thus aiding in the research to a great extent. KIXBASE is a global repository as well as a prediction tool for KIX domains. These domains are present in coactivators and play a central role in the transcription process. KIXBASE has two main parts, a web server and a database. To date, there is no other web resource that provides information on KIX domains. The KIX prediction program detects KIX domains in any organism on the basis of profile hidden markov models (HMM) developed through alignment of known KIX sequences along with additional filtering criteria for improving accuracy. KIXBASE incorporates the most widely used programs like PSIPRED-3.5 and CLUSTALW2 in the webserver for further annotation and quality assessment of the predictions. Users can upload batch entries for protein, genomic or EST sequences in FASTA format and carry out the detection of potential KIX domains, generate the secondary structure for domain of interest and examine conservation with other KIX domains. The backend prediction algorithm is also used for development of the KIX database, which contains 1891 KIX proteins representing 427 organisms spanning metazoans, fungi and plants, comprising the largest online collection of KIX sequences.
Short Abstract: During last decade, the gap between sequence determination and functional annotation has increased dramatically, resulting in an incomplete understanding of the data we have generated. Automatic annotation pipelines ease the burden of manual annotation, but are limited in scope and coverage. Computational tools for proteome annotation are intrinsically conservative in assigning a definitive function (76% of proteins in UniProt/TrEMBL are annotated as “unknown” or “uncharacterized”) and tend to focus on specific aspects of the protein such as functional domains, signal peptide prediction, or the presence of transmembrane helices. Compartmentalizing the annotation gives a very specific characterization of one aspect of the protein, at the cost of losing the general overview of the protein's function and role in its environment. Integrating results from several databases and tools allows us to simultaneously question several related aspects of a protein's function. We have created a computational pipeline that combines the advantages of manual curation with the speed and power of bioinformatics. The pipeline allows the characterization of whole proteomes as well as single proteins. We applied the Integrative Cell Biology (ICB) pipeline to 40 bacterial proteomes belonging to the PVC superphylum and were able to increase the average number of annotated proteins from 54% to 78%. The pipeline is modular, open, and can be installed at your location or run on our server. We illustrate the advantages of ICB with detailed results. The system and results will be available at www.pvcbacteria.org/pvcbacteria.
Short Abstract: The third generation PacBio long reads could effectively address the read length issue of the second generation sequencing technology, but contain about 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they have to discard large amounts of uncorrected bases.
Here we introduce HiBAM, a high base maintenance algorithm for long read error correction. It aligns the long reads to short read contigs with small local alignments, so that a target long read region could be aligned to both its corresponding contig region and its repeats' contig regions (repeat based alignment method). It then builds a contig graph to validate the alignments and find the most similar contig regions to make correction by referencing other long read alignments (long read support based validation method). Even though some target regions without the corresponding contig regions are corrected with their repeats' contig regions, this makes it possible to further refine these regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data, HiBAM was able to correct more long read bases and obtain 5.8-42.5% higher base maintenance ratio than the existing algorithms while keeping comparable accuracy. The HiBAM corrected long reads can also result in higher assembly quality than the existing algorithms.
The HiBAM software can be downloaded for free from this site: https://github.com/lanl001/hibam
Short Abstract: Becauase of Next Generation Sequencing (NGS) technology,the huge amount of genome information has been generated and analyzed. As number of sequenced genomes were increased, a lot of tools or pipelines were developed to identify the gene function such as InterPro, Pfam, SignalP, PSortII, ChloroP, TMHMM2, and NetPhos. Furthermore, rapidly increasing - genome information have been a major reason for spurring - genome wide functional studies of genes of interest or gene families through comparative genomics. Although many web-based platforms for comparative genomics were developed, application of those platforms were limited as they allow only small numbers of genome or tools. The web-based comparative genomics platform for large amount of genome information and various tools is desirable for comprehensive genome wide gene family studies or functional studies for interested genes as well as their evolutionary studies. Here we present Prometheus, the omics portal for comparative genomics. The Prometheus is a web-based and cloud computing-based comparative genomics platform and contains more than 30,000 of genome information from prokaryotic to eukaryotic genome with 3 primary and 20 secondary database generated from various analysis such as InterPro, TargetP, and OrthoMCL. In addition, the system of My Genes in the Prometheus gives chance to analyze interested genes with various tools or Chlosha, cloud-based analysis pipeline in the Prometheus as well as Chlosha II, a version for advanced users to allow customized analysis pipeline. Furthermore, assembly and annotation pipeline for genome or transcriptome will be added in the Prometheus in the near future.
Short Abstract: Recently the research community has come to appreciate the severe impact that many factors have on the variability of RNA Sequencing (RNA-Seq) data. The fact that local variance introduces global variance is main problem with RNA-Seq pre-processing. For example one highly variable gene will introduce variance in all other expressed genes. Originally the primary factors of concern were the total read depth and the feature length and the FPKM normalization mitigate the variance of these two factors. However, many other factors that add to the global variance have been discovered. For example, the ribosomal content, the mitochondrial content, the variable fragment length distribution, variable 3' bias, variable exon/intron/intergenic balance, variable sense/anti-sense balance, etc. Which factors matter depends on whether one is doing gene level analysis, isoform level analysis, or exon/intron/junction level analysis. All currently available normalization methods use a scaling approach that takes place after quantification, but this does not solve most of the problems adequately. Instead we have implemented a normalization pipeline called PORT that uses resampling at the read-alignment level, producing normalized SAM files. PORT starts with fastq files and produces as output a set of SAM files that have the same number of reads in each, as well as the same number of exon-mapping reads in each, the same number of intron-mapping reads in each, etc. These files are then used to quantify features after which no further normalization is necessary. PORT is implemented in Perl and runs on a cluster and is highly configurable.
Short Abstract: Alignment is the first step in most RNA-Seq analysis pipelines and the accuracy of most downstream analyses depend heavily on this step. There are many available methods, with conflicting claims of superiority, resulting in an ongoing debate. Unlike most other steps in the RNA-Seq analysis pipeline, alignment is particularly amenable to benchmarking with simulated data. We have performed a comprehensive benchmark analysis of the fourteen most popular methods, with metrics about the accuracy and efficiency at base level, read level and junction level. The benchmarking has been performed on simulated data from two genomes each with three levels of alignment complexity. The results show a tremendous variation in performances, with the most popular method TopHat generally underperforming. We further determine the comparative performance between the default settings versus the performance achievable by optimizing parameters. We find that TopHat has the worst performance of all aligners when using default parameters, underscoring the importance of parameter optimization. These results give clear guidance for this crucial step in most RNA-Seq analysis pipelines and the conclusions are in stark contrast to the advice most commonly given. Therefore these are important results which can have a significant impact on a wide range of research projects involving RNA sequencing.
Short Abstract: Genome-scale expression profiling has become a key tool of functional genomics, critically supporting progress in the post-genomic era. It improves our understanding of living systems at the molecular level. The fast development of sequencing technology has recently led to many updates of genome sequences as well as gene models and has revealed the complexity of the gene model annotation of many species.
This fast pace has resulted, however, in difficulties in mapping between corresponding alternative transcripts for different gene model annotations. What is more, for many organisms, where gene model updates are provided once in a few years, it is even more difficult to map between alternative transcripts of the same gene but from different version of gene models. Mapping by gene/transcript name as well as by simple sequence based alignment seems to fail in many cases. Especially where a big change, both in genome sequence as well as in gene model annotation, characterizes the consecutive versions. It is even more pronounced for plant genomes due their high ploidy.
Here we present a new approach for mapping between gene model annotations. Instead of full sequence alignment based scoring, we perform first the sequence similarity assessment on the ‘pseudo-exon’ level. Then the final score is calculated as a distance metric based on these local alignments. To mimic the importance of sequence composition, for functional interpretation, in the final score the influence of insertions/deletions in the context of possible frame shift in the coding sequence is also taken into account.
Short Abstract: Circular RNAs (circRNA) are a new class of abundant, non-adenylated, and stable RNAs that form a covalently closed loop. Recent studies have suggested that circRNAs play important regulatory roles through interactions with miRNAs and ribonucleoproteins. High-throughput RNA-sequencing to detect circRNAs requires non-poly(A) selected protocols. In this study, we established the use of Exome Capture RNA-Seq protocol to profile circRNAs across more than 1000 human cancers samples. We validated our protocol against two other gold-standard methods, depletion of rRNA (Ribo-Zero) and digestion of linear transcripts (RNase-R). Capture RNA-seq was shown to greatly facilitate the high-throughput profiling of circRNAs, providing the most comprehensive catalogue of circRNA species to-date. Specifically, our method achieved significantly better enrichment for circRNAs than rRNA depletion, and, unlike RNase-R treatment, preserved accurate circular-to-linear ratios. Although the correlation between circular and linear isoform abundance was modest in general , we found strong evidence that the lineage specificity of circular RNAs is due to the lineage specificity of their parent genes. To shed light on the mechanism of circRNAs biogenesis, we are investigating the associations between mutations in canonical splicing sites and splicing factors with aberrant formation of circRNAs. Finally, ratio of circular to linear transcript abundance was explored to give insight in the dynamics between transcriptome stability/turnover and cell proliferation. Overall, our compendium provides a comprehensive resource that could aid the exploration of circRNAs as a new type of biomarkers, or as intriguing splicing and regulatory phenomena.
Short Abstract: The assumption of lack of memory, i.e. Markovianity, is common to many models of protein sequence evolution, in particular to those based on point accepted mutation matrices (Dayhoff et al., 1978). Nevertheless, it has been observed (Benner et al.,1994 and Mitchison and Durbin,1995) that evolution seems to proceed differently at different time scales, questioning the Markovian assumption. We show that the among-site variability of substitution rates introduces an effective memory that makes protein sequence evolution not Markovian: each site retains the `memory' of its own substitution rate and this influences both the local destiny of that site and the global destiny of the full sequence. We introduce a simple model that describes the occurrence of substitutions in a generic protein sequence, based on the idea that mutations are more likely to be accepted at sites that interact with a spot where a substitution has occurred in the recent past. The model therefore extends the usual assumptions made in protein coevolution by introducing a time dumping on the effect of a substitution. We validate this model by successfully predicting the correlation of substitutions as a function of their distance along the sequence. Despite its simplicity, this model predicts a distribution of substitution rates highly compatible with a gamma distribution, consistently with the common wisdom (Yang 1993, Yang et al. 1994).
Short Abstract: Since the introduction of the de Bruijn graph assembly approach by Pevzner et al. in 2001, de Bruijn graph assemblers have become the dominant method for de novo assembly of large genomes. Nonetheless, assembling large genomes such as Homo sapiens remains a challenging task that requires abundant computing resources. Here we present two fundamental improvements to the ABySS assembler that significantly reduce the memory and run time requirements for large genomes. First, using the approach pioneered in the Minia assembler (Chikhi et al., 2012), we have reduced the original memory requirements of ABySS by an order of magnitude, using a de Bruijn graph represented in a succinct Bloom filter data structure. While Minia operates as a standalone unitig assembler, our Bloom filter assembler is integrated into the standard ABySS pipeline, including downstream stages for contig building, mate pair and long read scaffolding. Second, we have reduced the run time of the assembler through the use of a specialized hash function called "ntHash", which achieves runtimes orders of magnitude faster than standard hash functions by means of a constant-time sliding window calculation on adjacent k-mers. On a single 32-core machine with 120GB RAM, the ABySS Bloom filter pipeline assembles a modern 76X human dataset (SRA:ERR309932) and scaffolds with MPET data (SRA:ERR262997) with a wallclock time of 46 hours and a peak memory usage of 102GB RAM, achieving a scaffold NG50 of 1.7 Mbp.
Short Abstract: Long intergenic noncoding RNAs (lincRNA) are a novel class of regulator that play important roles in many biological processes. Myogenesis is the formation of muscular tissue, particularly during embryonic development. Little is known how lincRNAs are involved in skeletal myogenesis. First, to identify the functional lincRNAs in myogenesis, we present a novel computational framework that can accurately identify potential functional lincRNAs from millions of assembly transcripts obtained from transcriptome sequencing data during myogenesis. Second, among many identified potential functional lincRNAs, we functionally validate a novel Linc-YY1 from the promoter of the transcription factor (TF) Yin Yang 1 (YY1) gene. We demonstrate that Linc-YY1 is dynamically regulated during myogenesis in vitro and in vivo. Gain or loss of function of Linc-YY1 in C2C12 myoblasts or muscle satellite cells alters myogenic differentiation and in injured muscles has an impact on the course of regeneration. Linc-YY1 interacts with YY1 through its middle domain, to evict YY1/Polycomb repressive complex (PRC2) from target promoters, thus activating the gene expression in-trans. Altogether, we show that Linc-YY1 regulates skeletal myogenesis and uncover a previously unappreciated mechanism of gene regulation by lincRNA.
The work described here is substantially supported by General Research Funds (GRF) and Collaborative Research Fund (CRF) from the Research Grants Council (RGC) of the Hong Kong Special Administrative Region, China 476113, 473713, 14116014, 14113514 and C6015-14G
Short Abstract: Comprehensive identification of insertions/deletions (indels) across the full size spectrum from second generation sequencing is challenging due to the relatively short read length inherent in the technology. Different indel calling methods exist but are limited in detection to specific sizes with varying accuracy and resolution. We present ScanIndel, an integrated framework for detecting indels with multiple heuristics including gapped alignment, split reads and de novo assembly. Using simulation data, we demonstrate ScanIndel’s superior sensitivity and specificity relative to several state-of-the-art indel callers across various coverage levels and indel sizes. ScanIndel yields higher predictive accuracy with lower computational cost compared to existing tools for both targeted resequencing data from tumor specimens and high coverage whole-genome sequencing data from the human NIST standard NA12878. Thus we anticipate ScanIndel will improve indel analysis in both clinical and research settings. ScanIndel is implemented in Python, and is freely available for academic use at https://github.com/cauyrd/ScanIndel
Short Abstract: Over 90% of the human exome is alternatively spliced. To fully understand the complexity of splicing regulation, one needs to quantify relative splice form abundance within and between samples. Traditional quantification methods only account for simple, binary alternative splicing events such as cassette exons. While these binary events make up about 70% of all events the remaining 30% are complex events that are often discarded in the literature.
We recently published the MAJIQ algorithm which captures and accurately quantifies local splicing variations (LSVs) of arbitrary complexity. However, the current iteration of MAJIQ assumes that data replicates within a sample group generally agree on the percent inclusion (PSI) of splicing junctions, making it more sensitive to outliers. Given the prevalence of large heterogeneous datasets (e.g. patients vs. controls in disease studies), it is therefore important to address both efficiency and heterogeneity in estimating PSI and how it changes (delta PSI) between conditions or experimental groups.
We are developing MAJIQ-het, a generalization of the MAJIQ model which handles within-group heterogeneity to robustly estimate PSI and delta PSI. Briefly, MAJIQ-het assigns weights to each experiment representing a posterior belief in the relevance or group membership. We show that MAJIQ-het converges to the MAJIQ model on well-behaved datasets, and retains much of its power when introducing an outlier. We consider two alternative weighing schemes, termed the inside and outside models, and compare MAJIQ to each on synthetic and real-life data, demonstrating significant gain in both reproducibility and sensitivity for detecting differentially-spliced LSVs.
Short Abstract: Mutation rates can vary across the residues of a protein, but when multiple sequence alignments are computed for protein sequences, typically the same choice of values for the substitution score and gap penalty parameters is used across the entire protein. We provide for the first time a new method called adaptive local realignment, which computes protein multiple sequence alignments that automatically use diverse alignment parameter settings in different regions of the input sequences. This allows the aligner’s parameter settings to locally adapt across a protein to more closely follow varying mutation rates.
Our method builds on the Facet alignment accuracy estimator, and our prior work on global alignment parameter advising. In a computed alignment, for each region that has low estimated accuracy, a collection of candidate realignments is generated using a set of alternate parameter choices. If one of these alternate realignments has higher estimated accuracy than the original subalignment, it is replaced.
Adaptive local realignment significantly improves the quality of alignments over using the single best default parameter choice. In particular, local realignment, when combined with existing methods for global parameter advising, boosts alignment accuracy by almost 24% over the best default parameter setting on the hardest-to-align benchmarks.
A new version of the Opal multiple sequence aligner that incorporates adaptive local realignment, using Facet for parameter advising, is available free for non-commercial use at http://facet.cs.arizona.edu. This site also contains the benchmarks from our experiments, and optimal sets of parameter choices.
Short Abstract: In recent years, rapid expansion of mobile devices, including smart phones and tablets, has created a new trend of personal computing. Personal mobile devices have become convenient devices for daily information retrieval and exchange with more freedom. However, few mobile applications (APPs) were created to retrieve and display genome annotation information on the tablets or smart phones. Currently, no bioinformatic related mobile applications have developed specifically for the visualization of large-scale NGS sequence data. With increasing computation and graphic display capacities of mobile devices, mobile devices and mobile applications would become suitable user-friendly platforms for interrogating large-scale bioinformatic and genomic data. Herein, we tried to develop mobile application software to demonstrate the feasibility of visualizing large-scale human cancer gene expression information. We have implemented an iOS mobile application (RNA-Seq Viewer) in order to visualize the Next Generation Sequencing gene expression information with over 2,500 human cancer patients retrieved from The Cancer Genome Atlas (TCGA). Users can select RNA-Seq data of any given individual sample from nine different cancer types and our mobile application could efficiently display whole transcriptome expression information systematically over a human chromosome framework with easy accessibility and intuitive navigation user interface. Local gene modulation patterns could be inspected thoroughly. In addition, users can visualize their own RNA-Seq data by building their customized dataset. We imagine such mobile applications could be utilized in future personalized medicine applications by serving as an underlying component to easily access the genomic and medical information using cloud infrastructure on various mobile devices.
Short Abstract: RNA-Seq has been used for identifying expression profiles of whole transcripts on a genome. TopHat and Cufflinks are often used for predicting new transcripts and/or alternative splicing variants of the transcripts and for estimating their expression levels from RNA-Seq data. However, they have a problem with accuracy of their expression levels. For example, regarding the expression of the two isoforms (PPARG1 and PPARG2) of peroxisome proliferator-activated receptor gamma, it is known that the PPARG2 isoform is highly abundant in adipose tissue, but in our preliminary experiment, TopHat-Cufflinks indicated that PPARG1 was more abundant than PPARG2 with RNA-Seq data of mouse adipocytes. Another tool called TIGAR2 was successfully able to obtain a correct result (PPARG2 was more present) with the same RNA-Seq data, but this tool cannot discover novel transcripts because the tool need to have reference RNA sequences (i.e., an already-fixed set of transcripts) as the input of the tool.
To cope with this problem, we propose a new analysis method by combining these two tools. In this method, (1) discover whole transcripts including new transcripts by using TopHat-Cufflinks, and (2) estimate the expression levels of the transcripts accurately by using TIGAR2.
The performance of the proposed method has been demonstrated by comparing the result of the method with those of TopHat-Cufflinks and TIGAR2 using the RNA-Seq data obtained from mouse adipocytes.
Short Abstract: Chromosomal translocations leading to fusion transcripts represent a class of oncogenic aberrations that are of high interest for understanding cancer biology, treating cancer patients, and as targets for the development of new therapies. Transcriptome sequencing via RNA-Seq coupled with downstream bioinformatics software applications offers an effective method for identifying candidate fusion transcripts, and is more targeted and cost-effective than whole genome sequencing. Although many such fusion detection software tools have recently been made available, they often differ greatly in prediction accuracy, execution times, computational requirements, installation complexity, and in not being readily accessible to non-bioinformatician cancer researchers.
We present the Trinity Cancer Transcriptome Analysis Toolkit (CTAT), a suite of RNA-Seq targeted fusion detection tools leveraging a combination of reference genome read-mapping and de novo transcriptome assembly, coupled to in silico fusion transcript validation, annotation, and visualization. We show that components of our Trinity CTAT fusion detection toolkit, including STAR-Fusion and FusionInspector, yield improved accuracy and run times as compared to alternative leading tools. We explore fusion transcript discovery in a large cohort of ~300 chronic lymphocytic leukemia (CLL) patient samples, and identify several novel recurrent fusions that may represent drivers of CLL etiology and provide new avenues to therapies.
Trinity CTAT aims to provide cancer researchers with easy access to methods for analyzing cancer RNA-Seq, including transcript reconstruction, identification of mutations, expression analysis, and tumor heterogeneity. Trinity CTAT is freely available open source software and readily accessible to all cancer researchers via our public Galaxy web portal: https://galaxy.ncgas-trinity.indiana.edu/root
Short Abstract: Honey bee colonies exhibit an age-related division of labor, with worker bees performing discrete sets of behaviors throughout their lifespan. These behavioral states are associated with distinct brain transcriptomic states, yet little is known about the regulatory mechanisms governing them. We used CAGEscan (a variant of the Cap Analysis of Gene Expression technique) for the first time to characterize the promoter regions of differentially expressed brain genes during two behavioral states (brood care (aka “nursing”) and foraging) and identified transcription factors (TFs) that may govern their expression. More than half of the differentially expressed TFs were associated with motifs enriched in the promoter regions of differentially expressed genes (DEGs), suggesting they are regulators of behavioral state. Strikingly, five TFs (nf-kb, egr, pax6, hairy, and clockwork orange) were predicted to co-regulate nearly half of the genes that were upregulated in foragers. Finally, differences in alternative TSS usage between nurses and foragers were detected upstream of 646 genes, whose functional analysis revealed enrichment for Gene Ontology terms associated with neural function and plasticity. This demonstrates for the first time that alternative TSSs are associated with stable differences in behavior, suggesting they may play a role in organizing behavioral state.
Short Abstract: Messenger RNA (mRNA) 3’ untranslated regions (3’ UTRs) regulate gene functions by modifying cellular localization, stability and/or translational efficiency of transcripts during normal biological functions (e.g., development, nervous system functions) and disease states (e.g., UTR shortening in cancer). The recent surge in whole transcriptome sequencing has revealed an unexpected diversity in the regulation of 3’UTRs. The length of a given mRNA’s 3’UTR is not fixed; rather, 3’UTR isoform diversity is generated by alternate polyadenylation (APA) and/or alternate splicing (AS) during mRNA biosynthesis, thereby providing a dynamic substrate for RNA-binding proteins (RBPs) and miRNAs. Accurate measurement of 3’UTR dynamics using standard RNA-seq protocols remains a challenge. Current methods either rely heavily on 3’UTR annotation and/or employ statistical models sensitive to UTR shortening events at the expense of lengthening events.
We therefore developed a method to assess 3’UTR dynamics in publicly available RNA-seq profiles. First, a comprehensive genome-wide database of polyadenylation signals was compiled from publicly available sources, including high-throughput 3’-end sequencing studies and EST datasets. All collated alternate 3’ ends were then assigned to their respective gene loci. This allowed us to generate a transcriptome-wide model of all predicted cleavage sites and therefore all possible 3’UTR variants. We then developed a novel computational method to detect and quantify 3’UTR dynamics in RNA-seq profiles from publically available datasets. Computational analyses of these dynamic 3’UTR sequences revealed over-represented motifs for neuron-specific miRNAs and RNA-binding proteins. This demonstrates the utility of our method to assess 3’UTR dynamics using standard mRNA sequencing protocols.
Short Abstract: The vast majority of microbial species found in nature has yet to be grown in pure culture, turning metagenomics and – more recently – single cell genomics into indispensable methods to study microbial dark matter. We developed kgrep, a new tool that naturally exploits the complementary nature of single cells and metagenomes to improve de novo assembly of single cell genomes.
Prior to sequencing of a single cell, its DNA needs to be amplified. This usually is done by multiple displacement amplification (MDA), introducing a tremendous coverage bias. Poorly amplified regions result in extremely low sequencing coverage or physical sequencing gaps. These parts of the genome cannot be reconstructed in the subsequent assembly step and therefore genomic information is lost.
Frequently, single amplified genomes (SAGs) and shotgun metagenomes are generated from the same environmental sample. We developed a fast, k-mer based recruitment method to sensitively identify metagenomic “proxy” reads representing the single cell of interest, using the raw SAG reads as recruitment seeds. By assembling metagenomic proxy reads instead of single cell reads, we circumvent most challenges of single cell genome assembly, such as the aforementioned coverage bias and chimeric reads.
On real and simulated data we show that, with sufficient metagenomic coverage, assembling metagenomic proxy reads instead of single cell reads significantly improves assembly contiguity while maintaining the original accuracy. By applying our method iteratively, we span physical sequencing gaps and are able to recover genomic regions that otherwise would have been lost. However, careful contamination screening is needed.
Short Abstract: Nearly half of the human genome is composed of transposable elements (TEs), which are mobile elements that can insert themselves into new locations within the genome, potentially altering the function of any genes or regulatory regions nearby. Most of these elements are no longer mobile but still contain regulatory sequences that can serve as promoters, enhancers or repressors for cellular genes. Many chromatin-associated factors such as transcription factors, histone modifiers, and other DNA binding proteins are known to bind to repetitive regions of the genome, with some showing preferential enrichment at these loci.
Here, we present TEpeaks, a method for identifying ChIP-seq peaks genome-wide that includes the repetitive fraction of the genome as well as uniquely mappable sites. Most existing methods either discard non-uniquely mapped reads or randomly choose one from the multiple locations to which they align. Both strategies reduce the accuracy in determining enrichment in repetitive regions. TEpeaks carefully distributes multiply mapped reads using the uniquely mapped reads as a guide and optimizes the assignment by an expectation maximization (EM) algorithm. Moreover, TEpeaks provides multiple normalization options and also includes a module for differential binding analysis to determine differential enrichment statistics at these candidate binding sites when comparing between different experimental conditions or genotypes. By applying TEpeaks to publicly available OCT4 ChIP-seq data of H1 cells, we found 50% more ERV1 derived binding sites than using existing methods. There are potentially many more undiscovered regulatory regions associated with TE sequences.
Short Abstract: RNA nearest neighbor parameters are a set of parameters for estimating the folding energy changes of RNA secondary structures. These parameters are used widely in software to predict the secondary structures of given RNA sequences. Despite their widespread application, a comprehensive review of the impact of each parameter on the precision of calculations had not been conducted. To identify the parameters with greatest impact, we performed a sensitivity analysis on the 290 independent parameters that compose the 2004 version of the nearest neighbor rules. This required that the nearest neighbor parameters be recalculated and each parameter's uncertainty be determined. Each nearest neighbor parameter was modulated by increments of either experimental uncertainty or a fixed folding free energy value and the effect of this parameter change on predicted base-pair probabilities and secondary structures was observed. This identified parameters that should be determined to higher accuracy and energetic models of specific secondary structure motifs that should be updated in order to improve the accuracy of RNA structure prediction. In particular, the bulge loop initiations, multibranch loop parameters, and AU/GU closure terms stood out as particularly important. An analysis of parameter usage during folding free energy calculations of stochastic samples of predicted RNA secondary structures revealed a correlation between parameter usage and impact on structure prediction. The results of this analysis can be also used to inform which parameters are the most important to determine with high precision for the expansion of nearest neighbor parameters to nucleotides with modified chemistries.
Short Abstract: Next-generation sequencing (NGS) technology has recently been widely applied in clinical and public health laboratory investigations for pathogen detection and surveillance. Major gaps currently exist in NGS data analysis and data interpretation. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. A framework was developed to pursue data mining on NGS datasets by topic modeling. It consists of four major procedures: NGS data retrieval, preprocess, topic modeling, and data mining of the LDA topic outputs. The preprocessed NGS sequences were transformed into corpus, in which each document was reasonably viewed as “a bag of words assumption” which was essential for effectiveness of topic modeling approach. The NGS data set of 119 Salmonella isolates were retrieved from National Center for Biotechnology Information (NCBI) database and was used as an example in this work to show the working flow of this framework. The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The implementation of topic modeling in NGS data analysis framework provides us a new way in NGS data analysis for elucidating genetic information and biomarker identification, therefore, enhance the NGS data analysis and its applications on pathogen identification, source tracking, and population genome evolution.
Short Abstract: Next-generation sequencing (NGS) technologies and data processing pipelines are rapidly and inexpensively providing increasingly numerous sequencing data and associated (epi)genomic features of many individual genomes in multiple biological and clinical conditions, generally made publicly available within well-curated repositories. Answers to fundamental biomedical problems are hidden in these data; yet, their efficient management and integrative processing is becoming the biggest and most important “big data” problem of mankind. Multi-sample processing of heterogeneous information can support data-driven discoveries and biomolecular sense making, such as discovering how heterogeneous genomic, transcriptomic and epigenomic features cooperate to characterize biomolecular functions; yet, it requires state-of-the-art “big data” computing strategies, with abstractions beyond commonly used tool capabilities.
We recently proposed a new paradigm in NGS data management and processing by introducing an essential Genomic Data Model (GDM) using few general abstractions for genomic region data and associated experimental, biological and clinical metadata that guarantee interoperability between existing data formats. Leveraging on GDM, we developed a next-generation, high-level, declarative GenoMetric Query Language (GMQL) for genomics data; here, we demonstrate its usefulness, flexibility and simplicity of use through several biological query examples. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous samples; computational efficiency and high scalability are achieved by using parallel computing on clusters or public clouds. GDM and GMQL are applicable to federated repositories, and can be exploited to provide integrated access to curated data, made available by large consortia such as ENCODE, Epigenomics Roadmap, or TCGA, through user-friendly search services.
Short Abstract: Reliability, reproducibility, and reusability are essential
requirements for analysis pipelines, especially in the context of a
bioinformatics core facility tasked with analyzing a large number of
We are developing a software environment for the rapid development of
analysis pipelines in a high-performance computing setting. The
environment is based on Actor, a meta-scripting library allowing
developers to quickly create complex analysis pipelines, by combining
predefined objects and methods implementing standard analysis tasks (e.g., different sequence alignment programs). These building blocks only need to be developed and debugged once, leading to high reliability and reusability.
Input to the pipelines consists of a configuration file providing
details about the analysis process. Analysis parameters can therefore be changed without modifying the pipeline. Input data are stored in a data
structure able to represent multiple experimental conditions, multiple
samples for each condition, and multiple technical replicates for each
sample. Conditions, samples and replicates are managed automatically,
greatly simplifying pipeline development.
Actor is designed to work in a cluster environment, and provides
methods to submit jobs to a queue and wait for their termination under
either PBS or Slurm. It therefore facilitates the creation of complex
pipelines consisting of multiple processes running sequentially or in
parallel. Finally, Actor pipelines automatically generate a
publication-quality report describing all steps of the analysis,
including figures and links to all relevant input and output files.
We present the architecture of this platform and the pipelines we
developed using it, for RNA-Seq, SNP calling, and methylation analysis.
Short Abstract: The human leukocyte antigen (HLA) gene family plays a critical role in biomedical aspects, including organ transplantation, autoimmune diseases and infectious diseases. Coupled with the fact that the gene family contains the most polymorphic genes in human, clinical applications and biomedical research require highly accurate HLA typing. Meanwhile, NGS data have proved the ability to achieve high resolution HLA typing; however, the reads of the most platforms are not long enough to cover the two sequential exons, i.e., exon 2 and exon 3, and would lead to phasing ambiguities. On the other hands, the long reads of the PacBio system could unequivocally solve the phasing problem. The advantage of the PacBio long reads could be compromised by the high error rates; therefore, we proposed a typing method, which adjusted the Bayes’ theorem so that it could tolerate sequencing errors as well as de-multiplexing errors. We have implemented the method and integrated the pipeline of HLA typing into a program named BayesTyping.
Short Abstract: Vector integration sites (IS) in hematopoietic stem (HSC) cell gene therapy (GT) applications are stable genetic marks, distinctive for each independent cell clone and its progeny. The characterization of IS is required to support assessments of safety and efficacy of the treatment. Bioinformatics tools for IS detection identify IS only from sequence reads with unique mapping, discarding those landing in repetitive elements, ~30% of the whole dataset, hiding potential malignant events and reducing the reliability of IS studies. We developed a novel tool for IS identification in any genomic region even if composed by repetitive genomic sequences that uses a dereplication approach with a genome free analysis of reads and graph theory to identify indivisible subgraphs of sequences corresponding to single IS. To avoid false positive IS, the method statistically validates the clusters through permutation test and produces the final list of IS. We tested the reliability of our tool and compared the results with other state-of-the-art methods. In simulated datasets of IS, our tool showed precision and recall ranging 0.85-0.97 and 0.88-0.99 respectively (F-score ranging 0.86-0.98, higher than other tools). We reanalyzed the dataset of our published GT clinical trial for Metachromatic Leukodystrophy including IS in repetitive elements. The number of IS and estimated HSCs was increased by an average fold of 1.3 whereas the clonal population diversity index did not change and no aberrant clones in repeats occurred. Our tool addresses and solves important open issues in IS identification allowing the generation of a comprehensive repertoire of IS.
Short Abstract: Modern high throughput single cell approaches allow for the molecular profiling of thousands of cells simultaneously. However, individual sequencing runs for every cell is cumbersome and leads to enormous costs. These costs can be reduced dramatically by using multiplexed sequencing approaches on both, gene and sample level. We propose a method for multiplexed sequencing analysis based on sample barcoding (BART-seq), which involves simultaneous amplification of multiple target sequences using mixed sets of primers in a single solution. BART-seq allows us to analyze samples containing small amounts of template molecules, for example for genotyping or analyzing transcriptional profiles of single cells. Designing optimal barcodes and optimizing „cocktails“ of primers may be challenging due to sequence similarities and secondary structure formation, especially for sets employing dozens or more barcodes and primer pairs, respectively. Here we present a novel pipeline for barcode and primer design, as well as easy-to-use web applications, which are the basis for the conduction of barcoded NGS experiments and the deconvolution of the generated sequencing results. We show that the usage of our approaches in combination with an analysis pipeline leads to an accurate read identification, which is necessary to reliably quantify gene expression. With this novel approach we are also able to reliably measure gene expression for dozens of single cells simultaneously, which is demonstrated within an experiment on a human embryonic stem cell line in comparison to a fibroblast cell line.
Short Abstract: N6-methyl-adenosine (m6A) is one of the most abundant internal modifications in polyadenylated mRNAs, but much remains to be learnt about its biological roles. Reversible m6A methylation plays critical regulatory roles in RNA processing and has been postulated to participate in miRNA-related pathways. Moreover, aberrant m6A modification has been linked to several human diseases, including prostatitis, obesity and various forms of cancer, amongst others.
While transcriptome-wide m6A profiling using next generation sequencing technologies in principle permits analysis of the RNA “methylome”, bioinformatics resources available for the analysis of m6A sequencing data are still limited. Furthermore, transcriptome-wide m6A detection using existing enrichment-based methods suffers from limited resolution and a high false positive rate, both problems reflecting the biological characteristics of antibody binding, including non-specific binding.
Here, we present m6aViewer – a cross-platform software application for the analysis, annotation and visualisation of m6A peaks from sequencing data. m6aViewer implements a novel m6A peak-calling algorithm capable of identifying high-confidence methylated residues with more precision than previously described approaches. It does this by processing data at single nucleotide resolution and modelling RNA fragment distributions from both single- and paired-end read data. Using m6a methyltransferase knockout data, we train a supervised ensemble learning model that captures multiple peak features while retaining sequence information and can accurately distinguish true m6a peaks from non-specific antibody binding sites. We show that this approach can be generalised to multiple tissue types, thus enabling precise detection of high-confidence m6a residues.
Short Abstract: DNA transposons, RNA retrotransposons, and endogenous retroviruses together comprise what are collectively called transposable elements (TEs), and which make up nearly half of mammalian genomes. While TEs cover a large portion of the genome, they constitute a much smaller fraction of the transcribed RNA in cells. This is due, in large part, to regulation of these transcripts via transcriptional and post-transcriptional gene silencing mechanisms, TGS and PTGS, respectively. Small RNAs form a central part of the PTGS mechanism, and are becoming increasingly appreciated for their role in TGS mediated regulation as well. However, most small RNA profiling pipelines focus on microRNAs (miRNAs) and ignore the small RNAs derived from repetitive regions of the genome.
Here, we present a pipeline to include TE-associated reads in small RNA profiling and differential expression analysis. The pipeline closely follows standard tools for analyzing small RNA sequencing data, which includes mapping reads to a reference genome, annotating the mapped reads to genomic features according to a priority table, and profiling relative abundance. We include additional steps to carefully allocate non-uniquely mapping reads and to analyze the distribution of sense/antisense reads mapping to TE loci. Extensive annotation and exploratory plots are provided for miRNA analysis, TE-mapping siRNA analysis, and for reads mapping to other genomic features. The pipeline generates the analytical results in interactive HTML format to let users easily browse the data values and modify the figures. The code is written in Python with the Bokeh visualization library.
Short Abstract: Inefficiency of sequence analysis algorithms is major bottleneck in metagenomic research.
Sequence alignment provides the most information about the composition of a sample, but methods are prohibitively slow.
This inefficiency has lead to reliance on faster, but less accurate, algorithms which only produce simple taxonomic classification or abundance estimation, losing the valuable information given by full alignments of reads against annotated genomes.
k-SLAM is, a novel, ultra-fast method for the alignment and taxonomic classification of metagenomic data.
Using a k-mer based method, k-SLAM achieves speeds three orders of magnitude faster than current alignment based approaches.
The alignments found can also be used to find variants and genes present in a mixed sample, along with their taxonomic origins.
A novel pseudo-assembly method is used to produce more specific taxonomic classifications on species which have high sequence homology within their genus, providing up to 40% increase in accuracy on these species.
Short Abstract: The motif discovery problem can be thought as searching an unknown number of l-length subsequences from an N-length DNA sequence. Thus, all of the l-length subsequences from the given DNA sequence should be extracted in order to be statistically analyzed for the probability of being a motif instance. This probability is inversely proportional with the likelihood ratio of the nucleotide frequencies of the subsequence to the background model. MEME is a popular probabilistic motif discovery program which handles these issues by performing the EM algorithm to infer position weight matrices (PWMs). However, MEME scales poorly with large datasets and shows inability to find the motifs including insertions and deletions. This study aims to develop a framework for discovering DNA motifs, where fuzzy C-means (FCM) membership functions, and an EM technique are employed to extract putative motifs allowing Indels. The method proposed in the study follows the following steps: (a) clustering subsequences into a certain number of clusters using FCM (b) utilization of fuzzy membership values of each subsequence as an initial posterior probability values in EM technique (c) testing each PWM, clusters center, to see whether it is statistically meaningful or not.
We present FCM-EM, a motif discovery algorithm designed to find DNA-binding motifs in ChIP-Seq and DNase-Seq data. The nine discovered motifs in ChIP-Seq dataset are matched to known motifs using TOMTOM and its calculated E-value shown the significance of each discovered motif as well.
Short Abstract: With the amount of heterogeneous data that Next Generation Sequencing (NGS) is producing, many interesting computational problems are emerging and call for urgent solutions. Genome Browsers (e.g., IGB) are tools to visually compare and browse through multiple genomic feature samples aligned to the same genome reference and laid out on different genome browser tracks. They allow the visual inspection and identification of interesting “patterns” on multiple tracks, i.e. sets of genomic regions/peaks at given distances from each other in different genome browser tracks. Nevertheless, once such patterns are visually identified in a genome section, the search of their occurrences along the whole genome is a complex computational task currently not supported, although their discovery along the whole genome is very important for the biological interpretation of NGS experimental results and comprehension of biomolecular phenomena. To overcome such limitation, we present an optimized “similarity”-based pattern-search algorithm able to efficiently find, within a large set of genomic data, genomic region sets which are similar to a given pattern. We implemented our algorithm within an IGB plugin, which allows intuitive user interaction in both the visual selection of an interesting pattern on the loaded IGB tracks and the visualization within the IGB of the occurrences along the entire genome of the region sets similar to a selected pattern found by our algorithm. This demonstrates the efficiency and the accuracy of the proposed method.
Short Abstract: Occasionally, new protein-coding genes arise not from duplication and divergence, but rather de novo from non-coding DNA. These proteins will then be restricted to only certain taxa. Older protein-coding genes will have homologs in more distantly related taxa. We assigned an evolutionary age to all mouse proteins, and observed how their biochemical properties vary with gene age. Younger proteins are less aggregation-prone than older proteins and both are less aggregation-prone than intergenic sequences would be, if translated. Younger proteins have a higher intrinsic structural disorder (ISD) than older proteins, and this is not the result of biases whereby high ISD evolves faster and hence escapes detectable homology. Intergenic controls have the lowest ISD, contradicting a recent “continuum” theory of de novo gene birth and confirming an alternative theory of “preadaptation”. Previous work has found that hydrophobic amino acids are spread out or overdispersed along the primary sequence, compared to scrambled protein sequences. We found that this is only true for the very oldest genes with homologs in prokaryotes; younger genes have progressively higher levels of clumping or underdispersion as a function of youth. Our three correlates of gene age – aggregation propensity, ISD, and the dispersion of hydrophobic amino acids – are themselves interdependent in complex ways.
Short Abstract: Motivation: Most RNA-seq data analysis software packages
are not designed to handle the complexities involved in
properly apportioning short sequencing reads to highly repetitive
regions of the genome. These regions are often occupied by
transposable elements (TEs), which make up between 20-80%
of eukaryotic genomes. They can contribute a substantial portion
of transcriptomic and genomic sequence reads, but are typically
ignored in most analyses.
Results: Here we present a method and software package for
including both gene- and TE-associated ambiguously mapped
reads in differential expression analysis. Our method shows
improved recovery of TE transcripts over other published
expression analysis methods, in both synthetic data and
qPCR/NanoString-validated published datasets.
Availability: The source code, associated GTF files for TE
annotation, and testing data are freely available at http://
Short Abstract: Single-molecule, real-time (SMRT) sequencing developed by Pacific BioSciences produces longer reads than secondary generation sequencing technologies such as Illumina, enabling more nearly complete genome assembly than using short reads. However, SMRT sequencing data contains more (11-15%) insertion and deletion errors, which cause frameshifts and lead to fragmented alignments with marginal scores during homology search. Existing error correction methods such as hybrid sequencing still suffer from various limitations.
In this work we present a novel method to improve the homology search sensitivity while also correcting sequencing errors for SMRT data. Our method is designed based on two key observations. First, errors in SMRT data are randomly distributed. Thus aligning short reads to long reads of the same library can reduce errors, as shown in HGAP assembly pipeline. Second, by using characterized protein family models as the reference, we correct additional frameshift errors by maximizing the alignment scores. Our method combines a directed acyclic graph (DAG) for producing consensus sequences, and profile Hidden Markov Model (pHMM) for profile homology search.
As a proof of concept, we applied our method to protein domain annotation in simulated PacBio reads from E. coli. .With 20X sequencing coverage, the average HMM alignment length is improved from 83.7 to 102.0, compared with 123.3 on the ground truth sequences. By integrating DAG and pHMM, we corrected more errors for 47.0% of 6,132 input domains than using DAG only. In conclusion, our new method provides more powerful error correction for protein domain homology search for SMRT sequencing reads.
Short Abstract: As biological data continues to grow at exponential rates, visualisation and integration become increasingly necessary to interpret data. Protein sequence features describe regions or sites of biological interest; for instance domains, PTMs and binding sites amongst others, which play a critical role in the understanding protein function. A new visual approach by UniProt presents protein sequence features in one compact view using a highly interactive BioJS component, designed following a user-centered process. This is the first visualisation in a public resource that lets users see different types of protein sequence features such as domains, sites, PTMs and natural variants from multiple sources in a single view.
The viewer displays protein features in different tracks providing an intuitive and compact picture of co-localized elements; initial tracks currently include domains & sites, molecule processing, PTMs, sequence information, secondary structure, topology, mutagenesis and natural variants. The variant track offers a novel visualization using a matrix that maps amino acids to their position on the sequence, therefore allowing the display of large number of variants in a restricted space. UniProt also provides a new REST API completing this functionality, which allows easy access to UniProt protein features and additional large scale data for variants and proteomics.
We plan to continue to integrate selected large-scale experimental data, we plan to include proteomics related data, i.e., peptides, as well as antigenic data, i.e., epitope bindings.
Short Abstract: Trypanosoma cruzi is a trypanosomatid and the etiologic agent of Chagas disease, which is currently estimated to affect at least 6 million people worldwide. In 2005 El-Sayed and collaborators sequenced the genome of the hybrid T. cruzi CL Brener clone, currently available as a partially assembled set of 82 chromossomes (41 per haplotype). Given the lack of a completely assembled genomic data, RNA-Seq experiments should rely on assembled transcriptomes. We therefore decided to provide a de novo assembled, curated and analyzed version of the T. cruzi CL Brener epimastigote transcriptome, based on high-quality RNA-Seq data to serve as reference. We applied a polyA capture-based strand-specific Illumina RNA-Seq methodology to generate ~70 million paired-end reads that should reflect the total RNA content of epimastigotes. Several filters were applied to these sequences in order to eliminate contaminants, discard lower quality reads and reads derived from spike-in RNAs. After the entire curation process, nearly 15% of the reads were discarded, thus representing a relevant curation step prior to de novo transcriptome assembly. Reads were assembled by Trinity and the number and length of assembled transcripts reflected the annotated sequences on T. cruzi genome. A database will shortly be available to store and allow access to the assembled transcripts, their annotations, expression levels and basic statistics. We consider this a pioneer approach for the investigation of gene expression in trypanosomatids and a very useful resource for scientists working with neglected tropical diseases caused by these parasites.
Short Abstract: Many women develop tumor recurrence and metastatic disease years after surgery and drug treatment, even though early diagnosis and new therapeutics have significantly improved breast cancer patient survival. It has been proposed that cancer stem cells, the most aggressive subpopulation of tumor, are highly associated with micrometastasis formation, tumor recurrence and chemoresistance. In this project, our main aim is to identify genes that are highly enriched in breast cancer stem cells that may serve as biomarkers to predict relapse and chemoresistance. Therefore, we compared the transcriptome profiles of breast cancer stem cells, chemotherapy applied cancer stem cells and normal breast cancer cells using RNA-sequencing to identify novel and cancer stem cell specific genes which provides crucial information to predict chemoresistance and tumor relapse. Cancer stem cells isolated from estrogen receptor positive clinical samples and triple negative breast cancer patient derived xenografts used to investigate the distinct and common genes that are playing role in stemness, chemoresistance and relapse. Overlapping analysis was performed between significantly enriched genes in cancer stem cells and chemoresistant cells. Data mining approaches were used to integrate various biological databases to define the roles in clinical outcomes. The Oncomine and Gene Expression Omnibus resources were utilized to validate associations between stemness, recurrence and chemoresistance. As a result, we defined a catalog of breast cancer stemness genes involved in chemoresistance and tumor relapse, which are specific to either one or both breast cancer subtypes.
Short Abstract: RNA can fold into secondary and tertiary structures, which are important for regulation of gene expression. We recently developed a method to perform genome-wide RNA structure profiling in vivo employing high-throughput sequencing techniques, and applied this methodology to Arabidopsis. This method makes it possible to probe thousands of RNA structures at one time in living cells. Hidden RNA codes have been revealed by bioinformatic analyses of our RNA structuromes including RNA structures related to alternative polyadenylation and splicing .
Recently, further analysis of this dataset revealed a correlation between mRNA structure and the encoded protein structure, wherein the regions of individual mRNAs that code for protein domains generally have significantly higher structural reactivity than regions that encode protein domain junctions. This relationship is prominent for proteins annotated for catalytic activity but is reversed in proteins annotated for binding and transcription regulatory activity. We also found that mRNA segments that code for ordered regions have significantly higher structural reactivity than those that encode disordered regions .
We also developed a new computational platform, StructureFold, to facilitate the analysis of high throughput RNA structure profiling data. As a component of the Galaxy platform (https://usegalaxy.org), StructureFold integrates four computational modules in a user-friendly web-based interface or via local installation .
 Ding Y, Tang Y, Kwok CK, Zhang Y, Bevilacqua PC, Assmann SM. Nature. 2014;505:696-700.
 Tang Y, Assmann SM, Bevilacqua PC. J Mol Biol. 2016;428:758-766.
 Tang Y, Bouvier E, Kwok CK, Ding Y, Nekrutenko A, Bevilacqua PC, Assmann SM. Bioinformatics. 2015;31:2668-75.
Short Abstract: With recent advances of deep learning techniques in machine learning and artificial intelligence fields,
more and more attentions have been attracted to apply deep learning methods for solving bioinformatics challenges that be benefit from automatic feature representation.
In this work, we applied Recurrent Neural Networks (RNN) for RNA-seq analysis to solve
the sequence-specific bias correction problem.
We are motivated to use RNN to model structure-unknown genomic sequences.
We modelled nucleotides sequence probabilities using RNN and estimate sequence-specific bias weights
for RNA-seq reads. We demonstrated that RNN provides a flexible way to model nucleotide sequences without pre-determining or assuming sequence structures. The RNN-based correction method provides competitive results compared with the state-of-the-art RNA-seq bias correction method "SeqBias" using a Bayesian network.
Short Abstract: Resolution of ambiguity when a read can be aligned to more than one location in a reference sequence is a key problem in the processing of short read data. Despite the importance of ambiguity resolution in read mapping, its impact has not had extensive investigation. In this research, we examine sources of ambiguity together with the shortcomings of existing read mapping methods with regard to ambiguous reads. In this context we investigate the impact that ambiguous mappings have on the interpretation of short read data, using several k-mer analyses on two different bacterial genomes: Mycobacterium Tuberculosis (MTB) and Orientia Tsutsugamushi (OT).
The difficulties posed by ambiguous mappings are illustrated by limitations of a read mapper (Bowtie2) when applied to a set of synthetic reads. While less than 3% of reads map to multiple locations for the MTB experiment, in the case of OT, about 45% of reads map to more than one location using single-end mapping and this only drops to 35% when the paired-end information is exploited. The accuracy of read mapping can be considerably affected by multi-reads even when using the best existing mapping tools.
The results show the importance of ambiguity resolution in read mapping. These analyses can be used to discover the extent ambiguity affects mapping accuracy for a given genome and pave the way for more sophisticated mapping techniques.
Short Abstract: Motivation: Duplication in biological sequence databases has persisted for 20 years. Duplicate records introduce redundancies to databases, delay biocuration processes, and undermine the accuracy of studies based on sequence analysis such as GC content and melting temperature. Rapid growth of data makes purely manual de-duplication nearly impossible, and existing automatic systems cannot detect duplicates as precisely as experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While a mature approach in other duplicate detection contexts, machine learning has seen only preliminary application in the large biological sequence databases.
Results: We developed a supervised duplicate detection method, employing an over one million-pair expert curated dataset of duplicates across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed both binary and multi-class models. Both models achieve promising performance; the binary model had over 90% accuracy in all the 5 organisms while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. In particular, better measurement on sequences drives the performance.
Short Abstract: Single cell RNA-seq has been widely used in biological studies. Removing technical noise and normalizing the sequencing data are critical to fully explore the power of this technology. Various methods have been developed for normalization, including FPKM, UQ, DeSeq, RUV, and GRM. Among all, RUV and GRM can use spike-in ERCC to calibrate the technical noise. It is urgent to assess the performance of these methods using data with ground truth.
Recently, the NIH Single Cell Analysis Program – Transcriptome Project generated a RNA-seq data set using different amount of RNAs (10pg, 100pg and bulk) with ERCC. These data provide an unprecedented opportunity to compare different methods using the same data set. After normalization using each method, we clustered the samples and assume bulk samples are most similar to each other, 100pg samples are more similar to bulk than 10pg samples, and 10pg samples are more diverse. We used different metrics to evaluate the clustering performance by statistical indice.
The results showed, for methods not using ERCC, UQ, DESeq and RUV have comparable performance and better than FPKM. Considering ERCC by RUV and GRM significantly outperformed these methods without ERCC. Between RUV and GRM, GRM is more robust subject to different sets of genes.
In summary, we presented the first systematic comparison of normalization methods for single cell RNA-seq. We found that considering ERCC is helpful to remove technical noise and drastically improves clustering results. This study provides a guidance of selecting normalization methods for analyzing single cell RNA-seq data.
Short Abstract: Mucin type O-glycosylation is a common posttranslational protein modification that is initiated in the Golgi by the addition of alpha-N-acetylgalactosamine (GalNAc) to threonine or serine residues. This is performed by a family of 20 ppGalNAc-T transferases. O-glycosylation is found over a wide range of species and serves multiple biological functions such as protection of cell surfaces, modulating receptor activity and modulating protease cleavage sites. The ppGalNAc-Ts are differentially expressed and changes in their expression have been associated with a range of cancers and other disorders. It is now known that each ppGalNAc-Ts has a preferred recognition sequence or motif, including long and short-range preferences for residues that have already been glycosylated by GalNAc. We have obtained such preferences, or enhancement values, for 10 isoforms using a series of random peptide substrate libraries. From these data the so called enhancement value product (EVP) can be obtained that reflects the propensity of a site to be O-glycosylated by a particular ppGalNAc transferase. This can now be performed by a python web application called ISOGlyP at isoglyp.utep.edu.
This work reports the expansion of ISOGlyP to include an iterative method that will take into account the effect of the prior glycosylation of neighboring sites on subsequent glycosylation. Here the program attempts to mimic the actual process of glycosylation by first glycosylating the most preferred predicted sites and then repeating the prediction process until some desired threshold of glycosylation is achieved. With this approach the proposed regulatory roles of prior glycosylation may be tested.
Short Abstract: Retroviruses transcribe messenger RNA for the overlapping Gag and Gag-Pol polyproteins, by using a programmed -1 ribosomal frameshift which requires a slippery sequence and an immediate downstream stem-loop secondary structure, together called frameshift stimulating signal (FSS). It follows that the molecular evolution of this genomic region of HIV-1 is highly constrained, since the retroviral genome must contain a slippery sequence (sequence constraint), code appropriate peptides in reading frames 0 and 1 (coding requirements), and form a thermodynamically stable stem-loop secondary structure (structure requirement). We describe a unique computational tool, RNAsampleCDS, designed to compute the number of RNA sequences that code up to six user-specified peptides, possibly of various lengths (where optional IUPAC sequence constraints may be stipulated), in six overlapping reading frames. Subsequently, RNAsampleCDS can sample a user-specified number of RNAs that correctly code the stipulated peptides, and/or compute the exact position-specific scoring matrix. Using RNAsampleCDS, we obtain an estimate of 69% for the Boltzmann probability of stem-loop formation for 52 nt RNAs that contain the slippery sequence UUUUUUA and code Gag [resp. Pol] 17-mer peptides, whose amino acids have BLOSUM62 similarity at least +1 with those coded in Rfam family RF00480 of HIV-1 Gag-Pol frameshift stimulating signal RNAs. Such sample applications show that RNAsampleCDS constitutes a unique tool in the software arsenal now available to evolutionary biologists. Availability: Source code for the programs and additional data are available at http://bioinformatics.bc.edu/clotelab.
Short Abstract: Probiotic products, such as dietary supplements, and foods that contain intentionally added live microorganisms are purported to provide a human health benefit. Data on the identification and characterization of the microbial community found in these products (i.e. post-market surveillance) is scant, and furthermore, pre-market requirements are focused to ensure that the foods with these live microbes are safe for consumption. Our objective was to create a genome sequence database of Gram-positive bacteria found in these products to be used to verify the label content for microbial content in dietary supplements sold and consumed in the US. To construct this database, products samples were analyzed using a culture-independent metagenomic sequencing approach. Analysis of the shotgun genomic sequence data was used to identify product-specific microbial communities using unique clade-specific markers based on our custom k-mer database. In addition, the microbial contents of the dietary products were grown under different temperatures and atmospheric conditions to allow for the growth of all species listed on the product label. Purified isolates were sequenced with next-generation Illumina platforms and identified at the whole genome strain-specific level. Therefore, a genome database was established based on the genomic sequences of single colony Gram-positive isolates associated with each product. A phylogenetic tree was also created from these newly added sequences along with reference strains from GenBank using SNPs from core genes. The utility of this database can be defined to identify intended and unintended (contaminating) microbial constituents and to corroborate product-label contents.
Short Abstract: Structure probing coupled with high-throughput sequencing holds the potential to revolutionise our understanding of the role of RNA structure in regulation of gene expression. Despite major technological advances, intrinsic noise and high coverage requirements greatly limit the applicability of these techniques. Existing methods [1, 2, 3] do not provide strategies for correcting biases of the technology and are not sufficiently informed by inter-replicate variability in order to perform justifiable statistical assessments. We developed a probabilistic modelling pipeline which specifically accounts for biological variability and provides automated empirical strategies to correct coverage- and sequence-dependent biases in the data. The output of our method yields statistically interpretable scores for the probability of nucleotide modification transcriptome-wide, obviating the need for arbitrary thresholds and post-processing. We demonstrate on two yeast data sets that our method has greatly increased sensitivity, enabling the identification of modified regions on a greatly increased number of transcripts, compared with existing pipelines. Our method also provides accurate and confident predictions at much lower coverage levels than those recommended in recent studies [3, 4], which are normally only met for a handful of transcripts in transcriptome-wide experiments. Our results show that statistical modelling greatly extends the scope and potential of transcriptome-wide structure probing experiments.
 Ding, Yiliang, et al. Nature 505.7485 (2014).
 Kielpinski, Lukasz Jan, and Jeppe Vinther, Nucleic acids research (2014).
 Talkish, Jason, et al. RNA 20.5 (2014).
 Siegfried, Nathan A., et al. Nature methods 11.9 (2014).
Short Abstract: Though Illumina has largely dominated the RNA-seq market, the emergence of Ion Torrent in recent years has left scientists wondering which platform is best-suited for RNA-seq analysis. Previous studies comparing the two have often used reference samples derived from cell lines and brain tissue. These comparisons more closely-model studies of tissue-specific expression, marked by large-scale transcriptional differences between samples. Here we use a treatment/control experimental design, which allows us to compare these platforms in the context of the more subtle differences common to differential gene expression (DGE) experiments. We assessed the hepatic inflammatory response of mice by using an Illumina HiSeq and Ion Torrent Proton to study liver RNA from control and IL1B-treated animals. While we found the most difference between these platforms at the level of read alignment/coverage, they showed greater concordance at the level of DGE analysis, and yielded very similar results for pathways affected by IL1B treatment. Interestingly, we also observed a strong interaction between sequencing platform and choice of aligner. By aligning both real and simulated Illumina and Ion Torrent data with twelve of the most commonly-cited aligners in the literature, we found different aligner and platform combinations were better suited to probing different genomic features, like disentangling the source of expression in gene-pseudogene pairs. Taken together, our results show that while Illumina and Ion Torrent show similar capacity to detect changes in biology from a treatment/control experiment, these platforms may be tailored to interrogate different transcriptional phenomena through careful selection of alignment software.
Short Abstract: Translation, as the second step of central dogma in molecular biology, is a process for translating mRNAs into polypeptide chains. Finding ORFs corresponding to a given mRNA transcript, is an important step in reconstructing protein isoforms, which is vital for better understanding of RNA alternative splicing effects in diseases like cancer.
Existing tools for finding ORFs suffer from several shortcomings. First, every mRNA, in general, contains only one main ORFs, whereas the existing tools detect all regions between any start codon and stop codon in a given sequence without determining the actual ORF. Second, finding ORFs in a set of transcripts is processed individually for each sequence. Lacking a batch mode for finding protein isoforms and process each transcript individually, when dealing with a large number of transcripts, makes the whole process time consuming and not convenient for the user.
In our approach, we have developed a tool for finding all six frames corresponding to a given mRNA sequence and identifying finding the corresponding ORFs from RNA-Seq data. The ability to process several transcripts in batch mode, makes this tool an ideal software, when dealing with protein isoform prediction for a large number of transcripts. Moreover, the proposed tool is able to find both known and novel protein isoforms from RNA-Seq data. Using this tool we have been able to identify several novel transcripts including an extra exon in the 5’ UTR of WWP2. WWP2 has been shown to be related to several types of cancer.
Short Abstract: Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS) are two major technologies to interrogate genome using sequencing for variations to identify genetic etiology underlying clinical diseases. Although the price of WGS has reduced significantly, it is still twice as expensive as WES for the equivalent coverage yields. Moreover, WES is efficient in both library preparation and computational time due to the small percentage of the genome (1-2%) being analyzed. In a recent study which analyzed a large cohort of patients genome by WES, underlying molecular basis for various disorders were identified in 25% of the cases. Thus, WES will continue to be a valuable clinical tool for disease-associated studies in the near future. The most significant concern about WES data is the occurrence of low coverage regions and non-uniform distribution of coverage. We recently showed that such low coverage regions could not be improved by increasing the number of sequencing runs. We identified mistakenly discarded reads as a significant factor associated with the occurrence of low coverage regions. Here, we describe a new program, Rescuer to retrieve discarded reads. Rescuer, first clusters overlapping reads mapped to multiple locations on the reference genome, and then assembles adjacent reads into longer contigs. With this strategy we are able to recover about 10-15% of the discarded, ambiguously mapped, on-target sequencing reads. Rescuer significantly contributes towards variant detection in clinical investigations and account for some of the missing heritability issues in the study of complex diseases.
Short Abstract: The MinION sequencing platform from Oxford Nanopore Technologies (ONT) is still a pre-commercial technology, yet it is generating substantial excitement in the field for its features – longer read lengths and single-molecule sequencing in particular. As groups start developing bioinformatics tools for this new platform, a method to model and simulate the properties of the sequencing data will be valuable to test alternative approaches and to establish performance metrics. Here, we introduce NanoSim, a fast and lightweight read simulator that captures the technology-specific characteristics of ONT data with robust statistical models.
NanoSim is built on our observation that patterns of correct base calls and errors (mismatches and indels) can be described by statistical mixture models. Further, the structures of these models are consistent across chemistries and organisms (E. coli and S. cerevisiae). NanoSim generates synthetic ONT reads with empirical profiles derived from reference datasets, or using runtime parameters. Empirical profiles include read lengths and alignment fractions (the ratio of alignment lengths after unaligned portions of reads are soft-clipped from their flanks to read lengths). The lengths of intervals between errors (stretches of correct bases) and error types are modeled by Markov chains, and the lengths of errors are drawn from mixed statistical models.
In this work, we demonstrate the performance of NanoSim on publicly available datasets generated using R7 and R7.3 chemistries. The scalability of NanoSim to human-size genome will benefit the development of scalable NGS technologies for long nanopore reads.
Short Abstract: Species-, tissue-, or condition-specific transcriptomes can illuminate the underlying cellular metabolism for which they encode. RNASeq data, if properly assembled, can provide a snapshot of the transcriptome. Transcriptome assembly tools can be divided into two broad categories: de novo tools using only sequence data without references, and genome-guided tools that reconstruct the sequences using a reference genome. Differences in how assemblers generate and process their graphs result in drastically different behavior given the same input. Each method and each combination of kmer length or parameter selection can generate drastically different predicted transcripts. In this study, we compare the performance of six different transcriptome assemblers, three genome-guided and three de novo, using two different sets of simulated paired-end RNASeq data as a benchmark. The first set contains no sequencing errors and have a completely uniform insert size and coverage across transcripts. The second, more realistic, set contains errors, variable insert size, and biased transcript coverage modeled after 76nt Illumina Hi-Seq sequencing of Arabidopsis thaliana. For both data sets, no assembler was able to accurately assemble all of the reference transcripts; a large number of transcripts were not assembled correctly by any assembler (20~30% of the references depending on the simulation models). Each tool correctly assembled transcripts that no other assembler did. De novo tools uniquely assembled similar numbers of transcripts as genome-guided tools. There remains much room for improvement for transcriptome assembly, which can be achieved either by developing more accurate assemblers or by combining the correct predictions from existing tools.
Short Abstract: The advent of NGS technologies resulted in an exponential increase in the number of complete genomes available at biological databases. With this advance, many computational tools were developed to specifically analyze this large amount of data in different steps, from processing and quality filtering to gap filling and manual curation. Tools developed for closing gaps are very useful as they result in more accurate genomes, which will influence downstream analyses as genomic plasticity and comparative genomics. However, the gap filling step is still a challenge for genome assembly that often requires manual intervention.
Here, we present GapBlaster, a graphical application for evaluation and closing gaps. GapBlaster was developed in the Java programming language. The software uses contigs obtained in the assembly to perform an alignment against the draft of the genome/scaffold, using BLAST or Mummer, to close gaps. Then, all identified alignments of contigs that extends through the gaps in the draft sequence are presented to the user for further evaluation on the GapBlaster graphical interface. GapBlaster presents better results when compared to other similar software and has the advantage of having a Graphical interface for manual curation of the gaps.
GapBlaster program, the user manual and the test datasets are freely available at https://sourceforge.net/projects/gapblaster2015/. It requires Sun JDK 8, and Blast or Mummer installed to perform local alignments.
Short Abstract: Circular RNA (circRNA) is a class of endogenous noncoding RNAs and has attracted great attention due to their potential biological function as regulators of microRNAs. The next-generation sequencing technologies and novel bioinformatics approaches enable the detection of circRNAs in many species. This study provides an overview of circRNA candidates detected through an RNA-seq dataset across 11 organs of Fischer 344 rats from 4 developmental stages. The induction of circRNA candidates displays clear organ-specific patterns and gender differences. Liver and muscle have the lowest numbers of circRNA candidates and brain has the most and the pattern was also observed for expressed genes and transcripts in our previous study. Among the 1,793 parental genes, only 58 were detected with backspliced junctions in all eleven organs and 63 detected in ten organs but absent in one organ. The overlap of the induced circRNAs between male and female are less than 50% in most non-sexual organs. A trend of increase in circRNA candidates along the four developmental stages was observed in brain and liver for both sexes. There is a drop in circRNA candidates in thymus for aged rats. The number of circRNA candidates was stable in heart and lung. In the sex organs, the number of circRNA candidates remained stable across the aging points in Uterus, increased in the younger ages (Juvenile through Adult) in thymus and then significantly dropped for aged rats. Further knowledge of circRNA candidates in rat will undoubtedly advance the study of drug toxicity at the RNA regulation level.
Short Abstract: Previous analyses of site-specific variation in the rate of protein evolution (rate heterogeneity) have uncovered significant correlations between amino acid replacement rates and several structural and functional properties of proteins. To better understand the relationship between protein structure and rate heterogeneity, we have conducted a large-scale phylogenetic and statistical analysis testing the effects of three structural properties on site-specific amino acid replacement rates. Using sequence-based computational methods, we were able to predict amino acid replacement rates along with 1) intrinsic disorder propensity, 2) secondary structure and 3) functional domain involvement for millions of amino acid sites in nearly 12,000 animal protein sequence alignments. This data was evaluated via an unbalanced factorial analysis to assess the significance of individual structural effects on evolutionary rates, as well as statistical interactions among structural properties. Our results somewhat corroborate earlier findings that intrinsically disordered sites are more variable, on average, than ordered sites. However, there is considerable overlap among the rate distributions of ordered and disordered sites. Also, a significant confounding interaction exists between intrinsic disorder and secondary structure. Notably, a number of protein sites are consistently predicted to be both intrinsically disordered and involved in secondary structures. These sites tend to be conserved at the amino acid level, suggesting they are highly constrained and functionally important. Ultimately, these results suggest that multiple structural drivers of protein evolution should be evaluated simultaneously in order to get a clear picture of their individual effects as well as any confounding interactions among them.
Short Abstract: RNA-seq not only measures gene transcription levels, it also contains genetic variants information. The variants identified from RNA-seq data are of particular interests as they may reveal functional roles of the genetic alterations as well as regulatory information such as RNA-editing. Currently the RNA-seq variant detection is largely based on methods developed for DNA sequencing such as whole genome exome-seq (WES) or whole genome re-sequencing (WGS). While this is a widely adopted practice, there has not been a systematic comparison of the variants detected from DNA sequencing and the ones detected from RNA-seq from matched samples. With the recently released NIST benchmark on matched DNA and RNA samples, we tested multiple variant-calling tools and workflows and compared variant concordance between the variants detected from the matched DNA and RNA samples. We further examined the variants detected from WES and RNA-seq data from matched samples in multiple cancer samples from TCGA.
Short Abstract: DMAP is a comprehensive pipeline for the differential analysis of
methylation from high-throughput sequencing of sodium-bisulfite
converted DNA. Its main features include: flexibility in the
specification of the methylated sites, ability to evaluate methylation
changes across multiple conditions, and ease of configuration.
DMAP is based on the MOABS pipeline, but replaces its MCALL module,
responsible for identifying base conversions and computing methylation
rates, with the new tool CSCALL. CSCALL uses samtools to
determine coverage and conversion rates at each site of interest,
resulting in fast and reliable processing, and is able to examine
conversion of cytosine in contexts other than CG, whose importance in
plants and mammalian embryonic stem cells is increasingly recognized.
The main CSCALL output is a BED file containing coverage and
conversion rates at each analyzed site; the program also generates a
report of coverage and conversion rates (by chromosome and
genomewide), and reports conversion rates for "lone" cytosines, an
important quality control measure. Finally, CSCALL provides a
filtering function that can be applied to short reads before the
mapping step, which removes reads with an excessive number of
unconverted lone cytosines.
DMAP output consists of an automatically-generated report including
tables of differentially methylated sites between conditions, plots of
methylation across chromosomes and gene regions, and quality control
statistics. Usage of the pipeline is controlled by a simple
configuration file listing input files for each sample and all
We will describe the architecture of the pipeline and results obtained
on experimental datasets.
Short Abstract: Background: Whole exome next-generation sequencing data is commonly used for detection of small exonic mutations, but there has been a growing desire to accurately detect copy number variations (CNVs) as well. In order to address this research and clinical need, we previously developed the novel method PatternCNV (Wang C., Evans J., et al. (Bioinformatics 2014)). We are now releasing PatternCNV 2.0 and have made significant modifications and improvements to the detection algorithms, data integrity checks, CNV calling, and reports.
Methods and Results: Major improvements in PatternCNV 2.0 include complete automation to increase usability, GC bias correction, CNV segmentation, data quality reports, and publication quality images. Algorithm improvements were also made to improve somatic CNV detection as well as germline CNV detection in trios. Additionally, a set of utilities are included which allow users to easily plot CNVs in focused genes of interest. We demonstrate the somatic CNV enhancements by accurately detecting CNVs in whole exome-wide data from ovarian-TCGA samples and a lymphoma case study with paired tumor and normal samples. We also show PatternCNV’s improved germline CNV calling using a HapMap trio to detect CNVs with various modes of inheritance. The performance of PatternCNV 2.0 is evaluated by comparing CNV calling results with results from orthogonal platforms.
Short Abstract: RNA secondary structure folding kinetics is known to be important for the biological function
of certain processes, such as the hok/sok system in E. coli. Although linear algebra provides an
exact computational solution of secondary structure folding kinetics with respect to the Turner
energy model for tiny ( 20 nt) RNA sequences, the folding kinetics for larger sequences can only
be approximated by either using repeated simulations or by binning structures in a coarse-grained
model. Here we investigate the relation between two approaches to approximate the distribution
of first passage times in refolding an RNA sequence from the initial secondary structure A to the
target secondary structure B. In particular, we compare the time estimate to reach the minimum
free energy secondary structure, using a time-driven Metropolis Monte Carlo algorithm and using
the Gillespie algorithm. We prove that asymptotically, the Monte Carlo trajectory time is
larger than the Gillespie trajectory time, where
secondary structure neighbors, where a neighbor is obtained by removal or addition of a single base
pair. Despite this result, there is no such relation between the mean first passage time (MFPT)
computed by the Monte Carlo or by the Gillespie algorithms. Software developed for the Monte
Carlo and Gillespie algorithms for RNA folding, as well as software to compute the expected number
of neighbors, is available at http://bioinformatics.bc.edu/clote/RNAexpNumNbors.
Short Abstract: Conformational and functional flexibility promote protein evolvability. High evolvability allows related proteins to functionally diverge and potentially neostructuralize. p53 is a conformationally flexible, multifunctional protein with roles involved in maintaining cellular and genomic integrity, idolized as the Guardian of the Genome. Despite its deemed importance, p53 is often mutated and found to be implicated in various cancers. Here, the pre- and post- gene duplication phase of vertebrate p53 and its paralogs, p63 and p73, were examined to gain insights to the evolutionary dynamics of functional domains, structural properties, and phosphorylation within >300 sequence representatives of this protein family. In early metazoan, a four domain p53 protein ancestor was present. In some lineages, such as nematodes and arthropods, this protein has lost domains through both deletion and depletion of domain motifs, while the amount of structural disorder is simultaneously reduced. In other invertebrate lineages, complete four domain p53 proteins remain. Coinciding with gene duplications in early vertebrates, the four domain p53 protein ancestor underwent domain loss resulting in the p53, p63, and p73 paralogs. Regions of structural disorder and phosphorylations were found to be redistributed both among paralogs and within clades suggesting functional divergence. Finally, of the three paralogs, p63 is the most constrained while p53 is the least constrained. Indeed, p53 is a highly evolvable protein as demonstrated through rearranged regions of structural disorder and changing domain signatures among lineages, and does not appear to fit into the role of a resilient Guardian.
Short Abstract: DNA methylation patterns reveal gene regulatory features across multiple scales, from positions of individual nucleosomes to megabase-scale chromosomal domains. We have previously shown that long chromosomal domains of hypomethylation in colorectal cancer, or Partially Methylated Domains (PMDs) correspond to 3D topological domains that are late replicating and localized to the nuclear lamina. We noted that the same chromosomal domains had weak, but detectable, PMDs in adjacent normal tissues. Here, we report a comprehensive analysis of PMDs using a large and diverse whole-genome bisulfite sequencing (WGBS) dataset of 97 samples, including unpublished tumor and matched normal samples from 8 different cancer types from TCGA, along with with 48 published samples from tissues from non-disease individuals. By using novel sequence features and a multi-sample bioinformatics approach, we present global structure of PMDs across normal and cancer tissues. We provide this global segmentation based on the largest collection of WGBS datasets to date and conclude that DNA methylation maps of tumors and relevant normal cell types will allow us to understand gene regulatory changes that take place at the level of 3D nuclear topology.
Short Abstract: In recent years advancements in PCR technology have enabled the use of capillary electrophoresis analysis of microsatellite biomarkers which, unlike SNVs, are challenging to analyze via next-generation sequencing. Two disease-relevant microsatellite biomarkers are the triplet (CGG)n repeat region of the FMR1 gene (fragile X syndrome) and the hexanucleotide (GGGGCC)n repeat region of the C9ORF72 gene (familial ALS and FTD). Manual fragment size analysis of assays targeting these regions is time-consuming, typically requiring a trained technician ~2 hours to analyze a batch of 96 samples. We present the development and evaluation of a novel algorithm for fully automated, rapid, and accurate sizing of the FMR1 and C9ORF72 variable repeat regions. The algorithm processes capillary electrophoresis data derived from assays that amplify the FMR1 and C9ORF72 repetitive elements, and employs a composite signal-processing and machine-learning approach to instantly report accurate repeat sizing for gene-specific amplification products. The algorithm was tested on an independent set of 1000 samples spanning the full FMR1 genotype range (up to 200 repeats resolved on CE), and against a set of 50 samples spanning the full C9ORF72 genotype range (up to 120 repeats on CE). Automated sizing was compared against manual genotyping performed by a trained analyst, and were shown to be 100% concordant (within ACMG guidelines) for FMR1 sizing, and 100% concordant (within +/- 3 repeat element) for C9ORF72 sizing. These results demonstrate feasibility for high-throughput automated analysis of repeat disorders.
Short Abstract: RNA-seq is primarily used in measuring gene expression, quantification of transcript abundance, and building reference transcriptomes. Without bias from a reference sequence, de novo RNA-seq assembly is particularly useful for building new reference transcriptomes, detecting fusion genes, and discovering novel transcripts. A number of approaches for de novo RNA-seq assembly were developed over the past six years, including Trans-ABySS, Trinity, Oases, IDBA-tran, and SOAPdenovo-Trans. Using 12 CPUs, it takes approximately a day to assemble a human RNA-seq sample and require up to 100GB of memory. While the high memory usage may be alleviated by distributed computing, access to a high-performance computing environment is a strict requirement for RNA-seq assembly.
Here, we present a novel de novo RNA-seq assembler, “RNA-Bloom,” that utilizes Bloom filter-based data structures for compact storage of k-mer counts and the de Bruijn graph of two k-mer sizes in memory. Compared to existing approaches, RNA-Bloom can assemble a human RNA-seq sample with comparable accuracy using merely 10GB of memory, which is readily available on modern desktop computers. The de Bruijn graph of two k-mer sizes allows RNA-Bloom to effectively assemble both lowly and highly expressed transcripts. In addition, RNA-Bloom can assemble and quantify transcript isoforms without alignment of sequence reads, thus resulting in a quicker run-time than existing alignment-based protocols.
Short Abstract: In bioinformatics, there are many applications including sequence alignment, genome and transcriptome assembly, RNA-seq expression quantification, k-mer counting, error correction, and large-scale sequence analysis that rely on cataloguing or counting consecutive k-mers in DNA/RNA sequences for indexing, querying, and similarity search. An efficient way of implementing such operations is through the use of hash-based data structures, such as hash tables or Bloom filters. Therefore, improving the performance of hashing algorithms would have a broad impact for a wide range of bioinformatics tools.
Here, we present ntHash, a fast hash function for computing hash values for consecutive k-mers recursively. The algorithm calculates hash values for consecutive k-mers in a string using a recursive hash function, in which the hash value of the current k-mer is derived from the hash value of the previous k-mer. In this work, we have implemented an efficient cyclic polynomial rolling hash function for computing the hash values for all k-mers of a given DNA sequence. In hashing by cyclic polynomial, barrel shifts are used instead of multiplications to make the process faster. Experimental results demonstrate substantial speed improvement over conventional approaches, while retaining near-ideal hash value distribution. Comparison of run time of proposed method with the state-of-the-art general-purpose hash functions demonstrates that ntHash performs over 20x faster than the closest competitor, cityhash, the leading algorithm developed by Google.
Short Abstract: In this project, we use the computational tool GAMI (Genetic Algorithms for Motif Inference) to identify candidate regulatory regions for the PDGFRβ gene.
The PDGFRβ gene is expressed in mural cells of the vasculature, including the mesangium of the kidney. The mesangium is of particular interest due to its role in renal scarring. In order to better understand the regulation of PDGFRβ in the kidney, candidate cis-regulatory modules (CRMs) were identified upstream and in the first intron of PDGFRβ using the computational tool GAMI. GAMI is a de novo motif inference system that is used to predict candidate regulatory regions in noncoding DNA by identifying putative conserved elements. GAMI is able to search multiple (e.g., 100) long sequence lengths (e.g., 100,000nt). Typically, a GAMI run will yield hundreds of putative conserved motifs, which are then assembled into putative CRMs using the companion tool GAMI-CRM. However, putative conservation is not sufficient to ensure function.
Activity of conserved sequences is dependent on biological context and the epigenetic factors that distinguish that context. Typically, a gene is active in many contexts, usually requiring different regulatory elements and epigenetic factors for each. In order to better identify these acontext-dependent regulatory elements we have incorporated tissue-specific epigenetic data from the ENCODE project into our workflow. This allows for the discovery of relationships between epigenetic factors and putative conserved sequences. Using this technique we have identified and validated kidney specific active regulatory elements for PDGFRβ using a mouse model.
Short Abstract: The distribution of mutations across the length of the gene differs between oncogene and Tumor suppressor genes (TSGs). Oncogenes require gain-of-function and are prone to harbor mutations at select locations or “hotspots” on genes. TSGs, in contrast, can lose function through mutations, presumably, at any location on the coding sequence.
In our analysis of mutations in various cancer types in The Cancer Genome Atlas (TCGA), we found significant difference in distribution of missense and synonymous mutations across the length of the genes amongst oncogenes, TSGs and other “normal” genes. Normal genes harbor missense mutations equally along the length of the genes, while in oncogenes and TSGs, these mutations are more concentrated towards the N-terminal. This effect is stronger in TSGs than in oncogenes. Interestingly, oncogenes and TSGs show inverse relationship in the distribution of the synonymous mutations, with TSGs having more mutations towards the N-terminal and oncogenes having more mutations towards the C-terminal.
We measured the codon-usage bias in these 3 categories of genes. We found significant difference in codon-usage between oncogene and TSGs along the length of the genes. For synonymous mutation, the codon-usage bias is higher towards N-terminal regions both for oncogene and tumor suppressor genes. Oncogenes, on the other hand, have strikingly high codon-usage bias towards the C-terminal.
We speculate these differences in the mutation distribution and the codon-usage bias across the length of the genes reflect distinct mechanisms of action of oncogenes and TSGs. We are currently investigating the factors that might contribute to such difference.
Short Abstract: The emergence of next-generation sequencing technologies over the past decade has facilitated a massive increase in the sequencing of new genomes and given rise to numerous comparative studies among species. This development has served as the basis to study evolution by considering the genomic differences between closely-related species. The genomic differences can be caused by SNPs, indels or larger rearrangements. The rearrangements between closely-related species can be associated with hot-spots of divergence leading to speciation. In this study, we generated a de novo unitig assembly of the central chimpanzee sub-species and compare it to the available chimp reference genome (western chimpanzee). Instead of relying solely on the read-level comparison, we attempted a unitig-based similarity-level approach for the detection of genomic rearrangements. This simplified approach can be used in the exploration of potential genomic rearrangements or break-points for genomes with low-coverage data and fragmented assemblies. Associating these genomic hotspots with inter-specific differences in gene expression allows us to improve our understanding of speciation processes.
Short Abstract: Single nucleotide variant (SNV) are often detected in human specimens to correlate with other phenotypic variables. To date, there is no clear consensus on how much difference we should expect to see between blood-tissue, and between DNA-RNA of same subject. And answering these questions can greatly contribute to the accuracy of SNV identification. We conducted a thorough study by comparing the nucleotide sequence between each sample-sequencing type using a unique set of sequencing data from TCGA. As a result, the heterozygous consistency analysis results showed high consistency rate between sequencing data pairs of DNA samples. Once RNA data was introduced in the pair, the heterozygous consistency rate dropped substantially. We also analyzed the allele change frequencies, and frequency of two nucleotides upstream and downstream from the inconsistent genotype sites. The GG and CC nucleotides showed significant enrichment in the upstream, suggesting GC content played a role in resulting of the mismatched genotypes. And cluster results of allele change frequencies indicated that transition changes were clearly more preferred than transversion changes. Our results systematically explored the differences between RNA and DNA samples from same patient, and can help the identification of SNV in blood or RNA samples.
Short Abstract: The most common adult lymphoma, DLBCL, is a heterogeneous disease with respect to gene expression, clinical outcome, and mutational profile. Existing studies are inadequately powered to identify all the infrequently mutated genes that define the bulk of genetic alterations in the disease.
We embarked on the largest single-cancer whole exome sequencing study to date, characterizing over 1000 DLBCL cases. This sample size is an order of magnitude larger than existing DLBCL studies. While previous studies were only powered to detect mutations present in at least 15% of samples, this 1000-sample study is powered for detection of mutations occurring in fewer than 5% of cases.
Mutational analysis reveals striking patterns at multiple scales, from single nucleotide to gene to gene network. At the single nucleotide scale, the most striking genetic alteration was MYD88 L265P in 8% of cases. At the gene scale, MLL2/KMT2D was the most recurrently mutated, affecting 15% of cases. Finally, at the gene network level, candidates were computationally identified as interacting genes that exhibit maximal scores of both mutual exclusivity and coverage. Recurrent mutations occurred in chromatin modification, NF-κB, PI3K, and B-cell receptor gene networks.
In this largest single-cancer whole exome sequencing study to date, we have comprehensively and sensitively identified genetic alterations in DLBCL and used them to detect underlying gene network patterns that clinically stratify the disease. These findings advance our understanding of the heterogeneous genetic nature of DLBCL and enable us to identify potential therapeutic targets.
View Posters By Category
- A) Bioinformatics of Disease and Treatment
- B) Comparative Genomics
- C) Education
- D) Epigenetics
- E) Functional Genomics
- F) Genome Organization and Annotation
- G) Genetic Variation Analysis
- H) Metagenomics
- I) Open Science and Citizen Science
- J) Pathogen informatics
- K) Population Genetics Variation and Evolution
- L) Protein Structure and Function Prediction and Analysis
- M) Proteomics
- N) Sequence Analysis
- O) Systems Biology and Networks
- P) Other