20th Annual International Conference on
Intelligent Systems for Molecular Biology


Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015. To confirm your poster find the poster acceptence email there will be a confirmation link. Click on it and follow the instructions.

If you need further assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category U - ''
U01 - DAPPLE: a program that facilitates the design of species-specific kinome microarrays
Short Abstract: The kinome microarray is a relatively new technology for studying phosphorylation-mediated cellular signalling. Relatively little phosphorylation data are available for organisms other than human, rat, and mouse, making it difficult to design kinome microarrays suitable for studying them. We recently developed a protocol for leveraging known phosphorylation sites from one organism to identify putative sites in a different organism. While effective, this procedure is time-consuming, tedious, and cannot feasibly make use of even a small fraction of the known phosphorylation sites. To solve this problem, we have developed a collection of Perl scripts called DAPPLE that automates the identification of putative phosphorylation sites in an organism of interest, improving and accelerating the process of designing kinome microarrays for species other than human, rat, and mouse.
U02 - miRSeqNovel: An R based workflow for analyzing miRNA sequencing data
Short Abstract: MicroRNAs (miRNAs) are short single-stranded RNA molecules that play an important role in regulating gene expression in many organisms. With the advantages of modern high-throughput sequencing, new opportunities have arisen to quantitate the expression of known miRNAs as well as to predict novel miRNAs. Because of the short length of miRNAs, Illumina/Solexa and ABI SOLiD platforms are currently preferred for such experiments. However, customized protocols are needed to predict novel microRNAs due to dissimilar features of miRNAs in different species, such as different lengths of stem-loop sequences, or complementarity between mature and star miRNAs. To date, there is no workflow built into R to predict novel miRNAs from sequencing data.

We present miRSeqNovel, an R/Bioconductor based workflow for novel miRNA prediction from deep sequencing data. miRSeqNovel can process data in both colorspace (SOLiD) and basespace (Illumina/Solexa) formats as well as the results from different mapping algorithms and reference genomes. It finds differentially expressed known miRNAs with Bioconductor built-in popular statistical methods, and gives conservative prediction of novel miRNA candidates. miRSeqNovel is flexible and compatible with completely customized parameters.

miRSeqNovel is a general workflow for processing miRNA sequencing data in R environment. The miRSeqNovel package and user manual are freely available at http://sourceforge.net/projects/mirseq/files
U03 - Accurate Multiple RNA Alignments
Short Abstract: In the last years there has been a growing interest in the field of RNA. This finds its rationale both in the fact that a huge amount of transcript data is now available and that RNA has proved to be relevant for many cellular functions. In this context, multiple sequence alignment is a strategy to highlight the evolutionary relationships between homologs, like structurally conserved regions or, in more general terms, any biologically relevant pattern. However, comparing RNA sequences is a more challenging task than comparing protein sequences. This is because the RNA nucleotide alphabet is much simpler and therefore less informative than the alphabet of amino acids. Moreover for many RNAs, evolution is likely to constrain more structure than sequence resulting in homologous sequences having a very poor conservation, which hinders the accuracy of sequence comparisons. Therefore, there is a need for comparison methods that are able to include structural information to produce more accurate multiple RNA sequence alignments. In this work we present SARA-Coffee, a new algorithm that joins the pairwise RNA structure alignments performed by SARA with the multiple sequence T-Coffee framework.
U04 - Cleaning contaminated mouse reads from xenograft next-generation sequencing data by XenoCP: a xenograft cleansing pipeline
Short Abstract: Grafting human cancer tissue onto mice is a powerful tool for understanding cancer development and for exploring new therapeutic treatment. Analysis of xenograft samples by next-generation sequencing (NGS) can determine whether the xenograft faithfully represents the genetic and epigenetic abnormalities observed in its primary tumor. However, tumor samples extracted from mouse xenografts may be contaminated by murine cells, a source for misinterpretation of human/mouse variations as de-novo mutations acquired in xenograft. Such errors are most prevalent in regions where there is strong human/mouse homology, such as gene-coding regions. Currently, there is no publicly available method to identify and remove murine reads from xenograft NGS data.
We developed a xenograft cleansing pipeline (XenoCP) that identifies and cleanses contaminated murine reads in xenograft NGS data by comparing NGS read identity to human and mouse genome. The method was tested on simulated data of admixed human and mouse reads expected in a xenograft WGS, where it achieved 70-86% sensitivity and 99.98% specificity. It was applied to a retinoblastoma xenograft sample with 148 coding single nucleotide variations (SNVs) identified in the uncleaned WGS. Of these, 142 SNVs were false positives as manual review found that the reads harboring the “mutant” alleles matched the mouse genome. Re-running SNV discovery on the cleansed WGS identified only the 6 variants that passed manual review but not the false positives, demonstrating the power of this method to greatly reduce false findings caused by contaminated mouse reads.
U05 - GRAPE RNAseq Analysis Pipeline
Short Abstract: In this work we present a pipeline schema for the initial processing,
analysis and management of RNAseq data, as well as the implemen-
tation of this pipeline using a work
ow management system. This
pipeline starts from the raw sequencing reads, or BAM alignments,
a genome le and an annotation of the species from which the reads
originate. It will run a set of quality control steps, align the reads if
necessary and perform some standard analyses such as estimation of
expression levels. The results are stored in a MySQL database that
can be accessed using scripts or through a web application (RAISIN) that allows
for easy browsing of the results, an example of this web application can be seen at rnaseq.crg,cat. GRAPE may be run locally or it can
use a queuing system such as SGE to improve the speed and allow for
parallelization of the di fferent steps. New steps can be easily added,
allowing it to be extended in order to fit speci c needs. The software can be downloaded from our website at big.crg.cat.
U06 - Simultaneous Detection of Structure Variation in Multiple Related or Unrelated Individuals with Paired-End Sequencing
Short Abstract: Structure variation (SV) has been known as significant sources of differences between individual genomes, which can be discovered via paired-end sequencing with combinatorial or statistical methods. However, due to the variance of sequencing coverage, fragmentation inaccuracies, and alignment bias, existing methods often fail to uncover SVs in low-coverage regions and have little detection power for small-sized SVs. In this poster, we propose a new SV detection framework called HapSVDetector for simultaneous detection of SVs across related or unrelated individuals within family or population. HapSVDetector captures SV signatures of discordant and breakpoint reads, corrects the alignment bias, and distinguishes the SV- from non-SV chromosomes for individuals within family or population by solving a mixed integer programming. Finally, the inferred size of SVs and heterozygosity are further refined via additional evidences from paired-end breakpoint reads and the distributions of mapping distances, respectively. The results indicated that HapSVDetector has better precision and recall compared with Pindel and Breakdancer for various coverage and sizes of SVs. In low-coverage sequencing, HapSVDetector is able to continuously improve detection power as the sample size increases. This implies that the reduction of average sequencing coverage in large-scale re-sequencing projects can be compensated by simultaneously calling SVs for multiple individuals.
U07 - Correlation of NGS, microarray and qPCR data in a small RNA study.
Short Abstract: Platforms to study small RNA expression differ in molecular biological procedures and in the methods of data analysis, which might be the reason for a weak reproducibility of data by differing techniques.
The aim of the study was to compare next generation sequencing (NGS) of small RNAs and miRNA microarrays and to evaluate small RNA validation possibilities of features selected in NGS analysis. Additionally, there was a need to explain, how small RNA isoforms detected in NGS might influence the outcome of experiments.
20 follicular thyroid tumor samples were analyzed with Illumina hiScan NGS and Illumina Bead v2 microarrays. RPM normalized small RNA sequencing data was divided to low, medium and high expression values. QPCR was applied to verify sequencing data for 6 miRNA, represented by various variants.
Correlations greater than 0.8 were obtained for all 3 platform comparisons (NGS, microarray, qPCR). Our data suggest that the expression of miR isoforms with the same seed are summed up in microarray data and qPCR. Based on this assumption, correlation of NGS and qPCR reached 0.99.
However, qPCR and microarrays are not as specific as NGS in small RNA isoform detection, but they cover well expression of miR variant families. Therefore, standard qPCR protocols must be extended so they could detect miR isoforms playing different roles in regulation processes of the cell.
U09 - Running Next Generation Sequencing on a Heterogeneous Enterprise Cloud: A Case Study for Illumina CASAVA Analysis Using Parabon Frontier
Short Abstract: Advanced sequencing technologies have revolutionized the ability to generate massive amounts of sequence data. However these technologies bring wide-ranging bioinformatic challenges. The associated computational burden is usually assumed by a dedicated cluster or by moving analysis to external cloud infrastructure. We present a hybrid approach to addressing these challenges using the Parabon® Frontier® Enterprise Computing platform. Using even a modest number of computers, not necessarily dedicated to the task, Frontier software can be used to create an internal "cluster-on-demand" to run applications, such as Illumina CASAVA. The approach has the added advantage that a private enterprise cloud can be seamlessly expanded to include external cloud resources provided by Parabon.
The Illumina CASAVA package has multiple functions: alignment, variation call and genotype call. The Illumina CASAVA package was adapted to parallelize the computations. Although the solution can scale to thousands of nodes, we used modest numbers of Windows cpu's (25 cores, 2GB RAM) and achieved alignment within a couple of hours. The process required a one-time transmission of the reference sequence, however, this transmission is amortized over many runs. Additional comparative analysis will be presented.
There are two key challenges for cloud/grid computations: data movement and adaptation of the analysis to efficiently utilize large numbers of cpu nodes. Our proof of concept shows that highly complex alignment and variation call from Illumina CASAVA 1) can be accomplished utilizing desktop computers, 2)this system is highly extensible and 3) it is easily managed by Frontier.
U10 - Nanoanatomy Museum: Creating a Protein Family ProfileGrid Database
Short Abstract: ProfileGrids allow the efficienct visualization of very large protein multiple sequence alignments (MSAs). A ProfileGrid is a matrix color-coded according to the residue frequency occurring at each column position. While databases of protein families exist (such as Pfam), there are few curated repositories of user-generated MSAs possibly due to the previous lack of paradigms for visualizing large MSAs. Here we present progress toward building a database of protein family ProfileGrids that we call the Nanoanatomy Museum. Our initial dataset was the pre-calculated MSAs of the largest protein families from the Pfam database (ranging up to 160,000+ homologs). The final database will be a proof of principle for how established databases can incorporate ProfileGrids in the standard description of protein families thus replacing other limited visualizations such as sequence logos.

The new JProfileGrid v2.0 software was used for calculations. New features include algorithm optimization, a command-line interface, and a new PNG image file output format. Software performance has been enhanced due to improved memory handling, calculation optimization, and code parallelization. In addition, the software can now sort and search a long menu list of sequence names. A graphic "overview" mode enables the user to visualize the entire data set. The detailed ProfileGrid window has a new second pane that facilitates simultaneous viewing of different parts of the MSA. Data sampling was introduced to speed up similarity plot calculations. Finally, regular expressions and metadata can be used to filter large MSAs. [Supported by Erasmo Foundation grant TSC13702.]
U11 - NGSpass: system for analysis and management of next-generation genomic sequence data
Short Abstract: With the rapidly falling cost and availability of high throughput sequencing technology, the bottleneck of effectively using genomic analysis in the laboratory is shifting to one of effectively managing, analyzing, and sharing these massive genomic data. Here, we present web based system, called NGSpass, for analyzing and managing NGS genomic sequence data. Our system accepts a FASTQ-formatted sequencing file as inputs, and it then executes back-end analysis pipelines already constructed by users. Users can simply build analysis pipelines by adding or deleting programs and adjusting parameters of each program. Users also monitor running states of each NGS project. In addition, our system contains a module that allows researchers to build a Sequence Read Archive (SRA) submission files with their NGS and related data. Final results can be easily downloaded using a web browser. Our system is very flexible to update and modify because it has been developed using Google Web Toolkit (GWT) based on JAVA and JavaScript language and MySQL database. It has user-friendly interface and can be installed at any platform such as Linux, Mac and Windows. We believe that NGSpass system will be very useful for NGS genomic data analysis and processing.
U12 - Computational discovery and analysis of rDNA sequence heterogeneity in yeast
Short Abstract: Ribosomal RNA genes, known as ribosomal DNA or rDNA, are found in tandem arrays of tens or even hundreds of repeating units. The sequences of each unit in an array were once thought to be identical but it is now known that mutations may occur, causing heterogeneity amongst units. Opposing these divergent mutational processes, unit sequences are homogenised through concerted evolutionary processes such as unequal sister chromatid exchange (USCE) and gene conversion (GC).

Using bespoke Perl software, including the TURNIP tool for identifying nucleotide polymorphisms in repetitive genomic regions (http://www.ncyc.co.uk/software/turnip.html), we have uncovered rDNA sequence variation in the yeast Saccharomyces paradoxus dataset, using data derived from the Saccharomyces Genome Resequencing Project. This analysis, in conjunction with a reanalysis of the Saccharomyces cerevisiae dataset, gives us detailed information regarding rDNA sequence heterogeneity in two contrasting, yet closely-related yeast species.

We are further investigating rDNA evolutionary dynamics through the development of a Java computational tool that models the USCE and GC events within an rDNA array. We aim to fit our model to our estimated variation, enabling us to understand the balance of the different mutational processes in the evolution of the yeast rDNA arrays. Ultimately, by exploiting the fast tempo of rDNA evolution, we aim to use our model in the estimation of fine-scale phylogenies of yeast strains.

Here, we describe recent progress with the development of our computational tools and models for sequence heterogeneity discovery and analysis, focussing on new insights gained into rDNA dynamics in two yeast species.
U13 - The effect of ribosome collisions and queuing on gene expression
Short Abstract: The movement of translating ribosomes on a transcript allows collisions and formation of ribosome traffic jams, leading to perturbed and improper synthesis of the protein product. Earlier reports revealed several mechanisms that evolved to prevent this phenomenon, however, no quantitative measure of ribosome queuing was defined so far.

We introduced two theoretical measures of mRNA susceptibility to ribosome collisions: dE - the total time wasted by the second translating ribosome due to collisions with the first, and Z - the total number of collisions between two translating ribosomes. Using these measures, we examined the role of ribosome collisions in shaping genes expression levels and analyzed transcript specific features responsible for collision-free ribosome movement. We discovered that the measures correlate negatively with expression levels over distinct sets of genes and the strength of this relationship may achieve the level of correlations with GC content, mRNA secondary structure and codon usage. Additionally, although codons translated with slower velocity tend to cause collisions more often, the final effect depends on their position in the coding sequence. We also confirm that transcripts more resistant to ribosome collisions are able to persist longer in a cell.

Based on these results we state that the mRNA susceptibility to queuing is another feature that must be taken into account when considering translation productivity. The presented results may have important implications for research on translational productivitiy and heterologous expression.
U14 - The role of rare codons in protein expression
Short Abstract: The heterologous expression of proteins is central to modern biotechnology and is performed routinely in a huge range of research fields. Despite the ubiquity of the process some details of the translation of genes into proteins are still poorly understood, and many important protein targets have proven difficult or impossible to express in researchers' systems of choice. The mapping between codons and amino acids embedded in the universal genetic code is well established, but different organisms utilise it differently, selecting specific codons with varying frequencies. The mapping is only one of numerous layers of complexity, and the precise nucleotide sequence chosen from amongst the range of synonymous options can influence the functionality of the resulting protein in more subtle ways that are dependent on the cellular environment. Codon selection can directly affect the kinetics of translation, but evidence for its impact on protein efficacy is conflicting and inconclusive, with various studies finding that rare codons can be detrimental, beneficial or have no impact at all.

Here, we present the results of a comprehensive study of rare codon usage in approximately 4000 Escherichia coli genes, using homologous relationships in a database of 3.3 million prokaryotic genes. We have identified regions of conserved rare codon usage in multiple sequence alignments of sets of homologous genes and computed a statistical measure of the significance of these regions. Experimental analyses have been used to assess the accuracy of the computational conclusions.
U15 - Gene-targeted metagenome assembly
Short Abstract: Very large metagenomes tax the abilities of current-generation short-read assemblers. In addition to space and time complexity issues, most assemblers are not designed to correctly treat reads from closely related populations of organisms. Also, general assembly annotation pipelines may not be well tuned for analysis of specific genes that directly code for important environmental functions. We are developing a gene-targeted approach for metagenome assembly. In this approach, information about specific genes is used to guide assembly, and gene annotation occurs concomitantly with assembly. This approach combines a space-efficient modified De Bruijn graph representation of the reads with a protein profile Hidden Markov Model for the gene(s) of interest. To limit the search, we use a heuristic to identify nucleotide kmers that translate to peptides found in a set of representatives of the target protein family. Contigs are assembled in both directions from these starting kmers by applying graph path-finding algorithms on the combined De Bruijn-HMM graph structure. Using this technique we have been able to extract complete nifH protein coding regions from a 50G Iowa prairie metagenome and buk (butyrate kinase) coding regions from a human gut metagenome. Future work will focus on improving search efficiency and separating sequencing artifacts from low-coverage rare populations.
U16 - SAAP-RRBS: Streamlined Analysis and Annotation Pipeline for Reduced Representation Bisulfite Sequencing
Short Abstract: Reduced representation bisulfite sequencing (RRBS) is an efficient and economic approach for genome-wide methylation pattern profiling. Analyzing RRBS sequencing data is challenging and specialized alignment/mapping programs are needed. Although such programs have been developed, the alignment is only part of RRBS data analysis and a comprehensive solution that provides researchers with good quality and analyzable data is lacking. To address this need, we have developed a Streamlined Analysis and Annotation Pipeline for RRBS data (SAAP-RRBS) that integrates read quality assessment and clean-up, alignment, methylation data extraction, annotation, reporting, and visualization. With this package, bioinformaticians or investigators can start from sequencing reads and get a fully annotated CpG methylation report quickly and liberate more time for biological interpretation. The pipeline is flexible allowing different aligners to be plugged in and the annotation module to be run independently.
U17 - Bioinformatics Tools and Analysis for RAD Tag Analysis of Incipient Speciation
Short Abstract: A relatively new technique for DNA sequencing prep has emerged known as Restriction site Associated DNA marker tagging, or RAD tagging. RAD tagging is a method that can sequence pooled batches of DNA tags from individuals while still being able to differentiate which individual each sequence came from by attaching DNA barcodes to an end of the DNA strands. This technique thus presented us with data that, while ultimately familiar, required new methods of analysis. Here, we present bioinformatics improvements and a Galaxy analysis pipeline to look for SNPs important in speciation based on Rhagoletis pomonella (apple maggot) Illumnia RAD tagged sequences. Researchers start with quality control scripts and then create a config file to give all the necessary parameters for labeling of sequences, barcode correction, trimming, and division into specified populations – in our case, four populations from two subspecies. The pipeline allows for analysis that is specific to any given breakdown of the data; comparisons can be carried out between any pair of divisions declared by the user, and populations can be looked at individually throughout the process. It can function with or without the use of a reference genome. Through using this pipeline we have identified hundreds of SNPs that may relate to speciation, and have been able to position some of these SNPs along chromosomes without a reference genome.
U18 - A probabilistic method for RNA-Seq read error correction
Short Abstract: Sequencing of RNAs with next generation sequencing technologies (RNA-Seq) has revolutionized the field of transcriptomics for genetics and medical research. RNA-Seq experiments are routinely applied to study mRNAs, miRNAs, and other short RNAs in a diverse range of organisms. Error correction of RNA-Seq data is an important research direction to improve data analysis. Specifically, error-sensitive analyses such as de novo transcriptome assembly and detection of RNA editing events may benefit from sequencing error correction. Existing methods are ad-hoc or originally developed for genomic sequencing data.
In this work, we devise the first general method to remove sequencing errors from RNA-Seq reads. Removal of sequencing errors in RNA-seq data is challenging because of the overlapping effects of non-uniform abundance, polymorphisms, and alternative splicing (mRNAs). We present the SEECER algorithm based on a formulation of probabilistic profile Hidden Markov Models that addresses all the above-mentioned challenges. We show that SEECER reduces the amount of sequencing errors, significantly increases the performance of downstream analyses with or without available reference sequence, and vastly outperforms ad-hoc approaches that researchers currently use.
U19 - Transcriptome analysis reveals genes involved in early cone setting
Short Abstract: Transcriptome analysis reveals genes involved in early cone setting

Tree products are one of the largest exports in Sweden. Tree selection programs to increase the benefit has started but are progressing slowly due to the long generation time in trees. This is especially true for the Norway spruce, which sets cones after 20 years. We are interested in identifying the genes involved in cone setting and use those to reduce the generation time of Norway spruce. A naturally occurring mutant of the Norway spruce called Acrocona sets cones after just four years. We have collected samples from both Norway spruce and Acrocona during development in order to identify the genes involved in early cone setting.

Since the genome sequence is large, more than 20 gigabases, and littered with repetitive regions, no genome or transcriptome sequence for Norway spruce is yet available. We have, by preforming de novo assembly on RNAseq, created a transcriptome covering approximately 80 percent of all transcripts in spruce. By comparing the transcriptome of Norway spruce and the Acrocona we are generating an Acrocona SNP specific library. We have further analyzed the differential transcript expression pattern in the different samples and time points to identify genes involved in cone setting. Among other results we found one transcription factor, known to be important in flower development in other plants. We hypothesize this transcription factor is one of the key players in initiation of cone setting.
U20 - Genomic analysis and chromatin profiling of sequences distal to nucleolar organizer regions on the human acrocentric chromosomes
Short Abstract: The human genome includes around 300 copies of a ribosomal DNA (rDNA) repeat tandemly clustered in nucleolar organizer regions (NORs) on the short arms of five acrocentric chromosomes. Typically in human cells not all NORs are actively participating in the formation of nucleoli and even within active NORs not all copies of the rDNA repeat are transcribed. While it is known how individual rDNA repeats within an active NOR can toggle between active and silent states, the selection mechanism operating at the level of whole NORs is not understood. One possibility is that rDNA-adjacent sequences on acrocentric short arms regulate NOR activity. We established a 400kb contig distal to the rDNA repeats that is shared among all acrocentric chromosomes, but absent from the human genome assembly; we refer to this contig as the distal junction (DJ). To identify functional regions in this region, potentially involved in regulating NOR activity we have carried out an analysis of chromatin structure and transcription profile of the DJ using ChIP-seq, RNA-seq, FAIRE-seq, DNase-seq, and MNase-seq data in the public domain. Interestingly, there is strong evidence of regions of open chromatin consistent among different cell types and evidence of mRNA transcripts originating from the DJ. These results were validated experimentally. Our findings suggested that the DJ is actively transcribed by RNA polymerase II and contains potential NOR-regulating elements. On the path to understanding NOR regulation, the next key question is to determine the distribution of these chromatin features and transcripts between active and inactive NORs.
U21 - Faster and More Accurate Sequence Alignment with SNAP
Short Abstract: As the cost of DNA sequencing continues to drop faster than Moore's Law, there is a growing need for tools that can efficiently analyze larger bodies of sequence data. By mid-2013, sequencing a human genome is expected to cost $1000, at which point this technology enters the realm of routine clinical practice. For example, it is expected that each cancer patient will have their genome and their cancer's genome sequenced. Assembling and interpreting the short read data produced by sequencers in a timely fashion, however, is a significant challenge, with current pipelines taking thousands of CPU-hours per genome.

Here, we address the first and most expensive step of this process: aligning reads to a reference genome. We present the Scalable Nucleotide Alignment Program (SNAP), a new aligner that is 10-100x faster and simultaneously more accurate than existing tools like BWA, Bowtie2 and SOAP2. Unlike recent aligners that use graphical processing units (GPUs), SNAP runs on commodity processors. Furthermore, whereas existing fast aligners limit the number and types of differences from the reference genome they allow per read, SNAP supports a rich error model and can cheaply match reads with more differences. This gives it up to 2x lower error rates than current tools and lets it match classes of mutations, such as longer indels, that these tools miss.

Today, SNAP can align a human genome in 1.5 hours on a 16-core machine, compared to 1.5 days for BWA, while offering higher accuracy. In addition, the algorithm scales well to upcoming long-read technologies.
U22 - Computational detection of A-To-I RNA Editing Sites in Human mRNAs by RNA-Seq data
Short Abstract: RNA editing is a post-transcriptional process occurring in a wide range of organisms including prokaryotes, plants, viruses and animals. The A-to-I RNA editing, carried out by members of ADAR family of enzymes, is the most frequent event in human and occurs in coding as well as non-coding RNAs. Editing changes are essential for cellular homeostasis and have been linked to several diseases such as epilepsy, schizophrenia, amyotrophic later sclerosis and cancer. Recent RNA-Seq technology is extremely useful to investigate editing sites in a variety of experimental conditions. Although RNA editing can be detected using RNA-Seq and whole genome sequencing from the same individual, a deep characterization of transcriptome and genome is yet an expensive solution. Here we propose a simple computational strategy to predict de novo A-to-I RNA editing sites in human RNA-Seq experiments. After the read mapping, we explore all alignments and calculate the empirical probability to observe a substitution. Such probabilities are then used to detect statistically significant base conversions by applying the Fisher’s exact test by comparing the observed and expected occurrences in the aligned reads. The methodology has been tested on the SRA study SRA012427 comprising over 22 millions 50 nt long paired-end reads from human brain. We found 19 highly significant A-to-I conversions in known human coding regions. Interestingly, 11 of such changes have been already described in literature and 6 were experimentally confirmed. A further assessment was performed on short reads from human spinal-cord yielding 15 editing candidates in which 12 confirmed by exome.
U23 - SNPAAMapper: an efficient genome-wide SNP variant analysis pipeline for next-generation sequencing data
Short Abstract: This poster is based on Proceedings Submission 158.

Motivation: Next-generation sequencing technologies have provided unprecedented opportunities for biomedical research. Exome sequencing, in particular, has become a popular strategy to identify genetic variants (both common and rare) due to its cost effectiveness. Currently, there are many tools focusing on read alignment and variant calling functions for exome data. However, publicly available tools dealing with the downstream analysis of genome-wide variants are fewer and have limited functionality.
Results: To help resolve these issues, we developed SNPAAMapper, a novel variant analysis pipeline that can effectively classify variants by region (e.g. exon, intron, etc), predict amino acid change type (e.g. synonymous, non-synonymous mutation, etc), and prioritize mutation effects. SNAPAAMapper takes VCF format input and classifies variants by region using two algorithms: Algorithm 1 assigns “coding” information or “coding annotation structure” for each exon; Algorithm 2 associates identified variants with genomic locus/gene(s) and classifies the variants by region. Specifically, the UCSC genome database annotation file “knownGene.txt” was preprocessed and a table containing reported “coding” information for each exon was constructed. This processed exon information file was used to report isoforms and/or gene(s) in which the variant falls and to classify the variant by its associated genomic location in the genome with an efficient search algorithm. The reported genomic locations for variants were used as key information for custom scripts to predict potential mutations and/or amino acids changes. Lastly, SNAPAAMapper can prioritize the mutation effect for each variant based on affected region and protein coding change.

Acknowledgements: A.S. is grateful for a FONDECYT postdoctoral research grant (N° 3110009) and A.W.S. is grateful for a CONICYT postgraduate scholarship. This research was funded by grants from FONDECYT (N° 1110400) and ICM (N° P09-016-F).

U24 - Combination of paired-end short reads with split read alignment for the detection of translocations, inversions and deletions
Short Abstract: The rise of paired-end sequencing technologies enabled a deeper understanding of structural variations (SVs) such as insertions and deletions, inversions and translocations and showed that SVs highly contribute to various diseases such as tumor growth and development in diverse cases of cancer.

Since detection methods based on the insert size distribution or aberrant mappings of the paired-end reads are limited in terms of resolution and often lead to numerous false positive detections, many recent publications describe a split read alignment around anchor reads to infer base exact SV breakpoint sequences with higher reliability. These approaches mainly focus on medium to large sized indels and the accurate detection of translocations and inversions with paired-end reads is often still problematic.

We propose a split read detection approach for paired-end sequencing data with focus on translocations, inversions and large sized deletions. We combine the approximate breakpoint estimations of published insert size based methods like GASV or Breakdancer with a local BWA alignment of previously unmapped or partially mapped reads to validate their SV detection and give an estimate of the breakpoint sequence. Split reads are clustered and filtered according to previous knowledge of paired-end reads surrounding the breakpoint. The integration of established methods and reduction of the search space to previously approximated breakpoint regions enables a fast split read detection with relatively short reads of 50bp length or higher. We demonstrate the method and its improvement in accuracy compared to the single insert size based approach on publicly available paired-end sequencing data.
U25 - HAPCOMPASS: A fast cycle basis algorithm for accurate haplotype assembly of next-generation sequence data
Short Abstract: This poster is based on Proceedings Submission 259

Genome assembly methods produce haplotype phase ambiguous assemblies due to limitations in current sequencing technologies. Determining the haplotype phase of an individual is computationally challenging and experimentally expensive. But haplotype phase information is crucial in many bioinformatics workflows such as genetic association studies and genomic imputation. Current methods of determining haplotype phase from sequence data -- known as haplotype assembly -- have difficulties producing accurate results for large (1000 genomes-type) data or operate on restricted optimizations that are unrealistic using high-throughput sequencing technologies.
We present a novel algorithm, HAPCOMPASS, for haplotype assembly of densely sequenced human genome data. The HAPCOMPASS algorithm operates on a graph where SNPs are nodes and edges are defined by the sequencing reads and viewed as supporting evidence of co-occuring SNP alleles in a haplotype. In our graph model, haplotype phasings correspond to spanning trees and each spanning tree uniquely defines a cycle basis. We define a global optimization on this graph and translate it into local optimization moves of the cycle basis using rules for resolving conflicting evidence. We estimate the amount of sequencing required to produce a complete haplotype assembly of a chromosome. Using metrics borrowed from genome assembly and haplotype phasing, we compare the accuracy of HAPCOMPASS, the Genome Analysis ToolKit, and HapCut for 1000 genomes and simulated data. We show that HAPCOMPASS performs significantly better for a variety of data and metrics. HAPCOMPASS is available for download at http://www.brown.edu/Research/Istrail_Lab/
U26 - FusionQ: A Novel Approach for Detection and Quantification of Gene Fusion from Paired-end RNA-Seq
Short Abstract: This poster is based on Proceedings Submission “FusionQ: A Novel Approach for Detection and Quantification of Gene Fusion from paired-end RNA-Seq”. Chromosome abnormality is an important lesion in cancers. Recent biological studies have discovered many gene fusions related to specific cancers, which helped the diagnosis and target therapies in clinical trials. Thereby,fusion detection and description are very important in cancer genetic studies. The next generation sequencing technologies give researcher a more informative view of cancers. One of the techniques, whole transcriptome sequencing (RNA-seq)technology has shown its power in gene fusion detection. In this paper, we developed a novel gene fusion detection tool, FusionQ, based on pair-end RNA-seq data. FusionQ can detect gene fusions, construct the full-length chimerical transcripts and estimate their expression levels, providing a more complete view of gene fusions. To determine the exact positions of the fusion partners from the short reads, we proposed a novel approach, “residual sequence extension”. It extends the short reads and helps to find the unique mapping positions of the short reads. In addition, detected fusions often contain lots of false positives. Hence, we calculated fusion scores to report fewer fusions with higher possibilities. Furthermore, we incorporated the platform of expression estimation tool RSEM, and used EM algorithm with sparse constraints for chimerical transcripts expression estimation. We applied FusionQ to prostate and breast cancer cell lines and detected the known fusion genes as well as novel ones. In some cases, FusionQ can significantly outperform other tools. FusionQ is available at http://www.methodisthealth.com/Software.
U27 - A Fast Reads Alignment Algorithm
Short Abstract: Alignment of sequenced reads to a reference genome is a common task in biology. Many algorithms and software packages have been developed to perform this task. Traditionally, alignment algorithms have been based on a prefix/suffix-trees approach or a hash-tables approach.
Here we propose a new algorithm for read alignment, which does not use a standard prefix/suffix-tree or hashing approach. The algorithm allows trade-offs between accuracy, speed and memory footprint.
We present a prototype software implementation of the algorithm, designed for human-scale genomes. This software can align over a million reads per CPU minute. Our experiments in comparing our software to existing, popular, fast alignment software packages (such as Bowtie and BWA) suggest that our software is comparable in accuracy to these software packages and is 5-10 times faster in many scenarios.
U28 - Data Intensive Academic Grid (DIAG): A Computational Cloud Infrastructure Designed for Bioinformatics Analysis
Short Abstract: We have deployed the NSF funded Data Intensive Academic Grid (DIAG) (http://www.diagcomputing.org), a shared computational cloud designed to meet the analytical needs of the bioinformatics community. DIAG includes a computational infrastructure, a high-performance storage network, and optimized data sets generated by mining public sequence repositories. DIAG includes 1500 cores for high-throughput computational analysis and 160 cores, connected via a low latency network, for high-performance computing. Complementing this computational capacity, over 400 Terabytes (TB) of shared high-performance parallel storage and 400 TB of local storage are available.
DIAG’s cloud infrastructure is built using Nimbus (http://www.nimbusproject.org), an open source framework, which transforms a traditional computational cluster into a cloud. The DIAG cloud can be accessed using the popular Amazon EC2 API. The shared storage can be accessed through an S3 compatible interface.
The bioinformatics community can access DIAG as a PaaS using Ergatis, a web based pipeline creation and management tool, or as an IaaS using bioinformatics oriented virtual machines (VMs) such as CloVR (http://clovr.org), or other custom EC2 compatible Linux VMs. DIAG is also accessible as a traditional computational grid for interactive shell and batch processing, or as an Open Science Grid (OSG) (http://www.opensciencegrid.org) compute element.
DIAG currently supports a number of bioinformatics pipelines and tools such as the IGS Annotation Engine (http://ae.igs.umaryland.edu/), Virome (http://virome.diagcomputing.org), CloVR, Galaxy (http://galaxy.psu.edu/), ISGA (http://isga.cgb.indiana.edu), BioLinux (http://nebc.nerc.ac.uk/tools/bio-linux/bio-linux-6.0), Trinity (http://trinityrnaseq.sourceforge.net/), and Maker (http://www.yandell-lab.org/software/maker.html).
DIAG has over 100 registered users who conduct large-scale genomics, transcriptomics, and metagenomics analyses. DIAG is a free resource available to the academic community.
U29 - Optimizing sequence yield & interpreting data quality for RNA-Seq
Short Abstract: RNA-Seq analysis has rapidly become the primary method for quantifying gene expression, identifying alternative splicing events, and detecting gene fusions. There is however, a lack of studies designed to address pre-analytic design and post-analytic quality control. We introduce a statistical model that estimates sequence coverage to detect differentially expressed genes, prior to analysis. We also comprehensively compared and optimized three open source packages for quality control.
Without proper pre-analytics, investigators can under-sequence samples limiting their ability to calculate gene expression and alternative splicing events. Likewise, investigators are equally susceptible to over-sequence and waste resources, without careful study design. By understanding the heterogeneity of reads distribution along gene length, irrespective of tissue type, it is possible to estimate sequence yield that would sufficiently meet the needs of the scientific experiment. Taking into account sequencing errors and subject-to-subject variations, our model predicts the percentage of differentially expressed genes falsely detected, for given combination of targeted depth of coverage and number of subjects. Thus, one can decide the percent of expressed genes to detect, by altering sequence depth or sample size.
Post-analytic processing of RNA-Seq data is also complicated and multistep process. A single improper operation at any stage would result in biased or unusable data. To understand how useful the data is, it is crucial to generate, interpret, and understand post-sequencing quality control measures. We have integrated best practices from FASTQC, RNA-SeQC, and RSeQC, along with in-house quality control measures, to check the quality of RNA-Seq data before pursuing any further directions.
U30 - Improving RNA-Seq precision with MapAl
Short Abstract: With currently available RNA-Seq pipelines, expression estimates for most genes are very noisy. We here introduce MapAl, a tool for fast and straightforward expression profiling by RNA-Seq that builds on the existing tools. In the post-processing of RNA-Seq reads, MapAl incorporates gene models already at the stage of read alignment, increasing the number of reliably measured known transcripts consistently by 50%. Adding genes identified de-novo then allows a reliable assessment of double the total number of transcripts compared to other available pipelines.
This substantial improvement is of general relevance: Measurement precision determines the power of any analysis to reliably identify significant signals, such as in screens for differential expression, independent of whether the experimental design incorporates replicates or not.
MapAl supports both users and further development by giving a free choice of combining alternative steps at different stages of the process. In particular, a wide range of read mappers supporting the standard SAM format can be employed. With the new release we have also improved the handling of exon junctions, especially when reads spanning multiple splice junctions. These reads are particularly powerful in the discrimination of specific spliceforms and with read lengths of modern platforms ever increasing, reads spanning multiple splice junctions are becoming a more frequently observed issue.
U31 - Genome-Wide Identification of Allele-specific Methylation
Short Abstract: Among the most well-known functions of DNA methylation is mediating imprinted gene expression by passing an epigenomic state across generations. Imprinting has been tied to the evolution of the placenta in mammals and errors in imprinting have been associated with human diseases. We have designed a novel statistical model to describe allele-specific methylation(ASM) based on data from high-throughput short read bisulfite sequencing (BS-seq). We validated our method using semi-simulated data in which methylation states were simulated within actual reads from BS-seq experiments. Our results indicate that technical characteristics of existing public methylomes (i.e. read length and coverage) are sufficient to accurately identify AMRs. By applying our model to 22 human methylomes, emphasising those from uncultured cells, we identified a set of candidate AMRs involved in imprinted gene regulation. Candidates consistently identified across methylomes display remarkable concordance with known imprinted genes and allow boundaries of known AMRs to be precisely defined. Many candidates not associated with known imprinted genes mark the promoters of long non-coding RNAs (lncRNAs) and are also supported by similar analyses at orthologous regions in chimp; these provide a starting point for identifying additional imprinted genes, novel ICRs and possibly novel imprinted clusters. Our model, therefore, is an essential analytical complement to recently emerged experimental methods for understanding the role of DNA methylation in genomic imprinting.
U32 - PACE: a web based tool to extract CDR3 sequence from next generation T cell receptor sequence data.
Short Abstract: Parallel Algorithm for CDR3 Extraction (PACE) is a web based parallelized algorithm to obtain complementarity determining region 3 (CDR3) subsequences from next generation sequencing data from a population of T cell receptors (TCR). The T cell repertoire shapes an organism’s adaptive immune response and is and important regulator of self-antigen recognition. CDR3 distribution analysis is a crucial step to study the mechanisms that govern the dynamics of T cell repertoire. This requires a high throughput and thorough survey of millions of TCR sequences. There is a complete lack of publicly available tools to perform such an analysis. Manual analysis is cumbersome and highly inefficient. One way to address this issue is to device a parallel algorithm to extract CDR3 sequences from millions of reads. To accomplish this, PACE was developed to obtain CDR3 subsequences from next generation sequencing data of TCRs from a population of isolated T cells. The algorithm uses BLAST as a sequence comparison module and distributes high volumes of sequence data into manageable partitions. These partitions are then queued for extraction at multiple nodes on the network via web services. The algorithm framework is designed to be scalable by adding additional compute nodes in order to increase performance. Our results indicate PACE is considerably faster than traditional approaches and can analyze up to two millions sequences within 8 minutes in a single run on nine nodes. Our approach will allow researchers to expedite the analysis of the T cell repertoire efficiently and accurately.
U33 - TavernaPBS: High Performance, Next-Generation Sequence Analysis Workflows with Taverna
Short Abstract: Next-generation sequencing projects require large data to be processed with analytical pipelines composed of many UNIX-based tools. Complex, multi-step pipelines implemented with procedural programming languages can be difficult for student programmers to construct, maintain, revise, and/or extend. We have developed TavernaPBS, an extension for the GUI-based Taverna workflow system, to facilitate use of moderately complex bioinformatic pipelines, while allowing efficient batch-parallel workflow execution via high-performance computing resources. Using Taverna’s visual programming environment, users create arbitrarily interconnected, nested pipelines of individual UNIX commands with specified input/outputs. TavernaPBS ensures that UNIX command lines are correctly submitted to an underlying PBS queue, while maintaining job interdependencies encoded by workflow. Many additional features include an audit trail with real-time visual monitoring of workflow status, and elimination of redundant computations by detection of which work units were completed in prior workflow runs. We demonstrate here three complete examples of complicated, nested workflows: 1) Exome Variant Analysis (EVA) – combines disparate tools (samtools, picard, GATK, SeattleSeq, etc.) according to best practices to discover, annotate and genotype SNP/indel variants in human exome sequences; 2) Reference-guided RNAseq Analysis (RRA) – executes Tuxedo-suite (TopHat, Cufflinks, etc.) software according to published protocols; 3) de novo RNAseq Analysis (DRA) – uses Trinity, RSEM, and edgeR as an alternative RNAseq analysis protocol. TavernaPBS provides an effective environment to craft reproducible workflows for high-performance computing clusters, and are easily shared within and between lab groups.
U34 - RseqFlow: Workflows for RNA-Seq data analysis
Short Abstract: We have developed an RNA-Seq analysis workflow for single-ended Illumina reads, named, RseqFlow. This workflow integrates more analytical functions than the previous tools, while at the same time is flexible and easy to use. The major modules include: mapping sequencing reads to genome and transcriptome references, performing quality control (QC) of sequencing data, generating files for visualizing signal tracks, calculating gene expression levels, identifying differentially expressed genes, calling coding SNPs and producing MRF and BAM files. The modules can be extended to more functions, or replaced by other methods, as desired. We provide RseqFlow with two different Mappers PerM and Bowtie.
The workflow is formalized and managed by Pegasus Workflow Management System, which supports large-scale workflows on diverse environments in a scalable and reliable manner. Pegasus WMS maps the analysis modules onto available computational resources, varying from a laptop, desktop, high-performance clusters, to national Grids or computational Clouds, and executes the steps in the appropriate order. RseqFlow is also available as a Virtual Machine (VM), which eliminates any complex configuration and installation steps. Users can run it with just one command, instead of inputting commands and arguments for each module in a step-by-step manner.
RseqFlow is currently available for the analysis of data from both Human (hg19 and Gencode) and Rhesus Macaque (rheMac2 and RefSeq), and versions for additional species will be forthcoming. Further work is being done to improve the performance and usability of the workflows as well as add and alternative modules. More information is available at http://genomics.isi.edu/rnaseq.
U35 - SNP@splicesite: genome-wide splice-site SNP database
Short Abstract: Deep sequencing has shown that over 90% of human genes undergo alternative splicing. The splicing process requires exon-intron boundary recognition. SNPs located in the boundaries (splice sites) influence exon configuration. Also, splice site SNPs (ssSNPs) alter translation efficiency of the mRNA and lead to important changes in disease susceptibility. We developed the SNP@splicesite database to provide splice-site SNPs on human and mouse genes. It includes: 1) information of splice-site SNPs located within human and mouse genes; 2) effects of splice-site SNPs: junction strength change, protein domain change, and alternative splicing events (exon skipping, 5’- or 3’-exon extension); 3) splice site conservation in eukaryotes; and 4) associated disease information derived from OMIM and GAD. Our database contains 1,576 human splice-site SNPs associated with 1,193 genes and 538 mouse splice-site SNPs associated with 281 genes. Users can query SNP@splicesite with several types of search terms (gene symbol, SNP rs number, transcript ID, or genomic position), and the information can be accessed at http://variome.kobic.re.kr/ssSNPTarget/ or http://ssSNPTarget.org/.
U36 - Nebula – a web-server for advanced ChIP-Seq data analysis
Short Abstract: We present a web service, Nebula, with which biologists can analyze their ChIP-seq data. ChIP-Seq is chromatin immunoprecipitation followed by sequencing of the extracted DNA fragments. This technique allows accurate characterization of the binding sites of transcription factors and other DNA-associated proteins.
Many existing tools for ChIP-Seq data analysis are difficult to use by inexperienced users. Such tools are often command line applications or R packages.
Our web service, Nebula, was designed both for biologists and bioinformaticians. It is based on the Galaxy open source framework. Galaxy already includes a large number of functionalities for mapping reads and peak calling. We added the following to Galaxy: (1) peak calling with FindPeaks and a module for immunoprecipitation quality control, (2) de novo motif discovery with ChIPmunk, (3) calculation of the density and the cumulative distribution of peak locations around gene TSSs, (4) annotation of peaks with genomic features, and (5) annotation of genes with peak information. Nebula generates the graphs and the enrichment statistics at each step of the process. During steps 3 to 5, Nebula optionally repeats the analysis on a control dataset and compares these results with those from the main dataset. Nebula can also incorporate gene expression (or gene modulation) data during these steps. In summary, Nebula is an innovative web service that provides an advanced ChIP-seq analysis pipeline, the output of which is directly publishable.
U37 - Comparing assemblage of metagenomics and metatranscriptomics data obtained from Medium- Length Sequencing Technologies
Short Abstract: While metagenomics provides an inventory of community gene content and can reveal genetic diversity of microbial species recovered from their native habitat, metatranscriptomics aims to identify active functional genes by providing information about the expressed genetic patterns in the ecosystem. Thus, pairing metatranscriptomics and metagenomics analysis could provide an exceptional opportunity to explore molecular evolution of microbial species, and the structure and function of microbial communities. In both metagenomic and metatranscriptomic studies, sequence assembly is a fundamental step that determines whether an experiment can succeed.

The focus of this study is to compare the levels of assembly and the size of gene families resulting from using complex marine metagenomic and metatranscriptomic data associated with uranium mining. The data were obtained from random whole-community DNA and mRNA by the Medium-Length Sequencing Technologies, i.e. Ion Semiconductor Sequencing™ technology.

The results are compiled using a number of state-of-the-art assemblers. The performance of the assemblers is compared based on the size, accuracy and integrity of their contigs and scaffolds, computational time, and maximum random access memory occupancy.
U38 - Analysis of Gene Expression Profiles During Determinate Primary Root Growth in Cardón, Pachycereus pringlei (Cactaceae)
Short Abstract: Unlike roots of most plant species, the primary root of cardón Pachycereus pringlei, a Sonoran Desert Cactaceae, exhibits determinate growth. The root apical meristem of seedling primary root exhausts and all cells in the root tip differentiate. Determinate growth of primary and most lateral roots results in the formation of a compact root system that provides seedlings an advantage for survival in a desert environment. In order to identify and characterize genes involved in root meristem maintenance and determinate root growth in P.pringlei, we employed mRNA-seq using IGA II. The 85 nt reads were assembled de novo into about 26,000 contigs using the CLC Genomics Workbench. The largest contig of >15 kb represented the longest plant transcript from the BIG gene. The transcriptome contigs were annotated using the similarity search against GenBank Ref-seq proteins. Differential gene expression was estimated in the primary root tip. Over 400 and almost 900 transcripts were up-regulated more than 5 times during initial and terminal phases of root growth, respectively. Sixteen putative transcription regulators were up-regulated during the initial phase. Significant conservation between P.pringlei and Arabidopsis was revealed for the amino acid sequences and RNA expression patterns for various genes. We also detected differences in expression profiles of some PIN auxin efflux carriers between P.pringlei and Arabidopsis mutants with determinate primary root growth. The cytokinin synthesis related genes are expressed during P. pringlei terminal phase of root development, suggesting that the root tip is functionally active after meristem exhaustion.
U39 - Reference guided assembly of heterogeneous DNA sequencing data sets to improve the quality of P.falciparum DD2 genome
Short Abstract: More and more laboratories throughout the world are producing genomic data, often using different protocols and technology to sequence genomes. With the emergence of several next generation
sequencing technologies, the diversity of specifications of reads used by these labs is increasing. It is not difficult to predict the need for new approaches for combining multiple types of sequencing data in post sequencing analysis.
Here we propose an adapted method to assemble a mixture of reads and DNA fragments based on a reference genome. We also describe a method to find the location of new fragments across draft contigs and patch them to their appropriate location. We used the method to improve the quality of P.falciparum strain DD2 contigs based on the finished genome of P.falciparum strain 3D7.
We followed a workflow consist of trimming, mapping, partitioning, de-novo
assembly, combining contigs, validating and scaffolding.
We had access to a set of raw data, including single and paired end reads of DD2, Sanger contigs of DD2 from the Broad Institute (BI) of MIT and Harvard and a reference genome (3D7).
Our result show that contigs generated by our assembly of reads mapped to the 3D7 chromosomes are longer in average
than the BI contigs. However, BI contigs not mapped to the 3D7 genome are longer than the contigs generated by our assembly of unmapped short reads.
U40 - A Statistical Framework for Computationally Dissecting Heterogeneous Tissue Samples Based on mRNA-Seq data
Short Abstract: Motivation: Next Generation Sequencing (NGS) offers a unique opportunity to delineate the genome-wide architecture of regulatory variation. The promising biomedical applications of NGS have spurred the development of new statistical methods to capitalize on the wealth of information contained in mRNA-Seq data sets. However, for heterogeneous tissues, measurements of gene expression through mRNA-Seq data can be confounded by the presence of multiple cell types present in each sample.
Results: We present a novel approach for deconvolution of heterogeneous tissues based on mRNA-Seq data. First, we hypothesized that expression levels from heterogeneous cell populations in mRNA-Seq are mixed as the weighted average of expression from different constituting cell types. We then studied the feasibility and validity of a globally optimized non-negative decomposition algorithm through quadratic programming using mRNA-Seq data. We validated our computational approach in in-silico simulation experiments on various benchmark data sets. Agreement between our predicted cell proportions and the actual proportions was excellent. We further built tissue-specific signatures using data from the Human Body Map 2.0 project, and were able to accurately deconvolute complex, multi-tissue in-silico mixtures from independent experiments. Our analysis demonstrates the utility of a carefully designed statistical framework to discern heterogeneous composition in mRNA-Seq data. Our study describes tailored analytical methods and provides a rigorous, quantitative, and high-resolution tool that can be used as to interrogate mRNA-Seq data.
U41 - Resequencing a bacterial genome using high-throughput next-gen sequencing as an aid in RNA-Seq transcriptomics analysis: a case study using Bacillus subtilis str. DSM10
Short Abstract: RNA-Seq gene expression analysis using high throughput short reads from next-gen sequencing depends upon, among other factors, having a well-annotated genome upon which such reads can be reliably mapped. If the strain of interest does not have a reference genome, then the mapping must be completed using a fully assembled genomic sequence from a similar strain. Given this typical constraint, several questions arise: (1) At what point does it become worthwhile to perform additional experimental work to obtain the reference genome to improve our mapping? (2) How accurate and reliable is the de novo assembly when using the publicly available assembly tools? (3) What does resequencing add to RNA-Seq analysis in terms of accurate mapping (possibly to additional, newly identified genes)? (4) Does the additional information, if any, provide sufficient value to justify the additional experimental resources and staff time that was required? Here we present a case study of de novo assembly of Bacillus subtilis str. DSM10 using SOLiD sequencing. We compare the assembled DSM10 genomic sequence at the called gene level with the public B. subtilis str. 168 genome in the SEED environment of the U.S. Department of Energy’s Systems Biology Knowledgebase, where we can also compare the organisms at the pathway (SEED subsystem) level using the SEED comparative genomics utilities. We present the SOLiD sequencing, the assembly methods used (Velvet and ABySS), the contig ordering, the genome-to-genome comparison in the SEED environment, and, finally, RNA-Seq mapping results against the two genomes.
U42 - Electus: a lightweight tool for extracting targeted sets of reads from large NGS data sets
Short Abstract: Second Generation Sequencing enables the collection of large quantities of sequence data.
While such data sets are incredibly rich, often researchers are interested in answering specific questions about the data. Typically, a researcher with 100s of gigabytes of RNA-Seq data may be interested in SNPs and relative expression for only a small number of specific genes. In such a scenario, it is unnecessarily time consuming to process the entire data set against a whole reference set of genes, only to discard most of the analysis products.

Furthermore, to answer some questions detailed alignments may not be necessary. For instance, to determine relative gene expression, simply extracting and counting the relevant subset of reads may be sufficient. While other questions may require assembly or alignment to be performed, applying these algorithms to a subset of the total collection of reads can yield substantial computational savings, provided that the subset can be extracted efficiently.

We have developed a tool, Electus, which allows the user to quickly and sensitively extract a relevant selection of reads. Using our previously published representation for k-mer sets, our tool uses a k-mer decomposition of the reference sequence(s) of interest to yield crude but effective read alignments.

We have tested this approach in the analysis of a number of prostate cancer RNA-Seq data sets,
and show that for analysis targeted to the genes involved in known fusions, Electus enables us to efficiently produce files containing just those reads that map to the nominated genes, allowing the fpkm measure of gene expression to be computed quickly.
U43 - De novo detection of copy number variation
Short Abstract: Comparing genomes of individual organisms using next generation sequencing (NGS) data is, until now, mostly performed using a reference genome. This is challenging when the reference is distant and introduces bias towards the exact sequence present in the reference. Recent improvements in both NGS read length and efficiency of assembly algorithms have brought direct comparison of individual genomes by de novo assembly, rather than via a reference genome, within reach.

Here, we develop and test a Poisson mixture model (PMM) for copy number estimation of contigs assembled from NGS data. We combine this with co assembly to allow de novo detection of copy number variation between two individual genomes, without mapping reads to a reference genome. In co assembly, multiple sequencing samples are combined, generating a single contig graph with different traversal counts for the nodes and edges between the samples. In the resulting “colored” contig graph the contigs have integer copy-numbers; this negates the need to segment genomic regions based on depth of coverage, as required for mapping-based detection methods. The PMM is then used to assign integer copy numbers to contigs, after which copy number variation probabilities are inferred.

The copy number estimator and copy number variation detector perform well on simulated data. Application of the algorithms to hybrid yeast genomes showed allotriploid content from different origin in the wine yeast Y12, and extensive copy number variation in the aneuploid brewing yeast genomes. Integer copy number variation was also accurately detected in a short-term laboratory evolved yeast strain.
U44 - All the 1+3n one-mismatch sequences of n-mer DNA are involved in 22.2+0.00879n strings of Perfect Linear Code words on DNA
Short Abstract: Background:
With the advent of high-throughput DNA sequencers, the growing demands to map short DNA sequences to a genome have promoted the development of faster algorithms and tools for genome mapping. Nowadays, the fastest algorithms are suffix array and Burrow-Wheeler transforms (BWT). They are the best suited to find the genome position of exactly matching to DNA short read. But they are not good at finding the genome position with some mismatches. Their computational time grows exponentially to the number of mismatches. As the amount of DNA short reads generated from sequencers has been increased and will be, the faster algorithms to map short read with mismatches are required eagerly.

To overcome the shortage of the algorithms, we propose to encode DNA sequence onto perfect linear code (PHC) on Galois expansion field GF(4). Using perfect linear codes as an alphabet of BWT, the solution space to find the genome position with some mismatches drastically decrease.
Let us regard four types of nucleotide are elements of GF(4), use 4-ary (5,3) PHC and encode 5-mer DNA subsequences onto the PHC. With PHC’s ability of error correction, each word of PHC represents 16 5-mer DNA subsequences, which are one representative subsequence and all the 1-mismatch subsequences from the representative. The PHC has 64 words and all the 1024 5-mer DNA subsequences belong to only one word of the PHC. We show that 22.2 + 0.00879 n strings of PLC words on DNA include an n-mer DNA sequence and its 3n one-mismatch sequences.
U45 - Structured Protein Sequences Are More Uniform Than Random in terms of the Distributions of Hydrophobic and Hydrophilic Amino Acids.
Short Abstract: Amino acid sequences of structured proteins have been evolved under biological and physical pressures, which make natural proteins different from random sequence. There have been extensive studies on the local correlations of amino acids in the sequences of structured proteins, such as secondary structures, but whether such correlations exist in global distributions of amino acids has not been well understood. To clarify the peculiarity of the structured protein sequences, we compared the occurrence of each type of the amino acid pairs occurring in local sequence proximity up to 10 residue separations in SCOP representative domains with the occurrence expected from the random distribution of the amino acids in each domain. As a result, the pairs between two hydrophobic residues and those between two hydrophilic residues were significantly depleted, whereas the pairs between a hydrophobic residue and a hydrophilic one were enriched in the local proximity. In other words, the hydrophobic and hydrophilic residues were distributed more uniformly than expected from the random distribution, suggesting that there is a selection pressure to control the balance between hydrophobicity and charge in the entire sequence. This trend was not observed in the protein sequences of eukaryotic genome, in which the occurrence of identical amino acids in the local proximity was prominent due to the existence of intrinsically disordered regions. Therefore the uniform distributions of hydrophobic and hydrophilic amino acids are a characteristic feature of the amino acid sequences of structured proteins.
U46 - Jalview 2.8: Including new JABAWS 2 Alignment, Conservation and Disorder prediction webservices and support for RNA secondary structure
Short Abstract: Jalview is an open source platform for multiple sequence alignment, editing and analysis. The latest version of this standalone and web-based visualization tool is Jalview v2.8, which includes new web services for protein disorder and alignment conservation provided by JABAWS 2, support for DAS 1.6/2.0 via JDAS, and visualization of RNA secondary structure and base-pair conservation.
U47 - GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies
Short Abstract: The millions of short reads usually involved in high-throughput sequencing (HTS) are first assembled into longer fragments called contigs, which are then scaffolded, i.e. ordered and oriented using additional information, to produce even longer sequences called scaffolds. Most existing scaffolders are not suited for using information other than paired reads to perform scaffolding. They use this limited information to construct scaffolds, often preferring scaffold length to accuracy, when faced with the tradeoff.

We present GRASS (GeneRic ASsembly Scaffolder) - a novel algorithm for scaffolding second-generation sequencing assemblies capable of using diverse information sources. GRASS offers a mixed-integer programming formulation of the contig scaffolding problem, which combines contig order, distance and orientation in a single optimization objective. The resulting optimization problem is solved using an Expectation-Maximization procedure and an unconstrained binary quadratic programming approximation of the original problem. Although the current implementation of GRASS uses paired read information and/or related genomes, the algorithm is not limited to any particular set of information sources.

We compared GRASS to a number of state-of-the-art scaffolders (SSPACE, MIP and OPERA) using Illumina paired reads of three bacterial genomes. GRASS constructs the most accurate scaffolds on all datasets, while keeping the number of scaffolds low. In additional experiments we used two related genomes to help scaffold an E.coli assembly. GRASS achieved the best results using this additional data, as it allowed for ignoring conflicting scaffolding information. The tradeoff between accuracy and contiguity displayed by GRASS puts it in a unique niche compared to existing scaffolders.
U48 - An n-gram based probabilistic method for de novo sequence assembly
Short Abstract: For next-generation sequencers, reconstructing an organism’s genome from millions of reads is a computationally expensive task. Our algorithm approaches this problem by organizing and indexing the reads using n-grams, which are short, fixed-length DNA sequences of length n. These n-grams are used to efficiently locate putative read joins, thereby eliminating the need to perform an exhaustive search over all possible read pairs. Specifically, a probabilistic, iterative approach was utilized to determine the most likely reads to join through development of a new metric that models the probability of any two arbitrary reads being joined together. Our model utilizes the quality scores that are output with each read to improve the likelihood of making a correct join. We also incorporate a rigorous error-correcting, pre-processing step to limit the dictionary to only high quality reads. Results on simulated 454 genomes of lengths 100 kb and 1 Mb are reported, as well as results using short read experiment data for Brucella suis and Staphylococcus aureus.
U49 - Novel Short Peptide Assembler (SPA) for Metagenomic Sequence Analysis
Short Abstract: A fundamental computational problem in metagenomic analysis of microbial communities is the de novo assembly of nucleotide reads to infer genome sequences of the constituent microbes. This problem has proved to be challenging, with results dependent on several factors including sequencing depth, number and abundance distribution of the species, and polymorphisms within strains of the same species. A consequence of poor nucleotide assemblies is that a vast majority of the gene sequences identified from the assemblies are fragmentary and a large fraction of the reads are unassembled.
We investigate metagenomic assembly with respect to reconstructing protein sequences directly from metagenomic data generated using NGS technologies. Our framework is based on a novel Short Peptide Assembler (SPA) that assembles complete protein sequences from their constituent peptide fragments identified on short reads. The SPA algorithm is based on informed traversals of a de Bruijn graph, defined on an amino acid alphabet, to identify probable paths that correspond to proteins.
SPA can effectively handle tens to hundred million Ilumina reads. It performs very well on isolate genomes (with specificity and sensitivity in the high 90s and chimera rate below 0.01%). On simulated metagenomic data, it far out-performs the conventional approach of identifying genes on nucleotide sequence assemblies. For instance, on a simulated oral metagenomic dataset, our approach has 92% specificity, 86% sensitivity, and 0.19% chimera rate, while the conventional approach has 96% specificity, 65% sensitivity, and 0.02% chimera rate. Furthermore, 86% of reads were assembled by SPA, while the conventional approach only assembled 32% of reads.
U50 - Comparison of Tera-BLAST, a massively accelerated implementation of the BLAST algorithm, to other BLAST implementations.
Short Abstract: A number of technologies have emerged for accelerating similarity search algorithms in bioinformatics, including FPGA, GPU, and standard CPU clusters. We present Tera-BLAST, an FPGA-accelerated implementation of the BLAST algorithm, and compare the performance to GPU-accelerated BLAST and the industry standard NCBI BLAST on high performance computers. Tera-BLAST, running on the TimeLogic J1-series FPGA similarity search engine, performs 10’s of times faster than BLAST running on generic Tesla M2090 GPU cards and 100’s of times faster than standard multi-core CPU’s.
U51 - A Non-parametric Statistical Test for Differential Peak Detection in ChIP-Seq Data
Short Abstract: Chip-seq has rapidly become the dominant experimental technique in functional genomic and epigenomic studies. Statistical analysis of chip-seq data however remains challenging, due to the highly structured nature of the data and the paucity of replicates. Current approaches to detect regions that are statistically significant different in two different samples, are largely borrowed from RNA-seq data analysis and focus on total counts of fragments mapped to a peak and thus ignore information encoded in the shape of the peaks. We demonstrate empirically on real data that higher order features of chip-seq peaks carry important and often complementary information to total counts, and hence are potentially important in assessing differential binding. We then propose to incorporate higher order information in testing for differential binding by adapting recently proposed kernel-based statistical tests to chip-seq data.
To test the hypothesis that shape factors can be important for differential peak detection, we use a chip-seq data set investigating the epigenetic mark H3K4me3. The experiment consists of chip-seq measurements from a wild-type mouse embryonic stem cell line (WT) and a mutant line lacking the protein Cfp1, a component of the enzymatic complex responsible for the methylation of H3K4. The results demonstrate that our approach can lead to very different and complementary predictions compared with standard tests based on total count. In addition, an independent analysis of a repeated experiment with a different H3K4me3 antibody lead to consistent results demonstrating an improved robustness of our method, corroborating the appropriateness of our approach.
U52 - Lessons learned from RNA-Seq Read Alignment Evaluations
Short Abstract: Numerous recent biological studies base their findings on massive amounts of high throughput sequencing (HTS) data. Independent from the ultimate goal of investigation, the alignment of spliced fragments to a reference sequence plays a central role in many established analysis pipelines and is one of the most critical steps, posing a wide spectrum of computationally and statistically challenging problems. Hence, the quality and significance of biological conclusions ultimately depend on a considered and careful handling of the alignment problem.

Based on datasets made available through the RNA-seq Genome Annotation Assessment Project (RGASP, organised by the Wellcome Trust Sanger Institute), we experienced a striking influence of the choice of the alignment method on many downstream analyses. Within a comprehensive study, we evaluated a broad variety of different methods, including BLAT,GEM,PALMapper,SIBsim4,TopHat,STAR,ERANGE,and GSnap.
The great diversity in the behavior of the different alignment strategies with surprisingly small agreement between a subset of methods led us to the development of various filtering and tuning strategies. This post-processing step can drastically increase the precision of transcript prediction and transcript quantification as well as other downstream analyses. We developed a versatile toolbox, called RNA-geeq, that allows a clear comparison of different alignment methods and helps to improve alignment quality in many ways. Numerous filters can be optimized respective to a given annotation or any other GFF-file. Moreover, we implemented a multiple-mapper resolution strategy, which chooses for each read its most likely alignment.

All tools are available within the Galaxy-instance at http://oqtans.org. For further details visit http://bioweb.me/rna-geeq.
U53 - Sequence Signature Geometry and Alignment-Free Sequence Comparison
Short Abstract: A new algorithm for rapid local alignment-free sequence comparison
is proposed in this paper. We show that under $dt$ and its important variant $dts$ as the similarity measure, local comparison between a pair of sequences can be formulated as the problem of finding the maximum bichromatic dot product between two set of points in high dimensions. We introduce a geometric framework that transfers this to the problem of finding the bichromatic closest pair (BCP) for which an approximation algorithm is represented. While utilizing the brute force search strategy can solve the problem in quadratic time, the average running time analysis of our approximation algorithm has a sub-quadratic relation to the length of the sequences with bounded error. Results of our implementation further demonstrates the robustness and efficiency of our algorithm in finding local similarity even for cases in which the sequences are not alignable. Therefore our algorithm can extend the current usage of alignment-free based methods and can be also regarded as a
substitute for local alignment algorithms in many biological studies.
U54 - Sequencing data analysis starting from images can substantially improve quality of the results
Short Abstract: With the advent of newer generation of sequencers such as Illumina's HiSeq, primary data analysis has been unavailable to the user, since many configurable options in HCS have been abstracted away in order to provide the user with minimal choices so as to not confound her. For HiSeQ, unlike GAII, primary data analysis starting from images has generally not been thought to be possible for reasons which includes storage issues. Unfortunately there are certain situations where such an approach can be detrimental to good results; e.g. when sequencing libraries with low initial sequence diversity (in our experience only ~7% reads are usable in such cases when obtained with Illumina's recommended protocol of 50% PhiX spike-in).
These libraries arise from experiments where DNA has a non-Illumina prefixed barcode (from restriction enzyme digestion, or otherwise), leading to low initial sequence diversity.
In order to circumvent this problem, we use off-line primary analysis from raw image data (TIFF files) necessary for recreation of cluster coordinates from images with cycles of adequate diversity.
We demonstrate off-line analysis for single-ended runs by using a commercially available disk array viz. Drobo-S; configured for 12TB disk-space (using 5x3TB HDD; 1 redundant, connected with eSATA port). Using Drobo we store and transfer images from the RTA computer to the primary analysis computer running a modified version of Illumina's GOAT pipeline. Finally we discuss the improved results i.e. substantially more data, as well as future directions, which include more accurate determination of cluster coordinates, and hence overall better results from sequencing runs.
U55 - Customized Short Read Assembly: Graph Theoretic Integration of Domain-Specific Information
Short Abstract: Current sequencing technologies are producing massive amounts of short reads on an unprecedented scale. Several computational algorithms have been developed to assemble these reads into a representation of the original target sequence. Assembling short reads continues to be a challenge due to multiple factors including genome repeats, sequence gaps, and other sequencing errors. In addition, the expanding array of sequencing applications makes it unlikely that generic assemblers will produce accurate results in all problem domains. To address this issue, we introduce a data-centric assembler that uses graph enrichment to capture domain-specific dataset characteristics for developing a customized approach to the assembly problem. The proposed assembler follows the overlap-layout-consensus paradigm for assembly. Our approach departs from previous methods by enriching the overlap graph with domain-specific information associated with the genomes being assembled. This knowledge is used to make customized adjustments on parameters of the overlap graph during assembly. We present the results of applying our customized assembly algorithm on multiple read datasets that differ in read length, coverage, and composition. We compare the results obtained from the proposed assembler with those obtained from other popular assemblers for various genomes to demonstrate the impact of graph enrichment on the quality of the assembled output. This study demonstrates that the integration of domain-specific information into the assembly process makes a positive impact on assembly results. It also suggests that incorporating data enrichment approaches into computational models in a variety of bioinformatics applications has the potential of overcoming the shortcomings associated with generic methods.
U56 - Tools For Fast and Flexible Short Read Quality Assessment and Improvement
Short Abstract: Next generation sequences are the input for a wide and growing array of biological assays. It is widely recognized that removing poor quality reads or parts of reads, and sequences not arising from the biological sample (e.g. Illumina adapter sequences) can dramatically improve speed and reduce memory requirements for downstream computation, while improving the accuracy and validity of the ultimate biological inferences. We have developed open source tools to carry out the assessment of sequencing quality and read improvement. On the R/Bioconductor platform, qrqc (quick read quality control) generates per-cycle quality and nucleotide summaries, and hashes reads to provide a list of the most frequent, which can often lead to sample contaminant detection. In addition, qrqc can calculate the Kullback-Leibler divergence on a per-cycle basis for k-mers of arbitrary size, which may allow detection of unexpected contaminating sequences that are not at a fixed position within reads. Sabre, Scythe, and Sickle are written in C and allow fast demultiplexing of reads with "in-line" barcoding, fuzzy quality-aware 3'-adapter contamination detection and removal, and windowed 5'- and 3'-end trimming of low quality bases, respectively. (Scythe is described in another poster at this conference.)
As next generation sequencing is a rapidly changing field, we have designed these tools consistent with the Unix philosophy that the best toolkit consists of a set of high performance tools with limited scope, that can be applied flexibly as pipelines evolve. qrqc can be downloaded from bioconductor.org, and Sabre, Sickle, and Scythe are available on Github (https://github.com/ucdavis-bioinformatics).
U57 - Have we got enough sequencing reads from a metagenomic sample?
Short Abstract: Metagenomics is a research field studying uncultured organisms to understand the diversity, functions, cooperation and evolution of microbes in a given habitat like soil or water. A large amount of shotgun sequencing reads from metagenomic samples have been generated and are being generated with an unprecedented speed, thanks to the development of next generation sequencing technology. Then these reads could be assembled into longer, contiguous sequences for further analysis. In a typical metagenomic project, we often have a question to ask, as the sequencer is still generating reads day and night, is the coverage of the reads we have already got high enough to make the assembly feasible? Or how much more sequencing effort is required to have a decent coverage for assembly and further analysis? Traditionally sequencing coverage is estimated by mapping reads to a reference or assembly. For metagenomic data set, it is infeasible since we do not have the reference sequence for most species in the sample.Here we present a novel approach of using k-mer counting to estimate the coverage of metagenomic reads data set without reference. Using a memory efficient CountMin Sketch data structure to count k-mers, we see that the distribution of the median k-mer abundance of reads in a metagenomic data set can reflect the sequencing coverage well. We test this method on several artificial data sets and real metagenomic data sets from soil samples. This also leads to a novel approach to estimate the diversity of a metagenomic sample
U58 - Scythe – a Bayian based 3’-end adaptor contamination trimming tool
Short Abstract: 3’-end adaptor contamination is a common problem of many modern sequencing platforms. Partial adaptor sequences can adversely affect many downstream analyses, including assembly and mapping of reads. Scythe is a tool for Illumina read improvement to rescue contaminated reads by removing 3'-end contaminants. The increasingIllumina read length makes it more useful because after trimming, there is still significant useable sequence left. To trim adaptor sequence has been a difficult problem because 3’-end also has the highest proportion of mis-called bases. These poor quality regions make accurate adapter identification and removal particularly challenging. Several adapter filtering or trimming tools are available. However, they all depend on finding alignments with a fixed, maximum number of mismatches. These approaches may not work well for the low quality stretches of bases that can contain many mismatches. Scythe tackles the issue by using Bayesian methods and considers both individual base qualities and prior contamination rates to decide whether a given match is a contaminant or background sequence. Benchmarks have been done using controlled error rate and prior contamination rate datasets. Scythe outperforms other adapter removal software tools in terms of true positive and false positive rates.
U59 - Identification of differentially methylated Cytosines and regions by mSuite, DNA methylation analysis tools in a suite
Short Abstract: DNA methylation is a major form of epigenetic modifications and its genome wide profiling at base pair resolution is critical in understanding its biological role and correlation with other epigenetic or genetic mechanisms. Highthroughput NGS sequencing combined with DNA bisulfite conversion has gained popularity recently as a measurement of methylation status of Cytosine.

However the data analysis are challenged from two major perspectives. First, the challenge comes from the lack of methods for quantification of methylation ratio, and for the identification of differentially methylated cytosine (DMC) and differentially methylated region (DMR) between different biological samples. Here we propose an exact numerical method to calculate confidence interval (CI) for single methylation ratio, CI for difference of two methylation ratios, the p-value for comparing equity of two methylation ratios, and the adjusted methylation ratio difference. The adjusted methylation ratio difference is shown to be a good quantity to rank millions of cytosines to find most statistically and biologically significant DMCs. Using this exact numerical method, we also successfully applied the Hidden Markov Model to find DMR. Second, the challenge comes from the lack of mature data analysis pipelines. The fast evolution of high-throughput sequencing technologies has generated huge amount of data on whole genome scale and at single-base-pair resolution. To analyze such huge amount of methylation data in a fast manner, from single cytosine level to regional level, the open source software package, “mSuite”, is developed and applied to public dataset in finding allele-specific DMRs in mouse embryonic development.
U60 - Screening Short-Read DNA Sequences Against a Protein Database Using a Hybrid-Core Implementation of Smith-Waterman
Short Abstract: This presentation describes the use of SWSearch, a sequence search and alignment program, to screen short-read DNA sequences against a protein database. SWSearch is optimized for Convey’s hybrid-core (HC) computing architecture, which combines a traditional x86 environment with a reconfigurable coprocessor, to dramatically reduce the time to perform Smith-Waterman alignments, resulting in much faster performance than BLASTx.

With SWSearch, nucleotide or protein sequences can be compared to a database using linear or tabular scoring and affine gap penalties. Optionally, nucleotide sequences may be translated to proteins for comparison against a protein database. One use of translated search is to screen short-read sequences against a small protein database (e.g. known toxins or patented proteins) to quickly identify those reads coming from a gene associated with the proteins of interest for local assembly. This can be particularly useful for gene sequences which have multiple copies of highly conserved regions in a genome that would otherwise be difficult to assemble with whole-genome assembly, or when the goal is simply to identify one microbe out of a metagenome for further study.

Results will be presented comparing the performance of Convey’s SWSearch with NCBI’s BLASTx for a selection of data sets. Search times for Illumina reads against a database of protein sequences can be up to 7x faster with SWSearch on an HC server compared with BLASTx on a commodity x86 system. Furthermore, BLAST uses a heuristic filter and may miss relevant hits compared to a full Smith-Waterman approach.
U61 - A high dimensional NGS study on non-small-cell lung cancer (NSCLC) adenocarcinoma
Short Abstract: Lung cancer is the leading cause of cancer-related mortality in the world. In an effort to elucidate the detailed genetic and epigenetic causes of lung cancer development, we performed a high dimensional next-generation sequencing (NGS) study for a matched cell line pair derived from never-smoker female patients with NSCLC adenocarcinoma at early stages. For two cell lines (H2347 and BL2347, NSCLC adenocarcinoma cell line and lymphoblastoid cell line respectively from the same patient), we have generated the most complete set of multi-platform OMICS data – a whole genome-sequencing, RNA-Seq, small RNA-Seq, MeDIP-Seq, histone modifications (H3K4me3, H3K27me3, H3K36me3), and proteomics data. Our extensive NGS data reveals not only the mutational spectrum of lung cancer but also the detailed mechanism of gene dysregulation as manifested in the transcriptome, proteome, and epigenome data. Furthermore, we have generated the exome sequencing, RNA-Seq, small RNA-Seq, and MeDIP-Seq data for six patients with similar characteristics using the fresh surgical samples. Each type of data is being analyzed with respect to recurrency and causal association with NSCLC. These data would elucidate clinically important mutations and regulatory elements in NSCLC.
U62 - BRANCH: boosting RNA-Seq assemblies with genome contigs
Short Abstract: Motivation: Transcriptome assemblies of RNA-Seq data from Next Generation Sequence (NGS) technologies play an important role in many genomics applications of unsequenced organisms. Often it is very difficult to obtain high quality assemblies representing the full-length mRNA sequences expressed in a genome. Rapid improvements in NGS technologies make it now feasible to improve this situation by incorporating genomic information in the assembly process with reasonable time and resource investments.
Results: This study introduces BRANCH, an algorithm specifically designed for this situation. Its inputs include separately assembled RNA reads (transfrags) and genomic DNA reads (contigs) as well as the RNA-Seq reads themselves. It uses a modified version of BLAT to align the transfrags and RNA reads to the genome contigs. It then joins and extends incomplete transfrags by applying an algorithm that finds the Minimum weight Minimum Path Cover with given Paths (MMPP). In our performance tests on real data, BRANCH could improve the sensitivity and precision of de novo transfrags generated by Velvet/Oases by 6.0-9.7% and 5.0-6.3%, respectively. These improvements added 62-392 completely assembled transcripts to the initial set of 3,248 members.
Availability: The BRANCH software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/branch.
U63 - A Benchmark of Sequence Comparison Methods
Short Abstract: In the past few years there have been significant advances in sequence comparison methods (i.e. homology detection). We present benchmark results, using SCOP as a gold standard, comparing all major sequence, profile and profile-profile methods including: HHBLITS, HMMER3, HMMER2, SAM, BLAST, PSI-BLAST, CSBLAST, PHMMER, PRC and HHSEARCH. We also compare iterative methods such as SAM-T2K, T99, JackHMMER and HHblits. We compare prediction accuracy, score accuracy, computational speed and examine the difference between domain and full-length sequence prediction. Furthermore we consider the contribution of the individual components of these methods where applicable, e.g. model building and model scoring are assessed independently.

This work can be considered an update on our previous HMM benchmark paper in NAR a decade ago which has had a significant impact on the field, both on developers and on the users of sequence comparison methods. We expect that our results will provide a reference to the community of users and inform developers of the areas still in need of further development.
U64 - Tools for working with NGS data in the cloud: Hadoop-BAM and Seqpig
Short Abstract: We have developed tools and libraries processing data from next-generation sequencing experiments with the Hadoop distributed computing framework. Hadoop-BAM is a library that allows scalable manipulation of aligned NGS data via Hadoop MapReduce [1, 2]. It acts as an integration layer between analysis applications and BAM files by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. Hadoop-BAM builds on top of the Picard SAM JDK, so tools that rely on the Picard API are easily convertible to support large scale distributed processing.

To facilitate adoption of Hadoop-BAM among end users, we have developed a command-line interface that makes functionality, such as distributed sorting and merging, easily accessible without Java or Hadoop expertise. We have also recently released bindings to the popular Apache Pig Latin scripting language under the SeqPig project on SourceForge [3]. Access to NGS data via Pig opens up exciting possibilities in much more scalable analysis pipelines with a shorter development cycle than traditional distributed solutions.


1. Niemenmaa et al. Bioinformatics 28(6):876 (2012)
2. http://hadoop-bam.sourceforge.net/
3. https://sourceforge.net/projects/seqpig/
U65 - Differential oestrogen receptor binding is associated with clinical outcome in breast cancer
Short Abstract:
U66 - Alignment free high throughput DNA reads filtering with GPGPU computing
Short Abstract: Background
The output of a high throughput next generation sequencing machine is a collection of short reads, which have to be properly assembled in order to reconstruct the original DNA sequence of the analyzed organism.
The DNA sequence assembly process is based on aligning and merging these reads for effectively reconstructing the real primary structure of the DNA sample sequence or reference genome.
We implemented an alignment free method for reads preprocessing and filtering using the general purpose GPU computing facilities, which allow a fast and parallel processing of the promising read pairs selection.
Aim of the filter is to identify the promising read pairs in order to reduce the amount of input data given to the real assembly algorithm. The alignment free reads filtering leads to an acceleration of the assembly process without a substantial lost in accuracy of the DNA sample sequence reconstruction.
Preliminary experiments on real and simulated data proved the efficacy of this approach for filtering the promising read pairs, that are eligible candidates to successfully assemble the entire genome of a given organism.
U67 - A Hidden Markov Model Integrating Genotype and Copy Number Information from Exome Sequencing Data
Short Abstract: In routine diagnostics targeted sequencing is preferred over whole genome approaches due to its lower costs. Various articles in the last months showed that, in addition to the detection of structural aberrations, exome sequencing data could also be used for copy number (CN) estimation. Several methods are based on hidden markov models (HMM) using the underlying CNs as hidden states. Curiously none of these incorporates data from genotype (GT) calling, although this could provide valuable additional information.

The observed data sequence along genomic coordinates consists of two components: read counts inside exome regions and allele frequencies at known SNP locations. To integrate these two we developed a HMM with four hidden states: 1.) deletion - decreased read depth and loss of heterozygosity for GT calls, 2.) normal - no CN change and retention of zygosity, 3.) copy neutral loss of heterozygosity - normal CN paired with an unusually long sequence of homozygote calls, and 4.) amplification - gain in CN with no information about influence on GT. Emission probabilities were calculated for CN and GT separately and then combined to estimate the most probable sequence of hidden states. The model for CN estimation accounts for GC-content and background read depth.

Using simulated data as well as whole genome and exome sequencing data from a sample obtained at three different times (initial, remission, relapse) we compared our method to other algorithms for copy number estimation from exome sequencing data.

View Posters By Category

Search Posters: