Accepted Posters

Attention Conference Presenters - please review the Speaker Information Page available here.

If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category N - 'Sequence Analysis'

N01 - Accurate prediction of mitochondrial presequences and their cleavage sites with MitoFates identifies hundreds of novel human mitochondrial protein candidates
  • Yoshinori Fukasawa, University of Tokyo, Japan

Short Abstract: Mitochondria provide numerous essential functions for cells, and their dysfunction causes diseases such as neurodegenerative diseases. Thus obtaining a complete mitochondrial proteome should be a crucial step towards understand the roles of mitochondria. Many mitochondrial proteins have been identified but a complete list is not available. Unfortunately, the accuracy of existing predictors is far from perfect and has not improved significantly for a decade!      Here, we report MitoFates, a predictor to accelerate the discovery of mitochondrial proteins. In developing MitoFates we introduced novel presequence features: a modified hydrophobic moment, novel motifs and refined PWM for the cleavage site. We combined those with classical features and presented them to an SVM.      According to our benchmarks on a non-redundant test set of proteins, MitoFates achieves significantly higher performance than the well known predictors TargetP, Predotar and MitoProtII.      To investigate the utility of MitoFates, we looked for undiscovered mitochondrial proteins from the human proteome. MitoFates predicts 1231 genes, and 633 of these were annotated as “mitochondria” in neither UniProt nor GO. Interestingly, these include candidate regulators of Parkin translocation to damaged mitochondria, a trigger of degradation of dysfunctional mitochondria. This suggests that careful investigation of other predictions will be helpful in elucidating the functions of mitochondria in health and disease.

N02 - Identification of Conserved RNA Structural Motifs using A Subgraph Random Sampling Approach
  • Jiajie Huang, Purdue University, United States

Short Abstract: (This poster is based on Proceedings Submission 99.)
RNA structures can be represented as graphs, and common substructures identified by finding common (isomorphic) subgraphs among multiple structures. We have developed a subgraph sampling algorithm that efficiently identifies subgraphs in a RNA graph without exhaustive search. The set of subgraphs in an RNA structure, called its finger-print, can be compared with the fingerprints of other RNA structures to identify conserved or common structural motifs. We have also developed a distance function for comparing RNA structural fingerprints and demonstrate that it is able to correctly identify the similarity between known classes of RNA structures.

N03 - Accelerating DNA profile HMM searches with the FM index
  • Travis Wheeler, University of Montana, United States

Short Abstract: The FM index is a low-memory data structure that has recently played an important role in applications that depend on recognition of near-exact similarity between DNA sequences, such as read mapping and assembly. We introduce an algorithm that employs the FM index to accelerate highly sensitive sequence homology searching. We recently released software that improves detection of remote DNA homologs by applying probabilistic inference methods based on profile hidden Markov models (profile HMMs). That software, called nhmmer, represented a 100-fold acceleration over previous DNA-DNA search using profile HMMs, but even greater speed is required in the face of today’s massive sequence databases. Our FM index- based algorithm replaces the bottleneck in nhmmer’s filter pipeline, a stage that identifies high scoring ungapped alignment seeds. With default settings, this approach yields a >20-fold acceleration of seed finding, and makes nhmmer competitive with blastn speed while retaining nearly all of nhmmer’s improved sensitivity.

This accelerated DNA search is implemented in the upcoming HMMER3.1 release. Source code and documentation can be downloaded from http://hmmer.org.

N04 - GraphProt: How to make sense out of CLIP-seq data
  • Rolf Backofen, University of Freiburg, Germany

Short Abstract: The paper deals with one of today's hottest topics in biology, namely the analysis of RNA-protein interactions. Recent studies revealed that hundreds of RNA-binding proteins (RPBs) regulate a plethora of post-transcriptional processes. The gold standard for identifying RBP targets are experimental CLIP-seq approaches. However, a large number of binding sites remain unidentified, which is a major yet underestimated problem. The reason is simply that CLIP-seq is sensitive to expression levels. Thus, available CLIP-seq experiments for a specific protein in liver cells cannot be used to infer targets say in kidney cells.

We provide a solution by learning an accurate protein-binding model based on an efficient graph-kernel approach that learns sequence-structure properties from several thousands binding sites. Transcripts targeted in any other cells can be identified with high specificity. E.g. we show that the up-regulation in an AGO-knockdown cannot be explained with existing AGO-CLIP-seq data, but it can when using our predictions.

N06 - QueryFuse: A Hypothesis-Based, Gene-Specific Fusion Detection Algorithm
  • yuxiang tan, Boston University, United States

Short Abstract: Recurrent chromosomal translocations/fusions are characteristic features of many types of cancers, especially lymphoma and leukemia. Because fusions play important roles in carcinogenesis, they can serve as valuable diagnostic and therapeutic targets. To detect novel transcribed fusion targets, mRNA-seq (Whole Transcriptome Shotgun Sequencing) provides an ideal platform, and several computational methods have been developed to identify fusion transcript candidates from mRNA-seq data. However, all these methods require a full realignment to the transcriptome, a task that is computationally expensive and not necessary. Additionally, genome-wide fusion detection methods may use biologists' time inefficiently on trying to find genes that are not their priority. Here we propose QueryFuse, a novel hypothesis-based gene-specific fusion detection algorithm for pre-aligned mRNA-seq data. It is designed to help biologists find and/or confirm fusions of genes of interest in a sample within minutes, together with detailed information and visualization of supporting reads. By narrowing down reads to be related only to query genes, we can not only reduce realignment time, but also use a maximally accurate local aligner that is more sensitive and would not be computationally feasible genome-wide. We use a simulation dataset to estimate sensitivity and FDR of QueryFuse and compare it to the one of TophatFusion and defuse. We also validate our method on a well-studied publicly available breast cancer RNA-seq dataset. Finally, we report the results of our de-novo analysis of a series of clinical testicular lymphoma samples, and the identification of a novel fusion event with potential therapeutic implications.

N07 - Modular Domain-Peptide Interactions
  • Kousik Kundu, University of Freiburg, Germany

Short Abstract: Protein-protein interactions are the most essential cellular process in eukaryotes that involve many important biological activities such as signal transduction, maintaining cell polarity etc. Many protein-protein interactions in cellular signaling are mediated by modular protein domains. Peptide recognition modules (PRMs) are an important subcluss of modular protein domains that specifically recognize short linear peptides to mediate various post translational modifications. Computational identification of modular domain-peptide interactions is an open challenge with high relevance. In this study we applied machine learning approaches to identify the binding specificity of three modular protein domains (i.e. SH2, SH3 and PDZ domains). All models are based on support vector machines with different kernel functions ranging from polynomial, to Gaussian, to advanced graph kernels. In this way we model non-linear interactions between amino acid residues. Additionally a powerful semi-supervised technique was used to tackle the data-imbalance problem. We validated our results on manually curated data sets and achieved competitive performance against state-of-the-art approaches. Finally, we developed an interactive and easy-to-use webserver, namely MoDPepInt (Modular Domain-Peptide Interactions) for the prediction
of the binding partners for aforementioned modular protein domains. Currently we offer models for SH2, SH3 and PDZ domains,
via the tools SH2PepInt, SH3PepInt and PDZPepInt. More specifically our server offers predictions for 51 SH2
human domains and 69 SH3 human domains via single domain models, and predictions for 226 PDZ domains across several species, via 43 multi-domain models. MoDPepInt includes the largest number of models and offers a comprehensive domain-peptide prediction system in a single platform.

N08 - Combining NGS Data with Semantic Similarity for Transcriptional Cofactor Discovery in Specific Tissues
  • Paul Bible, National Institute of Arthritis and Musculoskeletal and Skin Diseases, United States

Short Abstract: Transcription factor (TF) binding site discovery is central to the study of gene regulation. Discovery of transcriptional cofactors is essential to determining the mechanisms of gene regulation and leads to hypotheses that are readily testable. We have developed a novel method for discovering potential TF cofactors for a given TF of interest by integrating ChIP-seq and RNA-seq NGS data with Gene Ontology (GO) functional semantic similarity. Trimethylated K4 (K4me3) ChIP-seq data identify regions of regulatory interest in specific tissues. Using ChIP-seq peaks from this data, traditional position weight matrix (PWM) motif models are used to identify potential binding sites and targets of the TF of interest as a first pass. From motif data in K4me3 regions, potential cofactors and target genes are identified. Determining cofactors from motif data is confounded by redundancies and overlaps between TF binding preferences. To overcome this challenge, we employ a scoring system based on GO semantic similarity between potential cofactors, the TF of interest, and potential targets of the TF and its cofactors. Conditional knockout (KO) RNA-seq data from the TF of interest and wildtype controls provide a method of prioritizing potential cofactors for experimental validation. Our method has identified two potential cofactors for an essential TF in hair and skin. Potential targets of the TF and cofactors are known regulators of hair development. KOs of these cofactors in mouse result in phenotypes closely resembling the KO phenotype of the TF of interest. These cofactors are currently being experimentally validated.

N09 - MetAmp: a novel approach to clustering analysis of microbial community structures using multiple genomic fingerprints
  • Ilya Zhbannikov, University of Idaho, United States

Short Abstract: High-throughput sequencing technologies allow researchers to characterize microbial community composition and structure without cultivation. There are two current approaches in analysis of microbial communities: metagenomic and genomic marker sequencing. Metagenomic analysis employs shotgun sequencing following by optional assembly. Marker sequencing clusters amplicon sequences by similarity into operational taxonomic units (OTU). The limitation of OTU approach is that it suffers from sequence noise, which reduces taxonomic resolution or leads to misclassifications and misassignments. It also possible that phylogenies estimated from multiple markers can differ substantially from phylogenies deduced from known, complete 16S rRNA genes since clustering can be different for different markers. Moreover, it is difficult to conduct analysis of data obtained from sequencing of different markers.
We present a novel method of marker analysis of microbial data, which performs clustering analysis of microbial communities using multiple genomic markers. The novel algorithm calibrates these markers using data from known microbial genomes. This in turn represents improvements on current methods used to characterize bacterial composition and community structures. We also present the “MetAmp”, an application that was developed to serve these purposes.

N10 - New Features in SPAdes Genome Assembler
  • Andrey Prjibelski, St. Petersburg Academic University, Russia

Short Abstract: Despite all the efforts high quality genome assembly is a complex task that so far remains unsolved. It is well known that majority of problems caused by repeats present in all genomes of any nature. The usage of multiple methods of genomic DNA isolation, different sequencing technologies and different types of genomic libraries for research projects introduces additional levels of complication to the genome assembly. The assembler tool SPAdes was originally developed at the St. Petersburg Academic University (St. Petersburg, Russia) for the purpose of overcoming the complications associated with single-cell microbial data (uneven coverage, increased level of errors and chimerical reads). The tool was able to successfully resolve these issues for Illumina reads and was recognized by the scientific community as one of the best assemblers working with both isolates and single-cell data. Even though the assembler was specifically designed to work solely with microbial genomes, scientists have tested the tool on a large number of different types of other data.

Their efforts and feedback have inspired us to extend the capabilities of SPAdes to include additional platforms (Ion Torrent, PacBio, Sanger), combinations of platforms, and to work with both paired-end and mate-pair libraries of different insert sizes. In this poster we present novel features of SPAdes 3.1: hybrid assemblies including the combination of Illumina/IonTorrent with PacBio (or other long reads technologies), improved algorithms for scaffolding and repeat resolution, and an approach for mate-pair only assembly using new Illumina NexteraMP protocol.

N11 - MicroRNA adenylation in early embryonic development
  • Narry Kim, Seoul National University, Korea, Rep

Short Abstract: Small RNAs are frequently modified at their 3’ ends. Uridylation and adenylation are most common modifications, but the frequency of modification varies widely depending on RNA species and cell types.
By analyzing transcriptome of Drosophila, we discovered that large portion of microRNAs (miRNAs) is highly adenylated at their 3’ ends in activated oocytes and early embryos. The role and mechanism of post-transcriptional regulation of mRNA during animal early development is well studied, but it remains largely unknown how miRNAs are regulated in this stage.
Here we show that adenylation of miRNAs during animal early development is mediated by Wispy, a noncanonical poly(A) polymerase. Through wispy knockout and overexpression studies, we find that Wispy induces degradation of miRNAs and decreases their activity. We report here that maternally enriched miRNAs are more highly adenylated than zygotic abundant miRNAs. We also find that the adenylation of maternal miRNA occurs pervasively in mouse and sea urchin eggs. These results suggest that adenylation may be a conserved mechanism that contributes to the clearance of maternally deposited miRNAs.

N12 - Next Generation Sequencing identifies a structural variant in FKBP5 moderating a gene by environment interaction after trauma exposure
  • Simone Röh, Max Planck Institute of Psychiatry, Germany

Short Abstract: Next Generation Sequencing (NGS) technologies are important tools to dissect the genetic contribution to complex disorders. They outperform common array based approaches in identification of novel and rare single nucleotide polymorphisms and structural variation. In psychiatric research, the detection of such variation can be based on pooled samples. Sample pooling is a strategy enabling the analysis of large sample sizes in a cost efficient manner. Here, we use a pooled targeted re-sequencing approach to identify variation in the FKBP5 gene, a gene involved in depression, post-traumatic stress disorder and anxiety.
We pooled genomic DNA of 400 African-American individuals from the Grady Trauma Project that were exposed to childhood trauma and interrogated for current psychopathology. 8 libraries containing 50 individuals each were prepared and sequenced on a SOLiD 5500xl platform generating approximately 400 million mappable reads. Overall, 1040 SNPs were detected, around 45% of which previously unknown. Moreover, sequencing revealed a deletion of 3.3 kb located in the first intron of FKBP5 that was confirmed by PCR genotyping in single individuals. This structural variant further modifies a previously described gene by environment interaction of rs1360780 and childhood trauma on the risk for post-traumatic stress disorder in adulthood. Our data underline the importance of structural variants for gene-environment interaction analyses in interplay with common single nucleotide variants.

N13 - Context-based mapping of RNA-seq data with ContextMap 2.0
  • Thomas Bonfert, Institute for Informatics, Germany

Short Abstract: Sequencing of RNA (RNA-seq) using next generation sequencing technology has effectively become the standard approach for profiling the transcriptomic state of a cell. This requires mapping of millions of sequencing reads to determine their transcriptomic origin. Recently, we developed a context-based mapping approach, ContextMap, which determines the most likely origin of a read by evaluating the context of the read in terms of alignments of other reads to the same genomic region. While the original implementation of ContextMap focused on improving mappings provided by other RNA-seq mapping tools, we recently extended this into a standalone version using a modification of the Bowtie short read aligner.
Here, we present ContextMap 2.0, an extension of the original ContextMap method, which now allows to use alternative short read aligners without modification. Currently, ContextMap 2.0 explicitly supports Bowtie, Bowtie2 or BWA, but other short read alignment programs can be easily included into the ContextMap workflow. This allows improving accuracy of RNA-seq mapping in a straightforward way by replacing the internal alignment program by improved short read alignment approaches.
While the initial ContextMap version was already very accurate compared to other state-of-the-art approaches, we now additionally improved accuracy by adding new mapping strategies to significantly reduce false discovery rates. Furthermore, sensitivity was increased by implementing novel methods to detect reads spanning over an arbitrary number of exons or containing insertions or deletions. Finally, the design of ContextMap 2.0 allows for massively parallelized data processing, resulting in reasonable running times despite the higher complexity of the context-based approach.

N14 - SoftSV: Assembling soft-clipped alignments to detect structural variation breakpoints from paired-end sequencing data
  • Christoph Bartenhagen, University of Münster, Germany

Short Abstract: Structural variations (SVs), such as deletions, inversions, tandem duplications and translocations, play an important role in genetic diversity among the population in general and specifically in diseases such as cancer. The increasing read lengths of modern NGS paired-end technologies enable the refinement of approximately detected breakpoint detections by split-read alignments up to single base pair resolution. But repetitive or complex breakpoint sequences with further co-occurring mutations hamper local alignments of short (sub-)sequences against a reference sequence. Furthermore, most SV detection programs require a minimum sequence coverage at the breakpoint to reduce false-positive predictions. But handling this threshold remains difficult and is based mainly on an empirical basis.
We present a method, called SoftSV, which addresses problems of SV detection, such as ambiguous mappings, unclean breakpoints, coverage thresholds and overlapping SVs. It analyses and reports any SV prediction with a minimum support of only one discordantly mapping paired-end and one split-read at each breakpoint, without a threshold on their mapping quality. SoftSV mutually aligns and assembles soft-clipped sequences from local alignment algorithms to accurately determine the breakpoint and its flanking sequences without using a reference sequence.
Simulations show that this enables robust and sensitive SV detection for all four SV types and sequencing parameters starting from 10x depth and 75bp read length, even within or close to repeat rich regions on the genome. We further compared SoftSV to other paired-end and split-read methods on real datasets from the 1000 Genomes and HapMap Project.
SoftSV is implemented in C++ and freely available at http://sourceforge.net/projects/softsv.

N15 - Janda: a two-layer approach to detect somatic structural variants in whole-genome DNA sequence data
  • Peter Johansson, QIMR Berghofer Medical Research Institute, Australia

Short Abstract: We present janda, a method to detect genomic structural variants (SV) in whole-genome DNA sequence data. The janda algorithm detects SVs in two steps. First, candidate regions are identified by clustering anomalously read pairs (ARP) that are not properly
paired. Second, unmapped and clipped reads are aligned against the two regions with a modified version of the Smith-Waterman algorithm allowing the two regions to be joined together.

To evaluate janda we generated synthetic sequencing data from a genome
with 26 SVs including 11 inversions, 7 deletions, 4 tandem duplications, 2 intra-chromosomal translocation, and 2
inter-chromosomal translocations. Janda detected 24 SVs correctly and only one false positive (FDR=4%). The two false negatives were both relatively small deletions (<100bp), which illustrates that sensitivity of the approach is lower the smaller the size of the variant. This is expected as janda relies on ARPs and smaller indels
are less likely to cause an ARP. This analysis demonstrated that the janda algorithm detects SVs with both high sensitivity and specificity.

To further evaluate the algorithm, we applied it on whole-genome sequencing data obtained from a melanoma tumor with matched blood. Janda predicted 24 SVs and we designed PCR primers for these SVs and Sanger sequencing confirmed the breakpoint for 12 SVs including three translocations, one inversion, and eight deletions. Interestingly, one of the deletions included PTEN, a known tumor suppressor in the PI3K pathway, which is a key player in melanoma and other cancers.

The algorithm is implemented in C++ and available as free software under GNU General Public License.

N16 - RNASurface: a web-server for comprehensive reconstruction of RNA structural profile
  • Svetlana Vinogradova, Moscow State University, Russia

Short Abstract: RNA secondary structure, rather than the sequence, is known to be the hallmark of many functionally important RNAs. There exist a number of computational tools to predict RNA secondary structure of either single sequence or a group of homologous sequences. However, development of computational methods to scan genomes for structured RNAs remained a largely unsolved problem.
We developed the RNASurface algorithm that constructs a matrix of Z-scores of RNA secondary structures and identifies all locally optimal structured RNA segments. Based on the matrix, a new measure of structure along a sequence (referred to as structural significance) was defined as the maximum of all squared Z-scores of segments covering a given genomic position. The profile of structural significance provides information about structure of a segment at a single nucleotide-scale.
Recently implemented RNASurface web server provides an interface for the algorithm including a heatmap visualization of segment Z-scores. We applied our algorithm to drosophila and bacterial genomes to analyze locally optimal structured RNA segments at the genome-wide scale and perform an evolutionary analysis.

N17 - Structural evolution of the Human Accelerated Region 1 (HAR1) as part of a long non-coding RNA
  • Katja Nowick, University of Leipzig, Germany

Short Abstract: The Human Accelerated Region 1 (HAR1) is part of a long non-coding RNA. The 118 nucleotide long region is very well
conserved among vertebrates, but has 18 nucleotide changes that are human-specific. HAR1 is specifically expressed in Cajal-Retzius neurons during embryonal development
and is co-expressed with reelin, which suggests that HAR1 is involved in brain development and cognitive abilities.
The molecular function of HAR1 is yet to be understood, but it seems likely that it acts through its secondary structure. To test this possibility, we investigated evolutionary changes in the secondary structure of HAR1 during great ape evolution. Using RNAfold (ViennaRNA) secondary structure predictions, refined by experimental data, we performed initial molecular evolutionary simulations. Interestingly, we discovered an evolutionary path to a more stable structure from the ancestral to the human HAR1 structure. Our findings are backed up by base pair probability calculations for the HAR1 sequences from human, chimpanzee and Denisovan.
While the ancestral sequence has a diverse ensemble of possible structures, the Denisovan and especially the human sequence have a more restricted structural space.
Using RNAsnp we tested which single-nucleotide mutations of the HAR1 sequence would make an impact on its structure. Three out of the 18 human-specific changes are predicted to significantly effect the structure of HAR1. In conclusion, at least some of the human-specific mutations in HAR1 seem to evolve under positive selection to form a more stable secondary structure, and might have been involved in the evolution of human-specific brain functions.

N18 - Heterogeneity in mESC under different conditions
  • Tomislav Ilicic, Wellcome Trust Sanger Institute,

Short Abstract: Mouse embryonic stem cells cultured in serum/LIF are heterogenous morphologically than when cultured in 2i/LIF.
Stochastic fluctuations in transcription, also referred to as "noise", can cause phenotypic changes in cells, leading to heterogeneity. This implies that supposedly homogenous populations of cells can yield biologically important subpopulations. Bulk sequencing masks such heterogeneity, as gene expression is averaged over a cell population. By using single cell sequencing, we were able to capture transcriptomes of single mESC cultured under three different conditions (Serum/LIF, 2i/LIF and alternative 2i/LIF) and chart out the landscape of heterogeneity. Our findings suggest that different growth conditions result in distinct transcriptional profiles of cells. Cells cultured in alternative 2i/LIF seem to share greater transcriptional similarity with cells from 2i/LIF. Comparative analysis of single cells within the same culture condition revealed a differentiated subpopulation of cells in Serum/LIF, characterised by downregulation of Nanog, Oct4, Rex1, Esrrb, Sox2 and Tet2, and up regulation of Krt8, Krt18, Klf6 and Tpm1.
To measure the degree of noise across culture conditions, the cells were examined for allele-specific expression patterns in hybrid mice. We identified four distinct noise signatures representing extrinsic noise (variability between cells), intrinsic noise (variability within a cell) and allele specific expression. Most genes show signatures characterised by intrinsic noise and only a few genes show extrinsic noise.
In summary, our work reveals transcriptional heterogeneity of mouse embryonic stem cells grown in different culture conditions and gives a global overview about the degree of stochastic fluctuations in such cells.

N19 - SpliceSeq 2.0: A tool for analysis and visualization of splicing variation in RNASeq data.
  • Michael Ryan, In Silico Solutions, United States

Short Abstract: SpliceSeq 2.0: A tool for analysis and visualization of splicing variation in RNASeq data.

Alternative mRNA splicing enables a single gene to produce multiple protein products that are specific to tissue type or developmental stage. Aberrant splicing patterns lead to disease, notably cancer. Next generation sequencing platforms provide the finest resolution yet of the transcriptome, opening the door to inclusion of alternative splicing analysis in mainstream gene expression studies. SpliceSeq is a biologist-friendly application that performs differential splicing analysis and provides results that can be sorted, filtered, and visualized. SpliceSeq makes it easy to identify the most significant changes in splicing patterns between two groups of samples and to explore the potential functional impact of those changes. We will present results of SpliceSeq analysis of TCGA tumor sample data.

SpliceSeq is freely available for academic, government, or commercial use at http://bioinformatics.mdanderson.org/main/SpliceSeq:Overview

1. Ryan MC, Cleland J, Kim R, Wong WC, Weinstein JN. SpliceSeq: A Resource for Analysis and Visualization of RNA-Seq Data on Alternative Splicing and Its Functional Impacts. Bioinformatics, 10.1093, 2012.

N20 - Benchmark Analysis of Algorithms for Reconstructing Full Splice Forms from RNA-Seq
  • Katharina Hayer, University of Pennsylvania, United States

Short Abstract: A serious difficulty of RNA-Sequencing analysis derives from the highly fragmented nature of the data. It's straightforward to quantify exons and junctions; however combining this local information to assess expression of full-length splice forms is highly complex. There are published algorithms offering partial solutions, one of which, Cufflinks, is currently widely used. The accuracy of these methods is difficult to assess, due to a lack of benchmarks. However, simulated data can provide upper bounds on the accuracy. We have developed a simulator (BEERS), which is highly effective at assessing transcript reconstruction applications. The simulator mimics the discrete operations that produce paired-end reads. We have assessed the accuracy of Cufflinks and the other algorithms, as functions of the number of expressed splice forms for a given gene. We first provide a 100% accurate alignment, to establish an upper bound on the accuracy, as well as alignments with TopHat, RUM and GSNAP. We present our findings as false-positive and false-negative rates with regards to transcript structure, and as assessments of the error of the assessed FPKM values. When there is only one splice form that is correctly annotated, abundantly expressed, sequenced without error, and perfectly aligned to a non-polymorphic genome, the algorithms are capable at detecting it. However, when factors are introduced such as incomplete annotation, multiple splice forms, or alignment artifacts, the accuracy drops precipitously. We conclude that the current published algorithms are probably not effective enough to be practical, underscoring the need for further algorithmic development and funding.

N21 - RVD2: An ultra-sensitive variant detection model for low-depth targeted next-generation sequencing data
  • Patrick Flaherty, Worcester Polytechnic Institute, United States

Short Abstract: We present a novel variant calling algorithm that uses a hierarchical Bayesian model to estimate allele frequency and call variants in heterogeneous samples.

N22 - Improved Pipeline for the Alignment of Plant DNA Barcode Region
  • Hannah Garbett, University of South Wales,

Short Abstract: Plant DNA barcoding uses short DNA sequences to identify species from a specimen (i.e. roots, pollen, leaf, etc). In 2009, the Consortium of the Barcode of Life (CBOL), established a standard for plant barcoding projects. The standard consists of the chloroplast genes, matK and rbcL. To analysis the data from this process, accurate and reliable alignment tools are required. Simply aligning the matK barcode regions using standard alignment tools often creates inaccurate results because the tools often break codons as they align DNA sequences, which leads to inaccurate attributions in the functionality of the gene. Hence, it is imperative to ensure that codons remain intact throughout the data analysis process. We selected a range of open source alignment tools based on their popularity in the scientific literature and reviewed their performance. We identified one tool, transAlign, which did not break any codons in our analysis. Some tools have the capability to align translated sequences but the sequences must be in the same reading frame, for the process to be successful. However, this is not always possible as the frame for each sequence maybe different in the same sequence file due to editing of the sequence. We therefore modified transAlign to develop an algorithm for more efficient multiple alignment. Maximum likelihood models were used to detect the best possible Open Reading Frame (ORF). The pipeline includes several different open source alignment tools (i.e. ClustalW, MAFFT, MUSCLE) as default options. Other alignment tools can also be included as modular components of the pipeline.

N23 - Identification and characterization of circular DNA elements in Ipomoea spp. samples collected from Brazil and Nigeria
  • Priscila Grynberg, Brazilian Agricultural Research Corporation, Brazil

Short Abstract: The Ipomoea genus has over 500 species including morning glory and sweet potato. Little is known about own (chloroplasts, mitochondria, mitochondrial plasmids) and exogenous (microbial diversity) circular DNA elements (circomics). Understanding the chloroplast sequence can bring insights to genetic transformation by homologous recombination. We intended to characterize the circomics presented in plants collected in Brazil and Nigeria using next generation sequencing. Rolling circle amplification (RCA) reaction products from 14 Ipomoea spp. samples collected in Brazil (from the States of Ceará, Pernambuco and the Distrito Federal) as well as 9 samples collected in Nigeria were polled separately by country origin and sequenced using the Roche 454 GS FLX platform. The two libraries, IPOBRA and IPONIG, were sequenced in 1/8th of a plate. IPOBRA resulted on 44,433 and IPONIG had a total number of 75,318 reads. FastQC and PRINSEQ were used for quality checking and trimming respectively. Coral was applied to correct possible indels errors introduced mainly because of the homopolymers. Artificial duplicates were removed with cd-hit-454. After pre-processing steps 28,539 (average length of 653.3 bases) and 51,576 (average length of 701 bases) sequences from Brazilian and Nigerian samples respectively were used for next steps. Preliminary results using SSU rRNA markers confirmed the presence of chloroplast and mitochondria sequences. The sample from Nigeria also presented a 16S rRNA sequence from an endophytic nondiazotrophic bacteria Enterobacterium asburia. Futures steps include assembly and annotation of Ipomoea spp. chloroplast and other circular DNA elements as well as further characterization of possible pathogens and symbionts.

N24 - Diversification of small RNAs pathways in developing seeds and inbred lines of maize
  • Oliver Tam, Cold Spring Harbor Laboratory, United States

Short Abstract: Small RNAs are important regulators of gene expression that act in a homology-dependent manner to guide transcriptional and post-transcriptional silencing mechanisms. However, in maize, knowledge of small RNA pathways and targets that they regulate is greatly lacking. Limited available data indicates substantial divergence of small RNA pathways between Arabidopsis and maize, yet even within maize varieties, there is considerable pathway diversity. Depending on inbred background, mutants that perturb the biogenesis of miRNAs or trans-acting siRNAs condition drastically different phenotypes. This implies that variation in small RNA pathways or downstream processes could drive intra-species diversity and/or robustness against perturbations. To elucidate the unique roles of small RNAs in maize, we sequenced 12 DAP embryo and endosperm tissues from several inbreds, as well as small RNA biogenesis mutants. Preliminary analysis on dicer-like 1 mutants has uncovered both known and potentially novel miRNAs, originating from unique and repetitive regions of the genome. These candidates will be further investigated to determine targets and assess conservation across inbred lines. Our analyses also extend beyond miRNAs to other small RNA species processed by alternative DICER-LIKE proteins. Characterization of small RNA populations in maize is critical to understanding their roles in regulating growth and development. Through deep sequencing small RNAs and their targets, we aim to uncover their molecular and functional diversity across different tissues and inbred lines of maize, and gain valuable insights into molecular mechanisms of small RNA biogenesis, regulatory functions and evolutionary divergence, not only within maize, but also across other plant species.

N25 - Constructing Improved Advisors for Multiple Sequence Alignment
  • Dan DeBlasio, University of Arizona, United States

Short Abstract: While the multiple sequence alignment output by an aligner strongly depends on the parameter values used for the alignment scoring function (i.e. choice of gap penalties, substitution scores), most users rely on the single default parameter setting. A different parameter setting, however, might yield a much higher-quality alignment for the specific set of input sequences. The problem of picking a good choice of parameter values for specific input sequences is called parameter advising. A parameter advisor has two ingredients: (i) a set of parameter choices to select from, and (ii) an estimator that provides an estimate of the accuracy of the alignment computed by the aligner using a parameter choice; the parameter advisor then picks the parameter choice from the set whose resulting alignment has highest estimated accuracy.

We consider the class of estimators that are linear combinations of real-valued feature functions of an alignment, and assume these feature functions are given as well as the universe of parameter choices from which the advisor’s set is drawn. For this scenario, we define the problem of constructing an optimal advisor, by finding the best possible set or estimator for advising on a collection of training data in which the reference alignment is available. While these optimal advising problems are NP-complete, they can all be formulated as integer linear programming problems. These problems are still infeasible, so we develop a fast greedy procedure based on rounding the ILP that is close to optimal and outperforms the set generated without considering the estimator.

N26 - Inferring Intra-Tumor Heterogeneity from Whole-Genome/Exome Sequencing Data
  • Layla Oesper, Brown University, United States

Short Abstract: Most tumor samples are a heterogeneous mixture of cells, including admixture by normal (non-cancerous) cells and subpopulations of cancerous cells with different complements of somatic aberrations. This intra-tumor heterogeneity complicates the analysis of somatic aberrations in DNA sequencing data from tumor samples. We describe an algorithm to infer the composition of a tumor sample, including both the percentage of normal admixture and the percentage and content (copy number aberrations) of any number of tumor subpopulations, directly from high-throughput DNA sequencing data, both whole-genome and whole-exome sequencing data.

Our new algorithm extends our published THetA algorithm (Oesper et al. 2013). First, the new algorithm is substantially faster (up to 235x faster) in the case of more than one tumor population. Second, the algorithm has improved performance on cancer genomes with large numbers of rearrangements and/or extensive amplification. We apply the new algorithm to whole-genome and whole-exome sequencing data from 9 ovarian carcinoma, glioblastoma multiforme and lung squamous cell carcinoma samples from The Cancer Genome Atlas (TCGA). For 4 of 6 samples for which we have whole-exome and whole-genome data, our estimates of tumor purity from both data types are within 3% of each other, thus demonstrating the consistency of the algorithm across different data types. In glioblastoma sample TCGA-06-0214 we identify two tumor subpopulations (48% and 19% of cells in the sample, respectively), and both include two extensive amplifications (> 30 copies) containing EGFR and PDGFRA - genes known to be amplified in glioblastoma.

N27 - Detecting complex fusion transcripts in pediatric cancer using a novel assembly-based algorithm CICERO
  • YONGJIN LI, St Jude Children's Research Hospital, United States

Short Abstract: Fusion genes are important for cancer diagnosis, subtype definition and targeted therapy. Although RNAseq is useful for detecting fusion transcripts, computational methods to identify fusion transcripts arising from internal tandem duplication (ITD), that have multiple partners, low expression or non-template insertion are limited. We developed an assembly-based algorithm CICERO (CICERO Is Clipped-reads Extended for RNA Optimization) that is able to extend the read-length spanning fusion junctions for detecting complex fusions. Using test data that include RNASeq from 3 ependymoma (EPD), 39 low-grade glioma (LGG), and 128 acute lymphoblastic leukemia (ALL), we have shown that CICERO is able to detect multi-segment fusion transcripts resulting from chromothripsis, ITD or fusions with long non-template insertion; all of which would be missed by existing fusion analysis methods. The overall sensitivity and accuracy of CICERO are much higher than existing tools such as deFuse and Tophat-Fusion. Using CICERO, we analyzed >600 brain tumor and leukemia transcriptomes from the St. Jude/Washington University Pediatric Cancer Genome Project (PCGP) and detected recurrent C11orf95-RELA fusions in EPD, FGFR1 ITD in LGG, NTRK fusion in high-grade glioma and activating kinase fusions with multiple partners in ALL. CICERO also shows high sensitivity when detecting fusions with low expression, like BRAF fusions in LGG, making it useful for analyzing tumor specimens with low purity. Furthermore, the power of CICERO increases with the extended read-length enabled by improvement in next-generation sequencing (NGS) technology. Using paired-end 300bp RNAseq reads, CICERO shows the ability to assemble near full-length fusion transcripts.

N28 - Assessment of Programs for Mapping Short DNA Reads to a Reference Genome
  • Farzana Rahman, University of South Wales,

Short Abstract: Background

In the last decade, a number of algorithms and associated software were developed to align next generation sequencing (NGS) reads with relevant reference genomes. The results of these programs may vary significantly, specially when the NGS reads are quite different from the reference genome. Yet there is no standard way to compare biological relevance of these programs.

Method

We developed a benchmark to assess the accuracy of the short reads mapping based on the pre-computed global alignment of closely related genome sequences. We used pairwise alignments of Escherichia coli O6 CFT073 genome with genomes of seven other bacteria. We simulated the short-reads of genomes to map these fragments on to the corresponding reference genome. We have compared several popular programs ( BLAST, SOAP, SHRiMP, Bowtie, BWA) and one recently published program ( ReadsMap) for mapping NGS reads to a reference genome.

Outcome

A benchmark to assess biological relevance of programs for reference genome alignments has been developed and applied to compare the most popular freely available programs. All programs show similar performance for very closely related organisms but some programs are significantly better than others when genomes are less similar. We have also found that the performance of these programs is related to the degree of duplication in the reference genome.

N29 - ABySS-Connector: Connect paired-end reads using a Bloom filter de Bruijn Graph
  • Shaun Jackman, BC Cancer Agency Genome Sciences Centre, Canada

Short Abstract: Paired-end sequencing yields a read from each end of a DNA molecule, typically leaving a gap of unsequenced nucleotides in the middle of the fragment. We have developed ABySS-Connector, a software tool that fills in the nucleotides of the unsequenced gap by navigating a de Bruijn Graph to find a path between the two reads and connect the pair. ABySS-Connector represents the de Bruijn graph using a Bloom filter, a probabilistic and memory-efficient data structure that represents a set. Our implementation is able to store the de Bruijn graph using a mean 1.5 bytes of memory per k-mer, a marked improvement over the typical hash table data structure. The memory usage per k-mer is independent of k, enabling its application to larger genomes. The use of a Bloom filter to represent a de Bruijn graph has previously been described for genome sequence assembly, a task which benefits from a second non-probabilistic data structure to enumerate the critical false positives. We observe that this additional data structure is unnecessary for connecting reads, reducing memory requirements. The de Bruijn graph of the 20-gigabase white spruce genome sequencing data, for example, can be represented in 40 gigabytes. k-mers observed only once are usually erroneous, and are therefore discarded by using a counting Bloom filter. Constructing the Bloom filter is parallelized and distributed over multiple machines, and connecting the reads is likewise parallelized and distributed. ABySS-Connector is expected to have broad applications in genomic analysis, including read alignment, sequence assembly and haplotype variant calling.

N30 - Error Correction of Illumina, Ion Torrent, and 454 Reads
  • Eric Marinier, University of Waterloo, Canada

Short Abstract: The information produced by second-generation sequencing technologies typically consists of millions of relatively short reads. However, these reads contain a variety of often platform-specific errors that can make sequence assembly and other downstream projects more challenging. We have developed software to correct multiple error types introduced by Illumina, Ion Torrent, and 454 sequencing technologies. The software locates and corrects substitution, insertion, deletion, and homopolymer errors while remaining sensitive to low coverage areas of a sequencing project.

Our software corrects errors by locating erroneous positions in reads and exploring the appropriateness of different possible corrections. We locate and categorize errors by recognizing adjacent k-mers with associated k-mer depths that deviate from an expected random sampling process. We attempt a variety of corrections and accept those which minimize k-mer depth deviation for a maximal number of k-mers.

We evaluate our software by aligning reads from Illumina MiSeq, Ion Torrent PGM, and Roche 454 GS Junior against a Roche 454 GS FLX reference. The reads and reference are all sequenced from the same E. coli O104:H4 isolate. We compare the alignments of uncorrected and corrected reads and find our software corrects the majority of errors in such technologies while introducing a very small number of errors. Furthermore, we find that our approach improves the quality of genome assemblies when used as a preprocessing step.

N31 - Comprehensive large scale assessment of intrinsic protein disorder
  • Manuel Giollo, University of Padova, Italy

Short Abstract: High throughput experiments are producing a large amount of data that needs careful annotation. Over the last decade a number of fast tools claiming accurate prediction were developed, but an objective and third party evaluation is often missing. In this work, we analysed the performance of 11 state-of-the-art protein disorder predictors, namely FoldIndex, GlobPlot, IUPred (short and long), Espritz (X-ray, NMR, DisProt), RONN, DisEMBL (456 and HL) and VSL2b. All these predictors are part of public database like MobiDB, suggesting that a validation of these widely used methods is necessary. Such tools have been compared on ca. 25.000 UniProt sequences with experimental annotation - a large test set with at least one order of magnitude more samples than the original papers.
Our assessment shows that disorder detection is a hard task, and there is room for improvement in the overall prediction quality. In fact, there are classes of proteins, like those related to Biological Adehesion or Signaling, where predictor performance is clearly different than the average accuracy. This is probably due to a bias in the design of the tested methods, which is also clear from the tendency to predict N and C terminal regions as unstructured. In addition, different methods tend to disagree when they predict a protein region as disordered, suggesting that they use very diverse definitions of the problem. For this reason, it is important to develop new guidelines for this problem, and design a continuous assessment of predictors for intrinsical protein disorder.

N32 - GO4D – Gene Ontology for Dummies
  • Rommel Ramos, Federal University of Pará, Brazil

Short Abstract: The next-generation sequencer (NGS) increase the amount of genomic data with lower cost. Beyond the advantages about the structural genomics, other studies have been developed due the NGS like the funcional genomics (transcriptomics and proteomics), which allow us understand the function and the behavior of an organism under some condition, an important resource to develop biological and medical products based on these. However, scripts and/or computational tools are required to handle and analyze these data due to the number of the biological database which can be integrated to improve the results, like can be observed for Corynebacterium pseudotuberculosis, a pathogenic bacteria which causes high economic losses around the world and have been used to structural and functional genomics.
About the functional analysis which can be applied over transcriptomics experiments we can cite the gene ontology through the Blast2GO, that is important by classify the genes regarding the biological process, cellular component and molecular function. But there are some limitations to perform the access and handle the results. Therefore, we developed the B4D (Blast2GO for Dummies) which allows insert the gene ontology results in the database and produces reports based on filters defined by the user. The tool uses threads to improve the performance, access a local gene ontology database. Beyond apply filters, generate graphs for the results presenting descriptions and the levels for each identified GO category.

N33 - Estimating the sample size for RNA sequencing experiments based on real data
  • Shilin Zhao, Center for Quantitative Sciences, Vanderbilt University, United States

Short Abstract: Next-generation sequencing of mRNA (RNA-Seq) highly expanded the researchers' understanding of gene expression and is becoming more widely used. Sample size estimation is the most important issue in RNA-Seq experiments design and a few negative binomial model based methods were developed, which used the average read count and dispersion of gene to estimate sample size. In these methods, read counts and dispersion for genes were fixed and often estimated by experience or set as most conservative values. But in RNA-Seq experiments, thousands of genes were quantified simultaneously, whose read counts distributed in a more than four orders of magnitudes range and dispersion were related with read counts. As a result, the sample size from estimated or conservative values will be inaccurate or over-estimated. To solve these issues, we developed a sample size estimation method based on the distributions of genes read counts and dispersion from real data. Data sets from the cancer genome atlas (TCGA) were used as reference. The genes read counts and their related dispersion will be selected randomly from the reference based on their distributions. And then the sample size will be estimated and summarized. The users can also use their preliminary experiment data sets as reference to achieve a most reliable result. A user friendly web interface for this method was provided at http://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/.

N34 - Calculation of loop probabilities using an RNA partition function
  • Michael Sloma, University of Rochester Medical Center, United States

Short Abstract: RNA secondary structure prediction is widely used to analyze RNA sequences. In an RNA partition function calculation, free energy nearest neighbor parameters are used in a dynamic programming algorithm to estimate statistical properties of the secondary structure ensemble. Previously, partition functions have largely been used to estimate the probability that a given pair of nucleotides form a base pair, the stacking probability of two pairs, or the accessibility to binding of continuous nucleotides. Here, we show that an RNA partition function can also be used to analytically calculate the probability of hairpin loops, internal loops, bulge, and multibranch loops at a given position. Benchmarking on a large number of RNA sequences with known secondary structures indicated that loops that were calculated to be more probable were more likely to be present in the known structure. This calculation is important for drug discovery because small molecules are known to bind secondary structure motifs that could be found in RNAs with this method. [1]

[1] Velagapudi S.P., Gallo S.M., Disney M.D. Nature Chemical Biology advance online publication, 9 February 2014 (doi:10.1038/nchembio.1452)

N35 - SplAdder: Integrated Quantification, Visualization and Differential Analysis of Alternative Splicing
  • Andre Kahles, Memorial Sloan Kettering Cancer Center, United States

Short Abstract: The alternative choice of splice sites during maturation of mRNA, termed alternative splicing (AS), is one of the key mechanisms that shape the complexity of most eukaryotic transcriptomes. It ensures the required flexibility in expression from a single locus that is vital for development and regulation. The analysis of AS-events from RNA-Sequencing (RNA-Seq) data plays a central role in understanding mechanisms of gene regulation and elucidating the causes of diseases.
We present SplAdder, the first fully integrated analysis framework, that can detect AS-events from RNA-Seq alignments, quantify and differentially test them between two given sample sets and visualizes the results. AS-events are detected from a given annotation or can be derived based on the given RNA-Seq evidence. The detected events can then be quantified on the same or a different set of RNA-Seq alignments. Differential analysis, employing a negative binomial test accounting for biological sources of variation within replicates, is realized with the rDiff package. SplAdder provides several visualization routines, producing publication ready plots of the quantified splicing graph, displaying one or many events or showing sashimi plots for different isoforms.
SplAdder can easily handle several thousand human samples and has been developed and tested on more than 3,500 whole transcriptome RNA-Seq libraries from The Cancer Genome Atlas project, on over 400 samples from the plant A. thaliana and in many more organisms, including the nematode C. elegans. The software is implemented in Python and is available as open source code. For more information visit www.bioweb.me/spladder.

N36 - Using 2D Energy Landscapes to Estimate the Kinetics of RNA Folding
  • Evan Senter, Boston College, United States

Short Abstract: RNA folding pathways play an important role in various biological processes, such as (i) the conformational switch in spliced leader RNA from Leptomonas collosoma, which controls trans-splicing of a portion of the 5′ exon, and (ii) riboswitches--portions of the 5′ untranslated region of mRNA that regulate genes by allostery. Since RNA folding pathways are determined by the energy landscape, we describe a novel algorithm, FFTbor2D, which computes the 2D projection of the energy landscape for a given RNA sequence. Provided two arbitrary secondary structures A, B, both compatible with an input sequence, FFTbor2D approximates the Boltzmann probability p(x, y) = Z_{x,y}/Z for secondary structures having base pair distance x from A and y from B. This algorithm runs in O(n^5) time and O(n^2) space, an improvement over prior techniques, by using polynomial interpolation with the fast Fourier transform.

With the 2D energy landscape generated by FFTbor2D, we then produce a model for RNA folding kinetics that can compute mean first passage time (MFPT) in a deterministic and efficient manner, using a new software package we call Hermes. With a carefully designed set of 1000 RNA sequences we compare our new approach to the de facto standard--Kinfold--and find both have stong correlation to the mathematically exact MFPT. We observe Pearson correlation coefficients of 78.9% for Kinfold versus 60.3% for our coarse-grained model. These results motivate the use of Hermes to quickly approximate kinetics of RNA folding, since the efficiency of our algorithm makes the analysis of large RNA sequences tractable.

N37 - Common Secondary Structure Prediction for RNA Homologs with Domain Insertions: Dynalign II
  • Yinghan Fu, University of Rochester, United States

Short Abstract: Automated comparative prediction of RNA secondary structure is an important tool for non-coding RNA identification and functional studies that can also be executed fast enough for genome scale research
Domain insertions, however, where a structural motif is inserted in a sequence (with respect to homologs) are prevalent. This phenomenon can affect the accuracy of RNA secondary structure prediction and has not been effectively accounted for. In this project, we introduce a novel methodology for handling domain insertions during the process of common RNA secondary structure prediction. We develop and demonstrate the methodology by developing Dynalign II, an update to the dynamic programming algorithm-based Dynalign algorithm for common secondary structure prediction for two RNA homologs. Our update introduces recursions that explicitly allow and account for inserted domains. This update is also accomplished at negligible increase in computational cost, using precomputed information from single sequence structure prediction.
Benchmarks obtained on ncRNA families with domain insertions validate the proposed method. Dynalign II has improved accuracy over Dynalign, attaining an 80.8% sensitivity (compared with 14.4% for Dynalign) and a 88.3% positive predictive value over base pairs occurring in inserted domains for tRNA, and a 69.3% sensitivity (compared with 44.8% for Dynalign) and a 60.0% positive predictive value (PPV) over base pairs in inserted domains for RNase P. Overall, Dynalign II also exhibits statistically significant improvement in sensitivity and PPV compared with Dynalign. This newly proposed method can be a starting point for improvement of RNA related automated comparative genomic analysis tools.

N38 - Standards-based management of Specimen derived DNA sequences
  • Satpal Bilkhu, Agruculture and Agri-Food Canada, Canada

Short Abstract: Agriculture and Agri-Food Canada (AAFC) Ottawa is home to a world-class taxonomy program based on Canada’s national Agricultural Collections for Botany (DAO), Mycology (DAOM & CCFC) and Entomology (CNC). These collections are valuable resources for authoritative identification material in DNA barcoding, metagenomic sequence identification and whole Genome sequencing applications. AAFC’s internally developed web application, termed SeqDB, manages source specimen information, DNA extractions, PCR reactions, and sequence reactions leading to DNA sequences. SeqDB is built on top of an enterprise-ready Java full-stack consisting of Hibernate, Spring, and Struts 2, and permits local user management as well as corporate directory integration. The database adheres to the Taxonomic Database Working Group (TDWG) Darwin Core standard for Biodiversity Occurrence Data, as well as the Genome Standards Consortium (GSC) Minimum Information about any (X) Sequences (MIxS) specification. The database will be further updated to reflect the Global Genome Biodiversity Network (GGBN) Data Standards when they are finalized. We are actively exploring options to make the source code for the database available to the public. The next iteration of SeqDB will see an increased focus on the management of metagenomics data in the context of Specimen-based reference sequences, integration with community developed collection management software, and seamless integration with the Galaxy workflow system for analysis of SeqDB-managed data.

N39 - TEToolkit: A Tool to Analysis Transposable Element from Next-Generation Sequencing Data
  • Ying Jin, Cold Spring Harbor Laboratory, United States

Short Abstract: Transposons are mobile DNA elements that constitute a large fraction of most eukaryotic genomes. These parasitic genetic elements propagate by multiplying within the genomes of host germ cells. While the majority of TE copies are nonfunctional, a subset have retained the ability to mobilize when host control mechanisms are compromised. Because of their potential to copy themselves and insert into new genomic locations, as well as to generate enormous levels of expression, transposable elements present a massive endogenous reservoir of genomic instability and cellular toxicity. However, reads deriving from transposon sequences have largely been ignored in most high throughput sequencing datasets. Understanding the contribution of transposons to cellular processes requires integration across multiple types of sequencing data. Analyzing these datasets requires dedicated bioinformatic algorithms to handle the complexities of mapping and accounting for short reads from highly repetitive regions of the genome. Here, we present TEToolkit, a software package for efficient quantification and differential analysis of repetitive regions in sequencing studies, such as ChIP-seq, CLIP-seq, and RNA-seq assays. TEToolkit can provide both element-wise and individual insertion level abundance estimation. Applications of TEToolkit show a number of correctly 'rescued' genomic sites and uncover important roles of TE in certain biological processes.

N40 - A computational approach for decoding microRNA target using high-throughput CLIP and CLASH sequencing
  • Chih-Hung Chou, National Chiao Tung University, Taiwan

Short Abstract: MicroRNAs (miRNAs) are a group of small non-coding RNA molecules that play a critical role in post-transcriptonal gene expression. To reveal the miRNA-target interactions (MTIs), scientists apply miRNA target prediction tools to select candidates and then validate by reporter assay in the past years. However, prediction tool were produced more false-positive prediction and the experiments validation are still time consuming and incapable of handling the large-scale screening. Recently high-throughput CLIP (crosslinking immunoprecipitation) sequencing approach has been widely used in identifying miRNA-target interaction and also reveals the atypical miRNA targets. In addition, a lot of CLIP-seq related experiments have been developed such as HITS-CLIP (high-throughput sequencing of RNA isolated by CLIP), PAR-CLIP (photoactivatable-ribonucleoside-enhanced CLIP), iCLIP (individual-nucleotide resolution CLIP), iPAR-CLIP (in vivo PAR-CLIP), and CLASH (crosslinking ligation and sequencing of hybrids). However, no suitable platforms exist for the analysis, unification and integration of all these datasets. We herein develop the first computational tools to comprehensively identify all CLIP-seq related data specifically for decoding miRNA targetome. Comparing with the existing CLIP-seq tools, our results show more visualizing and are capable to infer the possible biological mechanism via reconstructing the miRNA-target interactions network and functional annotation. To sum up, our development tool for analyzing CLIP-seq data may reveal more biological mechanism and molecular function of miRNA targeting.

N41 - Horizontal Gene Transfer Incident Detector (HGT-ID): A novel computational system to identify horizontal gene transfer candidates from sequencing data
  • Krishna Kalari, Mayo clinic, United States

Short Abstract: It has been shown that breast tissue contains diverse bacterial communities, and it is suspected that the composition of breast microbiota is altered in case of breast tumors. Close contact with microbes has been shown to facilitate the integration of microbe into the human genome through horizontal gene transfer (HGT). We have developed a computational workflow (HGT-ID) to identify HGT candidates from next-generation sequencing data and applied that to The Cancer Genome Atlas (TCGA) breast cancer data. Specifically, we investigate whether breast cancer subtypes are potentially altered through genetic interactions with the microbial population in breast. We obtained both paired-end sequencing reads whose one-read map to human and the other-read is unaligned as well as paired-end reads unaligned to human. All those reads will then be mapped to 81,089 viral, 4,984 bacterial and 39,963 human microbial genomes or partial genomes. Upon identification of paired-end reads where one end maps to human and the other maps to microbiome (HGT candidates), we then apply additional filtering criteria such as the coverage of viral or bacterial species that should be at least 30% of the genome and there should be more than 50x reads supporting the HGT candidate. To further reduce false discovery rate we also process the nominated HGT candidate through USEARCH algorithm to scan both protein and nucleotide databases. We are currently processing TCGA breast cancer data through the workflow to compare HGT candidates in breast tumors subtypes. This will elucidate significant associations with known oncogenes or drug related pathways.

N42 - ChIP-Seq and evolution of transcription factor binding sites in Mycobacteria
  • Anna Lyubetskaya, Boston University, United States

Short Abstract: Bacterial regulons evolve through changes in regulatory protein sequences, their binding sites, and their target genes; a study of these complex interconnected processes heavily depends on computational methods and is limited by the absence of experimental binding and complementary expression data in related organisms. To study the regulon evolution in bacteria, we performed ChIP-Seq and expression experiments for six conserved transcription factors for two to four Mycobacteria relatives (M. tuberculosis, M. avium, M. smegmatis, Rhodococcus sp.). We studied the conservation of binding sites and target genes by comparing binding site occupancy (ChIP-Seq coverage) and strength (predicted motif score) to binding site impact on its target gene transcription.
Many highly conserved binding sites were located in promoter areas of their target genes and enriched for both strong binding motifs and high expression impact. However, some strong conserved binding motifs were not bound experimentally. In some cases, a conserved binding motif was bound in one organism and not bound in another, which indicated differences in DNA accessibility. We found conserved binding sites within genes, as well as conserved sites located in convergent areas of the genome. We detected conserved site configurations, some of which were indicative of DNA looping.
Regulon composition was significantly less conserved than expected from the overall gene conservation level across Mycobacteria. In particular, M. avium had an unusually small kstR regulon. We observed a significant overlap between targets of some regulons, for example, LSR2, a nucleoid associated protein, and orthologs of Rv0081; however, both had distinct transcriptional role.

N43 - Classify RNA-seq runs as origin organs or other features by using machine learning
  • Yasunobu Okamura, Tohoku University, Japan

Short Abstract: Published RNA-seq data is increasing rapidly today. Although many RNA-seq data and other short read data are registered in SRA, annotations of these data are typically not enough to perform re-analysis. Usually, descriptions of runs, samples, experiments and studies are written in natural language often with abbreviated form, not in machine friendly form. To perform large-scale re-analyses, such as meta-analysis or gene-coexpression analysis, comprehensive machine-friendly annotations are required.
In this study, we automatically classify RNA-seq runs based on read count of genes into some features, such as 1) organs, 2) cell lines or 3) whether it is a tumor. Using support vector machine for 574 runs from 17 organs, we succeeded to predict 85.9% of runs correctly. We applied a similar procedure to select cell lines and whether it is a tumor. Original cell lines of 95.2% of 228 runs were correctly predicted. Also cell statuses (tumor or normal) were correctly predicted in 90.1 % of 111 runs. Our results are useful for large-scale re-analysis and annotating RNA-seq data. For example, gene-coexpression in a single organ will be useful to understand pathways and systems in the organ. We will also report contribution of genes to decide classification of the runs.

N44 - Reconstructing the gene phylogeny of the arrestin family: annotation of multi-exon genes
  • Henrike Indrischek, University Leipzig, Germany

Short Abstract: The cytosolic arrestin proteins mediate the desensitization of G-protein coupled receptors and activate effector molecules within signaling cascades. Since phylogenetic information is incomplete for arrestins, this study aims at obtaining annotation of arrestin homologs with the goal to trace the evolution to the last common ancestor.
Each of the four human paralogs is encoded by 15 or 16 relatively short exons interleaved with intronic regions of lengths up to 60.000 nt. As a consequence, automatic methods frequently fail to correctly annotate arrestin genes. We manually improve the Ensembl and Genbank annotations of arrestins using blast, transcriptome data and synteny information to obtain a reliable initial alignment for each paralog. We built a pipeline to automatically annotate homologous protein sequences in three steps. First, we build exon-specific hidden Markov models based on a protein alignment, second, we use the models to search for the best scoring hits within related genomes, third, we assign the hits to their paralogs based on alignment scores and synteny.
Starting from the annotation in human, we could identify four paralogs in Mammalia and Sauropsida. Within the mammalian superorder Afrotheria one paralog appears to be lost or severely truncated suggesting at least a partial complementation of the missing function by retained paralogs. Due to the teleost-specific genome duplication we find six to seven arrestin paralogs in pufferfish and zebrafish as expected.
In conclusion, we showed that our approach can be used to automatically annotate the sequence of homologous proteins encoded by multi-exon genes
even for species with low-quality genomes.

N45 - Realign PacBio Reads to Improve Variant Call Performance
  • Xiaoli Jiao, Frederick National Laboratory/Leidos Biomedical Research, Inc., United States

Short Abstract: Pacific Biosciences (PacBio) provides SMRT (Single Molecule, Real Time) technology to generate contiguously long and unbiased reads which greatly facilitates characterization of genetic complexity and diversity, including rare SNPs, indels, structural variants, haplotypes and phasing. Alignment is a prerequisite and plays a key role in many applications. However, to account for the high insertion and deletion error rate of the PacBio data, aligners have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effects that the aligners will position deletions differently for forward and reverse strands and sometimes mask a true SNP inside an insertion which is referred as a reference bias. It is important to note however, that discrepancy caused by pair-wise forward and backward alignments and reference bias are an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of PacBio data. We developed methods to handle these problems and will demonstrate the performance improvement for variant calls with several data sets. In addition, we also devised new mechanisms to automatically examine the statistics of the base calls in a local region to figure out possible alternative references in order to facilitate the processes of performing the local realignments.

N46 - Analysis of Sequence Variation in the Surfaceome of 23 Colorectal Cancer Cell Lines
  • Elisa Donnard, Sírio-Libanês Hospital, Brazil

Short Abstract: Colorectal cancer cells contain a large number of somatic mutations ranging from hundreds (microsatellite stable) to thousands (microsatellite unstable) which can offer new therapeutic targets for known drugs, and also generate immunogenic epitopes which can be used as vaccines for tumor control. The subset of genes coding for cell surface proteins (surfaceome) is a highly interesting target for the analysis of non-synonymous mutations and has not been properly explored to this date. We sequenced the portion of the exome corresponding to 3594 genes coding for transmembrane proteins located in the cell surface. For each of the 23 cell lines we called SNVs and InDels and removed common polymorphisms. Non-synonymous SNVs and frameshift InDels were submitted to several algorithms to assess their functional impact or druggable status. The protein fragment containing each non-synonymous SNV was used as input to identify new immunogenic epitopes present in these cancer cell lines. Lastly, we combined expression data from previous studies and also local RNAseq data to provide a new insight into the percentage of expressed mutated genes and predicted epitopes. A large set of somatic mutations comprising SNVs and InDels was identified, ranging from 40-1000 per cell line. Part of these somatic SNVs are located in druggable genes (average 37/cell line). New epitopes resulting from these mutations were identified (average 3/cell line). We also identified an average of 5 mutations affecting kinases per cell line. The mutation landscape in these cell lines also includes disruption of phosphorylation sites and clusters of juxtamembrane mutations.

N47 - Examination of antibody dynamics after vaccination against anthrax
  • Kaitlin Sawatzki, Boston University, United States

Short Abstract: The introduction of vaccination was a paradigm shifting event in modern healthcare. A major mechanism of protection afforded by vaccination is the development of pathogen-specific antibodies. During the course of repeated vaccination with the same immunogen, as occurs in the delivery of a vaccine series, affinity maturation leads to the development of antibodies with increased affinity for the immunizing agent. However, the mutation and selection events during this process are not well understood. We have used recent advances in next-generation sequencing to investigate individual antibody repertoires throughout vaccination. We have followed adult volunteers receiving 5 doses of Anthrax Vaccine Adsorbed (AVA) and analyzed B cell immunoglobulin heavy chains. Our lab has developed sophisticated statistical tools to analyze the diverse V-D-J region that encodes for the variable region of antibodies. From our data, we can infer affinity maturation events within clonally related families, which may have future predictive value for functionally important mutations. In addition, our data also indicate that clonal lineages may be more dynamic than previously recognized, with naïve, memory and plasmacyte B cells all regularly preceding expansion events. Further investigation into these observations will potentially contribute to the future development of improved vaccines.

N48 - Detection Efficiency for INDELs in Clinical Test Amplicon Sequencing Panel
  • Eric Klee, Mayo Clinic, United States

Short Abstract: The emergence of next generation sequencing as a diagnostic platform necessitates an increased understanding of the performance and bias associated with bioinformatics analyses. INDEL detection is a critical component of pathogenic alteration identification. We previously demonstrated a significant relationship between INDEL length and detection accuracy. In this study, we expanded our analyses to determine position and frequency influences on INDEL detection within a clinical amplicon sequencing panel.

We created synthetic datasets by introducing site specific and frequency specific INDELs directly into existing next generation short sequence reads. This enabled variants of varying size and location to be studied within the context of a real clinical sample. The synthetic samples were processed using our clinical sequencing workflow and the impact of INDEL position, frequency, and size on detection accuracy was assessed.

Our results revealed a significant bias associated with position relative to amplicon ends that could significantly impact clinical diagnostic performance. We also confirmed the previously reported INDEL length bias and showed a dependence on frequency as well. While in silico data are not always representative of real clinical samples, they can provide a means for systematically profiling performance limitations. By revealing the analytical biases via this study, our bioinformatics and laboratory processes can be adjusted to account for these effects and improve the overall assay performance.

N49 - Estimation of Isoform-specific and Allele-specific Expression from RNA-seq Data of Genetically Diverse Population
  • Gary Churchill, The Jackson Laboratory, United States

Short Abstract: The Diversity Outbred (DO) mouse population is a new heterogeneous stock derived from the eight Collaborative Cross (CC) founder strains. The DO mice have uniformly high levels of heterozygosity and genetic diversity, and thus provide a high-resolution mapping resource for identifying key genetic factors underlying complex traits and disease.

As each DO is a unique animal with a large number of SNPs and indels, aligning its RNA-seq reads to the reference is problematic. Misalignment due to unaccounted strain variation is common, and furthermore, it is difficult to derive accurate estimates of allele specific expression (ASE) from a single reference alignment strategy. For more accurate estimation of ASE in the DO, we align each read against one search space that includes custom transcript sequences from all eight CC founder strains. For each read, all and only the best alignments will be reported. In our EM-framework, we gradually reach the best posterior probability that describes where each read originates, based on the alignment profile of each read as well as the summary on how all reads align globally.

Just as we borrow information from across genes in many transcriptome analyses for a better estimation of expression mean and variance, we also borrow information of each transcript from across samples. The main idea is to balance between individual and population-level estimates using shrinkage estimators. The degree of shrinkage gets lower when there exist higher level of dispersion since data is suggesting that each individual is less likely to follow a single population-level distribution.

N50 - Prediction of Antibacterial Peptides Using Support Vector Machines
  • Francy Camacho, Escuela de Ingeniería de Sistemas, Facultad de Ingenierías Físico-mecánicas / Universidad Industrial de Santander, Colombia​, Colombia

Short Abstract: In silico design of pharmaceuticals has been used in prediction and design of antibacterial peptides, those are being considered as promising alternative antibiotics. In this study, we proposed use of Quantitative Structure-Activity Relationships (QSAR) and Support Vector Machines (SVM) for identification of synthetic antibacterial peptides against pathogenic E.coli and with Minimal Inhibitory Concentration (MIC) lower than 10 uM. Before the classification process, we verified if descriptors used in QSAR methodology were appropriated. For this reason, we took a set of 200 peptide sequences with their respective MIC50.

Subsequently, we codified structural information of sequences using descriptors. In addition, a SVM adjusting the free parameters with cross validation was developed. This allowed classifying of antibacterial peptides with an precision of 88.47% and a correlation coefficient of 0.762. From these results, we concluded that only with few descriptors was possible to identify active antibacterial peptides.

Moreover, we carried out the above procedure using Database of Anuran Defense Peptides (DADP). From DADP, we extracted a set of 479 peptide sequences with antibacterial activity against E. coli. For these peptides, MIC (ranging from 0.3uM to 231uM) is reported. We codified structural information of every sequence and a SVM was developed with adjusted free parameters. Our classification model reported an estimated precision of 72.73% with a correlation coefficient of 0.454. These results are indicating that using QSAR and SVM is possible identify peptides sequences with MIC lower than 10 uM.

N51 - Characterization of protein phosphorylation in the context of evolution
  • Shujiro Okuda, Niigata University, Japan

Short Abstract: Protein phosphorylation contributes to regulate a broad range of cellular processes. Taking advantage of new technologies such as high-throughput mass spectrometry has led to the discovery of massive amount of phosphoproteins. Large-scale information on phosphosites identified by phosphoproteomics is already stored and available in databases. Employing the known phosphosite data enabled us to develop a novel method detecting protein phosphorylation related to important cellular functions. Known phosphosites retrieved from the large-scale database were mapped to the human genome and the peripheral regions including the phosphosite were extracted. Subsequently, we clustered the sequences by the homology and created our original phosphomotifs. To trace the evolutionary transition of the three amino acid residues, serine, threonine, and tyrosine, of the phosphosites, we performed a comparative genome analysis and investigated the conservations from yeast to human. We observed that several phosphomotifs identified by our method were frequently acquired at a certain point in an evolutionary scenario. We found that proteins possessing such phosphomotifs were specific to several fundamental functions depending on when they were acquired. Our comparative sequence analysis for protein phosphorylation is a powerful approach to efficiently extract novel phosphomotifs involved in physiological functions and to understand complicated phosphorylation signal cascades.

N52 - Functional and taxonomic annotation of orphan genes based on codon bias signatures
  • Eva Maria Novoa Pardo, Massachusetts Institute of Technology, United States

Short Abstract: Next generation sequencing has provided us with a great opportunity to explore complex ecological systems, such as microbiomes from the human gut or oceanic samples. However, homology-based searches are not sufficient to correctly annotate most of its sequences, leaving a large amount of unannotated orphan sequences.

Here we show that orphan sequences can be classified and taxonomically annotated based using a novel method based exclusively on its codon usage bias signatures. Indeed, we show that these codon usage signatures can also be used to functionally annotate these genes. This method can be widely applied to taxonomically and functionally annotate metagenomic sequences that otherwise would remain uncharacterized.

N53 - The evolution of gene encryption in Oxytricha trifallax
  • Kelsi Lindblad, Princeton University, United States

Short Abstract: The ciliate Oxytricha trifallax has a dual-genome architecture, with a somatic genome that is transcribed during asexual growth and reproduction and an encrypted germline genome that is active only during sexual conjugation. Genome differentiation involves the decryption of the germline through a complex process that eliminates ~95% of the germline sequence, and the remaining fragments assemble to form the mature somatic genome. In about 10% of cases the germline fragments are out of order with respect to their position in the somatic genome, and it remains unknown how such intricate gene scrambling patterns arose through evolution. Here we present the results of a computational screen for genes Oxytricha recently acquired through horizontal gene transfer and which are predicted to have arrived as intact, unencrypted sequences. We then use phylogenetic comparisons to determine the approximate order in which these genes arrived and compare their germline sequences to infer how germline genes become fragmented and scrambled over time.

N54 - Modeling complex DNA sequence grammar with recurrent neural network
  • Zhizhuo Zhang, Massachusetts Institute of Technology, United States

Short Abstract: The chromatin state or gene expression of a genomic region are highly governed by the regulatory code in the static DNA sequence, and the study of underlining regulatory grammar has been hindered by the lack of large-scale datasets and its complex nature. To understand the regulatory grammar, some shallow learning algorithms (e.g.,SVM,RandomForest) were developed using sequence motifs or k-mers to predict chromatin state or gene expression, which only consider motifs co-occurrence statistics and ignore the relative positions and orientation among motifs in the regulatory regions. Biologically, some regulatory elements function as insulators, and the position of insulators can completely change the semantic meaning of the whole regulatory region. Here, we propose a recurrent neural network (RNN) approach, which has been successfully applied in the natural language processing. The RNN approach models the semantic role of each motif, each phase (motif-pair) and each whole sentence (regulatory region) in a unified high-dimension hidden space. More interestingly, we can explore hidden space to identify the different functional clusters of motifs. The interplay among different motif elements in different positions are naturally encoded during the parsing procedure in the RNN, which empowers to learn more complex regulatory grammars than previous approaches. Using back-propagation technique, the RNN can be trained with multiple regulatory outcomes (e.g., different histone marks), so that single regulatory grammar learned by RNN can explain them simultaneously. With simulated datasets, we show our RNN approach can identify not only activators or repressors but also insulators, and recover regulatory outcome accurately.

N55 - RNAalignClust: Discovering ncRNA families by sequence-structure-based clustering of multiple sequence alignments
  • Alexander Junge, Center for non-coding RNA in Technology and Health, University of Copenhagen, Denmark

Short Abstract: Clustering RNA molecules is an established approach to classify and functionally annotate non-coding RNAs (ncRNAs). The GraphClust pipeline (Heyne et al., 2012) performs a graph kernel-based clustering to identify ncRNAs related in terms of sequence and local secondary structure.
We present RNAalignClust, a novel algorithm to cluster a set of multiple sequence alignments aiming to discover families of ncRNAs. Each multiple alignment contains a group of RNAs derived from a common ancestor. This evolutionary relationship allows to infer conserved sequence and structure commonalities in each group. Previous approaches designed to cluster unaligned RNA sequences do not utilize this information.
RNAalignClust furthermore extends GraphClust by an extended graph kernel that scores both global and local structural similarities and also by producing the final result in a single non-iterative clustering.

Given a set of multiple alignments of evolutionary related sequences, RNAalignClust computes (sub)optimal structures for each alignment using highly conserved base pairs as constraints. Similar structural alignments are then identified using a fast nearest neighbors search approach based on hashing and graph kernel similarity notions.
To benchmark our algorithm, ncRNA families stored in the Rfam database were split into several subalignments and recovered using RNAalignClust.

RNAalignClust is a novel approach to identify families of ncRNAs related in terms of both sequence and structure from a given set of multiple alignments. It is able to take alternative secondary structures into account while still achieving a linear-time performance. This enables the analysis of complete genomes and transcriptomes within hours of computation time.

N56 - Evaluation of Quality Control Metrics for the Assesment of RNA Sequencing Performance
  • Christopher Mason, Weill Cornell Medical College, United States

Short Abstract: Advanced, high-throughput sequencing technologies allow for fast, single-base resolution scans of entire transcriptomes. Large-scale sequencing projects are producing this information for cancer research and transcriptional projects across many sites across the world. However, a comprehensive evaluation of cross-site quality checks and their impact over the downstream transcriptome analysis for RNA-seq data has not yet been established in order to allow for the merging of data between sites. Here, we relate a set of quality metrics for RNA-seq data to the reliability of detecting gene expression across sites from SEQC consortium. We found that the false-positive DEG detection shows extremely high inter-site variation, and cross-site data normalization incorporating all genes improves RNA-seq quality. In addition, our use of the cross-site, internal control library (#5) has demonstrated that GC-content and nucleotide composition (from 20th base to the end of reads) are preparation-specific, not laboratory specific, and we have added the coefficient of variation for gene coverage as a new quality measure in RNA-seq, thus quantifying this 5’-3’ bias. These metrics and internal controls complement those currently in use and create new granularity as to the quality of an RNA-seq dataset.

N57 - Normalization of RNA-Seq Data for Differential Expression Analysis
  • Gregory Grant, University of Pennsylvania, United States

Short Abstract: A "normalization problem" in RNA-Seq data analysis is any effect in the data which adversely affects the Type I or Type II error which can be mitigated algorithmically. Due to the nature of sequencing data, as one feature contributes more reads, the rest of the features necessarily contribute fewer. Thus variability in any given feature introduces variability globally. The common normalization known as the FPKM only attempts to normalize for depth-of-coverage and feature length. For the application of differential expression analysis, depth-of-coverage is the only factor that's being addressed by the FPKM normalization. We have identified seven additional factors that introduce considerable global variability into feature quantifications. These are: ribosomal content, mitochondrial content, the balance of exonic and non-exonic signal, the balance between intronic and intergenic signal, the fragment length distribution, the balance between unique and multi-mappers, and finally the 3' biasedness of the signal. We further propose methods to normalize for all of these factors based in read reasampling. We demonstrate, using a data set with six experimental conditions, each with eight biological replicates, the extent to which these factors contribute significantly to variation. We compare our method with the standard pipeline Tophat/CUFFLINKS/CUFFDIFF which demonstrates a considerable gain in the power of the statistical analysis. Software to perform the analysis is freely available through a git hub repository.

N58 - Regression-based multiscale decomposition of the ChIP-Seq signal reveals novel quantifying factors
  • Parameswaran Ramachandran, Ottawa Hospital Research Institute, Canada

Short Abstract: Although the nature of the "ChIP-Seq signal" has been analyzed in previous studies, a clear quantification of the roles of factors such as mappability, chromatin accessibility, and different notions of control at a genome-wide scale is missing from the literature. Using data from a large set of ChIP-Seq experiments spanning multiple cell lines generously made available by the ENCODE Consortium, we build a regression-based model that not only provides a means of achieving such a quantification, but also analyzes the ChIP-Seq signal at multiple scales. The model is validated on a subset of the datasets considered by correlating our predictions of potential binding sites around the transcription start sites (TSSs) with the motif densities found in those same regions. Results obtained using our model show an improved separation of the binding portion of the ChIP-Seq signal as compared to that obtained using existing peak-calling methodologies such as MACS.


View Posters By Category

Search Posters:


TOP