ISMB/ECCB 2015

Google Plus

Linked In

Flickr

Posters

Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015. To confirm your poster find the poster acceptence email there will be a confirmation link. Click on it and follow the instructions.

If you need further assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category N - 'Sequence Analysis'

N01 - The PARA-suite: an easy to use toolkit for the analysis of error prone CLIP sequencing data

Andreas Kloetgen, Department of Algorithmic Bioinformatics, Heinrich Heine University, Germany

Arndt Borkhardt, Department of Pediatric Oncology, Hematology and Clinical Immunology, Heinrich Heine University, Germany

Jessica Hoell, Department of Pediatric Oncology, Hematology and Clinical Immunology, Heinrich Heine University, Germany

Alice McHardy, Computational Biology of Infection Research, Helmholtz Center for Infection Research, Germany

Short Abstract: During recent years, many RNA-binding proteins (RBPs) have been described as key-players in posttranscriptional gene regulation. Thus it is not surprising that RBPs showing aberrant functions or changes in expression patterns have been associated with disease progression or even with carcinogenesis, highlighting the importance of protein–RNA interactions in cellular development.

The advent of next generation sequencing technologies has impacted many areas of research, including the analysis of protein–RNA interactions. Experimental protocols such as cross-linking and immunoprecipitation (CLIP) are often followed by deep sequencing and have enabled the genome-wide identification of protein–RNA interactions. Specific protocols such as PAR-CLIP or HITS-CLIP introduce single nucleotide conversions or even insertions or deletions during the reverse transcription, a necessary step before sequencing. Due to these "mutations", the analysis of the produced short sequencing reads, representing the functional network of the RBP, is more complicated compared to the analysis of sequencing reads produced by regular transcriptome sequencing.

We have developed the PAR-CLIP Analyzer suite (PARA-suite) for the analysis of error prone short sequencing reads, for example produced by PAR-CLIP or HITS-CLIP. It enables the user to infer an empirical error profile underlying the sequencing run, apply short read alignment using this error profile and to combine the results of genomic and transcriptomic read alignments. We also implemented a PAR-CLIP read simulator, available via the PARA-suite, to show improvements in alignment accuracy of our alignment strategy compared to approaches of previous CLIP studies. The PARA-suite (https://github.com/akloetgen/PARA-suite) is licensed under GNU GPLv3.

N02 - Computational Prediction of Modular Domain-Peptide Interactions

Kousik Kundu, University of Freiburg, Germany

Fabrizio Costa, University of Freiburg, Germany

Martin Mann, University of Freiburg, Germany

Rolf Backofen, University of Freiburg, Germany

Short Abstract: Protein-protein interactions are the most essential cellular process in eukaryotes that involve many important biological activities such as signal transduction, maintaining cell polarity etc. Many protein-protein interactions in cellular signaling are mediated by modular protein domains. Peptide recognition modules (PRMs) are an important subcluss of modular protein domains that specifically recognize short linear peptides to mediate various post translational modifications. Computational identification of modular domain-peptide interactions is an open challenge with high relevance. In this study, we applied machine learning approaches to identify the binding specificity of three modular protein domains (i.e. SH2, SH3, and PDZ domains). All models are based on support vector machines with different kernel functions ranging from polynomial, to Gaussian, to advanced graph kernels. In this way, we model non-linear interactions between amino acid residues. Additionally, a powerful semi-supervised technique was used to tackle the data-imbalance problem. We validated our results on manually curated data sets and achieved competitive performance against state-of-the-art approaches. Finally, we developed an interactive and easy-to-use webserver, namely MoDPepInt (Modular Domain-Peptide Interactions), for the prediction of the binding partners for aforementioned modular protein domains. Currently, we offer models for SH2, SH3 and PDZ domains, via the tools SH2PepInt, SH3PepInt and PDZPepInt. More specifically, our server offers predictions for 51 SH2 human domains and 69 SH3 human domains via single domain models, and predictions for 226 PDZ domains across several species, via 43 multi-domain models. MoDPepInt includes the largest number of models and offers a comprehensive domain-peptide prediction system in a single platform.

N03 - Stalled ribosomes and the control of translation in the infective metacyclic trypomastigote forms of Trypanosoma cruzi.

Saloe Bispo, ICC-Fiocruz, Brazil

Fabiola Holetz, ICC-Fiocruz, Brazil

Ashton Belew, UMD, United States

Eloise Slompo, ICC-Fiocruz, Brazil

Bruno Dallagiovanna, ICC-Fiocruz, Brazil

Short Abstract: In Trypanosoma cruzi, gene expression regulation occurs mainly at the posttranscriptional level. Previous studies from our group have shown that mRNAs encoding proteins of the small subunit proceossome complex (SSU), are more abundant in metacyclic trypomastigotes (MT) than in epimastigote (E) stages. However, even though a greater amount of TcSof1, TcImp4 and TcDhr1 mRNA is observed in MTs associated to polysomes, no protein products were observed and no new synthesis was detected.
Trapping mRNAs in polysomes suggesting a mechanism of translational downregulation.
The aim of our study is to understand gene expression regulation processes involved in the repression of translation in MT forms of T. cruzi. In particular, identify the presence of stalled ribosomes and the sub-population of mRNAs regulated by this mechanism.
We monitored the translational profiles in E and MT using RNA-seq and ribosome profiling. Using harringtonine, a drug that binds 60S ribosomes and has been reported to block initiation, followed by ribosome profiling analysis, we observed the presence of ribosome footprints along the coding region of TcSof1, TcImp4, TcDhr1 transcripts in MTs. This result strongly suggests that ribosomes are stalled onto the mRNA. Moreover, several transcripts showed this pattern of footprints along the coding sequence even after harringtonine treatment, with no protein product detected. Gene ontology analysis showed that these co-regulated genes are associated at the functional level. Our results suggest that a functionally related population of transcripts is negatively regulated at the translation level by a mechanism that stalls ribosomes along the mRNA.

N04 - The TOPCONS web server for signal peptide and membrane protein topology prediction

Konstantinos Tsirigos, , Sweden

Christoph Peters, Department of Biochemistry and Biophysics & Science for Life Laboratory, Stockholm University & , Sweden

Nanjiang Shu, Department of Biochemistry and Biophysics & Science for Life Laboratory, Stockholm University & , Sweden

Lukas Käll, Department of Biochemistry and Biophysics & Science for Life Laboratory, Stockholm University & , Sweden

Arne Elofsson, Department of Biochemistry and Biophysics & Science for Life Laboratory, Stockholm University & , Sweden

Short Abstract: Topology prediction of α-helical transmembrane proteins is a crucial step towards understanding their function in organisms. Recently, a number of computational methods have been developed and several studies indicate that methods which incorportate multiple sequence alignments perform best. Two major issues regarding membrane protein topology prediction are the process time and the efficient separation between membrane and non-membrane proteins as well as between the N-terminal signal peptide and a transmembrane region. The TOPCONS method (http://topcons.net/), initially presented in 2009, has been shown to perform the best in several benchmark studies. We present here some major updates and improvements to the server; we now rely on a much bigger database to create the profiles, but at the same time, we have kept the overall process time at the same levels as previously. This ensures that the ability to distinguish between a membrane and a non-membrane protein is much higher now, especially in globular proteins that contain a signal peptide, in which cases the previous version of TOPCONS falsely predicted an N-terminal membrane helix. Moreover, the method is now able to perform signal peptide prediction satisfactorily, reaching the levels of even dedicated software for that purpose. The rapid prediction response and the batch submission functionality of the server ensures that a user can perform a whole-genome scan within a reasonable time. In the benchmark we carried out, TOPCONS performs 4% better overall (and 10% in TM proteins alone), as compared to the existing state-of-the-art methods.

N05 - Computational identification and characterization of RNA binding proteins in S. Typhimurium

Malvika Sharan, IMIB-ZINF, Uni wuerzburg, Germany

Charlotte Michaux, IMIB-ZINF, Uni Wuerburg, Germany

Nora C. Marbaniang, IMIB-ZINF, Uni Wuerburg, Germany

Erik Holmquist, IMIB-ZINF, Uni Wuerburg, Germany

Joerg Vogel, IMIB-ZINF, Uni Wuerburg, Germany

Short Abstract: Several studies have been carried out using experimental techniques such as crosslinking and immunoprecipitation, which have enabled the characterization of RNAbinding motifs and their regulatory mechanisms in RBPs. Unfortunately, largescale screening for RBPs by those methods is expensive and time consuming; therefore such studies have been performed for a limited selection of organisms. Recently, computational approaches have been developed for the identification and characterization of proteinRNA interactions, but these methods have not been adapted for proteomewide identification of RBPs.
Here, we aim to develop an efficient and costeffective computational method for a systematic proteomewide screening of RBPs in bacteria. We have focused on identifying functional domains in proteins that may interact with RNA and predict their regulatory roles and mechanisms. For this purpose, we assembled sequencebased approaches for the functional characterization of the proteins in an automated pipeline called APRICOT (Analysing Protein RNA Interactions by Computational Techniques). Using this pipeline, a proteomewide prediction of RBPs was carried out in Salmonella Typhimurium and selected candidates are being tested by sequencing of immunoprecipitated RNA. The experimental results are recursively used to improve the computational RBP identification.
In future, APRICOT will be extended to assess the potential of a protein to interact to other molecular components in a variety of biological system.

N06 - FunFHMMer: protein function predictions using CATH-Gene3D functional family assignments

Sayoni Das, University College London,

David Lee, University College London,

Ian Sillitoe, University College London,

Natalie Dawson, University College London,

Jonathan Lees, University College London,

Christine Orengo, University College London,

Short Abstract: Due to the rapid increase in international genome-sequencing initiatives and structural genomics projects, a large amount of protein sequence and structural data are accumulating. Since experimental characterisation of such huge amounts of data is not feasible, computational approaches that can predict protein functions are essential.

We propose a domain based method for predicting protein functions based on the subclassification of protein domain superfamilies in CATH-Gene3D into functional families (FunFams) using a new family identification protocol, FunFHMMer. It recognises highly conserved positions and specificity-determining positions in cluster alignments and uses this information to ensure functional coherence of identified families. The functional purity of the families has been assessed using a set of manually curated mechanistically diverse enzyme superfamilies, consistency of EC annotations within the superfamilies, and an in-house residue enrichment analysis based on the percentage of conserved residues in a family that coincide with experimentally-determined functional residues.

Query sequences are scanned against the library of CATH FunFam HMMs and assigned a single set of sequential FunFam assignments. The Gene Ontology annotations of the characterised sequences that make up the matching FunFams are then transferred to the query sequence with probability scores. The function annotations provided by the FunFams are shown to be more precise than annotations provided by other domain family resources using a rollback benchmark. FunFHMMer also performs competitively with the top function prediction methods in the Critical Assessment of Protein Function Annotation (CAFA) 2 experiment, 2013-2014 (preliminary results). The FunFHMMer web server can be accessed from: http://www.cathdb.info/funfhmmer.

N07 - An annotation system for non-coding RNAs from sequencing data

Agnes Hotz-Wagenblatt, German Cancer Research Center (DKFZ), Germany

Coral del Val, University of Granada, Spain

Christopher Previti, German Cancer Research Center (DKFZ), Germany

Karl-Heinz Glatting, German Cancer Research Center (DKFZ), Germany

Short Abstract: Non-coding RNAs, especially miRNAs, play a key role in several diseases and in cell differentiation. Profiling using small RNA sequencing is commonly used for analysis. We developed a consensus database derived from several publicly available non-coding RNA databases for annotation. To avoid duplicate entries we cluster identical sequences from the same organism. Two pipelines were created, one for preparing the sequences by filtering the reads according to different criteria and for genomic mapping (sRNAMapper), and one for mapping and annotating the sequence reads using our non-coding RNA database (ncRNAannotator). sRNAMapper performs a mapping against a genome either using Bowtie or BWA, but also performs a number of quality and filtering checks with the input reads. ncRNAannotator classifies small/non-coding RNA sequence reads by mapping them against our unified database of small/non-coding RNAs. The output shows the distribution of the reads and the coverage for the most common RNAs. You also get separate tab separated output files for the different classes of non-coding RNAs for download. Both pipelines can be installed as stand-alone tools. We will show statistics about the database and the practical use of the pipelines in research with several examples.

N08 - Probing the role(s) of rare codon clusters in Escherichia coli

Athina Theodosiou, University of Cyprus, Cyprus

Vasilis Promponas, University of Cyprus, Cyprus

Short Abstract: Recently, experimental and computational advances have been made in quantifying translational pausing events caused by rare codon clusters (RCCs), demonstrating a possible relation to co-translational protein folding.
We used the recently developed standalone version of LaTcOm (Theodosiou and Promponas, 2012; Theodosiou and Promponas, in preparation), which offers access to different algorithmic approaches for detecting RCCs. We initially compared the algorithms implemented in LaTcOm, concluding that there is no clear evidence of concordance between the different approaches, except a good correlation found between %MinMax and MSS.
Moreover, RCCs were detected and analyzed for the complete E. coli protein coding sequences aiming to unravel possible roles for RCCs, combining functional and structural information. We demonstrate enrichment of RCCs at the terminals and reveal that the existence of RCCs is related with secreted, inner- and (surprisingly) outer-membrane proteins. Interestingly, we show that most of the sequences with no detected RCCs are found in the cytoplasm, are involved with the ribosome or with metabolic processes. In addition, we demonstrate multidomain proteins are enriched in RCCs and we show evidence that RCC coordinates can be used as indicators for domain boundaries.
Additionally, we correlated RCCs to topological and structural features of E. coli α-helical transmembrane proteins (αHTMP) with experimentally determined atomic structures. We demonstrate (on a small dataset) a weak preference for positioning of RCCs in periplasmic loops of aHTMPs, indicating that a coupling may exist between RCC-mediated ribosomal attenuation and biogenesis of aHTMPs.

N09 - Tabu Search Method for RNA Partial Degradation Problem

Agnieszka Rybarczyk, Institute of Computing Science, Poznan University of Technology; Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poland

Marta Kasprzak, Institute of Computing Science, Poznan University of Technology; Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poland

Jacek Błażewicz, Institute of Computing Science, Poznan University of Technology; Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poland

Short Abstract: In the last few years, there has been observed a great interest in the RNA research due to the discovery of the role that RNA molecules play in the biological systems. They do not only serve as a template in protein synthesis or as adaptors in translation process but also influence and are involved in the regulation of gene expression. It was demonstrated that most of them are produced from the larger molecules due to enzyme cleavage or spontaneous degradation.
In this work, we would like to present our recent results concerning the RNA degradation process. In our studies we used artificial RNA molecules designed according to the rules of degradation developed by Kierzek and co-workers. On the basis of the results of their degradation, we have proposed the formulation of the RNA Partial Degradation Problem (RNA PDP) and we have shown that the problem is strongly NP-complete. We would like to propose a new efficient heuristic algorithm based on tabu search approach that allows to reconstruct the cleavage sites of the given RNA molecule. The algorithm solves the problem with negative errors. Results of the computational experiment, which prove the quality and usefulness of the proposed method, are presented.

N10 - Bioinformatic analysis of Magnaporthe oryzae polyadenylation sites from next generation sequencing data

marco marconi, CBGP-UPM, Spain

julio rodríguez-romero, CBGP-UPM, Spain

ane sesma-galarraga, CBGP-UPM, Spain

mark wilkinson, CBGP-UPM, Spain

Short Abstract: Background
Several proteins have been shown to regulate alternative polyadenylation (APA), including Cleavage Factor I (CFIm) in metazoan and Hrp1 in yeast. The ascomycetous fungus Magnaporthe oryzae, also known as rice blast, is a plant-pathogenic fungus that causes a serious disease affecting rice. Rbp35 is the functional M. oryzae equivalent of CFIm68. ∆rbp35 and Δhrp1 mutants are viable indicating that Rbp35 and Hrp1 are not essential components of the polyadenylation machinery in the rice blast fungus. However, Δhrp1 and Δrbp35 mutants show developmental and virulence defects.

Results
We mapped the polyadenylation sites of M. oryzae in four different growing conditions and identified more than 14000 high-confidence poly(A) sites, accounting for more than 7,000 protein coding genes. 30% of M. oryzae genes are alternatively polyadenylated, and grouped in specific functional groups. We also identified the nucleotide context, protein-binding regions and RNA secondary structure of poly(A) sites, and demonstrated that these differ from budding yeast. Under carbon starvation, alternative poly(A) site selection usage is altered in more than 400 genes, producing longer 3'UTR isoforms. ~25% of the alternatively-polyadenylated transcripts found in the wild type were affected in the Δrbp35 mutant, which indicated that alternative site selection was Rbp35-dependent. Lack of Rbp35 in Δrbp35 affects poly(A) site selection by promoting proximal cuts, resulting in a global shortening of 3'UTRs. A UGUAH motif is enriched in Rbp35-dependent poly(A) sites, suggesting that these are the ribonucleotides recognized by Rbp35.

N11 - High-Dimensional Statistical Models for Fetal Fraction Estimation in Circulating Cell-Free DNA from Maternal Sequence Read Counts

Jennifer Geis, Sequenom Laboratories , United States

Sung Kim, Sequenom Laboratories, United States

Gregory Hannum, Sequenom Laboratories, United States

John Tynan, Sequenom Laboratories, United States

Grant Hogg, Sequenom Laboratories, United States

Chen Zhao, Sequenom Laboratories, United States

Taylor Jensen, Sequenom Laboratories, United States

Paul Oeth, Sequenom Laboratories, United States

Mathias Ehrich, Sequenom Laboratories, United States

Amin Mazloom, Sequenom Laboratories, United States

Dirk van den Boom, Sequenom Laboratories, United States

Cosmin Deciu, Sequenom Laboratories, United States

Tim Burcham, Sequenom Laboratories, United States

Short Abstract: In Non-Invasive Prenatal Testing (NIPT), the accuracy of fetal aneuploidy detection relies on an adequate signal from
fetal DNA above the maternal background, also known as the fetal fraction, in a sample of circulating cell-free DNA (ccf
DNA). Quantifying the fetal fraction is therefore a key component in NIPT assays. Direct estimators may be determined
from Massively Parallel Sequencing (MPS) reads through the elevation of chromosome Y in pregnancies with male
fetuses, or through the elevation or depletion of affected regions in pregnancies with fetal aneuploidies. Fetal fraction
estimates for pregnancies with euploid female fetuses may only be found through an additional assay, e.g., a
methylation-based or a SNP-based approach.

Recent observations made by Lo et al. using pair-end sequencing reads, indicated that fetal read lengths vary from
maternal read lengths in ccf DNA, shedding new light on the distributional differences. This underlying structural
variation, plus the correlation of autosomal bins with read lengths, provides the basis for a novel fetal fraction estimator
using autosomal bins only. This may be done by utilizing high-dimensional modeling techniques such as Reduced-Rank
Regression and Elastic Net in order to create predictive models. We examine these models and contrast this estimator
with other estimators, and offer insight into this novel approach that provides fetal fraction estimation for all types of
pregnancies.

N12 - EAGER: Efficient Ancient Genome Reconstruction

Alexander Peltzer, University of Tuebingen, Germany

Short Abstract: Next-generation sequencing (NGS) technologies have enabled the reconstruction of whole genomes even from DNA of ancient organisms that lived thousands of years ago. A plethora of different tools have been developed for NGS data analysis, however, special bioinformatics skills are often required to apply these appropriately to reconstruct whole genomes. Furthermore, NGS data produced from ancient DNA have specific characteristics, which need to be addressed during data analysis. Finally, only few automatic pipelines exist that can analyze multiple samples in parallel. Here, we introduce EAGER, a software solution offering a complete processing pipeline for a fully automatic reconstruction of genome sequences from NGS data. It includes specific tools for ancient DNA data analysis, as well as an enhanced mapping method for circular genomes. Furthermore, different read mapping and genotyping methods are integrated. The resulting VCF files can be directly used for downstream analyses such as population genetics or phylogenetic studies. EAGER significantly reduces data analysis time and offers reliable and comparable results when applied to multiple samples in parallel.

N13 - DNA barcodes for molecular biology

Leonid Bystrykh, senior researcher, Netherlands

Short Abstract: Synthetic DNA barcodes are useful elements of a great variety of experimental applications, for instance counting stem or other cells, tracing origin of the cell type (lineage) or tissue, counting particular DNA loci per single cell. Barcodes are also useful tool for a purpose of the multiplexed DNA sequencing as well as indexing of DNA preps in big libraries. Depending on the application a structure and the design of the barcode can vary, it can be long or short random or semi-random barcode, it can be designed by algorithm or made by implementation of some set of selection rules. Several common problems arise when dealing with barcodes. One is the recognition of the barcode structure among all incidental DNA sequences in the mix. Further, there is an issue of quality of labeling, for instance how unique is labeling by each barcode in the mixture of barcodes. At last, there is an issue of expected and observed distances in barcoded subsets, it is also connected to the issue of the removal of the sequencing noise. Recent works from our lab illustrates some aspects of the barcode applications. Technical challenges of those applications and future improvements are discussed.

N14 - Using reference-free compressed data structures to analyse thousands of human genomes

Thomas Keane, Wellcome Trust Sanger Institute,

Zhicheng Liu, Wellcome Trust Sanger Institute,

Dirk Dominic-Dolle, Wellcome Trust Sanger Institute,

Shane McCarthy, Wellcome Trust Sanger Institute,

Richard Durbin, Wellcome Trust Sanger Institute,

Short Abstract: We are rapidly approaching the point where we have sequenced the genomes of hundreds of thousands of human individuals. The scale up of human population sequencing has enabled us to detect sequence variants down to extremely low minor allele frequencies, explore ancient human lineages, and use genomics for screening of disease causing mutations. The Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a highly compressed searchable data structure used by read aligners and for de novo assembly. We sought to explore the use of BWTs to store and compress the raw sequencing reads of 26 human populations from 2535 individuals in the 1000 Genomes Project. We show that it is possible to achieve compression ratios of 0.09 bytes per bp (including sample meta-data), much higher than any of the existing sequencing data formats. A key feature of this population BWT is that as more individuals are added to the structure, identical read sequences are observed and compression becomes ever more efficient. BWTs are inherently reference-free so one can rapidly query all the raw sequencing data for non-reference haplotypes and viral integrations. We use the BWT to assess the support in the raw data for the predicted 1000 Genomes haplotypes and investigate the population support along different versions of the human reference genome, and evaluate sequence specific to versions of the reference with and without support in the population. We develop methods to derive accurate genotypes for both single base variants and short indels reference free.

N15 - Detecting small structural variants with SoftSV using soft-clipping information

Christoph Bartenhagen, Institute of Medical Informatics, Germany

Martin Dugas, Institute of Medical Informatics, Germany

Short Abstract: Numerous tools for the detection of structural variations (SVs) have been developed over the last years, including our own contribution called SoftSV. But there still remains a gap between small indels, which can be detected by gapped alignments, and large SVs (many hundred or thousand bp), which can be reconstructed by paired-end reads or read-depth information. Filling this gap remains difficult and often demands special algorithms for split-read alignments directly at the breakpoints, which only a few of the published tools do for this range of SVs.

We initially developed SoftSVs for large SVs and now expanded our approach to small and medium-sized deletions, tandem duplications and inversions (starting at 20bp). Similar to large rearrangements, we detect their exact breakpoints under the premise that no threshold filters SVs with low support or reads with low mapping quality or ambiguous mappings. Our greedy approach exploits any kind of soft-clipped alignment and reconstructs the breakpoint sequence just by comparing the soft-clipped reads at the start and end of an SV.

Using simulated and four real datasets from the 1000 Genomes Project, we evaluate the sensitivity and precision of SoftSV and four other tools. Our results show that sensitive and reliable SV detection is subject to many different factors like read length, coverage and SV type. SoftSV achieved sensitivities and PPVs between 80% and 100% consistently for all SV types on simulated datasets starting at 75bp reads and 10-15x sequence coverage, without requiring any parameter configuration by the user.

SoftSV is freely available at http://sourceforge.net/projects/softsv.

N16 - StaRGeEn: a statistical reference-based genome compression tool

Farzad Farhadzadeh, Eindhoven University of Technology, Netherlands

Frans M.J. Willems, Eindhoven University of Technology, Netherlands

Short Abstract: Genome compression has been the subject of multiple studies in the past decade. Studies in this field can be grouped into two main categories: reference-based compression and individual compression. Individual genome compression techniques focus on the repetitive nature of genome sequences such as approximate repeats, and concisely describe them.

Methods in the reference-based category obtain much better compression by utilizing the fact that the genomes of different individuals are more than 99% identical and coding just the differences between the target genome and a reference genome.

Reference-based algorithms typically consist of two stages. First, a mapping from the reference genome to the target genome is generated. The second stage involves describing this mapping concisely. To reconstruct the target genome, both the short description of the mapping and the reference genome are needed.

These algorithms face two major problems. First, their efficiency relies on finding a correct mapping. Consequently, if the specified mapping is not correct, the compression can not be efficient. Secondly, not only the sequences' variations
must be encoded but also extra bits have to be allocated to specify the mapping. Hence, compression rates achieved with the current techniques become higher than necessary.

Here, we introduced StaRGeEn (Statistical Reference-based Genome Encoding) a new genome compression algorithm, in which instead of finding one specific mapping, we weight all possible mappings to evaluate the necessary statistics for encoding. Thus, StaRGeEn is becoming an efficient single-stage compression method that is a) robust to mapping errors and b) offers high compression rates.

N17 - SOD: Sequence based outlier detection in multiple sequence alignments

Peter Jehl, University College Dublin, Ireland

Short Abstract: Multiple sequence alignments (MSA) are used in molecular biology in many different studies like phylogenetic analysis, finding conserved regions, protein function studies and structure prediction. The occurrence of an outlier, a sequence widely dissimilar to the rest of the alignment, can lead to misalignments or misinterpretations of the data.
This software aims to take alignments, distance matrices or raw sequences and finds sequences which are, in respect to a gap-metric, different to the rest of the alignment. This is done in a statistical manner by computing a distance matrix or using the matrix from the input and using bootstrapping or inter quartile range analysis to infer a robust estimate of the mean and standard deviation of the distances. Instances with a high distance from the mean are considered an outlier. Computations to verify the predictive power of SOD were calculated on datasets derived from the pfam database with no outliers or artificially introduced outliers. These computations are shown in ROC, True/False positive and Precission/Recall curves.
We were able to show that outlier sequences can be distinguished from non-outliers in a multiple sequence alignment and can be identified with high sensitivity and specificity. Removing outliers from alignments can increase the quality of the alignment and therefore the quality of subsequent studies where alignment quality is essential, like 3D structure prediction and discovery of linear motifs and functional domains.

N18 - Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations

Sergio Pulido Tamayo, KULeuven, Belgium

Aminael Sánchez-Rodríguez, Universidad Técnica Particular de Loja, Ecuador

Toon Swings, KULeuven, Belgium

Bram Van den Bergh, KULeuven, Belgium

Akanksa Dubey, KULeuven, Belgium

Hans Steenackers, KULeuven, Belgium

Jan Michiels, KULeuven, Belgium

Jan Fostier, UGent, Belgium

Kathleen Marchal, UGent, Belgium

Short Abstract: Clonal populations accumulate mutations over time, resulting in different haplotypes. Deep sequencing of such a population in principle provides information to reconstruct these haplotypes and the frequency at which the haplotypes occur. However, this reconstruction is technically not trivial, especially not in clonal systems with a relatively low mutation frequency. The low number of segregating sites in those systems add ambiguity to the haplotype phasing and thus obviates the reconstruction of genome-wide haplotypes based on sequence overlap information.

Therefore, we present EVORhA, a haplotype reconstruction method that complements phasing information in the non-empty read overlap with the frequency estimations of inferred local haplotypes. As was shown with simulated data, as soon as read lengths and/or mutation rates become restrictive for state-of-the-art methods, the use of this additional frequency information allows EVORhA to still reliably reconstruct genome-wide haplotypes. On real data, we show the applicability of the method in reconstructing the population composition of evolved bacterial populations and in decomposing mixed bacterial infections from clinical samples

N19 - Domain Annotation of Trimeric Autotransporter Adhesins 2

Jens Bassler, Max Planck Institute for Developmental Biology, Germany

Birte Hernandez Alvarez, Max Planck Institute for Developmental Biology, Germany

Marcus Hartmann, Max Planck Institute for Developmental Biology, Germany

Kotaro Koiwai, Department of Genome Medicine, Tokyo Metropolitan Institute of Medical Science, Japan

Katsutoshi Hori, Department of Biotechnology, Graduate School of Engineering, Nagoya University, Japan

Andrei Lupas, Max Planck Institute for Developmental Biology, Germany

Short Abstract: Trimeric Autotransporter Adhesins (TAAs) are a widespread family of filamentous outer membrane proteins in Gram-negative bacteria. As mediators of adhesion, they play a crucial role in the colonisation of biotic and abiotic surfaces, offering many attractive targets for medical and biotechnological applications.

Although highly variable, all TAAs arose from the mosaic-like rearrangement of a limited set of domains, whose structure is highly conserved even at low sequence identity. We therefore established a domain dictionary, consisting of bioinformatic sequence descriptors and representative crystal structures, in order to annotate newly sequenced TAAs and reconstruct the full fibres. In 2008, we implemented this approach in the web-based tool daTAA (domain annotation of TAAs).

Here we present daTAA2, a complete reimplementation of the daTAA functionality as a network of hidden Markov states. Its algorithm, which considers both sequence similarity and probabilistic rules of domain arrangement, allows the annotation of a TAA sequence as an array of consecutive states rather than as a collection of individual domains, thereby overcoming the minimum length limitation for recognisable features and yielding a significantly more detailed annotation.

N20 - STAC - A New Domain In Prokaryotic Transmembrane Signalling

Mateusz Korycinski, Max Planck Institute for Developmental Biology, Germany

Reinhard Albrecht, Max Planck Institute for Developmental Biology, Germany

Astrid Ursinus, Max Planck Institute for Developmental Biology, Germany

Marcus Hartmann, Max Planck Institute for Developmental Biology, Germany

Murray Coles, Max Planck Institute for Developmental Biology, Germany

Joerg Martin, Max Planck Institute for Developmental Biology, Germany

Stanislaw Dunin-Horkawicz, Max Planck Institute for Developmental Biology, Germany

Andrei Lupas, Max Planck Institute for Developmental Biology, Germany

Short Abstract: Transmembrane receptors are integral components of sensory pathways in prokaryotes. These receptors share a common dimeric architecture, consisting in its basic form of an N-terminal extracellular sensor, transmembrane helices, and an intracellular effector. As an exception, we have identified an archaeal receptor family – exemplified by Af1503 from Archaeoglobus fulgidus – which is C-terminally shortened, lacking a recognizable effector module. Here we examine the gene environment of Af1503-like receptors and identify a closely associated new protein domain family, which we characterize structurally and biochemically using Af1502 from A. fulgidus as a model system. Members of this family are found both as stand-alone proteins and as domains within extant receptors. Invariably, the latter appear as connectors between solute carrier (SLC) protein–like transmembrane domains and two-component signal transduction (TCST) domains. We propose that they mediate signal transduction in systems regulating transport processes, and name the domain STAC, for SLC and TCST Associated Component.

N21 - Leveraging transcript quantification for fast computation of alternative splicing profiles

Amadís Pagès, Centre for Genomic Regulation, Spain

Gael Pérez, Universitat Pompeu Fabra, Spain

Juan L. Trincado, Universitat Pompeu Fabra, Spain

Nicolás Bellora, INIBIOMA, Argentina

Eduardo Eyras, Catalan Institution for Research and Advanced Studies, Spain

Short Abstract: Alternative splicing plays an essential role in many cellular processes and bears major relevance in the understanding of multiple diseases, including cancer. High-throughput RNA sequencing allows genome-wide analyses of splicing across multiple conditions. However, the increasing number of available datasets represents a major challenge in terms of computation time and storage requirements. We describe SUPPA, a computational tool to calculate relative inclusion values of alternative splicing events, exploiting fast transcript quantification. SUPPA is more accurate than standard methods using simulated as well as real RNA sequencing data compared to experimentally validated events, and using complete transcripts rather than more transcripts per gene provides better estimates. SUPPA coupled with de novo transcript reconstruction methods does not achieve accuracies as high as using quantification of known transcripts, but remains better than existing methods. Finally, we show that SUPPA is more than 1000 times faster than standard methods. Coupled with fast transcript quantification, SUPPA provides inclusion values at a much higher speed than existing methods without compromising accuracy, thereby facilitating the systematic splicing analysis of large datasets with limited computational resources. The software is implemented in Python 2.7 and is available under the MIT license at https://bitbucket.org/regulatorygenomicsupf/suppa.

N22 - Alignment by numbers: sequence assembly using compressed numerical representations

Avraam Tapinos, The University of Manchester,

Bede Constantinides, The University of Manchester,

Douglas Kell, The University of Manchester,

David Robertson, The University of Manchester,

Short Abstract: This poster is based on Proceedings Submission 179
DNA sequencing instruments are enabling genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and interpret sequence data. Established methods for computational sequence analysis generally consider the nucleotide-level resolution of sequences, and while these approaches are sufficiently accurate, increasingly ambitious and data-intensive analyses are rendering demanding genome and metagenome assemblies impractical. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing and time series analysis. By representing sequences of nucleotides as numerical sequences it is possible to apply dimensionality reduction methods from these fields to genetic data, enabling sequence analysis with much reduced complexity.

To explore the applicability of signal transformation and dimensionality reduction methods in sequence assembly, we implemented a short read aligner and evaluated its performance against simulated high-diversity viral sequences alongside four existing aligners.Using our sequence transformation and feature selection approach, alignment time was reduced by up to 14-fold of that when using original, uncompressed sequences, and without reducing alignment accuracy. Despite using heavily compressed sequence transformations, our implementation yielded alignments of similar overall accuracy to existing aligners, outperforming all other tools tested at high levels of sequence variation. Our approach was also applied to the de novo assembly of a simulated diverse viral population. We have demonstrated that full sequence resolution is not a prerequisite of accurate sequence alignment and that analytical performance may be retained or even enhanced through appropriate dimensionality reduction of sequences.

N23 - Increasing accuracy and precision of vector integration site identification of sequencing reads with a new bioinformatics framework

Stefano Brasca, San Raffaele Telethon Institute for Gene Therapy, Italy

Andrea Calabria, San Raffaele Telethon Institute for Gene Therapy, Italy

Giulio Spinozzi, San Raffaele Telethon Institute for Gene Therapy, Italy

Fabrizio Benedicenti, San Raffaele Telethon Institute for Gene Therapy, Italy

Erika Tenderini, San Raffaele Telethon Institute for Gene Therapy, Italy

Eugenio Montini, San Raffaele Telethon Institute for Gene Therapy, Italy

Short Abstract: Analysis of vector integration sites (ISs) in transduced cells from gene therapy patients is a key point for unraveling gene therapy safety and efficacy of the treatment. The increase in sequencing depth for IS mapping however is accompanied by an increase in false positives derived by both PCR artifacts and reads parsing and mapping on the reference genome. By analyzing several IS datasets >10% of sequences mapped in clumps spreading ±4 bp around the real IS. To mitigate the consequent overestimation of IS we previously merged to a single position the sequencing reads mapping within a ±3 bp interval. However, a dedicated method and model to approach this issue is still missing.

We designed a new bioinformatics framework to identify ISs from mapped sequencing reads, as pipeline post-processing plugin aimed to model and address stochastic errors effects. In the first of 3 steps, junction reads are packed in different arrangements of putative ISs, through an iterative approach based on local modes and gaussian scores. Then, each different arrangement drives Monte Carlo simulations of sequences which are mapped producing distributive profiles to be compared to data by KS test: the choice is taken balancing the minimum number of input ISs and the maximum gain in profiles’ likelihoods (step-wise regression, forward selection).

This method effectively improved IS genomic placement and overestimation lying on statistical evaluations that could be adopted as a standard for future analyses in GT applications.

N24 - On the origin of folded proteins from ancient peptides

Vikram Alva Kullanja, Max Planck Institute for Developmental Biology, Germany

Johannes Soeding, Max Planck Institute for Biophysical Chemistry, Germany

Andrei Lupas, Max Planck Institute for Developmental Biology, Germany

Short Abstract: Contemporary proteins arose by combinatorial shuffling and differentiation from a basic set of domain prototypes, most of which were already established at the time of the Last Universal Common Ancestor (LUCA), 3.5 billion years ago. The origin of domains themselves, however, is poorly understood. We are pursuing the hypothesis that they arose by fusion, accretion, and repetition from an ancestral set of peptides active as co-factors of RNA-dependent replication and catalysis (the 'RNA world'). We reasoned that if this hypothesis is true, comparative studies should allow a description of this peptide set. To this end, we compared domains representative of known folds and identified 40 fragments that occur in domains of different folds, yet show significant similarities in sequence and structure. Based on their widespread occurrence in the most ancient folds (e.g. the P-loop NTPases and ribosomal proteins) and on their involvement in basal functions (e.g. nucleic acid-binding and metal-binding), we propose that these fragments represent the observable remnants of a primordial RNA-peptide world.

N25 - Site-specific evolution of selected post-synaptic protein complexes

Maciej Pajak, University of Edinburgh,

Clive R. Bramham, University of Bergen, Norway

T. Ian Simpson, University of Edinburgh,

Short Abstract: Sequence conservation analysis of proteins belonging to the post-synaptic proteome (PSP) has previously revealed that key synaptic protein classes are present in primitive organisms preceding the emergence of nervous systems.
Recent studies suggest that evolution of the PSP may be responsible for the emergence of complex neural system function and behaviour but these analyses assess evolution only at the whole protein level.

We have developed an analysis workflow that integrates codon-resolution selection pressure estimates with domain and motif data to allow refinement of our understanding of domain-centric functionalisation of the PSP.

We show the application of this workflow to the Activity-regulated cytoskeleton protein (Arc) complex, a set of 26 Arc interacting proteins. Arc is highly conserved among placental mammals and plays a significant role in the post-synaptic density as a major regulator of long-term synaptic plasticity, the presumed molecular correlate of memory and learning.

Maximum likelihood phylogenetic inference for proteins of the Arc interactome, followed by site-by-site selection pressure analysis using a fixed effect likelihood methodology reveals a small set of positively selected sites as well as many regions under strong negative selection pressure. Mapping of these sites onto both known and predicted binding domains and post-translational modification sites allows inference of key domain-level functionalisation events during Arc complex evolution and provides a rational basis for prioritising regions for functional studies.

N26 - IMDG-BLAST: a New Faster BLAST based on In-memory Data Grid in the Could Environment

Jeongsu Oh, KRIBB, Korea, Rep

Chi-Hwan Choi, Chungbuk National University, Korea, Rep

Kyung Mo Kim, KRIBB, Korea, Rep

Short Abstract: Sequence similarity searching is one of the most important tasks in bioinformatics fields. Among many tools available, Basic Local Alignment Search Tool (BLAST) has been widely used for determining homologs between sequences. Recent advances in sequencing technologies (e.g., pyrosequencing, Illumina, PacBio) result in a rapid increment of sequences that should be blasted. However, the current versions of BLAST has difficulty in processing a large number of sequences since they only support parallel processing in a single computing node. Consequently, it is now necessary to develop a new BLAST version that can analyze a big sequence data more efficiently. We therefore adopted in-memory data grid (IMDG) method to the BLAST application. A new program called IMDG-BLAST is a distributed and parallel BLAST version based on IMDG to process large amounts of sequence data rapidly. Even though parallel or distributed programs of BLAST already exist (e.g., mpiBLAST and Haddoop BLAST), the IMGD-BLAST has the following additional novel features: (I) much faster response time by the IMDG method; (II) linear speed up as computing nodes increase; (III) easy execution devoid of any complicated installation or system settings; (IV) platform-independent computing environments such as standalone, cluster and cloud environment (e.g., Amazon EC2). These features distinguish the IMDG-BLAST from the existing parallel and distributed implementations of BLAST. The IMDG-BLAST is written in JAVA and available upon users’request.

N27 - A method for detecting chromosomal translocations and identifying their structures from next generation sequencing data

Minho Kim, ETRI, Korea, Rep

Dae-Hee Kim, ETRI, Korea, Rep

Myung-Eun Lim, ETRI, Korea, Rep

Youngwoong Han, ETRI, Korea, Rep

Young-Won Kim, ETRI, Korea, Rep

Jae-Hoon Choi, ETRI, Korea, Rep

Ho-Youl Jung, ETRI, Korea, Rep

Short Abstract: Movement or interchange of a block of deoxyribonucleic acid (DNA) sequence between two chromosomes are called chromosomal translocation. It is one type of structural variation.

Recently, many researchers have tried to detect chromosomal translocation in next generation sequencing (NGS) data due to its importance in genetics and medicine. An approach using read depth (RD) information in NGS data have been typically used in finding copy number variant. This means that translocations with neutral copy number cannot be detected. Another approach using paired-end discordant fragments (DFs) and/or split reads (SRs) can detect break points (BPs) of translocations regardless of their copy number. But it cannot tell whether it exists on one or both of two related chromosomes.

We present a method that can effectively detecting translocations in diploid genomes like human. It takes advantage of concordant fragments near BPs of translocations as well as incorporates the strengths of the conventional RD- or DF- based methods. Our technique is capable of identifying the structure of a translocation. This means that it can locate BP of a translocation and tell whether the translocated sequence exists in one or both chromosomes, or on one side or both sides of a chromosome. Also, it can detect translocations with a neutral copy number as well as copy number variation. Additionally, our tool can reduce the number of false-positive results significantly.

N28 - MuffinInfo: HTML5 Information Extraction System from FastQ Data

Andy Alic, Universitat Politecnica de Valencia, Spain

Ignacio Blanquer, Universitat Politecnica de Valencia, Spain

Short Abstract: The analysis of the input data offers useful insights for any further processing step. Despite FastQ information extraction systems exist, they depend on the platform or require servers. HTML5 is a web technology capable of providing great performance for offline desktop software development. We present a client side-only FastQ information extractor (MuffinInfo), demonstrating the capabilities of HTML5. Our software focuses on efficiency/usability, including features like: FileReader API to handle files locally, Web Workers to perform faster costly operations without freezing the UI and SVG to seamlessly scale graphics across resolutions. MuffinInfo parses the input in chunks, keeping only the relevant information. This way, it can process large files locally, without fully loading them into RAM and without the need for a server. MuffinInfo presents the results in tabs, grouped by category. It lists general information like the number of reads and number of bases. Our application also makes use of charts depicting quality scores by number of bases, the k-mer spectrum and the GC content. From a user's standpoint, there are no installers and no external requirements (libraries/virtual machines). (S)he only has to visit the website using a HTML5 compliant browser. MuffinInfo can also run entirely offline. Additionally, the data and the information are kept private because the input is processed on the user's machine. In our experience, HTML5 is a viable solution to develop easy to use, fast software across a large range of devices and operating systems. Website (ongoing development): muffininfo.sourceforge.net

N29 - Using De Novo Protein Structure Predictions to Benchmark Very Large Multiple Sequence Alignments

Gearoid Fox, University College Dublin, Ireland

Fabian Sievers, University College Dublin, Ireland

Desmond G Higgins, University College Dublin, Ireland

Short Abstract: Multiple sequence alignments with large numbers of sequences are now commonplace. However, current multiple alignment benchmarks are ill-suited for testing these types of alignments, as test cases either contain a very small number of sequences or are based purely on simulation rather than empirical data. We take advantage of recent developments in protein structure prediction methods to create a benchmark for protein multiple sequence alignments containing many thousands of sequences in each test case, and which is based on empirical biological data. We rank popular multiple sequence alignment methods using this benchmark and verify a recent result showing that "chained" guide trees increase the accuracy of progressive alignment packages on data sets with thousands of proteins.

N30 - IgTools: a toolkit for construction of antibody repertoire using immunosequencing data

Yana Safonova, Saint-Petersburg State University, Russian Federation

Ekaterina Starostina, Saint-Petersburg State University, Russian Federation

Short Abstract: The analysis of concentrations of circulating antibodies in serum (antibody repertoire) is a fundamental, yet poorly studied, problem in immunoinformatics. Antibodies are not directly encoded in germline but are extensively diversified by somatic recombination and mutations. These somatic processes make antibodies highly repetitive that complicates reconstruction of the original repertoire from immunosequencing data. Particularly, sequencing errors and natural variations look very similar that makes standard error-correction tools inapplicable and states new computational challenges. On the other hand, antibody repertoire is custom for each individual that results in a lack of gold standard datasets that are required to validate repertoire construction algorithms. In order to address these challenges, we developed toolkit IgTools including IgRepertoireConstructor (an algorithm for antibody repertoire construction), IgQUAST (a quality assessment tool for antibody repertoires), and IgSimulator (a versatile immunosequencing simulator).

N31 - Informed K-mer selection for de novo transcriptome assembly

Dilip Durai, Saarland University, Germany

Marcel H Schulz, Cluster of Excellence on Multimodal Computing and Interaction, Saarland University and Max Planck Institute for Informatics, Germany

Short Abstract: Motivation: The de Bruijn graph (DBG) is widely used as an underlying data structure for de-novo transcriptome assembly. An important parameter which highly influences the assembly is the exact word (also known as k-mer) of length k. It has been studied that merging of assemblies from different k-mers improves sensitivity when compared to the single k-mer based assembly. However, no studies have investigated thoroughly the problem of automatically learning at which k value to start and stop the assembly. Instead a suboptimal selection of k values is often used by practitioners
Results: Here we investigate in detail the contribution of each k-mer in a multi k-mer based assembly. A k-mer length which provides the most distinct non-erroneous k-mers to the assembly is taken as the starting k-mer. We find that an analysis after transcript clustering of related assemblies allows estimating the importance of an additional k-mer assembly. We show that a model fit based approach with model selection works well for predicting the k value at which no further assemblies are necessary. Our approach is applicable to datasets with different coverage and read lengths.
Conclusion: We provide a parameter-free approach to predict an optimal k-mer range for multi k-mer based de novo transcriptome assembly without compromising assembly quality. This is a step forward to making multi-k-mer based methods more reliable

N32 - Goldilocks: Locating genomic regions that are "just right"

Sam Nicholls, Aberystwyth University,

Short Abstract: We present Goldilocks, a Python package which provides users with functionality for locating "interesting" subregions within a genome for analysis. Goldilocks was developed to support our work in the investigation of quality control for genetic sequencing. It was used to quickly locate regions on the human genome that expressed a desired level of variability, which were "just right" for later variant calling and comparison.

To enhance Goldilocks, the package has since been made more flexible and can be used to find regions of interest based on any arbitrary criteria such as GC-content, density of target k-mers, defined confidence metrics and missing nucleotides. Once a census of all regions has been conducted, results may be sorted by mathematical operations such as min, max or distance from the median, mean or some other target. Goldilocks can be used to highlight regions of sequences that feature the most missing nucleotides, the lowest GC-content ratio or those containing a number of SNPs within some distance of the overall mean or median.

We give examples of how Goldilocks may also be used in the context of a metagenomics pipeline.

Goldilocks is freely available open-source software hosted at https://github.com/SamStudio8/goldilocks and documentation can be found via http://goldilocks.readthedocs.org

N33 - Is “cheap” whole genome sequencing good enough for everyday drug discovery research?

Thomas Schlitt, Novartis, Switzerland

Robert Bruccoleri, Novartis, United States

Stine Buechmann-Moller, Novartis, Switzerland

Nicole Cheung, Novartis, Switzerland

Anita Fernandez, Novartis, Switzerland

Nicole Hartmann, Novartis, Switzerland

Yunsheng He, Novartis, United States

Xiaoyu Jian, Novartis, United States

Li Lei, Novartis, United States

Bolan Linghu, Novartis, United States

Thomas Morgan, , United States

Kevin Sloan, Novartis, United States

Jill Somers, Novartis, United States

Frank Staedtler, Novartis, Switzerland

Marc Sultan, Novartis, Switzerland

Joseph Szustakowsk, Novartis, United States

Marie Waldvogel, Novartis, Switzerland

Daniela Wieser, Novartis, Switzerland

Fan Yang, Novartis, United States

Xiaoujun Zhao, Cambridge, Switzerland

Short Abstract: With Illumina’s introduction of the HiSeq X sequencing platform, the cost of whole genome sequencing (WGS) is nearing $1000 per genome. Several providers have started offering sequencing services making it possible to perform WGS for (smaller) clinical study cohorts, thus supplementing and potentially replacing other genotyping platforms such as whole exome sequencing (WES) and array based genotyping. We have recently conducted a pilot study to compare the performance and genotyping quality of the HiSeq X platform against WES and array based genotyping with respect to a number of applications and scientific questions using samples from clinical studies.
In total we sequenced 115 individuals at 30x coverage; the WGS quality was very good and we observed high concordance of genotypes calls across the different platforms (>99%). In addition, WGS enables deeper analysis than the other platforms, including identification of novel SNPs and indels outside of coding regions.
Data management and interpretation of the phenotypic effects of the sequence variants observed outside exons are still challenging, but given the increasing availability of affordable WGS, we expect to see rapid progress in probing the effects of non-coding variants.

N34 - The Comparison of Next Generation Sequencing Technologies for the application of 16S and shotgun Data

Adam Clooney, Alimentary Pharmabiotic Centre, Ireland

Marcus Claesson, Alimentary Pharmabiotic Centre, Ireland

Short Abstract: Next generation sequencing approaches have evolved microbial ecology research from cell culture techniques to characterisation of complex bacterial communities. A link between the microbiome and many diseased states have been identified, including Inflammatory Bowel Disease, obesity and colorectal cancer, along with psychological disorders.
Studies investigating the composition of the microbiota of the gut have found clear compositional differences between the individuals in a healthy and diseased state. 16S analysis of complex microbial communities has proved to be very popular due to results obtained (composition, alpha and beta diversity), as well as the being cost effective, particularly when analysing a large number of samples. However, the 16S gene has nine different variable regions and therefore a number of different targets using primers. There are also a number of different Next Generation Technologies available, including; Illumina Miseq, Ion Torrent PGM and the Roche 454.
This study compares (through the use of human and mock samples) the results of sequencing various 16S regions across the 3 above mentioned technologies. The study also investigates the use of the Illumina Miseq and the PGM for shotgun metagenomics analysis, as well as comparing the 16S and shotgun microbial composition results.

N35 - Two-pass RNAseq alignment improves sensitivity

Brendan Veeneman, University of Michigan, United States

Arul Chinnaiyan, University of Michigan, United States

Alexey Nesvizhskii, University of Michigan, United States

Short Abstract: Analysis in most RNA sequencing projects begins with spliced alignment to the genome, and most spliced aligners permit use of a reference gene annotation database. Aligners use this gene annotation to require less stringent evidence for reads spanning known splice junctions, in contrast with novel, discovered splice junctions. This has the effect of allowing reads to align over known introns by fewer spanning nucleotides, and improves alignment sensitivity.

Recently, members of the field have suggested use of two-pass RNAseq alignment, in which splice junctions are discovered in a first alignment pass, and then used as annotation in a second alignment pass. This alignment framework is expected to improve alignment sensitivity over novel splice junctions, at the cost of some alignment errors, but these effects have not been quantitatively assessed.

Here we describe our work detailing the sensitivity and specificity effects of two-pass alignment. We used STAR to align 50nt and 100nt paired-end RNAseq libraries to the human genome, with and without gene annotation, and found that use of annotation improves read depth over unannotated splice junctions approximately two-fold in 50nt libraries, and about 25% in 100nt libraries. Alignment errors are introduced, but can be easily identified and eliminated by simple modeling. We conclude that two-pass alignment dramatically improves quantification of novel splice junctions.

N36 - Simple Rapid RNA-seq Analysis with Unique Gapped q-Grams

Sven Rahmann, University of Duisburg-Essen, Germany

Short Abstract: We present a new simple approach to RNA-seq gene expression analysis that avoids separate read mapping and feature counting by constructing an index with the following property:
Each gene (exons and exon junctions) is represented by its q-grams (substrings of length q, e.g. q=16), or, more generally, by gapped q-grams with a given shape.
These sets of q-grams are reduced to gene-specific ones, i.e., all q-grams that occur in more than one gene are discarded.
Now each of the 4^q possible q-grams is either not present, specific for a gene, or present in more than one gene.
We build an index that recognises the specific q-grams and maps them to their respective genes.
Optimisation of the q-gram mask results in high sensitivity and specificity, as we show with several examples.

Read mapping becomes particularly simple:
We iterate over a read's q-grams and count the number of hits to each gene. Careful analysis allows to pick the correct gene (or declare the read as unmappable or ambiguous) at unprecedented speed.
We thus obtain raw gene counts in a much simpler and computationally less demanding way than with standard approaches.

Further analysis (e.g., differential expression, implicated pathways) can proceed as before.
The poster compares the running time and the obtained counts resulting from our method and standard methods, showing that we achieve equivalent results with much less computational work.

We also outline possible extensions of the approach, including variant-tolerance and fusion gene detection.
Software will be made available under the MIT license.

N37 - Improved ab initio detection of approximate repeats in genomic sequences

Charlotte Schaeffer, Miami University, United States

Nathan Figueroa, Miami University, United States

Xiaolin Liu, Miami University, United States

John Karro, Miami University, United States

Short Abstract: Here we incorporate the spaced seed methodology into the RAIDER ab initio repeat-finding tool, improving RAIDER’s sensitivity to approximately repetitive sequences without significant runtime cost. The detection of repetitive DNA is a challenging problem, further complicated by the dynamic nature of the genome. Over long periods of time, originally identical repetitive sequences accumulate substitutions, creating approximate repeats - repeats that are similar but no longer identical. Approximate repeats better exemplify actual repetitive sequences, but they are harder to find. The task is further complicated if we wish to avoid working from a pre-constructed library (as used by tools such as RepeatMasker) – necessary in order to make possible the discovery of new repeat families.

Spaced seeds are patterns describing which positions in two sequences must match, and which positions are not so constrained, in order for them to be considered a “valid” match. Introduced by Ma, Tromp, and Li in their PatternHunter homology-search tool, the use of spaced seeds has been shown to considerably improve performance and result quality as compared to searching for short exact matches.

RAIDER displays comparable sensitivity to that of RepeatScout, while maintaining a significantly better runtime (e.g. 113 and 759 seconds, respectively, on human chromosome 22). We modified RAIDER’s original algorithm to incorporate the use of spaced seeds, motivated by the significant effect this approach had on sensitivity and speed in homology-search tools. The enhanced RAIDER tool retains the original’s time efficiency and displays a consistent improvement of 10-15% in sensitivity on the human genome.

N38 - GenomePatternScan: Computational identification of genome-wide binding sites for FOXD1

Clare Bates Congdon, Maine Medical Center Research Institute, United States

Samuel McFarland, University of Southern Maine, United States

Jennifer Fetting, Maine Medical Center Research Institute, United States

Craig R. Lessard, University of Southern Maine, United States

Jeffrey A Thompson, Dartmouth Medical School, United States

Christine W. Duarte, Maine Medical Center Research Institute, United States

Leif Oxburgh, Maine Medical Center Research Institute, United States

Short Abstract: We have developed GenomePatternScan (GPS), a computational tool that identifies the locations of a transcription factor binding site (or another DNA pattern, specified by the user) throughout a genome or other long genetic sequences. In this work, we use GPS to identify putative genes regulated by FOXD1, an important transcription factor in kidney development. As input, the program requires the pattern to search for and sequence data to search through. The transcription factor may be represented using the standard A, C, G, T abbreviations or the extended IUPAC notation. The program uses gene annotation data to identify where matches occur relative to known coding regions. As a default, genomes for human, rat, and mouse are provided, along with their annotations; other genomes can easily be added. Output from the system includes the genetic context of each candidate hit as well as a link into the UCSC Genome Browser to simplify further investigation of the genomic context. Recent additions to GPS include using synteny to look for hits in the same gene across different species and the ability to input a listing of particular genes of interest, such as the results from microarray experiments.

Using GPS, we identified 512 candidate locations of the FOXD1 binding site in the noncoding regions for the same genes in human, rat, and mouse. We further reduced this listing by cross-referencing with literature searches and are confirming the resulting short list of genes at the bench using qRT-PCR and chromatin immunoprecipitation (ChIP).

N39 - Comparison of Variance Stabilization Methods for Kinome Microarray Data

Farhad Maleki, University of Saskatchewan , Canada

Brett Trost, Department of Computer Science, University of Saskatchewan, Canada

Anthony Kusalik, Department of Computer Science, University of Saskatchewan, Canada

Short Abstract: Phosphorylation and dephosphorylation are ubiquitous post-translational modifications of proteins. Kinome microarray technology allows measurement of the phosphorylation activity of hundreds of proteins in a single experiment. However, these measurements suffer from a heterogeneous variance, which is a formidable challenge for kinome microarray data analyses.

Variance stabilization methods are often used to alleviate heterogeneous variance. Log-transformation, variance-stabilizing normalization (VSN), and variance-stabilizing transformation (VST) are the most common variance stabilization methods used by the microarray community.

Kinome microarrays typically differ from DNA microarrays in terms of the number of within-array replicates, the number of probes, and the functional dependence of the biomolecules represented by those probes. Thus, the variance stabilization method that works best for DNA microarrays may not work best for kinome arrays. Thus, here we investigated which variance stabilization method should be used to deal with heterogeneity of variance in kinome microarrays.

Since the main purpose of kinome microarray experiments is to detect differentially phosphorylated peptides, we were interested in variance stabilization methods that minimize the error in identifying differentially phosphorylated peptides. Thus, we used a set of synthetic kinome arrays for which we had a priori knowledge of their differentially phosphorylated peptides. We applied each variance stabilization method to each pair of synthetic arrays, and then used the Wilcoxon signed-rank test to detect differential phosphorylation. The specificity, sensitivity, and accuracy of differential phosphorylation detection were then calculated for each variance stabilization method. The results showed that VSN achieved the best result for all three criteria.

N40 - A novel method for RNA secondary structure prediction based on evolutionary conservation

Josef Panek, MBU AV CR, Czechia

Jan Hajič, MFF UK, Czechia

David Hoksza, MFF UK, Czechia

Short Abstract: We employ information about evolutionary conservation of RNAs for RNA secondary structure prediction. For evolutionarily related RNAs, we identify conserved structural segments using multiple sequence alignment, cut them from known RNA structure and paste them into the predicted structure. The remaining structural segments, showing week or no conservation, are predicted de novo using standard prediction algorithms. We present our novel method and demonstrate its efficiency and robustness by prediction of secondary structure of eukaryotic ribosomal ribonucleic acids (rRNAs). The rRNAs are one of the most essential biological molecules, as they form a structural core of the protein synthesizing unit, the ribosome. The rRNA structure therefore is necessary for understanding the mechanisms of gene translation. Experimental identification of rRNA structure is extremely technically difficult and computational prediction is hindered by extreme length of rRNA sequences. Thus, only few eukaryotic rRNA structures are available so far. Here, along a novel RNA secondary structure prediction method, we show previously unknown structures of eukaryotic rRNAs predicted by the presented method.

N41 - High-throughput Clinical NGS Data Analysis on the Cloud

Yassine Souilmi, Faculty of Sciences of Rabat, Morocco

Alex Lancaster, Harvard Medical School, United States

Jae-Yoon Jung, Harvard Medical School, United States

Ettore Rizzo, University of Pavia, Italy

Dennis Wall, Stanford University, United States

Peter Tonellato, Harvard Medical School, United States

Short Abstract: The dramatic fall of next generation sequencing (NGS) cost in recent years positions the price in range of typical medical testing, and thus whole genome analysis (WGA) may be a viable clinical diagnostic tool. Modern sequencing platforms routinely generate petabyte data. The current challenge lies in calling and analyzing this large-scale data, which has become the new time and cost rate-limiting step. To address the computational limitations and optimize the cost, we have developed COSMOS, a scalable, parallelizable workflow management system running on cloud services. Using COSMOS, we have constructed a NGS analysis pipeline implementing the Genome Analysis Toolkit - GATK v3.1 - best practice protocol, a widely accepted industry standard developed by the Broad Institute. COSMOS performs a thorough sequence analysis, including quality control, alignment, variant calling and an unprecedented level of annotation using a custom extension of ANNOVAR. COSMOS takes advantage of parallelization and the resources of a high-performance compute cluster, either local or in the cloud, to process datasets of up to the petabyte scale, which is becoming standard in NGS. This approach enables the timely and cost-effective implementation of NGS analysis, allowing for it to be used in a clinical setting and translational medicine. With COSMOS we reduced the whole genome data analysis cost under the $100 barrier, placing it within a reimbursable cost point and in clinical time, providing a significant change to the landscape of genomic analysis and cement the utility of cloud environment as a resource for Petabyte-scale genomic research.

N42 - Computational identification and biological validation of FOXJ1 regulatory regions in Strongylocentrotus purpuratus

Craig Lessard, Maine Medical Center Research Institute , United States

Christopher M. McCarty, The Jackson Laboratory, United States

Jeffrey A. Thompson, Dartmouth College, United States

Samuel McFarland, University of Southern Maine, United States

James A. Coffman, Mount Desert Island Biological Laboratory, United States

Robert L. Morris, Wheaton College, United States

Clare Bates Congdon, Maine Medical Center Research Institute, United States

Short Abstract: In this work, we used computational methods to identify candidate regulatory regions in proximity to the FOXJ1 gene. We then investigated these candidates in the sea urchin Strongylocentrotus purpuratus using a DNA-tag reporter system. We found that 3/9 candidate regions were biologically active at the time points measured.

S. purpuratus is a model species for studying embryonic cilia development, and the study of S. purpuratus may lead to advances in treatment for cilia-related diseases in humans. The FOXJ1 gene is a member of the forkhead family of transcription factors, and presumably functions as a master regulator of ciliary genes. Our goal is to computationally identify candidate regulatory elements for the FOXJ1 gene and investigate the expression of these elements at 14, 16, and 18 hours post-fertilization, the period when cilia first form. We used GAMI (Genetic Algorithms for Motif Inference) to identify candidate regulatory regions in silico. GAMI uses a genetic algorithms search to identify motifs in common (putative conserved elements) across the noncoding DNA sequences from divergent species. GAMI evolves hundreds of candidate functional elements, which are formed into candidate CRMs using the complement tool GAMI-CRM. Here, we generated a listing of 12 candidate CRMs for the noncoding regions surrounding the FOXJ1 gene in S. purpuratus, which were then investigated in vivo. Of the 12 candidate CRMs, 9/12 were successfully amplified and ligated to reporter constructs, then microinjected into egg cytoplasm directly following fertilization. This assay revealed that 3/9 candidate CRMs were biologically active in all three time points.

N43 - Reconstruction of B cell lineages

Barbera Van Schaik, Academic Medical Center, Netherlands

Marieke Doorenspleet, Academic Medical Center, Netherlands

Paul Klarenbeek, Academic Medical Center, Netherlands

Rebecca Esveldt, Academic Medical Center, Netherlands

Niek de Vries, Academic Medical Center, Netherlands

Antoine van Kampen, Academic Medical Center, Netherlands

Short Abstract: T and B cells of the adaptive immune system are highly specialized cells that orchestrate a targeted attack against pathogens. Variability in the T and B cell repertoire is introduced by somatic rearrangement of the receptor genes to recognize many antigens. In addition, B cells can increase their specificity by mutating their receptor to improve recognition of the antigen. Understanding this process might greatly enhance our understanding of immune responses in health and disease.

We will present a method to reconstruct the mutation events of B cell receptors (BCR) from sequencing data. The resulting lineage tree is represented as a directed graph were the nodes represent unique BCR sequences and the edges the parent-child relation. The properties of such trees have been shown to correlate with biological processes in the past.

We have used the mutation positions within sequences to determine parent-child relationships and gathered more evidence for a relation by using every sequence as reference once. With our method we were able to obtain reconstructed trees that were 94-100% similar to simulated trees, which is a result we couldn't obtain when using distance based methods alone or by performing a single pass comparison between all sequences with an outgroup.

The tree reconstruction method will be used to analyze the B cell repertoire of patients with an auto-immune disease to see if the mutation process is different compared to healthy individuals. This method might also be useful for the analysis of other somatic mutation events, for example in tumor material.

N44 - Combining structure probing experiments on RNA mutants with evolutionary information reveals RNA-binding interfaces

Vladimir Reinharz, , Canada

Yann Ponty, École Polytechnique, France

Jérôme Waldispühl, McGill University, Canada

Short Abstract: Systematic structure probing experiments (e.g. SHAPE) of RNA mutants such as the mutate-and-map protocol give us a direct access to the genetic robustness of ncRNA structures. Comparative studies of homologous sequences provide a distinct, yet complementary, approach to analyze structural and functional properties of non-coding RNAs.
In this paper, we introduce a formal framework to combine the biochemical signal collected from mutate-and-map experiments, with the evolutionary information available in multiple sequence alignments. We apply neutral theory principles to detect complex long-range dependancies between nucleotides of a single stranded RNA, and implement this technology in a software called aRNhAck. More precisely, first we use mutate-and-map data to identify mutations that destabilize the most the native structure of the molecule (i.e. the mutations associated with the most divergent SHAPE profiles). Then, we retrieve from RNA multiple sequence alignments (Rfam database) homologous sequences containing those destabilizing mutations, and compare their nucleotide distribution to the background distribution observed in the Rfam multiple sequence alignment. Finally, the ensemble of positions with highest mutual information is used to reveal nucleotide networks of functional dependencies. We illustrate the biological significance of this signal and show that the nucleotides networks calculated with aRNhAck are correlated with nucleotides located in RNA-RNA and RNA-protein interfaces. aRNhAck is freely available at http://csb.cs.mcgill.ca/arnhack.

N45 - Supercomputing for fusion gene detection on K computer

Satoshi ITO, The University of Tokyo, Japan

Yuichi SHIRAISHI, The University of Tokyo, Japan

Teppei SHIMAMURA, Nagoya University, Japan

Satoru MIYANO, The University of Tokyo, Japan

Short Abstract: Fusion genes have an important role in cancer development. Although, recent advances in high-throughput sequencing technologies enabled us to potentially obtain genome-wide landscape of fusion genes, sensitive and accurate identification still remain challenging task because of many artifacts caused by the misaligned reads arising from the ambiguity of genomic sequences, generating numerous false positive detections. From the practical point of view, database of sequenced DNA increase dramatically so that parallel computation is inevitable and its efficiency is one of the most important problems.
Shiraishi et al develops Genomon-fusion, which is a fusion gene detection pipeline. Their approach characterizes each gene fusion with a single base resolution by effectively utilizing soft-clipping short reads, reducing false positives by applying a number of filters.
In this study, we ported Genomon-fusion onto K computer (GFK), which is the fastest supercomputer in Japan installed in Advanced Institute for Computer Science, RIKEN. Such kind of large-scale supercomputer enable us to analyze very large number of samples in a large databases such as CCLE, TCGA, etc. Load balancing, OpenMP and MPI parallelization, and data handling between home and an external storage are included in this work. We performed fusion detection for 780 samples, which is all RNA-seq data in CCLE. Results and calculation summary will be demonstrate.

N46 - PeakRescue: a read rescue strategy for ambiguously mapped RNA-Seq reads

Christelle Robert, Roslin Institute,

Mick Watson, Roslin Institute,

Shriram Bhosle, Cancer Genome Project, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK,

Short Abstract: Motivation: RNA sequencing produces millions of read, a large proportion of which map to multiple genomic locations due to inherent genomic complexity. In addition, often genes overlap with one other, leading to ambiguously mapped reads in many genomic loci. Identifying the exact location which these reads originate from is of major importance to maximise the transcriptome coverage and accurately quantify gene expression. Several approaches have been applied to rescue multimapped and ambiguous mapped reads during and after read mapping to a reference genome. Strategies include random assignment to one or few genomic locations, uniform distribution across multiple loci, and probabilistic assignments. Additionally, in many studies, ambiguous multimapped reads are disregarded thus limiting the library space to the uniquely mapped reads.
Results: We propose peakRescue, a novel read assignment method to disambiguate uniquely mapped reads and rescue multimapped reads in RNA-Seq analysis. The peakRescue pipeline assigns ambiguously mapped reads to genes based on the highest unique per-base read coverage for a gene. This strategy accounts for non-uniform read coverage along the transcriptome, an observation intrinsic to RNA-Seq experiments. PeakRescue shows strong performances on simulated data, as measured with the F0.5 metric, as well as a strong agreement with the expected expression trend when applied to a real human RNA-seq dataset. Additionally, peakRescue FPKM values are highly correlated with the expected expression values. When compared with other count-based read-to-gene summarisation methods, peakRescue performs best with the highest Pearson correlation coefficients observed with the ground truth at various thresholds of ambiguous read proportions.

N47 - Deep Learning Based Sequence Feature Extraction and Its Application to Bias Correction for RNA-seq Data

Yaozhong Zhang, The University of Tokyo, Japan

Rui Yamaguchi, The University of Tokyo, Japan

Seiya Imoto, The University of Tokyo, Japan

Satoru Miyano, The University of Tokyo, Japan

Short Abstract: Feature extraction from biological data is an initial and fundamental step for many bioinformatics analysis.
With recent development in deep learning techniques, applying deep learning to biological data is gaining more
attentions in bioinformatics community.
In this work, we proposed to use Restricted Boltzmann Machine (RBM) for the problem of bias correction on RNA-seq data,
which extracts features in non-linear space from genomic data and reduce dimensionality of input sequences.
To detect and adjust positional biases in RNA-seq data, probabilities of contextual sequences at the beginning
and middle positions of RNA reads are calculated as the foreground and background information used for bias correction.
Due to number of potential sequence combination is numerous (4^|sequence length|), direct calculation for
each sequence type under foreground and background condition will be suffered from data sparseness.
Previous work, Cufflinks for example, used simplified model structures to reduce the number of model parameters.
We provided an alternative solution for this problem, which applied deep learning methods to extract compact encodings
for original sequences and estimate bias weights for each reads.
We demonstrated our performance on simulation data and commonly used real RNA-seq data, and
compared with other bias correction methods.

N48 - Developing Next-Generation Sequencing Data Analysis S/W on HPC System

Ho-Youl Jung, Electronics and Telecommunications Research Institute, Korea, Rep

Dae-Hee Kim, Electronics and Telecommunications Research Institute, Korea, Rep

Minho Kim, Electronics and Telecommunications Research Institute, Korea, Rep

Jae-Hun Choi, Electronics and Telecommunications Research Institute, Korea, Rep

Short Abstract: We are developing genome analysis S/W for optimizing performance on HPC(High Performance Computing) system using heterogeneous computing resources - CPU(Central Processing Unit), GPGPU(General-Purpose Graphics Processing Unit, Intel Xeon Phi. High-throughput sequencing (or next-generation sequencing) technologies produce enormous volume of sequence data inexpensively. HPC(High Performance Computing) system is needed in order to tackle such huge sequence data. Therefore we are parallelizing the genome data analysis pipeline and developing novel applications for genome data analysis using heterogeneous computing resources. Genome Data Analysis SW on HPC(High Performance Computing) system provide following features:
- Sequence read data mapping using parallel computing resources (GPGPU, Intel Xeon Phi, multi-core processor)
- Genome variation detecting using large memory and parallel computing resources (GPGPU, Intel Xeon Phi, multi-core processor).
In order to parallelize the genome analysis pipeline efficiently under HPC environment, we analyzed the CPU utilization pattern of each pipeline steps. In conclusion, we elucidated that sequence read data mapping, especially sequence alignment is computing centric and suitable for parallelization. We also found that manipulating SAM/BAM (Sequence Alignment Map/Binary sequence Alignment Map) files needs very large system memory resources. Finally we could parallelize the pipeline step of the genome variation detection considering the characteristics of data partition for genome wide data.

N49 - Container-based sequence data analysis workflow for reproducible research

Tazro Ohta, Database Center for Life Science, Japan

Osamu Ogasawara, DNA Data Bank of Japan, Japan

Short Abstract: Publishing raw data on public repository enabled researchers to reuse data and reproduce the published results. Most of the data analysis methods which connect raw data and published results, however, are described in natural language in the section of materials and methods in published articles that often has a lack of information to execute the analysis workflow exactly same as done in the original study. To achieve more accurate description and sharing of data analysis workflow for the reproducible research, we developed a framework based on Docker, the container-based virtual environment, and the several analysis workflow of high-throughput sequencing data are converted into the sets of docker container to be executed on the developed framework. Apache Mesos is also introduced on our large scale computing infrastructure as a resource manager, and we developed job schedular which communicates with Mesos to execute containers successively. This framework works on any kind of computational environment where docker and Mesos run, and supports users to manage, share, and re-execute their tools and workflows. We will provide the results of framework design and the challenges to better reproducible research environment for all the computational biology.

N50 - Enhancing the biological and functional context of InterPro annotated proteins

Alex Mitchell, EMBL-EBI,

Hsin-Yu Chang, EMBL-EBI,

Matthew Fraser, EMBL-EBI,

Gift Nuka, EMBL-EBI,

Sebastien Pesseat, EMBL-EBI,

Amaia Sangrador-Vegas, EMBL-EBI,

Maxim Scheremetjew, EMBL-EBI,

Siew-Yit Yong , EMBL-EBI,

Robert Finn, EMBL-EBI,

Short Abstract: In the post-genomic era, the generation of sequence data is no longer a scientific bottleneck. As result, proteins and their constituent domains are rarely considered in isolation. A wealth of data allows scientists to investigate wider biological contexts, such as the taxonomic distribution of proteins containing a set of domains, or the reconstruction of metabolic pathways in a particular proteome.

The InterPro database integrates diverse information about protein families, domains and functional sites. Based around an aggregation of eleven different databases, each InterPro entry (a collection of one or more member database signatures) is annotated with the biological relationships between entries, a descriptive abstract, database cross-references and Gene Ontology (GO) terms, which provide a controlled vocabulary with which to describe proteins from diverse environments.

A number of recent InterPro updates have aimed at broadening the biological and functional contexts of InterPro-derived data. These include the development of a domain architecture tool, allowing InterPro to be searched with a set of domains, returning all matching proteins. InterProScan, the tool that performs analysis of protein sequences against InterPro, has been improved through the addition of a sequence pre-filtering heuristic, resulting in a significant speed increase. InterProScan pathway cross-references have also been added, allowing metabolic pathways to be predicted for genomes/proteomes. Finally, substantial curation efforts have yielded increased InterPro coverage of sequence space.

N51 - A comprehensive survey of ncRNAs in the genome and transcriptome of the tropical parasite Trypanosoma cruzi

Mainá Lourenço, Universidade Federal do Rio de Janeiro, Brazil

Martin Smith, Garvan Institute for Medical Research, Australia

Dominik Kaczorowski, Garvan Institute for Medical Research, Australia

John Mattick, Garvan Institute for Medical Research, Australia

Gloria Franco, Universidade Federal de Minas Gerais, Brazil

Short Abstract: Trypanosoma cruzi is the etiologic agent of Chagas disease, a neglected tropical disease that mainly affects South America and leads to substantial socio-economic losses. This protozoan parasite was first reported in 1909 and in the following decades a variety of studies have elucidated several unique aspects of its biology. In recent years the genomes of six T. cruzi strains have been sequenced, but their annotation remains very poor, specially regarding non-coding RNAs (ncRNAs). ncRNAs are known to have crucial roles in virtually all biological processes and the constantly growing list of functions related to such molecules has influenced even the classical definition of a gene. To better elucidate the possible roles these molecules may play in a parasite with such a complex life-cycle, we have scanned a T. cruzi genome using similarity search methods to annotate ncRNA genes based on public databases. As result, over 1500 candidates were identified, more than 40% representing new findings, many of which represent ncRNAs not previously explored in this parasite and thus worthy of further studies. Publicly available RNASeq data have confirmed the expression of 300 of these candidates. We have then sequenced the set of T. cruzi RNAs both before and after gamma radiation exposure and we are currently analyzing the differential expression pattern of these previously identified ncRNA candidates and also of newly identified genes. This work will shed light on the function of ncRNAs in the parasite biology and possibly in its resistence to ionizing radiation.

N52 - Fast gene-expression quantification tool for massive RNA-sequence analysis

Yasunobu Okamura, , Japan

Kengo Kinoshita, Tohoku University, Japan

Short Abstract: RNA-sequence is widely used to determine gene expression levels. Today, a lot of RNA-sequence data are registered in Short Read Archive. Reanalysis of these RNA-sequence data is promising approach to unravel gene regulation systems or systems of the cell. Since the number of registered data is rapidly increasing, amount of required computational resource for gene quantification is also getting larger. To address this problem, we develop fast gene-expression quantification tool. One of the problems to do reanalysis of RNA-sequence data is the computational resource to process a large amount of data. Since widely used quantification tools, such as cufflinks and eXpress, are based on alignment, and they estimate the quantity of transcripts, they need long time. Some other tools, such as RNA-skim, Sailfish are alignment free and faster than alignment based method, but they also estimate by transcript level. In massive RNA-sequence analyses, transcript-level expression data are not always required because gene-level expression data have enough information to classify runs, calculate co-expression. Therefore, we developed gene-level expression quantification tool with alignment-free, N-gram hash map based method. We prepared unique N-gram for each gene. If a unique N-gram is found in an RNA-sequence read, the read must be come from the corresponding gene. Our method is 2 times faster than eXpress in processing Arabidopsis thaliana RNA-sequence data.

N53 - SeqAn 2.0 -- The sequence analysis library

Hannes Hauswedell, Max Planck Institute for Molecular Genetics, Germany

Enrico Siragusa, Freie Universität Berlin, Germany

Knut Reinert, Freie Universität Berlin, Germany

Short Abstract: SeqAn is a free and open source C++ library of efficient algorithms and data structures with a focus on the analysis of next-generation sequencing data. Our library applies a unique generic design that guarantees high performance, platform independence, generality, extensibility, and easy integration with other libraries. SeqAn is accompanied by detailed API documentation and online tutorials.
By providing easy access to core bioinformatics algorithms and formats, SeqAn is well suited for the rapid development of novel methods. Its exceptional performance on the other hand has enabled it to serve as the basis of some of the fastest solutions to well studied problems like read mapping and local alignment search. Furthermore, SeqAn-based application can be used as nodes in the workflow system KNIME, allowing for the design of complex workflows by "drag’n’drop"!
Version 2.0 of the library has recently been released with countless fixes, many improvements and new modules. SeqAn is now supported by the CIBI (Center for Integrative Bioinformatics) which is part of the recently founded German Network for Bioinformatics Infrastructure (de.NBI).

For further Information, please see:
http://www.seqan.de
http://github.com/seqan

N54 - Halvade: scalable sequence analysis with MapReduce

Jan Fostier, Ghent University, Belgium

Dries Decap, Ghent University-iMinds, Belgium

Joke Reumers, Janssen Pharmaceutica, Belgium

Charlotte Herzeel, Imec, Belgium

Pascal Costanza, Intel Corporation Belgium, Belgium

Jan Fostier, Ghent University - iMinds, Belgium

Short Abstract: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100bp paired-end reads, 50x coverage) in less than 3 hours with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared to running the individual tools with multithreading. Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR. Its source is available at http://bioinformatics.intec.ugent.be/halvade under GPL license.

N55 - Modeling Ribosome Profiling Data with Bayesian Hidden Markov Models

Brandon Malone, Max Planck Institute for Biology of Ageing, Germany

Florian Aeschimann, Friedrich Miescher Institute for Biomedical Research, Switzerland

Jieyi Xiong, Max Planck Institute for Biology of Ageing, Germany

Helge Grosshans, Friedrich Miescher Institute for Biomedical Research, Switzerland

Christoph Dieterich, Max Planck Institute for Biology of Ageing, Germany

Short Abstract: Ribosome profiling via high-throughput sequencing (ribo-seq) is a promising new technique for characterizing the occupancy of ribosomes on messenger RNA (mRNA) at base-pair resolution. The ribosome is responsible for translating mRNA into proteins, so information about its occupancy offers a detailed view of ribosome density and position which could be used to discover new upstream open reading frames, alternative start codons and new isoforms. Furthermore, this data allows the study of translational dynamics, such as decoding speed and ribosome pausing. Despite the wealth of information offered by ribo-seq, current analysis techniques have focused on coarse, gene-level statistics. In this work, we propose a hidden Markov model (HMM) approach to predict, at base-pair resolution, ribosome occupancy and translation. We use state-of-the-art learning algorithms to fit the parameters of our model, which correspond to biologically meaningful quantities, such as expected ribosome occupancy. Furthermore, we extend the model with Bayesian hyperparameters to quantify the uncertainty of the learned parameters. Preliminary evaluation shows that the HMM achieves a much higher true positive rate, and overall higher AUC, in identifying proteomics-verified coding regions compared to using the raw profile.

N56 - Ensemble Multiple Sequence Alignment via Advising

Dan DeBlasio, University of Arizona, United States

John Kececioglu, University of Arizona, United States

Short Abstract: The multiple sequence alignments computed by an aligner for different settings of its parameters, as well as the alignments computed by different aligners using their default settings, can differ markedly in accuracy. Parameter advising is the task of choosing a parameter setting for an aligner so as to maximize the accuracy of the resulting alignment. We extend parameter advising to aligner advising, which chooses among a set of aligners to maximize accuracy. In the context of aligner advising, default advising selects from a set of aligners that are using their default settings, while general advising chooses both the aligner and its parameter setting. We apply aligner advising for the first time, to obtain a true "ensemble aligner." Through experiments on benchmark protein sequence alignments, we show that parameter advising for a fixed aligner gives a significant boost in accuracy over simply using its default setting, for all of the most accurate aligners currently used in practice. Furthermore, for ensemble alignment, default aligner advising gives a further boost in accuracy over parameter advising for any single aligner, and general aligner advising improves beyond default advising. Our new ensemble aligner that results from general aligner advising, when evaluated on standard suites of protein alignment benchmarks, and selecting from a set of four or more choices, is significantly more accurate than the best single default aligner.

N57 - Comprehensive genome-wide transcription factor analysis reveals that a combination of high affinity and low affinity DNA binding is needed for human gene regulation

Kirill Batmanov, Norwegian Radium Hospital, Norway

Junbai Wang, Norwegian Radium Hospital, Norway

Agnieszka Malecka, Norwegian Radium Hospital, Norway

Gunhild Trøen, Norwegian Radium Hospital, Norway

Jan Delabie, Norwegian Radium Hospital, Norway

Short Abstract: High-throughput in vivo protein-DNA interaction experiments are currently widely used in gene regulation studies. Hitherto, comprehensive data analysis remains a challenge and for that reason most computational methods only consider the top few hundred or thousand strongest protein binding sites whereas weak protein binding sites are completely ignored.

A new biophysical model of protein-DNA interactions, BayesPI2+, was developed to address the above-mentioned challenges. BayesPI2+ can be run in either a serial computation model or a parallel ensemble learning framework. BayesPI2+ allowed us to analyze all binding sites of the transcription factors, including weak binding that cannot be analyzed by other models. It is evaluated in both synthetic and real in vivo protein-DNA binding experiments. Analysing ESR1 and SPIB in breast carcinoma and activated B cell-like diffuse large B-cell lymphoma cell lines, respectively, revealed that the concerted binding to high and low affinity sites correlates best with gene expression.

BayesPI2+ allows us to analyze transcription factor binding on a larger scale than hitherto achieved. By this analysis, we were able to demonstrate that genes are regulated by concerted binding to high and low affinity binding sites. The program and output results are publicly available at: http://folk.uio.no/junbaiw/BayesPI2Plus.

N58 - BioXSD — A universal data model for sequences, alignments, features, measured and infererred values

Matúš Kalaš, University of Bergen, Norway

Sveinung Gundersen, University of Oslo, Norway

László Kaján, unaffiliated (previously TU Munich), Germany

Jon Ison, Technical University of Denmark, Denmark

Steve Pettifer, University of Manchester,

Christophe Blanchet, L\\\'Institut Français de Bioinformatique, France

Rodrigo Lopez, European Bioinformatics Institute,

Kristoffer Rapacki, Technical University of Denmark, Denmark

Inge Jonassen, University of Bergen, Norway

Short Abstract: BioXSD has been developed as a universal data model and exchange format for basic bioinformatics types of data: sequences, alignments, features and related values, inferred or measured. The BioXSD model is rich enough to enable loss-less capture of diverse data that would otherwise require use of multiple different formats, and often even introduction of new formats for untypical features, classifications, or measured values. In BioXSD, an innovatively broad range of experimental data, annotations, and alignments can be recorded in an integrated chunk of data, together with provenance metadata, documentation, and semantic annotation with concepts from ontologies of user's choice.

BioXSD has so far been released in form of a machine-understandable XML Schema (XSD). Ongoing developments concentrate on providing BioXSD in form of JSON Schema and XML Schema 1.1, which may in the future be supplemented by RelaxNG or even OWL. This will enable using BioXSD as a common data model supporting serialisation of bioinformatics data into XML, JSON, RDF, or binary (EXI and BSON) as desired, while maintaining consistent and smooth validation, conversions, and parsing into objects for programming. The semantics of BioXSD is defined via SAWSDL references to EDAM (http://edamontology.org) and a number of other main ontologies.

BioXSD is a participatory community effort dependent on contributors, fans, and their needs. BioXSD Schema, documentation, and examples are available at http://bioxsd.org.

N59 - De novo genome sequencing of various tardigrades with ultra low input

Kazuharu Arakawa, Institute for Advanced Biosciences, Keio University, Japan

Shinta Fujimoto, Department of Zoology, Division of Biological Science, Graduate School of Science, Kyoto University, Japan

Masaru Tomita, Institute for Advanced Biosciences, Keio University, Japan

Short Abstract: Limno-terrestrial tardigrades can withstand almost complete desiccation through a mechanism called anhydrobiosis, and several of these species have been shown to survive the most extreme environments through exposure to space vacuum. Molecular mechanism for this tolerance has so far been studied in many anhydrobiotic metazoans, leading to the identification of several key molecules such as the accumulation and vitrification of trehalose as well as the expression of LEA proteins to prevent protein aggregation. On the other hand, the understanding of comprehensive molecular mechanisms and regulation machinery of metabolism during anhydrobiosis, as well as its evolutionary origin is yet to be explored. Molecular and genetic study of tardigrades has so far been limited due to their small size (about 100 µm in length) and difficulty to culture in laboratory conditions. Here we report the de novo genome sequencing of multiple tardigrades from single individuals, using ultra low input and amplification technologies, including a marine heterotardigrade Batillipes sp. The availability of genomic resource of Heterotardigrada provides insights into the evolution of tardigrades and their anhydrobiotic capabilities.

N60 - Scalable, fast, and accurate alignment with spaced seeds

Hamid Mohamadi, University Of British Columbia, Canada

Inanc Birol, BC Cancer Agency, Genome Sciences Centre, Canada

Short Abstract: Reads produced by the new sequencing technologies such as Oxford Nanopore or the Pacific Biosciences are increasing in the length and are not short anymore. These long reads have a significantly higher error rate, enriched for insertions or deletions (indels) rather than substitutions. Existing alignment methods such as BWA-SW and Bowtie2 are based on Burrows-Wheeler Transform (BWT) and FM-Index as the core of the alignment process, and therefore are insufficiently sensitive to align much longer reads with higher error rates such as those generated by Oxford Nanopore MinION or the Pacific Biosciences RS II. In this work, we present an accurate, fast, and scalable alignment method based on spaced seeds for long reads with higher error rates. The spaced seeds are the current state-of-the-art for approximate string matching, which have been increasingly used to improve the quality and sensitivity of searching. A spaced seed S is a string over the binary alphabet {1, *}, where 1 indicates a position in the S where a match must occur, whereas * indicates a position where a match or mismatch is allowed. To handle indels in the alignment process, we use pairs of spaced seeds with variable length gaps between seed pairs. The alignment method designed based on this scheme approaches the perfect sensitivity of dynamic programming algorithms. In the implementation, we have utilized the new scaling support features of the Intel’s recently released Many Integrated Core (MIC) architecture to make our proposed tool scalable and fast to facilitate large alignment tasks.

N61 - Detecting signs of positive selection in long non-coding RNAs

Maria Walter, Bioinformatics Laboratory, Germany

Katja Nowick, University of Leipzig, Bioinformatics Group, Department of Computer Science, Interdisciplinary Center for Bioinformatics (IZBI), Germany

Short Abstract: Long non-coding RNAs (lncRNAs) are a diverse class of non-coding RNAs that has been discovered quite recently. They resemble mRNAs in size and structure but do not code for proteins. Rather they function through their folded structures as functional RNAs by interacting with proteins and other RNAs in diverse complexes. LncRNAs are involved in several different biological processes, particularly by regulating gene expression. Interestingly, many lncRNAs are fast evolving. To achieve a better understanding of the evolution of lncRNAs we developed a method for detecting signs of selection in lncRNAs. To this end we used a previously published database of about 15,000 families of lncRNAs and aligned human lncRNAs to their orthologs from chimpanzees, orangutans, gorilla and rhesus macaque. Using these alignments, we identified the lncRNAs that contained the most human-specific mutations. We submitted the sequences to RNAsnp, a tool that reports the nucleotides that when mutated make a significant change in the RNA secondary structure. Similarly to the Ka/Ks ratio, which is used to investigate selection in protein-coding genes, we used the information from RNAsnp for each candidate to calculate the ratio between nucleotide changes leading to a structural change and nucleotide changes not leading to a structural change. LncRNAs for which this ratio is larger than 1 indicate an access of nucleotide changes leading to structural changes and are therefore suggested to show signs of positive selection. We found 20 lncRNAs with a ratio over than 1.5, which indicates they might be evolving under positive selection in humans.

N62 - Anchoring patterns and point mutations in pairwise alignments using AlignMe

Kamil Khafizov, Moscow Institute of Physics and Technology, Russian Federation

Rene Staritzbichler, Leipzig University, Germany

Maxim Ivanov, Moscow Institute of Physics and Technology, Russian Federation

Marcus Stamm, Max Planck Institute of Biophysics, Germany

Lucy Forrest, National Institutes of Health, United States

Short Abstract: X-ray crystal structures have revealed that numerous membrane proteins, such as GPCRs or some secondary transporters, despite the lack of any detectable sequence similarity between them, still share very similar 3D structures. Moreover, some proteins were originally categorized into different sequence families and yet after their structural models became available, it has been revealed that they may share the same evolutionary ancestor. One of the representative examples is the LeuT-fold transporters. Their core structure consists of two units of 5TM helices, and that these two units are related implies that LeuT-like transporters evolved from gene-duplication and fusion events. However, the lack of significant sequence similarity requires sensitive sequence search methods for analyzing their evolution. To this end, we developed a software application called AlignMe and subsequently a webserver, which can use various types of input information, such as residue hydrophobicity, to perform pairwise alignments of sequences and/or of hydropathy profiles of (membrane) proteins. Here, we describe a modification of the dynamic programming algorithm that it allows positions in the input sequences to be constrained. This novel feature allows the user to define any number of so-called “anchors” with varying strength for improving the quality of pairwise alignments in challenging cases lacking notable sequence similarity. Information about possible anchors can be obtained from experimental studies, expert knowledge of specific motifs or even from the alignments of hydropathy profiles. There are manifold applications in homology modeling and in the context of mutagenesis experiments, and all this makes the tool useful in detection of the structural similarity.

N63 - RiboGalaxy: development of a platform for the alignment, analysis and visualization of ribo-seq data.

Audrey Michel, University College Cork, Ireland

James P.A. Mullan, University College Cork, Ireland

Claire A. Donohue, University College Cork, Ireland

Vimalkumar Velayudhan, University College Cork, Ireland

Patrick B. O\'Connor, University College Cork, Ireland

Pavel V. Baranov, University College Cork, Ireland

Short Abstract: Ribosome profiling (ribo-seq) is a technique that uses high-throughput sequencing technology to reveal the exact locations and density of translating ribosomes at the entire transcriptome level [1,2]. However researchers who generate ribo-seq data often have to rely on bioinformaticians to process and analyse their data. We have developed RiboGalaxy (http://ribogalaxy.ucc.ie), a freely available Galaxy-based web server specifically tailored for pre-processing, aligning and analysing ribo-seq data with the visualisation functionality provided by GWIPS-viz (http://gwips.ucc.ie) [3]. Researchers can take advantage of the published workflows which reduce the multi-step alignment process to a minimum of inputs from the user. Researchers can also directly upload their ribosome profiles as custom tracks in GWIPS-viz and compare them to public ribo-seq tracks from multiple studies. As well as providing free computational infrastructure to researchers, RiboGalaxy allows users to analyse their ribo-seq data without the need for command-line tools or the need for a bioinformatics background. RiboGalaxy is accompanied by extensive documentation and tips for helping users. In addition we provide a RiboGalaxy forum (http://gwips.ucc.ie/Forum/) where we encourage users to post their questions and feedback to improve the overall RiboGalaxy service.

References
1 Ribosome profiling: new views of translation, from single codons to genome scale.
Ingolia NT.
Nat Rev Genet. 2014 Mar;15(3):205-13.

2 Ribosome profiling: a Hi-Def monitor for protein synthesis at the genome-wide scale.
Michel AM, Baranov PV.
Wiley Interdiscip Rev RNA. 2013 Sep-Oct;4(5):473-90.

3 GWIPS-viz: development of a ribo-seq genome browser.
Michel AM, Fox G, M Kiran A, De Bo C, O'Connor PB, Heaphy SM, Mullan JP, Donohue CA, Higgins DG, Baranov PV.
Nucleic Acids Res. 2014 Jan;42(Database issue):D859-64.

N64 - Structural Variant Detection across Species Boundaries - Mapping-Based Horizontal Gene Transfer Detection from Sequencing Data

Kathrin Trappe, Robert Koch-Institute Berlin, Germany

Bernhard Y. Renard, Research Group Bioinformatics (NG 4), Robert Koch Institute, Germany

Tobias Marschall, Center for Bioinformatics, Saarland University, and Max Planck Institute for Informatics, Germany

Short Abstract: One of the major fields in the analysis of NGS data is the detection of structural variants (SVs).
The focus of SV detection has primarily been on human sequencing data, most prominently in cancer studies.
Horizontal gene transfer (HGT) can be seen as a special case of SV.
Through HGT, bacteria among other kingdoms are able to acquire novel genes from other, further related, bacteria or other species which often comes with new functions or properties such as antibiotic resistances (Syvanen, 2012).
A prominent example is the EHEC outbreak 2011 in Germany.
Hence, fast and reliable pathogen identification or detection of antibiotic resistance are of particular interest in clinical diagnostics (Byrd et al., 2014).

Integrated into an analysis pipeline, we use the split-read based SV detection tools Gustaf (Trappe et al., 2014) and LASER (Marschall et al., 2013) to identify possible breakpoints of HGT events.
We then create candidate regions based on these breakpoints and incorporate read coverage information and read-pair based evidence to support the most likely candidates.

We successfully evaluated our approach on two E.coli datasets where we could detect breakpoints and create HGT candidates even in the presence of multiple splits, complex variants and longer gaps between split parts.

Transferring the concepts from SV detection methods to bacteria opens up new ways of diagnostics using NGS data, e.g. to distinguish parallel infections of multiple bacteria from single infections where the bacteria have acquired distinct DNA through HGT.
We therefore see great potential in applying SV detection approaches across species boundaries.

N65 - T2P - Transcriptome to Proteome: Improving translation in non-model organisms

Shifra Ben-Dor, Weizmann Institute of Science, Israel

Ester Feldmesser, Weizmann Institute of Science, Israel

Gilgi Friedlander, Weizmann Institute of Science, Israel

Irit Orr, Weizmann Institute of Science, Israel

Noa Wigoda, Hebrew University of Jerusalem, Agriculture Faculty, Israel

Short Abstract: High throughput Mass Spectrometry (MS) to study proteomes is a method being used more and more in biological systems. Identification of the peptides and proteins requires a database for comparison of the masses. In the major model species, genomes, transcripts and protein sequences are readily found. However, in non-model organisms sequences are often not available. RNAseq provides an affordable way to reach a transcript collection, even for species lacking a genome sequence. However, little attention is generally paid to the accuracy of the transcript definition. The prevailing attitude is that if the resulting transcripts allow us to reach biological insight, the results are sufficient. This is not the case for use as input to MS analysis programs, where accurate protein sequence is required to predict accurate protein mass.
We present T2P (Transcriptome to Proteome), a suite of tools to perform the transition step from transcriptome definition to translation of relevant proteins that can be used for further annotation and/or mass spectrometry. The suite utilizes a sequence comparison based choice of the probable functional reading frame, allows detection and correction of frameshifts, detection and splitting of artificially joined transcripts, merger of transcript collections (for example, multiple transcriptomes or transcriptome and EST data), and detection of putative non-coding or pseudogene sequences. The tool was designed to allow biologists without bioinformatics knowledge as well as bioinformaticians to use it.

N66 - Analysis under control - comparing workflow management systems

Maciej Kandula, BOKU University Vienna, Austria

David Kreil, BOKU University Vienna, Austria

Ola Spjuth, Uppsala University, Sweden

Samuel Lampa, Uppsala University, Sweden

Short Abstract: With rapid developments in high-throughput technologies, like Next-Generation Sequencing, the data sets of genomic scale, with gene expression, microRNAs activities, accumulation of somatic mutations, DNA methylation, as well as copy number variation, a scientist is faced with a complex analysis task. This complexity and amount of data produced require ways to supervise, document and control data processing in order to facilitate an error-free, reliable and reproducible research. These tasks have become a considerable challenge in the bioinformatics data analysis and software tools have been developed to automate them. There exist now a variety of workflow management systems with different mechanisms of operation, varying in features and sometimes with different users in mind.

We perform a thorough review of selected systems and point to the challenges not yet addressed. We here study a wide range of modern workflow management tools, including both GUI-based, like Galaxy, and text-based systems, such as Snakemake and Bpipe, covering the user- as well as top-down-administered tools. We compare crucial features of the systems with regard to the workflow paradigm, triggering mechanism, audit/reproducibility capabilities, flexibility, modularity, HPC capabilities, and more. We give special consideration to the ability of the system to reliably perform routine data analysis steps as common in industrial or facility settings, and to the flexibility - support for rapid prototyping, ad hoc analyses and version control, required in scientific research. We account for a system’s ability to adapt to altering analysis stages and various resource environments, and the required user’s computational literacy.

N67 - PolyMarker: A fast polyploid primer design pipeline

Ricardo Ramirez-Gonzalez, The Genome Analysis Centre,

Cristobal Uauy, John Innes Centre,

Mario Caccamo, The Genome Analysis Centre,

Short Abstract: The design of genetic markers is of particular relevance in crop breeding programs. Despite many economically important crops being polyploid organisms, the current primer design tools are tailored for diploid species. Bread wheat, for instance, is a hexaploid comprising of three related genomes and the performance of genetic markers is diminished if the primers are not genome specific. PolyMarker is a pipeline that selects candidate primers for a specified genome using local alignments and standard primer design tools to test the viability of the primers. A command line tool and a web interface are available to the community.

N68 - Construction of microRNA web server for deep sequencing, expression profiling and mRNA targeting.

Byungwook Lee, KRIBB, Korea, Rep

Short Abstract: In the field of microRNA (miRNA) research, biogenesis and molecular function are two key subjects. Deep sequencing has become the principal technique in cataloging of miRNA repertoire and generating expression profiles in an unbiased manner. Here, we describe the miRNA web server that compiled the deep sequencing miRNA data available in public and implemented several novel tools to facilitate exploration of massive data. We also developed the miR-seq browser supporting users to examine short read alignment with the secondary structure and read count information available in concurrent windows. Features such as sequence editing, sorting, ordering, import and export of user data would be of great utility for studying iso-miRs, miRNA editing and modifications. miRNA-target relation is essential for understanding miRNA function. Based on miRNA-seq and RNA-seq data from the same sample, coexpression analysis between miRNA and target mRNAs is visualized in the heat-map and network views where users can investigate the inverse correlation of gene expression and target relations, compiled from various databases of predicted and validated targets.

N69 - antiSMASH 3 - User-Friendly Analysis Pipeline for Secondary Metabolites

Kai Blin, NovoNordisk Foundation Center for Biosustainability, Denmark

Marnix Medema, Wageningen University, Netherlands

Hyun Uk Kim, NovoNordisk Foundation Center for Biosustainability, Denmark

Tilmann Weber, NovoNordisk Foundation Center for Biosustainability, Denmark

Short Abstract: The secondary metabolism of microorganisms is a rich source of compounds and lead structures for the discovery of drugs for human, veterinary and agricultural applications. Ever since the “Golden Age” of antibiotics discovery in the 60s-70s, the discovery rate of novel substances has declined over the past decades. With the rising awareness of the dangers posed by antibacterial resistance in clinically relevant pathogens and the recent surge in available genomic and proteomic data, the secondary metabolite field is rising in popularity again. Making modern prediction algorithms and the wealth of available data accessible to researchers is a challenge.
Since its initial publication in 2011, antiSMASH, the “antibiotics & secondary metabolites analysis shell”, has risen to this challenge and has become one of the leading applications in the field. antiSMASH is an analysis pipeline identifying secondary metabolite gene clusters from genomic data using Markov models of core biosynthetic enzyme sequences, with more elaborate product predictions for well-studied secondary metabolite cluster types. Curated databases of predicted and known secondary metabolite clusters aid in the dereplication of known compounds, allowing research to be focused on novel substances. With a public, easy to use web interface (http://antismash.secondarymetabolites.org/), antiSMASH offers a low barrier of entry to bench biologists. Published under an OSI-approved Open Source license, it can also be used and adapted in local installations. This poster will present the current version of antiSMASH, now supporting even more secondary metabolite cluster types, improved cluster identification and dereplication support, and homology-based metabolic modelling of producer strains.

N70 - Latest Developments of the Clustal Omega Multiple Sequence Alignment Program

Fabian Sievers, University College Dublin, Ireland

Des Higgins, University College Dublin, Ireland

Short Abstract: Multiple Sequence Alignments of large numbers of sequences are being
used in many bioinformatics analyses like secondary structure
prediction, phylogenetic analyses and epigenetics, amongst others. We
present the current version of Clustal Omega, which draws on the
latest insights into clustering and guide-tree construction for
overwhelmingly large numbers of sequences, that is, (i) optimised seed
selection for the mBed algorithm, (ii) selection of a sufficiently
small representative skeleton alignment and (iii) improved performance
of building up the final alignment. We show that estimated
phylogenetic trees in general do not produce the best alignments (for
small numbers of sequences), but that a good improvement over
traditional UPGMA trees can be achieved using imbalanced (chained)
trees. We benchmark our implementation, using traditional methods like
total-columns and sum-of-pairs scores, as well as a more modern method
based on contact maps, which penalises over-alignment. We compare the
performance and accuracy of Clustal Omega to a wide variety of
alternative programs.

N71 - Structural clustering and characterization of O-GlcNAc-sites in their unmodified state.

Thiago Britto-Borges, University of Dundee,

Geoffrey Barton, University of Dundee,

Short Abstract: Protein phosphorylation and O-GlcNAcylation are dynamic, intracellular and substrate-specific post-translational modifications. While roughly 500 protein kinases drive phosporylation, only the O-linked N-acetylglucosamine (O-GlcNAc) transferase (OGT) adds an O-GlcNAc moiety to serines and threonines. Although some kinase-specific phosphosites, have consensus sequences, less than 25% of known O-GlcNAc-sites match a clear amino acid pattern. In this work, we investigate the hypothesis that OGT recognises not only a sequence of consecutive amino acids, but also a structural fingerprint. To describe the three-dimensional structures of the 7 residue-long segments targeted by OGT, structural data from 57 serines and threonines, reported as O-GlcNAc-sites but without the modification group, were collected. Distinct instances of the same site, from multiple protein X-ray crystal structures or chains, had their backbone atoms clustered to determine structural uniqueness. Instances with the highest resolution were selected and secondary structure analysis showed an increased occurrence of missing residues at the site of modification (19%) compared to other serines and threonines in the background group (11%). Intriguingly, these central residues were as accessible to the solvent, as residues in the background. Accordingly, these results could partially explain the diversity of O-GlcNAc-sites’ primary sequence, by indicating the role of structural flexibility around the modification.

N72 - Qualimap 2.0: advanced quality control of high throughput sequencing data

Konstantin Okonechnikov, , Germany

Ana Conesa, Collaboration, Spain

Fernando Garcia-Alcalde, Collaboration, Switzerland

Short Abstract: Detection of random errors and systematic biases is a crucial step of a robust pipeline for processing of high throughput sequencing (HTS) data. There are bioinformatics software tools capable of performing this task. Some of them are suitable for general analysis of HTS data while others are targeted to a specific sequencing technology. Qualimap 2.0 represents a next step in the QC analysis of HTS data. It is a multiplatform user-friendly application with both graphical user and command line interfaces.

Qualimap includes four analysis modes: BAM QC, Counts QC, RNA-seq QC and Multi-sample BAM QC. Based on the selected type of analysis, users provide input data in the form of a BAM/SAM alignment, GTF/GFF/BED annotation file and/or read counts table. The results of the QC analysis are presented as an interactive report from GUI, as a static report in HTML or PDF format and as a plain text file suitable for parsing and further processing. The latter two analysis modes are first introduced in version 2.0. Multi-sample BAM QC allows combined quality control estimation for multiple alignment files. For this purpose Qualimap combines BAM QC results from multiple samples and creates a number of plots summarizing the datasets. RNA-seq QC performs computation of metrics specific for RNA-seq data, including per-transcript coverage, junction sequence distribution and reads genomic localization.

In addition, a large number of fixes and enhancements were implemented since the first version of Qualimap was released. The application is freely available at http://www.qualimap.org.

N73 - Simple Chained Guide Trees give High-Quality Protein Alignments

Kieran Boyce, UCD, Ireland

Fabian Sievers, UCD, Ireland

Des G. Higgins, UCD, Ireland

Short Abstract: Guide trees are used to decide the order of sequence alignment in progressive multiple sequence alignment. These guide trees are often the limiting factor in making large alignments and considerable effort has been expended over the years in making these quickly or accurately. We have found that, at least for protein families with large numbers of sequences which can be benchmarked with known structures, simple chained guide trees give the most accurate alignments. These also happen to be the computationally fastest and simplest guide trees to construct. Such guide trees have a striking effect on the accuracy of alignments produced by some of the most widely used alignment packages. There is a marked increase in accuracy and a marked decrease in computational time, once the number of sequences goes much above a few hundred. This is true, even if the order of sequences in the guide tree is random.

N74 - Improving Clustal Omega's sequence alignment accuracy with annotated profile Hidden Markov Models

Quan Le, University College Dublin, Ireland

Des Higgins, University College Dublin, Ireland

Short Abstract: Clustal Omega is the latest member of the Clustal sequence alignment program family; it allows the use of an additional profile HMM to improve the accuracy of the alignment. In this experiment, we use the tools HMMER 3.0 and pfam_scan to annotate each sequence in the set of sequences to align with profile HMMs from the Pfam database, we then add the annotated profile HMMs as the extra inputs to Clustal Omega to improve the alignment quality. Using one Pfam profile HMM per one alignment, we obtain positive results on all 5 reference sets of sequence alignment benchmark BALIBASE 3.0 (the average total columns scores improve from 2.4 % for the reference 3 to more than 20% for the reference 1 version 1 ). For the case multiple Pfam profile HMMs hit the sequences to align, we are performing initial experiments with using concatenated profile HMMs to improve further the alignment quality.

N75 - The Mithocondrial Genome Assembly of Buffalo from the Marajó island (Brazil)

Adonney Veras, For Personal Use, Brazil

Pablo Gomes de Sá, UFPA, Brazil

Adriana Carneiro, UFPA, Brazil

Artur Silva, UFPA, Brazil

Rommel Ramos, UFPA, Brazil

Short Abstract: The Buffalo has a great importance for small farmers and the economy of developing countries due to be used as a source of meat, milk and skin that can be converted into creams, butter, many types of yoghurt and cheese. These animals are usually found in tropical forests, wet meadows and swamps.
Therefore, the elucidation of mitochondria has driven the growth of studies in this area of research, such as, phylogenetic correlations between different species or breeds. An important characteristic of the mitochondria is related multiple copies present in the nuclear genome of a single cell mitocrondrial in contrast with other eukaryotic cells, which have only two copies of the nuclear genome. This feature helps to identifify the source of food products as the cheese.
Thus, this work aims to present the mitochondrial genome of Buffalo from the Marajó region, State of Pará, Brazil. The buffalo was sequenced using the Ion Proton platform with IonChip PI, which generated a throughput of 11.7 gigabases.
The computational pipeline MITObim 1.8 was used to assembly the genome, which was submitted to annotation by MITOS Web platform.
The result was a complete sequence of mitochondrial genome with 16578 base pairs, with 166-fold coverage. The annotation process identified 18 protein-code genes, 19 tRNA and 2 rRNA.

N76 - Back Translated Peptide K-mer Search in DNA Sequence Databases using BoND-tree

Sakti Pramanik, Michigan State University, United States

Short Abstract: In the past, genome sequence databases had used main memory indexing, such as the suffix tree, for fast sequence searches. With next generation sequencing technologies, the amount of sequence data being generated is huge and main memory indexing is limited by the amount of memory available. K-mer based techniques are becoming widely used for various genome sequence database applications. K-mer can also provide an excellent basis for creating efficient disk based indexing. In this paper, we have proposed a k-mer based database searching tool using box queries on BOND-Tree indexing [1]. Bond-tree is quite efficient for indexing and searching in Non-Ordered Discrete Data Space (NDDS). We have conducted experiments on searching DNA databases using back translated protein query sequences and have compared with existing methods. The results are quite promising and justify significance of the proposed approach.

N77 - Computational detection of DNA double-stranded breaks with nucleotide resolution using deep sequencing data

Malgorzata Rowicka, University of Texas, United States

Short Abstract: Double-stranded DNA breaks (DSBs) are a dangerous form of DNA damage. Despite many studies on the mechanisms of DSB formation, our knowledge of them is very incomplete. A main reason for our limited knowledge is that, to date, DSB formation has been extensively studied only at specific loci but remains largely unexplored at the genome-wide level.
We recently developed a method to label DSBs in situ followed by deep sequencing (BLESS), and used it to map DSBs in human cells [1] with a resolution 2-3 orders of magnitude better than previously achieved. Although our protocol detects free ends of DNA with extreme single nucleotide precision, the inference of original positions of DSBs remains challenging. This problem is due to unavoidable sequencing of DNA repair intermediaries (end resection), which effectively lowers our detection resolution by several orders of magnitudes. Another challenge is that DSBs are rare events and sequencing signal originating from them is easily overpowered by background signal, such as copy number variation.
Here, we show how DSBs can be detected computationally with nucleotide resolution. First, we learnt characteristic sequencing read pattern in the vicinity of a DSB using experimental data with DSBs induced in pre-determined positions. On these data, by scanning the genome with our pattern, we were able to detect DSB locations with 2nt positional accuracy and precision of over 90%. Thus derived pattern was then applied to detection of DSBs and their relation to the corresponding DNA damage and repair mechanisms.

N78 - SplAdder: Integrated Quantification, Visualization and Differential Analysis of Alternative Splicing

Andre Kahles, Memorial Sloan-Kettering Cancer Center, United States

Cheng Soon Ong, NICTA, Canberra Research Laboratory, Australia

Gunnar Ratsch, Memorial Sloan-Kettering Cancer Center, United States

Short Abstract: Understanding alternative splicing (AS) is a key task towards explaining the complex transcriptomes of higher eukaryotes. With the advent of high-throughput sequencing of RNA (RNA-Seq) this diversity can be measured at an unprecedented depth. Although the catalog of known AS-events has ever grown since, novel isoforms are commonly observed in less well annotated organism, in the context of disease, or within large populations. Whereas identification of complete isoforms is challenging and expensive, focusing on single alternative splicing events as a proxy is fruitful for differential analysis.

We present SplAdder, a fully integrated analysis framework, that can detect both known and novel AS-events from RNA-Seq alignments, quantify them and differentially test them between given sample sets. AS-events are detected from a given annotation or are added based on RNA-Seq evidence. The streamlined, highly efficient and parallelizable pipeline quantifies all events and provides counts as an interface to differential analysis with common tools such as the rDiff package or DExSeq. SplAdder includes several visualization routines, producing publication ready plots of the quantified splicing graph, displaying one or many events or showing sashimi-like plots for different isoforms.

SplAdder can easily handle several thousand samples of high complexity and has been developed and tested on data from The Cancer Genome Atlas project and the International Consortium of Cancer Genomics. However, SplAdder is not limited to human and we demonstrate applications in the plant A. thaliana as well as the nematode C. elegans. The Python software is open source and available at www.github.com/ratschlab/spladder and www.bioweb.me/spladder.

TOP

View Posters By Category

Search Posters:

TOP