Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015. To confirm your poster find the poster acceptence email there will be a confirmation link. Click on it and follow the instructions.

If you need further assistance please contact and provide your poster title or submission ID.

Category N - 'Sequence Analysis'
N001 - Masai: Accurate SNP calling without quality values?
Short Abstract: In the first part of the presentation we will shortly present the algorithmic idea of Masai and discuss the impact of indices on the memory consumption and throughput. We will also compare the speed of Masai to other read mappers.
In the second part of the presentation, we will elaborate on the evaluation of accuracy. We will argue, why the use of quality values in current read mappers does on average not lead to higher accuracy compared to edit distance methods. In particular, we will show that the use of quality values could help for resolving ambiguities caused by sequencing errors, while their use for mapping reads with SNVs to their correct genomic location is limited.
N002 - Identification and Characterization of miRNA Transcriptome in Potato
Short Abstract: Potato is the third largest global food crop. Yet molecular events leading to tuberization and development are poorly understood. Micro RNAs (miRNAs) represent a class of short, non-coding, endogenous RNAs which play important roles in post-transcriptional regulation of gene expression.
The aim of our study is to explore the miRNA transcriptome of potato enabling to investigate the role of these small non-coding RNAs using high-throughput sequencing and bioinformatics tools to analyze miRNAs in the tuber bearing crop potato (Solanum tuberosum) from leaf and stolon tissues. A pipeline was developed for the analysis of miRNA sequences in potato. Small RNA reads were firstly filtered in order to remove ribosomal, transfer RNAs, snRNAs and snoRNAs. The candidate precursors were excised from the potato genome at the mapping positions of these filtered reads. The candidate precursor were folded and miRNA predictions were made based on plant miRNA characteristics, such as minimum folding energy, the reads distributions of mature miRNA and star sequences, etc. Conserved and potato specific miRNAs were identified and non-conserved miRNAs were validated experimentally. Additionally, we predicted targets for all the miRNAs using the potato transcriptome data.
As a result, 28 conserved miRNA families were found and potato-specific miRNAs were identified and validated by RNA gel blot hybridization. The size, origin and predicted targets of conserved and potato specific miRNAs are described. The large number of miRNAs and complex population of small RNAs in potato suggest important roles for these non-coding RNAs in diverse physiological and metabolic pathways.
N003 - Newborn screening for SCID identifies patients with ataxia telangiectasia
Short Abstract: Severe combined immunodeficiency (SCID) is characterized by failure of T lymphocyte development. Newborn screening to identify SCID is now performed in several states. In addition to infants with typical SCID, screening identifies infants with T lymphocytopenia who appear healthy and in whom a SCID diagnosis cannot be confirmed. Deep sequencing was employed to find causes of T lymphocytopenia in such infants. Whole exome sequencing and analysis were performed in infants and their parents. Upon finding deleterious mutations in the ataxia telangiectasia mutated (ATM) gene, we confirmed the diagnosis of ataxia telangiectasia (AT) in two infants. AT is usually not diagnosed until much later in life, after symptoms are manifest. Although there is no current cure for the progressive neurological impairment of AT, early detection permits avoidance of infectious complications, while providing information for families regarding reproductive recurrence risks and increased cancer risks in patients and carriers.
N004 - Identifying differentially expressed transcripts from RNA-seq data with biological variation
Short Abstract: Analysing RNA-seq data poses multiple challenges due to base mismatches, non-uniform read distribution, reads shared by multiple splice variants and other factors which make the expression analysis especially difficult. The BitSeq method uses a Bayesian approach to model the read generation and sequencing processes and infers expression estimates of individual transcripts. Transcript expression levels can be used to obtain more accurate gene expression estimates, in comparison to popular count based methods, or for identifying differentially expressed transcripts or genes. Our differential expression model combines the uncertainty of the expression estimates with variances estimated from biologically replicated experiments to identify significantly differentially expressed transcripts with improved precision.
We present advantages of using BitSeq in RNA-seq datasets dealing with multi-mapping reads and non-uniform read distribution. Experiments with real and synthetic datasets show that BitSeq produces state-of-the-art results in both expression estimation and differential expression analysis.
N005 - REDItools: a suite of python scripts for RNA editing detection by massive NGS data
Short Abstract: RNA editing is a post-transcriptional molecular phenomenon whereby a genetic message is modified from the corresponding DNA template by means of substitutions, insertions and/or deletions. In human it mainly involves the deamination of adenosines to inosines by the family of ADAR enzymes acting on double RNA strands. A-to-I RNA editing has a plethora of biological effects depending on the RNA region involved in the modification. Changes in UTRs can lead to altered expression, whereas modifications in coding protein regions can induce amino acid replacements with more or less severe functional consequences.
RNA editing events can be detected at genomic scale by RNA-Seq technology. Indeed, thousands of candidates have been recently identified and validated in human by direct comparison with whole genome sequencing data in order to skip out single nucleotide variations (SNPs).
Although several methodologies have been developed to explore the RNA editing on eukaryotic transcriptomes, no comprehensive software for this aim has been released to date. For this reason we developed REDItools, a suite of python scripts aimed to the study of RNA editing at genomic scale using massive sequencing data. REDItools enable the genome-wide detection of RNA editing changes by using RNA-Seq and DNA-Seq data or RNA-Seq data alone. In addition, they implement effective filters to minimize false positives due to sequencing errors, mapping errors and SNPs.
The REDItools package and documentation is freely available at Google Code repository ( and released under the MIT license.
N006 - Exploiting Adaptive Bayesian Regression Shrinkage to identify Exome Sequence Variants associated with Gene Expression.
Short Abstract: Next-Generation exome sequencing identifies thousands of DNA sequence variants in each individual. Methods are needed that can effectively identify which of these variants are associated with changes in gene expression. As we expect only a few SNPs to be causal, we need methods that induce sparse models. The Normal-Gamma prior has been shown to induce adaptive shrinkage within the Bayesian linear model framework (large effects are shrunk proportionally less than small effects). Using simulated data we assess the efficacy and limitations of this Bayesian shrinkage method in comparison to other published methods (least squares, piMASS and HyperLasso) in parsimoniously identifying such sequence variants. The model is then validated using publicly available human and yeast datasets. We further develop the model to include the uncertainty in gene expression; SNP functional information obtained from online databases; and the uncertainty in the allele calls.

Our developments to the Normal-Gamma prior provide a suitable framework, that has been shown via simulation, to successfully identify causal DNA sequence variants (SNPs) affecting the gene expression level. Taking a fully Bayesian approach, permitted by the Normal-Gamma prior, allows for the various sources of uncertainty to be incorporated in a coherent manner.
N007 - Analyzing molecular sequence data with planar split networks
Short Abstract: Tree-like models are widely used in the analysis and visualization of evolutionary relationships among a set of organisms and are commonly constructed from molecular sequence data. These models appeal to scientists because of their intuitive structure but may not reflect the inherent complexity of certain evolutionary scenarios. Split networks are a popular tool to provide a quick snapshot of the data which can then help, for example, to assess whether using a tree-like model is appropriate or not. In addition, these networks can also help to detect and visualize potential relationships between sequence evolution and other aspects of a data set such as, for example, the geographic distribution of the corresponding organisms. The application of split networks for such purposes, however, can be hampered by the fact that some methods tend to produce networks overly cluttered by overlapping branches or, to avoid this, constrain the structure of the resulting network more than necessary.

We present a software package that implements several new methods for generating, processing and visualizing so-called planar split networks. This type of split network is the most general one that avoids the problem with overlapping branches and, thus, offers as much flexibility as possible for visualizing evolutionary scenarios in this way. The software is open source and its current version is freely available online at We demonstrate the applicability of the new methods on several biological data sets.
N008 - Newtonian dynamics in the space of phylogenetic trees
Short Abstract: A classic phylogenic tree is a simple directed graph. Edges are either present or absent and searching for a phylogenetic tree is a discrete optimisation problem. We have been developing an alternative view. Allowed trees are just points within a continuous space. Connections are continous properties which behave like coordinates. If we know the similarities between objects like sequences, we can see how well the set of coordinates (connections) fits the experimental data. The greater the disagreement, the greater the force acting on the connections. This leads to a method for generating phylogenetic trees. One can perform classic, conservative Newtonian dynamics in the space which includes all possible trees.

At the moment, we are limited to distance-based phylogenies, but the method has advantage over Monte Carlo methods, that it uses gradient information, so sampling can be quite efficient. Like Monte Carlo methods, extensions such as simulated annealing or replica exchange are easy to implement. We see the long term benefit as a means of providing efficient sampling for seeding more sophisticated methods such as Bayesian inference.
N009 - Agricultural bio-information resources in NABIC
Short Abstract: NABIC(National Agricultural Biotechnology Information Center) established integrated management system of agricultural omics information to achieve and analyze a agricultural bio-information resources in Korea.
The amount of bio-information is enormously increasing due to emergence of NGS(Next Generation Sequencing) technology. We are building, maintaining and providing agricultural bio-information databases and information services.
Various data type for submission is available such as genome, proteome, transcriptome, metabolome, molecular marker, etc. We issue the submission confirmation which is available for research achievement. Currently, the amount of data submitted on our system is 14Tb. We are also providing various analysis pipelines such as NGS analysis(denovo, rna-seq, reference assembly), Gene annotation, GWAS, Microbial community analysis and differential expression profiling analysis using submitted data through web.
NABIC System is available through web site(
N010 - Towards a high-quality barley genome reference sequence using deep multiplexing of BAC pools and whole-genome shotgun sequencing
Short Abstract: The barley genome is large (hapoid size: 5.1 Gb) and contains a large fraction of repetitive DNA (~80%). A physical map and a reference sequence of the genome have been constructed by the International Barley Genome Sequencing Consortium (IBSC, Currently, ~80% (3.9 Gb) of the barley genome is represented by a genetically anchored physical map. To improve the quality of the barley whole-genome shotgun (WGS) reference sequence, sequencing of about 70,000 BAC clones derived from the minimum tiling path of the physical map has been launched, mostly by deep multiplexing on the Illumina platform. Furthermore, paired-end and mate-paired whole-genome shotgun sequencing data from genomic DNA was generated and de novo assembled using ALLPATHS-LG. Here we present first results from the sequencing and assembly approaches.
N011 - hCKSAAP_UbSite: a Web Server to Predict Ubiquitination Sites in the Proteome of Human
Short Abstract: As one of the most common post-translational modifications, ubiquitination regulates the quantity and function of a variety of proteins. Experimental and clinical investigations have also suggested the crucial roles of ubiquitination in several human diseases. The complicated sequence context of human ubiquitination sites revealed by proteomic studies highlights the need of sophisticated computational strategy to predict human ubiquitination sites. Based on our developed ubiquitination site prediction method, a user-friendly web server called hCKSAAP_UbSite was constructed. Briefly, hCKSAAP_UbSite is a hybrid method which integrates the outputs of four complementary classifiers through a logistic regression model. Through 5-fold cross-validation on a class-balanced training dataset hCKSAAP_UbSite achieved an area under ROC curve (AUC) of 0.770. The protein sequence input to our server should be in RAW or FASTA format. Only the 20 conventional amino acid symbols are supported. The prediction output consists of seven items: position, scores of four different predictors, the final score and ubiquitination site annotations (YES for ubiquitination sites, NO for non-ubiquitination sites). Additionally, graph output is also given in the result page to show the positions of lysines of a query protein and the corresponding prediction scores. Given its promising performance in our large-scale dataset, this predictor has been made available at
N012 - GAGE–B: Evaluation of Genome Assemblers for Bacterial Organisms
Short Abstract: A large and rapidly growing number of bacterial organisms have been sequenced by the newest sequencing technologies. Cheaper and faster sequencing technologies make it easy to generate very high coverage of bacterial genomes, but these advances mean that DNA preparation costs can exceed the cost of sequencing for small genomes. The need to contain costs often results in the creation of only a single sequencing library, which in turn introduces new challenges for genome assembly methods.

We evaluated the ability of several genome assembly programs to assemble
bacterial genomes from a single, deep-coverage library. For our comparison, we
chose eight bacterial species spanning a wide range of GC content, and measured the contiguity and accuracy of the resulting assemblies. We compared the assemblies produced by this very-high-coverage, one-library strategy to the best assemblies created by two-library sequencing, and found that remarkably good bacterial assemblies are possible with just one library. We also measured the effect of read length and depth of coverage on assembly quality and determined the values that provide the best results with current algorithms.
N013 - A Maximum-Flow Approach to Resolve Ambiguous RNA-Seq Read Mappings for Gene Prediction
Short Abstract: The advent of RNA-Seq, the high-throughput processing of RNA by next generation sequencing technologies, enabled gene prediction assisted by transcriptomic information. Although RNA-Seq reads provide an accurate view on splicing events and currently expressed genes, the correct assessment of RNA-Seq information poses difficult challenges. Frequently, reads map ambiguously to various positions in the reference and it is necessary to distinguish wrong from correct alignments. We present an integrative Maximum-Flow network formulation to resolve ambiguous mappings based on candidate gene identifications. This approach reassigns reads to their most likely position with regard to the likeliness of their corresponding initially defined expressed gene regions. After the optimization, the predicted candidate genes are verified based on the support by remaining mapping reads.
The approach is evaluated in several simulations on prokaryotic as well as eukaryotic species and compared to other gene prediction methods. Further it is applied to a real data set comprising reads from Saccharomyces cerevisiae, showing the ability to accurately resolve ambiguous mappings.
N014 - MicroRNA Bindings to Meta-stable RNA Secondary Structures
Short Abstract: Recent literature has raised the question about potential microRNA bindings to meta-stable RNA secondary structures that are close to minimum free energy conformations. We studied the problem in the context of single nucleotide polymorphism and mRNA concentration levels, i.e. whether or
not the number of meta-stable structures and features of miRNA bindings to meta-stable conformations could provide additional information supporting the differences in expression levels of mRNAs and their corresponding SNP-infected sequences.

We focused on triples mRNA/3'UTR;SNP;miRNA] that have been analysed in recent literature by methods including PCR and/or luciferase reporter assays. Due to computational limitations imposed by the huge number of secondary structures within even a small energy range above the mfe conformation, only mRNA/3'UTR sequences below approximately 1000nt could be included.

The ten cases we analyzed are
[LIG3;rs4796030;miR-221;L=124], [CBR1;rs9024;miR-564;L=284];
[HLA-G;rs1063320;miR-148a-3p;L=386]; [PARP1;rs8679;miR-145-5p;L=769];
[IL-23R;rs10889677;let-7e;L=851]; [RYR3;rs1044129;miR-367;L=880];
[HOXB5;rs9299;miR-7-5p;L=952]; [RAD51;rs7180135;miR-197-3p;L=978];
[ORAI1;rs76753792;miR-519a-3p;L=1034]; [RAP1A;rs6573;miR-196a;L=1078].

The [miRNA;3'UTR] bindings were evaluated by using the latest version of STarMir (additional tools: RNAsubopt and Barriers from the Vienna RNA tool-set).

Due to the location of the binding region, there are identical data for both alleles for [RAP1A;rs6573;miR-196a;L=1078].

Assuming 100 copies as an upper bound for typical concentration levels, we obtained for the remaining nine cases stronger binding predictions for the ~100 meta-stable conformations with lowest free energy values for the allele with the stronger inhibitory effect.

Except for [RAD51;rs7180135;miR-197-3p;L=978], the total number of meta-stable conformations either differs at most by about 17% (avg 13.4%) for both alleles, or is significantly larger (>50%) for the allele with the stronger inhibitory effect (cases CBR1, PARP1, RYR3).
N015 - CMCompare webserver: Comparing RNA families via Covariance Models
Short Abstract: Novel non-coding RNAs can be identified by homology search with RNA family models.
These capture the common sequence and structure of the family (e.g RNAseP)
and can be constructed with the INFERNAL tool-chain in the form of covariance models.

Models should only detect family members (specificity), but still find remote homologs
(sensitivity). Prior to inclusion into a database like Rfam they should be
checked for these features and uniqueness.

The CMCompare webserver allows to address the question of specificity and
uniqueness by comparing user provided models against all models in Rfam.
It computes a pairwise linkscore that measures how strongly the two models

Models that are strongly linked by CMCompare can be explained either by a lack of
discriminatory power of at least one of the models or by a biological connection
between them. This is for example the case when both families are derived from a
common super-family.

The results are ranked by linkscore and can be further sorted and filtered
by several criteria to highlight the models of interest. A weighted graph
representation with models as nodes and their links as edges provides an
quick overview.

Moreover comparison and visualization of the relationship within a set of uploaded models
also makes it possible to find groups of families, so called clans.
They share a common ancestor, but are too divergent to be aligned or could be aligned,
but have distinct functions. Clan formation in Rfam has been a manual process up to now.

The CMCompare webserver is open and (GPL-3) freely available at:
N016 - 2D meets 4G: G-Quadruplexes in RNA Secondary Structure Prediction
Short Abstract: Guanosin-rich nucleic acid sequences fold into locally stable structural elements known as G-quadruplexes. Their stability results from pi-orbital interactions between stacked layers of G-quartets, i.e. planar assemblies composed of four Hoogsteen-bound guanines. Furthermore, a centrally located cation between two successive quartets influences the stability of the quadruplex even more.

In this work we extend the combinatorial theory of RNA structures and the dynamic programming algorithms for RNA secondary structure prediction to incorporate G-quadruplexes using a simple but plausible energy model. Our preliminary energy parameters were obtained from thermodynamic data based on UV-melting and circular dichroism spectroscopy of 3-layered RNA quadruplexes.

Benchmarking to asses the false positive rate was done with 45,511 sequences from 2185 Rfam seed alignments and resulted in an upper bound of 1.4% for single sequences and 0.7% for consensus structure prediction. A re-investigation of 17 mRNAs, for which the effect of a putative G-quadruplex on protein expression has been determined, shows almost complete agreement of our predictions with the experiment. In a large-scale scan of primate genomes taken from the Ensembl database, fly genomes taken from Flybase, worm genomes taken from Wormbase, and bacteria/archaea genomes from the NCBI database we find that the overwhelming majority of putative quadruplex forming sequences are more likely to fold into canonical secondary structures instead. However, stable G-quadruplexes are strongly enriched in the 5' UTR of protein coding mRNAs in primate genomes.
N017 - In silico identification of hidden human antimicrobial peptides
Short Abstract: Cationic AntiMicrobial Peptides (CAMPs) are small peptides which exert a direct microbicidal activity and constitute the most ancient arm of the innate immune system of multicellular eukaryotes. These peptides are promising therapeutic agents and our research group is focused on the development of new CAMPs against the most common pathogens in the lung infections of cystic fibrosis patients.
In the last years, several proteins which show antibacterial activity not correlated with their primary function have been discovered; these proteins seem to act as carriers in their primary structure of “cryptic” CAMPs, that could be released by the action of human or bacterial proteases (1).
In order to identify potential CAMPs contained inside the primary structure of human proteins we have developed a scoring function based on a charge score, a hydrophobicity score and a plasticity score of short sequences (9-30 aa). The main novelty of our system is the possibility of tuning the scoring function in order to discover CAMPs which are in principle more active against a particular bacterial strain.
We performed an automated analysis of the human secretome in UniProt database and, even applying stringent criteria, more than 300 peptides over 3393 sequences were identified. We found, among them, the most important known human CAMPs and an interesting number of potential new CAMPs inside proteins correlated with blood coagulation, immune processes and activities involving the extracellular matrix. The most promising peptides will be produced and characterized.

(1) D’Alessio G., 2011; FEBS Letters 585(15):2403-4.
N018 - Context-based mapping of RNA-seq data with ContextMap 2.0
Short Abstract: Sequencing of RNA (RNA-seq) using next generation sequencing has become the standard approach for profiling the transcriptomic state of a cell. This requires mapping of the sequencing reads to determine their transcriptomic origin.
Recently, we developed a context-based mapping approach, ContextMap, which determines the most likely origin of a read by evaluating the context of the read in the form of alignments of other reads to the same genomic region. In the original implementation, the focus was on improving initial mappings provided by other mapping tools.
Here, we present ContextMap 2.0, an extension of the original ContextMap method, which can also be used as a standalone tool without relying on initial mappings by other tools. We show that it yields highly accurate read mappings and is very robust against sequencing errors. The design of ContextMap 2.0 allows for massively parallelized data processing, resulting in reasonable running times despite the higher complexity of the context-based approach.
N019 - Advancing on fusion gene detection from RNA-seq data
Short Abstract: Gene fusions are known to be related to cancer and can be interpreted as hallmarks for the corresponding disease. Therefore, an accurate detection of these translocations is crucial for cancer diagnostics. RNA-seq technology has been established as the most promising approach to detect fusions on a genome-wide scale. Unfortunately, RNA-seq data analysis can be challenging due to the complexity and amount of information to be processed and the intrinsic error rate of the technology.
Some computational methods have been proposed to deal with these challenges. However, recent studies have demonstrated the inconsistency of the results between commonly used approaches applied to the same validated datasets. Here we propose a novel approach for fusion gene detection, which allows a more efficient prediction of transcriptome rearrangements. Our algorithm uses both local alignment of reads and the paired-end information to search for alignments which can support a fusion. Found alignments are then efficiently clustered and scored to detect putative fusion events. Similarly to other existing methods, a number of filters are applied to these fusions to increase the specificity of the results. The significance of fusion event is estimated by taking into account the expression of the involved genes, coverage pattern of supporting reads, mappability of the genomic region and other factors.
We have validated the current implementation of our algorithm on simulated RNA-seq datasets and several public cancer datasets. In result our method demonstrated prediction accuracy higher or equal in comparison to other methods while providing better computational performance.
N020 - A hybrid approach to assemble complex transcriptomes
Short Abstract: A transcriptome assembly is usually performed de novo or using a reference genome. Each methodology was used successfully to assemble transcripts by aligning reads generated from various next generation sequencing (NGS) platforms. However, using a reference based approach to assemble transcriptomes of newly sequenced polyploid and heterozygous organisms such as fruit trees is tedious. This is mainly due to the presence of ambiguous regions containing a high number of repetitive regions or highly fragmented assembly of these genomes. We will report a hybrid approach used for transcriptome assembly that exploits both methodologies and uses reads generated by 454 and Illumina NGS technologies. A first de novo assembly step is performed using the longest, unique and high quality 454 reads. Assembled transcripts together with shorter 454 and Illumina reads are then aligned to the reference draft genome in order to detect protein coding genes. Assembled transcripts are then iteratively extended using neighbor reads. Finally, to assess the quality of the transcripts we performed an ORF prediction based on protein similarity obtained from public protein databases. Furthermore, our strategy enabled the rapid identification of proteins with unknown domains and non-coding RNAs with high confidence.
N021 - microRNA variability is likely an additional layer of control of gene expression
Short Abstract: High throughput sequencing has unveiled an unexpected landscape of miRNA sequence variants (isomiRs) affecting mainly the 3´ terminus, and to a lesser extent the 5´ terminus of annotated mature miRNAs. The least common type of variation is nucleotide substitution in the miRNA sequence. It is currently unknown how widespread isomiRs are, whether they are generated by the same mechanisms in different species and what role (if any) they have in development.

To study whether isomiRs have a role in development, we characterized isomiRs in publicly data of human and macaque brains of different ages. We found that 29 miRNAs presented the same type of isomiRs with differential relative abundance at two ages (newborn and old), both in macaque and human brain. Functional enrichment analysis of reference miRNA targets points to energy metabolism-related pathways, while isomiR targets are more enriched in brain specific genes. This suggests that miRNA sequence plasticity contributes to different aspects of brain development and aging.

We conclude that miRNA variability is likely an additional layer of control of mRNA targeting and silencing.
N022 - SHIFTDB: Database of hidden stop codons in frame-shifted translation in mitochondrial genomic sequences
Short Abstract: Reading frames plays an important role in the process of translation of proteins from nucleotide sequences. Selection of a wrong reading frame could lead to the wrong protein products which might have lethal results. Frame-shift mutation is a genetic mutation caused generally by indels, i.e. insertion and deletion of nucleotides. Coding sequences lack stop codons, but many stop codons appear off-frame. Off-frame stops, i.e. stop codons in +1 and -1 shifted reading frames, are termed hidden stop codons. A stop codon keeps a check on the translation process and hence controls the protein product. Frame-shifts lead to waste of energy, resources and activity of the biosynthetic machinery. In addition, some peptides synthesized after frame-shifts are probably cytotoxic and might even lead to severe diseases such as cancer. Mitochondrial genomic study may reveal significant insight into the evolution and many aspects of genome evolution like gene rearrangements, gene regulation, and replication mechanisms. Here we present a database that was elucidated after identifying hidden stop codons in the mitochondrial genomic DNA sequences. We have checked hidden stops in both +1 and -1 frame-shift with respect to mitochondrial genetic code system. It provides retireval information for various categories of codons with their respective contribution to hidden stops. Further it will also provide codon usage frequencies and contribution of codons to hidden stops in off frame context. This database will help the computational and evolutionary biologists in the analysis of frame-shifted translation in mitochondrial coding genomic sequences.
N023 - RNA Bricks - a database of RNA interactions
Short Abstract: Non-coding RNAs (ncRNAs) were found to be involved in many cellular processes from the gene transcriptional regulation to the catalysis of chemical reactions. Many ncRNAs, including cis-regulatory elements, are modular biomolecules, composed largely of recurrent structural modules glued together via mutual interactions into a compact, functional, 3D structure.

RNA Bricks database ( provides information about the recurrent RNA modules and their interactions, both with themselves and with proteins. In contrast to other similar tools (RNA Frabase, Rloom) here RNA modules are presented in their natural environment, that is neighborhood of other molecules in a solution or a crystal.

There are also two other features that make the RNA Bricks unique:

First is a robust algorithm for 3D motif search and comparison. The algorithm compares spatial positions of backbone atoms in query and a representative set of RNA modules, with no use of secondary, or primary structure information. This enables the study of local structural similarities among distant RNA homologs and evolutionarily unrelated analogs.

Second is the availability of three structure-quality scores with a single nucleotide resolution. These scores facilitate the identification of regions with poor backbone geometry, severe steric clashes, and low real-space correlation coefficients for experimental diffraction data (if available). The local quality assessment enables the selection of good modules from any RNA structure, not only the ones believed to be most reliable (e.g. high resolution crystallographic models). This is particularly important taking into account small number of unique RNA structures available in the PDB.
N024 - diChIPMunk: deriving dinucleotide TFBS models from ChIP-Seq data
Short Abstract: Analysis of regulatory sequence motifs is an essential component of studies focused on transcriptional regulation in higher eukaryotes. In particular, transcription factor binding sites (TFBS) models are often produced using motif discovery methods. One of the most widely used TFBS models is a mononucleotide positional weight matrix (PWM).

High-throughput sequencing methods, like ChIP-Seq, produce enough data to adequately train more advanced models with more parameters. Yet classic PWMs still compete successfully with more complex approaches as shown in a recent benchmark published by DREAM consortium [Weirauch et al. 2003]. In particular, it was shown that out previous algorithm, ChIPMunk, performed quite well with basic PWMs trained on ChIP-Seq data.

We present our new computational tool, diChIPMunk, which is able to construct dinucleotide PWMs accounting for dependencies between neighboring nucleotides within TFBS and dinucleotide composition of input sequences. diChIPMunk inherits all ChIPMunk advantages, including its ability to utilize ChIP-Seq base coverage profiles as a prior knowledge on motif positioning.

Using several public ChIP-Seq datasets we show that dinucleotide PWMs from diChIPMunk clearly outperform ChIPMunk models and existing PWMs from published TFBS model databases.
N025 - PhysBinder: improving the prediction of transcription factor binding sites by flexible inclusion of biophysical properties
Short Abstract: The most important mechanism in the regulation of transcription is the binding of a transcription factor (TF) to a DNA sequence called the transcription factor binding site (TFBS). Most binding sites are short and degenerate, which makes predictions based on their primary sequence alone somewhat unreliable. We present a new web tool that implements a flexible and extensible algorithm for predicting TFBS. The algorithm makes use of both direct (the sequence) and several indirect readout features of protein–DNA complexes (biophysical properties such as bendability or the solvent-excluded surface of the DNA). This algorithm significantly outperforms state-of-the-art approaches for in silico identification of transcription factor binding sites. Users can submit FASTA sequences for analysis in the PhysBinder integrative algorithm and choose from more than 60 different TF binding models. The results of this analysis can be used to plan and steer wet-lab experiments. The PhysBinder web tool is freely available at
N026 - Classification-Based Filtering of Contigs from de novo Assemblies
Short Abstract: Whereas the quality of mappings of next-generation sequencing (NGS) reads can be measured with respect to the reference, the evaluation of de novo assembled sequences is still a major challenge and often requires significant manual interaction.

Current approaches include single feature metrics such as the N50, the aggregation of features or likelihood based evaluations. What they have in common is that they evaluate complete assemblies.

However, several applications of de novo assembly can be performed with contigs directly, such as gene annotation. Since low quality contigs may compromise the results, such experiments benefit from a quality-filtered set of contigs.

We present a supervised learning classification approach using random forests for the selection of reliable contigs. Contigs are examined for several input features including contig length, contig quality values, read quality values, read count, coverage and relative N50, as well as more complex derived features. As target values, eleven quality metrics are calculated by aligning a contig to a known ground truth reference sequence. The target values are separated into positive and negative classes. After the classification of a contig for each target value, a voting procedure over all eleven target values determines the final result.

We evaluated our method with different collections of read sets with similar characteristics (organism and sequencer). Each collection is separated into a training and a test set. The preliminary results yield satisfactory sensitivity and specificity for data with similar conditions.
N027 - Evolutionary profile hidden Markov models; Application to remote homology detection
Short Abstract: We present closed-form analytical solutions for a probabilistic model
of biological sequence evolution that is compatible with the standard
affine insertion/deletion cost model used in sequence comparisons. By
being probabilistic and allowing affine insertions and deletions, this
new evolutionary model can be directly applied to standard profile and
pair hidden Markov models (HMMs) used for protein and DNA
homology. Under an evolutionary interpretation, each
Match/Delete/Insert ``unit'' of the HMM describes the possible fates
of an ancestral residue (whether substituted or deleted, and possibly
sustaining an arbitrary number of insertions before the next ancestral
unit). Under our evolutionary parameterization, the transition
probabilities of the HMM become analytic time-dependent functions that
depend on position-specific (``profiled'') non-negative rate
parameters. For a given profile HMM parameterized as a model of a
particular alignment, one can construct an evolutionary version of the
HMM where its position-specific rates can be estimated under certain
assumptions without need for any additional information. A comparison
of the power to detect remote homologies by the original and the
time-dependent versions of the profile HMM is presented.
N028 - Horizontal gene transfer of plant-specific leucine-rich repeats between plants and bacteria
Short Abstract: Leucine rich repeats (LRRs) are present in over 14,000 proteins that have been identified in viruses, bacteria, archaea, and eukaryotes. Two to sixty-two LRRs occur in tandem forming an overall arc shaped domain. There are eight classes of LRRs. Plant specific LRRs (class: PS-LRR) had previously been recognized in only plant proteins. However, we find that PS-LRRs are also present in proteins from bacteria. We investigated the origin of bacterial PS-LRR domains. PS-LRR proteins are widely distributed in most plants; they are found in only a few bacterial species. There are no PS-LRR proteins from archaea. Bacterial PS-LRRs in twenty proteins from eleven bacterial species in the three phyla are significantly more similar to the PS-LRR class than to the other seven classes of LRR proteins. Not only amino acid sequences but also nucleotide sequences of the bacterial PS-LRR domains show highly significant similarity with those of many plant proteins. The EGID program predicts that Synechococcus sp. CYA_1022 came from another organism. Four bacterial PS-LRR proteins contain AhpC-TSA and IgA peptidase M64; these domains show no similarity with any eukaryotic (plant) proteins, in contrast to the similarities of their respective PS-LRRs. The present results indicate that horizontal gene transfer (HGT) of genes/gene fragments encoding PS-LRR domains occurred between bacteria and plants, and HGT among the eleven bacterial species, of the three phyla, as opposed to descent from a common ancestor. There is the possibility of the occurrence of one HGT event from plant to bacteria.
N029 - De novo transcriptome sequencing and assembly of tardigrade species from ultra low input mRNA-Seq
Short Abstract: Limno-terrestrial tardigrades can withstand almost complete desiccation through a mechanism called anhydrobiosis, and several of these species have been shown to survive the most extreme environments through exposure to space vacuum. Molecular mechanism for this tolerance has so far been studied in many anhydrobiotic metazoans, leading to the identification of several key molecules such as the accumulation and vitrification of trehalose as well as the expression of LEA proteins to prevent protein aggregation. On the other hand, the understanding of comprehensive molecular mechanisms and regulation machinery of metabolism during anhydrobiosis is yet to be explored. One of the major hurdles in the analysis of these microscopic non-model organisms is the relatively large amount of biological samples required for high-throughput analyses. To this end, we have conducted a de novo transcriptome survey of multiple tardigrade species using ultra-low input mRNA-Seq using SMARTer amplification method. By comparing the amplification-based method with draft genome annotations and mRNA-Seq without amplification,, we discuss the feasibility of such amplification-based method, and suggest a working pipeline for de novo transcriptome sequence assembly and annotation as well as quantification for such purposes.
N030 - StatsDB: platform-agnostic storage and understanding of next generation sequencing run metrics
Short Abstract: Modern sequencing platforms generate enormous quantities of data in ever-decreasing amounts of time. With such data comes a significant challenge to understand its quality and to understand how quality and yield are changing across instruments and over time. As well as the desire to understand historical data, centres often have a duty to provide clear summaries of individual run performance to collaborators or customers.

We present StatsDB, an open-source software package for storage and analysis of next generation sequencing run metrics. The system has been designed for incorporation into the standard primary analysis pipeline of, for example, Illumina sequencers. Statistics are stored in an SQL database and an API provides the ability to store and access the data while abstracting the underlying database design. This abstraction allows simpler, wider querying across multiple fields than is possible by the manual steps and calculation required to dissect individual reports, e.g. "provide metrics about nucleotide bias in libraries using adaptor barcode X, across all runs on sequencer A, within the last month".

The software is supplied with modules for storage of statistics from FASTQC[1], a commonly use tool for analysis of sequence reads, but the open nature of the database schema means it can be easily adapted to other tools. Currently, reports can be produced as PDFs, or accessed through the MISO LIMS system[2], but the API makes it easy to develop custom reports and to interface with other packages.


N031 - Hierarchical non-parametric Bayesian clustering of digital gene expression data
Short Abstract: Next-generation sequencing is rapidly replacing microarrays for studying gene expression. When applied to an RNA sample, these technologies produce a library of millions of sequence tags, which constitute a discrete measure of gene expression. Typically, data of this type is characterised by over-dispersion and low numbers of biological replicates, which poses new challenges to their statistical analysis. An important aspect of this analysis is to identify interesting patterns in the data by partitioning it into different clusters. Existing clustering methods require a distance metric for measuring the similarity between different data points or a generative mixture model, where each mixture component corresponds to a different cluster. Both approaches are sensitive to noise, which is extensive in biological data.

We propose a novel non-parametric Bayesian clustering algorithm based on an infinite mixture model, which combines the Negative Binomial distribution for modelling overdispersed count data with the Hierarchical Dirichlet Process (HDP). The HDP permits information sharing within and between samples, thus compensating for the low number or even absence of replicates and making possible the robust estimation of the mean and dispersion parameters of each mixture component. Moreover, the use of an infinite mixture does not require any a priori information regarding the number of clusters, which is estimated together with the other model parameters. The algorithm is applied on several digital gene expression datasets, including CAGE data from different regions of the macaque brain, and it is shown to outperform several popular clustering algorithms, particularly at low numbers of replicates.
N032 - Gene-centered, condensed & time-efficient visualization of mRNA Seq Data
Short Abstract: The analysis of transcriptomes is important to understand the molecular mechanisms of diseases. mRNA-seq represents an unbiased genome-wide approach to study the expression and isoform distribution of genes in normal and diseased tissues. However, there is currently a lack of powerful tools to compute and visualize the coverage of unique and ambigous reads in a time-efficient gene-centered manner because many genes can have multiple large introns which is cumbersome with existing tools like Integrative Genomics Viewer, UCSC Genome Browser or Gbrowse.
Therefore a novel workflow was implemented to generate visualizations on-the-fly and in reasonable time. Reads were mapped to the genome using the STAR Aligner which allows for spliced alignment of mRNA-seq reads. For a given input gene, the corresponding transcripts positions are fetched from a local Ensembl database. The genomic coverage is then assembled using SAMtools mpileup from a single or multiple BAM files, transformed to absolute transcript positions and written into a flat file which is then used for the visualization with ggplot2. Our approach displays the transcript specific coverage as a stack of both unique and ambiguous alignments which allows the user to focus on exonic gene regions. Each step in the workflow is parallelized. Pooling of multiple BAM files enables the user to aggregate the coverage on different levels. This allows a flexible extension of the sequencing depth for low abundant genes like GPCRs. An additional feature is given by the optional extension of the transcript start and stop position which allows UTR refinement.
N033 - Accelerating the original profile kernel
Short Abstract: One of the most accurate multi-class protein classification systems continues to be the profile-based SVM kernel introduced by the Leslie group. Unfortunately, its CPU requirements render it too slow for practical applications of large-scale classification tasks. Here, we introduce several improvements that enable significant acceleration.
After testing the new implementation with non-redundant data sets and various parameters, creating a kernel matrix is now 5 times faster than before on average and reaches a maximum acceleration of 14-fold. Predicting the class of a query is sometimes over 200 times faster with a 66-fold acceleration on average. As an example, predicting a single query with a model consisting of 1,000 SVMs and 100,000 support vectors took 5 hours before; the same task now takes 4 minutes. All of these accelerations were achieved entirely without parallelization, but we also added the feature to parallelize kernel matrix computations on top. (None of our changes affects the output.)
We have integrated all our solutions into a single program that we distribute freely (academic license). It can create or apply profile kernel based models in a single run without requiring SVM or machine learning knowledge from the user. As a Debian package, it reaches high standards for software quality, from source code layout to versioning and documentation. It Is package-manager compatible with all Debian-based Linux distributions (Ubtunu, Mint, …) and comes with a make-based installation for all other systems.
Installation instructions can be found at Bugs and other issues may be reported at
N034 - Scalability of Alignment Quality with Number of Sequences
Short Abstract: Multiple Sequence Alignments (MSAs) of >100,000 sequences are getting
commonplace. At present, there are no systematic analyses concerning
the scalability of the alignment quality as the number of aligned
sequences is increased. We bench-marked 20 widely used MSA packages
using protein families with known structures. We found that the
accuracy of alignments decreases as more sequences are added
indiscriminately; this is true for all packages and large numbers of
sequences. For small numbers of carefully selected sequences a modest
improvement is possible. The reason for this deterioration is mostly
due to 'attrition' during the profile alignment stage rather than
problems in guide-tree construction. This effect can be attenuated
through iteration or external profile alignment. This suggests that
the availability of high quality curated alignments will have to
complement algorithmic and/or software developments in the long-term.
N035 - aLib: A pipeline for the processing of high-throughput ancient DNA data
Short Abstract: Advances in molecular techniques, combined with the advent of high-throughput sequencing technologies has made the sequencing of ancient DNA (aDNA) possible.  The DNA extracted from ancient sources has a number of properties that distinguish it from modern DNA: the molecules are short and carry chemical damage, and often exist in a background of microbial contamination which complicates their identification. The large volumes of sequence data produced for whole genome sequencing projects from ancient samples necessitates the development of software tools to quickly and accurately perform the processing going from raw intensities to accurate alignments of the short reads generated from aDNA samples.

We present aLib, a pipeline for initial processing both aDNA and modern DNA sequencing data from the  Illumina platform. The pipeline includes base-calling, adapter sequence removal, sequence merging, identification of adapter chimeras, demultiplexing, flagging of low quality data and quality assurance. Sequencing runs are typically base-called using freeIbis, which produces well-calibrated quality scores. The original molecule is then reconstructed based on the presence of adapter sequences and overlap between corresponding read pairs. Barcoded reads are the demultiplexed using a likelihood based approach, and a read group assignment quality score (RGQ) is computed. A filtering heuristic allows the flagging of sequences with a high probability of low quality mismatch bases to the reference. Finally, different quality metrics are computed to evaluate the success of the sequencing run. Our pipeline is free, open-source and released under the GPL ( .
N036 - Transimulation - protein biosynthesis web service
Short Abstract: Due to experimental difficulties, the process of translation is poorly characterized at the level of individual genes. To overcome this problem, we developed Transimulation - a web service measuring and simulating translational activity of individual genes in three model organisms: Escherichia coli, Saccharomyces cerevisiae, and Homo sapiens. The simulation is based on a computational model of translation developed previously and data sets from several experiments that served as an input. It provides quantitative information on mean translation initiation and elongation times (expressed in SI units), and approximates the number of proteins produced per transcript during its lifespan. It also quantifies the number of ribosomes that typically occupy a transcript during translation, and simulates their average propagation on an mRNA molecule. The simulation of ribosomes' movement is interactive and allows modifying coding sequence on the fly. It also enables uploading any coding sequence and simulating its translation in one of the three model organisms, provided that translation initiation time is given. In such a case, ribosomes propagate according to mean codon elongation times of the host organism and its growth conditions, which may prove useful for heterologous expression of genes.
N037 - GPCRpipe: A pipeline for the detection of G-protein coupled receptors in proteomes
Short Abstract: G-protein coupled receptors(GPCRs) form the largest and most diverse superfamily of transmembrane receptors in eykaryotic cells. GPCRs play very important roles in health and disease and are main targets of drug development. Although GPCRs share common topology,consisting of 7 transmembrane α-helices and extracellular N-terminals, they show important diversity at sequence level, and are divided into families. This lack of sequence similarity makes difficult their detection in proteomes, especially, the finding of novel members of the GPCR superfamily. In this work, we developed a pipeline for the accurate detection of GPCRs in proteomes.
GPCRpipe( consists of two layers: 1. The GPCRs’ detection layer, consisting of a) a HMM especially designed for the detection of GPCRs and (b) a library of 35 GPCR specific PFAM pHMMs, and 2. The annotation layer, offering information regarding the family (Pfam profiles), the topology (SignalP 4.0, HMM and Pfam profiles) and the coupling specificity (PRED-COUPLE2) of each receptor, using pre-existing tools.
GPCRpipe is a reliable method for the discrimination of GPCRs in proteomes and, compared to other available algorithms, shows comparable or better performance. GPCRpipe will become a useful tool in the hands of researchers working with this important superfamily of eykaryotic receptors and, also, will assist the annotation efforts of newly assembled proteomes.
This work was funded by the SYNERGASIA 2009 PROGRAMME, which is co-funded by the European Regional Development Fund and National resources (Project Code 09SYN-13-999), General Secretariat for Research and Technology of the Greek Ministry of Education and Religious Affairs, Culture and Sports.
N038 - PTree: Pattern-based, Stochastic Search for Maximum Parsimony Phylogenies.
Short Abstract: Phylogenetic reconstruction is vital to analyzing the evolutionary relationship of genes within and across populations of different species. Nowadays, with next generation sequencing technologies producing sets comprising thousands of sequences, robust identification of the tree topology, which is optimal according to standard criteria such as maximum parsimony, maximum likelihood or posterior probability, with phylogenetic inference methods is a computationally very demanding task. Here, we describe a stochastic search method for a maximum parsimony tree, implemented in a software package we named PTree. Our method is based on a new pattern-based technique that enables us to infer intermediate sequences efficiently where the incorporation of these sequences in the current tree topology yields a phylogenetic tree with a lower cost. Evaluation across multiple datasets showed that our method is comparable to the algorithms implemented in PAUP* or TNT, which are widely used by the bioinformatics community, in terms of topological accuracy and runtime. We show that our method can process large-scale datasets of 1,000-8,000 sequences. We believe that our novel pattern-based method enriches the current set of tools and methods for phylogenetic tree inference.
N039 - Give it AGO! MicroRNA-argonaute sorting in Arabidopsis thaliana revisited
Short Abstract: The specific recognition of miRNAs by Argonaute (AGO) proteins, the effector proteins of the RNA-induced silencing complex, constitutes the final step of the biogenesis of miRNAs and is crucial for their target interaction.
In the genome of Arabidopsis thaliana (Ath), 10 different AGO proteins are encoded and the sorting decision, which miRNA associates with which AGO protein, was reported to depend exclusively on the identity of the 5'-sequence position of mature miRNAs. Hence, with only four different bases possible, a 5'-position-only sorting signal would not suffice to specifically target all 10 different AGOs individually or would suggest redundant AGO action or yet unidentified sorting signals.

We analyzed a dataset comprising 117 Ath-miRNAs with clear sorting preference to either AGO1, AGO2, or AGO5 as identified in co-immunoprecipitation experiments combined with sequencing.
While mutual information analysis did not identify any other single position but the 5'-nucleotide to be informative for the sorting at sufficient statistical significance, significantly better than random classification results using Random Forests suggest that additional positions and combinations thereof also carry information with regard to the AGO sorting.
Furthermore, uracil bases at defined positions appear to be important for the sorting to AGO2 and AGO5, in particular. No predictive value was associated with miRNA length or base pair binding pattern in the miRNA:miRNA* duplex.
From inspecting available AGO gene expression data in Arabidopsis, we conclude that the temporal and spatial expression profile may also contribute to the fine-tuning of miRNA sorting and function.
N040 - RSVSim: an R/Bioconductor package for the simulation of structural variations
Short Abstract: The simulation of structural variations (SVs) is a powerful, quick and inexpensive approach to assess the performance and correctness of algorithms dealing with annotation, visualization, comparison and, most importantly, detection of SVs from next generation sequencing data. It can generate a base exact ground truth within a predefined and known setting. Furthermore, a comprehensive simulation of different SV types of various sizes combined with a variety of read simulations with different numbers of reads, insert-sizes or read lengths can give valuable information for the design of sequencing experiments.
We present RSVSim, which covers the simulation of deletions, inversions, insertions, tandem duplications and translocations within the hg19 or any other genome available as FASTA file. It uses publicly available data from the DGV and incorporates knowledge about SV formation around repeats and regions of high homology to realistically model the size and breakpoints for each type of SV individually. Hence, breakpoints can be correlated to LINEs, SINEs, Mini-/Microsatellites or segmental duplications and may have additional smaller mutations such as indels and SNPs nearby.
The user can submit own coordinates to implement a fixed set of SVs, subset the reference genome (e.g. the exome only) or to generate heterozygous SVs or genomes for paired, like healthy/diseased, samples.
The output is a FASTA file of the modified reference genome, suitable for subsequent read simulation, and the set of simulated SVs including the breakpoint sequences.
RSVSim is implemented in R and available through Bioconductor.
N041 - RAPL - A tool for the computational analysis of deep-sequencing based transcriptome data
Short Abstract: RNA-Seq - the qualitative and quantitative examination of RNA molecules by massively parallel sequencing of cDNA libraries - is a potent way to perform transcriptome analyses at single-nucleotide-resolution and with a high dynamic range. In order to extract information from the raw RNA-Seq data, several steps, which in part can be computationally intensive, have to be conducted. RAPL (RNA-Seq Analysis PipeLine) covers the crucial steps of such analyses and combines them into an easy-to-use Python tool with a consistent command line interface. It performs clipping and filtering of raw cDNA reads, aligning of them to reference sequences, coverage calculation, gene based quantification, and comparison of expression levels. Moreover, it provides several statistics about the mapping efficiency and generates files for visualization of the results in a genome browser. The work flow of RAPL is highly configurable in order to adapt it to the specific needs of the user. To leverage the full power of modern computers, most parts of RAPL offer parallel data processing. We have successfully used RAPL for the analyses of whole-transcriptome data from pro- and eukaryotes, RNAs isolated from co-immunoprecipitation experiments as well as other subclasses of RNA species. RAPL is publicly available under the ISC open source license.
N042 - High-throughput sequencing reveals influence of miRNA isoforms on the outcome of qPCR validation
Short Abstract: Context: Small RNA expression profiling platforms differ significantly in the sensitivity of detection for small RNA isoforms. This fact implies frequent lack of results reproducibility obtained from high-throughput sequencing (HTS) with quantitative polymerase chain reaction (qPCR).
The aim of the study was to design and implement software that indicates and evaluates miRNA isoforms potentially interfering with the main miRNA isoform validation.
Material: 20 follicular thyroid tumor samples were analyzed with Illumina hiScan sequencing. 3 statistically differentiating malignant and benign tumors miRNAs were selected from reads per million and DESeq normalized data. Identified malignancy markers were tested with isoform specific (custom) and standard Qiagen qPCR. The validation was carried in 20 samples library (method confirmation) and 84 independent samples.
Results: The implemented R/Bioconductor software – miRNA FMG allowed to identify interfering isoforms that perturbed standard qPCR but did not influence custom validation. Furthermore, implemented software included miRNA seed handling, t-test p value, false discovery rate, mean and median based fold changes of isoform specific analysis.
Conclusion: qPCR technique is limited in validation of small RNA expression. Application of proper analysis and FMG software strategy leads to proper design of validation experiments, dedicated for follicular thyroid tumors differentiation and other miRNA HTS studies.

Funding: FNP MPD Program “Molecular Genomics, Transcriptomics and Bioinformatics in Cancer” (TS, BW)
N043 - On the validity of alignment free distance for DNA reads comparison
Short Abstract: Background and Methods
We propose a method for next generation sequencing read pairs analysis based on alignment free distance. The similarity of two reads is assessed by comparing their substrings (k-mers) frequencies. We compare alignment free distance with the Needleman-Wunsch edit distance and the quality of the BLAST alignment. Our comparison is based on a very simple assumption: the most correct distance is that obtained by knowing in advance the reference sequence that we are trying to align. We compute the overlap between reads that is obtained once the reads have been aligned on the original DNA sequence, and use that as a reference; then, we verify how the alignment free and the alignment based distances are able to reproduce this ideal distance. The capability of correctly reproducing this ideal distance is evaluated over samples of read pairs from Yeast, Escherichia Coli, and Human genomic sequences. Comparisons are based on the correctness of threshold predictors and are measures and cross-validated over different samples from the same sets of reads.
We show that, for the considered sequences, the proposed alignment free distance performs as well as, or better, than the more time consuming distances that require the alignment of the reads. Such assessment is based on the fact the distance predictors based on alignment free distances show higher prediction precision both on training and on test sets.
We present computational results that show the efficacy of the alignment free distance in estimating a good read-to-read distance measure.
N044 - PARSEC: a new web platform for the localization and characterization of genomic sites in complete eukaryotic genomes
Short Abstract: We present PARSEC (PAtteRn Search and Contextualization), a new open source platform of guided discovery allowing localization and biological characterization of short genomic sites in complete eukaryotic genomes. PARSEC can search for a genomic sequence or a degenerated pattern with a specified number of mismatches. The set of genomic sites can then be characterized in terms of (i) conservation in model organisms, (ii) genetic context (proximity to genes) and (iii) function of neighboring genes. These modules allow the user to explore, visualize, filter out and extract biological knowledge from a set of short genomic regions such as transcription factor binding sites.
N045 - Sequencing degraded RNA; consequences and solutions
Short Abstract: RNA sequencing has become widely used in gene expression profiling experiments. Prior to any RNA sequencing experiment the quality of the RNA must be measured to assess whether or not it can be used for further downstream analysis. The RNA integrity number (RIN) is a scale used to measure the quality of RNA that runs from 1 (completely degraded) to 10 (intact). Ideally, samples with high RIN (>8) are used in RNA sequencing experiments. RNA, however, is a fragile molecule which is susceptible to degradation and obtaining high quality RNA is often hard, or even impossible, when extracting RNA from certain clinical tissues. Thus, sometimes, working with low quality RNA is the only option the researcher has.

Here we investigate what effects the RIN has on RNA sequencing and suggest ways to handle low quality RNA data. Using a model cell line we generated and RNA-sequenced samples with varying RINs and illustrate what effect the RIN has on the basic procedure of RNA sequencing; both quality aspects and differential expression. We show that the RIN has systematic effects on duplicate reads, gene coverage and false positives in differential expression. We introduce a novel approach, 3'TagCounting (3TC), to estimate differential expression for samples with low RIN. We show that using the 3TC method in differential expression analysis significantly reduces false positives when comparing samples with different RIN, while retaining a reasonable sensitivity.
N046 - Contaminator – detect contaminating sequences in high throughput sequencing data
Short Abstract: One of the challenges in the analysis of data from next-generation instruments is sample contamination from non-sample sources like bacteria or viruses. Contaminator is a new pipeline for the analysis of sequence reads that could not be mapped to the corresponding genome. It can be used to find out whether there was substantial contamination in your sequencing experiment and which organism the reads come from.

The input for Contaminator is a BAM file, containing both mapped and unmapped sequencing reads. In a first step, Contaminator re-maps a random subset of the unmapped reads (1,000,000 by default) with Bowtie against the corresponding genome, using less restrictive parameter settings. In a second step, Contaminator randomly selects a certain number of reads (5,000 by default) to Blast them against genomic sequences from Refseq. The resulting sequence hits are sorted by organism and clustered into different taxonomic groups. The hits are classified either as "unique", if there is exactly one blast hit for a read fulfilling the threshold parameters, or as "ambiguous", if there are multiple hits with different organisms.
The output of Contaminator consists of four lists showing the blast result; a "unique" one and an "ambiguous" one, both for the organism and for the taxonomic group. The results are sorted by the number of hits against organism or group.
As examples contaminations of human and mouse sequencing data will be shown.
N047 - Clustal Omega
Short Abstract: Clustal Omega

Multiple Sequence Alignments (MSA) are used to take a set of related DNA or protein sequences and line them up so as to make them easy to compare to each other. Most MSAs are made using a range of related heuristics that involve clustering the sequences and building an alignment that follows the clusters. These methods have served us well for the past 20 years but are now starting to creak. This poster describes a new program called Clustal Omega which can make alignments of any number of sequences. It gives good quality alignments in reasonable times and has extensive features for adding new sequences to or for exploiting information in existing alignments. It is available for download from in a command-line driven format (Linux style) for proteins only. It is also available for on-line use from the EBI and from Galaxy.
N048 - Unifying Alignment Algorithms Using Generic Meta-Programming
Short Abstract: SeqAn, an open source C++ software library, provides efficient algorithms and data structures for analyzing biological data. It supplies an alignment module supporting all important pairwise alignment algorithms in an easy to use interface. For software libraries, like SeqAn, managing those algorithms in a stand-alone fashion may become a major drawback in terms of maintainability and extendibility, possibly leading to unusable code.
We remodeled the Dynamic Programming (DP) approach so that all standard alignment algorithms use one unified DP algorithm. We first identified dependencies between alignment features, e.g. gap functions, recursion formulas, or different traceback strategies. Subsequently, we developed a generic meta model that configures the desired alignment strategy based on a given alignment profile. The result is a model which virtually splits the DP matrix into smaller sections, for which the identified properties can be configured independently at compile time.
Using this technique, we condensed 36 alignment algorithms and traceback implementations into one procedure, while increasing the number of possible configurations. The modularized structure facilitates high extensibility for existing alignments. For example, it is now possible to add "single instruction on multiple data" (SIMD) parallelization, based on the extended instruction sets Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX) available on modern Intel processors, to all alignment algorithms by making only minor modifications to the code. In addition, the overall runtime could be decreased, making the new SeqAn alignment module potentially faster than other solutions.
N049 - Experimental Design for Multiple Group RNA-Seq Data
Short Abstract: Next-generation sequencing techniques have become highly popular in the last years to characterize the genome at a base pair resolution. In particular, RNA-Seq experiments enable us to detect genes that are differentially expressed between different groups of samples, such as tumors at different stages. While a large number of measured samples for each of the considered groups is desirable for obtaining reliable results, the feasible number of measured samples often is limited by cost.
Thus, approaches for optimal allocation of measurements are needed. We propose an approach that starts with a small number of samples per group and indicates additional experiments for specific groups such that the true positive rate for finding differentially expressed genes is maximized. For focusing the experiment design on genes of interest, we explore the usage of clustering algorithms, leading to different experiments being optimal for different expression profiles of genes.
We present results from a simulation study based on real RNA-Seq data from the Cancer Genome Atlas and that provide guidance on how to use clustering for identifying additional candidates for measurement and more generally for improving experiment design.
N050 - The Impact of Guide-Tree Toplogy on Multiple Sequence Alignment Quality and Speed
Short Abstract: Progressive multiple sequence alignment is a two-stage process: the construction of a guide tree based on a similarity measure between the sequences to be aligned, and the use of this guide tree to order the pairwise alignments of the sequences. While numerous approaches have been proposed to construct guide trees, little attention has been paid to the impact the topology of the guide tree has on an alignment.

We examine how different guide tree topologies affect both the speed and quality of the alignments produced by the main Multiple Sequence Alignment programs. We construct guide trees of varying degree of imbalance, and find that the quality of the output from all alignment programs is affected by this. In addition, the response time of certain programs scales unfavourably with the degree of imbalance while others are largely unaffected by it.

The findings of our study will have consequences for approaches adopted for constructing guide trees.
N051 - SeqAn - An efficient C++ template library for sequence analysis
Short Abstract: We present SeqAn, a large open source C++ library of efficient algorithms and data structures for the analysis of sequences with focus on biological data. Our library applies a unique generic design that guarantees high performance, generality, extensibility, and easy integration with other libraries. It is platform independent, easy to use and offers numerous alignment algorithms and a wide range of different data structures, for instance journaled strings or indices, such as the enhanced suffix array index or the FM index. Further, SeqAn provides functionality to efficiently process gigabytes of data via memory mapped strings and its file IO supports a variety of commonly used formats, as for example Fasta/Fastq, SAM/BAM and Gff/Gtf. For an easy usability, tutorials provide usage examples for all components of the library. These tutorials also cover the setup of different integrated development environments, e.g. Eclipse, XCode or Visual Studio, as well as the development of your own applications using SeqAn. Further, a developer support assists with questions and difficulties. In addition, applications developed in SeqAn can be used in workflow engines, such as Knime, via an automatically generated XML description. Existing SeqAn applications, such as Stellar, Mason, RazerS 3 or Masai, show the usability of the library, which will be further extended by adapting SeqAn to run in parallel on clusters, on graphic cards and other special hardware.
N052 - Neutral evolution of duplicated DNA: An evolutionary stick-breaking process causes scale-invariant behaviour
Short Abstract: Recently, an enrichment of identical matching sequences has been found in many eukaryotic genomes. Their length distribution exhibits a power law tail raising the question of what evolutionary mechanism or functional constraints would be able to shape this distribution. Here we introduce a simple and evolutionarily neutral model, which involves only point mutations and segmental duplications, and produces the same statistical features as observed for genomic data. Further, we extend a mathematical model for random stick breaking to analytically show that the exponent of the power law tail is -3 and universal as it does not depend on microscopic details of the model. Our model allows us to estimate the amount of segmental duplications in relation to point mutations. Fascinatingly, these two processes are balanced in the human lineage. We will discuss the evolutionary consequences of this finding.
N053 - Alternative splicing impact on long non-coding RNAs
Short Abstract: Notwithstanding the enormous interest recently raised in understanding long noncoding RNAs (lncRNAs) functions and biology, still the current knowledge of these important regulators of many cellular processes is remarkably incomplete. LncRNAs, which show heterogeneous characteristics regarding size, gene structure and expression, generally lack a thorough functional characterization. Arguably, one of the most puzzling features of lncRNAs is that they often present variants obtained by alternative splicing. We aimed at investigating how alternative splicing affects the lncRNA molecules, characterizing lncRNA alternative exons by the presence of known RNA functional motifs, protein-RNA binding sites, global folding, local stable secondary structures, and evolutionary conservation. Statistical models of lncRNA splicing were employed to assess correlation between lncRNA features and their splicing-mediated inclusion in mature transcripts. Splicing variant-specific expression patterns, as detected by next generation transcriptome sequencing, offered a detailed picture describing the cellular usage of different lncRNA variants carrying different sets of features, providing a comprehensive overview on how and why these molecules are shaped by splicing and offering clues about their evolution.
N054 - Sequence assembly and variation calling using multiple-dimension de Bruijn graphs
Short Abstract: Recent developments in sequencing technologies have brought a renewed impetus to the development of bioinformatics tools for sequence processing and analysis. Most of the current algorithms for de novo genome assembly are based on de Brujin graphs which provide an effective framework for aggregating next generation sequencing (NGS) data into a convenient structure. De Bruijn graphs, however, introduce an artificial parameter that can impact greatly on the results: the dimension k giving rise to k-mer building blocks. We report on the development of a novel assembly algorithm with a new data structure designed to overcome some of the limitations of a single fixed k-mer size de Brujin graph approach and enable higher quality NGS data processing. Our approach structurally combines de Brujin graphs for all possible dimensions k in one supergraph, leading to a flexible graph dimension . The algorithm called StarK is designed in such a way that it allows the assembler to dynamically adjust the de Brujin graph dimension at any given nucleotide position. In addition to flexible k-mer lengths the structure allows for simultaneous assembly of a consensus sequence and mutations/haplotypes directly from reads. The StarK graph uses localised coverage differences to guide the generation of connected subgraphs. This allows higher resolution of genomic differences and helps differentiate errors from potential variants within the sequencing sample.
N055 - GPU Technology for Speeding up Sequence Alignment Algorithms
Short Abstract: Local sequence alignment algorithm is used to find the alignment of common substrings between two biological sequences and is used to determine the similarity between sequences. There are alignment’ scores for these common substrings determined using substitution matrix like BLOSUM50. This algorithm was developed to extract the most similar sequence for a query sequence based on the maximum score from biological database contains thousands of DNA/Protein sequences. We develop this algorithm so that find the nearest two sequences(ancestors) for a query sequence(offspring) by determining first the nearest biological sequences that have the highest alignment’ scores for query based on BLOSUM50 substitution matrix for protein sequences. Secondly , it find the most two sequences from the first 100 sequences that have the longest common substring with query sequence. The substitution matrix used secondly called coincidence matrix where this matrix for identical nucleotides give score +1 and for different nucleotides give score -1. For database contains 300,000 sequences and for query contains 1024 nucleotides at most, execution time of program has 23000 second. So using Graphical Processing Unit “GPU” to accelerate this program by executing it in parallel. For the same database and query that has 1024 nucleotides, execution time of program is 189 seconds. So GPU accelerate program 140 times faster than CPU.
N056 - Developing Sequence Alignment Algorithm forTracing Genes
Short Abstract: Genes acquire many changes and modifications during
evolution producing many orthologous and paralogous. One
of the main challenges for biologist is tracing back the origin
of modified genes in the genomes of various organisms.
Tracing genes modifications and developments from the
ancestors to offsprings is however a massive task for
biologists. Thus, computer programs can help biologists
tracking down those changes through sequence alignment. In
this paper, we present an algorithm, called Gene Tracer and
based on sequence alignment, that given two ancestor
sequences and their offspring one, tracks down genes
modification in the ancestor sequences and finds related parts
of each ancestor in the offspring one
N057 - AfterGenBank: A Web Portal and Repository for Easily Retrieval of the Sequence Features and Annotations Derived from Genbank
Short Abstract: Along with the recent advances on sequencing technologies, the accumulated sequences in GenBank increase faster than ever. This big volume of data is an important treasure for biomedical community. The feature table of each sequence entry describes the roles and locations of higher order sequence domains and elements within the genome of an organism such as mRNA, gene, transcript, exon, intron, ncRNA, 5’UTR, and etc. However, there is no simple and flexible way to retrieve these intelligences from the huge flat files by the specified feature terms and the taxonomic category in GenBank.
By integration of big data, system framework, self-developed programs/ scripts and intuitive interface, we constructed a web database with automatic updated mechanism named as AfterGenBank to retrieve annotated sequences with specific features from GenBank. Two major programs in parallel computing named as Features Extractor and Sequences Conjunctor were developed by our team to extract and combine the meaningful fragments from hundred million sequence records.
Presently, sixty-one feature terms from twelve sequence divisions (Bacterial, Plant, Primate, etc.) are available. Users can employ full text searches / sequence-similarity searches to identify specific featured sequences, retrieve these results in fasta / CSV via the intuitive web interface. According to researchers’ interesting, the dataset fetched from AfterGenBank, are available to various purposes combined with existed tools, such as summarizing a consensus sequence patterns, performing phylogenetic analysis, designing sequence-specific primers/ probes, or re-formatting into a novel database of specific features, etc. AfterGenBank and its related analysis services, are freely accessible at
N058 - A new method for spectral characterization of protein families from sequence information using Fourier Transform
Short Abstract: The Resonant recognition model (RRM) provides a framework to identify, from sequence information, original features extracted through their numerical encoding and subsequent treatment by Fourier transform. These have been linked to functional properties of protein families through the calculation of family-based consensus spectra.
We provide a new method for calculating and characterizing the consensus spectra in the RRM method. Our methodology proceeds by calculating the geometrical mean of the energy spectra of sequences from a functionally characterized family instead of calculating a cross spectrum by simply multiplying the spectra. The amplitude of the recurrent peaks are, in our method, independent of the number of sequences in the initial dataset. Henceforth, to evaluate the significance of a peak, we propose that the signal-to-noise (s/n) cut-off value used to determine characteristic peaks of the consensus spectrum is based on the standard deviation of the s/n values. To determine the statistical significance of each of these peaks, we set up a bootstrap strategy consisting in shuffling the amino acid order for each sequence of the family. We illustrate this new canvas on the analysis of a highly divergent odorant binding protein family and of a more conserved sialidase enzyme family.
N059 - Mapping the UniProt human reference proteome to the reference genome and variation data
Short Abstract: UniProt has annotated the complete Homo sapiens proteome and approximately 20,000 protein coding genes are represented by a canonical protein sequence in UniProtKB/Swiss-Prot. Most of these protein sequences are now mapped to the reference genome assembly produced by the international Genome Reference Consortium (GRC). This mapping was possible through a process of aligning all human protein sequences in the UniProt Knowledgebase (UniProtKB) to the protein translations in Ensembl, based on 100% amino acid identity over the entire sequence. Where protein sequences were not found in UniProtKB, new records were added to the UniProtKB reference proteome data set.

UniProt use literature to manually annotate variant sequences with known functional consequences. Studies on sequence variation are increasing; projects like 1000 Genomes and the Cancer Genome Project are generating a vast amount of variant information that is, or will be, stored by Ensembl variation in their databases. This exponential growth of variation information is making it infeasible to expect non-synonymous variants to be only manually assessed and added to UniProtKB. With human reference protein sequences mapped to the reference genome assembly, UniProt has developed a pipeline to import high-quality non-synonymous single amino acid variants from Ensembl variation. This poster describes the mapping procedure, the 389,935 single amino acid humans variants that have been identified for import in to the UniProt human reference proteome and highlights the invaluable information variation data provides to clinical medicine and biological science. UniProt intends to extend this Ensembl variant import pipeline for other species with a complete proteome.
N060 - Global identification of mRNA and small RNA targets of the RNA pyrophosphohydrolase RppH in Helicobacter pylori
Short Abstract: RNA-seq has been revolutionizing transcriptome analyses. Besides annotation of gene boundaries or novel transcripts, RNA-seq can also be utilized for gene-expression profiling. This requires an accurate assessment of transcript-associated reads and normalization among libraries.
The stabilization and degradation of RNAs constitutes an important layer of gene expression control in bacteria. Similar to the protection of eukaryotic mRNAs by the 5’ cap structure, prokaryotic primary transcripts are protected from decay by the triphosphate (PPP) at their 5’ ends. The bacterial Nudix hydrolase RppH has been shown to initiate RNA decay by removing the pyrophosphate from the 5’ end of mRNAs, and thereby generating a 5’-monophosphate. This modification results in an increased susceptibility to endo- or exonucleolytic decay. The Gram-negative Epsilonproteobacterium Helicobacter pylori carries one Nudix enzyme, which is a potential RppH homolog. We have studied the impact of RppH-induced transcript degradation in this major human pathogen on a transcriptome-wide scale by using RNA-seq analysis of wild-type, ∆rppH and complementation strains. Data processing of the different RNA-seq experiments was conducted using our RAPL pipeline and differential gene-expression was analyzed via DESeq.
Analysis of cDNA libraries specific for transcripts bearing a 5’ monophosphate and/or triphosphate facilitated the identification of a multitude of potential RppH targets in H. pylori. For a subset of these candidates, including several mRNAs and small RNAs, further experimental validation showed that their expression levels and transcript stabilities are affected by RppH. These findings suggest that RppH plays a role in posttranscriptional regulation of gene expression in H. pylori.
N061 - Upper Confidence Bounds applied to Trees (UCT) for next generation sequencing reads assembly
Short Abstract: Despite the large number of next generation sequencing assembly algorithms the problem of constructing an accurate genome assembly remains unsolved. Existing methods fall into two major classes build upon the overlap-layout-consensus paradigm or the de Bruijn graph approach respectively. The aim of the work is to propose a new method for short read genome assembly based on a model called Upper Confidence Bounds applied to Trees (UCT). This model is a new artificial intelligence approach for computer game playing. It combines the tree search with the Monte Carlo sampling technique, providing a good balance between exploitation and exploration in a large search space. Our method defines the problem of short read assembly as a single player game and utilizes the UCT strategy to search for optimal moves sequence i.e. sequence of reads, that would lead to the game win - accurate assembly. The model was preliminary tested on simulated sets of short reads obtained from E.coli genome. The results show that this approach is able to reconstruct the accurate solution, however it appears to be computationally intensive. Fortunately, this problem may be overcome by convenient parallelization of the UCT model. Thus scaling the approach to larger, eukaryotic genomes is possible.
N062 - Phi29 and S1 nuclease treatment enhances multiple strand amplification DNA genome assembly
Short Abstract: We report a robust method for optimised Illumina sequencing from whole genome amplified template with phi29. De-branched and S1 nuclease digested Bacillus subtilis. WGA product assembled as effectively as genomic DNA after Illumina sequencing, without bias. This method should be of use in forensic analyses as well as for sequencing of difficult to culture organisms where obtaining large quantities of template DNA is problematic. This we believe is the first time that a WGA method has been assessed by next generation whole genome sequencing (as opposed to qPCR of selected genes or microarrays) and analysis of sequencing and assembly efficiency with reference genome comparison software. We also report a method of filtering the genomic data of the fastidious bacterium, Pasteuria penetrans from a bacterial metagenomic sample and perform a de novo assembly using a range of freely available algorithms.
N063 - How reliable are indels for phylogeny inference?
Short Abstract: It is often assumed that it is unlikely that the same insertion or deletion (indel) event occurred at the same position in two independent evolutionary lineages, i.e., indel-based inference of phylogeny should be less subject to homoplasy compared to standard inference based on substitution events. Indeed, indels were successfully used to solve debated evolutionary relationships among various taxonomical groups. However, indels are never directly observed but rather inferred from the alignment and thus indel-based inference may be sensitive to alignment errors. We hypothesized that phylogenetic reconstruction would be more accurate if it relied only on a subset of reliable indels instead of the entire indel data.

To test this hypothesis, we first determined the level of agreement among different alignment methods (MAFFT, PRANK, and ClustalW). We show that differences among alignments obtained from various methods suffice to generate different indel-based trees. We next developed a method to quantify the reliability of indel characters by measuring the frequency in which they appear in a set of alternative multiple sequence alignments. Our approach is based on the assumption that indels which are consistently present in most alternative alignments are more reliable compared to indels that appear only in a small subset of these alignments. Using simulations, we show that our method can significantly increase the fraction of correctly inferred indels. Furthermore, by comparing the reconstruction of a known phylogeny with and without filtering, we show that our method can increase the accuracy of indel-based phylogenetic reconstruction.
N064 - A comparison of whole-genome sequencing approaches in polyploid wheat
Short Abstract: The challenge of assembling complex plant genomes such as polyploid wheat using short-read whole-genome shotgun (WGS) approaches is well known. Graph-based genome assembly programs are confounded by large genome size, high repeat content and the presence of homoeologous chromosomes. Recent attempts to reduce the complexity of genome sequencing and assembly in hexaploid wheat have exploited flow-cytometry and aneuploid lines to separate individual chromosome arms, followed by whole-chromosome sequencing (WCS) and assembly. Analysis of WCS reads indicates that kmers from the non-repetitive parts of the genome could not easily be distinguished from sequencing errors or kmers from repetitive regions. This is likely to affect the quality of chromosome arm assemblies. WGS was performed to determine whether this property is due to the biological complexity of wheat or due to artifacts introduced during flow-sorting and sequencing.

We generated Illumina sequence reads for the tetraploid wheat Triticum turgidum ssp. durum. The kmers from non-repetitive parts of the genome could easily be identified in spite of the increased complexity of the sample being sequenced. The reads were first assembled using a de Bruijn graph approach. This assembly was analysed using arm-specific data to determine whether the homoeologous genomes assemble separately or generate chimeric contigs. In a second approach, the arm-specific data were then used to assign WGS reads to individual arms. The sets of reads for each arm were assembled separately and compared to the whole genome assembly. A comparison of the properties of assemblies with and without flow-sorted data will be presented.
N065 - Jaccard Index to Compare Position Weight Matrices of Transcription Factor Binding Sites
Short Abstract: Sets of transcription factor binding sites (TFBS) can be represented
as computational models. Positional weight matrices (PWMs) often serve
as such models able to assign quality scores to different DNA
fragments. A given score threshold is selected to distinguish binding
sites from background. Different experimental methods used to study
TFBS exhibit similar but not identical DNA sequence motifs. Similar
binding motifs are also recognized by TFs from the same structural
family. Thus a quantitative measure of similarity between models of
factor binding sites is highly important.

Large difference between two sets of PWM-predicted binding sites does
not directly require large difference between elements of two
corresponding PWMs and vice versa. Moreover, PWM score threshold has
high influence on the degree of similarity between sets of recognized
binding sites. This induces principal limitations for direct methods
based on comparison of matrix elements. Here we suggest a variant of
the Jaccard index to compare similarity of transcription factor
binding sites models, the position weight matrices with the given
score threshold levels. The presented measure defines a metric space
for TFBS models of all widths. MAtrix CompaRisOn by Approximate
P-value Estimation (MACRO-APE) software implements effective algorithm
to compare a pair of PWMs and search a given collection of PWMs for
matrices similar to a given query. Software is freely available at

We show how MACRO-APE can be applied to construct a hierarchical tree
of TFBS models and to show significant differences between existing
PWMs present in public databases.
N066 - Finding a needle in a “Hexaploid Haystack”
Short Abstract: The identification of disease resistance genes in crops is of vital importance to ensure food security. We are focused on resistance to yellow rust (YR15) in bread wheat, a hexaploid (2n=6x=42) organism with a highly repetitive genome for which there is no reference sequence yet. Our experiment consists on using backcrosses of an isogenic Avocet line with a near isogenic line of Avocet containing the YR15 resistance gene. Total RNA was extracted from leaves from a number of susceptible and resistance plants. We sequenced RNA-Seq libraries generated from the parental lines and six bulks obtained from the backcrossed progeny: three made of susceptible plants and three of resistance plants. All the libraries were barcoded and sequenced in four Hi-Seq2000 lanes.
Then, we align the RNA-Seq reads to the Unigene set v60 where the three haplotypes are merged in a single consensus gene collection. After cataloguing the differences in the consensus of the parents, we look at the ratios of how much each base in the putative SNPs are expressed on each sample, and we compare the ratio of the resistant and the susceptible samples to calculate the Bulk Frequency Ratios (Trick, 2012). This approach allows us to characterise SNPs and potentially other genomics variants that are not accessible to traditional SNP callers, based on diploid genomes and assuming less variation than the variation found in grasses.
N067 - The New Bit-Parallel Algorithm for the Merged Longest Common Subsequence Problem with Block Constraint
Short Abstract: The problem of finding interleaving relationship between sequences is crucial for comparison of genomic sequences. There are various similarity measures, e.g. the length of the longest common subsequence, sequence alignment, edit distance. With recent research on the whole genome duplication followed by a massive gene loss, a new problem, a merged longest common subsequence and its block variant, has been formulated. It measures the relationship among three sequences to verify if the event (whole genome duplication) occurred. The inputs are three sequences T, A, B, and the requested output is a longest sequence P that is a subsequence of T and can be split into two subsequences P′ and P′′ such that P′ is a subsequence of A and P′′ is a subsequence of B. In the block variant, merging of A and B is only allowed at the additional block boundaries.

To date two algorithms for the problem were published, first employing the dynamic programming approach, second specialised for the case of large alphabets, when the dynamic programming matrix is sparse. We propose the first bit-parallel algorithm for the block variant of the merged longest common subsequence problem. Practical experiments, made on real and simulated data, show that our proposal is from 10 to over 100 times faster than existing algorithms.

Acknowledgements. This work was supported by the European Union from the European Social Fund (grant agreement number: UDA-POKL.04.01.01-00-106/09) and by Polish National Science Centre upon decision DEC-2011/03/B/ST6/01588.
N068 - MISTIC: a Mutual Information Server to Infer Coevolution
Short Abstract: Identifying functionally important residues of proteins has long been investigated using multiple sequence alignments. Besides studying the conserved residues, another information rich approach is to examine the correlated mutational pattern between columns of the alignment. Mutual information (MI) from information theory can be used to analyze this type of covariation. We have previously developed an MI algorithm that corrects several known biases in the MI signal, such as phylogeny, sequence redundancy and low counts.

Here we present MISTIC, a publicly available Mutual Information Server To Infer Coevolution that brings covariation analysis closer to non-computational biologists with a friendly-user interface and an interactive view of the results.

MISTIC uses a multiple sequence alignment as the basis for calculating mutual information and several residue-based scores, and can also integrate structural data when a PDB structure is provided. The server outputs a graph with several information-related quantities using a circos representation. This offers an integrated view of the information contained in the multiple sequence alignment in terms of mutual information between residue-pairs, sequence conservation and cumulative and proximity MI scores. Furthermore, analysis and characterization of the MI network can be done using an interactive interface which includes network-to-structure mapping and several other network oriented tools.

We have developed a web server to study coevolution in protein families using mutual information. The different ways of visualizing the results complement each other well, and the interactive and user-friendly interface make MISTIC a novel tool among other coevolution web-servers. MISTIC is available at
N069 - Sequence analysis-driven identification and experimental validation of novel selective autophagy receptors in Drosophila melanogaster
Short Abstract: Macroautophagy is a conserved catabolic process in eukaryots that sequesters and transports cytoplasmic material to lysosome for degradation. It was initially considered as a non-selective bulk degradation process, but recent evidence suggests that autophagy can be selective, being mediated by specific proteins (selective autophagy receptors - SARs) linking specific cargos with the core autophagic machinery. The function of SARs is facilitated by a short amino acid linear motif (LIR/LRS motif), mediating their direct interaction with Atg8 family proteins. A recently described regular expression for the LIR-motif is problematic, in the sense that it cannot identify all proteins currently known to contain a functional LIR-motif.

In this work, our aim is to redefine a selective LIR-motif regular expression for identifying novel SARs in Drosophila melanogaster. Along these lines, we collected known proteins containing the LIR-motif among diverse eukaryots, and identified their (remote) homologs. We used the CAST algorithm for filtering compositionally biased regions and performed sensitive BLASTP/PSI-BLAST searches against the NCBI non redundant database, manually validating the resulting alignments for discriminating genuine homologs with a (possibly divergent) LIR-motif. We redefined the LIR-motif based on carefully crafted multiple sequence alignments and scanned the complement of proteins encoded in the in Drosophila genome, thus identifying a number of novel candidate SARs. This set of protein sequences is further narrowed down by searching for the coexistence of autophagy-related domains/motifs already present in known LIR-containing proteins. We are currently validating these putative SARs in laboratory settings using cell biological and biochemical approaches.
N070 - nhmmer : DNA-to-DNA database search in HMMER3.1
Short Abstract: Sequence database searches are an essential part of molecular biology, providing information about the function and evolutionary history of proteins, RNA molecules, and DNA sequence elements. We present a tool, called nhmmer, which applies probabilistic inference methods based on hidden Markov models (HMMs) to the problem of DNA homology search
It is available in the new HMMER3.1 release.

The tool nhmmer has been used to improve sensitivity of search for ancient homologs to conserved non-coding elements, and acts as an acceleration filter to the RNA homology search tool, Infernal. It has also been used, in conjunction with the transposable element (TE) database Dfam, to annotate the TE content of the human genome, increasing by nearly 150Mb the amount of the genome that is reliably annotated as derived from TEs.
N071 - Genomes are surprisingly diverse on the protein level
Short Abstract: Today, the (nearly) complete sequences of the human genome and of numerous model organisms are available, but the corresponding proteomes remain ambiguous. In part this is due to the high number of proteins that have not been experimentally observed, but are predicted from nucleotide sequences. Nevertheless, with this information in hand, large-scale comparative analysis has become feasible.

Here, we present a sequence-level comparison between the human proteome and those of several model organisms. For the comparison, we chose EBI reference proteomes - complete non-redundant proteome sets provided by the "Quest for Orthologs" group - and performed an all-against-all PSI-BLAST search. To assess similarity we used different levels of required alignment length and sequence identity. Even for organisms closely related to human we observe a surprising diversity on the protein level. For example, while mouse has an ortholog for the majority of human proteins, only about 7.5% of the proteins are close to identical.
N072 - A feature analysis for membrane protein target selection in the NYCOMPS structural genomics pipeline
Short Abstract: The New York Consortium on Membrane Protein Structure (NYCOMPS) is one of several structural genomics centers aiming specifically at membrane protein structures as part of the Protein Structure Initiative. Due to the intricate nature of membrane proteins, these centers face large obstacles in the selection of promising targets.

A large percentage of targets selected for the NYCOMPS pipeline cannot be expressed in sufficient quantity necessary for structure determination through X-ray crystallography. While methods exist to predict crystallization success for globular proteins, for membrane proteins the first bottleneck lies in the expression.

More than 12,500 targets have already been cloned with comparable parameters in the NYCOMPS pipeline. Tapping into this immense data repository, we identified several features with impact on target expression and are aiming to utilize these findings in a machine learning predictor.
N073 - FACET: a feature-based accuracy-estimation tool for protein multiple sequence alignments
Short Abstract: Selecting an aligner -- and parameter values for the aligner’s scoring function -- to obtain a quality alignment of a specific set of sequences can be challenging. Different aligners and different parameter values can produce vastly different alignments of the same sequences. In principle, a user could simply try various aligners and parameter settings, and choose the one that yields the most accurate alignment -- except that in practice, the accuracy of an alignment cannot be measured (since the correct alignment is not known). We overcome this obstacle by combining efficiently-computable, real-valued features of an alignment into an estimator of its accuracy that is suitable for choosing both aligners and parameter settings.

FACET (“Feature-based ACcuracy EsTimator”) is an easy-to-use, open-source utility for estimating the accuracy of a protein multiple sequence alignment, available at FACET can be readily applied to both parameter advising (choosing good parameter values) and aligner advising (choosing a good aligner). The tool provides optimized default coefficients for its linear estimator that are best on average (coefficients may also be specified manually), and can be run as a stand-alone tool, or included in any pre-existing Java application. The FACET website also provides pre-computed parameter sets (substitution matrices and affine gap penalties) that are optimal for boosting aligner accuracy via parameter advising.

We show experimental results on applying FACET to parameter advising and aligner advising that improves alignment accuracy by as much as 27% on the most challenging sets of sequences.
N074 - Checking your Blind Spots: An Assessment of Genomic Regions with Poor Mappability in Whole Genome and Targeted Sequencing Data
Short Abstract: As genome sequencing exponentially decreases in cost while advancing in throughput and accuracy, many attempts are being made to utilize next-generation sequencing in clinical settings. Such attempts often leverage targeted sequencing approaches to characterize variants in genes that may yield clinically actionable information. The results of sequencing analysis pipelines are often a list of variants that pass quality control steps. However, such results do not typically indicate regions that are missed due to low sequencing coverage. Regions that are not reported in the variant file may refer to either reference sequence or "blind spots" that are not covered well by sequencing reads, which can lead to incorrect interpretation without additional investigation. In a set of 100 samples sequenced using ultra high throughput targeted sequencing (mean target coverage 823x), we observed 1.5-1.9kb of low mappability across a targeted 420kb. We then investigated “blind spot” regions of poor mappability across public and internal samples that have been whole genome or whole exome sequenced (100 exomes captured with Agilent SureSelect Exome v4 Plus, 100 exomes captured with Nimblegen SeqCap EZ v3). As a case study, we characterize variations in mappability regions using the CEPH sample NA12878, and assess whether increased coverage in targeted sequencing can rescue genomic "blind spots". We also propose methods for reporting this information to end users.
N075 - miRGator v3.0: a microRNA portal for deep sequencing, expression profiling, and mRNA targeting
Short Abstract: Deep sequencing has become the principal technique in cataloging of miRNA repertoire and generating expression profiles in an unbiased manner. Here, we describe the miRGator v3.0 update that compiled the deep sequencing miRNA data available in public and implemented several novel tools to facilitate exploration of massive data. The miR-seq browser supports users to examine short read alignment with the secondary structure and read count information available in concurrent windows. Features such as sequence editing, sorting, ordering, import and export of user data, would be of great utility for studying iso-miRs, miRNA editing and modifications. Coexpression analysis of miRNA and target mRNAs, based on miRNA-seq and RNA-seq data from the same sample, is visualized in the heat-map and network views where users can investigate the inverse correlation of gene expression and target relations, compiled from various databases of predicted and validated targets. By keeping datasets and analytic tools up-to-date, miRGator, available at http://mirgator., should continue to serve as an integrated resource for biogenesis and functional investigation of miRNAs.
N076 - COIF: Using assembly metadata to improve prediction and scaffolding of mobile genetic elements in unfinished genomes
Short Abstract: Mobile genetic elements (MGEs) enable the lateral genetic transfer (LGT) of DNA between bacterial cells. LGT contributes to the rapid adaptation of bacteria to different environments and plays an important role in bacterial virulence and antibiotic resistance. MGEs (including plasmids, phage, transposons and genomic islands) make up a significant proportion of intraspecies variation and present a serious informatic challenge due to their repetitive nature, which leads to fragmentation in draft genome assemblies.

We present COIF, a tool that creates and uses a contiguous sequence adjacency graph to improve identification and scaffolding of MGE contigs ( Adjacencies are identified using a De Bruijn graph. MGE contigs are predicted by traditional methods (such as sequence annotation). Contigs are then added or removed from the predicted set based on proximity in the graph to other predicted contigs.

Plasmid contigs are usually identified in draft genome data by identifying plasmid specific annotations and by identifying contigs with higher than average coverage. Whereas the former method results in a large number of false negative predictions as plasmid specific genes are only present on a small number of contigs, coverage-based methods are not always feasible for large plasmids that may be low copy number within the genome. We show how using contig adjacency metadata with COIF significantly improves both the sensitivity and specificity of finding plasmid sequences in draft genome data. We also show how COIF has been used to completely assemble large (>100 kilobase) antibiotic resistance plasmids in Illumina paired-end assemblies of Escherichia coli genomes.
N077 - The Evolution of Abandoned Genes: Conservation Analysis of Metazoan Remaining Enzymes from Essential Amino Acids Biosynthetic Pathways
Short Abstract: Introduction

Metazoans have the distinct phenotype of being heterotrophs. This phenotype was caused by the loss of several genes responsible for the biosynthesis of essential
amino acids (EAA). Interestingly, we have recently found that some metazoan
EAA biosynthetic enzymes escaped deletion, since 8 of them are still present in our


To study the consequences of the loss of partners on the evolution of genes, in
particular, of the remaining metazoans genes of EAA biosynthetic pathways.


All 49 yeast genes coding for EAA biosynthesis were used as queries to perform
BLAST searches against 10 metazoan genomes in the refseq NCBI database. Using
this approach, 8 homologs were found in most metazoan genomes. Of those 8
enzymes, 6 were found to take part in other pathways. Only 2 of them, acetolactate
synthase (ALS) and betaine--homocysteine S-methyltransferase (BHMT), were found
to have no known function besides EAA biosynthesis. Using a new method to visualize
conserved regions in multi-clade alignments, we found closer similarity between fungi
and plants enzymes for the ALS and BHMT, but not for the other 6 enzymes. Those
2 genes also have a tree topology that diverges from the Tree of Life, with fungi and
plants being clustered together and metazoans positioned as an outgroup.


These results suggest that the metazoan ALS and BHMT have suffered divergent
evolution and may have acquired new functions after the loss of their pathway partners.
This work opens new possibilities for the study and comparison of distant homologs.
N078 - Genome wide analysis of E. coli RNAP binding in open or closed complexes
Short Abstract: Regulation of genes in E. coli can occur at every step on the pathway to gene expression, but transcription initiation is the most frequently regulated step. The goal of our study is to determine the nature of RNA Polymerase (RNAP) binding, specifically whether it's binding in open or closed complexes to promoters can be distinguished on a genome-wide scale. We clustered σ-70 ChIP-exo signals from E. coli that were proximal (within -80 to +20) to 3746 transcription start sites (TSSs) [Kim et. al 2012] and obtained 3 clusters. One cluster consisted of promoter regions that had no or very low ChIP-Exo signal, representing the set of promoters that did not recruit RNAP. The signal of the second cluster tended to have ChIP-exo peaks upstream of the TSS, between -80 and 0, while the third cluster had ChIP-Exo peaks that clustered downstream of the TSS, between -20 and +20. We hypothesize that the crosslinking of RNAP to these different regions near the TSS is indicative of promoters bound by RNAP in closed vs. open complexes. Knowledge of predominant RNAP binding in the closed complex vs. open complex forms is highly correlated with the eventual expression of the genes; such a view is consistent with known regulation of the closed to open complex transition as mechanism to regulate gene expression. We will present the results of this analysis as well as experimental methods performed to verify this relationship.
N079 - A comparative ab initio method for predicting RNA-RNA interactions
Short Abstract: Proteins have traditionally been thought to be the primary regulators of gene expression, and only recently have we begun to uncover and appreciate the complexity of RNA-regulated expression in the cell. Central to nearly all discovered classes of regulatory RNAs is their ability to form specific intermolecular base-pairs with target RNAs, such as those in miRNA-mRNA interactions. Accurate in silico prediction of these intermolecular base-pairs or RNA-RNA interactions (RRI) could thus greatly aid the annotation of the overwhelming number of novel proposed non-protein-coding RNAs.

Given two RNA sequences, existing computational RRI prediction methods rely on estimating the energetic stability of the hybridized duplex. A few methods have recently improved upon these algorithms by adding evolutionary conservation into their predictions. The comparative metric used in these hybrid programs however, often results in a significant increase in run-time, or, alternatively, a fast performance at the expense of an oversimplified evolutionary model.

We here propose a new, hybrid ab initio RRI prediction method that maintains a full evolutionary model while remaining fast and tractable for longer transcripts. By using a full evolutionary model, our method maintains a robust performance when given alignments containing highly related or more divergent species. We evaluate the performance accuracy of our method on a large test set of known interactions and find that our method is more accurate than existing RRI methods relying on only energies and hybrids with additional comparative information.
N080 - Landscape of double-stranded DNA breaks in human genome: mapping and predictive modeling
Short Abstract: We have developed a genome-wide approach to map DNA double-strand breaks (DSBs) at nucleotide resolution by a method we termed BLESS(direct in situ breaks labeling, enrichment on streptavidin and next-generation sequencing). We validated and tested BLESS using human and mouse cells and different DSBs-inducing agents and sequencing platforms. Our method is suitable for genome-wide mapping of DSBs in various cells and experimental conditions, with a specificity and resolution unachievable by current techniques. We characterized the genomic landscape of sensitivity to replication stress in human cells, and we identified >2,000 nonuniformly distributed aphidicolin-sensitive regions (ASRs) overrepresented in genes and enriched in satellite repeats. ASRs were also enriched in regions rearranged in human cancers, with many cancer-associated genes exhibiting high sensitivity to replication stress. Our data also shows that double-stranded breaks occur more often in transcribed regions, thus supporting the hypothesis that collisions between transcription and replication machineries contribute substantially to DNA double-breaks formation. Therefore, we model in silico movements and collisions of transcription and replication complexes. We discuss how well our model predicts experimentally observed patterns in different situations.
N. Crosetto, A. Mitra, M. J. Silva, M. Bienko, N. Dojer, Q. Wang, E. Karaca, R. Chiarle, M. Skrzypczak, K. Ginalski, P. Pasero, M. Rowicka & Ivan Dikic: Nucleotide-resolution DNA double-strand break mapping by next-generation sequencing, Nature Methods, (2013) doi:10.1038/nmeth.2408
N081 - proovread: Third generation sequencing length with next generation accuracy
Short Abstract: With read lengths of several kilobases, third generation single molecule real-time sequencing outperforms all other currently available sequencing techniques. Thus, its reads have the potential to substantially improve assemblies, in particular their contingency. But, they suffer from a per base error rate of up to 15%. Thus, their usability for scaffolding is limited and direct assembly is not feasible. To address this challenge, we have developed the correction pipeline proovread, which generates accurate long reads by integrating third and second generation sequencing data. First, short high quality Illumina reads are mapped onto low quality Pacific Bioscience long reads. Second, a consensus sequence is calculated for the long reads including a
confidence score for each individual base call. Finally, the corrected reads are trimmed and filtered. An iterative setup for the procedure facilitates increased sensitivity and decreased run-time simultaneously. The quality of the pipeline was evaluated on real and simulated transcriptomic and genomic read sets. Overall, the corrected reads had more than 99,5% base-call accuracy and the amount of chimeric reads was significantly reduced. Still, the length of the original PacBio reads was preserved. Thus, proovread enables to gain the best from both worlds: Illumina read accuracy and PacBio read lengt
N082 - Accurate alignment of thousands of barcoding sequences using the Multiple Alignment of Coding SEquences toolkit (MACSE-tk)
Short Abstract: MACSE is a multiple sequence alignment software able to align protein-coding nucleotide sequences based on their amino acid translations while accounting for possible reading frame modifications (due to biological frameshifts or sequence errors). The original alignment strategy of MACSE is time consuming when handling very large data sets. We developed various tools based on the MACSE core algorithm in order to enable the efficient alignment of thousands of coding sequences as typically used in barcoding projects. We show that using a dedicated pipeline based on MACSE-tk allows producing high quality alignments while detecting sequencing errors and overlooked pseudogenes in barcode data sets.
Our alignment strategy alleviates the risk of incorrect species delimitation due to the incorporation of sequencing errors or undetected pseudogenes of nuclear origins, e.g. numts for cytochrome c oxidase I (COI). By applying this pipeline to the mammal and plant sections of the Barcode of Life Database (BOLD), we highlighted several cases of sequencing errors and provide reference alignments including thousands of barcoding sequences. Alignment accuracy is asserted by studying the distribution of very low frequency variants across official barcode segments. These reference alignments are made available and can be accessed through a dedicated website. We anticipate our approach to be particularly useful for next-generation barcoding studies in which thousands of new sequences need to be compared to a reference database for taxonomic assignment in large-scale biodiversity assessment and diet characterization studies.
N083 - P-value based regulatory motif discovery using positional weight matrices
Short Abstract: The rapid progress in high-throughput sequencing is transforming the way in which we study genomes and their role in regulating cellular and developmental processes. Genome-wide experiments need to be analysed with respect to the motifs that protein and ncRNA factors bind to. Most tools for the computational discovery of motifs use the simple pattern representation, because evaluating their statistical significance of enrichment of positional weight matrices (PWMs) has been too time-consuming. XXmotif is the first motif discovery tool that directly optimizes the statistical significance of enrichment for PWMs.On ChIP-chip/seq, miRNA knock-down, and co-expression datasets, XXmotif improves upon state-of-the-art tools in numbers of correctly identified motifs and in the quality of PWMs. In human core promoters, XXmotif reports most previously described motifs and eight novel elements sharply peaked around the transcription start site, among them an Initiator motif similar to the fly and yeast versions. Web server and source code:
N084 - DuplexSVM: a miRNA duplex prediction methodology
Short Abstract: We address the problem of predicting the miRNA:miRNA* duplex stemming from a microRNA (miRNA) hairpin precursor and we present a SVM-based methodology, named DuplexSVM, to address it. Predicting the miRNA:miRNA* duplex is a first step towards identifying the mature miRNA, suggesting possible miRNA targets and ultimately, reducing experimentation effort, time, and cost. We measure the error in terms of the absolute difference of the true and predicted location of all of the four ends of the duplex and/or of each end separately. Our mean absolute error over all ends is 1.61 ± 2.24 nts as measured on a hold-out set of 220 miRNA hairpin precursor sequences. In addition, our tool precisely predicts (with 0 nt deviation) the starting position for 57% and 52% of the miRNAs in the 5’ and 3’ strands of the same dataset, significantly outperforming the state-of-the-art tool MaturePred which achieves 18% and 12%, respectively, on the same task. Overall, our method accurately identifies not only the starting nucleotide of novel miRNA:miRNA* duplexes – and thus individual miRNAs - but also their length, while outperforming the current state-of-the-art tool.
N085 - A deterministic analysis of genome integrity during neoplastic growth in Drosophila using long overlapping paired-end reads
Short Abstract: The development of cancer has been associated with the gradual acquisition of genetic alterations leading to a progressive increase in malignancy. In various cancer types this process is enabled and accelerated by genome instability. While genome sequencing-based analysis of tumor genomes becomes increasingly a standard procedure in human cancer research, the potential necessity of genome instability for tumorigenesis in Drosophila melanogaster has, to our knowledge, never been determined at DNA sequence level. Therefore, we induced formation of tumors by depletion of the Drosophila tumor suppressor Polyhomeotic and subjected them to genome sequencing. To achieve a highly resolved delineation of the genome structure we developed the Deterministic Structural Variation Detection (DSVD) algorithm, which identifies structural variations (SVs) with high accuracy and at single base resolution. The employment of long overlapping paired-end reads enables DSVD to perform a deterministic, i.e. fragment size distribution independent, identification of a large size spectrum of SVs. Application of DSVD and other algorithms to our sequencing data reveals substantial genetic variation with respect to the reference genome reflecting temporal separation of the reference and laboratory strains. The majority of SVs, constituted by small insertions/deletions, is potentially caused by erroneous replication or transposition of mobile elements. Nevertheless, the tumor did not depict a loss of genome integrity compared to the control. Altogether, our results demonstrate that genome stability is not affected inevitably during sustained tumor growth in Drosophila implying that tumorigenesis, in this model organism, can occur irrespective of genome instability and the accumulation of specific genetic alterations.
N086 - Nucleotide-resolution DNA double-strand breaks mapping by next-generation sequencing
Short Abstract: We present a genome-wide method to map DNA double-strand breaks (DSBs) at nucleotide resolution by direct in situ breaks labeling, enrichment on streptavidin, and next- generation sequencing (BLESS). We comprehensively validated and tested BLESS using different mammalian cell models, DSBs-inducing agents, and sequencing platforms. BLESS was able to detect telomere ends, Sce endonuclease-induced DSBs, and complex genome-wide DSBs landscapes. As a proof of principle, we characterized the genomic landscape of sensitivity to replication stress in human cells, and identified over 1,000 non- uniformly distributed aphidicolin-sensitive regions (ASRs) overrepresented in genes and enriched in satellite repeats. ASRs were also enriched in regions rearranged in human cancers, with many cancer-associated genes exhibiting high sensitivity to replication stress. Our method is suitable for genome-wide mapping of DSBs in various cells and experimental conditions with a specificity and resolution unachievable by current techniques.
N087 - Phyloclassifiers based on tRNA signatures are robust to compositional convergence and horizontal gene transfer
Short Abstract: Phylogenomic approaches are subject to systematic bias from convergence in macromolecular compositions. Such bias may underlie recent contradictory results on the phylogeny of the SAR11 group of alphaproteobacteria, notable for their ecological dominance and extreme genome streamlining. Here we develop a novel approach to phyloclassification based on signatures derived from tRNA Class-Informative Features (CIFs). Although tRNA CIFs are bioinformatically defined, they are enriched for features that underlie tRNA- protein interactions. We developed a simple tRNA-CIF-based phyloclassifier for alphaproteobacterial genomes that recapitulate the results of complex and meticulous whole proteome-based phylogenomic analyses. Our results reject monophyly of the SAR11 group of affiliation of most strains with Rickettsiales. Instead, most strains tree with other free-living alphaproteobacteria. The CIF-based phylogenomic signature is remarkably robust even though base contents of SAR11 and Rickettsiales tRNAs have both converged towards higher A+T contents. Also, given the notoriously promiscuous horizontal gene transfer (HGT) of aminoacyl-tRNA synthetase genes, our tRNA CIF-based phyloclassifier appears at least partly robust to HGT of interacting components. We describe how unique features of the tRNA-protein interaction network facilitate mining of traits governing macromolecular interactions from genomic data, and discuss why interaction-governing traits may be especially useful in resolving deep relationships in the Tree of Life.
N088 - BESST - Scaffolding large fragmented assemblies efficiently
Short Abstract: ABSTRACT
Motivation: Using short reads from High Throughput Sequencing (HTS) teqniques for de novo assembly are now commonplace. However, obtaining contiguous assemblies from short reads is difficult which makes scaffolding an important step in the assembly pipeline. Different algorithms have been proposed, but we found the available scaffolders lacking in quality and usability. We noted that published scaffolders are only evaluated on small datasets using output from only one assembler. There are two effects to this, firstly, most of the available tools are not well suited for, or even able to run on more complex genomes. Secondly, these evaluations provide little support for inferring a software’s general performance. Additionally, we have previously shown that model errors exist within state of-the-art scaffolders rendering scaffolding prone to errors when using libraries with large insert size variation.

Results: We propose a new algorithm, implemented in a tool called BESST, that works well on genomes of all sizes and complexities. We evaluated BESST against the most popular stand-alone scaffolders on a large variety of data sets in what we believe is the most comprehensive comparison of scaffolders yet. Our results confirm that popular scaffolders are not always able to run on complex data sets. Furthermore, no single scaffolder outperforms the others on all data sets and each scaffolder has at least one data set where it’s giving the best result. BESST however, gives better results than the other scaffolders when the library insert size distribution is wide.
N089 - Detecting differential expression in de novo assembled transcriptomes
Short Abstract: With next generation sequencing it has become possible to analyse the transcriptome of non-model organism by performing a de novo assembly of RNA-seq reads. In particular, differential expression analysis can be undertaken without the need for a reference genome or annotation. While a number of studies have compared the relative merits of different transcriptome assembly programs, less attention has been given to the methodology for performing a differential expression analysis after the transcriptome has been assembled.

Here we assess different strategies for taking de novo assembled transcripts and producing a list of differentially expressed genes. Differential expression analysis on a de novo assembly suffers from several challenges including mapping reads to transcripts, clustering similar transcripts and producing a summary of read counts for statistical testing.

We demonstrate that clustering transcripts into loci improves the interpretability of results and increases statistical power, but that results are very dependent on the choice of clustering. Most clustering tools are not optimised for de novo assembled sequences, and to address this we have developed a method that uses hierarchical clustering to group transcripts based on shared reads. We also explore possible choices for mapping and summarising read counts to gene clusters. We combine these results and offer a pipeline for going from assembled transcripts to gene-level counts data and differential expression results. We show that our pipeline offers superior results in all cases we have tested.
N090 - TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference
Short Abstract: Motivation: Many human genes express multiple transcript isoforms through alternative splicing, which greatly increases diversity of protein function. Although RNA sequencing (RNA-Seq) technologies have been widely used in measuring amounts of transcribed mRNA, accurate estimation of transcript isoform abundances from RNA-Seq data is challenging, because reads often map to more than one transcript isoforms or paralogs whose sequences are very similar to each other.
Results: We propose a statistical method to estimate transcript isoform abundances from RNA-Seq data that can handle gapped alignments of reads against reference sequences so that it allows insertion or deletion errors within reads. The proposed method optimizes the number of transcript isoforms by variational Bayesian inference through an iterative procedure, and its convergence is guaranteed under a stopping criterion. On simulated data sets, our method outperformed the comparable quantification methods in inferring transcript isoform abundances, and at the same time its convergence speed under a stopping criterion was faster than that of the expectation maximization (EM) algorithm. We also applied our method to RNA-Seq data of human cell line samples, and showed that our prediction result was more consistent among technical replicates than those of other methods.
N091 - PacBio RS long Read Applications in Plant Genomics
Short Abstract: Many plant species with high economic value have large and repetitive genomes, which hamper construction of high quality assemblies and its subsequent exploitation. The currently available next-generation sequencing technologies, such as Illumina, are high-throughput, low cost, but produce relatively short reads (<~250 bases). As a result, most sequenced plant genomes consist of tens of thousands of scaffolds containing many gaps. The PacBio RS sequencing technology can generate long reads (≥4 kb on average), which can be used to generate high quality assemblies and improve draft genomes. Here we present several examples of high quality assemblies and improved draft genomes generated using PacBio RS reads.
N092 - How UniProtKB Maps Genomes And Variants and Provides This Information
Short Abstract: The Universal Protein Resource (UniProt), a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR), is comprised of four databases, each optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the central access point for extensively curated protein information, including function, classification and cross-references. The UniProt Reference Clusters (UniRef) combines closely related sequences into a single record to speed up sequence similarity searches. The UniProt Archive (UniParc) is a comprehensive repository of all protein sequences, consisting only of unique identifiers and sequences. The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data.
UniProt introduced the concept of complete proteomes in order to provide proteins related to complete genomes and to re-organize protein space. A complete proteome consists of the set of proteins thought to be expressed by an organism, whose genome has been sequenced completely. A reference proteome is the complete proteome of a representative, well-studied model organism or an organism of interest for biomedical research.
The exponential growth of variation information is making it infeasible to expect non-synonymous variants to be only manually assessed and added to UniProtKB as it is done currently. UniProt has developed a pipeline to import high-quality 1000 Genomes and COSMIC non-synonymous single amino acid variants from Ensembl variation. 389,935 single amino acid variants have been identified for import into the UniProt human reference proteome.
N093 - ARH-seq: Differential Splicing Prediction for RNA-seq Data
Short Abstract: Splicing is a fundamental mechanism to generate multiple proteins from a single RNA template and impacts differentiation and disease. The computational prediction of alternative splicing from high-throughput sequencing data is inherently difficult, because the differential splicing signal is overlaid by influence factors, such as gene expression differences, expression of multiple isoforms and others and, thus, necessitates robust statistical measures. Furthermore, the lack of benchmark data for validating differential splicing is still a bottleneck.
Here, we describe ARH-seq, a prediction method for differential splicing that is based on the information-theoretic concept of entropy. ARH-seq is an extension of the ARH method that was originally developed for exon microarrays. It uses the known exon and junction annotation and transforms the sequence read counts to a probability distribution that is subsequently evaluated with entropy. ARH-seq is embedded within a computational workflow for RNA-seq data analysis. We show that the method has inherent features, such as independence of transcript exon number, independence of differential expression among others, that makes it particularly suited for detecting alternative splicing events from sequencing data. We challenged the method with benchmark data on twenty human tissues, and performed a comparison with alternative methods. In order to judge the performance of the methods we constructed a benchmark data set of true positive splicing events between different tissues agglomerated from public databases. Using this set up, we show that ARH-seq is a highly performing method for differential splicing prediction, that can be used efficiently in the context of RNA-seq data analysis.
N094 - GPCRsort: a new technique for the prediction of G protein-coupled receptor classes using only structural region lengths
Short Abstract: Being the largest and most divergent receptor family, GPCRs require an efficient classification for grouping the members according to their functions. With the current developments in genome sequencing and emerging number of orphan GPCRs, it is required to classify the receptors into corresponding classes so that researchers could use known features for the classes in further evaluations. As current classification tools are inadequate and slow, a new classification tool is required to catch up with the increasing number of sequenced genomes and orphan receptors.
This study summarizes the development of a new classification tool for GPCRs. Using the structural features derived from the primary sequence, the proposed method, GPCRsort, is able to classify uncharacterized GPCRs with great accuracy in a very short time.
N095 - Statistical association of gp120 positions with HIV-1 co-receptor tropism
Short Abstract: For entry into cells, HIV-1 contacts the cellular receptor CD4 and a
co-receptor, mostly chemokine receptors CCR5 ("R5") or CXCR4 ("X4"). The
co-receptor chosen by HIV-1 (“tropism”) is largely determined by the V3
loop of HIV-1 envelope protein gp120. Since V3 is part of a large gp120
trimer, its function may depend on the rest of this molecular machine,
and changes in V3 may be accompanied by changes elsewhere in gp120. We
have therefore asked: Are there gp120 positions or position tuples that
are significantly associated with R5- or X4-tropism?

We have developed the R-package EPIC for effective and user-friendly
analysis of such questions. The EPIC input is an aligned set of nucleic
acid or amino acid sequences, each annotated with a factor (e.g.
R5-tropic or X4-tropic). EPIC applies Fisher's exact test to each column
(or tuple of columns) of the alignment to identify statistical
associations of monomer frequencies in the columns with the factor. We
used EPIC to analyse 4,362 gp120 sequences from the Los Alamos HIV
database, previously predicted as R5- or X4-tropic by T-CUP. We
recovered known associations of tropism with V3, but we also detected
new highly significant associations (p < 0.0001, corrected for multiple
testing) elsewhere in gp120. An analysis of pairs of alignment columns
yielded 585 pairs that are highly significantly associated with tropism
(corrected p < 0.0001), 163 inside V3, and 129 completely outside V3. We
discuss the location of these positions in gp120 and functional
N096 - Phylogenomics of fungal polyketide biosynthesis - reconciliation, synteny and function
Short Abstract: We conducted a phylogenomic analysis of iterative fungal polyketide synthases and their genomic context based on over 140 model fungal genomes. The phylogeny reconstruction was carried out based on conserved ketoacyl synthase domain signature. By reconciliation with species tree based on 59 conserved, single copy orthologs, we were able to postulate sources of present day variability as a result of duplication, speciation and horizontal transfer (among eukaryotic hosts). The assembled results have enabled relative dating of expansions and contractions affecting the evolutionary history of polyketide biosynthetic repertoire.
Further inquiry has enabled us to correlate genomic context with gene phylogeny (cooccurrence of conserved orthologs of given function, in the environment of biosynthetic genes, preservation of exon-intron structure). We were able to extract and analyse both already known and novel groups of genes likely involved in processing, regulation and transport of polyketide metabolites.
As an example case, we provide a context for previously reported relationships among non-reducing polyketide synthases, as progression of different cyclisation specificities (C2-C7, C4-C9, C6-C11). We also showcase evolutionary divergence of clades corresponding to novel termination mechanisms (split of thioesterase-dependent and NAD-dependent reductive mechanisms, introduction of accessory Mn-dependent lactamases as novel termination mechanism).
N097 - Noncoding RNA alignment and homology search using extended RNA secondary structure
Short Abstract: Non-coding RNAs have low sequence conservation and therefore HMM based alignment and homology search methods often do not perform well. Their base-paired secondary structures on the other hand are usually highly conserved. RNA sequence alignment and homology search algorithms have been developed to use structure information. These algorithms usually use a simplified RNA secondary structure consisting of 1-diagrams where each nucleotide interacts with at most one partner and only canonical base pairs are allowed. Many nucleotides however form non-canonical base pairs and interact with two other nucleotides forming base-triplets. Recent attempts to include this information include RNAwolf which applies a base-triplet grammar combined with a simple version of the standard Turner energy model to single-sequence non-coding RNA structure prediction. Here we describe an application of a probabilistic base-triplet grammar to alignment and homology search. The grammar was implemented in Haskell using Algebraic Dynamic Programming and the transition and emission probabilities were estimated using FR3D annotations from PDB. The grammar was then used to generate covariance models for a set of Rfam families. These models were used to filter the results from a Infernal genome search for the same families. The results show that the filtering produces a lower false negative rate which can be attributed to the additional restrictions imposed on the structure.
N098 - Reduced cost single cell genome sequencing and de novo assembly of microbial communities
Short Abstract: Identification of all species in a microbial sample is an important and challenging task with crucial applications. It is challenging because there are typically millions of cells in a microbial sample, the vast majority of which elude cultivation. The most accurate method to date is exhaustive single cell sequencing using multiple displacement amplification, which is simply intractable for a large number of cells. However, there is hope for breaking this barrier as the number of different species is usually much smaller than the number of cells.Here, we present a novel divide-and-conquer method to sequence and de novo assemble the genomes of all of the different species present in a microbial sample with a sequencing cost and computational complexity proportional to the number of species, not the number of cells. The method is implemented in a tool called Squeezambler. We evaluated Squeezambler on simulated data. The proposed divide-and-conquer method successfully reduces the cost of sequencing in comparison with the naive exhaustive approach.
N099 - Data Management and Bioinformatics of High Throughput Sequencing Data from Prospective Cohort Study at Tohoku Medical Megabank Project
Short Abstract: Tohoku University Tohoku Medical Megabank Organization (ToMMo; was founded to establish an advanced medical system to foster the reconstruction from the Great East Japan Earthquake. The organization will develop a biobank that combines medical and genome information during the process of rebuilding the community medical system and supporting health and welfare in the Tohoku area.
A blueprint for Tohoku University Tohoku Medical Megabank Organization is a ten-year project including three main activities: a biobank combining medical and genome information; an online platform for the coordination of community medical information; and training program designed for a varieties of highly specialized professionals and experts such as researchers of bioinformatics and science communicators. The biobank to be developed will be utilized to analyze the local heredity information so that it can establish an advanced medical system based on genome information with cutting-edge information and communication technology.
The first goal of ToMMo is to understand the detailed genetic population background in this area including rare variants. Thus, this project applies deep coverage whole genome sequencing of thousands people who joined to this prospective genome cohort project within years. This poster presents the oral presenter’s organizing part “the data management and the bioinformatics analysis of massive amount of high throughput sequencing data”, and the research position availabilities of this very exciting project as a graduate student or a research staff.
N100 - A statistical variant calling approach using pedigree information
Short Abstract: Due to the progress of next-generation sequencing technologies, whole genome sequencing for each individual becomes possible in practical time and with reasonable cost for the identification of disease associated mutations. Also, individual genome data from case-control study or study with pedigree analysis contribute to the elucidation of unknown disease mechanisms. Since accurate variant detection is required for the analysis of these genomes in a reliable manner, the development of accurate variant callers is demanded. In most of the variant callers, variants such as single nucleotide polymorphisms, insertions, and deletions are detected at each position in a reference genome from the information of mapped sequence reads. Since there exist positions with insufficient coverage of reads due to bias in the library preparation and mapping failures at short tandem repeat polymorphic sites or variable number of tandem repeat sites, reliable variant detection is challenging at these sites. Hence, variant callers that are robust to those errors even with low coverage data are demanded. We propose a new variant calling approach that considers pedigree information. Unlike variant callers considering individuals independently, our approach can use sequence read information of each individual for variant calling on other individuals by connecting read generation models of individuals based on pedigree information. Therefore, accurate variant calling and genotyping is expected even in insufficient read coverage sites. In performance evaluation with the HapMap CEU parent-offspring trio sequencing data, our approach outperformed existing approaches in accuracy on the agreement with SNP array genotyping results.
N101 - Counting RNAseq reads: which way is better?
Short Abstract: RNAseq presented a revolution on mRNA expression analysis. Microarrays where considered doomed and early experimental validation of RNAseq analysis findings has largely endorsed the common sense view of this technology would become the de facto standard on gene expression analysis. Some data RNAseq, however, seems to be very sensitive to the method used on its analysis. In this work we show the variation of results we’ve found while working with ~1 billion Illumina reads from drought tolerant Sorghum bicolor genotype in the presence and absence of the stress and compared results found for key genes already characterized.
N102 - In silico screening for Antimicrobial Resistance genes in NGS Sequenced Bacterial Strains
Short Abstract: Royal DSM is a global science-based company active in health, nutrition and materials and a prominent player in industrial biotechnology.
We have developed a dual approach to detect Antimicrobial Resistance (ARes) genes in bacterial genome sequences. The first part of the method implements a homology search at the DNA level which relies on mapping of the sequencing reads to a collection of known ARes genes and is an update of a method that was previously described by Bennedsen et al. (2011) accounting for 4 fold longer reads (150 bp) which are typically today obtained by Next gen sequencing machines like the Illumina MiSeq. In the second part of the method, the collection of ARes genes is searched at the protein level in thede novo assembly of the genome using translated BLAST.
Because de novo assemblies built from NGS data today typically are not closed but consist of many contigs (in the order of 100 in our cases), the combination of both approaches allows to also assess genes that might be absent in the draft de novo assembly and takes advantage of increased sensitivity of homology searches at the protein instead of DNA level.
We have applied our method to Lactobacillus bulgaricus strains, results will be given on the poster.
N103 - Assembling the 20 Gb White Spruce Genome
Short Abstract: In this work we report the largest genome assembled using a pure whole genome shotgun sequencing approach. Estimated at 20 Gb, sequencing and assembly of the white spruce (Picea glauca) genome presented unique challenges. Our strategy involved preparation and sequencing of multiple libraries on two Illumina sequencing platforms. We complemented the high coverage data from HiSeq 2000 with low coverage longer reads from MiSeq to a total of 68-fold raw coverage. To provide long-range sequence linkage information, we sequenced nine long fragment libraries.
To build the white spruce genome we used the ABySS de novo assembly tool on a high performance computer cluster. The assembly was performed in three stages: unitig, contig and scaffold. The unitig stage used a distributed de Bruijn graph representation of read-to-read overlaps for sequence extensions, and was executed on 1,560 CPU cores over 41 hours. The contig stage used read pairs to resolve sequence extension ambiguities, and was executed on 12 CPU cores over 102 hours. Finally the scaffold stage used linkage information from short and long fragment libraries to resolve repeats, and was executed on 12 CPU cores over 98 hours. The assembled white spruce scaffolds captured over half of the genome in sequences 20 kb or longer.
We measured the completeness of the assembly using 13,036 full-length cDNA clones from a closely related species, and observed that 83% of them had at least a partial representation. Thus we conclude that this assembly will be a valuable resource to develop new forest management methods.
N104 - Exact dynamics of duplications and mutations in whole genome sequences
Short Abstract: We introduce a family of models of sequence duplication and mutation that
can reproduce steady state duplication length distributions of natural genomes,
both eukaryotic and prokaryotic.

For naturally occurring genomes, whole genome alignment and intersection yield length
distributions of exact sequence duplications that typically exhibit an algebraic form, with an exponent of around -3.

It has been demonstrated that a dynamics consisting solely of duplications can generate algebraic
duplication length distributions. Exponents and amplitudes of duplication length distributions from
natural genomes are also reproduced by a model combining random duplication with point substitution.

Discrete dynamical equations for our models were derived and solved
analytically, demonstrating excellent correspondence with numerical simulations and
a mapping onto fragmentation dynamics. Continuum generalisations of these models
exhibit an asymptotic regime of duplication and point substitution where the exponent of the duplications length
distribution takes the value -3 irrespective of details of the source distribution.
N105 - A Genome Scale Computational Approach for Identifying Neuropeptides
Short Abstract: Evolvement of multicellular organisms with complex behavior is associated with increasingly complex sensation and behavior. This is achieved by a rich repertoire of bioactive peptides called NeuroPeoptides (NPs) that act extracellularly by activating receptors, channels and signaling cascades. Due to the broad impact of NPs on physiology and homeostasis, they are attractive targets for many fields, including drug development and neuroscience related research. Active peptides are produced from a larger precursor that undergoes a set of regulated cleavages and post-translational modifications (PTMs). These peptides are collectively called neuropeptides (NPs). Herein we describe NeuroPID, a machine-learning Support Vector Machine (ML-SVM) protocol that identifies NP precursor genes from sequences of metazoan genomes. NeuroPID was trained on known NP precursors from insects and mammals. Hundreds of features were extracted from the sequences. NeuroPID reached 83% success in identifying NPs from unseen genomes such as the butterfly and several ant genomes. NeuroPID serve as a discovery tool from transcriptomes such as the primary RNA-Seq data. Maximal impact on the performance is attributed to features that captured the entropy for several amino acids and the abundance of signature-motifs for cleavage sites. With the growth of unprocessed data from NGS and Mass Spectrometry, the need for reliable predictors is crucial to close the gap between genome annotations and functional discovery. To this end, we provide a catalog of top candidates for NPs whose functions govern sensation, homeostasis and behavior.
N106 - A general secretion signal for the mycobacterial type VII secretion pathway
Short Abstract: Mycobacterial pathogens, causing TB, use specialized type VII secretion (T7S) systems to transport virulence factors into infected host cells. These virulence factors lack classical secretion signals and their recognition by the secretion systems is not well understood. Here we demonstrate that the T7S substrates PE25/PPE41, which form a heterodimer, are targeted to the T7S pathway ESX-5 by a signal YxxxD/E located C-terminally in PE25. Pathogenic mycobacteria have several different T7S systems and we found a PE protein that is secreted by the ESX-1 system and contains a functionally equivalent C-terminal signal to that of PE25. Swapping the C-terminal motifs between these PE proteins preserved secretion, but each PE protein remained secreted via its own secretion system, indicating that additional signal(s) must provide system specificity. Searching with a refined YxxxD/E motif enriched with structural information confirmed that the YxxxD/E secretion signal appears present in all known mycobacterial T7S substrates.

View Posters By Category

Search Posters: