19th Annual International Conference on
Intelligent Systems for Molecular Biology and
10th European Conference on Computational Biology


Accepted Posters

Category 'U'- Sequence analysis'
Poster U01
Improved modeling of protein solubility using k-mer and sequence similarity based parzen window approach

Pawel Smialowski Technical University Munich
Dmitrij Frishman (Technical University Munich, Departament of genome oriented bioinformatics); Gero Doose (Technical University Munich, Departament of genome oriented bioinformatics); Philipp Torkler (Technical University Munich, Departament of genome oriented bioinformatics); Stefanie Kaufmann (Technical University Munich, Departament of genome oriented bioinformatics);
 
Short Abstract: Many fields of science(structural biology, protein biochemistry) and industry(enzymes and antibodies production) depend on efficient production of active protein using heterologous expression in E.coli. Solubility upon expression is an individual trait of proteins which, under a given set of experimental conditions, is determined by their sequence. Prediction of solubility from sequence is instrumental for many applications including setting priorities on targets in large-scale projects and optimizing efficiency of protein production. We aimed to build a server predicting which proteins have the best/worst chances to be soluble once heterologously expressed in E.coli. We developed a machine-learning based model called PROSO II based on amino acid k-mer frequencies and sequence similarity. It is organized as a two-layered structure in which the output of a primary logistic classifier of amino acid k-mer composition and a Parzen window model for sequence similarity serve as input for a second level logistic classifier. Experimental progress information from the pepcDB and PDB databases was used as the source of data. Compared with previously published methods our model is trained on currently the largest dataset consisting of 82000 proteins. When tested on a separate holdout set not used at any point of method development our server attained the following performance values: MCC=0.39, sensitivity=0.731, specificity=0.759, precision(soluble)=0.377, gain(soluble)=2.263, precision(insoluble)=0.934, and gain(insoluble)=1.121. This results are statistically different (p-value 2*10-13) and superior compared to our previous predictor (PROSO) (MCC=0.224, sensitivity=0.415, specificity=0.813, precision(soluble)=0.328, gain(soluble)=1.973, precision(insoluble)=0.877, and gain(insoluble)=1.052).
In summary, the PROSO II server provides more accurate solubility predictions. It is available at http://mips.helmholtz-muenchen.de/prosoII/prosoII.seam
 
Poster U02
Getting maximum information from your paired-end reads: A package for accurate detection of copy number alterations.

Valentina Boeva Institut Curie
Bruno Zeitouni (Institut Curie, Centre de Recherche, Bioinformatics); Kevin Bleakley (Institut Curie, Centre de Recherche, Bioinformatics); Andrei Zinovyev (Institut Curie, Centre de Recherche, Bioinformatics); Jean-Philippe Vert (Institut Curie, Centre de Recherche, Bioinformatics); Isabelle Janoueix-Lerosey (Institut Curie, U830); Olivier Delattre (Institut Curie, U830); Emmanuel Barillot (Institut Curie, Centre de Recherche, Bioinformatics);
 
Short Abstract: In many studies that apply deep sequencing to cancer genomes, one has to calculate copy number profiles (CNPs) and predict regions of gain and loss. There exist two frequent obstacles in the analysis of cancer genomes: absence of an appropriate control sample for normal tissue and possible polyploidy. For various reasons, sequencing of an appropriate control sample is not always possible. We therefore developed a bioinformatics tool, called FREEC, able to automatically detect copy number alterations (CNAs) without use of a control dataset. FREEC normalizes copy number profiles using read mappability and GC-content and then applies a LASSO-based segmentation procedure to the normalized profiles to predict CNAs.
For mate-paired/paired-ends mapping (PEM) data, one can complement the information about CNAs (i.e., output of FREEC) with the predictions of structural variants (SVs) made by another tool that we developed, SVDetect. SVDetect finds clusters of ‘discordant’ PEMs and uses all the characteristics of reads inside the clusters (orientation, order and clone insert size) to identify the SV type. SVDetect allows identification of a large spectrum of rearrangements including large insertions-deletions, duplications, inversions and balanced/unbalanced intra/inter-chromosomal translocations.
Both SVDetect and FREEC are compatible with the SAM alignment format and provide output files for graphical visualization of predicted genomic rearrangements.
Here we present a package for automatic intersection of FREEC and SVDetect outputs that allows one to (1) refine coordinates of CNAs using PEM data and (2) improve confidence in calling true positive rearrangements (particularly, in ambiguous satellite/repetitive regions).
 
Poster U03
Accurate quantification of transcriptome from RNA-Seq data by effective length normalization

Sanghyuk Lee Korea Research Institute of Bioscience and Biotechnology
Soohyun Lee (Cornell University, Division of nutritional science); Chaehwa Seo (Korea Research Institute of Bioscience and Biotechnology, Korean Bioinformation Center); Byungho Lim (Korea Advanced Institute of Science and Technology, Department of Biological Sciences); Jin Ok Yang (Korea Research Institute of Bioscience and Biotechnology, Bioinformation Center); Jeongsu Oh (Korea Research Institute of Bioscience and Biotechnology, Korean Bioinformation Center); Minjin Kim (Korea Advanced Institute of Science and Technology, Department of Biological Sciences); Sooncheol Lee (Advanced Institute of Science and Technology, Department of Biological Sciences); Byungwook Lee (Korea Research Institute of Bioscience and Biotechnology, Korean Bioinformation Center); Changwon Kang (Korea Advanced Institute of Science and Technology, Department of Biological Sciences);
 
Short Abstract: We have introduced an approach of estimating mRNA abundances from RNA-Seq data. Our method, NEUMA (Normalization by Expected Uniquely Mappable Area), is based on effective length normalization using uniquely mappable areas of genes and mRNA isoforms. NEUMA effectively proves that correlation coefficient higher than 0.95 can be achieved between RNA-Seq-based abundance estimates and quantitative PCR in single samples. Making the best use of known transcriptome sequence information, NEUMA pre-computes the numbers of all possible gene-wise and isoform-wise informative reads: the former common to all mRNA isoforms of a single gene and the latter unique a single mRNA isoform. The results are used as the effective lengths of genes and transcripts, after taking experimental distributions of fragment size into consideration. Stand-alone and web-server NEUMA programs are available at http://neuma.kobic.re.kr. NEUMA may also be combined with other RNA-Seq analysis platforms as a component for quantifying gene transcript levels.

Long Abstract: Click Here

Poster U04
Application of whole genome resequencing to detect sequence polymorphisms underlying boar taint

Rahul Agarwal Norwegian University of Life Sciences
Eli Grindflek (NORSVIN) Maren van Son (NORSVIN) Tina Graceline (Norwegian University of Life Sciences and University of Oslo, Centre for Integrative Genetics (CIGENE) and Centre for Ecological and Evolutionary Synthesis (CEES)); Sigbjørn Lien (Norwegian University of Life Sciences, Department of Animal and Aquacultural Sciences and Centre for Integrative Genetics (CIGENE)); Matthew Kent (Norwegian University of Life Sciences, Department of Animal and Aquacultural Sciences and Centre for Integrative Genetics (CIGENE));
 
Short Abstract: Boar taint,a variable trait characterised by an objectionable odour/flavour in the meat from intact male pigs, is caused primarily by high levels of androstenone and skatole in adipose tissue. Traditionally, the pig industry has used castration as a tool for preventing boar taint, however this is becoming less acceptable to consumers and leads to both reduced growth efficiency and excess deposition of fat. Selective breeding may serve as an alternative to castration but effective implementation requires the identification of genetic variation (single nucleotide polymorphisms - SNPs) associated with the phenotype to select for low levels of boar taint without simultaneously affecting levels of sex steroids. Here we are presenting a conduit to detect genetic variants underlying boar taint by whole genome resequencing. This approach detects genetic variants in multiple samples simultaneously and at the same time, it maintains information about each breed and individual that carries the different genetic variants. Two Norwegian Landrace and Duroc pigs with high and low androstenone levels were sequenced using a Genome Analyzer IIx to an average coverage of 4X. BWA was used to map these short reads of 108 nucleotides length to the Sus Scrofa Build 10, and SAMtools was used for SNPs and insertion-deletion polymorphisms (InDels) detection. We detected 7,313,798 SNPs and 1,011,236 InDels and SNP filters criteria were minimum SNP quality 20 and SNP allele covered by at least 2 reads. These will be compared to existing QTL information in an attempt to find strong candidate markers and perhaps functional polymorphisms.
 
Poster U05
High-throughput protein analysis in the cloud

Guy Yachdav Columbia University
Laszlo Kajan (Technische Universität München, 1Dept. for Bioinformatics and Computational Biology); Burkhard Rost (Technische Universität München, 1Dept. for Bioinformatics and Computational Biology);
 
Short Abstract: 1. INTRODUCTION
Summary: In this poster we report the opening of a new ‘app store’ in the form of a Debian/Ubuntu repository for the most popular prediction methods from the Rost Lab, including PredictProtein. We also report the release of a new protein feature prediction machine image for high-throughput cloud computing. Protein secondary structure, disordered regions, effects of amino acid substitutions, protein-protein interaction sites and sub-cellular localisation are among the features predicted by software available in the repository and on the machine image. This image is bootable from a USB stick or an SD card and ready to predict in a real or a virtual machine. It is also transferable to and bootable on server instances offered in cloud infrastructure services such as the Amazon Elastic Compute Cloud. After booting, a friendly message at the login prompt offers usage tips and directions to documentation. The image is updated monthly and is freely available to non-profit organizations after registration. We plan to release another image based on Ubuntu with support for the web-based graphical interface of PredictProtein.

Availability and Implementation: The Debian/Ubuntu package repository is freely available on the web at http://rostlab.org/debian. An overview of the contents of the repository is available at https://rostlab.org/owiki/index.php/Packages. The PredictProtein machine image is freely available for academic users after registration at https://predictprotein.org/download.
 
Poster U06
Evolutionary Conservation and Genetic Variability of Hepatitis C Virus Core Protein (HCVc)

Muhammad Ilyas National Centre of Excellence in Molecular Biology, University of the Punjab, Lahore, Pakistan
Syyada Samra Jafri (National Centre of Excellence in Molecular Biology, University of the Punjab, Lahore, Pakistan, Molecular Biology); Muhammad Saqib Shahzad (National Centre of Excellence in Molecular Biology, University of the Punjab, Lahore, Pakistan, Molecular Biology); Ziaur Rahman (National Centre of Excellence in Molecular Biology, University of the Punjab, Lahore, Pakistan, Molecular Biology); Gulzar Ahmad Niazi (National Centre of Excellence in Molecular Biology, University of the Punjab, Lahore, Pakistan, Molecular Biology); Tayyab Husnain (National Centre of Excellence in Molecular Biology, University of the Punjab, Lahore, Pakistan, Molecular Biology);
 
Short Abstract: Immunological and structural properties of Hepatitis C virus core protein (HCVc) has been studied in detail. Evolutionary powers enhancing sequence variation within the core are not completely explainable.
A large portion of HCV core protein sequence is found non-polymorphic, and polymorphism is focused to a few amino acids of HCVc. Phylogenetic analysis suggests that core protein specific forces confine its diversity within the context of whole HCV genome evolution. As a result it is said that in fact virus genotype is not predicted by core protein. Major role in limiting diversity of HCVc is played by structural requirements of capsid. It has also been proposed through phylogenetic analysis that no major role is played by immunological selection in driving HCVc diversity.
 
Poster U07
A method to separate function and fold specific residues in a protein.

Mohd Rehan Jawaharlal Nehru University
Andrew Lynn (Jawaharlal Nehru University)
 
Short Abstract: According to the neutral theory of molecular evolution, once a protein has evolved to a useful level of functionality, the majority of mutation are selectively neutral at the molecular level and do not affect the function and folding of the protein, whereas those mutations which are deleterious provide selection pressure for residue conservation. Thus, the conservation indicates the importance of residue for structure and function of the protein. Often, the mutations at a conservation site after the gene duplication leads to functional divergence. The residues of a protein at these sites called function specific residues when mutated changes the protein's function. For a class of protein, sequences are further grouped into subtypes contain subtype conserved residues indicative of functional differences among the subtypes. Here we present a method for finding function specific residues and fold specific residues in a protein family pre-classified into subtypes based on the function. Our method is based on RE (Relative Entropy) and a newly derived MI (Mutual Information) score. RE calculates the conservation over background distribution of amino acids whereas the derived MI tells about differential conservation among subtypes. This newly derived MI and the RE are combined in a way to give a new score, MIRE to rank the residues for fold and function. The methodology is implemented using HMM, validated on AGC-Protein Kinases and G-Protein Coupled receptors (GPCRs) and compared with other existing methods. The kinases have different peptide substrates, but common protein kinase function whereas GPCRs have similar scaffold with different substrate binding sites.
 
Poster U09
A web-based program for detecting SNPs substitution and existence of heterozygosity

Daesang Lee Korea Bio Polytechnic College
 
Short Abstract: Single-nucleotide polymorphism (SNPs) is the DNA sequence difference among the same species in the level of nucleic acid and is widely applied in clinical fields such as personalized medicine. The routine and labor-intensive methods to determine SNPs are performing a sequence homology search by using BLAST and navigating the trace of chromatogram files generated by high-throughput DNA sequencing machine by using Chromas program. In this poster, we developed SNPchaser, a web-based program for detecting SNPs substitution and heterozygosity existence, to improve the labor-intensive method in determining SNPs. The SNPchaser performed sequence alignment and visualized the suspected region of SNPs by using user's reference sequence, AB1 files, and positional information of SNPs. It simultaneously provides the results of sequences alignment and chromatogram of relevant area of SNPs to user. In addition, SNPchaser can easily determine existence of heterozygosity in SNPs area. SNPchaser is freely accessible via the web site http://www.bioinformatics.ac.kr/SNPchaser.
 
Poster U10
EMBOSS: EuropeanMolecular Biology Open Software Suite

Peter Rice European Bioinformatics Institute
Alan Bleasby (European Bioinformatics Institute, Services Division); Mahmut Uludag (European Bioinformatics Institute, Services Division); Jon Ison (European Bioinformatics Institute, Services Division);
 
Short Abstract: EMBOSS is a mature open source suite with extensive library functions written in C.

The new release (July 2011) includes access to over 500 public data resources, extensive annotation of tools and resources using the EDAM ontology, new data types (ontologies, taxonomy, data resources, mapped assemblies, general text), and new data access methods (DAS, BioMart, Ensembl, SOAP web services, etc.)
 
Poster U11
SKM: A String Kernel Motifs Classifier for Nucleic Acid and Protein Sequences

Hugh Shanahan Royal Holloway, University Of London
Piotr Byzia (Royal Holloway, University of London, Computer Science);
 
Short Abstract: The use of string kernels and Support Vector Machines in sequence classification has reached a degree of maturity in a number of areas of Computational BIology. However, a generic tool with an intuitive interface for wet lab researchers to do this is not available. SKM provides an easy-to-use interface to a) train a classifier based on sets of Nucleic Acid or Protein sequences and b) to use the classifier on previously un-classified sequences. The underlying kernel is based on either frequencies of words of a user-defined length or frequencies of sets of words (of the same length) which differ by one letter (the mismatch kernel). While SKM is entirely generic the focus of the package is oriented towards the classification of Transcription Factor Binding Sites and the benchmarking for this package has been carried out using the HTPSELEX data set. SKM is freely available and can be downloaded from http://gene.cs.rhul.ac.uk/SKM. The package has been tested on Linux, Mac OS X and Windows environments and is written in Python and Java.
 
Poster U12
P-value based motif identification using positional weight matrices

Holger Hartmann Ludwig Maximilians University Munich
Eckhart Guthöhrlein (Ludwig Maximilians University Munich, Gene Center); Johannes Söding (Ludwig Maximilians University Munich, Gene Center);
 
Short Abstract: Transcription factors are key regulators in every biological network. To understand the regulatory mechanisms, knowledge of the intrinsic affinities of these factors is crucial. Many computational approaches have been developed to reveal these affinities from experimental data, however, no approach could combine the high descriptive power of positional weight matrices (PWMs) with a solid statistical framework.

We have developed XXmotif (eXhaustive identification of matriX motifs), which calculates the significance of any candidate PWM to be overrepresented in a set of sequences compared to a reference set. Due to a fast exhaustive search of candidate motifs and a P-value driven PWM refinement step, the affinity matrices of all significant motifs in the set can be found in a single run.

In benchmarks on Chip-ChIP data, XXmotif outperforms all state-of-the-art (e.g. MEME, Weeder, PRIORITY, cERMIT) tools in both the number of correctly identified motifs as well as the quality of the motif PWMs. On a set of segmentation modules in D. melanogaster, XXmotif is able to detect most of the key regulators, as well as some new motifs, which might be important in fly segmentation.
 
Poster U13
Alignment of membrane protein sequences with AlignMe

Marcus Stamm Max Planck Institute of Biophysics
Lucy Forrest (Max Planck Institute of Biophysics) Rene Staritzbichler (Max Planck Institute of Biophysics, Computational Structural Biology); Kamil Khafizov (Max Planck Institute of Biophysics, Computational Structural Biology); Lucy R Forrest (Max Planck Institute of Biophysics, Computational Structural Biology);
 
Short Abstract: Accurate alignments between two protein sequences are crucial for identifying relationships and for generation of high-quality homology models. The latter is a particular issue for membrane proteins, for which homology models are often required because of the difficulty of resolving their structures experimentally. Due to their hydrophobic membrane environment, integral membrane proteins have distinct properties from globular proteins, including hydrophobicity, amino acid distribution and evolutionary substitution rates. However, the majority of existing sequence alignment methods were optimized on data sets of globular proteins. Here we present a pairwise sequence alignment program called AlignMe, developed and optimized specifically for the alignment of transmembrane proteins. Optimization was carried out against structure-based reference alignments of ?-helical membrane proteins from the HOMEP2010 (HOmologous MEmbrane Proteins) dataset. To determine the accuracy of an alignment, we introduce two novel scoring schemes that consider the number of correctly aligned residues, as well as the shift of amino acids. A number of biologically-relevant input parameters were tested for the construction of the alignments. We found that a combination of a substitution matrix with a hydrophobicity scale and a secondary structure prediction resulted in the most accurate sequence alignments, especially if gaps in hydrophobic (and presumably evolutionarily conserved) segments were treated differently from those in hydrophilic segments. AlignMe alignments contain significantly smaller shifts than those from other alignment programs, and therefore are particularly useful for homology modelling. AlignME can be used locally or via a user-friendly webserver, available at: www.forrestlab.org/AlignMe.
 
Poster U14
Toolbox for the analysis of Roche 454 Sequencing data

Christian Ruckert University of Münster
Hans-Ulrich Klein (University of Münster, Institute of Medical Informatics); Christoph Bartenhagen (University of Münster, Institute of Medical Informatics);
 
Short Abstract: R/Bioconductor provides a lot of packages for sequence analysis. However, methods for data from Roche's Genome Sequencer FLX system are rare. In this context, we developed a freely available toolbox of different methods to read in raw data, to control quality and to annotate structural variants. Special attention was paid to reproduce and complement the quality control features of the GS Run Browser, adding functionality to assess long term quality over a series of different runs.

The first step of the pipeline is to read in Roche's binary Standard Flowgram Format (SFF) files, which contain flowgrams for individual reads, the basecalled read sequences and per-base quality scores. These files are processed using C subroutines for speed purposes and are afterwards converted to the standard R/Bioconductor data structures for sequencing data. Next, quality analysis is carried out comprising summary statistics like average read length, base quality or GC-content as well as statistics for each base position summarized over all reads of one lane. Furthermore, sequence complexity and different kinds of duplications are reported. The function to annotate structural variants creates a report in HTML format.

Our toolbox allows to store quality reports in a PostgresSQL data base. To provide researchers without R programming skills easy access to these reports, a Google Web Toolkit based front end was implemented allowing to monitor and compare the quality of different sequencing runs. In addition, our software is useful for researchers without access to the Roche software who have their specimens sequenced in out-of-house facilities.
 
Poster U15
Comrad: Detection of expressed rearrangements by integrated analysis of RNA-Seq and low coverage genome sequence data

Andrew McPherson Simon Fraser University
Chunxiao Wu (Vancouver Prostate Centre, Laboratory for Advanced Genome Analysis); Iman Hajirasouliha (Simon Fraser University, Computing Science); Fereydoun Hormozdiari (Simon Fraser University, Computing Science); Faraz Hach (Simon Fraser University, Computing Science); Anna Lapuk (Vancouver Prostate Centre, Laboratory for Advanced Genome Analysis); Stanislav Volik (Vancouver Prostate Centre, Laboratory for Advanced Genome Analysis); Sohrab Shah (BC Cancer Agency, Centre for Translational and Applied Genomics); Colin Collins (Vancouver Prostate Centre, Laboratory for Advanced Genome Analysis); S. Cenk Sahinalp (Simon Fraser University, Computing Science);
 
Short Abstract: Both paired end whole transcriptome shotgun sequencing (RNA-Seq), and paired end Whole Genome Shotgun Sequencing (WGSS), have been used to discover rearrangements in tumour genomes. However, to date no method exists that leverages both RNA-Seq and WGSS data for accurate discovery of rearrangements and their associated fusion transcripts.

We present Comrad, a novel algorithmic framework for the integrated analysis of RNA-Seq and WGSS data for the purposes of discovering genomic rearrangements and aberrant transcripts. The Comrad framework leverages the advantages of both RNA-Seq and WGSS data, providing accurate classification of rearrangements as expressed or not expressed and accurate classification of the genomic or non-genomic origin of aberrant transcripts. A major benefit of Comrad is its ability to accurately identify aberrant transcripts and associated rearrangements using low coverage genome data.

We have applied Comrad to the discovery of gene fusions and read-throughs in prostate cancer cell line C4-2, a derivative of the LNCaP cell line with androgen-independent characteristics. As a proof of concept we have rediscovered in the C4-2 data 4 of the 6 fusions previously identified in LNCaP. We also identified 6 novel fusion transcripts and associated genomic breakpoints, and verified their existence in LNCaP, suggesting that Comrad may be more sensitive than previous methods that have been applied to fusion discovery in LNCaP. We show that many of the gene fusions discovered using Comrad would be difficult to identify using currently available techniques.
 
Poster U16
The Interlaboratory Robustness of Next-Generation Sequencing (IRON) Study

Hans-Ulrich Klein University of Münster
Katja Hebestreit (University of Münster, Institute of Medical Informatics); Alexander Kohlmann (MLL, Munich Leukemia Laboratory); Sandra Weissmann (MLL, Munich Leukemia Laboratory); Silvia Bresolin (University of Padova, Department of Pediatrics); Tracy Chaplin (Queen Mary University, Barts Cancer Centre); Harry Cuppens (University Hospital Leuven, Center for Human Genetics); Bernardo Garicochea (HSL, Hospital São Lucas da PUCRS); Vera Grossmann (MLL, Munich Leukemia Laboratory); Bozena Hanczaruk (Roche, 454 Life Sciences); Katja Hofer (Österreichisches Rotes Kreuz, Blutzentrale Linz); Ilaria Iacobucci (University of Bologna, Department of Hematology and Oncologic Sciences); Joop Jansen (Radboud University, Medical Centre); Truus te Kronnie (University of Padova, Department of Pediatrics); Louis van de Locht (Radboud University, Medical Centre); Giovanni Martinelli (University of Bologna, Department of Hematology and Oncologic Sciences); Kim McGowan (Roche, 454 Life Sciences); Stephanie Stabentheiner (Österreichisches Rotes Kreuz, Blutzentrale Linz); Bernd Timmermann (MPI, Max-Planck Institut für molekulare Genetik); Peter Vandenberghe (University Hospital Leuven, Center for Human Genetics); Bryan Young (Queen Mary University, Barts Cancer Centre); Torsten Haferlach (MLL, Munich Leukemia Laboratory); Martin Dugas (University of Münster, Institute of Medical Informatics);
 
Short Abstract: Massively parallel pyrosequencing enables deep-sequencing to detect molecular aberrations. However, only limited data is available on the precision and robustness of this technology. Here, we studied the reproducibility of Roche 454 Sequencing across 10 laboratories from 8 countries.

The participating laboratories received specimens from 18 chronic myelomonocytic leukemia patients. 31 amplicons covering exonic regions from three different genes were selected for sequencing. Sample processing and sequencing were performed separately by each laboratory. Bioinformatic analyses were performed by an independent centre. Methods for quality assessment and comparison of the results were implemented in R and published as Bioconductor package.

First, the achieved amplicon coverages were compared. The median coverage across all 31 amplicons differed significantly between laboratories and ranged from 541-fold to 872-fold. Second, we focused on the measured relative frequencies of detected mutations. In total, 92 mutations were observed among all 18 samples. An equivalence test revealed that there was no considerable bias in the measured relative frequencies of these mutations between any centres (p<0.05). The 95% confidence interval of the standard deviation of the relative frequencies was [2.9%, 3.2%].

The different coverages observed between laboratories may be caused by varying experience with the sequencing technology. However, we could not detect a bias in the measured relative frequencies of the mutations and the estimated standard deviation was only 3%. Hence, this technology together with standardized bioinformatic analyses is suitable for a clinical diagnostic setting.
 
Poster U17
A systematic and agnostic learning method for identifying sequence motifs relevant to protein subcellular localization

Kuo-Bin Li National Yang-Ming University
Shou-Cheng Yen (National Yang-Ming University, Institute of Biomedical Informatics);
 
Short Abstract: This poster describes a novel method for identifying sequence motifs to predict protein subcellular localizations. Most existing methods rely either on prior knowledge about protein targeting signals or on sophisticated residue compositions that often don’t provide clear insight. We proposed a systematic approach to identify signature motifs without using prior knowledge. We concentrated on the localizations that are traditionally more difficult to predict. For proteins within those localizations, we investigated all sequence motifs (length < 8) represented by a reduced amino acid alphabet set. Each motif was then subject to a statistical test to determine if it has a distinct occurrence frequency for proteins in a specific localization. The identified sequence motifs were further extended on both ends to increase their length, resulting in eight motifs for five localizations. Three of the motifs have never been applied to the prediction of localization, they are: (1) the [WFY][AVLI][AVLI]KNS[WFY] motif, a lysosomal specific motif found on cathepsin protease active site; (2) a RERXXER motif exclusive for peroxisomal proteins; and (3) an enriched CGHC motif present exclusively in ER proteins. The results facilitate implementations of more accurate prediction tools for lysosomal, peroxisomal and ER proteins, the three challenging localizations. With extension of proteins located in other subcellular compartments using a wider range of physicochemical properties, our discovery-oriented approach fulfills the gaps left by the current studies in this field.
 
Poster U18
A dotplot method for motif discovery: the importance of outlier alignments in the score-abundance relationship.

Kazuhito Shida Tohoku University
 
Short Abstract: Basically, De novo motif discovery algorithms seek the gapless-alignments with unproportionally high alignment score for their statistical abundance expected in given background model.
However, precise determination of maximally “unproportional” alignment requires calculation of precise P-value under background models for numerous alignments, that is rarely feasible.

An extensive numerical experiment reveals that a special definition of score and abundance are in a simpler relationship than definitions used in typical motif analysis.
In this special definition, the mean alignment score is roughly proportional to the alignment abundance, which means the most statistically unusual alignment (possibly the biologically correct motif) can be approximately detected by a basic linear classifier.

This discovery method is even more simplified, implemented in Perl5 and tested on human portion of benchmark proposed by Tompa et al.
After a preprocessing by RepeatMasker, every W-mer in the input is used as a “seed” for a heuristics method to obtain locally optimal alignments under a ZOOPS-like occurrence model.
Then, using above-mentioned special definitions, the abundance and alignment score are evaluated for these selected alignments to be organized in a 2D dot-plot.
Finally, the outlier dot toward upper left corner is manually picked as the end result.
Surprisingly, this very primitive method performs at least as accurately as the best algorithms reported (MEME3, Weeder, YMF, etc.).

Also being developed are an automated outlier detection, automated parameter (e.g. W) setting, and a stochastic version (re-weighted Gibbs-sampling) of this method. An open-source version and a manuscript are under preparation, too.
 
Poster U19
ZEBU GENOME SEQUENCING AND ANALYSIS USING SECOND GENERATION TECHNOLOGY

Guilherme Oliveira FIOCRUZ
Adhemar Zerlotini (FIOCRUZ) Adhemar Zerlotini (FIOCRUZ, CEBio); Flávio Araújo (FIOCRUZ, LPCM); Betânia Drumond (UFJF, Virology); Izinara Rosse (UFMG, Genetics); Sara Cuadros-Orellana (FIOCRUZ, CEBio); Beatriz Lopes (ABCZ, ABCZ); Elizângela Guedes (EMBRAPA, CNPGL); Wagner Arbex (EMBRAPA, CNPGL); Marco Antônio Machado (EMBRAPA, CNPGL); Maria Gabriela Peixoto (EMBRAPA, CNPGL); Rui Verneque (EMBRAPA, CNPGL); Marta Guimarães (EMBRAPA, CNPGL); Roney Coimbra (FIOCRUZ, CEBio); Maria Raquel Carvalho (UFMG, Genetics); Marcos Vinícius Silva (EMBRAPA, CNPGL); Guilherme Oliveira (FIOCRUZ, CEBio);
 
Short Abstract: The Brazilian cattle are composed mainly by zebu breeds and crossbreeds between taurine and zebu breeds. During the last decades in Brazil, traditional genetic evaluations have guaranteed considerable gains in dairy production and resistance to diseases and parasites. Although effective, these methods do not shed light at the biological processes underlying the observed results. The inclusion of genetic markers in the breeding programs would produce 30 to 50 percent more genetic gain than a traditional progeny test system at a similar or lower cost. Aiming to identify genetic polymorphisms in the zebu genome of dairy Gyr breed, we constructed mate-paired genomic libraries with 1-2 kb inserts. Fifty bp-long reads produced with SOLiD V3 platform were mapped into the reference genome of a female Bos taurus (NCBI Project ID: 10708) using BioScope. SAMTools was used to generate the consensus sequence and to identify SNPs. The first two SOLiD v.3 runs yielded 204 million 50 bp-long reads representing ~1.18X observed coverage of the reference genome. An initial comparative analysis was performed for six genes related to dairy production and we observed low identity in coding regions compared to their Bos taurus orthologs. Twenty-one new SNPs were identified on the chromosomes that harbor these genes, especially on chromosome 6 (19 SNPs), mainly on non-coding regions, except for one within Fibulin-7 gene on chromosome 11. The identification of SNPs in dairy Gyr cattle may improve the efficiency of the next version of genotyping chips to be used for Dairy zebu genomic selection in Brazil.
 
Poster U20
Improved performance of sequence search algorithms in remote homology detection

Adwait Joshi National Centre for Biological Sciences (TIFR)
Ramanathan Sowdhamini (National Centre for Biological Sciences (TIFR))
 
Short Abstract: Remote homology detection from mere sequence information is applicable for the analysis of genome databases, but is highly challenging due to sequence dispersion within protein families. Iterative profile-based sequence search algorithms and methods that employ Hidden Markov Models are quite effective in detecting remote homologues, however they seldom achieve full coverage. In this study, we have compared two such methods, iterative profile-based searches - Position Specific Iterative BLAST (PSI-BLAST) and a motif-initiated constrained profile-based search : Pattern Hit Initiated BLAST (PHI-BLAST). We have integrated various strategies for achieving high coverage including multiple queries (for PSI-BLAST) and multiple motifs (for PHI-BLAST). We have tested the strategies over 12 protein structural superfamilies present in PASS2 database (Bhaduri and coworkers, BMC Bioinformatics, 5, 35), which directly corresponds to SCOP but includes members with <40% mutual sequence identity. Following these search strategies followed by validation using Hidden Markov Model library of PASS2 superfamily members, the coverage at superfamily level was analyzed. Sequence searches driven by multiple motifs per query through PHI-BLAST clearly outperform PSI-BLAST performance suggesting its utility in better coverage in remote homology detection. The findings reveal saturation of number of homologues obtained for multiple query multiple motifs approaches through PHI-BLAST as compared with multiple query approach of PSI-BLAST followed by best performing query in both the methods. Whereas this must be the best approach, owing to the computational expense, a best performing query with its multiple motifs employed through PHI-BLAST can mitigate trade-off between coverage and computational time.
 
Poster U21
New method of detecting pathogenic viruses using high-throughput sequencing data

Daisuke Komura The University of Tokyo
Shumpei Ishikawa (The University of Tokyo)
 
Short Abstract: Massively parallel sequencing technology enables us to efficiently discover new viruses such as clonal infection of polyomavirus in Merkel cell carcinoma . Conventional methods assume that such viruses belong to known virus family, so that they would share closely related sequences. However, this method will overlook novel pathogenic viruses which do not have such sequences.
We present here a method to discover novel pathogenic viruses from sequencing data by searching sequences homologous to human. The method is based on the fact that some pathogenic viruses have genes homologous to humans, and such genes play important roles in pathogenesis. For example, some retroviral and human oncogenes are homologous to each other, and viral IL-6 protein encoded in human herpesvirus 6 facilitates neoplastic cell replication in Kaposi sarcoma. Simply searching for human homologous sequences may lead to many false positives due to sequencing or mapping errors. In order to avoid this, we utilize information on pathogenic context, pathways, and protein domains. Our method will enable us to identify new pathogenic viruses and the genes related to pathogenesis simultaneously. We evaluate usefulness of our approach on actual data set from human disease tissues.
 
Poster U22
Perfect Hamming Code as Hash key for Fast Genome Mapping

Yoichi Takenaka Osaka University
Shigeto Seno (Osaka University, Bioinformatic Engineering); Hideo Matsuda (Osaka University, Bioinformatic Engineering);
 
Short Abstract: This poster is based on Proceedings Submission 61.

With the advent of next-generation sequencers, the growing demands to map short DNA sequences to a genome have promoted the development of fast algorithms and tools. The tools commonly used today are based on either a hash table or the suffix array/Burrow-Wheeler transform (BWT). These algorithms are the best suited to finding the genome position of exactly matching short reads. However, they have limited capacity to handle mismatches. To find n-mismatches, they requires O(2^n) times the computation time of exact matches. Therefore, acceleration techniques are required.

We propose a hash-based method for genome mapping that reduces the number of hash references for finding mismatches without increasing the size of the hash table. The method regards DNA subsequences as words on Galois extension field GF(4) and each word is encoded to a code word of a perfect Hamming code. The perfect Hamming code defines equivalence classes of DNA subsequences. Each equivalence class has a representative subsequence and all the 1-mismatch subsequences from the representative belong to the class.

The code word is used as a hash key to store these subsequences in a hash table.
Specifically, it reduces by about 70% the number of hash keys necessary for searching the genome positions of all 2-mismatches of 21-base-long DNA subsequence. This method can also apply to BWT-based genome mapping.
 
Poster U23
Patome: a database for biological sequence annotation in patents

Byungwook Lee Korea Research Institute of Bioscience and Biotechnology
 
Short Abstract: Patent biosequences (PAT division) account for 5% of GenBank, the third largest division in entries. However, they have attracted relatively little attention compared to other major sequence resources. We have built a database server called Patome which contains the annotation and analysis information for patent biosequences from GenBank, the European Molecular Biology Laboratory (EMBL), and the DNA Database of Japan (DDBJ). The aims of the Patome are to assign biological functions to the patented biosequences and to provide information on the patent relationship of a particular gene or disease. The patented biosequences were annotated with RefSeq. Two kinds of analysis maps were built from the annotated data and gene information of major model organisms. They are gene- and disease-patent maps. We also present a classification of genes associated with patent biosequences according to the hierarchical structure of the Gene Ontology (GO). Map information can be used to determine whether a particular gene, a disease, or GO terms are patented or patent-related. Patome is available at http://www.patome.org/; the information is updated bimonthly.
 
Poster U24
fastapl -- a utility for versatile and easy processing of multifasta format data

Paul Horton AIST, Computational Biology Research Center
 
Short Abstract: Processing of sequence and annotation data in multifasta format is a common task in bioinformatics. Existing resources to facilitate this task either provide a predefined set of configurable scripts to cover many common cases (e.g. EMBOSS tools) or provide libraries (e.g. BioPerl) which can be called from user programs.

We present a new and complementary approach, motivated by the observation that the tasks users require are often simple but ad hoc tasks; which would be almost trivial to do with standard linux tools such as grep, sed, wc, etc. -- except for the fact that these tools are line based, rather than fasta record based.

Our tool fastapl, (FASTA Perl Loop, pronounced like "fasta apple"), provides functionality analogous to perl used with its -n and related switches, but looping over fasta records instead of lines. This is most easily described by listing examples:

Reformat sequence lines to have max line length 100.
% fastapl -p -l 100

Truncate sequences to maximum sequence length of 39.
% fastapl -p -e '$seq = substr( $seq, 0, 39 )'

Reverse complement DNA sequences.
% fastapl -p -e '$seq = reverse $seq; $seq =~ tr/acgtACGT/tgcaTGCA/'

Print records of sequences not starting with methionine.
% fastapl -g -e '$seq !~ /^M/'

Randomly shuffle sequences.
% fastapl -M 'List::Util qw(shuffle)' -p -e '$seq = join( "", shuffle(@seq) )'

Sort records by id.
% fastapl --sort -e '$id1 cmp $id2'


More examples, documentation, and the sourcecode are available at http://seq.cbrc.jp/fastapl.
 
Poster U25
Behind the significance in NGS reads assembly

Antonio Muñoz University of Málaga
Javier Rios (University of Málaga, Computer Architecture); Hicham Benzekri (University of Málaga, Computer Architecture); Oswaldo Trelles (University of Málaga, Computer Architecture);
 
Short Abstract: Sequence comparison has been traditionally the best way to know the relationship between two nucleotide or protein sequences in order to set them in base to their homology. Using this strategy we have developed a system to infer the relative location of two reads coming from Next Generation Sequencing (NGS), therefore, this two sequences must be assembled together to form a longer sequence. To determine the significance between each sequence pair, we reproduce the experiments developed by Rost, Schneider, and Altschul where the similarity was dependent on the sequences length but applied to a completely different scenario. In this case we work with nucleotide sequences instead of protein sequences that come from a single organism instead different ones and we are looking for similarities belonging to the same region, so they must be exactly the same except in case of SNPs, SSRs and sequencing errors. The establishment of the significance curve parameters allows to identify in an efficient way whether two sequences to compare belong to the same location when assembly which suppose a time reduction. Based on the curve parameters estimation, we propose a system to reduce the CPU computing time and memory requirements in sequence assembly process by reducing the computational space. Using this quick procedure, our experimental results with a set of 1000 reads shows that the 86% of the comparisons are not really needed to reproduce accepted results in a 10x coverage assembly.
 
Poster U26
Computational Analysis of DNA Replicases in Double-Stranded DNA Viruses: Relationship with the Genome Size

Darius Kazlauskas Vilnius University
Ceslovas Venclovas (Vilnius University, Institute of Biotechnology);
 
Short Abstract: Fast and accurate genome duplication in free-living cellular organisms depends on the processive DNA synthesis. This task is performed by DNA replicases that always include a replicative DNA polymerase, a DNA sliding clamp and a clamp loader. However, double-stranded (ds) DNA viruses often lack one or more components of a typical DNA replicase. For tiny genome viruses this may be the result of a limited genome coding capacity. Therefore, as the increase of viral genome size alleviates coding capacity constraints, it might be expected that the replicase processivity properties become important. To address this hypothesis we performed a detailed computational analysis of DNA replicases in the available genomes of dsDNA viruses. Using sensitive homology detection methods we identified putative DNA replicase components. During this step we discovered highly divergent B-family DNA polymerases in phiKZ-like phages, previously thought to depend entirely on host replication proteins. Additionally, we newly identified remote sliding clamp homologs in Ascoviridae family and Ma-LMM01 phage. We found that there is a clear dependency between the nature of DNA replicases and the viral genome size. As viral genomes become relatively large, all viruses encode their own DNA polymerases. With further increase in the genome size we identified a higher frequency of both known and novel evolutionary solutions to the DNA polymerase processivity problem. The detected distribution patterns of DNA replicase components may be useful for the analysis of new viral genomes, while the observed properties of these components may also be relevant for studies of corresponding cellular counterparts.
 
Poster U27
A novel miRNA prediction method based on next generation sequencing technology

Kui Qian University of Helsinki
Eeva Auvinen (University of Helsinki, Haartman Institute); Dario Greco (Karolinska Institutet, Department of Bioscience and Nutrition at Novum); Petri Auvinen (University of Helsinki, Institute of Biotechnology);
 
Short Abstract: MicroRNAs (miRNA) are short single-stranded RNA molecules that have been demonstrated to play an important role in regulation of gene expression in many organisms. With the advantages of next generation sequencing, new opportunities have arisen in identifying and quantifying miRNAs, and investigating their functions.
We will discuss a novel miRNA prediction method, which utilize miRNA-sequencing results from both Illumina and SOLiD sequencing platforms. This method uses both the mapped reads and reference sequences to generate candidate miRNA stem-loop sequences. Caused by the varied nature of miRNAs in different species, such as lengths of the stem-loop sequences, complementarities between mature and star miRNAs, it needs customized parameters for prediction of candidates. Our method is flexible and can be fitted to the needs of the user by adjusting the parameters. Furthermore, biological annotation information and read counts for each candidate would be contained in the outputs of the analysis. Since the method is executable in R, it is not dependant on operating system. The outputs are in fasta format and can be used as an input for other RNA structure prediction software, such as RNAfold. This method gave a user-friendly way to check the miRNA-seq result and would be free for academical usage.
 
Poster U28
Sequence alignment of coiled coil proteins: adapting to restricted sequence variation

Michael Kuhn Technische Universität Dresden
Andreas Beyer (Technische Universität Dresden, Biotec); Anthony A. Hyman (Max Planck Institute for Molecular Cell Biology and Genetics)
 
Short Abstract: Coiled coil proteins are an important part of the cytoskeleton and the centrosome. In addition, coiled coil domains have recently been implicated in the aggregation of proteins leading to neurodegenerative diseases. Due to the coiled coils’ biased amino acid composition, traditional sequence alignment programs fail to correctly identify homologous proteins: Helices that form coiled coils have a regular, repeating pattern of hydrophobic, charged, and hydrophilic amino acids. This so-called heptad repeat of seven residues causes traditional sequence alignment algorithms to over-estimate the significance of the observed sequence similarity, leading to incorrectly predicted homologous proteins. Here, we present a method that takes the space of possible substitutions into account, thereby greatly reducing the amount of false positives. To align two sequences, we predict the location and register of coiled coil helices and use register-specific scoring matrices. For benchmarking, we use the manually curated KOG database. In order to simulate proteins with little conservation outside coiled coil domains, we truncate proteins to coiled coil helices plus a linker of ten residues. Of the reciprocal best hits detected by our method, 83% are homologous, compared to 72% for BLAST. At the same time, the sensitivity is almost the same (24% vs. 25% of all possible reciprocal best hits between homologous proteins). We identify homologous proteins across 53 metazoan species using the eggNOG pipeline, significantly improving the study of the evolution history of coiled coil proteins.
 
Poster U29
Computational prediction of transposon insertion sites in bacterial genomes

Louisa Roselius TU Braunschweig
Michael Steinert (TU Braunschweig) Olga Shevchuk (TU-Braunschweig, Microbiology); Richard Münch (TU-Braunschweig, Microbiology); Dieter Jahn (TU-Braunschweig, Microbiology); Johannes Klein (TU-Braunschweig, Microbiology); Svitlana Yarmolinetz (TU-Braunschweig, Microbiology);
 
Short Abstract: Transposon (Tn) mutagenesis is an important tool, which facilitates gene function studies in genetics.
The crucial step of this approach is the localization of the Tn-insertion site, since it is time consuming and needs intensive optimization.
In order to improve the efficiency of these experiments, we developed a bioinformatics tool 'InFiRe' (Insertion Finder with Restriction enzymes).

InFiRe allows the accurate computational prediction of the Tn-insertion site in sequenced genomes.

The method is based on simple restriction digestions of genomic DNA in combination with Southern blot.
In a first step, the sizes of fragments with transposon insertion are identified by hybridizetion with a transposon specific probe.
In the second step, the most probable genomic position of the transposon is calculated
using the derived fragment size pattern. Hereby, the theortical number of different digestions and restriction enzymes is estimated in a statistical approach.

We successfully demonstrate the application of the algorithm in a case study using the mini-Tn10 transposon mutated Legionella pneumophila.
It is shown, that the outlined approach significantly enhances the transposon identification step after screening of transposon libraries.
The method opens the possibility to upscale this procedure to a new high-throughput technology.

The software was implemented as a stand-alone R-package. In addition, a web interface to the program is availabale
at http://www.infire.tu-bs.de/.
 
Poster U30
Improving SNV calling sensitivity by probabilistically combining SNVs reported by different calling tools

Alejandro Balbin University of Michigan
Alexey Nesvizhskii (Assistant Professor/University of Michigan, Pathology); Arul Chinnaiyan (S.P. Hicks Endowed Professor/University of Michigan, Pathology); Sameek Roychowdhury (Research Fellow/University of Michiga, Pathology); Xuhong Xao (Research Staff/University of Michigan, Pathology); Dan Robinson (Research Fellow/University of Michigan, Pathology); Yi-Mi Wu (Research Fellow/University of Michigan, Pathology); Matthew Iyer (University of Michigan, Pathology);
 
Short Abstract: Exome sequencing has taken a central role in personalized medicine by identifying “driver” mutations causing disease. Consequently, several computational tools have been developed for calling single nucleotide variants (SNVs). One pressing computational challenge is a large number of false positives calls produced by these tools, and the need to define a small set of high confidence SNVs. The algorithms rely on pre-processing of aligned sequences and filtering of called SNVs according to heuristics such as SNV quality score and depth of coverage. Remarkably but expected, the number and quality of SNV predicted by different tools is variable. Therefore, big projects as 1000 Genomes and GENCODE have been increasingly relying on the use of multiple SNV calling algorithms and the application of majority rule. In this work, we analyzed sets of SNVs generated by applying two widely employed algorithms, GATK and SamTools-Bcftools, on a cohort of six deeply sequenced exomes. The analysis revealed important discrepancies in the number of false positives produced by each algorithm, and that SNV calls are significantly affected by the differences in the realignment and recalibration procedures implemented in these tools. Although invoking a naïve majority rule such as selecting only SNVs called by both algorithms improved the rate of true positives, it still left out a considerable proportion of true positives which were called by only one of the methods. This lead us to further explore the data and investigate new approaches for improving SNV calling sensitivity by probabilistically combining SNVs reported by different calling methodologies.
 
Poster U31
Integrating Protein Domain Prediction in Multiple Sequence Alignment

Layal Al-Ait Georg August University, Goettingen
Eduardo Corel (Georg August University, Goettingen, Bioinformatics);
 
Short Abstract: Most algorithms for multiple protein alignment are based on primary-sequence information alone. However, with the increasing amount of biological data that is available now, it is possible to use additional sources of information for improved multiple alignment.
In this study, we use the PFAM database of protein domains to improve the performance of the program DIALIGN. Prior to aligning a set of protein sequences, we search them against PFAM to find matching protein domains. The idea is that positions of the input sequences matched to same position in a PFAM domain are likely to be homologous. By identifying segments of the input sequences that match the same segment of a PFAM domain, we obtain local alignments of the input sequences that should be part of a final multiple alignment of these sequences. Currently, we restrict ourselves to finding gap-free local pairwise alignments.
In general, it is not possible to include all local alignments found by our PFAM searches into a single multiple alignment. We therefore have to select a consistent subset of these local alignments, i.e. a set of alignments that fit into one output alignment. Currently, we solve this consistency problem by using a previously developed method which selects a consistent subset of our PFAM-based local alignments. These are afterwards given to DIALIGN as anchor points in order to include them in the final multiple alignment.
Testing carried out on BAliBASE 3 shows a considerable increase in the accuracy of our PFAM-based alignments, compared to the default version of DIALIGN.
 
Poster U32
A new algorithm for sequencing by hybridization based on non-classical approach

Marcin Radom Poznan University of Technology
Piotr Formanowicz (Poznan University of Technology, Institute of Computing Science);
 
Short Abstract: This poster is based on Proceedings Submission 82

Sequencing by hybridization(SBH) is one of the methods used to sequence DNA fragments. Despite the rapid development of next-generation sequencing technologies, methods like SBH could find applications on some areas of sequencing, e.g. in medical diagnostics, where information about long DNA sequences is not necessary. Moreover, resequencing of whole genomes is also an area where improvements of the SBH method my contribute in obtaining interesting and important biological results. Since SBH has been proposed in 1989 many modifications and new algorithms were introduced in order to enhance the method capabilities. One of such approaches to the SBH is to use universal probes / degenerate bases which combined with the normal nucleotides define a set of oligonucleotides in probes of the chip in opposition to the classical SBH where in one probe there is only one type of oligonucleotide.
In this work we propose a new algorithm that can reconstruct DNA fragments on the basis on a non-classical hybridization spectrum coming from a chip called Gapped. It uses unspecific bases in half of its probes in order to both enhance the sequencing capabilities (especially the DNA fragments length) and to reduce the number of ambiguous solutions given in the computation phase of the SBH method.
The main goal of this work is to describe the proposed algorithm for Gapped Chip and more importantly to present the results of the computational experiment when the algorithm had been tested in order to unambiguously reconstruct long DNA fragments.
 
Poster U33
InterPro: New developments

Antony Quinn European Bioinformatics Institute
Alex Mitchell (European Bioinformatics Institute) Craig McAnulla (European Bioinformatics Institute, InterPro); Amaia Sangrador (European Bioinformatics Institute, InterPro); Sarah Burge (European Bioinformatics Institute, InterPro); Sarah Hunter (European Bioinformatics Institute, InterPro); Siew-Yit Yong (European Bioinformatics Institute, InterPro); Prudence Mutowo (European Bioinformatics Institute, InterPro); David Lonsdale (European Bioinformatics Institute, InterPro); Phil Jones (European Bioinformatics Institute, InterPro); Matthew Fraser (European Bioinformatics Institute, InterPro); Sebastien Pesseat (European Bioinformatics Institute, InterPro); Chris Hunter (European Bioinformatics Institute, InterPro); Maxim Scheremetjew (European Bioinformatics Institute, InterPro);
 
Short Abstract: InterPro is a database of signatures, or predictive models, which can be used to classify sequences into protein families and predict the presence of domains and sites. InterPro integrates signatures from eleven different databases (referred to as member databases), which have a range of biological focuses. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated diagnostic resource.

InterPro's web interface allows users to explore the database online, examining the biological entities represented by the entries, the UniProtKB proteins they match, the GO terms associated with them, and the way they relate to other entries within the database (the different domains associated with a particular protein family, for example). The interface also allows sequences to be searched against the database directly, using the InterProScan analysis pipeline.

Here we present the latest developments with the database and its interface.
 
Poster U34
Computational challenges in the production and analysis of Illumina Genome Analyzer data

Martin Kircher Max Planck Institute for evolutionary Anthropology
Janet Kelso (Max Planck Institute for evolutionary Anthropology, Evolutionary Genetics/Bioinformatics);
 
Short Abstract: Sequencing technologies, including the new high-throughput sequencing instruments, have specific limitations which affect data production and analysis. We analyzed Illumina sequencing data sets and illustrate different sources of computational challenges.

The quality of sequencing libraries, preparation of the sequencing chemistry and instrument handling strongly affect data quality. Particles like chemistry lumps, dust and lint in the sequencing chemistry can cause pseudo-sequence signals which result in artifactual reads or distorted sequence readouts. Additionally, reflections, air bubbles, uneven application of oil and an imperfectly calibrated instrument cause the quality of the data to vary considerably between sequencing runs and even between lanes of the same run. To account for these differences, the quality of each run needs to be assessed and quality scores calibrated. Identification of adapter and chimera sequences is not part of the standard processing and often hampered by higher error rates at the read ends as well as by short reads showing only a few adapter bases. Reads including adapter sequence are frequently excluded in read alignment; however some fraction is reported with incorrect alignment coordinates.

We present the following principles for good analysis practice: First, indexing/tagging as well as filtering low complexity sequences removes most non-adapter-related sequencing artifacts and indexing can exclude library cross-contamination. Second, protocol-specific library artifacts like adapter chimeras should be identified for each preparation protocol and we recommend using these for filtering library artifacts and trimming starting adapters prior to downstream analyses. Third, calibrated PHRED-like base quality scores should be used for quality-based read filtering or propagated in analysis.
 
Poster U35
Simulating Quality and Depth of Sequence for Genotype Accuracy

Iain Bancarz Illumina UK
 
Short Abstract: The depth of aligned coverage and quality scores of individual bases are vitally important to accuracy of genome builds. Current practice for a normal diploid human genome is to sequence to a mean depth of 30. With >80% of bases having predicted error probability 0.1% or less (Phred quality score >= Q30) this allows high consensus accuracy.

Improved coverage uniformity and high basecall quality will allow depth to be reduced, enabling genomes to be read faster and at lower cost. Furthermore, data storage requirements could be substantially lowered by using fewer scoring bins. It is desirable to discover how these results can be achieved with minimal impact on genotype accuracy. This motivates an extensive investigation using Monte-Carlo sampling of simulated “stacks” of aligned basecalls and quality scores, which are input to a Bayesian genotype caller based on the SNP caller in the CASAVA 1.8 package.

With ideal Poisson distribution of coverage and realistic distribution of basecall quality, genotype error for both homozygous and heterozygous loci decays exponentially as depth increases, with errors lower in homozygous positions by orders of magnitude. The benefit of increased quality at reduced coverage is measured, for hypothetical quality distributions up to Q50. It is demonstrated that the set of quality bins can be reduced from a full phred scale to 16 or fewer while preserving high accuracy. Finally, non-uniformity of coverage across the genome is modelled as a linear combination of Poisson distributions.
 
Poster U36
CloVR: A platform for automated and portable sequence analysis using virtual machines and cloud computing

Malcolm Matalka Institute For Genome Sciences
Samuel Angiuoli (Institute For Genome Sciences, Bioinformatics); Florian Fricke (Institute For Genome Sciences, Bioinformatics); Cesar Arze (Institute For Genome Sciences, Bioinformatics); Kevin Galens (Institute For Genome Sciences, Bioinformatics); James White (Institute For Genome Sciences, Bioinformatics); Mahesh Vangala (Institute For Genome Sciences, Bioinformatics); Owen White (Institute For Genome Sciences, Bioinformatics);
 
Short Abstract: Next-generation sequencing platforms (e.g. 454, Illumina) have increased the accessibility and affordability of genomics to the broader research community. However, demands for computational resources and challenges in using analysis tools on large data sets are increasingly becoming a bottleneck in sequence analysis. We present the Cloud Virtual Resource (CloVR), a desktop application that takes advantage of virtual machines and cloud computing, to provide analytical tools for the analysis of next generation sequencing data. CloVR bundles pre-installed and pre-configured bioinformatics tools into automated pipelines within a virtual machine that executes on a local personal computer with minimal installation and configuration. In addition, CloVR can automatically access scalable on-demand cloud computing platforms to perform CPU-intensive tasks seamlessly over the Internet. We describe the system architecture in CloVR and a case study evaluating the cost and resources for microbial sequence analysis using four analysis protocols: BLAST searches (CloVR-Search), single microbial whole- genome shotgun (WGS) assembly and annotation (CloVR-Microbe), and metagenomic WGS gene prediction and BLAST comparison (CloVR-Metagenomics) and comparative 16S rRNA sequence analysis (CloVR-16S). The CloVR VM and associated architecture lowers the barrier of entry for sequence analysis on a personal computer and utilization of cloud computing systems for high throughput data processing.
 
Poster U37
Using next-generation sequencing “dark matter” to identify a new transposable element location in the mouse genome.

James Cavalcoli University of Michigan
Christopher Vlangos (University of Michigan Medical School, Pediatrics); Catherine Keegan (University of Michigan School of Medicine, Pediatrics);
 
Short Abstract: High throughput sequencing of portions of genomes from representative individuals or strains can reveal small deviations from the reference sequence (insertions, deletions and point mutations), however larger changes or variations may be missed if they are absent from the reference genome, or are present as heterozygous copies (one in the reference and one novel). DNA from the Danforth's short tail mouse strain was isolated and enriched for a 1.5 Mb region of Chromosome 2, and a paired-end library was sequenced using Illumina-GAII. The reads were aligned to the mouse reference genome with ~90% of read pairs mapping to the genome. After analysis of the mapped reads did not reveal any apparent mutations, we developed custom methods to examine the unmapped reads. Individual ends of the each pair were aligned to the genome separately, and reads where one end mapped to the region of chr2 and the other end failed to align were filtered for failed reads and low complexity sequences and screened for repetitive elements. We discovered a region in which one end of at least 5 reads mapped to a consistent chr2 position and orientation and the other end of the pairs mapped to retrotransposon LTR sequences. Since the DNA was heterozygous, we also identified the other allele read pairs showing no variation from the reference genome. This retrotransposon insertion was confirmed by long-range PCR. Thus we demonstrate the value of examining the unmapped reads (dark matter) from genome
 
Poster U38
Accessibility of Compositionally Biased Regions in PDB structures

Stella Tamana University of Cyprus
Vasilis Promponas (University of Cyprus) Ioannis Kirmitzoglou (University of Cuprus, Department of Biological Sciences); Vasilis Promponas (University of Cuprus, Department of Biological Sciences);
 
Short Abstract: In this work we aim to inspect structural aspects of Compositionally Biased Regions (CBRs) under the prism of two of the most widely used CBR detection algorithms, namely SEG and CAST, using different detection criteria.
A non-redundant dataset of high resolution protein structures (resolution ? 1.6 Å, R-factor ? 0.5 and less than 30% pairwise similarity) solved by X-Ray crystallography was selected for representing our current knowledge of the protein structural universe. CBR detection results (computed under different schemes) were subsequently mapped to respective Relative Accessible Surface Area values using a suite of custom Perl scripts. Moreover, a specialized module was developed for visualizing this type of data.
Interesting patterns of existence (or absence) of CBR types were detected. Our data illustrate that, altering the parameters of CBR detection methods, very different fractions of the dataset are detected as CBR.
It is worth mentioning that SEG results are significantly correlated to the global residue composition, whereas CAST mainly identifies CBRs rich in Ala, Glu, Ser, Gly and Thr. Remarkably, Ile-rich CBRs (Ile: approx. 5% of the database) are absent from CAST output even with the most permissive thresholds. Regarding the structural analysis, we observe that Asp-, Lys- and Ser-rich CBRs tend to lay on the surface of protein subunits, while Gly-, Val- and in some cases Leu-rich CBRs are preferentially buried in the hydrophobic core with high statistical significance.
 
Poster U39
A Conditional Random Fields Model for RNA Conformation Sampling

Zhiyong Wang Toyota Technological Institute at Chicago
Jinbo Xu (Toyota Technological Institute at Chicago)
 
Short Abstract: Accurate tertiary structures are very important for the functional study of non-coding RNA molecules. However, predicting RNA tertiary structures is extremely challenging because of a large conformation space to be explored and lack of an accurate scoring function differentiating the native structure from decoys. The fragment-based conformation assembly methods bear shortcomings that the limited size of a fragment library makes it infeasible to represent all possible conformations well. Assembling RNA structures from fragments is also extremely time-consuming for large RNA molecules with more than 100 nucleotides. A recent dynamic Bayesian network method overcomes the issue of the fragment assembly method, but cannot make use of sequence information in sampling conformations. Here, we present a new probabilistic graphical model, Conditional Random Fields (CRFs), for RNA conformation sampling and a novel tree-guided sampling scheme. Our method uses a CRF model to capture the RNA sequence-structure relationship and estimates the probability of a conformation from sequence. In addition, our probabilistic model enables us to sample real-valued instead of discrete-valued bond torsion angles. Experimental results indicate that our method can efficiently sample RNA conformations from sequence. Tested on 11 RNA molecules, our method generates a higher percentage of native-like decoys than the fragment-assembly method FARNA and the dynamic Bayesian network method BARNACLE, although we use a very simple energy function, which is based upon only base-pairing information. Compared with MC-Sym, our method is much faster, which enable people to investigate large RNA molecules.
 
Poster U40
A new approach to suboptimal pairwise sequence alignment

Feng Lou Laboratoire de Recherche en Informatique
Peter Clote (Boston College, Biology Department); Alain Denise (Laboratoire de Recherche en Informatique, Computer Department);
 
Short Abstract: We submit our abstract to the poster session on Sequence analysis, since our work presents a new algorithm concerning the analysis of protein
sequences, motivated by the interest of improving pairwise sequence alignment quality. It is now understood that Protein sequence alignments
have a myriad of applications in bioinformatics, including secondary and tertiary structure prediction, homology modeling, and phylogeny.
Unfortunately, Mathematically optimal alignments do not always properly align active site residues or well-recognized structural elements.
Therefore, Waterman and Eggert constructed suboptimal alignments by a heuristic of masking a portion of the optimal alignment. Given the
biological importance of constructing suboptimal alignments, in this paper, we present a new method that relies on a rigorous, mathematical framework.
Given any initial alignment A0 of two nucleic acid or amino acid sequences, in cubic time and space, our algorithm SubOptAlign simultaneously compute
the optimal alignment Ak having distance k from A0.
We use a benchmark dataset BaliBASE which is a dataset of manually-refined multiple sequence alignments, specifically designed for the evaluation and comparison of sequence alignment programs. A comparison between our algorithm SubOptAlign and the Needleman–Wunsch algorithm shows that our algorithm can find the reference alignment which can't be found by Needleman–Wunsch algorithm, and the optimal Ak alignment is provably more similar to the reference alignment.
 
Poster U41
Support Vector Machines for finding deletions and short insertions using paired-end short reads

Dominik Grimm Max Planck Institutes Tübingen
Jörg Hagmann (Max Planck Institute for Developmental Biology, Department of Molecular Biology); Daniel Koenig (Max Planck Institute for Developmental Biology, Department of Molecular Biology); Detlef Weigel (Max Planck Institute for Developmental Biology, Department of Molecular Biology); Karsten Borgwardt (Max Planck Institutes Tübingen, Machine Learning and Computational Biology Research Group);
 
Short Abstract: Next Generation Sequencing (NGS) techniques generate millions of short reads. Assembly methods can reconstruct the original sequence but suffer from a high error rate due to the 30 bp to 80 bp length of the reads. A more accurate method is to align these reads against a reference genome using tools like SHORE or SSAHA2. In both strategies, it remains a difficult task to detect insertions and deletions, because sequencing is not error-free and because reads can often be mapped to more than one position in the genome.
Here we present an approach to detect deletions and short insertions using paired-end short reads from the Illumina Sequencing Platform. To test the accuracy of our approach, we used sequenced data from the 1001 genomes project in Arabidopsis thaliana. For validation, we have performed Sanger sequencing for a random set of 192 candidate deletions. Out of the 192 Sanger sequences, we extracted 122 sequences of high quality, which confirm 69 true candidate deletions and 53 false candidate deletions. We developed scores to determine the quality of an indel candidate. Based on this validated training dataset and these quality scores, we trained a Support Vector Machine (SVM) to then predict true indels in the whole genome. We achieved an accuracy of 91.985%±1.7769% (precision = 91.48%±2.19%, recall = 96%±1.48%). Furthermore, we found 419,119 positive deletions and 34,471 insertion candidates across 67 ecotypes in Arabidopsis thaliana.
 
Poster U42
A combination of amino acid composition and matching for conotoxin superfamily classification

Piers Campbell United Arab Emirates University
Nazar Zaki (United Arab Emirates University)
 
Short Abstract: This poster is based on Proceedings Submission 29.

Conotoxin classification could assist in the study of the structure function relationship of ion-channels and receptors, as well as in identifying potential therapeutics in the treatment of a wide variety of diseases such as schizophrenia, chronic pain, cardiovascular and bladder dysfunction. Traditional BLASTP tool for searching homologues is shown to be unsuitable for hyper variable conotoxins and therefore, it is imperative to use a superior classification technique. In this study, we introduce a novel method (Toxin-AAM) for conotoxin superfamily classification. Toxin-AAM incorporates evolutionary information using a powerful means of pairwise sequence comparison and amino acid composition knowledge.

The combination of the sequential model and the discrete model makes the Toxin-AAM method superior in classifying conotoxin superfamily when compared to the existing conotoxin superfamily classification techniques.
 
Poster U43
Ranking and filtering translocations in paired-end sequencing data using BLAT realignment

Christoph Bartenhagen University of Münster
Martin Dugas (University of Münster , Institute of Medical Informatics);
 
Short Abstract: The alignment of paired-end sequencing data in combination with algorithms for the detection of structural variations give rise to large-scale mutations such as insertions and deletions, inversions and translocations. Inter-chromosomal translocations leading to fusion genes are associated for example with tumor growth and development in various cases of cancer. An unbiased detection of novel translocations from whole genome sequencing data, however, is often complicated by a major fraction of false positive calls due to misalignments of the paired reads. Small insertions of parts from other chromosomes, duplications, repetitive regions or wrong base calls during sequencing can sometimes lead to misplaced read pairs on two different chromosomes. Since laboratory validation of variants is limited by time, money and material, it is a crucial task to prioritize a set of detections by means of a confidence score saying which translocations are more likely to be artificial.

We propose a scoring system based on realignment of reads accounting for possible translocations. After the alignment of paired-end data with BWA, we first use GASV to compute possible breakpoint regions and cluster read pairs by mapping coordinates. Then, we realign each cluster individually with BLAT and compute a score based on high-quality alternative alignments and their agreement between both ends of a read pair. Together with mapping qualities and read depth at every breakpoint region, we demonstrate a way of filtering and ranking translocations detected in human Illumina paired-end sequencing data.
 
Poster U44
Devising strategies for quantifying false positive and false negative rates in RNA-Seq studies and steps towards the analysis of RNA-seq data from genetically complex mice backgrounds

Priscila Darakjian Oregon Health and Science University
Daniel Bottomly (Oregon Health and Science University , Oregon Clinical and Translational Research Institute ); Nicole Walter (Oregon Health and Science University , Behavioral Neuroscience); Robert Searles (Oregon Health and Science University, Massively Parallel Sequencing Shared Resource); Kari Buck (Oregon Health and Science University, Behavioral Neuroscience); Shannon McWeeney (Oregon Health and Science University, OHSU Knight Cancer Institute and Oregon Clinical and Translational Research Institute ); Robert Hitzemann (Oregon Health and Science University, Behavioral Neuroscience);
 
Short Abstract: This poster is based on Proceedings Submission 20720.

C57BL/6J (B6) and DBA/2J (D2) are two of the most commonly used inbred mouse strains in neuroscience research. The emergence of next-generation sequencing (NGS) and RNA-Seq allows assessment of gene expression with the potential to concurrently assess sample specific transcript sequences. We wished to explore this potential and determine optimal analysis techniques leveraging known sequence variations of the B6 and D2 strains, in order to apply this to more complex backgrounds like those of the Heterogeneous Stock (HS).

Single-end Illumina GAIIx reads derived from the striatum brain region of 3 B6 and 4 D2 mice with total reads counts ranging from 27742742-31437071 for B6, and 30117036-31953011 for D2 were generated and realigned using Bowtie (Langmead et al., 2009) separately for each sample. Genotypes were called using SAMtools (Li et al., 2009). For determining Single Nucleotide Polymorphisms (SNPs), we set stringent quality thresholds (consensus quality and SNP quality at Phred 40 and minimum number of reads set to 5) and estimated the false positive rate for the B6 samples using the known reference sequence (B6). We found that 0.01% of B6 bases (from 27 million high quality bases) and 0.04% of D2 bases (from 29 million high quality bases) varied from the reference. Based on our findings we discuss strategies for estimating false positive and false negative rates in sequence detection and the implications of this study for the analysis of more genetically complex backgrounds like the Heterogeneous Stock Collaborative Cross (HSCC).
 
Poster U45
Distributed Web Services for Bioinformatics: Multiple Sequence Alignment

Peter Troshin University of Dundee
James Procter (University of Dundee, College of Life Sciences); Geoffrey Barton (University of Dundee, College of Life Sciences);
 
Short Abstract: Despite widespread use of webservices in bioinformatics some issues persist. For example, execution limits of public web services are often too restrictive and the data privacy of their users cannot be guaranteed. In addition, the accessibility of web services as well as the frequency of their interface changes is outside of the users’ control. Finally, many web services provide very limited access to the command-line parameters of the tools they wrap
Here we present the JAva Bioinformatics Analyses Web Services (JABAWS) system, a collection of SOAP web services which overcomes these problems. JABAWS lets an average user deploy, operate and configure webservices on their own hardware. JABAWS can run on a single computer, or a cluster managed by Oracle Grid Engine, Load Sharing Facility and other systems via DRMAA. Configuration files specify how jobs should be routed to the local machine or cluster depending on load or size of job.
JABAWS currently provides programmatic access to the common multiple sequence alignment methods Probcons (Do et al. 2005), T-coffee (Notredame et al. 2000), Muscle (Edgar 2004), Mafft (Katoh & Toh 2008), and ClustalW (Larkin et al. 2007) and can be accessed from the popular Jalview multiple alignment analysis workbench (Waterhouse et al. 2009) or the command-line client that is distributed with JABAWS. Although currently focusing on multiple alignment, the JABAWS framework is flexible and can be easily extended with other web services. We plan to add disorder, conservation and secondary structure prediction web services to JABAWS soon.
 
Poster U46
Pythoscape: A Software Framework for Generation of Large Protein Similarity Networks

Alan Barber University of California, San Francisco
Patricia Babbitt (University of California, San Francisco , Department of Bioengineering and Therapeutic Sciences);
 
Short Abstract: Due to the rapidly increasing size of biological data available to computational scientists, new methods are necessary that can accommodate and organize ever larger data sets. Protein similarity networks leverage biological databases to enable hypothesis creation about sequence-structure-function relationships using datasets that are too large and unwieldy for many other methods. We present Pythoscape, an interface and set of plug-ins implemented in Python that serves as a framework for fast and efficient generation of large protein similarity networks that can be outputted and visualized in other software packages (e.g. Cytoscape). As an example case study, we have generated sequence similarity networks of the glutathione transferase (GST) superfamily, an enzyme superfamily whose members function in cellular chemical detoxification pathways. These networks demonstrate the utility of Pythoscape to manage and visualize large datasets and highlight technical challenges for correct classification of the enzymatic functions of GST superfamily members.
 
Poster U47
LaTcOm: A web server for visualizing rare codon clusters

Athina Theodosiou University of Cyprus
Vasilis Promponas (University of Cyprus, Biological Sciences);
 
Short Abstract: Clusters of unusual codon composition have been the focus of several recent research efforts, especially as indicators of an extra level of cell regulation. A number of different definitions and algorithms have been proposed for identifying, so called, 'rare codon clusters' (RCC) in coding sequences. We present a new web tool named LaTcOm, which offers several alternative methods for RCC identification from a single and simple graphical user interface.

Using as input coding DNA sequences in FASTA format, LaTcOm facilitates rapid RCC identification.
Three core RCC detection algorithms are currently implemented in the server, namely the recently described %MINMAX and sliding window approaches, along with a linear-time algorithm for detection of maximally scoring segments tailored to the RCC detection problem.

Among a number of user tunable parameters, several RCC-relevant scales are available, including pre-calculated tRNA abundance values and several codon usage tables from an array of genomes or gene subsets. Furthermore, scale transformations may be performed upon user request (e.g., linear, sigmoid).

Users may choose to visualize RCC positions within the submitted sequences either with graphical representations in PNG format, or in textual form for further processing. LaTcOm has been implemented as a CGI perl application, utilizing the CGI.pm module, and BioPerl and will be freely available for use via http://troodos.biol.ucy.ac.cy/latcom.html.
 
Poster U48
SRS 3D - Integrating structures, sequences and annotations

Andrea Schafferhans Technische Universität München
Seán O'Donoghue (EMBL, Structural Biology); Burkhard Rost (Technische Universität München, Bioinformatics and Computational Biology);
 
Short Abstract: We present SRS 3D, a tool for rapid access to and visualisation of all related structures for a given target sequence. The underlying HSSP / PSSH family of databases stores pre-compiled high-quality structure-to-sequence alignments from an all against all comparison of Uniprot and PDB. The Structure Viewer displays the aligned sequences, sequence features and the PDB structure in a tripartite window. The user interface has been designed to help non-specialist users understand the relation between sequence features and structure. This open source java applet can be used to visualise sequence annotations from any source in structural context, e.g. for the visualisation of results from the PredictProtein annotation pipeline. The SRS3D tools are used in the srs3d.org website. Searches for a keyword or identifier from Uniprot give a result table showing all sequences matching this query. An icon indicates sequence entries with structural information. Clicking on the icon opens the corresponding PSSH entry, which shows a graphical summary of all 3D structures with significant homology to the target protein. It displays the match region along with sequence annotations and the structure description, thus helping to identify the most relevant structures. Selecting a structure opens a HSSPalign entry, with sequences, alignments, the PDB structure and features from UniProt, all displayed in the Structure Viewer. Furthermore, SRS 3D considerably enhances the view of PDB entries, since UniProt and PDB features are mapped on to the structure, and the user can easily navigate to related sequences and structures.
 
Poster U49
Multiobjective sequence alignment: Formulation and Algorithms

Maryam Abbasi University of Coimbra
Luís Paquete (University of Coimbra, CISUC, Department of Informatics Engineering); Arnaud Liefooghe (Université Lille 1, UFR IEEA - FIL); Maria Celeste Dias (University of Aveiro, CESAM, Department of Biology);
 
Short Abstract: Recently, there has been a growing interest on the multiobjective formulation of optimization problems in computational biology. In this work, the multiobjective formulation of the pairwise sequence alignment problem is considered, where a vector score function takes into account the occurrence of matches, indels, gaps or mismatches separately. An alignment is efficient if it is maximal with respect to the component-wise ordering of the scoring of all alignments. Of particular interest is to find the image of all efficient alignments in the space of the vector score function. This formulation has interesting properties: i) An optimal alignment for the parametric score function with positive parameters is an efficient alignment; ii) There may be efficient alignments that are not optimal for any combination of parameters; iii) The image of the optimal alignments has tractable size. Therefore, multiobjective sequence alignment brings advantages to the practitioner since it allows to get rid of parameters and to explore a tractable set of alignments that are not reachable by any other methods. In this work, explicit recurrence equations for a number of score functions are given and two approaches are compared: a dynamic programming algorithm that extends classical algorithms for this problem and an epsilon-constraint algorithm that solves a series of constrained sequence alignment problems. Both approaches improve previous results in the literature in terms of time-complexity. Numerical results are shown, indicating that the exact solution of this problem is feasible in practice. Extensions for multiple sequence alignment are also discussed.
 
Poster U50
An improved RIP-Seq Toolkit to Reveal Genome-wide Exon Junction Complex Landscape

Alper Kucukural University of Massatchusetts Medical School
Guramrit Singh (University of Massatchusetts Medical School, Biochemistry and Molecular Pharmacology ); Can Cenik (Harvard Medical School, Biochemistry and Molecular Pharmacology ); Zhipping Weng (University of Massatchusetts Medical School, Biochemistry and Molecular Pharmacology ); Melissa Moore (University of Massatchusetts Medical School, Biochemistry and Molecular Pharmacology );
 
Short Abstract: With the advent of massively parallel high-throughput sequencing of short DNA fragments, RIP-Seq has become a highly attractive method for detecting binding sites of RNA-binding proteins (RBPs)on a genome-wide scale. It involves immunoprecipitation of RNA-protein complexes using antibodies against target protein/s of interest followed by detection of RNA binding sites by high-throughput sequencing. Millions of reads obtained are analyzed using computational tools to reveal the binding sites. Any computational pipeline for analysis of high-throughput sequencing data to reveal binding sites of RBPs faces a challenge to minimize noise that arises from experimental methods or sequencing quality. Furthermore, background noise is a direct function of expression levels of individual genes. To account for these issues, we have developed a new adaptive peak finding algorithm that computes a background for any expressed exon using its expression levels. Such background definitions allow us to detect peaks even in genes with low expression levels. We have used this algorithm to detect genome-wide binding sites of a multi-protein complex (Exon Junction Complex, EJC) that is deposited on mRNA 20-24 nucleotides upstream of exon-exon junctions during splicing of pre-mRNAs. Although much has been learned about EJC assembly and deposition, many fundamental questions remained unanswered. Are EJCs deposited on all junctions? Are EJCs always deposited at the same location? Using RIP-seq approach RNAs that purify as EJC footprints from human cells (HEK293), we have started to address these questions. As a result we found that EJC occupies regions centered ~24 nucleotides upstream of junctions in human cells.
 
Poster U51
SEED: Efficient Clustering of Next Generation Sequences

Thomas Girke University of California, Riverside
Ergude Bao (University of California, Riverside, Computer Sciences); Tao Jiang (University of California, Riverside, Computer Sciences);
 
Short Abstract: This poster is based on Proceedings Submission 215

Efficient similarity clustering of next generation sequences (NGS) is an important computational problem to study the population sizes of short DNA/RNA molecules and to reduce the redundancies in NGS data sets. Currently, most algorithms available for sequence clustering are limited by their speed and scalability, and thus cannot handle modern NGS data with tens of millions of reads.
In this paper, we introduce SEED - a new algorithm that is able to cluster very large NGS sets. It joins highly similar sequences into clusters that can differ by up to three mismatches and three overhanging residues. The method is based on a block spaced seed method. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million sequences with 40 bp in length in less than four hours with a linear time and memory performance for increasing numbers of sequences. When using SEED as a preprocessing tool on real genome and transcriptome assembly data, it was able to reduce the memory requirements of the Velvet/Oasis assembler by 22-41%. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 10-20% higher N50 values. Most of SEED's utilities fall into the data preprocessing area of NGS. However, our tests also demonstrate its efficiency as standalone tool for discovering clusters of small RNA sequences in NGS data sets from unsequenced organisms.
 
Poster U52
Para-mugsy: Distributed whole genome multiple alignment

Samuel Angiuoli University of Maryland
Malcolm Matalka (University of Maryland, Institute for Genome Sciences);
 
Short Abstract: The increased ability to do whole genome sequencing demands the ability to compare
hundreds or even thousands of genomes efficiently. Extensive runtime or memory consumption can limit the number of genomes that are feasible to compare with whole genome alignment tools. To address these challenges, we describe a new tool, Para-Mugsy, that implements a distributed algorithm for multiple whole genome alignment that can execute on cloud computing platforms. Para-Mugsy utilizes Nucmer for pairwise alignment, Mugsy for identification of collinear regions, and Seqan::TCoffee for multiple alignment. Parallel processing of intermediate alignments is used at several steps to increase throughput using multiple CPUs. We describe the algorithm and evaluate the performance of Para-Mugsy on the multiple alignment of over one hundred E. coli genomes.
 
Poster U53
Estimating the Accuracy of Protein Multiple Sequence Alignments

Daniel DeBlasio University of Arizona
Travis Wheeler (Univeristy of Arizona, Comuter Science); Vladimir Filkov (Univeristy of California, Davis, Comuter Science); John Kececioglu (Univeristy of Arizona, Comuter Science);
 
Short Abstract: Estimating the accuracy of a computed multiple sequence alignment without knowledge of the correct alignment is an important problem. A good accuracy estimator has broad utility, from building a meta-aligner that selects the best output of a collection of aligners, to boosting the accuracy of a single aligner by choosing the best values for alignment parameters. Accuracy is conventionally measured with respect to a reference alignment; we estimate this accuracy without knowing the reference by learning a function that combines multiple, easily-computable features of an alignment into a single estimate.

For protein alignments, we consider percent identity, average substitution score, gap density, information content, and measures of conserved predicted secondary structure. We also introduce new features that measure the consistency with an unknown phylogeny. We then learn an accuracy estimation function that is a quadratic combination of the feature functions; the coefficients of which, we determine by minimizing the L2-norm between the estimated accuracy and the true accuracy on a training set of benchmark reference alignments.

For evaluation, we apply the learned accuracy estimation function to the task of determining the best affine gap penalties to use in the multiple alignment tool Opal (Wheeler and Kececioglu, 2007). Given a collection of alignments, generated with different parameters, selecting the alignment with the highest estimated accuracy determines the optimal parameter choice. On structurally aligned benchmarks (Edgar, 2009), compared to competing accuracy estimation approaches of MOS (Lasmann and Sonnhammer, 2002) and NorMD (Thompson, et al., 2001), our estimator achieves an 11.4% boost in accuracy on the hardest to align sequences.
 
Poster U54
Visualizing the next generation of sequencing data with GenomeView

Thomas Abeel Broad Institute of MIT and Harvard
Yves Van de Peer (VIB, PSB); James Galagan (Boston University, Biomedical engineering);
 
Short Abstract: Advances in DNA sequencing methods result in billions of nucleotide sequences on a daily basis. The ability to visually explore sequencing data is extremely valuable. Visualization shines at any stage of data analysis, by enabling sanity checks on your data and analysis results. Eye-balling your data is the nice way to get a good idea of what came out of your experiments. This can be leveraged to generate new hypotheses and to fine-tune analysis parameters. The right image often makes the solution obvious or at the very least helps us to reason about these complex data. We present GenomeView, a tool specifically designed to visualize and manipulate a multitude of genomics data. GenomeView enables users to dynamically browse high volumes of aligned short read data, with dynamic navigation and semantic zooming. At the same time, the tool enables visualization of whole genome alignments of dozens of genomes relative to a reference sequence. GenomeView is unique in its capability to interactively handle huge data sets consisting of dozens of aligned genomes, thousands of annotation features and millions of mapped short reads both as viewer and editor.
 
Poster U55
SlideSort: Fast and exact algorithm for Next Generation Sequencing data analysis

Kana Shimizu National Institute of Advanced Industrial Science and Technology
Koji Tsuda (National Institute of Advanced Industrial Science and Technology, Computational Biology Research Center);
 
Short Abstract: Next Generation Sequencing (NGS) technology calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount data. In this study, we designed and implemented exact algorithm SlideSort that finds all similar pairs whose edit-distance does not exceed a given threshold from NGS data, which helps many important analyses, such as de novo genome assembly, identification of frequently appearing sequence patterns and accurate clustering.
Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mer, our method is more effective in reducing the number of edit-distance calculations. In comparison to state-of-the-art methods, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing NGS data for further processing.

Long Abstract: Click Here

Poster U56
Who & How: Investigating miRNA Regulation in Ranked Gene Lists using miRnaPursuit.

Teresa Colombo 'Sapienza' University of Rome
Giuseppe Macino ('Sapienza' University of Rome, Dept. of Cellular Biotechnologies and Hematology);
 
Short Abstract: MicroRNAs (miRNAs) are small non-coding RNAs well-known to be crucial negative regulators of gene expression in the most varied cellular processes. Identi?cation of miRNA target genes is instrumental in understanding miRNA function in a given biological context of interest. However, only few miRNA target genes have been validated so far and experimental testing of additional functional interactions is a low-throughput step which necessarily relies on bioinformatics and statistics to prioritize gene candidates. Di?erential analysis of gene expression complexity together with statistical analysis of 3’UTR sequence motif frequency can provide a useful approach to recognize a relevant miRNA and its putative targets. The rationale for the above is twofold. First, miRNAs can decrease levels of target mRNAs. Second, this e?ect appears chie?y mediated by a short stretch of consecutive nucleotides at the 5’end of the mature miRNA sequence, termed the seed. Indeed biological experiments very often involve a contrast between large-scale gene expression pro?les taken under two di?erent conditions to select top-ranking experiment responders.
Here we present ’miRnaPursuit’, a web server tool devised to receive in input such a ranked gene list and infer from it possible evidences of miRNA regulation based on statistical analysis for over-representation of seed-complementary motifs in the corresponding set of 3’UTR sequences as a function of rank position.
miRnaPursuit is especially meant for non-expert users through the development of a simple and intuitive web interface as well as the delivery of an exhaustive and well-documented set of output information.
 
Poster U57
HHblits: Lightning-fast iterative sequence searching by HMM-HMM comparison

Michael Remmert Gene Center Munich
Andreas Biegert (LMU Munich, Gene Center); Andreas Hauser (LMU Munich, Gene Center); Johannes Söding (LMU Munich, Gene Center);
 
Short Abstract: Most sequence-based methods for protein structure or function prediction construct a multiple sequence alignment of homologs as a first step. The standard search tool to generate multiple sequence alignments is PSI-BLAST (>30000 citations), an extension of BLAST to profile-sequence comparison. PSI-BLAST owes its sensitivity to its iterative search scheme. Significant sequence hits are added to the evolving multiple alignment from which a profile is generated for the next search iteration. A further improvement would be possible if iterative profile-sequence comparison could be extended to iterative profile-profile searches.

We have developed HHblits, the first iterative search method based on the pairwise comparison of profile Hidden Markov Models (HMMs). HHblits achieves better runtimes than PSI-BLAST or HMMER3 by using a fast prefilter based on profile-profile comparison. Furthermore, it greatly improves upon PSI-BLAST and HMMER3 in terms of sensitivity/selectivity and alignment quality. On a standard SCOP20 ROC benchmark (SCOP 1.75 proteins filtered to 20% maximum sequence identity), HHblits detects twice as many true positives than PSI-BLAST and 73% more than HMMER3 at 1% error rate in the first iteration. Two search iterations HHblits detect significantly more true positives than five PSI-BLAST iterations.

Our HHblits webserver can be accessed at http://toolkit.lmb.uni-muenchen.de/hhblits.
 
Poster U58
Single Nucleotide Polymorphisms (SNPs) in the Complex Allotetraploid Genome of Tobacco

Nicolas Sierro Philip Morris International R&D
Nikolai Ivanov (Philip Morris International R&D)
 
Short Abstract: Tobacco (Nicotiana tabacum) is an important agricultural crop which harbors ~4.5 gigabases of genomic DNA and is highly homogeneous due to self-pollination and human cultivation. Its allotetraploid genome is thought to be a hybrid of the ancestral species of N. sylvestris (S-genome) and N. tomentosiformis (T-genome). SNP discovery in the tobacco genome is a significant challenge due its polyploidy, large size, and high repeat content. Preliminary experiments targeting selected gene-containing regions within dozens of tobacco varieties showed evidence of low polymorphism. We have performed an EcoRI-reduced representation of the whole genome of two tobacco varieties (Hicks Broadleaf and Red Russian), known to be genetically distant and used to generate a microsatellite-based genetic map. Because of the allotetraploidy, it was possible to assign short sequencing reads to one of the two ancestral genomes (S- or T-), and to use this information to pre-validate in silico the discovered SNPs. Our results indicate that, depending on the level of pre- and post-filtering in the SNP calling process, 10’000 to 15’000 SNPs could be obtained. The reduced genomes, representing about 4.5% of the whole genomes, 220’000 to 330’000 SNPs, that is one SNP in every 13 to 20 kb on average, can be expected between these two varieties. Thus, based on only two varieties, tobacco shows a higher rate of polymorphism at the whole genome level than initially expected. We therefore believe that by the inclusion of additional divergent varieties, the construction of a reliable and functional SNP array is feasible.
 
Poster U59
Analysing allele-specific NGS datasets using ASAP

Felix Krueger The Babraham Institute
 
Short Abstract: A well studied epigenetic example for the functional discordance of the two copies of a gene in the diploid genome is genomic imprinting, in which gene expression depends on the parental origin of the allele. Mechanisms establishing and maintaining imprinted expression are tightly controlled during development and defects have been linked to various diseases.
Imprinting studies routinely use crosses of mice from different genetic backgrounds and SNPs in the region of interest to determine the parental origin of the allele. We have developed an allele-specific alignment pipeline (ASAP) to analyse allelic differences in NGS data. ASAP takes in a sequencing file and performs alignments to two reference genomes in parallel using Bowtie. For this allele-specific analysis method to work details about both involved genomes have to be known, i.e. a genome-wide set of SNPs or alternative chromosome sequences must be available. All alignments are analysed as they are being generated to decide whether an alignment is specific for one of the genomes or if a sequence aligns to both genomes equally well (e.g. in the absence of SNPs in the read).
In its initial version ASAP was intended to explore allelic differences in ChIP-Seq data. However, it can be applied to any experiment involving the analysis of sequencing samples with two distinct genetic origins. Thus, ASAP could in principle also be used for differential allelic expression in RNA-Seq data. Other examples of applications for ASAP could include determining allele-specific genomic interactions (4C) or separating a mixture of DNA from two sources.
 
Poster U60
Performing Functional Site Prediction on Protein Sequences Generated by De Novo Transcriptome Assembly

Chien-Yu Chen National Taiwan University
Ting-Ying Chien (National Taiwan University, Computer Science and Information Engineering);
 
Short Abstract: Recently, research projects aimed at analyzing transcriptome data using massively parallel sequencing have accumulated an abundance of partial gene sequences for non-model organisms. In general, sequence alignment is good at searching for potential homologues for each of the assembled sequences. However, functional site annotation on such sequences is considerably challenging since only partial sequences are available. When without a good reference, it is hard to tell if the sequence of interest is long enough to fold into a complete functional site. In this study, we presented a motif-based approach for annotating protein sequence to answer whether the assembled sequences are long enough to construct a functional site. The proposed method was realized on the problem of predicting catalytic sites of enzymes. First, all the enzyme sequences curated in UniProt database were collected. Motif discovery was conducted on each enzyme family by invoking WildSpan, a sequential pattern mining tool for discovering structured motifs (called W-patterns). It was demonstrated in our recently published paper that W-patterns outperforms the PROSITE patterns on protein family prediction in terms of both sensitivity and specificity rates. Though the pattern components are largely separated in sequence, they form the catalytic sites when the sequence is folded. In this regard, it is expected that matching one of the W-patterns answers whether the assembled sequences are long enough for further investigation of enzyme behavior. We concluded that the proposed method serves as a good complementary tool to sequence alignment for annotating protein sequences generated by De Novo transcriptome assembly.
 
Poster U61
MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods

Sudhir Kumar The Biodesign Institute, Arizona State University
 
Short Abstract: Comparative analysis of molecular sequence data is essential for reconstructing the evolutionary histories of species and inferring the nature and extent of selective forces shaping the evolution of genes and species. We have developed MEGA5 (Molecular Evolutionary Genetics Analysis version 5), which is a user-friendly software for mining online databases, building sequence alignments and phylogenetic trees, and using methods of evolutionary bioinformatics in basic biology, biomedicine, and evolution. The newest addition in MEGA5 is a collection of Maximum Likelihood (ML) analyses for inferring evolutionary trees, selecting best-fit substitution models (nucleotide or amino acid), inferring ancestral states and sequences (along with probabilities), and estimating evolutionary rates site-by-site. In computer simulation analyses, ML tree inference algorithms in MEGA5 compared favorably with other software packages in terms of computational efficiency and the accuracy of the estimates of phylogenetic trees, substitution parameters, and rate variation among sites. The MEGA user-interface has now been enhanced to be activity-driven to make it easier for the use of both beginners and experienced scientists. This version of MEGA is intended for the Windows platform, and it has been configured for effective use on Mac OS X and Linux desktops. It is available free of charge from www.megasoftware.net.
 
Poster U62
Twenty-Five Fold Diversity Overestimation Revealed by Analyzing a MOCK Sample using 454 Amplicon Pyrosequencing

Sebastian Jünemann University of Bielefeld
Dag Harmsen (University Hospital of Münster, Department of Periodontology); Karola Prior (University Hospital of Münster, Department of Periodontology); Alexander Goesmann (University of Bielefeld, Center for Biotechnology); Jens Stoye (University of Bielefeld, Center for Biotechnology);
 
Short Abstract: Introduction:

Analysis based on 16S rDNA pyrotag sequencing constitutes a common
technique in metagenomic studies but is subject to yet insufficiently
examined side effects. In this study we utilized a new software for
creating and executing flexible pipelines and analyzed a constructed
metagenome based on a known MOCK community with respect to the detection
of PCR- and sequencing errors.


Methods:

DNA of 26 16S ribosomal RNA operons of different whole genome sequenced
Genbank entries from type strains were extracted and subsequently
amplified by PCR targeting the V5 and V6 hypervariable regions.
Multiplex identifiers (MIDs) were ligated to both the forward and the
reverse primers. Two replicates of this MOCK sample were sequenced in
parallel on the 454 GS FLX Titanium platform. Using the Conveyor engine
[Linke et al. 2011] a collection of flexible pipelines was developed
covering typical bioinformatic workflows, i.e. primer trimming, sequence
denoising, quality filtering, OTU clustering, taxonomic classification,
and ecological-statistics. Computationally intensive routines, such as
the chimera detection algorithms Pintail, Ccode, ChimeraSlayer and
Perseus, were parallelized with aid of the conveyor engine.


Results:

About 20% of the obtained reads showed a mismatch between the forward
and reverse MID only explainable by chimeric formations during the
emPCR. Chimeras were identified in 10% of the sequencing data by each
algorithm individually, but surprisingly their intersection reduced this
to less than 1%. The OTU clustering resulted in a 25-fold overestimation
of the species diversity with a strong bias towards specific organisms
in the abundance profile, but could be reduced to less than the
five-fold after sequence denoising.
 
Poster U63
Effect of adding homologs in multiple sequence alignment and phylogenetic analysis

Kazutaka Katoh National Institute of Advanced Industrial Science and Technology
Christian Ledergerber (ETH Zurich, Computer Science); Christophe Dessimoz (ETH Zurich, Computer Science); Manuel Gil (ETH Zurich, Computer Science);
 
Short Abstract: The inclusion of additional homologous sequences is generally believed to improve the accuracy of phylogenetic inference and multiple sequence alignment (MSA). Building upon a phylogeny-based benchmarking approach introduced recently, we quantitatively examined the validity of such analysis by performing an evaluation of MSA and phylogenetic tree inference.

In order to clarify the effects of homologs at the MSA step and at the tree inference step separately, two types of tests, (1) Enriched and (2) Impoverished, were performed. In the (1) Enriched test, the entire MSA, containing additional homologs, was used to infer a tree. Its result reflects the total effect of homologs on the MSA step and the tree inference step. In the (2) Impoverished test, the additional homologs were included in the MSA step but excluded from the tree inference step. Its result is expected to reflect the effect of homologs specifically on the MSA calculation. In addition, the effect of homologs specifically on the tree inference step was assessed by using (3) the difference between the results of Enriched and Impoverished.

We examined several combinations of different MSA methods and tree inference methods. The results suggest that additional homologs do not improve the quality of MSA in general, but improve the resulting tree in most cases. This benchmark also provides practical guidelines, for example, an appropriate similarity level of homologs to be included into a phylogenetic analysis.
 
Poster U64
ADDAPTS: A Data-Driven Automated Pipeline and Tracking System

Risha Narayan University of Cambridge (UK)
Kim Rutherford (University of Cambridge, Department of Plant Sciences); Rishi Nag (University of Cambridge, Department of Plant Sciences); Krys Kelly (University of Cambridge, Department of Plant Sciences);
 
Short Abstract: Much effort goes into handling the large amount of data produced by current sequencing technologies. In order to manage the raw data and the metadata, we saw a need for a database-backed automated processing pipeline with a web front-end. We are developing ADDAPTS, a Data-Driven Automated Pipeline and Tracking System.

The ADDAPTS relational database stores metadata about the samples and their associated data files. A controller process monitors the database and starts new pipeline jobs when appropriate. The dependencies between pipeline tasks are configured in the database itself, and can be modified without pausing the pipeline.

The pipeline uses raw sequencing output files in FASTQ format as input, the full analysis results in alignments viewable from the GBrowse genome viewer. The pipeline performs the following processes: it de-multiplexes files from multiplexed sequencing runs; removes small RNA adapters or clips sequence reads as appropriate; filters reads by size; generates output files and statistics for each stage of the analysis; aligns the reads against an appropriate reference genomes; converts the resulting alignment files into several formats, such as GFF3, SAM and BAM file formats; and generates BAM indexes for GBrowse. It supports data generated using Illumina and 454 sequencing technologies, and various sequencing applications (e.g. small RNA, RNA-seq, CHIP-seq, genomic DNA).

The tracking system provides a web front-end for entering, viewing and editing metadata. It also generates charts and statistics for each sample, provides links to the files generated from the pipeline analysis, and generates global reports for all samples.
 
Poster U65
Fast and Accurate RNA-Seq alignments with PALMapper

Geraldine Jean Max Planck Society
Gunnar Raetsch (Max Planck Society, Friedrich Miescher Laboratory); Andre Kahles (Max Planck Society, Friedrich Miescher Laboratory); Soeren Sonnenburg (Berlin Institute of Technology, Machine Learning Group); Korbinian Schneeberger (Max Planck Institute for Plant Breeding Research, Plant Developmental Biology); Joerg Hagmann (Max Planck Institute for Developmental Biology, Molecular Biology); Fabio De Bona (Max Planck Society, Friedrich Miescher Laboratory); Detlef Weigel (Max Planck Institute for Developmental Biology, Molecular Biology);
 
Short Abstract: Short mRNA sequences produced by RNA-Seq enhance transcriptome analysis and promise great opportunities for the discovery of new genes and the identification of alternative transcripts. However, the sheer amount of high throughput sequencing data requires efficient methods for accurate spliced alignments of reads against the reference genome, which is further challenged by size and quality of the sequence reads. We present an original RNA-Seq read mapper, called PALMapper, that combines a faster extension of the high accurate alignment method QPALMA with the fast short read aligner GenomeMapper. PALMapper quickly carries out an initial read mapping which then guides a Banded Semi-Global alignment algorithm that allows for long gaps corresponding to introns. PALMapper drastically improves the speed of QPALMA (around 50 times faster) and still computes both spliced and unspliced alignments at high accuracy by taking advantage of base quality information and computational splice site predictions. Moreover, PALMapper is under active development and offers a growing pool of features such as polyA trimming or non-canonical splice site support, which can improve again alignment accuracy for specific downstream studies. Finally, PALMapper does not rely on any annotation but is able to remap reads against an inferred splice junction database. This strategy applied on simulated data from C. elegans increases the number of correct spliced alignments from 89% to 92% while the incorrect alignments decrease by 27%. On the same dataset, we show that PALMapper outperfoms GSNAP and TopHat, two other widely used alignment tools.
 
Poster U66
Comparison of short read aligners for ABI SOLiD platform

Viktor Stranecky Charles University - 1st Medical Faculty
 
Short Abstract: Exome sequencing is a rapidly expanding technique . However analysis of these datasets still pose some particular challenges and analytical strategies are under development. We evaluated performance, specificity and sensitivity of frequently used aligners and SNP/Indel callers on ABI SOLiD datasets.
 

Accepted Posters


Attention Poster Authors: The ideal poster size should be max. 1.30 m (130 cm) high x 0.90 m (90 cm) wide. Fasteners (Velcro / double sided tape) will be provided at the site, please DO NOT bring tape, tacks or pins. View a diagram of the the poster board here




Posters Display Schedule:

Odd Numbered posters:
  • Set-up timeframe: Sunday, July 17, 7:30 a.m. - 10:00 a.m.
  • Author poster presentations: Monday, July 18, 12:40 p.m. - 2:30 p.m.
  • Removal timeframe: Monday, July 18, 2:30 p.m. - 3:30 p.m.*
Even Numbered posters:
  • Set-up timeframe: Monday, July 18, 3:30 p.m. - 4:30 p.m.
  • Author poster presentations: Tuesday, July 19, 12:40 p.m. - 2:30 p.m.
  • Removal timeframe: Tuesday, July 19, 2:30 p.m. - 4:00 p.m.*
* Posters that are not removed by the designated time may be taken down by the organizers and discarded. Please be sure to remove your poster within the stated timeframe.

Delegate Posters Viewing Schedule

Odd Numbered posters:
On display Sunday, July 17, 10:00 a.m. through Monday, June 18, 2:30 p.m.
Author presentations will take place Monday, July 18: 12:40 p.m.-2:30 p.m.

Even Numbered posters:
On display Monday, July 18, 4:30 p.m. through Tuesday, June 19, 2:30 p.m.
Author presentations will take place Tuesday, July 19: 12:40 p.m.-2:30 p.m





Want to print a poster in Vienna - try these options:

Repacopy- next to the congress venue link [MAP]

Also at Karlsplatz is in the Ring Center, Kärntner Str. 42, link [MAP]


If you need your poster on a thicker material, you may also use a plotter service next to Karlsplatz: http://schiessling.at/portfolio/



View Posters By Category
Search Posters:
Poster Number Matches
Last Name
Co-Authors Contains
Title
Abstract Contains






↑ TOP