19th Annual International Conference on
Intelligent Systems for Molecular Biology and
10th European Conference on Computational Biology


Accepted Posters

Category 'J'- Genomics'
Poster J01
Microsoft Biology Initiative

Beatriz Diaz Acosta Microsoft Research
 
Short Abstract: The Microsoft Biology Initiative (MBI) is an effort in Microsoft Research to bring new technology and tools to the area of bioinformatics and biology and is comprised of two primary components: the Microsoft Biology Foundation (MBF), an open source .NET library of bioinformatics functions, and the Microsoft Biology Tools (MBT).

Long Abstract: Click Here

Poster J02
NUPTs and NUMTs from algae to land plants - a genomic analysis in 12 different species

Noa Sela Ludwig Maximilians University Munich
Dario Leister (Ludwig Maximilians University Munich, Botanic);
 
Short Abstract: Genomic insertions of organellar DNA into the nucleus genome is an ongoing evolutionary process. In this work we show an analysis of these insertions in 12 different plants species spanning on the evolutionary tree from algae to land plants. our analysis indicates their importance in creating trasncriptomic novelties in plants.

Long Abstract: Click Here

Poster J03
Dissecting a massive parallel sequencing workflow for quantitative miRNA expression analysis

Raffaele Calogero Università di Torino
Francesca Cordero (University of Torino, Dept. of Clinical and Biological Sciences); Marco Beccuti (University of Torino, Dept. of Informatics); Susanna Donatelli (University of Torino, Dept. of Informatics); Maddalena Arigoni (University of Torino, Molecular Biotechnology Center);
 
Short Abstract: Background
Next Generation Sequencing methods (MPS) can extend and improve the knowledge obtained by conventional microarray technology, both for mRNAs and short non-coding RNAs. The processing methods used to extract and interpret the information are an important aspect of dealing with the vast amounts of data generated from short read sequencing. The mapping, counting and characterization of the short sequence reads results in a bottle neck in data analysis. Although the number of computational tools for MPS data analysis is constantly growing, their strengths and weaknesses as part of a complex analytical pipe-line have not yet been well investigated.
Results
A benchmark MPS miRNA dataset, resembling a situation in which microRNAs are spiked in biological replication experiments was assembled. Using this data set, the strengths and weaknesses were highlighted of the major steps (i.e. reference sequence data set, alignment to reference and differential expression detection) in the MPS analysis workflow for detection of differential expression of microRNAs .
Conclusions
The use of a precursor microRNAs set as a reference sequence for short reads alignment reduces computation time and performs better in read counting compared to when the whole unmasked genome is used. Moreover, the use of the precursor microRNAs set as reference sequence does not require application of any peak segmentation algorithm. Within the investigated alignment tools, SHRiMP and MicroRazerS show the highest detection rate for spike-in microRNAs. Among the methods developed for digital sequence differential expression detection, the statistics implemented in the baySeq package show the best detection performance.
 
Poster J04
Boosting the Principle Component Analysis for explorative analysis of genome-wide expressions reveals a common developmental pattern in human and mouse datasets

Carlo Cannistraci King Abdullah University for Science and Technology (KAUST)
Carlo Vittorio Cannistraci (King Abdullah University for Science and Technology (KAUST), Division of Chemical & Life Sciences and Engineering; Division of Applied Mathematic and Computer Sciences); Timothy Ravasi (King Abdullah University for Science and Technology (KAUST), Division of Chemical & Life Sciences and Engineering; Division of Applied Mathematic and Computer Sciences);
 
Short Abstract: Machine-learning are widely employed to mine genome-wide expressions. However, linear approaches such as Principle-Component-Analysis (PCA) showed some limitations in revealing the information hidden in several datasets. Although partial solutions can be adopted, such as integration of multiple-bioinformation-sources with the expressions and/or the use of different nonlinear MLs, these might introduce new costs and difficulties.
We encourage the computational-biology-community to consider this issue from a different angle: is it possible to squeeze out more information using just the expressions and applying an unsupervised-approach that is (like the PCA) parameter-free? And to the ML-community we propose ‘the boosting PCA conjecture’: is it possible to define classes of unsupervised and parameter-free algorithms or procedures that boost the performance of PCA?
A first algorithm, the discrete-cosine-transform-PCA (DPCA), and its promising performance is discussed. We performed data-mining-analysis of 5 different datasets (3 human-datasets and 2 mouse-datasets) composed of genome-wide expression measurements across different tissues.
DPCA attained a tissue-developmental-germ-layer discrimination (ectoderm, mesoderm, endoderm) higher than 80% accuracy (evaluated by clustering in the bi-dimensional space of visualization) in each of the 5 different datasets suggesting the presence of a common developmental-pattern in human and mouse gene-expressions.
DPCA was also efficient in compressing the information, allowing the identification of a cohort of MES1/HOX transcription-factor genes essential for human germ-layer-discrimination and fundamental in normal and cancer development. Interestingly, the MES1/HOX-cohort contains and largely extends those identified in a previously study by protein-protein-interaction-network knowledge, indicating the efficiency of DPCA to extract the hidden biological information starting only from the expression measurements.
 
Poster J05
Bio-NGS: BioRuby plugin to conduct programmable workflows for Next Generation Sequencing data

Francesco Strozzi Parco Tecnologico Padano
Raoul Bonnal (Istituto Nazionale Genetica Molecolare (INGM), Integrative Biology Program); Toshiaki Katayama (Human Genome Center University of Tokyo, Laboratory of Genome Database); Alessandra Stella (Parco Tecnologico Padano, CeRSA - Bioinformatics Group); Massimiliano Pagani (Istituto Nazionale Genetica Molecolare (INGM), Integrative Biology Program);
 
Short Abstract: BioRuby is a well-established bioinformatics library for the Ruby programming language. Here we present a new package for BioRuby to perform Next Generation Sequencing (NGS) analyses based on a recently introduced plugin system, which allows the user to extend the core library for adding new functionalities. Tools and libraries have been written in other languages using different approaches, but Ruby will facilitate the development of a light, flexible and customizable solution to face the new challenges of NGS data analysis. This NGS plugin can handle standard software like BWA, Bowtie, TopHat, Cufflinks, SAMtools and many others, in a common way. The plugin provides a task management framework, which encapsulates command line options and dependencies for the tasks to be executed, including pre- and post-processing tasks such as quality control of input data and visualization of the results. Users are not limited by the predefined tasks but are also able to define their own analysis workflows. For this purpose, Ruby is perfectly suited as a domain specific language (DSL) given its flexible and clean syntax. Users can easily develop their custom workflows with popular Ruby DSLs like Rake and Thor. Every operation submitted is recorded using a built-in history manager that will store all the settings and parameters used for a given procedure. Tasks can also be submitted as parallel jobs on different environments like computer clusters and multi-core machines. Bio-NGS provides a command-line reusable, yet customizable, system for demanding NGS data analysis and workflow management.
 
Poster J06
Towards validating GWAS findings using orthologous genotype and phenotype data

Tim Beck University of Leicester
Gudmundur Thorisson (University of Leicester, Department of Genetics); Anthony Brookes (University of Leicester, Department of Genetics);
 
Short Abstract: Translational medical research increasingly focuses upon model organisms to gain insight into the etiology of human disease – for example, in the use of mouse phenotyping studies to validate Genome Wide Association Study’s (GWAS). To support this line of work, we have developed a ‘one-click’ informatics pipeline that provides scientists with a starting point for addressing human and mouse orthologous genotype and phenotype questions.

Recently, there has been a rapid increase in the number of GWAS being performed and published. The most recent release of GWAS Central (http://www.gwascentral.org) provides data for over 700 studies where disease and genetic marker associations are available. In parallel, the amount of mouse genotype-phenotype data has dramatically increased with the success of high-throughput phenotyping programmes. Large-scale international informatics projects, such as GEN2PHEN (http://www.gen2phen.org), emphasise concerted efforts to make phenotypic and genetic data freely available and interoperable through the promotion of community data standards. A key informatics challenge is now to tap into these large, well-organised, and accessible genotype and phenotype datasets, to facilitate cross-species biological knowledge discovery.

Starting from a list of genes of interest, the pipeline we have developed queries remote datasets and returns: orthologous genes; orthologous diseases and phenotypes; and occurrences in GWAS. The pipeline will have applications in the validation of GWAS in model organisms, the identification of candidate disease-related genes and phenotypes, and the prioritization of genes for further investigation.

Funded by the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement 200754, GEN2PHEN.
 
Poster J07
Genome wave adjustment using a statistical model

Jiqiu Cheng Katholieke Universiteit Leuven
Peter Konings (Katholieke Universiteit Leuven, Electrical Engineering department); Bernard Thienpont (Babraham Institute, Laboratory of Molecular Signalling & Laboratory of Developmental Genetics and Imprinting); Joris Vermeesch (Katholieke Universiteit Leuven, Center for Human Genetics); Yves Moreau (Katholieke Universiteit Leuven, Electrical Engineering);
 
Short Abstract: Chromosomal aneuploidy acquired in single cells of human cleavage stage embryos plays an important role in low human fecundity and constitutional chromosomal disorders. Array CGH technologies have been employed to detect chromosomal. However, single-cell genome profiles frequently show a strong wave pattern, especially at the two terminals of a chromosome. Consequently, this genome wave pattern obscures the real chromosomal aneuploidies and yields high falsely detected results.

The aim of this study was to present a statistical model which efficiently adjusts the undesired genome wave. A weighted regression model was set up with the array CGH log2-Ratio as the response variable and the corresponding GC percentage as the independent variable. The large weights were assigned to the GC-rich genome regions.

The algorithm was applied to a single-cell oligo-array CGH experiment for the copy number variation (CNV) detection. The experiment consisted of 8 single Epstein- Barr virus (EBV) transformed lymphoblastoid cells performed on the Agilent 244K arrays. Circular Binary Segmentation algorithm was used to compare the CNV detection before and after genome profile correction. The results showed that the genome wave patterns were obviously reduced after the wave correction. The True Positive Rate (TPR) and the False Positive Rate (FPR) were calculated across 8 EBV cells before and after wave correction. The FPR is significantly reduced from median 0.13 to 0.02 at the 0.001 significance level after wave correction. This result indicates that our algorithm has efficiently removed genomic wave patterns without losing the real biological aberrations.
 
Poster J08
Sequence assembly in the clouds: Distributed assembly of the transcribed part of the genome of bread wheat

Andreas Schreiber University of Adelaide
Delphine Fleury (University of Adelaide, Australian Centre for Plant Functional Genomics); Peter Langridge (University of Adelaide, Australian Centre for Plant Functional Genomics); Ute Baumann (University of Adelaide, Australian Centre for Plant Functional Genomics); Matthew Hayden (Department of Primary Industries Victoria, Victorian AgriBiosciences Centre); Kerrie Forrest (Department of Primary Industries Victoria, Victorian AgriBiosciences Centre); Stephan Kong (Department of Primary Industries Victoria, Victorian AgriBiosciences Centre);
 
Short Abstract: Rapidly declining costs of DNA sequencing technology have led to a spectacular increase in the number of species for which complete genome assemblies are available. However, because of their complexity, the reliable assembly of plant genomes remains a major challenge. Bread wheat, for example, has a genome of about 15 Gb with an estimated repetitive content of 80-90%. Moreover, the genome is hexaploid, derived from three closely related diploid ancestors, and thus contains three so-called ‘homoeologous’ copies for each chromosome. While a multinational effort to sequence its chromosome arms one by one is slowly starting to yield results, we have assembled the entire transcribed portion of the complete hexaploid wheat genome in one pass. Using 454 as well as Illumina paired-end sequencing, we performed a two-stage assembly: first, we use the Velvet/Oases algorithms to group reads into tens of thousands of clusters representing individual gene families and homoeologous gene copies. We show that, while fast and relatively memory-efficient, these algorithms are not sufficiently accurate to disentangle the homoeologous copies of individual genes. Because of this, in the second stage we individually assemble each read cluster using the high-fidelity, but memory-intensive, assembly algorithm MIRA on a compute-cloud. We show that this yields a homoeolog-specific assembly for more than 90% of the transcribed genes. We conclude that this two-step procedure provides a reliable method for assembling even the most complicated polyploid eukaryotic transcriptomes.
 
Poster J09
Rice-Map: a new-generation rice genome browser

Ge Gao Peking University
Jun Wang (Peking University, Center for Bioinformatics); Lei Kong (Peking University, Center for Bioinformatics);
 
Short Abstract: Based on next-generation web technologies and high-throughput experimental data, we have developed Rice-Map, a novel genome browser for researchers to navigate, analyze and annotate the genomes of the two rice subspecies (Oryza sativa L. ssp. japonica and Oryza sativa L. ssp. indica) interactively. By smoothly scrolling, dragging and zooming, users can browse various genomic features simultaneously at multiple scales.

More than one hundred annotation tracks (81 for japonica and 82 for indica) have been compiled and loaded into Rice-Map, covering gene models, transcript evidences, expression data, epigenetics data, cross-species homologies, genetic markers and other genomic features. Besides these pre-computed tracks, Rice-Map provides a User-Defined Annotation mechanism for users to add their own annotations instantly and shared with research community, effectively supporting collaborative work. In addition to annotation browsing, on-the-fly analysis for selected entries could be performed through dedicated bioinformatic analysis platforms. Furthermore, Rice-Map offers a BioMart-powered data warehouse for advanced users to fetch bulk datasets based on complex criteria.

Rice-Map delivers abundant up-to-date japonica and indica annotations, providing a valuable resource for both computational and bench biologists. Rice-Map is publicly accessible at http://www.ricemap.org/, with all data available for free downloading.
 
Poster J10
Modular evolution of proteins – an integrated case study in a dense taxonomic group of 20 arthropods

Erich Bornberg-Bauer University of Muenster
Sonja Grath (University of Muenster) Andrew Moore (University of Muenster, Biology); Andreas Schueler (University of Muenster, Biology);
 
Short Abstract: It is well established that proteins evolve by a series of domain-wise events involving fusion, fission and terminal loss. To date, most studies have measured rates of domain-wise evolution across large datasets. This complicates the inference of exact rates of rearrangement events as the taxonomic resolution is low. Furthermore, past studies have focused on specific aspects of modular evolution complicating the integration of mechanisms and rates into holistic view. With the completed sequencing of 12 Drosophila genomes, and the further addition of other arthropod genomes an exceptionally well resolved dataset has arisen allowing detailed modeling of modular domain-wise evolution. Here, we reconstruct ancestral proteomes and quantify the events that facilitate modular protein evolution across a well resolved tree of 20 arthropods and 2 outgroups. We quantify the number of competing solutions for novel arrangements, establish an upper boundary of convergent formation of protein domain arrangements within the tree and explore the functional impact of different events.
 
Poster J11
The immense genomic variability of human olfactory receptor genes

Tsviya Olender The Weizmann Institute
Ifat Keydar (The Weizmann Institute, Molecular Genetics); Sebastian Waszak (Swiss Federal Institute of Technology Lausanne, Laboratory of Systems Biology and Genetics); Miriam Khen (The Weizmann Institute, Molecular Genetics); Noam Nativ (The Weizmann Institute, Molecular Genetics); Edna Ben-Asher (The Weizmann Institute, Molecular Genetics); Doron Lancet (The Weizmann Institute, Molecular Genetics);
 
Short Abstract: Olfactory receptors (ORs) constitute the largest gene family in the mammalian genome, and underlie odorant perception. Humans have 855 OR coding regions, of which 55% are pseudogenes. These are available at our Human Olfactory Receptor Data Explorer (HORDE, http://bioportal.weizmann.ac.il/HORDE), along with complete OR repertoires from other vertebrates. A new version 43 uses an object/relational CakePHP-based architecture, with a novel user interface and more powerful search, allowing more effective navigation of the OR universe. A much broader compendium of human OR genomic variations is provided, including >6000 SNPs and ~630 CNVs, with integration of several public sources including dbSNP, the 1000 Genomes Project (1000GP), Database of Genomic Variations, and several sequenced individual genomes. A major source of CNVs is the CopySeq algorithm (Waszak et al., PLoS Comp Biol 2010), applied to 1000GP data. For olfactory functional studies we focus on deleterious variations, 285 frame-disrupting SNPs in 210 OR genes, and 72 deletion CNVs in intact ORs, stemming from our recent extensive genomic variation survey. Twenty five of the deleterious SNPs are annotated as pseudogenes in the reference genome, and are “resurrected” by our discovery of an intact allele. The above variations are excellent functional candidates, hence are used in currently performed genetic association in available human cohorts phenotyped for odorant-related phenotypes, aimed at discovering odorant specificity for individual OR proteins. Altogether, we find that ~50% of the OR genes segregate in the human population between an intact and disrupted allele, an unprecedented degree of functional variability in a multi-gene superfamily.
 
Poster J12
ChIP-Seq Peak Prioritization Pipeline

Yu-Hsuan Lin University of Michigan
Jaeseok Han (University of Michigan, Biological Chemistry); Maureen Sartor (University of Michigan, Biostatistics);
 
Short Abstract: Chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) is increasingly employed to identify genome-wide in vivo protein-DNA interactions or histone modifications. While most peak calling programs report the statistical significance that a region is bound by the protein of interest, little effort is devoted to prioritizing peaks based on how likely the binding is functional. A growing number of software applications have been developed and shown to successfully identify bound regions from ChIP-Seq experiments. While they generally have close agreement for strong binding signals, they can vary substantially for the weaker binding signals. This may partially be due to the fact that most applications do not model variation among replicates/samples, nor do they exploit additional external sources of information.
Motivated by this issue, we developed a ChIP-Seq analysis pipeline that utilizes a sliding window approach with a negative binomial model to model variation among replicates. We demonstrate here that our pipeline performance is substantially improved when taking into account the variance among replicates, as opposed to concatenating reads. Using ChIP-Seq data from wildtype and knockout MEF samples for the transcription factor ATF4, we show our pipeline identified more peaks than an alternative peak finding software, ERANGE, and without sacrificing the percent of peaks that contained the canonical motif. While accounting for sample-to-sample variation may help to prioritize functional sites, several additional genomic annotations and external knowledge sources may also help. We show how we are expanding our pipeline to prioritize peaks based on additional sources of information using Metropolis Hastings sampling.
 
Poster J13
A simple and fast approach to evaluate de novo transcriptome assembly

Cory Brouwer University of North Carolina at Charlotte
Raad Gharaibeh (University of North Carolina at Charlotte, Bioinformatics and Genomics); Ann Loraine (University of North Carolina at Charlotte, Bioinformatics and Genomics);
 
Short Abstract: De novo assembly of an organism’s transcriptome is a challenging task. The primary problem lies in the heart of the assembly process itself. Treating the transcriptome assembly problem as a genomic assembly problem violates the assumptions of most assemblers. The fact that transcriptome datasets usually have genomic contamination, alternative splicing sites, repetitive sequences, paralogous genes, isoforms and possess an uneven coverage complicates the assembly process and presents a serious challenge to many of the available assemblers. With the lack of comprehensive EST database, reference genome or transcriptome, assessing the quality of de novo transcriptome assembly for non-model organisms is an important issue that needs to be addressed systematically. While genome assembly can be quickly assessed based on well developed metrics (N50, number of contigs, total size of contigs), the applicability of these metrics to transcriptome assembly is still a matter of debate, and there is no consensus on how to evaluate a de novo transcriptome assembly. Here, we present a method to assess de novo transcriptome assembly that requires minimum effort, relies on existing software and does not involve a database system. It enables researchers to find an optimal assembly or a set of optimal assemblies (out of many) to consider for further optimization or analysis. We demonstrate the applicability of our method to assess de novo transcriptome assembly for both model (Arabidopsis thaliana) and non-model (Vaccinium corymbosum, Southern Highbush Blueberry) organisms.
 
Poster J14
A File System Method For High Performance Sequence Interval Queries of Very Large Datasets

Steve Karcz Agriculture and Agri-Food Canada
Steven Karcz (Agriculture and Agri-Food Canada) Matthew Links (Agriculture and Agri-Food Canada, Genomics and Bioinformatics); Isobel Parkin (Agriculture and Agri-Food Canada, Genomics and Bioinformatics);
 
Short Abstract: Studies of genome variation have accelerated with the application of new DNA sequencing technologies. These studies benefit enormously from the ability to execute arbitrary sequence interval queries across reference genomes. However, this capability is in jeopardy as repositories of sequence variation grow in numbers, largely because SQL-based sequence interval queries perform poorly as the number of variant features grow into the tens of millions. These large numbers of variant features are already appearing in "1000 genomes" projects in Arabidopsis and humans. Therefore, new methods are required to enable the scientific community to utilize these rich sources of genome variation data for new discovery.
We have developed a file system methodology that allows high-performance sequence interval
queries to be run on datasets up to 1 billion features in size on commodity hardware. Sequence interval performance comparisons show that the file system method returns SNP feature data 10-20 fold faster than a MySQL-based feature store over
a range of sequence interval lengths (1-100kb) and dataset sizes (32-256 million features). The file system can also be populated with feature data 5-fold faster than the MySQL sequence feature database. Furthermore, rapid SNP visualization can be easily implemented using the open source genome browser JBrowse, running on top of our file system which functions as a dynamic drop-in data source. These performance characteristics, taken together, suggest that our method is suitable for the scalability challenges of large genome variation studies requiring high-performance sequence interval queries.
 
Poster J15
MetaBinG: Using GPUs to Accelerate Metagenomic Sequence Classification

Peng Jia Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences
Lei Liu (Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences) Chaochun Wei (Shanghai Jiao Tong University, School of Life Sciences and Biotechnology);
 
Short Abstract: Metagenome sequence classification or binning is the procedure to assign sequences to their source genomes. It is one of the important steps for metagenome sequencing data analysis. Although many methods exist, classification of high-throughput metagenome sequence data in a limited time is still a challenge. We present here an ultra-fast metagenomic sequence classification system (MetaBinG) using graphic processing units (GPUs). The accuracy of MetaBinG is comparable to the best existing systems and it can classify a million of 454 reads within ten minutes, which is more than 300 times faster than existing systems.
 
Poster J16
Interactive Microbial Analysis and Visualization with GView and GView Server

Aaron Petkau Public Health Agency of Canada - National Microbiology Laboratory
Matthew Stuart-Edwards (Public Health Agency of Canada - National Microbiology Laboratory, Bioinformatics Core); Franklin Bristow (Public Health Agency of Canada - National Microbiology Laboratory, Bioinformatics Core); Tom Matthews (Public Health Agency of Canada - National Microbiology Laboratory, Bioinformatics Core); Eric Marinier (Public Health Agency of Canada - National Microbiology Laboratory, Bioinformatics Core); Gary Van Domselaar (Public Health Agency of Canada - National Microbiology Laboratory, Bioinformatics Core); Paul Stothard (University of Alberta, Department of Agricultural, Food and Nutritional Science);
 
Short Abstract: As sequencing capabilities continue to advance there is an increasing need to analyze, compare, view and interact with whole genome sequences using software that is intuitive and accessible to biologists with limited computing skills. The GView software package and the associated GView Server web-application work together to provide biologists with an easy-to-use platform for performing sophisticated comparative genomics of microbial sequence data, and an interactive viewer for visualization, and customization of the resulting data sets. GView Server uses a wizard-style web interface that is customized for each type of comparative analysis: BLAST atlas, core genome, accessory genome, signature genes, and COG categories. Users provide input data sets, as standard genomic sequence files, and customizations for the analysis in a stepwise fashion under the guidance of the GView Server wizard. Reports are provided in the form of static images or as an interactive map rendered using GView. GView provides multiple options for viewing and interacting with the results. Users are able to pan, zoom, and select genomic features as well as customize the appearance of the selections and other map elements. The continuing advance in sequencing capabilities has created new opportunities to gain biological insight from the large scale analysis of genomic sequence data; however, realizing these opportunities requires increasingly complex computational tasks for performing comparative analysis and visualization. GView and GView Server are designed as simple and easy-to-use tools for performing these tasks. GView Server is available for public use at http://server.gview.ca; the standalone GView software is available at http://www.gview.ca.
 
Poster J17
Head-to-head genes play key roles in human gene coexpression network through spatial chromosome interaction and transcriptional control

Yun-Qin Chen Shanghai Center for Bioinformation Technology (SCBIT)
Yuan-Yuan Li (Shanghai Center for Bioinformation Technology (SCBIT)) Hui Yu (Shanghai Center for Bioinformation Technology (SCBIT), Systems Biology Group); Yi-Xue Li (Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Key Lab of Systems Biology);
 
Short Abstract: A head-to-head (h2h) gene pair is defined as a genomic locus in which two adjacent genes are divergently transcribed from opposite strands of DNA, and the region between the two transcription start sites (TSSs) is termed bi-directional promoter. We have previously found that h2h gene organization is ancient and conserved, which subjects functionally related genes to co-regulation probably by means of sharing transcription factors. Genes regulated by different bidirectional promoters were also found to be significantly coexpressed. However, the detailed transcriptional regulation mechanism of h2h genes and their role in the genome-scale regulatory network are still open to exploration.
In this work, h2h genes have been proved to be more central in a human gene co-expression network than random genes, and they are strongly linked to each other. Interestingly, h2h genes’ coexpression neighbours tend to be hub nodes, and the modules composed of h2h genes and their neighbors are more compact. H2h genes and their neighbours are rich in chimera EST data, indicating the involvement of spatial chromosome interaction while being transcribed, which is a newly discovered expression regulation mechanism. Furthermore, the binding site of a relevant transcriptional factor, CTCF, is over-represented in the promoters of h2h genes and their neighbours. These data suggest that h2h genes and their coexpression neighbours play crucial roles in the genome-scale coexpression network, probably as a result of both spatial chromosome interaction and transcriptional control.
 
Poster J18
Key Laboratory of Systems Biology

Peng Jia Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences
Lei Liu (Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences) Lei Liu ( Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Key Laboratory of Systems Biology); Chaochun Wei (Shanghai Jiao Tong University, School of Life Sciences and Biotechnology);
 
Short Abstract: This poster is based on Proceedings Submission 057.
Background: The culture-independent metagenomics methods try to sequence all genetic materials recovered directly from an environment. It has the potential to provide a global view of a microbial community. However, one of the challenging tasks is to assign the reads or assembled contigs into classes corresponding to their source genomes. The next-generation sequencing platforms can produce millions of shorter reads in a single run and it is a challenge for the existing classification systems.
Result: We present a fast binning system (MetaBinG) using the power of GPUs. MetaBinG is at least 300-fold faster than the leading tool Phymm with a comparable accuracy.
Conclusion: In general, MetaBinG is an ultra-fast metagenome binning system for high-throughput sequence data. Due to the progress of sequencing technologies, the throughput is getting higher and the read length is getting longer. The demand for a fast tool to analyze a huge amount of metagenome sequences is getting higher and higher. Therefore, MetaBinG can be a useful tool for metagenomic sequence classification.
 
Poster J19
SARUMAN - Using GPU programming for short read mapping

Jochen Blom Bielefeld University
Tobias Jakobi (Bielefeld University, Computational Genomics); Jens Stoye (Bielefeld University, Genome Informatics, Faculty of Technology); Alexander Goesmann (Bielefeld University, Computational Genomics);
 
Short Abstract: Motivation:
The introduction of next generation sequencing techniques and especially the high-throughput systems Solexa (Illumina Inc.) and SOLiD (ABI) made the mapping of short reads to reference sequences a standard application in modern bioinformatics. Short read alignment is needed for reference based re-sequencing of complete genomes as well as for gene expression analysis based on transcriptome sequencing. Several approaches were developed during the last years allowing for a fast alignment of short sequences to a given template. Methods available to date use heuristic techniques to gain a speedup of the alignments, thereby missing possible alignment positions. Furthermore, most approaches return only one best hit for every query sequence, thus losing the potentially valuable information of alternative alignment positions with identical scores.
Results:
We developed SARUMAN (Semiglobal Alignment of short Reads Using CUDA and NeedleMAN-Wunsch), a mapping approach that returns all possible alignment positions of a read in a reference sequence under a given error threshold, together with one optimal alignment for each of these positions. Alignments are computed in parallel on graphics hardware, facilitating an considerable speedup of this normally time consuming step. Combining our filter algorithm with CUDA-accelerated alignments, we were able to align reads to microbial genomes in time comparable or even faster than all published approaches, while still providing an exact, complete and optimal result. At the same time SARUMAN runs on every standard Linux PC with a CUDA compatible graphics accelerator.
 
Poster J20
Chipster genome browser: Smooth and interactive visualization of next generation sequencing data

Eija Korpelainen CSC - The Finnish IT Center for Science
Aleksi Kallio (CSC - The Finnish IT Center for Science) Massimiliano Gentile (CSC - The Finnish IT Center for Science , Application services); Taavi Hupponen (CSC - The Finnish IT Center for Science , Application services); Petri Klemelä (CSC - The Finnish IT Center for Science , Application services); Rimvydas Naktinis (CSC - The Finnish IT Center for Science , Application services);
 
Short Abstract: Chipster genome browser enables users to view next generation sequencing (NGS) reads and results in their genomic context smoothly and interactively using Ensembl annotations. Good user experience is achieved with a Java based standalone software and memory requirements are kept to the minimum by using real time indexing and summarizing.

Chipster allows zooming down to nucleotide level, highlighting of SNPs and paired-end anomalies, as well as viewing spliced reads. Coverage and quality coverage are automatically calculated and users can view both total and strand-specific coverage. The supported file formats are SAM, BAM and BED. SAM files are converted to BAM, sorted and indexed, and BED files are sorted. All this is accomplished during the data import step with Chipster’s preprocessor.

In order to move to a genomic location, one can either enter a gene symbol or base pair location, or click on the chromosome image. Navigation uses common conventions: One can zoom with the mouse wheel and move sideways by pulling with the mouse. Alternatively, the arrow keys can be used for these tasks. It is possible to have several viewing panels open at the same time and for example view the same data in a sortable spreadsheet in another panel.

Chipster genome browser is open source, and it is freely available in Chipster Viewer (http://chipster.csc.fi/beta/). Importantly, it is also integrated with a rich collection of NGS data analysis tools in Chipster 2.0, allowing seamless viewing of analysis results.
 
Poster J21
Genome-wide chromatin remodeling at G/C-rich long nucleosome-free regions containing SP1 and MAZ binding sites

Karin Schwarzbauer Johannes Kepler University Linz
Ulrich Bodenhofer (Johannes Kepler University Linz, Institute of bioinformatics); Sepp Hochreiter (Johannes Kepler University Linz, Institute of bioinformatics);
 
Short Abstract: To gain deeper insights into principles of cell biology, it is essential to understand how cells reorganize their genomes to induce fundamental changes of their transcriptional states. Chromatin remodeling is one of the major mechanisms for reorganizing the genome, for example, when activating T cells.

We analyzed chromatin remodeling on next generation sequencing (NGS) data stemming from resting and activated T cells. NGS enables to determine a whole genome chromatin remodeling landscape, thus, our analysis is not limited to promoter regions --- in fact, the majority of remodeling sites were found outside promoter regions. We consider chromatin remodeling in terms of nucleosome repositioning which can be observed most robustly in long nucleosome-free regions (LNFRs) that are occupied by nucleosomes after remodeling.

LNFR sequences were either A/T-rich or G/C-rich, where nucleosome repositioning was observed in 74% of A/T-rich and in 91% of G/C-rich LNFRs. Using support vector machines with string kernels, we identified DNA sequence patterns indicating nucleosome repositioning. G/C-rich LNFRs found in the chromatin of resting T cells showed the most specific and most prominent repositioning patterns, suggesting that chromatin remodeling in resting T cells is primarily cis-based. The most indicative patterns coincide with Sp1 and Maz binding patterns like GGGCGGG, GGGAGGG and GGGTGGG. The trancription factor Sp1 has already been associated with chromatin remodeling in promoter regions. Our results, however, attribute Sp1 and Maz a novel role as proteins participating in DNA conformation change on a whole-genome scale.
 
Poster J22
Associating Chromatin Features with Gene Expression

Thomas Sakoparnig Swiss Federal Institute of Technology Zurich
Moritz Gerstung (Swiss Federal Institute of Technology Zurich, Biosystems Science and Engineering); Tobias Kockmann (Swiss Federal Institute of Technology Zurich, Biosystems Science and Engineering); Christian Beisel (Swiss Federal Institute of Technology Zurich, Biosystems Science and Engineering); Renato Paro (Swiss Federal Institute of Technology Zurich, Biosystems Science and Engineering); Niko Beerenwinkel (Swiss Federal Institute of Technology Zurich, Biosystems Science and Engineering);
 
Short Abstract: Epigenetic control of gene expression is key for stable cell differentiation. Recently, unsupervised learning techniques were used to identify chromatin states which were linked to various levels of gene expression. However, these models cannot quantitatively predict gene expression nor do they reveal detailed mechanistic insight into the chromatin-based control of gene expression. In this study, we link, in a quantitative manner, chromatin protein binding and histone modification profiles to gene expression in Drosophila melanogaster on a genome-wide scale.
A set of one RNA-seq profile and eleven ChIP-seq profiles of chromatin proteins, DNA binding factors, and histone modifications (referred to as features) was complemented with 25 ChIP-chip profiles from the modENCODE data warehouse and subsequently used in order to train classification and regression models.
Classification models had a 10-fold cross-validated error rate of less than 10% on a set of over 9000 genes. Regression models achieved R^2 values around 0.7. The well-performing non-linear models such as Random Forests and Support Vector Machines were hard to interpret and we looked explicitly for combinatorial effects in a second step. We computed interaction terms (pairs and triplets of features) and applied a stability selection approach. The top ranked predictors were very stable and showed consistent effects on the used regression models.
 
Poster J23
Whole genome sequencing, assembly and annotation of the Saccharomyces cerevisiae strain CEN.PK 113-7D

Dick De Ridder Delft University Of Technology
Dick de Ridder (Delft University Of Technology) Marcel van den Broek (Delft University Of Technology, Dept. of Biotechnology, Faculty of Applied Sciences); Erwin Datema (Wageningen University and Research Centre, Plant Research International); Roeland van Ham (Wageningen University and Research Centre, Plant Research International); Marcel Reinders (Delft University Of Technology, Faculty Of Electrical Engineering, Mathematics and Computer Science); Jurgen Nijkamp (Delft University Of Technology, Faculty Of Electrical Engineering, Mathematics and Computer Science); Jean-Marc Daran (Delft University Of Technology, Dept. of Biotechnology, Faculty of Applied Sciences);
 
Short Abstract: The Saccharomyces cerevisiae (baker’s yeast) CEN.PK strain is extensively used for systems biology and fermentation studies, as it is robust and highly relevant to industry. In biotechnology, metabolic engineering efforts are enhanced by evolutionary engineering to improve strain performance by selecting for certain phenotypes. Mutations arising during the evolutionary process can be deduced using next-generation sequencing and then linked to the observed phenotypes, in order to reverse-engineer the resulting strains. This requires a high-quality CEN.PK reference genome.

To construct this, two sequencing data libraries were obtained and assembled: a 454 shotgun and an Illumina paired-end dataset. The resulting contig set was scaffolded using the well-known S. cerevisae S288c genome. Non-unique sites in the genome were not correctly assembled when their length is longer than the read length. A method was developed to mine such non-unique sites from the contig graph produced by an assembler, assigning locations in the assembly using S288c as a reference.

The assembled genome was annotated using a combination of tools. ORFs were identified using three ab initio and two comparative gene model predictors. The predicted gene models were validated with a multi-condition RNAseq dataset. The genome was then compared to the known yeast genomes S288c, RM11-1A and YJM789, and a number of CEN.PK-specific genes and gene variants were catalogued. The resulting annotated genome sequence not only forms a solid basis for further work in evolutionary engineering, but also serves as a reference point for genomic studies in CEN.PK-derived strains.
 
Poster J24
Genometa - Accurate, flexible and fast taxonomic assignment for ultra-short metagenomic reads

Colin Davenport Hannover Medical School
Svea Kokott (Hannover Medical School, Clinical Research Group OE 6710); Jens Neugebauer (Hannover Medical School, Clinical Research Group OE 6710); Jens Klockgether (Hannover Medical School, Clinical Research Group OE 6710); Burkhard Tümmler (Hannover Medical School, Clinical Research Group OE 6710); Volker Ahlers ( Hannover University of Applied Sciences, Department IV Economics and Informatics); Frauke Sprengel ( Hannover University of Applied Sciences, Department IV Economics and Informatics);
 
Short Abstract: In the modern metagenomic era massively parallel sequencing technologies are rapidly increasing the availability of sequence information from environmental sources. Over 300 projects with deep sequencing of prokaryotic communities exist on the Genomes Online database, and allow access to novel biological information from environmental prokaryotes. Ultra short reads are not yet widely used in metagenomics due to a lack of computational analysis tools suitable for aligning millions of reads quickly and precisely. However, Illumina and Solid next generation sequencers produce the highest volume of reads and lowest cost per base pair, and are thus particularly appropriate for metagenomics.
Our user friendly approach with graphical interface, Genometa, uses the speed of modern short read mapping tools. Millions of short reads can be accurately mapped onto a curated reference dataset of microbial species within 20 minutes. Mapped reads are then counted, taxonomy is assigned and the interface of the Integrated Genome Browser is used to show read distributions. This method demonstrably allows strain level identification using complex artificial and real metagenomic datasets. Using real Illumina reads from simulated metagenomes and a human gut microbiome project, we demonstrate the taxonomic fidelity of short reads and that false positive assignments are minimal. Based on the distribution of read hits to a reference sequence the presence of whole genomes or smaller elements can be discerned. Alternative reference sequences can be easily integrated into Genometa. Genometa is open source and freely available at http://genomics1.mh-hannover.de/genometa/.
 
Poster J25
Comparing long maximal exact repeats in four human genome assemblies

Sara Garcia University of Aveiro
João Rodrigues (University of Aveiro, IEETA/DETI); Armando Pinho (University of Aveiro, IEETA/DETI); Paulo Ferreira (University of Aveiro, IEETA/DETI);
 
Short Abstract: We compare the content in long (length >= 1 kb) maximal exact repeats in four human genome assemblies. Maximal exact repeats are perfect repeats that cannot be further extended without loss of similarity. Maximal exact repeats are important for seeding alignment of sequence reads in genome assembly, and as anchor points in comparisons of closely related genomes.

We use the human reference GRCh37 assembly sequenced with capillary-based technologies (Lander et al. 2001), to three de novo human assemblies, namely, the HuRef assembly of the genome of J. Craig Venter sequenced with capillary-based technologies and assembled with the Celera Assembler (Levy et al. 2007), the NA12878 assembly from cell line GM12878 sequenced with massively parallel technologies and assembled with the ALLPATHS-LG assembler (Gnerre et al. 2011), and the YH assembly of the genome of a Han Chinese individual sequenced with massively parallel technologies and assembled with the SOAPdenovo assembler (Li et al. 2010).

We find the content in long maximal exact repeats to be the largest in the reference GRCh37 assembly, representing ~0.96% of the genome size (A, C, G, T bases). The percentage of the genome represented by long maximal exact repeats in the NA12878 assembly is ~0.3%, whereas that value decreases to ~0.04% in the HuRef assembly. The presence of long maximal exact repeats in the YH assembly is negligible. These results further highlight the challenges in obtaining high-quality de novo draft assemblies for large and repeat-rich genomes, and the promises of recent efforts.
 
Poster J26
Revealing multiple nucleotide polymorphysms and indels in homopolymeric regions in human exome data

Tibor Nagy Omixon
Szilveszter Juhos (Omixon) Alessandro Guffanti (Genomnia, n.a.); Szilveszter Juhos (Omixon) Kenneth McElreavey (Institut Pasteur, Paris, Human Developmental Genetics);
 
Short Abstract: While detecting single nucleotide polymorphysms (SNPs) is the common goal of many studies, multiple nucleotide polymorphysms (MNPs i.e. mutation of two or three adjacent nucleotides) are generally less studied. Using Next-Generation
sequencing it is relatively difficult to detect MNPs similar to insertions
and deletions (indels) that are in homopolymeric or repetitive regions.
Analyzing human exome data we are presenting examples where MNPs can occur in
genes and are causing mutations responsible for significant changes in enzime
structure.
Furthermore, other examples are also presented where indels were found in
homopolymeric regions or in special symmetric sequences where detecting
indels is hard for computer algorithms.
We are displaying examples where the found alterations (MNPs or indels) can
be related to X-ray structures deposited in the PDB. We are also presenting
statistics about new MNPS and indels found originating frameshifts. Special
attention is paid for deletions/insertions where whole codons are inserted
or deleted. Four human samples were sequenced using the Agilent full exome
enrichment kit and SOLiD NGS sequencing. The resulting 4x160M color-space
reads were aligned with a sensitive method described elsewhere that is more
applicable for finding DNPs and indels. A comparison with alternative SNP,
indel and MNP mutation identification is also illustrated.
 
Poster J27
Double nucleotide polymorphisms in human pathogen Propionibacterium acnes strains

Krisztina Rigo Omixon
Emese Szabo (Omixon) Judit Hunyadkürti (Baygen Institute for Plant Genomics, Human Biotechnology and Bioenergy, Research Group of Innate Immunity); Istvan Nagy (Baygen Institute for Plant Genomics, Human Biotechnology and Bioenergy, Research Group of Innate Immunity);
 
Short Abstract: One of the main goal of re-sequencing is to find SNPs in the target
organism, but finding double nucleotide polymorphysms (DNPs) or even
multiple nucleotide polymorphysms (MNPs) is sometimes difficult and
these mutations are often neglected. Using NGS data a rather sensitive
read-alignment method is needed to detect these relatively rare
mutations. In the NCBI SNP database there are very few references for
prokaryote MNPs and the human SNPdb contains only about 50.000
possible MNP coordinates next to the 23 millions SNPs. We have used
SOLiD sequencing to re-sequence two strains of the human pathogen P.
acnes and analysed the short read data with a sensitive alignment
method described elsewhere. After aligning reads to the 2.6M bp
reference we have found 62031 SNPs and 1406 DNPs in strain A.
Similarly, strain B contained 50101 SNPs and 1341 DNPs. To validate
these results few regions close to genes were selected and they were
validated by Sanger sequencing also. Our results are demonstrating
that DNPs are more frequent in bacterial genomes than they are
represented in the SNPdb and likely contain functionally important
mutations.
 
Poster J28
Detecting homogenous methylation-domains within whole genome methylation maps

Michael Hackenberg University of Granada
Pedro Carpena (University of Malaga, Applied Physics II); Pedro Bernaola-Galván (University of Malaga, Applied Physics II); Guillermo Barturen (University of Granada, Genetics); José L. Oliver (University of Granada, Genetics);
 
Short Abstract: With the advent of high-throughput sequencing techniques, the generation of single base resolution methylation maps became a feasible task. The detection of regions with homogenous methylation levels is an important goal as: 1) CpG islands should manifest as homogenously hypomethylated regions within these maps and 2) partially methylated regions have been recently described in differentiated cells showing some clear biological features. The whole genome methylation maps can be viewed as arrays (one for each chromosome) of numerical values varying between 0 (completely un-methylated) and 1 (completely methylated). In order to detect segments or regions with homogeneous methylation levels, we applied a numerical segmentation method which was shown before to be able to unveil the compositional structure of DNA sequences. When analyzing the segment length vs. the segment methylation, we obtain 3 clearly distinguishable segment types: (i) short (mean length around 1 kb), hypomethylated segments (methylation levels <= 0.1), (ii) partially methylated segments between approximately 0.35 and 0.6 in differentiated tissues (these segments are absent in undifferentiated cells) and (iii) very long (around 100 kb on average depending on the tissue) and strongly methylated (>= 0.8) segments. We observed that the short and hypomethylated regions have clearly higher CpG densities than the partially methylated segments, thus suggesting that these regions correspond mainly to CpG islands. These results offer the possibility to detect CpG islands in a tissue-specific manner by means of numerical segmentation of whole-genome methylation maps, as well as to analyze further the biological meaning of partially methylated regions.
 
Poster J29
Improvement of RNA-Seq precision

Paweł Łabaj Boku University Vienna
German G. Leparc (Boku University Vienna, Chair of Bioinformatics); Bryan E. Linggi (Pacific Northwest National Laboratory, Environmental Molecular Sciences Laboratory); Lye Meng Markillie (Pacific Northwest National Laboratory, Environmental Molecular Sciences Laboratory); H. Steven Wiley (Pacific Northwest National Laboratory, Environmental Molecular Sciences Laboratory); David P. Kreil (Boku University Vienna, Chair of Bioinformatics);
 
Short Abstract: This poster is based on Proceedings Submission 237.

Measurement precision determines the power of any subsequent analysis, such as the sensitive detection of differentially expressed transcripts, independent of whether replicates are employed in this analysis or not. Replicates are, however, required for a systematic analysis of this precision at the level of individual transcripts. Large RNA-Seq datasets featuring technical replicates now allow a gene-by-gene examination of the reliability of expression-level estimates from massively parallel sequencing.

We here report a comprehensive study of target coverage and measurement precision, including their dependence on transcript expression-levels, read depth, and other parameters. On one hand, with 330 million 50bp reads, an impressive target coverage of 84% of the estimated true transcript population could be achieved, with diminishing returns from increased sequencing-depths. On the other hand, the majority of the measurement power (75% of all reads) is spent on only 7% of the known transcriptome, however, making less strongly expressed transcripts harder to measure. Consequently, only ~39,000 known transcripts could be quantified reliably with a relative error < 20%. Profiling transcripts de-novo yielded even fewer precise measurements (~35,000).

Building on established tools, we here introduce a new approach for mapping and analysing sequencing-reads, increasing the number of reliably measured known transcripts by 50% to ~57,000. Still lower than what can be achieved with standard microarrays, extrapolations to higher sequencing-depths highlight the need for more efficient experimental strategies. Further improvements in comprehensive quantitative profiling of the transcriptome need to exploit the respective strengths of complementary technologies.
 
Poster J30
Next generation sequencing and its application to study breast tumor subtypes

Krishna Rani Kalari Mayo clinic
Krishna Kalari (Mayo clinic) High Seng Chai (Mayo Clinic, Department of Health Sciences Research); Yan Asmann (Mayo Clinic, Department of Health Sciences Research); asha nair (Mayo Clinic, Department of Health Sciences Research); saurabh baheti (Mayo Clinic, Department of Health Sciences Research); Tiffany Baker (Mayo Clinic, Department of Cancer Biology); Jennifer Carr (Mayo Clinic, Department of Cancer Biology); Asif Hossain (Mayo Clinic, Department of Health Sciences Research); Zhifu Sun (Mayo Clinic, Department of Health Sciences Research); Sumit Middha (Mayo Clinic, Department of Health Sciences Research); Jean-Pierre Kocher (Mayo Clinic, Department of Health Sciences Research); Edith Perez (Mayo Clinic, Department of Internal Medicine); David Rossell (Institute for Research in Biomedicine, Department of biomedicine); Aubrey Thompson (Mayo Clinic, Department of Cancer Biology);
 
Short Abstract: Breast tumor is classified into several categories based on the expression of genes like ER, PR and ERBB2. Expression of these genes determines breast tumor subtypes and therapeutic response. Many high-throughput experiments have generated multiple candidate genes from different kinds of independent genomic feature analysis such as SNPs, CNVs and gene expression with each breast tumor subtype. The objective of the present study is to perform integrative analysis of multiple genomic features to identify key pathways that are specific to each breast tumor subtype using next generation sequencing technology. We performed 50nt paired-end RNA-sequencing for 24 breast adenocarcinoma samples (8 HER2 positive, 8 ER positive and 8 triple negatives) and 8 normal HMEC cell lines. Mayo Bioinformatics Core workflows along with a variety of computational methods and softwares were used to carry out analyses. Differential gene expression and splicing analysis were performed between each breast tumor subtype and normal. Integration analyses of differential gene expression data with a p-value 0.05 identified 1859, 1908 and 1257 genes that are specific to HER2 positive, ER positive and triple negatives tumor subtypes respectively. Integration analyses of alternative splice form data with p-value 0.0001, identified 1138, 571 and 388 splicing variants that is specific to HER2 positive, ER positive and triple negatives tumor subtypes respectively. We are in the process of identifying genes corresponding to coding nucleotide polymorphisms that are specific to each breast tumor subtype. Combination of all the genes obtained from each genomic feature analysis will be useful to indentify biological processes for each tumor subtype.
 
Poster J31
Bayesian Evaluation of de novo Genome Assembly

Sergey Nikolenko St. Petersburg Academic University
Alexander Sirotkin (St. Petersburg Academic University, Algorithmic Biology Lab); Max Alekseyev (University of South Carolina, Department of Computer Science & Engineering);
 
Short Abstract: An important problem of genome assembly is evaluation of the assembly results. Existing metrics include contig size, N50, and their close relatives; see Assemblathon review (Korf et al., 2011). When a reference genome is given, these metrics provide good evaluation by counting only "correct" contigs. However, for de novo assembly, a random sufficiently long string would also produce good N50-like metrics. Moreover, users of exiting assemblers have begun to criticize their results for too short assemblies (Alkan et al., 2011), which is, however, expected from the formulation of genome assembly as the Shortest Superstring Problem. A unified method for comparing the quality of genome assemblies should take into account how well the input reads fit the results.

We rank assemblies according to their likelihood with respect to a generative read generation model. In our model, each (non-paired) read is taken from a uniformly selected position in a genome. Errors in each nucleotide are independent, and their distribution is a property of the sequencing process (Phred or Solexa quality scores, indel probabilities). We compute the total log-likelihood of a read with weighted local matching algorithms (assuming that the coverage distribution is known in advance, e.g., is uniform). The assembled contigs can now be compared by the total log-likelihood of the entire set of the input reads.

Our approach can compare genomic sequences of different length since the final log-likelihood formula includes a "model complexity" term, penalizing longer strings. The model can be extended to handle prior information about genome length and structure.
 
Poster J32
Extracting relationships from integrated functional genomics data with the Ontological Discovery Environment

Jeremy Jay The Jackson Laboratory
Erich Baker (Baylor University, Engineering and Computer Science); Elissa Chesler (The Jackson Laboratory)
 
Short Abstract: High-throughput experimental data has become ubiquitous in functional genomics, providing a wealth of discrete gene-function associations through microarrays, GWAS, QTLs, and RNA-seq experiments. To integrate these data we are curating billions of discrete functional genomic associations and storing them in an open-access collaborative environment, where researchers can apply modular tools based on bipartite graph algorithms to explore relations through custom workflows.

The Ontological Discovery Environment (freely available at http://ontologicaldiscovery.org) has amassed a large database of diverse data obtained from Mouse Genome Informatics, Neuroscience Information Framework, Allen Brain Atlas, and Comparative Toxicogenomics databases. Multiple species are supported and all data is linked using homology to enable comparative functional genomics. User-submitted data can be privately stored or made open-access, and all data can be selectively used with our integrated analysis and visualization pipeline.

Navigating intersections enables discovery of similarity among biological characters or gene functions. Using the collected relationships and similarities found within the data, we have developed a gene-seeded search for co-represented genes in similar experimental data, which enables rapid discovery and prioritization of candidate genes through integration of many gene sets. Boolean gene set logic enables operations including union or intersection of large groups of gene sets, useful to distill multiple sets of positional candidates regulating a single phenotype. These and other ODE tools provide a simple approach to expand or filter a list of interest and discover new or under-studied relationships for follow-up, allowing the user to get past the obvious genes and generate new hypotheses from existing data.

Supported by NIH-AA18776.
 
Poster J33
Give Cancer Archival Tissues a New Life: Unlock with High-throughput Technologies

Lan Hu Dana-Farbar Cancer Institute
Howie Goodell (Dana-Farber Cancer Institute, Biostatistics and Computational Biology); Fieda Abderazzaq (Dana-Farber Cancer Institute, Biostatistics and Computational Biology); Renee Rubio (Dana-Farber Cancer Institute, Biostatistics and Computational Biology); Mick Correll (Dana-Farber Cancer Institute, Biostatistics and Computational Biology); John Quackenbush (Dana-Farber Cancer Institute, Biostatistics and Computational Biology);
 
Short Abstract: Formalin-Fixed Paraffin-Embedded (FFPE) archival tissue blocks are a rich resource for retrospective discovery studies. A great advantage of FFPE blocks is that in many cases tumor and normal tissues of the same patient and the clinical data are available, which allows us to build more precise models to identify the relevant gene signatures and genomic variations in cancers. At the same time, the fragmented and cross-linked nature of of mRNA/DNA in the FFPE blocks presents a challenge in order to successfully 'unlock' these archival tissues.
Our group has deployed both the DASL (cDNA-mediated Annealing, Selection, extension, and Ligation) assay and the NGS (next-generation sequencing) technology to profile gene expression in cancer FFPE blocks. We use the DASL assay to determine a sample extraction method that can achieve both quality and speed to make full use of the large repository of normal-tumor paired FFPE blocks in breast cancer. We demonstrate that with a good yield of recovered mRNA, the coring method is largely comparable to the traditional, lab intense, and slow sectional method. We sequenced a small number of FFPE blocks from different stages of the cancer using RNA-seq. The subsequent analysis, although preliminary, showed gene expression patterns in the context of the disease stage.
 
Poster J34
Role of Conserved 3’-UTR Cis-elements in the Regulation of Human MECP2 and MAPT Genes Involved in Mental Disorders

Joetsaroop Bagga John P. Stevens High School
Lawrence D'Antonio (Ramapo College of New Jersey, TAS-Bioinformatics); Paramjeet Bagga (Ramapo College of New Jersey, TAS-Bioinformatics);
 
Short Abstract: Mutations in the MECP2 (Methyl CpG binding Protein-2) and MAPT genes, which are differentially expressed during neuronal development, have been linked to Rett Syndrome, X-linked mental retardation, autism, Alzheimer's disease and dementia. The goal of this project has been to study the role of cis-elements in regulating the post-transcriptional gene expression of these genes in human and mouse.
We have used a bioinformatics approach to analyze phylogenetically conserved cis-regulatory elements in the 3’-UTRs of alternatively spliced isoforms of the MECP2 and MAPT genes.
We discovered phylogenetically conserved overlapping AU rich elements (AREs) and mi-R148/152 target sites in the 3'-UTR of the longer mRNA isoform of the human MECP2 gene, suggesting cooperation between miRNAs and ARE-Binding Proteins for regulation of MECP2 large isoform mRNA decay. Selective association of conserved G-quadruplex motifs with one of the polyA sites suggests their role in alternative polyadenylation. The long MECP2 isoform is expressed in the early neuronal development. The shorter adult brain-isoform lacked miRNA targets, ARE or G-quadruplexes. Our results suggest a regulatory role of these cis-elements in the post-transcriptional MECP2 expression during early development of the brain.
A conserved mi-R563 target site, G-quadruplexes, and an ARE in the 3'-UTR were analyzed in the vicinity of CDS in one of the MAPT mRNA isoforms. We believe that interplay between the G-quadruplex-binding proteins, such as Fragile-X Mental Retardation Protein, miRNAs, and ARE-BPs could help regulate MAPT expression.
Our studies help provide better insights into the post-transcriptional regulation of MECP2 and MAPT genes involved in mental disorders.
 
Poster J35
GARM - Genome Assembly, Reconciliation and Merging

Alejandro Sanchez-Flores Wellcome Trust Sanger Institute
Fidel Sanchez-Flores (Wellcome Trust Sanger Institute) Matt Berriman (Wellcome Trust Sanger Institute, Parasite Genomics);
 
Short Abstract: Genome assembly is a problem that remains a challenge in genome sequencing projects. The progress in DNA sequencing gave birth to second-generation technologies that changed the paradigm of genome assembly, giving rise to new algorithms and software. Several programs can take reads from different technologies to perform a hybrid assembly, but the difference in read lengths and error models can affect the result. Another common scenario is that a genome project could have started using one sequencing technology but turned to new technologies as they became available, in the end requiring a way of combining them all. We present GARM (Genome Assembler, Reconcilation and Merging) a software to merge and reconcile assemblies from different algorithms or sequencing technologies. It relies on the principle that a combination of sequencing methods will work better together to reconstruct a genome in its totality, overcoming limitations of each technology and assembly methods if they were used in isolation. The pipeline is based on perl-scripts and third-party software like AMOS and MUMmer. The method merges contigs and scaffolds from different assemblers using the same or different sequencing technologies. When scaffolds are provided, a process for finding compressions and extensions (CE) problems in the genome can be performed; scaffolds can also be reconstructed recalculating the gaps between contigs. It has been tested with three major sequencing technologies (Sanger, 454 and Illumina) using simulated reads, and real sequencing data for some organisms where a reference genome is available or where a manual post-assembly improvement is being undertaken.
 
Poster J36
Application of Restriction Site Associated DNA (RAD) linkage map in Brassica oleracea genome assembling

Chu Shin Koh NRC Plant Biotechnology Institute (NRC-PBI)
Christine Sidebottom (NRC Plant Biotechnology Institute (NRC-PBI), DNA Technologies Laboratory); Carling Tallon (NRC Plant Biotechnology Institute (NRC-PBI), DNA Technologies Laboratory); Wayne Clarke (Agriculture and Agri-Food Canada, Saskatoon Research Centre); Matthew Links (Agriculture and Agri-Food Canada, Saskatoon Research Centre); Erin Higgins (Agriculture and Agri-Food Canada, Saskatoon Research Centre); Jacek Nowak (NRC Plant Biotechnology Institute (NRC-PBI), Bioinformatics Laboratory); Faouzi Bekkaoui (NRC Plant Biotechnology Institute (NRC-PBI)) Isobel Parkin (Agriculture and Agri-Food Canada, Saskatoon Research Centre); Andrew Sharpe (NRC Plant Biotechnology Institute (NRC-PBI), DNA Technologies Laboratory);
 
Short Abstract: Restriction-site associated DNA (RAD) tags are short DNA fragments that immediately flank either side of a particular restriction enzyme recognition site. RAD tags provide a reduced representation of a target genome and highly multiplexed sequencing of these tags allows rapid detection of large number of Single Nucleotide Polymorphisms (SNP) at a reasonable cost (Baird et al. 2008). Here we describe Illumina sequencing of multiplexed RAD tags for use in the development of a publically available Brassica oleracea (cabbage) genome assembly. RAD tags were generated from genomic DNA from the two parents, TO1000 and Early Big, and 94 lines of the doubled haploid TO1000 x Early Big B. oleracea core genetic mapping population. The use of EcoRI for DNA digestion and barcoded adapters allows for efficient use of sequencing resources to achieve good coverage of EcoRI sites within the genome assembly at sufficient read depth. This methodology allows for simultaneous high-density SNP discovery and genetic mapping. The resulting polymorphic SNPs were integrated with existing SSR and SNP data to create a dense genetic map which facilitated genetic anchoring of B. oleracea assembly scaffolds. The results will be presented.
 
Poster J37
Comparison of bacterial replication termination models by simulation of genomic mutations

Nobuaki Kono Institute for Advanced Biosciences, Keio University
Kazuharu Arakawa (Keio University, Graduate School of Media and Governance); Masaru Tomita (Keio University, Faculty of Environment and Information Studies);
 
Short Abstract: Bacterial circular chromosomes exhibit compositional bias towards G over C in the leading strand, commonly calculated as the GC skew: (C-G)/(C+G). Therefore, the shift-points of GC skew polarity are known to correlate with the positions of the replication origin and terminus. While the positions of the replication origins are known to be finite, the exact position and the mechanism for replication termination in bacteria are still in controversy. The dif-termination model proposes replication termination at the chromosome dimer resolution site dif based on its “bioinformatically “ optimized positioning, the fork-trap model is premised on in vivo and in vitro evidence of Tus/Ter blockage of replication fork progression, and the fork-collision model defines replication termination at a region opposite of the origin where the two replication forks meet. To this end, here we test these replication termination models, by simulating the genomic mutation under different models of replication termination mechanisms. Moreover, by altering the replication termination positions both by probabilistic means and by using known positions of Ter and dif sequences, we show that only models with defined termination site can reproduce the compositional bias of original genomes. These results suggest the existence of coordination of replication termination machinery with cell division mechanisms such as the chromosome dimer resolution system.
 
Poster J38
A-Bruijn graph approach to de novo genome assembly

Mikhail Dvorkin St. Petersburg Academic University
Alexander Kulikov (St. Petersburg Academic University, Algorithmic Biology Laboratory); Max Alexeev (University of South Carolina, Department of Computer Science & Engineering);
 
Short Abstract: A common approach to assembling a genome from short reads is construction of the de Bruijn graph on k-mers from the reads and finding a traversal of edges in this graph. We propose a new approach that allows to decrease the graph size without losing the information from the input data.

We select only a small fraction of all k-mers such that each read contains at least two selected k-mers. Selected k-mers are then represented by vertices so that each read is represented by a line graph on such vertices. Further gluing vertices corresponding to the same k-mers results in an A-Bruijn graph, where each is present as a path. After simplifications, non-branching paths in the A-Bruijn graphs reveals the genome contigs as sequences of reads; the contig content can be then found by consensus over the corresponding reads.

The A-Bruijn graph has a number of advantages as compared to the traditional de Bruijn graph: it has less vertices and thus reduces memory usage (that is particularly important for large genomes); it is well captures the repeat structure of the genome being assembled; since it is based only on a small fraction of all k-mers, it is less sensitive to the errors in the reads.

Many algorithms (graph simplification, mate pairs analysis etc.) that work on de Bruijn graphs can be adapted to work on A-Bruijn graphs.
 
Poster J39
BLESS: mapping DNA double-strand breaks by next-generation sequencing

Maga Rowicka University of Texas Medical Branch at Galveston
Nicola Crosetto (Institute of Biochemistry II, Molecular Signalling Group); Abhishek Mitra (University of Texas Medical Branch at Galveston, Institute for Translational Sciences);
 
Short Abstract: We present a method to map DNA double-strand breaks (DSBs) by in situ labeling, enrichment, and next-generation sequencing (BLESS). BLESS was comprehensively validated, and tested in a genome-wide screening of aphidicolin-induced common fragile sites (CFSs) in HeLa cells. 191 CFSs (45 previously unknown) were identified, and 13 aphidicolin-sensitive hotspots (2 in chromosomal regions considered non-fragile) were mapped with a resolution of few kilobases. Our method is suitable for genome-wide high-resolution mapping of DSBs in various cells and experimental conditions.
 
Poster J40
PhyloPTE: Genotype/Phenotype Association with Reference to Phylogeny

Samuel Handelman The Ohio State University
Joseph Verducci (The Ohio State University, Statistics); Jesse Kwiek (The Ohio State University, Center for Microbial Interface Biology); Surender Kumar (The Ohio State University, Veterinary Biosciences); Daniel Janies (The Ohio State University, Biomedical Informatics);
 
Short Abstract: When genome sequences are obtained from organisms with different associated phenotypes,
it should be possible to identify those sequence properties which confer a given phenotype.
However, the evolutionary relationships between organisms lead to non-independence between the
sequence properties. For example, the HIV-1 virus has a population structure reflecting both
transmission between individuals and evolution of the HIV-1 quasispecies within each patient. This
non-independence can introduce interdependence between unrelated mutations giving a false
appearance of causation. These evolutionary relationships are an issue even in HIV-1 where
recombination is rapid, and are pervasive in humans, where linkage disequilibrium is extensive. In
human disease studies, this can sometimes be overcome by comparing siblings: alleles common only
in “sick” siblings are likely true causative alleles. PhyloPTE identifies, in a phylogenetic
reconstruction, sibling lineages where the phenotype varies. Then, PhyloPTE uses modified
proportional hazard models to identify causal polymorphisms. PhyloPTE’s advantages include:
speed practical for high-throughput sequence data, estimates of relative strength or speed of
different effects, and improved precision even vs. other tree-based methods: 50%-300%
improvement in precision at same recall, either to predict experimental correlations (obtained from
STRING: http://string-db.org/) or in simulations under biologically reasonable parameters on HIV
quasispecies sequence trees.
 
Poster J41
Identifying the Dynamic States of the 3D Genome Organization

Andrzej Kudlicki University of Texas Medical Branch
Dirar Homouz (Khalifa University of Science, Technology & Research, Physics); Gang Chen (Central South University, Changsha, CS);
 
Short Abstract: Chromatin capture experiments (4C, Hi-C) allow genome-wide mapping of physical interactions between chromosomal loci. We discuss the impact that multiple or non-homogeneous subpopulations of cells contained in an experimental sample may have on the results of a chromatin capture experiment. We propose a method allowing to identify this phenomenon by analyzing statistical and geometrical properties of chromatin capture measurements. By applying the algorithm to published experimental data, we demonstrate that subpopulations with different chromatin conformations are indeed present and their influence on the results is significant. Finally, we present an algorithm that reconstructs the chromatin conformations in each subpopulation by applying graph-theoretic consideration. We demonstrate that the results are consistent with the presence of several subpopulations of cells. In each the chromatin has a different conformation; the functional annotation of 3D contacts bewteen genes show that each subpopulation is executing a significantly different transcriptional program.
 
Poster J42
Instant-Seq - an integrated and fast Web-based program for analysis of (timecourse) ChIP-Seq data

Abhishek Mitra University of Texas Medical Branch at Galveston
Maga Rowicka (University of Texas Medical Branch at Galveston, Biochemistry and Molecular Biology);
 
Short Abstract: The rapid development of next generation sequencing has made this technology very popular and allowed for performing more sophisticated sequencing experiments like time-series. Our software focuses on comprehensive analysis of ChIP-Seq like data, and also supports non-standard experiments, e.g. detection of genomic fragile sites and to the best of our knowledge, is the only tool available for analysis of the time-course dna-sequencing data. Time-course analysis includes finding all genomic regions exhibiting user-defined temporal trend with a given statistical significance.
In all cases, our tool, starting from the raw sequencing data, performs integrated and comprehensive analysis of the data requiring no user interaction, except answering few simple questions about experimental design. Then, our software pre-processes and assesses the quality of the data, including automatic barcode detection, searching for sequencing artifacts, assessing data quality based on Phred scores (using FASTQC). Read alignment with BWA or BOWTIE follows. If binding site is expected, peak calling is performed with MACS, followed by motif detection with MEME and in-house software. In other cases, regions of interest are identified by non-model based in-house algorithms.
As a part of the analysis, our tool produces a chart showing histogram of peak distribution with respect to transcription start sites, genome browser tracks and weblogo of motifs, a list of regulated genes ranked by their significance scores and their enriched and depleted GO categories. Comparisons with user-provided gold standard sets of genes and selection of targets for qPCR validation are also supported.
 
Poster J43
Statistical Tests for Detecting Differential RNA-Transcript Expression from Read Counts

Philipp Drewe Friedrich Miescher Laboratory of the Max Planck Society
Oliver Stegle (Max Planck Institute for Biological Cybernetics, Machine Learning and Computational Biology); Regina Bohnert (Friedrich Miescher Laboratory of the Max Planck Society, Machine Learning in Biology); Karsten Borgwardt (Max Planck Institute for Biological Cybernetics, Machine Learning and Computational Biology);
 
Short Abstract: As a fruit of the current revolution in sequencing technology, transcriptomes can now be analysed at an unprecedented level of detail. But especially for the analysis of alternative splicing, there is still a lack of statistically robust methods to detect differentially spliced genes in RNA-Seq experiments.
In this work, we present two novel statistical tests to address this important methodological gap: a ‘gene-structure-sensitive’ Negative-Binomial (NB) test that can be used to detect differential transcript expression when the gene structure is known and a non-parametric kernel-based test, called Maximum Mean Discrepancy (MMD), for cases when the gene structure is is incomplete or unknown. Both methods also can cope with multiple replicates and can account for biological variance. Furthermore they can efficiently use paired-end read information.
We analysed both proposed methods on simulated read data, as well as on factual reads generated by the Illumina Genome Analyzer for two A.thaliana samples. Our analysis shows that the NB-test identifies genes with differential transcript expression considerably better than approaches based on transcript quantification, such as rQuant and Cuffdiff. But even more surprisingly we found that the MMD test performs as well as existing methods, in the absence of any knowledge of the annotated transcripts. This method is therefore very well suited to analyse RNA-Seq experiments, where other approaches fail, namely when the genome annotations are incomplete, false or entirely missing.
The software is available as a Galaxy package and as a standalone version.
 
Poster J44
The Importance of Being Moderate: Bias towards Optimal Stability of Codon-Anticodon Interactions and its Effect on Translation

Naama Wald Hebrew University
Maya Alroy (Hebrew University, Microbiology and Molecular Genetics); Maya Botzman (Hebrew University, Microbiology and Molecular Genetics); Hanah Margalit (Hebrew University, Microbiology and Molecular Genetics);
 
Short Abstract: Synonymous codons are unevenly distributed among genes, a phenomenon termed codon usage bias. Understanding the forces shaping codon bias is a major step towards elucidating the adaptive advantage codon choice can confer at the level of individual genes and organisms. Stability of codon-anticodon interactions was previously suggested to be one of the forces underlying codon bias. The model of optimal stability suggests that too stable and too unstable codon-anticodon interactions reduce the efficiency of translation compared to moderate interactions. Codons forming the latter are termed optimally stable. Here we assess the contribution of optimal stability considerations to codon bias, using gene sequences of ribosomal proteins in hundreds of prokaryotic genomes. Controlling for tRNA abundance, we show that codons conforming to the optimal stability considerations are statistically significantly over-represented in these genes, supporting their role towards improved translation. We substantiate the direct effect that optimally stable codons might have on translational efficiency by demonstrating their association with higher expression levels within a set of synonymous GFP constructs. We also provide supporting evidence for the effect of these codons on translation accuracy, by showing their overabundance in positions sensitive to errors. We demonstrate that the effect of optimal stability considerations on codon usage is correlated with the organisms’ level of translation selection.
 
Poster J45
Multi-Image Genome Viewer (MIG)

Stephen Taylor University of Oxford
Simon McGowan (WIMM, CBRG); Jim Hughes (WIMM, Molecular Haemotology);
 
Short Abstract: We have developed a software tool to visualise genomic data based on the open-source genome browsers such as UCSC and GBrowse. While these web browsers excel at the display of disparate data at a single locus, it can be time-consuming to compare this data with that of other genomic loci. The Multi-Image Genome (MIG) viewer that we have developed allows the user to conveniently compare data from many loci. Primarily aimed at high-throughput sequencing experiments where an enrichment is detected at multiple locations across the genome, MIG allows the user to visualise these regions of in their local genome context. In addition, MIG allows the regions of interest to be filtered based on user-defined criteria and to be viewed automatically in a user-defined order.
 
Poster J46
Profiling transcription initiation in human aged brain using deep-CAGE

Margherita Francescatto Vrije Universiteit Medical Center and Instituto de Ciências Biomédicas Abel Salazar
Luba Pardo (Vrije Universiteit Medical Center, Medical Genomics); Patrizia Rizzu (Vrije Universiteit Medical Center, Medical Genomics); Morana Vitezic (RIKEN Yokohama Institute and Karolinska Institute, RIKEN Omics Science Center and Department of Cell and Molecular Biology); Javier Simón-Sánchez (Vrije Universiteit Medical Center, Medical Genomics); Hazuki Takahashi (RIKEN Yokohama Institute, RIKEN Omics Science Center); Carsten Daub (RIKEN Yokohama Institute, RIKEN Omics Science Center); Piero Carninci (RIKEN Yokohama Institute, RIKEN Omics Science Center); Peter Heutink (Vrije Universiteit Medical Center, Medical Genomics);
 
Short Abstract: The aim of this study was to characterize transcription start sites (TSS) in different areas of human aged brain. Since its ability to profile TSSs at high resolution and at a genome wide level, we used Cap Analysis of Gene Expression (CAGE) combined with high-throughput sequencing (deep-CAGE) to collect TSSs. We present here our findings on alternative promoters and antisense transcription. Post-mortem tissue from 5 brain regions was collected from 5 human donors and used to prepare 25 libraries.

On average 2 million CAGE tags for each sample were sequenced. Mapping, expression normalization and clustering of the tags were carried out using automated pipelines. Core promoters were defined by merging tags within 300 base pairs; a threshold on expression was used to reduce background noise. According to this definition we found 22023 promoters, 50% of which mapped to either the annotated promoter region or the 5' UTR of RefSeq transcripts. We found that ca. 32% of the genes expressed in our data use more than one promoter. A promoter was considered preferentially expressed (PEP) in one of the regions if at least 50% of its total expression was derived from that region. Around 30% of the alternative promoters identified were PEP. Ca. 15% of the promoters found were either part of a bi-directionally transcribed pair or antisense to an annotated RefSeq gene.

This study confirms deep-CAGE as a suitable approach to profile transcription, even in the challenging context given by the use of post-mortem tissue from aged human brains.
 
Poster J47
Resource-Efficient de novo Assembly of Eukaryotic Genomes

Thomas Conway NICTA/University of Melbourne
Bryan Beresford-Smith (NICTA/University of Melbourne) Andrew Bromage (NICTA/University of Melbourne, Computer Science); Justin Zobel (NICTA/University of Melbourne, Computer Science);
 
Short Abstract: Assembly of genomes from short sequence reads is a core problem in bioinformatics. Assembly algorithms based on de Bruijn graphs have been very successful, particularly with sequence data that is characterised by large numbers (millions or more) of short (30–500 nucleotides) reads. However, a serious practical impediment to the assembly of larger genomes with current de Bruijn graph implementations is the amount of memory required to represent the graph and the sequence reads, with accurate assembly of genomes larger than a few million base-pairs being infeasible on commodity servers. We have developed and implemented an assembler using a practical and efficient representation of the de Bruijn graph, recently published[1] which exploits the succinct data structures developed by Okanohara & Sadakane and Raman, Raman, & Rao. With our implementation, gossamer, we can, for example, build the de Bruijn graph for a human genome from 1.6x109 Illumina 75bp reads using a single server with only 32GB memory. The assembly required 51 hours.

Importantly, our solution requires a sub-linear increase in space requirements as the amount of sequence data (and therefore also sequencing errors) increases. Since the publication [1], we have also developed a technique for threading reads through the assembly graph requiring space logarithmic in the volume of read data. We are also making use of the space efficiency of gossamer for the de novo assembly of transcriptomes and metagenomes where deep sequencing can increase the sensitivity and dynamic range of the analysis.

[1] Conway, T., Bromage, A.: Succinct Data Structures for Assembling Large Genomes. Bioinformatics 2011 27 (3).
 
Poster J48
The SMALT Software for the Efficient and Accurate Alignment of DNA Sequencing Reads

Hannes Ponstingl Wellcome Trust Sanger Institute
Zemin Ning (Wellcome Trust Sanger Institute, Sequencing Informatics);
 
Short Abstract: We present new results on SMALT, a computer program we developed for the efficient and accurate alignment of DNA sequencing reads with a reference genome. Reads from most types of sequencing platforms, for example Illumina, Roche-454, PacBio, or ABI capillary sequencers, can be processed.

The software employs a hash index of short words sampled at equidistant steps along the genomic reference sequences. For each read, potentially matching segments in the reference are identified from seed matches in the index and subsequently aligned with the read using a banded Smith-Waterman algorithm. A score for the reliability of the mapping is assigned to the best match. The length and spacing of the hashed words are adjustable. A range of different output formats, multi-threading, and the detection of split reads are supported.

We compared the performance of SMALT to 4 popular read mappers for reads generated computationally from the sequence of the entire human genome. Single base changes and short insertions and deletions of up to 14 nucleotides were uniformly distributed at rates ranging from 0.5% to 5%, with every 5th variation an insertion or deletion. We found the accuracy of SMALT for single reads superior at comparable sensitivity, speed and memory requirements. For paired-end reads it matched the most accurate software at superior speed. The speed combined with high sensitivity and low error rates makes SMALT very useful for a broad range of genomic re-sequencing projects and sequencing platforms.

The software is available from http://www.sanger.ac.uk/resources/software/smalt.
 
Poster J49
SeqGI: Sequence Read Enrichment at Genomic Intervals

Tom Carroll Medical Research Council
Ines de Santiago (Medical Research Council) Ines de Santiago (MRC, Clinical Sciences Centre, London); Ana Pombo (MRC, Clinical Sciences Centre, London);
 
Short Abstract: The visualisation and statistical evaluation of read profiles over genomic features are core components in the interpretation of high-throughput sequencing data. These processes have largely remained disparate and so have led to the use of multiple softwares requiring inter-conversion between differing file formats. Furthermore, the increasing use of multiple biological samples in ChIP-Seq studies demand for statistical and computational methods suitable for the assessment of biological variation. SeqGI provides a GUI framework for the simultaneous visualisation and testing of sequence read distributions both between and within classes of user defined genomic features. SeqGI can be used to intersect BED, WIG or output files from standard aligners with a set of dictated genomic features in order to calculate and illustrate the read density at these genomic intervals or at some distance from known regions/features. Profile plots and heatmaps are used to visualise single or multiple read distributions across features whereas scatter and box plots allow for the identification of differential read densities both between individual or classes of genomic features as well as between conditions. Alongside normalisation, transformation and classical parametric and non-parametric tests, SeqGI’s statistical framework allows for the analysis of differential read densities as count data using methodology implemented in the DESeq Bioconductor package. SeqGI provides users with an intuitive graphical interface, combining both visualisation and statistical tools and so assists in the rapid interpretation of sequencing data.
 
Poster J50
Genome3D: A viewer-model framework for integrating and visualizing multi-scale epigenomic information within a three-dimensional genome

Wenjin Zheng Medical University of South Carolina
Thomas Asbury (MUSC, Biochemistry & Molecular Biology); Matt Mitman (University of South Carolina, Computer Science); Jijun Tang (University of South Carolina, Computer Science);
 
Short Abstract: New technologies are enabling the measurement of many types of genomic and epigenomic information at scales ranging from the atomic to nuclear. Much of this new data is increasingly structural in nature, and is often difficult to coordinate with other data sets. There is a legitimate need for integrating and visualizing these disparate data sets to reveal structural relationships not apparent when looking at these data in isolation.

We have applied object-oriented technology to develop a downloadable visualization tool, Genome3D, for integrating and displaying epigenomic data within a prescribed three-dimensional physical model of the human genome. In order to integrate and visualize large volume of data, novel statistical and mathematical approaches have been developed to reduce the size of the data. To our knowledge, this is the first such tool developed that can visualize human genome in three-dimension.

Genome3D is a software visualization tool that explores a wide range of structural genomic and epigenetic data. Data from various sources of differing scales can be integrated within a hierarchical framework that is easily adapted to new developments concerning the structure of the physical genome. In addition, our tool has a simple annotation mechanism to incorporate non-structural information. Genome3D is unique is its ability to manipulate large amounts of multi-resolution data from diverse sources to uncover complex and new structural relationships within the genome.
 
Poster J51
GeneCards: The Human Gene ReSearcher

irina dalah Weizmann Institute of Science
Irina dalah (Weizmann Institute of Science) Gil Stelzer (Weizmann Institute of Science, Molecular Genetics); Noam Nativ (Weizmann Institute of Science, Molecular Genetics); Tsippi Iny-Stein (Weizmann Institute of Science, Molecular Genetics); Naomi Rosen (Weizmann Institute of Science, Molecular Genetics); Yigeal Satanower (Weizmann Institute of Science, Molecular Genetics); Frida Belinky (Weizmann Institute of Science, Molecular Genetics); hagit krug (Weizmann Institute of Science, Molecular Genetics); Marilyn Safran (Weizmann Institute of Science, Molecular Genetics); Doron Lancet (Weizmann Institute of Science, Molecular Genetics);
 
Short Abstract: GeneCards (www.genecards.org) is a comprehensive, searchable compendium of annotative information about human genes, used worldwide for over 15 years (Safran et al., Database, Aug. 5:baq020, 2010). Its gene-centric content is automatically mined and integrated from over 90 digital sources. The collated data are presented in a web-based card for each of ~67,400 gene entries, including 21,300 protein-coding and ~14,700 RNA genes. The broad range of data sources allows researchers to glean extensive systems-oriented insights regarding basic biological mechanisms and their medical significance. This is made possible by an extensive relational database structure with >100 tables, and a powerful Lucene-based index and advanced search engine, enabling a new level of meta-annotation. Score-prioritized search results show the retrieved genes as expandable structured minicards, with section-specific search-string highlights. GeneCards features unique classification tools, such as GIFtS (Harel et al., BMC Bioinformatics 23;10:348, 2009), which represents functional annotation levels, and allowing one to pinpoint searches to a desired segment of the gene annotation landscape. GeneCards provides strong ties to the mined resources, including extensive deep-links and clear demarcation of source-of-origin for card-displayed information. Further reSearch can be done by exporting an editable list of gene symbol hits to GeneALaCart, our batch query-and-display tool, as well as to GeneDecks, our gene set-analysis suite (Stelzer et al., OMICS 13:477-87, 2009). The latter includes Partner Hunter, retrieving a like-me set of genes, based on diverse user-selected annotation attributes, and Set Distiller, outputting additional enriched descriptors for a search-retrieved gene set.
Supported by a grant from Xennex http://www.xennexinc.com/
 

Accepted Posters


Attention Poster Authors: The ideal poster size should be max. 1.30 m (130 cm) high x 0.90 m (90 cm) wide. Fasteners (Velcro / double sided tape) will be provided at the site, please DO NOT bring tape, tacks or pins. View a diagram of the the poster board here




Posters Display Schedule:

Odd Numbered posters:
  • Set-up timeframe: Sunday, July 17, 7:30 a.m. - 10:00 a.m.
  • Author poster presentations: Monday, July 18, 12:40 p.m. - 2:30 p.m.
  • Removal timeframe: Monday, July 18, 2:30 p.m. - 3:30 p.m.*
Even Numbered posters:
  • Set-up timeframe: Monday, July 18, 3:30 p.m. - 4:30 p.m.
  • Author poster presentations: Tuesday, July 19, 12:40 p.m. - 2:30 p.m.
  • Removal timeframe: Tuesday, July 19, 2:30 p.m. - 4:00 p.m.*
* Posters that are not removed by the designated time may be taken down by the organizers and discarded. Please be sure to remove your poster within the stated timeframe.

Delegate Posters Viewing Schedule

Odd Numbered posters:
On display Sunday, July 17, 10:00 a.m. through Monday, June 18, 2:30 p.m.
Author presentations will take place Monday, July 18: 12:40 p.m.-2:30 p.m.

Even Numbered posters:
On display Monday, July 18, 4:30 p.m. through Tuesday, June 19, 2:30 p.m.
Author presentations will take place Tuesday, July 19: 12:40 p.m.-2:30 p.m





Want to print a poster in Vienna - try these options:

Repacopy- next to the congress venue link [MAP]

Also at Karlsplatz is in the Ring Center, Kärntner Str. 42, link [MAP]


If you need your poster on a thicker material, you may also use a plotter service next to Karlsplatz: http://schiessling.at/portfolio/



View Posters By Category
Search Posters:
Poster Number Matches
Last Name
Co-Authors Contains
Title
Abstract Contains






↑ TOP