19th Annual International Conference on
Intelligent Systems for Molecular Biology and
10th European Conference on Computational Biology

Accepted Posters

Category 'I'- Genome Annotation'
Poster I01
Integrated prediction of protein functions in Arabidopsis thaliana by Bayesian Markov Random Fields

Yiannis Kourmpetis Wageningen University
Aalt Jan van Dijk (Plant Research International, Applied Bioinformatics); Roeland van Ham (Plant Research International, Applied Bioinformatics); Cajo ter Braak (Wageningen University and Research Centre, Biometris);
Short Abstract: Arabidopsis thaliana is a widely used model species that belongs to the Brasicaceae family. Unraveling the biological processes of this plant is of great importance for the understanding of plant biology and the transfer of this knowledge to other species. Essential for this aim is the functional annotation of Arabidopsis proteins. Entirely experimental determination of protein functions in large scale is not feasible. Targeted experiments are time consuming and prerequisite the existence of a specific hypothesis for the protein function but this is not possible for all the proteins. High throughput experiments reveal only a general view about the functions of proteins. Therefore, accurate computational methods are needed to predict protein functions in large scale accurately or at least to provide general but valuable information for the design of targeted experiments.
We developed a method for protein function prediction that combines network data (i.e. protein protein interactions, co-expressions) with sequence derived features (i.e. functional domains). Our method is based on Markov Random Fields and uses an adaptive Markov Chain Monte Carlo algorithm to perform simultaneous protein function prediction and estimation of the model parameters including the weighting of the data types (network or sequence).
We applied our method to Arabidopsis data and obtained Gene Ontology term predictions for the biological roles of more than 23,000 proteins. Several predictions were confirmed by the recent literature. We further focused on the proteins that are predicted to be involved in the flowering and floral organ development of Arabidopsis.
Poster I02
Glycine max and Zea mays Genome Annotation with Gnomon

Alexandre souvorov National Center for Biotechnology Information
Alexandre Souvorov (National Center for Biotechnology Information) Tatiana Tatusova (National Center for Biotechnology Information, IEB); Leonid Zaslasky (National Center for Biotechnology Information, IEB); Brian Smith-White (National Center for Biotechnology Information, IEB);
Short Abstract: NCBI gene prediction method uses a combination of homology searches and ab initio models. General philosophy behind this approach is to utilize available experimental information as much as possible. In the past NCBI annotation pipeline has been used successfully for many plant and animal genomes. More recently we have annotated Glycine max and Zea mays genomes.
Both genomes have considerable amount of cDNA. There are ~1,500,000 EST for Glycine max and ~2,000,000 EST for Zea mays. Zea mays also has ~170,000 full length mRNA. These cDNA were used along with other plants proteins for gene prediction. We predicted ~51,000 genes in Glycine max and ~42,000 genes in Zea mays.
The predicted proteins were clustered with other plant proteins from Arabidopsis thaliana, Arabidopsis lyrata, Vitis vinifera, Sorghum bicolor, Populus trichocarpa, Oryza sativa, and Ricinus communis. The clustering produced ~27,000 clusters. The clusters were used to derive putative protein functions.
The compirison of CG content distribution along the coding regions of the orthologs of the nine genomes showed that in monocot group the CG content at the 5’ end of the coding region is as high as 70% and drops to about 50% after 500-1000 bp. There is no such GC content gradient in the coding regions from dicot genomes. This effect is similar to the one found by J. Yu et al (Science, p79, v296, 2002) for orthologs of Oryza sativa and Arabidopsis thaliana.
The analysis of predicted gene models and comparison to alternative annotation will be presented.
Poster I03
Expansion of Genome-wide Annotation in COSMIC (the Catalogue Of Somatic Mutations In Cancer)

Mingming Jia The Wellcome Trust Sanger Institute
Simon Forbes (The Wellcome Trust Sanger Institute, Cancer Genome Project); Rebecca Shepherd (The Wellcome Trust Sanger Institute, Cancer Genome Project); Nidhi Bindal (The Wellcome Trust Sanger Institute, Cancer Genome Project); David Beare (The Wellcome Trust Sanger Institute, Cancer Genome Project); Chai Yin Kok (The Wellcome Trust Sanger Institute, Cancer Genome Project); Prasad Gunasekaran (The Wellcome Trust Sanger Institute, Cancer Genome Project); Kenric Leung (The Wellcome Trust Sanger Institute, Cancer Genome Project); Andrew Menzies (The Wellcome Trust Sanger Institute, Cancer Genome Project); Sally Bamford (The Wellcome Trust Sanger Institute, Cancer Genome Project); Charlotte Cole (The Wellcome Trust Sanger Institute, Cancer Genome Project); Sari Ward (The Wellcome Trust Sanger Institute, Cancer Genome Project); Jon Teague (The Wellcome Trust Sanger Institute, Cancer Genome Project); Adam Butler (The Wellcome Trust Sanger Institute, Cancer Genome Project); Andrew Futreal (The Wellcome Trust Sanger Institute, Cancer Genome Project); Michael Stratton (The Wellcome Trust Sanger Institute, Cancer Genome Project);
Short Abstract: Cancer is driven by the accumulation of somatically acquired mutations. COSMIC (www.sanger.ac.uk/cosmic/) is a resource offering a comprehensive catalogue of these mutational events in human cancer. The database currently describes over 42,000 different coding mutations identified from more than 577,000 tumour samples (v51, Jan 2011). COSMIC is populated with manually curated data from scientific literature including large-scale systematic screens, together with data from the Cancer Genome Project at the Sanger Institute, UK (CGP), The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC). Results of full-genome screening experiments are held on 383 tumours, with mutation annotations provided by the VAGrENT annotation tool, (www.sanger.ac.uk/resources/software/vagrent) developed by our group. COSMIC provides a graphical web-based user interface for browsing the database. A Gbrowse instance has been setup, which permits the visualisation of the cancer mutations in a genomic (GRCh37) context (www.sanger.ac.uk/fgb2/gbrowse/cosmic) and a BioMart, for data mining and linking to additional external resources such as Ensembl (www.sanger.ac.uk/genetics/CGP/cosmic/biomart/martview). COSMIC somatic mutation data have also been integrated into Ensembl which allows the mutations to be viewed within the Ensembl Genome Browser. In addition to somatic mutation data, we plan to integrate the data from the Genomics of Drug Sensitivity in Cancer Project, which is screening a wide range of anti-cancer therapeutics against over 1,000 genetically-characterised human cancer cell lines. The sensitivity data and genetic correlations are currently available through a dedicated website (www.sanger.ac.uk/genetics/CGP/translation).
Poster I04
Bayesian Methods of Genome Annotation

Vasilis Christaras University of Glamorgan
Tatiana TATARINOVA (University of Glamorgan, Faculty of Advanced Technology);
Short Abstract: Introduction of the next generation of sequencing methods presented an opportunity to revisit the problem of accurate identification of transcription start sites. This is an important problem, since position of a regulatory element with respect to TSS affects gene regulation, and performance of promoter motif-finding methods depends on correct identification of TSS. This poster presents a statistical analysis of promoter regions of a model dicot plant Arabidopsis thaliana. Important statistical features of promoters, such as frequency and location of transcription factors binding sites, length of untranslated region, codon usage bias in upstream gene and distribution of positions of expressed sequence tags are analysed and a coherent overview is presented. Based on these features, we propose a Bayesian method to evaluate the collections of evidence and an aggregate scoring function. By incorporating additional evidence in we can improve accuracy of TSS prediction, expand recognition capabilities to multiple TSS per locus and enhance the understanding of alternative splicing mechanisms.
We discuss performance of our new method in application to various model organisms.
Poster I05
A Multi-objective approach for the prediction of bacterial sRNAs

Javier Arnedo Granada University
Coral del Val (Granada University) Rocio Romero-Zaliz (Granada University, Computer Science); Igor Zwir (Granada University, Computer Science);
Short Abstract: Bacterial small non-coding RNAs (sRNAs) are recognized as novel widespread regulators of gene expression in response to environmental signals. Thought their important role in regulation their accurate prediction in genome sequences has proven to be more difficult than for gene coding proteins. There are several algorithms available to identify and score potential sRNA secondary structures based on: thermodynamic stability of the secondary structure, conservation, and/or covariance in sequence alignments, etc... However, the principal drawback of all these existing methods is the vast number of false positives that they generate. In this work we propose a multiobjective methodology to combine available algorithms into an aggregation scheme in order to obtain optimal methods’ aggregations that reduce the number of false positive predictions and increase the sensitivity. We applied our methodology to the genome of Salmonella typhimurium LT2; Comparative sequence data were obtained from four related organisms (Escherichia coli K12, Klebsiella pneumoniae, Xylella fastidiosa y Yersinia pestis). As complementary predictive tools to identify conserved and stable secondary structures corresponding to putative non-coding RNAs the following programs were selected: zMFold, QRNA, RNAz, Alifoldz, Dynalign, TransTermHP, MSARi and vsFold.
The optimization strategy was carried with a multi-objective evolutionary optimization algorithm, NSGA-II. This algorithm tries to maximize the sensibility and the specificity, two contradictory measures of the goodness of an aggregation. The obtained results show a major improvement in specificity and sensitivity of our strategy compared to the performance of individual methods. Moreover, the here proposed methodology could be seen as an automatic method generator.
Poster I06
Nonparametric Approach to Analysis of Data-Rich Biological Problems

Tatiana TATARINOVA University of Glamorgan
Hannah Garbett (University of Glamorgan, Faculty of Advanced Technology); Alona Chubatiuk (University of Southern Califirnia, Mathematics); Alan Schumitzky (University of Southern Califirnia, Mathematics); Farzana Rahman (University of Glamorgan, Faculty of Advanced Technology);
Short Abstract: Exponential growth of the experimental data almost inevitably leads to multiplicity of observation values. Some of the values can be repeated tens and even hundreds of times, while the number of distinct observations may be limited. Brute force approach to model-based clustering is not computationally efficient for such datasets. In this poster we present an alternative procedure for treatment of multiplicity of observations. The main reason we consider the Nonparametric Bayesian approach is because of its ability to calculate Bayesian credibility intervals, no matter what the sample size is. This is not possible with other methods, such as nonparametric maximum likelihood approach.

We discuss how the Gibbs Sampler is used for calculations required in the context of Dirichlet Process mixture models. Traditional implementation of the Gibbs sampler in this setting is computationally prohibitive, since it the initial number of support points is usually set to be equal to the number of observations. Our approach is based on the reduction of the number of the support points to the number of distinct observations and modifying the Gibbs sampler to take into account multiplicity of each distinct value. We tested our approach on standard benchmark datasets and compared its performance with existing parametric and nonparametric clustering methods. In this poster we describe application of our method to analysis of genome-wide Copy Number Variation data, generated by The Wellcome Trust Case-Control Consortium, and to the problem of gene prediction using Expressed Sequence Tags data from different libraries.
Poster I07
GenDB3 - A novel genome annotation system for eukaryotes and prokaryotes

Lukas Jelonek Bielefeld University
Burkhard Linke (Bielefeld University, CeBiTec); Oliver Rupp (Bielefeld University, CeBiTec); Alexander Goesmann (Bielefeld University, CeBiTec);
Short Abstract: GenDB is an open-source web based genome annotation system for prokaryotes that has been used in hundreds of genome sequencing projects. It allows automatic annotations based on different pipelines as well as manual annotations and uses a role based project management system for authentication and authorization of users.
To cope with today’s demands in data size, scalability and usability, GenDB3 is a redesign of GenDB that is implemented in the Java programming language. The core of the system is a unified application programming interface (API) defining the basic data concepts in genome annotation of prokaryotes as well as eukaryotes. Based on this API a hybrid multitiered architecture was implemented that consists of a database-adapter to existing GenDB2 databases, a server application with a RESTful webservice interface, a client application and all legacy pipelines from GenDB2.
The client is based on the NetBeans rich client platform, which easily enables the development of modular applications. Through different implementations of the GenDB3-API it can connect to different data sources. Currently it has a fully featured implementation for the GenDB3-RESTful webservice and a simple implementation for viewing EMBL or Genbank files based on Biojava. The highly interactive visualization components are based on the zoomable user interface (ZUI) framework Piccolo2D, enabling a new browsing user experience for existing GenDB users. New client functionality can be implemented in the form of plugins.
GenDB3 is work in progress and a first public version comprising the annotation of yeasts, fungi and plants is in preparation.
Poster I08
Genome Scale Annotation Using Rosetta De Novo Structure Predictions

Kevin Drew New York University
Patrick Winters (New York University, Biology); Glenn L. Butterfoss (New York University, Biology); Viktors Berstis (IBM, Austin); Keith Uplinger (IBM, Austin); Jonathan Armstrong (IBM, Austin); Michael Riffle (University of Washington, Biochemistry / Genome Sciences); Erik Schweighofer (Institute of Systems Biology, IT); Bill Bovermann (IBM, Austin); David R. Goodlett (University of Washington, Medicinal Chemistry); Trisha N. Davis (University of Washington, Biochemistry / Genome Sciences); Dennis Shasha (New York University, Computer Science); Lars Malmstro?m (ETH Zurich, Institute of Molecular Systems Biology); Richard Bonneau (New York University, Biology / Computer Science);
Short Abstract: The incompleteness of genome function annotation is a critical problem for all biologists and, in particular, severely limits interpretation of high­throughput and next generation experiments.  Protein structure is highly valuable for the description of a protein’s function but the difficulty and expense of experimentally solving protein structures has limited its application to genome annotation.  We have used our Rosetta de novo structure prediction algorithm to computationally predict structures for over 57,000 protein domains without known structure from nearly 100 organisms (including Human, Arabidopsis, Rice, Mouse, Fly, Yeast, E. coli and Worm).  To increase the accuracy of structure predictions, Rosetta was distributed on a grid of over 1.5 million CPUs worldwide (World Community Grid) which allowed increased sampling of the protein fold space. We then used these structure predictions to classify proteins into SCOP superfamilies of related proteins.  Additionally, we predict GO Molecular Function annotations by integrating our structure predictions with available GO Biological Process and Cellular Component annotations.  We provide multiple interfaces to this database of structure and function results at: http://www.yeastrc.org/pdr/.
Poster I09
From next generation sequencing to knowledge: A case study of functional genomics annotation in prokaryotes

Carlos Prieto Biotechnology Institute of Leon
Carlos Barreiro (Biotechnology Institute of Leon, INBIOTEC); Antonio Rodríguez-García (Biotechnology Institute of Leon, INBIOTEC); Alberto Sola-Landa (Biotechnology Institute of Leon, INBIOTEC); Miriam Martínez-Castro (Biotechnology Institute of Leon, INBIOTEC); Carlos García-Estrada (Biotechnology Institute of Leon, INBIOTEC); Rosario Pérez-Redondo (Biotechnology Institute of Leon, INBIOTEC); Lorena Fernández-Martínez (Biotechnology Institute of Leon, INBIOTEC); Elena Solera (Biotechnology Institute of Leon, INBIOTEC); Katarina Kosalková (Biotechnology Institute of Leon, INBIOTEC); Jesús Aparicio (Biotechnology Institute of Leon, INBIOTEC); Juan F. Martín (Biotechnology Institute of Leon, INBIOTEC);
Short Abstract: Nowadays, next generation sequencing techniques have amazingly increased the number of genome sequencing projects. In the life cycle of these projects bioinformatics analyses are crucial for the success and further exploitation of sequenced genomes. There are a wide range of necessary tasks from assembly to final publication. Fortunately, a large number of software tools have been developed to make it easier. However, each task requires an appropriate choice of tools and methods to be successfully done.
In this communication we describe the great variety of bioinformatics tools, methods and approaches that have been used in the life cycle of our sequencing project. The whole process has been divided in different phases: Scaffold Reordering, Gap Filling, Gene Prediction, Gene Validation, Manual Annotation, Functional Annotation, Genome Visualization and Publication.
This project has been focused on prokaryotes sequencing; in particular we have sequenced the genome of Streptomyces tsukubaensis. This organism is an actinobacteria; this means that it is a gram-positive bacterium with a high G+C content in their DNA. This species is interesting because its secondary metabolism produces drugs with high added value and an important part of the project will be focused on indentifying the gene clusters which product metabolites.
This poster will not only be expected to spread our scientific work, but also will seek the active collaboration of conference participants in order to propose new approaches that could be applied to incoming sequencing projects.
Poster I10
Human annotation in Ensembl

Amonida Zadissa Wellcome Trust Sanger Institute
Tim Hubbard (Wellcome Trust Sanger Institute, Ensembl); Steve Searle (Wellcome Trust Sanger Institute, Ensembl); Bronwen Aken (Wellcome Trust Sanger Institute, Ensembl); Susan Fairley (Wellcome Trust Sanger Institute, Ensembl); Amy Tang (Wellcome Trust Sanger Institute, Ensembl); Simon White (Wellcome Trust Sanger Institute, Ensembl); Magali Ruffier (Wellcome Trust Sanger Institute, Ensembl); Thibaut Hourlier (Wellcome Trust Sanger Institute, Ensembl); Daniel Barrell (Wellcome Trust Sanger Institute, Ensembl); Michael Schuster (European Bioinformatics Institute, Ensembl);
Short Abstract: Ensembl provides integrated genome annotation for Human and other vertebrate species, including coding and non-coding gene annotation, multiple species alignments, functional genomics, variation resources, and tools such as the Variant Effect Predictor.

The Ensembl human gene set is the official gene set used by the ENCODE project. Human annotation was recently updated using the Ensembl automatic annotation pipeline. We also merged Havana manual annotation for human into the Ensembl annotation to provide a more comprehensive gene set for our users.

The human gene set on GRCh37 (Ensembl release 62) contains 51,123 genes of which 20,684 are classified as protein-coding. Currently 78% of these protein-coding genes are "merged" (containing Ensembl-Vega merged transcripts). The remaining non-merged protein-coding genes are mainly contributed by Ensembl's genome-wide annotation in regions where Havana have not yet provided manual annotation.

Ensembl collaborates with HAVANA, NCBI and UCSC in the Consensus Coding Sequence (CCDS) project which aims at identifying a core set of protein-coding regions that are consistently annotated by these groups and of high quality. Our updated gene set will be used for a new round of comparison; results of this comparison will be presented.

The Genome Reference Consortium releases regular updates to the human reference assembly GRCh37 in the form of 'patches'. We import these patches and provide basic annotation on them. To date, there have been 53 'novel' patches (new haplotypes) and 31 'fix' patches (improvements to problematic regions). Fix patches improve the underlying reference assembly and thereby enable better gene annotation on these regions.
Poster I11
Collaborative action between Spain and Austria to enhance the application of High Performance Computing to High throughput technologies

Pau Corral University of Málaga, Bioinformatics and Information Technologies Laboratory
Noura Chelbat (Bioinformatik Institute, Johannes Kepler University) Antonio Muñoz (University of Málaga, Bioinformatics and Information Technologies Laboratory); Johan Karlsson (University of Málaga, Bioinformatics and Information Technologies Laboratory); Alfredo Martínez (University of Málaga, Bioinformatics and Information Technologies Laboratory); Johan Karlsson (University of Málaga, Bioinformatics and Information Technologies Laboratory); Noura Chelbat (Bioinformatik Institute, Johannes Kepler University, Bioinformatics); Günter Klambauer (Bioinformatik Institute, Johannes Kepler University, Bioinformatics); Sepp Hochreiter (Bioinformatik Institute, Johannes Kepler University, Bioinformatics); Oswaldo Trelles (University of Málaga, Bioinformatics and Information Technologies Laboratory); Ulrich Bodenhofer (Bioinformatik Institute, Johannes Kepler University, Bioinformatics); Alfredo Martínez (University of Málaga, Bioinformatics and Information Technologies Laboratory);
Short Abstract: Next generation sequencing (NGS) has changed the bottleneck in sequence knowledge acquisition from sequencing to data interpretation, data integration and the experimental design. Few years ago the sequencing process was a tedious work where researchers had considerably high difficulties to manage all data generated during genomes sequencing and microarray analysis. Currently the huge technological growth in the whole genomes sequencing raised the necessity of specialized software to store, handle and extract information from this data amount at the same time that this is produced in order to avoid the data accumulation.
This bilateral action between Spain and Austria proposes the knowledge and expertise exchange on two of the nowadays most important Next Generation Sequencing platforms.
Poster I12
CHIPEAN – CHIP Seq Peak Annotation

Agnes Hotz-Wagenblatt German Cancer Research Center
Kai Sommer (German Cancer Research Center, Bioinformatics W180); Maximilian Rueppell (German Cancer Research Center, Bioinformatics W180); Karl-Heinz Glatting (German Cancer Research Center, Bioinformatics W180);
Short Abstract: Chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-seq) is one of the new technologies used for gaining insight into DNA-protein regulation. For investigation of this regulation, our pipeline for the annotation of ChIP-Seq peaks reveals the targeted genes and their GO- and Pathway enrichments. Additionally, the known transcription factor binding sites in those regions are mapped by using the Transfac database. Novel motifs are found by a de novo motif search using Meme and the Gibbs Sampler program. As input, the pipeline needs a bed file containing the peaks, the range which should be added for looking up neighboring genes and the organism/assembly version. GO and pathway annotation of the genes is used for enrichment tests.

The output is a list of known transcription factor binding sites ranked according to their enrichment, together with a list of novel motifs found. The second section shows the targeted genes together with the enriched GO terms and pathways. The pipeline has been developed in the W3H pipeline system. We will analyse a public ChIP-Seq data set with CHIPEAN and compare the result with other tools.
Poster I13
Comprehensive Homeodomain Gene Annotation in Human and Mouse

Veronika Boychenko Wellcome Trust Sanger Research Institute
Laurens Wilming (Wellcome Trust Sanger Research Institute, Informatics (team 119, HAVANA)); Gloria Despacio-Reyes (Wellcome Trust Sanger Research Institute, Informatics (team 119, HAVANA)); Jennifer Harrow (Wellcome Trust Sanger Research Institute, Informatics (team 119, HAVANA));
Short Abstract: The HAVANA? (?Human and Vertebrate Analysis and Annotation?) ?team provides manual annotation for the ENCODE and CCDS project,? ?amongst others,? ?which is available via the VEGA genome browser,? ?as well as Ensembl and the UCSC Genome Browser.? ?We work on gene by gene basis using the vertebrate transcriptome as our main evidence and strive to comprehensively annotate all gene loci,? ?including non-coding RNAs,? ?pseudogenes and splice variants.? ?Recently we added RNAseq and protein mass spectrometry data to our annotation? data sets.
? One area where manually annotation has an important impact is on the annotation of duplicated gene clusters. Using HomeoDB (homeodb.cbi.pku.edu.cn) as our source, we recently found some discrepancies with homeodomain gene annotation with RefSeq regarding exact gene structures, lack of transcriptional support and coding versus pseudogene status. RNAseq and protein mass-spec data will assist us resolve these issues.
? We have also developed guidelines for annotating long non-coding RNA in collaboration with HGNC and John Mattick's group at the University of Queensland. Antennapedia class homeodomain genes were shown to be strikingly enriched with antisense ncRNA lici, presumably having an important role in single gene or even gene claster regulation. Of the annotated human set of 119 Antennopedia genes, 63% of homeodomain coding genes have an antisense locus while of the 102 orthologous mouse genes only 46% of loci have an antisense locus (this includes experimentally verified HOTAIR ncRNA and its mouse orthologue).
Poster I14
APPRIS - Canonical Isoforms for the Human Genome

Jose Manuel Rodriguez Centro Nacional de Investigaciones Oncologicas
Michael Tress (Centro Nacional de Investigaciones Oncologicas) Iakes Ezkurdia (Spanish National Cancer Research Centre, Structural and Computational Biology Programme); Gonzalo Lopez (Spanish National Cancer Research Centre, Structural and Computational Biology Programme); Paolo Maietta (Spanish National Cancer Research Centre, Structural and Computational Biology Programme); Alfonso Valencia (Spanish National Cancer Research Centre, Structural and Computational Biology Programme); Michael Tress (Spanish National Cancer Research Centre, Structural and Computational Biology Programme);
Short Abstract: Alternative splicing has the potential to expand the cellular protein repertoire by altering the biological function of the expressed proteins. Recent studies have estimated that almost 100% of multi-exon human genes produce differently spliced mRNAs.

However, the role played by alternative splicing in the modulation of cellular function is some way from being clarified. Recent works have suggested that many alternative isoforms are likely to have cellular functions that are substantially different from their constitutively spliced counterparts.

Given the likely ubiquity in the cell of alternative isoforms, the role of these alternatively spliced gene products is becoming an increasingly important question. Determining principal functional variants is a critical first step in the study of the implications of alternative splicing. The annotation of a variant as the principal isoform enables research groups to concentrate experiments on this main functional variant and allows bioinformatics groups to make reliable predictions of changes in structure and function for alternative variants.

We have annotated principal functional variants for the entire human genome using APPRIS in collaboration with the GENCODE consortium that is compiling the definitive annotation of the human genome as part of the ENCODE project. APPRIS deploys a range of computational methods to locate principal isoforms, including the conservation of exonic structure, the conservation of protein structure and function and a measure of non-neutral evolution of exons. These annotations are continually updated and freely available to the scientific community.
Poster I15
Refining the prediction of transported proteins by analysis of signal peptide prediction in orthologous groups

Armando Neto Centro de Pesquisas Rene Rachou - FIOCRUZ
Antônio Rezende (Centro de Pesquisas Rene Rachou - FIOCRUZ, Cellular and Molecular Parasitology Laboratory); Denise Alvarenga (Centro de Pesquisas Rene Rachou - FIOCRUZ, Malaria Laboratory); Ricardo Ribeiro (Centro de Pesquisas Rene Rachou - FIOCRUZ, Malaria Laboratory); Cristiana Brito (Centro de Pesquisas Rene Rachou - FIOCRUZ, Malaria Laboratory);
Short Abstract: Proteins from the same ortholog group are expected to agree according to the presence/absence of signal peptides (SP), since divergences would entail profoundly distinct biological outcomes. Orthologous groups with diverging SP predictions could be explained by: 1) Inaccurate protein sequence annotation; 2) Limitations of the predicting algorithms; 3) Biological variability. Therefore, we propose to combine orthology information to signal peptide prediction and develop strategies to discriminate these three scenarios, in order to improve the overall accuracy for the prediction of transported proteins. Plasmodium proteins were recovered (PlasmoDB), submitted to SP prediction (SignalP), organized into orthologous groups (OrthoMCL) and aligned (MAFFT). Those groups presenting diverging SP predictions were selected, and a subset (43%) was manually inspected. Whenever possible, sequences were reannotated based on comparative analyses to better identify protein N-terminus. Changed and Unchanged groups were comparatively analysed and a set of distinguishing metrics was derived. An SVM based classifier was built to identify groups containing putative misannotated proteins. From 1935 analysed genes, 308 were reannotated, and 217 (70%) had their SP predictions altered, resulting in a substantially modified roll of exported/secreted proteins in Plasmodium species. Eleven statistically significant parameters for distinguishing between Changed and Unchanged groups were implemented in a preliminary SVM classifier. The training set was composed of 73 positive and 248 negative examples. In a test against a set of known groups, specificity/sensitivity for the SVM were 95,3% and 62,2%, respectively. The classifier identified 242 groups containing misannotated proteins from the 415 that were not inspected previously.
Poster I16
Accurate RNA-seq-based de-novo annotation using mGene.ngs

Jonas Behr Friedrich-Miescher-Laboratory 0f the Max Planck Society
Regina Bohnert (Friedrich-Miescher-Laboratory 0f the Max Planck Society, -); Andre Kahles (Friedrich-Miescher-Laboratory 0f the Max Planck Society, -); Gabriele Schweikert (Friedrich-Miescher-Laboratory 0f the Max Planck Society, -); Georg Zeller (Friedrich-Miescher-Laboratory 0f the Max Planck Society, -); Lisa Hartmann (Friedrich-Miescher-Laboratory 0f the Max Planck Society, -); Gunnar Raetsch (Friedrich-Miescher-Laboratory 0f the Max Planck Society, -);
Short Abstract: The model organism Caenorhabditis elegans is one of the most important subjects to study cell fate and regulation of apoptosis. To gain a deeper understanding of regulatory mechanisms in C. elegans its nearby evolutionary context was explored and the genome of five closely related Nematodes was sequenced. So far major limitation in analyzing these genomes was that a accurate transcriptome annotation was lacking. In this project we sequenced the transcriptome (RNA-Seq) of all five Nematodes and C. elegans using the Illumina sequencing platform (~300M reads, strand specific, paired-end, 76bp). Based on the RNA-Seq data we annotated all six Nematodes using the newly developed de-novo gene finding system mGene.ngs. mGene.ngs combines features from the RNA-Seq data and the genomic DNA sequence already at the learning stage. The system can be trained on a set of highly expressed protein coding and non coding genes, whose structure can be directly inferred from the RNA-Seq data. The training was done independently for all 6 organisms. Predictions include alternative isoforms supported by spliced reads as well as non coding genes and transcripts. One benefit of this approach is that the predictions are not biased towards the C. elegans genome annotation, but it can take advantage of this high quality annotation to estimate the performance of the prediction. Therefore, we can account for differences in transcriptome struture between the organism most evident for p. pacificus, which has considerably more exons per gene (10.2) than C. elegans (8.1). We observe that the prediction accuracy in terms of coding transcript level sensitivity 56.1% and specificity 62.7% compares very favourable to the well known de-novo transcriptome recognition system cufflinks (sensitivity 49.9%, specificity 49.5%).
Poster I17
BioTracks: A genome browser extension plugin for the BioGPS gene portal system

Sergey Batalov Novartis (GNF)
Short Abstract: BioGPS [1] is an easily extensible and customizable gene portal. BioGPS enables users to easily aggregate data on a gene by gene basis from more than 350 external sources and to personalize their gene report using BioGPS layouts.

Much of the information for a gene is better understood with genomic context taken into account. To bridge the gene- to genome- centric information, we aimed to create a template for a genomic viewer plugin that would 1) allow to visualize user’s data as well as most publically available genome annotation tracks in a genomic interval, 2) be easy to install and maintain, 3) allow for non-standard visual custom tracks. After evaluating a dozen existing genomic browsers, we found the well-established UCSC Genome Browser the easiest to embed and extend. We have implemented a simple proxy for the UCSC-based content with free-content custom tracks added and served locally. This solution may appeal to small labs, as it doesn’t require the UCSC GB database installation and maintenance.

[1] Wu C, et al. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol. 2009;10(11):R130.
Poster I18
Automated curation and assignment of massive DNA sequences

Duong Vu CBS-KNAW Fungal Biodiversity Centre
Vincent Robert (Dr., Bioinformatics group);
Short Abstract: There is an explosion of genomic data being generated. The number of sequence submissions to central databases grows significantly. However, many of the sequences have been submitted without or given wrong identification. Once the original sequence has been annotated incorrectly, the error can be propagated through the databases. Thus there is a need for automated tools for curation and assignment of DNA sequences. Clustering techniques play an important role in the identification of biological data. Most approaches on clustering require a similarity matrix of sequences. Currently, this is not possible if one has to deal with a massive number like millions of DNA sequences. In this research, we propose a new approach that can group homologous sequences, and is able to handle large sequence databases.
Poster I19
Characterization of Long Noncoding RNAs

Andrea Tanzer Centre de Regulacio Genomica
Thomas Derrien (Université de Rennes1, Génétique et Développement); Roderic Guigó (Centre de Regulacio Genomica, Bioinformatics and Genomics);
Short Abstract: Intergenic regions, often described as 'the dark matter of the genome`, turned out to be populated by numerous noncoding RNAs. Long noncoding RNAs represent one such class of functional RNAs. They are more than 200nt in length and share several features with mRNAs (e.g. PolyA+, intron-exon structure), but do not encode proteins. Individual members have been characterized to some extend, however, our understanding of fundamental principles such as transcriptional regulation, functional domains, and evolutionary conservation is still rudimentary.

We present a comprehensive analysis of basic characteristics of lncRNAs based on the GENCODE gene annotation. Unlike other resources, GENCODE explicitly attempts to include novel models for non-protein coding genes and thus provides a rich source of about 10,000 annotated long noncoding transcripts (GENCODE v3c). Based on expression data (NGS and microarrays), chromatin marks, and sequence conservation we characterize and in part subdivide this most likely heterogeneous class of RNAs.

Promoter regions of lncRNAs are well conserved and chromatin marks at transcriptional start sites (TSS) show patterns similar to those of protein-coding genes. In addition, many lncRNAs display differential tissue expression. Expression levels, however, are often much lower than those observed for protein coding genes.

Numerous lncRNAs overlap with small ncRNAs, like miRNAs and snoRNAs. In comparison to protein coding genes, lncRNAs are enriched for small ncRNAs lacking their own promoters. Analyzing co-expression of such pairs suggests that a subclass of lncRNAs serves as host genes providing a transcriptional active environment required for the expression of small regulatory RNAs.
Poster I20
Annotating High Confidence Consensus Coding Sequences (CCDS) and Gencode Processed Pseudogenes

Rachel Harte University of California Santa Cruz
Mark Diekhans (University of California Santa Cruz, Center for Biomolecular Science and Engineering); Robert Baertsch (University of California Santa Cruz, Center for Biomolecular Science and Engineering); Kim Pruitt (National Institutes of Health , National Center for Biotechnology Information (NCBI)); Catherine Farrell (National Institutes of Health , National Center for Biotechnology Information (NCBI)); Craig Wallin (National Institutes of Health , National Center for Biotechnology Information (NCBI)); Jennifer Harrow (Wellcome Trust Sanger Institute, Vertebrate Genome Analysis); Jane Loveland (Wellcome Trust Sanger Institute, Vertebrate Genome Analysis); Steve Searle (Wellcome Trust Sanger Institute, Vertebrate Genome Analysis); Bronwen Aken (Wellcome Trust Sanger Institute, Vertebrate Genome Analysis); Daniel Barrell (Wellcome Trust Sanger Institute, Vertebrate Genome Analysis); Suganthi Balasubramanian (Yale University, Department of Molecular Biophysics and Biochemistry); Mark Gerstein (Yale University, Department of Molecular Biophysics and Biochemistry); David Haussler (University of California Santa Cruz, Center for Biomolecular Science and Engineering); Tim Hubbard (Wellcome Trust Sanger Institute, Vertebrate Genome Analysis);
Short Abstract: The CCDS project, a collaboration of the RefSeq group at the National Center for Biotechnology Information (NCBI), the Wellcome Trust Sanger Institute (WTSI) and the University of California, Santa Cruz (UCSC), was initiated to identify identical protein-coding regions based on the genomic annotations of the RefSeq group and the Ensembl and HAVANA groups. Distinguishing between trustworthy and erroneous data and correctly identifying functional protein-coding genes are vital to providing annotations of high confidence. Periodic automated builds are performed. Between updates, each of the three groups must agree to any changes proposed for the underlying manual annotations from RefSeq or HAVANA else the CCDS is withdrawn. To ensure annotation consistency, an important aspect of CCDS curation, the three groups agreed upon a set of guidelines for choosing an appropriate translation start site and for identifying potential nonsense-mediated decay candidates. An update of the CCDS project status and curation examples are described.
UCSC performs quality control for the CCDS builds. This involves both the assessment of CDS regions for protein-coding potential and the removal of putative pseudogenes predicted by UCSC’s RetroFinder pipeline.
The Gencode project belongs to the ENCyclopedia Of DNA Elements (ENCODE) consortium (data repository at UCSC); it incorporates annotation on the human genome from both HAVANA and Ensembl so CCDS regions are included in this gene set. Gencode includes a high confidence set of processed pseudogenes based on overlap among RetroFinder and Yale PseudoPipe predictions and HAVANA manual annotations. Pseudogenes found only by RetroFinder and PseudoPipe are included as predictions.
Poster I21
YGAP: the Yeast automatic Genome Annotation Pipeline

Estelle Proux Wera Trinity College Dublin
David Armisén (Trinity College Dublin, Smurfit Institute of Genetics); Kenneth H. Wolfe (Trinity College Dublin, Smurfit Institute of Genetics);
Short Abstract: Yeasts provide a unique opportunity to explore the mechanism of eukaryotic genome evolution by comparative genomics. Recently, the number of genome sequencing projects has increased drastically, due to the development of next-generation sequencing technologies. However, the annotation of genomes remains a challenge, as it still relies on a largely manual process. Here we present YGAP, the Yeast Genome Annotation Pipeline. YGAP uses information from the Yeast Gene Order Browser (YGOB) to do automatic de novo annotation, based on the homology relationship and the syntenic gene content in other annotated yeast species. Additionally, the pipeline is able to detect probable frameshift sequencing errors and can propose corrections for them. It can also look for the presence of introns, and detect tRNAs and Ty(-like) elements. We tested the pipeline on the genome of Saccharomyces cerevisiae, and compared our results and those obtained with another annotation software (Augustus) trained on S. cerevisiae to the curated annotation of the organism in YGOB. YGAP was able to predict more correct genes and less uncorrect elements than the other method. We also resequenced the genome of Naumovozyma castellii using 454 technology, automatically reannotated it and compared our annotation with the previous manual one. Our pipeline was able to predict most of the expected genes, and improved the quality of the manual annotation. YGAP has a user-friendly interface and is freely available on our lab page.
Poster I22
A genome-wide, functional analysis of transcription initiation and promoter architecture in the budding yeast Saccharomyces cerevisiae

R. Taylor Raborn PhD Candidate/University of Iowa
Michael Grace (University of Alberta, Faculty of Medicine and Dentistry); Robert E. Malone (Professor/University of Iowa, Department of Biology); John M. Logsdon, Jr. (Associate Professor/University of Iowa, Department of Biology);
Short Abstract: High-throughput sequencing of transcriptomes has generated extensive annotations of the constellation of RNAs produced within the cells of numerous eukaryotes. Among these, approaches that identify transcription start sites (TSS) for mRNAs are of particular interest because each TSS provides genomic positional information for a gene’s core promoter. Recent work demonstrates that the positions of TSS for a gene and their boundaries within the promoter are variable and unpredictable. Additionally, the distribution or shape of TSSs produced by a gene can classify its promoter and is associated with specific core promoter motifs. While comprehensive identification of TSSs is complete for many species of eukaryotes, outside of several metazoans little is known about patterns of promoter architecture or its overall functional and regulatory significance. A more complete understanding of eukaryotic promoter architecture will contribute to our understanding of gene regulation and genome evolution. In this work, we analyze TSS in two separate conditions for the budding yeast Saccharomyces cerevisiae. Employing a novel TSS clustering algorithm, we define the major positions of transcription initiation for a large fraction of the genes in the budding yeast genome and characterize promoter architecture using measures of breadth and shape. Our initial findings suggest that canonical ‘broad’ and ‘peaked’ patterns of promoter shape also exist in budding yeast, and report some evidence of a relationship between promoter architecture and gene function. This study also revealed a suite of genes that demonstrate alternative promoter usage during sporulation, the meiotic cycle of budding yeast.
Poster I23
Human transcriptome data in Ensembl

Thibaut Hourlier Wellcome Trust Sanger
Jan Vogel (Wellcome Trust Sanger, Ensembl); Amy Tang (Wellcome Trust Sanger, Ensembl); Amonida Zadissa (Wellcome Trust Sanger, Ensembl); Bronwen Aken (Wellcome Trust Sanger, Ensembl); Daniel Barrell (Wellcome Trust Sanger, Ensembl); Magali Ruffier (Wellcome Trust Sanger, Ensembl); Susan Fairley (Wellcome Trust Sanger, Ensembl); Steve Searle (Wellcome Trust Sanger, Ensembl); Tim Hubbard (Wellcome Trust Sanger, Ensembl);
Short Abstract: New sequencing technologies are creating large transcriptome datasets for many species. Here we present an overview of how we have used short read transcriptome data for the de-novo construction of tissue-specific gene models for human.

We have used RNA-Seq data from the Illumina BodyMap 2.0 project to generate transcript models for 16 tissues including adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells.

To generate these data, single and paired-end short reads were aligned to the genome to determine the transcribed exon regions (alignment blocks). Mate-pair information was used to group the alignment blocks into approximate transcript structures called proto-transcripts. The short reads were mapped a second time, using a splice-aware alignment model, this time to the proto-transcripts. Spliced reads that spanned introns of the proto-transcript were used to identify the boundaries for introns and generate possible transcript variants for a gene.

In Ensembl release 62, users can visualise results from all 16 tissues by selecting them via the configuration panel. Each tissue has two tracks. The first track displays the transcript variant with the most short read support. The second track displays all reads that splice across an intron, and therefore all alternate splicing events for each tissue. The height of each intron alignment block indicates the amount of read support for that intron.

BAM and BigWig files can be uploaded by users and viewed alongside this and other Ensembl-generated annotation.
Poster I24
Extending Metagenomic Functional Annotation Through Functional Inference

Gregory Vey University of Waterloo
Gabriel Moreno-Hagelsieb (Wilfrid Laurier University, Biology);
Short Abstract: Here we show that we can predict networks of functional interactions that help expand the functional annotations of prokaryotic metagenomic sequences, and help find both new genes working in already known functional categories, as well as groups of genes probably working in yet-to-be-described functions. With the explosion in metagenomics projects, and their abundance in unknown genes, it becomes important to devise and test methods for annotating genes beyond what is possible based on direct homology alone. Predicted operons have provided large networks of functional interactions that can complement functional annotations of unknown genes in genome sequences. Here we test the extension of homology-based functional annotation attainable using predicted operons in metagenomes. Our results show that predicted operons based on intergenic distances can increase the number of genes annotated with a potential function by up to 31% compared to using homology-based annotation methods alone. We conclude with a demonstration of practicality by functionally annotating novel cellulase candidates within metagenomic data.
Poster I25
The Reconciliation of Data and Function

Jamie MacPherson University of Manchester
Ryan Ames (University of Manchester, Faculty of Life Sciences); John Pinney (Imperial College London, Theoretical Systems Biology); Simon Lovell (University of Manchester, Faculty of Life Sciences); David Robertson (University of Manchester, Faculty of Life Sciences);
Short Abstract: Introduction
The aim of systems biology is to model processes using experimentally derived data to gain understanding of cellular function. Saccharomyces cerevisiae is the best-studied model of the eukaryotic cell. Clustering approaches are frequently used in an attempt to identify functional units from interaction networks. Likewise, genetic interactions are often studied, both in terms of network properties and biological annotation. In this work we integrate a variety of network and functional data and establish what functional subsystems can be captured from biological networks using clustering.

Three interaction networks were constructed using: (i) protein, (ii) genetic and (iii) gene coregulation interaction data. In addition, a combined network was created by integrating interactions from these networks. Each network was exhaustively clustered using graph partitioning. Biological functions represented by each cluster were identified using Gene Ontology (GO) annotation. A Voronoi tree mapping method was used to visualise annotation coverage by cluster sets. Our results show striking differences in both the ability of different networks to capture certain areas of biological function and also total functional space that is covered by each network.

Our findings indicate that each interaction network specialises in describing different areas of biological function. In addition, we identify transcendent functions that require integrated information in order to be accurately identified. We conclude that capture of biological function by network clustering is heavily influenced by choice of biological interaction data set and that integration of interaction from multiple sources is essential to attain a global capture of biological function.

Accepted Posters

Attention Poster Authors: The ideal poster size should be max. 1.30 m (130 cm) high x 0.90 m (90 cm) wide. Fasteners (Velcro / double sided tape) will be provided at the site, please DO NOT bring tape, tacks or pins. View a diagram of the the poster board here

Posters Display Schedule:

Odd Numbered posters:
  • Set-up timeframe: Sunday, July 17, 7:30 a.m. - 10:00 a.m.
  • Author poster presentations: Monday, July 18, 12:40 p.m. - 2:30 p.m.
  • Removal timeframe: Monday, July 18, 2:30 p.m. - 3:30 p.m.*
Even Numbered posters:
  • Set-up timeframe: Monday, July 18, 3:30 p.m. - 4:30 p.m.
  • Author poster presentations: Tuesday, July 19, 12:40 p.m. - 2:30 p.m.
  • Removal timeframe: Tuesday, July 19, 2:30 p.m. - 4:00 p.m.*
* Posters that are not removed by the designated time may be taken down by the organizers and discarded. Please be sure to remove your poster within the stated timeframe.

Delegate Posters Viewing Schedule

Odd Numbered posters:
On display Sunday, July 17, 10:00 a.m. through Monday, June 18, 2:30 p.m.
Author presentations will take place Monday, July 18: 12:40 p.m.-2:30 p.m.

Even Numbered posters:
On display Monday, July 18, 4:30 p.m. through Tuesday, June 19, 2:30 p.m.
Author presentations will take place Tuesday, July 19: 12:40 p.m.-2:30 p.m

Want to print a poster in Vienna - try these options:

Repacopy- next to the congress venue link [MAP]

Also at Karlsplatz is in the Ring Center, Kärntner Str. 42, link [MAP]

If you need your poster on a thicker material, you may also use a plotter service next to Karlsplatz: http://schiessling.at/portfolio/

View Posters By Category
Search Posters:
Poster Number Matches
Last Name
Co-Authors Contains
Abstract Contains