Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015. To confirm your poster find the poster acceptence email there will be a confirmation link. Click on it and follow the instructions.

If you need further assistance please contact and provide your poster title or submission ID.

Category F - 'Genome Organization and Annotation'
F01 - Assessing the limits of restraint-based 3D modeling of genomes and genomic domains
Marie Jeanne Trussart, Center for Genomic Regulation, Spain
François Serra, Gene Regulation. Stem Cells and Cancer Program. Center for Genomic Regulation (CRG), Barcelona. Spain; Genome Biology Group. Centre Nacional d\'Anàlisi Genòmica (CNAG), Barcelona. Spain, Spain
Davide Baù, Gene Regulation. Stem Cells and Cancer Program. Center for Genomic Regulation (CRG), Barcelona. Spain; Genome Biology Group. Centre Nacional d\'Anàlisi Genòmica (CNAG), Barcelona. Spain, Spain
Ivan Junier, Gene Regulation. Stem Cells and Cancer Program. Center for Genomic Regulation (CRG), Barcelona. Spain, Spain
Luis Serrano, EMBL/CRG Systems Biology Research Unit. Centre for Genomic Regulation (CRG), Barcelona. Spain ; Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona. Spain, Spain
Marc A. Marti-Renom, Gene Regulation. Stem Cells and Cancer Program. Center for Genomic Regulation (CRG), Barcelona. Spain ; Genome Biology Group. Centre Nacional d\'Anàlisi Genòmica (CNAG), Barcelona. Spain ; Institució, Spain
Short Abstract: Restraint-based modeling of genome and genomic domains has been recently explored with the advent of high-throughput Chromosome Conformation Capture (3C-based) experiments. We previously developed a mean-field restraint-based reconstruction method to resolve the 3D architecture of both prokaryotic and eukaryotic genomes using 3C-based data with the Integrative Modeling Platform (IMP). These models were congruent with fluorescent imaging validation. However, the limits of methods for reconstructing 3D genomes have not systematically been assessed. Here we propose the first generic evaluation of a mean field restraint-based reconstruction of genomes and genomic domains by considering diverse chromosome architectures and different levels of data noise and structural variability. The results of our analysis show that: first, current scoring functions for 3D reconstruction correlate with the final accuracy of the models; second, reconstructed models are robust to experimental noise but sensitive to the structural variability in the sample; third, the local structure organization of genomes, such as the existence of Topologically Associating Domains, results in more accurate models; fourth, to a certain extent, the spatially restrained models capture the intrinsic structural variability in the input matrices; and fifth, the accuracy of the models can be a priori predicted by analyzing the properties of the input interaction matrices. In summary, our work provides a systematic analysis of the limitations of a mean field restrain-based method for reconstructing 3D genomes, which could be taken into consideration in further development of methods as well as their applications.
F02 - Combinatorial identification of broad association regions using NGS data
Jieun Jeong, University of Pennsylvania School of Medicine, United States
Mudit Gupta, University of Pennsylvania School of Medicine, United States
Andrey Poleshko, University of Pennsylvania School of Medicine, United States
Jonathan Epstein, University of Pennsylvania School of Medicine, United States
Short Abstract: This poster is based on Proceedings
Submission 21.
Motivation: Differentiation of cells into different cell types involves many types of modifications of the chromatin, and mapping of those modifications is a key computational task as researchers uncover different aspects of that process. Modifications associated with the formation of heterochromatin pose new challenges in this context because we need to define very broad regions that have only moderately stronger signal than the remainder of the chromatin. LADs, lamin associated domains, are a prime example of such regions.
Results: We present CIBAR (Combinatorial Identification of Broad Association Regions), a new method to identify this type of broad regions. CIBAR is based on an efficient solution to a natural combinatorial problem, adapts to widely variable yields of NGS reads (ChIP and input), and it performs competitively with previous methods, including DamID that was used in many publications on LADs but cannot be applied in most in vivo situations.
F03 - Revising human protein coding gene numbers
Alfonso Valencia, Spanish National Cancer Research Centre (CNIO), Spain
Michael Tress, Spanish National Cancer Research Centre (CNIO), Spain
David Juan, Spanish National Cancer Research Centre (CNIO), Spain
Iakes Ezkurdia, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Spain
Jesus Vazquez, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Spain
Adam Frankish, Wellcome Trust Sanger Institute, United Kingdom
Jennifer Harrow, Wellcome Trust Sanger Institute, United Kingdom
Mark Diekhans, University of California Santa Cruz (UCSC), United States
Jose Manuel Rodriguez, Spanish National Cancer Research Centre (CNIO), Spain
Short Abstract: In this paper we mapped peptides from 7 large-scale proteomics studies to protein coding genes from the human genome. While we identified peptides for more than 96% of genes that evolved before bilateria, we did not find peptides for primate-specific genes, for genes without protein-like features or for genes with poor cross-species conservation. We described a set of 2,001 genes that were potentially non-coding based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We show that many of these genes behave more like non-coding genes than protein-coding genes, and suggest that many may not code for proteins. Their inclusion in the human protein coding gene catalogue is being revised as part of the ongoing human genome annotation effort.
F04 - The future of reference genome assemblies
James Torrance, Sanger Institute, United Kingdom
Kerstin Howe, Sanger Institute, United Kingdom
Short Abstract: The Genome Reference Consortium (GRC) is the international collaboration responsible for continually improving the quality of the assemblies of the human, mouse, and zebrafish reference genomes that are deposited with the INSDC. The GRC is also responsible for improving these assemblies by closing remaining gaps, correcting sequencing errors, and better representing variation. The GRC is expanding its coverage, with new species as well as additional strains and subspecies being taken under the GRC umbrella due to the demand for high quality reference genomes.

We are constantly expanding our tool kit for identifying and addressing assembly issues, making use of upcoming new sequencing and mapping technologies. Long-read sequencing data (such as that produced by Pacific Biosciences instruments) allows us to improve the scaffolding of existing assemblies; optical mapping data generated on different commercial platforms is extremely valuable for identifying large-scale assembly issues.

We are sharing our genome annotation via a Track Hub, which can be used to display this information in various genome browsers. We will continue to release improved assemblies both as major releases that alter coordinates and as minor patch updates, and we provide "analysis sets" intended for convenient use with read alignment software. As well as improving the reference sequence, the GRC is working with the Global Alliance for Genomics and Health to coordinate future development of tools and data formats, including graph-based representations of genome variation.
F05 - Identification of small deletions within human exons in Asian and European genomes using trancriptome data.
Gabriel Wajnberg, Fiocruz, Brazil
Nicole de Miranda Scherer, Bioinformatic Unit, Clinical Research Coordination, Instituto Nacional de Câncer (INCA), Brazil
Carlos Gil Ferreira, Clinical Research Coordination, Instituto Nacional de Câncer (INCA), Brazil
Fabio Passetti, Laboratory of Functional Genomics and Bioinformatics, Oswaldo Cruz Institute, FIOCRUZ, Brazil
Short Abstract: Insertions and deletions (INDEL) are examples of alterations in the DNA sequence. If one INDEL is located within the coding region, it can produce transcripts with modifications in splice sites, the encoded amino acids or frame shift. The 1000 genomes project can be used as source of data to search for INDELs in different populations. We developed an innovative method to search for small deletions up to 99 nucleotides in length: usage of transcriptome data to identify deletions within human coding exons. Here, we present preliminary data from the analysis of 12 genomes from the 1000 genomes: 6 East Asian Ancestry (EAS) and 6 European Ancestry (EUR). A total of 232,973 small deletions were identified and 138,452 may cause frameshift. We detected deletions previously identified only by the 1000 genomes project (N=57), only annotated in the dbSNP (N=67) and by both approaches (N=5). For example, we detected previously annotated deletions in the dbSNP in the following human genes: DHFR (rs144629981) in EUR, TIFA (rs5861095) in EAS and DNAI2 (rs140867882) in both EUR and EAS. We were also able to detect novel unannotated small deletions: 93 in all 6 EAS genomes and 679 in all EUR genomes. In conclusion, we present preliminary data in which we used transcriptome data to identify small deletions previously described in the dbSNP and the 1000 genomes project, and to detect novel unannotated deletions in human genes. Financial support: INCA/MS, FIOCRUZ, CAPES, Fundação do Câncer, FAPERJ and CNPq.
F06 - Hierarchical organization of chromosome folding during mammalian cell differentiation
Markus Schueler, Berlin Institute for Medical Systems Biology - MDC Berlin, Germany
James Fraser, Department of Biochemistry and Goodman Cancer Centre, McGill University, Canada
Carmelo Ferrai, Berlin Institute for Medical Systems Biology, MDC Berlin, Germany
Andrea Maria Chiariello, Dipartimento di Fisica, Università di Napoli Federico II, and INFN Napoli, Italy
Giovanni Laudanno, Dipartimento di Fisica, Università di Napoli Federico II, and INFN Napoli, Italy
Tiago Rito, Berlin Institute for Medical Systems Biology, MDC Berlin, Germany
Mariano Barbieri, Berlin Institute for Medical Systems Biology, MDC Berlin, Germany
Stuart Aitken, MRC IGMM, University of Edinburgh, United Kingdom
Kelly J Morris, Berlin Institute for Medical Systems Biology, MDC Berlin, Germany
Masayoshi Itoh, RIKEN Preventive Medicine and Diagnosis Innovation Program, Japan
Hideya Kawaji, RIKEN Preventive Medicine and Diagnosis Innovation Program, Japan
Ines Jaeger, MRC Clinical Sciences Centre, Imperial College London, United Kingdom
Yoshihide Hayashizaki, RIKEN Preventive Medicine and Diagnosis Innovation Program, Japan
Piero Carninci, RIKEN Center for Life Science Technologies, Division of Genomic Technologies, Japan
Alistair R. R. Forrest, RIKEN Center for Life Science Technologies, Division of Genomic Technologies, Japan
Colin Semple, MRC IGMM, University of Edinburgh, United Kingdom
Josée Dostie, Department of Biochemistry and Goodman Cancer Centre, McGill University, Canada
Mario Nicodemi, Dipartimento di Fisica, Università di Napoli Federico II, and INFN Napoli, Italy
Ana Pombo, Berlin Institute for Medical Systems Biology, MDC Berlin, Germany
Short Abstract: Chromosomes have a complex spatial organization within the cell nucleus. They are folded into an array of megabase-sized regions, known as topologically associated domains (TADs), marked by locally enriched chromatin interactions. Multiple studies have now established that TADs have highly significant internal sub-structures, but it remains elusive whether higher-order levels of TAD organization exist and how they might be established. Here, we investigate interactions between TADs and find that, far from being isolated structures, they form a functional hierarchy of domains-within-domains (metaTADs), which extends across genomic scales up to entire chromosomes. We map chromatin contacts with Hi-C along a differentiation time-course from proliferating murine embryonic stem cells, through neuronal precursors cells, and terminally differentiated neurons. We find that TAD-TAD interactions generate a hierarchical structure (metaTAD trees) which is irrespective of cell type, reflecting a general organizational principle of the mammalian genome. We explore the hierarchy in more detail and find that metaTAD trees correlate with genetic, epigenetic and expression features. We also find that changes in transcriptional state measured by CAGE relate with changes in the tree architecture, highlighting a functional role for hierarchical chromatin organization far beyond simple packing of chromosomes.
F07 - Simulation-based benchmarking of 4C-seq analysis strategies
Carolin Walter, University of Muenster, Germany
Daniel Schützmann, University of Muenster, Germany
Frank Rosenbauer, University of Muenster, Germany
Martin Dugas, University of Muenster, Germany
Short Abstract: Circular chromosome conformation capture combined with high-throughput sequencing (4C-seq) allows valuable insight into the three-dimensional structure of the genome, but interpreting the raw signal data is no trivial task. A number of strategies has been published, but comparing the performance of these algorithms or optimizing their parameters is a challenge in itself. Simulated 4C-seq data with a specified ground truth offers a possible basis for benchmarking, with less need of biologically validated interactions.
We present a simulator for 4C-seq data that is based on read distributions of published data and can generate the typical data structure of a 4C-seq experiment, e.g. the characteristic overrepresentation of the viewpoint region. The addition of variable levels of background noise and certain forms of bias is possible. Fragment characteristics and expected variations in signal strength are respected, and different interaction structures supported. Alternatively, our package can insert sets of simulated interactions into published 4C-seq data.
We use both simulated 4C-seq data and published data sets to evaluate current 4C-seq algorithms with different parameter settings in terms of sensitivity and specificity, and consider the results of 4C-seq specific quality metrics. Due to the expected difference in signal strength, the viewpoint region, the viewpoint chromosome and the remainder of the genome are analysed separately. Regions with many non-unique or short fragments are of special interest, since they provide challenges for the alignment and analysis of the data.
Our benchmarking shows clear distinctions between the algorithms and underlines the importance of filter steps and sensible parameter choices.
F08 - Novel brain-specific miRNA discovery using small RNA sequencing in post-mortem human brain
Christian Wake, Boston University, United States
Adam Labadorf, Boston University, United States
Alexandra Dumitriu, Boston University, United States
Andrew Hoss, Boston University, United States
Richard Myers, Boston University, United States
Short Abstract: MicroRNAs (miRNA) are short non-coding RNAs that regulate gene expression mainly through translational repression of target mRNA molecules. More than 2700 human miRNAs have been identified and some are known to display tissue-specific patterns of expression. Here, we use high-throughput small RNA sequencing to discover novel and possibly brain-specific miRNAs in 94 human post-mortem prefrontal cortex samples from patients with Huntington's disease and Parkinson's disease and normal neuropathology. Using a custom analysis pipeline, we identified 66 novel miRNA candidates that originate in both intergenic and intragenic regions of the genome. 21 of the candidate miRNAs show sequence similarity with known mature miRNA sequences and may be novel members of known miRNA families, while the remaining 45 may constitute previously undiscovered families of miRNAs that are specific to the brain. In a small number of these novel miRNAs, preliminary differential expression analysis between neurodegenerative disease and normal samples identified differences in expression. These results suggest that a portion of these novel miRNAs may not only be unique to brain, but may have a role in the neurodegenerative disease processes.
F09 - An integrated pipeline and monitoring system for de novo genome analysis
Junhyung Park, Insilicogen,Inc., Korea, Rep
SeungJae Noh, Insilicogen,Inc., Korea, Rep
Kyuyeol Lee, Insilicogen,Inc., Korea, Rep
Yeonkyung Kang, Insilicogen,Inc., Korea, Rep
Myunghee Jung, Insilicogen,Inc., Korea, Rep
Short Abstract: De novo genome assembly projects have been performed using NGS sequencers. Assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, including several steps of sequence data cleaning, error correction, assembly and quality control. Despite many efforts to solve the problem of genome assembly, the problem still remains. Analyzing this data is difficult, in part because assembly algorithms have many parameters that are not easily optimized. We present an assembly pipeline that simplifies the entire genome assembly process by automating these steps, by integration several previously published algorithms with new algorithms for quality control and automated ssembly pipeline. The genome assembly pipeline includes three main steps. 1) Pre-processing, 2) Building contigs, 3) Scaffolding contigs. Pre-processing is read error correction. For this step we use error correction tools from the FASTQC package and clc_trimmer to correct reads for sequencing errors. Building contigs step, error corrected reads are assembled into contigs using assembler programs such as clc_assembler, ALLPATHS-LG, MIRA. After contigs building, paired/mated libraries can be used to scaffold contigs together by scaffolder SSPACE. We demonstrate that produce assemblies of quality, without any prior knowledge of the particular genome and without the extensive parameter tuning. Several plant, insect draft genome was annotated using those integrated pipeline. Our pipeline will assist researchers in selecting a well-suited assembler and offer essential information. This work was carried out with the support of "Cooperative Research Program for Agriculture Science & Technology Development (PJ010343)" Rural Development Administration, Republic of Korea.
F10 - Computationally efficient approach for novel transcript discovery across large RNA-seq dataset reveals glioblastoma-associated lncRNAs
Maria Laaksonen, BioMediTech, University of Tampere, Finland
Antti Ylipää, 1) BioMediTech, University of Tampere 2) Department of Signal Processing, Tampere University of Technology, Finland
Janne Seppälä, BioMediTech, University of Tampere , Finland
Tommi Rantapero, BioMediTech, University of Tampere , Finland
Kirsi Granberg, 1) BioMediTech, University of Tampere 2) Department of Signal Processing, Tampere University of Technology, Finland
Matti Nykter, BioMediTech, University of Tampere , Finland
Short Abstract: Availability of RNA-sequencing data from human tumors and normal tissues has resulted in discovery of hundreds of tissue specific transcripts. Uncovering novel transcripts typically requires computationally expensive de novo transcriptome assembly and combination of assemblies across samples have proven challenging. To be able to search for new transcripts from large RNA-seq cohorts, we developed a computational approach that directly identifies unannotated genomic loci that are variably expressed within a sample set, or differentially expressed between two sample sets. These loci are then subject to gene structure analysis, allowing identification of full transcript structures in data driven manner. Our approach was validated by re-discovering a set of well annotated genes. We were able to correctly re-build known gene structures and identify the typical structural features of protein coding genes even when only a single exon of the gene was given as input.

We applied our approach to RNA-seq data of 169 primary glioblastoma samples from The Cancer Genome Atlas (TCGA). We identified 53 unannotated transcripts that did not contain good quality open reading frames, indicating that they were lncRNAs. The expression of 20 out of 22 high confidence lncRNAs was validated by PCR in at least one glioblastoma cell line. Clinical association analyses in the TCGA glioma cohort revealed that a subset of lncRNA expression profiles associates with patient survival, tumor grade and/or IDH1 mutation status. The functional analysis of lncRNA knockdowns was performed in glioblastoma cells to evaluate their significance in disease aggressiveness.
F11 - ContiBAIT: An R Package for Genome Finishing Using Strand-seq
Kieran O’Neill, British Columbia Cancer Agency, Canada
Mark Hills, British Columbia Cancer Agency, Canada
Peter Lansdorp, British Columbia Cancer Agency, Canada
Ryan Brinkman , British Columbia Cancer Agency, Canada
Short Abstract: Strand-seq is a method for directional, low-coverage sequencing of DNA
template strands in single cells. Taken together, strand-seq data from
cells from the same organism provide genomic distance information.
This can be used to improve the quality of early-build reference
genomes made up of many contigs with no bridging sequence, firstly by
grouping contigs from the same chromosome together, and secondly by
ordering contigs within chromosomes. We present ContiBAIT, an R
package for performing these tasks.

For grouping contigs into chromosomes, contiBAIT uses a custom
clustering method based on a Chinese restaurant process. Contigs are
then reoriented using a greedy algorithm which optimises for global
inter-contig distance. Contig groups showing close strand similarity
following reorientation are merged.

For ordering contigs within a putative chromosome, ContiBAIT computes
the strand distance between all pairs of contigs. The problem then
becomes one of finding the lowest-weight Hamiltonian path over the
contigs, which can be reformulated into a travelling salesman problem.
ContiBAIT then finds the best ordering of contigs using the TSP

To validate contig clustering, we applied ContiBAIT to an early build
of the mouse genome (mm2), with coordinates lifted over to mm10.
ContiBAIT was able to assign most contigs with sufficient read depth
for strand-seq analysis to the correct chromosome (median

To validate contig ordering, we applied ContiBAIT to artificial
contigs sampled from mm10, of sizes 1MB, 500kB and 250kB. Some
chromosomes were well-ordered (Pearson's rho=0.99), while others had
large sections locally well-ordered but incorrectly ordered relative
to each other.
F12 - EuGene-PP: a next-generation automated annotation pipeline for prokaryotic genomes
Erika Sallet, INRA-CNRS, France
Jerome Gouzy, INRA-CNRS, France
Thomas Schiex, INRA, France
Short Abstract: It is now easy and increasingly usual to produce oriented RNA-Seq data as a prokaryotic genome is being sequenced. However, this information is usually just used for expression quantification.

We designed EuGene-PP, a fully automated pipeline for structural annotation of prokaryotic genomes integrating protein similarities, output of existing CDS and ncRNA predictors, intrinsic information provided by coding potential, and any oriented expression information (RNA-Seq or tiling arrays) through a variety of file formats. This enables the prediction of a qualitatively enriched annotation including coding sequences (CDSs), untranslated regions, transcription start sites (TSSs) and (possibly antisense) non-coding genes. It can run using just FASTA genomic sequences and expression data, and has no parameter to tune (by default). Training procedures required for gene finding are performed inside EuGene-PP. The pipeline is able to manage genomes with speculiar replicons. (e.g: stong GC% bias compared to the rest of the genome).

EuGene-PP is an open-source software based on the gene finder EuGene integrating a Galaxy configuration. EuGene-PP can be downloaded at
F13 - Translation-dependent classifiers enhance prediction of long non-coding RNAs
Seo-Won Choi, Hanyang Univ., Korea, Rep
Jin-Wu Nam, Hanyang University, Korea, Rep
Short Abstract: Long non-coding RNAs (lncRNAs) are emerging as a key regulatory factor in various biological processes, majority of which are annotated through de novo assembly followed by coding/non-coding classifiers. Despite the ongoing efforts to identify lncRNAs, those strongly associated with scanning ribosomes and recently evolved protein-coding genes are often incorrectly classified. Current lncRNA annotations also encompass incomplete, non-coding contaminants derived from 3’UTR fragments. Here, we developed two novel lncRNA classifiers that distinguish lncRNAs from coding transcripts and 3’UTR fragments by protein associations that occur during translation in vivo. Ribosome-protected fragment signal bias (RPS) aims to detect whether a transcript displays tri-nucleotide periodicity of ribosome footprints, which results from codon base-shifting of ribosomes during translation. We compared RPSs of expressed mRNAs to those of manually curated lncRNAs with Kullback-Leibler divergence and confirmed that RPS is able to re-classify RNAs with strong ribosome-association correctly. Another classifier is capable of sorting out misclassified 3’UTR fragments of mRNAs using translation-dependent binding of UPF1, a nonsense-mediated decay mediator, to 3’UTR. We denominated the classifier as translation indication by UPF1 association (TITAN) and also compared abundance of UPF1 binding to randomly fragmented 3’UTRs of mRNAs to those of lncRNAs. The results clearly indicated that the classifier is capable of filtering out untranslated fragments of coding transcripts. Taken together, the proposed new classifiers enhanced the sensitivity and specificity in separating lncRNAs from protein-coding genes as well as 3’UTR fragments.
F14 - 3DNOME: 3D NucleOme Multiscale Engine for data-driven modeling of three-dimensional genome architecture
Przemyslaw Szalaj, Medical University of Bialystok, Hasselt University, Poland
Zhonghui Tang, The Jackson Laboratory for Genomic Medicine, United States
Oskar Luo, The Jackson Laboratory for Genomic Medicine, United States
Paul Michalski, The Jackson Laboratory for Genomic Medicine, United States
Yijun Ruan, The Jackson Laboratory for Genomic Medicine, United States
Dariusz Plewczynski, Centre of New Technologies, Warsaw University, Poland
Short Abstract: Human genome is folded into three-dimensional structures. The 3D organization of the genome is thought to facilitate compartmentalization, chromatin organization and spatial interaction of genes and their regulatory elements. Recently developed high-throughput ChIA-PET method allows us to capture the genome-wide map of physical contacts between distal genomic loci.

We present a 3DNOME, a multiscale computational engine we developed to model the 3D organization of the genome. Our approach allows us to model the chromatin folding on a level of whole chromosomes as well as single topological domains, including modeling of individual chromatin loops and their mutual interactions. In our modeling we consider CTCF (which is long known to be responsible for chromatin weaving) and RNAPII (which activates genes transcription) interactions. Taken together these two protein factors provides a comprehensive map of human genome interactions.

We describe how our hierarchical model is constructed and how the structures on every scale are obtained. We do also highlight main advantages of our approach compared to existing methods for genome architecture modeling.
F15 - Causes for bias in 3C-like data
Yannick Spill, Structural Genomics group, Spain
Marc Marti-Renom, Structural Genomics group, Spain
Short Abstract: Chromatin structure determination is a fast evolving field. It recently emerged with the invention of 3C-like experiments, in particular 3C [1] and Hi-C [2].
These experiments allow to probe for the spatial distance between two genetic loci. Yet, these experiments do not provide the distances themselves, but a contact frequency, which is prone to be biased by a number of genomic factors. It is therefore crucial to de-bias the data for any practical application.

A number of methods have been proposed to de-bias these datasets, but none so far has reached an overall state of acceptance by the community. ICE [3] provides the best treatment, but is limited to whole genomes. It can therefore only be applied to HiC data. It also makes a number of assumptions, which we review in detail in this poster. HiCNorm [4] is based on the method by Yaffe and Tanay [5] and performs less accepted normalization than ICE. It is, however, much more flexible, and can be applied to a large variety of 3C-like datasets. Extending this probabilistic approach, we present a causal graph which represents the sources of bias in 3C-like data. We show the risks associated with using the local GC content to de-bias the data, and propose more adapted metrics.

[1] Dekker J et al. Science (2002) 295(5558):1306-1311
[2] Belton JM et al. Methods (2012),
[3] Imakaev M et al. Nature Methods (2012) 9:999-1003
[4] Hu M et al. Bioinformatics (2012) 28(23):3131-3133
[5] Yaffe E, Tanay A. Nature
Genetics (2011) 43:1059-1065
F16 - Alternatively spliced homologous exons have ancient origins and are highly expressed at the protein level
Michael Liam Tress, Spanish National Cancer Research Centre (CNIO), Spain
Michael Tress, Spanish National Cancer Research Centre (CNIO), Spain
Federico Abascal, Spanish National Cancer Research Centre (CNIO), Spain
Alfonso Valencia, Spanish National Cancer Research Centre (CNIO), Spain
Juan Rodriguz, Spanish National Cancer Research Centre (CNIO), Spain
Jose Manuel Rodriguez, Spanish National Cancer Research Centre (CNIO), Spain
Iakes Ezkurdia, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Spain
Jesus Vazquez, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Spain
Angela del Pozo, Instituto de Genetica Medica y Molecular, Hospital Universitario La Paz, Spain
Short Abstract: Alternative splicing of messenger RNA can generate a wide variety of mature RNA transcripts, and these transcripts may produce protein isoforms with diverse cellular functions. While there is much supporting evidence for the expression of alternative transcripts, the same is not true for the alternatively spliced protein products. Although large-scale mass spectroscopy experiments have identified evidence of alternative splicing at the protein level, results have been contradictory.

Here we carried out a rigorous analysis of the peptide evidence from eight large-scale proteomics experiments to assess the scale of alternative splicing detectable by high-resolution mass spectroscopy. While we identified peptides for almost 64% of human protein coding genes, we detected just 282 splice events. We demonstrate that this is fewer splice events than would be expected, and show that most genes have a single dominant isoform at the protein level.
The most striking result was that more than 20% of the splice isoforms we identified were generated by substituting one homologous exon for another. This is significantly more than would be expected from their frequency in the genome. These homologous exon substitution events were remarkably conserved - all the homologous exons we identified evolved over 460 million years ago - and eight of the fourteen tissue-specific splice isoforms we identified were generated from homologous exons. The combination of proteomics evidence, ancient origin and tissue-specific splicing is a clear indication that isoforms generated from homologous exons may have important cellular roles.
F17 - Sequencing and de-novo Assembly of six grapevine cultivars
Michele Vidotto, University of Udine, Italy
Davide Scaglione, IGA Technology Services, Italy
Gabriele Magris, Department of Agricultural and Environmental Science, University of Udine; Institute of Applied Genomics Udine, Italy
Sara Pinosio, CNR, Istituto di Bioscienze e Biorisorse, Sesto Fiorentino; Institute of Applied Genomics Udine, Italy
Giusi Zaina, University of Udine, Italy
Fabio Marroni, Department of Agricultural and Environmental Science, University of Udine; Institute of Applied Genomics Udine, Italy
Gabriele Di Gaspero, Department of Agricultural and Environmental Science, University of Udine; Institute of Applied Genomics Udine, Italy
Michele Morgante, Department of Agricultural and Environmental Science, University of Udine; Institute of Applied Genomics Udine, Italy
Short Abstract: Motivation: Plant genomes are characterized by high levels of structural variation, consisting of insertion/deletions, mostly due to recent insertions of transposable elements. Next-generation sequencing (NGS) allows re-sequencing the whole genome of several subjects to produce catalogues of structural variants (SVs), ultimately defining a species Dispensable Genome (DG) composed of partially shared and/or non-shared DNA sequence elements. To detect those portions of the DG that are not present in the Vitis vinifera reference genome (PN40024, 485Mb) but that may be present in one or more individuals, we performed sequencing and de-novo assembly of six grapevine cultivars: Cabernet Franc, Gouais Blanc, Kishmish Vatkana, Rkatsiteli, Sangiovese and Traminer.

Methods: Raw reads were cleaned with cutadapt and ERNE-FILTER respectively. The overall quality of libraries was estimated by k-mer distribution analysis with Jellyfish. The ALLPATHS-LG algorithm was chosen for assembling the six cultivars. The k-mers spectra from the contigs and the starting reads were compared with the Kmer Analysis Toolkit (KAT). The contigs were aligned on the V. vinifera reference using DENOM. An in house script was developed to place scaffolds on the reference. We estimated the fraction of genes, exons and repeats annotated in the V. vinifera reference that are present in the placed scaffolds.

Results: We assembled de-novo the genome of six heterozygous grape cultivars obtaining high accuracy and good assembly statistics even in complex genomic regions. The assemblies will be used to define the extent and composition of regions belonging to the dispensable genome.
F18 - Knowledge-based modelling of Arabidopsis thaliana genome
Marco Di Stefano, Parc Científic de Barcelona - CNAG, Spain
Short Abstract: The spatial organisation of the Arabidopsis thaliana genome has been widely studied with imaging techniques, which show distinctive features of its large scale organisation. The preferential positioning of the chromocenters at the nuclear periphery, the proximity of the telomeres to the nucleolus, and the presence of chromatin loops protruding from the chromocenters all seem to play a major role in the spatial organisation of Arabidopsis thaliana chromosomes. Taking advantage of these experimental data, we used coarse-grained models of the chromatin fiber and knowledge-based molecular dynamics to test how the large-scale spatial organisation impacts on the three-dimensional genome arrangement at the kilobase level resolution. We describe the chromatin fiber as a chain-of-beads with excluded volume and bending rigidity and enforce the spatial restraints in the simulation to match the known large-scale chromosomal arrangements. To validate our models, we analyzed recently published Hi-C data with TADbit, an advanced bioinformatics tool developed by our group, and compared them with the frequency contact matrices computed on our models. The generated models allowed us to get insights into the structural properties of the Arabidopsis thaliana genome.
F19 - Extending and validating the mouse lncRNA gene catalogue using RNAseq data
Jose Gonzalez, W. T. Sanger Institute, United Kingdom
Electra Tapanari, Wellcome Trust Sanger Institute, United Kingdom
Thibaut Hourlier, EMBL-EBI, United Kingdom
Carlos Garcia-Giron, EMBL-EBI, United Kingdom
Rory Johnson, Centre de Regulacio Genomica, Spain
Barbara Uszczynska, Centre de Regulacio Genomica, Spain
James Wright, Wellcome Trust Sanger Institute, United Kingdom
Bronwen Aken, EMBL-EBI, United Kingdom
Jyoti Choudhary, Wellcome Trust Sanger Institute, United Kingdom
Roderic Guigo, Centre de Regulacio Genomica, Spain
Jen Harrow, Wellcome Trust Sanger Institute, United Kingdom
Short Abstract: Many groups are generating and mining publicly available Illumina RNAseq data to identify thousands of novel long non-coding RNAs from different organisms. The reliability of these models is variable and can depend on length and quality of input data and algorithms used.

With the aim of discovering novel mouse lncRNA genes, we built transcript models from ENCODE mouse Illumina data across ten different tissues using the Ensembl RNAseq pipeline. This pipeline firstly makes rough exon models with BWA, then defines splice sites with Exonerate and finally combines exon and intron features to build the best supported isoform as indicated by read depth. In comparison with transcripts built with Cufflinks, this pipeline seems more conservative and produces a smaller proportion of single-exon loci.

Transcripts built from this pipeline were subsequently filtered using coding potential assessment tools like PhyloCSF and CPAT. We found both putative coding and non-coding novel loci, with most of them predicted to be non-coding. For some tissues, 80% of predicted non-coding loci mapped outside the current GENCODE annotation. Consequently, these putative lncRNA loci are now being manually assessed by HAVANA annotators for incorporation into the GENCODE geneset.

Only around 20% lncRNAs are supported by CAGE data or polyA features to indicate that they are full-length. Therefore we have selected 5300 of these predicted lncRNAs to be extended using CaptureSeq, a strategy based on targeted RNA capture followed by PacBio sequencing. We are also using proteogenomic techniques to identify any peptides encoded by small ORFs in the novel lncRNAs.
F20 - Immunogenomics of the Egyptian Fruit Bat, An Important Viral Reservoir
Stephanie D'Souza, Boston University School of Medicine, United States
Chandri Yandava, Boston University School of Medicine, United States
Sean Lovett, United States Army Research Institute of Infectious Diseases, United States
Galina Koroleva, United States Army Research Institute of Infectious Diseases, United States
Elyse Nagle, United States Army Research Institute of Infectious Diseases, United States
Albert Lee, Columbia University College of Physicians and Surgeons, United States
Raul Rabadan, Columbia University College of Physicians and Surgeons, United States
Mariano Sanchez-Lockhart, United States Army Research Institute of Infectious Diseases, United States
Johnathan Towner, Centers for Disease Control and Prevention, United States
Gustavo Palacios, United States Army Research Institute of Infectious Diseases, United States
Thomas Kepler, Boston University School of Medicine, United States
Short Abstract: The Egyptian fruit bat (Rousettus aegyptiacus) is the suspected reservoir host for Marburg virus, and there is mounting evidence for the long-term circulation and evolution of the virus in these bats. Currently, there is no available reference genome for the Rousettus bat, limiting the ability to study this virus in its natural host. The lack of genomic data for this bat also prevents detailed study of the
molecular mechanisms and genetic changes that allow bats to coexist with Marburg and other highly pathogenic viruses. To address this need, we are putting together a high quality, annotated genome of R. aegyptiacus with a hybrid assembly approach. Using a combination of short and long read data, we have produced a draft genome of 99,254 scaffolds with 11,314 scaffolds representing 50% of the assembly. We are testing a few hybrid assembly pipelines to improve our current assembly. For automated annotation of the whole genome, we are using ab initio software trained with paired-end RNA-Seq data from ten tissues including brain, heart, lungs and liver. We are simultaneously annotating key immune loci in both innate and adaptive immune systems. Thus far, we have found evidence for seven families of immunoglobulin heavy chain variable genes, and an expanded family of Type I Interferons, an important component of innate antiviral immunity. These reference annotations will open a new suite of tools in bat immunology and will be valuable for assessing how the Rousettus bat hosts a deadly virus without any major pathology.
F21 - From scaffold to submission in a day: a new software pipeline for rapid genome annotation and analysis
Sascha Steinbiss, Wellcome Trust Sanger Institute, United Kingdom
Fatima Silva, University of Liverpool, United Kingdom
Brian Brunk, University of Pennsylvania, United States
Bernardo Foth, Wellcome Trust Sanger Institute, United Kingdom
Christiane Hertz-Fowler, University of Liverpool, United Kingdom
Matt Berriman, Wellcome Trust Sanger Institute, United Kingdom
Thomas Dan Otto, Wellcome Trust Sanger Institute, United Kingdom
Short Abstract: Currently available sequencing technologies allow for quick and cheap sequencing of many new parasite species for subsequent comparative analysis. However, such analyses depend on the availability of high quality annotations of the newly sequenced genomes with regard to manually curated references.
We present a new full-stack software pipeline for eukaryotic genome annotation. It accepts input in various states of assembly, covering all stages from pseudochromosome contiguation and gene finding to function assignment, and includes specific support for partial genes likely to be found in highly fragmented assemblies. State-of-the-art workflow and software deployment technologies are used to make it scalable and portable for use on powerful PCs as well as large compute clusters with a minimum of effort. In addition, we have created a web-accessible annotation interface to the pipeline, allowing researchers from the parasitology community to run annotation jobs and comparative analysis tasks on user-provided genomes as well as visualize the results. We exemplify the use of the pipeline to annotate kinetoplastid parasite genomes.

View Posters By Category

Search Posters: