HOME

Tweets by @ISMBinfo

Accepted Posters

Attention Conference Presenters - please review the Speaker Information Page available here.

If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category G - 'Genetic Variation Analysis'

G01 - A Hierarchical Hidden Markov Model for the Annotation of Chromatin States

Eugenio Marco, Dana-Farber Cancer Institute, United States

Short Abstract: A Hierarchical Hidden Markov Model for the Annotation of Chromatin States Eugenio Marco, Wouter Meuleman, Luca Pinello, Manolis Kellis, Guo-Cheng Yuan Epigenetic mechanisms play an important role in many diseases, but our mechanistic understanding of epigenetic regulation is still incomplete. One major difficulty is that chromatin forms complex three-dimensional structures and studies have failed to map genome-wide chromatin interactions with enough resolution. On the other hand, genome-wide distributions of the first-order chromatin structure, such as histone modifications, have been characterized at increasingly higher resolution. In order to identify multi-layer chromatin structures simultaneously, we have developed a Hierarchical Hidden Markov Model (HHMM) with two-layers of chromatin states, which we call domain- and nucleosome-level states, respectively. Using this method, we analyzed a ChIPseq dataset of 9 histone marks in H1 (human embryonic stem cells), GM12878 (lymphoblastoid cells) and K562 (erythroleukemia cells), and identified a number of chromatin domains that can be validated by independent studies. At the same time, nucleosome-level states detected variations in histone modification patterns at high resolution. Our new HHMM approach has uncovered higher order chromatin states and provides novel insights into epigenetic regulation in normal development and disease.

G02 - Exploring the Genomic Architecture and Chromatin Structure of uncharted regions of the Human Genome using existing ChIP-seq, RNA-seq and Hi-C data sets

Sofia Barreira, National University of Ireland Galway, Ireland

Short Abstract: Nucleolar Organizer Regions, NORs, positioned on the short arms of the five human acrocentric chromosomes (13, 14, 15, 21 and 22) and containing tandem arrays of ribosomal genes are responsible for forming a major functional domain of the nucleus dedicated to ribosome biogenesis, the nucleolus (1). Evidence suggests that sequences adjacent to the rDNA repeats are involved in the regulation of nucleoli (2). The entire short arms of these chromosomes are missing from the current human genome assembly. The identification and characterization of these sequences is of critical importance, as nucleoli have a central role in growth-regulation and a long-established connection to tumorigenesis. To date, we have determined nearly 600 kb of novel sequences neighboring ribosomal genes, in particular 380kb, termed distal junction (DJ) on the distal side towards the telomere (2).
My work focuses on rDNA organization, extending and characterizing the distal sequences of the NORs and establishing the organization/structure of DJ chromatin.
Using available Hi-C datasets combined with 454 sequencing reads of nucleolar DNA we have identified a BAC that maps on the distal side of rDNA and closer to the telomere on all acrocentric chromosomes. At low resolution Hi-C analysis confirms the relative spatial positioning of the rDNA repeats and the DJ in interphase cells, and at high resolution Hi-C reveals a striking chromatin feature centered over a large inverted repeat that might play a role in nucleolar function.

(1) Grob et al, 2014, Genes Dev. 28, 220–230.
(2) Floutsakou et al, 2013, Genome Res. 23, 2003–2012.

G03 - Assessment of comparative functional annotation propagation in mouse

Li Ni, The Jackson Laboratory, United States

Short Abstract: Mouse Genome Informatics (MGI) has long exploited orthologous mammalian relationships to infer function of mouse genes from experimentally determined knowledge about human and rat genes. Although one-to-one orthology assertions between mouse/human/rat genes still holds for 90% of protein-coding genes, MGI can now represent N-to-M cases such as Serpina1a class where phylogenetic analysis shows 5 mouse genes, 1 human gene, and 1 rat gene in the same homology class.

The Gene Ontology (GO) supports the use of shared semantics for functional annotation, facilitating comparative genomics endeavors that will lead to a better understanding of human biology and disease. Annotations curated from the literature by domain experts are considered the most valuable component of this effort, but manual curation is very labor intensive compared with semi-automated methods for assignment of functional annotation. MGI’s use of new N:M orthology sets includes the refinement of rules for semi-automated annotation propagation.

Since genes that share close evolutionary relationships are likely to function in similar ways, many applications leverage phylogenetic relationships to propagate functional annotation from related genes. This process involves two distinct steps: (1) the assertion of orthology, and, (2) since function is not necessarily conserved across speciation and gene duplication events, the determination that annotation propagation is sound.

We assess both the quantity and quality of various methods of automated propagation of functional annotations. As more genomes are available, such automated methods for annotation propagation will become more important. We hope this work will contribute to maintaining the high quality of functional annotation sets.

G04 - Integrating Genomic and Image Data using Biological Database of Images and Genomes

Andrew Oberlin, Miami University, United States

Short Abstract: Genomic data and biomedical imaging data are undergoing exponential growth. However, our understanding of the phenotype-genotype connection linking the two types of data is lagging behind. While there are many types of software that enable the manipulation and analysis of image data and genomic data as separate entities, there was no framework established for linking the two. We present an update on the generic set of software tools, BioDIG, that allows linking of image data to genomic data. BioDIG tools can be applied to a wide range of research problems that require linking images to genomes. BioDIG features the following: rapid construction of web-based workbenches, community-based annotation, user management, and web services. By using BioDIG to create websites, researchers and curators can rapidly annotate a large number of images with genomic information. BioDIG stands out not only as a structure for annotating images and linking genomic information, but also provides tools for users to collaborate on projects and work together seamlessly. Here we present the BioDIG software tools that include an image module, a genome module and a user management module. We also introduce a BioDIG-based website, MyDIG, which is being used to annotate images of Mycoplasma. This software is available under an open source license via http://biodig.org. In this iteration, a focus on maintainable code backed by a RESTful web API has resulted in a scalable and stable code base. This will make BioDIG a reliable MOD on top of Chado and GBrowse for future users.

G05 - Unraveling higher order chromatin structure

Benjamin Moore, University of Edinburgh,

Short Abstract: Recent advances in chromosome capture technology have permitted genome-wide assessment of higher order chromatin structure in a variety of cell types. For example, Hi-C pairwise interactions can be statistically analyzed to reveal broad facets of genome organization, such as self-interacting topological associating domains (TADs) and larger heterochromatic or euchromatic compartments. This structural information in conjunction with comprehensive ChIP-seq datasets produced by the ENCODE consortium offers an unprecedented opportunity to quantitatively investigate the relationship between locus level chromatin features (such as histone modifications and transcription factor binding) and higher order chromatin organisation.

We have built genome-wide, quantitative models describing higher order chromatin structure based on the underlying constellations of locus level features. In three very different cell types, Random Forest regression models achieved high predictive accuracy and are able to generalize across the different cell lines. We find that both TAD and compartment calls are highly reproducible between these human cell types, and those regions that vary are enriched for cell type specific enhancers and actively transcribed regions. In addition we show that compartments, like TADs, show an enrichment of bound factors at their boundaries.

G06 - GRCh38: resources, analyses and future directions for the new version of the human reference genome sequence

James Torrance, Wellcome Trust Sanger Institute,

Short Abstract: The Genome Reference Consortium (GRC) released a new human reference assembly, GRCh38, at the end of 2013. The new assembly has already been adopted by the Ensembl and UCSC Genome Browsers and is also available in the NCBI Mapviewer. The GRC provides resources for remapping annotation onto GRCh38, whether from earlier versions of the reference assembly or from other human genomes.

The GRC has compared GRCh38 with the preceding assembly (GRCh37) with regard to assembly quality, gene representation, and the extent to which reads map. In order to aid read mapping, the GRC has created "analysis sets": GRCh38 sequence collections in a convenient format for use by genome alignment pipelines, removing the need for additional decoy sequences.

The GRC continues to work on the human reference genome and will produce regular minor patch releases. Annotation regarding this work in progress on the genome is available through a Track Hub, which can be used to display this information in various genome browsers. As well as improving the reference sequence, the GRC is working with the Global Alliance to coordinate future development of tools and data formats. This will lead to more sophisticated methods of representing and querying human sequence variation.

G07 - Basic4Cseq: an R-package for analyzing 4C-seq data

Carolin Walter, University of Münster, Germany

Short Abstract: Circular chromosome conformation capture combined with high-throughput sequencing (4C-seq) identifies chromosomal interactions between one predefined interaction partner (viewpoint), and virtually any other part of the genome, without prior knowledge of potential interaction partners. Since 4C-seq data sets are complex in nature and prone to bias, differenciating between signal and background noise and filtering are crucial to achieve reliable results. The existing techniques for the analysis of 4C-seq data choose different strategies to achieve these goals, but without a number of biologically validated interaction sites, assessing the sensitivity and specitivity of the algorithms is difficult. Simulating 4C-seq data with known interaction sites provides an alternative.
Our R package, Basic4Cseq, offers routines for filtering, analysis and near-cis visualization of 4C-seq data. Virtual fragment libraries can be created that provide chromosome position and further information on 4C-seq fragments. Filtering options allow to include or remove potentially biased fragment types (e.g. fragments lacking a secondary restriction site), and the effects can be visualized and compared on profile-level for the viewpoint region. Quality controls based on the read distribution are included, and fragment statistics for specified regions of interest can be exported. In addition, Basic4Cseq can simulate 4C-seq reads that respect both the fragment structure and the power-law distribution typically found around an experiment's viewpoint, and model biases introduced by fragment properties. Different types and shapes of interaction sites can be simulated, forming a base for the comparison of the existing 4C-seq analysis algorithms.

G08 - 3D Chromosome Structure Modeling Based on HiC Data

Dariusz Plewczynski, University of Warsaw, Poland

Short Abstract: Although chromosomes are tightly packed inside the cell nucleus and they exhibit a complex spatial organization, they are often treated as linear, one dimensional entities. It is partially caused by a lack of proper tools for studying their spatial conformation. Techniques such as FISH, which enable us to observe physical locations of the specified loci and to measure distances between them are impractical for genome-wide studies. Chromosome conformation capture (3C)-based methods have allowed us to quantify interchromosomal interactions and to infer chromatin structure, leading to a better understanding of relations between genomic loci. Chromatin inside the cell nucleus is well-organized, and that two compartments, clearly differing in chromatin density, gene richness, chromatin marks and other, both genomic and epigenetic features, may be distinguished.
This project's goal is to model a 3D chromosome structure and to analyze its relation to genome functional features, with emphasis on the structural variations, such as copy number variations (CNV). The general outline of the project pipeline is as follows. Firstly, reads from the HiC experiments are processed and the resulting heatmap is generated. Based on it, the 3D structure is inferred (one needs to remember that since HiC data are population-based, this structure represents an ensemble of structures rather than a single chromosomal structure). Finally, genomic features of interest will be mapped to the resulting structure and their analysis is performed.

G09 - MITIE: Simultaneous RNA-Seq-based Transcript Inference and Quantification in Multiple Samples

Jonas Behr, Suiss Federal Institute of Technology Zurich, Switzerland

Short Abstract: High throughput sequencing of mRNA (RNA-Seq) led to expect tremendous improvements
in detection of expressed genes and transcripts. However, the immense dynamic range
of gene expression, biases from sequencing, library preparation and read mapping, and
the unexpected complexity of the transcriptional landscape cause profound computational
challenges. The latter can lead to a combinatorial explosion of the number of
potential transcripts that can qualitatively explain the observed read data. To find the
correct set of transcripts, long range dependencies have to be resolved.
Based on simple toy examples we can show that state of the art tools fail to resolve
these dependencies even if sufficient information is provided.

By treating the transcript recognition problem as a combinatorial optimization problem we
disclose a great arsenal of techniques that cannot be applied in continuous optimization
settings.
Firstly, a set of up to k transcripts which gives the optimal quantitative explanation for
the observed RNA-Seq reads can be computed without enumerating all possible
transcripts. Secondly, sparsity can be enforced by penalizing the number of transcripts
needed to quantitatively explain the reads.
Thirdly, we can share information among multiple RNA-Seq samples and thereby provably increase
the power to resolve long range dependencies.

These conceptual improvements translate to substantial gains in transcript recognition
performance, which we show on carefully simulated reads for the human genome and in a
retrospecive study on the drosophila modENCODE data set consisting of 53 RNA-Seq samples.

G10 - iPlant: Tools For Large Scale Assembly and Annotation of Plant Genomes

Joshua Stein, Cold Spring Harbor Laboratory, United States

Short Abstract: The promise of genome research depends on our ability to accurately assemble, annotate, and derive meaning from sequence data. However, extremes of genome size, polyploidy, diversity, and repeat content push the limits on current algorithms, expertise, and computational power needed by today’s researchers. In response iPlant is fostering a community effort to identify best practices and state-of-the-art tools, install and optimize their performance on the nations’ most powerful supercomputers, and make these available as a free on-line resource. Over the last two years the iPlant Discovery Environment (DE) has matured to provide a comprehensive set of tools and services for sequence handling, performing read alignments, RNA-seq profiling, and de novo genome and transcriptome assembly. To extend these capabilities we are working to incorporate MAKER-P, a standardized, portable and easy-to-use plant-genome annotation engine with built-in methods for quality control. As part of this effort MAKER-P was specifically optimized to take advantage of the parallel computing environment of the TACC Lonestar cluster and is now a supported module. Performance testing showed that MAKER-P can perform high-quality, full-fledged annotation pipelines on even the largest plant genomes in a matter of hours. Incorporation of this resource into the DE fits into an overall strategy that includes downstream functional annotation of protein coding genes and visualization. The iPlant Collaborative is funded by NSF #DBI-0735191.

G11 - cisExpress- Novel method for discovery of position specific regulatory elements

Martin Triska, Children's Hospital Los Angeles, United States

Short Abstract: One of the major challenges for contemporary bioinformatics is the analysis and accurate annotation of genomic datasets to enable extraction of useful information about the functional role of DNA sequences. Our method introduces a novel genome-wide statistical approach to the detection of position specific DNA sequence motifs based on similarities between the promoters of similarly expressed genes. This new tool, cisExpress, is especially designed for use with large datasets, such as those generated by publicly accessible whole genome and transcriptome projects. In order to make cisExpress highly efficient and to take advantage of currently available computational resources, cisExpress is implemented in highly parallel fashion using C++/OpenMPI. We demonstrateed the robust nature and validity of the proposed method in the paper published in Bioinformatics (Oxfor journals) in Sept 2013. It is applicable for use with a wide range of genomic databases for any species of interest.
AVAILABILITY: cisExpress is available at www.cisexpress.org.

G12 - NPEST: a nonparametric method and a database for transcription start site prediction

Alona Kryshchenko, Children’s Hospital Los Angeles and Keck School of Medicine, University of Southern California, United States

Short Abstract: In this poster we present NPEST, a novel tool for the analysis of expressed sequence tags (EST) distributions and transcription start site (TSS) prediction. This method estimates an unknown probability distribution of ESTs using a maximum likelihood (ML) approach, which is then used to predict positions of TSS. Accurate identification of TSS is an important genomics task, since the position of regulatory elements with respect to the TSS can have large effects on gene regulation, and performance of promoter motif-finding methods depends on correct identification of TSSs. Our probabilistic approach expands recognition capabilities to multiple TSS per locus that may be a useful tool to enhance the understanding of alternative splicing mechanisms. This poster presents analysis of simulated data as well as statistical analysis of promoter regions of a model dicot plant Arabidopsis thaliana. Using our statistical tool we analyzed 16520 loci and developed a database of TSS, which is now publicly available at www.glacombio.net/NPEST.

G13 - Identification of long-range regulatory elements and their target genes

Yih-Chii Hwang, University of Pennsylvania, United States

Short Abstract: GWAS have shown the majority of disease-associated DNA variations are within non-coding regions. One class of DNA non-coding regulatory element is enhancer elements. Because an enhancer element can be linearly distal and orientation-independent from the gene it regulates, probing all possible pairs of enhancer–target gene contacts in space would be extremely laborious and remain unsolved.
To systematically uncover all enhancers and the genes they regulate, we reanalyzed Hi-C datasets of human and mouse cells with raw reads spreading from 60M to 1,612M. By comparing read depth of specific and non-specific read pairs, we identified DNA—DNA interactions by extracting restriction fragments with reads more than expected. We then classified these restriction fragments as candidate enhancer elements if they overlap to known enhancer-associated histone modification regions and touch a promoter element.
Additionally, we streamlined the analysis procedure as a pipeline to ensure enhancer—target gene prediction can be efficiently for future Hi-C datasets. The pipeline takes Hi-C FASTQ files as input and ultimately identifies enhancer–target gene pairs by incorporating sequencing analysis packages such as BWA, samtools, and bedtools. It is designed to run on High-Performance Computing Cluster operating systems.
We have identified 2,540 to 13,867 enhancer–target gene interactions throughout the human and mouse genomes. These enhancers can be 17-fold enriched in p300 binding sites and their target promoters are 1.2-fold more likely to be near RNA polymerase II binding sites. This comprehensive enhancer–target gene repertoire will allow us to identify disease-linked polymorphisms that lie within enhancer elements, and study the evolutionary conservation of enhancer–target gene pairs.

G14 - Assembling the mitochondrial minicircle genome in Trypanosoma brucei

Michael Quintin, Boston University, United States

Short Abstract: Kinetoplastids are a group of flagellated protozoans which cause parasitic diseases in many developing countries. Parasites of the genus Trypanosoma give rise to African trypanosomiasis, or sleeping sickness - a disease currently classified by the WHO as uncontrolled and re-emerging in many areas of the world. Trypanosomatids are characterized by the presence of a kinetoplast - a dense, DNA-containing structure located within their mitochondrion. This genome (kDNA) is organized into a catenated network of circular structures comprising around 10,000 minicircles (~1 kb) and a few dozen maxicircles (~20 kb). Remarkably, most of the protein-coding genes are encrypted - their transcripts can be translated only after uridine insertion/deletion via mRNA-editing. This editing process is directed by minicircle-encoded guide RNAs (gRNAs). While past studies have attempted to sequence individual minicircles and identify their conserved domains, obtaining a complete and reliable catalog of all existing minicircle sequences remains a challenging task. Obtaining such a reference could prove vital in our understanding of gRNA transcription and RNA-editing mechanisms in kinetoplastids. Here, we analyzed next-generation sequencing data from mitochondrial DNA and small RNAs isolated from T. brucei. We explored cluster-based classification of minicircles based on conserved sequence regions, total sequence homology, and gRNA gene content in order to describe minicircle sequence diversity. Sequence clusters generated this way will be utilized to arrive at consensus sequences and generate a complete map of the T. brucei mitochondrial genome.

G15 - HGNChelper: identification and correction of invalid human gene symbols

Jasmine Abdelnabi, CUNY School of Public Health, United States

Short Abstract: Gene symbols are meant to provide unique and meaningful names for each gene in the human body; however, aliases for the same gene are common. Various data entry methods, including the use of the Excel spreadsheet, can also introduce mislabeling as certain gene symbols are converted to date formats (Zeeberg et al., BMC Bioinformatics 2004). Many bioinformaticians incorrectly assume that by using annotations directly from public databases, and avoiding Excel, these problems can be avoided. We analyzed all Homo Sapiens platform annotations stored in the Gene Expression Omnibus (GEO) in order to quantify the extent of invalid and out of date gene symbols, and found that some platform annotations contained up to 15% invalid gene symbols. The Human Gene Nomenclature Committee (HGNC) maintains the official database gene symbols and their aliases; however it cannot be accessed programmatically and does not correct symbols that have been mogrified by Excel. We present the HGNChelper software, which identifies known aliases and outdated symbols from the HGNC database, in addition to common date modifications introduced by Excel. HGNChelper is implemented as an R package for high-throughput programmatic correction, and as a web page for simple interactive usage. Use of HGNChelper can both improve the mapping of published gene signatures and avoid the embarrassment of publishing “DEC-1” as a candidate disease-associated gene.

G16 - Experiences developing a Genome Annotation workflows within Galaxy

Iyad Kandalaft, Agriculture and Agri-Food Canada, Canada

Short Abstract: Current DNA sequencing technologies enable routine generation of microbial genomes. While automated annotation tools exist, they are subject to regular improvement, and the effort required to install, update, and utilize these tools in a systematic way imposes a barrier for many researchers. To facilitate rigorous and reproducible annotation efforts at Agriculture and Agri-Food Canada, we enabled a suite of annotation tools and integrated them into workflows for the Galaxy platform. In the process, we documented proposed IT best practices for sustainable tool and workflow development within Galaxy. The resulting workflows leverage Galaxy’s ability to install and systematically execute automated annotation software to 1) heuristically and empirically annotate genes using Maker, 2) identify gene clusters using AntiSMASH 2, 3) identify gene function using InterProScan 5, 4) locate microsatellites using MISA, 5) detect SNPs using Mauve, and 6) design primers flanking microsatellites and SNP-rich regions using Primer3. Conversion steps in the workflows utilize the Generic Feature Format (GFF3) as the unified file type that permits pushing annotations to genome browsers for manual curation and publication. The Galaxy-driven annotation tools and workflows approach ensure minimal effort reproducibility by tracking provenance data with regards to tool versions and analysis steps. The workflow and proposed IT best practices are presented to the Galaxy community herein for discussion and evaluation, and the tool wrappers will be released to the community shortly.

G17 - HIPPO: A graph-based approach to constructing quality meta-assemblies

Aaron Steele, University of Notre Dame, United States

Short Abstract: The ability to extract meaningful results from genomic data often depends on access to a well-constructed genome assembly. Because of limitations of time and money, however, manual finishing and validation are not performed. This issue is compounded by the fallibility of assemblers, which struggle with repeats, chimeric reads, and contaminants and can vary widely in the caliber of their assemblies. Further, some assemblers fare better than their competitors on specific datasets. Deciding which of them is best can be daunting, as heuristics like N50 or number of contigs are too broad to capture fine-grained quality information of a specific assembly.
HIPPO is designed to alleviate these problems by automatically merging multiple assemblies based on their sequence quality. The measure of quality derives from previous work in assembly validation, where the correctness of an assembled sequence is derived from a vector of quantifiable values for consecutive windows across the assembly. The multi-step approach begins with a full genome alignment that is transformed into a bipartite graph. Weights are then assigned to edges in the graph using size, similarity, and local assembly quality of each alignment. A modified maximum-flow algorithm identifies optimal sections for improvement. These sections are woven together by a simplified de Bruijn path process producing a meta-assembly consisting of only the highest-quality sections. The tool extends, fill gaps, and replaces dubious sections of sequence. We conclude that HIPPO allows users to combine multiple assemblies and incorporate high-quality supplemental regions, such as fosmids.

G18 - Detecting ancient elements of extreme conservation in eukaryotic genomes using hash mapping and cache-aware in-memory computing

Andi Dhroso, University of Missouri, United States

Short Abstract: Genomics is one of the first life science disciplines to enter the era of Big Data, facing challenges in all three dimensions—volume, variety, and velocity. Yet, in spite of plethora of sequencing data, we are still far from creating a complete encyclopedia of functional and structural elements of the genome. In 2004, an example of this knowledge gap came about when Bejerano and Haussler discovered 481 DNA elements in the syntenic positions of human, mouse and rat genomes that were 100% identical, called the ultraconserved elements (UCEs). Recently, using an advanced data-mining alignment-free approach, we have shown that this phenomenon exists beyond the animal kingdom and outside the regions of synteny.

Our ultimate goal is to provide a comprehensive atlas of the regions of extreme conservation in higher eukaryotes, which may shed light into the structural organization, function and evolution of these elements. However, this task of all-against-all comparison of dozens of eukaryotic genomes may not be feasible using current approaches. Here, we present a hybrid approach that integrates the ideas of hash mapping and cache-aware in-memory computing. Our algorithm leverages the concept of “help-me-help-you”, where the data structures are tailored to maximize cache-hit, while minimizing cache-miss. As a result, our hybrid algorithm is almost 300 times faster than the current state-of-the-art method and is scalable to deal with the unassembled genomes. It has been applied to detect the earliest evidence of extreme conservation by including into the large-scale analysis recently sequenced genomes of coelacanth, elephant shark, and lamprey.

G19 - High-throughput genome scaffolding from in vivo DNA interaction frequency

Noam Kaplan, University of Massachusetts Medical School, United States

Short Abstract: Despite the advancement of DNA sequencing technologies, assembly of complex genomes remains a major challenge. Surprisingly, the quality of published complex genomes has decreased, due to the growing use of short read sequencing.

We have developed a high-throughput scaffolding approach, based on the notion that loci that are near each other in the genomic sequence have a high probability of interacting with each other. We demonstrate that genome-wide in vivo chromatin interaction frequency measurements can be used as genomic distance proxies to accurately detect the positions of contigs over large distances without requiring any sequence overlap. Furthermore, we demonstrate our approach can karyotype and scaffold an entire genome de novo. Applying our approach to incomplete regions of the human genome, we predict the positions of 65 previously unplaced contigs, in agreement with alternative methods. Our approach can theoretically bridge any gap size, is simple, robust, scalable and applicable to any species.

G20 - De novo Assembly of the North American Bullfrog Transcriptome with Trans-ABySS

Bahar Behsaz, Canada's Michael Smith Genome Sciences Centre, Canada

Short Abstract: Whole transcriptome shotgun sequencing (RNA-seq) provides the ability to perform efficient and accurate transcriptome analysis and profiling. However, non-uniform coverage of transcripts in RNA-seq data due to variable expression level of transcripts, up to six orders of magnitude, has been a computational challenge for de novo assembly and analysis of RNA-seq data. Here, we report our updates on transcriptome assembly algorithm Trans-ABySS, and its application in a de novo assembly project to reconstruct the North American Bullfrog (Rana catesbeiana) transcriptome. We assessed our results with the CEGMA (Core Eukaryotic Gene Mapping Approach) tool which showed reconstruction of transcripts associated with 100% of 248 highly conserved core eukaryotic genes. We were able to map more than 95% of the original reads back to this assembled transcriptome. We used assemblies of RNA-seq data from different tissues to perform differential expression analysis. Certain genes were expected to be responding differently under different biological conditions. We observed that de novo transcriptome assemblies were effective in identifying those genes and estimating their expression levels, which correlated well with qPCR validation experiments. The results demonstrate that Trans-ABySS is a valuable tool for assembling transcriptomes of non-model organisms.

G21 - Vervet Monkey Gene Annotation in Ensembl

Rishi Nag, EMBL-EBI,

Short Abstract: Ensembl (www.ensembl.org) provides automatic genome annotation for over 60 vertebrate species including vervet monkey. New multiple species alignments, gene trees and variation features are made freely available online in each release. These data can be accessed via the Ensembl Browser, MySQL databases, a Perl API and the BioMart query system.

Chlorocebus sabeus (vervet monkey) is a critical non-human primate model system for biomedical research, employed for investigations of brain and behaviour, metabolism and immunity. It is the most abundant natural host of simian immunodeficiency virus (SIV).

The latest vervet monkey assembly, Chlorocebus_sabeus 1.1, was produced April 2014 by the Vervet Genomics Consortium. The gene annotation is planned for Ensembl release 76 (June 2014). The 2.8Gb high-coverage vervet monkey genome was annotated using the Ensembl gene annotation pipeline, incorporating RNA-Seq data. The initial set of analyses masked 49% of the genome as repeat features.

Subsequent analyses on the genome included the prediction of CpG islands and transcript start sites. Ab initio gene predictions by Genscan were also included. Protein, cDNA and EST sequences from UniProt, RefSeq and ENA were aligned to the genome. This led to the construction of an Ensembl gene set for vervet monkey, following the standard gene annotation process whereby sequence alignments from vervet monkey and other species were used to support the gene models.

Evidence from vervet monkey proteins, cDNA, EST and RNA-Seq data was combined with with vertebrate proteins from Uniprot and human Ensembl Longest Translation models to contribute to the final gene set.

G22 - A spectral clustering approach to investigate specificity of chromosomal organization

Alireza Fotuhi Siahpirani, University of Wisconsin-Madison, United States

Short Abstract: The three dimensional organization of the genome is emerging as a major determinant of gene regulation. Advances in chromosome conformation capture (3C) methods are expanding our repertoire of data sets measuring the three dimensional organization for multiple cell types and organisms. An important challenge is to develop computational methods to interrogate these data to reveal three-dimensional organization of the genome and how these maps change across different cell types. In this work we introduce a clustering based approach to analyze genome-wide contact maps and to compare them across different cell types. We examine different clustering algorithms including Hierarchical, Kmeans and Spectral using different cluster evaluation criteria. We find that Spectral clustering, which is based on a graph-based representation of the contact maps is significantly better than a simple Hierarchical or Kmeans clustering approach. We apply our clustering method to published genome-wide 3C data (Hi-C) from six human and two mice cell lines. We recover clusters representing different levels of chromosomal organization ranging from entire chromosomes within one cluster, chromosomes split into multiple clusters as well as clusters capturing more than one chromosome. Several of our clusters are enriched for genomic features such as gene content, activating and repressive chromatin marks and DNaseI hypersensitive sites. Finally, we use our clusters to define inter-chromosomal interactions in human and mouse ES cells and observe significant conservation between these interactions. In summary, our clustering based approach provides a promising approach to systematically compare Hi-C contact count maps across multiple cell types and organisms.

G23 - Chromatin states and activity patterns of regulatory regions across 111 human epigenomes

Wouter Meuleman, Massachusetts Institute of Technology, United States

Short Abstract: Large scale epigenomics profiling studies such as the Roadmap Epigenomics Project have generated a wealth of genome-wide datasets. In this study, we make use of genome-wide epigenomic maps across 111 epigenomes, spanning more than 100 distinct human tissues and cell types. We use these data to define a high-resolution map of diverse classes of regulatory regions and their activity patterns across cells.
We systematically delineate regulatory regions using DNaseI hypersensitive data, and annotate them using chromatin state models derived from genome-wide maps of a variety of histone modifications. We study dynamic patterns of histone modifications across tissue/cell types to detect regions of the human genome that exhibit signatures of regulatory potential, revealing ~2.3M putative enhancer regions and ~80k promoter regions based on their chromatin signatures across cell types.
We study the activity profile of these regions across 111 epigenomes to study their dynamics and further delineate them into likely units of autonomous functionality and regulation. This allows us to derive non-overlapping activity-based modules. The validity and biological relevance of these modules is emphasized using functional enrichment analyses of nearby genes and significant overlap with positively validated known human enhancers, enabling us to annotate candidate functions associated with each activity module and the individual enhancer elements that it contains.

G24 - Reference Proteomes in UniProtKB – Responding to Challenges in the Post Genomic Era

Benoit Bely, EMBL-EBI,

Short Abstract: The UniProt Knowledgebase (UniProtKB) is a central resource for high-quality, consistent and richly annotated protein sequence and functional information. UniProtKB has witnessed a vast increase in the number of submissions of multiple genomes for the same or closely related organisms. To keep up with this rapid growth and avoid sequence redundancy, in addition to complete proteomes that are based on translations of completely sequenced genomes, we offer a selected subset of reference proteomes constituting a representative cross-section of the taxonomic diversity found within UniProtKB. These reference proteomes are the focus of both manual and automatic annotation, aiming to provide the best annotated protein sequence sets for the selected species. High-throuput annotations from variation projects (e.g 1000 Genomes, Cancer genome project, etc.) and proteomics experiments are used in the annotations of these species. We are working closely in collaboration with the INSDC, Ensembl and RefSeq to map all UniProtKB proteins to the underlying genomic assemblies and to offer a consistent set of complete and reference genomes/proteomes to the user community. Additionally, we will discuss the development of pan-proteome data sets within taxonomic groups that will capture unique sequences not found in reference proteomes and aid in phylogenetic comparisons. To further reduce redundancy within UniProt, a gene-centric view of complete proteomes will be implemented. This will bring together canonical and variant protein sequences into gene-based clusters that will offer a single reference protein for each gene.
All of these future plans will be presented with a particular focus on the microbial context.

G25 - Recognizing Structured RNA by Teaching Machines to Consider Evolutionary Neutrality

Shermin Pei, Boston College, United States

Short Abstract: Strategies for de novo discovery of structured RNAs rely
heavily on identification of sequences with a common secondary
structure. Alignments of homologous RNAs are critical for providing
convincing evidence of a shared structure in the form of co-varying
mutations, where aligned sequences vary, but predicted base-pairings
are conserved. Such conservation patterns are the result of
evolutionary forces to maintain functional structures. Like protein
structures, structured RNAs can be considered robust to point
mutations. In such a context, an RNA sequence is considered robust if
its neutrality (determined as the average similarity of the minimum
free energy (MFE) structure for the sequence and those of sequences
that differ by a single point mutation, or 1-mutant neighbors), is
greater than that expected for an artificial sequence with the same
MFE structure. In this work, we evaluate neutrality as a feature to
detect naturally evolved RNA structures. We use existing methods to
evaluate neutrality, and introduce our own measure, sampled ensemble
neutrality (SEN). SEN measures maintenance of the core RNA structure
by determining the extent to which base-pairs present in the original
structure are preserved in structures sampled from the structural
ensembles of 1-mutant neighbors. We find that neutrality is effective
at separating true structured RNAs from several different
backgrounds. Furthermore, as an independent feature classifier
to identify structured RNAs, the SEN yields comparable performance to
current techniques that consider stability and sequence identity.
Finally, the SEN outperforms other measures of neutrality at detecting
mutational robustness in bacterial cis-regulatory RNA
structures.

View Posters By Category

Search Posters:

TOP