HOME

Posters

Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015. To confirm your poster find the poster acceptence email there will be a confirmation link. Click on it and follow the instructions.

If you need further assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category G - 'Genetic Variation Analysis'

G01 - The three-dimensional genome conformation of Mycoplasma pneumoniae

Short Abstract: A recent study, involving a genome-reduced bacterium, Mycoplasma pneumoniae, the smallest self-replicating organism known to date, has revealed impressive transcriptome complexity.
Using recent Hi-C method, enabling purification of ligation products followed by massively parallel sequencing, allows unbiased identification of chromatin interactions across an entire genome.
We are seeking to build a 3D model of the genome conformation of the Mycoplasma pneumoniae using Hi-C data.
Integrating Chip-Seq data and chromatin structure information reveals numerous features of genomic organization, that change from exponential to stationary phases of growth. Additional RNA expression data will allow us identify gene interactions that could affect the supercoiling. All these will hopefully help us to understand the complex transcriptional regulation of prokaryotes.

G02 - Retroposition of non-coding RNA sequences in human, mouse, and rat genomes

Short Abstract: Insertion of DNA segments is one mechanism by which genomes evolve. The bulk of genomic segments are now known to be transcribed into long and short noncoding RNAs (ncRNAs), promoter-associated transcripts, and enhancer-templated transcripts. These various ncRNAs are thought to be dispersed in the human and other genomes by retroposition. In this study, we report clear evidence for dissemination of ncRNAs by retroposition. We used highly stringent conditions to find recently retroposed ncRNAs that had a poly(A) tract and were flanked by a target site duplication. We identified a total of several tens of instances of retroposition in the human, mouse, and rat. The inserted segments, in some cases, served as a novel exon or promoter for the associated gene, resulting in novel transcript variants. Some disseminated sequences showed sequence conservation across animals, implying a possible regulatory role. Our results indicate that retroposition is one of the mechanisms for dispersion of ncRNAs. We propose that these newly inserted segments may play a role in genome evolution by potentially functioning as novel exons, promoters, or enhancers.

G03 - Genome-wide Analysis of Transcriptional Regulators in Human HSCs Reveals a Densely Interconnected Network of Coding and Non-coding Genes

Short Abstract: Combinatorial transcription factor (TF) interactions regulate hematopoietic stem cell (HSC) formation, maintenance and differentiation, and are recognized as drivers of stem cell signatures in cancer. However, genome-wide combinatorial binding patterns for key regulators have not been reported for primary human hematopoietic stem/progenitor cells (HSPCs) and have constrained analysis of the global architecture of the molecular circuits controlling these cells. Here we provide high-resolution genome-wide binding maps for a heptad of key TFs (FLI1, ERG, GATA2, RUNX1, SCL, LYL1 and LMO2) in human CD34+ HSPCs together with quantitative RNA and microRNA expression profiles. We catalogue binding of TFs at coding genes and microRNA promoters and report that combinatorial binding of all seven TFs is favored and is associated with differential expression of genes and microRNA in HSPCs. We also uncover a hitherto unrecognized association between FLI1 and RUNX1 pairing in HSPCs, establish a correlation between the density of histone modifications, which mark active enhancers and the number of overlapping TFs at a peak and identify complex relationships between specific miRNAs and coding genes regulated by the heptad. Taken together, these data reveal the power of integrating multifactor ChIP-seq with coding and non-coding gene expression to identify regulatory circuits controlling cell identity.

G04 - Identification of Endogenous Retroviruses using a Novel Screening Technique

Short Abstract: Numerous endogenous retroviruses (ERVs) are found in all mammalian genomes, for example, they are the source of approximately 8% of human and chimpanzee genetic material. They have many effects on their hosts; they can be co-opted for functional roles, they provide regions of sequence similarity where mispairing can occur, their insertion can disrupt genes and they provide regulatory elements for existing genes. Accurate annotation and characterisation of these regions is therefore an important step in interpreting the huge amount of genetic information available for increasing numbers of organisms.

ERVs are repetitive, degenerate and numerous so specialised techniques are needed to identify them. We have developed a novel screening technique, using a comprehensive database of 1408 known endogenous and exogenous retroviruses and the Exonerate sequence comparison algorithm to allow any genome to be quickly and exhaustively screened for regions resembling retroviral genes.

Our pipeline has been used successfully to characterise the ERV content of the horse, dog and chicken. We are now using it to annotate ERV sites in fifteen species of primate. Over 40,000 ERV like regions have been identified so far in these genomes, approximately 2,500 per genome. These include many groups of novel, previously uncharacterised ERVs. For example, only four groups of ERV have been previously identified in prosimian primates. Here, in one species, 1898 regions were identified with at least one ERV-like component.

G05 - HiTC: exploration of high-throughput 'C' experiments

Short Abstract: The three-dimensional organization of chromosomes and the physical interactions occurring along and between them play an important role in the regulation of gene activity. Over the last ten years, the development of chromosome conformation capture (3C)-based techniques has changed our view of nuclear organization. With the emergence of next-generation sequencing, high-throughput conformation capture techniques, such a 3C Carbon-Copy (5C), or more recently Hi-C have been developed to study the physical interactions between many loci in parallel.

The HiTC R/BioConducor package [1] has been developed to offer a bioinformatic environment to explore high-throughput ’C’ data. It provides new classes and methods to import and process the data. Quality control and normalization functions are provided to check the prevalence of intra/inter-chromosomal interactions, and estimates the interaction counts one would expect if the signal was only dependent on the genomic distance between interaction loci. The current version of the package was improved to increase its inter-operablility with the rest of the BioConductor project. New visualization functions dedicated to Hi-C are now available, as well as annotation methods to integrate high-throughput 'C' data with additional epigenetic marks. Recent methods for topological domains identification and normalization have also been integrated to the package.

The set of functionalities proposed by the HiTC package makes it a reference tool for high-throughput 'C' analyses and visualization within the Bioconductor framework, and thus, offers new opportunities for future development in this field.

[1] Servant N. et al. (2012) HiTC:Exploration of High-Throughput ’C’ experiments. Bioinformatics 28(21):2843-4

G06 - MOST+: A de novo motif finding approach combining genomic sequence and heterogeneous genome-wide signatures

Short Abstract: De novo discovery of regulatory elements, like transcription factor binding sites (TFBSs), has long been a hot spot as well as a major challenge to gain insight on mechanisms of gene regulation. Recent advances in ChIP-seq/ChIP-chip experiments allow biologists to generate some genome-wide transcription factor binding sites or signatures such as histone modification positions. This allows scientists to develop better computational methods for motif discovery. However, existing methods for motif finding combing these information suffers from high false positive rates, slow speed and limited size of datasets.
Here we present MOST+, a MOtif finder with Suffix Trees to represent genomic sequences and integrating different types of genome-wide signatures, such as histone modification marks and DNase I hypersensitivity. MOST+ can detect motifs in each ChIP-seq peak region of the mouse embryonic stem cell and human lymphoblastic cell in a few minutes. In addition, some novel co-factors and motifs have been found by MOST+. Compared with currently existing systems, it is fast and accurate and it can deal with datasets larger than 100Mbps.

G07 - APApred – a new predictor of polyadenylation sites in genomic sequences

Short Abstract: Polyadenylation (pA) is the process by which a newly transcribed RNA is cleaved and then has a run of adenosine bases attached. The site of polyadenylation is determined in part by short sequences such as AAUAAA known as polyadenylation signals (PAS) that are recognized by the RNA cleavage and polyadenylation complex that cleaves RNA and adds the poly(A) tail. Previous pA site predictors rely heavily on the location of PAS, but the full suite of PAS has yet to be characterized even in human or model organisms, thus limiting detection accuracy. In contrast, APApred is trained on genome-wide data for pA sites obtained by analysis of direct RNA sequencing (DRS) reads from the human transcriptome. The 3' ends of these reads are exquisitely aligned at the pA sites in the genome, resulting in a very precisely located, unbiased, positive training set. APApred considers the nucleotide composition of the genomic sequence from 40bp upstream to 20bp downstream of a pA site, without requiring the presence of any known PAS within this region. APApred was designed using the artificial neural network package SNNS. Positions of the DRS data supplied the positive training data and randomly chosen intergenic positions from the human genome provided the negative training set. Blind tests of the trained network show that APApred has excellent sensitivity and specificity (90% and 84%, respectively), surpassing the previous best pA site predictor, Polyar (84% and 82%, respectively on the same blind test data).

G08 - Genome of Venturia inaequalis – the causal agent of apple scab

Short Abstract: The hemibiotrophic fungus Venturia inaequalis is the causal agent of apple scab, one of the most deleterious diseases in apple industries worldwide. Heavy infections cause blossom and fruit drop, and even tree defoliation; unsightly lesions render fresh fruits unmarketable and provide entry ports for storage pathogens. Apple production heavily relies on scab-resistant cultivars and fungicide treatments, but the pathogen is quick in overcoming these protective measurements. To identify genes and mechanisms involved in pathogenicity and adaptability, we sequenced the genome of V. inaequalis using Illumina. The current genome assembly consists of 3106 contigs (> 1kb) covering 39.7 Mb. We predicted and annotated 11,076 protein coding genes, with most housekeeping pathways covered to near completion. Considering that the number of genes predicted on this assembly is comparable to those in other filamentous fungi, and that the assembly accounts for only half of the currently estimated genome size (~ 80 Mb), it appears likely that this fungus has a high proportion of low complexity DNA. Our analyses into overrepresented gene families show that V. inaequalis has 28 cutinases, which may be involved in host colonization. Furthermore, we discuss our results on V. inaequalis effectors that may potentially be involved in R-Avr apple-scab interactions.

G09 - An integrated approach to understanding apicomplexan metabolism from their genomes

Short Abstract: The Apicomplexa is a large phylum of intracellular parasites and includes the causative agents of malaria, toxoplasmosis and theileriosis, diseases of huge economic and social impact. A number of these genomes have been sequenced but functional annotation remains challenging due to their divergence from model species. We have utilised an approach called ‘metabolic reconstruction’, in which genes are systematically assigned to functions within pathways/networks. Functional annotation and metabolic reconstruction was carried out using a semi-automatic approach, integrating genomic information with biochemical evidence from the literature. The functions were automatically assigned using a sequence similarity-based approach and protein motif information. Experimental evidence was also accommodated in the confirmation of functions and the grouping of genes into metabolic pathways. The functions required to complete metabolic pathways, and that are missing in gene models, were also identified. A web database named Library of Apicomplexan Metabolic Pathways (LAMP, http://www.llamp.net) was developed and contains the near complete mapping of genes to metabolic functions for Toxoplasma gondii, Neospora caninum, Cryptosporidium and Theileria species and Babesia bovis. Each metabolic pathway page contains an interactive metabolic pathway map, gene annotations hyperlinked to external resources and detailed information about the metabolic capabilities. We also carried out a comparative analysis of the overall metabolic capabilities of apicomplexan species in terms of their ability to synthesise or dependence on the host for a metabolite. We expect the LAMP database will become a valuable resource for the Apicomplexa community both for fundamental and applied research.

G10 - Tandem repeats classification and annotation in eukaryotic genomes

Short Abstract: Tandemly repeated DNA represents a significant portion of eukaryotic genomes. The large tandem repeats including satellite DNA are the main component of centromeric and pericentromeric regions that are mostly unassembled. The incomplete characterization of large tandem repeats and satellite DNA limits experimental studies. Here, we present a workflow for classification and annotation of the large tandem repeats using both genome assemblies and unassembled reads. A non-redundant set of tandem repeats found with TRF in assembled sequences is divided into following types: microsatellites, perfect minisatellites, minisatellites, tandem repeats related to mobile elements, and large tandem repeats including satellite DNA. We suggest two following approaches to the distance computation between arrays: a distance based on a pairwise sequence alignment and a distance based on a number of common k-mers between two arrays; using these distances we constructed tandem repeats similarity graph that allows to define tandem repeats families and subfamilies for large tandem repeats, each family named accordingly to proposed uniform nomenclature. To check tandem repeat array assembly quality and estimate genome copy number we use array coverage by unassembled raw reads. Annotation step includes prediction of position in the reference genome, estimated copy number, presence of known DNA motifs (e.g. CENP-B box, G-quadruplex , or pJalpha), presence of high order repeats, and predicted chromosome specific. Annotated tandem repeats could be an important resource for further characterization and overall understanding of the eukaryotic genomes.

G11 - Improving genome annotation using next generation high-throughput data

Short Abstract: The GENCODE consortium provides the reference gene annotation for the ENCODE project. Its gene set comprises a merge of HAVANA manual annotation and Ensembl gene predictions. Although the number of protein-coding genes has stabilized the increasing amount of new splice variants and long non-coding (lncRNA) genes indicates that there is still more to discover. GENCODE has started to incorporate novel high-throughput transcriptomics and proteomics data sources to further improve the annotation of the human genome. For example, RNA-seq data obtained from a broad range of tissues and cell types, like Illumina BodyMap, facilitates the identification of new loci and splice variants. CAGE tags and polyA-seq data help to refine the annotation of transcript boundaries and discover new single-exon loci. Proteomics data help to unveil previously unknown coding regions within a transcript. The HAVANA group is also producing reference annotation for the mouse and zebrafish genomes, for which there is still a limited amount of varied novel data sources. In the case of mouse we are building transcript models from the ENCODE RNA-seq data using the Ensembl pipeline. We have compared these predictions against the publicly available ENCODE transcript set built using Cufflinks. Using comparative genomics methods like PhyloCSF we have evaluated the coding potential of these predicted transcripts so that we can discriminate between protein-coding and lncRNA candidates. In an alternative approach, we are also using high-depth RNA-seq data to build introns with the aim of discovering new splicing variants in the zebrafish genome.

G12 - FIDEA: a server for the Functional Interpretation of Differential Expression Analysis

Short Abstract: Differential expression analyses typically end up with hundreds or thousands of differentially expressed genes. This is obviously a non exhaustive result because it has to be interpreted in light of the biology of the specific system under study. A possibility to do this is given by moving the attention from the differentially expressed genes to the functional categories they belong to, such as KEGG pathways, InterPro families, Gene Ontology Molecular Function, Biological Process or Cellular Component. This can be done through the identification of functional categories significantly over-represented by differentially expressed genes: an analysis that is more effectively performed by scientists who are well acquainted with the biological problem. We present the web server FIDEA, aimed at allowing experimentalists to “play with” their data in an easy and at the same time exhaustive fashion within a single tool. FIDEA directly starts from the results of the differential expression analysis accepting as input the output of the most used tools in RNA-Seq or macroarray analysis. The user can immediately see preliminary statistics on the fold change distribution and interactively filter differential expressed genes according to their fold change value. Overrepresentation statistics are calculated by analyzing down-regulated and up-regulated differential expressed genes separately or together as a single set. The results of the analysis are provided as interactive graphs and tables, modifiable by the user, and also as publication-ready plots. We developed a tool that allows experimentalists to explore their data facilitating the interpretation of the biological significance of their results.

G13 - Rgb : a native genome browser for R

Short Abstract: The growing demand from the biology community for statistically robust approaches have made the R statistics-oriented scripting language an essential part of the bioinformatics toolbox. Its graphical capabilities make it a valuable tool to produce publication-grade complex figures, while its computational efficiency allows it to handle huge datasets, as currently required in fields like transcriptomics or next generation sequencing (NGS). These qualities come with an open-source licensing and various operating system ports that make it available virtually everywhere.

Thanks to the Bioconductor initiative, a large amount of software is freely available as R packages for tasks like microarray processing, feature annotation, parallel computing or sequence analysis. While most of the available software would largely benefit from “genome browser” representations, there is still a lack for a native and efficient R solution.

The open-source package proposed here implements an interactive genome browser in R, whose advantages over available software have been demonstrated in several areas. As implemented in an object-oriented paradigm, it is highly extendable and properly documented. As it relies on C-level indexing, it shows significantly faster genomic feature handling than available solutions. Its Tcl-Tk interface allows users non familiar with the R language to browse their data, while several lower level functions make it includable in scripts to produce automatic representations. Finally its graphical parameter handling system allows a precise control on the graphical output at both levels.

Its capabilities will be demonstrated on two typical user cases of Comparative Genomic Hybridization and NGS unpublished datasets.

G14 - Bacterial Genome Annotation Calling using Semantic Similarity to Aggregate Hits from Multiple Databases

Short Abstract: Automatic genome annotation systems work by transitively applying annotations to novel genes by selecting the best homolog from a similarity search. When they fail, human experts must manually examine the list of homologs found by similarity search. Much of the work human experts perform is data cross-referencing information gathered from multiple sources to find support for a common annotation.
We attempt to replicate genome annotation experts by using gene observations generated by an automatic annotation pipeline and calculating the semantic similarity of the descriptions in relevant records. Using cosine similarity, relevant records are placed into groups describing similar gene products. Each group is then given a confidence score by cross-referencing low BLAST e-values with HMM database hits with a score surpassing the trusted cut-off. Within the highest ranking group, the hit taxonomically closest to the target organism is selected for annotation transfer.
Our method was evaluated by comparing our annotations to those produced by GenDB and BASys on streptococcus milleri. This genome was expertly annotated by experts using the exact same observation set used by GenDB and our annotation system. Our method showed a 7% increase in accuracy over GenDB (86% total accuracy), although we tended to predict genes that should have been left annotated as hypothetical protein. This was a perfect real world example of how an improved annotation calling system can save time and money as all gene observations in this evaluation were generated before annotations could be propagated to public databases.

G15 - SPAT: Searching for Poly(A) Tails in RNA-Seq de novo Assemblies

Short Abstract: In cancers, alternative splicing and polyadenylation (APA) can affect transcript stability, transport and translation, and can change a transcript’s translated sequence. APA can be characterized with short-read sequencing using specialized library construction methods. However, for large-scale disease studies it is desirable to characterize a range of splicing, and APA from a single library construction, sequencing run, and analysis toolset.

In the work reported here, we describe SPAT, an analysis tool that uses de novo assembly of RNA-seq data to Search for Poly(A) Tails. SPAT is designed to accept contigs from a range of de novo transcriptome assemblers. When used with Trans-ABySS, the overall pipeline reports alternative splicing and a range of types of APA.

We demonstrate SPAT with Trans-ABySS, using an RNA-Seq library constructed from Stratagene's Universal Human Reference RNA and sequenced on the HiSeq 2000 and MiSeq Illumina platforms. By analyzing the HiSeq data with SPAT and validating the results with the added length of MiSeq reads, we show that SPAT detects 88% of all annotated 3’ UTR cleavage sites when there is more than 15x MiSeq read coverage.

Using poly(A)-selected transcriptomes and matched genomes from patients with acute myeloid leukemia (AML), we distinguish RNA sequence variants that have genomic support (i.e. SNVs) from variants that lack such support, many of which have characteristics of RNA edits. We show that both types of variants can influence APA by creating or modifying polyadenylation signal motifs in 3’ UTRs and introns in genes like EIF2A, CCDC25 and RBMXL1.

G16 - 454 and SOLiD complementary used to filter and build the draft genome of highly abundant cyanobacteria in biological desert crust

Short Abstract: Over the past few years, massively parallel DNA sequencing platforms have become widely available, reducing the cost of DNA sequencing by over five orders of magnitude. We used two of these methods to reveal the genome sequence of a highly abundant cyanobacterium in biological desert crust. These new rapid evolved next-generation sequencing technologies posed challenges for us, the bioinformatics, in terms of sequence quality scoring, alignment, assembly and more, making de novo assembly, a challenge.
We are working on solving the genome of a desert cyanobacteria from biological sand crusts. Biological sand crusts are found in many deserts around the world. They play an important role in stabilizing sandy areas and affect the vegetation composition. The crusts are formed by adhesion of the sand to extracellular polysaccharides excreted mainly by filamentous cyanobacteria. Their destruction by man-made activities is considered an important promoter of desertification.
Using the SOLiD™ System, we were able to get most of the genes in the genome; however, the short reads, produce by the SOLiD™ technique, assembled into short contigs, which failed to assemble into scaffolds and the draft output was highly fragmented. By combining an addition sequencing method, the 454, we were able to get much longer contigs, which assembled into scaffolds. However the 454 method introduced sequence contamination. The fragmented SOLiD™ data helped us to filter out these sequence contaminants. Only the combination of the two methods enabled us to produce the pure draft genome. The identity and uniquely of the microbe will be shown.

G17 - Prediction of genome-wide in vivo transcription factor binding using factor-specific DNase footprinting models

Short Abstract: The identification of DNase I hypersensitive sites and DNase footprints are well established methods for identification of genomic regulatory regions and DNA-protein interactions, respectively. Using data generated by high throughput DNase-seq assays, we propose models to identify binding locations of transcription factors in different cell lines in a genome-wide manner by modeling each factor’s unique DNase footprint. Contrary to most existing approaches, our model aims to represent the footprint shape in detail while trying to account for the contribution of overall DNase hypersensitivity around a binding site to assess the accuracy of the footprints by themselves – a necessary feature to identify specific sites bound under different conditions. We model each transcription factor’s footprint using two features: distribution of DNase-seq reads at each base and the DNase-seq coverage. Transcription factor binding predictions are validated rigorously using ChIP-seq assays from the ENCODE consortium. We achieve a mean AUC value of 95% for 20 transcription factors. We find that AUC values tend to depend on quality of motif associated with transcription factor and transcription factor structural family. For each transcription factor, we show that some ChIP-seq peaks do not overlap with a DNase footprint and characterize such peaks according to ChIP-seq signal intensity and co-binding proteins.

G18 - Chromosomal organization of the yeast recombinosome is explained by convergent and divergent genes

Short Abstract: The dynamics of the formation of the synaptonemal complex in Saccharomyces cerevisiae during meiosis is not known. We show that the recombinosome proteins binding to DNA and inducing this yeast three-dimensional chromosomal structure can be merely described by genomic signals.
Convergent and divergent pairs of genes along the chromosomes characterize the main hotspots of localization for the proteins forming the axis, as Hop1, and causing the double-strand breaks, as Spo11. Gene length, GC-rich regions and intergenic region length turn out to be key elements to obtain an accurate modeling of the experimental data.
Based on them, the descriptive model that we propose demonstrates that the dynamics of the synaptonemal complex formation is encoded in gene organization along DNA and highlights that it is the orientation of the genes that influences the position and the length of the loops within the 3D structure.

G19 - Regulatory Network of TB: Transcription factor binding distribution and properties

Short Abstract: In order to study the initiation and establishment of an M. tuberculosis infection, we successfully conducted chromatin immunoprecipitation experiments (ChIP-Seq) for over 80 transcription factors. We detected over 10,000 binding locations with high resolution using a novel computational pipeline we developed and built a regulatory network for MTB.
We combined binding data with overexpression microarrays for over 150 TFs. We could determine a target gene and assign a potential regulatory role to ~25% of binding sites after correction for multiple testing. Interestingly, target genes regulated by multiple TFs had less evidence for regulation than genes regulated by one TF.
A significant number of validated binding sites were located within coding regions (~70%) or located at a large distance from the target (~60%) which indicated potential long-distance interactions and cooperative binding. Stronger binding sites were more often associated with regulation than weaker sites, while clusters of binding sites (strong and weak) had a more significant regulatory role than singletons.
Most transcription factors we analyzed had a conservative DNA binding motif; however, less than half of computationally predicted motif instances appeared experimentally bound. Moreover, we detected distinct areas of the genome either bound by unusually high number of TFs or depleted of the binding although conservative binding motifs were present. This accessibility property of DNA could be partially explained by binding of nucleoid-associated proteins (LSR2, H-NS). However, the variety of binding site locations and properties also suggested some transcription factors play an important role in modulation of DNA structure.

G20 - GRCh38: a new version of the human reference genome sequence

Short Abstract: The Genome Reference Consortium (GRC, genomereference.org) will release a new human reference assembly, GRCh38, in autumn 2013. Gaps have been bridged, alternate representations have been added for variant regions, and issues have been fixed both within individual clones and at the level of the ordering of clones and contigs. Some of these improvements have already been made available as optional patches for GRCh37. These patches are divided into two types: fix patches represent corrections to the primary assembly, whereas novel patches represent alternative possibilities to the primary assembly in variable regions. As of patch release 10, there are 111 fix patches adding over 5 Mb of novel sequence, and 71 novel patches adding over 800 kb of novel sequence; together, patches of both types affect 2.9% of the sequence. GRCh38 integrates fix patches into the primary assembly, whereas novel patches are retained as alternatives to the primary assembly, and designated alternate loci. GRCh38 also adds improvements beyond those contained in existing patch releases, including the use of data from the 1000 Genomes Project: sequence has been added corresponding to the decoy used in the 1000 Genomes Project, and reference bases have been replaced if 1000 Genomes data suggests that they are erroneous. Centromere sequences are represented in GRCh38 by chromosome-specific models. A draft build of GRCh38 is being used to test the impact of these changes.

G21 - biomvRhsmm: Genomic segmentation and copy number variation analysis with Hidden-semi Markov model

Short Abstract: With high throughput experiments like tiling array and massively parallel sequencing, large scale genomic data are growing at an unforeseeable velocity. Researchers applying these experiments frequently look at these genome-wide data searching for continuous homogeneous segments or signal peaks, which would represents either chromatin state, methylation ratio, transcript or genome region of deletion or amplification. In the R/Bioconductor package biomvRCNS, we implement a novel Hidden-semi Markov model (HSMM), biomvRhsmm, which is specially designed and tailored to serve as a general segmentation tool for multiple genomic profiles.

As a generalization of hidden Markov models (HMM), HSMM allow the sojourn distribution (probability distribution of staying in the same state) to be specified other than the Geometric distribution implicitly used in common HMM. In the package, several types of sojourn distribution are implemented. Other than the flat prior commonly used in Bayesian inference, prior information for the sojourn density could be estimated from annotation or previous studies, thus be effectively utilized together with positional information of features to guide the estimation of the most likely state sequence. With its full probabilistic model, various emission densities are provided, enabling the model to handle normally distributed data from traditional array platform as well as count data from sequencing experiment. The proposed model has been tested against well studied aCGH dataset from Coriell cell lines and RNA-seq data generated by ENCODE project to show both its functionality and reliability.

G23 - Updating RNA-Seq analyses after re-annotation

Short Abstract: The estimation of isoform abundances from RNA-Seq data requires a time-intensive step of mapping reads to either an assembled, or previously annotated transcriptome, followed by an optimization procedure for deconvolution of multi-mapping reads. These procedures are essential for downstream analysis such as differential expression. In cases where it is desirable to adjust the underlying annotation, for example upon the discovery of novel isoforms or errors in existing annotations, current pipelines must be rerun from scratch. This makes it difficult to update abundance estimates after re-annotation, or to explore the effect of changes in the transcriptome on analyses.

We present a novel efficient algorithm for updating abundance estimates from RNA-Seq experiments upon re-annotation that does not require re-analysis of the entire dataset. Our approach is based on a fast partitioning algorithm for identifying transcripts whose abundances may depend on the added or deleted isoforms, and on a fast follow-up approach to re-estimating abundances for all transcripts. We demonstrate the effectiveness of our methods by showing how to synchronize RNA-Seq abundance estimates with the daily RefSeq incremental updates. Thus, we provide a practical approach to maintaining relevant databases of RNA-Seq derived abundance estimates even as annotations are being constantly revised. Our methods are implemented in software called ReXpress and are freely available, together with source code, at http://bio.math.berkeley.edu/ReXpress/

G24 - Parseq: Transcriptional Landscape reconstruction from RNA-Seq data

Short Abstract: The most common RNA-Seq strategy is based on random shearing, amplification, and high-throughput sequencing, of the RNA fraction. This produces millions of sequence reads that serve to characterize whole-genome transcriptional profiles. Significant effort has been made to reconstruct transcript structures from these data by read assembly. However, there is also a need for methods to analyze, from read count profiles, the full complexity of the transcription level variations along the genome. We developed a statistical approach for transcription level inference at basepair resolution. A State Space Model serves to describe, in terms of abrupt shifts and more progressive drifts, the dynamics of the transcription level along the genome. Alongside variations of transcription level, our model also incorporate a component of short-range variation to pull apart local artifacts causing correlated overdispersion. Reconstruction of the transcription level relies on a Sequential Monte Carlo approach known as Particle Gibbs that is combined with parameter estimation in a Markov chain Monte Carlo algorithm. The approach allows to estimate the local transcription level, to call transcribed regions, and to identify transcript borders. Evaluation was carried out on synthetic and real data sets and showed to outperform traditional strategies based on read overlapping.

G25 - A Simple and Effective Technique for Assisted Genome Assembly

Short Abstract: While allowing for high-quality genome assembly, we aim at reducing the number of reads, and therefore the costs, required to obtain the genome of an organism. For this purpose, we devise a new approach based on the paradigm of assisted genome assembly, i.e., we are given the genome of a genetically related species (reference genome), and we aim at exploiting this information during the process of genome assembly.

Here, we propose a simple but surprisingly effective technique for assisted genome assembly. We generate artificial reads from the reference genome, and add them to the input dataset of the de novo assembler Velvet.

We show that our approach allows for a substantial reduction of the number of reads needed for an accurate reconstruction, and therefore it reduces the costs. In particular, while using an order of magnitude fewer reads, the contigs produced by our approach have approximately the same quality as the contigs of the non-assisted de novo assembly. Furthermore, we show that the contigs produced by our simple approach from the reduced sample, are of better quality than those produced by Amos, one of the state-of-the-art assisted assembly tools, in terms of coverage of the genome, accuracy of the base pairs and completeness of the assembly (N50).

G26 - Ensembl Chicken Gene Annotation

Short Abstract: Ensembl (www.ensembl.org) provides automatic genome annotation for over 50 vertebrate species including chicken. New multiple species alignments, gene trees and variation features are made freely available
online in each release. These data can be accessed via the Ensembl Browser,MySQL databases, a Perl API and the BioMart query system.

Gallus gallus (chicken) is a model organism for birds and for development in vertebrates, as they produce large, robust embryos whose development occurs outside the body of the mother.

The Gallus_gallus-4.0 assembly of the chicken genome was produced in November 2011 by the International Chicken Genome Consortium. The gene annotation was made public in Ensembl release 71 (April 2013). The
high-coverage chicken genome was annotated using the Ensembl genebuild pipeline, incorporating RNA-seq data from 10 different tissues.

The initial set of analyses masked a 12% of the genome as repeat features. Subsequent analyses on the masked genome included CpG islands and
transcript start sites identification together with ab-initio gene predictions and alignment of protein, cDNA and EST sequences from UniProt, RefSeq and ENA. This led to the construction of a gene set following a process where various levels of evidence, parameters and filtering steps are taken into account.

Evidence from chicken proteins, cDNA and RNA-seq data together with vertebrate proteins contributed to the final gene set resulting in 17,108 genes. Of these, 15,508 are protein coding genes, 1,558 are short non-coding RNA genes and 42 pseudogenes.

G27 - Deciphering the association between gene function and spatial gene-gene interactions in 3D genome conformation

Short Abstract: The three-dimensional (3D) genome conformation plays an important role in regulating gene expression. Chromosome regions in spatial proximity (i.e. contacts) can be readily detected by chromosome conformation capturing techniques such as Hi-C. We analyze the inter- and intra-chromosome contact data of primary tumor B-cells determined by the Hi-C technique. From the chromosome contact data, we generate the spatial gene-gene interactions in the entire human genome, in order to investigate if spatially interacted genes have similar functions. We analyze the gene functional similarity of gene pairs that do not spatially interact and that have substantial interactions (i.e. >= 18 spatial contacts observed in the Hi-C data). Comparing the distribution of the function similarity of two groups of gene pairs leads to the discovery that, although a large portion of the interacted genes have the function similarity distribution similar to the non-interacted genes, a sub-group of interacted genes tend to have the same functions. We carry out the further analysis in order to identify other determinants that can define this sub-group of interacted genes with high function similarity. We find that gene pairs with >=100 spatial contacts tend to have the same function or high function similarity; interacted gene pairs with high function similarity tend to have short genomic distances; and interacted gene pairs with >30% sequence identity often have the same gene function. The results suggest that gene pairs with substantial spatial interactions tend to have higher function similarity, and spatial interaction information is complementary with genomic distance or sequence identity.

G28 - Enhancer activity patterns and interactions in 100 epigenomes

Short Abstract: The NIH Roadmap Epigenomics Program has generated a large resource of epigenomic marks including histone modification patterns in both primary human tissues and human cell lines, with the goal of creating global reference maps of regulatory elements and study their biological roles.

These data can be used to generate chromatin state maps by learning combinations of histone modification patterns indicative of different functional classes. We find that these result in ~500,000 active and poised enhancer regions, and ~120,000 active and poised promoter regions across cell types, as well as strongly and weakly transcribed regions, repressed regions, and heterochromatic regions.

We use these epigenomic maps to cluster regulatory regions into 55 enhancer modules and 70 promoter modules of coordinated activity across cell types. We find that the vast majority of enhancers and promoters are cell type restricted, and highly enriched in developmental processes. Surprisingly, only a small percentage of promoter regions are constitutively active, suggesting a higher similarity between enhancer and promoter regions than previously recognized.

It is known that spatial proximity of genomic loci in 3D is highly correlated with co-ordinated activity. We are making use of epigenomic chromatin state maps to derive statistical prediction rules for inter-locus interactions. We show that we can predict these 3D genomic interactions using a combination of sequence-based information as well as chromatin state make-up and transcription factor binding behavior of interacting fragments.

G29 - On the discrepancies between mRNA and genome sequences in human reference databases.

Short Abstract: In living organisms, mRNA sequences consist of transcribed copies of genomic DNA sequences, thus these two sequences are expected to be matched, except some minor RNA modifications such as RNA editing. Currently reference sequences are publicly available both for mRNAs, e.g. RefSeq, and for genomic sequences, e.g. GRCh37, but these two types of reference sequences have been known to differ with each other for some mRNA sequences. In this report we performed comprehensive comparison between mRNA and genome sequences in human reference database and identified the discrepancies between them. We found more than 2,000 and more than 200 mRNA sequences include mismatching bases and in-dels, respectively, compared with the corresponding sequences in the reference genome. Some of the mRNAs contained more than 100 base regions that are missing in the reference genome, but if we search these regions in the genomic sequences of other assemblies, most of them could be found. We compared the mRNA sequences with reference genomes of different versions from 2004 to 2009, but the rate of decrease in mismatches was not so significant. These discrepancies may be due to the fact that the reference genome sequence was based on small number of individuals and it does not always include all the mRNA sequences that can be found in humans.

G30 - Large-scale analysis of transcription intiation uncovers widespread tissue-specific alternative promoter usage in human

Short Abstract: High-throughput characterization of transcriptomes reveals that the landscape of transcription is more complex than previously thought. Application of cap-based, 5’ end-centered RNA tagging methods (i.e. CAGE), coupled with NGS platforms has enabled the massive identification of TSSs for a number of model systems, including human. Observed patterns of TSSs indicate that positions of transcription initiation in metazoans are variable, typically forming distributions at the promoter, which we define as the transcription start region (TSR). Due to the variability of TSSs within putative promoters and the presence of numerous transcripts of unannotated function observed within transcriptomes, precisely identifying TSRs is challenging.
To address this, we developed a computational methodology and tool, TSRchitect, that identifies putative promoters on a genome-wide basis from large-scale TSS profiling information. We applied TSRchitect to CAGE data derived from multiple human cell types and annotated the TSRs of expressed genes. Overall, we find that a sizable fraction of human genes utilize multiple promoters according to our criteria. Considering the positions of annotated TSRs across the surveyed cell types, we also observe extensive tissue-specific alternative and differential promoter usage, consisent with other recent work (Batut et al., 2012). These results suggest widespread alternative 5’ exon use, which may alter the coding potential of the associated transcripts. In future work, we will seek to experimentally confirm selected results and initiate a comparative genomic approach in mouse. We will also futher evaluate cases of putative alternative promoter usage of biological interest, particularly those relating to disease-associated processes and genes.

G31 - Accelerated BLAST: a comparison of FPGA versus GPU and CPU implementations

Short Abstract: A number of technologies have emerged for accelerating similarity search algorithms in bioinformatics, including FPGA, GPU, and standard CPU clusters. We present Tera-BLAST, an FPGA-accelerated implementation of the BLAST algorithm, and compare the performance to GPU-accelerated BLAST and the industry standard NCBI BLAST+ on high performance computers. Tera-BLAST, running on the TimeLogic J-series FPGA Similarity Search Engine, performs 100’s of times faster than BLAST running on generic Tesla M2090 GPU cards or standard multi-core CPU’s.

TOP

View Posters By Category

Search Posters:

TOP