20th Annual International Conference on
Intelligent Systems for Molecular Biology


Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015. To confirm your poster find the poster acceptence email there will be a confirmation link. Click on it and follow the instructions.

If you need further assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category J - 'Pathogen informatics'
J01 - Comparing maximal exact repeats in human genome assemblies using a normalized compression distance
Short Abstract: Maximal exact repeats are perfect repeats that cannot be further extended without loss of similarity, and which are important for seeding alignment of sequence reads in genome assembly and as anchor points in comparisons of closely related genomes. We compare the content in maximal exact repeats in four human genome assemblies using a normalized compression distance. The normalized compression distance is an information theory-based similarity measure quantifying how much is gained in coding a data set after knowing a different data set. We use multiple competing Markov models for sequence compression. We find greater similarity between assemblies sequenced with capillary-based technologies than between assemblies sequenced with massively parallel technologies. We also find the variation in maximal exact repeats to be more significant for large and very large maximal exact repeats, where the biases of sequencing and assembly become more pronounced. Our results reveal the well-known difficulty in disambiguating repetitive sequences in the de novo assembly of large and repeat-rich genomes, as well as, the dependencies between smaller and larger maximal exact repeats within and between assemblies.
J02 - Genome-wide total RNA and MBD-sequencing in HCT116 and DKO cells as a global re-expression model
Short Abstract: Background: IT has been shown before that gene promoter methylation suppress expression of the downstream gene. This mechanism is involved in a series of diseases such as for example cancer, where tumor suppressor genes can be methylated and thus suppressed.
Methodology: We did a genome-wide methylation analysis of a colon cancer cell line (HCT116) using Methyl-Binding-Domain (MDB) capturing followed by next-generation sequencing. We also did a genome-wide totalRNA (mRNA, miRNA, snoRNA ...) analysis of a colon cancer cell line (HCT116) and of a HCT116 cell line with both DNMT1 and DNMT3B knocked out (DKO) using directional next-generation RNA sequencing. Both the methylation and RNA were analysed using Illumina next-generation sequencing technology.
Results: Using computational analysis and comparing the expression in HCT116 and DKO, and taking the methylation profile into account, we were able to identify pathways that are directly regulated by the methylation and pathways that are regulated in an indirect fashion. Furthermore, we were able to confirm known methylation regulated protein coding genes, miRNAs (e.g. hsa-mir-146a) and snoRNAs (e.g. SNORD116) in a single comprehensive sequencing experiment. We also could identify exons and alternative transcripts of which the expression is regulated by methylation, confirming the link between alternative splicing and methylation.
J03 - Toward genomic data sharing directly at the lab site in the new agent-based infrastructure OpenKnowledge
Short Abstract: Over the next decade functional genomics and metagenomics will produce increasingly massive data sets by the thousands. Such data volumes will preclude centralized, repository-based access to the raw data as the sole solution. The Cloud will help but storage and transfer costs make it likely that the producing laboratories will only upload selected data. The early-stage, peer-to-peer infrastructure OpenKnowledge (www.openk.org) could enable controlled access to data sets which might otherwise remain hidden, through searches directly at the lab sites. This is in addition to searching other data sets at centralized repositories, i.e. the currently prevailing access option is retained by the new infrastructure. Data-sharing interactions in OpenKnowledge (OK) are implemented as simple query-answering protocols written in OK's interaction language, LCC (Lightweight Coordination Calculus). While LCC protocols and OK can also support more advanced, customized data analyses this poster focuses on raw data sharing using OK. Two sharing scenarios have been implemented to date, for genomic and proteomic data. Though the search and user interface software are customized for these scenarios, the underlying interaction protocols can be reused to share data of any type between labs. For example, sharing raw Sanger sequencing "trace" chromatograms using new software, SeqDoC+, helps investigate polymorphic sites in closely related bacterial strains. In this new OK-demo we envision a researcher who searches a remote lab site for similar DNA sequences to his query sequence, and then compares the corresponding traces directly using an analysis pipeline based on the SeqDoC algorithm (research.imb.uq.edu.au/seqdoc/).
J04 - Massively parallel transcriptome sequencing reveals specific mRNA splicing patterns in stage 4 and stage 4S Neuroblastoma controlled by MYCN regulated RNA binding splicing factors
Short Abstract: Neuroblastoma (NB) is the most common solid extracranial tumor in children. Stage 4 NB is characterized with its clinical heterogynous outcome. The special category, stage 4S subgroup (2-5% of all NB) has a cure rate of above 90%. On the other hand, MYCN amplification (25-30% of all NB) is associated with poor outcome (26% event free survival rate for patients younger than 18 months). Alternative splicing events have crucial roles in tumorigenesis and progression. High-throughput transcriptome sequencing experiments allow us to quantify all genes and their isoforms across samples. By sequencing analysis of 29 stage 4 neuroblastoma tumors, we discovered an alternative splicing signature for MYCN amplified stage 4 tumors compared to MYCN non-amplified 4S tumors. Pathway analysis demonstrates that splicing perturbed genes have enriched roles in cancer hallmarks biological functions. Interestingly, splicing factors from RBFOX, CELF, and hnRNP families are differentially expressed between tumor subgroups. The regulation motif sequences of these splicing factors are enriched in adjacent introns of alternatively spliced exons. Moreover, the expression pattern of splicing factors RBFOX1, RBFOX3, PTBP1, CELF2 and CELF6 are highly correlated with the expression of MYCN. MYCN induction and knockdown experiments on NB cell lines show evidence of MYCN regulation of these splicing factors. This study systematically exam and compare the splicing programs in stage 4 NB subgroups. Moreover, it demonstrates that in addition to MYCN’s well characterized role in transcriptional regulation, it indirectly regulates splicing events by controlling the expression of splicing factors.
J05 - Comparing methods detecting differentially expressed genes in RNA-seq data
Short Abstract: Second Generation Sequencing (SGS) paved the way for deeper understanding of
cellular processes characterized by the expression of specific genes.
RNA-seq as an SGS-based approach is evolving into the go-to technique for analyzing the transcriptome.
One of the main goals in transcriptomics is
to identify differentially expressed genes across different conditions.

In this work we compare the most commonly used RNA-seq methods that detect
differentially expressed genes and provide a more extensive
comparison than previously published.
Toward this end we define evaluation criteria that take all thresholds for
considering a gene as being differentially expressed into account.
These measures are well established in the machine learning community and
provide a more appropriate performance measure than fixing an
arbitrary threshold for viewing a gene as being differentially expressed.
Both the sensitivity and the specificity are combined into one value using these measures.
Simulated datasets allow to assess the performance of the competitors
for various coverages and fold changes while knowing the ground truth.
J06 - Chromatin state dynamics in genomic regulatory blocks across six human cell lines
Short Abstract: Genomic regulatory blocks (GRBs) are chromosomal blocks spread by highly conserved non-coding elements (HCNEs), majority of which serve as regulatory instructions of one target gene in the region. Target genes are often transcription factors implicated in embryonic development, and often located tens of kilobases away from the regulatory inputs. Earlier it has been shown that target genes have certain sequence characteristics that discriminate them from other sets of genes. In this study we used 15 chromatin states data from six human cell lines in combination with transcription data to uncover the chromatin state dynamics across different subsets of genes and cell lines. We classified the genes in human genome into four groups namely GRB target genes, Bystander genes, CpG genes, and non-CpG genes. Each group further divided into ‘not expressed’, ‘low expressed’, ‘medium expressed’ and ‘highly expressed’ subsets based on expression data. We found that the predominant differences across gene sets lies in chromatin state 3 (poised promoter) and state 13 (polycomb repressed).
J07 - The Encrypted Germline Genome of the Ciliate Oxytricha
Short Abstract: The unicellular ciliate Oxytricha trifallax exhibits an unusual genomic organization with two types of nuclei: a transcriptionally active somatic macronucleus and a germline micronucleus. After sexual conjugation, the somatic nucleus develops from a copy of the germline nucleus through a process that involves elimination of 95% of the ~1Gb germline DNA, including a large number of transposons, satellite repeats and noncoding sequences that interrupt gene segments, as well as massive chromosome fragmentation producing gene-sized ‘nanochromosomes’. In addition, not only are germline copies of genes interrupted by noncoding elements, but thousands of germline loci also have a scrambled structure, with gene segments present in a different order relative to their order in the soma. Therefore, somatic nuclear development also requires rearrangement of the gene fragments by inversions or permutations to decrypt and assemble functional genes.
While the assembly of Oxytricha’s somatic genome is complete (Swart et al, in preparation), much less is known about the germline genome, with its scrambled gene patterns and vast excess of noncoding DNA. We present our efforts to sequence and assemble the germline genome of Oxytricha. Analysis of the content and architecture of this encrypted genome has revealed intriguing new mysteries. By comparing the somatic and germline genomes, we provide novel insights into the processes of genome rearrangement, such as internal deletion of noncoding sequences, unscrambling of gene segments and chromosome fragmentation.
J08 - Gene Annotator Tool
Short Abstract: The Gene Annotator Tool is a one stop functional analysis tool for human, rat and mouse genes. The user can upload lists of gene symbols, Ensembl IDs, NCBI RefSeq IDs, Genbank sequence IDs, or Rat Genome Database gene IDs for any of the three organisms to retrieve comprehensive functional data, multiple identifiers with links to data at various sources. Once a list of genes has been submitted, the user can customize desired results through an interface. Gene Ontology annotations, Mammalian Phenotype annotations, Disease annotations, NeuroBehavioral Ontology annotations and Pathway Ontology annotations are included and there are 45 identifier/link types for gene, sequence, protein, pathway, disease and interactions. Users may retrieve data from only one organism or from the indicated gene and corresponding orthologs. Results are available in individual gene reports or as a downloadable file. The genes can also be viewed in a Genome Viewer which allows users to upload additional tracks such as QTL, markers and ontology annotations. The Genome Viewer also allows the user to navigate to the GBrowse tool to provide the user with the ability to view genes within a more detailed context including QTL, SNPs and other variations, transcripts, disease tracks and others. The Gene Annotator tool provides a quick and easy way to retrieve comprehensive functional data from multiple organisms.
J09 - RNA-seq and Exome-seq analysis of two Ewing's sarcoma cell lines that show differential drug sensitivity
Short Abstract: Ewing's sarcoma is a rare form of malignant bone tumor that strikes children and adolescents. Understanding the underlying mechanisms involved in the development of Ewing's sarcoma forms the key step in the design and development of new therapeutic interventions. Genomic alterations have been known to play an important role in oncogenesis, disease progression and response of tumors to different therapies. The advent of next generation sequencing-based methods provide unprecedented capabilities to scan genomes for changes such as mutations, insertions/deletions, copy number variations, differential gene expression and discovery of new gene fusions. In this study, we performed RNA-seq and exome-capture-seq analyses of two Ewing cell lines, RDES and SKES1, which show differential sensitivity to natural products extracts. We studied the gene expression profiles from RNA-seq reads and identified approximately 250 differentially expressed genes between drug-sensitive and insensitive cell lines. Gene fusion products were also identified from RNA-seq analysis. Both the cell lines were found to harbor the well-known Ewing’s sarcoma gene fusion EWSR1-FLI1. In addition to EWSR1-FLI1, we also discovered some new and unique gene fusions in both cell lines. Furthermore, we used exome-capture sequence analysis to identify sequence variation and mutations in these two cell lines. Known mutations reported in COSMIC database and SNPs that are not included in normal populations from the 1000 genome project were carefully studied. Interestingly, some of the non-synonymous changes identified map to kinases and other known cancer-related genes suggesting candidate targets for drug resistance/sensitivity studies.
J10 - Whole genome bioinformatics analysis of glycoengineered Pichia pastoris strains with next generation sequencing technologies
Short Abstract: Glycoengineered yeast Pichia pastoris provides a novel platform for producing therapeutic biologics. To further improve strain attributes, we have identified several mutant strains displaying improved cell robustness or reduced O-glycosylation. Whole genome sequencing of those strains will help to understand genomic sequence changes driving desirable phenotypes to allow for translation of these properties to next-generation production strains. Genomic DNA samples were sequenced by Next Generation Sequencing technologies (NGS). Millions of short read sequences for each strain were generated. To address the requirement of sequence analysis, we developed a PERL-based computation pipeline to run on Merck's high performance computer cluster. The pipeline integrated many public or open source analysis algorithms and tools, such as Burrows-Wheeler Aligner and Genome Analysis Toolkit. The data generated are also exported into formats that can be easily visualized by graphic viewers of The University of California Santa Cruz Genome Browser and Broad Institute's Integrative Genomics Viewer. We have located the mutations, insertions and deletions of 22 selected Pichia strains so far. We have identified temperature resistant causal mutations that identified the Acquired Temperature Tolerance (ATT1) gene as a key regulator of bioreactor robustness and filed a patent application. In summary, we have evaluated the quality of NGS and demonstrated its utility for glycoengineered yeast strain optimization. Specifically, this novel approach is already leading to understanding of specific genetic changes underlying observed phenotypic variation in these Pichia strains, which is already having significant impact on several important Biologics programs.
J11 - A Novel Algorithm for Accurate and Efficient Alignment of DNA Sequencing Reads
Short Abstract: The length of DNA sequencing reads has increased as the next generation sequencing technology advances. That leads to increasing needs for handling many mismatches and long structural variations including insertions, deletions and inversions in the reads efficiently. We developed efficient and accurate alignment algorithm that meets those needs. The algorithm employs a special form of hash table, which resembles with the adjacency matrix. A reference genome sequence and reads can be represented as graph data structures. Each non-overlapping k-mers of a reference genome sequence and read sequences form a node, and two consecutive nodes form a directional edge. Each cell of the adjacency matrix contains several starting bases of an edge. For each n base read, our algorithm performs [n/k] times of matrix lookup operations, each of which takes O(1). Theoretically, the number of mismatches is allowed up to 5, and the size of structural variations is allowed up to 70 while preserving the mapping speed, when the length of reads is 100 and k is 6. Our algorithm was tested on simulated reads of chromosome 1 and 19 by SAMtool (Li et al., 2009), and showed better performance on mapping accuracy and speed than 3 popular read alignment algorithms.
J12 - Entropy based assessment of cell type-specific and frequent histone modification
Short Abstract: Modified histones which are specific to certain cellular types or those which are modified in most cell types are linked to cell type-specific or shared cellular functions respectively. To evaluate variations of histone modification patterns in cell types, we introduced two measurements (Hs and Hu). They are based on Shannon entropy which has been applied to the analysis of cell type specificities of gene expression or DNA methylation patterns. Unlike previous definition of entropy, Hs entropy efficiently deals with 0, which correspond to no modified histones, the observation very common in histone modification data. In addition, Hu can evaluate both variances and frequencies; they have been specialized to assess the cell type specificity and the frequency of histone modifications. Applying them to ChIP-Seq data obtained in seven human cell types, we extracted cell type-specific and shared modifications, which were subsequently inspected visually. Comparing entropies of gene expression patterns with Hs of histone modification around genes, we found correlation between the specificity of histone modifications and gene expression patterns. Interestingly, we found differences in the histone modification specificities around enhancers and TSS for some modification types, and genes with frequently modified histones were associated with CpG islands around TSSs compared to cell type-specific histone modifications. Therefore these two measurements enabled us to extract novel characteristics of histone modifications. We believe that in the future, these measures will be useful for epigenetic researches to uncover association between epigenetic regulation and cellular functions.
J13 - Primer3 - New Capabilities and Interfaces
Short Abstract: Polymerase chain reaction (PCR) is a basic molecular biology technique with a multiplicity of uses, including DNA cloning and sequencing, functional analysis of genes, diagnosis of diseases, genotyping, and discovery of genetic variants. PCR has also recently become important for template preparation for next-generation sequencing and for the technical confirmation of genetic variants detected using this technology. Reliable primer design is crucial for the success of PCR, and for over a decade, the free and open-source Primer3 software (http://primer3.sourceforge.net/) has been widely used for primer design and has been incorporated into numerous web services. During this period, we have greatly expanded Primer3’s functionality. The most notable enhancements incorporate more accurate thermodynamic models in the primer design process, both to improve melting temperature prediction and to reduce the likelihood that primers will form hairpins or dimers or will hybridize to unintended sites in the template. Additional enhancements include more precise control of primer placement—a change motivated partly by opportunities to use whole-genome sequences to improve primer specificity. We also added features to increase ease of use, including the ability to save and re-use parameter settings and the ability to require that individual primers not be used in more than one primer pair. We have made the core code more modular and provided cleaner programming interfaces to further ease integration with other software. A substantially improved web interface, Primer3Plus, now offers simpler ways to accomplish specific primer design tasks. These improvements position Primer3 for continued use in the decade ahead.
J14 - Comprehensive analysis of RNA and DNA differences in non-small cell lung cancer
Short Abstract: Recent reports on widespread differences in DNA and RNA sequences drew significant attention on the role of RNA editing. Our experimental data can serve as an independent survey of RDD events since all three tiers of whole genome, transcriptome, and proteome data are available for identical samples with high coverage. We set out to test the validity of RDD events and investigate their meaning in cancer biology. Principal concern in bioinformatics analysis of RDD events is the reliability of mapping program. Accurate alignment of reads to the reference genome is a pre-requisite for identifying RDD events as well as mutations. We tested two programs (BWA and Bowtie) for mapping the whole genome and RNA sequencing data. Since the list of RDD candidates was highly dependent on the choice of transcriptome mapping program, we fixed the transcriptome mapping as Bowtie. Then we used BWA and Bowtie for mapping genomic sequences. We obtained many of RDD candidate sites, and then performed the Sanger sequencing and LC-MS/MS peptide sequencing to confirm the RDD events. The proteome data were thoroughly examined to search peptides from the candidate RDD sites. We checked the possible presence of synonymous RDDs in the neighboring region, and found that most nonsynonymous RDD sites are accompanied by neighboring synonymous RDD sites in the prediction and in the Sanger sequencing data as well. Interestingly, most RDD affected genes were single exon. This suggests that extra-ordinarily long single exon genes might be the primary target of RNA editing process of unknown mechanism.
J15 - DNA methylome analysis using reduced representation bisulfite sequencing data
Short Abstract: Reduced representation bisulfite sequencing (RRBS) is a cost-efficient method for DNA methylation profiling providing absolute levels of DNA methylation at single-nucleotide resolution [1]. CpG rich regions (CpG islands and promoter regions) are enriched by selecting 40-220-base pair fragments from MspI digested DNA. Per covered CpG this method provides the number of methylated and the number of unmethylated reads spanning this site. So far, only few approaches to process and analyze this kind of data are published.

We propose an algorithm to detect differentially methylated regions (DMRs) in cancer versus control samples. This algorithm is divided into four steps: 1.) Find clusters of covered CpGs considering all samples. 2.) Predict methylation levels for a set of grid points in these CpG clusters for each sample [2]. 3.) Estimate the group effect along the predicted sites. 4.) Define the DMRs according to the significance and the level of group effects.

We applied this algorithm on RRBS data of 18 patients with acute promyelocytic leukemia and 16 controls and detected 1,458 DMRs in 26,849 CpG clusters with at least 30% methylation difference while accounting for gender effects. An analysis on the presence of binding sites in the detected DMRs revealed under- and overrepresentation of several binding sites, e.g. c-Myc and SUZ12.

[1] Meissner A et al. Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature 2008.
[2] Hansen KD et al. Increased methylation variation in epigenetic domains across cancer types. Nature Genetics 2011.
J16 - Integrative analysis of histone ChIP-seq and gene expression microarray data using Bayesian mixture models
Short Abstract: Histone modifications are an epigenetic key mechanism to activate or repress the expression of genes. Several data sets consisting of matched microarray expression data and histone modification data measured by ChIP-seq have been published. Here, we present a novel bioinformatic approach to detect genes that are differentially expressed between two conditions due to an altered histone modification.

First, both data types are matched by assigning the number of ChIP-seq reads aligning within the promoter region of a gene to the normalized expression value of that gene. Next, quantile normalization is applied to ChIP-seq values. Then, a score is calculated for each gene by multiplying the standardized difference of ChIP-seq values by the standardized difference of expression values. Finally, a Bayesian mixture model with eleven normal components is fitted to this score by means of posterior simulation using MCMC methods. The implicit assignment of genes to mixture components is used to classify genes into the groups (i) equally directed differences in both data sets, (ii) reversely directed differences and (iii) no differences.

We applied the method to anti-H3K4me3 ChIP-seq data of CEBPA knock-out and wild-type samples from mice together with matched expression data. The distribution of our score was slanted towards positive values, since H3K4me3 is an activating histone mark. Each mixture component could be clearly identified with one of the groups (i)-(iii), when appropriate prior distributions for the model's parameters were used. Overall, 331 (102) genes were classified into group (i) ((ii)). The classification was remarkably stable when using replicates.
J17 - Positioning of synonymous codon and gene translation efficiency among bacteria
Short Abstract: Amino acid in the protein peptide is coded by codons, consist of three DNA bases. While amino acid is carried by tRNA recognizing each codon, most of amino acid have multiple tRNA partners. These occurrence of synonymous codons are not equal, and codon usage bias is known to influence gene transcription, translation, and expression levels. From the recent study targeting Saccharomyces cerevisiae and other eucaryotes, it was shown that the closest same amino acid coded inside the coding sequence favors codons recognized by the same tRNA. The codon co-occurence showed high correlation with the gene translation speed, followed by the tRNA recycling model during translation. In this poster, we have confirmed the codon co-occurence among 696 bacterial genomes. From the exhaustive analysis using UniProt complete proteome set, high codon correlation was observed in most of the bacterial genome. The tendency of codon co-occurence was observed not only in frequently used codons, but also in rarely used codons. By comparing the GCSI (GC Skew Index) and mean TPI (tRNA Pairing Index) of the genome, codon correlation was significantly observed in species with low GC skew spectrum intensity (GCSI > 0.1). Our result indicates the possibility of tRNA recycling during translation in bacteria, and will give a new incite to the relationships between gene expression and it’s coding sequence.
J18 - Multiscale Representation of Genomic Signals
Short Abstract: In the genome, information is encoded on a wide range of spatial scales. Functional genomic regions can be on the order of base pairs (bps), for e.g. transcription factor binding sites, up to Mbps for nuclear lamina associated domains. As a consequence, measurements derived from the genome will exhibit structure at different spatial scales, a fact that should be taken into account when analyzing such data. In this work, we present a fundamentally new approach to analyze genomic signals at different spatial scales. Genomic signals are defined as quantitative measurements as a function of genomic position and include DNA sequence based data, such as CG-content, as well as (epi-)genomic measurements, such as ChIP-seq data.
We developed a multiscale segmentation method to obtain the multiscale representation (MSR) of a genomic signal. The MSR is a representation of signal enrichment and depletion as a function of spatial scale and genomic position. We applied this approach to a variety of genomic signals in the mouse, including intra-species sequence conservation data, GC-content and ChIP-seq data of TFs, RNA polymerase II and epigenetic marks. The MSR offers a novel way to summarize and visualize the information content across spatial scales. Using correlation analysis and genomic annotation, we demonstrate that a genomic signal indeed contains functional information at multiple scales. This multiscale information can be employed to accurately predict gene expression and function. Using a machine learning framework, we show substantially improved prediction accuracy when compared to approaches that analyze genomic signals at a single scale.
J19 - Direct RNA seqencing of a plant transcriptome highlights complexity of RNA 3’-end formation
Short Abstract: Correct RNA 3'-end formation and the addition of the poly(A) tail is critical to the generation of mature messenger RNA. Previous reports have shown the importance of alternative sites for polyadenylation in modulating the function of the transcript or gene product. Here, we examined polyadenylation genome-wide in Arabidopsis thaliana by applying the single molecule direct RNA sequencing technology (DRS) from Helicos Biosciences to polyadenylated transcripts. DRS produces tens of millions of short reads, requiring extensive computational analysis. Extreme heterogeneity of RNA 3'-ends was revealed, with up to 23 alternative poly(A) sites per gene. DRS allowed ranking of the sites within each 3’ UTR by their expression. Examination of sequence features around the sites determined poly(A) sequence signals for the fist time in A.thaliana coding genes and significance of these signals for sites with respect to their rank. We developed a new method of determining the new poly(A) signals by clustering positional profiles of nucleotide hexamers around the cleavage sites. The DRS data allowed re-annotation (by elongation) of 3'-ends for 70% of the coding genes with aligned DRS reads. Novel peak finding and annotated gene elongation algorithms were devised for this genome-wide re-annotation. Our findings and algorithms show the power of direct RNA single molecule sequencing and will be important for a clearer understanding of the genome structure of A. thaliana in particular, and for plants in general.
J20 - Semi-supervised learning based workflow for predicting recurrence of breast cancer
Short Abstract: Building an accurate classifier for identifying a type of diseases or predicting disease outcomes is an important issue in bioinformatics, and has been extensively studied. Recently, there have been several attempts to build a classifier incorporating a large set of unlabeled microarray samples. These semi-supervised learning approaches were reported to build a more robust and accurate classifier. We propose a novel workflow to predict a recurrence risk of breast cancer patient based on semi-supervised learning approach. First, we identify strong influential gene pair which can classify the recurrence and non-recurrence status by measuring the correlation between two genes for all possible gene pairs in the labeled samples. According to the difference between correlations of recurrence and non-recurrence samples, the gene pairs are selected. The genes of the selected gene pairs are regarded as labeled nodes of the core gene network. Second, the other genes which are not included in the core network become unlabeled nodes and they are connected to the existing genes of the core network by measuring correlation. Including those new unlabeled genes, the entire gene network is constructed. Third, we calculate the probabilistic transaction matrix P for all genes of the entire network and then apply label propagation algorithm for the learning framework. After completing the propagation, we obtain the more accurate and large gene network which contains recurrence and non-recurrence related genes. We predicted the label of unknown sample using the edges of this network by comparing the correlation with high accuracy.
J21 - Improving Color Call Accuracy for Next Generation Sequencing using SVMs
Short Abstract: Here, we show how the accuracy of next-generation DNA sequencing machines is significantly improved using multi-class SVMs. Improving the accuracy of next-generation sequencers directly lowers sequencing costs, making genome-based diagnostics more affordable. [1]

Noise in sequencing is due to the imperfect nature of the chemical processes involved. Specifically, incomplete cleavage of bases from previous cycles results in residual signal. Also, signal strength diminishes along the sequence due to depletion of chemicals. We model these error sources explicitly through multi-class SVMs. We use SVMLight Multi-class [2] with linear kernel and margin rescaling, with one SVM per cycle (to account for depletion). We use features from previous and current sequence positions (to account for residual signal) to predict the base for the current sequence position.

Our method is demonstrated on SOLiD 5500, widely used next generation sequencer, which encodes base-space into color space [3]. Ours is the first method to obtain performance improvements in color space using SVMs. For E.coli genome, we used a small training set of 7000 reads per lane and outperformed the current SOLiD platform, in terms of percentage of reads that were correctly mapped to the genome, and number of error-free reads. The method is fast, with training times of 1-2 min per lane. Incorporating this method would considerably improve next-generation sequencers.


1. Ledergerber, C., Dessimoz, C. Briefings in Bioinformatics, 12 (5), 489-497 (2011)
2. Joachims, T. Advances in Kernel Methods - Support Vector Learning (1999)
3. Breu, H: Applied Biosystems (2010) 139WP01-02 CO13982
J22 - A Robust Linear Framework for Transcript Quantification using MultiSplice Features.
Short Abstract: The advent of high throughput RNA-seq technology allows deep sampling of the transcriptome, making it possible to characterize both the diversity and the abundance of transcript isoforms. Accurate abundance estimation or transcript quantification of isoforms is critical for downstream differential analysis (e.g. healthy vs. diseased cells) but remains a challenging problem for several reasons. First, while various types of algorithms have been developed for abundance estimation, short reads often do not uniquely identify the transcript isoform from which they were sampled so that in some conditions the quantification problem is not identifiable, i.e. lacks a unique solution. We develop a generalized linear model for transcript quantification that leverages reads spanning multiple splice junctions to ameliorate identifiability. Second, RNA-seq reads sampled from the transcriptome exhibit unknown position-specific and sequence-specific bias. We extend our method to simultaneously learn bias parameters during transcript quantification to improve accuracy. Third, transcript quantification is often provided with a candidate set of isoforms, not all of which are likely to be significantly expressed in a given condition. By resolving the linear system with LASSO our approach can infer an accurate set of dominantly expressed transcripts while existing methods tend to assign positive expression to every candidate isoform. Using simulated RNA-seq datasets, our method demonstrated better quantification accuracy than existing methods. The application of our method on real data experimentally demonstrated that transcript quantification is effective for differential analysis of transcriptomes.
J23 - DiffSplice: the Genome-Wide Detection of Differential Splicing Events with RNA-seq
Short Abstract: The RNA transcriptome varies in response to cellular differentiation as well as environmental factors, and can be characterized by the diversity and abundance of transcript isoforms. The availability of high-throughput short-read RNA sequencing technologies provides in-depth sampling of the transcriptome, making it possible to accurately detect the differences between transcriptomes. We present a new method for the detection and visualization of differential transcription. Our approach does not depend on transcript or gene annotations. It also circumvents the need for full transcript inference and quantification, which is a challenging problem due to short read lengths as well as various sampling biases. Instead, our method takes a divide-and-conquer approach to localize the difference between transcriptomes in the form of Alternative Splicing Modules (ASMs) where transcript isoforms diverge. Our approach starts with the identification of ASMs from the splice graph, constructed directly from the exons and introns predicted from RNA-seq read alignments. The abundance of alternative splicing isoforms residing in each ASM is estimated for each sample and is compared across sample groups. A non-parametric statistical test is applied to each ASM to detect significant differential transcription with a controlled false discovery rate. The sensitivity and specificity of the method have been assessed using simulated datasets and compared with other state-of-the-art approaches. The qRT-PCR experiments have confirmed a selected set of genes that are differentially expressed in a lung differentiation study and a breast cancer dataset, demonstrating the utility of the approach applied on experimental biological datasets. Software download URL: http://www.netlab.uky.edu/p/bioinfo/DiffSplice.
J24 - FASTG: Representing the True Information Content of a Genome Assembly
Short Abstract: Genome assemblies have typically been represented linearly, as sequences of bases recorded in FASTA files. This makes sense when one has complete knowledge of the genome. However, excepting small, haploid genomes, all assemblies contain errors and omissions which can result in incorrect biological inferences. In most cases these assemblies do not represent polymorphism at all.

Today, using high-coverage data, assembly algorithms ‘see’ almost all bases of the genome. Thus errors in the assemblies result primarily from defects in the algorithms and in assembly representation. Where a particular locus in an assembly is inaccurate, it is often the case that the assembly algorithm could have described the data more accurately: ‘there are between 14 and 16 Ts here’. However, such ambiguities are precluded by the current linear representation. Similarly, polymorphisms can be represented, as in ‘the two alleles here are A and T’ or ‘one allele has this 330 base insertion’.

Genome assemblies should come with structures that capture the uncertainties in our knowledge and the variability in the data. Here, in collaboration with the Assemblathon group, we propose a representation, FASTG, that accomplishes this. FASTG is FASTA—thus allowing existing tools to run and providing coordinates that facilitate computation—with additional global and local layers of ‘markup’. The global markup encapsulates a graph structure that is essential in cases where long perfect repeats exceed the resolving power of the data. The local markup represents ambiguities, including the description of captured gaps by unresolved graphs.
J25 - RSEM-EVAL: A Probabilistic Transcriptome Assembly Evaluator
Short Abstract: RNA sequencing (RNA-Seq) provides us with a great opportunity to study the transcriptomes of species without sequenced genomes. The first step in such studies is to perform a transcriptome assembly. There are several de novo transcriptome assemblers available, such as trans-ABySS, Trinity and Oases. However, it is not clear how to assess the quality of the assemblies obtained with these methods, especially if the true transcript set is unknown. To address this issue, we propose a probabilistic-model-based method to evaluate assemblies that depends only on RNA-Seq data. Building off of RSEM, our methodology for quantification with RNA-Seq, we model a generative process of both assemblies and RNA-Seq reads. We then assess the quality of an assembly by computing its posterior probability under our model, given RNA-seq reads. Due to the computational infeasibility of calculating exact posterior probabilities, we propose several tractable approximations. The primary advantage of our proposed method is that no ground truth on real transcripts is required. It has a broad range of potential applications. For example, current de novo transcriptome assemblers have many parameters, some of which are fixed in an ad hoc way. Using our method, it is possible to find the optimal set of parameters for each RNA-Seq data set. Our method can also be used to select the best assembler for a given data set and may be used in designing a meta-assembler that combines the outputs of several existing assemblers. We present our progress on this project and initial experimental results.
J26 - forestSV: supervised learning for genomic structural variant discovery
Short Abstract: Structural variants (SVs) are a major form of genetic variation, and their detection has provided key insights into the genetic basis of common human disease -- most notably in neuropsychiatric disorders, where it has been well established that rare and de novo mutations confer significant risk.

We have brought the SV discovery problem into a statistical learning paradigm, allowing us to adapt proven methodology to facilitate improved discovery of SVs. Using data from the 1000 Genomes Project, we trained a Random Forest (RF) classifier to discriminate known SVs (deletions and duplications) from invariant regions and false positives called as SVs by other methods. When presented with new data, the classifier was able to identify SVs at high sensitivity and specificity.

The classifier can be viewed as a single method that embodies the strengths of the various calling methods that provided training calls, meaning the user spends less time reconciling calls from multiple methods. Additionally, the method benefits from the advantages of supervised learning. As SV calls are experimentally validated or invalidated, they may be added to the training data, leading to a new classifier that has learned from its previous successes and failures.

We have bundled the classifier and related tools into an R package called forestSV and have made it available on our website at http://sebatlab.ucsd.edu/software.
J27 - Graph rigidity reveals non-deformable collections of chromosome conformation data
Short Abstract: Recent chromosome conformation capture (3C) experiments have been used to
construct three-dimensional models of genomic regions, chromosomes, and entire
genomes. These models can be used to understand long-range gene regulation,
chromosome rearrangements, and the relationships between sequence and spatial location. However, it is unclear whether 3C pairwise distance constraints provide sufficient information to embed chromatin in three dimensions. A priori, it is possible that an infinite number of embeddings are consistent with the measurements due to a lack of constraints between some regions.

We present a new method based on graph rigidity to assess the suitability of 3C
experiments for constructing plausible three-dimensional models of chromatin
structure. Underlying this analysis is a new, efficient, and accurate algorithm
for finding sufficiently constrained (rigid) collections of constraints in 3D, a problem for which there is no known efficient algorithm. Without performing an embedding or creating a frequency-to-distance mapping, our proposed approach establishes which substructures are supported by an appropriate framework of interactions.

Applying the method to four recent 3C experiments, we find that, for even
stringently filtered constraints, a large rigid component spans most of the
measured region. Filtering highlights higher-confidence regions, and we find
that the organization of these regions depends crucially on short-range
interactions. We also find that rigid component boundaries are associated with
areas of low nucleosome density. Pre-processing experimentally observed
interactions with this method before relating chromatin structure to biological phenomena will ensure that hypothesized correlations are not driven by the arbitrary choice of a particular unconstrained embedding.
J28 - Iterative De Novo Assembly of High Quality Transcripts from Paired-end RNA-Seq Data in Non-model Organisms
Short Abstract: The massive amount of data generated from RNA sequencing opens up a wide venue to discover variants, novel transcript forms, and to measure gene expression. Yet for non-model organisms, particularly in mammals, lack of good quality transcriptome/genome reference hinders such studies. De novo assembly of the transcriptome, while valuable, has obvious limitations on the data set size the de novo assembly tool could take in, the hardware resource it demands, and the mostly importantly the quality control of the assembled transcripts at the end.
Paired-end (PE) sequencing provides the advantage of spanning cDNA fragments peaked at 250~270 bases, which is significantly longer than the read length (e.g. 100 bases from Illumina HiSeq2000). The spanning distance between read pairs harbors information on transcript structure in individual samples. To fully utilize the distance information between PE reads from RNA sequencing, here we propose an iterative 'hybrid' de novo assembly process in mammals to build a set of high quality transcripts for genes of interest. The process starts from a few 'seed' sequences taken from a model organism, e.g. human, as the initial reference. At each iteration, the paired mate of the 'broken pairs', i.e. only one read from a pair mapped to the reference, will be recruited into the read pool for de novo assembly to build a new reference. The iteration ends when the number of 'broken pairs' reaches certain threshold.
J29 - Microarray or RNA-seq: which one to believe?
Short Abstract: The mRNA population in the transcriptome, reflects protein coding genes that are actively expressed at any given point of time. The sheer ability to simultaneously quantify the mRNA for a vast number of genes, facilitates the analysis of global gene expression patterns. This information helps in interpreting, the functional elements and vital biological processes in any given organism. Several methodologies have emerged for genome-wide transcriptome studies. Among them, microarray and RNA-seq are the two widely popular methodologies. For more than a decade, hybridization-based microarray technology has been instrumental in genome-wide transcriptome studies. This methodology has now almost reached its maturity, due to its usage, since long time. Recently, RNA-seq, a next-generation sequencing-based technology has been introduced. Although, this methodology is at its infancy, is already rapidly gaining popularity. Several cross-platform comparison studies have been carried out to infer, how the two methodologies perform in whole transcriptome profiling studies.
However, all studies are focused on eukaryotes and hardly any based on prokaryotes. In this study, using a microbial model system, a plant pathogenic bacteria Xathomonas citri subsp. citri, we compared the performance of RNA-seq and microarray platforms. We validated the expression levels using Real-Time Quantitative Reverse Transcription PCR (qRT-PCR) methodology. Our analysis revealed that both methods agree qualitatively 60% and quantitatively ~30%, in estimating the expression data. The implication of the differences and consensus between the two methodologies is presented.
J30 - Using Semantic Workflows for Genome-Scale Analysis
Short Abstract: The complexity of genomic data analysis arises not just from the sophisticated statistical techniques required but from the sheer scale of the datasets, from dozens to thousands of individuals and from gigabytes to soon terabytes of data of varying quality. Automation and assistance are key to the rapid utilization of genomic research and its potential impact on healthcare and beyond. We are developing new capabilities to assist scientists with the complexity of data analysis in population genomics and next-generation sequencing. Building on existing workflow technologies for managing the execution of distributed scalable scientific data analyses, we use semantic workflows to capture the know-how required to manage the complexity of genomic data analysis and to provide assistance to scientists in setting up and validating analyses. First, by enforcing the correct use of methods the system is validating the analyses and avoiding possible common errors. Second, by capturing the process that led to new results and retaining the rationale for the system’s validation the system is facilitating reproducibility and verifiability. We have developed an initial collection of workflows for inter-family and intra-family population genomics and next-generation sequencing. It includes workflows for association tests, copy number variation detection, transmission disequilibrium test, and SNV/Indel discovery from resequencing of genomic DNA and RNA sequencing. The workflows include software components from popular packages: Plink, PennCNV and Gnosis, Allegro and FastLink, Burrows-Wheeler Aligner and SAMTools, and Structure. We show their use in two replication studies for results previously published.
J31 - Organization of mammalian genomes into transcriptionally coherent gene clusters
Short Abstract: We used genome-wide histone profiling to generate a map of chromatin states in murine embryonic stem cells and found long segments of the genome that contain gene clusters with similar transcriptional activity. Surprisingly, one-third of all genes occur in these multigenic segments with coherent transcriptional states. This organization of the genome into transcriptionally coherent clusters also exists in human embryonic stem cells, and genes in clusters show conservation of transcriptional state between the two organisms. We find that genes within clusters are more likely to be co-regulated through development than other genes. Taken together our results suggest that transcriptionally coherent clusters represent a prevalent form of genome organization with implications for transcriptional regulation.
J32 - Genome-Wide Chromatin Remodeling Identified at GC-Rich Long Nucleosome-Free Regions of
Short Abstract: To gain deeper insights into principles of cell biology, it is essential to understand how cells reorganize their genomes by chromatin remodeling. We analyzed chromatin remodeling on next generation sequencing data from resting and activated T~cells to determine a whole-genome chromatin remodeling landscape. We consider chromatin remodeling in terms of nucleosome repositioning which can be observed most robustly in long nucleosome-free regions (LNFRs) that are occupied by nucleosomes in another cell state. We found that LNFR sequences are either AT-rich or GC-rich, where nucleosome repositioning was observed much more prominently in GC-rich LNFRs - a considerable proportion of them outside promoter regions. Using support vector machines with string kernels, we identified DNA sequence patterns indicating loci of nucleosome repositioning. GC-rich LNFRs found in resting T~cells showed the most specific and most prominent repositioning patterns. The patterns most indicative for chromatin remodeling are GGGGTGGGG and GGGGCGGGG. The first pattern, GGGGTGGGG is significantly enriched in remodeled LNFRs of resting T cells, regardless of whether these LNFRs are in promoter regions or not. The second pattern, GGGGCGGGG, however, is only indicative for chromatin remodeling outside of promoters. So both patterns hint at hitherto unknown genome-wide mechanisms of chromatin remodeling. Comparisons of the patterns to known binding site patterns suggest the involvement of a Zinc finger protein.
J33 - Enabling Genome Assembly Validation Through Ensembles and Unsupervised Learning
Short Abstract: Recent years have seen the emergence of the “post-genomics” era, with dozens of completed genomes becoming more readily available. Research increasingly relies on both direct and indirect inferences drawn from these genomes. For example, orthology information, synteny, gene duplication and phylogenetic organization are contingent upon the accuracy of the underlying genome. Unfortunately, improvement of assemblies often involves only filling in gaps rather than evaluating the quality of existing sequences. Given assembly validation is crucial to ensuring data integrity, we propose a novel machine learning-based approach that will facilitate genome quality assessment through improving an existing supervised implementation and implementing a novel unsupervised learning approach. Current assembly-validation applications rely on metrics describing mate pair placement, read coverage, clone coverage, and SNP placement; however, none of these approaches incorporate all available metrics and the established machine-learning approach relies on seldom known misassemblies as a training set. Our approach, on the other hand, incorporates all currently used metrics and further improves on these by incorporating ensembles to boost classifier performance. Finally, we circumvent the limitations of classification by implementing a clustering approach to differentiate between assembled and mis-assembled areas. This combined approach will allow for accurate, cost-effective genome validation as the number of genomes continue to grow.
J34 - Analysis of a high coverage genome of a Denisovan individual
Short Abstract: The draft genome sequences of two extinct archaic hominins, Neandertals and Denisovans, were published with 1.3-fold and 1.9-fold shotgun coverage in 2010. These draft genomes enabled the analysis of the genetic relationship between the archaic individuals and present-day humans, but were not of sufficient quality to determine with confidence the exact state of the archaic sequences at the majority of sites in the human genome. Using improved methods we have now produced ~30x coverage of the Denisova genome, as well as comparable coverage for 11 present-day humans. For sequence analysis, we rigorously updated previous strategies and adopted standard formats and algorithms for genotype calling.
Comparisons to the present-day humans show that the Denisova genome is of comparable, if not higher, quality. We detect a reduced number of changes in the Denisovan, reflecting the shorter evolutionary lineage. This suggests an alternative sequence-based approach to dating fossils may be feasible. With the high-coverage data, we also identify heterozygous positions and compared the level of heterozygosity to present-day humans. We observe that the genetic diversity in Denisovans is reduced to 1/5 of that in Africans and 1/4-1/3 of that in Eurasians. This new genome enables us to obtain a near-complete catalogue of fixed genetic changes specific to the recent human lineage and we can, for the first time, also explore Denisova-specific changes. The analysis of this archaic hominin genome is no longer limited by the quality of the obtained sequences, but rather by read alignment issues and missing Denisova population data.
J35 - Repeat reduction in NGS libraries prior to sequencing
Short Abstract: Current DNA sequencing technologies provide opportunities to generate massive amounts of sequence data. Analyses of large plant and animal genomes have been complicated by the presence of repetitive sequences of varying degrees of complexity and sequence divergence. Several uses of sequence data, such as gene and SNP discovery as well as genotyping, would benefit from libraries with reduced abundance of repeated sequences. We refined a method for reducing the high-copy components in libraries prior to sequencing using Illumina Genome Analyzer or HiSeq systems. DNA libraries are denatured to single strands and then allowed to partially reanneal. Treatment with a thermostable duplex-specific nuclease (DSN) after an appropriate reannealment period results in the selective destruction of the more rapidly re-annealing high-copy sequences leaving the low-copy component to be amplified and sequenced. As a part of Compositae Genome Project http://compgenomics.ucdavis.edu/ the lettuce transcriptome and gene space have been sequenced using this repeat reduction approach, assembled and analyzed using CLC Genomics Workbench. Experiments were designed to investigate the consequences of variables in the DSN protocol. These demonstrate that 2 to 3 fold enrichment of gene space can be achieved for large plant genomes such as lettuce (2.7 Gb) that are comprised of more than 70% repeated sequences.
J36 - Bioinformatics Training in the –Omics Era: The Canadian Bioinformatics Workshop Series
Short Abstract: The research environment today has fully embraced high-throughput technologies. Terms like transcriptome, genome, proteome, and metabolome among others have become commonplace, and their experimental outputs have changed our research possibilities. Yet with these global analyses comes an ever-greater need for bioinformatics skills to visualize, statistically evaluate, interrogate and integrate data. A demand for bioinformatics skills that address -omic era research questions has resulted. However, finding adequate bioinformatic training programs can be challenging.

The Canadian Bioinformatics Workshops (CBW), a national training program in computational biology established in 1999, continues to be successful in providing such advanced training opportunities. Being responsive to the research community is key. Community surveys and expert input taken annually, highlight the constantly evolving needs within the –omic landscape. In response, the CBW constantly updates and develops short 2-day training sessions in specialized, advanced level topics. Topics covered thus remain relevant for–omic era research, such as how to annotate a gene list and evaluate pathways (Pathway and Network Analysis from –omics Data), and how to examine biological data and conduct essential statistical analyses (Exploratory Analysis of Biological Data). Other topic areas have emerged as certain fields have matured such as metabolomics (Informatics and Statistics for Metabolomics) or microarrays (Microarray Data Analysis), while others have emerged to meet the demand for new technologies such as how to manage and interpret HT sequencing data (Informatics on High-Throughput Sequencing Data) or for unique topic areas with their own informatic challenges (Bioinformatics for Cancer Genomics). Workshop specifics can be found at bioinformatics.ca
J37 - Tumor Suppressor Status in Cancer Cell Line Encyclopedia
Short Abstract: Tumor suppressors play a major role in cancer biology. In order to have a tumor promoting effect, in the majority of cases a complete loss of function of both alleles is required. There are several different mechanisms for inactivation of tumor suppressors that can be divided in three categories. The first category includes inactivation of both alleles by genetic alterations, such as copy number loss, loss of heterozygosity and mutations. The second category includes inactivation of one allele by a mechanism from the first category and inactivation of the second allele by an epigenetic mechanism, such as promoter methylation and possible histone modifications. The third category includes inactivation of both alleles by an epigenetic mechanism. In order to address a tumor suppressor status on a sample-by-sample basis, a comprehensive computational framework was constructed. The framework systematically checks for three major mechanisms of inactivation of tumor suppressors utilizing the Cancer Cell Line Encyclopedia (CCLE) data generated by collaboration between the Broad institute and the Novartis Institutes for BioMedical Research. The CCLE provides mRNA expression, Affymetrix SNP 6.0 profiles, OncoMap mutation screening and exome sequencing data for approximately 1,000 cancer cell lines. The framework uses 799 cell lines for which all above data types are available and therefore it’s possible to comprehensively determine tumor suppressor status. The poster describes the approach used to determine tumor suppressor status, including important technical and biological nuances and provides summary of results for well-known and putative tumor suppressors.
J38 - SOAP3-dp: Fast and sensitive short read alignment via index-assisted dynamic programming on GPU
Short Abstract: Existing short read alignment tools either are not fast enough or cannot handle INDELs well. By a skillful exploitation of whole genome indexing and dynamic programming on a GPU, we devised a GPU-based tool called SOAP3-dp that can find alignments involving mismatches and INDELs, and it achieves a drastic improvement in speed and better sensitivity over all existing tools. In our experiments with paired-end reads, SOAP3-dp is 10 to 15 times faster than BWA and the newly released Bowtie2, and aligns 2-4 percent more reads with high identity rate (average over 88%). Compared to its predecessor SOAP3 which tolerates mismatches only, SOAP3-dp aligns 10% more reads and is 1.6 times faster, showing that GPU-based dynamic programming coupled with indexing gives a more efficient and effective alignment tool.

SOAP3-dp maintains a full-text index (2BWT) of the reference sequence in the GPU. It makes uses of this mismatch-tolerant index in a multi-resolution manner to quickly identify the candidate regions containing the reads and then uses dynamic programming to align the reads onto the candidate regions with possible mismatches and INDELs. The highly-uniform nature of dynamic programming fits well with the SIMT model of multi-cores in GPU, but the parallelism is hindered by the bandwidth to access the dynamic-programming tables of different reads in the global memory. SOAP3-dp resolves this bottleneck by carefully optimizing its GPU kernel for dynamic programming so that each DP table entry is read only once, thus keeping the access to the global memory to the minimal.

Availability: http://www.cs.hku.hk/2bwt-tools/SOAP3-dp
J39 - Characterization of DNA-methylation around the promoter region
Short Abstract: Background: Methylation of cytosines is an important transcriptional regulatory feature in eukaryotic genomes and fundamental for cellular differentiation processes. Next Generation Sequencing technologies can provide us with better insights into DNA methylation patterns.

Methodology: We have performed a genome-wide analysis of DNA methylation on 80 samples across tissue types, between cancer and normal using Methyl-Binding-Domain capturing followed by next-generation sequencing (Methylcap-seq). We analyzed CpG island distribution, average methylation distribution and average binary entropy for more than 50000 transcripts in each of the 80 samples.

Results: We found that the average methylation properties varies around the TSS. Our study shows high variable methylation levels in the intergenic regions followed by lower, stable methylation in the promoter region starting from -750bp to the transcription start site. The most dense methylation occurs at the beginning of the first exon followed by lower, stable methylation in intron 1.
Interestingly, the promoter is distinguishable from the intergenic region at around -750/-500 bp upstream until the first exon at the TSS by lower methylation levels and lower variability. Overall, we found clear differences in methylation around the transcription start site, indicating epigenetically different properties for intergenic, promoter, exonic and intronic regions. These findings could be useful for gene annotation, finding alternative transcription start sites and oligo design.
J40 - Clear and biologist-friendly analysis software for RNA-, miRNA-, ChIP-, methyl- and CNA-seq data
Short Abstract: The open source Chipster software (http://chipster.csc.fi) provides a clear and biologist-friendly interface to analysis tools for RNA-, miRNA-, ChIP-, methyl- and CNA-seq data. Users can easily save and share analysis workflows, and built-in genome browser allows seamless viewing of reads and results.

Users can perform their whole data analysis in Chipster from quality control to downstream applications such as pathway enrichment and motif discovery. Popular tools such as FastQC, FASTX, PRINSEQ, SAMtools, BEDtools, Bowtie, BWA, TopHat, HTSeq, Cufflinks and MACS are included, and care has been taken to serve them in a biologist-friendly manner. Also several R/Bioconductor packages have been integrated, including edgeR, DESeq and MEDIPS.

Chipster’s built-in genome browser allows seamless visualization of reads and results in their genomic context using Ensembl annotations. Users can zoom in to nucleotide level, highlight SNPs and view the automatically calculated coverage. Cross-talk between the genome browser and BED files allows users to quickly inspect genomic regions by simply clicking on the data row of interest.

Technically Chipster is a Java-based client-server system. It is open source, and a virtual machine distribution of is available (http://chipster.sourceforge.net/). New analysis tools can easily be added using a simple mark-up language. We are working with the Hadoop MapReduce framework so that large jobs can be run in the cloud. Taken together, Chipster provides an easy way to serve NGS data analysis tools in a biologist-friendly manner.

1.Kallio et al. (2011) BMC Genomics 12:507
2.Niemenmaa et al. (2012)Bioinformatics 28(6):876
J41 - Barcoding-free BAC pooling enables combinatorial selective sequencing of the barley gene space
Short Abstract: We propose a sequencing protocol that combines recent advances in
combinatorial pooling design and second-generation sequencing
technology to efficiently approach de novo selective genome
sequencing. We show that combinatorial pooling is a cost-effective
and practical alternative to exhaustive DNA barcoding when dealing
with hundreds or thousands of DNA samples, such as in this case genome-tiling
gene-rich BAC clones. The novelty of the protocol hinges on the
computational ability to efficiently compare hundreds of million of
short reads and assign them to the correct BAC clones so that the
assembly can be carried out clone-by-clone. Experimental results on
simulated data from the rice genome show that the deconvolution is
extremely accurate (99.57% of the deconvoluted reads are assigned to
the correct BAC), and the resulting BAC assemblies have very high
quality (BACs are covered by contigs over about 77% of their length,
on average). Experimental results on real data for a gene-rich subset of
the barley genome confirm that the deconvolution is accurate (more than
70% of left/right pairs in paired-end reads are assigned to the same
BAC, despite being processed independently) and the BAC assemblies have
good quality (the average sum of all assembled contigs is 76-88%
of the estimated BAC length). We also compare to an assembly of
whole-shotgun sequencing data of the barley genome at 31x sequencing depth:
the sum of all assembled contigs covers about 30% of the genome, with an N50 of 2.9~kb.
In contrast, the average BAC assembly covers about 76--88% of each BAC, with an
average N50 of 6-7.2~kb. BAC assemblies and whole genome shotgun assembly
of barley are available at url{http://www.harvest-web.org/}.
J42 - Bioinformatics analysis reveals the diverse and highly complex spinal cord regeneration transcriptome on Xenopus laevis
Short Abstract: Vertebrates are capable of some types of tissue regeneration, most, including humans, have lost the ability to regenerate whole structures such as limbs (epimorphic regeneration), while amphibians, in contrast, are exceptionally good at it. Epimorphic regeneration is a complex responses that involved several and different pathways to accomplish restoration of all sort of tissues, and therefore, are difficult to analyze. In order to provide a richer view of the regeneration process, we aimed to unravel the transcriptome of the spinal cord regeneration in Xenopus laevis. We performed high-throughput mRNA sequencing of the spinal cord transcriptome at days two (early) and six (later) after injury, in Xenopus regenerative (st. 50) and non-regenerative (st. 66) stages. Our result allows us to characterize up to twenty-seven thousand transcripts in all samples, and about 2,000 were found with an expression difference of at least 2.8-fold, suggesting a putative role in the spinal cord regenerative response. In the regenerative stage a preliminary GO analysis shows genes related with the cell cycle consistent with the fact that after injury, new cells must replace damage ones. At the other hand, the non-regenerative stage, show an activity of transcripts related with the immune response. The complexity and variety of the two thousand transcripts, makes difficult the full understanding of the regenerative process. However, our research will help to a better comprehension of mammalian regenerative capacity, and eventually, enhance and/or develop therapies that will promote healing and regeneration in humans.
J43 - This déjà vu feeling: Analysis of multidomain protein evolution in eukaryotic genomes
Short Abstract: Background. Evolutionary innovation in eukaryotes, and especially in animals, is at least partially driven by genome rearrangements and emergence of proteins with new domain architectures, and thus novel functionality. Given the random nature of such rearrangements, one could expect that proteins with particularly useful multidomain combinations may have been rediscovered multiple times by parallel evolution. However, existing reports suggest a minimal role of this phenomenon in the overall evolution of eukaryotic proteomes.

Results. We assembled a collection of 172 complete eukaryotic genomes that is not only the largest, but also the most taxonomically diverse set of genomes analyzed so far. By employing a maximum parsimony approach to compare repertoires of protein domains and their combinations, we show that independent evolution of domain combinations is significantly more prevalent than previously thought. Our results indicate that about 25% of all currently observed domain combinations have evolved independently multiple times. Yet, on the level of individual species this percentage is much higher, for instance 70% of the domain combinations found in the human genome have evolved independently in other species at least once. The process of independent domain combination formation is not only widespread, but also affects domains of all functional categories.

Conclusions. The surprisingly large contribution of parallel evolution to the development of the repertoire of multidomain protein architectures in extant genomes has profound consequences for our understanding of the evolution of pathways and cellular processes in eukaryotes and for comparative functional genomics.
J44 - An Assessment of Protein Non-folding in Protozoan Genomes in a Database Framework
Short Abstract: Intrinsically disordered proteins can be defined as a class of proteins, which fail to form rigid tridimensional structures under physiological conditions, either along their entire lengths or only in localized regions. Despite of the existence of several different disorder prediction algorithms and since there is no common sense definition for protein disorder, a wide range of methodologies exists. Our main goal is to develop a computational pipeline approach that combines key features from different methodologies together with a user pre-defined level of sensitivity and specificity for high throughput disorder prediction. In this context, we are using the information from protozoan proteomes and developing a database framework aiming to establish the biological correlations between protein structural disorder and host-parasite interactions. The computational framework adopted was divided in four steps: 1) prediction of IUPs by the means of several algorithms; 2) integrative analyses of the data into MySQL database; 3) improvement of the predictions and analyses through the development of parsers; and 4) functional annotation with motifs based on PFam, PRINTS, ProDom, ProSite, InterPro and Gene Ontology terms. Comparative results of natively unfolded protein content of five Leishmania species together with functional annotation, GO assignments and the developed pipeline and database will be presented.
J45 - WebMGA: Customizable Service for Fast Metagenomic Sequence Analysis
Short Abstract: With the advances in next-generation sequencing techniques, researchers are facing tremendous challenges in metagenomic data analysis. Metagenomic analysis often involves a large variety of software tools, which are very difficult for common users to install and maintain, especially on a computer cluster. The few metagenomic analysis servers, like MG-RAST, CAMERA, require little work for common users, but they have various constrains such as login requirement, long waiting time, inability to configure pipelines etc. Here we developed WebMGA, customizable web services for fast metagenomic analysis. WebMGA includes over 20 commonly used tools such as ORF calling, sequence clustering, quality control of raw reads, removal of sequencing artifacts and contaminations, taxonomic analysis, functional annotation etc. WebMGA provides users with rapid metagenomic data analysis using fast and effective algorithms. All tools behind WebMGA were implemented to run in parallel on our local computer cluster. Users can access WebMGA server through web browsers and use programming script to perform individual analysis or to configure and run customized pipelines through restful web services. WebMGA is freely available at http://weizhongli-lab.org/metagenomic-analysis.
J46 - CD-HIT-OTU: Rapid and Accurate Identification of Microbial Diversity from rRNA Tags
Short Abstract: Accurate identification of microbial diversity for environmental samples using ribosomal RNA tags remains challenging and problematic, especially for the emerging large-scale Illumina-based sequences (i.e., iTags). Intrinsic sequencing errors, artifacts and very big data size all contribute to the complexity of diversity analysis. CD-HIT-OTU is a new tool we developed with ultra-fast clustering algorithms and effective methods for identifying chimeric reads and eliminating erroneous tags. This tool can accurately identifies Operational Taxonomic Units from single or pair ended Illumina reads. It took just a few minutes to process the Mock datasets with millions of reads. For real environmental datasets with thousands of species, our method only used about one hour. For pair ended reads, our error-tolerating assembly method assembles reads with mismatches and therefore utilizes more raw data than existing methods. CD-HIT-OTU is available through our CD-HIT package.
J47 - Simultaneous RNA-Seq-based Transcript Inference and Quantification in Multiple Samples
Short Abstract: High throughput sequencing of mRNA (RNA-Seq) led to expect tremendous improvements
in detection of expressed genes and transcripts. However, the immense dynamic range
of gene expression, biases from sequencing, library preparation and read mapping, and
the unexpected complexity of the transcriptional landscape cause profound computational
challenges. The latter can lead to a combinatorial explosion of the number of
potential transcripts that can qualitatively explain the observed read data. To find the
correct set of transcripts, long range dependencies have to be resolved.
Based on simple toy examples we can show that state of the art tools fail to resolve
these dependencies even if sufficient information is provided.

By treating the transcript recognition problem as a combinatorial optimization problem we
disclose a great arsenal of techniques that cannot be applied in continuous optimization
Firstly, a set of up to k transcripts which gives the optimal quantitative explanation for
the observed RNA-Seq reads can be computed without enumerating all possible
transcripts. Secondly, sparsity can be enforced by penalizing the number of transcripts
needed to quantitatively explain the reads.
Thirdly, we can share information among multiple RNA-Seq samples and thereby provably increase
the power to resolve long range dependencies.

These conceptual improvements translate to substantial gains in transcript recognition
on artificial reads for the human genome. We currently apply our system to the drosophila modENCODE data consisting of 53 RNA-Seq data sets with several samples each and intend to provide results on the meeting.

View Posters By Category

Search Posters: