Deprecated: mysql_escape_string(): This function is deprecated; use mysql_real_escape_string() instead. in /mnt/target04/348208/www.iscb.org/web/content/phpincludes/friendfeedapi/checkfeed.php on line 78

Proceedings Track Presentations

All Highlights and Proceedings Track presentations are presented by scientific area part of the combined Paper Presentation schedule.


Alternative polyadenylation
Bioinformatics
Cancer
Cryo-electron tomography
de Bruijn sequence
DNA binding specificity
dose-response analysis
Drug-Target Interaction
Feature selection
Gene Ontology
Haematopoietic stem cell
Haplotype
Host-pathogen protein interaction
Identity-by-Descent
Image processing
Metabolic network
Metabolic networks
Metabolic pathways
microRNA
Network topology
Nonparametric sparse Bayesian factor analysis
Parameter estimation
Pathway inference
PCA
Poly(A) motif
Population genetics
PPI-Network
PRM: Protein Recognition module
Protein contact map prediction
Protein interaction evolution
Protein structure prediction
Protein Threading
Pseudogene
RNA
RNA structure prediction
Sequence Analysis
Sequencing
Short read alignment
Text mining
Transcriptome assembling
Other


Proceedings Track: Alternative polyadenylation
Presenting author: Dina Hafez , Duke University, United States
Sunday, July 21: 3:40 p.m. - 4:05 p.m.Room: Hall 14.2

Area Session Chair: Cenk Sahinalp

Presentation Overview:
Motivation: Pre-mRNA cleavage and polyadenylation is an essential step for 3' end maturation, and subsequent stability and degradation of mRNAs. This process is highly controlled by cis-regulatory elements surrounding the cleavage site (polyA site), which are frequently constrained by sequence content and position. More than 50\% of human transcripts have multiple functional polyA sites, and the specific use of alternative polyA sites (APA) results in isoforms with varying 3'UTRs, thus affecting gene regulation. Elucidating the regulatory mechanisms underlying differential polyA preferences in multiple cell types has been hindered both by the lack of suitable data on the precise location of cleavage sites, as well as of appropriate tests for determining APAs with significant differences across multiple libraries. Results: We applied a tailored paired-end RNA-seq protocol to specifically probe the position of polyA sites in three adult cell types. We specified a linear effects regression model to identify tissue-specific biases indicating regulated alternative polyadenylation; the significance of differences between cell types was assessed by an appropriately designed permutation test. This combination allowed to identify highly specific subsets of APA events in the individual cell types. Predictive models successfully classified constitutive polyA sites from a biologically relevant background (auROC = 99.6\%), as well as tissue-specific regulated sets from each other. We found that the main cis-regulatory elements described for polyadenylation are a strong, and highly informative, hallmark for constitutive sites only. Tissue-specific regulated sites were found to contain other regulatory motifs, with the canonical PAS signal being nearly absent at brain-specific polyA sites. Together, our results contribute to the understanding of the diversity of post-transcriptional gene regulation.
TOP

Proceedings Track: Bioinformatics
Presenting author: Sarah Aerni , Stanford University, United States
Monday, July 22: 2:10 p.m. - 2:35 p.m.Room: Hall 7

Area Session Chair: Stefan Kramer

Presentation Overview:
Motivation: Advances in high-resolution microscopy have recently made possible the analysis of gene expression at the level of individual cells. The fixed lineage of cells in the adult worm C. elegans makes this organism an ideal model for studying complex biological processes like development and aging. However, annotating individual cells in images of adult C. elegans typically requires expertise and significant manual effort. Automation of this task is therefore critical to enabling high-resolution studies of a large number of genes. Results: In this paper, we describe an automated method for annotating a subset of 154 cells (including various muscle, intestinal, and hypodermal cells) in high-resolution images of adult C. elegans. We formulate the task of labeling cells within an image as a combinatorial optimization problem, where the goal is to minimize a scoring function that compares cells in a test input image with cells from a training atlas of manually annotated worms according to various spatial and morphological characteristics. We propose an approach for solving this problem based on reduction to minimum-cost maximum flow and apply a cross-entropy based learning algorithm to tune the weights of our scoring function. We achieve 84% median accuracy across a set of 154 cell labels in this highly variable system.These results demonstrate the feasibility of the automatic annotation of microscopy-based images in adult C. elegans.
TOP

Proceedings Track: Cancer
Presenting author: Andrew J. Sedgewick , University of Pittsburgh, United States
Tuesday, July 23: 11:30 a.m. - 11:55 a.m.Room: Hall 14.2

Area Session Chair: Lonnie Welch

Presentation Overview:
High-dimensional “-omics” profiling provides a detailed molecular view of individual cancers, however understanding the mechanisms by which tumors evade cellular defenses requires deep knowledge of the underlying cellular pathways within each cancer sample. We extended the PARADIGM algorithm (Vaske et al., 2010), a pathway analysis method for combining multiple “-omics” data types, to learn the strength and direction of 9139 gene and protein interactions curated from the literature. Using genomic and mRNA expression data from 1936 samples in The Cancer Genome Atlas (TCGA) cohort, we learned interactions that provided support for and relative strength of 7138 (78%) of the curated links. Gene set enrichment found that genes involved in the strongest interactions were significantly enriched for transcriptional regulation, apoptosis, cell cycle regulation, and response to tumor cells. Within the TCGA breast cancer cohort we assessed different interaction strengths between breast cancer subtypes, and found interactions associated with the MYC pathway and the ER alpha network to be among the most differential between basal and luminal A subtypes. PARADIGM with the Naive Bayesian assumption produced gene activity predictions that, when clustered, found groups of patients with better separation in survival than both the original version of PARADIGM and a version without the assumption. We found that this Naive Bayes assumption was valid for the vast majority of co-regulators, indicating that most co-regulators act independently on their shared target. Availability: http://paradigm.five3genomics.com
TOP

Proceedings Track: Cryo-electron tomography
Presenting author: Min Xu , University of Southern California, United States
Tuesday, July 23: 3:10 p.m. - 3:35 p.m.Room: ICC Lounge 81

Area Session Chair: Donna Slonim

Presentation Overview:
Motivation: Cryo-electron tomography allows the imaging of macromolecular complexes in near living conditions. To enhance the nominal resolution of a structure it is necessary to align and average individual subtomograms each containing identical complexes. However, if the sample of complexes is heterogeneous, it is necessary to first classify subtomograms into groups of identical complexes. This task becomes challenging when tomograms contain mixtures of unknown complexes extracted from a crowded environment. Two main challenges must be overcome: First, classification of subtomograms must be performed without knowledge of template structures. However, most alignment methods are too slow to perform reference-free classification of a large number of (e.g. tens of thousands) of subtomograms. Second, subtomograms extracted from crowded cellular environments, contain often fragments of other structures besides the target complex. However, alignment methods generally assume that each subtomogram only contains one complex. Automatic methods are needed to identify the target complexes in a subtomogram even when its shape is unknown. Results: In this paper, we propose an automatic and systematic method for the isolation and masking of target complexes in subtomograms extracted from crowded environments. Moreover, we also propose a fast alignment method using fast rotational matching in real space. Our experiments show that, compared to our previously proposed fast alignment method in reciprocal space, our new method significantly improves the alignment accuracy for highly distorted and especially crowded subtomograms. Such improvements are important for achieving successful and unbiased high-throughput reference-free structural classification of complexes inside whole cell tomograms.
TOP

Proceedings Track: de Bruijn sequence
Presenting author: Yaron Orenstein , Tel-Aviv University, Israel
Tuesday, July 23: 11:30 a.m. - 11:55 a.m.Room: Hall 4/5

Area Session Chair: Debra Goldberg

Presentation Overview:
Novel technologies can generate large sets of short double-stranded DNA sequences that can be used to measure their regulatory effects. Microarrays can measure in vitro the binding intensity of a protein to thousands of probes. Synthetic enhancer sequences inserted into an organism's genome allow us to measure in vivo the effect of such sequences on the phenotype. In both applications, by using sequence probes that cover all k-mers, a comprehensive picture of the effect of all possible short sequences on gene regulation is obtained. The value of k that can be used in practice is, however, severely limited by cost and space considerations. A key challenge is therefore to cover all k-mers with a minimal number of probes.The standard way to do this uses the de Bruijn sequence of length 4^k. However, since probes are double stranded, when a k-mer is included in a probe, its reverse complement k-mer is accounted for as well. Here we show how to efficiently create a shortest possible sequence with the property that it contains each k-mer or its reverse complement, but not necessarily both. The length of the resulting sequence approaches half that of the de Bruijn sequence as k increases. By reducing the total sequence length, experimental limitations can be overcome; alternatively, additional sequences with redundant k-mers of interest can be added.
TOP

Proceedings Track: DNA binding specificity
Presenting author: Fantine Mordelet , Duke University, United States
Sunday, July 21: 11:00 a.m. - 11:25 a.m.Room: Hall 4/5

Area Session Chair: Erik Bongcam-Rudloff

Presentation Overview:

Motivation: The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix (PWM) model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret. Results: We propose novel regression-based models of TF-DNA binding specificity, trained using high resolution in vitro data from custom protein binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human TFs c-Myc, Max, and Mad2) in their native genomic context. These high-throughput, quantitative data are well suited for training complex models that take into account not only independent contributions from individual bases, but also contributions from di- and trinucleotides at various positions within or near the binding sites. To ensure that our models remain interpretable, we use feature selection to identify a small number of sequence features that accurately predict TF-DNA binding specificity. To further illustrate the accuracy of our regression models, we show that even in the case of paralogous TF with highly similar PWMs, our new models can distinguish the specificities of individual factors. Thus, our work represents an important step towards better sequence-based models of individual TF-DNA binding specificity.

Availability: Our code is available at http://genome.duke.edu/labs/ gordan/ISMB2013. The PBM data used in this paper are available in the Gene Expression Omnibus under accession number GSE44604.


TOP

Proceedings Track: dose-response analysis
Presenting author: Russell Schwartz, Carnegie Mellon University, United States
Tuesday, July 23: 2:40 p.m. - 3:05 p.m.Room: ICC Lounge 81

Area Session Chair: Donna Slonim

Presentation Overview:
Motivation: Development and progression of solid tumors can be attributed to a process of mutations, which typically includes changes in the number of copies of genes or genomic regions. Although comparisons of cells within single tumors show extensive heterogeneity, recurring features of their evolutionary process may be discerned by comparing multiple regions or cells of a tumor. A particularly useful source of data for studying likely progression of individual tumors is fluorescence in situ hybridization (FISH), which allows one to count copy numbers of several genes in hundreds of single cells. Novel algorithms for interpreting such data phylogenetically are needed, however, to reconstruct likely evolutionary trajectories from states of single cells and facilitate analysis of their evolutionary trajectories. Results: In this paper, we develop phylogenetic methods to infer likely models of tumor progression using FISH copy number data and apply them to a study of FISH data from two cancer types. Statistical analyses of topological characteristics of the tree-based model provide insights into likely tumor progression pathways consistent with the prior literature. Furthermore, tree statistics from the resulting phylogenies can be used as features for prediction methods. This results in improved accuracy, relative to unstructured gene copy number data, at predicting tumor state and future metastasis. Availability: A package of source code for FISH tree building (FISHtrees) and the data on cervical cancer and breast cancer examined here are publicly available at the site ftp://ftp.ncbi.nlm.nih.gov/pub/FISHtrees.
TOP

Proceedings Track: Drug-Target Interaction
Presenting author: Jianyang Zeng, Tsinghua University, China
Monday, July 22: 11:30 a.m. - 11:55 a.m.Room: Hall 14.2

Area Session Chair: Serafim Batzoglou

Presentation Overview:
Motivation: In silico prediction of drug-target interactions plays an important role towards identifying and developing new uses of existing or abandoned drugs. Network-based approaches have recently become a popular tool for discovering new drug-target interactions. Unfortunately, most of these network-based approaches can only predict binary interactions between drugs and targets, and information about different types of interactions has not been well exploited for drug-target interaction prediction in previous studies. On the other hand, incorporating additional information about drug-target relationships or drug modes of action can improve prediction of drug-target interactions. Furthermore, the predicted types of drug-target interactions can broaden our understanding about the molecular basis of drug action. Results: We propose a first machine learning approach to integrate multiple types of drug-target interactions and predict unknown drug-target relationships or drug modes of action. We cast the new drug-target interaction prediction problem into a two-layer graphical model, called restricted Boltzmann machine (RBM), and apply a practical learning algorithm to train our model and make predictions. Tests on two public databases show that our RBM model can effectively capture the latent features of a drug-target interaction network, and achieve excellent performance on predicting different types of drug-target interactions, with the area under precision-recall curve (AUPR) up to 89.6. In addition, we demonstrate that integrating multiple types of drug-target interactions can significantly outperform other predictions either by simply mixing multiple types of interactions without distinction or using only a single interaction type. Further tests show that our approach can infer a high fraction of novel drug-target interactions that has been validated by known experiments in the literature or other databases. These results indicate that our approach can have highly practical relevance to drug-target interaction prediction and drug repositioning, and hence advance the drug discovery process. Availability: Software and datasets are available upon request.
TOP

Proceedings Track: Feature selection
Presenting author: Chloé-Agathe Azencott , Max-Planck-Institutes Tübingen, Germany
Monday, July 22: 12:00 p.m. - 12:25 p.m.Room: Hall 4/5

Area Session Chair: Russell Schwartz

Presentation Overview:
As an increasing number of genome-wide association studies reveal the limitations of the attempt to explain phenotypic heritability by single genetic loci, there is a recent focus on associating complex phenotypes with sets of genetic loci. While several methods for multi-locus mapping have been proposed, it is often unclear how to relate the detected loci to the growing knowledge about gene pathways and networks. The few methods that take biological pathways or networks into account are either restricted to investigating a limited number of predetermined sets of loci, or do not scale to genome-wide settings. We present SConES, a new efficient method to discover sets of genetic loci that are maximally associated with a phenotype, while being connected in an underlying network. Our approach is based on a minimum cut reformulation of the problem of selecting features under sparsity and connectivity constraints, which can be solved exactly and rapidly. SConES outperforms state-of-the-art competitors in terms of runtime, scales to hundreds of thousands of genetic loci and exhibits higher power in detecting causal SNPs in simulation studies than other methods. On flowering time phenotypes and genotypes from Arabidposis thaliana, SConES detects loci that enable accurate phenotype prediction and that are supported by the literature.
TOP

Proceedings Track: Gene Ontology
Presenting author: Wyatt Clark , Indiana University, United States
Tuesday, July 23: 2:40 p.m. - 3:05 p.m.Room: Hall 4/5

Area Session Chair: Reinhard Schneider

Presentation Overview:
The development of effective methods for the prediction of ontological annotations is an important goal in computational biology, with protein function prediction and disease gene prioritization gaining wide recognition. While various algorithms have been proposed for these tasks, evaluating their performance is difficult due to problems caused both by the structure of biomedical ontologies and biased or incomplete experimental annotations of genes and gene products. In this work, we propose an information-theoretic framework to evaluate the performance of computational protein function prediction. We use a Bayesian network, structured according to the underlying ontology, to model the prior probability of a protein's function. We then define two concepts, misinformation and remaining uncertainty, that can be seen as information-theoretic analogs of precision and recall. Finally, we propose a single statistic, referred to as semantic distance, that can be used to rank or train classification models. We evaluate our approach by analyzing the performance of three protein function predictors of Gene Ontology terms and provide evidence that we address several weaknesses of currently used metrics. We believe this framework provides useful insights into the performance of protein function prediction tools.
TOP

Proceedings Track: Haematopoietic stem cell
Presenting author: Nicola Bonzanni , VU University Amsterdam, Netherlands
Tuesday, July 23: 12:00 p.m. - 12:25 p.m.Room: Hall 14.2

Area Session Chair: Lonnie Welch

Presentation Overview:
Motivation: Combinatorial interactions of transcription factors with cis-regulatory elements control the dynamic progression through successive cellular states and thus underpin all metazoan development. The construction of network models of cis-regulatory elements therefore has the potential to generate fundamental insights into cellular fate and differentiation. Haematopoiesis has long served as a model system to study mammalian differentiation, yet modelling based on experimentally informed cis-regulatory interactions has so far been restricted to pairs of interacting factors. Here we have generated a Boolean network model based on detailed cis-regulatory functional data connecting 11 haematopoietic stem/progenitor cell (HSPC) regulator genes. Results: Despite its apparent simplicity, the model exhibits surprisingly complex behaviour that we charted using strongly connected components and shortest-path analysis in its Boolean state space. This analysis of our model predicts that HSPCs display heterogeneous expression patterns and possess many intermediate states that can act as ‘stepping stones’ for the HSPC to achieve a final differentiated state. Importantly, an external perturbation or ‘trigger’ is required to exit the stem cell state, with distinct triggers characterising maturation into the various different lineages. By focussing on intermediate states occurring during erythrocyte differentiation, from our model we predicted a novel negative regulation of Fli1 by Gata1 which we confirmed experimentally thus validating our model. In conclusion, we demonstrate that an advanced mammalian regulatory network model based on experimentally validated cis-regulatory interactions has allowed us to make novel, experimentally testable hypotheses about transcriptional mechanisms that control differentiation of mammalian stem cells.
TOP

Proceedings Track: Haplotype
Presenting author: Derek Aguiar , Brown University, United States
Monday, July 22: 2:10 p.m. - 2:35 p.m.Room: Hall 14.2

Area Session Chair: Sean O'Donoghue

Presentation Overview:
Motivation: Genome-wide haplotype reconstruction from sequence data, or haplotype assembly, is at the center of major challenges in molecular biology and life sciences. For complex eukaryotic organisms like humans, the genome is vast and the population samples are growing so rapidly that algorithms processing these high-throughput sequencing data must scale favorably in terms of both accuracy and computational efficiency. Furthermore, current models and methodologies for haplotype assembly (1) do not consider individuals sharing haplotypes jointly which reduces the size and accuracy of assembled haplotypes and (2) are unable to model genomes having more than two sets of homologous chromosomes (polyploidy). Particularly, polyploid organisms are becoming the target of many research groups interested in studying the genomics of disease, phylogenetics, botany, and evolution but there is an absence of theory and methods for polyploid haplotype reconstruction. Results: In this work, we present a number of results, extensions, and generalizations of Compass graphs and our HapCompass framework (Aguiar et al. 2012). We prove the theoretical complexity of two haplotype assembly optimizations, thereby motivating the use of heuristics. We present graph theory-based algorithms for the problem of haplotype assembly from sequencing data using our previously developed HapCompass framework for (1) novel implementations of haplotype assembly optimizations (minimum error correction), (2) assembly of a pair of individuals sharing a tract identical by descent, and (3) assembly of polyploid genomes. We demonstrate the accuracy of each method on the 1000 Genomes Project, Pacific Biosciences, and simulated sequence data. HapCompass is available for download at http://www.brown.edu/Research/Istrail_Lab/
TOP

Proceedings Track: Host-pathogen protein interaction
Presenting author: Meghana Kshirsagar , Carnegie Mellon University , United States
Sunday, July 21: 2:40 p.m. - 3:05 p.m.Room: Hall 4/5

Area Session Chair: Olga Vitek

Presentation Overview:
Motivation: An important aspect of infectious disease research involves understanding the differences and commonalities in the infection mechanisms underlying various diseases. Systems biology based approaches study infectious diseases by analyzing the interactions between the host species and the pathogen organisms. This work aims to combine the knowledge from experimental studies of host-pathogen interactions in several diseases in order to build stronger predictive models. Our approach is based on a formalism from machine-learning called `multi-task learning', which considers the problem of building models across tasks that are related to each other. A `task' in our scenario is the set of host-pathogen protein interactions involved in one disease. To integrate interactions from several tasks (i.e diseases), our method exploits the similarity in the infection process across the diseases. In particular, we use the biological hypothesis that similar pathogens target the same critical biological processes in the host, in defining a common structure across the tasks. Results: Our current work on host-pathogen protein interaction prediction focuses on human as the host, and four bacterial species as pathogens. The multi-task learning technique we develop uses a task based regularization approach. We find that the resulting optimization problem is a difference of convex (DC) functions. To optimize, we implement a Convex-Concave procedure based algorithm. We compare our integrative approach to baseline methods that build models on a single host-pathogen protein interaction dataset. Our results show that our approach outperforms the baselines on the training data. We further analyse the protein interaction predictions generated by the models, and find some interesting insights.
TOP

Proceedings Track: Identity-by-Descent
Presenting author: Dan He , IBM T.J. Watson, United States
Monday, July 22: 11:00 a.m. - 11:25 a.m.Room: Hall 4/5

Area Session Chair: Russell Schwartz

Presentation Overview:
Detecting Identity-by-Descent (IBD) is a very important problem in genetics. Most of the existing methods focus on detecting pairwise IBDs, which have relatively low power to detect short IBDs. Methods to detect IBDs among multiple individuals simultaneously, or group-wise IBDs, have better performance for short IBD detection. In the meanwhile group-wise IBDs can be applied to a wide range of applications such as disease mapping, pedigree reconstruction, etc. The existing group-wise IBD detection method is computationally inefficient and is only able to handle small data sets such as 20, 30 individuals with hundreds of SNPs. It also requires a prior specification of the number of IBD groups, which may not be realistic in many cases. The method can only handle small number of IBD groups such as two or three due to scalability issue. What's more, it does not take LD into consideration. In this work, we developed a very efficient method \textit{IBD-Groupon}, which detects group-wise IBDs based on pairwise IBD relationships and it is able to address all the drawbacks mentioned above. To our knowledge, our method is the first group-wise IBD detection method that is scalable to very large data sets, for example, hundreds of individuals with thousands of SNPs, and in the meanwhile is powerful to detect short IBDs. Our method does not need to specify the number of IBD groups, which will be detected automatically. And our method takes LD into consideration as it is based on pairwise IBDs where LD can be easily incorporated.
TOP

Proceedings Track: Image processing
Presenting author: Saket Navlakha , Carnegie Mellon University , United States
Monday, July 22: 3:40 p.m. - 4:05 p.m.Room: Hall 7

Area Session Chair: Stefan Kramer

Presentation Overview:
Motivation: Synaptic connections underlie learning and memory in the brain and are dynamically formed and eliminated during development and in response to stimuli. Quantifying changes in overall density and strength of synapses is an important pre-requisite for studying connectivity and plasticity in these cases or in diseased conditions. Unfortunately, most techniques to detect such changes are either low-throughput (e.g. electrophysiology), prone to error and difficult to automate (e.g. standard electron microscopy), or too coarse (e.g. MRI) to provide accurate and large-scale measurements. Results: To facilitate high-throughput analyses, we used a 50-year-old experimental technique to selectively stain for synapses in electron microscopy (EM) images, and we developed a machine learning framework to automatically detect synapses in these images. To validate our method we experimentally imaged brain tissue of the somatosensory cortex in six mice. We detected thousands of synapses in these images and demonstrate the accuracy of our approach using cross-validation with manually labeled data and by comparing against existing algorithms and against tools that process standard EM images. We also used a semi-supervised algorithm that leverages unlabeled data to overcome sample heterogeneity and improve performance. Our algorithms are highly efficient and scalable and are freely available for others to use.
TOP

Proceedings Track: Metabolic network
Presenting author: Masaaki Kotera , Kyoto University, Japan
Monday, July 22: 11:30 a.m. - 11:55 a.m.Room: ICC Lounge 81

Area Session Chair: Hagit Shatkay

Presentation Overview:
Motivation: The metabolic pathway is an important biochemical reaction network involving enzymatic reactions among chemical compounds. However, it is assumed that a large number of metabolic pathways remain unknown, and many reactions are still missing even in known pathways. Therefore, the most important challenge in metabolomics is the automated de novo reconstruction of metabolic pathways, which includes the elucidation of previously unknown reactions to bridge the metabolic gaps. Results: In this paper we develop a novel method to reconstruct metabolic pathways from a large compound set in the reaction-filling framework. We define feature vectors representing the chemical transformation patterns of compound-compound pairs in enzymatic reactions using chemical fingerprints. We apply a sparsity-induced classifier to learn what we refer to as ”enzymatic-reaction likeness”, i.e., whether or not compound pairs are possibly converted to each other by enzymatic reactions. The originality of our method lies in the search for potential reactions among many compounds at a time, in the extraction of reaction-related chemical transformation patterns, and in the large-scale applicability owing to the computational efficiency. In the results, we demonstrate the usefulness of our proposed method on the de novo reconstruction of 134 metabolic pathways in KEGG. Our comprehensively predicted reaction networks of 15,698 compounds enable us to suggest many potential pathways and to increase research productivity in metabolomics.
TOP

Proceedings Track: Metabolic networks
Presenting author: John Pinney, Imperial College London, United Kingdom
Sunday, July 21: 10:30 a.m. - 10:55 a.m.Room: Hall 4/5

Area Session Chair: Erik Bongcam-Rudloff

Presentation Overview:
Motivation: Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison to sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are therefore needed to assist in the identification of misannotated gene products. In the case of enzymatic functions, each functional assignment implies the existence of a reaction within the organism’s metabolic network; a first approximation to a genome-scale metabolic model can be obtained directly from an automated genome annotation. Any obvious problems in the network, such as dead-end or disconnected reactions, can therefore be strong indications of misannotation. Results: We demonstrate that a machine learning approach using only network topological features can successfully predict the validity of enzyme annotations. The predictions are tested at 3 different levels. A random forest using topological features of the metabolic network and trained on curated sets of correct and incorrect enzyme assignments was found to have an accuracy of up to 86% in 5-fold cross validation experiments. Further cross validation against unseen enzyme superfamilies indicates that this classifier can successfully extrapolate beyond the classes of enzyme present in the training data. The random forest model was applied to several automated genome annotations, achieving an accuracy of 60% in most cases when validated against recent genome-scale metabolic models. We also observe that when applied to draft metabolic networks for multiple species, a clear negative correlation is observed between predicted annotation quality and phylogenetic distance to the major model organism for biochemistry (Escherichia coli for prokaryotes and Homo sapiens for eukaryotes). Contact: j.pinney@imperial.ac.uk
TOP

Proceedings Track: Metabolic pathways
Presenting author: Cesim Erten, Kadir Has University
Tuesday, July 23: 2:40 p.m. - 3:05 p.m.Room: Hall 7

Area Session Chair: Alfonso Valencia

Presentation Overview:
Given a pair of metabolic pathways, an alignment of the pathways corresponds to a mapping between similar substructures of the pair. Successful alignments may provide useful applications in phylogenetic tree reconstruction, drug design, and overall may enhance our understanding of cellular metabolism. We consider the problem of providing one-to-many alignments of reactions in a pair of metabolic pathways. We first provide a constrained alignment framework applicable to the problem. We show that the constrained alignment problem even in a very primitive setting is computationally intractable which justifies efforts for designing efficient heuristics. We present our Constrained Alignment of Metabolic Pathways (CAMPWays) algorithm designed for this purpose. Through extensive experiments involving a large pathway database we demonstrate that when compared to a state-of-the-art alternative, the CAMPWays algorithm provides better alignment results on metabolic networks as far as measures based same-pathway inclusion are concerned. The execution speed of our algorithm constitutes yet another important improvement over alternative algorithms.
TOP

Proceedings Track: microRNA
Presenting author: Hai-Son Le , Carnegie Mellon, United States
Tuesday, July 23: 2:40 p.m. - 3:05 p.m.Room: Hall 14.2

Area Session Chair: Ralf Zimmer

Presentation Overview:
Motivation: MicroRNAs (miRNAs) are small non-coding RNAs that regulate gene expression post-transcriptionally. MiRNAs were shown to play an important role in development and disease, and accurately determining the networks regulated by these miRNAs in a specific condition is of great interest. Early work on miRNA target prediction has focused on utilizing static sequence information. More recently, researchers have combined sequence and expression data to identify such targets in various conditions. Results: Here we propose a regression-based probabilistic method that integrates sequence, expression and interaction data to identify modules of mRNAs controlled by small sets of miRNAs. We formulate an optimization problem and develop a learning framework to determine the module regulation and membership. Applying our method to cancer data we show that by adding protein interaction data and modeling combinatorial regulation our method can accurately identify both miRNA and their targets improving upon prior methods. We next used our method to jointly analyze a number of different types of cancers and identified both common and cancer type specific miRNA regulators.
TOP

Proceedings Track: Network topology
Presenting author: Carlo Vittorio Cannistraci , King Abdullah University of Science and Technology, Saudi Arabia
Monday, July 22: 11:00 a.m. - 11:25 a.m.Room: ICC Lounge 81

Area Session Chair: Hagit Shatkay

Presentation Overview:
Motivation: Most functions within the cell emerge thanks to protein-protein-interactions (PPIs), yet their experimental determination is both expensive and time consuming. PPI-networks present signifi-cant levels of noise and incompleteness. Prediction of interactions using solely PPI-network-topology (topological prediction) is difficult but essential when biological prior-knowledge is absent or unreliable. Methods: Network-embedding emphasizes relations between net-work proteins embedded in a low-dimensional space, where protein-pairs closer to each other represent potential candidate interactions to predict. Network denoising, which boosts the prediction perfor-mance, is here achieved by minimum-curvilinear-embedding (MCE), combined with the shortest-path (SP) adopted in the reduced space for assigning likelihood scores to candidate interactions. Further-more, we introduce: (i) a new valid variation of MCE named non-centred-MCE (ncMCE); (ii) two automatic strategies for the selection of the appropriate embedding-dimension; (ii) two new randomised procedures for prediction evaluation. Results: We compared our method against several unsupervised and supervised embedding approaches, and node-neighbourhood techniques. Despite its computational simplicity, ncMCE-SP was the overall leader outperforming the current methods for topological link prediction. Conclusion: Minimum curvilinearity is a valuable nonlinear frame-work, which we successfully applied in embedding of protein net-works for unsupervised prediction of novel PPIs. The rationale is that biological and evolutionary prior-information is imprinted in the nonlinear patterns hidden behind the protein network topology, and can be exploited for prediction of new protein links. The predicted PPIs represent good candidates to test in high-throughput experi-ments or to exploit in systems biology tools such as those used for network-based inference and prediction of disease-related functional modules.
TOP

Proceedings Track: Nonparametric sparse Bayesian factor analysis
Presenting author: Iulian Pruteanu-Malinici , Duke University, United States
Monday, July 22: 3:10 p.m. - 3:35 p.m.Room: Hall 7

Area Session Chair: Stefan Kramer

Presentation Overview:
Motivation: Computational approaches for the annotation of phenotypes from image data have shown promising results across many applications, and provide rich and valuable information for studying gene function and interactions. While data are often available both at high spatial resolution and across multiple time points, phenotypes are frequently annotated independently, for individual time points only. In particular, for the analysis of developmental gene expression patterns, it is biologically sensible when images across multiple time points are jointly accounted for, such that spatial and temporal dependencies are captured simultaneously. Methods: We describe a discriminative, undirected graphical model to label gene-expression time-series image data, with an efficient training and decoding method based on the junction tree algorithm. The approach is based on an effective feature selection technique, consisting of a nonparametric sparse Bayesian factor analysis model. The result is a flexible framework, which can handle large-scale data with noisy, incomplete samples, i.e. it can tolerate data missing from individual time points. Results: Using the annotation of gene expression patterns across stages of Drosophila embryonic development as an example, we demonstrate that our method achieves superior accuracy, gained by jointly annotating phenotype sequences, when compared to previous models that annotate each stage in isolation. The experimental results on missing data indicate that our joint learning method successfully annotates genes for which no expression data are available for one or more stages.
TOP

Proceedings Track: Parameter estimation
Presenting author: Xin Gao , King Abdullah University of Science and Technology, Saudi Arabia
Monday, July 22: 12:00 p.m. - 12:25 p.m.Room: ICC Lounge 81

Area Session Chair: Hagit Shatkay

Presentation Overview:
Motivation: Systematic and scalable parameter estimation is a key to construct complex gene regulatory models and to ultimately facilitate an integrative systems biology approach to quantitatively understand the molecular mechanisms underpinning gene regulation. Results: Here, we report a novel framework for efficient and scalable parameter estimation that focuses specifically on modeling of gene circuits. Exploiting the structure commonly found in gene circuit models, this framework decomposes a system of coupled rate equations into individual ones and efficiently integrates them separately to reconstruct the mean time evolution of the gene products. The accuracy of the parameters is refined by iteratively increasing the accuracy of numerical integration using the model structure. As a case study, we applied our framework to four gene circuit models with complex dynamics based on three synthetic data sets and one time-series microarray data set. We compared our framework to three state-of-the-art parameter estimation methods and found that our approach consistently generated higher quality parameter solutions efficiently. While many general-purpose parameter estimation methods have been applied for modeling of gene circuits, our results suggest that the use of more tailored approaches to employ domain specific information may be a key to reverse-engineering of complex biological systems. Availability: Website: http://sfb.kaust.edu.sa/Pages/Software.aspx
TOP

Proceedings Track: Pathway inference
Presenting author: Anthony Gitter , Carnegie Mellon University , United States
Monday, July 22: 2:10 p.m. - 2:35 p.m.Room: Hall 4/5

Area Session Chair: Reinhard Schneider

Presentation Overview:
Several types of studies, including genome-wide association studies and RNA interference screens, strive to link genes to diseases. Although these approaches have had some success, genetic variants are often only present in a small subset of the population and screens are noisy with low overlap between experiments in different labs. Neither provides a mechanistic model explaining how identified genes impact the disease of interest or the dynamics of the pathways those genes regulate. Such mechanistic models could be used to accurately predict downstream effects of knocking down pathway members and allow comprehensive exploration of the effects of targeting pairs or higher-order combinations of genes. We developed methods to model the activation of signaling and dynamic regulatory networks involved in disease progression. Our model, SDREM, integrates static and time series data to link proteins and the pathways they regulate in these networks. SDREM utilizes prior information about proteins' likelihood of involvement in a disease (e.g. from screens) to improve the quality of the predicted signaling pathways. We used our algorithms to study the human immune response to H1N1 influenza infection. The resulting networks correctly identified many of the known pathways and transcriptional regulators of this disease. Furthermore, they accurately predict RNA interference effects and can be used to infer genetic interactions, greatly improving over other methods suggested for this task. Applying our method to the more pathogenic H5N1 influenza allowed us to identify several strain-specific targets of this infection.
TOP

Proceedings Track: PCA
Presenting author: Noa Liscovitch , Bar Ilan University, Israel
Monday, July 22: 2:40 p.m. - 3:05 p.m.Room: Hall 7

Area Session Chair: Stefan Kramer

Presentation Overview:
High spatial resolution imaging datasets of mammalian brains have recently become available in unprecedented amounts. Images now reveal highly complex patterns of gene expression varying on multiple scales. The challenge in analyzing these images is both in extracting the patterns that are most relevant functionally, and in providing a meaningful representation that allows neuroscientists to interpret the extracted patterns. Here we present FuncISH – a method to learn functional representations of neural in situ hybridization (ISH) images. We represent images using a histogram of local descriptors (SIFT) in several scales, and use this representation to learn detectors of functional (GO) categories for every image. As a result, each image is represented as a point in a low dimensional space whose axes correspond to meaningful functional annotations. The resulting representations define similarities between ISH images that can be easily explained by functional categories. We applied our method to the genomic set of mouse neural ISH images available at the Allen Brain Atlas, finding that the majority of GO biological processes can be inferred from spatial expression patterns with high accuracy. Using functional representations, we predict several gene interaction properties such as protein-protein interactions and cell type specificity more accurately than competing methods based on global correlations. We used FuncISH to identify similar expression patterns of GABAergic neuronal markers that were not previously identified, and to infer new gene function based on image-image similarities.
TOP

Proceedings Track: Poly(A) motif
Presenting author: Bo Xie , Georgia Institute of Technology, United States
Sunday, July 21: 3:10 p.m. - 3:35 p.m.Room: Hall 14.2

Area Session Chair: Cenk Sahinalp

Presentation Overview:
Motivation: Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA. Existing poly(A) motif predictors demonstrate that information extracted from the surrounding nucleotide sequences of candidate poly(A) motifs can differentiate true motifs from the false ones to a great extent. A variety of sophisticated features has been explored, including sequential, structural, statistical, thermodynamic and evolutionary properties. However, most of these methods involve extensive manual feature engineering, which can be time-consuming and can require in-depth domain knowledge. Results: We propose a novel machine learning method for poly(A) motif prediction by marrying generative learning (hidden Markov models) and discriminative learning (support vector machines). Generative learning provides a rich palette on which the uncertainty and diversity of sequence information can be handled, while discriminative learning allows the performance of the classification task to be directly optimized. Here, we employed hidden Markov models for fitting the DNA sequence dynamics, and developed an efficient spectral algorithm for extracting latent variable information from these models. These spectral latent features were then fed into support vector machines to fine tune the classification performance. We evaluated our proposed method on a comprehensive human poly(A) dataset that consists of 14,740 samples from 12 of the most abundant variants of human poly(A) motifs. Compared with one of previous state-of-art methods in the literature (the random forest model with expert-crafted features), our method reduces the average error rate, false negative rate and false positive rate by 26%, 15% and 35%, respectively. Meanwhile, our method made about 30% fewer error predictions relative to the other string kernels. Furthermore, our method can be used to visualize the importance of oligomers and positions in predicting poly(A) motifs, from which we can observe a number of characteristics in the surrounding regions of true and false motifs that have not been reported before. Availability: website:http://sfb.kaust.edu.sa/Pages/Software.aspx
TOP

Proceedings Track: Population genetics
Presenting author: Pier Francesco Palamara , Columbia University, United States
Monday, July 22: 11:30 a.m. - 11:55 a.m.Room: Hall 4/5

Area Session Chair: Russell Schwartz

Presentation Overview:
Pairs of individuals from a study cohort will often share long-range haplotypes identical-by-descent (IBD). Such haplotypes are transmitted from common ancestors that lived tens to hundreds of generations in the past, and can now be efficiently detected in high-resolution genomic datasets, providing a novel source of information in several domains of genetic analysis. Recently, haplotype sharing distributions were studied in the context of demographic inference, and were used to reconstruct recent demographic events in several populations. We here extend such framework to handle demographic models that contain multiple demes interacting through migration. We extensively test our formalism in several demographic scenarios, and provide a freely available software tool for demographic inference.
TOP

Proceedings Track: PPI-Network
Presenting author: Alex Lan , Ben Gurion University, Israel
Tuesday, July 23: 3:40 p.m. - 4:05 p.m.Room: Hall 7

Area Session Chair: Alfonso Valencia

Presentation Overview:
A major challenge in systems biology is to reveal the cellular pathways that give rise to specific phenotypes and behaviours. Current techniques often rely on a network representation of molecular interactions, where each node represents a protein or a gene and each interaction is assigned a single static score. However, the use of single interaction scores fails to capture the tendency of proteins to favour different partners under distinct cellular conditions. Here we propose a novel context-sensitive network model, in which genes and protein nodes are assigned multiple contexts based on their gene ontology annotations, and their interactions are associated with multiple context-sensitive scores. Using this model we developed a new approach and a corresponding tool, ContextNet, based on a dynamic programming algorithm for identifying signalling paths linking proteins to their downstream target genes. ContextNet finds high-ranking context-sensitive paths in the interactome, thereby revealing the intermediate proteins in the path and their path-specific contexts. We validated the model using 18,348 manually-curated cellular paths derived from the SPIKE database. We next applied our framework to elucidate the responses of human primary lung cells to influenza infection. Top-ranking paths were much more likely to contain infection-related proteins, and this likelihood was highly correlated with path score. Moreover, the contexts assigned by the algorithm pointed to putative as well as previously known responses to viral infection. Thus context-sensitivity is an important extension to current network biology models and can be efficiently used to elucidate cellular response mechanisms. ContextNet is publicly available at http://netbio.bgu.ac.il/ContextNet.
TOP

Proceedings Track: PRM: Protein Recognition module
Presenting author: Kousik Kundu , University of Freiburg, Germany
Sunday, July 21: 11:30 a.m. - 11:55 a.m.Room: Hall 4/5

Area Session Chair: Erik Bongcam-Rudloff

Presentation Overview:
State-of-the-art experimental data for determining binding specificities of peptide recognition modules (PRMs) is obtained by high-throughput approaches like peptide arrays. Most prediction tools applicable to this kind of data are based on an initial multiple alignment of the peptide ligands. Building an initial alignment can be error-prone, especially in the case of the proline-rich peptides bound by the SH3 domains. Here we present a machine learning approach based on an efficient graph-kernel technique to predict the specificity of a large set of 70 human SH3 domains, which are a very important class of PRMs. The graph-kernel strategy allows us to 1) integrate several types of physico-chemical information for each amino acid, 2) consider high order correlations between these features and 3) eliminate the need for an initial peptide alignment. We build specialized models for each human SH3 domain and achieve competitive predictive performance of 0.73 area under precision-recall curve (AUC PR), compared to 0.27 AUC PR for state-of-the-art methods based on position weight matrices. We show that better models can be obtained when we use information on the on-interacting peptides (negative examples), which is currently not used by the state-of-the art approaches based on position-weight matrices. To this end, we analyze two strategies to identify subsets of high confidence negative data. The techniques introduced here are more general and hence can also be used for any other protein domains which interact with short peptides (i.e., other PRMs).
TOP

Proceedings Track: Protein contact map prediction
Presenting author: Jinbo Xu , Toyota Technological Institute at Chicago, United States
Monday, July 22: 10:30 a.m. - 10:55 a.m.Room: Hall 7

Area Session Chair: Alex Bateman

Presentation Overview:
Motivation. Protein contact map describes the pairwise spatial and functional relationship of residues in a protein and contains key information for protein 3D structure prediction. Although studied extensively, it remains very challenging to predict contact map using only sequence information. Most existing methods predict the contact map matrix element-by-element, ignoring correlation among contacts and physical feasibility of the whole contact map. A couple of recent methods predict contact map by using mutual information (MI) and enforcing a sparsity restraint (i.e., the contact matrix shall be very sparse), but these methods demand for a very large number of sequence homologs and the resultant contact map may be still physically infeasible. Results. This paper presents a novel method for contact map prediction, integrating both evolutionary and physical restraints by machine learning and integer linear programming (ILP). The evolutionary restraints are much more informative than MI and the physical restraints specify more concrete relationship among contacts than the sparsity restraint. As such, our method greatly reduces the solution space of the contact map matrix and thus, significantly improves prediction accuracy. Experimental results show that our method outperforms currently popular methods no matter how many sequence homologs are available for the protein under consideration.
TOP

Proceedings Track: Protein interaction evolution
Presenting author: Robert Patro , Carnegie Mellon University, United States
Monday, July 22: 3:10 p.m. - 3:35 p.m.Room: ICC Lounge 81

Area Session Chair: Burkhard Rost

Presentation Overview:
Motivation: Reconstruction of the network-level evolutionary history of protein-protein interactions provides a principled way to relate interactions in several present-day networks. Here, we present a general framework for inferring such histories and demonstrate how it can be used to determine what interactions existed in the ancestral networks, which present-day interactions should we expect to exist based on evolutionary evidence, and what information extant networks contain about the order of ancestral protein duplications. Results: Our framework characterizes the space of likely parismonious network histories. It results in a structure that can be used to find probabilities for a number of events associated with the histories. The framework is based on a directed hypergraph formulation of dynamic programming that we extend to enumerate many optimal and near-optimal solutions. The algorithm is applied to reconstructing ancestral interactions among bZIP transcription factors, imputing missing present-day interactions among the bZIPs and among proteins from 5 herpes viruses, and determining relative protein duplication order in the bZIP family. Our approach more accurately reconstructs ancestral interactions compared with existing approaches. In cross-validation tests, we find that our approach ranks the majority of the left-out present-day interactions among the top 2% and 17% of possible edges for the bZIP and herpes networks, respectively, making it a competitive approach for edge imputation. It also estimates, from interaction data alone, relative bZIP protein duplication orders that are significantly correlated with sequence-based estimates. Availability: The algorithm is implemented in C++, is open source, and available at http://www.cs.cmu.edu/~ckingsf/software/parana2. Contact: robp@cs.cmu.edu and carlk@cs.cmu.edu
TOP

Proceedings Track: Protein structure prediction
Presenting author: Zhidong Xue , University of Michigan, United States
Monday, July 22: 11:00 a.m. - 11:25 a.m.Room: Hall 7

Area Session Chair: Alex Bateman

Presentation Overview:
Motivation: Protein domains are subunits that can fold and function independently. Identification of domain boundary locations is often the first step in protein folding and function annotations. Most of the current methods deduce domain boundaries by sequence-based analysis where accuracy is low. There is no efficient method for predicting discontinuous domains that consist of segments from separated sequences. Since template-based methods are most efficient for protein 3D structure modeling, combining multiple threading alignment information should increase the accuracy and reliability of computational domain predictions. Result: We develop a new domain predictor, ThreaDom, which deduces protein domain boundary locations based on multiple threading alignments. The core of the method development is the derivation of a domain conservation score that combines composite information from template domain structures and terminal and internal alignment gaps. Tested on 630 non-redundant sequences, without using homologous templates ThreaDom generates correct single- and multi-domain classifications in 81% of cases where 78% have the domain linker location assigned within 20 residues. In a second test on 486 proteins with discontinuous domains, ThreaDom achieves an average precision 84% and a recall 65% in domain boundary prediction. Finally, ThreaDom was examined on 56 targets from CASP8 and had a domain overlap rate 73%, 87% and 85% with the target structure for Free Modeling, Hard multiple-domain and discontinuous domain proteins, respectively, which are significantly higher than most of the domain predictors in the CASP8 experiment.
TOP

Proceedings Track: Protein Threading
Presenting author: Sheng Wang, Toyota Technological Institute at Chicago, United States
Monday, July 22: 11:30 a.m. - 11:55 a.m.Room: Hall 7

Area Session Chair: Alex Bateman

Presentation Overview:
Motivation: Template-based modeling (TBM) including homology modeling and protein threading is the most reliable method for pro-tein 3D structure prediction. However, alignment errors and template selection are still the main bottleneck for current TBM methods, especially when proteins under consideration are distantly related. Results: We present a novel context-specific alignment potential for protein threading including alignment and template selection. Our alignment potential measures the log odds ratio of one alignment being generated from two related proteins to being generated from two unrelated proteins, by integrating both local and global context-specific information. The local alignment potential quantifies how well one sequence residue can be aligned to one template residue based upon context-specific information of the residues. The global alignment potential quantifies how well two sequence residues can be placed into two template positions at a given distance, again based upon context-specific information. By accounting for correla-tion among a variety of protein features and making use of context-specific information, our alignment potential is much more sensitive than the widely used context-independent or profile-based scoring function. Experimental results confirm that our method generates significantly better alignments and threading results than the best profile-based methods on several very large benchmarks. Our method works particularly well for distantly-related proteins or pro-teins with sparse sequence profiles due to the effective integration of context-specific, structure and global information.
TOP

Proceedings Track: Pseudogene
Presenting author: Wei Wang, UCLA, United States
Sunday, July 21: 2:40 p.m. - 3:05 p.m.Room: Hall 14.2

Area Session Chair: Cenk Sahinalp

Presentation Overview:
Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives), and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that about 3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, about 10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries. Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls due to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that more than 16.3% of them are false positives. Availablility: The software can be downloaded at http://csbio.unc.edu/genescissors/
TOP

Proceedings Track: RNA
Presenting author: Vladimir Reinharz , McGill University, Canada
Tuesday, July 23: 3:40 p.m. - 4:05 p.m.Room: Hall 14.2

Area Session Chair: Ralf Zimmer

Presentation Overview:
Motivations: The design of RNA sequences folding into predefined secondary structures is a milestone for many synthetic biology and gene therapy studies. Most of the current software use similar local search strategies (i.e. a random seed is progressively adapted to acquire the desired folding properties) and more importantly do not allow the user to control explicitly the nucleotide distribution such as the GC-content in their sequences. However, the latter is an important criteria for large-scale applications as it could presumably be used to design sequences with better transcription rates and/or structural plasticity. Results: In this paper, we introduce IncaRNAtion, a novel algorithm to design RNA sequences folding into target secondary structures with a predefined nucleotide distribution. IncaRNAtion uses a global sampling approach and weighted sampling techniques. We show that our approach is fast (i.e. running time comparable or better than local search methods), seed-less (we remove the bias of the seed in local search heuristics), and successfully generates high-quality sequences (i.e. thermodynamically stable) for any GC-content. To complete this study, we develop an hybrid method combining our global sampling approach with local search strategies. Remarkably, our glocal methodology outperforms both local and global approaches.
TOP

Proceedings Track: RNA structure prediction
Presenting author: Hamidreza Chitsaz, Wayne State University, United States
Tuesday, July 23: 3:10 p.m. - 3:35 p.m.Room: Hall 14.2

Area Session Chair: Ralf Zimmer

Presentation Overview:
Motivation: Computational RNA structure prediction is a mature important problem which has received a new wave of attention with the discovery of regulatory non-coding RNAs and the advent of high-throughput transcriptome sequencing. Despite nearly two scores of research on RNA secondary structure and RNA-RNA interaction prediction, the accuracy of the state-of-the-art algorithms are still far from satisfactory. So far, researchers have proposed increasingly complex energy models and improved parameter estimation methods, experimental and/or computational, in anticipation of endowing their methods with enough power to solve the problem. The output has disappointingly been only modest improvements, not matching the expectations. Even recent massively featured machine learning approaches were not able to break the barrier. Why is that? Approach: The first step towards high accuracy structure prediction is to pick an energy model that is inherently capable of predicting each and every one of known structures to date. In this paper, we introduce the notion of learnability of the parameters of an energy model as a measure of such an inherent capability. We say that the parameters of an energy model are learnable iff there exists at least one set of such parameters that renders every known RNA structure to date the minimum free energy structure. We derive a necessary condition for the learnability and give a dynamic programming algorithm to assess it. Our algorithm computes the convex hull of the feature vectors of all feasible structures in the ensemble of a given input sequence. Interestingly, that convex hull coincides with the Newton polytope of the partition function as a polynomial in energy parameters. To the best of our knowledge, this is the first approach towards computing the RNA Newton polytope and a systematic assessment of the inherent capabilities of an energy model. The worst complexity of our algorithm is expontential in the number of features. However, one could employ dimensionality reduction techniques to avoid the curse of dimensionality. Results: We demonstrated the application of our theory to a simple energy model consisting of a weighted count of A-U, C-G, and G-U base pairs. Our results show that this simple energy model satisfies the necessary condition for more than half of the input unpseudoknotted sequence-structure pairs (55%) chosen from the RNA STRAND v2.0 database and severely violates the condition for about 13%, which provide a set of hard cases that require further investigation. From 1350 RNA strands, the observed three dimensional feature vector for 749 strands is on the surface of the computed polytope. For 289 RNA strands, the observed feature vector is not on the boundary of the polytope but its distance from the boundary is not more than one. A distance of one essentially means one base pair difference between the observed structure and the closest point on the boundary of the polytope, which need not be the feature vector of a structure. For 171 sequences, this distance is larger than 2, and for only 11 sequences, this distance is larger than 5.
TOP

Proceedings Track: Sequence Analysis
Presenting author: Noah Daniels , Tufts University, United States
Monday, July 22: 2:40 p.m. - 3:05 p.m.Room: Hall 4/5

Area Session Chair: Reinhard Schneider

Presentation Overview:
Motivation: The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed up homology search directly, but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools. Results: We introduce a suite of homology search tools, powered by compressively-accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate to all known state- of-the-art tools including HHblits, DELTA-BLAST, and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP’s runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed up many tasks such as protein structure prediction and orthology mapping which rely heavily on homology search. Availability: CaBLASTP is available under the GNU Public License at http://cablastp.csail.mit.edu/
TOP

Proceedings Track: Sequencing
Presenting author: David Golan , Tel Aviv University, Israel
Monday, July 22: 2:40 p.m. - 3:05 p.m.Room: Hall 14.2

Area Session Chair: Sean O'Donoghue

Presentation Overview:
Motivation: The importance of fast and affordable DNA sequencing methods for current day life sciences, medicine and biotechnology is hard to overstate. A major player is IonTorrent, a pyrosequencing-like technology which produces flowgrams – sequences of incorporation values – which are converted into nucleotide sequences by a base-calling algorithm. Because of its exploitation of ubiquitous semiconductor technology and innovation in chemistry, IonTorrent has been gaining popularity since its debut in 2011. Despite the advantages, however, IonTorrent read accuracy remains a significant concern. Results: We present FlowgramFixer, a new algorithm for converting flowgrams into reads. Our key observation is that the incorporation signals of neighboring flows, even after normalization and phase correction, carry considerable mutual information and are important in making the correct base-call. We therefore propose that base-calling of flowgrams should be done on a read-wide level, rather than one flow at a time. We show that this can be done in linear time by combining a state machine with a Viterbi algorithm to find the nucleotide sequence that maximizes the likelihood of the observed flowgram. FlowgramFixer is applicable to any flowgram based sequencing platform. We demonstrate FlowgramFixer’s superior performance on Ion Torrent E.Coli data, with a 4.8% improvement in the number of high-quality mapped reads and a 7.1% improvement in the number of uniquely mappable reads. Availability: Binaries and source code of FlowgramFixer are freely available at: http://www.cs.tau.ac.il/˜davidgo5/flowgramfixer.html
TOP

Proceedings Track: Short read alignment
Presenting author: Victoria Popic , Stanford University, United States
Tuesday, July 23: 11:00 a.m. - 11:25 a.m.Room: Hall 4/5

Area Session Chair: Debra Goldberg

Presentation Overview:
The increasing availability of high throughput sequencing technologies has led to thousands of human genomes having been sequenced in the past years. Efforts such as the 1000 Genomes Project further add to the availability of human genome variation data. However, to-date there is no method that can map reads of a newly sequenced human genome to a large collection of genomes. Instead, methods rely on aligning reads to a single reference genome. This leads to inherent biases and lower accuracy. To tackle this problem, a new alignment tool BWBBLE is introduced in this paper. We (1) introduce a new compressed representation of a collection of genomes, which explicitly tackles the genomic variation observed at every position, and (2) design a new alignment algorithm based on the Burrows-Wheeler transform that maps short reads from a newly sequenced genome to an arbitrary collection of 2 or more (up to millions of) genomes with high accuracy and no inherent bias to one specific genome.
TOP

Proceedings Track: Text mining
Presenting author: Sophia Ananiadou, The University of Manchester
Tuesday, July 23: 3:40 p.m. - 4:05 p.m.Room: Hall 4/5

Area Session Chair: Reinhard Schneider

Presentation Overview:
Motivation: In order to create, verify and maintain pathway models, curators must discover and assess knowledge distributed over the vast body of biological literature. Methods supporting these tasks must understand both the pathway model representations and the natural language in the literature. These methods should identify and order documents by relevance to any given pathway reaction. No existing system has addressed all aspects of this challenge. Method: We present novel methods for associating pathway model reactions with relevant publications. Our approach extracts the reactions directly from the models, and then turns them into queries for three text-mining-based MEDLINE literature search systems. These queries are executed, and the resulting documents are combined and ranked according to their relevance to the reactions of interest. We manually annotate document-reaction pairs with the relevance of the document to the reaction and use this annotation to study several ranking methods, using various heuristic and machinelearning approaches. Results: Our evaluation shows that the annotated document-reaction pairs can be used to create a rule-based document ranking system, and that machine learning can be used to rank documents by their relevance to pathway reactions. We find a Support Vector Machine-based system outperforms several baselines and matches the performance of the rule-based system. The success of the query extraction and ranking methods are used to update our existing pathway search system, PathText. Availability: An online demonstration of PathText 2 and the annotated corpus are available for research purposes at http://www.nactem.ac.uk/pathtext2/. Contact: makoto.miwa@manchester.ac.uk
TOP

Proceedings Track: Transcriptome assembling
Presenting author: Henry C.M. Leung, The University of Hong Kong
Tuesday, July 23: 10:30 a.m. - 10:55 a.m.Room: Hall 4/5

Area Session Chair: Debra Goldberg

Presentation Overview:
Motivation: RNA sequencing based on next-generation sequencing technology is an effective approach for analyzing transcriptomes. Similar to de novo genome assembly, de novo transcriptome assembly does not rely on a reference genome or additional annotated information. It is well-known that the transcriptome assembly problem is more difficult. In particular, isoforms can have very uneven expression levels (e.g. 1:100) which make it very difficult to identify low-expressed isoforms. Technically, a core issue is to remove erroneous vertices/edges with high multiplicity (produced by high-expressed isoforms) in the de Bruijn graph without removing those correct ones with not so high multiplicity corresponding to low-expressed isoforms. Failing to do so will result in the loss of low-expressed isoforms or having complicated subgraphs with transcripts of different genes mixed together due to the erroneous vertices and edges. Contributions: Unlike existing tools which usually remove erroneous vertices/edges if their multiplicities are lower than a global threshold, we developed a probabilistic progressive approach with local thresholds to iteratively remove those erroneous vertices/edges. This enables us to decompose the graph into disconnected components, each of which contains a few, if not single, genes, while keeping a lot of correct vertices/edges of low-expressed isoforms. Combined with existing techniques, IDBA-Tran is able to assemble both high-expressed and low-expressed transcripts and outperforms existing assemblers in terms of sensitivity and specificity for both simulated and real data. Availability: http://www.cs.hku.hk/~alse/idba_tran
TOP