ISMB 2008 ISCB


















Accepted Papers


Protein Structure and Function
Databases and Ontologies
Gene Regulation and Transcriptomics
Protein Interactions and Molecular Networks
Population Genomics
Other Bioinformatics Applications and Methods
Evolution and Phylogeny
Sequence Analysis and Alignment
Bioinformatics of Disease
Comparative Genomics


Bioinformatics of Disease



Author(s):
Irene M. Kaplow, Massachusetts Institute of Technology, United States
Marina Sirota, Stanford University, United States
Chuong B. Do, Stanford University, United States
Atul J. Butte, Stanford University School of Medicine, United States
Serafim Batzoglou*, Stanford University, United States

Presenting Author: Marc Schaub, Stanford University

Genome-wide association studies are commonly used to identify possible associations between genetic variations and diseases. These studies mainly focus on identifying individual single nucleotide polymorphisms potentially linked with one disease of interest. In this work, we introduce a novel methodology that identifies similarities between diseases using information from a large number of single nucleotide polymorphisms. We separate the diseases for which we have individual genotype data into one reference disease and several query diseases. We train a classifier that distinguishes between individuals that have the reference disease and a set of control individuals. This classifier is then used to classify the individuals that have the query diseases. We can then rank query diseases according to the average classification of the individuals in each disease set, and identify which of the query diseases are more similar to the reference disease. We repeat these classification and comparison steps so that each disease is used once as reference disease. We apply this approach using a decision tree classifier to the genotype data of seven common diseases and two shared control sets provided by the Wellcome Trust Case Control Consortium. We show that this approach identifies the known genetic similarity between type 1 diabetes and rheumatoid arthritis, and identifies a new putative similarity between bipolar disease and hypertension.




Author(s):
Alexander Schoenhuth, Simon Fraser University, Canada
Christoph Hafemeister, Max Planck Institute for Molecular Genetics, Germany
Alexander Schliep, Max Planck Institute for Moelcular Genetics, Germany

Presenting Author: Ivan Gesteira Costa, Center of Informatics, Federal University of Pernambuco

Motivation: Personalized medicine based on molecular views of diseases, predominantly supported by gene expression profiling methods, has become increasingly popular. However, when analyzing clinical gene expression data one faces multiple challenges; most of the well-known theoretical issues such as high-dimensional feature spaces, few examples, noise and missing data apply. Special care is needed when designing classification procedures that support personalized diagnosis and choice of treatment. Here, we particularly focused on classification of interferon-(IFN) treatment response in Multiple Sclerosis (MS) patients which has attracted special attention in the recent past. While still being standard, about half of the patients remains unaffected by IFN treatment. In such cases, treatment should be timely ceased to mitigate the side effects. Results: We propose constrained estimation of mixtures of hidden Markov models as a methodology to classify patient response to IFN treatment. The advantages of our approach are to take the temporal nature of the data into account and its robustness with respect to noise, missing data and mislabeled samples. Moreover, mixture estimation enables to explore the presence of response sub-groups of patients on the transcriptional level. We clearly outperformed all prior approaches in terms of prediction accuracy, raising it, for the first time, above 90%. Moreover, we were able to identify potentially mislabeled samples and to subdivide the good responders into two subgroups that exhibited different transcriptional response programs. This is supported by recent findings on MS pathology and therefore may raise interesting clinical follow-up questions.




Author(s):
Stefan Wuchty, Northwestern Institute on Complexity, Northwestern University, United States
Michael Ferdig, Eck Institute for Global Health, Department of Biological Sciences, University of Notre Dame, United States
Teresa Przytycka*, NIH, United States

Presenting Author: Yang Huang, NCBI/NLM/NIH

Analysis of expression quantitative trait loci (eQTL) significantly contributes to the determination of gene regulation programs. However, the discovery and analysis of associations of gene expression levels and their underlying sequence polymorphisms still pose many challenges. Methods are limited in their ability to illuminate the full structure of the eQTL data. Most rely on an exhaustive, genome scale search that considers all possible locus-gene pairs and tests the linkage between each locus and gene. To analyze eQTLs in a more comprehensive and efficient way, we developed the Graph based eQTL Decomposition method (GeD) that allows us to model genotype and expression data using an eQTL association graph. Through graph-based heuristics, GeD identifies dense subgraphs in the eQTL association graph. By identifying eQTL association cliques that expose the hidden structure of genotype and expression data, GeD effectively filters out most locus-gene pairs that are unlikely to have significant linkage. We apply GeD on eQTL data from Plasmodium falciparum, the human malaria parasite, and show that GeD reveals the structure of the relationship between all loci and all genes on a whole genome level. Furthermore, GeD allows us to uncover many more eQTLs with lower FDR than were previously observed, providing an important complement to traditional eQTL analysis methods.




Author(s):
K-John Cheung Jr., British Columbia Cancer Agency, Canada
Nathalie Johnson, British Columbia Cancer Agency, Canada
Randy Gascoyne, British Columbia Cancer Agency, Canada
Douglas Horsman, British Columbia Cancer Agency, Canada
Raymond Ng, Dept. of Computer Science, University of British Columbia, Canada
Kevin Murphy, University of British Columbia, Canada:

Presenting Author: Sohrab Shah, University of British Columbia/BC Cancer Agency

Analysis of array comparative genomic hybridization (aCGH) data for recurrent DNA copy number alterations from a cohort of patients can yield distinct sets of molecular signatures or profiles. This can be due to the presence of heterogeneous cancer subtypes within a supposedly homogeneous population. We propose a novel statistical method for automatically detecting such subtypes or clusters. Our approach is model-based: each cluster is defined in terms of a sparse profile, which contains the locations of unusually frequent alterations. The profile is represented as a hidden Markov model (HMM). Samples are assigned to clusters based on their similarity to the cluster s profile. We simultaneously infer the cluster assignments and the cluster profiles using an expectation maximization-like algorithm. We show, using a realistic simulation study, that our method is significantly more accurate than rival clustering techniques. We then apply our method to two clinical data sets. In particular, we examine previously reported aCGH data from a cohort of 106 follicular lymphoma patients, and discover clusters that are known to correspond to clinically relevant subgroups. In addition, we examine a cohort of 92 diffuse large B-cell lymphoma (DLBCL) patients, and discover previously unreported clusters of biological interest which have inspired followup clinical research on an independent cohort.




Author(s):

Presenting Author: Yoram Louzoun,

Viruses employ various means to evade immune detection. One common evasion strategy is the removal of CD8 cytotoxic T-lymphocyte (CTL) epitopes. We here use a combination of multiple bioinformatic tools and large amount of genomic data to compute the epitope repertoire presented by over 1,300 viruses in multiple HLA alleles. We define the "Size of Immune Repertoire (SIR) score," which represents the ratio between the epitope density within a protein and the expected density. We show that viral proteins in general have a higher epitope density than human proteins. This difference is due to a good fit of the human MHC molecules to the typical amino acid usage of viruses. Among different viruses, viruses infecting humans present less epitopes than non-human viruses. This selection is not at the amino acid usage level, but through the removal of specific epitopes. Within a single virus, not all the proteins express the same epitopes density. Proteins expressed early in the viral life cycle have a lower epitope density than late proteins. Such a difference is not observed in non-human viruses. The removal of early epitopes and the targeting of the CTL immune response to late viral proteins, allow the virus time interval to propagate before its host cells are destroyed by CTLs.



Comparative Genomics



Author(s):
Dannie Durand*, Carnegie Mellon University, United States

Presenting Author: Jacob Joseph, Carnegie Mellon University

Motivation: Classification of gene and protein sequences into homologous families is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify dense subgraphs corresponding to families. Single domain families can now be classified accurately due to major algorithmic advances in remote homology detection and graph clustering. However, classification of multidomain families remains a significant challenge. The presence of promiscuous domains in sequences that do not share common ancestry introduces false edges in the homology network that link unrelated families and stymy clustering algorithms. Results: Here we investigate a network rewiring strategy designed to eliminate edges due to promiscuous domains. We show that this strategy can reduce noise in and restore structure to artificial networks with simulated noise, as well as to the yeast genome homology network. We further evaluate this approach on a hand-curated set of multidomain sequences in mouse and human and demonstrate that classification using the rewired network delivers dramatic improvement in Precision and Recall, compared with the homology network inferred using sequence comparison. Families in our test set exhibit a broad range of domain architectures and sequence conservation, demonstrating that our method is flexible, robust and suitable for high-throughput, automated processing of heterogeneous, genome-scale data.




Author(s):

Presenting Author: Xiaohui Xie, University of California Irvine

Motivation: Comparing the genomes from closely related species provides a powerful tool for detecting and interpreting functional elements encoded in the genomes. Many methods have been developed to identify conserved sequence across species; however, to date all methods have focused on conservation as a decrease in the rate of mutation and have ignored selection acting on the pattern of mutations. Results: We present a new approach that takes advantage of deeply sequenced clades to identify evolutionary selection by uncovering not only signatures of rate-based conservation but also substitution patterns characteristic of sequence undergoing natural selection. We describe a new statistical method for modeling biased nucleotide substitutions, a learning algorithm for inferring site-specific substitution biases directly from sequence alignments, and a hidden Markov model for detecting constrained elements characterized by biased substitutions. The algorithms are implemented in a freely available Java software package, called SIPHY. We show that this new approach can identify significantly more degenerate constrained sequences than the rate-based methods. Applying SIPHY to the ENCODE regions, we identify as much as 10.2% of the sequences in the regions as evolutionarily constrained.



Databases and Ontologies



Author(s):

Presenting Author: Stefan Schulz,

ABSTRACT Motivation: For many years, the UMLS Semantic Network (SN) has been used as an upper-level semantic framework for the categorization of ter-minological resources in biomedicine. BioTop has recently been developed as an upper-level ontology for the biomedical domain. In contrast to the SN, it is founded upon strict ontological principles, using OWL-DL a formal representation language, which has become popular through the Semantic Web. In order to make logic-based reasoning available for the resources annotated or classified by the SN, a mapping ontology was developed aligning the SN with Bio-Top. Methods: The theoretical foundations and the practical realization of the alignment are described, with a focus on the design decisions taken, the problems encountered, and the adaptations of BioTop that be-came necessary. For evaluation, UMLS concept pairs obtained from MEDLINE abstract sentences by a named entity recognition system were tested for possible semantic relationships. Furthermore, all semantic type combinations which occur in the UMLS Metathesau-rus were checked for consistency. Results: The effort-intensive alignment process required major design changes and enhancements of BioTop and brought up several de-sign errors that could be fixed. A comparison between a human curator and the ontology yielded only a low agreement. Ontology reasoning was also used to successfully identify 89 inconsistent semantic type combinations. Availability: BioTop, the OWL-DL representation of the UMLS SN, and the map-ping ontology are available at http://purl.org/biotop/




Author(s):
Gang Feng, Northwestern University, United States
Pan Du, Northwestern University, United States
Michelle Holko, Northwestern University, United States
Warren Kibbe, Northwestern University, United States
Jie Song, University of Chicago, United States

Presenting Author: Warren Kibbe, Northwestern University

Subjective methods have been reported to adapt a general-purpose ontology for a specific application. For instance, GO Slim was created from Gene Ontology (GO) to generate a highly aggregated report of the human-genome annotation. We propose statistical methods to adapt the general-purpose, OBO Foundry Disease Ontology (DO) for the identification of gene-disease associations. Thus, we need a simplified definition of disease categories derived from implicated genes. Based on the assumption that the DO terms having similar associated genes are closely related, we group the DO terms based on the similarity of gene-to-DO mapping profiles. Two types of binary distance metrics are defined to measure the overall and subset similarity between DO terms. A compactness-scalable fuzzy clustering method is then applied to group similar DO terms. To reduce false clustering, the semantic similarities between DO terms are also used to constrain clustering results. As such, the DO terms are aggregated and the redundant DO terms are largely removed. Using these methods, we constructed a simplified vocabulary list from the DO called Disease Ontology Lite (DOLite). We demonstrated that DOLite results in more interpretable results than DO for gene-disease association tests. The resultant DOLite has been used in the Functional Disease Ontology (FunDO) web application at http://projects.bioinformatics.northwestern.edu/fundo.




Author(s):
Daniel Dvorkin, U. Colorado Denver, United States
K. Bretonnel Cohen, University of Colorado School of Medicine, United States
Lawrence Hunter, UC Denver, United States

Presenting Author: Karin Verspoor, U. Colorado Denver

Motivation: It is important for the quality of biological ontologies that similar concepts be expressed consistently, or univocally. Univocality is relevant for the usability of the ontology for humans, as well as for computational tools that rely on regularity in the structure of terms. However, in practice terms are not always expressed consistently, and we must develop methods for identifying terms that are not univocal so that they can be corrected. Results: We developed an automated transformation-based clustering methodology for detecting terms that use different linguistic conventions for expressing similar semantics. These term sets represent occurrences of univocality violations. Our method was able to identify 67 examples of univocality violations in the Gene Ontology.



Evolution and Phylogeny



Author(s):
Regula Rupp, Tuebingen University, Germany
Vincent Berry, LIRMM, France
Philippe Gambette, LIRMM - Universite Montpellier II, C.N.R.S., France
Christophe Paul, CNRS - LIRMM, France

Presenting Author: Daniel Huson, University of Tuebingen

Motivation: Developing methods for computing phylogenetic networks from biological data is an important problem posed by molecular evolution and much work is currently being undertaken in this area. Although promising approaches exist, there are no tools available that biologists could easily and routinely use to compute rooted phylogenetic networks on real datasets containing tens or hundreds of taxa. The problem of computing an optimal rooted phylogenetic network from a set of clusters, and the problem of just determining whether a given network contains a given cluster, are both hard, in general. Hence, some researchers have focused on topologically restricted classes of networks, such as galled trees and level-k networks, that are more tractable, but have the practical draw-back that a given set of clusters will usually not possess such a representation. Results: In this paper we argue that galled networks provide a good trade-off between level of generality and tractability. Any set of clusters can be represented by some galled network and the question whether a cluster is contained in such a network is easy to solve. Although the computation of an optimal galled network involves solving instances of two different NP-complete problems, in practice our algorithm does so on large datasets containing hundreds of taxa and many reticulations in seconds, as illustrated on a dataset containing 279 prokaryotes. Availability: We provide a fast, robust and easy-to-use implementation of this work in version 2.0 of our tree-handling software Dendroscope,freely available from http://www.dendroscope.org.




Author(s):
Nicholas Mundy, Department of Zoology, University of Cambridge, United Kingdom

Presenting Author: Timothy O'Connor, University of Cambridge

Motivation: Mapping between genotype and phenotype is one of the primary goals of evolutionary genetics but one that has received little attention at the interspecies level. Recent developments in phylogenetics and statistical modelling have typically been used to examine molecular and phenotypic evolution separately. We have used this background to develop phylogenetic substitution models to test for associations between evolutionary rate of genotype and phenotype. We do this by creating hybrid rate matrices between genotype and phenotype. Results: Simulation results show our models to be accurate in detecting genotype-phenotype associations and robust for various factors that typically affect maximum likelihood methods, such as number of taxa, level of relevant signal, proportion of sites effected, and length of evolutionary divergence. Further, simulations show that our method is robust to homogeneity assumptions. We apply the models to data sets of male reproductive system genes in relation to mating systems of primates. We show that evolution of semenogelin I, semenogelin II and zonadhesin are significantly associated with mating systems whereas two negative control genes (cytochrome b and peptidase inhibitor 3) show no significant association. This provides the first hybrid substitution model of which we are aware to directly test the association between genotype and phenotype using a phylogenetic framework. Availability: Perl and HYPHY scripts are available upon request from the authors.



Gene Regulation and Transcriptomics



Author(s):
Aaron Vollrath, University of Wisconsin, Madison, United States
Christopher Bradfield, University of Wisconsin, Madison, United States
Mark Craven, University of Wisconsin, United States

Presenting Author: Adam Smith, University of Wisconsin, Madison

Motivation: Characterizing and comparing temporal gene expression responses is an important computational task for answering a variety of questions in biological studies. Algorithms for aligning time series represent a valuable approach for such analyses. However, previous approaches to aligning gene-expression time series have assumed that all genes should share the same alignment. Our work is motivated by the need for methods that identify sets of genes that differ in similar ways between two time series, even when their expression profiles are quite different. Results: We present a novel algorithm that calculates clustered alignments; the method finds clusters of genes such that the genes within a cluster share a common alignment, but each cluster is aligned independently of the others. We also present an efficient new segment-based alignment algorithm for time series called SCOW. We evaluate our methods by assessing the accuracy of alignments computed with sparse time series from a toxicogenomics data set. The results of our evaluation indicate that our clustered alignment approach and SCOW provide more accurate alignments than previous approaches. Additionally, we apply our clustered alignment approach to characterize the effects of a conditional Mop3 knockout in mouse liver. Availability: Source code is available at http://www.biostat.wisc.edu/~aasmith/catcode. Contact: aasmith@cs.wisc.edu




Author(s):
Naoki Abe, ERROR, United States
Yan Liu, IBM, United States
Saharon Rosset, Tel Aviv University, Israel

Presenting Author: Saharon Rosset, IBM Research

We consider the problem of discovering gene regulatory networks from time-series microarray data. Recently, Graphical Granger modeling has gained considerable attention as a promising direction for addressing this problem. These methods apply graphical modeling methods on time series data and invoke the notion of "Granger causality" to make assertions on causality through inference on time lagged effects. Existing algorithms, however, have neglected an important aspect of the problem-- the group structure among the lagged temporal variables naturally imposed by the time series they belong to. Specifically, existing methods in computational biology share this shortcoming, as well as additional computational limitations, prohibiting their effective applications to large data sets including a large number of genes and many data points. In the present paper we propose a novel methodology which we term "grouped graphical Granger modeling method", which overcomes the limitations mentioned above by applying a regression method suited for high dimensional and large data, and by leveraging the group structure among the lagged temporal variables according to the time series they belong to. We demonstrate the effectiveness of the proposed methodology on both simulated and actual gene expression data, specifically the human cancer cell (HeLa S3) cycle data. The simulation results show that the proposed methodology generally exhibits higher accuracy in recovering the underlying causal structure. Those on the gene expression data demonstrate that it leads to improved accuracy with respect to prediction of known links, and also uncovers additional causal relationships uncaptured by earlier works.




Author(s):

Presenting Author: Le Song,

Motivation: Gene regulatory networks underlying temporal processes, such as the cell cycle or the life cycle of an organism, can exhibit significant topological changes to facilitate the underlying dynamic regulatory functions. Thus it is essential to develop methods that capture the temporal evolution of the regulatory networks. These methods will be an enabling first step for studying the driving forces underlying the dynamic gene regulation circuity, and predicting the future network structures in response to internal and extenal stimuli. Results: We introduce a kernel reweighted logistic regression for reverse engineering the dynamic interactions between genes based on their time series of expression values. We applied the proposed method to estimate the latent sequence of temporal rewiring networks of 588 genes involved in the developmental process during the life cycle of Drosophila melanogaster. Our results offer the first glimpse into the temporal evolution of gene networks in a living organism during its full developmental course. Our results also show that many genes exhibit distintive functions at different stages along the developmental cycle. Availability: All source code and data sets will be made available at http://www.cs.cmu.edu/~lesong/code/kernelw Contact: epxing@cs.cmu.edu




Author(s):
Kartik Mohanram, Rice University, United States
Alessandro Di Cara, Merck Serono, Switzerland
Ioannis Xenarios, Swiss Institute of Bioinformatic - Vital-IT, Switzerland
Giovanni De Micheli, Ecole Polytechnique Federale de Lausanne, Switzerland

Presenting Author: Abhishek Garg, Ecole Polytechnique Federale de Lausanne

Motivation: Understanding gene regulation in biological processes and modeling the robustness of underlying regulatory networks is a crucial problem that is currently being addressed by computational systems biologists. Lately, there has been a renewed interest in Boolean modeling techniques for Gene Regulatory Networks (GRNs). However, due to their deterministic nature, it is often difficult to identify whether these modeling approaches are robust to the addition of stochastic noise that is widespread in gene regulatory processes. Stochasticity in Boolean models of GRNs has been addressed relatively sparingly in the past, mainly by flipping the expression of genes between different expression levels with a predefined probability. This Stochasticity In Nodes (SIN) model leads to over-representation of noise in GRNs and hence non-correspondence with biological observations. Results: In this paper, we introduce the Stochasticity In Functions (SIF) model for simulating stochasticity in Boolean models of GRNs. By providing biological motivation behind the use of the SIF model and applying the model on T-Helper and T-Cell activation networks, we show that SIF provides more biologically robust results than the existing SIN model of stochasticity in GRNs. Availability: Algorithms are made available under our Boolean modeling toolbox, GenYsis. The software binaries can be downloaded from http://si2.epfl.ch/~garg/genysis.html.




Author(s):
Andreas Beyer, Group Leader, Germany
Srinivasan Parthasarathy, Department of Computer Science and Engineering, The Ohio State University, United States
Christopher Workman*, DTU, Denmark

Presenting Author: Duygu Ucar, ohio state univ.

Chromatin immunoprecipitation (ChIP-chip) experiments enable capturing physical interactions between regulatory proteins and DNA in vivo. However, measurement of chromatin binding alone is not sufficient to detect regulatory interactions. A detected binding event may not be biologically relevant, or a known regulatory interaction might not be observed under the growth conditions tested so-far. To correctly identify physical interactions between TFs and genes and to determine their regulatory implications under various experimental conditions, we integrated ChIP-chip data with motif binding sites, nucleosome occupancy and mRNA expression datasets within a probabilistic framework. This framework was specifically tailored for the identification of functional and non-functional DNA binding events. Using this, we estimate that only 50% of condition specific protein-DNA binding in budding yeast is functional. We further investigated the molecular factors determining the functionality of protein-DNA interactions under diverse growth conditions. Our analysis suggests that the functionality of binding is highly condition specific and highly dependent on the presence of specific co-factors. Hence, the joint analysis of both, functional and non-functional DNA binding may lend important new insights into transcriptional regulation.



Other Bioinformatics Applications and Methods



Author(s):
Gregory Batt*, INRIA Paris-Rocquencourt, France
Fran?ßois Fages, INRIA Rocquencourt, France
Sylvain Soliman, INRIA, France

Presenting Author: Aurelien Rizk, INRIA Paris-Rocquencourt

Motivation: Robustness is the capacity of a system to maintain a function in the face of perturbations. It is essential for the correct functioning of natural and engineered biological systems. Robustness is generally defined in an ad-hoc, problem-dependent manner, thus hampering the fruitful developement of a theory of biological robustness, advocated by Kitano [Mol Syst Biol, 3:137, 2007]. Results: In this paper, we propose a general definition of robustness that applies to any biological function expressible in temporal logic LTL, and to broad model classes and perturbation types. Moreover, we propose a computational approach and an implementation in BIOCHAM 2.8 for the automated estimation of the robustness of a given behavior with respect to a given set of perturbations. The applicability and biological relevance of our approach is demonstrated by testing and improving the robustness of the timed behavior of a synthetic transcriptional cascade that could be used as a biological timer for synthetic biology applications. Availability: Version 2.8 of BIOCHAM and the transcriptional cascade model are available at http://contraintes.inria.fr/BIOCHAM/




Author(s):

Presenting Author: Yoshihiro Yamanishi,

Motivation: The IUBMBÕs Enzyme Nomenclature system, commonly known as the EC (Enzyme Commission) numbers, plays key roles in classifying enzymatic reactions and in linking the enzyme genes or proteins to reactions in metabolic pathways. There are numerous reactions known to be present in various pathways but without any official EC numbers, most of which have no hope to be given ones because of the lack of the published articles on enzyme assays. Results: In this paper we propose a new method to predict the potential EC numbers to given reactant pairs (substrates and products) or uncharacterized reactions, and a web-server named e-zyme as an application. This technology is based on our original biochemical transformation pattern which we call an "RDM pattern", and consists of three steps: (1) graph alignment of a query reactant pair (substrates and products) for computing the query RDM pattern, (2) multi-layered partial template matching by comparing the query RDM pattern with template patterns related with known EC numbers, and (3) weighted major voting scheme for selecting appropriate EC numbers. As the result, cross-validation experiments show that the proposed method achieves both high coverage and high prediction accuracy at a practical level, and consistently outperforms the previous method. The e-zyme system is implemented as a rapid and high performance tool for chemical annotation on a large scale.




Author(s):
Lodewyk Wessels, Division of Molecular Biology, The Netherlands Cancer Institute, Amsterdam, NL, Netherlands
Marcel Reinders, Delft University of Technology, Netherlands
Ilya Shmulevich, Institute for Systems Biology, United States

Presenting Author: Theo Knijnenburg, Institute for Systems Biology

Motivation Permutation tests have become a standard tool to assess the statistical significance of an event under investigation. The statistical significance, as expressed in a P-value, is calculated as the fraction of permutation values that are at least as extreme as the original statistic, which was derived from non-permuted data. This empirical method directly couples both the minimal obtainable P-value and the resolution of the P-value to the number of permutations. Thereby, it imposes upon itself the need for a very large number of permutations when small P-values are to be accurately estimated. This is computationally expensive and often infeasible. Results A method of computing P-values based on tail approximation is presented. The tail of the distribution of permutation values is approximated by a generalized Pareto distribution. A good fit and thus accurate P-value estimates can be obtained with a drastically reduced number of permutations when compared to the standard empirical way of computing P-values.




Author(s):
Nils Gehlenborg, European Bioinformatics Institute, European Bioinformatics Institute
Ali Faisal, Helsinki University of Technology, Computer and Information Science
Alvis Brazma, European Bioinformatics Institute, European Bioinformatics Institute
Samuel Kaski, Helsinki University of Technology, Computer and Information Science

Presenting Author: Jose Caldas, Helsinki University of Technology

Motivation: As ArrayExpress and other repositories of genome-wide experiments are reaching a mature size, it is becoming more meaningful to search for related experiments, given a particular study. We introduce methods that allow for the search to be based upon measurement data, instead of the more customary annotation data. The goal is to retrieve experiments in which the same biological processes are activated. This can be due either to experiments targeting the same biological question, or to as-yet unknown relationships. Results: We use a combination of existing and new probabilistic machine learning techniques to extract information about the biological processes differentially activated in each experiment, to retrieve earlier experiments where the same processes are activated, and to visualize and interpret the retrieval results. Case studies on a subset of ArrayExpress show that, with a sufficient amount of data, our method indeed finds experiments relevant to particular biological questions. Results can be interpreted in terms of biological processes using the visualization techniques. Availability: The code is available from http://www.cis.hut.fi/projects/mi/software/ismb09




Author(s):
Ruedi Aebersold, Institute of Molecular Systems Biology, ETH Zurich and Institute for Systems Biology, Seattle WA and Faculty of Science, University of Zurich, Switzerland
Joachim M. Buhmann*, Swiss Federal Institute of Technology ETHZ, Switzerland

Presenting Author: Manfred Claassen, ETH Zürich

Liquid chromatography tandem mass spectrometry (LC-MS/MS) is the predominant method to comprehensively characterize complex protein mixtures such as samples from pre-fractionated or complete proteomes. In order to maximize proteome coverage for the studied sample, i.e. identify as many traceable proteins as possible, LC-MS/MS experiments are typically repeated extensively and the results combined. Proteome coverage prediction is the task of estimating the number of peptide discoveries of future LC-MS/MS experiments. Proteome coverage prediction is important to enhance the design of efficient proteomics studies. To date, there does not exist any method to reliably estimate the increase of proteome coverage at an early stage. We propose an extended infinite Markov model DiriSim to extrapolate the progression of proteome coverage based on a small number of already performed LC-MS/MS experiments. The method explicitly accounts for the uncertainty of peptide identifications. We tested DiriSim on a set of 37 LC-MS/MS experiments of a complete proteome sample and demonstrated that DiriSim correctly predicts the coverage progression already from a small subset of experiments. The predicted progression enabled us to specify maximal coverage for the test sample. We demonstrated that quality requirements on the final proteome map impose an upper bound on the number of useful experiment repetitions and limit the achievable proteome coverage.



Population Genomics



Author(s):
Elena Helman, Brown University, United States
Ali Bashir, UCSD, United States
Benjamin Raphael*, Brown University, United States

Presenting Author: Suzanne Sindi, Brown University

Motivation: Structural variants, including duplications, insertions, deletions, and inversions of large blocks of DNA sequence, are an important contributor to human genome variation. Measuring structural variants in a genome sequence is significantly more challenging than measuring single nucleotide changes. Current approaches for structural variant identification, including array comparative genomic hybridization (aCGH) and paired-end DNA sequencing, do not identify the boundaries of variants precisely. Consequently, most reported human structural variants are poorly defined and not readily compared across different studies and measurement techniques. Results: We introduce a geometric approach for identification, classification, and comparison of structural variants. Our approach represents the uncertainty in measurement of a structural variant as a polygon in the plane, and identifies measurements supporting the same variant by computing intersections of polygons. We derive a computational geometry algorithm to efficiently identify all such intersections. We apply this algorithm to recent sequencing data from 9 individual human genomes and several cancer genomes. Our method provides improved localization of the boundaries of structural variants, distinguishes genetic from somatic structural variants in cancer genomes, and integrates aCGH and paired-end sequencing measurements of structural variants in cancer genomes. This work presents the first general framework for comparing structural variants across multiple samples and measurement techniques, and will be useful for studies of both genetic structural variants and somatic rearrangements in cancer.




Author(s):
Kyung-Ah Sohn, Carnegie Mellon University, United States
Eric Xing*, Carnegie Mellon University, United States

Presenting Author: Seyoung Kim, Carnegie Mellon University

Motivation: Many complex disease syndromes such as asthma consist of a large number of highly related, rather than independent, clinical phenotypes, raising a new technical challenge in identifying genetic variations associated simultaneously with correlated traits. Although a causal genetic variation may influence a group of highly correlated traits jointly, most of the previous association analyses considered each phenotype separately, or combined results from a set of single-phenotype analyses. Results: We propose a new statistical framework called graph-guided fused lasso (GFlasso) to address this issue in a principled way. Our approach explicitly represents the dependency structure among the quantitative traits as a network, and leverages this trait network to encode structured regularizations in a multivariate regression model over the genotypes and traits, so that the genetic markers that jointly influence subgroups of highly correlated traits can be detected with high sensitivity and specificity. While most of the traditional methods examined each phenotype independently, our approach analyzes all of the traits jointly in a single statistical method to discover the genetic markers that perturb a subset of correlated triats jointly rather than a single trait. Using simulated datasets based on the HapMap consortium data and an asthma dataset, we compare the performance of our method with the single-marker analysis, and other sparse regression methods that do not use any structural information in the traits. Our results show that there is a significant advantage in detecting the true causal SNPs when we incorporate the correlation pattern in traits using our proposed methods.




Author(s):
Sriram Sankararaman*, UC Berkeley, United States
Gad Kimmel*, International Computer Science Institute, United States
Eran Halperin*, International Computer Science Institute, Berkeley, United States

Presenting Author: Sriram Sankararaman, U C Berkeley

A characterization of the genetic variation of recently admixed populations may reveal historical population events, and is useful for the detection of SNPs associated with diseases through association studies and admixture mapping. Inference of locus-specific ancestry is key to our understanding of the genetic variation of such populations.While a number of methods for the inference of locus-specific ancestry are accurate when the ancestral populations are quite distant (e.g. African-Americans), current methods incur a large error rate when inferring the locus-specific ancestry in admixed populations where the ancestral populations are closely related (e.g., Americans of European descent). Results: In this work, we extend previous methods for the inference of locus-specific ancestry by the incorporation of a refined model of recombination events. We present an efficient dynamic programming algorithm to infer the locus-specific ancestries in this model, resulting in a method that attains improved accuracies; the improvement is most significant when the ancestral populations are closely related. An evaluation on a wide range of scenarios, including admixtures of the 52 population groups from the Human Genome Diversity Project demonstrates that locus-specific ancestry can indeed be accurately inferred in these admixtures using our method. Finally, we demonstrate that imputation methods can be improved by the incorporation of locus-specific ancestry, when applied to admixed populations.




Author(s):
Michael Jordan*, UC Berkeley, United States
Yun Song*, UC Berkeley, United States

Presenting Author: Junming Yin, UC Berkeley

Motivation: Two known types of meiotic recombination are crossovers and gene conversions. Although they leave behind different footprints in the genome, it is a challenging task to tease apart their relative contributions to the observed genetic variation. In particular, for a given population SNP data set, the joint estimation of the crossover rate, the gene conversion rate, and the mean conversion tract length is widely viewed as a very difficult problem. Results: In this paper, we devise a full-likelihood method based on an interleaved hidden Markov model (HMM) that can jointly estimate the aforementioned three parameters fundamental to recombination. Our method significantly improves upon a recently proposed method based on a factorial HMM. We show that modeling overlapping gene conversions is crucial for improving the joint estimation of the gene conversion rate and the mean conversion tract length. We test the performance of our method on simulated data. We then apply our method to analyze real biological data from the telomere of the X chromosome of Drosophila melanogaster, and show that the ratio of the gene conversion rate to the crossover rate for the region may not be nearly as high as previously claimed. Availability: A software implementation of the algorithms discussed in this paper will be made publicly available.




Author(s):
Yun Song*, UC Berkeley, United States

Presenting Author: Anand Bhaskar, UC Berkeley

Motivation: A fundamental problem in population genetics, which being also of importance to forensic science, is to compute the match probability (MP) that two individuals randomly chosen from a population have identical alleles at a collection of loci. At present, 11 to 13 unlinked autosomal microsatellite loci are typed for forensic use. In a finite population, the genealogical relationships of individuals can create statistical non-independence of alleles at unlinked loci. However, the so-called product rule, which is used in courts in the US, computes the MP for multiple unlinked loci by assuming statistical independence, multiplying the one-locus MPs at those loci. Analytically testing the accuracy of the product rule for more than 5 loci has hitherto remained an open problem. Results: In this paper, we adopt a flexible graphical framework to compute multi-locus MPs analytically. We consider two standard models of random mating, namely the Wright-Fisher and Moran models. We succeed in computing haplotypic MPs for up to 10 loci in the Wright-Fisher model, and up to 13 loci in the Moran model. For a finite population, we show that the MPs for a large number of loci predicted by the product rule are highly sensitive to mutation rates in the range of interest, while the true multi-locus MPs are not. Furthermore, we show that the Wright-Fisher and Moran models may produce drastically different MPs for a finite population, and that this difference grows with the number of loci and mutation rates. Although the two models converge to the same coalescent or diffusion limit, in which the population size approaches infinity, we demonstrate that, when multiple loci are considered, the rate of convergence in the Moran model is significantly slower than that in the Wright-Fisher model. Availability: Our C++ program for computing multi-locus MPs will be made publicly available.




Author(s):
Christopher Meek*, Microsoft Research, United States
Ydo Wexler*, Microsoft Research, United States

Presenting Author: Ydo Wexler, Computer Science Dept., Technion, Israel

We develop an HMM based algorithm for computing exact parametric and non-parametric linkage scores in larger pedigrees than was possible before. The algorithm is applicable whenever there are chains of persons in the pedigree with no genetic measurements and with unknown affection status. The algorithm is based on shrinking the state space of the HMM considerably using such chains. In a two g-degree cousins pedigree the reduction drops the state space from being exponential in g to being linear in g. For a Finnish family in which two affected children suffer from a rare cold-inducing sweating syndrome we were able to reduce the state space by more than 5 orders of magnitude from 2^{50} to 2^{32}. In another pedigree of state space size of 2^{27}, used for a study of pituitary adenoma, the state space reduced by a factor of 8.5 and consequently exact linkage scores can now be computed, rather than approximated.



Protein Interactions and Molecular Networks



Author(s):

Presenting Author: Xin Guo,

Recent advances in high-throughput experimental techniques have yielded a large amount of data on protein-protein interaction (PPI) networks, naturally leading to research into network alignment: the comparative analysis of PPI networks across species to detect conserved functional modules. Most conventional network alignment algorithms use a node-then-edge-alignment paradigm: they first identify homologous proteins across networks and then consider interactions among them to construct network alignments. Here, we present DNAlign, a domain-oriented network aligner that adopts an alternative direct-edge-alignment paradigm. Specifically, DNAlign does not require the explicit identification of homologous proteins but instead directly searches for equivalent interactions across networks by comparing the domain interaction contents of PPIs. We apply our approach to identify conserved protein complexes in yeast-fly and yeast-worm PPI networks, and show that DNAlign outperforms two recent network aligners.




Author(s):
Francis Bach, INRIA - Ecole Normale Superieure, France
Jean-Philippe Vert*, Ecole des Mines de Paris, France

Presenting Author: Mikhail Zaslavskiy, Mines ParisTech

Motivation: Aligning protein-protein interaction (PPI) networks of different species has drawn a considerable interest recently. This problem is important to investigate evolutionary conserved pathways or protein complexes across species, and to help in the identification of functional orthologs through the detection of conserved interactions. It is however a difficult combinatorial problem, for which only heuristic methods have been proposed so far. Results: We reformulate the PPI alignment as a graph matching problem, and investigate how state-of-the-art graph matching algorithms can be used for that purpose. We differentiate between two alignment problems, depending on whether strict constraints on protein matches are given, based on sequence similarity, or whether the goal is instead to find an optimal compromise between sequence similarity and interaction conservation in the alignment. We propose new methods for both cases, and assess their performance on the alignment of the yeast and fly PPI networks. The new methods consistently outperform state-of-the-art algorithms, retrieving in particular 78% more conserved interactions than IsoRank for a given level of sequence similarity. Availability: All data and codes are freely and publicly available from the authors.




Author(s):

Presenting Author: Michael Baym,

Motivation: With the increasing availability of large protein-protein interaction networks, the question of protein network alignment is becoming central to systems biology. Network alignment is further delineated into two sub-problems: local alignment, to find small conserved motifs across networks, and global alignment, which attempts to find a best mapping between all nodes of the two networks. In this paper, our aim is to improve upon existing global alignment results. Better network alignment will enable, among other things, more accurate identification of functional orthologs across species. Results: We introduce IsoRankN (IsoRank-Nibble) a global multiple-network alignment tool based on spectral clustering on the induced graph of pairwise alignment scores. IsoRankN outperforms existing algorithms for global network alignment in coverage and consistency on multiple alignments of the five available eukaryotic networks. Being based on spectral methods, IsoRankN is both error-tolerant and computationally efficient. Availability: Our software is available freely for non-commercial purposes on request from: http://isorank.csail.mit.edu/




Author(s):

Presenting Author: Shira Mintz-Oron,

Revealing the subcellular localization of proteins within membrane-bound compartments is of a major importance for inferring protein function. Though current high-throughput localization experiments provide valuable data, they are costly and time-consuming, and due to technical difficulties not readily applicable for many Eukaryotes. Physical characteristics of proteins, such as sequence targeting signals and amino acid composition are commonly used to predict sub-cellular localizations using computational approaches. Recently it was shown that protein-protein interaction (PPI) networks can be used to significantly improve the prediction accuracy of protein sub-cellular localization. However, as high-throughput PPI data depend on costly high-throughput experiments and are currently available for only a few organisms, the scope of such methods is yet limited. This study presents a novel constraint-based method for predicting sub-cellular localization of enzymes based on their embedding metabolic network, relying on a parsimony principle of a minimal number of cross-membrane metabolite transporters. In a cross-validation test of predicting known sub-cellular localization of yeast enzymes, the method is shown to be markedly robust, providing accurate localization predictions even when only 20% of the known enzyme localizations are given as input. It is shown to outperform pathway enrichment-based methods both in terms of prediction accuracy and in its ability to predict the sub-cellular localization of entire metabolic pathways when no apriori pathway-specific localization data is available (and hence enrichment methods are bound to fail). With the number of available metabolic networks already reaching more than 600 and growing fast, the new method may significantly contribute to the identification of enzyme localizations in many different organisms.



Protein Structure and Function



Author(s):
Fa Zhang, Institute of Computing Technology, China
Gongming Wang, Institute of Computing Technology, Chinese Academy of Sciences, China
Zhiyong Liu, Institute of Computing Technology, Chinese Academy of Scinece, China

Presenting Author: Liya Fan, Institute of Computing Technology, Chinese Academy of Sciences

A particle reclustering framework (PRF) is introduced to refine clusters produced during the model refining stage of EMAN, one of the most popular software packages for single particle reconstruction. The purpose is more accurately determining orientations of molecule particle images in the clusters, and thereby achieving higher resolutions of consequent 3-D structures. The framework consists of 3 components. Each of them is responsible for one of the basic tasks of PRF: normalization, threshold determination, and reclustering. Our implementation is also described and proved to meet the constraints proposed by PRF. Experiments revealed that our implementation improved resolutions of consequent structures for most cases, but only a little extra execution time was incurred. Therefore it is practical to incorporate PRF in EMAN to improve qualities of generated 3-D structures.




Author(s):
Li Xie, University of California, San Diego, United States
Philip Bourne, UCSD, United States

Presenting Author: Lei Xie, University of California, San Diego

Functional relationships between proteins that do not share global sequence or structure similarity can be established by detecting their ligand binding site similarity. For a large scale comparison it is critical to accurately and efficiently assess the statistical significance of this similarity. Most existing statistical models only take into account the matching vertices between two sites that are defined by a fixed number of points. In reality, the boundary of the binding site is not known or is dependent on the bound ligand making these approaches limited. To address these shortcomings and to perform binding site mapping on a genome-wide scale, we developed a sequence-order independent profile-profile alignment (SOIPPA) algorithm that is able to detect local similarity between unknown binding sites a prior. The SOIPPA scoring integrates geometric, evolutionary and physical information into a unified framework. However, this imposes a significant challenge in assessing the statistical significance of the similarity. Here we report a new statistical model that is similar to that used for global sequence or structure comparison. We find that scores for binding site matching by SOIPPA follow an extreme value distribution (EVD). Benchmark studies show that the EVD model performs at least two-orders faster and is more accurate than the non-parametric statistical method. Efficient statistical analysis makes it possible to apply SOIPPA to genome-based drug discovery. Consequently, we have applied the approach to the structural genome of M.tuberculosis to construct a protein-ligand interaction network. The network reveals highly connected proteins which represent suitable targets for promiscuous drugs.




Author(s):
Xin Gao, University of Waterloo, Canada
Emre Karakoc, University of Waterloo, Canada
Logan Donaldson, York University, Canada
Aleks Gutmanas, University of Toronto, Canada
Cheryl Arrowsmith, University of Toronto, Canada
Ming Li*, Univ of Waterloo, Canada:

Presenting Author: Babak Alipanahi Ramandi, University of Waterloo

Picking peaks from experimental NMR spectra is a key unsolved problem for automated NMR protein structure determination. Such a process is a prerequisite for resonance assignment, NOE distance restraint assignment, and structure calculation tasks. Manual or semi-automatic peak picking, which is currently the prominent way used in NMR labs, is tedious, time-consuming, and costly. We introduce new ideas, including noise level estimation, component forming and sub-division, singular value decomposition (SVD)-based peak picking, and peak pruning and refinement. PICKY is developed as an automated peak picking method. Different from the previous research on peak picking, we provide a systematic study of the proposed method. PICKY is tested on 32 real 2D and 3D spectra of eight target proteins, and achieves an average of 88% recall and 74% precision. More important than these numbers, PICKY actually works in practice. We feed peak lists generated by PICKY to IPASS for resonance assignment, feed IPASS assignment to SPARTA for fragments generation, and feed SPARTA fragments to FALCON for structure calculation. This results in high-resolution structures of several proteins, for example, TM1112, at 1.25A. PICKY is available upon request, and the web server for PICKY is under construction.




Author(s):
Carol A. Rohl, Rosetta Inpharmatics LLC, United States
Kevin Karplus, Biomolecular Engineering Department, University of California, Santa Cruz, United States

Presenting Author: Firas Khatib, University of Washington

Our focus has been on detecting topological properties that are rare in real proteins, but occur more frequently in models generated by protein structure prediction methods such as Rosetta. We previously created the Knotfind algorithm, successfully decreasing the frequency of knotted Rosetta models during CASP 6. We observed an additional class of knot-like loops that appeared to be equally un-protein-like and yet do not contain a mathematical knot. These topological features are commonly referred to as slip-knots and are caused by the same mechanisms that result in knotted models. Slip-knots are undetectable by the original Knotfind algorithm. We have generalized our algorithm to detect them, and analyzed CASP 6 models built using the Rosetta loop modeling method. After analyzing known protein structures in the PDB, we found that slip-knots do occur in certain proteins, but are rare and fall into a small number of specific classes. Our group used this new Pokefind algorithm to distinguish between these rare real slip-knots and the numerous classes of slip-knots that we discovered in Rosetta models and CASP 7 server models. The goal of this work is to improve future models created by protein structure prediction methods. Both algorithms are able to detect un-protein-like features that current metrics such as GDT are unable to identify, so these topological filters can also be used as additional assessment tools.




Author(s):
Ryan Lilien*, University of Toronto, Canada

Presenting Author: Izhar Wallach, University of Toronto

Motivation: The ability to predict binding profiles for an arbitrary protein can significantly improve the areas of drug discovery, lead optimization, and protein function prediction. At present, there are no successful algorithms capable of predicting binding profiles for novel proteins. Existing methods typically rely on manually curated templates or entire active site comparison. Consequently, they perform best when analyzing proteins sharing significant structural similarity with known proteins (i.e., proteins resulting from divergent evolution). These methods fall short when asked to characterize the binding profile of a novel active site or one for which a template is not available. In contrast to previous approaches, our method characterizes the binding preferences of sub-cavities within the active site by exploiting a large set of known protein-ligand complexes. The uniqueness of our approach lies not only in the consideration of sub-cavities, but also in the more complete structural representation of these sub-cavities, their parameterization, and the method by which they are compared. By only requiring local structural similarity, we are able to leverage previously unused structural information and perform binding inference for proteins that do not share significant structural similarity with known systems. Results: Our algorithm demonstrates the ability to accurately cluster similar sub-cavities and to predict binding patterns across a diverse set of protein--ligand complexes. When applied to two high-profile drug targets, our algorithm successfully generates a binding profile that is consistent with known inhibitors. The results suggest that our algorithm should be useful in structure-based drug discovery and lead optimization.




Author(s):

Presenting Author: Silvio Tosatto,

Motivation: Proteins with solenoid repeats evolve more quickly than non-repetitive ones, and their periodicity may be rapidly hidden at sequence level, while still evident in structure. In order to identify these repeats, we propose here a novel method based on a metric characterizing amino acid properties (polarity, secondary structure, molecular volume, codon diversity, electric charge) using five previously derived numerical functions. Results: The five spectra of the candidate sequences coding for structural repeats, obtained by Discrete Fourier Transform (DFT), show common features allowing determination of repeat periodicity with excellent results. Moreover it is possible to introduce a phase space parametrized by two quantities related to the Fourier spectra which allow for a clear distinction between a non-homologous set of globular proteins and proteins with solenoid repeats. The DFT method is shown to be competitive with other state of the art methods in the detection of solenoid structures, while improving its performance especially in the identification of periodicities, since it is able to recognize the actual repeat length in most cases. Moreover it highlights the relevance of local structural propensities in determining solenoid repeats. A web tool implementing the algorithm presented in the paper (REPETITA) is available with additional details on the data sets at the URL: http://protein.bio.unipd.it/repetita/.



Sequence Analysis and Alignment



Author(s):

Presenting Author: HamidReza Chitsaz,

Recent interests, such as RNA interference and antisense RNA regulation, strongly motivate the problem of predicting whether two nucleic acid strands interact. Motivation: Regulatory non-coding RNAs (ncRNAs) such as microRNAs play an important role in gene regulation. Studies on both prokaryotic and eukaryotic cells show that such ncRNAs usually bind to their target mRNA to regulate the translation of corresponding genes. The specifity of these interactions depends on the stability of intermolecular and intramolecular base pairing. While methods like deep sequencing allow to discover an ever increasing set of ncRNAs, there are no high-throughput methods available to detect their associated targets. Hence, there is an increasing need for precise computational target prediction. In order to predict base-pairing probability of any two bases in interacting nucleic acids, it is necessary to compute the interaction partition function over the whole ensemble. The partition function is a scalar value from which various thermodynamic uantities can be derived. For example, the equilibrium concentration of each complex nucleic acid species and also the melting temperature of interacting nucleic acids can be calculated based on the partition function of the complex. Results: We present a model for analyzing the thermodynamics of two interacting nucleic acid strands considering the most general type of interactions studied in the literature. We also present a corresponding dynamic programming algorithm which computes the partition function over (almost) all physically possible joint secondary structures formed by two interacting nucleic acids in O(n^6) time. We verify the predictive power of our algorithm by computing (1) the melting temperature for interacting RNA pairs studied in the literature and (2) the equilibrium concentration for several variants of the OxyS-fhlA complex. In both experiments our algorithm shows high accuracy and outperforms competitors.




Author(s):
Charles Grant, University of Washington, United States
William Noble, University of Washington, United States
Timothy L. Bailey*, University of Queensland, Australia

Presenting Author: John Hawkins, Institute for Molecular Bioscience, University of Queensland

A variety of algorithms have been developed to predict transcription factor binding sites within the genome by exploiting the evolutionary information implicit in multiple alignments of the genomes of related species. One such approach uses an extension of the standard position-specific motif model that incorporates phylogenetic information via a phylogenetic tree and a model of evolution. However, these phylogenetic motif models have never been rigorously benchmarked in order to determine whether they lead to better prediction of transcription factor binding sites than obtained using simple position weight matrix scanning. Results: We evaluate three phylogenetic motif model-based prediction algorithms, each of which uses a different treatment of gapped alignments, and we compare their prediction accuracy with that of a non-phylogenetic motif scanning approach. Surprisingly, all of these algorithms appear to be inferior to simple motif scanning, when accuracy is measured using a gold standard of validated yeast transcription factor binding sites. However, the phylogenetic motif model scanners perform much better than simple motif scanning when we abandon the gold standard and consider the number of statistically significant sites predicted, using column-shuffled ``random motifs to measure significance. These results suggest that the common practice of measuring the accuracy of binding site predictors using collections of known sites may be dangerously misleading since such collections may be missing ``weak sites, which are exactly the type of sites needed to discriminate among predictors. We then extend our previous theoretical model of the statistical power of phylogenetic motif model-based prediction algorithms to allow for loss of binding sites during evolution, and show that it gives a more accurate upper bound on scanner accuracy. Finally, utilizing our theoretical model, we introduce a new method for predicting the number of real binding sites in a genome. The results suggest that the number of true sites for a yeast transcription factor is in general several times greater than the number of known sites listed in the Saccharomyces cerevisiae database (SCPD). Among the three scanning algorithms that we test, the MONKEY algorithm has the highest accuracy for predicting yeast transcription factor binding sites.




Author(s):
Pradipta Ray*, Carnegie Mellon University, United States
Eric Xing*, Carnegie Mellon University, United States

Presenting Author: Pradipta Ray, Carnegie Mellon University

Identifying transcription factor binding sites (TFBS) encoding com- plex regulatory signals in metazoan genomes remains a challenging problem in computational genomics. Due to degeneracy of nucleo- tide content among binding site instances or motifs, and intricate “grammatical organization” of motifs within cis-regulatory modules, extant pattern-matching based in-silico motif search methods often suffer from impractically high false positive rates, especially in the con- text of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the informa- tion content of the genomic DNA in the region of query, taking cues from values of various biologically meaningful genetic and epigene- tic factors in the query region such as clade-specific evolutionary parameters, presence / absence of nearby coding regions, etc We present a new method for TFBS prediction in metazoan genomes which utilizes both the cis-regulatory module (CRM) architecture of sequences and a variety of features of individual motifs. Our propo- sed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features. This model overcomes weaknesses in ear- lier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in compa- rison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1-score. The code is publicly available at http://www.sailing.cs.cmu.edu/discover.html.




Author(s):
Sven Rahmann*, TU Dortmund, Germany

Presenting Author: Tobias Marschall, TU Dortmund

Motivation: The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. Thisis partly due to the large number of possibilities of defining the motif space to be searched and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad-hoc manner with heuristics that do not guarantee to find the best motif. Results: We show how to solve the motif discovery problem (almost) exactly on a practially relevant space of IUPAC generalized string patterns while using the p-value with respect to an i.i.d. model or a Markov model as the measure of over-representation. In particular, (1) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithmetic automaton (PAA). (2) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (3) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 hours compared to an extrapolated runtime of 4.8 years). (4) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on M. tuberculosis.




Author(s):
Eran Segal*, Weizmann Institute of Science, Israel

Presenting Author: Shai Lubliner, Weizmann Institute of Science

Motivation: Understanding the mechanisms that govern nucleosome positioning over genomes in-vivo is essential for unraveling the role of chromatin organization in transcriptional regulation. Until now, models for predicting genome-wide nucleosome occupancy have assumed that the DNA associations of neighboring nucleosomes on the genome are independent. We present a new model that relaxes this independence assumption by modeling interactions between adjacent nucleosomes. Results: We show that modeling interactions between adjacent nucleosomes improves genome-wide nucleosome occupancy predictions in an in vitro system that includes only nucleosomes and purified DNA, where the resulting model has a preference for short spacings (linkers) of less than 20bp in length between neighboring nucleosomes. Since nucleosome occupancy in vitro depends only on properties intrinsic to nucleosomes, these results suggest that the interactions we find are intrinsic to nucleosomes and do not depend on other factors, such as transcription factors and chromatin remodelers. We also show that modeling these intrinsic interactions significantly improves genome-wide predictions of nucleosome occupancy in-vivo.




Author(s):
Linda Payet, University of Cambridge, UK, United Kingdom
Jean-Louis Mergny, INSERM & Mus?©um National d Histoire Naturelle, Paris, France, France
Julian Huppert*, University of Cambridge, United Kingdom

Presenting Author: Oliver Stegle, University of Cambridge

section{Motivation:} G-quadruplexes are stable four-stranded guanine-rich structures which can form from DNA and RNA. They are an important component of human telomeres and play a role in the regulation of transcription and translation. The biological significance of a G-quadruplex is crucially linked with their thermodynamic stability. Hence the prediction of G-quadruplex stability is of vital interest. section{Results:} In this paper we present a novel Bayesian prediction framework based on Gaussian process regression to determine the thermodynamic stability of previously unmeasured G-quadruplex forming sequences from the sequence information alone. We benchmark our approach on a large G-quadruplex dataset and compare our method to standard approaches. Furthermore we propose an active learning procedure which can be used to iteratively acquire data in an optimal fashion. Lastly, we demonstrate the utility of our procedure on a genome-wide study of quadruplexes in the human genome. section{Availability:} The Gaussian process prediction software will be made available on request. section{Contact:} os252@cam.ac.uk, jlh29@cam.ac.uk




Author(s):
Kengo Sato, Japan Biological Informatics Consortium, Japan
Hisanori Kiryu, National Institute of Advanced Industrial Science and Technology (AIST), Japan
Toutai Mituyama, Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Japan
Kiyoshi Asai, University of Tokyo, Japan

Presenting Author: Michiaki Hamada, Mizuho Information & Research Institute, Inc.

Motivation: Secondary structure prediction of RNA sequences is an important problem. There have been progresses in this area, but the accuracy of prediction from an RNA sequence is still limited. In many cases, however, homologous RNA sequences are available with the target RNA sequence whose secondary structure is to be predicted. Results: In this paper, we propose a new method for secondary structure predictions of individual RNA sequences by taking the information of their homologous sequences into account without assuming the common secondary structure of the entire sequences. The proposed method is based on posterior decoding techniques, which consider all the suboptimal secondary structures of the target and homologous sequences and all the suboptimal alignments between the target sequence and each of the homologous sequences. In our computational experiments, the proposed method provides better predictions than those performed only on the basis of the formation of individual RNA sequences and those performed by using methods for predicting the common secondary structure of the homologous sequences. Remarkably, we found that the common secondary predictions sometimes give worse predictions for the secondary structure of a target sequence than the predictions from the individual target sequence, while the proposed method always gives good predictions for the secondary structure of target sequences in all tested cases.




Author(s):
Yves Van de Peer, Ghent University, Belgium
Yvan Saeys*, Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), Ghent University, Belgium

Presenting Author: Thomas Abeel, Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), Ghent University

Motivation: Promoter prediction is an important task in genome annotation projects, and during the past years many new promoter prediction programs have emerged. However, many of these programs are compared inadequately to other programs. In most cases, only a small portion of the genome is used to evaluate the program, which is not a realistic setting for whole genome annotation projects. In addition, a common evaluation design to properly compare promoter prediction programs is still lacking. Results: We present a large-scale benchmarking study of 17 state-of-the-art promoter prediction programs. A multi-faceted evaluation strategy is proposed that can be used as a gold standard for promoter prediction evaluation, allowing authors of promoter prediction software to compare their method to existing methods in a proper way. This evaluation strategy is subsequently used to compare the chosen promoter predictors, and an in-depth analysis on predictive performance, promoter class specificity, overlap between predictors and positional bias of the predictions is conducted. Availability: We provide the implementations of the four protocols, as well as the data sets required to perform the benchmarks to the academic community free of charge on request.