16th Annual
International Conference
Intelligent Systems
for Molecular Biology

Metro Toronto Convention Centre (South Building)
Toronto, Canada



ISMB 2008 Accepted Papers

Bioinformatics of Disease
Comparative Genomics
Databases and Ontologies
Evolution and Phylogeny
Gene Regulation and Transcriptomics
Protein Interactions and Molecular Networks
Protein Structure and Function
Sequence Analysis and Alignment
Text Mining

View/Hide all Abstracts

Differential variability analysis of gene expression and its application to human diseases
Joshua Ho, School of Information Technologies, The University of Sydney
Maurizio Stefani, Muscle Research Unit, Bosch Institute, The University of Sydney
Cristobal dos Remedios, Muscle Research Unit, Bosch Institute, The University of Sydney
Michael Charleston, School of Information Technologies, The University of Sydney

Motivation: Current microarray analyses focus on identifying sets of genes that are differentially expressed (DE) or differentially coexpressed (DC) in different biological states (e.g., diseased vs. non-diseased). We observed that in many human diseases, some genes have a significant increase or decrease in expression variability (variance). As these observed changes in expression variability may be caused by alteration of the underlying expression dynamics, such differential variability (DV) patterns are also biologically interesting. Results: Here we propose a novel analysis for changes in gene expression variability between groups of samples, which we call differential variability analysis. We introduce the concept of differential variability (DV), and present a simple procedure for identifying DV genes from microarray data. Our procedure is evaluated with simulated and real microarray datasets. The effect of data preprocessing methods on identification of DV gene is investigated. The biological significance of DV analysis is demonstrated with four human disease datasets. The relationships among DV, DE and DC genes are investigated. The results suggest that changes in expression variability are associated with changes in coexpression pattern, which imply that DV is not merely stochastic noise, but informative signal.

Keywords: differential variability human disease microarray analysis


Selecting anti-HIV therapies based on a variety of genomic and clinical factors
Michal Rosen-Zvi, IBM
Andre altmann, Max Planck Institute for Informatics, Saarbrücken, Germany
Mattia Prosperi, University of Rome TRE, Italy
Ehud Aharoni, Machine-Learning group HRL IBM
Hani Neuvirth, Machine-Learning group HRL IBM
Anders Snnerborg, Division of Infectious Diseases, Department of Medicine, Karolinska Institute,
Eugen Schlter, University of Cologne, Cologne, Germany
Daniel Struck, Retrovirology Laboratory, CRP-Sant, Luxembourg
Yardena Peres, HC&LS HRL IBM
Francesca Incardona, Informa srl, Rome, Italy
Rolf Kaiser, University of Cologne, Cologne, Germany
Maurizio Zazzi , Department of Molecular Biology, University of Siena, Italy
Thomas Lengauer, MPI for Informatics

\section{Motivation:} Optimizing HIV therapies is crucial since the virus rapidly develops mutations to evade drugs pressure. Recent studies have shown that genotypic information might not be sufficient for the design of therapies and that other clinical and demographical factors may play a role in therapy failure. This study is designed to assess the improvement in prediction achieved when such information is taken into account. We use these factors to generate a prediction engine using a variety of machine-learning methods and to determine which clinical conditions are most misleading in terms of predicting the outcome of a therapy. \section{Results:} Three different machine-learning techniques were used: generative-discriminative method, regression with derived evolutionary features, and regression with a mixture of effects. All three methods had similar performances with an area under the ROC curve (AUC) of $0.77$. A set of three similar engines limited to genotypic information only achieved an AUC of $0.75$. A straightforward combination of the three engines consistently improves the prediction, with significantly better prediction when the full set of features is employed. The combined engine improves on predictions obtained from an on-line state-of-the-art resistance interpretation system. Moreover, engines tend to disagree more on the outcome of failure therapies than regarding successful ones. Careful analysis of the differences between the engines revealed those mutations and drugs most closely associated with uncertainty of the therapy outcome. \section{Availability:} The combined prediction engine will be available at the time of publication

Keywords: HIV resistance antiretroviral therapy decision support system machine learning


Reversible Jump MCMC Approach for Peak identification for Stroke SELDI Mass Spectrometry Using Mixture Model
Yuan Wang, The Methodist Hospital, Weill Cornell Medical College
Xiaobo Zhou, The Methodist Hospital, Weill Cornell Medical College
Honghui Wang, National Institutes of Health
King Li, The Methodist Hospital, Weill Cornell Medical College
Lixiu Yao, Shanghai Jiao Tong University
Stephen T.C. Wong, The Methodist Hospital, Weill Cornell Medical College

Stroke is the leading cause of disability and third major cause of death in the United States. Mass spectrometry has shown great potential in detecting disease related biomarkers for early diagnosis of stroke. To discover potential biomarkers from large volume of noisy mass spectrometry data, peak detection must be performed first. This paper proposes a novel automatic peak detection method for the stroke mass spectrometry data. In this method, a mixture model is proposed to model the spectrum. Bayesian approach is used to estimate parameters of the mixture model, and Markov chain Monte Carlo method is employed to perform Bayesian inference. By introducing a reversible jump method, we can automatically estimate the number of peaks in the model. Instead of separating peak detection into sub-steps, the proposed peak detection method can do baseline correction, denoising and peak identification simultaneously. Therefore it minimizes the risk of introducing irrecoverable bias and errors from each sub-step. In addition, this peak detection method does not require a manually selected denoising threshold. Experimental results on both simulated dataset and stroke mass spectrometry dataset show that the proposed peak detection method not only has the ability to detect small signal-to-noise ratio peaks, but also greatly decreases false detection rate while maintaining the same sensitivity.

Keywords: mass spectrometry mixture model peak detection reversible jump Markov chain Monte Carlo


MACHOS: Markov Clusters of Homologous Subsequences
Simon Wong, The University of Queensland
Mark A. Ragan, The University of Queensland

Motivation: The classification of proteins into homologous groups (families) allows their structure and function to be analysed and compared in an evolutionary context. The modular nature of eukaryotic proteins presents a considerable challenge to the delineation of families, as different local regions within a single protein may share common ancestry with distinct, even mutually exclusive, sets of homologs, thereby creating an intricate web of homologous relationships if full-length sequences are taken as the unit of comparison. We attempt to disentangle this web by developing a fully automated pipeline to delineate protein subsequences that represent sensible units for homology inference, and clustering them into putatively homologous families using the Markov clustering algorithm. Using six eukaryotic proteomes as input, we clustered 162349 protein sequences into 19697 to 77415 subsequence families depending on granularity of clustering. We validated these Markov clusters of homologous subsequences (MACHOS) against the manually curated Pfam domain families, using a quality measure to assess overlap. Our subsequence families correspond well to known domain families and achieve higher quality scores than do groups generated by fully automated domain family classification methods. We illustrate our approach by analysis of a group of proteins that contain the glutamyl/glutaminyl-tRNA synthetase domain, and conclude that our method can produce high-coverage decomposition of protein sequence space into precise homologous families in a way that takes the modularity of eukaryotic proteins into account. This approach allows for a fine-scale examination of evolutionary histories of proteins encoded in eukaryotic genomes.

Keywords: Markov clustering algorithm glutamyl/glutaminyl-tRNA synthetase domain homologous subsequences protein domain families protein domains


Classification and Feature Selection Algorithms for Multi-class CGH data
Jun Liu, PhD student
Sanjay Ranka, Professor
Tamer Kahveci, Professor

Recurrent chromosomal alterations provide cytological and molecular positions for the diagnosis and prognosis of cancer. Comparative Genomic Hybridization (CGH) has been useful in understanding these alterations in cancerous cells. CGH datasets consist of samples that are represented by large dimensional arrays of intervals. Each sample consists of long runs of intervals with losses and gains. In this paper, we develop novel SVM based methods for classification and feature selection of CGH data. For classification, we developed a novel similarity kernel that is shown to be more effective than the standard linear kernel used in SVM. For feature selection, we propose a novel method based on the new kernel that iteratively selects features that provides the maximum benefit for classification. We compared our methods against the best wrapper based and filter based approaches that have been used for feature selection of large dimensional biological data. Our results on datasets generated from the Progenetix database, suggests that our methods are considerably superior to existing methods.

Keywords: CGH classification feature selection


Assessment and Improvement of Plasmodium yoelii yoelii Genome Annotation through Comparative Analysis
Ashley Vaughan, Seattle Biomedical Research Institute
Sum-Ying Chiu, Department of Biology, University of Washington
Gowthaman Ramasamy, Seattle Biomedical Research Institute
Ling Li, Seattle Biomedical Research Institute
Malcolm Gardner, Seattle Biomedical Research Institute
Alice Tarun, Seattle Biomedical Research Institute
Stefan Kappe, Seattle Biomedical Research Institute & Department of Pathobiology, University of Washington
Xinxia Peng, Seattle Biomedical Research Institute

Motivation: The sequencing of Plasmodium yoeliii genome, the model malaria parasite, has greatly facilitated the researches in the development of new drugs and vaccines against malaria. Unfortunately only preliminary gene models were annotated on the partially sequenced genome mostly by in silico gene prediction, and there is no major improvement since. Results: Here we reported a systematic assessment of the accuracy of the genome annotation based on a detailed analysis of a comprehensive set of cDNA sequences, and proteomics data. We found that the coverage of current annotation tends to be biased toward genes expressed in bloods stages of the parasite life cycle. We estimated that about 15% of proteome in liver stages are absent from current annotation based on the proteomic analysis. Through comparative analysis we identified ~500 P. yoelii genes which have clear orthologs in P. falciparum genome, the human malaria parasite, but are absent from or incorrectly annotated in the current databases. These genes were manually curated. This study suggests that the improvement of P. yoelii genome annotation needs to be focused on the genes expressed in stages other than blood stages. Comparative analysis will be critically helpful for the re-annotation. The addition of newly annotated genes will greatly facilitate the use of P. yoelii as a model system for studying human malaria. Contact: Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords: comparative analysis genome annotation malaria proteomics


Guided genome halving: hardness, heuristics and the history of the Hemiascomycetes
Chunfang Zheng, University of Ottawa
Qian Zhu, University of Ottawa
Zaky Adam, University of Ottawa
David Sankoff, Department of Mathematics and Statistics, University of Ottawa

Motivation: Some present-day species have incurred a whole genome doubling event in their evolutionary history, and this is reflected today in patterns of duplicated segments scattered throughout their chromosomes. These duplications may be used as data to ``halve'' the genome, i.e., to reconstruct the ancestral genome at the moment of doubling, but the solution is often highly non-unique. To resolve this problem, we take account of outgroups, external reference genomes, to guide and narrow down the search. Results: We improve on a previous, computationally costly, "brute force" method by adapting the genome halving algorithm of El-Mabrouk and Sankoff so that it rapidly and accurately constructs an ancestor close the outgroups, prior to a local optimization heuristic. We apply this to reconstruct the pre-doubling ancestor of S. cerevisiae and C. glabrata, guided by the genomes of three other yeasts that diverged before the genome doubling event. We analyze the results in terms 1) of the minimum evolution criterion, 2) how close the genome halving result is to the final (local) minimum, and 3) how close the final result is to an ancestor manually constructed by an expert with access to additional information. We also visualize the set of reconstructed ancestors using classic multidimensional scaling to see what aspects of the doubling descendants and outgroups influence the similarities and differences among the reconstructions.

Keywords: algorithms genome halving rearrangements whole genome duplication yeast


A High Accurate Model of the Individual Haplotyping Problem Based on Weighted SNP Fragments and Genotype with Errors
Minzhu Xie, Central South University
Jianxin Wang, Central South University
Jianer Chen, Central South University

In genetic studies of complex diseases, haplotypes provide more information than genotypes, but using biological techniques, haplotyping is much more difficult than genotyping. Therefore effective computational techniques have been in demand. The individual haplotyping problem is the computational problem of inducing a pair of haplotypes from an individual's aligned SNP fragments. Based on various optimal criteria and including different extra information, many models for the problem have been proposed. However, a model with higher accuracy in haplotype reconstruction is desirable. The current paper proposes a high accurate model of the individual haplotyping problem based on weighted fragments and genotype with errors. In this paper the model is proved to be NP-hard even with gapless fragments. To solve this model, based on the characteristics of SNP fragments, the paper introduces a parameterized algorithm of time complexity $O(nk_22^{k_2}+mlogm+mk_1)$, where $m$ is the number of fragments, $n$ is the number of SNP sites, $k_1$ is the maximum number of SNP sites that a fragment covers (no more than $n$ and usually smaller than 10), and $k_2$ is the maximum number of the fragments covering a SNP site (usually no more than 19). Extensive experiments show that this model is more accurate in haplotype reconstruction than other models. The program of the parameterized algorithm can be obtained by sending an email to the corresponding author.

Keywords: Genotype with Errors Individual Haplotyping Problem Weighted SNP Fragments parameterized algorithm


The EXACT description of biomedical protocols
Larisa Soldatova, Aberystwyth University
Wayne Aubrey, Aberystwyth University
Ross King, Aberystwyth University
Amanda Clare, Aberystwyth University

Motivation: Many published manuscripts contain experimental protocols which are poorly described or deficient in information. This means that the published results are very hard or impossible to repeat. This problem is being made worse by the increasing complexity of high throughput/automated methods. There is therefore a growing need to represent experimental protocols in an efficient and unambiguous way. Results: We have developed the EXACT ontology as the basis of a method of representing biological laboratory protocols. We provide example protocols that have been formalised using EXACT, and demonstrate the advantages and opportunities created by using this formalisation. We argue that the use of EXACT will result in the publication of protocols with increased clarity and usefulness to the scientific community. Availability: The ontology, examples and code can be downloaded from Contact: Larisa Soldatova

Keywords: biomedical experiment protocols formalisation ontology


Towards the use of Argumentation in Bioinformatics: A Gene Expression Case Study
Kenneth McLeod, Heriot-Watt University
Albert Burger, MRC, Human Genetics Unit

Motivation: Due to different experimental setups and various interpretations of results, the data contained in online bioinformatics resources can be inconsistent, therefore making it more difficult for users of these resources to assess the suitability and correctness of the answers to their queries. This work investigates the role of argumentation systems to help the users evaluate such answers. More specifically, it looks closely at a gene expression case study, creating an appropriate representation of the underlying data and series of rules that are used by a third party argumentation engine to reason over the query results provided by the mouse gene expression database EMAGE. Results: A prototype using the ASPIC argumentation engine has been implemented and a preliminary evaluation has been carried out. This evaluation suggested that argumentation can be used to deal with inconsistent data in biological resources. Availability: The ASPIC argumentation engine is available from EMAGE gene expression data can be obtained from The argumentation rules for the gene expression example are available from the lead author upon request.

Keywords: data integration gene expression non-monotonic reasoning


The Ontology of Biological Taxa
Stefan Schulz, Freiburg University Hospital
Holger Stenzhorn, University Medical Center Freiburg
Martin Boeker, University Freiburg Medical Center

Motivation: The classification of biological entities in terms of species and taxa is an important endeavor in biology. Although a large amount of statements encoded in current biomedical ontologies is taxon dependent, there is no obvious or standard way for introducing taxon information into an integrative ontology architecture because of ongoing controversies about the ontological nature of species and taxa. Results: In this article, we discuss different approaches on how to represent biological taxa using existing standards for biomedical ontologies such as the description logic OWL DL and the OBO Relation Ontology. We demonstrate how hidden ambiguities of the species concept can be dealt with and existing controversies can be overcome. A novel approach is to envisage taxon information as qualities that inhere in biological organisms, organism parts, and populations. Availability: The presented methodology has been implemenented in the domain top-level ontology BioTop, accessible at BioTop may help to improve the logical and ontological rigor of biomedical ontologies and further provides a clear architectural principle to deal with biological taxa information.

Keywords: Biomedical Ontologies Species Taxa


GenoQuery: a new querying module for functional annotation in a genomic warehouse

Motivation: We have to cope with both a deluge of new genome sequences and huge amount of data produced by high throughput approaches used to exploit these genomic features. Crossing and comparing such heterogeneous and disparate data will help improving functional annotation of genomes. This requires designing elaborate integration systems such as warehouses for storing and querying these data. Results: We have designed a relational genomic warehouse with an original multi-layer architecture made of a databases layer and an entities layer. We describe a new querying module, GenoQuery, which is based on this architecture. We use the entity layer to define mixed queries. These mixed queries allow searching for instances of biological entities and their properties in the different databases, without specifying in which database they have to be found. Accordingly, we further introduce the central notion of alternative queries. Such queries have the same meaning as the original mixed queries, while exploiting the complementarities yielded by the various integrated databases of the warehouse. We explain how GenoQuery computes all the alternative queries of a given mixed query. We illustrate how useful is this querying module by means of a thorough example. Availability:Ëślemoine/GenoQuery/

Keywords: databases integration systems


Estimating True Evolutionary Distances under the DCJ Model
Yu Lin, Swiss Federal Institute of Technology (EPFL)
Bernard Moret, Ecole Polytechnique Federale de Lausanne

Modern techniques can yield the ordering and strandedness of genes on each chromosome of a genome; such data already exists for hundreds of organisms. The evolutionary mechanisms through which the set of a genes of an organism is altered and reordered are of great interest to phylogenists, evolutionary biologists, comparative genomicists, and biomedical researchers. Perhaps the most basic concept in this area is that of evolutionary distance between two genomes: under a given model of genomic evolution, how many events most likely took place to account for the difference between the two genomes? We present a method to estimate the true evolutionary distance between two genomes under the ``double-cut-and-join" (DCJ) model of genome rearrangement, a model under which a single multichromosomal operation accounts for all genomic rearrangement events: inversion, transposition, translocation, block interchange, and chromosomal fusion and fission. Our method relies on a simple structural characterization of a genome pair and is both analytically and computationally tractable. We provide analytical results to describe the asymptotic behavior of genomes under the DCJ model, as well as experimental results on a wide variety of genome structures to exemplify the very high accuracy (and low variance) of our estimator.

Keywords: evolutionary distance genome rearrangement rearrangement model


Efficient Inference of Bacterial Strain Trees From Genome-scale Multi-locus Data
Cuong Than, Rice University
Ryuichi Sugino, Graduate University for Advanced Studies
Hideki Innan, Graduate University for Advanced Studies
Luay Nakhleh, Rice University

Motivation: In bacterial evolution, inferring a strain tree, which is the evolutionary history of different strains of the same bacterium, plays a major role in analyzing and understanding the evolution of strongly isolated populations, population divergence, and various evolutionary events, such as horizontal gene transfer and recombination. Inferring a strain tree from multi-locus data of these strains is exceptionally hard since, at this scale of evolution, stochastic effects of the coalescent result in a very high degree of gene tree incongruence. Results: In this paper we present a novel computational method for inferring the strain tree from multi-locus data under the coalescent model. Our method operates in three phases, where in phase I a set of candidate strain tree topologies is computed using the maximal cliques concept, in phase II divergence times for each of the topologies are estimated using mixed integer linear programming (MILP), and in phase III the optimal tree ( or trees) is selected based on an optimality criterion. We have analyzed 1898 genes from nine strains of the Staphylococcus aureus bacteria, and identified a fully resolved (binary) strain tree with estimated divergence times, despite the high degrees of sequence identity at the nucleotide level and gene tree incongruence. Our method's efficiency makes it particularly suitable for analysis of genome-scale data sets, including those of strongly isolated populations which are usually very challenging to analyze. Availability: We have implemented the algorithms in the PhyloNet software package, which is available publicly at

Keywords: bacterial evolution coalescent gene tree incongruence phylogeny strain tree


The Multiple Gene Duplication Problem Revisited
Mukul S. Bansal, Department of Computer Science, Iowa State University, Ames, IA, USA
Oliver Eulenstein, Department of Computer Science, Iowa State University, Ames, IA, USA

Motivation: Deciphering the location of gene duplications and multiple gene duplication episodes on the Tree of Life is fundamental to understanding the way gene families and genomes evolve. The multiple gene duplication problem provides a framework for placing gene duplication events onto nodes of a given species tree, and detecting episodes of multiple gene duplication. One version of the multiple gene duplication problem was defined by Guigo et al. in 1996. Several heuristic solutions have since been proposed for this problem, but no exact algorithms were known. Results: In this paper we solve this longstanding open problem by providing the first exact and efficient solution. We also demonstrate the improvement offered by our algorithm over the best heuristic approaches, by applying it to several simulated as well as empirical datasets. Contact:

Keywords: Algorithms Multiple gene duplication Phylogenetics


Nucleosome Positioning from Tiling Microarray Data
Moran Yassour, The Hebrew University of Jerusalem
Tommy Kaplan, The Hebrew University of Jerusalem
Ariel Jaimovich, The Hebrew University of Jerusalem
Nir Friedman, The Hebrew University of Jerusalem

The packaging of DNA around nucleosomes in eukaryotic cells plays a crucial role in regulation of gene expression, and other DNA-related processes. To better understand the regulatory role of nucleosomes, it is therefore important to pinpoint their position in a high (5-10bp) resolution. Toward this end, several recent works used dense tiling arrays to map nucleosomes in a high-throughput manner. These data were then parsed and hand-curated, and the positions of nucleosomes were assessed. In this manuscript, we present a fully automated algorithm to analyze such data and predict the exact location of nucleosomes. We introduce a method, based on a probabilistic graphical model, to increase the resolution of our predictions even beyond that of the microarray used. We show how to build such a model and how it can be compiled into a simple HMM, allowing for a fast and accurate inference of nucleosome positions. We applied our model to nucleosomal data from mid-log yeast cells reported by Yuan etal, and compared our predictions to those of the original paper; to a more recent method that uses five times denser tiling arrays Lee etal; and to a curated set of literature-based nucleosome positions. Our results suggest that by applying our algorithm to the same data used by Yuan etal, our fully automated model traced 13% more nucleosomes, and increased the overall accuracy by about 20%. We believe that such an improvement opens the way for a better understanding of the regulatory mechanisms controlling gene expression, and how they are encoded in the DNA.

Keywords: Chromatin Hidden Markov models Nucleosomes Tiling arrays graphical models


Alignment and Classification of Time Series Gene Expression in Clinical Studies
Tien-ho Lin, Carnegie Mellon University
Naftali Kaminski, University of Pittsburgh Medical School
Ziv Bar-Joseph, Carnegie Mellon University

Motivation: Classification of tissues using static gene expression data has received considerable attention. Recently, a growing number of expression datasets are measured as a time series. Methods that are specifically designed for this temporal data can both utilize its unique features (temporal evolution of profiles) and address its unique challenges (different response rates of patients in the same class). Results: We present a method that utilizes hidden Markov models (HMMs) for the classification task. We use HMMs with less states than time points leading to an alignment of the different patient response rates. To focus on the differences between the two classes we develop a discriminative HMM classifier. Unlike the traditional generative HMM, discriminative HMM can use examples from both classes when learning the model for a specific class. We have tested our method on both simulated and real time series expression data. As we show, our method improves upon prior methods and can suggest markers for specific disease and response stages that are not found when using traditional classifiers. Availability: Matlab implementation is available from Contact:

Keywords: clinical studies hidden Markov models response rate time series gene expression


Inferring Differentiation Pathways from Gene Expression
Ivan Gesteira Costa, Max Planck Institute for Molecular Genetics
Stefan Roepcke, Max Planck Institute for Molecular Genetics
Christoph Hafemeister, Max Planck Institute for Molecular Genetics
Alexander Schliep, Max Planck Institute for Moelcular Genetics

Motivation: The regulation of proliferation and differentiation of embryonic and adult stem cells into mature cells is central to developmental biology. Gene expression measured in distinguishable developmental stages helps to elucidate underlying molecular processes. In previous work we showed that functional gene modules, which act distinctly in the course of development, can be represented by a mixture of trees. In general, the similarities in the gene expression programs of cell populations reflect the similarities in the differentiation path. Results: We propose a novel model for gene expression profiles and an unsupervised learning method to estimate developmental similarity and infer differentiation pathways. We assess the performance of our model on simulated data and compare it with favorable results to related methods. We also infer differentiation pathways and predict functional modules in gene expression data of lymphoid development. Conclusions: We demonstrate for the first time how, in principal, the incorporation of structural knowledge about the dependence structure helps revealing differentiation pathways and potentially relevant functional gene modules from microarray data sets. Our method applies in any area of developmental biology where it is possible to obtain cells of distinguishable differentiation stages.

Keywords: cell differentiation dependence trees development gene expression mixture models


Predicting functional transcription factor binding through alignment-free, affinity-based analysis of orthologous promoter sequences
Lucas Ward, Columbia University
Harmen Bussemaker, Columbia University

Motivation: The identification of regulatory protein binding sites and the transcriptional circuitry that they define is currently an area of intense research. Data from whole-genome chromatin immunoprecipitation (ChIP-chip), whole-genome expression microarrays, and sequencing of multiple closely related genomes have all proven useful. By and large, existing methods treat the interpretation of functional data as a classification problem (between bound and unbound DNA), and the analysis of comparative data as a problem of local alignment (to recover phylogenetic footprints of presumably functional elements). Both of these approaches suffer from the inability to model and detect low-affinity binding sites, which have recently been shown to be abundant and functional. Results: We have developed a method that discovers functional regulatory targets of Saccharomyces cerevisiae transcription factors by predicting the total affinity of each promoter for those factors and then comparing that affinity across orthologous promoters in three closely related yeast species. At each promoter, we consider the minimum affinity among orthologs to be the fraction of the affinity that is functional in S. cerevisiae. Because we calculate the affinity of the entire promoter, our method is independent of local alignment. By comparing with functional annotation information and gene expression data, we have validated that this biophysically motivated use of evolutionary conservation gives rise to dramatic improvement in prediction of regulatory connectivity and factor-factor interactions compared to the use of a single genome. Using this method, we propose novel biological functions for several factors. Our alternative approach towards comparative genomics may allow a more quantitative analysis of the principles governing the evolution of noncoding DNA.

Keywords: comparative genomics regulatory networks transcriptional regulation


Combinatorial influence of environmental parameters on transcription factor activity
Theo Knijnenburg, Delft Technical University
Lodewyk F.A. Wessels, Delft University of Technology
Marcel J.T. Reinders, Delft University of Technology

Motivation Cells receive a wide variety of environmental signals, which are often processed combinatorially to generate specific genetic responses. Changes in transcript levels, as observed across different environmental conditions, can, to a large extent, be attributed to changes in the activity of transcription factors (TFs). However, in unraveling these transcription regulation networks, the actual environmental signals are often not incorporated into the model, simply because they have not been measured. The unquantified heterogeneity of the environmental parameters across microarray experiments frustrates regulatory network inference. Results We propose an inference algorithm that models the influence of environmental parameters on gene expression. The approach is based on a yeast microarray compendium of chemostat steady-state experiments. Chemostat cultivation enables the accurate control and measurement of many of the key cultivation parameters, such as nutrient concentrations, growth rate and temperature. The observed transcript levels are explained by inferring the activity of TFs in response to combinations of cultivation parameters. The interplay between activated enhancers and repressors that bind a gene promoter determine the possible up- or downregulation of the gene. The model is translated into a linear integer optimization problem. The resulting regulatory network identifies the combinatorial effects of environmental parameters on TF activity and gene expression.

Keywords: chemostat gene expression data combinatorial cultivation parameters linear integer optimization transcription factor activity


Classification of arrayCGH data using a fused SVM
Franck Rapaport, Institut Curie / ENSMP
Emmanuel Barillot, Institut CURIE
Jean-Philippe Vert, Ecole des Mines de Paris

Motivation: Array-based comparative genomic hybridization (arrayCGH) has recently become a popular tool to identify DNA copy number variations along the genome. These pro?les are star ting to be used as markers to improve prognosis or diagnosis of cancer, which implies that methods for automated super vised classi?cation of arrayCGH data are needed. Like gene expression pro?les, arrayCGH pro?les are characterized by a large number of variables usually measured on a limited number of samples. However, arrayCGH pro?les have a par ticular structure of correlations between variables, due to the spatial organization of BACs along the genome. This suggests that classical classi?cation methods, often based on the selection of a small number of discriminative features, may not be the most accurate methods and may not produce easily inter pretable prediction rules. Results: We propose a new method for super vised classi?cation of arrayCGH data. The method is a variant of suppor t vector machine (SVM) that incor porates the biological speci?cities of DNA copy number variations along the genome as prior knowledge. The resulting classi?er is a sparse linear classi?er based on a limited number of regions automatically selected on the chromosomes, leading to easy inter pretation and identi?cation of discriminative regions of the genome. We test this method on three classi?cation problems for bladder and uveal cancer, involving both diagnosis and prognosis. We demonstrate that the introduction of the new prior on the classi?er leads not only to more accurate predictions, but also to the identi?cation of known and new regions of interest in the genome.

Keywords: arrayCGH cancer supervised classification


Modelling peptide fragmentation with dynamic Bayesian networks for peptide identification
Aaron Klammer, Dept of Genome Sciences University of Washington
Sheila Reynolds, Department of Electrical Engineering, University of Washington
Jeff Bilmes, University of Washington, Seattle
Michael MacCoss, Department of Genome Sciences, University of Washington
William Noble, Department of Genome Sciences, University of Washington

Motivation: Tandem mass spectrometry (MS/MS) is an indispensable technology for identification of proteins from complex mixtures. Proteins are digested to peptides that are then identified by their fragmentation patterns in the mass spectrometer. Thus, at its core, MS/MS protein identification relies on the relative predictability of peptide fragmentation. Unfortunately, peptide fragmentation is complex and not fully understood, and what is understood is not fully exploited by modern peptide identification algorithms. Results: We use a hybrid dynamic Bayesian network (DBN) / support vector machine (SVM) approach to address these two problems. We train a set of DBNs on high confidence spectrum matches. These trained DBNs, known collectively as Riptide, comprise a probabilistic model of peptide fragmentation chemistry. We show that examining the learned Riptide model allows us to identify new trends, such as prevalent a-ion fragmentation at peptide cleavage sites C-term to hydrophobic residues. In addition, the Riptide can be used to produce likelihood scores that indicate whether a given peptide-spectrum match is correct. A vector of such scores is then given to an SVM, which produces a final score to be used in peptide identification. Using Riptide in this way yields improved discrimination when compared to other state-of-the-art MS/MS identification algorithms, increasing the number of positive identifications by as much as 12% at a 1% false discovery rate.

Keywords: dynamic bayesian networks machine learning peptide fragmentation peptide identification proteomics


Matching Isotopic Distributions from Metabolically Labeled Samples
Sean McIlwain, Department of Computer Sciences, University of Wisconsin
David Page, Department of Computer Sciences and Department of Biostatistics, University of Wisconsin, Madison, WI
Edward Huttlin, Department of Biochemistry, University of Wisconsin, Madison, WI
Michael Sussman, Department of Biochemistry, University of Wisconsin, Madison, WI

Motivation: In recent years stable isotopic labeling has become a standard approach for quantitative proteomic analyses. Among the many available isotopic labeling strategies, metabolic labeling is attractive for the excellent internal control it provides. However, analysis of data from metabolic labeling experiments can be complicated because the spacing between labeled and unlabeled forms of each peptide depends on its sequence, and is thus variable from analyte to analyte. As a result, one generally needs to know the sequence of a peptide to identify its matching isotopic distributions in an automated fashion. In some experimental situations it would be necessary or desirable to match pairs of labeled and unlabeled peaks from peptides of unknown sequence. This paper addresses this largely overlooked problem in the analysis of quantitative mass spectrometry data by presenting an algorithm that not only identifies isotopic distributions within a mass spectrum, but also annotates matches between natural abundance light isotopic distributions and their metabolically labeled counterparts. This algorithm is designed in two stages: first we annotate the isotopic peaks using a modified version of the IDM algorithm described last year (McIlwain et al. (2007)); then we use a probabilistic classifier that is supplemented by dynamic programming to find the metabolically labeled matched isotopic pairs. Such a method is needed for high-throughput quantitative proteomic / metabolomic experiments measured via mass spectrometry. Results: The primary result of this paper is that the dynamic programming approach performs well given perfect isotopic distribution annotations. Our algorithm achieves a true positive rate of 99% and a false positive rate of 1% using perfect isotopic distribution annotations. When the isotopic distributions are annotated given “expert” selected peaks, the same algorithm gets a true positive rate of 77% and a false positive rate of 1%. Finally, when annotating using “machine” selected peaks, which may contain noise, the dynamic programming algorithm gives a true positive rate of 36% and a false positive rate of 1%. It is important to mention that these rates arise from the requirement of exact annotations of both the light and heavy isotopic distributions. In our evaluations, a match is considered “entirely incorrect” if it is missing even one peak or containing an extraneous peak. Using a looser requirement that requires the “mono-isotopic” peaks exist within the two matched distributions, our algorithm obtains a positive rate of 45% and a false positive rate of 1% on the “machine” selected data. Changes to the algorithm's scoring function and training example generation improves our "mono-isotopic" peak score true positive rate to 65% while obtaining a false positive rate of 2%. All results were obtained within ten-fold cross-validation of 41 mass spectra with a mass-to-charge range of 800-4000 m/z. There are a total of 713 isotopic distributions and 255 matched isotopic pairs that are hand-annotated for this study. Availability: Programs are available via McIlwain, S., Page, D., Huttlin, E. L., and Sussman, M. R. (2007). Using dynamic programming to create isotopic distribution maps from mass spectra. Bioinformatics, 23(13), i328–336.

Keywords: Dynamic Programming Mass Spectrometry Metabolic Labeling Proteomics


Multi-Spectra Peptide Sequencing and its Applications to Multistage Mass Spectrometry
Nuno Bandeira, University of California, San Diego
Jesper Olsen, Max-Planck Institute for Biochemistry
Matthias Mann, Max-Planck-Institute for Biochemistry
Pavel Pevzner, University of California, San Diego

Despite a recent surge of interest in database-independent peptide identifications, accurate de novo peptide sequencing remains an elusive goal. While the recently introduced spectral network approach resulted in an accurate peptide sequencing in low-complexity samples, its success depends on the chance presence of spectra from overlapping peptides. On the other hand, while multistage mass spectrometry (collecting multiple MS/MS/MS spectra from each MS/MS spectrum) can be applied to all spectra in a complex sample, there are currently no software tools for de novo peptide sequencing by multistage mass spectrometry. We describe a rigorous probabilistic framework for analyzing spectra of overlapping peptides and show how to apply it for multistage mass spectrometry. Our software results in both accurate de novo peptide sequencing from multistage mass spectra (despite the inferior quality of MS/MS/MS spectra) and improved interpretation of spectral networks. We further study the problem of de novo peptide sequencing with accurate parent mass (but inaccurate fragment masses), the protocol that may soon become the dominant mode of spectral acquisition. Most existing peptide sequencing algorithms (based on the spectrum graph approach) do not track the accurate parent mass and are thus not equipped for solving this problem. We describe a de novo peptide sequencing algorithm aimed at this experimental protocol and show that it improves the accuracy of both tandem and multistage mass spectrometry. The open-source implementation of our software is available at

Keywords: MS/MS/MS de novo peptide sequencing tandem mass spectrometry


Assessing the functional structure of genomic data
Curtis HuttenhowerOlga Troyanskaya, Princeton University

Motivation: The availability of genome-scale data has enabled an abundance of novel analysis techniques for investigating gene function and relationships. As thousands of such datasets become available, they also provide an opportunity to study the higher-level interactions between cellular pathways and processes. This also allows the exploration of shared functional enrichments between diverse biological datasets, and it serves to direct experimenters to areas of low data coverage or with high probability of new discoveries. Results: We analyze the functional structure of S. cerevisiae datasets from over 950 publications in the context of over 140 biological processes. This includes a coverage analysis of biological processes given current high-throughput data, a data-driven map of similarities between processes, and a measure of functional similarity between arbitrary genome-scale datasets. This uncovers subtle gene expression similarities in three otherwise disparate microarray datasets due to a shared strain background. We also provide several means of predicting areas of yeast biology likely to benefit from additional high-throughput experimental screens.

Keywords: data integration functional relationships gene function interaction networks systems biology


A Maximum Common Substructure-Based Algorithm for Searching and Predicting Drug-Like Compounds
Yiqun Cao, Department of Computer Science & Engineering, University of California, Riverside
Tao Jiang, Department of Computer Science & Engineering, University of California, Riverside
Thomas Girke, Department of Botany and Plant Sciences, University of California, Riverside

The prediction of biologically active compounds is of great importance for High Throughput Screening (HTS) approaches in drug discovery and chemical genomics. Many computational methods in this area focus on measuring the structural similarities between chemical structures. However, traditional similarity measures are often too rigid or consider only global similarities between structures. The Maximum Common Substructure (MCS) approach provides a more promising and flexible alternative for predicting bioactive compounds. In this paper, a new backtracking algorithm for MCS is proposed and compared to global similarity measurements. Our algorithm provides high flexibility in the matching process, and it is very efficient in identifying local structural similarities. To predict and cluster biologically active compounds more efficiently, the concept of basis compounds is proposed that enables researchers to easily combine the MCS-based and traditional similarity measures with modern machine learning techniques. Support Vector Machines (SVMs) are used to test how the MCS-based similarity measure and the basis compound vectorization method perform on two empirically tested datasets. The test results show that MCS complements the well-known atom pair descriptor-based similarity measure. By combining these two measures, our SVM-based model predicts the biological activities of chemical compounds with higher specificity and sensitivity.

Keywords: SVM branch and bound chemical compound library chemical compound structural similarity maximum common subgraph


BLASTing Small Molecules - Statistics and Extreme Statistics of Chemical Similarity Scores
Pierre Baldi, University of California, Irvine
Ryan Benz, University of California, Irvine

Motivation: Small organic molecules, from nucleotides and amino acids to metabolites and drugs, play a fundamental role in chemistry, biology, and medicine. As databases of small molecules continue to grow and become more open, it is important to develop the tools to search them efficiently. In order to develop a BLAST-like tool for small molecules, one must first understand the statistical behavior of molecular similarity scores. Results: We develop a new detailed theory of molecular similarity scores that can be applied to a variety of molecular representations and similarity measures. For concreteness, we focus on the most widely used measure - the Tanimoto measure applied to chemical fingerprints. In both the case of empirical fingerprints and fingerprints generated by several stochastic models, we derive very accurate approximations for both the distribution and extreme value distribution of similarity scores. These approximation are derived using a ratio of correlated Gaussians approach. The theory enables the calculation of significance scores, such as Z-scores and p-values, and the estimation of the top hits list size. Empirical results obtained using both the random models and real data from the ChemDB database are given to corroborate the theory and show how it can be applied to mine chemical space.

Keywords: Chemical Similarity Chemoinformatics Fingerprints Statistical Significance Tanimoto Distribution


Identifying Functional Modules in Protein-Protein Interaction Networks: An Integrated Exact Approach
Marcus T. Dittrich, University of Würzburg
Gunnar W. Klau, Freie Universitt Berlin
Andreas Rosenwald, University of Würzburg
Thomas Dandekar, University of Würzburg
Tobias Mller, University of Würzburg

Motivation: With the exponential growth of expression and protein-protein interaction (PPI) data, the frontier of research in system biology shifts more and more to the integrated analysis of these large datasets. Of particular interest is the identification of functional modules in PPI networks, sharing common cellular function beyond the scope of classical pathways, by means of detecting jointly differentially expressed regions in PPI networks. This requires on the one hand an adequate scoring of the nodes in the network to be identified and on the other hand the availability of an effective algorithm to find the maximally scoring network regions. Various heuristic approaches have been proposed in the literature. Results: Here we present an exact integer linear programming solution for this problem, which is based on its connection to the well-known prize-collecting Steiner tree problem from Operations Research. Despite the NP-hardness of the underlying combinatorial problem, our method typically computes provably optimal subnetworks in large PPI networks in a few minutes. An essential ingredient of our approach is a scoring function defined on network nodes. We propose a new additive score with two desirable properties: (i) it is scalable by a statistically interpretable parameter and (ii) it allows a smooth integration of data from various sources. We apply our method to the well-established lymphoma microarray dataset in combination with associated survival data and the large interaction network of HPRD to identify functional modules by computing optimal-scoring subnetworks. In particular, we find a functional interaction module associated with proliferation over-expressed in the aggressive ABC subtype as well as modules derived from non malignant by-stander cells. Availability: Our software is available freely for non-commercial purposes at Contact:

Keywords: Discrete Optimization Distribution of Optimal Subgraph Score Integer Linear Programming Microarray Protein-Protein Interaction Network


Prediction of drug-target interaction networks from the integration of chemical and genomic spaces
Yoshihiro Yamanishi, Centre for Computational Biology, Ecole des Mines de Paris
Michihiro Araki, Human Genome Center, Institute of Medical Science, University of Tokyo
Alex Gutteridge, Bioinformatics Center, Institute for Chemical Research, Kyoto University
Wataru Honda, Bioinformatics Center, Institute for Chemical Research, Kyoto University
Minoru Kanehisa, Bioinformatics Center, Institute for Chemical Research, Kyoto University

The identification of interactions between drugs and target proteins is a key area in genomic drug discovery. There is therefore a strong incentive to develop new methods capable of detecting these potential drug-target interactions efficiently. In this paper, we characterize four classes of drug-target interaction networks in humans involving enzymes, ion channels, GPCRs, and nuclear receptors, and reveal significant correlations between drug structure similarity, target sequence similarity, and the drug-target interaction network topology. We then develop new statistical methods to predict unknown drug-target interaction networks from chemical structure and genomic sequence information simultaneously on a large scale. The originality of the proposed method lies in the formalization of the drug-target interaction inference as a supervised learning problem for a bipartite graph, the lack of need for 3D structure information of the target proteins, and in the integration of chemical and genomic spaces into a unified space that we call "pharmacological space". In the results, we demonstrate the usefulness of our proposed method for the prediction of the four classes of drug-target interaction networks. Our comprehensively predicted drug-target interaction networks enable us to suggest many potential drug-target interactions and to increase research productivity toward genomic drug discovery.

Keywords: chemical genomics compound-protein interaction drug-target interaction network kernel method network inference


Biomolecular Network Motif Counting and Discovery by Color Coding
Alon Noga, Schools of Mathematics and Computer Science at Tel Aviv University
Phuong Dao, Simon Fraser University
Iman hajirasouliha, Simon Fraser University
Fereydoun Hormozdiari, Simon Fraser University
Cenk Sahinalp, Simon Fraser University

Protein protein interaction (PPI) networks of many organisms share global topological features such as degree distribution, k-hop reachability, betweenness and closeness. Yet some of these networks can differ significantly from the others in terms of local structures: e.g. the number of specific network motifs can vary significantly among PPI networks. Counting the number of network motifs provide a major challenge to comparing biomolecular networks. Recently developed algorithms have been able to count the number of induced occurrences of subgraphs with k =< 7 vertices. Yet no practical algorithm exists for counting non-induced occurrences, or counting subgraphs with k >= 8 vertices. Counting non-induced occurrences of network motifs is not only challenging but also quite desirable as available PPI networks include several false interactions and miss many others. In this paper we show how to apply the ``color coding'' technique to counting non-induced occurrences of subgraph topologies in the form of trees and bounded treewidth subgraphs. Our algorithm can count all occurrences of motif G' with k vertices in a network G with n vertices in time polynomial with n, provided k = O(log n). We use our algorithm to obtain ``treelet'' distributions for k =< 10 of available PPI networks of unicellular organisms (S.cerevisiae, E.coli and H.pylori), which are all quite similar, and a multicellular organism (C.elegans) which is significantly different. Furthermore the treelet distribution of the unicellular organisms are similar to that obtained by the ``duplication model'' but are quite different from that of the ``preferential attachment model''. The treelet distribution is robust w.r.t. sparsification with bait/edge coverage of 70% but differences can be observed when bait/edge coverage drops to 50%.

Keywords: Network motifs Protein protein interaction networks


Protein Complex Identification by Supervised Graph Local Clustering
Yanjun Qi, Carnegie Mellon University
Fernanda Balem, University of Pittsburgh School of Medicine
Christos Faloutsos, Carnegie Mellon University, School of Computer Science
Judith Klein-Seetharaman, University of Pittsburgh School of Medicine
Ziv Bar-Joseph, Carnegie Mellon University

Motivation: Protein complexes integrate multiple gene products to coordinate many biological functions. Given a graph representing pairwise protein interaction data one can search for subgraphs representing protein complexes. Previous methods for performing such search relied on the assumption that complexes form a clique in that graph. While this assumption is true for some complexes, it does not hold for many others. New algorithms are required in order to recover complexes with other types of topological structure. Results: We present an algorithm for inferring protein complexes from weighted interaction graphs. By using graph topological patterns and biological properties as features, we model each complex subgraph by a probabilistic Bayesian Network (BN). We use a training set of known complexes to learn the parameters of this BN model. The log-likelihood ratio derived from the BN is then used to score subgraphs in the protein interaction graph and identify new complexes. We applied our method to protein interaction data in yeast. As we show our algorithm achieved a considerable improvement over clique based algorithms in terms of its ability to recover known complexes. We discuss some of the new complexes predicted by our algorithm and determine that they likely represent true complexes. Availability: Matlab implementation is available on the supporting website:

Keywords: Graph Clustering Protein Complex Identification Protein-Protein Interaction Network Supervised Graph Mining


Designing Succinct Structural Alphabets
Shuai Cheng Li, University of Waterloo
Dongbo Bu, Institute of Computing Technology, Chinese Academy of Sciences
Xin Gao, Mr.
Jinbo Xu, Toyota Technological Institute at Chicago
Ming Li, University of Waterloo

The three dimensional structure of a protein sequence can be assembled from the substructures corresponding to small segments of this sequence ~\cite{Levitt2002}. For each small sequence segment, there are only a few more likely substructures. We call them the ``structural alphabet" for this segment. Classical approaches such as ROSETTA used sequence profile and secondary structure only ~\cite{Sim97} to predict structural fragments. In contrast, we utilize more structural information, including solvent accessibility and contact capacity, for finding structural fragments. Integer linear programming technique is applied to derive the best combination of these sequence and structural information items. This approach generates significantly more accurate and succinct structural alphabets with more than 50% improvement over the previous accuracies. With these novel structural alphabets, we are able to construct more accurate protein structures than the state-of-art {\it ab initio} protein structure prediction programs such as ROSETTA. We are also able to reduce the Kolodny's library size by a factor of 8, at the same accuracy.

Keywords: Integer Linear Programming Protein Structure Prediction Structural Alphabet


Predicting protein thermostability changes from sequence upon multiple mutations
Ludovica Montanucci, University of Bologna, Dept. of Biology
Piero Fariselli, Dept. of Biology, University of Bologna, Italy
Pier Luigi Martelli, Dept. of Biology, University of Bologna, Italy
Rita Casadio, Dept of Biology UNIBO

Motivation: A basic question in protein science is to which extent mutations affect protein thermostability. This knowledge would be particularly relevant for engineering thermostable enzymes. In several experimental approaches this issue has been serendipitously addressed. It would be therefore convenient providing a computational method that predicts when a given protein mutant is more thermostable than its corresponding wild type. Results: We present a new method based on Support Vector Machines that is able to predict if a set of mutations (including insertion and deletions) can enhance the thermostability of a given protein sequence. When trained and tested on a redundancy-reduced dataset, our predictor achieves 88% accuracy and a correlation coefficient equal to 0.75. Our predictor also correctly classifies 12 out of 14 experimentally characterized protein mutants with enhanced thermostability. Finally, it correctly detects all the 11 mutated proteins whose increase in stability temperature is greater than 10°C. Availability: The data set and the list of protein clusters adopted for the SVM cross-validation are available at the web site

Keywords: Mutations Support Vector Machines Thermostability Thermostability enhancement prediction


Algorithm for Backrub Motions in Protein Design
Ivelin Georgiev, Duke University Computer Science Department
Daniel Keedy, Department of Biochemistry, Duke University
Jane Richardson, Department of Biochemistry, Duke University
David Richardson, Department of Biochemistry, Duke University
Bruce Randall Donald, Duke University

The Backrub is a small but kinematically-efficient side-chain-coupled local backbone motion frequently observed in atomic-resolution crystal structures of proteins. A backrub shifts the Ca-Cb orientation of a given side-chain by rigid-body dipeptide rotation plus smaller individual rotations of the two peptides, with virtually no change in the rest of the protein. Backrubs can therefore provide a biophysically-realistic model of local backbone flexibility for structure-based protein design. Previously, however, backrub motions were applied via manual interactive model-building, so their incorporation into a protein design algorithm (a simultaneous search over mutation and backbone/side-chain conformation space) was infeasible. We present a combinatorial search algorithm for protein design that incorporates an automated procedure for local backbone flexibility via backrub motions. We further derive a Dead-End Elimination (DEE)-based criterion for pruning candidate rotamers that, in contrast to previous DEE algorithms, is provably accurate with backrub motions. Our backrub-based algorithm successfully predicts alternate side-chain conformations from four <= 0.9 A-resolution structures, confirming the suitability of the automated backrub procedure. Finally, the application of our algorithm to redesign two different proteins is shown to identify a large number of lower-energy conformations and mutation sequences that would have been ignored by a rigid-backbone model.

Keywords: Dead-End Elimination backbone flexibility backrub motions protein design provably-accurate algorithms


Contact Replacement for NMR Resonance Assignment
Fei Xiong, Dartmouth College
Gopal Pandurangan, Purdue University
Chris Bailey-Kellogg, Dartmouth Computer Science

Complementing its traditional role in structural studies of proteins, nuclear magnetic resonance (NMR) spectroscopy is playing an increasingly important role in functional studies. NMR dynamics experiments characterize motions involved in target recognition, ligand binding, etc., while NMR chemical shift perturbation experiments identify and localize protein-protein and protein-ligand interactions. The key bottleneck in these studies is to determine the backbone resonance assignment, which allows spectral peaks to be mapped to specific atoms. This paper develops a novel approach to address that bottleneck, exploiting an available x-ray structure or homology model to assign the entire backbone from a set of relatively fast and cheap NMR experiments. We formulate contact replacement for resonance assignment as the problem of computing correspondences between a contact graph representing the structure and an NMR graph representing the data; the NMR graph is a significantly corrupted, ambiguous version of the contact graph. We first show that by combining connectivity and amino acid type information, and exploiting the random structure of the noise, one can provably determine unique correspondences in polynomial time with high probability, even in the presence of significant noise (a constant number of noisy edges per vertex). We then detail an efficient randomized algorithm and show that, over a variety of experimental and synthetic datasets, it is robust to typical levels of structural variation (1-2AA), noise (250-600%) and missings (10-40%). Our algorithm achieves very good overall assignment accuracy, above 80% in alpha helices, 70% in beta sheets, and 60% in loop regions.

Keywords: Backbone Resonance Assignment Contact Graph Nuclear Magnetic Resonance (NMR) spectroscopy Protein Structure Random Graph Model


A Computational Framework to Empower Probabilistic Protein Design
Menachem Fromer, The Hebrew University of Jerusalem
Chen Yanover
Motivation: The task of engineering a protein to perform a target biological function is known as protein design. A commonly used paradigm casts this functional design problem as a structural one, assuming a fixed backbone. In probabilistic protein design, positional amino acid probabilities are used to create a random library of sequences to be simultaneously screened for biological activity. Clearly, certain choices of probability distributions will be more successful in yielding functional sequences. However, since the number of sequences is exponential in protein length, computational optimization of the distribution is difficult. Results: In this paper, we develop a computational framework for probabilistic protein design following the structural paradigm. We formulate the distribution of sequences for a structure using the Boltzmann distribution over their free energies. The corresponding probabilistic graphical model is constructed, and we apply belief propagation (BP) to calculate marginal amino acid probabilities. We test this method on a large structural dataset and demonstrate the superiority of BP over previous methods. Nevertheless, since the results obtained by BP are far from optimal, we thoroughly assess the paradigm using high-quality experimental data. We demonstrate that, for small scale sub-problems, BP attains identical results to those produced by exact inference on the paradigmatic model. However, quantitative analysis shows that the distributions predicted significantly differ from the experimental data. These findings, along with the excellent performance we observed using BP on the smaller problems, suggest potential shortcomings of the paradigm. We conclude with a discussion of how it may be improved in the future.

Keywords: Evolutionary Divergence Free Energy Graphical Models Protein Design


POIMs: Positional Oligomer Importance Matrices --- Understanding Support Vector Machine Based Signal Detectors
Soeren Sonnenburg, Fraunhofer FIRST
Alexander Zien, Friedrich Miescher Laboratory and Fraunhofer FIRST
Petra Philips, Friedrich Miescher Laboratory of the Max Planck Society
Gunnar Rtsch, Friedrich Miescher Laboratory of the Max Planck Society

Motivation: At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts. Results: To make SVM-based sequence classifiers more accessible and profitable, we introduce the concept of positional oligomer importance matrices (POIMs) and propose an efficient algorithm for their computation. In contrast to the raw SVM feature weighting, POIMs take the underlying correlation structure of k-mer features induced by overlaps of related k-mers into account. POIMs can be seen as a powerful generalization of sequence logos: they allow to capture and visualize sequence patterns that are relevant for the investigated biological phenomena. Availability: All source code, data sets, tables and figures will be made availabe at

Keywords: DNA Signal Detection, Splice Site, TSS Sequence Classification String Kernel Support Vector Machine Visualization and Understanding


The effectiveness of position- and composition-specific gap costs for protein homology search
Aleksandar Stojmirovic, NCBI/NLM/NIH
E. Michael Gertz, NCBI/NLM/NIH
Stephen F. Altschul, NCBI/NLM/NIH

Flexibility in gap cost enjoyed by Hidden Markov Models (HMMs) is expected to lead to better retrieval efficiency than position-specific scoring matrices (PSSMs). We attempt to quantify the effect of more general gap parameters by separately examining the influence of position- and composition-specific gap scores as well as comparing the retrieval effectiveness of the PSSMs constructed using iterative procedure against that of the HMMs provided by Pfam and SUPERFAMILY, curated ensembles of multiple alignments. While we found that position-specific gap penalties do gain appreciably over constant gap costs, we did not explore the avenue of optimizing gap costs per query. For Pfam, PSSMs iteratively constructed from seeds based on HMM consensus sequences show a performance equivalent to that of HMMs that were adjusted to have constant gap transition probabilities, albeit with much larger variance. We observed no effect of composition-specific gap costs to retrieval performance.

Keywords: HMM PSSM compostion-specific gap position-specific gap


ProSOM: Core promoter prediction based on unsupervised clustering of DNA physical profiles
Thomas Abeel, Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), Ghent University
Yvan Saeys, Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), Ghent University
Pierre Rouz, Ugent
Yves Van de Peer, Ghent University

Motivation: More and more genomes are being sequenced, and to keep up with the pace of sequencing projects, automated annotation techniques are required. One of the most challenging problems in genome annotation is the identification of the core promoter. Because the identification of the transcription initiation region is such a challenging problem, it is not yet common practice to integrate transcription start site prediction in genome annotation projects. Nevertheless, better core promoter prediction can improve genome annotation and can be used to guide experimental work. Results: Comparing the average structural profile of transcribed, promoter and intergenic sequences demonstrates that the core promoter has unique features that cannot be found in other sequences. We show that unsupervised clustering by using self-organizing maps can clearly distinguish between the structural profiles of promoter sequences and other genomic sequences. An implementation of this promoter prediction program, called ProSOM, is available and has been compared with the state-of-the-art. We propose an objective, accurate and biologically sound validation scheme for core promoter predictors. ProSOM performs at least as good as current software, but our technique is more balanced in terms of the number of predicted sites and the number of false predictions, thus providing a better all-round performance. Additional tests on the ENCODE regions of the human genome show that 98% of all predictions made by ProSOM can be associated with transcriptionally active regions, thus demonstrating high precision. Availability: Predictions for the human genome and the program (ProSOM) are available upon request.

Keywords: DNA physical property core promoter promoter prediction transcription start site


Optimal pooling for genome re-sequencing with ultra-high-throughput short-read technologies
Iman Hajirasouliha, Simon Fraser University
Fereydoun Hormozdiari, Simon Fraser University
S. Cenk Sahinalp, Simon Fraser University
Inanc Birol, BC Genome Sciences Centre

New generation sequencing technologies offer unique opportunities and challenges for re-sequencing studies. In this paper, we focus on re-sequencing experiments using the Solexa technology, based on bacterial artificial chromosome (BAC) clones, and address an experimental design problem. In these specific experiments, approximate coordinates of the BACs on a reference genome are known, and fine scale differences between the BAC sequences and the reference are of interest. The high throughput characteristics of the sequencing technology makes it possible to multiplex BAC sequencing experiments by pooling BACs for a cost-effective operation. However, the way BACs are pooled in such re-sequencing experiments has an effect on the downstream analysis of the generated data, mostly due to subsequences common to multiple BACs. The experimental design strategy we develop in this paper offers combinatorial solutions based on approximation algorithms for the well known em max n-cut problem and the related max n-section problem on hypergraphs. Our algorithms, when applied to a number of sample cases give a more than 2-fold performance improvement over random partitioning.

Keywords: Approximaion algorithms Genome resequencing Hypergraph partitioning New generation sequencing technologies Short read assembly


Efficient algorithms for exact hierarchical clustering of huge datasets: Tackling the entire protein space
Yaniv Loewenstein, The Hebrew University of Jerusalem
Portugaly Elon, The Hebrew University of Jerusalem
Menachem Fromer, The Hebrew University of Jerusalem
Michal Linial, The Hebrew University of Jerusalem

Motivation: UPGMA (average-linkage clustering) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. UPGMA however, is a complete-linkage method, in the sense that all edges between data points are needed in memory. Due to this prohibitive memory requirement UPGMA is not scalable for very large datasets. Results: We present novel memory-constrained UPGMA (MCUPGMA) algorithms. Given a constrained memory size, our algorithm guarantees the exact same UPGMA clustering solution, without explicitly holding all edges in memory. Our algorithms are general, and applicable to any dataset. We present a theoretical characterization of the algorithm efficiency, and hardness for various data. We show the performance of our algorithm , under restricted memory constraints. The presented concepts are applicable to any agglomerative clustering formulation. We apply our algorithm to the entire collection of protein sequences, to automatically build a novel evolutionary tree of all proteins using no prior knowledge. We show that newly created tree captures protein families better than state-of-the-art large scale methods such as CluSTr, ProtoNet4, or single-linkage clustering. The robustness of UPGMA improves significantly on existing methods, especially for multi-domain proteins, and for large or divergent families. Our algorithm is scalable to any feasible increase in sequence databse sizes. Availability: The evolutionary tree of all proteins in the entire UniProt set, together with navigation and classification tools will be made available as part the ProtoNet service. A C++ implementation of the algorithm, suitable for any type or size a data, is available.

Keywords: Clustering Protein function Sequence alignment protein families


MiRNA Prediction with a Novel Ranking Algorithm Based on Random Walks
Xuefeng Zhou, Washington University in Saint Louis
Yunpeng Xu, washington university in st. louis
weixiong zhang, washington university in st. louis

MicroRNA (miRNAs) play essential roles in post-transcriptional gene regulation in animals and plants. Several existing computational approaches have been developed to complement experimental methods in discovery of miRNAs that express restrictively in specific environmental conditions or nonabundant cell types. These computational methods require a sufficient number of characterized miRNAs as training samples, and rely on genome annotation to reduce the number of predicted putative miRNAs. However, most sequenced genomes have not been well annotated and many of them have a very few experimentally characterized miRNAs. As a result, the existing methods are not effective or even feasible for identifying miRNAs in these genomes. Aiming at identifying miRNAs from genomes with a few known miRNA and/or little annotation, we propose and develop a novel miRNA prediction method based on our new random walks based ranking algorithm. We first tested our method on H. sapiens genome; using a very few known human miRNAs as samples, our method achieved a prediction accuracy greater than 95%. We then applied our method to predict 200 miRNAs in A. gambiae, which is the most important vector of malaria in Africa. Our further study showed that 78 out of the 200 putative miRNA precursors encode mature miRNAs that are conserved in at least one other animal species. These conserved putative miRNAs are good candidates for further experimental study to understand malaria infection.

Keywords: miRNA prediction random walks ranking


A Robust Framework for Detecting Structural Variations in a Genome
Seunghak Lee, University of Toronto
Elango Cheran, University of Toronto
Michael Brudno, University of Toronto

Motivation: Recently, structural genomic variants have come to the forefront as a significant source of variation in the human population, but the identification of these variants in a large genome remains a challenge. The complete sequencing of a human individual is prohibitive at current costs, while current polymorphism detection technologies, such as SNP arrays, are not able to identify many of the large scale events. One of the most promising methods to detect such variants is the computational mapping of clone-end sequences to a reference genome. Results: Here, we present a probabilistic framework for the identification of structural variants using clone-end sequencing. Unlike previous methods, our approach does not rely on an a priori determined mapping of all reads to the reference. Instead, we build a framework for finding the most probable assignment of sequenced clones to potential structural variants based on the other clones. We compare our predictions with the structural variants identified in three previous studies. While there is a statistically significant correlation between the predictions, we also find a significant number of previously uncharacterized structural variants. Furthermore, we identify a number of putative cross-chromosomal events, primarily located proximally to the centromeres of the chromosomes. Availability: Our dataset, results, and source code are available at

Keywords: Polymorphism Rearrangements Structural variation


A max-margin model for efficient simultaneous alignment and folding of RNA sequences
Chuong Do, Stanford University
Chuan-Sheng Foo, Stanford University
Serafim Batzoglou, Stanford University

From a biological perspective, the need for accurate and efficient tools for computational analysis of RNA secondary structure has become increasingly apparent over the last several years: RNA folding algorithms underlie numerous applications in bioinformatics, ranging from microarray probe selection to de novo computational noncoding RNA gene prediction. Previously, we introduced CONTRAfold, a novel algorithm for RNA secondary structure prediction which demonstrated for the first time that a fully-automated statistical learning approach to single-sequence RNA secondary structure prediction could achieve accuracies exceeding those of the best thermodynamics-based methods. In this work, we present \RAF{} (\underline{R}NA \underline{A}lignment and \underline{F}olding), a new algorithm which extends the CONTRAfold methodology to the problem of simultaneous alignment and consensus folding of unaligned RNA sequences. \section{Results:} Algorithmically, \RAF{} combines efficient sparse dynamic programming approaches for RNA secondary structure prediction with discriminative mahine learning approaches. On carefully cross-validated benchmark tests, \RAF{} achieves accuracy competitive with the current best approaches for RNA multiple sequence secondary structure prediction. However, \RAF{} also requires nearly an order of magnitude less time than other methods, thus making it especially appropriate for high-throughput studies.

Keywords: RNA folding sequence analysis


Annotation-based inference of transporter function
Thomas Lee, SRI International
Ian Paulsen, Macquarie University, Chemistry & Biomedical Sciences
Peter Karp, SRI International

Motivation: We present a method for inferring and constructing transport reactions for transporter proteins based primarily on the analysis of the names of individual proteins in the genome annotation of an organism. Transport reactions are declarative descriptions of transporter activities, and thus can be manipulated computationally, unlike free-text protein names. Once transporter activities are encoded as transport reactions, a number of computational analyses are possible including database queries by transporter activity; inclusion of transporters into an automatically generated metabolic-map diagram that can be painted with omics data to aid in their interpretation; detection of anomalies in the metabolic and transport networks, such as substrates that are transported into the cell but are not inputs to any metabolic reaction or pathway; and comparative analyses of the transport capabilities of different organisms. Results: On randomly selected organisms, the method achieves precision and recall rates of .93 and .90 respectively in identifying transporter proteins by name within the complete genome. The method obtains 67.5% accuracy in predicting complete transport reactions; if allowance is made for predictions that are overly general yet not incorrect,reaction prediction accuracy is 82.5%. Availability: The method is implemented as part of PathoLogic, the inference component of the Pathway Tools software. Pathway Tools is freely available to researchers at noncommercial institutions, including source code; a fee applies to commercial institutions. Contact:

Keywords: genome analysis protein function prediction transport reactions


Detection of IUPAC and IUPAC-like Chemical Names
Roman Klinger, Fraunhofer Institute for Algorithms and Scientific Computing
Corinna Kolarik, Fraunhofer Institute for Algorithms and Scientific Computing
Juliane Fluck, Fraunhofer Institute for Algorithms and Scientific Computing
Martin Hofmann-Apitius, Fraunhofer Society and University of Bonn
Christoph M. Friedrich, Fraunhofer Institute for Algorithms and Scientific Computing

Motivation: Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only S MILES and InChI names allow a direct structure search, but in biomedical texts trivial names and IUPAC like names are used more frequent. While trivial names can be found with a dictionary based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all possible IUPAC names. In this work we present a new machine learning approach based on Conditional Random Fields (CRF) to ?nd mentions of IUPAC and IUPAC like names in scienti?c text as well as its evaluation and the conversion rate with available name-to-structure tools. Results: We present an IUPAC name recognizer with an F1 measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F1 measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run. Availability: We plan to publish the corpora, annotation guideline as well as the Conditional Random Field model as a UIMA component.

Keywords: Chemical Entity Recognition Chemical Nomenclature Conditional Random Fields Information Retrieval Named Entity Recognition


Identifying Gene-Disease Associations using Centrality on a Literature Mined Gene Interaction Network
Arzucan Ozgur, University of Michigan
Thuy Vu, University of Michigan
Gunes Erkan, University of Michigan
Dragomir Radev, U. Michigan

Motivation: Understanding the role of genetics in diseases is one of the most important aims of the biological sciences. The completion of the Human Genome Project has led to a rapid increase in the number of publications in this area. However, the coverage of curated databases that provide information manually extracted from the literature is limited. Another challenge is that determining disease-related genes requires laborious experiments. Therefore, predicting good candidate genes before experimental analysis will save time and effort. We introduce an automatic approach based on text mining and network analysis to predict gene-disease associations. We collected an initial set of known disease-related genes and built an interaction network by automatic literature mining based on dependency parsing and support vector machines. Our hypothesis is that the central genes in this disease-specific network are likely to be related to the disease. We used the degree, eigenvector, betweenness, and closeness centrality metrics to rank the genes in the network. Results: The proposed approach can be used to extract known and to infer unknown gene-disease associations. We evaluated the approach for prostate cancer. Eigenvector and degree centrality achieved high accuracy. 95% of the top 20 genes ranked by these methods are confirmed to be related to prostate cancer. On the other hand, betweenness and closeness centrality predicted more genes whose relation to the disease is currently unknown and are candidates for experimental study. Availability: A web-based system for browsing the disease-specific gene interaction networks is available at: Contact:

Keywords: gene interaction network gene-disease associations graph centrality metrics text mining


Integrating High Dimensional Bi-directional Parsing Models for Gene Mention Tagging
Chun-Nan Hsu, Institute of Information Science, Academia Sinica, Taiwan
Yu-Ming Chang, IIS, Academia Sinica,Taiwan
Cheng-Ju Kuo, Institute of Information Science, Academia Sinica, Taiwan
Yu-Shi Lin, Institute of Information Science Academia Sinica, Taiwan
Han-Shen Huang, IIS, Academia Sinica,Taiwan
I-Fang Chung, Institute of Biomedical Informatics, National Yang-Ming University, Taiwan

Motivation: Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In this paper, we describe in details our gene mention tagger participated in BioCreative 2 challenge and analyze what contributes to its good performance. Our tagger is interesting because it is based on the conditional random fields model (CRF), the most prevailing method for the gene mention tagging task in BioCreative 2.Our tagger accomplished the highest F-scores among them and second over all. Moreover, we accomplished our results by mostly applying open source packages, making it easy to duplicate our results. Results: We first describe in details how we developed our CRF-based tagger. We designed a very high dimensional feature set that includes most of information that may be possibly relevant. We trained bi-directional CRF models with the same set of features, one applies forward parsing and the other backward, and integrated two models based on the likelihood scores and dictionary filtering. One of the most prominent factors that contribute to the good performance of our tagger is the integration of an additional backward parsing model. However, from the definition of CRF, it appears that a CRF model is symmetric and bi-directional parsing models will produce the same results. We show that due to different feature settings, a CRF model can be asymmetric and our feature setting for our tagger in BioCreative 2 not only produces different results but also give backward parsing models slight but constant advantage over forward parsing model for gene mention tagging. To fully explore the potential of integrating bi-directional parsing models, we applied different feature settings to generate many bi-directional parsing models and integrate them based on the likelihood scores. Experimental results show that this integrated model can achieve even higher F-score solely based on the training corpus for gene mention tagging.

Keywords: BioCreative 2 Conditional Random Fields Gene Mention Tagging Model Integration Text Mining