Motivation: Integration of heterogeneous data in life sciences is a growing and recognized challenge. The problem is not only to enable the study of such data within the context of a biological question, but also more fundamentally how to represent the available knowledge and make it accessible for mining.
Results: Our integration approach is based on the premise that relationships between biological entities can be represented as a complex network. The context dependency is achieved by a judicious use of distance measures on these networks. The biological entities and the distances between them are mapped for the purpose of visualization into the lower-dimensional space using the Sammon’s mapping. The system implementation is based on a multi-tier architecture using a native XML database and a software tool for querying and visualizing complex biological networks. The functionality of our system is demonstrated with two examples: (1) A multiple pathway retrieval, in which, given a pathway name, the system finds all the relationships related to the query by checking available metabolic pathway, transcriptional, signaling, protein-protein interaction, and ontology annotation resources and (2) A protein neighborhood search, in which given a protein name, the system finds all its connected entities within a specified depth. These two examples show that our system is able to conceptually traverse different databases to produce testable hypotheses and lead towards answers to complex biological questions.
2. A Motif-based Framework for Recognizing Sequence Families
Motivation: Many signals in biological sequences are based on the presence or absence of base signals and their spatial combinations. One of the best known examples in this regard is the signal identifying a core promoter---the site at which the basal transcription machinery starts the transcription of a gene. Our goal is a fully automatic pattern recognition system for a family of sequences that simultaneously discovers the base signals, their spatial relationships and a classifier based upon them.
Results: In this paper we present a general method for characterizing a set of Sequences by their recurrent motifs. Our approach relies on novel probabilistic models for DNA binding sites and modules of binding sites, on algorithms to learn them from data, and on a support vector machine that uses the learned models to classify a set of sequences. We demonstrate the applicability of our approach to diverse instances, ranging from families of promoter sequences to a data set of intronic sequences flanking alternatively spliced exons. On a core promoter data set our results are comparable to the state-of-the-art McPromoter. On a data set of alternatively spliced exons we outperform a previous approach. We also achieve high success rates in recognizing cell cycle regulated genes. These results demonstrate that a fully automatic pattern recognition algorithm can meet or exceed the performance of hand-crafted approaches.
Availability: The software and data sets are available from the authors upon request.
3. A Procedure for Assessing GO Annotation Consistency
Authors:Dolan, Ni, Camon, Blake
Motivation: The Gene Ontology [GO] is widely used to annotate molecular attributes of genes and gene products. Multiple groups undertaking functional annotations of genomes contribute their annotation sets to the GO database resource and these data are subsequently used in comparative functional analysis research. Although GO curators adhere to the same protocols and standards while assigning GO annotations, the specific procedure followed by each annotation group can vary. Since differences in application of annotation standards would dilute the effectiveness of comparative analysis, methods for assessing annotation consistency are essential. The development of methodologies that are broadly applicable for the assessment of GO annotation consistency is an important issue for the comparative genomics community.
Results: We have developed a methodology for assessing the consistency of GO annotations provided by different annotation groups. The method is completely general and can be applied to compare any two sets of GO annotations. This is the first attempt to assess cross-species GO annotation consistency. Our method compares annotation sets utilizing the hierarchical structure of the GO to compare GO annotations between orthologous gene pairs. The method produces a report on the annotation consistency and inconsistency for each orthologous pair. We present results obtained by comparing GO annotations for mouse and human gene sets.
4. An HMM Posterior Decoder for Sequence Feature Prediction that Includes Homology Information
Authors:Käll, Krogh, Sonnhammer
When predicting sequence features like transmembrane topology, signal peptides, coil-coil structures, protein secondary structure, or genes, extra support can be gained from homologs. We here present a general HMM decoding algorithm that combines probabilities for sequence features of homologs by considering the average of the posterior label probability of each position in a global sequence alignment. The algorithm is an extension of the previously described "optimal accuracy" decoder, allowing homology information to be used. It was benchmarked using an HMM for transmembrane topology and signal peptide prediction, Phobius. We found that performance was substantially increased when incorporating information from homologs.
5. YeastHub: A Semantic Web Use Case for Integrating Data in the Life Sciences Domain
Motivation: As the semantic web technology is maturing and the need for life sciences data integration over the web is growing, it is important to explore how data integration needs can be addressed by the semantic web. The main problem that we face in data integration is a lack of widely-accepted standards for expressing the syntax and semantics of the data. We address this problem by exploring the use of semantic web technologies — including Resource Description Framework (RDF), RDF Site Summary (RSS), relational-database-to-RDF mapping (D2RQ), and native RDF data repository — to represent, store, and query both metadata and data across life sciences datasets.
Results: As many biological datasets are presently available in tabular format, we introduce an RDF structure into which they can be converted. Also, we develop a prototype web-based application called YeastHub that demonstrates how a life sciences data warehouse can be built using a native RDF data store (Sesame). This data warehouse allows integration of different types of yeast genome data provided by different resources in different formats including the tabular and RDF formats. Once the data are loaded into the data warehouse, RDF-based queries can be formulated to retrieve and query the data in an integrated fashion.
6. Improved Detection of DNA Motifs Using a Self-organized Clustering of Familial Binding Profiles
Authors:Mahony, Golden, Smith, Benos
Motivation: One of the limiting factors in deciphering transcriptional regulatory networks is the effectiveness of motif-finding software. An emerging avenue for improving motif-finding accuracy aims to incorporate generalised binding constraints of related transcription factors (TFs), named familial binding profiles (FBPs), as priors in motif identification methods. A motif-finder can thus be “biased” towards finding motifs from a particular TF family. However, current motif-finders allow only a single FBP to be used as a prior in a given motif-finding run. In addition, current FBP construction methods are based on manual clustering of position specific scoring matrices (PSSMs) according to the known structural properties of the TF proteins. Manual clustering assumes that the binding preferences of structurally similar TFs will also be similar. This assumption is not true, at least not for some TF families. Automatic PSSM clustering methods are thus required for augmenting the usefulness of FBPs.
Results: A novel method is developed for automatic clustering of PSSM models. The resulting FBPs are incorporated into the SOMBRERO motif-finder, significantly improving its performance when finding motifs related to those that have been incorporated. SOMBRERO is thus the only existing de novo motif-finder that can incorporate knowledge of all known PSSMs in a given motif-finding run.
7. High-recall Protein Entity Recognition Using a Dictionary
Authors:Kou, Cohen, Murphy
Protein name extraction is an important step in mining biological literature. We describe two new methods for this task: semiCRFs and dictionary HMMs. SemiCRFs are a recently-proposed extension to conditional random fields that enables more effective use of dictionary information as features. Dictionary HMMs are a technique in which a dictionary is converted to a large HMM that recognizes phrases from the dictionary, as well as variations of these phrases. Standard training methods for HMMs can be used to learn which variants should be recognized. We compared the performance of our new approaches to that of Maximum Entropy (MaxEnt) and normal CRFs on three datasets, and improvement was obtained for all four methods over the best published results for two of the datasets. CRFs and semiCRFs achieved the highest overall performance according to the widely-used F-measure, while the dictionary HMMs performed the best at finding entities that actually appear in the dictionary—the measure of most interest in our intended application.
8. Statistics of Local Multiple Alignments
BLAST [Altschul et al., 1990] statistics have been shown to be extremely useful for searching for significant similarity hits, for amino acid and nucleotide sequences. While these statistics are well understood for pairwise comparisons, there has been little success developing statistical scores for multiple alignments. In particular, there is no score for multiple alignment that is well founded and treated as a standard. We extend the BLAST theory to multiple alignments. Following some simple assumptions, we present and justify a significance score for multiple segments of a local multiple alignment. We demonstrate its usefulness in distinguishing high and moderate quality multiple alignments from low quality ones, with supporting experiments on orthologous vertebrate promoter sequences.
9. Beyond The Clause: Extraction of Phosphorylation Information from Medline Abstracts
Authors:Narayanaswamy, Ravikumar, Vijay-Shanker
Phosphorylation is an important biochemical reaction that plays a critical Role in signal transduction pathways and cell-cycle processes. A text mining system to extract the phosphorylation relation from the literature is reported. The focus of this paper is on the new methods developed and implemented to connect and merge pieces of information about phosphorylation mentioned in different sentences in the text. The effectiveness and accuracy of the system as a whole as well as that of the methods for extraction beyond a clause/sentence is evaluated using an independently annotated dataset, the Phospho.ELM database. The new methods developed to merge pieces of information from different sentences are shown to effective in significantly raising the recall without much difference in precision.
10. Computing the P-value of the Information Content from an Alignment of Multiple Sequences
Authors:Nagarajan, Jones, Keich
Motivation: The efficient and accurate computation of p-values is an essential requirement for motif-finding and alignment tools. We show that the approximation algorithms used in two popular motif-finding programs, MEME and Consensus, can fail to accurately compute the p-value.
Results: We present two new algorithms: one for the evaluation of the p-values of a range of motif scores, and a faster one for the evaluation of the p-value of a single motif score. Both exhibit more reliability than existing algorithms, and the latter algorithm is comparable in speed to the fastest existing method.
11. Validation of Qualitative Models of Genetic Regulatory Networks by Model Checking: Analysis of the Nutritional Stress Response in Escherichia coli
Authors:Batt, Ropers, de Jong, Geiselmann, Mateescu, Page, Schneider
Motivation: The modeling and simulation of genetic regulatory networks has created the need for tools for model validation. The main challenges of model validation are the achievement of a match between the precision of model predictions and experimental data, as well as the efficient and reliable comparison of the predictions and observations.
Results: We present an approach towards the validation of models of genetic regulatory networks addressing the above challenges. It combines a method for qualitative modeling and simulation with techniques for model checking, and is supported by a new version of the computer tool GNA. The model-validation approach has been applied to the analysis of the network controlling the nutritional stress response in Escherichia coli. Availability: GNA and the model of the stress response network are available at http://www-helix.inrialpes.fr/gna
12. Identification of Nine Human-specific Frameshift Mutations by Comparative Analysis of the Human and the Chimpanzee Genome Sequences
Motivation: The recent release of the draft sequence of the chimpanzee genome is an invaluable resource for finding genome-wide genetic differences that might explain phenotypic differences between humans and chimpanzees.
Results: In this paper, we describe a simple procedure to identify potential human-specific frameshift mutations that occurred after the divergence of human and chimpanzee. The procedure involves collecting human coding exons bearing insertions or deletions compared to the chimpanzee genome and identification of homologs from other species supporting that the mutations are human-specific. Using this procedure, we identified nine genes, BASE, DNAJB3, FLJ33674, HEJ1, NTSR2, RPL13AP, SCGB1D4, WBSCR27, and ZCCHC13, that show human-specific alterations including truncations of the C-terminus. In some cases, the frameshift mutation results in gene inactivation or decay. In other cases, the altered protein seems to be functional. This study demonstrates that even the unfinished chimpanzee genome sequence can be useful in identifying modification of genes that are specific to the human lineage and therefore could potentially be relevant to the study of the acquisition of human-specific traits.
13. Kernel Methods for Predicting Protein-protein Interactions
Authors:Ben-Hur, Noble Paper
Despite advances in high throughput methods for discovering protein-protein interactions, the interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions.
We present a kernel method for predicting protein-protein interactions using a combination of data sources, including protein sequences, Gene Ontology annotations, local properties of the network, and homologous interactions in other species. Whereas protein kernels proposed in the literature provide a similarity between single proteins, prediction of interactions requires a kernel between pairs of proteins. We propose a pairwise kernel that converts a kernel between single proteins into a kernel between pairs of proteins, and we illustrate the kernel's effectiveness in conjunction with a support vector machine classifier. Furthermore, we obtain improved performance by combining several sequence-based kernels based on k-mer frequency, motif and domain content and by further augmenting the pairwise sequence kernel with features that are based on other sources of data.
We apply our method to predict physical interactions in yeast using data from the BIND database. At a false positive rate of 1% the classifier retrieves close to 80% of a set of trusted interactions. We thus demonstrate the ability of our method to make accurate predictions despite the sizeable fraction of false positives that are known to exist in interaction databases.
14. Maximum Likelihood of Evolutionary Trees: Hardness and Approximation
Motivation: Maximum likelihood (ML) is an increasingly popular optimality criterion for selecting evolutionary trees (Felsenstein, 1981). Yet the computational complexity of ML was open for over 20 years, and only recently resolved by the authors (2005) for the Jukes Cantor model of substitution and its generalizations. It was proved that reconstructing the ML tree is computationally intractable (NP hard). In this work we explore three directions, which extend that result.
We show that ML under the assumption of molecular clock is still computationally intractable (NP hard).
We show that not only is it computationally intractable to find the exact ML tree, even approximating the logarithm of the maximum likelihood for any multiplicative factor smaller than 1.00175 is computationally intractable.
We develop an algorithm for approximating log-likelihood under the condition that the input sequences are sparse. It employs any approximation algorithm for parsimony, and asymptotically achieves the same approximation ratio. We note that ML reconstruction for sparse inputs is still hard under this condition, and furthermore many real datasets satisfy it.
15. Automatic Detection of Subsystem/Pathway Variants in Genome Analysis
Authors:Ye, Osterman, Overbeek, Godzik
Motivation: Proteins work together in pathways and networks, collectively comprising the cellular machinery. A subsystem (a generalization of pathway concept) is a group of related functional roles (such as enzymes) jointly involved in a specific aspect of the cellular machinery. Subsystems provide a natural framework for comparative genome analysis and functional annotation. A subsystem may be implemented in a number of different functional variants in individual species. In order to reliably project functional assignments across multiple genomes, we have to be able to identify the variants implemented in each genome. The analysis of such variants across diverse species is an interesting problem by itself and may provide new evolutionary insights. However, no computational techniques are presently available for an automated detection and analysis of subsystem variants.
Results: Here we formulate the subsystem variant detection problem as finding the minimum number of subgraphs of a subsystem, which is represented as a graph, and solve the optimization problem by integer programming approach. The performance of our method was tested on subsystems encoded in the SEED, a genomic integration platform developed by the Fellowship for Interpretation of Genomes (FIG) as a component of a large-scale effort on comparative analysis and annotation of multiple diverse genomes. Here we illustrate the results obtained for two expert-encoded subsystems of the biosynthesis of Coenzyme A and FMN/FAD cofactors. Applications of variant detection, to support genomic annotations and to assess divergence of species, are briefly discussed in the context of these universally conserved and essential metabolic subsystems.
16. Detecting Coevolving Amino Acid sites using Bayesian Mutational Mapping
Authors:Dimmic, Hubisz, Bustamante, Nielsen
Motivation: The evolution of protein sequences is constrained by complex interactions between amino acid residues. Because harmful substitutions may be compensated by other substitutions at neighboring sites, residues can coevolve. We describe a Bayesian phylogenetic approach to the detection of coevolving residues in protein families. This method, Bayesian mutational mapping (BMM), assigns mutations to the branches of the evolutionary tree stochastically, and then test statistics are calculated to determine whether a coevolutionary signal exists in the mapping. Posterior predictive P-values provide an estimate of significance, and specificity is maintained by integrating over uncertainty in the estimation of the tree topology, branch lengths, and substitution rates. A coevolutionary Markov model for codon substitution is also described, and this model is used as the basis of several test statistics.
Results: Results on simulated coevolutionary data indicate that the BMM method can successfully detect nearly all coevolving sites when the model has been correctly specified, and that nonparametric statistics such as mutual information are generally less powerful than parametric statistics. On a dataset of eukaryotic proteins from the phosphoglycerate kinase (PGK) family, interdomain site contacts yield a significantly greater coevolutionary signal than interdomain non-contacts, an indication that the method provides information about interacting sites. Failure to account for the heterogeneity in rates across sites in PGK resulted in a less discriminating test, yielding a marked increase in the number of reported positives at both contact and non-contact sites.
17. Modeling the Organization of the WUSCHEL Expression Domain in the Shoot Apical Meristem
Motivation: The above ground tissues of higher plants are generated from a small region of cells situated at the plant apex called the shoot apical meristem. An important genetic control circuit modulating the size of the Arabidopsis thaliana meristem is a feed-back network between the CLAVATA3 and WUSCHEL genes. Although the expression patterns for these genes do not overlap, WUSCHEL activity is both necessary and sufficient (when expressed ectopically) for the induction of CLAVATA3 expression. However, upregulation of CLAVATA3 in conjunction with the receptor kinase CLAVATA1 results in the down regulation of WUSCHEL. Despite much work, experimental data for this network is incomplete and additional hypotheses are needed to explain the spatial locations and dynamics of these expression domains. Predictive mathematical models describing the system should provide a useful tool for investigating and discriminating among possible hypotheses, by determining which hypotheses best explain observed gene expression dynamics.
Results: We are developing a method using in vivo live confocal microscopy to capture quantitative gene expression data and create templates for computational models. We present two models accounting for the organization of the WUSCHEL expression domain. Our preferred model uses a reaction-diffusion mechanism, in which an activator induces WUSCHEL expression. This model is able to organize the WUSCHEL expression domain. In addition, the model predicts the dynamical reorganization seen in experiments where cells, including the WUSCHEL domain, are ablated, and also predicts the spatial expansion of the WUSCHEL domain resulting from removal of the CLAVATA3 signal.
18. Efficient Computation of Close Lower and Upper Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution
Authors:Song, Wu, Gusfield
Results: In this paper, we present efficient, practical methods for computing both upper and lower bounds on the minimum number of needed recombinations, and for constructing evolutionary histories that explain the input sequences. We extensively study the efficiency and accuracy of these algorithms on both simulated and real data sets. The algorithms produce very close upper and lower bounds, which match exactly in a surprisingly wide range of data. Thus, with the use of new, very effective lower bounding methods and an efficient algorithm for computing upper bounds, this approach allows the efficient exact computation of the minimum number of needed recombinations, with high frequency in a large range of data. When upper and lower bounds match, evolutionary histories found by our algorithm correspond to most parsimonious histories.
The classification of protein sequences obtained from patients with various immunoglobulin-related conformational diseases may provide insight into structural correlates of pathogenicity. However, clinical data are very sparse and, in the case of antibody-related proteins, the collected sequences have large variability with only a small subset of variations relevant to the protein pathogenicity (function). On this basis, these sequences represent a model system for development of strategies to recognize the small subset of function-determining variations among the much larger number of primary structure diversifications introduced during evolution. Under such conditions, most protein classification algorithms have limited accuracy. To address this problem, we propose a Support Vector Machine (SVM)-based classifier that combines sequence and 3D structural averaging information. Each amino acid in the sequence is represented by a set of six physicochemical properties: hydrophobicity, hydrophilicity, volume, surface area, bulkiness, and refractivity. Each position in the sequence is described by the properties of the amino acid at that position and the properties of its neighbors in 3D space or its neighbors in the sequence. A structure template is selected to determine neighbors in 3D space and a window size is used to determine the neighbors in the sequence. The test data consists of 209 proteins of human antibody immunoglobulin light chains, each represented by aligned sequences of 120 amino acids. The methodology is applied to the classification of protein sequences collected from patients with and without amyloidosis, and indicates that the proposed modified classifiers are more robust to sequence variability than standard SVM classifiers, improving classification error between 5-25% and sensitivity between 9-17%. The classification results might also suggest possible mechanisms for the propensity of immunoglobulin light chains to amyloid formation.
20. A Statistical Method for Detecting Splice Variation from Expression Data
Authors:Cline, Blume, Cawley, Clark, Hu, Lu, Salomonis, Wang, Williams
Motivation: Many or most mammalian genes undergo alternative splicing, generating a variety of transcripts from a single gene. New information on splice variation is becoming available through technology for measuring expression levels of several exons or splice junctions per gene. We have developed a statistical method, ANOSVA, to detect alternative splicing from expression data. ANOSVA requires no transcript information, so can be applied when the level of annotation data is poor. When validated against spiked clone data, it generated no false positives and few false negatives. We demonstrated ANOSVA with data from a prototype mouse alternative splicing array, yielding a set of genes with evidence of tissue-specific splice variation. These results are available at http://bioinfo.affymetrix.com/Papers/ANOSVA.
Motivation: Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and Support Vector Machine classification on these protein graphs.
Results: Our graph model, derivable from protein sequence and structure only, is competitive with vector models that require additional protein information such as the size of surface pockets. If we include this extra information into our graph model, our classifier yields significantly higher accuracy levels than the vector models. Hyperkernels allow us to select and to optimally combine the most relevant node attributes in our protein graphs. We have laid the foundation for a protein function prediction system that integrates protein information from various sources efficiently and effectively.
Availability: More information available via www.dbs.ifi.lmu.de/Mitarbeiter/borgwardt.html.
22. Robust Classification Modeling on Microarray Data Using Misclassification Penalized Posterior
Authors:Soukup, Lee, Cho
Genome-wide microarray data are often used in challenging classification problems of subtypes of human diseases. However, the identification of a parsimonious robust prediction model that performs consistently well on future independent data has not been successful due to the biased model selection from an extremely large number of candidate models during the classification model search and construction. Furthermore, common criteria of prediction model performance such as classification error rates do not provide a sensitive measure for valuating performance of such astronomic competing models. Also, even though several different classification approaches have been utilized to tackle such classification problems, no direct comparison on these methods have been made. We introduce a novel measure for assessing the performance of a prediction model, the misclassification-penalized posterior (MiPP), the sum of the posterior classification probabilities penalized by the number of incorrectly classified samples. Using MiPP, we implement a forward step-wise cross-validated procedure to find our optimal prediction models with different numbers of features on a training set. Our final robust classification model and its dimension are determined based on a completely independent test data set. This MiPP-based classification modeling approach enables us to identify the most parsimonious robust prediction models only with two or three features on well-known microarray data sets. These models show superior performance to other models in the literature that often have more than 40--100 features in their model construction. Our classification algorithm is available at the Bioconductor web site as the MiPP package.
23. CaSPredictor: A new Computer-based Tool for Caspase Substrate Prediction
Motivation: The in vitro studies have shown that the most remarkable catalytic features of caspases, a family of cys-teineproteases, are their stringent specificity to Asp (D) in the S1 subsite and at least four amino acid to the left of scis-sile bound. However, there is little information about the substraterecognition patterns in vivo. The prediction and-characterization of proteolytic cleavage sites in natural sub-strates could be useful for uncovering these structural rela-tionships.
Results: PEST-like sequences rich in the amino acids Ser (S), Thr (T), Pro (P), Glu or Asp (E/D), including Asn (N) and Gln (Q) are adjacent structural/sequential elements in the majority of cleavage site regions of the natural caspase sub-strates described in the literature, supporting its possible implication in the substrate selection by caspases. We de-veloped CaSPredictor, a software which incorporated a PEST-like index and the position-dependent amino acid ma-trices for prediction of caspase cleavage sites in individual proteins and protein datasets. The program predicted suc-cessfully 81% (111/137) of the cleavage sites in experimen-tally verified caspase substrates not annotated in its internal data file. Its accuracy and confidence was estimated as 80% using ROC methodology. The program was much more effi-cient to predict caspase substrates when compared to Pep-tideCutter and PEPS softwares. Finally, the program de-tected potential cleavage sites in the primary sequences of 1644 proteins in a dataset containing 9986 protein entries.
24. Experimental Design for Three-Color and Four-Color Gene Expression Microarrays
Authors:Woo, Krueger, Kaur, Churchill
Motivation: Compared to two-color microarrays, three-color microarrays can increase design efficiency and power to detect differential expression without additional samples and arrays. Furthermore, three-color microarray technology is currently available at a reasonable cost. Despite the potential advantages, clear guidelines for designing and analyzing three-color experiments do not exist.
Results: We propose a three- and four-color cyclic design (loop) and a complementary graphical representation to help design experiments that are balanced, efficient, and robust to hybridization failures. In theory, three-color loop designs are more efficient than two-color loop designs. Experiments using both two- and three-color platforms were performed in parallel and its outputwere analyzed using linearmixedmodel analysis in R/MAANOVA. These results demonstrate that three-color experiments using the same number of samples (and fewer arrays) will perform no less efficiently than two-color experiments. The improved efficiency of the design is somewhat offset by a reduced dynamic range and increased variability in the three-color experimental system. This result suggests that, with minor technological improvements, three-color microarrays using loop designs could detect differential expression more efficiently than two-color loop designs.
25. Families of Membranous Proteins can be Characterized by the Amino Acid Composition of their Transmembrane Domains
In eukaryotes, membranous proteins account for 20-30% of the proteome. Most of these proteins contain one or more transmembrane (TM) domains. These are short segments that transverse the bilayer lipid membrane. Various properties of the TM domains, such as their number, their topology and their arrangement within the membrane are closely related to the protein’s cellular functions. Properties of the TM domains also determine the cellular targeting and localization of these proteins. It is not known, however, whether the information encoded by TM domains suffices for the purpose of classifying proteins into their functional families. This is the question we address here. We introduce an algorithm that creates a profile of each functional family of membranous proteins based only on the amino-acid composition of their TM domains. This is complemented by a classifier program for each such family (to determine whether a given protein belongs to it or not). We find that in most instances TM domains contain enough information to allow an accurate discrimination of ~80% sensitivity and ~90% specificity among unrelated polytopic functional families with the same number of TM domains.
26. A Hidden Markov Model for Analyzing ChIP-chip Experiments on Genome Tiling Arrays and its Application to p53 Binding Sequences
Authors:Li, Meyer, Liu
Motivation: Transcription factors (TFs) regulate gene ex-pression by recognizing and binding to specific regulatory regions on the genome, which in higher eukaryotes can oc-cur far away from the regulated genes. Recently Affymetrix developed the high-density oligonucleotide arrays that tile all the non-repetitive sequences of the human genome at 35-bp resolution. This new array platform allows for the unbiased mapping of in vivo TF binding sequences (TFBSs) using Chromatin ImmunoPrecipitation followed by microarray ex-periments (ChIP-chip). The massive data set generated from these experiments pose great challenges for data analysis.
Results: We developed a fast, scalable and sensitive method to extract TFBSs from ChIP-chip experiments on genome tiling arrays. Our method takes advantage of tiling array data from many experiments to normalize and model the behavior of each individual probe, and identifies TFBSs using a Hidden Markov Model (HMM). When applied to the data of p53 ChIP-chip experiments (Cawley et al., 2004), our method discovered many new high confidence p53 targets including all the regions verified by quantitative PCR . Using a de novo motif finding algorithm MDscan (Liu et al., 2002), we also recovered the p53 motif from our HMM identified p53 target regions. Furthermore, we found substantial p53 motif enrichment in these regions comparing with both ge-nomic background and the TFBSs identified by Cawley et al (2004). Several of the newly identified p53 TFBSs are in known genes’ promoter regions or associated with previ-ously characterized p53-responsive genes.
27. High-Throughput Inference of Protein-Protein Interfaces from Unassigned NMR Data
Authors:Mettu, Lilien, Donald
We cast the problem of identifying protein-protein interfaces, using only unassigned NMR spectra, into a geometric clustering problem. Identifying protein-protein interfaces is critical to understanding inter- and intra-cellular communication, and NMR allows the study of protein interaction in solution. However it is often the case that NMR studies of a protein complex are very time-consuming, mainly due to the bottleneck in assigning the chemical shifts, even if the apo structures of the constituent proteins are known. We study whether it is possible, in a high-throughput manner, to identify the interface region of a protein complex using only unassigned chemical shift and residual dipolar coupling (RDC) data.
We introduce a geometric optimization problem where we must cluster the cells in an arrangement on the boundary of a 3-manifold, where the arrangement is induced by a spherical quadratic form. We show that this formalism derives directly from the physics of RDCs. We present an optimal algorithm for this problem that runs in O(n^3 log n) time for an n-residue protein. We then use this clustering algorithm as a subroutine in a practical algorithm for identifying the interface region of a protein complex from unassigned NMR data. We present the results of our algorithm on NMR data for 7 proteins from 5 protein complexes and show that our approach is useful for high-throughput applications in which we seek to rapidly identify the interface region of a protein complex.
28. Multi-way Clustering of Microarray Data using Probabilistic Sparse Matrix Factorization
Authors:Dueck, Morris, Frey
We address the problem of multi-way clustering of microarray data using a generative model. Our algorithm, probabilistic sparse matrix factorization (PSMF), is a probabilistic extension of a previous hard-decision algorithm for this problem. PSMF allows for varying levels of sensor noise in the data, uncertainty in the hidden prototypes used to explain the data, and uncertainty as to which prototypes are selected to explain each data vector. We present experimental results demonstrating that our method can better recover functionally-relevant clusterings in mRNA expression data than standard clustering techniques, including hierarchical agglomerative clustering, and we show that by computing probabilities instead of point estimates, our method avoids converging to poor solutions.
29. Tag SNP Selection in Genotype Data for Maximizing SNP Prediction Accuracy
Authors:Halperin, Kimmel, Shamir
Motivation: The search for genetic regions associated with complex disease, such as cancer or Alzheimer's disease, is an important challenge that may lead to better diagnosis and treatment. The existence of millions of DNA variations, primarily single nucleotide polymorphisms (SNPs), may allow the fine dissection of such associations. However, studies seeking disease association are limited by the cost of genotyping SNPs. Therefore, it is essential to find a small subset of informative SNPs (tag SNPs) that may be used as good representatives of the rest of the SNPs.
Results: We define a new natural measure for evaluating the prediction accuracy of a set of tag SNPs, and use it to develop a new method for tag SNPs selection. Our method is based on a novel algorithm that predicts the values of the rest of the SNPs given the tag SNPs. In contrast to most previous methods, our prediction algorithm uses the genotype information and not the haplotype information of the tag SNPs. Our method is very efficient, and it does not rely on having a block partition of the genomic region.
We compared our method to two state of the art tag SNP selection algorithms on 58 different genotype data sets from four different sources. Our method consistently found tag SNPs with considerably better prediction ability than the other methods.
30. Three-Stage Prediction of Protein Beta-Sheets by Neural Networks, Alignments, and Graph Algorithms
Motivation: Protein beta-sheets play a fundamental role in protein structure, function, evolution, and bio-engineering. Accurate prediction and assembly of protein beta-sheets, however, remains challenging because protein beta-sheets require formation of hydrogen bonds between linearly distant residues. Previous approaches for predicting beta-sheet topological features, such as beta-strand alignments, in general have not exploited the global covariation and constraints characteristic of beta-sheet architectures.
Results: We propose a modular approach to the problem of predicting/assembling protein beta-sheets in a chain by integrating both local and global constraints in three steps. The first step uses recursive neural networks to predict pairing probabilities for all pairs of inter-strand beta-residues from profile, secondary structure, and solvent accessibility information. The second step applies dynamic programming techniques to these probabilities to derive binding pseudo-energies and optimal alignments between all pairs of beta-strands. Finally, the third step, uses graph matching algorithms to predict the beta-sheet architecture of the protein by optimizing the global pseudo-energy while enforcing strong global beta-strand pairing constraints. The approach is evaluated using cross-validation methods on a large non-homologous dataset and yields significant improvements over previous methods.
31. Mining ChIP-chip Data for Transcription Factor and Cofactor Binding Sites
Authors:Smith, Sumazin, Das, Zhang
We describe methodology to identify de novo individual and interacting pairs of binding site motifs from ChIP-chip data, using an algorithm that integrates localization data directly into the motif discovery process. We combine matrix-enumeration-based motif discovery with multi-variate regression to evaluate candidate motifs and identify motif interactions. When applied to the HNF localization data of Odom et al. (2004) in liver and pancreatic islets, our methods produce motifs that are either novel or improved known motifs. All motif pairs identified to predict localization are further evaluated according to how well they predict expression in liver and islets, and according to how conserved are the relative positions of their occurrences. We find that interaction models of HNF1 and CDP motifs provide excellent prediction of both HNF1 localization and gene expression in liver. Our results demonstrate that ChIP-chip data can be used to identify interacting binding site motifs.
32. Improving Protein Structure Prediction With Model-Based Search
De novo protein structure prediction can be formulated as search in a high-dimensional space. One of the most frequently used computational tools to solve such search problems is the Monte Carlo method. We present a novel search technique, called model-based search. This method samples the high-dimensional search space to build an approximate model of the underlying function. This model is incrementally refined in areas of interest, while areas that are not of interest are excluded from further exploration. Model-based search derives its efficiency from the fact that the information obtained during the exploration of the search space is used to guide further exploration. In contrast, Monte Carlo-based techniques are memory-less and exploration is performed based on random walks, ignoring the information obtained in previous steps. Model-based search is applied to protein structure prediction, where search is employed to find the global minimum of the protein's energy landscape. We show that model-based search uses computational resources more efficiently to find lower-energy conformations of proteins when compared to one of the leading protein structure prediction methods, which relies on a tailored Monte Carlo method to perform search. The performance improvements become more pronounced as the dimensionality of the search space increases. We show that model-based search enables more accurate protein structure prediction than previously possible. Furthermore, we believe that similar performance improvements can be expected in other problems that are currently solved using Monte Carlo-based search methods.
33. GenXHC: A Probabilistic Generative Model for Cross-hybridization Compensation in High-density Genome-wide Microarray Data
Authors:Huang, Morris, Frey Paper
Microarray designs containing millions to hundreds of millions of probes that tile entire genomes are currently being released. Within the next 2 months, our group will release a microarray data set containing over 12,000,000 microarray measurements taken from 37 mouse tissues. A problem that will become increasingly significant in the upcoming era of genome-wide exon-tiling microarray experiments is the removal of cross-hybridization noise. We present a probabilistic generative model for cross-hybridization in microarray data and a corresponding variational learning method for cross-hybridization compensation, GenXHC, that reduces cross-hybridization noise by taking into account multiple sources for each mRNA expression level measurement and prior knowledge of hybridization similarities between the nucleotide sequences of microarray probes and their target cDNAs. The algorithm is applied to a subset of an exon-resolution genome-wide Agilent microarray data set for chromosome 16 of Mus musculus and is found to produce statistically significant reductions in cross-hybridization noise. The denoised data is found to produce enrichment in multiple Gene Ontology-Biological Process (GO-BP) functional groups. The algorithm is found to outperform Robust Multi-array Analysis, another method for cross-hybridization compensation.
34. Prediction of Active Sites for Protein Structures from Computed Chemical Properties
Authors:Ko, Murga, Ondrechen
Identification of functional information for a protein from its three-dimensional structure is a major challenge in genomics. The power of theoretical microscopic titration curves (THEMATICS), when coupled with a statistical analysis, provides a method for high-throughput screening for identification of catalytic sites and binding sites with high accuracy and precision. The method requires only the 3D structure of the query protein as input, but performs as well as other methods that depend on sequence alignments and structural similarities.
35. RASE: Recognition of Alternatively Spliced Exons in C. elegans
Authors:Raetsch, Sonnenburg, Schoelkopf
Pre-mRNA alternative splicing greatly increases the complexity of gene expression. Estimates show that more than half of the human genes and at least a third of the genes of less complex organisms are alternatively spliced. In this work we consider one major form of alternative splicing: the exclusion of exons from the transcript. It was shown that alternatively spliced exons have properties that distinguish them from other exons. While most computational studies on alternative splicing only apply to exons which are conserved among species, our method only uses information available to the splicing machinery. We employ advanced machine learning techniques in order to answer two questions: a) Is a certain exon alternatively spliced? b) How to identify yet unidentified exons within verified introns?
We designed a SVM kernel well suited for the task of classifying sequences with motifs having positional preferences. In order to solve the task a), we combine the kernel with additional local sequence information. The resulting SVM based classifier achieves a true positive rate of 48,5% at a false positive rate of 1%. By scanning over EST-confirmed exons we identified 215 potentially alternatively spliced exons. For 10 randomly selected exons we successfully performed biological verification experiments and confirmed 3 novel alternatively spliced exons. To answer question b), we used SVM based predictions to recognize acceptor and donor splice sites. Combined with the abovementioned features we were able to identify 85,2% of skipped exons within verified introns at a false positive rate of 1%.
36. In silico Identification of Functional Regions in Proteins
Authors:Nimrod, Glaser, Steinberg, Ben-Tal, Pupko
Motivation: In silico prediction of functional regions on protein surfaces, i.e., sites of interaction with DNA, ligands, substrates and other proteins, is of utmost importance in various applications in the emerging fields of proteomics and structural genomics. When a sufficient number of homologs is found, powerful prediction schemes can be based on the observation that evolutionarily conserved regions are often functionally important. However, it is challenging to unambiguously identify the boundaries of the functional regions; typically, only the principal functionally-important region of the protein is detected, while secondary functional regions with weaker conservation signals are overlooked.
Methods: We present a new methodology, called PatchFinder, that automatically identifies patches of conserved residues that are located in close proximity to each other on the protein surface. PatchFinder is based on the following steps: (1) Assignment of conservation scores to each amino acid position on the protein surface. (2) Assignment of a score to each putative patch, based on its likelihood to be functionally important. The patch of maximum likelihood is considered to be the main functionally-important region, and the search is continued for non-overlapping patches of secondary importance. Results: We examined the accuracy of the method using the IGPS enzyme, the SH2 domain and a benchmark set of 112 proteins. These examples demonstrated that PatchFinder is capable of identifying both the main and secondary functional patches.
37. Enhanced Position Weight Matrices using Mixture Models
Motivation: Positional weight matrix (PWM) is derived from a set of experimentally determined binding sites. Here we explore whether there exist subclasses of binding sites, and if the mixture of these subclass-PWMs can improve the binding site prediction. Intuitively, the subclasses correspond to either distinct binding preference of the same transcription factor in different contexts or distinct subtypes of the transcription factor.
Result: We report an Expectation Maximization algorithm adapting the mixture model of Baily and Elkan. We assessed the relative merit of using two subclass-PWMs. The resulting PWMs were evaluated with respect to preferred conservation (relative to mouse) of potential sites in human promoters and expression coherence of the potential target genes. Based on 64 JASPAR vertebrate PWMs, 61%-81% of the cases resulted in a higher conservation using the mixture model. Also in 98% of the cases the expression coherence was higher for the target genes of one of the subclass-PWMs. Our analysis of REB1 sites is consistent with previously discovered subtypes using independent methods. Additionally application of our method to mutated sites for transcription factor LEU3 reveals subclasses that segregate into strongly binding and weakly binding sites with p-value of 0.008. This is the first study which attempts to quantify the subtly different binding specificities of a transcription factor at a large scale and suggests the use of a mixture of PWMs instead of the current practice of using a single PWM for a transcription factor.
38. How Old is Your Fold?
Authors:Winstanley, Abeln, Deane
Motivation: At present there exists no age estimate for the different protein structures found in nature. It has become clear from occurrence studies that different folds arose at different points in evolutionary time. An estimation of the age of different folds would be a starting point for many investigations into protein structure evolution: how we arrived at the set of folds we see today. It would also be a powerful tool in protein structure classification allowing us to reassess the available hierarchical methods and perhaps suggest improvements.
Results: We have created the first relative age estimation technique for protein folds. Our method is based on constructing parsimonious scenarios which can describe occurrence patterns in a phylogeny of species. The ages presented are shown to be robust to the different trees or data types used for their generation. They show correlations with other previously used protein age estimators, but appear to be far more discriminating than any previously suggested technique. The age estimates given are not absolutes but they already offer intriguing insights, like the very different age patterns of alpha/beta folds compared to small folds. The alpha/beta folds appear on average to be far older than their small fold counterparts.
39. De Novo Identification of Repeat Families in Large Genomes
Authors:Price, Jones, Pevzner
De novo repeat family identification is a challenging algorithmic problem of great practical importance. As the number of genome sequencing projects increases, there is a pressing need to identify the repeat families present in large, newly sequenced genomes. We develop a new method for de novo identification of repeat families via extension of consensus seeds; our method enables a rigorous definition of repeat boundaries, a key issue in repeat analysis.
Our RepeatScout algorithm is more sensitive and orders of magnitude faster than RECON, the dominant tool for de novo repeat family identification in newly sequenced genomes. Using RepeatScout, we estimate that roughly 2% of the human genome and 4% of mouse and rat genomes consist of previously unannotated repetitive sequence.
40. Search for Folding Nuclei in Native Protein Structures
The problem of finding folding nuclei (a set of native contacts that play an important role in folding) along with identifying folding pathways (a time ordered sequence of folding events) of proteins is one of the most important problems in protein chemistry. Here we propose a novel and simple approach to address this problem as follows: given the topology of the native state, identify native contacts that form folding nuclei based on the graph theoretical approach that considers effective contact order (effective loop closure) as its objective function.
Motivation: A number of computational methods for the prediction of folding nuclei already exists in the literature, but most of them rely on restrictive assumptions about the nature of nuclei or the process of folding. Our motivation is to develop a simple, efficient and robust algorithm to find an ensemble of pathways with the lowest effective contact order and to identify contacts that are crucial for folding.
Results: Our approach is different from the previously used methods in that it uses efficient graph algorithms, and does not formulate restrictive assumptions about folding nuclei. Our predictions provide more details concerning the protein folding pathway than most other methods in literature. We demonstrate the success of our approach by predicting folding nuclei for a data set of proteins for which experimental kinetic data is available. We show that our method compares favourably with other methods in literature and its results agree with experimental results.
41. Automatic Interpretation of MS/MS Spectra of Oligosaccharides
Authors:Tang, Mechref, Novotny
The emerging glycomics and glycoproteomics projects aim to characterize all forms of glycoproteins in different tissues and organisms. Tandem mass spectrometry is the key experimental methodology for high throughput glycan identification and characterization. Fragmentation of glycans from high energy collision-induced dissociation (CID) generates ions from glycosidic as well as internal cleavages. The cross-ring ions resulting from internal cleavages provide additional information that is important to reveal the type of linkage between monosaccharides. This information, however, is not incorporated by the current programs for analyzing glycan mass spectra. As a result, they can rarely distinguish isomeric oligosaccharides from the mass spectra, which have the same saccharide composition but different types of sequences, branches or linkages.
In this paper, we describe a novel algorithm for glycan characterization using tandem mass spectrometry (MS/MS). This algorithm consists of three steps. First we develop a scoring scheme to identify potential bond linkages between monosaccharides, based on the appearance pattern of cross-ring ions. Next, we use a dynamic programming algorithm to determine the most probable oligosaccharide structures from the mass spectrum. Finally, we re-evaluate these oligosaccharide structures, taking into account the double fragmentation ions. We also show the preliminary results of testing our algorithm on several MS/MS spectra of oligosaccharides.
42. A Path Planning Approach for Computing Large-Amplitude Motions of Flexible Molecules
Authors:Cortes, Simeon, Ruiz de Angulo, Guieysse, Remaud-Simeon, Tran
Motivation: Motion is inherent in molecular interactions. Molecular flexibility must be taken into account in order to develop accurate computational techniques for predicting interactions. Energy-based methods currently used in molecular modeling (i.e. molecular dynamics, Monte Carlo algorithms) are, in practice, only able to compute local motions while accounting for molecular flexibility. However, large-amplitude motions often occur in biological processes. We investigate the application of geometric path planning algorithms to compute such large motions in flexible molecular models. Our purpose is to exploit the efficacy of a geometric conformational search as a filtering stage before subsequent energy refinements.
Results: Two kinds of large-amplitude motion are treated in this paper: protein loop conformational changes (involving protein backbone flexibility) and ligand trajectories to deep active sites in proteins (involving ligand and protein side-chain flexibility). First studies performed using our two-stage approach (geometric search followed by energy refinements) show that, compared to classical molecular modeling methods, quite similar results can be obtained with a performance gain of several orders of magnitude. Furthermore, our results also indicate that the geometric stage can provide by itself highly valuable information to biologists.
Availability: The algorithms have been implemented in the general-purpose motion planning software Move3D (Simeon et al., 2001). We are currently working on an optimized stand-alone library that will be available to the scientific community.
43. Clustering Short Time Series Gene Expression Data
Authors:Ernst, Nau, Bar-Joseph
Motivation: Time series expression experiments are used to study a wide range of biological systems. More than 80% of all time series expression datasets are short (8 time points or fewer). These datasets present unique challenges. Due to the large number of genes profiled (often tens of thousands) and the small number of time points many patterns are expected to arise at random. Most clustering algorithms are unable to distinguish between real and random patterns.
Results: We present an algorithm specifically designed for clustering short time series expression data. Our algorithm works by assigning genes to a pre-defined set of model profiles that capture the potential distinct patterns that can be expected from the experiment. We discuss how to obtain such a set of profiles and how to determine the significance of each of these profiles. Significant profiles are retained for further analysis and can be combined to form clusters. We tested our method on both simulated and real biological data. Using immune response data we show that our algorithm can correctly detect the temporal profile of relevant functional categories. Using GO analysis we show that our algorithm outperforms both general clustering algorithms and algorithms designed specifically for clustering time series gene expression data.
44. PILER: Identification and Classification of Genomic Repeats
Repeated elements such as satellites and transposons are ubiquitous in eukaryotic genomes. De novo computational identification and classification of such elements is a challenging problem, so repeat annotation of sequenced genomes has historically largely relied on sequence similarity to hand-curated libraries of known repeat families. We present a new approach to de novo repeat annotation that exploits characteristic patterns of local alignments induced by certain classes of repeats. We describe PILER, a package of efficient search algorithms for identifying such patterns. Novel repeats found using PILER are reported for H. sapiens, A. thalania and D. melanogaster. The software is freely available at http://www.drive5.com/piler.
45. Predicting the in vivo Signature of Human Gene Regulatory Sequences
In the living cell nucleus, genomic DNA is packaged into chromatin. DNA sequences that regulate transcription and other chromosomal processes are associated with local disruptions, or "openings," in chromatin structure caused by the cooperative action of regulatory proteins. Such perturbations are extremely specific for em cis-regulatory elements, and occur over short stretches of DNA (typically approximately 250 bp). They can be detected experimentally as DNaseI hypersensitive sites (HSs) in vivo, though the process is extremely laborious and costly. The ability to discriminate DNaseI HSs computationally would have a major impact on the annotation and utilization of the human genome.
We found that a supervised pattern recognition algorithm, trained using a set of 280 DNaseI HS and 737 non-HS control sequences from erythroid cells, was capable of de novo prediction of HSs across the human genome with surprisingly high accuracy determined by prospective in vivo validation. Systematic application of this computational approach will greatly facilitate discovery and analysis of functional non-coding elements in the human and other complex
46. The Precise Positioning of Genomic Landmarks Can Aid in the Identification of Regulatory Elements
Motivation: The transcription start site (TSS) has been located for an increasing number of genes across several organisms. Statistical tests have shown that some cis-acting regulatory elements have positional preferences with respect to the TSS, but few strategies have emerged for locating elements by their positional preferences. This paper elaborates such a strategy. First, we align promoter regions without gaps, anchoring the alignment on each promoter’s TSS. Second, we apply a novel word-specific mask. Third, we apply a clustering test related to gapless BLAST statistics. The test examines whether any specific word is placed unusually consistently with respect to the TSS. Finally, our program A-GLAM, an extension of the GLAM program, uses significant word positions as new “anchors” to realign the sequences. A Gibbs sampling algorithm then locates putative cis-acting regulatory elements. Usually, Gibbs sampling requires a preliminary masking step, to avoid convergence onto a dominant but uninteresting signal from a DNA repeat. The positional anchors focus A-GLAM on the motif of interest, however, so masking DNA repeats during Gibbs sampling becomes unnecessary.
Results: In a set of human DNA sequences with experimentally characterized TSSs, the placement of 801 octonucleotide words was unusually consistent (multiple-test corrected p < 0.05). Alignments anchored on these words sometimes located statistically significant motifs inaccessible to GLAM or AlignACE.
47. Predicting Protein-Protein Interaction by Searching Evolutionary Tree Automorphism Space
Authors:Jothi, Kann, Przytycka
Motivation: Uncovering protein-protein interaction network is a fundamental step in the quest for understanding the molecular machinery of a cell. This motivates the search for efficient computational methods for predicting such interactions. Among the available predictors are those that are based on the co-evolution hypothesis "evolutionary trees of protein families (that are known to interact) are expected to have similar topologies." Many of these methods are limited by the fact that they can only handle a small number of protein sequences. Also, details on evolutionary tree topology are missing as they use similarity matrices in lieu of the trees.
Results: We introduce MORPH, a new algorithm for predicting protein interaction partners between members of two protein families that are known to interact. Our approach can also be seen as a new method for searching the best superposition of the corresponding evolutionary trees based on tree automorphism group. We discuss relevant facts related to predictability of protein-protein interaction based on their co-evolution. When compared to related computational approaches, our method reduces the search space by about $3\times 10^5$-fold, and at the same time increases the prediction accuracy of correct binding partners.
48. Reversal Distance for Partially Ordered Genomes
Authors:Zheng, Lenert, Sankoff
Motivation: The total order of the genes or markers on a chromosome inherent in its representation as a signed permutation must often be weakened to a partial order in the case of real data. This is due to lack of resolution, where several genes are mapped to the same chromosomal position, to missing data from some of the datasets used to compile a gene order, and to conflicts between these datasets. The available genome rearrangement algorithms, however, require total orders as input. A more general ap-proach is needed to handle rearrangements of gene partial orders.
Results: We formalize the uncertainty in gene order data by representing a chromosome from each genome as a partial order, summarized by a directed acyclic graph (DAG). The rearrangement problem is then to infer a minimal sequence of reversals for transforming any topological sort of one DAG to any one of the other DAG. Each topological sort represents a possible linearization compatible with all the datasets on the chromosome. The set of all possible topological sorts is embedded in each DAG by appropriately augmenting the edge set, so that it becomes a general directed graph (DG). The DGs representing chromosomes of two genomes are combined to produce a bicoloured graph from which we extract a maximal decomposition into alternating coloured cycles, from which an optimal sequence of reversals can usually be identified. We test this approach on simulated incomplete comparative maps and on cereal chromosomal maps drawn from the Gramene browser.
49. Supervised Enzyme Network Inference from the Integration of Genomic Data and Chemical Information
Authors:Yamanishi, Vert, Kanehisa
The metabolic network is an important biological network which relates enzyme proteins and chemical compounds. A large number of metabolic pathways remain unknown nowadays, and many enzymes are missing even in known metabolic pathways. There is therefore an incentive to develop methods to reconstruct the unknown parts of the metabolic network, and to identify genes coding for missing enzymes.
This paper presents new methods to infer enzyme networks from the integration of multiple genomic data and chemical information, in the framework of supervised graph inference. The originality of the methods is the introduction of chemical compatibility as a constraint for refining the network predicted by the network inference engine. The chemical compatibility between two enzymes is obtained automatically from the information encoded by their EC numbers. The proposed method are tested and compared on their ability to infer the enzyme network of the yeast Saccharomyces cerevisiae from four datasets for enzymes with assigned EC numbers: gene expression data, protein localization data, phylogenetic profiles, and chemical compatibility information. It is shown that the prediction accuracy of the network reconstruction consistently improves due to the introduction of chemical constraints, the use of a supervised approach, and the weighted integration of multiple datasets. Finally, we conduct a comprehensive prediction of a global enzyme network consisting all enzyme candidate proteins of the yeast to obtain new biological findings.
50. ExonHunter: A Comprehensive Approach to Gene Finding
Authors:Brejova, Brown, Li, Vinar
We present ExonHunter, a new and comprehensive gene finding system that outperforms existing systems and features several new ideas and approaches. Our system combines numerous sources of information (genomic sequences, ESTs, and protein databases of related species) into a gene finder based on a hidden Markov model in a novel and systematic way. In our framework, various sources of information are expressed as partial probabilistic statements about positions in the sequence and their annotation. We then combine these into the final prediction via a quadratic programming method, which we show is an extension of existing methods. Allowing only partial statements is key to our transparent handling of missing information and coping with the heterogeneous character of individual sources of information. As well, we give a new method for modeling the length distribution of intergenic regions in hidden Markov models. Results: On a commonly used test set, ExonHunter performs significantly better than the existing gene finders ROSETTA, SLAM, or TWINSCAN, with more than two thirds of genes predicted completely correctly.
51. Whole-proteome Prediction of Protein Function via Graph-theoretic Analysis of Interaction Maps
Authors:Nabieva, Jim, Agarwal, Chazelle, Singh
Motivation: Determining protein function is one of the most important problems in the post-genomic era. For the typical proteome, there are no functional annotations for one-third or more of its proteins. Recent high-throughput experiments have determined proteome-scale protein physical interaction maps for several organisms. These physical interactions are complemented by an abundance of data about other types of functional relationships between proteins, including genetic interactions, knowledge about co-expression, and shared evolutionary history. Taken together, these pairwise linkages can be used to build whole-proteome protein interaction maps.
Results: We develop a network-flow based algorithm, FunctionalFlow, that exploits the underlying structure of protein interaction maps in order to predict protein function. In cross-validation testing on the yeast proteome, we show that FunctionalFlow has improved performance over previous methods in predicting the function of proteins with few (or no) annotated protein neighbors. By comparing several methods that use protein interaction maps to predict protein function, we demonstrate that FunctionalFlow performs well because it takes advantage of both network topology and some measure of locality. Finally, we show that performance can be improved substantially as we consider multiple data sources and use them to create weighted interaction networks.
52. Conservative Extraction of Over-represented Extensible Motifs
Authors:Apostolico, Comin, Parida
The discovery of motifs in biosequences is frequently torn between the rigidity of the model on the one hand and the abundance of candidates on the other. Part of the problem seems rooted, in the various characterizations offered for the notion of a motif, that are typically based either on syntax or on statistics alone. It seems worthwhile to consider alternatives that result from a prudent combination of these two aspects in the model. We introduce and study a notion of extensible motif in a sequence which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. We show that a combination of appropriate saturation conditions (expressed in terms of minimum number of don't cares compatible with a given list of of occurrences) and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate over-represented motifs. The merits of the method are documented by results obtained in implementation, which specifically targeted protein sequence families. In all cases tested, the motif reported in PROSITE as most important in terms of functional/structural relevance emerges among the top thirty extensible motifs returned by our algorithm, often right at the top. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constrains.
53. Mining Coherent Dense Subgraphs Across Massive Biological Networks for Functional Discovery
Authors:Hu, Yan, Huang, Han, Zhou
Motivation: The rapid accumulation of biological network data translates into an urgent need for computational meth-ods for graph pattern mining. One important problem is to identify recurrent patterns across multiple networks to dis-cover biological modules. However, existing algorithms for frequent pattern mining become very costly in time and space as the pattern sizes and network numbers increase. Currently, no efficient algorithm is available for mining re-current patterns across large collections of genome-wide networks.
Results: We developed a novel algorithm, CODENSE, to efficiently mine frequent coherent dense subgraphs across large numbers of massive graphs. Compared to previous methods, our approach is scalable in the number and size of the input graphs, and adjustable in terms of exact or ap-proximate pattern mining. Applying CODENSE to 39 co-expression networks derived from microarray data sets, we discovered a large number of functionally homogenous clusters and made functional predictions for 169 uncharac-terized yeast genes.
54. A Systematic Approach for Comprehensive T-cell Epitope Discovery Using Peptide Libraries
Authors:Beissbarth, Tye-Din, Smyth, Speed, Anderson
Motivation: T-cell response to peptides bound on MHC Class~I or Class~II molecules is essential for immune recognition of pathogens. T-cells are activated by specific peptide epitopes that are determined within the antigen processing pathways and presented on the surface of other cells bound to MHC molecules. To determine which part of allergenic or pathogenic proteins can stimulate T-cells is important for the treatment of diseases. We sought to take advantage of the falling cost of synthetic, screening grade peptides, and devise a comprehensive, non-hypothesis-driven screen for T-cell epitopes. We were interested in the study of celiac disease (CD) and used the ELISPOT technique to perform a large number of T-cell assays. We therefore needed to compensate for the lack of statistical data analysis methods for ELISPOT assays.
Results: We describe a method to comprehensively screen for T-cell epitopes within a family or group of proteins. We have implemented an algorithm to generate a set of unique short peptide sequences that incorporate all possible epitopes within a group of proteins. T-cell assays were performed in 96 well plates using the ELISPOT assay to screen for responses in CD patients against any epitopes in glutens. We describe a statistical model to fit the data and an EM (Expectation Maximization) algorithm to estimate the parameters of interest and analyze the resulting data.
55. Bayesian Neural Network Approaches to Ovarian Cancer Identification from High-resolution Mass Spectrometry Data
The classification of high-dimensional data is always a challenge to statistical machine learning. We propose a novel method named shallow feature selection (SFS) that assigns each feature a probability of being selected based on the structure of training data itself. Independent of particular classifiers, the high dimension of biodata can be fleetly reduced to an applicable case for consequential processings. Moreover, to improve both efficiency and performance of classification, these prior probabilities will be further used to specify the distributions of top-level hyperparameters in hierarchical models of Bayesian neural network (BNN), as well as the parameters in Gaussian process models. Three BNN approaches are derived and then applied to identify ovarian cancer from NCI's high-resolution mass spectrometry data, which yielded an excellent performance in 1,000 independent k-fold cross validations (k=2,...,10). For instance, an average sensitivity and specificity of 98.56% and 98.42% have been achieved in the 2-fold cross validations, furthermore, only one control and one cancer are misclassified in the leave-one-out cross validation. Some other popular classifiers are also tested as a comparison.
56. Kernels for Small Molecules and the Prediction of Mutagenicity, Toxicity, and Anti-Cancer Activity
Motivation: Small molecules play a fundamental role in organic chemistry and biology. They can be used to probe biological systems and to discover new drugs and other useful compounds. As large datasets of small molecules become increasingly available, it is necessary to develop computational methods that can deal with molecules of variable size and structure and predict their chemical, physical, or biological properties. Results: Here we develop several new classes of kernels for small molecules using their 1D, 2D, and 3D representations. In 1D, we consider string kernels based on SMILES strings. In 2D, we introduce several similarity kernels based on conventional or generalized fingerprints. Generalized fingerprints are derived by counting in different ways sub-paths contained in the graph of bonds, using depth-first searches. In 3D, we consider similarity measures between histograms of pairwise distances between atom classes. These kernels can be computed efficiently and are applied to problems of classification and prediction of mutagenicity, toxicity, and anti-cancer activity on three publicly available datasets. The results derived using cross validation methods are state-of-the-art. Tradeoffs between various kernels are briefly discussed.