  |                          |                                                              |                   |                   |                                               |    |                                                             |                        |                                                                                                                                                                                                           Authors:                                 Gopalacharyulu, Lindfors,                                 Bounsaythip, Kivioja, Yetukuri, Hollmen, Oresic                                                              Motivation: Integration                                 of heterogeneous data in life sciences is a growing                                 and recognized challenge. The problem is not only                                 to enable the study of such data within the context                                 of a biological question, but also more fundamentally                                 how to represent the available knowledge and make                                 it accessible for mining.                                Results: Our integration                                 approach is based on the premise that relationships                                 between biological entities can be represented                                 as a complex network. The context dependency is                                 achieved by a judicious use of distance measures                                 on these networks. The biological entities and                                 the distances between them are mapped for the                                 purpose of visualization into the lower-dimensional                                 space using the Sammon’s mapping. The system                                 implementation is based on a multi-tier architecture                                 using a native XML database and a software tool                                 for querying and visualizing complex biological                                 networks. The functionality of our system is demonstrated                                 with two examples: (1) A multiple pathway retrieval,                                 in which, given a pathway name, the system finds                                 all the relationships related to the query by                                 checking available metabolic pathway, transcriptional,                                 signaling, protein-protein interaction, and ontology                                 annotation resources and (2) A protein neighborhood                                 search, in which given a protein name, the system                                 finds all its connected entities within a specified                                 depth. These two examples show that our system                                 is able to conceptually traverse different databases                                 to produce testable hypotheses and lead towards                                 answers to complex biological questions.                               Contact: matej.oresic@vtt.fi |                                                                                   |   |                                                                                                                                                                     Authors:                                 Sharan, Myers                                Motivation: Many                                 signals in biological sequences are based on the                                 presence or absence of base signals and their                                 spatial combinations. One of the best known examples                                 in this regard is the signal identifying a core                                 promoter---the site at which the basal transcription                                 machinery starts the transcription of a gene.                                 Our goal is a fully automatic pattern recognition                                 system for a family of sequences that simultaneously                                 discovers the base signals, their spatial relationships                                 and a classifier based upon them.                               Results: In this                                 paper we present a general method for characterizing                                 a set of Sequences by their recurrent motifs.                                 Our approach relies on novel probabilistic models                                 for DNA binding sites and modules of binding sites,                                 on algorithms to learn them from data, and on                                 a support vector machine that uses the learned                                 models to classify a set of sequences. We demonstrate                                 the applicability of our approach to diverse instances,                                 ranging from families of promoter sequences to                                 a data set of intronic sequences flanking alternatively                                 spliced exons. On a core promoter data set our                                 results are comparable to the state-of-the-art                                 McPromoter. On a data set of alternatively spliced                                 exons we outperform a previous approach. We also                                 achieve high success rates in recognizing cell                                 cycle regulated genes. These results demonstrate                                 that a fully automatic pattern recognition algorithm                                 can meet or exceed the performance of hand-crafted                                 approaches.                               Availability: The                                 software and data sets are available from the                                 authors upon request.                               Contact: roded@tau.ac.il |                                                                                   |   |                                                                                                                                                                     Authors:                                 Dolan, Ni, Camon, Blake                                                              Motivation: The                                 Gene Ontology [GO] is widely used to annotate                                 molecular attributes of genes and gene products.                                 Multiple groups undertaking functional annotations                                 of genomes contribute their annotation sets to                                 the GO database resource and these data are subsequently                                 used in comparative functional analysis research.                                 Although GO curators adhere to the same protocols                                 and standards while assigning GO annotations,                                 the specific procedure followed by each annotation                                 group can vary. Since differences in application                                 of annotation standards would dilute the effectiveness                                 of comparative analysis, methods for assessing                                 annotation consistency are essential. The development                                 of methodologies that are broadly applicable for                                 the assessment of GO annotation consistency is                                 an important issue for the comparative genomics                                 community.                               Results: We have                                 developed a methodology for assessing the consistency                                 of GO annotations provided by different annotation                                 groups. The method is completely general and can                                 be applied to compare any two sets of GO annotations.                                 This is the first attempt to assess cross-species                                 GO annotation consistency. Our method compares                                 annotation sets utilizing the hierarchical structure                                 of the GO to compare GO annotations between orthologous                                 gene pairs. The method produces a report on the                                 annotation consistency and inconsistency for each                                 orthologous pair. We present results obtained                                 by comparing GO annotations for mouse and human                                 gene sets.                                Availability:                               The complete current MGI_GOA GO annotation consistency                               report is available online at http://www.spatial.maine.edu/~mdolan. |                                                                                   |   |                                                                                                                                                                     Authors:                                 Käll, Krogh, Sonnhammer                                                              When predicting sequence features                                 like transmembrane topology, signal peptides,                                 coil-coil structures, protein secondary structure,                                 or genes, extra support can be gained from homologs.                                 We here present a general HMM decoding algorithm                                 that combines probabilities for sequence features                                 of homologs by considering the average of the                                 posterior label probability of each position in                                 a global sequence alignment. The algorithm is                                 an extension of the previously described "optimal                                 accuracy" decoder, allowing homology information                                 to be used. It was benchmarked using an HMM for                                 transmembrane topology and signal peptide prediction,                                 Phobius. We found that performance was substantially                                 increased when incorporating information from                                 homologs.                               |                                                                                   |   |                                                                                                                                                                     Authors:                                 Cheung, Yip, Smith, deKnikker,                                 Masiar, Gerstein                               Motivation: As                                 the semantic web technology is maturing and the                                 need for life sciences data integration over the                                 web is growing, it is important to explore how                                 data integration needs can be addressed by the                                 semantic web. The main problem that we face in                                 data integration is a lack of widely-accepted                                 standards for expressing the syntax and semantics                                 of the data. We address this problem by exploring                                 the use of semantic web technologies — including                                 Resource Description Framework (RDF), RDF Site                                 Summary (RSS), relational-database-to-RDF mapping                                 (D2RQ), and native RDF data repository —                                 to represent, store, and query both metadata and                                 data across life sciences datasets.                               Results: As many                                 biological datasets are presently available in                                 tabular format, we introduce an RDF structure                                 into which they can be converted. Also, we develop                                 a prototype web-based application called YeastHub                                 that demonstrates how a life sciences data warehouse                                 can be built using a native RDF data store (Sesame).                                 This data warehouse allows integration of different                                 types of yeast genome data provided by different                                 resources in different formats including the tabular                                 and RDF formats. Once the data are loaded into                                 the data warehouse, RDF-based queries can be formulated                                 to retrieve and query the data in an integrated                                 fashion.                               Availability: The                                 YeastHub web site is accessible via the following                                 URL: http://yeasthub.gersteinlab.org                               Contact: kei.cheung@yale.edu  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Mahony, Golden, Smith, Benos                               Motivation: One                                 of the limiting factors in deciphering transcriptional                                 regulatory networks is the effectiveness of motif-finding                                 software. An emerging avenue for improving motif-finding                                 accuracy aims to incorporate generalised binding                                 constraints of related transcription factors (TFs),                                 named familial binding profiles (FBPs), as priors                                 in motif identification methods. A motif-finder                                 can thus be “biased” towards finding                                 motifs from a particular TF family. However, current                                 motif-finders allow only a single FBP to be used                                 as a prior in a given motif-finding run. In addition,                                 current FBP construction methods are based on                                 manual clustering of position specific scoring                                 matrices (PSSMs) according to the known structural                                 properties of the TF proteins. Manual clustering                                 assumes that the binding preferences of structurally                                 similar TFs will also be similar. This assumption                                 is not true, at least not for some TF families.                                 Automatic PSSM clustering methods are thus required                                 for augmenting the usefulness of FBPs.                               Results: A novel                                 method is developed for automatic clustering of                                 PSSM models. The resulting FBPs are incorporated                                 into the SOMBRERO motif-finder, significantly                                 improving its performance when finding motifs                                 related to those that have been incorporated.                                 SOMBRERO is thus the only existing de novo motif-finder                                 that can incorporate knowledge of all known PSSMs                                 in a given motif-finding run.                               Availability: The                                 methods outlined will be incorporated into the                                 next release of SOMBRERO, which is available from                                 http://bioinf.nuigalway.ie/sombrero.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Kou, Cohen, Murphy                               Protein name extraction is an important                                 step in mining biological literature. We describe                                 two new methods for this task: semiCRFs and dictionary                                 HMMs. SemiCRFs are a recently-proposed extension                                 to conditional random fields that enables more                                 effective use of dictionary information as features.                                 Dictionary HMMs are a technique in which a dictionary                                 is converted to a large HMM that recognizes phrases                                 from the dictionary, as well as variations of                                 these phrases. Standard training methods for HMMs                                 can be used to learn which variants should be                                 recognized. We compared the performance of our                                 new approaches to that of Maximum Entropy (MaxEnt)                                 and normal CRFs on three datasets, and improvement                                 was obtained for all four methods over the best                                 published results for two of the datasets. CRFs                                 and semiCRFs achieved the highest overall performance                                 according to the widely-used F-measure, while                                 the dictionary HMMs performed the best at finding                                 entities that actually appear in the dictionary—the                                 measure of most interest in our intended application.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Prakash, Tompa                               BLAST [Altschul et al., 1990] statistics                                 have been shown to be extremely useful for searching                                 for significant similarity hits, for amino acid                                 and nucleotide sequences. While these statistics                                 are well understood for pairwise comparisons,                                 there has been little success developing statistical                                 scores for multiple alignments. In particular,                                 there is no score for multiple alignment that                                 is well founded and treated as a standard. We                                 extend the BLAST theory to multiple alignments.                                 Following some simple assumptions, we present                                 and justify a significance score for multiple                                 segments of a local multiple alignment. We demonstrate                                 its usefulness in distinguishing high and moderate                                 quality multiple alignments from low quality ones,                                 with supporting experiments on orthologous vertebrate                                 promoter sequences.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Narayanaswamy, Ravikumar, Vijay-Shanker                               Phosphorylation is an important                                 biochemical reaction that plays a critical Role                                 in signal transduction pathways and cell-cycle                                 processes. A text mining system to extract the                                 phosphorylation relation from the literature is                                 reported. The focus of this paper is on the new                                 methods developed and implemented to connect and                                 merge pieces of information about phosphorylation                                 mentioned in different sentences in the text.                                 The effectiveness and accuracy of the system as                                 a whole as well as that of the methods for extraction                                 beyond a clause/sentence is evaluated using an                                 independently annotated dataset, the Phospho.ELM                                 database. The new methods developed to merge pieces                                 of information from different sentences are shown                                 to effective in significantly raising the recall                                 without much difference in precision.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Nagarajan, Jones, Keich                               Motivation: The                                 efficient and accurate computation of p-values                                 is an essential requirement for motif-finding                                 and alignment tools. We show that the approximation                                 algorithms used in two popular motif-finding programs,                                 MEME and Consensus, can fail to accurately compute                                 the p-value.                               Results: We present                                 two new algorithms: one for the evaluation of                                 the p-values of a range of motif scores, and a                                 faster one for the evaluation of the p-value of                                 a single motif score. Both exhibit more reliability                                 than existing algorithms, and the latter algorithm                                 is comparable in speed to the fastest existing                                 method.                               Availability: The                                 algorithms described in this paper are available                                 from http://www.cs.cornell.edu/~keich.                               Contact: keich@cs.cornell.edu  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Batt, Ropers, de Jong, Geiselmann,                                 Mateescu, Page, Schneider                                Motivation: The                                 modeling and simulation of genetic regulatory                                 networks has created the need for tools for model                                 validation. The main challenges of model validation                                 are the achievement of a match between the precision                                 of model predictions and experimental data, as                                 well as the efficient and reliable comparison                                 of the predictions and observations.                               Results: We present                                 an approach towards the validation of models of                                 genetic regulatory networks addressing the above                                 challenges. It combines a method for qualitative                                 modeling and simulation with techniques for model                                 checking, and is supported by a new version of                                 the computer tool GNA. The model-validation approach                                 has been applied to the analysis of the network                                 controlling the nutritional stress response in                                 Escherichia coli. Availability: GNA and the model                                 of the stress response network are available at                                 http://www-helix.inrialpes.fr/gna  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Hahn, Lee                                Motivation: The                                 recent release of the draft sequence of the chimpanzee                                 genome is an invaluable resource for finding genome-wide                                 genetic differences that might explain phenotypic                                 differences between humans and chimpanzees.                               Results: In this                                 paper, we describe a simple procedure to identify                                 potential human-specific frameshift mutations                                 that occurred after the divergence of human and                                 chimpanzee. The procedure involves collecting                                 human coding exons bearing insertions or deletions                                 compared to the chimpanzee genome and identification                                 of homologs from other species supporting that                                 the mutations are human-specific. Using this procedure,                                 we identified nine genes, BASE, DNAJB3, FLJ33674,                                 HEJ1, NTSR2, RPL13AP, SCGB1D4, WBSCR27, and ZCCHC13,                                 that show human-specific alterations including                                 truncations of the C-terminus. In some cases,                                 the frameshift mutation results in gene inactivation                                 or decay. In other cases, the altered protein                                 seems to be functional. This study demonstrates                                 that even the unfinished chimpanzee genome sequence                                 can be useful in identifying modification of genes                                 that are specific to the human lineage and therefore                                 could potentially be relevant to the study of                                 the acquisition of human-specific traits.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Ben-Hur, Noble Paper                               Despite advances in high throughput                                 methods for discovering protein-protein interactions,                                 the interaction networks of even well-studied                                 model organisms are sketchy at best, highlighting                                 the continued need for computational methods to                                 help direct experimentalists in the search for                                 novel interactions.                               We present a kernel method for predicting                                 protein-protein interactions using a combination                                 of data sources, including protein sequences,                                 Gene Ontology annotations, local properties of                                 the network, and homologous interactions in other                                 species. Whereas protein kernels proposed in the                                 literature provide a similarity between single                                 proteins, prediction of interactions requires                                 a kernel between pairs of proteins. We propose                                 a pairwise kernel that converts a kernel between                                 single proteins into a kernel between pairs of                                 proteins, and we illustrate the kernel's effectiveness                                 in conjunction with a support vector machine classifier.                                 Furthermore, we obtain improved performance by                                 combining several sequence-based kernels based                                 on k-mer frequency, motif and domain content and                                 by further augmenting the pairwise sequence kernel                                 with features that are based on other sources                                 of data.                                We apply our method to predict physical                                 interactions in yeast using data from the BIND                                 database. At a false positive rate of 1% the classifier                                 retrieves close to 80% of a set of trusted interactions.                                 We thus demonstrate the ability of our method                                 to make accurate predictions despite the sizeable                                 fraction of false positives that are known to                                 exist in interaction databases.                                |                                                                                   |   |                                                                                                                                                                     Authors:                                 Tuller, Chor                               Motivation: Maximum                                 likelihood (ML) is an increasingly popular optimality                                 criterion for selecting evolutionary trees (Felsenstein,                                 1981). Yet the computational complexity of ML                                 was open for over 20 years, and only recently                                 resolved by the authors (2005) for the Jukes Cantor                                 model of substitution and its generalizations.                                 It was proved that reconstructing the ML tree                                 is computationally intractable (NP hard). In this                                 work we explore three directions, which extend                                 that result.                               Results:                                                                - We show that ML under the assumption of molecular                                   clock is still computationally intractable (NP                                   hard).
                                 -  We show that not only is it computationally                                   intractable to find the exact ML tree, even                                   approximating the logarithm of the maximum likelihood                                   for any multiplicative factor smaller than 1.00175                                   is computationally intractable.
                                 -  We develop an algorithm for approximating                                   log-likelihood under the condition that the                                   input sequences are sparse. It employs any approximation                                   algorithm for parsimony, and asymptotically                                   achieves the same approximation ratio. We note                                   that ML reconstruction for sparse inputs is                                   still hard under this condition, and furthermore                                   many real datasets satisfy it.
                                 |                                                                                   |   |                                                                                                                                                                     Authors:                                 Ye, Osterman, Overbeek, Godzik                               Motivation: Proteins                                 work together in pathways and networks, collectively                                 comprising the cellular machinery. A subsystem                                 (a generalization of pathway concept) is a group                                 of related functional roles (such as enzymes)                                 jointly involved in a specific aspect of the cellular                                 machinery. Subsystems provide a natural framework                                 for comparative genome analysis and functional                                 annotation. A subsystem may be implemented in                                 a number of different functional variants in individual                                 species. In order to reliably project functional                                 assignments across multiple genomes, we have to                                 be able to identify the variants implemented in                                 each genome. The analysis of such variants across                                 diverse species is an interesting problem by itself                                 and may provide new evolutionary insights. However,                                 no computational techniques are presently available                                 for an automated detection and analysis of subsystem                                 variants.                                Results: Here we                                 formulate the subsystem variant detection problem                                 as finding the minimum number of subgraphs of                                 a subsystem, which is represented as a graph,                                 and solve the optimization problem by integer                                 programming approach. The performance of our method                                 was tested on subsystems encoded in the SEED,                                 a genomic integration platform developed by the                                 Fellowship for Interpretation of Genomes (FIG)                                 as a component of a large-scale effort on comparative                                 analysis and annotation of multiple diverse genomes.                                 Here we illustrate the results obtained for two                                 expert-encoded subsystems of the biosynthesis                                 of Coenzyme A and FMN/FAD cofactors. Applications                                 of variant detection, to support genomic annotations                                 and to assess divergence of species, are briefly                                 discussed in the context of these universally                                 conserved and essential metabolic subsystems.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Dimmic, Hubisz, Bustamante,                                 Nielsen                               Motivation: The                                 evolution of protein sequences is constrained                                 by complex interactions between amino acid residues.                                 Because harmful substitutions may be compensated                                 by other substitutions at neighboring sites, residues                                 can coevolve. We describe a Bayesian phylogenetic                                 approach to the detection of coevolving residues                                 in protein families. This method, Bayesian mutational                                 mapping (BMM), assigns mutations to the branches                                 of the evolutionary tree stochastically, and then                                 test statistics are calculated to determine whether                                 a coevolutionary signal exists in the mapping.                                 Posterior predictive P-values provide an estimate                                 of significance, and specificity is maintained                                 by integrating over uncertainty in the estimation                                 of the tree topology, branch lengths, and substitution                                 rates. A coevolutionary Markov model for codon                                 substitution is also described, and this model                                 is used as the basis of several test statistics.                               Results: Results                                 on simulated coevolutionary data indicate that                                 the BMM method can successfully detect nearly                                 all coevolving sites when the model has been correctly                                 specified, and that nonparametric statistics such                                 as mutual information are generally less powerful                                 than parametric statistics. On a dataset of eukaryotic                                 proteins from the phosphoglycerate kinase (PGK)                                 family, interdomain site contacts yield a significantly                                 greater coevolutionary signal than interdomain                                 non-contacts, an indication that the method provides                                 information about interacting sites. Failure to                                 account for the heterogeneity in rates across                                 sites in PGK resulted in a less discriminating                                 test, yielding a marked increase in the number                                 of reported positives at both contact and non-contact                                 sites.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Jönsson, Heisler, Reddy,                                 Agrawal, Gor, Shapiro, Mjolsness, Meyerowitz                                                              Motivation: The                                 above ground tissues of higher plants are generated                                 from a small region of cells situated at the plant                                 apex called the shoot apical meristem. An important                                 genetic control circuit modulating the size of                                 the Arabidopsis thaliana meristem is a feed-back                                 network between the CLAVATA3 and WUSCHEL genes.                                 Although the expression patterns for these genes                                 do not overlap, WUSCHEL activity is both necessary                                 and sufficient (when expressed ectopically) for                                 the induction of CLAVATA3 expression. However,                                 upregulation of CLAVATA3 in conjunction with the                                 receptor kinase CLAVATA1 results in the down regulation                                 of WUSCHEL. Despite much work, experimental data                                 for this network is incomplete and additional                                 hypotheses are needed to explain the spatial locations                                 and dynamics of these expression domains. Predictive                                 mathematical models describing the system should                                 provide a useful tool for investigating and discriminating                                 among possible hypotheses, by determining which                                 hypotheses best explain observed gene expression                                 dynamics.                                Results: We are                                 developing a method using in vivo live confocal                                 microscopy to capture quantitative gene expression                                 data and create templates for computational models.                                 We present two models accounting for the organization                                 of the WUSCHEL expression domain. Our preferred                                 model uses a reaction-diffusion mechanism, in                                 which an activator induces WUSCHEL expression.                                 This model is able to organize the WUSCHEL expression                                 domain. In addition, the model predicts the dynamical                                 reorganization seen in experiments where cells,                                 including the WUSCHEL domain, are ablated, and                                 also predicts the spatial expansion of the WUSCHEL                                 domain resulting from removal of the CLAVATA3                                 signal.                                 |                                                                                   |   |                                                                                                                                                                     Authors:                                 Song, Wu, Gusfield                               Results: In this                                 paper, we present efficient, practical methods                                 for computing both upper and lower bounds on the                                 minimum number of needed recombinations, and for                                 constructing evolutionary histories that explain                                 the input sequences. We extensively study the                                 efficiency and accuracy of these algorithms on                                 both simulated and real data sets. The algorithms                                 produce very close upper and lower bounds, which                                 match exactly in a surprisingly wide range of                                 data. Thus, with the use of new, very effective                                 lower bounding methods and an efficient algorithm                                 for computing upper bounds, this approach allows                                 the efficient exact computation of the minimum                                 number of needed recombinations, with high frequency                                 in a large range of data. When upper and lower                                 bounds match, evolutionary histories found by                                 our algorithm correspond to most parsimonious                                 histories.                               Availability: HapBound                                 and SHRUB, programs implementing the new algorithms                                 discussed in this paper, are available at http://wwwcsif.cs.ucdavis.edu/~gusfield/lu.html.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Yu, Zaveljevski, Stevens,                                 Yackovich, Reifman                               The classification of protein sequences                                 obtained from patients with various immunoglobulin-related                                 conformational diseases may provide insight into                                 structural correlates of pathogenicity. However,                                 clinical data are very sparse and, in the case                                 of antibody-related proteins, the collected sequences                                 have large variability with only a small subset                                 of variations relevant to the protein pathogenicity                                 (function). On this basis, these sequences represent                                 a model system for development of strategies to                                 recognize the small subset of function-determining                                 variations among the much larger number of primary                                 structure diversifications introduced during evolution.                                 Under such conditions, most protein classification                                 algorithms have limited accuracy. To address this                                 problem, we propose a Support Vector Machine (SVM)-based                                 classifier that combines sequence and 3D structural                                 averaging information. Each amino acid in the                                 sequence is represented by a set of six physicochemical                                 properties: hydrophobicity, hydrophilicity, volume,                                 surface area, bulkiness, and refractivity. Each                                 position in the sequence is described by the properties                                 of the amino acid at that position and the properties                                 of its neighbors in 3D space or its neighbors                                 in the sequence. A structure template is selected                                 to determine neighbors in 3D space and a window                                 size is used to determine the neighbors in the                                 sequence. The test data consists of 209 proteins                                 of human antibody immunoglobulin light chains,                                 each represented by aligned sequences of 120 amino                                 acids. The methodology is applied to the classification                                 of protein sequences collected from patients with                                 and without amyloidosis, and indicates that the                                 proposed modified classifiers are more robust                                 to sequence variability than standard SVM classifiers,                                 improving classification error between 5-25% and                                 sensitivity between 9-17%. The classification                                 results might also suggest possible mechanisms                                 for the propensity of immunoglobulin light chains                                 to amyloid formation.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Cline, Blume, Cawley, Clark,                                 Hu, Lu, Salomonis, Wang, Williams                                Motivation: Many                                 or most mammalian genes undergo alternative splicing,                                 generating a variety of transcripts from a single                                 gene. New information on splice variation is becoming                                 available through technology for measuring expression                                 levels of several exons or splice junctions per                                 gene. We have developed a statistical method,                                 ANOSVA, to detect alternative splicing from expression                                 data. ANOSVA requires no transcript information,                                 so can be applied when the level of annotation                                 data is poor. When validated against spiked clone                                 data, it generated no false positives and few                                 false negatives. We demonstrated ANOSVA with data                                 from a prototype mouse alternative splicing array,                                 yielding a set of genes with evidence of tissue-specific                                 splice variation. These results are available                                 at http://bioinfo.affymetrix.com/Papers/ANOSVA.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Borgwardt, Ong, Schoenauer,                                 Vishwanathan, Smola, Kriegel                               Motivation: Computational                                 approaches to protein function prediction infer                                 protein function by finding proteins with similar                                 sequence, structure, surface clefts, chemical                                 properties, amino acid motifs, interaction partners                                 or phylogenetic profiles. We present a new approach                                 that combines sequential, structural and chemical                                 information into one graph model of proteins.                                 We predict functional class membership of enzymes                                 and non-enzymes using graph kernels and Support                                 Vector Machine classification on these protein                                 graphs.                                Results: Our graph                                 model, derivable from protein sequence and structure                                 only, is competitive with vector models that require                                 additional protein information such as the size                                 of surface pockets. If we include this extra information                                 into our graph model, our classifier yields significantly                                 higher accuracy levels than the vector models.                                 Hyperkernels allow us to select and to optimally                                 combine the most relevant node attributes in our                                 protein graphs. We have laid the foundation for                                 a protein function prediction system that integrates                                 protein information from various sources efficiently                                 and effectively.                               Availability: More                                 information available via www.dbs.ifi.lmu.de/Mitarbeiter/borgwardt.html.                               Contact: borgwardt@dbs.ifi.lmu.de  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Soukup, Lee, Cho                               Genome-wide microarray data are                                 often used in challenging classification problems                                 of subtypes of human diseases. However, the identification                                 of a parsimonious robust prediction model that                                 performs consistently well on future independent                                 data has not been successful due to the biased                                 model selection from an extremely large number                                 of candidate models during the classification                                 model search and construction. Furthermore, common                                 criteria of prediction model performance such                                 as classification error rates do not provide a                                 sensitive measure for valuating performance of                                 such astronomic competing models. Also, even though                                 several different classification approaches have                                 been utilized to tackle such classification problems,                                 no direct comparison on these methods have been                                 made. We introduce a novel measure for assessing                                 the performance of a prediction model, the misclassification-penalized                                 posterior (MiPP), the sum of the posterior classification                                 probabilities penalized by the number of incorrectly                                 classified samples. Using MiPP, we implement a                                 forward step-wise cross-validated procedure to                                 find our optimal prediction models with different                                 numbers of features on a training set. Our final                                 robust classification model and its dimension                                 are determined based on a completely independent                                 test data set. This MiPP-based classification                                 modeling approach enables us to identify the most                                 parsimonious robust prediction models only with                                 two or three features on well-known microarray                                 data sets. These models show superior performance                                 to other models in the literature that often have                                 more than 40--100 features in their model construction.                                 Our classification algorithm is available at the                                 Bioconductor web site as the MiPP package.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Garay-Malpartida, Occhiucci, Alves,                                 Belizário                                Motivation: The                                 in vitro studies have shown that the most remarkable                                 catalytic features of caspases, a family of cys-teineproteases,                                 are their stringent specificity to Asp (D) in                                 the S1 subsite and at least four amino acid to                                 the left of scis-sile bound. However, there is                                 little information about the substraterecognition                                 patterns in vivo. The prediction and-characterization                                 of proteolytic cleavage sites in natural sub-strates                                 could be useful for uncovering these structural                                 rela-tionships.                               Results: PEST-like                                 sequences rich in the amino acids Ser (S), Thr                                 (T), Pro (P), Glu or Asp (E/D), including Asn                                 (N) and Gln (Q) are adjacent structural/sequential                                 elements in the majority of cleavage site regions                                 of the natural caspase sub-strates described in                                 the literature, supporting its possible implication                                 in the substrate selection by caspases. We de-veloped                                 CaSPredictor, a software which incorporated a                                 PEST-like index and the position-dependent amino                                 acid ma-trices for prediction of caspase cleavage                                 sites in individual proteins and protein datasets.                                 The program predicted suc-cessfully 81% (111/137)                                 of the cleavage sites in experimen-tally verified                                 caspase substrates not annotated in its internal                                 data file. Its accuracy and confidence was estimated                                 as 80% using ROC methodology. The program was                                 much more effi-cient to predict caspase substrates                                 when compared to Pep-tideCutter and PEPS softwares.                                 Finally, the program de-tected potential cleavage                                 sites in the primary sequences of 1644 proteins                                 in a dataset containing 9986 protein entries.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Woo, Krueger, Kaur, Churchill                               Motivation: Compared                                 to two-color microarrays, three-color microarrays                                 can increase design efficiency and power to detect                                 differential expression without additional samples                                 and arrays. Furthermore, three-color microarray                                 technology is currently available at a reasonable                                 cost. Despite the potential advantages, clear                                 guidelines for designing and analyzing three-color                                 experiments do not exist.                               Results: We propose                                 a three- and four-color cyclic design (loop) and                                 a complementary graphical representation to help                                 design experiments that are balanced, efficient,                                 and robust to hybridization failures. In theory,                                 three-color loop designs are more efficient than                                 two-color loop designs. Experiments using both                                 two- and three-color platforms were performed                                 in parallel and its outputwere analyzed using                                 linearmixedmodel analysis in R/MAANOVA. These                                 results demonstrate that three-color experiments                                 using the same number of samples (and fewer arrays)                                 will perform no less efficiently than two-color                                 experiments. The improved efficiency of the design                                 is somewhat offset by a reduced dynamic range                                 and increased variability in the three-color experimental                                 system. This result suggests that, with minor                                 technological improvements, three-color microarrays                                 using loop designs could detect differential expression                                 more efficiently than two-color loop designs.                                                              Availability: http://www.jax.org/staff/churchill/labsite/software                               Supplementary Information:                                 Multi-color cyclic design construction                                 methods and examples along with additional results                                 of the experiment are provided at:                                 http://www.jax.org/staff/churchill/labsite/pubs/yong.                                                              Keywords: three-color                                 microarray, four-color microarray, cyclic design,                                 directed graph, efficiency, Fs-statistics                                Contact: garyc@jax.org  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Sadka, Linial                               In eukaryotes, membranous                                 proteins account for 20-30% of the proteome. Most                                 of these proteins contain one or more transmembrane                                 (TM) domains. These are short segments that transverse                                 the bilayer lipid membrane. Various properties                                 of the TM domains, such as their number, their                                 topology and their arrangement within the membrane                                 are closely related to the protein’s cellular                                 functions. Properties of the TM domains also determine                                 the cellular targeting and localization of these                                 proteins. It is not known, however, whether the                                 information encoded by TM domains suffices for                                 the purpose of classifying proteins into their                                 functional families. This is the question we address                                 here. We introduce an algorithm that creates a                                 profile of each functional family of membranous                                 proteins based only on the amino-acid composition                                 of their TM domains. This is complemented by a                                 classifier program for each such family (to determine                                 whether a given protein belongs to it or not).                                 We find that in most instances TM domains contain                                 enough information to allow an accurate discrimination                                 of ~80% sensitivity and ~90% specificity among                                 unrelated polytopic functional families with the                                 same number of TM domains.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Li, Meyer, Liu                               Motivation: Transcription                                 factors (TFs) regulate gene ex-pression by recognizing                                 and binding to specific regulatory regions on                                 the genome, which in higher eukaryotes can oc-cur                                 far away from the regulated genes. Recently Affymetrix                                 developed the high-density oligonucleotide arrays                                 that tile all the non-repetitive sequences of                                 the human genome at 35-bp resolution. This new                                 array platform allows for the unbiased mapping                                 of in vivo TF binding sequences (TFBSs) using                                 Chromatin ImmunoPrecipitation followed by microarray                                 ex-periments (ChIP-chip). The massive data set                                 generated from these experiments pose great challenges                                 for data analysis.                                Results: We developed                                 a fast, scalable and sensitive method to extract                                 TFBSs from ChIP-chip experiments on genome tiling                                 arrays. Our method takes advantage of tiling array                                 data from many experiments to normalize and model                                 the behavior of each individual probe, and identifies                                 TFBSs using a Hidden Markov Model (HMM). When                                 applied to the data of p53 ChIP-chip experiments                                 (Cawley et al., 2004), our method discovered many                                 new high confidence p53 targets including all                                 the regions verified by quantitative PCR . Using                                 a de novo motif finding algorithm MDscan (Liu                                 et al., 2002), we also recovered the p53 motif                                 from our HMM identified p53 target regions. Furthermore,                                 we found substantial p53 motif enrichment in these                                 regions comparing with both ge-nomic background                                 and the TFBSs identified by Cawley et al (2004).                                 Several of the newly identified p53 TFBSs are                                 in known genes’ promoter regions or associated                                 with previ-ously characterized p53-responsive                                 genes.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Mettu, Lilien, Donald                               We cast the problem of identifying                                 protein-protein interfaces, using only unassigned                                 NMR spectra, into a geometric clustering problem.                                 Identifying protein-protein interfaces is critical                                 to understanding inter- and intra-cellular communication,                                 and NMR allows the study of protein interaction                                 in solution. However it is often the case that                                 NMR studies of a protein complex are very time-consuming,                                 mainly due to the bottleneck in assigning the                                 chemical shifts, even if the apo structures of                                 the constituent proteins are known. We study whether                                 it is possible, in a high-throughput manner, to                                 identify the interface region of a protein complex                                 using only unassigned chemical shift and residual                                 dipolar coupling (RDC) data.                                We introduce a geometric optimization                                 problem where we must cluster the cells in an                                 arrangement on the boundary of a 3-manifold, where                                 the arrangement is induced by a spherical quadratic                                 form. We show that this formalism derives directly                                 from the physics of RDCs. We present an optimal                                 algorithm for this problem that runs in O(n^3                                 log n) time for an n-residue protein. We then                                 use this clustering algorithm as a subroutine                                 in a practical algorithm for identifying the interface                                 region of a protein complex from unassigned NMR                                 data. We present the results of our algorithm                                 on NMR data for 7 proteins from 5 protein complexes                                 and show that our approach is useful for high-throughput                                 applications in which we seek to rapidly identify                                 the interface region of a protein complex.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Dueck, Morris, Frey                               We address the problem of multi-way                                 clustering of microarray data using a generative                                 model. Our algorithm, probabilistic sparse matrix                                 factorization (PSMF), is a probabilistic extension                                 of a previous hard-decision algorithm for this                                 problem. PSMF allows for varying levels of sensor                                 noise in the data, uncertainty in the hidden prototypes                                 used to explain the data, and uncertainty as to                                 which prototypes are selected to explain each                                 data vector. We present experimental results demonstrating                                 that our method can better recover functionally-relevant                                 clusterings in mRNA expression data than standard                                 clustering techniques, including hierarchical                                 agglomerative clustering, and we show that by                                 computing probabilities instead of point estimates,                                 our method avoids converging to poor solutions.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Halperin, Kimmel, Shamir                               Motivation: The                                 search for genetic regions associated with complex                                 disease, such as cancer or Alzheimer's disease,                                 is an important challenge that may lead to better                                 diagnosis and treatment. The existence of millions                                 of DNA variations, primarily single nucleotide                                 polymorphisms (SNPs), may allow the fine dissection                                 of such associations. However, studies seeking                                 disease association are limited by the cost of                                 genotyping SNPs. Therefore, it is essential to                                 find a small subset of informative SNPs (tag SNPs)                                 that may be used as good representatives of the                                 rest of the SNPs.                               Results: We define                                 a new natural measure for evaluating the prediction                                 accuracy of a set of tag SNPs, and use it to develop                                 a new method for tag SNPs selection. Our method                                 is based on a novel algorithm that predicts the                                 values of the rest of the SNPs given the tag SNPs.                                 In contrast to most previous methods, our prediction                                 algorithm uses the genotype information and not                                 the haplotype information of the tag SNPs. Our                                 method is very efficient, and it does not rely                                 on having a block partition of the genomic region.                               We compared our method to two state                                 of the art tag SNP selection algorithms on 58                                 different genotype data sets from four different                                 sources. Our method consistently found tag SNPs                                 with considerably better prediction ability than                                 the other methods.  |                                                                                   |   |                                                                                                                                                                     Authors:                                  Cheng, Baldi                               Motivation: Protein                                 beta-sheets play a fundamental role in protein                                 structure, function, evolution, and bio-engineering.                                 Accurate prediction and assembly of protein beta-sheets,                                 however, remains challenging because protein beta-sheets                                 require formation of hydrogen bonds between linearly                                 distant residues. Previous approaches for predicting                                 beta-sheet topological features, such as beta-strand                                 alignments, in general have not exploited the                                 global covariation and constraints characteristic                                 of beta-sheet architectures.                               Results: We propose                                 a modular approach to the problem of predicting/assembling                                 protein beta-sheets in a chain by integrating                                 both local and global constraints in three steps.                                 The first step uses recursive neural networks                                 to predict pairing probabilities for all pairs                                 of inter-strand beta-residues from profile, secondary                                 structure, and solvent accessibility information.                                 The second step applies dynamic programming techniques                                 to these probabilities to derive binding pseudo-energies                                 and optimal alignments between all pairs of beta-strands.                                 Finally, the third step, uses graph matching algorithms                                 to predict the beta-sheet architecture of the                                 protein by optimizing the global pseudo-energy                                 while enforcing strong global beta-strand pairing                                 constraints. The approach is evaluated using cross-validation                                 methods on a large non-homologous dataset and                                 yields significant improvements over previous                                 methods.                               Availability: http://www.igb.uci.edu/servers/psss.html  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Smith, Sumazin, Das, Zhang                                                              We describe methodology to identify                                 de novo individual and interacting pairs of binding                                 site motifs from ChIP-chip data, using an algorithm                                 that integrates localization data directly into                                 the motif discovery process. We combine matrix-enumeration-based                                 motif discovery with multi-variate regression                                 to evaluate candidate motifs and identify motif                                 interactions. When applied to the HNF localization                                 data of Odom et al. (2004) in liver and pancreatic                                 islets, our methods produce motifs that are either                                 novel or improved known motifs. All motif pairs                                 identified to predict localization are further                                 evaluated according to how well they predict expression                                 in liver and islets, and according to how conserved                                 are the relative positions of their occurrences.                                 We find that interaction models of HNF1 and CDP                                 motifs provide excellent prediction of both HNF1                                 localization and gene expression in liver. Our                                 results demonstrate that ChIP-chip data can be                                 used to identify interacting binding site motifs.                                 |                                                                                   |   |                                                                                                                                                                     Authors:                                 Brunette, Brock                               De novo protein structure prediction                                 can be formulated as search in a high-dimensional                                 space. One of the most frequently used computational                                 tools to solve such search problems is the Monte                                 Carlo method. We present a novel search technique,                                 called model-based search. This method samples                                 the high-dimensional search space to build an                                 approximate model of the underlying function.                                 This model is incrementally refined in areas of                                 interest, while areas that are not of interest                                 are excluded from further exploration. Model-based                                 search derives its efficiency from the fact that                                 the information obtained during the exploration                                 of the search space is used to guide further exploration.                                 In contrast, Monte Carlo-based techniques are                                 memory-less and exploration is performed based                                 on random walks, ignoring the information obtained                                 in previous steps. Model-based search is applied                                 to protein structure prediction, where search                                 is employed to find the global minimum of the                                 protein's energy landscape. We show that model-based                                 search uses computational resources more efficiently                                 to find lower-energy conformations of proteins                                 when compared to one of the leading protein structure                                 prediction methods, which relies on a tailored                                 Monte Carlo method to perform search. The performance                                 improvements become more pronounced as the dimensionality                                 of the search space increases. We show that model-based                                 search enables more accurate protein structure                                 prediction than previously possible. Furthermore,                                 we believe that similar performance improvements                                 can be expected in other problems that are currently                                 solved using Monte Carlo-based search methods.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Huang, Morris, Frey Paper                                                              Microarray designs containing millions                                 to hundreds of millions of probes that tile entire                                 genomes are currently being released. Within the                                 next 2 months, our group will release a microarray                                 data set containing over 12,000,000 microarray                                 measurements taken from 37 mouse tissues. A problem                                 that will become increasingly significant in the                                 upcoming era of genome-wide exon-tiling microarray                                 experiments is the removal of cross-hybridization                                 noise. We present a probabilistic generative model                                 for cross-hybridization in microarray data and                                 a corresponding variational learning method for                                 cross-hybridization compensation, GenXHC, that                                 reduces cross-hybridization noise by taking into                                 account multiple sources for each mRNA expression                                 level measurement and prior knowledge of hybridization                                 similarities between the nucleotide sequences                                 of microarray probes and their target cDNAs. The                                 algorithm is applied to a subset of an exon-resolution                                 genome-wide Agilent microarray data set for chromosome                                 16 of Mus musculus and is found to produce statistically                                 significant reductions in cross-hybridization                                 noise. The denoised data is found to produce enrichment                                 in multiple Gene Ontology-Biological Process (GO-BP)                                 functional groups. The algorithm is found to outperform                                 Robust Multi-array Analysis, another method for                                 cross-hybridization compensation.                                  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Ko, Murga, Ondrechen                               Identification of functional information                                 for a protein from its three-dimensional structure                                 is a major challenge in genomics. The power of                                 theoretical microscopic titration curves (THEMATICS),                                 when coupled with a statistical analysis, provides                                 a method for high-throughput screening for identification                                 of catalytic sites and binding sites with high                                 accuracy and precision. The method requires only                                 the 3D structure of the query protein as input,                                 but performs as well as other methods that depend                                 on sequence alignments and structural similarities.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Raetsch, Sonnenburg, Schoelkopf                                                              Pre-mRNA alternative splicing greatly                                 increases the complexity of gene expression. Estimates                                 show that more than half of the human genes and                                 at least a third of the genes of less complex                                 organisms are alternatively spliced. In this work                                 we consider one major form of alternative splicing:                                 the exclusion of exons from the transcript. It                                 was shown that alternatively spliced exons have                                 properties that distinguish them from other exons.                                 While most computational studies on alternative                                 splicing only apply to exons which are conserved                                 among species, our method only uses information                                 available to the splicing machinery. We employ                                 advanced machine learning techniques in order                                 to answer two questions: a) Is a certain exon                                 alternatively spliced? b) How to identify yet                                 unidentified exons within verified introns?                                We designed a SVM kernel well suited                                 for the task of classifying sequences with motifs                                 having positional preferences. In order to solve                                 the task a), we combine the kernel with additional                                 local sequence information. The resulting SVM                                 based classifier achieves a true positive rate                                 of 48,5% at a false positive rate of 1%. By scanning                                 over EST-confirmed exons we identified 215 potentially                                 alternatively spliced exons. For 10 randomly selected                                 exons we successfully performed biological verification                                 experiments and confirmed 3 novel alternatively                                 spliced exons. To answer question b), we used                                 SVM based predictions to recognize acceptor and                                 donor splice sites. Combined with the abovementioned                                 features we were able to identify 85,2% of skipped                                 exons within verified introns at a false positive                                 rate of 1%.  |                                                                                   |   |                                                                                                                                                                     Authors:                                 Nimrod, Glaser, Steinberg, Ben-Tal, Pupko                               Motivation: In                                 silico prediction of functional regions on                                 protein surfaces, i.e., sites of interaction with                                 DNA, ligands, substrates and other proteins, is                                 of utmost importance in various applications in                                 the emerging fields of proteomics and structural                                 genomics. When a sufficient number of homologs                                 is found, powerful prediction schemes can be based                                 on the observation that evolutionarily conserved                                 regions are often functionally important. However,                                 it is challenging to unambiguously identify the                                 boundaries of the functional regions; typically,                                 only the principal functionally-important region                                 of the protein is detected, while secondary functional                                 regions with weaker conservation signals are overlooked.                               Methods: We present                                 a new methodology, called PatchFinder, that automatically                                 identifies patches of conserved residues that                                 are located in close proximity to each other on                                 the protein surface. PatchFinder is based on the                                 following steps: (1) Assignment of conservation                                 scores to each amino acid position on the protein                                 surface. (2) Assignment of a score to each putative                                 patch, based on its likelihood to be functionally                                 important. The patch of maximum likelihood is                                 considered to be the main functionally-important                                 region, and the search is continued for non-overlapping                                 patches of secondary importance.                                  Results: We examined the accuracy of the method                                 using the IGPS enzyme, the SH2 domain and a benchmark                                 set of 112 proteins. These examples demonstrated                                 that PatchFinder is capable of identifying both                                 the main and secondary functional patches.                               Availability: The                                 PatchFinder program is available at: http://ashtoret.tau.ac.il/~nimrodg                               Contact: NirB@tauex.tau.ac.il  |                                                                                    |                                                                                                                                                                     Authors:                                 Hannenhalli, Wang                               Motivation: Positional                                 weight matrix (PWM) is derived from a set of experimentally                                 determined binding sites. Here we explore whether                                 there exist subclasses of binding sites, and if                                 the mixture of these subclass-PWMs can improve                                 the binding site prediction. Intuitively, the                                 subclasses correspond to either distinct binding                                 preference of the same transcription factor in                                 different contexts or distinct subtypes of the                                 transcription factor.                                Result: We report                                 an Expectation Maximization algorithm adapting                                 the mixture model of Baily and Elkan. We assessed                                 the relative merit of using two subclass-PWMs.                                 The resulting PWMs were evaluated with respect                                 to preferred conservation (relative to mouse)                                 of potential sites in human promoters and expression                                 coherence of the potential target genes. Based                                 on 64 JASPAR vertebrate PWMs, 61%-81% of the cases                                 resulted in a higher conservation using the mixture                                 model. Also in 98% of the cases the expression                                 coherence was higher for the target genes of one                                 of the subclass-PWMs. Our analysis of REB1 sites                                 is consistent with previously discovered subtypes                                 using independent methods. Additionally application                                 of our method to mutated sites for transcription                                 factor LEU3 reveals subclasses that segregate                                 into strongly binding and weakly binding sites                                 with p-value of 0.008. This is the first study                                 which attempts to quantify the subtly different                                 binding specificities of a transcription factor                                 at a large scale and suggests the use of a mixture                                 of PWMs instead of the current practice of using                                 a single PWM for a transcription factor.  |                                                                                    |                                                                                                                                                                     Authors:                                 Winstanley, Abeln, Deane                               Motivation: At                                 present there exists no age estimate for the different                                 protein structures found in nature. It has become                                 clear from occurrence studies that different folds                                 arose at different points in evolutionary time.                                 An estimation of the age of different folds would                                 be a starting point for many investigations into                                 protein structure evolution: how we arrived at                                 the set of folds we see today. It would also be                                 a powerful tool in protein structure classification                                 allowing us to reassess the available hierarchical                                 methods and perhaps suggest improvements.                               Results: We have                                 created the first relative age estimation technique                                 for protein folds. Our method is based on constructing                                 parsimonious scenarios which can describe occurrence                                 patterns in a phylogeny of species. The ages presented                                 are shown to be robust to the different trees                                 or data types used for their generation. They                                 show correlations with other previously used protein                                 age estimators, but appear to be far more discriminating                                 than any previously suggested technique. The age                                 estimates given are not absolutes but they already                                 offer intriguing insights, like the very different                                 age patterns of alpha/beta folds compared to small                                 folds. The alpha/beta folds appear on average                                 to be far older than their small fold counterparts.                                                               Availability:                                 Example trees and additional material are available                                 at:                                 http://www.stats.ox.ac.uk/~abeln/foldage                               Contact: deane@stats.ox.ac.uk  |                                                                                    |                                                                                                                                                                     Authors:                                 Price, Jones, Pevzner                               De novo repeat family identification                                 is a challenging algorithmic problem of great                                 practical importance. As the number of genome                                 sequencing projects increases, there is a pressing                                 need to identify the repeat families present in                                 large, newly sequenced genomes. We develop a new                                 method for de novo identification of repeat families                                 via extension of consensus seeds; our method enables                                 a rigorous definition of repeat boundaries, a                                 key issue in repeat analysis.                               Our RepeatScout algorithm is more                                 sensitive and orders of magnitude faster than                                 RECON, the dominant tool for de novo repeat family                                 identification in newly sequenced genomes. Using                                 RepeatScout, we estimate that roughly 2% of the                                 human genome and 4% of mouse and rat genomes consist                                 of previously unannotated repetitive sequence.  |                                                                                    |                                                                                                                                                                     Authors:                                 Shmygelska                               The problem of finding folding nuclei                                 (a set of native contacts that play an important                                 role in folding) along with identifying folding                                 pathways (a time ordered sequence of folding events)                                 of proteins is one of the most important problems                                 in protein chemistry. Here we propose a novel                                 and simple approach to address this problem as                                 follows: given the topology of the native state,                                 identify native contacts that form folding nuclei                                 based on the graph theoretical approach that considers                                 effective contact order (effective loop closure)                                 as its objective function.                               Motivation:                                 A number of computational methods for the                                 prediction of folding nuclei already exists in                                 the literature, but most of them rely on restrictive                                 assumptions about the nature of nuclei or the                                 process of folding. Our motivation is to develop                                 a simple, efficient and robust algorithm to find                                 an ensemble of pathways with the lowest effective                                 contact order and to identify contacts that are                                 crucial for folding.                                                                   Results: Our approach is different                                 from the previously used methods in that it uses                                 efficient graph algorithms, and does not formulate                                 restrictive assumptions about folding nuclei.                                 Our predictions provide more details concerning                                 the protein folding pathway than most other methods                                 in literature. We demonstrate the success of our                                 approach by predicting folding nuclei for a data                                 set of proteins for which experimental kinetic                                 data is available. We show that our method compares                                 favourably with other methods in literature and                                 its results agree with experimental results.  |                                                                                    |                                                                                                                                                                     Authors:                                 Tang, Mechref, Novotny                               The emerging glycomics and glycoproteomics                                 projects aim to characterize all forms of glycoproteins                                 in different tissues and organisms. Tandem mass                                 spectrometry is the key experimental methodology                                 for high throughput glycan identification and                                 characterization. Fragmentation of glycans from                                 high energy collision-induced dissociation (CID)                                 generates ions from glycosidic as well as internal                                 cleavages. The cross-ring ions resulting from                                 internal cleavages provide additional information                                 that is important to reveal the type of linkage                                 between monosaccharides. This information, however,                                 is not incorporated by the current programs for                                 analyzing glycan mass spectra. As a result, they                                 can rarely distinguish isomeric oligosaccharides                                 from the mass spectra, which have the same saccharide                                 composition but different types of sequences,                                 branches or linkages.                               In this paper, we describe a novel                                 algorithm for glycan characterization using tandem                                 mass spectrometry (MS/MS). This algorithm consists                                 of three steps. First we develop a scoring scheme                                 to identify potential bond linkages between monosaccharides,                                 based on the appearance pattern of cross-ring                                 ions. Next, we use a dynamic programming algorithm                                 to determine the most probable oligosaccharide                                 structures from the mass spectrum. Finally, we                                 re-evaluate these oligosaccharide structures,                                 taking into account the double fragmentation ions.                                 We also show the preliminary results of testing                                 our algorithm on several MS/MS spectra of oligosaccharides.                                |                                                                                    |                                                                                                                                                                     Authors:                                 Cortes, Simeon, Ruiz de Angulo,                                 Guieysse, Remaud-Simeon, Tran                               Motivation: Motion                                 is inherent in molecular interactions. Molecular                                 flexibility must be taken into account in order                                 to develop accurate computational techniques for                                 predicting interactions. Energy-based methods                                 currently used in molecular modeling (i.e. molecular                                 dynamics, Monte Carlo algorithms) are, in practice,                                 only able to compute local motions while accounting                                 for molecular flexibility. However, large-amplitude                                 motions often occur in biological processes. We                                 investigate the application of geometric path                                 planning algorithms to compute such large motions                                 in flexible molecular models. Our purpose is to                                 exploit the efficacy of a geometric conformational                                 search as a filtering stage before subsequent                                 energy refinements.                               Results: Two kinds                                 of large-amplitude motion are treated in this                                 paper: protein loop conformational changes (involving                                 protein backbone flexibility) and ligand trajectories                                 to deep active sites in proteins (involving ligand                                 and protein side-chain flexibility). First studies                                 performed using our two-stage approach (geometric                                 search followed by energy refinements) show that,                                 compared to classical molecular modeling methods,                                 quite similar results can be obtained with a performance                                 gain of several orders of magnitude. Furthermore,                                 our results also indicate that the geometric stage                                 can provide by itself highly valuable information                                 to biologists.                               Availability: The                                 algorithms have been implemented in the general-purpose                                 motion planning software Move3D (Simeon et al.,                                 2001). We are currently working on an optimized                                 stand-alone library that will be available to                                 the scientific community.                               Contact: {jcortes,simeon}@laas.fr                                 |                                                                                    |                                                                                                                                                                     Authors:                                 Ernst, Nau, Bar-Joseph                               Motivation: Time                                 series expression experiments are used to study                                 a wide range of biological systems. More than                                 80% of all time series expression datasets are                                 short (8 time points or fewer). These datasets                                 present unique challenges. Due to the large number                                 of genes profiled (often tens of thousands) and                                 the small number of time points many patterns                                 are expected to arise at random. Most clustering                                 algorithms are unable to distinguish between real                                 and random patterns.                                Results: We present                                 an algorithm specifically designed for clustering                                 short time series expression data. Our algorithm                                 works by assigning genes to a pre-defined set                                 of model profiles that capture the potential distinct                                 patterns that can be expected from the experiment.                                 We discuss how to obtain such a set of profiles                                 and how to determine the significance of each                                 of these profiles. Significant profiles are retained                                 for further analysis and can be combined to form                                 clusters. We tested our method on both simulated                                 and real biological data. Using immune response                                 data we show that our algorithm can correctly                                 detect the temporal profile of relevant functional                                 categories. Using GO analysis we                                 show that our algorithm outperforms both general                                 clustering algorithms and algorithms designed                                 specifically for clustering time series gene expression                                 data.                                Availability: Information                                 on obtaining a Java implementation with a Graphical                                 User Interface (GUI) is available from http://www.cs.cmu.edu/~jernst/st.                                                              Contact: jernst@cs.cmu.edu                                |                                                                                    |                                                                                                                                                                     Authors:                                 Edgar, Myers                               Repeated elements such as satellites                                 and transposons are ubiquitous in eukaryotic genomes.                                 De novo computational identification and classification                                 of such elements is a challenging problem, so                                 repeat annotation of sequenced genomes has historically                                 largely relied on sequence similarity to hand-curated                                 libraries of known repeat families. We present                                 a new approach to de novo repeat annotation that                                 exploits characteristic patterns of local alignments                                 induced by certain classes of repeats. We describe                                 PILER, a package of efficient search algorithms                                 for identifying such patterns. Novel repeats found                                 using PILER are reported for H. sapiens, A. thalania                                 and D. melanogaster. The software is freely available                                 at http://www.drive5.com/piler.  |                                                                                    |                                                                                                                                                                     Authors:                                 Noble, Kuehn, Thurman, Yu,                                 Stamatoyannopoulos                               In the living cell nucleus, genomic                                 DNA is packaged into chromatin. DNA sequences                                 that regulate transcription and other chromosomal                                 processes are associated with local disruptions,                                 or "openings," in chromatin structure                                 caused by the cooperative action of regulatory                                 proteins. Such perturbations are extremely specific                                 for em cis-regulatory elements, and occur over                                 short stretches of DNA (typically approximately                                 250 bp). They can be detected experimentally as                                 DNaseI hypersensitive sites (HSs) in vivo, though                                 the process is extremely laborious and costly.                                 The ability to discriminate DNaseI HSs computationally                                 would have a major impact on the annotation and                                 utilization of the human genome.                               We found that a supervised pattern                                 recognition algorithm, trained using a set of                                 280 DNaseI HS and 737 non-HS control sequences                                 from erythroid cells, was capable of de novo prediction                                 of HSs across the human genome with surprisingly                                 high accuracy determined by prospective in vivo                                 validation. Systematic application of this computational                                 approach will greatly facilitate discovery and                                 analysis of functional non-coding elements in                                 the human and other complex                                |                                                                                    |                                                                                                                                                                     Authors:                                 Tharakaraman, Marino-Ramirez, Sheetlin, Landsman,                                 Spouge                               Motivation: The                                 transcription start site (TSS) has been located                                 for an increasing number of genes across several                                 organisms. Statistical tests have shown that some                                 cis-acting regulatory elements have positional                                 preferences with respect to the TSS, but few strategies                                 have emerged for locating elements by their positional                                 preferences. This paper elaborates such a strategy.                                 First, we align promoter regions without gaps,                                 anchoring the alignment on each promoter’s                                 TSS. Second, we apply a novel word-specific mask.                                 Third, we apply a clustering test related to gapless                                 BLAST statistics. The test examines whether any                                 specific word is placed unusually consistently                                 with respect to the TSS. Finally, our program                                 A-GLAM, an extension of the GLAM program, uses                                 significant word positions as new “anchors”                                 to realign the sequences. A Gibbs sampling algorithm                                 then locates putative cis-acting regulatory elements.                                 Usually, Gibbs sampling requires a preliminary                                 masking step, to avoid convergence onto a dominant                                 but uninteresting signal from a DNA repeat. The                                 positional anchors focus A-GLAM on the motif of                                 interest, however, so masking DNA repeats during                                 Gibbs sampling becomes unnecessary.                                Results: In a set                                 of human DNA sequences with experimentally characterized                                 TSSs, the placement of 801 octonucleotide words                                 was unusually consistent (multiple-test corrected                                 p < 0.05). Alignments anchored on these words                                 sometimes located statistically significant motifs                                 inaccessible to GLAM or AlignACE.                                Availability: The                                 A-GLAM program and a list of statistically significant                                 words are available at: ftp://ftp.ncbi.nih.gov/pub/spouge/papers/archive/AGLAM.  |                                                                                    |                                                                                                                                                                     Authors:                                 Jothi, Kann, Przytycka                               Motivation: Uncovering                                 protein-protein interaction network is a fundamental                                 step in the quest for understanding the molecular                                 machinery of a cell. This motivates the search                                 for efficient computational methods for predicting                                 such interactions. Among the available predictors                                 are those that are based on the co-evolution hypothesis                                 "evolutionary trees of protein families (that                                 are known to interact) are expected to have similar                                 topologies." Many of these methods are limited                                 by the fact that they can only handle a small                                 number of protein sequences. Also, details on                                 evolutionary tree topology are missing as they                                 use similarity matrices in lieu of the trees.                                                              Results: We introduce                                 MORPH, a new algorithm for predicting protein                                 interaction partners between members of two protein                                 families that are known to interact. Our approach                                 can also be seen as a new method for searching                                 the best superposition of the corresponding evolutionary                                 trees based on tree automorphism group. We discuss                                 relevant facts related to predictability of protein-protein                                 interaction based on their co-evolution. When                                 compared to related computational approaches,                                 our method reduces the search space by about $3\times                                 10^5$-fold, and at the same time increases the                                 prediction accuracy of correct binding partners.                                                               Contact:                                 przytyck@mail.nih.gov                                                                  |                                                                                    |                                                                                                                                                                     Authors:                                 Zheng, Lenert, Sankoff                               Motivation: The                                 total order of the genes or markers on a chromosome                                 inherent in its representation as a signed permutation                                 must often be weakened to a partial order in the                                 case of real data. This is due to lack of resolution,                                 where several genes are mapped to the same chromosomal                                 position, to missing data from some of the datasets                                 used to compile a gene order, and to conflicts                                 between these datasets. The available genome rearrangement                                 algorithms, however, require total orders as input.                                 A more general ap-proach is needed to handle rearrangements                                 of gene partial orders.                                Results: We formalize                                 the uncertainty in gene order data by representing                                 a chromosome from each genome as a partial order,                                 summarized by a directed acyclic graph (DAG).                                 The rearrangement problem is then to infer a minimal                                 sequence of reversals for transforming any topological                                 sort of one DAG to any one of the other DAG. Each                                 topological sort represents a possible linearization                                 compatible with all the datasets on the chromosome.                                 The set of all possible topological sorts is embedded                                 in each DAG by appropriately augmenting the edge                                 set, so that it becomes a general directed graph                                 (DG). The DGs representing chromosomes of two                                 genomes are combined to produce a bicoloured graph                                 from which we extract a maximal decomposition                                 into alternating coloured cycles, from which an                                 optimal sequence of reversals can usually be identified.                                 We test this approach on simulated incomplete                                 comparative maps and on cereal chromosomal maps                                 drawn from the Gramene browser.  |                                                                                    |                                                                                                                                                                     Authors:                                 Yamanishi, Vert, Kanehisa                               The metabolic network is an important                                 biological network which relates enzyme proteins                                 and chemical compounds. A large number of metabolic                                 pathways remain unknown nowadays, and many enzymes                                 are missing even in known metabolic pathways.                                 There is therefore an incentive to develop methods                                 to reconstruct the unknown parts of the metabolic                                 network, and to identify genes coding for missing                                 enzymes.                               This paper presents new methods                                 to infer enzyme networks from the integration                                 of multiple genomic data and chemical information,                                 in the framework of supervised graph inference.                                 The originality of the methods is the introduction                                 of chemical compatibility as a constraint for                                 refining the network predicted by the network                                 inference engine. The chemical compatibility between                                 two enzymes is obtained automatically from the                                 information encoded by their EC numbers. The proposed                                 method are tested and compared on their ability                                 to infer the enzyme network of the yeast Saccharomyces                                 cerevisiae from four datasets for enzymes with                                 assigned EC numbers: gene expression data, protein                                 localization data, phylogenetic profiles, and                                 chemical compatibility information. It is shown                                 that the prediction accuracy of the network reconstruction                                 consistently improves due to the introduction                                 of chemical constraints, the use of a supervised                                 approach, and the weighted integration of multiple                                 datasets. Finally, we conduct a comprehensive                                 prediction of a global enzyme network consisting                                 all enzyme candidate proteins of the yeast to                                 obtain new biological findings.  |                                                                                    |                                                                                                                                                                     Authors:                                 Brejova, Brown, Li, Vinar                               We present ExonHunter, a new and                                 comprehensive gene finding system that outperforms                                 existing systems and features several new ideas                                 and approaches. Our system combines numerous sources                                 of information (genomic sequences, ESTs, and protein                                 databases of related species) into a gene finder                                 based on a hidden Markov model in a novel and                                 systematic way. In our framework, various sources                                 of information are expressed as partial probabilistic                                 statements about positions in the sequence and                                 their annotation. We then combine these into the                                 final prediction via a quadratic programming method,                                 which we show is an extension of existing methods.                                 Allowing only partial statements is key to our                                 transparent handling of missing information and                                 coping with the heterogeneous character of individual                                 sources of information. As well, we give a new                                 method for modeling the length distribution of                                 intergenic regions in hidden Markov models.                                  Results: On a commonly used test set, ExonHunter                                 performs significantly better than the existing                                 gene finders ROSETTA, SLAM, or TWINSCAN, with                                 more than two thirds of genes predicted completely                                 correctly.                                                                   Availability: Supplementary material                                 available at                                 http://www.bioinformatics.uwaterloo.ca/supplements/05eh                               Contact: bbrejova@uwaterloo.ca                                |                                                                                    |                                                                                                                                                                     Authors:                                 Nabieva, Jim, Agarwal, Chazelle,                                 Singh                               Motivation: Determining                                 protein function is one of the most important                                 problems in the post-genomic era. For the typical                                 proteome, there are no functional annotations                                 for one-third or more of its proteins. Recent                                 high-throughput experiments have determined proteome-scale                                 protein physical interaction maps for several                                 organisms. These physical interactions are complemented                                 by an abundance of data about other types of functional                                 relationships between proteins, including genetic                                 interactions, knowledge about co-expression, and                                 shared evolutionary history. Taken together, these                                 pairwise linkages can be used to build whole-proteome                                 protein interaction maps.                               Results: We develop                                 a network-flow based algorithm, FunctionalFlow,                                 that exploits the underlying structure of protein                                 interaction maps in order to predict protein function.                                 In cross-validation testing on the yeast proteome,                                 we show that FunctionalFlow has improved performance                                 over previous methods in predicting the function                                 of proteins with few (or no) annotated protein                                 neighbors. By comparing several methods that use                                 protein interaction maps to predict protein function,                                 we demonstrate that FunctionalFlow performs well                                 because it takes advantage of both network topology                                 and some measure of locality. Finally, we show                                 that performance can be improved substantially                                 as we consider multiple data sources and use them                                 to create weighted interaction networks.                               Availability: http://compbio.cs.princeton.edu/function.                               Contact: msingh@princeton.edu  |                                                                                    |                                                                                                                                                                     Authors:                                 Apostolico, Comin, Parida                                                              The discovery of motifs in biosequences                                 is frequently torn between the rigidity of the                                 model on the one hand and the abundance of candidates                                 on the other. Part of the problem seems rooted,                                 in the various characterizations offered for the                                 notion of a motif, that are typically based either                                 on syntax or on statistics alone. It seems worthwhile                                 to consider alternatives that result from a prudent                                 combination of these two aspects in the model.                                 We introduce and study a notion of extensible                                 motif in a sequence which tightly combines the                                 structure of the motif pattern, as described by                                 its syntactic specification, with the statistical                                 measure of its occurrence count. We show that                                 a combination of appropriate saturation conditions                                 (expressed in terms of minimum number of don't                                 cares compatible with a given list of of occurrences)                                 and the monotonicity of probabilistic scores over                                 regions of constant frequency afford us significant                                 parsimony in the generation and testing of candidate                                 over-represented motifs. The merits of the method                                 are documented by results obtained in implementation,                                 which specifically targeted protein sequence families.                                 In all cases tested, the motif reported in PROSITE                                 as most important in terms of functional/structural                                 relevance emerges among the top thirty extensible                                 motifs returned by our algorithm, often right                                 at the top. Of equal importance seems the fact                                 that the sets of all surprising motifs returned                                 in each experiment are extracted faster and come                                 in much more manageable sizes than would be obtained                                 in the absence of saturation constrains.                                |                                                                                    |                                                                                                                                                                     Authors:                                 Hu, Yan, Huang, Han, Zhou                                                               Motivation:                                 The rapid accumulation of biological network data                                 translates into an urgent need for computational                                 meth-ods for graph pattern mining. One important                                 problem is to identify recurrent patterns across                                 multiple networks to dis-cover biological modules.                                 However, existing algorithms for frequent pattern                                 mining become very costly in time and space as                                 the pattern sizes and network numbers increase.                                 Currently, no efficient algorithm is available                                 for mining re-current patterns across large collections                                 of genome-wide networks.                               Results: We developed                                 a novel algorithm, CODENSE, to efficiently mine                                 frequent coherent dense subgraphs across large                                 numbers of massive graphs. Compared to previous                                 methods, our approach is scalable in the number                                 and size of the input graphs, and adjustable in                                 terms of exact or ap-proximate pattern mining.                                 Applying CODENSE to 39 co-expression networks                                 derived from microarray data sets, we discovered                                 a large number of functionally homogenous clusters                                 and made functional predictions for 169 uncharac-terized                                 yeast genes.                                 Availability:                                 http://zhoulab.usc.edu/CODENSE                               Contact: xjzhou@usc.edu  |                                                                                    |                                                                                                                                                                     Authors:                                 Beissbarth, Tye-Din, Smyth, Speed, Anderson                                                              Motivation: T-cell                                 response to peptides bound on MHC Class~I or Class~II                                 molecules is essential for immune recognition                                 of pathogens. T-cells are activated by specific                                 peptide epitopes that are determined within the                                 antigen processing pathways and presented on the                                 surface of other cells bound to MHC molecules.                                 To determine which part of allergenic or pathogenic                                 proteins can stimulate T-cells is important for                                 the treatment of diseases. We sought to take advantage                                 of the falling cost of synthetic, screening grade                                 peptides, and devise a comprehensive, non-hypothesis-driven                                 screen for T-cell epitopes. We were interested                                 in the study of celiac disease (CD) and used the                                 ELISPOT technique to perform a large number of                                 T-cell assays. We therefore needed to compensate                                 for the lack of statistical data analysis methods                                 for ELISPOT assays.                               Results: We describe                                 a method to comprehensively screen for T-cell                                 epitopes within a family or group of proteins.                                 We have implemented an algorithm to generate a                                 set of unique short peptide sequences that incorporate                                 all possible epitopes within a group of proteins.                                 T-cell assays were performed in 96 well plates                                 using the ELISPOT assay to screen for responses                                 in CD patients against any epitopes in glutens.                                 We describe a statistical model to fit the data                                 and an EM (Expectation Maximization) algorithm                                 to estimate the parameters of interest and analyze                                 the resulting data.                               Availability: Implementations                                 of our algorithms in R or Perl are available at                                 http://bioinf.wehi.edu.au/folders/immunol.                                |                                                                                    |                                                                                                                                                                     Authors:                                 Yu, Chen                               The classification of high-dimensional                                 data is always a challenge to statistical machine                                 learning. We propose a novel method named shallow                                 feature selection (SFS) that assigns each feature                                 a probability of being selected based on the structure                                 of training data itself. Independent of particular                                 classifiers, the high dimension of biodata can                                 be fleetly reduced to an applicable case for consequential                                 processings. Moreover, to improve both efficiency                                 and performance of classification, these prior                                 probabilities will be further used to specify                                 the distributions of top-level hyperparameters                                 in hierarchical models of Bayesian neural network                                 (BNN), as well as the parameters in Gaussian process                                 models. Three BNN approaches are derived and then                                 applied to identify ovarian cancer from NCI's                                 high-resolution mass spectrometry data, which                                 yielded an excellent performance in 1,000 independent                                 k-fold cross validations (k=2,...,10). For instance,                                 an average sensitivity and specificity of 98.56%                                 and 98.42% have been achieved in the 2-fold cross                                 validations, furthermore, only one control and                                 one cancer are misclassified in the leave-one-out                                 cross validation. Some other popular classifiers                                 are also tested as a comparison.  |                                                                                    |                                                                                                                                                                     Authors:                                 Ralaivola, Chen, Phung, Bruand, Swamidass,                                 Baldi                               Motivation: Small                                 molecules play a fundamental role in organic chemistry                                 and biology. They can be used to probe biological                                 systems and to discover new drugs and other useful                                 compounds. As large datasets of small molecules                                 become increasingly available, it is necessary                                 to develop computational methods that can deal                                 with molecules of variable size and structure                                 and predict their chemical, physical, or biological                                 properties. Results: Here we develop several new                                 classes of kernels for small molecules using their                                 1D, 2D, and 3D representations. In 1D, we consider                                 string kernels based on SMILES strings. In 2D,                                 we introduce several similarity kernels based                                 on conventional or generalized fingerprints. Generalized                                 fingerprints are derived by counting in different                                 ways sub-paths contained in the graph of bonds,                                 using depth-first searches. In 3D, we consider                                 similarity measures between histograms of pairwise                                 distances between atom classes. These kernels                                 can be computed efficiently and are applied to                                 problems of classification and prediction of mutagenicity,                                 toxicity, and anti-cancer activity on three publicly                                 available datasets. The results derived using                                 cross validation methods are state-of-the-art.                                 Tradeoffs between various kernels are briefly                                 discussed.                               Availability: Datasets                                 available upon request.                               Contact: pfbaldi@ics.uci.edu  |                                                                                    |                                                      |                                                          |                              |                      |       |