ISMB 2006
ISMB 2006AB3CX-MeetingISMB 2006
ISMB 2006

ISCB

Hosted by

Embrapa

LNCC

Accepted Papers

 

View/Hide all Abstracts

 

Comparative Genomics

Hairpins in a haystack: recognizing microrna precursors in comparative genomics data
Author(s):
Jana Hertel, Bioinformatics Group, Department of Computer Science, University of Leipzig, Germany
Peter F. Stadler, Bioinformatics Group, Department of Computer Science, University of Leipzig, Germany; Institute for Theoretical Chemistry, University of Vienna, Austria; Santa Fe Institute, New Mexico, Germany

Recently, genome wide surveys for non-coding RNAs have provided evidence for tens of thousands of previously undescribed evolutionary conserved RNAs with distinctive secondary structures. The annotation of these putative ncRNAs, however, remains a difficult problem. Here we describe an SVM-based approach that, in conjunction with a non-stringent filter for consensus secondary structures, is capable of efficiently recognizing microRNA precursors in multiple sequence alignments. The software was applied to recent genome-wide RNAz surveys of mammals, urochordates, and nematodes.

Keywords: miRNA, support vector machine, non-coding RNA

Hide

Comparative genomics reveals unusually long motifs in mammalian genomes
Author(s):
Neil Jones, University of California San Diego, United States
Pavel Pevzner, University of California San Diego, United States

Motivation:
The recent discovery of the first small modulatory RNA (smRNA) presents the challenge of finding other molecules of similar length and conservation level. Unlike short interfering RNA (siRNA) and micro-RNA (miRNA), effective computational and experimental screening methods are not currently known for this species of RNA molecule, and the discovery of the one known example was partly fortuitous because it happened to be complementary to a well-studied DNA binding motif (the Neuron Restrictive Silencer Element).

Results:
The existing comparative genomics approaches (e.g., phylogenetic footprinting) rely on alignments of orthologous regions across multiple genomes. This approach, while extremely valuable, is not suitable for finding motifs with highly diverged ``non-alignable'' flanking regions. Here we show that several unusually long and well conserved motifs can be discovered de novo through a comparative genomics approach that does not require an alignment of orthologous upstream regions. These motifs, including Neuron
Restrictive Silencer Element, were missed in recent comparative genomics studies that rely on phylogenetic footprinting. While the functions of these motifs remain unknown, we argue that some may represent biologically important sites.

Availability:
Our comparative genomics software, a web-accessible database of our results and a compilation of experimentally validated binding sites for NRSE can
be found at http://www.cse.ucsd.edu/groups/bioinformatics.

Contact: ppevzner@cs.ucsd.edu.

Hide

Relative contributions of structural designability and functional diversity in fixation of gene duplicates
Author(s):
Boris Shakhnovich, Boston University, USA

Elucidation of the governing laws or even identifying predominant trends in gene family or protein evolution has been a formidable challenge in post-genomic biology. While the skewed distribution of folds and families was previously described, the key genetic mechanisms or family specific characteristics that influence the generation of this distribution are as yet unknown. Furthermore, the extent of evolutionary pressure on duplicate genes, most often credited with generation of new genetic material and family members is hotly debated. In this paper we present evidence that duplicate genes have variable probability of locus fixation correlated with strength of selection. In turn evolutionary pressure is influenced by innate characteristics of structural designability (e.g. the potential for sequence entropy) of the protein family. We further show that variability of pseudogene formation from gene duplicates can be directly tied to the size and designability of the family to which the genes belong.

Hide

Automatic clustering of orthologs and inparalogs shared by multiple proteomes
Author(s):
Andrey Alexeyenko, Stockholm Bioinformatics Center, Albanova, Stockholm University, Sweden
Ivica Tamas, Department of Molecular Biology & Functional Genomics, Stockholm University, Sweden
Gang Liu, Center for Genomics and Bioinformatics, Karolinska Institutet, Sweden
Erik Sonnhammer, Center for Genomics and Bioinformatics, Karolinska Institutet, Sweden

The complete sequencing of many genomes has made it possible to identify orthologous genes descending from a common ancestor. However, reconstruction of evolutionary history over long time periods faces many challenges due to gene duplications and losses. Identification of orthologous groups shared by multiple proteomes therefore becomes a clustering problem in which an optimal compromise between conflicting evidences needs to be found.
Here we present a new proteome-scale analysis program called MultiParanoid that can automatically find orthology relationships between proteins in multiple proteomes. The software is an extension of the InParanoid program that identifies orthologs and inparalogs in pairwise proteome comparisons. MultiParanoid applies a clustering algorithm to merge multiple pairwise ortholog groups from InParanoid into multi-species ortholog groups. To avoid outparalogs in the same cluster, MultiParanoid only combines species that share the same last ancestor.

To validate the clustering technique, we compared the results to a reference set obtained by manual phylogenetic analysis. We further compared the results to ortholog groups in KOGs and OrthoMCL, which revealed that MultiParanoid produces substantially fewer outparalogs than these resources.
MultiParanoid is a freely available standalone program that enables efficient orthology analysis much needed in the post-genomic era. A web-based service providing access to the original datasets, the resulting groups of orthologs, and the source code of the program can be found at http://multiparanoid.cgb.ki.se.

Keywords: orthology, paralogy, inparalog, outparalog, clustering, algorithm, last common ancestor, comparative genomics, Homo sapiens, C. elegans, D. melanogaster.

Hide

A Sequence-based filtering method for ncRNA identification and its application to searching for Riboswitch Elements
Author(s):
Shaojie Zhang, Department of Computer Science and Engineering, University of California, San Diego, U.S.A.
Ilya Borovok, Department of Molecular Microbiology and Biotechnology, Tel-Aviv University, Israel
Yair Aharonowitz, Department of Molecular Microbiology and Biotechnology, Tel-Aviv University, Israel
Roded Sharan, School of Computer Science, Tel-Aviv University, Israel
Vineet Bafna, Department of Computer Science and Engineering, University of California, San Diego, U.S.A.

Recent studies have uncovered an ``RNA world'', in which non coding RNA (ncRNA) sequences play a central role in the regulation of gene expression. Computational studies on ncRNA have been directed toward developing detection methods for ncRNAs. State-of-the-art methods for the problem, like covariance models, suffer from high computational cost, underscoring the need for efficient filtering approaches that can identify promising sequence segments and accelerate the detection process. In this paper we make several contributions toward this goal. First, we formalize the concept of a filter and provide figures of merit that allow comparing between filters. Second, we design efficient sequence based filters that dominate the current state-of-the-art HMM filters. Third, we provide a new formulation of the covariance model that allows speeding up RNA alignment. We demonstrate the power of our approach on both synthetic data and real bacterial genomes. We then apply our algorithm to the detection of novel riboswitch elements from the whole bacterial and archaeal genomes. Our results point to a number of novel riboswitch candidates, and include genomes that were not previously known to contain riboswitches.

Hide

Finding novel genes in bacterial communities isolated from the environment
Author(s):
Lutz Krause, Bielefeld University, Center for Biotechnology (CeBiTec), Germany
Naryttza N. Diaz, Bielefeld University, Center for Biotechnology (CeBiTec), Germany
Daniela Bartels, Bielefeld University, Center for Biotechnology (CeBiTec) D-33594 Bielefeld
Robert A. Edwards, Fellowship for Interpretation of Genomes, Burr Ridge IL, United States
Alfred Pühler, Universität Bielefeld, Lehrstuhl für Genetik, Fakultät für Biologie D-33594 Bielefeld, Germany
Forest Rohwer, Department of Biology, San Diego State University, San Diego, CA, United States
Folker Meyer, Bielefeld University, Center for Biotechnology (CeBiTec), Germany
Jens Stoye, Universität Bielefeld, Technische Fakultät D-33594 Bielefeld, Germany

Motivation:
Novel sequencing techniques can give access to organisms that are difficult to cultivate using conventional methods. For example, the 454 pyrosequencing method can generate a large amount of data in short time and at a low cost. When applied to environmental samples, the data generated has some drawbacks, e.g. short length of assembled contigs, in-frame stop codons and frame shifts. Unfortunately, current gene finders can not circumvent these difficulties. On the other hand, high throughput methods are needed to investigate special attributes of microbial communities. Some metagenomics analyses have already revealed interesting findings in diversity and evolution of complex microbial communities. Therefore, the automated prediction of genes is a prerequisite for the increasing amount of genomic sequences to ensure progress in metagenomics.

Results:
We introduce a novel gene finding algorithm that incorporates features overcoming the short length of the assembled contigs from environmental data, in-frame stop codons as well as frame shifts contained in bacterial sequences. The results show that by searching for sequence similarities in an environmental sample our algorithm is capable of detecting a high fraction of its gene content, depending on the species composition and the overall size of the sample. Therefore, the method is valuable for hunting novel unknown genes that may be specific for the habitat where the sample is taken. Finally, we show that our algorithm can even exploit the limited information contained in the short reads generated by the 454 technology for the prediction of protein coding genes.

Hide

Top

Databases & Data Integration

An experimental metagenome data management and analysis system
Author(s):
Victor Markowitz, Biological Data Management and Technology Center, Lawrence Berkeley National Lab, USA
Natalia Ivanova, Genome Biology Program, Joint Genome Institute, USA
Krishna Palaniappan, Biological Data Management and Technology Center, Lawrence Berkeley National Lab, USA
Ernest Szeto, Biological Data Management and Technology Center, Lawrence Berkeley National Lab, USA
Frank Korzeniewski, Biological Data Management and Technology Center, Lawrence Berkeley National Lab, USA
Athanasios Lykidis, Genome Biology Program, Joint Genome Institute, USA
Iain Anderson, Genome Biology Program, Joint Genome Institute, USA
Konstantinos Mavrommatis, Genome Biology Program, Joint Genome Institute, USA
Victor Kunin, Microbial Ecology Program, Joint Genome Institute, USA
Hector Garcia Martin, Microbial Ecology Program, Joint Genome Institute, USA
Inna Dubchak, Genomics Division, Lawrence Berkeley National Lab, USA
Phil Hugenholtz, Microbial Ecology Program, Joint Genome Institute, USA
Nikos Kyrpides, Genome Biology Program, Joint Genome Institute, USA

The application of shotgun sequencing to environmental samples has revealed a new universe of microbial community genomes (metagenomes) involving previously uncultured organisms. Metagenome analysis, which is expected to provide a comprehensive picture of the gene functions and metabolic capacity of microbial community, needs to be conducted in the context of a comprehensive data management and analysis system. We present in this paper IMG/M, an experimental metagenome data management and analysis system that is based on the Integrated Microbial Genomes (IMG) system. IMG/M provides tools and viewers for analyzing both metagenomes and isolate genomes individually or in a comparative context.

Hide

Distance based algorithms for small biomolecule classification and structural similarity search
Author(s):
Emre Karakoc, Simon Fraser University, Canada
Artem Cherkasov, University of British Columbia, Canada
S. Cenk Sahinalp, Simon Fraser University, Canada

Structural similarity search among small molecules is a standard tool used in molecular classification and in-silico drug discovery. The effectiveness of this general approach depends on how well the following problems are addressed.
The notion of similarity should be chosen for providing the highest level of discrimination of compounds wrt the bioactivity of interest. The data structure for performing search should be very efficient as the molecular databases of interest include several millions of compounds.

In this paper we focus on the k-nearest-neighbor search method, which, until recently was not considered for small molecule classification. The few recent applications of k-nn to compound classification focus on selecting the most relevant set of chemical descriptors which are then compared under standard Minkowski distance L_p. Here we show how to computationally design the optimal "weighted" Minkowski distance wL_p for maximizing the discrimination between active and inactive compounds wrt bioactivities of interest. We then show how to construct pruning based k-nn search data structures for any wL_p distance that minimizes similarity search time.

The accuracy achieved by our classifier is better than the alternative LDA and MLR approaches and is comparable to the ANN methods. In terms of running time, our classifier is considerably faster than the ANN approach especially when large data sets are used. Furthermore, our classifier quantifies the level of bioactivity rather than returning a binary decision and thus is more informative than the ANN approach.

Hide

springScape: Visualisation of microarray and contextual bioinformatic data using spring embedding and an information landscape
Author(s):
Timothy Ebbels, Department of Computer Science, University College London, UK
Bernard Buxton, Department of Computer Science, University College London, UK
David Jones, Department of Computer Science, University College London, UK

The interpretation of microarray and other high-throughput data is highly dependent on the biological context of experiments. However, standard analysis packages are poor at simultaneously presenting both the array and related bioinformatic data. We have addressed this challenge by developing a system springScape based on ‘spring embedding' and an ‘information landscape' allowing several related data sources to be dynamically combined while highlighting one particular feature.

Each data source is represented as a network of nodes con-nected by weighted edges. The networks are combined and embedded in the 2-D plane by spring embedding such that nodes with a high similarity are drawn close together. Complex relationships can be discovered by varying the weight of each data source and observing the dynamic response of the spring network. By modifying Procrustes analysis, we find that the visualizations have an acceptable degree of reproducibility. The ‘information landscape' highlights one particular data source, displaying it as a smooth surface whose height is proportional to both the information being viewed and the density of nodes. The algorithm is demonstrated using several microarray data sets in combination with protein-protein interaction data and GO annotations. Among the features revealed are the spatio-temporal profile of gene expression and the identification of GO terms correlated with gene expression and protein interactions. The power of this combined display lies in its interactive feedback and exploitation of human visual pattern recognition. Overall, springScape shows promise as a tool for the interpretation of microarray data in the context of relevant bioinformatic information.

Hide

SNP Function Portal: a web database for exploring the function implication of SNP alleles
Author(s):
Pinglang Wang, University of Michigan, United States
Manhong Dai, University of Michigan, United States
Weijian Xuan, University of Michigan, United States
Richard C McEachin, University of Michigan, United States
Anne U Jackson, University of Michigan, United States
Laura J Scott, University of Michigan, United States
Brian Athey, University of Michigan, United States
Stanley J. Watson, University of Michigan, United States
Fan Meng, University of Michigan, United States

Motivation:
Finding the potential functional significance of SNPs is a major bottleneck in the understanding genome-wide SNP scanning results, as the related functional data are distributed across many different databases. The SNP Function Portal is designed to be a clearing house for all public domain SNP function annotation data, as well as in-house functional annotations derived from data from different sources. It currently contains SNP function annotations in six major categories including genomic elements, transcription regulation, protein function, pathway, disease and population genetics. Besides extensive SNP function annotatns, the SNP Function Portal includes a powerful search engine that accepts different types of genetic markers as input and identifies all genetically related SNPs based on HapMap II data as well as the relationship of different markers to known genes. As a result, our system allows users to search the potential biological impact of any genetic marker(s), investigate complex relationships among genetic markers and genes, and greatly facilitates the understanding of genome-wide SNP scanning results.

Availability:
http://brainarray.mbni.med.umich.edu/Brainarray/Database/SearchSNP/snpfunc.aspx

Contact: mengf@umich.edu

Hide

Integrating structured biological data by kernel Maximum Mean Discrepancy
Author(s):
Karsten Borgwardt, University of Munich, Germany
Arthur Gretton, MPI Tuebingen, Germany
Malte Rasch, TU Graz, Austria
Hans-Peter Kriegel, University of Munich, Germany
Bernhard Schoelkopf, MPI Tuebingen, Germany
Alex Smola, National ICT Australia, Canberra, Australia

Motivation:
Many problems in data integration in bioinformatics can be posed as one common question: Are two sets of observations generated by the same distribution? We propose a kernel-based statistical test for this problem, based on the fact that two distributions are different if and only if there exists at least one function having different expectation on the two distributions. Consequently we use the maximum discrepancy between function means as the basis of a test statistic. The Maximum Mean Discrepancy (MMD) can take advantage of the kernel trick, which allows us to apply it not only to vectors, but strings, sequences, graphs, and other common structured data types arising in molecular biology.

Results:
We study the practical feasibility of an MMD-based test on three central data integration tasks: Testing cross-platform comparability of microarray data, cancer diagnosis, and data-content based schema matching for two different protein function classification schemas. In all of these experiments, including high-dimensional ones, MMD is very accurate in finding samples that were generated from the same distribution, and outperforms or is as good as its best competitors.

Conclusions:
We have defined a novel statistical test of whether two samples are from the same distribution, compatible with both multivariate and structured data, that is fast, easy to implement, and works well, as confirmed by our experiments.

Availability: http://www.dbs.ifi.lmu.de/~borgward/MMD

Hide

Top

Evolution and Phylogeny

Constructing near-perfect phylogenies with multiple homoplasy events
Author(s):
Ravi Vijaya Satya, School of EECS, University of central Florida Orlando FL, USA
Amar Mukherjee, School of EECS, University of central Florida Orlando FL, USA
Gabriela Alexe, IBM T.J. Watson rsearch Center, YorkTown Heights, NY, USA
Laxmi Parida, IBM T.J. Watson rsearch Center, YorkTown Heights, NY, USA
Gyan Bhanot, IBM T.J. Watson rsearch Center, YorkTown Heights, NY, USA

Motivation:
We explore the problem of constructing near-perfect phylogenies on bi-allelic haplotypes, where the deviation from perfect phylogeny is entirely due to homoplasy events. We present polynomial-time algorithms for restricted versions of the problem. We show that these algorithms can be extended to genotype data, in which case the problem is called the near-perfect phylogeny haplotyping (NPPH) problem. We present a near-optimal algorithm for the H1-NPPH problem, which is to determine if a given set of genotypes admit a phylogeny with a single homoplasy event. The time-complexity of our algorithm for the H1-NPPH problem is O(m2(n+m)), where n is the number of genotypes and $m$ is the number of SNP sites. This is a significant improvement over the earlier O(n4) algorithm.

We also introduce generalized versions of the problem. The H(1,q)-NPPH problem is to determine if a given set of genotypes admit a phylogeny with q homoplasy events, so that all the homoplasy events occur in a single site. We present an O(mq+1(n+m)) algorithm for the H(1,q)-NPPH problem.

Results:
We present results on simulated data, which demonstrate that the accuracy of our algorithm for the H1-NPPH problem is comparable to that of the existing methods, while being orders of magnitude faster.

Availability:
The implementation of our algorithm for the H1-NPPH problem is available upon request.

Contact: rvijaya@cs.ucf.edu

Hide

BNTagger: Improved tagging snp selection using bayesian networks
Author(s):
Phil Hyoun Lee, School of Computing, Queen's University, Canada
Hagit Shatkay, School of Computing, Queen's University, Canada

Genetic variation analysis holds much promise as a basis for disease-gene association. However, due to the tremendous number of candidate single nucleotide polymorphisms (SNPs), there is a clear need to expedite genotyping by selecting and considering only a subset of all SNPs. This process is known as tagging SNP selection. Several methods for tagging SNP selection have been proposed, and have shown promising results. However, most of them rely on strong assumptions such as prior block-partitioning, bi-allelic SNPs, or a fixed number or location of tagging SNPs.

We introduce BNTagger, a new method for tagging SNP selection, based on conditional independencies among SNPs. Using the formalism of Bayesian networks (BNs), our system aims to select a subset of independent and highly predictive SNPs. Similar to previous prediction-based methods, we aim to maximize the prediction accuracy of tagging SNPs, but unlike them, we neither fix the number or the location of predictive tagging SNPs, nor require SNPs to be bi-allelic. In addition, for newly-genotyped samples, BNTagger directly uses genotype data as input, while producing as output haplotype data of all SNPs.

Using three public data sets, we compare the prediction performance of our method to that of three state-of-the-art tagging SNP selection methods. The results demonstrate that our method consistently improves upon previous methods in terms of prediction accuracy. Moreover, our method retains its good performance even when a very small number of tagging SNPs are used.

Hide


Mutation parameters from DNA sequence data using graph theoretic measures on lineage trees
Author(s):
Reuma Magori Cohen, Bar Ilan University, Israel
Yoram Louzoun, Bar Ilan University, Israel
Steven Kleinstein, Princeton University, USA

Motivation:
B cells responding to antigenic stimulation can fine-tune their binding properties through a process of affinity maturation, composed of somatic hypermutation, affinity-selection and clonal expansion. The mutation rate of the B cell receptor DNA sequence, and the effect of these mutations on affinity and specificity are of critical importance for understanding immune and autoimmune processes. Unbiased estimates of these properties are currently lacking due to the short time-scales involved and the small numbers of sequences available.

Results:
We have developed a bioinformatic method based on a maximum likelihood analysis of phylogenetic lineage trees to estimate the parameters of a B cell clonal expansion model, which includes somatic hypermutation with the possibility of lethal mutations. Lineage trees are created from clonally related B cell receptor DNA sequences. Important links between tree shapes and underlying model parameters are identified using mutual information. Parameters are estimated using a likelihood function based on the joint distribution of several tree shapes, without requiring a priori knowledge of the number of generations in the clone (which is not available for rapidly dividing populations in vivo). A systematic validation on synthetic trees produced by a mutating birth-death process simulation shows that our estimates are precise and robust to several underlying assumptions. These methods are applied to experimental data from autoimmune mice to demonstrate the existence of hyper!
mutating B cells in an unexpected location in the spleen.

Hide

Top

Human Health

Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks
Author(s):
Olivier Gevaert, Department of Electrical Engineering ESAT-SCD Katholieke Universiteit Leuven, Belgium
Frank De Smet, Department of Electrical Engineering ESAT-SCD Katholieke Universiteit Leuven, Belgium
Dirk Timmerman, Department of obstetrics and gynecology, University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Belgium
Yves Moreau, Department of Electrical Engineering ESAT-SCD Katholieke Universiteit Leuven, Belgium
Bart De Moor, Department of Electrical Engineering ESAT-SCD Katholieke Universiteit Leuven, Belgium

Motivation:
Clinical data, such as patient history, laboratory analysis, ultrasound parameters - which are the basis of day-to-day clinical decision support - are often neglected to guide the clinical management of cancer in the presence of microarray data. We propose a strategy based on Bayesian networks to treat clinical and microarray data on an equal footing. The main advantage of this probabilistic model is that it allows to integrate these data sources in several ways and that it allows to investigate and understand the model structure and parameters. Furthermore using the concept of a Markov Blanket we can identify all the variables that shield off the class variable from the influence of the remaining network. Therefore Bayesian networks automatically perform feature selection by identifying the (in)dependency relationships with the class variable.

Results:
We evaluated three methods for integrating clinical and microarray data: decision integration, partial integration and full integration and used them to classify publicly available breast cancer patients into a poor and a good prognosis group. The partial integration method is most promising and has an independent test set area under the ROC curve of 0.845. After choosing an operating point the classification performance is better than frequently used indices.

Hide

AClAP, Autonomous hierarchical agglomerative Cluster Analysis based Protocol to partition conformational datasets
Author(s):
Giovanni Bottegoni, Dept. of Pharmaceutical Sciences - University of Bologna, Italy
Walter Rocchia, NEST - Scuola Normale Superiore of Pisa, Italy
Maurizio Recanatini, Dept. of Pharmaceutical Sciences - University of Bologna, Italy
Andrea Cavalli, Dept. of Pharmaceutical Sciences - University of Bologna, Italy

Motivation:
Sampling the conformational space is a fundamental step for both ligand- and structure-based drug design. However, the rational organization of different molecular conformations still remains a challenge. In fact, for drug design applications, the sampling process provides a redundant conformation set whose thorough analysis can be intensive, or even prohibitive. We propose a statistical approach based on cluster analysis aimed at rationalizing the output of methods such as Monte Carlo, genetic, and reconstruction algorithms. Although some software already implements clustering procedures, at present, a universally accepted protocol is still missing.

Results:
We integrated hierarchical agglomerative cluster analysis with a clusterability assessment method and a user independent cutting rule, to form a global protocol that we implemented in a MATLAB metalanguage program (AClAP). We tested it on the conformational space of a quite diverse set of drugs generated via Metropolis Monte Carlo simulation, and on the poses we obtained by reiterated docking runs performed by four widespread programs. In our tests, AClAP proved to remarkably reduce the dimensionality of the original datasets at a negligible computational cost. Moreover, when applied to the outcomes of many docking programs together, it was able to point to the crystallographic pose.

Availability:
AClAP is available at the “AClAP” section of the website http://www.scfarm.unibo.it

Contact: andrea.cavalli@unibo.it

Supplementary Information:
The complete series of AClAP results is available in the "services” section of the website http://www.scfarm.unibo.it.

Hide

Integrating copy number polymorphisms into array CGH analysis using a robust HMM
Author(s):
Sohrab Shah, University of British Columbia, Canada
Xiang Xuan, University of British Columbia, Canada
Ron DeLeeuw, British Columbia Cancer Research Centre, Canada
Mehrnoush Khojasteh, British Columbia Cancer Research Centre, Canada
Wan Lam, British Columbia Cancer Research Centre, Canada
Raymond Ng, University of British Columbia, Canada
Kevin Murphy, University of British Columbia, Canada

Array comparative genomic hybridization (aCGH) is a pervasive technique used to identify chromosomal aberrations in human diseases, including cancer. Aberrations are defined as regions of increased or decreased copy number, relative to a normal sample. Accurately identifying the locations of these aberrations has many important medical applications. Unfortunately, the observed copy number changes are often corrupted by various sources of noise, making the boundaries hard to detect. One popular current technique uses hidden Markov models (HMMs) to segment the signal into regions of constant copy number; a subsequent classification phase labels each region as a gain, a loss or neutral. Unfortunately, standard HMMs are sensitive to outliers, causing oversegmentation. We propose a simple modification that makes the HMM more robust to such single clone outliers. More importantly, this modification allows us to exploit prior knowledge about the likely location of such ``outliers'', which are often due to copy number polymorphisms (CNPs). By ``explaining away'' these outliers, we can focus attention on more interesting aberrated regions. We show significant improvements over the current state of the art technique (DNAcopy with MergeLevels) on some previously used synthetic data, augmented with outliers. We also show modest gains on the well-studied H526 lung cancer cell line data, and argue why we expect more substantial gains on other data sets in the future.

Source code written in Matlab is available from http://www.cs.ubc.ca/~sshah/acgh

Hide

Decoding non-unique oligonucleotide hybridization experiments of targets related by a phylogenetic tree
Author(s):
Alexander Schliep, Max Planck Institute for Molecular Genetics, Germany
Sven Rahmann, Bielefeld University, Germany

Motivation:
The reliable identification of presence or absence of biological agents ("targets"), such as viruses or bacteria, is crucial for many applications from health care to biodiversity. If genomic sequences of targets are known, hybridization reactions between oligonucleotide probes and targets performed on suitable DNA microarrays will allow to infer presence or absence from the observed pattern of hybridization. Targets, for example all known strains of HIV, are often closely related and finding unique probes becomes impossible. The use of non-unique oligonucleotides with more advanced decoding techniques from statistical group testing allows to detect known targets with great success. Of great relevance, however, is the problem of identifying the presence of previously unknown targets or of targets that evolve rapidly.

Results:
We present the first approach to decode hybridization experiments using non-unique probes when targets are related by a phylogenetic tree. By use of a Bayesian framework and a Markov chain Monte Carlo approach we are able to identify over 95% of known targets and assign up to 70% of unknown targets to their correct clade in hybridization simulations on biological and simulated data.

Availability:
Software implementing the method described in this paper and datasets are available from http://algorithmics.molgen.mpg.de/probetrees.

Keywords: virus detection, probe design, phylogenie, MCMC

Hide

Top

Molecular and Supramolecular Dynamics

DynaPred: A structure and sequence based method for the prediction of MHC class I binding peptide sequences and conformations
Author(s):
Iris Antes, MPI fuer Informatik, Stuhlsatzenhausweg 85, D-66123 Saarbruecken, Germany
Shirley Siu, Universitaet des Saarlandes, D-66041 Saarbruecken, Germany
Thomas Lengauer, MPI fuer Informatik, Stuhlsatzenhausweg 85, D-66123 Saarbruecken, Germany

We developed a SVM-trained, quantitative matrix-based method for the prediction of MHC class I binding peptides, in which the features of the scoring matrix are energy terms retrieved from molecular dynamics simulations. At the same time we use the equilibrated structures obtained by the same simulations in a simple and efficient docking procedure. Our method consists of two steps: First, we predict potential binders from sequence data alone and second, we construct protein-peptide complexes for the predicted binders. So far, we tested our approach on the HLA-A0201 allele. We constructed two prediction models, using local, position-dependent (DynaPredPOS) and global, position-independent (DynaPred) features. The former model outperformed two sequence-based methods used in the evaluation; the latter showed a slightly lower performance (5% less accuracy), but a much higher generalizability towards other alleles than the position-dependent models. The constructed peptide conformations can be refined within seconds to structures with an average RMSD from the corresponding experimental structures of 1.53 Å for the peptide backbone and 1.1 Å for buried side chain atoms.

Hide

Top

Ontologies

A top-level ontology of functions and its application in the Open Biomedical Ontologies
Author(s):
Patryk Burek, University of Leipzig, Germany
Robert Hoehndorf, University of Leipzig, Max Planck Institute for Evolutionary Anthropology, Germany
Frank Loebe, University of Leipzig, Germany
Johann Visagie, Max Planck Institute for Evolutionary Anthropology, Germany
Heinrich Herre, University of Leipzig, Germany
Janet Kelso, Max Planck Institute for Evolutionary Anthropology, Germany

Motivation:
A clear understanding of functions in biology is a key component in accurate modelling of molecular, cellular and organismal biology. Using the existing biomedical ontologies it has been impossible to capture the complexity of the community's knowledge about biological functions.

Results:
We present here a top-level ontological framework for representing knowledge about biological functions. This framework lends greater accuracy, power and expressiveness to biomedical ontologies by providing a means to capture existing functional knowledge in a more formal manner. An initial major application of the ontology of functions is the provision of a principled way in which to curate functional knowledge and annotations in biomedical ontologies. Further potential applications include the facilitation of ontology interoperability and automated reasoning. A major advantage of the proposed implementation is that it is an extension to existing biomedical ontologies, and can be applied without substantial changes to these domain ontologies.

Availability:
The Ontology of Functions (OF) can be downloaded in OWL format from http://onto.eva.mpg.de/. Additionally, a UML profile and supplementary information and guides for using the OF can be accessed from the same website.

Contact: bioonto@lists.informatik.uni-leipzig.de

Keywords: knowledge representation, ontology, top-level ontology, biological function

Hide

An ontology for a robot scientist
Author(s):
Larisa Soldatova, The University of Wales, Aberystwyth, UK
Amanda Clare, The University of Wales, Aberystwyth, UK
Andrew Sparkes, The University of Wales, Aberystwyth, UK
Ross King, The University of Wales, Aberystwyth, UK

Motivation:
A Robot Scientist is a physically implemented robotic system that can automatically carry out cycles of scientific experimentation. We are commissioning a new Robot Scientist designed to investigate gene function in S. cerevisiae. This Robot Scientist will be capable of initiating >1,000 experiments, and making >200,000 observations a day. Robot Scientists provide a unique test bed for the development of methodologies for the curation and annotation of scientific experiments: for as the experiments are conceived and executed automatically by computer, it is possible to completely capture and digitally curate all aspects of the scientific process. This new ability brings with it significant technical challenges. To meet these we apply an ontology driven approach to the representation of all the Robot Scientist's data and metadata.

Results:
We demonstrate the utility of developing an ontology for the new Robot Scientist. This ontology is based on a general ontology of experiments. The ontology aids the curation and annotating of: the experimental data and metadata, the equipment metadata, and supports the design of database systems to hold the data and metadata.

Availability:
EXPO in XML and OWL formats is at: http://sourceforge.net/projects/expo/.
All materials about the Robot Scientist project are available at: www.aber.ac.uk/compsci/Research/bio/robotsci/.

Hide

Protein classification using ontology classification
Author(s):
Katherine Wolstencroft, University of Manchester, UK
Phillip Lord, University of Newcastle, UK
Lydia Tabernero, University of Manchester, UK
Andy Brass, University of Manchester, UK
Robert Stevens, University of Manchester, UK

Motivation:
The classification of proteins expressed by an organism is an important step in understanding the molecu-lar biology of that organism. Traditionally, this classifica-tion has been performed by human experts. Human knowl-edge can recognise the functional properties that are suffi-cient to place an individual gene product into a particular protein family group. Automation of this task usually fails to meet the ‘gold standard' of the human annotator because of the difficult recognition stage. The growing number of genomes, the rapid changes in knowledge and the central role of classification in the annotation process, however, motivates the need to automate this process.

Results:
We capture human understanding of how to rec-ognise members of the protein phosphatases family by do-main architecture as an ontology. By describing protein instances in terms of the domains they contain, it is possible to use description logic reasoners and our ontology to as-sign those proteins to a protein family class. We have tested our system on classifying the protein phos-phatases of the human and Aspergillus fumigatus genomes and found that our knowledge-based, automatic classifica-tion matches, and sometimes surpasses, that of the human annotators. We have made the classification process fast and reproducible and, where appropriate knowledge is available, the method can potentially be generalised for use with any protein family.

Hide

Top

Proteomics

Semi-Supervised LC/MS alignment for differential proteomics
Author(s):
Bernd Fischer, ETH Zurich / Institute of Computational Science, Switzerland
Jonas Grossmann, ETH Zurich / Institute of Plant Biotechnology, Switzerland
Volker Roth, ETH Zurich / Institute of Computational Science, Switzerland
Wilhelm Gruissem, ETH Zurich / Institute of Plant Biotechnology, Switzerland
Sacha Baginsky, ETH Zurich / Institute of Plant Biotechnology, Switzerland
Joachim M. Buhmann, ETH Zurich / Institute of Computational Science, Switzerland

Motivation:
Mass spectrometry (MS) combined with highperformance liquid chromatography (LC) has received considerable attention for high-throughput analysis of the proteome. Isotopic labeling techniques such as ICAT have been successfully applied to derive differential quantitative information for two protein samples, however at the price of significantly increased complexity of the experimental setup. To overcome these limitations, we consider a label-free setting where correspondences between elements of two samples have to be established prior to the comparative analysis. The alignment between samples is achieved by nonlinear robust ridge regression. The correspondence estimates are guided in a semi-supervised fashion by prior information which is derived from sequenced tandem mass spectra.

Results:
The semi-supervised method for finding correspondences is successfully applied to aligning highly complex protein samples, even if they exhibit large variations due to different experimental conditions. A large-scale experiment clearly demonstrates that the proposed method bridges the gap between statistical data analysis and label-free quantitative differential proteomics.

Keywords: Semi-Supervised Learning, Alignment, Differential Proteomics

Hide

Annotating proteins by mining protein interaction networks
Author(s):
Mustafa Kirac, Case Western Reserve University, US
Gultekin Ozsoyoglu, Case Western Reserve University, US
Jiong Yang, Case Western Reserve University, US

Motivation:
In general, most accurate gene/protein annotations are provided by curators. Despite having lesser evidence strengths, it is inevitable to use computational methods for fast and a priori discovery of protein function annotations. This paper considers the problem of assigning Gene Ontology (GO) annotations to partially annotated or newly discovered proteins.

Results:
We present a data mining technique that computes the probabilistic relationships between GO annotations of proteins on protein-protein interaction data, and assigns highly correlated GO terms of annotated proteins to non-annotated proteins in the target set. In comparison with other techniques, probabilistic suffix tree and correlation mining techniques produce the highest prediction accuracy of 81% precision with the recall at 45%.

Availability:
Code is available upon request. Results and used materials are available online at http://kirac.case.edu/PROTAN

Hide

A model-based approach for mining membrane protein crystallization trials
Author(s):
Sitaram Asur, Department of Computer Science, Ohio State University, USA
Srinivasan Parthasarathy, Department of Computer Science, Ohio State University, USA
Pichai Raman, Department of Biophysics, Ohio State University, USA
Matthew Eric Otey, Department of Computer Science, Ohio State University, USA

Crystallization has been proven to be an essential step in macromolecular
structure verification. Unfortunately, the bottleneck is that the crystallization
process is quite complex. It can take any time from weeks to years to obtain diffraction-quality crystals, even under the right conditions. Other issues include the time and cost involved in taking trials and the presence of very few positive samples in a wide and largely undetermined parameter space.

Any help in directing scientists' attention to the hot spots in the conceptual crystallization space would lead to increased efficiency in crystallization trials. This work is an application case study on mining membrane protein crystallization trials to predict novel conditions with a high likelihood of leading to crystallization. We use suitable supervised learning algorithms to model the d