Hairpins
in a haystack: recognizing microrna precursors in
comparative genomics data
Author(s):
Jana Hertel, Bioinformatics Group, Department
of Computer Science, University of Leipzig, Germany
Peter F. Stadler, Bioinformatics Group, Department
of Computer Science, University of Leipzig, Germany;
Institute for Theoretical Chemistry, University of
Vienna, Austria; Santa Fe Institute, New Mexico, Germany
Recently, genome wide surveys for non-coding RNAs
have provided evidence for tens of thousands of
previously undescribed evolutionary conserved RNAs
with distinctive secondary structures. The annotation
of these putative ncRNAs, however, remains a difficult
problem. Here we describe an SVM-based approach
that, in conjunction with a non-stringent filter
for consensus secondary structures, is capable of
efficiently recognizing microRNA precursors in multiple
sequence alignments. The software was applied to
recent genome-wide RNAz surveys of mammals, urochordates,
and nematodes.
Keywords: miRNA, support vector
machine, non-coding RNA
Hide
Comparative
genomics reveals unusually long motifs in mammalian
genomes
Author(s):
Neil Jones, University of California San Diego,
United States
Pavel Pevzner, University of California San Diego,
United States
Motivation:
The recent discovery of the first small modulatory
RNA (smRNA) presents the challenge of finding other
molecules of similar length and conservation level.
Unlike short interfering RNA (siRNA) and micro-RNA
(miRNA), effective computational and experimental
screening methods are not currently known for this
species of RNA molecule, and the discovery of the
one known example was partly fortuitous because
it happened to be complementary to a well-studied
DNA binding motif (the Neuron Restrictive Silencer
Element).
Results:
The existing comparative genomics approaches (e.g.,
phylogenetic footprinting) rely on alignments of
orthologous regions across multiple genomes. This
approach, while extremely valuable, is not suitable
for finding motifs with highly diverged ``non-alignable''
flanking regions. Here we show that several unusually
long and well conserved motifs can be discovered
de novo through a comparative genomics approach
that does not require an alignment of orthologous
upstream regions. These motifs, including Neuron
Restrictive Silencer Element, were missed in recent
comparative genomics studies that rely on phylogenetic
footprinting. While the functions of these motifs
remain unknown, we argue that some may represent
biologically important sites.
Availability:
Our comparative genomics software, a web-accessible
database of our results and a compilation of experimentally
validated binding sites for NRSE can
be found at http://www.cse.ucsd.edu/groups/bioinformatics.
Contact: ppevzner@cs.ucsd.edu.
Hide
Relative
contributions of structural designability and functional
diversity in fixation of gene duplicates
Author(s):
Boris Shakhnovich, Boston University,
USA
Elucidation of the governing laws or even identifying
predominant trends in gene family or protein evolution
has been a formidable challenge in post-genomic
biology. While the skewed distribution of folds
and families was previously described, the key genetic
mechanisms or family specific characteristics that
influence the generation of this distribution are
as yet unknown. Furthermore, the extent of evolutionary
pressure on duplicate genes, most often credited
with generation of new genetic material and family
members is hotly debated. In this paper we present
evidence that duplicate genes have variable probability
of locus fixation correlated with strength of selection.
In turn evolutionary pressure is influenced by innate
characteristics of structural designability (e.g.
the potential for sequence entropy) of the protein
family. We further show that variability of pseudogene
formation from gene duplicates can be directly tied
to the size and designability of the family to which
the genes belong.
Hide
Automatic
clustering of orthologs and inparalogs shared by multiple
proteomes
Author(s):
Andrey Alexeyenko, Stockholm Bioinformatics Center,
Albanova, Stockholm University, Sweden
Ivica Tamas, Department of Molecular Biology &
Functional Genomics, Stockholm University, Sweden
Gang Liu, Center for Genomics and Bioinformatics,
Karolinska Institutet, Sweden
Erik Sonnhammer, Center for Genomics and Bioinformatics,
Karolinska Institutet, Sweden
The complete sequencing of many genomes has made
it possible to identify orthologous genes descending
from a common ancestor. However, reconstruction
of evolutionary history over long time periods faces
many challenges due to gene duplications and losses.
Identification of orthologous groups shared by multiple
proteomes therefore becomes a clustering problem
in which an optimal compromise between conflicting
evidences needs to be found.
Here we present a new proteome-scale analysis program
called MultiParanoid that can automatically find
orthology relationships between proteins in multiple
proteomes. The software is an extension of the InParanoid
program that identifies orthologs and inparalogs
in pairwise proteome comparisons. MultiParanoid
applies a clustering algorithm to merge multiple
pairwise ortholog groups from InParanoid into multi-species
ortholog groups. To avoid outparalogs in the same
cluster, MultiParanoid only combines species that
share the same last ancestor.
To validate the clustering technique, we compared
the results to a reference set obtained by manual
phylogenetic analysis. We further compared the results
to ortholog groups in KOGs and OrthoMCL, which revealed
that MultiParanoid produces substantially fewer
outparalogs than these resources.
MultiParanoid is a freely available standalone program
that enables efficient orthology analysis much needed
in the post-genomic era. A web-based service providing
access to the original datasets, the resulting groups
of orthologs, and the source code of the program
can be found at http://multiparanoid.cgb.ki.se.
Keywords: orthology, paralogy,
inparalog, outparalog, clustering, algorithm, last
common ancestor, comparative genomics, Homo sapiens,
C. elegans, D. melanogaster.
Hide
A
Sequence-based filtering method for ncRNA identification
and its application to searching for Riboswitch Elements
Author(s):
Shaojie Zhang, Department of Computer Science
and Engineering, University of California, San Diego,
U.S.A.
Ilya Borovok, Department of Molecular Microbiology
and Biotechnology, Tel-Aviv University, Israel
Yair Aharonowitz, Department of Molecular Microbiology
and Biotechnology, Tel-Aviv University, Israel
Roded Sharan, School of Computer Science, Tel-Aviv
University, Israel
Vineet Bafna, Department of Computer Science and
Engineering, University of California, San Diego,
U.S.A.
Recent studies have uncovered an ``RNA world'',
in which non coding RNA (ncRNA) sequences play a
central role in the regulation of gene expression.
Computational studies on ncRNA have been directed
toward developing detection methods for ncRNAs.
State-of-the-art methods for the problem, like covariance
models, suffer from high computational cost, underscoring
the need for efficient filtering approaches that
can identify promising sequence segments and accelerate
the detection process. In this paper we make several
contributions toward this goal. First, we formalize
the concept of a filter and provide figures of merit
that allow comparing between filters. Second, we
design efficient sequence based filters that dominate
the current state-of-the-art HMM filters. Third,
we provide a new formulation of the covariance model
that allows speeding up RNA alignment. We demonstrate
the power of our approach on both synthetic data
and real bacterial genomes. We then apply our algorithm
to the detection of novel riboswitch elements from
the whole bacterial and archaeal genomes. Our results
point to a number of novel riboswitch candidates,
and include genomes that were not previously known
to contain riboswitches.
Hide
Finding
novel genes in bacterial communities isolated from
the environment
Author(s):
Lutz Krause, Bielefeld University, Center for
Biotechnology (CeBiTec), Germany
Naryttza N. Diaz, Bielefeld University, Center
for Biotechnology (CeBiTec), Germany
Daniela Bartels, Bielefeld University, Center
for Biotechnology (CeBiTec) D-33594 Bielefeld
Robert A. Edwards, Fellowship for Interpretation
of Genomes, Burr Ridge IL, United States
Alfred Pühler, Universität Bielefeld,
Lehrstuhl für Genetik, Fakultät für
Biologie D-33594 Bielefeld, Germany
Forest Rohwer, Department of Biology, San Diego
State University, San Diego, CA, United States
Folker Meyer, Bielefeld University, Center for
Biotechnology (CeBiTec), Germany
Jens Stoye, Universität Bielefeld, Technische
Fakultät D-33594 Bielefeld, Germany
Motivation:
Novel sequencing techniques can give access to organisms
that are difficult to cultivate using conventional
methods. For example, the 454 pyrosequencing method
can generate a large amount of data in short time
and at a low cost. When applied to environmental
samples, the data generated has some drawbacks,
e.g. short length of assembled contigs, in-frame
stop codons and frame shifts. Unfortunately, current
gene finders can not circumvent these difficulties.
On the other hand, high throughput methods are needed
to investigate special attributes of microbial communities.
Some metagenomics analyses have already revealed
interesting findings in diversity and evolution
of complex microbial communities. Therefore, the
automated prediction of genes is a prerequisite
for the increasing amount of genomic sequences to
ensure progress in metagenomics.
Results:
We introduce a novel gene finding algorithm that
incorporates features overcoming the short length
of the assembled contigs from environmental data,
in-frame stop codons as well as frame shifts contained
in bacterial sequences. The results show that by
searching for sequence similarities in an environmental
sample our algorithm is capable of detecting a high
fraction of its gene content, depending on the species
composition and the overall size of the sample.
Therefore, the method is valuable for hunting novel
unknown genes that may be specific for the habitat
where the sample is taken. Finally, we show that
our algorithm can even exploit the limited information
contained in the short reads generated by the 454
technology for the prediction of protein coding
genes.
Hide
Top
|
|
An
experimental metagenome data management and analysis
system
Author(s):
Victor Markowitz, Biological Data Management and
Technology Center, Lawrence Berkeley National Lab,
USA
Natalia Ivanova, Genome Biology Program, Joint
Genome Institute, USA
Krishna Palaniappan, Biological Data Management
and Technology Center, Lawrence Berkeley National
Lab, USA
Ernest Szeto, Biological Data Management and Technology
Center, Lawrence Berkeley National Lab, USA
Frank Korzeniewski, Biological Data Management
and Technology Center, Lawrence Berkeley National
Lab, USA
Athanasios Lykidis, Genome Biology Program, Joint
Genome Institute, USA
Iain Anderson, Genome Biology Program, Joint Genome
Institute, USA
Konstantinos Mavrommatis, Genome Biology Program,
Joint Genome Institute, USA
Victor Kunin, Microbial Ecology Program, Joint
Genome Institute, USA
Hector Garcia Martin, Microbial Ecology Program,
Joint Genome Institute, USA
Inna Dubchak, Genomics Division, Lawrence Berkeley
National Lab, USA
Phil Hugenholtz, Microbial Ecology Program, Joint
Genome Institute, USA
Nikos Kyrpides, Genome Biology Program, Joint
Genome Institute, USA
The application of shotgun sequencing to environmental
samples has revealed a new universe of microbial
community genomes (metagenomes) involving previously
uncultured organisms. Metagenome analysis, which
is expected to provide a comprehensive picture of
the gene functions and metabolic capacity of microbial
community, needs to be conducted in the context
of a comprehensive data management and analysis
system. We present in this paper IMG/M, an experimental
metagenome data management and analysis system that
is based on the Integrated Microbial Genomes (IMG)
system. IMG/M provides tools and viewers for analyzing
both metagenomes and isolate genomes individually
or in a comparative context.
Hide
Distance
based algorithms for small biomolecule classification
and structural similarity search
Author(s):
Emre Karakoc, Simon Fraser University, Canada
Artem Cherkasov, University of British Columbia,
Canada
S. Cenk Sahinalp, Simon Fraser University, Canada
Structural similarity search among small molecules
is a standard tool used in molecular classification
and in-silico drug discovery. The effectiveness
of this general approach depends on how well the
following problems are addressed.
The notion of similarity should be chosen for providing
the highest level of discrimination of compounds
wrt the bioactivity of interest. The data structure
for performing search should be very efficient as
the molecular databases of interest include several
millions of compounds.
In this paper we focus on the k-nearest-neighbor
search method, which, until recently was not considered
for small molecule classification. The few recent
applications of k-nn to compound classification
focus on selecting the most relevant set of chemical
descriptors which are then compared under standard
Minkowski distance L_p. Here we show how to computationally
design the optimal "weighted" Minkowski
distance wL_p for maximizing the discrimination
between active and inactive compounds wrt bioactivities
of interest. We then show how to construct pruning
based k-nn search data structures for any wL_p distance
that minimizes similarity search time.
The accuracy achieved by our classifier is better
than the alternative LDA and MLR approaches and
is comparable to the ANN methods. In terms of running
time, our classifier is considerably faster than
the ANN approach especially when large data sets
are used. Furthermore, our classifier quantifies
the level of bioactivity rather than returning a
binary decision and thus is more informative than
the ANN approach.
Hide
springScape:
Visualisation of microarray and contextual bioinformatic
data using spring embedding and an information landscape
Author(s):
Timothy Ebbels, Department of Computer Science,
University College London, UK
Bernard Buxton, Department of Computer Science,
University College London, UK
David Jones, Department of Computer Science, University
College London, UK
The interpretation of microarray and other high-throughput
data is highly dependent on the biological context
of experiments. However, standard analysis packages
are poor at simultaneously presenting both the array
and related bioinformatic data. We have addressed
this challenge by developing a system springScape
based on ‘spring embedding' and an ‘information
landscape' allowing several related data sources
to be dynamically combined while highlighting one
particular feature.
Each data source is represented as a network of
nodes con-nected by weighted edges. The networks
are combined and embedded in the 2-D plane by spring
embedding such that nodes with a high similarity
are drawn close together. Complex relationships
can be discovered by varying the weight of each
data source and observing the dynamic response of
the spring network. By modifying Procrustes analysis,
we find that the visualizations have an acceptable
degree of reproducibility. The ‘information
landscape' highlights one particular data source,
displaying it as a smooth surface whose height is
proportional to both the information being viewed
and the density of nodes. The algorithm is demonstrated
using several microarray data sets in combination
with protein-protein interaction data and GO annotations.
Among the features revealed are the spatio-temporal
profile of gene expression and the identification
of GO terms correlated with gene expression and
protein interactions. The power of this combined
display lies in its interactive feedback and exploitation
of human visual pattern recognition. Overall, springScape
shows promise as a tool for the interpretation of
microarray data in the context of relevant bioinformatic
information.
Hide
SNP
Function Portal: a web database for exploring the
function implication of SNP alleles
Author(s):
Pinglang Wang, University of Michigan, United
States
Manhong Dai, University of Michigan, United States
Weijian Xuan, University of Michigan, United States
Richard C McEachin, University of Michigan, United
States
Anne U Jackson, University of Michigan, United
States
Laura J Scott, University of Michigan, United
States
Brian Athey, University of Michigan, United States
Stanley J. Watson, University of Michigan, United
States
Fan Meng, University of Michigan, United States
Motivation:
Finding the potential functional significance of
SNPs is a major bottleneck in the understanding
genome-wide SNP scanning results, as the related
functional data are distributed across many different
databases. The SNP Function Portal is designed to
be a clearing house for all public domain SNP function
annotation data, as well as in-house functional
annotations derived from data from different sources.
It currently contains SNP function annotations in
six major categories including genomic elements,
transcription regulation, protein function, pathway,
disease and population genetics. Besides extensive
SNP function annotatns, the SNP Function Portal
includes a powerful search engine that accepts different
types of genetic markers as input and identifies
all genetically related SNPs based on HapMap II
data as well as the relationship of different markers
to known genes. As a result, our system allows users
to search the potential biological impact of any
genetic marker(s), investigate complex relationships
among genetic markers and genes, and greatly facilitates
the understanding of genome-wide SNP scanning results.
Availability:
http://brainarray.mbni.med.umich.edu/Brainarray/Database/SearchSNP/snpfunc.aspx
Contact: mengf@umich.edu
Hide
Integrating
structured biological data by kernel Maximum Mean
Discrepancy
Author(s):
Karsten Borgwardt, University of Munich, Germany
Arthur Gretton, MPI Tuebingen, Germany
Malte Rasch, TU Graz, Austria
Hans-Peter Kriegel, University of Munich, Germany
Bernhard Schoelkopf, MPI Tuebingen, Germany
Alex Smola, National ICT Australia, Canberra,
Australia
Motivation:
Many problems in data integration in bioinformatics
can be posed as one common question: Are two sets
of observations generated by the same distribution?
We propose a kernel-based statistical test for this
problem, based on the fact that two distributions
are different if and only if there exists at least
one function having different expectation on the
two distributions. Consequently we use the maximum
discrepancy between function means as the basis
of a test statistic. The Maximum Mean Discrepancy
(MMD) can take advantage of the kernel trick, which
allows us to apply it not only to vectors, but strings,
sequences, graphs, and other common structured data
types arising in molecular biology.
Results:
We study the practical feasibility of an MMD-based
test on three central data integration tasks: Testing
cross-platform comparability of microarray data,
cancer diagnosis, and data-content based schema
matching for two different protein function classification
schemas. In all of these experiments, including
high-dimensional ones, MMD is very accurate in finding
samples that were generated from the same distribution,
and outperforms or is as good as its best competitors.
Conclusions:
We have defined a novel statistical test of whether
two samples are from the same distribution, compatible
with both multivariate and structured data, that
is fast, easy to implement, and works well, as confirmed
by our experiments.
Availability: http://www.dbs.ifi.lmu.de/~borgward/MMD
Hide
Top
|
|
Constructing
near-perfect phylogenies with multiple homoplasy events
Author(s):
Ravi Vijaya Satya, School of EECS, University
of central Florida Orlando FL, USA
Amar Mukherjee, School of EECS, University
of central Florida Orlando FL, USA
Gabriela Alexe, IBM T.J. Watson rsearch Center,
YorkTown Heights, NY, USA
Laxmi Parida, IBM T.J. Watson rsearch Center,
YorkTown Heights, NY, USA
Gyan Bhanot, IBM T.J. Watson rsearch Center,
YorkTown Heights, NY, USA
Motivation:
We explore the problem of constructing near-perfect
phylogenies on bi-allelic haplotypes, where the
deviation from perfect phylogeny is entirely due
to homoplasy events. We present polynomial-time
algorithms for restricted versions of the problem.
We show that these algorithms can be extended to
genotype data, in which case the problem is called
the near-perfect phylogeny haplotyping (NPPH) problem.
We present a near-optimal algorithm for the H1-NPPH
problem, which is to determine if a given set of
genotypes admit a phylogeny with a single homoplasy
event. The time-complexity of our algorithm for
the H1-NPPH problem is O(m2(n+m)), where
n is the number of genotypes and $m$ is the number
of SNP sites. This is a significant improvement
over the earlier O(n4) algorithm.
We also introduce generalized versions of the problem.
The H(1,q)-NPPH problem is to determine if a given
set of genotypes admit a phylogeny with q homoplasy
events, so that all the homoplasy events occur in
a single site. We present an O(mq+1(n+m))
algorithm for the H(1,q)-NPPH problem.
Results:
We present results on simulated data, which demonstrate
that the accuracy of our algorithm for the H1-NPPH
problem is comparable to that of the existing methods,
while being orders of magnitude faster.
Availability:
The implementation of our algorithm for the H1-NPPH
problem is available upon request.
Contact: rvijaya@cs.ucf.edu
Hide
BNTagger:
Improved tagging snp selection using bayesian networks
Author(s):
Phil Hyoun Lee, School of Computing, Queen's University,
Canada
Hagit Shatkay, School of Computing, Queen's University,
Canada
Genetic variation analysis holds much promise as
a basis for disease-gene association. However, due
to the tremendous number of candidate single nucleotide
polymorphisms (SNPs), there is a clear need to expedite
genotyping by selecting and considering only a subset
of all SNPs. This process is known as tagging SNP
selection. Several methods for tagging SNP selection
have been proposed, and have shown promising results.
However, most of them rely on strong assumptions
such as prior block-partitioning, bi-allelic SNPs,
or a fixed number or location of tagging SNPs.
We introduce BNTagger, a new method for tagging
SNP selection, based on conditional independencies
among SNPs. Using the formalism of Bayesian networks
(BNs), our system aims to select a subset of independent
and highly predictive SNPs. Similar to previous
prediction-based methods, we aim to maximize the
prediction accuracy of tagging SNPs, but unlike
them, we neither fix the number or the location
of predictive tagging SNPs, nor require SNPs to
be bi-allelic. In addition, for newly-genotyped
samples, BNTagger directly uses genotype data as
input, while producing as output haplotype data
of all SNPs.
Using three public data sets, we compare the prediction
performance of our method to that of three state-of-the-art
tagging SNP selection methods. The results demonstrate
that our method consistently improves upon previous
methods in terms of prediction accuracy. Moreover,
our method retains its good performance even when
a very small number of tagging SNPs are used.
Hide
Mutation
parameters from DNA sequence data using graph theoretic
measures on lineage trees
Author(s):
Reuma Magori Cohen, Bar Ilan University, Israel
Yoram Louzoun, Bar Ilan University, Israel
Steven Kleinstein, Princeton University, USA
Motivation:
B cells responding to antigenic stimulation can
fine-tune their binding properties through a process
of affinity maturation, composed of somatic hypermutation,
affinity-selection and clonal expansion. The mutation
rate of the B cell receptor DNA sequence, and the
effect of these mutations on affinity and specificity
are of critical importance for understanding immune
and autoimmune processes. Unbiased estimates of
these properties are currently lacking due to the
short time-scales involved and the small numbers
of sequences available.
Results:
We have developed a bioinformatic method based on
a maximum likelihood analysis of phylogenetic lineage
trees to estimate the parameters of a B cell clonal
expansion model, which includes somatic hypermutation
with the possibility of lethal mutations. Lineage
trees are created from clonally related B cell receptor
DNA sequences. Important links between tree shapes
and underlying model parameters are identified using
mutual information. Parameters are estimated using
a likelihood function based on the joint distribution
of several tree shapes, without requiring a priori
knowledge of the number of generations in the clone
(which is not available for rapidly dividing populations
in vivo). A systematic validation on synthetic trees
produced by a mutating birth-death process simulation
shows that our estimates are precise and robust
to several underlying assumptions. These methods
are applied to experimental data from autoimmune
mice to demonstrate the existence of hyper!
mutating B cells in an unexpected location in the
spleen.
Hide
Top
|
|
Predicting
the prognosis of breast cancer by integrating clinical
and microarray data with Bayesian networks
Author(s):
Olivier Gevaert, Department of Electrical Engineering
ESAT-SCD Katholieke Universiteit Leuven, Belgium
Frank De Smet, Department of Electrical Engineering
ESAT-SCD Katholieke Universiteit Leuven, Belgium
Dirk Timmerman, Department of obstetrics and gynecology,
University Hospital Gasthuisberg, Katholieke Universiteit
Leuven, Belgium
Yves Moreau, Department of Electrical Engineering
ESAT-SCD Katholieke Universiteit Leuven, Belgium
Bart De Moor, Department of Electrical Engineering
ESAT-SCD Katholieke Universiteit Leuven, Belgium
Motivation:
Clinical data, such as patient history, laboratory
analysis, ultrasound parameters - which are the
basis of day-to-day clinical decision support -
are often neglected to guide the clinical management
of cancer in the presence of microarray data. We
propose a strategy based on Bayesian networks to
treat clinical and microarray data on an equal footing.
The main advantage of this probabilistic model is
that it allows to integrate these data sources in
several ways and that it allows to investigate and
understand the model structure and parameters. Furthermore
using the concept of a Markov Blanket we can identify
all the variables that shield off the class variable
from the influence of the remaining network. Therefore
Bayesian networks automatically perform feature
selection by identifying the (in)dependency relationships
with the class variable.
Results:
We evaluated three methods for integrating clinical
and microarray data: decision integration, partial
integration and full integration and used them to
classify publicly available breast cancer patients
into a poor and a good prognosis group. The partial
integration method is most promising and has an
independent test set area under the ROC curve of
0.845. After choosing an operating point the classification
performance is better than frequently used indices.
Hide
AClAP,
Autonomous hierarchical agglomerative Cluster Analysis
based Protocol to partition conformational datasets
Author(s):
Giovanni Bottegoni, Dept. of Pharmaceutical Sciences
- University of Bologna, Italy
Walter Rocchia, NEST - Scuola Normale Superiore
of Pisa, Italy
Maurizio Recanatini, Dept. of Pharmaceutical Sciences
- University of Bologna, Italy
Andrea Cavalli, Dept. of Pharmaceutical Sciences
- University of Bologna, Italy
Motivation:
Sampling the conformational space is a fundamental
step for both ligand- and structure-based drug design.
However, the rational organization of different
molecular conformations still remains a challenge.
In fact, for drug design applications, the sampling
process provides a redundant conformation set whose
thorough analysis can be intensive, or even prohibitive.
We propose a statistical approach based on cluster
analysis aimed at rationalizing the output of methods
such as Monte Carlo, genetic, and reconstruction
algorithms. Although some software already implements
clustering procedures, at present, a universally
accepted protocol is still missing.
Results:
We integrated hierarchical agglomerative cluster
analysis with a clusterability assessment method
and a user independent cutting rule, to form a global
protocol that we implemented in a MATLAB metalanguage
program (AClAP). We tested it on the conformational
space of a quite diverse set of drugs generated
via Metropolis Monte Carlo simulation, and on the
poses we obtained by reiterated docking runs performed
by four widespread programs. In our tests, AClAP
proved to remarkably reduce the dimensionality of
the original datasets at a negligible computational
cost. Moreover, when applied to the outcomes of
many docking programs together, it was able to point
to the crystallographic pose.
Availability:
AClAP is available at the “AClAP” section
of the website http://www.scfarm.unibo.it
Contact: andrea.cavalli@unibo.it
Supplementary Information:
The complete series of AClAP results is available
in the "services” section of the website
http://www.scfarm.unibo.it.
Hide
Integrating
copy number polymorphisms into array CGH analysis
using a robust HMM
Author(s):
Sohrab Shah, University of British Columbia, Canada
Xiang Xuan, University of British Columbia, Canada
Ron DeLeeuw, British Columbia Cancer Research
Centre, Canada
Mehrnoush Khojasteh, British Columbia Cancer Research
Centre, Canada
Wan Lam, British Columbia Cancer Research Centre,
Canada
Raymond Ng, University of British Columbia, Canada
Kevin Murphy, University of British Columbia,
Canada
Array comparative genomic hybridization (aCGH)
is a pervasive technique used to identify chromosomal
aberrations in human diseases, including cancer.
Aberrations are defined as regions of increased
or decreased copy number, relative to a normal sample.
Accurately identifying the locations of these aberrations
has many important medical applications. Unfortunately,
the observed copy number changes are often corrupted
by various sources of noise, making the boundaries
hard to detect. One popular current technique uses
hidden Markov models (HMMs) to segment the signal
into regions of constant copy number; a subsequent
classification phase labels each region as a gain,
a loss or neutral. Unfortunately, standard HMMs
are sensitive to outliers, causing oversegmentation.
We propose a simple modification that makes the
HMM more robust to such single clone outliers. More
importantly, this modification allows us to exploit
prior knowledge about the likely location of such
``outliers'', which are often due to copy number
polymorphisms (CNPs). By ``explaining away'' these
outliers, we can focus attention on more interesting
aberrated regions. We show significant improvements
over the current state of the art technique (DNAcopy
with MergeLevels) on some previously used synthetic
data, augmented with outliers. We also show modest
gains on the well-studied H526 lung cancer cell
line data, and argue why we expect more substantial
gains on other data sets in the future.
Source code written in Matlab is available from
http://www.cs.ubc.ca/~sshah/acgh
Hide
Decoding
non-unique oligonucleotide hybridization experiments
of targets related by a phylogenetic tree
Author(s):
Alexander Schliep, Max Planck Institute for Molecular
Genetics, Germany
Sven Rahmann, Bielefeld University, Germany
Motivation:
The reliable identification of presence or absence
of biological agents ("targets"), such
as viruses or bacteria, is crucial for many applications
from health care to biodiversity. If genomic sequences
of targets are known, hybridization reactions between
oligonucleotide probes and targets performed on
suitable DNA microarrays will allow to infer presence
or absence from the observed pattern of hybridization.
Targets, for example all known strains of HIV, are
often closely related and finding unique probes
becomes impossible. The use of non-unique oligonucleotides
with more advanced decoding techniques from statistical
group testing allows to detect known targets with
great success. Of great relevance, however, is the
problem of identifying the presence of previously
unknown targets or of targets that evolve rapidly.
Results:
We present the first approach to decode hybridization
experiments using non-unique probes when targets
are related by a phylogenetic tree. By use of a
Bayesian framework and a Markov chain Monte Carlo
approach we are able to identify over 95% of known
targets and assign up to 70% of unknown targets
to their correct clade in hybridization simulations
on biological and simulated data.
Availability:
Software implementing the method described in this
paper and datasets are available from http://algorithmics.molgen.mpg.de/probetrees.
Keywords: virus detection, probe
design, phylogenie, MCMC
Hide
Top
|
|
DynaPred:
A structure and sequence based method for the prediction
of MHC class I binding peptide sequences and conformations
Author(s):
Iris Antes, MPI fuer Informatik, Stuhlsatzenhausweg
85, D-66123 Saarbruecken, Germany
Shirley Siu, Universitaet des Saarlandes, D-66041
Saarbruecken, Germany
Thomas Lengauer, MPI fuer Informatik, Stuhlsatzenhausweg
85, D-66123 Saarbruecken, Germany
We developed a SVM-trained, quantitative matrix-based
method for the prediction of MHC class I binding
peptides, in which the features of the scoring matrix
are energy terms retrieved from molecular dynamics
simulations. At the same time we use the equilibrated
structures obtained by the same simulations in a
simple and efficient docking procedure. Our method
consists of two steps: First, we predict potential
binders from sequence data alone and second, we
construct protein-peptide complexes for the predicted
binders. So far, we tested our approach on the HLA-A0201
allele. We constructed two prediction models, using
local, position-dependent (DynaPredPOS) and global,
position-independent (DynaPred) features. The former
model outperformed two sequence-based methods used
in the evaluation; the latter showed a slightly
lower performance (5% less accuracy), but a much
higher generalizability towards other alleles than
the position-dependent models. The constructed peptide
conformations can be refined within seconds to structures
with an average RMSD from the corresponding experimental
structures of 1.53 Å for the peptide backbone
and 1.1 Å for buried side chain atoms.
Hide
Top
|
|
A
top-level ontology of functions and its application
in the Open Biomedical Ontologies
Author(s):
Patryk Burek, University of Leipzig, Germany
Robert Hoehndorf, University of Leipzig, Max Planck
Institute for Evolutionary Anthropology, Germany
Frank Loebe, University of Leipzig, Germany
Johann Visagie, Max Planck Institute for Evolutionary
Anthropology, Germany
Heinrich Herre, University of Leipzig, Germany
Janet Kelso, Max Planck Institute for Evolutionary
Anthropology, Germany
Motivation:
A clear understanding of functions in biology is
a key component in accurate modelling of molecular,
cellular and organismal biology. Using the existing
biomedical ontologies it has been impossible to
capture the complexity of the community's knowledge
about biological functions.
Results:
We present here a top-level ontological framework
for representing knowledge about biological functions.
This framework lends greater accuracy, power and
expressiveness to biomedical ontologies by providing
a means to capture existing functional knowledge
in a more formal manner. An initial major application
of the ontology of functions is the provision of
a principled way in which to curate functional knowledge
and annotations in biomedical ontologies. Further
potential applications include the facilitation
of ontology interoperability and automated reasoning.
A major advantage of the proposed implementation
is that it is an extension to existing biomedical
ontologies, and can be applied without substantial
changes to these domain ontologies.
Availability:
The Ontology of Functions (OF) can be downloaded
in OWL format from http://onto.eva.mpg.de/.
Additionally, a UML profile and supplementary information
and guides for using the OF can be accessed from
the same website.
Contact: bioonto@lists.informatik.uni-leipzig.de
Keywords: knowledge representation,
ontology, top-level ontology, biological function
Hide
An
ontology for a robot scientist
Author(s):
Larisa Soldatova, The University of Wales, Aberystwyth,
UK
Amanda Clare, The University of Wales, Aberystwyth,
UK
Andrew Sparkes, The University of Wales, Aberystwyth,
UK
Ross King, The University of Wales, Aberystwyth,
UK
Motivation:
A Robot Scientist is a physically implemented robotic
system that can automatically carry out cycles of
scientific experimentation. We are commissioning
a new Robot Scientist designed to investigate gene
function in S. cerevisiae. This Robot Scientist
will be capable of initiating >1,000 experiments,
and making >200,000 observations a day. Robot
Scientists provide a unique test bed for the development
of methodologies for the curation and annotation
of scientific experiments: for as the experiments
are conceived and executed automatically by computer,
it is possible to completely capture and digitally
curate all aspects of the scientific process. This
new ability brings with it significant technical
challenges. To meet these we apply an ontology driven
approach to the representation of all the Robot
Scientist's data and metadata.
Results:
We demonstrate the utility of developing an ontology
for the new Robot Scientist. This ontology is based
on a general ontology of experiments. The ontology
aids the curation and annotating of: the experimental
data and metadata, the equipment metadata, and supports
the design of database systems to hold the data
and metadata.
Availability:
EXPO in XML and OWL formats is at: http://sourceforge.net/projects/expo/.
All materials about the Robot Scientist project
are available at: www.aber.ac.uk/compsci/Research/bio/robotsci/.
Hide
Protein
classification using ontology classification
Author(s):
Katherine Wolstencroft, University of Manchester,
UK
Phillip Lord, University of Newcastle, UK
Lydia Tabernero, University of Manchester, UK
Andy Brass, University of Manchester, UK
Robert Stevens, University of Manchester, UK
Motivation:
The classification of proteins expressed by an organism
is an important step in understanding the molecu-lar
biology of that organism. Traditionally, this classifica-tion
has been performed by human experts. Human knowl-edge
can recognise the functional properties that are
suffi-cient to place an individual gene product
into a particular protein family group. Automation
of this task usually fails to meet the ‘gold
standard' of the human annotator because of the
difficult recognition stage. The growing number
of genomes, the rapid changes in knowledge and the
central role of classification in the annotation
process, however, motivates the need to automate
this process.
Results:
We capture human understanding of how to rec-ognise
members of the protein phosphatases family by do-main
architecture as an ontology. By describing protein
instances in terms of the domains they contain,
it is possible to use description logic reasoners
and our ontology to as-sign those proteins to a
protein family class. We have tested our system
on classifying the protein phos-phatases of the
human and Aspergillus fumigatus genomes and found
that our knowledge-based, automatic classifica-tion
matches, and sometimes surpasses, that of the human
annotators. We have made the classification process
fast and reproducible and, where appropriate knowledge
is available, the method can potentially be generalised
for use with any protein family.
Hide
Top
|
|
Semi-Supervised
LC/MS alignment for differential proteomics
Author(s):
Bernd Fischer, ETH Zurich / Institute of Computational
Science, Switzerland
Jonas Grossmann, ETH Zurich / Institute of Plant
Biotechnology, Switzerland
Volker Roth, ETH Zurich / Institute of Computational
Science, Switzerland
Wilhelm Gruissem, ETH Zurich / Institute of Plant
Biotechnology, Switzerland
Sacha Baginsky, ETH Zurich / Institute of Plant
Biotechnology, Switzerland
Joachim M. Buhmann, ETH Zurich / Institute of
Computational Science, Switzerland
Motivation:
Mass spectrometry (MS) combined with highperformance
liquid chromatography (LC) has received considerable
attention for high-throughput analysis of the proteome.
Isotopic labeling techniques such as ICAT have been
successfully applied to derive differential quantitative
information for two protein samples, however at
the price of significantly increased complexity
of the experimental setup. To overcome these limitations,
we consider a label-free setting where correspondences
between elements of two samples have to be established
prior to the comparative analysis. The alignment
between samples is achieved by nonlinear robust
ridge regression. The correspondence estimates are
guided in a semi-supervised fashion by prior information
which is derived from sequenced tandem mass spectra.
Results:
The semi-supervised method for finding correspondences
is successfully applied to aligning highly complex
protein samples, even if they exhibit large variations
due to different experimental conditions. A large-scale
experiment clearly demonstrates that the proposed
method bridges the gap between statistical data
analysis and label-free quantitative differential
proteomics.
Keywords: Semi-Supervised Learning,
Alignment, Differential Proteomics
Hide
Annotating
proteins by mining protein interaction networks
Author(s):
Mustafa Kirac, Case Western Reserve University,
US
Gultekin Ozsoyoglu, Case Western Reserve University,
US
Jiong Yang, Case Western Reserve University, US
Motivation:
In general, most accurate gene/protein annotations
are provided by curators. Despite having lesser
evidence strengths, it is inevitable to use computational
methods for fast and a priori discovery of protein
function annotations. This paper considers the problem
of assigning Gene Ontology (GO) annotations to partially
annotated or newly discovered proteins.
Results:
We present a data mining technique that computes
the probabilistic relationships between GO annotations
of proteins on protein-protein interaction data,
and assigns highly correlated GO terms of annotated
proteins to non-annotated proteins in the target
set. In comparison with other techniques, probabilistic
suffix tree and correlation mining techniques produce
the highest prediction accuracy of 81% precision
with the recall at 45%.
Availability:
Code is available upon request. Results and used
materials are available online at http://kirac.case.edu/PROTAN
Hide
A
model-based approach for mining membrane protein crystallization
trials
Author(s):
Sitaram Asur, Department of Computer Science,
Ohio State University, USA
Srinivasan Parthasarathy, Department of Computer
Science, Ohio State University, USA
Pichai Raman, Department of Biophysics, Ohio State
University, USA
Matthew Eric Otey, Department of Computer Science,
Ohio State University, USA
Crystallization has been proven to be an essential
step in macromolecular
structure verification. Unfortunately, the bottleneck
is that the crystallization
process is quite complex. It can take any time from
weeks to years to obtain diffraction-quality crystals,
even under the right conditions. Other issues include
the time and cost involved in taking trials and
the presence of very few positive samples in a wide
and largely undetermined parameter space.
Any help in directing scientists' attention to
the hot spots in the conceptual crystallization
space would lead to increased efficiency in crystallization
trials. This work is an application case study on
mining membrane protein crystallization trials to
predict novel conditions with a high likelihood
of leading to crystallization. We use suitable supervised
learning algorithms to model the d | |