Motivation: Integration
of heterogeneous data in life sciences is a growing
and recognized challenge. The problem is not only
to enable the study of such data within the context
of a biological question, but also more fundamentally
how to represent the available knowledge and make
it accessible for mining.
Results: Our integration
approach is based on the premise that relationships
between biological entities can be represented
as a complex network. The context dependency is
achieved by a judicious use of distance measures
on these networks. The biological entities and
the distances between them are mapped for the
purpose of visualization into the lower-dimensional
space using the Sammon’s mapping. The system
implementation is based on a multi-tier architecture
using a native XML database and a software tool
for querying and visualizing complex biological
networks. The functionality of our system is demonstrated
with two examples: (1) A multiple pathway retrieval,
in which, given a pathway name, the system finds
all the relationships related to the query by
checking available metabolic pathway, transcriptional,
signaling, protein-protein interaction, and ontology
annotation resources and (2) A protein neighborhood
search, in which given a protein name, the system
finds all its connected entities within a specified
depth. These two examples show that our system
is able to conceptually traverse different databases
to produce testable hypotheses and lead towards
answers to complex biological questions.
2.
A Motif-based Framework for Recognizing Sequence
Families
Authors:Sharan, Myers
Motivation: Many
signals in biological sequences are based on the
presence or absence of base signals and their
spatial combinations. One of the best known examples
in this regard is the signal identifying a core
promoter---the site at which the basal transcription
machinery starts the transcription of a gene.
Our goal is a fully automatic pattern recognition
system for a family of sequences that simultaneously
discovers the base signals, their spatial relationships
and a classifier based upon them.
Results: In this
paper we present a general method for characterizing
a set of Sequences by their recurrent motifs.
Our approach relies on novel probabilistic models
for DNA binding sites and modules of binding sites,
on algorithms to learn them from data, and on
a support vector machine that uses the learned
models to classify a set of sequences. We demonstrate
the applicability of our approach to diverse instances,
ranging from families of promoter sequences to
a data set of intronic sequences flanking alternatively
spliced exons. On a core promoter data set our
results are comparable to the state-of-the-art
McPromoter. On a data set of alternatively spliced
exons we outperform a previous approach. We also
achieve high success rates in recognizing cell
cycle regulated genes. These results demonstrate
that a fully automatic pattern recognition algorithm
can meet or exceed the performance of hand-crafted
approaches.
Availability: The
software and data sets are available from the
authors upon request.
3.
A Procedure for Assessing GO Annotation Consistency
Authors:Dolan, Ni, Camon, Blake
Motivation: The
Gene Ontology [GO] is widely used to annotate
molecular attributes of genes and gene products.
Multiple groups undertaking functional annotations
of genomes contribute their annotation sets to
the GO database resource and these data are subsequently
used in comparative functional analysis research.
Although GO curators adhere to the same protocols
and standards while assigning GO annotations,
the specific procedure followed by each annotation
group can vary. Since differences in application
of annotation standards would dilute the effectiveness
of comparative analysis, methods for assessing
annotation consistency are essential. The development
of methodologies that are broadly applicable for
the assessment of GO annotation consistency is
an important issue for the comparative genomics
community.
Results: We have
developed a methodology for assessing the consistency
of GO annotations provided by different annotation
groups. The method is completely general and can
be applied to compare any two sets of GO annotations.
This is the first attempt to assess cross-species
GO annotation consistency. Our method compares
annotation sets utilizing the hierarchical structure
of the GO to compare GO annotations between orthologous
gene pairs. The method produces a report on the
annotation consistency and inconsistency for each
orthologous pair. We present results obtained
by comparing GO annotations for mouse and human
gene sets.
4.
An HMM Posterior Decoder for Sequence Feature Prediction
that Includes Homology Information
Authors:Käll, Krogh, Sonnhammer
When predicting sequence features
like transmembrane topology, signal peptides,
coil-coil structures, protein secondary structure,
or genes, extra support can be gained from homologs.
We here present a general HMM decoding algorithm
that combines probabilities for sequence features
of homologs by considering the average of the
posterior label probability of each position in
a global sequence alignment. The algorithm is
an extension of the previously described "optimal
accuracy" decoder, allowing homology information
to be used. It was benchmarked using an HMM for
transmembrane topology and signal peptide prediction,
Phobius. We found that performance was substantially
increased when incorporating information from
homologs.
5.
YeastHub: A Semantic Web Use Case for Integrating
Data in the Life Sciences Domain
Motivation: As
the semantic web technology is maturing and the
need for life sciences data integration over the
web is growing, it is important to explore how
data integration needs can be addressed by the
semantic web. The main problem that we face in
data integration is a lack of widely-accepted
standards for expressing the syntax and semantics
of the data. We address this problem by exploring
the use of semantic web technologies — including
Resource Description Framework (RDF), RDF Site
Summary (RSS), relational-database-to-RDF mapping
(D2RQ), and native RDF data repository —
to represent, store, and query both metadata and
data across life sciences datasets.
Results: As many
biological datasets are presently available in
tabular format, we introduce an RDF structure
into which they can be converted. Also, we develop
a prototype web-based application called YeastHub
that demonstrates how a life sciences data warehouse
can be built using a native RDF data store (Sesame).
This data warehouse allows integration of different
types of yeast genome data provided by different
resources in different formats including the tabular
and RDF formats. Once the data are loaded into
the data warehouse, RDF-based queries can be formulated
to retrieve and query the data in an integrated
fashion.
6.
Improved Detection of DNA Motifs Using a Self-organized
Clustering of Familial Binding Profiles
Authors:Mahony, Golden, Smith, Benos
Motivation: One
of the limiting factors in deciphering transcriptional
regulatory networks is the effectiveness of motif-finding
software. An emerging avenue for improving motif-finding
accuracy aims to incorporate generalised binding
constraints of related transcription factors (TFs),
named familial binding profiles (FBPs), as priors
in motif identification methods. A motif-finder
can thus be “biased” towards finding
motifs from a particular TF family. However, current
motif-finders allow only a single FBP to be used
as a prior in a given motif-finding run. In addition,
current FBP construction methods are based on
manual clustering of position specific scoring
matrices (PSSMs) according to the known structural
properties of the TF proteins. Manual clustering
assumes that the binding preferences of structurally
similar TFs will also be similar. This assumption
is not true, at least not for some TF families.
Automatic PSSM clustering methods are thus required
for augmenting the usefulness of FBPs.
Results: A novel
method is developed for automatic clustering of
PSSM models. The resulting FBPs are incorporated
into the SOMBRERO motif-finder, significantly
improving its performance when finding motifs
related to those that have been incorporated.
SOMBRERO is thus the only existing de novo motif-finder
that can incorporate knowledge of all known PSSMs
in a given motif-finding run.
Availability: The
methods outlined will be incorporated into the
next release of SOMBRERO, which is available from
http://bioinf.nuigalway.ie/sombrero.
7.
High-recall Protein Entity Recognition Using a Dictionary
Authors:Kou, Cohen, Murphy
Protein name extraction is an important
step in mining biological literature. We describe
two new methods for this task: semiCRFs and dictionary
HMMs. SemiCRFs are a recently-proposed extension
to conditional random fields that enables more
effective use of dictionary information as features.
Dictionary HMMs are a technique in which a dictionary
is converted to a large HMM that recognizes phrases
from the dictionary, as well as variations of
these phrases. Standard training methods for HMMs
can be used to learn which variants should be
recognized. We compared the performance of our
new approaches to that of Maximum Entropy (MaxEnt)
and normal CRFs on three datasets, and improvement
was obtained for all four methods over the best
published results for two of the datasets. CRFs
and semiCRFs achieved the highest overall performance
according to the widely-used F-measure, while
the dictionary HMMs performed the best at finding
entities that actually appear in the dictionary—the
measure of most interest in our intended application.
8.
Statistics of Local Multiple Alignments
Authors:Prakash, Tompa
BLAST [Altschul et al., 1990] statistics
have been shown to be extremely useful for searching
for significant similarity hits, for amino acid
and nucleotide sequences. While these statistics
are well understood for pairwise comparisons,
there has been little success developing statistical
scores for multiple alignments. In particular,
there is no score for multiple alignment that
is well founded and treated as a standard. We
extend the BLAST theory to multiple alignments.
Following some simple assumptions, we present
and justify a significance score for multiple
segments of a local multiple alignment. We demonstrate
its usefulness in distinguishing high and moderate
quality multiple alignments from low quality ones,
with supporting experiments on orthologous vertebrate
promoter sequences.
9.
Beyond The Clause: Extraction of Phosphorylation
Information from Medline Abstracts
Authors:Narayanaswamy, Ravikumar, Vijay-Shanker
Phosphorylation is an important
biochemical reaction that plays a critical Role
in signal transduction pathways and cell-cycle
processes. A text mining system to extract the
phosphorylation relation from the literature is
reported. The focus of this paper is on the new
methods developed and implemented to connect and
merge pieces of information about phosphorylation
mentioned in different sentences in the text.
The effectiveness and accuracy of the system as
a whole as well as that of the methods for extraction
beyond a clause/sentence is evaluated using an
independently annotated dataset, the Phospho.ELM
database. The new methods developed to merge pieces
of information from different sentences are shown
to effective in significantly raising the recall
without much difference in precision.
10.
Computing the P-value of the Information Content
from an Alignment of Multiple Sequences
Authors:Nagarajan, Jones, Keich
Motivation: The
efficient and accurate computation of p-values
is an essential requirement for motif-finding
and alignment tools. We show that the approximation
algorithms used in two popular motif-finding programs,
MEME and Consensus, can fail to accurately compute
the p-value.
Results: We present
two new algorithms: one for the evaluation of
the p-values of a range of motif scores, and a
faster one for the evaluation of the p-value of
a single motif score. Both exhibit more reliability
than existing algorithms, and the latter algorithm
is comparable in speed to the fastest existing
method.
11.
Validation of Qualitative Models of Genetic Regulatory
Networks by Model Checking: Analysis of the Nutritional
Stress Response in Escherichia coli
Authors:Batt, Ropers, de Jong, Geiselmann,
Mateescu, Page, Schneider
Motivation: The
modeling and simulation of genetic regulatory
networks has created the need for tools for model
validation. The main challenges of model validation
are the achievement of a match between the precision
of model predictions and experimental data, as
well as the efficient and reliable comparison
of the predictions and observations.
Results: We present
an approach towards the validation of models of
genetic regulatory networks addressing the above
challenges. It combines a method for qualitative
modeling and simulation with techniques for model
checking, and is supported by a new version of
the computer tool GNA. The model-validation approach
has been applied to the analysis of the network
controlling the nutritional stress response in
Escherichia coli. Availability: GNA and the model
of the stress response network are available at
http://www-helix.inrialpes.fr/gna
12.
Identification of Nine Human-specific Frameshift
Mutations by Comparative Analysis of the Human and
the Chimpanzee Genome Sequences
Authors:Hahn, Lee
Motivation: The
recent release of the draft sequence of the chimpanzee
genome is an invaluable resource for finding genome-wide
genetic differences that might explain phenotypic
differences between humans and chimpanzees.
Results: In this
paper, we describe a simple procedure to identify
potential human-specific frameshift mutations
that occurred after the divergence of human and
chimpanzee. The procedure involves collecting
human coding exons bearing insertions or deletions
compared to the chimpanzee genome and identification
of homologs from other species supporting that
the mutations are human-specific. Using this procedure,
we identified nine genes, BASE, DNAJB3, FLJ33674,
HEJ1, NTSR2, RPL13AP, SCGB1D4, WBSCR27, and ZCCHC13,
that show human-specific alterations including
truncations of the C-terminus. In some cases,
the frameshift mutation results in gene inactivation
or decay. In other cases, the altered protein
seems to be functional. This study demonstrates
that even the unfinished chimpanzee genome sequence
can be useful in identifying modification of genes
that are specific to the human lineage and therefore
could potentially be relevant to the study of
the acquisition of human-specific traits.
13.
Kernel Methods for Predicting Protein-protein Interactions
Authors:Ben-Hur, Noble Paper
Despite advances in high throughput
methods for discovering protein-protein interactions,
the interaction networks of even well-studied
model organisms are sketchy at best, highlighting
the continued need for computational methods to
help direct experimentalists in the search for
novel interactions.
We present a kernel method for predicting
protein-protein interactions using a combination
of data sources, including protein sequences,
Gene Ontology annotations, local properties of
the network, and homologous interactions in other
species. Whereas protein kernels proposed in the
literature provide a similarity between single
proteins, prediction of interactions requires
a kernel between pairs of proteins. We propose
a pairwise kernel that converts a kernel between
single proteins into a kernel between pairs of
proteins, and we illustrate the kernel's effectiveness
in conjunction with a support vector machine classifier.
Furthermore, we obtain improved performance by
combining several sequence-based kernels based
on k-mer frequency, motif and domain content and
by further augmenting the pairwise sequence kernel
with features that are based on other sources
of data.
We apply our method to predict physical
interactions in yeast using data from the BIND
database. At a false positive rate of 1% the classifier
retrieves close to 80% of a set of trusted interactions.
We thus demonstrate the ability of our method
to make accurate predictions despite the sizeable
fraction of false positives that are known to
exist in interaction databases.
14.
Maximum Likelihood of Evolutionary Trees: Hardness
and Approximation
Authors:Tuller, Chor
Motivation: Maximum
likelihood (ML) is an increasingly popular optimality
criterion for selecting evolutionary trees (Felsenstein,
1981). Yet the computational complexity of ML
was open for over 20 years, and only recently
resolved by the authors (2005) for the Jukes Cantor
model of substitution and its generalizations.
It was proved that reconstructing the ML tree
is computationally intractable (NP hard). In this
work we explore three directions, which extend
that result.
Results:
We show that ML under the assumption of molecular
clock is still computationally intractable (NP
hard).
We show that not only is it computationally
intractable to find the exact ML tree, even
approximating the logarithm of the maximum likelihood
for any multiplicative factor smaller than 1.00175
is computationally intractable.
We develop an algorithm for approximating
log-likelihood under the condition that the
input sequences are sparse. It employs any approximation
algorithm for parsimony, and asymptotically
achieves the same approximation ratio. We note
that ML reconstruction for sparse inputs is
still hard under this condition, and furthermore
many real datasets satisfy it.
15.
Automatic Detection of Subsystem/Pathway Variants
in Genome Analysis
Authors:Ye, Osterman, Overbeek, Godzik
Motivation: Proteins
work together in pathways and networks, collectively
comprising the cellular machinery. A subsystem
(a generalization of pathway concept) is a group
of related functional roles (such as enzymes)
jointly involved in a specific aspect of the cellular
machinery. Subsystems provide a natural framework
for comparative genome analysis and functional
annotation. A subsystem may be implemented in
a number of different functional variants in individual
species. In order to reliably project functional
assignments across multiple genomes, we have to
be able to identify the variants implemented in
each genome. The analysis of such variants across
diverse species is an interesting problem by itself
and may provide new evolutionary insights. However,
no computational techniques are presently available
for an automated detection and analysis of subsystem
variants.
Results: Here we
formulate the subsystem variant detection problem
as finding the minimum number of subgraphs of
a subsystem, which is represented as a graph,
and solve the optimization problem by integer
programming approach. The performance of our method
was tested on subsystems encoded in the SEED,
a genomic integration platform developed by the
Fellowship for Interpretation of Genomes (FIG)
as a component of a large-scale effort on comparative
analysis and annotation of multiple diverse genomes.
Here we illustrate the results obtained for two
expert-encoded subsystems of the biosynthesis
of Coenzyme A and FMN/FAD cofactors. Applications
of variant detection, to support genomic annotations
and to assess divergence of species, are briefly
discussed in the context of these universally
conserved and essential metabolic subsystems.
16.
Detecting Coevolving Amino Acid sites using Bayesian
Mutational Mapping
Authors:Dimmic, Hubisz, Bustamante,
Nielsen
Motivation: The
evolution of protein sequences is constrained
by complex interactions between amino acid residues.
Because harmful substitutions may be compensated
by other substitutions at neighboring sites, residues
can coevolve. We describe a Bayesian phylogenetic
approach to the detection of coevolving residues
in protein families. This method, Bayesian mutational
mapping (BMM), assigns mutations to the branches
of the evolutionary tree stochastically, and then
test statistics are calculated to determine whether
a coevolutionary signal exists in the mapping.
Posterior predictive P-values provide an estimate
of significance, and specificity is maintained
by integrating over uncertainty in the estimation
of the tree topology, branch lengths, and substitution
rates. A coevolutionary Markov model for codon
substitution is also described, and this model
is used as the basis of several test statistics.
Results: Results
on simulated coevolutionary data indicate that
the BMM method can successfully detect nearly
all coevolving sites when the model has been correctly
specified, and that nonparametric statistics such
as mutual information are generally less powerful
than parametric statistics. On a dataset of eukaryotic
proteins from the phosphoglycerate kinase (PGK)
family, interdomain site contacts yield a significantly
greater coevolutionary signal than interdomain
non-contacts, an indication that the method provides
information about interacting sites. Failure to
account for the heterogeneity in rates across
sites in PGK resulted in a less discriminating
test, yielding a marked increase in the number
of reported positives at both contact and non-contact
sites.
17.
Modeling the Organization of the WUSCHEL Expression
Domain in the Shoot Apical Meristem
Motivation: The
above ground tissues of higher plants are generated
from a small region of cells situated at the plant
apex called the shoot apical meristem. An important
genetic control circuit modulating the size of
the Arabidopsis thaliana meristem is a feed-back
network between the CLAVATA3 and WUSCHEL genes.
Although the expression patterns for these genes
do not overlap, WUSCHEL activity is both necessary
and sufficient (when expressed ectopically) for
the induction of CLAVATA3 expression. However,
upregulation of CLAVATA3 in conjunction with the
receptor kinase CLAVATA1 results in the down regulation
of WUSCHEL. Despite much work, experimental data
for this network is incomplete and additional
hypotheses are needed to explain the spatial locations
and dynamics of these expression domains. Predictive
mathematical models describing the system should
provide a useful tool for investigating and discriminating
among possible hypotheses, by determining which
hypotheses best explain observed gene expression
dynamics.
Results: We are
developing a method using in vivo live confocal
microscopy to capture quantitative gene expression
data and create templates for computational models.
We present two models accounting for the organization
of the WUSCHEL expression domain. Our preferred
model uses a reaction-diffusion mechanism, in
which an activator induces WUSCHEL expression.
This model is able to organize the WUSCHEL expression
domain. In addition, the model predicts the dynamical
reorganization seen in experiments where cells,
including the WUSCHEL domain, are ablated, and
also predicts the spatial expansion of the WUSCHEL
domain resulting from removal of the CLAVATA3
signal.
18.
Efficient Computation of Close Lower and Upper Bounds
on the Minimum Number of Recombinations in Biological
Sequence Evolution
Authors:Song, Wu, Gusfield
Results: In this
paper, we present efficient, practical methods
for computing both upper and lower bounds on the
minimum number of needed recombinations, and for
constructing evolutionary histories that explain
the input sequences. We extensively study the
efficiency and accuracy of these algorithms on
both simulated and real data sets. The algorithms
produce very close upper and lower bounds, which
match exactly in a surprisingly wide range of
data. Thus, with the use of new, very effective
lower bounding methods and an efficient algorithm
for computing upper bounds, this approach allows
the efficient exact computation of the minimum
number of needed recombinations, with high frequency
in a large range of data. When upper and lower
bounds match, evolutionary histories found by
our algorithm correspond to most parsimonious
histories.
The classification of protein sequences
obtained from patients with various immunoglobulin-related
conformational diseases may provide insight into
structural correlates of pathogenicity. However,
clinical data are very sparse and, in the case
of antibody-related proteins, the collected sequences
have large variability with only a small subset
of variations relevant to the protein pathogenicity
(function). On this basis, these sequences represent
a model system for development of strategies to
recognize the small subset of function-determining
variations among the much larger number of primary
structure diversifications introduced during evolution.
Under such conditions, most protein classification
algorithms have limited accuracy. To address this
problem, we propose a Support Vector Machine (SVM)-based
classifier that combines sequence and 3D structural
averaging information. Each amino acid in the
sequence is represented by a set of six physicochemical
properties: hydrophobicity, hydrophilicity, volume,
surface area, bulkiness, and refractivity. Each
position in the sequence is described by the properties
of the amino acid at that position and the properties
of its neighbors in 3D space or its neighbors
in the sequence. A structure template is selected
to determine neighbors in 3D space and a window
size is used to determine the neighbors in the
sequence. The test data consists of 209 proteins
of human antibody immunoglobulin light chains,
each represented by aligned sequences of 120 amino
acids. The methodology is applied to the classification
of protein sequences collected from patients with
and without amyloidosis, and indicates that the
proposed modified classifiers are more robust
to sequence variability than standard SVM classifiers,
improving classification error between 5-25% and
sensitivity between 9-17%. The classification
results might also suggest possible mechanisms
for the propensity of immunoglobulin light chains
to amyloid formation.
20.
A Statistical Method for Detecting Splice Variation
from Expression Data
Authors:Cline, Blume, Cawley, Clark,
Hu, Lu, Salomonis, Wang, Williams
Motivation: Many
or most mammalian genes undergo alternative splicing,
generating a variety of transcripts from a single
gene. New information on splice variation is becoming
available through technology for measuring expression
levels of several exons or splice junctions per
gene. We have developed a statistical method,
ANOSVA, to detect alternative splicing from expression
data. ANOSVA requires no transcript information,
so can be applied when the level of annotation
data is poor. When validated against spiked clone
data, it generated no false positives and few
false negatives. We demonstrated ANOSVA with data
from a prototype mouse alternative splicing array,
yielding a set of genes with evidence of tissue-specific
splice variation. These results are available
at http://bioinfo.affymetrix.com/Papers/ANOSVA.
Motivation: Computational
approaches to protein function prediction infer
protein function by finding proteins with similar
sequence, structure, surface clefts, chemical
properties, amino acid motifs, interaction partners
or phylogenetic profiles. We present a new approach
that combines sequential, structural and chemical
information into one graph model of proteins.
We predict functional class membership of enzymes
and non-enzymes using graph kernels and Support
Vector Machine classification on these protein
graphs.
Results: Our graph
model, derivable from protein sequence and structure
only, is competitive with vector models that require
additional protein information such as the size
of surface pockets. If we include this extra information
into our graph model, our classifier yields significantly
higher accuracy levels than the vector models.
Hyperkernels allow us to select and to optimally
combine the most relevant node attributes in our
protein graphs. We have laid the foundation for
a protein function prediction system that integrates
protein information from various sources efficiently
and effectively.
Availability: More
information available via www.dbs.ifi.lmu.de/Mitarbeiter/borgwardt.html.
22.
Robust Classification Modeling on Microarray Data
Using Misclassification Penalized Posterior
Authors:Soukup, Lee, Cho
Genome-wide microarray data are
often used in challenging classification problems
of subtypes of human diseases. However, the identification
of a parsimonious robust prediction model that
performs consistently well on future independent
data has not been successful due to the biased
model selection from an extremely large number
of candidate models during the classification
model search and construction. Furthermore, common
criteria of prediction model performance such
as classification error rates do not provide a
sensitive measure for valuating performance of
such astronomic competing models. Also, even though
several different classification approaches have
been utilized to tackle such classification problems,
no direct comparison on these methods have been
made. We introduce a novel measure for assessing
the performance of a prediction model, the misclassification-penalized
posterior (MiPP), the sum of the posterior classification
probabilities penalized by the number of incorrectly
classified samples. Using MiPP, we implement a
forward step-wise cross-validated procedure to
find our optimal prediction models with different
numbers of features on a training set. Our final
robust classification model and its dimension
are determined based on a completely independent
test data set. This MiPP-based classification
modeling approach enables us to identify the most
parsimonious robust prediction models only with
two or three features on well-known microarray
data sets. These models show superior performance
to other models in the literature that often have
more than 40--100 features in their model construction.
Our classification algorithm is available at the
Bioconductor web site as the MiPP package.
23.
CaSPredictor: A new Computer-based Tool for Caspase
Substrate Prediction
Motivation: The
in vitro studies have shown that the most remarkable
catalytic features of caspases, a family of cys-teineproteases,
are their stringent specificity to Asp (D) in
the S1 subsite and at least four amino acid to
the left of scis-sile bound. However, there is
little information about the substraterecognition
patterns in vivo. The prediction and-characterization
of proteolytic cleavage sites in natural sub-strates
could be useful for uncovering these structural
rela-tionships.
Results: PEST-like
sequences rich in the amino acids Ser (S), Thr
(T), Pro (P), Glu or Asp (E/D), including Asn
(N) and Gln (Q) are adjacent structural/sequential
elements in the majority of cleavage site regions
of the natural caspase sub-strates described in
the literature, supporting its possible implication
in the substrate selection by caspases. We de-veloped
CaSPredictor, a software which incorporated a
PEST-like index and the position-dependent amino
acid ma-trices for prediction of caspase cleavage
sites in individual proteins and protein datasets.
The program predicted suc-cessfully 81% (111/137)
of the cleavage sites in experimen-tally verified
caspase substrates not annotated in its internal
data file. Its accuracy and confidence was estimated
as 80% using ROC methodology. The program was
much more effi-cient to predict caspase substrates
when compared to Pep-tideCutter and PEPS softwares.
Finally, the program de-tected potential cleavage
sites in the primary sequences of 1644 proteins
in a dataset containing 9986 protein entries.
24.
Experimental Design for Three-Color and Four-Color
Gene Expression Microarrays
Authors:Woo, Krueger, Kaur, Churchill
Motivation: Compared
to two-color microarrays, three-color microarrays
can increase design efficiency and power to detect
differential expression without additional samples
and arrays. Furthermore, three-color microarray
technology is currently available at a reasonable
cost. Despite the potential advantages, clear
guidelines for designing and analyzing three-color
experiments do not exist.
Results: We propose
a three- and four-color cyclic design (loop) and
a complementary graphical representation to help
design experiments that are balanced, efficient,
and robust to hybridization failures. In theory,
three-color loop designs are more efficient than
two-color loop designs. Experiments using both
two- and three-color platforms were performed
in parallel and its outputwere analyzed using
linearmixedmodel analysis in R/MAANOVA. These
results demonstrate that three-color experiments
using the same number of samples (and fewer arrays)
will perform no less efficiently than two-color
experiments. The improved efficiency of the design
is somewhat offset by a reduced dynamic range
and increased variability in the three-color experimental
system. This result suggests that, with minor
technological improvements, three-color microarrays
using loop designs could detect differential expression
more efficiently than two-color loop designs.
25.
Families of Membranous Proteins can be Characterized
by the Amino Acid Composition of their Transmembrane
Domains
Authors:Sadka, Linial
In eukaryotes, membranous
proteins account for 20-30% of the proteome. Most
of these proteins contain one or more transmembrane
(TM) domains. These are short segments that transverse
the bilayer lipid membrane. Various properties
of the TM domains, such as their number, their
topology and their arrangement within the membrane
are closely related to the protein’s cellular
functions. Properties of the TM domains also determine
the cellular targeting and localization of these
proteins. It is not known, however, whether the
information encoded by TM domains suffices for
the purpose of classifying proteins into their
functional families. This is the question we address
here. We introduce an algorithm that creates a
profile of each functional family of membranous
proteins based only on the amino-acid composition
of their TM domains. This is complemented by a
classifier program for each such family (to determine
whether a given protein belongs to it or not).
We find that in most instances TM domains contain
enough information to allow an accurate discrimination
of ~80% sensitivity and ~90% specificity among
unrelated polytopic functional families with the
same number of TM domains.
26.
A Hidden Markov Model for Analyzing ChIP-chip Experiments
on Genome Tiling Arrays and its Application to p53
Binding Sequences
Authors:Li, Meyer, Liu
Motivation: Transcription
factors (TFs) regulate gene ex-pression by recognizing
and binding to specific regulatory regions on
the genome, which in higher eukaryotes can oc-cur
far away from the regulated genes. Recently Affymetrix
developed the high-density oligonucleotide arrays
that tile all the non-repetitive sequences of
the human genome at 35-bp resolution. This new
array platform allows for the unbiased mapping
of in vivo TF binding sequences (TFBSs) using
Chromatin ImmunoPrecipitation followed by microarray
ex-periments (ChIP-chip). The massive data set
generated from these experiments pose great challenges
for data analysis.
Results: We developed
a fast, scalable and sensitive method to extract
TFBSs from ChIP-chip experiments on genome tiling
arrays. Our method takes advantage of tiling array
data from many experiments to normalize and model
the behavior of each individual probe, and identifies
TFBSs using a Hidden Markov Model (HMM). When
applied to the data of p53 ChIP-chip experiments
(Cawley et al., 2004), our method discovered many
new high confidence p53 targets including all
the regions verified by quantitative PCR . Using
a de novo motif finding algorithm MDscan (Liu
et al., 2002), we also recovered the p53 motif
from our HMM identified p53 target regions. Furthermore,
we found substantial p53 motif enrichment in these
regions comparing with both ge-nomic background
and the TFBSs identified by Cawley et al (2004).
Several of the newly identified p53 TFBSs are
in known genes’ promoter regions or associated
with previ-ously characterized p53-responsive
genes.
27.
High-Throughput Inference of Protein-Protein Interfaces
from Unassigned NMR Data
Authors:Mettu, Lilien, Donald
We cast the problem of identifying
protein-protein interfaces, using only unassigned
NMR spectra, into a geometric clustering problem.
Identifying protein-protein interfaces is critical
to understanding inter- and intra-cellular communication,
and NMR allows the study of protein interaction
in solution. However it is often the case that
NMR studies of a protein complex are very time-consuming,
mainly due to the bottleneck in assigning the
chemical shifts, even if the apo structures of
the constituent proteins are known. We study whether
it is possible, in a high-throughput manner, to
identify the interface region of a protein complex
using only unassigned chemical shift and residual
dipolar coupling (RDC) data.
We introduce a geometric optimization
problem where we must cluster the cells in an
arrangement on the boundary of a 3-manifold, where
the arrangement is induced by a spherical quadratic
form. We show that this formalism derives directly
from the physics of RDCs. We present an optimal
algorithm for this problem that runs in O(n^3
log n) time for an n-residue protein. We then
use this clustering algorithm as a subroutine
in a practical algorithm for identifying the interface
region of a protein complex from unassigned NMR
data. We present the results of our algorithm
on NMR data for 7 proteins from 5 protein complexes
and show that our approach is useful for high-throughput
applications in which we seek to rapidly identify
the interface region of a protein complex.
28.
Multi-way Clustering of Microarray Data using Probabilistic
Sparse Matrix Factorization
Authors:Dueck, Morris, Frey
We address the problem of multi-way
clustering of microarray data using a generative
model. Our algorithm, probabilistic sparse matrix
factorization (PSMF), is a probabilistic extension
of a previous hard-decision algorithm for this
problem. PSMF allows for varying levels of sensor
noise in the data, uncertainty in the hidden prototypes
used to explain the data, and uncertainty as to
which prototypes are selected to explain each
data vector. We present experimental results demonstrating
that our method can better recover functionally-relevant
clusterings in mRNA expression data than standard
clustering techniques, including hierarchical
agglomerative clustering, and we show that by
computing probabilities instead of point estimates,
our method avoids converging to poor solutions.
29.
Tag SNP Selection in Genotype Data for Maximizing
SNP Prediction Accuracy
Authors:Halperin, Kimmel, Shamir
Motivation: The
search for genetic regions associated with complex
disease, such as cancer or Alzheimer's disease,
is an important challenge that may lead to better
diagnosis and treatment. The existence of millions
of DNA variations, primarily single nucleotide
polymorphisms (SNPs), may allow the fine dissection
of such associations. However, studies seeking
disease association are limited by the cost of
genotyping SNPs. Therefore, it is essential to
find a small subset of informative SNPs (tag SNPs)
that may be used as good representatives of the
rest of the SNPs.
Results: We define
a new natural measure for evaluating the prediction
accuracy of a set of tag SNPs, and use it to develop
a new method for tag SNPs selection. Our method
is based on a novel algorithm that predicts the
values of the rest of the SNPs given the tag SNPs.
In contrast to most previous methods, our prediction
algorithm uses the genotype information and not
the haplotype information of the tag SNPs. Our
method is very efficient, and it does not rely
on having a block partition of the genomic region.
We compared our method to two state
of the art tag SNP selection algorithms on 58
different genotype data sets from four different
sources. Our method consistently found tag SNPs
with considerably better prediction ability than
the other methods.
30.
Three-Stage Prediction of Protein Beta-Sheets by
Neural Networks, Alignments, and Graph Algorithms
Authors:Cheng, Baldi
Motivation: Protein
beta-sheets play a fundamental role in protein
structure, function, evolution, and bio-engineering.
Accurate prediction and assembly of protein beta-sheets,
however, remains challenging because protein beta-sheets
require formation of hydrogen bonds between linearly
distant residues. Previous approaches for predicting
beta-sheet topological features, such as beta-strand
alignments, in general have not exploited the
global covariation and constraints characteristic
of beta-sheet architectures.
Results: We propose
a modular approach to the problem of predicting/assembling
protein beta-sheets in a chain by integrating
both local and global constraints in three steps.
The first step uses recursive neural networks
to predict pairing probabilities for all pairs
of inter-strand beta-residues from profile, secondary
structure, and solvent accessibility information.
The second step applies dynamic programming techniques
to these probabilities to derive binding pseudo-energies
and optimal alignments between all pairs of beta-strands.
Finally, the third step, uses graph matching algorithms
to predict the beta-sheet architecture of the
protein by optimizing the global pseudo-energy
while enforcing strong global beta-strand pairing
constraints. The approach is evaluated using cross-validation
methods on a large non-homologous dataset and
yields significant improvements over previous
methods.
31.
Mining ChIP-chip Data for Transcription Factor and
Cofactor Binding Sites
Authors:Smith, Sumazin, Das, Zhang
We describe methodology to identify
de novo individual and interacting pairs of binding
site motifs from ChIP-chip data, using an algorithm
that integrates localization data directly into
the motif discovery process. We combine matrix-enumeration-based
motif discovery with multi-variate regression
to evaluate candidate motifs and identify motif
interactions. When applied to the HNF localization
data of Odom et al. (2004) in liver and pancreatic
islets, our methods produce motifs that are either
novel or improved known motifs. All motif pairs
identified to predict localization are further
evaluated according to how well they predict expression
in liver and islets, and according to how conserved
are the relative positions of their occurrences.
We find that interaction models of HNF1 and CDP
motifs provide excellent prediction of both HNF1
localization and gene expression in liver. Our
results demonstrate that ChIP-chip data can be
used to identify interacting binding site motifs.
32.
Improving Protein Structure Prediction With Model-Based
Search
Authors:Brunette, Brock
De novo protein structure prediction
can be formulated as search in a high-dimensional
space. One of the most frequently used computational
tools to solve such search problems is the Monte
Carlo method. We present a novel search technique,
called model-based search. This method samples
the high-dimensional search space to build an
approximate model of the underlying function.
This model is incrementally refined in areas of
interest, while areas that are not of interest
are excluded from further exploration. Model-based
search derives its efficiency from the fact that
the information obtained during the exploration
of the search space is used to guide further exploration.
In contrast, Monte Carlo-based techniques are
memory-less and exploration is performed based
on random walks, ignoring the information obtained
in previous steps. Model-based search is applied
to protein structure prediction, where search
is employed to find the global minimum of the
protein's energy landscape. We show that model-based
search uses computational resources more efficiently
to find lower-energy conformations of proteins
when compared to one of the leading protein structure
prediction methods, which relies on a tailored
Monte Carlo method to perform search. The performance
improvements become more pronounced as the dimensionality
of the search space increases. We show that model-based
search enables more accurate protein structure
prediction than previously possible. Furthermore,
we believe that similar performance improvements
can be expected in other problems that are currently
solved using Monte Carlo-based search methods.