|
|
PRESENTATION 2 |
|
TITLE:
|
Mining viral protease data to extract cleavage knowledge |
AUTHORS: |
A. Narayanan, X. Wu and Z. Rong Yang |
ABSTRACT: |
Motivation: The
motivation is to identify, through machine learning techniques,
specific patterns in HIV and HCV viral polyprotein amino acid
residues where viral protease cleaves the polyprotein as it leaves
the ribosome. An understanding of viral protease specificity may
help the development of future anti-viral drugs involving protease
inhibitors by identifying specific features of protease activity
for further experimental investigation. While viral sequence
information is growing at a fast rate, there is still
comparatively little understanding of how viral polyproteins are
cut into their functional unit lengths. The aim of the work
reported here is to investigate whether it is possible to
generalise from known cleavage sites to unknown cleavage sites for
two specific viruses - HIV and HCV. An understanding of
proteolytic activity for specific viruses will contribute to our
understanding of viral protease function in general, thereby
leading to a greater understanding of protease families and their
substrate characteristics.
Results: Our results show that artificial neural networks and
symbolic learning techniques (See5) capture some fundamental and
new substrate attributes, but neural networks outperform their
symbolic counterpart. Publicly available
software was used:
- Stuttgart Neural Network Simulator;
http://www-ra.informatik.uni-tuebingen.de/SNNS/
- See5;
http://www.rulequest.com |
|
AVAILABILITY: |
The datasets used (HIV, HCV) for See5 are available
at:
http://www.dcs.ex.ac.uk/~anarayan/bioinf/ismbdatasets/ |
|
CONTACT: |
a.narayanan@ex.ac.uk |
z.r.yang@ex.ac.uk |
PRESENTATION 3 |
|
TITLE: |
The metric space of proteins - comparative study of clustering
algorithms |
AUTHORS: |
O. Sasson, N. Linial, M. Linial |
ABSTRACT: |
Motivation: A large fraction of biological research concentrates
on individual proteins and on small families of proteins. One of
the current major challenges in bioinformatics is to extend our
knowledge also to very large sets of proteins. Several major
projects have tackled this problem. Such undertakings usually
start with a process that clusters all known proteins or large
subsets of this space. Some work in this area is carried out
automatically, while other attempts incorporate expert advice and
annotation.
Results: We propose a novel technique that automatically clusters
protein sequences. We consider all proteins in SWISSPROT, and
carry out an all-against-all BLAST similarity test among them.
With this similarity measure in hand we proceed to perform a
continuous bottom-up clustering process by applying alternative
rules for merging clusters. The outcome of this clustering process
is a classification of the input proteins into a hierarchy of
clusters of varying degrees of granularity. Here we compare the
clusters that result from alternative merging rules, and validate
the results against InterPro.
Our preliminary results show that clusters that are consistent
with several rather than a single merging rule tend to comply with
InterPro annotation. This is an affirmation of the view that the
protein space consists of families that differ markedly in their
evolutionary conservation. |
|
AVAILABILITY: |
The outcome of these investigations
can be viewed in an interactive Web site at:
http://www.protonet.cs.huji.ac.il
Supplementary information: Biological examples for comparing the
performance of the different algorithms used for classification
are presented in:
http://www.protonet.cs.huji.ac.il/examples.html. |
|
CONTACT: |
ori@cs.huji.ac.il |
PRESENTATION 4 |
|
TITLE: |
DNA sequence and structure:
direct and indirect recognition in protein-DNA binding |
AUTHORS: |
N.F. Steffen, S.D. Murphy, L. Tolleri, G.W. Hatfield, R.H. Lathrop |
ABSTRACT: |
Motivation: Direct recognition, or
direct readout, of DNA bases by a DNA-binding protein involves
amino acids that interact directly with features specific to each
base. Experimental evidence also shows that in many cases the
protein achieves partial sequence specificity by indirect
recognition, i.e., by recognizing structural properties of the
DNA. (1) Could threading a DNA sequence onto a crystal structure
of bound DNA help explain the indirect recognition component of
sequence specificity? (2) Might the resulting pure-structure
computational motif manifest itself in familiar sequence-based
computational motifs?
Results: The starting structure motif was a crystal structure of
DNA bound to the integration host factor protein (IHF) of {\it
E.~coli}. IHF is known to exhibit both direct and indirect
recognition of its binding sites. (1) Threading DNA sequences
onto the crystal structure showed statistically significant
partial separation of 60 IHF binding sites from random and
intragenic sequences and was positively correlated with binding
affinity. (2) The crystal structure was shown to be equivalent to
a linear Markov network, and so, to a joint probability
distribution over sequences, computable in linear time. It was
transformed algorithmically into several common pure-sequence
representations, including (a) small sets of short exact strings,
(b) weight matrices, (c) consensus regular patterns, (d) multiple
sequence alignments, and (e) phylogenetic trees. In all cases the
pure-sequence motifs retained statistically significant partial
separation of the IHF binding sites from random and intragenic
sequences. Most exhibited positive correlation with binding
affinity. The multiple alignment showed some conserved columns,
and the phylogenetic tree partially mixed low-energy sequences
with IHF binding sites but separated high-energy sequences. The
conclusion is that deformation energy explains part of indirect
recognition, which explains part of IHF sequence-specific binding. |
|
AVAILABILITY: |
Code and data on request. |
|
CONTACT: |
Nick Steffen for code:
nsteffen@uci.edu |
Lorenzo Tolleri for data:
Tolleri@chiron.it |
|
PRESENTATION 5 |
|
TITLE: |
Beyond
tandem repeats: complex pattern structures and distant regions of
similarity |
AUTHORS: |
A.M. Hauth, D.A. Joseph |
ABSTRACT: |
Motivation: Tandem repeats (TRs) are
associated with human disease, play a role in evolution and are
important in regulatory processes. Despite their importance,
locating and characterizing these patterns within anonymous DNA
sequences remains a challenge. In part, the difficulty is due to
imperfect conservation of patterns and complex pattern
structures. We study recognition algorithms for two complex
pattern structures: variable length tandem repeats (VLTRs) and
multi-period tandem repeats (MPTRs).
Results: We extend previous algorithmic research to a class of
regular tandem repeats (RegTRs). We formally define RegTRs, as
well as, two important subclasses: VLTRs and MPTRs. We present
algorithms for identification of TRs in these classes.
Furthermore, our algorithms identify degenerate VLTRs and MPTRs:
repeats containing substitutions, insertions and deletions. To
illustrate our work, we present results of our analysis for two
difficult regions in cattle and human which reflect practical
occurrences of these subclasses in GenBank sequence data. In
addition, we show the applicability of our algorithmic techniques
for identifying Alu sequences, gene clusters and other distant
regions of similarity. We illustrate this with an example from
yeast chromosome I.
|
|
AVAILABILITY: |
Algorithms can be accessed at:
http://www.cs.wisc.edu/areas/theory |
|
CONTACT: |
Amy M. Hauth
kryder@cs.wisc.edu
608-831-2164 |
Deborah A. Joseph
joseph@cs.wisc.edu
608-262-8022, |
|
FAX: 608-262-9777 |
PRESENTATION 6 |
|
TITLE: |
The POPPs: clustering and searching using peptide probability
profiles |
AUTHORS: |
M.J. Wise |
ABSTRACT: |
The POPPs is a suite of inter-related
software tools which allow the user to discover what is
statistically "unusual" in the composition of an unknown protein,
or to automatically cluster proteins into families based on
peptide composition. Finally, the user can search for related
proteins based on peptide composition. Statistically based
peptide composition provides a view of proteins that is, to some
extent, orthogonal to that provided by sequence. In a test study,
the POPP suite is able to regroup into their families sets of
approximately 100 randomised Pfam protein domains.
The POPPs
suite is used to explore the diverse set of late embryogenesis
abundant (LEA) proteins. |
|
AVAILABILITY: |
Contact the author |
CONTACT: |
mw263@cam.ac.uk |
|
PRESENTATION 7 |
|
TITLE: |
Combining
bioinformatics and biophysics to understand protein
structure and function |
AUTHORS: |
Barry Honig |
ABSTRACT: |
The increasing numbers of proteins whose
three-dimensional structures have been determined will have major
impact on the ability to exploit genomic data. We have been
developing a series of computational tools with the goal of
detecting relationships among amino acid sequence, protein
structure and protein function. In this context, recent
computational advances in using structure to improve sequence
alignments, in homology model building and in the calculation of
binding affinities will be summarized as will their combined use
towards an understanding the principles of protein-protein
recognition.
Our basic approach involves calculating
protein folding and binding free energies as well as contributions
of individual amino acids to these free energies, and correlating
these energetic contributions with sequence patterns and with
physical and chemical properties of the protein. The factors that
determine binding free energies will be discussed with particular
emphasis on understanding how protein interfaces are designed, in
a structural sense, to exploit different combinations of
electrostatic and hydrophobic interactions to achieve both
affinity and specificity.
An important feature in any attempt to
compare the properties of different proteins is to obtain an
accurate sequence alignment and, when possible, an accurate
structure alignment as well. We have developed a novel hybrid
approach that uses both sequence and structural alignments to
generate high quality sequence profiles for protein families.
Application of such alignments, when combined with calculations of
binding affinities, yield a new method to predicting binding
specificity from sequence alone. An application to the peptide
binding specificity of SH2 domains will be described.
With regard to structure prediction, we
have made a number advances that we believe will significantly
improve the accuracy of homology modeling. Specifically, we have
succeeded in predicting the conformation of buried side chains to
close to experimental accuracy, in obtaining extremely high loop
prediction accuracies and in refining modeled structures so as to
more closely resemble the correct structure than the original
template. In addition, our new alignment procedure and our
calculations of folding free energies have led to the development
of a new and accurate procedure to evaluate homology models. These
and other developments have been incorporated into new homology
modeling software that will be described. |
|
|
|
PRESENTATION 8 |
|
TITLE: |
A sequence-profile-based HMM for predicting and discriminating beta barrel
membrane proteins |
AUTHORS: |
P.L. Martelli,
P. Fariselli, A. Krogh, R. Casadio |
ABSTRACT: |
Motivation:
Membrane proteins are an abundant and functionally relevant subset
of proteins that putatively include from about 15 up to 30% of the
proteome of organisms fully sequenced. These estimates are mainly
computed on the basis of sequence comparison and membrane protein
prediction. It is therefore urgent to develop methods capable of
selecting membrane proteins especially in the case of outer
membrane proteins, barely taken into consideration when proteome
wide analysis is performed. This will also help protein annotation
when no homologous sequence is found in the database. Outer
membrane proteins solved so far at atomic resolution interact with
the external membrane of bacteria with a characteristic beta barrel
structure comprising different even numbers of beta strands (beta barrel
membrane proteins). In this they differ from the membrane proteins
of the cytoplasmic membrane endowed with alpha helix bundles (all
alpha membrane proteins) and need specialised predictors.
Results:
We develop a HMM model, which can predict the topology of beta barrel
membrane proteins, using as input evolutionary information. The
model is cyclic with 6 types of states: two for the beta strand transmembrane core, one for the
beta strand cap on either side of the
membrane, one for the inner loop, one for the outer loop and one
for the globular domain state in the middle of each loop. The
development of a specific input for HMM based on multiple sequence
alignment is novel. The accuracy per residue of the model is 82%
when a jack knife procedure is adopted. With a model optimisation
method using a dynamic programming algorithm seven topological
models out the twelve proteins included in the testing set are
also correctly predicted. When used as a discriminator, the model
is rather selective. At a fixed probability value, it retains 84%
of a non-redundant set comprising 145 sequences of well-annotated
outer membrane proteins. Concomitantly, it correctly rejects 90%
of a set of globular proteins including about 1200 chains with low
sequence identity (< 30%) and 90% of a set of all alpha membrane
proteins, including 188 chains. |
|
AVAILABILITY: |
The program will
be available on request from the authors. |
CONTACT: |
gigi@lipid.biocomp.unibo.it |
www.biocomp.unibo.it |
|
PRESENTATION 9 |
|
TITLE: |
Fully
automated ab initio protein structure prediction using
I-SITES, HMMSTR and ROSETTA |
AUTHORS: |
C. Bystroff and Y. Shao |
ABSTRACT: |
Motivation:
The Monte Carlo fragment insertion method for protein tertiary
structure prediction (ROSETTA) of Baker and others, has been
merged with the I-SITES library of sequence structure motifs and
the HMMSTR model for local structure in proteins, to form a new
public server for the ab initio prediction of protein structure.
The server performs several tasks in addition to tertiary
structure prediction, including a database search, amino acid
profile generation, fragment structure prediction, and backbone
angle and secondary structure prediction. Meeting reasonable
service goals required improvements in the efficiency, in
particular for the ROSETTA algorithm.
Results: The new server was
used for blind predictions of 40 protein sequences as part of the
CASP4 blind structure prediction experiment. The results for 31 of
those predictions are presented here. 61% of the residues overall
were found in topologically correct predictions, which are defined
as fragments of 30 residues or more with a root-mean-square
deviation in superimposed alpha carbons of less than 6Å. HMMSTR
3-state secondary structure predictions were 73% correct overall.
Tertiary structure predictions did not improve the accuracy of
secondary structure prediction. |
|
AVAILABILITY: |
The server is
accessible through the web at
http://isites.bio.rpi.edu/hmmstr/index.html
Programs are available upon requests for academics.
Licensing agreements are available for commercial interests.
Supplementary information:
http://isites.bio.rpi.edu
http://predictioncenter.llnl.gov/casp4/ |
|
CONTACT: |
bystrc@rpi.edu |
shaoy@rpi.edu |
|
PRESENTATION 10 |
|
TITLE: |
Prediction of contact maps by GIOHMMs
and recurrent neural networks using lateral propagation from all
four cardinal corners |
AUTHORS: |
G.
Pollastri, P. Baldi |
ABSTRACT: |
Motivation:
Accurate prediction of protein contact maps is an important step
in computational structural proteomics. Because contact maps
provide a translation and rotation invariant topological
representation of a protein, they can be used as a fundamental
intermediary step in protein structure prediction.
Results: We
develop a new set of flexible machine learning architectures for
the prediction of contact maps, as well as other information
processing and pattern recognition tasks. The architectures can be
viewed as recurrent neural network parameterizations of a class of
Bayesian networks we call generalized input-output HMMs. For the
specific case of contact maps, contextual information is
propagated laterally through four hidden planes, one for each
cardinal corner. We show that these architectures can be trained
from examples and yield contact map predictors that outperform
previously reported methods. While several extensions and
improvements are in progress, the current version can accurately
predict 60.5% of contacts at a distance cutoff of 8 Å and 45% of
distant contacts at 10 Å, for proteins of length up to 300. |
|
AVAILABILITY: |
The contact
map predictor will be made available through
http://promoter.ics.uci.edu/BRNN-PRED/
as part of an existing suite of proteomics predictors. |
|
CONTACT: |
gpollast@ics.uci.edu
|
pfbaldi@ics.uci.edu |
|
PRESENTATION 11 |
|
TITLE: |
Rate4Site:
an algorithmic tool for the identification of functional regions
on proteins by surface mapping of evolutionary determinants within
their homologues |
AUTHORS: |
T. Pupko, R.E. Bell, I. Mayrose, F. Glaser, N. Ben-Tal |
ABSTRACT: |
Motivation: A
number of proteins of known three-dimensional (3D) structure
exist, with yet unknown function. In light of the recent progress
in structure determination methodology, this number is likely to
increase rapidly. A novel method is presented here: "Rate4Site",
which maps the rate of evolution among homologous proteins onto
the molecular surface of one of the homologues whose 3D-structure
is known. Functionally important regions correspond to surface
patches of slowly evolving residues.
Results: Rate4Site estimates
the rate of evolution of amino acid sites using the maximum
likelihood (ML) principle. The ML estimate of the rates considers
the topology and branch lengths of the phylogenetic tree, as well
as the underlying stochastic process. To demonstrate its potency,
we study the Src SH2domain. Like previously established methods,
Rate4Site detected the SH2 peptide-binding groove. Interestingly,
it also detected inter-domain interactions between the SH2 domain
and the rest of the Src protein that other methods failed to
detect. |
|
AVAILABILITY: |
Rate4Site can be
downloaded at:
http://ashtoret.tau.ac.il/
Supplementary
Information: multiple sequence alignment of homologous domains
from the SH2 protein family, the corresponding phylogenetic tree
and additional examples are available at:
http://ashtoret.tau.ac.il/~rebell |
|
CONTACT: |
tal@ism.ac.jp
rebell@ashtoret.tau.ac.il |
fabian@ashtoret.tau.ac.il
bental@ashtoret.tau.ac.il |
|
PRESENTATION 12 |
|
TITLE: |
Inferring
sub-cellular localization through automated lexical analysis |
AUTHORS: |
R. Nair and B. Rost |
ABSTRACT: |
Motivation:
The SWISS-PROT sequence database contains keywords of functional
annotations for many proteins. In contrast, information about the
sub-cellular localization is only available for few proteins.
Experts can often infer localization from keywords describing
protein function. We developed LOCkey, a fully automated method
for lexical analysis of SWISS-PROT keywords that assigns
sub-cellular localization. With the rapid growth in sequence data,
the biochemical characterisation of sequences has been falling
behind. Our method may be a useful tool for supplementing
functional information already automatically available.
Results: The
method reached a level of more than 82% accuracy in a full
cross-validation test. Due to a lack of functional annotations, we
could infer localization for less than half of all proteins in
SWISS-PROT. We applied LOCkey to annotate five entirely sequenced
proteomes, namely Saccharomyces cerevisiae (yeast), Caenorhabditis
elegans (worm), Drosophila melanogaster (fly), Arabidopsis
thaliana (plant) and a subset of all human proteins. LOCkey found
about 8000 new annotations of sub-cellular localization for these
eukaryotes. |
|
AVAILABILITY: |
Annotations of
localization for eukaryotes at:
http://cubic.bioc.columbia.edu/services/LOCkey |
|
CONTACT: |
rost@columbia.edu |
|
PRESENTATION 14 |
|
TITLE: |
Support
vector regression applied to the determination of the
developmental age of a Drosophila embryo from its segmentation
gene expression patterns |
AUTHORS: |
E. Myasnikova, A. Samsonova, M. Samsonova and J. Reinitz |
ABSTRACT: |
Motivation: In
this paper we address the problem of the determination of
developmental age of an embryo from its segmentation gene
expression patterns in Drosophila.
Results: By applying support
vector regression we have developed a fast method for automated
staging of an embryo on the basis of its gene expression pattern.
Support vector regression is a statistical method for creating
regression functions of arbitrary type from a set of training
data. The training set is composed of embryos for which the
precise developmental age was determined by measuring the degree
of membrane invagination. Testing the quality of regression on
the training set showed good prediction accuracy. The optimal
regression function was then used for the prediction of the gene
expression based age of embryos in which the precise age has not
been measured by membrane morphology. Moreover, we show that the
same accuracy of prediction can be achieved when the
dimensionality of the feature vector was reduced by applying
factor analysis. The data reduction allowed us to avoid
over-fitting and to increase the efficiency of the algorithm. |
|
AVAILABILITY: |
This software may be obtained from the authors.
|
CONTACT: |
samson@fn.csa.ru |
|
PRESENTATION 15 |
|
TITLE: |
Variance
stabilization applied to microarray data calibration and to the
quantification of differential expression |
AUTHORS: |
W.
Huber, A. von Heydebreck, H. Sueltmann, A. Poustka and M. Vingron |
ABSTRACT: |
We introduce a
statistical model for microarray gene expression data that
comprises data calibration, the quantification of differential
expression, and the quantification of measurement error. In
particular, we derive a transformation $h$ for intensity
measurements, and a difference statistic *h whose variance is
approximately constant along the whole intensity range. This
forms a basis for statistical inference from microarray data, and
provides a rational data pre-processing strategy for multivariate
analyses. For the transformation $h$, the parametric form h(x) =
arsinh(a+bx) is derived from a model of the variance-versus-mean
dependence for microarray intensity data, using the method of
variance stabilizing transformations. For large intensities, h
coincides with the logarithmic transformation, and *h with the
log-ratio. The parameters of h together with those of the
calibration between experiments are estimated with a robust
variant of maximum-likelihood estimation.
We demonstrate our
approach on data sets from different experimental platforms,
including two-color cDNA arrays and a series of Affymetrix
oligonucleotide arrays. |
|
AVAILABILITY: |
Software is
freely available for academic use as an R package at
http://www.dkfz.de/abt0840/whuber |
|
CONTACT: |
w.huber@dkfz.de
|
|
PRESENTATION 16 |
|
TITLE: |
A variance-stabilizing transformation for gene-expression
microarray data |
AUTHORS: |
B.P.
Durbin, J.S. Hardin, D.M. Hawkins, D.M. Rocke |
ABSTRACT: |
Motivation:
Standard statistical techniques often assume that data are
normally distributed, with constant variance not depending on the
mean of the data. Data that violate these assumptions can often
be brought in line with the assumptions by application of a
transformation. Gene-expression microarray data have a
complicated error structure, with a variance that changes with the
mean in a non-linear fashion. Log transformations, which are
often applied to microarray data, can inflate the variance of
observations near background.
Results: We introduce a
transformation that stabilizes the variance of microarray data
across the full range of expression. Simulation studies also
suggest that this transformation approximately symmetrizes
microarray data. |
|
AVAILABILITY: |
|
CONTACT: |
bpdurbin@wald.ucdavis.edu |
|
PRESENTATION 17 |
|
TITLE: |
Binary
tree-structured vector quantization approach to clustering and
visualizing microarray data |
AUTHORS: |
M. Sultan, D.
Wigle, CA Cumbaa, M. Maziarz, J. Glasgow,
M.S. Tsao and I. Jurisica |
ABSTRACT: |
Motivation:
With the increasing number of gene expression databases, the need
for more powerful analysis and visualization tools is growing.
Many techniques have successfully been applied to unravel latent
similarities among genes and/or experiments. Most of the current
systems for microarray data analysis use statistical methods,
hierarchical clustering, self-organizing maps, support vector
machines, or k-means clustering to organize genes or experiments
into "meaningful" groups. Without prior explicit bias almost all
of these clustering methods applied to gene expression data not
only produce different results, but may also produce clusters with
little or no biological relevance. Of these methods, agglomerative
hierarchical clustering has been the most widely applied, although
many limitations have been identified.
Results: Starting with a
systematic comparison of the underlying theories behind clustering
approaches, we have devised a technique that combines
tree-structured vector quantization and partitive k-means
clustering (BTSVQ). This hybrid technique has revealed clinically
relevant clusters in three large publicly available data sets. In
contrast to existing systems, our approach is less sensitive to
data preprocessing and data normalization. In addition, the
clustering results produced by the technique have strong
similarities to those of self-organizing maps (SOMs). We discuss
the advantages and the mathematical reasoning behind our approach. |
|
AVAILABILITY: |
The BTSVQ
system is implemented in Matlab R12 using the SOM toolbox for the
visualization and preprocessing of the data:
http://www.cis.hut.fi/projects/somtoolbox/
BTSVQ is available for non-commercial use:
http://www.uhnres.utoronto.ca/ta3/BTSVQ |
|
CONTACT: |
ij@uhnres.utoronto.ca |
|
PRESENTATION 18 |
|
TITLE: |
Linking
gene expression data with patient survival times using partial
least squares |
AUTHORS: |
P.J. Park, L. Tian, I.S. Kohane |
ABSTRACT: |
There is an
increasing need to link the large amount of genotypic data,
gathered using microarrays for example, with various phenotypic
data from patients. The classification problem in which gene
expression data serve as predictors and a class label phenotype as
the binary outcome variable has been examined extensively, but
there has been less emphasis in dealing with other types of
phenotypic data. In particular, patient survival times with
censoring are often not used directly as a response variable due
to the complications that arise from censoring. We show that the
issues involving censored data can be circumvented by
reformulating the problem as a standard Poisson regression
problem. The procedure for solving the transformed problem is a
combination of two approaches: partial least squares, a regression
technique that is especially effective when there is severe
collinearity due to a large number of predictors, and generalized
linear regression, which extends standard linear regression to
deal with various types of response variables. The linear
combinations of the original variables identified by the method
are highly correlated with the patient survival times and at the
same time account for the variability in the covariates. The
algorithm is fast, as it does not involve any matrix
decompositions in the iterations.
We apply our method to data
sets from lung carcinoma and diffuse large B-cell lymphoma studies
to verify its effectiveness. |
|
AVAILABILITY: |
|
CONTACT: |
peter_park@harvard.edu
|
|
PRESENTATION 19 |
|
TITLE: |
Gene
trees, genome trees and the universal tree. |
AUTHORS: |
Ford Doolittle |
ABSTRACT: |
For several
decades, molecular phylogeneticists laboured under the illusion
that the topology of an accurately constructed tree for a single
gene, that encoding small-subunit ribosomal RNA (SSU rRNA), could
be taken as the topology of the "universal tree" relating all
organisms, back to a last common ancestor that might have lived
more than 3.5 billion years ago. In this they may have been naive,
since we now know that many protein-coding genes give trees
different than SSU rRNA, or than each other. Sometimes this is
artifact, but often it reflects genuinely different histories, the
consequence of frequent lateral gene transfer. There may be a
small core of genes that have never been exchanged and thus show
the true organismal history, but this is very hard to prove. I
will review the practical and theoretical implications of this,
for the genomes of prokaryotes and simple eukaryotes.
|
|
|
|
PRESENTATION 20 |
|
TITLE: |
Microarray
synthesis through multiple-use PCR primer design |
AUTHORS: |
R.J.
Fernandes, S.S. Skiena |
ABSTRACT: |
A substantial
percentage of the expense in constructing full-genome spotted
microarrays comes from the cost of synthesizing the PCR primers to
amplify the desired DNA. We propose a computationally-based
method to substantially reduce this cost. Historically, PCR
primers are designed so that each primer occurs uniquely in the
genome. This condition is unnecessarily strong for selective
amplification, since only the primer pair associated with each
amplification need be unique.
We demonstrate that careful design
in a genome-level amplification project permits us to save the
cost of several thousand primers over conventional approaches. |
|
AVAILABILITY: |
|
CONTACT:
|
skiena@cs.sunysb.edu
|
rohan@cs.sunysb.edu |
|
PRESENTATION 21 |
|
TITLE: |
Discovering statistically significant biclusters in gene
expression data |
AUTHORS: |
A. Tanay,
R.
Sharan and R. Shamir |
ABSTRACT: |
In gene expression
data, a bicluster is a subset of the genes exhibiting consistent
patterns over a subset of the conditions. We propose a new method
to detect significant biclusters in large expression datasets. Our
approach is graph theoretic coupled with statistical modeling of
the data. Under plausible assumptions, our algorithm is
polynomial and is guaranteed to find the most significant biclusters.
We tested our method on a collection of yeast
expression profiles and on a human cancer dataset. Cross
validation results show high specificity in assigning function to
genes based on their biclusters, and we are able to annotate in
this way 196 uncharacterized yeast genes. We also demonstrate how
the biclusters lead to detecting new concrete biological
associations. In cancer data we are able to detect and relate
finer tissue types than was previously possible. We also show
that the method outperforms the biclustering algorithm of Cheng
and Church (2000). |
|
AVAILABILITY: |
www.cs.tau.ac.il/~rshamir/biclust.html |
CONTACT: |
amos@tau.ac.il
roded@tau.ac.il |
rshamir@tau.ac.il |
|
PRESENTATION 22 |
|
TITLE: |
Co-clustering of biological networks and gene expression data |
AUTHORS: |
D. Hanisch, A. Zien, R. Zimmer, T. Lengauer |
ABSTRACT: |
Motivation:
Large scale gene expression data are often analyzed by clustering
genes based on gene expression data alone, though a-priori
knowledge in the form of biological networks is available. The use
of this additional information promises to improve exploratory
analysis considerably.
Results: We propose to construct a distance
function which combines information from expression data and
biological networks. Based on this function, we compute a joint
clustering of genes and vertices of the network. This general
approach is elaborated for metabolic networks. We define a graph
distance function on such networks and combine it with a
correlation-based distance function for gene expression
measurements. A hierarchical clustering and an associated
statistical measure is computed to arrive at a reasonable number
of clusters. Our method is validated using expression data of the
yeast diauxic shift. The resulting clusters are easily
interpretable in terms of the biochemical network and the gene
expression data and suggest that our method is able to
automatically identify processes that are relevant under the
measured conditions. |
|
AVAILABILITY: |
|
CONTACT: |
Daniel.Hanisch@scai.fhg.de
|
|
PRESENTATION 23 |
|
TITLE: |
Statistical process control for large scale microarray experiments |
AUTHORS: |
F. Model, T. Konig, C. Piepenbrock, P. Adorjan |
ABSTRACT: |
Motivation:
Maintaining and controlling data quality is a key problem in large
scale microarray studies. Especially systematic changes in
experimental conditions across multiple chips can seriously affect
quality and even lead to false biological conclusions.
Traditionally the influence of these effects can only be minimized
by expensive repeated measurements, because a detailed
understanding of all process relevant parameters seems
impossible.
Results: We introduce a novel method for microarray
process control that estimates quality solely based on the
distribution of the actual measurements without requiring repeated
experiments. A robust version of principle component analysis
detects single outlier microarrays and only thereby enables the
use of techniques from multivariate statistical process control.
In particular, the T2 control chart reliably tracks undesired
changes in process relevant parameters. This can be used to
improve the microarray process itself, limits necessary
repetitions only to affected samples and therefore maintains
quality in a cost effective way. We prove the power of the
approach on 3 large sets of DNA methylation microarray data. |
|
AVAILABILITY: |
|
CONTACT: |
Fabian.Model@epigenomics.com |
|
PRESENTATION 24 |
|
TITLE: |
Evaluating
machine learning approaches for aiding probe selection for
gene-expression arrays |
AUTHORS: |
J.B. Tobler, M.N. Molla,
E.F.Nuwaysir, R.D.Green and J.W. Shavlik |
ABSTRACT: |
Motivation:
Microarrays are a fast and cost-effective method of performing
thousands of DNA hybridization experiments simultaneously. DNA
probes are typically used to measure the expression level of
specific genes. Because probes greatly vary in the quality of
their hybridizations, choosing good probes is a difficult task.
If one could accurately choose probes that are likely to hybridize
well, then fewer probes would be needed to represent each gene in
a gene-expression microarray, and, hence, more genes could be
placed on an array of a given physical size. Our goal is to
empirically evaluate how successfully three standard
machine-learning algorithms - naïve Bayes, decision trees, and
artificial neural networks - can be applied to the task of
predicting good probes. Fortunately it is relatively easy to get
training examples for such a learning task: place various probes
on a gene chip, add a sample where the corresponding genes are
highly expressed, and then record how well each probe measures the
presence of its corresponding gene. With such training examples,
it is possible that an accurate predictor of probe quality can be
learned.
Results: Two of the learning algorithms we investigate -
naïve Bayes and neural networks - learn to predict probe quality
surprisingly well. For example, in the top ten predicted probes
for a given gene not used for training, on average about five rank
in the top 2.5% of that gene's hundreds of possible probes.
Decision-tree induction and the simple approach of using predicted
melting temperature to rank probes perform significantly worse
than these two algorithms. The features we use to represent
probes are very easily computed and the time taken to score each
candidate probe after training is minor. Training the naïve Bayes
algorithm takes very little time, and while it takes over 10 times
as long to train a neural network, that time is still not very
substantial (on the order of a few hours on a desktop
workstation). We also report the information contained in the
features we use to describe the probes. We find the fraction of
cytosine in the probe to be the most informative feature. We also
find, not surprisingly, that the nucleotides in the middle of the
probes sequence are more informative than those at the ends of the
sequence. |
|
AVAILABILITY: |
|
CONTACT: |
molla@cs.wisc.edu
|
ABSTRACT: |
A variety of measures have been proposed for assessing the
accuracy of sequence database search methods. One measure that has
gained wide use is the ROC score, derived from a graph of false
vrs. true positives as alignment cutoff score varies. An
interesting question is when the ROC scores of two different
methods can be said to differ significantly. Recent analytic
results concerning bootstrap resampling applied to ROC scores
provide one possible answer to this question.
We have used ROC analysis to assess a large number of possible
refinements of the original, 1997 version of PSI-BLAST. Several
modifications lead to significant or near-significant improvements
in program accuracy. The most important among these is the
incorporation of sequence-composition based statistics, which
substantially suppress the corruption of protein profiles by false
positive alignments. |
|
|
PRESENTATION 26 |
|
TITLE: |
The
degenerate primer design problem |
AUTHORS: |
C. Linhart and R. Shamir |
ABSTRACT: |
A PCR primer
sequence is called degenerate if some of its positions have
several possible bases. The degeneracy of the primer is the
number of unique sequence combinations it contains. We study the
problem of designing a pair of primers with prescribed degeneracy
that match a maximum number of given input sequences. Such
problems occur when studying a family of genes that is known only
in part, or is known in a related species. We prove that various
simplified versions of the problem are hard, show the
polynomiality of some restricted cases, and develop approximation
algorithms for one variant.
Based on these algorithms, we
implemented a program called HYDEN for designing highly-degenerate
primers for a set of genomic sequences. We report on the success
of the program in an experimental scheme for identifying all human
olfactory receptor (OR) genes. In that project, HYDEN was used to
design primers with degeneracies up to 1010 that amplified with
high specificity many novel genes of that family, tripling the
number of OR genes known at the time. |
|
AVAILABILITY: |
Available on
request from the authors |
CONTACT: |
chaiml@tau.ac.il |
rshamir@tau.ac.il |
|
PRESENTATION 27 |
|
TITLE: |
Splicing
graphs and EST assembly problem |
AUTHORS: |
S. Heber, M. Alekseyev, S.-H. Sze, H. Tang and P.A. Pevzner |
ABSTRACT: |
Motivation:
The traditional approach to annotate alternative splicing is to
investigate every splicing variant of the gene in a case-by-case
fashion. This approach, while useful, has some serious
shortcomings. Recent studies indicate that alternative splicing
is more frequent than previously thought and some genes may
produce tens of thousands of different transcripts. A list of
alternatively spliced variants for such genes would be difficult
to build and hard to analyze. Moreover, such a list does not show
the relationships between different transcripts and does not show
the overall structure of all transcripts. A better approach would
be to represent all splicing variants for a given gene in a way
that captures the relationships between different splicing
variants.
Results: We introduce the notion of the splicing graph
that is a natural and convenient representation of all splicing
variants. The key difference with the existing approaches is that
we abandon the linear (sequence) representation of each transcript
and substitute it with a graph representation where each
transcript corresponds to a path in the graph. We further design
an algorithm to assemble EST reads into the splicing graph rather
than assembling them into each splicing variant in a case-by-case
fashion. |
|
AVAILABILITY: |
http://www-cse.ucsd.edu/groups/bioinformatics/software.html |
CONTACT: |
sheber@ucsd.edu |
|
PRESENTATION 28 |
|
TITLE: |
Exact genetic linkage computations for general pedigrees |
AUTHORS: |
M. Fishelson
and D. Geiger |
ABSTRACT: |
Motivation:
Genetic linkage analysis is a useful statistical tool for mapping
disease genes and for associating functionality of genes to their
location on the chromosome. There is a need for a program that
computes multipoint likelihood on general pedigrees with many
markers that also deals with two-locus disease models.
Results: In
this paper we present algorithms for performing exact multipoint
likelihood calculations on general pedigrees with a large number
of highly polymorphic markers, taking into account a variety of
disease models. We have implemented these algorithms in a new
computer program called SUPERLINK which outperforms leading
linkage software with regards to functionality, speed, memory
requirements and extensibility. |
|
AVAILABILITY: |
SUPERLINK is
available at:
http://bioinfo.cs.technion.ac.il/superlink |
|
CONTACT: |
fmaayan@cs.technion.ac.il |
dang@cs.technion.ac.il |
|
PRESENTATION 29 |
|
TITLE: |
Assigning
probes into a small number of pools separable by electrophoresis |
AUTHORS: |
T. Kivioja,
M. Arvas, K. Kataja, M. Penttila, H. Soderlund, E.
Ukkonen |
ABSTRACT: |
Motivation:
Measuring transcriptional expression levels (transcriptional
profiling) has become one of the most important methods in
functional genomics. Still, new measuring methods are needed to
obtain more reliable, quantitative data about transcription on a
genomic scale. In this paper we concentrate on certain
computational optimization problems arising in the design of one
such novel method. From a computational point of view the key
feature of the new method is that the hybridized probes are
distinguished from each other based on their different size.
Therefore the probes have to be assigned into pools such that the
probes in the same pool have unique sizes different enough from
each other. Identification of expressed RNA is given by probe pool
and probe size while quantification is given by the label of the
probe, e.g. fluorescence intensity.
Results: We show how to
computationally find the probes and assign them into pools for a
whole genome such that i) each gene has a specific probe suitable
for amplification and hybridization, and ii) the expression level
measurement can be done in a minimal number of pools separable by
electrophoresis in order to minimize the total experiment cost of
the measurement. Our main result is a polynomial-time
approximation algorithm for assigning the probes into pools. We
demonstrate the feasibility of the procedure by selecting probes
for the yeast genome and assigning them into less than 100 pools.
|
|
AVAILABILITY: |
The probe
sequences and their assignment into pools are available for
academic research on request from the authors. |
|
CONTACT: |
Teemu.Kivioja@cs.Helsinki.FI |
|
PRESENTATION 30 |
|
TITLE: |
Representing genetic sequence data for pharmacogenomics: an
evolutionary approach using ontological and relational models |
AUTHORS: |
D.L. Rubin, F. Shafa,
D.E. Oliver, M. Hewett, T. Klein and R.
Altman |
ABSTRACT: |
Motivation:
The information model chosen to store biological data affects the
types of queries possible, database performance, and difficulty in
updating that information model. Genetic sequence data for
pharmacogenetics studies can be complex, and the best information
model to use may change over time. As experimental and analytical
methods change, and as biological knowledge advances, the data
storage requirements and types of queries needed may also change.
Results: We developed a model for genetic sequence and
polymorphism data, and used XML schema to specify the elements and
attributes required for this model. We implemented this model as
an ontology in a frame-based representation and as a relational
model in a database system. We collected genetic data from two pharmacogenetics resequencing studies, and formulated queries
useful for analyzing these data. We compared the ontology and
relational models in terms of query complexity, performance, and
difficulty in changing the information model.
Our results
demonstrate benefits of evolving the schema for storing pharmacogenetics data: ontologies perform well in early design
stages as the information model changes rapidly and simplify query
formulation, while relational models offer improved query speed
once the information model and types of queries needed stabilize. |
|
AVAILABILITY: |
Our ontology and
relational models are available at:
http://www.pharmgkb.org/data/schemas/ |
|
CONTACT:
|
rubin@smi.stanford.edu
russ.altman@stanford.edu |
help@pharmgkb.org |
|
PRESENTATION 32 |
|
TITLE: |
Evaluating
functional network inference using simulations of complex
biological systems |
AUTHORS: |
V.A. Smith, E.D. Jarvis and A.J. Hartemink |
ABSTRACT: |
Motivation:
Although many network inference algorithms have been presented in
the bioinformatics literature, no suitable approach has been
formulated for evaluating their effectiveness at recovering models
of complex biological systems from limited data. To overcome this
limitation, we propose an approach to evaluate network inference
algorithms according to their ability to recover a complex
functional network from biologically reasonable simulated data.
Results: We designed a simulator to generate data from a complex
biological system at multiple levels of organization: behavior,
neural anatomy, brain electrophysiology, and gene expression of
songbirds. About 90% of the simulated variables are unregulated
by other variables in the system and are included simply as
distracters. We sampled the simulated data at intervals as one
would sample from a biological system in practice, and then used
the sampled data to evaluate the effectiveness of an algorithm we
developed for functional network inference. We found that our
algorithm is highly effective at recovering the functional network
structure of the simulated system---including the irrelevance of
unregulated variables---from sampled data alone. To assess the
reproducibility of these results, we tested our inference
algorithm on 50separately simulated sets of data and it
consistently recovered almost perfectly the complex functional
network structure underlying the simulated data. To our
knowledge, this is the first approach for evaluating the
effectiveness of functional network inference algorithms at
recovering models from limited data. Our simulation approach also
enables researchers a priori to design experiments and
data-collection protocols that are amenable to functional network
inference. |
|
AVAILABILITY: |
Source code and
simulated data are available upon request. |
CONTACT: |
asmith@neuro.duke.edu |
jarvis@neuro.duke.edu |
|
PRESENTATION 33 |
|
TITLE: |
The
Pathway
Tools software |
AUTHORS: |
P.D.
Karp, S. Paley, P. Romero |
ABSTRACT: |
Motivation:
Bioinformatics requires reusable software tools for creating
model-organism databases (MODs).
Results: The Pathway Tools is a
reusable, production-quality software environment for creating a
type of MOD called a Pathway/Genome Database (PGDB). A PGDB such
as EcoCyc (see ecocyc.org) integrates our evolving understanding
of the genes, proteins, metabolic network, and genetic network of
an organism. This paper provides an overview of the four main
components of the Pathway Tools: The PathoLogic component supports
creation of new PGDBs from the annotated genome of an organism.
The Pathway/Genome Navigator provides query, visualization, and
Web-publishing services for PGDBs. The Pathway/Genome Editors
support interactive updating of PGDBs. The Pathway Tools ontology
defines the schema of PGDBs. The Pathway Tools makes use of the
Ocelot object database system for data management services for
PGDBs. The Pathway Tools has been used to build PGDBs for 13
organisms within SRI and by external users. |
|
AVAILABILITY: |
The software is
freely available to academics and is available for a fee to
commercial institutions. Contact:
ptools-support@ai.sri.com (for information on obtaining
the software) |
|
CONTACT: |
pkarp@ai.sri.com
|
|
PRESENTATION 34 |
|
TITLE: |
Discovering
regulatory and signalling circuits in molecular interaction
networks |
AUTHORS: |
T. Ideker, O. Ozier, B. Schwikowski and A.F. Siegel |
ABSTRACT: |
Motivation: In
model organisms such as yeast, large databases of protein-protein
and protein-DNA interactions have become an extremely important
resource for the study of protein function, evolution, and gene
regulatory dynamics. In this paper we demonstrate that by
integrating these interactions with widely-available mRNA
expression data, it is possible to generate concrete hypotheses
for the underlying mechanisms governing the observed changes in
gene expression. To perform this integration systematically and at
large scale, we introduce an approach for screening a molecular
interaction network to identify active subnetworks, i.e.,
connected regions of the network that show significant changes in
expression over particular subsets of conditions. The method we
present here combines a rigorous statistical measure for scoring
subnetworks with a search algorithm for identifying subnetworks
with high score.
Results: We evaluated our procedure on a small
interaction network of 332 genes and a large network of 4160 genes
containing all 7462 protein-protein and protein-DNA interactions
in the yeast public databases. In the case of the small network,
we identified five significant subnetworks that covered 41 out of
77 (53%) of all significant changes in expression. Both network
analyses returned several top-scoring subnetworks with good
correspondence to known regulatory mechanisms in the literature.
These results demonstrate how large-scale genomic approaches may
be used to uncover signaling and regulatory pathways in a
systematic, integrative fashion. |
|
AVAILABILITY: |
The methods
presented in this paper are implemented in the Cytoscape software
package which is available to the academic community at
www.cytoscape.org |
|
CONTACT: |
trey@wi.mit.edu |
|
PRESENTATION 35 |
|
TITLE: |
Modelling
regulatory pathways in E.coli from time series expression profiles |
AUTHORS: |
I.M. Ong, J.D. Glasner
and D.
Page |
ABSTRACT: |
Motivation:
Cells continuously reprogram their gene expression network as they
move through the cell cycle or sense changes in their
environment. In order to understand the regulation of cells, time
series expression profiles provide a more complete picture than
single time point expression profiles. Few analysis techniques,
however, are well suited to modeling such time series
data.
Results: We describe an approach that naturally handles time
series data with the capabilities of modeling causality, feedback
loops, and environmental or hidden variables using a Dynamic
Bayesian network. We also present a novel way of combining prior
biological knowledge and current observations to improve the
quality of analysis and to model interactions between sets of
genes rather than individual genes. Our approach is evaluated on
time series expression data measured in response to physiological
changes that affect tryptophan metabolism in E. coli. Results
indicate that this approach is capable of finding correlations
between sets of related genes. |
|
AVAILABILITY: |
|
CONTACT: |
ong@cs.wisc.edu
|
|
PRESENTATION 36 |
|
TITLE: |
Of
truth
and pathways: chasing bits of information through myriads of
articles |
AUTHORS: |
M.
Krauthammer, P. Kra, I. Iossifov, S.M.Gomez, G. Hripcsak,
V.hatzivassiloglou, C. Friedman and A. Rzhetsky |
ABSTRACT: |
Knowledge on
interactions between molecules in living cells is indispensable
for theoretical analysis and practical applications in modern
genomics and molecular biology. Building such networks relies on
the assumption that the correct molecular interactions are known
or can be identified by reading a few research articles. However,
this assumption does not necessarily hold, as truth is rather an
emerging property based on many potentially conflicting facts.
This paper explores the processes of knowledge generation and
publishing in the molecular biology literature using modeling and
analysis of real molecular interaction data. The data analyzed in
this article was automatically extracted from 50,000 research
articles in molecular biology using a computer system called
GeneWays containing a natural language processing module.
The
paper indicates that truthfulness of statements is associated in
the minds of scientists with the relative importance
(connectedness) of substances under study, revealing a potential
selection bias in the reporting of research results. Aiming at
understanding the statistical properties of the life cycle of
biological facts reported in research articles, we formulate a
stochastic model describing generation and propagation of
knowledge about molecular interactions through scientific
publications. We hope that in the future such a model can be
useful for automatically producing consensus views of molecular
interaction data. |
|
AVAILABILITY: |
|
CONTACT: |
ar345@columbia.edu |
|
PRESENTATION 37 |
|
TITLE: |
Minreg: Inferring an active regulator
set |
AUTHORS: |
D. Pe'er, A. Regev
and A. Tanay |
ABSTRACT: |
Regulatory
relations between genes are an important component of molecular
pathways. Here, we devise a novel global method that uses a set of
gene expression profiles to find a small set of relevant active
regulators, identifies the genes that they regulate, and
automatically annotates them.
We show that our algorithm is
capable of handling a large number of genes in a short time and is
robust to a wide range of parameters. We apply our method to a
combined dataset of S. cerevisiae expression profiles, and
validate the resulting model of regulation by cross-validation and
extensive biological analysis of the selected regulators and their
derived annotations. |
|
AVAILABILITY: |
|
CONTACT: |
|
|
PRESENTATION 38 |
|
TITLE: |
Bioinformatic
analysis of a morphogenetic field in Drosophila |
AUTHORS: |
John Reinitz |
ABSTRACT: |
Bioinformatics and
functional genomics will ultimately involve the application of
genomic and informatics methods to the full range of biological
functions, including those that are properties of multicellular
organisms, This includes areas such as neurobiology, development,
macroevolution, and ecology. This talk will be concerned with the
bioinformatics of animal development. The central problem in
animal development is the generation of body form. This problem
was first considered by Aristotle, and in the nineteenth century
is was shown that basic body form is determined by interactions
among cells in a morphogenetic field. The determination of a
morphogenetic field in development involves the expression of
genes in spatial patterns. Spatially controlled gene expression
cannot as yet be assayed in microarrays, but certain special
properties of the fruit fly {\it Drosophila} which make it a
premier system for developmental genetics also enable it to be
used as a naturally grown differential display system for reverse
engineering networks of genes. In this system we can approach
fundamental scientific questions about development and
transcription as well as certain computational questions that
arise in the analysis of genomic level gene expression data.
Our approach is called the ``gene circuit method'', and it
consists of 4 components: (1) The formulation of a {\bf
theoretical model} for gene regulation. (2) The acquisition of
{\bf gene expression data} using fluorescently tagged antibodies.
(3) The determination of the values of parameters in the model or
the demonstration that no such values exist by {\bf numerical fits
to data}. The results of (1), (2), and (3) are used (4) to {\bf
validate the model} by comparison to the existing experimental
data and by making further predictions. Recent progress in all 4
of these areas will be discussed. |
|
PRESENTATION 39 |
|
TITLE: |
Marginalized kernels for biological sequences |
AUTHORS: |
K. Tsuda,
T. Kin, K. Asai |
ABSTRACT: |
Motivation:
Kernel methods such as support vector machines require a kernel
function between objects to be defined a priori. Several works
have been done to derive kernels from probability distributions,
e.g. the Fisher kernel. However, a general methodology to design
a kernel is not fully developed.
Results: We propose a reasonable
way of designing a kernel when objects are generated from latent
variable models (e.g. HMM).First of all, a joint kernel is
designed for complete data which include both visible and hidden
variables. Then a marginalized kernel for visible data is
obtained by taking the expectation with respect to hidden
variables. We will show that the Fisher kernel is a special case
of marginalized kernels, which gives another viewpoint to the
Fisher kernel theory. Although our approach can be applied to any
object, we particularly derive several marginalized kernels useful
for biological sequences (e.g. DNA and proteins). The effectiveness
of marginalized kernels is illustrated in the task of classifying
bacterial gyrase subunit B (gyrB} amino acid sequences. |
|
AVAILABILITY: |
|
CONTACT: |
koji.tsuda@aist.go.jp |
|
PRESENTATION 40 |
|
TITLE: |
A tree
kernel to analyze phylogenetic profiles |
AUTHORS: |
J.-P. Vert |
ABSTRACT: |
Motivation:
The phylogenetic profile of a protein is a string that encodes the
presence or absence of the protein in every fully sequenced
genome. Because proteins that participate in a common structural
complex or metabolic pathway are likely to evolve in a correlated
fashion, the phylogenetic profiles of such proteins are often
"similar" or at least "related" to each other. The question we
address in this paper is the following: how to measure the
"similarity" between two profiles, in an evolutionarily relevant
way, in order to develop efficient function prediction
methods?
Results: We show how the profiles can be mapped to a
high-dimensional vector space which incorporates evolutionarily
relevant information, and we provide an algorithm to compute
efficiently the inner product in that space, which we call the
tree kernel. The tree kernel can be used by any kernel-based
analysis method for classification or data mining of phylogenetic
profiles. As an application a Support Vector Machine (SVM)
trained to predict the functional class of a gene from its
phylogenetic profile is shown to perform better with the tree
kernel than with a naive kernel that does not include any
information about the phylogenetic relationships among species.
Moreover a kernel principal component analysis (KPCA) of the
phylogenetic profiles illustrates the sensitivity of the tree
kernel to evolutionarily relevant variations. |
|
AVAILABILITY: |
All data and
software used are freely and publicly available upon request. |
CONTACT: |
Jean-Philippe.Vert@mines.org |
|
PRESENTATION 41 |
|
TITLE: |
Statistically
based postprocessing of phylogenetic analysis by clustering |
AUTHORS: |
C. Stockham,
L.-S. Wang and T. Warnow |
ABSTRACT: |
Motivation:
Phylogenetic analyses often produce thousands of candidate trees.
Biologists resolve the conflict by computing the consensus of
these trees. Single-tree consensus as postprocessing methods can
be unsatisfactory due to their inherent limitations.
Results: In
this paper we present an alternative approach by using clustering
algorithms on the set of candidate trees. We propose bicriterion
problems, in particular using the concept of information loss, and
new consensus trees called characteristic trees that minimize the
information loss. Our empirical study using four biological
datasets shows that our approach provides a significant
improvement in the information content, while adding only a small
amount of complexity. Furthermore, the consensus trees we obtain
for each of our large clusters are more resolved than the
single-tree consensus trees. We also provide some initial
progress on theoretical questions that arise in this context. |
|
AVAILABILITY: |
Software available
upon request from the authors. The agglomerative clustering is
implemented using Matlab (MathWorks, 2000) with the Statistics
Toolbox. The Robinson-Foulds distance matrices and the strict
consensus trees are computed using PAUP (Swofford, 2001) and the
Daniel Huson's tree library on Intel Pentium workstations running
Debian Linux.
Supplementary Information:
http://www.cs.utexas.edu/users/lisan/ismb02/ |
|
CONTACT: |
lisan@cs.utexas.edu |
|
Telephone: +1 (512) 232-7455 |
Fax: +1 (512) 471-8885 |
|
PRESENTATION 42 |
|
TITLE: |
Efficiently
detecting polymorphisms during the fragment assembly process |
AUTHORS: |
D. Fasulo, A. Halpern, I.
Dew and C. Mobarry |
ABSTRACT: |
Motivation:
Current genomic sequence assemblers assume that the input data is
derived from a single, homogeneous source. However, recent
whole-genome shotgun sequencing projects have violated this
assumption, resulting in input fragments covering the same region
of the genome whose sequences differ due to polymorphic variation
in the population. While single-nucleotide polymorphisms (SNPs)
do not pose a significant problem to state-of-the-art assembly
methods, these methods do not handle insertion/deletion (indel)
polymorphisms of more than a few bases.
Results: This paper
describes an efficient method for detecting sequence discrepancies
due to polymorphism that avoids resorting to global use of more
costly, less stringent affine sequence alignments. Instead, the
algorithm uses graph-based methods to determine the small set of
fragments involved in each polymorphism and performs more
sophisticated alignments only among fragments in that set.
Results from the incorporation of this method into the Celera
Assembler are reported for the D. melanogaster, H. sapiens, and M.
musculus genomes. |
|
AVAILABILITY: |
The method
described herein does not constitute a stand-alone software
application, but is laid out in sufficient detail to be
implemented as a component of any genomic sequence assembler. |
|
CONTACT: |
daniel.fasulo@celera.com |
|
PRESENTATION 43 |
|
TITLE: |
Multiple
genome rearrangement: a general approach via the evolutionary genome
graph |
AUTHORS: |
D. Korkin and L. Goldfarb |
ABSTRACT: |
Motivation: In
spite of a well-known fact that genome rearrangements are supposed
to be viewed in the light of the evolutionary relationships within
and between the species involved, no formal underlying framework
based on the evolutionary considerations for treating the
questions arising in the area has been proposed. If such an
underlying framework is provided, all the basic questions in the
area can be posed in a biologically more appropriate and useful
form: e.g., the similarity between two genomes can then be
computed via the nearest ancestor, rather than "directly",
ignoring the evolutionary connections.
Results: We outline an
evolution-based general framework for answering questions related
to the multiple genome rearrangement. In the proposed model, the
evolutionary genome graph (EG-graph) encapsulates an evolutionary
history of a genome family. For a set of all EG-graphs, we
introduce a family of similarity measures each defined via a set
of genome transformations associated with a particular EG-graph.
Given a set of genomes and restricting ourselves to the
transpositions, an algorithm for constructing an EG-graph is
presented. We also present the experimental results in the form of
an EG-graph for a set of concrete genomes (for several species).
This EG-graph turns out to be very close to the corresponding
known phylogenetic tree. |
|
AVAILABILITY: |
|
CONTACT: |
dkorkin@unb.ca
|
|
PRESENTATION 44 |
|
TITLE: |
Efficient
multiple genome alignment |
AUTHORS: |
M. Hoehl,
S. Kurtz and E. Ohlebusch |
ABSTRACT: |
Motivation: To
allow a direct comparison of the genomic DNA sequences of
sufficiently similar organisms, there is an urgent need for
software tools that can align more than two genomic
sequences.
Results: We developed new algorithms and a software tool
"Multiple Genome Aligner" (MGA for short) that efficiently
computes multiple genome alignments of large, closely related DNA
sequences. For example, it can align 85% percent of the complete
genomes of six human adenoviruses (average length 35,305 bp.) in
159 seconds. An alignment of 74% of the complete genomes of three
strains of E. coli (lengths: 5,528,445; 5,498,450; 4,639,221 bp.)
is produced in 30 minutes. |
|
AVAILABILITY: |
The software
MGA is available free of charge for non-commercial research
institutions. For details see:
http://bibiserv.techfak.uni-bielefeld.de/mga/ |
|
CONTACT: |
kurtz@techfak.uni-bielefeld.de |
enno@techfak.uni-bielefeld.de |
|
PRESENTATION 46 |
|
TITLE: |
PseudoViewer: automatic visualization of RNA pseudoknots |
AUTHORS: |
K. Han,
Y. Lee and W. Kim |
ABSTRACT: |
Motivation:
Several algorithms have been developed for drawing RNA secondary
structures, however none of these can be used to draw RNA
pseudoknot structures. In the sense of graph theory, a drawing of
RNA secondary structures is a tree, whereas a drawing of RNA
pseudoknots is a graph with inner cycles within a pseudoknot as
well as possible outer cycles formed between a pseudoknot and
other structural elements. Thus, RNA pseudoknots are more
difficult to visualize than RNA secondary structures. Since no
automatic method for drawing RNA pseudoknots exists, visualizing
RNA pseudoknots relies on significant amount of manual work and
does not yield satisfactory results. The task of visualizing RNA
pseudoknots by hand becomes more challenging as the size and
complexity of the RNA pseudoknots increase.
Results: We have
developed a new representation and an algorithm for drawing H-type
pseudoknots with RNA secondary structures. Compared to existing
representations of H-type pseudoknots, the new representation
ensures uniform and clear drawings with no edge crossing for all
H-type pseudoknots. To the best of our knowledge, this is the
first algorithm for automatically drawing RNA pseudoknots with RNA
secondary structures. The algorithm has been implemented in a Java
program, which can be executed on any computing system.
Experimental results demonstrate that the algorithm generates an
aesthetically pleasing drawing of all H-type pseudoknots. The
results have also shown that the drawing has high readability,
enabling the user to quickly and easily recognize the whole RNA
structure as well as the pseudoknots themselves. |
|
AVAILABILITY: |
available on
request from the corresponding author. |
CONTACT: |
khan@inha.ac.kr
|
|
PRESENTATION 47 |
|
TITLE: |
A powerful non-homology method for the prediction of operons in
prokaryotes |
AUTHORS: |
G.
Moreno-Hagelsieb, J. Collado-Vides |
ABSTRACT: |
Motivation:
The prediction of the transcription unit organization of genomes
is an important clue in the inference of functional relationships
of genes, the interpretation and evaluation of transcriptome
experiments, and the overall inference of the regulatory networks
governing the expression of genes in response to the environment.
Though several methods have been devised to predict operons, most
need a high characterization of the genome analyzed.
Log-likelihoods derived from inter-genic distance distributions
work surprisingly well to predict operons in Escherichia coli and
are available for any genome as soon as the gene sets are
predicted.
Results: Here we provide evidence that the very same
method is applicable to any prokaryotic genome. First, the method
has the same efficiency when evaluated using a collection of
experimentally known operons of Bacillus subtilis. Second, operons
among most if not all prokaryotes seem to have the same tendencies
to keep short distances between their genes, the most frequent
distances being the overlaps of four and one base pairs. The
universality of this structural feature allows us to predict the
organization of transcription units in all prokaryotes. Third,
predicted operons contain a higher proportion of genes with
related phylogenetic profiles and conservation of adjacency than
predicted borders of transcription units. |
|
AVAILABILITY: |
Additional materials and graphs, are available at:
http://www.cifn.unam.mx/moreno/pub/TUpredictions/ |
|
CONTACT: |
moreno@cifn.unam.mx |
|
PRESENTATION 48 |
|
TITLE: |
Identifying
operons and untranslated regions of transcripts using Escherichia
coli RNA expression analysis |
AUTHORS: |
B. Tjaden, D.R. Haynor, S. Stolyar, C. Rosenow, E. Kolker |
ABSTRACT: |
Microarrays
traditionally have been used to assay the transcript expression of
coding regions of genes. Here, we use Escherichia coli
oligonucleotide microarrays to assay transcript expression of both
open reading frames (ORFs) and intergenic regions. We then use
hidden Markov models to analyze this expression data and estimate
transcription boundaries of genes.
This approach allows us to
identify 5( untranslated regions (UTRs) of transcripts as well as
genes which are likely to be operon members. The operon elements
we identify correspond to documented operons with 99% specificity
and 63% sensitivity. Similarly we find that our 5( UTR results
accurately coincide with experimentally verified promoter regions
for most genes. |
|
AVAILABILITY: |
|
CONTACT: |
tjaden@cs.washington.edu
|
|
PRESENTATION 49 |
|
TITLE: |
Detecting
recombination with MCMC |
AUTHORS: |
D. Husmeier and G. McGuire |
ABSTRACT: |
Motivation: We
present a statistical method for detecting recombination, whose
objective is to accurately locate the recombinant breakpoints in
DNA sequence alignments of small numbers of taxa (4 or 5). Our
approach explicitly models the sequence of phylogenetic tree
topologies along a multiple sequence alignment. Inference under
this model is done in a Bayesian way, using Markov chain Monte
Carlo (MCMC). The algorithm returns the site-dependent posterior
probability of each tree topology, which is used for detecting
recombinant regions and locating their breakpoints.
Results: The
method was tested on a synthetic and three real DNA sequence
alignments, where it was found to outperform the established
detection methods PLATO, RECPARS, and TOPAL. |
|
AVAILABILITY: |
The algorithm
has been implemented in the C++ program package BARCE, which is
freely available from
http://www.bioss.sari.ac.uk/~dirk/software |
|
CONTACT: |
dirk@bioss.ac.uk |
|
PRESENTATION 50 |
|
TITLE: |
Finding
composite regulatory patterns in DNA sequences |
AUTHORS: |
E. Eskin and P.A. Pevzner |
ABSTRACT: |
Motivation:
Pattern discovery in unaligned DNA sequences is a fundamental
problem in computational biology with important applications in
finding regulatory signals. Current approaches to pattern
discovery focus on monad patterns that correspond to relatively
short contiguous strings. However, many of the actual regulatory
signals are composite patterns that are groups of monad patterns
that occur near each other. A difficulty in discovering composite
patterns is that one or both of the component monad patterns in
the group may be "`too weak". Since the traditional monad-based
motif finding algorithms usually output one (or a few) high
scoring patterns, they often fail to find composite regulatory
signals consisting of weak monad parts.
In this paper, we present
a MITRA (MIsmatch TRee Algorithm)} approach for discovering
composite signals. We demonstrate that MITRA performs well for
both monad and composite patterns by presenting experiments over
biological and synthetic data. |
|
AVAILABILITY: |
MITRA is available
at:
http://www.cs.columbia.edu/compbio/mitra/ |
CONTACT: |
eeskin@cs.columbia.edu
|
|