|
|
|
|
PRESENTATION 2 |
| |
| TITLE:
|
Mining viral protease data to extract cleavage knowledge |
|
AUTHORS: |
A. Narayanan, X. Wu and Z. Rong Yang |
|
ABSTRACT: |
Motivation: The
motivation is to identify, through machine learning techniques,
specific patterns in HIV and HCV viral polyprotein amino acid
residues where viral protease cleaves the polyprotein as it leaves
the ribosome. An understanding of viral protease specificity may
help the development of future anti-viral drugs involving protease
inhibitors by identifying specific features of protease activity
for further experimental investigation. While viral sequence
information is growing at a fast rate, there is still
comparatively little understanding of how viral polyproteins are
cut into their functional unit lengths. The aim of the work
reported here is to investigate whether it is possible to
generalise from known cleavage sites to unknown cleavage sites for
two specific viruses - HIV and HCV. An understanding of
proteolytic activity for specific viruses will contribute to our
understanding of viral protease function in general, thereby
leading to a greater understanding of protease families and their
substrate characteristics.
Results: Our results show that artificial neural networks and
symbolic learning techniques (See5) capture some fundamental and
new substrate attributes, but neural networks outperform their
symbolic counterpart. Publicly available
software was used:
- Stuttgart Neural Network Simulator;
http://www-ra.informatik.uni-tuebingen.de/SNNS/
- See5;
http://www.rulequest.com |
| |
|
AVAILABILITY: |
The datasets used (HIV, HCV) for See5 are available
at:
http://www.dcs.ex.ac.uk/~anarayan/bioinf/ismbdatasets/ |
|
|
CONTACT: |
a.narayanan@ex.ac.uk |
z.r.yang@ex.ac.uk |
|
PRESENTATION 3 |
|
|
| TITLE: |
The metric space of proteins - comparative study of clustering
algorithms |
|
AUTHORS: |
O. Sasson, N. Linial, M. Linial |
|
ABSTRACT: |
Motivation: A large fraction of biological research concentrates
on individual proteins and on small families of proteins. One of
the current major challenges in bioinformatics is to extend our
knowledge also to very large sets of proteins. Several major
projects have tackled this problem. Such undertakings usually
start with a process that clusters all known proteins or large
subsets of this space. Some work in this area is carried out
automatically, while other attempts incorporate expert advice and
annotation.
Results: We propose a novel technique that automatically clusters
protein sequences. We consider all proteins in SWISSPROT, and
carry out an all-against-all BLAST similarity test among them.
With this similarity measure in hand we proceed to perform a
continuous bottom-up clustering process by applying alternative
rules for merging clusters. The outcome of this clustering process
is a classification of the input proteins into a hierarchy of
clusters of varying degrees of granularity. Here we compare the
clusters that result from alternative merging rules, and validate
the results against InterPro.
Our preliminary results show that clusters that are consistent
with several rather than a single merging rule tend to comply with
InterPro annotation. This is an affirmation of the view that the
protein space consists of families that differ markedly in their
evolutionary conservation. |
| |
|
AVAILABILITY: |
The outcome of these investigations
can be viewed in an interactive Web site at:
http://www.protonet.cs.huji.ac.il
Supplementary information: Biological examples for comparing the
performance of the different algorithms used for classification
are presented in:
http://www.protonet.cs.huji.ac.il/examples.html. |
| |
|
CONTACT: |
ori@cs.huji.ac.il |
|
PRESENTATION 4 |
| |
| TITLE: |
DNA sequence and structure:
direct and indirect recognition in protein-DNA binding |
|
AUTHORS: |
N.F. Steffen, S.D. Murphy, L. Tolleri, G.W. Hatfield, R.H. Lathrop |
|
ABSTRACT: |
Motivation: Direct recognition, or
direct readout, of DNA bases by a DNA-binding protein involves
amino acids that interact directly with features specific to each
base. Experimental evidence also shows that in many cases the
protein achieves partial sequence specificity by indirect
recognition, i.e., by recognizing structural properties of the
DNA. (1) Could threading a DNA sequence onto a crystal structure
of bound DNA help explain the indirect recognition component of
sequence specificity? (2) Might the resulting pure-structure
computational motif manifest itself in familiar sequence-based
computational motifs?
Results: The starting structure motif was a crystal structure of
DNA bound to the integration host factor protein (IHF) of {\it
E.~coli}. IHF is known to exhibit both direct and indirect
recognition of its binding sites. (1) Threading DNA sequences
onto the crystal structure showed statistically significant
partial separation of 60 IHF binding sites from random and
intragenic sequences and was positively correlated with binding
affinity. (2) The crystal structure was shown to be equivalent to
a linear Markov network, and so, to a joint probability
distribution over sequences, computable in linear time. It was
transformed algorithmically into several common pure-sequence
representations, including (a) small sets of short exact strings,
(b) weight matrices, (c) consensus regular patterns, (d) multiple
sequence alignments, and (e) phylogenetic trees. In all cases the
pure-sequence motifs retained statistically significant partial
separation of the IHF binding sites from random and intragenic
sequences. Most exhibited positive correlation with binding
affinity. The multiple alignment showed some conserved columns,
and the phylogenetic tree partially mixed low-energy sequences
with IHF binding sites but separated high-energy sequences. The
conclusion is that deformation energy explains part of indirect
recognition, which explains part of IHF sequence-specific binding. |
| |
| AVAILABILITY: |
Code and data on request. |
|
|
CONTACT: |
Nick Steffen for code:
nsteffen@uci.edu |
Lorenzo Tolleri for data:
Tolleri@chiron.it |
| |
|
PRESENTATION 5 |
| |
| TITLE: |
Beyond
tandem repeats: complex pattern structures and distant regions of
similarity |
|
AUTHORS: |
A.M. Hauth, D.A. Joseph |
|
ABSTRACT: |
Motivation: Tandem repeats (TRs) are
associated with human disease, play a role in evolution and are
important in regulatory processes. Despite their importance,
locating and characterizing these patterns within anonymous DNA
sequences remains a challenge. In part, the difficulty is due to
imperfect conservation of patterns and complex pattern
structures. We study recognition algorithms for two complex
pattern structures: variable length tandem repeats (VLTRs) and
multi-period tandem repeats (MPTRs).
Results: We extend previous algorithmic research to a class of
regular tandem repeats (RegTRs). We formally define RegTRs, as
well as, two important subclasses: VLTRs and MPTRs. We present
algorithms for identification of TRs in these classes.
Furthermore, our algorithms identify degenerate VLTRs and MPTRs:
repeats containing substitutions, insertions and deletions. To
illustrate our work, we present results of our analysis for two
difficult regions in cattle and human which reflect practical
occurrences of these subclasses in GenBank sequence data. In
addition, we show the applicability of our algorithmic techniques
for identifying Alu sequences, gene clusters and other distant
regions of similarity. We illustrate this with an example from
yeast chromosome I.
|
| |
|
AVAILABILITY: |
Algorithms can be accessed at:
http://www.cs.wisc.edu/areas/theory |
| |
|
CONTACT: |
Amy M. Hauth
kryder@cs.wisc.edu
608-831-2164 |
Deborah A. Joseph
joseph@cs.wisc.edu
608-262-8022, |
|
|
FAX: 608-262-9777 |
|
PRESENTATION 6 |
| |
| TITLE: |
The POPPs: clustering and searching using peptide probability
profiles |
|
AUTHORS: |
M.J. Wise |
|
ABSTRACT: |
The POPPs is a suite of inter-related
software tools which allow the user to discover what is
statistically "unusual" in the composition of an unknown protein,
or to automatically cluster proteins into families based on
peptide composition. Finally, the user can search for related
proteins based on peptide composition. Statistically based
peptide composition provides a view of proteins that is, to some
extent, orthogonal to that provided by sequence. In a test study,
the POPP suite is able to regroup into their families sets of
approximately 100 randomised Pfam protein domains.
The POPPs
suite is used to explore the diverse set of late embryogenesis
abundant (LEA) proteins. |
| |
|
AVAILABILITY: |
Contact the author |
|
CONTACT: |
mw263@cam.ac.uk |
| |
|
PRESENTATION 7 |
| |
| TITLE: |
Combining
bioinformatics and biophysics to understand protein
structure and function |
|
AUTHORS: |
Barry Honig |
| ABSTRACT: |
The increasing numbers of proteins whose
three-dimensional structures have been determined will have major
impact on the ability to exploit genomic data. We have been
developing a series of computational tools with the goal of
detecting relationships among amino acid sequence, protein
structure and protein function. In this context, recent
computational advances in using structure to improve sequence
alignments, in homology model building and in the calculation of
binding affinities will be summarized as will their combined use
towards an understanding the principles of protein-protein
recognition.
Our basic approach involves calculating
protein folding and binding free energies as well as contributions
of individual amino acids to these free energies, and correlating
these energetic contributions with sequence patterns and with
physical and chemical properties of the protein. The factors that
determine binding free energies will be discussed with particular
emphasis on understanding how protein interfaces are designed, in
a structural sense, to exploit different combinations of
electrostatic and hydrophobic interactions to achieve both
affinity and specificity.
An important feature in any attempt to
compare the properties of different proteins is to obtain an
accurate sequence alignment and, when possible, an accurate
structure alignment as well. We have developed a novel hybrid
approach that uses both sequence and structural alignments to
generate high quality sequence profiles for protein families.
Application of such alignments, when combined with calculations of
binding affinities, yield a new method to predicting binding
specificity from sequence alone. An application to the peptide
binding specificity of SH2 domains will be described.
With regard to structure prediction, we
have made a number advances that we believe will significantly
improve the accuracy of homology modeling. Specifically, we have
succeeded in predicting the conformation of buried side chains to
close to experimental accuracy, in obtaining extremely high loop
prediction accuracies and in refining modeled structures so as to
more closely resemble the correct structure than the original
template. In addition, our new alignment procedure and our
calculations of folding free energies have led to the development
of a new and accurate procedure to evaluate homology models. These
and other developments have been incorporated into new homology
modeling software that will be described. |
| |
|
| |
|
PRESENTATION 8 |
| |
| TITLE: |
A sequence-profile-based HMM for predicting and discriminating beta barrel
membrane proteins |
|
AUTHORS: |
P.L. Martelli,
P. Fariselli, A. Krogh, R. Casadio |
|
ABSTRACT: |
Motivation:
Membrane proteins are an abundant and functionally relevant subset
of proteins that putatively include from about 15 up to 30% of the
proteome of organisms fully sequenced. These estimates are mainly
computed on the basis of sequence comparison and membrane protein
prediction. It is therefore urgent to develop methods capable of
selecting membrane proteins especially in the case of outer
membrane proteins, barely taken into consideration when proteome
wide analysis is performed. This will also help protein annotation
when no homologous sequence is found in the database. Outer
membrane proteins solved so far at atomic resolution interact with
the external membrane of bacteria with a characteristic beta barrel
structure comprising different even numbers of beta strands (beta barrel
membrane proteins). In this they differ from the membrane proteins
of the cytoplasmic membrane endowed with alpha helix bundles (all
alpha membrane proteins) and need specialised predictors.
Results:
We develop a HMM model, which can predict the topology of beta barrel
membrane proteins, using as input evolutionary information. The
model is cyclic with 6 types of states: two for the beta strand transmembrane core, one for the
beta strand cap on either side of the
membrane, one for the inner loop, one for the outer loop and one
for the globular domain state in the middle of each loop. The
development of a specific input for HMM based on multiple sequence
alignment is novel. The accuracy per residue of the model is 82%
when a jack knife procedure is adopted. With a model optimisation
method using a dynamic programming algorithm seven topological
models out the twelve proteins included in the testing set are
also correctly predicted. When used as a discriminator, the model
is rather selective. At a fixed probability value, it retains 84%
of a non-redundant set comprising 145 sequences of well-annotated
outer membrane proteins. Concomitantly, it correctly rejects 90%
of a set of globular proteins including about 1200 chains with low
sequence identity (< 30%) and 90% of a set of all alpha membrane
proteins, including 188 chains. |
| |
|
AVAILABILITY: |
The program will
be available on request from the authors. |
|
CONTACT: |
gigi@lipid.biocomp.unibo.it |
www.biocomp.unibo.it |
| |
|
PRESENTATION 9 |
| |
| TITLE: |
Fully
automated ab initio protein structure prediction using
I-SITES, HMMSTR and ROSETTA |
|
AUTHORS: |
C. Bystroff and Y. Shao |
|
ABSTRACT: |
Motivation:
The Monte Carlo fragment insertion method for protein tertiary
structure prediction (ROSETTA) of Baker and others, has been
merged with the I-SITES library of sequence structure motifs and
the HMMSTR model for local structure in proteins, to form a new
public server for the ab initio prediction of protein structure.
The server performs several tasks in addition to tertiary
structure prediction, including a database search, amino acid
profile generation, fragment structure prediction, and backbone
angle and secondary structure prediction. Meeting reasonable
service goals required improvements in the efficiency, in
particular for the ROSETTA algorithm.
Results: The new server was
used for blind predictions of 40 protein sequences as part of the
CASP4 blind structure prediction experiment. The results for 31 of
those predictions are presented here. 61% of the residues overall
were found in topologically correct predictions, which are defined
as fragments of 30 residues or more with a root-mean-square
deviation in superimposed alpha carbons of less than 6Å. HMMSTR
3-state secondary structure predictions were 73% correct overall.
Tertiary structure predictions did not improve the accuracy of
secondary structure prediction. |
| |
|
AVAILABILITY: |
The server is
accessible through the web at
http://isites.bio.rpi.edu/hmmstr/index.html
Programs are available upon requests for academics.
Licensing agreements are available for commercial interests.
Supplementary information:
http://isites.bio.rpi.edu
http://predictioncenter.llnl.gov/casp4/ |
| |
|
CONTACT: |
bystrc@rpi.edu |
shaoy@rpi.edu |
| |
|
PRESENTATION 10 |
| |
| TITLE: |
Prediction of contact maps by GIOHMMs
and recurrent neural networks using lateral propagation from all
four cardinal corners |
|
AUTHORS: |
G.
Pollastri, P. Baldi |
|
ABSTRACT: |
Motivation:
Accurate prediction of protein contact maps is an important step
in computational structural proteomics. Because contact maps
provide a translation and rotation invariant topological
representation of a protein, they can be used as a fundamental
intermediary step in protein structure prediction.
Results: We
develop a new set of flexible machine learning architectures for
the prediction of contact maps, as well as other information
processing and pattern recognition tasks. The architectures can be
viewed as recurrent neural network parameterizations of a class of
Bayesian networks we call generalized input-output HMMs. For the
specific case of contact maps, contextual information is
propagated laterally through four hidden planes, one for each
cardinal corner. We show that these architectures can be trained
from examples and yield contact map predictors that outperform
previously reported methods. While several extensions and
improvements are in progress, the current version can accurately
predict 60.5% of contacts at a distance cutoff of 8 Å and 45% of
distant contacts at 10 Å, for proteins of length up to 300. |
| |
|
AVAILABILITY: |
The contact
map predictor will be made available through
http://promoter.ics.uci.edu/BRNN-PRED/
as part of an existing suite of proteomics predictors. |
| |
|
CONTACT: |
gpollast@ics.uci.edu
|
pfbaldi@ics.uci.edu |
| |
|
PRESENTATION 11 |
| |
| TITLE: |
Rate4Site:
an algorithmic tool for the identification of functional regions
on proteins by surface mapping of evolutionary determinants within
their homologues |
|
AUTHORS: |
T. Pupko, R.E. Bell, I. Mayrose, F. Glaser, N. Ben-Tal |
|
ABSTRACT: |
Motivation: A
number of proteins of known three-dimensional (3D) structure
exist, with yet unknown function. In light of the recent progress
in structure determination methodology, this number is likely to
increase rapidly. A novel method is presented here: "Rate4Site",
which maps the rate of evolution among homologous proteins onto
the molecular surface of one of the homologues whose 3D-structure
is known. Functionally important regions correspond to surface
patches of slowly evolving residues.
Results: Rate4Site estimates
the rate of evolution of amino acid sites using the maximum
likelihood (ML) principle. The ML estimate of the rates considers
the topology and branch lengths of the phylogenetic tree, as well
as the underlying stochastic process. To demonstrate its potency,
we study the Src SH2domain. Like previously established methods,
Rate4Site detected the SH2 peptide-binding groove. Interestingly,
it also detected inter-domain interactions between the SH2 domain
and the rest of the Src protein that other methods failed to
detect. |
| |
|
AVAILABILITY: |
Rate4Site can be
downloaded at:
http://ashtoret.tau.ac.il/
Supplementary
Information: multiple sequence alignment of homologous domains
from the SH2 protein family, the corresponding phylogenetic tree
and additional examples are available at:
http://ashtoret.tau.ac.il/~rebell |
| |
|
CONTACT: |
tal@ism.ac.jp
rebell@ashtoret.tau.ac.il |
fabian@ashtoret.tau.ac.il
bental@ashtoret.tau.ac.il |
| |
|
PRESENTATION 12 |
| |
| TITLE: |
Inferring
sub-cellular localization through automated lexical analysis |
|
AUTHORS: |
R. Nair and B. Rost |
|
ABSTRACT: |
Motivation:
The SWISS-PROT sequence database contains keywords of functional
annotations for many proteins. In contrast, information about the
sub-cellular localization is only available for few proteins.
Experts can often infer localization from keywords describing
protein function. We developed LOCkey, a fully automated method
for lexical analysis of SWISS-PROT keywords that assigns
sub-cellular localization. With the rapid growth in sequence data,
the biochemical characterisation of sequences has been falling
behind. Our method may be a useful tool for supplementing
functional information already automatically available.
Results: The
method reached a level of more than 82% accuracy in a full
cross-validation test. Due to a lack of functional annotations, we
could infer localization for less than half of all proteins in
SWISS-PROT. We applied LOCkey to annotate five entirely sequenced
proteomes, namely Saccharomyces cerevisiae (yeast), Caenorhabditis
elegans (worm), Drosophila melanogaster (fly), Arabidopsis
thaliana (plant) and a subset of all human proteins. LOCkey found
about 8000 new annotations of sub-cellular localization for these
eukaryotes. |
| |
|
AVAILABILITY: |
Annotations of
localization for eukaryotes at:
http://cubic.bioc.columbia.edu/services/LOCkey |
|
|
CONTACT: |
rost@columbia.edu |
| |
|
PRESENTATION 14 |
| |
| TITLE: |
Support
vector regression applied to the determination of the
developmental age of a Drosophila embryo from its segmentation
gene expression patterns |
|
AUTHORS: |
E. Myasnikova, A. Samsonova, M. Samsonova and J. Reinitz |
|
ABSTRACT: |
Motivation: In
this paper we address the problem of the determination of
developmental age of an embryo from its segmentation gene
expression patterns in Drosophila.
Results: By applying support
vector regression we have developed a fast method for automated
staging of an embryo on the basis of its gene expression pattern.
Support vector regression is a statistical method for creating
regression functions of arbitrary type from a set of training
data. The training set is composed of embryos for which the
precise developmental age was determined by measuring the degree
of membrane invagination. Testing the quality of regression on
the training set showed good prediction accuracy. The optimal
regression function was then used for the prediction of the gene
expression based age of embryos in which the precise age has not
been measured by membrane morphology. Moreover, we show that the
same accuracy of prediction can be achieved when the
dimensionality of the feature vector was reduced by applying
factor analysis. The data reduction allowed us to avoid
over-fitting and to increase the efficiency of the algorithm. |
| |
|
AVAILABILITY: |
This software may be obtained from the authors.
|
|
CONTACT: |
samson@fn.csa.ru |
| |
|
PRESENTATION 15 |
| |
| TITLE: |
Variance
stabilization applied to microarray data calibration and to the
quantification of differential expression |
|
AUTHORS: |
W.
Huber, A. von Heydebreck, H. Sueltmann, A. Poustka and M. Vingron |
|
ABSTRACT: |
We introduce a
statistical model for microarray gene expression data that
comprises data calibration, the quantification of differential
expression, and the quantification of measurement error. In
particular, we derive a transformation $h$ for intensity
measurements, and a difference statistic *h whose variance is
approximately constant along the whole intensity range. This
forms a basis for statistical inference from microarray data, and
provides a rational data pre-processing strategy for multivariate
analyses. For the transformation $h$, the parametric form h(x) =
arsinh(a+bx) is derived from a model of the variance-versus-mean
dependence for microarray intensity data, using the method of
variance stabilizing transformations. For large intensities, h
coincides with the logarithmic transformation, and *h with the
log-ratio. The parameters of h together with those of the
calibration between experiments are estimated with a robust
variant of maximum-likelihood estimation.
We demonstrate our
approach on data sets from different experimental platforms,
including two-color cDNA arrays and a series of Affymetrix
oligonucleotide arrays. |
| |
|
AVAILABILITY: |
Software is
freely available for academic use as an R package at
http://www.dkfz.de/abt0840/whuber |
| |
|
CONTACT: |
w.huber@dkfz.de
|
| |
|
PRESENTATION 16 |
| |
| TITLE: |
A variance-stabilizing transformation for gene-expression
microarray data |
|
AUTHORS: |
B.P.
Durbin, J.S. Hardin, D.M. Hawkins, D.M. Rocke |
|
ABSTRACT: |
Motivation:
Standard statistical techniques often assume that data are
normally distributed, with constant variance not depending on the
mean of the data. Data that violate these assumptions can often
be brought in line with the assumptions by application of a
transformation. Gene-expression microarray data have a
complicated error structure, with a variance that changes with the
mean in a non-linear fashion. Log transformations, which are
often applied to microarray data, can inflate the variance of
observations near background.
Results: We introduce a
transformation that stabilizes the variance of microarray data
across the full range of expression. Simulation studies also
suggest that this transformation approximately symmetrizes
microarray data. |
| |
|
AVAILABILITY: |
|
|
CONTACT: |
bpdurbin@wald.ucdavis.edu |
| |
|
PRESENTATION 17 |
| |
| TITLE: |
Binary
tree-structured vector quantization approach to clustering and
visualizing microarray data |
|
AUTHORS: |
M. Sultan, D.
Wigle, CA Cumbaa, M. Maziarz, J. Glasgow,
M.S. Tsao and I. Jurisica |
|
ABSTRACT: |
Motivation:
With the increasing number of gene expression databases, the need
for more powerful analysis and visualization tools is growing.
Many techniques have successfully been applied to unravel latent
similarities among genes and/or experiments. Most of the current
systems for microarray data analysis use statistical methods,
hierarchical clustering, self-organizing maps, support vector
machines, or k-means clustering to organize genes or experiments
into "meaningful" groups. Without prior explicit bias almost all
of these clustering methods applied to gene expression data not
only produce different results, but may also produce clusters with
little or no biological relevance. Of these methods, agglomerative
hierarchical clustering has been the most widely applied, although
many limitations have been identified.
Results: Starting with a
systematic comparison of the underlying theories behind clustering
approaches, we have devised a technique that combines
tree-structured vector quantization and partitive k-means
clustering (BTSVQ). This hybrid technique has revealed clinically
relevant clusters in three large publicly available data sets. In
contrast to existing systems, our approach is less sensitive to
data preprocessing and data normalization. In addition, the
clustering results produced by the technique have strong
similarities to those of self-organizing maps (SOMs). We discuss
the advantages and the mathematical reasoning behind our approach. |
| |
|
AVAILABILITY: |
The BTSVQ
system is implemented in Matlab R12 using the SOM toolbox for the
visualization and preprocessing of the data:
http://www.cis.hut.fi/projects/somtoolbox/
BTSVQ is available for non-commercial use:
http://www.uhnres.utoronto.ca/ta3/BTSVQ |
| |
|
CONTACT: |
ij@uhnres.utoronto.ca |
| |
|
PRESENTATION 18 |
| |
| TITLE: |
Linking
gene expression data with patient survival times using partial
least squares |
|
AUTHORS: |
P.J. Park, L. Tian, I.S. Kohane |
|
ABSTRACT: |
There is an
increasing need to link the large amount of genotypic data,
gathered using microarrays for example, with various phenotypic
data from patients. The classification problem in which gene
expression data serve as predictors and a class label phenotype as
the binary outcome variable has been examined extensively, but
there has been less emphasis in dealing with other types of
phenotypic data. In particular, patient survival times with
censoring are often not used directly as a response variable due
to the complications that arise from censoring. We show that the
issues involving censored data can be circumvented by
reformulating the problem as a standard Poisson regression
problem. The procedure for solving the transformed problem is a
combination of two approaches: partial least squares, a regression
technique that is especially effective when there is severe
collinearity due to a large number of predictors, and generalized
linear regression, which extends standard linear regression to
deal with various types of response variables. The linear
combinations of the original variables identified by the method
are highly correlated with the patient survival times and at the
same time account for the variability in the covariates. The
algorithm is fast, as it does not involve any matrix
decompositions in the iterations.
We apply our method to data
sets from lung carcinoma and diffuse large B-cell lymphoma studies
to verify its effectiveness. |
| |
|
AVAILABILITY: |
|
|
CONTACT: |
peter_park@harvard.edu
|
| |
|
PRESENTATION 19 |
| |
| TITLE: |
Gene
trees, genome trees and the universal tree. |
|
AUTHORS: |
Ford Doolittle |
|
ABSTRACT: |
For several
decades, molecular phylogeneticists laboured under the illusion
that the topology of an accurately constructed tree for a single
gene, that encoding small-subunit ribosomal RNA (SSU rRNA), could
be taken as the topology of the "universal tree" relating all
organisms, back to a last common ancestor that might have lived
more than 3.5 billion years ago. In this they may have been naive,
since we now know that many protein-coding genes give trees
different than SSU rRNA, or than each other. Sometimes this is
artifact, but often it reflects genuinely different histories, the
consequence of frequent lateral gene transfer. There may be a
small core of genes that have never been exchanged and thus show
the true organismal history, but this is very hard to prove. I
will review the practical and theoretical implications of this,
for the genomes of prokaryotes and simple eukaryotes.
|
| |
|
| |
|
PRESENTATION 20 |
| |
| TITLE: |
Microarray
synthesis through multiple-use PCR primer design |
|
AUTHORS: |
R.J.
Fernandes, S.S. Skiena |
|
ABSTRACT: |
A substantial
percentage of the expense in constructing full-genome spotted
microarrays comes from the cost of synthesizing the PCR primers to
amplify the desired DNA. We propose a computationally-based
method to substantially reduce this cost. Historically, PCR
primers are designed so that each primer occurs uniquely in the
genome. This condition is unnecessarily strong for selective
amplification, since only the primer pair associated with each
amplification need be unique.
We demonstrate that careful design
in a genome-level amplification project permits us to save the
cost of several thousand primers over conventional approaches. |
| |
|
AVAILABILITY: |
|
|
CONTACT:
|
skiena@cs.sunysb.edu
|
rohan@cs.sunysb.edu |
| |
|
PRESENTATION 21 |
| |
| TITLE: |
Discovering statistically significant biclusters in gene
expression data |
|
AUTHORS: |
A. Tanay,
R.
Sharan and R. Shamir |
|
ABSTRACT: |
In gene expression
data, a bicluster is a subset of the genes exhibiting consistent
patterns over a subset of the conditions. We propose a new method
to detect significant biclusters in large expression datasets. Our
approach is graph theoretic coupled with statistical modeling of
the data. Under plausible assumptions, our algorithm is
polynomial and is guaranteed to find the most significant biclusters.
We tested our method on a collection of yeast
expression profiles and on a human cancer dataset. Cross
validation results show high specificity in assigning function to
genes based on their biclusters, and we are able to annotate in
this way 196 uncharacterized yeast genes. We also demonstrate how
the biclusters lead to detecting new concrete biological
associations. In cancer data we are able to detect and relate
finer tissue types than was previously possible. We also show
that the method outperforms the biclustering algorithm of Cheng
and Church (2000). |
| |
|
AVAILABILITY: |
www.cs.tau.ac.il/~rshamir/biclust.html |
|
CONTACT: |
amos@tau.ac.il
roded@tau.ac.il |
rshamir@tau.ac.il |
| |
|
PRESENTATION 22 |
| |
| TITLE: |
Co-clustering of biological networks and gene expression data |
|
AUTHORS: |
D. Hanisch, A. Zien, R. Zimmer, T. Lengauer |
|
ABSTRACT: |
Motivation:
Large scale gene expression data are often analyzed by clustering
genes based on gene expression data alone, though a-priori
knowledge in the form of biological networks is available. The use
of this additional information promises to improve exploratory
analysis considerably.
Results: We propose to construct a distance
function which combines information from expression data and
biological networks. Based on this function, we compute a joint
clustering of genes and vertices of the network. This general
approach is elaborated for metabolic networks. We define a graph
distance function on such networks and combine it with a
correlation-based distance function for gene expression
measurements. A hierarchical clustering and an associated
statistical measure is computed to arrive at a reasonable number
of clusters. Our method is validated using expression data of the
yeast diauxic shift. The resulting clusters are easily
interpretable in terms of the biochemical network and the gene
expression data and suggest that our method is able to
automatically identify processes that are relevant under the
measured conditions. |
| |
|
AVAILABILITY: |
|
|
CONTACT: |
Daniel.Hanisch@scai.fhg.de
|
| |
|
PRESENTATION 23 |
| |
| TITLE: |
Statistical process control for large scale microarray experiments |
|
AUTHORS: |
F. Model, T. Konig, C. Piepenbrock, P. Adorjan |
|
ABSTRACT: |
Motivation:
Maintaining and controlling data quality is a key problem in large
scale microarray studies. Especially systematic changes in
experimental conditions across multiple chips can seriously affect
quality and even lead to false biological conclusions.
Traditionally the influence of these effects can only be minimized
by expensive repeated measurements, because a detailed
understanding of all process relevant parameters seems
impossible.
Results: We introduce a novel method for microarray
process control that estimates quality solely based on the
distribution of the actual measurements without requiring repeated
experiments. A robust version of principle component analysis
detects single outlier microarrays and only thereby enables the
use of techniques from multivariate statistical process control.
In particular, the T2 control chart reliably tracks undesired
changes in process relevant parameters. This can be used to
improve the microarray process itself, limits necessary
repetitions only to affected samples and therefore maintains
quality in a cost effective way. We prove the power of the
approach on 3 large sets of DNA methylation microarray data. |
| |
|
AVAILABILITY: |
|
|
CONTACT: |
Fabian.Model@epigenomics.com |
| |
|
PRESENTATION 24 |
| |
| TITLE: |
Evaluating
machine learning approaches for aiding probe selection for
gene-expression arrays |
|
AUTHORS: |
J.B. Tobler, M.N. Molla,
E.F.Nuwaysir, R.D.Green and J.W. Shavlik |
|
ABSTRACT: |
Motivation:
Microarrays are a fast and cost-effective method of performing
thousands of DNA hybridization experiments simultaneously. DNA
probes are typically used to measure the expression level of
specific genes. Because probes greatly vary in the quality of
their hybridizations, choosing good probes is a difficult task.
If one could accurately choose probes that are likely to hybridize
well, then fewer probes would be needed to represent each gene in
a gene-expression microarray, and, hence, more genes could be
placed on an array of a given physical size. Our goal is to
empirically evaluate how successfully three standard
machine-learning algorithms - naïve Bayes, decision trees, and
artificial neural networks - can be applied to the task of
predicting good probes. Fortunately it is relatively easy to get
training examples for such a learning task: place various probes
on a gene chip, add a sample where the corresponding genes are
highly expressed, and then record how well each probe measures the
presence of its corresponding gene. With such training examples,
it is possible that an accurate predictor of probe quality can be
learned.
Results: Two of the learning algorithms we investigate -
naïve Bayes and neural networks - learn to predict probe quality
surprisingly well. For example, in the top ten predicted probes
for a given gene not used for training, on average about five rank
in the top 2.5% of that gene's hundreds of possible probes.
Decision-tree induction and the simple approach of using predicted
melting temperature to rank probes perform significantly worse
than these two algorithms. The features we use to represent
probes are very easily computed and the time taken to score each
candidate probe after training is minor. Training the naïve Bayes
algorithm takes very little time, and while it takes over 10 times
as long to train a neural network, that time is still not very
substantial (on the order of a few hours on a desktop
workstation). We also report the information contained in the
features we use to describe the probes. We find the fraction of
cytosine in the probe to be the most informative feature. We also
find, not surprisingly, that the nucleotides in the middle of the
probes sequence are more informative than those at the ends of the
sequence. |
| |
|
AVAILABILITY: |
|
|
CONTACT: |
molla@cs.wisc.edu
|
|
ABSTRACT: |
A variety of measures have been proposed for assessing the
accuracy of sequence database search methods. One measure that has
gained wide use is the ROC score, derived from a graph of false
vrs. true positives as alignment cutoff score varies. An
interesting question is when the ROC scores of two different
methods can be said to differ significantly. Recent analytic
results concerning bootstrap resampling applied to ROC scores
provide one possible answer to this question.
We have used ROC analysis to assess a large number of possible
refinements of the original, 1997 version of PSI-BLAST. Several
modifications lead to significant or near-significant improvements
in program accuracy. The most important among these is the
incorporation of sequence-composition based statistics, which
substantially suppress the corruption of protein profiles by false
positive alignments. |
| |
| |
|
PRESENTATION 26 |
| |
| TITLE: |
The
degenerate primer design problem |
|
AUTHORS: |
C. Linhart and R. Shamir |
|
ABSTRACT: |
A PCR primer
sequence is called degenerate if some of its positions have
several possible bases. The degeneracy of the primer is the
number of unique sequence combinations it contains. We study the
problem of designing a pair of primers with prescribed degeneracy
that match a maximum number of given input sequences. Such
problems occur when studying a family of genes that is known only
in part, or is known in a related species. We prove that various
simplified versions of the problem are hard, show the
polynomiality of some restricted cases, and develop approximation
algorithms for one variant.
Based on these algorithms, we
implemented a program called HYDEN for designing highly-degenerate
primers for a set of genomic sequences. We report on the success
of the program in an experimental scheme for identifying all human
olfactory receptor (OR) genes. In that project, HYDEN was used to
design primers with degeneracies up to 1010 that amplified with
high specificity many novel genes of that family, tripling the
number of OR genes known at the time. |
| |
|
AVAILABILITY: |
Available on
request from the authors |
|
CONTACT: |
chaiml@tau.ac.il |
rshamir@tau.ac.il |
| |
|
PRESENTATION 27 |
| |
| TITLE: |
Splicing
graphs and EST assembly problem |
|
AUTHORS: |
S. Heber, M. Alekseyev, S.-H. Sze, H. Tang and P.A. Pevzner |
|
ABSTRACT: |
Motivation:
The traditional approach to annotate alternative splicing is to
investigate every splicing variant of the gene in a case-by-case
fashion. This approach, while useful, has some serious
shortcomings. Recent studies indicate that alternative splicing
is more frequent than previously thought and some genes may
produce tens of thousands of different transcripts. A list of
alternatively spliced variants for such genes would be difficult
to build and hard to analyze. Moreover, such a list does not show
the relationships between different transcripts and does not show
the overall structure of all transcripts. A better approach would
be to represent all splicing variants for a given gene in a way
that captures the relationships between different splicing
variants.
Results: We introduce the notion of the splicing graph
that is a natural and convenient representation of all splicing
variants. The key difference with the existing approaches is that
we abandon the linear (sequence) representation of each transcript
and substitute it with a graph representation where each
transcript corresponds to a path in the graph. We further design
an algorithm to assemble EST reads into the splicing graph rather
than assembling them into each splicing variant in a case-by-case
fashion. |
| |
|
AVAILABILITY: |
http://www-cse.ucsd.edu/groups/bioinformatics/software.html |
|
CONTACT: |
sheber@ucsd.edu |
|