Register Now
 What's New
 Agenda
 Key Dates
 Keynote Speakers
 Tutorial Program
 Paper News
 Poster News
 Birds of a Feather
 ISMB SIG's
 Welcome
 Edmonton
 Transportation
 Housing
 Maps
 Join Mail List
 Volunteers
 FAQs
 Travel Fellowships
 ISMB 2002 Poster
 Committees
 Contact us
 Exhibiting Companies
 Exhibitor Opportunities
 Sponsor Opportunities
 Sponsor List
 Travel Sponsorships
 ISCB News
 Software Demos
 Job Fair Sign-up
 Job Fair Postings
 Western BBQ
 Other Conferences
 ISMB 2003
 Past ISMB Meetings
 
 

 
 
PRESENTATION 1
 
TITLE: TBA
AUTHORS: Michael Ashburner
 

PRESENTATION 2

 
TITLE:

Mining viral protease data to extract cleavage knowledge

AUTHORS: A. Narayanan, X. Wu and Z. Rong Yang
ABSTRACT:

Motivation: The motivation is to identify, through machine learning techniques, specific patterns in HIV and HCV viral polyprotein amino acid residues where viral protease cleaves the polyprotein as it leaves the ribosome. An understanding of viral protease specificity may help the development of future anti-viral drugs involving protease inhibitors by identifying specific features of protease activity for further experimental investigation. While viral sequence information is growing at a fast rate, there is still comparatively little understanding of how viral polyproteins are cut into their functional unit lengths. The aim of the work reported here is to investigate whether it is possible to generalise from known cleavage sites to unknown cleavage sites for two specific viruses - HIV and HCV. An  understanding of proteolytic activity for specific viruses will contribute to our understanding of viral protease function in general, thereby leading to a greater understanding of protease families and their substrate characteristics.
Results:  Our results show that artificial neural networks and symbolic learning techniques (See5) capture some fundamental and new substrate attributes, but neural networks outperform their symbolic counterpart. Publicly available software was used:
- Stuttgart Neural Network Simulator;
http://www-ra.informatik.uni-tuebingen.de/SNNS/

- See5;
http://www.rulequest.com

 
AVAILABILITY: The datasets used (HIV, HCV) for See5 are available at:
http://www.dcs.ex.ac.uk/~anarayan/bioinf/ismbdatasets/
CONTACT:

a.narayanan@ex.ac.uk

z.r.yang@ex.ac.uk
 

PRESENTATION 3

 

TITLE:

The metric space of proteins - comparative study of clustering algorithms

AUTHORS: O. Sasson, N. Linial, M. Linial
ABSTRACT:

Motivation: A large fraction of biological research concentrates on individual proteins and on small families of proteins. One of the current major challenges in bioinformatics is to extend our knowledge also to very large sets of proteins. Several major projects have tackled this problem. Such undertakings usually start with a process that clusters all known proteins or large subsets of this space. Some work in this area is carried out automatically, while other attempts incorporate expert advice and annotation.
Results: We propose a novel technique that automatically clusters protein sequences. We consider all proteins in SWISSPROT, and carry out an all-against-all BLAST similarity test among them. With this similarity measure in hand we proceed to perform a continuous bottom-up clustering process by applying alternative rules for merging clusters. The outcome of this clustering process is a classification of the input proteins into a hierarchy of clusters of varying degrees of granularity. Here we compare the clusters that result from alternative merging rules, and validate the results against InterPro. Our preliminary results show that clusters that are consistent with several rather than a single merging rule tend to comply with InterPro annotation. This is an affirmation of the view that the protein space consists of families that differ markedly in their evolutionary conservation.

 
AVAILABILITY:

The outcome of these investigations can be viewed in an interactive Web site at: http://www.protonet.cs.huji.ac.il
Supplementary information: Biological examples for comparing the performance of the different algorithms used for classification are presented in: http://www.protonet.cs.huji.ac.il/examples.html.

 
CONTACT:

ori@cs.huji.ac.il

 
PRESENTATION 4
 
TITLE:

DNA sequence and structure: direct and indirect recognition in protein-DNA binding

AUTHORS:

N.F. Steffen, S.D. Murphy, L. Tolleri, G.W. Hatfield, R.H. Lathrop

ABSTRACT:

Motivation: Direct recognition, or direct readout, of DNA bases by a DNA-binding protein involves amino acids that interact directly with features specific to each base.  Experimental evidence also shows that in many cases the protein achieves partial sequence specificity by indirect recognition, i.e., by recognizing structural properties of the DNA.  (1) Could threading a DNA sequence onto a crystal structure of bound DNA help explain the indirect recognition component of sequence specificity? (2) Might the resulting pure-structure computational motif manifest itself in familiar sequence-based computational motifs?
Results: The starting structure motif was a crystal structure of DNA bound to the integration host factor protein (IHF) of {\it E.~coli}.  IHF is known to exhibit both direct and indirect recognition of its binding sites.  (1) Threading DNA sequences onto the crystal structure showed statistically significant partial separation of 60 IHF binding sites from random and intragenic sequences and was positively correlated with binding affinity.  (2) The crystal structure was shown to be equivalent to a linear Markov network, and so, to a joint probability distribution over sequences, computable in linear time.  It was transformed algorithmically into several common pure-sequence representations, including (a) small sets of short exact strings, (b) weight matrices, (c) consensus regular patterns, (d) multiple sequence alignments, and (e) phylogenetic trees.  In all cases the pure-sequence motifs retained statistically significant partial separation of the IHF binding sites from random and intragenic sequences.  Most exhibited positive correlation with binding affinity. The multiple alignment showed some conserved columns, and the phylogenetic tree partially mixed low-energy sequences with IHF binding sites but separated high-energy sequences.  The conclusion is that deformation energy explains part of indirect recognition, which explains part of IHF sequence-specific binding.

 
AVAILABILITY: Code and data on request.
CONTACT: Nick Steffen for code: nsteffen@uci.edu Lorenzo Tolleri for data: Tolleri@chiron.it
 
PRESENTATION 5
 
TITLE:

Beyond tandem repeats: complex pattern structures and distant regions of similarity

AUTHORS: A.M. Hauth, D.A. Joseph
ABSTRACT:

Motivation:  Tandem repeats (TRs) are associated with human disease, play a role in evolution and are important in regulatory processes.  Despite their importance, locating and characterizing these patterns within anonymous DNA sequences remains a challenge.  In part, the difficulty is due to imperfect conservation of patterns and complex pattern structures.  We study recognition algorithms for two complex pattern structures: variable length tandem repeats (VLTRs) and multi-period tandem repeats (MPTRs).
Results: We extend previous algorithmic research to a class of regular tandem repeats (RegTRs).  We formally define RegTRs, as well as, two important subclasses: VLTRs and MPTRs.  We present algorithms for identification of TRs in these classes.  Furthermore, our algorithms identify degenerate VLTRs and MPTRs: repeats containing substitutions, insertions and deletions.  To illustrate our work, we present results of our analysis for two difficult regions in cattle and human which reflect practical occurrences of these subclasses in GenBank sequence data.  In addition, we show the applicability of our algorithmic techniques for identifying Alu sequences, gene clusters and other distant regions of similarity.  We illustrate this with an example from yeast chromosome I.
 

 
AVAILABILITY: Algorithms can be accessed at:
http://www.cs.wisc.edu/areas/theory
 
CONTACT: Amy M. Hauth
kryder@cs.wisc.edu
608-831-2164
Deborah A. Joseph
joseph@cs.wisc.edu
608-262-8022,

FAX: 608-262-9777

 
PRESENTATION 6
 
TITLE:

The POPPs: clustering and searching using peptide probability profiles

AUTHORS:

M.J. Wise

ABSTRACT:

The POPPs is a suite of inter-related software tools which allow the user to discover what is statistically "unusual" in the composition of an unknown protein, or to automatically cluster proteins into families based on peptide composition.  Finally, the user can search for related proteins based on peptide composition.  Statistically based peptide composition provides a view of proteins that is, to some extent, orthogonal to that provided by sequence.  In a test study, the POPP suite is able to regroup into their families sets of approximately 100 randomised Pfam protein domains. 
The POPPs suite is used to explore the diverse set of late embryogenesis abundant (LEA) proteins.

 
AVAILABILITY: Contact the author
CONTACT: mw263@cam.ac.uk
 
PRESENTATION 7
 
TITLE: Combining bioinformatics and biophysics to understand protein structure and function
AUTHORS: Barry Honig
ABSTRACT: The increasing numbers of proteins whose three-dimensional structures have been determined will have major impact on the ability to exploit genomic data. We have been developing a series of computational tools with the goal of detecting relationships among amino acid sequence, protein structure and protein function.  In this context, recent computational advances in using structure to improve sequence alignments, in homology model building and in the calculation of binding affinities will be summarized as will their combined use towards an understanding the principles of protein-protein recognition.

Our basic approach involves calculating protein folding and binding free energies as well as contributions of individual amino acids to these free energies, and correlating these energetic contributions with sequence patterns and with physical and chemical properties of the protein. The factors that determine binding free energies will be discussed with particular emphasis on understanding how protein interfaces are designed, in a structural sense, to exploit different combinations of electrostatic and hydrophobic interactions to achieve both affinity and specificity.

An important feature in any attempt to compare the properties of different proteins is to obtain an accurate sequence alignment and, when possible, an accurate structure alignment as well. We have developed a novel hybrid approach that uses both sequence and structural alignments to generate high quality sequence profiles for protein families. Application of such alignments, when combined with calculations of binding affinities, yield a new method to predicting binding specificity from sequence alone. An application to the peptide binding specificity of SH2 domains will be described.

With regard to structure prediction, we have made a number advances that we believe will significantly improve the accuracy of homology modeling. Specifically, we have succeeded in predicting the conformation of buried side chains to close to experimental accuracy, in obtaining extremely high loop prediction accuracies and in refining modeled structures so as to more closely resemble the correct structure than the original template. In addition, our new alignment procedure and our calculations of folding free energies have led to the development of a new and accurate procedure to evaluate homology models. These and other developments have been incorporated into new homology modeling software that will be described.

   
 
PRESENTATION 8
 
TITLE:

A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins

AUTHORS:

P.L. Martelli, P. Fariselli, A. Krogh, R. Casadio

ABSTRACT:

Motivation: Membrane proteins are an abundant and functionally relevant subset of proteins that putatively include from about 15 up to 30% of the proteome of organisms fully sequenced. These estimates are mainly computed on the basis of sequence comparison and membrane protein prediction. It is therefore urgent to develop methods capable of selecting membrane proteins especially in the case of outer membrane proteins, barely taken into consideration when proteome wide analysis is performed. This will also help protein annotation when no homologous sequence is found in the database. Outer membrane proteins solved so far at atomic resolution interact with the external membrane of bacteria with a characteristic beta barrel structure comprising different even numbers of beta strands (beta barrel membrane proteins). In this they differ from the membrane proteins of the cytoplasmic membrane endowed with alpha helix bundles (all alpha membrane proteins) and need specialised predictors.
Results: We develop a HMM model, which can predict the topology of beta barrel membrane proteins, using as input evolutionary information. The model is cyclic with 6 types of states: two for the beta strand transmembrane core, one for the beta strand cap on either side of the membrane, one for the inner loop, one for the outer loop and one for the globular domain state in the middle of each loop. The development of a specific input for HMM based on multiple sequence alignment is novel. The accuracy per residue of the model is 82% when a jack knife procedure is adopted. With a model optimisation method using a dynamic programming algorithm seven topological models out the twelve proteins included in the testing set are also correctly predicted. When used as a discriminator, the model is rather selective. At a fixed probability value, it retains 84% of a non-redundant set comprising 145 sequences of well-annotated outer membrane proteins. Concomitantly, it correctly rejects 90% of a set of globular proteins including about 1200 chains with low sequence identity (< 30%) and 90% of a set of all alpha membrane proteins, including 188 chains.

 
AVAILABILITY: The program will be available on request from the authors.
CONTACT:

gigi@lipid.biocomp.unibo.it

www.biocomp.unibo.it
 
PRESENTATION 9
 
TITLE:

Fully automated ab initio  protein structure prediction using I-SITES, HMMSTR and ROSETTA

AUTHORS:

C. Bystroff and Y. Shao

ABSTRACT:

Motivation:  The Monte Carlo fragment insertion method for protein tertiary structure prediction (ROSETTA) of Baker and others, has been merged with the I-SITES library of sequence structure motifs and the HMMSTR model for local structure in proteins, to form a new public server for the ab initio prediction of protein structure. The server performs several tasks in addition to tertiary structure prediction, including a database search, amino acid profile generation, fragment structure prediction, and backbone angle and secondary structure prediction.  Meeting reasonable service goals required improvements in the efficiency, in particular for the ROSETTA algorithm.
Results: The new server was used for blind predictions of 40 protein sequences as part of the CASP4 blind structure prediction experiment. The results for 31 of those predictions are presented here.  61% of the residues overall were found in topologically correct predictions, which are defined as fragments of 30 residues or more with a root-mean-square deviation in superimposed alpha carbons of less than 6Å. HMMSTR 3-state secondary structure predictions were 73% correct overall. Tertiary structure predictions did not improve the accuracy of secondary structure prediction.

 
AVAILABILITY:

The server is accessible through the web at
http://isites.bio.rpi.edu/hmmstr/index.html 
Programs are available upon requests for academics. Licensing agreements are available for commercial interests.
Supplementary information:
http://isites.bio.rpi.edu                 http://predictioncenter.llnl.gov/casp4/

 
CONTACT: bystrc@rpi.edu shaoy@rpi.edu
 
PRESENTATION 10
 
TITLE:

Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners

AUTHORS:

G. Pollastri,   P. Baldi

ABSTRACT:

Motivation: Accurate prediction of protein contact maps is an important step in computational structural proteomics. Because contact maps provide a translation and rotation invariant topological representation of a protein, they can be used as a fundamental intermediary step in protein structure prediction.
Results: We develop a new set of flexible machine learning architectures for the prediction of contact maps, as well as other information processing and pattern recognition tasks. The architectures can be viewed as recurrent neural network parameterizations of a class of Bayesian networks we call generalized input-output HMMs.  For the specific case of contact maps, contextual information is propagated laterally through four hidden planes, one for each cardinal corner. We show that these architectures can be trained from examples and yield contact map predictors that outperform previously reported methods. While several extensions and improvements are in progress, the current version can accurately predict 60.5% of contacts at a distance cutoff of 8 Å and 45% of distant contacts at 10 Å, for proteins of length up to 300.

 
AVAILABILITY:

The contact map predictor will be made available through
http://promoter.ics.uci.edu/BRNN-PRED/
as part of an existing suite of proteomics predictors.

 
CONTACT:  gpollast@ics.uci.edu  pfbaldi@ics.uci.edu
 
PRESENTATION 11
 
TITLE:

Rate4Site: an algorithmic tool for the identification of functional regions on proteins by surface mapping of evolutionary determinants within their homologues

AUTHORS:

T. Pupko, R.E. Bell, I. Mayrose, F. Glaser, N. Ben-Tal

ABSTRACT:

Motivation: A number of proteins of known three-dimensional (3D) structure exist, with yet unknown function. In light of the recent progress in structure determination methodology, this number is likely to increase rapidly. A novel method is presented here: "Rate4Site", which maps the rate of evolution among homologous proteins onto the molecular surface of one of the homologues whose 3D-structure is known. Functionally important regions correspond to surface patches of slowly evolving residues.
Results: Rate4Site estimates the rate of evolution of amino acid sites using the maximum likelihood (ML) principle. The ML estimate of the rates considers the topology and branch lengths of the phylogenetic tree, as well as the underlying stochastic process.  To demonstrate its potency, we study the Src SH2domain. Like previously established methods, Rate4Site detected the SH2 peptide-binding groove. Interestingly, it also detected inter-domain interactions between the SH2 domain and the rest of the Src protein that other methods failed to detect.

 
AVAILABILITY: Rate4Site can be downloaded at: http://ashtoret.tau.ac.il/
Supplementary Information: multiple sequence alignment of homologous domains from the SH2 protein family, the corresponding phylogenetic tree and additional examples are available at:
http://ashtoret.tau.ac.il/~rebell
 
CONTACT:

tal@ism.ac.jp
rebell@ashtoret.tau.ac.il

fabian@ashtoret.tau.ac.il
bental@ashtoret.tau.ac.il
 
PRESENTATION 12
 
TITLE:

Inferring sub-cellular localization through automated lexical analysis

AUTHORS:

R. Nair and B. Rost

ABSTRACT:

Motivation: The SWISS-PROT sequence database contains keywords of functional annotations for many proteins. In contrast, information about the sub-cellular localization is only available for few proteins. Experts can often infer localization from keywords describing protein function. We developed LOCkey, a fully automated method for lexical analysis of SWISS-PROT keywords that assigns sub-cellular localization. With the rapid growth in sequence data, the biochemical characterisation of sequences has been falling behind. Our method may be a useful tool for supplementing functional information already automatically available.
Results: The method reached a level of more than 82% accuracy in a full cross-validation test. Due to a lack of functional annotations, we could infer localization for less than half of all proteins in SWISS-PROT. We applied LOCkey to annotate five entirely sequenced proteomes, namely Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Arabidopsis thaliana (plant) and a subset of all human proteins. LOCkey found about 8000 new annotations of sub-cellular localization for these eukaryotes.

 
AVAILABILITY:

Annotations of localization for eukaryotes at: 
 http://cubic.bioc.columbia.edu/services/LOCkey

CONTACT:

rost@columbia.edu

 
PRESENTATION 13
 
TITLE: Pattern Discovery
AUTHORS: Isidore Rigoutsos
 
PRESENTATION 14
 
TITLE:

Support vector regression applied to the determination of the developmental age of a Drosophila embryo from its segmentation gene expression patterns

AUTHORS:

E. Myasnikova, A.  Samsonova, M. Samsonova and J. Reinitz

ABSTRACT:

Motivation: In this paper we address the problem of the determination of developmental age of an embryo from its segmentation gene expression patterns in Drosophila.
Results: By applying support vector regression we have developed a fast method for automated staging of an embryo on the basis of its gene expression pattern. Support vector regression is a statistical method for creating regression functions of arbitrary type from a set of training data. The training set is composed of embryos for which the precise developmental age was determined by measuring the degree of membrane invagination.  Testing the quality of regression on the training set showed good prediction accuracy.  The optimal regression function was then used for the prediction of the gene expression based age of embryos in which the precise age has not been measured by membrane morphology.  Moreover, we show that the same accuracy of prediction can be achieved when the dimensionality of the feature vector was reduced by applying factor analysis.  The data reduction allowed us to avoid over-fitting and to increase the efficiency of the algorithm.

 
AVAILABILITY: This software may be obtained from the authors.
CONTACT:

samson@fn.csa.ru

 
PRESENTATION 15
 
TITLE:

Variance stabilization applied to microarray data calibration and to the quantification of differential expression

AUTHORS:

W. Huber, A. von Heydebreck, H. Sueltmann, A. Poustka and M. Vingron

ABSTRACT:

We introduce a statistical model for microarray gene expression data that comprises data calibration, the quantification of differential expression, and the quantification of measurement error. In particular, we derive a transformation $h$ for intensity measurements, and a difference statistic *h whose variance is approximately constant along the whole intensity range.  This forms a basis for statistical inference from microarray data, and provides a rational data pre-processing strategy for multivariate analyses.  For the transformation $h$, the parametric form h(x) = arsinh(a+bx) is derived from a model of the variance-versus-mean dependence for microarray intensity data, using the method of variance stabilizing transformations. For large intensities, h coincides with the logarithmic transformation, and *h with the log-ratio.  The parameters of h together with those of the calibration between experiments are estimated with a robust variant of maximum-likelihood estimation. 
We demonstrate our approach on data sets from different experimental platforms, including two-color cDNA arrays and a series of Affymetrix oligonucleotide arrays.

 
AVAILABILITY:

Software is freely available for academic use as an R package at
http://www.dkfz.de/abt0840/whuber

 
CONTACT: w.huber@dkfz.de
 
PRESENTATION 16
 
TITLE: A variance-stabilizing transformation for gene-expression microarray data
AUTHORS:

B.P. Durbin, J.S. Hardin, D.M.  Hawkins, D.M. Rocke

ABSTRACT:

Motivation: Standard statistical techniques often assume that data are normally distributed, with constant variance not depending on the mean of the data.  Data that violate these assumptions can often be brought in line with the assumptions by application of a transformation.  Gene-expression microarray data have a complicated error structure, with a variance that changes with the mean in a non-linear fashion.  Log transformations, which are often applied to microarray data, can inflate the variance of observations near background.
Results: We introduce a transformation that stabilizes the variance of microarray data across the full range of expression.  Simulation studies also suggest that this transformation approximately symmetrizes microarray data.

 
AVAILABILITY:  
CONTACT:  bpdurbin@wald.ucdavis.edu
 
PRESENTATION 17
 
TITLE:

Binary tree-structured vector quantization approach to clustering and visualizing microarray data

AUTHORS:

M. Sultan, D. Wigle, CA Cumbaa, M. Maziarz, J. Glasgow,
M.S. Tsao and I. Jurisica

ABSTRACT:

Motivation: With the increasing number of gene expression databases, the need for more powerful analysis and visualization tools is growing.  Many techniques have successfully been applied to unravel latent similarities among genes and/or experiments.  Most of the current systems for microarray data analysis use statistical methods, hierarchical clustering, self-organizing maps, support vector machines, or k-means clustering to organize genes or experiments into "meaningful" groups.  Without prior explicit bias almost all of these clustering methods applied to gene expression data not only produce different results, but may also produce clusters with little or no biological relevance. Of these methods, agglomerative hierarchical clustering has been the most widely applied, although many limitations have been identified.
Results: Starting with a systematic comparison of the underlying theories behind clustering approaches, we have devised a technique that combines tree-structured vector quantization and partitive k-means clustering (BTSVQ). This hybrid technique has revealed clinically relevant clusters in three large publicly available data sets.  In contrast to existing systems, our approach is less sensitive to data preprocessing and data normalization. In addition, the clustering results produced by the technique have strong similarities to those of self-organizing maps (SOMs).  We discuss the advantages and the mathematical reasoning behind our approach.

 
AVAILABILITY:

The BTSVQ system is implemented in Matlab R12 using the SOM toolbox for the visualization and preprocessing of the data:
http://www.cis.hut.fi/projects/somtoolbox/
BTSVQ is available for non-commercial use:
http://www.uhnres.utoronto.ca/ta3/BTSVQ

 
CONTACT: 

ij@uhnres.utoronto.ca

 
PRESENTATION 18
 
TITLE:

Linking gene expression data with patient survival times using partial least squares

AUTHORS:

P.J. Park, L. Tian, I.S. Kohane

ABSTRACT:

There is an increasing need to link the large amount of genotypic data, gathered using microarrays for example, with various phenotypic data from patients.  The classification problem in which gene expression data serve as predictors and a class label phenotype as the binary outcome variable has been examined extensively, but there has been less emphasis in dealing with other types of phenotypic data.  In particular, patient survival times with censoring are often not used directly as a response variable due to the complications that arise from censoring. We show that the issues involving censored data can be circumvented by reformulating the problem as a standard Poisson regression problem.  The procedure for solving the transformed problem is a combination of two approaches: partial least squares, a regression technique that is especially effective when there is severe collinearity due to a large number of predictors, and generalized linear regression, which extends standard linear regression to deal with various types of response variables.  The linear combinations of the original variables identified by the method are highly correlated with the patient survival times and at the same time account for the variability in the covariates.  The algorithm is fast, as it does not involve any matrix decompositions in the iterations. 
We apply our method to data sets from lung carcinoma and diffuse large B-cell lymphoma studies to verify its effectiveness.

 
AVAILABILITY:  
CONTACT:  peter_park@harvard.edu
 
PRESENTATION 19
 
TITLE: Gene trees, genome trees and the universal tree.
AUTHORS: Ford Doolittle
ABSTRACT:

For several decades, molecular phylogeneticists laboured under the illusion that the topology of an accurately constructed tree for a single gene, that encoding small-subunit ribosomal RNA (SSU rRNA), could be taken as the topology of the "universal tree" relating all organisms, back to a last common ancestor that might have lived more than 3.5 billion years ago. In this they may have been naive, since we now know that many protein-coding genes give trees different than SSU rRNA, or than each other. Sometimes this is artifact, but often it reflects genuinely different histories, the consequence of frequent lateral gene transfer. There may be a small core of genes that have never been exchanged and thus show the true organismal history, but this is very hard to prove. I will review the practical and theoretical implications of this, for the genomes of prokaryotes and simple eukaryotes.
 

   
 
PRESENTATION 20
 
TITLE:

Microarray synthesis through multiple-use PCR primer design

AUTHORS:

R.J. Fernandes, S.S. Skiena

ABSTRACT:

A substantial percentage of the expense in constructing full-genome spotted microarrays comes from the cost of synthesizing the PCR primers to amplify the desired DNA.  We propose a computationally-based method to substantially reduce this cost.  Historically, PCR primers are designed so that each primer occurs uniquely in the genome.  This condition is unnecessarily strong for selective amplification, since only the primer pair associated with each amplification need be unique. 
We demonstrate that careful design in a genome-level amplification project permits us to save the cost of several thousand primers over conventional approaches.

 
AVAILABILITY:  
CONTACT:   skiena@cs.sunysb.edu                           rohan@cs.sunysb.edu
 
PRESENTATION 21
 
TITLE: Discovering statistically significant biclusters in gene expression data
AUTHORS: A. Tanay, R. Sharan and R. Shamir
ABSTRACT:

In gene expression data, a bicluster is a subset of the genes exhibiting consistent patterns over a subset of the conditions. We propose a new method to detect significant biclusters in large expression datasets. Our approach is graph theoretic coupled with statistical modeling of the data.  Under plausible assumptions, our algorithm is polynomial and is guaranteed to find the most significant biclusters. 
We tested our method on a collection of yeast expression profiles and on a human cancer dataset.  Cross validation results show high specificity in assigning function to genes based on their biclusters, and we are able to annotate in this way 196 uncharacterized yeast genes. We also demonstrate how the biclusters lead to detecting new concrete biological associations.  In cancer data we are able to detect and relate finer tissue types than was previously possible.  We also show that the method outperforms the biclustering algorithm of Cheng and Church (2000).

 
AVAILABILITY: www.cs.tau.ac.il/~rshamir/biclust.html
CONTACT: amos@tau.ac.il 
roded@tau.ac.il
rshamir@tau.ac.il
 
PRESENTATION 22
 
TITLE:

Co-clustering of biological networks and gene expression data

AUTHORS:

D. Hanisch, A. Zien, R. Zimmer, T. Lengauer

ABSTRACT:

Motivation: Large scale gene expression data are often analyzed by clustering genes based on gene expression data alone, though a-priori knowledge in the form of biological networks is available. The use of this additional information promises to improve exploratory analysis considerably.
Results: We propose to construct a distance function which combines information from expression data and biological networks. Based on this function, we compute a joint clustering of genes and vertices of the network. This general approach is elaborated for metabolic networks. We define a graph distance function on such networks and combine it with a correlation-based distance function for gene expression measurements. A hierarchical clustering and an associated statistical measure is computed to arrive at a reasonable number of clusters. Our method is validated using expression data of the yeast diauxic shift. The resulting clusters are easily interpretable in terms of the biochemical network and the gene expression data and suggest that our method is able to automatically identify processes that are relevant under the measured conditions.

 
AVAILABILITY:  
CONTACT: Daniel.Hanisch@scai.fhg.de
 
PRESENTATION 23
 
TITLE: Statistical process control for large scale microarray experiments
AUTHORS:

F. Model, T. Konig, C. Piepenbrock, P. Adorjan

ABSTRACT:

Motivation: Maintaining and controlling data quality is a key problem in large scale microarray studies.  Especially systematic changes in experimental conditions across multiple chips can seriously affect quality and even lead to false biological conclusions.  Traditionally the influence of these effects can only be minimized by expensive repeated measurements, because a detailed understanding of all process relevant parameters seems impossible.
Results: We introduce a novel method for microarray process control that estimates quality solely based on the distribution of the actual measurements without requiring repeated experiments.  A robust version of principle component analysis detects single outlier microarrays and only thereby enables the use of techniques from multivariate statistical process control.  In particular, the T2 control chart reliably tracks undesired changes in process relevant parameters.  This can be used to improve the microarray process itself, limits necessary repetitions only to affected samples and therefore maintains quality in a cost effective way.  We prove the power of the approach on 3 large sets of DNA methylation microarray data.

 
AVAILABILITY:  
CONTACT:

Fabian.Model@epigenomics.com

 
PRESENTATION 24
 
TITLE:

Evaluating machine learning approaches for aiding probe selection for gene-expression arrays

AUTHORS: J.B. Tobler, M.N. Molla, E.F.Nuwaysir, R.D.Green and J.W. Shavlik
ABSTRACT:

Motivation:  Microarrays are a fast and cost-effective method of performing thousands of DNA hybridization experiments simultaneously.  DNA probes are typically used to measure the expression level of specific genes.   Because probes greatly vary in the quality of their hybridizations, choosing good probes is a difficult task.    If one could accurately choose probes that are likely to hybridize well, then fewer probes would be needed to represent each gene in a gene-expression microarray, and, hence, more genes could be placed on an array of a given physical size.  Our goal is to empirically evaluate how successfully three standard machine-learning algorithms - naïve Bayes, decision trees, and artificial neural networks - can be applied to the task of predicting good probes.  Fortunately it is relatively easy to get training examples for such a learning task: place various probes on a gene chip, add a sample where the corresponding genes are highly expressed, and then record how well each probe measures the presence of its corresponding gene.  With such training examples, it is possible that an accurate predictor of probe quality can be learned.
Results:  Two of the learning algorithms we investigate - naïve Bayes and neural networks - learn to predict probe quality surprisingly well.  For example, in the top ten predicted probes for a given gene not used for training, on average about five rank in the top 2.5% of that gene's hundreds of possible probes.  Decision-tree induction and the simple approach of using predicted melting temperature to rank probes perform significantly worse than these two algorithms.  The features we use to represent probes are very easily computed and the time taken to score each candidate probe after training is minor.  Training the naïve Bayes algorithm takes very little time, and while it takes over 10 times as long to train a neural network, that time is still not very substantial (on the order of a few hours on a desktop workstation). We also report the information contained in the features we use to describe the probes.  We find the fraction of cytosine in the probe to be the most informative feature.  We also find, not surprisingly, that the nucleotides in the middle of the probes sequence are more informative than those at the ends of the sequence.

 
AVAILABILITY:  
CONTACT: molla@cs.wisc.edu
 
PRESENTATION 25
 
TITLE: Assessing the accuracy of database search methods,
and improving the performance of PSI-BLAST
AUTHORS: Stephen Altschul
ABSTRACT:

A variety of measures have been proposed for assessing the accuracy of sequence database search methods. One measure that has gained wide use is the ROC score, derived from a graph of false vrs. true positives as alignment cutoff score varies. An interesting question is when the ROC scores of two different methods can be said to differ significantly. Recent analytic results concerning bootstrap resampling applied to ROC scores provide one possible answer to this question.

We have used ROC analysis to assess a large number of possible refinements of the original, 1997 version of PSI-BLAST. Several modifications lead to significant or near-significant improvements in program accuracy. The most important among these is the incorporation of sequence-composition based statistics, which substantially suppress the corruption of protein profiles by false positive alignments.

 
 
PRESENTATION 26
 
TITLE:

The degenerate primer design problem

AUTHORS:

C. Linhart and R. Shamir

ABSTRACT:

A PCR primer sequence is called degenerate if some of its positions have several possible bases.  The degeneracy of the primer is the number of unique sequence combinations it contains.  We study the problem of designing a pair of primers with prescribed degeneracy that match a maximum number of given input sequences. Such problems occur when studying a family of genes that is known only in part, or is known in a related species.  We prove that various simplified versions of the problem are hard, show the polynomiality of some restricted cases, and develop approximation algorithms for one variant. 
Based on these algorithms, we implemented a program called HYDEN for designing highly-degenerate primers for a set of genomic sequences.  We report on the success of the program in an experimental scheme for identifying all human olfactory receptor (OR) genes.  In that project, HYDEN was used to design primers with degeneracies up to 1010 that amplified with high specificity many novel genes of that family, tripling the number of OR genes known at the time.

 
AVAILABILITY: Available on request from the authors
CONTACT:  chaiml@tau.ac.il rshamir@tau.ac.il
 
PRESENTATION 27
 
TITLE:

Splicing graphs and EST assembly problem

AUTHORS:

S. Heber, M. Alekseyev, S.-H. Sze, H. Tang and P.A. Pevzner

ABSTRACT:

Motivation: The traditional approach to annotate alternative splicing is to investigate every splicing variant of the gene in a case-by-case fashion. This approach, while useful, has some serious shortcomings.  Recent studies indicate that alternative splicing is more frequent than previously thought and some genes may produce tens of thousands of different transcripts.  A list of alternatively spliced variants for such genes would be difficult to build and hard to analyze. Moreover, such a list does not show the relationships between different transcripts and does not show the overall structure of all transcripts. A better approach would be to represent all splicing variants for a given gene in a way that captures the relationships between different splicing variants.
Results: We introduce the notion of the splicing graph that is a natural and convenient representation of all splicing variants.  The key difference with the existing approaches is that we abandon the linear (sequence) representation of each transcript and substitute it with a graph representation where each transcript corresponds to a path in the graph. We further design an algorithm to assemble EST reads into the splicing graph rather than assembling them into each splicing variant in a case-by-case fashion.

 
AVAILABILITY: http://www-cse.ucsd.edu/groups/bioinformatics/software.html
CONTACT:  sheber@ucsd.edu
 
PRESENTATION 28