Partially supervised clustering of gene expression time course data

Alexander Schoenhuth1, Alexander Schliep2, Christine Steinhoff
1aschoen@zpr.uni-koeln.de, Center for Applied Computer Science, University Cologne; 2schliep@molgen.mpg.de, Max Planck Institute for Molecular Genetics, Berlin

Performing microarray experiments consecutively in time produces a time course of gene expression profiles. New approaches to classifying these time courses are clustering methods based on models. Statistical models are used to represent clusters and cluster membership is decided based on maximization of a data point's likelihood given a model/cluster. Model-based clustering accounts for horizontal dependencies between expression levels of different time points and so is more suitable for classifying time courses than conventional, usually distance-based, methods. As the amount of genes with known function available is growing there is a need for classification methods which allow the use of prior knowledge. This can be realized by partially supervised clustering: models, which represent and are learnt from labeled sets of genes with known function, are added to the collection of clusters. In the iteration steps of the clustering algorithm, reassignment to other clusters of the labeled data is prohibited. In our case clusters are represented by Hidden Markov Models (HMM's). Besides their use in biological sequence analysis HMM's have been successfully applied for analyzing time course data in a wide range of different problem domains. An initial collection of models is chosen encompassing typical qualitative behavior like up- or down-regulation. Models learnt from labeled genes are added. In the iteration steps new model parameters are computed using Baum-Welch-Training (Expectation-Maximization). Genes which have no labels are then reassigned to the models maximizing their likelihood. This iterative procedure is carried out until convergence of the assignment. We apply the method to simulated data and to various published data sets and compare them with purely unsupervised or purely supervised methods. This poster is referring to the paper 'Using Hidden Markov Models for analyzing gene expression time course data' by the same authors.