Hierarchical Clustering of Gene Expression Data with the Agglomerative Information-Bottleneck Method

Byoung-Hee Kim1, Kyu-Baek Hwang2, Jung-Ho Chang, Byoung-Tak Zhang
1bhkim@bi.snu.ac.kr, Biointelligence Lab, Seoul National University; 2kbhwang@bi.snu.ac.kr, Biointelligence Lab, Seoul National University

Clustering is a widely-used technique for analyzing gene expression data, and various clustering methods have been tested on gene expression data. We applied 'double clustering with the agglomerative information-bottleneck method', which turned out to lead to significant improvement in performance on the information retrieval field, to gene expression data. The information-bottleneck method is an information theoretic approach to distributional clustering, in which one looks for a compact representation of one variable which preserves as much information as possible about the other relevant variable. This method provides a justified distributional similarity measure based on the mutual information. Combining this method with the double-clustering paradigm seems to provide a better reflection of the inherent structure of data. Results of single and double clustering on samples of NCI60 cancer cell lines verified the intuitive hypothesis that ostensible origin of the tumors may cause similar gene expression pattern in cancer cells. With this result, we calculated the 'entropy' for several levels of gene clusters during double clustering and found that compression to about 100 gene clusters can be allowed. With the graph for the change of mutual information and its variation during clustering process, significant reduction of the inevitable noise or redundancy of the original data could be observed. Also, an appropriate number of clusters could be estimated.