Hierarchical-Partitioning: A New Clustering Framework for Gene Expression Data Analysis

Alan Wee-Chung Liew1, Hong Yan2, Lap Keung Szeto
1itwcliew@cityu.edu.hk, Dept of Computer Engineering and Information Technology, City University of Hong Kong; 2ityan@cityu.edu.hk, Dept of Computer Engineering and Information Technology, City University of Hong Kong

Clustering has been shown to be a useful tool for the exploration of large-scale gene expression profiles. Traditional clustering techniques can generally be classified into two categories, hierarchical and partitional. The former results in nested clusters, and the latter in non-nested clusters. Hierarchical clustering algorithm transforms a pairwise dissimilarity matrix of patterns into a sequence of nested partitions, such as a dendrogram. Partitional clustering, on the other hand, performs a partition of patterns into K clusters, such that patterns in a cluster are more similar to each other than to patterns in different clusters. Both categories of clustering algorithms have their merits and weaknesses and both have been used extensively in gene expression data study (see for example, the cluster analysis packages by Standford Microarray Database: http://genome-www5.stanford.edu/MicroArray/SMD/restech.html). In our recent work, we attempt to combine the features of both classes of algorithms using a novel hierarchical-partitioning framework which we called the binary hierarchical clustering. In essence, our algorithm performs a successive binary subdivision of the data into smaller and smaller partitions in a hierarchical manner, until any further splitting of a partition into two smaller partitions is insignificant anymore. The hierarchical structure is manifested in the binary tree structure of the clustering result, where a parent node gives rise to two children nodes if the projection onto the optimal fisher discriminant axis satisfies a certain threshold. The partitioning behavior of our algorithm is incorporated in the cluster splitting process, where the fuzzy C-means clustering algorithm is used to split a parent cluster into two children clusters. The novel structure of our algorithm enables the cluster number to be estimated based on a mathematically well-defined index. Moreover, the binary tree structure also allows a subjective judgment of the number of cluster similar to that found in conventional hierarchical clustering. Meaningful visualization of the clustering result is important for interpretation. Our clustering framework also leads to a tree structure representation, which display the nested relationship between cluster partitions of finer and finer resolution. We believe that our algorithm will add to the set of valuable data analysis tools for the biologists.