P-quasi complete linkage clustering method for gene-expression profiles based on distribution analysis

Shigeto Seno1, Reiji Teramoto2, Yoichi Takenaka, Hideo Matsuda
1s-senoo@ist.osaka-u.ac.jp, Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka University; 2teramoto@sumitomopharm.co.jp, Genomic Science Laboratories, Research Division, Sumitomo Pharmaceuticals

In order to find the function of genes from gene-expression profiles, the hierarchical clustering with correlation coefficient, in general, has been used. This method, however, has a serious problem in terms of representation capability of relationship. The resulting dendrogram by the method can represent only simple similarity relationships between genes. In other words, it looses a lot of useful information except for the largest score of correlation coefficient. To cope with the problem, we propose a new clustering method with the following two features. First, the proposed method exploits a new similarity measure based on distribution of gene expressions. This measure allows us to find weak relationship between a pair of genes that cannot be clarified or by correlation coefficient. Second, the proposed clustering method leverages the P-quasi complete linkage algorithm for describing clusters. The P-quasi complete linkage graph satisfies the condition that any member in one group has linkages to at least P% of all the members within the group. With the algorithm, members that do not always have sufficient similarity to each other can be clustered if they have linkages to more than P% of all the members. This fact means that the algorithm facilitates us to find relationships among multiple genes. The synergy of the two features provides more informative clustering in comparison with the hierarchical clustering with correlation coefficient. In the poster, we will show the effectiveness and usefulness of the proposed clustering method through the gene-expression profile analysis of cancer patients.