Robust k-means Clustering of Gene Expression

Chris1, Dimitri2, Yong-Chuan Tao, Karine G. Le Roch, Garret Hampton, Elizabeth A. Winzeler, Jiayu Liao, Guangzhou Zou, Peter Schultz, Yingyao Zhou
1cbenner@gnf.org, Benner; 2dpetrov@gnf.org, Petrov

Cluster analysis is one of the most commonly used methods for the analysis of gene expression data, as it can allow unsupervised discovery of novel biological networks, as well as assist in the assignment of biological function to uncharacterized genes. However, current clustering methods suffer greatly from their uncertainty and ambiguity. This study demonstrates how both variations in data sources and the intrinsic indeterminacy of clustering procedures can be overcome and that reliable, informative, and optimal clustering results can be achieved. Using the methods introduced below, statistical robustness was assigned to each gene in the robust k-means clustering methods leading to more interpretable and reliable biological conclusions. The robust k-means clustering algorithm was applied to gene expression profiles of Malaria erythrocyte cell cycle and interesting cell cycle regulation clusters were identified for functional annotation.