An automatic and unbiased GA for finding the most discriminant gene sets on Microarray

Han-Yu Chuang1, Hwa-Sheng Chiu2, Huai-Kuang Tsai, Cheng-Yan Kao, Bioinfo Lab., Department of CSIE, National Taiwan University;, Bioinfo Lab., Department of CSIE, National Taiwan University

Microarrays, a recent technology in experimental molecular biology, have been used to help the development of diagnostic tools and classification platforms in cancer research. Different approaches, including univariate and multivariate techniques, have been developed to select predictive genes in an expression dataset for classification. The univariate approaches may lose gene-gene correlated relations since they examine one gene at a time. Moreover, assumptions of homogeneity within the same class used in these methods often suffice for binary or 3-class datasets. Several multivariate approaches, which identifying genes that jointly discriminate between multi-classes of samples, overcome above drawbacks but may over fit to some kind of objectives by using a particular classifier directly as criteria for goodness of gene sets. Furthermore, such systems contain many parameter adjustments to get better results. We proposed a genetic algorithm based approach, combining univariate and multivariate techniques, use Gamma test [Stefansson et. al., 1997] and Pearson correlation as evaluation functions to find optimal gene sets with minimum size for sample classification on gene expression data automatically and unbiased. Our approach can be divided into two steps: (1) preprocessing by Threshold Number of Misclassification (TNoM) and permutation tests to filter out genes with non-informative patterns between classes, and (2) using a multi-objective genetic algorithm to find subsets of significant genes derived from step 1 with lower correlation within gene patterns and higher correlation to classes. The GA in Step 2 has two major mechanisms to help local and global searches, including heterogeneous pairing selection (HpS) and family competition [Yang, 2001]. The objective function contains two parts, which minimize the sum of these two scores: (1) Gamma test gives a data-derived estimate for the mean-squared error of classification, but not a classifier actually; (2) Pearson correlation is used to get a gene sets with smaller size and complement patterns. We use the classification accuracy of K-nearest neighbor classifier and leave-one-out cross-validation to evaluate our performance. In our preliminary study for Colon cancer dataset, we get 56 genes remained after the filtering step and 8 of them to be the final predictive set. The classification accuracy is 95% when K = 3 and using a majority rule. 1. Stefensson, A., Koncar, N., and Jones, A.J. (1997) A Note on the Gamma Test. Neural Comput. Applic., 5, 131-133. 2. Yang, J. M. (2001) A Family Competition Evolutionary Approach of Global Optimization in Neural Networks, Optical Thin-film Design, and Structure-based Drug Design. Ph. D. thesis, National Taiwan University, Taiwan.