Extracting informative genes with negative correlation for accurate cancer classification

Hong-Hee Won¹, Sung-Bae Cho²
¹cool@candy.yonsei.ac.kr, Yonsei University; ²sbcho@cs.yonsei.ac.kr, Yonsei University

Accurate classification of cancer is very important issue for treatment of cancer. Several conventional techniques for diagnosis, however, can be often incomplete or misleading. Molecular level diagnostics based on microarray technologies can offer the methodology of precise, objective, and systematic cancer classification. Genome-wide expression patterns generally consist of thousand genes or more. It is desirable to extract some significant genes from all genes because all genes are not related to cancer. We have defined two ideal gene vectors strongly related to a cancer using the concept of negative correlation to extract significant genes. Two ideal feature vectors are the one high in class A and low in class B, and the other one low in class A and high in class B. We have extracted two significant gene subsets (SGSs) based on the similarity to the two ideal genes. Since the vectors are negatively correlated, the sets of genes similar to each ideal vector are also negatively correlated. The negatively correlated features represent two different aspects of classification boundary for gene expression data. We can search in a much wider solution space by combining these features. We have trained the neural network classifiers with SGSs respectively and combined them using Bayesian approach. We have evaluated the performance of the proposed methods on three benchmark datasets-Leukemia, Colon, and Lymphoma cancer dataset. Experimental results show that the ensemble classifier with negatively correlated gene subsets produces the best recognition rate in three benchmark datasets. The best recognition rate of ensemble classifier is 97.1% in Leukemia dataset, 87.1% in Colon dataset, and 92.0% in Lymphoma dataset. Compared with the best recognition rates of individual neural network classifiers, 97.1%, 83.9%, and 88.0% on the datasets respectively, the performance of ensemble is better. We have confirmed that negative correlation enables the ensemble classifier to work better by providing enough information for the classification to neural network classifiers.