Decision-tree approach to the classification of prostate tissue samples using microarray gene expression data

Changqing Ma1, Rajiv Dhir2, Jianhua Luo, George Michalopoulos, Michael Becich, John Gilbertson
1chmst40@pitt.edu, University of Pittsburgh; 2dhirr@MSX.UPMC.EDU, University of Pittsburgh

Introduction The study of cancer through high throughput gene-expression microarrays has stimulated the development of supervised learning methods for classification of tumor specimens based on microarray results. In this paper we present the application of a decision-tree learning approach (based on the C4.5 algorithm) to microarray data and compare the effectiveness of C4.5 algorithm with Support Vector Machines and Weighted Voting approaches.

Method Our initial dataset included three types of prostate tissue: prostate cancer (Tumor, N=62), normal appearing prostate tissue adjacent to prostate cancer (NAT, N=64) and normal prostate tissue from cancer free donor specimens (Donor, N=18). All specimens were run on the Affymetrix U95Av2 chip containing 12,600 probe sets. C4.5 algorithm was then applied to this dataset to learn plausible models for distinguishing one type of tissues from another. Classification models were then tested against two additional published datasets. Feature reduction and boosting approach were combined in the leave-one-out cross-validation (LOOCV) to select parameters for C4.5 to obtaining the most accurate classifier. Two other classification methods, support vector machines and weighted voting, were also used to discriminate different tissue types. Classification results were compared.

Results Three models were generated for binary classification of the three classes of prostate specimens: Tumor v Donor tissue, Adjacent to Tumor v Donor tissue and Tumor v Adjacent to Tumor tissue. When C4.5 was applied to the initial data set, these models yielded 94.87%, 93.83%, and 76.42% accuracy respectively in LOOCV. This result is comparable to the LOOCV from SVM analysis results (98.72%, 97.53%, and 69.92% accuracy corresponding to these three models) and those from weighted voting approach (97.43%, 93.82%, and 71.54% respectively).

These three models generated by C4.5 were then tested on two independent previously published prostate tissue datasets. The model for Tumor v Donor tissue classified all tumor samples from the independent datasets correctly. The model for Adjacent to Tumor v Donor tissue accurately categorized 94% normal samples from independent datasets. The models for Tumor v Adjacent to Tumor tissue classified 72.55% (102 samples) and 96.97% (33 samples) of samples from each of the two independent data sets correctly.

It appears that C4.5 decision-tree learning method can be used to classify tissues on the basis of microarray gene-expression results. Significantly, the C4.5 results were as good if not better than the classification produced by the more accepted microarray classification methods such as Support Vector Machines or Weighted Voting. Furthermore, these models generated by C4.5 hold human-understandable tree structures which can be summarized into simple rule sets and may imply possible interactions among genes used as features in C4.5 learning.

These results also showed that feature reduction and boosting helped significantly in improving classification accuracy. Finally, tissue samples that were misclassified by all three approaches were identified and explored for possible reasons.