Correlated Feature Extraction for Classification of Microarray and Mass Spectroscopy Data

Christopher Bowman1, Richard Baumgartner2, Ray Somorjai
1Christppher.Bowman@nrc-cnrc.gc.ca, Institute for Biodiagnostics; 2Richard.Baumgartner@nrc-cnrc.gc.ca, Institute for Biodiagnostics

Data obtained through microarray and mass spectroscopy experiments are characterized by their high dimensionality and the severely limited number of available samples. A statistically meaningful analysis of a limited number of high-dimensional data points presents a serious challenge due to the extreme sparseness of the available data in these high-dimensional spaces. It is generally accepted by the pattern recognition community that robust classifier development requires 5-10 samples per feature (1), which is unfeasible for either of these modalities, where the number of features measures per pattern number in the tens of hundreds of thousands (the gene expression levels or mass to charge ratios). Some form of feature selection/extraction provides a natural way to address this problem. Feature selection/extraction is especially desirable in disease profiling applications where the main interest lies in identifying discriminatory features (gene expression levels, or peaks in mass spectra). In both these experimental modalities, the features are highly redundant, suggesting that the data do not span the entire (original) high-dimensional space; instead, they lie on (or close to) some low-dimensional manifold. Neighbouring spectral features of mass spectra are highly correlated; in fact, they are almost identical and therefore form natural clusters. This high degree of correlation between neighbouring features does not exist in microarray data, but nevertheless, it has been shown that gene expression patterns cluster in large groups of genes with similar expression patterns. Several forms of supervised (2) and unsupervised (eg PCA) feature reduction methods have been applied to these data modalities. Supervised methods can be slow to train and are vulnerable to overtraining. Unsupervised methods like principal component analysis resist this, but "scramble" the features, making subsequent interpretation difficule. We present a feature reduction method, using unsupervised clustering that exploits the highly correlated characteristics of the features. We propose this technique as a preprocessing step for wrapper-based feature extraction procedures. We apply our algorithm to a publicly available ovarian and prostrate mass spectroscopy dataset from the NIH/FDA Clinical Proteomics Program Databank (http://clinicalproteomics.steem.com), as well as the well known ALL/AML microarray data. 1) Raudys S, Jain A Small sample size effects in statistical pattern recognition: recommendation for practitioners, IEEE Transaction on Pattern Analysis and Machine Intelligence, 13(3), 252-264, 1991. 2) Petricoin E, Ardekani A, Hitt B, Levine P, Fusaro V, Steinberg S, Mills G, Simone C, Fishman D, Kohn EC, Liotta L Use of proteomic patterns in serum to identify ovarian cancer. Lancet, 359(9306), 572-7, 2002.