Overcoming Confounded Controls in the Analysis of Gene Expression Data from Microarray Experiments

Soumyaroop Bhattacharya1, Dang Duc Long2, James Lyons-Weiler
1bhattacharyas@msx.upmc.edu, Benedum Oncology Informatics Center, University of Pittsburgh; 2Dang_Long@student.uml.edu, Center for Bioinformatics and Computational Biology, University of Massachusetts Lowell

A potential limitation to the utility of microarray gene expression experiments is the use of control samples that are derived from tissues of origin that differ from the tumors being used to construct predictive classifications. To determine which gene expression differences reflect cancer vs. normal, the only statistically valid comparison will compare tumor tissues to normal tissue samples from which the tumors themselves derive. A symptom of confounding tissue of origin differences in gene expression with tumor vs. normal differences in gene expression is the robust clustering of some normal samples within tumor groups and robust clustering of other normal samples in a separate, 'normal' group. We examine and overcome this problem in a published data set of gene expression values for 7464 genes from 22 normal and 40 colon tumor samples. Our approach uses the maximum difference subset algorithm (MDSS), which calculates a test statistic with which the significance of the difference in mean expression between two groups (e.g., normal vs. tumor) is evaluated. Unlike other approaches, including those that use n-fold difference criteria and simple t-test comparisons, the test statistic employed in the MDSS algorithm compares group mean differences and uses the pooled variance error term. The test statistic approach takes into account both the difference between groups and variability within groups. The pooled variance is required because gene expression profiles are correlated among similar tissue types. In the colon cancer data set, we found that many genes in the MDSS gene set switch dramatically in their ranking (based on the significance of difference between normal and tumor) when all normal samples are used vs. when only epithelial-like normal samples are included. SM22, for example, ranked 4th in the muscle-like normal vs. tumor comparison but 1328th in the epithelial-derived normal vs. tumor comparison. A number of genes that indicate cancer vs. normal when the appropriate normals are used are masked when all normals are used. We identify a maximum difference gene subset that should provide a road map for further exploration of prevention and treatments for colon cancer. Remarkably, each of the top 44 genes was underexpressed in the tumor compared to the epithelial-like normal samples. Many of the genes in the MDSS list have previously been identified as important factors in the development and progression of colon cancer. We review these factors in detail. Guanylin (-T) ranks first and uroguanylin (-T) ranks 6th in the MDSS when the appropriate, homogeneous epithelial-like control groups is used. Given that a recent experiment demonstrating that oral replacement of uroguanylin dramatically reduces the incidence of polyp formation in Min/+ mice (Shailubhai et al.,Can Res 60:5151-5157), we conclude that similar experiments with other members of the MDSS should be conducted in search of additional, potentially synergistic, preventative dietary supplements.