Introduction From a machine learning point of view, classification of gene expression patterns is a very particular task. Typically, training data consists of few samples (small number of experiments) but contains many variables (expression levels measured in each experiment). In this context classical machine learning methods may cause various difficulties. For instance, statistical models, particularly those with many parameters, may overfit the training data. Thereby, they rather adapt to noise in the data than learn the desired phenomenon. Moreover, common machine learning methods do not provide an intuitive and biologically meaningful explanation of their results. However, such explanations help users to trust a computational analysis. In the research presented here, we try to cope with these two problems in the context of medical diagnosis.
The Gene Ontology (GO) We conjecture that the mentioned problems can be tackled by giving the classifier a biologically meaningful structure, i.e., by dividing the classification task into subtasks according to biological criteria. Structuring biological knowledge is one of the central goals of the Gene Ontology database . Biological terms related to molecular functions, biological processes and cellular components are collected into a directed acyclic graph where each node represents a term and child-terms are either members or representatives of their parent-terms. Moreover, genes are attributed to GO-nodes according to their functions, involvement into biological processes and localization within the cell. We suggest to use this structure in a classifier as follows.
GO driven classifier For each GO-node with annotated, one classifier is implemented using a usual machine learning method on expression data of the annotated genes. According to their deviance each node obtains a weight and results of children are collected in their parents by weighted sums. In this manner probabilities for each class are computed in each node. The overall classification result is provided by the root node's classifier.
Rationale of classification In this procedure each classifier bases its decision only on information related to the biological aspect represents. Therefore, when considering an overall classification result, its rationale can be deduced from the various classifier results. Moreover, the weights determined after training provide information about which biological aspects are deemed important in the classification task. Finally, the partitioning of the input variables among many classifiers, weakens the mentioned overfitting problem.
First results and experiences Using an implementation as an R program we have evaluated the method on a large dataset from a study on acute lymphoblastic leukemia using the recognition of leukemia translocations as the classification task. This task has been shown to be rather simple yielding recognition rates of 96% to 100% using sophisticated feature selection and support vector machines. First tests with our classifier have shown comparable recognition accuracies. Thereby, many classifiers yield average or weak results and only a few pin-point the important biological aspects for classification.