Understanding regulatory genetic networks represents an important step towards the characterization of genetic mechanisms underlying complex diseases. In cancer research for example, where the identification of onco- and tumor supressor genes plays a key role, the knowledge of new potential oncogenes and their interaction with other molecules can be a contribution for revealing the basic principles that govern the transformation of normal cells into malignant cancer cells.

We will show that our approach of Bayesian inverse modeling is capable of detecting genes with such an oncogenic characteristic just by statistically analyzing gene-expression pattern measured by DNA-microarrays. The underlying probabilistic model that we use is a Bayesian network which encodes the multivariate probability distribution of a set of variables by a set of conditional probability distributions. Statistical dependencies are encoded in a graph structure. The learning procedure uses Bayesian statistics to find the network structure and the corresponding model parameters which describe best the probability distribution drawn by the dataset. For the case of gene expression analysis, nodes of the Bayes net represent genes and edges represent causal relationships among them. We trained a Bayes net on a microarray dataset of different pediatric acute lymphoblastic leukemia (ALL) subtypes. The model can now be used for generating new artificial microarray datasets and moreover, by intervening in our model namely by clamping for example one gene at a certain expression state and by sampling data out of this model, we can simulate the effect of our intervention on the expression of all other genes, that is we are able to predict the effect of the expression of a few genes on the global gene-expression pattern which is related to cellular behavior.

The approach of Bayesian inverse modeling can be defined as finding those genes that, by fixing them at a certain expression level, affect the model such that the generated artificial microarray dataset shows the same properties as a cancer-specifc measured dataset. In terms of statistics this means, that we estimate the probability that our model generates cancer-characteristic data given the fixed expression-state of one or more genes, where a high probability predicts the fixed genes to be oncogenic. Clamping for example gene PBX1 to the overexpressed state leads our model to generate with a probability of 0.96 a dataset that is characteristic for ALL B-lineage subtype E2A/PBX1 which could be an indication for the oncogenic characteristic of this gene causing the leukemia subtype mentioned above. And in fact, due to a chromosomal translocation PBX1 is known to convert to a potent oncogene causing leukemia subtype E2A/PBX1. Besides PBX1 we found other genes either known to be oncogenes or to be involved in critical biological processes such as ADPRT and PSMD10 which are both involved in DNA repair. Thus with our generative model we are able to predict genes that have a potentially oncogenic characteristic.

Furthermore, since the graph structure of the model can be interpreted in a causal manner it gives information about the interaction between potential oncogenes and other ones which in turn can be interpreted as an oncogenic regulation. Looking at the structure around PBX1 it can be shown that it is a dominant gene, that influences many others but is regulated itself only by one or few other genes. This can again be elucidated by known biology, since PBX1 acts as a potent transcriptional activator, activating genes that are either normally not expressed or expressed at low levels.

Consequently, we can show that our statistical and data driven approach of Bayesian inverse modeling can be efficient to infer the biological pathogenic impact of individual genes and to reveal the interaction with other genes.