Computational Discovery of Gene Modules and Regulatory Networks

Georg K. Gerber1, Ziv-Bar Joseph2, Tong Ihn Lee, François Robert, D. Benjamin Gordon, Ernest Fraenkel, Itamar Simon, Tommi S. Jaakkola, Richard A. Young, David K. Gifford
1georg@mit.edu, Massachusetts Institute of Technology, Laboratory for Computer Science; 2georg@mit.edu, Massachusetts Institute of Technology, Laboratory for Computer Science

We introduce a new algorithm that is both efficient in combining information from large complementary data sets and robust since it makes few assumptions about the underlying data. Our algorithm uses genomic expression and transcription factor protein-DNA binding data sources to discover abstractions that we call modules. A module is a group of genes that are both co-expressed and bound by the same set of transcription factors. Importantly, the algorithm performs an efficient exhaustive search over all possible combinations of transcription factors implied by the protein-DNA interaction data with a stringent criteria for determining binding. Once a set of genes bound by a common set of transcription factors is found, the algorithm proceeds to find a smaller subset of genes that are co-expressed, which serves as a “seed” for the module. The algorithm then seeks to add additional genes to the module that are similarly expressed and would be considered bound by the same set of transcription factors if a more relaxed binding criteria were used. We applied our module discovery algorithm to a collection of genomic binding experiments profiling 106 Saccharomyces cervisiae transcription factors in rich media conditions and a second data set of over 500 expression experiments profiling yeast cells under a variety of conditions. We use the discovered modules to build a regulatory network of transcription factors and modules, and also use modules to label transcription factors as activators or repressors and identify patterns of combinatorial regulation. Further, we present a method for using modules to build automatically genetic regulatory sub-networks for specific biological processes, and use this to reconstruct accurately key elements of the rapamycin response and cell cycle in yeast. Finally, we validate the quality of the results obtained with our module discovery algorithm, by performing analyses using four independent data sources.