A Statistical Model of Protein Sequences in Interaction Networks and Its Solution via Gibbs Sampling

David J. Reiss1, Benno Schwikowski2, Andrew F. Siegel, Stanley Fields
1dreiss@systemsbiology.org, Institute for Systems Biology; 2benno@systemsbiology.org, Institute for Systems Biology

Many cellular pathways utilize a set of compact binding domains that each bind to a particular structu ral motif on the surface of their ligands. These small "peptide recognition modules" have evolved to recognize their ligand peptides to a high degree of specificity. Several authors have attempted to computatio nally predict such types of interactions, with varying degree of success. The difficulty lies in the fact that the network of interactions between these modules and their ligands may be complex. Moreover, the motifs desc ribing the ligand peptides are often loosely conserved as a whole, and do not strictly conform to any one prot otypical consensus. This situation was observed in high-throughput two-hybrid and phage display experiments on SH3 domains in yeast (Tong et al, 2001).

While confirming that this computational task would be difficult, these types of high-throughput exper iments still provide valuable information that can be used to assist the computational identification of bindi ng sites and the prediction of interactions through sequence analysis. With this goal in mind, we have devised a statistical model incorporating experimental interaction data and sequence data, that utilizes informed pri ors, mixture models, and discrimination techniques. We use a Gibbs sampling algorithm to trained the model on an experimentally-derived yeast-two-hybrid SH3 binding domain interaction network (Tong et al, 2001). O ur results reveal that the observed interaction networks can, to a large degree, be explained by our model, an d we directly compare the results of our algorithm to a similar network derived via combinatorial chemistry me thods (Tong et al, 2001). The performance of our method as evaluated by cross-validation is similar to that obtained by the phage display experiments.

Our technique simultaneously identifies the specific peptide motifs that show high affinity to each do main, as well as the most likely binding sites on the domain ligands. Once trained on a set of sequences and i nteractions, the resulting model may be used to derive statistically-robust estimates of the likelihood of bin ding between domains and new potential interactors. Such predictions may be used to further expand the network s of interactions, to suggest potential drug targets, and to more specifically direct further experiments, suc h as co-immunoprecipitation or protein mutagenesis.