Protein Superfamily Clustering using Biomedical Text Mining via the Information Bottleneck Method

Sahng-Joon Auh1, Jae-Hong Eom2, Byoung-Hee Kim, Byoung-Tak Zhang
1sjauh@bi.snu.ac.kr, Biointelligence Laboratory; 2jheom@bi.snu.ac.kr, Biointelligence Laboratory

We present a novel implementation of protein superfamily clustering using biomedical literature, MEDLINE abstracts, via the recently introduced information bottleneck method which shows good performance in document clustering. First we defined 156 event verbs relating to gene/protein interaction, and given a joint empirical distribution of subject-proteins and object-proteins in texts, p(x,y), we first cluster the object-proteins, Y, so that the obtained object-protein clusters, Y’, maximally preserve the information on the subject-proteins. The resulting joint distribution, p(X,Y’), contains most of the original information about the subject-proteins, I(X;Y’)  I(X;Y), but it is much less sparse and noisy. Using the same procedure we then cluster the subject-proteins, X, so that the information about the object-protein clusters is preserved. Thus, we first find object-protein clusters that capture most of the mutual information about the set of subject-proteins, and then find subject-protein clusters, that preserve the information about the object-protein clusters. We test this procedure over 1866 saccharomyces cerevisiae proteins in COGs (Clusters of Orthologous Groups of proteins) by NCBI (National Center for Biotechnology Information). The results are assessed by calculating the correlation between the protein clusters and the correct labels for thses proteins. Findings from our experiments show the possibility that this clustering method also can identify the unknown protein groups besides NCBI COGs data.