Use of hidden Markov models and phylogenetic algorithms to predict functionally distinct subclasses of chromodomains in different families of chromatin-modifying proteins

Khairina Tajul-Arifin1, Rohan Teasdale2, John S. Mattick
1k.tajularifin@imb.uq.edu.au, IMB, UQ; 2r.teasdale@imb.uq.edu.au, IMB, UQ

The chromatin modifier domain or chromodomain (CD) is a conserved domain that is contained within a number of eukaryotic proteins involved in chromosome maintenance and remodelling. The CD is 60-90 amino acids long and has been proposed to be a protein-protein interaction module, an RNA and/or DNA-binding domain, and a recognition module for histone tail modification. In this study CD sequences from publicly available databases were clustered using phylogenetic methods, and distinct clusters were observed. A hidden Markov model (HMM) consensus sequence was created for each cluster and used to identify CD-containing proteins in several model eukaryotes. These CD-containing proteins can be classified into protein families based on other domains contained in the protein, for example the histone acetyltransferase family, the histone methyltransferase family and the Retinoblastoma-binding protein family (which contains TUDOR and BRIGHT domains), in addition to the CD. We also found that the CD contained within the proteins in the same family is almost invariably of the same subclass, and that different families, in most cases, contain a specific subclass of CD, which indicates there is a strong correlation between the CD subclass and the specific function of the proteins within the family. This demonstrates the utility of bioinformatics to identify subtle variations within large datasets of protein domain sequences which can be used to predict differences in domain function and / or specificity that are not obvious in existing domain databases.