An Automated Procedure to Create a Protein Structure Family Database and Application to Whole-Genome Annotation

Kenneth J Kelly1
1kjk@chemcomp.com, Chemical Computing Group Inc

It has become evident that the use of pre-clustered family databases of protein sequences and structures offers substantial advantages in homology identification and genome annotation tasks. Such databases provide pre-built family profiles, such as those that assist PSI-BLAST, and they allow for a practicable use of Z-scores to confidently test homology hypotheses in the twilight zone of sequence identity by exploiting tentative transitivity assumptions. Given the ever-increasing pace of new genome sequencing, it has become important to find a methodology capable of building protein classification databases from all of the latest data, both public and private, in a timely and unsupervised manner. The automated clustering method described here can create a Protein Family Database from the contents of the Protein Databank, along with non-redundant sequence data from the PIR-NREF database, in less than a week. The method was implemented in MOE, and the cited performance was achieved with a modestly sized computing cluster using MOE?s built-in scalable multiple-processing (SMP) functions. Whole-genome annotation tests on several completely sequenced genomes have demonstrated results superior to those obtained with standard annotation tools (e.g. BLAST), and at least as good as results obtained from PSI-BLAST. The database and the source code required to re-build and maintain it are part of the Molecular Operating Environment, available from the Chemical Computing Group, Inc. (info@chemcomp.com)