New Datasets for Structural Data Mining Studies.

Carmen K. Chu¹, Merridee A. Wouters²
¹cchu@cse.unsw.edu.au, Computational Biology and Bioinformatics Program, Victor Chang Cardiac Research Institute; ²m.wouters@victorchang.unsw.edu.au, Computational Biology and Bioinformatics Program, Victor Chang Cardiac Research Institute

Structural data mining studies often attempt to deduce general principles of protein structure from solved structures deposited in the protein data bank (PDB). The entire database is unsuitable for such studies because it is not representative of the ensemble of protein folds. Given that novel folds continue to be unearthed, some folds are currently unrepresented in the PDB while other folds are overrepresented. Overrepresentation can easily be avoided by filtering the dataset. In the past, representative subsets of the PDB have been deduced by sequence comparison. Specifically, structures with sequences that exhibit a pairwise sequence identity above a threshold value are weeded from the dataset. Although length criteria for pairwise alignments have a structural basis, this automated method of pruning is essentially sequence-based and has been criticized on the basis that some folds remain overrepresented. Here we investigate this claim by comparing this sequence-derived dataset PDB_SELECT with the structural database SCOP. We show that some folds remain overrepresented in the PDB_SELECT dataset. By filtering the dataset, we obtain a new subset equal to approximately one quarter of the original PDB_SELECT list with less than 25% pairwise sequence identity that contains unique representatives of their protein fold. We also discuss the possibility of using unique representatives of SCOP at the fold level as a representative dataset.