A hybrid clustering approach to genome-scale recognition of protein families

Timothy J. Harlow1, J. Peter Gogarten2, Mark A. Ragan
1t.harlow@imb.uq.edu.au, Institute for Molecular Bioscience, University of Queensland; 2gogarten@uconn.edu, University of Connecticut

The comprehensive, automated classification of proteins into similarity groups is an important but difficult challenge in post-genomic bioinformatics. Here we present a hybrid approach to recognizing protein families among very large (multi-genomic) datasets, based on successive Markov and single-linkage clustering of normalised pairwise BLASTP bit scores. Family members so divergent as not to be recognised by pairwise sequence comparison will not be grouped by this approach. For family members similar enough to be recognized by sequence comparison, our approach preserves the advantages of single-linkage clustering (e.g. preserving threshold-ordered information on edge strength and cluster membership), but captures the power of Markov clustering to avoid indiscriminate clustering. We present results based on all conceptually translated proteins from 114 microbial genomes, and demonstrate the utility of the method in recognizing orthologs and paralogs of rotary motor ATP synthetase F1 subunit proteins.