Protein domain extraction by quasi-convex set functions

HwaSeob Yun1, Casimir Kulikowski2, Ilya Muchnik
1seabee@cs.rutgers.edu, Rutgers University; 2kulikows@cs.rutgers.edu, Rutgers University

We describe a fast and fully automatic procedure to extract protein domains from single query sequences. Unlike PFAM and PRODOM, which require pre-calculation of domain statistics, our method searches large databases of functionally annotated proteins by BLAST, and uses its hits to segment sub-molecular fragments with an efficient dynamic programming procedure. We associate a domain with a subset of hit images having the longest pair-wise overlaps as a group. This task of domain identification can be formulated as the optimization of a quasi-convex function. The criterion value for a subset is the minimum of estimates calculated for every image in the subset. The estimate is the sum of lengths of the image overlaps with other images drawn from the same subset. The subset which yields the maximum of the function we denote as a cluster-domain. After a cluster-domain is found, its elements are removed from the set and the procedure is performed iteratively on the remaining elements of the set until all remaining elements fall into the same cluster. Every cluster-domain determines an intersection of its hit images. This intersection, or the union of overlapping intersections originating from several clusters, is called a covered domain. The sequence fragments between covered domains are the uncovered domains. From a biological point of view, covered domains represent the most common and conserved regions of the protein sequence and the uncovered ones are the most unique and, probably, the least conservative domains. Length constraints must also be added to filter out those which are biologically less plausible. Calculation of the criterion is simple and fast, and because the function is quasi-convex, its optimal value can be found very efficiently. Our procedure also finds the total number of clusters automatically, which is one of the hardest challenges for clustering.