Identifying Bacterial Outer Membrane Proteins using Frequent Subsequences - A Data Mining Approach

Rong She1, Fei Chen2, Ke Wang, Martin Ester, Jennifer L. Gardy, Fiona S.L. Brinkman
1rshe@cs.sfu.ca, School of Computing Science, Simon Fraser University; 2fchena@cs.sfu.ca, School of Computing Science, Simon Fraser University

Outer membrane proteins (OMPs) are a class of proteins resident at the outer membrane of Gram-negative bacteria cells. Identifying OMPs is of medical importance as they are exposed at the bacterial surface and so are the most accessible drug targets. Because of the lengthy time it takes to study such proteins in the lab, it is important that OMP predictors be more precise than currently available methods. We adopted a data mining approach and developed OMP predictors based on subsequences (consecutive amino acids) that appear frequently in OMPs. Two methods are used to automatically search for combinations of frequent subsequences that best distinguish OMPs from non-OMPs. One algorithm, making use of support vector machines (SVM), aims at producing high precision in OMP prediction. The other algorithm, based on association-rule-based classification (ARC), provides OMP patterns that can be used for biological analysis, while producing reasonably high precision. For SVM classification, frequent subsequences are used as its features; whereas in ARC classification, frequent patterns in the form of *X*X* (each X is a frequent subsequence, each * is a gap that substitutes for one or more amino acids) are used as classification rules. We created the largest dataset of Gram-negative bacterial proteins with experimentally determined subcellular localizations (available at http://www.psort.org/dataset) and used it for analysis of our methods. We show that both our algorithms outperform the current state-of-the-art OMP predictor in the biological domain which was based on a hidden Markov model (HMM). Our SVM classifier achieved the best performance with a precision of 98% and a recall of 81%, while the HMM obtained a precision of 64% and a recall of 71% with the same dataset. Our ARC classifier provided biologists with explicit description of patterns which will be useful for further biological analysis of OMPs, while obtaining a precision of 90% and a recall of 60%.