EPP: Eukaryotic Promoter Prediction system using an efficient training approach

Sang-Soo Yeo1, Sung-Kwon Kim2, Jung-Won Rhee, Kyoung-Rak Na
1ssyeo@alg.cse.cau.ac.kr, Chung-Ang University, Seoul, Republic of Korea; 2skkim@cau.ac.kr, Chung-Ang University, Seoul, Republic of Korea

EPP is a promoter prediction system for eukaryotic genes. So far, various promoter prediction systems have been developed; however, most of them are for vertebrates only, and not for eukaryotes. A broader prediction domain usually degrades the performance of prediction systems. A new approach has been employed for designing our promoter prediction system. In EPP, before training step for prediction modules, the promoter dataset (positive training set) is divided into many clusters (30~40), and the non-promoter dataset (negetive training set) is also divided into clusters. Each of these clusters is separately trained to make a decision model. Eventually the prediction module of EPP consists of many smaller decision models. This approach raises the sensitivity of the promoter prediction and enables us to broaden the domain of the promoter prediction to eukaryotes. This also improves the specificity --capability of separating non-promoter sequences-- and thus reduces false positives. The positive training set of EPP consists of 2997 promoter sequences from EPD Release 74 and the negative set consists of about 3000 non-overlapping exon and intron sequences from GenBank. EPP can be accessed freely at http://epp.cau.ac.kr