An evaluation of new criteria for CpG islands in the human genome as gene markers

Patrick, Yong Wang1, Frederick, C. Leung2
1wangyong@hkusua.hku.hk, HKU, Dept of Zoology; 2fcleung@hkucc.hku.hk, HKU, Dept of Zoology

Recently, more stringent criteria for CpG islands (length 500bp, G + C content 55%, and CpG o/e ratio 0.65) have been introduced. Using these new criteria, we investigated several types of associations between CpG islands and genes to further establish the importance of CpG islands as gene markers. The CpG islands in the human contigs were searched by a java program, CpGIE (www.hku.hk/zoology/fc_leung). The genes labeled with different evidence codes were located in the configs and their association with the CpG islands was checked. According to our results, more than 70% of the identified CpG islands were associating with the genes. Furthermore, our investigation of genes with an evidence code of C (confirmed gene model) showed that 56.1% of the genes were associated with CpG islands in association type A0 (a CpG island overlaps the promoter of a gene), and that up to 73% of the genes had a CpG island. These results demonstrated that CpG island were valuable gene markers, and that their presence was also a very reliable indicator of the location of the promoters of genes. For the genes in the evidence code group of C or ?, the association type A0 was more frequently identified than the other association types. In the evidence code groups of P and PE (predicted gene model with GenomeScan), on the other hand, the genes were not obviously associated with CpG islands, suggesting that the GenomeScan program failed to exclude many false-positive predictions when the sequence similarity was not high.