On the Sequence Pattern Distribution in Splice Junctions. An Analysis Using Information Theoretic and Machine Learning

Christina Zheng¹, Virginia R de Sa², Michael Gribskov, T. Murlidharan Nair
¹nair@sdsc.edu, UCSD SDSC; ²desa@cogsci.ucsd.edu, UCSD

Recognition of precise splice junctions is a challenge faced in the analysis of newly sequenced genomes. This challenge is compounded by the fact that the distribution of sequence patterns in these regions are not always distinct. With a view to understand the sequence signatures at the splice junctions, neural network based calliper randomization and information theoretic based feature selection approaches have been used in the analysis of the sequences at this region. This has been done in an effort to understand the regions that harbor information content and to extract elements that are relevant for splice site prediction. Results: The analysis of the sequences at the splice junction using a neural network based calliper randomization approach reveals the regions that are important in the internal representation of the network model. Further, analysis of the region using the feature selection approach revealed a subset of features where the information are concentrated. Comparative analysis of the results using both the methods help to infer about the kind of information present in the region.