Detection of Program Source Code Plagiarism Using Genomic Sequence Alignments Methodology

Eun-Mi Kang1, Hwan-Gue Cho2, Young-Min Kang
1emkang@pearl.cs.pusan.ac.kr, Pusan National University; 2hgcho@pusan.ac.kr, Pusan National University

The syntactic and semantic characteristics of a computer program can be represented by the keyword sequence extracted from the source code. Therefore the similarity and the difference between two programs can be clearly figured out by comparing the keyword sequences obtained from the given source codes. Various methods for measuring the similarity of two different sequences have been already intensively studied in bioinformatics for the manipulation of the genomic sequences. In this paper, we propose a new method for measuring the similarity of two different source codes and detecting the plagiarism by exploiting the sequence alignment techniques. The proposed method detects the similarity of two different keyword sequences as follows: 1) the system extracts keyword sequences from the source codes by considering their function call structures. 2) By using the local alignment techniques, the system then recursively finds aligned substrings of which lengths are longer than a give threshold. The local alignment in the proposed system is performed with score matrix that considers the property of each keyword. 3) In addition to the alignment based similarity computation, the system also computes the structural similarity between two source codes by taking into account their function call structures. The system exploits the additional similarity information to improve the precision of the plagiarism detection. In order to evaluate the performance of the proposed method, we experimented with hundreds of program source codes submitted by 70 students attending ‘Data Structure’ course at Pusan National University. In our experiments, 215 source codes were given to the system. Among the input data, 38 source codes were intentionally plagiarized programs, and the other 177 codes were original source codes. 26 source codes out of the 38 plagiarized codes were detected as obvious plagiarism by the proposed system, and only one code among the 177 original source code was detected as plagiarism. The experimental results show that the proposed method is more efficient and powerful than the common fingerprinting method that cannot detect the partial plagiarism.