Measuring similarity between Karel programs using character and word n-grams
- 63 Downloads
We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.
Keywordsmachine learning similarity Karel programming language character n-grams word n-grams SVM LSA
Unable to display preview. Download preview PDF.
- 1.R. E. Pattis, J. Reoberts, and M. Stehlik, Karel the Robot: Gentle Introduction to the Art of Programming, 2nd Ed. (John Wiley Sons, 1994).Google Scholar
- 4.M. J. Wise, “YAP: Improved detection of similarities in computer program and other texts”, in Proceedings of SIGCSE’96 Technical Symposium (Philadelphia, USA, 1996), pp. 130–134.Google Scholar
- 6.G. Cosma, “An approach to source-code plagiarism detection and investigation using latent semantic analysis”, PhD Dissertation (Department of Computer Science, University of Warwick, 2008).Google Scholar
- 7.S. K. Hsu and S. J. Lin, “A block-structures model for source code retrieval,” in Proceedings of Intelligent Information and Database Systems, Third International Conference, ACIIDS 2011, 2011, pp. 161–171.Google Scholar
- 8.S. Saul, D. S. Wilkerson, and A. Aiken, “Winnowing: Local algorithms for document fingerprinting”, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (ACM, New York, NY, USA, 2003), pp. 76–85.Google Scholar
- 9.J. P. Posadas-Durán, I. Markov, H. Gómez-Adorno, G. Sidorov, I. Batyrshin, A. Gelbukh, and O. Pichardo-Lagunas, “Syntactic N-grams as features for the author profiling task”, in Conference and Labs of the Evaluation Forum, Working Notes of CLEF 2015 (Toulouse, France, 2015), vol. 1391.Google Scholar
- 10.H. Gómez-Adorno, G. Sidorov, D. Pinto, and I. Markov, “A graph based authorship identification approach”, in Conference and Labs of the Evaluation Forum, Working Notes of CLEF 2015 (Toulouse, France, 2015), vol. 1391.Google Scholar
- 11.G. Sidorov, H. Gómez-Adorno, I. Markov, D. Pinto, and N. Loya, “Computing text similarity using tree edit distance”, in Proceedings of the Fuzzy Information Processing Society (NAFIPS) held jointly with 2015 5th World Conference on Soft Computing (WConSC), 2015 Annual Conference of the North American (Redmond, WA, USA, 2015), pp. 1–4.Google Scholar
- 12.G. Sidorov, “Should syntactic N-grams contain names of syntactic relations?”, Int. J. Computational Linguistics Appl. 5(1), 139–158 (2014).Google Scholar
- 13.Information Retrieval (Cambridge University Press, New York, NY, 2008).Google Scholar
- 15.M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: An update”, SIGKDD Explorations 11(1) (2009).Google Scholar