Programming and Computer Software

, Volume 43, Issue 1, pp 47–50 | Cite as

Measuring similarity between Karel programs using character and word n-grams

  • G. SidorovEmail author
  • M. Ibarra Romero
  • I. MarkovEmail author
  • R. Guzman-CabreraEmail author
  • L. Chanona-HernándezEmail author
  • F. VelásquezEmail author


We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.


machine learning similarity Karel programming language character n-grams word n-grams SVM LSA 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    R. E. Pattis, J. Reoberts, and M. Stehlik, Karel the Robot: Gentle Introduction to the Art of Programming, 2nd Ed. (John Wiley Sons, 1994).Google Scholar
  2. 2.
    M. H. Halstead, Elements of Software Science (North Holland, New York, 1977).zbMATHGoogle Scholar
  3. 3.
    T. J. McCabe, “A complexity measure”, IEEE Trans. Software Eng. 2(4), 308–320 (1976).MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    M. J. Wise, “YAP: Improved detection of similarities in computer program and other texts”, in Proceedings of SIGCSE’96 Technical Symposium (Philadelphia, USA, 1996), pp. 130–134.Google Scholar
  5. 5.
    N. Tran and D. Gitchell, “Sim: A utility for detecting similarity in computer programs”, SIGCSE Bull. 31(1), 266–270 (1999).CrossRefGoogle Scholar
  6. 6.
    G. Cosma, “An approach to source-code plagiarism detection and investigation using latent semantic analysis”, PhD Dissertation (Department of Computer Science, University of Warwick, 2008).Google Scholar
  7. 7.
    S. K. Hsu and S. J. Lin, “A block-structures model for source code retrieval,” in Proceedings of Intelligent Information and Database Systems, Third International Conference, ACIIDS 2011, 2011, pp. 161–171.Google Scholar
  8. 8.
    S. Saul, D. S. Wilkerson, and A. Aiken, “Winnowing: Local algorithms for document fingerprinting”, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (ACM, New York, NY, USA, 2003), pp. 76–85.Google Scholar
  9. 9.
    J. P. Posadas-Durán, I. Markov, H. Gómez-Adorno, G. Sidorov, I. Batyrshin, A. Gelbukh, and O. Pichardo-Lagunas, “Syntactic N-grams as features for the author profiling task”, in Conference and Labs of the Evaluation Forum, Working Notes of CLEF 2015 (Toulouse, France, 2015), vol. 1391.Google Scholar
  10. 10.
    H. Gómez-Adorno, G. Sidorov, D. Pinto, and I. Markov, “A graph based authorship identification approach”, in Conference and Labs of the Evaluation Forum, Working Notes of CLEF 2015 (Toulouse, France, 2015), vol. 1391.Google Scholar
  11. 11.
    G. Sidorov, H. Gómez-Adorno, I. Markov, D. Pinto, and N. Loya, “Computing text similarity using tree edit distance”, in Proceedings of the Fuzzy Information Processing Society (NAFIPS) held jointly with 2015 5th World Conference on Soft Computing (WConSC), 2015 Annual Conference of the North American (Redmond, WA, USA, 2015), pp. 1–4.Google Scholar
  12. 12.
    G. Sidorov, “Should syntactic N-grams contain names of syntactic relations?”, Int. J. Computational Linguistics Appl. 5(1), 139–158 (2014).Google Scholar
  13. 13.
    Information Retrieval (Cambridge University Press, New York, NY, 2008).Google Scholar
  14. 14.
    S. Deerwester, S. T. Dumais, G. W. Furnas, and T. K. Landauer, “Indexing by latent semantic analysis”, J. Am. Soc. Inform. Sci. 41(6), 391–407 (1990).CrossRefGoogle Scholar
  15. 15.
    M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: An update”, SIGKDD Explorations 11(1) (2009).Google Scholar
  16. 16.
    G. Sidorov, A. Gelbukh, H. Gómez-Adorno, and D. Pinto, “Soft similarity and soft cosine measure: Similarity of features in vector space model”, Computación y Sistemas 18(3), 491–504 (2014).CrossRefGoogle Scholar

Copyright information

© Pleiades Publishing, Ltd. 2017

Authors and Affiliations

  1. 1.Instituto Politécnico Nacional (IPN)Center for Computing Research (CIC)Mexico CityMexico
  2. 2.Engineering DivisionUniversity of Guanajuato, Campus Irapuato-SalamancaGuanajuatoMexico
  3. 3.Instituto Politécnico NacionalSchool of Mechanical and Electrical Engineering (ESIME)Mexico CityMexico
  4. 4.Polytechnic University of QueretaroQueretaroMexico

Personalised recommendations