Knowledge and Information Systems

, Volume 43, Issue 2, pp 445–472 | Cite as

Efficient clustering-based source code plagiarism detection using PIY

  • Tony OhmannEmail author
  • Imad Rahal
Regular Paper


Vast amounts of information available online make plagiarism increasingly easy to commit, and this is particularly true of source code. The traditional approach of detecting copied work in a course setting is manual inspection. This is not only tedious but also typically misses code plagiarized from outside sources or even from an earlier offering of the course. Systems to automatically detect source code plagiarism exist but tend to focus on small submission sets. One such system that has become the standard in automated source code plagiarism detection is measure of software similarity (MOSS) Schleimer et al. in proceedings of the 2003 ACM SIGMOD international conference on management of data, ACM, San Diego, 2003. In this work, we present an approach called program it yourself (PIY) which is empirically shown to outperform MOSS in detection accuracy. By utilizing parallel processing and data clustering, PIY is also capable of maintaining detection accuracy and reasonable runtimes even when using extremely large data repositories.


Plagiarism detection Data clustering \(k\)-Grams Parallel computing NUMA 


  1. 1.
    Bhattacharjee A, Jamil HM (2013) CodeBlast: a two-stage algorithm for improved program similarity matching in large software repositories. In: Proceedings of the 28th annual ACM symposium on applied computing. ACM, Coimbra, pp 846–852Google Scholar
  2. 2.
    Chen X, Francia B, Li M et al (2004) Shared information and program plagiarism detection. IEEE Trans Inf Theory 50(7):1545–1551CrossRefzbMATHMathSciNetGoogle Scholar
  3. 3.
    Choi Y, Park Y, Choi J et al (2013) RAMC: runtime abstract memory context based plagiarism detection in binary code. In: Proceedings of the 7th international conference on ubiquitous information management and communication. Kota Kinabalu, Malaysia, pp 67–73Google Scholar
  4. 4.
    Chuda D, Navrat P, Kovacova B et al (2012) The issue of (software) plagiarism: a student view. IEEE Trans Educ 55(1):22–28CrossRefGoogle Scholar
  5. 5.
    Cosma G, Joy M (2008) Towards a definition of source-code plagiarism. IEEE Trans Educ 51(2):195–200CrossRefGoogle Scholar
  6. 6.
    El Bachir Menai M, Al-Hassoun NS (2010) Similarity detection in Java programming assignments. In: Proceedings of the 5th international conference on computer science and education (ICCSE). IEEE, Hefei, pp 356–361Google Scholar
  7. 7.
    Ester M, Kriegel HP, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI Press, Portland, pp 226–231Google Scholar
  8. 8.
    Faidhi JA, Robinson S (1987) An empirical approach for detecting program similarity within a university programming environment. Comput Educ 11(1):11–19CrossRefGoogle Scholar
  9. 9.
    Flores E, Barrón-Cedeño A, Rosso P et al (2011) Towards the detection of cross-language source code reuse. In: Proceedings of the 16th international conference on applications of natural language to information systems. Springer, Salford, pp 250–253Google Scholar
  10. 10.
    Karp RM, Rabin MO (1987) Efficient randomized pattern-matching algorithms. IBM J Res Dev 31(2):249–260CrossRefzbMATHMathSciNetGoogle Scholar
  11. 11.
    Klieman AB, Kowaltowski T (2009) Qualitative analysis and comparison of plagiarism-detection systems in student programs. Instituto de Computacao Universidade Estadual de Campinas (UNICAMP): Sao Poalo, BrazilGoogle Scholar
  12. 12.
    Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New YorkCrossRefGoogle Scholar
  13. 13.
    Kuo JY, Huang FC, Hung C, et al (2012) The study of plagiarism detection for object-oriented programming. In: Proceedings of the 6th international conference on genetic and evolutionary computing (ICGEC). IEEE, Kitakyushu, pp 188–191Google Scholar
  14. 14.
    Lea D (2000) A Java fork/join framework. In: Proceedings of the ACM 2000 conference on Java Grande. ACM, San Francisco, pp 36–43Google Scholar
  15. 15.
    Lin C, Snyder L (2008) Principles of parallel programming. Addison-Wesley, BostonGoogle Scholar
  16. 16.
    Liu C, Chen C, Han J, et al (2006) GPLAG: detection of software plagiarism by program dependence graph analysis. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 872–881Google Scholar
  17. 17.
    Marinescu D, Baicoianu A, Dimitriu S (2013) A plagiarism detection system in computer source code. Int J Comput Sci Res Appl 3(1):22–30Google Scholar
  18. 18.
    Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th international conference on very large data bases. Morgan Kaufmann Publishers, San Francisco, pp 144–155Google Scholar
  19. 19.
    Pang-Ning T, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, BostonGoogle Scholar
  20. 20.
    Prechelt L, Malpohl G, Philippsen M (2002) Finding plagiarisms among a set of programs with JPLAG. J Univers Comput Sci 8:1016–1038Google Scholar
  21. 21.
    Rahal I, Wang B, Schnepf J (2009) A primer on text-data analysis. In: (2nd ed) Encyclopedia of information science and technology. IGI Publishing, Hershey, pp 3111–3118Google Scholar
  22. 22.
    Rosales F, García A, Rodríguez S et al (2008) Detection of plagiarism in programming assignments. IEEE Trans Educ 51(2):174–183CrossRefGoogle Scholar
  23. 23.
    Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefzbMATHGoogle Scholar
  24. 24.
    Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, San Diego, pp 76–85Google Scholar
  25. 25.
    Van Rijsbergen CJ (1977) A theoretical basis for the use of co-occurrence data in information retrieval. J Doc 33(2):106–119CrossRefGoogle Scholar
  26. 26.
    Whale G (1990) Identification of program similarity in large populations. Comput J 33(2):140–146CrossRefGoogle Scholar
  27. 27.
    Wise MJ (1995) A system for comparing biological sequences using the running Karp-Rabin Greedy String-Tiling algorithm. In: Proceedings of the third international conference on intelligent systems for molecular biology. AAAI Press, Cambridge, pp 393–401Google Scholar
  28. 28.
    Wise MJ (1996) YAP3: improved detection of similarities in computer program and other texts. In: Proceedings of the ACM SIGCSE technical symposium on computer science education. ACM, Philadelphia, pp 130–134Google Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  1. 1.School of Computer ScienceUniversity of MassachusettsAmherstUSA
  2. 2.Department of Computer Science, College of Saint BenedictSaint John’s UniversityCollegevilleUSA

Personalised recommendations