Efficient clustering-based source code plagiarism detection using PIY

Ohmann, Tony; Rahal, Imad

doi:10.1007/s10115-014-0742-2

Efficient clustering-based source code plagiarism detection using PIY

Regular Paper
Published: 22 March 2014

Volume 43, pages 445–472, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Tony Ohmann¹ &
Imad Rahal²

923 Accesses
21 Citations
Explore all metrics

Abstract

Vast amounts of information available online make plagiarism increasingly easy to commit, and this is particularly true of source code. The traditional approach of detecting copied work in a course setting is manual inspection. This is not only tedious but also typically misses code plagiarized from outside sources or even from an earlier offering of the course. Systems to automatically detect source code plagiarism exist but tend to focus on small submission sets. One such system that has become the standard in automated source code plagiarism detection is measure of software similarity (MOSS) Schleimer et al. in proceedings of the 2003 ACM SIGMOD international conference on management of data, ACM, San Diego, 2003. In this work, we present an approach called program it yourself (PIY) which is empirically shown to outperform MOSS in detection accuracy. By utilizing parallel processing and data clustering, PIY is also capable of maintaining detection accuracy and reasonable runtimes even when using extremely large data repositories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Bhattacharjee A, Jamil HM (2013) CodeBlast: a two-stage algorithm for improved program similarity matching in large software repositories. In: Proceedings of the 28th annual ACM symposium on applied computing. ACM, Coimbra, pp 846–852
Chen X, Francia B, Li M et al (2004) Shared information and program plagiarism detection. IEEE Trans Inf Theory 50(7):1545–1551
Article MATH MathSciNet Google Scholar
Choi Y, Park Y, Choi J et al (2013) RAMC: runtime abstract memory context based plagiarism detection in binary code. In: Proceedings of the 7th international conference on ubiquitous information management and communication. Kota Kinabalu, Malaysia, pp 67–73
Chuda D, Navrat P, Kovacova B et al (2012) The issue of (software) plagiarism: a student view. IEEE Trans Educ 55(1):22–28
Article Google Scholar
Cosma G, Joy M (2008) Towards a definition of source-code plagiarism. IEEE Trans Educ 51(2):195–200
Article Google Scholar
El Bachir Menai M, Al-Hassoun NS (2010) Similarity detection in Java programming assignments. In: Proceedings of the 5th international conference on computer science and education (ICCSE). IEEE, Hefei, pp 356–361
Ester M, Kriegel HP, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI Press, Portland, pp 226–231
Faidhi JA, Robinson S (1987) An empirical approach for detecting program similarity within a university programming environment. Comput Educ 11(1):11–19
Article Google Scholar
Flores E, Barrón-Cedeño A, Rosso P et al (2011) Towards the detection of cross-language source code reuse. In: Proceedings of the 16th international conference on applications of natural language to information systems. Springer, Salford, pp 250–253
Karp RM, Rabin MO (1987) Efficient randomized pattern-matching algorithms. IBM J Res Dev 31(2):249–260
Article MATH MathSciNet Google Scholar
Klieman AB, Kowaltowski T (2009) Qualitative analysis and comparison of plagiarism-detection systems in student programs. Instituto de Computacao Universidade Estadual de Campinas (UNICAMP): Sao Poalo, Brazil
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Book Google Scholar
Kuo JY, Huang FC, Hung C, et al (2012) The study of plagiarism detection for object-oriented programming. In: Proceedings of the 6th international conference on genetic and evolutionary computing (ICGEC). IEEE, Kitakyushu, pp 188–191
Lea D (2000) A Java fork/join framework. In: Proceedings of the ACM 2000 conference on Java Grande. ACM, San Francisco, pp 36–43
Lin C, Snyder L (2008) Principles of parallel programming. Addison-Wesley, Boston
Google Scholar
Liu C, Chen C, Han J, et al (2006) GPLAG: detection of software plagiarism by program dependence graph analysis. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 872–881
Marinescu D, Baicoianu A, Dimitriu S (2013) A plagiarism detection system in computer source code. Int J Comput Sci Res Appl 3(1):22–30
Google Scholar
Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th international conference on very large data bases. Morgan Kaufmann Publishers, San Francisco, pp 144–155
Pang-Ning T, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston
Google Scholar
Prechelt L, Malpohl G, Philippsen M (2002) Finding plagiarisms among a set of programs with JPLAG. J Univers Comput Sci 8:1016–1038
Google Scholar
Rahal I, Wang B, Schnepf J (2009) A primer on text-data analysis. In: (2nd ed) Encyclopedia of information science and technology. IGI Publishing, Hershey, pp 3111–3118
Rosales F, García A, Rodríguez S et al (2008) Detection of plagiarism in programming assignments. IEEE Trans Educ 51(2):174–183
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article MATH Google Scholar
Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, San Diego, pp 76–85
Van Rijsbergen CJ (1977) A theoretical basis for the use of co-occurrence data in information retrieval. J Doc 33(2):106–119
Article Google Scholar
Whale G (1990) Identification of program similarity in large populations. Comput J 33(2):140–146
Article Google Scholar
Wise MJ (1995) A system for comparing biological sequences using the running Karp-Rabin Greedy String-Tiling algorithm. In: Proceedings of the third international conference on intelligent systems for molecular biology. AAAI Press, Cambridge, pp 393–401
Wise MJ (1996) YAP3: improved detection of similarities in computer program and other texts. In: Proceedings of the ACM SIGCSE technical symposium on computer science education. ACM, Philadelphia, pp 130–134

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Massachusetts, Amherst, MA , 01060, USA
Tony Ohmann
Department of Computer Science, College of Saint Benedict, Saint John’s University, Collegeville, MN , 56321, USA
Imad Rahal

Authors

Tony Ohmann
View author publications
You can also search for this author in PubMed Google Scholar
Imad Rahal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tony Ohmann.

Additional information

Tony Ohmann completed the work at the College of Saint Benedict and Saint John’s University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ohmann, T., Rahal, I. Efficient clustering-based source code plagiarism detection using PIY. Knowl Inf Syst 43, 445–472 (2015). https://doi.org/10.1007/s10115-014-0742-2

Download citation

Received: 15 April 2013
Revised: 01 February 2014
Accepted: 07 March 2014
Published: 22 March 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s10115-014-0742-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Efficient clustering-based source code plagiarism detection using PIY

Abstract

Access this article

Similar content being viewed by others

Review of Plagiarism Detection Technique in Source Code

Retrieving and classifying instances of source code plagiarism

Evaluating the robustness of source code plagiarism detection tools to pervasive plagiarism-hiding modifications

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient clustering-based source code plagiarism detection using PIY

Abstract

Access this article

Similar content being viewed by others

Review of Plagiarism Detection Technique in Source Code

Retrieving and classifying instances of source code plagiarism

Evaluating the robustness of source code plagiarism detection tools to pervasive plagiarism-hiding modifications

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation