Skip to main content

Retrieving and classifying instances of source code plagiarism

Abstract

Automatic detection of source code plagiarism is an important research field for both the commercial software industry and within the research community. Existing methods of plagiarism detection primarily involve exhaustive pairwise document comparison, which does not scale well for large software collections. To achieve scalability, we approach the problem from an information retrieval (IR) perspective. We retrieve a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection. The challenge in source code document retrieval is that the standard bag-of-words (BoW) representation model for such documents is likely to result in many false positives being retrieved, because of the use of identical programming language specific constructs and keywords. To address this problem, we make use of an abstract syntax tree (AST) representation of the source code documents. While the IR approach is efficient, it is essentially unsupervised in nature. To further improve its effectiveness, we apply a supervised classifier (pre-trained with features extracted from sample plagiarized source code pairs) on the top ranked retrieved documents. We report experiments on the SOCO-2014 dataset comprising 12K Java source files with almost 1M lines of code. Our experiments confirm that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation, i.e., the AST based approach is able to identify a higher number of plagiarized source code documents at top ranks in response to a query source code document. The supervised classifier, trained on features extracted from sample plagiarized source code pairs, is shown to effectively filter and thus further improve the ranked list of retrieved candidate plagiarized documents.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. Refer to the case of Oracle America, Inc. versus Google, Inc. for an example of possible source code plagiarism: https://www.theguardian.com/technology/2016/may/26/google-wins-copyright-lawsuit-oracle-java-code, http://www.potomaclaw.com/oracle-v-google-copyrightability-apis/.

  2. http://theory.stanford.edu/~aiken/moss/.

  3. The code fragments of Fig. 1 are motivating examples only, and do not form a part of our experimental dataset.

  4. According to Java programming conventions, an identifier name starting with an upper case letter denotes a ‘class’ name, whereas one beginning with a lower case denotes the name of a variable or a method, hence, “Add” and “add” have different semantic meanings.

  5. We refer to these simply as “functions” and do not make a distinction between a function and a class method.

  6. http://www.isical.ac.in/~fire/.

  7. http://users.dsic.upv.es/grupos/nle/soco/.

  8. https://code.google.com/codejam/contest/1460488/dashboard

  9. http://code.google.com/p/javaparser/.

  10. https://lucene.apache.org/core/4_6_0/index.html.

  11. https://github.com/gdebasis/YASOCS.

  12. https://github.com/gdebasis/YASOCS/blob/master/javastopwords.txt.

  13. http://www.cs.waikato.ac.nz/ml/weka/.

References

  • Baer, N., & Zeidman, R. (2012). Measuring whitespace pattern sequence as an indication of plagiarism. Journal of Software Engineering and Applications, 5(4), 249–254.

    Article  Google Scholar 

  • Baxter, I. D., Yahin, A., Moura, L., Sant’Anna, M., & Bier, L. (1998). Clone detection using abstract syntax trees. In Proceedings of the international conference on software maintenance, ICSM ’98 (p. 368).

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MATH  Google Scholar 

  • Burrows, S., Tahaghoghi, S. M . M., & Zobel, J. (2007). Efficient plagiarism detection for large code repositories. Software: Practice and Experience, 37(2), 151–175.

    Google Scholar 

  • Chae, D.-K., Ha, J., Kim, S.-W., Kang, B., & Im, E. G. (2013a). Software plagiarism detection: A graph-based approach. In Proceedings of the 22nd ACM international conference on information and knowledge management, CIKM ’13 (pp. 1577–1580).

  • Chae, D.-K., Kim, S.-W., Ha, J., Lee, S.-C., & Woo, G. (2013b). Software plagiarism detection via the static api call frequency birthmark. In Proceedings of the 28th annual ACM symposium on applied computing, SAC’13 (pp. 1639–1643).

  • Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on theory of computing, STOC ’02 (pp. 380–388). New York, NY, USA: ACM.

  • Cosma, G., & Joy, M. (2013). Evaluating the performance of lsa for source-code plagiarism detection. Informatica, 36(4), 409–424.

    Google Scholar 

  • Faidhi, J. A. W., & Robinson, S. K. (1987). An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers and Education, 11(1), 11–19.

    Article  Google Scholar 

  • Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15, 3133–3181.

    MathSciNet  MATH  Google Scholar 

  • Flores, E., Barrón-Cedeño, A., Rosso, P., & Moreno, L. (2011). Towards the detection of cross-language source code reuse. In Proceedings of the 16th international conference on applications of natural language to information systems, NLDB 2011 (pp. 250–253).

  • Flores, E., Barrede, A., Moreno, L., & Rosso, P. (2014a). Uncovering source code reuse in large-scale academic environments. Computer Applications in Engineering Education, 23, 383–390.

    Article  Google Scholar 

  • Flores, E., Rosso, P., Moreno, L., & Villatoro-Tello, E. (2014b). PAN@FIRE: Overview of SOCO track on the detection of source code re-use. In Working notes of the forum for information retrieval evaluation, FIRE 2014.

  • Flores, E., Rosso, P., Moreno, L., & Villatoro-Tello, E. (2014c). Pan@fire: Overview of soco track on the detection of source code re-use. In Proceedings of the forum for information retrieval evaluation, FIRE 2014.

  • Fox, E. A., Koushik, M. P., Shaw, J. A., Modlin, R., & Rao, D. (1992). Combining evidence from multiple searches. In Proceedings of the first text REtrieval conference, TREC 1992, Gaithersburg, Maryland (pp. 319–328), November 4–6, 1992.

  • Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 22(3), 251–270.

    Article  Google Scholar 

  • Hiemstra, D. (2000). Using language models for information retrieval. Ph.D. thesis, CTIT, AE Enschede.

  • Jones, J. (2003). Abstract syntax tree implementation idioms. In Proceedings of PLP ’03.

  • Kim, J. & Croft, W. B. (2012). A field relevance model for structured document retrieval. In Proceedings of the 34th European conference on IR research, ECIR 2012 (pp. 97–108).

  • Marinescu, D., Baicoianu, A., & Dimitriu, S. (2012). Software for plagiarism detection in computer source code. In Proceedings of the 7th international conference on virtual learning (Vol. 156, pp. 373–379).

  • Narayanan, S., & Simi, S. (2012). Source code plagiarism detection and performance analysis using fingerprint based distance measure method. In Procceedings of the 7th international conference on computer science and education, ICCSE ’12 (pp. 1065–1068).

  • Neamtiu, I., Foster, J. S., & Hicks, M. (2005). Understanding source code evolution using abstract syntax tree matching. Proceedings of the 2005 International Workshop on Mining Software Repositories, MSR’05, 30(4), 1–5.

    Google Scholar 

  • Ogilvie, P., & Callan, J. (2003). Combining document representations for known-item search. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’03 (pp. 143–150). New York, NY, USA: ACM.

  • Ponte, J. M. (1998). A language modeling approach to information retrieval. Ph.D. thesis, University of Massachusetts.

  • Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., & Stein, B. (2014). Overview of the 6th international competition on plagiarism detection. In Working notes for CLEF 2014 conference (pp. 845–876).

  • Prechelt, L., Malpohl, G., & Philippsen, M. (2002). Finding plagiarisms among a set of programs with jplag. Journal of Universal Computer Science J-UCS, 8(11), 1016–1038.

    Google Scholar 

  • Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieva, SIGIR’05 (pp. 162–169). New York, NY, USA.

  • Schleimer, S., Wilkerson, D. S., & Aiken, A. (2003). Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (pp. 76–85). ACM.

  • Stein, B., Potthast, M., Rosso, P., Barredeo, A., Stamatatos, E., & Koppel, M. (2011). Fourth international workshop on uncovering plagiarism, authorship, and social software misuse. In SIGIR Forum (Vol. 45, pp. 45–48).

  • Takaki, T., Fujii, A., & Ishikawa, T. (2004). Associative document retrieval by query subtopic analysis and its application to invalidity patent search. In Proceedings of the thirteenth ACM international conference on information and knowledge management, CIKM ’04 (pp. 399–405).

  • Xue, X. & Croft, W. B. (2009). Automatic query generation for patent search. In Proceedings of the 18th ACM conference on information and knowledge management, CIKM ’09 (pp. 2037–2040). New York, NY, USA: ACM.

Download references

Acknowledgements

The authors would like to thank to Enrique Flores, Paolo Rosso and Lidia Moreno for providing us with important details regarding the participating systems in the SOCO 2014 shared task. The first two authors are supported by Science Foundation Ireland (SFI) as a part of the ADAPT Centre at DCU (Grant No.: 13/RC/2106). The work of the last three authors was partially funded by CONACyT under the Thematic Networks program (Language Technologies Thematic Network Project No. 260178, 271622). Additionally, they would also like to thank to UAM Cuajimalpa and SNI-CONACyT for their support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Debasis Ganguly.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ganguly, D., Jones, G.J.F., Ramírez-de-la-Cruz, A. et al. Retrieving and classifying instances of source code plagiarism. Inf Retrieval J 21, 1–23 (2018). https://doi.org/10.1007/s10791-017-9313-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10791-017-9313-y

Keywords

  • Source code plagiarism detection
  • Field based indexing and retrieval
  • Lexical, Structural and stylistic features
  • Document representation