Abstract
Automatic detection of source code plagiarism is an important research field for both the commercial software industry and within the research community. Existing methods of plagiarism detection primarily involve exhaustive pairwise document comparison, which does not scale well for large software collections. To achieve scalability, we approach the problem from an information retrieval (IR) perspective. We retrieve a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection. The challenge in source code document retrieval is that the standard bag-of-words (BoW) representation model for such documents is likely to result in many false positives being retrieved, because of the use of identical programming language specific constructs and keywords. To address this problem, we make use of an abstract syntax tree (AST) representation of the source code documents. While the IR approach is efficient, it is essentially unsupervised in nature. To further improve its effectiveness, we apply a supervised classifier (pre-trained with features extracted from sample plagiarized source code pairs) on the top ranked retrieved documents. We report experiments on the SOCO-2014 dataset comprising 12K Java source files with almost 1M lines of code. Our experiments confirm that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation, i.e., the AST based approach is able to identify a higher number of plagiarized source code documents at top ranks in response to a query source code document. The supervised classifier, trained on features extracted from sample plagiarized source code pairs, is shown to effectively filter and thus further improve the ranked list of retrieved candidate plagiarized documents.
This is a preview of subscription content, access via your institution.






Notes
Refer to the case of Oracle America, Inc. versus Google, Inc. for an example of possible source code plagiarism: https://www.theguardian.com/technology/2016/may/26/google-wins-copyright-lawsuit-oracle-java-code, http://www.potomaclaw.com/oracle-v-google-copyrightability-apis/.
The code fragments of Fig. 1 are motivating examples only, and do not form a part of our experimental dataset.
According to Java programming conventions, an identifier name starting with an upper case letter denotes a ‘class’ name, whereas one beginning with a lower case denotes the name of a variable or a method, hence, “Add” and “add” have different semantic meanings.
We refer to these simply as “functions” and do not make a distinction between a function and a class method.
References
Baer, N., & Zeidman, R. (2012). Measuring whitespace pattern sequence as an indication of plagiarism. Journal of Software Engineering and Applications, 5(4), 249–254.
Baxter, I. D., Yahin, A., Moura, L., Sant’Anna, M., & Bier, L. (1998). Clone detection using abstract syntax trees. In Proceedings of the international conference on software maintenance, ICSM ’98 (p. 368).
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Burrows, S., Tahaghoghi, S. M . M., & Zobel, J. (2007). Efficient plagiarism detection for large code repositories. Software: Practice and Experience, 37(2), 151–175.
Chae, D.-K., Ha, J., Kim, S.-W., Kang, B., & Im, E. G. (2013a). Software plagiarism detection: A graph-based approach. In Proceedings of the 22nd ACM international conference on information and knowledge management, CIKM ’13 (pp. 1577–1580).
Chae, D.-K., Kim, S.-W., Ha, J., Lee, S.-C., & Woo, G. (2013b). Software plagiarism detection via the static api call frequency birthmark. In Proceedings of the 28th annual ACM symposium on applied computing, SAC’13 (pp. 1639–1643).
Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on theory of computing, STOC ’02 (pp. 380–388). New York, NY, USA: ACM.
Cosma, G., & Joy, M. (2013). Evaluating the performance of lsa for source-code plagiarism detection. Informatica, 36(4), 409–424.
Faidhi, J. A. W., & Robinson, S. K. (1987). An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers and Education, 11(1), 11–19.
Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15, 3133–3181.
Flores, E., Barrón-Cedeño, A., Rosso, P., & Moreno, L. (2011). Towards the detection of cross-language source code reuse. In Proceedings of the 16th international conference on applications of natural language to information systems, NLDB 2011 (pp. 250–253).
Flores, E., Barrede, A., Moreno, L., & Rosso, P. (2014a). Uncovering source code reuse in large-scale academic environments. Computer Applications in Engineering Education, 23, 383–390.
Flores, E., Rosso, P., Moreno, L., & Villatoro-Tello, E. (2014b). PAN@FIRE: Overview of SOCO track on the detection of source code re-use. In Working notes of the forum for information retrieval evaluation, FIRE 2014.
Flores, E., Rosso, P., Moreno, L., & Villatoro-Tello, E. (2014c). Pan@fire: Overview of soco track on the detection of source code re-use. In Proceedings of the forum for information retrieval evaluation, FIRE 2014.
Fox, E. A., Koushik, M. P., Shaw, J. A., Modlin, R., & Rao, D. (1992). Combining evidence from multiple searches. In Proceedings of the first text REtrieval conference, TREC 1992, Gaithersburg, Maryland (pp. 319–328), November 4–6, 1992.
Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 22(3), 251–270.
Hiemstra, D. (2000). Using language models for information retrieval. Ph.D. thesis, CTIT, AE Enschede.
Jones, J. (2003). Abstract syntax tree implementation idioms. In Proceedings of PLP ’03.
Kim, J. & Croft, W. B. (2012). A field relevance model for structured document retrieval. In Proceedings of the 34th European conference on IR research, ECIR 2012 (pp. 97–108).
Marinescu, D., Baicoianu, A., & Dimitriu, S. (2012). Software for plagiarism detection in computer source code. In Proceedings of the 7th international conference on virtual learning (Vol. 156, pp. 373–379).
Narayanan, S., & Simi, S. (2012). Source code plagiarism detection and performance analysis using fingerprint based distance measure method. In Procceedings of the 7th international conference on computer science and education, ICCSE ’12 (pp. 1065–1068).
Neamtiu, I., Foster, J. S., & Hicks, M. (2005). Understanding source code evolution using abstract syntax tree matching. Proceedings of the 2005 International Workshop on Mining Software Repositories, MSR’05, 30(4), 1–5.
Ogilvie, P., & Callan, J. (2003). Combining document representations for known-item search. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’03 (pp. 143–150). New York, NY, USA: ACM.
Ponte, J. M. (1998). A language modeling approach to information retrieval. Ph.D. thesis, University of Massachusetts.
Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., & Stein, B. (2014). Overview of the 6th international competition on plagiarism detection. In Working notes for CLEF 2014 conference (pp. 845–876).
Prechelt, L., Malpohl, G., & Philippsen, M. (2002). Finding plagiarisms among a set of programs with jplag. Journal of Universal Computer Science J-UCS, 8(11), 1016–1038.
Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieva, SIGIR’05 (pp. 162–169). New York, NY, USA.
Schleimer, S., Wilkerson, D. S., & Aiken, A. (2003). Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (pp. 76–85). ACM.
Stein, B., Potthast, M., Rosso, P., Barredeo, A., Stamatatos, E., & Koppel, M. (2011). Fourth international workshop on uncovering plagiarism, authorship, and social software misuse. In SIGIR Forum (Vol. 45, pp. 45–48).
Takaki, T., Fujii, A., & Ishikawa, T. (2004). Associative document retrieval by query subtopic analysis and its application to invalidity patent search. In Proceedings of the thirteenth ACM international conference on information and knowledge management, CIKM ’04 (pp. 399–405).
Xue, X. & Croft, W. B. (2009). Automatic query generation for patent search. In Proceedings of the 18th ACM conference on information and knowledge management, CIKM ’09 (pp. 2037–2040). New York, NY, USA: ACM.
Acknowledgements
The authors would like to thank to Enrique Flores, Paolo Rosso and Lidia Moreno for providing us with important details regarding the participating systems in the SOCO 2014 shared task. The first two authors are supported by Science Foundation Ireland (SFI) as a part of the ADAPT Centre at DCU (Grant No.: 13/RC/2106). The work of the last three authors was partially funded by CONACyT under the Thematic Networks program (Language Technologies Thematic Network Project No. 260178, 271622). Additionally, they would also like to thank to UAM Cuajimalpa and SNI-CONACyT for their support.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ganguly, D., Jones, G.J.F., Ramírez-de-la-Cruz, A. et al. Retrieving and classifying instances of source code plagiarism. Inf Retrieval J 21, 1–23 (2018). https://doi.org/10.1007/s10791-017-9313-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10791-017-9313-y
Keywords
- Source code plagiarism detection
- Field based indexing and retrieval
- Lexical, Structural and stylistic features
- Document representation