Retrieving Candidate Plagiarised Documents Using Query Expansion

  • Rao Muhammad Adeel Nawab
  • Mark Stevenson
  • Paul Clough
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7224)


External plagiarism detection systems compare suspicious texts against a reference collection to identify the original one(s). The suspicious text may not contain a verbatim copy of the reference collection since plagiarists often try to disguise their behaviour by altering the text. For large reference collections, such as those accessible via the internet, it is not practical to compare the suspicious text with every document in the reference collection. Consequently many approaches to plagiarism detection begin by identifying a set of candidate documents from the reference collection. We report an IR-based approach to the candidate document selection problem that uses query expansion to identify candidates which have been altered. The reported system outperforms a previously reported approach and is also robust to changes in the reference collection text.


information retrieval external plagiarism detection query expansion 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Barrón-Cedeño, A., Rosso, P., Benedí, J.: Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 523–534. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  2. 2.
    Boisvert, R., Irwin, M.: Plagiarism on the rise. Communications of the ACM 49(6), 23–24 (2006)CrossRefGoogle Scholar
  3. 3.
    Callison-Burch, C.: Syntactic constraints on paraphrases extracted from parallel corpora. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 196–205. ACM (2008)Google Scholar
  4. 4.
    Campbell, C.: Writing with other’s words: Using background reading text in academic compositions. In: Kroll, B. (ed.) Second Language Writing: Research Insights for the Classroom, pp. 211–230. Cambridge University Press, Cambridge (1990)Google Scholar
  5. 5.
    Ceska, Z.: Plagiarism Detection Based on Singular Value Decomposition. In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 108–119. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  6. 6.
    Chen, C., Yeh, J., Ke, H.: Plagiarism Detection using ROUGE and WordNet. Journal of Computing 2(3), 34–44 (2010)Google Scholar
  7. 7.
    Chong, M., Specia, L., Mitkov, R.: Using Natural Language Processing for Automatic Detection of Plagiarism. In: Proceedings of the 4th International Plagiarism Conference (IPC 2010), Newcastle, UK (2010)Google Scholar
  8. 8.
    Clough, P., Stevenson, M.: Developing A Corpus of Plagiarised Short Answers. In: Language Resources and Evaluation: Special Issue on Plagiarism and Authorship Analysis. Springer, Heidelberg (2010)Google Scholar
  9. 9.
    Efthimiadis, E.: Query expansion. Annual Review of Information Systems and Technology (ARIST) 31, 121–187 (1996)Google Scholar
  10. 10.
    Fox, E.A., Shaw, J.A.: Combination of Multiple Searches. In: Harman, D.K. (ed.) Proceedings TREC-2, pp. 243–249 (1994)Google Scholar
  11. 11.
    Johns, A., Myers, P.: An analysis of summary protocols of university ESL students. Applied Linguistics 11, 253–271 (1990)CrossRefGoogle Scholar
  12. 12.
    Judge, G.: Plagiarism: Bringing Economics and Education Together (With a Little Help from IT). Computers in Higher Education Economics Review 20(1), 21–26 (2008)Google Scholar
  13. 13.
    Keck, C.: The use of paraphrase in summary writing: A comparison of l1 and l2 writers. Journal of Second Language Writing 15, 261–278 (2006)CrossRefGoogle Scholar
  14. 14.
    Lane, P., Lyon, C., Malcolm, J.: Demonstration of the Ferret plagiarism detector. In: Proceedings of the 2nd International Plagiarism Conference (2006)Google Scholar
  15. 15.
    Martin, B.: Plagiarism: a misplaced emphasis. Journal of Information Ethics 3(2), 36–47 (1994)Google Scholar
  16. 16.
    Maurer, H., Kappe, F., Zaka, B.: Plagiarism - A Survey. Journal of Universal Computer Science 12(8), 1050–1084 (2006)Google Scholar
  17. 17.
    McCabe, D.: Research report of the center for academic integrity (2005),
  18. 18.
    McCabe, D., Butterfield, K., Trevino, L.: Academic Dishonesty in Graduate Business Programs: Prevalence, Causes, and Proposed Action. Academy of Management Learning and Education 5(3), 1–294 (2006)CrossRefGoogle Scholar
  19. 19.
    Meyer zu Eissen, S., Stein, B., Kulig, M.: Plagiarism detection without reference collections. In: Advances in Data Analysis, pp. 359–366. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  20. 20.
    Mozgovoy, M., Kakkonen, T., Sutinen, E.: Using Natural Language Parsers in Plagiarism Detection. In: Proceedings of SLaTE 2007 Workshop, Pennsylvania, USA (2007)Google Scholar
  21. 21.
    Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Johnson, D.: Terrier Information Retrieval Platform. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 517–519. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  22. 22.
    Park, C.: In other (people’s) words: plagiarism by university students – literature and lessons. Assessment and Evaluation in Higher Education 28(5) (2003)Google Scholar
  23. 23.
    Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An Evaluation Framework for Plagiarism Detection. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pp. 997–1005 (2010)Google Scholar
  24. 24.
    Potthast, M., Stein, B., Eiselt, A., Cedeño, A., Rosso, P.: Overview of the 2nd International Competition on Plagiarism Detection. In: Proceedings of the CLEF 2010 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Padua, Italy (2010)Google Scholar
  25. 25.
    Rocchio, J.: Relevance feedback in information retrieval. In: The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313–323 (1971)Google Scholar
  26. 26.
    Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of the 2nd Annual Conference on the Theory and Practice of Digital Libraries, Texas, USA (1995)Google Scholar
  27. 27.
    Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E.: 3rd PAN Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse. In: 25th Annual Conference of the Spanish Society for Natural Language Processing (SEPLN), pp. 1–77 (2009)Google Scholar
  28. 28.
    Uzuner, O., Katz, B., Nahnsen, T.: Using syntactic information to identify plagiarism. In: Proceedings of the 2nd Workshop on Building Educational Applications Using NLP, pp. 37–44. Association for Computational Linguistics (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Rao Muhammad Adeel Nawab
    • 1
  • Mark Stevenson
    • 1
  • Paul Clough
    • 2
  1. 1.Department of Computer ScienceUniversity of SheffieldUK
  2. 2.Information SchoolUniversity of SheffieldUK

Personalised recommendations