Language Resources and Evaluation

, Volume 45, Issue 1, pp 5–24 | Cite as

Developing a corpus of plagiarised short answers

  • Paul CloughEmail author
  • Mark Stevenson


Plagiarism is widely acknowledged to be a significant and increasing problem for higher education institutions (McCabe 2005; Judge 2008). A wide range of solutions, including several commercial systems, have been proposed to assist the educator in the task of identifying plagiarised work, or even to detect them automatically. Direct comparison of these systems is made difficult by the problems in obtaining genuine examples of plagiarised student work. We describe our initial experiences with constructing a corpus consisting of answers to short questions in which plagiarism has been simulated. This corpus is designed to represent types of plagiarism that are not included in existing corpora and will be a useful addition to the set of resources available for the evaluation of plagiarism detection systems.


Plagiarism Plagiarism detection Corpus creation Language resources 



We thank James Gregory, Aman Brar, Chris Bishop, Saleh Al Belwi and Congyun Long for organising the data collection and all participants involved in generating examples for the corpus.


  1. Barzilay, R., & McKeown, K. (2001). Extracting paraphrases from a parallel corpus. In Proceedings of 39th annual meeting of the association for computational linguistics, association for computational linguistics (pp. 50–57). Toulouse, France.Google Scholar
  2. Brin, S., Davis, J., & Garcia-Molina, H. (1995). Copy detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD international conference on management of data (pp. 398–409).Google Scholar
  3. Broder, A. Z. (1998). On the resemblance and containment of documents. In Compression and complexity of sequences. IEEE Computer Society.Google Scholar
  4. Bull, J., Collins, C., Coughlin, E., & Sharp, D. (2001). Technical review of plagiarism detection software report. Luton: Computer Assisted Assessment Centre.
  5. Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine translation using paraphrases. In Proceedings of the human language technology conference of the NAACL, main conference, association for computational linguistics (pp. 17–24). New York City, USA.Google Scholar
  6. Campbell, C. (1990). Writing with other’s words: Using background reading text in academic compositions. In B. Kroll (Ed.), Second language writing: Research insights for the classroom (pp. 211–230). Cambridge: Cambridge University Press.Google Scholar
  7. Cebrián, M., Alfonseca, M., & Ortega, A. (2007). Automatic generation of benchmarks for plagiarism detection tools using grammatical evolution. In GECCO ’07: Proceedings of the 9th annual conference on genetic and evolutionary computation (pp. 2253–2253).Google Scholar
  8. Clough, P. (2000). Plagiarism in natural and programming languages: An overview of current tools and technologies. Technical report on research memoranda: CS-00-05. Department of Computer Science, University of Sheffield (UK).Google Scholar
  9. Clough, P., Gaizauskas, R., Piao, S., & Wilks, Y. (2002). Measuring text reuse. In Proceedings of 40th annual meeting of the association for computational linguistics, association for computational linguistics (pp. 152–159). Philadelphia, Pennsylvania, USA.Google Scholar
  10. Cohn, T., Callison-Burch, C., & Lapata, M. (2008). Constructing corpora for development and evaluation of paraphrase systems. Computational Lingustics, 34(4), 597–614.CrossRefGoogle Scholar
  11. Coleman, M., & Liau, T. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60, 283–284.CrossRefGoogle Scholar
  12. Collberg, C., & Kobourov, S. (2005). Self-plagiarism in computer science. Communications of the ACM, 48(4), 88–94.CrossRefGoogle Scholar
  13. Culwin, F., & Lancaster, T. (2001). Plagiarism issues for higher education. VINE, 31(2), 36–41.CrossRefGoogle Scholar
  14. Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of 20th internatonal conference on computational linguistics (Coling 2004) (pp. 350–356) Geneva, Switzerland.Google Scholar
  15. Faidhi, J. A. W., & Robinson, S. K. (1987). An empirical approach for detecting program similarity and plagiarism within a university programming environment. Journal of Computer Education, 11(1), 11–19.CrossRefGoogle Scholar
  16. Finlay, S. (1999). Copycatch. Master’s thesis, University of Birmingham.Google Scholar
  17. Flesch, R. (1974). The art of readable writing. New York: Harper and Row.Google Scholar
  18. Gitchell, D., & Tran, N. (1999). Sim: A utility for detecting similarity in computer programs. In Proceedings of 13th SIGSCI technical symposium on computer science education (pp. 226–270).Google Scholar
  19. Grishman, R., & Sundheim, B. (1996). Message understanding conference-6: A brief history. In Proceedings of the 16th international conference on computational linguistics (COLING-96) (pp. 466–470), Copenhagen, Denmark.Google Scholar
  20. Guthrie, D., Guthrie, L., Allison, B., & Wilks, Y. (2007). Unsupervised anomaly detection. In Proceedings of the twentieth international joint conference on artificial intelligence (IJCAI 2007) (pp. 1626–1628). Hyderabad, India.Google Scholar
  21. Heckel, P. (1978). A technique for isolating differences between files. Communications of the ACM, 21(4), 264–268.CrossRefGoogle Scholar
  22. Hislop, G. W. (1998). Analyzing existing software for software re-use. Journal of Systems and Software, 54(3), 203–215.Google Scholar
  23. Jing, H., & McKeown, K. (1999). The decomposition of human-written summary sentences. In Proceedings of SIGIR99 (pp. 129–136).Google Scholar
  24. Johns, A., & Myers, P. (1990). An analysis of summary protocols of university esl students. Applied Linguistics, 11, 253–271.CrossRefGoogle Scholar
  25. Joy, M., & Luck, M. (1999). Plagiarism in programming assignments. IEEE Transactions of Education, 42(2), 129–133.CrossRefGoogle Scholar
  26. Judge, G. (2008). Plagiarism: Bringing economics and education together (with a little help from it). Computers in Higher Education Economics Review, 20(1), 21–26.Google Scholar
  27. Juola, P. (2006). Authorship attribution. Foundations and Trends in Information Retrieval, 1(3), 233–334.CrossRefGoogle Scholar
  28. Kang, N., Gelbukh, A., & Han, S. (2006). Ppchecker: Plagiarism pattern checker in document copy detection. In Proceedings of the 2006 European conference on information retrieval (pp. 565–569).Google Scholar
  29. Keck, C. (2006). The use of paraphrase in summary writing: A comparison of l1 and l2 writers. Journal of Second Language Writing, 15, 261–278.CrossRefGoogle Scholar
  30. Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1–2), 1–224.Google Scholar
  31. Korfhage, R. (1997). Information storage and retrieval. London: Wiley.Google Scholar
  32. Lyon, C., Malcolm, J., & Dickerson, B. (2001) Detecting short passages of similar text in large document collections. In Proceedings of the 2001 conference on empirical methods in natural language processing (EMNLP-2001) (pp. 118–125).Google Scholar
  33. Manber, U. (1994). Finding similar files in a large file system. In Proceedings of 1994 winter usenix technical conference (pp. 1–10).Google Scholar
  34. Martin, B. (1994). Plagiarism: A misplaced emphasis. Journal of Information Ethics, 3(2), 36–47.Google Scholar
  35. McCabe, D. (2005). Research report of the center for academic integrity.
  36. McEnery, A. M., & Oakes, M. P. (2000). Authorship identification and Computational Stylometry. In R. Dale, H. Moisl, & H. Somers (Eds.), Handbook of natural language processing (pp. 545–562). New York.Google Scholar
  37. Mihalcea, R., Chklovski, T., & Kilgarriff, A. (2004). The senseval-3 English lexical sample task. In Proceedings of senseval-3: The third international workshop on the evaluation of systems for the semantic analysis of text. Barcelona, Spain.Google Scholar
  38. Morgan, G., Griego, O., & Gloeckner, G. (2001). SPSS for windows: An introduction to use and interpretation in research. Mahwah New Jersey: Lawrence Erlbaum Associates.Google Scholar
  39. Myers, E. (1986). An O(ND) difference algorithm and its variations. Algorithmica, 1(2), 251–266.CrossRefGoogle Scholar
  40. Pang, B., Knight, K., & Marcu, D. (2003). Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In Proceedings of the human language technology conference and the annual meeting of the North American chapter of the association for computational linguistics.Google Scholar
  41. Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., & Rosso, P. (2009). A statistical approach to crosslingual natural language tasks. Journal of Algorithms, 64(1), 51–60.CrossRefGoogle Scholar
  42. Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., & Rosso, P. (2009). Overview of the 1st international competition on plagiarism detection. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09) (pp. 1–9), Scholar
  43. Sanderson, M. (1997). Duplicate detection in the reuters collection. Technical Report (TR-1997-5), Department of Computing Science at the University of Glasgow.Google Scholar
  44. Shivakumar, N., & Garcia-Molina, H. (1996). Building a scalable and accurate copy detection mechanism. In Proceedings of 1st ACM conference on digital libraries DL’96.Google Scholar
  45. Stein, S. B., & zu Eissen S. M. (2006). Near similarity search and plagiarism analysis. In Proceedings of the 29th annual conference of the GfKl.Google Scholar
  46. Voorhees, E., & Harman, D. (2005). TREC: Experiment and evaluation in information retrieval. Cambridge, Mass: MIT Press.Google Scholar
  47. Web Technology & Information Systems Group BUW. (2008). Plagiarism Corpus Webis-PC-08. In S. M. zu Eissen, B. Stein, & M. Kulig (Eds.).
  48. White, D., & Joy, M. (2004). Sentence-based natural language plagiarism detection. ACM Journal on Educational Resources in Computing, 4(4), 1–20.CrossRefGoogle Scholar
  49. Wise, M. (1992). Detection of similarities in student programs: Yap’ing may be preferable to plague’ing. In Presented at 23rd SIGCSE technical symposium (pp. 268–271). Kansas City, USA.Google Scholar
  50. Woolls, D., & Coulthard, M. (1998). Tools for the trade. Forensic Linguistics, 5(1), 33–57.CrossRefGoogle Scholar
  51. Zobel, J. (2004). Uni cheats racket: A case study in plagiarism investigation. In: ACE ’04: Proceedings of the sixth conference on Australasian computing education (pp. 357–365). Darlinghurst, Australia: Australian Computer Society, Inc.Google Scholar
  52. zu Eissen, S. M., & Stein, B. (2006). Intrinsic plagiarism detection. In Proceedings of the ninth international conference on text, speech and dialogue (pp. 661–667).Google Scholar
  53. zu Eissen, S. M., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. Lenz (Eds.), Advances in data analysis (pp. 359–366). London: Springer.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Department of Information StudiesUniversity of SheffieldSheffieldUK
  2. 2.Department of Computer ScienceUniversity of SheffieldSheffieldUK

Personalised recommendations