Skip to main content
Log in

Developing a corpus of plagiarised short answers

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Plagiarism is widely acknowledged to be a significant and increasing problem for higher education institutions (McCabe 2005; Judge 2008). A wide range of solutions, including several commercial systems, have been proposed to assist the educator in the task of identifying plagiarised work, or even to detect them automatically. Direct comparison of these systems is made difficult by the problems in obtaining genuine examples of plagiarised student work. We describe our initial experiences with constructing a corpus consisting of answers to short questions in which plagiarism has been simulated. This corpus is designed to represent types of plagiarism that are not included in existing corpora and will be a useful addition to the set of resources available for the evaluation of plagiarism detection systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://www.webis.de/pan-09.

  2. http://www.dcs.shef.ac.uk/nlp/meter/Metercorpus/metercorpus.htm.

  3. http://research.microsoft.com/en-us/downloads/607D14D9-20CD-47E3-85BC-A2F65CD28042/default.aspx.

  4. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002T01.

  5. http://www.homepages.inf.ed.ac.uk/tcohn/paraphrase_corpus.html.

  6. http://www.wikipedia.com.

  7. For further examples see http://www.chem.uky.edu/Courses/common/plagiarism.html and http://www.yale.edu/bass/writing/sources/plagiarism/.

  8. A pilot study with a limited number of participants used a finer grained distinction between types of plagiarism, however, we found it was difficult for participants to distinguish between them.

  9. The corpus can be downloaded from: http://www.ir.shef.ac.uk/cloughie/resources/plagiarism_corpus.html.

  10. The Wikipedia pages total 14,242 words after conversion to plaintext using lynx -dump and removal of URL references.

  11. Equivalent experiments were carried out in which the stopwords were removed before computing similarity. We found a similar pattern of results to those reported here and do not report results when stopwords are removed for brevity.

  12. The WEKA 3.2 implementation was used.

References

  • Barzilay, R., & McKeown, K. (2001). Extracting paraphrases from a parallel corpus. In Proceedings of 39th annual meeting of the association for computational linguistics, association for computational linguistics (pp. 50–57). Toulouse, France.

  • Brin, S., Davis, J., & Garcia-Molina, H. (1995). Copy detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD international conference on management of data (pp. 398–409).

  • Broder, A. Z. (1998). On the resemblance and containment of documents. In Compression and complexity of sequences. IEEE Computer Society.

  • Bull, J., Collins, C., Coughlin, E., & Sharp, D. (2001). Technical review of plagiarism detection software report. Luton: Computer Assisted Assessment Centre. http://www.plagiarismadvice.org/documents/resources/Luton_TechnicalReviewofPDS.pdf.

  • Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine translation using paraphrases. In Proceedings of the human language technology conference of the NAACL, main conference, association for computational linguistics (pp. 17–24). New York City, USA.

  • Campbell, C. (1990). Writing with other’s words: Using background reading text in academic compositions. In B. Kroll (Ed.), Second language writing: Research insights for the classroom (pp. 211–230). Cambridge: Cambridge University Press.

    Google Scholar 

  • Cebrián, M., Alfonseca, M., & Ortega, A. (2007). Automatic generation of benchmarks for plagiarism detection tools using grammatical evolution. In GECCO ’07: Proceedings of the 9th annual conference on genetic and evolutionary computation (pp. 2253–2253).

  • Clough, P. (2000). Plagiarism in natural and programming languages: An overview of current tools and technologies. Technical report on research memoranda: CS-00-05. Department of Computer Science, University of Sheffield (UK).

  • Clough, P., Gaizauskas, R., Piao, S., & Wilks, Y. (2002). Measuring text reuse. In Proceedings of 40th annual meeting of the association for computational linguistics, association for computational linguistics (pp. 152–159). Philadelphia, Pennsylvania, USA.

  • Cohn, T., Callison-Burch, C., & Lapata, M. (2008). Constructing corpora for development and evaluation of paraphrase systems. Computational Lingustics, 34(4), 597–614.

    Article  Google Scholar 

  • Coleman, M., & Liau, T. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60, 283–284.

    Article  Google Scholar 

  • Collberg, C., & Kobourov, S. (2005). Self-plagiarism in computer science. Communications of the ACM, 48(4), 88–94.

    Article  Google Scholar 

  • Culwin, F., & Lancaster, T. (2001). Plagiarism issues for higher education. VINE, 31(2), 36–41.

    Article  Google Scholar 

  • Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of 20th internatonal conference on computational linguistics (Coling 2004) (pp. 350–356) Geneva, Switzerland.

  • Faidhi, J. A. W., & Robinson, S. K. (1987). An empirical approach for detecting program similarity and plagiarism within a university programming environment. Journal of Computer Education, 11(1), 11–19.

    Article  Google Scholar 

  • Finlay, S. (1999). Copycatch. Master’s thesis, University of Birmingham.

  • Flesch, R. (1974). The art of readable writing. New York: Harper and Row.

    Google Scholar 

  • Gitchell, D., & Tran, N. (1999). Sim: A utility for detecting similarity in computer programs. In Proceedings of 13th SIGSCI technical symposium on computer science education (pp. 226–270).

  • Grishman, R., & Sundheim, B. (1996). Message understanding conference-6: A brief history. In Proceedings of the 16th international conference on computational linguistics (COLING-96) (pp. 466–470), Copenhagen, Denmark.

  • Guthrie, D., Guthrie, L., Allison, B., & Wilks, Y. (2007). Unsupervised anomaly detection. In Proceedings of the twentieth international joint conference on artificial intelligence (IJCAI 2007) (pp. 1626–1628). Hyderabad, India.

  • Heckel, P. (1978). A technique for isolating differences between files. Communications of the ACM, 21(4), 264–268.

    Article  Google Scholar 

  • Hislop, G. W. (1998). Analyzing existing software for software re-use. Journal of Systems and Software, 54(3), 203–215.

    Google Scholar 

  • Jing, H., & McKeown, K. (1999). The decomposition of human-written summary sentences. In Proceedings of SIGIR99 (pp. 129–136).

  • Johns, A., & Myers, P. (1990). An analysis of summary protocols of university esl students. Applied Linguistics, 11, 253–271.

    Article  Google Scholar 

  • Joy, M., & Luck, M. (1999). Plagiarism in programming assignments. IEEE Transactions of Education, 42(2), 129–133.

    Article  Google Scholar 

  • Judge, G. (2008). Plagiarism: Bringing economics and education together (with a little help from it). Computers in Higher Education Economics Review, 20(1), 21–26.

    Google Scholar 

  • Juola, P. (2006). Authorship attribution. Foundations and Trends in Information Retrieval, 1(3), 233–334.

    Article  Google Scholar 

  • Kang, N., Gelbukh, A., & Han, S. (2006). Ppchecker: Plagiarism pattern checker in document copy detection. In Proceedings of the 2006 European conference on information retrieval (pp. 565–569).

  • Keck, C. (2006). The use of paraphrase in summary writing: A comparison of l1 and l2 writers. Journal of Second Language Writing, 15, 261–278.

    Article  Google Scholar 

  • Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1–2), 1–224.

    Google Scholar 

  • Korfhage, R. (1997). Information storage and retrieval. London: Wiley.

    Google Scholar 

  • Lyon, C., Malcolm, J., & Dickerson, B. (2001) Detecting short passages of similar text in large document collections. In Proceedings of the 2001 conference on empirical methods in natural language processing (EMNLP-2001) (pp. 118–125).

  • Manber, U. (1994). Finding similar files in a large file system. In Proceedings of 1994 winter usenix technical conference (pp. 1–10).

  • Martin, B. (1994). Plagiarism: A misplaced emphasis. Journal of Information Ethics, 3(2), 36–47.

    Google Scholar 

  • McCabe, D. (2005). Research report of the center for academic integrity. http://www.academicintegrity.org.

  • McEnery, A. M., & Oakes, M. P. (2000). Authorship identification and Computational Stylometry. In R. Dale, H. Moisl, & H. Somers (Eds.), Handbook of natural language processing (pp. 545–562). New York.

  • Mihalcea, R., Chklovski, T., & Kilgarriff, A. (2004). The senseval-3 English lexical sample task. In Proceedings of senseval-3: The third international workshop on the evaluation of systems for the semantic analysis of text. Barcelona, Spain.

  • Morgan, G., Griego, O., & Gloeckner, G. (2001). SPSS for windows: An introduction to use and interpretation in research. Mahwah New Jersey: Lawrence Erlbaum Associates.

    Google Scholar 

  • Myers, E. (1986). An O(ND) difference algorithm and its variations. Algorithmica, 1(2), 251–266.

    Article  Google Scholar 

  • Pang, B., Knight, K., & Marcu, D. (2003). Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In Proceedings of the human language technology conference and the annual meeting of the North American chapter of the association for computational linguistics.

  • Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., & Rosso, P. (2009). A statistical approach to crosslingual natural language tasks. Journal of Algorithms, 64(1), 51–60.

    Article  Google Scholar 

  • Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., & Rosso, P. (2009). Overview of the 1st international competition on plagiarism detection. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09) (pp. 1–9), CEUR-WS.org.

  • Sanderson, M. (1997). Duplicate detection in the reuters collection. Technical Report (TR-1997-5), Department of Computing Science at the University of Glasgow.

  • Shivakumar, N., & Garcia-Molina, H. (1996). Building a scalable and accurate copy detection mechanism. In Proceedings of 1st ACM conference on digital libraries DL’96.

  • Stein, S. B., & zu Eissen S. M. (2006). Near similarity search and plagiarism analysis. In Proceedings of the 29th annual conference of the GfKl.

  • Voorhees, E., & Harman, D. (2005). TREC: Experiment and evaluation in information retrieval. Cambridge, Mass: MIT Press.

    Google Scholar 

  • Web Technology & Information Systems Group BUW. (2008). Plagiarism Corpus Webis-PC-08. In S. M. zu Eissen, B. Stein, & M. Kulig (Eds.). http://www.uni-weimar.de/medien/webis/research/corpora.

  • White, D., & Joy, M. (2004). Sentence-based natural language plagiarism detection. ACM Journal on Educational Resources in Computing, 4(4), 1–20.

    Article  Google Scholar 

  • Wise, M. (1992). Detection of similarities in student programs: Yap’ing may be preferable to plague’ing. In Presented at 23rd SIGCSE technical symposium (pp. 268–271). Kansas City, USA.

  • Woolls, D., & Coulthard, M. (1998). Tools for the trade. Forensic Linguistics, 5(1), 33–57.

    Article  Google Scholar 

  • Zobel, J. (2004). Uni cheats racket: A case study in plagiarism investigation. In: ACE ’04: Proceedings of the sixth conference on Australasian computing education (pp. 357–365). Darlinghurst, Australia: Australian Computer Society, Inc.

  • zu Eissen, S. M., & Stein, B. (2006). Intrinsic plagiarism detection. In Proceedings of the ninth international conference on text, speech and dialogue (pp. 661–667).

  • zu Eissen, S. M., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. Lenz (Eds.), Advances in data analysis (pp. 359–366). London: Springer.

    Chapter  Google Scholar 

Download references

Acknowledgments

We thank James Gregory, Aman Brar, Chris Bishop, Saleh Al Belwi and Congyun Long for organising the data collection and all participants involved in generating examples for the corpus.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul Clough.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Clough, P., Stevenson, M. Developing a corpus of plagiarised short answers. Lang Resources & Evaluation 45, 5–24 (2011). https://doi.org/10.1007/s10579-009-9112-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-009-9112-1

Keywords

Navigation