Developing a corpus of plagiarised short answers

Clough, Paul; Stevenson, Mark

doi:10.1007/s10579-009-9112-1

Developing a corpus of plagiarised short answers

Published: 16 January 2010

Volume 45, pages 5–24, (2011)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Paul Clough¹ &
Mark Stevenson²

972 Accesses
71 Citations
Explore all metrics

Abstract

Plagiarism is widely acknowledged to be a significant and increasing problem for higher education institutions (McCabe 2005; Judge 2008). A wide range of solutions, including several commercial systems, have been proposed to assist the educator in the task of identifying plagiarised work, or even to detect them automatically. Direct comparison of these systems is made difficult by the problems in obtaining genuine examples of plagiarised student work. We describe our initial experiences with constructing a corpus consisting of answers to short questions in which plagiarism has been simulated. This corpus is designed to represent types of plagiarism that are not included in existing corpora and will be a useful addition to the set of resources available for the evaluation of plagiarism detection systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Use of Artificial Intelligence in Writing Scientific Review Articles

Article Open access 16 January 2024

How to avoid common errors in writing scientific manuscripts

Article 10 April 2018

An automated essay scoring systems: a systematic literature review

Article 23 September 2021

Notes

http://www.webis.de/pan-09.
http://www.dcs.shef.ac.uk/nlp/meter/Metercorpus/metercorpus.htm.
http://research.microsoft.com/en-us/downloads/607D14D9-20CD-47E3-85BC-A2F65CD28042/default.aspx.
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002T01.
http://www.homepages.inf.ed.ac.uk/tcohn/paraphrase_corpus.html.
http://www.wikipedia.com.
For further examples see http://www.chem.uky.edu/Courses/common/plagiarism.html and http://www.yale.edu/bass/writing/sources/plagiarism/.
A pilot study with a limited number of participants used a finer grained distinction between types of plagiarism, however, we found it was difficult for participants to distinguish between them.
The corpus can be downloaded from: http://www.ir.shef.ac.uk/cloughie/resources/plagiarism_corpus.html.
The Wikipedia pages total 14,242 words after conversion to plaintext using lynx -dump and removal of URL references.
Equivalent experiments were carried out in which the stopwords were removed before computing similarity. We found a similar pattern of results to those reported here and do not report results when stopwords are removed for brevity.
The WEKA 3.2 implementation was used.

References

Barzilay, R., & McKeown, K. (2001). Extracting paraphrases from a parallel corpus. In Proceedings of 39th annual meeting of the association for computational linguistics, association for computational linguistics (pp. 50–57). Toulouse, France.
Brin, S., Davis, J., & Garcia-Molina, H. (1995). Copy detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD international conference on management of data (pp. 398–409).
Broder, A. Z. (1998). On the resemblance and containment of documents. In Compression and complexity of sequences. IEEE Computer Society.
Bull, J., Collins, C., Coughlin, E., & Sharp, D. (2001). Technical review of plagiarism detection software report. Luton: Computer Assisted Assessment Centre. http://www.plagiarismadvice.org/documents/resources/Luton_TechnicalReviewofPDS.pdf.
Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine translation using paraphrases. In Proceedings of the human language technology conference of the NAACL, main conference, association for computational linguistics (pp. 17–24). New York City, USA.
Campbell, C. (1990). Writing with other’s words: Using background reading text in academic compositions. In B. Kroll (Ed.), Second language writing: Research insights for the classroom (pp. 211–230). Cambridge: Cambridge University Press.
Google Scholar
Cebrián, M., Alfonseca, M., & Ortega, A. (2007). Automatic generation of benchmarks for plagiarism detection tools using grammatical evolution. In GECCO ’07: Proceedings of the 9th annual conference on genetic and evolutionary computation (pp. 2253–2253).
Clough, P. (2000). Plagiarism in natural and programming languages: An overview of current tools and technologies. Technical report on research memoranda: CS-00-05. Department of Computer Science, University of Sheffield (UK).
Clough, P., Gaizauskas, R., Piao, S., & Wilks, Y. (2002). Measuring text reuse. In Proceedings of 40th annual meeting of the association for computational linguistics, association for computational linguistics (pp. 152–159). Philadelphia, Pennsylvania, USA.
Cohn, T., Callison-Burch, C., & Lapata, M. (2008). Constructing corpora for development and evaluation of paraphrase systems. Computational Lingustics, 34(4), 597–614.
Article Google Scholar
Coleman, M., & Liau, T. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60, 283–284.
Article Google Scholar
Collberg, C., & Kobourov, S. (2005). Self-plagiarism in computer science. Communications of the ACM, 48(4), 88–94.
Article Google Scholar
Culwin, F., & Lancaster, T. (2001). Plagiarism issues for higher education. VINE, 31(2), 36–41.
Article Google Scholar
Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of 20th internatonal conference on computational linguistics (Coling 2004) (pp. 350–356) Geneva, Switzerland.
Faidhi, J. A. W., & Robinson, S. K. (1987). An empirical approach for detecting program similarity and plagiarism within a university programming environment. Journal of Computer Education, 11(1), 11–19.
Article Google Scholar
Finlay, S. (1999). Copycatch. Master’s thesis, University of Birmingham.
Flesch, R. (1974). The art of readable writing. New York: Harper and Row.
Google Scholar
Gitchell, D., & Tran, N. (1999). Sim: A utility for detecting similarity in computer programs. In Proceedings of 13th SIGSCI technical symposium on computer science education (pp. 226–270).
Grishman, R., & Sundheim, B. (1996). Message understanding conference-6: A brief history. In Proceedings of the 16th international conference on computational linguistics (COLING-96) (pp. 466–470), Copenhagen, Denmark.
Guthrie, D., Guthrie, L., Allison, B., & Wilks, Y. (2007). Unsupervised anomaly detection. In Proceedings of the twentieth international joint conference on artificial intelligence (IJCAI 2007) (pp. 1626–1628). Hyderabad, India.
Heckel, P. (1978). A technique for isolating differences between files. Communications of the ACM, 21(4), 264–268.
Article Google Scholar
Hislop, G. W. (1998). Analyzing existing software for software re-use. Journal of Systems and Software, 54(3), 203–215.
Google Scholar
Jing, H., & McKeown, K. (1999). The decomposition of human-written summary sentences. In Proceedings of SIGIR99 (pp. 129–136).
Johns, A., & Myers, P. (1990). An analysis of summary protocols of university esl students. Applied Linguistics, 11, 253–271.
Article Google Scholar
Joy, M., & Luck, M. (1999). Plagiarism in programming assignments. IEEE Transactions of Education, 42(2), 129–133.
Article Google Scholar
Judge, G. (2008). Plagiarism: Bringing economics and education together (with a little help from it). Computers in Higher Education Economics Review, 20(1), 21–26.
Google Scholar
Juola, P. (2006). Authorship attribution. Foundations and Trends in Information Retrieval, 1(3), 233–334.
Article Google Scholar
Kang, N., Gelbukh, A., & Han, S. (2006). Ppchecker: Plagiarism pattern checker in document copy detection. In Proceedings of the 2006 European conference on information retrieval (pp. 565–569).
Keck, C. (2006). The use of paraphrase in summary writing: A comparison of l1 and l2 writers. Journal of Second Language Writing, 15, 261–278.
Article Google Scholar
Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1–2), 1–224.
Google Scholar
Korfhage, R. (1997). Information storage and retrieval. London: Wiley.
Google Scholar
Lyon, C., Malcolm, J., & Dickerson, B. (2001) Detecting short passages of similar text in large document collections. In Proceedings of the 2001 conference on empirical methods in natural language processing (EMNLP-2001) (pp. 118–125).
Manber, U. (1994). Finding similar files in a large file system. In Proceedings of 1994 winter usenix technical conference (pp. 1–10).
Martin, B. (1994). Plagiarism: A misplaced emphasis. Journal of Information Ethics, 3(2), 36–47.
Google Scholar
McCabe, D. (2005). Research report of the center for academic integrity. http://www.academicintegrity.org.
McEnery, A. M., & Oakes, M. P. (2000). Authorship identification and Computational Stylometry. In R. Dale, H. Moisl, & H. Somers (Eds.), Handbook of natural language processing (pp. 545–562). New York.
Mihalcea, R., Chklovski, T., & Kilgarriff, A. (2004). The senseval-3 English lexical sample task. In Proceedings of senseval-3: The third international workshop on the evaluation of systems for the semantic analysis of text. Barcelona, Spain.
Morgan, G., Griego, O., & Gloeckner, G. (2001). SPSS for windows: An introduction to use and interpretation in research. Mahwah New Jersey: Lawrence Erlbaum Associates.
Google Scholar
Myers, E. (1986). An O(ND) difference algorithm and its variations. Algorithmica, 1(2), 251–266.
Article Google Scholar
Pang, B., Knight, K., & Marcu, D. (2003). Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In Proceedings of the human language technology conference and the annual meeting of the North American chapter of the association for computational linguistics.
Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., & Rosso, P. (2009). A statistical approach to crosslingual natural language tasks. Journal of Algorithms, 64(1), 51–60.
Article Google Scholar
Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., & Rosso, P. (2009). Overview of the 1st international competition on plagiarism detection. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09) (pp. 1–9), CEUR-WS.org.
Sanderson, M. (1997). Duplicate detection in the reuters collection. Technical Report (TR-1997-5), Department of Computing Science at the University of Glasgow.
Shivakumar, N., & Garcia-Molina, H. (1996). Building a scalable and accurate copy detection mechanism. In Proceedings of 1st ACM conference on digital libraries DL’96.
Stein, S. B., & zu Eissen S. M. (2006). Near similarity search and plagiarism analysis. In Proceedings of the 29th annual conference of the GfKl.
Voorhees, E., & Harman, D. (2005). TREC: Experiment and evaluation in information retrieval. Cambridge, Mass: MIT Press.
Google Scholar
Web Technology & Information Systems Group BUW. (2008). Plagiarism Corpus Webis-PC-08. In S. M. zu Eissen, B. Stein, & M. Kulig (Eds.). http://www.uni-weimar.de/medien/webis/research/corpora.
White, D., & Joy, M. (2004). Sentence-based natural language plagiarism detection. ACM Journal on Educational Resources in Computing, 4(4), 1–20.
Article Google Scholar
Wise, M. (1992). Detection of similarities in student programs: Yap’ing may be preferable to plague’ing. In Presented at 23rd SIGCSE technical symposium (pp. 268–271). Kansas City, USA.
Woolls, D., & Coulthard, M. (1998). Tools for the trade. Forensic Linguistics, 5(1), 33–57.
Article Google Scholar
Zobel, J. (2004). Uni cheats racket: A case study in plagiarism investigation. In: ACE ’04: Proceedings of the sixth conference on Australasian computing education (pp. 357–365). Darlinghurst, Australia: Australian Computer Society, Inc.
zu Eissen, S. M., & Stein, B. (2006). Intrinsic plagiarism detection. In Proceedings of the ninth international conference on text, speech and dialogue (pp. 661–667).
zu Eissen, S. M., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. Lenz (Eds.), Advances in data analysis (pp. 359–366). London: Springer.
Chapter Google Scholar

Download references

Acknowledgments

We thank James Gregory, Aman Brar, Chris Bishop, Saleh Al Belwi and Congyun Long for organising the data collection and all participants involved in generating examples for the corpus.

Author information

Authors and Affiliations

Department of Information Studies, University of Sheffield, Regent Court, 211 Portobello, Sheffield, S1 4DP, UK
Paul Clough
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello, Sheffield, S1 4DP, UK
Mark Stevenson

Authors

Paul Clough
View author publications
You can also search for this author in PubMed Google Scholar
Mark Stevenson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Clough.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Clough, P., Stevenson, M. Developing a corpus of plagiarised short answers. Lang Resources & Evaluation 45, 5–24 (2011). https://doi.org/10.1007/s10579-009-9112-1

Download citation

Published: 16 January 2010
Issue Date: March 2011
DOI: https://doi.org/10.1007/s10579-009-9112-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Developing a corpus of plagiarised short answers

Abstract

Access this article

Similar content being viewed by others

The Use of Artificial Intelligence in Writing Scientific Review Articles

How to avoid common errors in writing scientific manuscripts

An automated essay scoring systems: a systematic literature review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Developing a corpus of plagiarised short answers

Abstract

Access this article

Similar content being viewed by others

The Use of Artificial Intelligence in Writing Scientific Review Articles

How to avoid common errors in writing scientific manuscripts

An automated essay scoring systems: a systematic literature review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation