Abstract
Plagiarism is widely acknowledged to be a significant and increasing problem for higher education institutions (McCabe 2005; Judge 2008). A wide range of solutions, including several commercial systems, have been proposed to assist the educator in the task of identifying plagiarised work, or even to detect them automatically. Direct comparison of these systems is made difficult by the problems in obtaining genuine examples of plagiarised student work. We describe our initial experiences with constructing a corpus consisting of answers to short questions in which plagiarism has been simulated. This corpus is designed to represent types of plagiarism that are not included in existing corpora and will be a useful addition to the set of resources available for the evaluation of plagiarism detection systems.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
For further examples see http://www.chem.uky.edu/Courses/common/plagiarism.html and http://www.yale.edu/bass/writing/sources/plagiarism/.
A pilot study with a limited number of participants used a finer grained distinction between types of plagiarism, however, we found it was difficult for participants to distinguish between them.
The corpus can be downloaded from: http://www.ir.shef.ac.uk/cloughie/resources/plagiarism_corpus.html.
The Wikipedia pages total 14,242 words after conversion to plaintext using lynx -dump and removal of URL references.
Equivalent experiments were carried out in which the stopwords were removed before computing similarity. We found a similar pattern of results to those reported here and do not report results when stopwords are removed for brevity.
The WEKA 3.2 implementation was used.
References
Barzilay, R., & McKeown, K. (2001). Extracting paraphrases from a parallel corpus. In Proceedings of 39th annual meeting of the association for computational linguistics, association for computational linguistics (pp. 50–57). Toulouse, France.
Brin, S., Davis, J., & Garcia-Molina, H. (1995). Copy detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD international conference on management of data (pp. 398–409).
Broder, A. Z. (1998). On the resemblance and containment of documents. In Compression and complexity of sequences. IEEE Computer Society.
Bull, J., Collins, C., Coughlin, E., & Sharp, D. (2001). Technical review of plagiarism detection software report. Luton: Computer Assisted Assessment Centre. http://www.plagiarismadvice.org/documents/resources/Luton_TechnicalReviewofPDS.pdf.
Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine translation using paraphrases. In Proceedings of the human language technology conference of the NAACL, main conference, association for computational linguistics (pp. 17–24). New York City, USA.
Campbell, C. (1990). Writing with other’s words: Using background reading text in academic compositions. In B. Kroll (Ed.), Second language writing: Research insights for the classroom (pp. 211–230). Cambridge: Cambridge University Press.
Cebrián, M., Alfonseca, M., & Ortega, A. (2007). Automatic generation of benchmarks for plagiarism detection tools using grammatical evolution. In GECCO ’07: Proceedings of the 9th annual conference on genetic and evolutionary computation (pp. 2253–2253).
Clough, P. (2000). Plagiarism in natural and programming languages: An overview of current tools and technologies. Technical report on research memoranda: CS-00-05. Department of Computer Science, University of Sheffield (UK).
Clough, P., Gaizauskas, R., Piao, S., & Wilks, Y. (2002). Measuring text reuse. In Proceedings of 40th annual meeting of the association for computational linguistics, association for computational linguistics (pp. 152–159). Philadelphia, Pennsylvania, USA.
Cohn, T., Callison-Burch, C., & Lapata, M. (2008). Constructing corpora for development and evaluation of paraphrase systems. Computational Lingustics, 34(4), 597–614.
Coleman, M., & Liau, T. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60, 283–284.
Collberg, C., & Kobourov, S. (2005). Self-plagiarism in computer science. Communications of the ACM, 48(4), 88–94.
Culwin, F., & Lancaster, T. (2001). Plagiarism issues for higher education. VINE, 31(2), 36–41.
Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of 20th internatonal conference on computational linguistics (Coling 2004) (pp. 350–356) Geneva, Switzerland.
Faidhi, J. A. W., & Robinson, S. K. (1987). An empirical approach for detecting program similarity and plagiarism within a university programming environment. Journal of Computer Education, 11(1), 11–19.
Finlay, S. (1999). Copycatch. Master’s thesis, University of Birmingham.
Flesch, R. (1974). The art of readable writing. New York: Harper and Row.
Gitchell, D., & Tran, N. (1999). Sim: A utility for detecting similarity in computer programs. In Proceedings of 13th SIGSCI technical symposium on computer science education (pp. 226–270).
Grishman, R., & Sundheim, B. (1996). Message understanding conference-6: A brief history. In Proceedings of the 16th international conference on computational linguistics (COLING-96) (pp. 466–470), Copenhagen, Denmark.
Guthrie, D., Guthrie, L., Allison, B., & Wilks, Y. (2007). Unsupervised anomaly detection. In Proceedings of the twentieth international joint conference on artificial intelligence (IJCAI 2007) (pp. 1626–1628). Hyderabad, India.
Heckel, P. (1978). A technique for isolating differences between files. Communications of the ACM, 21(4), 264–268.
Hislop, G. W. (1998). Analyzing existing software for software re-use. Journal of Systems and Software, 54(3), 203–215.
Jing, H., & McKeown, K. (1999). The decomposition of human-written summary sentences. In Proceedings of SIGIR99 (pp. 129–136).
Johns, A., & Myers, P. (1990). An analysis of summary protocols of university esl students. Applied Linguistics, 11, 253–271.
Joy, M., & Luck, M. (1999). Plagiarism in programming assignments. IEEE Transactions of Education, 42(2), 129–133.
Judge, G. (2008). Plagiarism: Bringing economics and education together (with a little help from it). Computers in Higher Education Economics Review, 20(1), 21–26.
Juola, P. (2006). Authorship attribution. Foundations and Trends in Information Retrieval, 1(3), 233–334.
Kang, N., Gelbukh, A., & Han, S. (2006). Ppchecker: Plagiarism pattern checker in document copy detection. In Proceedings of the 2006 European conference on information retrieval (pp. 565–569).
Keck, C. (2006). The use of paraphrase in summary writing: A comparison of l1 and l2 writers. Journal of Second Language Writing, 15, 261–278.
Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1–2), 1–224.
Korfhage, R. (1997). Information storage and retrieval. London: Wiley.
Lyon, C., Malcolm, J., & Dickerson, B. (2001) Detecting short passages of similar text in large document collections. In Proceedings of the 2001 conference on empirical methods in natural language processing (EMNLP-2001) (pp. 118–125).
Manber, U. (1994). Finding similar files in a large file system. In Proceedings of 1994 winter usenix technical conference (pp. 1–10).
Martin, B. (1994). Plagiarism: A misplaced emphasis. Journal of Information Ethics, 3(2), 36–47.
McCabe, D. (2005). Research report of the center for academic integrity. http://www.academicintegrity.org.
McEnery, A. M., & Oakes, M. P. (2000). Authorship identification and Computational Stylometry. In R. Dale, H. Moisl, & H. Somers (Eds.), Handbook of natural language processing (pp. 545–562). New York.
Mihalcea, R., Chklovski, T., & Kilgarriff, A. (2004). The senseval-3 English lexical sample task. In Proceedings of senseval-3: The third international workshop on the evaluation of systems for the semantic analysis of text. Barcelona, Spain.
Morgan, G., Griego, O., & Gloeckner, G. (2001). SPSS for windows: An introduction to use and interpretation in research. Mahwah New Jersey: Lawrence Erlbaum Associates.
Myers, E. (1986). An O(ND) difference algorithm and its variations. Algorithmica, 1(2), 251–266.
Pang, B., Knight, K., & Marcu, D. (2003). Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In Proceedings of the human language technology conference and the annual meeting of the North American chapter of the association for computational linguistics.
Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., & Rosso, P. (2009). A statistical approach to crosslingual natural language tasks. Journal of Algorithms, 64(1), 51–60.
Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., & Rosso, P. (2009). Overview of the 1st international competition on plagiarism detection. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09) (pp. 1–9), CEUR-WS.org.
Sanderson, M. (1997). Duplicate detection in the reuters collection. Technical Report (TR-1997-5), Department of Computing Science at the University of Glasgow.
Shivakumar, N., & Garcia-Molina, H. (1996). Building a scalable and accurate copy detection mechanism. In Proceedings of 1st ACM conference on digital libraries DL’96.
Stein, S. B., & zu Eissen S. M. (2006). Near similarity search and plagiarism analysis. In Proceedings of the 29th annual conference of the GfKl.
Voorhees, E., & Harman, D. (2005). TREC: Experiment and evaluation in information retrieval. Cambridge, Mass: MIT Press.
Web Technology & Information Systems Group BUW. (2008). Plagiarism Corpus Webis-PC-08. In S. M. zu Eissen, B. Stein, & M. Kulig (Eds.). http://www.uni-weimar.de/medien/webis/research/corpora.
White, D., & Joy, M. (2004). Sentence-based natural language plagiarism detection. ACM Journal on Educational Resources in Computing, 4(4), 1–20.
Wise, M. (1992). Detection of similarities in student programs: Yap’ing may be preferable to plague’ing. In Presented at 23rd SIGCSE technical symposium (pp. 268–271). Kansas City, USA.
Woolls, D., & Coulthard, M. (1998). Tools for the trade. Forensic Linguistics, 5(1), 33–57.
Zobel, J. (2004). Uni cheats racket: A case study in plagiarism investigation. In: ACE ’04: Proceedings of the sixth conference on Australasian computing education (pp. 357–365). Darlinghurst, Australia: Australian Computer Society, Inc.
zu Eissen, S. M., & Stein, B. (2006). Intrinsic plagiarism detection. In Proceedings of the ninth international conference on text, speech and dialogue (pp. 661–667).
zu Eissen, S. M., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. Lenz (Eds.), Advances in data analysis (pp. 359–366). London: Springer.
Acknowledgments
We thank James Gregory, Aman Brar, Chris Bishop, Saleh Al Belwi and Congyun Long for organising the data collection and all participants involved in generating examples for the corpus.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Clough, P., Stevenson, M. Developing a corpus of plagiarised short answers. Lang Resources & Evaluation 45, 5–24 (2011). https://doi.org/10.1007/s10579-009-9112-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-009-9112-1