Identifying Machine-Paraphrased Plagiarism

Wahle, Jan Philip; Ruas, Terry; Foltýnek, Tomáš; Meuschke, Norman; Gipp, Bela

doi:10.1007/978-3-030-96957-8_34

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13192))

Included in the following conference series:

International Conference on Information

2154 Accesses
10 Citations
21 Altmetric

Abstract

Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we evaluate the effectiveness of five pre-trained word embedding models combined with machine learning classifiers and state-of-the-art neural language models. We analyze preprints of research papers, graduation theses, and Wikipedia articles, which we paraphrased using different configurations of the tools SpinBot and SpinnerChief. The best performing technique, Longformer, achieved an average F1 score of 80.99% (F1 = 99.68% for SpinBot and F1 = 71.64% for SpinnerChief cases), while human evaluators achieved F1 = 78.4% for SpinBot and F1 = 65.6% for SpinnerChief cases. We show that the automated classification alleviates shortcomings of widely-used text-matching systems, such as Turnitin and PlagScan.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://edintegrity.biomedcentral.com/mbp.
2.
https://arxiv.org.
3.
https://doi.org/10.5281/zenodo.3608000
4.
https://github.com/jpelhaW/ParaphraseDetection
5.
http://purl.org/spindetector
6.
https://huggingface.co/jpelhaw/longformer-base-plagiarism-detection
7.
https://spinbot.com/.
8.
https://paraphrasing-tool.com/.
9.
https://free-article-spinner.com/.
10.
http://www.spinnerchief.com/.
11.
https://en.wikipedia.org/wiki/Wikipedia:Content_assessment.
12.
https://kwarc.info/projects/arXMLiv/.
13.
https://nlp.stanford.edu/projects/glove/.
14.
https://code.google.com/archive/p/word2vec/.
15.
https://fasttext.cc/docs/en/english-vectors.html.
16.
https://radimrehurek.com/gensim/models/doc2vec.html.
17.
https://scikit-learn.org.
18.
99.35% of the datasets’ text can be represented with less than 512 tokens.
19.
https://www.quiz-maker.com/.
20.
https://doi.org/10.5281/zenodo.3608000.
21.
https://github.com/jpelhaW/ParaphraseDetection.
22.
http://purl.org/spindetector.
23.
https://huggingface.co/jpelhaw/longformer-base-plagiarism-detection.

References

Alvi, F., Stevenson, M., Clough, P.: Paraphrase type identification for plagiarism detection using contexts and word embeddings. Int. J. Educ. Technol. High. Educ. 18(1), 42 (2021)
Article Google Scholar
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3613–3618. Association for Computational Linguistics (2019). 10/ggcgtm
Google Scholar
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv:2004.05150 [cs], April 2020
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). 10/gfw9cs
Google Scholar
Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. arXiv:2003.10555 [cs], March 2020
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings Conference on Empirical Methods in Natural Language Processing (2017). https://doi.org/10.18653/v1/d17-1070
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs], May 2019
Dey, K., Shrivastava, R., Kaushik, S.: A paraphrase and semantic similarity detection system for user generated short-text content on microblogs. In: Proceedings International Conference on Computational Linguistics (COLING), pp. 2880–2890 (2016)
Google Scholar
Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP 2005) (2005)
Google Scholar
Foltýnek, T., et al.: Testing of support tools for plagiarism detection. Int. J. Educ. Technol. High. Educ. 17(1), 1–31 (2020). https://doi.org/10.1186/s41239-020-00192-4
Article Google Scholar
Foltýnek, T., Meuschke, N., Gipp, B.: Academic plagiarism detection: a systematic literature review. ACM Comput. Surv. 52(6), 112:1-112:42 (2019). https://doi.org/10.1145/3345317
Article Google Scholar
Foltýnek, T., et al.: Detecting machine-obfuscated plagiarism. In: Sundqvist, A., Berget, G., Nolin, J., Skjerdingstad, K.I. (eds.) iConference 2020. LNCS, vol. 12051, pp. 816–827. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43687-2_68
Chapter Google Scholar
Gharavi, E., Veisi, H., Rosso, P.: Scalable and language-independent embedding-based approach for plagiarism detection considering obfuscation type: no training phase. Neural Comput. Appl. 32(14), 10593–10607 (2019). https://doi.org/10.1007/s00521-019-04594-y
Article Google Scholar
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). JMLR W&CP, vol. 9, pp. 297–304 (2010)
Google Scholar
Hunt, E., et al.: Machine learning models for paraphrase identification and its applications on plagiarism detection. In: Proceedings 10th IEEE International Conference on Big Knowledge, pp. 97–104 (2019). https://doi.org/10.1109/ICBK.2019.00021
Iyer, S., Dandekar, N., Csernai, K.: First quora dataset release: Question pairs (2017). https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs
Lan, W., Qiu, S., He, H., Xu, W.: A continuously growing dataset of sentential paraphrases. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1224–1234. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/D17-1126
Lan, W., Xu, W.: Neural network models for paraphrase identification, semantic textual similarity, natural language inference, and question answering. arXiv:1806.04330 [cs], August 2018
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. arXiv:1909.11942 [cs], September 2019
Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. In: Proceedings Workshop on Representation Learning for NLP (2016). https://doi.org/10.18653/v1/w16-1609
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings 31st International Conference on Machine Learning, vol. 32, pp. 1188–1196 (2014)
Google Scholar
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv:1910.13461 [cs], October 2019
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 [cs], July 2019
Meuschke, N.: Analyzing non-textual content elements to detect academic plagiarism. Doctoral thesis, University of Konstanz, Department of Computer and Information Science, Konstanz, Germany (2021). https://doi.org/10.5281/zenodo.4913345
Meuschke, N., Gondek, C., Seebacher, D., Breitinger, C., Keim, D., Gipp, B.: An adaptive image-based plagiarism detection approach. In: Proceedings 18th ACM/IEEE Joint Conference on Digital Libraries, pp. 131–140 (2018). https://doi.org/10.1145/3197026.3197042
Meuschke, N., Stange, V., Schubotz, M., Gipp, B.: HyPlag: a hybrid approach to academic plagiarism detection. In: Proceedings 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1321–1324 (2018). https://doi.org/10.1145/3209978.3210177
Meuschke, N., Stange, V., Schubotz, M., Kramer, M., Gipp, B.: Improving academic plagiarism detection for STEM documents by analyzing mathematical content and citations. In: Proceedings ACM/IEEE Joint Conference on Digital Libraries, pp. 120–129 (2019). https://doi.org/10.1109/JCDL.2019.00026
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. arXiv:1310.4546 [cs, stat], October 2013
Napoles, C., Gormley, M., Van Durme, B.: Annotated gigaword. In: Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX), Montréal, Canada, pp. 95–100. Association for Computational Linguistics, June 2012
Google Scholar
Ostendorff, M., Ash, E., Ruas, T., Gipp, B., Moreno-Schneider, J., Rehm, G.: Evaluating document representations for content-based legal literature recommendations. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, São Paulo Brazil, pp. 109–118. ACM, June 2021. https://doi.org/10.1145/3462757.3466073. https://arxiv.org/pdf/2104.13841.pdf
Ostendorff, M., Ruas, T., Blume, T., Gipp, B., Rehm, G.: Aspect-based document similarity for research papers. In: Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 6194–6206. International Committee on Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.coling-main.545. https://aclanthology.org/2020.coling-main.545.pdf
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings Conference on Empirical Methods in Natural Language Processing, vol. 14, pp. 1532–1543 (2014). 10/gfshwg
Google Scholar
Perone, C.S., Silveira, R., Paula, T.S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv:1806.06259 (2018)
Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/n18-1202
Prentice, F.M., Kinden, C.E.: Paraphrasing tools, language translation tools and plagiarism: an exploratory study. Int. J. Educ. Integr. 14(1), 1–16 (2018). https://doi.org/10.1007/s40979-018-0036-7
Article Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
Google Scholar
Rogerson, A.M., McCarthy, G.: Using Internet based paraphrasing tools: original work, patchwriting or facilitated plagiarism? Int. J. Educ. Integr. 13(1), 1–15 (2017). https://doi.org/10.1007/s40979-016-0013-y
Article Google Scholar
Ruas, T., Ferreira, C.H.P., Grosky, W., de França, F.O., de Medeiros, D.M.R.: Enhanced word embeddings using multi-semantic representation through lexical chains. Inf. Sci. 532, 16–32 (2020). https://doi.org/10.1016/j.ins.2020.04.048
Article Google Scholar
Ruas, T., Grosky, W., Aizawa, A.: Multi-sense embeddings through a word sense disambiguation process. Expert Syst. Appl. 136, 288–303 (2019). https://doi.org/10.1016/j.eswa.2019.06.026
Article Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs], October 2019
Spinde, T., Plank, M., Krieger, J.D., Ruas, T., Gipp, B., Aizawa, A.: Neural media bias detection using distant supervision with BABE - bias annotations by experts. In: Findings of the Association for Computational Linguistics: EMNLP 2021. Dominican Republic, November 2021. tex.pubstate: published tex.tppubtype: inproceedings
Google Scholar
Subramanian, S., Trischler, A., Bengio, Y., Pal, C.J.: Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv:1804.00079 [cs], March 2018
Trinh, T.H., Le, Q.V.: A simple method for commonsense reasoning. arXiv:1806.02847 [cs] (2019)
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017). https://arxiv.org/abs/1706.03762
Wahle, J.P., Ashok, N., Ruas, T., Meuschke, N., Ghosal, T., Gipp, B.: Testing the generalization of neural language models for COVID-19 misinformation detection. In: Proceedings of the iConference, February 2022
Google Scholar
Wahle, J.P., Ruas, T., Meuschke, N., Gipp, B.: Are neural language models good plagiarists? A benchmark for neural paraphrase detection. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), Washington, USA. IEEE, September 2021
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. arXiv:1804.07461 [cs], February 2019
Weber-Wulff, D.: Plagiarism detectors are a crutch, and a problem. Nature (2019). https://doi.org/10.1038/d41586-019-00893-5
Article Google Scholar
Xu, W.: Data-drive approaches for paraphrasing across language variations. Ph.D. thesis, Department of Computer Science, New York University (2014). http://www.cis.upenn.edu/~xwe/files/thesis-wei.pdf
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. arXiv:1906.08237 [cs], June 2019
Zellers, R., et al.: Defending against neural fake news. arXiv:1905.12616 [cs] (2019)
Zhang, Q., Wang, D.Y., Voelker, G.M.: DSpin: detecting automatically spun content on the web. In: Proceedings Network and Distributed System Security (NDSS) Symposium, pp. 23–26 (2014). https://doi.org/10.14722/ndss.2014.23004
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: The IEEE International Conference on Computer Vision (ICCV), December 2015
Google Scholar

Download references

Author information

Authors and Affiliations

University of Wuppertal, Rainer-Gruenter-Straße, 42119, Wuppertal, Germany
Jan Philip Wahle, Terry Ruas, Norman Meuschke & Bela Gipp
Mendel University in Brno, 61300, Brno, Czechia
Tomáš Foltýnek

Authors

Jan Philip Wahle
View author publications
You can also search for this author in PubMed Google Scholar
Terry Ruas
View author publications
You can also search for this author in PubMed Google Scholar
Tomáš Foltýnek
View author publications
You can also search for this author in PubMed Google Scholar
Norman Meuschke
View author publications
You can also search for this author in PubMed Google Scholar
Bela Gipp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Philip Wahle .

Editor information

Editors and Affiliations

Humboldt-Universität zu Berlin, Berlin, Germany
Malte Smits

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wahle, J.P., Ruas, T., Foltýnek, T., Meuschke, N., Gipp, B. (2022). Identifying Machine-Paraphrased Plagiarism. In: Smits, M. (eds) Information for a Better World: Shaping the Global Future. iConference 2022. Lecture Notes in Computer Science(), vol 13192. Springer, Cham. https://doi.org/10.1007/978-3-030-96957-8_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-96957-8_34
Published: 23 February 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96956-1
Online ISBN: 978-3-030-96957-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics