Detecting Machine-Obfuscated Plagiarism

Foltýnek, Tomáš; Ruas, Terry; Scharpf, Philipp; Meuschke, Norman; Schubotz, Moritz; Grosky, William; Gipp, Bela

doi:10.1007/978-3-030-43687-2_68

Detecting Machine-Obfuscated Plagiarism

Conference paper
First Online: 19 March 2020

2917 Accesses
9 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12051))

Abstract

Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classifiers for distinguishing human-written from machine-paraphrased text. The best performing classification approach achieves an accuracy of 99.0% for documents and 83.4% for paragraphs. Second, we show that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks. Third, we provide a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing. The data and code of our study are openly available.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). https://doi.org/10.1007/BF00153759
Article Google Scholar
Altheneyan, A., Menai, M.E.B.: Evaluation of state-of-the-art paraphrase identification and its application to automatic plagiarism detection. Int. J. Pattern Recogn. Artif Intell. (2019). https://doi.org/10.1142/S0218001420530043
Article Google Scholar
Altszyler, E., Sigman, M., Fernandez Slezak, D.: Corpus specificity in LSA and word2vec: the role of out-of-domain documents. In: Proceedings 3rd Workshop on Representation Learning for NLP, pp. 1–10 (2018). https://doi.org/10.18653/v1/W18-3001
Alvi, F., Stevenson, M., Clough, P.: Plagiarism detection in texts obfuscated with homoglyphs. In: Jose, J.M., et al. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 669–675. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56608-5_64
Chapter Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching wordvectors withsubword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). https://doi.org/10.1162/tacl_a_00051
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Cer, D., et al.: Universal sentence encoder for English. In: Proceedings Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 169–174 (2018). https://doi.org/10.18653/v1/D18-2029
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1023/A:1022627411411
Article MATH Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41(6), 391–407 (1990). https://doi.org/10.1002/(SICI)1097-4571(199009)
Article Google Scholar
Dey, K., Shrivastava, R., Kaushik, S.: A paraphrase and semantic similarity detection system for user generated short-text content on microblogs. In: Proceedings International Conference on Computational Linguistics (Coling), vol. 42, pp. 2880–2890 (2016)
Google Scholar
Eisa, T., Salim, N., Alzahrani, S.: Figure plagiarism detection using content-based features. In: Patnaik, S., Popentiu-Vladicescu, F. (eds.) Recent Developments in Intelligent Computing, Communication and Devices. AISC, vol. 555, pp. 17–20. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-3779-5_3
Chapter Google Scholar
Ferrero, J., Agnes, F., Besacier, L., Schwab, D.: Using word embedding for cross-language plagiarism detection. In: Proceedings Conference of the European Chapter of the Association for Computational Linguistics (EACL), vol. 2, pp. 415–421 (2017)
Google Scholar
Foltýnek, T., Meuschke, N., Gipp, B.: Academic plagiarism detection: a systematic literature review. ACM Comput. Surv. 52(6), 112:1–112:42 (2019). https://doi.org/10.1145/3345317
Article Google Scholar
Franco-Salvador, M., Gupta, P., Rosso, P., Banchs, R.E.: Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowl.-Based Syst. 111, 87–99 (2016). https://doi.org/10.1016/j.knosys.2016.08.004
Article Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings International Joint Conference on Artificial Intelligence (IJCAI), pp. 1606–1611 (2007)
Google Scholar
Gipp, B., Meuschke, N., Breitinger, C., Pitman, J., Nürnberger, A.: Web-based demonstration of semantic similarity detection using citation pattern visualization for a cross language plagiarism case. In: Proceedings International Conference on Enterprise Information Systems (ICEIS), vol. 2, pp. 677–683 (2014). https://doi.org/10.5220/0004985406770683
Goldberg, Y., Hirst, G.: Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers, San Rafael (2017). https://doi.org/10.2200/S00762ED1V01Y201703HLT037
Book Google Scholar
Kanjirangat, V., Gupta, D.: Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system. In: Proceedings International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1578–1584 (2015). https://doi.org/10.1109/ICACCI.2015.7275838
Kanjirangat, V., Gupta, D.: Study on extrinsic text plagiarism detection techniques and tools. J. Eng. Sci. Technol. Rev. 9(5), 9–23 (2016). https://doi.org/10.1109/ICACCI.2015.7275838
Article Google Scholar
Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. In: Proceedings Workshop on Representation Learning for NLP (2016). https://doi.org/10.18653/v1/w16-1609
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings 31st International Confernce on Machine Learning, vol. 32, pp. 1188–1196 (2014)
Google Scholar
Madera, Q., García-Valdez, M., Mancilla, A.: Ad text optimization using interactive evolutionary computation techniques. In: Castillo, O., Melin, P., Pedrycz, W., Kacprzyk, J. (eds.) Recent Advances on Hybrid Approaches for Designing Intelligent Systems. SCI, vol. 547, pp. 671–680. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05170-3_47
Chapter Google Scholar
McCullagh, P., Nelder, J.: Generalized Linear Models, 2nd edn. Chapman & Hall, Boca Raton (1989)
Book Google Scholar
Meuschke, N., Gipp, B.: State of the art in detecting academic plagiarism. Int. J. Educ. Integr. 9(1), 50–71 (2013). https://doi.org/10.5281/zenodo.3482941
Article Google Scholar
Meuschke, N., Gondek, C., Seebacher, D., Breitinger, C., Keim, D., Gipp, B.: An adaptive image-based plagiarism detection approach. In: Proceedings 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 131–140 (2018). https://doi.org/10.1145/3197026.3197042
Meuschke, N., Schubotz, M., Hamborg, F., Skopal, T., Gipp, B.: Analyzing mathematical content to detect academic plagiarism. In: Proceedings ACM Conference on Information and Knowledge Management (CIKM), pp. 2211–2214 (2017). https://doi.org/10.1145/3132847.3133144
Meuschke, N., Siebeck, N., Schubotz, M., Gipp, B.: Analyzing semantic concept patterns to detect academic plagiarism. In: Proceedings International Workshop on Mining Scientific Publications (WOSP) at the 17th ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 46–53 (2017). https://doi.org/10.1145/3127526.3127535
Meuschke, N., Stange, V., Schubotz, M., Gipp, B.: HyPlag: a hybrid approach to academic plagiarism detection. In: Proceedings 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1321–1324 (2018). https://doi.org/10.1145/3209978.3210177
Meuschke, N., Stange, V., Schubotz, M., Kramer, M., Gipp, B.: Improving academic plagiarism detection for stem documents by analyzing mathematical content and citations. In: Proceedings ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 120–129 (2019). https://doi.org/10.1109/JCDL.2019.00026
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings Workshop Track 1st International Conference on Learning Representations (ICLR) (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings 27th Conference on Neural Information Processing Systems (NIPS), pp. 3111–3119 (2013)
Google Scholar
Mitchell, T.M.: Machine learning. International Edition. McGraw-Hill, New York (1997)
Google Scholar
Mohebbi, M., Talebpour, A.: Texts semantic similarity detection based graph approach. Int. Arab. J. Inf. Technol. 13(2), 246–251 (2016)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 14, pp. 1532–1543 (2014). https://doi.org/10.3115/v1/D14-1162
Perone, C.S., Silveira, R., Paula, T.S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv abs/1806.06259 (2018)
Google Scholar
Peters, M., et al.: Deep contextualized word representations. In: Proceedings Conference of the North American Chapter of the Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/n18-1202
Prentice, F.M., Kinden, C.E.: Paraphrasing tools, language translation tools and plagiarism: an exploratory study. Int. J. Educ. Integr. 14(1), 11 (2018). https://doi.org/10.1007/s40979-018-0036-7
Article Google Scholar
Roberts, K.: Assessing the corpus size vs. similarity trade-off for word embeddings in clinical NLP. In: Proceedings Workshop on Clinical NLP, pp. 54–63 (2016)
Google Scholar
Rogerson, A.M., McCarthy, G.: Using Internet based paraphrasing tools: original work, patchwriting or facilitated plagiarism? Int. J. Educ. Integr. 13(1), 2 (2017). https://doi.org/10.1007/s40979-016-0013-y
Article Google Scholar
Shaoul, C., Westbury, C.: The Westbury Lab Wikipedia Corpus (2010). http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html
Velásquez, J.D., Covacevich, Y., Molina, F., Marrese-Taylor, E., Rodríguez, C., Bravo-Marquez, F.: DOCODE 3.0 (DOcument COpy DEtector): a system for plagiarism detection by applying an information fusion process from multiple documental data sources. Inform. Fusion 27, 64–75 (2016). https://doi.org/10.1016/j.inffus.2015.05.006
Article Google Scholar
Weber-Wulff, D.: False Feathers. Springer, Berlin Heidelberg (2014). https://doi.org/10.1007/978-3-642-39961-9
Book Google Scholar
Weber-Wulff, D.: Plagiarism detectors are a crutch, and a problem. Nature 567, 435 (2019). https://doi.org/10.1038/d41586-019-00893-5
Article Google Scholar
Yokoi, T.: Sentence-based plagiarism detection for Japanese document based on common nouns and part-of-speech structure. In: Fujita, H., Selamat, A. (eds.) SoMeT 2014. CCIS, vol. 513, pp. 297–308. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17530-0_21
Chapter Google Scholar
Zhang, Q., Wang, D.Y., Voelker, G.M.: DSpin: detecting automatically spun content on the web. In: Proceedings Network and Distributed System Security (NDSS) Symposium, pp. 23–26 (2014). https://doi.org/10.14722/ndss.2014.23004

Download references

Author information

Authors and Affiliations

University of Wuppertal, Rainer-Gruenter-Str. 21, 42119, Wuppertal, Germany
Tomáš Foltýnek, Terry Ruas, Norman Meuschke, Moritz Schubotz & Bela Gipp
Mendel University in Brno, Zemědělská 1, 613 00, Brno, Czechia
Tomáš Foltýnek
University of Konstanz, Universitätsstraße 10, 78464, Konstanz, Germany
Philipp Scharpf & Norman Meuschke
University of Michigan-Dearborn, 4901 Evergreen Rd, Dearborn, 48128, USA
Terry Ruas & William Grosky

Authors

Tomáš Foltýnek
View author publications
You can also search for this author in PubMed Google Scholar
Terry Ruas
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Scharpf
View author publications
You can also search for this author in PubMed Google Scholar
Norman Meuschke
View author publications
You can also search for this author in PubMed Google Scholar
Moritz Schubotz
View author publications
You can also search for this author in PubMed Google Scholar
William Grosky
View author publications
You can also search for this author in PubMed Google Scholar
Bela Gipp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomáš Foltýnek .

Editor information

Editors and Affiliations

OsloMet – Oslo Metropolitan University, Oslo, Norway
Anneli Sundqvist
OsloMet – Oslo Metropolitan University, Oslo, Norway
Gerd Berget
University of Boras, Boras, Sweden
Jan Nolin
OsloMet – Oslo Metropolitan University, Oslo, Norway
Kjell Ivar Skjerdingstad

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Foltýnek, T. et al. (2020). Detecting Machine-Obfuscated Plagiarism. In: Sundqvist, A., Berget, G., Nolin, J., Skjerdingstad, K. (eds) Sustainable Digital Communities. iConference 2020. Lecture Notes in Computer Science(), vol 12051. Springer, Cham. https://doi.org/10.1007/978-3-030-43687-2_68

Download citation

DOI: https://doi.org/10.1007/978-3-030-43687-2_68
Published: 19 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43686-5
Online ISBN: 978-3-030-43687-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics