Abstract
Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classifiers for distinguishing human-written from machine-paraphrased text. The best performing classification approach achieves an accuracy of 99.0% for documents and 83.4% for paragraphs. Second, we show that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks. Third, we provide a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing. The data and code of our study are openly available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
References
Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). https://doi.org/10.1007/BF00153759
Altheneyan, A., Menai, M.E.B.: Evaluation of state-of-the-art paraphrase identification and its application to automatic plagiarism detection. Int. J. Pattern Recogn. Artif Intell. (2019). https://doi.org/10.1142/S0218001420530043
Altszyler, E., Sigman, M., Fernandez Slezak, D.: Corpus specificity in LSA and word2vec: the role of out-of-domain documents. In: Proceedings 3rd Workshop on Representation Learning for NLP, pp. 1–10 (2018). https://doi.org/10.18653/v1/W18-3001
Alvi, F., Stevenson, M., Clough, P.: Plagiarism detection in texts obfuscated with homoglyphs. In: Jose, J.M., et al. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 669–675. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56608-5_64
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching wordvectors withsubword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). https://doi.org/10.1162/tacl_a_00051
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Cer, D., et al.: Universal sentence encoder for English. In: Proceedings Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 169–174 (2018). https://doi.org/10.18653/v1/D18-2029
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1023/A:1022627411411
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41(6), 391–407 (1990). https://doi.org/10.1002/(SICI)1097-4571(199009)
Dey, K., Shrivastava, R., Kaushik, S.: A paraphrase and semantic similarity detection system for user generated short-text content on microblogs. In: Proceedings International Conference on Computational Linguistics (Coling), vol. 42, pp. 2880–2890 (2016)
Eisa, T., Salim, N., Alzahrani, S.: Figure plagiarism detection using content-based features. In: Patnaik, S., Popentiu-Vladicescu, F. (eds.) Recent Developments in Intelligent Computing, Communication and Devices. AISC, vol. 555, pp. 17–20. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-3779-5_3
Ferrero, J., Agnes, F., Besacier, L., Schwab, D.: Using word embedding for cross-language plagiarism detection. In: Proceedings Conference of the European Chapter of the Association for Computational Linguistics (EACL), vol. 2, pp. 415–421 (2017)
Foltýnek, T., Meuschke, N., Gipp, B.: Academic plagiarism detection: a systematic literature review. ACM Comput. Surv. 52(6), 112:1–112:42 (2019). https://doi.org/10.1145/3345317
Franco-Salvador, M., Gupta, P., Rosso, P., Banchs, R.E.: Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowl.-Based Syst. 111, 87–99 (2016). https://doi.org/10.1016/j.knosys.2016.08.004
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings International Joint Conference on Artificial Intelligence (IJCAI), pp. 1606–1611 (2007)
Gipp, B., Meuschke, N., Breitinger, C., Pitman, J., Nürnberger, A.: Web-based demonstration of semantic similarity detection using citation pattern visualization for a cross language plagiarism case. In: Proceedings International Conference on Enterprise Information Systems (ICEIS), vol. 2, pp. 677–683 (2014). https://doi.org/10.5220/0004985406770683
Goldberg, Y., Hirst, G.: Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers, San Rafael (2017). https://doi.org/10.2200/S00762ED1V01Y201703HLT037
Kanjirangat, V., Gupta, D.: Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system. In: Proceedings International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1578–1584 (2015). https://doi.org/10.1109/ICACCI.2015.7275838
Kanjirangat, V., Gupta, D.: Study on extrinsic text plagiarism detection techniques and tools. J. Eng. Sci. Technol. Rev. 9(5), 9–23 (2016). https://doi.org/10.1109/ICACCI.2015.7275838
Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. In: Proceedings Workshop on Representation Learning for NLP (2016). https://doi.org/10.18653/v1/w16-1609
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings 31st International Confernce on Machine Learning, vol. 32, pp. 1188–1196 (2014)
Madera, Q., García-Valdez, M., Mancilla, A.: Ad text optimization using interactive evolutionary computation techniques. In: Castillo, O., Melin, P., Pedrycz, W., Kacprzyk, J. (eds.) Recent Advances on Hybrid Approaches for Designing Intelligent Systems. SCI, vol. 547, pp. 671–680. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05170-3_47
McCullagh, P., Nelder, J.: Generalized Linear Models, 2nd edn. Chapman & Hall, Boca Raton (1989)
Meuschke, N., Gipp, B.: State of the art in detecting academic plagiarism. Int. J. Educ. Integr. 9(1), 50–71 (2013). https://doi.org/10.5281/zenodo.3482941
Meuschke, N., Gondek, C., Seebacher, D., Breitinger, C., Keim, D., Gipp, B.: An adaptive image-based plagiarism detection approach. In: Proceedings 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 131–140 (2018). https://doi.org/10.1145/3197026.3197042
Meuschke, N., Schubotz, M., Hamborg, F., Skopal, T., Gipp, B.: Analyzing mathematical content to detect academic plagiarism. In: Proceedings ACM Conference on Information and Knowledge Management (CIKM), pp. 2211–2214 (2017). https://doi.org/10.1145/3132847.3133144
Meuschke, N., Siebeck, N., Schubotz, M., Gipp, B.: Analyzing semantic concept patterns to detect academic plagiarism. In: Proceedings International Workshop on Mining Scientific Publications (WOSP) at the 17th ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 46–53 (2017). https://doi.org/10.1145/3127526.3127535
Meuschke, N., Stange, V., Schubotz, M., Gipp, B.: HyPlag: a hybrid approach to academic plagiarism detection. In: Proceedings 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1321–1324 (2018). https://doi.org/10.1145/3209978.3210177
Meuschke, N., Stange, V., Schubotz, M., Kramer, M., Gipp, B.: Improving academic plagiarism detection for stem documents by analyzing mathematical content and citations. In: Proceedings ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 120–129 (2019). https://doi.org/10.1109/JCDL.2019.00026
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings Workshop Track 1st International Conference on Learning Representations (ICLR) (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings 27th Conference on Neural Information Processing Systems (NIPS), pp. 3111–3119 (2013)
Mitchell, T.M.: Machine learning. International Edition. McGraw-Hill, New York (1997)
Mohebbi, M., Talebpour, A.: Texts semantic similarity detection based graph approach. Int. Arab. J. Inf. Technol. 13(2), 246–251 (2016)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 14, pp. 1532–1543 (2014). https://doi.org/10.3115/v1/D14-1162
Perone, C.S., Silveira, R., Paula, T.S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv abs/1806.06259 (2018)
Peters, M., et al.: Deep contextualized word representations. In: Proceedings Conference of the North American Chapter of the Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/n18-1202
Prentice, F.M., Kinden, C.E.: Paraphrasing tools, language translation tools and plagiarism: an exploratory study. Int. J. Educ. Integr. 14(1), 11 (2018). https://doi.org/10.1007/s40979-018-0036-7
Roberts, K.: Assessing the corpus size vs. similarity trade-off for word embeddings in clinical NLP. In: Proceedings Workshop on Clinical NLP, pp. 54–63 (2016)
Rogerson, A.M., McCarthy, G.: Using Internet based paraphrasing tools: original work, patchwriting or facilitated plagiarism? Int. J. Educ. Integr. 13(1), 2 (2017). https://doi.org/10.1007/s40979-016-0013-y
Shaoul, C., Westbury, C.: The Westbury Lab Wikipedia Corpus (2010). http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html
Velásquez, J.D., Covacevich, Y., Molina, F., Marrese-Taylor, E., Rodríguez, C., Bravo-Marquez, F.: DOCODE 3.0 (DOcument COpy DEtector): a system for plagiarism detection by applying an information fusion process from multiple documental data sources. Inform. Fusion 27, 64–75 (2016). https://doi.org/10.1016/j.inffus.2015.05.006
Weber-Wulff, D.: False Feathers. Springer, Berlin Heidelberg (2014). https://doi.org/10.1007/978-3-642-39961-9
Weber-Wulff, D.: Plagiarism detectors are a crutch, and a problem. Nature 567, 435 (2019). https://doi.org/10.1038/d41586-019-00893-5
Yokoi, T.: Sentence-based plagiarism detection for Japanese document based on common nouns and part-of-speech structure. In: Fujita, H., Selamat, A. (eds.) SoMeT 2014. CCIS, vol. 513, pp. 297–308. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17530-0_21
Zhang, Q., Wang, D.Y., Voelker, G.M.: DSpin: detecting automatically spun content on the web. In: Proceedings Network and Distributed System Security (NDSS) Symposium, pp. 23–26 (2014). https://doi.org/10.14722/ndss.2014.23004
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Foltýnek, T. et al. (2020). Detecting Machine-Obfuscated Plagiarism. In: Sundqvist, A., Berget, G., Nolin, J., Skjerdingstad, K. (eds) Sustainable Digital Communities. iConference 2020. Lecture Notes in Computer Science(), vol 12051. Springer, Cham. https://doi.org/10.1007/978-3-030-43687-2_68
Download citation
DOI: https://doi.org/10.1007/978-3-030-43687-2_68
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43686-5
Online ISBN: 978-3-030-43687-2
eBook Packages: Computer ScienceComputer Science (R0)