Detecting Machine-Obfuscated Plagiarism

  • Tomáš FoltýnekEmail author
  • Terry Ruas
  • Philipp Scharpf
  • Norman Meuschke
  • Moritz Schubotz
  • William Grosky
  • Bela Gipp
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12051)


Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classifiers for distinguishing human-written from machine-paraphrased text. The best performing classification approach achieves an accuracy of 99.0% for documents and 83.4% for paragraphs. Second, we show that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks. Third, we provide a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing. The data and code of our study are openly available.


Paraphrase detection Plagiarism detection Document classification Word embeddings 


  1. 1.
    Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). Scholar
  2. 2.
    Altheneyan, A., Menai, M.E.B.: Evaluation of state-of-the-art paraphrase identification and its application to automatic plagiarism detection. Int. J. Pattern Recogn. Artif Intell. (2019). Scholar
  3. 3.
    Altszyler, E., Sigman, M., Fernandez Slezak, D.: Corpus specificity in LSA and word2vec: the role of out-of-domain documents. In: Proceedings 3rd Workshop on Representation Learning for NLP, pp. 1–10 (2018).
  4. 4.
    Alvi, F., Stevenson, M., Clough, P.: Plagiarism detection in texts obfuscated with homoglyphs. In: Jose, J.M., et al. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 669–675. Springer, Cham (2017). Scholar
  5. 5.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching wordvectors withsubword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). Scholar
  6. 6.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). Scholar
  7. 7.
    Cer, D., et al.: Universal sentence encoder for English. In: Proceedings Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 169–174 (2018).
  8. 8.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). Scholar
  9. 9.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41(6), 391–407 (1990). Scholar
  10. 10.
    Dey, K., Shrivastava, R., Kaushik, S.: A paraphrase and semantic similarity detection system for user generated short-text content on microblogs. In: Proceedings International Conference on Computational Linguistics (Coling), vol. 42, pp. 2880–2890 (2016)Google Scholar
  11. 11.
    Eisa, T., Salim, N., Alzahrani, S.: Figure plagiarism detection using content-based features. In: Patnaik, S., Popentiu-Vladicescu, F. (eds.) Recent Developments in Intelligent Computing, Communication and Devices. AISC, vol. 555, pp. 17–20. Springer, Singapore (2017). Scholar
  12. 12.
    Ferrero, J., Agnes, F., Besacier, L., Schwab, D.: Using word embedding for cross-language plagiarism detection. In: Proceedings Conference of the European Chapter of the Association for Computational Linguistics (EACL), vol. 2, pp. 415–421 (2017)Google Scholar
  13. 13.
    Foltýnek, T., Meuschke, N., Gipp, B.: Academic plagiarism detection: a systematic literature review. ACM Comput. Surv. 52(6), 112:1–112:42 (2019). Scholar
  14. 14.
    Franco-Salvador, M., Gupta, P., Rosso, P., Banchs, R.E.: Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowl.-Based Syst. 111, 87–99 (2016). Scholar
  15. 15.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings International Joint Conference on Artificial Intelligence (IJCAI), pp. 1606–1611 (2007)Google Scholar
  16. 16.
    Gipp, B., Meuschke, N., Breitinger, C., Pitman, J., Nürnberger, A.: Web-based demonstration of semantic similarity detection using citation pattern visualization for a cross language plagiarism case. In: Proceedings International Conference on Enterprise Information Systems (ICEIS), vol. 2, pp. 677–683 (2014).
  17. 17.
    Goldberg, Y., Hirst, G.: Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers, San Rafael (2017). Scholar
  18. 18.
    Kanjirangat, V., Gupta, D.: Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system. In: Proceedings International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1578–1584 (2015).
  19. 19.
    Kanjirangat, V., Gupta, D.: Study on extrinsic text plagiarism detection techniques and tools. J. Eng. Sci. Technol. Rev. 9(5), 9–23 (2016). Scholar
  20. 20.
    Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. In: Proceedings Workshop on Representation Learning for NLP (2016).
  21. 21.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings 31st International Confernce on Machine Learning, vol. 32, pp. 1188–1196 (2014)Google Scholar
  22. 22.
    Madera, Q., García-Valdez, M., Mancilla, A.: Ad text optimization using interactive evolutionary computation techniques. In: Castillo, O., Melin, P., Pedrycz, W., Kacprzyk, J. (eds.) Recent Advances on Hybrid Approaches for Designing Intelligent Systems. SCI, vol. 547, pp. 671–680. Springer, Cham (2014). Scholar
  23. 23.
    McCullagh, P., Nelder, J.: Generalized Linear Models, 2nd edn. Chapman & Hall, Boca Raton (1989)CrossRefGoogle Scholar
  24. 24.
    Meuschke, N., Gipp, B.: State of the art in detecting academic plagiarism. Int. J. Educ. Integr. 9(1), 50–71 (2013). Scholar
  25. 25.
    Meuschke, N., Gondek, C., Seebacher, D., Breitinger, C., Keim, D., Gipp, B.: An adaptive image-based plagiarism detection approach. In: Proceedings 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 131–140 (2018).
  26. 26.
    Meuschke, N., Schubotz, M., Hamborg, F., Skopal, T., Gipp, B.: Analyzing mathematical content to detect academic plagiarism. In: Proceedings ACM Conference on Information and Knowledge Management (CIKM), pp. 2211–2214 (2017).
  27. 27.
    Meuschke, N., Siebeck, N., Schubotz, M., Gipp, B.: Analyzing semantic concept patterns to detect academic plagiarism. In: Proceedings International Workshop on Mining Scientific Publications (WOSP) at the 17th ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 46–53 (2017).
  28. 28.
    Meuschke, N., Stange, V., Schubotz, M., Gipp, B.: HyPlag: a hybrid approach to academic plagiarism detection. In: Proceedings 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1321–1324 (2018).
  29. 29.
    Meuschke, N., Stange, V., Schubotz, M., Kramer, M., Gipp, B.: Improving academic plagiarism detection for stem documents by analyzing mathematical content and citations. In: Proceedings ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 120–129 (2019).
  30. 30.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings Workshop Track 1st International Conference on Learning Representations (ICLR) (2013)Google Scholar
  31. 31.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings 27th Conference on Neural Information Processing Systems (NIPS), pp. 3111–3119 (2013)Google Scholar
  32. 32.
    Mitchell, T.M.: Machine learning. International Edition. McGraw-Hill, New York (1997)Google Scholar
  33. 33.
    Mohebbi, M., Talebpour, A.: Texts semantic similarity detection based graph approach. Int. Arab. J. Inf. Technol. 13(2), 246–251 (2016)Google Scholar
  34. 34.
    Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 14, pp. 1532–1543 (2014).
  35. 35.
    Perone, C.S., Silveira, R., Paula, T.S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv abs/1806.06259 (2018)Google Scholar
  36. 36.
    Peters, M., et al.: Deep contextualized word representations. In: Proceedings Conference of the North American Chapter of the Association for Computational Linguistics (2018).
  37. 37.
    Prentice, F.M., Kinden, C.E.: Paraphrasing tools, language translation tools and plagiarism: an exploratory study. Int. J. Educ. Integr. 14(1), 11 (2018). Scholar
  38. 38.
    Roberts, K.: Assessing the corpus size vs. similarity trade-off for word embeddings in clinical NLP. In: Proceedings Workshop on Clinical NLP, pp. 54–63 (2016)Google Scholar
  39. 39.
    Rogerson, A.M., McCarthy, G.: Using Internet based paraphrasing tools: original work, patchwriting or facilitated plagiarism? Int. J. Educ. Integr. 13(1), 2 (2017). Scholar
  40. 40.
    Shaoul, C., Westbury, C.: The Westbury Lab Wikipedia Corpus (2010).
  41. 41.
    Velásquez, J.D., Covacevich, Y., Molina, F., Marrese-Taylor, E., Rodríguez, C., Bravo-Marquez, F.: DOCODE 3.0 (DOcument COpy DEtector): a system for plagiarism detection by applying an information fusion process from multiple documental data sources. Inform. Fusion 27, 64–75 (2016). Scholar
  42. 42.
    Weber-Wulff, D.: False Feathers. Springer, Berlin Heidelberg (2014). Scholar
  43. 43.
    Weber-Wulff, D.: Plagiarism detectors are a crutch, and a problem. Nature 567, 435 (2019). Scholar
  44. 44.
    Yokoi, T.: Sentence-based plagiarism detection for Japanese document based on common nouns and part-of-speech structure. In: Fujita, H., Selamat, A. (eds.) SoMeT 2014. CCIS, vol. 513, pp. 297–308. Springer, Cham (2015). Scholar
  45. 45.
    Zhang, Q., Wang, D.Y., Voelker, G.M.: DSpin: detecting automatically spun content on the web. In: Proceedings Network and Distributed System Security (NDSS) Symposium, pp. 23–26 (2014).

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of WuppertalWuppertalGermany
  2. 2.Mendel University in BrnoBrnoCzechia
  3. 3.University of KonstanzKonstanzGermany
  4. 4.University of Michigan-DearbornDearbornUSA

Personalised recommendations