Skip to main content

Detecting Machine-Obfuscated Plagiarism

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12051))

Abstract

Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classifiers for distinguishing human-written from machine-paraphrased text. The best performing classification approach achieves an accuracy of 99.0% for documents and 83.4% for paragraphs. Second, we show that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks. Third, we provide a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing. The data and code of our study are openly available.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://edintegrity.biomedcentral.com/mbp.

  2. 2.

    https://en.wikipedia.org/wiki/Wikipedia:Content_assessment.

  3. 3.

    https://spinbot.com/API.

  4. 4.

    https://paraphrasing-tool.com/.

  5. 5.

    https://free-article-spinner.com/.

  6. 6.

    https://nlp.stanford.edu/projects/glove/.

  7. 7.

    https://code.google.com/archive/p/word2vec/.

  8. 8.

    https://fasttext.cc/docs/en/english-vectors.html.

  9. 9.

    https://tfhub.dev/google/universal-sentence-encoder/2.

  10. 10.

    https://radimrehurek.com/gensim/models/doc2vec.html.

  11. 11.

    https://www.quiz-maker.com/.

  12. 12.

    http://www.randomtextgenerator.com/.

  13. 13.

    https://arxiv.org/.

  14. 14.

    http://qwone.com/~jason/20Newsgroups/.

  15. 15.

    https://seotoolscentre.com/article-rewriter-tool.

  16. 16.

    http://www.ezrewrite.com/.

  17. 17.

    http://www.spinnerchief.com/.

References

  1. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). https://doi.org/10.1007/BF00153759

    Article  Google Scholar 

  2. Altheneyan, A., Menai, M.E.B.: Evaluation of state-of-the-art paraphrase identification and its application to automatic plagiarism detection. Int. J. Pattern Recogn. Artif Intell. (2019). https://doi.org/10.1142/S0218001420530043

    Article  Google Scholar 

  3. Altszyler, E., Sigman, M., Fernandez Slezak, D.: Corpus specificity in LSA and word2vec: the role of out-of-domain documents. In: Proceedings 3rd Workshop on Representation Learning for NLP, pp. 1–10 (2018). https://doi.org/10.18653/v1/W18-3001

  4. Alvi, F., Stevenson, M., Clough, P.: Plagiarism detection in texts obfuscated with homoglyphs. In: Jose, J.M., et al. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 669–675. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56608-5_64

    Chapter  Google Scholar 

  5. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching wordvectors withsubword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). https://doi.org/10.1162/tacl_a_00051

    Article  Google Scholar 

  6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  7. Cer, D., et al.: Universal sentence encoder for English. In: Proceedings Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 169–174 (2018). https://doi.org/10.18653/v1/D18-2029

  8. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1023/A:1022627411411

    Article  MATH  Google Scholar 

  9. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41(6), 391–407 (1990). https://doi.org/10.1002/(SICI)1097-4571(199009)

    Article  Google Scholar 

  10. Dey, K., Shrivastava, R., Kaushik, S.: A paraphrase and semantic similarity detection system for user generated short-text content on microblogs. In: Proceedings International Conference on Computational Linguistics (Coling), vol. 42, pp. 2880–2890 (2016)

    Google Scholar 

  11. Eisa, T., Salim, N., Alzahrani, S.: Figure plagiarism detection using content-based features. In: Patnaik, S., Popentiu-Vladicescu, F. (eds.) Recent Developments in Intelligent Computing, Communication and Devices. AISC, vol. 555, pp. 17–20. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-3779-5_3

    Chapter  Google Scholar 

  12. Ferrero, J., Agnes, F., Besacier, L., Schwab, D.: Using word embedding for cross-language plagiarism detection. In: Proceedings Conference of the European Chapter of the Association for Computational Linguistics (EACL), vol. 2, pp. 415–421 (2017)

    Google Scholar 

  13. Foltýnek, T., Meuschke, N., Gipp, B.: Academic plagiarism detection: a systematic literature review. ACM Comput. Surv. 52(6), 112:1–112:42 (2019). https://doi.org/10.1145/3345317

    Article  Google Scholar 

  14. Franco-Salvador, M., Gupta, P., Rosso, P., Banchs, R.E.: Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowl.-Based Syst. 111, 87–99 (2016). https://doi.org/10.1016/j.knosys.2016.08.004

    Article  Google Scholar 

  15. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings International Joint Conference on Artificial Intelligence (IJCAI), pp. 1606–1611 (2007)

    Google Scholar 

  16. Gipp, B., Meuschke, N., Breitinger, C., Pitman, J., Nürnberger, A.: Web-based demonstration of semantic similarity detection using citation pattern visualization for a cross language plagiarism case. In: Proceedings International Conference on Enterprise Information Systems (ICEIS), vol. 2, pp. 677–683 (2014). https://doi.org/10.5220/0004985406770683

  17. Goldberg, Y., Hirst, G.: Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers, San Rafael (2017). https://doi.org/10.2200/S00762ED1V01Y201703HLT037

    Book  Google Scholar 

  18. Kanjirangat, V., Gupta, D.: Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system. In: Proceedings International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1578–1584 (2015). https://doi.org/10.1109/ICACCI.2015.7275838

  19. Kanjirangat, V., Gupta, D.: Study on extrinsic text plagiarism detection techniques and tools. J. Eng. Sci. Technol. Rev. 9(5), 9–23 (2016). https://doi.org/10.1109/ICACCI.2015.7275838

    Article  Google Scholar 

  20. Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. In: Proceedings Workshop on Representation Learning for NLP (2016). https://doi.org/10.18653/v1/w16-1609

  21. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings 31st International Confernce on Machine Learning, vol. 32, pp. 1188–1196 (2014)

    Google Scholar 

  22. Madera, Q., García-Valdez, M., Mancilla, A.: Ad text optimization using interactive evolutionary computation techniques. In: Castillo, O., Melin, P., Pedrycz, W., Kacprzyk, J. (eds.) Recent Advances on Hybrid Approaches for Designing Intelligent Systems. SCI, vol. 547, pp. 671–680. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05170-3_47

    Chapter  Google Scholar 

  23. McCullagh, P., Nelder, J.: Generalized Linear Models, 2nd edn. Chapman & Hall, Boca Raton (1989)

    Book  Google Scholar 

  24. Meuschke, N., Gipp, B.: State of the art in detecting academic plagiarism. Int. J. Educ. Integr. 9(1), 50–71 (2013). https://doi.org/10.5281/zenodo.3482941

    Article  Google Scholar 

  25. Meuschke, N., Gondek, C., Seebacher, D., Breitinger, C., Keim, D., Gipp, B.: An adaptive image-based plagiarism detection approach. In: Proceedings 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 131–140 (2018). https://doi.org/10.1145/3197026.3197042

  26. Meuschke, N., Schubotz, M., Hamborg, F., Skopal, T., Gipp, B.: Analyzing mathematical content to detect academic plagiarism. In: Proceedings ACM Conference on Information and Knowledge Management (CIKM), pp. 2211–2214 (2017). https://doi.org/10.1145/3132847.3133144

  27. Meuschke, N., Siebeck, N., Schubotz, M., Gipp, B.: Analyzing semantic concept patterns to detect academic plagiarism. In: Proceedings International Workshop on Mining Scientific Publications (WOSP) at the 17th ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 46–53 (2017). https://doi.org/10.1145/3127526.3127535

  28. Meuschke, N., Stange, V., Schubotz, M., Gipp, B.: HyPlag: a hybrid approach to academic plagiarism detection. In: Proceedings 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1321–1324 (2018). https://doi.org/10.1145/3209978.3210177

  29. Meuschke, N., Stange, V., Schubotz, M., Kramer, M., Gipp, B.: Improving academic plagiarism detection for stem documents by analyzing mathematical content and citations. In: Proceedings ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 120–129 (2019). https://doi.org/10.1109/JCDL.2019.00026

  30. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings Workshop Track 1st International Conference on Learning Representations (ICLR) (2013)

    Google Scholar 

  31. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings 27th Conference on Neural Information Processing Systems (NIPS), pp. 3111–3119 (2013)

    Google Scholar 

  32. Mitchell, T.M.: Machine learning. International Edition. McGraw-Hill, New York (1997)

    Google Scholar 

  33. Mohebbi, M., Talebpour, A.: Texts semantic similarity detection based graph approach. Int. Arab. J. Inf. Technol. 13(2), 246–251 (2016)

    Google Scholar 

  34. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 14, pp. 1532–1543 (2014). https://doi.org/10.3115/v1/D14-1162

  35. Perone, C.S., Silveira, R., Paula, T.S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv abs/1806.06259 (2018)

    Google Scholar 

  36. Peters, M., et al.: Deep contextualized word representations. In: Proceedings Conference of the North American Chapter of the Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/n18-1202

  37. Prentice, F.M., Kinden, C.E.: Paraphrasing tools, language translation tools and plagiarism: an exploratory study. Int. J. Educ. Integr. 14(1), 11 (2018). https://doi.org/10.1007/s40979-018-0036-7

    Article  Google Scholar 

  38. Roberts, K.: Assessing the corpus size vs. similarity trade-off for word embeddings in clinical NLP. In: Proceedings Workshop on Clinical NLP, pp. 54–63 (2016)

    Google Scholar 

  39. Rogerson, A.M., McCarthy, G.: Using Internet based paraphrasing tools: original work, patchwriting or facilitated plagiarism? Int. J. Educ. Integr. 13(1), 2 (2017). https://doi.org/10.1007/s40979-016-0013-y

    Article  Google Scholar 

  40. Shaoul, C., Westbury, C.: The Westbury Lab Wikipedia Corpus (2010). http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html

  41. Velásquez, J.D., Covacevich, Y., Molina, F., Marrese-Taylor, E., Rodríguez, C., Bravo-Marquez, F.: DOCODE 3.0 (DOcument COpy DEtector): a system for plagiarism detection by applying an information fusion process from multiple documental data sources. Inform. Fusion 27, 64–75 (2016). https://doi.org/10.1016/j.inffus.2015.05.006

    Article  Google Scholar 

  42. Weber-Wulff, D.: False Feathers. Springer, Berlin Heidelberg (2014). https://doi.org/10.1007/978-3-642-39961-9

    Book  Google Scholar 

  43. Weber-Wulff, D.: Plagiarism detectors are a crutch, and a problem. Nature 567, 435 (2019). https://doi.org/10.1038/d41586-019-00893-5

    Article  Google Scholar 

  44. Yokoi, T.: Sentence-based plagiarism detection for Japanese document based on common nouns and part-of-speech structure. In: Fujita, H., Selamat, A. (eds.) SoMeT 2014. CCIS, vol. 513, pp. 297–308. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17530-0_21

    Chapter  Google Scholar 

  45. Zhang, Q., Wang, D.Y., Voelker, G.M.: DSpin: detecting automatically spun content on the web. In: Proceedings Network and Distributed System Security (NDSS) Symposium, pp. 23–26 (2014). https://doi.org/10.14722/ndss.2014.23004

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomáš Foltýnek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Foltýnek, T. et al. (2020). Detecting Machine-Obfuscated Plagiarism. In: Sundqvist, A., Berget, G., Nolin, J., Skjerdingstad, K. (eds) Sustainable Digital Communities. iConference 2020. Lecture Notes in Computer Science(), vol 12051. Springer, Cham. https://doi.org/10.1007/978-3-030-43687-2_68

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-43687-2_68

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-43686-5

  • Online ISBN: 978-3-030-43687-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics