Skip to main content

Plagiarism Detection in Texts Obfuscated with Homoglyphs

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10193))

Included in the following conference series:

Abstract

Homoglyphs can be used for disguising plagiarized text by replacing letters in source texts with visually identical letters from other scripts. Most current plagiarism detection systems are not able to detect plagiarism when text has been obfuscated using homoglyphs. In this work, we present two alternative approaches for detecting plagiarism in homoglyph obfuscated texts. The first approach utilizes the Unicode list of confusables to replace homoglyphs with visually identical letters, while the second approach uses a similarity score computed using normalized hamming distance to match homoglyph obfuscated words with source words. Empirical testing on datasets from PAN-2015 shows that both approaches perform equally well for plagiarism detection in homoglyph obfuscated texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://en.wikipedia.org/wiki/IDN_homograph_attack.

References

  1. Unicode List of Visually Confusable Characters. http://www.unicode.org/Public/security/9.0.0/confusables.txt. Accessed 19 Oct 2016

  2. Alvi, F., Stevenson, M., Clough, P.D.: Hashing and merging heuristics for text reuse detection. In: Working Notes for CLEF 2014 Conference, pp. 939–946 (2014)

    Google Scholar 

  3. Costello, A.: RFC3492-Punycode: a bootstring encoding of Unicode for internationalized domain names in applications (IDNA). Network Working Group (2003). http://www.ietf.org/rfc/rfc3492.txt. Accessed 19 Oct 2016

  4. Fu, A.Y., Deng, X., Wenyin, L.: REGAP: a tool for Unicode-based web identity fraud detection. J. Digital Forensic Pract. 1(2), 83–97 (2006)

    Article  Google Scholar 

  5. Gillam, L., Marinuzzi, J., Ioannou, P.: Turnitoff-defeating plagiarism detection systems. In: Proceedings of the 11th Higher Education Academy-ICS Annual Conference. Higher Education Academy (2010)

    Google Scholar 

  6. Heather, J.: Turnitoff: identifying and fixing a hole in current plagiarism detection software. Assess. Eval. High. Educ. 35(6), 647–660 (2010)

    Article  Google Scholar 

  7. Kakkonen, T., Mozgovoy, M.: Hermetic and web plagiarism detection systems for student essays an evaluation of the state-of-the-art. J. Educ. Comput. Res. 42(2), 135–159 (2010)

    Article  Google Scholar 

  8. Meuschke, N., Gipp, B.: State-of-the-art in detecting academic plagiarism. Int. J. Educ. Integrity 9(1), 50–71 (2013)

    Google Scholar 

  9. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)

    Article  Google Scholar 

  10. Palkovskii, Y., Belov, A.: Submission to the 7th International Competition on Plagiarism Detection (2015). http://www.uni-weimar.de/medien/webis/events/pan-15. Accessed 15 Oct 2016

  11. Potthast, M., Göring, S., Rosso, P., Stein, B.: Towards data submissions for shared tasks: first experiences for the task of text alignment. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings, September 2015

    Google Scholar 

  12. Weber-Wulff, D., Möer, C., Touras, J., Zincke, E.: Plagiarism Detection Software Test 2013 (2013). http://plagiat.htw-berlin.de/software-en/test2013/report-2013/. Accessed 15 Oct 2016

  13. Wenyin, L., Fu, A.Y., Deng, X.: Exposing homograph obfuscation intentions by coloring unicode strings. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds.) APWeb 2008. LNCS, vol. 4976, pp. 275–286. Springer, Heidelberg (2008). doi:10.1007/978-3-540-78849-2_29

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Faisal Alvi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Alvi, F., Stevenson, M., Clough, P. (2017). Plagiarism Detection in Texts Obfuscated with Homoglyphs. In: Jose, J., et al. Advances in Information Retrieval. ECIR 2017. Lecture Notes in Computer Science(), vol 10193. Springer, Cham. https://doi.org/10.1007/978-3-319-56608-5_64

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-56608-5_64

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-56607-8

  • Online ISBN: 978-3-319-56608-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics