Abstract
Homoglyphs can be used for disguising plagiarized text by replacing letters in source texts with visually identical letters from other scripts. Most current plagiarism detection systems are not able to detect plagiarism when text has been obfuscated using homoglyphs. In this work, we present two alternative approaches for detecting plagiarism in homoglyph obfuscated texts. The first approach utilizes the Unicode list of confusables to replace homoglyphs with visually identical letters, while the second approach uses a similarity score computed using normalized hamming distance to match homoglyph obfuscated words with source words. Empirical testing on datasets from PAN-2015 shows that both approaches perform equally well for plagiarism detection in homoglyph obfuscated texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Unicode List of Visually Confusable Characters. http://www.unicode.org/Public/security/9.0.0/confusables.txt. Accessed 19 Oct 2016
Alvi, F., Stevenson, M., Clough, P.D.: Hashing and merging heuristics for text reuse detection. In: Working Notes for CLEF 2014 Conference, pp. 939–946 (2014)
Costello, A.: RFC3492-Punycode: a bootstring encoding of Unicode for internationalized domain names in applications (IDNA). Network Working Group (2003). http://www.ietf.org/rfc/rfc3492.txt. Accessed 19 Oct 2016
Fu, A.Y., Deng, X., Wenyin, L.: REGAP: a tool for Unicode-based web identity fraud detection. J. Digital Forensic Pract. 1(2), 83–97 (2006)
Gillam, L., Marinuzzi, J., Ioannou, P.: Turnitoff-defeating plagiarism detection systems. In: Proceedings of the 11th Higher Education Academy-ICS Annual Conference. Higher Education Academy (2010)
Heather, J.: Turnitoff: identifying and fixing a hole in current plagiarism detection software. Assess. Eval. High. Educ. 35(6), 647–660 (2010)
Kakkonen, T., Mozgovoy, M.: Hermetic and web plagiarism detection systems for student essays an evaluation of the state-of-the-art. J. Educ. Comput. Res. 42(2), 135–159 (2010)
Meuschke, N., Gipp, B.: State-of-the-art in detecting academic plagiarism. Int. J. Educ. Integrity 9(1), 50–71 (2013)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)
Palkovskii, Y., Belov, A.: Submission to the 7th International Competition on Plagiarism Detection (2015). http://www.uni-weimar.de/medien/webis/events/pan-15. Accessed 15 Oct 2016
Potthast, M., Göring, S., Rosso, P., Stein, B.: Towards data submissions for shared tasks: first experiences for the task of text alignment. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings, September 2015
Weber-Wulff, D., Möer, C., Touras, J., Zincke, E.: Plagiarism Detection Software Test 2013 (2013). http://plagiat.htw-berlin.de/software-en/test2013/report-2013/. Accessed 15 Oct 2016
Wenyin, L., Fu, A.Y., Deng, X.: Exposing homograph obfuscation intentions by coloring unicode strings. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds.) APWeb 2008. LNCS, vol. 4976, pp. 275–286. Springer, Heidelberg (2008). doi:10.1007/978-3-540-78849-2_29
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Alvi, F., Stevenson, M., Clough, P. (2017). Plagiarism Detection in Texts Obfuscated with Homoglyphs. In: Jose, J., et al. Advances in Information Retrieval. ECIR 2017. Lecture Notes in Computer Science(), vol 10193. Springer, Cham. https://doi.org/10.1007/978-3-319-56608-5_64
Download citation
DOI: https://doi.org/10.1007/978-3-319-56608-5_64
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56607-8
Online ISBN: 978-3-319-56608-5
eBook Packages: Computer ScienceComputer Science (R0)