Abstract
Intrinsic plagiarism detection deals with the task of finding plagiarized sections in text documents without using a reference corpus. This paper describes a novel approach in this field by analyzing the grammar of authors and using sliding windows to find significant differences in writing styles. To find suspicious text passages, the algorithm splits a document into single sentences, calculates syntax grammar trees and builds profiles based on frequently used grammar patterns. The text is then traversed, where each window is compared to the document profile using a distance metric. Finally, all sentences that have a significantly higher distance according to a utilized Gaussian normal distribution are marked as suspicious. A preliminary evaluation of the algorithm shows very promising results.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Augsten, N., Böhlen, M., Gamper, J.: The pq-Gram Distance between Ordered Labeled Trees. ACM Transactions on Database Systems, TODS (2010)
Gottron, T.: External Plagiarism Detection Based on Standard IR Technology and Fast Recognition of Common Subsequences. In: CLEF (Notebook Papers/LABs/Workshops) (2010)
Joachims, T.: Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. In: Proceedings of the 10th European Conference on Machine Learning, London, UK, pp. 137–142 (1998)
Joshi, A.K., Schabes, Y.: Tree-Adjoining Grammars. Handbook of Formal Languages 3, 69–124 (1997)
Karlgren, J.: Stylistic Experiments For Information Retrieval. PhD thesis, Swedish Institute for Computer Science (2000)
Kestemont, M., et al.: Intrinsic Plagiarism Detection Using Character Trigram Distance Scores. In: CLEF Labs and Worksh. Papers, Amsterdam, Netherlands (2011)
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proc. of the 41st Meeting on Comp. Linguistics, Stroudsburg, PA, USA, pp. 423–430 (2003)
Koppel, M., Schler, J.: Exploiting Stylistic Idiosyncrasies for Authorship Attribution. In: IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003)
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn Treebank. Comp. Linguistics 19, 313–330 (1993)
Oberreuter, G., et al.: Approaches for Intrinsic and External Plagiarism Detection. In: Notebook Papers of CLEF Labs and Workshops (2011)
Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An Evaluation Framework for Plagiarism Detection. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China (2010)
Stamatatos, E.: Intrinsic Plagiarism Detection Using Character n-gram Profiles. In: CLEF (Notebook Papers/Labs/Workshop) (2009)
Stamatatos, E., Kokkinakis, G., Fakotakis, N.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26, 471–495 (2000)
Tschuggnall, M., Specht, G.: Detecting Plagiarism in Text Documents through Grammar-Analysis of Authors. In: 15. GI-Fachtagung Datenbanksysteme für Business, Technologie und Web, Magdeburg, Germany (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tschuggnall, M., Specht, G. (2013). Using Grammar-Profiles to Intrinsically Expose Plagiarism in Text Documents. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds) Natural Language Processing and Information Systems. NLDB 2013. Lecture Notes in Computer Science, vol 7934. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38824-8_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-38824-8_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38823-1
Online ISBN: 978-3-642-38824-8
eBook Packages: Computer ScienceComputer Science (R0)