Syntactic Dependency-Based N-grams as Classification Features
In this paper we introduce a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner of what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking the words as they appear in the text. Dependency trees fit directly into this idea, while in case of constituency trees some simple additional steps should be made. Sn-grams can be applied in any NLP task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. SVM classifier for several profile sizes was used. We used as baseline traditional n-grams of words, POS tags and characters. Obtained results are better when applying sn-grams.
Keywordssyntactic n-grams sn-grams parsing classification features syntactic paths authorship attribution
Unable to display preview. Download preview PDF.
- 1.Khalilov, M., Fonollosa, J.A.R.: N-gram-based Statistical Machine Translation versus Syntax Augmented Machine Translation: comparison and system combination. In: Proceedings of the 12th Conference of the European Chapter of the ACL, pp. 424–432 (2009)Google Scholar
- 3.Agarwal, A., Biads, F., Mckeown, K.R.: Contextual Phrase-Level Polarity Analysis using Lexical Affect Scoring and Syntactic N-grams. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL), pp. 24–32 (2009)Google Scholar
- 5.Baayen, H., Tweedie, F., Halteren, H.: Outside The Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution. Literary and Linguistic Computing, 121–131 (1996)Google Scholar
- 9.Juola, P.: Ad-hoc authorship attribution competition. In: Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, pp. 175–176 (2004)Google Scholar
- 10.Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1) (2002)Google Scholar
- 14.Luyckx, K.: Scalability Issues in Authorship Attribution. Ph.D. Thesis, University of Antwerp (2010)Google Scholar
- 15.Argamon, S., Juola, P.: Overview of the international authorship identification competition at PAN-2011. In: 5th Int. Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (2011)Google Scholar
- 17.Escalante, H., Solorio, T., et al.: Local histograms of character n-grams for authorship attribution. In: 49th Annual Meeting of the Association for Computational Linguistics, pp. 288–298 (2011)Google Scholar
- 18.Keselj, V., Peng, F., et al.: N-gram-based author profiles for authorship attribution. Computational Linguistics 3, 225–264 (2003)Google Scholar
- 20.Koppel, M., Schler, J., et al.: Measuring differentiability: unmasking pseudonymous authors. Journal of Machine Learning Research, 1261–1276 (2007)Google Scholar