Using Dependency-Based Annotations for Authorship Identification
Most statistical approaches to stylometry to date have focused on lexical methods, such as relative word frequencies or type-token ratios. Explicit attention to syntactic features has been comparatively rare. Those approaches that have used syntactic features typically either used very shallow features (such as parts of speech) or features based on phrase structure grammars. This paper investigates whether typed dependency grammars might yield useful stylometric features.
An experiment was conducted using a novel method of depicting information about typed dependencies. Each token in a text is replaced with a “DepWord,” which consists of a concise representation of the chain of grammatical dependencies from that token back to the root of the sentence. The resulting representation contains only syntactic information, with no lexical or othographic information. These DepWords can then be used in place of the original words as the input for statistical language processing methods.
I adapted a simple method of authorship attribution — nearest neighbor based on word frequency rankings — for use with DepWords, and found it performed comparably to the same technique trained on words or parts of speech, even outperforming lexical methods in some cases. This indicates that the grammatical dependency relations between words contains stylometric information sufficient for distinguishing authorship. These results suggest that further research into typed-dependency-based stylometry might prove fruitful.
Keywordsstylometry authorship attribution syntax dependency grammar DepWords
Unable to display preview. Download preview PDF.
- 2.Goldman, E., Allison, A.: Using grammatical Markov models for stylometric analysis. Class project, CS224N, Stanford University (2008), Retrieved from, http://nlp.stanford.edu/courses/cs224n/2008/reports/17.pdf
- 4.Juola, P.: Authorship Attribution. Now Publishers, Delft (2008)Google Scholar
- 5.Kaster, A., Siersdorfer, S., Weikum, G.: Combining text and linguistic document representations for authorship attribution. In: SIGIR Workshop: Stylistic Analysis of Text for Information Access (STYLE), pp. 27–35. MPI, Saarbrücken (2005)Google Scholar
- 6.Levitsky, V., Melnyk, Y.P.: Sentence length and sentence structure in English prose. Glottometrics 21, 14–24 (2011)Google Scholar
- 7.Marneffe, M., MacCartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, pp. 449–454 (2006)Google Scholar
- 9.Popescu, M., Dinu, L.P.: Rank distance as a stylistic similarity. In: Coling 2008: Companion Volume — Posters and Demonstrations, pp. 91–94 (2008)Google Scholar
- 10.Raghavan, S., Kovashka, A., Mooney, R.: Authorship attribution using probabilistic context-free grammars. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 38–42 (2010)Google Scholar