Abstract
The most important approaches to computer-assistedauthorship attribution are exclusively based onlexical measures that either represent the vocabularyrichness of the author or simply comprise frequenciesof occurrence of common words. In this paper wepresent a fully-automated approach to theidentification of the authorship of unrestricted textthat excludes any lexical measure. Instead we adapt aset of style markers to the analysis of the textperformed by an already existing natural languageprocessing tool using three stylometric levels, i.e.,token-level, phrase-level, and analysis-levelmeasures. The latter represent the way in which thetext has been analyzed. The presented experiments ona Modern Greek newspaper corpus show that the proposedset of style markers is able to distinguish reliablythe authors of a randomly-chosen group and performsbetter than a lexically-based approach. However, thecombination of these two approaches provides the mostaccurate solution (i.e., 87% accuracy). Moreover, wedescribe experiments on various sizes of the trainingdata as well as tests dealing with the significance ofthe proposed set of style markers.
Similar content being viewed by others
References
Baayen, H., H. Van Halteren and F. Tweedie. “Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution. ” Literary and Linguistic Computing, 11(3) (1996), 121–131.
Biber, D. “Methodological Issues Regarding Corpus-based Analyses of Linguistic Variations. ” Literary and Linguistic Computing, 5 (1990), 257–269.
Biber, D. “Representativeness in Corpus Design. ” Literary and Linguistic Computing,8 (1993), 1–15.
Brill E. “Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. ” Computational Linguistics, 21(4) (1995), 543–565.
Brinegar, C. “Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship. ” Journal of the American Statistical Association, 58 (1963), 85–96.
Burrows, J. “Word-patterns and Story-shapes: The Statistical Analysis of Narrative Style. ” Literary and Linguistic Computing, 2(2) (1987), 61–70.
Burrows, J. “Not Unless You Ask Nicely: The Interpretative Nexus Between Analysis and Information. ” Literary and Linguistic Computing, 7(2) (1992), 91–109.
Dermatas E. and G. Kokkinakis “Automatic Stochastic Tagging of Natural Language Texts. ” Computational Linguistics, 21(2) (1995), 137–164.
Eisenbeis, R. and R. Avery. Discriminant Analysis and Classification Procedures: Theory and Applications. Lexington, Mass.: D.C. Health and Co. 1972.
Forsyth, R. and D. Holmes. “Feature-Finding for Text Classification. ” Literary and Linguistic Computing, 11(4) (1996),163–174.
Fucks W. “On the Mathematical Analysis of Style. ” Biometrica, 39 (1952), 122–129.
Holmes, D. “A Stylometric Analysis of Mormon Scripture and Related Texts. ” Journal of the Royal Statistical Society Series A, 155(1) (1992), 91–120.
Holmes, D. (1994). “Authorship Attribution. ” Computers and the Humanities, 28 (1994), 87–106.
Holmes, D. and R. Forsyth. “The Federalist Revisited: New Directions in Authorship Attribution. ” Literary and Linguistic Computing, 10(2) (1995), 111–127.
Honore, A. “Some Simple Measures of Richness of Vocabulary. ” Association for Literary and Linguistic Computing Bulletin, 7(2) (1979), 172–177.
Karlgren, J. “Stylistic Experiments in Information Retrieval. ” In Natural Language Information Retrieval. Ed. T. Strzalkowski, Kluwer Academic Publishers, 1999, pp. 147–166.
Morton A. “The Authorship of Greek Prose. ” Journal of the Royal Statistical Society Series A, 128 (1965), 169–233.
Mosteller, F. and D. Wallace. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. MA: Addison-Wesley, Reading, 1984.
Oakman, R. Computer Methods for Literary Research. Columbia: University of South Carolina Press, 1980.
Palmer, D. and M. Hearst. “Adaptive Multilingual Sentence Boundary Disambiguation. ” Computational Linguistics, 23(2) (1997), 241–267.
Sichel, H. “Word Frequency Distributions and Type-Token Characteristics. ” Mathematical Scientist, 11 (1986), 45–72.
Srinivas, B and A. Joshi. “Supertagging: An Approach to Almost Parsing. ” Computational Linguistics, 25(2) (1999), 237–265.
Stamatatos, E., N. Fakotakis and G. Kokkinakis. “Automatic Authorship Attribution. ” In Proc. of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL'99), 1999a, pp. 158–164.
Stamatatos, E., N. Fakotakis, and G. Kokkinakis. “Automatic Extraction of Rules for Sentence Boundary Disambiguation. ” In Proc. of the Workshop on Machine Learning in Human Language Technology, ECCAI Advanced Course on Artificial Intelligence (ACAI-99), 1999b, pp. 88–82.
Stamatatos, E., N. Fakotakis, and G. Kokkinakis. “A Practical Chunker for Unrestricted Text. ” In Proc. of the Second Int. Conf. on Natural Language Processing, 2000.
Strzalkowski, T. “Robust Text Processing in Automated Information Retrieval. ” In Proc. of the 4th Conf. On Applied Natural Language Processing, 1994, pp. 168–173.
Tallentire D. “Towards an Archive of Lexical Norms: A Proposal. ” In The Computer and Literary Studies. Eds. A. Aitken, R. Bailey, and N Hamilton-Smith, 1973, Edinburgh University Press.
Tweedie, F. and Baayen, R. “How Variable may a Constant be? Measures of Lexical Richness in Perspective. ” Computers and the Humanities, 32(5) (1998), 323–352.
Yule, G. The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press, 1944.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Stamatatos, E., Fakotakis, N. & Kokkinakis, G. Computer-Based Authorship Attribution Without Lexical Measures. Computers and the Humanities 35, 193–214 (2001). https://doi.org/10.1023/A:1002681919510
Issue Date:
DOI: https://doi.org/10.1023/A:1002681919510