Skip to main content

Computer-Based Authorship Attribution Without Lexical Measures

Abstract

The most important approaches to computer-assistedauthorship attribution are exclusively based onlexical measures that either represent the vocabularyrichness of the author or simply comprise frequenciesof occurrence of common words. In this paper wepresent a fully-automated approach to theidentification of the authorship of unrestricted textthat excludes any lexical measure. Instead we adapt aset of style markers to the analysis of the textperformed by an already existing natural languageprocessing tool using three stylometric levels, i.e.,token-level, phrase-level, and analysis-levelmeasures. The latter represent the way in which thetext has been analyzed. The presented experiments ona Modern Greek newspaper corpus show that the proposedset of style markers is able to distinguish reliablythe authors of a randomly-chosen group and performsbetter than a lexically-based approach. However, thecombination of these two approaches provides the mostaccurate solution (i.e., 87% accuracy). Moreover, wedescribe experiments on various sizes of the trainingdata as well as tests dealing with the significance ofthe proposed set of style markers.

This is a preview of subscription content, access via your institution.

References

  • Baayen, H., H. Van Halteren and F. Tweedie. “Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution. ” Literary and Linguistic Computing, 11(3) (1996), 121–131.

    Google Scholar 

  • Biber, D. “Methodological Issues Regarding Corpus-based Analyses of Linguistic Variations. ” Literary and Linguistic Computing, 5 (1990), 257–269.

    Google Scholar 

  • Biber, D. “Representativeness in Corpus Design. ” Literary and Linguistic Computing,8 (1993), 1–15.

    Google Scholar 

  • Brill E. “Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. ” Computational Linguistics, 21(4) (1995), 543–565.

    Google Scholar 

  • Brinegar, C. “Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship. ” Journal of the American Statistical Association, 58 (1963), 85–96.

    Google Scholar 

  • Burrows, J. “Word-patterns and Story-shapes: The Statistical Analysis of Narrative Style. ” Literary and Linguistic Computing, 2(2) (1987), 61–70.

    Google Scholar 

  • Burrows, J. “Not Unless You Ask Nicely: The Interpretative Nexus Between Analysis and Information. ” Literary and Linguistic Computing, 7(2) (1992), 91–109.

    Google Scholar 

  • Dermatas E. and G. Kokkinakis “Automatic Stochastic Tagging of Natural Language Texts. ” Computational Linguistics, 21(2) (1995), 137–164.

    Google Scholar 

  • Eisenbeis, R. and R. Avery. Discriminant Analysis and Classification Procedures: Theory and Applications. Lexington, Mass.: D.C. Health and Co. 1972.

    Google Scholar 

  • Forsyth, R. and D. Holmes. “Feature-Finding for Text Classification. ” Literary and Linguistic Computing, 11(4) (1996),163–174.

    Google Scholar 

  • Fucks W. “On the Mathematical Analysis of Style. ” Biometrica, 39 (1952), 122–129.

    Google Scholar 

  • Holmes, D. “A Stylometric Analysis of Mormon Scripture and Related Texts. ” Journal of the Royal Statistical Society Series A, 155(1) (1992), 91–120.

    Google Scholar 

  • Holmes, D. (1994). “Authorship Attribution. ” Computers and the Humanities, 28 (1994), 87–106.

    Google Scholar 

  • Holmes, D. and R. Forsyth. “The Federalist Revisited: New Directions in Authorship Attribution. ” Literary and Linguistic Computing, 10(2) (1995), 111–127.

    Google Scholar 

  • Honore, A. “Some Simple Measures of Richness of Vocabulary. ” Association for Literary and Linguistic Computing Bulletin, 7(2) (1979), 172–177.

    Google Scholar 

  • Karlgren, J. “Stylistic Experiments in Information Retrieval. ” In Natural Language Information Retrieval. Ed. T. Strzalkowski, Kluwer Academic Publishers, 1999, pp. 147–166.

  • Morton A. “The Authorship of Greek Prose. ” Journal of the Royal Statistical Society Series A, 128 (1965), 169–233.

    Google Scholar 

  • Mosteller, F. and D. Wallace. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. MA: Addison-Wesley, Reading, 1984.

  • Oakman, R. Computer Methods for Literary Research. Columbia: University of South Carolina Press, 1980.

    Google Scholar 

  • Palmer, D. and M. Hearst. “Adaptive Multilingual Sentence Boundary Disambiguation. ” Computational Linguistics, 23(2) (1997), 241–267.

    Google Scholar 

  • Sichel, H. “Word Frequency Distributions and Type-Token Characteristics. ” Mathematical Scientist, 11 (1986), 45–72.

    Google Scholar 

  • Srinivas, B and A. Joshi. “Supertagging: An Approach to Almost Parsing. ” Computational Linguistics, 25(2) (1999), 237–265.

    Google Scholar 

  • Stamatatos, E., N. Fakotakis and G. Kokkinakis. “Automatic Authorship Attribution. ” In Proc. of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL'99), 1999a, pp. 158–164.

  • Stamatatos, E., N. Fakotakis, and G. Kokkinakis. “Automatic Extraction of Rules for Sentence Boundary Disambiguation. ” In Proc. of the Workshop on Machine Learning in Human Language Technology, ECCAI Advanced Course on Artificial Intelligence (ACAI-99), 1999b, pp. 88–82.

  • Stamatatos, E., N. Fakotakis, and G. Kokkinakis. “A Practical Chunker for Unrestricted Text. ” In Proc. of the Second Int. Conf. on Natural Language Processing, 2000.

  • Strzalkowski, T. “Robust Text Processing in Automated Information Retrieval. ” In Proc. of the 4th Conf. On Applied Natural Language Processing, 1994, pp. 168–173.

  • Tallentire D. “Towards an Archive of Lexical Norms: A Proposal. ” In The Computer and Literary Studies. Eds. A. Aitken, R. Bailey, and N Hamilton-Smith, 1973, Edinburgh University Press.

  • Tweedie, F. and Baayen, R. “How Variable may a Constant be? Measures of Lexical Richness in Perspective. ” Computers and the Humanities, 32(5) (1998), 323–352.

    Google Scholar 

  • Yule, G. The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press, 1944.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Stamatatos, E., Fakotakis, N. & Kokkinakis, G. Computer-Based Authorship Attribution Without Lexical Measures. Computers and the Humanities 35, 193–214 (2001). https://doi.org/10.1023/A:1002681919510

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1002681919510

Keywords

  • Computational Linguistic
  • Important Approach
  • Common Word
  • Newspaper Corpus
  • Authorship Attribution