Computer-Based Authorship Attribution Without Lexical Measures

Stamatatos, E.; Fakotakis, N.; Kokkinakis, G.

doi:10.1023/A:1002681919510

Computer-Based Authorship Attribution Without Lexical Measures

Published: May 2001

Volume 35, pages 193–214, (2001)
Cite this article

Computers and the Humanities Aims and scope Submit manuscript

E. Stamatatos¹,
N. Fakotakis¹ &
G. Kokkinakis¹

459 Accesses
103 Citations
Explore all metrics

Abstract

The most important approaches to computer-assistedauthorship attribution are exclusively based onlexical measures that either represent the vocabularyrichness of the author or simply comprise frequenciesof occurrence of common words. In this paper wepresent a fully-automated approach to theidentification of the authorship of unrestricted textthat excludes any lexical measure. Instead we adapt aset of style markers to the analysis of the textperformed by an already existing natural languageprocessing tool using three stylometric levels, i.e.,token-level, phrase-level, and analysis-levelmeasures. The latter represent the way in which thetext has been analyzed. The presented experiments ona Modern Greek newspaper corpus show that the proposedset of style markers is able to distinguish reliablythe authors of a randomly-chosen group and performsbetter than a lexically-based approach. However, thecombination of these two approaches provides the mostaccurate solution (i.e., 87% accuracy). Moreover, wedescribe experiments on various sizes of the trainingdata as well as tests dealing with the significance ofthe proposed set of style markers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author Attribution of Email Messages Using Parse-Tree Features

Complete Syntactic N-grams as Style Markers for Authorship Attribution

Using Frequent Fixed or Variable-Length POS Ngrams or Skip-Grams for Blog Authorship Attribution

References

Baayen, H., H. Van Halteren and F. Tweedie. “Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution. ” Literary and Linguistic Computing, 11(3) (1996), 121–131.
Google Scholar
Biber, D. “Methodological Issues Regarding Corpus-based Analyses of Linguistic Variations. ” Literary and Linguistic Computing, 5 (1990), 257–269.
Google Scholar
Biber, D. “Representativeness in Corpus Design. ” Literary and Linguistic Computing,8 (1993), 1–15.
Google Scholar
Brill E. “Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. ” Computational Linguistics, 21(4) (1995), 543–565.
Google Scholar
Brinegar, C. “Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship. ” Journal of the American Statistical Association, 58 (1963), 85–96.
Google Scholar
Burrows, J. “Word-patterns and Story-shapes: The Statistical Analysis of Narrative Style. ” Literary and Linguistic Computing, 2(2) (1987), 61–70.
Google Scholar
Burrows, J. “Not Unless You Ask Nicely: The Interpretative Nexus Between Analysis and Information. ” Literary and Linguistic Computing, 7(2) (1992), 91–109.
Google Scholar
Dermatas E. and G. Kokkinakis “Automatic Stochastic Tagging of Natural Language Texts. ” Computational Linguistics, 21(2) (1995), 137–164.
Google Scholar
Eisenbeis, R. and R. Avery. Discriminant Analysis and Classification Procedures: Theory and Applications. Lexington, Mass.: D.C. Health and Co. 1972.
Google Scholar
Forsyth, R. and D. Holmes. “Feature-Finding for Text Classification. ” Literary and Linguistic Computing, 11(4) (1996),163–174.
Google Scholar
Fucks W. “On the Mathematical Analysis of Style. ” Biometrica, 39 (1952), 122–129.
Google Scholar
Holmes, D. “A Stylometric Analysis of Mormon Scripture and Related Texts. ” Journal of the Royal Statistical Society Series A, 155(1) (1992), 91–120.
Google Scholar
Holmes, D. (1994). “Authorship Attribution. ” Computers and the Humanities, 28 (1994), 87–106.
Google Scholar
Holmes, D. and R. Forsyth. “The Federalist Revisited: New Directions in Authorship Attribution. ” Literary and Linguistic Computing, 10(2) (1995), 111–127.
Google Scholar
Honore, A. “Some Simple Measures of Richness of Vocabulary. ” Association for Literary and Linguistic Computing Bulletin, 7(2) (1979), 172–177.
Google Scholar
Karlgren, J. “Stylistic Experiments in Information Retrieval. ” In Natural Language Information Retrieval. Ed. T. Strzalkowski, Kluwer Academic Publishers, 1999, pp. 147–166.
Morton A. “The Authorship of Greek Prose. ” Journal of the Royal Statistical Society Series A, 128 (1965), 169–233.
Google Scholar
Mosteller, F. and D. Wallace. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. MA: Addison-Wesley, Reading, 1984.
Oakman, R. Computer Methods for Literary Research. Columbia: University of South Carolina Press, 1980.
Google Scholar
Palmer, D. and M. Hearst. “Adaptive Multilingual Sentence Boundary Disambiguation. ” Computational Linguistics, 23(2) (1997), 241–267.
Google Scholar
Sichel, H. “Word Frequency Distributions and Type-Token Characteristics. ” Mathematical Scientist, 11 (1986), 45–72.
Google Scholar
Srinivas, B and A. Joshi. “Supertagging: An Approach to Almost Parsing. ” Computational Linguistics, 25(2) (1999), 237–265.
Google Scholar
Stamatatos, E., N. Fakotakis and G. Kokkinakis. “Automatic Authorship Attribution. ” In Proc. of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL'99), 1999a, pp. 158–164.
Stamatatos, E., N. Fakotakis, and G. Kokkinakis. “Automatic Extraction of Rules for Sentence Boundary Disambiguation. ” In Proc. of the Workshop on Machine Learning in Human Language Technology, ECCAI Advanced Course on Artificial Intelligence (ACAI-99), 1999b, pp. 88–82.
Stamatatos, E., N. Fakotakis, and G. Kokkinakis. “A Practical Chunker for Unrestricted Text. ” In Proc. of the Second Int. Conf. on Natural Language Processing, 2000.
Strzalkowski, T. “Robust Text Processing in Automated Information Retrieval. ” In Proc. of the 4th Conf. On Applied Natural Language Processing, 1994, pp. 168–173.
Tallentire D. “Towards an Archive of Lexical Norms: A Proposal. ” In The Computer and Literary Studies. Eds. A. Aitken, R. Bailey, and N Hamilton-Smith, 1973, Edinburgh University Press.
Tweedie, F. and Baayen, R. “How Variable may a Constant be? Measures of Lexical Richness in Perspective. ” Computers and the Humanities, 32(5) (1998), 323–352.
Google Scholar
Yule, G. The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press, 1944.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Electrical and Computer Engineering, University of Patras, 265 00 –, Patras, Greece
E. Stamatatos, N. Fakotakis & G. Kokkinakis

Authors

E. Stamatatos
View author publications
You can also search for this author in PubMed Google Scholar
N. Fakotakis
View author publications
You can also search for this author in PubMed Google Scholar
G. Kokkinakis
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stamatatos, E., Fakotakis, N. & Kokkinakis, G. Computer-Based Authorship Attribution Without Lexical Measures. Computers and the Humanities 35, 193–214 (2001). https://doi.org/10.1023/A:1002681919510

Download citation

Issue Date: May 2001
DOI: https://doi.org/10.1023/A:1002681919510

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Computer-Based Authorship Attribution Without Lexical Measures

Abstract

Access this article

Similar content being viewed by others

Author Attribution of Email Messages Using Parse-Tree Features

Complete Syntactic N-grams as Style Markers for Authorship Attribution

Using Frequent Fixed or Variable-Length POS Ngrams or Skip-Grams for Blog Authorship Attribution

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Computer-Based Authorship Attribution Without Lexical Measures

Abstract

Access this article

Similar content being viewed by others

Author Attribution of Email Messages Using Parse-Tree Features

Complete Syntactic N-grams as Style Markers for Authorship Attribution

Using Frequent Fixed or Variable-Length POS Ngrams or Skip-Grams for Blog Authorship Attribution

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation