Skip to main content

Automatic Authorship Investigation


This chapter discusses authorship investigation studies, such as author verification, author attribution and author profiling, by means of automatic procedures. These procedures generally consist of two stages. In the first, we extract large numbers of counts of linguistic constructs (‘features’). In the second, we compare these counts between samples. On the basis of this comparison, we attempt to answer the authorship questions at hand. We describe various types of features and comparison methods, and the methodology to use them for studies. Then we present a case study, demonstrating that author verification with very high-quality results is possible, at least for book-length same-genre texts. Furthermore, we show that quality improves if rare features are included in the comparison data and that end-to-end deep learning systems already show high quality, but not yet at the level attained by traditional methods.


  • Author recognition
  • Author verification
  • Author attribution
  • Author profiling
  • Deep learning
  • Machine learning
  • Text classification
  • Text features
  • Siamese network
  • Stylome

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-84330-4_8
  • Chapter length: 37 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   129.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-84330-4
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Hardcover Book
USD   169.99
Price excludes VAT (USA)
Fig. 8.1


  1. 1.

    The concept of idiolect is discussed more extensively in Chap. 7.

  2. 2.

    Compression methods such as ZIP operate by replacing repeated character sequences by pointers to the previous use of those sequences. If two authors have preferences for different language use, and therefore different character sequences, compression will work better on single author texts than on mixed author texts.

  3. 3.

    IDF stands for Inverse Document Frequency: The log of the number of documents with the word divided by the number of all documents. Words occurring everywhere, like function words, have very low IDF, but rare and topic-specific tokens have high IDF.

  4. 4.

    In the XML version of the BNC, sentences are split into sentences. We removed the markers but used the sentence split to create 200 text samples of between 1950 and 2050 words. This was done by including full sentences until size 2000 was reached; in case size was over 2050, we removed the last included sentence and checked that size was at least 1950; if not, the current sample was deleted.

  5. 5.

    Looking at this from the perspective of correctly marked cases, the true accept rate (TAR) is the fraction of samples from a Howard book in the test set that have features values in the range of the current training book by Howard. The true reject rate (TRR), is the fraction of samples for a non-Howard book in the test set that have features values outside that range. Similar to the calculation of the F-score for precision and recall, we calculate the OCSR as (2∙TAR∙TRR)/(TAR+TRR).

  6. 6.

    N-gram is the term for n items adjacent in the text, typically n characters or n tokens, for example the word ‘character’ contains the 4-gram ‘ract’. N is often kept low, and there are separate terms for 1-gram (‘unigram’), 2-gram (‘bigram’) and 3-gram (‘trigram’).

  7. 7.

    A newline character (10 in ASCII and UNICODE) indicates to the computer that the text should be continued on a new line.

  8. 8.

    POS stands for Part Of Speech. A POS tag contains morpho-syntactic properties of a token (in its current context). Apart from the major word class, such as Noun or Preposition, it may contain additional information, such as number or tense.

  9. 9.

    Stanford CoreNLP is just one of many options to obtain an automatic syntactic analysis. Some well-known alternatives are NLTK (Bird et al., 2009) and SpaCy (Honnibal & Montani, 2017).

  10. 10.

    For the computationally minded, the formula for entropy is H(X) = -Σi P(xi)log(P(xi)).

  11. 11.

    This section contains many technical details that are of little interest to the reader without experience in computational linguistics and machine learning. However, they are of vital importance for researchers who want to replicate the analysis.

  12. 12.

    We would like to thank Benedikt Bönninghoff for providing us with useful assistance in using his software and adapting its functioning to our specific data and task.

  13. 13.

    In principle, we can see that all distributions are bimodal. Nevertheless, using this fact in the alignment would mean an unfair comparison, as we would use the knowledge about the number of positive test samples present. Assuming that we mostly have negative samples, we can use the leftmost mode as reference, and then use only the samples with even lower values to estimate a ῾standard deviationʼ for calculating the z-score.


  • Aarts, J., van Halteren, H., & Oostdijk, N. (1998). The linguistic annotation of corpora: The TOSCA analysis system. International Journal of Corpus Linguistics, 3(2), 189–210.

    CrossRef  Google Scholar 

  • Ainsworth, J., & Juola, P. (2019). Who wrote this?: Modern forensic authorship analysis as a model for valid forensic science. Washington University Law Review, 9(5), 1161–1189.

    Google Scholar 

  • Baayen, F. H., van Halteren, H., & Tweedie, F. (1996). Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3), 121–132.

    CrossRef  Google Scholar 

  • Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. O’Reilly Media Inc..

    Google Scholar 

  • Benedetto, D., Caglioti, E., & Loreto, V. (2002). Language trees and zipping. Physical Review Letters, 88(4), 048702.

    CrossRef  Google Scholar 

  • BNC Consortium. (2007). The British national corpus, v3 (BNC XML Edition). Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium.

    Google Scholar 

  • Bönninghoff, B., Nickel, R. M., Zeiler, S., & Kolossa, D. (2019). Similarity Learning for Authorship Verification in Social Media. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing: Proceedings: May 12-17, 2019, Brighton Conference Centre, Brighton, United Kingdom. IEEE. 2457-2461. Retrieved from

  • Bönninghoff, B., Rupp, J., Nickel, R.M., & Kolossa, D. (2020). Deep Bayes factor scoring for authorship verification. Notebook for PAN at CLEF 2020. In CLEF 2020 Labs and Workshops, Notebook Papers. Retrieved from

  • Chang, C.-C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(27), 1–27.

    CrossRef  Google Scholar 

  • Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian knot: The Moving-average type–token ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100.

    CrossRef  Google Scholar 

  • Daelemans, W., Van Den Bosch, A., & Zavrel, J. (1999). Forgetting exceptions is harmful in language learning. Machine Learning, 34(1-3), 11–41.

    CrossRef  Google Scholar 

  • Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.

    Google Scholar 

  • Juola, P. (2008). Authorship attribution. Foundations and Trends® in Information Retrieval, 1(3), 233–334.

    CrossRef  Google Scholar 

  • Koppel, M., Akiva, N., & Dagan, I. (2006). Feature instability as a criterion for selecting potential style markers. Journal of the American Society for Information Science and Technology, 57(11), 1519–1525.

    CrossRef  Google Scholar 

  • Lutosławski, W. (1890). Principes de stylométrie. Revue des études grecques, 41, 61–81.

    Google Scholar 

  • Ma, W., Liu, R., Wang, L., & Vosoughi, S. (2020). Towards improved model design for authorship identification: A survey on writing style understanding. Retrieved from arXiv:2009.14445

    Google Scholar 

  • Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics. 55-60. Retrieved from

  • Mosteller, F., & Wallace, D. L. (1964). Inference and disputed authorship: The Federalist. Addison-Wesley.

    Google Scholar 

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.). (2019). Advances in Neural Information Processing Systems, 32, 8024-8035.

    Google Scholar 

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    Google Scholar 

  • Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3), 538–556.

    CrossRef  Google Scholar 

  • Tweedie, F., & Baayen, R. H. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32, 323–352.

    CrossRef  Google Scholar 

  • Valla, L. (1439-1440). De falso credita et ementita Constantini Donatione declamatio. Retrieved from

  • van Halteren, H. (2019). Benchmarking author recognition systems for forensic application. Linguistic Evidence in Security, Law and Intelligence (LESLI) Journal, 3. Retrieved from

  • van Halteren, H., Baayen, R. H., Tweedie, F. J., Haverkort, M., & Neijt, A. (2005). New machine learning methods demonstrate the existence of a human stylome. Journal of Quantitative Linguistics, 12(1), 65–77.

    CrossRef  Google Scholar 

  • van Halteren, H., van Hout, R., & Roumans, R. (2018). Tweet geography. Tweet based mapping of dialect features in Dutch Limburg. Computational Linguistics in the Netherlands Journal, 8, 138–162. Retrieved from

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Hans van Halteren .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

van Halteren, H. (2022). Automatic Authorship Investigation. In: Guillén-Nieto, V., Stein, D. (eds) Language as Evidence. Palgrave Macmillan, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Palgrave Macmillan, Cham

  • Print ISBN: 978-3-030-84329-8

  • Online ISBN: 978-3-030-84330-4

  • eBook Packages: Social SciencesSocial Sciences (R0)