Computers and the Humanities

, Volume 32, Issue 5, pp 323–352 | Cite as

How Variable May a Constant be? Measures of Lexical Richness in Perspective

  • Fiona J. Tweedie
  • R. Harald Baayen


A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word types), depends heavily on the text length (measured in word tokens), a variety of alternative measures has been proposed which are claimed to be independent of the text length. This paper has a threefold aim. Firstly, we have investigated to what extent these alternative measures are truly textual constants. We have observed that in practice all measures vary substantially and systematically with the text length. We also show that in theory, only three of these measures are truly constant or nearly constant. Secondly, we have studied the extent to which these measures tap into different aspects of lexical structure. We have found that there are two main families of constants, one measuring lexical richness and one measuring lexical repetition. Thirdly, we have considered to what extent these measures can be used to investigate questions of textual similarity between and within authors. We propose to carry out such comparisons by means of the empirical trajectories of texts in the plane spanned by the dimensions of lexical richness and lexical repetition, and we provide a statistical technique for constructing confidence intervals around the empirical trajectories of texts. Our results suggest that the trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.

lexical statistics Monte Carlo methods vocabulary richness 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Baayen, R. H. A Corpus-based Approach to Morphological Productivity. Statistical Analysis and Psycholinguistic Interpretation. PhD thesis, Amsterdam: Free University, 1989.Google Scholar
  2. Baayen, R. H. “Statistical Models for Word Frequency Distributions: A Linguistic Evaluation”. Computers and the Humanities26 (1993), 347-363.Google Scholar
  3. Baayen, R. H. “The Effect of Lexical Specialisation on the Growth Curve of the Vocabulary”. Computational Linguistics22 (1996), 455-480.Google Scholar
  4. Baayen, R. H. and F. J. Tweedie. “The Sample-size Invariance of LNRE Model Parameters: Problems and Opportunities”. Journal of Quantitative Linguistics5 (1998).Google Scholar
  5. Baayen, R. H., H. van Halteren and F. J. Tweedie. “Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution”. Literary and Linguistic Computing11(3) (1996), 121-131.Google Scholar
  6. Baker, J. C. “Pace: A Test of Authorship Based on the Rate atWhich New Words Enter the Author's Text”. Literary and Linguistic Computing3(1) (1988), 136-139.Google Scholar
  7. Brunet, E. Vocabulaire de Jean Giraudoux: Structure et Évolution. Genève: Slatkine, 1978.Google Scholar
  8. Burrows, J. F. “ ‘An OceanWhere Each Kind...’: Statistical Analysis and Some Major Determinants of Literary Style”. Computers and the Humanities23(4-5) (1989), 309-321.Google Scholar
  9. Chitashvili, R. J. and R. H. Baayen. “Word Frequency Distributions”. In Quantitative Text Analysis. Eds. G. Altmann and L. Hrebícek, Trier: Wissenschaftlicher Verlag Trier, 1993.Google Scholar
  10. Cossette, A. La Richesse Lexicale et sa Mesure. Number 53 in Travaux de Linguistique Quantitative. Paris: Slatkine-Champion, Geneva, 1994.Google Scholar
  11. Dugast, D. “Sur quoi se fonde la notion d'étendue théoretique du vocabulaire?”. Le francais moderne46(1) (1978), 25-32.Google Scholar
  12. Dugast, D. Vocabulaire et Stylistique. I Théâtre et Dialogue. Travaux de Linguistique Quantitative. Paris: Slatkine-Champion, Geneva, 1979.Google Scholar
  13. Good, I. J. “The Population Frequencies of Species and the Estimation of Population Parameters”. Biometrika40 (1953), 237-264.Google Scholar
  14. Guiraud, H. Les Caractères Statistiques du Vocabulaire. Paris: Presses Universitaires de France, 1954.Google Scholar
  15. Herdan, G. “A New Derivation and Interpretation of Yule's Characteristic K”. Zeitschrift für Angewandte Mathematik und Physik(1955).Google Scholar
  16. Herdan, G. Type-Token Mathematics: A Textbook of Mathematical Linguistics. The Hague, The Netherlands: Mouton & Co., 1960.Google Scholar
  17. Herdan, G. Quantatative Linguistics. London: Butterworth, 1964.Google Scholar
  18. Holmes, D. I. “A Stylometric Analysis of Mormon Scripture and Related Texts”. Journal of the Royal Statistical Society Series A155(1) (1992), 91-120.Google Scholar
  19. Holmes, D. I. “Authorship Attribution”. Computers and the Humanities28(2) (1994), 87-106.Google Scholar
  20. Holmes, D. I. and R. S. Forsyth. “The Federalist Revisited: New Directions in Authorship Attribution”. Literary and Linguistic Computing10(2) (1995), 111-127.Google Scholar
  21. Honoré, A. “Some Simple Measures of Richness of Vocabulary”. Association for Literary and Linguistic Computing Bulletin7(2) (1979), 172-177.Google Scholar
  22. Johnson, N. L. and S. Kotz. Urn Models and their Application. An Approach to Modern Discrete Probability Theory. New York: John Wiley and Sons, 1977.Google Scholar
  23. Johnson, R. “Measures of Vocabulary Diversity”. In Advances in Computer-aided Literary and Linguistic Research. Eds. D. E. Ager, F. E. Knowles and M. W. A. Smith, AMLC, 1979.Google Scholar
  24. Maas, H.-D. “Zusammenhang zwischen wortschatzumfang und länge eines textes”. Zeitschrift für Literaturwissenschaft und Linguistik8 (1972), 73-79.Google Scholar
  25. Martindale, C. and D. McKenzie. “On the Utility of Content Analysis in Author Attribution: The Federalist”. Computers and the Humanities29 (1995), 259-270.Google Scholar
  26. Ménard, N. Mesure de la Richesse Lexicale. Théorie et vérifications expérimentales. Etudes stylométriques et sociolinguistiques. Number 14 in Travaux de Linguistique Quantitative. Paris: Slatkine-Champion, Geneva, 1983.Google Scholar
  27. Michéa, R. “Répétition et variété dans l'emploi des mots”. Bulletin de la société de linguistique de Paris(1969).Google Scholar
  28. Michéa, R. “De la relation entre le nombre des mots d'une fréquence déterminée et celui des mots différents employés dans le texte”. Cahiers de Lexicologie(1971).Google Scholar
  29. Mosteller, F. and D. L. Wallace. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, 1964.Google Scholar
  30. Orlov, Y. K. “Ein modell der häufigkeitsstruktur des vokabulars”. In Studies on Zipf's Law. Bochum: Brockmeyer, 1983, pp. 154-233.Google Scholar
  31. Sichel, H. S. “On a Distribution Law for Word Frequencies”. Journal of the American Statistical Association70 (1975), 542-547.Google Scholar
  32. Sichel, H. S. “Word Frequency Distributions and Type-token Characteristics”. The Mathematical Scientist11 (1986), 45-72.Google Scholar
  33. Simpson, E. H. “Measurement of Diversity”. Nature163 (1949), 168.Google Scholar
  34. Thoiron, P. “Diversity Index and Entropy as Measures of Lexical Richness”. Computers and the Humanities20 (1986), 197-202.Google Scholar
  35. Tuldava, J. “Quantitative Relations between the Size of the Text and the Size of Vocabulary”. SMIL Quarterly, Journal of Linguistic Calculus4 (1977).Google Scholar
  36. Tweedie, F. J., D. I. Holmes and T. N. Corns. “The Provenance of De Doctrina Christiana, Attributed to John Milton: A Statistical Investigation”. Literary and Linguistic Computing13(2) (1998), 77-87.Google Scholar
  37. Weitzman, M. “How Useful is the Logarithmic Type-token Ratio?”. Journal of Linguistics7 (1971), 237-243.Google Scholar
  38. Whissell, C. “Traditional and Emotional Stylometric Analysis of the Songs of Beatles Paul McCartney and John Lennon”. Computers and the Humanities30(3) (1996), 257-265.Google Scholar
  39. Yule, G. U. The Statistical Study of Literary Vocabulary. Cambridge University Press, 1944.Google Scholar

Copyright information

© Kluwer Academic Publishers 1998

Authors and Affiliations

  • Fiona J. Tweedie
    • 1
  • R. Harald Baayen
    • 2
  1. 1.University of GlasgowUnited Kingdom
  2. 2.Max Planck Institute for PsycholinguisticsNijmegenThe Netherlands

Personalised recommendations