Skip to main content
Log in

How Variable May a Constant be? Measures of Lexical Richness in Perspective

  • Published:
Computers and the Humanities Aims and scope Submit manuscript

Abstract

A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word types), depends heavily on the text length (measured in word tokens), a variety of alternative measures has been proposed which are claimed to be independent of the text length. This paper has a threefold aim. Firstly, we have investigated to what extent these alternative measures are truly textual constants. We have observed that in practice all measures vary substantially and systematically with the text length. We also show that in theory, only three of these measures are truly constant or nearly constant. Secondly, we have studied the extent to which these measures tap into different aspects of lexical structure. We have found that there are two main families of constants, one measuring lexical richness and one measuring lexical repetition. Thirdly, we have considered to what extent these measures can be used to investigate questions of textual similarity between and within authors. We propose to carry out such comparisons by means of the empirical trajectories of texts in the plane spanned by the dimensions of lexical richness and lexical repetition, and we provide a statistical technique for constructing confidence intervals around the empirical trajectories of texts. Our results suggest that the trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Baayen, R. H. A Corpus-based Approach to Morphological Productivity. Statistical Analysis and Psycholinguistic Interpretation. PhD thesis, Amsterdam: Free University, 1989.

    Google Scholar 

  • Baayen, R. H. “Statistical Models for Word Frequency Distributions: A Linguistic Evaluation”. Computers and the Humanities26 (1993), 347-363.

    Google Scholar 

  • Baayen, R. H. “The Effect of Lexical Specialisation on the Growth Curve of the Vocabulary”. Computational Linguistics22 (1996), 455-480.

    Google Scholar 

  • Baayen, R. H. and F. J. Tweedie. “The Sample-size Invariance of LNRE Model Parameters: Problems and Opportunities”. Journal of Quantitative Linguistics5 (1998).

  • Baayen, R. H., H. van Halteren and F. J. Tweedie. “Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution”. Literary and Linguistic Computing11(3) (1996), 121-131.

    Google Scholar 

  • Baker, J. C. “Pace: A Test of Authorship Based on the Rate atWhich New Words Enter the Author's Text”. Literary and Linguistic Computing3(1) (1988), 136-139.

    Google Scholar 

  • Brunet, E. Vocabulaire de Jean Giraudoux: Structure et Évolution. Genève: Slatkine, 1978.

    Google Scholar 

  • Burrows, J. F. “ ‘An OceanWhere Each Kind...’: Statistical Analysis and Some Major Determinants of Literary Style”. Computers and the Humanities23(4-5) (1989), 309-321.

    Google Scholar 

  • Chitashvili, R. J. and R. H. Baayen. “Word Frequency Distributions”. In Quantitative Text Analysis. Eds. G. Altmann and L. Hrebícek, Trier: Wissenschaftlicher Verlag Trier, 1993.

    Google Scholar 

  • Cossette, A. La Richesse Lexicale et sa Mesure. Number 53 in Travaux de Linguistique Quantitative. Paris: Slatkine-Champion, Geneva, 1994.

    Google Scholar 

  • Dugast, D. “Sur quoi se fonde la notion d'étendue théoretique du vocabulaire?”. Le francais moderne46(1) (1978), 25-32.

    Google Scholar 

  • Dugast, D. Vocabulaire et Stylistique. I Théâtre et Dialogue. Travaux de Linguistique Quantitative. Paris: Slatkine-Champion, Geneva, 1979.

    Google Scholar 

  • Good, I. J. “The Population Frequencies of Species and the Estimation of Population Parameters”. Biometrika40 (1953), 237-264.

    Google Scholar 

  • Guiraud, H. Les Caractères Statistiques du Vocabulaire. Paris: Presses Universitaires de France, 1954.

    Google Scholar 

  • Herdan, G. “A New Derivation and Interpretation of Yule's Characteristic K”. Zeitschrift für Angewandte Mathematik und Physik(1955).

  • Herdan, G. Type-Token Mathematics: A Textbook of Mathematical Linguistics. The Hague, The Netherlands: Mouton & Co., 1960.

    Google Scholar 

  • Herdan, G. Quantatative Linguistics. London: Butterworth, 1964.

    Google Scholar 

  • Holmes, D. I. “A Stylometric Analysis of Mormon Scripture and Related Texts”. Journal of the Royal Statistical Society Series A155(1) (1992), 91-120.

    Google Scholar 

  • Holmes, D. I. “Authorship Attribution”. Computers and the Humanities28(2) (1994), 87-106.

    Google Scholar 

  • Holmes, D. I. and R. S. Forsyth. “The Federalist Revisited: New Directions in Authorship Attribution”. Literary and Linguistic Computing10(2) (1995), 111-127.

    Google Scholar 

  • Honoré, A. “Some Simple Measures of Richness of Vocabulary”. Association for Literary and Linguistic Computing Bulletin7(2) (1979), 172-177.

    Google Scholar 

  • Johnson, N. L. and S. Kotz. Urn Models and their Application. An Approach to Modern Discrete Probability Theory. New York: John Wiley and Sons, 1977.

    Google Scholar 

  • Johnson, R. “Measures of Vocabulary Diversity”. In Advances in Computer-aided Literary and Linguistic Research. Eds. D. E. Ager, F. E. Knowles and M. W. A. Smith, AMLC, 1979.

  • Maas, H.-D. “Zusammenhang zwischen wortschatzumfang und länge eines textes”. Zeitschrift für Literaturwissenschaft und Linguistik8 (1972), 73-79.

    Google Scholar 

  • Martindale, C. and D. McKenzie. “On the Utility of Content Analysis in Author Attribution: The Federalist”. Computers and the Humanities29 (1995), 259-270.

    Google Scholar 

  • Ménard, N. Mesure de la Richesse Lexicale. Théorie et vérifications expérimentales. Etudes stylométriques et sociolinguistiques. Number 14 in Travaux de Linguistique Quantitative. Paris: Slatkine-Champion, Geneva, 1983.

    Google Scholar 

  • Michéa, R. “Répétition et variété dans l'emploi des mots”. Bulletin de la société de linguistique de Paris(1969).

  • Michéa, R. “De la relation entre le nombre des mots d'une fréquence déterminée et celui des mots différents employés dans le texte”. Cahiers de Lexicologie(1971).

  • Mosteller, F. and D. L. Wallace. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, 1964.

  • Orlov, Y. K. “Ein modell der häufigkeitsstruktur des vokabulars”. In Studies on Zipf's Law. Bochum: Brockmeyer, 1983, pp. 154-233.

    Google Scholar 

  • Sichel, H. S. “On a Distribution Law for Word Frequencies”. Journal of the American Statistical Association70 (1975), 542-547.

    Google Scholar 

  • Sichel, H. S. “Word Frequency Distributions and Type-token Characteristics”. The Mathematical Scientist11 (1986), 45-72.

    Google Scholar 

  • Simpson, E. H. “Measurement of Diversity”. Nature163 (1949), 168.

    Google Scholar 

  • Thoiron, P. “Diversity Index and Entropy as Measures of Lexical Richness”. Computers and the Humanities20 (1986), 197-202.

    Google Scholar 

  • Tuldava, J. “Quantitative Relations between the Size of the Text and the Size of Vocabulary”. SMIL Quarterly, Journal of Linguistic Calculus4 (1977).

  • Tweedie, F. J., D. I. Holmes and T. N. Corns. “The Provenance of De Doctrina Christiana, Attributed to John Milton: A Statistical Investigation”. Literary and Linguistic Computing13(2) (1998), 77-87.

    Google Scholar 

  • Weitzman, M. “How Useful is the Logarithmic Type-token Ratio?”. Journal of Linguistics7 (1971), 237-243.

    Google Scholar 

  • Whissell, C. “Traditional and Emotional Stylometric Analysis of the Songs of Beatles Paul McCartney and John Lennon”. Computers and the Humanities30(3) (1996), 257-265.

    Google Scholar 

  • Yule, G. U. The Statistical Study of Literary Vocabulary. Cambridge University Press, 1944.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tweedie, F.J., Baayen, R.H. How Variable May a Constant be? Measures of Lexical Richness in Perspective. Computers and the Humanities 32, 323–352 (1998). https://doi.org/10.1023/A:1001749303137

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1001749303137

Navigation