Abstract
A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word types), depends heavily on the text length (measured in word tokens), a variety of alternative measures has been proposed which are claimed to be independent of the text length. This paper has a threefold aim. Firstly, we have investigated to what extent these alternative measures are truly textual constants. We have observed that in practice all measures vary substantially and systematically with the text length. We also show that in theory, only three of these measures are truly constant or nearly constant. Secondly, we have studied the extent to which these measures tap into different aspects of lexical structure. We have found that there are two main families of constants, one measuring lexical richness and one measuring lexical repetition. Thirdly, we have considered to what extent these measures can be used to investigate questions of textual similarity between and within authors. We propose to carry out such comparisons by means of the empirical trajectories of texts in the plane spanned by the dimensions of lexical richness and lexical repetition, and we provide a statistical technique for constructing confidence intervals around the empirical trajectories of texts. Our results suggest that the trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.
Similar content being viewed by others
References
Baayen, R. H. A Corpus-based Approach to Morphological Productivity. Statistical Analysis and Psycholinguistic Interpretation. PhD thesis, Amsterdam: Free University, 1989.
Baayen, R. H. “Statistical Models for Word Frequency Distributions: A Linguistic Evaluation”. Computers and the Humanities26 (1993), 347-363.
Baayen, R. H. “The Effect of Lexical Specialisation on the Growth Curve of the Vocabulary”. Computational Linguistics22 (1996), 455-480.
Baayen, R. H. and F. J. Tweedie. “The Sample-size Invariance of LNRE Model Parameters: Problems and Opportunities”. Journal of Quantitative Linguistics5 (1998).
Baayen, R. H., H. van Halteren and F. J. Tweedie. “Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution”. Literary and Linguistic Computing11(3) (1996), 121-131.
Baker, J. C. “Pace: A Test of Authorship Based on the Rate atWhich New Words Enter the Author's Text”. Literary and Linguistic Computing3(1) (1988), 136-139.
Brunet, E. Vocabulaire de Jean Giraudoux: Structure et Évolution. Genève: Slatkine, 1978.
Burrows, J. F. “ ‘An OceanWhere Each Kind...’: Statistical Analysis and Some Major Determinants of Literary Style”. Computers and the Humanities23(4-5) (1989), 309-321.
Chitashvili, R. J. and R. H. Baayen. “Word Frequency Distributions”. In Quantitative Text Analysis. Eds. G. Altmann and L. Hrebícek, Trier: Wissenschaftlicher Verlag Trier, 1993.
Cossette, A. La Richesse Lexicale et sa Mesure. Number 53 in Travaux de Linguistique Quantitative. Paris: Slatkine-Champion, Geneva, 1994.
Dugast, D. “Sur quoi se fonde la notion d'étendue théoretique du vocabulaire?”. Le francais moderne46(1) (1978), 25-32.
Dugast, D. Vocabulaire et Stylistique. I Théâtre et Dialogue. Travaux de Linguistique Quantitative. Paris: Slatkine-Champion, Geneva, 1979.
Good, I. J. “The Population Frequencies of Species and the Estimation of Population Parameters”. Biometrika40 (1953), 237-264.
Guiraud, H. Les Caractères Statistiques du Vocabulaire. Paris: Presses Universitaires de France, 1954.
Herdan, G. “A New Derivation and Interpretation of Yule's Characteristic K”. Zeitschrift für Angewandte Mathematik und Physik(1955).
Herdan, G. Type-Token Mathematics: A Textbook of Mathematical Linguistics. The Hague, The Netherlands: Mouton & Co., 1960.
Herdan, G. Quantatative Linguistics. London: Butterworth, 1964.
Holmes, D. I. “A Stylometric Analysis of Mormon Scripture and Related Texts”. Journal of the Royal Statistical Society Series A155(1) (1992), 91-120.
Holmes, D. I. “Authorship Attribution”. Computers and the Humanities28(2) (1994), 87-106.
Holmes, D. I. and R. S. Forsyth. “The Federalist Revisited: New Directions in Authorship Attribution”. Literary and Linguistic Computing10(2) (1995), 111-127.
Honoré, A. “Some Simple Measures of Richness of Vocabulary”. Association for Literary and Linguistic Computing Bulletin7(2) (1979), 172-177.
Johnson, N. L. and S. Kotz. Urn Models and their Application. An Approach to Modern Discrete Probability Theory. New York: John Wiley and Sons, 1977.
Johnson, R. “Measures of Vocabulary Diversity”. In Advances in Computer-aided Literary and Linguistic Research. Eds. D. E. Ager, F. E. Knowles and M. W. A. Smith, AMLC, 1979.
Maas, H.-D. “Zusammenhang zwischen wortschatzumfang und länge eines textes”. Zeitschrift für Literaturwissenschaft und Linguistik8 (1972), 73-79.
Martindale, C. and D. McKenzie. “On the Utility of Content Analysis in Author Attribution: The Federalist”. Computers and the Humanities29 (1995), 259-270.
Ménard, N. Mesure de la Richesse Lexicale. Théorie et vérifications expérimentales. Etudes stylométriques et sociolinguistiques. Number 14 in Travaux de Linguistique Quantitative. Paris: Slatkine-Champion, Geneva, 1983.
Michéa, R. “Répétition et variété dans l'emploi des mots”. Bulletin de la société de linguistique de Paris(1969).
Michéa, R. “De la relation entre le nombre des mots d'une fréquence déterminée et celui des mots différents employés dans le texte”. Cahiers de Lexicologie(1971).
Mosteller, F. and D. L. Wallace. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, 1964.
Orlov, Y. K. “Ein modell der häufigkeitsstruktur des vokabulars”. In Studies on Zipf's Law. Bochum: Brockmeyer, 1983, pp. 154-233.
Sichel, H. S. “On a Distribution Law for Word Frequencies”. Journal of the American Statistical Association70 (1975), 542-547.
Sichel, H. S. “Word Frequency Distributions and Type-token Characteristics”. The Mathematical Scientist11 (1986), 45-72.
Simpson, E. H. “Measurement of Diversity”. Nature163 (1949), 168.
Thoiron, P. “Diversity Index and Entropy as Measures of Lexical Richness”. Computers and the Humanities20 (1986), 197-202.
Tuldava, J. “Quantitative Relations between the Size of the Text and the Size of Vocabulary”. SMIL Quarterly, Journal of Linguistic Calculus4 (1977).
Tweedie, F. J., D. I. Holmes and T. N. Corns. “The Provenance of De Doctrina Christiana, Attributed to John Milton: A Statistical Investigation”. Literary and Linguistic Computing13(2) (1998), 77-87.
Weitzman, M. “How Useful is the Logarithmic Type-token Ratio?”. Journal of Linguistics7 (1971), 237-243.
Whissell, C. “Traditional and Emotional Stylometric Analysis of the Songs of Beatles Paul McCartney and John Lennon”. Computers and the Humanities30(3) (1996), 257-265.
Yule, G. U. The Statistical Study of Literary Vocabulary. Cambridge University Press, 1944.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Tweedie, F.J., Baayen, R.H. How Variable May a Constant be? Measures of Lexical Richness in Perspective. Computers and the Humanities 32, 323–352 (1998). https://doi.org/10.1023/A:1001749303137
Issue Date:
DOI: https://doi.org/10.1023/A:1001749303137