How Variable May a Constant be? Measures of Lexical Richness in Perspective

Tweedie, Fiona J.; Baayen, R. Harald

doi:10.1023/A:1001749303137

How Variable May a Constant be? Measures of Lexical Richness in Perspective

Published: September 1998

Volume 32, pages 323–352, (1998)
Cite this article

Computers and the Humanities Aims and scope Submit manuscript

Fiona J. Tweedie¹ &
R. Harald Baayen²

1468 Accesses
207 Citations
Explore all metrics

Abstract

A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word types), depends heavily on the text length (measured in word tokens), a variety of alternative measures has been proposed which are claimed to be independent of the text length. This paper has a threefold aim. Firstly, we have investigated to what extent these alternative measures are truly textual constants. We have observed that in practice all measures vary substantially and systematically with the text length. We also show that in theory, only three of these measures are truly constant or nearly constant. Secondly, we have studied the extent to which these measures tap into different aspects of lexical structure. We have found that there are two main families of constants, one measuring lexical richness and one measuring lexical repetition. Thirdly, we have considered to what extent these measures can be used to investigate questions of textual similarity between and within authors. We propose to carry out such comparisons by means of the empirical trajectories of texts in the plane spanned by the dimensions of lexical richness and lexical repetition, and we provide a statistical technique for constructing confidence intervals around the empirical trajectories of texts. Our results suggest that the trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Baayen, R. H. A Corpus-based Approach to Morphological Productivity. Statistical Analysis and Psycholinguistic Interpretation. PhD thesis, Amsterdam: Free University, 1989.
Google Scholar
Baayen, R. H. “Statistical Models for Word Frequency Distributions: A Linguistic Evaluation”. Computers and the Humanities26 (1993), 347-363.
Google Scholar
Baayen, R. H. “The Effect of Lexical Specialisation on the Growth Curve of the Vocabulary”. Computational Linguistics22 (1996), 455-480.
Google Scholar
Baayen, R. H. and F. J. Tweedie. “The Sample-size Invariance of LNRE Model Parameters: Problems and Opportunities”. Journal of Quantitative Linguistics5 (1998).
Baayen, R. H., H. van Halteren and F. J. Tweedie. “Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution”. Literary and Linguistic Computing11(3) (1996), 121-131.
Google Scholar
Baker, J. C. “Pace: A Test of Authorship Based on the Rate atWhich New Words Enter the Author's Text”. Literary and Linguistic Computing3(1) (1988), 136-139.
Google Scholar
Brunet, E. Vocabulaire de Jean Giraudoux: Structure et Évolution. Genève: Slatkine, 1978.
Google Scholar
Burrows, J. F. “ ‘An OceanWhere Each Kind...’: Statistical Analysis and Some Major Determinants of Literary Style”. Computers and the Humanities23(4-5) (1989), 309-321.
Google Scholar
Chitashvili, R. J. and R. H. Baayen. “Word Frequency Distributions”. In Quantitative Text Analysis. Eds. G. Altmann and L. Hrebícek, Trier: Wissenschaftlicher Verlag Trier, 1993.
Google Scholar
Cossette, A. La Richesse Lexicale et sa Mesure. Number 53 in Travaux de Linguistique Quantitative. Paris: Slatkine-Champion, Geneva, 1994.
Google Scholar
Dugast, D. “Sur quoi se fonde la notion d'étendue théoretique du vocabulaire?”. Le francais moderne46(1) (1978), 25-32.
Google Scholar
Dugast, D. Vocabulaire et Stylistique. I Théâtre et Dialogue. Travaux de Linguistique Quantitative. Paris: Slatkine-Champion, Geneva, 1979.
Google Scholar
Good, I. J. “The Population Frequencies of Species and the Estimation of Population Parameters”. Biometrika40 (1953), 237-264.
Google Scholar
Guiraud, H. Les Caractères Statistiques du Vocabulaire. Paris: Presses Universitaires de France, 1954.
Google Scholar
Herdan, G. “A New Derivation and Interpretation of Yule's Characteristic K”. Zeitschrift für Angewandte Mathematik und Physik(1955).
Herdan, G. Type-Token Mathematics: A Textbook of Mathematical Linguistics. The Hague, The Netherlands: Mouton & Co., 1960.
Google Scholar
Herdan, G. Quantatative Linguistics. London: Butterworth, 1964.
Google Scholar
Holmes, D. I. “A Stylometric Analysis of Mormon Scripture and Related Texts”. Journal of the Royal Statistical Society Series A155(1) (1992), 91-120.
Google Scholar
Holmes, D. I. “Authorship Attribution”. Computers and the Humanities28(2) (1994), 87-106.
Google Scholar
Holmes, D. I. and R. S. Forsyth. “The Federalist Revisited: New Directions in Authorship Attribution”. Literary and Linguistic Computing10(2) (1995), 111-127.
Google Scholar
Honoré, A. “Some Simple Measures of Richness of Vocabulary”. Association for Literary and Linguistic Computing Bulletin7(2) (1979), 172-177.
Google Scholar
Johnson, N. L. and S. Kotz. Urn Models and their Application. An Approach to Modern Discrete Probability Theory. New York: John Wiley and Sons, 1977.
Google Scholar
Johnson, R. “Measures of Vocabulary Diversity”. In Advances in Computer-aided Literary and Linguistic Research. Eds. D. E. Ager, F. E. Knowles and M. W. A. Smith, AMLC, 1979.
Maas, H.-D. “Zusammenhang zwischen wortschatzumfang und länge eines textes”. Zeitschrift für Literaturwissenschaft und Linguistik8 (1972), 73-79.
Google Scholar
Martindale, C. and D. McKenzie. “On the Utility of Content Analysis in Author Attribution: The Federalist”. Computers and the Humanities29 (1995), 259-270.
Google Scholar
Ménard, N. Mesure de la Richesse Lexicale. Théorie et vérifications expérimentales. Etudes stylométriques et sociolinguistiques. Number 14 in Travaux de Linguistique Quantitative. Paris: Slatkine-Champion, Geneva, 1983.
Google Scholar
Michéa, R. “Répétition et variété dans l'emploi des mots”. Bulletin de la société de linguistique de Paris(1969).
Michéa, R. “De la relation entre le nombre des mots d'une fréquence déterminée et celui des mots différents employés dans le texte”. Cahiers de Lexicologie(1971).
Mosteller, F. and D. L. Wallace. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, 1964.
Orlov, Y. K. “Ein modell der häufigkeitsstruktur des vokabulars”. In Studies on Zipf's Law. Bochum: Brockmeyer, 1983, pp. 154-233.
Google Scholar
Sichel, H. S. “On a Distribution Law for Word Frequencies”. Journal of the American Statistical Association70 (1975), 542-547.
Google Scholar
Sichel, H. S. “Word Frequency Distributions and Type-token Characteristics”. The Mathematical Scientist11 (1986), 45-72.
Google Scholar
Simpson, E. H. “Measurement of Diversity”. Nature163 (1949), 168.
Google Scholar
Thoiron, P. “Diversity Index and Entropy as Measures of Lexical Richness”. Computers and the Humanities20 (1986), 197-202.
Google Scholar
Tuldava, J. “Quantitative Relations between the Size of the Text and the Size of Vocabulary”. SMIL Quarterly, Journal of Linguistic Calculus4 (1977).
Tweedie, F. J., D. I. Holmes and T. N. Corns. “The Provenance of De Doctrina Christiana, Attributed to John Milton: A Statistical Investigation”. Literary and Linguistic Computing13(2) (1998), 77-87.
Google Scholar
Weitzman, M. “How Useful is the Logarithmic Type-token Ratio?”. Journal of Linguistics7 (1971), 237-243.
Google Scholar
Whissell, C. “Traditional and Emotional Stylometric Analysis of the Songs of Beatles Paul McCartney and John Lennon”. Computers and the Humanities30(3) (1996), 257-265.
Google Scholar
Yule, G. U. The Statistical Study of Literary Vocabulary. Cambridge University Press, 1944.

Download references

Author information

Authors and Affiliations

University of Glasgow, United Kingdom
Fiona J. Tweedie
Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
R. Harald Baayen

Authors

Fiona J. Tweedie
View author publications
You can also search for this author in PubMed Google Scholar
R. Harald Baayen
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tweedie, F.J., Baayen, R.H. How Variable May a Constant be? Measures of Lexical Richness in Perspective. Computers and the Humanities 32, 323–352 (1998). https://doi.org/10.1023/A:1001749303137

Download citation

Issue Date: September 1998
DOI: https://doi.org/10.1023/A:1001749303137

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

How Variable May a Constant be? Measures of Lexical Richness in Perspective

Abstract

Access this article

Similar content being viewed by others

Towards a Deeper Understanding of the Complex Behaviour Observed in the Distribution of Words in Written Texts

Morphological Richness of Text

Levenshtein’s Distance for Measuring Lexical Evolution Rates

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

How Variable May a Constant be? Measures of Lexical Richness in Perspective

Abstract

Access this article

Similar content being viewed by others

Towards a Deeper Understanding of the Complex Behaviour Observed in the Distribution of Words in Written Texts

Morphological Richness of Text

Levenshtein’s Distance for Measuring Lexical Evolution Rates

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation