Statistical Models of Language Use
An oft-cited model of word frequency in natural language is the Zipf distribution, held by some to account for the fact that common words are, by and large, shorter than rare ones through a principle of “least effort”. It is beginning to re-emerge as a model of artificial language too; for example, command usage in computer systems. However, it has been established that Zipf’s law is very easily achieved by simple random processes. This paper examines random models of both word and letter production, derives the associated rank/frequency relationships, and compares them with those found in naturally-occurring English text. The result shows that Zipf’s distribution arises from purely random sources, and questions the validity of interpretations of observed hyperbolic rank/ frequency distributions as manifestations of purposeful, or even evolutionary, behavior.
KeywordsWord Frequency Random Model Space Character Alphabet Size Zipf Distribution
Unable to display preview. Download preview PDF.
- Carroll, J. B., 1966, Word-Frequency Studies and the Lognormal Distribution, Proc. Conference on Language and Language Behavior, Zale, E. M., ed., Appleton-Century-Crofts, New York, pp. 213–235.Google Scholar
- Carroll, J. B., 1967, On Sampling From a Lognormal Model of Word-Frequency Distribution, Computational Analysis of Present-Day American English, Kucera, H., and Francis, W. N., ed., Brown University Press, Providence, RI, pp. 406–424.Google Scholar
- Ellis, S. R., and Hitchcock, R. J., 1986, The Emergence of Zipf’s Law: Spontaneous Encoding Optimization by Users of a Command Language, IEEE Trans. Systems, Man, and Cybernetics, Vol. SMC-16, No. 3, pp. 423–427.Google Scholar
- Fairthorne, R. A., 1969, Empirical Hyperbolic Distributions (Bradford-Zipf-Mandelbrot) for Bibliographic Description and Prediction, 7. Documentation, Vol. 25, No. 4, December 1969.Google Scholar
- Good, I. J., 1969, Statistics of Language, Encyclopaedia of Information, Linguistics, and Control, Meetham, A. R., and Hudson, R. A., eds., Pergamon, Oxford, England, pp. 567–581.Google Scholar
- Mandelbrot, B., 1952, An Informational Theory of the Statistical Structure of Language, Proc. Symposium on Applications of Communication Theory, Butterworth, London, September 1952, pp. 486–500.Google Scholar
- Peachey, J. B., Bunt, R. B., and Colbourn, C. J., 1982, Bradford-Zipf Phenomena in Computer Systems, Proc. Canadian Information Processing Society Conference, Saskatoon, SK, May 1982, pp. 155–161.Google Scholar
- Whitworth, W. A., 1901, Choice and Chance, Deighton and Bell, Cambridge.Google Scholar
- Witten, I. H., Cleary, J., and Greenberg, S., 1984, On Frequency-Based Menu-Splitting Algorithms, Int. J. Man-Machine Studies, Vol. 21, No. 21, pp. 135–148, August.Google Scholar
- Zipf, G. K., 1949, Human Behavior and the Principle of Least Effort, Addison-Wesley, Cambridge, MA.Google Scholar