Statistical Models of Language Use

  • Ian H. Witten
  • Timothy C. Bell


An oft-cited model of word frequency in natural language is the Zipf distribution, held by some to account for the fact that common words are, by and large, shorter than rare ones through a principle of “least effort”. It is beginning to re-emerge as a model of artificial language too; for example, command usage in computer systems. However, it has been established that Zipf’s law is very easily achieved by simple random processes. This paper examines random models of both word and letter production, derives the associated rank/frequency relationships, and compares them with those found in naturally-occurring English text. The result shows that Zipf’s distribution arises from purely random sources, and questions the validity of interpretations of observed hyperbolic rank/ frequency distributions as manifestations of purposeful, or even evolutionary, behavior.


Word Frequency Random Model Space Character Alphabet Size Zipf Distribution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Carroll, J. B., 1966, Word-Frequency Studies and the Lognormal Distribution, Proc. Conference on Language and Language Behavior, Zale, E. M., ed., Appleton-Century-Crofts, New York, pp. 213–235.Google Scholar
  2. Carroll, J. B., 1967, On Sampling From a Lognormal Model of Word-Frequency Distribution, Computational Analysis of Present-Day American English, Kucera, H., and Francis, W. N., ed., Brown University Press, Providence, RI, pp. 406–424.Google Scholar
  3. Ellis, S. R., and Hitchcock, R. J., 1986, The Emergence of Zipf’s Law: Spontaneous Encoding Optimization by Users of a Command Language, IEEE Trans. Systems, Man, and Cybernetics, Vol. SMC-16, No. 3, pp. 423–427.Google Scholar
  4. Fairthorne, R. A., 1969, Empirical Hyperbolic Distributions (Bradford-Zipf-Mandelbrot) for Bibliographic Description and Prediction, 7. Documentation, Vol. 25, No. 4, December 1969.Google Scholar
  5. Good, I. J., 1969, Statistics of Language, Encyclopaedia of Information, Linguistics, and Control, Meetham, A. R., and Hudson, R. A., eds., Pergamon, Oxford, England, pp. 567–581.Google Scholar
  6. Mandelbrot, B., 1952, An Informational Theory of the Statistical Structure of Language, Proc. Symposium on Applications of Communication Theory, Butterworth, London, September 1952, pp. 486–500.Google Scholar
  7. Miller, G. A., Newman, E. B., and Friedman, E. A., 1957, Some Effects of Intermittent Silence, American J. Psychology, Vol. 70, pp. 311–313.CrossRefGoogle Scholar
  8. Peachey, J. B., Bunt, R. B., and Colbourn, C. J., 1982, Bradford-Zipf Phenomena in Computer Systems, Proc. Canadian Information Processing Society Conference, Saskatoon, SK, May 1982, pp. 155–161.Google Scholar
  9. Whitworth, W. A., 1901, Choice and Chance, Deighton and Bell, Cambridge.Google Scholar
  10. Witten, I. H., Cleary, J., and Greenberg, S., 1984, On Frequency-Based Menu-Splitting Algorithms, Int. J. Man-Machine Studies, Vol. 21, No. 21, pp. 135–148, August.Google Scholar
  11. Zipf, G. K., 1949, Human Behavior and the Principle of Least Effort, Addison-Wesley, Cambridge, MA.Google Scholar

Copyright information

© Plenum Press, New York 1990

Authors and Affiliations

  • Ian H. Witten
    • 1
  • Timothy C. Bell
    • 2
  1. 1.Department of Computer ScienceUniversity of CalgaryCalgaryCanada
  2. 2.Department of Computer ScienceUniversity of CanterburyChristchurch 1New Zealand

Personalised recommendations