Skip to main content

Statistical Laws in Linguistics

Part of the Lecture Notes in Morphogenesis book series (LECTMORPH)

Abstract

Zipf’s law is just one out of many universal laws proposed to describe statistical regularities in language. Here we review and critically discuss how these laws can be statistically interpreted, fitted, and tested (falsified). The modern availability of large databases of written text allows for tests with an unprecedent statistical accuracy and also for a characterization of the fluctuations around the typical behavior. We find that fluctuations are usually much larger than expected based on simplifying statistical assumptions (e.g., independence and lack of correlations between observations). These simplifications appear also in usual statistical tests so that the large fluctuations can be erroneously interpreted as a falsification of the law. Instead, here we argue that linguistic laws are only meaningful (falsifiable) if accompanied by a model for which the fluctuations can be computed (e.g., a generative model of the text). The large fluctuations we report show that the constraints imposed by linguistic laws on the creativity process of text generation are not as tight as one could expect.

Keywords

  • Frequent Word
  • Null Model
  • Independence Assumption
  • Text Generation
  • General Entropy Maximization

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-24403-7_2
  • Chapter length: 20 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-24403-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Hardcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    While some of the laws clearly intend to speak about the language as a whole, in practice they are tested and motivated by observations in specific texts which are thus implicitly or explicitly assumed to reflect the language as a whole.

  2. 2.

    A relative of Maxwell’s Daemon known from Thermodynamics.

References

  1. Herdan, G.: Quantitative Linguistics. Butterworth Press, Oxford (1964)

    Google Scholar 

  2. Zipf, G.K.: The Psycho-Biology of Language. Routledge, London (1936). Id., Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Oxford (1949)

    Google Scholar 

  3. Köhler, R., Altmann, G., Piotrowski, R.G. (eds.): Quantitative Linguistik. Ein internationales Handbuch. Quantitative Linguistics. An international Handbook. (=HSK27). de Gruyter, Berlin (2005)

    Google Scholar 

  4. Köhler, R., Altmann, G., Grzybek, P. (eds.): Quantitative Linguistics, De Gruyer Mouton. www.degruyter.com/view/serial/35295. Accessed 6 Feb 2015

  5. Glottopedia: the free encyclopedia of linguistics. http://www.glottopedia.org/index.php/Laws. Accessed 17 Dec 2014

  6. Enciclopedia entry: laws in quantitative linguistics. http://lql.uni-trier.de. Accessed 3 Dec 2014

  7. Harald Baayen, R.: Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht (2001)

    CrossRef  Google Scholar 

  8. Zanette, D.H.: Statistical patterns in written language (2014). arXiv:1412.3336

  9. Barbieri, G., Pachet, F., Roy, P., Degli Esposti, M.: Markov constraints for generating lyrics with style. In: 20th European Conference on Artificial Inteligence – ECAI, IOS Press, Amsterdam (2012)

    Google Scholar 

  10. Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Math. 1, 226 (2004)

    CrossRef  Google Scholar 

  11. Newman, M.E.J.: Power laws, Pareto distributions and Zipfs law. Contemp. Phys. 46, 323 (2005)

    CrossRef  Google Scholar 

  12. Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. In: Structure of Language and Its Mathematical Aspects: Proceedings of Symposia in Applied Mathematics, vol. XII. American Mathematical Society, Providence (1961)

    Google Scholar 

  13. Altmann, G.: Prolegomena to Menzerath’s law. Glottometrika 2, 1 (1980)

    Google Scholar 

  14. Cramer, I.: The parameters of the Altmann-Menzerath law. J. Quant. Linguist. 12, 41 (2005)

    CrossRef  Google Scholar 

  15. Egghe, L.: Untangling Herdan’s law and Heaps’ law : mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58, 702 (2007)

    CrossRef  Google Scholar 

  16. Simon, H.A.: On a class of skew distribution functions. Biometrika 42, 425 (1955)

    CrossRef  Google Scholar 

  17. Li, W.: Zipf’s law everywhere. Glottometrics 5, 14 (2002)

    Google Scholar 

  18. Zanette, D., Montemurro, M.: Dynamics of text generation with realistic Zipf’s distribution. J. Quant. Linguist. 12, 29 (2005)

    CrossRef  Google Scholar 

  19. Piantadosi, S.T.: Zipfs word frequency law in natural language: a critical review and future directions. Psychon. Bull. Rev. 21, 1112 (2014)

    CrossRef  Google Scholar 

  20. Lü, L., Zhang, Z.-K., Zhou, T.: Zipf’s law leads to Heaps’ law: analyzing their relation in finite-size systems. PLOS One 5, e14139 (2010)

    CrossRef  Google Scholar 

  21. Petersen, A.M., Tenenbaum, J.N., Havlin, S., Stanley, H.E., Perc, M.: Languages cool as they expand: allometric scaling and the decreasing need for new words. Sci. Rep. 2, 943 (2012)

    Google Scholar 

  22. Gerlach, M., Altmann, E.G.: Stochastic model for the vocabulary growth in natural languages. Phys. Rev. X 3, 021006 (2013)

    Google Scholar 

  23. Font-Clos, F., Boleda, G., Corral, A.: A scaling law beyond Zipf’s law and its relation to Heaps’ law. New J. Phys. 15(9), 093033 (2013)

    CrossRef  Google Scholar 

  24. Gerlach, M., Altmann, E.G.: Scaling laws and fluctuations in the statistics of word frequencies. New J. Phys. 16, 113010 (2014)

    CrossRef  Google Scholar 

  25. Altmann, E.G., Pierrehumbert, J.B., Motter, A.E.: Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words. Plos One 4, e7678 (2009)

    CrossRef  Google Scholar 

  26. Corral, A., Ferrer-i-Cancho, R., Boleda, G., Diaz-Guilera, A.: Univeral complex structures in written language. arXiv:0901.2924

  27. Lijffijt, J., Papapetrou, P., Puolamäki, K., Mannila, H.: Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science, vol. 6912, p. 341. Springer, Berlin (2011)

    CrossRef  Google Scholar 

  28. Damerau, F.J., Mandelbrot, B.: Tests of the degree of word clustering in samples of written English. Linguistics 102, 58–72 (1973)

    Google Scholar 

  29. Schenkel, A., Zhang, J., Zhang, Y.: Long range correlation in human writings. Fractals 1, 47 (1993)

    CrossRef  Google Scholar 

  30. Altmann, E.G., Cristadoro, G., Degli Esposti, M.: On the origin of long-range correlations in texts. PNAS 109, 11582 (2012)

    CrossRef  Google Scholar 

  31. Ebeling, W., Pöschel, T.: Entropy and long-range correlations in literary English. Europhys. Lett. 26, 24 (1994)

    Google Scholar 

  32. Debowski, L.: On Hilberg’s law and its links with Guiraud’s law. J. Quant. Linguist. 13, 81–109 (2006)

    CrossRef  Google Scholar 

  33. Piantadosi, S.T., Tily, H., Gibson, E.: Word lengths are optimized for efficient communication. PNAS 108, 3526 (2011)

    CrossRef  Google Scholar 

  34. Solé, R.V., Corominas-Murtra, B., Valverde, S., Steels, L.: Language networks: their structure, function and evolution. Complexity 15, 20 (2009)

    Google Scholar 

  35. Choudhury, M., Mukherjee, A.: The structure and dynamics of linguistic networks. Dynamics On and Of Complex Networks, pp. 145–166. Springer, Boston (2009)

    Google Scholar 

  36. Baronchelli, A., Ferrer-i-Cancho, R., Pastor-Satorras, R., Chater, N., Christiansen, M.H.: Networks in cognitive science. Trends Cogn. Sci. 17, 348 (2013)

    CrossRef  Google Scholar 

  37. Cong, J., Liu, H.: Approaching human language with complex networks. Phys. Life Rev. 11, 598 (2014)

    CrossRef  Google Scholar 

  38. Constrained writing, in Wikipedia. http://en.wikipedia.org/wiki/Constrained_writing. Accessed 3 Dec 2014

  39. Benford, F.: The law of anomalous numbers. Proc. Am. Philos. Soc. 78, 551 (1938)

    Google Scholar 

  40. Main, I.G., Li, L., McCloskey, J., Naylor, M.: Effect of the Sumatran mega-earthquake on the global magnitude cut-off and event rate. Nat. Geosci. 1, 142 (2008)

    CrossRef  Google Scholar 

  41. Amancio, D.R., Altmann, E.G., Rybski, D., Oliveira Jr., O.N., Costa, L.D.F.: Probing the statistical properties of unknown texts: application to the Voynich manuscript. PLOS One 8, e67310 (2013)

    Google Scholar 

  42. Febres, G., Jaffé, K., Gershenson, C.: Complexity measurement of natural and artificial languages. Complexity (2014). doi:10.1002/cplx.21529

    Google Scholar 

  43. Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: Size-dependent word frequencies and translational invariance of books. Phys. A 389, 330 (2010)

    CrossRef  Google Scholar 

  44. Williams, J.R., Bagrow, J.P. Danforth, C.M., Dodds, P.S.: Text mixing shapes the anatomy of rank-frequency distributions: a modern Zipfian mechanics for natural language (2014). arXiv:1409.3870

  45. Baixeries, J., Elvevag, B., Ferrer-i-Cancho, R.: The evolution of the exponent of Zipf’s law in language ontogeny. PLOS One 8, e53227 (2013)

    CrossRef  Google Scholar 

  46. Jäger, G.: Power laws and other heavy-tailed distribution in linguistic typology. Adv. Compl. Syst. 15, 1150019 (2012)

    Google Scholar 

  47. Ferrer-i-Cancho, R., Elvevag, B.: Random texts do not exhibit the real Zipf’s law-like rank distribution. PLOS One 5, e9411 (2010)

    CrossRef  Google Scholar 

  48. Corominas-Murtra, B., Fortuny, J., Solé, R.V.: Emergence of Zipfs law in the evolution of communication. Phys. Rev. E 83, 036115 (2011)

    Google Scholar 

  49. Ferrer-i-Cancho, R.: Optimization models of natural communication (2014). arXiv:1412.2486

  50. Marsili, M., Mastromatteo, I., Roudi, Y.: On sampling and modeling complex systems. J. Stat. Mech. 2013, P09003 (2013)

    CrossRef  Google Scholar 

  51. Peterson, J., Dixit, P.D., Dill, K.: A maximum entropy framework for nonexponential distributions. PNAS 110, 20380 (2013)

    CrossRef  Google Scholar 

  52. Goldstein, M.L., Morris, S.A., Yen, G.G.: Problems with fitting to the power-law distribution. Eur. J. Phys. B 41, 255–258 (2004)

    CrossRef  Google Scholar 

  53. Bauke, H.: Parameter estimation for power-law distributions by maximum likelihood methods. Eur. J. Phys. B 58, 167–173 (2007)

    CrossRef  Google Scholar 

  54. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51, 661–703 (2009)

    CrossRef  Google Scholar 

  55. Deluca, A., Corral, A.: Fitting and goodness-of-fit test of non-truncated and truncated power-law distributions. Acta Geophys. 61, 1351–1394 (2013)

    CrossRef  Google Scholar 

  56. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2009)

    CrossRef  Google Scholar 

  57. Burnham, K.P., Anderson, D.R.: Model Selection and Multimodal Inference: A Practical Information-Theoretic Approach. Spinger, New York (2002)

    Google Scholar 

  58. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 716–723 (1974)

    CrossRef  Google Scholar 

  59. Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)

    CrossRef  Google Scholar 

  60. Grünwald, P.D.: Minimum Description Length Principle. MIT Press, Cambridge (2007)

    Google Scholar 

  61. Jaynes, E.T.: Probability Theory: The Logic of Science. Oxford University Press, Oxford (2003)

    CrossRef  Google Scholar 

  62. Günther, R., Levitin, L., Schapiro, B., Wagner, P.: Zipf ’s law and the effect of ranking on probability distributions. Int. J. Theor. Phys. 35, 395 (1996)

    CrossRef  Google Scholar 

  63. Cristelli, M., Batty, M., Pietronero, L.: There is more than a power law in Zipf. Sci. Rep. 2, 812 (2012)

    CrossRef  Google Scholar 

  64. Stumpf, M.P.H., Porter, M.A.: Critical truths about power laws. Science 335, 665–666 (2012)

    CrossRef  Google Scholar 

  65. Weiss, M.S.: Modification of the Kolmogorov-Smirnov statistic for use with correlated data. J. Am. Stat. Assoc. 73, 872–875 (1978)

    CrossRef  Google Scholar 

  66. Chicheportiche, R., Bouchaud, J.-P.: Goodness-of-fit tests with dependent observations. J. Stat. Mech.: Theory Exp. 2011, P09003 (2011)

    CrossRef  Google Scholar 

  67. Serrano, M.A., Flammini, A., Menczer, F.: Modeling statistical properties of written text. PlOS One 4, e5372 (2009)

    CrossRef  Google Scholar 

  68. Eisler, Z., Bartos, I., Kertész, J.: Fluctuation scaling in complex systems: Taylor’s law and beyond. Adv. Phys. 57, 89–142 (2008)

    CrossRef  Google Scholar 

  69. Louf, R., Barthelemy, M.: Scaling: lost in the smog. Environ. Plan. B: Plan. Des. 41, 767 (2014)

    CrossRef  Google Scholar 

Download references

Acknowledgments

We thank A. Corral, A. Deluca, R. Ferrer-i-Cancho F. Font-Clos, and R. Guimerá for insightful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eduardo G. Altmann .

Editor information

Editors and Affiliations

Appendix

Appendix

The books listed in Table 3 were obtained from Project Gutenberg (http://www.gutenberg.org). The books and data filtering are the same as the ones used in Ref. [30] (see the Supplementary information of that paper for further details). We removed capitalization and all symbols except the letters “a–z”, the number “0–9”, the apostrophe, and the blank space. A string of symbols between two consecutive blank spaces was considered to be a word.

The English Wikipedia data was obtained from Wikimedia dumps (http://dumps.wikimedia.org/). The filtering was the same as the one used in Ref. [24], in which we removed capitalization and kept only those words (i.e., sequences of symbols separated by blank space) which consisted exclusively of the letters “a–z” and the apostrophe.

The computation of Menzerath–Altmann law appearing in Figs. 1, 2, and Table 2 was done starting from the unique words (word type) in the database discussed in the previous paragraphs. For each word w we applied the following steps:

  1. 1.

    Lemmatize using the WordNetLemmatizer (http://wordnet.princeton.edu in the NLTK Python package http://www.nltk.org/).

  2. 2.

    Count the number of syllables \(x_w\) based on the Moby Hyphenation List by Grady Ward, available at http://www.gutenberg.org/ebooks/3204.

  3. 3.

    Count the number of phonemes \(z_w\) based on The CMU Pronouncing Dictionary, version 0.7b available at www.speech.cs.cmu.edu/cgi-bin/cmudict.

For the book Moby Dick by H. Melville, this procedure allowed to compute \(x_w\) and \(z_w\) for 11, 595 words, \(66\,\%\) of the total number of words (before lemmatization). For the Wikipedia, we obtain 60, 749 words, \(1.7\,\%\) of the total number. The low success in Wikipedia is due to the size of the database (large number of rare words) and the results depend more strongly on the procedure described above than on the database itself.

Rights and permissions

Reprints and Permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Altmann, E.G., Gerlach, M. (2016). Statistical Laws in Linguistics. In: Degli Esposti, M., Altmann, E., Pachet, F. (eds) Creativity and Universality in Language. Lecture Notes in Morphogenesis. Springer, Cham. https://doi.org/10.1007/978-3-319-24403-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24403-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24401-3

  • Online ISBN: 978-3-319-24403-7

  • eBook Packages: Social SciencesSocial Sciences (R0)