Advertisement

Statistical Laws in Linguistics

  • Eduardo G. Altmann
  • Martin Gerlach
Part of the Lecture Notes in Morphogenesis book series (LECTMORPH)

Abstract

Zipf’s law is just one out of many universal laws proposed to describe statistical regularities in language. Here we review and critically discuss how these laws can be statistically interpreted, fitted, and tested (falsified). The modern availability of large databases of written text allows for tests with an unprecedent statistical accuracy and also for a characterization of the fluctuations around the typical behavior. We find that fluctuations are usually much larger than expected based on simplifying statistical assumptions (e.g., independence and lack of correlations between observations). These simplifications appear also in usual statistical tests so that the large fluctuations can be erroneously interpreted as a falsification of the law. Instead, here we argue that linguistic laws are only meaningful (falsifiable) if accompanied by a model for which the fluctuations can be computed (e.g., a generative model of the text). The large fluctuations we report show that the constraints imposed by linguistic laws on the creativity process of text generation are not as tight as one could expect.

Keywords

Frequent Word Null Model Independence Assumption Text Generation General Entropy Maximization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

We thank A. Corral, A. Deluca, R. Ferrer-i-Cancho F. Font-Clos, and R. Guimerá for insightful discussions.

References

  1. 1.
    Herdan, G.: Quantitative Linguistics. Butterworth Press, Oxford (1964)Google Scholar
  2. 2.
    Zipf, G.K.: The Psycho-Biology of Language. Routledge, London (1936). Id., Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Oxford (1949)Google Scholar
  3. 3.
    Köhler, R., Altmann, G., Piotrowski, R.G. (eds.): Quantitative Linguistik. Ein internationales Handbuch. Quantitative Linguistics. An international Handbook. (=HSK27). de Gruyter, Berlin (2005)Google Scholar
  4. 4.
    Köhler, R., Altmann, G., Grzybek, P. (eds.): Quantitative Linguistics, De Gruyer Mouton. www.degruyter.com/view/serial/35295. Accessed 6 Feb 2015
  5. 5.
    Glottopedia: the free encyclopedia of linguistics. http://www.glottopedia.org/index.php/Laws. Accessed 17 Dec 2014
  6. 6.
    Enciclopedia entry: laws in quantitative linguistics. http://lql.uni-trier.de. Accessed 3 Dec 2014
  7. 7.
    Harald Baayen, R.: Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht (2001)CrossRefGoogle Scholar
  8. 8.
    Zanette, D.H.: Statistical patterns in written language (2014). arXiv:1412.3336
  9. 9.
    Barbieri, G., Pachet, F., Roy, P., Degli Esposti, M.: Markov constraints for generating lyrics with style. In: 20th European Conference on Artificial Inteligence – ECAI, IOS Press, Amsterdam (2012)Google Scholar
  10. 10.
    Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Math. 1, 226 (2004)CrossRefGoogle Scholar
  11. 11.
    Newman, M.E.J.: Power laws, Pareto distributions and Zipfs law. Contemp. Phys. 46, 323 (2005)CrossRefGoogle Scholar
  12. 12.
    Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. In: Structure of Language and Its Mathematical Aspects: Proceedings of Symposia in Applied Mathematics, vol. XII. American Mathematical Society, Providence (1961)Google Scholar
  13. 13.
    Altmann, G.: Prolegomena to Menzerath’s law. Glottometrika 2, 1 (1980)Google Scholar
  14. 14.
    Cramer, I.: The parameters of the Altmann-Menzerath law. J. Quant. Linguist. 12, 41 (2005)CrossRefGoogle Scholar
  15. 15.
    Egghe, L.: Untangling Herdan’s law and Heaps’ law : mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58, 702 (2007)CrossRefGoogle Scholar
  16. 16.
    Simon, H.A.: On a class of skew distribution functions. Biometrika 42, 425 (1955)CrossRefGoogle Scholar
  17. 17.
    Li, W.: Zipf’s law everywhere. Glottometrics 5, 14 (2002)Google Scholar
  18. 18.
    Zanette, D., Montemurro, M.: Dynamics of text generation with realistic Zipf’s distribution. J. Quant. Linguist. 12, 29 (2005)CrossRefGoogle Scholar
  19. 19.
    Piantadosi, S.T.: Zipfs word frequency law in natural language: a critical review and future directions. Psychon. Bull. Rev. 21, 1112 (2014)CrossRefGoogle Scholar
  20. 20.
    Lü, L., Zhang, Z.-K., Zhou, T.: Zipf’s law leads to Heaps’ law: analyzing their relation in finite-size systems. PLOS One 5, e14139 (2010)CrossRefGoogle Scholar
  21. 21.
    Petersen, A.M., Tenenbaum, J.N., Havlin, S., Stanley, H.E., Perc, M.: Languages cool as they expand: allometric scaling and the decreasing need for new words. Sci. Rep. 2, 943 (2012)Google Scholar
  22. 22.
    Gerlach, M., Altmann, E.G.: Stochastic model for the vocabulary growth in natural languages. Phys. Rev. X 3, 021006 (2013)Google Scholar
  23. 23.
    Font-Clos, F., Boleda, G., Corral, A.: A scaling law beyond Zipf’s law and its relation to Heaps’ law. New J. Phys. 15(9), 093033 (2013)CrossRefGoogle Scholar
  24. 24.
    Gerlach, M., Altmann, E.G.: Scaling laws and fluctuations in the statistics of word frequencies. New J. Phys. 16, 113010 (2014)CrossRefGoogle Scholar
  25. 25.
    Altmann, E.G., Pierrehumbert, J.B., Motter, A.E.: Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words. Plos One 4, e7678 (2009)CrossRefGoogle Scholar
  26. 26.
    Corral, A., Ferrer-i-Cancho, R., Boleda, G., Diaz-Guilera, A.: Univeral complex structures in written language. arXiv:0901.2924
  27. 27.
    Lijffijt, J., Papapetrou, P., Puolamäki, K., Mannila, H.: Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science, vol. 6912, p. 341. Springer, Berlin (2011)CrossRefGoogle Scholar
  28. 28.
    Damerau, F.J., Mandelbrot, B.: Tests of the degree of word clustering in samples of written English. Linguistics 102, 58–72 (1973)Google Scholar
  29. 29.
    Schenkel, A., Zhang, J., Zhang, Y.: Long range correlation in human writings. Fractals 1, 47 (1993)CrossRefGoogle Scholar
  30. 30.
    Altmann, E.G., Cristadoro, G., Degli Esposti, M.: On the origin of long-range correlations in texts. PNAS 109, 11582 (2012)CrossRefGoogle Scholar
  31. 31.
    Ebeling, W., Pöschel, T.: Entropy and long-range correlations in literary English. Europhys. Lett. 26, 24 (1994)Google Scholar
  32. 32.
    Debowski, L.: On Hilberg’s law and its links with Guiraud’s law. J. Quant. Linguist. 13, 81–109 (2006)CrossRefGoogle Scholar
  33. 33.
    Piantadosi, S.T., Tily, H., Gibson, E.: Word lengths are optimized for efficient communication. PNAS 108, 3526 (2011)CrossRefGoogle Scholar
  34. 34.
    Solé, R.V., Corominas-Murtra, B., Valverde, S., Steels, L.: Language networks: their structure, function and evolution. Complexity 15, 20 (2009)Google Scholar
  35. 35.
    Choudhury, M., Mukherjee, A.: The structure and dynamics of linguistic networks. Dynamics On and Of Complex Networks, pp. 145–166. Springer, Boston (2009)Google Scholar
  36. 36.
    Baronchelli, A., Ferrer-i-Cancho, R., Pastor-Satorras, R., Chater, N., Christiansen, M.H.: Networks in cognitive science. Trends Cogn. Sci. 17, 348 (2013)CrossRefGoogle Scholar
  37. 37.
    Cong, J., Liu, H.: Approaching human language with complex networks. Phys. Life Rev. 11, 598 (2014)CrossRefGoogle Scholar
  38. 38.
    Constrained writing, in Wikipedia. http://en.wikipedia.org/wiki/Constrained_writing. Accessed 3 Dec 2014
  39. 39.
    Benford, F.: The law of anomalous numbers. Proc. Am. Philos. Soc. 78, 551 (1938)Google Scholar
  40. 40.
    Main, I.G., Li, L., McCloskey, J., Naylor, M.: Effect of the Sumatran mega-earthquake on the global magnitude cut-off and event rate. Nat. Geosci. 1, 142 (2008)CrossRefGoogle Scholar
  41. 41.
    Amancio, D.R., Altmann, E.G., Rybski, D., Oliveira Jr., O.N., Costa, L.D.F.: Probing the statistical properties of unknown texts: application to the Voynich manuscript. PLOS One 8, e67310 (2013)Google Scholar
  42. 42.
    Febres, G., Jaffé, K., Gershenson, C.: Complexity measurement of natural and artificial languages. Complexity (2014). doi: 10.1002/cplx.21529
  43. 43.
    Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: Size-dependent word frequencies and translational invariance of books. Phys. A 389, 330 (2010)CrossRefGoogle Scholar
  44. 44.
    Williams, J.R., Bagrow, J.P. Danforth, C.M., Dodds, P.S.: Text mixing shapes the anatomy of rank-frequency distributions: a modern Zipfian mechanics for natural language (2014). arXiv:1409.3870
  45. 45.
    Baixeries, J., Elvevag, B., Ferrer-i-Cancho, R.: The evolution of the exponent of Zipf’s law in language ontogeny. PLOS One 8, e53227 (2013)CrossRefGoogle Scholar
  46. 46.
    Jäger, G.: Power laws and other heavy-tailed distribution in linguistic typology. Adv. Compl. Syst. 15, 1150019 (2012)Google Scholar
  47. 47.
    Ferrer-i-Cancho, R., Elvevag, B.: Random texts do not exhibit the real Zipf’s law-like rank distribution. PLOS One 5, e9411 (2010)CrossRefGoogle Scholar
  48. 48.
    Corominas-Murtra, B., Fortuny, J., Solé, R.V.: Emergence of Zipfs law in the evolution of communication. Phys. Rev. E 83, 036115 (2011)Google Scholar
  49. 49.
    Ferrer-i-Cancho, R.: Optimization models of natural communication (2014). arXiv:1412.2486
  50. 50.
    Marsili, M., Mastromatteo, I., Roudi, Y.: On sampling and modeling complex systems. J. Stat. Mech. 2013, P09003 (2013)CrossRefGoogle Scholar
  51. 51.
    Peterson, J., Dixit, P.D., Dill, K.: A maximum entropy framework for nonexponential distributions. PNAS 110, 20380 (2013)CrossRefGoogle Scholar
  52. 52.
    Goldstein, M.L., Morris, S.A., Yen, G.G.: Problems with fitting to the power-law distribution. Eur. J. Phys. B 41, 255–258 (2004)CrossRefGoogle Scholar
  53. 53.
    Bauke, H.: Parameter estimation for power-law distributions by maximum likelihood methods. Eur. J. Phys. B 58, 167–173 (2007)CrossRefGoogle Scholar
  54. 54.
    Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51, 661–703 (2009)CrossRefGoogle Scholar
  55. 55.
    Deluca, A., Corral, A.: Fitting and goodness-of-fit test of non-truncated and truncated power-law distributions. Acta Geophys. 61, 1351–1394 (2013)CrossRefGoogle Scholar
  56. 56.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2009)CrossRefGoogle Scholar
  57. 57.
    Burnham, K.P., Anderson, D.R.: Model Selection and Multimodal Inference: A Practical Information-Theoretic Approach. Spinger, New York (2002)Google Scholar
  58. 58.
    Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 716–723 (1974)CrossRefGoogle Scholar
  59. 59.
    Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)CrossRefGoogle Scholar
  60. 60.
    Grünwald, P.D.: Minimum Description Length Principle. MIT Press, Cambridge (2007)Google Scholar
  61. 61.
    Jaynes, E.T.: Probability Theory: The Logic of Science. Oxford University Press, Oxford (2003)CrossRefGoogle Scholar
  62. 62.
    Günther, R., Levitin, L., Schapiro, B., Wagner, P.: Zipf ’s law and the effect of ranking on probability distributions. Int. J. Theor. Phys. 35, 395 (1996)CrossRefGoogle Scholar
  63. 63.
    Cristelli, M., Batty, M., Pietronero, L.: There is more than a power law in Zipf. Sci. Rep. 2, 812 (2012)CrossRefGoogle Scholar
  64. 64.
    Stumpf, M.P.H., Porter, M.A.: Critical truths about power laws. Science 335, 665–666 (2012)CrossRefGoogle Scholar
  65. 65.
    Weiss, M.S.: Modification of the Kolmogorov-Smirnov statistic for use with correlated data. J. Am. Stat. Assoc. 73, 872–875 (1978)CrossRefGoogle Scholar
  66. 66.
    Chicheportiche, R., Bouchaud, J.-P.: Goodness-of-fit tests with dependent observations. J. Stat. Mech.: Theory Exp. 2011, P09003 (2011)CrossRefGoogle Scholar
  67. 67.
    Serrano, M.A., Flammini, A., Menczer, F.: Modeling statistical properties of written text. PlOS One 4, e5372 (2009)CrossRefGoogle Scholar
  68. 68.
    Eisler, Z., Bartos, I., Kertész, J.: Fluctuation scaling in complex systems: Taylor’s law and beyond. Adv. Phys. 57, 89–142 (2008)CrossRefGoogle Scholar
  69. 69.
    Louf, R., Barthelemy, M.: Scaling: lost in the smog. Environ. Plan. B: Plan. Des. 41, 767 (2014)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Max Planck Institute for the Physics of Complex SystemsDresdenGermany

Personalised recommendations