Abstract
Zipf’s law is just one out of many universal laws proposed to describe statistical regularities in language. Here we review and critically discuss how these laws can be statistically interpreted, fitted, and tested (falsified). The modern availability of large databases of written text allows for tests with an unprecedent statistical accuracy and also for a characterization of the fluctuations around the typical behavior. We find that fluctuations are usually much larger than expected based on simplifying statistical assumptions (e.g., independence and lack of correlations between observations). These simplifications appear also in usual statistical tests so that the large fluctuations can be erroneously interpreted as a falsification of the law. Instead, here we argue that linguistic laws are only meaningful (falsifiable) if accompanied by a model for which the fluctuations can be computed (e.g., a generative model of the text). The large fluctuations we report show that the constraints imposed by linguistic laws on the creativity process of text generation are not as tight as one could expect.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
While some of the laws clearly intend to speak about the language as a whole, in practice they are tested and motivated by observations in specific texts which are thus implicitly or explicitly assumed to reflect the language as a whole.
- 2.
A relative of Maxwell’s Daemon known from Thermodynamics.
References
Herdan, G.: Quantitative Linguistics. Butterworth Press, Oxford (1964)
Zipf, G.K.: The Psycho-Biology of Language. Routledge, London (1936). Id., Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Oxford (1949)
Köhler, R., Altmann, G., Piotrowski, R.G. (eds.): Quantitative Linguistik. Ein internationales Handbuch. Quantitative Linguistics. An international Handbook. (=HSK27). de Gruyter, Berlin (2005)
Köhler, R., Altmann, G., Grzybek, P. (eds.): Quantitative Linguistics, De Gruyer Mouton. www.degruyter.com/view/serial/35295. Accessed 6 Feb 2015
Glottopedia: the free encyclopedia of linguistics. http://www.glottopedia.org/index.php/Laws. Accessed 17 Dec 2014
Enciclopedia entry: laws in quantitative linguistics. http://lql.uni-trier.de. Accessed 3 Dec 2014
Harald Baayen, R.: Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht (2001)
Zanette, D.H.: Statistical patterns in written language (2014). arXiv:1412.3336
Barbieri, G., Pachet, F., Roy, P., Degli Esposti, M.: Markov constraints for generating lyrics with style. In: 20th European Conference on Artificial Inteligence – ECAI, IOS Press, Amsterdam (2012)
Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Math. 1, 226 (2004)
Newman, M.E.J.: Power laws, Pareto distributions and Zipfs law. Contemp. Phys. 46, 323 (2005)
Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. In: Structure of Language and Its Mathematical Aspects: Proceedings of Symposia in Applied Mathematics, vol. XII. American Mathematical Society, Providence (1961)
Altmann, G.: Prolegomena to Menzerath’s law. Glottometrika 2, 1 (1980)
Cramer, I.: The parameters of the Altmann-Menzerath law. J. Quant. Linguist. 12, 41 (2005)
Egghe, L.: Untangling Herdan’s law and Heaps’ law : mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58, 702 (2007)
Simon, H.A.: On a class of skew distribution functions. Biometrika 42, 425 (1955)
Li, W.: Zipf’s law everywhere. Glottometrics 5, 14 (2002)
Zanette, D., Montemurro, M.: Dynamics of text generation with realistic Zipf’s distribution. J. Quant. Linguist. 12, 29 (2005)
Piantadosi, S.T.: Zipfs word frequency law in natural language: a critical review and future directions. Psychon. Bull. Rev. 21, 1112 (2014)
Lü, L., Zhang, Z.-K., Zhou, T.: Zipf’s law leads to Heaps’ law: analyzing their relation in finite-size systems. PLOS One 5, e14139 (2010)
Petersen, A.M., Tenenbaum, J.N., Havlin, S., Stanley, H.E., Perc, M.: Languages cool as they expand: allometric scaling and the decreasing need for new words. Sci. Rep. 2, 943 (2012)
Gerlach, M., Altmann, E.G.: Stochastic model for the vocabulary growth in natural languages. Phys. Rev. X 3, 021006 (2013)
Font-Clos, F., Boleda, G., Corral, A.: A scaling law beyond Zipf’s law and its relation to Heaps’ law. New J. Phys. 15(9), 093033 (2013)
Gerlach, M., Altmann, E.G.: Scaling laws and fluctuations in the statistics of word frequencies. New J. Phys. 16, 113010 (2014)
Altmann, E.G., Pierrehumbert, J.B., Motter, A.E.: Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words. Plos One 4, e7678 (2009)
Corral, A., Ferrer-i-Cancho, R., Boleda, G., Diaz-Guilera, A.: Univeral complex structures in written language. arXiv:0901.2924
Lijffijt, J., Papapetrou, P., Puolamäki, K., Mannila, H.: Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science, vol. 6912, p. 341. Springer, Berlin (2011)
Damerau, F.J., Mandelbrot, B.: Tests of the degree of word clustering in samples of written English. Linguistics 102, 58–72 (1973)
Schenkel, A., Zhang, J., Zhang, Y.: Long range correlation in human writings. Fractals 1, 47 (1993)
Altmann, E.G., Cristadoro, G., Degli Esposti, M.: On the origin of long-range correlations in texts. PNAS 109, 11582 (2012)
Ebeling, W., Pöschel, T.: Entropy and long-range correlations in literary English. Europhys. Lett. 26, 24 (1994)
Debowski, L.: On Hilberg’s law and its links with Guiraud’s law. J. Quant. Linguist. 13, 81–109 (2006)
Piantadosi, S.T., Tily, H., Gibson, E.: Word lengths are optimized for efficient communication. PNAS 108, 3526 (2011)
Solé, R.V., Corominas-Murtra, B., Valverde, S., Steels, L.: Language networks: their structure, function and evolution. Complexity 15, 20 (2009)
Choudhury, M., Mukherjee, A.: The structure and dynamics of linguistic networks. Dynamics On and Of Complex Networks, pp. 145–166. Springer, Boston (2009)
Baronchelli, A., Ferrer-i-Cancho, R., Pastor-Satorras, R., Chater, N., Christiansen, M.H.: Networks in cognitive science. Trends Cogn. Sci. 17, 348 (2013)
Cong, J., Liu, H.: Approaching human language with complex networks. Phys. Life Rev. 11, 598 (2014)
Constrained writing, in Wikipedia. http://en.wikipedia.org/wiki/Constrained_writing. Accessed 3 Dec 2014
Benford, F.: The law of anomalous numbers. Proc. Am. Philos. Soc. 78, 551 (1938)
Main, I.G., Li, L., McCloskey, J., Naylor, M.: Effect of the Sumatran mega-earthquake on the global magnitude cut-off and event rate. Nat. Geosci. 1, 142 (2008)
Amancio, D.R., Altmann, E.G., Rybski, D., Oliveira Jr., O.N., Costa, L.D.F.: Probing the statistical properties of unknown texts: application to the Voynich manuscript. PLOS One 8, e67310 (2013)
Febres, G., Jaffé, K., Gershenson, C.: Complexity measurement of natural and artificial languages. Complexity (2014). doi:10.1002/cplx.21529
Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: Size-dependent word frequencies and translational invariance of books. Phys. A 389, 330 (2010)
Williams, J.R., Bagrow, J.P. Danforth, C.M., Dodds, P.S.: Text mixing shapes the anatomy of rank-frequency distributions: a modern Zipfian mechanics for natural language (2014). arXiv:1409.3870
Baixeries, J., Elvevag, B., Ferrer-i-Cancho, R.: The evolution of the exponent of Zipf’s law in language ontogeny. PLOS One 8, e53227 (2013)
Jäger, G.: Power laws and other heavy-tailed distribution in linguistic typology. Adv. Compl. Syst. 15, 1150019 (2012)
Ferrer-i-Cancho, R., Elvevag, B.: Random texts do not exhibit the real Zipf’s law-like rank distribution. PLOS One 5, e9411 (2010)
Corominas-Murtra, B., Fortuny, J., Solé, R.V.: Emergence of Zipfs law in the evolution of communication. Phys. Rev. E 83, 036115 (2011)
Ferrer-i-Cancho, R.: Optimization models of natural communication (2014). arXiv:1412.2486
Marsili, M., Mastromatteo, I., Roudi, Y.: On sampling and modeling complex systems. J. Stat. Mech. 2013, P09003 (2013)
Peterson, J., Dixit, P.D., Dill, K.: A maximum entropy framework for nonexponential distributions. PNAS 110, 20380 (2013)
Goldstein, M.L., Morris, S.A., Yen, G.G.: Problems with fitting to the power-law distribution. Eur. J. Phys. B 41, 255–258 (2004)
Bauke, H.: Parameter estimation for power-law distributions by maximum likelihood methods. Eur. J. Phys. B 58, 167–173 (2007)
Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51, 661–703 (2009)
Deluca, A., Corral, A.: Fitting and goodness-of-fit test of non-truncated and truncated power-law distributions. Acta Geophys. 61, 1351–1394 (2013)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2009)
Burnham, K.P., Anderson, D.R.: Model Selection and Multimodal Inference: A Practical Information-Theoretic Approach. Spinger, New York (2002)
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 716–723 (1974)
Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)
Grünwald, P.D.: Minimum Description Length Principle. MIT Press, Cambridge (2007)
Jaynes, E.T.: Probability Theory: The Logic of Science. Oxford University Press, Oxford (2003)
Günther, R., Levitin, L., Schapiro, B., Wagner, P.: Zipf ’s law and the effect of ranking on probability distributions. Int. J. Theor. Phys. 35, 395 (1996)
Cristelli, M., Batty, M., Pietronero, L.: There is more than a power law in Zipf. Sci. Rep. 2, 812 (2012)
Stumpf, M.P.H., Porter, M.A.: Critical truths about power laws. Science 335, 665–666 (2012)
Weiss, M.S.: Modification of the Kolmogorov-Smirnov statistic for use with correlated data. J. Am. Stat. Assoc. 73, 872–875 (1978)
Chicheportiche, R., Bouchaud, J.-P.: Goodness-of-fit tests with dependent observations. J. Stat. Mech.: Theory Exp. 2011, P09003 (2011)
Serrano, M.A., Flammini, A., Menczer, F.: Modeling statistical properties of written text. PlOS One 4, e5372 (2009)
Eisler, Z., Bartos, I., Kertész, J.: Fluctuation scaling in complex systems: Taylor’s law and beyond. Adv. Phys. 57, 89–142 (2008)
Louf, R., Barthelemy, M.: Scaling: lost in the smog. Environ. Plan. B: Plan. Des. 41, 767 (2014)
Acknowledgments
We thank A. Corral, A. Deluca, R. Ferrer-i-Cancho F. Font-Clos, and R. Guimerá for insightful discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
The books listed in Table 3 were obtained from Project Gutenberg (http://www.gutenberg.org). The books and data filtering are the same as the ones used in Ref. [30] (see the Supplementary information of that paper for further details). We removed capitalization and all symbols except the letters “a–z”, the number “0–9”, the apostrophe, and the blank space. A string of symbols between two consecutive blank spaces was considered to be a word.
The English Wikipedia data was obtained from Wikimedia dumps (http://dumps.wikimedia.org/). The filtering was the same as the one used in Ref. [24], in which we removed capitalization and kept only those words (i.e., sequences of symbols separated by blank space) which consisted exclusively of the letters “a–z” and the apostrophe.
The computation of Menzerath–Altmann law appearing in Figs. 1, 2, and Table 2 was done starting from the unique words (word type) in the database discussed in the previous paragraphs. For each word w we applied the following steps:
-
1.
Lemmatize using the WordNetLemmatizer (http://wordnet.princeton.edu in the NLTK Python package http://www.nltk.org/).
-
2.
Count the number of syllables \(x_w\) based on the Moby Hyphenation List by Grady Ward, available at http://www.gutenberg.org/ebooks/3204.
-
3.
Count the number of phonemes \(z_w\) based on The CMU Pronouncing Dictionary, version 0.7b available at www.speech.cs.cmu.edu/cgi-bin/cmudict.
For the book Moby Dick by H. Melville, this procedure allowed to compute \(x_w\) and \(z_w\) for 11, 595 words, \(66\,\%\) of the total number of words (before lemmatization). For the Wikipedia, we obtain 60, 749 words, \(1.7\,\%\) of the total number. The low success in Wikipedia is due to the size of the database (large number of rare words) and the results depend more strongly on the procedure described above than on the database itself.
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Altmann, E.G., Gerlach, M. (2016). Statistical Laws in Linguistics. In: Degli Esposti, M., Altmann, E., Pachet, F. (eds) Creativity and Universality in Language. Lecture Notes in Morphogenesis. Springer, Cham. https://doi.org/10.1007/978-3-319-24403-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-24403-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24401-3
Online ISBN: 978-3-319-24403-7
eBook Packages: Social SciencesSocial Sciences (R0)