Statistical Laws in Linguistics

Altmann, Eduardo G.; Gerlach, Martin

doi:10.1007/978-3-319-24403-7_2

Eduardo G. Altmann⁵ &
Martin Gerlach⁵

Part of the book series: Lecture Notes in Morphogenesis ((LECTMORPH))

1017 Accesses
23 Citations
27 Altmetric

Abstract

Zipf’s law is just one out of many universal laws proposed to describe statistical regularities in language. Here we review and critically discuss how these laws can be statistically interpreted, fitted, and tested (falsified). The modern availability of large databases of written text allows for tests with an unprecedent statistical accuracy and also for a characterization of the fluctuations around the typical behavior. We find that fluctuations are usually much larger than expected based on simplifying statistical assumptions (e.g., independence and lack of correlations between observations). These simplifications appear also in usual statistical tests so that the large fluctuations can be erroneously interpreted as a falsification of the law. Instead, here we argue that linguistic laws are only meaningful (falsifiable) if accompanied by a model for which the fluctuations can be computed (e.g., a generative model of the text). The large fluctuations we report show that the constraints imposed by linguistic laws on the creativity process of text generation are not as tight as one could expect.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Statistics for Categorical, Nonparametric, and Distribution-Free Data

Taming Chaos. Chance and Variability in the Language Sciences

Analysing Frequency Lists

Notes

1.
While some of the laws clearly intend to speak about the language as a whole, in practice they are tested and motivated by observations in specific texts which are thus implicitly or explicitly assumed to reflect the language as a whole.
2.
A relative of Maxwell’s Daemon known from Thermodynamics.

References

Herdan, G.: Quantitative Linguistics. Butterworth Press, Oxford (1964)
Google Scholar
Zipf, G.K.: The Psycho-Biology of Language. Routledge, London (1936). Id., Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Oxford (1949)
Google Scholar
Köhler, R., Altmann, G., Piotrowski, R.G. (eds.): Quantitative Linguistik. Ein internationales Handbuch. Quantitative Linguistics. An international Handbook. (=HSK27). de Gruyter, Berlin (2005)
Google Scholar
Köhler, R., Altmann, G., Grzybek, P. (eds.): Quantitative Linguistics, De Gruyer Mouton. www.degruyter.com/view/serial/35295. Accessed 6 Feb 2015
Glottopedia: the free encyclopedia of linguistics. http://www.glottopedia.org/index.php/Laws. Accessed 17 Dec 2014
Enciclopedia entry: laws in quantitative linguistics. http://lql.uni-trier.de. Accessed 3 Dec 2014
Harald Baayen, R.: Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht (2001)
Book Google Scholar
Zanette, D.H.: Statistical patterns in written language (2014). arXiv:1412.3336
Barbieri, G., Pachet, F., Roy, P., Degli Esposti, M.: Markov constraints for generating lyrics with style. In: 20th European Conference on Artificial Inteligence – ECAI, IOS Press, Amsterdam (2012)
Google Scholar
Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Math. 1, 226 (2004)
Article Google Scholar
Newman, M.E.J.: Power laws, Pareto distributions and Zipfs law. Contemp. Phys. 46, 323 (2005)
Article Google Scholar
Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. In: Structure of Language and Its Mathematical Aspects: Proceedings of Symposia in Applied Mathematics, vol. XII. American Mathematical Society, Providence (1961)
Google Scholar
Altmann, G.: Prolegomena to Menzerath’s law. Glottometrika 2, 1 (1980)
Google Scholar
Cramer, I.: The parameters of the Altmann-Menzerath law. J. Quant. Linguist. 12, 41 (2005)
Article Google Scholar
Egghe, L.: Untangling Herdan’s law and Heaps’ law : mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58, 702 (2007)
Article Google Scholar
Simon, H.A.: On a class of skew distribution functions. Biometrika 42, 425 (1955)
Article Google Scholar
Li, W.: Zipf’s law everywhere. Glottometrics 5, 14 (2002)
Google Scholar
Zanette, D., Montemurro, M.: Dynamics of text generation with realistic Zipf’s distribution. J. Quant. Linguist. 12, 29 (2005)
Article Google Scholar
Piantadosi, S.T.: Zipfs word frequency law in natural language: a critical review and future directions. Psychon. Bull. Rev. 21, 1112 (2014)
Article Google Scholar
Lü, L., Zhang, Z.-K., Zhou, T.: Zipf’s law leads to Heaps’ law: analyzing their relation in finite-size systems. PLOS One 5, e14139 (2010)
Article Google Scholar
Petersen, A.M., Tenenbaum, J.N., Havlin, S., Stanley, H.E., Perc, M.: Languages cool as they expand: allometric scaling and the decreasing need for new words. Sci. Rep. 2, 943 (2012)
Google Scholar
Gerlach, M., Altmann, E.G.: Stochastic model for the vocabulary growth in natural languages. Phys. Rev. X 3, 021006 (2013)
Google Scholar
Font-Clos, F., Boleda, G., Corral, A.: A scaling law beyond Zipf’s law and its relation to Heaps’ law. New J. Phys. 15(9), 093033 (2013)
Article Google Scholar
Gerlach, M., Altmann, E.G.: Scaling laws and fluctuations in the statistics of word frequencies. New J. Phys. 16, 113010 (2014)
Article Google Scholar
Altmann, E.G., Pierrehumbert, J.B., Motter, A.E.: Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words. Plos One 4, e7678 (2009)
Article Google Scholar
Corral, A., Ferrer-i-Cancho, R., Boleda, G., Diaz-Guilera, A.: Univeral complex structures in written language. arXiv:0901.2924
Lijffijt, J., Papapetrou, P., Puolamäki, K., Mannila, H.: Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science, vol. 6912, p. 341. Springer, Berlin (2011)
Chapter Google Scholar
Damerau, F.J., Mandelbrot, B.: Tests of the degree of word clustering in samples of written English. Linguistics 102, 58–72 (1973)
Google Scholar
Schenkel, A., Zhang, J., Zhang, Y.: Long range correlation in human writings. Fractals 1, 47 (1993)
Article Google Scholar
Altmann, E.G., Cristadoro, G., Degli Esposti, M.: On the origin of long-range correlations in texts. PNAS 109, 11582 (2012)
Article Google Scholar
Ebeling, W., Pöschel, T.: Entropy and long-range correlations in literary English. Europhys. Lett. 26, 24 (1994)
Google Scholar
Debowski, L.: On Hilberg’s law and its links with Guiraud’s law. J. Quant. Linguist. 13, 81–109 (2006)
Article Google Scholar
Piantadosi, S.T., Tily, H., Gibson, E.: Word lengths are optimized for efficient communication. PNAS 108, 3526 (2011)
Article Google Scholar
Solé, R.V., Corominas-Murtra, B., Valverde, S., Steels, L.: Language networks: their structure, function and evolution. Complexity 15, 20 (2009)
Google Scholar
Choudhury, M., Mukherjee, A.: The structure and dynamics of linguistic networks. Dynamics On and Of Complex Networks, pp. 145–166. Springer, Boston (2009)
Google Scholar
Baronchelli, A., Ferrer-i-Cancho, R., Pastor-Satorras, R., Chater, N., Christiansen, M.H.: Networks in cognitive science. Trends Cogn. Sci. 17, 348 (2013)
Article Google Scholar
Cong, J., Liu, H.: Approaching human language with complex networks. Phys. Life Rev. 11, 598 (2014)
Article Google Scholar
Constrained writing, in Wikipedia. http://en.wikipedia.org/wiki/Constrained_writing. Accessed 3 Dec 2014
Benford, F.: The law of anomalous numbers. Proc. Am. Philos. Soc. 78, 551 (1938)
Google Scholar
Main, I.G., Li, L., McCloskey, J., Naylor, M.: Effect of the Sumatran mega-earthquake on the global magnitude cut-off and event rate. Nat. Geosci. 1, 142 (2008)
Article Google Scholar
Amancio, D.R., Altmann, E.G., Rybski, D., Oliveira Jr., O.N., Costa, L.D.F.: Probing the statistical properties of unknown texts: application to the Voynich manuscript. PLOS One 8, e67310 (2013)
Google Scholar
Febres, G., Jaffé, K., Gershenson, C.: Complexity measurement of natural and artificial languages. Complexity (2014). doi:10.1002/cplx.21529
Google Scholar
Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: Size-dependent word frequencies and translational invariance of books. Phys. A 389, 330 (2010)
Article Google Scholar
Williams, J.R., Bagrow, J.P. Danforth, C.M., Dodds, P.S.: Text mixing shapes the anatomy of rank-frequency distributions: a modern Zipfian mechanics for natural language (2014). arXiv:1409.3870
Baixeries, J., Elvevag, B., Ferrer-i-Cancho, R.: The evolution of the exponent of Zipf’s law in language ontogeny. PLOS One 8, e53227 (2013)
Article Google Scholar
Jäger, G.: Power laws and other heavy-tailed distribution in linguistic typology. Adv. Compl. Syst. 15, 1150019 (2012)
Google Scholar
Ferrer-i-Cancho, R., Elvevag, B.: Random texts do not exhibit the real Zipf’s law-like rank distribution. PLOS One 5, e9411 (2010)
Article Google Scholar
Corominas-Murtra, B., Fortuny, J., Solé, R.V.: Emergence of Zipfs law in the evolution of communication. Phys. Rev. E 83, 036115 (2011)
Google Scholar
Ferrer-i-Cancho, R.: Optimization models of natural communication (2014). arXiv:1412.2486
Marsili, M., Mastromatteo, I., Roudi, Y.: On sampling and modeling complex systems. J. Stat. Mech. 2013, P09003 (2013)
Article Google Scholar
Peterson, J., Dixit, P.D., Dill, K.: A maximum entropy framework for nonexponential distributions. PNAS 110, 20380 (2013)
Article Google Scholar
Goldstein, M.L., Morris, S.A., Yen, G.G.: Problems with fitting to the power-law distribution. Eur. J. Phys. B 41, 255–258 (2004)
Article Google Scholar
Bauke, H.: Parameter estimation for power-law distributions by maximum likelihood methods. Eur. J. Phys. B 58, 167–173 (2007)
Article Google Scholar
Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51, 661–703 (2009)
Article Google Scholar
Deluca, A., Corral, A.: Fitting and goodness-of-fit test of non-truncated and truncated power-law distributions. Acta Geophys. 61, 1351–1394 (2013)
Article Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2009)
Book Google Scholar
Burnham, K.P., Anderson, D.R.: Model Selection and Multimodal Inference: A Practical Information-Theoretic Approach. Spinger, New York (2002)
Google Scholar
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 716–723 (1974)
Article Google Scholar
Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)
Article Google Scholar
Grünwald, P.D.: Minimum Description Length Principle. MIT Press, Cambridge (2007)
Google Scholar
Jaynes, E.T.: Probability Theory: The Logic of Science. Oxford University Press, Oxford (2003)
Book Google Scholar
Günther, R., Levitin, L., Schapiro, B., Wagner, P.: Zipf ’s law and the effect of ranking on probability distributions. Int. J. Theor. Phys. 35, 395 (1996)
Article Google Scholar
Cristelli, M., Batty, M., Pietronero, L.: There is more than a power law in Zipf. Sci. Rep. 2, 812 (2012)
Article Google Scholar
Stumpf, M.P.H., Porter, M.A.: Critical truths about power laws. Science 335, 665–666 (2012)
Article Google Scholar
Weiss, M.S.: Modification of the Kolmogorov-Smirnov statistic for use with correlated data. J. Am. Stat. Assoc. 73, 872–875 (1978)
Article Google Scholar
Chicheportiche, R., Bouchaud, J.-P.: Goodness-of-fit tests with dependent observations. J. Stat. Mech.: Theory Exp. 2011, P09003 (2011)
Article Google Scholar
Serrano, M.A., Flammini, A., Menczer, F.: Modeling statistical properties of written text. PlOS One 4, e5372 (2009)
Article Google Scholar
Eisler, Z., Bartos, I., Kertész, J.: Fluctuation scaling in complex systems: Taylor’s law and beyond. Adv. Phys. 57, 89–142 (2008)
Article Google Scholar
Louf, R., Barthelemy, M.: Scaling: lost in the smog. Environ. Plan. B: Plan. Des. 41, 767 (2014)
Article Google Scholar

Download references

Acknowledgments

We thank A. Corral, A. Deluca, R. Ferrer-i-Cancho F. Font-Clos, and R. Guimerá for insightful discussions.

Author information

Authors and Affiliations

Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
Eduardo G. Altmann & Martin Gerlach

Authors

Eduardo G. Altmann
View author publications
You can also search for this author in PubMed Google Scholar
Martin Gerlach
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eduardo G. Altmann .

Editor information

Editors and Affiliations

Dipartimento di Matematica, Università di Bologna, Bologna, Italy
Mirko Degli Esposti
the Physics of Complex Systems, Max Planck Institute for, Dresden, Germany
Eduardo G. Altmann
Sony Computer Science Laboratory, Paris, France
François Pachet

Appendix

The books listed in Table 3 were obtained from Project Gutenberg (http://www.gutenberg.org). The books and data filtering are the same as the ones used in Ref. [30] (see the Supplementary information of that paper for further details). We removed capitalization and all symbols except the letters “a–z”, the number “0–9”, the apostrophe, and the blank space. A string of symbols between two consecutive blank spaces was considered to be a word.

The English Wikipedia data was obtained from Wikimedia dumps (http://dumps.wikimedia.org/). The filtering was the same as the one used in Ref. [24], in which we removed capitalization and kept only those words (i.e., sequences of symbols separated by blank space) which consisted exclusively of the letters “a–z” and the apostrophe.

The computation of Menzerath–Altmann law appearing in Figs. 1, 2, and Table 2 was done starting from the unique words (word type) in the database discussed in the previous paragraphs. For each word w we applied the following steps:

1.
Lemmatize using the WordNetLemmatizer (http://wordnet.princeton.edu in the NLTK Python package http://www.nltk.org/).
2.
Count the number of syllables \(x_w\) based on the Moby Hyphenation List by Grady Ward, available at http://www.gutenberg.org/ebooks/3204.
3.
Count the number of phonemes \(z_w\) based on The CMU Pronouncing Dictionary, version 0.7b available at www.speech.cs.cmu.edu/cgi-bin/cmudict.

For the book Moby Dick by H. Melville, this procedure allowed to compute \(x_w\) and \(z_w\) for 11, 595 words, \(66\,\%\) of the total number of words (before lemmatization). For the Wikipedia, we obtain 60, 749 words, \(1.7\,\%\) of the total number. The low success in Wikipedia is due to the size of the database (large number of rare words) and the results depend more strongly on the procedure described above than on the database itself.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Altmann, E.G., Gerlach, M. (2016). Statistical Laws in Linguistics. In: Degli Esposti, M., Altmann, E., Pachet, F. (eds) Creativity and Universality in Language. Lecture Notes in Morphogenesis. Springer, Cham. https://doi.org/10.1007/978-3-319-24403-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-24403-7_2
Published: 19 May 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24401-3
Online ISBN: 978-3-319-24403-7
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics

Statistical Laws in Linguistics

Abstract

Access this chapter

Similar content being viewed by others

Statistics for Categorical, Nonparametric, and Distribution-Free Data

Taming Chaos. Chance and Variability in the Language Sciences

Analysing Frequency Lists

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Statistical Laws in Linguistics

Abstract

Access this chapter

Similar content being viewed by others

Statistics for Categorical, Nonparametric, and Distribution-Free Data

Taming Chaos. Chance and Variability in the Language Sciences

Analysing Frequency Lists

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation