Abstract
As shown at the end of the previous chapter, the rank-frequency relation of Moby Dick almost follows a power law -> with an η value close to 1. The goal of this chapter is to see how well Zipf’s law holds among various kinds of texts and data. A text is typically written by a single author, but other corpora consist of collections (e.g., newspapers, collections of literary texts). Analyses like the one conducted here have also been reported beyond written texts, for speech (including infant utterances) and even program source code and music.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Project Gutenberg includes few Japanese texts, so more were separately acquired from Aozora Bunko.
- 2.
Precisely, texts were included when they listed a single author. Even when a single author was listed, some texts, such as The Traditional Text of the Holy Gospels by John William Burgon, could have merely been collected by a single person, making it doubtful whether they are really single authored.
- 3.
Specifically, the average LL value was 6.57, with a standard deviation of 0.30. This mean value of LL is pretty similar to that reported for Moby Dick at the end of the previous section. As mentioned previously, the definition of LL appears in Sect. 21.1. LL becomes small when the fit is good.
- 4.
LL = 5.817 for TED, LL = 6.070 for Beethoven, and LL = 7.155 for Haskell.
- 5.
There is no conventional way to transform music to a sequence of words. Therefore, for this book, MIDI data that encodes a performance was transcribed to a text by a software application called SMF2MML. This produces a sequence of sounds in a text format. Every encoded note is considered a word in this analysis. There could be many other ways to analyze music, and this way is one possible crude starting point.
- 6.
The disjunctions were deemed to represent the differences in behavior among the different categories of words in source code, namely literals, operators, reserved words, and identifiers.
- 7.
The vocabulary size for an individual has been analyzed in the cognitive linguistics field (Aitchison, 1987).
- 8.
Precisely, Bernhardsson et al. (2009) studied the density function of a vocabulary population, which takes a different perspective from that of a rank-frequency distribution. They reported that ζ in the density function, when defined as given in formula (6.3) in the next chapter, decreases as the text length increases. Theoretically, this corresponds roughly to the increase in η, as the next chapter will explain via formula (6.2). They showed this change by taking portions of literary texts.
- 9.
Baixeries et al. (2013) indicated how the overall η value decreases with age.
- 10.
As explained in Sect. 22.4, the results presented for characters in the first two panels of Fig. 5.3 exclude all symbols, such as spaces, punctuation marks, and so on. The analysis was conducted only on the set of characters for each language, because there is a clear, available definition of the set of characters for each script, based on Unicode.
- 11.
Because an exponential function presents a convex tendency, such plots are often roughly called “exponential.” The rank-frequency plots of English and Arabic do present a sort of rough linear tendency on semi-log axes, but with some disjunctions among the points. Whether these plots are really exponential is an issue that will require future work.
References
Aitchison, Jean (1987). Words in the mind: an introduction to the mental lexicon. Basil Blackwell Ltd.
Allahverdyan, Armen E., Deng, Weibing, and Wang, Qiuping A. (2013). Explaining Zipf’s law via a mental lexicon. Physical Review E, 88:062804.
Baayen, R. Harald (2001). Word Frequency Distributions. Springer.
Baixeries, Jaume, Brita, Elvevag, and Ferrer-i-Cancho, Ramon (2013). The evolution of the exponent of Zipf’s law in language ontogeny. PLoS One, 8(3):e53227
Bell, Timothy C., Cleary, John G., and Witten, Ian H. (1990). Text Compression. Prentice Hall.
Bernhardsson, Sebastian, da Rocha, Luis E. C., and Minnhagen, Petter (2009). The meta book and size-dependent properties of written language. New Journal of Physics, 11(12):123015.
Coulmas, Florian (1996). The Blackwell Encyclopedia of Writing Systems. Blackwell Publishers Ltd.
Daniels, Peter T. and Bright, William, editors (1996). The World’s Writing Systems. Oxford University Press.
Deng, Weibing, Allahverdyan, Armen E., Li, Bo, and Wang, Quipiing A. (2014). Rank-frequency relation for Chinese characters. The European Physical Journal B, 87:47.
Gerlach, Martin and Altmann, Eduardo G. (2013). Stochastic model for the vocabulary growth in natural languages. Physical Review X, 3(2):21006.
Good, Ian J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40, 237–264.
Lieven, Elena, Salomo, Dorothé, and Tomasello, Michael (2009). Two-year-old children’s production of multiword utterances : A usage-based analysis. Cognitive Linguistics, 20(3), 481–507.
Lü, Linyuan, Zhang, Zi-Ke, and Zhou, Tao (2013). Deviation of Zipf’s and Heaps’ laws in human languages with limited dictionary sizes. Scientific Reports, 1082.
Mandelbrot, Benoit B. (1953). An informational theory of the statistical structure of language. In Proceedings of Symposium of Applications of Communication theory, pages 486–502.
Mandelbrot, Benoit B. (1965). Information Theory and Psycholinguistics. Scientific Psychology, pages 250—368.
Montemurro, Marcelo A. (2001). Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications, 300, 567–678.
Nabeshima, Terutaka and Gunji, Yukio-Pegio (2004). Zipf’s law in phonograms and Weibull distribution in ideograms: comparison of English with Japanese. Biosystems, 73(2), 131–139.
van Egmond, Mariolein (2018). On the topic of Zipf’s law in people with schizophrenic disorders. ph.D. thesis.
Zipf, George K. (1949). Human Behavior and the Principle of Least Effort : An Introduction to Human Ecology. Addison-Wesley Press.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s)
About this chapter
Cite this chapter
Tanaka-Ishii, K. (2021). Bias in Rank-Frequency Relation. In: Statistical Universals of Language. Mathematics in Mind. Springer, Cham. https://doi.org/10.1007/978-3-030-59377-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-59377-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59376-6
Online ISBN: 978-3-030-59377-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)