Bias in Rank-Frequency Relation

Tanaka-Ishii, Kumiko

doi:10.1007/978-3-030-59377-3_5

Kumiko Tanaka-Ishii¹¹

Part of the book series: Mathematics in Mind ((MATHMIN))

Abstract

As shown at the end of the previous chapter, the rank-frequency relation of Moby Dick almost follows a power law -> with an η value close to 1. The goal of this chapter is to see how well Zipf’s law holds among various kinds of texts and data. A text is typically written by a single author, but other corpora consist of collections (e.g., newspapers, collections of literary texts). Analyses like the one conducted here have also been reported beyond written texts, for speech (including infant utterances) and even program source code and music.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Project Gutenberg includes few Japanese texts, so more were separately acquired from Aozora Bunko.
2.
Precisely, texts were included when they listed a single author. Even when a single author was listed, some texts, such as The Traditional Text of the Holy Gospels by John William Burgon, could have merely been collected by a single person, making it doubtful whether they are really single authored.
3.
Specifically, the average LL value was 6.57, with a standard deviation of 0.30. This mean value of LL is pretty similar to that reported for Moby Dick at the end of the previous section. As mentioned previously, the definition of LL appears in Sect. 21.1. LL becomes small when the fit is good.
4.
LL = 5.817 for TED, LL = 6.070 for Beethoven, and LL = 7.155 for Haskell.
5.
There is no conventional way to transform music to a sequence of words. Therefore, for this book, MIDI data that encodes a performance was transcribed to a text by a software application called SMF2MML. This produces a sequence of sounds in a text format. Every encoded note is considered a word in this analysis. There could be many other ways to analyze music, and this way is one possible crude starting point.
6.
The disjunctions were deemed to represent the differences in behavior among the different categories of words in source code, namely literals, operators, reserved words, and identifiers.
7.
The vocabulary size for an individual has been analyzed in the cognitive linguistics field (Aitchison, 1987).
8.
Precisely, Bernhardsson et al. (2009) studied the density function of a vocabulary population, which takes a different perspective from that of a rank-frequency distribution. They reported that ζ in the density function, when defined as given in formula (6.3) in the next chapter, decreases as the text length increases. Theoretically, this corresponds roughly to the increase in η, as the next chapter will explain via formula (6.2). They showed this change by taking portions of literary texts.
9.
Baixeries et al. (2013) indicated how the overall η value decreases with age.
10.
As explained in Sect. 22.4, the results presented for characters in the first two panels of Fig. 5.3 exclude all symbols, such as spaces, punctuation marks, and so on. The analysis was conducted only on the set of characters for each language, because there is a clear, available definition of the set of characters for each script, based on Unicode.
11.
Because an exponential function presents a convex tendency, such plots are often roughly called “exponential.” The rank-frequency plots of English and Arabic do present a sort of rough linear tendency on semi-log axes, but with some disjunctions among the points. Whether these plots are really exponential is an issue that will require future work.

References

Aitchison, Jean (1987). Words in the mind: an introduction to the mental lexicon. Basil Blackwell Ltd.
Google Scholar
Allahverdyan, Armen E., Deng, Weibing, and Wang, Qiuping A. (2013). Explaining Zipf’s law via a mental lexicon. Physical Review E, 88:062804.
Article Google Scholar
Baayen, R. Harald (2001). Word Frequency Distributions. Springer.
Google Scholar
Baixeries, Jaume, Brita, Elvevag, and Ferrer-i-Cancho, Ramon (2013). The evolution of the exponent of Zipf’s law in language ontogeny. PLoS One, 8(3):e53227
Article Google Scholar
Bell, Timothy C., Cleary, John G., and Witten, Ian H. (1990). Text Compression. Prentice Hall.
Google Scholar
Bernhardsson, Sebastian, da Rocha, Luis E. C., and Minnhagen, Petter (2009). The meta book and size-dependent properties of written language. New Journal of Physics, 11(12):123015.
Article Google Scholar
Coulmas, Florian (1996). The Blackwell Encyclopedia of Writing Systems. Blackwell Publishers Ltd.
Google Scholar
Daniels, Peter T. and Bright, William, editors (1996). The World’s Writing Systems. Oxford University Press.
Google Scholar
Deng, Weibing, Allahverdyan, Armen E., Li, Bo, and Wang, Quipiing A. (2014). Rank-frequency relation for Chinese characters. The European Physical Journal B, 87:47.
Article Google Scholar
Gerlach, Martin and Altmann, Eduardo G. (2013). Stochastic model for the vocabulary growth in natural languages. Physical Review X, 3(2):21006.
Article Google Scholar
Good, Ian J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40, 237–264.
Article MathSciNet MATH Google Scholar
Lieven, Elena, Salomo, Dorothé, and Tomasello, Michael (2009). Two-year-old children’s production of multiword utterances : A usage-based analysis. Cognitive Linguistics, 20(3), 481–507.
Article Google Scholar
Lü, Linyuan, Zhang, Zi-Ke, and Zhou, Tao (2013). Deviation of Zipf’s and Heaps’ laws in human languages with limited dictionary sizes. Scientific Reports, 1082.
Google Scholar
Mandelbrot, Benoit B. (1953). An informational theory of the statistical structure of language. In Proceedings of Symposium of Applications of Communication theory, pages 486–502.
Google Scholar
Mandelbrot, Benoit B. (1965). Information Theory and Psycholinguistics. Scientific Psychology, pages 250—368.
Google Scholar
Montemurro, Marcelo A. (2001). Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications, 300, 567–678.
Article MATH Google Scholar
Nabeshima, Terutaka and Gunji, Yukio-Pegio (2004). Zipf’s law in phonograms and Weibull distribution in ideograms: comparison of English with Japanese. Biosystems, 73(2), 131–139.
Article Google Scholar
van Egmond, Mariolein (2018). On the topic of Zipf’s law in people with schizophrenic disorders. ph.D. thesis.
Google Scholar
Zipf, George K. (1949). Human Behavior and the Principle of Least Effort : An Introduction to Human Ecology. Addison-Wesley Press.
Google Scholar

Download references

Author information

Authors and Affiliations

Research Center for Advanced Science and Technology (RCAST), The University of Tokyo, Tokyo, Japan
Kumiko Tanaka-Ishii

Authors

Kumiko Tanaka-Ishii
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tanaka-Ishii, K. (2021). Bias in Rank-Frequency Relation. In: Statistical Universals of Language. Mathematics in Mind. Springer, Cham. https://doi.org/10.1007/978-3-030-59377-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-59377-3_5
Published: 02 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59376-6
Online ISBN: 978-3-030-59377-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics