Skip to main content

Bias in Rank-Frequency Relation

  • Chapter
  • First Online:
Statistical Universals of Language

Part of the book series: Mathematics in Mind ((MATHMIN))

Abstract

As shown at the end of the previous chapter, the rank-frequency relation of Moby Dick almost follows a power law -> with an η value close to 1. The goal of this chapter is to see how well Zipf’s law holds among various kinds of texts and data. A text is typically written by a single author, but other corpora consist of collections (e.g., newspapers, collections of literary texts). Analyses like the one conducted here have also been reported beyond written texts, for speech (including infant utterances) and even program source code and music.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Project Gutenberg includes few Japanese texts, so more were separately acquired from Aozora Bunko.

  2. 2.

    Precisely, texts were included when they listed a single author. Even when a single author was listed, some texts, such as The Traditional Text of the Holy Gospels by John William Burgon, could have merely been collected by a single person, making it doubtful whether they are really single authored.

  3. 3.

    Specifically, the average LL value was 6.57, with a standard deviation of 0.30. This mean value of LL is pretty similar to that reported for Moby Dick at the end of the previous section. As mentioned previously, the definition of LL appears in Sect. 21.1. LL becomes small when the fit is good.

  4. 4.

    LL = 5.817 for TED, LL = 6.070 for Beethoven, and LL = 7.155 for Haskell.

  5. 5.

    There is no conventional way to transform music to a sequence of words. Therefore, for this book, MIDI data that encodes a performance was transcribed to a text by a software application called SMF2MML. This produces a sequence of sounds in a text format. Every encoded note is considered a word in this analysis. There could be many other ways to analyze music, and this way is one possible crude starting point.

  6. 6.

    The disjunctions were deemed to represent the differences in behavior among the different categories of words in source code, namely literals, operators, reserved words, and identifiers.

  7. 7.

    The vocabulary size for an individual has been analyzed in the cognitive linguistics field (Aitchison, 1987).

  8. 8.

    Precisely, Bernhardsson et al. (2009) studied the density function of a vocabulary population, which takes a different perspective from that of a rank-frequency distribution. They reported that ζ in the density function, when defined as given in formula (6.3) in the next chapter, decreases as the text length increases. Theoretically, this corresponds roughly to the increase in η, as the next chapter will explain via formula (6.2). They showed this change by taking portions of literary texts.

  9. 9.

    Baixeries et al. (2013) indicated how the overall η value decreases with age.

  10. 10.

    As explained in Sect. 22.4, the results presented for characters in the first two panels of Fig. 5.3 exclude all symbols, such as spaces, punctuation marks, and so on. The analysis was conducted only on the set of characters for each language, because there is a clear, available definition of the set of characters for each script, based on Unicode.

  11. 11.

    Because an exponential function presents a convex tendency, such plots are often roughly called “exponential.” The rank-frequency plots of English and Arabic do present a sort of rough linear tendency on semi-log axes, but with some disjunctions among the points. Whether these plots are really exponential is an issue that will require future work.

References

  • Aitchison, Jean (1987). Words in the mind: an introduction to the mental lexicon. Basil Blackwell Ltd.

    Google Scholar 

  • Allahverdyan, Armen E., Deng, Weibing, and Wang, Qiuping A. (2013). Explaining Zipf’s law via a mental lexicon. Physical Review E, 88:062804.

    Article  Google Scholar 

  • Baayen, R. Harald (2001). Word Frequency Distributions. Springer.

    Google Scholar 

  • Baixeries, Jaume, Brita, Elvevag, and Ferrer-i-Cancho, Ramon (2013). The evolution of the exponent of Zipf’s law in language ontogeny. PLoS One, 8(3):e53227

    Article  Google Scholar 

  • Bell, Timothy C., Cleary, John G., and Witten, Ian H. (1990). Text Compression. Prentice Hall.

    Google Scholar 

  • Bernhardsson, Sebastian, da Rocha, Luis E. C., and Minnhagen, Petter (2009). The meta book and size-dependent properties of written language. New Journal of Physics, 11(12):123015.

    Article  Google Scholar 

  • Coulmas, Florian (1996). The Blackwell Encyclopedia of Writing Systems. Blackwell Publishers Ltd.

    Google Scholar 

  • Daniels, Peter T. and Bright, William, editors (1996). The World’s Writing Systems. Oxford University Press.

    Google Scholar 

  • Deng, Weibing, Allahverdyan, Armen E., Li, Bo, and Wang, Quipiing A. (2014). Rank-frequency relation for Chinese characters. The European Physical Journal B, 87:47.

    Article  Google Scholar 

  • Gerlach, Martin and Altmann, Eduardo G. (2013). Stochastic model for the vocabulary growth in natural languages. Physical Review X, 3(2):21006.

    Article  Google Scholar 

  • Good, Ian J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40, 237–264.

    Article  MathSciNet  MATH  Google Scholar 

  • Lieven, Elena, Salomo, Dorothé, and Tomasello, Michael (2009). Two-year-old children’s production of multiword utterances : A usage-based analysis. Cognitive Linguistics, 20(3), 481–507.

    Article  Google Scholar 

  • Lü, Linyuan, Zhang, Zi-Ke, and Zhou, Tao (2013). Deviation of Zipf’s and Heaps’ laws in human languages with limited dictionary sizes. Scientific Reports, 1082.

    Google Scholar 

  • Mandelbrot, Benoit B. (1953). An informational theory of the statistical structure of language. In Proceedings of Symposium of Applications of Communication theory, pages 486–502.

    Google Scholar 

  • Mandelbrot, Benoit B. (1965). Information Theory and Psycholinguistics. Scientific Psychology, pages 250—368.

    Google Scholar 

  • Montemurro, Marcelo A. (2001). Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications, 300, 567–678.

    Article  MATH  Google Scholar 

  • Nabeshima, Terutaka and Gunji, Yukio-Pegio (2004). Zipf’s law in phonograms and Weibull distribution in ideograms: comparison of English with Japanese. Biosystems, 73(2), 131–139.

    Article  Google Scholar 

  • van Egmond, Mariolein (2018). On the topic of Zipf’s law in people with schizophrenic disorders. ph.D. thesis.

    Google Scholar 

  • Zipf, George K. (1949). Human Behavior and the Principle of Least Effort : An Introduction to Human Ecology. Addison-Wesley Press.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s)

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Tanaka-Ishii, K. (2021). Bias in Rank-Frequency Relation. In: Statistical Universals of Language. Mathematics in Mind. Springer, Cham. https://doi.org/10.1007/978-3-030-59377-3_5

Download citation

Publish with us

Policies and ethics