Skip to main content

Returns

  • Chapter
  • First Online:
Statistical Universals of Language

Part of the book series: Mathematics in Mind ((MATHMIN))

  • 669 Accesses

Abstract

Part II investigated the population of words, but the book thus far has not considered the properties underlying a sequence of words. Language forms a sequence, which characterizes what language is. Indeed, Sect. 4.4 showed that, for n-grams, the subsequences of natural language texts present a different nature from those of random sequences, even from the population viewpoint. This provides evidence that language has a kind of memory, meaning that a word in one part of a text influences words in other parts of the text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Many continuous natural phenomena commonly have a spectrum showing 1∕f noise. Here, 1∕f noise is defined as the frequency spectrum of a continuous time series following the function y = 1∕x. This book is about a discrete sequence, however, whereas 1∕f noise concerns a continuous time series. A clear relation between them would require defining what a “spectrum” signifies in a discrete sequence. There is an interesting work on this question for DNA (Voss, 1992).

  2. 2.

    Repetitions in language within a close range of sounds or words are studied in the field of linguistics and referred to by the term reduplication . For an introduction, see Rebino (2021). Texts with reduplication would appear different if they were statistically analyzed as mentioned here.

  3. 3.

    In this chapter, there are two cases for considering the relative frequency: with respect to a specific interval sequence Q w for a word w, or with respect to all intervals. In the former case, P(q) = #q∕|Q w|, where #q denotes the number of occurrences of intervals of length q. In the latter case, for every type of word, each occurrence after the first one can be associated with an interval; thus, the accumulated frequency equals m − v. Therefore, P(q) can be estimated as a relative frequency, i.e., P(q) = #q∕(m − v). Section 7.3 and Fig. 7.7 consider the case of all intervals, whereas the rest of the chapter considers the case of a specific interval sequence Q w.

  4. 4.

    The cumulative distribution function Cum(q) of P(q) is defined as follows:

    $$\displaystyle \begin{aligned} Cum(q) \equiv \int_0^{q} P(q) dq. \end{aligned} $$
    (7.3)
  5. 5.

    Altmann et al. (2009) also showed that a renewal process produces this stretched exponential function. It is not obvious, however, how to integrate the population characteristic seen in Part II into the renewal process to form a sequence.

  6. 6.

    The points here were fitted by the least-squares method (cf. Sect. 21.1), with ε = 0.0226 for Moby Dick and ε = 0.00929 for the shuffled text.

  7. 7.

    Chapter 17 introduces that idea, but in brief, the idea is to consider the rare words of a corpus as a single word, as suggested in Mikolov et al. (2010).

  8. 8.

    Tanaka-Ishii and Bunde (2016) investigated the effects of different values: ψ = 1, 2, 4, 8, 16, 32, 64. For a large ψ, the interval sequence becomes short, whereas for a small ψ, W ψ starts to include functional words. Therefore, this book uses ψ = 16 as a moderate value throughout.

  9. 9.

    ε = 0.00787 for Moby Dick, and ε = 0.00521 for the shuffled text.

  10. 10.

    Tanaka-Ishii and Bunde (2016) indicated that κ is in fact a function of θ, and therefore, the stretched exponential function can ultimately be described by one parameter. Moreover, by using P(q), they formulated a probability function for the occurrences of rare words, Q ψ.

  11. 11.

    ε = 0.0245, 0.0199, 0.0260 for Moby Dick, the shuffled text, and the monkey text, respectively.

References

  • Altmann, Eduardo G., Pierrehumbert, Janet B., and Motter, Adilson E. (2009). Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS One, 4(11):e7678.

    Article  Google Scholar 

  • Blender, Richard, Raible, Christoph C., and Lunkeit, Frank (2014). Non-exponential return time distributions for vorticity extremes explained by fractional poisson processes. Quarterly Journal of the Royal Meteorological Society, 141, 249–257.

    Article  Google Scholar 

  • Bogachev, Mikhail I., Eichner, Jan F., and Bunde, Armin (2007). Effect of nonlinear correlations on the statistics of return intervals in multifractal data sets. Physical Review Letters, 99(24):240601.

    Article  Google Scholar 

  • Bunde, Armin, Eichner, Jan F., Kantelhardt, Jan W., and Havlin, Shlomo (2005). Long-term memory : A natural mechanism for the clustering of extreme events and anomalous residual times in climate records. Physical Review Letters, 94(4):048701.

    Article  Google Scholar 

  • Corral, Álvaro (2004). Long-term clustering, scaling, and universality in the temporal occurrence of earthquakes. Physical Review Letters, 92(10):108501.

    Article  Google Scholar 

  • Corral, Álvaro (2005). Renomalization-group transformations and correlations of seismicity. Physical Review Letters, 95:028501.

    Article  Google Scholar 

  • Ebeling, Werner and Neiman, Alexander (1995). Long-range correlations between letters and sentences in texts. Physica A, 215, 233–241.

    Article  Google Scholar 

  • Gerlach, Martin and Altmann, Eduardo G. (2013). Stochastic model for the vocabulary growth in natural languages. Physical Review X, 3(2):21006.

    Article  Google Scholar 

  • Mikolov, Tomáš, Karafiát, Martin, Burget, Lukáš, Černocký, Jan H., and Khudanpur, Sanjeev (2010). Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, pages 1045–1048.

    Google Scholar 

  • Nabeshima, Terutaka and Gunji, Yukio-Pegio (2004). Zipf’s law in phonograms and Weibull distribution in ideograms: comparison of English with Japanese. Biosystems, 73(2), 131–139.

    Article  Google Scholar 

  • Rebino, Carl (2021). Reduplication. https://wals.info/chapter/27, accessed in 2021.

  • Santhanam, M. S. and Kantz, Holger (2005). Long-range correlations and rare events in boundary layer wind fields. Physica A, 345, 713–721.

    Article  Google Scholar 

  • Tanaka-Ishii, Kumiko and Bunde, Armin (2016). Long-range memory in literary texts: On the universal clustering of the rare words. PLoS One, 11(11), e0164658.

    Article  Google Scholar 

  • Turcotte, Donald L. (1997). Fractals and Chaos in Geology and Geophysics. Cambridge University Press.

    Book  MATH  Google Scholar 

  • Voss, Richard F. (1992). Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Physical Review Letters, 68(25), 3805–3808.

    Article  Google Scholar 

  • Yamasaki, Kazuko, Muchnik, Lev, Havlin, Shlomo, Bunde, Armin, and Stanley, H.Eugene (2005). Scaling and memory in volatility return intervals in financial markets. Proceedings of the National Acaddemy of Sciences, 102(26), 9424–9428.

    Article  Google Scholar 

  • Zipf, George K. (1949). Human Behavior and the Principle of Least Effort : An Introduction to Human Ecology. Addison-Wesley Press.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s)

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Tanaka-Ishii, K. (2021). Returns. In: Statistical Universals of Language. Mathematics in Mind. Springer, Cham. https://doi.org/10.1007/978-3-030-59377-3_7

Download citation

Publish with us

Policies and ethics