Returns

Tanaka-Ishii, Kumiko

doi:10.1007/978-3-030-59377-3_7

Kumiko Tanaka-Ishii¹¹

Part of the book series: Mathematics in Mind ((MATHMIN))

669 Accesses

Abstract

Part II investigated the population of words, but the book thus far has not considered the properties underlying a sequence of words. Language forms a sequence, which characterizes what language is. Indeed, Sect. 4.4 showed that, for n-grams, the subsequences of natural language texts present a different nature from those of random sequences, even from the population viewpoint. This provides evidence that language has a kind of memory, meaning that a word in one part of a text influences words in other parts of the text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Many continuous natural phenomena commonly have a spectrum showing 1∕f noise. Here, 1∕f noise is defined as the frequency spectrum of a continuous time series following the function y = 1∕x. This book is about a discrete sequence, however, whereas 1∕f noise concerns a continuous time series. A clear relation between them would require defining what a “spectrum” signifies in a discrete sequence. There is an interesting work on this question for DNA (Voss, 1992).
2.
Repetitions in language within a close range of sounds or words are studied in the field of linguistics and referred to by the term reduplication . For an introduction, see Rebino (2021). Texts with reduplication would appear different if they were statistically analyzed as mentioned here.
3.
In this chapter, there are two cases for considering the relative frequency: with respect to a specific interval sequence Q _w for a word w, or with respect to all intervals. In the former case, P(q) = #q∕|Q _w|, where #q denotes the number of occurrences of intervals of length q. In the latter case, for every type of word, each occurrence after the first one can be associated with an interval; thus, the accumulated frequency equals m − v. Therefore, P(q) can be estimated as a relative frequency, i.e., P(q) = #q∕(m − v). Section 7.3 and Fig. 7.7 consider the case of all intervals, whereas the rest of the chapter considers the case of a specific interval sequence Q _w.
4.
The cumulative distribution function Cum(q) of P(q) is defined as follows:
$$\displaystyle \begin{aligned} Cum(q) \equiv \int_0^{q} P(q) dq. \end{aligned} $$
(7.3)
5.
Altmann et al. (2009) also showed that a renewal process produces this stretched exponential function. It is not obvious, however, how to integrate the population characteristic seen in Part II into the renewal process to form a sequence.
6.
The points here were fitted by the least-squares method (cf. Sect. 21.1), with ε = 0.0226 for Moby Dick and ε = 0.00929 for the shuffled text.
7.
Chapter 17 introduces that idea, but in brief, the idea is to consider the rare words of a corpus as a single word, as suggested in Mikolov et al. (2010).
8.
Tanaka-Ishii and Bunde (2016) investigated the effects of different values: ψ = 1, 2, 4, 8, 16, 32, 64. For a large ψ, the interval sequence becomes short, whereas for a small ψ, W _ψ starts to include functional words. Therefore, this book uses ψ = 16 as a moderate value throughout.
9.
ε = 0.00787 for Moby Dick, and ε = 0.00521 for the shuffled text.
10.
Tanaka-Ishii and Bunde (2016) indicated that κ is in fact a function of θ, and therefore, the stretched exponential function can ultimately be described by one parameter. Moreover, by using P(q), they formulated a probability function for the occurrences of rare words, Q _ψ.
11.
ε = 0.0245, 0.0199, 0.0260 for Moby Dick, the shuffled text, and the monkey text, respectively.

References

Altmann, Eduardo G., Pierrehumbert, Janet B., and Motter, Adilson E. (2009). Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS One, 4(11):e7678.
Article Google Scholar
Blender, Richard, Raible, Christoph C., and Lunkeit, Frank (2014). Non-exponential return time distributions for vorticity extremes explained by fractional poisson processes. Quarterly Journal of the Royal Meteorological Society, 141, 249–257.
Article Google Scholar
Bogachev, Mikhail I., Eichner, Jan F., and Bunde, Armin (2007). Effect of nonlinear correlations on the statistics of return intervals in multifractal data sets. Physical Review Letters, 99(24):240601.
Article Google Scholar
Bunde, Armin, Eichner, Jan F., Kantelhardt, Jan W., and Havlin, Shlomo (2005). Long-term memory : A natural mechanism for the clustering of extreme events and anomalous residual times in climate records. Physical Review Letters, 94(4):048701.
Article Google Scholar
Corral, Álvaro (2004). Long-term clustering, scaling, and universality in the temporal occurrence of earthquakes. Physical Review Letters, 92(10):108501.
Article Google Scholar
Corral, Álvaro (2005). Renomalization-group transformations and correlations of seismicity. Physical Review Letters, 95:028501.
Article Google Scholar
Ebeling, Werner and Neiman, Alexander (1995). Long-range correlations between letters and sentences in texts. Physica A, 215, 233–241.
Article Google Scholar
Gerlach, Martin and Altmann, Eduardo G. (2013). Stochastic model for the vocabulary growth in natural languages. Physical Review X, 3(2):21006.
Article Google Scholar
Mikolov, Tomáš, Karafiát, Martin, Burget, Lukáš, Černocký, Jan H., and Khudanpur, Sanjeev (2010). Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, pages 1045–1048.
Google Scholar
Nabeshima, Terutaka and Gunji, Yukio-Pegio (2004). Zipf’s law in phonograms and Weibull distribution in ideograms: comparison of English with Japanese. Biosystems, 73(2), 131–139.
Article Google Scholar
Rebino, Carl (2021). Reduplication. https://wals.info/chapter/27, accessed in 2021.
Santhanam, M. S. and Kantz, Holger (2005). Long-range correlations and rare events in boundary layer wind fields. Physica A, 345, 713–721.
Article Google Scholar
Tanaka-Ishii, Kumiko and Bunde, Armin (2016). Long-range memory in literary texts: On the universal clustering of the rare words. PLoS One, 11(11), e0164658.
Article Google Scholar
Turcotte, Donald L. (1997). Fractals and Chaos in Geology and Geophysics. Cambridge University Press.
Book MATH Google Scholar
Voss, Richard F. (1992). Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Physical Review Letters, 68(25), 3805–3808.
Article Google Scholar
Yamasaki, Kazuko, Muchnik, Lev, Havlin, Shlomo, Bunde, Armin, and Stanley, H.Eugene (2005). Scaling and memory in volatility return intervals in financial markets. Proceedings of the National Acaddemy of Sciences, 102(26), 9424–9428.
Article Google Scholar
Zipf, George K. (1949). Human Behavior and the Principle of Least Effort : An Introduction to Human Ecology. Addison-Wesley Press.
Google Scholar

Download references

Author information

Authors and Affiliations

Research Center for Advanced Science and Technology (RCAST), The University of Tokyo, Tokyo, Japan
Kumiko Tanaka-Ishii

Authors

Kumiko Tanaka-Ishii
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tanaka-Ishii, K. (2021). Returns. In: Statistical Universals of Language. Mathematics in Mind. Springer, Cham. https://doi.org/10.1007/978-3-030-59377-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-59377-3_7
Published: 02 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59376-6
Online ISBN: 978-3-030-59377-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics