Abstract
Part II investigated the population of words, but the book thus far has not considered the properties underlying a sequence of words. Language forms a sequence, which characterizes what language is. Indeed, Sect. 4.4 showed that, for n-grams, the subsequences of natural language texts present a different nature from those of random sequences, even from the population viewpoint. This provides evidence that language has a kind of memory, meaning that a word in one part of a text influences words in other parts of the text.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Many continuous natural phenomena commonly have a spectrum showing 1∕f noise. Here, 1∕f noise is defined as the frequency spectrum of a continuous time series following the function y = 1∕x. This book is about a discrete sequence, however, whereas 1∕f noise concerns a continuous time series. A clear relation between them would require defining what a “spectrum” signifies in a discrete sequence. There is an interesting work on this question for DNA (Voss, 1992).
- 2.
Repetitions in language within a close range of sounds or words are studied in the field of linguistics and referred to by the term reduplication . For an introduction, see Rebino (2021). Texts with reduplication would appear different if they were statistically analyzed as mentioned here.
- 3.
In this chapter, there are two cases for considering the relative frequency: with respect to a specific interval sequence Q w for a word w, or with respect to all intervals. In the former case, P(q) = #q∕|Q w|, where #q denotes the number of occurrences of intervals of length q. In the latter case, for every type of word, each occurrence after the first one can be associated with an interval; thus, the accumulated frequency equals m − v. Therefore, P(q) can be estimated as a relative frequency, i.e., P(q) = #q∕(m − v). Section 7.3 and Fig. 7.7 consider the case of all intervals, whereas the rest of the chapter considers the case of a specific interval sequence Q w.
- 4.
The cumulative distribution function Cum(q) of P(q) is defined as follows:
$$\displaystyle \begin{aligned} Cum(q) \equiv \int_0^{q} P(q) dq. \end{aligned} $$(7.3) - 5.
Altmann et al. (2009) also showed that a renewal process produces this stretched exponential function. It is not obvious, however, how to integrate the population characteristic seen in Part II into the renewal process to form a sequence.
- 6.
The points here were fitted by the least-squares method (cf. Sect. 21.1), with ε = 0.0226 for Moby Dick and ε = 0.00929 for the shuffled text.
- 7.
- 8.
Tanaka-Ishii and Bunde (2016) investigated the effects of different values: ψ = 1, 2, 4, 8, 16, 32, 64. For a large ψ, the interval sequence becomes short, whereas for a small ψ, W ψ starts to include functional words. Therefore, this book uses ψ = 16 as a moderate value throughout.
- 9.
ε = 0.00787 for Moby Dick, and ε = 0.00521 for the shuffled text.
- 10.
Tanaka-Ishii and Bunde (2016) indicated that κ is in fact a function of θ, and therefore, the stretched exponential function can ultimately be described by one parameter. Moreover, by using P(q), they formulated a probability function for the occurrences of rare words, Q ψ.
- 11.
ε = 0.0245, 0.0199, 0.0260 for Moby Dick, the shuffled text, and the monkey text, respectively.
References
Altmann, Eduardo G., Pierrehumbert, Janet B., and Motter, Adilson E. (2009). Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS One, 4(11):e7678.
Blender, Richard, Raible, Christoph C., and Lunkeit, Frank (2014). Non-exponential return time distributions for vorticity extremes explained by fractional poisson processes. Quarterly Journal of the Royal Meteorological Society, 141, 249–257.
Bogachev, Mikhail I., Eichner, Jan F., and Bunde, Armin (2007). Effect of nonlinear correlations on the statistics of return intervals in multifractal data sets. Physical Review Letters, 99(24):240601.
Bunde, Armin, Eichner, Jan F., Kantelhardt, Jan W., and Havlin, Shlomo (2005). Long-term memory : A natural mechanism for the clustering of extreme events and anomalous residual times in climate records. Physical Review Letters, 94(4):048701.
Corral, Álvaro (2004). Long-term clustering, scaling, and universality in the temporal occurrence of earthquakes. Physical Review Letters, 92(10):108501.
Corral, Álvaro (2005). Renomalization-group transformations and correlations of seismicity. Physical Review Letters, 95:028501.
Ebeling, Werner and Neiman, Alexander (1995). Long-range correlations between letters and sentences in texts. Physica A, 215, 233–241.
Gerlach, Martin and Altmann, Eduardo G. (2013). Stochastic model for the vocabulary growth in natural languages. Physical Review X, 3(2):21006.
Mikolov, Tomáš, Karafiát, Martin, Burget, Lukáš, Černocký, Jan H., and Khudanpur, Sanjeev (2010). Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, pages 1045–1048.
Nabeshima, Terutaka and Gunji, Yukio-Pegio (2004). Zipf’s law in phonograms and Weibull distribution in ideograms: comparison of English with Japanese. Biosystems, 73(2), 131–139.
Rebino, Carl (2021). Reduplication. https://wals.info/chapter/27, accessed in 2021.
Santhanam, M. S. and Kantz, Holger (2005). Long-range correlations and rare events in boundary layer wind fields. Physica A, 345, 713–721.
Tanaka-Ishii, Kumiko and Bunde, Armin (2016). Long-range memory in literary texts: On the universal clustering of the rare words. PLoS One, 11(11), e0164658.
Turcotte, Donald L. (1997). Fractals and Chaos in Geology and Geophysics. Cambridge University Press.
Voss, Richard F. (1992). Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Physical Review Letters, 68(25), 3805–3808.
Yamasaki, Kazuko, Muchnik, Lev, Havlin, Shlomo, Bunde, Armin, and Stanley, H.Eugene (2005). Scaling and memory in volatility return intervals in financial markets. Proceedings of the National Acaddemy of Sciences, 102(26), 9424–9428.
Zipf, George K. (1949). Human Behavior and the Principle of Least Effort : An Introduction to Human Ecology. Addison-Wesley Press.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s)
About this chapter
Cite this chapter
Tanaka-Ishii, K. (2021). Returns. In: Statistical Universals of Language. Mathematics in Mind. Springer, Cham. https://doi.org/10.1007/978-3-030-59377-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-59377-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59376-6
Online ISBN: 978-3-030-59377-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)