Abstract
We now have a rough overview of the most important statistical universals underlying language. As a whole, is there any way to examine how complex language is? What is the characteristic underlying this complexity?
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Note that the conditional entropy is given in general as
$$\displaystyle \begin{aligned} \mathrm{H}(X|Y) = -\sum_{x,y} P(X=x,Y=y)\log P(X=x |Y=y). \end{aligned} $$(10.3) - 2.
In this book, ⇒ indicates convergence.
- 3.
See Sect. 17.2 for the concepts of language models and their training.
- 4.
The PPM code uses an n-gram-based language modeling method (Bell et al., 1990) that applies variable-length n-grams and arithmetic coding. The PPM code is guaranteed to be universal when the length of the n-gram is considered up to infinity (Ryabko, 2010). Among state-of-the-art compressors, 7-zip PPMd was used for the PPM code. PPM was used because it follows theory better than many other compression methods do, such as zip, lzh, and tar.xz (Takahira et al., 2016).
- 5.
I hereby thank Ryosuke Takahira and Shuntaro Takahashi for generating this figure for the purpose of this book, by reusing code used to conduct the study reported in Takahira et al. (2016).
- 6.
- 7.
- 8.
Whether this is true would require more fundamental research. Above all, it might not be the case that β characterizes natural language, given that a shuffled text’s β was pretty close to that of natural language here. One possible path to verify the β value’s universality would be to conduct a statistical test, as was done for Taylor analysis, with many sets of data as introduced in the previous section. The problem in doing so is that the texts must be very large indeed to estimate a credible β to acquire the entropy rate. At the same time, larger texts have the limitation of self-similarity as discussed in Chap. 5. Therefore, clarifying whether β is universal would require a completely different approach.
- 9.
Section 21.8 explains the perplexity in relation to the entropy rate and cross entropy.
References
Bell, Timothy C., Cleary, John G., and Witten, Ian H. (1990). Text Compression. Prentice Hall.
Berger, Toby (1968). Rate distortion theory for sources with abstract alphabets and memory. Information and Control, 13, 254–273.
Brown, Peter F., Della-Pietra, Stephan A., Della-Pietra, Vincent J., Lai, Jennifer C., and Mercer, Robert L. (1992). An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1), 31–40.
Cover, Thomas M. and King, Roger C. (1978). A convergent gambling estimate of the entropy of English. IEEE Transactions on Information Theory, 24(4), 413–421.
Cover, Thomas M. and Thomas, Joy A. (1991). Elements of Information Theory. John Wiley & Sons, Inc.
Crutchfield, J. P. and Feldman, D. P. (2003). Regularities unseen, randomness observed: Levels of entropy convergence. Chaos, 13, 25–54.
Dai, Zihang, Yang, Zhilin, Yang, Yiming, Carbonell, Jaime, Le, Quoc V., and Salakhutdinov, Ruslan (2019). Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988.
Dȩbowski, Łukasz (2015). The relaxed Hilberg conjecture: A review and new experimental support. Journal of Quantitative Linguistics, 22(4), 311–337.
Dȩbowski, Łukasz (2020). Information Theory Meets Power Laws: Stochastic Processes and Language Models. Wiley.
Ebeling, Werner and Nicolis, G. (1991). Entropy of symbolic sequences : The role of correlations. Europhysics Letters, 14(3), 191–196.
Ferrer-i-Cancho, Ramon, Dȩbowski, Łukas, and Moscoso del Prado Martin, Fermin 2013:L07001. Constant conditional entropy and related hypotheses. Journal of Statistical Mechanics: Theory and Experiment, 2013.
Genzel, Dmitriy and Charniak, Eugene (2002). Entropy rate constancy in text. In Proccedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 199–206.
Hilberg, Wolfgang (1990). Der bekannte grenzwert der redundanzfreien information in texten — eine fehlinterpretation der shannonschen experimente? Frequenz, 44, 243–248.
Hockett, Charles F. (1958). A course in modern linguistics. Macmillan Company.
Levy, Roger and Jaeger, T. Florian (2006). Speakers optimize information density through syntactic reduction. In Advances in Neural Information Processing Systems 19, pages 849–856. MIT Press.
Manning, Christopher D. and Schütze, Hinrich (1999). Foundations of Statistical Natural Language Processing. The MIT Press.
Moradi, Hamid, Grzymala-Busse, Jerzy W., and Roberts, James A. (1998). Entropy of English text: Experiments with humans and a machine learning system based on rough sets. Information Sciences, 104, 31–47.
Ren, Geng, Takahashi, Shuntaro, and Tanaka-Ishii, Kumiko (2019). Entropy rate estimation for English via a large cognitive experiment using Mechanical Turk. Entropy, 21(12):1201.
Ryabko, Boris (2010). Applications of universal source coding to statistical analysis of time series. In I. Woungang, S. Misra, and S. C. Misra, editors, Selected Topics in Information and Coding Theory, Series on Coding and Cryptology, pages 289–338. World Scientific Publishing.
Schümann, Thomas and Grassberger, Peter (1996). Entropy estimation of symbol sequences. Chaos, 6(3), 414–427.
Shannon, Claude E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379–423, 623–656.
Shannon, Claude E. (1951). Prediction and entropy of printed English. The Bell System Technical Journal, 30, 50–64.
Shannon, Claude E. (1959). Coding theorems for a discrete source with a fidelity criterion. IRE National Convention Record, 4, 142–163.
Takahira, Ryosuke, Tanaka-Ishii, Kumiko, and Dȩbowski, Łukasz (2016). Entropy rate estimates for natural language—a new extrapolation of compressed large-scale corpora. Entropy, 18(10):364.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s)
About this chapter
Cite this chapter
Tanaka-Ishii, K. (2021). Complexity. In: Statistical Universals of Language. Mathematics in Mind. Springer, Cham. https://doi.org/10.1007/978-3-030-59377-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-59377-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59376-6
Online ISBN: 978-3-030-59377-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)