Skip to main content

Complexity

  • Chapter
  • First Online:
Statistical Universals of Language

Part of the book series: Mathematics in Mind ((MATHMIN))

Abstract

We now have a rough overview of the most important statistical universals underlying language. As a whole, is there any way to examine how complex language is? What is the characteristic underlying this complexity?

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that the conditional entropy is given in general as

    $$\displaystyle \begin{aligned} \mathrm{H}(X|Y) = -\sum_{x,y} P(X=x,Y=y)\log P(X=x |Y=y). \end{aligned} $$
    (10.3)
  2. 2.

    In this book, ⇒ indicates convergence.

  3. 3.

    See Sect. 17.2 for the concepts of language models and their training.

  4. 4.

    The PPM code uses an n-gram-based language modeling method (Bell et al., 1990) that applies variable-length n-grams and arithmetic coding. The PPM code is guaranteed to be universal when the length of the n-gram is considered up to infinity (Ryabko, 2010). Among state-of-the-art compressors, 7-zip PPMd was used for the PPM code. PPM was used because it follows theory better than many other compression methods do, such as zip, lzh, and tar.xz (Takahira et al., 2016).

  5. 5.

    I hereby thank Ryosuke Takahira and Shuntaro Takahashi for generating this figure for the purpose of this book, by reusing code used to conduct the study reported in Takahira et al. (2016).

  6. 6.

    For fitting Fig. 6.3, the least-squares method was applied (cf. Sect. 21.1). ε = 0.0175 for the New York Times, ε = 0.00606 for the shuffled text, and ε = 0.00295 for the monkey text.

  7. 7.

    This figure was adapted from Takahira et al. (2016) by applying the extrapolation function of formula (10.5) to the data listed in the first block of Table 1 in that article, which consisted only of results obtained from clean newspaper data.

  8. 8.

    Whether this is true would require more fundamental research. Above all, it might not be the case that β characterizes natural language, given that a shuffled text’s β was pretty close to that of natural language here. One possible path to verify the β value’s universality would be to conduct a statistical test, as was done for Taylor analysis, with many sets of data as introduced in the previous section. The problem in doing so is that the texts must be very large indeed to estimate a credible β to acquire the entropy rate. At the same time, larger texts have the limitation of self-similarity as discussed in Chap. 5. Therefore, clarifying whether β is universal would require a completely different approach.

  9. 9.

    Section 21.8 explains the perplexity in relation to the entropy rate and cross entropy.

References

  • Bell, Timothy C., Cleary, John G., and Witten, Ian H. (1990). Text Compression. Prentice Hall.

    Google Scholar 

  • Berger, Toby (1968). Rate distortion theory for sources with abstract alphabets and memory. Information and Control, 13, 254–273.

    Article  MathSciNet  MATH  Google Scholar 

  • Brown, Peter F., Della-Pietra, Stephan A., Della-Pietra, Vincent J., Lai, Jennifer C., and Mercer, Robert L. (1992). An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1), 31–40.

    Google Scholar 

  • Cover, Thomas M. and King, Roger C. (1978). A convergent gambling estimate of the entropy of English. IEEE Transactions on Information Theory, 24(4), 413–421.

    Article  MathSciNet  MATH  Google Scholar 

  • Cover, Thomas M. and Thomas, Joy A. (1991). Elements of Information Theory. John Wiley & Sons, Inc.

    Book  MATH  Google Scholar 

  • Crutchfield, J. P. and Feldman, D. P. (2003). Regularities unseen, randomness observed: Levels of entropy convergence. Chaos, 13, 25–54.

    Article  MathSciNet  MATH  Google Scholar 

  • Dai, Zihang, Yang, Zhilin, Yang, Yiming, Carbonell, Jaime, Le, Quoc V., and Salakhutdinov, Ruslan (2019). Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988.

    Google Scholar 

  • Dȩbowski, Łukasz (2015). The relaxed Hilberg conjecture: A review and new experimental support. Journal of Quantitative Linguistics, 22(4), 311–337.

    Google Scholar 

  • Dȩbowski, Łukasz (2020). Information Theory Meets Power Laws: Stochastic Processes and Language Models. Wiley.

    Google Scholar 

  • Ebeling, Werner and Nicolis, G. (1991). Entropy of symbolic sequences : The role of correlations. Europhysics Letters, 14(3), 191–196.

    Article  Google Scholar 

  • Ferrer-i-Cancho, Ramon, Dȩbowski, Łukas, and Moscoso del Prado Martin, Fermin 2013:L07001. Constant conditional entropy and related hypotheses. Journal of Statistical Mechanics: Theory and Experiment, 2013.

    Google Scholar 

  • Genzel, Dmitriy and Charniak, Eugene (2002). Entropy rate constancy in text. In Proccedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 199–206.

    Google Scholar 

  • Hilberg, Wolfgang (1990). Der bekannte grenzwert der redundanzfreien information in texten — eine fehlinterpretation der shannonschen experimente? Frequenz, 44, 243–248.

    Article  Google Scholar 

  • Hockett, Charles F. (1958). A course in modern linguistics. Macmillan Company.

    Google Scholar 

  • Levy, Roger and Jaeger, T. Florian (2006). Speakers optimize information density through syntactic reduction. In Advances in Neural Information Processing Systems 19, pages 849–856. MIT Press.

    Google Scholar 

  • Manning, Christopher D. and Schütze, Hinrich (1999). Foundations of Statistical Natural Language Processing. The MIT Press.

    MATH  Google Scholar 

  • Moradi, Hamid, Grzymala-Busse, Jerzy W., and Roberts, James A. (1998). Entropy of English text: Experiments with humans and a machine learning system based on rough sets. Information Sciences, 104, 31–47.

    Article  Google Scholar 

  • Ren, Geng, Takahashi, Shuntaro, and Tanaka-Ishii, Kumiko (2019). Entropy rate estimation for English via a large cognitive experiment using Mechanical Turk. Entropy, 21(12):1201.

    Article  Google Scholar 

  • Ryabko, Boris (2010). Applications of universal source coding to statistical analysis of time series. In I. Woungang, S. Misra, and S. C. Misra, editors, Selected Topics in Information and Coding Theory, Series on Coding and Cryptology, pages 289–338. World Scientific Publishing.

    Google Scholar 

  • Schümann, Thomas and Grassberger, Peter (1996). Entropy estimation of symbol sequences. Chaos, 6(3), 414–427.

    Article  MathSciNet  MATH  Google Scholar 

  • Shannon, Claude E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379–423, 623–656.

    Article  MathSciNet  MATH  Google Scholar 

  • Shannon, Claude E. (1951). Prediction and entropy of printed English. The Bell System Technical Journal, 30, 50–64.

    Article  MATH  Google Scholar 

  • Shannon, Claude E. (1959). Coding theorems for a discrete source with a fidelity criterion. IRE National Convention Record, 4, 142–163.

    Google Scholar 

  • Takahira, Ryosuke, Tanaka-Ishii, Kumiko, and Dȩbowski, Łukasz (2016). Entropy rate estimates for natural language—a new extrapolation of compressed large-scale corpora. Entropy, 18(10):364.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s)

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Tanaka-Ishii, K. (2021). Complexity. In: Statistical Universals of Language. Mathematics in Mind. Springer, Cham. https://doi.org/10.1007/978-3-030-59377-3_10

Download citation

Publish with us

Policies and ethics