Skip to main content
Log in

Negative correlation of word rank sequence in written texts

  • Regular Article – Statistical and Nonlinear Physics
  • Published:
The European Physical Journal B Aims and scope Submit manuscript

This article has been updated

Abstract

The structure of written texts is analyzed by focusing on word sequences. As a method, word sequences in texts are transformed into rank sequences of the occurrence frequency of each word and return maps are drawn. The features of word sequences are extracted by comparing with the surrogate data, i.e., a sequence in which all the words are randomly rearranged. A total of 140 written texts consisting of ten languages are selected for analysis. To characterize the distribution in the return map quantitatively, two characteristic quantities are defined, the distance between the original distribution and surrogate distribution, and the correlation coefficient of the adjacent word ranks. The results show that there is a negative correlation in the rank of adjacent words in almost all languages, and features of return maps of the same language texts are similar. A clustering structure which implies the relation to language (sub)family is observed. A mathematical model is proposed for reproducing features of the return map for multiple languages. The numerical simulations achieve results similar to those of the real data quantitatively.

GraphicAbstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability Statement

The manuscript has associated data in a data repository. [Authors’ comment: This manuscript has data included as electronic supplementary material. The online version of this article contains supplementary material, which is available to authorized users.].

Change history

  • 17 October 2021

    The ESM was missed to update.

References

  1. C.E. Shannon, Bell Syst. Tech. J. 30, 50 (1951)

    Article  Google Scholar 

  2. N. Chomsky, Aspects of the Theory of Syntax (M.I.T. Press, Cambridge, 1965)

    Google Scholar 

  3. J.H. Greenberg, Science 166, 473 (1969)

    Article  ADS  Google Scholar 

  4. M.A. Nowak, J.B. Plotkin, V.A. Jansen, Nature 404, 495 (2000)

    Article  ADS  Google Scholar 

  5. G.K. Zipf, Human Behavior and the Principle of Least Effort (Addison-Wesley, Reading, 1949)

    Google Scholar 

  6. M.E.J. Newman, Contemp. Phys. 46, 323 (2005)

    Article  ADS  Google Scholar 

  7. M. Perc, J.R. Soc, Interface 11, 20140378 (2014)

    Google Scholar 

  8. R.F. i Cancho, R.V. Solé, Proc. Natl. Acad. Sci. USA 100, 788 (2003)

  9. R.F. i Cancho, R.V. Solé, J. Quantum Linguist. 8, 165 (2001)

  10. R.F. i Cancho, R.V. Solé, Eur. Phys. J. B 44, 249 (2005)

  11. A.M. Petersen et al., Sci. Rep. 2, 943 (2012)

    Article  Google Scholar 

  12. R. Solé, Nature 434, 289 (2005)

    Article  ADS  Google Scholar 

  13. M. Perc, J.R. Soc, Interface 9, 3323 (2012)

    Google Scholar 

  14. H.S. Heaps, Information Retrieval: Computational and Theoretical Aspects (Academic Press, Orlando, 1978)

    MATH  Google Scholar 

  15. R. Baeza-Yates, G. Navarro, J. Am. Soc. Inf. Sci. 51, 69 (2000)

    Article  Google Scholar 

  16. D.C. van Leijenhorst, T.P. van der Weide, Inf. Sci. 170, 263 (2005)

    Article  Google Scholar 

  17. L. Lü, Z.-K. Zhang, T. Zhou, Sci. Rep. 3, 1082 (2013)

    Article  Google Scholar 

  18. M.Á. Serrano, A. Flammini, F. Menczer, PLoS One 4, e5372 (2009)

  19. R.E. Madsen, D. Kauchak, C. Elkan, ICML ’05: Proceedings of the 22nd international conference on Machine learning (2005), p. 545

  20. P. Sunehag, Conference for Artificial Intelligence and Statistics, vol. 8 (2007)

  21. C.-K. Peng et al., Nature 356, 168 (1992)

    Article  ADS  Google Scholar 

  22. A. Schenkel, J. Zhang Y. -C. Zhang, Fractals 1, 47 (1993)

  23. I. Kanter, D.A. Kessler, Phys. Rev. Lett. 74, 4559 (1995)

    Article  ADS  Google Scholar 

  24. P. Kokol, V. Podgorelec, Complex. Int. 7, 1 (2000)

    Google Scholar 

  25. W. Ebeling, T. Poschel, Europhys. Lett. 26, 241 (1994)

    Article  ADS  Google Scholar 

  26. E. ALverez-Lacalle et al., PNAS 103, 7956 (2006)

  27. E.G. Altmann, G. Cristadoro, M.D. Esposti, PNAS 109, 11582 (2012)

    Article  ADS  Google Scholar 

  28. M.A. Montemmurro, D.H. Zannette, PLoS One 6, e19875 (2011)

  29. Project Gutenberg, (2021). https://www.gutenberg.org

  30. Aozora Bunko, (2021). https://www.aozora.gr.jp

  31. MeCab, (2021). https://taku910.github.io/mecab

  32. Glottolog, (2021). http://glottolog.org

  33. C. Cattuto, V. Loreto, V.D.P. Servedio, Europhys. Lett. 76, 208 (2006)

    Article  ADS  MathSciNet  Google Scholar 

  34. R. Hayakawa, Y. Fukuoka, T. Mizuguchi, J. Phys. Soc. Jpn. 81, 094001 (2012)

Download references

Acknowledgements

The authors would like to thank Prof. Suzuki for useful advice on linguistic typology. T. M. would like to thank Prof. Horita for useful discussions.

Author information

Authors and Affiliations

Authors

Contributions

The negative correlation of rank in Moby Dick is found by S. Y. The analysis of real texts under the common condition is performed by T. Y. The model is suggested and analyzed by T. Y. and T. M. The paper is written by T. Y. and T. M.

Corresponding author

Correspondence to Tsuyoshi Mizuguchi.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 698 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yamamoto, T., Yamada, S. & Mizuguchi, T. Negative correlation of word rank sequence in written texts. Eur. Phys. J. B 94, 200 (2021). https://doi.org/10.1140/epjb/s10051-021-00210-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1140/epjb/s10051-021-00210-y

Navigation