Advertisement

Rank-frequency relation for Chinese characters

  • Weibing Deng
  • Armen E. Allahverdyan
  • Bo Li
  • Qiuping A. Wang
Regular Article

Abstract

We show that the Zipf’s law for Chinese characters perfectly holds for sufficiently short texts (few thousand different characters). The scenario of its validity is similar to the Zipf’s law for words in short English texts. For long Chinese texts (or for mixtures of short Chinese texts), rank-frequency relations for Chinese characters display a two-layer, hierarchic structure that combines a Zipfian power-law regime for frequent characters (first layer) with an exponential-like regime for less frequent characters (second layer). For these two layers we provide different (though related) theoretical descriptions that include the range of low-frequency characters (hapax legomena). We suggest that this hierarchic structure of the rank-frequency relation connects to semantic features of Chinese characters (number of different meanings and homographies). The comparative analysis of rank-frequency relations for Chinese characters versus English words illustrates the extent to which the characters play for Chinese writers the same role as the words for those writing within alphabetical systems.

Keywords

Statistical and Nonlinear Physics 

References

  1. 1.
    R.E. Wyllys, Library Trends 30, 53 (1981)Google Scholar
  2. 2.
    C.D. Manning, H. Schütze, Foundations of Statistical natural Language Processing (MIT Press, Cambridge, 1999)Google Scholar
  3. 3.
    H. Baayen, Word Frequency Distribution (Kluwer Academic Publishers, Dordrecht, 2001)Google Scholar
  4. 4.
    W.T. Li, Glottometrics 5, 14 (2002)zbMATHGoogle Scholar
  5. 5.
    N. Hatzigeorgiu, G. Mikros, G. Carayannis, J. Quantitative Linguistics 8, 175 (2001)CrossRefGoogle Scholar
  6. 6.
    B.D. Jayaram, M.N. Vidya, J. Quantitative Linguistics 15, 293 (2008)CrossRefGoogle Scholar
  7. 7.
    L. Lü, Z.K. Zhang, T. Zhou, PLoS ONE 5, e14139 (2010)ADSCrossRefGoogle Scholar
  8. 8.
    J. Baixeries, B. Elvevag, R. Ferrer-i-Cancho, PLoS ONE 8, e53227 (2013)ADSCrossRefGoogle Scholar
  9. 9.
  10. 10.
    J.B. Estoup, Gammes Sténographiques (Institut Sténogra- phique de France, Paris, 1916)Google Scholar
  11. 11.
    R. Ferrer-i-Cancho, R. Solé, Proc. Natl. Acad. Sci. 100, 788 (2003)ADSCrossRefzbMATHGoogle Scholar
  12. 12.
    M. Prokopenko et al., J. Stat. Mech. 2010, P11025 (2010)CrossRefGoogle Scholar
  13. 13.
    B. Mandelbrot, An Information Theory of the Statistical structure of language, in Communication Theory, edited by W. Jackson (London, Butterworths, 1953)Google Scholar
  14. 14.
    B. Mandelbrot, Fractal Geometry of Nature (W.H. Freeman, New York, 1983)Google Scholar
  15. 15.
    B. Corominas-Murtra et al., Phys. Rev. E 83, 036115 (2011)ADSCrossRefGoogle Scholar
  16. 16.
    D. Manin, Cogn. Sci. 32, 1075 (2008)CrossRefGoogle Scholar
  17. 17.
    G.A. Miller, Am. J. Psyc. 70, 311 (1957)CrossRefGoogle Scholar
  18. 18.
    W.T. Li, IEEE Inform. Theory 38, 1842 (1992)CrossRefGoogle Scholar
  19. 19.
    M.V. Arapov, Yu.A. Shrejder, in Semiotics and Informatics, (Moscow, VINITI, 1978), Vol. 10, p. 74Google Scholar
  20. 20.
    I. Kanter, D.A. Kessler, Phys. Rev. Lett. 74, 4559 (1995)ADSCrossRefGoogle Scholar
  21. 21.
    B.M. Hill, J. Am. Stat. Ass. 69, 1017 (1974)CrossRefzbMATHGoogle Scholar
  22. 22.
    G. Troll, P. beim Graben, Phys. Rev. E 57, 1347 (1998)ADSCrossRefGoogle Scholar
  23. 23.
    A. Czirok et al., Phys. Rev. 53, 6371 (1996)ADSGoogle Scholar
  24. 24.
    K.E. Kechedzhi et al., Phys. Rev. E 72, 046138 (2005)ADSCrossRefMathSciNetGoogle Scholar
  25. 25.
    A.E. Allahverdyan, W. Deng, Q.A. Wang, Phys. Rev. E 88, 062804 (2013)ADSCrossRefGoogle Scholar
  26. 26.
    D. Howes, Am. J. Psyc. 81, 269 (1968)CrossRefGoogle Scholar
  27. 27.
    R. Ferrer-i-Cancho, B. Elveva, PLoS ONE 5, 9411 (2010)ADSCrossRefGoogle Scholar
  28. 28.
    K.H. Zhao, Am. J. Phys. 58, 449 (1990)ADSCrossRefGoogle Scholar
  29. 29.
    R. Rousseau, Q. Zhang, Scientometrics 24, 201 (1992)CrossRefGoogle Scholar
  30. 30.
    D.H. Wang et al., Physica A 358, 545 (2005)CrossRefGoogle Scholar
  31. 31.
    S. Shtrikman, J. Info. Sci. 20, 142 (1994)CrossRefGoogle Scholar
  32. 32.
    Le Quan Ha et al., Extension of Zipf’s Law to Words and Phrases, in Proceedings of the 19th international conference on Computational linguistics (2002), Vol. 1, pp. 1–6Google Scholar
  33. 33.
    Q. Chen, J. Guo, Y. Liu, J. Quantitative Linguistics 19, 232 (2012)CrossRefGoogle Scholar
  34. 34.
    D. Aaronson, S. Ferres, J. Memory and Language 25, 136 (1986)CrossRefGoogle Scholar
  35. 35.
    H.C. Chen, Reading comprehension in Chinese, in Language processing in Chinese, edited by H.C. Chen, O.J.L. Tzeng (Amsterdam, Elsevier, 1992), pp. 175–205Google Scholar
  36. 36.
    R. Hoosain, Speed of getting at the phonology and meaning of Chinese words, in Cognitive Neuroscience Studies of Chinese Language, edited by H.S.R. Kao, C.K. Leong, D.G. Gao (Hong kong University Press, Hong kong, 2002)Google Scholar
  37. 37.
    G.K. Zipf, Selected Studies of the Principle of Relative Frequency in Language (Harvard University Press, Cambridge MA, 1932)Google Scholar
  38. 38.
    L. Lü, Z.K. Zhang, T. Zhou, Sci. Rep. 3, 1082 (2013)Google Scholar
  39. 39.
    C.K. Hu, W.C. Kuo, Universality and Scaling in the Statistical Data of Literary Works (POLA Forever, 2005), pp. 115–139Google Scholar
  40. 40.
    J. Elliott et al., Language identification in unknown signals, in Proceedings of the 18th conference on Computational linguistics (2000), Vol. 2, pp. 1021–1025Google Scholar
  41. 41.
    J. Elliot, E. Atwell, J. British Interplanetary Society 53, 13 (2000)ADSGoogle Scholar
  42. 42.
    H.P. Luhn, IBM J. Res. Devel. 2, 159 (1958)CrossRefMathSciNetGoogle Scholar
  43. 43.
    S.M. Huang et al., Decision Support Systems 46, 70 (2008)CrossRefGoogle Scholar
  44. 44.
    D.M.W. Powers, Applications and explanations of Zipf’s law, in New Methods in Language Processing and Computational Natural Language Learning (NEMLAP3/CONLL98), edited by D.M.W. Powers (ACL, 1998), pp. 151–160Google Scholar
  45. 45.
    G. Sampson, Linguistics 32, 117 (1994)CrossRefGoogle Scholar
  46. 46.
    J. DeFrancis, Visible Speech: the Diverse Oneness of Writing Systems (University of Hawaii Press, Honulu, 1989)Google Scholar
  47. 47.
    J.L. Packard, The Morphology of Chinese: A linguistic and Cognitive Approach (Cambridge University Press, Cambridge, 2000)Google Scholar
  48. 48.
    K. Turner, Visualizing Zipf’s Law in Japanese, available at this link: http://classes.soe.ucsc.edu/cmps161/Winter12/projects/ katurner/proj/paper/paper.pdf
  49. 49.
    R. Hoosain, Psychological reality of the word in Chinese, in Language processing in Chinese, edited by H.C. Chen, J.L. Tseng (Amsterdam, Netherlands, 1992), pp. 111–130Google Scholar
  50. 50.
    I.M. Liu et al., Chinese J. Psyc. 16, 25 (1974)ADSGoogle Scholar
  51. 51.
    S.H. Hsu, K.C. Huang, Perceptual and Motor Skills 91, 355 (2000)CrossRefGoogle Scholar
  52. 52.
    S.H. Hsu, K.C. Huang, Perceptual and Motor Skills 90, 81 (2000)CrossRefGoogle Scholar
  53. 53.
    X. Luo, A Maximum Entropy Chinese Character-based parser, in Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, 2003 Google Scholar
  54. 54.
    Wm.C. Hannas, Asia’s Orthographic Dilemma (University of Hawaii Press, Honolulu, 1997)Google Scholar
  55. 55.
    C.Y. Chen et al., Some distributional properties of Madanrin Chinese, in Proceedings of the first Pasific Asia conference on Formal and Computational Linguistics, Taipei, 1993, p. 81Google Scholar
  56. 56.
  57. 57.
    N.V. Obukhova, Quantitative Linguistics and Automatic Text Analysis (Proc. of Tartu university) 745, 119 (1986)Google Scholar
  58. 58.
    N.J.D. Nagelkerke, Biometrika 78, 691 (1991)CrossRefzbMATHMathSciNetGoogle Scholar
  59. 59.
    M.L. Goldstein, S.A. Morris, G.G. Yen, Eur. Phys. J. B 41, 255 (2004)ADSCrossRefGoogle Scholar
  60. 60.
    H. Bauke, Eur. Phys. J. B 58, 167 (2007)ADSCrossRefGoogle Scholar
  61. 61.
    A. Clauset, C.R. Shalizi, M.E.J. Newman, SIAM Rev. 51, 4 (2009)CrossRefMathSciNetGoogle Scholar
  62. 62.
    R.E. Madsen et al., Modeling word burstiness using the Dirichlet distribution, in Proc. Intl. Conf. Machine Learning (2005)Google Scholar
  63. 63.
    S. Bernhardsson, L.E. Correa da Rocha, P. Minnhagen, Physica A 389, 330 (2010)ADSCrossRefGoogle Scholar
  64. 64.
    S. Bernhardsson, L.E. Correa da Rocha, P. Minnhagen, New J. Phys. 11, 123015 (2009)ADSCrossRefGoogle Scholar
  65. 65.
    T. Hofmann, Probabilistic Latent Semantic Analysis, in Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (1999)Google Scholar
  66. 66.
    W.J.M. Levelt et al., Beh. Brain Sciences 22, 1 (1999)Google Scholar
  67. 67.
    J. Tuldava, J. Quantitative Linguistics 3, 38 (1996)CrossRefGoogle Scholar
  68. 68.
    D. Krallmann, Statistische Methoden in der Stilistischen Textanalyse (Inaug.-Dissert, Bonn, 1966)Google Scholar
  69. 69.
    S.K. Baek, S. Bernhardsson, P. Minnhagen, New J. Phys. 13, 043004 (2011)ADSCrossRefGoogle Scholar
  70. 70.
    Y. Dover, Physica A 334, 591 (2004)ADSCrossRefMathSciNetGoogle Scholar
  71. 71.
    E.V. Vakarin, J.P. Badiali, Phys. Rev. E 74, 036120 (2006)ADSCrossRefMathSciNetGoogle Scholar
  72. 72.
    E.T. Jaynes, IEEE Trans. Syst. Sci. Cybernet. 4, 227 (1968)CrossRefzbMATHGoogle Scholar
  73. 73.
    M. Jaeger, Int. J. Approx. Reas. 38, 217 (2005)CrossRefzbMATHMathSciNetGoogle Scholar
  74. 74.
    J. Haldane, Proceedings of the Cambridge Philosophical Society 28, 55 (1932)ADSCrossRefGoogle Scholar
  75. 75.
    A.F. Healy, A. Drewnowski, Journal of Experimental Psychology: Human Perception and Performance 9, 413 (1983)Google Scholar
  76. 76.
    Reading Chinese Script: A Cognitive Analysis, edited by J. Wang, A.W. Imhoff, H.-C. Chen (Lawrence Erlbaum Associates, New Jersey, 1999)Google Scholar
  77. 77.
    A.N. Kolmogorov, Giornale dell’ Instituto Italiano degli Attuari 4, 77 (1933)Google Scholar
  78. 78.
    P.T. Nicholls, J. Am. Soc. Information Sci. 40, 379 (1989)CrossRefGoogle Scholar

Copyright information

© EDP Sciences, SIF, Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.Laboratoire de Physique Statistique et Systèmes Complexes, ISMANSLUNAM UniversitéLe MansFrance
  2. 2.Complexity Science Center and Institute of Particle PhysicsHua-Zhong Normal UniversityWuhanP.R. China
  3. 3.IMMM, UMR CNRS 6283Université du MaineLe MansFrance
  4. 4.Yerevan Physics InstituteYerevanArmenia
  5. 5.Department of Chinese LiteratureUniversity of HeilongjiangHarbinP.R. China

Personalised recommendations