Advertisement

Comparing Entropies within the Chinese Language

  • Benjamin K. Tsou
  • Tom B. Y. Lai
  • Ka-po Chow
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3248)

Abstract

Using a large synchronous Chinese corpus, we show how word and character entropy variations exhibit interesting differences in terms of time and space for different Chinese speech communities. We find that word entropy values are affected by the quality of the segmentation process. We also note that word entropies can be affected by proper nouns, which is the most volatile segment of the stable lexicon of the language. Our word and character entropy results provide interesting comparison with the earlier results and the average joint character entropies (a.k.a. entropy rates) of Chinese up to order 20 provided by us indicate that the limits of the conditional character entropies of Chinese for the different speech communities should be about 1 (or less). This invites questions on whether early convergence of character entropies would also entail word entropy convergence.

Keywords

Language Model Word Type Conditional Entropy Hong Kong Word Segmentation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bell, T.C.: Text Compression. Prentice-Hall, Englewood Cliffs (1990)Google Scholar
  2. 2.
    Brown, P., Della Pietra, S., Della Pietra, V., Lai, J.C., Mercer, R.L.: An Estimate of an Upper Bound for the Entropy of English. Computational Linguistics 18(1), 31 (1992)Google Scholar
  3. 3.
    Yuan, C.: Xiandai Hanyu Dingliang Fenxi (Quantitative Analysis of Modern Chinese). Shanghai Education Press (1989)Google Scholar
  4. 4.
    Cover, T.M., King, R.: A Convergent Gambling Estimate of the Entropy of English. IEEE Trans. on Information Theory, IT 24(4), 413–421 (1978)zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Zhiwei, F.: Shuxue Yu Yuyan (Mathematics and Language). Hunan Education Press (February 1991)Google Scholar
  6. 6.
    Xuanjing, H., Lide, W., Yikun, G., Bingwei, L.: Computation of the Entropy of Modern Chinese and the Probability Estimation of Sparse Event in Statistical Language Model. Acta Electronica Sinica 28(8), 110–112 (2000)Google Scholar
  7. 7.
    Di, J.: An Entropy Value of Classical Tibetan Language and Some Other Questions. In: Proceedings of 1998 International Conference on Chinese Information Processing, November 18-20 (1998)Google Scholar
  8. 8.
    Yuan, L., Dejin, W., Sheying, Z.: The Probability Distribution and Entropy and Redundancy in Printed Chinese. In: Proceedings of International Conference on Chinese Information Processing, August 1987, pp. 505–509 (1987)Google Scholar
  9. 9.
    Shannon, C.E.: A Mathematical Theory of Communication. Bell System Technical Journal 27, 379–423, 623–656 (1948)zbMATHMathSciNetGoogle Scholar
  10. 10.
    Shannon, C.E.: Prediction and Entropy of Printed English. Bell System Technical Journal 3, 50–64 (1951)Google Scholar
  11. 11.
    Guiqing, S., Bingzeng, X.: Hanzi Zipin Fenbu, Zui Jia Bianma Yu Shuru Wenti (Character Frequency Distribution, Optimal Encoding and Input of Chinese). Acta Electronic Sinica 12(4), 94–96 (1984)Google Scholar
  12. 12.
    Teahan, W.J., Cleary, J.G.: The Entropy of English using PPM-based Models. In: Proceedings of Data Compression Conference (DCC 1996), pp. 53–62 (1996)Google Scholar
  13. 13.
    Xiaopeng, T.: The Design and Application of Language Model for the Minimum Entropy of Chinese Character (2003) (manuscript)Google Scholar
  14. 14.
    Tsou, B.K., Tsoi, W.F., Lai, T.B.Y., Hu, J., Chan, S.W.K.: LIVAC, A Chinese Synchronous Corpus, and Some Applications. In: Proceedings of the ICCLC International Conference on Chinese Language Computing, Chicago, pp. 233–238 (2000), http://livac.org
  15. 15.
    Weaver, W., Shannon, C.E.: The Mathematical Theory of Communication. University of Illinois Press, Urbana (1949)zbMATHGoogle Scholar
  16. 16.
    Jun, W., Zuoying, W.: Hanyu Xinxi Shang He Yuyan Muxing De Fuzadu (Entropy and Complexity of Language Model of Chinese). Acta Electronica Sinica 24(10), 69–71 (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Benjamin K. Tsou
    • 1
  • Tom B. Y. Lai
    • 1
  • Ka-po Chow
    • 1
  1. 1.Language Information Sciences Research CentreCity University of Hong KongHong Kong

Personalised recommendations