Word Frequency Approximation for Chinese Without Using Manually-Annotated Corpus

  • Maosong Sun
  • Zhengcao Zhang
  • Benjamin Ka-Yin T’sou
  • Huaming Lu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3878)

Abstract

Word frequencies play important roles in a variety of NLP-related applications. Word frequency estimation for Chinese is a big challenge due to characteristics of Chinese, in particular word-formation and word segmentation. This paper concerns the issue of word frequency estimation in the condition that we only have a Chinese wordlist and a raw Chinese corpus with arbitrarily large size, and do not perform any manual annotation to the corpus. Several realistic schemes for approximating word frequencies under the framework of STR (frequency of string of characters as an approximation of word frequency) and MM (Maximal matching) are presented. Large-scale experiments indicate that the proposed scheme, MinMaxMM, can significantly benefit the estimation of word frequencies, though its performance is still not very satisfactory in some cases.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    ‘863’ High Tech Program of China et al.: When Computers Can Have Ability to listen, to speak, and to read? – The Results of the Fifth Evaluation of Chinese Character Recognition, Speech Recognition, Speech Synthesis and Natural Language Processing. Computer World. E9, June 22 (1998)Google Scholar
  2. 2.
    Chen, G.L.: On Chinese Morphology. Xuelin Publisher, Shanghai (1994)Google Scholar
  3. 3.
    Dai, X.L.: Chinese Morphology and its Interface with the Syntax. Ph.D Dissertation, Ohio State University, USA (1992)Google Scholar
  4. 4.
    Emerson, T.: The Second International Chinese Word Segmentation Bakeoff. In: Proceedings of the Third SIHAN Workshop on Chinese Language Processing, Jeju, Korea (2005)Google Scholar
  5. 5.
    Liang, N.Y.: CDWS: A Word Segmentation System for Written Chinese Texts. Journal of Chinese Information Processing 1(2), 44–52 (1987)Google Scholar
  6. 6.
    Liu, E.S.: Frequency Dictionary of Chinese Words. Mouton & Co N.V. Publishers, Netherlands (1973)Google Scholar
  7. 7.
    Liu, K.Y.: Study on the Evaluation Technique for Word Segmentation of Contemporary Chinese. Applied Linguistics (1), 101–106 (1997)Google Scholar
  8. 8.
    Liu, Y., Liang, N.Y.: Counting Word Frequencies of Contemporary Chinese – An Engineering of Chinese Processing. Journal of Chinese Information Processing 0(1), 17–25 (1986)Google Scholar
  9. 9.
    Sproat, R., Emerson, T.: The First International Chinese Word Segmentation Bakeoff. In: Proceedings of the Second SIHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 133–143 (2003)Google Scholar
  10. 10.
    Sun, M.S., Shen, D.Y., T’sou, B.K.Y.: Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data. In: Proceedings of 36th ACL & 17th COLING, Montreal, Canada, pp. 1265–1271 (1998)Google Scholar
  11. 11.
    Sun, M.S., T’sou, B.K.Y.: Ambiguity Resolution in Chinese Word Segmentation. In: Proceedings of the 10th Pacific Asia Conference on Language, Information & Computation, Hong Kong, pp. 121–126 (1995)Google Scholar
  12. 12.
    Sun, M.S., Wang, H.J., et al.: Wordlist of Contemporary Chinese for Information Processing. Applied Linguistics 4, 84–89 (2001)Google Scholar
  13. 13.
    Tang, T.C.: Chinese Morphology and Syntax, vol. 3. Taiwan Student Publisher, Taipei (1992)Google Scholar
  14. 14.
    Zhu, D.X.: Lectures on Grammar. The Commercial Press, Beijing (1982)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Maosong Sun
    • 1
  • Zhengcao Zhang
    • 1
  • Benjamin Ka-Yin T’sou
    • 2
  • Huaming Lu
    • 3
  1. 1.The State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and TechnologyTsinghua UniversityBeijingChina
  2. 2.Language Information Sciences Research CenterCity University of Hong Kong 
  3. 3.School of BusinessBeijing Institute of MachineryBeijingChina

Personalised recommendations