Word Frequency Approximation for Chinese Using Raw, MM-Segmented and Manually Segmented Corpora

  • Wei Qiao
  • Maosong Sun
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4285)


Word frequencies play important roles in many NLP-related applications. Word frequency estimation for Chinese remains a big challenge due to the characteristics of Chinese. An underlying fact is that a perfect word-segmented Chinese corpus never exists, and currently we only have raw corpora, which can be of arbitrarily large size, automatically word-segmented corpora derived from raw corpora, and a number of manually word-segmented corpora, with relatively smaller size, which are developed under various word segmentation standards by different researchers. In this paper we propose a new scheme to do word frequency approximation by combining the factors above. Experiments indicate that in most cases this scheme can benefit the word frequency estimation, though in other cases its performance is still not very satisfactory.


word frequency estimation raw corpus automatically word-segmented corpus manually word-segmented corpus 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chen, G.L.: On Chinese Morphology. Xuelin Publisher, Shanghai (1994)Google Scholar
  2. 2.
    Dai, X.L.: Chinese Morphology and its Interface with the Syntax. Ph.D Dissertation, Ohio State University, USA (1992)Google Scholar
  3. 3.
    Emerson, T.: The Second International Chinese Word Segmentation Bakeoff. In: Proceedings of the Third SIHAN Workshop on Chinese Language Processing, Jeju, Korea (2005)Google Scholar
  4. 4.
    Liang, N.Y.: CDWS: A Word Segmentation System for Written Chinese Texts. Journal of Chinese Information Processing 1(2), 44–52 (1987)Google Scholar
  5. 5.
    Liu, E.S.: Frequency Dictionary of Chinese Words. Mouton and Co. N.V. Publishers (1973)Google Scholar
  6. 6.
    Liu, K.Y.: Study on the Evaluation Technique for Word Segmentation of Contemporary Chinese. Applied Linguistics (Beijing) (1), 101–106 (1997)Google Scholar
  7. 7.
    Liu, Y., Liang, N.Y.: Counting Word Frequencies of Contemporary Chinese - An Engineering of Chinese Processing. Journal of Chinese Information Processing 0(1), 17–25 (1986)Google Scholar
  8. 8.
    Sproat, R., Emerson, T.: The First International Chinese Word Segmentation Bakeoff. In: Proceedings of the Second SIHAN Workshop on Chinese Language Processing, Sapporo, Japan, pp. 133–143 (2003)Google Scholar
  9. 9.
    Sun, M.S., Shen, D.Y., T’sou, B.K.Y.: Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data. In: Proceedings of 36th ACL and 17th COLING, Montreal, Canada, pp. 1265–1271 (1998)Google Scholar
  10. 10.
    Sun, M.S., T’sou, B.K.Y.: Ambiguity Resolution in Chinese Word Segmentation. In: Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation, Hong Kong, pp. 121–126 (1995)Google Scholar
  11. 11.
    Sun, M.S., Wang, H.J., et al.: Wordlist of Contemporary Chinese for Information Processing. Applied Linguistics (Beijing) (4), 84–89 (2001)Google Scholar
  12. 12.
    Sun, M., Zhang, Z., T’sou, B.K.-Y., Lu, H.: Word Frequency Approximation for Chinese Without Using Manually-Annotated Corpus. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 105–116. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Tang, T.C.: Chinese Morphology and Syntax, vol. 3. Taiwan Student Publisher, Taipei (1992)Google Scholar
  14. 14.
    Zhu, D.X.: Lectures on Grammar. The Commercial Press, Beijing (1982)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Wei Qiao
    • 1
  • Maosong Sun
    • 1
  1. 1.National Lab. of Intelligent Technology & Systems, Department of Computer Sci. & Tech.Tsinghua UniversityBeijingChina

Personalised recommendations