A Local Generative Model for Chinese Word Segmentation

  • Kaixu Zhang
  • Maosong Sun
  • Ping Xue
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6458)


This paper presents a local generative model for Chinese word segmentation, which has faster learning process than discriminative models and can do unsupervised learning. It has the ability to make use of larger resources. In this model, four successive characters are used to determine whether a character interval should be a word boundary or not. The Gibbs sampling algorithm, as well as three additional rules, is applied for the unsupervised learning. Besides words, the word candidates that are generated by our model can improve the performance of Chinese information retrieval. The experiments show that in supervised learning our method outperforms a language model based method. And the performance on one corpus is better than the best one reported in SIGHAN bakeoff 05. In unsupervised learning, our method achieves the comparable performance compared to the state-of-the-art method.


probability model natural language processing Chinese word segmentation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Xue, N.: Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8, 29–48 (2003)Google Scholar
  2. 2.
    Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: COLING 2004, vol. 1, pp. 562–568 (2004)Google Scholar
  3. 3.
    Gao, J., Li, M., Wu, A., Huang, C.: Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics 31, 531–574 (2005)CrossRefzbMATHGoogle Scholar
  4. 4.
    Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., Isahara, H.: An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging. In: 47th Annual Meeting of the ACL, vol. 1, pp. 513–521 (2009)Google Scholar
  5. 5.
    Goldwater, S., Griffiths, T., Johnson, M.: Contextual Dependencies in Unsupervised Word Segmentation. In: 21th Annual Meeting of the ACL, vol. 1, pp. 673–680 (2006)Google Scholar
  6. 6.
    Mochihashi, D., Yamada, T., Ueda, N.: Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In: 47th Annual Meeting of the ACL, vol. 1, pp. 100–108 (2009)Google Scholar
  7. 7.
    Sun, M., Shen, D., Tsou, B.: Chinese word segmentation without using lexicon and hand-crafted training data. In: Proceedings of the 17th International Conference on Computational Linguistics, vol. 2, pp. 1265–1271 (1998)Google Scholar
  8. 8.
    Huang, C., Šimon, P., Hsieh, S., Prévot, L.: Rethinking Chinese word segmentation: tokenization, character classification, or wordbreak identification. In: 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, vol. 1, pp. 69–72 (2007)Google Scholar
  9. 9.
    Liu, Y., Wang, B., Ding, F., Xu, S.: Information retrieval oriented word segmentation based on character associative strength ranking. In: The Conference on EMNLP, vol. 1, pp. 1061–1069 (2008)Google Scholar
  10. 10.
    Emerson, T.: The second international chinese word segmentation bakeoff. In: The Fourth SIGHAN Workshop on Chinese Language Processing, vol. 1, pp. 123–133 (2005)Google Scholar
  11. 11.
    Li, Z., Sun, M.: Punctuation as implicit annotations for Chinese word segmentation. Computational Linguistics 35, 505–512 (2009)CrossRefGoogle Scholar
  12. 12.
    Teh, Y.: A Bayesian interpretation of interpolated Kneser-Ney. Technical Report (2006)Google Scholar
  13. 13.
    Bishop, C.: Pattern recognition and machine learning. Springer, Heidelberg (2006)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Kaixu Zhang
    • 1
  • Maosong Sun
    • 1
  • Ping Xue
    • 2
  1. 1.State Key Lab of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and TechnologyTsinghua UniversityBeijingP.R. China
  2. 2.The Boeing CompanyUSA

Personalised recommendations