Advertisement

A Study of Chinese Word Segmentation Based on the Characteristics of Chinese

  • Aaron Li-Feng Han
  • Derek F. Wong
  • Lidia S. Chao
  • Liangye He
  • Ling Zhu
  • Shuo Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8105)

Abstract

This paper introduces the research on Chinese word segmentation (CWS). The word segmentation of Chinese expressions is difficult due to the fact that there is no word boundary in Chinese expressions and that there are some kinds of ambiguities that could result in different segmentations. To distinguish itself from the conventional research that usually emphasizes more on the algorithms employed and the workflow designed with less contribution to the discussion of the fundamental problems of CWS, this paper firstly makes effort on the analysis of the characteristics of Chinese and several categories of ambiguities in Chinese to explore potential solutions. The selected conditional random field models are trained with a quasi-Newton algorithm to perform the sequence labeling. To consider as much of the contextual information as possible, an augmented and optimized set of features is developed. The experiments show promising evaluation scores as compared to some related works.

Keywords

Natural language processing Chinese word segmentation Characteristics of Chinese Optimized features 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Pak-kwong, W., Chorkin, C.: Chinese word segmentation based on maximum matching and word binding force. In: Proceedings of the 16th Conference on Computational Linguistics, COLING 1996, vol. 1, pp. 200-203. Association for Computational Linguistics, Stroudsburg (1996)Google Scholar
  2. 2.
    Richard, S., Willian, G., Chilin, S., Nancy, C.: A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics 22(3), 377–404 (1996)Google Scholar
  3. 3.
    Hua-Ping, Z., Qun, L., Xue-Qi, C., Hao, Z., Hong-Kui, Y.: Chinese lexical analysis using hierarchical hidden Markov model. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, SIGHAN 2003, vol. 17, pp. 63–70. Association for Computational Linguistics, Stroudsburg (2003)Google Scholar
  4. 4.
    Jin, L.K., Hwee, N.T., Wenyuan, G.: A maximum entropy approach to Chinese word segmentation. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Pro-cessing, vol. 164 (2005)Google Scholar
  5. 5.
    Fuchun, P., Fangfang, F., An-drew, M.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), vol. Article 562. Association for Computational Linguistics, Stroudsburg (2004)Google Scholar
  6. 6.
    Ting-hao, Y., Tian-Jian, J., Chan-hung, K., Richard, T.: T-h., Wen-lian, H.: Unsupervised overlapping feature selection for conditional random fields learning in Chinese word segmentation. In: Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing, ROCLING 2011, pp. 109–122. Association for Computational Linguistics, Stroudsburg (2011)Google Scholar
  7. 7.
    Fuchun, P., Xiangji, H., Dale, S., Nick, C.-C., Stephen, R.: Using self-supervised word segmentation in Chinese in-formation retrieval. In: Proceedings of the 25th An-nual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), pp. 349–350. ACM, New York (2002)Google Scholar
  8. 8.
    Hanshi, W., Jian, Z., Shiping, T., Xiaozhong, F.: A new unsupervised approach to word segmentation. Computational Linguistics 37(3), 421–454 (2011)CrossRefGoogle Scholar
  9. 9.
    Yan, S., Chunyu, K., Ruifeng, X., Hai, Z.: How unsupervised learning affects character tagging based Chinese Word Segmentation: A quantitative investigation. International Conference on Machine Learning and Cybernetics 6, 3481–3486 (2009)Google Scholar
  10. 10.
    Hai, Z., Chunyu, K.: Integrating unsupervised and supervised word segmentation: The role of goodness measures. Information Sciences 181(1), 163–183 (2011)CrossRefGoogle Scholar
  11. 11.
    John, L., Andrew, M., Ferando, P.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceeding of 18th International Conference on Machine Learning, pp. 282–289 (2001)Google Scholar
  12. 12.
    Shewchuk, J.R.: An introduction to the conjugate gradient method without the agonizing pain. Technical Report CMUCS-TR-94-125. Carnegie Mellon University (1994)Google Scholar
  13. 13.
    Michael, C., Nigel, D., Florham, P.: New ranking algorithms for parsing and tag-ging: kernels over discrete structures, and the voted perceptron. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002), pp. 263–270. Association for Computational Linguistics, Stroudsburg (2002)Google Scholar
  14. 14.
    The Numerical Algorithms Group: E04 - Min-imizing or Maximizing a Function, NAG Library Manual, Mark 23 (2012) (retrieved)Google Scholar
  15. 15.
    Peng, L., Liu, Z., Zhang, L.: A Recognition Approach Study on Chinese Field Term Based Mutual Information /Conditional Random Fields. In: 2012 International Workshop on Information and Electronics Engineering, pp. 1952–1956 (2012)Google Scholar
  16. 16.
    Guangjin, J., Xiao, C.: The Fourth International Chinese Language Processing Bakeoff: Chinese Word Segmentation, Name Entity Recognition and Chinese POS Tagging. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp. 83–95 (2008)Google Scholar
  17. 17.
    Asahara, L.J.M., Matsumoto, Y.: Analyzing Chinese Synthetic Words with Tree-based Information and a Survey on Chinese Morphologically Derived Words. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp. 53–60 (2008)Google Scholar
  18. 18.
    Zhang, R., Sumita, E.: Achilles: NiCT/ATR Chinese Morphological Analyzer for the Fourth Sighan Bakeoff. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp. 178–182 (2008)Google Scholar
  19. 19.
    Leong, K.S., Wong, F., Li., Y., Dong, M.: Chinese Tagging Based on Maximum Entropy Model. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp. 138–142 (2008)Google Scholar
  20. 20.
    Wu, X., Lin, X., Wang, X., Wu, C., Zhang, Y., Yu, D.: An Im-proved CRF based Chinese Language Processing System for SIGHAN Bakeoff 2007. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp. 155–160 (2008)Google Scholar
  21. 21.
    Qin, Y., Yuan, C., Sun, J., Wang, X.: BUPT Systems in the SIGHAN Bakeoff 2007. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp. 94–97 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Aaron Li-Feng Han
    • 1
  • Derek F. Wong
    • 1
  • Lidia S. Chao
    • 1
  • Liangye He
    • 1
  • Ling Zhu
    • 1
  • Shuo Li
    • 1
  1. 1.Department of Computer and Information ScienceUniversity of MacauMacauChina

Personalised recommendations