Disyllabic Chinese Word Extraction Based on Character Thesaurus and Semantic Constraints in Word-Formation

  • Sun Maosong
  • Xu Dongliang
  • Benjamin K. Y. T’sou
  • Lu Huaming
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5246)


This paper presents a novel approach to Chinese disyllabic word extraction based on semantic information of characters. Two thesauri of Chinese characters, manually-crafted and machine-generated, are conducted. A Chinese wordlist with 63,738 two-character words, together with the character thesauri, are explored to learn semantic constraints between characters in Chinese word-formation, resulting in two types of semantic-tag-based HMM. Experiments show that: (1) both schemes outperform their character-based counterpart; (2) the machine-generated thesaurus outperforms the hand-crafted one to some extent in word extraction, and (3) the proper combination of semantic-tag-based and character-based methods could benefit word extraction.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chen, K.J., Chen, C.J.: A Corpus Based Study on Computational Morphology for Mandarin Chinese. In: Quantitative and Computational Studies on the Chinese Language, Hong Kong, pp. 283–305 (1998)Google Scholar
  2. 2.
    Church, K.W., Hanks, P., et al.: Using Statistics in Lexical Analysis, Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon. Erlbaum, Hillsdale (1991)Google Scholar
  3. 3.
    Hajič, J.: HMM Parameters Estimation: The Baum-Welch Algorithm (2000), http://www.cs.jhu.edu/~hajic
  4. 4.
    Johansson, C.: Good Bigrams. In: Proc. of COLING 1996, Copenhagen, Denmark (1996)Google Scholar
  5. 5.
    Kang, S.Y., Sun, M.S., et al.: Design and Implementation of a Chinese Character Thesaurus. In: Proc. of Int’l Conf. on Chinese Computing 2001, Singapore, pp. 301–307 (2001)Google Scholar
  6. 6.
    Mei, J.J.: Tong Yi Ci Ci Lin. Shanghai Cishu Press (1983)Google Scholar
  7. 7.
    Merkel, M., Andersson, M.: Knowledge-lite Extraction of Multi-word Units with Language Filters and Entropy Thresholds. In: Proc. of RIAO 2000, Paris, France, pp. 737–746 (2000)Google Scholar
  8. 8.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)Google Scholar
  9. 9.
    Sproat, R., Shih, C.L.: A Statistical Method for Finding Word Boundaries in Chinese Text. Computer Processing of Chinese and Oriental Languages 4(4), 336–349Google Scholar
  10. 10.
    Su, K.Y., Wu, M., Chang, J.S.: A Corpus-based Approach to Automatic Compound Extraction. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, pp. 242–247 (1994)Google Scholar
  11. 11.
    Sun, M.S., Luo, S.F., T’sou, B.K.: Word Extraction Based on Semantic Constraints in Chinese Word-Formation, Computational Linguistics and Intelligent Text Processing. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 202–213. Springer, Heidelberg (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Sun Maosong
    • 1
  • Xu Dongliang
    • 1
  • Benjamin K. Y. T’sou
    • 2
  • Lu Huaming
    • 3
  1. 1.The State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Dept. of Computer Sci. & Tech.Tsinghua UniversityBeijingChina
  2. 2.Language Information Sciences Research CenterCity University of Hong Kong 
  3. 3.Beijing Information Science and Technology UniversityBeijingChina

Personalised recommendations