Disyllabic Chinese Word Extraction Based on Character Thesaurus and Semantic Constraints in Word-Formation

  • Sun Maosong
  • Xu Dongliang
  • Benjamin K. Y. T’sou
  • Lu Huaming
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5246)


This paper presents a novel approach to Chinese disyllabic word extraction based on semantic information of characters. Two thesauri of Chinese characters, manually-crafted and machine-generated, are conducted. A Chinese wordlist with 63,738 two-character words, together with the character thesauri, are explored to learn semantic constraints between characters in Chinese word-formation, resulting in two types of semantic-tag-based HMM. Experiments show that: (1) both schemes outperform their character-based counterpart; (2) the machine-generated thesaurus outperforms the hand-crafted one to some extent in word extraction, and (3) the proper combination of semantic-tag-based and character-based methods could benefit word extraction.


Hide Markov Model Chinese Character Semantic Category Associative Strength Semantic Constraint 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chen, K.J., Chen, C.J.: A Corpus Based Study on Computational Morphology for Mandarin Chinese. In: Quantitative and Computational Studies on the Chinese Language, Hong Kong, pp. 283–305 (1998)Google Scholar
  2. 2.
    Church, K.W., Hanks, P., et al.: Using Statistics in Lexical Analysis, Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon. Erlbaum, Hillsdale (1991)Google Scholar
  3. 3.
    Hajič, J.: HMM Parameters Estimation: The Baum-Welch Algorithm (2000),
  4. 4.
    Johansson, C.: Good Bigrams. In: Proc. of COLING 1996, Copenhagen, Denmark (1996)Google Scholar
  5. 5.
    Kang, S.Y., Sun, M.S., et al.: Design and Implementation of a Chinese Character Thesaurus. In: Proc. of Int’l Conf. on Chinese Computing 2001, Singapore, pp. 301–307 (2001)Google Scholar
  6. 6.
    Mei, J.J.: Tong Yi Ci Ci Lin. Shanghai Cishu Press (1983)Google Scholar
  7. 7.
    Merkel, M., Andersson, M.: Knowledge-lite Extraction of Multi-word Units with Language Filters and Entropy Thresholds. In: Proc. of RIAO 2000, Paris, France, pp. 737–746 (2000)Google Scholar
  8. 8.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)Google Scholar
  9. 9.
    Sproat, R., Shih, C.L.: A Statistical Method for Finding Word Boundaries in Chinese Text. Computer Processing of Chinese and Oriental Languages 4(4), 336–349Google Scholar
  10. 10.
    Su, K.Y., Wu, M., Chang, J.S.: A Corpus-based Approach to Automatic Compound Extraction. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, pp. 242–247 (1994)Google Scholar
  11. 11.
    Sun, M.S., Luo, S.F., T’sou, B.K.: Word Extraction Based on Semantic Constraints in Chinese Word-Formation, Computational Linguistics and Intelligent Text Processing. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 202–213. Springer, Heidelberg (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Sun Maosong
    • 1
  • Xu Dongliang
    • 1
  • Benjamin K. Y. T’sou
    • 2
  • Lu Huaming
    • 3
  1. 1.The State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Dept. of Computer Sci. & Tech.Tsinghua UniversityBeijingChina
  2. 2.Language Information Sciences Research CenterCity University of Hong Kong 
  3. 3.Beijing Information Science and Technology UniversityBeijingChina

Personalised recommendations