Word Extraction Based on Semantic Constraints in Chinese Word-Formation

  • Maosong Sun
  • Shengfen Luo
  • Benjamin K T’sou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3406)


This paper presents a novel approach to Chinese word extraction based on semantic information of characters. A thesaurus of Chinese characters is conducted. A Chinese lexicon with 63,738 two-character words, together with the thesaurus of characters, are explored to learn semantic constraints between characters in Chinese word-formation, forming a semantic-tag-based HMM. The Baum-Welch re-estimation scheme is then chosen to train parameters of the HMM in the way of unsupervised learning. Various statistical measures for estimating the likelihood of a character string being a word are further tested. Large-scale experiments show that the results are promising: the F-score of this word extraction method can reach 68.5% whereas its counterpart, the character-based mutual information method, can only reach 47.5%.


Hide Markov Model Chinese Character Associative Strength Chinese Word Unknown Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Calzolari, N., Bindi, R.: Acquision of Lexical Information from a Large Textual Italian Corpus. In: Proc. of COLING 1990, Helsinki, Finland, pp. 54–59 (1990)Google Scholar
  2. 2.
    Chien, L.F.: PAT-tree-based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval. In: Information Processing and Management, special issue: Information Retrieval with Asian Language (1998)Google Scholar
  3. 3.
    Daille, B.: Study and Implementation of Combined Techniques Automatic Extraction of Terminology. In: Proc. of the Balancing Act Workshop at 32nd Annual Meeting of the ACL, pp. 29–36 (1994)Google Scholar
  4. 4.
    Dunning, T.: Accurate Method for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–75 (1993)Google Scholar
  5. 5.
    Gelbukh, A., Sidorov, G.: Approach to Construction of Automatic Morphological Analysis Systems for Inflective Languages with Little Effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Hajic, J.: HMM Parameters Estimation: The Baum-Welch Algorithm (2000), http://www.cs.jhuedu/ hajicGoogle Scholar
  7. 7.
    Johansson, C.: Good Bigrams. In: Proc. of COLING 1996, Copenhagen, Denmark (1996)Google Scholar
  8. 8.
    Mei, J.J.: Tong Yi Ci Ci Lin. Shanghai Cishu Press (1983)Google Scholar
  9. 9.
    Merkel, M., Andersson, M.: Knowledge-lite Extraction of Multi-word Units with Language Filters and Entropy Thresholds. In: Proc. of RIAO 2000, Paris, France, pp. 737–746 (2000)Google Scholar
  10. 10.
    Nie, J.Y., Hannan, M.L., Jin, W.: Unknown Word Detection and Segmentation of Chinese Using Statistical and Heuristic Knowledge. Communications of COLIPS 5, 47–57 (1999)Google Scholar
  11. 11.
    Sornlertlamvanich, V., Potipiti, T., Charoenporn, T.: Automatic Corpus-based Thai Word Extraction with the C4.5 Learning Algorithm. In: Proc. of COLING 2000, Saarbrucken, Germany, pp. 802–807 (2000)Google Scholar
  12. 12.
    Sun, M.S., Shen, D.Y., Huang, C.N.: CSeg&Tag1.0: A Practical Word Segmenter and POS Tagger for Chinese Texts. In: Proc. of the 5th Int’l Conference on Applied Natural Language Processing, Washington DC, USA, pp. 119–126 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Maosong Sun
    • 1
  • Shengfen Luo
    • 1
  • Benjamin K T’sou
    • 2
  1. 1.National Lab. of Intelligent Tech. & SystemsTsinghua UniversityBeijingChina
  2. 2.Language Information Sciences Research CentreCity University of Hong Kong 

Personalised recommendations