Word Extraction Based on Semantic Constraints in Chinese Word-Formation

  • Maosong Sun
  • Shengfen Luo
  • Benjamin K T’sou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3406)


This paper presents a novel approach to Chinese word extraction based on semantic information of characters. A thesaurus of Chinese characters is conducted. A Chinese lexicon with 63,738 two-character words, together with the thesaurus of characters, are explored to learn semantic constraints between characters in Chinese word-formation, forming a semantic-tag-based HMM. The Baum-Welch re-estimation scheme is then chosen to train parameters of the HMM in the way of unsupervised learning. Various statistical measures for estimating the likelihood of a character string being a word are further tested. Large-scale experiments show that the results are promising: the F-score of this word extraction method can reach 68.5% whereas its counterpart, the character-based mutual information method, can only reach 47.5%.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Calzolari, N., Bindi, R.: Acquision of Lexical Information from a Large Textual Italian Corpus. In: Proc. of COLING 1990, Helsinki, Finland, pp. 54–59 (1990)Google Scholar
  2. 2.
    Chien, L.F.: PAT-tree-based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval. In: Information Processing and Management, special issue: Information Retrieval with Asian Language (1998)Google Scholar
  3. 3.
    Daille, B.: Study and Implementation of Combined Techniques Automatic Extraction of Terminology. In: Proc. of the Balancing Act Workshop at 32nd Annual Meeting of the ACL, pp. 29–36 (1994)Google Scholar
  4. 4.
    Dunning, T.: Accurate Method for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–75 (1993)Google Scholar
  5. 5.
    Gelbukh, A., Sidorov, G.: Approach to Construction of Automatic Morphological Analysis Systems for Inflective Languages with Little Effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Hajic, J.: HMM Parameters Estimation: The Baum-Welch Algorithm (2000), http://www.cs.jhuedu/ hajicGoogle Scholar
  7. 7.
    Johansson, C.: Good Bigrams. In: Proc. of COLING 1996, Copenhagen, Denmark (1996)Google Scholar
  8. 8.
    Mei, J.J.: Tong Yi Ci Ci Lin. Shanghai Cishu Press (1983)Google Scholar
  9. 9.
    Merkel, M., Andersson, M.: Knowledge-lite Extraction of Multi-word Units with Language Filters and Entropy Thresholds. In: Proc. of RIAO 2000, Paris, France, pp. 737–746 (2000)Google Scholar
  10. 10.
    Nie, J.Y., Hannan, M.L., Jin, W.: Unknown Word Detection and Segmentation of Chinese Using Statistical and Heuristic Knowledge. Communications of COLIPS 5, 47–57 (1999)Google Scholar
  11. 11.
    Sornlertlamvanich, V., Potipiti, T., Charoenporn, T.: Automatic Corpus-based Thai Word Extraction with the C4.5 Learning Algorithm. In: Proc. of COLING 2000, Saarbrucken, Germany, pp. 802–807 (2000)Google Scholar
  12. 12.
    Sun, M.S., Shen, D.Y., Huang, C.N.: CSeg&Tag1.0: A Practical Word Segmenter and POS Tagger for Chinese Texts. In: Proc. of the 5th Int’l Conference on Applied Natural Language Processing, Washington DC, USA, pp. 119–126 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Maosong Sun
    • 1
  • Shengfen Luo
    • 1
  • Benjamin K T’sou
    • 2
  1. 1.National Lab. of Intelligent Tech. & SystemsTsinghua UniversityBeijingChina
  2. 2.Language Information Sciences Research CentreCity University of Hong Kong 

Personalised recommendations