Disyllabic Chinese Word Extraction Based on Character Thesaurus and Semantic Constraints in Word-Formation
This paper presents a novel approach to Chinese disyllabic word extraction based on semantic information of characters. Two thesauri of Chinese characters, manually-crafted and machine-generated, are conducted. A Chinese wordlist with 63,738 two-character words, together with the character thesauri, are explored to learn semantic constraints between characters in Chinese word-formation, resulting in two types of semantic-tag-based HMM. Experiments show that: (1) both schemes outperform their character-based counterpart; (2) the machine-generated thesaurus outperforms the hand-crafted one to some extent in word extraction, and (3) the proper combination of semantic-tag-based and character-based methods could benefit word extraction.
KeywordsHide Markov Model Chinese Character Semantic Category Associative Strength Semantic Constraint
Unable to display preview. Download preview PDF.
- 1.Chen, K.J., Chen, C.J.: A Corpus Based Study on Computational Morphology for Mandarin Chinese. In: Quantitative and Computational Studies on the Chinese Language, Hong Kong, pp. 283–305 (1998)Google Scholar
- 2.Church, K.W., Hanks, P., et al.: Using Statistics in Lexical Analysis, Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon. Erlbaum, Hillsdale (1991)Google Scholar
- 3.Hajič, J.: HMM Parameters Estimation: The Baum-Welch Algorithm (2000), http://www.cs.jhu.edu/~hajic
- 4.Johansson, C.: Good Bigrams. In: Proc. of COLING 1996, Copenhagen, Denmark (1996)Google Scholar
- 5.Kang, S.Y., Sun, M.S., et al.: Design and Implementation of a Chinese Character Thesaurus. In: Proc. of Int’l Conf. on Chinese Computing 2001, Singapore, pp. 301–307 (2001)Google Scholar
- 6.Mei, J.J.: Tong Yi Ci Ci Lin. Shanghai Cishu Press (1983)Google Scholar
- 7.Merkel, M., Andersson, M.: Knowledge-lite Extraction of Multi-word Units with Language Filters and Entropy Thresholds. In: Proc. of RIAO 2000, Paris, France, pp. 737–746 (2000)Google Scholar
- 8.Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)Google Scholar
- 9.Sproat, R., Shih, C.L.: A Statistical Method for Finding Word Boundaries in Chinese Text. Computer Processing of Chinese and Oriental Languages 4(4), 336–349Google Scholar
- 10.Su, K.Y., Wu, M., Chang, J.S.: A Corpus-based Approach to Automatic Compound Extraction. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, pp. 242–247 (1994)Google Scholar
- 11.Sun, M.S., Luo, S.F., T’sou, B.K.: Word Extraction Based on Semantic Constraints in Chinese Word-Formation, Computational Linguistics and Intelligent Text Processing. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 202–213. Springer, Heidelberg (2005)Google Scholar