From Phoneme to Morpheme: Another Verification Using a Corpus
We scientifically test Harris’s hypothesis that morpheme/ word boundaries can be detected from changes in the complexity of phoneme sequences. We re-formulate his hypothesis from a more information theoretic viewpoint and use a corpus to test whether the hypothesis holds. We found that his hypothesis holds for morphemes, with an F-score of about 80%, in both English and Chinese. However, we obtained contrary results for English and Chinese with regard to word boundaries; this reflects a difference in the nature of the two languages.
KeywordsChinese Word Word Boundary Word Segmentation Meaningful Unit Phoneme Sequence
Unable to display preview. Download preview PDF.
- 1.Harris, S.: From phoneme to morpheme. Language, 190–222 (1955)Google Scholar
- 2.Imai, K.: Dictionary of Chomsky. Taishukan (1986) (in Japanese)Google Scholar
- 3.Martinet, A.: Elements de linguistique generale. Colin (1960)Google Scholar
- 4.Jin, Z., Tanaka-Ishii, K.: Unsupervised segmentation of chinese text by use of braching entropy. In: COLLING/ACL (2006)Google Scholar
- 5.Huang, H., Powers, D.: Chinese word segmentation based on contexual entropy. In: Pacific Asian Conference on Language, Information and Computation (2003)Google Scholar
- 6.Frantzi, T., Ananiadou, S.: Extracting nested collocations. In: 16th COLING, pp. 41–46 (1996)Google Scholar
- 7.Tanaka-Ishii, K., Nakagawa, H.: A multilingual usage consultation tool based on internet searching -More than a search engine, less than QA. In: WWW Conference, pp. 363–371 (2005)Google Scholar
- 9.Carnegie Mellon University: CMU pronouncing dictionary version 0.6 (2006) (visited 2006), http://www.speech.cs.cmu.edu/cgi-bin/cmudict
- 10.SIL: Pc-kimmo version 2, a morphologial parser (1995), http://www.sil.org/pckimmo/
- 11.ICL: People’s daily corpus, Beijing university (1999), http://www.icl.pku.edu.cn/icl_res/
- 12.NJStar Software Corp: Njstar, chinese word processing software (2006), http://www.njstar.com