From Phoneme to Morpheme: Another Verification Using a Corpus

  • Kumiko Tanaka-Ishii
  • Zhihui Jin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4285)


We scientifically test Harris’s hypothesis that morpheme/ word boundaries can be detected from changes in the complexity of phoneme sequences. We re-formulate his hypothesis from a more information theoretic viewpoint and use a corpus to test whether the hypothesis holds. We found that his hypothesis holds for morphemes, with an F-score of about 80%, in both English and Chinese. However, we obtained contrary results for English and Chinese with regard to word boundaries; this reflects a difference in the nature of the two languages.


Chinese Word Word Boundary Word Segmentation Meaningful Unit Phoneme Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Harris, S.: From phoneme to morpheme. Language, 190–222 (1955)Google Scholar
  2. 2.
    Imai, K.: Dictionary of Chomsky. Taishukan (1986) (in Japanese)Google Scholar
  3. 3.
    Martinet, A.: Elements de linguistique generale. Colin (1960)Google Scholar
  4. 4.
    Jin, Z., Tanaka-Ishii, K.: Unsupervised segmentation of chinese text by use of braching entropy. In: COLLING/ACL (2006)Google Scholar
  5. 5.
    Huang, H., Powers, D.: Chinese word segmentation based on contexual entropy. In: Pacific Asian Conference on Language, Information and Computation (2003)Google Scholar
  6. 6.
    Frantzi, T., Ananiadou, S.: Extracting nested collocations. In: 16th COLING, pp. 41–46 (1996)Google Scholar
  7. 7.
    Tanaka-Ishii, K., Nakagawa, H.: A multilingual usage consultation tool based on internet searching -More than a search engine, less than QA. In: WWW Conference, pp. 363–371 (2005)Google Scholar
  8. 8.
    Tanaka-Ishii, K.: Entropy as an indicator of context boundaries —an experiment using a web search engine. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS, vol. 3651, pp. 93–105. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  9. 9.
    Carnegie Mellon University: CMU pronouncing dictionary version 0.6 (2006) (visited 2006),
  10. 10.
    SIL: Pc-kimmo version 2, a morphologial parser (1995),
  11. 11.
    ICL: People’s daily corpus, Beijing university (1999),
  12. 12.
    NJStar Software Corp: Njstar, chinese word processing software (2006),

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Kumiko Tanaka-Ishii
    • 1
  • Zhihui Jin
    • 1
  1. 1.Graduate School of Information Science and TechnologyUniversity of Tokyo 

Personalised recommendations