Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4285))

Included in the following conference series:

Abstract

We scientifically test Harris’s hypothesis that morpheme/ word boundaries can be detected from changes in the complexity of phoneme sequences. We re-formulate his hypothesis from a more information theoretic viewpoint and use a corpus to test whether the hypothesis holds. We found that his hypothesis holds for morphemes, with an F-score of about 80%, in both English and Chinese. However, we obtained contrary results for English and Chinese with regard to word boundaries; this reflects a difference in the nature of the two languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Harris, S.: From phoneme to morpheme. Language, 190–222 (1955)

    Google Scholar 

  2. Imai, K.: Dictionary of Chomsky. Taishukan (1986) (in Japanese)

    Google Scholar 

  3. Martinet, A.: Elements de linguistique generale. Colin (1960)

    Google Scholar 

  4. Jin, Z., Tanaka-Ishii, K.: Unsupervised segmentation of chinese text by use of braching entropy. In: COLLING/ACL (2006)

    Google Scholar 

  5. Huang, H., Powers, D.: Chinese word segmentation based on contexual entropy. In: Pacific Asian Conference on Language, Information and Computation (2003)

    Google Scholar 

  6. Frantzi, T., Ananiadou, S.: Extracting nested collocations. In: 16th COLING, pp. 41–46 (1996)

    Google Scholar 

  7. Tanaka-Ishii, K., Nakagawa, H.: A multilingual usage consultation tool based on internet searching -More than a search engine, less than QA. In: WWW Conference, pp. 363–371 (2005)

    Google Scholar 

  8. Tanaka-Ishii, K.: Entropy as an indicator of context boundaries —an experiment using a web search engine. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS, vol. 3651, pp. 93–105. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  9. Carnegie Mellon University: CMU pronouncing dictionary version 0.6 (2006) (visited 2006), http://www.speech.cs.cmu.edu/cgi-bin/cmudict

  10. SIL: Pc-kimmo version 2, a morphologial parser (1995), http://www.sil.org/pckimmo/

  11. ICL: People’s daily corpus, Beijing university (1999), http://www.icl.pku.edu.cn/icl_res/

  12. NJStar Software Corp: Njstar, chinese word processing software (2006), http://www.njstar.com

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tanaka-Ishii, K., Jin, Z. (2006). From Phoneme to Morpheme: Another Verification Using a Corpus. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_25

Download citation

  • DOI: https://doi.org/10.1007/11940098_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-49667-0

  • Online ISBN: 978-3-540-49668-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics