Skip to main content

A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4394))

Abstract

In this paper, we formulate a generalized method of automatic word segmentation. The method uses corpus type frequency information to choose the type with maximum length and frequency from “desegmented” text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. The method is also extendible to a dictionary-based or hybrid method with some additions to the algorithms. Evaluation results show that our method outperforms several competing methods.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brent, M.: An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning 34, 71–106 (1999)

    Article  MATH  Google Scholar 

  2. Brent, M., Cartwright, T.: Distributional regularity and phonotactics are useful for segmentation. Cognition 61, 93–125 (1996)

    Article  Google Scholar 

  3. Brill, E.: Some advances in transformation-based part of speech tagging. In: Proc. of the Twelfth National Conference on Artificial Intelligence, pp. 748–753. MIT Press, Cambridge (1994)

    Google Scholar 

  4. Christiansen, M., Allen, J.: Coping with Variation in Speech Segmentation. In: Proceedings of GALA 1997: Language Acquisition: Knowledge Representation and Processing, pp. 327–332 (1997)

    Google Scholar 

  5. Christiansen, M., Allen, J., Seidenberg, M.: Learning to Segment Speech Using Multiple Cues: A Connectionist Model. Language and Cognitive Processes 13, 221–268 (1998)

    Article  Google Scholar 

  6. Daelamans, W., van den Bosch, A., Weijters, A.: IGTree: Using trees for compression and classification in lazy learning algorithms. Artificial Intelligence Review 11, 407–423 (1997)

    Article  Google Scholar 

  7. Dale, R., Moisl, H., Somers, H.: Handbook of Natural Language Processing, pp. 22–26. Marcel Dekker, Inc., New York (2000)

    Google Scholar 

  8. Deligne, S., Bimbot, F.: Language Modeling by Variable Length Sequences: Theoretical Formulation and Evaluation of Multigrams. In: Proceedings ICASSP (1995)

    Google Scholar 

  9. de Marcken, C.: The Unsupervised Acquisition of a Lexicon from Continuous Speech. Technical Report AI Memo No. 1558, M.I.T., Cambridge, Massachusetts (1995)

    Google Scholar 

  10. Do, H.H., Rahm, E.: COMA – A System for Flexible Combination of Schema Matching Approaches. In: VLDB (2002)

    Google Scholar 

  11. Fung, P., Wu, D.: Improving Chinise tokenization with linguistic filters on statistical lexical acquisition. In: Fourth Conference Applied Natural Language Processing, Stuttgart, pp. 180–181 (1994)

    Google Scholar 

  12. Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named entity recognition: a pragmatic approach. Computational Linguistics 31(4) (2005)

    Google Scholar 

  13. Gelbukh, A., Alexandrov, M., Han, S.Y.: Detecting Inflection Patterns in Natural Language by Minimization of Morphological Model. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 432–438. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  14. Hua, Y.: Unsupervised word induction using MDL criterion. In: Proceedings ISCSL2000, Beijing (2000)

    Google Scholar 

  15. Kit, C., Wilks, Y.: Unsupervised Learning of Word Boundary with Description Length Gain. In: Proceedings CoNLL99 ACL Workshop, Bergen (1999)

    Google Scholar 

  16. Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based Schema Matching. In: International Conference on Data Engineering (ICDE-05) (2005)

    Google Scholar 

  17. Mikheev, A.: Automatic rule induction for unknown word guessing. Computational Linguistics 23(3), 405–423 (1997)

    Google Scholar 

  18. Peng, F., Schuurmans, D.: A Hierarchical EM Approach to Word Segmentation. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001) Tokyo, Japan, pp. 475–480 (2001)

    Google Scholar 

  19. Rabiner, L.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE 77(2) (1989)

    Google Scholar 

  20. Rumelhart, D.E., McClelland, J.: On learning the past Tense of English verbs. In: Parallel distributed processing, vol. II, pp. 216–271. MIT Press, Cambridge (1986)

    Google Scholar 

  21. Saffran, J.R., Newport, E.L., Aslin, R.N.: Word segmentation: The role of distributional cues. Journal of Memory and Language 35, 606–621 (1996)

    Article  Google Scholar 

  22. Shannon, C.E., Weaver, W.: The mathematical theory of communication. University of Illinois Press, Urbana (1963)

    MATH  Google Scholar 

  23. Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics 22(3), 377–404 (1996)

    Google Scholar 

  24. Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic word segmentation algorithm for a Mandarin text-to-speech system. In: 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, pp. 66–72 (1994)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Islam, M.A., Inkpen, D., Kiringa, I. (2007). A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-70939-8_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-70938-1

  • Online ISBN: 978-3-540-70939-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics