A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

  • Md. Aminul Islam
  • Diana Inkpen
  • Iluju Kiringa
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4394)


In this paper, we formulate a generalized method of automatic word segmentation. The method uses corpus type frequency information to choose the type with maximum length and frequency from “desegmented” text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. The method is also extendible to a dictionary-based or hybrid method with some additions to the algorithms. Evaluation results show that our method outperforms several competing methods.


Greedy Algorithm Natural Language Processing Minimum Description Length Word Boundary Word Segmentation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Brent, M.: An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning 34, 71–106 (1999)CrossRefzbMATHGoogle Scholar
  2. 2.
    Brent, M., Cartwright, T.: Distributional regularity and phonotactics are useful for segmentation. Cognition 61, 93–125 (1996)CrossRefGoogle Scholar
  3. 3.
    Brill, E.: Some advances in transformation-based part of speech tagging. In: Proc. of the Twelfth National Conference on Artificial Intelligence, pp. 748–753. MIT Press, Cambridge (1994)Google Scholar
  4. 4.
    Christiansen, M., Allen, J.: Coping with Variation in Speech Segmentation. In: Proceedings of GALA 1997: Language Acquisition: Knowledge Representation and Processing, pp. 327–332 (1997)Google Scholar
  5. 5.
    Christiansen, M., Allen, J., Seidenberg, M.: Learning to Segment Speech Using Multiple Cues: A Connectionist Model. Language and Cognitive Processes 13, 221–268 (1998)CrossRefGoogle Scholar
  6. 6.
    Daelamans, W., van den Bosch, A., Weijters, A.: IGTree: Using trees for compression and classification in lazy learning algorithms. Artificial Intelligence Review 11, 407–423 (1997)CrossRefGoogle Scholar
  7. 7.
    Dale, R., Moisl, H., Somers, H.: Handbook of Natural Language Processing, pp. 22–26. Marcel Dekker, Inc., New York (2000)Google Scholar
  8. 8.
    Deligne, S., Bimbot, F.: Language Modeling by Variable Length Sequences: Theoretical Formulation and Evaluation of Multigrams. In: Proceedings ICASSP (1995)Google Scholar
  9. 9.
    de Marcken, C.: The Unsupervised Acquisition of a Lexicon from Continuous Speech. Technical Report AI Memo No. 1558, M.I.T., Cambridge, Massachusetts (1995)Google Scholar
  10. 10.
    Do, H.H., Rahm, E.: COMA – A System for Flexible Combination of Schema Matching Approaches. In: VLDB (2002)Google Scholar
  11. 11.
    Fung, P., Wu, D.: Improving Chinise tokenization with linguistic filters on statistical lexical acquisition. In: Fourth Conference Applied Natural Language Processing, Stuttgart, pp. 180–181 (1994)Google Scholar
  12. 12.
    Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named entity recognition: a pragmatic approach. Computational Linguistics 31(4) (2005)Google Scholar
  13. 13.
    Gelbukh, A., Alexandrov, M., Han, S.Y.: Detecting Inflection Patterns in Natural Language by Minimization of Morphological Model. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 432–438. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  14. 14.
    Hua, Y.: Unsupervised word induction using MDL criterion. In: Proceedings ISCSL2000, Beijing (2000)Google Scholar
  15. 15.
    Kit, C., Wilks, Y.: Unsupervised Learning of Word Boundary with Description Length Gain. In: Proceedings CoNLL99 ACL Workshop, Bergen (1999)Google Scholar
  16. 16.
    Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based Schema Matching. In: International Conference on Data Engineering (ICDE-05) (2005)Google Scholar
  17. 17.
    Mikheev, A.: Automatic rule induction for unknown word guessing. Computational Linguistics 23(3), 405–423 (1997)Google Scholar
  18. 18.
    Peng, F., Schuurmans, D.: A Hierarchical EM Approach to Word Segmentation. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001) Tokyo, Japan, pp. 475–480 (2001)Google Scholar
  19. 19.
    Rabiner, L.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE 77(2) (1989)Google Scholar
  20. 20.
    Rumelhart, D.E., McClelland, J.: On learning the past Tense of English verbs. In: Parallel distributed processing, vol. II, pp. 216–271. MIT Press, Cambridge (1986)Google Scholar
  21. 21.
    Saffran, J.R., Newport, E.L., Aslin, R.N.: Word segmentation: The role of distributional cues. Journal of Memory and Language 35, 606–621 (1996)CrossRefGoogle Scholar
  22. 22.
    Shannon, C.E., Weaver, W.: The mathematical theory of communication. University of Illinois Press, Urbana (1963)zbMATHGoogle Scholar
  23. 23.
    Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics 22(3), 377–404 (1996)Google Scholar
  24. 24.
    Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic word segmentation algorithm for a Mandarin text-to-speech system. In: 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, pp. 66–72 (1994)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Md. Aminul Islam
    • 1
  • Diana Inkpen
    • 1
  • Iluju Kiringa
    • 1
  1. 1.School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5Canada

Personalised recommendations