A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

Islam, Md. Aminul; Inkpen, Diana; Kiringa, Iluju

doi:10.1007/978-3-540-70939-8_16

A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

Md. Aminul Islam¹,
Diana Inkpen¹ &
Iluju Kiringa¹

Conference paper

1517 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4394))

Abstract

In this paper, we formulate a generalized method of automatic word segmentation. The method uses corpus type frequency information to choose the type with maximum length and frequency from “desegmented” text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. The method is also extendible to a dictionary-based or hybrid method with some additions to the algorithms. Evaluation results show that our method outperforms several competing methods.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brent, M.: An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning 34, 71–106 (1999)
Article MATH Google Scholar
Brent, M., Cartwright, T.: Distributional regularity and phonotactics are useful for segmentation. Cognition 61, 93–125 (1996)
Article Google Scholar
Brill, E.: Some advances in transformation-based part of speech tagging. In: Proc. of the Twelfth National Conference on Artificial Intelligence, pp. 748–753. MIT Press, Cambridge (1994)
Google Scholar
Christiansen, M., Allen, J.: Coping with Variation in Speech Segmentation. In: Proceedings of GALA 1997: Language Acquisition: Knowledge Representation and Processing, pp. 327–332 (1997)
Google Scholar
Christiansen, M., Allen, J., Seidenberg, M.: Learning to Segment Speech Using Multiple Cues: A Connectionist Model. Language and Cognitive Processes 13, 221–268 (1998)
Article Google Scholar
Daelamans, W., van den Bosch, A., Weijters, A.: IGTree: Using trees for compression and classification in lazy learning algorithms. Artificial Intelligence Review 11, 407–423 (1997)
Article Google Scholar
Dale, R., Moisl, H., Somers, H.: Handbook of Natural Language Processing, pp. 22–26. Marcel Dekker, Inc., New York (2000)
Google Scholar
Deligne, S., Bimbot, F.: Language Modeling by Variable Length Sequences: Theoretical Formulation and Evaluation of Multigrams. In: Proceedings ICASSP (1995)
Google Scholar
de Marcken, C.: The Unsupervised Acquisition of a Lexicon from Continuous Speech. Technical Report AI Memo No. 1558, M.I.T., Cambridge, Massachusetts (1995)
Google Scholar
Do, H.H., Rahm, E.: COMA – A System for Flexible Combination of Schema Matching Approaches. In: VLDB (2002)
Google Scholar
Fung, P., Wu, D.: Improving Chinise tokenization with linguistic filters on statistical lexical acquisition. In: Fourth Conference Applied Natural Language Processing, Stuttgart, pp. 180–181 (1994)
Google Scholar
Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named entity recognition: a pragmatic approach. Computational Linguistics 31(4) (2005)
Google Scholar
Gelbukh, A., Alexandrov, M., Han, S.Y.: Detecting Inflection Patterns in Natural Language by Minimization of Morphological Model. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 432–438. Springer, Heidelberg (2004)
Chapter Google Scholar
Hua, Y.: Unsupervised word induction using MDL criterion. In: Proceedings ISCSL2000, Beijing (2000)
Google Scholar
Kit, C., Wilks, Y.: Unsupervised Learning of Word Boundary with Description Length Gain. In: Proceedings CoNLL99 ACL Workshop, Bergen (1999)
Google Scholar
Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based Schema Matching. In: International Conference on Data Engineering (ICDE-05) (2005)
Google Scholar
Mikheev, A.: Automatic rule induction for unknown word guessing. Computational Linguistics 23(3), 405–423 (1997)
Google Scholar
Peng, F., Schuurmans, D.: A Hierarchical EM Approach to Word Segmentation. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001) Tokyo, Japan, pp. 475–480 (2001)
Google Scholar
Rabiner, L.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE 77(2) (1989)
Google Scholar
Rumelhart, D.E., McClelland, J.: On learning the past Tense of English verbs. In: Parallel distributed processing, vol. II, pp. 216–271. MIT Press, Cambridge (1986)
Google Scholar
Saffran, J.R., Newport, E.L., Aslin, R.N.: Word segmentation: The role of distributional cues. Journal of Memory and Language 35, 606–621 (1996)
Article Google Scholar
Shannon, C.E., Weaver, W.: The mathematical theory of communication. University of Illinois Press, Urbana (1963)
MATH Google Scholar
Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics 22(3), 377–404 (1996)
Google Scholar
Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic word segmentation algorithm for a Mandarin text-to-speech system. In: 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, pp. 66–72 (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5, Canada
Md. Aminul Islam, Diana Inkpen & Iluju Kiringa

Authors

Md. Aminul Islam
View author publications
You can also search for this author in PubMed Google Scholar
Diana Inkpen
View author publications
You can also search for this author in PubMed Google Scholar
Iluju Kiringa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Islam, M.A., Inkpen, D., Kiringa, I. (2007). A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-70939-8_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70938-1
Online ISBN: 978-3-540-70939-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics