Abstract
In this paper, we formulate a generalized method of automatic word segmentation. The method uses corpus type frequency information to choose the type with maximum length and frequency from “desegmented” text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. The method is also extendible to a dictionary-based or hybrid method with some additions to the algorithms. Evaluation results show that our method outperforms several competing methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Brent, M.: An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning 34, 71–106 (1999)
Brent, M., Cartwright, T.: Distributional regularity and phonotactics are useful for segmentation. Cognition 61, 93–125 (1996)
Brill, E.: Some advances in transformation-based part of speech tagging. In: Proc. of the Twelfth National Conference on Artificial Intelligence, pp. 748–753. MIT Press, Cambridge (1994)
Christiansen, M., Allen, J.: Coping with Variation in Speech Segmentation. In: Proceedings of GALA 1997: Language Acquisition: Knowledge Representation and Processing, pp. 327–332 (1997)
Christiansen, M., Allen, J., Seidenberg, M.: Learning to Segment Speech Using Multiple Cues: A Connectionist Model. Language and Cognitive Processes 13, 221–268 (1998)
Daelamans, W., van den Bosch, A., Weijters, A.: IGTree: Using trees for compression and classification in lazy learning algorithms. Artificial Intelligence Review 11, 407–423 (1997)
Dale, R., Moisl, H., Somers, H.: Handbook of Natural Language Processing, pp. 22–26. Marcel Dekker, Inc., New York (2000)
Deligne, S., Bimbot, F.: Language Modeling by Variable Length Sequences: Theoretical Formulation and Evaluation of Multigrams. In: Proceedings ICASSP (1995)
de Marcken, C.: The Unsupervised Acquisition of a Lexicon from Continuous Speech. Technical Report AI Memo No. 1558, M.I.T., Cambridge, Massachusetts (1995)
Do, H.H., Rahm, E.: COMA – A System for Flexible Combination of Schema Matching Approaches. In: VLDB (2002)
Fung, P., Wu, D.: Improving Chinise tokenization with linguistic filters on statistical lexical acquisition. In: Fourth Conference Applied Natural Language Processing, Stuttgart, pp. 180–181 (1994)
Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named entity recognition: a pragmatic approach. Computational Linguistics 31(4) (2005)
Gelbukh, A., Alexandrov, M., Han, S.Y.: Detecting Inflection Patterns in Natural Language by Minimization of Morphological Model. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 432–438. Springer, Heidelberg (2004)
Hua, Y.: Unsupervised word induction using MDL criterion. In: Proceedings ISCSL2000, Beijing (2000)
Kit, C., Wilks, Y.: Unsupervised Learning of Word Boundary with Description Length Gain. In: Proceedings CoNLL99 ACL Workshop, Bergen (1999)
Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based Schema Matching. In: International Conference on Data Engineering (ICDE-05) (2005)
Mikheev, A.: Automatic rule induction for unknown word guessing. Computational Linguistics 23(3), 405–423 (1997)
Peng, F., Schuurmans, D.: A Hierarchical EM Approach to Word Segmentation. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001) Tokyo, Japan, pp. 475–480 (2001)
Rabiner, L.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE 77(2) (1989)
Rumelhart, D.E., McClelland, J.: On learning the past Tense of English verbs. In: Parallel distributed processing, vol. II, pp. 216–271. MIT Press, Cambridge (1986)
Saffran, J.R., Newport, E.L., Aslin, R.N.: Word segmentation: The role of distributional cues. Journal of Memory and Language 35, 606–621 (1996)
Shannon, C.E., Weaver, W.: The mathematical theory of communication. University of Illinois Press, Urbana (1963)
Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics 22(3), 377–404 (1996)
Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic word segmentation algorithm for a Mandarin text-to-speech system. In: 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, pp. 66–72 (1994)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Islam, M.A., Inkpen, D., Kiringa, I. (2007). A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-70939-8_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70938-1
Online ISBN: 978-3-540-70939-8
eBook Packages: Computer ScienceComputer Science (R0)