A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate
- Cite this paper as:
- Islam M.A., Inkpen D., Kiringa I. (2007) A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate. In: Gelbukh A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg
In this paper, we formulate a generalized method of automatic word segmentation. The method uses corpus type frequency information to choose the type with maximum length and frequency from “desegmented” text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. The method is also extendible to a dictionary-based or hybrid method with some additions to the algorithms. Evaluation results show that our method outperforms several competing methods.
Unable to display preview. Download preview PDF.