A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

  • Md. Aminul Islam
  • Diana Inkpen
  • Iluju Kiringa
Conference paper

DOI: 10.1007/978-3-540-70939-8_16

Part of the Lecture Notes in Computer Science book series (LNCS, volume 4394)
Cite this paper as:
Islam M.A., Inkpen D., Kiringa I. (2007) A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate. In: Gelbukh A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg

Abstract

In this paper, we formulate a generalized method of automatic word segmentation. The method uses corpus type frequency information to choose the type with maximum length and frequency from “desegmented” text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. The method is also extendible to a dictionary-based or hybrid method with some additions to the algorithms. Evaluation results show that our method outperforms several competing methods.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Md. Aminul Islam
    • 1
  • Diana Inkpen
    • 1
  • Iluju Kiringa
    • 1
  1. 1.School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5Canada

Personalised recommendations