Chapter

Computational Linguistics and Intelligent Text Processing

Volume 4394 of the series Lecture Notes in Computer Science pp 175-185

A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

  • Md. Aminul IslamAffiliated withSchool of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5
  • , Diana InkpenAffiliated withSchool of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5
  • , Iluju KiringaAffiliated withSchool of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5

* Final gross prices may vary according to local VAT.

Get Access

Abstract

In this paper, we formulate a generalized method of automatic word segmentation. The method uses corpus type frequency information to choose the type with maximum length and frequency from “desegmented” text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. The method is also extendible to a dictionary-based or hybrid method with some additions to the algorithms. Evaluation results show that our method outperforms several competing methods.