Advertisement

A Hybrid Approach to Word Segmentation of Vietnamese Texts

  • L ê Hông Phuong
  • Nguyên Thi Minh Huyên
  • Azim Roussanaly
  • Hô Tuòng Vinh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5196)

Abstract

We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.

Keywords

Hybrid Approach Compound Word Smoothing Technique Word Segmentation Lexical Unit 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Maurel, D.: Electronic Dictionaries and Acyclic Finite-State Automata: A State of The Art. Grammars and Automata for String Processing (2003)Google Scholar
  2. 2.
    Daciuk, J., Mihov, S., Watson, B.W., Watson, R.E.: Incremental Construction of Minimal Acyclic Finite-State Automata. Computational Linguistics 26(1) (2000)Google Scholar
  3. 3.
    ISO/TC 37/SC 4 AWI N309, Language Resource Management - Word Segmentation of Written Texts for Mono-lingual and Multi-lingual Information Processing - Part I: General Principles and Methods. Technical Report, ISO (2006)Google Scholar
  4. 4.
    Jelinke, F., Mercer, R.L.: Interpolated estimation of Markov source parameters from sparse data. In: Proceedings of the Workshop on Pattern Recognition in Practice, The Netherlands (1980)Google Scholar
  5. 5.
    Schmid, H.: Tokenizing. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin (2007)Google Scholar
  6. 6.
    Gao, J., et al.: Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics (2006)Google Scholar
  7. 7.
    Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. In: Proceedings of the 34th Annual Meeting of the ACL (1996)Google Scholar
  8. 8.
    Wong, P., Chan, C.: Chinese Word Segmentation based on Maximum Matching and Word Binding Force. In: Proceedings of the 16th Conference on Computational Linguistics, Copenhagen, DK (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • L ê Hông Phuong
    • 1
  • Nguyên Thi Minh Huyên
    • 2
  • Azim Roussanaly
    • 1
  • Hô Tuòng Vinh
    • 3
  1. 1.LORIANancyFrance
  2. 2.Vietnam National UniversityHanoiVietnam
  3. 3.IFIHanoiVietnam

Personalised recommendations