A Hybrid Approach to Word Segmentation of Vietnamese Texts

Hông Phuong, L ê; Thi Minh Huyên, Nguyên; Roussanaly, Azim; Vinh, Hô Tuòng

doi:10.1007/978-3-540-88282-4_23

L ê Hông Phuong⁴,
Nguyên Thi Minh Huyên⁵,
Azim Roussanaly⁴ &
…
Hô Tuòng Vinh⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5196))

Included in the following conference series:

International Conference on Language and Automata Theory and Applications

928 Accesses
55 Citations

Abstract

We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Maurel, D.: Electronic Dictionaries and Acyclic Finite-State Automata: A State of The Art. Grammars and Automata for String Processing (2003)
Google Scholar
Daciuk, J., Mihov, S., Watson, B.W., Watson, R.E.: Incremental Construction of Minimal Acyclic Finite-State Automata. Computational Linguistics 26(1) (2000)
Google Scholar
ISO/TC 37/SC 4 AWI N309, Language Resource Management - Word Segmentation of Written Texts for Mono-lingual and Multi-lingual Information Processing - Part I: General Principles and Methods. Technical Report, ISO (2006)
Google Scholar
Jelinke, F., Mercer, R.L.: Interpolated estimation of Markov source parameters from sparse data. In: Proceedings of the Workshop on Pattern Recognition in Practice, The Netherlands (1980)
Google Scholar
Schmid, H.: Tokenizing. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin (2007)
Google Scholar
Gao, J., et al.: Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics (2006)
Google Scholar
Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. In: Proceedings of the 34th Annual Meeting of the ACL (1996)
Google Scholar
Wong, P., Chan, C.: Chinese Word Segmentation based on Maximum Matching and Word Binding Force. In: Proceedings of the 16th Conference on Computational Linguistics, Copenhagen, DK (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

LORIA, Nancy, France
L ê Hông Phuong & Azim Roussanaly
Vietnam National University, Hanoi, Vietnam
Nguyên Thi Minh Huyên
IFI, Hanoi, Vietnam
Hô Tuòng Vinh

Authors

L ê Hông Phuong
View author publications
You can also search for this author in PubMed Google Scholar
Nguyên Thi Minh Huyên
View author publications
You can also search for this author in PubMed Google Scholar
Azim Roussanaly
View author publications
You can also search for this author in PubMed Google Scholar
Hô Tuòng Vinh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research Group on Mathematical Linguistics, Rovira i Virgili University, Plaza Imperial Tàrraco 1, 43005, Tarragona, Spain
Carlos Martín-Vide
Fachbereich Elektrotechnik/Informatik, Universität Kassel, Wilhelmshöher Allee 73, 34121, Kassel, Germany
Friedrich Otto
Fachbereich 4, Abteilung Informatik/Wirtschaftsinformatik, Universität Trier, Campus II, Gebäude H, 54286, Trier, Germany
Henning Fernau

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hông Phuong, L.ê., Thi Minh Huyên, N., Roussanaly, A., Vinh, H.T. (2008). A Hybrid Approach to Word Segmentation of Vietnamese Texts. In: Martín-Vide, C., Otto, F., Fernau, H. (eds) Language and Automata Theory and Applications. LATA 2008. Lecture Notes in Computer Science, vol 5196. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88282-4_23

Download citation

DOI: https://doi.org/10.1007/978-3-540-88282-4_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88281-7
Online ISBN: 978-3-540-88282-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics