Abstract
We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Maurel, D.: Electronic Dictionaries and Acyclic Finite-State Automata: A State of The Art. Grammars and Automata for String Processing (2003)
Daciuk, J., Mihov, S., Watson, B.W., Watson, R.E.: Incremental Construction of Minimal Acyclic Finite-State Automata. Computational Linguistics 26(1) (2000)
ISO/TC 37/SC 4 AWI N309, Language Resource Management - Word Segmentation of Written Texts for Mono-lingual and Multi-lingual Information Processing - Part I: General Principles and Methods. Technical Report, ISO (2006)
Jelinke, F., Mercer, R.L.: Interpolated estimation of Markov source parameters from sparse data. In: Proceedings of the Workshop on Pattern Recognition in Practice, The Netherlands (1980)
Schmid, H.: Tokenizing. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin (2007)
Gao, J., et al.: Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics (2006)
Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. In: Proceedings of the 34th Annual Meeting of the ACL (1996)
Wong, P., Chan, C.: Chinese Word Segmentation based on Maximum Matching and Word Binding Force. In: Proceedings of the 16th Conference on Computational Linguistics, Copenhagen, DK (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hông Phuong, L.ê., Thi Minh Huyên, N., Roussanaly, A., Vinh, H.T. (2008). A Hybrid Approach to Word Segmentation of Vietnamese Texts. In: Martín-Vide, C., Otto, F., Fernau, H. (eds) Language and Automata Theory and Applications. LATA 2008. Lecture Notes in Computer Science, vol 5196. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88282-4_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-88282-4_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88281-7
Online ISBN: 978-3-540-88282-4
eBook Packages: Computer ScienceComputer Science (R0)