Skip to main content

An Unsupervised Learning and Statistical Approach for Vietnamese Word Recognition and Segmentation

  • Conference paper
Intelligent Information and Database Systems (ACIIDS 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5991))

Included in the following conference series:

Abstract

There are two main topics in this paper: (i) Vietnamese words are recognized and sentences are segmented into words by using probabilistic models; (ii) the optimum probabilistic model is constructed by an unsupervised learning processing. For each probabilistic model, new words are recognized and their syllables are linked together. The syllable-linking process improves the accuracy of statistical functions which improves contrarily the new words recognition. Hence, the probabilistic model will converge to the optimum one.

Our experimented corpus is generated from about 250.000 online news articles, which consist of about 19.000.000 sentences. The accuracy of the segmented algorithm is over 90%. Our Vietnamese word and phrase dictionary contains more than 150.000 elements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cao, X.H.: Vietnamese - Some Questions on Phonetics, Syntax and Semantics. Nxb Giao duc, Hanoi (2000)

    Google Scholar 

  2. Chu, M.N., Nghieu, V.Đ., Phien, H.T.: Cơ sở ngôn ngữ học và tiẽ́ng Việt. Nxb Giáo dục. Hanoi, pp. 142–152 (1997)

    Google Scholar 

  3. Dien, D., Kiem, H., Toan, N.V.: Vietnamese Word Segmentation. In: The Sixth Natural Language Processing Pacific Rim Symposium, Tokyo, Japan, pp. 749–756 (2001)

    Google Scholar 

  4. Giap, N.T.: Từ vựng học tiẽ́ng Việt. H., Nxb Giao duc (2003)

    Google Scholar 

  5. Thu, C.B., Hien, P.: Về một xu hướng mới của từ điển giải thích (2007), http://ngonngu.net/index.php?p=319

  6. Ha, L.A.: A method for word segmentation in Vietnamese. In: Proceedings of Corpus Linguistics 2003, Lancaster, UK (2003)

    Google Scholar 

  7. Le, H.P., Nguyen, T.M.H., Roussanaly, A., Ho, T.V.: A hybrid approach to word segmentation of Vietnamese texts. In: Martín-Vide, C., Otto, F., Fernau, H. (eds.) LATA 2008. LNCS, vol. 5196, pp. 240–249. Springer, Heidelberg (2008)

    Google Scholar 

  8. Nguyen, C.T., Nguyen, T.K., Phan, X.H., Nguyen, L.M., Ha, Q.T.: Vietnamese word segmentationwith CRFs and SVMs: An investigation. In: Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation (PACLIC 2006), Wuhan, CH (2006)

    Google Scholar 

  9. Nguyen, T.V., Tran, H.K., Nguyen, T.T.T., Nguyen, H.: Word segmentation for Vietnamese text categorization: an online corpus approach. In: Research, Innovation and Vision for the Future, The 4th International Conference on Computer Sciences (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Le Trung, H., Le Anh, V., Le Trung, K. (2010). An Unsupervised Learning and Statistical Approach for Vietnamese Word Recognition and Segmentation. In: Nguyen, N.T., Le, M.T., Świątek, J. (eds) Intelligent Information and Database Systems. ACIIDS 2010. Lecture Notes in Computer Science(), vol 5991. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12101-2_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12101-2_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12100-5

  • Online ISBN: 978-3-642-12101-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics