Part-of-Speech Induction for Vietnamese

Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 245)


This paper presents a method for automatically inducing the parts-ofspeech of the Vietnamese language from a large text corpus. We first build a classbased bigram language model using several statistical algorithms assigning words to classes based on their ability to combine with neighbouring words.We then show that this model is able to extract word classes that have the flavor of either syntactically based or semantically based groupings of Vietnamese words, which are the long disputed approaches among the Vietnamese linguistic community. Finally, the quality of word clusters is quantitatively evaluated when word cluster features are used to improve the accuracy of a statistical part-of-speech tagger for Vietnamese.


Average Mutual Information Word Segmentation Word Class Word Cluster Large Text Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Schütze, H.: Part-of-speech induction from scratch. In: Proceedings of ACL, pp. 251–258 (1993)Google Scholar
  2. 2.
    Con, N.H.: On the determination of Vietnamese word classes. Journal of Language, Vietnamese Institute of Linguistics, 36–46 (2003) (in Vietnamese)Google Scholar
  3. 3.
    Vietnam Social Science Committee (ed.): Vietnamese Grammar. Social Sciences Publisher, Hanoi (1983) (in Vietnamese)Google Scholar
  4. 4.
    Diep, Q.B., Hoang, V.T.: Vietnamese Grammar. Vietnam Education Publisher, Hanoi (1999) (in Vietnamese)Google Scholar
  5. 5.
    Doan, T.T., Nguyen, K.H., Pham, N.Q.: A Concise Vietnamese Grammar (For Non-native Speakers). World Publishers, Ha Noi (2003) (in Vietnamese)Google Scholar
  6. 6.
    Bao, H.T.: Building basic resources and tools for Vietnamese language and speech processing (VLSP). Technical report, The KC/01/06-10 project (2010)Google Scholar
  7. 7.
    Christodoulopoulos, C., Goldwater, S., Steedman, M.: Two decades of unsupervised POS induction: How far have we come? In: Proceedings of ACL (2010)Google Scholar
  8. 8.
    Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Computational Linguistics 18, 467–479 (1992)Google Scholar
  9. 9.
    Liang, P.: Semi-supervised learning for natural language. Master’s thesis. MIT (2005)Google Scholar
  10. 10.
    Nguyen, P.T., Xuan, L.V., Nguyen, T.M.H., Nguyen, V.H., Le-Hong, P.: Building a large syntactically-annotated corpus of Vietnamese. In: Proceedings of the 3rd Linguistic Annotation Workshop, ACL-IJCNLP, Singapore (2009)Google Scholar
  11. 11.
    Le-Hong, P., Nguyen, T.M.H., Roussanaly, A., Ho, T.V.: A hybrid approach to word segmentation of Vietnamese texts. In: Martín-Vide, C., Otto, F., Fernau, H. (eds.) LATA 2008. LNCS, vol. 5196, pp. 240–249. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  12. 12.
    McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information and segmentation. In: Proceedings of ICML (2000)Google Scholar
  13. 13.
    Le-Hong, P., Roussanaly, A., Nguyen, T.M.H., Rossignol, M.: An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts. In: Proceedings of Traitement Automatique des Langues Naturelles (TALN 2010), Montreal, Canada (2010)Google Scholar
  14. 14.
    Minh, N.L., Bach, N.X., Cuong, N.V., Minh, P.Q.N., Shimazu, A.: A semi-supervised learning method for Vietnamese part-of-speech tagging. In: KSE, pp. 141–146 (2010)Google Scholar
  15. 15.
    Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of EMNLP-CoNLL, pp. 410–420 (2007)Google Scholar
  16. 16.
    Clark, A.: Combining distributional and morphological information for part-of-speech induction. In: Proceedings of EACL (2003)Google Scholar
  17. 17.
    Leibbrandt, R.E., Powers, D.M.W.: Robust induction of parts-of-speech in child-directed language by co-clustering of words and contexts. In: Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP, Avignon, France, pp. 44–54 (2012)Google Scholar
  18. 18.
    Chrupała, G.: Hierarchical clustering of word class distributions. In: Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure, Montréal, Canada, pp. 100–104 (2012)Google Scholar
  19. 19.
    Turian, J., Ratinov, L., Bengio, Y.: Word representations: A simple and general method for semi-supervised learning. In: Proceedings of ACL, Uppsala, Sweden, pp. 384–394 (2010)Google Scholar
  20. 20.
    Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of the ACL, pp. 873–882 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.University of ScienceVietnam National UniversityHanoiVietnam

Personalised recommendations