Skip to main content
Log in

A lexicon for Vietnamese language processing

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Only very recently have Vietnamese researchers begun to be involved in the domain of Natural Language Processing (NLP). As there does not exist any published work in formal linguistics nor any recognizable standard for Vietnamese word definition and word categories, the fundamental tasks for automatic Vietnamese language processing, such as part-of-speech tagging, parsing, etc., are very difficult tasks for computer scientists. The fact that all necessary linguistic resources have to be built from scratch by each research team is a real obstacle to the development of Vietnamese language processing. The aim of our projects is thus to build a common linguistic database that is freely and easily exploitable for the automatic processing of Vietnamese. In this paper, we present our work on creating a Vietnamese lexicon for NLP applications. We emphasize the standardization aspect of the lexicon representation. We especially propose an extensible set of Vietnamese syntactic descriptions that can be used for tagset definition and morphosyntactic analysis. These descriptors are established in such a way as to be a reference set proposal for Vietnamese in the context of ISO subcommittee TC 37/SC 4 (Language Resource Management).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Laboratoire Lorrain de Recherche en Informatique et ses Applications http://www.led.loria.fr/outils.php

  2. However, due to copyright restrictions, we cannot publish other information from the print dictionary, such as the definitions, examples, etc.

  3. cf. the project forum at http://www.viettreebank.co.

References

  • Cao, X. H. (2000). Tiếng Việt—mấy vấn đề ngữ âm, ngữ nghĩa (Vietnamese—Some Questions on Phonetics, Syntax and Semantics). Hà Nội, Việt Nam: NXB Giào dục.

    Google Scholar 

  • Dien, D., Hoi, P. P., & Hung, N. Q. (2003). Some lexical issues in electronic Vietnamese dictionary. In PAPILLON-2003 workshop on multilingual lexical databases. Hokaido University, Japan.

  • Dien, D., & Kiem, H. (2003). POS-tagger for English–Vietnamese bilingual corpus. In Workshop: Building and using parallel texts: Data driven machine translation and beyond. Canada: Edmonton.

  • Dien, D., & Kiem, H. (2005). State of the art of machine translation in Vietnam. AAMT Journal, special issue on MT Summit X.

  • Dien, D., Kiem, H., & Toan N. V. (2001). Vietnamese word segmentation. In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS2001). Tokyo, Japan.

  • Diệp Q. B., & Văn Thung, H. (1999). Ngữ phàp tiếng Việt (Vietnamese Grammar), (Vol 1). Hà Nội Việt Nam: NXB Giào dục.

    Google Scholar 

  • Erjavec, T., Ide, N., & Tufis, D. (1998). Development and assessment of common lexical specifications for six central and eastern European languages. In Proceedings of the First International Conference on Language Resources and Evaluation. Granada, Spain.

  • Hoàng, P. (Ed.) (2002). Từ điển tiếng Việt (Vietnamese Dictionary). Việt Nam: NXB Ðà Nẵng.

    Google Scholar 

  • Hữu Ð., Dõi, T. T., & Lan. Ð. T. (1998). Cơ sở tiếng Việt (Basis of Vietnamese). Hà Nội Việt Nam: NXB Giào dục.

    Google Scholar 

  • Ide, N., & Romary, L. (2001). Standards for language resources. In: Proceedings of the IRCS Workshop on Linguistic Databases. Philadelphia, US.

  • Ide, N., & Romary, L. (2003). Encoding syntactic annotation. In A. Abeillè (Ed.), Building and using parsed corpora. Dordrecht, Netherlands: Kluwer Academic Publishers.

    Google Scholar 

  • Ide, N., & Véronis, J. (1994). MULTEXT: Multilingual text tools and corpora. In: Proceedings of the 15th International Conference on Computational Linguistics (COLING 94). Kyoto, Japan.

  • Ide, N., & Véronis, J. (1995). Encoding dictionaries. In N. Ide & J. Véronis (Eds.), Text encoding initiative: Background and context. Dordrecht, Netherlands: Kluwer Academic Publishers.

    Google Scholar 

  • ISO 24613, Rev.13 (2006). Language resource management—Lexical markup framework (LMF). ISO, Geneva, Switzerland.

  • Li, C. N., & Thompson, S. A. (1976). Subject and topic: A new typology of language. In C. N. Li (Ed.), Subject and topic (pp. 457–489). London/New York: Academic Press.

    Google Scholar 

  • Nguyen, T. M. H., Romary, L., & Vu X. L. (2003). Une étude de cas pour l’étiquetage morpho-syntaxique de textes Vietnamiens. In: Actes de la Conférence francophone internationale sur le Traitement Automatique des Langues Naturelles (TALN 03). Batz-sur-mer, France.

  • Nguyễn, T. M. H. (2006). Outils et ressources linguistiques pour l’alignement de textes multilingues Français-Vietnamiens. Thèse de doctorat en informatique, Université Henri Poincaré, Nancy I, Nancy, France.

  • Nguyễn, T. C. (1998). Ngữ pháp tiếng Việt (Vietnamese Grammar). Hà Nội, Việt Nam: NXB Ðại học Quốc gia.

    Google Scholar 

  • Romary, L., Salmon-Alt, S., & Francopoulo, G. (2004). Standards going concrete: From LMF to Morphalou. In Workshop Enhancing and using electronic dictionaries. The 20th International Conference on Computational Linguistics (COLING). Geneva, Switzerland.

  • Uỷ ban Khoa học Xã hội Việt Nam (1983). Ngữ pháp tiếng Việt (Vietnamese Grammar). Hà Nội, Việt Nam: NXB Khoa học Xã hội.

Download references

Acknowledgements

This work would not have been possible without the enthusiastic collaboration of all the linguists at the Vietnam Lexicography Centre, especially Hoàng Thị Tuyền Linh, Ðặng Thanh Hoà, Ðào Minh Thu and Phạm Thị Thuỷ. Great thanks to them! Many thanks also to Nguyễn Thành Bôn for his contribution to the development of the various tools.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thị Minh Huyền Nguyễn.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyễn, T.M.H., Romary, L., Rossignol, M. et al. A lexicon for Vietnamese language processing. Lang Resources & Evaluation 40, 291–309 (2006). https://doi.org/10.1007/s10579-007-9034-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-007-9034-8

Keywords

Navigation