Skip to main content

Vietnamese treebank construction and entropy-based error detection

Abstract

Treebanks, especially the Penn treebank for natural language processing (NLP) in English, play an essential role in both research into and the application of NLP. However, many languages still lack treebanks and building a treebank can be very complicated and difficult. This work has a twofold objective. Firstly, to share our results in constructing a large Vietnamese treebank (VTB) with three levels of annotation including word segmentation, part-of-speech tagging, and syntactic analysis. Major steps in the treebank construction process are described with particular regard to specific Vietnamese properties such as lack of word delimiter and isolation. Those properties make sentences highly syntactically ambiguous, and therefore it is difficult to ensure a high level of agreement among annotators. Various studies of Vietnamese syntax were employed not only to define annotations but also to systematically deal with ambiguities. Annotators were supported by automatic labelling tools, which are based on statistical machine learning methods, for sentence pre-processing and a tree editor for supporting manual annotation. As a result, an annotation agreement of around 90 % was achieved. Our second objective is to present our method for automatically finding errors and inconsistencies in treebank corpora and its application to the construction of the VTB. This method employs the Shannon entropy measure in a manner that the more reduced entropy the more corrected errors in a treebank. The method ranks error candidates by using a scoring function based on conditional entropy. Our experiments showed that this method detected high-error-density subsets of original error candidate sets, and that the corpus entropy was significantly reduced after error correction. The size of these subsets was only about one third of the whole set, while these subsets contained 80–90 % of the total errors. This method can also be applied to languages similar to Vietnamese.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    Multi-version treebank publishing has several purposes: error correction, annotation scheme modification, and data addition. For example, major changes in the Penn English Treebank (PTB) Marcus and Marcinkiewicz (1993) upgrade from version I to version II include POS tagging error correction and predicate-argument structure labelling. In the PTB upgrade from version II to version III, more data is appended.

  2. 2.

    This choice emphasizes the similarity between Chinese and other languages.

  3. 3.

    JJ: adjective, NN: noun

  4. 4.

    Note that before Dickinson, Halteren (2000) pointed out that POS taggers can be used to enforce consistency.

  5. 5.

    ADVP: adverbial phrase, RB: adverb

  6. 6.

    Steedman et al. (2003) showed that a training set size of around 10,000 syntactic trees was good for English parsing since when using a larger training set, improvement in parsing performance was small (as tested on Collins’ parser).

  7. 7.

    http://vlsp.vietlp.org:8080/demo/

  8. 8.

    This term has the same meaning as the term ‘variation nuclei’ in Dickinson and Meurers (2003). In our paper, a variation n-gram is an n-gram which varies in how it is labelled because of ambiguity or annotation error. Contextual information, such as surrounding words, is not included in an n-gram.

  9. 9.

    Online versions at: http://ir.library.osaka-u.ac.jp/metadb/up/LIBRIWLK01/riwl_001_019.pdf; http://www.sealang.net/archives/mks/THOMPSONLaurenceC.htm

  10. 10.

    They may have a meaning (‘’, ‘hàn\(_{cold}\)’) or not (‘lẽo’, ‘nhánh’)

  11. 11.

    The other approach is joint processing, in which all tasks are carried out simultaneously.

  12. 12.

    This classification is widely accepted in the Vietnamese linguistic community.

  13. 13.

    This term came from the fact that the design for the Penn Treebank tag set was based on the simplification of the Brown Corpus tag set.

  14. 14.

    http://www.cis.upenn.edu/dbikel/software.html

  15. 15.

    http://vlsp.vietlp.org:8080/demo/

  16. 16.

    Two points nearest to the vertical axis are the number of variation n-grams which have no erroneous instances.

  17. 17.

    Using \(p(x_{1}, x_{2}, \ldots , x_{K})=Freq(x_{1}, x_{2}, \ldots , x_{K})/L\), the value of empirical entropy reduction was 173.49 on the word-segmented data set.

References

  1. Awate, S. P., & Whitaker, R. T. (2006). Unsupervised, information-theoretic, adaptive image filtering for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 364–376.

    Article  Google Scholar 

  2. Berger, A., Pietra, S. D., & Pietra, V. D. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71.

    Google Scholar 

  3. Black, E., Abney, S., Flickenger, D., Gdaniec, C., Grishman, R., Harrison, P., et al. (1991). A procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings of DARPA speech and natural language workshop.

  4. Cao, X.-H. (2007). The Vietnamese language: Phonetics, syntax, and semantics [in Vietnamese]. Cambridge: Education Press.

    Google Scholar 

  5. Chiang, D., & Bikel, D. M. (2002). Recovering latent information in treebanks. In Proceedings of COLING.

  6. Collins, M. (1999). Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania.

  7. Cover, T. M., & Thomas, J. A. (2006). Elements of information theory. New York: Wiley.

    Google Scholar 

  8. Dickinson, M., & Meurers, W. D. (2003). Detecting errors in part-of-speech annotation. In Proceedings of EACL.

  9. Dickinson, M. (2006). From detecting errors to automatically correcting them. In Proceedings of EACL.

  10. Dickinson, M. (2008). Ad hoc treebank structures. In Proceedings of ACL.

  11. Diep, Q.-B. (2005). Vietnamese syntax [in Vietnamese]. Cambridge: Education Press.

    Google Scholar 

  12. Han, C., Han, N., Ko, E., & Palmer, M. (2002). Development and evaluation of a Korean treebank and its application to NLP. In Proceedings of LREC.

  13. Johnson, M. (1998). PCFG models of linguistic tree representation. Computational Linguistics, 24, 613–632.

    Google Scholar 

  14. Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing., Computational linguistics and speech recognition New Jersey: Prentice Hall.

    Google Scholar 

  15. Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of ACL.

  16. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML.

  17. Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19, 313–330.

    Google Scholar 

  18. Mitchell, T. M. (1997). Machine learning. Maidenhead: McGraw-Hill.

    Google Scholar 

  19. Miyao, Y., & Tsujii, J. (2008). Feature forest models for probabilistic HPSG parsing. Computational Linguistics, 34, 35–80.

    Article  Google Scholar 

  20. Nguyen, V.-H. (2009). Vietnamese syntax [in Vietnamese]. Cambridge: Education Press.

    Google Scholar 

  21. Nguyen, T.-M.-H., Vu, X.-L., Le, & H.-P. (2003). A case study of the probabilistic tagger QTAG for tagging Vietnamese texts [in Vietnamese]. In Proceedings of ICT.rda.

  22. Nguyen, T.-C. (2004). Vietnamese syntax [in Vietnamese]. Hanoi: Vietnam National University Press.

    Google Scholar 

  23. Nguyen, P.-T., Vu, X. L., Nguyen, T. M. H., Nguyen, V. H., & Le, H. P. (2009). Building a large syntactically-annotated corpus of Vietnamese. In Proceedings of LAW-3, ACL-IJCNLP.

  24. Nguyen, V.-H. (2009). The history of approaches in describing Vietnamese syntax. Journal of the Research Institute for World Languages, (1), 19–34

  25. Novak, V., & Razimova, M. (2009). Unsupervised detection of annotation inconsistencies using apriori algorithm. In Proceedings of LAW-3, ACL-IJCNLP.

  26. Pajas, P., & Stepanek, J. (2008). Recent advances in a feature-rich framework for treebank annotation. In Proceedings of COLING.

  27. Phuong, L. H., Huyen, N. T. M., Azim, R., & Vinh, H. T. (2008). A hybrid approach to word segmentation of vietnamese texts. In Proceedings of the 2nd international conference on language and automata theory and applications. Springer LNCS 5196, Tarragona, Spain, 2008.

  28. Rambow, O. (2010). The simple truth about dependency and phrase structure representations: An opinion piece. In Proceedings of NAACL.

  29. Santorini, B. (1990). Part-of-speech tagging guidelines for the Penn Treebank Project. In Treebank-3 Documents. Linguistic Data Consortium.

  30. Sciullo, A. M. D., & Williams, E. (1987). On the definition of word. Cambridge: The MIT Press.

    Google Scholar 

  31. Steedman, M., Osborne, M., Sarkar, A., Clark, S., Hwa, R., Hockenmaier, J., et al. (2003). Bootstrapping statistical parsers from small datasets. In Proceedings of EACL.

  32. Thompson, L. C. (1987). A Vietnamese reference grammar. Hawaii: University of Hawaii Press.

    Google Scholar 

  33. van Halteren, H. (2000). The detection of inconsistency in manually tagged text. In Proceedings of LINC.

  34. Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11, 207–238.

    Article  Google Scholar 

  35. Yamada, H., & Matsumoto, Y. (2003). Statistical dependency analysis with support vector machines. In Proceedings of IWPT.

  36. Yates, A., Schoenmackers, S., & Etzioni, O. (2006). Detecting parser errors using web-based semantic filters. In Proceedings of EMNLP.

Download references

Acknowledgments

This paper is supported by the project QGTĐ.12.21 funded by Vietnam National University, Hanoi. We would like to express special thanks to other members of the treebank development team Xuan-Luong Vu and Dr. Thi-Minh-Huyen Nguyen, and linguistic annotators Minh-Thu Dao, Thi-Minh-Ngoc Nguyen, Kim-Ngan Le, Mai-Van Nguyen for the effective cooperation. We also would like to express thanks to Assoc. Prof. Dinh Dien for his comments and discussions during the early stages of the treebank development.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Phuong-Thai Nguyen.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nguyen, PT., Le, AC., Ho, TB. et al. Vietnamese treebank construction and entropy-based error detection. Lang Resources & Evaluation 49, 487–519 (2015). https://doi.org/10.1007/s10579-015-9308-5

Download citation

Keywords

  • Treebank
  • Error detection
  • Entropy