Language Resources and Evaluation

, Volume 50, Issue 2, pp 245–261 | Cite as

A comparative study of dictionaries and corpora as methods for language resource addition

Original Paper

Abstract

In this paper, we investigate the relative effect of two strategies for language resource addition for Japanese morphological analysis, a joint task of word segmentation and part-of-speech tagging. The first strategy is adding entries to the dictionary and the second is adding annotated sentences to the training corpus. The experimental results showed that addition of annotated sentences to the training corpus is better than the addition of entries to the dictionary. In particular, adding annotated sentences is especially efficient when we add new words with contexts of several real occurrences as partially annotated sentences, i.e. sentences in which only some words are annotated with word boundary information. According to this knowledge, we performed real annotation experiments on invention disclosure texts and observed word segmentation accuracy. Finally we investigated various language resource addition cases and introduced the notion of non-maleficence, asymmetricity, and additivity of language resources for a task. In the WS case, we found that language resource addition is non-maleficent (adding new resources causes no harm in other domains) and sometimes additive (adding new resources helps other domains). We conclude that it is reasonable for us, NLP tool providers, to distribute only one general-domain model trained from all the language resources we have.

Keywords

Partial annotation Domain adaptation Dictionary Word segmentation POS tagging Non-maleficence of language resources 

References

  1. Brown, P. F., Pietra, V. J. D., deSouza, P. V., Lai, J. C., & Mercer, R. L. (1992). Class-based \(n\)-gram models of natural language. Computational Linguistics, 18(4), 467–479.Google Scholar
  2. Goto, I., Lu, B., Chow, K.P., Sumita, E., & Tsou, B.K. (2011). Overview of the patent machine translation task at the NTCIR-9 workshop. In Proceedings of NTCIR-9 workshop meeting (pp. 559–578).Google Scholar
  3. Kaji, N., & Kitsuregawa, M. (2013). Efficient word lattice generation for joint word segmentation and pos tagging in Japanese. In Proceedings of the sixth international joint conference on natural language processing, Nagoya, Japan (pp. 153–161).Google Scholar
  4. Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., & Isahara, H. (2009). An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of the 47th annual meeting of the association for computational linguistics.Google Scholar
  5. Kudo, T., Yamamoto, K., & Matsumoto, Y. (2004). Applying conditional random fields to Japanese morphological analysis. In Proceedings of the conference on empirical methods in natural language processing (pp. 230–237).Google Scholar
  6. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth ICML (pp. 282–289).Google Scholar
  7. Liang, P., Daumé III H., & Klein, D. (2008). Structure compilation: Trading structure for features. In Proceedings of the 25th ICML.Google Scholar
  8. Maekawa, K. (2008). Balanced corpus of contemporary written Japanese. In Proceedings of the 6th workshop on Asian language resources (pp. 101–102).Google Scholar
  9. Mori, S., & Kurata, G. (2005). Class-based variable memory length Markov model. In Proceedings of the InterSpeech2005 (pp. 13–16).Google Scholar
  10. Mori, S., & Nagao, M. (1996). Word extraction from corpora and its part-of-speech estimation using distributional analysis. In Proceedings of the 16th international conference on computational linguistics (pp. 1119–1122).Google Scholar
  11. Mori, S., & Neubig, G. (2014). Language resource addition: Dictionary or corpus? In Proceedings of the nineth international conference on language resources and evaluation (pp. 1631–1636)Google Scholar
  12. Mori, S., & Oda, H. (2009). Automatic word segmentation using three types of dictionaries. In Proceedings of the eighth international conference pacific association for computational linguistics (pp. 1–6).Google Scholar
  13. Mori, S., Maeta, H., Yamakata, Y., & Sasada, T. (2014) Flow graph corpus from recipe texts. In Proceedings of the nineth international conference on language resources and evaluation (pp. 2370–2377).Google Scholar
  14. Nagata, M. (1994). A stochastic Japanese morphological analyzer using a forward-DP backward-A\(^{*}\) n-best search algorithm. In Proceedings of the 15th international conference on computational linguistics (pp. 201–207).Google Scholar
  15. Nakagawa, T. (2004). Chinese and Japanese word segmentation using word-level and character-level information. In Proceedings of the 20th international conference on computational linguistics.Google Scholar
  16. Nanba, H., Fujii, A., Iwayama, M., & Hashimoto, T. (2011). Overview of the patent mining task at the NTCIR-8 workshop. In Proceedings of NTCIR-8 workshop meeting (pp. 293–302).Google Scholar
  17. Neubig, G., & Mori, S. (2010). Word-based partial annotation for efficient corpus construction. In (Proceedings of the seventh international conference on language resources and evaluation) (pp. 2723–2727).Google Scholar
  18. Neubig, G., Nakata, Y., & Mori, S. (2011). Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics (pp. 529–533).Google Scholar
  19. Ng, H.T., & Low, J.K. (2004). Chinese part-of-speech tagging: one-at-a-time or all-at-once? word-based or character-based. In Proceedings of the conference on empirical methods in natural language processing.Google Scholar
  20. Peng, F., Feng, F., & McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th international conference on computational linguistics.Google Scholar
  21. Ron, D., Singer, Y., & Tishby, N. (1996). The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25, 117–149.CrossRefGoogle Scholar
  22. Sassano, M. (2002). An empirical study of active learning with support vector machines for Japanese word segmentation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 505–512).Google Scholar
  23. Settles, B., Craven, M., & Friedland, L. (2008). Active learning with real annotation costs. In NIPS workshop on cost-sensitive learning.Google Scholar
  24. Tomanek, K., & Hahn, U. (2009). Semi-supervised active learning for sequence labeling. In Proceedings of the 47th annual meeting of the association for computational linguistics (pp. 1039–1047).Google Scholar
  25. Tsuboi, Y., Kashima, H., Mori, S., Oda, H., & Matsumoto, Y. (2008). Training conditional random fields using incomplete annotations. In Proceedings of the 22nd international conference on computational linguistics (pp. 897–904).Google Scholar
  26. Wang, L., Li, Q., Li, N., Dong, G., & Yang, Y. (2008). Substructure similarity measurement in Chinese recipes. In Proceedings of the 17th international conference on World Wide Web (pp. 978–988).Google Scholar
  27. Yamakata, Y., Imahori, S., Sugiyama, Y., Mori, S., & Tanaka, K. (2013). Feature extraction and summarization of recipes using flow graph. In Proceedings of the 5th international conference on social informatics, LNCS 8238 (pp. 241–254).Google Scholar
  28. Yang, F., & Vozila, P. (2014). Semi-supervised Chinese word segmentation using partial-label learning with conditional random fields. In Proceedings of the 2014 conference on empirical methods in natural language processing (pp. 90–98).Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  1. 1.Academic Center for Computing and Media StudiesKyoto UniversityKyotoJapan
  2. 2.Nara Institute of Science and TechnologyIkomaJapan

Personalised recommendations