A comparative study of dictionaries and corpora as methods for language resource addition

Mori, Shinsuke; Neubig, Graham

doi:10.1007/s10579-016-9354-7

A comparative study of dictionaries and corpora as methods for language resource addition

Original Paper
Published: 21 May 2016

Volume 50, pages 245–261, (2016)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Shinsuke Mori¹ &
Graham Neubig²

344 Accesses
2 Altmetric
Explore all metrics

Abstract

In this paper, we investigate the relative effect of two strategies for language resource addition for Japanese morphological analysis, a joint task of word segmentation and part-of-speech tagging. The first strategy is adding entries to the dictionary and the second is adding annotated sentences to the training corpus. The experimental results showed that addition of annotated sentences to the training corpus is better than the addition of entries to the dictionary. In particular, adding annotated sentences is especially efficient when we add new words with contexts of several real occurrences as partially annotated sentences, i.e. sentences in which only some words are annotated with word boundary information. According to this knowledge, we performed real annotation experiments on invention disclosure texts and observed word segmentation accuracy. Finally we investigated various language resource addition cases and introduced the notion of non-maleficence, asymmetricity, and additivity of language resources for a task. In the WS case, we found that language resource addition is non-maleficent (adding new resources causes no harm in other domains) and sometimes additive (adding new resources helps other domains). We conclude that it is reasonable for us, NLP tool providers, to distribute only one general-domain model trained from all the language resources we have.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Groningen Meaning Bank

Introduction to CKIP’s Language Resources and Their Applications

Lexical markup framework: an ISO standard for electronic lexicons and its implications for Asian languages

Article 17 June 2014

Notes

In the first check process the annotator focused on words appearing only in the newly annotated 5000 sentences. In the second process we divide annotated sentences into several parts and the annotator checked the differences between the manual annotations of each part and the machine decisions by a model trained on the corpus including the other parts, similarly to cross validation.
We had run some experiments. BCCWJ consists of six domains. We split each of them into the test and train. Then we built a model from five training data and tested it on the rest of the data in the other domain. When we use Yahoo! QA as test, WS and MA accuracies are 98.64 and 97.78, respectively. The WS errors are 61.3 % of those of MA. When the test is Yahoo! blogs, the most difficult domain among six, the accuracies are 96.98 and 95.77, so the WS errors are 71.4 % of those of MA.
For example an entry (French language) is a combination of (France) and (language).
Note that it is also possible to learn sequence-based models from partial annotations (Tsuboi et al. 2008; Yang and Vozila 2014), which may provide an increase of accuracy at the cost of an increase in training time (the total time for training CRFs on partially annotated data scales in the number of words in sentences with at least one annotation, in contrast to the pointwise approach, which scales in the number of annotated words). A comparison between these two methods is orthogonal to our present goal of comparing dictionary and corpus addition, and thus we use pointwise predictors in our experiments.
It should be noted that there has been a recently proposed method to loosen this restriction, although this adds some complexity to the decoding process and reduces speed somewhat (Kaji and Kitsuregawa 2013).
More fine-grained POS tags have provided small boosts in accuracy in previous research (Kudo et al. 2004), but these increase the annotation burden, which is contrary to our goal.
Dictionary features for word segmentation are active if the string exists in the original unsegmented input, regardless of whether it is segmented as a single word in \(\varvec{w}_1^J\), and thus can be calculated without the word segmentation result.
We did not precisely tune the parameters, so there still may be room for further improvement.
http://mecab.sourceforge.net/dic.html.
KyTea requires re-training.
As we can see in Table 4, renewing CRF parameters decreased the accuracy.
The expected frequency of a word candidate is the frequency as a string in the raw corpus multiplied by the word likelihood estimated by the comparison between the distribution of the word candidate and that of the words. See (Mori and Nagao 1996) for more detail.
We borrow this terminology from medicine, where non-maleficence indicates the property of “doing no harm.”
A very slight degradation is observed in case of recipe WS by the model trained from patent texts (from 95.56 to 95.54 %). This is not statistically significant.
The only exception is that the model adapted to the patent tested on the general domain is better than the others (from 99.01 to 99.02 %). The change is, however, not significant.
ML technologies have a possibility to adapt the model to an unexpected input automatically.

References

Brown, P. F., Pietra, V. J. D., deSouza, P. V., Lai, J. C., & Mercer, R. L. (1992). Class-based \(n\)-gram models of natural language. Computational Linguistics, 18(4), 467–479.
Google Scholar
Goto, I., Lu, B., Chow, K.P., Sumita, E., & Tsou, B.K. (2011). Overview of the patent machine translation task at the NTCIR-9 workshop. In Proceedings of NTCIR-9 workshop meeting (pp. 559–578).
Kaji, N., & Kitsuregawa, M. (2013). Efficient word lattice generation for joint word segmentation and pos tagging in Japanese. In Proceedings of the sixth international joint conference on natural language processing, Nagoya, Japan (pp. 153–161).
Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., & Isahara, H. (2009). An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of the 47th annual meeting of the association for computational linguistics.
Kudo, T., Yamamoto, K., & Matsumoto, Y. (2004). Applying conditional random fields to Japanese morphological analysis. In Proceedings of the conference on empirical methods in natural language processing (pp. 230–237).
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth ICML (pp. 282–289).
Liang, P., Daumé III H., & Klein, D. (2008). Structure compilation: Trading structure for features. In Proceedings of the 25th ICML.
Maekawa, K. (2008). Balanced corpus of contemporary written Japanese. In Proceedings of the 6th workshop on Asian language resources (pp. 101–102).
Mori, S., & Kurata, G. (2005). Class-based variable memory length Markov model. In Proceedings of the InterSpeech2005 (pp. 13–16).
Mori, S., & Nagao, M. (1996). Word extraction from corpora and its part-of-speech estimation using distributional analysis. In Proceedings of the 16th international conference on computational linguistics (pp. 1119–1122).
Mori, S., & Neubig, G. (2014). Language resource addition: Dictionary or corpus? In Proceedings of the nineth international conference on language resources and evaluation (pp. 1631–1636)
Mori, S., & Oda, H. (2009). Automatic word segmentation using three types of dictionaries. In Proceedings of the eighth international conference pacific association for computational linguistics (pp. 1–6).
Mori, S., Maeta, H., Yamakata, Y., & Sasada, T. (2014) Flow graph corpus from recipe texts. In Proceedings of the nineth international conference on language resources and evaluation (pp. 2370–2377).
Nagata, M. (1994). A stochastic Japanese morphological analyzer using a forward-DP backward-A\(^{*}\) n-best search algorithm. In Proceedings of the 15th international conference on computational linguistics (pp. 201–207).
Nakagawa, T. (2004). Chinese and Japanese word segmentation using word-level and character-level information. In Proceedings of the 20th international conference on computational linguistics.
Nanba, H., Fujii, A., Iwayama, M., & Hashimoto, T. (2011). Overview of the patent mining task at the NTCIR-8 workshop. In Proceedings of NTCIR-8 workshop meeting (pp. 293–302).
Neubig, G., & Mori, S. (2010). Word-based partial annotation for efficient corpus construction. In (Proceedings of the seventh international conference on language resources and evaluation) (pp. 2723–2727).
Neubig, G., Nakata, Y., & Mori, S. (2011). Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics (pp. 529–533).
Ng, H.T., & Low, J.K. (2004). Chinese part-of-speech tagging: one-at-a-time or all-at-once? word-based or character-based. In Proceedings of the conference on empirical methods in natural language processing.
Peng, F., Feng, F., & McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th international conference on computational linguistics.
Ron, D., Singer, Y., & Tishby, N. (1996). The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25, 117–149.
Article Google Scholar
Sassano, M. (2002). An empirical study of active learning with support vector machines for Japanese word segmentation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 505–512).
Settles, B., Craven, M., & Friedland, L. (2008). Active learning with real annotation costs. In NIPS workshop on cost-sensitive learning.
Tomanek, K., & Hahn, U. (2009). Semi-supervised active learning for sequence labeling. In Proceedings of the 47th annual meeting of the association for computational linguistics (pp. 1039–1047).
Tsuboi, Y., Kashima, H., Mori, S., Oda, H., & Matsumoto, Y. (2008). Training conditional random fields using incomplete annotations. In Proceedings of the 22nd international conference on computational linguistics (pp. 897–904).
Wang, L., Li, Q., Li, N., Dong, G., & Yang, Y. (2008). Substructure similarity measurement in Chinese recipes. In Proceedings of the 17th international conference on World Wide Web (pp. 978–988).
Yamakata, Y., Imahori, S., Sugiyama, Y., Mori, S., & Tanaka, K. (2013). Feature extraction and summarization of recipes using flow graph. In Proceedings of the 5th international conference on social informatics, LNCS 8238 (pp. 241–254).
Yang, F., & Vozila, P. (2014). Semi-supervised Chinese word segmentation using partial-label learning with conditional random fields. In Proceedings of the 2014 conference on empirical methods in natural language processing (pp. 90–98).

Download references

Acknowledgments

This work was supported by JSPS Grants-in-Aid for Scientific Research Grant Numbers 26280084 and 26540190, and NTT agreement dated 05/23/2013.

Author information

Authors and Affiliations

Academic Center for Computing and Media Studies, Kyoto University, Yoshidahonmachi, Sakyo-ku, Kyoto, Japan
Shinsuke Mori
Nara Institute of Science and Technology, 8916-5 Takayamacho, Ikoma, Nara, Japan
Graham Neubig

Authors

Shinsuke Mori
View author publications
You can also search for this author in PubMed Google Scholar
Graham Neubig
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shinsuke Mori.

Additional information

The current paper describes and extends the language resource creation activities, experimental results, and findings that have previously appeared as an LREC paper (Mori and Neubig 2014).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mori, S., Neubig, G. A comparative study of dictionaries and corpora as methods for language resource addition. Lang Resources & Evaluation 50, 245–261 (2016). https://doi.org/10.1007/s10579-016-9354-7

Download citation

Published: 21 May 2016
Issue Date: June 2016
DOI: https://doi.org/10.1007/s10579-016-9354-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative study of dictionaries and corpora as methods for language resource addition

Abstract

Access this article