Skip to main content
Log in

Detecting Japanese idioms with a linguistically rich dictionary

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Detecting idioms in a sentence is important to sentence understanding. This paper discusses the linguistic knowledge for idiom detection. The challenges are that idioms can be ambiguous between literal and idiomatic meanings, and that they can be “transformed” when expressed in a sentence. However, there has been little research on Japanese idiom detection with its ambiguity and transformations taken into account. We propose a set of linguistic knowledge for idiom detection that is implemented in an idiom dictionary. We evaluated the linguistic knowledge by measuring the performance of an idiom detector that exploits the dictionary. As a result, more than 90% of the idioms are detected with 90% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Some idioms represent two or three idiomatic meanings. But we only check whether a phrase is used as an idiom or not.

  2. For a detailed discussion of what constitutes the notion of (Japanese) idiom, see Miyaji (1982), which details usages of commonly used Japanese idioms.

  3. In fact, the idiom has no literal interpretation.

  4. A bunsetu is a syntactic unit in Japanese, consisting of one independent word and more than zero ancillary words.

  5. One can devise a context that makes the literal interpretation of those Classes possible. However, virtually no phrase of Class A or B is interpreted literally in real texts, and we think our generalization safely captures the reality of idioms.

  6. There were many more variations in the internal structure of idiom than we had expected. To make clear what internal structures there are in Japanese idioms, careful investigation is required, which we could not carry out in this study.

  7. “Volitional Modality” represents those verbal expressions of order, request, permission, prohibition, and volition

  8. It might seem unfeasible to compile a large-scale idiom dictionary that is equipped with the lexical knowledge described so far. In fact, only Class C requires detailed linguistic information (the disambiguation knowledge), which must be described by relying on native speakers’ intuition, while the lexical knowledge of Class A and B (two-thirds of all idioms) is compiled automatically. Related to this, the disambiguation knowledge for Class C has been compiled by the authors’ intuition in this study. And we found that there were far fewer disagreements about the judgments than we had expected.

  9. The most frequently used 100 idioms in Kindaitchi and Ikeda (1989) cover 53.49% of all tokens in the Mainichi newspaper of 10 years. Thus, our dictionary accounts for approximately half of all idiom tokens in a corpus.

  10. One rejection was done by the dependency analysis error.

  11. Semantic compositionality does not play an important role in the idiom detection, although most papers concerning MWEs are obsessed with it.

  12. The notion of decomposability of Sag et al. (2002) and Nunberg et al. (1994) is independent of ambiguity. In fact, ambiguous idioms are either decomposable (hara-ga kuroi (belly-nom black) “black-hearted”) or non-decomposable (hiza-o utu (knee-acc hit) “have a brainwave”). Also, unambiguous idioms are either decomposable (hara-o yomu (belly-accread) “fathom someone’s thinking”) or non-decomposable (saba-o yomu (chub.mackerel-acc read) “cheat in counting”).

References

  • Fazly, A., & Stevenson, S. (2006). Automatically constructing a Lexicon of verb phrase idiomatic combinations. In Proceedings of the 11th conference of the European Chapter of the Association for Computational Linguistics (EACL-2006), pp. 337–344.

  • Hashimoto, C., Sato, S., & Utsuro, T. (2006). Japanese idiom recognition: Drawing a line between literal and idiomatic meanings. In COLING/ACL 2006, Sydney, pp. 353–360.

  • Ikehara, S., Miyazaki, M., Shirai, S., Yokoo, A., Nakaiwa, H., Ogura, K., Ooyama, Y., & Hayashi, Y. (1997). Goi-Taikei —a Japanese Lexicon. Iwanami Shoten.

  • Ishida, P. (2000). Doushi Kanyouku-ni taisuru Tougoteki Sousa-no Kaisou Kankei (On the hierarchy of syntactic operations applicable to verb idioms). Nihongo Kagaku (Japanese Linguistics), 7, 24–43.

    Google Scholar 

  • Katz, G., & Giesbrecht, E. (2006). Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In Proceedings of the workshop, COLING/ACL 2006, multiword expressions: Identifying and exploiting underlying properties, pp. 12–19.

  • Kindaichi, H., & Ikeda, Y. (Eds.). (1989). Gakken Kokugo Daijiten (Gakken’s Dictionary). Gakushu Kenkyu-sha.

  • Kudo, T., & Matsumoto, Y. (2002). Japanese dependency analysis using cascaded chunking. In Proceedings of the 6th conference on natural language learning (CoNLL-2002), pp. 63–69.

  • Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, H., Takaoka, K., & Asahara, M. (2000). Morphological analysis system ChaSen version 2.2.1 manual. Nara Institute of Science and Technology.

  • Miyaji, Y. (1982). Kanyouku-no Imi-to Youhou (Usage and semantics of idioms). Meiji Shoin.

  • Nunberg, G., Sag, I. A., & Wasow, T. (1994). Idioms. Language, 70, 491–538.

    Article  Google Scholar 

  • Oku, M. (1990). Nihongo-bun Kaiseki-ni-okeru Jutsugo Soutou-no Kanyouteki Hyougen-no Atsukai (Treatments of predicative idiomatic expressions in parsing Japanese). Journal of Information Processing Society of Japan, 31(12), 1727–1734.

    Google Scholar 

  • Rohde, D. L. T. (2005). TGrep2 User Manual version 1.15. Massachusetts Institute of Technology. http://www.tedlab.mit.edu/∼dr/Tgrep2.

  • Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Computational linguistics and intelligent text processing: Third international conference. pp. 1–15.

  • Shudo, K., Tanabe, T., Takahashi, M., & Yoshimura, K. (2004). MWEs as non-propositional content indicators. In the 2nd ACL workshop on multiword expressions: Integrating processing, pp. 32–39.

  • Tanaka, Y. (1997). Collecting idioms and their equivalents. In IPSJ SIGNL 1997-NL-121.

Download references

Acknowledgements

A special thank goes to Gakushu Kenkyu-sha, also known as Gakken, who permitted us to use Gakken’s Dictionary for our research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chikara Hashimoto.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hashimoto, C., Sato, S. & Utsuro, T. Detecting Japanese idioms with a linguistically rich dictionary. Lang Resources & Evaluation 40, 243–252 (2006). https://doi.org/10.1007/s10579-007-9024-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-007-9024-x

Keywords

Navigation