Compilation of an idiom example database for supervised idiom identification

Article

Abstract

Some phrases can be interpreted in their context either idiomatically (figuratively) or literally. The precise identification of idioms is essential in order to achieve full-fledged natural language processing. Because of this, the authors of this paper have created an idiom corpus for Japanese. This paper reports on the corpus itself and the results of an idiom identification experiment conducted using the corpus. The corpus targeted 146 ambiguous idioms, and consists of 102,856 examples, each of which is annotated with a literal/idiomatic label. All sentences were collected from the World Wide Web. For idiom identification, 90 out of the 146 idioms were targeted and a word sense disambiguation (WSD) method was adopted using both common WSD features and idiom-specific features. The corpus and the experiment are both, as far as can be determined, the largest of their kinds. It was discovered that a standard supervised WSD method works well for idiom identification and it achieved accuracy levels of 89.25 and 88.86%, with and without idiom-specific features, respectively. It was also found that the most effective idiom-specific feature is the one that involves the adjacency of idiom constituents.

Keywords

Japanese idiom Corpus Idiom identification Language resources 

References

  1. Baldwin, T., Bannard, C., Tanaka, T., & Widdows, D. (2003). An empirical model of multiword expression decomposability. In Proceedings of the workshop on multiword expressions: Analysis, acquisition and treatment. pp. 89–96.Google Scholar
  2. Birke, J., & Sarkar, A. (2006). A clustering approach for the nearly unsupervised recoginition of nonliteral language. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL 2006). pp. 329–336.Google Scholar
  3. Cook, P., Fazly, A., & Stevenson, S. (2007). Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In: Proceedings of the workshop on a broader perspective on multiword expressions, pp. 41–48.Google Scholar
  4. Cook, P., Fazly, A., & Stevenson, S. (2008). The VNC-tokens dataset’. In: Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE2008). pp. 19–22.Google Scholar
  5. Edmonds, P., & Cotton, S. (2001). SENSEVAL-2: Overview. In Proceedings of the second international workshop on evaluating word sense disambiguation systems (SENSEVAL-2), pp. 1–5.Google Scholar
  6. Fazly, A., & Stevenson, S. (2006). Automatically constructing a Lexicon of verb phrase idiomatic combinations. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL-2006), pp. 337–344.Google Scholar
  7. Grégoire, N., Evert, S., & Kim, S. N. (Eds.) (2007). Proceedings of the workshop on a broader perspective on multiword expressions. Prague: Association for Computational Linguistics.Google Scholar
  8. Grégoire, N., Evert, S., & Krenn, B. (Eds.) (2008). Proceedings of the LREC workshop towards a shared task for multiword expressions. Marrakech: ACL Special Interest Group on the Lexicon (SIGLEX).Google Scholar
  9. Hashimoto, C., & Kawahara, D. (2008). Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features. In: Proceedings of the conference on empirical methods in natural language processing 2008 (EMNLP-2008). pp. 991–1000.Google Scholar
  10. Hashimoto, C., & Kurohashi, S. (2007). Construction of domain dictionary for fundamental vocabulary. In: Proceedings of the 45th annual meeting of the association for computational linguistics (ACL’07) Poster. pp. 137–140.Google Scholar
  11. Hashimoto, C., & Kurohashi, S. (2008). Blog categorization exploiting domain dictionary and dynamically estimated domains of unknown words. In: Proceedings of the 46th annual meeting of the association for computational linguistics (ACL’08) Short paper, Poster. pp. 69–72.Google Scholar
  12. Hashimoto, C, Sato, S., & Utsuro, T. (2006a) Detecting Japanese idioms with a linguistically rich dictionary. Language Resources and Evaluation 40(3–4), 243–252.Google Scholar
  13. Hashimoto, C., Sato, S., & Utsuro, T. (2006b). Japanese idiom recognition: Drawing a line between literal and idiomatic meanings’. In: The joint 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006) Poster. Sydney, pp. 353–360.Google Scholar
  14. Isahara, H., Bond, F., Uchimoto, K., Utiyama, M., & Kanzaki, K. (2008). Development of the Japanese WordNet. In The sixth international conference on language resources and evaluation (LREC2008).Google Scholar
  15. Ishida, P. (2000). Doushi Kanyouku-ni taisuru Tougoteki Sousa-no Kaisou Kankei (On the Hierarchy of Syntactic Operations Applicable to Verb Idioms). Nihongo Kagaku (Japanese Linguistics) 7, 24–43.Google Scholar
  16. Katz, G., & Giesbrecht, E. (2006). Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In Proceedings of the workshop, COLING/ACL 2006, multiword expressions: Identifying and exploiting underlying properties. pp. 12–19.Google Scholar
  17. Kawahara, D., & Kurohashi, S. (2006). Case frame compilation from the Web using high-performance computing. In: Proceedings of the 5th international conference on language resources and evaluation (LREC-06), pp. 1344–1347.Google Scholar
  18. Kilgarriff, A., Palmer, M. (2000). Introduction to the special issue on SENSEVAL. Computers and the Humanities 34(1–2), 1–13.CrossRefGoogle Scholar
  19. Kindaichi, H. (2005). Shogakusei no Manga Kanyouku Jiten (Comic dictionary of idioms for elementary school children). Shogakukan.Google Scholar
  20. Kindaichi, H., & Kindaichi, H. (2005). Shin Reinbo Shogaku Kokugo Jiten (New Rainbow Japanese dictionary for elementary school). Gakken.Google Scholar
  21. Kindaichi, K. (2006). Shogakukan Gakushu Kokugo Shin Jiten Zentei Dainihan (Shogaku-kan’s Japanese new dictionary for learners, 2nd edn). Shogaukan.Google Scholar
  22. Krenn, B., & Evert, S. (2001). Can we do better than frequency? A case study on extracting PP-verb collocations. In: Proceedings of the workshop on collocations. pp. 39–46.Google Scholar
  23. Kuiper, K., McCann, H., Quinn, H., Aitchison,T., & van der Veer, K. (2003). SAID: A syntactically annotated idiom dataset’. Linguistic data consortium, LDC2003T10. Pennsylvania.Google Scholar
  24. Kurohashi, S., & Nagao, M. (1994). A syntactic analysis method of long Japanese sentences based on the detection of conjunctive structures. Computational Linguistics 20(4), 507–534.Google Scholar
  25. Kurohashi, S., Nakamura, T., Matsumoto, Y., & Nagao, M. (1994). Improvements of Japanese mophological analyzer JUMAN. In: Proceedings of the international workshop on sharable natural language resources, pp. 22–28.Google Scholar
  26. Lee, Y. K., & Ng, H. T. (2002). An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In: EMNLP ’02: Proceedings of the ACL-02 conference on empirical methods in natural language processing, pp. 41–48.Google Scholar
  27. Lin, D. (1999). Automatic identification of non-compositional phrases. In: Proceeding of the 37th annual meeting of the association for computational linguistics, pp. 317–324.Google Scholar
  28. Magnini, B., Strapparava, C., Pezzulo, G., Gliozzo, A. (2002). The role of domain information in word sense disambiguation. Natural language Engineering, Special Issue on Word Sense Disambiguation, 8(3), 359–373.Google Scholar
  29. Miyaji, Y. (1982). Usage and semantics of idioms. Meiji Shoin. (in Japanese).Google Scholar
  30. Moirón, B. V., Villavicencio, A., McCarthy, D., Evert, S., & Stevenson S. (Eds.) (2006). Proceedings of the workshop on multiword expressions: Identifying and exploiting underlying properties. Sydney, Australia: Association for Computational Linguistics.Google Scholar
  31. Morita, Y. (1985). DoushiKanyouku (Verb Idioms). Nihongogaku (Japanese Linguistics) 4(1), 37–44.Google Scholar
  32. Rayson, P., Moirón, B. V., Sharoff, S., Piao, S., & Evert, S. (Eds.) (2008). International Journal of Language Resources and Evaluation. Springer (Special issue on Multiword expressions: hard going or plain sailing?)Google Scholar
  33. Rayson, P., Sharoff, S., & Adolphs S. (Eds.) (2006). Proceedings of EACL 2006 workshop on multi-word-expressions in a multilingual context. Trento, Italy: European Chapter of the Association for Computational Linguistics.Google Scholar
  34. Sato, S. (2007). Compilation of a comparative list of basic Japanese idioms from five sources. In: IPSJ 2007-NL-178, pp. 1–6. (in Japanese).Google Scholar
  35. Shudo K., Tanabe, T., Takahashi, M., & Yoshimura, K. (2004). MWEs as non-propositional content indicators. In The 2nd ACL workshop on multiword expressions: Integrating processing. pp. 32–39.Google Scholar
  36. Takahashi, T., Soonsang, H., Taura, K., & Yonezawa, A. (2002). World Wide Web Crawler. In Poster proceedings of the 11th international World Wide Web conference.Google Scholar
  37. Tanaka, T., Bond, F., Baldwin, T., Fujita, S., & Hashimoto, C. (2007). Word sense disambiguation incorporating lexical and structural semantic information. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp. 477–485.Google Scholar
  38. Tsuchiya, M., Utsuro, T., Matsuyoshi, S., Sato, S., & Nakagawa, S. (2006). Development and analysis of an example database of Japanese compound functional expressions. Transactions of Information Processing Society of Japan 47(6), 1728–1741. (in Japanese).Google Scholar
  39. Uchiyama, K., Baldwin, T., & Ishizaki, S. (2005). Disambiguating Japanese compound verbs. Computer Speech and Language, Special Issue on Multiword Expressions 19(4), 497–512.Google Scholar
  40. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.Google Scholar
  41. Villavicencio, A., Bond, F., Korhonen, A., & McCarthy D. (Eds.) (2005). Journal of Computer Speech and Language: Special Issue on Multiword Expressions. Elsevier.Google Scholar
  42. Yonekawa, A., & Ohtani, I. (2005) Nihongo Kanyouku Jiten (Japanese idiom dictionary). Tokyo-do Shuppan.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2009

Authors and Affiliations

  1. 1.National Institute of Information and Communications TechnologyKyotoJapan

Personalised recommendations