Language Resources and Evaluation

, 43:355

Compilation of an idiom example database for supervised idiom identification

Authors

    • National Institute of Information and Communications Technology
  • Daisuke Kawahara
    • National Institute of Information and Communications Technology
Article

DOI: 10.1007/s10579-009-9104-1

Cite this article as:
Hashimoto, C. & Kawahara, D. Lang Resources & Evaluation (2009) 43: 355. doi:10.1007/s10579-009-9104-1

Abstract

Some phrases can be interpreted in their context either idiomatically (figuratively) or literally. The precise identification of idioms is essential in order to achieve full-fledged natural language processing. Because of this, the authors of this paper have created an idiom corpus for Japanese. This paper reports on the corpus itself and the results of an idiom identification experiment conducted using the corpus. The corpus targeted 146 ambiguous idioms, and consists of 102,856 examples, each of which is annotated with a literal/idiomatic label. All sentences were collected from the World Wide Web. For idiom identification, 90 out of the 146 idioms were targeted and a word sense disambiguation (WSD) method was adopted using both common WSD features and idiom-specific features. The corpus and the experiment are both, as far as can be determined, the largest of their kinds. It was discovered that a standard supervised WSD method works well for idiom identification and it achieved accuracy levels of 89.25 and 88.86%, with and without idiom-specific features, respectively. It was also found that the most effective idiom-specific feature is the one that involves the adjacency of idiom constituents.

Keywords

Japanese idiomCorpusIdiom identificationLanguage resources

1 Introduction

For some phrases, such as kick the bucket, the meaning is ambiguous in terms of whether the phrase has a literal or idiomatic meaning in a certain context. It is necessary to resolve this ambiguity in the same manner as for the ambiguous words that have been dealt with in the WSD literature. Hereafter, literal/idiomatic ambiguity resolution is referred to as idiom (token) identification.

Idiom identification is classified into two categories; one for idiom types and the other for idiom tokens. The former is used to find phrases that can be interpreted as idioms in the text corpora, typically for compiling idiom dictionaries, while the latter helps identify a phrase in context as a true idiom or a phrase that should be interpreted literally (henceforth, a literal phrase). This paper deals primarily with the latter, i.e., idiom token identification.

Despite the recent enthusiasm for multiword expressions (MWEs) (Villavicencio et al. 2005; Rayson et al. 2006, 2008; Moirón et al. 2006; Grégoire et al. 2007, 2008), idiom token identification is at an early stage of development. Given that many natural language processing (NLP) tasks, such as machine translation or parsing, have been developed thanks to the availability of language resources, idiom token identification should also be developed when adequate idiom resources are provided. It is for this purpose that the authors of this paper have constructed a Japanese idiom corpus. An idiom identification experiment has also been conducted using the corpus, which is expected to become a good reference point for future studies on the subject. A standard WSD framework was drawn from with machine learning that exploited features used commonly in WSD studies and features that were idiom-specific.

This paper reports the corpus and the results of the experiment in detail. It is important to note here that the corpus and the experiment are believed to be the largest of their kind in existence.

While only the ambiguity between literal and idiomatic interpretations is dealt with, some phrases have two or more idiomatic meanings without context. For example, one Japanese idiom, te-o dasu (hand-acc stretch), can be interpreted as either “punch,” “steal,” or “make moves on.” This kind of ambiguity is not addressed in this paper and is left to future work. Note that acc indicates the accusative case marker in this paper and, likewise, the following notation is used hereafter; nom for the nominative case marker, dat for the dative case marker, gen for the genitive case marker, ins for the instrumental case marker, only for the restrictive case marker, pass for the passive morpheme, and caus for the causative morpheme. from and to stand for the Japanese counterparts of from and to. neg represents a verbal negation morpheme.

The problem of what constitutes the notion of “idiom” is not addressed here. Only phrases listed in Sato (2007) are regarded as idioms in this paper. Sato (2007) consulted five books in order to compile Japanese idiom lists. Among these five books, Miyaji (1982) provides a relatively in-depth discussion of the notion of “idiom.” In short, Miyaji (1982) defines idioms as phrases that (i) consist of more than one word that tend to behave as a single syntactic unit and (ii) take on a fixed, conventional meaning. The idioms dealt with here fall within the definition of Miyaji (1982). A further discussion of Japanese idioms will be presented in Sect. 3.1.

The remainder of this paper is organized as follows. Related works are presented in Sect. 2, while Sect. 3 shows the target idioms. The idiom corpus is described in Sect. 4, after which the idiom identification method used and the experiment are detailed in Sect. 5. Finally, Sect. 6 concludes the paper. 1

2 Related work

Only a few works on the construction of an idiom corpus have been carried out to date, with Birke and Sarkar (2006) and Cook et al. (2008) being notable exceptions.

Birke and Sarkar (2006) automatically constructed a corpus of ~50 English idiomatic expressions (words that can be used non-literally), and ~6,600 examples thereof. This corpus, referred to as TroFi Example Base, is available on the Web. 2

Cook et al. (2008) compiled a corpus of English verb-noun combination (VNC) tokens, which deals with 53 VNC expressions and consists of about 3,000 sample sentences and is also available on the Web. 3 As with our corpus, theirs assigned a label to each example indicating whether an expression in the example is used literally or idiomatically. Our corpus can be regarded as the Japanese counterpart of these works, although it should be noted that it targets 146 idioms and consists of 102,856 example sentences.

Another exception is MUST, a database of Japanese compound functional expressions that was constructed manually by Tsuchiya et al. (2006) and is available online. 4 Some compound functional expressions in Japanese are, like idioms, ambiguous. 5

The SAID dataset (Kuiper et al. 2003) provides data about the syntactic flexibility of English idioms. 6 It does not deal with idiom token identification, but as in Hashimoto et al. (2006a) and Cook et al. (2007), among others, the syntactic behavior of idioms is an important clue to idiom token identification.

While previous studies have focused mainly on idiom type identification (Lin 1999; Krenn and Evert 2001; Baldwin et al. 2003; Shudo et al. 2004; Fazly and Stevenson 2006), there has been a recent growing interest in idiom token identification (Katz and Giesbrecht 2006; Hashimoto et al. 2006a, b; Birke and Sarkar 2006; Cook et al. 2007).

Katz and Giesbrecht (2006) manually annotated the 67 occurrences of a German MWE with literal/idiom labels, from which they built LSA (Latent Semantic Analysis) vectors for the two usages. They used these two vectors and the cosine similarity metric to identify tokens of the German MWE as either literal or idiomatic.

Hashimoto et al. (2006a and b) (henceforth, HSU) focused their attention on the differences in grammatical constraints imposed on idioms and their literal counterparts, such as the possibility of passivization, and developed a set of rules for Japanese idiom identification. Although their task was identical to that of this present study, and the grammatical knowledge it has provided has been drawn on by our corpus, the scale of their experiment was relatively small, using only 108 sentences for idiom identification. Further, unlike HSU, matured WSD technology was employed in our study. A more detailed description of HSU is provided in Sect. 5.1.3.

Cook et al. (2007) (CFS henceforth) propose an unsupervised method for English based on the observation that idioms tend to be expressed in a small number of fixed forms.

While these studies mainly used the characteristics of idioms our study employed a WSD method, for which there have been many studies and matured technologies, in addition to the characteristics of idioms. While Birke and Sarkar (2006) also used WSD, they employed an unsupervised method, compared to the completely supervised one used in this study.

A supervised method was adopted in order to learn how accurately idioms could be identified if a sufficient amount of training data was available. Supervised methods do, admittedly, have scalability problems, so an unsupervised method, like that of CFS, therefore needs to be developed. Nevertheless, revealing the supervised accuracy is helpful for clarifying the accuracy of an unsupervised method. In other words, the experimental results obtained in this study are expected to serve as a reference point for future studies.

Apart from idioms, Uchiyama et al. (2005) conducted the token classification of Japanese compound verbs by employing a supervised method.

With regard to the idiom identification method adopted in our study, it is also worth mentioning Lee and Ng (2002) (hereafter, LN). Our study drew heavily on LN, which evaluated a variety of knowledge sources (part-of-speech of neighboring words, content words in the context, N-grams surrounding the target word, and syntactically related words) and supervised learning algorithms for word sense disambiguation. Their results showed that the best performance was provided by a combination of all the knowledge sources and support vector machines (SVM). Our study, in turn, used the best performing combination for the idiom identification task. A more detailed description of LN will be provided in Sect. 5.1.1.

3 Target idioms

This section describes the characteristics of Japanese idioms (see Sect. 3.1) and how certain target idioms were selected from among them for this study (see Sect. 3.2).

3.1 Overview of Japanese idioms

In order to achieve an overall perspective of Japanese idioms, their distribution was investigated with regard to their morpho-syntactic structures, as follows.
  1. 1.

    A total of 926 idioms were extracted from Sato (2007). Sato compiled about 3,600 basic Japanese idioms, which are available on the Web, 7 from five books: two elementary school dictionaries (Kindaichi and Kindaichi 2005; Kindaichi 2006), two idiom dictionaries (Yonekawa and Ohtani 2005; Kindaichi 2005), and one monograph on idioms (Miyaji 1982). Those idioms that were described in more than two of the five books were extracted. Accordingly, it can be assumed that the 926 idioms are a fundamental part of Japanese idioms used in daily life.

     
  2. 2.

    These 926 idioms were parsed using JUMAN (Kurohashi et al. 1994), 8 a morphological analyzer of Japanese, and KNP (Kurohashi-Nagao Parser, Kurohashi and Nagao (1994), 9 a Japanese dependency parser, which provided the morpho-syntactic structures of the idioms.

     
  3. 3.

    Two native speakers of Japanese (a member of Group B who will be mentioned in Sect. 3.2 and one of the authors) corrected the parsed results manually. 10

     
As a result, we obtained the distribution illustrated in Table 1, which shows the five most prevalent morpho-syntactic structures. 11 Note that the sequence of symbols in the first column, such as “(N-P V)” and “(N-P (N-P V))” indicate the morpho-syntactic structures of idioms as follows. 12
https://static-content.springer.com/image/art%3A10.1007%2Fs10579-009-9104-1/MediaObjects/10579_2009_9104_Figa_HTML.gif
Table 1

Five most prevalent morpho-syntactic structures among the 926 idioms

Structure

P

%

Example

(N-P V)

V

57.24% (\(\frac{530}{926}\))

goma-o suru (sesame- acc crush)“flatter”

(N-P N)

N

6.05% (\(\frac{56}{926}\))

mizu-to abura (water-and oil)“oil and water”

(N-P A)

A

4.64% (\(\frac{43}{926}\))

han-ga takai (nose- nom high)“proud”

(N-P (N-P V))

V

2.59% (\(\frac{24}{926}\))

ashi-ga bou-ni naru(leg- nom stick-dat become)“feet get stiff”

(N-P V-S)

V

2.48% (\(\frac{23}{926}\))

kubi-ga mawara-nai (neck- nom turn-neg)“up to one’s neck”

In addition, note that “S” of “(N-P V-S)” in the sixth row indicates verbal suffixes, such as the negation morpheme, passive, and causative morphemes.

The second column indicates the part-of-speech (i.e., the verb, noun, adjective, etc.) to which the idioms as a whole correspond. For example, the (N-P V) idiom goma-o suru is a verbal idiom.

As can be seen, more than half are of the (N-P V) type, which is consistent with the observation made by HSU (Hashimoto et al. 2006b, Sect. 3.3). The 530 idioms of (N-P V) type can further be classified on the basis of which postposition (case marker) they contain, as in Table 2. More than half of the idioms of this type contain the accusative (acc) case marker, followed by the dative (dat), the nominative (nom), and the instrumental (ins) case markers. This distribution is consistent with the observation made by Yonekawa and Ohtani (2005) (p. 549).
Table 2

Postpositional distribution of the (N-P V) idioms

Postp

%

Example

acc

63.40 (\(\frac{336}{530}\))

goma-o suru (sesame-acc crush)“flatter”

dat

18.68 (\(\frac{99}{530}\))

tyuu-ni uku (midair-dat float)“be up in the air”

nom

17.55 (\(\frac{93}{530}\))

abura-ga noru (fat-nom put.on)“warm up to one’s work”

ins

0.38 (\(\frac{2}{530}\))

ago-de tukau (jaw-ins use)“have someone at one’s beck and call”

3.2 Selection of target idioms

One hundred and forty-six idioms were selected for this study using the following procedure.
  1. 1.

    The 926 basic idioms were extracted from Sato (2007), as mentioned in Sect. 3.1.

     
  2. 2.

    From these, all the ambiguous idioms were picked up, which amounted to 146, based only on native speakers’ intuition.13

     

As for 2, it is not trivial to determine whether a phrase is ambiguous, since one meaning of a phrase is sometimes so much more common and familiar than the other meaning(s), if any, that it can be regarded as unambiguous. Our efforts are concentrated on evenly ambiguous idioms, the disambiguation of which will surely contribute to the development of NLP.

Two native speakers of Japanese (Group A) were then asked to classify the 926 idioms into two categories: (1) (evenly) ambiguous ones and (2) unambiguous ones. On the basis of the classification, one of the authors made final judgments.14

For example, the phrase goma-o suru (sesame-acc crush) is ambiguous in terms of its literal (“crushing sesame”) and idiomatic (“flattering people”) meanings. On the other hand, the phrase saba-o yomu (chub.mackerel-acc read) is an unambiguous idiom that means “cheating in counting.” Unambiguous idioms in this study include those that might be interpreted literally in some artificial contexts but in real life are mostly used idiomatically. The phrase kage-ga usui (shadow-nom blurred) is mostly used in real life as an idiom that means “low profile” and is thus regarded as unambiguous in this study. A context could be devised in which one is drawing a picture of a building and believes that the shadow of the building in the picture should have been thicker and sharper. In this artificial context, the phrase kage-ga usui (shadow-nom blurred) might be used literally, though native speakers of Japanese may believe that kage-no iro-ga usui (shadow-gen color-nom blurred) sounds more natural in the context.

To verify the stability of this ambiguity endorsement, two more native speakers of Japanese (Group B) were asked to perform the same task and the Kappa statistic between the two speakers was then calculated. One hundred and one idioms were sampled from the 926 chosen earlier and the two members of Group B then classified the 101 sampled idioms into the two classes. The Kappa statistic was found to be 0.6576 (the observed agreement was 0.7723), which indicates moderate stability.

All of the idioms on which the two members of Group B disagreed were finally judged as unambiguous by one of the authors (in that only idiomatic interpretation is possible), but might be interpreted literally if some artificial and unlikely context is provided. In other words, they can be described as borderline cases.15

Table 3 shows the five most prevalent morpho-syntactic structures among the 146 selected idioms.16 The tendency for the (N-P V) type to prevail is clearer; this type comprised 78.23% of the 146 idioms.
Table 3

Five most prevalent morpho-syntactic structures of the 146 idioms

Structure

P

%

Example

(N-P V)

V

78.23 (\(\frac{115}{146}\))

goma-o suru (sesame-acc crush) “flatter”

(N-P A)

A

7.48 (\(\frac{11}{146}\))

han-ga takai (nose-nom high) “proud”

(N-P N)

N

2.72 (\(\frac{4}{146}\))

mizu-to abura (water-and oil)“oil and water”

(N-P V-S)

V

2.72 (\(\frac{4}{146}\))

kubi-ga mawara-nai (neck-nom turn-neg) “up to one’s neck”

(N-P V-Aux)

A

2.04 (\(\frac{3}{146}\))

hi-ga kieta-you (fire-nom go.out-seem) “stagnate”

Note that “Aux” of “(N-P V-Aux)” in the sixth row indicates auxiliary morphemes. The auxiliary morpheme you in the example of the sixth row attaches a verb and changes it into an adjective.

Table 4 shows the breakdown of the 115 idioms of (N-P V) type in terms of their postpositions (case markers). Again, the observed distribution is mostly the same as Table 2, with the difference being that the acc type is slightly more pervasive, and the nom type comes second.
Table 4

Postpositional distribution of the (N-P V) idioms among those selected

Postp

%

Example

acc

67.83 (\(\frac{78}{115}\))

goma-o suru (sesame-acc crush) “flatter”

nom

19.13 (\(\frac{22}{115}\))

abura-ga noru (fat-nom put.on)“warm up to one’s work”

dat

13.04 (\(\frac{15}{115}\))

tyuu-ni uku (midair-dat float)“be up in the air”

Table 5 lists 90 out of the 146 target idioms that were used for the experiment. 17
Table 5

Idioms used for the experiment

ID

Type

I:L

0016

(blue.vein-acc emerge) “burst a blood vessel”

286:57

0035

(sit cross-legged) “rest on one’s laurels”

587:353

0056

(leg-nom attach) “find a clue to solving a case”

184:478

0057

(leg-nom go.out) “run over budget”

188:651

0079

(one’s feet-acc look.down) “see someone coming”

420:310

0080

(leg-acc wash) “wash one’s hands of ...”

632:291

0088

(leg-acc stretch) “go a little further”

727:179

0098

(head-nom ache) “harass oneself about ...”

158:217

0107

(head-acc fold) “tear one’s hair out”

796:116

0114

(head-acc lift) “rear its head”

804:163

0150

(fat-nom put.on) “warm up to one’s work”

196:1006

0151

(oil-acc sell) “shoot the breeze”

507:78

0152

(oil-acc squeeze) “rake someone over the coals”

69:139

0161

(net-acc spread) “wait expectantly”

366:858

0198

(breath-nom choke.up) “stifling”

681:270

0262

(one-from ten-to) “all without exception”

770:67

0390

(color-acc lose) “turn pale”

262:720

0436

(arm-nom go.up) “develop one’s skill”

481:362

0648

(tail-acc pull) “have a lasting effect”

843:118

0689

(face-acc present) “show up”

697:128

0756

(shoulder-acc juxtapose) “on par with”

842:100

0773

(corner-nom remove) “become mature”

370:274

1100

(lip-acc bite) “bite one’s lip”

587:241

1107

(mouth-acc cut) “break the ice”

210:223

1120

(mouth-acc sharpen) “pout”

663:105

1141

(neck-nom turn-neg) “up to one’s neck”

619:310

1146

(neck-acc cut) “give the axe”

449:384

1153

(neck-acc twist) “think hard”

885:65

1379

(thing-dat depend) “perhaps”

231:113

1417

(sesame-acc crush) “flatter”

87:88

1738

(back-acc train) “turn one’s back on”

597:298

1897

(blood-nom flow) “humane”

422:419

1947

(midair-dat float) “be up in the air”

382:529

1988

(dirt-nom attach) “be defeated in sumo wrestling”

70:186

2032

(hand-nom reach) “afford” “reach an age” “attentive”

470:112

2033

(hand-nom there.isn’t) “have no remedy”

799:120

2037

(hand-nom get.away) “get one’s work done”

360:414

2075

(hand-dat ride) “fall into someone’s trap”

372:583

2101

(hand-dat insert) “obtain”

373:328

2105

(hand-acc hang) “give a lot of care”

241:578

2108

(hand-acc cut) “break away”

468:341

2121

(hand-acc take) “give every possible help (to learn)”

91:728

2122

(hand-acc grasp) “conclude an alliance”

73:696

2125

(hand-acc stretch) “extend one’s business”

95:814

2128

(hand-acc open.up) “extend one’s business”

579:242

2130

(hand-acc turn) “take measures”

246:544

2166

(mountain.pass-acc go.over) “get over the hump”

685:264

2264

(mud-acc daub) “drag someone through the mud”

543:187

2341

(wave-dat ride) “catch a wave”

783:125

2459

(heat-nom get.cool) “fever goes down”

890:100

2463

(heat-acc raise) “be enthusiastic”

903:73

2464

(heat-acc feed.in) “enthuse”

723:127

2473

(root-acc take.down) “take root”

824:136

2475

(root-acc spread) “take root”

564:376

2555

(bus-dat miss) “miss the boat”

199:665

2580

(baton-acc give) “have someone succeed a position”

471:250

2581

(nasal.breathing-nom heavy) “full of big talk”

286:256

2584

(nose-nom high) “proud”

659:652

2615

(nose-acc break) “humble (someone)”

69:90

2621

(nose-acc make.a.sound) “make light of ...”

536:426

2677

(belly-acc cut) “have a heart-to-heart talk”

1265:58

2684

(teeth-acc clench) “grit one’s teeth”

194:102

2770

(human-acc eat) “look down upon someone”

727:243

2785

(spark-acc spread) “fight heatedly”

728:230

2860

(painting.brush-acc add) “correct (writings or paintings)”

213:68

2878

(ship-acc row) “nod”

167:162

2937

(bone-nom break) “have difficulty”

575:348

2947

(bone-acc bury) “make it one’s final home”

757:157

2949

(bone-acc break) “make efforts”

350:545

2967

(curtain-nom open) “start”

533:425

3018

(right-from left) “passing through without staying”

794:2246

3037

(water-and oil) “oil and water”

1053:839

3039

(water-dat flush) “forgive and forget”

652:320

3069

(body-dat put.on) “learn”

725:78

3078

(ear-nom ache) “make one’s ears burn”

333:489

3084

(ear-dat insert) “get word of ...”

501:168

3132

(fruit-acc bear) “bear fruit”

826:98

3164

(chest-nom ache) “suffer heartache”

876:60

3173

(chest-nom expand) “feel one’s heart leap”

338:423

3193

(chest-acc hit) “impress”

801:66

3231

(germ-nom come.out) “close to making the top”

377:491

3236

(eye-nom there.isn’t) “have a passion for ...”

829:74

3256

(scalpel-acc insert) “take drastic measures”

741:92

3279

(eye-dat enter) “catch sight of ...”

623:112

3318

(eye-acc cover) “be in a shambles”

725:106

3327

(eye-acc awake) “snap out of ...”

118:587

3338

(eye-acc close) “turn a blind eye”

533:227

3350

(eye-acc thin) “one’s eyes light up”

115:132

3468

(finger-acc suck) “look enviously”

876:71

3471

(bow-acc draw) “defy”

138:1018

4 Idiom corpus

4.1 Corpus specification

The corpus is designed for the idiom token identification task. That is, each example sentence in the corpus is annotated with a label that indicates whether the corresponding phrase in the example is used as an idiom or a literal phrase. The former is referred to as the idiomatic example and the latter is called the literal example. More specifically, the corpus consists of lines that each represent one example. A line consists of four fields, as follows.
  • Label indicates whether the example is idiomatic or literal. Label i is used for idiomatic examples and l for literal ones.

  • ID denotes the idiom that is included in the example. In this study, each idiom has a unique number, which is based on Sato (2007).

  • Lemma also shows the idiom in the example. Each idiom was assigned its canonical (or standard) form (orthography) on the basis of Sato (2007).

  • Example is the sentence itself containing the idiom.

Below is a sample of a literal example of goma-o suru (sesame-acc crush) “flatter.”
  1. (1)

    1 1417 ごまをすり すり鉢でごまをすり…

     

The third field is the lemma of the idiom and the last one is the example that reads “crushing sesame in a mortar...”

Before working on the corpus construction, a reference was prepared by which human annotators could consistently distinguish between the literal and figurative meanings of idioms. More precisely, this reference specified literal and idiomatic meanings for each idiom, similar to the way it is done in dictionaries. For example, the entry for goma-o suru in the reference reads as follows.

Idiom: To flatter people.

Literal: To crush sesame.

As for the corpus size, examples were annotated for each idiom, regardless of the proportion of idioms and literal phrases, until the total number of examples for each idiom reached 1,000. 18 In the case of a shortage of original data, as many examples as possible were annotated.

The original data was sourced from the Japanese Web corpus (Kawahara and Kurohashi 2006). Kawahara and Kurohashi (2006) collected Web pages using a Web crawler (Takahashi et al. 2002). From these, pages written in Japanese were extracted by checking either the character encoding information obtained from HTML source files or the number of Japanese postpositions that existed in the pages. The pages were then split into sentences based on periods and HTML tags such as <br> and <p>.

4.2 Corpus construction

The corpus was constructed in the following manner.
  1. 1.

    From the Web corpus mentioned above, example sentences were collected that contained one of our target idioms in whichever meaning (idiomatic or literal) they take on. Specifically, sentences were automatically collected in which constituent words of one of our targets appeared in a canonical dependency relationship by using KNP, taking into account the morphological inflection and non-adjacency of idiom constituents. The canonical forms (orthography) of idioms provided by Sato (2007) were used, while character variations of Japanese (Hiragana, Katakana, or Chinese characters) were not taken into account.

     
  2. 2.

    The 102,856 examples among all the collected ones were classified as either idiomatic or literal. This classification was performed by human annotators and was based on the reference to distinguish the two meanings. Longer examples were given higher priority than shorter examples for annotation. Examples that were collected by mistake due to dependency parsing errors were discarded, as were those that lacked a context that would assist in their correct interpretation. 19 The annotators worked with the context of a sentence that contained an idiom.

     

The classification of the 102,856 examples was performed by the two members of Group A and took a total of 230 h.

The classification decisions could be more reliable if additional context information, such as entire documents, was provided. This was not the case, however, because the Web corpus adopted for the original data of the idiom corpus sometimes lacked the context of the example sentence before and/or after it, and it costs too much to consult the neighboring sentences to label more than 100,000 examples.

This relates to the policy (2. above) to give higher priority to longer sentences for annotation. Attempts were made to give annotators sufficient context (one long sentence) to make the I/L annotations easier and more reliable without relying on neighboring sentences.

4.3 Status of corpus

The corpus consists of 102,856 examples, each of which contain one example. Note that the figures reported in this subsection are those of the corpus of the 2008-06-25 version, which was used for the experiment in this paper. 20 The total number of idiomatic examples is 68,239, in addition to 34,617 literal examples. Table 5 shows the number of idiomatic and literal examples for each individual idiom that was used for the experiment in Sect. 5. Figure 1 shows the distribution of the number of examples. For 68 idioms, more than 1,000 examples were annotated. However, <100 examples were annotated for 17 idioms due to a lack of original data.
https://static-content.springer.com/image/art%3A10.1007%2Fs10579-009-9104-1/MediaObjects/10579_2009_9104_Fig1_HTML.gif
Fig. 1

Distribution of the number of examples

The average number of words in a sentence is 46.  Idiom in Fig. 2 shows the distribution of sentence length (the number of words) in the corpus.  Web and  News indicate the sentence length in the Web corpus and a newspaper corpus, respectively. The figures for the Web corpus and the newspaper corpus are drawn from Kawahara and Kurohashi (2006). It is noticeable that our corpus contains a larger number of long sentences; this is because longer sentences were given priority for annotation, as stated in Sect. 4.2. Figure 3 shows the longest and shortest examples for both the idiomatic and literal meanings of goma-o suru drawn from the corpus.
https://static-content.springer.com/image/art%3A10.1007%2Fs10579-009-9104-1/MediaObjects/10579_2009_9104_Fig2_HTML.gif
Fig. 2

Distribution of sentence length

https://static-content.springer.com/image/art%3A10.1007%2Fs10579-009-9104-1/MediaObjects/10579_2009_9104_Fig3_HTML.gif
Fig. 3

The longest and shortest examples of the idiomatic and literal meanings of goma-o suru

To determine the consistency of the idiomatic/literal annotation between different human annotators, 1,421 examples were sampled from the corpus. The two members of Group B were asked to perform the same annotation, and the Kappa statistic between the two was calculated. The value was 0.8519 (the observed agreement was 0.9247), which indicates a very high level of agreement.

4.4 Distribution of corpus

The corpus is available online. 21 Figure 4 is a screenshot of the corpus’ Website. The download instructions can be found on the Website. The BSD license was adopted for the corpus. The size of the corpus (.tar.bz2) is 5.7MB and it is euc-jp encoded.
https://static-content.springer.com/image/art%3A10.1007%2Fs10579-009-9104-1/MediaObjects/10579_2009_9104_Fig4_HTML.gif
Fig. 4

Website of the corpus

Prior to the corpus being distributed, any examples that were overly sexual or discriminatory were removed by referring to a dictionary that listed 257 common sexual/discriminatory expressions. 22 The distributable corpus contains 101,500 examples, among which 67,575 are idiomatic and 33,925 are literal.

Anyone wishing to access the complete corpus that was used for the experiment in this study may do so via the contact information provided on the Website.

In order to make it easy to browse the corpus, an online Browser has been developed. 23 This browser makes it possible to (i) highlight certain constituents of idioms (recognized automatically by KNP) and (ii) display examples either in full or in the keyword-in-context (KWIC) format, which can be either left-aligned or right-aligned. The context length can also be chosen by the number of characters.

Figure 5 shows a screenshot of the online browser. In the figure, examples of the idiom goma-o suru (sesame-acc crush) “flatter” are displayed in the left-aligned KWIC format.
https://static-content.springer.com/image/art%3A10.1007%2Fs10579-009-9104-1/MediaObjects/10579_2009_9104_Fig5_HTML.gif
Fig. 5

The corpus browser

5 Idiom identification experiment

5.1 Method of idiom identification

A standard WSD method was adopted using machine learning, specifically, an SVM (Vapnik 1995) with a quadratic kernel implemented in TinySVM 24. The knowledge sources used are classified into those that are commonly used in WSD, along the lines of Lee and Ng (2002) (LN), or those that have been designed for Japanese idiom identification, as proposed by HSU. 25

The next two subsections describe the features developed by LN and HSU.

5.1.1 Features of Lee and Ng (2002)

For WSD, LN considered four kinds of features; part-of-speech (POS) of neighboring words, single words in the surrounding context, local collocations, and syntactic relations.

Part-of-speech of neighboring words: LN used the POS of three words that preceded/followed a target word of WSD, as well as the POS of the target word itself. The neighboring words were within the same sentence as the target.

Single words in the surrounding context: All single words (unigrams) in the surrounding context of the target word were used, and the surrounding context could be up to a few sentences in length.

Local collocations: These are 11 n-grams around the target word; C−1,-1, C1,1, C−2,-2, C2,2, C−2,-1, C−1,1, C1,2, C−3,-1, C−2,1, C−1,2, and C1,3. Ci,j refers to the ordered sequence of tokens in the local context of the target. i and j denote the start and end position of the sequence (relative to the target), where a negative (positive) offset refers to a token to its left (right).

Syntactic relations: If the target word was a noun, this knowledge source included its parent headword (h), the POS of h, the voice of h (active, passive, or 0 if h is not a verb), and the relative position of h from the target (left or right). If the target was a verb, LN used six clues: (1) the nearest word l to the left of the target, such that the target was the parent headword of l, (2) the nearest word r to the right of the target, such that the target was the parent headword of r, (3) the POS of l, (4) the POS of r, (5) the POS of the target, and (6) the voice of the target. If the target was an adjective, the target’s parent headword h and the POS of h were used.

With these features, LN were able to achieve a higher level of accuracy than the best official scores on both SENSEVAL-2 (Edmonds and Cotton 2001) and SENSEVAL-1 (Kilgarriff and Palmer 2000) test data.

In short, LN used several kinds of contextual information regarding the target word of WSD, as has often been used for many sense-oriented natural language tasks.

5.1.2 Features of Hashimoto et al. (2006a, b)

Based on Miyaji (1982), Morita (1985) and Ishida (2000), HSU proposed the following linguistic knowledge to identify idioms.
  1. 1.
    Adnominal modification constraints
    1. (a)

      Relative clause prohibition

       
    2. (b)

      Genitive phrase prohibition

       
    3. (c)

      Adnominal word prohibition

       
     
  2. 2.

    Topic/restrictive postposition constraints

     
  3. 3.
    Voice constraints
    1. (a)

      Passivization prohibition

       
    2. (b)

      Causativization prohibition

       
     
  4. 4.
    Modality constraints
    1. (a)

      Negation prohibition

       
    2. (b)

      Volitional modality prohibition 26

       
     
  5. 5.

    Detachment Constraint

     
  6. 6.

    Selectional Restriction

     
For example, the idiom, hone-o oru, (bone-acc break) “make an effort,” does not allow adnominal modification by a genitive phrase. It is therefore only possible to interpret (2) literally.
  1. (2)

    kare-nohone-ooru

    he-GENbone-accbreak

    “(Someone) breaks his bone.”

     

That is, the above genitive phrase prohibition is in effect for the idiom. In other words, the idiom is lexicalized so that it resists the modification of only its nominal part.

Likewise, the idiom does not allow its postposition o (acc) to be substituted with restrictive postposition such as dake (only), and therefore, (3) represents only a literal meaning.
  1. (3)

    hone-dakeoru

    bone-ONLYbreak

    “(Someone) breaks only some bones.”

     

This means that the restrictive postposition constraint above is also in effect, which also occurs as a result of its lexicalization; the idiom resists the topicalization of its nominal part.

(4) is an example of the passivization prohibition of the voice constraints.
  1. (4)

    hone-ga   o-rareru

    bone-nombreak-pass

    “A bone is broken.”

     

That is, because of the syntactic unity of the idiom hone-o oru, it cannot be passivized, unlike its literal counterpart.

Idioms that are subject to modality constraints cannot be negated and/or take on volitional modality. This is believed to be caused by the semantic irregularity of idioms.

The detachment constraint states that the constituents of some idioms cannot be separated from one another. In other words, some idioms do not allow intervening phrases or words, such as adverbs, among their constituents, which is reflected by this constraint.

Although HSU did not implement it, selectional restriction makes use of the semantic restriction on the syntactic arguments of an idiom. For example, if tyuu-ni uku (midair-dat float) is used idiomatically (meaning “to be up in the air”), it should take an abstract thing, such as yosanan (budget plan) as its nominative argument. On the other hand, if the phrase is used literally (meaning “to float in midair”), its nominative argument should be a concrete thing, such as bôru (ball).

Note that the linguistic constraints above (1–6) are not always in effect for an idiom. For instance, the causativization prohibition is invalid for the idiom, hone-o oru. In fact, (5a) can be interpreted both literally and idiomatically.
  1. (5)

    a. kare-ni   hone-oor-aseru

    he-datbone-accbreak-CAUS

    b. “(Someone) makes him break a bone.”

    c. “(Someone) makes him make an effort.”

     

Based on these linguistic knowledge sources, HSU achieved an F-Measure of 0.800 to identify idioms (Class C idioms of HSU). 27

The intuition behind the linguistic knowledge of HSU is that, in general, usages that are applicable to idioms (such as adnominal modification or passivization) are also applicable to literal phrases, but the reverse is not always true (Fig. 6). HSU then attempted to find usages that were applicable only to literal phrases, which correspond to the shaded area in Fig. 6, based on the observations in Miyaji (1982), Morita (1985) and Ishida (2000).
https://static-content.springer.com/image/art%3A10.1007%2Fs10579-009-9104-1/MediaObjects/10579_2009_9104_Fig6_HTML.gif
Fig. 6

Difference of applicable usages

In short, HSU’s linguistic knowledge captures the intolerance of idioms for certain syntactic and semantic operations, such as adnominal modification, passivization, or the detachment of constituents.

5.1.3 The proposed features

This paper proposes a set of features that combines those of LN and HSU, as below.
  • Common WSD Features
    • f1: Part-of-Speech of Neighboring Words

    • f2: Single Words in the Surrounding Context

    • f3: Local Collocations

    • f4a: Lemma of the Rightmost Word among those Words that are the Dependents of the Leftmost Constituent Word of the Idiom 28

    • f4b: POS of the Rightmost Word among those Words that are the Dependents of the Leftmost Constituent Word of the Idiom

    • f5a: Lemma of the Word which is the Parent Headword of the Rightmost Constituent Word of the Idiom

    • f5b: POS of the Word which is the Parent Headword of the Rightmost Constituent Word of the Idiom

    • f6: Hypernyms of Words in the Surrounding Context

    • f7: Domains of Words (Hashimoto and Kurohashi 2007), Hashimoto and Kurohashi (2008) in the Surrounding Context

  • Idiom-Specific Features
    • f8: Adnominal Modification Flag

    • f9: Topic Case Marking Flag

    • f10: Voice Alternation Flag

    • f11: Negation Flag

    • f12: Volitional Modality Flag

    • f13: Adjacency Flag

JUMAN and KNP were used to extract these features.

f1, f2 and f3 are mostly the same as those described in LN. The differences between them and the corresponding features of LN are as follows. Unlike LN, the POS of a target word itself was not used for f1, since the targets of this study (idioms) are not single words and two or more POSs of a target would have to be posited. For f2, a sentence containing a target was used as a context, unlike LN, which used up to a few sentences. This is due to the restriction on corpus construction (described in Sect. 4), whereby some sentences are collected from the Web in isolation, without information about the sentences that precede or follow them. In this work for f3, words or phrases between constituents of a target idiom were not considered part of local collocation, since this feature was intended to be as close as possible to that of LN. (6b) Illustrates the values of f1, f2, and f3 for an example (6a). The target idiom is mune-o utu (chest-acc hit) “impress.”
  1. (6)

    a. tyousyu-no mune-outu utukusi uta

    audience-genchest-acchit beautiful song

    “A beautiful song that impresses the audience”

    b.

     

    tyousyu

    no

    mune

    o

    utu

    utukusi

    uta

    f1:

    N

    P

    (idiom)

    A

    N

    f2:

        

    f3:

    C−1,-1:

    no

     

    C+1,+1:

    utukusi

     

    C−2,-2:

    tyousyu

     

    C+2,+2:

    uta

     

    C−2,-1:

    tyousyu no

     

    C−1,+1:

    no (idiom) utukusi

     

    C+1,+2:

    utukusi uta

     

    C−3,-1:

    ϕ tyousyu no

     

    C−2,+1:

    tyousyu no (idiom) utukusi

     

    C−1,+2:

    no (idiom) utukusi uta

     

    C+1,+3:

    utukusi uta ϕ

     

More precisely, f1 is < ϕ, N, P, (idiom), A, N, ϕ > . f2 is the sparse vector in which all values except for tyousyu, utukusi, and uta are zero. Note that f2 deals with content words and that no in (6b) is a postposition and is therefore not considered for the feature. f3 consists of the eleven n-grams listed in (6b).

f4 and f5 correspond roughly to the syntactic relations of LN. The difference between this study and LN’s is that this study considered only the POS and the lemma of the syntactic child of the leftmost constituent and that of the syntactic parent of the rightmost constituent. This is because idioms have a more complicated internal structure than single words. In other words, the intention was to keep features f4 and f5 simple, while observing the intuition of the original features posited by LN. In the case of the example of mune-o utu (chest-acc hit) “impress,” below, f4 is the POS and lemma of tyousyu and f5 corresponds to those of uta. 29
https://static-content.springer.com/image/art%3A10.1007%2Fs10579-009-9104-1/MediaObjects/10579_2009_9104_Figb_HTML.gif

f6 and f7 are available from JUMAN’s output. For example, the hypernym of tyousyu (audience) is human and its domain is culture/media. Those of uta (song) are abstract-thing and culture/recreation. Although they are not used in LN, they are known to be useful for WSD (Tanaka et al. 2007; Magnini et al. 2002).

f8 indicates whether the nominal constituent of an idiom, if any, undergoes adnominal modification. While this corresponds to HSU’s adnominal modification constraints, in order to avoid data sparseness the present study did not distinguish the sub-constraints, the relative clause prohibition, the genitive phrase prohibition, and the adnominal word prohibition. KNP was used for its robust ability to detect adnominal modification structures in an input sentence.

f9 indicates whether one of Japanese topic case markers is attached to a nominal constituent of an idiom; this corresponds to HSU’s topic or restrictive postposition constraints.

f10 is turned on when a passive or causative suffix is attached to a verbal constituent of an idiom. This is the counterpart of HSU’s voice constraints, but the sub-constraints, the passivization prohibition and the causativization prohibition were not distinguished, so as to avoid data sparseness. KNP’s output was used to see if a target idiom is passivized or causativized. 30

f11 and f12 are similar to f10. The former is used for negated forms and the latter for volitional modality suffixes of a predicate part of an idiom. 31f11 and f12 jointly correspond to HSU’s modality constraints. A wide range of modality expressions can be reliably recognized by KNP, and its output was used to obtain the values of the features.

Finally, f13 indicates the adjacency of constituents of an idiom to one other, and thus corresponds to HSU’s detachment constraint.

5.2 Experimental condition

Ninety idioms were considered in the experiment, for which more than 50 examples of both idiomatic and literal usages were available. 32 The 90 idioms are shown in Table 5. The column “I:L” indicates the number of idiomatic and literal example sentences used for the experiment. Experiments were conducted for each idiom.

The performance measure is the accuracy.
$$ \hbox{Accuracy} = \frac{\#\,\hbox{of\,examples\,identified\,correctly}} {\#\,\hbox{of\,all\,examples}} $$
The baseline system uniformly regards all examples as either idiomatic or literal depending on which is more dominant in the idiom corpus. Naturally, this is prepared for each idiom.
$$ \hbox{Baseline} =\frac{\hbox{max}(\#\,\hbox{of\,idiomatic},\,\#\,\hbox{of\,literal})} {\#\,\hbox{of\,all\,examples}} $$

The accuracy and baseline accuracy for each idiom is calculated in a tenfold cross validation style; examples of an idiom are split randomly into ten pieces prior to the experiment.

The overall accuracy and baseline accuracy is then calculated from the individual results. The accuracy scores of all 90 idioms are summed up and then this number is divided by 90, which is called the macro-average. This calculation was also performed for the baseline accuracy.

Another performance measure is the relative error reduction (RER).
$$ \hbox{RER}= \frac{\hbox{ER of baseline}-\hbox{ER of system}} {\hbox{ER of baseline}} $$

ER stands for Error Rate in the formula. Error rate is defined as 1—accuracy.

Using the above formula, the overall RER is calculated based on the overall accuracy and the overall baseline accuracy.

In addition, the effectiveness of each idiom-specific feature was investigated by measuring performance without the use of one of the idiom features.

5.3 Experimental result

Table 6 shows the overall performance. The first column is the baseline accuracy (%) and the second column is the accuracy (%) and relative error reduction (%) of the system without the idiom-specific features. The third column is the accuracy (%) and relative error reduction (%) of the system with the idiom features.
Table 6

Overall results

Base

w/o I (RER)

w/I (RER)

72.92

88.87 (58.90)

89.26 (60.35)

Tables 7 and 8 show the individual results of the 90 idioms. The first column shows the target idioms and the second column shows the baseline accuracy (%). The accuracy (%) and relative error reduction (%) of the system without the idiom-specific features is described in the third column. The fourth column shows those of the system with the idiom features. Bold face indicates a set of features with better performance (either w/o I or w/ I).
Table 7

Individual results (1/2)

ID

Base

w/o I (RER)

w/I (RER)

0016

83.38

86.03 (15.91)

86.61 (19.45)

0035

62.45

92.98 (81.30)

92.98 (81.30)

0056

72.21

77.05 (17.41)

79.02 (24.50)

0057

77.59

92.49 (66.47)

93.08 (69.13)

0079

57.53

86.03 (67.10)

85.21 (65.16)

0080

68.47

92.54 (76.33)

92.43 (76.00)

0088

80.24

95.26 (76.03)

95.15 (75.47)

0098

57.87

83.40 (60.61)

83.40 (60.61)

0107

87.28

91.46 (32.85)

91.57 (33.72)

0114

83.14

93.61 (62.06)

93.71 (62.68)

0150

83.69

93.02 (57.21)

93.02 (57.21)

0151

86.67

92.63 (44.70)

92.63 (44.70)

0152

66.83

84.64 (53.71)

86.14 (58.23)

0161

70.10

81.53 (38.23)

81.20 (37.13)

0198

71.61

79.61 (28.17)

79.40 (27.43)

0262

92.00

93.48 (18.51)

93.48 (18.51)

0390

73.32

84.34 (41.29)

84.44 (41.68)

0436

57.06

84.36 (63.59)

88.75 (73.80)

0648

87.72

93.14 (44.15)

93.35 (45.84)

0689

84.35

88.24 (24.88)

88.13 (24.17)

0756

89.38

93.20 (35.97)

93.10 (34.97)

0773

57.45

78.20 (48.76)

77.73 (47.66)

1100

70.89

78.41 (25.82)

79.36 (29.10)

1107

51.50

84.83 (68.73)

83.90 (66.81)

1120

86.33

88.01 (12.28)

87.48 (8.43)

1141

66.63

86.42 (59.31)

86.11 (58.39)

1146

53.90

90.05 (78.41)

90.28 (78.92)

1153

93.16

94.11 (13.85)

93.89 (10.77)

1379

67.15

96.50 (89.35)

97.35 (91.94)

1417

50.29

92.75 (85.42)

91.58 (83.06)

1738

66.70

88.72 (66.13)

88.84 (66.47)

1897

50.18

82.65 (65.18)

83.12 (66.13)

1947

58.07

88.15 (71.73)

88.58 (72.77)

1988

72.66

79.08 (23.51)

78.76 (22.33)

2032

80.76

88.00 (37.64)

88.00 (37.64)

2033

86.94

92.50 (42.54)

92.83 (45.06)

2037

53.49

92.24 (83.32)

92.36 (83.57)

2075

61.05

92.76 (81.41)

93.60 (83.57)

2101

53.21

93.58 (86.29)

93.73 (86.59)

2105

70.57

91.05 (69.58)

91.30 (70.42)

2108

57.85

91.08 (78.83)

91.20 (79.12)

2121

88.89

92.74 (34.67)

92.74 (34.67)

2122

90.51

95.30 (50.54)

95.04 (47.77)

2125

89.55

93.90 (41.62)

94.00 (42.59)

2128

70.52

89.41 (64.09)

90.27 (66.99)

2130

68.86

93.16 (78.05)

94.18 (81.30)

2166

72.18

89.08 (60.73)

89.50 (62.26)

Note: The bold values indicate which system showed a superior performance between w/o I and w/I

Table 8

Individual results (2/2)

ID

Base

w/o I (RER)

w/I (RER)

2264

74.38

91.64 (67.38)

91.78 (67.91)

2341

86.23

93.05 (49.55)

92.72 (47.13)

2459

89.90

92.02 (21.00)

92.12 (22.00)

2463

92.52

94.50 (26.45)

94.71 (29.21)

2464

85.06

90.82 (38.58)

91.88 (45.67)

2473

85.83

93.33 (52.94)

93.33 (52.94)

2475

60.00

87.55 (68.88)

87.87 (69.68)

2555

76.97

90.61 (59.24)

92.24 (66.31)

2580

65.33

81.84 (47.63)

82.81 (50.41)

2581

52.77

75.15 (47.38)

76.63 (50.51)

2584

50.27

81.08 (61.96)

81.92 (63.65)

2615

56.60

69.58 (29.91)

74.92 (42.20)

2621

55.72

80.79 (56.62)

81.00 (57.09)

2677

95.62

96.68 (24.16)

96.68 (24.16)

2684

65.54

71.97 (18.66)

72.32 (19.66)

2770

74.95

87.01 (48.15)

86.91 (47.74)

2785

75.99

89.47 (56.13)

89.57 (56.56)

2860

75.80

83.63 (32.37)

84.70 (36.79)

2878

50.76

75.82 (50.88)

76.68 (52.65)

2937

62.30

94.03 (84.17)

93.93 (83.89)

2947

82.82

90.06 (42.13)

90.39 (44.03)

2949

60.89

92.85 (81.72)

92.85 (81.72)

2967

55.64

86.43 (69.41)

86.43 (69.41)

3018

73.88

90.03 (61.84)

90.13 (62.22)

3037

55.66

83.19 (62.10)

86.10 (68.66)

3039

67.08

86.11 (57.81)

89.51 (68.12)

3069

90.29

96.51 (64.11)

96.39 (62.82)

3078

59.49

88.81 (72.38)

89.18 (73.29)

3084

74.89

89.65 (58.80)

90.68 (62.88)

3132

89.39

95.79 (60.33)

95.68 (59.31)

3164

93.59

95.93 (36.46)

96.03 (38.14)

3173

55.58

94.21 (86.97)

94.48 (87.57)

3193

92.39

96.45 (53.34)

96.56 (54.87)

3231

56.57

91.34 (80.06)

91.67 (80.82)

3236

91.81

95.58 (46.12)

95.25 (42.05)

3256

88.96

96.28 (66.30)

96.28 (66.30)

3279

84.76

90.35 (36.69)

91.16 (41.97)

3318

87.24

91.57 (33.94)

92.30 (39.61)

3327

83.26

88.21 (29.56)

88.92 (33.82)

3338

70.13

90.13 (66.96)

90.53 (68.28)

3350

53.44

75.20 (46.74)

74.69 (45.64)

3468

92.50

95.90 (45.25)

95.80 (43.93)

3471

88.06

95.51 (62.41)

95.51 (62.41)

Note: The bold values indicate which system showed a superior performance between w/o I and w/I

All in all, although relatively high baseline performances can be observed, both systems outperformed the baseline. In particular, the system without the idiom-specific features has a noticeable lead over the baseline, which shows that WSD technologies are effective in idiom identification. Incorporating the idiom features into the system improved the overall performance, which is statistically significant (McNemar test, p < 0.01). 33 However, there were some cases in which the individual performances of some idioms were slightly degraded by the incorporation of the idiom features.

Table 9 shows the overall results without using one of the idiom features. 34
Table 9

Overall results without using one of the idiom features

Feature type

Acc(%)

All

89.264

− f8 (w/o Adnominal modification flag)

89.258

− f9 (w/o Topic case marking flag)

89.232

− f10 (w/o Voice alternation flag)

89.160

− f11 (w/o Negation flag)

89.182

− f12 (w/o Volitional modality flag)

89.213

− f13 (w/o Adjacency flag)

89.090

It can be seen that the adjacency flag (f13) makes the greatest contribution to idiom identification. 35 The adnominal modification flag (f8), meanwhile, makes only a slight contribution to the task. 36 All of the degradations in the table are statistically significant (McNemar test, p < 0.01) except for that of the adnominal modification flag (p = 0.1589).

Tables 10 and 11 show the individual results [accuracy (%)] obtained without using one of the idiom features. Bold face indicates the lowest accuracy. As expected, the contribution of idiom features varied depending on the idioms to be identified, and in some cases the addition of certain idiom features even degraded the accuracy of their identification. 37
Table 10

Individual results without using one of the idiom features (1/2)

ID

All

 −f8

 −f9

 −f10

 −f11

 −f12

 −f13

0016

86.61

86.61

86.61

86.61

86.61

86.32

86.03

0035

92.98

92.98

92.98

92.98

92.98

92.98

92.98

0056

79.02

79.17

79.02

79.02

78.86

79.17

77.20

0057

93.08

93.32

93.08

93.08

93.08

93.20

92.85

0079

85.21

85.48

85.07

85.21

85.62

85.48

85.62

0080

92.43

92.65

92.43

92.32

92.43

92.65

92.43

0088

95.15

95.15

95.15

95.15

95.15

95.15

95.26

0098

83.40

83.40

83.40

83.40

83.40

83.40

83.40

0107

91.57

91.46

91.57

91.57

91.57

91.57

91.57

0114

93.71

93.61

93.71

93.71

93.71

93.71

93.71

0150

93.02

93.02

93.02

93.02

92.94

92.86

92.86

0151

92.63

92.45

92.63

92.63

92.63

92.63

92.63

0152

86.14

86.14

86.14

85.14

86.14

86.50

86.14

0161

81.20

81.04

81.20

81.20

80.96

81.20

80.96

0198

79.40

79.71

79.40

79.40

79.50

79.40

79.29

0262

93.48

93.72

93.72

93.48

93.48

93.48

93.48

0390

84.44

84.44

84.44

84.44

84.23

84.44

84.54

0436

88.75

88.87

88.28

88.75

84.36

88.04

87.68

0648

93.35

93.25

93.35

93.35

93.35

93.35

93.35

0689

88.13

88.13

88.13

88.13

88.13

88.24

88.13

0756

93.10

93.20

93.10

93.10

93.20

93.20

93.10

0773

77.73

77.73

77.73

77.73

77.89

77.57

77.89

1100

79.36

79.36

79.36

79.49

79.49

79.61

78.53

1107

83.90

83.92

83.90

83.90

84.15

82.97

84.37

1120

87.48

87.48

87.48

87.61

87.48

87.48

88.01

1141

86.11

86.11

86.01

86.11

86.22

86.11

86.53

1146

90.28

90.28

90.28

90.16

90.16

90.29

90.04

1153

93.89

94.11

93.89

93.89

93.89

93.89

94.11

1379

97.35

96.50

97.35

97.35

97.35

97.35

97.35

1417

91.58

92.17

91.58

91.58

91.58

92.75

92.17

1738

88.84

88.84

88.84

88.84

88.72

88.84

88.84

1897

83.12

83.12

83.12

83.12

83.24

83.36

82.89

1947

88.58

88.58

88.58

88.15

88.58

88.58

88.47

1988

78.76

79.08

78.76

78.76

78.76

78.76

78.76

2032

88.00

88.00

88.00

88.00

87.83

88.00

88.00

2033

92.83

92.83

92.73

92.83

92.72

92.83

93.05

2037

92.36

92.23

92.36

92.36

92.37

92.36

92.36

2075

93.60

94.02

93.49

93.60

93.17

93.49

93.71

2101

93.73

93.58

93.73

93.73

93.58

93.87

93.58

2105

91.30

91.30

91.30

91.30

90.81

91.30

91.30

2108

91.20

91.20

91.20

91.09

91.33

91.19

91.09

2121

92.74

92.74

92.74

92.74

92.74

92.74

92.74

2122

95.04

95.30

95.04

95.04

95.04

95.04

95.17

2125

94.00

94.11

94.00

94.00

93.89

93.80

94.01

2128

90.27

90.27

90.27

90.27

89.90

90.39

89.78

2130

94.18

94.18

94.18

94.18

94.18

94.18

93.29

2166

89.50

89.49

89.39

89.50

89.50

89.49

89.30

Table 11

Individual results without using one of the idiom features (2/2)

ID

All

 −f8

 −f9

 −f10

 −f11

 −f12

 −f13

2264

91.78

91.78

91.78

91.64

91.64

91.78

91.64

2341

92.72

93.05

92.83

92.83

92.94

92.83

92.83

2459

92.12

92.12

92.22

92.12

92.12

92.12

92.02

2463

94.71

94.81

94.71

94.71

94.71

94.71

94.50

2464

91.88

91.88

91.88

91.88

91.06

91.88

91.65

2473

93.33

93.33

93.33

93.33

93.33

93.33

93.33

2475

87.87

87.87

87.87

87.66

87.77

87.87

87.77

2555

92.24

92.36

92.47

92.24

92.59

90.50

92.24

2580

82.81

82.81

82.81

81.83

82.67

81.15

82.95

2581

76.63

76.63

77.00

76.63

76.44

76.44

75.33

2584

81.92

81.92

82.07

81.92

81.77

81.85

81.24

2615

74.92

72.92

74.92

72.25

74.92

74.92

70.92

2621

81.00

80.89

81.00

81.00

81.00

80.89

80.58

2677

96.68

96.60

96.68

96.68

96.68

96.68

96.68

2684

72.32

71.97

72.32

72.32

72.32

71.97

71.97

2770

86.91

86.80

86.91

86.91

87.11

86.91

86.91

2785

89.57

89.68

89.57

89.57

89.47

89.57

89.47

2860

84.70

84.35

84.70

84.70

85.06

84.70

82.92

2878

76.68

76.37

76.68

75.75

76.68

76.68

75.75

2937

93.93

93.71

93.93

93.93

93.93

93.82

93.92

2947

90.39

90.28

90.39

90.39

90.28

90.17

90.49

2949

92.85

92.96

92.85

92.63

92.85

92.85

92.85

2967

86.43

86.43

86.43

86.42

86.32

86.32

86.32

3018

90.13

90.03

90.10

90.13

90.13

90.13

90.10

3037

86.10

86.10

83.19

86.10

86.10

86.10

85.94

3039

89.51

89.40

89.51

87.45

89.51

88.99

89.51

3069

96.39

96.39

96.39

96.39

96.51

96.39

96.39

3078

89.18

89.18

89.18

89.18

88.93

89.18

89.18

3084

90.68

90.68

90.53

90.68

90.66

90.68

90.26

3132

95.68

95.68

95.68

95.79

95.68

95.68

95.68

3164

96.03

96.03

96.03

96.03

96.03

96.03

95.82

3173

94.48

94.48

94.48

94.48

94.35

94.48

94.35

3193

96.56

96.56

96.56

96.56

96.56

96.56

96.56

3231

91.67

91.78

91.67

91.67

91.22

91.67

91.67

3236

95.25

95.47

95.36

95.25

95.25

95.25

95.70

3256

96.28

96.28

96.28

96.28

96.28

96.28

96.28

3279

91.16

91.16

91.16

91.16

91.29

91.16

90.49

3318

92.30

92.30

92.30

92.18

92.06

92.06

91.81

3327

88.92

88.92

88.92

88.50

88.92

88.50

88.92

3338

90.53

90.53

90.53

90.53

90.26

90.39

90.53

3350

74.69

74.69

74.69

74.37

74.69

75.20

74.78

3468

95.80

96.21

95.80

95.80

95.80

95.80

96.00

3471

95.51

95.43

95.43

95.51

95.51

95.51

95.43

6 Conclusion

This paper has reported on the idiom corpus that the authors constructed and the idiom identification experiment conducted using the corpus.

As mentioned in Sect. 3, some idioms are short of examples in the current idiom corpus and, accordingly, we intend to collect more examples by using different characters. In the Japanese language, there are three basic character systems, Hiragana, Katakana, and Chinese characters. This means that an idiom can be written in different characters, for example, mune-o utu (chest-acc hit) “impress” can be either or.

In spite of its imperfection, a lot can be learned from the corpus about idiom identification. As far as can be determined, it is the largest of its kind, as is the idiom identification experiment reported in Sect. 5.

This paper has also shown that a standard supervised WSD method works well for idiom identification. Our system achieved accuracy of 89.25 and 88.86% with/without idiom-specific features.

This study dealt with 90 idioms, but practical NLP systems are required to deal with many more. In order to achieve scalable idiom identification, it is necessary to develop an unsupervised or semi-supervised method. One possibility would be to follow the unsupervised method of Birke and Sarkar (2006) using the Japanese WordNet (Isahara et al. 2008), while the language-independent unsupervised method proposed by CFS could also be of help.

In any case, this idiom corpus will play an important role in the development of unsupervised and semi-supervised methods, and the experimental results obtained in this study will provide a good reference point for evaluating those methods.

Footnotes
1

A preliminary version of this study was presented in Hashimoto and Kawahara (2008). This paper extends the previous paper in several respects. The current paper compares this study with many more previous studies; adds the extensive characterization on Japanese idioms; describes the updated version of our idiom corpus and a newly-developed online browser of the corpus; discusses the full details of features used in the experiment that couldn’t be presented in the previous paper due to the page limitation; and presents additional experimental results concerning individual results without using one of the idiom features.

 
5

For example, (something)-ni-atatte ((something)-dat-run.into) means either “to run into (something)” or “on the occasion of (something),” with the former being the literal interpretation and the latter being the idiomatic interpretation of the compound functional expression.

 
10

This was done collaboratively; in case of any disagreement on the morpho-syntactic status of an idiom, the two native speakers discussed the case in question and reached a settlement.

 
11

Those of the other 250 idioms (27.1%) are infrequent miscellaneous structures like (N-P ((N-P V) (N-P V))) of the idiom ato-ha no-to nare yama-to nare (future-top field-dat become mountain-dat become) “I don’t care what happens afterwards.”

 
12

The arrows indicate dependency relations.

 
13

Note that some idioms, such as by and large and saba-o yomu (chub.mackerel-acc read) “cheating in counting,” do not have a literal meaning. They are not dealt with in this paper.

 
14

It may be difficult to determine some interpretations (literal or idiomatic) and such a decision may only be possible by looking at token usages of candidate phrases. However, such a token usage-based decision for classifying idiom types was not used because of the prohibitive cost involved.

 
15

For example, hara-o kimeru (belly-acc decide) “to make up one’s mind” was judged as ambiguous by one of the Group B members. Its literal interpretation would be “decide on which belly to (do something),” which sounds unnatural regardless of the context.

 
16

Those of the other nine idioms (6.8%) are infrequent miscellaneous structures like (V-Aux V-Aux) of the idiom nessi-yasuku same-yasui (heat-easy.to cool.down-easy.to) “tend to be enthusiastic (about something) but also tend to be tired (of it).”

 
17

The way in which the 90 idioms were selected is described in Sect. 5.2.

 
18

For idioms sampled for preliminary annotation, through which the issues of annotation were identified and the specifications of annotation were established, more than 1,000 examples were annotated.

 
19

Among the 107,598 examples worked on by the annotators, 258 examples were collected by parser mistakes and 4,484 examples lacked sufficient context to interpret target phrases correctly. Decisions regarding whether an example should be discarded were made by the annotator who was in charge and one of the authors.

 
20

The current release of the corpus, which is now available, is described in Sect. 4.4.

 
22

Although the dictionary has been carefully constructed by hand, the corpus may still contain some problematic examples. The removal of any such examples is the subject of a future project.

 
25

Bear in mind that HSU implemented them in handcrafted rules, while they were adapted in this study to a machine learning framework.

 
26

“Volitional modality” represents verbal expressions of order, request, permission, prohibition, and volition.

 
27

The F-Measure of HSU’s baseline system was 0.734.

 
28

Note that Japanese is a head final language.

 
29

Functional words attached to either the f4 word or the f5 word are ignored. In the example, no (gen) is ignored.

 
30

Passivization is indicated by the suffix (r)are in Japanese, but the same suffix is also used for honorification, potentials and spontaneous potentials. These were not distinguished, as doing so is beyond the capabilities of current technology.

 
31

Note that f10, f11 and f12 are applied only to those idioms that can be used as predicates.

 
32

Ninety examples were unavailable due to feature extraction failure. This was caused by KNP’s inability to handle very long sentences; it gives up parsing when the size of CKY table exceeds a hard-coded threshold. Thus, fewer examples were used for the experiment than were included in the corpus.

 
33

The McNemar test was conducted on the ratio of correct and incorrect idiom example classifications between the two groups, “with idiom features” and “without idiom features.” The idiom examples used for the test were all of the data described in Table 5, and thus were identical across the two groups.

 
34

For ease of reference, the first row shows the result with all of the idiom features used.

 
35

Note that a greater performance drop indicates a greater contribution.

 
36

This result is inconsistent with that obtained in HSU, in which it was reported that grammatical constraints involving adnominal modification were most effective. The present study suspects that HSU’s observation is not particularly reliable because only 15 test sentences were considered when investigating the best performing grammatical constraint (Hashimoto et al. 2006a, Sect. 4.3).

 
37

It might be argued that different feature sets should have been used for different idioms in order to obtain better results. However, doing this would be unrealistic when dealing with so many more idioms, since it would mean that the best feature sets would need to be carefully examined for each idiom.

 

Acknowledgments

This work was conducted as part of the collaborative research project of Kyoto University and NTT Communication Science Laboratories. The work was supported by NTT Communication Science Laboratories and JSPS Grants-in-Aid for Young Scientists (B) 19700141. We would like to thank the members of the collaborative research group of Kyoto University and NTT Communication Science Laboratories and Dr. Francis Bond for their stimulating discussion. Thanks are also due to Prof. Satoshi Sato, who kindly provided us with the list of basic Japanese idioms.

Copyright information

© Springer Science+Business Media B.V. 2009