Skip to main content
Log in

Automatically learning semantic knowledge about multiword predicates

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Highly frequent and highly polysemous verbs, such as give, take, and make, pose a challenge to automatic lexical acquisition methods. These verbs widely participate in multiword predicates (such as light verb constructions, or LVCs), in which they contribute a broad range of figurative meanings that must be recognized. Here we focus on two properties that are key to the computational treatment of LVCs. First, we consider the degree of figurativeness of the semantic contribution of such a verb to the various LVCs it participates in. Second, we explore the patterns of acceptability of LVCs, and their productivity over semantically related combinations. To assess these properties, we develop statistical measures of figurativeness and acceptability that draw on linguistic properties of LVCs. We demonstrate that these corpus-based measures correlate well with human judgments of the relevant property. We also use the acceptability measure to estimate the degree to which a semantic class of nouns can productively form LVCs with a given verb. The linguistically-motivated measures outperform a standard measure for capturing the strength of collocation of these multiword expressions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Throughout this paper, we use the term verb phrase to refer to a syntactic combination of a verb and its arguments. We use the term multiword predicate (MWP) to refer to a verb phrase that has been lexicalized.

  2. Our first approach to address the class-based pattern of LVC formation is described in Stevenson et al. (2004). The material in Sects. 5–6 of this article is an updated presentation of that in Fazly et al. (2006).

  3. It is important to note that these judgments are subject to individual differences. The point here is that the patterns specified by “?” and “??” (and to some extent those specified by “*”) are less-preferred for the given expression. We do not claim here that these are impossible, rather that they are expected to be less natural, and less common, compared to the preferred pattern(s).

  4. We do not include get, have, and do because of their frequent use as auxiliaries; we did not include make in this experiment since, compared to give and take, it seemed to be more difficult to distinguish between literal and figurative usages of this verb. Our ongoing work focuses on expanding the set of verbs (see Fazly 2007).

  5. We also evaluated our figurativeness measure, Figness, using web data as in our experiments for acceptability presented in Sect. 6. We found that since the estimation of Figness requires more sophisticated linguistic knowledge, using a smaller but cleaner corpus (i.e., the parsed BNC) provides substantially better results.

  6. Note that since the initial sets were missing expressions that were rated as “literal” by the human annotators, the distributions of figurative and literal expressions in them were not representative of their “true” distribution.

  7. In order to maintain simplicity of both the questions and the process of translating their answers to numerical ratings, some fine-grained distinctions were lost. For example, under this scheme, give an idea and give a speech would receive the same rating. To distinguish such cases, we could also ask judges about the possibility of paraphrasing a given expression with a verb morphologically related to the noun constituent, which is a strong indicator of an LVC.

  8. We realize that a kappa value of .34 (for expressions with give) is low. In the future, we intend to resolve this problem, e.g., by providing the judges with more training, or more appropriate questions. The fact that expressions with take, which were annotated after those with give, have a much higher kappa reflects that more training may lead to more consistent annotations, and hence higher interannotator agreements.

  9. PMI is known to be unreliable when used with low frequency data. Nonetheless, in our preliminary experiments on development data, we found that PMI performed better than two other association measures, Dice and Log Likelihood. Other research has also shown that PMI performs better than or comparable to many other association measures (Inkpen 2003; Mohammad and Hirst 2006). We also alleviate the problem of sparse data by: (i) using large corpora, the 100-million-word BNC and the Web, and (ii) focusing on expressions with a minimum frequency of 5 (Dunning 1993).

  10. The use of an indefinite determiner or no determiner in an LVC relates to semantic characteristics such as the aspectual properties of the state or event expressed by the predicative noun (Wierzbicka 1982). The detailed discussion of their differences, however, is outside the scope of this study.

  11. This was an observation made by the judges who later rated the acceptability of the experimental expressions as LVCs. The extent to which this observation holds for make or for other verbs in general is outside the scope of this study.

  12. This is clearly not the case for the estimate of the corpus size, since “the” likely occurs frequently within each page. However, in our formulas, this value appears as a constant, thus all scores are equally affected.

  13. Because our ratings are skewed toward low values, slight changes in observed agreement cause large swings in kappa values (the “paradox” of low kappa scores with high observed agreement; Feinstein and Cicchetti 1990). Since we are concerned with comparison to a baseline, observed agreement better reveals the patterns.

References

  • Alba-Salas, J. (2002). Light verb constructions in Romance: A syntactic analysis. PhD thesis, Cornell University.

  • Baldwin, T., Bannard, C., Tanaka, T., & Widdows, D. (2003). An empirical model of multiword expression decomposability. In Proceedings of the ACL-SIGLEX Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 89–96.

  • Baldwin, T., & Villavicencio, A. (2002). Extracting the unextractable: A case study on verb-particles. In Proceedings of the Sixth Conference on Computational Natural Language Learning (CoNLL’02), pp. 98–104.

  • Bannard, C., Baldwin, T., & Lascarides, A. (2003). A statistical approach to the semantics of verb-particles. In Proceedings of the ACL-SIGLEX Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 65–72.

  • BNC Reference Guide (2000). Reference guide for the British National Corpus (World Edition). Second edition.

  • Brinton, L. J., & Akimoto, M. (Eds.) (1999). Collocational and idiomatic aspects of composite predicates in the history of English. John Benjamins Publishing Company.

  • Butt, M. (2003). The light verb jungle. Manuscript.

  • Cacciari, C. (1993). The place of idioms in a literal and metaphorical world. In C. Cacciari & P. Tabossi (Eds.), Idioms: Processing, structure, and interpretation (pp. 27–53). Lawrence Erlbaum Associates.

  • Church, K., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In U. Zernik (Ed.), Lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115–164). Lawrence Erlbaum.

  • Claridge, C. (2000). Multi-word verbs in early modern English: A corpus-based study. Amsterdam, Atlanta: Rodopi B.V.

  • Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.

    Article  Google Scholar 

  • Collins, M. (1999). Head-driven statistical models for natural language parsing. PhD thesis, University of Pennsylvania.

  • Cruse, D. A. (1986). Lexical semantics. Cambridge University Press.

  • Desbiens, M. C., & Simon, M. (2003). Déterminants et locutions verbales. Manuscript.

  • Dras, M., & Johnson, M. (1996). Death and lightness: Using a demographic model to find support verbs. In Proceedings of the Fifth International Conference on the Cognitive Science of Natural Language Processing.

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Fazly, A. (2007). Automatic acquisition of lexical knowledge about multiword predicates. PhD thesis, University of Toronto.

  • Fazly, A., North, R., & Stevenson, S. (2005). Automatically distinguishing literal and figurative usages of highly polysemous verbs. In Proceedings of the ACL’05 Workshop on Deep Lexical Acquisition, pp. 38–47.

  • Fazly, A., North, R., & Stevenson, S. (2006). Automatically determining allowable combinations of a class of flexible multiword expressions. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’06), pp. 81–92.

  • Fazly, A., & Stevenson, S. (2006). Automatically constructing a lexicon of verb phrase idiomatic combinations. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL’06), pp. 337–344.

  • Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa:I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549.

    Article  Google Scholar 

  • Fellbaum, C. (Ed.) (1998). WordNet, an electronic lexical database. The MIT Press.

  • Gibbs, R. W. (1993). Why idioms are not dead metaphors. In C. Cacciari & P. Tabossi (Eds.), Idioms: Processing, structure, and interpretation (pp. 57–77). Lawrence Erlbaum Associates.

  • Gibbs, R., & Nayak, N. P. (1989). Psychololinguistic studies on the syntactic behaviour of idioms. Cognitive Psychology, 21, 100–138.

    Article  Google Scholar 

  • Glucksberg, S. (1993). Idiom meanings and allusional content. In C. Cacciari & P. Tabossi (Eds.), Idioms: Processing, structure, and interpretation (pp. 3–26). Lawrence Erlbaum Associates.

  • Grefenstette, G., & Teufel, S. (1995). Corpus-based method for automatic identification of support verbs for nominalization. In Proceedings of the Seventh Meeting of the European Chapter of the Association for Computational Linguistics (EACL’95).

  • Inkpen, D. (2003). Building a lexical knowledge-base of near-synonym differences. PhD thesis, University of Toronto.

  • Johnson, M. (1987). The body in the mind: The bodily basis of meaning, imagination, and reason. The University of Chicago Press.

  • Karimi, S. (1997). Persian complex verbs: Idiomatic or compositional? Lexicology, 3(1), 273–318.

    Google Scholar 

  • Kearns, K. (2002). Light verbs in English. Manuscript.

  • Keller, F., & Lapata, M. (2003). Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29, 459–484.

    Article  Google Scholar 

  • Krenn, B., & Evert, S. (2001). Can we do better than frequency? A case study on extracting PP-verb collocations. In Proceedings of the ACL’01 Workshop on Collocations, pp. 39–46.

  • Lakoff, G., & Johnson, M. (1980). Metaphors we live by. The University of Chicago Press.

  • Levin, B. (1993). English verb classes and alternations: A preliminary investigation. The University of Chicago Press.

  • Lin, D. (1999). Automatic identification of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL’99), pp. 317–324.

  • Lin, T. -H. (2001). Light verb syntax and the theory of phrase structure. PhD thesis, University of California, Irvine.

  • McCarthy, D., Keller, B., & Carroll, J. (2003). Detecting a continuum of compositionality in phrasal verbs. In Proceedings of the ACL-SIGLEX Workshop on Multiword Expressions: Analysis, Acquisition and Treatment.

  • Melamed, I. D. (1997). Automatic discovery of non-compositional compounds in parallel data. In Proceedings of the Second Conference on Empirical Methods for Natural Language Processing (EMNLP’97).

  • Miyamoto, T. (2000). The light verb construction in Japanese: The role of the verbal noun. John Benjamins Publishing Company.

  • Mohammad, S., & Hirst, G. (2006). Determining word sense dominance using a thesaurus. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL’06), pp. 121–128.

  • Moirón, M. B. V. (2004). Discarding noise in an automatically acquired lexicon of support verb constructions. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC).

  • Moon, R. (1998). Fixed expressions and idioms in English: A corpus-based approach. Oxford University Press.

  • Newman, J. (1996). Give: A cognitive linguistic study. Mouton de Gruyter.

  • Newman, J., & Rice, S. (2004). Patterns of usage for English SIT, STAND, and LIE: A cognitively inspired exploration in corpus linguistics. Cognitive Linguistics, 15(3), 351–396.

    Article  Google Scholar 

  • Nunberg, G., Sag, I. A., & Wasow, T. (1994). Idioms. Language, 70(3), 491–538.

    Article  Google Scholar 

  • Pauwels, P. (2000). Put, set, lay and place: A cognitive linguistic approach to verbal meaning. LINCOM EUROPA.

  • Pustejovsky, J. (1995). The generative lexicon. MIT Press.

  • Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A comprehensive grammar of the English language. Longman.

  • Rohde, D. L. T. (2004). TGrep2 User Manual.

  • Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP’. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’02), pp. 1–15.

  • Seretan, V., Nerima, L., & Wehrli, E. (2003). Extraction of multi-word collocations using syntactic bigram composition. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’03).

  • Stevenson, S., Fazly, A., & North, R. (2004). Statistical measures of the semi-productivity of light verb constructions. In Proceedings of the ACL’04 Workshop on Multiword Expressions: Integrating Processing, pp. 1–8

  • Turney, P. D. (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning (ECML’01), pp. 491–502.

  • Uchiyama, K., Baldwin, T., & Ishizaki, S. (2005). Disambiguating Japanese compound verbs. Computer Speech and Language, 19, 497–512.

    Google Scholar 

  • Venkatapathy, S., & Joshi, A. (2005). Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features. In Proceedings of the Joint Conference on Human Language Technology and Empirical Methods for Natural Language Processing (HLT-EMNLP’05), pp. 899–906.

  • Villavicencio, A. (2003). Verb-particle constructions and lexical resources. In Proceedings of the ACL-SIGLEX Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 57–64.

  • Villavicencio, A. (2005). The availability of verb-particle constructions in lexical resources: How much is enough? Computer Speech and Language, 19, 415–432.

    Google Scholar 

  • Wanner, L. (2004). Towards automatic fine-grained semantic classification of verb-noun collocations. Natural Language Engineering, 10(2), 95–143.

    Article  Google Scholar 

  • Wermter, J., & Hahn, U. (2005). Paradigmatic modifiability statistics for the extraction of complex multi-word terms. In Proceedings of the Joint Conference on Human Language Technology and Empirical Methods for Natural Language Processing (HLT-EMNLP’05), pp. 843–850.

  • Wierzbicka, A. (1982). Why can you have a drink When you can’t *Have an eat? Language, 58(4), 753–799.

    Article  Google Scholar 

Download references

Acknowledgements

We thank Anne-Marie Brousseau, for the enlightening discussions regarding the human judgments on figurativeness; Eric Joanis, for providing us with NP-head extraction software; and our judges, who made the evaluation of our ideas possible. We are also grateful of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Ontario Graduate Scholarship program (OGS), and the University of Toronto for the financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Afsaneh Fazly.

Appendix

Appendix

This appendix contains information on the procedure for interpreting the human judgments for the development and test expressions used in the experiments of Sect. 4.3. It also contains the numerical r s values of the results presented in Sect. 6.3.1.

Tables 10 and 11 show how the judges’ answers to the questions (given in Table 3 on page 13) are translated into numerical ratings ranging from 0 to 4. Higher numerical ratings express higher degrees of literalness, hence lower degrees of figurativeness. Expressions for which no numerical rating is listed in the tables are removed from the final set of experimental expressions. These were expressions that were considered unacceptable or ambiguous by a majority of the annotators. (This resulted in the removal of 11 expressions in total.) Table 12 contains the correlation scores (r s ) for Accept LVC and PMI LVC across the three verbs (take, give, and make) and the Levin and WordNet test classes. (These are the numbers used in creating the greyscale representation shown in Fig. 3.)

Table 10 Interpretation of answers to the questions for expressions with give
Table 11 Interpretation of answers to the questions for expressions with take
Table 12 Correlation scores corresponding to Fig. 3

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fazly, A., Stevenson, S. & North, R. Automatically learning semantic knowledge about multiword predicates. Lang Resources & Evaluation 41, 61–89 (2007). https://doi.org/10.1007/s10579-007-9017-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-007-9017-9

Keywords

Navigation