Language Resources and Evaluation

, Volume 48, Issue 2, pp 249–278 | Cite as

A Hebrew verb–complement dictionary

Original Paper

Abstract

We present a verb–complement dictionary of Modern Hebrew, automatically extracted from text corpora. Carefully examining a large set of examples, we defined ten types of verb complements that cover the vast majority of the occurrences of verb complements in the corpora. We explored several collocation measures as indicators of the strength of the association between the verb and its complement. We then used these measures to automatically extract verb complements from corpora. The result is a wide-coverage, accurate dictionary that lists not only the likely complements for each verb, but also the likelihood of each complement. We evaluated the quality of the extracted dictionary both intrinsically and extrinsically. Intrinsically, we showed high precision and recall on randomly (but systematically) selected verbs. Extrinsically, we showed that using the extracted information is beneficial for two applications, prepositional phrase attachment disambiguation and Arabic-to-Hebrew machine translation.

Keywords

Verb subcategorization Hebrew Lexicography 

References

  1. Albert, A., MacWhinney, B., Nir, B., & Wintner, S. (2012). A morphologically annotated Hebrew CHILDES corpus. In Proceedings of the Workshop on Computational Models of Language Acquisition and Loss (pp. 20–22), Avignon, France, April 2012. Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W12/W12-0904.
  2. Atterer, M., & Schütze, H. (2007). Prepositional phrase attachment without oracles. Computational Linguistics, 33(4), 469–476. ISSN 0891-2017. doi:10.1162/coli.2007.33.4.469.Google Scholar
  3. Baldewein, U. (2004). Modeling attachment decisions with a probabilistic parser: The case of head final structures. In Proceedings of the 26th Annual Conference of the Cognitive Science Society (pp 73–78). Erlbaum.Google Scholar
  4. Belletti, A., & Shlonsky, U. (1995). The order of verbal complements: A comparative study. Natural Language and Linguistic Theory, 13(3), 489–526.CrossRefGoogle Scholar
  5. Brent, M. R., (1991). Automatic acquisition of subcategorization frames from untagged text. In Proceedings of the 29th annual meeting on Association for Computational Linguistics (pp. 209–214), Stroudsburg, PA, USA. Association for Computational Linguistics. doi:10.3115/981344.981371.
  6. Brent, M. R. (1993). From grammar to lexicon: Unsupervised learning of lexical syntax. Computational Linguistics, 19(2), 243–262.Google Scholar
  7. Briscoe, T., & Carroll, J. (1993). Generalised probabilistic LR parsing of natural language (corpora) with unification-based grammars. Computational Linguistics, 19(1), 25–59.Google Scholar
  8. Briscoe, T., & Carroll, J. (1997). Automatic extraction of subcategorization from corpora. In Proceedings of the 5th ACL Conference on Applied Natural Language Processing (pp. 356–363).Google Scholar
  9. Carroll, J., Minnen, G., & Briscoe, T. (1998). Can subcategorisation probabilities help a statistical parser? In Proceedings of the 6th ACL/SIGDAT Workshop on Very Large Corpora (pp. 118–126).Google Scholar
  10. Chang, B., Danielsson, P., & Teubert, W. (2002). Extraction of translation unit from Chinese–English parallel corpora. In Proceedings of the first SIGHAN workshop on Chinese language processing, (pp. 1–5), Morristown, NJ, USA. Association for Computational Linguistics. doi:10.3115/1118824.1118825.
  11. Chesley, P., & Salmon-alt, S. (2006). Automatic extraction of subcategorization frames for French. In Proceedings of the Language Resources and Evaluation Conference, LREC 2006 (pp. 253–258). European Language Resources Association (ELRA).Google Scholar
  12. Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.Google Scholar
  13. Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29. ISSN 0891-2017.Google Scholar
  14. Dahlgren, K., & McDowell, J. P. (1986). Using commonsense knowledge to disambiguate prepositional phrase modifiers. In T. Kehler (Ed.), Proceedings of the 5th National Conference on Artificial Intelligence (pp. 589–593). Morgan Kaufmann.Google Scholar
  15. Dȩbowski, Ł. (2009). Valence extraction using EM selection and co-occurrence matrices. Language Resources and Evaluation, 43(4), 301–327.CrossRefGoogle Scholar
  16. Denkowski, M. & Lavie, A. (2011). Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the Sixth Workshop on Statistical Machine Translation (pp. 85–91). Association for Computational Linguistics, July 2011. http://www.aclweb.org/anthology/W11-2107.
  17. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61–74.Google Scholar
  18. Garnsey, S. M., Pearlmutter, N. J., Myers, E., & Lotocky, M. A. (1997). The contributions of verb bias and plausibility to the comprehension of temporarily ambiguous sentences. Journal of Memory and Language, 37(1), 58–93, 7.Google Scholar
  19. Goldberg, Y. (2011). Automatic Syntactic Processing of Modern Hebrew. PhD thesis, Ben Gurion University of the Negev, Israel.Google Scholar
  20. Goldberg, Y., & Elhadad, M. (2009). Hebrew dependency parsing: Initial results. In Proceedings of the 11th International Workshop on Parsing Technologies (IWPT-2009), 7–9 October 2009, Paris, France (pp. 129–133). The Association for Computational Linguistics.Google Scholar
  21. Goldberg, Y., & Elhadad, M. (2010). An efficient algorithm for easy-first non-directional dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 742–750). Stroudsburg, PA, USA. Association for Computational Linguistics. ISBN 1-932432-65-5. http://dl.acm.org/citation.cfm?id=1857999.1858114.
  22. Guthmann, N., Krymolowski, Y., Milea, A., & Winter, Y. (2009). Automatic annotation of morpho-syntactic dependencies in a Modern Hebrew treebank. In Proceedings of Trees in Linguistic Theory (TLT-2009), January 2009.Google Scholar
  23. Hajič, J., Čmejrek, M., Dorr, B., Ding, Y., Eisner, J., Gildea, D., Koo, T., Parton, K., Penn, G., Radev, D., & Rambow, O. (2004). Natural language generation in the context of machine translation. Technical report, Center for Language and Speech Processing, Johns Hopkins University, March 2004. http://cs.jhu.edu/~jason/papers/ws02. Final report from 2002 CLSP summer workshop (p. 87).
  24. Han, X., Zhao, T., Qi, H., Yu, H. (2004). Subcategorization acquisition and evaluation for Chinese verbs. In Proceedings of the 20th international conference on Computational Linguistics (COLING ’04), Stroudsburg, PA, USA. Association for Computational Linguistics. doi:10.3115/1220355.1220459.
  25. Hindle, D., & Rooth, M. (1993). Structural ambiguity and lexical relations. Computationa Linguistics, 19(1), 103–120. ISSN 0891-2017.Google Scholar
  26. Hirst, G. (1988). Semantic interpretation and ambiguity. Artificial Intelligence, 34(2), 131–177.CrossRefGoogle Scholar
  27. Huddleston, R., & Pullum, G. K. (2002). The Cambridge Grammar of the English Language. Cambridge, MA: Cambridge University Press.Google Scholar
  28. Ienco, D., Villata, S., & Bosco, C. (2008). Automatic extraction of subcategorization frames for Italian. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). European Language Resources Association (ELRA), May 2008. ISBN 2-9517408-4-0. http://www.lrec-conf.org/proceedings/lrec2008/.
  29. Itai, A., & Wintner, S. (2008). Language resources for Hebrew. Language Resources and Evaluation, 42(1), 75–98.CrossRefGoogle Scholar
  30. Jensen, K., Binot, J.-L. (1987). Disambiguating prepositional phrase attachments by using on-line dictionary definitions. Computational Linguistics, 13(3–4), 251–260. ISSN 0891-2017.Google Scholar
  31. Korhonen, A. (2000). Using semantically motivated estimates to help subcategorization acquisition. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp. 216–223), Stroudsburg, PA, USA. Association for Computational Linguistics. doi:10.3115/1117794.1117821.
  32. Korhonen, A. (2002a). Semantically motivated subcategorization acquisition. In Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition (pp. 51–58), Stroudsburg, PA, USA. Association for Computational Linguistics. doi:10.3115/1118627.1118634.
  33. Korhonen, A. (2002b). Subcategorisation acquisition. PhD thesis, Computer Laboratory, University of Cambridge. Techical Report UCAM-CL-TR-530.Google Scholar
  34. Korhonen, A., Gorrell, G., & McCarthy, D. (2000). Statistical filtering and subcategorization frame acquisition. In Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora (pp. 199–206), Stroudsburg, PA, USA. Association for Computational Linguistics. doi:10.3115/1117794.1117819.
  35. Korhonen, A., Krymolowski, Y., & Briscoe, T. (2006). A large subcategorization lexicon for natural language processing applications. In Proceedings of the Language Resources and Evaluation Conference, LREC 2006 (pp. 1015–1020). European Language Resources Association (ELRA).Google Scholar
  36. Korhonen, A., Preiss, J. (2003). Improving subcategorization acquisition using word sense disambiguation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (pp. 48–55), Stroudsburg, PA, USA. Association for Computational Linguistics. doi:10.3115/1075096.1075103.
  37. Kummerfeld, J. K., Hall, D., Curran, J. R., & Klein, D. (2012). Parser showdown at the wall street corral: An empirical investigation of error types in parser output. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 1048–1059), Jeju Island, South Korea, July 2012. Association for Computational Linguistics. http://www.aclweb.org/anthology/D12-1096.
  38. Kummerfeld, J. K., Tse, D., Curran, J. R., Klein, D. (2013). An empirical examination of challenges in Chinese parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 98–103), Sofia, Bulgaria, August 2013. Association for Computational Linguistics. http://www.aclweb.org/anthology/P13-2018.
  39. Lapata, M., Keller, F., & Schulte im Walde, S. (2001). Verb frame frequency as a predictor of verb bias. Journal of Psycholinguistic Reseach, 30(4), 419–435.CrossRefGoogle Scholar
  40. Lavie, A. (2008). Stat-XFER: A general search-based syntax-driven framework for machine translation. In A. F. Gelbukh (ed.), CICLing, vol. 4919 of Lecture Notes in Computer Science (pp. 362–375). Springer. ISBN 978-3-540-78134-9.Google Scholar
  41. Levin, B. (1993). English Verb Classes and Alternations: A Preliminary Investigation. Chicago: University of Chicago Press. ISBN 9780226475332.Google Scholar
  42. Li, J., Brew, C. (2005). Automatic extraction of subcategorization frames from spoken corpora. In Proceedings of the Interdisciplinary Workshop on the Identification and Representation of Verb Features and Verb Classes (pp. 74–79).Google Scholar
  43. Lin, D. (1998). Dependency-based evaluation of MINIPAR. In Proceedings of the Workshop on the Evaluation of Parsing Systems (pp. 317–330). Springer.Google Scholar
  44. Merlo, P., & Ferrer, E. E. (2006). The notion of argument in prepositional phrase attachment. Computational Linguistics, 32(3), 341–378. ISSN 0891-2017. doi:10.1162/coli.2006.32.3.341.
  45. Messiant, C., Poibeau, T., Korhonen, A. (2008). LexSchem: a large subcategorization lexicon for French verbs. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). European Language Resources Association (ELRA), May 2008. ISBN 2-9517408-4-0. http://www.lrec-conf.org/proceedings/lrec2008/.
  46. Nir, B., MacWhinney, B., & Wintner, S. (2010). A morphologically-analyzed CHILDES corpus of Hebrew. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10) (pp. 1487–1490). European Language Resources Association (ELRA), May 2010. ISBN 2-9517408-6-7.Google Scholar
  47. Ó Séaghdha, D. (2010). Latent variable models of selectional preference. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 435–444), Stroudsburg, PA, USA. Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1858681.1858726.
  48. Pantel, P., & Lin, D. (2000). An unsupervised approach to prepositional phrase attachment using contextually similar words. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 101–108), Stroudsburg, PA, USA. Association for Computational Linguistics. doi:10.3115/1075218.1075232.
  49. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 311–318), Morristown, NJ, USA. Association for Computational Linguistics. doi:10.3115/1073083.1073135.
  50. Pecina, P. (2005). An extensive empirical study of collocation extraction methods. In Proceedings of the ACL Student Research Workshop (pp. 13–18), Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. http://www.aclweb.org/anthology/P/P05/P05-2003.
  51. Ratnaparkhi, A., Reynar, J., Roukos, S. (1994). A maximum entropy model for prepositional phrase attachment. In Proceedings of the workshop on Human Language Technology (pp. 250–255), Stroudsburg, PA, USA. Association for Computational Linguistics. ISBN 1-55860-357-3. doi:10.3115/1075812.1075868.
  52. Resnik, P., Hearst, M. A. (1993). Structural ambiguity and conceptual relations. In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives (pp. 58–64).Google Scholar
  53. Ritter, A., Mausam, & Etzioni, O. (2010). A latent dirichlet allocation method for selectional preferences. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 424–434), Stroudsburg, PA, USA. Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1858681.1858725.
  54. Ross, J. R. (1967). Constraints on variables in syntax. PhD thesis, Massachusetts Institute of Technology, Department of Modern Languages and Linguistics.Google Scholar
  55. Sarkar, A., & Zeman, D. (2000). Automatic extraction of subcategorization frames for Czech. In Proceedings of the 18th Conference on Computational Linguistics (pp. 691–697), Stroudsburg, PA, USA. Association for Computational Linguistics. doi:10.3115/992730.992746.
  56. Schulte im Walde, S., & Brew, C. (2002). Inducing German semantic verb classes from purely syntactic subcategorisation information. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 223–230), Philadelphia, PA.Google Scholar
  57. Shilon, R., Fadida, H. & Wintner, S. (2012a). Incorporating linguistic knowledge in statistical machine translation: Translating prepositions. In Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data (pp. 106–114), Avignon, France. Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W12/W12-0514.
  58. Shilon, R., Habash, N., Lavie, A., & Wintner, S. (2010). Machine translation between Hebrew and Arabic: Needs, challenges and preliminary solutions. In Proceedings of AMTA 2010: The Ninth Conference of the Association for Machine Translation in the Americas.Google Scholar
  59. Shilon, R., Habash, N., Lavie, A., & Wintner, S. (2012b). Machine translation between Hebrew and Arabic. Machine Translation, 26, 177–195. ISSN 0922-6567. http://dx.doi.org/10.1007/s10590-011-9103-z.
  60. Sima’an, K., Itai, A., Winter, Y., Altman, A., & Nativ, N. (2001). Building a tree-bank of modern Hebrew text. Traitement Automatique des Langues, 42(2), 247–380.Google Scholar
  61. Stern, N. (1994). Milon ha-Poal. Bar Ilan University. ISBN 965-226-164-5. In Hebrew.Google Scholar
  62. Stetina, J., & Nagao, M. (1997). Corpus based PP attachment ambiguity resolution with a semantic dictionary. In J. Zhou & K. W. Church (eds.), Proceedings of the Fifth Workshop on Very Large Corpora (pp. 66–80).Google Scholar
  63. Sun, L., & Korhonen, A. (2009). Improving verb clustering with automatically acquired selectional preferences. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (pp. 638–647), Stroudsburg, PA, USA. Association for Computational Linguistics. ISBN 978-1-932432-62-6.Google Scholar
  64. Sun, L., Korhonen, A., & Krymolowski, Y. (2008a). Automatic classification of English verbs using rich syntactic features. In Proceedings of the Third International Joint Conference on Natural Language Processing (pp. 769–774). http://aclweb.org/anthology-new/I/I08/I08-2107.pdf.
  65. Sun, L., Korhonen, A., & Krymolowski, Y. (2008b). Verb class discovery from rich syntactic data. In Proceedings of the 9th International Conference on Computational Linguistics and Intelligent Text Processing (pp. 16–27), Berlin, Heidelberg. Springer-Verlag. ISBN 3-540-78134-X, 978-3-540-78134-9.Google Scholar
  66. Surdeanu, M., Harabagiu, S., Williams, J., & Aarseth, P. (2003). Using predicate-argument structures for information extraction. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (pp. 8–15), Stroudsburg, PA, USA. Association for Computational Linguistics. doi:10.3115/1075096.1075098.
  67. Tsvetkov, Y., & Wintner, S. (2010). Extraction of multi-word expressions from small parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) (pp. 1256–1264).Google Scholar
  68. Tsvetkov, Y., & Wintner, S. (2012). Extraction of multi-word expressions from small parallel corpora. Natural Language Engineering, 18(4). 549–573. doi:10.1017/S1351324912000101.CrossRefGoogle Scholar
  69. Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., Ramisch, C. (2007). Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 1034–1043). http://www.aclweb.org/anthology/D/D07/D07-1110.
  70. Volk, M. (2002). Combining unsupervised and supervised methods for PP attachment disambiguation. In Proceedings of the 19th international conference on Computational linguistics (vol. 1, pp. 1–7), Stroudsburg, PA, USA. Association for Computational Linguistics. doi:10.3115/1072228.1072232.
  71. Wilks, Y., Huang, X., Fass, D. (1985). Syntax, preference, and right attachment. In Proceedings of the 9th International Joint Conference on Artificial Intelligence, vol. 2 of IJCAI’85 (pp. 779–784), San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. ISBN 0-934613-02-8, 978-0-934-61302-6.Google Scholar
  72. Yeh, A. S., & Vilain, M. B. (1998). Some properties of preposition and subordinate conjunction attachments. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (vol. 2, pp. 1436–1442), Stroudsburg, PA, USA. Association for Computational Linguistics. doi:10.3115/980691.980803.
  73. Zanette, A., Scarton, C., & Zilio, L. (2012). Automatic extraction of subcategorization frames from corpora: An approach to portuguese. In Proceedings of PROPOR 2012: International Conference on Computational Processing of the Portuguese Language. http://www.propor2012.org/demos/DemoSubcategorization.pdf.
  74. Zeman, D. (2002). Can subcategorization help a statistical dependency parser? In Proceedings of the 19th international conference on Computational linguistics (COLING-02) (pp. 1156–1162), Stroudsburg, PA, USA. Association for Computational Linguistics. doi:10.3115/1072228.1072346.

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  1. 1.Department of Computer ScienceTechnionHaifaIsrael
  2. 2.Department of Computer ScienceUniversity of HaifaHaifaIsrael

Personalised recommendations