Valence extraction using EM selection and co-occurrence matrices

Article

Abstract

This paper discusses two new procedures for extracting verb valences from raw texts, with an application to the Polish language. The first novel technique, the EM selection algorithm, performs unsupervised disambiguation of valence frame forests, obtained by applying a non-probabilistic deep grammar parser and some post-processing to the text. The second new idea concerns filtering of incorrect frames detected in the parsed text and is motivated by an observation that verbs which take similar arguments tend to have similar frames. This phenomenon is described in terms of newly introduced co-occurrence matrices. Using co-occurrence matrices, we split filtering into two steps. The list of valid arguments is first determined for each verb, whereas the pattern according to which the arguments are combined into frames is computed in the following stage. Our best extracted dictionary reaches an F-score of 45%, compared to an F-score of 39% for the standard frame-based BHT filtering.

Keywords

Verb valence extraction EM algorithm Co-occurrence matrices Polish language 

References

  1. Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34, 555–596.CrossRefGoogle Scholar
  2. Baker, C. F., & Ruppenhofer, J. (2002). FrameNet’s frames vs. Levin’s verb classes’. In Proceedings of the 28th annual meeting of the Berkeley Linguistics Society (pp. 27–38).Google Scholar
  3. Bańko, M. (Ed.) (2000). Inny słownik języka polskiego. Warszawa: Wydawnictwo Naukowe PWN.Google Scholar
  4. Baum, L. E. (1972). Inequality and associated maximization technique in statistical estimation of probabilistic functions of Markov processes. Inequalities, 3, 1–8.Google Scholar
  5. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.Google Scholar
  6. Brent, M. R. (1993). From grammar to Lexicon: Unsupervised learning of lexical syntax. Computational Linguistics, 19, 243–262.Google Scholar
  7. Briscoe, T., & Carroll, J. (1997). Automatic extraction of subcategorization from Corpora. In Proceedings of the 5th ACL conference on applied natural language processing (pp. 356–363). Washington, DC. Morgan Kaufmann.Google Scholar
  8. Carroll, G., & Rooth, M. (1998). Valence induction with a head-lexicalized PCFG. In Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (No. 4, Vol. 3, pp. 25–54).Google Scholar
  9. Chesley, P., & Salmon-Alt, S. (2006). Automatic extraction of subcategorization frames for French. In Proceedings of the language resources and evaluation conference (LREC 2006), Genua, Italy.Google Scholar
  10. Chi, Z., & Geman, S. (1998). Estimation of probabilistic context-free grammars. Computational Linguistics, 24, 299–305.Google Scholar
  11. Colmerauer, A. (1978). Metamorphosis grammar. In Natural language communication with computers. Lecture Notes in Computer Science (Vol. 63, pp. 133–189). New York: Springer.Google Scholar
  12. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 185–197.Google Scholar
  13. Dębowski, Ł., & Woliński, M. (2007). Argument co-occurrence matrix as a description of verb valence. In Z. Vetulani (Ed.), Proceedings of the 3rd language & technology conference, October 5–7, 2007, Poznań, Poland (pp. 260–264).Google Scholar
  14. Ersan, M., & Charniak, E. (1995). A statistical syntactic disambiguation program and what it learns. In S. Wermter, E. Riloff, & G. Scheler (Eds.), Learning for natural language processing (pp. 146–159). New York: Springer.Google Scholar
  15. Fast, J., & Przepiórkowski, A. (2005). Automatic extraction of polish verb subcategorization: An evaluation of common statistics. In Z. Vetulani (Ed.), Proceedings of the 2nd language & technology conference, Poznań, Poland, April 21–23, 2005 (pp. 191–195).Google Scholar
  16. Gorrell, G. (1999). Acquiring subcategorisation from textual corpora. M. Phil. Dissertation, University of Cambridge.Google Scholar
  17. Halford, G. S., Wilson, W. H., & Phillips, W. (1998). Processing capacity defined by relational complexity: Implications for comparative, developmental and cognitive psychology. Behavioral Brain Sciences, 21(6), 803–864.Google Scholar
  18. Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge, MA: The MIT Press.Google Scholar
  19. Korhonen, A. (2002). Subcategorization acquisition. Ph.D. Dissertation, University of Cambridge.Google Scholar
  20. Kupiec, J. (1992). Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6, 225–242.CrossRefGoogle Scholar
  21. Kurcz, I., Lewicki, A., Sambor, J., & Woronczak, J. (1990). Słownik frekwencyjny polszczyzny współczesnej. Kraków: Instytut Języka Polskiego PAN.Google Scholar
  22. Lapata, M., & Brew, C. (2004). Verb class disambiguation using informative priors. Computational Linguistics, 30, 45–73.CrossRefGoogle Scholar
  23. Levin, B. (1993). English verb classes and alternations: A preliminary investigation. Chicago: The University of Chicago Press.Google Scholar
  24. Macleod, C., Grishman, R., & Meyers, A. (1994). Creating a common syntactic dictionary of English. In SNLR: International workshop on sharable natural language resources, Nara, August, 1994.Google Scholar
  25. Manning, C. (1993). Automatic acquisition of a large subcategorization dictionary from corpora. In Proceedings of the 31st annual meeting of the ACL, Columbus, OH (pp. 235–242).Google Scholar
  26. Mayol, L., Boleda, G., & Badia, T. (2005). Automatic acquisition of syntactic verb classes with basic resources. Language Resources and Evaluation, 39, 295–312.CrossRefGoogle Scholar
  27. McCarthy, D. (2001). Lexical acquisition at the syntax-semantics interface: Diathesis alternations, subcategorization frames and selectional preferences. Ph.D. Thesis, University of Sussex.Google Scholar
  28. Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20, 155–171.Google Scholar
  29. Młynarczyk, A. K. (2004). Aspectual pairing in Polish. Ph.D. Thesis, Universiteit Utrecht.Google Scholar
  30. Neal, R., & Hinton, G. (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Cambridge, MA: The MIT Press.Google Scholar
  31. Polański, K. (Ed.). (1980–1992). Słownik syntaktyczno-generatywny czasowników polskich. Wrocław: Zakład Narodowy im. Ossolińskich/Kraków: Instytut Języka Polskiego PAN.Google Scholar
  32. Przepiórkowski, A. (2006). What to acquire from corpora in automatic valence acquisition’. In V. Koseska-Toszewa, R. Roszko (eds.) Semantyka a konfrontacja językowa (3). Warszawa: Slawistyczny Ośrodek Wydawniczy PAN.Google Scholar
  33. Przepiórkowski, A., & Fast, J. (2005). Baseline experiments in the extraction of polish valence frames. In M. A. Kłopotek, S. T. Wierzchoń, & K. Trojanowski (Eds.), Intelligent information processing and web mining (pp. 511–520). New York: Springer.Google Scholar
  34. Przepiórkowski, A., & Woliński, M. (2003). A flexemic tagset for polish. In Proceedings of morphological processing of slavic languages (EACL 2003) (pp. 33–40).Google Scholar
  35. Rudin, W. (1974). Real and complex analysis. New York: McGraw-Hill.Google Scholar
  36. Sarkar, A., & Zeman, D. (2000). Automatic extraction of subcategorization frames for Czech. In Proceedings of the 18th international conference on computational linguistics (COLING 2000), Saarbrücken, Germany (pp. 691–698).Google Scholar
  37. Schulte im Walde, S. (2006). Experiments on the automatic induction of German semantic verb classes. Computational Linguistics, 32, 159–194.CrossRefGoogle Scholar
  38. Surdeanu, M., Morante, R., & Màrquez, L. (2008). Analysis of joint inference strategies for the semantic role labeling of Spanish and Catalan. In Proceedings of the computational linguistics and intelligent text processing 9th international conference (CICLing 2008) (pp. 206–218).Google Scholar
  39. Świdziński, M. (1992). Gramatyka formalna języka polskiego. Warszawa: Wydawnictwa Uniwersytetu Warszawskiego.Google Scholar
  40. Świdziński, M. (1994). Syntactic dictionary of polish verbs. Warszawa: Uniwersytet Warszawski/Amsterdam: Universiteit van Amsterdam.Google Scholar
  41. Tokarski, J. (1993). Schematyczny indeks a tergo polskich form wyrazowych. Warszawa: Wydawnictwo Naukowe PWN.Google Scholar
  42. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.Google Scholar
  43. Woliński, M. (2004). Komputerowa weryfikacja gramatyki Świdzińskiego. Ph.D. Thesis, Instytut Podstaw Informatyki PAN, Warszawa.Google Scholar
  44. Woliński, M. (2005). An efficient implementation of a large grammar of Polish. Archives of Control Sciences, 15(LI)(3), 251–258.Google Scholar
  45. Woliński, M. (2006). Morfeusz—A practical tool for the morphological analysis of polish. In M. A. Kłopotek, S. T. Wierzchoń, & K. Trojanowski (Eds.), Intelligent information processing and web mining (pp. 503–512). New York: Springer.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2009

Authors and Affiliations

  1. 1.Instytut Podstaw Informatyki PANWarszawaPoland
  2. 2.Centrum Wiskunde and InformaticaAmsterdamThe Netherlands

Personalised recommendations