Language Resources and Evaluation

, Volume 44, Issue 1–2, pp 137–158 | Cite as

Lexical association measures and collocation extraction

Article

Abstract

We present an extensive empirical evaluation of collocation extraction methods based on lexical association measures and their combination. The experiments are performed on three sets of collocation candidates extracted from the Prague Dependency Treebank with manual morphosyntactic annotation and from the Czech National Corpus with automatically assigned lemmas and part-of-speech tags. The collocation candidates were manually labeled as collocational or non-collocational. The evaluation is based on measuring the quality of ranking the candidates according to their chance to form collocations. Performance of the methods is compared by precision-recall curves and mean average precision scores. The work is focused on two-word (bigram) collocations only. We experiment with bigrams extracted from sentence dependency structure as well as from surface word order. Further, we study the effect of corpus size on the performance of the individual methods and their combination.

Keywords

Lexical association measures Collocations Multiword expressions Evaluation 

Notes

Acknowledgments

This is a revised and extended version of our previous work (Pecina and Schlesinger 2006). Details on the reference data sets are described in (Pecina 2008a). Experiments that are performed on other data sets and confirm good results of our combination methods are presented in (Pecina 2008b). This work was supported by the Ministry of Education of the Czech Republic project MSM 0021620838.

References

  1. Bartsch, S. (2004). Structural und functional properties of collocations in English. A corpus study of lexical and pragmatic constraints on lexical co-occurrence. Tübingen: Gunter Narr Verlag.Google Scholar
  2. Berry-Rogghe, G. L. (1973). The computation of collocations and their relevance in lexical studies. In The computer and literal studies (pp. 103–112). Edinburgh, New York: University Press.Google Scholar
  3. Choueka, Y. (1988). Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO.Google Scholar
  4. Choueka, Y., Klein, S., & Neuwitz, E. (1983). Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing, 4(1), 34–38.Google Scholar
  5. Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16(1), 22–29.Google Scholar
  6. Conger, A. J. (1980). Integration and generalisation of Kappas for multiple raters. Psychological Bulletin, 88, 322–328CrossRefGoogle Scholar
  7. Daille, B. (1996). Study and implementation of combined techniques for automatic extraction of terminology. In J. L. Klavans & P. Resnik (Eds.), The balancing act (Chap. 3, pp. 49–66). Cambridge, MA: MIT Press.Google Scholar
  8. Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. PhD Thesis, University of Stuttgart.Google Scholar
  9. Evert, S., & Krenn, B. (2001). Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th annual meeting of the association for computational linguistics (pp. 188–195).Google Scholar
  10. Fawcett, T. (2003). ROC graphs: Notes and practical considerations for data mining researchers. Technical Report, HPL 2003–4. Palo Alto CA: HP Laboratories.Google Scholar
  11. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 378–382.CrossRefGoogle Scholar
  12. Hajič, J. (2004). Disambiguation of rich inflection (computational morphology of Czech) (Vol. 1). Prague: Charles University Press.Google Scholar
  13. Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391–1415.Google Scholar
  14. Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, New York, NY.Google Scholar
  15. Inkpen, D., & Hirst, G. (2002). Acquiring collocations for lexical choice between near synonyms. In SIGLEX workshop on unsupervised lexical acquisition, 40th meeting of the ACL, Philadelphia.Google Scholar
  16. Kita, K., Kato, Y., Omoto, T., & Yano, Y. (1994). A comparative study of automatic extraction of collocations from corpora: Mutual information vs. cost criteria. Journal of Natural Language Processing, 1(1), 21–33.Google Scholar
  17. Krenn, B. (2000). The usual suspects: Data-oriented models for identification and representation of lexical collocations. PhD Thesis, Saarland University.Google Scholar
  18. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, Massachusetts: The MIT Press, Chap. 5. Collocations.Google Scholar
  19. Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In Proceedings of the 2004 conference on EMNLP. Barcelona, SpainGoogle Scholar
  20. Palmer, H. E. (1938). A grammar of English words. London: LongmanGoogle Scholar
  21. PDT (2006). Prague dependency treebank 2.0. Institute of Formal and Applied Lingustics.Google Scholar
  22. Pearce, D. (2002) A comparative evaluation of collocation extraction techniques. In Third international conference on language resources and evaluation. Spain, Las Palmas.Google Scholar
  23. Pecina, P. (2008a). Machine learning approach to mutliword expression extraction. In Proceedings of the sixth international conference on language resources and evaluation workshop: Towards a shared task for multiword expressions (MWE 2008), Marrakech, Morocco.Google Scholar
  24. Pecina, P. (2008b). Reference data for Czech collocation extraction. In Proceedings of the sixth international conference on language resources and evaluation workshop: Towards a shared task for multiword expressions (MWE 2008). Marrakech, Morocco.Google Scholar
  25. Pecina, P., & Schlesinger, P. (2006) Combining association measures for collocation extraction. In Proceedings of the 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006). Sydney, Australia.Google Scholar
  26. Shimohata, S., Sugio, T., Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In Proceedings of the 35th meeting of ACL/EACL (pp. 476–481). Madrid, Spain.Google Scholar
  27. Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177Google Scholar
  28. Stevens, M. E., Giuliano, V. E., & Heilprin, L. B. (Eds.), (1965). Proceedings of the symposium on statistical association methods for mechanized documentation (Vol. 269). Washington, DC: National Bureau of Standards Miscellaneous Publication.Google Scholar
  29. Venables, W. N., & Ripley, B. (2002). Modern applied statistics with S (4th ed.). New York: Springer.Google Scholar
  30. Zhai, C. (1997). Exploiting context to identify lexical atoms: A statistical view of linguistic context. In International and interdisciplinary conferences on modeling and using context.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2009

Authors and Affiliations

  1. 1.Institute of Formal and Applied LinguisticsCharles UniversityPragueCzech Republic

Personalised recommendations