Lexical association measures and collocation extraction
- Pavel Pecina
- … show all 1 hide
Rent the article at a discountRent now
* Final gross prices may vary according to local VAT.Get Access
We present an extensive empirical evaluation of collocation extraction methods based on lexical association measures and their combination. The experiments are performed on three sets of collocation candidates extracted from the Prague Dependency Treebank with manual morphosyntactic annotation and from the Czech National Corpus with automatically assigned lemmas and part-of-speech tags. The collocation candidates were manually labeled as collocational or non-collocational. The evaluation is based on measuring the quality of ranking the candidates according to their chance to form collocations. Performance of the methods is compared by precision-recall curves and mean average precision scores. The work is focused on two-word (bigram) collocations only. We experiment with bigrams extracted from sentence dependency structure as well as from surface word order. Further, we study the effect of corpus size on the performance of the individual methods and their combination.
- Bartsch, S. (2004). Structural und functional properties of collocations in English. A corpus study of lexical and pragmatic constraints on lexical co-occurrence. Tübingen: Gunter Narr Verlag.
- Berry-Rogghe, G. L. (1973). The computation of collocations and their relevance in lexical studies. In The computer and literal studies (pp. 103–112). Edinburgh, New York: University Press.
- Choueka, Y. (1988). Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO.
- Choueka, Y., Klein, S., & Neuwitz, E. (1983). Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing, 4(1), 34–38.
- Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16(1), 22–29.
- Conger, A. J. (1980). Integration and generalisation of Kappas for multiple raters. Psychological Bulletin, 88, 322–328 CrossRef
- Daille, B. (1996). Study and implementation of combined techniques for automatic extraction of terminology. In J. L. Klavans & P. Resnik (Eds.), The balancing act (Chap. 3, pp. 49–66). Cambridge, MA: MIT Press.
- Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. PhD Thesis, University of Stuttgart.
- Evert, S., & Krenn, B. (2001). Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th annual meeting of the association for computational linguistics (pp. 188–195).
- Fawcett, T. (2003). ROC graphs: Notes and practical considerations for data mining researchers. Technical Report, HPL 2003–4. Palo Alto CA: HP Laboratories.
- Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 378–382. CrossRef
- Hajič, J. (2004). Disambiguation of rich inflection (computational morphology of Czech) (Vol. 1). Prague: Charles University Press.
- Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391–1415.
- Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, New York, NY.
- Inkpen, D., & Hirst, G. (2002). Acquiring collocations for lexical choice between near synonyms. In SIGLEX workshop on unsupervised lexical acquisition, 40th meeting of the ACL, Philadelphia.
- Kita, K., Kato, Y., Omoto, T., & Yano, Y. (1994). A comparative study of automatic extraction of collocations from corpora: Mutual information vs. cost criteria. Journal of Natural Language Processing, 1(1), 21–33.
- Krenn, B. (2000). The usual suspects: Data-oriented models for identification and representation of lexical collocations. PhD Thesis, Saarland University.
- Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, Massachusetts: The MIT Press, Chap. 5. Collocations.
- Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In Proceedings of the 2004 conference on EMNLP. Barcelona, Spain
- Palmer, H. E. (1938). A grammar of English words. London: Longman
- PDT (2006). Prague dependency treebank 2.0. Institute of Formal and Applied Lingustics.
- Pearce, D. (2002) A comparative evaluation of collocation extraction techniques. In Third international conference on language resources and evaluation. Spain, Las Palmas.
- Pecina, P. (2008a). Machine learning approach to mutliword expression extraction. In Proceedings of the sixth international conference on language resources and evaluation workshop: Towards a shared task for multiword expressions (MWE 2008), Marrakech, Morocco.
- Pecina, P. (2008b). Reference data for Czech collocation extraction. In Proceedings of the sixth international conference on language resources and evaluation workshop: Towards a shared task for multiword expressions (MWE 2008). Marrakech, Morocco.
- Pecina, P., & Schlesinger, P. (2006) Combining association measures for collocation extraction. In Proceedings of the 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006). Sydney, Australia.
- Shimohata, S., Sugio, T., Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In Proceedings of the 35th meeting of ACL/EACL (pp. 476–481). Madrid, Spain.
- Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177
- Stevens, M. E., Giuliano, V. E., & Heilprin, L. B. (Eds.), (1965). Proceedings of the symposium on statistical association methods for mechanized documentation (Vol. 269). Washington, DC: National Bureau of Standards Miscellaneous Publication.
- Venables, W. N., & Ripley, B. (2002). Modern applied statistics with S (4th ed.). New York: Springer.
- Zhai, C. (1997). Exploiting context to identify lexical atoms: A statistical view of linguistic context. In International and interdisciplinary conferences on modeling and using context.
- Lexical association measures and collocation extraction
Language Resources and Evaluation
Volume 44, Issue 1-2 , pp 137-158
- Cover Date
- Print ISSN
- Online ISSN
- Springer Netherlands
- Additional Links
- Lexical association measures
- Multiword expressions
- Industry Sectors
- Pavel Pecina (1)
- Author Affiliations
- 1. Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic