Skip to main content
Log in

Lexical association measures and collocation extraction

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

We present an extensive empirical evaluation of collocation extraction methods based on lexical association measures and their combination. The experiments are performed on three sets of collocation candidates extracted from the Prague Dependency Treebank with manual morphosyntactic annotation and from the Czech National Corpus with automatically assigned lemmas and part-of-speech tags. The collocation candidates were manually labeled as collocational or non-collocational. The evaluation is based on measuring the quality of ranking the candidates according to their chance to form collocations. Performance of the methods is compared by precision-recall curves and mean average precision scores. The work is focused on two-word (bigram) collocations only. We experiment with bigrams extracted from sentence dependency structure as well as from surface word order. Further, we study the effect of corpus size on the performance of the individual methods and their combination.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. An agreement measure for any numbers of annotators (Fleiss 1971): \(\kappa = {\frac{P_o\,-\,P_e}{1\,-\,P_e}},\) where P o is the relative observed agreement among annotators and P e is the theoretical probability of chance agreement (each annotator randomly choosing each category). The factor 1 − P e then corresponds to the level of agreement achievable above chance and P o  − P e is the level of agreement actually achieved above chance. For two annotators the exact Fleiss’ \(\kappa\) reduces to the well known Cohen’s \(\kappa\) (Conger 1980).

References

  • Bartsch, S. (2004). Structural und functional properties of collocations in English. A corpus study of lexical and pragmatic constraints on lexical co-occurrence. Tübingen: Gunter Narr Verlag.

  • Berry-Rogghe, G. L. (1973). The computation of collocations and their relevance in lexical studies. In The computer and literal studies (pp. 103–112). Edinburgh, New York: University Press.

  • Choueka, Y. (1988). Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO.

  • Choueka, Y., Klein, S., & Neuwitz, E. (1983). Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing, 4(1), 34–38.

    Google Scholar 

  • Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16(1), 22–29.

    Google Scholar 

  • Conger, A. J. (1980). Integration and generalisation of Kappas for multiple raters. Psychological Bulletin, 88, 322–328

    Article  Google Scholar 

  • Daille, B. (1996). Study and implementation of combined techniques for automatic extraction of terminology. In J. L. Klavans & P. Resnik (Eds.), The balancing act (Chap. 3, pp. 49–66). Cambridge, MA: MIT Press.

    Google Scholar 

  • Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. PhD Thesis, University of Stuttgart.

  • Evert, S., & Krenn, B. (2001). Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th annual meeting of the association for computational linguistics (pp. 188–195).

  • Fawcett, T. (2003). ROC graphs: Notes and practical considerations for data mining researchers. Technical Report, HPL 2003–4. Palo Alto CA: HP Laboratories.

    Google Scholar 

  • Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 378–382.

    Article  Google Scholar 

  • Hajič, J. (2004). Disambiguation of rich inflection (computational morphology of Czech) (Vol. 1). Prague: Charles University Press.

    Google Scholar 

  • Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391–1415.

    Google Scholar 

  • Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, New York, NY.

  • Inkpen, D., & Hirst, G. (2002). Acquiring collocations for lexical choice between near synonyms. In SIGLEX workshop on unsupervised lexical acquisition, 40th meeting of the ACL, Philadelphia.

  • Kita, K., Kato, Y., Omoto, T., & Yano, Y. (1994). A comparative study of automatic extraction of collocations from corpora: Mutual information vs. cost criteria. Journal of Natural Language Processing, 1(1), 21–33.

    Google Scholar 

  • Krenn, B. (2000). The usual suspects: Data-oriented models for identification and representation of lexical collocations. PhD Thesis, Saarland University.

  • Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, Massachusetts: The MIT Press, Chap. 5. Collocations.

  • Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In Proceedings of the 2004 conference on EMNLP. Barcelona, Spain

  • Palmer, H. E. (1938). A grammar of English words. London: Longman

    Google Scholar 

  • PDT (2006). Prague dependency treebank 2.0. Institute of Formal and Applied Lingustics.

  • Pearce, D. (2002) A comparative evaluation of collocation extraction techniques. In Third international conference on language resources and evaluation. Spain, Las Palmas.

  • Pecina, P. (2008a). Machine learning approach to mutliword expression extraction. In Proceedings of the sixth international conference on language resources and evaluation workshop: Towards a shared task for multiword expressions (MWE 2008), Marrakech, Morocco.

  • Pecina, P. (2008b). Reference data for Czech collocation extraction. In Proceedings of the sixth international conference on language resources and evaluation workshop: Towards a shared task for multiword expressions (MWE 2008). Marrakech, Morocco.

  • Pecina, P., & Schlesinger, P. (2006) Combining association measures for collocation extraction. In Proceedings of the 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006). Sydney, Australia.

  • Shimohata, S., Sugio, T., Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In Proceedings of the 35th meeting of ACL/EACL (pp. 476–481). Madrid, Spain.

  • Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177

    Google Scholar 

  • Stevens, M. E., Giuliano, V. E., & Heilprin, L. B. (Eds.), (1965). Proceedings of the symposium on statistical association methods for mechanized documentation (Vol. 269). Washington, DC: National Bureau of Standards Miscellaneous Publication.

  • Venables, W. N., & Ripley, B. (2002). Modern applied statistics with S (4th ed.). New York: Springer.

    Google Scholar 

  • Zhai, C. (1997). Exploiting context to identify lexical atoms: A statistical view of linguistic context. In International and interdisciplinary conferences on modeling and using context.

Download references

Acknowledgments

This is a revised and extended version of our previous work (Pecina and Schlesinger 2006). Details on the reference data sets are described in (Pecina 2008a). Experiments that are performed on other data sets and confirm good results of our combination methods are presented in (Pecina 2008b). This work was supported by the Ministry of Education of the Czech Republic project MSM 0021620838.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pavel Pecina.

Appendix

Appendix

Table 3 The inventory of lexical association measures used for collocation extraction used in our experiments

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pecina, P. Lexical association measures and collocation extraction. Lang Resources & Evaluation 44, 137–158 (2010). https://doi.org/10.1007/s10579-009-9101-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-009-9101-4

Keywords

Navigation