Comparative Evaluation and Integration of Collocation Extraction Metrics
The paper deals with collocation extraction from corpus data. A whole number of formulae have been created to integrate different factors that determine the association between the collocation components. The experiments are described which objective was to study the method of collocation extraction based on the statistical association measures. The work is focused on bigram collocations. The obtained data on the measure precision allow to establish to some degree that some measures are more precise than others. No measure is ideal, which is why various options of their integration are desirable and useful. We propose a number of parameters that allow to rank collocates in an combined list, namely, an average rank, a normalized rank and an optimized rank.
KeywordsCollocation extraction Association measures Evaluation Ranking Average rank Normalized rank Optimized rank
This work was partly supported by the grant of the Russian Foundation for Humanities (research project No. 16-04-12019).
- 1.Evert, S.: The statistics of word cooccurences word pairs and collocations. Ph.D. thesis, Institut für Maschinelle Sprachverarbeitung (IMS), Stuttgart (2004) Google Scholar
- 2.Pecina, P.: Lexical association measures and collocation extraction. Lang. Resour. Eval. 44(1–2), 137–158 (2009). PragueGoogle Scholar
- 3.Halliday, M.: Current Ideas in Systemic Practice and Theory. Pinter, London (1991)Google Scholar
- 4.Daille, B.: Mixed approach for the automatic extraction of terminology: lexical statistics and linguistic filters [Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques]. Ph.D. thesis, Université Paris 7 (1994)Google Scholar
- 5.Kilgarriff, A., Tugwell, D.: Sketching words. In: Correard, M.H. (ed.) Lexicography and Natural Language Processing: A Festschrift in Honour of B.T.S. Atkins, pp. 125–137. Euralex, Goteborg (2002)Google Scholar
- 7.Křen, M.: Collocation Measures and the Czech Language: Comparison on the Czech National Corpus data [Kolokační míry a čeština: srovnání na datech Českého národního korpusu], pp. 223–248. Kolokace, Praha (2006)Google Scholar
- 8.Zakharov, V., Khokhlova, M.: Syntagmatic relations in Russian corpora and dictionaries. In: Schoepe, K., et al. (eds.) Pragmantax II. The Present State of Linguistics and its Sub-Disciplines, pp. 333–344. Peter Lang, Frankfurt a.M. (2014)Google Scholar
- 9.Rychlý, P.: Manatee/Bonito – a modular corpus manager. In: 1st Workshop on Recent Advances in Slavonic Natural Language Processing, pp. 65–70. Masaryk University, Brno (2007)Google Scholar
- 11.Statistics Used in Sketch Engine. https://www.sketchengine.co.uk/documentation/statistics-used-in-sketch-engine/. Accessed 3 Feb 2017
- 12.Ashmanov, I., Grigoryev, S., Gusev, V., Kharin, N., Shabanov, V.: Using statistical method for intelligent computer-based text processing [Primenenie statisticheskih metodov dlja intellektual’noj komp’juternoj obrabotki tekstov]. In: The Proceedings of the Dialog 1997 International Seminar on Computational Linguistics and Its Applications, pp. 33–37 (1997)Google Scholar