Abstract
The paper deals with collocation extraction from corpus data. A whole number of formulae have been created to integrate different factors that determine the association between the collocation components. The experiments are described which objective was to study the method of collocation extraction based on the statistical association measures. The work is focused on bigram collocations. The obtained data on the measure precision allow to establish to some degree that some measures are more precise than others. No measure is ideal, which is why various options of their integration are desirable and useful. We propose a number of parameters that allow to rank collocates in an combined list, namely, an average rank, a normalized rank and an optimized rank.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Evert, S.: The statistics of word cooccurences word pairs and collocations. Ph.D. thesis, Institut für Maschinelle Sprachverarbeitung (IMS), Stuttgart (2004)
Pecina, P.: Lexical association measures and collocation extraction. Lang. Resour. Eval. 44(1–2), 137–158 (2009). Prague
Halliday, M.: Current Ideas in Systemic Practice and Theory. Pinter, London (1991)
Daille, B.: Mixed approach for the automatic extraction of terminology: lexical statistics and linguistic filters [Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques]. Ph.D. thesis, Université Paris 7 (1994)
Kilgarriff, A., Tugwell, D.: Sketching words. In: Correard, M.H. (ed.) Lexicography and Natural Language Processing: A Festschrift in Honour of B.T.S. Atkins, pp. 125–137. Euralex, Goteborg (2002)
Seretan, V.: Syntax-Based Collocation Extraction. Text, Speech and Language. Springer, Dordrecht (2011)
Křen, M.: Collocation Measures and the Czech Language: Comparison on the Czech National Corpus data [Kolokační míry a čeština: srovnání na datech Českého národního korpusu], pp. 223–248. Kolokace, Praha (2006)
Zakharov, V., Khokhlova, M.: Syntagmatic relations in Russian corpora and dictionaries. In: Schoepe, K., et al. (eds.) Pragmantax II. The Present State of Linguistics and its Sub-Disciplines, pp. 333–344. Peter Lang, Frankfurt a.M. (2014)
Rychlý, P.: Manatee/Bonito – a modular corpus manager. In: 1st Workshop on Recent Advances in Slavonic Natural Language Processing, pp. 65–70. Masaryk University, Brno (2007)
Benko, V.: Aranea: yet another family of (comparable) web corpora. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS, vol. 8655, pp. 247–256. Springer, Cham (2014). doi:10.1007/978-3-319-10816-2_31
Statistics Used in Sketch Engine. https://www.sketchengine.co.uk/documentation/statistics-used-in-sketch-engine/. Accessed 3 Feb 2017
Ashmanov, I., Grigoryev, S., Gusev, V., Kharin, N., Shabanov, V.: Using statistical method for intelligent computer-based text processing [Primenenie statisticheskih metodov dlja intellektual’noj komp’juternoj obrabotki tekstov]. In: The Proceedings of the Dialog 1997 International Seminar on Computational Linguistics and Its Applications, pp. 33–37 (1997)
Acknowledgments
This work was partly supported by the grant of the Russian Foundation for Humanities (research project No. 16-04-12019).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Zakharov, V. (2017). Comparative Evaluation and Integration of Collocation Extraction Metrics. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-64206-2_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64205-5
Online ISBN: 978-3-319-64206-2
eBook Packages: Computer ScienceComputer Science (R0)