Language Resources and Evaluation

, Volume 43, Issue 1, pp 71–85

Multilingual collocation extraction with a syntactic parser


DOI: 10.1007/s10579-008-9075-7

Cite this article as:
Seretan, V. & Wehrli, E. Lang Resources & Evaluation (2009) 43: 71. doi:10.1007/s10579-008-9075-7


An impressive amount of work was devoted over the past few decades to collocation extraction. The state of the art shows that there is a sustained interest in the morphosyntactic preprocessing of texts in order to better identify candidate expressions; however, the treatment performed is, in most cases, limited (lemmatization, POS-tagging, or shallow parsing). This article presents a collocation extraction system based on the full parsing of source corpora, which supports four languages: English, French, Spanish, and Italian. The performance of the system is compared against that of the standard mobile-window method. The evaluation experiment investigates several levels of the significance lists, uses a fine-grained annotation schema, and covers all the languages supported. Consistent results were obtained for these languages: parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision (between 16.4 and 29.7%, depending on the language; 20.1% overall), MWE precision (between 19.9 and 35.8%; 26.1% overall), and grammatical precision (between 47.3 and 67.4%; 55.6% overall). This positive result bears a high importance, especially in the perspective of the subsequent integration of extraction results in other NLP applications.


Collocation extractionEvaluationHybrid methodsMultilingual issuesSyntactic parsing

Copyright information

© Springer Science+Business Media B.V. 2008

Authors and Affiliations

  1. 1.Language Technology Laboratory (LATL)University of GenevaGenevaSwitzerland