Language Resources and Evaluation

, Volume 43, Issue 1, pp 71–85 | Cite as

Multilingual collocation extraction with a syntactic parser



An impressive amount of work was devoted over the past few decades to collocation extraction. The state of the art shows that there is a sustained interest in the morphosyntactic preprocessing of texts in order to better identify candidate expressions; however, the treatment performed is, in most cases, limited (lemmatization, POS-tagging, or shallow parsing). This article presents a collocation extraction system based on the full parsing of source corpora, which supports four languages: English, French, Spanish, and Italian. The performance of the system is compared against that of the standard mobile-window method. The evaluation experiment investigates several levels of the significance lists, uses a fine-grained annotation schema, and covers all the languages supported. Consistent results were obtained for these languages: parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision (between 16.4 and 29.7%, depending on the language; 20.1% overall), MWE precision (between 19.9 and 35.8%; 26.1% overall), and grammatical precision (between 47.3 and 67.4%; 55.6% overall). This positive result bears a high importance, especially in the perspective of the subsequent integration of extraction results in other NLP applications.


Collocation extraction Evaluation Hybrid methods Multilingual issues Syntactic parsing 



This work was supported in part by Swiss National Science Foundation grant no. 101412-103999. We wish to thank Jorge Antonio Leoni de León, Yves Scherrer and Vincenzo Pallotta for participating in the annotation task, as well as Stephanie Durrleman-Tame for proofreading the article. We are very grateful to the anonymous reviewers, whose comments and suggestions helped us to improve this paper.


  1. Barnbrook, G. (1996). Language and computers: A practical introduction to the computer analysis of language. Edinburgh: Edinburgh University Press.Google Scholar
  2. Basili, R., Pazienza, M. T., & Velardi, P. (1994) A “not-so-shallow” parser for collocational analysis. In Proceedings of the 15th Conference on Computational Linguistics (pp. 447–453). Association for Computational Linguistics: Kyoto, Japan.Google Scholar
  3. Bourigault, D. (1992). Surface grammatical analysis for the extraction of terminological noun phrases. In Proceedings of the 15th International Conference on Computational Linguistics (pp. 977–981). Nantes, France.Google Scholar
  4. Breidt, E. (1993). Extraction of V–N-Collocations from text corpora: A feasibility study for German. In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives. Columbus, USA.Google Scholar
  5. Calzolari, N., & Bindi, R. (1990). Acquisition of lexical information from a large textual Italian corpus. In Proceedings of the 13th International Conference on Computational Linguistics (pp. 54–59). Helsinki, Finland.Google Scholar
  6. Choueka, Y. (1988). Looking for needles in a haystack, or locating interesting collocational expressions in large textual databases. In Proceedings of the International Conference on User-oriented Content-based Text and Image Handling (pp. 609–623). Cambridge, USA.Google Scholar
  7. Church, K., Gale, W., Hanks, P., & Hindle, D. (1989). Parsing, word associations and typical predicate-argument relations. In Proceedings of the International Workshop on Parsing Technologies (pp. 103–112). Carnegie Mellon University: Pittsburgh.Google Scholar
  8. Church, K. W., & Hanks, P. (1989). Word association norms, mutual information, and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (pp. 76–83). Vancouver, B.C.: Association for Computational Linguistics.Google Scholar
  9. Church, K., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.Google Scholar
  10. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.CrossRefGoogle Scholar
  11. Daille, B. (1994). Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. Ph.D. thesis, Université Paris 7.Google Scholar
  12. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.Google Scholar
  13. Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. Ph.D. thesis, University of Stuttgart.Google Scholar
  14. Evert, S., & Kermes, H. (2003). Experiments on candidate data for collocation extraction. In Companion Volume to the Proceedings of the 10th Conference of The European Chapter of the Association for Computational Linguistics (pp. 83–86). Budapest, Hungary.Google Scholar
  15. Evert, S., & Krenn, B. (2001). Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (pp. 188–195). Toulouse, France.Google Scholar
  16. Evert, S., & Krenn, B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech & Language, 19(4), 450–466.Google Scholar
  17. Fontenelle, T. (1992). Collocation acquisition from a corpus or from a dictionary: A comparison. Proceedings I–II. Papers submitted to the 5th EURALEX International Congress on Lexicography in Tampere, pp. 221–228.Google Scholar
  18. Goldman, J.-P., Nerima, L., & Wehrli, E. (2001). Collocation extraction using a syntactic parser. In Proceedings of the ACL Workshop on Collocations (pp. 61–66). Toulouse, France.Google Scholar
  19. Gross, M. (1984). Lexicon-grammar and the syntactic analysis of French. In Proceedings of the 22nd conference on Association for Computational Linguistics (pp. 275–282). Morristown, NJ, USA.Google Scholar
  20. Huang, C.-R., Kilgarriff, A., Wu, Y., Chiu, C.-M., Smith, S., Rychly, P., Bai, M.-H., & Chen, K.-J. (2005). Chinese Sketch Engine and the extraction of grammatical collocations. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (pp. 48–55). Jeju Island, Republic of Korea.Google Scholar
  21. Ikehara, S., Shirai, S., & Kawaoka, T. (1995). Automatic extraction of uninterrupted collocations by n-gram statistics. In Proceedings of First Annual Meeting of the Association for Natural Language Processing, pp. 313–316.Google Scholar
  22. Jacquemin, C., Klavans, J. L., & Tzoukermann, E. (1997). Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In Proceedings of the 35th Annual Meeting on Association for Computational Linguistics (pp. 24–31). Association for Computational Linguistics: Morristown, NJ, USA.Google Scholar
  23. Justeson, J. S., & Katz, S. M. (1995). Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(1), 9–27.CrossRefGoogle Scholar
  24. Kilgarriff, A. (1996). Which words are particularly characteristic of a text? A survey of statistical approaches. In Proceedings of AISB Workshop on Language Engineering for Document Analysis and Recognition (pp. 33–40). Sussex, UK.Google Scholar
  25. Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine. In Proceedings of the Eleventh EURALEX International Congress (pp. 105–116). Lorient, France.Google Scholar
  26. Kim, S., Yang, Z., Song, M., & Ahn, J.-H. (1999). Retrieving collocations from Korean text. In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp. 71–81). Maryland, USA.Google Scholar
  27. Kjellmer, G. (1994). A dictionary of English collocations. Oxford: Claredon Press.Google Scholar
  28. Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of The Tenth Machine Translation Summit (MT Summit X) (pp. 79–86). Phuket, Thailand.Google Scholar
  29. Krenn, B. (2000). The usual suspects: Data-oriented models for identification and representation of lexical collocations, Vol. 7. Saarbrücken, Germany: German Research Center for Artificial Intelligence and Saarland University Dissertations in Computational Linguistics and Language Technology.Google Scholar
  30. Krenn, B., & Evert, S. (2001). Can we do better than frequency? A case study on extracting PP-verb collocations. In Proceedings of the ACL Workshop on Collocations (pp. 39–46). Toulouse, France.Google Scholar
  31. Lafon, P. (1984). Dépouillements et statistiques en lexicométrie. Genève Paris: Slatkine Champion.Google Scholar
  32. Lin, D. (1998). Extracting collocations from text corpora. In First Workshop on Computational Terminology (pp. 57–63). Montreal.Google Scholar
  33. Lin, D. (1999). Automatic identification of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 317–324). Association for Computational Linguistics: Morristown, NJ, USA.Google Scholar
  34. Lu, Q., Li, Y., & Xu, R. (2004). Improving Xtract for Chinese collocation extraction. In: Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp. 333–338.Google Scholar
  35. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.Google Scholar
  36. McKeown, K. R., & Radev, D. R. (2000). Collocations. In R. Dale, H. Moisl, & H. Somers (Eds.), A Handbook of natural language processing (pp. 507–523). New York, USA: Marcel Dekker.Google Scholar
  37. Mel’čuk, I. (1998). Collocations and lexical functions. In A. P. Cowie (Eds.), Phraseology. Theory, analysis, and applications (pp. 23–53). Oxford: Claredon Press.Google Scholar
  38. Mel’čuk, I. (2003). Collocations: Définition, rôle et utilité. In: F. Grossmann & A. Tutin (Eds.), Les collocations: Analyse et traitement (pp. 23–32). Amsterdam: Editions “De Werelt”.Google Scholar
  39. Pearce, D. (2001). Synonymy in collocation extraction. In WordNet and Other Lexical Resources: Applications, Extensions and Customizations (NAACL 2001 Workshop) (pp. 41–46). Pittsburgh, USA.Google Scholar
  40. Pearce, D. (2002). A comparative evaluation of collocation extraction techniques. In Third International Conference on Language Resources and Evaluation. Spain: Las Palmas.Google Scholar
  41. Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2002) (pp. 1–15). Mexico City.Google Scholar
  42. Seretan, V., Nerima, L., & Wehrli, E. (2004). A tool for multi-word collocation extraction and visualization in multilingual corpora. In Proceedings of the Eleventh EURALEX International Congress, EURALEX 2004 (pp. 755–766). Lorient, France.Google Scholar
  43. Seretan, V., & Wehrli, E. (2006). Accurate collocation extraction using a multilingual parser. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (pp. 953–960). Sydney, Australia.Google Scholar
  44. Shimohata, S., Sugio, T., & Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 476–481). Madrid, Spain.Google Scholar
  45. Silberztein, M. (1993). Dictionnaires électroniques et analyse automatique de textes. Le système INTEX. Paris: Masson.Google Scholar
  46. Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–177.Google Scholar
  47. Tutin, A. (2004). Pour une modélisation dynamique des collocations dans les textes. In Proceedings of the Eleventh EURALEX International Congress (pp. 207–219). Lorient, France.Google Scholar
  48. Villada Moirón, M. B. (2005). Data-driven identification of fixed expressions and their modifiability. Ph.D. thesis, University of Groningen.Google Scholar
  49. Wehrli, E. (2007). Fips, A “deep” linguistic multilingual parser. In ACL 2007 Workshop on Deep Linguistic Processing. Prague, Czech Republic (pp. 120–127). Association for Computational Linguistics.Google Scholar
  50. Wermter, J., & Hahn, U. (2004). Collocation extraction based on modifiability statistics. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004) (pp. 980–986). Geneva, Switzerland.Google Scholar
  51. Zajac, R., Lange, E., & Yang, J. (2003). Customizing complex lexical entries for high-quality MT. In Proceedings of the Ninth Machine Translation Summit (pp. 433–438). New Orleans, USA.Google Scholar
  52. Zinsmeister, H., & Heid, U. (2003). Significant triples: Adjective+Noun+Verb combinations. In Proceedings of the 7th Conference on Computational Lexicography and Text Research (Complex 2003), Budapest.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2008

Authors and Affiliations

  1. 1.Language Technology Laboratory (LATL)University of GenevaGenevaSwitzerland

Personalised recommendations