A Multilingual Integrated Framework for Processing Lexical Collocations

  • Violeta SeretanEmail author
Part of the Studies in Computational Intelligence book series (SCI, volume 458)


Lexical collocations are typical combinations of words, such as heavy rain, close collaboration, or to meet a deadline. Pervasive in language, they are a key issue for NLP systems since, as other types of multi-word expressions like idioms, they do not allow for word-by-word processing. We present a multilingual framework that lays emphasis on the accurate acquisition of collocational knowledge from corpora and its exploitation in two large-scale applications (parsing and machine translation), as well as for lexicographic support and for reading assistance. The underlying methodology departs from mainstream approaches by relying on deep parsing to cope with the high morphosyntactic flexibility of collocations. We review theoretical claims and contrast them with practical work, showing our efforts to model collocations in an adequate and comprehensive way. Experimental results show the efficiency of our approach and the impact of collocational knowledge on the performance of parsing and machine translation.


Natural Language Processing Machine Translation Parse Tree Statistical Machine Translation Computational Linguistics 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [Alshawi and Carter(1994)]
    Alshawi, H., Carter, D.: Training and scaling preference functions for disambiguation. Computational Linguistics 20(4), 635–648 (1994)Google Scholar
  2. [Baldwin and Kim(2010)]
    Baldwin, T., Kim, S.N.: Multiword expressions. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn. CRC Press, Taylor and Francis Group, Boca Raton, FL (2010)Google Scholar
  3. [Benson et al(1986)Benson, Benson, and Ilson]
    Benson, M., Benson, E., Ilson, R.: The BBI Dictionary of English Word Combinations. John Benjamins, Amsterdam (1986)Google Scholar
  4. [Blaheta and Johnson(2001)]
    Blaheta, D., Johnson, M.: Unsupervised learning of multi-word verbs. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp. 54–60 (2001)Google Scholar
  5. [Bod(2007)]
    Bod, R.: Unsupervised syntax-based machine translation: the contribution of discontiguous phrases. In: Proceedings of MT Summit XI, Copenhagen, Denmark, pp. 51–56 (2007)Google Scholar
  6. [Bourigault(1992)]
    Bourigault, D.: LEXTER, vers un outil linguistique d’aide à l’acquisition des connaissances. In: Actes des 3èmes Journées d’Acquisition des Connaissances, Dourdan, France (1992)Google Scholar
  7. [Choueka et al(1983)Choueka, Klein, and Neuwitz]
    Choueka, Y., Klein, S., Neuwitz, E.: Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing 4(1), 34–38 (1983)Google Scholar
  8. [Cowie(1978)]
    Cowie, A.P.: The place of illustrative material and collocations in the design of a learner’s dictionary. In: Strevens, P. (ed.) Honour of A.S. Hornby, pp. 127–139. Oxford University Press, Oxford (1978)Google Scholar
  9. [Daille(1994)]
    Daille, B.: Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7 (1994)Google Scholar
  10. [Dias(2003)]
    Dias, G.: Multiword unit hybrid extraction. In: Proceedings of the ACL Workshop on Multiword Expressions, Sapporo, Japan, pp. 41–48 (2003)Google Scholar
  11. [Dunning(1993)]
    Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)Google Scholar
  12. [Erman and Warren(2000)]
    Erman, B., Warren, B.: The idiom principle and the open choice principle. Text 20(1), 29–62 (2000)Google Scholar
  13. [Evert(2004)]
    Evert, S.: The statistics of word cooccurrences: Word pairs and collocations. PhD thesis, University of Stuttgart (2004)Google Scholar
  14. [Firth(1957)]
    Firth, J.R.: Papers in Linguistics 1934-1951. Oxford University Press, Oxford (1957)Google Scholar
  15. [Fontenelle(1992)]
    Fontenelle, T.: Collocation acquisition from a corpus or from a dictionary: a comparison. In: Proceedings I-II Papers submitted to the 5th EURALEX International Congress on Lexicography in Tampere, pp. 221–228 (1992)Google Scholar
  16. [Gildea and Palmer(2002)]
    Gildea, D., Palmer, M.: The necessity of parsing for predicate argument recognition. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 239–246 (2002)Google Scholar
  17. [Hausmann(1989)]
    Hausmann, F.J.: Le dictionnaire de collocations. In: Hausmann, F., Reichmann, O., Wiegand, H., Zgusta, L. (eds.) Wörterbücher: Ein internationales Handbuch zur Lexicographie. Dictionaries, Dictionnaires, pp. 1010–1019. de Gruyter, Berlin (1989)Google Scholar
  18. [Heid(1994)]
    Heid, U.: On ways words work together – research topics in lexical combinatorics. In: Proceedings of the 6th Euralex International Congress on Lexicography (EURALEX 1994), Amsterdam, The Netherlands, pp. 226–257 (1994)Google Scholar
  19. [Heylen et al(1994)Heylen, Maxwell, and Verhagen]
    Heylen, D., Maxwell, K.G., Verhagen, M.: Lexical functions and machine translation. In: Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994), Kyoto, Japan, pp. 1240–1244 (1994)Google Scholar
  20. [Hindle and Rooth(1993)]
    Hindle, D., Rooth, M.: Structural ambiguity and lexical relations. Computational Linguistics 19(1), 103–120 (1993)Google Scholar
  21. [Jackendoff(1997)]
    Jackendoff, R.: The Architecture of the Language Faculty. MIT Press, Cambridge (1997)Google Scholar
  22. [Jacquemin et al(1997)Jacquemin, Klavans, and Tzoukermann]
    Jacquemin, C., Klavans, J.L., Tzoukermann, E.: Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In: Proceedings of the 35th Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, pp. 24–31 (1997)Google Scholar
  23. [Kjellmer(1987)]
    Kjellmer, G.: Aspects of English collocations. In: Meijs, W. (ed.) Corpus Linguistics and Beyond, Rodopi, Amsterdam, pp. 133–140 (1987)Google Scholar
  24. [Koehn(2005)]
    Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit (MT Summit X), Phuket, Thailand, pp. 79–86 (2005)Google Scholar
  25. [Krenn(2000)]
    Krenn, B.: The Usual Suspects: Data-Oriented Models for Identification and Representation of Lexical Collocations, vol 7. German Research Center for Artificial Intelligence and Saarland University Dissertations in Computational Linguistics and Language Technology, Saarbrücken (2000)Google Scholar
  26. [Lea and Runcie(2002)]
    Lea, D., Runcie, M. (eds.): Oxford Collocations Dictionary for Students of English. Oxford University Press, Oxford (2002)Google Scholar
  27. [Lü and Zhou(2004)]
    Lü, Y., Zhou, M.: Collocation translation acquisition using monolingual corpora. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, pp. 167–174 (2004)Google Scholar
  28. [Maynard and Ananiadou(1999)]
    Maynard, D., Ananiadou, S.: A linguistic approach to terminological context clustering. In: Proceedings of Natural Language Pacific Rim Symposium (1999)Google Scholar
  29. [Mel’čuk(1998)]
    Mel’čuk, I.: Collocations and lexical functions. In: Cowie, A.P. (ed.) Phraseology. Theory, Analysis, and Applications, pp. 23–53. Claredon Press, Oxford (1998)Google Scholar
  30. [Michou and Seretan(2009)]
    Michou, A., Seretan, V.: A tool for multi-word expression extraction in modern Greek using syntactic parsing. In: Proceedings of the Demonstrations Session at EACL 2009, pp. 45–48. Association for Computational Linguistics, Athens (2009)Google Scholar
  31. [Orliac and Dillinger(2003)]
    Orliac, B., Dillinger, M.: Collocation extraction for machine translation. In: Proceedings of Machine Translation Summit IX, New Orleans, Lousiana, USA, pp. 292–298 (2003)Google Scholar
  32. [Padó and Lapata(2007)]
    Padó, S., Lapata, M.: Dependency-based construction of semantic space models. Computational Linguistics 33(2), 161–199 (2007)zbMATHCrossRefGoogle Scholar
  33. [Pearce(2002)]
    Pearce, D.: A comparative evaluation of collocation extraction techniques. In: Third International Conference on Language Resources and Evaluation, Las Palmas, Spain, pp. 1530–1536 (2002)Google Scholar
  34. [Pecina(2005)]
    Pecina, P.: An extensive empirical study of collocation extraction methods. In: Proceedings of the ACL Student Research Workshop. Ann Arbor, Michigan, pp. 13–18 (2005)Google Scholar
  35. [Sag et al(2002)Sag, Baldwin, Bond, Copestake, and Flickinger]
    Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword Expressions: A Pain in the Neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  36. [Seretan(2009)]
    Seretan, V.: An integrated environment for extracting and translating collocations. In: Mahlberg, M., González-Díaz, V., Smith, C. (eds.) Proceedings of the Corpus Linguistics Conference CL 2009, Liverpool, UK (2009)Google Scholar
  37. [Seretan(2011)]
    Seretan, V.: Syntax-Based Collocation Extraction. Text, Speech and Language Technology. Springer, Dordrecht (2011)zbMATHCrossRefGoogle Scholar
  38. [Seretan and Wehrli(2006)]
    Seretan, V., Wehrli, E.: Accurate collocation extraction using a multilingual parser. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 953–960 (2006)Google Scholar
  39. [Seretan and Wehrli(2009)]
    Seretan, V., Wehrli, E.: Multilingual collocation extraction with a syntactic parser. Language Resources and Evaluation 43(1), 71–85 (2009)CrossRefGoogle Scholar
  40. [Seretan and Wehrli(2010a)]
    Seretan, V., Wehrli, E.: Extending a multilingual symbolic parser to Romanian. In: Tufiş, D., Forǎscu, C. (eds.) Multilinguality and Interoperability in Language Processing with Emphasis on Romanian. Romanian Academy Publishing House, Bucharest (2010a)Google Scholar
  41. [Seretan and Wehrli(2010b)]
    Seretan, V., Wehrli, E.: Tools for syntactic concordancing. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Wisła, Poland, pp. 493–500 (2010b)Google Scholar
  42. [Seretan and Wehrli(2011)]
    Seretan, V., Wehrli, E.: FipsCoView: On-line visualisation of collocations extracted from multilingual parallel corpora. In: Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, Association for Computational Linguistics, Portland, Oregon, USA, pp. 125–127 (2011) get rid of,
  43. [Seretan et al(2004)Seretan, Nerima, and Wehrli]
    Seretan, V., Nerima, L., Wehrli, E.: Using the Web as a corpus for the syntactic-based collocation identification. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 1871–1874 (2004)Google Scholar
  44. [Sinclair(1991)]
    Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)Google Scholar
  45. [Smadja(1993)]
    Smadja, F.: Retrieving collocations from text: Xtract. Computational Linguistics 19(1), 143–177 (1993)Google Scholar
  46. [Stubbs(1995)]
    Stubbs, M.: Corpus evidence for norms of lexical collocation. In: Cook, G., Seidlhofer, B. (eds.) Principle & Practice in Applied Linguistics. Studies in Honour of H.G. Widdowson. Oxford University Press, Oxford (1995)Google Scholar
  47. [Villada Moirón(2005)]
    Moirón V., Begoña, M.: Data-driven identification of fixed expressions and their modifiability. PhD thesis, University of Groningen (2005)Google Scholar
  48. [Wehrli(2007)]
    Wehrli, E.: Fips, a “deep” linguistic multilingual parser. In: ACL 2007 Workshop on Deep Linguistic Processing, Prague, Czech Republic, pp. 120–127 (2007)Google Scholar
  49. [Wehrli et al(2009a)Wehrli, Nerima, and Scherrer]
    Wehrli, E., Nerima, L., Scherrer, Y.: Deep linguistic multilingual translation and bilingual dictionaries. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 90–94. Association for Computational Linguistics, Athens (2009a)Google Scholar
  50. [Wehrli et al(2009b)Wehrli, Nerima, Seretan, and Scherrer]
    Wehrli, E., Nerima, L., Seretan, V., Scherrer, Y.: On-line and off-line translation aids for non-native readers. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Mrągowo, Poland, pp. 299–303 (2009b)Google Scholar
  51. [Wehrli et al(2009c)Wehrli, Seretan, Nerima, and Russo]
    Wehrli, E., Seretan, V., Nerima, L., Russo, L.: Collocations in a rule-based MT system: A case study evaluation of their translation adequacy. In: Proceedings of the 13th Annual Meeting of the European Association for Machine Translation, Barcelona, Spain, pp. 128–135 (2009c)Google Scholar
  52. [van der Wouden(2001)]
    van der Wouden, T.: Collocational behaviour in non content words. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp. 16–23 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Department of Translation Technology, Faculty of Translation and InterpretingUniversity of GenevaGenevaSwitzerland

Personalised recommendations