Skip to main content

A Multilingual Integrated Framework for Processing Lexical Collocations

  • Chapter

Part of the book series: Studies in Computational Intelligence ((SCI,volume 458))

Abstract

Lexical collocations are typical combinations of words, such as heavy rain, close collaboration, or to meet a deadline. Pervasive in language, they are a key issue for NLP systems since, as other types of multi-word expressions like idioms, they do not allow for word-by-word processing. We present a multilingual framework that lays emphasis on the accurate acquisition of collocational knowledge from corpora and its exploitation in two large-scale applications (parsing and machine translation), as well as for lexicographic support and for reading assistance. The underlying methodology departs from mainstream approaches by relying on deep parsing to cope with the high morphosyntactic flexibility of collocations. We review theoretical claims and contrast them with practical work, showing our efforts to model collocations in an adequate and comprehensive way. Experimental results show the efficiency of our approach and the impact of collocational knowledge on the performance of parsing and machine translation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alshawi, H., Carter, D.: Training and scaling preference functions for disambiguation. Computational Linguistics 20(4), 635–648 (1994)

    Google Scholar 

  2. Baldwin, T., Kim, S.N.: Multiword expressions. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn. CRC Press, Taylor and Francis Group, Boca Raton, FL (2010)

    Google Scholar 

  3. Benson, M., Benson, E., Ilson, R.: The BBI Dictionary of English Word Combinations. John Benjamins, Amsterdam (1986)

    Google Scholar 

  4. Blaheta, D., Johnson, M.: Unsupervised learning of multi-word verbs. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp. 54–60 (2001)

    Google Scholar 

  5. Bod, R.: Unsupervised syntax-based machine translation: the contribution of discontiguous phrases. In: Proceedings of MT Summit XI, Copenhagen, Denmark, pp. 51–56 (2007)

    Google Scholar 

  6. Bourigault, D.: LEXTER, vers un outil linguistique d’aide à l’acquisition des connaissances. In: Actes des 3èmes Journées d’Acquisition des Connaissances, Dourdan, France (1992)

    Google Scholar 

  7. Choueka, Y., Klein, S., Neuwitz, E.: Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing 4(1), 34–38 (1983)

    Google Scholar 

  8. Cowie, A.P.: The place of illustrative material and collocations in the design of a learner’s dictionary. In: Strevens, P. (ed.) Honour of A.S. Hornby, pp. 127–139. Oxford University Press, Oxford (1978)

    Google Scholar 

  9. Daille, B.: Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7 (1994)

    Google Scholar 

  10. Dias, G.: Multiword unit hybrid extraction. In: Proceedings of the ACL Workshop on Multiword Expressions, Sapporo, Japan, pp. 41–48 (2003)

    Google Scholar 

  11. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)

    Google Scholar 

  12. Erman, B., Warren, B.: The idiom principle and the open choice principle. Text 20(1), 29–62 (2000)

    Google Scholar 

  13. Evert, S.: The statistics of word cooccurrences: Word pairs and collocations. PhD thesis, University of Stuttgart (2004)

    Google Scholar 

  14. Firth, J.R.: Papers in Linguistics 1934-1951. Oxford University Press, Oxford (1957)

    Google Scholar 

  15. Fontenelle, T.: Collocation acquisition from a corpus or from a dictionary: a comparison. In: Proceedings I-II Papers submitted to the 5th EURALEX International Congress on Lexicography in Tampere, pp. 221–228 (1992)

    Google Scholar 

  16. Gildea, D., Palmer, M.: The necessity of parsing for predicate argument recognition. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 239–246 (2002)

    Google Scholar 

  17. Hausmann, F.J.: Le dictionnaire de collocations. In: Hausmann, F., Reichmann, O., Wiegand, H., Zgusta, L. (eds.) Wörterbücher: Ein internationales Handbuch zur Lexicographie. Dictionaries, Dictionnaires, pp. 1010–1019. de Gruyter, Berlin (1989)

    Google Scholar 

  18. Heid, U.: On ways words work together – research topics in lexical combinatorics. In: Proceedings of the 6th Euralex International Congress on Lexicography (EURALEX 1994), Amsterdam, The Netherlands, pp. 226–257 (1994)

    Google Scholar 

  19. Heylen, D., Maxwell, K.G., Verhagen, M.: Lexical functions and machine translation. In: Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994), Kyoto, Japan, pp. 1240–1244 (1994)

    Google Scholar 

  20. Hindle, D., Rooth, M.: Structural ambiguity and lexical relations. Computational Linguistics 19(1), 103–120 (1993)

    Google Scholar 

  21. Jackendoff, R.: The Architecture of the Language Faculty. MIT Press, Cambridge (1997)

    Google Scholar 

  22. Jacquemin, C., Klavans, J.L., Tzoukermann, E.: Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In: Proceedings of the 35th Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, pp. 24–31 (1997)

    Google Scholar 

  23. Kjellmer, G.: Aspects of English collocations. In: Meijs, W. (ed.) Corpus Linguistics and Beyond, Rodopi, Amsterdam, pp. 133–140 (1987)

    Google Scholar 

  24. Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit (MT Summit X), Phuket, Thailand, pp. 79–86 (2005)

    Google Scholar 

  25. Krenn, B.: The Usual Suspects: Data-Oriented Models for Identification and Representation of Lexical Collocations, vol 7. German Research Center for Artificial Intelligence and Saarland University Dissertations in Computational Linguistics and Language Technology, Saarbrücken (2000)

    Google Scholar 

  26. Lea, D., Runcie, M. (eds.): Oxford Collocations Dictionary for Students of English. Oxford University Press, Oxford (2002)

    Google Scholar 

  27. Lü, Y., Zhou, M.: Collocation translation acquisition using monolingual corpora. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, pp. 167–174 (2004)

    Google Scholar 

  28. Maynard, D., Ananiadou, S.: A linguistic approach to terminological context clustering. In: Proceedings of Natural Language Pacific Rim Symposium (1999)

    Google Scholar 

  29. Mel’čuk, I.: Collocations and lexical functions. In: Cowie, A.P. (ed.) Phraseology. Theory, Analysis, and Applications, pp. 23–53. Claredon Press, Oxford (1998)

    Google Scholar 

  30. Michou, A., Seretan, V.: A tool for multi-word expression extraction in modern Greek using syntactic parsing. In: Proceedings of the Demonstrations Session at EACL 2009, pp. 45–48. Association for Computational Linguistics, Athens (2009)

    Google Scholar 

  31. Orliac, B., Dillinger, M.: Collocation extraction for machine translation. In: Proceedings of Machine Translation Summit IX, New Orleans, Lousiana, USA, pp. 292–298 (2003)

    Google Scholar 

  32. Padó, S., Lapata, M.: Dependency-based construction of semantic space models. Computational Linguistics 33(2), 161–199 (2007)

    Article  MATH  Google Scholar 

  33. Pearce, D.: A comparative evaluation of collocation extraction techniques. In: Third International Conference on Language Resources and Evaluation, Las Palmas, Spain, pp. 1530–1536 (2002)

    Google Scholar 

  34. Pecina, P.: An extensive empirical study of collocation extraction methods. In: Proceedings of the ACL Student Research Workshop. Ann Arbor, Michigan, pp. 13–18 (2005)

    Google Scholar 

  35. Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword Expressions: A Pain in the Neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  36. Seretan, V.: An integrated environment for extracting and translating collocations. In: Mahlberg, M., González-Díaz, V., Smith, C. (eds.) Proceedings of the Corpus Linguistics Conference CL 2009, Liverpool, UK (2009)

    Google Scholar 

  37. Seretan, V.: Syntax-Based Collocation Extraction. Text, Speech and Language Technology. Springer, Dordrecht (2011)

    Book  MATH  Google Scholar 

  38. Seretan, V., Wehrli, E.: Accurate collocation extraction using a multilingual parser. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 953–960 (2006)

    Google Scholar 

  39. Seretan, V., Wehrli, E.: Multilingual collocation extraction with a syntactic parser. Language Resources and Evaluation 43(1), 71–85 (2009)

    Article  Google Scholar 

  40. Seretan, V., Wehrli, E.: Extending a multilingual symbolic parser to Romanian. In: Tufiş, D., Forǎscu, C. (eds.) Multilinguality and Interoperability in Language Processing with Emphasis on Romanian. Romanian Academy Publishing House, Bucharest (2010a)

    Google Scholar 

  41. Seretan, V., Wehrli, E.: Tools for syntactic concordancing. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Wisła, Poland, pp. 493–500 (2010b)

    Google Scholar 

  42. Seretan, V., Wehrli, E.: FipsCoView: On-line visualisation of collocations extracted from multilingual parallel corpora. In: Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, Association for Computational Linguistics, Portland, Oregon, USA, pp. 125–127 (2011) get rid of, http://www.aclweb.org/anthology/W11-0819

  43. Seretan, V., Nerima, L., Wehrli, E.: Using the Web as a corpus for the syntactic-based collocation identification. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 1871–1874 (2004)

    Google Scholar 

  44. Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)

    Google Scholar 

  45. Smadja, F.: Retrieving collocations from text: Xtract. Computational Linguistics 19(1), 143–177 (1993)

    Google Scholar 

  46. Stubbs, M.: Corpus evidence for norms of lexical collocation. In: Cook, G., Seidlhofer, B. (eds.) Principle & Practice in Applied Linguistics. Studies in Honour of H.G. Widdowson. Oxford University Press, Oxford (1995)

    Google Scholar 

  47. Moirón V., Begoña, M.: Data-driven identification of fixed expressions and their modifiability. PhD thesis, University of Groningen (2005)

    Google Scholar 

  48. Wehrli, E.: Fips, a “deep” linguistic multilingual parser. In: ACL 2007 Workshop on Deep Linguistic Processing, Prague, Czech Republic, pp. 120–127 (2007)

    Google Scholar 

  49. Wehrli, E., Nerima, L., Scherrer, Y.: Deep linguistic multilingual translation and bilingual dictionaries. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 90–94. Association for Computational Linguistics, Athens (2009a)

    Google Scholar 

  50. Wehrli, E., Nerima, L., Seretan, V., Scherrer, Y.: On-line and off-line translation aids for non-native readers. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Mrągowo, Poland, pp. 299–303 (2009b)

    Google Scholar 

  51. Wehrli, E., Seretan, V., Nerima, L., Russo, L.: Collocations in a rule-based MT system: A case study evaluation of their translation adequacy. In: Proceedings of the 13th Annual Meeting of the European Association for Machine Translation, Barcelona, Spain, pp. 128–135 (2009c)

    Google Scholar 

  52. van der Wouden, T.: Collocational behaviour in non content words. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp. 16–23 (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Violeta Seretan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Seretan, V. (2013). A Multilingual Integrated Framework for Processing Lexical Collocations. In: Przepiórkowski, A., Piasecki, M., Jassem, K., Fuglewicz, P. (eds) Computational Linguistics. Studies in Computational Intelligence, vol 458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34399-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34399-5_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34398-8

  • Online ISBN: 978-3-642-34399-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics