A Multilingual Integrated Framework for Processing Lexical Collocations

Seretan, Violeta

doi:10.1007/978-3-642-34399-5_5

A Multilingual Integrated Framework for Processing Lexical Collocations

Violeta Seretan⁵

Chapter

1711 Accesses
2 Citations

Part of the book series: Studies in Computational Intelligence ((SCI,volume 458))

Abstract

Lexical collocations are typical combinations of words, such as heavy rain, close collaboration, or to meet a deadline. Pervasive in language, they are a key issue for NLP systems since, as other types of multi-word expressions like idioms, they do not allow for word-by-word processing. We present a multilingual framework that lays emphasis on the accurate acquisition of collocational knowledge from corpora and its exploitation in two large-scale applications (parsing and machine translation), as well as for lexicographic support and for reading assistance. The underlying methodology departs from mainstream approaches by relying on deep parsing to cope with the high morphosyntactic flexibility of collocations. We review theoretical claims and contrast them with practical work, showing our efforts to model collocations in an adequate and comprehensive way. Experimental results show the efficiency of our approach and the impact of collocational knowledge on the performance of parsing and machine translation.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alshawi, H., Carter, D.: Training and scaling preference functions for disambiguation. Computational Linguistics 20(4), 635–648 (1994)
Google Scholar
Baldwin, T., Kim, S.N.: Multiword expressions. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn. CRC Press, Taylor and Francis Group, Boca Raton, FL (2010)
Google Scholar
Benson, M., Benson, E., Ilson, R.: The BBI Dictionary of English Word Combinations. John Benjamins, Amsterdam (1986)
Google Scholar
Blaheta, D., Johnson, M.: Unsupervised learning of multi-word verbs. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp. 54–60 (2001)
Google Scholar
Bod, R.: Unsupervised syntax-based machine translation: the contribution of discontiguous phrases. In: Proceedings of MT Summit XI, Copenhagen, Denmark, pp. 51–56 (2007)
Google Scholar
Bourigault, D.: LEXTER, vers un outil linguistique d’aide à l’acquisition des connaissances. In: Actes des 3èmes Journées d’Acquisition des Connaissances, Dourdan, France (1992)
Google Scholar
Choueka, Y., Klein, S., Neuwitz, E.: Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing 4(1), 34–38 (1983)
Google Scholar
Cowie, A.P.: The place of illustrative material and collocations in the design of a learner’s dictionary. In: Strevens, P. (ed.) Honour of A.S. Hornby, pp. 127–139. Oxford University Press, Oxford (1978)
Google Scholar
Daille, B.: Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7 (1994)
Google Scholar
Dias, G.: Multiword unit hybrid extraction. In: Proceedings of the ACL Workshop on Multiword Expressions, Sapporo, Japan, pp. 41–48 (2003)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)
Google Scholar
Erman, B., Warren, B.: The idiom principle and the open choice principle. Text 20(1), 29–62 (2000)
Google Scholar
Evert, S.: The statistics of word cooccurrences: Word pairs and collocations. PhD thesis, University of Stuttgart (2004)
Google Scholar
Firth, J.R.: Papers in Linguistics 1934-1951. Oxford University Press, Oxford (1957)
Google Scholar
Fontenelle, T.: Collocation acquisition from a corpus or from a dictionary: a comparison. In: Proceedings I-II Papers submitted to the 5th EURALEX International Congress on Lexicography in Tampere, pp. 221–228 (1992)
Google Scholar
Gildea, D., Palmer, M.: The necessity of parsing for predicate argument recognition. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 239–246 (2002)
Google Scholar
Hausmann, F.J.: Le dictionnaire de collocations. In: Hausmann, F., Reichmann, O., Wiegand, H., Zgusta, L. (eds.) Wörterbücher: Ein internationales Handbuch zur Lexicographie. Dictionaries, Dictionnaires, pp. 1010–1019. de Gruyter, Berlin (1989)
Google Scholar
Heid, U.: On ways words work together – research topics in lexical combinatorics. In: Proceedings of the 6th Euralex International Congress on Lexicography (EURALEX 1994), Amsterdam, The Netherlands, pp. 226–257 (1994)
Google Scholar
Heylen, D., Maxwell, K.G., Verhagen, M.: Lexical functions and machine translation. In: Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994), Kyoto, Japan, pp. 1240–1244 (1994)
Google Scholar
Hindle, D., Rooth, M.: Structural ambiguity and lexical relations. Computational Linguistics 19(1), 103–120 (1993)
Google Scholar
Jackendoff, R.: The Architecture of the Language Faculty. MIT Press, Cambridge (1997)
Google Scholar
Jacquemin, C., Klavans, J.L., Tzoukermann, E.: Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In: Proceedings of the 35th Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, pp. 24–31 (1997)
Google Scholar
Kjellmer, G.: Aspects of English collocations. In: Meijs, W. (ed.) Corpus Linguistics and Beyond, Rodopi, Amsterdam, pp. 133–140 (1987)
Google Scholar
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit (MT Summit X), Phuket, Thailand, pp. 79–86 (2005)
Google Scholar
Krenn, B.: The Usual Suspects: Data-Oriented Models for Identification and Representation of Lexical Collocations, vol 7. German Research Center for Artificial Intelligence and Saarland University Dissertations in Computational Linguistics and Language Technology, Saarbrücken (2000)
Google Scholar
Lea, D., Runcie, M. (eds.): Oxford Collocations Dictionary for Students of English. Oxford University Press, Oxford (2002)
Google Scholar
Lü, Y., Zhou, M.: Collocation translation acquisition using monolingual corpora. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, pp. 167–174 (2004)
Google Scholar
Maynard, D., Ananiadou, S.: A linguistic approach to terminological context clustering. In: Proceedings of Natural Language Pacific Rim Symposium (1999)
Google Scholar
Mel’čuk, I.: Collocations and lexical functions. In: Cowie, A.P. (ed.) Phraseology. Theory, Analysis, and Applications, pp. 23–53. Claredon Press, Oxford (1998)
Google Scholar
Michou, A., Seretan, V.: A tool for multi-word expression extraction in modern Greek using syntactic parsing. In: Proceedings of the Demonstrations Session at EACL 2009, pp. 45–48. Association for Computational Linguistics, Athens (2009)
Google Scholar
Orliac, B., Dillinger, M.: Collocation extraction for machine translation. In: Proceedings of Machine Translation Summit IX, New Orleans, Lousiana, USA, pp. 292–298 (2003)
Google Scholar
Padó, S., Lapata, M.: Dependency-based construction of semantic space models. Computational Linguistics 33(2), 161–199 (2007)
Article MATH Google Scholar
Pearce, D.: A comparative evaluation of collocation extraction techniques. In: Third International Conference on Language Resources and Evaluation, Las Palmas, Spain, pp. 1530–1536 (2002)
Google Scholar
Pecina, P.: An extensive empirical study of collocation extraction methods. In: Proceedings of the ACL Student Research Workshop. Ann Arbor, Michigan, pp. 13–18 (2005)
Google Scholar
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword Expressions: A Pain in the Neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002)
Chapter Google Scholar
Seretan, V.: An integrated environment for extracting and translating collocations. In: Mahlberg, M., González-Díaz, V., Smith, C. (eds.) Proceedings of the Corpus Linguistics Conference CL 2009, Liverpool, UK (2009)
Google Scholar
Seretan, V.: Syntax-Based Collocation Extraction. Text, Speech and Language Technology. Springer, Dordrecht (2011)
Book MATH Google Scholar
Seretan, V., Wehrli, E.: Accurate collocation extraction using a multilingual parser. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 953–960 (2006)
Google Scholar
Seretan, V., Wehrli, E.: Multilingual collocation extraction with a syntactic parser. Language Resources and Evaluation 43(1), 71–85 (2009)
Article Google Scholar
Seretan, V., Wehrli, E.: Extending a multilingual symbolic parser to Romanian. In: Tufiş, D., Forǎscu, C. (eds.) Multilinguality and Interoperability in Language Processing with Emphasis on Romanian. Romanian Academy Publishing House, Bucharest (2010a)
Google Scholar
Seretan, V., Wehrli, E.: Tools for syntactic concordancing. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Wisła, Poland, pp. 493–500 (2010b)
Google Scholar
Seretan, V., Wehrli, E.: FipsCoView: On-line visualisation of collocations extracted from multilingual parallel corpora. In: Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, Association for Computational Linguistics, Portland, Oregon, USA, pp. 125–127 (2011) get rid of, http://www.aclweb.org/anthology/W11-0819
Seretan, V., Nerima, L., Wehrli, E.: Using the Web as a corpus for the syntactic-based collocation identification. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 1871–1874 (2004)
Google Scholar
Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)
Google Scholar
Smadja, F.: Retrieving collocations from text: Xtract. Computational Linguistics 19(1), 143–177 (1993)
Google Scholar
Stubbs, M.: Corpus evidence for norms of lexical collocation. In: Cook, G., Seidlhofer, B. (eds.) Principle & Practice in Applied Linguistics. Studies in Honour of H.G. Widdowson. Oxford University Press, Oxford (1995)
Google Scholar
Moirón V., Begoña, M.: Data-driven identification of fixed expressions and their modifiability. PhD thesis, University of Groningen (2005)
Google Scholar
Wehrli, E.: Fips, a “deep” linguistic multilingual parser. In: ACL 2007 Workshop on Deep Linguistic Processing, Prague, Czech Republic, pp. 120–127 (2007)
Google Scholar
Wehrli, E., Nerima, L., Scherrer, Y.: Deep linguistic multilingual translation and bilingual dictionaries. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 90–94. Association for Computational Linguistics, Athens (2009a)
Google Scholar
Wehrli, E., Nerima, L., Seretan, V., Scherrer, Y.: On-line and off-line translation aids for non-native readers. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Mrągowo, Poland, pp. 299–303 (2009b)
Google Scholar
Wehrli, E., Seretan, V., Nerima, L., Russo, L.: Collocations in a rule-based MT system: A case study evaluation of their translation adequacy. In: Proceedings of the 13th Annual Meeting of the European Association for Machine Translation, Barcelona, Spain, pp. 128–135 (2009c)
Google Scholar
van der Wouden, T.: Collocational behaviour in non content words. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp. 16–23 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Translation Technology, Faculty of Translation and Interpreting, University of Geneva, 40 bd. du Pont-d’Arve, 1211, Geneva, Switzerland
Violeta Seretan

Authors

Violeta Seretan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Violeta Seretan .

Editor information

Editors and Affiliations

, Institute of Computer Science, Polish Academy of Sciences, ul. Ordona 21, Warsaw, 01-237, Poland
Adam Przepiórkowski
, Institute of Informatics, Wroclaw University of Technology, ul. Wybrzeże Wyspiańskiego 27, Wroclaw, 50-370, Poland
Maciej Piasecki
, Faculty of Mathematics and Computer Scie, Adam Mickiewicz University, ul. Umultowska 87, Poznań, 61-614, Poland
Krzysztof Jassem
TiP Sp. z o. o., Francuska 35/37, Katowice, 40-027, Poland
Piotr Fuglewicz

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Seretan, V. (2013). A Multilingual Integrated Framework for Processing Lexical Collocations. In: Przepiórkowski, A., Piasecki, M., Jassem, K., Fuglewicz, P. (eds) Computational Linguistics. Studies in Computational Intelligence, vol 458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34399-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-34399-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34398-8
Online ISBN: 978-3-642-34399-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics