Language Resources and Evaluation

, Volume 44, Issue 1–2, pp 59–77 | Cite as

Alignment-based extraction of multiword expressions

  • Helena Medeiros de Caseli
  • Carlos Ramisch
  • Maria das Graças Volpe Nunes
  • Aline Villavicencio


Due to idiosyncrasies in their syntax, semantics or frequency, Multiword Expressions (MWEs) have received special attention from the NLP community, as the methods and techniques developed for the treatment of simplex words are not necessarily suitable for them. This is certainly the case for the automatic acquisition of MWEs from corpora. A lot of effort has been directed to the task of automatically identifying them, with considerable success. In this paper, we propose an approach for the identification of MWEs in a multilingual context, as a by-product of a word alignment process, that not only deals with the identification of possible MWE candidates, but also associates some multiword expressions with semantics. The results obtained indicate the feasibility and low costs in terms of tools and resources demanded by this approach, which could, for example, facilitate and speed up lexicographic work.


Automatic identification Word alignment Machine translation Terminology Multiword expressions Lexical acquisition Statistical methods 



We thank the financial support of the Brazilian agencies FAPESP (02/13207-8) CNPq (550388/2005-2), SEBRAE/FINEP (1194/07) and CAPES (CAPES/COFECUB 548/07). We also thank Mônica Saddy Martins for helping in the evaluation process, and the anonymous reviewers for the useful comments.


  1. Armentano-Oller, C., Carrasco, R. C., Corbí-Bellot, A. M., Forcada, M. L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Ramírez-Sánchez, G., Sánchez-Martínez, F., & Scalco, M. A. (2006). Open-source Portuguese–Spanish machine translation. In Proceedings of the VII Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada (PROPOR-2006), Itatiaia-RJ, Brazil (pp. 50–59).Google Scholar
  2. Baldwin, T., & Villavicencio, A. (2002). Extracting the unextractable: A case study on verb–particles. In Proceedings of the 6th conference on natural language learning (CoNLL-2002), Taipei, Taiwan.Google Scholar
  3. Briscoe, T., & Carroll, J. (2002). Robust accurate statistical annotation of general text. In Proceedings of LREC-2003.Google Scholar
  4. Brown, P., Della-Pietra, V., Della-Pietra, S., & Mercer, R. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–312.Google Scholar
  5. Burnard, L. (2000). User Reference Guide for the British National Corpus. Technical report. Oxford, UK: Oxford University Computing Services.Google Scholar
  6. Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistics. Computational Linguistics 22(2), 249–254.Google Scholar
  7. Caseli, H. M., Nunes, M. G. V., & Forcada, M. L. (2006). Automatic induction of bilingual resources from aligned parallel corpora: Application to shallow-transfer machine translation. Machine Translation 20, 227–245.CrossRefGoogle Scholar
  8. Caseli, H. M., Silva, A. M. P., & Nunes, M. G. V. (2004). Evaluation of methods for sentence and lexical alignment of Brazilian Portuguese and English parallel texts. In Proceedings of the SBIA 2004 (LNAI), Berlin, Heidelberg (pp. 184–193).Google Scholar
  9. Evert, S., & Krenn, B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language 19(4), 450–466.Google Scholar
  10. Fazly, A., & Stevenson, S. (2007). Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In Proceedings of the workshop on a broader perspective on multiword expressions, Prague (pp. 9–16).Google Scholar
  11. Hofland, K. (1996). A program for aligning English and Norwegian sentences. In S. Hockey, N. Ide, & G. Perissinotto (Eds.), Research in humanities computing (pp. 165–178). Oxford: Oxford University Press.Google Scholar
  12. Jackendoff, R. (1997). ‘Twistin’ the night away. Language 73, 534–559.CrossRefGoogle Scholar
  13. Melamed, I. D. (1997). Automatic discovery of non-compositional compounds in parallel data. In eprint arXiv:cmp-lg/9706027, pp. 6027.Google Scholar
  14. Och, F. J., & Ney, H. (2000a). A comparison of alignment models for statistical machine translation. In Proceedings of the 18th international conference on computational linguistics (COLING−2000), Saarbrücken, Germany (pp. 1086–1090).Google Scholar
  15. Och, F. J., & Ney, H. (2000b). Improved statistical alignment models. In Proceedings of the 38th annual meeting of the ACL, Hong Kong, China (pp. 440–447).Google Scholar
  16. Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51.CrossRefGoogle Scholar
  17. Pearce, D. (2002). A comparative evaluation of collocation extraction techniques. In Proceedings of the third international conference on language resources and evaluation, Las Palmas, Canary Islands, Spain (pp. 1–7).Google Scholar
  18. Piao, S. S. L., Sun, G., Rayson, P., & Yuan, Q. (2006). Automatic extraction of Chinese multiword expressions with a statistical tool. In Proceedings of the workshop on multi-word-expressions in a multilingual context (EACL-2006), Trento, Italy (pp. 17–24).Google Scholar
  19. Procter, P. (1995). Cambridge international dictionary of English. Cambridge: Cambridge University Press.Google Scholar
  20. Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the third international conference on computational linguistics and intelligent text processing (CICLing-2002), Lecture Notes in Computer Science, London, UK, Vol. 2276 (pp. 1–15).Google Scholar
  21. Van de Cruys, T., & Villada Moirón, B. (2007). Semantics-based multiword expression extraction. In Proceedings of the workshop on a broader perspective on multiword expressions, Prague (pp. 25–32).Google Scholar
  22. Villada Moirón, B., & Tiedemann, J. (2006). Identifying idiomatic expressions using automatic word-alignment. In Proceedings of the workshop on multi-word-expressions in a multilingual context (EACL-2006), Trento, Italy (pp. 33–40).Google Scholar
  23. Villavicencio, A. (2005). The availability of verb–particle constructions in lexical resources: How much is enough? Journal of Computer Speech and Language Processing 19, 415–432.Google Scholar
  24. Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., & Ramisch, C. (2007). Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague (pp. 1034–1043).Google Scholar
  25. Vogel, S., Ney, H., & Tillmann, C. (1996) HMM-based word alignment in statistical translation. In Proceedings of the 16th international conference on computational linguistics (COLING-1996), Copenhagen (pp. 836–841).Google Scholar
  26. Zhang, Y., Kordoni, V., Villavicencio, A., & Idiart, M. (2006). Automated multiword expression prediction for grammar engineering. In Proceedings of the workshop on multiword expressions: Identifying and exploiting underlying properties, Sydney, Australia (pp. 36–44).Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2009

Authors and Affiliations

  • Helena Medeiros de Caseli
    • 1
  • Carlos Ramisch
    • 2
  • Maria das Graças Volpe Nunes
    • 3
  • Aline Villavicencio
    • 2
    • 4
  1. 1.NILC, Department of Computer ScienceFederal University of São CarlosSão CarlosBrazil
  2. 2.Institute of InformaticsFederal University of Rio Grande do SulPorto AlegreBrazil
  3. 3.NILC, ICMCUniversity of São PauloSão CarlosBrazil
  4. 4.Department of Computer ScienceUniversity of BathBathUK

Personalised recommendations