A Hybrid Approach for Multiword Expression Identification

  • Carlos Ramisch
  • Helena de Medeiros Caseli
  • Aline Villavicencio
  • André Machado
  • Maria José Finatto
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6001)


Considerable attention has been given to the problem of Multiword Expression (MWE) identification and treatment, for NLP tasks like parsing and generation, to improve the quality of results. Statistical methods have been often employed for MWE identification, as an inexpensive and language independent way of finding co-occurrence patterns. On the other hand, more linguistically motivated methods for identification, which employ information such as POS filters and lexical alignment between languages, can produce more targeted candidate lists. In this paper we propose a hybrid approach that combines the strenghts of different sources of information using a machine learning algorithm to produce more robust and precise results. Automatic evaluation on gold standards shows that the performance of our hybrid method is superior to the individual results of statistical and alignment-based MWE extraction approaches for Portuguese and for English. This method can be used to aid lexicographic work by providing a more targeted MWE candidate list.


Bayesian Network Machine Translation Candidate List Parallel Corpus Pointwise Mutual Information 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword Expressions: A Pain in the Neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  2. 2.
    Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan, E.: Grammar of Spoken and Written English. Longman, Harlow (1999)Google Scholar
  3. 3.
    Jackendoff, R.: Twistin’ the night away. Language 73, 534–559 (1997)CrossRefGoogle Scholar
  4. 4.
    Evert, S., Krenn, B.: Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language, Special Issue on Multiword Expressions 19(4), 450–466 (2005)Google Scholar
  5. 5.
    Baldwin, T.: The deep lexical acquisition of English verb-particles. Computer Speech and Language, Special Issue on Multiword Expressions 19(4), 398–414 (2005)Google Scholar
  6. 6.
    Caseli, H.M., Villavicencio, A., Machado, A., Finatto, M.J.: Statistically-driven alignment-based multiword expression identification for technical domains. In: Proceedings of the ACL-IJCNLP 2009 Workshop on Multiword Expressions, pp. 1–8 (2009)Google Scholar
  7. 7.
    Villavicencio, A., Caseli, H.M., Machado, A.: Identification of Multiword Expressions in Technical Domains: Investigating Statistical and Alignment-based Approaches. In: Proceedings of the 7th Brazilian Symposium in Information and Human Language Technology, São Carlos, SP (2009)Google Scholar
  8. 8.
    Fazly, A., Cook, P., Stevenson, S.: Unsupervised type and token identification of idiomatic expressions. Computational Linguistics 35(1), 61–103 (2009)CrossRefGoogle Scholar
  9. 9.
    Van de Cruys, T., Villada Moirón, B.: Semantics-based Multiword Expression Extraction. In: Proceedings of the ACL 2007 Workshop on Multiword Expressions: A Broader Prespective, Prague, pp. 25–32 (2007)Google Scholar
  10. 10.
    Villada Moirón, B., Tiedemann, J.: Identifying idiomatic expressions using automatic word-alignment. In: Proceedings of the EACL 2006 Workshop on Multiword expressions in a Multilingual Context, Trento, Italy, pp. 33–40 (2006)Google Scholar
  11. 11.
    Ramisch, C., Villavicencio, A., Moura, L., Idiart, M.: Picking them up and Figuring them out: Verb-Particle Constructions, Noise and Idiomaticity. In: Proceedings of the 12th Conference on Computational Natural Language Learning (CoNLL 2008), pp. 49–56 (2008)Google Scholar
  12. 12.
    Melamed, I.D.: Automatic Discovery of Non-Compositional Compounds in Parallel Data (1997) eprint arXiv:cmp-lg/9706027Google Scholar
  13. 13.
    Coulthard, R.J.: The application of corpus methodology to translation: the JPED parallel corpus and the Pediatrics comparable corpus. Master’s thesis, Universidade Federal de Santa Catarina (2005)Google Scholar
  14. 14.
    Lopes, L., Vieira, R., Finatto, M.J., Martins, D., Zanette, A.: Automatic extraction of composite terms for construction of ontologies: an experiment in the health care area. RECIIS - Electronic Journal of communication information and innovation in healthq 3, 76–88 (2009)Google Scholar
  15. 15.
    Procter, P.: Cambridge International Dictionary of English. Cambridge University Press, Cambridge (1995)Google Scholar
  16. 16.
    Banerjee, S., Pedersen, T.: The Design, Implementation and Use of the Ngram Statistics Package. In: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, pp. 370–381 (2003)Google Scholar
  17. 17.
    Och, F.J., Ney, H.: Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting of the ACL, Hong Kong, China, pp. 440–447 (2000)Google Scholar
  18. 18.
    Armentano-Oller, C., Carrasco, R.C., Corbí-Bellot, A.M., Forcada, M.L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramírez-Sánchez, G., Sánchez-Martínez, F., Scalco, M.A.: Open-source Portuguese-Spanish machine translation. In: Vieira, R., Quaresma, P., Nunes, M.d.G.V., Mamede, N.J., Oliveira, C., Dias, M.C. (eds.) PROPOR 2006. LNCS (LNAI), vol. 3960, pp. 50–59. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  19. 19.
    Caseli, H.M., Nunes, M.G.V., Forcada, M.L.: On the automatic learning of bilingual resources: Some relevant factors for machine translation. In: Zaverucha, G., da Costa, A.L. (eds.) SBIA 2008. LNCS (LNAI), vol. 5249, pp. 258–267. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  20. 20.
    Caseli, H.M., Ramisch, C., Nunes, M.G.V., Villavicencio, A.: Alignment-based extraction of multiword expressions. Language Resources and Evaluation (2009) (to appear)Google Scholar
  21. 21.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Carlos Ramisch
    • 1
    • 2
  • Helena de Medeiros Caseli
    • 3
  • Aline Villavicencio
    • 2
    • 4
  • André Machado
    • 2
  • Maria José Finatto
    • 5
  1. 1.GETALP/LIGUniversity of Grenoble(France)
  2. 2.Institute of InformaticsFederal University of Rio Grande do Sul(Brazil)
  3. 3.Department of Computer ScienceFederal University of São Carlos(Brazil)
  4. 4.Department of Computer SciencesBath University(UK)
  5. 5.Institute of Language and LinguisticsFederal University of Rio Grande do Sul(Brazil)

Personalised recommendations