Skip to main content
Log in

Alignment-based extraction of multiword expressions

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Due to idiosyncrasies in their syntax, semantics or frequency, Multiword Expressions (MWEs) have received special attention from the NLP community, as the methods and techniques developed for the treatment of simplex words are not necessarily suitable for them. This is certainly the case for the automatic acquisition of MWEs from corpora. A lot of effort has been directed to the task of automatically identifying them, with considerable success. In this paper, we propose an approach for the identification of MWEs in a multilingual context, as a by-product of a word alignment process, that not only deals with the identification of possible MWE candidates, but also associates some multiword expressions with semantics. The results obtained indicate the feasibility and low costs in terms of tools and resources demanded by this approach, which could, for example, facilitate and speed up lexicographic work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Pesquisa FAPESP is available at http://www.revistapesquisa.fapesp.br.

  2. Apertium is an open-source machine translation engine and toolbox available at: http://www.apertium.org.

  3. http://www-igm.univ-mlv.fr/~unitex/.

  4. For example: “artesian wells”, “black hole” and “botanical gardens” are found in CIDE, “clean up”, “consist of” and “depend on” are found in CIDPV.

  5. Evert and Krenn (2005) give a detailed description of standard measures and their application to MWE identification, and more material may also be found on http://www.collocations.de.

References

  • Armentano-Oller, C., Carrasco, R. C., Corbí-Bellot, A. M., Forcada, M. L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Ramírez-Sánchez, G., Sánchez-Martínez, F., & Scalco, M. A. (2006). Open-source Portuguese–Spanish machine translation. In Proceedings of the VII Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada (PROPOR-2006), Itatiaia-RJ, Brazil (pp. 50–59).

  • Baldwin, T., & Villavicencio, A. (2002). Extracting the unextractable: A case study on verb–particles. In Proceedings of the 6th conference on natural language learning (CoNLL-2002), Taipei, Taiwan.

  • Briscoe, T., & Carroll, J. (2002). Robust accurate statistical annotation of general text. In Proceedings of LREC-2003.

  • Brown, P., Della-Pietra, V., Della-Pietra, S., & Mercer, R. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–312.

    Google Scholar 

  • Burnard, L. (2000). User Reference Guide for the British National Corpus. Technical report. Oxford, UK: Oxford University Computing Services.

  • Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistics. Computational Linguistics 22(2), 249–254.

    Google Scholar 

  • Caseli, H. M., Nunes, M. G. V., & Forcada, M. L. (2006). Automatic induction of bilingual resources from aligned parallel corpora: Application to shallow-transfer machine translation. Machine Translation 20, 227–245.

    Article  Google Scholar 

  • Caseli, H. M., Silva, A. M. P., & Nunes, M. G. V. (2004). Evaluation of methods for sentence and lexical alignment of Brazilian Portuguese and English parallel texts. In Proceedings of the SBIA 2004 (LNAI), Berlin, Heidelberg (pp. 184–193).

  • Evert, S., & Krenn, B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language 19(4), 450–466.

    Google Scholar 

  • Fazly, A., & Stevenson, S. (2007). Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In Proceedings of the workshop on a broader perspective on multiword expressions, Prague (pp. 9–16).

  • Hofland, K. (1996). A program for aligning English and Norwegian sentences. In S. Hockey, N. Ide, & G. Perissinotto (Eds.), Research in humanities computing (pp. 165–178). Oxford: Oxford University Press.

  • Jackendoff, R. (1997). ‘Twistin’ the night away. Language 73, 534–559.

    Article  Google Scholar 

  • Melamed, I. D. (1997). Automatic discovery of non-compositional compounds in parallel data. In eprint arXiv:cmp-lg/9706027, pp. 6027.

  • Och, F. J., & Ney, H. (2000a). A comparison of alignment models for statistical machine translation. In Proceedings of the 18th international conference on computational linguistics (COLING−2000), Saarbrücken, Germany (pp. 1086–1090).

  • Och, F. J., & Ney, H. (2000b). Improved statistical alignment models. In Proceedings of the 38th annual meeting of the ACL, Hong Kong, China (pp. 440–447).

  • Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51.

    Article  Google Scholar 

  • Pearce, D. (2002). A comparative evaluation of collocation extraction techniques. In Proceedings of the third international conference on language resources and evaluation, Las Palmas, Canary Islands, Spain (pp. 1–7).

  • Piao, S. S. L., Sun, G., Rayson, P., & Yuan, Q. (2006). Automatic extraction of Chinese multiword expressions with a statistical tool. In Proceedings of the workshop on multi-word-expressions in a multilingual context (EACL-2006), Trento, Italy (pp. 17–24).

  • Procter, P. (1995). Cambridge international dictionary of English. Cambridge: Cambridge University Press.

    Google Scholar 

  • Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the third international conference on computational linguistics and intelligent text processing (CICLing-2002), Lecture Notes in Computer Science, London, UK, Vol. 2276 (pp. 1–15).

  • Van de Cruys, T., & Villada Moirón, B. (2007). Semantics-based multiword expression extraction. In Proceedings of the workshop on a broader perspective on multiword expressions, Prague (pp. 25–32).

  • Villada Moirón, B., & Tiedemann, J. (2006). Identifying idiomatic expressions using automatic word-alignment. In Proceedings of the workshop on multi-word-expressions in a multilingual context (EACL-2006), Trento, Italy (pp. 33–40).

  • Villavicencio, A. (2005). The availability of verb–particle constructions in lexical resources: How much is enough? Journal of Computer Speech and Language Processing 19, 415–432.

    Google Scholar 

  • Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., & Ramisch, C. (2007). Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague (pp. 1034–1043).

  • Vogel, S., Ney, H., & Tillmann, C. (1996) HMM-based word alignment in statistical translation. In Proceedings of the 16th international conference on computational linguistics (COLING-1996), Copenhagen (pp. 836–841).

  • Zhang, Y., Kordoni, V., Villavicencio, A., & Idiart, M. (2006). Automated multiword expression prediction for grammar engineering. In Proceedings of the workshop on multiword expressions: Identifying and exploiting underlying properties, Sydney, Australia (pp. 36–44).

Download references

Acknowledgements

We thank the financial support of the Brazilian agencies FAPESP (02/13207-8) CNPq (550388/2005-2), SEBRAE/FINEP (1194/07) and CAPES (CAPES/COFECUB 548/07). We also thank Mônica Saddy Martins for helping in the evaluation process, and the anonymous reviewers for the useful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aline Villavicencio.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Caseli, H.M., Ramisch, C., das Graças Volpe Nunes, M. et al. Alignment-based extraction of multiword expressions. Lang Resources & Evaluation 44, 59–77 (2010). https://doi.org/10.1007/s10579-009-9097-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-009-9097-9

Keywords

Navigation