Skip to main content

Joining Forces for Multiword Expression Identification

  • 544 Accesses

Part of the Lecture Notes in Computer Science book series (LNAI,volume 9727)


Multiword Expressions (MWEs) display some kind of linguistic and statistical markedness that may influence the effectiveness of techniques that automatically identify them in texts. While parsing-based techniques for MWE identification are considered to be better at handling long-distance dependencies, passivization and internal modification, statistics-based techniques use association measures to detect statistical markedness regardless of syntactic form. In this paper we compare these two approaches focusing on nominal compounds in Portuguese. We compare the accuracy of each method and propose that combining the strengths of both for increased accuracy.


  • Multiword Expressions
  • Basic Statistical Techniques
  • Statistical Association Measures
  • Long-distance Dependencies
  • Automatic Validation

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-41552-9_23
  • Chapter length: 6 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   54.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-41552-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   69.99
Price excludes VAT (USA)
Fig. 1.


  1. 1.

    Parsing increased the number of tokens from 49 to 67 million words as contractions like those involving preposition and determiner are expanded to de + o by the parser (e.g. do (of the)).

  2. 2.

    Among the 100 top candidates from each process, some were already validated in the automatic validation step. The manual validation was applied only to those MWE candidates that were not prevalidated.


  1. Bick, E.: The parsing system Palavras: Automatic grammatical analysis of Portuguese in a constraint grammar framework. Aarhus Universitetsforlag (2000)

    Google Scholar 

  2. Calzolari, N., Fillmore, C.J., Grishman, R., Ide, N., Lenci, A., MacLeod, C., Zampolli, A.: Towards best practice for multiword expressions in computational lexicons. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC-2002), Las Palmas, Canary Islands, Spain. European Language Resources Association (ELRA), May 2002

    Google Scholar 

  3. Evert, S., Krenn, B.: Using small random samples for the manual evaluation of statistical association measures. Comput. Speech Lang. 19(4), 450–466 (2005)

    CrossRef  Google Scholar 

  4. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Conference Proceedings: The Tenth Machine Translation Summit, pp. 79–86. AAMT, Phuket (2005)

    Google Scholar 

  5. Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 216–225. Association for Computational Linguistics (2010)

    Google Scholar 

  6. Oliveira, H.G., Gomes, P.: automatic construction of a lexical ontology for portuguese. In: Stairs 2010: Proceedings of the Fifth Starting AI Researchers’ Symposium, vol. 222, p. 199. IOS Press (2010)

    Google Scholar 

  7. Pecina, P.: Lexical association measures and collocation extraction. Lang. Resour. Eval. 44(1–2), 137–158 (2010)

    CrossRef  Google Scholar 

  8. Ramisch, C., Schreiner, P., Idiart, M., Villavicencio, A.: An evaluation of methods for the extraction of multiword expressions. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), pp. 50–53 (2008)

    Google Scholar 

  9. Ramisch, C., Villavicencio, A., Boitet, C.: Multiword expressions in the wild?: the mwetoolkit comes in handy. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, pp. 57–60. Association for Computational Linguistics (2010)

    Google Scholar 

  10. Seretan, V.: Syntax-Based Collocation Extraction, Text, Speech and Language Technology, vol. 44, 1st edn, p. 212. Dordrecht, Netherlands (2011)

    Google Scholar 

  11. Wehrli, E., Nerima, L.: The fips multilingual parser. In: Gala, E., Rapp, R., Bel-Enguix, G. (eds.) Language Production, Cognition, and the Lexicon. Text, Speech and Language Technology, vol. 48, pp. 473–490. Springer, Switzerland (2015)

    Google Scholar 

  12. Wehrli, E., Seretan, V., Nerima, L.: Sentence analysis and collocation identification. In: 23rd International Conference on Computational Linguistics, COLING (2010)

    Google Scholar 

Download references


This research was partially developed in the context of the project Text Simplification of Complex Expressions, sponsored by Samsung Eletrônica da Amazônia Ltda., in the terms of the Brazilian law n. 8.248/91. This work was also partly supported by CNPq (482520/2012-4, 312114/2015-0) and FAPERGS.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Rodrigo Wilkens .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Zilio, L., Wilkens, R., Möllmann, L., Wehrli, E., Cordeiro, S., Villavicencio, A. (2016). Joining Forces for Multiword Expression Identification. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41551-2

  • Online ISBN: 978-3-319-41552-9

  • eBook Packages: Computer ScienceComputer Science (R0)