Skip to main content

Using a Database of Multiword Expressions in Dependency Parsing

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

Abstract

Identifying and correctly handling multiword expressions is critical for understanding a language system and for properly functioning NLP tools. This paper presents a database of multiword expressions (MWE) we build for the Czech language which currently contains more than 7,000 entries. It contains detailed information about the properties of MWEs, e.g. about their idiomaticity and variability. The database also contains manually verified dependency structures of MWEs. We show one of the possible uses of the database: identification and correction of parsing errors in sentences containing MWEs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The annotation manual of the PDT states: abstract or metaphorical meanings tend to be determined as Obj (see zaplést se do intrik ‘to get involved in intriques’: this is not a location). https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/ch03s02x05.html.

  2. 2.

    See the annotation manual of the PDT: There exist, however, collocations the syntactic structure of which is so doubtful and unclear that it is impossible to propose a reasonable syntactic representation. In such cases we take recourse to a technical representation. This is really the last resort, used only in cases when all other attempts at a meaningful representation fail. https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/ch03s06x23.html.

  3. 3.

    See also http://wiki.korpus.cz/doku.php/en:cnk:syn.

  4. 4.

    See also http://wiki.korpus.cz/doku.php/en:cnk:syn.

  5. 5.

    In PDT, the conjunction jako ‘as’ can be assigned two significantly different structures depending on its semantics, see https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/ch03s06x26.htm and https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/ch03s02x07.html#auxyjakol. This is confusing for the parser.

References

  1. Baldwin, T., Kim, S.N.: Multiword expressions. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn., pp. 267–292. Chapman and Hall/CRC (2010). http://www.crcnetbase.com/doi/abs/10.1201/9781420085938-c12

  2. Bejček, E., Straňák, P.: Annotation of multiword expressions in the Prague dependency treebank. Lang. Resour. Eval. 44(1–2), 7–21 (2010)

    Article  Google Scholar 

  3. Čermák, F., et al.: Slovník české frazeologie a idiomatiky. Leda, Prague (2016)

    Google Scholar 

  4. Constant, M., et al.: Multiword expression processing: a survey. Comput. Linguist. 43(4), 837–892 (2017). https://doi.org/10.1162/COLI_a_00302

    Article  MathSciNet  Google Scholar 

  5. Czerepowicka, M., Savary, A.: SEJF - a grammatical lexicon of polish multiword expressions. In: Vetulani, Z., Mariani, J., Kubis, M. (eds.) LTC 2015. LNCS (LNAI), vol. 10930, pp. 59–73. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93782-3_5. https://hal.archives-ouvertes.fr/hal-01223683

    Chapter  Google Scholar 

  6. Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing. CoRR abs/1611.01734 (2016). http://arxiv.org/abs/1611.01734

  7. Geyken, A.: Bootstrapping a database of German multi-word expressions. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). European Language Resources Association (ELRA), Lisbon, May 2004. http://www.lrec-conf.org/proceedings/lrec2004/pdf/595.pdf

  8. Grégoire, N.: DuELME: a Dutch electronic lexicon of multiword expressions. Lang. Resour. Eval. 44(1), 23–39 (2010). https://doi.org/10.1007/s10579-009-9094-z

    Article  Google Scholar 

  9. Hajič, J., et al.: Prague dependency Treebank 3.5. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University (2018). http://hdl.handle.net/11234/1-2621

  10. Hnátková, M., et al.: Eye of a needle in a Haystack – multiword expressions in Czech: typology and lexicon. In: Mitkov, R. (ed.) EUROPHRAS 2017. LNCS (LNAI), vol. 10596, pp. 160–175. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69805-2_12

    Chapter  Google Scholar 

  11. Hnátková, M., Křen, M., Procházka, P., Skoumalová, H.: The SYN-series corpora of written Czech. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 160–164. European Language Resources Association (ELRA), Reykjavik, May 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/294_Paper.pdf

  12. Kopřivová, M., Hnátková, M.: Identification of idioms in spoken corpora. In: Gajdošová, K., Žáková, A. (eds.) Proceedings of the Seventh International Conference Slovko 2013, pp. 92–99. Slovak Academy of Sciences, Bratislava (2013)

    Google Scholar 

  13. Kopřivová, M., Hnátková, M.: From dictionary to corpus. In: Phraseology in Dictionaries and Corpora, Maribor, Slovenia, pp. 155–168 (2014)

    Google Scholar 

  14. Kordoni, V., Cholakov, K., Egg, M., Markantonatou, S., Nakov, P.: Proceedings of the 12th workshop on multiword expressions. In: Proceedings of the 12th Workshop on Multiword Expressions. Association for Computational Linguistics (2016). http://aclweb.org/anthology/W16-1800

  15. Křen, M., et al.: Corpus SYN, version 6, 18 December 2017. http://kontext.korpus.cz

  16. Martins, A., Almeida, M., Smith, N.A.: Turning on the turbo: fast third-order non-projective turbo parsers. In: Annual Meeting of the Association for Computational Linguistics - ACL, pp. 617–622, August 2013

    Google Scholar 

  17. Nguyen, D.Q., Verspoor, K.: An improved neural network model for joint POS tagging and dependency parsing. CoRR abs/1807.03955 (2018). http://arxiv.org/abs/1807.03955

  18. Nivre, J., et al.: Universal dependencies v1: a multilingual treebank collection. In: Calzolari, N., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), Paris, May 2016

    Google Scholar 

  19. Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_1. http://dl.acm.org/citation.cfm?id=647344.724004

    Chapter  Google Scholar 

  20. Savary, A., et al.: The PARSEME shared task on automatic identification of verbal multiword expressions. In: Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pp. 31–47. Association for Computational Linguistics (2017). http://aclweb.org/anthology/W17-1704

  21. Vondřička, P.: Design of a multiword expressions database. Prague Bull. Math. Linguist. 112, 83–110 (2019)

    Article  Google Scholar 

Download references

Acknowledgments

This paper, the creation of the database and the experiments on which the paper is based have been supported by the Ministry of Education of the Czech Republic, through the project Czech National Corpus, no. LM2015044 and by the Grant Agency of the Czech Republic through the grant 16-07473S (Between lexicon and grammar).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomáš Jelínek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jelínek, T. (2019). Using a Database of Multiword Expressions in Dependency Parsing. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27947-9_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27946-2

  • Online ISBN: 978-3-030-27947-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics