Abstract
Identifying and correctly handling multiword expressions is critical for understanding a language system and for properly functioning NLP tools. This paper presents a database of multiword expressions (MWE) we build for the Czech language which currently contains more than 7,000 entries. It contains detailed information about the properties of MWEs, e.g. about their idiomaticity and variability. The database also contains manually verified dependency structures of MWEs. We show one of the possible uses of the database: identification and correction of parsing errors in sentences containing MWEs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The annotation manual of the PDT states: abstract or metaphorical meanings tend to be determined as Obj (see zaplést se do intrik ‘to get involved in intriques’: this is not a location). https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/ch03s02x05.html.
- 2.
See the annotation manual of the PDT: There exist, however, collocations the syntactic structure of which is so doubtful and unclear that it is impossible to propose a reasonable syntactic representation. In such cases we take recourse to a technical representation. This is really the last resort, used only in cases when all other attempts at a meaningful representation fail. https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/ch03s06x23.html.
- 3.
- 4.
- 5.
In PDT, the conjunction jako ‘as’ can be assigned two significantly different structures depending on its semantics, see https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/ch03s06x26.htm and https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/ch03s02x07.html#auxyjakol. This is confusing for the parser.
References
Baldwin, T., Kim, S.N.: Multiword expressions. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn., pp. 267–292. Chapman and Hall/CRC (2010). http://www.crcnetbase.com/doi/abs/10.1201/9781420085938-c12
Bejček, E., Straňák, P.: Annotation of multiword expressions in the Prague dependency treebank. Lang. Resour. Eval. 44(1–2), 7–21 (2010)
Čermák, F., et al.: Slovník české frazeologie a idiomatiky. Leda, Prague (2016)
Constant, M., et al.: Multiword expression processing: a survey. Comput. Linguist. 43(4), 837–892 (2017). https://doi.org/10.1162/COLI_a_00302
Czerepowicka, M., Savary, A.: SEJF - a grammatical lexicon of polish multiword expressions. In: Vetulani, Z., Mariani, J., Kubis, M. (eds.) LTC 2015. LNCS (LNAI), vol. 10930, pp. 59–73. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93782-3_5. https://hal.archives-ouvertes.fr/hal-01223683
Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing. CoRR abs/1611.01734 (2016). http://arxiv.org/abs/1611.01734
Geyken, A.: Bootstrapping a database of German multi-word expressions. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). European Language Resources Association (ELRA), Lisbon, May 2004. http://www.lrec-conf.org/proceedings/lrec2004/pdf/595.pdf
Grégoire, N.: DuELME: a Dutch electronic lexicon of multiword expressions. Lang. Resour. Eval. 44(1), 23–39 (2010). https://doi.org/10.1007/s10579-009-9094-z
Hajič, J., et al.: Prague dependency Treebank 3.5. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University (2018). http://hdl.handle.net/11234/1-2621
Hnátková, M., et al.: Eye of a needle in a Haystack – multiword expressions in Czech: typology and lexicon. In: Mitkov, R. (ed.) EUROPHRAS 2017. LNCS (LNAI), vol. 10596, pp. 160–175. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69805-2_12
Hnátková, M., Křen, M., Procházka, P., Skoumalová, H.: The SYN-series corpora of written Czech. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 160–164. European Language Resources Association (ELRA), Reykjavik, May 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/294_Paper.pdf
Kopřivová, M., Hnátková, M.: Identification of idioms in spoken corpora. In: Gajdošová, K., Žáková, A. (eds.) Proceedings of the Seventh International Conference Slovko 2013, pp. 92–99. Slovak Academy of Sciences, Bratislava (2013)
Kopřivová, M., Hnátková, M.: From dictionary to corpus. In: Phraseology in Dictionaries and Corpora, Maribor, Slovenia, pp. 155–168 (2014)
Kordoni, V., Cholakov, K., Egg, M., Markantonatou, S., Nakov, P.: Proceedings of the 12th workshop on multiword expressions. In: Proceedings of the 12th Workshop on Multiword Expressions. Association for Computational Linguistics (2016). http://aclweb.org/anthology/W16-1800
Křen, M., et al.: Corpus SYN, version 6, 18 December 2017. http://kontext.korpus.cz
Martins, A., Almeida, M., Smith, N.A.: Turning on the turbo: fast third-order non-projective turbo parsers. In: Annual Meeting of the Association for Computational Linguistics - ACL, pp. 617–622, August 2013
Nguyen, D.Q., Verspoor, K.: An improved neural network model for joint POS tagging and dependency parsing. CoRR abs/1807.03955 (2018). http://arxiv.org/abs/1807.03955
Nivre, J., et al.: Universal dependencies v1: a multilingual treebank collection. In: Calzolari, N., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), Paris, May 2016
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_1. http://dl.acm.org/citation.cfm?id=647344.724004
Savary, A., et al.: The PARSEME shared task on automatic identification of verbal multiword expressions. In: Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pp. 31–47. Association for Computational Linguistics (2017). http://aclweb.org/anthology/W17-1704
Vondřička, P.: Design of a multiword expressions database. Prague Bull. Math. Linguist. 112, 83–110 (2019)
Acknowledgments
This paper, the creation of the database and the experiments on which the paper is based have been supported by the Ministry of Education of the Czech Republic, through the project Czech National Corpus, no. LM2015044 and by the Grant Agency of the Czech Republic through the grant 16-07473S (Between lexicon and grammar).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Jelínek, T. (2019). Using a Database of Multiword Expressions in Dependency Parsing. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-27947-9_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)