Language Resources and Evaluation

, Volume 44, Issue 1–2, pp 7–21 | Cite as

Annotation of multiword expressions in the Prague dependency treebank

  • Eduard BejčekEmail author
  • Pavel StraňákEmail author


We describe annotation of multiword expressions (MWEs) in the Prague dependency treebank, using several automatic pre-annotation steps. We use subtrees of the tectogrammatical tree structures of the Prague dependency treebank to store representations of the MWEs in the dictionary and pre-annotate following occurrences automatically. We also show a way to measure reliability of this type of annotation.


Multiword expressions Treebanks Annotation Inter-annotator agreement Named entities 



This work has been supported by grants 1ET201120505 and 1ET100300517 of Grant Agency of the Academy of Science of the Czech Republic, projects MSM0021620838 and LC536 of the Ministry of Education and 201/05/H014 of the Czech Science Foundation and a grant GAUK 4307/2009 of the Grant Agency of Charles University in Prague.


  1. Artstein, R., & Poesio, M. (2007). Inter-coder agreement for computational linguistics. Submitted to Computational Linguistics.Google Scholar
  2. Bejček, E., Straňák, P., & Schlesinger, P. (2008). Annotation of multiword expressions in the Prague dependency treebank. In IJCNLP 2008 Proceedings of the third international joint conference on natural language processing (pp. 793–798).Google Scholar
  3. Čermák, F., Červená, V., Churavý, M., & Machač, J. (1994). Slovník české frazeologie a idiomatiky. Praha: Academia.Google Scholar
  4. Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin 70(4), 213–220.CrossRefGoogle Scholar
  5. Eurovoc. (2007).
  6. Hajič, J. (2005). Complex corpus annotation: The Prague dependency treebank, Chap. In Insight into Slovak and Czech corpus linguistics (pp. 54–73). Bratislava, Slovakia: Veda.Google Scholar
  7. Hajič, J., Holub, M., Hučinová, M., Pavlík, M., Pecina, P., Straňák, P., et al. (2004). Validating and improving the Czech WordNet via lexico-semantic annotation of the Prague dependency treebank. In LREC 2004, Lisbon.Google Scholar
  8. Hajič, J., Panevová, J., Urešová, Z., Bémová, A., Kolářová, V., & Pajas, P. (2003). PDT-VALLEX. In J. Nivre & E. Hinrichs (Eds.), Proceedings of the second workshop on treebanks and linguistic theories, Vol. 9 of Mathematical modeling in physics, engineering and cognitive sciences (pp. 57–68). Vaxjo, Sweden: Vaxjo University Press.Google Scholar
  9. Hajičová, E., Partee, B. H., & Sgall, P. (1998). Topic-focus articulation, tripartite structures, and semantic content, Vol. 71 of Studies in linguistics and philosophy. Dordrecht: Kluwer.Google Scholar
  10. Hnátková, M. (2002). Značkováni frazémů a idiomů v Českém národním korpusu s pomoci Slovníku české frazeologie a idiomatiky. Slovo a slovesnost.Google Scholar
  11. Kilgarriff, A. (1998). SENSEVAL: An exercise in evaluating word sense disambiguation programs. In Proceedings of LREC (pp. 581–588). Granada Google Scholar
  12. Krenn, B. & Erbach, G. (1993). Idioms and Support Verb Constructions in HPSG. Technical report, Universität des Saarlandes, Saarbrücken.Google Scholar
  13. Mel'čuk, I. (1996). Lexical functions: A tool for the description of lexical relations in a lexicon. In L. Wanner (Ed.) Lexical functions in lexicography and natural language processing, Vol. 31 of Studies in language companion series (pp. 37–102). Amsterdam: John Benjamins.Google Scholar
  14. Meyers, A., Reeves, R., Macleod, C., Szekely, R., Zielinska, V., Young, B., et al. (2004). The NomBank project: An interim report. In A. Meyers (Ed.), HLT-NAACL 2004 workshop: Frontiers in corpus annotation (pp. 24–31). Boston, MA, USA : Association for Computational Linguistics.Google Scholar
  15. Mihalcea, R. (1998) SEMCOR Semantically tagged corpus.Google Scholar
  16. Mikulová, M., Bémová, A., Hajič, J., Hajičová, E., Havelka, J., Kolářová, V., et al. (2006). Annotation on the Tectogrammatical Level in the Prague Dependency Treebank Annotation manual. Technical Report 30, ÚFAL MFF UK, Prague, Czech Rep.Google Scholar
  17. Pajas, P., & Štěpánek, J. (2005). A Generic XML-Based Format for Structured Linguistic Annotation and Its Application to Prague Dependency Treebank 2.0. Technical Report TR-2005-29, ÚFAL MFF UK, Prague, Czech Rep.Google Scholar
  18. Palmer, M., Gildea, D., & Kingsbury, P. (2005) The proposition bank: A corpus annotated with semantic roles. Computational Linguistics Journal 31(1).Google Scholar
  19. Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Third international conference, CICLing.Google Scholar
  20. Ševčiková, M., Žabokrtský, Z., & Krůza, O. (2007). Zpracováni pojmenovaných entit v českých textech (Treatment of Named Entities in Czech Texts). Technical Report TR-2007-36, ÚFAL MFF UK, Prague, Czech Republic.Google Scholar
  21. Sgall, P., Hajičová, E., & Panevová, J. (1986). The meaning of the sentence in its semantic and pragmatic aspects. Praha/Dordrecht: Academia/Reidel Publishing Company.Google Scholar
  22. Smrž, P. (2003). Quality control for wordnet development. In P. Sojka, K. Pala, P. Smrž, C. Fellbaum, & P. Vossen (Eds.), Proceedings of the second international WordNet conference—GWC 2004 (pp. 206–212). Masaryk University Brno: Brno, Czech Republic.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2009

Authors and Affiliations

  1. 1.Institute of Formal and Applied LinguisticsCharles University in PraguePragueCzech Republic

Personalised recommendations