International Workshop on Systems and Frameworks for Computational Morphology

Systems and Frameworks for Computational Morphology pp 1-26 | Cite as

Morphology Within the Multi-layered Annotation Scenario of the Prague Dependency Treebank

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 537)

Abstract

Morphological annotation constitutes a separate layer in the multi-layered annotation scenario of the Prague Dependency Treebank. At this layer, morphological categories expressed by a word form are captured in a positional part-of-speech tag. According to the Praguian approach based on the relation between form and function, functions (meanings) of morphological categories are represented as well, namely as grammateme attributes at the deep-syntactic (tectogrammatical) layer of the treebank.

In the present paper, we first describe the role of morphology in the Prague Dependency Treebank, and then outline several recent topics based on Praguian morphology: named entity recognition in Czech, formemes attributes encoding morpho-syntactic information in the dependency-based machine translation system, and development of a lexical database of derivational relations based partially on information provided by the morphological analyser.

Keywords

Annotation Deep syntax Lemma Morphology Multi-layered scenario Part-of-speech tag Surface syntax Tagging 

References

  1. 1.
    Agić, Ž., Aranzabe, M.J., Atutxa, A., Bosco, C., Choi, J., de Marneffe, M.-C., Dozat, T., Farkas, R., Foster, J., Ginter, F., Goenaga, I., Gojenola, K., Goldberg, Y., Hajič, J., Johannsen, A.T., Kanerva, J., Kuokkala, J., Laippala, V., Lenci, A., Lindén, K., Ljubešić, N., Lynn, T., Manning, C., Martínez, H.A., McDonald, R., Missilä, A., Montemagni, S., Nivre, J., Nurmi, H., Osenova, P., Petrov, S., Piitulainen, J., Plank, B., Prokopidis, P., Pyysalo, S., Seeker, W., Seraji, M., Silveira, N., Simi, M., Simov, K., Smith, A., Tsarfaty, R., Vincze, V., Zeman, D.: Universal Dependencies 1.1. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (2015). http://hdl.handle.net/11234/LRT-1478
  2. 2.
    Baayen, R.H., Piepenbrock, R., Gulikers, L.: The CELEX lexical database (release 2), Data/software. Linguistic Data Consortium, Philadelphia (1995)Google Scholar
  3. 3.
    Bejček, E., Hajič, J., Panevová, J., Mírovský, J., Spoustová, J., Štěpánek, J., Straňák, P., Šidák, P., Vimmrová, P., Št’astná, E., Ševčíková, M., Smejkalová, L., Homola, P., Popelka, J., Lopatková, M., Hrabalová, L., Kluyeva, N., Žabokrtský, Z.: Prague Dependency Treebank 2.5. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (2011). http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
  4. 4.
    Bejček, E., Hajičová, E., Hajič, J., Jínová, P., Kettnerová, V., Kolářová, V., Mikulová, M., Mírovský, J., Nedoluzhko, A., Panevová, J., Poláková, L., Ševčíková, M., Štěpánek, J., Zikánová, Š.: Prague Dependency Treebank 3.0. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (2013). http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3
  5. 5.
    Böhmová, A., Hajič, J., Hajičová, E., Hladká, B.: The Prague dependency treebank: a three-level annotation scenario. In: Abeillé, A. (ed.) Treebanks: Building and Using Syntactically Annotated Corpora, pp. 103–128. Kluwer Academic Publishers, Dordrecht (2003)CrossRefGoogle Scholar
  6. 6.
    Collins, M.: Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), vol. 10, pp. 1–8. Association for Computational Linguistics, Philadelphia (2002)Google Scholar
  7. 7.
    Dušek, O., Žabokrtský, Z., Popel, M., Majliš, M., Novák, M., Mareček, D.: Formemes in english-czech deep syntactic MT. In: Proceedings of the Seventh ACL Workshop on Statistical Machine Translation, pp. 267–274. Association for Computational Linguistics, Montréal (2012)Google Scholar
  8. 8.
    Feldman, A., Hana, J.: A Resource-Light Approach to Morpho-Syntactic Tagging. Rodopi, Amsterdam (2010)CrossRefGoogle Scholar
  9. 9.
    Fleischman, M., Hovy, E.: Fine-grained classification of named entities. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING), vol. I, pp. 267–273. Association for Computational Linguistics, Taipei (2002)Google Scholar
  10. 10.
    Giesbrecht, E., Evert, S.: Part-of-speech tagging - a solved task? an evaluation of POS taggers for the Web as corpus. In: Proceedings of the 5th Web as Corpus Workshop (WAC5), San Sebastian, pp. 27–35 (2009)Google Scholar
  11. 11.
    Hajič, J.: Disambiguation of Rich Inflection: Computational Morphology of Czech. Karolinum, Prague (2004)Google Scholar
  12. 12.
    Hajič, J., Hajičová, E., Panevová, J., Sgall, P., Cinková, S., Fučíková, E., Mikulová, M., Pajas, P., Popelka, J., Semecký, J., Šindlerová, J., Štěpánek, J., Toman, J., Urešová, Z., Žabokrtský, Z.: Prague Czech-English Dependency Treebank 2.0. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (2012). http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
  13. 13.
    Hajič, J., Hladká, B.: Probabilistic and rule-based tagger of an inflective language - a comparison. In: Proceedings of the 5th Conference on Applied Natural Language Processing, pp. 111–118. Association for Computational Linguistics, Washington, DC (1997)Google Scholar
  14. 14.
    Hajič, J., Hlaváčvá, J.: MorfFlex CZ. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (1990). http://hdl.handle.net/11858/00-097C-0000-0015-A780-9
  15. 15.
    Hajič, J., Krbec, P., Oliva, K., Květoň, P., Petkevič, V.: Serial combination of rules and statistics: a case study in Czech tagging. In: Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics (ACL 2001), pp. 260–267. Association for Computational Linguistics, Tolouse (2001)Google Scholar
  16. 16.
    Hajič, J., Panevová, J., Hajičcová, E., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., Mikulová, M., Žabokrtský, Z., Ševčíková-Razímová, M., Urešová, Z.: Prague Dependency Treebank 2.0. Data/software. Linguistic Data Consortium, Philadelphia (2006)Google Scholar
  17. 17.
    Hajič, J., Smrž, O., Zemánek, P., Pajas, P., Šnaidauf, J., Beška, E., Kracmar, J., Hassanová, K.: Prague Arabic Dependency Treebank 1.0. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (2009). http://hdl.handle.net/11858/00-097C-0000-0001-4872-3
  18. 18.
    Hajič, J., Vidová Hladká, B.: Czech language processing - PoS tagging. In: Proceedings of the 1st International Conference on Language Resources and Evaluation (LREC 1998), pp. 931–936. ELRA, Granada (1998)Google Scholar
  19. 19.
    Hajič, J., Vidová Hladká, B., Panevová, J., Hajičcová, E., Sgall, P., Pajas, P.: Prague Dependency Treebank 1.0. Data/software. Linguistic Data Consortium, Philadelphia (2001)Google Scholar
  20. 20.
    Hana, J., Feldman, A.: Resource-light approaches to computational morphology. Part 1: monolingual approaches. Lang. Linguist. Compass 6, 622–634 (2012)CrossRefGoogle Scholar
  21. 21.
    Hana, J., Zeman, D., Hajič, J., Hanová, H., Hladká, B., Jeřábek, E.: Manual for Morphological Annotation, Revision for the Prague Dependency Treebank 2.0. Technical report no. 2005/TR-2005-27, FAL MFF UK, Prague (2005)Google Scholar
  22. 22.
    Hathout, N., Namer, F.: Démonette, a French derivational morpho-semantic network. Linguist. Issues Lang. Technol. 11, 125–168 (2014)Google Scholar
  23. 23.
    Hladká, B.: Software Tools for Large Czech Corpora Annotation. Master thesis. MFF UK, Prague (1994)Google Scholar
  24. 24.
    Hladká, B., Králík, J.: Proměny Českého akademického korpusu. Slovo a Slovesnost 67, 179–194 (2006)Google Scholar
  25. 25.
    Komárek, M., Kořenský, J., Petr, J., Veselková, J., et al.: Mluvnice češtiny 2. Tvarosloví. Academia, Prague (1986) Google Scholar
  26. 26.
    Konkol, M., Konopík, M.: Maximum entropy named entity recognition for czech language. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 203–210. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  27. 27.
    Konkol, M., Konopík, M.: CRF-based Czech named entity recognizer and consolidation of Czech NER research. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS, vol. 8082, pp. 153–160. Springer, Heidelberg (2013) Google Scholar
  28. 28.
    Kravalová, J., Žabokrtský Z.: Czech named entity corpus and SVM-based recognizer. In: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009), pp. 194–201. Association for Computational Linguistics, Suntec (2009)Google Scholar
  29. 29.
    Krbec, P.: Language Modelling for Speech Recognition of Czech. Ph.D. thesis. MFF UK, Prague (2005)Google Scholar
  30. 30.
    Kuryłowicz, J.: Dérivation lexicale et dérivation syntaxique. Bull. de la Société de Linguistique de Paris 37, 79–92 (1936)Google Scholar
  31. 31.
    Květoň, P.: Rule-based Morphological Disambiguation. Ph.D. thesis. MFF UK, Prague (2006)Google Scholar
  32. 32.
    Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building A Large Annotated Corpus of English: The Penn Treebank. Technical reports (CIS), Paper 237 (1993). http://repository.upenn.edu/cis_reports/237/
  33. 33.
    de Marneffe, M.-C., Dozat, T., Silveira, N., Haverinen, K., Ginter, F., Nivre, J., Manning, C.: Universal stanford dependencies: a cross-linguistic typology. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 4585–4592. ELRA, Reykjavík (2014)Google Scholar
  34. 34.
    Mel’čuk, I.A.: Dependency Syntax: Theory and Practice. State University of New York Press, New York (1988) Google Scholar
  35. 35.
    Mikulová, M., Bémová, A., Hajič, J., Hajičová, E., Havelka, J., Kolářová, V., Kučová, L., Lopatková, M., Pajas, P., Panevová, J., Razímová, M., Sgall, P., Štěpánek, J., Urešová, Z., Veselá, K., Žabokrtský, Z.: Annotation on the tectogrammatical level in the Prague Dependency Treebank. Annotation manual. Technical report no. 2006/30, ÚFAL MFF UK, Prague (2006)Google Scholar
  36. 36.
    Oliva, K., Květoň, P., Ondruška, R.: The computational complexity of rule-based part-of-speech tagging. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 82–89. Springer, Heidelberg (2003) CrossRefGoogle Scholar
  37. 37.
    Oliva, K., Hnátková, M., Petkevič, V., Květoň, P.: The linguistic basis of a rule-based tagger of Czech. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 3–8. Springer, Heidelberg (2000) CrossRefGoogle Scholar
  38. 38.
    Pajas, P., Štěpánek, J., Sedlák, M.: PML Tree Query. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (2009). http://hdl.handle.net/11858/00-097C-0000-0022-C7F6-3
  39. 39.
    Petkevič, V.: Reliable morphological disambiguation of Czech: rule-based approach is necessary. In: Šimková, M. (ed.) Insight into the Slovak and Czech Corpus Linguistics, pp. 26–44. Veda, Bratislava (2006)Google Scholar
  40. 40.
    Petkevič, V.: Problémy automatické morfologické disambiguace češtiny. Naše řeč 97, 194–207 (2014)Google Scholar
  41. 41.
    Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pp. 2089–2096. ELRA, Istanbul (2012)Google Scholar
  42. 42.
    Razímová, M., Žabokrtský, Z.: Annotation of grammatemes in the prague dependency treebank 2.0. In: Proceedings of the LREC Workshop on Annotation Science, pp. 12–19. ELRA, Genova (2006)Google Scholar
  43. 43.
    Rosa, R., Mašek, J., Mareček, D., Popel, M., Zeman, D., Žabokrtský, Z.: HamleDT 2.0: thirty dependency treebanks stanfordized. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 2334–2341. ELRA, Reykjavík (2014)Google Scholar
  44. 44.
    Sedláček, R.: Morfologický analyzátor češtiny. Master thesis. FI MU, Brno (1999)Google Scholar
  45. 45.
    Sedláček, R., Smrž, P.: A new Czech morphological analyser ajka. In: Matoušek, V., Mautner, P., Mouček, R., Tauser, K. (eds.) TSD 2001. LNCS (LNAI), vol. 2166, pp. 100–107. Springer, Heidelberg (2001) CrossRefGoogle Scholar
  46. 46.
    Sedlák, M.: Treex::Web. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (2014). http://hdl.handle.net/11858/00-097C-0000-0023-44AF-C
  47. 47.
    Sekine, S.: Sekine’s Extended Named Entity Hierarchy (2003). http://nlp.cs.nyu.edu/ene/
  48. 48.
    Sgall, P.: Generativní Popis Jayzka a Česká Deklinace. Academia, Prague (1967) Google Scholar
  49. 49.
    Sgall, P., Hajičová, E., Panevová, J.: The Meaning of the Sentence in its Semantic and Pragmatic Aspects. Reidel Publishing Company, Dordrecht (1986) Google Scholar
  50. 50.
    Spoustová, D.: Kombinované statisticko-pravidlové metody značkování češtiny. Ph.D. thesis. MFF UK, Prague (2007)Google Scholar
  51. 51.
    Spoustová, D., Hajič, J., Raab, J., Spousta, M.: Semi-supervised training for the averaged perceptron POS tagger. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pp. 763–771. Association for Computational Linguistics, Athens (2009)Google Scholar
  52. 52.
    Spoustová, D., Hajič, J., Votrubec, J., Krbec, P., Květoň, P.: The best of two worlds: cooperation of statistical and rule-based taggers for Czech. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing 2007, pp. 67–74. Association for Computational Linguistics, Prague (2007)Google Scholar
  53. 53.
    Straka, M., Straková, J.: MorphoDiTa: Morphological Dictionary and Tagger. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (2014). http://hdl.handle.net/11858/00-097C-0000-0023-43CD-0
  54. 54.
    Straka, M., Straková, J.: NameTag. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (2014). http://hdl.handle.net/11858/00-097C-0000-0023-43CE-E
  55. 55.
    Straková, J., Straka, M., Hajič, J.: A new state-of-the-art Czech named entity recognizer. In: Habernal, I. (ed.) TSD 2013. LNCS, vol. 8082, pp. 68–75. Springer, Heidelberg (2013) Google Scholar
  56. 56.
    Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014): System Demonstrations, pp. 13–18. Association for Computational Linguistics, Baltimore (2014)Google Scholar
  57. 57.
    Straková, J., Straka, M., Ševčíková, M., Žabokrtský, Z.: Czech Named Entity Corpus. In: Ide, N., Pustejovsky, J. (eds.) Handbook of Linguistic Annotation. Springer, Heidelberg (in press)Google Scholar
  58. 58.
    Ševčíková, M., Panevová, J., Smejkalová, L.: Specificity of the number of nouns in Czech and its annotation in prague dependency treebank. Prague Bull. Math. Linguist. 96, 27–47 (2011)Google Scholar
  59. 59.
    Ševčíková, M., Žabokrtský, Z.: Word-formation network for czech. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 1087–1093. ELRA, Reykjavík (2014)Google Scholar
  60. 60.
    Ševčíková, M., Žabokrtský, Z., Krůza, O.: Named entities in Czech: annotating data and developing NE tagger. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 188–195. Springer, Heidelberg (2007) CrossRefGoogle Scholar
  61. 61.
    Ševčíková, M., Žabokrtský, Z., Straková, J., Straka, M.: Czech Named Entity Corpus 1.1. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (2014). http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C
  62. 62.
    Ševčíková Razímová, M., Žabokrtský, Z.: Systematic parameterized description of pro-forms in the prague dependency treebank 2.0. In: Proceedings of the Fifth International Workshop on Treebanks and Linguistic Theories (TLT 2006), pp. 175–186. Institute of Formal and Applied Linguistics, Prague (2006)Google Scholar
  63. 63.
    Šnajder, J.: DerivBase.Hr: a high-coverage derivational morphology resource for croatian. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 3371–3377. ELRA, Reykjavík (2014)Google Scholar
  64. 64.
    Štěpánek, J.: Post-annotation checking of prague dependency treebank 2.0 data. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 277–284. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  65. 65.
    Štěpánek, J.: Závislostní zachycení větné struktury v anotovaném syntaktickém korpusu (nástroje pro zajištění konzistence dat). Ph.D. thesis. MFF UK, Prague (2006)Google Scholar
  66. 66.
    Vidová Hladká, B., Hajič, J., Hana, J., Hlaváčová, J., Mírovský, J., Raab, J.: Czech Academic Corpus 2.0. Data/software. Linguistic Data Consortium, Philadelphia (2008)Google Scholar
  67. 67.
    Viová Hladká, B., Hana, J., Hajič, J., Hlaváčová, J., Mírovský, J., Votrubec, J.: Czech Academic Corpus 1.0. Data/software. Karolinum, Prague (2007) Google Scholar
  68. 68.
    Votrubec, J.: Volba vhodné sady rysů pro morfologické značkování češtiny. Master thesis. MFF UK, Prague (2005)Google Scholar
  69. 69.
    Zeller, B., Šnajder, J., Padó, S.: DerivBase: inducing and evaluating a derivational morphology resource for German. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), pp. 1201–1211. Association for Computational Linguistics, Sofia (2013)Google Scholar
  70. 70.
    Zeman, D.: Lingua: Interset 2.026. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (2014). http://hdl.handle.net/11234/1-1465
  71. 71.
    Zeman, D.: Reusable tagset conversion using tagset drivers. In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pp. 213–218. ELRA, Marrakech (2008)Google Scholar
  72. 72.
    Zeman, D., Mareček, D., Mašek, J., Popel, M., Ramasamy, L., Rosa, R., Štěpánek, J., Žabokrtský, Z.: HamleDT 2.0. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague (2014). http://hdl.handle.net/11858/00-097C-0000-0023-9551-4
  73. 73.
    Zeman, D., Mareček, D., Popel, M., Ramasamy, L., Štěpánek, J., Žabokrtský, Z., Hajič, J.: HamleDT: to parse or not to parse? In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pp. 2735–2741. ELRA, Istanbul (2012)Google Scholar
  74. 74.
    Žabokrtský, Z.: Resemblances between meaning-text theory and functional generative description. In: Proceedings of the 2nd International Conference of Meaning-Text Theory, pp. 549–557. Slavic Culture Languages Publishers House, Moskva (2005)Google Scholar
  75. 75.
    Žabokrtský, Z., Ptáček, J., Pajas. P.: TectoMT: highly modular MT system with tectogrammatics used as transfer layer. In: Proceedings of the Third ACL Workshop on Statistical Machine Translation, pp. 167–170. Association for Computational Linguistics, Columbus (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Faculty of Mathematics and Physics, Institute of Formal and Applied LinguisticsCharles University in PraguePragueCzech Republic

Personalised recommendations