Language Resources and Evaluation

, Volume 48, Issue 4, pp 601–637 | Cite as

HamleDT: Harmonized multi-language dependency treebank

  • Daniel Zeman
  • Ondřej Dušek
  • David Mareček
  • Martin Popel
  • Loganathan Ramasamy
  • Jan Štěpánek
  • Zdeněk Žabokrtský
  • Jan Hajič
Original Paper

Abstract

We present HamleDT—a HArmonized Multi-LanguagE Dependency Treebank. HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. In the present article, we provide a thorough investigation and discussion of a number of phenomena that are comparable across languages, though their annotation in treebanks often differs. We claim that transformation procedures can be designed to automatically identify most such phenomena and convert them to a unified annotation style. This unification is beneficial both to comparative corpus linguistics and to machine learning of syntactic parsing.

Keywords

Dependency treebank Annotation scheme Harmonization 

References

  1. Aduriz, I., Aranzabe, M. J., Arriola, J. M., Atutxa, A., Díaz de Ilarraza, A., Garmendia, A., & Oronoz, M. (2003). Construction of a Basque dependency treebank. In Proceedings of the 2nd workshop on treebanks and linguistic theories.Google Scholar
  2. Afonso, S., Bick, E., Haber, R., & Santos, D. (2002). “Floresta sintá(c)tica”: A treebank for Portuguese. In Proceedings of the 3rd international conference on language resources and evaluation (LREC) (pp. 1968–1703).Google Scholar
  3. Atalay, N. B., Oflazer, K., Say, B., & Inst, I. (2003). The annotation process in the Turkish rreebank. In Proceedings of the 4th international workshop on linguistically interpreteted corpora (LINC).Google Scholar
  4. Bamman, D., & Crane, G. (2011). The ancient Greek and Latin dependency treebanks. In C. Sporleder, A. Bosch, & K. Zervanou (Eds.), Language technology for cultural heritage, theory and applications of natural language processing (pp. 79–98). Berlin, Heidelberg: Springer.CrossRefGoogle Scholar
  5. Bengoetxea, K., & Gojenola, K. (2009). Exploring treebank transformations in dependency parsing. In Proceedings of the international conference RANLP-2009. Borovets, Bulgaria (pp. 33–38). Association for Computational Linguistics.Google Scholar
  6. Bharati, A., Chaitanya, V., & Sangal, R. (1994). Natural language processing: A paninian perspective. New Delhi: Prentice-Hall of India.Google Scholar
  7. Bick, E., Uibo, H., & Müürisep, K. (2004). Arborest—A VISL-style treebank derived from an Estonian constraint grammar corpus. In Proceedings of treebanks and linguistic theories.Google Scholar
  8. Boguslavsky, I., Grigorieva, S., Grigoriev, N., Kreidlin, L., & Frid, N. (2000). Dependency treebank for Russian: Concept, tools, types of information. In Proceedings of the 18th conference on computational linguistics (Vol. 2, pp. 987–991).Google Scholar
  9. Bosco, C., Montemagni, S., Mazzei, A., Lombardo, V., Lenci, A., Lesmo, L., Attardi, G., Simi, M., Lavelli, A., Hall, J., Nilsson, J., & Nivre, J. (2010). Comparing the influence of different treebank annotations on dependency parsing.Google Scholar
  10. Brants, S., Dipper, S., Eisenberg, P., Hansen, S., König, E., Lezius, W., et al. (2004). TIGER: Linguistic interpretation of a German corpus. Journal of Language and Computation, 2(4), 597–620. Special Issue.CrossRefGoogle Scholar
  11. Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In Proceedings of CoNLL (pp. 149–164).Google Scholar
  12. Călăcean, M. (2008). Data-driven dependency parsing for Romanian. Master’s thesis, Uppsala University.Google Scholar
  13. Civit, M., Martí, M. A., & Bufí, N. (2006). Cat3LB and Cast3LB: From constituents to dependencies. In T. Salakoski, F. Ginter, S. Pyysalo, & T. Pahikkala (Eds.), FinTAL, Vol. 4139 of Lecture notes in computer science (pp. 141–152). Berlin: Springer.Google Scholar
  14. Csendes, D., Csirik, J., Gyimóthy, T., & Kocsor, A. (2005). The Szeged treebank. In V. Matoušek, P. Mautner, & T. Pavelka (Eds.), TSD, Vol. 3658 of Lecture notes in computer science (pp. 123–131). Berlin: Springer.Google Scholar
  15. de Marneffe, M.-C., & Manning, C. D. (2008). Stanford typed dependencies manual.Google Scholar
  16. Džeroski, S., Erjavec, T., Ledinek, N., Pajas, P., Žabokrtský, Z., & Žele, A. (2006). Towards a slovene dependency treebank. In Proceedings of the fifth international language resources and evaluation conference, LREC 2006. Genova, Italy (pp. 1388–1391). European Language Resources Association (ELRA).Google Scholar
  17. Hajič, J., Ciaramita, M., Johansson, R., Kawahara, D., Martí, M. A., Màrquez, L., Meyers, A., Nivre, J., Padó, S., Štěpánek, J., Straňák, P., Surdeanu, M., Xue, N., & Zhang, Y. (2009). The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the 13th conference on computational natural language learning (CoNLL-2009), June 4–5. Boulder, Colorado, USA.Google Scholar
  18. Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., Mikulová, M., Žabokrtský, Z., & Ševčíková-Razímová, M. (2006). Prague dependency treebank 2.0. CD-ROM, Linguistic Data Consortium, LDC Catalog No.: LDC2006T01, Philadelphia.Google Scholar
  19. Haverinen, K., Viljanen, T., Laippala, V., Kohonen, S., Ginter, F., & Salakoski, T. (2010). Treebanking finnish. In M. Dickinson, K. Müürisep, & M. Passarotti (Eds.), Proceedings of the ninth international workshop on treebanks and linguistic theories (TLT9) (pp. 79–90).Google Scholar
  20. Hudson, R. (2004). Are determiners heads? Functions of Language, 11(1).Google Scholar
  21. Hudson, R. (2010). An encyclopedia of word grammar and English grammar. London, UK: University College London. http://tinyurl.com/wg-encyc.
  22. Husain, S., Mannem, P., Ambati, B., & Gadde, P. (2010). The ICON-2010 tools contest on Indian language dependency parsing. In Proceedings of ICON-2010 tools contest on Indian language dependency parsing. Kharagpur, India.Google Scholar
  23. Hwa, R., Resnik, P., Weinberg, A., Cabezas, C. I., & Kolak, O. (2005). Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11(3), 311–325.CrossRefGoogle Scholar
  24. Kawata, Y., & Bartels, J. (2000). Stylebook for the Japanese treebank in verbmobil. In Report 240. Tübingen, Germany.Google Scholar
  25. Kromann, M. T., Mikkelsen, L., & Lynge, S. K. (2004). Danish dependency treebank.Google Scholar
  26. Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19(2), 313–330.Google Scholar
  27. Mareček, D., & Žabokrtský, Z. (2012). Exploiting reducibility in unsupervised dependency parsing. In Proceedings of EMNLP-CoNLL’12 (pp. 297–307).Google Scholar
  28. McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., Hall, K., Petrov, S., Zhang, H., Täckström, O., Bedini, C., Castelló, N. B., & Lee, J. (2013). Universal dependency annotation for multilingual parsing. In Proceedings of the ACL 2013. Association for Computational Linguistics.Google Scholar
  29. McDonald, R., Petrov, S., & Hall, K. (2011a). Multi-source transfer of delexicalized dependency parsers. In Proceedings of the conference on empirical methods in natural language processing (pp. 62–72). Stroudsburg, PA, USA. Association for Computational Linguistics.Google Scholar
  30. McDonald, R., Petrov, S., & Hall, K. (2011b). Multi-source transfer of delexicalized dependency parsers. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 62–72). Edinburgh, Scotland, UK. Association for Computational Linguistics.Google Scholar
  31. Mel’čuk, I. A. (1988). Dependency syntax: Theory and practice. New York: State University of New York Press.Google Scholar
  32. Montemagni, S., Barsotti, F., Battista, M., Calzolari, N., Corazzari, O., Lenci, A., et al. (2003). Building the Italian syntactic-semantic treebank. In A. Abeillé (Ed.), Building and using parsed corpora (pp. 189–210). Dordrecht: Kluwer.Google Scholar
  33. Nilsson, J., Hall, J., & Nivre, J. (2005). MAMBA Meets TIGER: Reconstructing a Swedish treebank from antiquity. In Proceedings of the NODALIDA special session on treebanks.Google Scholar
  34. Nilsson, J., Nivre, J., & Hall, J. (2006). Graph transformations in data-driven dependency parsing. In Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics (pp. 257–264).Google Scholar
  35. Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., & Yuret, D. (2007). The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL 2007 shared task. Joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL).Google Scholar
  36. Popel, M., & Žabokrtský, Z. (2010). TectoMT: Modular NLP framework. In Advances in natural language processing (pp. 293–304).Google Scholar
  37. Popel, M., Mareček, D., Štěpánek, J., Zeman, D., & Žabokrtský, Z. (2013). Coordination structures in dependency treebanks’. In Proceedings of the 51st annual meeting of the association for computational linguistics (pp. 517–527). Sofia, Bulgaria. Association for Computational Linguistics.Google Scholar
  38. Prokopidis, P., Desipri, E., Koutsombogera, M., Papageorgiou, H., & Piperidis, S. (2005). Theoretical and practical issues in the construction of a Greek dependency treebank. In Proceedings of the 4th workshop on treebanks and linguistic theories (TLT) (pp. 149–160).Google Scholar
  39. Quirk, R., Greenbaum, S., & Leech, G., Svartvik, J. (1985). A comprehensive grammar of the English language. London: Longman.Google Scholar
  40. Ramasamy, L., & Žabokrtský, Z. (2012). Prague dependency style treebank for Tamil. In Proceedings of LREC 2012. İstanbul, Turkey.Google Scholar
  41. Rasooli, M. S., Moloodi, A., Kouhestani, M., & Minaei-Bidgoli, B. (2011). A syntactic valency lexicon for persian verbs: The first steps towards Persian dependency treebank. In 5th language and technology conference (LTC): Human language technologies as a challenge for computer science and linguistics (pp. 227–231). Poland: Poznań.Google Scholar
  42. Schwartz, R., Abend, O., & Rappoport, A. (2012). Learnability-based syntactic annotation design. In Proceedings of COLING 2012: Technical papers (pp. 2405–2422). India: Mumbai.Google Scholar
  43. Seginer, Y. (2007). Learning syntactic structure. Ph.D. thesis, University of Amsterdam.Google Scholar
  44. Simov, K., & Osenova, P. (2005). Extending the annotation of BulTreeBank: Phase 2. In The fourth workshop on treebanks and linguistic theories (TLT 2005), Barcelona (pp. 173–184).Google Scholar
  45. Smrž, O., Bielický, V., Kouřilová, I., Kráčmar, J., Hajič, J., & Zemánek, P. (2008). Prague Arabic dependency treebank: A word on the million words. In Proceedings of the workshop on Arabic and local languages (LREC 2008) (pp. 16–23). Marrakech, Morocco. European Language Resources Association.Google Scholar
  46. Surdeanu, M., Johansson, R., Meyers, A., Màrquez, L., & Nivre, J. (2008). The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies. In Proceedings of CoNLL.Google Scholar
  47. Taulé, M., Martí, M.A., & Recasens, M. (2008). AnCora: Multilevel annotated corpora for Catalan and Spanish. In LREC. European Language Resources Association.Google Scholar
  48. Tesnière, L. (1959). Éléments de syntaxe structurale. Paris: Klincksieck.Google Scholar
  49. Tsarfaty, R., Nivre, J., & Andersson, E. (2011). Evaluating dependency parsing: Robust and heuristics-free cross-annotation evaluation. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 385–396). Edinburgh, Scotland, UK. Association for Computational Linguistics.Google Scholar
  50. van der Beek, L., Bouma, G., Daciuk, J., Gaustad, T., Malouf, R., van Noord, G., Prins, R., & Villada, B. (2002). Chapter 5. The Alpino dependency treebank. In Algorithms for linguistic processing NWO PIONIER progress report. Groningen, The Netherlands.Google Scholar
  51. Zeman, D. (2008). Reusable tagset conversion using tagset drivers. In N. Calzolari, K. Choukri, B. Maegaard, Mariani J., J. Odijk, S. Piperidis, & D. Tapias (Eds.), Proceedings of the sixth international language resources and evaluation conference, LREC 2008 (pp. 28–30). Marrakech, Morocco. European Language Resources Association (ELRA).Google Scholar
  52. Zeman, D., Mareček, D., Popel, M., Ramasamy, L., Štěpánek, J., Žabokrtský, Z., & Hajič, J. (2012). HamleDT: To parse or not to parse? In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, J. Odijk, & S. Piperidis (Eds.), In Proceedings of the eight international conference on language resources and evaluation (LREC’12). İstanbul, Turkey. European Language Resources Association (ELRA).Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  • Daniel Zeman
    • 1
  • Ondřej Dušek
    • 1
  • David Mareček
    • 1
  • Martin Popel
    • 1
  • Loganathan Ramasamy
    • 1
  • Jan Štěpánek
    • 1
  • Zdeněk Žabokrtský
    • 1
  • Jan Hajič
    • 1
  1. 1.Faculty of Mathematics and Physics, ÚFALCharles University in PraguePragueCzech Republic

Personalised recommendations