The PROIEL treebank family: a standard for early attestations of Indo-European languages

  • Hanne Eckhoff
  • Kristin Bech
  • Gerlof Bouma
  • Kristine Eide
  • Dag Haug
  • Odd Einar Haugen
  • Marius Jøhndal
Original Paper

Abstract

This article describes a family of dependency treebanks of early attestations of Indo-European languages originating in the parallel treebank built by the members of the project pragmatic resources in old Indo-European languages. The treebanks all share a set of open-source software tools, including a web annotation interface, and a set of annotation schemes and guidelines developed especially for the project languages. The treebanks use an enriched dependency grammar scheme complemented by detailed morphological tags, which have proved sufficient to give detailed descriptions of these richly inflected languages, and which have been easy to adapt to new languages. We describe the tools and annotation schemes and discuss some challenges posed by the various languages that have been annotated. We also discuss problems with tokenisation, sentence division and lemmatisation, commonly encountered in ancient and mediaeval texts, and challenges associated with low levels of standardisation and ongoing morphological and syntactic change.

Keywords

Treebank Dependency grammar Indo-European Greek Latin Romance Germanic Slavic Armenian 

References

  1. Adesam, Y., & Bouma, G. (2016). Part-of-speech tagging Old Swedish. In Proc of language technology for cultural heritage, social sciences, and humanities. Berlin.Google Scholar
  2. Andrews, A. D. (1971). Case agreement of predicate modifiers in Ancient Greek. Linguistic Inquiry, 2(2), 127–151.Google Scholar
  3. Andrews, A. D. (1982). Long distance agreement in Modern Icelandic. In P. Jacobson & G. K. Pullum (Eds.), The nature of syntactic representation (pp. 1–33). Dordrecht: D. Reidel.CrossRefGoogle Scholar
  4. Bamman, D., Crane, G., Passarotti, M., & Raynaud, S. (2007). Guidelines for the syntactic annotation of Latin treebanks. Technical report. Boston: Tufts Digital Library.Google Scholar
  5. Berdičevskis, A. (2015). Estimating grammeme redundancy by measuring their importance for syntactic parser performance. In Proceedings of the 6th workshop on cognitive aspects of computational language learning (pp. 65–73). Association for Computational Linguistics.Google Scholar
  6. Berdičevskis, A., Eckhoff, H., & Gavrilova, T. (2016). The beginning of a beautiful friendship: Rule-based and statistical analysis of Middle Russian. In Computational linguistics and intellectual technologies. Proceedings of Dialogue 16. Moscow.Google Scholar
  7. Birnbaum, D., & Eckhoff, H. Machine-assisted multilingual alignment of the Codex Suprasliensis (manuscript).Google Scholar
  8. Bouma, G., & Adesam, Y. (2013). Experiments on sentence segmentation in Old Swedish editions. In Þ. Eyþórsson, L. Borin, D. Haug & E. Rögnvaldsson (Eds.), Proceedings of the workshop on computational historical linguistics at NODALIDA 2013, (pp. 11–26). Oslo. http://www.ep.liu.se/ecp/article.asp?issue=87&article=2.
  9. Brants, T. (2000). TnT—A statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference ANLP-2000. Seattle, WA.Google Scholar
  10. Delsing, L.-O. (2002). Fornsvenska textbanken. In S. Lagman, S. Ö. Ohlsson & V. Voodla (Eds.) Svenska språkets historia i Östersjöområdet, (pp 149–156). Tartu.Google Scholar
  11. Eckhoff, H. M. (2011). Old Russian possessive constructions: A construction grammar approach. Berlin: Mouton de Gruyter.CrossRefGoogle Scholar
  12. Eckhoff, H. M. (2015). Animacy and differential object marking in Old Church Slavonic. Russian Linguistics, 39(2), 233–254.CrossRefGoogle Scholar
  13. Eckhoff, H., & Berdičevskis, B. (2015). Linguistics vs. digital editions: The Tromsø Old Russian and OCS Treebank. Scripta and e-Scripta, 14–15, 9–25.Google Scholar
  14. Eckhoff, H., & Berdičevskis, B. (2016). Automatic parsing as an efficient pre-annotation tool for historical texts. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH, held in conjunction with COLING). http://de.clarin.eu/en/current-issues/lt4dh/lt4dh-proceedings.
  15. Eckhoff, H., & Haug, D. (2015). Aspect and prefixation in Old Church Slavonic. Diachronica, 32(2), 186–230.CrossRefGoogle Scholar
  16. Fort, K., & Sagot, B. (2010). Influence of pre-annotation on POS-tagged corpus development. In Proceedings of the 4th linguistic annotation workshop, (pp. 56–63). ACL: Uppsala.Google Scholar
  17. Haug, D. (2011). From dependency structures to LFG representations. In M. Butt & T. Holloway King (Eds.), Proceedings of LFG12 (pp. 271–291). Stanford: CSLI Publications.Google Scholar
  18. Haug, D., Eckhoff, H., & Welo, E. (2014). The theoretical foundations of givenness annotation. In K. Bech & K. Eide (Eds.), Information structure and syntactic change in Germanic and Romance languages. Amsterdam: John Benjamins.Google Scholar
  19. Haug, D. T. T., & Jøhndal, M. L. (2008). Creating a parallel treebank of the old Indo-European Bible translations. In Proceedings of the 6th international language resources and evaluation (LREC’08). European Language Resources Association (ELRA).Google Scholar
  20. Haug, D. T. T., Jøhndal, M., Eckhoff, H. M., Welo, E., Hertzenberg, M. J. B., & Müth, A. (2009). Computational and linguistic issues in designing a syntactically annotated parallel corpus of Indo-European languages. Traitement Automatique des Langues, 50, 17–45.Google Scholar
  21. Haugen, O. E., & Øverland, F. Th. (2014). Guidelines for morphological and syntactic annotation of Old Norwegian texts. Bergen Language and Linguistics Studies 4(2). https://bells.uib.no/bells/issue/view/158.
  22. Haugland, K. E. (2007). Old English impersonal constructions and the use and non-use of nonreferential pronouns. Ph.D. dissertation, University of Bergen.Google Scholar
  23. Hertzenberg, M. J. B. (2011). Classical and Romance usages of ipse in the Vulgate. Oslo Studies in Language (OSLa), 3(3), 173–188.Google Scholar
  24. Hertzenberg, M. J. B. (2014). “The valley” or “that valley”? Ille and ipse in the Itinerarium Egeriae. In P. Molinelli, P. Cuzzolin, & C. Fedriani (Eds.) Latin vulgaireLatin tardif X. Actes du Xe colloque international sur le latin vulgaire et tardif. Bergamo, 59 septembre 2012. Bergamo: Sestante edizioni.Google Scholar
  25. Jøhndal, M. (2012). Non-finiteness in Latin. Ph.D. dissertation, University of Cambridge.Google Scholar
  26. König, E., & Lezius, W. (2003). The TIGER languageA description language for syntax graphs, formal definition. Technical report. IMS, University of Stuttgart.Google Scholar
  27. Lee, J., & Haug, D. (2010). Porting an Ancient Greek and Latin treebank. In Proc. conference on language resources and evaluation (LREC).Google Scholar
  28. Lindberg, R. (2013). Definiteness in Old Church Slavonic: A study of how long and short form in adjectives reflect information status. Master’s thesis, University of Oslo.Google Scholar
  29. Mitchell, B. (1985). Old English syntax (Vol. 2). Oxford: Clarendon.CrossRefGoogle Scholar
  30. Müth, A. (2015). Indefiniteness, animacy and object marking: A quantitative study based on the Classical Armenian Gospel translation. Ph.D. thesis, University of Oslo.Google Scholar
  31. Seeker, W., Farkas, R., Bohnet, B., Schmid, H., & Kuhn, J. (2012). Data-driven dependency parsing with empty heads. In M. Kay & C. Boitet (Eds.), Proceedings of COLING 2012: Posters (pp. 1081–1090). Mumbai. http://www.aclweb.org/anthology/C12-2105.
  32. Skjærholt, A. (2011). More, faster: Accelerated corpus annotation with statistical taggers. Journal for Language Technology and Computational Linguistics, 26(2), 151–163.Google Scholar
  33. Söderwall, K. F. (1884–1918). Ordbok över svenska medeltids-språket. Samlingar utgivna av Svenska fornskriftsällskapet, Serie 1, Svenska skrifter 27.Google Scholar
  34. Traugott, E. C. (1992). Syntax. In R. M. Hogg (Ed.), The Cambridge history of the English language, vol. 1: The beginnings to 1066 (pp. 168–289). Cambridge: Cambridge University Press.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2017

Authors and Affiliations

  1. 1.UiT The Arctic University of NorwayTromsøNorway
  2. 2.University of OsloOsloNorway
  3. 3.University of GothenburgGothenburgSweden
  4. 4.University of BergenBergenNorway

Personalised recommendations