The PROIEL treebank family: a standard for early attestations of Indo-European languages

Abstract

This article describes a family of dependency treebanks of early attestations of Indo-European languages originating in the parallel treebank built by the members of the project pragmatic resources in old Indo-European languages. The treebanks all share a set of open-source software tools, including a web annotation interface, and a set of annotation schemes and guidelines developed especially for the project languages. The treebanks use an enriched dependency grammar scheme complemented by detailed morphological tags, which have proved sufficient to give detailed descriptions of these richly inflected languages, and which have been easy to adapt to new languages. We describe the tools and annotation schemes and discuss some challenges posed by the various languages that have been annotated. We also discuss problems with tokenisation, sentence division and lemmatisation, commonly encountered in ancient and mediaeval texts, and challenges associated with low levels of standardisation and ongoing morphological and syntactic change.

This is a preview of subscription content, log in to check access.

Notes

  1. 1.

    Work by Gerlof Bouma was supported by the Marcus and Amalia Wallenberg foundation (MAW 2012.0146: MAÞiR).

  2. 2.

    http://www.hf.uio.no/ifikk/english/research/projects/proiel/, http://proiel.github.io/.

  3. 3.

    http://foni.uio.no:3000

  4. 4.

    Information Structure and Word Order Change in Germanic and Romance Languages, http://www.hf.uio.no/ilos/english/research/projects/iswoc/.

  5. 5.

    http://iswoc.github.io/, also hosted with the PROIEL treebank at http://foni.uio.no:3000.

  6. 6.

    The project has also made use of Menotec’s Old Norwegian treebank and PROIEL’s Gothic treebank.

  7. 7.

    http://torottreebank.github.io/, https://nestor.uit.no.

  8. 8.

    http://www.menota.org/menotec.xml.

  9. 9.

    Hosted with the PROIEL treebank at http://foni.uio.no:3000 and also accessible through the INESS portal at http://clarino.uib.no/iness (select the treebanks for Old Norse).

  10. 10.

    http://www.menota.org.

  11. 11.

    http://bragi.info/greinir/.

  12. 12.

    https://spraakbanken.gu.se/mathir.

  13. 13.

    http://proiel.github.io/framework.

  14. 14.

    The reviewers have generally been senior project members or very experienced annotators. The number of corrections made by reviewers vary considerably depending on the accuracy and experience of the annotator as well as on the complexity of the text. For instance, the PROIEL New Testament texts generally have few corrections due to the fact that this text is extremely well supported by translations and exegesis, as well as by the fact that analyses could be compared across languages during annotation (0.1–3.5% of the tokens were corrected for morphology or lemmatisation errors, 1.5–11.8% of the sentences were corrected for syntactic attachment or label errors). More complicated and less supported texts had considerably more corrections. For instance, Herodotus’ Histories (Ancient Greek, PROIEL) had 9% of its tokens corrected for morphology or lemmatisation, while 65.5% of the sentences were syntactically corrected. Very similarly, the Russkaja pravda (Old East Slavic, TOROT) had 10.8% of its tokens corrected for morphology or lemmatisation, and 65.1% of its sentences were syntactically corrected.

  15. 15.

    Currently, the PROIEL, Menotec, ISWOC and TOROT treebanks are also available for syntactic query in the INESS treebank facility, http://clarino.uib.no/iness/page.

  16. 16.

    All examples are given with a text reference and a sentence ID in the relevant treebank, if they are publicly available.

  17. 17.

    https://github.com/morphgnt.

  18. 18.

    http://www.wulfila.be/gothic/.

  19. 19.

    A further 7.5% of the tokens off-by-one errors, i.e. only one of the ten morphological fields had the wrong value.

  20. 20.

    For a fuller description of the scheme, see Haug et al. 2009. For an exhaustive documentation of the scheme, see the PROIEL guidelines for syntactic annotation (http://folk.uio.no/daghaug/syntactic_guidelines.pdf). For documentation of the application of the scheme to Slavic, see the TOROT guidelines (http://folk.uio.no/hanneme/torot.pdf). For documentation of the application of the scheme to Old Norwegian, see Haugen and Øverland 2014. For documentation of the application of the scheme to Old English, see http://folk.uio.no/krisbec/OE_guidelines.pdf.

  21. 21.

    https://ufal.mff.cuni.cz/pdt2.0/.

  22. 22.

    https://perseusdl.github.io/treebank_data/.

  23. 23.

    http://itreebank.marginalia.it/.

  24. 24.

    http://ufal.mff.cuni.cz/project/pdt2.0/doc/manuals/en/t-layer/html/index.html.

  25. 25.

    For a discussion of dependency parsing with empty nodes, see Seeker et al. (2012). For an experiment using MaltParser on OCS data from TOROT, see Berdičevskis (2015), for a pre-parsing experiment on TOROT data, see Eckhoff and Berdičevskis (2016).

  26. 26.

    For further discussion and motivation of the differences between the two schemes, see Haug and Jøhndal (2008) and Haug et al. (2009).

  27. 27.

    In PDT-style treebanks, the dependent infinitive would be analysed as a subject. In the PROIEL scheme, argument infinitives are never analysed as SUBs unless they are nominalised by way of definite articles, since such structures are often ambiguous. Instead, they are analysed as COMP or XOBJ depending on whether they have an external subject.

  28. 28.

    Note that the secondary dependency from on ‘in’ to synt ‘are’ indicates that the dependent shares its subject with its head verb, but that this subject is not overtly expressed. Moreover, the dependent’s subject may be any (non-overt) argument of the verb.

  29. 29.

    Note that the APOS relation label is used not only for the usual type of nominal appositions, but also to mark non-restrictive relative clauses, as seen in the tree for example (11). Restrictive relative clauses are ATR dependents of their antecedents. In this the PROIEL scheme differs from the annotation in PDT-style treebanks, where restrictive and non-restrictive relative clauses are not distinguished.

  30. 30.

    As seen in the tree for example 12, the relation XOBJ is also used for nominal predicates in copular constructions. The reasoning is thus that nominal predicates are arguments of the copula and have external subjects which are identical with the copula’s subject. In this the PROIEL scheme deviates from the PDT scheme, which has a separate label PNOM for nominal predicates. PNOMs are also deemed to be dependents of the copula, but there is no direct indication of the external subject.

  31. 31.

    Note that, unlike in the PDT-based schemes, subjunctions are given the dependency label of the whole subordinate clause (in this case COMP). This is a general principle in the PROIEL scheme: the head of a subtree should always carry the relation label of the whole subtree, regardless of its form. In the PDT-based schemes, the subjunction is also the head of the subordinate clause, but it is labeled AuxC, and its dependent verb carries the relation label of the whole subordinate clause.

  32. 32.

    An additional confounding factor is that conditional clauses, a common environment for the indefinite pronoun čьto, often lack subordinators in Middle Russian texts, which can cause even more ambiguities.

  33. 33.

    Example (20a) also illustrates how predicate identity (PID) and shared dependents can be indicated by way of secondary dependencies.

  34. 34.

    The Penn Parsed Corpus of Historical Greek (PPCHiG) used a constituency-based annotation, but is no longer in active development.

  35. 35.

    https://github.com/biblicalhumanities/greek-new-testament/tree/master/syntax-trees/sblgnt.

  36. 36.

    http://www.dh.uni-leipzig.de/wo/projects/ancient-greek-and-latin-dependency-treebank-2-0/.

  37. 37.

    http://www.sfs.uni-tuebingen.de/en/ascl/resources/corpora/index-thomisticus-treebank.html.

  38. 38.

    Note that AGDT also annotates punctuation.

  39. 39.

    http://universaldependencies.org/.

References

  1. Adesam, Y., & Bouma, G. (2016). Part-of-speech tagging Old Swedish. In Proc of language technology for cultural heritage, social sciences, and humanities. Berlin.

  2. Andrews, A. D. (1971). Case agreement of predicate modifiers in Ancient Greek. Linguistic Inquiry, 2(2), 127–151.

    Google Scholar 

  3. Andrews, A. D. (1982). Long distance agreement in Modern Icelandic. In P. Jacobson & G. K. Pullum (Eds.), The nature of syntactic representation (pp. 1–33). Dordrecht: D. Reidel.

    Google Scholar 

  4. Bamman, D., Crane, G., Passarotti, M., & Raynaud, S. (2007). Guidelines for the syntactic annotation of Latin treebanks. Technical report. Boston: Tufts Digital Library.

  5. Berdičevskis, A. (2015). Estimating grammeme redundancy by measuring their importance for syntactic parser performance. In Proceedings of the 6th workshop on cognitive aspects of computational language learning (pp. 65–73). Association for Computational Linguistics.

  6. Berdičevskis, A., Eckhoff, H., & Gavrilova, T. (2016). The beginning of a beautiful friendship: Rule-based and statistical analysis of Middle Russian. In Computational linguistics and intellectual technologies. Proceedings of Dialogue 16. Moscow.

  7. Birnbaum, D., & Eckhoff, H. Machine-assisted multilingual alignment of the Codex Suprasliensis (manuscript).

  8. Bouma, G., & Adesam, Y. (2013). Experiments on sentence segmentation in Old Swedish editions. In Þ. Eyþórsson, L. Borin, D. Haug & E. Rögnvaldsson (Eds.), Proceedings of the workshop on computational historical linguistics at NODALIDA 2013, (pp. 11–26). Oslo. http://www.ep.liu.se/ecp/article.asp?issue=87&article=2.

  9. Brants, T. (2000). TnT—A statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference ANLP-2000. Seattle, WA.

  10. Delsing, L.-O. (2002). Fornsvenska textbanken. In S. Lagman, S. Ö. Ohlsson & V. Voodla (Eds.) Svenska språkets historia i Östersjöområdet, (pp 149–156). Tartu.

  11. Eckhoff, H. M. (2011). Old Russian possessive constructions: A construction grammar approach. Berlin: Mouton de Gruyter.

    Google Scholar 

  12. Eckhoff, H. M. (2015). Animacy and differential object marking in Old Church Slavonic. Russian Linguistics, 39(2), 233–254.

    Article  Google Scholar 

  13. Eckhoff, H., & Berdičevskis, B. (2015). Linguistics vs. digital editions: The Tromsø Old Russian and OCS Treebank. Scripta and e-Scripta, 14–15, 9–25.

    Google Scholar 

  14. Eckhoff, H., & Berdičevskis, B. (2016). Automatic parsing as an efficient pre-annotation tool for historical texts. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH, held in conjunction with COLING). http://de.clarin.eu/en/current-issues/lt4dh/lt4dh-proceedings.

  15. Eckhoff, H., & Haug, D. (2015). Aspect and prefixation in Old Church Slavonic. Diachronica, 32(2), 186–230.

    Article  Google Scholar 

  16. Fort, K., & Sagot, B. (2010). Influence of pre-annotation on POS-tagged corpus development. In Proceedings of the 4th linguistic annotation workshop, (pp. 56–63). ACL: Uppsala.

  17. Haug, D. (2011). From dependency structures to LFG representations. In M. Butt & T. Holloway King (Eds.), Proceedings of LFG12 (pp. 271–291). Stanford: CSLI Publications.

    Google Scholar 

  18. Haug, D., Eckhoff, H., & Welo, E. (2014). The theoretical foundations of givenness annotation. In K. Bech & K. Eide (Eds.), Information structure and syntactic change in Germanic and Romance languages. Amsterdam: John Benjamins.

    Google Scholar 

  19. Haug, D. T. T., & Jøhndal, M. L. (2008). Creating a parallel treebank of the old Indo-European Bible translations. In Proceedings of the 6th international language resources and evaluation (LREC’08). European Language Resources Association (ELRA).

  20. Haug, D. T. T., Jøhndal, M., Eckhoff, H. M., Welo, E., Hertzenberg, M. J. B., & Müth, A. (2009). Computational and linguistic issues in designing a syntactically annotated parallel corpus of Indo-European languages. Traitement Automatique des Langues, 50, 17–45.

    Google Scholar 

  21. Haugen, O. E., & Øverland, F. Th. (2014). Guidelines for morphological and syntactic annotation of Old Norwegian texts. Bergen Language and Linguistics Studies 4(2). https://bells.uib.no/bells/issue/view/158.

  22. Haugland, K. E. (2007). Old English impersonal constructions and the use and non-use of nonreferential pronouns. Ph.D. dissertation, University of Bergen.

  23. Hertzenberg, M. J. B. (2011). Classical and Romance usages of ipse in the Vulgate. Oslo Studies in Language (OSLa), 3(3), 173–188.

    Google Scholar 

  24. Hertzenberg, M. J. B. (2014). “The valley” or “that valley”? Ille and ipse in the Itinerarium Egeriae. In P. Molinelli, P. Cuzzolin, & C. Fedriani (Eds.) Latin vulgaireLatin tardif X. Actes du Xe colloque international sur le latin vulgaire et tardif. Bergamo, 59 septembre 2012. Bergamo: Sestante edizioni.

  25. Jøhndal, M. (2012). Non-finiteness in Latin. Ph.D. dissertation, University of Cambridge.

  26. König, E., & Lezius, W. (2003). The TIGER languageA description language for syntax graphs, formal definition. Technical report. IMS, University of Stuttgart.

  27. Lee, J., & Haug, D. (2010). Porting an Ancient Greek and Latin treebank. In Proc. conference on language resources and evaluation (LREC).

  28. Lindberg, R. (2013). Definiteness in Old Church Slavonic: A study of how long and short form in adjectives reflect information status. Master’s thesis, University of Oslo.

  29. Mitchell, B. (1985). Old English syntax (Vol. 2). Oxford: Clarendon.

    Google Scholar 

  30. Müth, A. (2015). Indefiniteness, animacy and object marking: A quantitative study based on the Classical Armenian Gospel translation. Ph.D. thesis, University of Oslo.

  31. Seeker, W., Farkas, R., Bohnet, B., Schmid, H., & Kuhn, J. (2012). Data-driven dependency parsing with empty heads. In M. Kay & C. Boitet (Eds.), Proceedings of COLING 2012: Posters (pp. 1081–1090). Mumbai. http://www.aclweb.org/anthology/C12-2105.

  32. Skjærholt, A. (2011). More, faster: Accelerated corpus annotation with statistical taggers. Journal for Language Technology and Computational Linguistics, 26(2), 151–163.

    Google Scholar 

  33. Söderwall, K. F. (1884–1918). Ordbok över svenska medeltids-språket. Samlingar utgivna av Svenska fornskriftsällskapet, Serie 1, Svenska skrifter 27.

  34. Traugott, E. C. (1992). Syntax. In R. M. Hogg (Ed.), The Cambridge history of the English language, vol. 1: The beginnings to 1066 (pp. 168–289). Cambridge: Cambridge University Press.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Hanne Eckhoff.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Eckhoff, H., Bech, K., Bouma, G. et al. The PROIEL treebank family: a standard for early attestations of Indo-European languages. Lang Resources & Evaluation 52, 29–65 (2018). https://doi.org/10.1007/s10579-017-9388-5

Download citation

Keywords

  • Treebank
  • Dependency grammar
  • Indo-European
  • Greek
  • Latin
  • Romance
  • Germanic
  • Slavic
  • Armenian