Skip to main content

Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence

Abstract

The paper introduces a novel annotated corpus of Old and Middle Hungarian (16–18 century), the texts of which were selected in order to approximate the vernacular of the given historical periods as closely as possible. The corpus consists of testimonies of witnesses in trials and samples of private correspondence. The texts are not only analyzed morphologically, but each file contains metadata that would also facilitate sociolinguistic research. The texts were segmented into clauses, manually normalized and morphosyntactically annotated using an annotation system consisting of the PurePos PoS tagger and the Hungarian morphological analyzer HuMor originally developed for Modern Hungarian but adapted to analyze Old and Middle Hungarian morphological constructions. The automatically disambiguated morphological annotation was manually checked and corrected using an easy-to-use web-based manual disambiguation interface. The normalization process and the manual validation of the annotation required extensive teamwork and provided continuous feedback for the refinement of the computational morphology and iterative retraining of the statistical models of the tagger. The paper discusses some of the typical problems that occurred during the normalization procedure and their tentative solutions. Besides, we also describe the automatic annotation tools, the process of semi-automatic disambiguation, and the query interface, a special function of which also makes correction of the annotation possible. Displaying the original, the normalized and the parsed versions of the selected texts, the beta version of the first fully normalized and annotated historical corpus of Hungarian is freely accessible at the address http://tmk.nytud.hu/.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. http://omagyarkorpusz.nytud.hu/en-descr.html.

  2. http://www.helsinki.fi/varieng/CoRD/corpora/CED/index.html.

  3. http://www.helsinki.fi/varieng/domains/CEEC.html.

  4. http://ps.clul.ul.pt.

  5. Although the size of corpora is in general given in tokens, we also provide this data in characters, as this would facilitate the comparison of corpora, the token number being partly determined by language type (synthetic vs. analytic languages). Punctuation marks are also often counted as independent tokens. In our case, the provided token count is the number of analyzed words in the normalized version of the corpus. This also differs from raw word count in the original texts both due to differences in orthography and because it does not include the count of foreign-language (mostly Latin) tokens in the corpus.

  6. A follow-up project of the team was started in September 2015. The focus of this project is corpus-based research on historical morphology, syntax, and, above all, variation, meaning that we will mainly exploit the corpus, but a part of the resources is allocated to further enlarging of it.

  7. We used FineReader, which makes full customization of glyph models possible, including the total exclusion of out-of-the-box models.

  8. i.e. The chance that the next token in the corpus differs from all previous tokens is and remains much higher for any corpus size.

  9. Besides, to facilitate text input, we also allowed the use of asterisks, which are easier to type, and were subsequently converted to flying accents before morphological analysis.

  10. Measured on newswire text.

  11. This ambiguity is absent from modern standard Hungarian because the passive is not used any more.

  12. Asynchronous JavaScript and XML (Ajax) is a client-side browser script that communicates to a server/database without the need for a complete web page refresh.

  13. Five-fold cross-validation is an evaluation technique, where the corpus is divided into five roughly equal-sized parts. Four parts are used as a training corpus, while the fifth part is used for testing in each of the five rounds of evaluation. Results of the five evaluations are averaged.

  14. The following is an example of a regular-expression-based substitution expression that we used to correct word forms and their analyses in which some form of the word gyermek ‘child’ was overnormalized to a corresponding form of gyerek ‘child’:

    #gyerek > gyermek

    /((?:^|\s)\S+erm\S+%=[Gg]yer)(ek\S*\{\{\*gyer)(?=ek)/$1m$2m/.

  15. Manual validation of the annotations was not performed within the framework of the Old Hungarian corpus project, but the disambiguated morphological annotations were taken from the Computational Database for Historical Linguistics (CDHL), see below.

  16. The original morphological annotations is CDHL are encoded in a hard-to-read numerical format, which occasionally was incorrect and often incomplete lacking some rather relevant distinctions (e.g. infinitives and all types of participles were collapsed into a single category in the original CDHL annotation.). Due to this, the original form often needed to be taken into account in addition to the morphological annotation when generating the normalized version of the corpus, and morphological analysis subsequently automatically added the missing morphological features to the annotation.

  17. The CONLL-U format is in general used to store treebanks containing dependency annotation.

  18. https://www.ling.upenn.edu/hist-corpora/.

  19. http://www.tycho.iel.unicamp.br/corpus/en/index.html.

  20. http://www.rhyddiaithganoloesol.caerdydd.ac.uk/en/.

  21. http://www.voies.uottawa.ca/corpus_pg_en.html.

  22. http://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC).

  23. http://www.dias.ie/index.php?option=com_content&view=article&id=6586&Itemid=224&lang=en.

  24. https://enhgcorpus.wikispaces.com/.

  25. http://www.ling.upenn.edu/~janabeck/greek-corpora.html.

  26. The set of distinct dependency relations is very much streamlined in that corpus version as well. The annotation of the rather frequent elliptic structures is also very controversial.

References

  • Alberti, G. (2006). Generatív grammatikai gyakorlókönyv III. A háttérelmélet [Exercises for generative grammar. III. Theoretical background]. PTE—Bölcsész konzorcium—HEFOP Iroda, Pécs.

  • Archer, D., et al. (2014). Normalising the corpus of English dialogues (1560–1760) using VARD2: Decisions and justifications. In 35th ICAME conference, April 30–May 04, 2014. Nottingham. Abstract: http://eprints.lancs.ac.uk/72803/.

  • Archer, D., et al. (2015). Guidelines for normalising Early Modern English corpora: Decisions and justifications. ICAME Journal. doi:10.1515/icame-2015-0001.

    Google Scholar 

  • Baron, A., Rayson, P., & Archer, D. (2011). Quantifying early modern English spelling variation: Change over time and genre. In Conference on new methods in historical corpora, University of Manchester. Presentation: http://eprints.lancs.ac.uk/60258/1/Presentation.pdf.

  • Bennet, P., Durell, M., Scheible, S., & Whitt, R. J. (2010). Annotating a historical corpus of German: A case study. In Proceedings of the LREC 2010 workshop on Language Resources and Language Technology Standards, Valletta, Malta, May 18, 2010, pp. 64–68. http://www.ims.uni-stuttgart.de/institut/mitarbeiter/scheible/publications/lrec2010.pdf.

  • Bollmann, M. (2013). Spelling normalization of historical German with sparse training data. In Proceedings of the Corpus Analysis with Noise in the Signal workshop (CANS 2013). http://ucrel.lancs.ac.uk/cans2013/abstracts/Bollmann.pdf.

  • Claridge, C. (2008). Historical corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (Vol. 1, pp. 242–259). Berlin, NJ: Walter DE GRUYTER.

    Google Scholar 

  • Csendes, D., Csirik, J., Gyimóthy, T., & Kocsor, A. (2005). The szeged treebank. In 8th International Conference Text, Speech and Dialogue, TSD 2005 (pp. 123–131). Springer.

  • Halácsy, P., Kornai, A., & Oravecz, Cs. (2007). HunPos: An open source trigram tagger. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, ACL ’07 (pp. 209–212). Stroudsburg, PA: Association for Computational Linguistics.

  • Hendrickx, I., & Marquilhas, R. (2011). From old texts to modern spelling: An experiment in automatic normalisation. JLCL, 26(2), 65–76.

    Google Scholar 

  • Hulden, M., & Francom, J. (2012). Boosting statistical tagger accuracy with simple rule-based grammars. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, J. Odijk, & S. Piperidis (Eds.), Proceedings of the eighth International Conference on Language Resources and Evaluation (LREC’12). Istanbul: European Language Resources Association (ELRA).

    Google Scholar 

  • Hunston, S. (2008). Collection strategies and design decisions. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (Vol. 1, pp. 154–168). Berlin, NJ: Walter de Gruyter.

    Google Scholar 

  • Jackendoff, R. (1977). X-bar-syntax: A study of phrase structure. Linguistic inquiry monograph 2. Cambridge, MA: MIT Press.

  • Jakab, L. (2002). A Jókai-kódex mint nyelvi emlék: szótárszerű feldolgozásban. Debrecen: Debreceni Egyetem.

    Google Scholar 

  • Jakab, L., & Kiss, A. (1994). A Guary-kódex ábécérendes adattára. Számítógépes nyelvtörténeti adattár. Debrecen: Debreceni Egyetem.

    Google Scholar 

  • Jakab, L., & Kiss, A. (2001). A Festetics-kódex ábécérendes adattára. Számítógépes nyelvtörténeti adattár. Debrecen: Debreceni Egyetem.

    Google Scholar 

  • Kiss, K. É. (1987). Configurationality in Hungarian. Budapest: Reidel, Dordrecht & Akadémiai Kiadó.

    Book  Google Scholar 

  • Lehto, A., Baron, A., Ratia, M., & Rayson, P. (2010). Improving the precision of corpus methods: The standardized version of early modern English medical texts. In I. Taavitsainen & P. Pahta (Eds.), Early modern English medical texts (pp. 279–290). Amsterdam: Benjamins.

    Google Scholar 

  • Lüdeling, A., & Kytö, M. (Eds.). (2008). Corpus linguistics. An international handbook. Berlin, NY: Walter de Gruyter.

    Google Scholar 

  • McEnery, T., & Hardie, A. (2010). Investigating the journalism of the seventeenth century. http://www.lancaster.ac.uk/fass/projects/newsbooks/default.htm.

  • Meyer, C. F. (2002). English corpus linguistics. An introduction. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., & Zeman, D. (2016). Universal dependencies v1: A Multilingual treebank collection. In Proceedings of the tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 1659–1666). European Language Resources Association (ELRA).

  • Novák, A. (2003). Milyen a Jó Humor? [What is good humor like?]. In I. Magyar Számítógépes Nyelvészeti Konferencia (pp. 138–144). Szeged: SZTE.

  • Novák, A., Rebrus, P., & Ludányi, Zs. (2017). Az emMorph morfológiai elemző annotációs formalizmusa [The annotation formalism of the emMorph morphological analyzer]. In XIII. Magyar Számítógépes Nyelvészeti Konferencia (pp. 70–78). Szeged: SZTE.

  • Orosz, Gy., & Novák, A. (2013). PurePos 2.0: A hybrid tool for morphological disambiguation. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (pp. 539–545). Hissar, Bulgaria.

  • Pahta, P., Palander-Collin, M., Nevala, M., & Nurmi, A. (2010). Language practices in the construction of social roles in late modern English. In P. Pahta, M. Nevala, A. Nurmi, & M. Palander-Collin (Eds.), Social roles and language practices in late modern English, (Pragmatics and Beyond NS 195). Amsterdam: Benjamins.

    Chapter  Google Scholar 

  • Petersen, U. (2004). Emdros—A text database engine for analyzed or annotated text. In Proceedings of COLING 2004 (pp. 1190–1193).

  • Prószéky, G., & Novák, B. (2005). Computational morphologies for small Uralic languages. Inquiries into Words, Constraints and Contexts, 116–125.

  • Rayson, P., Archer, D., Baron, A., Culpeper, J., & Smith, N. (2007). Tagging the bard: Evaluating the accuracy of a modern POS tagger on early modern English corpora. In Proceedings of the Corpus Linguistics conference: CL2007. UCREL. http://eprints.lancs.ac.uk/13011/1/192_Paper.pdf.

  • Schneider, P. (2002). Computer assisted spelling normalization of 18th century English. In P. Peters, P. Collins, & A. Smith (Eds.), New frontiers of corpus research: Papers from the 21st International Conference on English Language Research on Computerized Corpora, Sydney, 2000 (pp. 199–211). Amsterdam: Rodopi.

    Google Scholar 

  • Simon, E. (2014). Corpus building from Old Hungarian codices. In The evolution of functional left peripheries in Hungarian syntax (pp. 224–236). Oxford: Oxford University Press. ISBN 978-0-19-870985-5.

  • Simon, E., & Vincze, V. (2016). Universal morphology for Old Hungarian. In Proceedings of the 10th SIGHUM workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, LaTeCH@ACL 2016 (pp. 118–127). Association for Computational Linguistics.

  • Vincze, V., Szauter, D., Almási, A., Móra, Gy., Alexin, Z., & Csirik, J. (2010). Hungarian dependency treebank. In Proceedings of the seventh International Conference on Language Resources and Evaluation (LREC’10) (pp. 1855–1862). European Language Resources Association (ELRA).

Download references

Acknowledgements

The project Morphologically analysed corpus of Old and Middle Hungarian texts representative of informal language use was funded by the Hungarian Scientific Research Fund (OTKA) Project Grant No. OTKA 81189. The participants of the project mainly include historical linguists working at the Department of Finno-Ugric and Historical Linguistics of the Research Institute for Linguistics of the Hungarian Academy of Sciences, but the funding of OTKA made it possible to involve MA and doctoral students as well as participation of the computational linguist of the team. The project greatly benefited from regular consultations with experts of etymology (László Horváth) and historical syntax (Lea Haader). The follow-up project Competing structures in the Middle Hungarian vernacular: a variationist approach has been funded by the Hungarian Scientific Research Fund project Grant No. OTKA K 116217.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Attila Novák.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Novák, A., Gugán, K., Varga, M. et al. Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence. Lang Resources & Evaluation 52, 1–28 (2018). https://doi.org/10.1007/s10579-017-9393-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-017-9393-8

Keywords

  • Historical corpus
  • Corpus annotation
  • Morphological analysis
  • PoS tagging
  • Middle Hungarian
  • Old Hungarian
  • Corpus query tool