Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence


The paper introduces a novel annotated corpus of Old and Middle Hungarian (16–18 century), the texts of which were selected in order to approximate the vernacular of the given historical periods as closely as possible. The corpus consists of testimonies of witnesses in trials and samples of private correspondence. The texts are not only analyzed morphologically, but each file contains metadata that would also facilitate sociolinguistic research. The texts were segmented into clauses, manually normalized and morphosyntactically annotated using an annotation system consisting of the PurePos PoS tagger and the Hungarian morphological analyzer HuMor originally developed for Modern Hungarian but adapted to analyze Old and Middle Hungarian morphological constructions. The automatically disambiguated morphological annotation was manually checked and corrected using an easy-to-use web-based manual disambiguation interface. The normalization process and the manual validation of the annotation required extensive teamwork and provided continuous feedback for the refinement of the computational morphology and iterative retraining of the statistical models of the tagger. The paper discusses some of the typical problems that occurred during the normalization procedure and their tentative solutions. Besides, we also describe the automatic annotation tools, the process of semi-automatic disambiguation, and the query interface, a special function of which also makes correction of the annotation possible. Displaying the original, the normalized and the parsed versions of the selected texts, the beta version of the first fully normalized and annotated historical corpus of Hungarian is freely accessible at the address

  5. Although the size of corpora is in general given in tokens, we also provide this data in characters, as this would facilitate the comparison of corpora, the token number being partly determined by language type (synthetic vs. analytic languages). Punctuation marks are also often counted as independent tokens. In our case, the provided token count is the number of analyzed words in the normalized version of the corpus. This also differs from raw word count in the original texts both due to differences in orthography and because it does not include the count of foreign-language (mostly Latin) tokens in the corpus.

  6. A follow-up project of the team was started in September 2015. The focus of this project is corpus-based research on historical morphology, syntax, and, above all, variation, meaning that we will mainly exploit the corpus, but a part of the resources is allocated to further enlarging of it.

  7. We used FineReader, which makes full customization of glyph models possible, including the total exclusion of out-of-the-box models.

  8. i.e. The chance that the next token in the corpus differs from all previous tokens is and remains much higher for any corpus size.

  9. Besides, to facilitate text input, we also allowed the use of asterisks, which are easier to type, and were subsequently converted to flying accents before morphological analysis.

  10. Measured on newswire text.

  11. This ambiguity is absent from modern standard Hungarian because the passive is not used any more.

  12. Asynchronous JavaScript and XML (Ajax) is a client-side browser script that communicates to a server/database without the need for a complete web page refresh.

  13. Five-fold cross-validation is an evaluation technique, where the corpus is divided into five roughly equal-sized parts. Four parts are used as a training corpus, while the fifth part is used for testing in each of the five rounds of evaluation. Results of the five evaluations are averaged.

  14. The following is an example of a regular-expression-based substitution expression that we used to correct word forms and their analyses in which some form of the word gyermek ‘child’ was overnormalized to a corresponding form of gyerek ‘child’:

    #gyerek > gyermek


  15. Manual validation of the annotations was not performed within the framework of the Old Hungarian corpus project, but the disambiguated morphological annotations were taken from the Computational Database for Historical Linguistics (CDHL), see below.

  16. The original morphological annotations is CDHL are encoded in a hard-to-read numerical format, which occasionally was incorrect and often incomplete lacking some rather relevant distinctions (e.g. infinitives and all types of participles were collapsed into a single category in the original CDHL annotation.). Due to this, the original form often needed to be taken into account in addition to the morphological annotation when generating the normalized version of the corpus, and morphological analysis subsequently automatically added the missing morphological features to the annotation.

  17. The CONLL-U format is in general used to store treebanks containing dependency annotation.









  26. The set of distinct dependency relations is very much streamlined in that corpus version as well. The annotation of the rather frequent elliptic structures is also very controversial.


The project Morphologically analysed corpus of Old and Middle Hungarian texts representative of informal language use was funded by the Hungarian Scientific Research Fund (OTKA) Project Grant No. OTKA 81189. The participants of the project mainly include historical linguists working at the Department of Finno-Ugric and Historical Linguistics of the Research Institute for Linguistics of the Hungarian Academy of Sciences, but the funding of OTKA made it possible to involve MA and doctoral students as well as participation of the computational linguist of the team. The project greatly benefited from regular consultations with experts of etymology (László Horváth) and historical syntax (Lea Haader). The follow-up project Competing structures in the Middle Hungarian vernacular: a variationist approach has been funded by the Hungarian Scientific Research Fund project Grant No. OTKA K 116217.

Novák, A., Gugán, K., Varga, M. et al. Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence. Lang Resources & Evaluation 52, 1–28 (2018).

