Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

Orosz, György; Szabó, Gergő; Berkecz, Péter; Szántó, Zsolt; Farkas, Richárd

doi:10.1007/978-3-031-40498-6_6

György Orosz¹⁰,
Gergő Szabó¹⁰,
Péter Berkecz¹⁰,
Zsolt Szántó¹⁰ &
…
Richárd Farkas¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14102))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

381 Accesses
1 Altmetric

Abstract

This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.

G. Orosz, G. Szabó and P. Berkecz—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://spacy.io/.
2.
These steps are usually referred to as the Tok2Vec layers.
3.
https://explosion.ai/blog/floret-vectors.
4.
https://explosion.ai/blog/edit-tree-lemmatizer.
5.
Experiments are performed at the v2.10 revision.

References

Altıntaş, M., Tantuğ, A.C.: Improving the performance of graph based dependency parsing by guiding bi-affine layer with augmented global and local features. Intell. Syst. Appl. 18, 200190 (2023)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451 (2020)
Google Scholar
Csendes, D., Csirik, J., Gyimóthy, T.: The szeged corpus: a POS tagged and syntactically annotated Hungarian natural language corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 41–47. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30120-2_6
Chapter Google Scholar
Csendes, D., Csirik, J., Gyimóthy, T., Kocsor, A.: The szeged treebank. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 123–131. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_16
Chapter Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing. In: International Conference on Learning Representations (2017)
Google Scholar
Enevoldsen, K., Hansen, L., Nielbo, K.: DaCy: a unified framework for Danish NLP. arXiv preprint arXiv:2107.05295 (2021)
Honnibal, M.: Introducing spaCy (2015). https://explosion.ai/blog/introducing-spacy
Honnibal, M., Goldberg, Y., Johnson, M.: A non-monotonic arc-eager transition system for dependency parsing. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 163–172. Association for Computational Linguistics, Sofia (2013)
Google Scholar
Indig, B., Sass, B., Simon, E., Mittelholcz, I., Vadász, N., Makrai, M.: One format to rule them all - the emtsv pipeline for Hungarian. In: Proceedings of the 13th Linguistic Annotation Workshop, pp. 155–165. Association for Computational Linguistics, Florence (2019)
Google Scholar
Kondratyuk, D., Straka, M.: 75 languages, 1 model: parsing universal dependencies universally. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2779–2795 (2019)
Google Scholar
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)
Google Scholar
Müller, T., Cotterell, R., Fraser, A., Schütze, H.: Joint lemmatization and morphological tagging with lemming. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2268–2274. Association for Computational Linguistics, Lisbon (2015)
Google Scholar
Nemeskey, D.M.: Egy emBERT próbáló feladat. In: XVI. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY2020), pp. 409–418. Szeged (2020)
Google Scholar
Nemeskey, D.M.: Natural language processing methods for language modeling. Ph.D. thesis, Eötvös Loránd University (2020)
Google Scholar
Nivre, J., et al.: Universal dependencies v2: an evergrowing multilingual treebank collection. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4034–4043. European Language Resources Association, Marseille (2020)
Google Scholar
Novák, A.: A new form of humor – mapping constraint-based computational morphologies to a finite-state representation. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 1068–1073. European Language Resources Association (ELRA), Reykjavik (2014)
Google Scholar
Novák, A., Siklósi, B., Oravecz, C.: A new integrated open-source morphological analyzer for Hungarian. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 1315–1322. European Language Resources Association (ELRA), Portorož (2016)
Google Scholar
Orosz, G., Novák, A.: PurePos 2.0: a hybrid tool for morphological disambiguation. In: Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pp. 539–545. INCOMA Ltd., Shoumen, BULGARIA, Hissar (2013)
Google Scholar
Orosz, G., Szántó, Z., Berkecz, P., Szabó, G., Farkas, R.: HuSpaCy: an industrial-strength Hungarian natural language processing toolkit. In: XVIII. Magyar Számítógépes Nyelvészeti Konferencia (2022)
Google Scholar
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: a Python natural language processing toolkit for many human languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (2020)
Google Scholar
Simon, E., Lendvai, P., Németh, G., Olaszy, G., Vicsi, K.: A Magyar Nyelv a Digitális Korban - the Hungarian Language in the Digital Age. Georg Rehm and Hans Uszkoreit (Series Editors): META-NET White Paper Series. Springer, Heidelberg (2012)
Google Scholar
Simon, E., Indig, B., Kalivoda, Á., Mittelholcz Iván, S.B., Vadász, N.: Újabb fejlemények az e-magyar háza táján. In: Berend, G., Gosztolya, G., Vincze, V. (eds.) XVI. Magyar Számítógépes Nyelvészeti Konferencia, pp. 29–42. Szegedi Tudományegyetem Informatikai Tanszékcsoport, Szeged (2020)
Google Scholar
Simon, E., Vadász, N.: Introducing NYTK-NerKor, a gold standard Hungarian named entity annotated corpus. In: Ekštein, K., Pártl, F., Konopík, M. (eds.) TSD 2021. LNCS (LNAI), vol. 12848, pp. 222–234. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-83527-9_19
Chapter Google Scholar
Simon, E., Vadász, N., Lévai, D., Dávid, N., Orosz, G., Szántó, Z.: Az NYTK-NerKor több szempontú kiértékelése. XVIII. Magyar Számítógépes Nyelvészeti Konferencia (2022)
Google Scholar
Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197–207. Association for Computational Linguistics, Brussels (2018)
Google Scholar
Szarvas, György., Farkas, Richárd, Kocsor, András: A multilingual named entity recognition system using boosting and C4.5 decision tree learning algorithms. In: Todorovski, Ljupčo, Lavrač, Nada, Jantke, Klaus P.. (eds.) DS 2006. LNCS (LNAI), vol. 4265, pp. 267–278. Springer, Heidelberg (2006). https://doi.org/10.1007/11893318_27
Chapter Google Scholar
Van Nguyen, M., Lai, V., Veyseh, A.P.B., Nguyen, T.H.: Trankit: a light-weight transformer-based toolkit for multilingual natural language processing. EACL 2021, 80 (2021)
Google Scholar
Váradi, T., et al.: E-magyar - a digital language processing system. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki (2018)
Google Scholar
Vincze, V., Simkó, K., Szántó, Z., Farkas, R.: Universal dependencies and morphology for Hungarian - and on the price of universality. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 356–365. Association for Computational Linguistics, Valencia (2017)
Google Scholar
Zsibrita, J., Vincze, V., Farkas, R.: magyarlanc: a toolkit for morphological and dependency parsing of Hungarian. In: Proceedings of Recent Advances in Natural Language Processing 2013, pp. 763–771. Association for Computational Linguistics, Hissar (2013)
Google Scholar

Download references

Acknowledgments

The authors would like to thank Gábor Berend for his valuable suggestions. HuSpaCy research and development is supported by the European Union project RRF-2.3.1-21-2022-00004 within the framework of the Artificial Intelligence National Laboratory.

Author information

Authors and Affiliations

Institute of Informatics, University of Szeged, 2. Árpád tér, Szeged, Hungary
György Orosz, Gergő Szabó, Péter Berkecz, Zsolt Szántó & Richárd Farkas

Authors

György Orosz
View author publications
You can also search for this author in PubMed Google Scholar
Gergő Szabó
View author publications
You can also search for this author in PubMed Google Scholar
Péter Berkecz
View author publications
You can also search for this author in PubMed Google Scholar
Zsolt Szántó
View author publications
You can also search for this author in PubMed Google Scholar
Richárd Farkas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to György Orosz .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Orosz, G., Szabó, G., Berkecz, P., Szántó, Z., Farkas, R. (2023). Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-40498-6_6
Published: 23 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40497-9
Online ISBN: 978-3-031-40498-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines