Skip to main content

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14102))

Included in the following conference series:

Abstract

This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.

G. Orosz, G. Szabó and P. Berkecz—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://spacy.io/.

  2. 2.

    These steps are usually referred to as the Tok2Vec layers.

  3. 3.

    https://explosion.ai/blog/floret-vectors.

  4. 4.

    https://explosion.ai/blog/edit-tree-lemmatizer.

  5. 5.

    Experiments are performed at the v2.10 revision.

References

  1. Altıntaş, M., Tantuğ, A.C.: Improving the performance of graph based dependency parsing by guiding bi-affine layer with augmented global and local features. Intell. Syst. Appl. 18, 200190 (2023)

    Google Scholar 

  2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  3. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451 (2020)

    Google Scholar 

  4. Csendes, D., Csirik, J., Gyimóthy, T.: The szeged corpus: a POS tagged and syntactically annotated Hungarian natural language corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 41–47. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30120-2_6

    Chapter  Google Scholar 

  5. Csendes, D., Csirik, J., Gyimóthy, T., Kocsor, A.: The szeged treebank. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 123–131. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_16

    Chapter  Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  7. Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing. In: International Conference on Learning Representations (2017)

    Google Scholar 

  8. Enevoldsen, K., Hansen, L., Nielbo, K.: DaCy: a unified framework for Danish NLP. arXiv preprint arXiv:2107.05295 (2021)

  9. Honnibal, M.: Introducing spaCy (2015). https://explosion.ai/blog/introducing-spacy

  10. Honnibal, M., Goldberg, Y., Johnson, M.: A non-monotonic arc-eager transition system for dependency parsing. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 163–172. Association for Computational Linguistics, Sofia (2013)

    Google Scholar 

  11. Indig, B., Sass, B., Simon, E., Mittelholcz, I., Vadász, N., Makrai, M.: One format to rule them all - the emtsv pipeline for Hungarian. In: Proceedings of the 13th Linguistic Annotation Workshop, pp. 155–165. Association for Computational Linguistics, Florence (2019)

    Google Scholar 

  12. Kondratyuk, D., Straka, M.: 75 languages, 1 model: parsing universal dependencies universally. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2779–2795 (2019)

    Google Scholar 

  13. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)

    Google Scholar 

  15. Müller, T., Cotterell, R., Fraser, A., Schütze, H.: Joint lemmatization and morphological tagging with lemming. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2268–2274. Association for Computational Linguistics, Lisbon (2015)

    Google Scholar 

  16. Nemeskey, D.M.: Egy emBERT próbáló feladat. In: XVI. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY2020), pp. 409–418. Szeged (2020)

    Google Scholar 

  17. Nemeskey, D.M.: Natural language processing methods for language modeling. Ph.D. thesis, Eötvös Loránd University (2020)

    Google Scholar 

  18. Nivre, J., et al.: Universal dependencies v2: an evergrowing multilingual treebank collection. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4034–4043. European Language Resources Association, Marseille (2020)

    Google Scholar 

  19. Novák, A.: A new form of humor – mapping constraint-based computational morphologies to a finite-state representation. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 1068–1073. European Language Resources Association (ELRA), Reykjavik (2014)

    Google Scholar 

  20. Novák, A., Siklósi, B., Oravecz, C.: A new integrated open-source morphological analyzer for Hungarian. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 1315–1322. European Language Resources Association (ELRA), Portorož (2016)

    Google Scholar 

  21. Orosz, G., Novák, A.: PurePos 2.0: a hybrid tool for morphological disambiguation. In: Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pp. 539–545. INCOMA Ltd., Shoumen, BULGARIA, Hissar (2013)

    Google Scholar 

  22. Orosz, G., Szántó, Z., Berkecz, P., Szabó, G., Farkas, R.: HuSpaCy: an industrial-strength Hungarian natural language processing toolkit. In: XVIII. Magyar Számítógépes Nyelvészeti Konferencia (2022)

    Google Scholar 

  23. Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: a Python natural language processing toolkit for many human languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (2020)

    Google Scholar 

  24. Simon, E., Lendvai, P., Németh, G., Olaszy, G., Vicsi, K.: A Magyar Nyelv a Digitális Korban - the Hungarian Language in the Digital Age. Georg Rehm and Hans Uszkoreit (Series Editors): META-NET White Paper Series. Springer, Heidelberg (2012)

    Google Scholar 

  25. Simon, E., Indig, B., Kalivoda, Á., Mittelholcz Iván, S.B., Vadász, N.: Újabb fejlemények az e-magyar háza táján. In: Berend, G., Gosztolya, G., Vincze, V. (eds.) XVI. Magyar Számítógépes Nyelvészeti Konferencia, pp. 29–42. Szegedi Tudományegyetem Informatikai Tanszékcsoport, Szeged (2020)

    Google Scholar 

  26. Simon, E., Vadász, N.: Introducing NYTK-NerKor, a gold standard Hungarian named entity annotated corpus. In: Ekštein, K., Pártl, F., Konopík, M. (eds.) TSD 2021. LNCS (LNAI), vol. 12848, pp. 222–234. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-83527-9_19

    Chapter  Google Scholar 

  27. Simon, E., Vadász, N., Lévai, D., Dávid, N., Orosz, G., Szántó, Z.: Az NYTK-NerKor több szempontú kiértékelése. XVIII. Magyar Számítógépes Nyelvészeti Konferencia (2022)

    Google Scholar 

  28. Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197–207. Association for Computational Linguistics, Brussels (2018)

    Google Scholar 

  29. Szarvas, György., Farkas, Richárd, Kocsor, András: A multilingual named entity recognition system using boosting and C4.5 decision tree learning algorithms. In: Todorovski, Ljupčo, Lavrač, Nada, Jantke, Klaus P.. (eds.) DS 2006. LNCS (LNAI), vol. 4265, pp. 267–278. Springer, Heidelberg (2006). https://doi.org/10.1007/11893318_27

    Chapter  Google Scholar 

  30. Van Nguyen, M., Lai, V., Veyseh, A.P.B., Nguyen, T.H.: Trankit: a light-weight transformer-based toolkit for multilingual natural language processing. EACL 2021, 80 (2021)

    Google Scholar 

  31. Váradi, T., et al.: E-magyar - a digital language processing system. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki (2018)

    Google Scholar 

  32. Vincze, V., Simkó, K., Szántó, Z., Farkas, R.: Universal dependencies and morphology for Hungarian - and on the price of universality. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 356–365. Association for Computational Linguistics, Valencia (2017)

    Google Scholar 

  33. Zsibrita, J., Vincze, V., Farkas, R.: magyarlanc: a toolkit for morphological and dependency parsing of Hungarian. In: Proceedings of Recent Advances in Natural Language Processing 2013, pp. 763–771. Association for Computational Linguistics, Hissar (2013)

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank Gábor Berend for his valuable suggestions. HuSpaCy research and development is supported by the European Union project RRF-2.3.1-21-2022-00004 within the framework of the Artificial Intelligence National Laboratory.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to György Orosz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Orosz, G., Szabó, G., Berkecz, P., Szántó, Z., Farkas, R. (2023). Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40498-6_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40497-9

  • Online ISBN: 978-3-031-40498-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics