Abstract
One of the crucial problems of natural language processing for languages such as Ukrainian is lack of datasets both unlabeled (for pretraining of word embeddings or large deep learning models) and labeled (for benchmarking existing approaches).
In this paper we describe a framework for simple classification dataset creation with minimal labeling effort. We create a dataset for Ukrainian news classification and compare several pretrained models for Ukrainian language in different training settings.
We show that ukr-RoBERTa, ukr-ELECTRA and XLM-R tend to show the highest performance, although XLM-R tends to perform better on longer texts, while ukr-RoBERTa performs substantially better on shorter sequences.
We publish this dataset on Kaggle (https://www.kaggle.com/c/ukrainian-news-classification/) and suggest to use it for further comparison of approaches for Ukrainian text classification.
Results of the “Ukrainian News Classification” contest [1] hosted by TechTalents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017). Curran Associates Inc., Red Hook, NY, USA, pp. 6000–6010 (2017)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)
Radchenko, V.: We trained the Ukrainian language model. https://youscan.io/blog/ukrainian-language-model/
Schweter, S.: Ukrainian ELECTRA model. https://github.com/stefan-it/ukrainian-electra, https://doi.org/10.5281/zenodo.4267880
Dmytro, B.: Determining sentiment and important properties of Ukrainian-language user reviews. Master thesis, manuscript rights/Dmytro Babenko, Supervisor Vsevolod Dyomkin, Ukrainian Catholic University, Department of Computer Sciences, - Lviv: [s.n.], 35 p. (2020)
Babenko, D., Dyomkin, V.: Determining Sentiment and Important Properties of Ukrainian Language User Reviews (2019). http://ceur-ws.org/Vol-2566/MS-AMLV-2019-paper39-p106.pdf
NER annotation corpus. https://lang.org.ua/en/corpora/
Conneau, A., et al.: Unsupervised Cross-lingual Representation Learning at Scale (2019). CoRR, abs/1911.02116
Conneau, A., et al.: XNLI: Evaluating cross-lingual sentence representations. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2018)
Ying, S., et al.: Improving medical short text classification with semantic expansion using word-cluster embedding. ArXiv abs/1812.01885 (2018). n. pag
Zhang, Y., Jin, R., Zhou, Z.-H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1, 43–52 (2010). https://doi.org/10.1007/s13042-010-0001-0
Kaufman, S., Rosset, S., Perlich, C.: Leakage in data mining: formulation, detection, and avoidance. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol.6, pp. 556–563 (2011). https://doi.org/10.1145/2020408.2020496
Norvig, P.: How to write a spelling corrector. url: http://norvig.com/spell-correct.html
Shuyo, N.: Language detection library for Java (2010)
TF-IDF. In: Sammut C., Webb G.I. (eds.) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8
Arkhipov, A.: Tuning multilingual transformers for language-specific named entity recognition. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pp. 89–93. Association for Computational Linguistics (2019)
Wang, C.: Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, vol. 2: Short Papers, pp. 90–94). Association for Computational Linguistics (2012)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). CoRR, abs/1907.11692
Ortiz Suárez, P., Sagot, B., Romary, L.: Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In: 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache (2019)
Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: Pre-training text encoders as discriminators rather than generators. In: International Conference on Learning Representations (2020)
Acknowledgments
Multi university education platform TechTalents for hosting contest [1]. Funding partner AltexSoft and partner CloudWorks.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Panchenko, D., Maksymenko, D., Turuta, O., Luzan, M., Tytarenko, S., Turuta, O. (2022). Ukrainian News Corpus as Text Classification Benchmark. In: Ignatenko, O., et al. ICTERI 2021 Workshops. ICTERI 2021. Communications in Computer and Information Science, vol 1635. Springer, Cham. https://doi.org/10.1007/978-3-031-14841-5_37
Download citation
DOI: https://doi.org/10.1007/978-3-031-14841-5_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14840-8
Online ISBN: 978-3-031-14841-5
eBook Packages: Computer ScienceComputer Science (R0)