Abstract
Natural language processing (NLP) has made significant progress for well-resourced languages such as English but lagged behind for low-resource languages like Setswana. This paper addresses this gap by presenting PuoBERTa, a customised masked language model trained specifically for Setswana. We cover how we collected, curated, and prepared diverse monolingual texts to generate a high-quality corpus for PuoBERTa’s training. Building upon previous efforts in creating monolingual resources for Setswana, we evaluated PuoBERTa across several NLP tasks, including part-of-speech (POS) tagging, named entity recognition (NER), and news categorisation. Additionally, we introduced a new Setswana news categorisation dataset and provided the initial benchmarks using PuoBERTa. Our work demonstrates the efficacy of PuoBERTa in fostering NLP capabilities for understudied languages like Setswana and paves the way for future research directions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
References
Adebara, I., Elmadany, A., Abdul-Mageed, M., Alcoba Inciarte, A.: SERENGETI: massively multilingual language models for Africa. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 1498–1537. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.findings-acl.97
Adelani, D., et al.: A few thousand translations go a long way! leveraging pre-trained models for African news translation. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3053–3070. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.223
Adelani, D., et al.: MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4488–4508 (2022)
Agić, Ž., Vulić, I.: JW300: a wide-coverage parallel corpus for low-resource languages. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3204–3210. Association for Computational Linguistics, Florence, Italy (2019). 10.18653/v1/P19-1310
Alabi, J.O., Adelani, D.I., Mosbach, M., Klakow, D.: Adapting pre-trained language models to African languages via multilingual adaptive fine-tuning. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 4336–4349. International Committee on Computational Linguistics, Gyeongju, Republic of Korea (2022)
Armengol-Estapé, J., et al.: Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4933–4946. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.findings-acl.437
Aulamo, M., Sulubacak, U., Virpioja, S., Tiedemann, J.: OpusTools and parallel corpus diagnostics. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 3782–3789. European Language Resources Association (2020)
Baziotis, C., Zhang, B., Birch, A., Haddow, B.: When does monolingual data help multilingual translation: the role of domain and model scale. arXiv preprint arXiv: 2305.14124 (2023)
Burlot, F., Yvon, F.: Using monolingual data in neural machine translation: a systematic study. In: Proceedings of the Third Conference on Machine Translation: Research Papers. pp. 144–155. Association for Computational Linguistics, Brussels, Belgium (Oct 2018). 10.18653/v1/W18-6315
Collection, L.C.: Tswana web text corpus (South Africa) based on material from 2019. https://corpora.uni-leipzig.de/en?corpusId=tsn_community_2017. Accessed 22 Aug 2023
Dione, C.M.B., et al.: MasakhaPOS: part-of-speech tagging for typologically diverse African languages. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, vol. 1 Long Papers, pp. 10883–10900. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.acl-long.609
Doddapaneni, S., et al.: Towards leaving no indic language behind: building monolingual corpora, benchmark and models for indic languages. In: Annual Meeting of The Association For Computational Linguistics (2022). https://doi.org/10.18653/v1/2023.acl-long.693
Dossou, B.F., et al.: AfroLM: a self-active learning-based multilingual pretrained language model for 23 African languages. SustaiNLP 2022, 52 (2022)
Eiselen, R.: Nchlt Setswana roberta language model (2023). https://hdl.handle.net/20.500.12185/641
Eiselen, R., Puttkammer, M.J.: Developing text resources for ten South African languages. In: LREC, pp. 3698–3703 (2014)
Fan, A., et al.: Beyond english-centric multilingual machine translation (2020)
Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the Leipzig corpora collection: from 100 to 200 languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 759–765. European Language Resources Association (ELRA), Istanbul, Turkey (2012)
Gyasi, F., Schlippe, T.: Twi machine translation. Big Data Cogn. Comput. 7(2), 114 (2023). https://doi.org/10.3390/bdcc7020114
Haddow, B., Bawden, R., Miceli Barone, A.V., Helcl, J., Birch, A.: Survey of low-resource machine translation. Comput. Linguist. 48(3), 673–732 (2022)
Ifeoluwa Adelani, D., et al.: MasakhaNEWS: news topic classification for African languages. arXiv e-prints pp. arXiv-2304 (2023)
Lastrucci, R., et al.: Preparing the Vuk’uzenzele and ZA-gov-multilingual south african multilingual corpora. In: Fourth workshop on Resources for African Indigenous Languages (RAIL), p. 18 (2023)
Limisiewicz, T., Malkin, D., Stanovsky, G.: You can have your data and balance it too: towards balanced and efficient multilingual models. In: Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pp. 1–11. Association for Computational Linguistics, Dubrovnik, Croatia (2023)
Litre, G., et al.: Participatory detection of language barriers towards multilingual sustainability(ies) in Africa. Sustainability 14(13), 8133 (2022). https://doi.org/10.3390/su14138133
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv: 2001.08210 (2020)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
Marivate, V., Mots’Oehli, M., Wagner, V., Lastrucci, R., Dzingirai, I.: Puoberta + puoberta Setswana language models (2023). https://doi.org/10.5281/zenodo.8434795
Marivate, V., et al.: Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi. In: Proceedings of the first workshop on Resources for African Indigenous Languages, pp. 15–20. European Language Resources Association (ELRA), Marseille, France (2020)
Meyer, F., Buys, J.: Subword segmental language modelling for Nguni languages. In: Conference On Empirical Methods In Natural Language Processing (2022). https://doi.org/10.48550/arXiv.2210.06525
Motsoehli, M.: Tshwanabert (2020). https://huggingface.co/MoseliMotsoehli/TswanaBert
Nekoto, W., et al.: Participatory research for low-resourced machine translation: a case study in African languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2144–2160. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.195
Ogueji, K., Zhu, Y., Lin, J.: Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In: Proceedings of the 1st Workshop on Multilingual Representation Learning, pp. 116–126. Association for Computational Linguistics, Punta Cana, Dominican Republic (2021)
Palai, E.B., O’Hanlon, L.: Word and phoneme frequency of occurrence in conversational Setswana: a clinical linguistic application. South. Afr. Linguist. Appl. Lang. Stud. 22(3–4), 125–142 (2004)
Ragni, A., Knill, K.M., Rath, S.P., Gales, M.J.: Data augmentation for low resource languages. In: INTERSPEECH 2014: 15th Annual Conference of the International Speech Communication Association, pp. 810–814. International Speech Communication Association (ISCA) (2014)
Ranathunga, S., Lee, E.S.A., Prifti Skenduli, M., Shekhar, R., Alam, M., Kaur, R.: Neural machine translation for low-resource languages: A survey. arXiv e-prints pp. arXiv-2106 (2021)
Scao, T.L., et al.: What language model to train if you have one million GPU hours? In: Conference On Empirical Methods in Natural Language Processing (2022). https://doi.org/10.48550/arXiv.2210.15424
de Souza, L.R., Nogueira, R., Lotufo, R.: On the ability of monolingual models to learn language-agnostic representations. arXiv preprint arXiv: 2109.01942 (2021)
de Vries, W., Bartelds, M., Nissim, M., Wieling, M.: Adapting monolingual models: data can be scarce when language similarity is high. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4901–4907. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.findings-acl.433
Xue, L., et al.: Byt5: towards a token-free future with pre-trained byte-to-byte models. arXiv preprint arXiv: 2105.13626 (2021)
Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv: 2010.11934 (2020)
Acknowledgement
We want to acknowledge the feedback received from colleagues at DSFSI at Univeristy of Pretoria and Lelapa AI that improved this paper. We would like to acknowledge funding from the ABSA Chair of Data Science, Google and the NVIDIA Corporation hardware Grant. We are thankful to the anonymous reviewers for their feedback.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Data Statement for Daily News - Dikgang Categorised News Corpus
Appendix: Data Statement for Daily News - Dikgang Categorised News Corpus
1.1 A. CURATION RATIONALE
The motivation for building this dataset was to provide one of the few annotated news categorisation datasets for Setswana. The task required identifying a high-quality Setswana news dataset, collecting the data, and then annotating leveraging the International Press Telecommunications Council (IPTC) News Categories (or codes)Footnote 16. The identified source was the Daily NewsFootnote 17 (Dikgang Section) from the Botswana Government. All copyright for the news content belongs to Daily News. We collected 5000 Setswana news articles. The distribution of final categories for the dataset are shown in Fig. 1.
1.2 B. LANGUAGE VARIETY
The language of this data set is Setswana (primarily from Botswana).
1.3 C. SPEAKER DEMOGRAPHIC
Setswana is a Bantu languages that is spoken in Botswana as well as several regions of South Africa [32].
1.4 D. ANNOTATOR DEMOGRAPHIC
Two annotators were used to label the news articles based on the Daily News - Dikgang news. Their deomographic information is shown in Table 7.
1.5 E. PROVENANCE APPENDIX
The original data is from the Daily News news service from the Botswana Government.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Marivate, V., Mots’Oehli, M., Wagnerinst, V., Lastrucci, R., Dzingirai, I. (2023). PuoBERTa: Training and Evaluation of a Curated Language Model for Setswana. In: Pillay, A., Jembere, E., J. Gerber, A. (eds) Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science, vol 1976. Springer, Cham. https://doi.org/10.1007/978-3-031-49002-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-49002-6_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49001-9
Online ISBN: 978-3-031-49002-6
eBook Packages: Computer ScienceComputer Science (R0)