PuoBERTa: Training and Evaluation of a Curated Language Model for Setswana

Marivate, Vukosi; Mots’Oehli, Moseli; Wagnerinst, Valencia; Lastrucci, Richard; Dzingirai, Isheanesu

doi:10.1007/978-3-031-49002-6_17

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1976))

Included in the following conference series:

Southern African Conference for Artificial Intelligence Research

278 Accesses
7 Altmetric

Abstract

Natural language processing (NLP) has made significant progress for well-resourced languages such as English but lagged behind for low-resource languages like Setswana. This paper addresses this gap by presenting PuoBERTa, a customised masked language model trained specifically for Setswana. We cover how we collected, curated, and prepared diverse monolingual texts to generate a high-quality corpus for PuoBERTa’s training. Building upon previous efforts in creating monolingual resources for Setswana, we evaluated PuoBERTa across several NLP tasks, including part-of-speech (POS) tagging, named entity recognition (NER), and news categorisation. Additionally, we introduced a new Setswana news categorisation dataset and provided the initial benchmarks using PuoBERTa. Our work demonstrates the efficacy of PuoBERTa in fostering NLP capabilities for understudied languages like Setswana and paves the way for future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Adebara, I., Elmadany, A., Abdul-Mageed, M., Alcoba Inciarte, A.: SERENGETI: massively multilingual language models for Africa. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 1498–1537. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.findings-acl.97
Adelani, D., et al.: A few thousand translations go a long way! leveraging pre-trained models for African news translation. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3053–3070. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.223
Adelani, D., et al.: MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4488–4508 (2022)
Google Scholar
Agić, Ž., Vulić, I.: JW300: a wide-coverage parallel corpus for low-resource languages. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3204–3210. Association for Computational Linguistics, Florence, Italy (2019). 10.18653/v1/P19-1310
Google Scholar
Alabi, J.O., Adelani, D.I., Mosbach, M., Klakow, D.: Adapting pre-trained language models to African languages via multilingual adaptive fine-tuning. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 4336–4349. International Committee on Computational Linguistics, Gyeongju, Republic of Korea (2022)
Google Scholar
Armengol-Estapé, J., et al.: Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4933–4946. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.findings-acl.437
Aulamo, M., Sulubacak, U., Virpioja, S., Tiedemann, J.: OpusTools and parallel corpus diagnostics. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 3782–3789. European Language Resources Association (2020)
Google Scholar
Baziotis, C., Zhang, B., Birch, A., Haddow, B.: When does monolingual data help multilingual translation: the role of domain and model scale. arXiv preprint arXiv: 2305.14124 (2023)
Burlot, F., Yvon, F.: Using monolingual data in neural machine translation: a systematic study. In: Proceedings of the Third Conference on Machine Translation: Research Papers. pp. 144–155. Association for Computational Linguistics, Brussels, Belgium (Oct 2018). 10.18653/v1/W18-6315
Google Scholar
Collection, L.C.: Tswana web text corpus (South Africa) based on material from 2019. https://corpora.uni-leipzig.de/en?corpusId=tsn_community_2017. Accessed 22 Aug 2023
Dione, C.M.B., et al.: MasakhaPOS: part-of-speech tagging for typologically diverse African languages. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, vol. 1 Long Papers, pp. 10883–10900. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.acl-long.609
Doddapaneni, S., et al.: Towards leaving no indic language behind: building monolingual corpora, benchmark and models for indic languages. In: Annual Meeting of The Association For Computational Linguistics (2022). https://doi.org/10.18653/v1/2023.acl-long.693
Dossou, B.F., et al.: AfroLM: a self-active learning-based multilingual pretrained language model for 23 African languages. SustaiNLP 2022, 52 (2022)
Google Scholar
Eiselen, R.: Nchlt Setswana roberta language model (2023). https://hdl.handle.net/20.500.12185/641
Eiselen, R., Puttkammer, M.J.: Developing text resources for ten South African languages. In: LREC, pp. 3698–3703 (2014)
Google Scholar
Fan, A., et al.: Beyond english-centric multilingual machine translation (2020)
Google Scholar
Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the Leipzig corpora collection: from 100 to 200 languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 759–765. European Language Resources Association (ELRA), Istanbul, Turkey (2012)
Google Scholar
Gyasi, F., Schlippe, T.: Twi machine translation. Big Data Cogn. Comput. 7(2), 114 (2023). https://doi.org/10.3390/bdcc7020114
Article Google Scholar
Haddow, B., Bawden, R., Miceli Barone, A.V., Helcl, J., Birch, A.: Survey of low-resource machine translation. Comput. Linguist. 48(3), 673–732 (2022)
Article Google Scholar
Ifeoluwa Adelani, D., et al.: MasakhaNEWS: news topic classification for African languages. arXiv e-prints pp. arXiv-2304 (2023)
Google Scholar
Lastrucci, R., et al.: Preparing the Vuk’uzenzele and ZA-gov-multilingual south african multilingual corpora. In: Fourth workshop on Resources for African Indigenous Languages (RAIL), p. 18 (2023)
Google Scholar
Limisiewicz, T., Malkin, D., Stanovsky, G.: You can have your data and balance it too: towards balanced and efficient multilingual models. In: Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pp. 1–11. Association for Computational Linguistics, Dubrovnik, Croatia (2023)
Google Scholar
Litre, G., et al.: Participatory detection of language barriers towards multilingual sustainability(ies) in Africa. Sustainability 14(13), 8133 (2022). https://doi.org/10.3390/su14138133
Article Google Scholar
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv: 2001.08210 (2020)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
MATH Google Scholar
Marivate, V., Mots’Oehli, M., Wagner, V., Lastrucci, R., Dzingirai, I.: Puoberta + puoberta Setswana language models (2023). https://doi.org/10.5281/zenodo.8434795
Marivate, V., et al.: Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi. In: Proceedings of the first workshop on Resources for African Indigenous Languages, pp. 15–20. European Language Resources Association (ELRA), Marseille, France (2020)
Google Scholar
Meyer, F., Buys, J.: Subword segmental language modelling for Nguni languages. In: Conference On Empirical Methods In Natural Language Processing (2022). https://doi.org/10.48550/arXiv.2210.06525
Motsoehli, M.: Tshwanabert (2020). https://huggingface.co/MoseliMotsoehli/TswanaBert
Nekoto, W., et al.: Participatory research for low-resourced machine translation: a case study in African languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2144–2160. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.195
Ogueji, K., Zhu, Y., Lin, J.: Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In: Proceedings of the 1st Workshop on Multilingual Representation Learning, pp. 116–126. Association for Computational Linguistics, Punta Cana, Dominican Republic (2021)
Google Scholar
Palai, E.B., O’Hanlon, L.: Word and phoneme frequency of occurrence in conversational Setswana: a clinical linguistic application. South. Afr. Linguist. Appl. Lang. Stud. 22(3–4), 125–142 (2004)
Google Scholar
Ragni, A., Knill, K.M., Rath, S.P., Gales, M.J.: Data augmentation for low resource languages. In: INTERSPEECH 2014: 15th Annual Conference of the International Speech Communication Association, pp. 810–814. International Speech Communication Association (ISCA) (2014)
Google Scholar
Ranathunga, S., Lee, E.S.A., Prifti Skenduli, M., Shekhar, R., Alam, M., Kaur, R.: Neural machine translation for low-resource languages: A survey. arXiv e-prints pp. arXiv-2106 (2021)
Google Scholar
Scao, T.L., et al.: What language model to train if you have one million GPU hours? In: Conference On Empirical Methods in Natural Language Processing (2022). https://doi.org/10.48550/arXiv.2210.15424
de Souza, L.R., Nogueira, R., Lotufo, R.: On the ability of monolingual models to learn language-agnostic representations. arXiv preprint arXiv: 2109.01942 (2021)
de Vries, W., Bartelds, M., Nissim, M., Wieling, M.: Adapting monolingual models: data can be scarce when language similarity is high. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4901–4907. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.findings-acl.433
Xue, L., et al.: Byt5: towards a token-free future with pre-trained byte-to-byte models. arXiv preprint arXiv: 2105.13626 (2021)
Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv: 2010.11934 (2020)

Download references

Acknowledgement

We want to acknowledge the feedback received from colleagues at DSFSI at Univeristy of Pretoria and Lelapa AI that improved this paper. We would like to acknowledge funding from the ABSA Chair of Data Science, Google and the NVIDIA Corporation hardware Grant. We are thankful to the anonymous reviewers for their feedback.

Author information

Authors and Affiliations

Department of Computer Science, University of Pretoria, Hatfield, South Africa
Vukosi Marivate, Richard Lastrucci & Isheanesu Dzingirai
Lelapa AI, Johannesburg, South Africa
Vukosi Marivate
University of Hawaii at Manoa, Honolulu, USA
Moseli Mots’Oehli
Sol Plaatje University, Kimberley, South Africa
Valencia Wagnerinst

Authors

Vukosi Marivate
View author publications
You can also search for this author in PubMed Google Scholar
Moseli Mots’Oehli
View author publications
You can also search for this author in PubMed Google Scholar
Valencia Wagnerinst
View author publications
You can also search for this author in PubMed Google Scholar
Richard Lastrucci
View author publications
You can also search for this author in PubMed Google Scholar
Isheanesu Dzingirai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vukosi Marivate .

Editor information

Editors and Affiliations

University of KwaZulu-Natal, Durban, South Africa
Anban Pillay
University of KwaZulu-Natal, Durban, South Africa
Edgar Jembere
University of the Western Cape, Cape Town, South Africa
Aurona J. Gerber

Appendix: Data Statement for Daily News - Dikgang Categorised News Corpus

1.1 A. CURATION RATIONALE

The motivation for building this dataset was to provide one of the few annotated news categorisation datasets for Setswana. The task required identifying a high-quality Setswana news dataset, collecting the data, and then annotating leveraging the International Press Telecommunications Council (IPTC) News Categories (or codes)^{Footnote 16}. The identified source was the Daily News^{Footnote 17} (Dikgang Section) from the Botswana Government. All copyright for the news content belongs to Daily News. We collected 5000 Setswana news articles. The distribution of final categories for the dataset are shown in Fig. 1.

1.2 B. LANGUAGE VARIETY

The language of this data set is Setswana (primarily from Botswana).

1.3 C. SPEAKER DEMOGRAPHIC

Setswana is a Bantu languages that is spoken in Botswana as well as several regions of South Africa [32].

1.4 D. ANNOTATOR DEMOGRAPHIC

Two annotators were used to label the news articles based on the Daily News - Dikgang news. Their deomographic information is shown in Table 7.

Table 7. Annotator demographic

Full size table

1.5 E. PROVENANCE APPENDIX

The original data is from the Daily News news service from the Botswana Government.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Marivate, V., Mots’Oehli, M., Wagnerinst, V., Lastrucci, R., Dzingirai, I. (2023). PuoBERTa: Training and Evaluation of a Curated Language Model for Setswana. In: Pillay, A., Jembere, E., J. Gerber, A. (eds) Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science, vol 1976. Springer, Cham. https://doi.org/10.1007/978-3-031-49002-6_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-49002-6_17
Published: 29 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49001-9
Online ISBN: 978-3-031-49002-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PuoBERTa: Training and Evaluation of a Curated Language Model for Setswana

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Data Statement for Daily News - Dikgang Categorised News Corpus

Appendix: Data Statement for Daily News - Dikgang Categorised News Corpus

1.1 A. CURATION RATIONALE

1.2 B. LANGUAGE VARIETY

1.3 C. SPEAKER DEMOGRAPHIC

1.4 D. ANNOTATOR DEMOGRAPHIC

1.5 E. PROVENANCE APPENDIX

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation