Skip to main content

PuoBERTa: Training and Evaluation of a Curated Language Model for Setswana

  • Conference paper
  • First Online:
Artificial Intelligence Research (SACAIR 2023)

Abstract

Natural language processing (NLP) has made significant progress for well-resourced languages such as English but lagged behind for low-resource languages like Setswana. This paper addresses this gap by presenting PuoBERTa, a customised masked language model trained specifically for Setswana. We cover how we collected, curated, and prepared diverse monolingual texts to generate a high-quality corpus for PuoBERTa’s training. Building upon previous efforts in creating monolingual resources for Setswana, we evaluated PuoBERTa across several NLP tasks, including part-of-speech (POS) tagging, named entity recognition (NER), and news categorisation. Additionally, we introduced a new Setswana news categorisation dataset and provided the initial benchmarks using PuoBERTa. Our work demonstrates the efficacy of PuoBERTa in fostering NLP capabilities for understudied languages like Setswana and paves the way for future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.rma.nwu.ac.za/handle/20.500.12185/641.

  2. 2.

    https://huggingface.co/MoseliMotsoehli/TswanaBert.

  3. 3.

    https://zenodo.org/record/5674236.

  4. 4.

    https://corpora.uni-leipzig.de/.

  5. 5.

    http://www.rma.nwu.ac.za/handle/20.500.12185/641.

  6. 6.

    https://www.justice.gov.za/constitution/SAConstitution-web-set.pdf.

  7. 7.

    https://huggingface.co/datasets/dsfsi/vukuzenzele-monolingual.

  8. 8.

    https://huggingface.co/datasets/dsfsi/gov-za-monolingual.

  9. 9.

    https://v-sdlr-lnx1.nwu.ac.za/handle/20.500.12185/641?show=full.

  10. 10.

    https://github.com/dsfsi/PuoData, https://huggingface.co/datasets/dsfsi/PuoData.

  11. 11.

    https://github.com/dsfsi/PuoBERTa.

  12. 12.

    https://huggingface.co/dsfsi/PuoBERTa.

  13. 13.

    https://dailynews.gov.bw/.

  14. 14.

    https://iptc.org/standards/newscodes/.

  15. 15.

    https://github.com/dsfsi/PuoBERTa.

  16. 16.

    https://iptc.org/standards/newscodes/.

  17. 17.

    https://dailynews.gov.bw/.

References

  1. Adebara, I., Elmadany, A., Abdul-Mageed, M., Alcoba Inciarte, A.: SERENGETI: massively multilingual language models for Africa. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 1498–1537. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.findings-acl.97

  2. Adelani, D., et al.: A few thousand translations go a long way! leveraging pre-trained models for African news translation. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3053–3070. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.naacl-main.223

  3. Adelani, D., et al.: MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4488–4508 (2022)

    Google Scholar 

  4. Agić, Ž., Vulić, I.: JW300: a wide-coverage parallel corpus for low-resource languages. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3204–3210. Association for Computational Linguistics, Florence, Italy (2019). 10.18653/v1/P19-1310

    Google Scholar 

  5. Alabi, J.O., Adelani, D.I., Mosbach, M., Klakow, D.: Adapting pre-trained language models to African languages via multilingual adaptive fine-tuning. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 4336–4349. International Committee on Computational Linguistics, Gyeongju, Republic of Korea (2022)

    Google Scholar 

  6. Armengol-Estapé, J., et al.: Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4933–4946. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.findings-acl.437

  7. Aulamo, M., Sulubacak, U., Virpioja, S., Tiedemann, J.: OpusTools and parallel corpus diagnostics. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 3782–3789. European Language Resources Association (2020)

    Google Scholar 

  8. Baziotis, C., Zhang, B., Birch, A., Haddow, B.: When does monolingual data help multilingual translation: the role of domain and model scale. arXiv preprint arXiv: 2305.14124 (2023)

  9. Burlot, F., Yvon, F.: Using monolingual data in neural machine translation: a systematic study. In: Proceedings of the Third Conference on Machine Translation: Research Papers. pp. 144–155. Association for Computational Linguistics, Brussels, Belgium (Oct 2018). 10.18653/v1/W18-6315

    Google Scholar 

  10. Collection, L.C.: Tswana web text corpus (South Africa) based on material from 2019. https://corpora.uni-leipzig.de/en?corpusId=tsn_community_2017. Accessed 22 Aug 2023

  11. Dione, C.M.B., et al.: MasakhaPOS: part-of-speech tagging for typologically diverse African languages. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, vol. 1 Long Papers, pp. 10883–10900. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.acl-long.609

  12. Doddapaneni, S., et al.: Towards leaving no indic language behind: building monolingual corpora, benchmark and models for indic languages. In: Annual Meeting of The Association For Computational Linguistics (2022). https://doi.org/10.18653/v1/2023.acl-long.693

  13. Dossou, B.F., et al.: AfroLM: a self-active learning-based multilingual pretrained language model for 23 African languages. SustaiNLP 2022, 52 (2022)

    Google Scholar 

  14. Eiselen, R.: Nchlt Setswana roberta language model (2023). https://hdl.handle.net/20.500.12185/641

  15. Eiselen, R., Puttkammer, M.J.: Developing text resources for ten South African languages. In: LREC, pp. 3698–3703 (2014)

    Google Scholar 

  16. Fan, A., et al.: Beyond english-centric multilingual machine translation (2020)

    Google Scholar 

  17. Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the Leipzig corpora collection: from 100 to 200 languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 759–765. European Language Resources Association (ELRA), Istanbul, Turkey (2012)

    Google Scholar 

  18. Gyasi, F., Schlippe, T.: Twi machine translation. Big Data Cogn. Comput. 7(2), 114 (2023). https://doi.org/10.3390/bdcc7020114

    Article  Google Scholar 

  19. Haddow, B., Bawden, R., Miceli Barone, A.V., Helcl, J., Birch, A.: Survey of low-resource machine translation. Comput. Linguist. 48(3), 673–732 (2022)

    Article  Google Scholar 

  20. Ifeoluwa Adelani, D., et al.: MasakhaNEWS: news topic classification for African languages. arXiv e-prints pp. arXiv-2304 (2023)

    Google Scholar 

  21. Lastrucci, R., et al.: Preparing the Vuk’uzenzele and ZA-gov-multilingual south african multilingual corpora. In: Fourth workshop on Resources for African Indigenous Languages (RAIL), p. 18 (2023)

    Google Scholar 

  22. Limisiewicz, T., Malkin, D., Stanovsky, G.: You can have your data and balance it too: towards balanced and efficient multilingual models. In: Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pp. 1–11. Association for Computational Linguistics, Dubrovnik, Croatia (2023)

    Google Scholar 

  23. Litre, G., et al.: Participatory detection of language barriers towards multilingual sustainability(ies) in Africa. Sustainability 14(13), 8133 (2022). https://doi.org/10.3390/su14138133

    Article  Google Scholar 

  24. Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv: 2001.08210 (2020)

  25. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)

    MATH  Google Scholar 

  26. Marivate, V., Mots’Oehli, M., Wagner, V., Lastrucci, R., Dzingirai, I.: Puoberta + puoberta Setswana language models (2023). https://doi.org/10.5281/zenodo.8434795

  27. Marivate, V., et al.: Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi. In: Proceedings of the first workshop on Resources for African Indigenous Languages, pp. 15–20. European Language Resources Association (ELRA), Marseille, France (2020)

    Google Scholar 

  28. Meyer, F., Buys, J.: Subword segmental language modelling for Nguni languages. In: Conference On Empirical Methods In Natural Language Processing (2022). https://doi.org/10.48550/arXiv.2210.06525

  29. Motsoehli, M.: Tshwanabert (2020). https://huggingface.co/MoseliMotsoehli/TswanaBert

  30. Nekoto, W., et al.: Participatory research for low-resourced machine translation: a case study in African languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2144–2160. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.195

  31. Ogueji, K., Zhu, Y., Lin, J.: Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In: Proceedings of the 1st Workshop on Multilingual Representation Learning, pp. 116–126. Association for Computational Linguistics, Punta Cana, Dominican Republic (2021)

    Google Scholar 

  32. Palai, E.B., O’Hanlon, L.: Word and phoneme frequency of occurrence in conversational Setswana: a clinical linguistic application. South. Afr. Linguist. Appl. Lang. Stud. 22(3–4), 125–142 (2004)

    Google Scholar 

  33. Ragni, A., Knill, K.M., Rath, S.P., Gales, M.J.: Data augmentation for low resource languages. In: INTERSPEECH 2014: 15th Annual Conference of the International Speech Communication Association, pp. 810–814. International Speech Communication Association (ISCA) (2014)

    Google Scholar 

  34. Ranathunga, S., Lee, E.S.A., Prifti Skenduli, M., Shekhar, R., Alam, M., Kaur, R.: Neural machine translation for low-resource languages: A survey. arXiv e-prints pp. arXiv-2106 (2021)

    Google Scholar 

  35. Scao, T.L., et al.: What language model to train if you have one million GPU hours? In: Conference On Empirical Methods in Natural Language Processing (2022). https://doi.org/10.48550/arXiv.2210.15424

  36. de Souza, L.R., Nogueira, R., Lotufo, R.: On the ability of monolingual models to learn language-agnostic representations. arXiv preprint arXiv: 2109.01942 (2021)

  37. de Vries, W., Bartelds, M., Nissim, M., Wieling, M.: Adapting monolingual models: data can be scarce when language similarity is high. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4901–4907. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.findings-acl.433

  38. Xue, L., et al.: Byt5: towards a token-free future with pre-trained byte-to-byte models. arXiv preprint arXiv: 2105.13626 (2021)

  39. Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv: 2010.11934 (2020)

Download references

Acknowledgement

We want to acknowledge the feedback received from colleagues at DSFSI at Univeristy of Pretoria and Lelapa AI that improved this paper. We would like to acknowledge funding from the ABSA Chair of Data Science, Google and the NVIDIA Corporation hardware Grant. We are thankful to the anonymous reviewers for their feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vukosi Marivate .

Editor information

Editors and Affiliations

Appendix: Data Statement for Daily News - Dikgang Categorised News Corpus

Appendix: Data Statement for Daily News - Dikgang Categorised News Corpus

figure a

1.1 A. CURATION RATIONALE

The motivation for building this dataset was to provide one of the few annotated news categorisation datasets for Setswana. The task required identifying a high-quality Setswana news dataset, collecting the data, and then annotating leveraging the International Press Telecommunications Council (IPTC) News Categories (or codes)Footnote 16. The identified source was the Daily NewsFootnote 17 (Dikgang Section) from the Botswana Government. All copyright for the news content belongs to Daily News. We collected 5000 Setswana news articles. The distribution of final categories for the dataset are shown in Fig. 1.

1.2 B. LANGUAGE VARIETY

The language of this data set is Setswana (primarily from Botswana).

1.3 C. SPEAKER DEMOGRAPHIC

Setswana is a Bantu languages that is spoken in Botswana as well as several regions of South Africa [32].

1.4 D. ANNOTATOR DEMOGRAPHIC

Two annotators were used to label the news articles based on the Daily News - Dikgang news. Their deomographic information is shown in Table 7.

Table 7. Annotator demographic

1.5 E. PROVENANCE APPENDIX

The original data is from the Daily News news service from the Botswana Government.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Marivate, V., Mots’Oehli, M., Wagnerinst, V., Lastrucci, R., Dzingirai, I. (2023). PuoBERTa: Training and Evaluation of a Curated Language Model for Setswana. In: Pillay, A., Jembere, E., J. Gerber, A. (eds) Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science, vol 1976. Springer, Cham. https://doi.org/10.1007/978-3-031-49002-6_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-49002-6_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-49001-9

  • Online ISBN: 978-3-031-49002-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics