Skip to main content
Log in

TunBERT: Pretraining BERT for Tunisian Dialect Understanding

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Language models have proved to achieve high performances and outperform state of the art results in the Natural Language Processing field. More specifically, Bidirectional Encoder Representations from Transformers (BERT) has become the state of the art model for such tasks. Most of the available language models have been trained on Indo-European languages. These models are known to require huge training datasets. However, only a few studies have focused on under-represented languages and dialects. In this work, we describe the pretraining of a customized Google BERT Tensorflow implementation model (named TunBERT-T) and the pretraining of a PyTorch implementation of BERT language model using NVIDIA implementation (named TunBERT-P) for the Tunisian dialect. We describe the process of creating a training dataset from collecting a Common-Crawl-based dataset, filtering and pre-processing the data. We describe the training setup and we detail fine-tuning TunBERT-T and TunBERT-P models on three NLP downstream tasks. We challenge the assumption that a lot of training data is needed. We explore the effectiveness of training a monolingual Transformer-based language model for low-resourced languages, taking the Tunisian dialect as a use case. Our models results indicate that a proportionately small sized Common-Crawl-based dataset (500K sentences, 67.2MB) leads to comparable performances as those obtained using costly larger datasets (from 24GB to 128GB of text). We demonstrate that with the use of newly created datasets, our proposed TunBERT-P model achieves comparable or higher performances in three downstream tasks: Sentiment Analysis, Language Identification and Reading Comprehension Question-Answering. We release the two pretrained models along with all the datasets used for the fine-tuning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://github.com/iCompass-ai/TunBERT.

  2. https://github.com/NVIDIA/NeMo.

  3. https://github.com/google-research/bert.

  4. The word ”Symptoms” is the English translation.

References

  1. Abdul-Mageed Muhammad, Zhang Chiyu, Bouamor Houda, Habash Nizar. NADI 2020: The first nuanced Arabic dialect identification shared task. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 97–110, Barcelona, Spain (Online), December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.wanlp-1.9.

  2. Abdul-Mageed Muhammad, Elmadany AbdelRahim, Nagoudi El Moatez Billah. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7088–7105, Online, August 2021. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.551. URL https://aclanthology.org/2021.acl-long.551.

  3. Abu Farha Ibrahim, Magdy Walid. Benchmarking transformer-based language models for Arabic sentiment and sarcasm detection. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 21–31, Kyiv, Ukraine (Virtual), April 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.wanlp-1.3.

  4. Ghadah Alqahtani. Alothaim Abdulrahman. Emotion analysis of arabic tweets: Language models and available resources. Frontiers in Artificial Intelligence; 2022. p. 5.

    Google Scholar 

  5. Antoun Wissam, Baly Fady, Hajj Hazem. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, 2020;9–15.

  6. Bahdanau Dzmitry. Kyung Hyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015, January 2015.

  7. Baimukan Nurpeiis, Habash N, Bouamor H. Hierarchical aggregation of dialectal data for arabic dialect identification. In Proceedings of the Language Resources and Evaluation Conference (LREC), Marseille, France. The European Language Resources Association, 2022.

  8. Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics. 2017;5:135–46. https://doi.org/10.1162/tacl_a_00051.

    Article  Google Scholar 

  9. Bouamor Houda, Habash Nizar, Salameh Mohammad, Zaghouani Wajdi, Rambow Owen, Abdulrahim Dana, Obeid Ossama, Khalifa Salam, Eryani Fadhl, Erdmann Alexander, Oflazer Kemal. The madar arabic dialect corpus and lexicon. In The International Conference on Language Resources and Evaluation, 2018.

  10. Bouamor Houda, Hassan Sabit, Habash Nizar. The MADAR shared task on Arabic fine-grained dialect identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, 2019;199–207.

  11. Canete José, Chaperon Gabriel, Fuentes Rodrigo, Pérez Jorge. Spanish pre-trained bert model and evaluation data. In Proceedings of the Practical ML for Developing Countries Workshop at The International Conference on Language Resources and Evaluation, 2020.

  12. Carvalho Diogo V, Pereira Eduardo M, Cardoso Jaime S. Machine learning interpretability: A survey on methods and metrics. Electronics. 2019;8:832.

    Article  Google Scholar 

  13. Chen Danqi, Fisch A, Weston J, Bordes Antoine. Reading wikipedia to answer open-domain questions. ArXiv, abs/1704.00051, 2017.

  14. Conneau Alexis, Lample Guillaume. Cross-lingual language model pretraining. In In Proceedings of tAdvances in Neural Information Processing Systems, 2019;7059–7069.

  15. Delobelle Pieter, Winters Thomas, Berendt Bettina. Liu, yinhan and ott, myle and goyal, naman and du, jingfei and joshi, mandar and chen, danqi and levy, omer and lewis, mike and zettlemoyer, luke and stoyanov, veselin. Computing Research Repository, arXiv:1907.11692, 2019. URL https://arxiv.org/abs/1907.11692. version 1.

  16. Delobelle Pieter, Winters Thomas, Berendt Bettina. Robbert: a dutch roberta-based language model. Computing Research Repository, arXiv:2001.06286, 2020. URL https://arxiv.org/abs/2001.06286. version 2.

  17. Pieter Delobelle, Thomas Winters, Bettina Berendt. Robbertje: A distilled dutch bert model. Computational Linguistics in the Netherlands Journal. 2022;11:125–40.

    Google Scholar 

  18. Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019;4171–4186.

  19. El-Haj Mahmoud, Rayson Paul, Aboelezz Mariam. Arabic dialect identification in the context of bivalency and code-switching. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), 2018;3622–3627.

  20. Fourati Chayma, Messaoudi Abir, Haddad Hatem. Tunizi: a tunisian arabizi sentiment analysis dataset. In AfricaNLP Workshop, Putting Africa on the NLP Map. ICLR 2020, Virtual Event, volume arXiv:3091079, 2020. URL https://arxiv.org/submit/3091079.

  21. Harrat Salima, Meftouh Karima, Smaïli Kamel. Maghrebi arabic dialect processing: an overview. Journal of International Science and General Applications, 1, 2018.

  22. Horesh SUri. Languages of the middle east and north africa. The SAGE encyclopedia of human communication sciences and disorders, 2019;1:1058–1061. https://doi.org/10.4135/9781483380810.n349.

  23. Howard Jeremy, Ruder Sebastian. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia, July 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1031. URL https://www.aclweb.org/anthology/P18-1031.

  24. Lan Zhenzhong, Chen Mingda, Goodman Sebastian, Gimpel Kevin, Sharma Piyush, Soricut Radu. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020.

  25. Le Hang, Vial Loïc, Frej Jibril, Segonne Vincent, Coavoux Maximin, Lecouteux Benjamin, Allauzen Alexandre, Crabbé Benoit, Besacier Laurent, Schwab Didier. FlauBERT: Unsupervised language model pre-training for French. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), 2020;2479–2490.

  26. Martin Louis, Muller Benjamin, Ortiz Suárez Pedro Javier, Dupont Yoann, Romary Laurent, de la Clergerie Éric, Seddah Djamé, Sagot Benoît. CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020;7203–7219.

  27. Medhaffar Salima, Bougares Fethi, Estève Yannick, Hadrich-Belguith Lamia. Sentiment analysis of Tunisian dialects: Linguistic ressources and experiments. In Proceedings of the Third Arabic Natural Language Processing Workshop, 2017;55–61.

  28. Messaoudi Abir, Cheikhrouhou Ahmed, Haddad Hatem, Ferchichi Nourchene, BenHajhmida Moez, Korched Abir, Naski Malek, Ghriss Faten, Kerkeni Amine. Tunbert: Pretrained contextualized text representation for tunisian dialect. In Akram Bennour, Tolga Ensari, Yousri Kessentini, and Sean Eom, editors, Intelligent Systems and Pattern Recognition, pages 278–290, Cham, 2022. Springer International Publishing. ISBN 978-3-031-08277-1.

  29. Mikolov Tomas, Chen Kai, Corrado Greg, Dean Jeffrey. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, Workshop Track Proceedings, 2013.

  30. Mozannar Hussein, Maamary Elie, El Hajal Karl, Hajj Hazem. Neural Arabic question answering. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 108–118, Florence, Italy, August 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4612. URL https://www.aclweb.org/anthology/W19-4612.

  31. Mulki Hala, Haddad Hatem, Ali Chedi Bechikh, Babaoğlu Ismail. Tunisian dialect sentiment analysis: a natural language processing-based approach. Computación y Sistemas, 2018a;22(4):1223–1232.

  32. Hala Mulki, Hatem Haddad, Ismail Babaoğlu. Modern trends in arabic sentiment analysis: A survey. Traitement Automatique des Langues. 2018;58(3):15–39.

    Google Scholar 

  33. Mulki Hala, Haddad Hatem, Gridach Mourad, Babaoğlu Ismail. Syntax-ignorant n-gram embeddings for dialectal arabic sentiment analysis. Natural Language Engineering, 2020;1–24. https://doi.org/10.1017/S135132492000008X.

  34. Naseem Usman, Razzak Imran, Khan Shah Khalid, Prasad Mukesh. A comprehensive survey on word representation models: From classical to state-of-the-art word representation language models. Transactions on Asian and Low-Resource Language Information Processing, 2021;20(5):1–35.

  35. Pennington Jeffrey, Socher Richard, Manning Christopher. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014;1532–1543.

  36. Peters Matthew, Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, Zettlemoyer Luke. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202.

  37. Pires Telmo, Schlinger Eva, Garrette Dan. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy, July 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1493. URL https://www.aclweb.org/anthology/P19-1493.

  38. Sayadi Karim, Liwicki Marcus, Ingold Rolf, Bui Marc. Tunisian dialect and modern standard arabic dataset for sentiment analysis: Tunisian election context. In Proceedings of The Second International Conference on Arabic Computational Linguistics, ACLING, 2016;35–53.

  39. Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Ł ukasz, Polosukhin Illia. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.

  40. Virtanen Antti, Kanerva Jenna, Ilo Rami, Luomaa Jouni, Luotolahti Juhani, Salakoski Tapio, Ginter Filip, Pyysalo Sampo. Multilingual is not enough: Bert for finnish. Computing Research Repository, arXiv:1912.07076, 2019. URL https://arxiv.org/abs/1912.07076. version 1.

  41. Wuwei Lan, Yang Chen, Wei Xu, Alan Ritter. Gigabert: Zero-shot transfer learning from english to arabic. In Proceedings of The 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP), 2020.

  42. Zaidan Omar F, Callison-Burch Chris. The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011;37–41.

  43. Zhang Susan, Roller Stephen, Goyal Naman, Artetxe Mikel, Chen Moya, Chen Shuohui, Dewan Christopher, Diab Mona, Li Xian, Lin Xi Victoria et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hatem Haddad.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Ethics approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

https://github.com/iCompass-ai/TunBERT.

This article is part of the topical collection “Recent Trends on Machine Learning & Intelligent Systems” guest edited by Akram Bennour, Tolga Ensari and Abdel-Badeeh Salem.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Haddad, H., Rouhou, A.C., Messaoudi, A. et al. TunBERT: Pretraining BERT for Tunisian Dialect Understanding. SN COMPUT. SCI. 4, 194 (2023). https://doi.org/10.1007/s42979-022-01541-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01541-y

Keywords

Navigation