Skip to main content

TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

  • Conference paper
  • First Online:
Intelligent Systems and Pattern Recognition (ISPR 2022)

Abstract

Pre-trained models have accomplished high performances with the introduction of the Transformers like the Bidirectional Encoder Representations from Transformers known for BERT. Nevertheless, most of these proposed models have been trained on most represented languages (English, French, German, etc.) and few models target the under-represented languages and dialects.

This work introduces a feasibility study of pre-training language models based on Transformers on Tunisian dialect as an under-represented languages. The Tunisian language model is evaluated on dialect identification task, sentiment analysis task, and reading comprehension question-answering task. Results demonstrate that, instead of using datasets from traditional sources (Wikipedia, articles, etc.), noisy web crawled data is more convenient for a under-represented language such as the Tunisian dialect. Additionally, experiments show that a reasonably small-scale dataset conducts to similar or better achievements as when using a large-scale dataset and that TunBERT model performances reach or enhance the state of the art in all three downstream tasks. The pre-trained model named TunBERT and the used datasets for the fine-tuning step are publicly released.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT.

  2. 2.

    https://github.com/iCompass-ai/TunBERT.

  3. 3.

    https://github.com/NVIDIA/NeMo.

  4. 4.

    The word “Question” is the English translation.

References

  1. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, Workshop Track Proceedings (2013)

    Google Scholar 

  2. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237 (2018). https://www.aclweb.org/anthology/N18-1202

  3. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339 (2018). https://www.aclweb.org/anthology/P18-1031

  4. Bahdanau, D., Cho, K., Bengio,Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations (2015)

    Google Scholar 

  5. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

  6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long And Short Papers), pp. 4171–4186 (2019)

    Google Scholar 

  7. Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pp. 9–15 (2020)

    Google Scholar 

  8. Wuwei, L., Yang, C., Wei, X., Alan, R.: GigaBERT: zero-shot transfer learning from English to Arabic. In: Proceedings of the 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP) (2020)

    Google Scholar 

  9. Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual BERT? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4996–5001 (2019). https://www.aclweb.org/anthology/P19-1493

  10. Fourati, C., Messaoudi, A., Haddad, H.: TUNIZI: a Tunisian Arabizi sentiment analysis dataset. In: AfricaNLP Workshop, Putting Africa on the NLP Map. ICLR 2020, Virtual Event. arXiv:3091079 (2020)

  11. Delobelle, P., et al.: Computing research repository. arXiv:1907.11692 (2019)

  12. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: Proceedings of the 8th International Conference on Learning Representations (ICLR) (2020)

    Google Scholar 

  13. Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  14. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  15. Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 7059–7069 (2019)

    Google Scholar 

  16. Delobelle, P., Winters, T., Berendt, B.: RobBERT: a dutch RoBERTa-based language model. Computing Research Repository, version 2. arXiv:2001.06286 (2020)

  17. Le, H., et al.: FlauBERT: unsupervised language model pre-training for French. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pp. 2479–2490 (2020)

    Google Scholar 

  18. Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219 (2020)

    Google Scholar 

  19. Canete, J., Chaperon, G., Fuentes, R., Ho, J., Kang, H., Pérez, J.: Spanish pre-trained BERT model and evaluation data. In: PML4DC @ ICLR 2020, p. 2020 (2020)

    Google Scholar 

  20. Virtanen, A., et al.: Multilingual is not enough: BERT for finnish. Computing Research Repository, version 1. arXiv:1912.07076 (2019)

  21. Medhaffar, S., Bougares, F., Estève, Y., Hadrich-Belguith, L.: Sentiment analysis of Tunisian dialects: linguistic resources and experiments. In: Proceedings of the Third Arabic Natural Language Processing Workshop, pp. 55–61 (2017)

    Google Scholar 

  22. Sayadi, K., Liwicki, M., Ingold, R., Bui, M.: Tunisian dialect and modern standard Arabic dataset for sentiment analysis: Tunisian election context. In: Proceedings of the Second International Conference on Arabic Computational Linguistics, ACLING, pp. 35–53 (2016)

    Google Scholar 

  23. Zaidan, O., Callison-Burch, C.: The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 37–41 (2011)

    Google Scholar 

  24. El-Haj, M., Rayson, P., Aboelezz, M.: Arabic dialect identification in the context of bivalency and code-switching. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pp. 3622–3627 (2018)

    Google Scholar 

  25. Harrat, S., Meftouh, K., Smaïli, K.: Maghrebi Arabic dialect processing: an overview. J. Int. Sci. Gen. Appl. 1 (2018)

    Google Scholar 

  26. Horesh, S.: Languages of the Middle East and North Africa. In: The SAGE Encyclopedia of Human Communication Sciences and Disorders, vol. 1, pp. 1058–1061 (2019)

    Google Scholar 

  27. Bouamor, H., et al.: The MADAR Arabic dialect corpus and lexicon. In: The International Conference on Language Resources and Evaluation (2018)

    Google Scholar 

  28. Bouamor, H., Hassan, S., Habash, N.: The MADAR shared task on Arabic fine-grained dialect identification. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 199–207 (2019)

    Google Scholar 

  29. Mozannar, H., Maamary, E., El Hajal, K., Hajj, H.: Neural Arabic question answering. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 108–118 (2019). https://www.aclweb.org/anthology/W19-4612

  30. Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to answer open-domain questions. ArXiv. abs/1704.00051 (2017)

  31. Mulki, H., Haddad, H., Gridach, M., Babaoğlu, I.: Syntax-ignorant N-gram embeddings for dialectal Arabic sentiment analysis. Nat. Lang. Eng. 27, 1–24 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Abir Messaoudi , Ahmed Cheikhrouhou , Hatem Haddad , Nourchene Ferchichi , Moez BenHajhmida , Abir Korched , Malek Naski , Faten Ghriss or Amine Kerkeni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Messaoudi, A. et al. (2022). TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect. In: Bennour, A., Ensari, T., Kessentini, Y., Eom, S. (eds) Intelligent Systems and Pattern Recognition. ISPR 2022. Communications in Computer and Information Science, vol 1589. Springer, Cham. https://doi.org/10.1007/978-3-031-08277-1_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08277-1_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08276-4

  • Online ISBN: 978-3-031-08277-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics