TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Messaoudi, Abir; Cheikhrouhou, Ahmed; Haddad, Hatem; Ferchichi, Nourchene; BenHajhmida, Moez; Korched, Abir; Naski, Malek; Ghriss, Faten; Kerkeni, Amine

doi:10.1007/978-3-031-08277-1_23

Abir Messaoudi⁹,
Ahmed Cheikhrouhou¹⁰,
Hatem Haddad⁹,
Nourchene Ferchichi¹⁰,
Moez BenHajhmida⁹,
Abir Korched¹⁰,
Malek Naski⁹,
Faten Ghriss¹⁰ &
…
Amine Kerkeni¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1589))

Included in the following conference series:

International Conference on Intelligent Systems and Pattern Recognition

560 Accesses
5 Citations

Abstract

Pre-trained models have accomplished high performances with the introduction of the Transformers like the Bidirectional Encoder Representations from Transformers known for BERT. Nevertheless, most of these proposed models have been trained on most represented languages (English, French, German, etc.) and few models target the under-represented languages and dialects.

This work introduces a feasibility study of pre-training language models based on Transformers on Tunisian dialect as an under-represented languages. The Tunisian language model is evaluated on dialect identification task, sentiment analysis task, and reading comprehension question-answering task. Results demonstrate that, instead of using datasets from traditional sources (Wikipedia, articles, etc.), noisy web crawled data is more convenient for a under-represented language such as the Tunisian dialect. Additionally, experiments show that a reasonably small-scale dataset conducts to similar or better achievements as when using a large-scale dataset and that TunBERT model performances reach or enhance the state of the art in all three downstream tasks. The pre-trained model named TunBERT and the used datasets for the fine-tuning step are publicly released.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

TunBERT: Pretraining BERT for Tunisian Dialect Understanding

Article 03 February 2023

BertOdia: BERT Pre-training for Low Resource Odia Language

MACEDONIZER - The Macedonian Transformer Language Model

Notes

1.
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT.
2.
https://github.com/iCompass-ai/TunBERT.
3.
https://github.com/NVIDIA/NeMo.
4.
The word “Question” is the English translation.

References

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, Workshop Track Proceedings (2013)
Google Scholar
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237 (2018). https://www.aclweb.org/anthology/N18-1202
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339 (2018). https://www.aclweb.org/anthology/P18-1031
Bahdanau, D., Cho, K., Bengio,Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long And Short Papers), pp. 4171–4186 (2019)
Google Scholar
Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pp. 9–15 (2020)
Google Scholar
Wuwei, L., Yang, C., Wei, X., Alan, R.: GigaBERT: zero-shot transfer learning from English to Arabic. In: Proceedings of the 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP) (2020)
Google Scholar
Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual BERT? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4996–5001 (2019). https://www.aclweb.org/anthology/P19-1493
Fourati, C., Messaoudi, A., Haddad, H.: TUNIZI: a Tunisian Arabizi sentiment analysis dataset. In: AfricaNLP Workshop, Putting Africa on the NLP Map. ICLR 2020, Virtual Event. arXiv:3091079 (2020)
Delobelle, P., et al.: Computing research repository. arXiv:1907.11692 (2019)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: Proceedings of the 8th International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 7059–7069 (2019)
Google Scholar
Delobelle, P., Winters, T., Berendt, B.: RobBERT: a dutch RoBERTa-based language model. Computing Research Repository, version 2. arXiv:2001.06286 (2020)
Le, H., et al.: FlauBERT: unsupervised language model pre-training for French. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pp. 2479–2490 (2020)
Google Scholar
Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219 (2020)
Google Scholar
Canete, J., Chaperon, G., Fuentes, R., Ho, J., Kang, H., Pérez, J.: Spanish pre-trained BERT model and evaluation data. In: PML4DC @ ICLR 2020, p. 2020 (2020)
Google Scholar
Virtanen, A., et al.: Multilingual is not enough: BERT for finnish. Computing Research Repository, version 1. arXiv:1912.07076 (2019)
Medhaffar, S., Bougares, F., Estève, Y., Hadrich-Belguith, L.: Sentiment analysis of Tunisian dialects: linguistic resources and experiments. In: Proceedings of the Third Arabic Natural Language Processing Workshop, pp. 55–61 (2017)
Google Scholar
Sayadi, K., Liwicki, M., Ingold, R., Bui, M.: Tunisian dialect and modern standard Arabic dataset for sentiment analysis: Tunisian election context. In: Proceedings of the Second International Conference on Arabic Computational Linguistics, ACLING, pp. 35–53 (2016)
Google Scholar
Zaidan, O., Callison-Burch, C.: The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 37–41 (2011)
Google Scholar
El-Haj, M., Rayson, P., Aboelezz, M.: Arabic dialect identification in the context of bivalency and code-switching. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pp. 3622–3627 (2018)
Google Scholar
Harrat, S., Meftouh, K., Smaïli, K.: Maghrebi Arabic dialect processing: an overview. J. Int. Sci. Gen. Appl. 1 (2018)
Google Scholar
Horesh, S.: Languages of the Middle East and North Africa. In: The SAGE Encyclopedia of Human Communication Sciences and Disorders, vol. 1, pp. 1058–1061 (2019)
Google Scholar
Bouamor, H., et al.: The MADAR Arabic dialect corpus and lexicon. In: The International Conference on Language Resources and Evaluation (2018)
Google Scholar
Bouamor, H., Hassan, S., Habash, N.: The MADAR shared task on Arabic fine-grained dialect identification. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 199–207 (2019)
Google Scholar
Mozannar, H., Maamary, E., El Hajal, K., Hajj, H.: Neural Arabic question answering. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 108–118 (2019). https://www.aclweb.org/anthology/W19-4612
Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to answer open-domain questions. ArXiv. abs/1704.00051 (2017)
Mulki, H., Haddad, H., Gridach, M., Babaoğlu, I.: Syntax-ignorant N-gram embeddings for dialectal Arabic sentiment analysis. Nat. Lang. Eng. 27, 1–24 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

iCompass, Tunis, Tunisia
Abir Messaoudi, Hatem Haddad, Moez BenHajhmida & Malek Naski
instaDeep, London, UK
Ahmed Cheikhrouhou, Nourchene Ferchichi, Abir Korched, Faten Ghriss & Amine Kerkeni

Authors

Abir Messaoudi
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Cheikhrouhou
View author publications
You can also search for this author in PubMed Google Scholar
Hatem Haddad
View author publications
You can also search for this author in PubMed Google Scholar
Nourchene Ferchichi
View author publications
You can also search for this author in PubMed Google Scholar
Moez BenHajhmida
View author publications
You can also search for this author in PubMed Google Scholar
Abir Korched
View author publications
You can also search for this author in PubMed Google Scholar
Malek Naski
View author publications
You can also search for this author in PubMed Google Scholar
Faten Ghriss
View author publications
You can also search for this author in PubMed Google Scholar
Amine Kerkeni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Abir Messaoudi , Ahmed Cheikhrouhou , Hatem Haddad , Nourchene Ferchichi , Moez BenHajhmida , Abir Korched , Malek Naski , Faten Ghriss or Amine Kerkeni .

Editor information

Editors and Affiliations

Larbi Tebessi University, Tebessa, Algeria
Akram Bennour
Arkansas Tech University, Russellville, AR, USA
Tolga Ensari
Digital Research Centre of Sfax, Sakiet Ezzit, Tunisia
Yousri Kessentini
Southeast Missouri State University, Cape Girardeau, MO, USA
Sean Eom

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Messaoudi, A. et al. (2022). TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect. In: Bennour, A., Ensari, T., Kessentini, Y., Eom, S. (eds) Intelligent Systems and Pattern Recognition. ISPR 2022. Communications in Computer and Information Science, vol 1589. Springer, Cham. https://doi.org/10.1007/978-3-031-08277-1_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-08277-1_23
Published: 17 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08276-4
Online ISBN: 978-3-031-08277-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Abstract

Access this chapter

Similar content being viewed by others

TunBERT: Pretraining BERT for Tunisian Dialect Understanding

BertOdia: BERT Pre-training for Low Resource Odia Language

MACEDONIZER - The Macedonian Transformer Language Model

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Abstract

Access this chapter

Similar content being viewed by others

TunBERT: Pretraining BERT for Tunisian Dialect Understanding

BertOdia: BERT Pre-training for Low Resource Odia Language

MACEDONIZER - The Macedonian Transformer Language Model

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation