Abstract
Pre-trained models have accomplished high performances with the introduction of the Transformers like the Bidirectional Encoder Representations from Transformers known for BERT. Nevertheless, most of these proposed models have been trained on most represented languages (English, French, German, etc.) and few models target the under-represented languages and dialects.
This work introduces a feasibility study of pre-training language models based on Transformers on Tunisian dialect as an under-represented languages. The Tunisian language model is evaluated on dialect identification task, sentiment analysis task, and reading comprehension question-answering task. Results demonstrate that, instead of using datasets from traditional sources (Wikipedia, articles, etc.), noisy web crawled data is more convenient for a under-represented language such as the Tunisian dialect. Additionally, experiments show that a reasonably small-scale dataset conducts to similar or better achievements as when using a large-scale dataset and that TunBERT model performances reach or enhance the state of the art in all three downstream tasks. The pre-trained model named TunBERT and the used datasets for the fine-tuning step are publicly released.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
The word “Question” is the English translation.
References
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, Workshop Track Proceedings (2013)
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237 (2018). https://www.aclweb.org/anthology/N18-1202
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339 (2018). https://www.aclweb.org/anthology/P18-1031
Bahdanau, D., Cho, K., Bengio,Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations (2015)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long And Short Papers), pp. 4171–4186 (2019)
Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pp. 9–15 (2020)
Wuwei, L., Yang, C., Wei, X., Alan, R.: GigaBERT: zero-shot transfer learning from English to Arabic. In: Proceedings of the 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP) (2020)
Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual BERT? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4996–5001 (2019). https://www.aclweb.org/anthology/P19-1493
Fourati, C., Messaoudi, A., Haddad, H.: TUNIZI: a Tunisian Arabizi sentiment analysis dataset. In: AfricaNLP Workshop, Putting Africa on the NLP Map. ICLR 2020, Virtual Event. arXiv:3091079 (2020)
Delobelle, P., et al.: Computing research repository. arXiv:1907.11692 (2019)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: Proceedings of the 8th International Conference on Learning Representations (ICLR) (2020)
Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 7059–7069 (2019)
Delobelle, P., Winters, T., Berendt, B.: RobBERT: a dutch RoBERTa-based language model. Computing Research Repository, version 2. arXiv:2001.06286 (2020)
Le, H., et al.: FlauBERT: unsupervised language model pre-training for French. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pp. 2479–2490 (2020)
Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219 (2020)
Canete, J., Chaperon, G., Fuentes, R., Ho, J., Kang, H., Pérez, J.: Spanish pre-trained BERT model and evaluation data. In: PML4DC @ ICLR 2020, p. 2020 (2020)
Virtanen, A., et al.: Multilingual is not enough: BERT for finnish. Computing Research Repository, version 1. arXiv:1912.07076 (2019)
Medhaffar, S., Bougares, F., Estève, Y., Hadrich-Belguith, L.: Sentiment analysis of Tunisian dialects: linguistic resources and experiments. In: Proceedings of the Third Arabic Natural Language Processing Workshop, pp. 55–61 (2017)
Sayadi, K., Liwicki, M., Ingold, R., Bui, M.: Tunisian dialect and modern standard Arabic dataset for sentiment analysis: Tunisian election context. In: Proceedings of the Second International Conference on Arabic Computational Linguistics, ACLING, pp. 35–53 (2016)
Zaidan, O., Callison-Burch, C.: The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 37–41 (2011)
El-Haj, M., Rayson, P., Aboelezz, M.: Arabic dialect identification in the context of bivalency and code-switching. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pp. 3622–3627 (2018)
Harrat, S., Meftouh, K., Smaïli, K.: Maghrebi Arabic dialect processing: an overview. J. Int. Sci. Gen. Appl. 1 (2018)
Horesh, S.: Languages of the Middle East and North Africa. In: The SAGE Encyclopedia of Human Communication Sciences and Disorders, vol. 1, pp. 1058–1061 (2019)
Bouamor, H., et al.: The MADAR Arabic dialect corpus and lexicon. In: The International Conference on Language Resources and Evaluation (2018)
Bouamor, H., Hassan, S., Habash, N.: The MADAR shared task on Arabic fine-grained dialect identification. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 199–207 (2019)
Mozannar, H., Maamary, E., El Hajal, K., Hajj, H.: Neural Arabic question answering. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 108–118 (2019). https://www.aclweb.org/anthology/W19-4612
Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to answer open-domain questions. ArXiv. abs/1704.00051 (2017)
Mulki, H., Haddad, H., Gridach, M., Babaoğlu, I.: Syntax-ignorant N-gram embeddings for dialectal Arabic sentiment analysis. Nat. Lang. Eng. 27, 1–24 (2020)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Messaoudi, A. et al. (2022). TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect. In: Bennour, A., Ensari, T., Kessentini, Y., Eom, S. (eds) Intelligent Systems and Pattern Recognition. ISPR 2022. Communications in Computer and Information Science, vol 1589. Springer, Cham. https://doi.org/10.1007/978-3-031-08277-1_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-08277-1_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08276-4
Online ISBN: 978-3-031-08277-1
eBook Packages: Computer ScienceComputer Science (R0)