Abstract
The established performance of existing transformer-based language models, delivering state-of-the-art results on numerous downstream tasks, is noteworthy. However, these models often face limitations, being either confined to high-resource languages or designed with a multilingual focus. The availability of models dedicated to Arabic dialects is scarce, and even those that do exist primarily cater to dialects written in Arabic script. This study presents the first BERT models for Moroccan Arabic dialect, also known as Darija, called DarijaBERT, DarijaBERT-arabizi, and DarijaBERT-mix. These models are trained on the largest Arabic monodialectal corpus, supporting both Arabic and Latin character representations of the Moroccan dialect. Their performance is thoroughly evaluated and compared to existing multidialectal and multilingual models across four distinct downstream tasks, showcasing state-of-the-art results. The data collection methodology and pre-training process are described, and the Moroccan Topic Classification Dataset (MTCD) is introduced as the first dataset for topic classification in the Moroccan Arabic dialect. The pre-trained models and MTCD dataset are available to the scientific community.
Similar content being viewed by others
Notes
No copyright claims have been made by the anonymous writers or organizers, and we explicitly state that we do not assert any copyrights to the text. The collected data is exclusively utilized for training models and is not disseminated or shared in any manner.
كاتشوف, كيضحك, زوينة, كتبكي, مزيان, داكشي, كيشوف, كتشوف, واخا, كيزيدو, دابا ديال, الجلاخة, تبوگيصة, مكلخ, حشومة, منبقاوش, شلاهبية, تخربيق, كايدوي, برهوش, كاندوي يسيفطوه, يصيفطوه, السماسرية, مكاينش, مزيانين, الفقصة, .زوينين, سيمانة, الدراري English translation you see, he’s laughing, beautiful, she’s crying/you’re crying, well, that, he’s watching, she’s watching/you’re watching, okay, they’re adding, now, of, disgusting person, beauty, stupid, shame, we don’t stay anymore/we won’t continue to, thugs, gibberish, he speaks, little kid, I speak, they send him, they send him, the commercial intermediaries, there is no, good, frustration, beautiful, a week, the children/the boys.
Some Algerian/Moroccan words: كارطون, ماقراتش, ندير.
Some short keywords: زم، فك، رشّ، هزّ
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: 31st NIPS, pp. 6000–6010 (2017)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/n19-1423
Martin, L., Muller, B., Ortiz Suárez, P.J., Dupont, Y., Romary, L., de la Clergerie, É., Seddah, D., Sagot, B.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7203–7219 (2020). https://doi.org/10.18653/v1/2020.acl-main.645
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish pre-trained bert model and evaluation data. In: PML4DC at ICLR 2020 (2020)
Elgezouli, M., Elmadani, K.N., Saeed, M.: Sudabert: pre-trained encoder representation for Sudanese Arabic dialect. In: 2020 ICCCEEE, pp. 1–4 (2021). https://doi.org/10.1109/ICCCEEE49695.2021.9429651
Messaoudi, A., Cheikhrouhou, A., Haddad, H., Ferchichi, N., BenHajhmida, M., Korched, A., Naski, M., Ghriss, F., Kerkeni, A.: Tunbert: Pretrained contextualized text representation for Tunisian dialect. In: Intelligent Systems and Pattern Recognition, Cham, pp. 278–290 (2022)
Abdaoui, A., Berrimi, M., Oussalah, M., Moussaoui, A.: Dziribert: pre-trained language model for the Algerian dialect. arXiv preprint arXiv:2109.12346 (2021)
Slim, A., Melouah, A., Faghihi, U., Sahib, K.: Improving neural machine translation for low resource Algerian dialect by transductive transfer learning strategy. Arab. J. Sci. Eng. 47, 1–8 (2022)
Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M., Smaili, K.: Machine translation experiments on PADIC: a parallel Arabic DIalect corpus. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, Shanghai, China, pp. 26–34 (2015). https://aclanthology.org/Y15-1004
Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, pp. 9–15 (2020)
Safaya, A., Abdullatif, M., Yuret, D.: KUISAIL at SemEval-2020 : BERT-CNN for offensive speech identification in social media. In: 40th SemEval, pp. 2054–2059. ICCL, Barcelona (online) (2020)
Abdul-Mageed, M., Elmadany, A., Nagoudi, E.M.B.: ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 7088–7105 (2021). https://doi.org/10.18653/v1/2021.acl-long.551
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8440–8451 (2020). https://doi.org/10.18653/v1/2020.acl-main.747
Inoue, G., Alhafni, B., Baimukan, N., Bouamor, H., Habash, N.: The interplay of variant, size, and task type in Arabic pre-trained language models. In: Workshop on Arabic Natural Language Processing (2021)
Abdelali, A., Hassan, S., Mubarak, H., Darwish, K., Samih, Y.: Pre-training bert on Arabic tweets: practical considerations. arXiv preprint arXiv:2102.10684 (2021)
El-Khair, I.A.: 1.5b words Arabic corpus. preprint arXiv:1611.04033 (2016)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 32 (2019)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 (2020)
Schuster, M., Nakajima, K.: Japanese and Korean voice search. In: 2012 IEEE ICASSP, pp. 5149–5152 (2012). IEEE
Bianchi, R.M.: Glocal Arabic online: the case of 3arabizi. SSLLT 2(4), 483–503 (2012)
Yaghan, M.A.: “Arabizi’’: a contemporary style of Arabic slang. Design Issues 24(2), 39–52 (2008)
Alghamdi, H., Petraki, E.: Arabizi in Saudi Arabia: a deviant form of language or simply a form of expression? Soc. Sci. 7(9), 155 (2018)
Aboelezz, M.: ’we are young. we are trendy. buy our product!’: The use of Latinized Arabic in printed edited magazines in Egypt. UAJSS (9), 47–72 (2012)
Palfreyman, D., Khalil, M.A.: “A funky language for teenz to use’’: representing gulf Arabic in instant messaging. J. Comput. Med. Commun. 9(1), 917 (2003)
Mostafa, L.: A survey of automated tools for translating Arab chat alphabet into Arabic language. Am. Acad. Sch. Res. J. 4(3), 44–50 (2012)
Elmahdy, M., Gruhn, R., Abdennadher, S., Minker, W.: Rapid phonetic transcription using everyday life natural chat alphabet orthography for dialectal Arabic speech recognition. In: 2011 IEEE ICASSP, pp. 4936–4939 (2011). IEEE
Habash, N., Eryani, F., Khalifa, S., Rambow, O., Abdulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W., Bouamor, H., Zalmout, N., : Unified guidelines and resources for arabic dialect orthography. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Abu Farha, I., Magdy, W.: From Arabic sentiment analysis to sarcasm detection: the ArSarcasm dataset. In: 4th OSACT, Marseille, France, pp. 32–39 (2020)
Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Eryani, F., Erdmann, A.,: The madar arabic dialect corpus and lexicon. In: LREC (2018)
Abdul-Mageed, M., Zhang, C., Bouamor, H., Habash, N.: NADI 2020: The first Nuanced Arabic dialect identification shared task. In: Proceedings of the Fifth WANLP, pp. 97–110 (2020)
Abdelali, A., Mubarak, H., Samih, Y., Hassan, S., Darwish, K.: Qadi: Arabic dialect identification in the wild. In: Workshop on Arabic Natural Language Processing (2021)
Zaghouani, W., Charfi, A.: Arap-tweet: a large multi-dialect Twitter corpus for gender, age and language variety identification. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan (2018)
Al-Shargi, F., Kaplan, A., Eskander, R., Habash, N., Rambow, O.: Morphologically annotated corpora and morphological analyzers for Moroccan and Sanaani Yemeni Arabic. In: 10th LREC 2016 (2016)
Darwish, K., Abdelali, A., Mubarak, H., Samih, Y., Attia, M.: Diacritization of Moroccan and Tunisian Arabic dialects: A CRF approach. OSACT 3, 62 (2018)
Samih, Y., Maier, W.: An Arabic-Moroccan Darija code-switched corpus. In: Proceedings of LREC’16, pp. 4170–4175 (2016)
Voss, C., Tratz, S., Laoudi, J., Briesch, D.: Finding Romanized Arabic dialect in code-mixed tweets. In: Proceedings of LREC’14, pp. 2249–2253 (2014)
Laoudi, J., Bonial, C., Donatelli, L., Tratz, S., Voss, C.: Towards a computational lexicon for Moroccan darija: Words, idioms, and constructions. In: Proceedings of LAW-MWE-CxG-2018, pp. 74–85 (2018)
Maghfour, M., Elouardighi, A.: Standard and dialectal Arabic text classification for sentiment analysis. In: ICMDE, pp. 282–291 (2018). Springer
Mihi, S., Ait, B., El, I., Arezki, S., Laachfoubi, N.: Mstd: Moroccan sentiment twitter dataset. Int. J. Adv. Comput. Sci. Appl 11(10), 363–372 (2020)
Refaee, E., Rieser, V.: An Arabic twitter corpus for subjectivity and sentiment analysis. In: LREC, pp. 2268–2273 (2014)
Oussous, A., Benjelloun, F.-Z., Lahcen, A.A., Belfkih, S.: Asa: A framework for Arabic sentiment analysis. J. Inf. Sci. 46(4), 544–559 (2020)
El Abdouli, A., Hassouni, L., Anoun, H.: Sentiment analysis of Moroccan tweets using naive bayes algorithm. IJCSIS 15(12) (2017)
Habbat, N., Anoun, H., Hassouni, L.: Topic modeling and sentiment analysis with LDA and NMF on Moroccan tweets. In: The Proceedings of the Third ICSCA, pp. 147–161 (2020). Springer
Abdellaoui, H., Zrigui, M.: Using tweets and emojis to build tead: an Arabic dataset for sentiment analysis. Computaci’on y Sistemas 22(3) (2018)
Boujou, E., Chataoui, H., Mekki, A.E., Benjelloun, S., Chairi, I., Berrada, I.: An open access nlp dataset for arabic dialects: data collection, labeling, and model construction. preprint arXiv:2102.11000 (2021)
Acknowledgements
We thank the Google TRC program for giving us access to their TPUs Cloud.
Author information
Authors and Affiliations
Contributions
KG, AMN, AA and Imade Benelallam contributed to this paper with equal efforts. KG, AMN, AA particpated to the data collection and cleaning. KG was responsible for the development and implementation of the models on GCP. All authors discussed and reviewed the results. KG was responsible for writing the paper. IB supervised the whole work.
Corresponding author
Ethics declarations
Conflicts of interest
The authors state that they have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gaanoun, K., Naira, A.M., Allak, A. et al. DarijaBERT: a step forward in NLP for the written Moroccan dialect. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-023-00498-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41060-023-00498-2