Skip to main content
Log in

DarijaBERT: a step forward in NLP for the written Moroccan dialect

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

The established performance of existing transformer-based language models, delivering state-of-the-art results on numerous downstream tasks, is noteworthy. However, these models often face limitations, being either confined to high-resource languages or designed with a multilingual focus. The availability of models dedicated to Arabic dialects is scarce, and even those that do exist primarily cater to dialects written in Arabic script. This study presents the first BERT models for Moroccan Arabic dialect, also known as Darija, called DarijaBERT, DarijaBERT-arabizi, and DarijaBERT-mix. These models are trained on the largest Arabic monodialectal corpus, supporting both Arabic and Latin character representations of the Moroccan dialect. Their performance is thoroughly evaluated and compared to existing multidialectal and multilingual models across four distinct downstream tasks, showcasing state-of-the-art results. The data collection methodology and pre-training process are described, and the Moroccan Topic Classification Dataset (MTCD) is introduced as the first dataset for topic classification in the Moroccan Arabic dialect. The pre-trained models and MTCD dataset are available to the scientific community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://9isas.modareb.info/.

  2. No copyright claims have been made by the anonymous writers or organizers, and we explicitly state that we do not assert any copyrights to the text. The collected data is exclusively utilized for training models and is not disseminated or shared in any manner.

  3. https://socialblade.com/.

  4. https://hypeauditor.com/.

  5. كاتشوف, كيضحك, زوينة, كتبكي, مزيان, داكشي, كيشوف, كتشوف, واخا, كيزيدو, دابا ديال, الجلاخة, تبوگيصة, مكلخ, حشومة, منبقاوش, شلاهبية, تخربيق, كايدوي, برهوش, كاندوي يسيفطوه, يصيفطوه, السماسرية, مكاينش, مزيانين, الفقصة, .زوينين, سيمانة, الدراري English translation you see, he’s laughing, beautiful, she’s crying/you’re crying, well, that, he’s watching, she’s watching/you’re watching, okay, they’re adding, now, of, disgusting person, beauty, stupid, shame, we don’t stay anymore/we won’t continue to, thugs, gibberish, he speaks, little kid, I speak, they send him, they send him, the commercial intermediaries, there is no, good, frustration, beautiful, a week, the children/the boys.

  6. Some Algerian/Moroccan words: كارطون, ماقراتش, ندير.

  7. Some short keywords: زم، فك، رشّ، هزّ

  8. https://sites.research.google/trc/about/.

  9. https://huggingface.co/SI2M-Lab/DarijaBERT.

  10. https://huggingface.co/SI2M-Lab/DarijaBERT-arabizi.

  11. https://huggingface.co/SI2M-Lab/DarijaBERT-mix.

  12. https://github.com/AIOXLABS/DBert.

References

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: 31st NIPS, pp. 6000–6010 (2017)

  2. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/n19-1423

  3. Martin, L., Muller, B., Ortiz Suárez, P.J., Dupont, Y., Romary, L., de la Clergerie, É., Seddah, D., Sagot, B.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7203–7219 (2020). https://doi.org/10.18653/v1/2020.acl-main.645

  4. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish pre-trained bert model and evaluation data. In: PML4DC at ICLR 2020 (2020)

  5. Elgezouli, M., Elmadani, K.N., Saeed, M.: Sudabert: pre-trained encoder representation for Sudanese Arabic dialect. In: 2020 ICCCEEE, pp. 1–4 (2021). https://doi.org/10.1109/ICCCEEE49695.2021.9429651

  6. Messaoudi, A., Cheikhrouhou, A., Haddad, H., Ferchichi, N., BenHajhmida, M., Korched, A., Naski, M., Ghriss, F., Kerkeni, A.: Tunbert: Pretrained contextualized text representation for Tunisian dialect. In: Intelligent Systems and Pattern Recognition, Cham, pp. 278–290 (2022)

  7. Abdaoui, A., Berrimi, M., Oussalah, M., Moussaoui, A.: Dziribert: pre-trained language model for the Algerian dialect. arXiv preprint arXiv:2109.12346 (2021)

  8. Slim, A., Melouah, A., Faghihi, U., Sahib, K.: Improving neural machine translation for low resource Algerian dialect by transductive transfer learning strategy. Arab. J. Sci. Eng. 47, 1–8 (2022)

    Article  Google Scholar 

  9. Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M., Smaili, K.: Machine translation experiments on PADIC: a parallel Arabic DIalect corpus. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, Shanghai, China, pp. 26–34 (2015). https://aclanthology.org/Y15-1004

  10. Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, pp. 9–15 (2020)

  11. Safaya, A., Abdullatif, M., Yuret, D.: KUISAIL at SemEval-2020 : BERT-CNN for offensive speech identification in social media. In: 40th SemEval, pp. 2054–2059. ICCL, Barcelona (online) (2020)

  12. Abdul-Mageed, M., Elmadany, A., Nagoudi, E.M.B.: ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 7088–7105 (2021). https://doi.org/10.18653/v1/2021.acl-long.551

  13. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8440–8451 (2020). https://doi.org/10.18653/v1/2020.acl-main.747

  14. Inoue, G., Alhafni, B., Baimukan, N., Bouamor, H., Habash, N.: The interplay of variant, size, and task type in Arabic pre-trained language models. In: Workshop on Arabic Natural Language Processing (2021)

  15. Abdelali, A., Hassan, S., Mubarak, H., Darwish, K., Samih, Y.: Pre-training bert on Arabic tweets: practical considerations. arXiv preprint arXiv:2102.10684 (2021)

  16. El-Khair, I.A.: 1.5b words Arabic corpus. preprint arXiv:1611.04033 (2016)

  17. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 32 (2019)

  18. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  19. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 (2020)

  20. Schuster, M., Nakajima, K.: Japanese and Korean voice search. In: 2012 IEEE ICASSP, pp. 5149–5152 (2012). IEEE

  21. Bianchi, R.M.: Glocal Arabic online: the case of 3arabizi. SSLLT 2(4), 483–503 (2012)

    Article  Google Scholar 

  22. Yaghan, M.A.: “Arabizi’’: a contemporary style of Arabic slang. Design Issues 24(2), 39–52 (2008)

    Article  Google Scholar 

  23. Alghamdi, H., Petraki, E.: Arabizi in Saudi Arabia: a deviant form of language or simply a form of expression? Soc. Sci. 7(9), 155 (2018)

    Article  Google Scholar 

  24. Aboelezz, M.: ’we are young. we are trendy. buy our product!’: The use of Latinized Arabic in printed edited magazines in Egypt. UAJSS (9), 47–72 (2012)

  25. Palfreyman, D., Khalil, M.A.: “A funky language for teenz to use’’: representing gulf Arabic in instant messaging. J. Comput. Med. Commun. 9(1), 917 (2003)

    Google Scholar 

  26. Mostafa, L.: A survey of automated tools for translating Arab chat alphabet into Arabic language. Am. Acad. Sch. Res. J. 4(3), 44–50 (2012)

    Google Scholar 

  27. Elmahdy, M., Gruhn, R., Abdennadher, S., Minker, W.: Rapid phonetic transcription using everyday life natural chat alphabet orthography for dialectal Arabic speech recognition. In: 2011 IEEE ICASSP, pp. 4936–4939 (2011). IEEE

  28. Habash, N., Eryani, F., Khalifa, S., Rambow, O., Abdulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W., Bouamor, H., Zalmout, N., : Unified guidelines and resources for arabic dialect orthography. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

  29. Abu Farha, I., Magdy, W.: From Arabic sentiment analysis to sarcasm detection: the ArSarcasm dataset. In: 4th OSACT, Marseille, France, pp. 32–39 (2020)

  30. Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Eryani, F., Erdmann, A.,: The madar arabic dialect corpus and lexicon. In: LREC (2018)

  31. Abdul-Mageed, M., Zhang, C., Bouamor, H., Habash, N.: NADI 2020: The first Nuanced Arabic dialect identification shared task. In: Proceedings of the Fifth WANLP, pp. 97–110 (2020)

  32. Abdelali, A., Mubarak, H., Samih, Y., Hassan, S., Darwish, K.: Qadi: Arabic dialect identification in the wild. In: Workshop on Arabic Natural Language Processing (2021)

  33. Zaghouani, W., Charfi, A.: Arap-tweet: a large multi-dialect Twitter corpus for gender, age and language variety identification. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan (2018)

  34. Al-Shargi, F., Kaplan, A., Eskander, R., Habash, N., Rambow, O.: Morphologically annotated corpora and morphological analyzers for Moroccan and Sanaani Yemeni Arabic. In: 10th LREC 2016 (2016)

  35. Darwish, K., Abdelali, A., Mubarak, H., Samih, Y., Attia, M.: Diacritization of Moroccan and Tunisian Arabic dialects: A CRF approach. OSACT 3, 62 (2018)

    Google Scholar 

  36. Samih, Y., Maier, W.: An Arabic-Moroccan Darija code-switched corpus. In: Proceedings of LREC’16, pp. 4170–4175 (2016)

  37. Voss, C., Tratz, S., Laoudi, J., Briesch, D.: Finding Romanized Arabic dialect in code-mixed tweets. In: Proceedings of LREC’14, pp. 2249–2253 (2014)

  38. Laoudi, J., Bonial, C., Donatelli, L., Tratz, S., Voss, C.: Towards a computational lexicon for Moroccan darija: Words, idioms, and constructions. In: Proceedings of LAW-MWE-CxG-2018, pp. 74–85 (2018)

  39. Maghfour, M., Elouardighi, A.: Standard and dialectal Arabic text classification for sentiment analysis. In: ICMDE, pp. 282–291 (2018). Springer

  40. Mihi, S., Ait, B., El, I., Arezki, S., Laachfoubi, N.: Mstd: Moroccan sentiment twitter dataset. Int. J. Adv. Comput. Sci. Appl 11(10), 363–372 (2020)

    Google Scholar 

  41. Refaee, E., Rieser, V.: An Arabic twitter corpus for subjectivity and sentiment analysis. In: LREC, pp. 2268–2273 (2014)

  42. Oussous, A., Benjelloun, F.-Z., Lahcen, A.A., Belfkih, S.: Asa: A framework for Arabic sentiment analysis. J. Inf. Sci. 46(4), 544–559 (2020)

    Article  Google Scholar 

  43. El Abdouli, A., Hassouni, L., Anoun, H.: Sentiment analysis of Moroccan tweets using naive bayes algorithm. IJCSIS 15(12) (2017)

  44. Habbat, N., Anoun, H., Hassouni, L.: Topic modeling and sentiment analysis with LDA and NMF on Moroccan tweets. In: The Proceedings of the Third ICSCA, pp. 147–161 (2020). Springer

  45. Abdellaoui, H., Zrigui, M.: Using tweets and emojis to build tead: an Arabic dataset for sentiment analysis. Computaci’on y Sistemas 22(3) (2018)

  46. Boujou, E., Chataoui, H., Mekki, A.E., Benjelloun, S., Chairi, I., Berrada, I.: An open access nlp dataset for arabic dialects: data collection, labeling, and model construction. preprint arXiv:2102.11000 (2021)

Download references

Acknowledgements

We thank the Google TRC program for giving us access to their TPUs Cloud.

Author information

Authors and Affiliations

Authors

Contributions

KG, AMN, AA and Imade Benelallam contributed to this paper with equal efforts. KG, AMN, AA particpated to the data collection and cleaning. KG was responsible for the development and implementation of the models on GCP. All authors discussed and reviewed the results. KG was responsible for writing the paper. IB supervised the whole work.

Corresponding author

Correspondence to Kamel Gaanoun.

Ethics declarations

Conflicts of interest

The authors state that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gaanoun, K., Naira, A.M., Allak, A. et al. DarijaBERT: a step forward in NLP for the written Moroccan dialect. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-023-00498-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41060-023-00498-2

Keywords

Navigation