DarijaBERT: a step forward in NLP for the written Moroccan dialect

Gaanoun, Kamel; Naira, Abdou Mohamed; Allak, Anass; Benelallam, Imade

doi:10.1007/s41060-023-00498-2

DarijaBERT: a step forward in NLP for the written Moroccan dialect

Regular Paper
Published: 23 January 2024

(2024)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Kamel Gaanoun¹,
Abdou Mohamed Naira^1,2,
Anass Allak^1,2 &
…
Imade Benelallam^1,2

130 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The established performance of existing transformer-based language models, delivering state-of-the-art results on numerous downstream tasks, is noteworthy. However, these models often face limitations, being either confined to high-resource languages or designed with a multilingual focus. The availability of models dedicated to Arabic dialects is scarce, and even those that do exist primarily cater to dialects written in Arabic script. This study presents the first BERT models for Moroccan Arabic dialect, also known as Darija, called DarijaBERT, DarijaBERT-arabizi, and DarijaBERT-mix. These models are trained on the largest Arabic monodialectal corpus, supporting both Arabic and Latin character representations of the Moroccan dialect. Their performance is thoroughly evaluated and compared to existing multidialectal and multilingual models across four distinct downstream tasks, showcasing state-of-the-art results. The data collection methodology and pre-training process are described, and the Moroccan Topic Classification Dataset (MTCD) is introduced as the first dataset for topic classification in the Moroccan Arabic dialect. The pre-trained models and MTCD dataset are available to the scientific community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BERT for Arabic NLP Applications: Pretraining and Finetuning MSA and Arabic Dialects

Language resources for Maghrebi Arabic dialects’ NLP: a survey

Article 25 April 2020

RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model

Notes

https://9isas.modareb.info/.
No copyright claims have been made by the anonymous writers or organizers, and we explicitly state that we do not assert any copyrights to the text. The collected data is exclusively utilized for training models and is not disseminated or shared in any manner.
https://socialblade.com/.
https://hypeauditor.com/.
كاتشوف, كيضحك, زوينة, كتبكي, مزيان, داكشي, كيشوف, كتشوف, واخا, كيزيدو, دابا ديال, الجلاخة, تبوگيصة, مكلخ, حشومة, منبقاوش, شلاهبية, تخربيق, كايدوي, برهوش, كاندوي يسيفطوه, يصيفطوه, السماسرية, مكاينش, مزيانين, الفقصة, .زوينين, سيمانة, الدراري English translation you see, he’s laughing, beautiful, she’s crying/you’re crying, well, that, he’s watching, she’s watching/you’re watching, okay, they’re adding, now, of, disgusting person, beauty, stupid, shame, we don’t stay anymore/we won’t continue to, thugs, gibberish, he speaks, little kid, I speak, they send him, they send him, the commercial intermediaries, there is no, good, frustration, beautiful, a week, the children/the boys.
Some Algerian/Moroccan words: كارطون, ماقراتش, ندير.
Some short keywords: زم، فك، رشّ، هزّ
https://sites.research.google/trc/about/.
https://huggingface.co/SI2M-Lab/DarijaBERT.
https://huggingface.co/SI2M-Lab/DarijaBERT-arabizi.
https://huggingface.co/SI2M-Lab/DarijaBERT-mix.
https://github.com/AIOXLABS/DBert.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: 31st NIPS, pp. 6000–6010 (2017)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/n19-1423
Martin, L., Muller, B., Ortiz Suárez, P.J., Dupont, Y., Romary, L., de la Clergerie, É., Seddah, D., Sagot, B.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7203–7219 (2020). https://doi.org/10.18653/v1/2020.acl-main.645
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish pre-trained bert model and evaluation data. In: PML4DC at ICLR 2020 (2020)
Elgezouli, M., Elmadani, K.N., Saeed, M.: Sudabert: pre-trained encoder representation for Sudanese Arabic dialect. In: 2020 ICCCEEE, pp. 1–4 (2021). https://doi.org/10.1109/ICCCEEE49695.2021.9429651
Messaoudi, A., Cheikhrouhou, A., Haddad, H., Ferchichi, N., BenHajhmida, M., Korched, A., Naski, M., Ghriss, F., Kerkeni, A.: Tunbert: Pretrained contextualized text representation for Tunisian dialect. In: Intelligent Systems and Pattern Recognition, Cham, pp. 278–290 (2022)
Abdaoui, A., Berrimi, M., Oussalah, M., Moussaoui, A.: Dziribert: pre-trained language model for the Algerian dialect. arXiv preprint arXiv:2109.12346 (2021)
Slim, A., Melouah, A., Faghihi, U., Sahib, K.: Improving neural machine translation for low resource Algerian dialect by transductive transfer learning strategy. Arab. J. Sci. Eng. 47, 1–8 (2022)
Article Google Scholar
Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M., Smaili, K.: Machine translation experiments on PADIC: a parallel Arabic DIalect corpus. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, Shanghai, China, pp. 26–34 (2015). https://aclanthology.org/Y15-1004
Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, pp. 9–15 (2020)
Safaya, A., Abdullatif, M., Yuret, D.: KUISAIL at SemEval-2020 : BERT-CNN for offensive speech identification in social media. In: 40th SemEval, pp. 2054–2059. ICCL, Barcelona (online) (2020)
Abdul-Mageed, M., Elmadany, A., Nagoudi, E.M.B.: ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 7088–7105 (2021). https://doi.org/10.18653/v1/2021.acl-long.551
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8440–8451 (2020). https://doi.org/10.18653/v1/2020.acl-main.747
Inoue, G., Alhafni, B., Baimukan, N., Bouamor, H., Habash, N.: The interplay of variant, size, and task type in Arabic pre-trained language models. In: Workshop on Arabic Natural Language Processing (2021)
Abdelali, A., Hassan, S., Mubarak, H., Darwish, K., Samih, Y.: Pre-training bert on Arabic tweets: practical considerations. arXiv preprint arXiv:2102.10684 (2021)
El-Khair, I.A.: 1.5b words Arabic corpus. preprint arXiv:1611.04033 (2016)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 32 (2019)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 (2020)
Schuster, M., Nakajima, K.: Japanese and Korean voice search. In: 2012 IEEE ICASSP, pp. 5149–5152 (2012). IEEE
Bianchi, R.M.: Glocal Arabic online: the case of 3arabizi. SSLLT 2(4), 483–503 (2012)
Article Google Scholar
Yaghan, M.A.: “Arabizi’’: a contemporary style of Arabic slang. Design Issues 24(2), 39–52 (2008)
Article Google Scholar
Alghamdi, H., Petraki, E.: Arabizi in Saudi Arabia: a deviant form of language or simply a form of expression? Soc. Sci. 7(9), 155 (2018)
Article Google Scholar
Aboelezz, M.: ’we are young. we are trendy. buy our product!’: The use of Latinized Arabic in printed edited magazines in Egypt. UAJSS (9), 47–72 (2012)
Palfreyman, D., Khalil, M.A.: “A funky language for teenz to use’’: representing gulf Arabic in instant messaging. J. Comput. Med. Commun. 9(1), 917 (2003)
Google Scholar
Mostafa, L.: A survey of automated tools for translating Arab chat alphabet into Arabic language. Am. Acad. Sch. Res. J. 4(3), 44–50 (2012)
Google Scholar
Elmahdy, M., Gruhn, R., Abdennadher, S., Minker, W.: Rapid phonetic transcription using everyday life natural chat alphabet orthography for dialectal Arabic speech recognition. In: 2011 IEEE ICASSP, pp. 4936–4939 (2011). IEEE
Habash, N., Eryani, F., Khalifa, S., Rambow, O., Abdulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W., Bouamor, H., Zalmout, N., : Unified guidelines and resources for arabic dialect orthography. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Abu Farha, I., Magdy, W.: From Arabic sentiment analysis to sarcasm detection: the ArSarcasm dataset. In: 4th OSACT, Marseille, France, pp. 32–39 (2020)
Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Eryani, F., Erdmann, A.,: The madar arabic dialect corpus and lexicon. In: LREC (2018)
Abdul-Mageed, M., Zhang, C., Bouamor, H., Habash, N.: NADI 2020: The first Nuanced Arabic dialect identification shared task. In: Proceedings of the Fifth WANLP, pp. 97–110 (2020)
Abdelali, A., Mubarak, H., Samih, Y., Hassan, S., Darwish, K.: Qadi: Arabic dialect identification in the wild. In: Workshop on Arabic Natural Language Processing (2021)
Zaghouani, W., Charfi, A.: Arap-tweet: a large multi-dialect Twitter corpus for gender, age and language variety identification. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan (2018)
Al-Shargi, F., Kaplan, A., Eskander, R., Habash, N., Rambow, O.: Morphologically annotated corpora and morphological analyzers for Moroccan and Sanaani Yemeni Arabic. In: 10th LREC 2016 (2016)
Darwish, K., Abdelali, A., Mubarak, H., Samih, Y., Attia, M.: Diacritization of Moroccan and Tunisian Arabic dialects: A CRF approach. OSACT 3, 62 (2018)
Google Scholar
Samih, Y., Maier, W.: An Arabic-Moroccan Darija code-switched corpus. In: Proceedings of LREC’16, pp. 4170–4175 (2016)
Voss, C., Tratz, S., Laoudi, J., Briesch, D.: Finding Romanized Arabic dialect in code-mixed tweets. In: Proceedings of LREC’14, pp. 2249–2253 (2014)
Laoudi, J., Bonial, C., Donatelli, L., Tratz, S., Voss, C.: Towards a computational lexicon for Moroccan darija: Words, idioms, and constructions. In: Proceedings of LAW-MWE-CxG-2018, pp. 74–85 (2018)
Maghfour, M., Elouardighi, A.: Standard and dialectal Arabic text classification for sentiment analysis. In: ICMDE, pp. 282–291 (2018). Springer
Mihi, S., Ait, B., El, I., Arezki, S., Laachfoubi, N.: Mstd: Moroccan sentiment twitter dataset. Int. J. Adv. Comput. Sci. Appl 11(10), 363–372 (2020)
Google Scholar
Refaee, E., Rieser, V.: An Arabic twitter corpus for subjectivity and sentiment analysis. In: LREC, pp. 2268–2273 (2014)
Oussous, A., Benjelloun, F.-Z., Lahcen, A.A., Belfkih, S.: Asa: A framework for Arabic sentiment analysis. J. Inf. Sci. 46(4), 544–559 (2020)
Article Google Scholar
El Abdouli, A., Hassouni, L., Anoun, H.: Sentiment analysis of Moroccan tweets using naive bayes algorithm. IJCSIS 15(12) (2017)
Habbat, N., Anoun, H., Hassouni, L.: Topic modeling and sentiment analysis with LDA and NMF on Moroccan tweets. In: The Proceedings of the Third ICSCA, pp. 147–161 (2020). Springer
Abdellaoui, H., Zrigui, M.: Using tweets and emojis to build tead: an Arabic dataset for sentiment analysis. Computaci’on y Sistemas 22(3) (2018)
Boujou, E., Chataoui, H., Mekki, A.E., Benjelloun, S., Chairi, I., Berrada, I.: An open access nlp dataset for arabic dialects: data collection, labeling, and model construction. preprint arXiv:2102.11000 (2021)

Download references

Acknowledgements

We thank the Google TRC program for giving us access to their TPUs Cloud.

Author information

Authors and Affiliations

SI2M Lab, INSEA, Rabat-Instituts, Rabat, Morocco
Kamel Gaanoun, Abdou Mohamed Naira, Anass Allak & Imade Benelallam
AIOX LABS, rue Honain, Rabat, Morocco
Abdou Mohamed Naira, Anass Allak & Imade Benelallam

Authors

Kamel Gaanoun
View author publications
You can also search for this author in PubMed Google Scholar
Abdou Mohamed Naira
View author publications
You can also search for this author in PubMed Google Scholar
Anass Allak
View author publications
You can also search for this author in PubMed Google Scholar
Imade Benelallam
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

KG, AMN, AA and Imade Benelallam contributed to this paper with equal efforts. KG, AMN, AA particpated to the data collection and cleaning. KG was responsible for the development and implementation of the models on GCP. All authors discussed and reviewed the results. KG was responsible for writing the paper. IB supervised the whole work.

Corresponding author

Correspondence to Kamel Gaanoun.

Ethics declarations

Conflicts of interest

The authors state that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gaanoun, K., Naira, A.M., Allak, A. et al. DarijaBERT: a step forward in NLP for the written Moroccan dialect. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-023-00498-2

Download citation

Received: 07 February 2023
Accepted: 12 December 2023
Published: 23 January 2024
DOI: https://doi.org/10.1007/s41060-023-00498-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DarijaBERT: a step forward in NLP for the written Moroccan dialect

Abstract

Access this article

Similar content being viewed by others

BERT for Arabic NLP Applications: Pretraining and Finetuning MSA and Arabic Dialects

Language resources for Maghrebi Arabic dialects’ NLP: a survey

RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DarijaBERT: a step forward in NLP for the written Moroccan dialect

Abstract

Access this article

Similar content being viewed by others

BERT for Arabic NLP Applications: Pretraining and Finetuning MSA and Arabic Dialects

Language resources for Maghrebi Arabic dialects’ NLP: a survey

RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation