Algeria’s socio-linguistic situation is known as a complex phenomenon involving several historical, cultural and technological factors. However, there are three languages that are mainly spoken in Algeria (Arabic, Tamazight and French) and they can be mixed in the same sentence (code-switching). Moreover, there are several varieties of dialects that differ from one region to another and sometimes within the same region. This paper aims to provide a new multi-purpose parallel corpus (i.e., DZDC12 corpus), which will serve as a testbed for various natural language processing and information retrieval applications. In particular, it can be a useful tool to study Arabic–French code-switching phenomenon, Algerian Romanized Arabic (Arabizi), different Algerian sub-dialects, sentiment analysis, gender writing style, machine translation, abuse detection, etc. To the best of our knowledge, the proposed corpus is the first of its kind, where the texts are written in Latin script and crawled from Facebook. More specifically, this corpus is organised by gender, region and city, and is transliterated into Arabic script and translated into Modern Standard Arabic. In addition, it is annotated for emotion detection and abuse detection, and annotated at the word level. This article focuses in particular on Algeria’s socio-linguistic situation and the effect of social media networks. Furthermore, the general guidelines for the design of DZDC12 corpus are described as well as the dialects clustering over the map.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Abainia, K. (2018, October 24–25). Detecting Algerian sub-dialects of on-line commentators in social media networks. In Proceedings of the 3rd international conference on pattern analysis and intelligent systems, Tebessa, Algeria.
Abbassi, A. (1977). A sociolinguistic analysis of multilingualism in Morocco. Ph.D. Thesis, University of Texas, Austin, USA.
Abdul-Mageed, M., AlHuzli, H., & Duaa’Abu Elhija, M. D. (2016, May 24). DINA: A multi-dialect dataset for Arabic emotion analysis. In The 2nd workshop on Arabic corpora and processing tools 2016 theme: social media (pp. 29–37). Portorož, Slovenia.
Adouane, W., & Dobnik, S. (2017, April 3). Identification of languages in Algerian arabic multilingual documents. In Proceedings of the the 3rd Arabic natural language processing workshop (WANLP 2017) (pp. 1-8). Valencia, Spain.
Adouane, W., Dobnik, S., Bernardy, J. P., & Semmar, N. (2018, June 6). A comparison of character neural language model and bootstrapping for language identification in multilingual noisy texts. In Proceedings of second workshop on Subword and Character LEvel Models in NLP (SCLeM) co-located with the 16th annual conference of the North American Chapter of the association for computational linguistics: Human language technologies (NAACL-HLT 2018) (pp. 22–31). New Orleans, Louisiana, USA.
Adouane, W., Semmar, N., & Johansson, R. (2016a, December 12). Romanized Berber and romanized Arabic automatic language identification using machine learning. In Proceedings of the third workshop on NLP for similar languages, varieties and dialects (pp. 53–61). Osaka, Japan.
Adouane, W., Semmar, N., & Johansson, R. (2016c, November). Romanized Arabic and Berber detection using prediction by partial matching and dictionary methods. In Proceedings of the 13th ACS/IEEE international conference on computer systems and applications (AICCSA 2016) (pp. 1–7). Agadir, Morocco.
Adouane, W., Semmar, N., & Johansson, R. (2016d). Arabicized and romanized Berber automatic identification. In Conférence internationale sur les Technologies d’Information et de Communication pour l’AMazighe (TICAM 2016). Rabat, Maroc.
Adouane, W., Semmar, N., & Johansson. R. (2016e, December 2012). ASIREM Participation at the discriminating similar languages shared task 2016. In Third edition of the discriminating between similar languages (DSL) shared task at the VarDial workshop co-located with the 26th international conference on computational linguistics (COLING 2016) (pp. 163–169). Osaka, Japan.
Adouane, W., Semmar, N., Johansson, R., & Bobicev, V. (2016b, December 12). Automatic detection of arabicized Berber and Arabic varieties. In Proceedings of the third workshop on NLP for similar languages, varieties and dialects (pp. 63–72). Osaka, Japan.
Ait Habbouche, K. (2013). Language maintenance and language shift among Kabyle speakers in Arabic speaking communities: The case of Oran. Magister Thesis, University of Oran.
Akbacak, M., Dimitra Vergyri, D., Andreas Stolcke, A., Scheffer, N., & Mandal, A. (2011, August 27–31). Effective Arabic dialect classification using diverse phonotactic models. In Proceedings of the 5th international speech communication association (INTERSPEECH) (pp. 737–740). Florence, Italy.
Al-Badrashiny, M., Elfardy, H., & Diab, M. (2015, July 30–31). AIDA2: A hybrid approach for token and sentence level dialect identification in Arabic. In Proceedings of the 19th conference on computational language learning (pp. 42–51). Beijing, China.
Al-Badrashiny, M., Eskander, R., Habash, N., & Rambow, O. (2014, June 26–27). Automatic transliteration of romanized dialectal Arabic. Proceedings of the eighteenth conference on computational natural language learning (pp. 30–38). Baltimore, Maryland, USA.
Ali, A., Dehak, N., Cardinal, P., Khurana, S., Harsha Yella, S., Glass, J., Bell, P., & Renals, S. (2016, September 8–12). Automatic dialect detection in Arabic broadcast speech. In INTERSPEECH (pp. 2934–2938). San Francisco, USA.
Al-Kabi, M. N., Al-Qwaqenah, A. A., Gigieh, A. H., Alsmearat, K., Al-Ayyoub, M., & Alsmadi, I. M. (2016, November). Building a standard dataset for Arabie sentiment analysis: Identifying potential annotation pitfalls. In Proceedings of the 13th International Conference of Computer Systems and Applications (AICCSA) (pp. 1–6). Agadir, Morocco.
Al-Sabbagh, R., Diesner, J., & Girju, R. (2013, October 14–18). Using the semantic–syntactic interface for reliable Arabic modality annotation. In International joint conference on natural language processing (pp. 410–418). Nagoya, Japan.
Alshutayri, A., Atwell, E., AlOsaimy, A., Dickins, J., Ingleby, M., & Watson, J. (2016, December 12). Arabic language WEKA-based dialect classifier for Arabic automatic speech recognition transcripts. In Proceedings of the third workshop on NLP for similar languages, varieties and dialects (pp. 204–211). Osaka, Japan.
Amazouz, D., Adda-Decker, M., & Lamel, L. (2017, August 20–24). Addressing code-switching in French/Algerian Arabic speech. In Proceedings of the international conference INTERSPEECH (pp. 62–66). Stockholm, Sweden.
Amazouz, D., Adda-Decker, M., & Lamel, L. (2018, May 7–12). The French–Algerian code-switching triggered audio corpus (FACST). In LREC 2018 (pp. 1468–1473). Miyazaki, Japan.
Babel Project. (2017). Répartition des dialectes en Algérie. http://projetbabel.org/forum/viewtopic.php?t=20334. Consulted on March 2017.
Bagui, H. (2014). Aspects of diglossic code switching situations: A Sociolinguistic Interpretation. European Journal of Research in Social Sciences,2(4), 86–92.
Benrabah, M. (2007). Language maintenance and spread: French in Algeria. International Journal of Francophone Studies,10(1–2), 193–215.
Bentahila, A., & Davies, E. (1983). The syntax of Arabic–French code-switching. Lingua,59(4), 301–330.
Bentahila, A., & Davies, E. (1992). Code-switching and language dominance. International Journal of Advances in Psychology,83, 443–458.
Bentahila, A., & Davies, E. (1995). Patterns of code-switching and patterns of language contact. Lingua,96, 75–93.
Bougrine, S., Cherroun, H., Ziadi, D., Lakhdari, A., & Chorana, A. (2016, May 24). Toward a rich Arabic speech parallel corpus for Algerian sub-dialects. In Proceedings of the 2nd workshop on Arabic corpora and processing tools. Portorož, Slovenia.
Bougrit, F. (2010). The combination of language varieties in students’ speech. A case study: Students from UMC. Master’s Thesis, Mentouri University, Constantine, Algeria.
Boujelbane, R., Mallek, M., Ellouze, M., & Hadrich Belguith, L. (2014, June 18–20). Fine-grained pos tagging of spoken Tunisian dialect corpora. In Proceedings of the 19th international conference on applications of natural language to information systems (pp. 59–62). Montpellier, France.
Buckwalter, T. (2002). Arabic transliteration. http://www.qamus.org/transliteration.htm.
Busser, R. D. (2015). The influence of social, cultural, and natural factors on language structure: An overview. Language structure and environment: social, cultural, and natural factors (pp. 1–28). Amsterdam: John Benjamins.
Chalabi, A., & Gerges, H. (2012, December 15). Romanized Arabic transliteration. In Proceedings of the second workshop on advances in text input methods (COLING) (pp. 89–96). Mumbai, India.
Chiang, D., Diab, M., Habash, N., Rambow, O., & Shareef, S. (2006, April 3–7). Parsing Arabic dialects. In Proceedings of the11st European chapter of the association for computational linguistics (EACL) (pp. 369–376). Trento, Italy.
Ciot, M., Sonderegger, M., & Ruths, D. (2013, October 18–21). Gender inference of twitter users in non-english contexts. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1136–1145). Seattle, Washington, USA.
Cotterell,R., & Callison-Burch, C. (2014, May 26–31). A multi-dialect, multi-genre corpus of informal written Arabic. In The 9th international conference on the language resources and evaluation conference (LREC) (pp. 241–245). Reykjavik, Iceland.
Cotterell, R., Renduchintala, A., Saphra, N., & Callison-Burch, C. (2014, May 26–31). An Algerian Arabic–French code-switched corpus. In The 9th international conference on the language resources and evaluation conference (LREC). Reykjavik, Iceland.
Cowell, M. W. (2005). A reference grammar of Syrian Arabic with audio CD: (Based on the dialect of Damascus) (Vol. 7). Washington, DC: Georgetown University Press.
Danescu-Niculescu-Mizil, C., Sudhof, M., & Jurafsky, D. (2013). A computational approach to politeness with application to social factors. arXiv preprint arXiv:1306.6078.
Darwish, K. (2014, October 25). Arabizi detection and conversion to Arabic. In Proceedings of the EMNLP 2014 workshop on Arabic natural langauge processing (ANLP) (pp. 217–224). Doha, Qatar.
Dehimi, M. (2010). Code-switching among English students: A case study of Mentouri University-Constantine students. Master’s Thesis, Mentouri University, Constantine, Algeria.
Derradji, Y., Debov, V., Queffélec, A., Dekdouk, D. S., & Benchefra, Y. C. (2002). Le français en Algérie : lexique et dynamique des langues. In Duclot (Ed.), AUF, 2002.
Diab, M., Ghoneim, M., Hawwari, A., AlGhamdi, F. AlMarwani, N., & Al-Badrashiny, M. (2016, May 23–28). Creating a large multi-layered representational repository of linguistic code switched arabic data. In Proceedings of the 10th international conference on language resources and evaluation (LREC) (pp. 4228–4235). Paris, France.
Djellab, M., Amrouche, A., Bouridane, A., & Mehallegue, N. (2017). Algerian modern colloquial Arabic speech corpus (AMCASC): Regional accents recognition withincomplex socio-linguistic environments. International Journal of Language Resources and Evaluation (LREV),51(3), 613–641.
Duwairi, R. M., Alfaqeh, M., Wardat, M., & Alrabadi, A. (2016, April 5–7). Sentiment analysis for Arabizi text. In Proceedings of the 7th international conference on information and communication systems (ICICS). Irbid, Jordan.
Eldesouki, M., Dalvi, F., Sajjad, H., & Darwish, K. (2016, December 12). QCRI @ DSL 2016: Spoken Arabic dialect identification using textual features. In Proceedings of the 3rd workshop on NLP for similar languages, varieties and dialects (pp. 221–226). Osaka, Japan.
Elfardy, H., & Diab, M. (2012, May 23–25). Simplified guidelines for the creation of large scale dialectal Arabic annotations. In Proceedings of the 8th international conference on language resources and evaluation (LREC) (pp. 371–378). Istanbul, Turkey.
Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP),8(4), 14.
Ferhat, Z. F. (2015). Code-switching between Algerian Arabic and French language in the Algerian media, a pragmatic perspective case study: Kbc’s “Pas de-panique” show. Master’s Thesis, University of Kasdi Merbah, Ouargla, Algeria.
Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick, S., & Buckwalter, T. (2009). Standard Arabic morphological analyzer (SAMA) version 3.1. Linguistic Data Consortium LDC2009E73.
Guellil, I., Adeel, A., Azouaou, F., & Hussain, A. (2018, July 7–8). SentiALG: Automated Corpus annotation for Algerian sentiment analysis. In Proceedings of the international conference on brain inspired cognitive systems (pp. 557–567). Xi’an, China.
Guellil, I., & Azouaou, F. (2017, July). Bilingual lexicon for Algerian Arabic dialect treatment in social media. In Proceedings of the international conference of women & underrepresented minorities in natural language processing (WiNLP). Vancouver, Canada.
Guellil, I., Azouaou, F., Abbas, M., & Sadat, F. (2017, May). Arabizi transliteration of Algerian Arabic dialect into modern standard Arabic. Proceedings of the 1st workshop on social media and user generated content machine translation. Prague, Czech Republic.
Habash, N., Diab, M., & Rambow, O. (2012, May 21–27). Conventional orthography for dialectal Arabic. In LREC 2012. Istanbul, Turkey.
Habash, N., Roth, R., Rambow, O., Eskander, R., & Tomeh, N. (2013, June 9–14). Morphological analysis and disambiguation for dialectal Arabic. In Proceedings of NAACL-HLT (pp. 426–432). Atlanta, Georgia.
Hamdi, A., Nasr, A., Habash, N., & Gala, N. (2015, July 26–31). POS-tagging of Tunisian dialect using standard Arabic resources and tools. In Proceedings of the second workshop on Arabic natural language processing (pp. 59–68). Beijing, China.
Hamed, I., Elmahdy, M., & Abdennadher, S. (2017, November 5–6). Building a first language model for code-switch Arabic–English. In Proceedings of the 3rd international conference on Arabic computational linguistics, ACLing’17 (pp. 208–2016). Dubai, United Arab Emirates.
Hamed, I., Elmahdy, M., & Abdennadher, S. (2018, May 7–12). Collection and analysis of code-switch Egyptian Arabic–English speech corpus. In LREC 2018 (pp. 3805–3809). Miyazaki, Japan.
Hanani, A., Qaroush, A., & Taylor, S. (2016, December 12). Classifying ASR transcriptions according to Arabic dialect. In Proceedings of the third workshop on NLP for similar languages, varieties and dialects (pp. 126–134). Osaka, Japan.
Harrat, S., Meftouh, K., Abbas, M., Hidouci, K. W., & Smaili, K. (2016). An Algerian dialect: Study and resources. International Journal of Advanced Computer Science and Applications (JACSA),7(3), 384–396.
Harrat, S., Meftouh, K., & Smaili, K. (2019). Machine translation for Arabic dialects (survey). Journal of Information Processing and Management,56(2), 262–273.
Herring, S. (1994). Politeness in computer culture: Why women thank and men flame. In Cultural performances: Proceedings of the third Berkeley women and language conference (pp. 278–294). Berkeley, CA.
Joshi, A. K. (1982, July 05–10). Processing of sentences with intra-sentential code-switching. In Proceedings of the 9th conference on computational linguistics (pp. 145–150). Prague, Czechoslovakia.
Kachru, B. B. (1977). Code-switching as a communicative strategy in India. In M. Saville-Troike (Ed.), Linguistics and anthropology. Georgetown University Round Table on Languages and Linguistics. Washington, DC: Georgetown University Press.
Krasnova, H., Wenninger, H., Widjaja, T., & Buxmann, P. (2013, February 27–March 1). Envy on Facebook: A hidden threat to users’ life satisfaction?. In Proceedings of 11th international conference on Wirtschaftsinformatik (pp. 1–16). Leipzig, Germany.
Lui, M., & Baldwin, T. (2012). langid.py: An off-the-shelf language identification tool. In Proceedings of the 50th annual meeting of the association for computational linguistics (ACL 2012) Demo Session (pp. 25–30). Jeju, Korea.
Luomala, J. (2016). Features of Arabic–French code-switching in Morocco: A sociolinguistic case study on intra-sentential code-switching in Morocco. Master’s Thesis, Centre for Languages and Literature, Lund University, Sweden.
Malmasi, S., Refaee, E., & Dras, M. (2015, May 19–21). Arabic dialect identification using a parallel multidialectal corpus. In Proceedings of pacific association for computational linguistics (PACLING) (pp. 203–211). Bali, Indonesia.
Marshall, T. C., Ferenczi, N., Lefringhausen, K., Hill, S., & Deng, J. (2018). Intellectual, narcissistic, or Machiavellian? How Twitter users differ from Facebook-only users, why they use Twitter, and what they tweet about. Journal of Psychology of Popular Media Culture. https://doi.org/10.1037/ppm0000209.
Masmoudi, A., Habash, N., Ellouze, M., Estève, Y., & Hadrich Belguith, L. (2015, April 14–20). Arabic transliteration of Romanized Tunisian dialect text: A preliminary investigation. In Proceedings of the 16th international conference on intelligent text processing and computational linguistics (CICLing) (pp. 608–619). Cairo, Egypt.
Meftouh, K., Bouchemal, N., & Smaili, K. (2012, May 7–9). A study of non-resourced language: An Algerian dialect. In Proceedings of the 3rd international workshop on spoken languages technologies for under-resourced languages (SLTU). Cape Town, South Africa.
Mohamed, E., Mohit, B., & Oflazer, K. (2012, May 23–25). Annotating and learning morphological segmentation of Egyptian colloquial Arabic. In Proceedings of the 8th international conference on language resources and evaluation (LREC) (pp. 873–877). Istanbul, Turkey.
Molina, G., Rey-Villamizar, N., Solorio, T., AlGhamdi, F., Ghoneim, M., Hawwari, A., et al. (2016, November 1). Overview for the second shared task on language identification in code-switched data. In Proceedings of the 2ndworkshop on computational approaches to code switching (EMNLP) (pp. 40–49). Austin, Texas, USA.
Morsly, D. (1986). Multilingualism in Algeria. In The Fergusonian impact: In Honor of Charles A. Ferguson on the occasion of his 65th birthday.
Mustafa, M., & Suleman, H. (2011). Building a multilingual and mixed Arabic–English corpus. In Proceedings of Arabic language technology international conference (ALTIC). Alexandria, Egypt.
Nakatani, S. (2010). Language detection library (slides). http://www.slideshare.net/shuyo/language-detection-library-for-java. Accessed Dec 2015.
Pasha, A., Al-Badrashiny, M., El Kholy, A., Eskander, R., Diab, M., Habash, N., et al. (2014, May 26–31). Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th international conference on language resources and evaluation (pp. 1094–1101). Reykjavik, Iceland.
Pereira, C. (2011). Arabic in the North African region. In S. Weniger, G. Khan, M. P. Streck, & J. C. E. Watson (Eds.), Semitic languages. An international handbook (pp. 944–959). Berlin: Springer.
Poplack, S. (2001). Code-switching (linguistic). In N. Smelser & P. Baltes (Eds.), International encyclopedia of the social and behavioral sciences (pp. 2062–2065). Amsterdam: Elsevier.
Qafisheh, H. A. (1977). A short reference grammar of Gulf Arabic. Tucson: University of Arizona Press.
Rahab, H., Zitouni, A., & Djoudi, M. (2017a, September 12–14). SIAAC: Sentiment polarity identification on Arabic Algerian newspaper comments. In Proceedings of the international conference on advances in intelligent systems and computing (pp. 139–149). Szczecin, Poland.
Rahab, H., Zitouni, A., & Djoudi, M. (2017b, July). ARAACOM: ARAbic Algerian corpus for opinion mining. In Proceedings of the international conference on computing for engineering and sciences (pp. 35–39). Istanbul, Turkey.
Saadane, H. (2015). Le traitement automatique de l’arabe dialectalisé : aspects méthodologiques et algorithmiques. PhD thesis at University of Grenoble Alpes.
Saadane, H., & Habash, N. (2015, July 26–31). A conventional orthography for Algerian Arabic. In Proceedings of the 2nd workshop on Arabic natural language processing (pp. 69–79). Beijing, China.
Salloum, W., & Habash, N. (2014). ADAM: Analyzer for dialectal Arabic morphology. Journal of King Saud University-Computer and Information Sciences,26(4), 372–378.
Samih, Y., & Maier, W. (2016, May 23–28). An Arabic–Moroccan darija code-switched corpus. In Proceedings of the 10th international conference on language resources and evaluation (LREC) (pp. 4170–4175). Portoroz, Slovenia.
Selouani, S. A., & Boudraa, M. (2010). Algerian Arabic speech database (ALGASD): Corpus design and automatic speech recognition application. Arabian Journal for Science and Engineering, 35(2), 157–166.
Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., & Gohneim, M., et al. (2014, October 25). Overview for the first shared task on language identification in code-switched data. In Proceedings of the first workshop on computational approaches to code switching (EMNLP) (pp. 62–72). Doha, Qatar.
Stevens, P. B. (1983). Ambivalence, modernisation and language attitudes: French and Arabic in Tunisia. Journal of Multilingual and Multicultural Development,4(2–3), 101–114.
Van der Wees, M., Bisazza, A., & Monz, C. (2016, December 11). A simple but effective approach to improve Arabizi-to-English statistical machine translation. In Proceedings of the 2nd workshop on noisy user-generated text (WNUT) (pp. 43–50). Osaka, Japan.
Wottawa, J., Amazouz, D., Adda-Decker, M., & Lamel, L. (2018, September). Studying vowel variation in French–Algerian Arabic code-switched speech. In Proceedings of the international conference INTERSPEECH (pp. 2753–2757). Hyderabad, India.
Zaidan, O. F., & Callison-Burch, C. (2011, June 19–24). The Arabic online commentary dataset: An annotated dataset of informal Arabic with high dialectal content. In Proceedings of the 49th annual meeting of the association for computational linguistics (ACL) (pp. 37–41). Portland, Oregon.
Zaidan, O. F., & Callison-Burch, C. (2014). Arabic dialect identification. International Journal of Computational Linguistics,40(1), 171–202.
Zribi, I., Ellouze Khemakhem, M., & Hadrich Belguith, L. (2013, October 14–18). Morphological analysis of Tunisian dialect. In Proceeding of international joint conference on natural language processing (IJCNLP) (pp. 992–996). Nagoya, Japan.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Abainia, K. DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus. Lang Resources & Evaluation 54, 419–455 (2020). https://doi.org/10.1007/s10579-019-09454-8
- Parallel corpus
- Algerian dialects
- Text categorization
- Machine translation