Towards Gulf Emirati Dialect Corpus from Social Media

AlAzzam, Bayan A.; Alkhatib, Manar; Shaalan, Khaled

doi:10.1007/978-3-031-56121-4_27

Bayan A. AlAzzam¹³,
Manar Alkhatib¹³ &
Khaled Shaalan¹³

Part of the book series: Lecture Notes in Civil Engineering ((LNCE,volume 473))

1543 Accesses

Abstract

Purpose: This paper discusses the need for a corpus of Emirati traditional phrases and idioms in natural language processing (NLP) for the Gulf Emirati dialect and its potential applications in fields like voice recognition, machine translation, and sentiment analysis.

Methodology: The researchers collected a corpus of more than 3000 traditional Emirati words and idioms by gathering data from several social media platforms, such as forums, YouTube, and Emirati radio stations. In addition, the researchers used the website scraping technologies to collect suitable resources, subsequently cleansing and organising the gathered material to ensure accuracy and consistency. A pilot investigation was undertaken, including an individual who is a native speaker of Emirati, in order to verify the precision of the dataset.

Findings: The researchers successfully compiled a substantial dataset of traditional Emirati phrases and idioms, so enabling potential future investigations in the realm of Arabic dialects, specifically focusing on Gulf Arabic dialects such as the Emirati dialect.

Implications: The compilation of Emirati traditional idioms and words presented in this study has potential practical effects in several domains such as medical, education, and business. These implications mostly revolve around enhancing communication among and with individuals proficient in the Emirati language.

Originality/Value: This study distinguishes itself by concentrating on the compilation of an NLP corpus comprising traditional Emirati phrases and idioms, with a specific emphasis on the Gulf Emirati dialect. The dataset generated as a result of this effort may prove indispensable for further studies into Arabic dialects.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

In computer science, natural language processing, or NLP, is an essential technique that allows humans to communicate with computers even when the machines can only comprehend programming languages or machine language. Arabic dialects have unique vocabulary and syntax, and MSA, a non-spoken Arabic language, is used in formal situations. NLP resources are needed for speech recognition, machine translation, and sentiment analysis. Corpora, libraries of text or voice data, are used to train and assess language models, helping researchers understand a dialect’s linguistic quirks. NLP has gained importance due to its ability to improve machine interpretation and serve various applications. Discourse analysis, multilingual data frameworks, sentiment analysis, chatbots, and voice recognition are some of the applications that utilize NLP (Alkhatib and Shaalan, 2018). As NLP technology improves, new and intriguing applications will emerge. NLP in computer science could significantly impact our daily lives by enabling computers and people to communicate in natural languages like English, Arabic, and Chinese.

NLP improves usability and productivity in several applications (Alkhatib and Shaalan, 2018).

Language corpora include texts in many languages (El-khair and Ibrahim, 2003). It is one of the most important data sources (CL) in IR, NLP, and CL. It can replicate written language and process opinions and apply suitable applications. Arabic NLP practitioners suffer from underfunded and underresearched corpora compared to English corpora (Al-Thubaity, 2014). Arabic has three main dialects: Classical Arabic (CA), Modern Standard Arabic (MSA), and Dialect Arabic (DA). DA is geographically specific, while MSA is used in news. Arabic’s morphology is complex and inflected due to its diverse subcategories (Alruily, 2020).

Thus, many researchers are now creating enormous Arabic corpora. Since most Arabic NLP developments have focused on CA and MSA, they perform poorly with DA data (Habash et al., 2012). In colloquial English, DA lemmas may have hundreds of surface forms (Al-Twairesh et al., 2017). Morphologically wealthy. DA lacks orthographies and differs from MSA in syntax, morphology, and pronunciation (Habash, 2010). As noted in the literature, a single NLP solution for all Arabic dialects is almost impossible (Farghaly, n.d.). Despite MSA being the most extensively used written language, DA is currently written more often than (Al-Twairesh et al., 2018; Shaalan et al., 2018). Therefore, the presence of a corpus that accurately represents contemporary usage of the English language has significant importance. (Sawalha et al., 2019).

1.1 Arabic Language

Arabic, which is one of the top five languages, is spoken by over 1.5 billion Muslims and 422 million non-Muslims. It is an example of diglossia, with Modern Standard Arabic (MSA) being the government’s written language. Arabic is varied across countries, including Morocco and Oman, and is used in education, media, and culture (Elsherif and Soomro, 2017).

Arabs speak MSA, but dialectal Arabic varies significantly across the Arab world. Social media has led to the introduction of various dialects in written social interactions, affecting phonologically, lexically, and morphologically. The main categories include Maghreb, Egyptian, Levantine, Iraqi, and Gulf (Zbib et al., 2012).

1.2 Dialect

Dialectal Arabic is infrequently employed in formal contexts such as media and education, whereas Middle Eastern Standard Arabic (MSA) is widely encountered and utilized, as indicated by Bouamor et al. (Bouamor et al., 2019), However, native Arabic speakers rarely use MSA while interacting with others. The development of linguistic resources for dialects plays a crucial role in improving the functionality and effectiveness of Dialectal Arabic, particularly in applications such as dialect-specific machine translation for several Arabic dialects, including Tunisian, Algerian, Lebanese, Syrian, Jordanian, Kuwaiti, and Qatari (Al-Mulla and Zaghouani, 2020).

In this paper, we focus on the Gulf dialect which is spoken in the United Arab Emirates, and the data extracted from social media. Our study uses text extracted from social media to address the Emirates Arabic dialect, and from the YouTube Emirate dialect radio channel. The remainder of this work is structured as follows: Sect. 2 reviews relevant literature; Sect. 3 briefly describes the basic features of Gulf Emirati. Section 4 describes the dataset. Section 5 discusses the strategy and resources for gathering the corpus. Finally, in Sect. 6, we bring the project to a close.

2 Related Work

Arabic NLP resources have surged, particularly Modern Standard Arabic (MSA), per Rosso’s findings (Rosso et al., 2018). Interest grows in projects like (Rangel et al., 2019), gathering user-generated Arabic content from social media, aided by MADAR (Bouamor et al., 2019) and ARAP-Tweet (Zaghouani and Charfi, 2018). Dialectal Arabic gains traction, seen in Gulf Arabic’s comprehensive dictionary (Al-Twairesh et al., 2017) and Lahajat’s directory, Al-Badawi’s site showcases Arabic phrases from various origins (Al-Badawi, 2013). (Al-Malki, 2015) presents camel-related idioms, while (Al-Kuwari, 2014) researches marine life. Georgetown University’s “Qatari phrasebook” app offers “1,500” Qatari terms with English translations. A Saudi Twitter Corpus (Assiri et al., 2016) aids sentiment analysis with 4,700 SD tweets across disciplines. BRAD (Elnagar and Einea, 2016) offers sentiment analysis material from 156,506 annotated reviews. Saudi sentiment corpus (Al-Twairesh et al., 2017), includes 2.2M tweets, labeled for sentiment. Habibi corpus (El-Haj, 2020) holds 500K phrases from Arabic songs in six dialects across 18 nations. Arabic Sentiment Analysis Dataset (Alyami and Olatunji, 2020) covers Saudi societal concerns with 15,149 words, aiding sentiment classification in Saudi Vision 2030 context.

Al Shamsi & Abdallah (2022a, b) emphasized on Sentiment Analysis of Arabic Dialects in Social Media. The author highlighted that the majority of research papers did not indicate the dialect type, however several expressly investigated the Saudi dialect. Around half of the datasets utilized in the investigations were created by the authors, with sizes ranging from (10,000 to 50,000) words. Machine learning techniques were the most commonly used approach for sentiment analysis, with different classifiers producing the best results. Twitter was the most popular site for building databases of Arabic Dialects texts.

Al Shamsi & Abdallah (2022a, b) developed a meticulously curated dataset for the purpose of conducting sentiment analysis on the Emirati dialect as seen on the social media platform Instagram. The dataset had a total of 70,000 comments, with a predominant use of the Emirati dialect seen among the bulk of the comments. The quality of the corpus was assessed using Cohen’s kappa coefficient, and it was declared to be of high quality. The corpus was examined using eight different machine learning approaches, including TF-IDF for text vectorization. The results revealed that the corpus has appropriate compatibility, with various approaches attaining an accuracy of more than 70% (the greatest accuracy achieved being 80%).

This work aims to construct a corpus specifically dedicated to the Emirati dialect, as there is a dearth of electronic resources available for this particular dialect. The absence of a dedicated Emirati dialect corpus for NLP applications necessitates the significance and competitiveness of this research endeavor.

3 Gulf Emirati (Background, Orthography and Morphology)

Emirati Arabic, spoken by Emiratis in the United Arab Emirates, is a series of closely related Arabic dialects with similar phonological, lexical, and morphosyntactic structures. Within the UAE, there are regional variations, classified into three sub-varieties based on their distribution in the Northern, Eastern, and western Emirates.

There are differences in the pronunciation and spelling of Emirati Arabic. For instance, the word “mob, ”—which meaning “not, ”—can be pronounced as “mesh, ” in Abu Dhabi, “mob, ” in the Northern Emirates, and “ma, ” on the East Coast. In Emirati Arabic, several letters are also changed to ones with a similar sound, such as “j ” becoming “y, ”, “k, ” changing to “ch, ” and “q, ” changing to “g” or “j ”.

Emirati Arabic speakers distinguish their dialect from nearby dialects based on a number of phonological, morphological, and syntactic characteristics (Leung et al., 2021).

For Emirati languages, there are no fixed orthographic conventions, unlike Modern Standard Arabic. This causes inconsistent spelling, even across authors. The mispronunciation or the way the words are written when derived from the Modern Standard Arabic orthography may be the source of the typo. For instance, the word “truth” is written “ ” (“sedq”) in Modern Standard Arabic, while most Emirati dialect speakers spell it “sedg”, “ ” (Al-Twairesh et al., 2018).

4 Dataset Description

The project aimed to create a comprehensive dataset for natural language processing applications, specifically for the Gulf Emirati dialect. The dataset was gathered from various internet resources, including specialized websites, social media platforms, Emirati language blogs, and YouTube emirate radio channels.

We used manual data gathering techniques, such as crawling the web for relevant pages and extracting comments and news items from social media sites like Facebook and YouTube emirate dialect radio stations, to guarantee a wide variety of phrase lengths and coverage of different domains and issues. Throughout the past decade, social media platforms have grown in popularity and variety. We use these social media platforms to acquire questionable data. The purpose of collecting data involves extracting and organizing terms (Rekik et al., 2018).

The main source of data for this study came from forums devoted to Emirati dialects, which included conversations on a range of topics, including word definitions, conversational use, problems with Emirati terminology, and storytelling in the dialect. About 1800 words acquired from various forums where it was made easier to collect data.

We also looked at YouTube radio stations that played content in the Emirati Gulf dialect. These channels produced information in the Emirati Gulf dialect and covered a wide range of subjects. We used specialist websites created for speech-to-text translation to convert the spoken input into written form. We took a selection of words from the transcriptions after the conversion procedure. To remove duplicates and improve data quality, the resultant corpus underwent stringent cleaning operations. We eventually ended up with a collection of 3000 words. We enlisted the aid of a knowledgeable specialist in the Emirati dialect to guarantee the reliability and quality of the dataset that was collected. Their knowledge was essential in confirming the dataset and identifying any possible flaws or discrepancies.

To guarantee clarity and consistency, this dataset, which includes over 3000 traditional Emirati phrases and idioms, has undergone extensive treatment, cleaning, and classification. Future research on Arabic dialects, especially Gulf Arabic dialects like the Emirati dialect, would substantially benefit from its availability. Researchers can utilize this dataset for various purposes, particularly emphasizing tasks such as voice recognition, machine translation, and sentiment analysis specifically targeting Gulf Emirati speech. Furthermore, this dataset possesses the capacity to augment communication in a diverse range of contexts, encompassing corporate, medical, and educational environments, among others. It may increase nuanced knowledge in these areas and assist successful communication by promoting a deeper awareness of the nuances and special characteristics of the Emirati accent.

The dataset produced is a significant advancement in the study of the Gulf Emirati Arabic dialect and the development of natural language processing tools, potentially simplifying future research and advancing the growth of NLP applications for Gulf Arabic dialects by addressing resource scarcity.

5 Data Pre-processing

The research utilized a combination of human and automated data collection methods to gather social media data in the Gulf Emirati dialect, aiming to create a dataset for developing dialectal Arabic speech recognition software and applications for the dialect.

The study team first created precise search keywords and inquiries that were directed at social media postings made in the Gulf Emirati dialect. The words and phrases “ , yalla” (let’s go), “shlonik, ” (how are you), and “ , inshallah” (God willing) were among the search keywords used.

Since speakers of the Gulf Emirati dialect often utilize social media sites like Facebook, and YouTube emirate radio channels, the researchers used these sites to gather data. They manually looked for postings that satisfied the requirements for inclusion in the dataset, such as those that included idioms and terms common to the Emirati language.

The researchers employed automated data gathering methods in addition to conventional data collection methods. They automated the collection of data from social media networks using speech-to-text techniques.

The study presents a method for identifying the Gulf Emirati dialect in online conversations by utilizing a dataset of Emirati language-specific idioms and phrases, aiming to enhance the development of Arabic speech and NLP technologies..

The data collection process involved multiple phases, including locating relevant web resources, Emirati-friendly websites, forums, and social media networks to gather relevant social media postings in the Gulf Emirati dialect (Alkhair et al., 2019).

Next, we used a number of preprocessing processes to clean up the social media postings, eliminate noise and undesirable symbols. For instance, we eliminated dates, hours, Arabic and English digits, URLs, and unique characters like the @ (mention sign). Along with emotions, stop words, and extra punctuation and symbols, we also deleted the comments. In addition, we eliminated elongation and switched to a single occurrence (Nerabie et al., 2021). Our dataset contains the following classic phrases and idioms from the Emirati language:

A phrase for expressing appreciation or respect for someone or something is “ya zain, ”.

“Khallas, ” is a phrase often used to denote the conclusion or completion of a task.

“Mafi mushkila, ” is a saying that means there isn’t an issue or a challenge.

“Yallah, ” is the Arabic word for encouragement or urging.

In summary, the research methodology involved identifying online sources, preprocessing social media posts, and annotating data to determine if it was in the Gulf Emirati dialect. A dataset of traditional Emirati phrases and idioms was compiled to improve dialectal Arabic Speech and Natural Language specific to the Gulf Emirati dialect (Fig. 1).

6 Conclusion

On summary, this study has introduced an innovative methodology for identifying the Gulf Emirati dialect on online platforms by using a collection of conventional phrases and idiomatic expressions. The absence of standardized orthographies in Arabic dialects is a significant obstacle for academics, resulting in the arduous and time-consuming process of creating language databases. Our argument is that the dataset we have produced will make a valuable contribution to the advancement of research in the fields of dialectal Arabic Speech and NLP methods and applications, with a special focus on the Gulf Emirati dialect.

Furthermore, we have surveyed the previous Arabic corpora and presented the methodology used to process the Emirati Gulf dialect dataset, which has resulted in the extraction of 3000 Emirati dialect words from social media. It is anticipated that this research will serve as a catalyst for more exploration into the utilization of social media as a means of constructing language datasets, therefore making a valuable contribution to the advancement of more precise and efficient language resources specifically tailored for the Gulf Emirati dialect.

References

Al Shamsi, A.A., Abdallah, S.: A systematic review for sentiment analysis of Arabic Dialect texts researches. In: Al-Emran, M., Al-Sharafi, M.A., Al-Kabi, M.N., Shaalan, K. (eds.) ICETIS 2021. LNNS, vol. 322, pp. 291–309. Springer, Cham (2022a). https://doi.org/10.1007/978-3-030-85990-9_25
Al Shamsi, A.A., Abdallah, S.: Sentiment analysis of emirati dialect. Big Data and Cogn. Comput. 6(2), 57 (2022b)
Google Scholar
AlBadawi, K.: Turkish words exotic to the Arabic language. WWW Document (2013). http://www.m.ahewar.org/s.asp
Alkhair, M., Meftouh, K., Smaïli, K., Othman, N.: An Arabic corpus of fake news: collection, analysis and classification. In: Smaïli, K. (ed.) ICALP 2019. CCIS, vol. 1108, pp. 292–302. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32959-4_21
Chapter Google Scholar
Alkhatib, M., Shaalan, K.: The key challenges for Arabic machine translation. In: Shaalan, K., Hassanien, A.E., Tolba, F. (eds.) Intelligent Natural Language Processing: Trends and Applications. SCI, vol. 740, pp. 139–156. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-67056-0_8
Chapter Google Scholar
Al-Kuwari, R.: The dictionary of Pearl diving and marine life terms in the Gulf (2014)
Google Scholar
Al-Malki, A.: Camels in Qatar. Dar for Qatari Books, Doha (2015)
Google Scholar
Al-Mulla, S., Zaghouani, W.: Building a corpus of Qatari Arabic expressions, pp. 11–16 (2020)
Google Scholar
Alruily, M.: Issues of dialectal Saudi Twitter corpus. Int. Arab J. Inf. Technol. 17, 367–374 (2020)
Google Scholar
Al-Thubaity, A.O.: A 700M+ Arabic corpus: KACST Arabic corpus design and construction. Lang. Resour. Eval.Resour. Eval. 49(3), 721–751 (2014). https://doi.org/10.1007/s10579-014-9284-1
Article Google Scholar
Al-Twairesh, N., Al-Khalifa, H., Al-Salman, A., Al-Ohali, Y.: AraSenTi-Tweet: a corpus for Arabic sentiment analysis of Saudi Tweets. Procedia Comput. Sci. 117, 63–72 (2017)
Article Google Scholar
Al-Twairesh, N., et al.: SUAR: towards building a corpus for the Saudi dialect. Procedia Comput. Sci. 142, 72–82 (2018)
Article Google Scholar
Alyami, S.N., Olatunji, S.O.: Application of support vector machine for arabic sentiment classification using twitter-based dataset (2020). https://doi.org/10.1142/S0219649220400183
Assiri, A., Emam, A., Al-Dossari, H.: Saudi Twitter corpus for sentiment analysis. Int. J. Comput. Inf. Eng. 10(2), 272–275 (2016)
Google Scholar
Bouamor, H., et al.: The madar Arabic dialect corpus and lexicon. In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, pp. 3387–3396 (2019)
Google Scholar
Bouamor, H., et al.: The MADAR Arabic dialect corpus and lexicon (n.d.)
Google Scholar
El-Haj, M.: Habibi-a multi dialect multi national Arabic song lyrics corpus. eprints.lancs.ac.uk (2020)
Google Scholar
El-Khair, I.: Abu El-Khair corpus: a modern standard Arabic corpus. Int. J. Recent Trends Eng. Res. 2(11), 5–13 (2003)
Google Scholar
Elnagar, A., Einea, O.: BRAD 1.0: book reviews in Arabic dataset. In: Proceedings of IEEE/ACS International Conference on Computer Systems and Applications, AICCSA (2016)
Google Scholar
Elsherif, H.M., Soomro, T.R.: Perspectives of Arabic machine translation. J. Eng. Sci. Technol. 12, 2315–2332 (2017)
Google Scholar
Farghaly, A.: Arabic natural language processing: challenges and solutions (n.d.)
Google Scholar
Habash, N., Eskander, R., Hawwari, A.: A morphological analyzer for Egyptian Arabic, pp. 1–9 (2012)
Google Scholar
Habash, N.Y.: Introduction to Arabic Natural Language Processing. Synthesis Lectures on Human Language Technologies (2010)
Google Scholar
Leung, T.-C., Ntelitheos, D., Al Kaabi, M.: Emirati Arabic: A Comprehensive Grammar - Tommi Tsz-Cheung Leung, Dimitrios Ntelitheos, Meera Al Kaabi - Google Books (2021)
Google Scholar
Nerabie, A.M., AlKhatib, M., Mathew, S.S., Barachi, M.E., Oroumchian, F.: The impact of Arabic part of speech tagging on sentiment analysis: a new corpus and deep learning approach. Procedia Comput. Sci. 184, 148–155 (2021)
Article Google Scholar
Rangel, F., Rosso, P., Charfi, A., Zaghouani, W.: Detecting deceptive tweets in Arabic for cyber-security. In: 2019 IEEE International Conference on Intelligence and Security Informatics, ISI 2019, pp. 86–91 (2019)
Google Scholar
Rekik, A., et al.: Building an Arabic social corpus for dangerous profile extraction on social networks. Computación y Sistemas 22, 1337–1346 (2018)
Article Google Scholar
Rosso, P., Rangel, F., Farías, I.H., Cagnina, L., Zaghouani, W., Charfi, A.: A survey on author profiling, deception, and irony detection for the Arabic language. Lang. Linguist. Compass. 12, e12275 (2018)
Article Google Scholar
Sawalha, M., Alshargi, F., Alshdaifat, A., Yagi, S., Qudah, M.A.: Construction and annotation of the Jordan comprehensive contemporary Arabic corpus (JCCA), pp. 148–157 (2019)
Google Scholar
Shaalan, K., Siddiqui, S., Alkhatib, M., Abdel Monem, A.: Challenges in Arabic natural language processing. In: Computational Linguistics, Speech and Image Processing for Arabic Language, pp. 59–83. World Scientific (2018)
Google Scholar
Zaghouani, W., Charfi, A.: Arap-Tweet: a large multi-dialect Twitter corpus for gender, age and language variety identification. arXiv preprint arXiv:1808.07674 (2018)
Zbib, R., et al.: Machine translation of Arabic dialects. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie, pp. 49–59 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

IT and Engineering, British University in Dubai, Dubai, UAE
Bayan A. AlAzzam, Manar Alkhatib & Khaled Shaalan

Authors

Bayan A. AlAzzam
View author publications
You can also search for this author in PubMed Google Scholar
Manar Alkhatib
View author publications
You can also search for this author in PubMed Google Scholar
Khaled Shaalan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bayan A. AlAzzam .

Editor information

Editors and Affiliations

The British University in Dubai, Dubai, United Arab Emirates
Khalid Al Marri
The British University in Dubai, Dubai, United Arab Emirates
Farzana Asad Mir
The British University in Dubai, Dubai, United Arab Emirates
Solomon Arulraj David
The British University in Dubai, Dubai, United Arab Emirates
Mostafa Al-Emran

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

AlAzzam, B.A., Alkhatib, M., Shaalan, K. (2024). Towards Gulf Emirati Dialect Corpus from Social Media. In: Al Marri, K., Mir, F.A., David, S.A., Al-Emran, M. (eds) BUiD Doctoral Research Conference 2023. Lecture Notes in Civil Engineering, vol 473. Springer, Cham. https://doi.org/10.1007/978-3-031-56121-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-56121-4_27
Published: 29 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56120-7
Online ISBN: 978-3-031-56121-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics