Keywords

1 Introduction

In computer science, natural language processing, or NLP, is an essential technique that allows humans to communicate with computers even when the machines can only comprehend programming languages or machine language. Arabic dialects have unique vocabulary and syntax, and MSA, a non-spoken Arabic language, is used in formal situations. NLP resources are needed for speech recognition, machine translation, and sentiment analysis. Corpora, libraries of text or voice data, are used to train and assess language models, helping researchers understand a dialect’s linguistic quirks. NLP has gained importance due to its ability to improve machine interpretation and serve various applications. Discourse analysis, multilingual data frameworks, sentiment analysis, chatbots, and voice recognition are some of the applications that utilize NLP (Alkhatib and Shaalan, 2018). As NLP technology improves, new and intriguing applications will emerge. NLP in computer science could significantly impact our daily lives by enabling computers and people to communicate in natural languages like English, Arabic, and Chinese.

NLP improves usability and productivity in several applications (Alkhatib and Shaalan, 2018).

Language corpora include texts in many languages (El-khair and Ibrahim, 2003). It is one of the most important data sources (CL) in IR, NLP, and CL. It can replicate written language and process opinions and apply suitable applications. Arabic NLP practitioners suffer from underfunded and underresearched corpora compared to English corpora (Al-Thubaity, 2014). Arabic has three main dialects: Classical Arabic (CA), Modern Standard Arabic (MSA), and Dialect Arabic (DA). DA is geographically specific, while MSA is used in news. Arabic’s morphology is complex and inflected due to its diverse subcategories (Alruily, 2020).

Thus, many researchers are now creating enormous Arabic corpora. Since most Arabic NLP developments have focused on CA and MSA, they perform poorly with DA data (Habash et al., 2012). In colloquial English, DA lemmas may have hundreds of surface forms (Al-Twairesh et al., 2017). Morphologically wealthy. DA lacks orthographies and differs from MSA in syntax, morphology, and pronunciation (Habash, 2010). As noted in the literature, a single NLP solution for all Arabic dialects is almost impossible (Farghaly, n.d.). Despite MSA being the most extensively used written language, DA is currently written more often than (Al-Twairesh et al., 2018; Shaalan et al., 2018). Therefore, the presence of a corpus that accurately represents contemporary usage of the English language has significant importance. (Sawalha et al., 2019).

1.1 Arabic Language

Arabic, which is one of the top five languages, is spoken by over 1.5 billion Muslims and 422 million non-Muslims. It is an example of diglossia, with Modern Standard Arabic (MSA) being the government’s written language. Arabic is varied across countries, including Morocco and Oman, and is used in education, media, and culture (Elsherif and Soomro, 2017).

Arabs speak MSA, but dialectal Arabic varies significantly across the Arab world. Social media has led to the introduction of various dialects in written social interactions, affecting phonologically, lexically, and morphologically. The main categories include Maghreb, Egyptian, Levantine, Iraqi, and Gulf (Zbib et al., 2012).

1.2 Dialect

Dialectal Arabic is infrequently employed in formal contexts such as media and education, whereas Middle Eastern Standard Arabic (MSA) is widely encountered and utilized, as indicated by Bouamor et al. (Bouamor et al., 2019), However, native Arabic speakers rarely use MSA while interacting with others. The development of linguistic resources for dialects plays a crucial role in improving the functionality and effectiveness of Dialectal Arabic, particularly in applications such as dialect-specific machine translation for several Arabic dialects, including Tunisian, Algerian, Lebanese, Syrian, Jordanian, Kuwaiti, and Qatari (Al-Mulla and Zaghouani, 2020).

In this paper, we focus on the Gulf dialect which is spoken in the United Arab Emirates, and the data extracted from social media. Our study uses text extracted from social media to address the Emirates Arabic dialect, and from the YouTube Emirate dialect radio channel. The remainder of this work is structured as follows: Sect. 2 reviews relevant literature; Sect. 3 briefly describes the basic features of Gulf Emirati. Section 4 describes the dataset. Section 5 discusses the strategy and resources for gathering the corpus. Finally, in Sect. 6, we bring the project to a close.

2 Related Work

Arabic NLP resources have surged, particularly Modern Standard Arabic (MSA), per Rosso’s findings (Rosso et al., 2018). Interest grows in projects like (Rangel et al., 2019), gathering user-generated Arabic content from social media, aided by MADAR (Bouamor et al., 2019) and ARAP-Tweet (Zaghouani and Charfi, 2018). Dialectal Arabic gains traction, seen in Gulf Arabic’s comprehensive dictionary (Al-Twairesh et al., 2017) and Lahajat’s directory, Al-Badawi’s site showcases Arabic phrases from various origins (Al-Badawi, 2013). (Al-Malki, 2015) presents camel-related idioms, while (Al-Kuwari, 2014) researches marine life. Georgetown University’s “Qatari phrasebook” app offers “1,500” Qatari terms with English translations. A Saudi Twitter Corpus (Assiri et al., 2016) aids sentiment analysis with 4,700 SD tweets across disciplines. BRAD (Elnagar and Einea, 2016) offers sentiment analysis material from 156,506 annotated reviews. Saudi sentiment corpus (Al-Twairesh et al., 2017), includes 2.2M tweets, labeled for sentiment. Habibi corpus (El-Haj, 2020) holds 500K phrases from Arabic songs in six dialects across 18 nations. Arabic Sentiment Analysis Dataset (Alyami and Olatunji, 2020) covers Saudi societal concerns with 15,149 words, aiding sentiment classification in Saudi Vision 2030 context.

Al Shamsi & Abdallah (2022a, b) emphasized on Sentiment Analysis of Arabic Dialects in Social Media. The author highlighted that the majority of research papers did not indicate the dialect type, however several expressly investigated the Saudi dialect. Around half of the datasets utilized in the investigations were created by the authors, with sizes ranging from (10,000 to 50,000) words. Machine learning techniques were the most commonly used approach for sentiment analysis, with different classifiers producing the best results. Twitter was the most popular site for building databases of Arabic Dialects texts.

Al Shamsi & Abdallah (2022a, b) developed a meticulously curated dataset for the purpose of conducting sentiment analysis on the Emirati dialect as seen on the social media platform Instagram. The dataset had a total of 70,000 comments, with a predominant use of the Emirati dialect seen among the bulk of the comments. The quality of the corpus was assessed using Cohen’s kappa coefficient, and it was declared to be of high quality. The corpus was examined using eight different machine learning approaches, including TF-IDF for text vectorization. The results revealed that the corpus has appropriate compatibility, with various approaches attaining an accuracy of more than 70% (the greatest accuracy achieved being 80%).

This work aims to construct a corpus specifically dedicated to the Emirati dialect, as there is a dearth of electronic resources available for this particular dialect. The absence of a dedicated Emirati dialect corpus for NLP applications necessitates the significance and competitiveness of this research endeavor.

3 Gulf Emirati (Background, Orthography and Morphology)

Emirati Arabic, spoken by Emiratis in the United Arab Emirates, is a series of closely related Arabic dialects with similar phonological, lexical, and morphosyntactic structures. Within the UAE, there are regional variations, classified into three sub-varieties based on their distribution in the Northern, Eastern, and western Emirates.

There are differences in the pronunciation and spelling of Emirati Arabic. For instance, the word “mob, ”—which meaning “not, ”—can be pronounced as “mesh, ” in Abu Dhabi, “mob, ” in the Northern Emirates, and “ma, ” on the East Coast. In Emirati Arabic, several letters are also changed to ones with a similar sound, such as “j ” becoming “y, ”, “k, ” changing to “ch, ” and “q, ” changing to “g” or “j ”.

Emirati Arabic speakers distinguish their dialect from nearby dialects based on a number of phonological, morphological, and syntactic characteristics (Leung et al., 2021).

For Emirati languages, there are no fixed orthographic conventions, unlike Modern Standard Arabic. This causes inconsistent spelling, even across authors. The mispronunciation or the way the words are written when derived from the Modern Standard Arabic orthography may be the source of the typo. For instance, the word “truth” is written “ ” (“sedq”) in Modern Standard Arabic, while most Emirati dialect speakers spell it “sedg”, “ ” (Al-Twairesh et al., 2018).

4 Dataset Description

The project aimed to create a comprehensive dataset for natural language processing applications, specifically for the Gulf Emirati dialect. The dataset was gathered from various internet resources, including specialized websites, social media platforms, Emirati language blogs, and YouTube emirate radio channels.

We used manual data gathering techniques, such as crawling the web for relevant pages and extracting comments and news items from social media sites like Facebook and YouTube emirate dialect radio stations, to guarantee a wide variety of phrase lengths and coverage of different domains and issues. Throughout the past decade, social media platforms have grown in popularity and variety. We use these social media platforms to acquire questionable data. The purpose of collecting data involves extracting and organizing terms (Rekik et al., 2018).

The main source of data for this study came from forums devoted to Emirati dialects, which included conversations on a range of topics, including word definitions, conversational use, problems with Emirati terminology, and storytelling in the dialect. About 1800 words acquired from various forums where it was made easier to collect data.

We also looked at YouTube radio stations that played content in the Emirati Gulf dialect. These channels produced information in the Emirati Gulf dialect and covered a wide range of subjects. We used specialist websites created for speech-to-text translation to convert the spoken input into written form. We took a selection of words from the transcriptions after the conversion procedure. To remove duplicates and improve data quality, the resultant corpus underwent stringent cleaning operations. We eventually ended up with a collection of 3000 words. We enlisted the aid of a knowledgeable specialist in the Emirati dialect to guarantee the reliability and quality of the dataset that was collected. Their knowledge was essential in confirming the dataset and identifying any possible flaws or discrepancies.

To guarantee clarity and consistency, this dataset, which includes over 3000 traditional Emirati phrases and idioms, has undergone extensive treatment, cleaning, and classification. Future research on Arabic dialects, especially Gulf Arabic dialects like the Emirati dialect, would substantially benefit from its availability. Researchers can utilize this dataset for various purposes, particularly emphasizing tasks such as voice recognition, machine translation, and sentiment analysis specifically targeting Gulf Emirati speech. Furthermore, this dataset possesses the capacity to augment communication in a diverse range of contexts, encompassing corporate, medical, and educational environments, among others. It may increase nuanced knowledge in these areas and assist successful communication by promoting a deeper awareness of the nuances and special characteristics of the Emirati accent.

The dataset produced is a significant advancement in the study of the Gulf Emirati Arabic dialect and the development of natural language processing tools, potentially simplifying future research and advancing the growth of NLP applications for Gulf Arabic dialects by addressing resource scarcity.

5 Data Pre-processing

The research utilized a combination of human and automated data collection methods to gather social media data in the Gulf Emirati dialect, aiming to create a dataset for developing dialectal Arabic speech recognition software and applications for the dialect.

The study team first created precise search keywords and inquiries that were directed at social media postings made in the Gulf Emirati dialect. The words and phrases “ , yalla” (let’s go), “shlonik, ” (how are you), and “ , inshallah” (God willing) were among the search keywords used.

Since speakers of the Gulf Emirati dialect often utilize social media sites like Facebook, and YouTube emirate radio channels, the researchers used these sites to gather data. They manually looked for postings that satisfied the requirements for inclusion in the dataset, such as those that included idioms and terms common to the Emirati language.

The researchers employed automated data gathering methods in addition to conventional data collection methods. They automated the collection of data from social media networks using speech-to-text techniques.

The study presents a method for identifying the Gulf Emirati dialect in online conversations by utilizing a dataset of Emirati language-specific idioms and phrases, aiming to enhance the development of Arabic speech and NLP technologies..

The data collection process involved multiple phases, including locating relevant web resources, Emirati-friendly websites, forums, and social media networks to gather relevant social media postings in the Gulf Emirati dialect (Alkhair et al., 2019).

Next, we used a number of preprocessing processes to clean up the social media postings, eliminate noise and undesirable symbols. For instance, we eliminated dates, hours, Arabic and English digits, URLs, and unique characters like the @ (mention sign). Along with emotions, stop words, and extra punctuation and symbols, we also deleted the comments. In addition, we eliminated elongation and switched to a single occurrence (Nerabie et al., 2021). Our dataset contains the following classic phrases and idioms from the Emirati language:

A phrase for expressing appreciation or respect for someone or something is “ya zain, ”.

“Khallas, ” is a phrase often used to denote the conclusion or completion of a task.

“Mafi mushkila, ” is a saying that means there isn’t an issue or a challenge.

“Yallah, ” is the Arabic word for encouragement or urging.

In summary, the research methodology involved identifying online sources, preprocessing social media posts, and annotating data to determine if it was in the Gulf Emirati dialect. A dataset of traditional Emirati phrases and idioms was compiled to improve dialectal Arabic Speech and Natural Language specific to the Gulf Emirati dialect (Fig. 1).

Fig. 1.
figure 1

Cleaning of corpus data and preprocessing

6 Conclusion

On summary, this study has introduced an innovative methodology for identifying the Gulf Emirati dialect on online platforms by using a collection of conventional phrases and idiomatic expressions. The absence of standardized orthographies in Arabic dialects is a significant obstacle for academics, resulting in the arduous and time-consuming process of creating language databases. Our argument is that the dataset we have produced will make a valuable contribution to the advancement of research in the fields of dialectal Arabic Speech and NLP methods and applications, with a special focus on the Gulf Emirati dialect.

Furthermore, we have surveyed the previous Arabic corpora and presented the methodology used to process the Emirati Gulf dialect dataset, which has resulted in the extraction of 3000 Emirati dialect words from social media. It is anticipated that this research will serve as a catalyst for more exploration into the utilization of social media as a means of constructing language datasets, therefore making a valuable contribution to the advancement of more precise and efficient language resources specifically tailored for the Gulf Emirati dialect.