A Semi-supervised Approach for Sentiment Analysis of Arab(ic+izi) Messages: Application to the Algerian Dialect

In this paper, we propose a semi-supervised approach for sentiment analysis of Arabic and its dialects. This approach is based on a sentiment corpus, constructed automatically and reviewed manually by Algerian dialect native speakers. This approach consists of constructing and applying a set of deep learning algorithms to classify the sentiment of Arabic messages as positive or negative. It was applied on Facebook messages written in Modern Standard Arabic (MSA) as well as in Algerian dialect (DALG, which is a low resourced-dialect, spoken by more than 40 million people) with both scripts Arabic and Arabizi. To handle Arabizi, we consider both options: transliteration (largely used in the research literature for handling Arabizi) and translation (never used in the research literature for handling Arabizi). For highlighting the effectiveness of a semi-supervised approach, we carried out different experiments using both corpora for the training (i.e. the corpus constructed automatically and the one that was reviewed manually). The experiments were done on many test corpora dedicated to MSA/DALG, which were proposed and evaluated in the research literature. Both classifiers are used, shallow and deep learning classifiers such as Random Forest (RF), Logistic Regression(LR) Convolutional Neural Network (CNN) and Long short-term memory (LSTM). These classifiers are combined with word embedding models such as Word2vec and fastText that were used for sentiment classification. Experimental results (F1 score up to 95% for intrinsic experiments and up to 89% for extrinsic experiments) showed that the proposed system outperforms the existing state-of-the-art methodologies (the best improvement is up to 25%).


Introduction
Sentiment analysis (SA) helps to analyse people's opinions, sentiments, appraisals, attitudes, and emotions towards entities such as products, services, organisations, individuals, issues, events, topics [1]. Two main approaches are commonly used to determine the valence of documents (i.e. positive or negative): lexicon-based approach [2] and machine learning-based approach (ML) [3]. English has the greatest number of sentiment analysis studies, while research is more limited for other languages, including Arabic [4][5][6]. There are three different variants of Arabic: Classical Arabic (CA, for Quran), Modern Standard Arabic (MSA, used in formal exchange) and Arabic dialects (AD, used in informal exchange).
Moreover, Arabic can be written in both scripts, Arabic and Arabizi (corresponds to Arabic written with Latin letters, numerals and punctuation [7,8]). However, one of the main issues related to the treatment of Arabic and its dialects is the lack of resources. Also, other dominant problems include the non-standard romanisation (called Arabizi) that Arabic speakers often use in social media. Arabizi uses the Latin alphabet, numbers, punctuation for writing an Arabic word. For example, the word "mli7", combined with Latin letters and numbers, becomes the romanised form of the Arabic word meaning "good". Due to the challenging problems related to transliteration, most of the ongoing research works are focused on Arabic sentiment analysis written in Arabic script. To the best of our knowledge, only a few works have been presented in the literature on Arabizi sentiment analysis [9,10] or on Arabic and Arabizi sentiment analysis ( [11,12]). Furthermore, some dialects (such as Egyptian, Gulf or Iraqi (belonging to Mashreq dialects)) are more studied than others; indeed, a few works have been conducted on Maghrebi dialects, such as Morocco or Algerian dialects.
To address the challenges mentioned earlier, this paper proposes a semi-supervised sentiment analysis approach of Arabic messages extracted from social media (i.e. Facebook). The main idea behind this approach is to construct the sentiment corpus automatically and review it manually. The interesting aspect of this approach is that it considers both Arabic and Arabizi. For transforming Arabizi into Arabic, we consider both options, transliteration and translation. The impact of both techniques on sentiment analysis is shown to highlight the most suitable approach to adopt for Arabizi handling. The proposed approach consists of four main steps which are (1) Corpus extraction. (2) Arabizi transliteration.
(3) Arabizi translation. (4) Arabic sentiment analysis. At the end of this paper, we aim to answer a set of research questions where each answer opens the door to a research perspective. The research questions addressed in this paper are the following: 1. What is the best option for handling Arabizi (i.e. transliteration or translation)? 2. How could we improve the used transliteration approach? 3. How could we improve the proposed translation approach? 4. What is the best technique for Arabic sentiment analysis (i.e. supervised or semi-supervised approach)? 5. How could we improve the automatic annotation approach?
This paper is organised as follows. The next section presents the different challenges related to Arabic sentiment analysis followed by which the related work done on sentiment corpus construction and the new trends related to Arabic SA are presented. The subsequent section presents the methodology that we follow. Then the different experimentation that we carry out and the different results that we obtained by comparing our results to those obtained in the research literature are presented. Before the concluding section, a discussion and some errors analysis are presented. We conclude by presenting a synthesis and some opening for futures works.

Arab(ic+izi) Sentiment Analysis: Challenges
Most of the works on short text sentiment classification concentrate on Twitter [13][14][15][16][17]. Facebook has more than one billion clients. Facebook users spend approximately 120 minutes, consistently communicating with family and companions [18]. Although Facebook is the biggest social network, only a few approaches targeting Facebook posts and comments have been proposed. This is mainly due to the lack of labelled dataset for such a purpose. Facebook is also a popular social media platform in Arabic countries, where users typically write in Arabic and its dialects. Table 1  (4) Arabizi, messages 5 and 6. -Some messages are written using Arabic script (messages 1, 2, 3, and 4) and others using Arabizi (messages 5 and 6). -Inappropriate use of punctuation, space, exaggeration and links, as the text in social media, is recognised to be unstructured (messages 3, 4, 6, and 10). -Code-switching between languages. The combined use of Arabic and English can be seen in Mechrek countries such as Egypt and the Gulf. The combined use of Arabic and French can be seen in Maghreb countries such as Tunisia and Algeria (message 10). -Code-switching between scripts, where some messages are written using Arabizi and Arabic (message 9). -A massive use of emoticons (a convenient way to express opinions, sentiments and emotions (message 4)).
In the context of this paper, we are focusing on four of the presented challenges, which are Arabizi, code-switching between Arabic and Arabizi, inappropriate use of punctuation and the extensive use of emoticons.
As presented above, one of the most important challenges behind Arabic sentiment analysis is the use of Arabizi. The challenge behind Arabizi is the presence of many forms of the same word. For example, Cottrell et al. [19] argued that the word (meaning if God is willing) could be written in 69 different manners. Another challenge is related SN Computer Science to the annotation process. Almost all the works presented in the research literature rely on a manual annotation for the sentiment corpus (used in the training phase) [11,[20][21][22]. However, manual annotation is time and effort consuming. Some works dedicated to English and Dutch [15,23,24] present approaches using emoticons for automatically tag a large corpus. However, relying on emoticons only leads to many errors where some users express a contradictory sentiment between the text and the emotions that they used. More recently, Gamal et al. [25] presented a large sentiment corpus dedicated to MSA and Egyptian dialect. They also relied on a sentiment lexicon for the automatic annotation. However, they used only the sentiment score for the annotation. Also, they carry out only intrinsic experiments (i.e. the constructed corpus was split into train and test corpora). The challenge behind the automatic annotation is to propose an approach combining between the emoticons and text and other features for increasing the annotation precision. For validating the constructed corpus, it would be better to choose an external test corpus for showing the efficiency of the training corpus with real-world examples.
Algerian dialect (DALG) is a Maghrebi Dialect, primarily used in informal communication including social media [26,27]. DALG is not used in school education or within television news. It is used more in everyday life, music and series broadcast. This dialect is considered as a language of low variety, meaning that DALG is lowly standardised and normalised. DALG has been enriched with the influence of the language of countries colonising the Algerian population. Among these languages: Turc, Italian and more recently French. Hence, DALG resulted from different languages, including MSA (representing the major part of this dialect). The challenge behind DALG is the lack of works and resources. To the best of our knowledge, in addition to the corpora that we presented for DALG (and that we present in more details in the experimentation part), only three corpora are publicly available for DALG. The first one is Cottrel's corpus [19], which is an Algerian Arabizi corpus extracted from Facebook. The second one is PADIC corpus [28], which is a parallel corpus between MSA and many dialects, including DALG. The last one is SANA_Alg [29] that is a recent, an annotated sentiment corpus (which we are using for our experiments, to evaluate the proposed approach on a test corpus presented in the literature).

Arabic Sentiment Analysis
The classification of Arabic messages into two/three main classes (i.e. positive/negative or positive/negatives/neural) is done using two main approaches: lexicon-based approach and corpus-based approach. Both approaches require annotated data. The lexicon-based approach requires an annotated lexicon where each word is annotated as positive/negative/ neutral. Some lexicons also contain a sentiment orientation score (generally a number from 0 to 5) estimating the strength of the sentiment. Corpus-based approaches require an annotated corpus were each sentence contains a label (defining if it is positive/negative or neutral). For constructing both lexicons and corpora, three trends are emerging: (1) manual construction, (2) automatic construction and (3) semi-automatic construction.

Manual Construction
Only few lexicons are constructed manually [30,31]. The work in [30] described the process of creating SIFAAT, a manually created lexicon of 3325 Arabic adjectives labelled with one of the following tags, positive (Pos), negative (Neg), neutral (Obj). The adjectives in SIFAAT pertained to the newswire domain and were extracted from the first four parts of the Penn Arabic Treebank [32]. In [31], the authors focused on Algerian dialect by constructing three lexicons: (1) keyword lexicon; (2) negation word lexicon; and (3) intensification word lexicon. All these lexicons were constructed manually using existing MSA and Egyptian lexicons. The translation from MSA and Egyptian to Algerian dialect was done manually. The resulted lexicon contains 3093 words where 2380 are positives, and 713 words are negatives. However, Almost all the corpora were constructed manually [11, 20-22, 33, 34]. In the majority of cases, the annotation is done by natives annotators. In [20], the authors presented OCA, which contains 500 movie reviews collected from different Arabic web pages and blogs in Arabic (250 positive and 250 negatives). The reviews were also manually pre-processed, segmented, and roots were extracted. In [21] the authors presented AWATIF, a multi-genre corpus containing 10,723 Arabic sentences from three sources, namely the Penn Arabic Treebank (ATB) [32], Wikipedia talk pages, and web forums. The sentences are manually annotated as objective or subjective, and subjective sentences are annotated as positive or negative. Authors of [11] presented the TSAC (Tunisian Sentiment Analysis Corpus) corpus. It contains 17,060 Tunisian Facebook comments. These comments were manually annotated, and they include 8215 positive and 8845 negative statements. This corpus was collected from comments written on official pages of Tunisian radios and TV channels. In [22], the authors constructed ASTD, an Arabic Sentiment Tweets Dataset. This corpus contains 10,000 Arabic tweets that are annotated using Amazon Mechanical Turk as objective, subjective positive, subjective negative, or subjective mixed. The corpus presented in [33] is composed of DARDASHA (2798 chat messages from Maktoob 1 ), TAGREED (3,015 Arabic tweets), TAHRIR (3,008 sentences from Wikipedia Talk Page), and MON-TADA (3,097 Web forum sentences). Two natives speakers manually annotated these corpora. The corpus used [34] contains 2300 tweets that are manually annotated.

Automatic Construction
Almost all the lexicons presented in the literature were constructed automatically. To automatically construct an Arabic sentiment lexicon, three tendencies have emerged: (1) construction based on automatic translation [30,[35][36][37][38][39]. (2) Construction based on resources linking [40][41][42][43]. (3) Construction based on both translation and resources linking [44,45]. The main idea behind automatic translation construction is to start with an English sentiment lexicon (i.e. Bing Liu lexicon [46], SentiWordnet [47], SentiStrength [48], etc.) and translate them using Google translate. Some translations are done using an Arabic/English dictionary [45]. In resources linking different existing English/Arabic resources such as Sentiwordnet, Arabic WordNet [49], Arabic Morphological Analyzer [50, 51] are combined. The main idea behind the construction combining automatic translation and the resources linking is to use a reduce seed of English sentiment words, translate them to Arabic and expand them using Arabic Wordnet or Arabic synonyms dictionaries.
Only a few works have been conducted on automatic construction, and two techniques have been used: (1) using rating reviews [52,53] and (2) using sentiment lexicons [12]. In the context of using rating reviews, [52] presents LABR containing 63,257 book reviews, each rated on a scale from 1 to 5 stars. The authors considered reviews with 4 or 5 stars as positive, those with 1 or 2 stars as negative, and ones with 3 stars as neutral. In [53] the authors fellow the same annotation principle used in [52] for constructing 7 data sets (ATT, HTL, MOV, PROD, RES1, RES2, RES). ATT is a dataset of Attraction Reviews scrapped from TripAdvisor.com, containing 2154 reviews. HTL is a dataset of Hotel Reviews scrapped from TripAdvisor.com too and containing 15,572 reviews. MOV is a dataset of Movie Reviews scrapped from elcinema.com, containing 1524 reviews. PROD is a dataset of product reviews scrapped from souq.com, containing 4272 reviews. RES1 is a dataset of restaurant reviews scrapped from qaym.com containing 8364 reviews. RES2 is a dataset of restaurant reviews scrapped from tripadvisor.com containing 2642 reviews and RES is a combination between RES1 and RES2; hence it contains 10,970 reviews. In the context of using lexicon, the work in [12] create and use an Algerian sentiment lexicon for tagging a large set of MSA and Algerian messages. However, these authors concentrate on a reduced annotated corpus containing only 8000 messages (where 4000 are for Arabic and 4000 for Arabizi).

Semi-automatic Construction
Few works only have been done on semi-automatic construction for both resources (lexicon and corpora) [54][55][56]. [54] presents NileULex, an Arabic sentiment lexicon containing 45% of Egyptian (EGY) and 55% of MSA. This SN Computer Science lexicon contains 5953 unique terms. [55] presents SANA, a large-scale multi-genre, a multi-dialectal multi-lingual lexical resource for subjectivity and sentiment analysis of the Arabic and dialects. In addition to MSA, SANA also covers both EGY and LEV, along with providing English glosses. A significant portion of SANA entries is also augmented with POS, diacritics, gender and number. SANA is developed both manually and automatically, and it contains 224 564 entries. Finally, in [56] the authors present a Saudi corpus. This corpus contains 17,573 Saudi tweets that were manually reviewed into four classes: positive, negative, neutral and mixed. To construct this corpus, the authors target a set of sentiment words and use them to extract tweets containing these words. After the phase of cleaning and processing, they charge native speakers of Arabic/Saudi to review the constructed corpus manually.
After analysing the presented works using the constructed resources, we conclude that 1. The corpus-based approaches gives better results than the lexicon-based approaches. Also, almost all the recent works are relying on a corpus-based approach.

The resources constructed manually give the best results
for both lexicon-based and corpus-based approaches. However, the size of the resource is a crucial factor in the quality of the results. 3. The voluminous resources give the best results (mainly where the resources were constructed manually). However, manual construction represents time and effort consuming. 4. Semi-automatic construction seems to be the solution resolving both problems: precision and time/effort consuming. However, only a few approaches were proposed in this category. 5. Almost all the recent work in the research literature rely on word-embedding and deep learning approaches (detailed in the following parts).

Word Embedding and Deep Learning Approaches
In the supervised approach (corpus-based approach), the text is represented as a feature vector. A bag of words (BOW) representation is commonly used, mainly due to its simplicity as well as its efficiency [57]. Despite its popularity, this approach has two significant weaknesses: (1) loss of word order in the sentence, and (2) semantic ignorance of words [58]. Moreover, the application of this approach may require additional pretreatment of data and an appropriate word feature extraction technique [58,59]. More recently, word and document embedding have emerged as an alternative representation [58][59][60][61]. Among the most used word/document embedding methods, those presented in [58][59][60][61]. Al-Azani and El-Alfy [59] and Altowayan et al. [61] relied on large Arabic corpora to train word2vec models [62] to improve sentiment analysis. They generated features and used these features for training different classifiers. Barhoumi in [58] applied doc2vec model [63] for the sentiment classification of the corpus LABR [52]). El Mahdaouy et [60] affirm that using document embeddings improve text classification. All these works are based on Word2vec and Doc2vec. More recently, another algorithm is appearing, which is fastText [64]. As for Word2vec, fastText models are also based on either the skip-gram (SG) or the continuous bag-of-words (CBOW) architectures. fastText is often compared to Word-2vec for the classification task [65,66]. However, to the best of our knowledge, fastText has not been used for Arabic classification or sentiment analysis.
Recently deep learning algorithms such as convolutional neural network (CNN), long short-term memory (LSTM), bidirectional LSTM (Bi-LSTM), etc. take an essential place for classifying sentiments. In this context, [67] presents a scheme of Arabic sentiment classification, which evaluates and detects the sentiment polarity from Arabic reviews. The authors used Word2vec for features extraction (with Both CBOW and Skip-gram architecture). A convolutional neural network (CNN) was trained on top of pre-trained Arabic word embeddings for sentiment classification. For CNN, the authors used the same architecture defined in [68] relying on one channel that allows the adaptation of pre-trained vectors for each task. They apply their approach to different corpus presented in the literature such as LABR, ASTD, ATT, HTL, and MOV. More recently, [69] present a model (language-independent) for multi-class sentiment analysis using a simple neural network architecture of different layers. The advantage of the proposed model is that it does not rely on language-specific features such as ontologies, dictionaries, morphological or syntactic pre-processing. The authors applied their model for three languages which are: English, German and Arabic. For Arabic, they relied on ASTD corpus constructed in [22].

Arabizi Sentiment Analysis
Limited works have been conducted on Arabizi sentiment analysis [9,10,12]. In [9], the authors present a transliteration step before proceeding to the sentiment classification. However, their approach presents two majors drawbacks: (1) they relied on a fundamental table for the passage from Arabizi to Arabic, which cannot handle Arabizi ambiguities. (2) They constructed a small annotated corpus manually (containing 3026 messages). This corpus contains Arabizi messages which therefore transliterated into Arabic. In [12], the authors automatically construct an annotated sentiment Arabizi corpus and directly applied sentiment classification without calling the transliteration/translation process. However, the authors confronted several ambiguity problems which resulted in low F1 score of 66%. The same test corpus used in [12] was also used in [10], where the authors improved the results by calling a transliteration step. The authors used a large sentiment corpus constructed automatically by relying on a sentiment lexicon (also constructed automatically [39]). The results were up to 76% for automatic transliteration and up to 78% for manual transliteration.
Hence, it can be seen that for handling Arabizi, two trends are emerging: 1) considering the Arabizi as a proper language and rely on an annotated Arabizi corpus. 2) Transliterating Arabizi into Arabic and rely on the transliterated annotated corpus. Many works have been proposed to transliterate Arabizi to Arabic. Some of them consider a set of rules [10, 70,71]. Others rely on a parallel corpus (Arabizi/ Arabic) and consider the transliteration task as a translation task at a character level [7,72,73]. The usefulness of transliteration was shown and illustrated in different researches. Almost all the annotate sentiment corpora are in Arabic (not in Arabizi). Then, behind transliteration, we are aiming to transform Arabizi into Arabic. However, another way could lead to Arabic, the translation. Although the translation allows us to transform Arabizi into Arabic, no research work considers this way. In this paper, we consider this new perspective for handling Arabizi, which involves machine translation. Although no work was proposed for Arabizi sentiment analysis after translation, many works were proposed for Arabic machine translation. Some works also consider the effect of translation of the sentiment analysis results. The following part briefly describes some of these works.

Arabic Translation and Sentiment Analysis
During the last decades, several approaches have been proposed for translating Arabic to and from other spoken languages [74][75][76]. Arabic is also considered as a pivot for many works concentrating on Dialectal Arabic [77]. The proximity of dialectal Arabic to MSA makes the mapping easier than direct MT, and several researchers have explored this direction [77,78]. The main challenge in developing any MT system is the lack of data. This challenge is accentuated in the case of Arabic and its dialects where parallel corpora are rarely publicly available. Some dialects are more suffering from this lack than others. For example, for Algerian dialect, only one parallel corpus is publicly available (PADIC) [28] which contains 6,412 sentences translated from Algerian Dialect to MSA. Some work have been done on Arabizi translation [70,73,79,80]. However, these work consider transliteration before the translation step.
The idea of analysing sentiments after the automatic translation of messages was explored in many works [81,82]. However, to the best of our knowledge, two works only have been done on Arabic [36,83]. Rafaee et al. [83] presented a sentiment analysis approach using freely available MT systems to translate Arabic tweets to English, which the authors then label for sentiment using a state-of-the-art English SA system. The authors of the cited work affirm that MT-based SA is a cheap and effective alternative to building a complete SA system when dealing with under-resourced languages. Salameh et al. [36] achieved competitive results even with automatic translation. Both papers present the same idea: The translation of Arabic messages into English and then use the English resources for determining the sentiment.
However, both papers concentrate on Arabic only (omitting its dialects and specially Arabizi). Table 2 summarises and classifies the main works and resources presented in this section. Figure 1 summarizes the main steps of the proposed approach, including -Corpus extraction -Arabizi transliteration -Arabizi translation -Arabic sentiment analysis

Corpus Extraction
Text messages written in MSA/DALG from Facebook are extracted using two methods. In the first method, the comments from 226 popular Algerian pages such as Ooredoo 2 , HamoudBoualem 3 , and Ruiba 4 (which belongs to commercial companies, press, and public personalities) are extracted. The most popular Facebook pages using the statistics offered by the SocialBakers website 5 are identified. For the second method, Facebook content is searched using Facebook Rest API 6 with MSA/DALG words. The DALG terms are obtained using two sources. The first source is PADIC, which is a parallel multi-dialectal corpus containing parallel DALG-MSA pairs [28]. The second one is our translated lexicon that is described above. Using both methods, a corpus containing 15,407,910 messages is collected. After filtering out non-Arabic messages, 7,926,504 messages are retained. To extract the Arabic message, the SN Computer Science

Arabizi Transliteration
For Arabizi transliteration, we rely on the approach proposed by Guellil et al. in [71] and used for sentiment analysis purpose by Guellil  All these messages are pretreated. Afterwards, a set of passages rules are proposed (i.e. the letter 'a' could be replaced by , etc. It could also be replaced by ", none letters when it represents a diacritic). By applying different replacements, as well as different rules developed, each Arabizi word is corresponding to several words in Arabic. For example the word "kraht" (meaning, I hate) generates 32 possible candidates, such as . The correctly transliterated word is . To extract the best candidate for the transliteration of a given Arabizi word into Arabic, a language model is constructed and applied.

Arabizi Translation
From the corpus automatically extracted from Facebook, 2,924 were randomly selected. These comments were manually translated into Arabic (MSA). Table 3 presents the set of samples included in our parallel corpora. Our parallel corpus in between the pair Arabizi/MSA. The English translation is only added on the table for clarity. Inspired by the work presented in [28,77,78] on statistical machine translation of Arabic and its dialect, we propose three main steps: (1) language model training, (2) alignment, and (3) tuning. For training the language model, the large Arabic corpus in Arabic from Facebook is used. The parallel corpus was divided into two parts. The first one contains 90% of the whole corpus (representing 2,632 comments) is used for the training. The second one, containing 10% of the corpus (representing 292 comments, is used for the validation). Subsequently, alignment model and tuning methods are used to select the best translation. Inspired by [26], we used the open-source Moses toolkit [84] to build a phrase-based MT system with default settings: bidirectional phrase and lexical translation probabilities, distortion model, a word and a phrase penalty and a trigram language model. We used GIZA++ [85] for alignment and KenLM [86] to compute trigram language models.

Lexicon Construction and Review
For lexicon construction, we rely on the same approach proposed by Guellil et al. [39]. The main idea behind this construction is to automatically translate an existing English lexicon to DALG and MSA using Glosbe API 7 . In this work, we automatically translate SOCAL lexicon (containing 6769 terms among the Adjectives, verbs, nouns, and adverbs) [2]. The same score is assigned to all the translated words. This score corresponds to the score of the English word from which they are translated. For example, all the translations of the English word 'excellent' with a score of +5, such as (bAhy) meaning brilliant, (lTyf) meaning nice, and (mlyH), meaning good, are assigned a score of +5. Since some Arabic words result from different English words having different sentiment scores, an average score is assigned to such Arabic words. For example, the word (mlyH), meaning good can be the translation of the English term 'excellent' (with an associated score of +5); however, it can also be translated from the English term "good" (with an associated score of +3). Hence, the Arabic term is associated with the average of all sentiment scores of the English terms it is translated from.
The resulted lexicon after applying this approach contains 2,384 entries. Afterwards, we manually review this lexicon, to delete ambiguous words, to increase the annotation precision. Finally, we obtain a sentiment lexicon containing 1745 terms, of which 968 are negative, 771 are positive, and 6 are neutral, in both MSA and DALG.

Corpus Construction and Review
The constructed lexicon is used to provide a sentiment score for DALG utterances automatically. This process provides a baseline for different experiments. The lexicon is then used to build a large sentiment corpus. To calculate the score, we considered (1) opposition which is generally expressed in DALG with the keyword ''<b.s.h>' (bSH -but); (2) multi-word expressions because the constructed lexicon contains multi-word entries; (3) handling DALG morphology by employing a simple rule-based light stemmer that handles DALG prefixes and suffixes; (4) negation which can reverse polarity. Negation in DALG is usually expressed as an attached prefix, suffix, or a combination of both. To score a message, the sentiment scores of all the words in the message are averaged. Finally, balanced dataset (by keeping the same number of messages in a positive and negative dataset) is constructed. The resulted corpus contains 255,008 messages (where both positive and negative corpus contains 127, 504 messages).
By analysing the corpus annotated automatically, we observe that some messages were wrongly annotated. For example, the message: meaning Djabou the excellency of the name is sufficient was annotated negative (where it is positive). Another example with the message: meaning guide the play, we hope God brings the good things (we hope God bring the good things is an expression used to speak about bad things). This message is wrongly annotated as positive. To construct the corpus, the messages that are correctly annotated were kept, and those that are wrongly annotated were corrected. Also, some objectives messages (not holding a sentiment) were deleted. The resulted corpus contains 3048 messages (where 1488 are positives, and 1560 are negatives). This corpus is considered, to the best of our knowledge, as the first annotated sentiment corpus (manually checked) which handles DALG as well as MSA. We also use it for evaluating our automatic annotation. Among the 3048 messages that are manually reviewed, 2596 messages representing 85.17% were correctly annotated.

Sentiment Classification
For classification, we use two kinds of Algorithms, shallow and deep. For both classifications, we extract features with word embedding techniques. With shallow classification, we use Word2vec algorithm, while we use both word2Vec and fastText for deep classification.

Word2Vec + Classical Machine Learning Algorithms
For Word2vec, we used a context of 10 words to produce representations for both CBOW and SG of length 300. We trained the Word2vec models on the messages that appear in the training sets. In this work, we used the model presented by Altowayan et al. [61]. However, this work relies only on CBOW representation, where we rely on both representations CBOW and SG. For classification, we use five Algorithms such as GaussianNB (GNB), LogisticRegression (LR), RandomForset (RF), SGDClassifier (SGD, with loss='log' and penalty='l1') and LinearSVC (LSVC with C='1e1').

Word2/fastText + Deep Learning Algorithms
Three deep learning classifiers were used: CNN, LSTM and Bi-LSTM. For each model, six layers were used. The first layer is a randomly-initialised word embedding layer that turns words in sentences into a feature map. The weights of embedding_matrix are calculated using word2vec and fastText (with both SG and CBOW implementation). This layer is followed by a CNN/ LSTM/BiLSTM layer that scans the feature map (depending on the model that we defined). These layers are used with 300 filters and a width of 7, which means that each filter is trained to detect a particular pattern in a 7-gram window of words. Global max-pooling is applied to the output generated by CNN/LSTM/BiLSTM layer to take the maximum score of each pattern. The main function of the pooling layer is to reduce the dimensionality of the CNN/LSTM/BiLSTM representations by downsampling the output and to keep the maximum value. For reducing over-fitting by preventing complex co-adaptations on training data, a Dropout layer with a probability equal to 0.5 is added. The obtained scores are then fed to a single feed-forward (fully connected) layer with Relu activation. Finally, the output of that layer goes through a sigmoid layer that predicts the output classes. For all the models, we used Adam optimisers with epoch 100 and an early_stopping parameter for stopping the iteration in the absence of improvements.

Dataset
For evaluating the proposed approach, different corpora were constructed and used: -  [12] and transliterated automatically in [10] with an accuracy of 72.05% and containing 500 Facebook comments (250 are positives and 250 are negatives). -Test_Ar_Tr_manu, which is the same Arabizi sentiment corpus [12] and transliterated manually in [10], -Test_Ar_Translation_auto, which is an Arabizi sentiment corpus, firstly used in [12] and translated automatically (the BLEU score of the automatic translation is up to 8.13). -Test_Ar_Translation_manu which is the same Arabizi sentiment corpus [12] and translated manually. -SANA_Alg 9 , an Algerian sentiment corpus containing 513 messages (236 positives; 194 negatives; 83 neutral) extracted from news, political, religion, sports, and society articles selected at the following Algerian Arabic newspaper web sites.

Metrics
In total, five metrics are used for evaluating the proposed system. To evaluate the transliteration module, the Accuracy (A) is used. Accuracy, as shown in Eq. 1, represents the number of words correctly transliterated divided by the total number of words. In order to evaluate the translation module, the BLEU score is used [87]. BLEU score, as shown in Eq. 2 represents the geometric mean of the test corpus using modified precision scores and multiplied by an exponential brevity penalty factor. In order to evaluate the sentiment analysis module, three metrics are used (Precision(P), Recall(R) and F1 score (F1)). Precision, as shown in Eq. 3, represents the number of sentiments correctly labelled as belonging to the positive class divided by the total number of sentiments labelled as belonging to the positive class. Recall, as shown in Eq. 4, represents the number of true positives divided by the total number of opinions that belongs to the positive class. Finally, F score, as shown in Eq. 5, represents the harmonic mean of precision and recall [88].
where NB_Correct represents the number of words correctly transliterated. NB_Total represents the total number of words. BP, as shown in Eq. 6, represents the brevity penalty comparing the length of the candidate translation c and the effective reference corpus length r. TP represents true positive (i.e. manually annotated as positive and predicted by the model as positive). TN represents true negative (i.e. manually annotated as negative and predicted by the model as negative). FP represents false positive (i.e. manually annotated as negative and predicted by the model as positive). (1) And FN represents false negative (i.e. manually annotated as positive and predicted by the model by negative.

Experimental Results
Our aim behind this experiments is to first synthesise and compare the results obtained using both training corpora (i.e. constructed automatically and that reviewed manually). Second, the sentiments analysis results using both techniques (i.e. transliteration and translation) also need to be compared. Third, the best model for extracting features (i.e. Word2vec and fastText) needs to be extracted. Fourth, the most suitable classification algorithms (classical ones and deep learning ones). Finally, the most suitable deep learning algorithm for Arabic sentiment analysis (i.e. CNN, LSTM, Bi-LSTM) needs to be highlighted.

Results Using Word2Vec + Classical Machine Learning Algorithms
Both SG and CBOW models where used. However, the CBOW model gives the best results where it is associated to the classical algorithms. Then,

Results Using Word2Vec/fastText + Deep Learning Algorithms
Same to the previous experiments, both models CBOW and SG were used. However, now we present the results obtained using the SG model because this model outperforms the CBOW model. It can be seen from Table 5 that manual reviewing on the automatic annotation improves the performances. The obtained results with Test_SentiAlg are up to 0.82 where they are up to 0.89 on Corpus_manu.
Concerning the models extracting features such as Word-2vec and FastText, it can be seen that the best results were obtained using both models with Corpus_auto. However, fastText literally outperforms word2vec with Corpus_manu. It can also be seen from Table 5 that CNN outperforms all the others classifiers with Corpus_manu. However, both CNN and Bi-LSTM give remarkable results on Corpus_auto. Finally, from Table 5, it can be seen that the results of Arabizi after transliteration process (up to 0.71/0.80 for the automatic transliteration and up to 0.74/0.80 for the manual transliteration). These results are more promising than those obtained after the translation (up to 0.61/0.69 for the automatic translation and up to 0.66/0.79 for the manual translation). However, it can also be seen that the difference of results between automatic/manual transliteration is less significant than the difference of the results between automatic/manual translation. The results related to manual transliteration/translation are almost the same for corpus_manu, the corpus constructed in a semi-supervised way. This highlights the effectiveness of both techniques for handling Arabizi. However, the translation approaches require many improvements, starting by enriching the parallel corpus.

Discussion
For showing the efficiency of our approach and corpus, we carried out many experiments on several test corpora (previously used in the research literature). The corpora Senti_Alg(i.e. Senti_Alg_test_Arabic, Senti_Alg_test_trauto and Senti_Alg_test_trmanu were presented and used in many research papers [10, 12,39]. The results related to Test_Sen-tiAlg_Arabic (up to 87.77%) are very encouraging. These results were obtained using the CBOW model associated  [29] was up to 75%. Hence, our approach and corpus lead to an improvement of 6% on this corpus.
Finally, our corpus and approach were also evaluated on MSA and another dialect (Egyptian dialect) using corpus ASTD/QCRI/ArTwitter, which was used by Altowayen et al. [61]. In [61], the corpus was classified in two classes only (i.e. positive and negative classes). As we also focus on binary classification, it was more practical to compare our results to the results obtained by these authors ( [61]) rather than comparing them to the results obtained for each corpus separately. The best results obtained by Altowayen et al. [61] are up to 79.62% (for F1 score). The best results that we obtained are up to 80.58% (for F1 score). Moreover, This corpus is dedicated to MSA with a focus on Egyptian dialect (for ASTD). Hence, our approach and corpus which are dedicated to Algerian dialect outperform the results presented for corpora dedicated to MSA and Egyptian dialect.

Analysis
After presenting and comparing all the results related to the presented approach, we can answer different research questions presented in the Introduction part. We present the different answers in the following part.
1. What is the best option for handling Arabizi (i.e. transliteration or translation)? From the presented results, it can be seen that the transliteration is more suitable for Arabizi sentiment analysis. However, it was also highlighted that bad results associated with the translation are not related to the technique itself but the proposed approach. An approach is principally relying on a small parallel corpus including only 2924 parallel sentences.

How could we improve the used transliteration approach?
The principal error appears in transliteration process is related to the technique of choosing the best candidate. The idea of a language model is to extract the candidate having the most significant number of occurrence. However, in some cases, these techniques return an incorrect candidate. For example, the word "rakom" meaning "you are" is transliterated as meaning "a number" rather than (which is the correct transliteration). The solution to this problem is to integrate other parameters for determining the best candidate, such as distance. 3. How could we improve the proposed translation approach? To improve the translation results, particular attention should be first given to the parallel corpus construction and enrichment. 2,924 parallel sentences are not enough for training a statistical machine translation system. Relying on neural machine translation will also certainly improve the results. However, neural networks models require large corpora for the training phase. 4. What is the best technique for Arabic sentiment analysis (i.e. supervised or semi-supervised approach)?
The presented results highlight the fact that the corpus constructed semi-automatically outperform the corpus constructed purely automatically. Hence, a semi-supervised approach is less effort and time consuming than a manual one, and its results are better than an automatic one. 5. How could we improve the automatic annotation approach? Some sentiment classification errors are due to transliteration errors for Arabizi. For example, "khlwiya" meaning excellent and quiet is wrongly transliterated to (meaning empty) rather than . Improving transliteration will improve sentiment classification. Also, the automatic annotation is based on a reduced lexicon counting only 1,745 terms. Then, the vocabulary on which we based our annotation is relatively small. The manually reviewed corpus also has a reduced size (only 3,048 messages) where most corpora on the literature contain more than 10,000 messages. Hence, enriching both the sentiment lexicon and the sentiment annotated corpus will undoubtedly improve the results.

Open Issues
From this study and results, we have identified several research directions that deserve a more in-depth study.

Proposing a Statistical Machine Transliteration System
To improve our transliteration approach, we plan to use the presented system for automatically transliterate an Arabizi corpus. Afterwards, we manually review the transliteration pairs. Our systems gave us a precision more than (70%). Hence correcting (30%) of wrongly transliterated messages is better than constructing (100%) (from constructing a parallel corpus from scratch). Then, we could consider the transliteration as a translation task.

Proposing an Arabizi Identification Module
Among the issues related to Arabizi treatment, the confusion among Arabizi, English and French. In the context of this paper, we assume that our input is messages written in Arabic, Arabizi or both scripts. However, in real life, it is not the unique case. The problem that we could face with this system is to transliterate a message written in French or English. To resolve this problem, we plan to work on an identification system. We previously proposed a bilingual lexicon that we constructed [5]. Using this lexicon, we proposed a rules-based identification system [89]. However, we plan to improve our identification system by considering the identification task as any classification problem containing three classes: (1) messages which are written in Arabizi. (2) messages which are written in French. (3) messages which are written in English. Hence, we could use machine learning algorithms to detect Arabizi messages.

Enriching the Proposed Lexicon Automatically
Our proposed lexicon was constructed automatically by translating an existing English lexicon. The constructed lexicon was then manually reviewed. The resulted lexicon contains only 1745 entries. The problem with a reduced lexicon is that it is not covering all the vocabularies, and then it could not analyse the sentiment of all messages. The proposed lexicon could be enriched using Word2vec. The idea of Word2vec is to return the most semantically close words to a given word (i.e. the words which have similar vectors). However, the problem with this technique is that the two words "good" and "bad" are returned simultaneously as they are very close. It is perfectly understandable, where these two words frequently appear in the same context. Hence, our major problematic by handling these issues is to resolve the "good/bad" situation.

The Application of This Approach to More Dialects and Languages
It can be seen from the obtained results that our approach outperforms the results presented in the research literature, even with an Algerian corpus. To have a multi-dialect sentiment analysis, we need a training corpus for each dialect. For obtaining these training corpora, we propose to extend this approach to other dialects. This approach could also be applied to other languages. Moreover, it could be employed with other NLP problems requiring training corpus (especially in the training corpus is used in the context of classification).

Using This Approach in a Real-Life Application Case
Lots of recent application and problematic need sentiment analysis-for example, Hate-speech detection. According to Nockleby, Hate speech is commonly defined as any communication that disparages or defames a person or a group based on some characteristic such as race, colour, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics. Many approaches are proposed for hatespeech detection. Some researched consider hate-speech as a strong negative sentiment. Hence, We could use a part of our corpus in the context of Hate-speech detection. Also, we could propose the same approach to construct a corpus dedicated to hate-speech detection automatically.
To sum up, this paper handles sentiment analysis of Arabic and its dialects by focusing on both scripts: Arabic and Arabizi. It proposes new techniques and approaches for handling Arabizi and for constructing resources with a minimum of efforts. It also showed that reviewing a resource constructed automatically is better than constructing it from scratch in term of effort, time and results. However, this approach, as all the approaches presented in the research literature is not perfect. Some issues were observed. To handle these issues, we need to develop other approaches related to other NLP fields. Then, we join Erik Cambria qualifying sentiment analysis as a big suitcase of natural language processing (NLP) problems. Sentiment analysis has long been mistaken for the task of polarity detection. However, it is just one of the many NLP problems that need to be solved to achieve human-like performance in sentiment analysis [90].

Conclusion and Perspectives
In this paper, we proposed a sentiment analysis approach dedicated to Arabic and its dialects, and we applied it on DALG/MSA Facebook messages. The principal strengths of this approach are that we automatically constructed a sentiment corpus that we reviewed manually (for increasing the classification precision) and we handle both scripts Arabic and Arabizi. Another important aspect is that we relied on different word embedding models and different deep learning classifiers (for comparing the results). The obtained results are very encouraging (F1 up to 89% for extrinsic experiments using CNN), and they outperform the results obtained in the research literature (with a difference up to 25%). Also, for handling Arabizi, both techniques, transliteration and translation were used.
After analysing the different classifications errors, we highlighted different issues that we plan to address in our future works by integrating the following points: -Proposing a transliteration system based on a corpusbased approach. -Enriching the parallel corpora and proposed a neural machine translation system. -Extending the constructed lexicon using Word2vec.
-Extending the constructed annotated corpus.
-Proposing classifiers which combine between different models.
-Extending the approach to other dialects by starting with Maghrebi dialect which shares many characteristics with DALG.

Compliance with Ethical Standards
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.