Comparative Performance of Ensemble Machine Learning for Arabic Cyberbullying and Offensive Language Detection

In recent years, research on abusive language and cyberbullying detection have gained a great deal of interest since it affects both individual victims and societies. Hateful communications, bullying, sexism, racism, aggressive content, harassment, toxic comments, and other forms of abuse have all increased dramatically due to the ease of access to social media platforms such as Facebook, Instagram, Twitter, and others. As a result, there is a signi�cant need to identify, manage, and restrict the spread of offensive content on social networking sites, prompting us to perform this study to automate the detection of offensive language or cyberbullying. Having a balanced data set for a model would generate higher accuracy models, thus we build a new Arabic balanced data set to be used in the process of offensive detection. Lately, Ensemble Machine Learning has been used to enhance the performance of single classi�ers. The aim of this study is to compare the performance of different single and ensemble machine learning algorithms in detecting Arabic text containing cyberbullying and offensive language. For this purpose, we have chosen three machine learning classi�ers and three ensemble models and apply them to three Arabic datasets two of them are offensive datasets that are publicly available, and the third one which we constructed. The results showed that the ensemble machine learning methodology outperforms the single learner machine learning approach. Voting performs is the best among the trained ensemble machine learning classi�ers, with accuracy scores of (71.1%, 76.7%, and 98.5%) for the three used datasets respectively, exceeding the score obtained by the best single learner classi�er (65.1%, 76.2%, and 98%) for the same datasets. Finally we use hyperparameters tunning on the Arabic cyberbylluing data set to optize the performance of the voting technique.


Introduction
For many users, online social networks (OSNs) are becoming the most prevalent and interactive media.
The majority of individuals use social media without considering the impact these networks have on our lives, whether bene cial or harmful. However, along with valuable and interesting content, these networks can also broadcast inappropriate or harmful content, such as cyberbullying, hate speech, and insults [22] [23]. The detection of such language is essential because it may cause emotional distress and affect the mental health of social media users [1].
The Arabic language is the fth most spoken in the world, with more than 420 million speakers [2]. The use of the Arabic language in social media is widespread and continually increasing. As of 2017, the Arabic Social Media Report estimates that Facebook users from the Arab region constitute 8.4% of all Facebook users which is more than 150 million Arab users [3]. Because the linguistic format might be sophisticated or slang, classifying Arabic social media texts is a di cult task. The Arabic language has multiple dialects with different lexicons and structures, making high-performance classi cation di cult.
Ensemble Machine Learning is a machine learning methodology that integrates multiple distinct prediction models into a single model to improve performance. It has to be considered whenever good predictive accuracy is demanded [4]. In addition to Ensemble classi ers have been shown to be more effective than data resampling techniques to enhance the classi cation performance of imbalanced data [25]. Having a balanced data set for a model would generate higher accuracy models, higher balanced accuracy, and a balanced detection rate. The results obtained by Qiong W. et. al [26] demonstrates that using balanced training data (50% neutral and 50% deleterious) results in the highest balanced accuracy (the average of True Positive Rate and True Negative Rate). Hence, it is important to have a balanced data set for a classi cation model. Thus, we build a new Arabic balanced data set to be used in the process of offensive detection. Also, we will apply Ensemble Machine Learning models on detecting Arabic offensive posts for both balanced and imbalanced datasets. We have used several Machine Learning Models and Feature Extraction techniques to compare the performance of using a Single ML Classi er and ensemble ML in Arabic abusive language and cyberbullying detection.
The remainder of the paper is structured as follows: We discuss related work in section 2. After that a background covering the Ensemble Machine Learning models in section 3. Section 4 presents the proposed method to Arabic Abusive language and cyberbullying Detection. Finally, results and discussion will be presented in section 5.

Related Work
Research on Arabic abusive language detection has recently drawn much attention; Hamdy et al. have many efforts in the Arabic language eld, especially in Offensive Language detection. In [6], they show how to use popular trends in offensive and rude communications to build and extend a list of offensive words and hashtags using an automated tool. Also, Twitter users were ranked based on whether or not they use any of these offensive terms in their tweets. Using this classi cation, they expand the list of bad words and present the results on a newly created dataset of labeled Arabic tweets (obscene, offensive, and clean). Also, they publish a large corpus of classi ed user comments publicly, which were removed from a famous Arabic news site due to rule violations and guidelines of this site. In [14], they present a rapidly creating training dataset for identifying offensive tweets using a seed list of offensive words.
They trained a deep learning classi er based on charterer n-grams that can e ciently classify tweets with a 90% F1 score. They recently make a new dialectal Arabic news comment dataset public, which was collected from a variety of social media platforms, including Twitter, Facebook, and YouTube. In abusive comments, they investigate the unique lexical content in connection with the use of Emojis. The results show that the data set model of multi-platform news commentary can capture diversity in various dialects and domains. In addition to evaluating the models' generalization power, they also presented an in-depth analysis of emojis usage in offensive comments. Findings suggest emojis in the animal category are exploited in offensive comments, like lexical observation [15].
Abozinadah E. et al. [7] evaluates various machine learning algorithms for detecting abusive accounts with Arabic tweets, the data set for this analysis was collected based on the top ve Arabic swearing words, from the total result we ended up having 255 unique users. Naïve Bayes (N.B.) classi er with 10 tweets and 100 features has the best performance with a 90% accuracy rate. An Arabic word correction method was also suggested in [8] to tackle internet censorship systems and content-ltering vulnerabilities. This method achieved an accuracy of 96.5%. A statistical learning approach was used in [9] to detect adult accounts with the Arabic language in social media. The uses of obscenity, vulgarity, slang, and swear words in Arabic content on Twitter were examined in order to identify abusive accounts.
With this statistical method, a predictive precision of 96% was achieved, and the imitations of the bag-ofword (BOW) approach were overcome.
Alakrot et al. [10] build an Arabic dataset of YouTube comments to detect offensive language in a machine learning context. The data were collected by the principles of availability, variety, representativeness, and balance, thus ensuring that predictive analytical models for identifying the abusive language in online communication in Arabic can be implemented for training. Lately, Haddad B. et al. [11] address the issue of abusive language and hate speech detection. They suggest a method for data pre-processing and rebalancing, and then they used the bidirectional Gated Recurrent Unit (GRU) and Convolutional Neural Network (CNN) models. The bidirectional GRU model augmented with attention layer generated the best results among proposed models on a labeled dataset of Arabic tweets were achieved 85.9%F1 scores for offensive language detection and 75% F1 scores for the purpose of detecting hate speech.
Ensemble machine learning is a powerful machine learning algorithm, the result obtained from an ensemble, a combination of machine learning models can be more accurate than any single member of the group [5].
Regarding using Ensemble machine learning methods in Arabic offensive language and cyberbullying detection; Haidar et al. [12] present Ensemble-Machine-Learning as another solution for Arabic cyberbullying detection to enhance their previous work. They accomplish an improvement on Precision, Recall, and F1-Score. Also, Fatemah, H. [13] examined the impact of applying a single learner machine learning approach (SVM, logistic regression, and decision tree) and ensemble machine learning approach (Bagging, AdaBoost, and random forest) on Arabic offensive language detection. The study shows that applying ensemble machine learning techniques rather than single learner machine learning approaches has a signi cant effect. With an F1 of 88%, which exceeds the best single learner classi er score by 6 percent, Bagging performs the best among the quali ed ensemble machine learning classi ers in offensive language detection.

Ensemble Machine Learning
Ensemble Machine Learning, a strong machine learning technology that is used by data science experts in industries as it is considered the state-of-the-art solution for many machine learning problems [16]. The result obtained from an ensemble, a combination of machine learning models, can be more accurate than any single member of the group [4] [5]. See Fig. 1

Proposed Method
In this section, the proposed method is discussed in detail with a diagram. The datasets, the preprocessing, the classi cation methods used, and the performance measures are described below. Figure 2 illustrates the method that we follow in this study.

Dataset Description
In this paper three Arabic datasets are used, two are offensive datasets that are publicly available at [18] and the third one is a balanced dataset we decide to collect. We used a set of offensive keywords that represent different types of offensive and cyberbullying meaning, we used these keywords to search for tweets on Twitter and posts on Facebook, then we developed a web crawler to collect the Facebook search results automatically. In Twitter, we use Twitter API to collect the tweets generated by the search. Final all the collected posts and tweets are stored in a text le without repetition.
Non-Arabic letters, URLs, and emoji are examples of non-useful text that have a negative impact on categorization performance. As a result, the gathered data must be cleaned and ltered. The ltering phase removes all non-useful text. For the newly collected dataset, we manually labeled the ltered posts to 1 for cyberbullying and 0 for non-cyberbullying posts. At the end of the preprocessing, the nal dataset contained 6,000 non-cyberbullying instances and 6,000 cyberbullying ones. See Fig. 3. After nishing the annotation process the data had to be tokenized and stemmed. Table 1 gives the original distribution of the used datasets in terms of the source, size of the dataset, number of their majority and minority instances, and their imbalance ratio (IR).

Ensemble machine learning approach (EML)
For this approach, we select the three different ensembles machine learning models (bagging, voting, and boosting). For the bagging we select the random forest model, for the boosting, we select the AdaBoost method. For the voting, we used the three single learners used in the rst approach. We select three different models each use a different ensemble machine learning method. The random forest was trained using 100 maximum numbers of trees. Hard voting was used to create the ensemble because it can validate the result with more con dence than individually applied algorithms [19].

Performance Measures
The metrics measured used to analyze the performance of each classi er are Accuracy, Precision, Recall, and F1_score. The de nitions of those metrics are as follows: A. Accuracy: The accuracy is the percentage of instances that were correctly classi ed into their respective classes. It is also called sample accuracy.

Recall = TP (TP + FN)
3 D. The F-measure (or F-score): is used to measure the accuracy of the test by considering both precision and recall in computing the score. It conveys a balance between precision and recall wherein it reaches its best value at 1 and its worst value at 0.

Experimental Results And Discussion
This section presents the results of the experiments, which were carried out to compare the performance of different single and ensemble machine learning algorithms in detecting Arabic messages containing cyberbullying and abusive language.  Table 2 we can notice that Voting outperforms all used single and ensemble classi ers in all performance metrics. Logistic Regression (LR) outperforms the other single classi ers with an Accuracy of 65.1%. After that Linear SVC and KNeighbors achieved an Accuracy of 64.2%.
The Voting EML classi er outperforms the other EML classi ers with an Accuracy of 71.1%. After that, the Bagging EML achieved an accuracy of 60.5% and AdaBoost EML of 56.4%. These ndings prove the e ciency of using ensemble machine learning methods; Accuracy rises from 65.1% using the best SML classi er to 71.1% using the best EML model. Our results are inconsistent with [13].  Table 3 have shown the performance metrics for single and ensemble ML models for the second Arabic dataset. From Table 3 we can notice that Voting also outperforms all used single and ensemble classi ers for all performance measures. From Table 1 we can notice that this dataset is partially imbalanced, so we will use the F1 _score to evaluate the performance instead of using accuracy as it can be misleading.
The SML classi er (Linear SVC) outperforms the other SML classi ers with F1_score of 75.2% then the KNeighbors (KNN) achieved F1_score of 74.5% and the Logistic Regression (LR) with F1_score of 74.4%.
The voting classi er outperforms the other EML with F1_score of 75.8%. After that the Bagging EML achieved F1_score of 72.8% then the Boosting EML achieved F1_score of 70.2%. These ndings prove the e ciency of using ensemble machine learning methods; F1_score rises from 75.2% using the best SML model to 75.8% using the best EML model. Our results are inconsistent with [13].  Table 4 shows the performance metrics for single and ensemble ML models for the new Arabic dataset.
Our new dataset was balanced so we will consider the Accuracy for comparing the results. From Table 4 we can notice that Voting outperforms all used single and ensemble classi ers in all performance metrics.
Linear SVC classi er outperforms the other SML Classi ers with an Accuracy of 98. After that the Logistic Regression (LR) achieved an Accuracy of 97.4% then KNeighbors (KNN) achieved Accuracy of 96.3%. The voting classi er outperforms the other EML Classi ers with an accuracy of 98.5%. After that, the Bagging achieved an accuracy of 96.1%. % then the Boosting achieved an accuracy of 94.8%. These ndings prove the e ciency of using ensemble machine learning methods; Accuracy rises from 98% using the best SML model to 98.5% using the best EML model. Our results are inconsistent with [13]. Our ndings partly support the hypothesis that ensemble models naturally do better in comparison to single classi ers, but not in all cases. In some cases, Logistic Regression, which is a single classi er can achieve better results than ensemble classi er (Bagging and Boosting). It is dependent on the characteristics of the data set being examined.

Hyperparameter tuning
To optimize the previous results, which obtained by using the default parameters for each classi er, we apply hyperparameter tuning in the best obtained results (Voting). The best parameters are shown in Table 5: Linear SVC C 20 And the optimized performance is shown in Table 6.

Conclusion
Ensemble machine learning is a meta-learning machine learning method that aims to improve single learner classi er's performance by combining predictions from multiple single learner classi ers. In this study, we investigate the effect of applying three single learner machine learning approach (Linear SVC, logistic regression, and KNeighbors) and three ensemble machine learning approach (bagging-random forest, Voting, and Boosting-Adaboost) on offensive language and cyberbullying detection for the Arabic language.
The ensemble machine learning methodology outperforms the single learner machine learning approach in terms of impact. Among the trained ensemble machine learning classi ers, Voting performs the best in offensive language and cyberbullying detection with an Accuracy score of (71.1%, 76.7%, and 98.5%) for the three used datasets respectively which exceeds the score obtained by the best single learner classi er (65.1%, 76.2%, and 98%) for the three used datasets respectively. Finally, we used hyperparameters tuning to optimize the performance of the voting approach and the accuracy become 98.6%.

Declarations
Funding: Not applicable.
Availability of data and material and code : https://github.com/omammar167/Arabic-Abusive-Datasets Disclosure of potential Con ict of Interest: The authors declare that they have no con ict of interest.
Ethical Statement: "All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards." Consent Statement: "Informed consent was obtained from all individual participants included in the study."