Abstract
With the increased use of social network sites, such as Twitter, attackers exploit these platforms to spread counterfeit content. Such content can be fake advertisements or illegal content. Classifying such content is a challenging task, especially in Arabic. The Arabic language has a complex structure and makes classification tasks more difficult. This paper presents an approach to classifying Arabic tweets using classical machine learning (non-deep machine learning) and deep learning techniques. Tweets corpus were collected through Twitter API and labelled manually to get a reliable dataset. For an efficient classifier, feature extraction is applied to the corpus dataset. Then, two learning techniques are used for each feature extraction technique on the created dataset using N-gram models (uni-gram, bi-gram, and char-gram). The applied classical machine learning algorithms are support vector machines, neural networks, logistics regression, and naïve Bayes. Global vector (GloVe) and fastText learning models are utilised for the deep learning approaches. The Precision, Recall, and F1-score are the suggested performance measures calculated in this paper. Afterwards, the dataset is increased using the synthetic minority oversampling technique class to create a balanced dataset. After applying the classical machine learning models, the experimental results show that the neural network algorithm outperforms the other algorithms. Moreover, the GloVe outperforms the fastText model for the deep learning approach.
Similar content being viewed by others
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Alharbi AR, Aljaedi A (2019) Predicting rogue content and arabic spammers on twitter. Future Internet 11(11):229
Benevenuto F, Magno G, Rodrigues T and Almeida V (2010) Detecting spammers on twitter. In: Collaboration, electronic messaging, anti-abuse and spam conference (CEAS) (Vol. 6, No. 2010, p. 12)
Wang AH (2010) Don't follow me: Spam detection in twitter. In: 2010 international conference on security and cryptography (SECRYPT), pp 1–10. IEEE
Wu T, Wen S, Xiang Y, Zhou W (2018) Twitter spam detection: survey of new approaches and comparative study. Comput Secur 76:265–284
Kaddoura S, Chandrasekaran G, Elena Popescu D, Duraisamy JH (2022) A systematic literature review on spam content detection and classification. PeerJ Comput Sci 8:830. https://doi.org/10.7717/peerj-cs.830
Kaddoura S, Alfandi O and Dahmani N (2020) A spam email detection mechanism for english language text emails using deep learning approach. In: 2020 IEEE 29th international conference on enabling technologies: infrastructure for collaborative enterprises (WETICE). IEEE, pp 193–198. https://doi.org/10.1109/WETICE49692.2020.00045
Kaddoura S (2021) Classification of malicious and benign websites by network features using supervised machine learning algorithms. In: 2021 5th Cyber security in networking conference (CSNet). IEEE, pp 36–40. https://doi.org/10.1109/CSNet52717.2021.9614273
Kaddoura S, Arid AE and Moukhtar M (2021) Evaluation of supervised machine learning algorithms for multi-class intrusion detection systems. In: Proceedings of the future technologies conference. Springer, Cham, pp 1–16. https://doi.org/10.1007/978-3-030-89912-7_1
Ahmed I, Aljahdali S, Khan MS, Kaddoura S (2022) Classification of parkinson disease based on patient’s voice signal using machine learning. Intell Autom Soft Comput 32(2):705–722. https://doi.org/10.32604/iasc.2022.022037
Mubarak, H., Abdelali, A., Hassan, S. and Darwish, K., 2020, October. Spam detection on arabic twitter. In International Conference on Social Informatics (pp. 237–251). Springer, Cham.
Saeed RM, Rady S, Gharib TF (2022) An ensemble approach for spam detection in Arabic opinion texts. J King Saud University-Comput Inf Sci 34(1):1407–1416
Kaddoura S, Itani M, Roast C (2021) Analyzing the effect of negation in sentiment polarity of facebook dialectal arabic text. Appl Sci 11(11):4768. https://doi.org/10.3390/app11114768
Kaddoura S, Ahmed DR (2022) A comprehensive review on Arabic word sense disambiguation for natural language processing applications. Wiley Interdisciplinary Rev Data Mining Knowl Discov 12:e1447. https://doi.org/10.1002/widm.1447
Ekmekcioglu FC, Lynch MF, Willett P (1996) Stemming and n-gram matching for term conflation in Turkish texts. Inf Res 2(2):2–2
Daneshvar S and Inkpen D (2018) Gender identification in twitter using n-grams and lsa. In: Proceedings of the Ninth international conference of the CLEF association (CLEF 2018).
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Ma Y, He H (Eds.). (2013) Imbalanced learning: foundations, algorithms, and applications
Markines B, Cattuto C, Menczer F (2009) Social spam detection. In: Proceedings of the 5th international workshop on adversarial information retrieval on the web, pp 41–48)
Wang AH (2010) Machine learning for the detection of spam in twitter networks. In: International conference on e-business and telecommunications, pp 319–333. Springer, Berlin, Heidelberg
Wang AH (2010) Don't follow me: Spam detection in twitter. In: 2010 international conference on security and cryptography (SECRYPT), pp 1–10. IEEE
Mccord M, Chuah M (2011) Spam detection on twitter using traditional classifiers. In: international conference on Autonomic and trusted computing, pp 175–186. Springer, Berlin, Heidelberg
Shirani-Mehr H (2013) SMS spam detection using machine learning approach. unpublished) http://cs229stanford.edu/proj2013/ShiraniMehr-SMSSpamDetectionUsingMachineLearningApproach.pdf
Meda C, Bisio F, Gastaldo P, Zunino R (2014) A machine learning approach for Twitter spammers detection. In: 2014 international carnahan conference on security technology (iccst), pp 1–6. IEEE
Chen C, Zhang J, Xie Y, Xiang Y, Zhou W, Hassan MM, Alrubaian M (2015) A performance evaluation of machine learning-based streaming spam tweets detection. IEEE Trans Comput Soc Syst 2(3):65–76
Chen C, Zhang J, Chen X, Xiang Y, Zhou W (2015) 6 million spam tweets: a large ground truth for timely Twitter spam detection. In: 2015 IEEE international conference on communications (ICC), pp 7065–7070. IEEE
Trivedi SK (2016) A study of machine learning classifiers for spam detection. In: 2016 4th international symposium on computational and business Intelligence (ISCBI), pp 176–180. IEEE
Chen C, Wang Y, Zhang J, Xiang Y, Zhou W, Min G (2016) Statistical features-based real-time detection of drifted twitter spam. IEEE Trans Inf Forens Secur 12(4):914–925
Wu T, Liu S, Zhang J, Xiang Y (2017) Twitter spam detection based on deep learning. In: Proceedings of the australasian computer science week multiconference, pp 1–8
Mateen M, Iqbal MA, Aleem M, Islam MA (2017) A hybrid approach for spam detection for Twitter. In: 2017 14th international bhurban conference on applied sciences and technology (IBCAST), pp 466–471. IEEE
Liu S, Wang Y, Zhang J, Chen C, Xiang Y (2017) Addressing the class imbalance problem in twitter spam detection using ensemble learning. Comput Secur 69:35–49
Li C, Liu S (2018) A comparative study of the class imbalance problem in Twitter spam detection. Concurr Comput Pract Exp 30(5):e4281
Gupta H, Jamal MS, Madisetty S, Desarkar MS (2018) A framework for real-time spam detection in Twitter. In: 2018 10th international conference on communication systems & networks (COMSNETS), pp 380–383. IEEE
Madisetty S, Desarkar MS (2018) A neural network-based ensemble approach for spam detection in Twitter. IEEE Trans Comput Soc Syst 5(4):973–984
Itani M (2018) Sentiment analysis and resources for informal Arabic text on social media, Doctoral dissertation, Sheffield Hallam University.
Falak A, Ghous H, Malik M (2021) Twitter spam detection using machine learning. Int J Sci Eng Res, 12(2)
Ding Z, Xia R, Yu J, Li X, Yang J (2018) Densely connected bidirectional lstm with applications to sentence classification. In: Natural language processing and chinese computing: 7th CCF international conference, NLPCC 2018, Hohhot, China, August 26–30, 2018, Proceedings, Part II 7, pp 278–287. Springer International Publishing.
Rojas RF, Romero J, Lopez-Aparicio J, Ou KL (2021) Pain assessment based on fnirs using bi-lstm rnns. In: 2021 10th international IEEE/EMBS conference on neural engineering (NER, pp 399–402). IEEE
Jaihuni M, Basak JK, Khan F, Okyere FG, Sihalath T, Bhujel A, Kim HT (2022) A novel recurrent neural network approach in forecasting short term solar irradiance. ISA transactions 121:63–74
Sunny MAI, Maswood MMS, Alharbi AG (2020) Deep learning-based stock price prediction using LSTM and bi-directional LSTM model. In: 2020 2nd novel intelligent and leading emerging sciences conference (NILES), pp 87–92. IEEE
Hegde A, Coelho S, Shashirekha H (2022) MUCS@ DravidianLangTech@ ACL2022: ensemble of logistic regression penalties to identify Emotions in Tamil Text. In: Proceedings of the second workshop on speech and language technologies for Dravidian languages, (pp 145–150
Liu J, Rong Y, Takáč M, Huang J (2019) Accelerating distributed stochastic L-BFGS by sampled 2nd Order Information. Beyond first order methods in ML@ NeurIPS.
Koh K, Kim SJ, Boyd S (2007) A Method for large-scale l~ 1-regularized logistic regression. In: AAAI, pp 565–571
Zhang P, Shen C (2019) Choice of the number of hidden layers for back propagation neural network driven by stock price data and application to price prediction. In: Journal of physics: conference series (Vol. 1302, No. 2, p. 022017). IOP Publishing
Zhang C, Woodland PC (2015) Parameterised sigmoid and ReLU hidden activation functions for DNN acoustic modelling. In: Sixteenth annual conference of the international speech communication association
Huda NS, Mubarok MS (2019) A multi-label classification on topics of quranic verses (english translation) using backpropagation neural network with stochastic gradient descent and adam optimiser. In: 2019 7th International conference on information and communication technology (ICoICT), pp 1–5. IEEE
Goel A, and Srivastava SK (2016) Role of kernel parameters in performance evaluation of SVM. In: 2016 Second international conference on computational Intelligence & communication technology (CICT), pp 166–169. IEEE
Xu PF, Cheng C, Cheng HX, Shen YL, Ding YX (2020) Identification-based 3 DOF model of unmanned surface vehicle using support vector machines enhanced by cuckoo search algorithm. Ocean Eng 197:106898
Wang H, and Hu D (2005) Comparison of SVM and LS-SVM for regression. In: 2005 International conference on neural networks and brain (Vol. 1, pp. 279–283). IEEE
Vergara D, Hernández S, Jorquera F (2016) Multinomial Naive Bayes for real-time gender recognition. In: 2016 XXI Symposium on signal processing, images and artificial vision (STSIVA), pp 1–6. IEEE
Alkadri AM, Elkorany A, Ahmed C (2022) Enhancing detection of arabic social spam using data augmentation and machine learning. Appl Sci 12(22):11388
Al-Azani S, El-Alfy ESM (2018) Detection of arabic spam tweets using word embedding and machine learning. In: 2018 international conference on innovation and intelligence for informatics, computing, and technologies (3ICT), pp 1–5. IEEE
Kardaş Berk et al. (2021) Detecting spam tweets using machine learning and effective preprocessing. In: Proceedings of the 2021 IEEE/ACM international conference on advances in social networks analysis and mining
Alom Z, Carminati B, Ferrari E (2018) Detecting spam accounts on Twitter. In: 2018 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), pp 1191–1198. IEEE
Mostafa M, Abdelwahab A, Sayed HM (2020) Detecting spam campaign in twitter with semantic similarity. In: Journal of physics: conference series (Vol. 1447, No. 1, p. 012044). IOP Publishing
Ahmad SBS, Rafie M, Ghorabie SM (2021) Spam detection on Twitter using a support vector machine and users’ features by identifying their interactions. Multimed Tools Appl 80(8):11583–11605
Ban X, Chen C, Liu S, Wang Y, Zhang J (2018) Deep-learnt features for Twitter spam detection. In: 2018 International symposium on security and privacy in social networks and big data (SocialSec), pp 208–212. IEEE.
Funding
This work was funded by Zayed University—Start-up research grant [Grant Number R20081].
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kaddoura, S., Alex, S.A., Itani, M. et al. Arabic spam tweets classification using deep learning. Neural Comput & Applic 35, 17233–17246 (2023). https://doi.org/10.1007/s00521-023-08614-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08614-w