Abstract
With the accelerated growth of the internet, vast repositories of unstructured textual data have emerged, necessitating automated categorization algorithms for organization and insight extraction. The Arabic language, however, poses particular challenges due to its inflected nature, large vocabulary, and varying forms. This study targets the development of robust automated classification systems for Arabic text, a language increasingly adopted online. In this paper, we propose a comparison of four prevalent pre-trained word embeddings: Word2Vec (represented by Aravec), GloVe, FastText, and BERT (represented by ARBERTv2), using the widely-adopted SANAD dataset of Arabic news articles. We provide a comprehensive comparison by applying a fixed deep learning architecture across all four word embeddings to ensure fairness. The motivation behind this comparison is to bridge the knowledge gap observed in the usage of popular word embeddings for Arabic news classification. Despite the state-of-the-art results from transformer models, a significant inclination towards older methodologies still persists. Hence, we aim to highlight the efficiencies of modern techniques. Results indicate that ARBERTv2 outperforms the other embeddings, achieving 95.81%, 98.68%, and 99.30% accuracy on the Akhbarona, Alkhaleej, and Alarabiya subsets of SANAD, respectively. Despite its large number of parameters, ARBERT’s context-based word embeddings seem to offer superior performance. FastText stood out as the top performer among non-contextualized word embeddings due to its ability to capture morphological similarities and handle out-of-vocabulary words. Following closely behind was GloVe, and then came Aravec.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ababneh, A.H.: Investigating the relevance of Arabic text classification datasets based on supervised learning. J. Electron. Sci. Technol. 20(2), 100160 (2022)
Aftan, S., Shah, H.: A survey on BERT and its applications. In: 2023 20th Learning and Technology Conference (L &T), pp. 161–166. IEEE (2023)
Al Qadi, L., El Rifai, H., Obaid, S., Elnagar, A.: Arabic text classification of news articles using classical supervised classifiers. In: 2019 2nd International Conference on New Trends in Computing Sciences (ICTCS), pp. 1–6. IEEE (2019)
Alammary, A.S.: Bert models for Arabic text classification: a systematic review. Appl. Sci. 12(11), 5720 (2022)
Alhaj, Y.A., et al.: A novel text classification technique using improved particle swarm optimization: a case study of Arabic language. Future Internet 14(7), 194 (2022)
Alhawarat, M., Aseeri, A.O.: A superior Arabic text categorization deep model (SATCDM). IEEE Access 8, 24653–24661 (2020)
Boukil, S., Biniz, M., El Adnani, F., Cherrat, L., El Moutaouakkil, A.E.: Arabic text classification using deep learning technics. Int. J. Grid Distrib. Comput. 11(9), 103–114 (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Einea, O., Elnagar, A., Al Debsi, R.: SANAD: single-label Arabic news articles dataset for automatic text categorization. Data Brief 25, 104076 (2019)
El Rifai, H., Al Qadi, L., Elnagar, A.: Arabic text classification: the need for multi-labeling systems. Neural Comput. Appl. 34(2), 1135–1159 (2022)
Elmadany, A., Nagoudi, E.M.B., Abdul-Mageed, M.: ORCA: a challenging benchmark for arabic language understanding. arXiv preprint arXiv:2212.10758 (2022)
Elnagar, A., Al-Debsi, R., Einea, O.: Arabic text classification using deep learning models. Inf. Process. Manag. 57(1), 102121 (2020)
Galal, M., Madbouly, M.M., El-Zoghby, A.: Classifying Arabic text using deep learning. J. Theor. Appl. Inf. Technol. 97(23), 3412–3422 (2019)
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893 (2018)
Guyon, I., Elisseeff, A.: An introduction to feature extraction. Feature Extraction: Foundations and Applications, pp. 1–25 (2006)
Habash, N.Y.: Introduction to Arabic natural language processing. Synthesis Lectures Hum. Lang. Technol. 3(1), 1–187 (2010)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
Liu, Z., Lin, Y., Sun, M., Liu, Z., Lin, Y., Sun, M.: Representation learning and NLP. In: Representation Learning for Natural Language Processing, pp. 1–11 (2020)
Liu, Z., Lin, Y., Sun, M., Liu, Z., Lin, Y., Sun, M.: Word representation. In: Representation Learning for Natural Language Processing, pp. 13–41 (2020)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mishal, S.M., Hamad, M.M.: Text classification using convolutional neural networks (2022)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Pratiwi, N.I., Budi, I., Alfina, I.: Hate speech detection on Indonesian Instagram comments using FastText approach. In: 2018 International Conference on Advanced Computer Science and Information Systems (ICACSIS), pp. 447–450. IEEE (2018)
Rong, X.: Word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014)
Salloum, S.A., Mhamdi, C., Al-Emran, M., Shaalan, K.: Analysis and classification of Arabic newspapers’ Facebook pages using text mining techniques. Int. J. Inf. Technol. Lang. Stud. 1(2), 8–17 (2017)
Singh, K.N., Dorendro, A., Devi, H.M., Mahanta, A.K.: Analysis of changing trends in textual data representation. In: Santosh, K.C., Gawali, B. (eds.) RTIP2R 2020. CCIS, vol. 1380, pp. 237–251. Springer, Singapore (2021). https://doi.org/10.1007/978-981-16-0507-9_21
Soliman, A.B., Eissa, K., El-Beltagy, S.R.: AraVec: a set of Arabic word embedding models for use in Arabic NLP. Procedia Comput. Sci. 117, 256–265 (2017)
Sundus, K., Al-Haj, F., Hammo, B.: A deep learning approach for Arabic text classification. In: 2019 2nd International Conference on New Trends in Computing Sciences (ICTCS), pp. 1–7. IEEE (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Khaled, M.M., Al-Barham, M., Alomari, O.A., Elnagar, A. (2024). Arabic News Articles Classification Using Different Word Embeddings. In: GarcÃa Márquez, F.P., Jamil, A., Hameed, A.A., Segovia RamÃrez, I. (eds) Emerging Trends and Applications in Artificial Intelligence. ICETAI 2023. Lecture Notes in Networks and Systems, vol 960. Springer, Cham. https://doi.org/10.1007/978-3-031-56728-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-56728-5_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56727-8
Online ISBN: 978-3-031-56728-5
eBook Packages: EngineeringEngineering (R0)