Arabic Text Classification Based on Word and Document Embeddings

  • Abdelkader El MahdaouyEmail author
  • Eric Gaussier
  • Saïd Ouatik El Alaoui
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 533)


Recently, Word Embeddings have been introduced as a major breakthrough in Natural Language Processing (NLP) to learn viable representation of linguistic items based on contextual information or/and word co-occurrence. In this paper, we investigate Arabic document classification using Word and document Embeddings as representational basis rather than relying on text preprocessing and bag-of-words representation. We demonstrate that document Embeddings outperform text preprocessing techniques either by learning them using Doc2Vec or averaging word vectors using a simple method for document Embedding construction. Moreover, the results show that the classification accuracy is less sensitive to word and document vectors learning parameters.


Arabic text classification Arabic natural language processing Document embeddings Word embeddings SKIP-Gram Continuous Bag-of-Word Glove Doc2vec 


  1. 1.
    Al-Molegi, A., Izzat Alsmadi, H.N., Albashiri, H.: Automatic learning of arabic text categorization. Int. J. Digit. Contents Appl. 2(1), 1–16 (2015)Google Scholar
  2. 2.
    Al-Anzi, F.S., AbuZeina, D.: Big data categorization for arabic text using latent semantic indexing and clustering. In: International Conference on Engineering Technologies and Big Data Analytics (ETBDA 2016), pp. 1–4 (2016)Google Scholar
  3. 3.
    Al-Anzi, F.S., AbuZeina, D.: Toward an enhanced arabic text classification using cosine similarity and latent semantic indexing. J. King Saud Univ. Comput. Inf. Sci. (2016, in press)Google Scholar
  4. 4.
    Al-Badarneh, A., Al-Shawakfa, E., Bani-Ismail, B., Al-Rababah, K., Shatnawi, S.: The impact of indexing approaches on arabic text classification. J. Inform. Sci. 1, 1–15 (2016)Google Scholar
  5. 5.
    Ayadi, R., Maraoui, M., Zrigui, M.: Latent topic model for indexing arabic documents. IJIRR 4(1), 29–45 (2014)Google Scholar
  6. 6.
    Ayadi, R., Maraoui, M., Zrigui, M.: LDA and LSI as a dimensionality reduction method in arabic document classification. In: Dregvaite, G., Damasevicius, R. (eds.) Information and Software Technologies. Communications in Computer and Information Science, vol. 538, pp. 491–502. Springer International Publishing, Cham (2015)Google Scholar
  7. 7.
    Ayedh, A., Tan, G., Alwesabi, K., Rajeh, H.: The effect of preprocessing on arabic document categorization. Algorithms 9(2), 27 (2016)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, pp. 238–247. ACL, June 2014Google Scholar
  9. 9.
    Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)zbMATHGoogle Scholar
  10. 10.
    Farghaly, A.: Computer processing of arabic script-based languages: current state and future directions. In: Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, p. 1. Association for Computational Linguistics (2004)Google Scholar
  11. 11.
    Ganguly, D., Roy, D., Mitra, M., Jones, G.J.: Word embedding based generalized language model for information retrieval. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, New York, NY, USA, pp. 795–798. ACM (2015)Google Scholar
  12. 12.
    Hmeidi, I., Al-Ayyoub, M., Abdulla, N.A., Almodawar, A.A., Abooraig, R., Mahyoub, N.A.: Automatic arabic text categorization: a comprehensive comparative study. J. Inform. Sci. 41(1), 114–124 (2015)CrossRefGoogle Scholar
  13. 13.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, New York, NY, USA, pp. 50–57. ACM (1999)Google Scholar
  14. 14.
    Kadri, Y., Nie, J.Y.: Effective stemming for arabic information retrieval. In: International Conference of the Challenge of Arabic for NLP/MT at the British Computer Society (BCS), pp. 68–74 (2006)Google Scholar
  15. 15.
    Khoja, S., Garside, R.: Stemming arabic text, Computing Department, Lancaster University (1999)Google Scholar
  16. 16.
    Khreisat, L.: A machine learning approach for arabic text classification using n-gram frequency statistics. J. Informetrics 3(1), 72–77 (2009)CrossRefGoogle Scholar
  17. 17.
    Larkey, L., Ballesteros, L., Connell, M.: Light stemming for arabic information retrieval. In: Soudi, A., Bosch, A.D., Neumann, G. (eds.) Arabic Computational Morphology, Text, Speech and Language Technology, vol. 38, pp. 221–243. Springer, Netherlands (2007)CrossRefGoogle Scholar
  18. 18.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents (2014). CoRR abs/1405.4053.
  19. 19.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of International Conference on Learning Representations, ICLR 2013 (2013)Google Scholar
  20. 20.
    Otair, M.A.: Comparative analysis of arabic stemming algorithms. Int. J. Managing Inf. Technol. (IJMIT) 5(2), 1–12 (2013)Google Scholar
  21. 21.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, October 2014Google Scholar
  22. 22.
    Saad, M.K.: The impact of text preprocessing and term weighting on arabic text classification. Ph.D. thesis, Islamic University of Gaza, Palestine (2010)Google Scholar
  23. 23.
    Saad, M.K., Ashour, W.: Osac: open source arabic corpora. In: 6th International Conference on Electrical and Computer Systems (EECS 2010), Lefke, Cyprus, 25–26 November, pp. 118–123 (2010)Google Scholar
  24. 24.
    Said, D.A., Wanas, N.M., Darwish, N.M., Hegazy, N.: A study of text preprocessing tools for arabic text categorization, pp. 230–236 (2009)Google Scholar
  25. 25.
    Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, New York, NY, USA, pp. 363–372. ACM (2015)Google Scholar
  26. 26.
    Wei, X., Croft, W.B.: Lda-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, New York, NY, USA, pp. 178–185. ACM (2006)Google Scholar
  27. 27.
    Zahran, M.A., Magooda, A., Mahgoub, A.Y., Raafat, H., Rashwan, M., Atyia, A.: Word representations in vector space and their applications for arabic. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9041, pp. 430–443. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-18111-0_32 Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Abdelkader El Mahdaouy
    • 1
    • 2
    Email author
  • Eric Gaussier
    • 1
  • Saïd Ouatik El Alaoui
    • 2
  1. 1.Grenoble Alpes University, CNRS-LIG/AMAGrenobleFrance
  2. 2.Department of Computer ScienceLIM, FSDM, USMBAFezMorocco

Personalised recommendations