Stemming Algorithm for Arabic Text Using a Parallel Data Processing

  • Marieme BougarEmail author
  • El Houssaine ZiyatiEmail author
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 797)


The fast-growing data generated by the network, Faced data mining algorithms to the big difficulties namely the outlook of storing data and handling computational challenges related to the volume of data and scalability. Our interest is focused on analyzing data sets in Arabic language faced to the morphological complexities and dialectal varieties, and those specificities require advanced preprocessing steps typically stemming word step. In this paper to complete a successful Arabic information retrieval, we use the MapReduce model and then perform experiments on rankings generated by our optimized stemming algorithm based on Khoja algorithm, the popular algorithm in stemming Arabic words. We propose a structure based on key and value pair to speed up stemming phase and parallelize the process using MapReduce mechanism.


Information retrieval Hadoop MapReduce Big data Khoja stemmer Arabic text 


  1. 1.
    Budiman R (2013) Utilizing skype for providing learning support for Indonesian distance learning students: a lesson learnt. Procedia—Soc Behav Sci 83:5–10CrossRefGoogle Scholar
  2. 2.
    Khoja S, Garside R (1999) Stemming Arabic text. Home Page Last Accessed 21 Dec 2017
  3. 3.
    Nehar A, Ziyadi D, Cherroun H, Guellouma Y (2012) An efficient stemming for Arabic Text classification. In: International conference on innovations in Information Technology (IIT). IEEE, pp 328–32Google Scholar
  4. 4.
    AlSerhan H, Shalabi A, Kannan G (2003) New approach for extracting Arabic roots. In: Proceedings of the Arab conference on information technology 2003. Alexandria, Egypt, pp 42–59Google Scholar
  5. 5.
    Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) An improved parallel programming model for load balancing of MapReduce. Future generation computer. System (2017)Google Scholar
  6. 6.
    Uzunkaya C, Ensari T, Kavurucu Y (2015) Hadoop ecosystem and its analysis on tweets. Procedia—Soc Behav Sci 195:1890–1897CrossRefGoogle Scholar
  7. 7.
    Hamroun M, Gouider MS, Ben Said L (2016) Large scale microblogging intentions analysis with pattern based approach. Procedia Comput Sci 96:1249–1257CrossRefGoogle Scholar
  8. 8.
    Tsai CF, Lin WC, Ke SW (2016) Big data mining with parallel computing: a comparison of distributed and MapReduce methodologies. J Syst Softw 122:83–92CrossRefGoogle Scholar
  9. 9.
    Boudad N, Faizi R, Oulad R, Thami H, Chiheb R (2017) Sentiment analysis in Arabic: a review of the literature. Ain Shams Eng JGoogle Scholar
  10. 10.
    Sawalha M, Atwell E (2008) Comparative evaluation of Arabic language morphological analysers and stemmers. In: Posters proceedings 22nd international conference on computational linguistics. Manchester, UK, pp 107–110Google Scholar
  11. 11.
    Jaafar Y, Namly D, Bouzoubaa K, Yousfi A (2017) Enhancing Arabic stemming process using resources and benchmarking tools. J King Saudi Univ-Comput Inf Sci 29(2):164–170Google Scholar
  12. 12.
    El Mahdaouy A, Gaussier É, El Alaoui SO (2015) Exploring term proximity statistic for Arabic information retrieval. Colloquium in information science and technologyGoogle Scholar
  13. 13.
    Qian J, Miao D, Zhang Z, Yue X (2014) Parallel attribute reduction algorithms using MapReduce. In: Miao D, Pedrycz W, Ślȩzak D, Peters G, Hu Q, Wang R (eds) Rough sets and knowledge technology. RSKT 2014. LNCS, vol 8818. Springer, ChamGoogle Scholar
  14. 14.
    Vaidya M, Deshpande S (2016) Critical study of performance parameters on distributed file systems using MapReduce. Procedia Comput Sci 78:224–232CrossRefGoogle Scholar
  15. 15.
    Ghazi MR, Gangodkar D (2015) Hadoop, MapReduce and HDFS: a developers perspective. Procedia Comput Sci 48:45–50CrossRefGoogle Scholar
  16. 16.
    Usama M, Liu M, Chen M (2017) Job schedulers for big data processing in Hadoop environment: testing real-life schedule with benchmark programs. Digit Commun Netw 4(3):260–273CrossRefGoogle Scholar
  17. 17.
    Evangelopoulos X et al (2016) Evaluating information retrieval using document popularity: an implementation on MapReduce. Eng Appl Artif Intell 51:16–23CrossRefGoogle Scholar
  18. 18.
    Cano A, García-Martínez C, Ventura S (2017) Extremely high-dimensional optimization with MapReduce: scaling functions and algorithm. Inf Sci 415–416:110–127CrossRefGoogle Scholar
  19. 19.
    Dai W, Ji W (2014) A MapReduce implementation of C4.5 decision tree algorithm. Int J Database Theory Appl 7(1):49–60CrossRefGoogle Scholar
  20. 20.
    Chen C, Li K, Ouyang A, Li K (2017) A parallel approximate SS-ELM algorithm based on MapReduce for large-scale datasets. J Parallel Distrib Comput 108:85–94CrossRefGoogle Scholar
  21. 21.
    Cunha J, Silva C, Antunes M (2015) Health Twitter big bata management with Hadoop framework. Procedia Comput Sci 64:425–431CrossRefGoogle Scholar
  22. 22.
    Singh H, Bawa S (2017) A MapReduce-based scalable discovery and indexing of structured big data. Future Gener Comput Syst 73:32–43CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.National School of Electricity and MechanicsCasablancaMorocco

Personalised recommendations