Skip to main content
Log in

A Rule-Based Subject-Correlated Arabic Stemmer

  • Research Article - Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

Arabic is a derivational language that provides invaluable features. Arabic roots are basic forms that are used to formulate words. They are limited sets that encapsulate the word’s linguistic features. The knowledge of roots’ frequencies is a valuable additional feature, especially when it is bound to a specific topic. This paper utilizes collision resulting from the stemming process where two or more words may have the same root. It minimizes the number of extracted roots within a specific subject using roots’ frequencies and explores its effect on multiple roots disambiguation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Fathalla, R.; El Sonbaty, Y.; Ismail, M.A.: Extraction of arabic words from complex color image. In: 9th IEEE International Conference on Document Analysis and Recognition (ICDAR 2007). pp. 1223–1227. IEEE, Brazil (2007)

  2. Nadkarni, P.M.; Ohno-Machado, L.; Chapman, W.W.: Natural language processing: an introduction. J. Am. Med. Inform. Assoc. 18(5), 544 (2011)

  3. Hutchins, J.: The Georgetown-IBM experiment demonstrated in January 1954. In: Frederking R.E., Taylor K.B. (eds.) Machine translation: from real users to research: 6th conference of the association for machine translation in the Americas, AMTA 2004, Washington, DC, USA, September 28 - October 2, 2004. Proceedings. pp. 102–114. Springer, Berlin (2004)

  4. Jing, H.; McKeown, K.R.: Cut and paste based text summarization. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics conference, pp. 178–185. Association for Computational Linguistics (2000)

  5. Nenkova, A.: Automatic text summarization of newswire: Lessons learned from the document understanding. In: AAAI, vol. 5, pp. 1436–1441 (2005)

  6. AlSughaiyer, I.A.; AlKharashi, I.A.: Arabic morphological analysis techniques: A comprehensive survey. J. Am. Soc. Inf. Sci. Technol. 55(3), 189 (2004)

  7. Ryding K.C.X.: A reference grammar of modern standard Arabic. Cambridge University Press, Cambridge (2005)

    Book  Google Scholar 

  8. Larkey, L.S.; Ballesteros, L.; Connell, M.E.: Improving stemming for arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–282. ACM (2002)

  9. Taghva, K.; Elkhoury, R.; Coombs, J.S.: Arabic stemming without a root dictionary. In: ITCC, vol. 1, pp. 152–157. (2005)

  10. Oraby, S.M.; El-Sonbaty, Y.; El-Nasr, M.A.: Exploring the effects of word roots for Arabic sentiment analysis. In: Conference on Natural Language Processing. Nagoya, Japan (2013)

  11. Oraby S., El-Sonbaty Y., El-Nasr M.A.: Finding opinion strength using rule-based parsing for arabic sentiment analysis. In: Advances in Soft Computing and its Applications, vol. 8266, pp. 509–520. Springer, Berlin (2013)

  12. Ezzeldin, A.M.; El-Sonbaty, Y.; Kholief, M.H.: Exploring the effects of root expansion, sentence splitting and ontology on arabic answer selection. In: Natural Language Processing and Cognitive Science: Proceedings, p. 273.Walter de Gruyter Inc, Boston (2014)

  13. Ezzeldin, A.M.; Kholief, M.H.; El-Sonbaty, Y.: ALQASIM: Arabic language question answer selection in machines. In: Information Access Evaluation. Multilinguality, Multimodality, and Visualization, vol. 8138, pp. 100–103. Springer, Berlin, Heidelberg (2013)

  14. Habash, N.; Rambow, O.; Roth, R.: Mada+ tokan: A toolkit for arabic tokenization, diacritization, morphological disambiguation, pos tagging, stemming and lemmatization. In: Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), pp. 102–109. Cairo, Egypt (2009)

  15. Saleh, S.N.; El-Sonbaty, Y.: A feature selection algorithm with redundancy reduction for text classification. In: 22nd International Symposium on Computer and information sciences, 2007. ISCIS 2007, pp. 1–6. IEEE (2007)

  16. Cormen T.H.: Introduction to Algorithms. MIT press, Cambridge (2009)

    MATH  Google Scholar 

  17. Khoja S., Garside R.: Stemming arabic text. Computing Department. Lancaster University, Lancaster (1999)

    Google Scholar 

  18. Darwish, K.: Building a shallow arabic morphological analyzer in one day. In: Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages, pp. 1–8. Association for Computational Linguistics (2002)

  19. Zitouni I.: Natural language processing of semitic languages. Springer, Berlin (2014)

    Book  Google Scholar 

  20. Larkey L.S., Ballesteros L., Connell M.E.: Light stemming for arabic information retrieval. In: Arabic Computational Morphology, pp. 221–243. Springer, Berlin (2007)

  21. Aljlayl, M.; Frieder, O.: On arabic search: improving the retrieval effectiveness via a light stemming approach. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 340–347. ACM (2002)

  22. Zerrouki, T.: Tashaphyne, arabic light stemmer/segment (2010), http://tashaphyne.sourceforge.net

  23. Smrz, O.: Elixirfm: implementation of functional arabic morphology. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 1–8. Association for Computational Linguistics (2007)

  24. Smrz, O.; Bielicky, V.; Kourilova, I.; Kracmar, J.; Hajic, J.; Zemanek, P.: Prague arabic dependency treebank: A word on the million words. In: Proceedings of the Workshop on Arabic and Local Languages (LREC 2008), pp. 16–23. Marrakech, Morocco (2008)

  25. Buckwalter, T.: Buckwalter Arabicmorphological analyzer version 1.0 (2002)

  26. Pasha, A.; Al-Badrashiny, M.; Kholy, A.E.; Eskander, R.; Diab, M.; N.; Habash, Pooleery, M.; Rambow, O.; Roth, R.: Madamira:Afast, comprehensive tool formorphological analysis and disambiguation of arabic. In: Proceedings of the 9th International Conference on Language Resources and Evaluation. Reykjavik, Iceland (2014)

  27. Diab M., Hacioglu K., Jurafsky D.: Automated methods for processing arabic text: from tokenization to base phrase chunking. In: Arabic Computational Morphology: Knowledge-based and Empirical Methods. Kluwer/Springer, Berlin (2007)

  28. Alansary, S.; Nagi, M.; Adly, N.: Building an international corpus of arabic (ica): progress of compilation stage. In: 7th International Conference on Language Engineering. Cairo, Egypt, 5–6 Dec 2007

  29. Manning C.D., Raghavan P., Schutze H.: Introduction to Information Retrieval, vol. 1. Cambridge university press, Cambridge (2008)

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nahla A. Belal.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

El-Defrawy, M., El-Sonbaty, Y. & Belal, N.A. A Rule-Based Subject-Correlated Arabic Stemmer. Arab J Sci Eng 41, 2883–2891 (2016). https://doi.org/10.1007/s13369-016-2029-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-016-2029-2

Keywords

Navigation