A Language Independent Approach to Develop Urdu Stemmer

Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 178)


Especially, during last few years, a wide range of information in Indian regional languages like Hindi, Urdu, Bengali, Tamil and Telugu has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. Hence automatic information processing and retrieval is become an urgent requirement. This paper presents an unsupervised approach for the development of an Urdu stemmer. To train the system a training dataset, taken from CRULP [22], consists of 111,887 words is used. For generating suffix rules two different approaches, namely, frequency based stripping and length based stripping have been proposed. The evaluation has been made on 1200 words extracted from the Emille corpus. The experiment results shows that these are very efficient algorithms having accuracy of 85.36% and 79.76%.


Stemmer Morphological Analysis Information Retrieval Unsupervised Stemming 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Rizvi, J., et al.: Modeling case marking system of Urdu-Hindi languages by using semantic information. In: Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering, IEEE NLP-KE 2005 (2005)Google Scholar
  2. 2.
    Butt, M., King, T.: Non-Nominative Subjects in Urdu: A Computational Analysis. In: Proceedings of the International Symposium on Non-nominative Subjects, Tokyo, pp. 525–548 (December 2001)Google Scholar
  3. 3.
    Savoy, J.: Stemming of French words based on grammatical categories. Journal of the American Society for Information Science 44(1), 1–9 (1993)CrossRefGoogle Scholar
  4. 4.
    Chen, A., Gey, F.: Building and Arabic Stemmer for Information Retrieval. In: Proceedings of the Text Retrieval Conference, p. 47 (2002)Google Scholar
  5. 5.
    Mokhtaripour, A., Jahanpour, S.: Introduction to a New Farsi Stemmer. In: Proceedings of CIKM, Arlington, VA, USA, pp. 826–827 (2006)Google Scholar
  6. 6.
    Wicentowski, R.: Multilingual Noise-Robust Supervised Morphological Analysis using the Word Frame Model. In: Proceedings of Seventh Meeting of the ACL Special Interest Group on Computational Phonology (SIGPHON), pp. 70–77 (2004)Google Scholar
  7. 7.
    Rizvi, Hussain, M.: Analysis, Design and Implementation of Urdu Morphological Analyzer. In: SCONEST, pp. 1–7 (2005)Google Scholar
  8. 8.
    Krovetz, R.: View Morphology as an Inference Process. In: The Proceedings of 5th International Conference on Research and Development in Information Retrieval (1993)Google Scholar
  9. 9.
    Porter, M.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  10. 10.
    Thabet, N.: Stemming the Qur’an. In: The Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (2004)Google Scholar
  11. 11.
    Paik, Pauri: A Simple Stemmer for Inflectional Languages. In: FIRE 2008 (2008)Google Scholar
  12. 12.
    Sharifloo, A.A., Shamsfard, M.: A Bottom up Approach to Persian Stemming. In: IJCNLP (2008)Google Scholar
  13. 13.
    Croft, Xu: Corpus-Based Stemming Using Co occurrence of Word Variants. ACM Transactions on Information Systems, 61–81 (1998)Google Scholar
  14. 14.
    Kumar, A., Siddiqui, T.: An Unsupervised Hindi Stemmer with Heuristics Improvements. In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data (2008)Google Scholar
  15. 15.
    Kumar, M.S., Murthy, K.N.: Corpus Based Statistical Approach for Stemming Telugu. In: Creation of Lexical Resources for Indian Language Computing and Processing (LRIL), C-DAC, Mumbai, India (2007)Google Scholar
  16. 16.
    Akram, Q.-U.-A., Naseer, A., Hussain, S.: Assas-Band, an Affix-Exception-List Based Urdu Stemmer. In: Proceedings of ACL-IJCNLP 2009 (2009)Google Scholar
  17. 17.
  18. 18.
  19. 19.
  20. 20.
    Siddiqui, T.: Natural Language processing and Information Retrieval, U S TiwaryGoogle Scholar
  21. 21.
    Frakes, W.B., Baeza-Yates, R.: Information retrieval: data structure and algorithmsGoogle Scholar
  22. 22.

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Mohd. Shahid Husain
    • 1
  • Faiyaz Ahamad
    • 2
  • Saba Khalid
    • 2
  1. 1.Department of Information TechnologyIntegral UniversityLucknowIndia
  2. 2.Department of Computer Science & EngineeringIntegral UniversityLucknowIndia

Personalised recommendations