Skip to main content
Log in

Effect of stopwords in Indian language IR

  • Published:
Sādhanā Aims and scope Submit manuscript

Abstract

We explore and evaluate the effect of stopwords in retrieval performance of different Indian languages such as Marathi, Bengali, Gujarati and Sanskrit. The issue was investigated from three viewpoints. Is there any impact of non-corpus-based stopword removal on chosen Indian languages (if yes, to what extent)? Can we recommend, based on experiment, a number of stopwords for chosen Indian languages that are good enough from retrieval point of view? Is there any relationship of stopwords with average document length from retrieval perspective? It is observed that the stopword removal generally improves mean average precision (MAP) significantly compared with the case when it is not done. For each language, different lengths of the stopword list are explored and evaluated that lead to suggesting its optimal length. We also study the effect of stopwords on retrieval performance over document length. The effect of stopwords is generally found to be quite low in short documents compared with their long counterparts across the four Indian languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10

Similar content being viewed by others

Notes

  1. https://www.nltk.org/

  2. http://cltk.org/

  3. https://scikit-learn.org/

  4. http://terrier.org/

  5. https://trec.nist.gov/

  6. http://fire.irsi.res.in/fire/static/data

  7. https://github.com/gujarati-ir/Gujarati-Stop-Words

  8. https://gist.github.com/Akhilesh28/sanskrit stopwords

  9. https://github.com/stopwords-iso/stopwords-bn.txthttps://github.com/stopwords-iso/stopwords-bn.txt

References

  1. Timothy B, Ian HW, John GC (1989) Modeling for text compression. ACM Computing Surveys (CSUR), 21(4):557–591

    Article  Google Scholar 

  2. Christopher F 1989 A stop list for general text. In ACM SIGIR Forum, vol. 24, pp. 19–21. ACM

  3. Stephen PH 1986 Online information retrieval: Concepts, principles, and techniques. Academic Press Professional, Inc.

  4. Ljiljana Dolamic and Jacques Savoy. When stopword lists make the difference. Journal of the American Society for Information Science and Technology, 61(1):200–203, 2010.

    Article  Google Scholar 

  5. Ljiljana Dolamic and Jacques Savoy. Comparative study of indexing and search strategies for the hindi, marathi, and bengali languages. ACM Transactions on Asian Language Information Processing (TALIP), 9(3):11, 2010.

    Article  Google Scholar 

  6. Rachel T-WL, Ben H, and Iadh O 2005 Automatically building a stopword list for an information retrieval system. In Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), vol. 5, pp. 17–24

  7. Christopher D 1999 Manning and Hinrich Schütze. Foundations of statistical natural language processing. MIT press

  8. Manu K 2006 Text Mining Application Programming. Charles River Media, Inc., USA, 1st edition

    Google Scholar 

  9. Hakan A, and Sirma Y 2011 An automated domain specific stop word generation method for natural language text classification. In 2011 International Symposium on Innovations in Intelligent Systems and Applications, pp. 500–503. IEEE

  10. Khalifa C, and Rayner A 2016 An automatic construction of malay stop words based on aggregation method. In International Conference on Soft Computing in Data Science, pp. 180–189. Springer

  11. Sileshi  Girmaw Miretie and Khedkar. Automatic generation of stopwords in the amharic text. International Journal of Computer Applications, 180(10):19–22, 2018.

    Article  Google Scholar 

  12. Murphy C 2012 Effective listings of function stop words for twitter. arXiv preprint arXiv:1205.6396.

  13. Toluwase VA 2013 Entropy-based generic stopwords list for yoruba texts. International Journal of Computer and Information Technology, vol. 2, no. 5

  14. Jasleen K, and Jatinder kumar RS 2016 Punjabi stop words: A gurmukhi, shahmukhi and roman scripted chronicle. In Proceedings of the ACM Symposium on Women in Research 2016, pp. 32–37

  15. Jaideepsinh KR and Jatinder kumar RS (2016) Stop-word removal algorithm and its implementation for Sanskrit language. International Journal of Computer Applications, 150(2): 15–17

    Article  Google Scholar 

  16. Mark PS and David C 2003 Evolving better stoplists for document clustering and web intelligence. In HIS, pp 1015–1023

  17. Fotis Lazarinis. Engineering and utilizing a stopword list in greek web retrieval. Journal of the American Society for Information Science and Technology, 58(11):1645–1652, 2007.

    Article  Google Scholar 

  18. Jacques Savoy. A stemming procedure and stopword list for general french corpora. Journal of the American Society for Information Science, 50(10):944–952, 1999.

    Article  Google Scholar 

  19. Kripabandhu G and Arnab B 2017 Stopword removal: Why bother? a case study on verbose queries. In Proceedings of the 10th Annual ACM India Compute Conference, pp. 99–102

  20. Feng Z, Fu LW, Xiaotie D, and Song H 2006 Evaluation of stop word lists in Chinese language. In LREC, pp. 2497–2500

  21. Mohammad RD, Sanji M, and Aramideh M (2009) Farsi lexical analysis and stop word list. Library Hi Tech, 27(3):435–449, .

    Article  Google Scholar 

  22. Mohammad-Ali Y-Z-F, Behrouz M-B, Saeed R, and Saeed S. Pswg: An automatic stop-word list generator for persian information retrieval systems based on similarity function & pos information. In 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI), pp. 111–117. IEEE

  23. Ibrahim AE-K 2017 Effects of stop words elimination for arabic information retrieval: a comparative study. arXiv preprint arXiv:1702.01925

  24. Bassam A-S, Fekry O, and Waseem ALR (2011) An experimental study for the effect of stop words elimination for arabic text classification algorithms. International Journal of Information Technology and Web Engineering (IJITWE) 6(2):68–75

    Article  Google Scholar 

  25. R Jayashree, K Srikanta Murthy, and BS Anami 2014 Effect of stop word removal on the performance of naïve bayesian methods for text classification in the kannada language. International Journal of Artificial Intelligence and Soft Computing, 4(2-3):264–282, .

    Article  Google Scholar 

  26. Harnani MZ, Norwati M, Masrah A AM, and Nurfadhlina MS 2017 The effects of pre-processing strategies in sentiment analysis of online movie reviews. In AIP conference proceedings, vol. 1891, p. 020089. AIP Publishing LLC

  27. Hassan S, Miriam F, Yulan H, and Harith A 2014 On stopwords, filtering and data sparsity for sentiment analysis of twitter. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 810–817

  28. Hassan S, Miriam F, and Harith A 2014 Automatic stopword generation using contextual semantics for sentiment analysis of twitter. In CEUR Workshop Proceedings, vol. 1272

  29. Walaa Medhat, Ahmed Yousef, and Hoda Korashy. Egyptian dialect stopword list generation from social network data. The Egyptian Journal of Language Engineering, 2(1):43–55, 2015.

    Article  Google Scholar 

  30. Catarina S and Bernardete R 2003 The importance of stop word removal on recall values in text categorization. In Proceedings of the International Joint Conference on Neural Networks, 2003., vol. 3, pp. 1661–1666. IEEE

  31. Feng X, Tian J, and Liu Z 2009 A text categorization method based on local document frequency. In 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 7, pp. 468–471. IEEE

  32. Aqil A and Suha A-T 2009 Ikhtasir—a user selected compression ratio arabic text summarization system. In 2009 International Conference on Natural Language Processing and Knowledge Engineering, pp. 1–7. IEEE

  33. Alexandra S, Måns M, and David M 2017 Pulling out the stops: Rethinking stopword removal for topic models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Vol. 2, Short Papers, pp. 432–436

  34. Yaoyong Li and John Shawe-Taylor. Using kcca for japanese–english cross-language information retrieval and document classification. Journal of intelligent information systems, 27(2):117–133, 2006.

    Article  Google Scholar 

  35. Debasis M, Mayank G, Sandipan D, Pratyush B, and Sudeshna S 2007 Bengali and hindi to english clir evaluation. In Workshop of the Cross-Language Evaluation Forum for European Languages, pp. 95–102. Springer

  36. Erbug C, Baturman S, and Burak G 2009 Turkish—english cross language information retrieval using lsi. In 2009 24th International Symposium on Computer and Information Sciences, pp. 634–638. IEEE

  37. Chong TY, Rafael EB, and Chng ES 2012 An empirical evaluation of stop word removal in statistical machine translation. In Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), pp. 30–37. Association for Computational Linguistics

  38. Amit S, Chris B, and Mandar M 1996 Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’96, pp. 21–29, New York, NY, USA. Association for Computing Machinery

  39. Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972.

    Article  Google Scholar 

  40. Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS), 20(4):357–389, 2002.

    Article  Google Scholar 

  41. Stephen ER, van Rijsbergen CJ, and Porter MF 1980 Probabilistic models of indexing and searching. In SIGIR, vol. 80, pp 35–56

  42. Djoerd H 2001 Using language models for information retrieval. Univ. Twente

  43. Chris B and Ellen MV 2017 Evaluating evaluation measure stability. In ACM SIGIR Forum, vol. 51, pp 235–242. ACM

Download references

Acknowledgements

This research work was supported by IIT (BHU), Varanasi, India.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Siba Sankar Sahu or Sukomal Pal.

Appendix

Appendix

Table 15 Example of few non corpus-based stopword list used in experimentation

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sahu, S.S., Pal, S. Effect of stopwords in Indian language IR. Sādhanā 47, 17 (2022). https://doi.org/10.1007/s12046-021-01731-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12046-021-01731-z

Keywords

Navigation