Effect of stopwords in Indian language IR

Sahu, Siba Sankar; Pal, Sukomal

doi:10.1007/s12046-021-01731-z

Effect of stopwords in Indian language IR

Published: 10 January 2022

Volume 47, article number 17, (2022)
Cite this article

Sādhanā Aims and scope Submit manuscript

404 Accesses
3 Citations
Explore all metrics

Abstract

We explore and evaluate the effect of stopwords in retrieval performance of different Indian languages such as Marathi, Bengali, Gujarati and Sanskrit. The issue was investigated from three viewpoints. Is there any impact of non-corpus-based stopword removal on chosen Indian languages (if yes, to what extent)? Can we recommend, based on experiment, a number of stopwords for chosen Indian languages that are good enough from retrieval point of view? Is there any relationship of stopwords with average document length from retrieval perspective? It is observed that the stopword removal generally improves mean average precision (MAP) significantly compared with the case when it is not done. For each language, different lengths of the stopword list are explored and evaluated that lead to suggesting its optimal length. We also study the effect of stopwords on retrieval performance over document length. The effect of stopwords is generally found to be quite low in short documents compared with their long counterparts across the four Indian languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Timothy B, Ian HW, John GC (1989) Modeling for text compression. ACM Computing Surveys (CSUR), 21(4):557–591
Article Google Scholar
Christopher F 1989 A stop list for general text. In ACM SIGIR Forum, vol. 24, pp. 19–21. ACM
Stephen PH 1986 Online information retrieval: Concepts, principles, and techniques. Academic Press Professional, Inc.
Ljiljana Dolamic and Jacques Savoy. When stopword lists make the difference. Journal of the American Society for Information Science and Technology, 61(1):200–203, 2010.
Article Google Scholar
Ljiljana Dolamic and Jacques Savoy. Comparative study of indexing and search strategies for the hindi, marathi, and bengali languages. ACM Transactions on Asian Language Information Processing (TALIP), 9(3):11, 2010.
Article Google Scholar
Rachel T-WL, Ben H, and Iadh O 2005 Automatically building a stopword list for an information retrieval system. In Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), vol. 5, pp. 17–24
Christopher D 1999 Manning and Hinrich Schütze. Foundations of statistical natural language processing. MIT press
Manu K 2006 Text Mining Application Programming. Charles River Media, Inc., USA, 1st edition
Google Scholar
Hakan A, and Sirma Y 2011 An automated domain specific stop word generation method for natural language text classification. In 2011 International Symposium on Innovations in Intelligent Systems and Applications, pp. 500–503. IEEE
Khalifa C, and Rayner A 2016 An automatic construction of malay stop words based on aggregation method. In International Conference on Soft Computing in Data Science, pp. 180–189. Springer
Sileshi Girmaw Miretie and Khedkar. Automatic generation of stopwords in the amharic text. International Journal of Computer Applications, 180(10):19–22, 2018.
Article Google Scholar
Murphy C 2012 Effective listings of function stop words for twitter. arXiv preprint arXiv:1205.6396.
Toluwase VA 2013 Entropy-based generic stopwords list for yoruba texts. International Journal of Computer and Information Technology, vol. 2, no. 5
Jasleen K, and Jatinder kumar RS 2016 Punjabi stop words: A gurmukhi, shahmukhi and roman scripted chronicle. In Proceedings of the ACM Symposium on Women in Research 2016, pp. 32–37
Jaideepsinh KR and Jatinder kumar RS (2016) Stop-word removal algorithm and its implementation for Sanskrit language. International Journal of Computer Applications, 150(2): 15–17
Article Google Scholar
Mark PS and David C 2003 Evolving better stoplists for document clustering and web intelligence. In HIS, pp 1015–1023
Fotis Lazarinis. Engineering and utilizing a stopword list in greek web retrieval. Journal of the American Society for Information Science and Technology, 58(11):1645–1652, 2007.
Article Google Scholar
Jacques Savoy. A stemming procedure and stopword list for general french corpora. Journal of the American Society for Information Science, 50(10):944–952, 1999.
Article Google Scholar
Kripabandhu G and Arnab B 2017 Stopword removal: Why bother? a case study on verbose queries. In Proceedings of the 10th Annual ACM India Compute Conference, pp. 99–102
Feng Z, Fu LW, Xiaotie D, and Song H 2006 Evaluation of stop word lists in Chinese language. In LREC, pp. 2497–2500
Mohammad RD, Sanji M, and Aramideh M (2009) Farsi lexical analysis and stop word list. Library Hi Tech, 27(3):435–449, .
Article Google Scholar
Mohammad-Ali Y-Z-F, Behrouz M-B, Saeed R, and Saeed S. Pswg: An automatic stop-word list generator for persian information retrieval systems based on similarity function & pos information. In 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI), pp. 111–117. IEEE
Ibrahim AE-K 2017 Effects of stop words elimination for arabic information retrieval: a comparative study. arXiv preprint arXiv:1702.01925
Bassam A-S, Fekry O, and Waseem ALR (2011) An experimental study for the effect of stop words elimination for arabic text classification algorithms. International Journal of Information Technology and Web Engineering (IJITWE) 6(2):68–75
Article Google Scholar
R Jayashree, K Srikanta Murthy, and BS Anami 2014 Effect of stop word removal on the performance of naïve bayesian methods for text classification in the kannada language. International Journal of Artificial Intelligence and Soft Computing, 4(2-3):264–282, .
Article Google Scholar
Harnani MZ, Norwati M, Masrah A AM, and Nurfadhlina MS 2017 The effects of pre-processing strategies in sentiment analysis of online movie reviews. In AIP conference proceedings, vol. 1891, p. 020089. AIP Publishing LLC
Hassan S, Miriam F, Yulan H, and Harith A 2014 On stopwords, filtering and data sparsity for sentiment analysis of twitter. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 810–817
Hassan S, Miriam F, and Harith A 2014 Automatic stopword generation using contextual semantics for sentiment analysis of twitter. In CEUR Workshop Proceedings, vol. 1272
Walaa Medhat, Ahmed Yousef, and Hoda Korashy. Egyptian dialect stopword list generation from social network data. The Egyptian Journal of Language Engineering, 2(1):43–55, 2015.
Article Google Scholar
Catarina S and Bernardete R 2003 The importance of stop word removal on recall values in text categorization. In Proceedings of the International Joint Conference on Neural Networks, 2003., vol. 3, pp. 1661–1666. IEEE
Feng X, Tian J, and Liu Z 2009 A text categorization method based on local document frequency. In 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 7, pp. 468–471. IEEE
Aqil A and Suha A-T 2009 Ikhtasir—a user selected compression ratio arabic text summarization system. In 2009 International Conference on Natural Language Processing and Knowledge Engineering, pp. 1–7. IEEE
Alexandra S, Måns M, and David M 2017 Pulling out the stops: Rethinking stopword removal for topic models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Vol. 2, Short Papers, pp. 432–436
Yaoyong Li and John Shawe-Taylor. Using kcca for japanese–english cross-language information retrieval and document classification. Journal of intelligent information systems, 27(2):117–133, 2006.
Article Google Scholar
Debasis M, Mayank G, Sandipan D, Pratyush B, and Sudeshna S 2007 Bengali and hindi to english clir evaluation. In Workshop of the Cross-Language Evaluation Forum for European Languages, pp. 95–102. Springer
Erbug C, Baturman S, and Burak G 2009 Turkish—english cross language information retrieval using lsi. In 2009 24th International Symposium on Computer and Information Sciences, pp. 634–638. IEEE
Chong TY, Rafael EB, and Chng ES 2012 An empirical evaluation of stop word removal in statistical machine translation. In Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), pp. 30–37. Association for Computational Linguistics
Amit S, Chris B, and Mandar M 1996 Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’96, pp. 21–29, New York, NY, USA. Association for Computing Machinery
Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972.
Article Google Scholar
Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS), 20(4):357–389, 2002.
Article Google Scholar
Stephen ER, van Rijsbergen CJ, and Porter MF 1980 Probabilistic models of indexing and searching. In SIGIR, vol. 80, pp 35–56
Djoerd H 2001 Using language models for information retrieval. Univ. Twente
Chris B and Ellen MV 2017 Evaluating evaluation measure stability. In ACM SIGIR Forum, vol. 51, pp 235–242. ACM

Download references

Acknowledgements

This research work was supported by IIT (BHU), Varanasi, India.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, India
Siba Sankar Sahu & Sukomal Pal

Authors

Siba Sankar Sahu
View author publications
You can also search for this author in PubMed Google Scholar
Sukomal Pal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Siba Sankar Sahu or Sukomal Pal.

Appendix

Table 15 Example of few non corpus-based stopword list used in experimentation

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sahu, S.S., Pal, S. Effect of stopwords in Indian language IR. Sādhanā 47, 17 (2022). https://doi.org/10.1007/s12046-021-01731-z

Download citation

Received: 01 November 2020
Revised: 06 July 2021
Accepted: 25 August 2021
Published: 10 January 2022
DOI: https://doi.org/10.1007/s12046-021-01731-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effect of stopwords in Indian language IR

Abstract

Access this article

Similar content being viewed by others

Effect of Stopwords and Stemming Techniques in Urdu IR

An Automatic Construction of Malay Stop Words Based on Aggregation Method

Automatic Stopwords Identification from Very Small Corpora

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Effect of stopwords in Indian language IR

Abstract

Access this article

Similar content being viewed by others

Effect of Stopwords and Stemming Techniques in Urdu IR

An Automatic Construction of Malay Stop Words Based on Aggregation Method

Automatic Stopwords Identification from Very Small Corpora

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation