Abstract
We explore and evaluate the effect of stopwords in retrieval performance of different Indian languages such as Marathi, Bengali, Gujarati and Sanskrit. The issue was investigated from three viewpoints. Is there any impact of non-corpus-based stopword removal on chosen Indian languages (if yes, to what extent)? Can we recommend, based on experiment, a number of stopwords for chosen Indian languages that are good enough from retrieval point of view? Is there any relationship of stopwords with average document length from retrieval perspective? It is observed that the stopword removal generally improves mean average precision (MAP) significantly compared with the case when it is not done. For each language, different lengths of the stopword list are explored and evaluated that lead to suggesting its optimal length. We also study the effect of stopwords on retrieval performance over document length. The effect of stopwords is generally found to be quite low in short documents compared with their long counterparts across the four Indian languages.
Similar content being viewed by others
Notes
References
Timothy B, Ian HW, John GC (1989) Modeling for text compression. ACM Computing Surveys (CSUR), 21(4):557–591
Christopher F 1989 A stop list for general text. In ACM SIGIR Forum, vol. 24, pp. 19–21. ACM
Stephen PH 1986 Online information retrieval: Concepts, principles, and techniques. Academic Press Professional, Inc.
Ljiljana Dolamic and Jacques Savoy. When stopword lists make the difference. Journal of the American Society for Information Science and Technology, 61(1):200–203, 2010.
Ljiljana Dolamic and Jacques Savoy. Comparative study of indexing and search strategies for the hindi, marathi, and bengali languages. ACM Transactions on Asian Language Information Processing (TALIP), 9(3):11, 2010.
Rachel T-WL, Ben H, and Iadh O 2005 Automatically building a stopword list for an information retrieval system. In Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), vol. 5, pp. 17–24
Christopher D 1999 Manning and Hinrich Schütze. Foundations of statistical natural language processing. MIT press
Manu K 2006 Text Mining Application Programming. Charles River Media, Inc., USA, 1st edition
Hakan A, and Sirma Y 2011 An automated domain specific stop word generation method for natural language text classification. In 2011 International Symposium on Innovations in Intelligent Systems and Applications, pp. 500–503. IEEE
Khalifa C, and Rayner A 2016 An automatic construction of malay stop words based on aggregation method. In International Conference on Soft Computing in Data Science, pp. 180–189. Springer
Sileshi Girmaw Miretie and Khedkar. Automatic generation of stopwords in the amharic text. International Journal of Computer Applications, 180(10):19–22, 2018.
Murphy C 2012 Effective listings of function stop words for twitter. arXiv preprint arXiv:1205.6396.
Toluwase VA 2013 Entropy-based generic stopwords list for yoruba texts. International Journal of Computer and Information Technology, vol. 2, no. 5
Jasleen K, and Jatinder kumar RS 2016 Punjabi stop words: A gurmukhi, shahmukhi and roman scripted chronicle. In Proceedings of the ACM Symposium on Women in Research 2016, pp. 32–37
Jaideepsinh KR and Jatinder kumar RS (2016) Stop-word removal algorithm and its implementation for Sanskrit language. International Journal of Computer Applications, 150(2): 15–17
Mark PS and David C 2003 Evolving better stoplists for document clustering and web intelligence. In HIS, pp 1015–1023
Fotis Lazarinis. Engineering and utilizing a stopword list in greek web retrieval. Journal of the American Society for Information Science and Technology, 58(11):1645–1652, 2007.
Jacques Savoy. A stemming procedure and stopword list for general french corpora. Journal of the American Society for Information Science, 50(10):944–952, 1999.
Kripabandhu G and Arnab B 2017 Stopword removal: Why bother? a case study on verbose queries. In Proceedings of the 10th Annual ACM India Compute Conference, pp. 99–102
Feng Z, Fu LW, Xiaotie D, and Song H 2006 Evaluation of stop word lists in Chinese language. In LREC, pp. 2497–2500
Mohammad RD, Sanji M, and Aramideh M (2009) Farsi lexical analysis and stop word list. Library Hi Tech, 27(3):435–449, .
Mohammad-Ali Y-Z-F, Behrouz M-B, Saeed R, and Saeed S. Pswg: An automatic stop-word list generator for persian information retrieval systems based on similarity function & pos information. In 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI), pp. 111–117. IEEE
Ibrahim AE-K 2017 Effects of stop words elimination for arabic information retrieval: a comparative study. arXiv preprint arXiv:1702.01925
Bassam A-S, Fekry O, and Waseem ALR (2011) An experimental study for the effect of stop words elimination for arabic text classification algorithms. International Journal of Information Technology and Web Engineering (IJITWE) 6(2):68–75
R Jayashree, K Srikanta Murthy, and BS Anami 2014 Effect of stop word removal on the performance of naïve bayesian methods for text classification in the kannada language. International Journal of Artificial Intelligence and Soft Computing, 4(2-3):264–282, .
Harnani MZ, Norwati M, Masrah A AM, and Nurfadhlina MS 2017 The effects of pre-processing strategies in sentiment analysis of online movie reviews. In AIP conference proceedings, vol. 1891, p. 020089. AIP Publishing LLC
Hassan S, Miriam F, Yulan H, and Harith A 2014 On stopwords, filtering and data sparsity for sentiment analysis of twitter. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 810–817
Hassan S, Miriam F, and Harith A 2014 Automatic stopword generation using contextual semantics for sentiment analysis of twitter. In CEUR Workshop Proceedings, vol. 1272
Walaa Medhat, Ahmed Yousef, and Hoda Korashy. Egyptian dialect stopword list generation from social network data. The Egyptian Journal of Language Engineering, 2(1):43–55, 2015.
Catarina S and Bernardete R 2003 The importance of stop word removal on recall values in text categorization. In Proceedings of the International Joint Conference on Neural Networks, 2003., vol. 3, pp. 1661–1666. IEEE
Feng X, Tian J, and Liu Z 2009 A text categorization method based on local document frequency. In 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 7, pp. 468–471. IEEE
Aqil A and Suha A-T 2009 Ikhtasir—a user selected compression ratio arabic text summarization system. In 2009 International Conference on Natural Language Processing and Knowledge Engineering, pp. 1–7. IEEE
Alexandra S, Måns M, and David M 2017 Pulling out the stops: Rethinking stopword removal for topic models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Vol. 2, Short Papers, pp. 432–436
Yaoyong Li and John Shawe-Taylor. Using kcca for japanese–english cross-language information retrieval and document classification. Journal of intelligent information systems, 27(2):117–133, 2006.
Debasis M, Mayank G, Sandipan D, Pratyush B, and Sudeshna S 2007 Bengali and hindi to english clir evaluation. In Workshop of the Cross-Language Evaluation Forum for European Languages, pp. 95–102. Springer
Erbug C, Baturman S, and Burak G 2009 Turkish—english cross language information retrieval using lsi. In 2009 24th International Symposium on Computer and Information Sciences, pp. 634–638. IEEE
Chong TY, Rafael EB, and Chng ES 2012 An empirical evaluation of stop word removal in statistical machine translation. In Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), pp. 30–37. Association for Computational Linguistics
Amit S, Chris B, and Mandar M 1996 Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’96, pp. 21–29, New York, NY, USA. Association for Computing Machinery
Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972.
Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS), 20(4):357–389, 2002.
Stephen ER, van Rijsbergen CJ, and Porter MF 1980 Probabilistic models of indexing and searching. In SIGIR, vol. 80, pp 35–56
Djoerd H 2001 Using language models for information retrieval. Univ. Twente
Chris B and Ellen MV 2017 Evaluating evaluation measure stability. In ACM SIGIR Forum, vol. 51, pp 235–242. ACM
Acknowledgements
This research work was supported by IIT (BHU), Varanasi, India.