Social Choice Theory Based Domain Specific Hindi Stop Words List Construction and Its Application in Text Mining

Rani, Ruby; Lobiyal, D. K.

doi:10.1007/978-3-030-04021-5_12

Ruby Rani¹³ &
D. K. Lobiyal¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11278))

Included in the following conference series:

International Conference on Intelligent Human Computer Interaction

893 Accesses
5 Citations

Abstract

In this paper, we have given an attempt to create domain specific Hindi stop words list using statistical and knowledge based techniques from prepared textual corpora of different domains. In order to remove the biased raking nature of each technique, Borda’s rule of vote ranking method has been employed for unbiased stop words list construction. We also propose a novel approach called netting ranked performance evaluation (NRPE) to evaluate prepared stop words lists, in which stop words removal is done in leading and trailing fashion based on ascending and descending order of terms. Further, using combined band net (CBN) performance, we demonstrate the ability of each technique in identifying of candidate stop words followed by selection of features for text mining models. The experimental results show that a technique selects good features for classification/clustering needs not necessarily finds the good stop words. Results also show that the final Borda’s lists gives normalized performance over individual technique. This approach guarantees candidate stop word removal, least information dissipation and text mining model performance enhancement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ricardo, B.-Y.: Modern Information Retrieval. Pearson Education, India (1999)
Google Scholar
Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 256–263 (1995)
Google Scholar
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)
Article Google Scholar
Sinka, M.P., Corne, D.: Evolving better stoplists for document clustering and web intelligence. In: HIS, pp. 1015–1023 (2003)
Google Scholar
Petras, V., Perelman, N., Gey, F.: UC Berkeley at CLEF-2003 – Russian language experiments and domain-specific retrieval. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 401–411. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30222-3_39
Chapter Google Scholar
White, B.J., Fortier, J., Clapper, D., Grabolosa, P.: The impact of domain-specific stop-word lists on ecommerce website search performance. J. Strateg. E-Commerce 5(1/2), 83 (2007)
Google Scholar
Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic construction of chinese stop word list. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp. 1010–1015 (2006)
Google Scholar
Yao, Z., Ze-wen, C.: Research on the construction and filter method of stop-word list in text preprocessing. In: International Conference on Intelligent Computation Technology and Automation (ICICTA), 2011, vol. 1, pp. 217–221 (2011)
Google Scholar
Hao, L., Hao, L.: Automatic identification of stop words in chinese text classification. In: International Conference on Computer Science and Software Engineering, 2008, vol. 1, pp. 718–722 (2008)
Google Scholar
Alhadidi, B., Alwedyan, M.: Hybrid stop-word removal technique for Arabic language. Egypt. Comput. Sci. J. 30(1), 35–38 (2008)
Google Scholar
Alajmi, A., Saad, E.M., Darwish, R.R.: Toward an ARABIC stop-words list generation. Int. J. Comput. Appl. 46(8), 8–13 (2012)
Google Scholar
Jha, V., Manjunath, N., Shenoy, P.D., Venugopal, K.R.: HSRA: Hindi stopword removal algorithm. In: International Conference on Microelectronics, Computing and Communications (MicroCom), 2016, pp. 1–5 (2016)
Google Scholar
Choudhary, N., Jha, G.N.: Creating multilingual parallel corpora in Indian languages. In: Vetulani, Z., Mariani, J. (eds.) LTC 2011. LNCS (LNAI), vol. 8387, pp. 527–537. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08958-4_43
Chapter Google Scholar
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1(4), 309–317 (1957)
Article MathSciNet Google Scholar
Shenoy, P.D., Srinivasa, K.G., Venugopal, K.R., Patnaik, L.M.: Dynamic association rule mining using genetic algorithms. Intell. Data Anal. 9(5), 439–453 (2005)
Article Google Scholar
Pandey, A.K., Siddiqui, T.J.: Evaluating effect of stemming and stop-word removal on Hindi text retrieval. In: Tiwary, U.S., Siddiqui, T.J., Radhakrishna, M., Tiwari, M.D. (eds.) Proceedings of the First International Conference on Intelligent Human Computer Interaction, pp. 316–326. Springer, New Delhi (2009). https://doi.org/10.1007/978-81-8489-203-1_31
Chapter Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Article MathSciNet Google Scholar
Kucera, H., Francis, W.N.: Frequency analysis of English usage: Lexicon and grammar. Houghton Mifflin, Boston (1982)
Google Scholar
Van Rijsbergen, C.J.: A non-classical logic for information retrieval. Comput. J. 29(6), 481–485 (1986)
Article Google Scholar
Lo, R.T.-W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. J. Digit. Inf. Manage 5, 17–24 (2005). Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR)
Google Scholar
Makrehchi, M., Kamel, M.S.: Extracting domain-specific stopwords for text classifiers. Intell. Data Anal. 21(1), 39–62 (2017)
Article Google Scholar
Makrehchi, M., Kamel, M.S.: Automatic extraction of domain-specific stopwords from labeled documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, Ryen W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 222–233. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_22
Chapter Google Scholar
Singh, S., Siddiqui, T.J.: Evaluating effect of context window size, stemming and stop word removal on Hindi word sense disambiguation. In: International Conference on Information Retrieval & Knowledge Management (CAMP), 2012, pp. 1–5 (2012)
Google Scholar
Rani, R., Lobiyal, D.K.: Automatic construction of generic stop words list for Hindi text. Procedia Comput. Sci. Elsevier J. 132, 1–7 (2018)
Article Google Scholar
Ranks, “Hindi stopwords”. Accessed 17 Dec 2017
Google Scholar
Taranjeet, “Hindi stopwords”, 17 April 2017
Google Scholar
GitHub, “Hindi stopword list”, 29 December 2011
Google Scholar
Kantor, P.B., Lee, J.J.: The maximum entropy principle in information retrieval. In: Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 269–274 (1986)
Google Scholar
Myerson, R.B.: Fundamentals of social choice theory. Quart. J. Polit. Sci. 8(3), 305–337 (2013)
Article Google Scholar

Download references

Acknowledgements

This work has been partially supported by the UPE-II grant received from JNU. Authors would like to thank anonymous reviewers for their kind comments.

Author information

Authors and Affiliations

Jawaharlal Nehru University, New Delhi, India
Ruby Rani & D. K. Lobiyal

Authors

Ruby Rani
View author publications
You can also search for this author in PubMed Google Scholar
D. K. Lobiyal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruby Rani .

Editor information

Editors and Affiliations

Indian Institute of Information Technology, Allahabad, India
Uma Shanker Tiwary

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rani, R., Lobiyal, D.K. (2018). Social Choice Theory Based Domain Specific Hindi Stop Words List Construction and Its Application in Text Mining. In: Tiwary, U. (eds) Intelligent Human Computer Interaction. IHCI 2018. Lecture Notes in Computer Science(), vol 11278. Springer, Cham. https://doi.org/10.1007/978-3-030-04021-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-04021-5_12
Published: 10 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04020-8
Online ISBN: 978-3-030-04021-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics