Abstract
With the ease in accessibility of Internet, data available online has become one of the main source of information. Large amount of data gets updated daily online. Although this data may be useful for research purposes, however, it cannot be used in its raw form. In general, unstructured data contains a lot of common irrelevant words which do not add to the semantic meaning of the document. These words are known as stop words, and removing them is an important requirement for efficient text processing as done in information retrieval systems and other natural language processing applications. A significant amount of research has been done for removing stop words in languages such as English, Chinese, Urdu, Arabic, Hindi. However, not enough work is done regarding removal of stop words in Punjabi language. Most of the available works utilize corpus-based methods for removing stop words, which tend to be time-consuming. Present paper proposes a method for removing stop words for Punjabi language using finite automata. The performance of the proposed method is compared with the classical method of stop words removal. The implementation results show that the proposed algorithm gives better results in terms of execution time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
R. Feldman, J. Sanger, Categorization, in The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data (Cambridge University Press, New York, 2016), p. 68
R.B. Myerson, Fundamentals of social choice theory. QJPS. 8(3), 305–337 (2013). https://doi.org/10.1561/100.00013006
S. Behera, Implementation of a finite state automaton to recognize and remove stop words in English text on its retrieval, in 2018 2nd ICOEI (IEEE, 2018). https://doi.org/10.1109/icoei.2018.8553828
J. Martin, Finite automata and the languages they accept, in Introduction to Languages and the Theory of Computation (McGraw-Hill, New York, 2011), p. 45
C. Fox, A stop list for general text. SIGIR Forum. 24(1–2), 19–21 (1989). https://doi.org/10.1145/378881.378888
J. Savoy, A stemming procedure and stop word list for general French corpora. J. Am. Soc. Inf. Sci. 50(10), 944–952 (1999). https://doi.org/10.1002/(sici)1097-4571(1999)50:10%3c944::aid-asi9%3e3.0.co;2-q
M.P. Sinka, D.W. Corne, Towards modernised and Web-specific stoplists for web document analysis, in Proceedings IEEE/WIC International Conference on Web Intelligence (2003). https://doi.org/10.1109/wi.2003.1241221
R. Al-Shalabi et al., Stop-word removal algorithm for Arabic language, in Proceedings 2004 ICICT: From Theory to Applications (IEEE, 2004). https://doi.org/10.1109/ictta.2004.1307875
B. Alhadidi, M. Alwedyan, Hybrid stop-word removal technique for Arabic language. Egypt. Comput. Sci. J. 30, 35–38 (2008)
I.A. El-Khair, Effects of stop words elimination for Arabic information retrieval: a comparative study. IJCIS. 4, 119–133 (2006)
A. Alajmi, E.M. Saad, R.R. Darwish, Article: toward an ARABIC stop-words list generation. Int. J. Comput. Appl. 46(8), 8–13 (2012)
K.S. Dar et al., An efficient stop word elimination algorithm for Urdu language, in 2017 14th ECTI-CON (IEEE, 2017). https://doi.org/10.1109/ecticon.2017.8096386
S. Kamran et al., Stop words elimination in Urdu language using finite state automaton. Int. J. Asian Lang. Process. 27, 21–32 (2017)
F. Zou et al., Automatic construction of Chinese stop word list, in Proceedings of the 5th WSEAS ICACS, Hangzhou, China (2006), pp. 1010–1015
L. Hao, L. Hao, Automatic identification of stop words in Chinese text classification, in 2008 ICCSSE (IEEE, 2008). https://doi.org/10.1109/csse.2008.829
M. Choy, Effective listings of function stop words for Twitter. IJACSA. 3, 6 (2012). https://doi.org/10.14569/ijacsa.2012.030602
Z. Yao, C. Ze-wen, Research on the construction and filter method of stop-word list in text preprocessing, in 2011 4th ICICTA (IEEE, 2011). https://doi.org/10.1109/icicta.2011.64
G. Zheng, G. Gaowa, The selection of Mongolian stop words, in 2010 IEEE ICICIS (IEEE, 2010). https://doi.org/10.1109/icicisys.2010.5658841
R. Puri, R.P.S. Bedi, V. Goyal, Automated stopwords identification in Punjabi documents. IJES. 8(June) (2013)
J. Kaur, J.R. Saini, Punjabi stop words, in Proceedings of the ACM Symposium on Women in Research 2016—WIR’16 (ACM Press, 2016). https://doi.org/10.1145/2909067.2909073
J. Kaur, Stopwords removal and its algorithms based on different methods. IJARCS. 9(5), 81–88 (2018). https://doi.org/10.26483/ijarcs.v9i5.6301
V. Jha, et al., HSRA: Hindi stopword removal algorithm, in 2016 International Conference on Microelectronics, Computing and Communications (MicroCom) (IEEE, 2016). https://doi.org/10.1109/microcom.2016.7522593
S. Siddiqi, A. Sharan, Construction of a generic stopwords list for Hindi language without corpus statistics. IJACR. 8(34), 35–40 (2018). https://doi.org/10.19101/ijacr.2017.733030
J.K. Raulji, J.R. Saini, Stop-word removal algorithm and its implementation for Sanskrit language. Int. J. Comput. Appl. 150(2), 15–17 (2016). https://doi.org/10.5120/ijca2016911462
A. Pimpalshende, A.R. Mahajan, Test model for stop word removal of Devnagari text documents based on finite automata, in 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI) (IEEE, 2017). https://doi.org/10.1109/icpcsi.2017.8391797
B. Arora, S. Gandotra, Automated stop-word list generation for Dogri corpus, in IJAST, vol. 28 (2019), pp. 884–889
R.U. Haque et al., Bengali stop word and phrase detection mechanism. Arab. J. Sci. Eng. 45(4), 3355–3368 (2020). https://doi.org/10.1007/s13369-020-04388-8
N. Rajkumar, et al., Tamil stop word removal based on term frequency, in Advances in Intelligent Systems and Computing (Springer Singapore, 2020), pp. 21–30. https://doi.org/10.1007/978-981-15-1097-7_3
K.P. Johnson, (2020). https://github.com/cltk/cltk/blob/master/src/cltk/stops/pan.py. Accessed 5 May 2021
BBC.com (2021). https://www.bbc.com/punjabi. Accessed 21 May 2021
PunjabiLibrary.com (2021). https://punjabilibrary.com/news/. Accessed 21 May 2021
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kochhar, T.S., Goyal, G. (2022). Design and Implementation of Stop Words Removal Method for Punjabi Language Using Finite Automata. In: Verma, P., Charan, C., Fernando, X., Ganesan, S. (eds) Advances in Data Computing, Communication and Security. Lecture Notes on Data Engineering and Communications Technologies, vol 106. Springer, Singapore. https://doi.org/10.1007/978-981-16-8403-6_8
Download citation
DOI: https://doi.org/10.1007/978-981-16-8403-6_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-8402-9
Online ISBN: 978-981-16-8403-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)