Understanding of Data Preprocessing for Dimensionality Reduction Using Feature Selection Techniques in Text Classification

Dogra, Varun; Singh, Aman; Verma, Sahil; Kavita; Jhanjhi, N. Z.; Talib, M. N.

doi:10.1007/978-981-16-3153-5_48

Varun Dogra¹³,
Aman Singh¹³,
Sahil Verma¹⁴,
Kavita¹⁴,
N. Z. Jhanjhi¹⁵ &
…
M. N. Talib¹⁶

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 248))

854 Accesses
12 Citations

Abstract

The volume of textual data in digital form is growing with each day. For arranging these textual data, text classification has been used. To achieve efficient text classification, data preprocessing is an important phase. It prepares information for machine learning models. Text classification, however, has the issue of the high dimensionality of space for features. Feature selection is a technique for data preprocessing widely used on high-dimensional data. By feature selection techniques, this high dimensionality of feature space is solved and increases text classification efficiency. Feature selection explores how a list of features used to create text classification models may be chosen. Its goals include reducing dimensionality, deleting uninformative features, reducing the amount of data available to classifiers for learning, and enhancing classifiers’ predictive performance. The different methods of feature selection are presented in this paper. This paper also presents the advantages and limitations of feature selection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Magnini B, Pezzulo G, Gliozzo A (2002) The role of domain information in word sense disambiguation. Natural Language Engineering
Google Scholar
Ramisetty S, Verma S (2019) The amalgamative sharp wireless sensor networks routing and with enhanced machine learning. J Comput Theor Nanosci 16(9):3766–3769
Article Google Scholar
Zhang ML, Zhou ZH (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn 40(7):2038–2048
Article Google Scholar
Batra I, Verma S, Malik A, Ghosh U, Rodrigues JJ, Nguyen GN, Mariappan V (2020) Hybrid logical security framework for privacy preservation in the green internet of things. Sustainability 12(14):5542
Article Google Scholar
Armanfard N, Reilly JP, Komeili M (2015) Local feature selection for data classification. IEEE Trans Pattern Anal Mach Intell 38(6):1217–1227
Article Google Scholar
Pölsterl S, Conjeti S, Navab N, Katouzian A (2016) Survival analysis for high-dimensional, heterogeneous medical data: exploring feature extraction as an alternative to feature selection. Artif Intell Med 72:1–11
Article Google Scholar
Lei S (2012) A feature selection method based on information gain and genetic algorithm. In: 2012 international conference on computer science and electronics engineering, vol 2. IEEE, pp 355–358
Google Scholar
Jin X, Xu A, Bie R, Guo P (2006) Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles. In: International workshop on data mining for biomedical applications. Springer, Berlin, Heidelberg, pp 106–115
Google Scholar
Youn E, Koenig L, Jeong MK, Baek SH (2010) Support vector-based feature selection using Fisher’s linear discriminant and support vector machine. Expert Syst Appl 37(9):6148–6156
Article Google Scholar
Wang Y, Wang XJ (2005) A new approach to feature selection in text classification. In: 2005 international conference on machine learning and cybernetics, vol 6. IEEE, pp 3814–3819
Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Labani M, Moradi P, Ahmadizar F, Jalili M (2018) A novel multivariate filter method for feature selection in text classification problems. Eng Appl Artif Intell 70:25–37
Article Google Scholar
Granitto PM, Furlanello C, Biasioli F, Gasperi F (2006) Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemom Intell Lab Syst 83(2):83–90
Article Google Scholar
Rani P, Verma S, Nguyen GN (2020) Mitigation of black hole and gray hole attack using swarm inspired algorithm with artificial neural network. IEEE Access 8:121755–121764
Article Google Scholar
Chen H, Jiang W, Li C, Li R (2013) A heuristic feature selection approach for text categorization by using chaos optimization and genetic algorithm. Mathematical Problems in Engineering
Google Scholar
Guyon I, Bitter HM, Ahmed Z, Brown M, Heller J (2003) Multivariate non-linear feature selection with kernel multiplicative updates and gram-schmidt relief. In: BISC flint-CIBI 2003 workshop, Berkeley, pp 1–11
Google Scholar
Urbanowicz RJ, Olson RS, Schmitt P, Meeker M, Moore JH (2018) Benchmarking relief-based feature selection methods for bioinformatics data mining. J Biomed Inform 85:168–188
Article Google Scholar
Vickers NJ (2017) Animal communication: when i’m calling you, will you answer too? Curr Biol 27(14):R713–R715
Article Google Scholar
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55–67
Article Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc Ser B (Stat Methodol) 67(2):301–320
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Lovely Professional University, Phagwara, India
Varun Dogra & Aman Singh
Department of Computer Science and Engineering, Chandigarh University, Mohali, India
Sahil Verma & Kavita
School of Computer Science and Engineering, Taylor’s University, Subang Jaya, Malaysia
N. Z. Jhanjhi
Papua New Guinea University of Technology, Lae, Papua New Guinea
M. N. Talib

Authors

Varun Dogra
View author publications
You can also search for this author in PubMed Google Scholar
Aman Singh
View author publications
You can also search for this author in PubMed Google Scholar
Sahil Verma
View author publications
You can also search for this author in PubMed Google Scholar
Kavita
View author publications
You can also search for this author in PubMed Google Scholar
N. Z. Jhanjhi
View author publications
You can also search for this author in PubMed Google Scholar
M. N. Talib
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sahil Verma .

Editor information

Editors and Affiliations

Creative Technologies and Product Design, National Taipei University of Business, Taipei City, Taiwan
Sheng-Lung Peng
Computer Science and Information Engineering, National Cheng Kung University, Taiwan, Taiwan
Sun-Yuan Hsieh
Department of Information Technology, Vels Institute of Science, Technology & Advanced Studies, Chennai, Tamil Nadu, India
Suseendran Gopalakrishnan
Lincoln University College, Malaysia, Malaysia
Balaganesh Duraisamy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dogra, V., Singh, A., Verma, S., Kavita, Jhanjhi, N.Z., Talib, M.N. (2021). Understanding of Data Preprocessing for Dimensionality Reduction Using Feature Selection Techniques in Text Classification. In: Peng, SL., Hsieh, SY., Gopalakrishnan, S., Duraisamy, B. (eds) Intelligent Computing and Innovation on Data Science. Lecture Notes in Networks and Systems, vol 248. Springer, Singapore. https://doi.org/10.1007/978-981-16-3153-5_48

Download citation

DOI: https://doi.org/10.1007/978-981-16-3153-5_48
Published: 28 September 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-3152-8
Online ISBN: 978-981-16-3153-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics