Abstract
Feature selection methods have been issued in the context of data classification due to redundant and irrelevant features. The above features slow the overall system performance, and wrong decisions are more likely to be made with extensive data sets. Several methods have been used to solve the feature selection problem for classification, but most are specific to be used only for a particular data set. Thus, this paper proposes wide-ranging approaches to solve maximum feature selection problems for data sets. The proposed algorithm analytically chooses the optimal feature for classification by utilizing mutual information (MI) and linear correlation coefficients (LCC). It considers linearly and nonlinearly dependent data features for the same. The proposed feature selection algorithm suggests various features used to build a substantial feature subset for classification, effectively reducing irrelevant features. Three different datasets are used to evaluate the performance of the proposed algorithm with classifiers which requires a higher degree of features to have better accuracy and a lower computational cost. We considered probability value (p value <0.05) for feature selection in experiments on different data sets, then the number of features is selected (such as 7, 5, and 6 features from mobile, heart, and diabetes data set, respectively). Various accuracy is considered with different classifiers; for example, classifier Nearest_Neighbors made accuracy such as 0.92225, 0.88333, 0.86250 for mobile, heart, and diabetes data sets, respectively. The proposed model is adequate as per the evaluation of several real-world data sets.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available from the first author upon reasonable request.
References
Ahmad S, Mehfuz S, Mebarek-Oudina F, Beg J (2022) RSM analysis based cloud access security broker: a systematic literature review. Cluster Comput 25:3733–3763
Amiri F, RezaeiYousefi M, Lucas C, Shakery A, Yazdani N (2011) Mutual information-based feature selection for intrusion detection systems. J Netw Comput Appl 34(4):1184–1199
Battiti R (Jul. 1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550
Bhuyan HK, Chakraborty C (2022) Explainable machine learning for data extraction across computational social system. In: IEEE Transactions on Computational Social Systems, pp 1–15. https://doi.org/10.1109/TCSS.2022.3164993
Bhuyan HK, Huque MS (2018) Sub-feature selection based classification. In: IEEE Explore, International Conference on Trends in Electronics and Informatics (ICOEI), pp 210–216. https://doi.org/10.1109/ICOEI.2018.8553763
Bhuyan HK, Kamila NK (2014) Privacy preserving Sub-feature Selection based on fuzzy probabilities. Cluster Comput (Springer) 17(4):1383–1399
Bhuyan HK, Kamila NK (2015) Privacy preserving sub-feature selection in distributed data mining. Appl Soft Compu, Elsevier 36:552–569 ISSN: 1568-4946
Bhuyan HK, Ravi VK (2021) Analysis of sub-feature for classification in data mining. In: IEEE Transaction on Engineering Management, pp 1–15. https://doi.org/10.1109/TEM.2021.3098463
Bhuyan HK, Mohanty M, Das SR (2012) Privacy preserving for feature selection in data mining using centralized network. Int J Compu Sci Issues (IJCSI) 9(3):434–440
Bhuyan HK, Raghu Kumar L, Reddy KR (2019) Optimization model for sub-feature selection in data mining. In: 2nd International Conference on Smart Systems and Inventive Technology (ICSSIT 2019). IEEE Explore, pp 1–6. https://doi.org/10.1109/ICSSIT46314.2019.8987780
Bhuyan HK, Kamila NK, Pani SK (2022) Individual privacy in data mining using fuzzy optimization. Engineering Optimization, Taylor & Francis 54(8):1305–1323
Bhuyan HK, Ravi V, Brahma B, Kamila NK (2022) Disease analysis using machine learning approaches in healthcare system. Health Technol, Springer 12(5):987–1005
Bhuyan HK, Ravi V, Yadav MS (2022) Multi-objective optimization-based privacy in data mining. Cluster Comput (Springer):1-13. https://doi.org/10.1007/s10586-022-03667-3
Chen C, Wei J, Peng C, Zhang W, Qin H (2020) Qingdao University, Stony Brook University, improved saliency detection in RGB-D images using two-phase depth estimation and selective deep fusion. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2020.2968250
Chitrakar R, Huang C (2014) Selection of candidate support vectors in incremental SVM for network intrusion detection. Comput Sec 45:231–241
Chow TW, Huang D (Jan. 2005) Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information. IEEE Trans Neural Netw 16(1):213–224
Croft WB, Metzler D, Strohman T (2010) Search engines: information retrieval in practice. Addison-Wesley, Reading, MA, USA
Dahiru T (2008) P – value, a true test of statistical significance? a cautionary note. Annals Ibadan Postgrad Med 6(1)
Dhaminda B, Abeywickrama NB, Mamei M, Zambonelli F (2020) The SOTA approach to engineering collective adaptive systems. Int J Softw Tools Technol Transfer 22:399–415. https://doi.org/10.1007/s10009-020-00554-3
Gakii C, Mireji PO, Rimiru R (2022) Graph based feature selection for reduction of dimensionality in next-generation RNA sequencing datasets, algorithms. MDPI 15(21):1–14
Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG (2016) Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 31:337–350
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. Proc Int Conf Neural Inf Process Syst:507–514
Hsu CN, Huang HJ, Dietrich S (2004) The ANNIGMA–wrapper approach to fast feature selection for neural nets. IEEE Trans Syst, Man, Cybern B, Cybern 32(2):207–212
Kamila NK, Jena LD, Bhuyan HK (2016) Pareto-based multi-objective optimization for classification in data mining. Cluster Compu (Springer) 19(4):1723–1745 ISSN: 1386–7857 (print version) ISSN: 1573–7543 (electronic version)
Kraskov A, Stogbauer H, Grassberger P (2004) Estimating € mutual information. Phys Rev E 69(6):066138
Kwak N, Choi C-H (Jan. 2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159
Li L, Weinberg CR, Darden TA, Pedersen LG (2001) Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17(12):1131–1142
Li Z, Yang Y, Liu J, Zhou X, Lu H (2012) Unsupervised feature selection using nonnegative spectral analysis. Proc 26th AAAI Conf Artif Intell:1026–1032
Ma G, Li S, Chen C, Hao A, Qin H (2020) Stage-wise Salient Object Detection in 360° Omnidirectional Image via Object-level Semantical Saliency Ranking. IEEE Trans Vis Comput Graph 26(12):3535–3545. https://doi.org/10.1109/TVCG.2020.3023636
Mao KZ (2004) Feature subset selection for support vector machines through discriminative function pruning analysis. IEEE Trans Syst, Man, Cybern B, Cybern 34(1):60–67
Myat Thet Nyo F Mebarek-Oudina, SSH, Khan NA (2022) Otsu’s thresholding technique for MRI image brain tumor segmentation. Multimed Tools Appl. https://doi.org/10.1007/s11042-022-13215-1
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
W. H. Press, P. Flannery, S. A. Teukolsky, W. T. Vetterling, et al., Numerical Recipes, Cambridge UP Cambridge etc, 1986.
Rifkin R, Klautau A (2004) In defense of one-vs-all classification. The J Mach Learn Res 5:101–141
Rossi F, Lendasse A, François D, Wertz V, Verleysen M (2006) Mutual information for the selection of relevant variables in spectrometric nonlinear modelling. Chemom Intell Lab Syst 80(2):215–226
Song J, Takakura H, Okabe Y, Eto M, Inoue D, Nakao K (2011) Statistical analysis of honeypot data and building of kyoto 2006+dataset for nids evaluation. Proc 1st Workshop Building Anal Datasets Gathering Exp Ret Sec:29–36
Tabakhi S, Moradi P, Akhlaghian F (2014) An unsupervised feature selection algorithm based on ant colony optimization. Eng Appl Artif Intell 32:112–123
Tavallaee M, Bagheri E, Lu W, Ghorbani A-A (2009) A detailed analysis of the kdd cup 99 data set. Proc 2nd IEEE Symp Comput Intell Security Defence Appl:1–6
Wan Y, Sun S, Cheng Z (2021) Adaptive similarity embedding for unsupervised multi-view feature selection. IEEE Trans Knowl Data Eng 33(10):3338–3350
Wang R, Bian J, Nie F, Li X (2022) Unsupervised Discriminative Projection for Feature Selection. IEEE Trans Knowl Data Eng 34(2):942–953
Wang G, Chen C, Fan D-P, Hao A, Qin H (2022) Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception. IEEE Trans Pattern Anal Mach Intell:1–18 (published in Early access)
Zaffar M, Hashmani MA, Habib R, Quraishi KS, Irfan M, Alqhtani S, Hamdi M (2022) A hybrid feature selection framework for predicting students performance, computers. Mater Continua 70(1):1893–1920
Zhang Y, Zhang Z, Li S, Qin J, Liu G, Wang M, Yan S (Dec. 2019) Unsupervised nonnegative adaptive feature extraction for data representation. IEEE Trans Knowl Data Eng 31(12):2423–2440
Zhang L, Liu J, Zhang B, Zhang D, Zhu C (2020) Deep cascade model-based face recognition: when deep-layered learning meets small data. IEEE Trans Image Process 29:1016–1029
Zhu J, Liu Y, Wen C, Wu X (2022) DGDFS: dependence guided discriminative feature selection for predicting adverse drug-drug interaction. IEEE Trans Knowl Data Eng 34(1):271–285
Code availability
The code is available from the first author upon reasonable request.
Funding
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest/competing interests
The authors declare no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bhuyan, H.K., Saikiran, M., Tripathy, M. et al. Wide-ranging approach-based feature selection for classification. Multimed Tools Appl 82, 23277–23304 (2023). https://doi.org/10.1007/s11042-022-14132-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14132-z