Abstract
Clustering is regarded as one of the most difficult tasks due to the large search space that must be explored. Feature selection aims to reduce the dimensionality of data, thereby contributing to further processing. The feature subset achieved by any feature selection method should enhance classification accuracy by removing redundant features. To this end, this paper proposes a new model, called Best Clustering Normalized Mutual Information Quantile (BC-NMIQ), to rank the best features using the square root threshold. Finally, the proposed BC-NMIQ is improved with the optimal set of features selected automatically using the Incremental Association Markov Blanket (IAMB) feature selection method. The measurement criteria are applied to BC-NMIQ-IAMB as the main proposed method and to BC-NMIQ as a subsidiary proposed method. In fact, the hybrid BC-NMIQ-IAMB is the combination of the proposed filter method (BC-NMIQ) and the existing automatic filter feature selection approach (IAMB). To test the performance of the proposed BC-NMIQ-IAMB algorithm, its performance is compared with that of some other algorithms recently proposed in the literature. The results of the experiments, which were conducted on ten benchmark high-dimensional medical datasets (including binary and multi-class), confirmed that BC-NMIQ-IAMB increases the average accuracy of existing binary and multi-class algorithms to 0.92 and 0.94, respectively.
Similar content being viewed by others
Data availability
The datasets analyzed during the current study and the related implementation are available from the corresponding author on reasonable request.
References
Abasabadi S, Nematzadeh H, Motameni H, Akbari E (2021) Automatic ensemble feature selection using fast non-dominated sorting. Inf Syst 100:101760
Abasabadi S et al (2022) Hybrid feature selection based on SLI and genetic algorithm for microarray datasets. J Supercomput 78:19725–19753
Ahmed YA, Koçer B, Huda S, Saleh al-rimy BA, Hassan MM (2020) A system call refinement-based enhanced minimum redundancy maximum relevance method for ransomware early detection. J Netw Comput Appl 167:102753
Al-Batah M et al (2019) Gene Microarray Cancer Classification using Correlation Based Feature Selection Algorithm and Rules Classifiers. Int J Online Biomed Eng 15(8):62
Ali H, Tran SN, Benetos E, d’Avila Garcez AS (2018) Speaker recognition with hybrid features from a deep belief network. Neural Comput & Applic 29(6):13–19
Ali A et al (2019) Leveraging spatio-temporal patterns for predicting citywide traffic crowd flows using deep hybrid neural networks. In 2019 IEEE 25th international conference on parallel and distributed systems (ICPADS). IEEE
Ali A, Zhu Y, Zakarya M (2021) A data aggregation based approach to exploit dynamic spatio-temporal correlations for citywide crowd flows prediction in fog computing. Multimed Tools Appl:1–33
Alirezanejad M, Enayatifar R, Motameni H, Nematzadeh H (2020) Heuristic filter feature selection methods for medical datasets. Genomics 112(2):1173–1181
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. ACM SIGMOD Rec 28(2):49–60
Awan N, Ali A, Khan F, Zakarya M, Alturki R, Kundi M, Alshehri MD, Haleem M (2021) Modeling dynamic Spatio-temporal correlations for urban traffic flows prediction. IEEE Access 9:26502–26511
Blömer J et al (2016) Theoretical analysis of the k-means algorithm–a survey. In: Algorithm Engineering. Springer, pp 81–116
Brankovic A, Hosseini M, Piroddi L (2018) A distributed feature selection algorithm based on distance correlation with an application to microarrays. IEEE/ACM Trans Comput Biol Bioinform 16(6):1802–1815
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Chang H, Yeung D-Y (2008) Robust path-based spectral clustering. Pattern Recogn 41(1):191–203
Chaudhuri A, Sahu TP (2021) A hybrid feature selection method based on binary Jaya algorithm for micro-array data classification. Comput Electr Eng 90:106963
Chowdhary CL, Acharjya D (2016) A hybrid scheme for breast cancer detection using intuitionistic fuzzy rough set technique. Int J Healthc Inf Syst Inform (IJHISI) 11(2):38–61
Chowdhary CL, Acharjya D (2018) Segmentation of mammograms using a novel intuitionistic possibilistic fuzzy c-mean clustering algorithm. In: Nature Inspired Computing. Springer, pp 75–82
Debata PP, Mohapatra P (2022) Identification of significant bio-markers from high-dimensional cancerous data employing a modified multi-objective meta-heuristic algorithm. J King Saud Univ-Comput Inform Sci 34(8):4743–4755
Dimić G et al (2019) Descriptive statistical analysis in the process of educational data mining. In 2019 14th international conference on advanced technologies, systems and Services in Telecommunications (TELSIKS). IEEE
Ehlert KM, Orr MK (2019) Comparing grouping results between cluster analysis and Q-methodology. In: 2019 IEEE Frontiers in education conference (FIE). IEEE, pp 1–3
Estévez PA et al (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201
Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5(9):1531–1555
Gu X, Guo J, Xiao L, Li C (2022) Conditional mutual information-based feature selection algorithm for maximal relevance minimal redundancy. Appl Intell 52(2):1436–1447
Gunasundari S, Janakiraman S, Meenambal S (2018) Multiswarm heterogeneous binary PSO using win-win approach for improved feature selection in liver and kidney disease diagnosis. Comput Med Imaging Graph 70:135–154
Hallajian B, Motameni H, Akbari E (2022) Ensemble feature selection using distance-based supervised and unsupervised methods in binary classification. Elsevier Expert Syst Appl 200:1–18
Hancer E (2020) A new multi-objective differential evolution approach for simultaneous clustering and feature selection. Eng Appl Artif Intell 87:103307
Iqbal T, Ali H (2018) Generative adversarial network for medical images (MI-GAN). J Med Syst 42(11):1–11
Lensen A, Xue B, Zhang M (2016) Particle swarm optimisation representations for simultaneous clustering and feature selection. In 2016 IEEE symposium series on computational intelligence (SSCI). IEEE
Lensen A, Xue B, Zhang M (2017) Using particle swarm optimisation and the silhouette metric to estimate the number of clusters, select features, and perform clustering. In European conference on the applications of evolutionary computation. Springer
Li J, Huang G, Zhou Y (2020) A sentiment classification approach of sentences clustering in webcast barrages. J Inf Process Syst 16(3):718–732
Mitra P, Murthy C, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Murtagh F, Contreras P (2012) Algorithms for hierarchical clustering: an overview. Wiley Interdiscip Rev Data Min Knowl Discov 2(1):86–97
Nakariyakul S, Casasent DP (2009) An improvement on floating search algorithms for feature subset selection. Pattern Recogn 42(9):1932–1940
Nematzadeh H, Enayatifar R, Mahmud M, Akbari E (2019) Frequency based feature selection method using whale algorithm. Genomics 111(6):1946–1955
Nguyen BH, Xue B, Zhang M (2020) A survey on swarm intelligence approaches to feature selection in data mining. Swarm Evol Comput 54:100663
Okagbue HI, Adamu MO, Anake TA (2017) Quantile approximation of the chi–square distribution using the quantile mechanics
Rathod RR, Garg RD (2017) Design of electricity tariff plans using gap statistic for K-means clustering based on consumers monthly electricity consumption data. Int J Energy Sect Manag 11:295–310
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496
Rostami M, Forouzandeh S, Berahmand K, Soltani M (2020) Integration of multi-objective PSO based feature selection and node centrality for medical datasets. Genomics 112(6):4370–4384
Rostami M, Berahmand K, Forouzandeh S (2020) A novel method of constrained feature selection by the measurement of pairwise constraints uncertainty. J Big Data 7(1):1–21
Rostami M, Berahmand K, Nasiri E, Forouzandeh S (2021) Review of swarm intelligence-based feature selection methods. Eng Appl Artif Intell 100:104210
Rostami M, Berahmand K, Forouzandeh S (2021) A novel community detection based genetic algorithm for feature selection. J Big Data 8(1):1–27
Sadeghian Z, Akbari E, Nematzadeh H (2021) A hybrid feature selection method based on information theory and binary butterfly optimization algorithm. Eng Appl Artif Intell 97:104079
Sanchez EH, Serrurier M, Ortner M. (2020) Learning disentangled representations via mutual information estimation. In European conference on computer vision. Springer
Sheikhpour R, Sarram MA, Gharaghani S, Chahooki MAZ (2017) A survey on semi-supervised feature selection methods. Pattern Recogn 64:141–158
Sheng W, Liu X, Fairhurst M (2008) A niching memetic algorithm for simultaneous clustering and feature selection. IEEE Trans Knowl Data Eng 20(7):868–879
Sreedhar Kumar S et al (2019) A brief survey of unsupervised agglomerative hierarchical clustering schemes. Int J Eng Technol 8(1):29–37
Talbi E-G (2009) Metaheuristics: from design to implementation, vol 74. John Wiley & Sons
Thejas G et al (2019) Mini-batch normalized mutual information: a hybrid feature selection method. IEEE Access 7:116875–116885
Xue B, Zhang M, Browne WN (2012) Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE transactions on cybernetics 43(6):1656–1671
Yan C, Liang J, Zhao M, Zhang X, Zhang T, Li H (2019) A novel hybrid feature selection strategy in quantitative analysis of laser-induced breakdown spectroscopy. Anal Chim Acta 1080:35–42
Yang J, Ma Y, Zhang X, Li S, Zhang Y (2017) An initialization method based on hybrid distance for k-means algorithm. Neural Comput 29(11):3094–3117
Zhong W, Chen X, Nie F, Huang JZ (2021) Adaptive discriminant analysis for semi-supervised feature selection. Inf Sci 566:178–194
Zhou Y, Jin R, Hoi SCH (2010) Exclusive lasso for multi-task feature selection. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings
Zhu Z, Ong Y-S, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recogn 40(11):3236–3248
Zhu J, Jang-Jaccard J, Liu T, Zhou J (2021) Joint spectral clustering based on optimal graph and feature selection. Neural Process Lett 53(1):257–273
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Asghari, S., Nematzadeh, H., Akbari, E. et al. Mutual information-based filter hybrid feature selection method for medical datasets using feature clustering. Multimed Tools Appl 82, 42617–42639 (2023). https://doi.org/10.1007/s11042-023-15143-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15143-0