Abstract
Mining knowledge from imbalanced data is challenging due to the uneven distribution of classes and increasing dimensionality of data accumulated from real-life applications. Selecting informative features from imbalanced data is especially important for building an effective learning method. The global redundancy and the effect of imbalanced distribution need to be considered simultaneously. In this study, a feature selection method that considers the imbalanced distribution of classes in data is investigated by embedding the weighted constraint on the majority class into the global redundancy minimization GRM framework. Global redundancy minimization is acquired through an objective function that contains a feature redundancy matrix and feature scores. A new form of regularization to a within-class scatter matrix is first presented, which emphasizes the minority class and replaces the redundancy measurement approach. Then, after employing this new form of a within-class scatter matrix in GRM and taking the between-class distance as the GRM input score, a GRM-based discriminant feature selection algorithm (GRM-DFS) is proposed. Comparison studies on a within-class scatter matrix with different forms of regularization indicate that the proposed form of a within-class scatter matrix is effective when dealing with imbalanced data. Experiments on public imbalanced datasets are performed. The experimental results indicate that GRM-DFS is effective.
Similar content being viewed by others
References
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21:1263–1284
Jian C, Jian G, Ao Y (2016) A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193:115–122
Bach M, Werner A, Żywiec J, Pluskiewicz W (2017) The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf Sci 384:174– 190
Bedi P, Gupta N, Jindal V (2021) I-siamIDS: an improved siam-IDS for handling class imbalance in network-based intrusion detection systems. Appl Intell 51:1133–1151
Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf Sci 425:76–91
Cao P, Liu X, Zhang J, Zhao D, Huang M, Zaiane O (2017) ℓ2,1-norm regularized multi-kernel based joint nonlinear feature selection and over-sampling for imbalanced data classification. Neurocomputing 234:38–57
Das B, Krishnan NC, Cook DJ (2013) wRACOG: A Gibbs Sampling-Based Oversampling Technique. In: IEEE International Conference on Data Mining. IEEE, pp 111–120
Wang Z, Cao C, Zhu Y (2020) Entropy and Confidence-Based undersampling boosting random forests for imbalanced problems. IEEE Trans Neural Netw Learn Syst 31:5178–5191
Peng C, Zhao D, Zaiane O (2013) An Optimized Cost-Sensitive SVM for Imbalanced Data Learning. In: Advances in Knowledge Discovery and Data Mining. Springer, pp 280–292
Li K, Kong X, Zhi L, Liu W, Yin J (2013) Boosting weighted ELM for imbalanced learning. Neurocomputing 128(5):15–21
Peng M, Qi Z, Xing X, Tao G, Huang X (2019) Trainable Undersampling for Class-Imbalance Learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, pp 4707–4714
Li J, Liu H (2017) Challenges of feature selection for big data analytics. IEEE Intell Syst 32:9–15
Du G, Zhang J, Luo Z, Ma F, Li S (2020) Joint imbalanced classification and feature selection for hospital readmissions. Knowl-Based Syst 200(106020)
Liu H, Zhou M, Liu Q (2019) An embedded feature selection method for imbalanced data classification. IEEE/CAA J Autom Sin 27:703–715
Peng Z, Hu X, Li P, Wu X (2017) Online Feature Selection for High-dimensional Class-imbalanced Data. Knowl-Based Syst 136:187–199
Chen H, Li T, Fan X, Luo C (2019) Feature selection for imbalanced data based on neighborhood rough sets. Inf Sci 483:1–20
Zhang C, Zhou Y, Guo J, Wang G, Xuan W (2018) Research on classification method of high-dimensional class-imbalanced datasets based on SVM. In: International journal of machine learning and cybernetics(DSC), vol 10, pp 1765–1778
Shahee SA, Ananthakumar U (2020) An effective distance based feature selection approach for imbalanced data. Appl Intell 50:717–745
Viegas F, Rocha L, Goncalves M, Mourao F, Sa G, Salles T, Andrade G, Sandin I (2018) A Genetic Programming approach for feature selection in highly dimensional skewed data. Neurocomputing 273:554–569
Meng L, Chang X, Yong L, Chao X, Tao D (2018) Cost-Sensitive Feature selection by optimizing F-Measures. IEEE Trans Image Process 27:1323–1335
Wang D, Nie F, Huang H (2015) Feature selection via global redundancy minimization. IEEE Trans Knowl Data Eng 27(10):2743–2755
Nie F, Yang S, Zhang R, Li X (2019) A general framework for Auto-Weighted feature selection via global redundancy minimization. IEEE Trans Image Process 28:2428–2438
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinforma Comput Biol 3:185–205
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Yang F, Mao K, Lee GKK, Tang W (2015) Emphasizing minority class in LDA for feature subset selection on High-Dimensional Small-Sized problems. IEEE Trans Knowl Data Eng 27:88–101
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference, and prediction. 2nd ed
Thomaz C, Gillies D, Feitosa R (2001) Using mixture covariance matrices to improve face and facial expression recognitions. Pattern Recogn Lett 24(13):2159–2165
Masaeli M, Fung G, Dy JG (2010) From Transformation-Based Dimensionality Reduction to Feature Selection. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). pp 21–24
Yang Z, Ye Q, Chen Q, Ma X, Liu F (2020) Robust discriminant feature selection via joint ℓ2,1-norm distance minimization and maximization. Knowl-Based Syst:207(106090)
Tao H, Hou C, Nie F, Jiao Y, Yi D (2016) Effective discriminative feature selection with nontrivial solution. IEEE Trans Neural Netw Learn Syst 27(4):796–808
Zhao Z, Wang X (2018) Cost-sensitive SVDD models based on a sample selection approach. Appl Intell 48:4247–4266
Zhang S (2020) Cost-sensitive KNN classification. Neurocomputing 391:234–242
Xu Z, Shen D, Nie T, Kou Y (2020) A hybrid sampling algorithm combining m-SMOTE and ENN based on Random Forest for medical imbalanced data. J Biomed Inform:107(103465)
Kamalov F, Denisov D (2020) Gamma distribution-based sampling for imbalanced data. Knowl-Based Syst:207(106368)
Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf Sci 487:31–56
Wu T, Zhou Y, Zhang R, Xiao Y, Nie F (2017) Self-weighted discriminative feature selection via adaptive redundancy minimization. Neurocomputing 275:2824–2830
Zhao M, Lin M, Bernard CY, Zhao Z, Tang X (2018) Trace Ratio Criterion based Discriminative Feature Selection via ℓ2,p-norm regularization for supervised learning. Neurocomputing 321:1–16
Boyd S, Vandenberghe L, Faybusovich L (2006) Convex optimization. IEEE Trans Autom Control 51:1859–1859
Bertsekas DP (1996) Constrained Optimization and Lagrange Multiplier Methods
Lin Z, Liu R, Su Z (2011) Linearized alternating direction method with adaptive penalty for low rank representation. In: Advances in Neural Information Processing Systems (NIPS). MIT, pp 612– 620
Curtis FE, Jiang H, Robinson DP (2015) An adaptive augmented Lagrangian method for large-scale constrained optimization. Br Med J 152:201–245
Alcala-Fdez J, Fernandez A, Luengo J, Derrac J, Garcia S, Sanchez L, Herrera F (2011) KEEL Data-Mining Software tool: Data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Log Soft Comput 17(2-3):255–287
Au DC, Lorence RM, Gennis RB (2003) Numerical optimization, theoretical and practical aspects. IEEE Trans Autom Control 51:541–541
Kyrillidis A, Becker S, Cevher V (2013) Sparse projections onto the simplex. In: International conference machine learning (ICML), vol 28, pp 235–243
Blake CL, Merz CJ (1998) Uci repository of machine learning databases
Benabdeslem K, Hindawi M (2011) Constrained laplacian score for semi-supervised feature selection. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pp 204–218
Kononenko I (1994) Estimating attributes: Analysis and extensions of RELIEF. Italy: Mach Learn: ECML-94 784:171–182
Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1:3–18
Zhu Z, Ong YS, Zurada M (2010) Identification of full and partial class relevant genes. IEEE/ACM Trans Comput Biol Bioinform 7:263–277
Huang C, Huang X, Fang Y, Xu J, Qu Y, Zhai P, Fan L, Yin H, Xu Y, Li J (2020) Sample imbalance disease classification model based on association rule feature selection. Pattern Recogn Lett 133:280–286
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is supported by the National Natural Science Foundation of China (Nos. 61976182, 62076171, 61876157), Key program for International S&T Cooperation of Sichuan Province (2019YFH0097), Sichuan Key R&D project (2020YFG0035).
Rights and permissions
About this article
Cite this article
Huang, S., Chen, H., Li, T. et al. Feature selection via minimizing global redundancy for imbalanced data. Appl Intell 52, 8685–8707 (2022). https://doi.org/10.1007/s10489-021-02855-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02855-9