Feature selection via minimizing global redundancy for imbalanced data

Huang, Shuhao; Chen, Hongmei; Li, Tianrui; Chen, Hao; Luo, Chuan

doi:10.1007/s10489-021-02855-9

Feature selection via minimizing global redundancy for imbalanced data

Published: 02 November 2021

Volume 52, pages 8685–8707, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Shuhao Huang^1,2,
Hongmei Chen^1,2,
Tianrui Li^1,2,
Hao Chen^1,2 &
…
Chuan Luo³

886 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Mining knowledge from imbalanced data is challenging due to the uneven distribution of classes and increasing dimensionality of data accumulated from real-life applications. Selecting informative features from imbalanced data is especially important for building an effective learning method. The global redundancy and the effect of imbalanced distribution need to be considered simultaneously. In this study, a feature selection method that considers the imbalanced distribution of classes in data is investigated by embedding the weighted constraint on the majority class into the global redundancy minimization GRM framework. Global redundancy minimization is acquired through an objective function that contains a feature redundancy matrix and feature scores. A new form of regularization to a within-class scatter matrix is first presented, which emphasizes the minority class and replaces the redundancy measurement approach. Then, after employing this new form of a within-class scatter matrix in GRM and taking the between-class distance as the GRM input score, a GRM-based discriminant feature selection algorithm (GRM-DFS) is proposed. Comparison studies on a within-class scatter matrix with different forms of regularization indicate that the proposed form of a within-class scatter matrix is effective when dealing with imbalanced data. Experiments on public imbalanced datasets are performed. The experimental results indicate that GRM-DFS is effective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature dimensionality reduction: a review

Article Open access 21 January 2022

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

A review of unsupervised feature selection methods

Article 29 January 2019

References

He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21:1263–1284
Article Google Scholar
Jian C, Jian G, Ao Y (2016) A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193:115–122
Article Google Scholar
Bach M, Werner A, Żywiec J, Pluskiewicz W (2017) The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf Sci 384:174– 190
Article Google Scholar
Bedi P, Gupta N, Jindal V (2021) I-siamIDS: an improved siam-IDS for handling class imbalance in network-based intrusion detection systems. Appl Intell 51:1133–1151
Article Google Scholar
Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf Sci 425:76–91
Article MathSciNet Google Scholar
Cao P, Liu X, Zhang J, Zhao D, Huang M, Zaiane O (2017) ℓ_2,1-norm regularized multi-kernel based joint nonlinear feature selection and over-sampling for imbalanced data classification. Neurocomputing 234:38–57
Das B, Krishnan NC, Cook DJ (2013) wRACOG: A Gibbs Sampling-Based Oversampling Technique. In: IEEE International Conference on Data Mining. IEEE, pp 111–120
Wang Z, Cao C, Zhu Y (2020) Entropy and Confidence-Based undersampling boosting random forests for imbalanced problems. IEEE Trans Neural Netw Learn Syst 31:5178–5191
Article Google Scholar
Peng C, Zhao D, Zaiane O (2013) An Optimized Cost-Sensitive SVM for Imbalanced Data Learning. In: Advances in Knowledge Discovery and Data Mining. Springer, pp 280–292
Li K, Kong X, Zhi L, Liu W, Yin J (2013) Boosting weighted ELM for imbalanced learning. Neurocomputing 128(5):15–21
Google Scholar
Peng M, Qi Z, Xing X, Tao G, Huang X (2019) Trainable Undersampling for Class-Imbalance Learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, pp 4707–4714
Li J, Liu H (2017) Challenges of feature selection for big data analytics. IEEE Intell Syst 32:9–15
Article Google Scholar
Du G, Zhang J, Luo Z, Ma F, Li S (2020) Joint imbalanced classification and feature selection for hospital readmissions. Knowl-Based Syst 200(106020)
Liu H, Zhou M, Liu Q (2019) An embedded feature selection method for imbalanced data classification. IEEE/CAA J Autom Sin 27:703–715
Article Google Scholar
Peng Z, Hu X, Li P, Wu X (2017) Online Feature Selection for High-dimensional Class-imbalanced Data. Knowl-Based Syst 136:187–199
Article Google Scholar
Chen H, Li T, Fan X, Luo C (2019) Feature selection for imbalanced data based on neighborhood rough sets. Inf Sci 483:1–20
Article Google Scholar
Zhang C, Zhou Y, Guo J, Wang G, Xuan W (2018) Research on classification method of high-dimensional class-imbalanced datasets based on SVM. In: International journal of machine learning and cybernetics(DSC), vol 10, pp 1765–1778
Shahee SA, Ananthakumar U (2020) An effective distance based feature selection approach for imbalanced data. Appl Intell 50:717–745
Article Google Scholar
Viegas F, Rocha L, Goncalves M, Mourao F, Sa G, Salles T, Andrade G, Sandin I (2018) A Genetic Programming approach for feature selection in highly dimensional skewed data. Neurocomputing 273:554–569
Article Google Scholar
Meng L, Chang X, Yong L, Chao X, Tao D (2018) Cost-Sensitive Feature selection by optimizing F-Measures. IEEE Trans Image Process 27:1323–1335
Article MathSciNet Google Scholar
Wang D, Nie F, Huang H (2015) Feature selection via global redundancy minimization. IEEE Trans Knowl Data Eng 27(10):2743–2755
Article Google Scholar
Nie F, Yang S, Zhang R, Li X (2019) A general framework for Auto-Weighted feature selection via global redundancy minimization. IEEE Trans Image Process 28:2428–2438
Article MathSciNet Google Scholar
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinforma Comput Biol 3:185–205
Article Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Yang F, Mao K, Lee GKK, Tang W (2015) Emphasizing minority class in LDA for feature subset selection on High-Dimensional Small-Sized problems. IEEE Trans Knowl Data Eng 27:88–101
Article Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference, and prediction. 2nd ed
Thomaz C, Gillies D, Feitosa R (2001) Using mixture covariance matrices to improve face and facial expression recognitions. Pattern Recogn Lett 24(13):2159–2165
Article Google Scholar
Masaeli M, Fung G, Dy JG (2010) From Transformation-Based Dimensionality Reduction to Feature Selection. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). pp 21–24
Yang Z, Ye Q, Chen Q, Ma X, Liu F (2020) Robust discriminant feature selection via joint ℓ_2,1-norm distance minimization and maximization. Knowl-Based Syst:207(106090)
Tao H, Hou C, Nie F, Jiao Y, Yi D (2016) Effective discriminative feature selection with nontrivial solution. IEEE Trans Neural Netw Learn Syst 27(4):796–808
Article MathSciNet Google Scholar
Zhao Z, Wang X (2018) Cost-sensitive SVDD models based on a sample selection approach. Appl Intell 48:4247–4266
Article Google Scholar
Zhang S (2020) Cost-sensitive KNN classification. Neurocomputing 391:234–242
Article Google Scholar
Xu Z, Shen D, Nie T, Kou Y (2020) A hybrid sampling algorithm combining m-SMOTE and ENN based on Random Forest for medical imbalanced data. J Biomed Inform:107(103465)
Kamalov F, Denisov D (2020) Gamma distribution-based sampling for imbalanced data. Knowl-Based Syst:207(106368)
Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf Sci 487:31–56
Article MathSciNet Google Scholar
Wu T, Zhou Y, Zhang R, Xiao Y, Nie F (2017) Self-weighted discriminative feature selection via adaptive redundancy minimization. Neurocomputing 275:2824–2830
Article Google Scholar
Zhao M, Lin M, Bernard CY, Zhao Z, Tang X (2018) Trace Ratio Criterion based Discriminative Feature Selection via ℓ_2,p-norm regularization for supervised learning. Neurocomputing 321:1–16
Article Google Scholar
Boyd S, Vandenberghe L, Faybusovich L (2006) Convex optimization. IEEE Trans Autom Control 51:1859–1859
Article Google Scholar
Bertsekas DP (1996) Constrained Optimization and Lagrange Multiplier Methods
Lin Z, Liu R, Su Z (2011) Linearized alternating direction method with adaptive penalty for low rank representation. In: Advances in Neural Information Processing Systems (NIPS). MIT, pp 612– 620
Curtis FE, Jiang H, Robinson DP (2015) An adaptive augmented Lagrangian method for large-scale constrained optimization. Br Med J 152:201–245
MathSciNet MATH Google Scholar
Alcala-Fdez J, Fernandez A, Luengo J, Derrac J, Garcia S, Sanchez L, Herrera F (2011) KEEL Data-Mining Software tool: Data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Log Soft Comput 17(2-3):255–287
Google Scholar
Au DC, Lorence RM, Gennis RB (2003) Numerical optimization, theoretical and practical aspects. IEEE Trans Autom Control 51:541–541
Google Scholar
Kyrillidis A, Becker S, Cevher V (2013) Sparse projections onto the simplex. In: International conference machine learning (ICML), vol 28, pp 235–243
Blake CL, Merz CJ (1998) Uci repository of machine learning databases
Benabdeslem K, Hindawi M (2011) Constrained laplacian score for semi-supervised feature selection. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pp 204–218
Kononenko I (1994) Estimating attributes: Analysis and extensions of RELIEF. Italy: Mach Learn: ECML-94 784:171–182
Article Google Scholar
Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1:3–18
Article Google Scholar
Zhu Z, Ong YS, Zurada M (2010) Identification of full and partial class relevant genes. IEEE/ACM Trans Comput Biol Bioinform 7:263–277
Article Google Scholar
Huang C, Huang X, Fang Y, Xu J, Qu Y, Zhai P, Fan L, Yin H, Xu Y, Li J (2020) Sample imbalance disease classification model based on association rule feature selection. Pattern Recogn Lett 133:280–286
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, 611756, China
Shuhao Huang, Hongmei Chen, Tianrui Li & Hao Chen
National Engineering Laboratory of Integrated Transportation Big Data Application Technology, Southwest Jiaotong University, Chengdu, 611756, China
Shuhao Huang, Hongmei Chen, Tianrui Li & Hao Chen
College of Computer Science, Sichuan University, Chengdu, 610065, China
Chuan Luo

Authors

Shuhao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hongmei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Tianrui Li
View author publications
You can also search for this author in PubMed Google Scholar
Hao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chuan Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongmei Chen.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is supported by the National Natural Science Foundation of China (Nos. 61976182, 62076171, 61876157), Key program for International S&T Cooperation of Sichuan Province (2019YFH0097), Sichuan Key R&D project (2020YFG0035).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, S., Chen, H., Li, T. et al. Feature selection via minimizing global redundancy for imbalanced data. Appl Intell 52, 8685–8707 (2022). https://doi.org/10.1007/s10489-021-02855-9

Download citation

Accepted: 15 September 2021
Published: 02 November 2021
Issue Date: June 2022
DOI: https://doi.org/10.1007/s10489-021-02855-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature selection via minimizing global redundancy for imbalanced data

Abstract

Access this article

Similar content being viewed by others

Feature dimensionality reduction: a review

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of unsupervised feature selection methods

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Feature selection via minimizing global redundancy for imbalanced data

Abstract

Access this article

Similar content being viewed by others

Feature dimensionality reduction: a review

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of unsupervised feature selection methods

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation