Research on classification method of high-dimensional class-imbalanced datasets based on SVM

  • Chunkai ZhangEmail author
  • Ying Zhou
  • Jianwei Guo
  • Guoquan Wang
  • Xuan Wang
Original Article


High-dimensional problems result in bad classification results because some combinations of features have an adverse effect on classification; while class-imbalanced problems make the classifier to concern the majority class more but the minority less, because the number of samples of majority class is more than minority class. The problem of both high-dimensional and class-imbalanced classification is found in many fields such as bioinformatics, healthcare and so on. Many researchers study either the high-dimensional problem or class-imbalanced problem and come up with a series of algorithms, but they ignore the above new problem, which indicates high-dimensional problems affect sampling process while class-imbalanced problems interfere feature selection. Firstly, this paper analyses the new problem arising from the mutual influence of the two problems, and then introduces SVM and analyses its advantages in dealing high-dimensional problem and class-imbalanced problem. Next, this paper proposes a new algorithm named BRFE-PBKS-SVM aimed at high-dimensional class-imbalanced datasets, which improves SVM-RFE by considering the class-imbalanced problem in the process of feature selection, and it also improves SMOTE so that the procedure of over-sampling could work in the Hilbert space with an adaptive over-sampling rate by PSO. Finally, the experimental results show the performance of this algorithm.


High-dimensional Class-imbalanced Feature selection Boundary samples Over-sampling 



This work is supported by the National Key Research and Development Program of China (No. 2016YFB0800900).


  1. 1.
    Provost F (2008) Machine learning from imbalanced data sets 101 (extended abstract). In: 2011 international conference of soft computing and pattern recognition (SoCPaR). IEEE, Piscataway, pp 435–439Google Scholar
  2. 2.
    Wang XZ, Xing HJ, Li Y, Hua Q, Dong CR, Pedrycz W (2015) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst 23:1638–1654CrossRefGoogle Scholar
  3. 3.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357zbMATHGoogle Scholar
  4. 4.
    Huang YM, Hung CM, Jiau HC (2006) Evaluation of neural networks and data mining methods on a credit assessment task for class-imbalanced problem. Nonlinear Anal Real World Appl 7:720–747MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Wang XZ, Zhang T, Wang R (2017) Noniterative deep learning: incorporating restricted Boltzmann machine into multilayer random weight neural networks. IEEE Trans Syst Man Cybern Syst 99:1–10Google Scholar
  6. 6.
    Bhlmann P, Sara, Van De Geer (2013) Statistics for high-dimensional data: methods, theory and applications. J Jpn Stat Soc 44:247–249Google Scholar
  7. 7.
    Guo B, Damper RI, Gunn SR, Nelson JDB (2008) A fast separability-based feature-selection method for high-dimensional remotely sensed image classification. Pattern Recogn 41:1653–1662CrossRefzbMATHGoogle Scholar
  8. 8.
    Yu L, Liu H (2003) Efficiently handling feature redundancy in high-dimensional data. In: ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 685–690Google Scholar
  9. 9.
    Wang XZ, Wang R, Xu C (2017) Discovering the relationship between generalization and uncertainty by incorporating complexity of classification. IEEE Trans Cybern 48(2):703–715CrossRefGoogle Scholar
  10. 10.
    Shen D, Shen H, Marron JS (2013) Consistency of sparse PCA in high dimension, low sample size contexts. J Multivar Anal 115:317–333MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Zhuang X-S, Dai D-Q (2007) Improved discriminate analysis for high-dimensional data and its application to face recognition. Pattern Recogn 40:1570–1578CrossRefzbMATHGoogle Scholar
  12. 12.
    Arif M (2012) Similarity-dissimilarity plot for visualization of high-dimensional data in biomedical pattern classification. J Med Syst 36:1173–1181CrossRefGoogle Scholar
  13. 13.
    Imani M, Ghassemian H (2016) Binary coding based feature extraction in remote sensing high-dimensional data. Inf Sci 342:191–208CrossRefGoogle Scholar
  14. 14.
    Singh B, Kushwaha N, Vyas O-P (2014) A feature subset selection technique for high-dimensional data using symmetric uncertainty. J Data Anal Inf Process 2(4):95–105Google Scholar
  15. 15.
    Eiamkanitchat N, Theera-Umpon N, Auephanwiriyakul S (2015) On feature selection and rule extraction for high-dimensional data: a case of diffuse large B-cell lymphomas microarrays classification. Math Probl Eng 9:1–12CrossRefGoogle Scholar
  16. 16.
    García V, Sánchez JS, Mollineda RA (2011) Classification of high dimensional and imbalanced hyperspectral imagery data. In: Iberian conference on pattern recognition and image analysis. Springer, Berlin, pp 644–651CrossRefGoogle Scholar
  17. 17.
    Farid DM, Nowe A, Manderick B (2016) Ensemble of trees for classifying high-dimensional imbalanced genomic data. In: Proceedings of SAI intelligent systems conference. Springer, Berlin, pp 172–187Google Scholar
  18. 18.
    Liu Q, Lu X, He Z, Zhang C, Chen WS (2017) Deep convolutional neural networks for thermal infrared object tracking. Knowl Based Syst 134:189–198CrossRefGoogle Scholar
  19. 19.
    Gui L, Zhou Y, Xu R, He Y, Lu Q (2017) Learning representations from heterogeneous network for sentiment classification of product reviews. Knowl-Based Syst 124:34–45CrossRefGoogle Scholar
  20. 20.
    Chen T, Xu R, He Y, Wang X (2017) Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Exp Syst Appl 72:221–230CrossRefGoogle Scholar
  21. 21.
    Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R (2009) Feature selection with high-dimensional imbalanced data. In: IEEE international conference on data mining workshops. IEEE, Piscataway, pp 507–514Google Scholar
  22. 22.
    Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: International conference on machine learning and application. IEEE, Piscataway, pp 245–250Google Scholar
  23. 23.
    Blagus R, Lusa L (2012) Evaluation of SMOTE for high-dimensional class-imbalanced microarray data. Int Conf Mach Learn Appl 2:89–94Google Scholar
  24. 24.
    Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246CrossRefGoogle Scholar
  25. 25.
    Tibshirani R (2011) Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc B 73(3):273–282MathSciNetCrossRefGoogle Scholar
  26. 26.
    Gashler M, Martinez T (2011) Temporal nonlinear dimensionality reduction. In: International joint conference on neural networks, pp 1959–1966Google Scholar
  27. 27.
    Yin H, Gai K (2015) An empirical study on preprocessing high-dimensional class-imbalanced data for classification. In: 2015 IEEE 17th international conference on high performance computing and communications, 2015 IEEE 7th international symposium on cyberspace safety and security, and 2015 IEEE 12th international conference on embedded software and systems. IEEE, Piscataway, pp 1314–1319Google Scholar
  28. 28.
    Zhang C, Jia P (2014) DBBoost-enhancing imbalanced classification by a novel ensemble based technique. In: International conference on medical biometrics. IEEE, Piscataway, pp 210–215Google Scholar
  29. 29.
    Wang R, Wang XZ, Kwong S, Xu C (2017) Incorporating diversity and informativeness in multiple-instance active learning. IEEE Trans Fuzzy Syst 25:1460–1475CrossRefGoogle Scholar
  30. 30.
    Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2):225–252MathSciNetCrossRefGoogle Scholar
  31. 31.
    Ling CX, Sheng VS, Yang Q (2006) Test strategies for cost-sensitive decision trees. IEEE Trans Knowl Data Eng 18(8):1055–1067CrossRefGoogle Scholar
  32. 32.
    Zhang S, Liu L, Zhu X, Zhang C (2008) A strategy for attributes selection in cost-sensitive decision trees induction. In: International conference on computer and information technology workshops. ACM, New York, pp 8–13Google Scholar
  33. 33.
    Guyon I, Weston J, Barnhill S (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1):389–422CrossRefzbMATHGoogle Scholar
  34. 34.
    Wang J, Yun B, Huang P, Liu YA (2013) Applying threshold SMOTE algoritwith attribute bagging to imbalanced datasets. In: International conference on rough sets and knowledge technology. Springer, Berlin, pp 221–228CrossRefGoogle Scholar
  35. 35.
    Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, pp 878–887Google Scholar
  36. 36.
    Blagus R, Lusa L (2013) SMOTE for high-dimensional class-imbalanced data. Bmc Bioinformatics 14(1):106CrossRefGoogle Scholar
  37. 37.
    Kwok JT, Tsang IW (2004) The pre-image problem in kernel methods. IEEE Trans Neural Netw 15(6):1517–1525CrossRefGoogle Scholar
  38. 38.
    Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874MathSciNetCrossRefGoogle Scholar
  39. 39.
    Chang C-C, Lin C-J (2011) Libsvm. ACM Trans Intell Syst Technol TIST 2(3):27Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Chunkai Zhang
    • 1
    Email author
  • Ying Zhou
    • 1
  • Jianwei Guo
    • 1
  • Guoquan Wang
    • 1
  • Xuan Wang
    • 1
  1. 1.Department of Computer Science and TechnologyHarbin Institute of TechnologyShenzhenChina

Personalised recommendations