Overlap-Based Undersampling for Improving Imbalanced Data Classification

  • Pattaramon VuttipittayamongkolEmail author
  • Eyad Elyan
  • Andrei Petrovski
  • Chrisina Jayne
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11314)


Classification of imbalanced data remains an important field in machine learning. Several methods have been proposed to address the class imbalance problem including data resampling, adaptive learning and cost adjusting algorithms. Data resampling methods are widely used due to their simplicity and flexibility. Most existing resampling techniques aim at rebalancing class distribution. However, class imbalance is not the only factor that impacts the performance of the learning algorithm. Class overlap has proved to have a higher impact on the classification of imbalanced datasets than the dominance of the negative class. In this paper, we propose a new undersampling method that eliminates negative instances from the overlapping region and hence improves the visibility of the minority instances. Testing and evaluating the proposed method using 36 public imbalanced datasets showed statistically significant improvements in classification performance.


Undersampling Overlap Imbalanced data Classification Fuzzy C-means Resampling 


  1. 1.
    Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)CrossRefGoogle Scholar
  2. 2.
    Branco, P., Torgo, L., Ribeiro, R.P.: A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. (CSUR) 49(2), 31 (2016)CrossRefGoogle Scholar
  3. 3.
    Denil, M., Trappenberg, T.: Overlap versus imbalance. In: Farzindar, A., Kešelj, V. (eds.) AI 2010. LNCS (LNAI), vol. 6085, pp. 220–231. Springer, Heidelberg (2010). Scholar
  4. 4.
    Elyan, E., Gaber, M.M.: A fine-grained random forests using class decomposition: an application to medical diagnosis. Neural Comput. Appl. 27(8), 2279–2288 (2016)CrossRefGoogle Scholar
  5. 5.
    Elyan, E., Gaber, M.M.: A genetic algorithm approach to optimising random forests applied to class engineered data. Inf. Sci. 384, 220–234 (2017)CrossRefGoogle Scholar
  6. 6.
    Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2012)CrossRefGoogle Scholar
  7. 7.
    García, V., Mollineda, R.A., Sánchez, J.S.: On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal. Appl. 11(3–4), 269–280 (2008)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)CrossRefGoogle Scholar
  9. 9.
    Lee, H.K., Kim, S.B.: An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst. Appl. 98, 72–83 (2018)CrossRefGoogle Scholar
  10. 10.
    Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. 409, 17–26 (2017)CrossRefGoogle Scholar
  11. 11.
    Ng, W.W., Hu, J., Yeung, D.S., Yin, S., Roli, F.: Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans. Cybern. 45(11), 2402–2412 (2015)CrossRefGoogle Scholar
  12. 12.
    Ofek, N., Rokach, L., Stern, R., Shabtai, A.: Fast-CBUS: a fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing 243, 88–102 (2017)CrossRefGoogle Scholar
  13. 13.
    Oh, S.H.: Error back-propagation algorithm for classification of imbalanced data. Neurocomputing 74(6), 1058–1061 (2011)CrossRefGoogle Scholar
  14. 14.
    Song, J., Lu, X., Wu, X.: An improved adaboost algorithm for unbalanced classification data. In: Sixth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2009, vol. 1, pp. 109–113. IEEE (2009)Google Scholar
  15. 15.
    Visa, S., Ralescu, A.: Learning imbalanced and overlapping classes using fuzzy sets. In: Proceedings of the ICML, vol. 3 (2003)Google Scholar
  16. 16.
    Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Pattaramon Vuttipittayamongkol
    • 1
    Email author
  • Eyad Elyan
    • 1
  • Andrei Petrovski
    • 1
  • Chrisina Jayne
    • 2
  1. 1.Robert Gordon UniversityAberdeenUK
  2. 2.Oxford Brookes UniversityOxfordUK

Personalised recommendations