Advertisement

HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification

  • 10 Accesses

Abstract

Binary datasets are considered imbalanced when one of their two classes has less than 40% of the total number of the data instances (i.e., minority class). Existing classification algorithms are biased when applied on imbalanced binary datasets, as they misclassify instances of minority class. Many techniques are proposed to minimize the bias and to increase the classification accuracy. Synthetic Minority Oversampling Technique (SMOTE) is a well-known approach proposed to address this problem. It generates new synthetic data instances to balance the dataset. Unfortunately, it generates these instances randomly, leading to the generation of useless new instances, which is time and memory consuming. Different SMOTE derivatives were proposed to overcome this problem (such as Borderline SMOTE), yet the number of generated instances slightly changed. To overcome such problem, this paper proposes a novel approach for generating synthesized data instances known as Hybrid Clustered Affinitive Borderline SMOTE (HCAB-SMOTE). It managed to minimize the number of generated instances while increasing the classification accuracy. It combines undersampling for removing majority noise instances and oversampling approaches to enhance the density of the borderline. It uses k-means clustering on the borderline area and identify which clusters to oversample to achieve better results. Experimental results show that HCAB-SMOTE outperformed SMOTE, Borderline SMOTE, AB-SMOTE and CAB-SMOTE approaches which were developed before reaching HCAB-SMOTE, as it provided the highest classification accuracy with the least number of generated instances.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

References

  1. 1.

    Sun, A.; Lim, E.P.; Liu, Y.: On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support Syst. 48(1), 191–201 (2009)

  2. 2.

    Tek, F.B.; Dempster, A.G.; Kale, I.: Parasite detection and identification for automated thin blood film malaria diagnosis. Comput. Vis. Image Underst. 114(1), 21–32 (2010)

  3. 3.

    Qureshi, S.A.; Rehman, A.S.; Qamar, A.M.; Kamal, A.; Rehman, A.: Telecommunication subscribers’ churn prediction model using machine learning. In: Eighth International Conference Digital Information Management (ICDIM 2013), September, pp. 131–136 (2013)

  4. 4.

    “Keel Datasets, Wine Quality.” https://sci2s.ugr.es/keel/dataset.php?cod=1322. Accessed 21 Aug 2019

  5. 5.

    Bekkar, M.; Alitouche, D.; Akrouf, T.; AkroufAlitouche, T.: Imbalanced data learning approaches review. Data Min. Knowl. 3(4), 15–33 (2013)

  6. 6.

    Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

  7. 7.

    Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C.: Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5476 LNAI, pp. 475–482 (2009)

  8. 8.

    Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting, pp. 107–119 (2003)

  9. 9.

    Han, H.; Wang, W.; Mao, B.: Borderline-SMOTE : a new over-sampling method in imbalanced data sets learning, pp. 878–887 (2005)

  10. 10.

    Bach, M.; Werner, A.; Żywiec, J.; Pluskiewicz, W.: The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. 384, 174–190 (2017)

  11. 11.

    Douzas, G.; Bacao, F.; Last, F.: Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 465, 1–20 (2018)

  12. 12.

    Elhassan, A.T.; Aljourf, M.: Classification of imbalance data using Tomek Link (T-Link) combined with random under-sampling (RUS) as a data reduction method. J. Inform. Data Min. 1(2), 1–12 (2016)

  13. 13.

    Oskouei, R.J.; Bigham, B.S.: Over-sampling via under-sampling in strongly imbalanced data. Int. J. Adv. Intell. Paradig. 9(1), 58 (2017)

  14. 14.

    Japkowicz, N.: Learning from imbalanced data sets: a comparison of various strategies. In: AAAI Workshop Learning from Imbalanced Data Sets, vol. 68, pp. 10–15 (2000)

  15. 15.

    Stefanowski, J.; Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Data Warehousing and Knowledge Discovery (Lecture Notes Computer Science Series 5182), pp. 283–292 (2008)

  16. 16.

    Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. Inf. Sci. (2001)

  17. 17.

    “Weka.” https://www.cs.waikato.ac.nz/ml/weka/index.html. Accessed 7 Jan 2020

  18. 18.

    Fernández, A.; López, V.; Galar, M.; Del Jesus, M.J.; Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl. Based Syst. 42, 97–110 (2013)

  19. 19.

    Keel Datasets, Abalone9-18. http://sci2s.ugr.es/keel/dataset.php?cod=116. Accessed 21 Aug 2019

  20. 20.

    Crowd Analytix. http://www.crowdanalytix.com/contests/why-customer-churn/. Accessed:21 Aug 2019

  21. 21.

    IBM Analytics Telco Customer Churn Dataset. https://www.kaggle.com/blastchar/telco-customer-churn#WA_Fn-UseC_-Telco-Customer-Churn.csv. Accessed 21 Aug 2019

  22. 22.

    Dua, C.; Dheeru; Graff: UCI Machine Learning Repository (2017). http://archive.ics.uci.edu/ml. Accessed 21 Aug 2019

  23. 23.

    Haberman, S.J.: Generalized residuals for log-linear models. In: Proceedings of 9th International Conference on Biometrics, pp. 104–122 (1976)

  24. 24.

    IBM & Kaggle Employee Attrition. https://www.kaggle.com/patelprashant/employee-attrition#WA_Fn-UseC_-HR-Employee-Attrition.csv. Accessed 21 Aug 2019

  25. 25.

    Keel Datasets, Solar Flare. https://sci2s.ugr.es/keel/dataset.php?cod=1331#sub1. Accessed 21 Aug 2019

  26. 26.

    IBM Analytic, Win Loss. https://github.com/vkrit/data-science-class/blob/master/WA_Fn-UseC_-Sales-Win-Loss.csv. Accessed 21 Aug 2019

  27. 27.

    Moro, S.; Laureano, R.M.S.; Cortez, P.: Using data mining for bank direct marketing: An application of the CRISP-DM methodology. In: ESM 2011–2011 European Simulation and Modelling Conference 2011, no. Figure 1, pp. 117–121 (2011)

  28. 28.

    Kohavi, R.; Becker, B.: Adult Census Income (1996). http://archive.ics.uci.edu/ml/datasets/Adult. Accessed 7 Jan 2020

  29. 29.

    K. A. E. A. Challenge: No Title. https://www.kaggle.com/c/amazon-employee-access-challenge. Accessed 7 Jan 2020

  30. 30.

    Cup, K.: No Title (2012). https://www.openml.org/d/1220. Accessed 7 Jan 2020

  31. 31.

    Cervantes, J.; Garcia-Lamont, F.; Rodriguez, L.; López, A.; Castilla, J.R.; Trueba, A.: PSO-based method for SVM classification on skewed data sets. Neurocomputing 228, 187–197 (2017)

  32. 32.

    López, V.; Fernández, A.; Herrera, F.: On the importance of the validation technique for classification with imbalanced datasets: addressing covariate shift when data is skewed. Inf. Sci. 257, 1–13 (2014)

  33. 33.

    Saito, T.; Rehmsmeier, M.: The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3), 1–21 (2015)

Download references

Author information

Correspondence to Hisham Al Majzoub.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Al Majzoub, H., Elgedawy, I., Akaydın, Ö. et al. HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification. Arab J Sci Eng (2020) doi:10.1007/s13369-019-04336-1

Download citation

Keywords

  • Imbalanced data
  • Borderline SMOTE
  • Oversampling
  • SMOTE
  • AB-SMOTE
  • k-means clustering