Abstract
Binary datasets are considered imbalanced when one of their two classes has less than 40% of the total number of the data instances (i.e., minority class). Existing classification algorithms are biased when applied on imbalanced binary datasets, as they misclassify instances of minority class. Many techniques are proposed to minimize the bias and to increase the classification accuracy. Synthetic Minority Oversampling Technique (SMOTE) is a well-known approach proposed to address this problem. It generates new synthetic data instances to balance the dataset. Unfortunately, it generates these instances randomly, leading to the generation of useless new instances, which is time and memory consuming. Different SMOTE derivatives were proposed to overcome this problem (such as Borderline SMOTE), yet the number of generated instances slightly changed. To overcome such problem, this paper proposes a novel approach for generating synthesized data instances known as Hybrid Clustered Affinitive Borderline SMOTE (HCAB-SMOTE). It managed to minimize the number of generated instances while increasing the classification accuracy. It combines undersampling for removing majority noise instances and oversampling approaches to enhance the density of the borderline. It uses k-means clustering on the borderline area and identify which clusters to oversample to achieve better results. Experimental results show that HCAB-SMOTE outperformed SMOTE, Borderline SMOTE, AB-SMOTE and CAB-SMOTE approaches which were developed before reaching HCAB-SMOTE, as it provided the highest classification accuracy with the least number of generated instances.
Similar content being viewed by others
References
Sun, A.; Lim, E.P.; Liu, Y.: On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support Syst. 48(1), 191–201 (2009)
Tek, F.B.; Dempster, A.G.; Kale, I.: Parasite detection and identification for automated thin blood film malaria diagnosis. Comput. Vis. Image Underst. 114(1), 21–32 (2010)
Qureshi, S.A.; Rehman, A.S.; Qamar, A.M.; Kamal, A.; Rehman, A.: Telecommunication subscribers’ churn prediction model using machine learning. In: Eighth International Conference Digital Information Management (ICDIM 2013), September, pp. 131–136 (2013)
“Keel Datasets, Wine Quality.” https://sci2s.ugr.es/keel/dataset.php?cod=1322. Accessed 21 Aug 2019
Bekkar, M.; Alitouche, D.; Akrouf, T.; AkroufAlitouche, T.: Imbalanced data learning approaches review. Data Min. Knowl. 3(4), 15–33 (2013)
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C.: Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5476 LNAI, pp. 475–482 (2009)
Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting, pp. 107–119 (2003)
Han, H.; Wang, W.; Mao, B.: Borderline-SMOTE : a new over-sampling method in imbalanced data sets learning, pp. 878–887 (2005)
Bach, M.; Werner, A.; Żywiec, J.; Pluskiewicz, W.: The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. 384, 174–190 (2017)
Douzas, G.; Bacao, F.; Last, F.: Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 465, 1–20 (2018)
Elhassan, A.T.; Aljourf, M.: Classification of imbalance data using Tomek Link (T-Link) combined with random under-sampling (RUS) as a data reduction method. J. Inform. Data Min. 1(2), 1–12 (2016)
Oskouei, R.J.; Bigham, B.S.: Over-sampling via under-sampling in strongly imbalanced data. Int. J. Adv. Intell. Paradig. 9(1), 58 (2017)
Japkowicz, N.: Learning from imbalanced data sets: a comparison of various strategies. In: AAAI Workshop Learning from Imbalanced Data Sets, vol. 68, pp. 10–15 (2000)
Stefanowski, J.; Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Data Warehousing and Knowledge Discovery (Lecture Notes Computer Science Series 5182), pp. 283–292 (2008)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. Inf. Sci. (2001)
“Weka.” https://www.cs.waikato.ac.nz/ml/weka/index.html. Accessed 7 Jan 2020
Fernández, A.; López, V.; Galar, M.; Del Jesus, M.J.; Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl. Based Syst. 42, 97–110 (2013)
Keel Datasets, Abalone9-18. http://sci2s.ugr.es/keel/dataset.php?cod=116. Accessed 21 Aug 2019
Crowd Analytix. http://www.crowdanalytix.com/contests/why-customer-churn/. Accessed:21 Aug 2019
IBM Analytics Telco Customer Churn Dataset. https://www.kaggle.com/blastchar/telco-customer-churn#WA_Fn-UseC_-Telco-Customer-Churn.csv. Accessed 21 Aug 2019
Dua, C.; Dheeru; Graff: UCI Machine Learning Repository (2017). http://archive.ics.uci.edu/ml. Accessed 21 Aug 2019
Haberman, S.J.: Generalized residuals for log-linear models. In: Proceedings of 9th International Conference on Biometrics, pp. 104–122 (1976)
IBM & Kaggle Employee Attrition. https://www.kaggle.com/patelprashant/employee-attrition#WA_Fn-UseC_-HR-Employee-Attrition.csv. Accessed 21 Aug 2019
Keel Datasets, Solar Flare. https://sci2s.ugr.es/keel/dataset.php?cod=1331#sub1. Accessed 21 Aug 2019
IBM Analytic, Win Loss. https://github.com/vkrit/data-science-class/blob/master/WA_Fn-UseC_-Sales-Win-Loss.csv. Accessed 21 Aug 2019
Moro, S.; Laureano, R.M.S.; Cortez, P.: Using data mining for bank direct marketing: An application of the CRISP-DM methodology. In: ESM 2011–2011 European Simulation and Modelling Conference 2011, no. Figure 1, pp. 117–121 (2011)
Kohavi, R.; Becker, B.: Adult Census Income (1996). http://archive.ics.uci.edu/ml/datasets/Adult. Accessed 7 Jan 2020
K. A. E. A. Challenge: No Title. https://www.kaggle.com/c/amazon-employee-access-challenge. Accessed 7 Jan 2020
Cup, K.: No Title (2012). https://www.openml.org/d/1220. Accessed 7 Jan 2020
Cervantes, J.; Garcia-Lamont, F.; Rodriguez, L.; López, A.; Castilla, J.R.; Trueba, A.: PSO-based method for SVM classification on skewed data sets. Neurocomputing 228, 187–197 (2017)
López, V.; Fernández, A.; Herrera, F.: On the importance of the validation technique for classification with imbalanced datasets: addressing covariate shift when data is skewed. Inf. Sci. 257, 1–13 (2014)
Saito, T.; Rehmsmeier, M.: The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3), 1–21 (2015)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Al Majzoub, H., Elgedawy, I., Akaydın, Ö. et al. HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification. Arab J Sci Eng 45, 3205–3222 (2020). https://doi.org/10.1007/s13369-019-04336-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-019-04336-1