Abstract
In the field of supervised learning, the problem of class imbalance is one of the most difficult problems, and has attracted a great deal of research attention in recent years. In an imbalanced dataset, minority classes are those that contain very small numbers of data samples, while the remaining classes have a very large number of data samples. This type of imbalance reduces the predictive performance of machine learning models. There are currently three approaches for dealing with the class imbalance problem: algorithm-level, data-level, and ensemble-based approaches. Of these, data-level approaches are the most widely used, and consist of three sub-categories: under-sampling, oversampling, and hybrid techniques. Oversampling techniques generate synthetic samples for the minority class to balance an imbalanced dataset. However, existing oversampling approaches do not have a strategy for handling noise samples in imbalanced and noisy datasets, which leads to a reduction in the predictive performance of machine learning models. This study therefore proposes a noise-adaptive synthetic oversampling technique (NASOTECH) to deal with the class imbalance problem in imbalanced and noisy datasets. The noise-adaptive synthetic oversampling (NASO) strategy is first introduced, which is used to identify the number of samples generated for each sample in the minority class, based on the concept of the noise ratio. Next, the NASOTECH algorithm is proposed, based on the NASO strategy, to handle the class imbalance problem in imbalanced and noisy datasets. Finally, empirical experiments are conducted on several synthetic and real datasets to verify the effectiveness of the proposed approach. The experimental results confirm that NASOTECH outperforms three state-of-the-art oversampling techniques in terms of accuracy and geometric mean (G-mean) on imbalanced and noisy datasets.
Similar content being viewed by others
References
Sagi O, Rokach L (2018) Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249. https://doi.org/10.1002/widm.1249
Vo HA, Le HS, Vo MT, Le T (2019) A novel framework for trash classification using deep transfer learning. IEEE Access 7(1):178631–178639
Yi Z, Yang Y, Li T, Fujita H (2019) A multitask multiview clustering algorithm in heterogeneous situations based on LLE and LE. Knowl-Based Syst 163:776–786
Abassi L, Boukhris I (2019) A worker clustering-based approach of label aggregation under the belief function theory. Appl Intell 49(1):53–62
Hussain A, Cambria E (2018) Semi-supervised learning for big social data analysis. Neurocomputing 275:1662–1673
Li Y, Pan Q, Wang S, Peng H, Yang T, Cambria E (2019) Disentangled Variational auto-encoder for semi-supervised learning. Inf Sci 482:73–85
Le T, Vo MT, Kieu T, Hwang E, Rho S, Baik SW (2020) Multiple electric energy consumption forecasting using a cluster-based strategy for transfer learning in smart building. Sensors 20(9):2668
Barushka A, Hájek P (2020) Spam detection on social networks using cost-sensitive feature selection and ensemble-based regularized deep neural networks. Neural Comput & Applic 32(9):4239–4257
Zhu B, Baesens B, vanden Broucke SKLM (2017) An empirical comparison of techniques for the class imbalance problem in churn prediction. Inf Sci 408:84–99
Oskarsdottir M, Calster TV, Baesens B, Lemahieu W, Vanthienen J (2018) Time series for early churn detection: using similarity-based classification for dynamic networks. Expert Syst Appl 106:55–65
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from Imbalanced Data Sets. Springer, pp. 1–377. https://doi.org/10.1007/978-3-319-98074-4
Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3):659–665
Li F, Zhang X, Zhang X, Du C, Xu Y, Tian YC (2018) Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets. Inf Sci 422:242–256
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WF (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In IJCNN, Hong Kong, China, 1322-1328. https://doi.org/10.1109/IJCNN.2008.4633969
Han H, Wen-Yuan W, Bing-Huan M. (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Adv Intell Comput, pp. 878–887. https://doi.org/10.1007/11538059_91
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6(1):20–29
Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat Interface 2(3):349–360
Chen T, Guestrin C (2016) XGBoost: A Scalable Tree Boosting System. In KDD’16, San Francisco California, USA, pp. 785–794, https://doi.org/10.1145/2939672.2939785
Luo J, Xiao Q (2017) A novel approach for predicting microRNA-disease associations by unbalanced bi-random walk on heterogeneous network. J Biomed Inform 66:194–203
Le T, Baik SW (2019) A robust framework for self-care problem identification for children with disability. Symmetry 11(1):89
Zakaryazad A, Duman E (2016) A profit-driven artificial neural network (ANN) with applications to fraud detection and direct marketing. Neurocomputing 175:121–131
Le T, Lee MY, Park JR, Baik SW (2018) Oversampling techniques for bankruptcy prediction: novel features from a transaction dataset. Symmetry 10(4):79
Le T, Vo B, Fujita H, Nguyen NT, Baik SW (2019) A fast and accurate approach for bankruptcy forecasting using squared logistics loss with GPU-based extreme gradient boosting. Inf Sci 494:294–310
Sun J, Li H, Fujita H, Fu B, Ai W (2020) Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Inf Fusion 54:128–144
Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf Sci 425:76–91
Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44:534–550
Zhang H, Huang L, Wu CQ, Li Z (2020) An effective convolutional neural network based on SMOTE and Gaussian mixture model for intrusion detection in imbalanced dataset. Comput Netw 177:107315
Fan C, Xiao F, Yan C, Liu C, Li Z, Wang J (2019) A novel methodology to explain and evaluate data-driven building energy performance models based on interpretable machine learning. Appl Energy 235:1551–1560
Vo T, Nguyen T, Le CT (2019) A hybrid framework for smile detection in class imbalance scenarios. Neural Comput & Applic 31(12):8583–8592
Peng L, Zhang H, Chen Y, Yang B (2017) Imbalanced traffic identification using an imbalanced data gravitation-based classification model. Comput Commun 102:177–189
Du G, Zhang J, Li S, Li C (2021) Learning from class-imbalance and heterogeneous data for 30-day hospital readmission. Neurocomputing 420:27–35
Peng M, Zhang Q, Xing X, Gui T, Huang X, Jiang YG, Ding K, Chen Z (2019) Trainable Undersampling for Class-Imbalance Learning. In AAAI’19, Honolulu, Hawaii, USA, 4707–4714, https://doi.org/10.1609/aaai.v33i01.33014707
Koziarski M, Krawczyk B, Wozniak M (2019) Radial-based oversampling for noisy imbalanced data classification. Neurocomputing 343:19–33
Soltanzadeh P, Hashemzadeh M (2021) RCSMOTE: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci 542:92–111
Liu Z, Wei P, Jiang J, Cao W, Bian J, Chang Y (2020) MESA: Boost Ensemble Imbalanced Learning with MEta-SAmpler. NeurIPS
Raghuwanshi BS, Shukla S (2020) SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl-Based Syst 187:104814. https://doi.org/10.1016/j.knosys.2019.06.022
Yin J, Gan C, Zhao K, Lin X, Quan Z, Wang ZJ (2020) A Novel Model for Imbalanced Data Classification. In AAAI’20, 6680–6687
Zhang C, Bi J, Xu S, Ramentol E, Fan G, Qiao B, Fujita H (2019) Multi-imbalance: an open-source software for multi-class imbalance learning. Knowl-Based Syst 174:137–143. https://doi.org/10.1016/j.knosys.2019.03.001
Lemaitre G, Nogueira F, Aridas CK (2017) Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J Mach Learn Res 18:17:1–17:5
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, VanderPlas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Vo, M.T., Nguyen, T., Vo, H.A. et al. Noise-adaptive synthetic oversampling technique. Appl Intell 51, 7827–7836 (2021). https://doi.org/10.1007/s10489-021-02341-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02341-2