Skip to main content
Log in

Noise-adaptive synthetic oversampling technique

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In the field of supervised learning, the problem of class imbalance is one of the most difficult problems, and has attracted a great deal of research attention in recent years. In an imbalanced dataset, minority classes are those that contain very small numbers of data samples, while the remaining classes have a very large number of data samples. This type of imbalance reduces the predictive performance of machine learning models. There are currently three approaches for dealing with the class imbalance problem: algorithm-level, data-level, and ensemble-based approaches. Of these, data-level approaches are the most widely used, and consist of three sub-categories: under-sampling, oversampling, and hybrid techniques. Oversampling techniques generate synthetic samples for the minority class to balance an imbalanced dataset. However, existing oversampling approaches do not have a strategy for handling noise samples in imbalanced and noisy datasets, which leads to a reduction in the predictive performance of machine learning models. This study therefore proposes a noise-adaptive synthetic oversampling technique (NASOTECH) to deal with the class imbalance problem in imbalanced and noisy datasets. The noise-adaptive synthetic oversampling (NASO) strategy is first introduced, which is used to identify the number of samples generated for each sample in the minority class, based on the concept of the noise ratio. Next, the NASOTECH algorithm is proposed, based on the NASO strategy, to handle the class imbalance problem in imbalanced and noisy datasets. Finally, empirical experiments are conducted on several synthetic and real datasets to verify the effectiveness of the proposed approach. The experimental results confirm that NASOTECH outperforms three state-of-the-art oversampling techniques in terms of accuracy and geometric mean (G-mean) on imbalanced and noisy datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Sagi O, Rokach L (2018) Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249. https://doi.org/10.1002/widm.1249

  2. Vo HA, Le HS, Vo MT, Le T (2019) A novel framework for trash classification using deep transfer learning. IEEE Access 7(1):178631–178639

    Article  MathSciNet  Google Scholar 

  3. Yi Z, Yang Y, Li T, Fujita H (2019) A multitask multiview clustering algorithm in heterogeneous situations based on LLE and LE. Knowl-Based Syst 163:776–786

    Article  Google Scholar 

  4. Abassi L, Boukhris I (2019) A worker clustering-based approach of label aggregation under the belief function theory. Appl Intell 49(1):53–62

    Article  Google Scholar 

  5. Hussain A, Cambria E (2018) Semi-supervised learning for big social data analysis. Neurocomputing 275:1662–1673

    Article  Google Scholar 

  6. Li Y, Pan Q, Wang S, Peng H, Yang T, Cambria E (2019) Disentangled Variational auto-encoder for semi-supervised learning. Inf Sci 482:73–85

    Article  Google Scholar 

  7. Le T, Vo MT, Kieu T, Hwang E, Rho S, Baik SW (2020) Multiple electric energy consumption forecasting using a cluster-based strategy for transfer learning in smart building. Sensors 20(9):2668

    Article  Google Scholar 

  8. Barushka A, Hájek P (2020) Spam detection on social networks using cost-sensitive feature selection and ensemble-based regularized deep neural networks. Neural Comput & Applic 32(9):4239–4257

    Article  Google Scholar 

  9. Zhu B, Baesens B, vanden Broucke SKLM (2017) An empirical comparison of techniques for the class imbalance problem in churn prediction. Inf Sci 408:84–99

    Article  Google Scholar 

  10. Oskarsdottir M, Calster TV, Baesens B, Lemahieu W, Vanthienen J (2018) Time series for early churn detection: using similarity-based classification for dynamic networks. Expert Syst Appl 106:55–65

    Article  Google Scholar 

  11. Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from Imbalanced Data Sets. Springer, pp. 1–377. https://doi.org/10.1007/978-3-319-98074-4

  12. Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3):659–665

    Article  MathSciNet  Google Scholar 

  13. Li F, Zhang X, Zhang X, Du C, Xu Y, Tian YC (2018) Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets. Inf Sci 422:242–256

    Article  Google Scholar 

  14. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WF (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  15. He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In IJCNN, Hong Kong, China, 1322-1328. https://doi.org/10.1109/IJCNN.2008.4633969

  16. Han H, Wen-Yuan W, Bing-Huan M. (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Adv Intell Comput, pp. 878–887. https://doi.org/10.1007/11538059_91

  17. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772

    MathSciNet  MATH  Google Scholar 

  18. Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6(1):20–29

    Article  Google Scholar 

  19. Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat Interface 2(3):349–360

    Article  MathSciNet  Google Scholar 

  20. Chen T, Guestrin C (2016) XGBoost: A Scalable Tree Boosting System. In KDD’16, San Francisco California, USA, pp. 785–794, https://doi.org/10.1145/2939672.2939785

  21. Luo J, Xiao Q (2017) A novel approach for predicting microRNA-disease associations by unbalanced bi-random walk on heterogeneous network. J Biomed Inform 66:194–203

    Article  Google Scholar 

  22. Le T, Baik SW (2019) A robust framework for self-care problem identification for children with disability. Symmetry 11(1):89

    Article  Google Scholar 

  23. Zakaryazad A, Duman E (2016) A profit-driven artificial neural network (ANN) with applications to fraud detection and direct marketing. Neurocomputing 175:121–131

    Article  Google Scholar 

  24. Le T, Lee MY, Park JR, Baik SW (2018) Oversampling techniques for bankruptcy prediction: novel features from a transaction dataset. Symmetry 10(4):79

    Article  Google Scholar 

  25. Le T, Vo B, Fujita H, Nguyen NT, Baik SW (2019) A fast and accurate approach for bankruptcy forecasting using squared logistics loss with GPU-based extreme gradient boosting. Inf Sci 494:294–310

    Article  Google Scholar 

  26. Sun J, Li H, Fujita H, Fu B, Ai W (2020) Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Inf Fusion 54:128–144

    Article  Google Scholar 

  27. Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf Sci 425:76–91

    Article  MathSciNet  Google Scholar 

  28. Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44:534–550

    Article  Google Scholar 

  29. Zhang H, Huang L, Wu CQ, Li Z (2020) An effective convolutional neural network based on SMOTE and Gaussian mixture model for intrusion detection in imbalanced dataset. Comput Netw 177:107315

    Article  Google Scholar 

  30. Fan C, Xiao F, Yan C, Liu C, Li Z, Wang J (2019) A novel methodology to explain and evaluate data-driven building energy performance models based on interpretable machine learning. Appl Energy 235:1551–1560

    Article  Google Scholar 

  31. Vo T, Nguyen T, Le CT (2019) A hybrid framework for smile detection in class imbalance scenarios. Neural Comput & Applic 31(12):8583–8592

    Article  Google Scholar 

  32. Peng L, Zhang H, Chen Y, Yang B (2017) Imbalanced traffic identification using an imbalanced data gravitation-based classification model. Comput Commun 102:177–189

    Article  Google Scholar 

  33. Du G, Zhang J, Li S, Li C (2021) Learning from class-imbalance and heterogeneous data for 30-day hospital readmission. Neurocomputing 420:27–35

    Article  Google Scholar 

  34. Peng M, Zhang Q, Xing X, Gui T, Huang X, Jiang YG, Ding K, Chen Z (2019) Trainable Undersampling for Class-Imbalance Learning. In AAAI’19, Honolulu, Hawaii, USA, 4707–4714, https://doi.org/10.1609/aaai.v33i01.33014707

  35. Koziarski M, Krawczyk B, Wozniak M (2019) Radial-based oversampling for noisy imbalanced data classification. Neurocomputing 343:19–33

    Article  Google Scholar 

  36. Soltanzadeh P, Hashemzadeh M (2021) RCSMOTE: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci 542:92–111

  37. Liu Z, Wei P, Jiang J, Cao W, Bian J, Chang Y (2020) MESA: Boost Ensemble Imbalanced Learning with MEta-SAmpler. NeurIPS

  38. Raghuwanshi BS, Shukla S (2020) SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl-Based Syst 187:104814. https://doi.org/10.1016/j.knosys.2019.06.022

  39. Yin J, Gan C, Zhao K, Lin X, Quan Z, Wang ZJ (2020) A Novel Model for Imbalanced Data Classification. In AAAI’20, 6680–6687

  40. Zhang C, Bi J, Xu S, Ramentol E, Fan G, Qiao B, Fujita H (2019) Multi-imbalance: an open-source software for multi-class imbalance learning. Knowl-Based Syst 174:137–143. https://doi.org/10.1016/j.knosys.2019.03.001

  41. Lemaitre G, Nogueira F, Aridas CK (2017) Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J Mach Learn Res 18:17:1–17:5

    Google Scholar 

  42. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, VanderPlas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tuong Le.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vo, M.T., Nguyen, T., Vo, H.A. et al. Noise-adaptive synthetic oversampling technique. Appl Intell 51, 7827–7836 (2021). https://doi.org/10.1007/s10489-021-02341-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02341-2

Keywords

Navigation