Noise-adaptive synthetic oversampling technique

Vo, Minh Thanh; Nguyen, Trang; Vo, H. Anh; Le, Tuong

doi:10.1007/s10489-021-02341-2

Noise-adaptive synthetic oversampling technique

Published: 18 March 2021

Volume 51, pages 7827–7836, (2021)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Minh Thanh Vo¹,
Trang Nguyen²,
H. Anh Vo³ &
…
Tuong Le ORCID: orcid.org/0000-0003-0909-4974^3,4

765 Accesses
15 Citations
1 Altmetric
Explore all metrics

Abstract

In the field of supervised learning, the problem of class imbalance is one of the most difficult problems, and has attracted a great deal of research attention in recent years. In an imbalanced dataset, minority classes are those that contain very small numbers of data samples, while the remaining classes have a very large number of data samples. This type of imbalance reduces the predictive performance of machine learning models. There are currently three approaches for dealing with the class imbalance problem: algorithm-level, data-level, and ensemble-based approaches. Of these, data-level approaches are the most widely used, and consist of three sub-categories: under-sampling, oversampling, and hybrid techniques. Oversampling techniques generate synthetic samples for the minority class to balance an imbalanced dataset. However, existing oversampling approaches do not have a strategy for handling noise samples in imbalanced and noisy datasets, which leads to a reduction in the predictive performance of machine learning models. This study therefore proposes a noise-adaptive synthetic oversampling technique (NASOTECH) to deal with the class imbalance problem in imbalanced and noisy datasets. The noise-adaptive synthetic oversampling (NASO) strategy is first introduced, which is used to identify the number of samples generated for each sample in the minority class, based on the concept of the noise ratio. Next, the NASOTECH algorithm is proposed, based on the NASO strategy, to handle the class imbalance problem in imbalanced and noisy datasets. Finally, empirical experiments are conducted on several synthetic and real datasets to verify the effectiveness of the proposed approach. The experimental results confirm that NASOTECH outperforms three state-of-the-art oversampling techniques in terms of accuracy and geometric mean (G-mean) on imbalanced and noisy datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Article Open access 15 November 2019

A Review on Random Forest: An Ensemble Classifier

References

Sagi O, Rokach L (2018) Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249. https://doi.org/10.1002/widm.1249
Vo HA, Le HS, Vo MT, Le T (2019) A novel framework for trash classification using deep transfer learning. IEEE Access 7(1):178631–178639
Article MathSciNet Google Scholar
Yi Z, Yang Y, Li T, Fujita H (2019) A multitask multiview clustering algorithm in heterogeneous situations based on LLE and LE. Knowl-Based Syst 163:776–786
Article Google Scholar
Abassi L, Boukhris I (2019) A worker clustering-based approach of label aggregation under the belief function theory. Appl Intell 49(1):53–62
Article Google Scholar
Hussain A, Cambria E (2018) Semi-supervised learning for big social data analysis. Neurocomputing 275:1662–1673
Article Google Scholar
Li Y, Pan Q, Wang S, Peng H, Yang T, Cambria E (2019) Disentangled Variational auto-encoder for semi-supervised learning. Inf Sci 482:73–85
Article Google Scholar
Le T, Vo MT, Kieu T, Hwang E, Rho S, Baik SW (2020) Multiple electric energy consumption forecasting using a cluster-based strategy for transfer learning in smart building. Sensors 20(9):2668
Article Google Scholar
Barushka A, Hájek P (2020) Spam detection on social networks using cost-sensitive feature selection and ensemble-based regularized deep neural networks. Neural Comput & Applic 32(9):4239–4257
Article Google Scholar
Zhu B, Baesens B, vanden Broucke SKLM (2017) An empirical comparison of techniques for the class imbalance problem in churn prediction. Inf Sci 408:84–99
Article Google Scholar
Oskarsdottir M, Calster TV, Baesens B, Lemahieu W, Vanthienen J (2018) Time series for early churn detection: using similarity-based classification for dynamic networks. Expert Syst Appl 106:55–65
Article Google Scholar
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from Imbalanced Data Sets. Springer, pp. 1–377. https://doi.org/10.1007/978-3-319-98074-4
Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3):659–665
Article MathSciNet Google Scholar
Li F, Zhang X, Zhang X, Du C, Xu Y, Tian YC (2018) Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets. Inf Sci 422:242–256
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WF (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In IJCNN, Hong Kong, China, 1322-1328. https://doi.org/10.1109/IJCNN.2008.4633969
Han H, Wen-Yuan W, Bing-Huan M. (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Adv Intell Comput, pp. 878–887. https://doi.org/10.1007/11538059_91
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772
MathSciNet MATH Google Scholar
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6(1):20–29
Article Google Scholar
Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat Interface 2(3):349–360
Article MathSciNet Google Scholar
Chen T, Guestrin C (2016) XGBoost: A Scalable Tree Boosting System. In KDD’16, San Francisco California, USA, pp. 785–794, https://doi.org/10.1145/2939672.2939785
Luo J, Xiao Q (2017) A novel approach for predicting microRNA-disease associations by unbalanced bi-random walk on heterogeneous network. J Biomed Inform 66:194–203
Article Google Scholar
Le T, Baik SW (2019) A robust framework for self-care problem identification for children with disability. Symmetry 11(1):89
Article Google Scholar
Zakaryazad A, Duman E (2016) A profit-driven artificial neural network (ANN) with applications to fraud detection and direct marketing. Neurocomputing 175:121–131
Article Google Scholar
Le T, Lee MY, Park JR, Baik SW (2018) Oversampling techniques for bankruptcy prediction: novel features from a transaction dataset. Symmetry 10(4):79
Article Google Scholar
Le T, Vo B, Fujita H, Nguyen NT, Baik SW (2019) A fast and accurate approach for bankruptcy forecasting using squared logistics loss with GPU-based extreme gradient boosting. Inf Sci 494:294–310
Article Google Scholar
Sun J, Li H, Fujita H, Fu B, Ai W (2020) Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Inf Fusion 54:128–144
Article Google Scholar
Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf Sci 425:76–91
Article MathSciNet Google Scholar
Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44:534–550
Article Google Scholar
Zhang H, Huang L, Wu CQ, Li Z (2020) An effective convolutional neural network based on SMOTE and Gaussian mixture model for intrusion detection in imbalanced dataset. Comput Netw 177:107315
Article Google Scholar
Fan C, Xiao F, Yan C, Liu C, Li Z, Wang J (2019) A novel methodology to explain and evaluate data-driven building energy performance models based on interpretable machine learning. Appl Energy 235:1551–1560
Article Google Scholar
Vo T, Nguyen T, Le CT (2019) A hybrid framework for smile detection in class imbalance scenarios. Neural Comput & Applic 31(12):8583–8592
Article Google Scholar
Peng L, Zhang H, Chen Y, Yang B (2017) Imbalanced traffic identification using an imbalanced data gravitation-based classification model. Comput Commun 102:177–189
Article Google Scholar
Du G, Zhang J, Li S, Li C (2021) Learning from class-imbalance and heterogeneous data for 30-day hospital readmission. Neurocomputing 420:27–35
Article Google Scholar
Peng M, Zhang Q, Xing X, Gui T, Huang X, Jiang YG, Ding K, Chen Z (2019) Trainable Undersampling for Class-Imbalance Learning. In AAAI’19, Honolulu, Hawaii, USA, 4707–4714, https://doi.org/10.1609/aaai.v33i01.33014707
Koziarski M, Krawczyk B, Wozniak M (2019) Radial-based oversampling for noisy imbalanced data classification. Neurocomputing 343:19–33
Article Google Scholar
Soltanzadeh P, Hashemzadeh M (2021) RCSMOTE: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci 542:92–111
Liu Z, Wei P, Jiang J, Cao W, Bian J, Chang Y (2020) MESA: Boost Ensemble Imbalanced Learning with MEta-SAmpler. NeurIPS
Raghuwanshi BS, Shukla S (2020) SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl-Based Syst 187:104814. https://doi.org/10.1016/j.knosys.2019.06.022
Yin J, Gan C, Zhao K, Lin X, Quan Z, Wang ZJ (2020) A Novel Model for Imbalanced Data Classification. In AAAI’20, 6680–6687
Zhang C, Bi J, Xu S, Ramentol E, Fan G, Qiao B, Fujita H (2019) Multi-imbalance: an open-source software for multi-class imbalance learning. Knowl-Based Syst 174:137–143. https://doi.org/10.1016/j.knosys.2019.03.001
Lemaitre G, Nogueira F, Aridas CK (2017) Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J Mach Learn Res 18:17:1–17:5
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, VanderPlas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, Ho Chi Minh City University of Technology (HUTECH), Ho Chi Minh City, Vietnam
Minh Thanh Vo
Faculty of Information Technology, Ho Chi Minh City Open University, Ho Chi Minh City, Vietnam
Trang Nguyen
Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam
H. Anh Vo & Tuong Le
Informetrics Research Group, Ton Duc Thang University, Ho Chi Minh City, Vietnam
Tuong Le

Authors

Minh Thanh Vo
View author publications
You can also search for this author in PubMed Google Scholar
Trang Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
H. Anh Vo
View author publications
You can also search for this author in PubMed Google Scholar
Tuong Le
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tuong Le.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vo, M.T., Nguyen, T., Vo, H.A. et al. Noise-adaptive synthetic oversampling technique. Appl Intell 51, 7827–7836 (2021). https://doi.org/10.1007/s10489-021-02341-2

Download citation

Accepted: 08 March 2021
Published: 18 March 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s10489-021-02341-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Noise-adaptive synthetic oversampling technique

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

A Review on Random Forest: An Ensemble Classifier

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Noise-adaptive synthetic oversampling technique

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

A Review on Random Forest: An Ensemble Classifier

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation