Abstract
This review aims to examine the state of the art of semi-supervised learning (SSL) techniques for addressing class imbalanced data. Class imbalance is inherent in many real-world applications and has been extensively investigated in supervised classification. In a semi-supervised scenario, this problem is even more interesting because of two possible situations: performance is affected and the error is propagated to the unlabeled data, worsening the final performance, or unlabeled data can help to represent the minority class and improve the results. However, as far as we know, no survey exists organizing the semi-supervised approaches to deal with class imbalance. Our goal is to fill this gap and present a systematic review, where we retrieved 444 articles from five years (2017–2021) from ACM Digital Library, IEEE Explore, Elsevier, Springer, and Google Scholar. After applying exclusion criteria, 47 articles were selected and presented in more detail. We collect important information to answer four research questions, such as the existence of pre/post-processing techniques, the applications, data sets explored, the metrics used to evaluate the approaches, and the developed techniques to deal with class imbalance. We propose eight categories (balancing, graph-based, loss, self-training, ensemble, active learning, post-processing, and other types of learning) to organize the different methodological approaches from the papers. Finally, we present some discussion and future trends in the area. Our review aims to provide an understanding of the most prominent and currently relevant work employing SSL for class imbalance.
Similar content being viewed by others
References
Abuassba AO, Dezheng Z, Mahmood Z (2018) Semi-supervised multi-kernel extreme learning machine. Procedia Comput Sci 129:305–311. https://doi.org/10.1016/j.procs.2018.03.080
Ahmed KM, Al Dhubaib B (2011) Zotero: a bibliographic assistant to researcher. J Pharmacol Pharmacother 2(4):303–305. https://doi.org/10.4103/0976-500X.85940
Alam F, Joty S, Imran M (2018) Graph based semi-supervised learning with convolution neural networks to classify crisis related tweets. In: Twelfth International AAAI conference on web and social media, pp 556–559
Amiri SH, Jamzad M (2018) Leveraging multi-modal fusion for graph-based image annotation. J Vis Commun Image Represent 55:816–828. https://doi.org/10.1016/j.jvcir.2018.08.012
Arshad A, Riaz S, Jiao L et al (2018) Semi-supervised deep fuzzy c-mean clustering for software fault prediction. IEEE Access 6:25,675-25,685. https://doi.org/10.1109/ACCESS.2018.2835304
Banerjee D, Prabhat G, Bhowal R (2018) iCASSTLE: Imbalanced classification algorithm for semi supervised text learning. 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA) pp 1012–1016. https://doi.org/10.1109/ICMLA.2018.00165
Bautista E, Abry P, Gonçalves P (2019) L\(\gamma \)-pagerank for semi-supervised learning. Applied Network Science 4(1):1–20. https://doi.org/10.1007/s41109-019-0172-x
Berton L, de Andrade Lopes A, Vega-Oliveros DA (2018) A comparison of graph construction methods for semi-supervised learning. In: 2018 international joint conference on neural networks (ijcnn), IEEE, pp 1–8
Chapelle O, Scholkopf B, Zien A (2009) Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Trans Neural Netw 20(3):542
Chen D, Lin Y, Zhao G et al (2021) Topology-imbalance learning for semi-supervised node classification. Adv Neural Inf Process Syst 34:29,885-29,897
Chen K, Yao L, Zhang D et al (2020) A semisupervised recurrent convolutional attention model for human activity recognition. IEEE Trans Neural Netw Learn Syst 31(5):1747–1756. https://doi.org/10.1109/TNNLS.2019.2927224
Chen X, Wujek B (2021) A unified framework for automatic distributed active learning. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2021.3129793
Chen X, Wang Z, Zhang Z et al (2018) A semi-supervised approach to bearing fault diagnosis under variable conditions towards imbalanced unlabeled data. Sensors 18(7):1–17. https://doi.org/10.3390/s18072097
Cheng X, Shi F, Liu X, et al (2021) A novel deep class-imbalanced semisupervised model for wind turbine blade icing detection. IEEE Transactions on Neural Networks and Learning Systems, pp 1–13. https://doi.org/10.1109/TNNLS.2021.3102514
Chi J, Zeng G, Zhong Q, et al (2020) Learning to undersampling for class imbalanced credit risk forecasting. In: 2020 IEEE International Conference on Data Mining (ICDM), pp 72–81. https://doi.org/10.1109/ICDM50108.2020.00016
Chong Y, Ding Y, Yan Q et al (2020) Graph-based semi-supervised learning: a review. Neurocomputing 408:216–230. https://doi.org/10.1016/j.neucom.2019.12.130
Deng J, Yu JG (2021) A simple graph-based semi-supervised learning approach for imbalanced classification. Pattern Recogn 118:1–12. https://doi.org/10.1016/j.patcog.2021.108026
Duarte JM, Berton L (2023) A review of semi-supervised learning for text classification. Artif Intell Rev 56:9401–9469
Duarte JM, Sousa S, Milios E et al (2021) Deep analysis of word sense disambiguation via semi-supervised learning and neural word representations. Inf Sci 570:278–297. https://doi.org/10.1016/j.ins.2021.04.006
Galar M, Fernandez A, Barrenechea E et al (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C 42(4):463–484
Gu P, Ling Z, Shao SY, et al (2019) Active sample selection through sparse neighborhood for imbalanced datasets. In: 2019 IEEE Symposium on Computers and Communications (ISCC), pp 1–6. https://doi.org/10.1109/ISCC47284.2019.8969713
Guo LZ, Zhou Z, Shao JJ, et al (2021) Learning from imbalanced and incomplete supervision with its application to ride-sharing liability judgment. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp 487–495. https://doi.org/10.1145/3447548.3467305
Hady MFA, Schwenker F (2013) Semi-supervised learning. Handbook on Neural Information Processing, pp 215–239. https://doi.org/10.1007/978-3-642-36657-4_7
Haixiang G, Yijing L, Shang J et al (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Han Y, Liu Y, Jin Z (2020) Sentiment analysis via semi-supervised learning: a model based on dynamic threshold and multi-classifiers. Neural Comput Appl 32(9):5117–5129. https://doi.org/10.1007/s00521-018-3958-3
Huynh T, Nibali A, He Z (2021) Semi-supervised learning for medical image classification using imbalanced training data. Comput Methods Programs Biomed. https://doi.org/10.1016/j.cmpb.2022.106628
Hyun M, Jeong J, Kwak N (2020) Class-imbalanced semi-supervised learning. arXiv preprint arXiv:2002.06815
Japkowicz N (2000) The class imbalance problem: Significance and strategies. In: Proc. of the Int’l Conf. on Artificial Intelligence, Citeseer, pp 111–117
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Jing XY, Wu F, Dong X et al (2017) An improved sda based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Trans Software Eng 43(4):321–339. https://doi.org/10.1109/TSE.2016.2597849
Johnson JM, Khoshgoftaar TM (2019) Survey on deep learning with class imbalance. J Big Data 6(1):1–54. https://doi.org/10.1186/s40537-019-0192-5
Kim J, Hur Y, Park S et al (2020) Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. Adv Neural Inf Process Syst 33:14,567-14,579
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232. https://doi.org/10.1007/s13748-016-0094-0
Lee H, Shin S, Kim H (2021) ABC: Auxiliary balanced classifier for class-imbalanced semi-supervised learning. Adv Neural Inf Process Syst 34:7082–7094
Lee VLS, Gan KH, Tan TP et al (2019) Semi-supervised learning for sentiment classification using small number of labeled data. Procedia Comput Sci 161:577–584. https://doi.org/10.1016/j.procs.2019.11.159
Leevy JL, Khoshgoftaar TM, Bauder RA et al (2018) A survey on addressing high-class imbalance in big data. J Big Data 5(1):1–30
Li B, Cheng F, Cai H et al (2021) A semi-supervised approach to fault detection and diagnosis for building hvac systems based on the modified generative adversarial network. Energy Build 246:1–15. https://doi.org/10.1016/j.enbuild.2021.111044
Li J, Ma AJ, Yuen PC (2018) Semi-supervised region metric learning for person re-identification. Int J Comput Vis 126(8):855–874. https://doi.org/10.1007/s11263-018-1075-5
Li T, Ying N, Yu X, et al (2019a) Semi-supervised learning in unbalanced and heterogeneous networks. arXiv preprint arXiv:1901.01696
Li YF, Liang DM (2019) Safe semi-supervised learning: a brief introduction. Front Comp Sci 13(4):669–676. https://doi.org/10.1007/s11704-019-8452-2
Li Z, Yang F, Luo Y (2019) Context embedding based on bi-LSTM in semi-supervised biomedical word sense disambiguation. IEEE Access 7:72928–72935. https://doi.org/10.1109/ACCESS.2019.2912584
Linmei H, Yang T, Shi C, et al (2019) Heterogeneous graph attention networks for semi-supervised short text classification. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 4821–4830. https://doi.org/10.1145/3450352
Liu D, Qiao S, Han N et al (2020) SOTB: semi-supervised oversampling approach based on trigonal barycenter theory. IEEE Access 8:50,180-50,189. https://doi.org/10.1109/ACCESS.2020.2980157
Liu P, Zheng G, Lian C, et al (2021) Semi-supervised learning regularized by adversarial perturbation and diversity maximization. Machine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings, pp 199–208. https://doi.org/10.1007/978-3-030-87589-3_21
Liu Z, Jin W, Mu Y (2020) Graph-based boosting algorithm to learn labeled and unlabeled data. Pattern Recogn 106:1–11. https://doi.org/10.1016/j.patcog.2020.107417
Lu Z, Jiang J, Cao P et al (2021) Assembly quality detection based on class-imbalanced semi-supervised learning. Appl Sci 11(21):1–15. https://doi.org/10.3390/app112110373
Nunna SK, Bhattu SN, Somayajulu DVLN et al (2021) Structure-sensitive graph-based multiple-instance semi-supervised learning. Sādhanā 46(3):1–25. https://doi.org/10.1007/s12046-021-01659-4
Oh Y, Kim DJ, Kweon IS (2022) Daso: Distribution-aware semantics-oriented pseudo-label for imbalanced semi-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9786–9796
de Oliveira WDG, Penatti OA, Berton L (2020) A comparison of graph-based semi-supervised learning for data augmentation. In: 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp 264–271, https://doi.org/10.1109/SIBGRAPI51738.2020.00043
Ouzzani M, Hammady H, Fedorowicz Z et al (2016) Rayyan-a web and mobile app for systematic reviews. Syst Rev 5(1):1–10. https://doi.org/10.1186/s13643-016-0384-4
Park DH, Chang Y (2019) Adversarial sampling and training for semi-supervised information retrieval. The World Wide Web Conference, pp 1443–1453. https://doi.org/10.1145/3308558.3313416
Pérez-Ortiz M, Gutiérrez PA, Ayllón-Terán MD et al (2017) Synthetic semi-supervised learning in imbalanced domains: Constructing a model for donor-recipient matching in liver transplantation. Knowl-Based Syst 123:75–87. https://doi.org/10.1016/j.knosys.2017.02.020
Sakai T, Niu G, Sugiyama M (2018) Semi-supervised AUC optimization based on positive-unlabeled learning. Mach Learn 107(4):767–794. https://doi.org/10.1007/s10994-017-5678-9
Santos MS, Abreu PH, Japkowicz N et al (2022) On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev 55(8):6207–6275
Silva NFFD, Coletta LF, Hruschka ER (2016) A survey and comparative study of tweet sentiment analysis via semi-supervised learning. ACM Comput Surv 49(1):1–26. https://doi.org/10.1145/2932708
Sun F, Fang F, Wang R et al (2020) An impartial semi-supervised learning strategy for imbalanced classification on vhr images. Sensors 20(22):1–20. https://doi.org/10.3390/s20226699
Taskazan B, Miller J, Inyang-Udoh U, et al (2019) Domain adaptation based fault detection in label imbalanced cyberphysical systems. 2019 IEEE Conference on Control Technology and Applications (CCTA), pp 142–147. https://doi.org/10.1109/CCTA.2019.8920608
Triguero I, García S, Herrera F (2015) Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl Inf Syst 42(2):245–284. https://doi.org/10.1007/s10115-013-0706-y
Vafaie P, Viktor H, Michalowski W (2020) Multi-class imbalanced semi-supervised learning from streams through online ensembles. 2020 International Conference on Data Mining Workshops (ICDMW), pp 867–874. https://doi.org/10.1109/ICDMW51313.2020.00124
Van Engelen JE, Hoos HH (2020) A survey on semi-supervised learning. Mach Learn 109(2):373–440. https://doi.org/10.1007/s10994-019-05855-6
Wang G, Wong KW, Lu J (2021) AUC-based extreme learning machines for supervised and semi-supervised imbalanced classification. IEEE Trans Syst Man Cybern Syst 51(12):7919–7930. https://doi.org/10.1109/TSMC.2020.2982226
Wang J, Lu S, Wang SH et al (2022) A review on extreme learning machine. Multimed Tools Appl 81(29):41,611-41,660. https://doi.org/10.1007/s11042-021-11007-7
Wang R, Pun MO, Yu H (2021b) Semi-supervised land-use classification using weakly labeled remote sensing data. 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, pp 2492–2495. https://doi.org/10.1109/IGARSS47720.2021.9553882
Wang W, Lin L, Fan Z, et al (2021c) Semi-supervised learning for mars imagery classification. In: 2021 IEEE International Conference on Image Processing (ICIP), pp 499–503. https://doi.org/10.1109/ICIP42928.2021.9506533
Wang Y, Zheng K, Cheng CT, et al (2021d) Knowledge distillation with adaptive asymmetric label sharpening for semi-supervised fracture detection in chest x-rays. In: Information Processing in Medical Imaging: 27th International Conference, IPMI 2021, Virtual Event, June 28-June 30, 2021, Proceedings, pp 599–610. https://doi.org/10.1007/978-3-030-78191-0_46
Wang Z, Ye X, Wang C et al (2021) Network embedding with completely-imbalanced labels. IEEE Trans Knowl Data Eng 33(11):3634–3647. https://doi.org/10.1109/TKDE.2020.2971490
Wei C, Sohn K, Mellina C, et al (2021) CReST: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp 10852–10861. https://doi.org/10.1109/CVPR46437.2021.01071
Wuzheng X, Zuo S, Yao L et al (2021) Semi-supervised sparse representation classification for sleep eeg recognition with imbalanced sample sets. J Mech Med Biol. https://doi.org/10.1142/S0219519421400066
Xu C, Zhu G (2020) Semi-supervised learning algorithm based on linear lie group for imbalanced multi-class classification. Neural Process Lett 52(1):869–889. https://doi.org/10.1007/s11063-020-10287-8
Yalniz IZ, Jégou H, Chen K, et al (2019) Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546
Zhang H, Liu W, Shan J et al (2018) Online active learning paired ensemble for concept drift and class imbalance. IEEE Access 6:73,815-73,828. https://doi.org/10.1109/ACCESS.2018.2882872
Zhang H, Liu W, Liu Q (2020) Reinforcement online active learning ensemble for drifting imbalanced data streams. IEEE Trans Knowl Data Eng. https://doi.org/0.1109/TKDE.2020.3026196
Zhang J, Wang Z, Meng J et al (2019) Boosting positive and unlabeled learning for anomaly detection with multi-features. IEEE Trans Multimed 21(5):1332–1344. https://doi.org/10.1109/TMM.2018.2871421
Zhang ZW, Jing XY, Wang TJ (2017) Label propagation based semi-supervised learning for software defect prediction. Autom Softw Eng 24(1):47–69. https://doi.org/10.1007/s10515-016-0194-x
Zhao J, Liu N (2019) Semi-supervised classification based mixed sampling for imbalanced data. Open Phys 17(1):975–983. https://doi.org/10.1515/phys-2019-0103
Zhou ZH, Li M (2010) Semi-supervised learning by disagreement. Knowl Inf Syst 24(3):415–439. https://doi.org/10.1007/s10115-009-0209-z
Zhu XJ (2005) Semi-supervised learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, Wisconsin
Author information
Authors and Affiliations
Contributions
Willian D. G. de Oliveira: Methodology, Data curation, Writing - original draft, Figures. Lilian Berton: Conceptualization, Writing - original draft, Writing - review & editing, Supervision.
Corresponding author
Ethics declarations
Conflict of interest
We declare that this work does not have competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
de Oliveira, W.D.G., Berton, L. A systematic review for class-imbalance in semi-supervised learning. Artif Intell Rev 56 (Suppl 2), 2349–2382 (2023). https://doi.org/10.1007/s10462-023-10579-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-023-10579-0