Abstract
In data mining, the task of classification is to identify an instance in a dataset into one of the predefined classes. In real-life applications, the traditional classification does not work well for imbalanced datasets, i.e., where one class contains very few number of data points, named as the minority class, as compared to other class(es), named as the majority class(es). This problem of imbalanced dataset distribution is termed as the class imbalance problem (CIP). To solve CIP, the researchers examined the effects of CIP on the performance of classifier and proposed various techniques to handle this problem. In literature, these techniques are majorly classified into three levels: data-level approaches (or pre-processing techniques), algorithm-level approaches and ensemble-level approaches. The sampling-based approaches are further subdivided into three categories, such as oversampling techniques, undersampling techniques and hybrid sampling (undersampling + oversampling) techniques. In this paper, we proposed three hybrid sampling techniques (named as Bor-SMOTE+TL, TL+C-SMOTE, SL-SMOTE+TL) using Tomek links (an undersampling) technique combined with the oversampling techniques. The experiments are carried out using real-life imbalanced datasets to show the usefulness of the proposed techniques as compared to the existing sampling techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
C.X. Ling, V.S. Sheng, Class imbalance problem, in Encyclopedia of Machine Learning, ed. by C. Sammut, G.I. Webb (Springer, Boston, MA, 2011)
D.A. Cieslak, N.V. Chawla, A. Striegel, Combating imbalance in network intrusion datasets, in IEEE International Conference on Granular Computing (2006), pp. 732–737
N. Japkowicz, The class imbalance problem: significance and strategies, in Proceedings of International Conference on Artificial Intelligence (2000), pp. 111–117
A. Gosain, A. Saha, D. Singh, Analysis of sampling based classification techniques to overcome class imbalancing, in Proceedings of the 10th INDIACom-2016 IEEE International Conference (2016), pp. 7320–7326
B. Krawczyk, Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0
C. Seiffert, T.M. Khoshgoftaar, J.V. Hulse, Hybrid sampling for imbalanced data, in IEEE International Conference on Information Reuse and Integration (Las Vegas, NV, USA 2008), pp. 202–207. https://doi.org/10.1109/iri.2008.4583030
Q. Wang, A hybrid sampling SVM approach to imbalanced data classification. Abstr. Appl. Anal. (2014). https://doi.org/10.1155/2014/972786
M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, in IEEE transactions on systems, man, and cybernetics—part c: applications and reviews (vol. 42, 2012), pp. 463–484
P. Kaur, A. Gosain, Comparing the Behavior of Oversampling and Undersampling Approach of Class Imbalance Learning by Combining Class Imbalance Problem with Noise (ICT Based Innovations, Springer, Singapore, 2018), pp. 23–30
I. Tomek, Two modifications of CNN, in IEEE Transactions on Systems Man and Communications SMC-6 (1976), pp. 769–772
N.V. Chawla et al., SMOTE: synthetic minority over sampling technique. J. Artif. Intell. Res. (vol. 16, 2002), pp. 321– 357
H. Han, W. Wang, B. Mao, Borderline-SMOTE: A New Oversampling Method in Imbalanced Data-sets Learning (ICIC 2005. LNCS, Springer, Heidelberg, vol. 3644, 2005), pp. 878–887
G. He, W. Wang, H. Han, C-SMOTE: A Combination Method for Learning From Imbalanced Datasets (IICAI, 2005)
C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-Level-SMOTE: Safe Level-Synthetic Minority Over-Sampling Technique for handling the Class Imbalance Problem (PADD2009, LNAI, Springer, vol. 5476, 2009), pp. 475–482
G.E.A.P.A. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explorations (vol. 6, 2004), pp. 20–29
C. Blake, C. Merz, UCI Repository of Machine Learning Databases. Department of Information and Computer Sciences (University of California, Irvine, CA, USA, 1998). https://archive.ics.uci.edu/ml/datasets.html
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Gosain, A., Gupta, A., Singh, D. (2021). Hybrid Data-Level Techniques for Class Imbalance Problem. In: Gupta, D., Khanna, A., Bhattacharyya, S., Hassanien, A.E., Anand, S., Jaiswal, A. (eds) International Conference on Innovative Computing and Communications. Advances in Intelligent Systems and Computing, vol 1165. Springer, Singapore. https://doi.org/10.1007/978-981-15-5113-0_95
Download citation
DOI: https://doi.org/10.1007/978-981-15-5113-0_95
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-5112-3
Online ISBN: 978-981-15-5113-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)