Abstract
Machine learning (ML) is accurate and reliable in solving supervised problems such as classification, when the training is performed appropriately for the predefined classes. In real world scenario, during the dataset creation, class imbalance may arise, where one of the classes has huge number of instances while the other class has very less in numbers. In other words, the class distribution is not equal. Such scenarios results in anomalous prediction result. Handling of imbalanced dataset is therefore required to make correct prediction considering all the class scenarios in an equal ratio. The paper mentions various external and internal techniques to balance dataset found in literature survey along with experimental analysis of four different datasets from various domains- medical, mining, security, finance. The experiments are done using Python. External balancing techniques are used to balance the datasets- two oversampling SMOTE and ADASYN techniques and two undersampling Random Undersampling and Near Miss techniques. These datasets are used for binary classification task. Three machine learning classification algorithms such as logistic regression, random forest and decision tree are applied to imbalanced and balanced datasets to compare and contrast the performances. Comparisons with both balanced and unbalanced are reported. It has been found that undersample technique loses many important datapoints and thereby predicts with low accuracy. For all the datasets it is observed that oversampling technique ADASYN makes some decent prediction with appropriate balance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mitchell T (1997) Machine Learning. McGraw Hill, New York ISBN 0070428077
Furundžić D, Stanković S, Dimić G (2014) Error signal distribution as an indicator of imbalanced data. In: 12th Symposium on Neural Network Applications in Electrical Engineering (NEUREL), pp 189–194
Gosain A, Sardana S (2017) Handling class imbalance problem using oversampling techniques: a review. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, pp 79–85. https://doi.org/10.1109/icacci.2017.8125820
Li C (2007) Classifying imbalanced data using a bagging ensemble variation (BEV). pp 203–208. https://doi.org/10.1145/1233341.1233378
Shukla P, Bhowmick K (2017) To improve classification of imbalanced datasets. In: 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Coimbatore, pp 1–5. https://doi.org/10.1109/iciiecs.2017.8276044
Santos MS, Abreu PH, GarcÃa-Laencina PJ, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58. https://doi.org/10.1016/j.jbi.2015.09.012. ISSN 1532–0464
Wang J, Xu M, Wang H, Zhang J (2006) Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding. In: IEEE 2006 8th international Conference on Signal Processing, Guilin, China, 16–20 November 2006. https://doi.org/10.1109/icosp.2006.345752
Hu S, Liang Y, Ma L, He Y (2009) MSMOTE: improving classification performance when training data is imbalanced. pp 13–17. https://doi.org/10.1109/wcse.2009.756
Gao X et al (2017) An improved XGBoost based on weighted column subsampling for object classification. In: 2017 4th International Conference on Systems and Informatics (ICSAI), Hangzhou, pp 1557–1562. https://doi.org/10.1109/icsai.2017.8248532
Dubosson F, Bromuri S, Schumacher M (2016) A python framework for exhaustive machine learning algorithms and features evaluations. In: 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA), Crans-Montana, pp 987–993. https://doi.org/10.1109/aina.2016.160
https://www.kaggle.com/c/santander-customer-transaction-prediction/data
https://archive.ics.uci.edu/ml/machine-learning-databases/00266/
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Goswami, T., Roy, U.B. (2021). Classification Accuracy Comparison for Imbalanced Datasets with Its Balanced Counterparts Obtained by Different Sampling Techniques. In: Kumar, A., Mozar, S. (eds) ICCCE 2020. Lecture Notes in Electrical Engineering, vol 698. Springer, Singapore. https://doi.org/10.1007/978-981-15-7961-5_5
Download citation
DOI: https://doi.org/10.1007/978-981-15-7961-5_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-7960-8
Online ISBN: 978-981-15-7961-5
eBook Packages: EngineeringEngineering (R0)