Classification Accuracy Comparison for Imbalanced Datasets with Its Balanced Counterparts Obtained by Different Sampling Techniques

Goswami, Tilottama; Roy, Uponika Barman

doi:10.1007/978-981-15-7961-5_5

Tilottama Goswami³⁶ &
Uponika Barman Roy³⁷

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 698))

1587 Accesses
2 Citations

Abstract

Machine learning (ML) is accurate and reliable in solving supervised problems such as classification, when the training is performed appropriately for the predefined classes. In real world scenario, during the dataset creation, class imbalance may arise, where one of the classes has huge number of instances while the other class has very less in numbers. In other words, the class distribution is not equal. Such scenarios results in anomalous prediction result. Handling of imbalanced dataset is therefore required to make correct prediction considering all the class scenarios in an equal ratio. The paper mentions various external and internal techniques to balance dataset found in literature survey along with experimental analysis of four different datasets from various domains- medical, mining, security, finance. The experiments are done using Python. External balancing techniques are used to balance the datasets- two oversampling SMOTE and ADASYN techniques and two undersampling Random Undersampling and Near Miss techniques. These datasets are used for binary classification task. Three machine learning classification algorithms such as logistic regression, random forest and decision tree are applied to imbalanced and balanced datasets to compare and contrast the performances. Comparisons with both balanced and unbalanced are reported. It has been found that undersample technique loses many important datapoints and thereby predicts with low accuracy. For all the datasets it is observed that oversampling technique ADASYN makes some decent prediction with appropriate balance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mitchell T (1997) Machine Learning. McGraw Hill, New York ISBN 0070428077
MATH Google Scholar
Furundžić D, Stanković S, Dimić G (2014) Error signal distribution as an indicator of imbalanced data. In: 12th Symposium on Neural Network Applications in Electrical Engineering (NEUREL), pp 189–194
Google Scholar
Gosain A, Sardana S (2017) Handling class imbalance problem using oversampling techniques: a review. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, pp 79–85. https://doi.org/10.1109/icacci.2017.8125820
Li C (2007) Classifying imbalanced data using a bagging ensemble variation (BEV). pp 203–208. https://doi.org/10.1145/1233341.1233378
Shukla P, Bhowmick K (2017) To improve classification of imbalanced datasets. In: 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Coimbatore, pp 1–5. https://doi.org/10.1109/iciiecs.2017.8276044
Santos MS, Abreu PH, García-Laencina PJ, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58. https://doi.org/10.1016/j.jbi.2015.09.012. ISSN 1532–0464
Wang J, Xu M, Wang H, Zhang J (2006) Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding. In: IEEE 2006 8th international Conference on Signal Processing, Guilin, China, 16–20 November 2006. https://doi.org/10.1109/icosp.2006.345752
Hu S, Liang Y, Ma L, He Y (2009) MSMOTE: improving classification performance when training data is imbalanced. pp 13–17. https://doi.org/10.1109/wcse.2009.756
Gao X et al (2017) An improved XGBoost based on weighted column subsampling for object classification. In: 2017 4th International Conference on Systems and Informatics (ICSAI), Hangzhou, pp 1557–1562. https://doi.org/10.1109/icsai.2017.8248532
Dubosson F, Bromuri S, Schumacher M (2016) A python framework for exhaustive machine learning algorithms and features evaluations. In: 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA), Crans-Montana, pp 987–993. https://doi.org/10.1109/aina.2016.160
https://www.kaggle.com/c/santander-customer-transaction-prediction/data
https://www.kaggle.com/mlg-ulb/creditcardfraud
https://archive.ics.uci.edu/ml/machine-learning-databases/00266/
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Anurag Group of Institutions, Hyderabad, 500009, Telangana, India
Tilottama Goswami
Tata Consultancy Services, Whitefield, Bangalore, 560066, Karnataka, India
Uponika Barman Roy

Authors

Tilottama Goswami
View author publications
You can also search for this author in PubMed Google Scholar
Uponika Barman Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tilottama Goswami .

Editor information

Editors and Affiliations

BioAxis DNA Research Centre (P) Ltd., Hyderabad, India
Amit Kumar
Dynexsys, Sydney, NSW, Australia
Stefan Mozar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Goswami, T., Roy, U.B. (2021). Classification Accuracy Comparison for Imbalanced Datasets with Its Balanced Counterparts Obtained by Different Sampling Techniques. In: Kumar, A., Mozar, S. (eds) ICCCE 2020. Lecture Notes in Electrical Engineering, vol 698. Springer, Singapore. https://doi.org/10.1007/978-981-15-7961-5_5

Download citation

DOI: https://doi.org/10.1007/978-981-15-7961-5_5
Published: 12 October 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-7960-8
Online ISBN: 978-981-15-7961-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics