Skip to main content

Classification Accuracy Comparison for Imbalanced Datasets with Its Balanced Counterparts Obtained by Different Sampling Techniques

  • Conference paper
  • First Online:
ICCCE 2020

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 698))

Abstract

Machine learning (ML) is accurate and reliable in solving supervised problems such as classification, when the training is performed appropriately for the predefined classes. In real world scenario, during the dataset creation, class imbalance may arise, where one of the classes has huge number of instances while the other class has very less in numbers. In other words, the class distribution is not equal. Such scenarios results in anomalous prediction result. Handling of imbalanced dataset is therefore required to make correct prediction considering all the class scenarios in an equal ratio. The paper mentions various external and internal techniques to balance dataset found in literature survey along with experimental analysis of four different datasets from various domains- medical, mining, security, finance. The experiments are done using Python. External balancing techniques are used to balance the datasets- two oversampling SMOTE and ADASYN techniques and two undersampling Random Undersampling and Near Miss techniques. These datasets are used for binary classification task. Three machine learning classification algorithms such as logistic regression, random forest and decision tree are applied to imbalanced and balanced datasets to compare and contrast the performances. Comparisons with both balanced and unbalanced are reported. It has been found that undersample technique loses many important datapoints and thereby predicts with low accuracy. For all the datasets it is observed that oversampling technique ADASYN makes some decent prediction with appropriate balance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mitchell T (1997) Machine Learning. McGraw Hill, New York ISBN 0070428077

    MATH  Google Scholar 

  2. Furundžić D, Stanković S, Dimić G (2014) Error signal distribution as an indicator of imbalanced data. In: 12th Symposium on Neural Network Applications in Electrical Engineering (NEUREL), pp 189–194

    Google Scholar 

  3. Gosain A, Sardana S (2017) Handling class imbalance problem using oversampling techniques: a review. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, pp 79–85. https://doi.org/10.1109/icacci.2017.8125820

  4. Li C (2007) Classifying imbalanced data using a bagging ensemble variation (BEV). pp 203–208. https://doi.org/10.1145/1233341.1233378

  5. Shukla P, Bhowmick K (2017) To improve classification of imbalanced datasets. In: 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Coimbatore, pp 1–5. https://doi.org/10.1109/iciiecs.2017.8276044

  6. Santos MS, Abreu PH, García-Laencina PJ, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58. https://doi.org/10.1016/j.jbi.2015.09.012. ISSN 1532–0464

  7. Wang J, Xu M, Wang H, Zhang J (2006) Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding. In: IEEE 2006 8th international Conference on Signal Processing, Guilin, China, 16–20 November 2006. https://doi.org/10.1109/icosp.2006.345752

  8. Hu S, Liang Y, Ma L, He Y (2009) MSMOTE: improving classification performance when training data is imbalanced. pp 13–17. https://doi.org/10.1109/wcse.2009.756

  9. Gao X et al (2017) An improved XGBoost based on weighted column subsampling for object classification. In: 2017 4th International Conference on Systems and Informatics (ICSAI), Hangzhou, pp 1557–1562. https://doi.org/10.1109/icsai.2017.8248532

  10. Dubosson F, Bromuri S, Schumacher M (2016) A python framework for exhaustive machine learning algorithms and features evaluations. In: 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA), Crans-Montana, pp 987–993. https://doi.org/10.1109/aina.2016.160

  11. https://www.kaggle.com/c/santander-customer-transaction-prediction/data

  12. https://www.kaggle.com/mlg-ulb/creditcardfraud

  13. https://archive.ics.uci.edu/ml/machine-learning-databases/00266/

  14. https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tilottama Goswami .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Goswami, T., Roy, U.B. (2021). Classification Accuracy Comparison for Imbalanced Datasets with Its Balanced Counterparts Obtained by Different Sampling Techniques. In: Kumar, A., Mozar, S. (eds) ICCCE 2020. Lecture Notes in Electrical Engineering, vol 698. Springer, Singapore. https://doi.org/10.1007/978-981-15-7961-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-7961-5_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-7960-8

  • Online ISBN: 978-981-15-7961-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics