Skip to main content

An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets

  • Conference paper
  • First Online:
Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013)

Abstract

Most classifiers work well when the class distribution in the response variable of the dataset is well balanced. Problems arise when the dataset is imbalanced. This paper applied four methods: Oversampling, Undersampling, Bagging and Boosting in handling imbalanced datasets. The cardiac surgery dataset has a binary response variable (1 = Died, 0 = Alive). The sample size is 4976 cases with 4.2 % (Died) and 95.8 % (Alive) cases. CART, C5 and CHAID were chosen as the classifiers. In classification problems, the accuracy rate of the predictive model is not an appropriate measure when there is imbalanced problem due to the fact that it will be biased towards the majority class. Thus, the performance of the classifier is measured using sensitivity and precision Oversampling and undersampling are found to work well in improving the classification for the imbalanced dataset using decision tree. Meanwhile, boosting and bagging did not improve the Decision Tree performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Laza, R., Pavón, R., Reboiro-Jato, M., Fdez-Riverola, F.: Evaluating the effect of unbalanced data in biomedical document classification. Journal of integrative bioinformatics, 8(3):177, (2011). Doi:10,2390/biecoll-jib-2011-177.

    Google Scholar 

  2. Brown, I., & Mues, C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446-3453, (2012). doi: 10.1016/j.eswa.2011.09.033.

  3. Wei, W., Li, J., Cao, L., Ou, Y., Chen, J.: Effective detection of sophisticated online banking fraid on ectremely imbalanced data. World Wide Web (2013) 16:449–475. doi: 10.1007/s11280-012-0178-0.

  4. Rahman, N.N., Davis, D.N.: Addressing the Class Imbalance Problems in Medical Datasets. International Journal of Machine Learning and Computing, 3(2), 224-228, (2013).

    Google Scholar 

  5. Au, T., Chin, M.-L., & Ma, G.: Mining Rare Events Data by Sampling and Boosting: A Case Study. In S. Prasad, H. Vin, S. Sahni, M. Jaiswal & B. Thipakorn (Eds.), Information Systems, Technology and Management (Vol. 54, pp. 373-379): Springer Berlin Heidelberg, (2010).

    Google Scholar 

  6. Kotsiantis, S. B., Pintelas, P. E., Kanellopoulus, D.: Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, Vol.30, (2006).

    Google Scholar 

  7. Drummond C., Holte, R. C.: C4.5, Class Imbalance and Cost-Sensitivity: Why Undersampling beats Oversampling, Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC, (2003).

    Google Scholar 

  8. Drummond C., Holte, R. C.: Severe Class Imbalance: Why Better Algorithms Aren’t the Answer. Proceedings of 16th European Conference of Machine Learning, LNAI 3720, 539-546, (2005).

    Google Scholar 

  9. Weiss, G. M.:Mining with rarity: a unifying framework. Sigkdd Explorations, 6(1), 7-19 (2004).

    Google Scholar 

  10. Chawla, N. V.: Data mining for imbalanced datasets: An overview Data mining and knowledge discovery handbook (pp. 853-867): Springer, (2005).

    Google Scholar 

  11. Galar. M., Fern´andez, A., Barrenechea, E., Bustinc, H., Herrera, F.: A review on Ensembles for Class Imbalanced Problems: Bagging-, Boosting- and Hybrid Based Approaches. IEEE Transactions on Systems. Man,.and Cybernetics-Part C. Applications and Reviews. Vol.42, No.4, 463-484 (2012).

    Google Scholar 

  12. Chawla, N. V., Cieslak, D. A., Hall, L. O., Joshi, A.: Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery 17, 2, 225-252 (2008).

    Google Scholar 

  13. Kotsiantis, S., Pintelas, P.: Combining bagging and boosting. International Journal of Computational Intelligence, 1(4), 324-333 (2004).

    Google Scholar 

  14. Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Comparing Boosting and Bagging techniques with Noisy and Imbalanced Data, IEEE Transactions on Systems. Man,.and Cybernetics-Part A. Systems and Humans. Vol.41,No.3, 552-568 (2011).

    Google Scholar 

  15. Batista, G. E., Prati, R. C., Monard, M. C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1), 20-29,(2004).

    Google Scholar 

  16. Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artificial Intelligence in Medicine, 37(1), 7-18 (2006).

    Google Scholar 

  17. Duman, E., Ekinci, Y., Tanriverdi, A.: Comparing alternative classifiers for database marketing: The case of imbalanced datasets. Expert Systems with Applications, 39(1), 48-53 (2012).

    Google Scholar 

  18. Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 321-357 (2002).

    Google Scholar 

  19. Cao, D.-S., Xu, Q.-S., Liang, Y.-Z., Zhang, L.-X., Li, H.-D.: The boosting: A new idea of building models. Chemometrics and Intelligent Laboratory Systems, 100, 1-11(2010). doi: http://dx.doi.org/10.1016/j.chemolab.2009.09.002.

  20. Klement, W., Wilk, S., Michaowski, W., Matwin, S.: Classifying severely imbalanced data. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 258–264 (2011).

    Google Scholar 

  21. Breiman, L.: Bagging predictors. Machine learning, 24(2), 123-140 (1996).

    Google Scholar 

  22. Freund, Y., Schapire, R. E.: A desicion-theoretic generalization of on-line learning and an application to boosting Computational learning theory (pp. 23-37): Springer,(1995).

    Google Scholar 

  23. IBM SPSS Modeler 15 Algorithms Guide. IBM Corporation (2012).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bee Wah Yap .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media Singapore

About this paper

Cite this paper

Yap, B.W., Rani, K.A., Rahman, H.A.A., Fong, S., Khairudin, Z., Abdullah, N.N. (2014). An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets. In: Herawan, T., Deris, M., Abawajy, J. (eds) Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). Lecture Notes in Electrical Engineering, vol 285. Springer, Singapore. https://doi.org/10.1007/978-981-4585-18-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-981-4585-18-7_2

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-4585-17-0

  • Online ISBN: 978-981-4585-18-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics