An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets

Yap, Bee Wah; Rani, Khatijahhusna Abd; Rahman, Hezlin Aryani Abd; Fong, Simon; Khairudin, Zuraida; Abdullah, Nik Nik

doi:10.1007/978-981-4585-18-7_2

Bee Wah Yap⁴,
Khatijahhusna Abd Rani⁴,
Hezlin Aryani Abd Rahman⁴,
Simon Fong⁵,
Zuraida Khairudin⁴ &
…
Nik Nik Abdullah⁶

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 285))

3650 Accesses
97 Citations

Abstract

Most classifiers work well when the class distribution in the response variable of the dataset is well balanced. Problems arise when the dataset is imbalanced. This paper applied four methods: Oversampling, Undersampling, Bagging and Boosting in handling imbalanced datasets. The cardiac surgery dataset has a binary response variable (1 = Died, 0 = Alive). The sample size is 4976 cases with 4.2 % (Died) and 95.8 % (Alive) cases. CART, C5 and CHAID were chosen as the classifiers. In classification problems, the accuracy rate of the predictive model is not an appropriate measure when there is imbalanced problem due to the fact that it will be biased towards the majority class. Thus, the performance of the classifier is measured using sensitivity and precision Oversampling and undersampling are found to work well in improving the classification for the imbalanced dataset using decision tree. Meanwhile, boosting and bagging did not improve the Decision Tree performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Laza, R., Pavón, R., Reboiro-Jato, M., Fdez-Riverola, F.: Evaluating the effect of unbalanced data in biomedical document classification. Journal of integrative bioinformatics, 8(3):177, (2011). Doi:10,2390/biecoll-jib-2011-177.
Google Scholar
Brown, I., & Mues, C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446-3453, (2012). doi: 10.1016/j.eswa.2011.09.033.
Wei, W., Li, J., Cao, L., Ou, Y., Chen, J.: Effective detection of sophisticated online banking fraid on ectremely imbalanced data. World Wide Web (2013) 16:449–475. doi: 10.1007/s11280-012-0178-0.
Rahman, N.N., Davis, D.N.: Addressing the Class Imbalance Problems in Medical Datasets. International Journal of Machine Learning and Computing, 3(2), 224-228, (2013).
Google Scholar
Au, T., Chin, M.-L., & Ma, G.: Mining Rare Events Data by Sampling and Boosting: A Case Study. In S. Prasad, H. Vin, S. Sahni, M. Jaiswal & B. Thipakorn (Eds.), Information Systems, Technology and Management (Vol. 54, pp. 373-379): Springer Berlin Heidelberg, (2010).
Google Scholar
Kotsiantis, S. B., Pintelas, P. E., Kanellopoulus, D.: Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, Vol.30, (2006).
Google Scholar
Drummond C., Holte, R. C.: C4.5, Class Imbalance and Cost-Sensitivity: Why Undersampling beats Oversampling, Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC, (2003).
Google Scholar
Drummond C., Holte, R. C.: Severe Class Imbalance: Why Better Algorithms Aren’t the Answer. Proceedings of 16th European Conference of Machine Learning, LNAI 3720, 539-546, (2005).
Google Scholar
Weiss, G. M.:Mining with rarity: a unifying framework. Sigkdd Explorations, 6(1), 7-19 (2004).
Google Scholar
Chawla, N. V.: Data mining for imbalanced datasets: An overview Data mining and knowledge discovery handbook (pp. 853-867): Springer, (2005).
Google Scholar
Galar. M., Fern´andez, A., Barrenechea, E., Bustinc, H., Herrera, F.: A review on Ensembles for Class Imbalanced Problems: Bagging-, Boosting- and Hybrid Based Approaches. IEEE Transactions on Systems. Man,.and Cybernetics-Part C. Applications and Reviews. Vol.42, No.4, 463-484 (2012).
Google Scholar
Chawla, N. V., Cieslak, D. A., Hall, L. O., Joshi, A.: Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery 17, 2, 225-252 (2008).
Google Scholar
Kotsiantis, S., Pintelas, P.: Combining bagging and boosting. International Journal of Computational Intelligence, 1(4), 324-333 (2004).
Google Scholar
Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Comparing Boosting and Bagging techniques with Noisy and Imbalanced Data, IEEE Transactions on Systems. Man,.and Cybernetics-Part A. Systems and Humans. Vol.41,No.3, 552-568 (2011).
Google Scholar
Batista, G. E., Prati, R. C., Monard, M. C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1), 20-29,(2004).
Google Scholar
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artificial Intelligence in Medicine, 37(1), 7-18 (2006).
Google Scholar
Duman, E., Ekinci, Y., Tanriverdi, A.: Comparing alternative classifiers for database marketing: The case of imbalanced datasets. Expert Systems with Applications, 39(1), 48-53 (2012).
Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 321-357 (2002).
Google Scholar
Cao, D.-S., Xu, Q.-S., Liang, Y.-Z., Zhang, L.-X., Li, H.-D.: The boosting: A new idea of building models. Chemometrics and Intelligent Laboratory Systems, 100, 1-11(2010). doi: http://dx.doi.org/10.1016/j.chemolab.2009.09.002.
Klement, W., Wilk, S., Michaowski, W., Matwin, S.: Classifying severely imbalanced data. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 258–264 (2011).
Google Scholar
Breiman, L.: Bagging predictors. Machine learning, 24(2), 123-140 (1996).
Google Scholar
Freund, Y., Schapire, R. E.: A desicion-theoretic generalization of on-line learning and an application to boosting Computational learning theory (pp. 23-37): Springer,(1995).
Google Scholar
IBM SPSS Modeler 15 Algorithms Guide. IBM Corporation (2012).
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Shah Alam, Selangor, Malaysia
Bee Wah Yap, Khatijahhusna Abd Rani, Hezlin Aryani Abd Rahman & Zuraida Khairudin
Faculty of Science and Technology, University of Macau, Macau, China
Simon Fong
Faculty of Medicine, Universiti Teknologi MARA, Shah Alam, Selangor, Malaysia
Nik Nik Abdullah

Authors

Bee Wah Yap
View author publications
You can also search for this author in PubMed Google Scholar
Khatijahhusna Abd Rani
View author publications
You can also search for this author in PubMed Google Scholar
Hezlin Aryani Abd Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Simon Fong
View author publications
You can also search for this author in PubMed Google Scholar
Zuraida Khairudin
View author publications
You can also search for this author in PubMed Google Scholar
Nik Nik Abdullah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bee Wah Yap .

Editor information

Editors and Affiliations

Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
Tutut Herawan
Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Malaysia
Mustafa Mat Deris
School of Information Technology, Deakin University, Burwood, Victoria, Australia
Jemal Abawajy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yap, B.W., Rani, K.A., Rahman, H.A.A., Fong, S., Khairudin, Z., Abdullah, N.N. (2014). An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets. In: Herawan, T., Deris, M., Abawajy, J. (eds) Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). Lecture Notes in Electrical Engineering, vol 285. Springer, Singapore. https://doi.org/10.1007/978-981-4585-18-7_2

Download citation

DOI: https://doi.org/10.1007/978-981-4585-18-7_2
Published: 15 December 2013
Publisher Name: Springer, Singapore
Print ISBN: 978-981-4585-17-0
Online ISBN: 978-981-4585-18-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics