Advertisement

An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets

  • Bee Wah Yap
  • Khatijahhusna Abd Rani
  • Hezlin Aryani Abd Rahman
  • Simon Fong
  • Zuraida Khairudin
  • Nik Nik Abdullah
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 285)

Abstract

Most classifiers work well when the class distribution in the response variable of the dataset is well balanced. Problems arise when the dataset is imbalanced. This paper applied four methods: Oversampling, Undersampling, Bagging and Boosting in handling imbalanced datasets. The cardiac surgery dataset has a binary response variable (1 = Died, 0 = Alive). The sample size is 4976 cases with 4.2 % (Died) and 95.8 % (Alive) cases. CART, C5 and CHAID were chosen as the classifiers. In classification problems, the accuracy rate of the predictive model is not an appropriate measure when there is imbalanced problem due to the fact that it will be biased towards the majority class. Thus, the performance of the classifier is measured using sensitivity and precision Oversampling and undersampling are found to work well in improving the classification for the imbalanced dataset using decision tree. Meanwhile, boosting and bagging did not improve the Decision Tree performance.

Keywords

Bagging Boosting Oversampling Undersampling Imbalanced data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Laza, R., Pavón, R., Reboiro-Jato, M., Fdez-Riverola, F.: Evaluating the effect of unbalanced data in biomedical document classification. Journal of integrative bioinformatics, 8(3):177, (2011). Doi:10,2390/biecoll-jib-2011-177.Google Scholar
  2. 2.
    Brown, I., & Mues, C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446-3453, (2012). doi:  10.1016/j.eswa.2011.09.033.
  3. 3.
    Wei, W., Li, J., Cao, L., Ou, Y., Chen, J.: Effective detection of sophisticated online banking fraid on ectremely imbalanced data. World Wide Web (2013) 16:449–475. doi:  10.1007/s11280-012-0178-0.
  4. 4.
    Rahman, N.N., Davis, D.N.: Addressing the Class Imbalance Problems in Medical Datasets. International Journal of Machine Learning and Computing, 3(2), 224-228, (2013).Google Scholar
  5. 5.
    Au, T., Chin, M.-L., & Ma, G.: Mining Rare Events Data by Sampling and Boosting: A Case Study. In S. Prasad, H. Vin, S. Sahni, M. Jaiswal & B. Thipakorn (Eds.), Information Systems, Technology and Management (Vol. 54, pp. 373-379): Springer Berlin Heidelberg, (2010).Google Scholar
  6. 6.
    Kotsiantis, S. B., Pintelas, P. E., Kanellopoulus, D.: Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, Vol.30, (2006).Google Scholar
  7. 7.
    Drummond C., Holte, R. C.: C4.5, Class Imbalance and Cost-Sensitivity: Why Undersampling beats Oversampling, Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC, (2003).Google Scholar
  8. 8.
    Drummond C., Holte, R. C.: Severe Class Imbalance: Why Better Algorithms Aren’t the Answer. Proceedings of 16th European Conference of Machine Learning, LNAI 3720, 539-546, (2005).Google Scholar
  9. 9.
    Weiss, G. M.:Mining with rarity: a unifying framework. Sigkdd Explorations, 6(1), 7-19 (2004).Google Scholar
  10. 10.
    Chawla, N. V.: Data mining for imbalanced datasets: An overview Data mining and knowledge discovery handbook (pp. 853-867): Springer, (2005).Google Scholar
  11. 11.
    Galar. M., Fern´andez, A., Barrenechea, E., Bustinc, H., Herrera, F.: A review on Ensembles for Class Imbalanced Problems: Bagging-, Boosting- and Hybrid Based Approaches. IEEE Transactions on Systems. Man,.and Cybernetics-Part C. Applications and Reviews. Vol.42, No.4, 463-484 (2012).Google Scholar
  12. 12.
    Chawla, N. V., Cieslak, D. A., Hall, L. O., Joshi, A.: Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery 17, 2, 225-252 (2008).Google Scholar
  13. 13.
    Kotsiantis, S., Pintelas, P.: Combining bagging and boosting. International Journal of Computational Intelligence, 1(4), 324-333 (2004).Google Scholar
  14. 14.
    Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Comparing Boosting and Bagging techniques with Noisy and Imbalanced Data, IEEE Transactions on Systems. Man,.and Cybernetics-Part A. Systems and Humans. Vol.41,No.3, 552-568 (2011).Google Scholar
  15. 15.
    Batista, G. E., Prati, R. C., Monard, M. C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1), 20-29,(2004).Google Scholar
  16. 16.
    Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artificial Intelligence in Medicine, 37(1), 7-18 (2006).Google Scholar
  17. 17.
    Duman, E., Ekinci, Y., Tanriverdi, A.: Comparing alternative classifiers for database marketing: The case of imbalanced datasets. Expert Systems with Applications, 39(1), 48-53 (2012).Google Scholar
  18. 18.
    Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 321-357 (2002).Google Scholar
  19. 19.
    Cao, D.-S., Xu, Q.-S., Liang, Y.-Z., Zhang, L.-X., Li, H.-D.: The boosting: A new idea of building models. Chemometrics and Intelligent Laboratory Systems, 100, 1-11(2010). doi: http://dx.doi.org/10.1016/j.chemolab.2009.09.002.
  20. 20.
    Klement, W., Wilk, S., Michaowski, W., Matwin, S.: Classifying severely imbalanced data. C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 258–264 (2011).Google Scholar
  21. 21.
    Breiman, L.: Bagging predictors. Machine learning, 24(2), 123-140 (1996).Google Scholar
  22. 22.
    Freund, Y., Schapire, R. E.: A desicion-theoretic generalization of on-line learning and an application to boosting Computational learning theory (pp. 23-37): Springer,(1995).Google Scholar
  23. 23.
    IBM SPSS Modeler 15 Algorithms Guide. IBM Corporation (2012).Google Scholar

Copyright information

© Springer Science+Business Media Singapore 2014

Authors and Affiliations

  • Bee Wah Yap
    • 1
  • Khatijahhusna Abd Rani
    • 1
  • Hezlin Aryani Abd Rahman
    • 1
  • Simon Fong
    • 2
  • Zuraida Khairudin
    • 1
  • Nik Nik Abdullah
    • 3
  1. 1.Faculty of Computer and Mathematical SciencesUniversiti Teknologi MARAShah AlamMalaysia
  2. 2.Faculty of Science and TechnologyUniversity of MacauMacauChina
  3. 3.Faculty of MedicineUniversiti Teknologi MARAShah AlamMalaysia

Personalised recommendations