Skip to main content
Log in

A review of boosting methods for imbalanced data classification

  • Survey
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Recently, the problem of imbalanced data classification has drawn a significant amount of interest from academia, industry and government funding agencies. The fundamental issue with imbalanced data classification is the imbalanced data has posed a significant drawback of the performance of most standard learning algorithms, which assume or expect balanced class distribution or equal misclassification costs. Boosting is a meta-technique that is applicable to most learning algorithms. This paper gives a review of boosting methods for imbalanced data classification, denoted as IDBoosting (Imbalanced-data-boosting), where conventional learning algorithms can be integrated without further modifications. The main focus is on the intrinsic mechanisms without considering implementation detail. Existing methods are catalogued and each class is displayed in detail in terms of design criteria, typical algorithms and performance analysis. The essence of two IDBoosting methods is discovered followed by experimental evidence and useful reference point for future research are also given.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. A Asuncion DN (2007) UCI machine learning repository. URL http://www.ics.uci.edu/~mlearn/MLRepository.html

  2. Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2009) Metalearning. Springer, Berlin

    MATH  Google Scholar 

  3. Breiman L (1996) Bias, variance, and arcing classifiers. Tech. Rep. 460, Statistics Department, University of California at Berkeley

  4. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intel Res 16:321–357

    MATH  Google Scholar 

  5. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In. In Proceedings of the Principles of Knowledge Discovery in Databases, PKDD-2003, pp 107–119

  6. Chien CF, Wang WC, Cheng JC (2007) Data mining for yield enhancement in semiconductor manufacturing and an empirical study. Expert Syst Appl 33(1), pp. 192–198. doi:10.1016/j.eswa.2006.04.014

  7. Elkan C (2001) The foundations of cost-sensitive learning. In. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pp. 973–978

  8. Ertekin S, Huang J, Giles CL (2007) Active learning for class imbalance problem. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, New York, NY, USA, SIGIR ’07, pp. 823–824

  9. Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassification cost-sensitive boosting. In: Proceedigs of 16th International Conf. on Machine Learning, Morgan Kaufmann, pp. 97–105

  10. Fiore U, Palmieri F, Castiglione A, De Santis A (2013) Network anomaly detection with the restricted boltzmann machine. Neurocomput 122, pp. 13–23. doi:10.1016/j.neucom.2012.11.050

  11. Freund Y, Schapire R (1999) A short introduction to boosting. Jpn Soc Artif Intel 14(5):771–780

    Google Scholar 

  12. Friedman J, Hastie T, Tibshirani R (1998) Additive logistic regression: a statistical view of boosting. Annal Stat 28:2000

    MathSciNet  Google Scholar 

  13. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annal Stat 29:1189–1232

    Article  MATH  Google Scholar 

  14. Galar M, Fernndez A, Tartas EB, Herrera F (2013) Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit 46(12):3460–3471

    Article  Google Scholar 

  15. Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor Newsl 6:30–39

    Article  Google Scholar 

  16. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21:1263–1284

    Article  Google Scholar 

  17. Japkowicz N (2001) Concept-learning in the presence of between-class and within-class imbalances. In: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, Springer-Verlag, London, UK, UK, AI ’01, pp. 67–77.

  18. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intel Data Anal 6(5):429–449

    MATH  Google Scholar 

  19. Li Q, Mao Y, Wang Z, Xiang W (2009) Cost-sensitive boosting: fitting an additive asymmetric logistic regression model. Springer, Lect Notes Comput Sci 5828:234–247

    Article  Google Scholar 

  20. Lienhart R, Kuranov A, Pisarevsky V (2002) Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection. Tech. rep, Microprocessor Research Lab, Intel Labs

  21. Lozano AC, Abe N (2008) Multi-class cost-sensitive boosting with p-norm loss functions. In: KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, NY, USA, pp. 506–514

  22. Lughofer E (2012) Single-pass active learning with conflict and ignorance. Evol Syst 3(4):251–271

    Article  Google Scholar 

  23. Majid A, Ali S, Iqbal M, Kausar N (2014) Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput Methods Programs Biomed.

  24. Masnadi-Shirazi H, Vasconcelos N (2011) Cost-sensitive boosting. IEEE Trans Pattern Anal Mach Intel 33:294–309

    Article  Google Scholar 

  25. Mease DM, Wyner AJ, Buja A, Schapire R (2006) Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 8:2007

    Google Scholar 

  26. Ormeno P, Ramłrez F, Valle C, Allende-Cid H, Allende H (2012) Robust asymmetric adaboost. In: lvarez L, Mejail M, Gmez L, Jacobo JC (eds) CIARP, Springer, Lecture Notes in Computer Science, vol 7441, pp. 519–526

  27. Oza NC (2005) Online bagging and boosting. In: SMC, IEEE, pp. 2340–2345

  28. Prati RC, Batista GEAPA, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: MICAI, pp. 312–321

  29. Rajasegarar S, Leckie C, Bezdek JC, Palaniswami M (2010) Centered hyperspherical and hyperellipsoidal one-class support vector machines for anomaly detection in sensor networks. IEEE Trans Info Forensics Secur 5(3):518–533

    Article  Google Scholar 

  30. Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Annal Stat 26(5):1651–1686

    Article  MathSciNet  MATH  Google Scholar 

  31. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst, Man, Cybern-Part A: Syst Humans 40(1):185–197

    Article  Google Scholar 

  32. Serdio F, Lughofer E, Pichler K, Buchegger T, Efendic H (2014) Residual-based fault detection using soft computing techniques for condition monitoring at rolling mills. Inf Sci 259:304–320. doi:10.1016/j.ins.2013.06.045

    Article  Google Scholar 

  33. Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Proceedings of the Sixth International Conference on Data Mining, IEEE Computer Society, Washington, DC, USA, ICDM ’06, pp. 592–602

  34. Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40:3358–3378

    Article  MATH  Google Scholar 

  35. Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. IJPRAI 23(4):687–719

    Google Scholar 

  36. Sung KK (1996) Learning and example selection for object and pattern recognition. PhD thesis, MIT, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, Cambridge, MA

  37. Ting KM (2000) A comparative study of cost-sensitive boosting algorithms. In: Proceedings of the Seventeenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’00, pp 983–990

  38. Viaene S, Derrig RA, Dedene G (2004) Cost-sensitive learning and decision making for massachusetts pip claim fraud data. Int J Intell Syst 19:1197–1215

    Article  MATH  Google Scholar 

  39. Viola P, Jones M (2001a) Fast and robust classification using asymmetric adaboost and a detector cascade. In: Advances in Neural Information Processing System 14, MIT Press, pp. 1311–1318.

  40. Viola PA, Jones MJ (2001b) Rapid object detection using a boosted cascade of simple features. In: CVPR (1)’01, pp. 511–518

  41. Wang J (2013) Boosting the generalized margin in cost-sensitive multiclass classification. J Comput Graph Stat 22(1):178–192

    Article  Google Scholar 

  42. Zhang T (2004) Statistical behavior and consistency of classification methods based on convex risk minimization. Annal Stat 32:56–134

    Article  MATH  Google Scholar 

  43. Zhou ZH (2011) Cost-sensitive learning. In: Proceedings of the 8th international conference on Modeling decisions for artificial intelligence, Springer-Verlag, Berlin, Heidelberg, MDAI’11, pp. 17–18

Download references

Acknowledgments

The work was partially supported by the Natural Science Foundation of China under Grant No. 60974129, 70931002 and the NJUST Research fund under Grant No. 2011YBXM119.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiujie Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Q., Mao, Y. A review of boosting methods for imbalanced data classification. Pattern Anal Applic 17, 679–693 (2014). https://doi.org/10.1007/s10044-014-0392-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-014-0392-8

Keywords

Navigation