Abstract
Recently, the problem of imbalanced data classification has drawn a significant amount of interest from academia, industry and government funding agencies. The fundamental issue with imbalanced data classification is the imbalanced data has posed a significant drawback of the performance of most standard learning algorithms, which assume or expect balanced class distribution or equal misclassification costs. Boosting is a meta-technique that is applicable to most learning algorithms. This paper gives a review of boosting methods for imbalanced data classification, denoted as IDBoosting (Imbalanced-data-boosting), where conventional learning algorithms can be integrated without further modifications. The main focus is on the intrinsic mechanisms without considering implementation detail. Existing methods are catalogued and each class is displayed in detail in terms of design criteria, typical algorithms and performance analysis. The essence of two IDBoosting methods is discovered followed by experimental evidence and useful reference point for future research are also given.
Similar content being viewed by others
References
A Asuncion DN (2007) UCI machine learning repository. URL http://www.ics.uci.edu/~mlearn/MLRepository.html
Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2009) Metalearning. Springer, Berlin
Breiman L (1996) Bias, variance, and arcing classifiers. Tech. Rep. 460, Statistics Department, University of California at Berkeley
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intel Res 16:321–357
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In. In Proceedings of the Principles of Knowledge Discovery in Databases, PKDD-2003, pp 107–119
Chien CF, Wang WC, Cheng JC (2007) Data mining for yield enhancement in semiconductor manufacturing and an empirical study. Expert Syst Appl 33(1), pp. 192–198. doi:10.1016/j.eswa.2006.04.014
Elkan C (2001) The foundations of cost-sensitive learning. In. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pp. 973–978
Ertekin S, Huang J, Giles CL (2007) Active learning for class imbalance problem. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, New York, NY, USA, SIGIR ’07, pp. 823–824
Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassification cost-sensitive boosting. In: Proceedigs of 16th International Conf. on Machine Learning, Morgan Kaufmann, pp. 97–105
Fiore U, Palmieri F, Castiglione A, De Santis A (2013) Network anomaly detection with the restricted boltzmann machine. Neurocomput 122, pp. 13–23. doi:10.1016/j.neucom.2012.11.050
Freund Y, Schapire R (1999) A short introduction to boosting. Jpn Soc Artif Intel 14(5):771–780
Friedman J, Hastie T, Tibshirani R (1998) Additive logistic regression: a statistical view of boosting. Annal Stat 28:2000
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annal Stat 29:1189–1232
Galar M, Fernndez A, Tartas EB, Herrera F (2013) Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit 46(12):3460–3471
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor Newsl 6:30–39
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21:1263–1284
Japkowicz N (2001) Concept-learning in the presence of between-class and within-class imbalances. In: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, Springer-Verlag, London, UK, UK, AI ’01, pp. 67–77.
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intel Data Anal 6(5):429–449
Li Q, Mao Y, Wang Z, Xiang W (2009) Cost-sensitive boosting: fitting an additive asymmetric logistic regression model. Springer, Lect Notes Comput Sci 5828:234–247
Lienhart R, Kuranov A, Pisarevsky V (2002) Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection. Tech. rep, Microprocessor Research Lab, Intel Labs
Lozano AC, Abe N (2008) Multi-class cost-sensitive boosting with p-norm loss functions. In: KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, NY, USA, pp. 506–514
Lughofer E (2012) Single-pass active learning with conflict and ignorance. Evol Syst 3(4):251–271
Majid A, Ali S, Iqbal M, Kausar N (2014) Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput Methods Programs Biomed.
Masnadi-Shirazi H, Vasconcelos N (2011) Cost-sensitive boosting. IEEE Trans Pattern Anal Mach Intel 33:294–309
Mease DM, Wyner AJ, Buja A, Schapire R (2006) Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 8:2007
Ormeno P, Ramłrez F, Valle C, Allende-Cid H, Allende H (2012) Robust asymmetric adaboost. In: lvarez L, Mejail M, Gmez L, Jacobo JC (eds) CIARP, Springer, Lecture Notes in Computer Science, vol 7441, pp. 519–526
Oza NC (2005) Online bagging and boosting. In: SMC, IEEE, pp. 2340–2345
Prati RC, Batista GEAPA, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: MICAI, pp. 312–321
Rajasegarar S, Leckie C, Bezdek JC, Palaniswami M (2010) Centered hyperspherical and hyperellipsoidal one-class support vector machines for anomaly detection in sensor networks. IEEE Trans Info Forensics Secur 5(3):518–533
Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Annal Stat 26(5):1651–1686
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst, Man, Cybern-Part A: Syst Humans 40(1):185–197
Serdio F, Lughofer E, Pichler K, Buchegger T, Efendic H (2014) Residual-based fault detection using soft computing techniques for condition monitoring at rolling mills. Inf Sci 259:304–320. doi:10.1016/j.ins.2013.06.045
Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Proceedings of the Sixth International Conference on Data Mining, IEEE Computer Society, Washington, DC, USA, ICDM ’06, pp. 592–602
Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40:3358–3378
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. IJPRAI 23(4):687–719
Sung KK (1996) Learning and example selection for object and pattern recognition. PhD thesis, MIT, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, Cambridge, MA
Ting KM (2000) A comparative study of cost-sensitive boosting algorithms. In: Proceedings of the Seventeenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’00, pp 983–990
Viaene S, Derrig RA, Dedene G (2004) Cost-sensitive learning and decision making for massachusetts pip claim fraud data. Int J Intell Syst 19:1197–1215
Viola P, Jones M (2001a) Fast and robust classification using asymmetric adaboost and a detector cascade. In: Advances in Neural Information Processing System 14, MIT Press, pp. 1311–1318.
Viola PA, Jones MJ (2001b) Rapid object detection using a boosted cascade of simple features. In: CVPR (1)’01, pp. 511–518
Wang J (2013) Boosting the generalized margin in cost-sensitive multiclass classification. J Comput Graph Stat 22(1):178–192
Zhang T (2004) Statistical behavior and consistency of classification methods based on convex risk minimization. Annal Stat 32:56–134
Zhou ZH (2011) Cost-sensitive learning. In: Proceedings of the 8th international conference on Modeling decisions for artificial intelligence, Springer-Verlag, Berlin, Heidelberg, MDAI’11, pp. 17–18
Acknowledgments
The work was partially supported by the Natural Science Foundation of China under Grant No. 60974129, 70931002 and the NJUST Research fund under Grant No. 2011YBXM119.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Q., Mao, Y. A review of boosting methods for imbalanced data classification. Pattern Anal Applic 17, 679–693 (2014). https://doi.org/10.1007/s10044-014-0392-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-014-0392-8