Abstract
Real world data mining applications must address the issue of learning from imbalanced data sets. The problem occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed vector spaces or lack of information. Common approaches for dealing with the class imbalance problem involve modifying the data distribution or modifying the classifier. In this work, we choose to use a combination of both approaches. We use support vector machines with soft margins as the base classifier to solve the skewed vector spaces problem. We then counter the excessive bias introduced by this approach with a boosting algorithm. We found that this ensemble of SVMs makes an impressive improvement in prediction performance, not only for the majority class, but also for the minority class.
Similar content being viewed by others
References
Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Proceedings of the 2004 European conference on machine learning (ECML’2004)
Amari S, Wu S (1999) Improving support vector machine classifiers by modifying kernel functions. Neural Networks, 12, pp 783–789
Wang BX (2005) Boosting support vector machine. Master thesis
Batista G, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6:1, pp 20–29. Special issue on Learning from imbalanced datasets
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20: 273–297
Blake CL, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine. http://www.ics.uci.edu/~mlearn/MLRepository.html
Chawla N, Bowyer K, Hall L, Kegelmeyer WP (2000) SMOTE: synthetic minority over-sampling technique. International conference on knowledge based computer systems
Chawla N, Lazarevic A, Hall L, Bowyer K (2003) SMOTEBoost: improving prediction of the minority class in boosting. 7th European conference on principles and practice of knowledge discovery in databases. Cavtat-Dubrovnik, Croatia, pp 107–119
Chen C, Liaw A, Breiman L (2004) Using random forest to learn unbalanced data. Technical report 666, Statistics Department, University of California at Berkeley
Crammer K, Keshet J, Singer Y (2003) Kernel design using boosting. Adv Neural Inform Process Syst
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1): 1–30
Demsar J (2008) On the appropriateness of statistical tests in machine learning. In: Proceedings of the ICML’08 third workshop on evaluation methods for machine learning
Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 155–164, San Diego, CA
Elkan C (2001) The foundations of cost-senstive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence, pp 973–978, Seattle, WA
Fan W, Stolfo S, Zhang J, Chan P (1999) AdaCost: misclassification cost-sensitive boosting, In: Proceedings of 16th international conference on machine learning, Slovenia
Flach P (1997) ROC analysis for ranking and probability estimation. Notes from the 2007 UAI tutorial on ROC analysis
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comp Syst Sci 55(1): 119–139
Freund Y, Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the sixteenth international joint conference on artificial intelligence. pp 1401–1406
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explor 6(1): 30–39
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning, pp 137–142
Kandola J, Shawe-Taylor J (2003) Refining kernels for regression and uneven classification problems. In: Proceedings of the 9th international workshop on artificial intelligence and statistics
Laurikkala J (2002) Instance-based data reduction for improved identification of difficult small classes. Intell Data Anal 6(4): 311–322
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5): 429–450
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceddings of the fourteenth international conference on machine learning, pp 179–186
Kukar M (2006) Quality assessment of individual classifications in machine learning and data mining. Knowl Informa Syst 9: 3
Ling C, Li C (1998) Data mining for direct marketing—specific problems and solutions. In: Proceedings of fourth international conference on knowledge discovery and data mining, pp 73–79
Ting K (2000) A comparative study of cost-sensitive boosting algorithms. In: Proceedings of 17th international conference on machine learning, Stanford, CA, pp 983–990
Morik K, Brockhausen P, Joachims T (1999) Combining statistical learning with a knowledge-based approach—a case study in intensive care monitoring. ICML, pp 268–277
Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown, ICML-2003 workshop on learning from imbalanced data sets II
Japkowicz N (2000) The class imbalance problem: significance and strategies. In: Proceedings of the 2000 international conference on artificial intelligence (ICAI’2000): special track on inductive learning
Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 53(2): 239–281
Provost FJ, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. KDD, pp 43–48
Pang S, Kasabov N (2008) Encoding and decoding the knowledge of association rules over SVM classification trees. Knowl Inform Syst, Published online June 24, 2008
Rosset S, Perlich C, Zadrozny B (2007) Ranking-based evaluation of regression models. Knowl Inform Syst 12: 3
Shawe-Taylor J, Cristianini N (1999) Further results on the margin distribution. In: Proceedings of the 12th conference on computational learning theory
Stefanowski J, Wilk S (2007) Improving rule-based classifiers induced by MODLEM by selective pre-processing of imbalanced data. In: Proceedings of the 2007 ECML/PKDD international workshop on rough sets in knowledge discovery (RSKD’2007)
Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Proceedings of the sixth international conference on data mining, ICDM ’06. pp 592–602
Ting KM (2002) An instance-weighting method to induce costsensitive trees. IEEE Trans Knowl Data Eng 14(3): 659–665
Vapnik V (1995) The nature of statistical learning theory. Springer, New York
Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on artificial intelligence, pp 55–60
Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction, JAIR 19, pp 315–354
Weiss GM (2004) Mining with rarity: a unifying framework, SIGKDD explorations 6(1):7–19. Special issue on learning from imbalanced datasets
Wu G, Chang E (2003) Adaptive feature-space conformal transformation for imbalanced data learning. In: Proceedings of the 20th international conference on machine learning
Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the 3rd IEEE international conference on data mining. pp 435–442, Melbourne, FL
Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inform Syst 15(3): 321–334
Zhou Z-H, Liu X-Y (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1): 63–77
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, B.X., Japkowicz, N. Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25, 1–20 (2010). https://doi.org/10.1007/s10115-009-0198-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0198-y