Skip to main content
Log in

Boosting support vector machines for imbalanced data sets

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Real world data mining applications must address the issue of learning from imbalanced data sets. The problem occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed vector spaces or lack of information. Common approaches for dealing with the class imbalance problem involve modifying the data distribution or modifying the classifier. In this work, we choose to use a combination of both approaches. We use support vector machines with soft margins as the base classifier to solve the skewed vector spaces problem. We then counter the excessive bias introduced by this approach with a boosting algorithm. We found that this ensemble of SVMs makes an impressive improvement in prediction performance, not only for the majority class, but also for the minority class.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Proceedings of the 2004 European conference on machine learning (ECML’2004)

  2. Amari S, Wu S (1999) Improving support vector machine classifiers by modifying kernel functions. Neural Networks, 12, pp 783–789

    Google Scholar 

  3. Wang BX (2005) Boosting support vector machine. Master thesis

  4. Batista G, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6:1, pp 20–29. Special issue on Learning from imbalanced datasets

  5. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20: 273–297

    MATH  Google Scholar 

  6. Blake CL, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine. http://www.ics.uci.edu/~mlearn/MLRepository.html

  7. Chawla N, Bowyer K, Hall L, Kegelmeyer WP (2000) SMOTE: synthetic minority over-sampling technique. International conference on knowledge based computer systems

  8. Chawla N, Lazarevic A, Hall L, Bowyer K (2003) SMOTEBoost: improving prediction of the minority class in boosting. 7th European conference on principles and practice of knowledge discovery in databases. Cavtat-Dubrovnik, Croatia, pp 107–119

  9. Chen C, Liaw A, Breiman L (2004) Using random forest to learn unbalanced data. Technical report 666, Statistics Department, University of California at Berkeley

  10. Crammer K, Keshet J, Singer Y (2003) Kernel design using boosting. Adv Neural Inform Process Syst

  11. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1): 1–30

    MathSciNet  Google Scholar 

  12. Demsar J (2008) On the appropriateness of statistical tests in machine learning. In: Proceedings of the ICML’08 third workshop on evaluation methods for machine learning

  13. Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 155–164, San Diego, CA

  14. Elkan C (2001) The foundations of cost-senstive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence, pp 973–978, Seattle, WA

  15. Fan W, Stolfo S, Zhang J, Chan P (1999) AdaCost: misclassification cost-sensitive boosting, In: Proceedings of 16th international conference on machine learning, Slovenia

  16. Flach P (1997) ROC analysis for ranking and probability estimation. Notes from the 2007 UAI tutorial on ROC analysis

  17. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comp Syst Sci 55(1): 119–139

    Article  MATH  MathSciNet  Google Scholar 

  18. Freund Y, Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the sixteenth international joint conference on artificial intelligence. pp 1401–1406

  19. Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explor 6(1): 30–39

    Article  Google Scholar 

  20. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning, pp 137–142

  21. Kandola J, Shawe-Taylor J (2003) Refining kernels for regression and uneven classification problems. In: Proceedings of the 9th international workshop on artificial intelligence and statistics

  22. Laurikkala J (2002) Instance-based data reduction for improved identification of difficult small classes. Intell Data Anal 6(4): 311–322

    MATH  Google Scholar 

  23. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5): 429–450

    MATH  Google Scholar 

  24. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceddings of the fourteenth international conference on machine learning, pp 179–186

  25. Kukar M (2006) Quality assessment of individual classifications in machine learning and data mining. Knowl Informa Syst 9: 3

    Google Scholar 

  26. Ling C, Li C (1998) Data mining for direct marketing—specific problems and solutions. In: Proceedings of fourth international conference on knowledge discovery and data mining, pp 73–79

  27. Ting K (2000) A comparative study of cost-sensitive boosting algorithms. In: Proceedings of 17th international conference on machine learning, Stanford, CA, pp 983–990

  28. Morik K, Brockhausen P, Joachims T (1999) Combining statistical learning with a knowledge-based approach—a case study in intensive care monitoring. ICML, pp 268–277

  29. Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown, ICML-2003 workshop on learning from imbalanced data sets II

  30. Japkowicz N (2000) The class imbalance problem: significance and strategies. In: Proceedings of the 2000 international conference on artificial intelligence (ICAI’2000): special track on inductive learning

  31. Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 53(2): 239–281

    Article  Google Scholar 

  32. Provost FJ, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. KDD, pp 43–48

  33. Pang S, Kasabov N (2008) Encoding and decoding the knowledge of association rules over SVM classification trees. Knowl Inform Syst, Published online June 24, 2008

  34. Rosset S, Perlich C, Zadrozny B (2007) Ranking-based evaluation of regression models. Knowl Inform Syst 12: 3

    Article  Google Scholar 

  35. Shawe-Taylor J, Cristianini N (1999) Further results on the margin distribution. In: Proceedings of the 12th conference on computational learning theory

  36. Stefanowski J, Wilk S (2007) Improving rule-based classifiers induced by MODLEM by selective pre-processing of imbalanced data. In: Proceedings of the 2007 ECML/PKDD international workshop on rough sets in knowledge discovery (RSKD’2007)

  37. Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Proceedings of the sixth international conference on data mining, ICDM ’06. pp 592–602

  38. Ting KM (2002) An instance-weighting method to induce costsensitive trees. IEEE Trans Knowl Data Eng 14(3): 659–665

    Article  Google Scholar 

  39. Vapnik V (1995) The nature of statistical learning theory. Springer, New York

    MATH  Google Scholar 

  40. Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on artificial intelligence, pp 55–60

  41. Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction, JAIR 19, pp 315–354

  42. Weiss GM (2004) Mining with rarity: a unifying framework, SIGKDD explorations 6(1):7–19. Special issue on learning from imbalanced datasets

  43. Wu G, Chang E (2003) Adaptive feature-space conformal transformation for imbalanced data learning. In: Proceedings of the 20th international conference on machine learning

  44. Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the 3rd IEEE international conference on data mining. pp 435–442, Melbourne, FL

  45. Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inform Syst 15(3): 321–334

    Article  Google Scholar 

  46. Zhou Z-H, Liu X-Y (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1): 63–77

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nathalie Japkowicz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, B.X., Japkowicz, N. Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25, 1–20 (2010). https://doi.org/10.1007/s10115-009-0198-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0198-y

Keywords

Navigation