Advertisement

Knowledge and Information Systems

, Volume 25, Issue 1, pp 1–20 | Cite as

Boosting support vector machines for imbalanced data sets

  • Benjamin X. Wang
  • Nathalie JapkowiczEmail author
Regular Paper

Abstract

Real world data mining applications must address the issue of learning from imbalanced data sets. The problem occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed vector spaces or lack of information. Common approaches for dealing with the class imbalance problem involve modifying the data distribution or modifying the classifier. In this work, we choose to use a combination of both approaches. We use support vector machines with soft margins as the base classifier to solve the skewed vector spaces problem. We then counter the excessive bias introduced by this approach with a boosting algorithm. We found that this ensemble of SVMs makes an impressive improvement in prediction performance, not only for the majority class, but also for the minority class.

Keywords

Imbalanced data sets Support vector machines Boosting 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Proceedings of the 2004 European conference on machine learning (ECML’2004)Google Scholar
  2. 2.
    Amari S, Wu S (1999) Improving support vector machine classifiers by modifying kernel functions. Neural Networks, 12, pp 783–789Google Scholar
  3. 3.
    Wang BX (2005) Boosting support vector machine. Master thesisGoogle Scholar
  4. 4.
    Batista G, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6:1, pp 20–29. Special issue on Learning from imbalanced datasetsGoogle Scholar
  5. 5.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20: 273–297zbMATHGoogle Scholar
  6. 6.
    Blake CL, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine. http://www.ics.uci.edu/~mlearn/MLRepository.html
  7. 7.
    Chawla N, Bowyer K, Hall L, Kegelmeyer WP (2000) SMOTE: synthetic minority over-sampling technique. International conference on knowledge based computer systemsGoogle Scholar
  8. 8.
    Chawla N, Lazarevic A, Hall L, Bowyer K (2003) SMOTEBoost: improving prediction of the minority class in boosting. 7th European conference on principles and practice of knowledge discovery in databases. Cavtat-Dubrovnik, Croatia, pp 107–119Google Scholar
  9. 9.
    Chen C, Liaw A, Breiman L (2004) Using random forest to learn unbalanced data. Technical report 666, Statistics Department, University of California at BerkeleyGoogle Scholar
  10. 10.
    Crammer K, Keshet J, Singer Y (2003) Kernel design using boosting. Adv Neural Inform Process SystGoogle Scholar
  11. 11.
    Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1): 1–30MathSciNetGoogle Scholar
  12. 12.
    Demsar J (2008) On the appropriateness of statistical tests in machine learning. In: Proceedings of the ICML’08 third workshop on evaluation methods for machine learningGoogle Scholar
  13. 13.
    Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 155–164, San Diego, CAGoogle Scholar
  14. 14.
    Elkan C (2001) The foundations of cost-senstive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence, pp 973–978, Seattle, WAGoogle Scholar
  15. 15.
    Fan W, Stolfo S, Zhang J, Chan P (1999) AdaCost: misclassification cost-sensitive boosting, In: Proceedings of 16th international conference on machine learning, SloveniaGoogle Scholar
  16. 16.
    Flach P (1997) ROC analysis for ranking and probability estimation. Notes from the 2007 UAI tutorial on ROC analysisGoogle Scholar
  17. 17.
    Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comp Syst Sci 55(1): 119–139zbMATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Freund Y, Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the sixteenth international joint conference on artificial intelligence. pp 1401–1406Google Scholar
  19. 19.
    Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explor 6(1): 30–39CrossRefGoogle Scholar
  20. 20.
    Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning, pp 137–142Google Scholar
  21. 21.
    Kandola J, Shawe-Taylor J (2003) Refining kernels for regression and uneven classification problems. In: Proceedings of the 9th international workshop on artificial intelligence and statisticsGoogle Scholar
  22. 22.
    Laurikkala J (2002) Instance-based data reduction for improved identification of difficult small classes. Intell Data Anal 6(4): 311–322zbMATHGoogle Scholar
  23. 23.
    Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5): 429–450zbMATHGoogle Scholar
  24. 24.
    Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceddings of the fourteenth international conference on machine learning, pp 179–186Google Scholar
  25. 25.
    Kukar M (2006) Quality assessment of individual classifications in machine learning and data mining. Knowl Informa Syst 9: 3Google Scholar
  26. 26.
    Ling C, Li C (1998) Data mining for direct marketing—specific problems and solutions. In: Proceedings of fourth international conference on knowledge discovery and data mining, pp 73–79Google Scholar
  27. 27.
    Ting K (2000) A comparative study of cost-sensitive boosting algorithms. In: Proceedings of 17th international conference on machine learning, Stanford, CA, pp 983–990Google Scholar
  28. 28.
    Morik K, Brockhausen P, Joachims T (1999) Combining statistical learning with a knowledge-based approach—a case study in intensive care monitoring. ICML, pp 268–277Google Scholar
  29. 29.
    Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown, ICML-2003 workshop on learning from imbalanced data sets IIGoogle Scholar
  30. 30.
    Japkowicz N (2000) The class imbalance problem: significance and strategies. In: Proceedings of the 2000 international conference on artificial intelligence (ICAI’2000): special track on inductive learningGoogle Scholar
  31. 31.
    Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 53(2): 239–281CrossRefGoogle Scholar
  32. 32.
    Provost FJ, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. KDD, pp 43–48Google Scholar
  33. 33.
    Pang S, Kasabov N (2008) Encoding and decoding the knowledge of association rules over SVM classification trees. Knowl Inform Syst, Published online June 24, 2008Google Scholar
  34. 34.
    Rosset S, Perlich C, Zadrozny B (2007) Ranking-based evaluation of regression models. Knowl Inform Syst 12: 3CrossRefGoogle Scholar
  35. 35.
    Shawe-Taylor J, Cristianini N (1999) Further results on the margin distribution. In: Proceedings of the 12th conference on computational learning theoryGoogle Scholar
  36. 36.
    Stefanowski J, Wilk S (2007) Improving rule-based classifiers induced by MODLEM by selective pre-processing of imbalanced data. In: Proceedings of the 2007 ECML/PKDD international workshop on rough sets in knowledge discovery (RSKD’2007)Google Scholar
  37. 37.
    Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Proceedings of the sixth international conference on data mining, ICDM ’06. pp 592–602Google Scholar
  38. 38.
    Ting KM (2002) An instance-weighting method to induce costsensitive trees. IEEE Trans Knowl Data Eng 14(3): 659–665CrossRefGoogle Scholar
  39. 39.
    Vapnik V (1995) The nature of statistical learning theory. Springer, New YorkzbMATHGoogle Scholar
  40. 40.
    Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on artificial intelligence, pp 55–60Google Scholar
  41. 41.
    Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction, JAIR 19, pp 315–354Google Scholar
  42. 42.
    Weiss GM (2004) Mining with rarity: a unifying framework, SIGKDD explorations 6(1):7–19. Special issue on learning from imbalanced datasetsGoogle Scholar
  43. 43.
    Wu G, Chang E (2003) Adaptive feature-space conformal transformation for imbalanced data learning. In: Proceedings of the 20th international conference on machine learningGoogle Scholar
  44. 44.
    Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the 3rd IEEE international conference on data mining. pp 435–442, Melbourne, FLGoogle Scholar
  45. 45.
    Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inform Syst 15(3): 321–334CrossRefGoogle Scholar
  46. 46.
    Zhou Z-H, Liu X-Y (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1): 63–77CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  1. 1.Datalong technology Ltd.Wuhan, HubeiChina
  2. 2.School of Information Technology and EngineeringUniversity of OttawaOttawaCanada

Personalised recommendations