Boosting support vector machines for imbalanced data sets

Wang, Benjamin X.; Japkowicz, Nathalie

doi:10.1007/s10115-009-0198-y

Boosting support vector machines for imbalanced data sets

Regular Paper
Published: 05 March 2009

Volume 25, pages 1–20, (2010)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Benjamin X. Wang¹ &
Nathalie Japkowicz²

1339 Accesses
173 Citations
Explore all metrics

Abstract

Real world data mining applications must address the issue of learning from imbalanced data sets. The problem occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed vector spaces or lack of information. Common approaches for dealing with the class imbalance problem involve modifying the data distribution or modifying the classifier. In this work, we choose to use a combination of both approaches. We use support vector machines with soft margins as the base classifier to solve the skewed vector spaces problem. We then counter the excessive bias introduced by this approach with a boosting algorithm. We found that this ensemble of SVMs makes an impressive improvement in prediction performance, not only for the majority class, but also for the minority class.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Proceedings of the 2004 European conference on machine learning (ECML’2004)
Amari S, Wu S (1999) Improving support vector machine classifiers by modifying kernel functions. Neural Networks, 12, pp 783–789
Google Scholar
Wang BX (2005) Boosting support vector machine. Master thesis
Batista G, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6:1, pp 20–29. Special issue on Learning from imbalanced datasets
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20: 273–297
MATH Google Scholar
Blake CL, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine. http://www.ics.uci.edu/~mlearn/MLRepository.html
Chawla N, Bowyer K, Hall L, Kegelmeyer WP (2000) SMOTE: synthetic minority over-sampling technique. International conference on knowledge based computer systems
Chawla N, Lazarevic A, Hall L, Bowyer K (2003) SMOTEBoost: improving prediction of the minority class in boosting. 7th European conference on principles and practice of knowledge discovery in databases. Cavtat-Dubrovnik, Croatia, pp 107–119
Chen C, Liaw A, Breiman L (2004) Using random forest to learn unbalanced data. Technical report 666, Statistics Department, University of California at Berkeley
Crammer K, Keshet J, Singer Y (2003) Kernel design using boosting. Adv Neural Inform Process Syst
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1): 1–30
MathSciNet Google Scholar
Demsar J (2008) On the appropriateness of statistical tests in machine learning. In: Proceedings of the ICML’08 third workshop on evaluation methods for machine learning
Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 155–164, San Diego, CA
Elkan C (2001) The foundations of cost-senstive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence, pp 973–978, Seattle, WA
Fan W, Stolfo S, Zhang J, Chan P (1999) AdaCost: misclassification cost-sensitive boosting, In: Proceedings of 16th international conference on machine learning, Slovenia
Flach P (1997) ROC analysis for ranking and probability estimation. Notes from the 2007 UAI tutorial on ROC analysis
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comp Syst Sci 55(1): 119–139
Article MATH MathSciNet Google Scholar
Freund Y, Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the sixteenth international joint conference on artificial intelligence. pp 1401–1406
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explor 6(1): 30–39
Article Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning, pp 137–142
Kandola J, Shawe-Taylor J (2003) Refining kernels for regression and uneven classification problems. In: Proceedings of the 9th international workshop on artificial intelligence and statistics
Laurikkala J (2002) Instance-based data reduction for improved identification of difficult small classes. Intell Data Anal 6(4): 311–322
MATH Google Scholar
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5): 429–450
MATH Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceddings of the fourteenth international conference on machine learning, pp 179–186
Kukar M (2006) Quality assessment of individual classifications in machine learning and data mining. Knowl Informa Syst 9: 3
Google Scholar
Ling C, Li C (1998) Data mining for direct marketing—specific problems and solutions. In: Proceedings of fourth international conference on knowledge discovery and data mining, pp 73–79
Ting K (2000) A comparative study of cost-sensitive boosting algorithms. In: Proceedings of 17th international conference on machine learning, Stanford, CA, pp 983–990
Morik K, Brockhausen P, Joachims T (1999) Combining statistical learning with a knowledge-based approach—a case study in intensive care monitoring. ICML, pp 268–277
Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown, ICML-2003 workshop on learning from imbalanced data sets II
Japkowicz N (2000) The class imbalance problem: significance and strategies. In: Proceedings of the 2000 international conference on artificial intelligence (ICAI’2000): special track on inductive learning
Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 53(2): 239–281
Article Google Scholar
Provost FJ, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. KDD, pp 43–48
Pang S, Kasabov N (2008) Encoding and decoding the knowledge of association rules over SVM classification trees. Knowl Inform Syst, Published online June 24, 2008
Rosset S, Perlich C, Zadrozny B (2007) Ranking-based evaluation of regression models. Knowl Inform Syst 12: 3
Article Google Scholar
Shawe-Taylor J, Cristianini N (1999) Further results on the margin distribution. In: Proceedings of the 12th conference on computational learning theory
Stefanowski J, Wilk S (2007) Improving rule-based classifiers induced by MODLEM by selective pre-processing of imbalanced data. In: Proceedings of the 2007 ECML/PKDD international workshop on rough sets in knowledge discovery (RSKD’2007)
Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Proceedings of the sixth international conference on data mining, ICDM ’06. pp 592–602
Ting KM (2002) An instance-weighting method to induce costsensitive trees. IEEE Trans Knowl Data Eng 14(3): 659–665
Article Google Scholar
Vapnik V (1995) The nature of statistical learning theory. Springer, New York
MATH Google Scholar
Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on artificial intelligence, pp 55–60
Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction, JAIR 19, pp 315–354
Weiss GM (2004) Mining with rarity: a unifying framework, SIGKDD explorations 6(1):7–19. Special issue on learning from imbalanced datasets
Wu G, Chang E (2003) Adaptive feature-space conformal transformation for imbalanced data learning. In: Proceedings of the 20th international conference on machine learning
Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the 3rd IEEE international conference on data mining. pp 435–442, Melbourne, FL
Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inform Syst 15(3): 321–334
Article Google Scholar
Zhou Z-H, Liu X-Y (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1): 63–77
Article Google Scholar

Download references

Author information

Authors and Affiliations

Datalong technology Ltd., 430074, Wuhan, Hubei, China
Benjamin X. Wang
School of Information Technology and Engineering, University of Ottawa, 800 King Edward Ave., P.O. Box 450 Stn.A, Ottawa, ON, K1N 6N5, Canada
Nathalie Japkowicz

Authors

Benjamin X. Wang
View author publications
You can also search for this author in PubMed Google Scholar
Nathalie Japkowicz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nathalie Japkowicz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, B.X., Japkowicz, N. Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25, 1–20 (2010). https://doi.org/10.1007/s10115-009-0198-y

Download citation

Received: 03 September 2008
Revised: 07 January 2009
Accepted: 26 January 2009
Published: 05 March 2009
Issue Date: October 2010
DOI: https://doi.org/10.1007/s10115-009-0198-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Boosting support vector machines for imbalanced data sets

Abstract

Access this article

Similar content being viewed by others

Support Vector Machine Failure in Imbalanced Datasets

Support Vector Machines

Massive Classification with Support Vector Machines

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Boosting support vector machines for imbalanced data sets

Abstract

Access this article

Similar content being viewed by others

Support Vector Machine Failure in Imbalanced Datasets

Support Vector Machines

Massive Classification with Support Vector Machines

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation