Applying Support Vector Machines to Imbalanced Datasets

  • Rehan Akbani
  • Stephen Kwek
  • Nathalie Japkowicz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3201)


Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances (e.g. in gene profiling and detecting credit card fraud). This paper discusses the factors behind this failure and explains why the common strategy of undersampling the training data may not be the best choice for SVM. We then propose an algorithm for overcoming these problems which is based on a variant of the SMOTE algorithm by Chawla et al, combined with Veropoulos et al’s different error costs algorithm. We compare the performance of our algorithm against these two algorithms, along with undersampling and regular SVM and show that our algorithm outperforms all of them.


Support Vector Machine Minority Class Positive Instance Positive Class Negative Instance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Aha, D.: Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. International Journal Man-Machine Studies 36, 267–287 (1992)CrossRefGoogle Scholar
  2. 2.
    Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)zbMATHGoogle Scholar
  3. 3.
    Cortes, C.: Prediction of Generalisation Ability in Learning Machines. PhD thesis, Department of Computer Science, University of Rochester (1995)Google Scholar
  4. 4.
    Cristianini, N., Kandola, J., Elisseeff, A., Shawe-Taylor, J.: On kernel target alignment. Journal Machine Learning Research 1 (2002)Google Scholar
  5. 5.
    Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)Google Scholar
  6. 6.
    Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence: Special Track on Inductive Learning, Las Vegas, Nevada (2000)Google Scholar
  7. 7.
    Joachims, T.: Text Categorization with SVM: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, Springer, Heidelberg (1998)CrossRefGoogle Scholar
  8. 8.
    Kubat, M., Holte, R., Matwin, S.: Learning when Negative Examples Abound. In: van Someren, M., Widmer, G. (eds.) ECML 1997. LNCS, vol. 1224, Springer, Heidelberg (1997)Google Scholar
  9. 9.
    Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Proceedings of the 14th International Conference on Machine Learning (1997)Google Scholar
  10. 10.
    Ling, C., Li, C.: Data Mining for Direct Marketing Problems and Solutions. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York (1998)Google Scholar
  11. 11.
    Provost, F., Fawcett, T.: Robust Classification for Imprecise Environments. Machine Learning 42/3, 203–231 (2001)CrossRefGoogle Scholar
  12. 12.
    Tong, S., Chang, E.: Support Vector Machine Active Learning for Image Retrieval. In: Proceedings of ACM International Conference on Multimedia, pp. 107–118 (2001)Google Scholar
  13. 13.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, NY (1995)zbMATHGoogle Scholar
  14. 14.
    Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support vector machines. In: Proceedings of the International Joint Conference on AI, pp. 55–60 (1999)Google Scholar
  15. 15.
    Wu, G., Chang, E.: Class-Boundary Alignment for Imbalanced Dataset Learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Rehan Akbani
    • 1
  • Stephen Kwek
    • 1
  • Nathalie Japkowicz
    • 2
  1. 1.Department of Computer ScienceUniversity of Texas at San AntonioSan AntonioUSA
  2. 2.School of Information Technology & EngineeringUniversity of OttawaOttawaCanada

Personalised recommendations