Advertisement

Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles

  • Yang Liu
  • Aijun An
  • Xiangji Huang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3918)

Abstract

Learning from imbalanced datasets is inherently difficult due to lack of information about the minority class. In this paper, we study the performance of SVMs, which have gained great success in many real applications, in the imbalanced data context. Through empirical analysis, we show that SVMs suffer from biased decision boundaries, and that their prediction performance drops dramatically when the data is highly skewed. We propose to combine an integrated sampling technique with an ensemble of SVMs to improve the prediction performance. The integrated sampling technique combines both over-sampling and under-sampling techniques. Through empirical study, we show that our method outperforms individual SVMs as well as several other state-of-the-art classifiers.

Keywords

Decision Boundary Minority Class Karush Kuhn Tucker Class Imbalance Positive Instance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)CrossRefGoogle Scholar
  2. 2.
    Fawcett, T., Provost, F.J.: Adaptive fraud detection. Data Mining and Knowledge Discovery 1, 291–316 (1997)CrossRefGoogle Scholar
  3. 3.
    Ling, C.X., Li, C.: Data mining for direct marketing: Problems and solutions. In: KDD, pp. 73–79 (1998)Google Scholar
  4. 4.
    Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explorations 6, 30–39 (2004)CrossRefGoogle Scholar
  5. 5.
    Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: onesided selection. In: Proc. 14th International Conference on Machine Learning, pp. 179–186 (1997)Google Scholar
  6. 6.
    Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Technical Report 666, Statistics Department, University of California at Berkeley (2004)Google Scholar
  7. 7.
    Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-basedlearning algorithms. Mach. Learn. 38, 257–286 (2000)CrossRefMATHGoogle Scholar
  8. 8.
    Veropoulos, K., Cristianini, N., Campbell, C.: Controlling the sensitivity of support vector machines. In: International Joint Conference on Artificial Intelligence, IJCAI 1999 (1999)Google Scholar
  9. 9.
    Wu, G., Chang, E.Y.: Aligning boundary in kernel space for learning imbalanced dataset. In: ICDM, pp. 265–272 (2004)Google Scholar
  10. 10.
    Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: ECML, pp. 39–50 (2004)Google Scholar
  11. 11.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. (JAIR) 16, 321–357 (2002)MATHGoogle Scholar
  12. 12.
    Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: Smoteboost: Improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Weiss, G.M., Provost, F.J.: Learning when training data are costly: The effect of class distribution on tree induction. J. Artif. Intell. Res. (JAIR) 19, 315–354 (2003)MATHGoogle Scholar
  14. 14.
    Drummond, C., Holte, R.C.: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II held in conjunction with ICML 2003 (2003)Google Scholar
  15. 15.
    Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6, 429–449 (2002)MATHGoogle Scholar
  16. 16.
    Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, 121–167 (1998)CrossRefGoogle Scholar
  17. 17.
    Swets, J.: Measuring the accuracy of diagnostic systems. Science 240, 1285–1293 (1988)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Yang Liu
    • 1
  • Aijun An
    • 1
  • Xiangji Huang
    • 1
  1. 1.Department of Computer Science and EngineeringYork UniversityTorontoCanada

Personalised recommendations