Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles
Learning from imbalanced datasets is inherently difficult due to lack of information about the minority class. In this paper, we study the performance of SVMs, which have gained great success in many real applications, in the imbalanced data context. Through empirical analysis, we show that SVMs suffer from biased decision boundaries, and that their prediction performance drops dramatically when the data is highly skewed. We propose to combine an integrated sampling technique with an ensemble of SVMs to improve the prediction performance. The integrated sampling technique combines both over-sampling and under-sampling techniques. Through empirical study, we show that our method outperforms individual SVMs as well as several other state-of-the-art classifiers.
KeywordsDecision Boundary Minority Class Karush Kuhn Tucker Class Imbalance Positive Instance
Unable to display preview. Download preview PDF.
- 3.Ling, C.X., Li, C.: Data mining for direct marketing: Problems and solutions. In: KDD, pp. 73–79 (1998)Google Scholar
- 5.Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: onesided selection. In: Proc. 14th International Conference on Machine Learning, pp. 179–186 (1997)Google Scholar
- 6.Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Technical Report 666, Statistics Department, University of California at Berkeley (2004)Google Scholar
- 8.Veropoulos, K., Cristianini, N., Campbell, C.: Controlling the sensitivity of support vector machines. In: International Joint Conference on Artificial Intelligence, IJCAI 1999 (1999)Google Scholar
- 9.Wu, G., Chang, E.Y.: Aligning boundary in kernel space for learning imbalanced dataset. In: ICDM, pp. 265–272 (2004)Google Scholar
- 10.Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: ECML, pp. 39–50 (2004)Google Scholar
- 14.Drummond, C., Holte, R.C.: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II held in conjunction with ICML 2003 (2003)Google Scholar