SMOTEBoost: Improving Prediction of the Minority Class in Boosting
- 415 Citations
- 3.4k Downloads
Abstract
Many real world data mining applications involve learning from imbalanced data sets. Learning from data sets that contain very few instances of the minority (or interesting) class usually produces biased classifiers that have a higher predictive accuracy over the majority class(es), but poorer predictive accuracy over the minority class. SMOTE (Synthetic Minority Over-sampling TEchnique) is specifically designed for learning from imbalanced data sets. This paper presents a novel approach for learning from imbalanced data sets, based on a combination of the SMOTE algorithm and the boosting procedure. Unlike standard boosting where all misclassified examples are given equal weights, SMOTEBoost creates synthetic examples from the rare or minority class, thus indirectly changing the updating weights and compensating for skewed distributions. SMOTEBoost applied to several highly and moderately imbalanced data sets shows improvement in prediction performance on the minority class and overall improved F-values.
Keywords
Intrusion Detection Minority Class Class Imbalance Weak Learner Network Intrusion DetectionReferences
- 1.Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)zbMATHGoogle Scholar
- 2.Provost, F., Fawcett, T.: Robust Classification for Imprecise Environments. Machine Learning 42/3, 203–231 (2001)CrossRefGoogle Scholar
- 3.Buckland, M., Gey, F.: The Relationship Between Recall and Precision. Journal of the American Society for Information Science 45(1), 12–19 (1994)CrossRefGoogle Scholar
- 4.Joshi, M., Kumar, V., Agarwal, R.: Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements. In: First IEEE International Conference on Data Mining, San Jose, CA (2001)Google Scholar
- 5.Freund, Y., Schapire, R.: Experiments with a New Boosting Algorithm. In: Proceedings of the 13th International Conference on Machine Learning, pp. 325–332 (1996)Google Scholar
- 6.Ting, K.: A Comparative Study of Cost-Sensitive Boosting Algorithms. In: Proceedings of 17th International Conference on Machine Learning, Stanford, CA, pp. 983–990 (2000)Google Scholar
- 7.Fan, W., Stolfo, S., Zhang, J., Chan, P.: AdaCost: Misclassification Cost-Sensitive Boosting. In: Proc. of 16th International Conference on Machine Learning, Slovenia (1999)Google Scholar
- 8.Karakoulas, G., Shawe-Taylor, J.: Optimizing Classifiers for Imbalanced Training Sets. In: Kearns, M., Solla, S., Cohn, D. (eds.) Advances in Neural Information Processing Systems, vol. 11, MIT Press, Cambridge (1999)Google Scholar
- 9.Joshi, M., Agarwal, R., Kumar, V.: Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong? In: Proceedings of Eighth ACM Conference ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Edmonton, Canada (2002)Google Scholar
- 10.Joshi, M., Agarwal, R.: PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-study in Network Intrusion Detection), First SIAM Conference on Data Mining, Chicago, IL (2001)Google Scholar
- 11.Chan, P., Stolfo, S.: Towards Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection. In: Proceedings of Fourth ACM Conference ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, pp. 164–168 (1998)Google Scholar
- 12.Kubat, M., Holte, R., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30, 195–215 (1998)CrossRefGoogle Scholar
- 13.Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI 2000): Special Track on Inductive Learning, Las Vegas, Nevada (2000)Google Scholar
- 14.Lewis, D., Catlett, J.: Heterogeneous Uncertainty Sampling for Supervised Learning. In: Proceedings of the Eleventh International Conference of Machine Learning, San Francisco, CA, pp. 148–156 (1994)Google Scholar
- 15.Ling, C., Li, C.: Data Mining for Direct Marketing Problems and Solutions. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY (1998)Google Scholar
- 16.Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufman, San Mateo (1992)Google Scholar
- 17.Cohen, W.: Fast Effective Rule Induction. In: Proceedings of the 12th International Conference on Machine Learning, Lake Tahoe, CA, pp. 115–123 (1995)Google Scholar
- 18.Stanfill, C., Waltz, D.: Toward Memory-based Reasoning. Communications of the ACM 29(12), 1213–1228 (1986)CrossRefGoogle Scholar
- 19.Cost, S., Salzberg, S.: A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning 10(1), 57–78 (1993)Google Scholar
- 20.KDD-Cup, Task Description (1999), http://kdd.ics.uci.edu/databases/kddcup99/task.html
- 21.Lippmann, R., Fried, D., Graf, I., Haines, J., Kendall, K., McClung, D., Weber, D., Webster, S., Wyschogrod, D., Cunningham, R., Zissman, M.: Evaluating Intrusion Detection Systems: The 1998 DARPA Off-line Intrusion Detection Evaluation. In: Proceedings DARPA Information Survivability Conference and Exposition (DISCEX) 2000, vol. 2, pp. 12–26. IEEE Computer Society Press, Los Alamitos (2000)CrossRefGoogle Scholar
- 22.Blake, C., Merz, C.: UCI Repository of Machine Learning Databases, Department of Information and Computer Sciences, University of California, Irvine (1998), http://www.ics.uci.edu/~mlearn/~MLRepository.html
- 23.Provost, F., Fawcett, T., Kohavi, R.: The Case Against Accuracy Estimation for Comparing Induction Algorithms. In: Proceedings of 15th International Conference on Machine Learning, Madison, WI, pp. 445–453 (1998)Google Scholar
- 24.ELENA project, ftp.dice.ucl.ac.be in directory pub/neural-nets/ELENA/databases