Advertisement

SMOTEBoost: Improving Prediction of the Minority Class in Boosting

  • Nitesh V. Chawla
  • Aleksandar Lazarevic
  • Lawrence O. Hall
  • Kevin W. Bowyer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2838)

Abstract

Many real world data mining applications involve learning from imbalanced data sets. Learning from data sets that contain very few instances of the minority (or interesting) class usually produces biased classifiers that have a higher predictive accuracy over the majority class(es), but poorer predictive accuracy over the minority class. SMOTE (Synthetic Minority Over-sampling TEchnique) is specifically designed for learning from imbalanced data sets. This paper presents a novel approach for learning from imbalanced data sets, based on a combination of the SMOTE algorithm and the boosting procedure. Unlike standard boosting where all misclassified examples are given equal weights, SMOTEBoost creates synthetic examples from the rare or minority class, thus indirectly changing the updating weights and compensating for skewed distributions. SMOTEBoost applied to several highly and moderately imbalanced data sets shows improvement in prediction performance on the minority class and overall improved F-values.

Keywords

Intrusion Detection Minority Class Class Imbalance Weak Learner Network Intrusion Detection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)zbMATHGoogle Scholar
  2. 2.
    Provost, F., Fawcett, T.: Robust Classification for Imprecise Environments. Machine Learning 42/3, 203–231 (2001)CrossRefGoogle Scholar
  3. 3.
    Buckland, M., Gey, F.: The Relationship Between Recall and Precision. Journal of the American Society for Information Science 45(1), 12–19 (1994)CrossRefGoogle Scholar
  4. 4.
    Joshi, M., Kumar, V., Agarwal, R.: Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements. In: First IEEE International Conference on Data Mining, San Jose, CA (2001)Google Scholar
  5. 5.
    Freund, Y., Schapire, R.: Experiments with a New Boosting Algorithm. In: Proceedings of the 13th International Conference on Machine Learning, pp. 325–332 (1996)Google Scholar
  6. 6.
    Ting, K.: A Comparative Study of Cost-Sensitive Boosting Algorithms. In: Proceedings of 17th International Conference on Machine Learning, Stanford, CA, pp. 983–990 (2000)Google Scholar
  7. 7.
    Fan, W., Stolfo, S., Zhang, J., Chan, P.: AdaCost: Misclassification Cost-Sensitive Boosting. In: Proc. of 16th International Conference on Machine Learning, Slovenia (1999)Google Scholar
  8. 8.
    Karakoulas, G., Shawe-Taylor, J.: Optimizing Classifiers for Imbalanced Training Sets. In: Kearns, M., Solla, S., Cohn, D. (eds.) Advances in Neural Information Processing Systems, vol. 11, MIT Press, Cambridge (1999)Google Scholar
  9. 9.
    Joshi, M., Agarwal, R., Kumar, V.: Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong? In: Proceedings of Eighth ACM Conference ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Edmonton, Canada (2002)Google Scholar
  10. 10.
    Joshi, M., Agarwal, R.: PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-study in Network Intrusion Detection), First SIAM Conference on Data Mining, Chicago, IL (2001)Google Scholar
  11. 11.
    Chan, P., Stolfo, S.: Towards Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection. In: Proceedings of Fourth ACM Conference ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, pp. 164–168 (1998)Google Scholar
  12. 12.
    Kubat, M., Holte, R., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30, 195–215 (1998)CrossRefGoogle Scholar
  13. 13.
    Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI 2000): Special Track on Inductive Learning, Las Vegas, Nevada (2000)Google Scholar
  14. 14.
    Lewis, D., Catlett, J.: Heterogeneous Uncertainty Sampling for Supervised Learning. In: Proceedings of the Eleventh International Conference of Machine Learning, San Francisco, CA, pp. 148–156 (1994)Google Scholar
  15. 15.
    Ling, C., Li, C.: Data Mining for Direct Marketing Problems and Solutions. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY (1998)Google Scholar
  16. 16.
    Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufman, San Mateo (1992)Google Scholar
  17. 17.
    Cohen, W.: Fast Effective Rule Induction. In: Proceedings of the 12th International Conference on Machine Learning, Lake Tahoe, CA, pp. 115–123 (1995)Google Scholar
  18. 18.
    Stanfill, C., Waltz, D.: Toward Memory-based Reasoning. Communications of the ACM 29(12), 1213–1228 (1986)CrossRefGoogle Scholar
  19. 19.
    Cost, S., Salzberg, S.: A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning 10(1), 57–78 (1993)Google Scholar
  20. 20.
    KDD-Cup, Task Description (1999), http://kdd.ics.uci.edu/databases/kddcup99/task.html
  21. 21.
    Lippmann, R., Fried, D., Graf, I., Haines, J., Kendall, K., McClung, D., Weber, D., Webster, S., Wyschogrod, D., Cunningham, R., Zissman, M.: Evaluating Intrusion Detection Systems: The 1998 DARPA Off-line Intrusion Detection Evaluation. In: Proceedings DARPA Information Survivability Conference and Exposition (DISCEX) 2000, vol. 2, pp. 12–26. IEEE Computer Society Press, Los Alamitos (2000)CrossRefGoogle Scholar
  22. 22.
    Blake, C., Merz, C.: UCI Repository of Machine Learning Databases, Department of Information and Computer Sciences, University of California, Irvine (1998), http://www.ics.uci.edu/~mlearn/~MLRepository.html
  23. 23.
    Provost, F., Fawcett, T., Kohavi, R.: The Case Against Accuracy Estimation for Comparing Induction Algorithms. In: Proceedings of 15th International Conference on Machine Learning, Madison, WI, pp. 445–453 (1998)Google Scholar
  24. 24.
    ELENA project, ftp.dice.ucl.ac.be in directory pub/neural-nets/ELENA/databases

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Nitesh V. Chawla
    • 1
  • Aleksandar Lazarevic
    • 2
  • Lawrence O. Hall
    • 3
  • Kevin W. Bowyer
    • 4
  1. 1.Business Analytic SolutionsCanadian Imperial Bank of Commerce (CIBC), BCE PlaceTorontoCanada
  2. 2.Department of Computer ScienceUniversity of MinnesotaMinneapolisUSA
  3. 3.Department of Computer Science and EngineeringUniversity of South FloridaTampaUSA
  4. 4.Department of Computer Science and EngineeringUniversity of Notre DameUSA

Personalised recommendations