Advertisement

Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning

  • Hui Han
  • Wen-Yuan Wang
  • Bing-Huan Mao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3644)

Abstract

In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the existing methods to evaluate and solve the imbalance problem. Synthetic minority over-sampling technique (SMOTE) is one of the over-sampling methods addressing this problem. Based on SMOTE method, this paper presents two new minority over-sampling methods, borderline-SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are over-sampled. For the minority class, experiments show that our approaches achieve better TP rate and F-value than SMOTE and random over-sampling methods.

Keywords

Minority Class Data Mining Algorithm Imbalance Problem Minimax Probability Machine Traditional Data Mining 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Chawla, N.V., Japkowicz, N., Kolcz, A.: Editorial: Special Issue on Learning from Imbalanced Data Sets. SIGKDD Explorations 6(1), 1–6 (2004)CrossRefGoogle Scholar
  2. 2.
    Weiss, G.: Mining with rarity: A unifying framework. SIGKDD Explorations 6(1), 7–19 (2004)CrossRefGoogle Scholar
  3. 3.
    Ezawa, K.J., Singh, M., Norton, S.W.: Learning Goal Oriented Bayesian Networks for Telecommunications Management. In: Proceedings of the International Conference on Machine Learning, ICML 1996, Bari, Italy, pp. 139–147. Morgan Kaufmann, San Francisco (1996)Google Scholar
  4. 4.
    Kubat, m., Holte, R., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30, 195–215Google Scholar
  5. 5.
    van den Bosch, A., Weijters, T., van den Herik, H.J., Daelemans, W.: When small disjuncts abound, try lazy learning: A case study. In: Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, pp. 109–118 (1997)Google Scholar
  6. 6.
    Zheng, Z., Wu, X., Srihari, R.: Feature Selection for Text Categorization on Imbalanced Data. SIGKDD Explorations 6(1), 80–89 (2004)CrossRefGoogle Scholar
  7. 7.
    Fawcett, T., Provost, F.: Combining Data Mining and Machine Learning for Effective User Profile. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland OR, pp. 8–13. AAAI Press, Menlo Park (1996)Google Scholar
  8. 8.
    Lewis, D., Catlett, H.J.: Uncertainty Sampling for Supervized Learning. In: Proceedings of the 11th International Conference on Machine Learning, ICML1994, pp. 148–156 (1994)Google Scholar
  9. 9.
    Bradley, A.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)CrossRefGoogle Scholar
  10. 10.
    van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)Google Scholar
  11. 11.
    Kubat, M., Matwin, S.: Addressing the Course of Imbalanced Training Sets: One-sided Selection. In: ICML 1997, pp. 179–186 (1997)Google Scholar
  12. 12.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)MATHGoogle Scholar
  13. 13.
    Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.: SMOTEBoost: Improving prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  14. 14.
    Gustavo, E.A., Batista, P.A., Ronaldo, C., Prati, Monard, M.C.: A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explorations 6(1), 20–29 (2004)CrossRefGoogle Scholar
  15. 15.
    Estabrooks, A., Jo, T., Japkowicz, N.: A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence 20(1), 18–36 (2004)CrossRefMathSciNetGoogle Scholar
  16. 16.
    Jo, T., Japkowicz, N.: Class Imbalances versus Small Disjuncts. Sigkdd Explorations 6(1), 40–49 (2004)CrossRefMathSciNetGoogle Scholar
  17. 17.
    Guo, H., Viktor, H.L.: Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach. Sigkdd Explorations 6(1), 30–39 (2004)CrossRefGoogle Scholar
  18. 18.
    Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)MATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Joshi, M., Kumar, V., Agarwal, R.: Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements. In: First IEEE International Conference on Data Mining, San Jose, CA (2001)Google Scholar
  20. 20.
    Wu, G., Chang, E.Y.: Class-Boundary Alignment for Imbalanced Dataset Learning. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington, DC (2003)Google Scholar
  21. 21.
    Huang, K., Yang, H., King, I., Lyu, M.R.: Learning Classifiers from Imbalanced Data Based on Biased Minimax Probability Machine. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2004)Google Scholar
  22. 22.
    Dietterich, T., Margineantu, D., Provost, F., Turney, P. (eds.): Proceedings of the ICML 2000 Workshop on Cost-sensitive Learning (2000)Google Scholar
  23. 23.
    Manevitz, L.M., Yousef, M.: One-class SVMs for document classification. Journal of Machine Learning Research 2, 139–154 (2001)CrossRefGoogle Scholar
  24. 24.
    Blake, C., Merz, C.: UCI Repository of Machine Learning Databases. Department of Information and Computer Sciences, University of California, Irvine (1998), http://www.ics.uci.edu/~mlearn/~MLRepository.html
  25. 25.
    Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1992)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Hui Han
    • 1
  • Wen-Yuan Wang
    • 1
  • Bing-Huan Mao
    • 2
  1. 1.Department of AutomationTsinghua UniversityBeijingP. R. China
  2. 2.Department of StatisticsCentral University of Finance and EconomicsBeijingP. R. China

Personalised recommendations