Margin-Based Over-Sampling Method for Learning from Imbalanced Datasets

  • Xiannian Fan
  • Ke Tang
  • Thomas Weise
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6635)

Abstract

Learning from imbalanced datasets has drawn more and more attentions from both theoretical and practical aspects. Over- sampling is a popular and simple method for imbalanced learning. In this paper, we show that there is an inherently potential risk associated with the over-sampling algorithms in terms of the large margin principle. Then we propose a new synthetic over sampling method, named Margin-guided Synthetic Over-sampling (MSYN), to reduce this risk. The MSYN improves learning with respect to the data distributions guided by the margin-based rule. Empirical study verities the efficacy of MSYN.

Keywords

imbalance learning over-sampling over-fitting large margin theory generalization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164–168 (2001)Google Scholar
  2. 2.
    Kubat, M., Holte, R.C., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30(2), 195–215 (1998)CrossRefGoogle Scholar
  3. 3.
    Weisis, G.M.: Mining with Rarity: A Unifying Framwork. SiGKDD Explorations 6(1), 7–19 (2004)CrossRefGoogle Scholar
  4. 4.
    Wu, G., Chang, E.Y.: Class-Boundary Alignment for Imbalanced Dataset Learning. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)Google Scholar
  5. 5.
    Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: Smoteboost: Improving Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Liu, W., Chawla, S., Cieslak, D.A., Chawla, N.V.: A Robust Decision Tree Algorithm for Imbalanced Data Sets. In: SIAM International Conf. on Data Mining (2010)Google Scholar
  7. 7.
    Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 63–77 (2006)Google Scholar
  8. 8.
    Raskutti, B., Kowalczyk, A.: Extreme re-balancing for SVMs: a case study. SIGKDD Explorations 6(1), 60–69 (2004)CrossRefGoogle Scholar
  9. 9.
    Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: Proceeding of the 2000 International Conf. on Artificial Intelligence (ICAI 2000): Special Track on Inductive Learning, Las Vegas, Nevada (2000)Google Scholar
  10. 10.
    Ling, C., Li, C.: Data Mining for Direct Marketing Problems and Solutions. In: Proceeding of the Fourth International Conf. on Knowledge Discovery and Data Mining, KDD 1998, New York, NY (1998)Google Scholar
  11. 11.
    Chawla, N.V., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)MATHGoogle Scholar
  12. 12.
    Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing, 878–887 (2005)Google Scholar
  13. 13.
    He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In: Proceeding of International Conf. Neural Networks, pp. 1322–1328 (2008)Google Scholar
  14. 14.
    Crammer, K., Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin analysis of the LVQ algorithm. Advances in Neural Information Processing Systems, 479–486 (2003)Google Scholar
  15. 15.
    Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin based feature selection-theory and algorithms. In: Proceeding of the Twenty-First International Conference on Machine Learning (2004)Google Scholar
  16. 16.
    He, H., Garcia, E.A.: Learning from Imbalance Data. IEEE Transaction on Knowledge and Data Engineering 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  17. 17.
    Freund, Y., Schapire, R.: A desicion-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)CrossRefMATHGoogle Scholar
  18. 18.
    Bowyer, A.: Computing dirichlet tessellations. The Computer Journal 24(2) (1981)Google Scholar
  19. 19.
    Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques with Java implementations. ACM SIGMOD Record 31(1), 76–77 (2002)CrossRefGoogle Scholar
  20. 20.
    UCL machine learning group, http://www.dice.ucl.ac.be/mlg/?page=Elena
  21. 21.
    Asuncion, A., Newman, D.: UCI machine learning repository (2007)Google Scholar
  22. 22.
    Bradley, A.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)CrossRefGoogle Scholar
  23. 23.
    Van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)MATHGoogle Scholar
  24. 24.
    Wang, B.X., Japkowicz, N.: Imbalanced Data Set Learning with Synthetic Samples. In: Proc. IRIS Machine Learning Workshop (2004)Google Scholar
  25. 25.
    Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  26. 26.
    Guo, H., Viktor, H.L.: Learning from Imbalanced Data Sets with Boosting and Data Generation: the DataBoost-IM Approach. SIGKDD Explorations: Special issue on Learning from Imbalanced Datasets 6(1), 30–39 (2004)CrossRefGoogle Scholar
  27. 27.
    Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics 39(2), 539–550 (2009)CrossRefGoogle Scholar
  28. 28.
    Cohen, W.: Fast Effective Rule Induction. In: Proceeding of 12th International Conf. on Machine Learning, Lake Tahoe, CA, pp. 115–123. Morgan Kaufmann, San Francisco (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Xiannian Fan
    • 1
  • Ke Tang
    • 1
  • Thomas Weise
    • 1
  1. 1.Nature Inspired Computational and Applications Laboratory, School of Computer Science and TechnologyUniversity of Science and Technology of ChinaChina

Personalised recommendations