Skip to main content

Margin-Based Over-Sampling Method for Learning from Imbalanced Datasets

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6635))

Included in the following conference series:

Abstract

Learning from imbalanced datasets has drawn more and more attentions from both theoretical and practical aspects. Over- sampling is a popular and simple method for imbalanced learning. In this paper, we show that there is an inherently potential risk associated with the over-sampling algorithms in terms of the large margin principle. Then we propose a new synthetic over sampling method, named Margin-guided Synthetic Over-sampling (MSYN), to reduce this risk. The MSYN improves learning with respect to the data distributions guided by the margin-based rule. Empirical study verities the efficacy of MSYN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164–168 (2001)

    Google Scholar 

  2. Kubat, M., Holte, R.C., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30(2), 195–215 (1998)

    Article  Google Scholar 

  3. Weisis, G.M.: Mining with Rarity: A Unifying Framwork. SiGKDD Explorations 6(1), 7–19 (2004)

    Article  Google Scholar 

  4. Wu, G., Chang, E.Y.: Class-Boundary Alignment for Imbalanced Dataset Learning. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)

    Google Scholar 

  5. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: Smoteboost: Improving Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  6. Liu, W., Chawla, S., Cieslak, D.A., Chawla, N.V.: A Robust Decision Tree Algorithm for Imbalanced Data Sets. In: SIAM International Conf. on Data Mining (2010)

    Google Scholar 

  7. Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 63–77 (2006)

    Google Scholar 

  8. Raskutti, B., Kowalczyk, A.: Extreme re-balancing for SVMs: a case study. SIGKDD Explorations 6(1), 60–69 (2004)

    Article  Google Scholar 

  9. Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: Proceeding of the 2000 International Conf. on Artificial Intelligence (ICAI 2000): Special Track on Inductive Learning, Las Vegas, Nevada (2000)

    Google Scholar 

  10. Ling, C., Li, C.: Data Mining for Direct Marketing Problems and Solutions. In: Proceeding of the Fourth International Conf. on Knowledge Discovery and Data Mining, KDD 1998, New York, NY (1998)

    Google Scholar 

  11. Chawla, N.V., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)

    MATH  Google Scholar 

  12. Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing, 878–887 (2005)

    Google Scholar 

  13. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In: Proceeding of International Conf. Neural Networks, pp. 1322–1328 (2008)

    Google Scholar 

  14. Crammer, K., Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin analysis of the LVQ algorithm. Advances in Neural Information Processing Systems, 479–486 (2003)

    Google Scholar 

  15. Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin based feature selection-theory and algorithms. In: Proceeding of the Twenty-First International Conference on Machine Learning (2004)

    Google Scholar 

  16. He, H., Garcia, E.A.: Learning from Imbalance Data. IEEE Transaction on Knowledge and Data Engineering 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  17. Freund, Y., Schapire, R.: A desicion-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)

    Article  MATH  Google Scholar 

  18. Bowyer, A.: Computing dirichlet tessellations. The Computer Journal 24(2) (1981)

    Google Scholar 

  19. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques with Java implementations. ACM SIGMOD Record 31(1), 76–77 (2002)

    Article  Google Scholar 

  20. UCL machine learning group, http://www.dice.ucl.ac.be/mlg/?page=Elena

  21. Asuncion, A., Newman, D.: UCI machine learning repository (2007)

    Google Scholar 

  22. Bradley, A.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)

    Article  Google Scholar 

  23. Van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)

    MATH  Google Scholar 

  24. Wang, B.X., Japkowicz, N.: Imbalanced Data Set Learning with Synthetic Samples. In: Proc. IRIS Machine Learning Workshop (2004)

    Google Scholar 

  25. Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  26. Guo, H., Viktor, H.L.: Learning from Imbalanced Data Sets with Boosting and Data Generation: the DataBoost-IM Approach. SIGKDD Explorations: Special issue on Learning from Imbalanced Datasets 6(1), 30–39 (2004)

    Article  Google Scholar 

  27. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics 39(2), 539–550 (2009)

    Article  Google Scholar 

  28. Cohen, W.: Fast Effective Rule Induction. In: Proceeding of 12th International Conf. on Machine Learning, Lake Tahoe, CA, pp. 115–123. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fan, X., Tang, K., Weise, T. (2011). Margin-Based Over-Sampling Method for Learning from Imbalanced Datasets. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6635. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20847-8_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20847-8_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20846-1

  • Online ISBN: 978-3-642-20847-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics