Margin-Based Over-Sampling Method for Learning from Imbalanced Datasets

Fan, Xiannian; Tang, Ke; Weise, Thomas

doi:10.1007/978-3-642-20847-8_26

Xiannian Fan²²,
Ke Tang²² &
Thomas Weise²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6635))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2571 Accesses
18 Citations

Abstract

Learning from imbalanced datasets has drawn more and more attentions from both theoretical and practical aspects. Over- sampling is a popular and simple method for imbalanced learning. In this paper, we show that there is an inherently potential risk associated with the over-sampling algorithms in terms of the large margin principle. Then we propose a new synthetic over sampling method, named Margin-guided Synthetic Over-sampling (MSYN), to reduce this risk. The MSYN improves learning with respect to the data distributions guided by the margin-based rule. Empirical study verities the efficacy of MSYN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164–168 (2001)
Google Scholar
Kubat, M., Holte, R.C., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30(2), 195–215 (1998)
Article Google Scholar
Weisis, G.M.: Mining with Rarity: A Unifying Framwork. SiGKDD Explorations 6(1), 7–19 (2004)
Article Google Scholar
Wu, G., Chang, E.Y.: Class-Boundary Alignment for Imbalanced Dataset Learning. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)
Google Scholar
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: Smoteboost: Improving Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)
Chapter Google Scholar
Liu, W., Chawla, S., Cieslak, D.A., Chawla, N.V.: A Robust Decision Tree Algorithm for Imbalanced Data Sets. In: SIAM International Conf. on Data Mining (2010)
Google Scholar
Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 63–77 (2006)
Google Scholar
Raskutti, B., Kowalczyk, A.: Extreme re-balancing for SVMs: a case study. SIGKDD Explorations 6(1), 60–69 (2004)
Article Google Scholar
Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: Proceeding of the 2000 International Conf. on Artificial Intelligence (ICAI 2000): Special Track on Inductive Learning, Las Vegas, Nevada (2000)
Google Scholar
Ling, C., Li, C.: Data Mining for Direct Marketing Problems and Solutions. In: Proceeding of the Fourth International Conf. on Knowledge Discovery and Data Mining, KDD 1998, New York, NY (1998)
Google Scholar
Chawla, N.V., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing, 878–887 (2005)
Google Scholar
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In: Proceeding of International Conf. Neural Networks, pp. 1322–1328 (2008)
Google Scholar
Crammer, K., Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin analysis of the LVQ algorithm. Advances in Neural Information Processing Systems, 479–486 (2003)
Google Scholar
Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin based feature selection-theory and algorithms. In: Proceeding of the Twenty-First International Conference on Machine Learning (2004)
Google Scholar
He, H., Garcia, E.A.: Learning from Imbalance Data. IEEE Transaction on Knowledge and Data Engineering 21(9), 1263–1284 (2009)
Article Google Scholar
Freund, Y., Schapire, R.: A desicion-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)
Article MATH Google Scholar
Bowyer, A.: Computing dirichlet tessellations. The Computer Journal 24(2) (1981)
Google Scholar
Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques with Java implementations. ACM SIGMOD Record 31(1), 76–77 (2002)
Article Google Scholar
UCL machine learning group, http://www.dice.ucl.ac.be/mlg/?page=Elena
Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Google Scholar
Bradley, A.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)
Article Google Scholar
Van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
MATH Google Scholar
Wang, B.X., Japkowicz, N.: Imbalanced Data Set Learning with Synthetic Samples. In: Proc. IRIS Machine Learning Workshop (2004)
Google Scholar
Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)
Chapter Google Scholar
Guo, H., Viktor, H.L.: Learning from Imbalanced Data Sets with Boosting and Data Generation: the DataBoost-IM Approach. SIGKDD Explorations: Special issue on Learning from Imbalanced Datasets 6(1), 30–39 (2004)
Article Google Scholar
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics 39(2), 539–550 (2009)
Article Google Scholar
Cohen, W.: Fast Effective Rule Induction. In: Proceeding of 12th International Conf. on Machine Learning, Lake Tahoe, CA, pp. 115–123. Morgan Kaufmann, San Francisco (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Nature Inspired Computational and Applications Laboratory, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China, 230027
Xiannian Fan, Ke Tang & Thomas Weise

Authors

Xiannian Fan
View author publications
You can also search for this author in PubMed Google Scholar
Ke Tang
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Weise
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences, 518055, Shenzhen, China
Joshua Zhexue Huang
Faculty of Engineering and Information Technology, Center for Quantum Computation and Intelligent Systems, Data Sciences and Knowledge Discovery Lab, University of Technology Sydney, 2007, Sydney, NSW, Australia
Longbing Cao
Department of Computer Science and Engineering, University of Minnesota, 55455, Minneapolis, MN, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, X., Tang, K., Weise, T. (2011). Margin-Based Over-Sampling Method for Learning from Imbalanced Datasets. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6635. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20847-8_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-20847-8_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20846-1
Online ISBN: 978-3-642-20847-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics