Advertisement

A Membership Probability–Based Undersampling Algorithm for Imbalanced Data

  • Gilseung Ahn
  • You-Jin Park
  • Sun HurEmail author
Article
  • 19 Downloads

Abstract

Classifiers for a highly imbalanced dataset tend to bias in majority classes and, as a result, the minority class samples are usually misclassified as majority class. To overcome this, a proper undersampling technique that removes some majority samples can be an alternative. We propose an efficient and simple undersampling method for imbalanced datasets and show that the proposed method outperforms others with respect to four different performance measures by several illustrative experiments, especially for highly imbalanced datasets.

Keywords

Imbalanced class problem undersampling membership probability information loss 

Notes

Funding Information

This work has been supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT)(2019R1A2C1088255), and Research Project Program for Newly-Recruited Personnel funded by the Ministry of Science and Technology of Taiwan, R.O.C. (MOST 108 - 2218 - E - 027 - 008 - MY2).

References

  1. Bahnsen, A. C., Aouada, D., Stojanovic, A., & Ottersten, B. (2016). Feature engineering strategies for credit card fraud detection. Expert Systems with Applications, 51, 134–142.CrossRefGoogle Scholar
  2. Beckmann, M., Ebecken, N. F., & De Lima, B. S. P. (2015). A KNN undersampling approach for data balancing. Journal of Intelligent Learning Systems and Applications, 7, 104.CrossRefGoogle Scholar
  3. Blaszczynski, J., & Stefanowski, J. (2015). Neighbourhood sampling in bagging for imbalanced data. Neurocomputing., 150, 529–542.CrossRefGoogle Scholar
  4. Cai, R., Zhao, Q., She, D. P., Yang, L., Cao, H., & Yang, Q. Y. (2014). Bernoulli-based random undersampling schemes for 2D seismic data regularization. Applied Geophysics, 11, 321–330.CrossRefGoogle Scholar
  5. Chawla, N. V. (2010). “Data mining for imbalanced datasets: An overview”, In Data Mining and Knowledge Discovery Handbook (pp. 875-886). Springer.Google Scholar
  6. Chyi, Y.M. (2003). “Classification analysis techniques for skewed class distribution problems”, Master Thesis, Department of Information Management, National Sun Yat-Sen University.Google Scholar
  7. Dal Pozzolo, A., Caelen, O., Le Borgne, Y. A., Waterschoot, S., & Bontempi, G. (2014). Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications, 41, 4915–4928.CrossRefGoogle Scholar
  8. Galar, M., Fernandez, A., Barrenechea, E., & Herrera, F. (2013). EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 46, 3460–3471.CrossRefGoogle Scholar
  9. Garcia, S., & Herrera, F. (2009). Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evolutionary Computation, 17, 275–306.CrossRefGoogle Scholar
  10. Garica-Pedrajas, N., Perez-Rodriguez, J., Garcia-Pedrajas, M., Ortiz-Boyer, D., & Fyfe, C. (2012). Class imbalance methods for translation initiation site recognition in DNA sequences. Knowledge-Based Systems, 25, 22–34.CrossRefGoogle Scholar
  11. Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6, 429–449.CrossRefGoogle Scholar
  12. Kang, P., & Cho, S. (2006). “EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems”, In Neural Information Processing (pp. 837-846).Google Scholar
  13. Krawczyk, B., Galar, M., Jelen, Ł., & Herrera, F. (2016). Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Applied Soft Computing, 38, 714–726.CrossRefGoogle Scholar
  14. Liu, X. Y., Wu, J., & Zhou, Z. H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B Cybern, 39, 539–550.CrossRefGoogle Scholar
  15. Majid, A., Ali, S., Iqbal, M., & Kausar, N. (2014). Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Computer Methods and Programs in Biomedicine, 113, 792–808.CrossRefGoogle Scholar
  16. Maldonado, S., & Lopez, J. (2014). Imbalanced data classification using second-order cone programming support vector machines. Pattern Recognition, 47, 2070–2079.CrossRefGoogle Scholar
  17. Napierala, K., & Stefanowski, J. (2015). Addressing imbalanced data with argument based rule learning. Expert Systems with Applications, 42, 9468–9481.CrossRefGoogle Scholar
  18. Passos, I. C., Mwangi, B., Cao, B., Hamilton, J. E., Wu, M. J., Zhang, X. Y., Zunta-Soares, G. B., Quevedo, J., Kauer-Santanna, M., Kapczinski, F., & Soares, J. C. (2016). Identifying a clinical signature of suicidality among patients with mood disorders: A pilot study using a machine learning approach. Journal of Affective Disorders, 193, 109–116.CrossRefGoogle Scholar
  19. Provost, F., & Fawcett, T. (2013). “Fitting a model to data”, in Data Science for Business: What you need to know about data mining and data-analytic thinking. California: O’Reilly Media.Google Scholar
  20. Quinlan, J.R. (2014). C4.5: Programs for Machine Learning. Elsevier.Google Scholar
  21. Steinley, D., & Brusco, M. (2007). Initializing K-means batch clustering: A critical evaluation of several techniques. Journal of Classification., 24, 99–121.MathSciNetCrossRefGoogle Scholar
  22. Sundarkumar, G. G., & Ravi, V. (2015). A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance. Engineering Applications of Artificial Intelligence, 37, 368–377.CrossRefGoogle Scholar
  23. Tutz, G. (2012). Regression for categorical data. Cambridge University Press.Google Scholar
  24. Wang, K. J., Adrian, A. M., Chen, K. H., & Wang, K. M. (2015). A hybrid classifier combining borderline-SMOTE with AIRS algorithm for estimating brain metastasis from lung cancer: A case study in Taiwan. Computer Methods and Programs in Biomedicine, 119, 63–76.CrossRefGoogle Scholar
  25. Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 3, 408–421.MathSciNetCrossRefGoogle Scholar
  26. Yen, S. J., & Lee, Y. S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36, 5718–5727.CrossRefGoogle Scholar
  27. Yu, H., Ni, J., & Zhao, J. (2013). ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing, 101, 309–318.CrossRefGoogle Scholar

Copyright information

© The Classification Society 2020

Authors and Affiliations

  1. 1.Department of Industrial and Management EngineeringHanyang UniversityAnsanSouth Korea
  2. 2.Department of Industrial Engineering and ManagementNational Taipei University of TechnologyTaipeiR.O.C.

Personalised recommendations