Advertisement

Issues and challenges of class imbalance problem in classification

  • Prabhjot Kaur
  • Anjana Gosain
Original Research
  • 18 Downloads

Abstract

Class imbalance problem is the problem of classification when we seek out exceptional cases using traditional classification algorithms. Traditional classification algorithms are designed to look for either bigger classes or classes with the similar size. These algorithms when used to identify smaller class from the data either fails to detect or gives erroneous results. Researchers have worked on this problem using various concepts, logics or by modifying existing classification algorithms. This paper discusses existing research trends used to solve class imbalance problem. It also highlights the issues and gaps related to this problem.

Keywords

Classification Class imbalance problem Data level techniques Ensemble methods Algorithm level methods 

References

  1. 1.
    Ai X, Wu J, Sheng VS, Zhao P, Cui Z (2015) Immune centroids oversampling method for binary classification. Comput Intell Neurosci 2015:19CrossRefGoogle Scholar
  2. 2.
    Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. European conference on machine learning. Springer, Berlin, pp 39–50Google Scholar
  3. 3.
    Al-Rifaie MM, Alhakbani HA (2016) Handling class imbalance in direct marketing dataset using a hybrid data and algorithmic level solutions. SAI Comput Conf (SAI) 2016:446–451CrossRefGoogle Scholar
  4. 4.
    Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425CrossRefGoogle Scholar
  5. 5.
    Batuwita R, Palade V (2010) FSVM-CIL: fuzzy support vector machines for class imbalance learning. IEEE Trans Fuzzy Syst 18(3):558–571CrossRefGoogle Scholar
  6. 6.
    Breiman L (1999) Pasting small votes for classification in large databases and on-line. Mach Learn 36(1–2):85–103CrossRefGoogle Scholar
  7. 7.
    Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 475–482CrossRefGoogle Scholar
  8. 8.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefGoogle Scholar
  9. 9.
    Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: Improving prediction of the minority class in boosting. European Conference on Principles of Data Mining and Knowledge Discovery. Springer, Berlin, pp 107–119Google Scholar
  10. 10.
    Chen XW, Wasikowski M (2008) Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 124–132Google Scholar
  11. 11.
    Chi Z, Yan H, Pham T (1996) Fuzzy algorithms: with applications to image processing and pattern recognition, vol 10. World Scientific, SingaporezbMATHGoogle Scholar
  12. 12.
    Chyi YM (2003) Classification analysis techniques for skewed class distribution problems. Department of Information Management, National Sun Yat-Sen University, TaiwanGoogle Scholar
  13. 13.
    Cristianini N, Shawe-Taylor J, Elisseeff A, Kandola JS (2002) On kernel-target alignment. Advances in neural information processing systems. MIT Press, Cambridge, pp 367–373Google Scholar
  14. 14.
    D’Addabbo A, Maglietta R (2015) Parallel selective sampling method for imbalanced and large data classification. Pattern Recogn Lett 62:61–67CrossRefGoogle Scholar
  15. 15.
    Dai HL (2015) Class imbalance learning via a fuzzy total margin based support vector machine. Appl Soft Comput 31:172–184CrossRefGoogle Scholar
  16. 16.
    Fattahi S, Othman Z, Othman ZA (2015) New approach with ensemble method to address class imbalance problem. J Theor Appl Inf Technol 72(1):23–33Google Scholar
  17. 17.
    Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305zbMATHGoogle Scholar
  18. 18.
    Galar M, Fernández A, Barrenechea E, Herrera F (2013) EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn 46(12):3460–3471CrossRefGoogle Scholar
  19. 19.
    Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting- and hybrid-based approaches. IEEE Trans Syst Man Cybern C 42(4):463–484CrossRefGoogle Scholar
  20. 20.
    García S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evolut Comput 17(3):275–306MathSciNetCrossRefGoogle Scholar
  21. 21.
    Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM SIGKDD Explor Newsl 6(1):30–39CrossRefGoogle Scholar
  22. 22.
    Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In Natural Computation, 2008. ICNC’08. Fourth International Conference on vol. 4:192–201. IEEEGoogle Scholar
  23. 23.
    Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International Conference on Intelligent Computing. Springer, Berlin, pp 878–887Google Scholar
  24. 24.
    He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on IEEE, 1322–1328Google Scholar
  25. 25.
    He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRefGoogle Scholar
  26. 26.
    Hido S, Kashima H, Takahashi Y (2009) Roughly balanced bagging for imbalanced data. Stat Anal Data Min ASA Data Sci J 2(5–6):412–426MathSciNetCrossRefGoogle Scholar
  27. 27.
    Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier for imbalanced data sets. IEEE Trans Neural Netw 18(1):28–41CrossRefGoogle Scholar
  28. 28.
    Hu S, Liang Y, Ma L, He Y (2009) MSMOTE: improving classification performance when training data is imbalanced. In Computer Science and Engineering. WCSE’09. Second International Workshop on, IEEE 2:13–17Google Scholar
  29. 29.
    Imam T, Ting KM, Kamruzzaman J (2006) z-SVM: an SVM for improved classification of imbalanced data. Australasian Joint Conference on Artificial Intelligence. Springer, Berlin, pp 264–273Google Scholar
  30. 30.
    Kandola JS, Shawe-Taylor J (2003) Refining kernels for regression and uneven classification problems. In: Proceedings of AISTATSGoogle Scholar
  31. 31.
    Kim MJ, Kang DK (2013) Geometric mean based boosting algorithm to resolve data imbalance problem. In: Proceedings of PACIS, 2013, pp 1–27Google Scholar
  32. 32.
    Li DC, Wu CS, Tsai TI, Lina YS (2007) Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput Oper Res 34(4):966–982CrossRefGoogle Scholar
  33. 33.
    Lin CF, Wang SD (2002) Fuzzy support vector machines. IEEE Trans Neural Netw 13(2):464–471CrossRefGoogle Scholar
  34. 34.
    Mangasarian OL, Wild EW (2001) Proximal support vector machine classifiers. In: Proceedings of KDD- 2001: knowledge discovery and data mining, 2001Google Scholar
  35. 35.
    Maruthi Padmaja T, Raju BS, Hota RN, Krishna PR (2014) Class imbalance and its effect on PCA preprocessing. Int J Knowl Eng Soft Data Paradig 4(3):272–294CrossRefGoogle Scholar
  36. 36.
    Mi Y (2013) Imbalanced classification based on active learning SMOTE. Res J Appl Sci Eng Technol 5:944–949Google Scholar
  37. 37.
    Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. ICML 99:258–267Google Scholar
  38. 38.
    Mollineda RA, Alejo R, Sotoca JM (2007) The class imbalance problem in pattern classification and learning. In II Congreso Español de Informática (CEDI 2007). ISBN, 978–84Google Scholar
  39. 39.
    Nakamura M, Kajiwara Y, Otsuka A, Kimura H (2013) Lvq-smote–learning vector quantization based synthetic minority over–sampling technique for biomedical data. BioData Min 6(1):16CrossRefGoogle Scholar
  40. 40.
    Pang S, Zhu L, Chen G, Sarrafzadeh A, Ban T, Inoue D (2013) Dynamic class imbalance learning for incremental LPSVM. Neural Netw 44:87–100CrossRefGoogle Scholar
  41. 41.
    Rahman MM, Davis D (2013) Cluster based under-sampling for unbalanced cardiovascular data. Proc World Congr Eng 3:3–5Google Scholar
  42. 42.
    Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265CrossRefGoogle Scholar
  43. 43.
    Sáez JA, Luengo J, Stefanowski J, Herrera F (2014) Managing borderline and noisy examples in imbalanced classification by combining SMOTE with ensemble filtering. In International Conference on Intelligent Data Engineering and Automated Learning, Springer, Cham, 61–68Google Scholar
  44. 44.
    Salunkhe UR, Mali SN (2016) Classifier ensemble design for imbalanced data classification: a hybrid approach. Proc Comput Sci 85:725–732CrossRefGoogle Scholar
  45. 45.
    Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In International Conference on data warehousing and knowledge discovery, Springer, Berlin, 283–292Google Scholar
  46. 46.
    Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Humans 40(1):185–197CrossRefGoogle Scholar
  47. 47.
    Shibulal B, Al-Bahry SN, Al-Wahaibi YM, Elshafie AE, Al-Bemani AS, Joshi SJ (2014) Microbial enhanced heavy oil recovery by the aid of inhabitant spore-forming bacteria: an insight review. Sci World J 2014:1–12CrossRefGoogle Scholar
  48. 48.
    Tang Y, Zhang YQ (2006) Granular SVM with repetitive undersampling for highly imbalanced protein homology prediction. In Granular Computing, IEEE International Conference on IEEE, 457–460Google Scholar
  49. 49.
    Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3):659–665MathSciNetCrossRefGoogle Scholar
  50. 50.
    Tomar D, Agarwal S (2016) Prediction of defective software modules using class imbalance learning. Appl Comput Intell Soft Comput 2016:6CrossRefGoogle Scholar
  51. 51.
    Visa S, Ralescu A (2005) Issues in mining imbalanced data sets-a review paper. Proceedings of the sixteen midwest artificial intelligence and cognitive science conference 2005:67–73Google Scholar
  52. 52.
    Wang BX, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25(1):1–20CrossRefGoogle Scholar
  53. 53.
    Wasikowski M, Chen XW (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22(10):1388–1400CrossRefGoogle Scholar
  54. 54.
    Wu G, Chang EY (2003) Adaptive feature-space conformal transformation for imbalanced-data learning. Proceedings of the 20th International Conference on Machine Learning (ICML-03) 816–823Google Scholar
  55. 55.
    Wu S, Amari SI (2002) Conformal transformation of kernel functions: a data-dependent way to improve support vector machine classifiers. Neural Process Lett 15(1):59–67CrossRefGoogle Scholar
  56. 56.
    Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In ICML 2003 workshop on learning from imbalanced data sets II, Washington, DC 49–56Google Scholar
  57. 57.
    Wu G, Chang EY (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng 17(6):786–795CrossRefGoogle Scholar
  58. 58.
    Yang CY, Yang JS, Wang JJ (2009) Margin calibration in SVM class-imbalanced learning. Neurocomputing 73(1–3):397–411CrossRefGoogle Scholar
  59. 59.
    Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727MathSciNetCrossRefGoogle Scholar
  60. 60.
    Yong Y (2012) The research of imbalanced data set of sample sampling method based on K-means cluster and genetic algorithm. Energy Proc 17:164–170CrossRefGoogle Scholar
  61. 61.
    Zhang Y, Wang D (2013) A cost-sensitive ensemble method for class-imbalanced datasets. Abstract and applied analysis 2013:1–6zbMATHGoogle Scholar
  62. 62.
    Zhao Z, Zhong P, Zhao Y (2011) Learning SVM with weighted maximum margin criterion for classification of imbalanced data. Math Comput Model 54(3–4):1093–1099CrossRefGoogle Scholar
  63. 63.
    Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newsl 6(1):80–89CrossRefGoogle Scholar
  64. 64.
    Zhuang D, Zhang B, Yang Q, Yan J, Chen Z, Chen Y (2005) Efficient text classification by weighted proximal SVM. In Data Mining, Fifth IEEE International Conference on IEEE, 8Google Scholar

Copyright information

© Bharati Vidyapeeth's Institute of Computer Applications and Management 2018

Authors and Affiliations

  1. 1.Department of ITMSITNew DelhiIndia
  2. 2.USICT, GGSIP UniversityNew DelhiIndia

Personalised recommendations