Evaluating the Performance of Data Level Methods Using KEEL Tool to Address Class Imbalance Problem

Upadhyay, Kamlesh; Kaur, Prabhjot; Verma, Deepak Kumar

doi:10.1007/s13369-021-06377-x

Evaluating the Performance of Data Level Methods Using KEEL Tool to Address Class Imbalance Problem

Research Article-Computer Engineering and Computer Science
Published: 29 November 2021

Volume 47, pages 9741–9754, (2022)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

419 Accesses
10 Citations
Explore all metrics

Abstract

The class imbalance problem (CIP) has become a hot topic of machine learning in recent years because of its increasing importance in today’s era. As the application area of technology is increases, the size and variety of data also increases. By nature, most of the real-world raw data is present in imbalanced form like credit card frauds, fraudulent telephone calls, shuttle system failure, text classification, nuclear explosions, oil spill detection, detection of brain tumor images etc. The classification algorithms are not able to classify imbalance data accurately and their results always deviate toward the bigger class. This problem is known as Class Imbalance Problem. This paper assess various data level methods which are used to balance the data before classification. It also discusses various characteristics of data which impact class imbalance problem and the reasons why traditional classification algorithms are not able to tackle this issue. Apart from this it also discusses about other data abnormalities which makes the CIP more critical like size of data, overlapping classes, presence of noise in the data, data distribution within each class etc. The paper empirically compared 20 data-level classification methods with 44 UCI real imbalanced data-sets with the imbalance ratio ranging from as low as to 1.82 to as high as to 129.44 using KEEL tool. The performance of the methods is assessed using AUC, F-measure, G-mean metrics and the results are analyzed and represented graphically.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on ensemble learning

Article 30 August 2019

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

References

Yong, Y.: The Research of Imbalanced data-set of instance sampling method based on K-means cluster and Genetic algorithm. Energy Procedia 17, 164–170 (2012)
Article Google Scholar
Mollineda R.A.; Alejo, R.; Sotoca, J.M.: The class imbalance problem in pattern classification and learning. II Congreso Español de Informática (CEDI 2007), pp. 283–291 (2007).
Visa, S., Ralescu, A.: Issues in Mining Imbalance data-sets – A Review paper. In: Proceedings of the 16th Midwest Artificial Intelligence and Cognitive Science Conference, pp. 67–73 (2005).
Guo, X.: On the class imbalance problem. In: Proceedings of 4th International Conference on Natural Computation, IEEE Computer Society, pp. 192–201 (2008).
Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17, 1 (2011)
MathSciNet Google Scholar
Alcalá-Fdez, J., et al.: KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput. 13(3), 307–318 (2009)
Article Google Scholar
Fernando, M., et al.: Data Mining with skewed data. In: Zhang, Y. (Ed.) New Advances in Machine Learning, pp. 173–188 (2010). ISBN: 978-953-307-034-6, Intech. http://www.intechopen.com/books/new-advances-in-machine-learning/data-mining-with-skewed-data.
Zhang, Y.: New advances in machine learning: data mining with skewed data. Intech Open 1, 173–188 (2010)
Google Scholar
Ivan, T.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)
MathSciNet MATH Google Scholar
Hart, P.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14, 515–516 (1968)
Article Google Scholar
Miroslav, K.; Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. Proc. ICML 97, 179–186 (1997)
Google Scholar
Laurikkala, J.: Improving Identification of Difficult Small Classes by Balancing Class Distribution. In: S. Quaglini, P. Barahona, and S. Andreassen (Eds.) AIME 2001, In Proceedings of LNAI 2101, pp. 63–66 (2001).
Wilson, D.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. System man Commun 2(3), 408–421 (1972)
Article MathSciNet MATH Google Scholar
Chyi, Y.M.: Classification analysis techniques for skewed class distribution problems. National Sun Yat-Sen University, Department of Information Management (2003)
Google Scholar
Yoon, K.; Kwek, S.: An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Proceedings of International conference on Hybrid Intelligent Systems, pp. 1–6 (2005).
Tang, Y.; Jin, B.; Zhang, Y.Q.: Granular support vector machines with association rules mining for protein homology prediction. Artif. Intell. Med. 35(1–2), 121–134 (2005)
Article Google Scholar
Tang, Y.; Jin, B.; Zhang, Y.Q.; Fang, H.; Wang, B.: Granular support vector machines using linear decision hyperplanes for fast medical binary classification. In: Proceedings of FUZZ'05, The 14th IEEE International Conference on Fuzzy Systems, pp. 138–142 (2005).
Tang, Y.; Zhang, Y.Q.; Huang, Z.; Hu, X.T.; Zhao, Y.: Granular SVM-RFE feature selection algorithm for reliable cancer-related gene subsets extraction on microarray gene expression data. In: Proceedings of IEEE Symp. Bioinformatics and Bioeng, pp. 290–293 (2005).
Prabhjot, K.; Gosain, A.: Comparing the Behavior of Oversampling and Undersampling Approach of Class Imbalance Learning by Combining Class Imbalance Problem with Noise, p. 23–30. ICT Based Innovations. Springer, Singapore (2018)
Google Scholar
Salvador, G.; Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evol. Comput. 17(3), 275–306 (2009)
Article Google Scholar
Galar, M., et al.: EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn. 46, 3460–3471 (2013)
Article Google Scholar
Larry, E.J.: The CHC adaptive search algorithm: How to have safe search when engaging in nontraditional genetic recombination. Found. Genet. Algorithms 1, 265–283 (1991)
Google Scholar
Yen, S.-J.; Lee, Y.-S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36, 5718–5727 (2009)
Article Google Scholar
Zhang, J.; Mani, I.: KNN approach to unbalanced data distributions: A case study involving information extraction. In: Proceedings of ICML ‘2003 Workshop on Learning from Imbalanced Data-Sets, Vol. 126 (2003).
Rahman, M.M.; Davis, D.: Cluster based under-sampling for unbalanced cardiovascular data. In: Proceedings of the World Congress on Engineering, Vol. 3 (2013).
Sun, Z., et al.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48, 1623–1637 (2015)
Article Google Scholar
Fernando, S.-H., et al.: Predictive modeling of ICU healthcare-associated infections from imbalanced data. Using ensembles and a clustering-based undersampling approach. Appl. Sci. 9(24), 5287 (2019)
Article Google Scholar
Devi, D.; Suyel, N.; Kadry, S.: A boosting-aided adaptive cluster-based undersampling approach for treatment of class imbalance problem. Int. J. Data Warehousing Min. (IJDWM) 16(3), 60–86 (2020)
Article Google Scholar
Maruthi Padmaja, T.: Class Imbalance and Its Effect on PCA Preprocessing. Int. J. Knowl. Eng. Soft Data Paradigms 4(3), 272–294 (2014)
Article Google Scholar
Addabbo, D.; Maglietta, R.: Parallel selective sampling method for imbalanced and large data classification. Pattern Recognit. Lett. 62, 61–67 (2015)
Article Google Scholar
Kaur, P.; Gosain, A.: An intelligent undersampling technique based upon intuitionistic fuzzy sets to alleviate class imbalance problem of classification with noisy environment. Int. J. Intell. Eng. Inf. 6(5), 417–433 (2018)
Google Scholar
Zhang, J.; Wang, T.; Ng, W.W.Y.; Zhang, S.; Nugent, C.D.: Undersampling near Decision Boundary for Imbalance Problems. In: International Conference on Machine Learning and Cybernetics (ICMLC); IEEE (2019).
Liu, T., et al.: A design of information granule-based under-sampling method in imbalanced data classification. Soft. Comput. 24, 17333–17347 (2020)
Article Google Scholar
Fernandez, A., et al.: Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and adhoc approaches. Knowl. Based Syst. 42, 97–110 (2013)
Article Google Scholar
Batista, G.E.A.P.A., et al.: A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Expl. Newl. 6(1), 20–29 (2004)
Article Google Scholar
Jo, T.; Japkowicz, N.: Class Imbalances versus Small Disjuncts. ACM SIGKDD Explor. Newsl 6(1), 40–49 (2004)
Article Google Scholar
He, H.; Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Chawla, N.V., et al.: SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article MATH Google Scholar
Saez, J.A. et al.: Managing Borderline and Noisy examples in Imbalanced Classification by combining SMOTE with Ensemble Filtering. In: Proceedings of IDEAL2014, LNCS, Vol. 8669, pp. 61–68. Springer (2014).
Akbani, R.; Kwek, S.; Japkowicz, N.: Applying Support Vector Machines to Imbalanced Datasets. In: Proceedings of ECML 2004, LNAI 3201, pp. 39–50. Springer (2004)
Yong, Z.; Wang, D.: A cost-sensitive ensemble method for class-imbalanced datasets. Abst. Appl. Anal. Vol. 2013, Hindawi (2013).
Hui, H.; Wang, W.-Y.; Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Proceedings of International Conference on Intelligent Computing. Springer (2005).
Haibo, H., et al.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of IEEE International Joint Conference on Neural Network. IEEE (2008).
Tang, S.; Chen, S.: The Generation Mechanism of Synthetic Minority Class Examples. In: Proceedings of the 5th International Conference on Information Technology and Application in Biomedicine in conjunction with The 2nd International Symposium & Summer School on Biomedical and Health Engineering Shenzhen, China, May 30–31, pp. 444–447 (2008).
Stefanowski, J.; Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Proceedings of International Conference on Data Warehousing and Knowledge Discovery, pp. 283–292. Springer (2008).
Galar, M., et al.: A review on ensembles for the class imbalance problem: bagging, boosting and hybrid based approaches. IEEE Trans. Syst. Man Cybern.-Part C: Appl. Rev. 42(4), 463–484 (2011)
Article Google Scholar
Hu, S.; Liang, Y.; He, Y.: MSMOTE: Improving Classification Performance When Training Data is Imbalanced, 2009 Second International Workshop on Computer Science and Engineering.
Chumphol, B.; Sinapiromsaran, K.; Lursinsap, C.: Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer (2009).
Feng, L.; Qiu, M.-H.; Wang, Y.-X.; Xiang, Q.-L.; Yang, Y.-F.; Liu, K.: A fast divisive clustering algorithm using an improved discrete particle swarm optimizer. Pattern Recognit. Lett. 31, 1216–1225 (2010)
Article Google Scholar
Mi, Y.: Imbalanced classification based on Active Learning SMOTE. Res. J. Appl. Sci. Eng. Tech. 5(3), 944–949 (2013)
Article Google Scholar
Ai, X., et al.: Immune Centroids Oversampling method for binary classification. Comput. Intell. Neurosci. 2015, 11 (2015)
Article Google Scholar
Kaur, P.; Gosain, A.: FF-SMOTE: A Metaheuristic Approach to CombatClass Imbalance in Binary Classification. ISSN: 0883-9514 (Print) 1087–6545 (Online). https://www.tandfonline.com/loi/uaai20
Shaoning, P., et al.: Dynamic class imbalance learning for incremental LPSVM. Neural Netw. 44, 87–100 (2013)
Article MATH Google Scholar
Kaur, P.; Gosain, A.: GT2FS-SMOTE: An Intelligent Oversampling Approach Based Upon General Type-2 Fuzzy Sets to Detect Web Spam. Arab. J. Sci. Eng. https://doi.org/10.1007/s13369-020-04995-5
Nnamoko, N.; Korkontzelos, l.: Efficient treatment of outliers and class imbalance for diabetes prediction. Artif. Intell. Med. 104, 101815 (2020)
Article Google Scholar
Pan, T., et al.: Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf. Sci. 512, 1214–1233 (2020)
Article Google Scholar
Son, M.; Jung, S.; Moon, J.; Hwang, E. BCGAN-Based over-Sampling Scheme for Imbalanced Data. In: 2020 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE (2020).
Pal, B.; Tarafder, A.K.; Rahman, M.D.S.: Synthetic samples generation for imbalance class distribution with LSTM recurrent neural networks. In: Proceedings of the International Conference on Computing Advancements (2020).
Li, D.-C., et al.: Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput. Oper. Res. 34(4), 966–982 (2007)
Article MATH Google Scholar
Enislay, R., et al.: SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 33(2), 245–265 (2012)
Article Google Scholar
Majid, A.-R.M.; Alhakbani, H.A.: Handling class imbalance in direct marketing dataset using a hybrid data and algorithmic level solutions. In: Proceedings of SAI Computing Conference (SAI). IEEE (2016).
Uriz, M., et al.: FUZZ-EQ: A data equalizer for boosting the discrimination power of fuzzy classifiers. Appl. Soft Comput. 93, 1099 (2020)
Article Google Scholar
Koziarski, M.: CSMOUTE: Combined Synthetic Oversampling and Undersampling Technique for Imbalanced Data Classification (2020).
Kaur, P.; Gosain, A.: Robust hybrid data-level sampling approach to handle imbalanced data during classification. Soft. Comput. 24(20), 15715–15732 (2020)
Article Google Scholar
Kevin, B.; Lichman, M.: UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. University of California, School of Information and Computer Science, Irvine, CA (2013).
Stefan, K.; Personnaz, L.; Dreyfus, G.: Single-layer learning revisited: a stepwise procedure for building and training a neural network, p. 41–50. Neurocomputing, Springer, Berlin, Heidelberg (1999)
Google Scholar
Kotsiantis, S., et al.: Handling imbalanced data-sets: A review. GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)
Google Scholar
Weiss, G.: Mining with rarity: A unified framework. SIGKDD Explorations 6(1), 7–19 (2004)
Article Google Scholar
Nathalie, J.: Class imbalances: are we focusing on the right issue. In: Proceedings of Workshop on Learning from Imbalanced Data Sets II, Vol. 172 (2003).
Hickey, R.: Learning rare class footprints: the reflex algorithm. In: Proceedings of the ICML’03, Vol. 3 (2003).
Nathalie, J.: Concept-learning in the presence of between-class and within-class imbalances. In: Proceedings of Conference of the Canadian Society for Computational Studies of Intelligence, pp. 67–77. Springer (2001).
Garcia, V. et al.: Combined effects of Class Imbalance and Class Overlap on Instance-based Classification. In: Proceedings of IDEAL 2006, LNCS, Vol. 4224, pp. 371-378. Springer (2006).
Prati, R.C.; Gustavo, E.A.P.A.B.; Monard, M.C.: Class imbalances versus class overlapping: an analysis of a learning system behaviour. In: Proceedings of Mexican International Conference on Artificial Intelligence, pp. 312–321. Springer (2014).
Visa, S.; Ralescu, A.: Learning imbalanced and overlapping classes using fuzzy sets. In: Proceedings of the ICML’03 Workshop on Learning from Imbalanced data-sets, Vol. 3 (2003).
Dai, H.-L.: Class imbalance learning via a fuzzy total margin based support vector machine. Appl. SoftComput. 31, 172–184 (2015)
Google Scholar
Masko, D.; Hensman, P. The impact of imbalanced training data for convolutional neural networks (2015).
Johnson, J.M.; Khoshgoftaar, T.M.: Survey on deep learning with class imbalance. J. Big Data 6(1), 1–54 (2019)
Article Google Scholar
Lee, H.; Park, M.; Kim, J.: Plankton classification on imbalanced large scale database via convolutional neural networks with transfer learning. In: IEEE international conference on image processing (ICIP). IEEE, 2016.
Samira, P., et al.: Dynamic sampling in convolutional neural networks for imbalanced data classification. In: IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE (2018).
Buda, M.; Maki, A.; Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018)
Article Google Scholar
Wang, Y., et al.: GAN and CNN for imbalanced partial discharge pattern recognition in GIS. High Voltage (2021).
Nazari, E., Branco, P.: On Oversampling via Generative Adversarial Networks under Different Data Difficulty Factors. In: 3rd International Workshop on Learning with Imbalanced Domains: Theory and Applications, pp. 76–89. PMLR (2021).
Amalapuram, S.K.; Reddy, T.T.; Channappayya, S.S., Tamma, B.R.: On Handling Class Imbalance in Continual Learning based Network Intrusion Detection Systems. In: 1st International Conference on AI-ML-Systems, pp. 1–7 (2021).
Piboon, P.; Sinapiromsaran, K.: Mass Ratio Variance Majority Undersampling and Minority Oversampling Technique for Class Imbalance. In: Fuzzy Systems and Data Mining VII, pp. 152–161. IOS Press (2021).

Download references

Acknowledgements

Authors would like to thank Prof. SVAV Prasad for his valuable contribution in the improvement of this paper.

Author information

Authors and Affiliations

Lingayas Vidyapeeth, Faridabad, India
Kamlesh Upadhyay
Department of Information Technology, Maharaja Surajmal Institute of Technology, C-4, Janakpuri, New Delhi, 110058, India
Prabhjot Kaur
Lingayas Vidyapeeth, Faridabad, India
Deepak Kumar Verma

Authors

Kamlesh Upadhyay
View author publications
You can also search for this author in PubMed Google Scholar
Prabhjot Kaur
View author publications
You can also search for this author in PubMed Google Scholar
Deepak Kumar Verma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prabhjot Kaur.

Ethics declarations

Conflict of interest

Authors have no conflict of interest relevant to this article.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Upadhyay, K., Kaur, P. & Verma, D.K. Evaluating the Performance of Data Level Methods Using KEEL Tool to Address Class Imbalance Problem. Arab J Sci Eng 47, 9741–9754 (2022). https://doi.org/10.1007/s13369-021-06377-x

Download citation

Received: 06 July 2021
Accepted: 08 November 2021
Published: 29 November 2021
Issue Date: August 2022
DOI: https://doi.org/10.1007/s13369-021-06377-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating the Performance of Data Level Methods Using KEEL Tool to Address Class Imbalance Problem

Abstract

Access this article

Similar content being viewed by others

A survey on ensemble learning

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evaluating the Performance of Data Level Methods Using KEEL Tool to Address Class Imbalance Problem

Abstract

Access this article

Similar content being viewed by others

A survey on ensemble learning

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation