Abstract
Learning from imbalanced datasets is a challenging problem in machine learning research since the performance of the traditional classifiers suffer from biased classification towards the Majority class resulting in a low Minority class prediction rate. The inherent assumptions of equal class distribution and accuracy-driven evaluation are the identified reasons behind this degraded performance. Further, false negatives have higher penalty than the false positives. A simple logical solution to mitigate this issue is to construct a balanced training set from the imbalanced one. However, several such sets of balanced training sets can be formed for a given imbalanced set from which an optimal balanced training set has to be obtained. This is a computationally intractable problem and prone to local-optimal maxima/minima. To address these issues, a Simulated Annealing-based Under Sampling (SAUS) method is proposed. Simulated annealing is a popular meta-heuristic search algorithm, which implements a novel cost function in terms of Balanced Error Rate. This cost function strikes a balance between Sensitivity and Specificity measures while evaluating the solution at each iteration in the subsampling process and also is free from the local trap. The experimental results of SAUS demonstrate that the average Sensitivity measure on the test set has improved from 0.68 to 0.86 and proves its efficacy in tackling the imbalance issue in the dataset. Area Under the ROC Curve (AUC) results also demonstrate that SAUS outperforms several popular undersampling methods. SAUS works on par with state-of-the-art solutions for the class imbalance problem.
Similar content being viewed by others
References
Japkowicz N (2000) Learning from imbalanced data sets: A comparison of various strategies. AAAI Technical Report WS-00-05 10–15
Japkowicz N, Stephen S (2002) The class imbalance problem: A systematic study. Intell Data Anal J 6(5):429–450
Monard MC, Batista GEAPA (2002) Learning with skewed class distributions, in advances in logic. Artif Intell Robot 173–180
Barandela R, Sanchez S, Garcia V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recogn 36:849–851
Gustavo EAPA, Prati BRC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explor 6(1):20–29
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explor Newslett 6(1):40–49
Nitesh V (2004) Chawla, data mining for imbalanced datasets: An overview, chapter 40. Data Mining and Knowledge Discovery Handbook 853–867
Visa S, Ralescu A (2005) Issues in mining imbalanced data sets - a review paper, proceedings of the sixteen midwest artificial intelligence and cognitive science conference, MAICS-2005. Dayton 67–73
Bandyopadhyay S, Saha S, Maulik U, Deb K (2008) A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA. IEEE Trans Evolution Comput 12(3):269–283
Amine K (2019) Multiobjective simulated annealing: Principles and algorithm variants Advances in Operations Research, vol. 2019, Article ID 8134674, 13
Garcia V, Sanchez JS, Mollineda RA, Alejo R, Sotoca JM (2007) The class imbalance problem in pattern classification and learning. ISBN:, 978-84-9732-602-5 283–291
Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem, fourth international conference on natural computation. IEEE Computer Society 192–200
Sotoca JM, Sánchez JS, Mollineda RA (2005) A review of data complexity measures and their applicability to pattern classification problems. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje 77–83
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced Data: A review. Int J Pattern Recognit Artif Intell 23(04):687–719
Nguyen GH, Bouzerdoum A, Phung SL (2009) Learning pattern classification tasks with imbalanced data sets. Pattern Recogn 193–208
He Haibo, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Ramyachitra D, Manikandan P (2014) Imbalanced dataset classification and solutions: A review Int J Comput Business Res (IJCBR) 5(4)
Bekkar M, Alitouche TA (2013) Imbalanced data learning approaches review. Int J Data Mining Knowl Manag Process (IJDKP) 3(4):15–33
Kanellopoulos SKD, Pintetas P (2006) Handling imbalanced datasets: A review, GESTS International Transactions On Computer Science And Engineering 30
Jayasree S, Alice Gavya A (2014) Addressing imbalance problem in the class – A survey. Int J Appl Innov Eng Manag (IJAIEM) 03(09):239–243. ISSN 2319-4847
Krishna Veni CV, Sobha Rani T (2011) On the Classification of Imbalanced Datasets. Int J Comput Sci Technol (IJCST) 2(Spl):145–148
Hart PE (1968) The condensed nearest neighbor rule, IEEE Transactions on Information Theory, IT-4 515-516
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: One Sided Selection. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann, Tennesse, pp 179–186
Tomek I (1976) Two modifications of CNN. IEEE Transactions on Systems Man and Communications SMC-6 769–772
Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches to imbalanced data distributions. Expert Syst Appl 36:5718–5727
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution, technical report, a-2001-2 university of tampere
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and Hybrid-Based approaches. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(4):463–484. https://doi.org/10.1109/TSMCC.2011.2161285
Chawla NV, Lazarevic A, Hall LO, Kegelmeyer WP (2012) SMOTE: Synthetic minority over-sampling technique. Appl Intell 36(3):664–684
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, International Conference on Intelligent computing (ICIC). Lect Notes Comput Sci 3644:878–887
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: Safe-level-synthetic minority over-sampling TEchnique for handling the class imbalanced problem. Procedings of the 13th Pacific Asia conference on advances in knowledge discovery and data mining PAKDD’09 475–482
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN’08) 1322–1328
Wilson DR, Martinez TR (2000) Reduction techniques for Instance-Based learning algorithms. Mach Learn 38:257–286
Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalance issue in supervised learning problems in functional genomics, Hybrid Fifth International Conference onIntelligent Systems,HIS ’05
Longadge R, Dongre SS, Malik L (2013) Multi-cluster based approach for skewed data in data mining. IOSR-JCE 12(6):66–73
Sobhani P, Viktor H, Matwin S (2014) Learning from imbalanced data using ensemble methods and cluster-based undersampling, Workshop on New Frontiers in Mining Patterns, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)
Mostafizur Rahman M, Davis DN (2013) Cluster based Under-Sampling for unbalanced cardiovascular data. Proceedings of the World Congress on Engineering Vol III
Wang CY, Hu LL, Guo MZ, Liu XY, Zou Q (2015) imDC:An ensemble learning method for imbalanced classification with miRNA data, Genetics and Molecular research (GMR). Online J 14(1):123–133
Laith A (2018) Feature selection and enhanced Krill Herd algorithm for text document clustering
Zhang S, Sadaoui S, Mauhoub M (2015) An empirical analysis of imbalanced data classification. Comput Inform Sci 8(1)
Beyan C, Fisher R (2015) Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn 48(5):1653–1672
Ng WWY, Hu J, Yeung DS, Yin S, Roli F (2014) Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Transaction on Cybernetics
Barella VH, Costa EP, Carvalho ACPLF (2014) ClusterOSS: A new undersampling method for imbalanced learning
Mostafizur Rahman M, Davis DN (2013) Addressing the class imbalance problem in medical datasets. Int J Machine Learn Comput 3(2)
Manjula M, Seeniselvi T (2015) Ensembles of first order logical decision trees for imbalanced classification problems. Int J Innov Res Comput Commun Eng 3(1)
Garcia S, Fernandez A, Benitez AD, Herrera F (2007) Statistical comparisons by means of Non-Parametric tests: A case study on genetic based machine learning. II Congreso Espanol de Informatica 95–104
Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Machine Intell 4(3):289–300
Alshomrani S, Bawakid A, Shim SO, Fernandez A, Herrera F (2015) A Proposal for evolutionary fuzzy systems using feature weighting: Dealing with Overlapping in imbalanced datasets. Knowl-Based Syst 73:1–17
Francisco J, Pastor D, Rodriguez JJ, Garcia-Osorio C, Kuncheva LI (2015) Random Balance: Ensembles of variable priors classifiers for imbalanced data. Knowl Based Syst 85:96– 111
Sun Z, Song Q, Zhu X, Sun H, Xu B, Zhou Y (2015) A Novel Ensemble method for classifying imbalanced data. Pattern Recogn 48(5):1623–1637
Blaszczynski J, Stefonowski J (2015) Neighbourhood sampling in bagging for imbalanced data. NeuroComputing 150:529–542
Knight K, Rich E, Nair B (2017) Atificial Intelligence (3e) Tata Mecgrahill
A Comparative Study of Simulated Annealing and Genetic Algorithm for Solving the Travelling Salesman Problem. Adewole A.P, Otubamowo K.Egunjobi T.O International journal of applied information systems (IJAIS)–ISSN : 2249-0868Foundation of computer science FCS, New York, USA, 4(4) (2012)
Learning from imbalanced data (2016) open challenges and future directions, Bartos Krawczyk. Prog Artif Intell 5:221–232
Li J, Fong S, Wong RK, Chu VW (2018) Adaptive multi-objective swarm fusion for imbalanced data classification. Inform Fusion 39:1–24
Czarnowski I, Kędrzejowicz PJ (2019) An Approach to Imbalanced Data Classification Based on Instance Selection and Over-Sampling. ICCCI 2019, LNAI 11683 601–610
Combining random subspace approach with smote oversampling for imbalanced data classification, Pawel Ksieniewicz HAIS 2019, LNAI, 11734 660–673 (2019)
Fernández JC, Carbonero M, Gutiérrez PA et al (2019) Multi-objective evolutionary optimization using the relationship between f1 and accuracy metrics in classification tasks. Appl Intell 49:3447–3463
Ali H, Salleh MNM, Saedudin R, Hussain K, Mushtaq MF (2019) Imbalance class problems in data mining: A review. Indonesian J Electric Eng Comput Sci 14(3):1560–1571
An Improved Oversampling Algorithm Based on the Samples’ Selection Strategy for Classifying Imbalanced Data, Wenhao Xie, Gongqian Liang, Zhonghui Dong, Baoyu Tan,and Baosheng Zhang, Hindawi, Mathematical Problems in Engineering, Article ID 3526539, 13 pages, Volume 2019. imbalanced datasets classification, Safa Abdellatif, Mohamed Ali Ben Hassine, Sadok Ben Yahia,and Amel Bouzeghoub. International conference on current trends in theory and practice of informatics, SOFSEM 2018:Theory and Practice of Computer Science, 569–580 (2018)
A Synthetic neighborhood generation based ensemble learning for the imbalanced data classification, Zhi Chan, Tao Lin, Xin Xia, Hongyan Xu, Sha Ding, Applied Intelligence 48, 2441–2457 (2018)
Maximum Margin of twin spheres machine with pinball loss for imbalanced data classification, Yintian Xu, Qian Wang, Xinying Pang, Ying Tian, Appied Intelligence 48, 23–34 (2018)
Mahmoud K, Youssef I, Andy J (2013) Phishing detection: A literature survey. IEEE Communications Surveys & Tutorials. PP. 1–31
Zheng W, Zhao H (2020) Cost-sensitive hierarchical classification for imbalance classes. Appl Intell 50:2328–2338
Yi P, Guan Y, Zou F, Yao Y, Wang W, Zhu W (2018) Web phishing detection using a deep learning framework, Hindawi, Wireless communications and mobile computing Volume
Das A, Baki S, Aassal AE, Verma R, Dunbar A (2019) SOK: A comprehensive reexamination of Phishing research from the security perspective, IEEE
Kahksha J, Sameen N (2019) Detection of phishing website using machine learning approach, Int Confer Sustain Comput Sci Technol Manag
Aassal AE, Baki S, Das A, Verma RM (2020) An In-Depth Benchmarking and Evaluation of Phishing Detection Research for Security Needs, Special Section on Emerging Approaches to Cyber Security, IEEE Access
UCI Machine learning repository
KEEL data set. http://sci2s.ugr.es/keel
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chennuru, V.K., Timmappareddy, S.R. Simulated annealing based undersampling (SAUS): a hybrid multi-objective optimization method to tackle class imbalance. Appl Intell 52, 2092–2110 (2022). https://doi.org/10.1007/s10489-021-02369-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02369-4