Skip to main content
Log in

Simulated annealing based undersampling (SAUS): a hybrid multi-objective optimization method to tackle class imbalance

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Learning from imbalanced datasets is a challenging problem in machine learning research since the performance of the traditional classifiers suffer from biased classification towards the Majority class resulting in a low Minority class prediction rate. The inherent assumptions of equal class distribution and accuracy-driven evaluation are the identified reasons behind this degraded performance. Further, false negatives have higher penalty than the false positives. A simple logical solution to mitigate this issue is to construct a balanced training set from the imbalanced one. However, several such sets of balanced training sets can be formed for a given imbalanced set from which an optimal balanced training set has to be obtained. This is a computationally intractable problem and prone to local-optimal maxima/minima. To address these issues, a Simulated Annealing-based Under Sampling (SAUS) method is proposed. Simulated annealing is a popular meta-heuristic search algorithm, which implements a novel cost function in terms of Balanced Error Rate. This cost function strikes a balance between Sensitivity and Specificity measures while evaluating the solution at each iteration in the subsampling process and also is free from the local trap. The experimental results of SAUS demonstrate that the average Sensitivity measure on the test set has improved from 0.68 to 0.86 and proves its efficacy in tackling the imbalance issue in the dataset. Area Under the ROC Curve (AUC) results also demonstrate that SAUS outperforms several popular undersampling methods. SAUS works on par with state-of-the-art solutions for the class imbalance problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Japkowicz N (2000) Learning from imbalanced data sets: A comparison of various strategies. AAAI Technical Report WS-00-05 10–15

  2. Japkowicz N, Stephen S (2002) The class imbalance problem: A systematic study. Intell Data Anal J 6(5):429–450

    Article  Google Scholar 

  3. Monard MC, Batista GEAPA (2002) Learning with skewed class distributions, in advances in logic. Artif Intell Robot 173–180

  4. Barandela R, Sanchez S, Garcia V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recogn 36:849–851

    Article  Google Scholar 

  5. Gustavo EAPA, Prati BRC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explor 6(1):20–29

    Article  Google Scholar 

  6. Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explor Newslett 6(1):40–49

    Article  Google Scholar 

  7. Nitesh V (2004) Chawla, data mining for imbalanced datasets: An overview, chapter 40. Data Mining and Knowledge Discovery Handbook 853–867

  8. Visa S, Ralescu A (2005) Issues in mining imbalanced data sets - a review paper, proceedings of the sixteen midwest artificial intelligence and cognitive science conference, MAICS-2005. Dayton 67–73

  9. Bandyopadhyay S, Saha S, Maulik U, Deb K (2008) A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA. IEEE Trans Evolution Comput 12(3):269–283

    Article  Google Scholar 

  10. Amine K (2019) Multiobjective simulated annealing: Principles and algorithm variants Advances in Operations Research, vol. 2019, Article ID 8134674, 13

  11. Garcia V, Sanchez JS, Mollineda RA, Alejo R, Sotoca JM (2007) The class imbalance problem in pattern classification and learning. ISBN:, 978-84-9732-602-5 283–291

  12. Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem, fourth international conference on natural computation. IEEE Computer Society 192–200

  13. Sotoca JM, Sánchez JS, Mollineda RA (2005) A review of data complexity measures and their applicability to pattern classification problems. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje 77–83

  14. Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced Data: A review. Int J Pattern Recognit Artif Intell 23(04):687–719

    Article  Google Scholar 

  15. Nguyen GH, Bouzerdoum A, Phung SL (2009) Learning pattern classification tasks with imbalanced data sets. Pattern Recogn 193–208

  16. He Haibo, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  17. Ramyachitra D, Manikandan P (2014) Imbalanced dataset classification and solutions: A review Int J Comput Business Res (IJCBR) 5(4)

  18. Bekkar M, Alitouche TA (2013) Imbalanced data learning approaches review. Int J Data Mining Knowl Manag Process (IJDKP) 3(4):15–33

    Article  Google Scholar 

  19. Kanellopoulos SKD, Pintetas P (2006) Handling imbalanced datasets: A review, GESTS International Transactions On Computer Science And Engineering 30

  20. Jayasree S, Alice Gavya A (2014) Addressing imbalance problem in the class – A survey. Int J Appl Innov Eng Manag (IJAIEM) 03(09):239–243. ISSN 2319-4847

    Google Scholar 

  21. Krishna Veni CV, Sobha Rani T (2011) On the Classification of Imbalanced Datasets. Int J Comput Sci Technol (IJCST) 2(Spl):145–148

    Google Scholar 

  22. Hart PE (1968) The condensed nearest neighbor rule, IEEE Transactions on Information Theory, IT-4 515-516

  23. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: One Sided Selection. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann, Tennesse, pp 179–186

  24. Tomek I (1976) Two modifications of CNN. IEEE Transactions on Systems Man and Communications SMC-6 769–772

  25. Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches to imbalanced data distributions. Expert Syst Appl 36:5718–5727

    Article  Google Scholar 

  26. Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution, technical report, a-2001-2 university of tampere

  27. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and Hybrid-Based approaches. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(4):463–484. https://doi.org/10.1109/TSMCC.2011.2161285

    Article  Google Scholar 

  28. Chawla NV, Lazarevic A, Hall LO, Kegelmeyer WP (2012) SMOTE: Synthetic minority over-sampling technique. Appl Intell 36(3):664–684

    Article  Google Scholar 

  29. Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, International Conference on Intelligent computing (ICIC). Lect Notes Comput Sci 3644:878–887

    Article  Google Scholar 

  30. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: Safe-level-synthetic minority over-sampling TEchnique for handling the class imbalanced problem. Procedings of the 13th Pacific Asia conference on advances in knowledge discovery and data mining PAKDD’09 475–482

  31. He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN’08) 1322–1328

  32. Wilson DR, Martinez TR (2000) Reduction techniques for Instance-Based learning algorithms. Mach Learn 38:257–286

    Article  Google Scholar 

  33. Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalance issue in supervised learning problems in functional genomics, Hybrid Fifth International Conference onIntelligent Systems,HIS ’05

  34. Longadge R, Dongre SS, Malik L (2013) Multi-cluster based approach for skewed data in data mining. IOSR-JCE 12(6):66–73

    Article  Google Scholar 

  35. Sobhani P, Viktor H, Matwin S (2014) Learning from imbalanced data using ensemble methods and cluster-based undersampling, Workshop on New Frontiers in Mining Patterns, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)

  36. Mostafizur Rahman M, Davis DN (2013) Cluster based Under-Sampling for unbalanced cardiovascular data. Proceedings of the World Congress on Engineering Vol III

  37. Wang CY, Hu LL, Guo MZ, Liu XY, Zou Q (2015) imDC:An ensemble learning method for imbalanced classification with miRNA data, Genetics and Molecular research (GMR). Online J 14(1):123–133

    Google Scholar 

  38. Laith A (2018) Feature selection and enhanced Krill Herd algorithm for text document clustering

  39. Zhang S, Sadaoui S, Mauhoub M (2015) An empirical analysis of imbalanced data classification. Comput Inform Sci 8(1)

  40. Beyan C, Fisher R (2015) Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn 48(5):1653–1672

    Article  Google Scholar 

  41. Ng WWY, Hu J, Yeung DS, Yin S, Roli F (2014) Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Transaction on Cybernetics

  42. Barella VH, Costa EP, Carvalho ACPLF (2014) ClusterOSS: A new undersampling method for imbalanced learning

  43. Mostafizur Rahman M, Davis DN (2013) Addressing the class imbalance problem in medical datasets. Int J Machine Learn Comput 3(2)

  44. Manjula M, Seeniselvi T (2015) Ensembles of first order logical decision trees for imbalanced classification problems. Int J Innov Res Comput Commun Eng 3(1)

  45. Garcia S, Fernandez A, Benitez AD, Herrera F (2007) Statistical comparisons by means of Non-Parametric tests: A case study on genetic based machine learning. II Congreso Espanol de Informatica 95–104

  46. Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Machine Intell 4(3):289–300

    Google Scholar 

  47. Alshomrani S, Bawakid A, Shim SO, Fernandez A, Herrera F (2015) A Proposal for evolutionary fuzzy systems using feature weighting: Dealing with Overlapping in imbalanced datasets. Knowl-Based Syst 73:1–17

    Article  Google Scholar 

  48. Francisco J, Pastor D, Rodriguez JJ, Garcia-Osorio C, Kuncheva LI (2015) Random Balance: Ensembles of variable priors classifiers for imbalanced data. Knowl Based Syst 85:96– 111

    Article  Google Scholar 

  49. Sun Z, Song Q, Zhu X, Sun H, Xu B, Zhou Y (2015) A Novel Ensemble method for classifying imbalanced data. Pattern Recogn 48(5):1623–1637

    Article  Google Scholar 

  50. Blaszczynski J, Stefonowski J (2015) Neighbourhood sampling in bagging for imbalanced data. NeuroComputing 150:529–542

    Article  Google Scholar 

  51. Knight K, Rich E, Nair B (2017) Atificial Intelligence (3e) Tata Mecgrahill

  52. A Comparative Study of Simulated Annealing and Genetic Algorithm for Solving the Travelling Salesman Problem. Adewole A.P, Otubamowo K.Egunjobi T.O International journal of applied information systems (IJAIS)–ISSN : 2249-0868Foundation of computer science FCS, New York, USA, 4(4) (2012)

  53. Learning from imbalanced data (2016) open challenges and future directions, Bartos Krawczyk. Prog Artif Intell 5:221–232

    Article  Google Scholar 

  54. Li J, Fong S, Wong RK, Chu VW (2018) Adaptive multi-objective swarm fusion for imbalanced data classification. Inform Fusion 39:1–24

    Article  Google Scholar 

  55. Czarnowski I, Kędrzejowicz PJ (2019) An Approach to Imbalanced Data Classification Based on Instance Selection and Over-Sampling. ICCCI 2019, LNAI 11683 601–610

  56. Combining random subspace approach with smote oversampling for imbalanced data classification, Pawel Ksieniewicz HAIS 2019, LNAI, 11734 660–673 (2019)

  57. Fernández JC, Carbonero M, Gutiérrez PA et al (2019) Multi-objective evolutionary optimization using the relationship between f1 and accuracy metrics in classification tasks. Appl Intell 49:3447–3463

    Article  Google Scholar 

  58. Ali H, Salleh MNM, Saedudin R, Hussain K, Mushtaq MF (2019) Imbalance class problems in data mining: A review. Indonesian J Electric Eng Comput Sci 14(3):1560–1571

    Article  Google Scholar 

  59. An Improved Oversampling Algorithm Based on the Samples’ Selection Strategy for Classifying Imbalanced Data, Wenhao Xie, Gongqian Liang, Zhonghui Dong, Baoyu Tan,and Baosheng Zhang, Hindawi, Mathematical Problems in Engineering, Article ID 3526539, 13 pages, Volume 2019. imbalanced datasets classification, Safa Abdellatif, Mohamed Ali Ben Hassine, Sadok Ben Yahia,and Amel Bouzeghoub. International conference on current trends in theory and practice of informatics, SOFSEM 2018:Theory and Practice of Computer Science, 569–580 (2018)

  60. A Synthetic neighborhood generation based ensemble learning for the imbalanced data classification, Zhi Chan, Tao Lin, Xin Xia, Hongyan Xu, Sha Ding, Applied Intelligence 48, 2441–2457 (2018)

  61. Maximum Margin of twin spheres machine with pinball loss for imbalanced data classification, Yintian Xu, Qian Wang, Xinying Pang, Ying Tian, Appied Intelligence 48, 23–34 (2018)

  62. Mahmoud K, Youssef I, Andy J (2013) Phishing detection: A literature survey. IEEE Communications Surveys & Tutorials. PP. 1–31

  63. Zheng W, Zhao H (2020) Cost-sensitive hierarchical classification for imbalance classes. Appl Intell 50:2328–2338

    Article  Google Scholar 

  64. Yi P, Guan Y, Zou F, Yao Y, Wang W, Zhu W (2018) Web phishing detection using a deep learning framework, Hindawi, Wireless communications and mobile computing Volume

  65. Das A, Baki S, Aassal AE, Verma R, Dunbar A (2019) SOK: A comprehensive reexamination of Phishing research from the security perspective, IEEE

  66. Kahksha J, Sameen N (2019) Detection of phishing website using machine learning approach, Int Confer Sustain Comput Sci Technol Manag

  67. Aassal AE, Baki S, Das A, Verma RM (2020) An In-Depth Benchmarking and Evaluation of Phishing Detection Research for Security Needs, Special Section on Emerging Approaches to Cyber Security, IEEE Access

  68. UCI Machine learning repository

  69. KEEL data set. http://sci2s.ugr.es/keel

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Venkata Krishnaveni Chennuru.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chennuru, V.K., Timmappareddy, S.R. Simulated annealing based undersampling (SAUS): a hybrid multi-objective optimization method to tackle class imbalance. Appl Intell 52, 2092–2110 (2022). https://doi.org/10.1007/s10489-021-02369-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02369-4

Keywords

Navigation