Advertisement

Soft Computing

, Volume 23, Issue 21, pp 10755–10767 | Cite as

TLUSBoost algorithm: a boosting solution for class imbalance problem

  • Sujit Kumar
  • Saroj Kr. BiswasEmail author
  • Debashree Devi
Methodologies and Application
  • 116 Downloads

Abstract

It is habitually assumed that the training sets used for learning are balanced. However, this hypothesis is not always true in real-world applications, and hence, there is a tendency of relying on the classification models that are biased towards the overrepresented class as traditional data mining algorithms are generally inclined towards building of suboptimal classification models. This class imbalance problem is common to many application domains such as data mining, machine learning, pattern recognition, etc. Several techniques have been proposed to alleviate the problem of class imbalance. RUSBoost is one of the ensemble learning approaches that uses random undersampling (RUS) for data resampling and AdaBoost technique for boosting, as a solution to class imbalance. However, RUS may cause the loss of significant information of dataset. Therefore, this paper proposes Tomek-link undersampling-based boosting (TLUSBoost) algorithm which uses Tomek-linked and redundancy-based undersampling (TLRUS) for data resampling and AdaBoost technique for boosting. TLRUS meticulously finds outliers using Tomek-link concept and then eliminates some of the probable redundant instances from the outliers. Hence, this algorithm reduces the loss of information and conserves the characteristics of the dataset, thereby helping the classifier to be trained appropriately. TLUSBoost method is validated with 16 benchmark datasets and compared with EasyEnsemble, BalanceCascade, SMOTEBoost and RUSBoost algorithms. Ten-fold cross-validation is applied to measure overall accuracy and F-measure metric of the models. Experimental results show that the proposed model is better than EasyEnsemble, BalanceCascade, SMOTEBoost and RUSBoost in both overall accuracy and F-measure performance metric.

Keywords

Undersampling Boosting Data mining Class imbalance problem Tomek-link pair 

Notes

Compliance with ethical standards

Conflict of interest

None of the authors have the conflict of interest.

References

  1. Bermejo P, Gamez JA, Puerta JM (2011) Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets. Expert Syst Appl 38(3):2072–2080CrossRefGoogle Scholar
  2. Biswas SK, Devi D, Chakraborty M (2017) A hybrid case based reasoning model for classification in Internet of Things (IoT) environment. J Organ End User Comput 30(4):104–122CrossRefGoogle Scholar
  3. Błaszczynski J, Deckert M, Stefanowski J, Wilk S (2010) Integrating selective pre-processing of imbalanced data with IVOTES ensemble. In: Rough sets and current trends in computing. Lecture Notes in Computer Science Series, vol 6086, Springer, pp 148–157Google Scholar
  4. Cao D-S, Xu Q-S, Liang Y-Z, Zhang L-X, Li H-D (2010) The boosting: a new idea of building models. Chemom Intell Lab Syst 100(1):1–11CrossRefGoogle Scholar
  5. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefGoogle Scholar
  6. Devi D, Biswas SK, Purkayastha B (2017) Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recognit Lett 93:3–12CrossRefGoogle Scholar
  7. Elkan C (2001) The foundations of cost–sensitive learning. In: Proceedings of the 17th IEEE international joint conference on artificial intelligence (IJCAI’01), pp 973–978Google Scholar
  8. Fradkin D, Muchnik I (2000) Support vector machines for classification. In: DIMACS series in discrete mathematics and theoretical computer scienceGoogle Scholar
  9. Freund Y, Schapir RE (1996) Experiments with a new boosting algorithm. In: Machine learning: proceedings of the thirteenth international conference, pp 1–9Google Scholar
  10. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42(4):463–484CrossRefGoogle Scholar
  11. Gunn SR (1998) Support vector machine for classification and regression. Techical report, University of Southampton, Southampton, UKGoogle Scholar
  12. Khreich W, Granger E, Miri A, Sabourin R (2010) Iterative boolean combination of classifiers in the roc space: an application to anomaly detection with HMMS. Pattern Recognit 43(8):2732–2752CrossRefGoogle Scholar
  13. Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mac. Learn 30:195–215CrossRefGoogle Scholar
  14. Li C, Li J, He M (2016) Concept lattice compression in incomplete contexts based on K-medoids clustering. Int J Mach Learn Cybernet 7:539–552CrossRefGoogle Scholar
  15. Liu Y-H, Chen Y-T (2005) Total margin-based adaptive fuzzy support vector machines for multi-view face recognition. In: Proceedings of IEEE international conference on systems, man and cybernetics, vol 2, pp 1704–1711Google Scholar
  16. Liu X-Y, Wu J, Zhou Z-H (2009) Exploratory undersampling for class imbalance learning. IEEE Trans Syst Man Cybern Part B Cybern 39(2):539–550CrossRefGoogle Scholar
  17. Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436CrossRefGoogle Scholar
  18. Mustafa G, Niu Z, Yousif A, Tarus J (2015) Solving the class imbalance problems using RUSMultiBoost Ensemble. In: Proceedings of 10th Iberian conference on information systems and technologiesGoogle Scholar
  19. Pallara A (1992) Binary decision tree approach to classification: a review of CART and other methods with some applications to real data. Statist Appl Ital J Appl Stat 4(3):1–32Google Scholar
  20. Park H-A (2013) Introduction to logistic regression: from basic concepts to interpretation with particular attention to nursing domain. J Korean Acad Nurs 43(2):154–164CrossRefGoogle Scholar
  21. Phung S, Bouzerdoum A, Nguyen GH (2009) Learning pattern classification tasks with imbalanced data sets. In Yin (Eds.) Press. Pattern recognition, pp 193–208Google Scholar
  22. Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197CrossRefGoogle Scholar
  23. Sun Y, Wong AC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit 23(4):687–719CrossRefGoogle Scholar
  24. Sutton O (2012) Introduction to k-nearest neighbour classification and condensed nearest neighbour data reduction, pp 1–10Google Scholar
  25. Tavallaee M, Stakhanova N, Ghorbani A (2010) Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Trans Syst Man Cybern Part C Appl Rev 40(5):516–524CrossRefGoogle Scholar
  26. Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. IEEE Symp Comput Intell Data Min CDIM’09:324–331Google Scholar
  27. Yang Z, Tang W, Shintemirov A, Wu Q (2009) Association rule mining based dissolved gas analysis for fault diagnosis of power transformers. IEEE Trans Syst Man Cybern C Appl Rev 39(6):597–610CrossRefGoogle Scholar
  28. Zhu Z-B, Song Z-H (2010) Fault diagnosis based on imbalance modified kernel fisher discriminant analysis. Chem Eng Res Des 88(8):936–951CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Sujit Kumar
    • 1
  • Saroj Kr. Biswas
    • 1
    Email author
  • Debashree Devi
    • 1
  1. 1.Department of Computer Science and EngineeringNIT SilcharSilcharIndia

Personalised recommendations