Abstract
Classification of imbalanced data sets is one of the significant problems of machine learning and data mining. Traditional classifiers usually produced suboptimal results for imbalanced data sets. This study proposed an idea of using a newly proposed bi-objective hybrid algorithm for the given classification task of binary imbalanced noisy and borderline data sets. The bi-objective hybrid algorithm was based on the hybridization of two metaheuristics, namely cuckoo search and covariance matrix adaptation evolution strategy. The validation of this proposed hybrid algorithm was confirmed in terms of the Pareto fronts. Thereafter, this algorithm was used in a methodology proposed for the classification task of the binary imbalanced data sets. The proposed methodology was based on an idea of estimating the probabilities from both classes (majority and minority) of a data set, using normal distribution. Optimization of parameters of the normal distribution was done with the help of the proposed algorithm. Different data sets (simulated, noisy borderline and real) were used. Four well-known classifiers with a preprocessing algorithm were cast-off for the comparison purpose. Performances of all classifiers were evaluated using three evaluation measures, sensitivity, G mean and F measure. A promising performance of proposed methodology was observed.
Similar content being viewed by others
References
Alcala-Fdez J, Fernndez A, Luengo J et al (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Log Soft Comput 17:255–287. https://doi.org/10.1007/s00500-008-0323-y
Al-Shahib A, Breitling R, Gilbert D (2005) Feature selection and the class imbalance problem in predicting protein function from sequence. Appl Bioinform 4:195–203. https://doi.org/10.2165/00822942-200594030-00004
Bach M, Werner A, Zywiec J, Pluskiewicz W (2017) The study of under- and over-sampling methods utility in analysis of highly imbalanced data on osteoporosis. Inf Sci 384:174–190. https://doi.org/10.1016/j.ins.2016.09.038
Barandela R, Sanchez JS, Garcia V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recog 36:849–851. https://doi.org/10.1016/S0031-3203(02)00257-1
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl Spec Issue Learn Imbalanc Datasets 6:20–29. https://doi.org/10.1145/1007730.1007735
Beckmann M, de Lima BSLP, Ebecken NFF (2011) Genetic algorithms as a pre processing strategy for imbalanced datasets. In: Proceedings of the 13th annual conference companion on genetic and evolutionary computation—GECCO 11 131. https://doi.org/10.1145/2001858.2001933
Bekkar M, Djemaa HK, Alitouche TA (2013) Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl 3:27–38
Beyan C, Fisher R (2015) Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn 48:1653–1672. https://doi.org/10.1016/j.patcog.2014.10.032
Boonchuay K, Sinapiromsaran K, Lursinsap C (2016) Decision tree induction based on minority entropy for the class imbalance problem. Pattern Anal Appl. https://doi.org/10.1007/s10044-016-0533-3
Cao VL, Le-Khac NA, O’Neill, M et al (2016) Improving fitness functions in genetic programming for classification on unbalanced credit card data. In: Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 9597, pp 35–45. https://doi.org/10.1007/978-3-319-31204-0_3
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Japkowicz N, Drive P (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6:1–6. https://doi.org/10.1145/1007730.1007733
Chawla NV (2009) Data Mining for Imbalanced Datasets: An Overview. Data Min Knowl Discov Handb. https://doi.org/10.1007/978-0-387-09823-4_45
Cheng F, Zhang J, Wen C et al (2017) Large cost-sensitive margin distribution machine for imbalanced data classification. Neurocomputing 224:45–57. https://doi.org/10.1016/j.neucom.2016.10.053
Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data. University of California, Berkeley, p 112. https://ley.edu/sites/default/files/tech-reports/666.pdf
Coello CAC, Lamont GB, Van Veldhuizen DA (2007) Evolutionary algorithms for solving multi-objective problems second edition. Design. https://doi.org/10.1007/978-0-387-36797-2
Deb K (2001) Multi-objective optimization using evolutionary algorithms. Wiley, London, p 497. https://doi.org/10.1109/TEVC.2002.804322
Demar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30. https://doi.org/10.1016/j.jecp.2010.03.005
Ducange P, Lazzerini B, Marcelloni F (2010) Multi-objective genetic fuzzy classifiers for imbalanced and cost-sensitive datasets. Soft Comput 14:713–728. https://doi.org/10.1007/s00500-009-0460-y
Duval B, Hao JK (2009) Advances in metaheuristics for gene selection and classification of microarray data. Brief Bioinform 11:127–141. https://doi.org/10.1093/bib/bbp035
Fernandez A, Garcia S, Herrera F, Del Jesus MJ (2007) An analysis of the rule weights and fuzzy reasoning methods for linguistic rule based classification systems applied to problems with highly imbalanced data sets. In: Applications of fuzzy sets theory. WILF 2007. Lecture notes in computer science, vol 4578. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73400-0_21
Fernandez A, Garcia S, del Jesus MJ, Herrera F (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159:2378–2398. https://doi.org/10.1016/j.fss.2007.12.023
Fernandez A, Lopez V, Galar M et al (2013) Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl-Based Syst 42:97–110. https://doi.org/10.1016/j.knosys.2013.01.018
Fister I Jr, Fister D, Fistar I (2013) A comprehensive review of Cuckoo search: variants and hybrids. Int J Math Model Numer Optim 4:387–409. https://doi.org/10.1504/IJMMNO.2013.059205
Galar M, Fernandez A, Barrenechea E et al (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42:463–484. https://doi.org/10.1109/TSMCC.2011.2161285
Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 2:42–47
Garcia LPF, Lorena AC, Carvalho ACPLF (2012) A study on class noise detection and elimination. Proc Br Symp Neural Netw SBRN. https://doi.org/10.1109/SBRN.2012.49
Garcia S, Fernndez A, Bentez AD, Herrera F (2007) Statistical comparisons by means of non-parametric tests: a case study on genetic based machine learning. In: Proceedings of the II Congreso Espaol de Informtica (CEDI 2007) V Taller Nacional de Minera de Datos y Aprendizaje (TAMIDA 2007), pp 95–104
Garcia V, Mollineda RA, Sanchez JS (2008) On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11:269280. https://doi.org/10.1007/s10044-007-0087-5
Garcia V, Snchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl-Based Syst 25:1321. https://doi.org/10.1016/j.knosys.2011.06.013
Guo X, Yin Y, Dong C et al (2008) On the class imbalance problem. In: Proceedings—4th international conference on natural computation, ICNC, vol 4, pp. 192–201. https://doi.org/10.1109/ICNC.2008.871
Graczyk M, Lasota T, Telec Z, Trawiski B (2012) Nonparametric statistical analysis of machine learning algorithms for regression problems. Int J Appl Math Comput Sci 22:867–881
Hansen N (2016) The CMA evolution strategy. A tutorial. 102:75–102. https://doi.org/10.1007/11007937_4
Hansen N, Kern S (2004) Evaluating the CMA evolution strategy on multimodal test functions, pp 282–291. https://doi.org/10.1007/978-3-540-30217-9_29
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. Proc Int Jt Conf Neural Netw. https://doi.org/10.1109/IJCNN.2008.4633969
He M, Wu T, Silva A et al (2015) Augmenting cost-SVM with gaussian mixture models for imbalanced classification. Artif Intell Res 4:93–105. https://doi.org/10.5430/air.v4n2p93
Kumar MNA, Sheshadri SH (2012) On the classification of imbalanced datasets. Int J Comput Appl 44:17. https://doi.org/10.5120/6280-8449
Li J, Fong S, Wong RK, Chu VW (2018) Adaptive multi-objective swarm fusion for imbalanced data classification. Inf Fus 39:1–24. https://doi.org/10.1016/j.inffus.2017.03.007
Longadge R, Dongre SS, Malik L (2013) Class imbalance problem in data mining: review. Int J Comput Sci Netw 2:83–87. https://doi.org/10.1109/SIU.2013.6531574
Lopez V, Fndez A, del Jesus MJ, Herrera F (2013) A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets. Knowl-Based Syst 38:85–104. https://doi.org/10.1016/j.knosys.2012.08.025
Maheta HH, Dabhi VK (2015) Classification of imbalanced data sets using multi objective genetic programming. In: 5th international conference on computer communication and informatics, ICCCI 2015. https://doi.org/10.1109/ICCCI.2015.7218125
Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246. https://doi.org/10.1016/j.ins.2014.07.015
Maragoudakis M, Kermanidis K, Garbis A, Fakotakis N (2000) Dealing with imbalanced data using Bayesian techniques. In: International conference on language resources and evaluation, pp 1045–1050
Marler RT, Arora JS (2010) The weighted sum method for multi-objective optimization: new insights. Struct Multidiscip Optim 41:853–862. https://doi.org/10.1007/s00158-009-0460-7
Micheal R (2013) On the multivariate T distribution. Technical report from Automatic Control at Linkping s Universitet
Milare C, Batista G, Carvalho A (2011) A hybrid approach to learn with imbalanced classes using evolutionary algorithms. Log J IGPL 19:293–303
Moreno-Torres JG, Llor X, Goldberg DE, Bhargava R (2013) Repairing fractures between data using genetic programming-based feature extraction: a case study in cancer diagnosis. Inf Sci 222:805–823. https://doi.org/10.1016/j.ins.2010.09.018
Naidu K, Mokhlis H, Bakar A (2014) Multiobjective optimization using weighted sum artificial bee colony algorithm for load frequency control. Int J Electr Power Energy Syst 55:657–667
Napierala K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 6086 LNAI, pp 158–167. https://doi.org/10.1007/978-3-642-13529-3_18
Nie F, Huang Y, Wang X, Huang H (2014) New primal SVM solver with linear computational cost for big data classifications. In: Proceedings of 31st international conference on machine learning. JMLR: W & Cp 32, Beijing
Nie F, Wang X, Huang H (2017) Multiclass capped LP-norm SVM for robust classification. In: Proceedings of the 31st AAAI conference on artificial intelligence (AAAI-17)
Nguyen GH, Bouzerdoum A, Phung SL (2009) Learning pattern classification tasks with imbalanced data sets. Pattern Recogn. https://doi.org/10.5772/7544
Orriols-Puig A, Bernad-Mansilla E (2009) Evolutionary rule-based systems for imbalanced data sets. Soft Comput 13:213–225. https://doi.org/10.1007/s00500-008-0319-7
Pohlert T (2014) The pairwise multiple comparison of mean ranks package (PMCMR). R package 27. http://cran.ms.unimelb.edu.au/web/packages/PMCMR/vignettes/PMCMR.pdf
Rahman A, Ahmed AM (2016) Multi-objective optimization indices. A comparative. Analysis 8:112
Rivera WA, Xanthopoulos P (2016) A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets. Expert Syst Appl 66:124135. https://doi.org/10.1016/j.eswa.2016.09.010
Singh D (2013) A study on the use of non-parametric tests for experimentation with cluster analysis. Int J Eng Manag Res 3:64–72
Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40:3358–3378. https://doi.org/10.1016/j.patcog.2007.04.009
Thai-Nghe N, Gantner Z, Schmidt-Thieme L (2010) Cost-sensitive learning methods for imbalanced data. Proc Int Jt Conf Neural Netw. https://doi.org/10.1109/IJCNN.2010.5596486
Trawinski B, Smtek M, Telec Z, Lasota T (2012) Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms. Int J Appl Math Comput Sci. https://doi.org/10.2478/v10006-012-0064-z
Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning—ICML 07 935942. https://doi.org/10.1145/1273496.1273614
Van Hulse J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 68:1513–1542. https://doi.org/10.1016/j.datak.2009.08.005
Vluymans S, Triguero I, Cornelis C, Saeys Y (2016) EPRENNID: an evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data. Neurocomputing 216:596–610. https://doi.org/10.1016/j.neucom.2016.08.026
Weiss GM, Weiss GM (2015) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 61(6):7–19
Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques. Ann Phys. https://doi.org/10.1002/1521-3773(20010316)40:6%3c9823::AID-ANIE9823%3e3.3.CO;2-C
Yang P, Xu L, Zhou BB et al (2009) A particle swarm based hybrid system for imbalanced medical data sampling. BMC Genom 10(Suppl 3):S34. https://doi.org/10.1186/1471-2164-10-S3-S34
Yang X, Chien SF, Ting TO et al (2014) Computational intelligence and metaheuristic algorithms with applications. Sci World J 2014:14. https://doi.org/10.1155/2014/425853
Yang XS (2011) Bat algorithm for multi-objective optimization. Int J Bioinspir Comput 5:267–274
Yang X-S (2013) Multiobjective firefly algorithm for continuous. Optimization 29:175–184. https://doi.org/10.1007/s00366-012-0254-1
Yang XS, Deb S (2013) Multiobjective cuckoo search for design optimization. Comput Oper Res 40:1616–1624. https://doi.org/10.1016/j.cor.2011.09.026
Yang XS, Deb S (2014) Cuckoo search: recent advances and applications. Neural Comput Appl 24:169–174. https://doi.org/10.1007/s00521-013-1367-1
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Saeed, S., Ong, H.C. A bi-objective hybrid algorithm for the classification of imbalanced noisy and borderline data sets. Pattern Anal Applic 22, 979–998 (2019). https://doi.org/10.1007/s10044-018-0693-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-018-0693-4