Abstract
This paper addresses learning in complex scenarios involving imbalance and overlap. We propose a novel measure, the Augmented R-value, for estimating the level of overlap in the data. It improves an existing model-based measure, by including the data imbalance in the estimation process. We provide both a theoretical demonstration and empirical validations of the new metric’s efficacy in estimating the overlap level. Another contribution of the present paper is to propose a collection of meta-features to be used in conjunction with a meta-learning strategy for predicting the most suitable classifier for a given problem. The evaluations performed on a well-known collection of benchmark problems have shown that the meta-learning approach achieves superior results to the manual classifier selection process normally carried out by data scientists. The analysis of the results obtained by the meta-feature selection step has confirmed the power of the Augmented R-value in predicting the expected performance of classifiers in such complex classification scenarios. Also, we found that the overlap is a more serious factor affecting the performance of classifiers than imbalance.
Similar content being viewed by others
References
Aha DW (1992) Generalizing from case studies: a case study. In: Proceedings of the ninth international conference on machine learning, Morgan Kaufmann, pp 1–10
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S (2011) KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Multi-Valued Log Soft Comput 17(2-3):255–287. http://www.oldcitypublishing.com/MVLSC/MVLSCabstracts/MVLSC17.2-3abstracts/MVLSCv17n2-3p255-287Alcala.html
Ali S, Smith KA (2006) On learning algorithm selection for classification. Appl Soft Comput 6(2):119–138. doi:10.1016/j.asoc.2004.12.002
Barandela R, Sánchez JS, Garcıa V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit. 36(3):849–851
Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit. 30:1145–1159
Brodersen K, Ong CS, Stephan K, Buhmann J (2010) The balanced accuracy and its posterior distribution. In: Pattern recognition (ICPR), 2010 20th international conference on, pp 3121–3124, doi:10.1109/ICPR.2010.764
Chawla N (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York. doi:10.1007/0-387-25465-X_40
Chawla N, Lazarevic A, Hall L, Bowyer K (2003) Smoteboost: Improving prediction of the minority class in boosting. In: LavraČ N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Lecture notes in computer science, vol 2838, Springer, Berlin, pp 107–119. doi:10.1007/978-3-540-39804-2_12,
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357. http://dl.acm.org/citation.cfm?id=1622407.1622416
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. doi:10.1023/A:1022627411411
Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Proceedings of the 23rd Canadian conference on advances in artificial intelligence, Springer, Berlin, Heidelberg, AI’10, pp 220–231. doi:10.1007/978-3-642-13059-5_22
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’99, pp 155–164, doi:10.1145/312129.312220,
García V, Mollineda R, Sánchez J (2008) On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280. doi:10.1007/s10044-007-0087-5
Grzymala-Busse J, Stefanowski J, Wilk S (2004) A comparison of two approaches to data mining from imbalanced data. In: Negoita M, Howlett R, Jain L (eds) Knowledge-based intelligent information and engineering systems. Lecture notes in computer science, vol 3213, Springer, Berlin, Heidelberg, pp 757–763. doi:10.1007/978-3-540-30132-5_103,
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor Newsl 6(1):30–39. doi:10.1145/1007730.1007736
Gutlein M, Frank E, Hall M, Karwath A (2009) Large-scale attribute selection using wrappers. In: Computational intelligence and data mining, 2009. CIDM ’09. IEEE Symposium on, pp 332–339. doi:10.1109/CIDM.2009.4938668
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18. doi:10.1145/1656274.1656278
Japkowicz N, Stephen S (2002a) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449. http://dl.acm.org/citation.cfm?id=1293951.1293954
Japkowicz N, Stephen S (2002b) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449. http://dl.acm.org/citation.cfm?id=1293951.1293954
Lemnaru C, Potolea R (2012) Imbalanced classification problems: systematic study, issues and best practices. In: Zhang R, Zhang J, Zhang Z, Filipe J, Cordeiro J (eds) Enterprise information systems. Lecture notes in business information processing, vol 102, Springer, Berlin, Heidelberg, pp 35–50. doi:10.1007/978-3-642-29958-2_3
Lin Y, Lee Y, Wahba G (2002) Support vector machines for classification in nonstandard situations. Mach Learn 46(1–3):191–202. doi:10.1023/A:1012406528296
Liu B, Ma Y, Wong C (2000) Improving an association rule based classifier. In: Zighed D, Komorowski J, Żytkow J (eds) Principles of data mining and knowledge discovery. Lecture notes in computer science, vol 1910, Springer, Berlin, Heidelberg, pp 504–509. doi:10.1007/3-540-45372-5_58,
Liu W, Chawla S (2011) Class confidence weighted knn algorithms for imbalanced data sets. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining—vol Part II, Springer, Berlin, Heidelberg, PAKDD’11, pp 345–356. http://dl.acm.org/citation.cfm?id=2022850.2022879
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141. doi:10.1016/j.ins.2013.07.007
Luengo J, Fernández A, García S (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936. doi:10.1007/s00500-010-0625-8
Oh S (2011) A new dataset evaluation method based on category overlap. Comput Biol Med 41(2):115–122. doi:10.1016/j.compbiomed.2010.12.006
Potolea R, Cacoveanu S, Lemnaru C (2011) Meta-learning framework for prediction strategy evaluation. In: Filipe J, Cordeiro J (eds) Enterprise information systems. Lecture notes in business information processing, vol 73, Springer, Berlin, Heidelberg, pp 280–295. doi:10.1007/978-3-642-19802-1_20,
Quinlan J (1991) Improved estimates for the accuracy of small disjuncts. Mach Learn 6(1):93–98. doi:10.1007/BF00153762
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, San Francisc
Reif M, Shafait F, Goldstein M, Breuel T, Dengel A (2014) Automatic classifier selection for non-experts. Pattern Anal Appl 17(1):83–96. doi:10.1007/s10044-012-0280-z
Seiffert C, Khoshgoftaar T, Van Hulse J, Folleco A (2007) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. In: Information reuse and integration, 2007. IRI 2007. IEEE international conference on, pp 651–658. doi:10.1109/IRI.2007.4296694
Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. IJPRAI 23(4):687–719. doi:10.1142/S0218001409007326
Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. Neural Comput Appl 20(2):203–209. doi:10.1007/s00521-010-0349-9
Visa S (2005) Issues in mining imbalanced data sets—a review paper. In: Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, 2005, pp 67–73
Wasikowski M, wen Chen X (2010) Combating the small sample class imbalance problem using feature selection. Knowl Data Eng IEEE Trans 22(10):1388–1400. doi:10.1109/TKDE.2009.187
Weiss GM (2003) The effect of small disjuncts and class distribution on decision tree learning. PhD thesis, New Brunswick, NJ, USA, aAI3093004
Williams D, Myers V, Silvious M (2009) Mine classification with imbalanced data. Geosci Remote Sens Lett, IEEE 6(3):528–532. doi:10.1109/LGRS.2009.2021964
Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: In ICML 2003 workshop on learning from imbalanced data sets, pp 49–56
Zadrozny B, Elkan C (2001) Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the seventh international conference on knowledge discovery and data mining, ACM Press, pp 204–213
Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. Knowl Data Eng, IEEE Trans 18(1):63–77. doi:10.1109/TKDE.2006.17
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Borsos, Z., Lemnaru, C. & Potolea, R. Dealing with overlap and imbalance: a new metric and approach. Pattern Anal Applic 21, 381–395 (2018). https://doi.org/10.1007/s10044-016-0583-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-016-0583-6