Skip to main content
Log in

Dealing with overlap and imbalance: a new metric and approach

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

This paper addresses learning in complex scenarios involving imbalance and overlap. We propose a novel measure, the Augmented R-value, for estimating the level of overlap in the data. It improves an existing model-based measure, by including the data imbalance in the estimation process. We provide both a theoretical demonstration and empirical validations of the new metric’s efficacy in estimating the overlap level. Another contribution of the present paper is to propose a collection of meta-features to be used in conjunction with a meta-learning strategy for predicting the most suitable classifier for a given problem. The evaluations performed on a well-known collection of benchmark problems have shown that the meta-learning approach achieves superior results to the manual classifier selection process normally carried out by data scientists. The analysis of the results obtained by the meta-feature selection step has confirmed the power of the Augmented R-value in predicting the expected performance of classifiers in such complex classification scenarios. Also, we found that the overlap is a more serious factor affecting the performance of classifiers than imbalance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Aha DW (1992) Generalizing from case studies: a case study. In: Proceedings of the ninth international conference on machine learning, Morgan Kaufmann, pp 1–10

  2. Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S (2011) KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Multi-Valued Log Soft Comput 17(2-3):255–287. http://www.oldcitypublishing.com/MVLSC/MVLSCabstracts/MVLSC17.2-3abstracts/MVLSCv17n2-3p255-287Alcala.html

  3. Ali S, Smith KA (2006) On learning algorithm selection for classification. Appl Soft Comput 6(2):119–138. doi:10.1016/j.asoc.2004.12.002

    Article  Google Scholar 

  4. Barandela R, Sánchez JS, Garcıa V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit. 36(3):849–851

    Article  Google Scholar 

  5. Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit. 30:1145–1159

    Article  Google Scholar 

  6. Brodersen K, Ong CS, Stephan K, Buhmann J (2010) The balanced accuracy and its posterior distribution. In: Pattern recognition (ICPR), 2010 20th international conference on, pp 3121–3124, doi:10.1109/ICPR.2010.764

  7. Chawla N (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York. doi:10.1007/0-387-25465-X_40

    Google Scholar 

  8. Chawla N, Lazarevic A, Hall L, Bowyer K (2003) Smoteboost: Improving prediction of the minority class in boosting. In: LavraČ N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Lecture notes in computer science, vol 2838, Springer, Berlin, pp 107–119. doi:10.1007/978-3-540-39804-2_12,

  9. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357. http://dl.acm.org/citation.cfm?id=1622407.1622416

  10. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. doi:10.1023/A:1022627411411

    MATH  Google Scholar 

  11. Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Proceedings of the 23rd Canadian conference on advances in artificial intelligence, Springer, Berlin, Heidelberg, AI’10, pp 220–231. doi:10.1007/978-3-642-13059-5_22

  12. Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’99, pp 155–164, doi:10.1145/312129.312220,

  13. García V, Mollineda R, Sánchez J (2008) On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280. doi:10.1007/s10044-007-0087-5

    Article  MathSciNet  Google Scholar 

  14. Grzymala-Busse J, Stefanowski J, Wilk S (2004) A comparison of two approaches to data mining from imbalanced data. In: Negoita M, Howlett R, Jain L (eds) Knowledge-based intelligent information and engineering systems. Lecture notes in computer science, vol 3213, Springer, Berlin, Heidelberg, pp 757–763. doi:10.1007/978-3-540-30132-5_103,

  15. Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor Newsl 6(1):30–39. doi:10.1145/1007730.1007736

    Article  Google Scholar 

  16. Gutlein M, Frank E, Hall M, Karwath A (2009) Large-scale attribute selection using wrappers. In: Computational intelligence and data mining, 2009. CIDM ’09. IEEE Symposium on, pp 332–339. doi:10.1109/CIDM.2009.4938668

  17. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18. doi:10.1145/1656274.1656278

    Article  Google Scholar 

  18. Japkowicz N, Stephen S (2002a) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449. http://dl.acm.org/citation.cfm?id=1293951.1293954

  19. Japkowicz N, Stephen S (2002b) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449. http://dl.acm.org/citation.cfm?id=1293951.1293954

  20. Lemnaru C, Potolea R (2012) Imbalanced classification problems: systematic study, issues and best practices. In: Zhang R, Zhang J, Zhang Z, Filipe J, Cordeiro J (eds) Enterprise information systems. Lecture notes in business information processing, vol 102, Springer, Berlin, Heidelberg, pp 35–50. doi:10.1007/978-3-642-29958-2_3

  21. Lin Y, Lee Y, Wahba G (2002) Support vector machines for classification in nonstandard situations. Mach Learn 46(1–3):191–202. doi:10.1023/A:1012406528296

    Article  MATH  Google Scholar 

  22. Liu B, Ma Y, Wong C (2000) Improving an association rule based classifier. In: Zighed D, Komorowski J, Żytkow J (eds) Principles of data mining and knowledge discovery. Lecture notes in computer science, vol 1910, Springer, Berlin, Heidelberg, pp 504–509. doi:10.1007/3-540-45372-5_58,

  23. Liu W, Chawla S (2011) Class confidence weighted knn algorithms for imbalanced data sets. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining—vol Part II, Springer, Berlin, Heidelberg, PAKDD’11, pp 345–356. http://dl.acm.org/citation.cfm?id=2022850.2022879

  24. López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141. doi:10.1016/j.ins.2013.07.007

    Article  Google Scholar 

  25. Luengo J, Fernández A, García S (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936. doi:10.1007/s00500-010-0625-8

    Article  Google Scholar 

  26. Oh S (2011) A new dataset evaluation method based on category overlap. Comput Biol Med 41(2):115–122. doi:10.1016/j.compbiomed.2010.12.006

    Article  Google Scholar 

  27. Potolea R, Cacoveanu S, Lemnaru C (2011) Meta-learning framework for prediction strategy evaluation. In: Filipe J, Cordeiro J (eds) Enterprise information systems. Lecture notes in business information processing, vol 73, Springer, Berlin, Heidelberg, pp 280–295. doi:10.1007/978-3-642-19802-1_20,

  28. Quinlan J (1991) Improved estimates for the accuracy of small disjuncts. Mach Learn 6(1):93–98. doi:10.1007/BF00153762

    Google Scholar 

  29. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, San Francisc

    Google Scholar 

  30. Reif M, Shafait F, Goldstein M, Breuel T, Dengel A (2014) Automatic classifier selection for non-experts. Pattern Anal Appl 17(1):83–96. doi:10.1007/s10044-012-0280-z

    Article  MathSciNet  Google Scholar 

  31. Seiffert C, Khoshgoftaar T, Van Hulse J, Folleco A (2007) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. In: Information reuse and integration, 2007. IRI 2007. IEEE international conference on, pp 651–658. doi:10.1109/IRI.2007.4296694

  32. Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data

  33. Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. IJPRAI 23(4):687–719. doi:10.1142/S0218001409007326

    Google Scholar 

  34. Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. Neural Comput Appl 20(2):203–209. doi:10.1007/s00521-010-0349-9

    Article  Google Scholar 

  35. Visa S (2005) Issues in mining imbalanced data sets—a review paper. In: Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, 2005, pp 67–73

  36. Wasikowski M, wen Chen X (2010) Combating the small sample class imbalance problem using feature selection. Knowl Data Eng IEEE Trans 22(10):1388–1400. doi:10.1109/TKDE.2009.187

    Article  Google Scholar 

  37. Weiss GM (2003) The effect of small disjuncts and class distribution on decision tree learning. PhD thesis, New Brunswick, NJ, USA, aAI3093004

  38. Williams D, Myers V, Silvious M (2009) Mine classification with imbalanced data. Geosci Remote Sens Lett, IEEE 6(3):528–532. doi:10.1109/LGRS.2009.2021964

    Article  Google Scholar 

  39. Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: In ICML 2003 workshop on learning from imbalanced data sets, pp 49–56

  40. Zadrozny B, Elkan C (2001) Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the seventh international conference on knowledge discovery and data mining, ACM Press, pp 204–213

  41. Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. Knowl Data Eng, IEEE Trans 18(1):63–77. doi:10.1109/TKDE.2006.17

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zalán Borsos.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Borsos, Z., Lemnaru, C. & Potolea, R. Dealing with overlap and imbalance: a new metric and approach. Pattern Anal Applic 21, 381–395 (2018). https://doi.org/10.1007/s10044-016-0583-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-016-0583-6

Keywords

Navigation