Dealing with overlap and imbalance: a new metric and approach

Theoretical Advances

Abstract

This paper addresses learning in complex scenarios involving imbalance and overlap. We propose a novel measure, the Augmented R-value, for estimating the level of overlap in the data. It improves an existing model-based measure, by including the data imbalance in the estimation process. We provide both a theoretical demonstration and empirical validations of the new metric’s efficacy in estimating the overlap level. Another contribution of the present paper is to propose a collection of meta-features to be used in conjunction with a meta-learning strategy for predicting the most suitable classifier for a given problem. The evaluations performed on a well-known collection of benchmark problems have shown that the meta-learning approach achieves superior results to the manual classifier selection process normally carried out by data scientists. The analysis of the results obtained by the meta-feature selection step has confirmed the power of the Augmented R-value in predicting the expected performance of classifiers in such complex classification scenarios. Also, we found that the overlap is a more serious factor affecting the performance of classifiers than imbalance.

Keywords

Imbalance Overlap Augmented R-value Meta-classification 

References

  1. 1.
    Aha DW (1992) Generalizing from case studies: a case study. In: Proceedings of the ninth international conference on machine learning, Morgan Kaufmann, pp 1–10Google Scholar
  2. 2.
    Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S (2011) KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Multi-Valued Log Soft Comput 17(2-3):255–287. http://www.oldcitypublishing.com/MVLSC/MVLSCabstracts/MVLSC17.2-3abstracts/MVLSCv17n2-3p255-287Alcala.html
  3. 3.
    Ali S, Smith KA (2006) On learning algorithm selection for classification. Appl Soft Comput 6(2):119–138. doi:10.1016/j.asoc.2004.12.002 CrossRefGoogle Scholar
  4. 4.
    Barandela R, Sánchez JS, Garcıa V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit. 36(3):849–851CrossRefGoogle Scholar
  5. 5.
    Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit. 30:1145–1159CrossRefGoogle Scholar
  6. 6.
    Brodersen K, Ong CS, Stephan K, Buhmann J (2010) The balanced accuracy and its posterior distribution. In: Pattern recognition (ICPR), 2010 20th international conference on, pp 3121–3124, doi:10.1109/ICPR.2010.764
  7. 7.
    Chawla N (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York. doi:10.1007/0-387-25465-X_40 Google Scholar
  8. 8.
    Chawla N, Lazarevic A, Hall L, Bowyer K (2003) Smoteboost: Improving prediction of the minority class in boosting. In: LavraČ N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Lecture notes in computer science, vol 2838, Springer, Berlin, pp 107–119. doi:10.1007/978-3-540-39804-2_12,
  9. 9.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357. http://dl.acm.org/citation.cfm?id=1622407.1622416
  10. 10.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. doi:10.1023/A:1022627411411 MATHGoogle Scholar
  11. 11.
    Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Proceedings of the 23rd Canadian conference on advances in artificial intelligence, Springer, Berlin, Heidelberg, AI’10, pp 220–231. doi:10.1007/978-3-642-13059-5_22
  12. 12.
    Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’99, pp 155–164, doi:10.1145/312129.312220,
  13. 13.
    García V, Mollineda R, Sánchez J (2008) On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280. doi:10.1007/s10044-007-0087-5 MathSciNetCrossRefGoogle Scholar
  14. 14.
    Grzymala-Busse J, Stefanowski J, Wilk S (2004) A comparison of two approaches to data mining from imbalanced data. In: Negoita M, Howlett R, Jain L (eds) Knowledge-based intelligent information and engineering systems. Lecture notes in computer science, vol 3213, Springer, Berlin, Heidelberg, pp 757–763. doi:10.1007/978-3-540-30132-5_103,
  15. 15.
    Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor Newsl 6(1):30–39. doi:10.1145/1007730.1007736 CrossRefGoogle Scholar
  16. 16.
    Gutlein M, Frank E, Hall M, Karwath A (2009) Large-scale attribute selection using wrappers. In: Computational intelligence and data mining, 2009. CIDM ’09. IEEE Symposium on, pp 332–339. doi:10.1109/CIDM.2009.4938668
  17. 17.
    Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18. doi:10.1145/1656274.1656278 CrossRefGoogle Scholar
  18. 18.
    Japkowicz N, Stephen S (2002a) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449. http://dl.acm.org/citation.cfm?id=1293951.1293954
  19. 19.
    Japkowicz N, Stephen S (2002b) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449. http://dl.acm.org/citation.cfm?id=1293951.1293954
  20. 20.
    Lemnaru C, Potolea R (2012) Imbalanced classification problems: systematic study, issues and best practices. In: Zhang R, Zhang J, Zhang Z, Filipe J, Cordeiro J (eds) Enterprise information systems. Lecture notes in business information processing, vol 102, Springer, Berlin, Heidelberg, pp 35–50. doi:10.1007/978-3-642-29958-2_3
  21. 21.
    Lin Y, Lee Y, Wahba G (2002) Support vector machines for classification in nonstandard situations. Mach Learn 46(1–3):191–202. doi:10.1023/A:1012406528296 CrossRefMATHGoogle Scholar
  22. 22.
    Liu B, Ma Y, Wong C (2000) Improving an association rule based classifier. In: Zighed D, Komorowski J, Żytkow J (eds) Principles of data mining and knowledge discovery. Lecture notes in computer science, vol 1910, Springer, Berlin, Heidelberg, pp 504–509. doi:10.1007/3-540-45372-5_58,
  23. 23.
    Liu W, Chawla S (2011) Class confidence weighted knn algorithms for imbalanced data sets. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining—vol Part II, Springer, Berlin, Heidelberg, PAKDD’11, pp 345–356. http://dl.acm.org/citation.cfm?id=2022850.2022879
  24. 24.
    López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141. doi:10.1016/j.ins.2013.07.007 CrossRefGoogle Scholar
  25. 25.
    Luengo J, Fernández A, García S (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936. doi:10.1007/s00500-010-0625-8 CrossRefGoogle Scholar
  26. 26.
    Oh S (2011) A new dataset evaluation method based on category overlap. Comput Biol Med 41(2):115–122. doi:10.1016/j.compbiomed.2010.12.006 CrossRefGoogle Scholar
  27. 27.
    Potolea R, Cacoveanu S, Lemnaru C (2011) Meta-learning framework for prediction strategy evaluation. In: Filipe J, Cordeiro J (eds) Enterprise information systems. Lecture notes in business information processing, vol 73, Springer, Berlin, Heidelberg, pp 280–295. doi:10.1007/978-3-642-19802-1_20,
  28. 28.
    Quinlan J (1991) Improved estimates for the accuracy of small disjuncts. Mach Learn 6(1):93–98. doi:10.1007/BF00153762 Google Scholar
  29. 29.
    Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, San FranciscGoogle Scholar
  30. 30.
    Reif M, Shafait F, Goldstein M, Breuel T, Dengel A (2014) Automatic classifier selection for non-experts. Pattern Anal Appl 17(1):83–96. doi:10.1007/s10044-012-0280-z MathSciNetCrossRefGoogle Scholar
  31. 31.
    Seiffert C, Khoshgoftaar T, Van Hulse J, Folleco A (2007) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. In: Information reuse and integration, 2007. IRI 2007. IEEE international conference on, pp 651–658. doi:10.1109/IRI.2007.4296694
  32. 32.
    Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced dataGoogle Scholar
  33. 33.
    Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. IJPRAI 23(4):687–719. doi:10.1142/S0218001409007326 Google Scholar
  34. 34.
    Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. Neural Comput Appl 20(2):203–209. doi:10.1007/s00521-010-0349-9 CrossRefGoogle Scholar
  35. 35.
    Visa S (2005) Issues in mining imbalanced data sets—a review paper. In: Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, 2005, pp 67–73Google Scholar
  36. 36.
    Wasikowski M, wen Chen X (2010) Combating the small sample class imbalance problem using feature selection. Knowl Data Eng IEEE Trans 22(10):1388–1400. doi:10.1109/TKDE.2009.187 CrossRefGoogle Scholar
  37. 37.
    Weiss GM (2003) The effect of small disjuncts and class distribution on decision tree learning. PhD thesis, New Brunswick, NJ, USA, aAI3093004Google Scholar
  38. 38.
    Williams D, Myers V, Silvious M (2009) Mine classification with imbalanced data. Geosci Remote Sens Lett, IEEE 6(3):528–532. doi:10.1109/LGRS.2009.2021964 CrossRefGoogle Scholar
  39. 39.
    Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: In ICML 2003 workshop on learning from imbalanced data sets, pp 49–56Google Scholar
  40. 40.
    Zadrozny B, Elkan C (2001) Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the seventh international conference on knowledge discovery and data mining, ACM Press, pp 204–213Google Scholar
  41. 41.
    Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. Knowl Data Eng, IEEE Trans 18(1):63–77. doi:10.1109/TKDE.2006.17 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2016

Authors and Affiliations

  • Zalán Borsos
    • 1
  • Camelia Lemnaru
    • 1
  • Rodica Potolea
    • 1
  1. 1.Department of Computer ScienceTechnical University of Cluj-NapocaCluj-NapocaRomania

Personalised recommendations