Dealing with overlap and imbalance: a new metric and approach

Borsos, Zalán; Lemnaru, Camelia; Potolea, Rodica

doi:10.1007/s10044-016-0583-6

Dealing with overlap and imbalance: a new metric and approach

Theoretical Advances
Published: 27 September 2016

Volume 21, pages 381–395, (2018)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Zalán Borsos¹,
Camelia Lemnaru¹ &
Rodica Potolea¹

895 Accesses
17 Citations
3 Altmetric
Explore all metrics

Abstract

This paper addresses learning in complex scenarios involving imbalance and overlap. We propose a novel measure, the Augmented R-value, for estimating the level of overlap in the data. It improves an existing model-based measure, by including the data imbalance in the estimation process. We provide both a theoretical demonstration and empirical validations of the new metric’s efficacy in estimating the overlap level. Another contribution of the present paper is to propose a collection of meta-features to be used in conjunction with a meta-learning strategy for predicting the most suitable classifier for a given problem. The evaluations performed on a well-known collection of benchmark problems have shown that the meta-learning approach achieves superior results to the manual classifier selection process normally carried out by data scientists. The analysis of the results obtained by the meta-feature selection step has confirmed the power of the Augmented R-value in predicting the expected performance of classifiers in such complex classification scenarios. Also, we found that the overlap is a more serious factor affecting the performance of classifiers than imbalance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aha DW (1992) Generalizing from case studies: a case study. In: Proceedings of the ninth international conference on machine learning, Morgan Kaufmann, pp 1–10
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S (2011) KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Multi-Valued Log Soft Comput 17(2-3):255–287. http://www.oldcitypublishing.com/MVLSC/MVLSCabstracts/MVLSC17.2-3abstracts/MVLSCv17n2-3p255-287Alcala.html
Ali S, Smith KA (2006) On learning algorithm selection for classification. Appl Soft Comput 6(2):119–138. doi:10.1016/j.asoc.2004.12.002
Article Google Scholar
Barandela R, Sánchez JS, Garcıa V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit. 36(3):849–851
Article Google Scholar
Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit. 30:1145–1159
Article Google Scholar
Brodersen K, Ong CS, Stephan K, Buhmann J (2010) The balanced accuracy and its posterior distribution. In: Pattern recognition (ICPR), 2010 20th international conference on, pp 3121–3124, doi:10.1109/ICPR.2010.764
Chawla N (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York. doi:10.1007/0-387-25465-X_40
Google Scholar
Chawla N, Lazarevic A, Hall L, Bowyer K (2003) Smoteboost: Improving prediction of the minority class in boosting. In: LavraČ N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Lecture notes in computer science, vol 2838, Springer, Berlin, pp 107–119. doi:10.1007/978-3-540-39804-2_12,
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357. http://dl.acm.org/citation.cfm?id=1622407.1622416
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. doi:10.1023/A:1022627411411
MATH Google Scholar
Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Proceedings of the 23rd Canadian conference on advances in artificial intelligence, Springer, Berlin, Heidelberg, AI’10, pp 220–231. doi:10.1007/978-3-642-13059-5_22
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’99, pp 155–164, doi:10.1145/312129.312220,
García V, Mollineda R, Sánchez J (2008) On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280. doi:10.1007/s10044-007-0087-5
Article MathSciNet Google Scholar
Grzymala-Busse J, Stefanowski J, Wilk S (2004) A comparison of two approaches to data mining from imbalanced data. In: Negoita M, Howlett R, Jain L (eds) Knowledge-based intelligent information and engineering systems. Lecture notes in computer science, vol 3213, Springer, Berlin, Heidelberg, pp 757–763. doi:10.1007/978-3-540-30132-5_103,
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor Newsl 6(1):30–39. doi:10.1145/1007730.1007736
Article Google Scholar
Gutlein M, Frank E, Hall M, Karwath A (2009) Large-scale attribute selection using wrappers. In: Computational intelligence and data mining, 2009. CIDM ’09. IEEE Symposium on, pp 332–339. doi:10.1109/CIDM.2009.4938668
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18. doi:10.1145/1656274.1656278
Article Google Scholar
Japkowicz N, Stephen S (2002a) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449. http://dl.acm.org/citation.cfm?id=1293951.1293954
Japkowicz N, Stephen S (2002b) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449. http://dl.acm.org/citation.cfm?id=1293951.1293954
Lemnaru C, Potolea R (2012) Imbalanced classification problems: systematic study, issues and best practices. In: Zhang R, Zhang J, Zhang Z, Filipe J, Cordeiro J (eds) Enterprise information systems. Lecture notes in business information processing, vol 102, Springer, Berlin, Heidelberg, pp 35–50. doi:10.1007/978-3-642-29958-2_3
Lin Y, Lee Y, Wahba G (2002) Support vector machines for classification in nonstandard situations. Mach Learn 46(1–3):191–202. doi:10.1023/A:1012406528296
Article MATH Google Scholar
Liu B, Ma Y, Wong C (2000) Improving an association rule based classifier. In: Zighed D, Komorowski J, Żytkow J (eds) Principles of data mining and knowledge discovery. Lecture notes in computer science, vol 1910, Springer, Berlin, Heidelberg, pp 504–509. doi:10.1007/3-540-45372-5_58,
Liu W, Chawla S (2011) Class confidence weighted knn algorithms for imbalanced data sets. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining—vol Part II, Springer, Berlin, Heidelberg, PAKDD’11, pp 345–356. http://dl.acm.org/citation.cfm?id=2022850.2022879
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141. doi:10.1016/j.ins.2013.07.007
Article Google Scholar
Luengo J, Fernández A, García S (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936. doi:10.1007/s00500-010-0625-8
Article Google Scholar
Oh S (2011) A new dataset evaluation method based on category overlap. Comput Biol Med 41(2):115–122. doi:10.1016/j.compbiomed.2010.12.006
Article Google Scholar
Potolea R, Cacoveanu S, Lemnaru C (2011) Meta-learning framework for prediction strategy evaluation. In: Filipe J, Cordeiro J (eds) Enterprise information systems. Lecture notes in business information processing, vol 73, Springer, Berlin, Heidelberg, pp 280–295. doi:10.1007/978-3-642-19802-1_20,
Quinlan J (1991) Improved estimates for the accuracy of small disjuncts. Mach Learn 6(1):93–98. doi:10.1007/BF00153762
Google Scholar
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, San Francisc
Google Scholar
Reif M, Shafait F, Goldstein M, Breuel T, Dengel A (2014) Automatic classifier selection for non-experts. Pattern Anal Appl 17(1):83–96. doi:10.1007/s10044-012-0280-z
Article MathSciNet Google Scholar
Seiffert C, Khoshgoftaar T, Van Hulse J, Folleco A (2007) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. In: Information reuse and integration, 2007. IRI 2007. IEEE international conference on, pp 651–658. doi:10.1109/IRI.2007.4296694
Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. IJPRAI 23(4):687–719. doi:10.1142/S0218001409007326
Google Scholar
Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. Neural Comput Appl 20(2):203–209. doi:10.1007/s00521-010-0349-9
Article Google Scholar
Visa S (2005) Issues in mining imbalanced data sets—a review paper. In: Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, 2005, pp 67–73
Wasikowski M, wen Chen X (2010) Combating the small sample class imbalance problem using feature selection. Knowl Data Eng IEEE Trans 22(10):1388–1400. doi:10.1109/TKDE.2009.187
Article Google Scholar
Weiss GM (2003) The effect of small disjuncts and class distribution on decision tree learning. PhD thesis, New Brunswick, NJ, USA, aAI3093004
Williams D, Myers V, Silvious M (2009) Mine classification with imbalanced data. Geosci Remote Sens Lett, IEEE 6(3):528–532. doi:10.1109/LGRS.2009.2021964
Article Google Scholar
Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: In ICML 2003 workshop on learning from imbalanced data sets, pp 49–56
Zadrozny B, Elkan C (2001) Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the seventh international conference on knowledge discovery and data mining, ACM Press, pp 204–213
Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. Knowl Data Eng, IEEE Trans 18(1):63–77. doi:10.1109/TKDE.2006.17
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Technical University of Cluj-Napoca, 26 Baritiu Street, Room C9, 400027, Cluj-Napoca, Romania
Zalán Borsos, Camelia Lemnaru & Rodica Potolea

Authors

Zalán Borsos
View author publications
You can also search for this author in PubMed Google Scholar
Camelia Lemnaru
View author publications
You can also search for this author in PubMed Google Scholar
Rodica Potolea
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zalán Borsos.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Borsos, Z., Lemnaru, C. & Potolea, R. Dealing with overlap and imbalance: a new metric and approach. Pattern Anal Applic 21, 381–395 (2018). https://doi.org/10.1007/s10044-016-0583-6

Download citation

Received: 28 October 2015
Accepted: 08 September 2016
Published: 27 September 2016
Issue Date: May 2018
DOI: https://doi.org/10.1007/s10044-016-0583-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dealing with overlap and imbalance: a new metric and approach

Abstract

Access this article

Similar content being viewed by others

On the joint-effect of class imbalance and overlap: a critical review

Classification of Multi-class Imbalanced Data: Data Difficulty Factors and Selected Methods for Improving Classifiers

Analysing the Footprint of Classifiers in Overlapped and Imbalanced Contexts

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dealing with overlap and imbalance: a new metric and approach

Abstract

Access this article

Similar content being viewed by others

On the joint-effect of class imbalance and overlap: a critical review

Classification of Multi-class Imbalanced Data: Data Difficulty Factors and Selected Methods for Improving Classifiers

Analysing the Footprint of Classifiers in Overlapped and Imbalanced Contexts

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation