Abstract
Obtaining good probability estimates is imperative for many applications. The increased uncertainty and typically asymmetric costs surrounding rare events increase this need. Experts (and classification systems) often rely on probabilities to inform decisions. However, we demonstrate that class probability estimates obtained via supervised learning in imbalanced scenarios systematically underestimate the probabilities for minority class instances, despite ostensibly good overall calibration. To our knowledge, this problem has not previously been explored. We propose a new metric, the stratified Brier score, to capture class-specific calibration, analogous to the per-class metrics widely used to assess the discriminative performance of classifiers in imbalanced scenarios. We propose a simple, effective method to mitigate the bias of probability estimates for imbalanced data that bags estimators independently calibrated over balanced bootstrap samples. This approach drastically improves performance on the minority instances without greatly affecting overall calibration. We extend our previous work in this direction by providing ample additional empirical evidence for the utility of this strategy, using both support vector machines and boosted decision trees as base learners. Finally, we show that additional uncertainty can be exploited via a Bayesian approach by considering posterior distributions over bagged probability estimates.
This is a preview of subscription content,
to check access.







Notes
Somewhat confusingly, the term ‘calibration’ is often used both to refer to the process of calibrating classifiers and to measure the accuracy of probability estimates.
We used the somewhat arbitrary relative weight of 10.
See Table 1.
This limit is undefined in general; here \(\tilde{\pi }\) is coming from the positive side.
Recall that the training time of SVMs scales quadratically with the number of instances.
References
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Breslow NE, Day NE (1980) Statistical methods in cancer research, vol. 1. The analysis of case-control studies, vol 1. Distributed for IARC by WHO, Geneva, Switzerland
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78(1):1–3
Chawla NV (2010) Data mining for imbalanced datasets: an overview. Data Mining Knowledge Discovery Handbook, pp 875–886
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2):225–252
Cieslak D, Chawla N (2008) Analyzing pets on imbalanced datasets when training and testing class distributions differ. Adv Knowl Discov Data Min 5012: 519–526
Cohen AM, Hersh WR, Peterson K, Yen PY (2006a) Reducing workload in systematic review preparation using automated citation classification. J Am Med Inform Assoc 13(2):206–219
Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006b) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18
Cohen I, Goldszmidt M (2004) Properties and benefits of calibrated classifiers. In: 15th European conference on machine learning (ECML04) and 8th European conference on principles and practice of knowledge discovery in databases (PKDD04), pp 125–136
Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 7th international joint conference on, artificial intelligence (IJCAI01), vol 17, pp 973–978
Firth D (1993) Bias reduction of maximum likelihood estimates. Biometrika 80(1):27–38
Foster DP, Stine RA (2004) Variable selection in data mining: building a predictive model for bankruptcy. J Am Stat Assoc 99(466):303–313
Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 4th International conference on natural computation (ICNC08), pp 192–201
Haibo H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Hastie T, Tibshirani R (1998) Classification by pairwise coupling. Ann Stat 26(2):451–471
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
King G, Zeng L (2001) Logistic regression in rare events data. Political Anal 9(2):137–163
Lin HT, Lin CJ, Weng RC (2007) A note on platt’s probabilistic outputs for support vector machines. Mach Learn 68(3):267–276
McCullagh P, Nelder JA (1987) Generalized linear models. Chapman and Hall, London
Niculescu-Mizil A, Caruana R (2005a) Obtaining calibrated probabilities from boosting. In: Proceedings of the 21st conference on uncertainty in artificial intelligence (UAI05), pp 413–420
Niculescu-Mizil A, Caruana R (2005b) Predicting good probabilities with supervised learning. In: Proceedings of the 22nd international conference on machine learning (ICML05). ACM, pp 625–632
Platt J (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif 10(3):61–74
Provost F (2000) Machine learning from imbalanced data sets 101 (invited talk). In: Proceedings of the AAAI workshop on learning from imbalanced data sets (AAAI00)
Rabe-Hesketh S, Skrondal A (2008) Multilevel and longitudinal modeling using Stata. Stata Corp, College Station, Texas
Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning (ICML07), pp 935–942
Wallace BC, Dahabreh IJ (2012) Class probability estimates are unreliable for imbalanced data (and how to fix them). In: Proceedings of the 12th international conference on data mining (ICDM12). IEEE, pp 695–704
Wallace BC, Small K, Brodley CE, Trikalinos TA (2011) Class imbalance, redux. In: Proceedings of the 11th international conference on data mining (ICDM11). IEEE, pp 754–763
Wallace BC, Trikalinos TA, Lau J, Brodley C, Schmid CH (2010) Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinform 11(1):55
Walter SD (1985) Small sample estimation of log odds ratios from logistic regression and fourfold tables. Stat Med 4(4):437–444
Yang W, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Dec Making 5(4):597–604
Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining (KDD02). ACM, pp 694–699
Zhu J, Hovy E, (2007) Active learning for word sense disambiguation with methods for addressing the class imbalance problem. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural, language learning, pp 783–790
Author information
Authors and Affiliations
Corresponding author
Additional information
This article is an extended version of our ICDM 2012 paper [27]. This work was supported in part by a grant from the Agency for Healthcare Research and Quality (AHRQ, grant HS018494-01). The findings and conclusions in this paper are those of the authors, who are responsible for its content, and do not necessarily represent the views of the AHRQ. No statement in this report should be construed as an official position of the AHRQ or of the U.S. Department of Health and Human Services.
Rights and permissions
About this article
Cite this article
Wallace, B.C., Dahabreh, I.J. Improving class probability estimates for imbalanced data. Knowl Inf Syst 41, 33–52 (2014). https://doi.org/10.1007/s10115-013-0670-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0670-6