Skip to main content
Log in

A power-controlled reliability assessment for multi-class probabilistic classifiers

Advances in Data Analysis and Classification Aims and scope Submit manuscript


In multi-class classification, the output of a probabilistic classifier is a probability distribution of the classes. In this work, we focus on a statistical assessment of the reliability of probabilistic classifiers for multi-class problems. Our approach generates a Pearson \(\chi ^2\) statistic based on the k-nearest-neighbors in the prediction space. Further, we develop a Bayesian approach for estimating the expected power of the reliability test that can be used for an appropriate sample size k. We propose a sampling algorithm and demonstrate that this algorithm obtains a valid prior distribution. The effectiveness of the proposed reliability test and expected power is evaluated through a simulation study. We also provide illustrative examples of the proposed methods with practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. Because \(\hat{\textbf{p}}\) is a user-defined vector, one can choose \(\hat{\textbf{p}}\) to meet the necessary conditions. Another solution to ensure that \(p_j - \epsilon >0\) is to merge classes with low probabilities.

  2. The number of clusters was set to six to illustrate diverse reliability test results without being redundant.

  3. In this section, the true difference between each representative pattern and the corresponding underlying probability vector was used to empirically demonstrate the effectiveness of the proposed expected power compared with the actual rejection rate.


  • Breiman L (1984) Classification and regression trees. Taylor & Francis, LLC, Boca Raton, FL

    MATH  Google Scholar 

  • Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140.

    Article  MATH  Google Scholar 

  • Bröcker J, Smith LA (2007) Increasing the reliability of reliability diagrams. Weather Forecast 22(3):651–661

    Article  Google Scholar 

  • Cheng D, Branscum AJ, Stamey JD (2010) A Bayesian approach to sample size determination for studies designed to evaluate continuous medical tests. Comput Stat Data Anal 54(2):298–307.

    Article  MathSciNet  MATH  Google Scholar 

  • Daimon T (2008) Bayesian sample size calculations for a non-inferiority test of two proportions in clinical trials. Contemp Clin Trials 29(4):507–516.

    Article  Google Scholar 

  • Dua D, Graff C (2017) UCI machine learning repository.

  • Fagerland MW, Hosmer DW, Bofin AM (2008) Multinomial goodness-of-fit tests for logistic regression models. Stat Med 27(21):4238–4253.

    Article  MathSciNet  Google Scholar 

  • Fix E, Hodges J (1951) Discriminatory analysis, nonparametric discrimination: Consistency properties. Technical report, USAF School of Aviation Medivine, Randolph Field, Texas, project 21-49-004, Rept. 4, Contract AF41(128)-31, February 1951

  • Gerrard DJ (1969) Competition quotient: A new measure of the competition affecting individual forest trees. Research Bulletin No. 20, Agricultural Experimental Station, Michigan State University

  • Gweon H, Yu H (2019) How reliable is your reliability diagram? Pattern Recogn Lett 125:687–693.

    Article  Google Scholar 

  • Hamid HA, Wah Y, Xie X et al (2018) Investigating the power of goodness-of-fit tests for multinomial logistic regression. Commun Stat Simul Comput 47(4):1039–1055.

    Article  MathSciNet  MATH  Google Scholar 

  • Hartigan JA, Wong MA (1979) A k-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat) 28(1):100–108.

    Article  MATH  Google Scholar 

  • Hosmer DW, Lemeshow S (1980) Goodness of fit tests for the multiple logistic regression model. Commun Stat Theory Methods 9(10):1043–1069.

    Article  MATH  Google Scholar 

  • Jiang X, Osl M, Kim J et al (2012) Calibrating predictive model estimates to support personalized medicine. J Am Med Inform Assoc 19:263–274

    Article  Google Scholar 

  • Kumar A, Sarawagi S, Jain U (2018) Trainable calibration measures for neural networks from kernel mean embeddings. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, Stockholmsmässan, Stockholm Sweden, pp 2805–2814

  • Lloyd S (1957) Least squares quantization in pcm. Technical report RR-5497, Bell Lab

  • Murphy AH, Winkler RL (1977) Reliability of subjective probability forecasts of precipitation and temperature. J Roy Stat Soc: Ser C (Appl Stat) 26(1):41–47

    Google Scholar 

  • Naeini MP, Cooper GF, Hauskrecht M (2015) Obtaining well calibrated probabilities using bayesian binning. In: Proceedings of the 29th AAAI conference on artificial intelligence, pp 2901—2907

  • Niculescu-Mizil A, Caruana R (2005) Predicting good probabilities with supervised learning. In: Proceedings of the 22nd international conference on machine learning. ACM, New York, NY, USA, pp 625–632

  • Paul P, Pennell ML, Lemeshow S (2013) Standardizing the power of the Hosmer–Lemeshow goodness of fit test in large data sets. Stat Med 32(1):67–80.

    Article  MathSciNet  Google Scholar 

  • Pham-Gia T, Turkkan N (2003) Determination of exact sample sizes in the Bayesian estimation of the difference of two proportions. J Royal Stat Soc Ser D (The Statistician) 52(2):131–150.

    Article  MathSciNet  Google Scholar 

  • Pigeon JG, Heyse JF (1999) An improved goodness of fit statistic for probability prediction models. Biom J 41(1):71–82

    Article  MATH  Google Scholar 

  • Rauch G, Kieser M (2013) An expected power approach for the assessment of composite endpoints and their components. Comput Stat Data Anal 60:111–122.

    Article  MathSciNet  MATH  Google Scholar 

  • Read T, Cressie N (1988) Goodness-of-fit statistics for discrete multivariate data. Springer, New York

    Book  MATH  Google Scholar 

  • Settles B (2012) Active learning. Synth Lect Artif Intell Mach Learn 6(1):1–114

    MathSciNet  MATH  Google Scholar 

  • Vaicenavicius J, Widmann D, Andersson C, et al (2019) Evaluating model calibration in classification. In: Proceedings of the 22nd international conference on artificial intelligence and statistics, pp 3459–3467

  • Widmann D, Lindsten F, Zachariah D (2019) Calibration tests in multi-class classification: A unifying framework. In: Advances in neural information processing systems, pp 12,236 – 12,246

  • Widmann D, Lindsten F, Zachariah D (2021) Calibration tests beyond classification. In: Proceedings of the 9th international conference on learning representations

Download references


We gratefully acknowledge funding from grant RGPIN-2022-04698 (PI: Gweon) from the Natural Sciences and Engineering Research Council of Canada (NSERC).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Hyukjun Gweon.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Proof of Theorem 1

Appendix A Proof of Theorem 1

We show that the total area under \(f_{\textbf{r}}(r_1,\ldots ,r_c)\) equals to 1. Because there are \(\left( {\begin{array}{c}c\\ h\end{array}}\right) \) cases that h number of the \(r_i\) \((i=1,\ldots ,c)\) values are negative, the total area can be expressed as

$$\begin{aligned} \idotsint f_{\textbf{r}}(r_1,r_2,\ldots ,r_c) \,dr_1 \ldots \,dr_c = \sum _{h=1}^{c-1} \left( {\begin{array}{c}c\\ h\end{array}}\right) \text {A}_h, \end{aligned}$$

where \(\text {A}_h\) represents the probability such that

$$\begin{aligned} \sum _{i=1}^{h} r_i = -\frac{\epsilon }{2}, \quad \sum _{i=h+1}^{c} r_i = \frac{\epsilon }{2}, \quad r_1,\ldots ,r_h < 0, \text { and} \quad r_{h+1},\ldots ,r_c > 0 \end{aligned}$$

and thus the support of \((r_1,\ldots ,r_c)\) becomes

$$\begin{aligned}&r_1 \in \left( -\frac{\epsilon }{2},0\right) , \\&r_2 \in \left( -\frac{\epsilon }{2}- r_1,0\right) , \\&... \\&r_{h-1} \in \left( -\frac{\epsilon }{2}- r_1...-r_{h-2},0\right) , \\&r_h = -\frac{\epsilon }{2}- r_1...-r_{h-1}, \\&r_{h+1} \in \left( 0,\frac{\epsilon }{2}\right) , \\&r_{h+2} \in \left( 0,\frac{\epsilon }{2}-r_{h+1}\right) , \\&... \\&r_{c-1} \in \left( 0,\frac{\epsilon }{2}-r_{h+1} ... - r_{c-2}\right) , \text { and}\\&r_c = \frac{\epsilon }{2}-r_{h+1} ... - r_{c-1}. \end{aligned}$$

Using change of variable, we define \(w_i = - \epsilon /2 - \sum _{j=0}^{i}r_{h-j}\) (\(i=0,\ldots ,h-2)\) and \(v_i = \epsilon /2 - \sum _{j=0}^{i}r_{c-j}\) (\(i=0,\ldots ,c-h-2)\). Then, we have

$$\begin{aligned} \text {A}_h&= \int _{-\frac{\epsilon }{2}}^{0} \int _{-\frac{\epsilon }{2}}^{w_{h-2}} \ldots \int _{-\frac{\epsilon }{2}}^{w_1} \int _{0}^{\frac{\epsilon }{2}} \int _{v_{c-h-2}}^{\frac{\epsilon }{2}} \ldots \int _{v_1}^{\frac{\epsilon }{2}} \frac{1}{\left( {\begin{array}{c}c\\ h\end{array}}\right) } \frac{(c-2)!}{\epsilon ^{c-2}} \vert J \vert \,dw_0 \ldots \,dv_{c-h-2} \\&=\frac{1}{\left( {\begin{array}{c}c\\ h\end{array}}\right) } \frac{(c-2)!}{\epsilon ^{c-2}} \left[ \int _{0}^{\frac{\epsilon }{2}} \ldots \int _{v_{1}}^{\frac{\epsilon }{2}} \left[ \int _{-\frac{\epsilon }{2}}^{0} \ldots \int _{-\frac{\epsilon }{2}}^{w_1} \,dw_0 \ldots \,dw_{h-2} \right] \,dv_0 \ldots \,dv_{c-h-2} \right] , \end{aligned}$$

where the Jacobian \(\vert J \vert = 1\) due to the property of the determinant of a triangular matrix.

We first prove by induction that

$$\begin{aligned} \int _{-\frac{\epsilon }{2}}^{w_n} \ldots \int _{-\frac{\epsilon }{2}}^{w_1} \,dw_0 \ldots \,dw_{n-1} = \sum _{j=0}^{n} \left( \frac{\epsilon }{2}\right) ^j \frac{1}{j!} \frac{1}{(n-j)!} w_n^{n-j} \end{aligned}$$

for any positive integer n. When \(n=1\), we have

$$\begin{aligned} \int _{-\frac{\epsilon }{2}}^{w_1} \,dw_0 = w_1 + \frac{\epsilon }{2} = \sum _{j=0}^{1} \left( \frac{\epsilon }{2}\right) ^j \frac{1}{j!} \frac{1}{(1-j)!} w_1^{1-j}. \end{aligned}$$

Assuming that

$$\begin{aligned} \int _{-\frac{\epsilon }{2}}^{w_k} \ldots \int _{-\frac{\epsilon }{2}}^{w_1} \,dw_0 \ldots \,dw_{k-1} = \sum _{j=0}^{k} \left( \frac{\epsilon }{2}\right) ^j \frac{1}{j!} \frac{1}{(k-j)!} w_k^{k-j}, \end{aligned}$$

we have

$$\begin{aligned} \int _{-\frac{\epsilon }{2}}^{w_{k+1}} \ldots \int _{-\frac{\epsilon }{2}}^{w_1} \,dw_0 \ldots \,dw_{k}&= \int _{-\frac{\epsilon }{2}}^{w_{k+1}} \left( \sum _{j=0}^{k} \left( \frac{\epsilon }{2}\right) ^j \frac{1}{j!} \frac{1}{(k-j)!} w_k^{k-j} \right) \,dw_k \\&= \sum _{j=0}^{k} \left( \frac{\epsilon }{2}\right) ^j \frac{1}{j!} \frac{1}{(k-j)!} \left( \int _{-\frac{\epsilon }{2}}^{w_{k+1}} w_k^{k-j} \,dw_k \right) \\&= \sum _{j=0}^{k} \left( \frac{\epsilon }{2}\right) ^j \frac{1}{j!} \frac{1}{(k-j+1)!} w_{k+1}^{k-j+1} \\&\quad - \sum _{j=0}^{k} \left( \frac{\epsilon }{2}\right) ^j \left( - \frac{\epsilon }{2} \right) ^{k-j+1} \frac{1}{j!} \frac{1}{(k-j+1)!}. \end{aligned}$$

From the binomial theorem that is given as

$$\begin{aligned} \sum _{j=0}^{k+1} \left( \frac{\epsilon }{2}\right) ^j \left( - \frac{\epsilon }{2} \right) ^{k-j+1} \frac{(k+1)!}{j! (k-j+1)!} = 0, \end{aligned}$$

we have

$$\begin{aligned} \sum _{j=0}^{k} \left( \frac{\epsilon }{2}\right) ^j \left( - \frac{\epsilon }{2} \right) ^{k-j+1} \frac{1}{j!} \frac{1}{(k-j+1)!} = -\frac{1}{(k+1)!} \left( \frac{\epsilon }{2} \right) ^{k+1} \end{aligned}$$


$$\begin{aligned} \int _{-\frac{\epsilon }{2}}^{w_{k+1}} \ldots \int _{-\frac{\epsilon }{2}}^{w_1} \,dw_0 \ldots \,dw_{k}&= \sum _{j=0}^{k} \left( \frac{\epsilon }{2}\right) ^j \frac{1}{j!} \frac{1}{(k-j+1)!} w_{k+1}^{k-j+1} + \frac{1}{(k+1)!} \left( \frac{\epsilon }{2} \right) ^{k+1} \\&= \sum _{j=0}^{k+1} \left( \frac{\epsilon }{2}\right) ^j \frac{1}{j!} \frac{1}{(k-j+1)!} w_{k+1}^{k-j+1}. \end{aligned}$$

Then, using the result in Eq. (A1),

$$\begin{aligned} \int _{-\frac{\epsilon }{2}}^{0} \int _{-\frac{\epsilon }{2}}^{w_{h-2}} \ldots \int _{-\frac{\epsilon }{2}}^{w_1} \,dw_0 \ldots \,dw_{h-2}&= \int _{-\frac{\epsilon }{2}}^{0} \left( \sum _{j=0}^{k+1} \left( \frac{\epsilon }{2}\right) ^j \frac{1}{j!} \frac{1}{(k-j+1)!} w_{k+1}^{k-j+1} \right) \,dw_{h-2} \\&= \frac{1}{(h-1)!} \left( \frac{\epsilon }{2}\right) ^{h-1}. \end{aligned}$$

Similarly, we can show that

$$\begin{aligned} \int _{0}^{\frac{\epsilon }{2}} \int _{v_{c-h-2}}^{\frac{\epsilon }{2}} \ldots \int _{v_{1}}^{\frac{\epsilon }{2}} \,dv_0 \ldots \,dv_{c-h-2} = \frac{1}{(c-h-1)!} \left( \frac{\epsilon }{2}\right) ^{c-h-1}. \end{aligned}$$


$$\begin{aligned} \idotsint f_{\textbf{r}}(r_1,\ldots ,r_c) \,dr_1 \ldots \,dr_c&= \sum _{h=1}^{c-1} \left( {\begin{array}{c}c\\ h\end{array}}\right) \text {A}_h \\&= \frac{(c-2)!}{\epsilon ^{c-2}} \sum _{h=1}^{c-1} \left( \frac{(\epsilon /2)^{h-1}}{(h-1)!} \frac{(\epsilon /2)^{c-h-1}}{(c-h-1)!} \right) \\&= \frac{1}{2^{c-2}} \sum _{h=1}^{c-1} \frac{(c-2)!}{(h-1)!(c-h-1)!} \\&= \frac{1}{2^{c-2}} \sum _{b=0}^{c-2} \frac{(c-2)!}{b!(c-b-2)!}= 1 \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gweon, H. A power-controlled reliability assessment for multi-class probabilistic classifiers. Adv Data Anal Classif 17, 927–949 (2023).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


Mathematics Subject Classification