Skip to main content

Properization: constructing proper scoring rules via Bayes acts

Abstract

Scoring rules serve to quantify predictive performance. A scoring rule is proper if truth telling is an optimal strategy in expectation. Subject to customary regularity conditions, every scoring rule can be made proper, by applying a special case of the Bayes act construction studied by Grünwald and Dawid (Ann Stat 32:1367–1433, 2004) and Dawid (Ann Inst Stat Math 59:77–93, 2007), to which we refer as properization. We discuss examples from the recent literature and apply the construction to create new types, and reinterpret existing forms, of proper scoring rules and consistent scoring functions. In an abstract setting, we formulate sufficient conditions under which Bayes acts exist and scoring rules can be made proper.

This is a preview of subscription content, access via your institution.

Notes

  1. 1.

    As noted by Parry (2016), the improper score \({S}_1\) shares its (concave) expected score function \(P \mapsto {S}_1(P,P)\) with the proper Brier score. This illustrates the importance of the second condition in Theorem 1 of Gneiting and Raftery (2007): For a scoring rule \({S}\), the (strict) concavity of the expected score function \( G(P) := {S}(P,P)\) is equivalent to the (strict) propriety of \({S}\) only if, furthermore, \(- {S}(P,\cdot )\) is a subtangent of \(- G\) at P.

  2. 2.

    See, e.g., http://www.fharrell.com/post/class-damage/ and http://www.fharrell.com/post/classification/.

References

  1. Aliprantis, C. D., Border, K. C. (2006). Infinite dimensional analysis third ed. Berlin: Springer.

  2. Christensen, H. M., Moroz, I. M., Palmer, T. N. (2014). Evaluation of ensemble forecast uncertainty using a new proper score: Application to medium-range and seasonal forecasts. Quarterly Journal of the Royal Meteorological Society, 141, 538–549.

    Article  Google Scholar 

  3. Dawid, A. P. (1986). Probability forecasting. In S. Kotz, N. L. Johnson, C. B. Read (Eds.), Encyclopedia of statistical sciences, Vol. 7, pp. 210–218. New York: Wiley.

  4. Dawid, A. P. (2007). The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 59, 77–93.

    MathSciNet  Article  Google Scholar 

  5. Dawid, A. P., Musio, M. (2014). Theory and applications of proper scoring rules. Metron, 72, 169–183.

    MathSciNet  Article  Google Scholar 

  6. Diks, C., Panchenko, V., van Dijk, D. (2011). Likelihood-based scoring rules for comparing density forecasts in tails. Journal of Econometrics, 163, 215–230.

    MathSciNet  Article  Google Scholar 

  7. Ebert, E., Brown, B., Göber, M., Haiden, T., Mittermaier, M., Nurmi, P., Wilson, L., Jackson, S., Johnston, P., Schuster, D. (2018). The WMO challenge to develop and demonstrate the best new user-oriented forecast verification metric. Meteorologische Zeitschrift, 27, 435–440.

    Article  Google Scholar 

  8. Ebert, E., Wilson, L., Weigel, A., Mittermaier, M., Nurmi, P., Gill, P., Göber, M., Joslyn, S., Brown, B., Fowler, T., Watkins, A. (2013). Progress and challenges in forecast verification. Meteorological Applications, 20, 130–139.

    Article  Google Scholar 

  9. Ehm, W., Gneiting, T., Jordan, A., Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. Journal of the Royal Statistical Society Series B. Statistical Methodology, 78, 505–562.

    MathSciNet  Article  Google Scholar 

  10. Ferguson, T. S. (1967). Mathematical statistics: A decision theoretic approach. Probability and mathematical statistics, Vol. 1. New York: Academic Press.

    MATH  Google Scholar 

  11. Ferri, C., Hernández-Orallo, J., Modroiu, R. (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30, 27–38.

    Article  Google Scholar 

  12. Ferro, C. A. T. (2017). Measuring forecast performance in the presence of observation error. Quarterly Journal of the Royal Meteorological Society, 143, 2665–2676.

    Article  Google Scholar 

  13. Fissler, T., Ziegel, J. F. (2016). Higher order elicitability and Osband’s principle. The Annals of Statistics, 44, 1680–1707.

    MathSciNet  Article  Google Scholar 

  14. Friederichs, P., Thorarinsdottir, T. L. (2012). Forecast verification for extreme value distributions with an application to probabilistic peak wind prediction. Environmetrics, 23, 579–594.

    MathSciNet  Article  Google Scholar 

  15. Gelfand, A. E., Ghosh, S. K. (1998). Model choice: A minimum posterior predictive loss approach. Biometrika, 85, 1–11.

    MathSciNet  Article  Google Scholar 

  16. Gelman, A., Hwang, J., Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24, 997–1016.

    MathSciNet  Article  Google Scholar 

  17. Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106, 746–762.

    MathSciNet  Article  Google Scholar 

  18. Gneiting, T., Katzfuss, M. (2014). Probabilistic forecasting. Annual Review of Statistics and Its Application, 1, 125–151.

    Article  Google Scholar 

  19. Gneiting, T., Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378.

    MathSciNet  Article  Google Scholar 

  20. Gneiting, T., Ranjan, R. (2011). Comparing density forecasts using threshold- and quantile-weighted scoring rules. Journal of Business & Economic Statistics, 29, 411–422.

    MathSciNet  Article  Google Scholar 

  21. Granger, C. W., Machina, M. J. (2006). Forecasting and decision theory. In G. Elliott, C. Granger, A. Timmermann (Eds.), Handbook of economic forecasting, Vol. 1, pp. 81–98. Amsterdam: Elsevier.

  22. Granger, C. W. J., Pesaran, M. H. (2000). Economic and statistical measures of forecast accuracy. Journal of Forecasting, 19, 537–560.

    Article  Google Scholar 

  23. Grünwald, P. D., Dawid, A. P. (2004). Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. The Annals of Statistics, 32, 1367–1433.

    MathSciNet  Article  Google Scholar 

  24. Harrell, F. E, Jr. (2015). Regression modeling strategies. Springer series in statistics 2nd ed. Cham: Springer.

  25. Holzmann, H., Klar, B. (2017). Focusing on regions of interest in forecast evaluation. The Annals of Applied Statistics, 11, 2404–2431.

    MathSciNet  Article  Google Scholar 

  26. Laud, P. W., Ibrahim, J. G. (1995). Predictive model selection. Journal of the Royal Statistical Society Series B. Methodological, 57, 247–262.

    MathSciNet  MATH  Google Scholar 

  27. M4 Team. (2018). M4 competitor’s guide: Prizes and rules. Available online at https://www.m4.unic.ac.cy/wp-content/uploads/2018/03/M4-Competitors-Guide.pdf. Accessed 13 Dec 2018.

  28. Makridakis, S., Spiliotis, E., Assimakopoulos, V. (2018). The M4 competition: Results, findings, conclusion and way forward. International Journal of Forecasting, 34, 802–808.

    Article  Google Scholar 

  29. Müller, W. A., Appenzeller, C., Doblas-Reyes, F. J., Liniger, M. A. (2005). A debiased ranked probability skill score to evaluate probabilistic ensemble forecasts with small ensemble sizes. Journal of Climate, 18, 1513–1523.

    Article  Google Scholar 

  30. Parry, M. (2016). Linear scoring rules for probabilistic binary classification. Electronic Journal of Statistics, 10, 1596–1607.

    MathSciNet  Article  Google Scholar 

  31. Reid, M. D., Williamson, R. C. (2010). Composite binary losses. Journal of Machine Learning Research, 11, 2387–2422.

    MathSciNet  MATH  Google Scholar 

  32. van Erven, T., Reid, M. D., Williamson, R. C. (2012). Mixability is Bayes risk curvature relative to log loss. Journal of Machine Learning Research, 13, 1639–1663.

    MathSciNet  MATH  Google Scholar 

  33. Werner, D. (2018). Funktionalanalysis 8th ed. Berlin: Springer.

  34. Williamson, R. C., Vernet, E., Reid, M. D. (2016). Composite multiclass losses. Journal of Machine Learning Research, 17, 1–52.

    MathSciNet  MATH  Google Scholar 

  35. Wilson, L. J., Burrows, W. R., Lanzinger, A. (1999). A strategy for verification of weather element forecasts from an ensemble prediction system. Monthly Weather Review, 127, 956–970.

    Article  Google Scholar 

  36. Zamo, M., Naveau, P. (2018). Estimation of the continuous ranked probability score with limited information and applications to ensemble weather forecasts. Mathematical Geosciences, 50, 209–234.

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

Tilmann Gneiting is grateful for funding by the Klaus Tschira Foundation and by the European Union Seventh Framework Programme under grant agreement 290976. Part of his research leading to these results has been done within subproject C7 “Statistical postprocessing and stochastic physics for ensemble predictions” of the Transregional Collaborative Research Center SFB / TRR 165 “Waves to Weather” (www.wavestoweather.de) funded by the German Research Foundation (DFG). Jonas Brehmer gratefully acknowledges support by DFG through Research Training Group RTG 1953. We thank Tobias Fissler, Rafael Frongillo, Alexander Jordan, and Matthew Parry for instructive discussions, and we are grateful to the editor and two anonymous referees for thoughtful comments and suggestions.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jonas R. Brehmer.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proofs

Appendix: Proofs

Here, we present detailed arguments for the technical claims in Examples 5, 6, 7, and 9 as well as the proofs of Theorems 2 and 3.

Details for Example 5

We fix some distribution P and start with the case \(\alpha > 1\). An application of Fubini’s theorem gives

$$\begin{aligned} {S}_\alpha (Q,P) = \int \int \vert Q(x) - \mathbb {1}\left( y \le x\right) \vert ^\alpha \,\mathrm{d}P(y) \,\mathrm{d}x . \end{aligned}$$
(8)

Given \(x \in \mathbb {R}\), we seek the value \(Q(x) \in [0,1]\) that minimizes the inner integral in (8). If x is such that \(P(x) \in \lbrace 0, 1 \rbrace \), the equality \(\mathbb {1}\left( y \le x\right) = P(x)\) holds for P-almost all y, hence \(Q(x)= P(x)\) is the unique minimizer. If x satisfies \(P(x) \in (0,1)\), define the function

$$\begin{aligned} g_{x,P}(q) := \int \vert q - \mathbb {1}\left( y \le x\right) \vert ^\alpha \,\mathrm{d}P(y) = (1-P(x)) q^\alpha + P(x) (1-q)^\alpha , \end{aligned}$$

which is strictly convex in \(q \in (0,1)\) with derivative

$$\begin{aligned} g_{x,P}'(q) = \alpha (1- P(x)) q^{\alpha - 1} - \alpha P(x) (1- q)^{\alpha - 1} \end{aligned}$$

and a unique minimum at \(q = q_{x,P}^* \in (0,1)\). As a consequence, the minimizing value Q(x) is given by

$$\begin{aligned} Q(x) = q_{x,P}^* = \left( 1 + \left( \frac{1-P(x)}{P(x)} \right) ^{1/(\alpha - 1)} \right) ^{-1}. \end{aligned}$$

The function Q defined by the minimizers Q(x), \(x \in \mathbb {R}\) is a minimizer of \({S}_\alpha ( \cdot ,P)\) and if \({S}_\alpha (Q, P)\) is finite, it is unique Lebesgue almost surely. Since \(\alpha >1\), the function Q has the properties of a distribution function, and hence, \(P^*\) defined by (4) is a Bayes act for P. Moreover, Eq. (4) shows that the relation between P and \(P^*\) is one-to-one.

It remains to be checked under which conditions the properization of \({S}_\alpha \) is not only proper but strictly proper. The representation (4) along with two Taylor expansions implies that \(P^*\) behaves like \(P^{1/(\alpha -1)}\) in the tails. This has two consequences. At first, the above arguments show that for \({S}_\alpha (P^*, P)\) to be finite \(x \mapsto g_{x,P} (P^*(x))\) has to be integrable with respect to Lebesgue measure. Hence, the tail behavior of \(P^*\) and the inequality \(\alpha /(\alpha - 1) > 1\) for \(\alpha > 1\) show that \({S}_\alpha (P^*, P)\) is finite for \(P \in \mathscr {P}_1\). Second, \(P^*\) has a lighter tail than P for \(\alpha \in (1,2)\) and a heavier tail for \(\alpha > 2\). In the latter case, \(P \in \mathscr {P}_1\) does not necessarily imply \(P^* \in \mathscr {P}_1\). Hence, without additional assumptions, strict propriety of the properized score (3) can only be ensured relative to \(\mathscr {P}_\mathrm {c}\) for \(\alpha > 2\) and relative to the class \(\mathscr {P}_1\) for \(\alpha \in (1, 2]\).

We now turn to \(\alpha \in (0,1)\). In this case, the function \(g_{x,P}\) is strictly concave, and its unique minimum is at \(q = 0\) for \(P(x) < \frac{1}{2}\) and at \(q = 1\) for \(P(x) > \frac{1}{2}\). If \(P(x) = \frac{1}{2}\), then both 0 and 1 are minima. Arguing as above, every Bayes act \(P^*\) is a Dirac measure in a median of P.

Finally, \(\alpha = 1\) implies that \(g_{x,P}\) is linear, thus, as for \(\alpha \in (0,1)\), every Dirac measure in a median of P is a Bayes act. The only difference to the case \(\alpha \in (0,1)\) is that if there is more than one median, there are Bayes acts other than Dirac measures, since \(g_{x,P}\) is constant for all x satisfying \(P(x) = \frac{1}{2}\).

Details for Example 6

Let PQ and \(\varPhi \) be distribution functions. By the definition of the convolution operator

$$\begin{aligned} \int \mathbb {1}\left( y \le x\right) \,\mathrm{d}(Q * \varPhi ) (y) = \int \varPhi (x-y) \,\mathrm{d}Q(y) \end{aligned}$$

holds for \(x \in \mathbb {R}\). Using this identity and Fubini’s theorem leads to

$$\begin{aligned} {S}_\varPhi (P,Q)&= \int \! \int \left( P(x)^2 - 2 P(x) \varPhi (x-y) + \varPhi (x-y)^2 \right) \,\mathrm{d}Q(y) \,\mathrm{d}x \\&= \int \! \int \left( P(x)^2 - 2 P(x) \mathbb {1}\left( y \le x\right) + \mathbb {1}\left( y \le x\right) \right) \,\mathrm{d}(Q * \varPhi )(y) \,\mathrm{d}x \\&\quad + \int \! \int \varPhi (x-y) (\varPhi (x-y) - 1) \,\mathrm{d}Q(y) \,\mathrm{d}x \\&= \int \! \int (P(x) - \mathbb {1}\left( y \le x\right) )^2 \,\mathrm{d}x \,\mathrm{d}(Q * \varPhi )(y) - \int \varPhi (x) (1- \varPhi (x)) \,\mathrm{d}x, \end{aligned}$$

which verifies equality in (5). Moreover, the strict propriety of the CRPS relative to the class \(\mathscr {P}_1\) gives \({S}_\varPhi (P, Q) < \infty \) for \(P, Q, \varPhi \in \mathscr {P}_1\), thereby demonstrating that the Bayes act is unique in this situation.

Details for Example 7

For distributions \(P, Q \in \mathscr {P}\) and \(c > 0\), the Fubini–Tonelli theorem and the definition of the convolution operator give

$$\begin{aligned} {S}^\varphi (P,Q)&= - \int \int \varphi (x-y) {S}(P,x) \,\mathrm{d}Q(y) \,\mathrm{d}x \\&= \int \int \varphi (x-y) \,\mathrm{d}Q(y) \, {S}(P,x) \,\mathrm{d}x = {S}(P, Q * \varPhi ), \end{aligned}$$

so the stated (unique) Bayes act under \({S}^\varphi \) follows from the (strict) propriety of \({S}\). Proceeding as in the details for Example 6, we verify identity (6).

For \(P \in \mathscr {L}\), the same calculations as above show that the probability score satisfies

$$\begin{aligned} \mathrm {PS}_c(P,Q) = 2c \int \frac{Q(x + c) - Q(x - c)}{2c} \, \mathrm {LinS}(P,x) \,\mathrm{d}x, \end{aligned}$$

where \(\mathrm {LinS}(P,y) = - p(y)\) is the linear score. Consequently, to demonstrate that Theorem 1 is neither applicable to \(\mathrm {PS}_c\) nor to \(\mathrm {LinS}\), it suffices to show that there is a distribution Q such that \(P \mapsto \mathrm {LinS}(P,Q)\) does not have a minimizer. We use an argument that generalizes the construction in Section 4.1 of Gneiting and Raftery (2007) who show that \(\mathrm {LinS}\) is improper. Let q be a density, symmetric around zero and strictly increasing on \((-\infty , 0)\). Let \(\epsilon > 0\) and define the interval \(I_k := ((2k - 1) \epsilon , (2k + 1) \epsilon ]\) for \(k \in \mathbb {Z}\). Suppose p is a density with positive mass on some interval \(I_k\) for \(k \ne 0\). Due to the properties of q, the score \(\mathrm {LinS}(P,Q)\) can be reduced by substituting the density defined by

$$\begin{aligned} {\tilde{p}}(x) := p(x) - \mathbb {1}\left( x \in I_k\right) \, p(x) + \mathbb {1}\left( x + 2k \epsilon \in I_k\right) \, p(x + 2k \epsilon ) \end{aligned}$$

for p, i.e., by shifting the entire probability mass from \(I_k\) to the modal interval \(I_0\). Repeating this argument for any \(\epsilon > 0\) shows that no density p can be a minimizer of the expected score \(\mathrm {LinS}(P,Q)\). Note that the assumptions on q are stronger than necessary in order to facilitate the argument. They can be relaxed at the cost of a more elaborate proof.

Details for Example 9

For any probability distribution P and \(x \in \mathbb {R}\), we obtain

$$\begin{aligned} s(x,P) = \int \frac{\vert x - y \vert }{\vert x \vert + \vert y \vert } \mathbb {1}\left( x \ne y\right) \,\mathrm{d}P(y) , \end{aligned}$$

which immediately gives \(s(0,P) = P(\mathbb {R}\backslash \lbrace 0 \rbrace )\). This representation together with the dominated convergence theorem imply the continuity of \(x \mapsto s(x,P)\) in \(\mathbb {R}\backslash \lbrace 0 \rbrace \) as well as the limits given in (7).

Proof of Theorem 2

Let \((a_n)_{n \in \mathbb {N}} \subset \mathscr {A}\) be a sequence with \( a := \lim _{n \rightarrow \infty } a_n\). Since s is lower semicontinuous in its first component and uniformly bounded from below by g, Fatou’s lemma gives

$$\begin{aligned} \liminf _{n \rightarrow \infty } \int s(a_n, \omega ) \,\mathrm{d}P(\omega ) \ge \int \liminf _{n \rightarrow \infty } s(a_n,\omega ) \,\mathrm{d}P(\omega ) \ge s(a,P) \end{aligned}$$

for any \(P \in \mathscr {P}\). Hence, \(a \mapsto s(a, P)\) is a lower semicontinuous function for any \(P \in \mathscr {P}\) and due to the assumed compactness of \(\mathscr {A}\), the result now follows from Theorem 2.43 in Aliprantis and Border (2006).\(\square \)

Proof of Theorem 3

The same arguments as in the proof of Theorem 2 show that \(a \mapsto s(a, P)\) is a weakly lower semicontinuous function for any \(P \in \mathscr {P}\). If \(P \in \mathscr {P}\) is such that this function is also coercive, we conclude by proceeding as in the proof of Satz III.5.8 in Werner (2018): In case \(\inf _{a \in \mathscr {A}} s(a, P) = \infty \), there is nothing to prove. Otherwise, if \((a_n)_{n \in \mathbb {N}} \subset \mathscr {A}\) is a sequence such that \(\lim _{n \rightarrow \infty } s(a_n, P) = \inf _{a \in \mathscr {A}} s(a, P)\) holds, the coercivity of \(a \mapsto s(a,P)\) implies that this sequence is bounded. Together with the assumption that \(\mathscr {A}\) is a subset of a reflexive Banach space, we obtain a subsequence \((a_{n_k})_{k \in \mathbb {N}}\) of \((a_n)_{n \in \mathbb {N}}\) which weakly converges to some element \(a^*\); see, e.g., Theorem 6.25 in Aliprantis and Border (2006). Since \(\mathscr {A}\) is weakly closed, it contains \(a^*\) and weak lower semicontinuity gives \(s(a^*,P) \le \lim _{k \rightarrow \infty } s(a_{n_k}, P) = \inf _{a \in \mathscr {A}} s(a, P)\), concluding the proof.\(\square \)

About this article

Verify currency and authenticity via CrossMark

Cite this article

Brehmer, J.R., Gneiting, T. Properization: constructing proper scoring rules via Bayes acts. Ann Inst Stat Math 72, 659–673 (2020). https://doi.org/10.1007/s10463-019-00705-7

Download citation

Keywords

  • Bayes act
  • Consistent scoring function
  • Forecast evaluation
  • Misclassification error
  • Proper scoring rule