Abstract
Scoring rules serve to quantify predictive performance. A scoring rule is proper if truth telling is an optimal strategy in expectation. Subject to customary regularity conditions, every scoring rule can be made proper, by applying a special case of the Bayes act construction studied by Grünwald and Dawid (Ann Stat 32:1367–1433, 2004) and Dawid (Ann Inst Stat Math 59:77–93, 2007), to which we refer as properization. We discuss examples from the recent literature and apply the construction to create new types, and reinterpret existing forms, of proper scoring rules and consistent scoring functions. In an abstract setting, we formulate sufficient conditions under which Bayes acts exist and scoring rules can be made proper.
Similar content being viewed by others
Notes
As noted by Parry (2016), the improper score \({S}_1\) shares its (concave) expected score function \(P \mapsto {S}_1(P,P)\) with the proper Brier score. This illustrates the importance of the second condition in Theorem 1 of Gneiting and Raftery (2007): For a scoring rule \({S}\), the (strict) concavity of the expected score function \( G(P) := {S}(P,P)\) is equivalent to the (strict) propriety of \({S}\) only if, furthermore, \(- {S}(P,\cdot )\) is a subtangent of \(- G\) at P.
References
Aliprantis, C. D., Border, K. C. (2006). Infinite dimensional analysis third ed. Berlin: Springer.
Christensen, H. M., Moroz, I. M., Palmer, T. N. (2014). Evaluation of ensemble forecast uncertainty using a new proper score: Application to medium-range and seasonal forecasts. Quarterly Journal of the Royal Meteorological Society, 141, 538–549.
Dawid, A. P. (1986). Probability forecasting. In S. Kotz, N. L. Johnson, C. B. Read (Eds.), Encyclopedia of statistical sciences, Vol. 7, pp. 210–218. New York: Wiley.
Dawid, A. P. (2007). The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 59, 77–93.
Dawid, A. P., Musio, M. (2014). Theory and applications of proper scoring rules. Metron, 72, 169–183.
Diks, C., Panchenko, V., van Dijk, D. (2011). Likelihood-based scoring rules for comparing density forecasts in tails. Journal of Econometrics, 163, 215–230.
Ebert, E., Brown, B., Göber, M., Haiden, T., Mittermaier, M., Nurmi, P., Wilson, L., Jackson, S., Johnston, P., Schuster, D. (2018). The WMO challenge to develop and demonstrate the best new user-oriented forecast verification metric. Meteorologische Zeitschrift, 27, 435–440.
Ebert, E., Wilson, L., Weigel, A., Mittermaier, M., Nurmi, P., Gill, P., Göber, M., Joslyn, S., Brown, B., Fowler, T., Watkins, A. (2013). Progress and challenges in forecast verification. Meteorological Applications, 20, 130–139.
Ehm, W., Gneiting, T., Jordan, A., Krüger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings. Journal of the Royal Statistical Society Series B. Statistical Methodology, 78, 505–562.
Ferguson, T. S. (1967). Mathematical statistics: A decision theoretic approach. Probability and mathematical statistics, Vol. 1. New York: Academic Press.
Ferri, C., Hernández-Orallo, J., Modroiu, R. (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30, 27–38.
Ferro, C. A. T. (2017). Measuring forecast performance in the presence of observation error. Quarterly Journal of the Royal Meteorological Society, 143, 2665–2676.
Fissler, T., Ziegel, J. F. (2016). Higher order elicitability and Osband’s principle. The Annals of Statistics, 44, 1680–1707.
Friederichs, P., Thorarinsdottir, T. L. (2012). Forecast verification for extreme value distributions with an application to probabilistic peak wind prediction. Environmetrics, 23, 579–594.
Gelfand, A. E., Ghosh, S. K. (1998). Model choice: A minimum posterior predictive loss approach. Biometrika, 85, 1–11.
Gelman, A., Hwang, J., Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24, 997–1016.
Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106, 746–762.
Gneiting, T., Katzfuss, M. (2014). Probabilistic forecasting. Annual Review of Statistics and Its Application, 1, 125–151.
Gneiting, T., Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378.
Gneiting, T., Ranjan, R. (2011). Comparing density forecasts using threshold- and quantile-weighted scoring rules. Journal of Business & Economic Statistics, 29, 411–422.
Granger, C. W., Machina, M. J. (2006). Forecasting and decision theory. In G. Elliott, C. Granger, A. Timmermann (Eds.), Handbook of economic forecasting, Vol. 1, pp. 81–98. Amsterdam: Elsevier.
Granger, C. W. J., Pesaran, M. H. (2000). Economic and statistical measures of forecast accuracy. Journal of Forecasting, 19, 537–560.
Grünwald, P. D., Dawid, A. P. (2004). Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. The Annals of Statistics, 32, 1367–1433.
Harrell, F. E, Jr. (2015). Regression modeling strategies. Springer series in statistics 2nd ed. Cham: Springer.
Holzmann, H., Klar, B. (2017). Focusing on regions of interest in forecast evaluation. The Annals of Applied Statistics, 11, 2404–2431.
Laud, P. W., Ibrahim, J. G. (1995). Predictive model selection. Journal of the Royal Statistical Society Series B. Methodological, 57, 247–262.
M4 Team. (2018). M4 competitor’s guide: Prizes and rules. Available online at https://www.m4.unic.ac.cy/wp-content/uploads/2018/03/M4-Competitors-Guide.pdf. Accessed 13 Dec 2018.
Makridakis, S., Spiliotis, E., Assimakopoulos, V. (2018). The M4 competition: Results, findings, conclusion and way forward. International Journal of Forecasting, 34, 802–808.
Müller, W. A., Appenzeller, C., Doblas-Reyes, F. J., Liniger, M. A. (2005). A debiased ranked probability skill score to evaluate probabilistic ensemble forecasts with small ensemble sizes. Journal of Climate, 18, 1513–1523.
Parry, M. (2016). Linear scoring rules for probabilistic binary classification. Electronic Journal of Statistics, 10, 1596–1607.
Reid, M. D., Williamson, R. C. (2010). Composite binary losses. Journal of Machine Learning Research, 11, 2387–2422.
van Erven, T., Reid, M. D., Williamson, R. C. (2012). Mixability is Bayes risk curvature relative to log loss. Journal of Machine Learning Research, 13, 1639–1663.
Werner, D. (2018). Funktionalanalysis 8th ed. Berlin: Springer.
Williamson, R. C., Vernet, E., Reid, M. D. (2016). Composite multiclass losses. Journal of Machine Learning Research, 17, 1–52.
Wilson, L. J., Burrows, W. R., Lanzinger, A. (1999). A strategy for verification of weather element forecasts from an ensemble prediction system. Monthly Weather Review, 127, 956–970.
Zamo, M., Naveau, P. (2018). Estimation of the continuous ranked probability score with limited information and applications to ensemble weather forecasts. Mathematical Geosciences, 50, 209–234.
Acknowledgements
Tilmann Gneiting is grateful for funding by the Klaus Tschira Foundation and by the European Union Seventh Framework Programme under grant agreement 290976. Part of his research leading to these results has been done within subproject C7 “Statistical postprocessing and stochastic physics for ensemble predictions” of the Transregional Collaborative Research Center SFB / TRR 165 “Waves to Weather” (www.wavestoweather.de) funded by the German Research Foundation (DFG). Jonas Brehmer gratefully acknowledges support by DFG through Research Training Group RTG 1953. We thank Tobias Fissler, Rafael Frongillo, Alexander Jordan, and Matthew Parry for instructive discussions, and we are grateful to the editor and two anonymous referees for thoughtful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proofs
Appendix: Proofs
Here, we present detailed arguments for the technical claims in Examples 5, 6, 7, and 9 as well as the proofs of Theorems 2 and 3.
1.1 Details for Example 5
We fix some distribution P and start with the case \(\alpha > 1\). An application of Fubini’s theorem gives
Given \(x \in \mathbb {R}\), we seek the value \(Q(x) \in [0,1]\) that minimizes the inner integral in (8). If x is such that \(P(x) \in \lbrace 0, 1 \rbrace \), the equality \(\mathbb {1}\left( y \le x\right) = P(x)\) holds for P-almost all y, hence \(Q(x)= P(x)\) is the unique minimizer. If x satisfies \(P(x) \in (0,1)\), define the function
which is strictly convex in \(q \in (0,1)\) with derivative
and a unique minimum at \(q = q_{x,P}^* \in (0,1)\). As a consequence, the minimizing value Q(x) is given by
The function Q defined by the minimizers Q(x), \(x \in \mathbb {R}\) is a minimizer of \({S}_\alpha ( \cdot ,P)\) and if \({S}_\alpha (Q, P)\) is finite, it is unique Lebesgue almost surely. Since \(\alpha >1\), the function Q has the properties of a distribution function, and hence, \(P^*\) defined by (4) is a Bayes act for P. Moreover, Eq. (4) shows that the relation between P and \(P^*\) is one-to-one.
It remains to be checked under which conditions the properization of \({S}_\alpha \) is not only proper but strictly proper. The representation (4) along with two Taylor expansions implies that \(P^*\) behaves like \(P^{1/(\alpha -1)}\) in the tails. This has two consequences. At first, the above arguments show that for \({S}_\alpha (P^*, P)\) to be finite \(x \mapsto g_{x,P} (P^*(x))\) has to be integrable with respect to Lebesgue measure. Hence, the tail behavior of \(P^*\) and the inequality \(\alpha /(\alpha - 1) > 1\) for \(\alpha > 1\) show that \({S}_\alpha (P^*, P)\) is finite for \(P \in \mathscr {P}_1\). Second, \(P^*\) has a lighter tail than P for \(\alpha \in (1,2)\) and a heavier tail for \(\alpha > 2\). In the latter case, \(P \in \mathscr {P}_1\) does not necessarily imply \(P^* \in \mathscr {P}_1\). Hence, without additional assumptions, strict propriety of the properized score (3) can only be ensured relative to \(\mathscr {P}_\mathrm {c}\) for \(\alpha > 2\) and relative to the class \(\mathscr {P}_1\) for \(\alpha \in (1, 2]\).
We now turn to \(\alpha \in (0,1)\). In this case, the function \(g_{x,P}\) is strictly concave, and its unique minimum is at \(q = 0\) for \(P(x) < \frac{1}{2}\) and at \(q = 1\) for \(P(x) > \frac{1}{2}\). If \(P(x) = \frac{1}{2}\), then both 0 and 1 are minima. Arguing as above, every Bayes act \(P^*\) is a Dirac measure in a median of P.
Finally, \(\alpha = 1\) implies that \(g_{x,P}\) is linear, thus, as for \(\alpha \in (0,1)\), every Dirac measure in a median of P is a Bayes act. The only difference to the case \(\alpha \in (0,1)\) is that if there is more than one median, there are Bayes acts other than Dirac measures, since \(g_{x,P}\) is constant for all x satisfying \(P(x) = \frac{1}{2}\).
1.2 Details for Example 6
Let P, Q and \(\varPhi \) be distribution functions. By the definition of the convolution operator
holds for \(x \in \mathbb {R}\). Using this identity and Fubini’s theorem leads to
which verifies equality in (5). Moreover, the strict propriety of the CRPS relative to the class \(\mathscr {P}_1\) gives \({S}_\varPhi (P, Q) < \infty \) for \(P, Q, \varPhi \in \mathscr {P}_1\), thereby demonstrating that the Bayes act is unique in this situation.
1.3 Details for Example 7
For distributions \(P, Q \in \mathscr {P}\) and \(c > 0\), the Fubini–Tonelli theorem and the definition of the convolution operator give
so the stated (unique) Bayes act under \({S}^\varphi \) follows from the (strict) propriety of \({S}\). Proceeding as in the details for Example 6, we verify identity (6).
For \(P \in \mathscr {L}\), the same calculations as above show that the probability score satisfies
where \(\mathrm {LinS}(P,y) = - p(y)\) is the linear score. Consequently, to demonstrate that Theorem 1 is neither applicable to \(\mathrm {PS}_c\) nor to \(\mathrm {LinS}\), it suffices to show that there is a distribution Q such that \(P \mapsto \mathrm {LinS}(P,Q)\) does not have a minimizer. We use an argument that generalizes the construction in Section 4.1 of Gneiting and Raftery (2007) who show that \(\mathrm {LinS}\) is improper. Let q be a density, symmetric around zero and strictly increasing on \((-\infty , 0)\). Let \(\epsilon > 0\) and define the interval \(I_k := ((2k - 1) \epsilon , (2k + 1) \epsilon ]\) for \(k \in \mathbb {Z}\). Suppose p is a density with positive mass on some interval \(I_k\) for \(k \ne 0\). Due to the properties of q, the score \(\mathrm {LinS}(P,Q)\) can be reduced by substituting the density defined by
for p, i.e., by shifting the entire probability mass from \(I_k\) to the modal interval \(I_0\). Repeating this argument for any \(\epsilon > 0\) shows that no density p can be a minimizer of the expected score \(\mathrm {LinS}(P,Q)\). Note that the assumptions on q are stronger than necessary in order to facilitate the argument. They can be relaxed at the cost of a more elaborate proof.
1.4 Details for Example 9
For any probability distribution P and \(x \in \mathbb {R}\), we obtain
which immediately gives \(s(0,P) = P(\mathbb {R}\backslash \lbrace 0 \rbrace )\). This representation together with the dominated convergence theorem imply the continuity of \(x \mapsto s(x,P)\) in \(\mathbb {R}\backslash \lbrace 0 \rbrace \) as well as the limits given in (7).
1.5 Proof of Theorem 2
Let \((a_n)_{n \in \mathbb {N}} \subset \mathscr {A}\) be a sequence with \( a := \lim _{n \rightarrow \infty } a_n\). Since s is lower semicontinuous in its first component and uniformly bounded from below by g, Fatou’s lemma gives
for any \(P \in \mathscr {P}\). Hence, \(a \mapsto s(a, P)\) is a lower semicontinuous function for any \(P \in \mathscr {P}\) and due to the assumed compactness of \(\mathscr {A}\), the result now follows from Theorem 2.43 in Aliprantis and Border (2006).\(\square \)
1.6 Proof of Theorem 3
The same arguments as in the proof of Theorem 2 show that \(a \mapsto s(a, P)\) is a weakly lower semicontinuous function for any \(P \in \mathscr {P}\). If \(P \in \mathscr {P}\) is such that this function is also coercive, we conclude by proceeding as in the proof of Satz III.5.8 in Werner (2018): In case \(\inf _{a \in \mathscr {A}} s(a, P) = \infty \), there is nothing to prove. Otherwise, if \((a_n)_{n \in \mathbb {N}} \subset \mathscr {A}\) is a sequence such that \(\lim _{n \rightarrow \infty } s(a_n, P) = \inf _{a \in \mathscr {A}} s(a, P)\) holds, the coercivity of \(a \mapsto s(a,P)\) implies that this sequence is bounded. Together with the assumption that \(\mathscr {A}\) is a subset of a reflexive Banach space, we obtain a subsequence \((a_{n_k})_{k \in \mathbb {N}}\) of \((a_n)_{n \in \mathbb {N}}\) which weakly converges to some element \(a^*\); see, e.g., Theorem 6.25 in Aliprantis and Border (2006). Since \(\mathscr {A}\) is weakly closed, it contains \(a^*\) and weak lower semicontinuity gives \(s(a^*,P) \le \lim _{k \rightarrow \infty } s(a_{n_k}, P) = \inf _{a \in \mathscr {A}} s(a, P)\), concluding the proof.\(\square \)
About this article
Cite this article
Brehmer, J.R., Gneiting, T. Properization: constructing proper scoring rules via Bayes acts. Ann Inst Stat Math 72, 659–673 (2020). https://doi.org/10.1007/s10463-019-00705-7
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-019-00705-7