Scoring rules and the evaluation of probabilities

Winkler, R. L.; Muñoz, Javier; Cervera, José L.; Bernardo, José M.; Blattenberger, Gail; Kadane, Joseph B.; Lindley, Dennis V.; Murphy, Allan H.; Oliver, Robert M; Ríos-Insua, David

doi:10.1007/BF02562681

Scoring rules and the evaluation of probabilities

Published: June 1996

Volume 5, pages 1–60, (1996)
Cite this article

Test Aims and scope Submit manuscript

R. L. Winkler¹,
Javier Muñoz²,
José L. Cervera³,
José M. Bernardo⁴,
Gail Blattenberger⁵,
Joseph B. Kadane⁶,
Dennis V. Lindley⁷,
Allan H. Murphy⁸,
Robert M Oliver⁹ &
…
David Ríos-Insua¹⁰

1290 Accesses
158 Citations
3 Altmetric
Explore all metrics

Summary

In Bayesian inference and decision analysis, inferences and predictions are inherently probabilistic in nature. Scoring rules, which involve the computation of a score based on probability forecasts and what actually occurs, can be used to evaluate probabilities and to provide appropriate incentives for “good” probabilities. This paper review scoring rules and some related measures for evaluating probabilities, including decompositions of scoring rules and attributes of “goodness” of probabilites, comparability of scores, and the design of scoring rules for specific inferential and decision-making problems

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Article Open access 01 April 2016

Risk and Uncertainty

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

References

Bayarri, M. J. and DeGroot, M. H. (1988) Gaining weight. A Bayesian approach.Bayesian Statistics 3 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.), Oxford, University Press, 25–44, (with discussion).
Google Scholar
Bernardo, J. M. and Bermúdez, J. D. (1985) The choice of variables in probabilistic classification.Bayesian Statistics 2 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.), Amsterdam: North-Holland, 67–81 (with discussion).
Google Scholar
Bernardo, J. M. and Smith, A. F. M. (1994)Bayesian Theory. Chichester: Wiley
MATH Google Scholar
Blattenberger, G. and Lad, F (1985) Separating the Brier score into calibration and refinement components: A graphical exposition.Amer. Statist. 39, 26–32.
Article Google Scholar
Brier, G. W. (1950) Verification of forecasts expressed in terms of probability.Monthly Weather Review 78, 1–3.
Google Scholar
Clemen, R. T. (1996)Making Hard Decisions. 2nd Edition, Belmont, CA: Duxbury Press
Google Scholar
Cooke, R. M. (1991)Experts in Uncertainty: Opinion and Subjective Probability in Science, Oxford: University Press.
Google Scholar
Dawid, A. P. (1982) The well-calibrated Bayesian.J. Amer. Statist. Assoc. 77, 605–613.
Article MathSciNet MATH Google Scholar
de Finetti, B. (1937). La prévision: ses lois logiques, ses sources subjectives.Annales de l’Institut Henri Poincaré 7, 1–68. Translated as “Foresight: Its logical laws, its subjective sources” inStudies in Subjective Probability (H. E. Kyburg and H. E. Smokler, eds.), New York: Wiley, 1964, 93–158.
MATH Google Scholar
de Finetti, B. (1962) Does it make sense to speak of “good probability appraisers”?The Scientist Speculates: An Anthology of Partly-Baked Ideas (I. J. Good, ed.). New York: Wiley, 357–363.
Google Scholar
de Finetti, B. (1965) Methods for discriminating levels of partial knowledge concerning a test item.British J. of Math. and Stat. Psych. 18, 87–123.
Google Scholar
DeGroot, M. H. and Eriksson, E. A. (1985) Probability forecasting, stochastic dominance, and the Lorenz curve,Bayesian Statistics 2 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.), Amsterdam: North-Holland, 99–118, (with discussion).
Google Scholar
DeGroot, M. H. and Fienberg S. E. (1982) Assessing probability assessors: Calibration and refinement.Statistical Decision Theory and Related Topics III 1 (S. S. Gupta and J. O. Berger, eds.), New York: Academic Press, 291–314.
Google Scholar
DeGroot, M. H. and Fienberg S. E. (1983) The comparison and evaluation of forecasters.The Statistician 32, 14–22.
Article Google Scholar
Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories.J. Appl. Meteorology 8, 985–987.
Article Google Scholar
Good, I. J. (1952) Rational decisions.J. Roy. Statist. Soc. B 11, 107–114.
MathSciNet Google Scholar
Howard, R. A. and Matheson, J. E. (1983)The Principles and Applications of Decision Analysis (2 volumes), Palo Alto, CA: Strategic Decisions Group.
Google Scholar
Kadane, J. B. and Winkler, R. L. (1988) Separating probability elicitation from utilities.J. Amer. Statist. Assoc. 83, 357–363.
Article MathSciNet Google Scholar
Keeney, R. L. and Raiffa, H. (1976).Decisions with Multiple Objectives: Preferences and Value Tradeoffs, New York: Wiley.
Google Scholar
Kenney, R. L. and von Winterfeldt, D. (1991) Eliciting probabilities from experts in complex technical problems.IEEE Trans. Eng. Management 38, 191–201.
Article Google Scholar
Matheson, J. E. and Winkler, R. L. (1976). Scoring rules for continuous probability distributions.Manag. Sci. 22, 1087–1096.
MATH Google Scholar
McCarthy, J. (1956). Measures of the value of information.Proc. Nat. Acad. Sciences 42, 654–655.
Article MATH Google Scholar
Morgan, M. G. and Henrion M. (1990)Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis. Cambridge: University Press.
Google Scholar
Murphy, A. H. (1972a). Scalar and vector partitions of the probability score. Part I. Two-state situation.J. Appl. Meteorology 11, 273–282.
Article Google Scholar
Murphy, A. H. (1972b) Scalar and vector partitions of the probability score. Part II. N-state situation.J. Appl. Meteorology 11, 1183–1192.
Article Google Scholar
Murphy, A. H. (1973a) Hedging and skill scores for probability forecasts.J. Appl. Meteorology 12, 215–223.
Article Google Scholar
Murphy, A. H. (1973b). A new vector, partition of the probability score.J. Appl. Meteorology,12, 595–600.
Article Google Scholar
Murphy, A. H. (1974). A sample skill score for probability forecasts.Monthly Weather Review 102, 48–55.
Article Google Scholar
Murphy, A. H. (1977). The value of climatological, categorical, and probabilistic forecasts in the cost-loss ratio situation.Monthly Weather Review 105, 803–816.
Article Google Scholar
Murphy, A. H. (1993). What is a good forecasts? An essay on the nature of goodness in weather forecasting.Weather and Forecasting 8, 281–293.
Article Google Scholar
Murphy, A. H. (1996). General decompositions of MSE-based skill scores: Measures of some basic aspects of forecast quality.Monthly Weather Review 124, (to appear).
Murphy, A. H. and Daan, H. (1985). Forecast evaluation.Probability, Statistics, and Decision Making in the Atmospheric Sciences (A. H. Murphy and R. W. Katz, eds.), Boulder, CO: Westview Press, 379–437.
Google Scholar
Murphy, A. H. and Winkler, R. L. (1984). Probability forecasting in meteorology.J. Amer. Statist. Assoc. 79, 489–500.
Article Google Scholar
Murphy, A. H. and Winkler, R. L. (1987). A general framework for forecast verification.Monthly Weather Review 115, 1330–1338.
Article Google Scholar
Murphy, A. H. and Winkler, R. L. (1992). Diagnostic verification of probability forecasts.Int. J. Forecasting 7, 435–455.
Article Google Scholar
Pearl, J. (1978). An economic basis for certain methods of evaluating probabilistic forecasts.Int. J. Man-Machine Studies 10, 175–183.
Article Google Scholar
Raiffa, H. (1968).Decision Analysis, Reading, MA: Addison-Wesley.
MATH Google Scholar
Roberts, H. V. (1965). Probabilistic prediction.J. Amer. Statist. Assoc 60, 50–62.
Article MathSciNet MATH Google Scholar
Sanders, F. (1963). On subjective probability forecasting.J. Appl. Meteorology 2, 191–201.
Article Google Scholar
Sarin, R. K. and Winkler, R. L. (1980). Performance-based incentive plans.Manag. Sci. 26, 1131–1144.
MathSciNet Google Scholar
Savage, L. J. (1954).The Foundations of Statistics. New York: Wiley.
MATH Google Scholar
Savage, L. J. (1971). Elicitation of personal probabilities and expectations.J. Amer. Statist. Assoc. 66, 783–801.
Article MathSciNet MATH Google Scholar
Schervish, M. J. (1989). A general method for comparing probability assessors.Ann. Statist. 17, 1856–1879.
MathSciNet MATH Google Scholar
Shuford, E. H., Albert, A., and Massengill, H. E. (1966). Admissible probability measurement procedures.Psychometrika 31, 125–145.
Article MATH Google Scholar
Spetzler, C. S. and Staël von Holstein, C.-A. S. (1975). Probability encoding in decision analysis.Manag. Sci. 22, 340–358.
Google Scholar
Staël von Holstein, C.-A. S. (1970).Assessment and Evaluation of Subjective Probability Distributions. Stockholm: ERI, Stockholm School of Economics.
Google Scholar
Wallsten, T. S. and Budescu, D. V. (1983). Encoding subjective probabilities: A psychological and psychometric review.Manag. Sci. 29, 151–173.
Google Scholar
Wilks, D. S. (1995).Statistical Methods in the Atmospheric Sciences. New York: Academic Press.
Google Scholar
Winkler, R. L. (1967a). The assessment of prior distribution in Bayesian analysis.J. Amer. Statist. Assoc. 62, 776–800.
Article MathSciNet Google Scholar
Winkler, R. L. (1967b). The quantification of judgment: Some methodological suggestions.J. Amer. Statist. Assoc. 62, 1105–1120.
Article MathSciNet Google Scholar
Winkler, R. L. (1969). Scoring rules and the evaluation of probability assessors.J. Amer. Statist. Assoc. 64, 1073–1078.
Article Google Scholar
Winkler, R. L. (1986). On “good probability appraisers”.Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (P. Goel and A. Zellner, eds.), Amsterdam: North-Holland, 265–278.
Google Scholar
Winkler, R. L. (1994). Evaluating probabilities: Asymmetric scoring rules.Manag. Sci. 40, 1395–1405.
MATH Google Scholar
Winkler, R. L. and Murphy, A. H. (1968). “Good” probability assessorsJ. Appl. Meteorology 7, 751–758.
Article Google Scholar
Winkler, R. L., and Poses, R. M. (1993). Evaluating and combining physicians’ probabilities of survival in an intensive care unit.Manag. Sci. 39, 1526–1543.
Google Scholar
Yates, J. F. (1982) External correspondence: Decompositions of the mean probability score.Organizational Behavior and Human Performance 30, 132–156.
Article Google Scholar
Yates, J. F. (1988). Analyzing the accuracy of probability judgments for multiple events: An extension of the covariance decomposition.Organizational Behavior and Human Decision Processes 41, 281–299.
Article Google Scholar
Yates, J. F. and Curley, S. P. (1985). Conditional distribution analyses of probabilistic forecasts.J. Forecasting 4, 61–73.
Google Scholar

Additional References in the Discussion

Berger, J. (1994). An overview of robust Bayesian analysis.Test 3, 5–124 (with discussion).
MathSciNet MATH Google Scholar
Berger, J. O. and Wolpert, R. L. (1984).The Likelihood Principle. Lecture notesmonograph series. IMS: Hayward.
MATH Google Scholar
Bernardo, J. M. (1979). Expected information as expected utility.Ann. Statist. 7, 686–690.
MathSciNet MATH Google Scholar
Bernardo, J. M. (1987). Approximations in statistics from a decision-theoretical view-point.Probability and Bayesian Statistics (R. Viertl, ed.). New York: Plenum, 53–60.
Google Scholar
Blattenberger, G. (1996). Money demand revisited: an operational subjective approach.J. Appl. Econometrics 11, 153–168
Article Google Scholar
Blattenberger, G. and Lad, F. (1988). An application of operational-subjective statistical methods to rational expectations,J. Bus. Econ. Statistics 6, 453–477 (with discussion).
Article Google Scholar
Cervera, J. L. and Muñoz, J. (1996). Proper scoring rules for fractiles.Bayesian Statistics 5 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.): Oxford: University Press.
Google Scholar
Chaloner, K., Church, T., Louis, T. and Matts, J. (1993). Graphical elicitation of a prior distribution for a clinical trial.The Statistician 41, 342–353.
Google Scholar
Cooke, R. (1991).Experts in Uncertainty. Oxford: University Press.
Google Scholar
Dawid, A. P. (1986). Probability forecasting.Encyclopedia of Statistical Sciences 7 (S. Kotz, N. L. Johnson and C. B. Read, eds.). New York: Wiley, 210–218.
Google Scholar
Dawid, A. P., DeGroot, M. H. and Mortera, J. (1995). Coherent combination of experts’ opinions.Test 4, 263–313 (with discussion).
MathSciNet MATH Google Scholar
de Finetti, B. (1963). Lá décision et les probabilitiés.Rev. Roumaine Math. Pures Appl. 7, 405–413.
Google Scholar
de Finetti, B. (1964). Probabilità subordinate e teoria delle decisioni.Rendiconti Matematica 23, 128–131. Reprinted as ‘Conditional probabilities and decision theory’ in 1972,Probability, Induction and Statistics New York: Wiley, 13–18.
MATH Google Scholar
Eaton, M. L. (1992). A statistical diptych: admissible inferences, recurrence of symmetric Markov chains.Ann. Statist. 20, 1147–1179.
MathSciNet MATH Google Scholar
Edwards, W. and von Winterfeldt, D. (1986).Decision Analysis and Behavioral Research. Cambridge: University Press.
Google Scholar
Fudenberg, D. and Tirole, J. (1991).Game Theory. Cambridge: University Press.
Google Scholar
Hadley, G. and Kemp, M. C. (1971).Variational Methods in Economics. Amsterdam: North-Holland.
MATH Google Scholar
Harsanyi, J. (1967). Games with incomplete information played by ‘Bayesian’ players.Manag. Sci. 14, 159–182; 320–334; 486–502.
MathSciNet MATH Google Scholar
Hirshleifer, J. and Riley, J. G. (1992).The Analytics of Uncertainty and Information. Cambridge: University Press.
Google Scholar
Kadane, J. B. (1993). Several Bayesians: a review.Test 2, 1–32.
Article MathSciNet MATH Google Scholar
Kadane, J. B., Dickey, J. M., Winkler, R. L., Smith, W. S. and Peters, S. C. (1980). Interactive elicitation of opinion for a normal linear model.J. Amer. Statist. Assoc. 75, 845–854.
Article MathSciNet Google Scholar
Katz, R. W., Murphy, A. H. and Winkler, R. L. (1982). Assessing the value of frost forecasts to orchardists: A dynamic decision-making approach.J. Appl. Meteor. 21, 518–531.
Article Google Scholar
Krzysztofowicz, R. (1992). Bayesian correlation score: A utilitarian measure of forecast skill.Mon. Wea. Rev. 120, 208–219.
Article Google Scholar
Lindley, D. V. (1956). On a measure of information provided by an experiment.Ann. Math. Statist. 27, 986–1005.
MathSciNet MATH Google Scholar
Lindley, D. V. (1982). Scoring rules and the inevitability of probability.Internat. Statist. Rev. 50, 1–26 (with discussion).
Article MathSciNet MATH Google Scholar
McCloskey, D. and Ziliak, S. (1996). The standard error of regressions,J. Economic Literature 34(1), 97–114.
Google Scholar
Murphy, A. H. (1970). The ranked probability score and the probability score: A comparison.Mon. Wea. Rev. 98, 917–924.
Google Scholar
Murphy, A. H. (1991). Forecast verification: Its complexity and dimensionality.Mon. Wea. Rev. 119, 1590–1601.
Article Google Scholar
Murphy, A. H. (1995). A coherent method of stratification within a general framework for forecast verification.Mon. Wea. Rev. 123, 1582–1588.
Article Google Scholar
Murphy, A. H. (1996). Forecast verification.Economic Value of Weather and Climate Forecasts (R. W. Katz and A. H. Murphy, eds.). Cambridge: University Press, (to appear).
Google Scholar
Murphy, A. H. and Daan, H. (1984). Impacts of feedback and experience on the quality of subjective probability forecasts: Comparison of results from the first and second years of the Zierikzee experiment.Mon. Wea. Rev. 112, 413–423.
Article Google Scholar
Murphy, A. H. and Ehrendorfer, M. (1996).Probability forecasting and probability forecasts. Corvallis, Oregon: Prediction and Evaluation Systems (manuscript).
Google Scholar
Murphy, A. H. and Wilks, D. S. (1996). Statistical models in forecast verification: A case study of precipitation probability forecasts.13th Conference on Probability and Statistics in the Atmospheric Sciences. American Meteorology Society, 218–223.
Pearl, J. (1988).Probabilistic Reasoning in Intelligent Systems. San Mateo: Morgan Kaufmann.
Google Scholar
Pratt, J. W. and Zeckhauser, R. J. (eds.) (1985).Principals and Agents: The Structure of Business. Boston: Harvard Business School Press.
Google Scholar
Rubin, H. (1987). A weak system of axioms for ‘rational’ behavior and the non-separability of utility from prior.Statistics and Decisions 5, 47–58.
MathSciNet MATH Google Scholar
Schervish, M. J. (1995).Theory of Statistics, New York: Springer.
MATH Google Scholar
Spiegelhalter, D. J., Dawid, A. P., Larutzen, S. L. and Cowell, R. G. (1993). Bayesian analysis in expert systems.Statist. Sci. 8, 219–246.
MathSciNet MATH Google Scholar
Staël von Holstein, C.-A. S. and Murphy, A. H. (1978). The family of quadratic scoring rules.Mon. Wea. Rev. 106, 917–924.
Article Google Scholar
West, M. (1988). Modelling expert opinion.Bayesian Statistics 3 (J. M. Bernardo, M. H. DeGrott, D. V. Lindley and A. F. M. Smith, eds.). Oxford: University Press, 493–508 (with discussion).
Google Scholar
Winkler, R. L. (1986). Expert resolution.Manag. Sci. 32, 298–303.
Google Scholar
Winkler, R. L., Smith, W. S. and Kulkarni, R. B. (1978). Adaptive forecasting models based on predictive distributions.Manag. Sci. 24, 977–986.
Article MATH Google Scholar
Yates, J. F. (1994). Subjective probability accuracy analysis.Subjective Probability (G. Wright and P. Ayton, eds.). Chichester: Wiley, 381–410.
Google Scholar

Download references

Author information

Authors and Affiliations

Fuqua School of Business and Institute of Statistics and Decision Sciences, Duke University, 27708-0120, Durham, NC, USA
R. L. Winkler
Generalitat Valenciana, Spain
Javier Muñoz (Presidència)
Instituto Nacional de Estadística, Spain
José L. Cervera
Universitat de València, Spain
José M. Bernardo
University of Utah, USA
Gail Blattenberger
Carnegie Mellon University, USA
Joseph B. Kadane
Minehead, UK
Dennis V. Lindley
Prediction and Evaluation Systems, USA
Allan H. Murphy
University of California at Berkeley, USA
Robert M Oliver
Universidad Politécnica de Madrid, Madrid, Espana
David Ríos-Insua

Authors

R. L. Winkler
View author publications
You can also search for this author in PubMed Google Scholar
Javier Muñoz
View author publications
You can also search for this author in PubMed Google Scholar
José L. Cervera
View author publications
You can also search for this author in PubMed Google Scholar
José M. Bernardo
View author publications
You can also search for this author in PubMed Google Scholar
Gail Blattenberger
View author publications
You can also search for this author in PubMed Google Scholar
Joseph B. Kadane
View author publications
You can also search for this author in PubMed Google Scholar
Dennis V. Lindley
View author publications
You can also search for this author in PubMed Google Scholar
Allan H. Murphy
View author publications
You can also search for this author in PubMed Google Scholar
Robert M Oliver
View author publications
You can also search for this author in PubMed Google Scholar
David Ríos-Insua
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Read before the Spanish Statistical Society at a meeting organized by the Universitat de València on Tuesday, April 23, 1996

Rights and permissions

Reprints and permissions

About this article

Cite this article

Winkler, R.L., Muñoz, J., Cervera, J.L. et al. Scoring rules and the evaluation of probabilities. Test 5, 1–60 (1996). https://doi.org/10.1007/BF02562681

Download citation

Received: 15 March 1996
Accepted: 15 April 1996
Issue Date: June 1996
DOI: https://doi.org/10.1007/BF02562681

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scoring rules and the evaluation of probabilities

Summary

Access this article

Similar content being viewed by others

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Risk and Uncertainty

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

References

Additional References in the Discussion

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scoring rules and the evaluation of probabilities

Summary

Access this article

Similar content being viewed by others

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Risk and Uncertainty

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

References

Additional References in the Discussion

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation