, Volume 5, Issue 1, pp 1–60 | Cite as

Scoring rules and the evaluation of probabilities

  • R. L. Winkler
  • Javier Muñoz
  • José L. Cervera
  • José M. Bernardo
  • Gail Blattenberger
  • Joseph B. Kadane
  • Dennis V. Lindley
  • Allan H. Murphy
  • Robert M Oliver
  • David Ríos-Insua


In Bayesian inference and decision analysis, inferences and predictions are inherently probabilistic in nature. Scoring rules, which involve the computation of a score based on probability forecasts and what actually occurs, can be used to evaluate probabilities and to provide appropriate incentives for “good” probabilities. This paper review scoring rules and some related measures for evaluating probabilities, including decompositions of scoring rules and attributes of “goodness” of probabilites, comparability of scores, and the design of scoring rules for specific inferential and decision-making problems


Attributes of “Good” probabilities Decomposition of expected Scores Evaluation of Probabilities Probability Assessment Probability Forecasts Scoring Rules 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bayarri, M. J. and DeGroot, M. H. (1988) Gaining weight. A Bayesian approach.Bayesian Statistics 3 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.), Oxford, University Press, 25–44, (with discussion).Google Scholar
  2. Bernardo, J. M. and Bermúdez, J. D. (1985) The choice of variables in probabilistic classification.Bayesian Statistics 2 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.), Amsterdam: North-Holland, 67–81 (with discussion).Google Scholar
  3. Bernardo, J. M. and Smith, A. F. M. (1994)Bayesian Theory. Chichester: WileyzbMATHGoogle Scholar
  4. Blattenberger, G. and Lad, F (1985) Separating the Brier score into calibration and refinement components: A graphical exposition.Amer. Statist. 39, 26–32.CrossRefGoogle Scholar
  5. Brier, G. W. (1950) Verification of forecasts expressed in terms of probability.Monthly Weather Review 78, 1–3.Google Scholar
  6. Clemen, R. T. (1996)Making Hard Decisions. 2nd Edition, Belmont, CA: Duxbury PressGoogle Scholar
  7. Cooke, R. M. (1991)Experts in Uncertainty: Opinion and Subjective Probability in Science, Oxford: University Press.Google Scholar
  8. Dawid, A. P. (1982) The well-calibrated Bayesian.J. Amer. Statist. Assoc. 77, 605–613.CrossRefMathSciNetzbMATHGoogle Scholar
  9. de Finetti, B. (1937). La prévision: ses lois logiques, ses sources subjectives.Annales de l’Institut Henri Poincaré 7, 1–68. Translated as “Foresight: Its logical laws, its subjective sources” inStudies in Subjective Probability (H. E. Kyburg and H. E. Smokler, eds.), New York: Wiley, 1964, 93–158.zbMATHGoogle Scholar
  10. de Finetti, B. (1962) Does it make sense to speak of “good probability appraisers”?The Scientist Speculates: An Anthology of Partly-Baked Ideas (I. J. Good, ed.). New York: Wiley, 357–363.Google Scholar
  11. de Finetti, B. (1965) Methods for discriminating levels of partial knowledge concerning a test item.British J. of Math. and Stat. Psych. 18, 87–123.Google Scholar
  12. DeGroot, M. H. and Eriksson, E. A. (1985) Probability forecasting, stochastic dominance, and the Lorenz curve,Bayesian Statistics 2 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.), Amsterdam: North-Holland, 99–118, (with discussion).Google Scholar
  13. DeGroot, M. H. and Fienberg S. E. (1982) Assessing probability assessors: Calibration and refinement.Statistical Decision Theory and Related Topics III 1 (S. S. Gupta and J. O. Berger, eds.), New York: Academic Press, 291–314.Google Scholar
  14. DeGroot, M. H. and Fienberg S. E. (1983) The comparison and evaluation of forecasters.The Statistician 32, 14–22.CrossRefGoogle Scholar
  15. Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories.J. Appl. Meteorology 8, 985–987.CrossRefGoogle Scholar
  16. Good, I. J. (1952) Rational decisions.J. Roy. Statist. Soc. B 11, 107–114.MathSciNetGoogle Scholar
  17. Howard, R. A. and Matheson, J. E. (1983)The Principles and Applications of Decision Analysis (2 volumes), Palo Alto, CA: Strategic Decisions Group.Google Scholar
  18. Kadane, J. B. and Winkler, R. L. (1988) Separating probability elicitation from utilities.J. Amer. Statist. Assoc. 83, 357–363.CrossRefMathSciNetGoogle Scholar
  19. Keeney, R. L. and Raiffa, H. (1976).Decisions with Multiple Objectives: Preferences and Value Tradeoffs, New York: Wiley.Google Scholar
  20. Kenney, R. L. and von Winterfeldt, D. (1991) Eliciting probabilities from experts in complex technical problems.IEEE Trans. Eng. Management 38, 191–201.CrossRefGoogle Scholar
  21. Matheson, J. E. and Winkler, R. L. (1976). Scoring rules for continuous probability distributions.Manag. Sci. 22, 1087–1096.zbMATHGoogle Scholar
  22. McCarthy, J. (1956). Measures of the value of information.Proc. Nat. Acad. Sciences 42, 654–655.CrossRefzbMATHGoogle Scholar
  23. Morgan, M. G. and Henrion M. (1990)Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis. Cambridge: University Press.Google Scholar
  24. Murphy, A. H. (1972a). Scalar and vector partitions of the probability score. Part I. Two-state situation.J. Appl. Meteorology 11, 273–282.CrossRefGoogle Scholar
  25. Murphy, A. H. (1972b) Scalar and vector partitions of the probability score. Part II. N-state situation.J. Appl. Meteorology 11, 1183–1192.CrossRefGoogle Scholar
  26. Murphy, A. H. (1973a) Hedging and skill scores for probability forecasts.J. Appl. Meteorology 12, 215–223.CrossRefGoogle Scholar
  27. Murphy, A. H. (1973b). A new vector, partition of the probability score.J. Appl. Meteorology,12, 595–600.CrossRefGoogle Scholar
  28. Murphy, A. H. (1974). A sample skill score for probability forecasts.Monthly Weather Review 102, 48–55.CrossRefGoogle Scholar
  29. Murphy, A. H. (1977). The value of climatological, categorical, and probabilistic forecasts in the cost-loss ratio situation.Monthly Weather Review 105, 803–816.CrossRefGoogle Scholar
  30. Murphy, A. H. (1993). What is a good forecasts? An essay on the nature of goodness in weather forecasting.Weather and Forecasting 8, 281–293.CrossRefGoogle Scholar
  31. Murphy, A. H. (1996). General decompositions of MSE-based skill scores: Measures of some basic aspects of forecast quality.Monthly Weather Review 124, (to appear).Google Scholar
  32. Murphy, A. H. and Daan, H. (1985). Forecast evaluation.Probability, Statistics, and Decision Making in the Atmospheric Sciences (A. H. Murphy and R. W. Katz, eds.), Boulder, CO: Westview Press, 379–437.Google Scholar
  33. Murphy, A. H. and Winkler, R. L. (1984). Probability forecasting in meteorology.J. Amer. Statist. Assoc. 79, 489–500.CrossRefGoogle Scholar
  34. Murphy, A. H. and Winkler, R. L. (1987). A general framework for forecast verification.Monthly Weather Review 115, 1330–1338.CrossRefGoogle Scholar
  35. Murphy, A. H. and Winkler, R. L. (1992). Diagnostic verification of probability forecasts.Int. J. Forecasting 7, 435–455.CrossRefGoogle Scholar
  36. Pearl, J. (1978). An economic basis for certain methods of evaluating probabilistic forecasts.Int. J. Man-Machine Studies 10, 175–183.CrossRefGoogle Scholar
  37. Raiffa, H. (1968).Decision Analysis, Reading, MA: Addison-Wesley.zbMATHGoogle Scholar
  38. Roberts, H. V. (1965). Probabilistic prediction.J. Amer. Statist. Assoc 60, 50–62.CrossRefMathSciNetzbMATHGoogle Scholar
  39. Sanders, F. (1963). On subjective probability forecasting.J. Appl. Meteorology 2, 191–201.CrossRefGoogle Scholar
  40. Sarin, R. K. and Winkler, R. L. (1980). Performance-based incentive plans.Manag. Sci. 26, 1131–1144.MathSciNetGoogle Scholar
  41. Savage, L. J. (1954).The Foundations of Statistics. New York: Wiley.zbMATHGoogle Scholar
  42. Savage, L. J. (1971). Elicitation of personal probabilities and expectations.J. Amer. Statist. Assoc. 66, 783–801.CrossRefMathSciNetzbMATHGoogle Scholar
  43. Schervish, M. J. (1989). A general method for comparing probability assessors.Ann. Statist. 17, 1856–1879.MathSciNetzbMATHGoogle Scholar
  44. Shuford, E. H., Albert, A., and Massengill, H. E. (1966). Admissible probability measurement procedures.Psychometrika 31, 125–145.CrossRefzbMATHGoogle Scholar
  45. Spetzler, C. S. and Staël von Holstein, C.-A. S. (1975). Probability encoding in decision analysis.Manag. Sci. 22, 340–358.Google Scholar
  46. Staël von Holstein, C.-A. S. (1970).Assessment and Evaluation of Subjective Probability Distributions. Stockholm: ERI, Stockholm School of Economics.Google Scholar
  47. Wallsten, T. S. and Budescu, D. V. (1983). Encoding subjective probabilities: A psychological and psychometric review.Manag. Sci. 29, 151–173.Google Scholar
  48. Wilks, D. S. (1995).Statistical Methods in the Atmospheric Sciences. New York: Academic Press.Google Scholar
  49. Winkler, R. L. (1967a). The assessment of prior distribution in Bayesian analysis.J. Amer. Statist. Assoc. 62, 776–800.CrossRefMathSciNetGoogle Scholar
  50. Winkler, R. L. (1967b). The quantification of judgment: Some methodological suggestions.J. Amer. Statist. Assoc. 62, 1105–1120.CrossRefMathSciNetGoogle Scholar
  51. Winkler, R. L. (1969). Scoring rules and the evaluation of probability assessors.J. Amer. Statist. Assoc. 64, 1073–1078.CrossRefGoogle Scholar
  52. Winkler, R. L. (1986). On “good probability appraisers”.Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (P. Goel and A. Zellner, eds.), Amsterdam: North-Holland, 265–278.Google Scholar
  53. Winkler, R. L. (1994). Evaluating probabilities: Asymmetric scoring rules.Manag. Sci. 40, 1395–1405.zbMATHGoogle Scholar
  54. Winkler, R. L. and Murphy, A. H. (1968). “Good” probability assessorsJ. Appl. Meteorology 7, 751–758.CrossRefGoogle Scholar
  55. Winkler, R. L., and Poses, R. M. (1993). Evaluating and combining physicians’ probabilities of survival in an intensive care unit.Manag. Sci. 39, 1526–1543.Google Scholar
  56. Yates, J. F. (1982) External correspondence: Decompositions of the mean probability score.Organizational Behavior and Human Performance 30, 132–156.CrossRefGoogle Scholar
  57. Yates, J. F. (1988). Analyzing the accuracy of probability judgments for multiple events: An extension of the covariance decomposition.Organizational Behavior and Human Decision Processes 41, 281–299.CrossRefGoogle Scholar
  58. Yates, J. F. and Curley, S. P. (1985). Conditional distribution analyses of probabilistic forecasts.J. Forecasting 4, 61–73.Google Scholar

Additional References in the Discussion

  1. Berger, J. (1994). An overview of robust Bayesian analysis.Test 3, 5–124 (with discussion).MathSciNetzbMATHGoogle Scholar
  2. Berger, J. O. and Wolpert, R. L. (1984).The Likelihood Principle. Lecture notesmonograph series. IMS: Hayward.zbMATHGoogle Scholar
  3. Bernardo, J. M. (1979). Expected information as expected utility.Ann. Statist. 7, 686–690.MathSciNetzbMATHGoogle Scholar
  4. Bernardo, J. M. (1987). Approximations in statistics from a decision-theoretical view-point.Probability and Bayesian Statistics (R. Viertl, ed.). New York: Plenum, 53–60.Google Scholar
  5. Blattenberger, G. (1996). Money demand revisited: an operational subjective approach.J. Appl. Econometrics 11, 153–168CrossRefGoogle Scholar
  6. Blattenberger, G. and Lad, F. (1988). An application of operational-subjective statistical methods to rational expectations,J. Bus. Econ. Statistics 6, 453–477 (with discussion).CrossRefGoogle Scholar
  7. Cervera, J. L. and Muñoz, J. (1996). Proper scoring rules for fractiles.Bayesian Statistics 5 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.): Oxford: University Press.Google Scholar
  8. Chaloner, K., Church, T., Louis, T. and Matts, J. (1993). Graphical elicitation of a prior distribution for a clinical trial.The Statistician 41, 342–353.Google Scholar
  9. Cooke, R. (1991).Experts in Uncertainty. Oxford: University Press.Google Scholar
  10. Dawid, A. P. (1986). Probability forecasting.Encyclopedia of Statistical Sciences 7 (S. Kotz, N. L. Johnson and C. B. Read, eds.). New York: Wiley, 210–218.Google Scholar
  11. Dawid, A. P., DeGroot, M. H. and Mortera, J. (1995). Coherent combination of experts’ opinions.Test 4, 263–313 (with discussion).MathSciNetzbMATHGoogle Scholar
  12. de Finetti, B. (1963). Lá décision et les probabilitiés.Rev. Roumaine Math. Pures Appl. 7, 405–413.Google Scholar
  13. de Finetti, B. (1964). Probabilità subordinate e teoria delle decisioni.Rendiconti Matematica 23, 128–131. Reprinted as ‘Conditional probabilities and decision theory’ in 1972,Probability, Induction and Statistics New York: Wiley, 13–18.zbMATHGoogle Scholar
  14. Eaton, M. L. (1992). A statistical diptych: admissible inferences, recurrence of symmetric Markov chains.Ann. Statist. 20, 1147–1179.MathSciNetzbMATHGoogle Scholar
  15. Edwards, W. and von Winterfeldt, D. (1986).Decision Analysis and Behavioral Research. Cambridge: University Press.Google Scholar
  16. Fudenberg, D. and Tirole, J. (1991).Game Theory. Cambridge: University Press.Google Scholar
  17. Hadley, G. and Kemp, M. C. (1971).Variational Methods in Economics. Amsterdam: North-Holland.zbMATHGoogle Scholar
  18. Harsanyi, J. (1967). Games with incomplete information played by ‘Bayesian’ players.Manag. Sci. 14, 159–182; 320–334; 486–502.MathSciNetzbMATHGoogle Scholar
  19. Hirshleifer, J. and Riley, J. G. (1992).The Analytics of Uncertainty and Information. Cambridge: University Press.Google Scholar
  20. Kadane, J. B. (1993). Several Bayesians: a review.Test 2, 1–32.CrossRefMathSciNetzbMATHGoogle Scholar
  21. Kadane, J. B., Dickey, J. M., Winkler, R. L., Smith, W. S. and Peters, S. C. (1980). Interactive elicitation of opinion for a normal linear model.J. Amer. Statist. Assoc. 75, 845–854.CrossRefMathSciNetGoogle Scholar
  22. Katz, R. W., Murphy, A. H. and Winkler, R. L. (1982). Assessing the value of frost forecasts to orchardists: A dynamic decision-making approach.J. Appl. Meteor. 21, 518–531.CrossRefGoogle Scholar
  23. Krzysztofowicz, R. (1992). Bayesian correlation score: A utilitarian measure of forecast skill.Mon. Wea. Rev. 120, 208–219.CrossRefGoogle Scholar
  24. Lindley, D. V. (1956). On a measure of information provided by an experiment.Ann. Math. Statist. 27, 986–1005.MathSciNetzbMATHGoogle Scholar
  25. Lindley, D. V. (1982). Scoring rules and the inevitability of probability.Internat. Statist. Rev. 50, 1–26 (with discussion).MathSciNetzbMATHCrossRefGoogle Scholar
  26. McCloskey, D. and Ziliak, S. (1996). The standard error of regressions,J. Economic Literature 34(1), 97–114.Google Scholar
  27. Murphy, A. H. (1970). The ranked probability score and the probability score: A comparison.Mon. Wea. Rev. 98, 917–924.Google Scholar
  28. Murphy, A. H. (1991). Forecast verification: Its complexity and dimensionality.Mon. Wea. Rev. 119, 1590–1601.CrossRefGoogle Scholar
  29. Murphy, A. H. (1995). A coherent method of stratification within a general framework for forecast verification.Mon. Wea. Rev. 123, 1582–1588.CrossRefGoogle Scholar
  30. Murphy, A. H. (1996). Forecast verification.Economic Value of Weather and Climate Forecasts (R. W. Katz and A. H. Murphy, eds.). Cambridge: University Press, (to appear).Google Scholar
  31. Murphy, A. H. and Daan, H. (1984). Impacts of feedback and experience on the quality of subjective probability forecasts: Comparison of results from the first and second years of the Zierikzee experiment.Mon. Wea. Rev. 112, 413–423.CrossRefGoogle Scholar
  32. Murphy, A. H. and Ehrendorfer, M. (1996).Probability forecasting and probability forecasts. Corvallis, Oregon: Prediction and Evaluation Systems (manuscript).Google Scholar
  33. Murphy, A. H. and Wilks, D. S. (1996). Statistical models in forecast verification: A case study of precipitation probability forecasts.13th Conference on Probability and Statistics in the Atmospheric Sciences. American Meteorology Society, 218–223.Google Scholar
  34. Pearl, J. (1988).Probabilistic Reasoning in Intelligent Systems. San Mateo: Morgan Kaufmann.Google Scholar
  35. Pratt, J. W. and Zeckhauser, R. J. (eds.) (1985).Principals and Agents: The Structure of Business. Boston: Harvard Business School Press.Google Scholar
  36. Rubin, H. (1987). A weak system of axioms for ‘rational’ behavior and the non-separability of utility from prior.Statistics and Decisions 5, 47–58.MathSciNetzbMATHGoogle Scholar
  37. Schervish, M. J. (1995).Theory of Statistics, New York: Springer.zbMATHGoogle Scholar
  38. Spiegelhalter, D. J., Dawid, A. P., Larutzen, S. L. and Cowell, R. G. (1993). Bayesian analysis in expert systems.Statist. Sci. 8, 219–246.MathSciNetzbMATHGoogle Scholar
  39. Staël von Holstein, C.-A. S. and Murphy, A. H. (1978). The family of quadratic scoring rules.Mon. Wea. Rev. 106, 917–924.CrossRefGoogle Scholar
  40. West, M. (1988). Modelling expert opinion.Bayesian Statistics 3 (J. M. Bernardo, M. H. DeGrott, D. V. Lindley and A. F. M. Smith, eds.). Oxford: University Press, 493–508 (with discussion).Google Scholar
  41. Winkler, R. L. (1986). Expert resolution.Manag. Sci. 32, 298–303.Google Scholar
  42. Winkler, R. L., Smith, W. S. and Kulkarni, R. B. (1978). Adaptive forecasting models based on predictive distributions.Manag. Sci. 24, 977–986.zbMATHCrossRefGoogle Scholar
  43. Yates, J. F. (1994). Subjective probability accuracy analysis.Subjective Probability (G. Wright and P. Ayton, eds.). Chichester: Wiley, 381–410.Google Scholar

Copyright information

© SEIO 1996

Authors and Affiliations

  • R. L. Winkler
    • 1
  • Javier Muñoz
    • 2
  • José L. Cervera
    • 3
  • José M. Bernardo
    • 4
  • Gail Blattenberger
    • 5
  • Joseph B. Kadane
    • 6
  • Dennis V. Lindley
    • 7
  • Allan H. Murphy
    • 8
  • Robert M Oliver
    • 9
  • David Ríos-Insua
    • 10
  1. 1.Fuqua School of Business and Institute of Statistics and Decision SciencesDuke UniversityDurhamUSA
  2. 2.Generalitat ValencianaSpain
  3. 3.Instituto Nacional de EstadísticaSpain
  4. 4.Universitat de ValènciaSpain
  5. 5.University of UtahUSA
  6. 6.Carnegie Mellon UniversityUSA
  7. 7.MineheadUK
  8. 8.Prediction and Evaluation SystemsUSA
  9. 9.University of California at BerkeleyUSA
  10. 10.Universidad Politécnica de MadridMadridEspana

Personalised recommendations