Skip to main content

Advertisement

Log in

Scoring, truthlikeness, and value

  • Original Research
  • Published:
Synthese Aims and scope Submit manuscript

Abstract

There is an ongoing debate about which rule we ought to use for scoring probability estimates. Much of this debate has been premised on scoring-rule monism, according to which there is exactly one best scoring rule. In previous work, I have argued against this position. The argument given there was based on purely a priori considerations, notably the intuition that scoring rules should be sensitive to truthlikeness relations if, and only if, such relations are present among whichever hypotheses are at issue. The present paper offers a new, quasi-empirical argument against scoring-rule monism. This argument uses computational simulations to show that different scoring rules can have different economical consequences, depending on the context of use.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. For other dissenters, see Joyce (2009); Levinstein (2017) and Schurz (2019).

  2. See, for instance, Good (1952), Bernardo (1979), Bernardo and Smith (2000), Bickel (2007), and McCutcheon (2019). See also Fallis (2007) and Fallis and Lewis (2016).

  3. See, for instance, Brier (1950), Rosenkrantz (1981), and Selten (1998), Joyce (1998) also advocates the Brier score as the one true scoring rule, but he abandons scoring-rule monism in his (2009).

  4. It is entirely consistent with everything said in the present paper, or in Douven (2020), that there are still other purposes that scoring rules can serve, and that some of those may again require propriety. For instance, Roche and Shogenji (2018) argue that we should measure informativeness in terms of inaccuracy reduction, and that inaccuracy should then be measured by a proper scoring rule. There is no conflict here with the claim made in Douven (2020), which after all is merely that scoring rules can also serve purposes which do not require propriety.

  5. Thanks to Ilkka Niiniluoto for bringing this to my attention.

  6. An anonymous referee disagreed at this point, maintaining that we are to measure distance from the truth here in terms of the difference in goals scored. In my opinion, it is more reasonable to look at how different the various mentioned non-actual worlds (the world in which the match ends in a 0 : 1 win, the world in which the match ends in a 0 : 4 win, and so on) are from the actual world. And given what we know about the teams, our world would, as mentioned, have to be rather different from the actual world for the match to end 0 : 4 while it would not have to be very different for the match to end 0 : 1. By contrast, for the match to have ended 0 : 21, something entirely out of the ordinary would have had to occur, and whatever that would have been, it would have been about equally compatible with a 0 : 24 end result. For instance, if all players who normally play for the home team had been suspended, and the coach of that team had to line up their most inexperienced players, then a devastating loss would be explainable—but the explanation would be about as good in the case of a 0 : 21 end result as it would be in the case of a 0 : 24 end result. (Thanks to Theo Kuipers and Ilkka Niiniluoto for helpful discussion here.)

  7. The referee helpfully provided a proof: Suppose we have worlds \(\{w_1, w_2, w_3\}\), where \(w_2\) and \(w_3\) are H-worlds and \(w_1\), the only \(\lnot H\)-world, is actual. Now compare probability assignments p and \(p^*\) to these worlds: \(p(w_1) = p(w_3) = 0\) and \(p(w_2) = 1\); \(p^*(w_1) = 0\), \(p^*(w_2) = 1 - x\), and \(p^*(w_3) = x\). Furthermore, let the distances among the worlds be given simply by their ordering in the set. Then if your current degrees of belief are given by p, you incur a VS score of \(\omega _{11} + \omega _{21}\), while if they are given by \(p^*\), your VS score equals \(\omega _{11} + \omega _{21}(1 - x)^2 + \omega _{31}x^2\). And \(\bigl (\omega _{11} + \omega _{21}(1 - x)^2 + \omega _{31}x^2\bigr ) - \bigl (\omega _{11} + \omega _{21}\bigr ) = -\omega _{21} + \omega _{21}(1 - x)^2 + \omega _{31}x^2\), which is negative for small values of x. Hence, your VS score can go up by becoming certain of the closest H-world, while previously you were only certain that H.

  8. Incidentally, this is not a reason to think the RPS rule is more attractive after all, because that rule fails to satisfy Oddie’s Proximity condition as well. To see this, consider a set of worlds \(\{w_1, w_2, w_3, w_4\}\), where \(w_1\), \(w_3\), \(w_4\) are H-worlds and \(w_2\) is the only \(\lnot H\)-world and is actual. Again, the distances among the four worlds are given by their order in the set, meaning that \(w_1\) and \(w_3\) are the H-worlds closest to the actual world. Now let p and \(p^*\) be such that \(p(w_1) = p(w_3) = .5\) and \(p(w_2) = p(w_4) = 0\), while \(p^*(w_1) = 1\) and \(p^*(w_i) = 0\) for \(i \in \{2, 3, 4\}\). Suppose your current degrees of belief are given by p. Then if a scoring rule is to satisfy Proximity, it should not make you come out less accurate if you replace those degrees of belief by ones given by \(p^*\). But according to the RPS rule, doing so would make you less accurate. Given your current degrees of belief, the rule assigns you a penalty of 1/6. But if you switch to \(p^*\!\), your RPS penalty doubles, becoming 1/3. Oddie (2019) proves that, given certain mild restrictions on the semantics, any additive scoring rule that satisfies his condition is improper, where a scoring rule is additive if it can be written as the sum of local inaccuracies, which look only at what probability is assigned to a world and whether or not that world is actual. (For a similar result, see Levinstein 2019.) That the RPS rule does not satisfy Proximity, as just shown, might be thought to follow already from Oddie’s formal result. That is not so, however. For although the RPS rule is proper, it is not additive: it is not enough to know of each world whether it is actual and what probability it gets assigned; its place in the ordering, and the probabilities assigned to the other worlds, matter, too.

  9. Schoenfield (2021) proposes a number of principles weaker than Proximity but still strong enough—according to her—to capture truthcloseness intuitions. McCutcheon (2021) argues for a stronger replacement for Proximity that he calls “Proxvexity.” The referee who brought to my attention that VS rules fail to satisfy Proximity also pointed out to me that they do satisfy McCutcheon’s Proxvexity condition. It is worth noting that McCutcheon (2021) proposes a set of scoring rules that satisfy the same condition but that in addition are proper. However, McCutcheon’s rules build on the log score and shares with that the unboundedness problem (Carvalho 2016, p. 226), which some may find serious enough to reject those rules.

  10. We are dividing by \(\mathrm {pdf}(0.5\,|\, 0.5,0.075)\) to make 1 the maximum value of the function, so as to make it more easily comparable to the value functions in the other examples. Most probably, to obtain the monetary value of the commodities figuring in our examples, each of r, s, t, and u would have to be multiplied by a different constant. Doing so would be immaterial to the results of the simulations, however, given that these concern correlations, and given that correlations are unaffected by linear transformations of the variables.

  11. To sample a probability distribution on an n-element hypothesis partition from a uniform Dirichlet distribution essentially means that each point in the \((n-1)\)-dimensional probability simplex has the same chance of being selected.

  12. The procedure described in Meng, Rosenthal, and Meng et al. (1992) allows one to test for differences among so-called correlated correlations, which are correlations between pairs of variables where one of the variables is shared by the pairs. For instance, we can test whether the correlation of the Brier scores for a given \(n\in \{11,21,51,101\}\) with the \(\Delta \)-values for that n and a given value function differs significantly from the correlation of the ranked probability scores for that n with the same \(\Delta \)-values. Using this procedure, it was found that, for all n and each of r, s, and t, the correlations for the ranked probability scores as well as for the VS scores are significantly higher (at \(\alpha =.0001\)) than those for the Brier and log scores, and the correlations for the ranked probability scores are also significantly higher (at the same \(\alpha \) level) than the VS scores. In the case of value function u, the differences among the correlations are not significant for the cases \(n=11\) and \(n=21\), but for the remaining cases the correlations for the ranked probability scores are significantly higher (at \(\alpha =.001\)) than those for the other rules.

  13. Looking at the shapes of the other value functions, we can also understand the rest of the results reported in Table 3. In particular, it is easy to see why the correlations between ranked probability scores and divergences are particularly high for t, which has a more or less steady slope of about \(-1\) across its entire domain.

  14. As do the rules proposed in McCutcheon (2021). And there also exist weighted versions of the Brier score that deliver at least some aspects of the said desideratum. See Greaves and Wallace (2006), Dunn (2018), and Schoenfield (2021).

  15. I am greatly indebted to Christopher von Bülow and to two anonymous referees for valuable comments on previous versions.

References

  • Bernardo, J. M. (1979). Expected information as expected utility. Annals of Statistics, 7, 686–690.

    Article  Google Scholar 

  • Bernardo, J. M., & Smith, A. F. M. (2000). Bayesian theory. New York: Wiley.

    Google Scholar 

  • Bickel, J. E. (2007). Some comparisons between quadratic, spherical, and logarithmic scoring rules. Decision Analysis, 4, 49–65.

    Article  Google Scholar 

  • Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1–3.

    Article  Google Scholar 

  • Carvalho, A. (2016). An overview of applications of proper scoring rules. Decision Analysis, 13, 223–242.

    Article  Google Scholar 

  • Cooke, R. M. (1991). Experts in uncertainty. Oxford: Oxford University Press.

    Google Scholar 

  • Douven, I. (2020). Scoring in context’. Synthese, 197, 1565–1580.

    Article  Google Scholar 

  • Douven, I. (2021). The art of abduction. Cambridge, MA: MIT Press. in press.

    Google Scholar 

  • Douven, I., Wenmackers, S., Jraissati, Y., & Decock, L. (2017). Measuring graded membership: The case of color. Cognitive Science, 41, 686–722.

    Article  Google Scholar 

  • Dunn, J. (2018). Accuracy, verisimilitude and scoring rules. Australasian Journal of Philosophy, 97, 151–166.

    Article  Google Scholar 

  • Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology, 8, 985–987.

    Article  Google Scholar 

  • Fairchild, M. D. (2013). Color appearance models. Chichester, UK: Wiley.

    Book  Google Scholar 

  • Fallis, D. (2007). Attitudes toward epistemic risk and the value of experiments. Studia Logica, 86, 215–246.

    Article  Google Scholar 

  • Fallis, D., & Lewis, P. J. (2016). The Brier rule is not a good measure of epistemic utility (and other useful facts about epistemic betterness). Australasian Journal of Philosophy, 94, 576–590.

    Article  Google Scholar 

  • Gärdenfors, P. (2000). Conceptual spaces. Cambridge, MA: MIT Press.

    Book  Google Scholar 

  • Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society, B14, 107–114.

    Google Scholar 

  • Greaves, H., & Wallace, D. (2006). Justifying conditionalization: Conditionalization maximizes expected epistemic utility. Mind, 115, 607–632.

    Article  Google Scholar 

  • Joyce, J. M. (1998). A nonpragmatic vindication of probabilism. Philosophy of Science, 65, 575–603.

    Article  Google Scholar 

  • Joyce, J. M. (2009). Accuracy and coherence: Prospects for an alethic epistemology of partial belief. In F. Huber & C. Schmidt-Petri (Eds.), Degrees of belief (pp. 263–297). Dordrecht: Springer.

    Chapter  Google Scholar 

  • Kuipers, T. A. F. (2000). From instrumentalism to constructive realism. Dordrecht: Kluwer.

    Book  Google Scholar 

  • Kuipers, T. A. F. (2001). Structures in science. Dordrecht: Kluwer.

    Book  Google Scholar 

  • Kuipers, T. A. F. (2019). Nomic truth approximation revisited. Basel: Springer.

    Book  Google Scholar 

  • Levinstein, B. A. (2017). A pragmatist’s guide to epistemic utility. Philosophy of Science, 84, 613–638.

    Article  Google Scholar 

  • Levinstein, B. A. (2019). An objection of varying importance to epistemic utility theory. Philosophical Studies, 176, 2919–2931.

    Article  Google Scholar 

  • Machery, E., Mallon, R., Nichols, S., & Stich, S. P. (2004). Semantics, cross-cultural style. Cognition, 92, B1–12.

    Article  Google Scholar 

  • McCutcheon, R. G. (2019). In favor of logarithmic scoring. Philosophy of Science, 86, 286–303.

    Article  Google Scholar 

  • McCutcheon, R. G. (2021). A note on verisimilitude and accuracy. British Journal for the Philosophy of Science, in press.

  • Meng, X. L., Rosenthal, R., & Rubin, D. B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111, 172–175.

    Article  Google Scholar 

  • Murphy, A. H. (1969). On the ranked probability score. Journal of Applied Meteorology, 8, 988–989.

    Article  Google Scholar 

  • Murphy, A. H. (1993). What is a good forecast? An essay on the nature of goodness in weather forecasting. Weather Forecasting, 8, 281–293.

    Article  Google Scholar 

  • Niiniluoto, I. (1984). Is science progressive?. Dordrecht: Reidel.

    Book  Google Scholar 

  • Niiniluoto, I. (1987). Truthlikeness. Dordrecht: Reidel.

    Book  Google Scholar 

  • Niiniluoto, I. (1998). Verisimilitude: The third period. British Journal for the Philosophy of Science, 49, 1–29.

    Article  Google Scholar 

  • Niiniluoto, I. (1999). Critical Scientific Realism. Oxford: Oxford University Press.

    Google Scholar 

  • Oddie, G. (2019). What accuracy could not be. British Journal for the Philosophy of Science, 70, 551–580.

    Article  Google Scholar 

  • Roche, W., & Shogenji, T. (2018). Information and inaccuracy. British Journal for the Philosophy of Science, 69, 577–604.

    Article  Google Scholar 

  • Rosenkrantz, R. D. (1981). Foundations and applications of inductive probability. Atascadero, CA: Ridgeview.

    Google Scholar 

  • Schoenfield, M. (2021). Accuracy and verisimilitude: The good, the bad and the ugly. British Journal for the Philosophy of Science, in press.

  • Schurz, G. (1987). A new definition of verisimilitude and its applications. In P. Weingartner & G. Schurz (Eds.), Logic, Philosophy of Science and Epistemology (pp. 177–184). Vienna: Hölder-Pichler-Tempsky.

    Google Scholar 

  • Schurz, G. (1991). Relevant deduction. Erkenntnis, 35, 391–437.

    Article  Google Scholar 

  • Schurz, G. (2011). Verisimilitude and belief revision: With a focus on the relevant element account. Erkenntnis, 75, 203–221.

    Article  Google Scholar 

  • Schurz, G. (2019). Hume’s problem solved: The optimality of meta-induction. Cambridge, MA: MIT Press.

    Book  Google Scholar 

  • Selten, R. (1998). Axiomatic characterization of the quadratic scoring rule. Experimental Economics, 1, 43–62.

    Article  Google Scholar 

  • Shepard, R. N. (1987). Toward a universal law for psychological science. Science, 237, 1317–1323.

    Article  Google Scholar 

  • Weinberg, J. M., Gonnerman, C., Buckner, C., & Alexander, J. (2010). Are philosophers expert intuiters? Philosophical Psychology, 23, 331–355.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Igor Douven.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The Supplementary Materials for this paper, containing the R code used for the simulations to be reported, can be retrieved from this repository: https://osf.io/n8e2g/?view_only=72d468e2eabe4100b9409985c4a10950

This article belongs to the topical collection on Approaching Probabilistic Truths, edited by Theo Kuipers, Ilkka Niiniluoto, and Gustavo Cevolani.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Douven, I. Scoring, truthlikeness, and value. Synthese 199, 8281–8298 (2021). https://doi.org/10.1007/s11229-021-03162-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11229-021-03162-z

Keywords

Navigation