Abstract
There is an ongoing debate about which rule we ought to use for scoring probability estimates. Much of this debate has been premised on scoring-rule monism, according to which there is exactly one best scoring rule. In previous work, I have argued against this position. The argument given there was based on purely a priori considerations, notably the intuition that scoring rules should be sensitive to truthlikeness relations if, and only if, such relations are present among whichever hypotheses are at issue. The present paper offers a new, quasi-empirical argument against scoring-rule monism. This argument uses computational simulations to show that different scoring rules can have different economical consequences, depending on the context of use.
Similar content being viewed by others
Notes
It is entirely consistent with everything said in the present paper, or in Douven (2020), that there are still other purposes that scoring rules can serve, and that some of those may again require propriety. For instance, Roche and Shogenji (2018) argue that we should measure informativeness in terms of inaccuracy reduction, and that inaccuracy should then be measured by a proper scoring rule. There is no conflict here with the claim made in Douven (2020), which after all is merely that scoring rules can also serve purposes which do not require propriety.
Thanks to Ilkka Niiniluoto for bringing this to my attention.
An anonymous referee disagreed at this point, maintaining that we are to measure distance from the truth here in terms of the difference in goals scored. In my opinion, it is more reasonable to look at how different the various mentioned non-actual worlds (the world in which the match ends in a 0 : 1 win, the world in which the match ends in a 0 : 4 win, and so on) are from the actual world. And given what we know about the teams, our world would, as mentioned, have to be rather different from the actual world for the match to end 0 : 4 while it would not have to be very different for the match to end 0 : 1. By contrast, for the match to have ended 0 : 21, something entirely out of the ordinary would have had to occur, and whatever that would have been, it would have been about equally compatible with a 0 : 24 end result. For instance, if all players who normally play for the home team had been suspended, and the coach of that team had to line up their most inexperienced players, then a devastating loss would be explainable—but the explanation would be about as good in the case of a 0 : 21 end result as it would be in the case of a 0 : 24 end result. (Thanks to Theo Kuipers and Ilkka Niiniluoto for helpful discussion here.)
The referee helpfully provided a proof: Suppose we have worlds \(\{w_1, w_2, w_3\}\), where \(w_2\) and \(w_3\) are H-worlds and \(w_1\), the only \(\lnot H\)-world, is actual. Now compare probability assignments p and \(p^*\) to these worlds: \(p(w_1) = p(w_3) = 0\) and \(p(w_2) = 1\); \(p^*(w_1) = 0\), \(p^*(w_2) = 1 - x\), and \(p^*(w_3) = x\). Furthermore, let the distances among the worlds be given simply by their ordering in the set. Then if your current degrees of belief are given by p, you incur a VS score of \(\omega _{11} + \omega _{21}\), while if they are given by \(p^*\), your VS score equals \(\omega _{11} + \omega _{21}(1 - x)^2 + \omega _{31}x^2\). And \(\bigl (\omega _{11} + \omega _{21}(1 - x)^2 + \omega _{31}x^2\bigr ) - \bigl (\omega _{11} + \omega _{21}\bigr ) = -\omega _{21} + \omega _{21}(1 - x)^2 + \omega _{31}x^2\), which is negative for small values of x. Hence, your VS score can go up by becoming certain of the closest H-world, while previously you were only certain that H.
Incidentally, this is not a reason to think the RPS rule is more attractive after all, because that rule fails to satisfy Oddie’s Proximity condition as well. To see this, consider a set of worlds \(\{w_1, w_2, w_3, w_4\}\), where \(w_1\), \(w_3\), \(w_4\) are H-worlds and \(w_2\) is the only \(\lnot H\)-world and is actual. Again, the distances among the four worlds are given by their order in the set, meaning that \(w_1\) and \(w_3\) are the H-worlds closest to the actual world. Now let p and \(p^*\) be such that \(p(w_1) = p(w_3) = .5\) and \(p(w_2) = p(w_4) = 0\), while \(p^*(w_1) = 1\) and \(p^*(w_i) = 0\) for \(i \in \{2, 3, 4\}\). Suppose your current degrees of belief are given by p. Then if a scoring rule is to satisfy Proximity, it should not make you come out less accurate if you replace those degrees of belief by ones given by \(p^*\). But according to the RPS rule, doing so would make you less accurate. Given your current degrees of belief, the rule assigns you a penalty of 1/6. But if you switch to \(p^*\!\), your RPS penalty doubles, becoming 1/3. Oddie (2019) proves that, given certain mild restrictions on the semantics, any additive scoring rule that satisfies his condition is improper, where a scoring rule is additive if it can be written as the sum of local inaccuracies, which look only at what probability is assigned to a world and whether or not that world is actual. (For a similar result, see Levinstein 2019.) That the RPS rule does not satisfy Proximity, as just shown, might be thought to follow already from Oddie’s formal result. That is not so, however. For although the RPS rule is proper, it is not additive: it is not enough to know of each world whether it is actual and what probability it gets assigned; its place in the ordering, and the probabilities assigned to the other worlds, matter, too.
Schoenfield (2021) proposes a number of principles weaker than Proximity but still strong enough—according to her—to capture truthcloseness intuitions. McCutcheon (2021) argues for a stronger replacement for Proximity that he calls “Proxvexity.” The referee who brought to my attention that VS rules fail to satisfy Proximity also pointed out to me that they do satisfy McCutcheon’s Proxvexity condition. It is worth noting that McCutcheon (2021) proposes a set of scoring rules that satisfy the same condition but that in addition are proper. However, McCutcheon’s rules build on the log score and shares with that the unboundedness problem (Carvalho 2016, p. 226), which some may find serious enough to reject those rules.
We are dividing by \(\mathrm {pdf}(0.5\,|\, 0.5,0.075)\) to make 1 the maximum value of the function, so as to make it more easily comparable to the value functions in the other examples. Most probably, to obtain the monetary value of the commodities figuring in our examples, each of r, s, t, and u would have to be multiplied by a different constant. Doing so would be immaterial to the results of the simulations, however, given that these concern correlations, and given that correlations are unaffected by linear transformations of the variables.
To sample a probability distribution on an n-element hypothesis partition from a uniform Dirichlet distribution essentially means that each point in the \((n-1)\)-dimensional probability simplex has the same chance of being selected.
The procedure described in Meng, Rosenthal, and Meng et al. (1992) allows one to test for differences among so-called correlated correlations, which are correlations between pairs of variables where one of the variables is shared by the pairs. For instance, we can test whether the correlation of the Brier scores for a given \(n\in \{11,21,51,101\}\) with the \(\Delta \)-values for that n and a given value function differs significantly from the correlation of the ranked probability scores for that n with the same \(\Delta \)-values. Using this procedure, it was found that, for all n and each of r, s, and t, the correlations for the ranked probability scores as well as for the VS scores are significantly higher (at \(\alpha =.0001\)) than those for the Brier and log scores, and the correlations for the ranked probability scores are also significantly higher (at the same \(\alpha \) level) than the VS scores. In the case of value function u, the differences among the correlations are not significant for the cases \(n=11\) and \(n=21\), but for the remaining cases the correlations for the ranked probability scores are significantly higher (at \(\alpha =.001\)) than those for the other rules.
Looking at the shapes of the other value functions, we can also understand the rest of the results reported in Table 3. In particular, it is easy to see why the correlations between ranked probability scores and divergences are particularly high for t, which has a more or less steady slope of about \(-1\) across its entire domain.
I am greatly indebted to Christopher von Bülow and to two anonymous referees for valuable comments on previous versions.
References
Bernardo, J. M. (1979). Expected information as expected utility. Annals of Statistics, 7, 686–690.
Bernardo, J. M., & Smith, A. F. M. (2000). Bayesian theory. New York: Wiley.
Bickel, J. E. (2007). Some comparisons between quadratic, spherical, and logarithmic scoring rules. Decision Analysis, 4, 49–65.
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1–3.
Carvalho, A. (2016). An overview of applications of proper scoring rules. Decision Analysis, 13, 223–242.
Cooke, R. M. (1991). Experts in uncertainty. Oxford: Oxford University Press.
Douven, I. (2020). Scoring in context’. Synthese, 197, 1565–1580.
Douven, I. (2021). The art of abduction. Cambridge, MA: MIT Press. in press.
Douven, I., Wenmackers, S., Jraissati, Y., & Decock, L. (2017). Measuring graded membership: The case of color. Cognitive Science, 41, 686–722.
Dunn, J. (2018). Accuracy, verisimilitude and scoring rules. Australasian Journal of Philosophy, 97, 151–166.
Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology, 8, 985–987.
Fairchild, M. D. (2013). Color appearance models. Chichester, UK: Wiley.
Fallis, D. (2007). Attitudes toward epistemic risk and the value of experiments. Studia Logica, 86, 215–246.
Fallis, D., & Lewis, P. J. (2016). The Brier rule is not a good measure of epistemic utility (and other useful facts about epistemic betterness). Australasian Journal of Philosophy, 94, 576–590.
Gärdenfors, P. (2000). Conceptual spaces. Cambridge, MA: MIT Press.
Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society, B14, 107–114.
Greaves, H., & Wallace, D. (2006). Justifying conditionalization: Conditionalization maximizes expected epistemic utility. Mind, 115, 607–632.
Joyce, J. M. (1998). A nonpragmatic vindication of probabilism. Philosophy of Science, 65, 575–603.
Joyce, J. M. (2009). Accuracy and coherence: Prospects for an alethic epistemology of partial belief. In F. Huber & C. Schmidt-Petri (Eds.), Degrees of belief (pp. 263–297). Dordrecht: Springer.
Kuipers, T. A. F. (2000). From instrumentalism to constructive realism. Dordrecht: Kluwer.
Kuipers, T. A. F. (2001). Structures in science. Dordrecht: Kluwer.
Kuipers, T. A. F. (2019). Nomic truth approximation revisited. Basel: Springer.
Levinstein, B. A. (2017). A pragmatist’s guide to epistemic utility. Philosophy of Science, 84, 613–638.
Levinstein, B. A. (2019). An objection of varying importance to epistemic utility theory. Philosophical Studies, 176, 2919–2931.
Machery, E., Mallon, R., Nichols, S., & Stich, S. P. (2004). Semantics, cross-cultural style. Cognition, 92, B1–12.
McCutcheon, R. G. (2019). In favor of logarithmic scoring. Philosophy of Science, 86, 286–303.
McCutcheon, R. G. (2021). A note on verisimilitude and accuracy. British Journal for the Philosophy of Science, in press.
Meng, X. L., Rosenthal, R., & Rubin, D. B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111, 172–175.
Murphy, A. H. (1969). On the ranked probability score. Journal of Applied Meteorology, 8, 988–989.
Murphy, A. H. (1993). What is a good forecast? An essay on the nature of goodness in weather forecasting. Weather Forecasting, 8, 281–293.
Niiniluoto, I. (1984). Is science progressive?. Dordrecht: Reidel.
Niiniluoto, I. (1987). Truthlikeness. Dordrecht: Reidel.
Niiniluoto, I. (1998). Verisimilitude: The third period. British Journal for the Philosophy of Science, 49, 1–29.
Niiniluoto, I. (1999). Critical Scientific Realism. Oxford: Oxford University Press.
Oddie, G. (2019). What accuracy could not be. British Journal for the Philosophy of Science, 70, 551–580.
Roche, W., & Shogenji, T. (2018). Information and inaccuracy. British Journal for the Philosophy of Science, 69, 577–604.
Rosenkrantz, R. D. (1981). Foundations and applications of inductive probability. Atascadero, CA: Ridgeview.
Schoenfield, M. (2021). Accuracy and verisimilitude: The good, the bad and the ugly. British Journal for the Philosophy of Science, in press.
Schurz, G. (1987). A new definition of verisimilitude and its applications. In P. Weingartner & G. Schurz (Eds.), Logic, Philosophy of Science and Epistemology (pp. 177–184). Vienna: Hölder-Pichler-Tempsky.
Schurz, G. (1991). Relevant deduction. Erkenntnis, 35, 391–437.
Schurz, G. (2011). Verisimilitude and belief revision: With a focus on the relevant element account. Erkenntnis, 75, 203–221.
Schurz, G. (2019). Hume’s problem solved: The optimality of meta-induction. Cambridge, MA: MIT Press.
Selten, R. (1998). Axiomatic characterization of the quadratic scoring rule. Experimental Economics, 1, 43–62.
Shepard, R. N. (1987). Toward a universal law for psychological science. Science, 237, 1317–1323.
Weinberg, J. M., Gonnerman, C., Buckner, C., & Alexander, J. (2010). Are philosophers expert intuiters? Philosophical Psychology, 23, 331–355.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The Supplementary Materials for this paper, containing the R code used for the simulations to be reported, can be retrieved from this repository: https://osf.io/n8e2g/?view_only=72d468e2eabe4100b9409985c4a10950
This article belongs to the topical collection on Approaching Probabilistic Truths, edited by Theo Kuipers, Ilkka Niiniluoto, and Gustavo Cevolani.
Rights and permissions
About this article
Cite this article
Douven, I. Scoring, truthlikeness, and value. Synthese 199, 8281–8298 (2021). https://doi.org/10.1007/s11229-021-03162-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11229-021-03162-z