Scoring, truthlikeness, and value

Douven, Igor

doi:10.1007/s11229-021-03162-z

Scoring, truthlikeness, and value

Original Research
Published: 28 April 2021

Volume 199, pages 8281–8298, (2021)
Cite this article

Synthese Aims and scope Submit manuscript

Igor Douven ORCID: orcid.org/0000-0003-3413-080X¹

292 Accesses
1 Citation
Explore all metrics

Abstract

There is an ongoing debate about which rule we ought to use for scoring probability estimates. Much of this debate has been premised on scoring-rule monism, according to which there is exactly one best scoring rule. In previous work, I have argued against this position. The argument given there was based on purely a priori considerations, notably the intuition that scoring rules should be sensitive to truthlikeness relations if, and only if, such relations are present among whichever hypotheses are at issue. The present paper offers a new, quasi-empirical argument against scoring-rule monism. This argument uses computational simulations to show that different scoring rules can have different economical consequences, depending on the context of use.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scoring in context

Article 06 July 2018

The dialectics of accuracy arguments for probabilism

Article 24 April 2023

Properization: constructing proper scoring rules via Bayes acts

Article 22 February 2019

Notes

For other dissenters, see Joyce (2009); Levinstein (2017) and Schurz (2019).
See, for instance, Good (1952), Bernardo (1979), Bernardo and Smith (2000), Bickel (2007), and McCutcheon (2019). See also Fallis (2007) and Fallis and Lewis (2016).
See, for instance, Brier (1950), Rosenkrantz (1981), and Selten (1998), Joyce (1998) also advocates the Brier score as the one true scoring rule, but he abandons scoring-rule monism in his (2009).
It is entirely consistent with everything said in the present paper, or in Douven (2020), that there are still other purposes that scoring rules can serve, and that some of those may again require propriety. For instance, Roche and Shogenji (2018) argue that we should measure informativeness in terms of inaccuracy reduction, and that inaccuracy should then be measured by a proper scoring rule. There is no conflict here with the claim made in Douven (2020), which after all is merely that scoring rules can also serve purposes which do not require propriety.
Thanks to Ilkka Niiniluoto for bringing this to my attention.
An anonymous referee disagreed at this point, maintaining that we are to measure distance from the truth here in terms of the difference in goals scored. In my opinion, it is more reasonable to look at how different the various mentioned non-actual worlds (the world in which the match ends in a 0 : 1 win, the world in which the match ends in a 0 : 4 win, and so on) are from the actual world. And given what we know about the teams, our world would, as mentioned, have to be rather different from the actual world for the match to end 0 : 4 while it would not have to be very different for the match to end 0 : 1. By contrast, for the match to have ended 0 : 21, something entirely out of the ordinary would have had to occur, and whatever that would have been, it would have been about equally compatible with a 0 : 24 end result. For instance, if all players who normally play for the home team had been suspended, and the coach of that team had to line up their most inexperienced players, then a devastating loss would be explainable—but the explanation would be about as good in the case of a 0 : 21 end result as it would be in the case of a 0 : 24 end result. (Thanks to Theo Kuipers and Ilkka Niiniluoto for helpful discussion here.)
The referee helpfully provided a proof: Suppose we have worlds \(\{w_1, w_2, w_3\}\), where \(w_2\) and \(w_3\) are H-worlds and \(w_1\), the only \(\lnot H\)-world, is actual. Now compare probability assignments p and \(p^*\) to these worlds: \(p(w_1) = p(w_3) = 0\) and \(p(w_2) = 1\); \(p^*(w_1) = 0\), \(p^*(w_2) = 1 - x\), and \(p^*(w_3) = x\). Furthermore, let the distances among the worlds be given simply by their ordering in the set. Then if your current degrees of belief are given by p, you incur a VS score of \(\omega _{11} + \omega _{21}\), while if they are given by \(p^*\), your VS score equals \(\omega _{11} + \omega _{21}(1 - x)^2 + \omega _{31}x^2\). And \(\bigl (\omega _{11} + \omega _{21}(1 - x)^2 + \omega _{31}x^2\bigr ) - \bigl (\omega _{11} + \omega _{21}\bigr ) = -\omega _{21} + \omega _{21}(1 - x)^2 + \omega _{31}x^2\), which is negative for small values of x. Hence, your VS score can go up by becoming certain of the closest H-world, while previously you were only certain that H.
Incidentally, this is not a reason to think the RPS rule is more attractive after all, because that rule fails to satisfy Oddie’s Proximity condition as well. To see this, consider a set of worlds \(\{w_1, w_2, w_3, w_4\}\), where \(w_1\), \(w_3\), \(w_4\) are H-worlds and \(w_2\) is the only \(\lnot H\)-world and is actual. Again, the distances among the four worlds are given by their order in the set, meaning that \(w_1\) and \(w_3\) are the H-worlds closest to the actual world. Now let p and \(p^*\) be such that \(p(w_1) = p(w_3) = .5\) and \(p(w_2) = p(w_4) = 0\), while \(p^*(w_1) = 1\) and \(p^*(w_i) = 0\) for \(i \in \{2, 3, 4\}\). Suppose your current degrees of belief are given by p. Then if a scoring rule is to satisfy Proximity, it should not make you come out less accurate if you replace those degrees of belief by ones given by \(p^*\). But according to the RPS rule, doing so would make you less accurate. Given your current degrees of belief, the rule assigns you a penalty of 1/6. But if you switch to \(p^*\!\), your RPS penalty doubles, becoming 1/3. Oddie (2019) proves that, given certain mild restrictions on the semantics, any additive scoring rule that satisfies his condition is improper, where a scoring rule is additive if it can be written as the sum of local inaccuracies, which look only at what probability is assigned to a world and whether or not that world is actual. (For a similar result, see Levinstein 2019.) That the RPS rule does not satisfy Proximity, as just shown, might be thought to follow already from Oddie’s formal result. That is not so, however. For although the RPS rule is proper, it is not additive: it is not enough to know of each world whether it is actual and what probability it gets assigned; its place in the ordering, and the probabilities assigned to the other worlds, matter, too.
Schoenfield (2021) proposes a number of principles weaker than Proximity but still strong enough—according to her—to capture truthcloseness intuitions. McCutcheon (2021) argues for a stronger replacement for Proximity that he calls “Proxvexity.” The referee who brought to my attention that VS rules fail to satisfy Proximity also pointed out to me that they do satisfy McCutcheon’s Proxvexity condition. It is worth noting that McCutcheon (2021) proposes a set of scoring rules that satisfy the same condition but that in addition are proper. However, McCutcheon’s rules build on the log score and shares with that the unboundedness problem (Carvalho 2016, p. 226), which some may find serious enough to reject those rules.
We are dividing by \(\mathrm {pdf}(0.5\,|\, 0.5,0.075)\) to make 1 the maximum value of the function, so as to make it more easily comparable to the value functions in the other examples. Most probably, to obtain the monetary value of the commodities figuring in our examples, each of r, s, t, and u would have to be multiplied by a different constant. Doing so would be immaterial to the results of the simulations, however, given that these concern correlations, and given that correlations are unaffected by linear transformations of the variables.
To sample a probability distribution on an n-element hypothesis partition from a uniform Dirichlet distribution essentially means that each point in the \((n-1)\)-dimensional probability simplex has the same chance of being selected.
The procedure described in Meng, Rosenthal, and Meng et al. (1992) allows one to test for differences among so-called correlated correlations, which are correlations between pairs of variables where one of the variables is shared by the pairs. For instance, we can test whether the correlation of the Brier scores for a given \(n\in \{11,21,51,101\}\) with the \(\Delta \)-values for that n and a given value function differs significantly from the correlation of the ranked probability scores for that n with the same \(\Delta \)-values. Using this procedure, it was found that, for all n and each of r, s, and t, the correlations for the ranked probability scores as well as for the VS scores are significantly higher (at \(\alpha =.0001\)) than those for the Brier and log scores, and the correlations for the ranked probability scores are also significantly higher (at the same \(\alpha \) level) than the VS scores. In the case of value function u, the differences among the correlations are not significant for the cases \(n=11\) and \(n=21\), but for the remaining cases the correlations for the ranked probability scores are significantly higher (at \(\alpha =.001\)) than those for the other rules.
Looking at the shapes of the other value functions, we can also understand the rest of the results reported in Table 3. In particular, it is easy to see why the correlations between ranked probability scores and divergences are particularly high for t, which has a more or less steady slope of about \(-1\) across its entire domain.
As do the rules proposed in McCutcheon (2021). And there also exist weighted versions of the Brier score that deliver at least some aspects of the said desideratum. See Greaves and Wallace (2006), Dunn (2018), and Schoenfield (2021).
I am greatly indebted to Christopher von Bülow and to two anonymous referees for valuable comments on previous versions.

References

Bernardo, J. M. (1979). Expected information as expected utility. Annals of Statistics, 7, 686–690.
Article Google Scholar
Bernardo, J. M., & Smith, A. F. M. (2000). Bayesian theory. New York: Wiley.
Google Scholar
Bickel, J. E. (2007). Some comparisons between quadratic, spherical, and logarithmic scoring rules. Decision Analysis, 4, 49–65.
Article Google Scholar
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1–3.
Article Google Scholar
Carvalho, A. (2016). An overview of applications of proper scoring rules. Decision Analysis, 13, 223–242.
Article Google Scholar
Cooke, R. M. (1991). Experts in uncertainty. Oxford: Oxford University Press.
Google Scholar
Douven, I. (2020). Scoring in context’. Synthese, 197, 1565–1580.
Article Google Scholar
Douven, I. (2021). The art of abduction. Cambridge, MA: MIT Press. in press.
Google Scholar
Douven, I., Wenmackers, S., Jraissati, Y., & Decock, L. (2017). Measuring graded membership: The case of color. Cognitive Science, 41, 686–722.
Article Google Scholar
Dunn, J. (2018). Accuracy, verisimilitude and scoring rules. Australasian Journal of Philosophy, 97, 151–166.
Article Google Scholar
Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology, 8, 985–987.
Article Google Scholar
Fairchild, M. D. (2013). Color appearance models. Chichester, UK: Wiley.
Book Google Scholar
Fallis, D. (2007). Attitudes toward epistemic risk and the value of experiments. Studia Logica, 86, 215–246.
Article Google Scholar
Fallis, D., & Lewis, P. J. (2016). The Brier rule is not a good measure of epistemic utility (and other useful facts about epistemic betterness). Australasian Journal of Philosophy, 94, 576–590.
Article Google Scholar
Gärdenfors, P. (2000). Conceptual spaces. Cambridge, MA: MIT Press.
Book Google Scholar
Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society, B14, 107–114.
Google Scholar
Greaves, H., & Wallace, D. (2006). Justifying conditionalization: Conditionalization maximizes expected epistemic utility. Mind, 115, 607–632.
Article Google Scholar
Joyce, J. M. (1998). A nonpragmatic vindication of probabilism. Philosophy of Science, 65, 575–603.
Article Google Scholar
Joyce, J. M. (2009). Accuracy and coherence: Prospects for an alethic epistemology of partial belief. In F. Huber & C. Schmidt-Petri (Eds.), Degrees of belief (pp. 263–297). Dordrecht: Springer.
Chapter Google Scholar
Kuipers, T. A. F. (2000). From instrumentalism to constructive realism. Dordrecht: Kluwer.
Book Google Scholar
Kuipers, T. A. F. (2001). Structures in science. Dordrecht: Kluwer.
Book Google Scholar
Kuipers, T. A. F. (2019). Nomic truth approximation revisited. Basel: Springer.
Book Google Scholar
Levinstein, B. A. (2017). A pragmatist’s guide to epistemic utility. Philosophy of Science, 84, 613–638.
Article Google Scholar
Levinstein, B. A. (2019). An objection of varying importance to epistemic utility theory. Philosophical Studies, 176, 2919–2931.
Article Google Scholar
Machery, E., Mallon, R., Nichols, S., & Stich, S. P. (2004). Semantics, cross-cultural style. Cognition, 92, B1–12.
Article Google Scholar
McCutcheon, R. G. (2019). In favor of logarithmic scoring. Philosophy of Science, 86, 286–303.
Article Google Scholar
McCutcheon, R. G. (2021). A note on verisimilitude and accuracy. British Journal for the Philosophy of Science, in press.
Meng, X. L., Rosenthal, R., & Rubin, D. B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111, 172–175.
Article Google Scholar
Murphy, A. H. (1969). On the ranked probability score. Journal of Applied Meteorology, 8, 988–989.
Article Google Scholar
Murphy, A. H. (1993). What is a good forecast? An essay on the nature of goodness in weather forecasting. Weather Forecasting, 8, 281–293.
Article Google Scholar
Niiniluoto, I. (1984). Is science progressive?. Dordrecht: Reidel.
Book Google Scholar
Niiniluoto, I. (1987). Truthlikeness. Dordrecht: Reidel.
Book Google Scholar
Niiniluoto, I. (1998). Verisimilitude: The third period. British Journal for the Philosophy of Science, 49, 1–29.
Article Google Scholar
Niiniluoto, I. (1999). Critical Scientific Realism. Oxford: Oxford University Press.
Google Scholar
Oddie, G. (2019). What accuracy could not be. British Journal for the Philosophy of Science, 70, 551–580.
Article Google Scholar
Roche, W., & Shogenji, T. (2018). Information and inaccuracy. British Journal for the Philosophy of Science, 69, 577–604.
Article Google Scholar
Rosenkrantz, R. D. (1981). Foundations and applications of inductive probability. Atascadero, CA: Ridgeview.
Google Scholar
Schoenfield, M. (2021). Accuracy and verisimilitude: The good, the bad and the ugly. British Journal for the Philosophy of Science, in press.
Schurz, G. (1987). A new definition of verisimilitude and its applications. In P. Weingartner & G. Schurz (Eds.), Logic, Philosophy of Science and Epistemology (pp. 177–184). Vienna: Hölder-Pichler-Tempsky.
Google Scholar
Schurz, G. (1991). Relevant deduction. Erkenntnis, 35, 391–437.
Article Google Scholar
Schurz, G. (2011). Verisimilitude and belief revision: With a focus on the relevant element account. Erkenntnis, 75, 203–221.
Article Google Scholar
Schurz, G. (2019). Hume’s problem solved: The optimality of meta-induction. Cambridge, MA: MIT Press.
Book Google Scholar
Selten, R. (1998). Axiomatic characterization of the quadratic scoring rule. Experimental Economics, 1, 43–62.
Article Google Scholar
Shepard, R. N. (1987). Toward a universal law for psychological science. Science, 237, 1317–1323.
Article Google Scholar
Weinberg, J. M., Gonnerman, C., Buckner, C., & Alexander, J. (2010). Are philosophers expert intuiters? Philosophical Psychology, 23, 331–355.
Article Google Scholar

Download references

Author information

Authors and Affiliations

IHPST/CNRS/Panthéon–Sorbonne, Paris, France
Igor Douven

Authors

Igor Douven
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Igor Douven.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The Supplementary Materials for this paper, containing the R code used for the simulations to be reported, can be retrieved from this repository: https://osf.io/n8e2g/?view_only=72d468e2eabe4100b9409985c4a10950

This article belongs to the topical collection on Approaching Probabilistic Truths, edited by Theo Kuipers, Ilkka Niiniluoto, and Gustavo Cevolani.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Douven, I. Scoring, truthlikeness, and value. Synthese 199, 8281–8298 (2021). https://doi.org/10.1007/s11229-021-03162-z

Download citation

Received: 07 August 2020
Accepted: 15 April 2021
Published: 28 April 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s11229-021-03162-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scoring, truthlikeness, and value

Abstract

Access this article

Similar content being viewed by others

Scoring in context

The dialectics of accuracy arguments for probabilism

Properization: constructing proper scoring rules via Bayes acts

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scoring, truthlikeness, and value

Abstract

Access this article

Similar content being viewed by others

Scoring in context

The dialectics of accuracy arguments for probabilism

Properization: constructing proper scoring rules via Bayes acts

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation