Skip to main content

Reliability for degrees of belief


We often evaluate belief-forming processes, agents, or entire belief states for reliability. This is normally done with the assumption that beliefs are all-or-nothing. How does such evaluation go when we’re considering beliefs that come in degrees? I consider a natural answer to this question that focuses on the degree of truth-possession had by a set of beliefs. I argue that this natural proposal is inadequate, but for an interesting reason. When we are dealing with all-or-nothing belief, high reliability leads to high levels of truth-possession. However, when it comes to degrees of belief, reliability and truth-possession part ways. The natural answer thus fails to be a good way to evaluate degrees of belief for reliability. I propose and develop an alternative method based on the notion of calibration, suggested by Frank Ramsey, which does not have this problem and consider why we should care about such assessments of reliability even if they are not tied directly to truth-possession.

This is a preview of subscription content, access via your institution.


  1. For early work on the former, see Goldman (1979); for early work on the latter, see Armstrong (1973).

  2. Sometimes withholding is thought of as the failure to take a doxastic attitude with respect to P; sometimes it is thought of as a third kind of doxastic attitude.

  3. I don’t want to minimize the importance of disputes about how exactly to construct these sets. For example, there has been considerable debate about whether or not the truth-ratio for a process should be calculated based on the set of beliefs produced by that process in the actual world, in all possible worlds, or in some special subset of the possible worlds (see, for instance, Goldman 1986, 1988). But these disputes take place against a wide amount of general agreement about how to understand reliability. For an alternative view about reliability, which is critical of the appeal to truth-ratios, see Baumann (2009).

  4. This kind of proposal has roots in the scoring rules of Brier (1950), Finetti (1972), and Savage (1971).

  5. Very roughly, a process is well-calibrated if it produces beliefs in P to degree n and n % of the propositions like P are true.

  6. Although perhaps to be rational they must. A set of credences is probabilistically coherent just in case they satisfy the standard Kolmogorov axioms:

    1. (1)

      For all P, \(c(P) \ge 0\),

    2. (2)

      For all logical truths, \(\top\), \(c(\top ) = 1\), and

    3. (3)

      \(c(P \vee Q) = c(P) + c(Q)\) for all \(P\), \(Q\) such that \((P \wedge Q)\) is contradictory.

  7. The literature here is vast. For early work, see Brier (1950), Finetti (1972), and Savage (1971). More recent philosophical work on this topic has been done by Joyce (1998, 2009), Greaves and Wallace (2006), Gibbard (2008), and Leitgeb and Pettigrew (2010a, b).

  8. See Seidenfeld (1985, p. 285).

  9. Seidenfeld (1985, pp. 285–286) calls this the Quadratic Loss Function. It is also very similar to De Finetti’s S score (De Finetti 1972, p. 30).

  10. The truth-possession accounts that Goldman has defended (Goldman and Shaked 1991; Goldman 1999, 2010), whether one reads them as analyzing reliability or not, all appeal to the Linear Score. The reason given here for not using a Linear Score in an analysis of reliability applies equally to Goldman’s use of it.

  11. DePaul (2004) criticizes Goldman (1999) for construing the epistemic good as truth-possession. His strongest criticism relies on the fact that Goldman’s proposal requires agents to move their degrees of belief to extreme values in something like the way illustrated here (DePaul calls this ‘epistemic swashbuckling’). However, DePaul doesn’t attribute the problem to the scoring rule being improper, nor does he consider the modification of using a proper scoring rule to get around this problem.

  12. It is worth noting that there are many proper scoring rules (Savage 1971; Seidenfeld 1985; Gibbard 2008). So, although the Brier Score offers one way of pursuing the truth-possession approach to reliability, it is not the only way. Joyce (2009) offers some arguments on behalf of the Brier Score being a privileged choice, but I’ll leave this issue to the side, focusing instead on a complication that arises for any approach to reliability evaluation that appeals to truth-possession.

  13. Note that this description does not fully determine an evaluative scheme. In particular, one could take the average Brier Score for a set of propositions by averaging the Brier Score of all the atomic propositions, or by averaging the Brier Score of all the elements/worlds of the probability space, or indeed by averaging the Brier Score of some other way of cutting up the probability space. How one decides to compute these average scores can make a difference to the overall verdict. Throughout the body of the paper, I will assume that there is some non-arbitrary way of figuring out what the atomic propositions are and that scores are assigned to the atomic propositions produced by a process. Since I will argue that there is a problem with truth-possession proposals independent of this issue, I don’t consider it any further.

  14. In general, Process \(n\) outperforms Process \(k\) (\(k\ne n\)) whenever the proportion of true propositions in P is \(n\).

  15. I’ll assume that such a view says that the more reliable a process that produces a belief, the more justified it is. This is a plausible thing for reliabilism about justification to say, but it is not universally endorsed. Though it makes the argument easier to present, it is ultimately inessential to it.

  16. Given how a works, we know that 95 % of the propositions in \({\mathbf{A}}\) will be true. We can thus calculate the average Brier Score for Process a: \(0.95 \times (1 - 0 .95)^2 + 0.05 \times (0 - 0 .95)^2 = 0.0475\). Similar calculations yield the scores for the other processes.

  17. This argument is analogous to Cohen’s (1984) “new evil demon problem”. Internalists about justification maintain that justification supervenes only on internal features of agents. Externalists about justification deny this. In Cohen’s scenario you are to imagine an internal twin of yourself who is living in a demon world. Just like you, your twin believes that he has a hand. But your twin’s belief is not produced by a reliable process since the demon is constantly deceiving him. This puts pressure on the externalist to admit that an unreliable process can produce justified beliefs. Why? Because your twin seems to see his hand and in virtue of that believes he has a hand. There seems to be no other belief that your twin could form that would be more justified. Thus, his unreliably formed belief is justified. Similarly, I maintain that since Process b is making the best response it can, the credence produced is justified.

  18. This point connects up in interesting ways with recent work on epistemic value. In his contribution to a recent book, Duncan Pritchard (Pritchard et al. 2010) considers (without endorsing) the following view:

    • Epistemic Value T-Monism: True belief is the sole fundamental epistemic good.

    As stated, Epistemic Value T-Monism doesn’t say anything about degrees of belief. But it is not implausible to extend the view as follows:

    • Epistemic Value T-Monism (degrees): Truth-possession is the sole fundamental epistemic good, the more the better.

    Notice that if you are committed to Epistemic Value T-Monism (degrees), then you have to say that from the perspective of epistemic value, Process a* is better than Process b. To the extent that one is convinced by my arguments here, one has reason to maintain that Process a* is less reliable than Process b. This either shows that greater reliability need not go together with greater epistemic value, or that Epistemic Value T-Monism (degrees) is mistaken. The latter option, in turn, puts pressure on one to reject Epistemic Value T-Monism. In section 4.2 I consider how one might resolve this.

  19. See Blattenberger and Lad (1985) for a graphical representation of the relationship between calibration and the Brier Score, which demonstrates how one can trade-off calibration for an increased Brier Score.

  20. As I discuss shortly, this basic idea is considered, though rejected, in Goldman (1986). The introduction to Goldman (2012), however, suggests that Goldman is now sympathetic to such a view. Bas Fraassen (1983) considers the notion of calibration for credences, arguing that when a credence is calibrated with the frequencies, then it is vindicated. This is similar to how an all-or-nothing belief is vindicated when it is true. Alan Hájek (unpublished) also proposes that it is good for credences to be calibrated, but rather than calibrated to the frequencies he suggests they should be calibrated to the objective chances. Neither van Fraassen or Hájek, however, argue that calibration is related to reliability. Instead, both of them think of calibration as replacing degree of truth-possession as the overarching epistemic goal. Marc Lange (1999) also investigates the notion of calibration, however he focuses on agents who believe they are calibrated not those who actually are.

  21. See also DeGroot and Fienberg (1983) and Blattenberger and Lad (1985). For a philosophical treatment of this, see Joyce (2009). Schmitt (2000, pp. 265–268) has an informal discussion of some of this material specifically related to Goldman’s social epistemology.

  22. I focus on the Brier Score. As is well-known, there are other proper scoring rules. In general, any proper scoring rule has a decomposition into a calibration component and a refinement component as illustrated in the text for the Brier Score (DeGroot and Fienberg 1983, pp. 19–21). I focus in the text on the Brier Score, but I offer no arguments here for its advantages over other proper scoring rules. The primary claim I wish to defend is that calibration rather than truth-possession is a good measure of reliability. This point is independent of the choice of scoring rule. It’s worth noting that all proper scores will agree on perfect reliability.

  23. ARI is the principle that expresses this. Goldman writes: “ARI A J-rule system R is right if and only if R permits certain (basic) psychological processes, and the instantiation of these processes would result in a truth-ratio of beliefs that meets some specified high threshold.” (Goldman 1986, p. 106).

  24. Since Goldman (1986) only defines perfect calibration, he doesn’t take a stand on how to measure the distance from perfect calibration, and so doesn’t take a stand on the precise form of the scoring rule.

  25. Seidenfeld (1985) notes this problem with calibration, too, though not specifically with respect to Goldman.

  26. To determine the reliability of process \(c\), we have one partition that includes all 100 propositions. Its score is \((0.7 - 0.7)^2 = 0\), which means it is perfectly reliable. For Process \(d\), we first partition the 100 propositions into two sets: a set of those propositions assigned 0.5 credence and a set of propositions assigned 0.9 credence. We then work out the score for each set (in this case, the first set’s score is \((0.5 - 0.5)^2 = 0\) and the second set’s score is \((0.9 - 0.9)^2 = 0\)). To work out the reliability of \(d\), we then take the weighted average of these scores, which is 0. \(d\) is thus assessed as perfectly reliable, too.

  27. I address this kind of case in more detail in Section 4.2.

  28. I do not mean to commit myself to a controversial thesis about the relation between graded belief and binary belief. I simply mean that in evaluating processes for reliability, it is harmless to think of a binary belief as corresponding to a graded belief of 1. The ‘1’ simply indicates that the proposition assigned 1 is believed.

  29. The reliability verdicts given by Calibration Reliability to processes in binary models are ordinally equivalent to those given by standard truth-ratio based verdicts. Like Brier Reliability, however, there will be some cardinal differences in the verdicts given by truth-ratio based approaches and Calibration Reliability. This is due to the fact that both the Brier and the Calibration Scores are quadratic rules, which square the error term.

  30. This follows from the fact that \({\mathbf{P}}\) is constructed so that exactly half of its members are true.

  31. Such a decision depends on the probability that there is the relevant shadow given that there is a burglar, and the probability that there is the relevant sound given that there is a burglar. But if these are roughly the same, then \(f\) does seem preferable to \(e\).

  32. Thanks to an anonymous referee for giving this objection.

  33. If we stick to the 100 propositions actually assigned credence, \(R(d) = 0.17\) while \(R(c) = 0.21\).

  34. One might note that sometimes mid-level credences are informative. For example, suppose there are 11 possible answers to a question among which the inquirer is indifferent. After applying the process in question one answer is assigned credence 0.5 and the rest are assigned credence 0.05. This seems to be informative. Does this show that the Refinement Score doesn’t really measure informativeness? No. To see this note that there will be two cells in the partition for this process. In the first, \({\mathbf{P}}_{0.05}\), are all the propositions assigned credence 0.05. This set is 10 times larger than the set of propositions, \({\mathbf{P}}_{0.5}\), which are assigned credence 0.5. If we assume that the process is perfectly reliable (that is, perfectly calibrated), then the the truth-ratios for these sets of propositions are \(r_{0.05} = 0.05\) and \(r_{0.5} = 0.5\). Thus, the Refinement Score is approximately 0.066 for this process, which is fairly good. Why is this? Although the credence of 0.5 is not all that informative, most propositions are assigned the informative credence of 0.05. Thus, in a simple case where we think a mid-level credence is informative, the Refinement Score tracks this.

  35. Where \(c(\cdot )\) is an agent’s current credence function, and \(c^E(\cdot )\) is that agent’s credence function after learning evidence \(E\) (and nothing else), Conditionalization says that the following should hold:

    COND: For all A, \(E\), \(c^{E}(A) = c(A|E)\) (so long as \(c(E) \ne 0\)).

  36. See, for instance, Maher (1996), Williamson (2000), Bird (2004), Silins (2005), Neta (2008), Dunn (2012).

  37. Jonathan Weisberg (2009) has dubbed this problem the Inputs Problem. Williamson (2000) rests his rejection of Jeffrey’s framework on the difficulty he sees in solving—or even saying anything useful about—the Inputs Problem. See Weisberg (2011) for further discussion of why the Inputs Problem is important.


  • Armstrong, D. (1973). Belief, truth and knowledge. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Baumann, P. (2009). Reliabilism—modal, probabilistic, or contextualist. Grazer Philosophische Studien, 79, 77–89.

    Google Scholar 

  • Bird, A. (2004). Is evidence non-inferential? The Philosophical Quarterly, 54, 252–265.

    Article  Google Scholar 

  • Blattenberger, G., & Lad, F. (1985). Separating the Brier score into calibration and refinement components: A graphical exposition. American Statistician, 39, 26–32.

    Google Scholar 

  • Brier, G. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3.

    Article  Google Scholar 

  • Cohen, S. (1984). Justification and truth. Philosophical Studies, 46, 279–295.

    Article  Google Scholar 

  • De Finetti, B. (1972). Probability, induction and statistics: The art of guessing. New York: Wiley.

    Google Scholar 

  • DeGroot, M., & Fienberg, S. (1983). The comparison and evaluation of forecasters. Journal of the Royal Statistical Society D (The Statistician), 32(1/2), 12–22.

    Google Scholar 

  • DePaul, M. (2004). Truth consequentialism: Withholding and proportioning belief to the evidence. Philosophical Issues, 14(1), 91–112.

    Article  Google Scholar 

  • Dunn, J. (2012). Evidential externalism. Philosophical Studies, 158(3), 435–455.

    Article  Google Scholar 

  • Gibbard, A. (2008). Rational credence and the value of truth. In T. Gendler & J. Hawthorne (Eds.), Oxford studies in epistemology (Vol. 2). Oxford: Oxford University Press.

    Google Scholar 

  • Goldman, A. (1979). What is justified belief? In G. Pappas (Ed.), Justification and knowledge (pp. 1–23). Dordrecht: D Reidel.

    Chapter  Google Scholar 

  • Goldman, A. (1986). Epistemology and cognition. Cambridge, MA: Harvard University Press.

    Google Scholar 

  • Goldman, A. (1988). Strong and weak justification. Philosophical Perspectives, 2, 51–69.

    Article  Google Scholar 

  • Goldman, A. (1999). Knowledge in a social world. Oxford: Oxford University Press.

    Book  Google Scholar 

  • Goldman, A. (2010). Epistemic relativism and reasonable disagreement. In R. Feldman & T. Warfield (Eds.), Disagreement (pp. 187–215). Oxford: Oxford University Press.

    Chapter  Google Scholar 

  • Goldman, A. (2012). Reliabilism and contemporary epistemology. Oxford: Oxford University Press.

    Book  Google Scholar 

  • Goldman, A., & Shaked, M. (1991). An economic model of scientific activity and truth acquisition. Philosophical Studies, 63(1), 31–55.

    Article  Google Scholar 

  • Greaves, H., & Wallace, D. (2006). Justifying conditionalization: Conditionalization maximizes expected epistemic utility. Mind, 115(459), 607–632.

    Article  Google Scholar 

  • Greco, J. (1999). Agent reliabilism. Noûs, 33, 273–296.

    Article  Google Scholar 

  • Hájek, A. (unpublished). A puzzle about degree of belief. Retrieved from

  • Jeffrey, R. (1965). The logic of decision. Chicago: University of Chicago Press.

    Google Scholar 

  • Joyce, J. (1998). A nonpragmatic vindication of probabilism. Philosophy of Science, 65(4), 575–603.

    Article  Google Scholar 

  • Joyce, J. (2009). Accuracy and coherence: Prospects for an alethic epistemology of partial belief. In F. Huber & C. Schmidt-Petri (Eds.), Degrees of belief (pp. 263–297). New York: Springer.

    Chapter  Google Scholar 

  • Lange, M. (1999). Calibration and the epistemological role of Bayesian conditionalization. The Philosophical Review, 96(6), 292–324.

    Google Scholar 

  • Leitgeb, H., & Pettigrew, R. (2010a). An objective justification of Bayesianism I: Measuring inaccuracy. Philosophy of Science, 77(2), 201–235.

    Article  Google Scholar 

  • Leitgeb, H., & Pettigrew, R. (2010b). An objective justification of Bayesianism II: The consequences of minimizing inaccuracy. Philosophy of Science, 77(2), 236–272.

    Article  Google Scholar 

  • Maher, P. (1996). Subjective and objective confirmation. Philosophy of Science, 63, 149–174.

    Article  Google Scholar 

  • Murphy, A. (1972). Scalar and vector partitions of the probability score: Part I. Two-state situation. Journal of Applied Meteorology, 11, 273–282.

    Article  Google Scholar 

  • Murphy, A. (1973). A new vector partition of the probability score. Journal of Applied Meteorology, 12, 595–600.

    Article  Google Scholar 

  • Neta, R. (2008). What evidence do you have? British Journal for the Philosophy of Science, 59, 89–119.

    Article  Google Scholar 

  • Pritchard, D., Millar, A., & Haddock, A. (2010). The nature and value of knowledge: Three investigations. Oxford: Oxford University Press.

    Book  Google Scholar 

  • Ramsey, F. P. (1931). Truth and probability. In R. B. Braithwaite (Ed.), The foundations of mathematics and other logical essays (pp. 156–198). London: Routledge.

    Google Scholar 

  • Sanders, F. (1963). On subjective probability forecasting. Journal of Applied Meteorology, 2(2), 191–201.

    Article  Google Scholar 

  • Savage, L. (1971). Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336), 783–801.

    Article  Google Scholar 

  • Schmitt, F. F. (2000). Veritistic value. Social Epistemology, 14(4), 259–280.

    Article  Google Scholar 

  • Seidenfeld, T. (1985). Calibration, coherence, and scoring rules. Philosophy of Science, 52, 274–294.

    Article  Google Scholar 

  • Silins, N. (2005). Deception and evidence. Philosophical Perspectives, 19, 375–404.

    Article  Google Scholar 

  • van Fraassen, B. (1983). Calibration: A frequency justification for personal probability. Boston Studies in the Philosophy of Science, 76, 295–319.

    Google Scholar 

  • Weisberg, J. (2009). Commutativity or holism? A dilemma for conditionalizers. The British Journal for the Philosophy of Science, 60(4), 793.

    Article  Google Scholar 

  • Weisberg, J. (2011). Varieties of Bayesianism. In D. Gabbay, S. Hartmann, & J. Woods (Eds.), Handbook of the history of logic (Vol. 10). New York: Elsevier.

    Google Scholar 

  • Williamson, T. (2000). Knowledge and Its limits. Oxford: Oxford University Press.

    Google Scholar 

Download references


Earlier versions of this paper were given at the Fall 2011 Meeting of the Indiana Philosophical Association, the 2012 Central APA, and at Western Michigan University. Thanks to all participants there. Thanks especially to Erik Wielenberg, Jennifer Lackey, Lara Buchak, Chris Meacham, James Joyce, Ethan Brauer, and anonymous reviewers for Philosophical Studies for very helpful comments.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jeff Dunn.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dunn, J. Reliability for degrees of belief. Philos Stud 172, 1929–1952 (2015).

Download citation

  • Published:

  • Issue Date:

  • DOI:


  • Goldman
  • Reliabilism
  • Reliability
  • Scoring rules
  • Credences
  • Degrees of belief
  • Bayesian
  • Calibration
  • Refinement
  • Power