1 Introduction

The received model of degrees of belief represents them as probabilities. Over the last half century, however, many philosophers have been convinced that this model fails because it cannot allow that degrees of belief respect the evidence—and in its place they have advocated a model that represents degrees of belief using imprecise probabilities.Footnote 1 This paper presents a model of degrees of belief based on Dempster–Shafer belief functions and argues that it does better than the imprecise probability model at accounting for the idea that degrees of belief should respect the evidence.

The paper proceeds as follows. Belief functions have not received a lot of attention in the philosophical literature, so Sect. 2 presents them in detail: not just the formal details but also a new way of picturing belief functions that should make them more accessible to intuition. After that, I present arguments for belief functions over imprecise probabilities as a model of evidence-respecting degrees of belief. The arguments focus on three kinds of issue. First, theoretical virtues. Section 3 argues that belief functions generally provide simpler models than imprecise probabilities. Section 4 argues that, while the imprecise probability framework lacks a workable story about how to extract information about an agent’s belief state from the formal model—about how the model relates to the phenomena—there is, in contrast, a simple and direct interpretation of belief function models. Section 5 argues that the belief function framework has the flexibility to model an important view about how degrees of belief can respond to certain kinds of evidence—whereas the imprecise probability model cannot faithfully model this view. Second, motivations. In the subsequent two sections I critique arguments for the imprecise probability framework. I discern two main lines of argument in the literature. Section 6 argues that the first fails. Section 7 argues that the second lends at least as much support to the belief function view. Third, problem cases. Section 8 considers two well-known problems for the imprecise probability view—dilation and belief inertia—and shows that the belief function view avoids these problems.

2 Belief functions

Van Fraassen (1989, p. 161) writes:

To depict your state of opinion, you can use a model which I call the Muddy Venn Diagram. Just represent the propositions you have an opinion about by areas on the usual Venn diagram, and represent your personal probability by means of some quantity of mud heaped on them...Call the total mud present one unit always; the proportion of mud on an area equals the probability of the proposition represented by that area.

Fig. 1
figure 1

An exploded Venn diagram

We can picture degrees of belief modelled by belief functions in a similar way but with one crucial difference: instead of a Venn diagram we picture what I’ll call an exploded Venn diagram. Where van Fraassen has us spread our unit of belief over a Venn diagram such as the one at the centre of Fig. 1, my new picture involves our apportioning our unit of belief to its non-empty subsets considered individually—that is, the black shaded regions. (It is because I consider the subsets individually, rather than fitted together into the Venn diagram, that I use the term ‘exploded’—by analogy with exploded parts diagrams of the kind found in technical manuals.)

When we move to exploded Venn diagrams, the idea of representing the unit of belief by something spreadable like mud loses much of its point. We may just as well think of it as something rigid like a metal rod of unit mass which can be crosscut into sections of arbitrary length, each of which is then like a weight of the kind used on pan balances. The image then is that we distribute slices of the rod to the exploded black regions. Indeed you can even picture the lines connecting the black regions to the central Venn diagram as scale arms, and the black regions as pans, so that the exploded Venn diagram is like a set of scales with one pan per nonempty subset. You distribute a rod of unit mass—cut into cross-sections of whatever length you please—to these pans. You can put the whole rod (or, if you still prefer, all of the mud) on one region—or a bit here and a bit there.

Venn diagrams are useful when the space of possible outcomes is generated by a finite set of basic (logically atomic) propositions (or propositional variables). Starting with n such propositions generates \(2^n\) state descriptions: one per assignment of truth values to the basic propositions. These constitute the cells of a partition of the space: they correspond to the simple (not further subdividable) regions shown in the Venn diagram.Footnote 2 But the core idea also applies when the space of possible outcomes is given in some other way: for example, when it is partitioned into six cells, each corresponding to one of the possible ways a die could fall; or three cells, each corresponding to one of the suspects in a case being considered by a detective. In the latter case, van Fraassen’s idea would extend to having us spread mud over the (non Venn) diagram at the centre of Fig. 2, while my new proposal would have us place weights on the black shaded regions.

Fig. 2
figure 2

An exploded non-Venn diagram

The reader may object that it would be too mentally taxing to have to consider (say) the 63 separate nonempty subsets of a six-element sample space and apportion weight to each of them—whereas one can easily spread mental mud over a six-section grid. But this is a mistake. Suppose the evidence warrants assigning a certain weight to each specific outcome: say \(\frac{1}{6}\) to each section of the six-section grid.Footnote 3 Then one assigns \(\frac{1}{6}\) to each one-element subset and 0 to everything else: this is just as easy in the exploded case as it is in the unexploded case. In general, if the evidence warrants a particular probabilistic spreading of mud with amount \(a_i\) going to outcome \(A_i\), then in the exploded case one simply places weight \(a_i\) on each singleton subset \(\{A_i\}\) and places nothing on any other subset.Footnote 4 But suppose the evidence tells us only, say, that the perpetrator is twice as likely to be male than female. Then in the exploded case one can just assign \(\frac{2}{3}\) of the weight to the subset containing male suspects and \(\frac{1}{3}\) to the subset containing female suspects: simple! Whereas in the probability case you cannot assign \(\frac{2}{3}\) to the set of males without also assigning a specific amount to each male (and similarly for females). But you may have no evidence at all that tells you how to do that.Footnote 5 So it is never harder, and often easier, to place weights on the pans in the exploded diagram than to spread mud over the unexploded diagram.

The story about distributing parts of a unit weight to an exploded diagram gives an intuitive picture of belief functions. Now for the formal details. We begin with a space \(\Omega \) of mutually exclusive and jointly exhaustive possible outcomes. I shall take \(\Omega \) to be finite. This is not mathematically essential (see e.g. Shafer (1979)) but it is simpler. Yet simplicity is not my reason for assuming finitude. Rather, it seems that \(\Omega \) must be finite when it is supposed—as it is in this paper—to distinguish possible outcomes insofar as you could have evidence for one but not the other, or would potentially act differently if you took one rather than the other to be the case. The argument for this conclusion is essentially the same as Turing’s (1936, §1, §9) argument that a person engaged in mechanical computation can make use of only finitely many symbols and can be in one of only finitely many different internal states. While from a metaphysical point of view there are presumably infinitely many possible worlds, from the point of view of an agent gathering and weighing evidence and deciding how to act, the relevant space \(\Omega \) of possible outcomes will be only finitely differentiable. If, for example, the agent is offered an infinite lottery, or faces a decision situation in which the outcomes vary according to the value of a real-valued variable such as temperature, the relevant sample space will be an (extremely large) finite one that ‘chunks’ the possibilities at a point where the agent is no longer able to tell which one has occurred (she cannot measure temperature with infinite precision) or to care which one has occurred (she does not have the mental capacity to desire differently every one of an infinite number of possible dollar amounts of prize money).Footnote 6 Following Shafer (1976, p. 36) I call \(\Omega \) the frame of discernment.

A basic mass assignmentFootnote 7 is a function \(m:2^\Omega \rightarrow [0,1]\) satisfying two conditions:Footnote 8

$$\begin{aligned} {\begin{matrix} 1.\ {} &{} m(\emptyset )=0 \\ 2.\ {} &{} \sum _{A\subseteq \Omega } m(A)=1 \end{matrix}} \end{aligned}$$
(1)

The exploded diagram image depicts m. m(A) is the amount of mass placed on subset A and the entire unit mass (= condition 2) must be distributed among the nonempty (= condition 1) subsets.Footnote 9

Given a basic mass assignment m the corresponding belief function \(Bel:2^\Omega \rightarrow [0,1]\) is defined as follows (for all \(A\subseteq \Omega \)):

$$\begin{aligned} Bel(A)=\sum _{B\subseteq A}m(B) \end{aligned}$$
(2)

Bel is the proposed model of the agent’s degrees of belief. The degree of belief in a subset A of the frame of discernment is the sum of the basic masses assigned to all subsets of A (including A itself). In terms of the exploded diagram image, it is the sum of the masses assigned to shapes that do not protrude outside the A shape when the exploded pieces are assembled.

The sets to which m assigns a nonzero value are called the focal sets of m. If the focal sets of m are all singletons, the belief function Bel corresponding to m is a probability function—that is:

(3)

A belief function that satisfies these three conditions is called Bayesian.

The difference between the probability and belief function frameworks is thus as follows. In both cases one assigns a number between 0 and 1 to each subset A of the sample space, representing one’s degree of belief that the true outcome is in A. In the probability case these assignments are all determined by an assignment of portions of a unit of mass to singleton subsets (or equivalently elements) of the sample space. The degree of belief of an arbitrary subset A is then the sum of the masses assigned to A’s singleton subsets (or members). The belief function approach can be described by removing the word ‘singleton’. We start with an assignment of portions of a unit of mass to subsets of the sample space. The final or overall degree of belief in a subset A is then the sum of the masses assigned to A’s subsets. Both the probability and belief function frameworks thus involve a distinction between fundamental and derived quantities. The fundamental probabilities are the assignments to singleton sets (or equivalently elements). The derived probabilities are the assignments to arbitrary subsets (derived from the fundamental probabilities by addition). In the belief function framework, the basic masses (the values of the basic mass assignment m) are the fundamental assignments. The degrees of belief (the values of the belief function Bel) are the derived assignments (derived from the basic assignments by Eq. (2)).Footnote 10

The idea is that a basic mass assignment is warranted or induced by a body of evidence. For example, if \(\Omega \) is a set of suspects {Alice, Bob, Carol, Dave, Edwina, Frank} then the testimony of a witness that a woman left the building will warrant a basic mass assignment that assigns some positive mass p to {Alice, Carol, Edwina} and the rest (i.e. \(1-p\)) to \(\Omega \).Footnote 11 The testimony of a witness that Alice left the building will warrant an assignment of q to {Alice} and \(1-q\) to \(\Omega \). The discovery that the suspects rolled a fair die to determine who would commit the crime leads to an assignment of \(\frac{1}{6}\) to each singleton subset of \(\Omega \) (which induces a Bayesian belief function). A completely empty body of evidence induces the assignment of 1 to \(\Omega \); and so on. The belief function generated from the mass function then represents overall degree of belief. The overall degree of belief that the perpetrator was a woman, say, will be not only the mass assigned to {Alice, Carol, Edwina} directly on the basis of a given body of evidence, but also the masses so assigned to {Alice} and all other subsets of the set of women. This is because if Alice did it, then a woman did it: so direct evidence for Alice increases the overall degree of belief that the perpetrator was a woman. If we think of subsets of the frame of discernment as propositions then the idea is that the overall degree of belief in A is the sum of all the masses assigned to propositions that imply A: propositions such that if any of them is true, then A is true.

Now what happens when we want to combine evidence from different sources—say from independent witnesses? Different mass assignments, induced by distinct bodies of evidence, are combined by Dempster’s rule of combination. Suppose that mass assignment \(m_1\) has focal sets \(A_1,\ldots ,A_k\) and mass assignment \(m_2\) has focal sets \(B_1,\ldots ,B_l\) (and—this will be explained shortly—suppose \(\sum _{i,j: A_i\cap B_j=\emptyset } m_1(A_i)m_2(B_j)<1\)). Then their combination is the mass assignment m defined by \(m(\emptyset )=0\) and for all nonempty \(A\subseteq \Omega \):

$$\begin{aligned} m(A)=\frac{\sum _{i,j: A_i\cap B_j=A} m_1(A_i)m_2(B_j)}{1-\sum _{i,j: A_i\cap B_j=\emptyset } m_1(A_i)m_2(B_j)} \end{aligned}$$
(4)

The core idea here is that the combination of the evidence that induces \(m_1\) and the evidence that induces \(m_2\) warrants assigning mass to a set A if \(m_1\) and \(m_2\) assign mass to sets whose intersection is A. Specifically, it warrants assigning the product of the masses assigned to these sets by \(m_1\) and \(m_2\). Of course there may be more than one pair of focal sets of \(m_1\) and \(m_2\) whose intersection is A, so we sum across all such pairs. That gives the numerator of (4). The reason for the inclusion of the denominator is that there may be pairs of focal sets of \(m_1\) and \(m_2\) whose intersection is the empty set. But one of the defining conditions of a basic mass assignment (1 in (1)) is that the empty set be assigned 0. So we ‘confiscate’ this mass from the empty set and then—in order that the assignments made by m should sum to 1 (this being the other defining condition of a basic mass assignment: 2 in (1))—add it to the assignments to non-empty sets. Specifically, we renormalise by dividing each such assignment by 1 minus the total amount that would have gone to the empty set—the denominator of (4)—thus proportionally inflating each assignment so that together they sum to 1. The reason for the parenthetical condition above can now be seen. If it is not met, then (4) is mathematically undefined (it involves division by 0). In this case \(m_1\) and \(m_2\) flatly contradict each other and cannot be combined.Footnote 12

We have seen how a basic mass assignment generates a belief function via Eq. (2). The notion of a belief function can also be defined abstractly (i.e. conditions given such that any function that meets them is called a ‘belief function’) and any such function Bel can then be shown to induce a function m that satisfies the defining conditions in (1) of a basic mass assignment (which in turn generates Bel again via (2)). Talk in terms of basic mass assignments and talk in terms of belief functions is thus intertranslatable.Footnote 13 For example, I presented Dempster’s rule of combination as a means for combining basic mass assignments but it can also be presented in a translated form as a means for combining belief functions. Nevertheless, belief functions that come from nowhere are of little interest to us here. The aim is to show that belief functions provide a better account than imprecise probabilities of how degrees of belief can be appropriately responsive to evidence. Accordingly, our chief concern is with belief functions that come from a mass function m which is induced by a body of evidence.

To conclude this section, consider two ways of characterising belief functions that, while not incorrect, fail to get to the heart of the matter. First, it is common to characterise the difference between probability functions and belief functions in terms of additivity: probability functions are additive and (in general) belief functions are not. Of course this difference is a genuine one—but it is simply a symptom of the really crucial difference. Both the probability model and the belief function model of degrees of belief share a common starting point. Evidence can warrant greater or lesser degrees of belief in different propositions. We model this with the idea of a unit mass of belief which is to be distributed among propositions (or subsets of the sample space) with the mass assigned to P representing the degree to which the evidence supports P (specifically). We then calculate final credence values for sets by summing the masses assigned to their subsets. The crucial difference is that in the probability model, the initial mass must be fully distributed amongst singleton subsets—it is part of the setup that the force of the evidence must ultimately bear on maximally specific propositions—whereas in the belief function model this need not be the case. This then generates additivity of final values in the probability model, but not in the belief function case. In both models, belief mass accumulates up the food chain, so to speak, as we ascend from subsets to supersets. The crucial difference is that in the belief function model, mass can be attached directly to higher-level sets—thus generating super-additive behaviour—whereas in the probability model, basic masses must all be attached at the bottom level. So additivity is merely a symptom: the heart of the matter is whether evidence can warrant attaching some mass of belief directly to a non-singleton subset of outcomes.Footnote 14

Second, note that there is an injective mapping from belief functions to sets of probabilities. A belief function Bel generated (via Eq. (2)) by a basic mass assignment m is mapped to the set of Bayesian belief functions (i.e. probability functions) where each \(Bel'\) in the set is generated by a mass assignment \(m'\) that arises from m by dividing basic masses assigned by m to non-singleton focal sets, amongst the singleton subsets of those focal sets.Footnote 15 The lower envelope of this set of probability functions (i.e. the function that assigns to each subset S of the sample space, the infimum of the assignments made to S by the probability functions in the set) is then Bel itself (Halpern, 2017, p. 37). This means that it is not incorrect to think of a belief function as a lower bound on a set of probabilities.Footnote 16 Nevertheless this should not be our sole or even primary way of thinking about belief functions.Footnote 17 For thinking this way makes belief functions seem derivative from and more complex than probability functions: we go from one probability function, to a set of them, to a belief function as lower bound. However there is a clear sense in which belief functions are simpler and more fundamental than probability functions.Footnote 18 A probability function is determined by an assignment of a unit of mass subject to the constraint that positive mass may be assigned only to singleton subsets of the sample space \(\Omega \). A belief function is determined by an assignment that removes this constraint and is subject only to the minimal requirement that \(\emptyset \) be assigned 0. We may therefore regard a belief function as a more or less minimal formal representation of the very idea of (numerical) degree of belief: an assignment of numbers to propositions or subsets—directly modelling strength of belief in them—determined by an assignment of a unit of belief mass subject only to more or less minimal structural conditions.Footnote 19 This idea is conceptually prior to—not derivative from—the idea of a probability function. We get from belief functions to probability functions by adding constraints and (in that sense) increasing complexity.

3 Simplicity

The traditional model of degrees of belief represents them as probabilities. A powerful and widespread line of objection to this model holds that precise probabilistic degrees of belief cannot, in general, be appropriately responsive to evidence:

it is sometimes rational to make no determinate probability judgment...refusal to make a determinate probability judgment...may derive from a very clear and cool judgment that on the basis of the available evidence, making a numerically determinate judgment would be unwarranted and arbitrary. (Levi, 1985, p. 396) (cf. also Levi, 1974, pp. 394–395)

The moral would seem to be that if we want to give evidence its due—if we want to maintain that a rational investigator ought to adopt only those states of opinion she has good reason to adopt—we had better conclude that Immodest Connectedness is not a principle you should want to satisfy. (Kaplan, 1996, pp. 27–28) (cf. also p. 24, p. 29)

precise degrees of belief are the wrong response to the sorts of evidence that we typically receive...since the data we receive is often incomplete, imprecise or equivocal, the epistemically right response is often to have opinions that are similarly incomplete, imprecise or equivocal. ...Precise credences...always commit a believer to extremely definite beliefs...even when the evidence comes nowhere close to warranting such beliefs (Joyce, 2010, p. 283, p. 285)

When there is little or no relevant evidence, the probability model should be highly imprecise or vacuous. More generally, the precision of probability models should match the amount of information on which they are based...‘Vagueness’ or lack of information should be reflected in imprecise probabilities (Walley, 1991, p. 34, p. 246) (cf. also p. 7)

When evidence is essentially sharp, it warrants a sharp or exact attitude; when evidence is essentially fuzzy—as it is most of the time—it warrants at best a fuzzy attitude. (Sturgeon, 2008, p. 159)

The kind of evidential situation that Walley is talking about when he speaks of there being “little or no relevant evidence” has been widely thought to show that we must replace precise probabilities with imprecise probabilities. I agree that it poses a severe problem for probabilism—but shall argue that if we want a formal model of evidence-respecting degrees of belief then we have better reasons to employ belief functions than imprecise probabilities.

Let’s start with the problem for the traditional view. Suppose that Max has absolutely no evidence about the bias/fairness of a given coin. Given probabilism, he must assign a degree of belief to the proposition that the coin will land heads that is representable as a real number between 0 and 1. Different versions of precise probabilism will make different recommendations—for example, Max might apply the principle of insufficient reason and assign 0.5, or he might follow a whim and assign (say) 0.3—but the point to be made below goes through in any case. Now suppose Max has the coin examined by a laboratory which reports that it is balanced in such a way as to give a chance of 0.5 (or 0.3) of heads. Max’s degree of belief remains 0.5 (or 0.3).Footnote 20 But then that belief state is not appropriately responsive to the evidence. In one case Max has no evidence but assigns degree of belief 0.5 (or 0.3); in the other case he has lots of evidence but still assigns degree of belief 0.5 (or 0.3).

By contrast, the situation is nicely handled by the belief function model. Taking the frame of discernment to be \(\{H, T\}\), where H (T) is the situation in which the coin comes up heads (tails), the case of total lack of evidence is modelled by a basic mass assignment that assigns 1 to \(\{H, T\}\). The case of strong evidence for a 0.5 (or 0.3) chance of heads is modelled by a basic mass assignment that assigns 0.5 each to \(\{H\}\) and \(\{T\}\) (or 0.3 to \(\{H\}\) and 0.7 to \(\{T\}\)—either way, this assignment generates a Bayesian belief function). As desired, the agent’s belief state when he has no evidence is different from his belief state when he has definitive evidence of a specific bias. Furthermore, the belief states are not simply different: the second arises from the first in light of the new evidence. Max begins with no evidence and hence assigns his entire belief mass to the whole frame of discernment (for he has no evidence on the basis of which to make any more specific assignment). He then gets evidence about the coin and on its basis he reassigns his belief mass accordingly.Footnote 21 Thus the model captures the idea of degrees of belief that respond to or respect the evidence.Footnote 22

What about the imprecise probability model? Like the belief function model—and unlike the precise probability model—it offers different representations of Max’s belief states when he has no evidence and when he has definitive evidence of a specific bias. As in the belief function model, the case of strong evidence for a 0.5 (or 0.3) chance of heads is modelled by a single probability function that assigns 0.5 each to \(\{H\}\) and \(\{T\}\) (or 0.3 to \(\{H\}\) and 0.7 to \(\{T\}\)). The case of total lack of evidence is modelled differently: by the set of all probability functions over the sample space.

The argument for the imprecise probability model over the precise probability model is that the former allows us to distinguish cases that the latter forces us to conflate.Footnote 23 My argument for the belief function model over the imprecise probability model does not—indeed cannot—take the same form. The imprecise probability model can always distinguish wherever the belief function model can, because there is an injective mapping from belief functions to sets of probabilities (see Sect. 2). However this still leaves room to argue that the belief function model is superior to the imprecise probability model in other ways. In general, there is an abundance of well-studied formal structures, together with well-known methods for encoding the objects in certain structures into those in others. Given a formal model that can mark the distinctions we regard as important, we can get another formal model that is also adequate in this respect by encoding the objects used in the first modelling system into a different bunch of objects.Footnote 24 Yet we do not regard all such models equally favourably. Having the resources to mark intuitively important distinctions (such as that between having no evidence, and having strong evidence for no bias) is a necessary condition of adequacy on a formal model but there are also other desiderata. One of them is simplicity—or at least, lack of needless complexity. This is the first respect in which the belief function model shows its superiority.

Consider the situation where the agent has no evidence at all. Having absolutely no evidence is an incredibly simple kind of situation to be in. The belief function model represents it in a correspondingly simple way: an assignment of 1 to the frame of discernment. The imprecise probability model, by contrast, represents it in a highly complex way: an infinite set containing every possible probability function over the sample space. This is not only complex but also indirect: when one wades through this infinite set of precise assignments of this or that value to this or that subset of the sample space, one finds that the information it is supposed to encode is, essentially, nothing. The representation is like a politician’s speech: an overabundance of words masking a lack of content. The problem is not the lack of content: the point is to represent an ‘empty’ belief state—the result of a total lack of evidence. Rather, the complaint concerns the indirection and complexity of the representation. If you have nothing to present, better to do so straight up—as in the belief function representation—than wrapped in infinite layers like a giant pass-the-parcel with no central prize.

The core of the issue here is that the imprecise probability view has no simple or direct way of representing lack of evidence for P. This becomes a problem when the task is precisely to represent the belief states of agents with a paucity of evidence. In the belief function framework, having no evidence for P is represented by assigning 0 to P. This is as simple and direct a representation as one could hope for: no evidence, zero degree of belief. The guiding idea—the design brief—was to model evidence-respecting or evidence-responsive degrees of belief. And in this model, if there is no evidential pressure on the accelerator pedal, then indeed the engine of belief is not revving. Or to vary the image, picture evidence-responsive degrees of belief, as modelled by belief functions, in terms of an evidenceometer: strong evidence deflects the belief needle significantly, up towards 1; weak evidence deflects it a little bit, not so far above 0; and if there is no evidence then the needle does not move at all and just sits on 0. But this idea—that having no evidence for P is represented by assigning 0 degree of belief to P—does not work in the probabilistic framework, because 0 probability for P goes hand in hand with probability 1 for \(\lnot P\). So if you have no evidence at all (for P or \(\lnot P\)) then you cannot assign 0 to either of them (for then you would have to assign 1 to the other). So the imprecise probabilist represents a lack of evidence for P by an infinite set of probability functions which between them make all possible assignments to P. And while this captures the distinction between lack of evidence for P and \(\lnot P\), versus evidence that they are equally likely, it is nevertheless a roundabout and complex representation of the former situation—compared to the simple and direct representation in the belief function framework.Footnote 25

The situation is similar when there is some lack, but not a total lack of evidence. Consider Ellsberg’s (1961, p. 653) “urn known to contain 30 red balls and 60 black and yellow balls, the latter in unknown proportion.” Here we have evidence regarding the number of black-or-yellow balls but no evidence regarding the number of black (specifically) or yellow (specifically) balls.Footnote 26 The belief function representation is simple and direct. Let the frame of discernment be \(\{R,B,Y\}\), where R/B/Y is the situation in which a red/black/yellow ball is drawn. The basic mass assignment assigns \(\frac{1}{3}\) to \(\{R\}\) and \(\frac{2}{3}\) to \(\{B,Y\}\). So we have zero degree of belief in \(\{B\}\) and in \(\{Y\}\): zero evidence for black (specifically) so zero degree of belief (and similarly for yellow). As before, the imprecise probabilist cannot take this simple route. She represents the situation as an infinite set of probabilities, each of which assigns \(\frac{1}{3}\) to \(\{R\}\), and numbers x and y to \(\{B\}\) and \(\{Y\}\) such that \(x+y=\frac{2}{3}\). The imprecise probability framework can certainly distinguish between the Ellsberg case and the case in which we know that there are (say) 40 black balls and 20 yellow ones—but its representation of the former situation is more complex and less direct than the belief function representation. The latter is totally transparent: you have evidence regarding the number of black-or-yellow balls so you assign \(\{B\}\cup \{Y\}\) a positive degree of belief; you have no evidence regarding the number of black (specifically) or yellow (specifically) balls so you assign 0 to \(\{B\}\) and 0 to \(\{Y\}\).

When one compares the route from precise probabilism to belief functions with the route from precise probabilism to imprecise probabilism, it isn’t surprising that imprecise probability models end up being more complex and less transparent. The imprecise probability strategy for moving away from precise probabilities takes something that does not, in general, provide a good model of evidence-responsive belief states—a probability function—and multiplies it, to give a (generally infinite) set of such things. The belief function strategy takes the probability function and simplifies it, by removing one constraint from its definition. Instead of requiring basic belief masses to be distributed amongst singleton subsets of the sample space, they may be distributed amongst any subsets.Footnote 27 The imprecise probability model multiples a constrained thing. The belief function model stays with a single thing but removes a constraint. The former moves in the direction of increasing complexity. The latter moves in the opposite direction.

4 Interpreting the models

While simplicity is desirable in a formal model, another feature is essential: there must be a clear and coherent story about how the models relate to the phenomena. In other work, I argue that the imprecise probability model of degrees of belief lacks such a story. That is, the model lacks a workable interpretation: a viable account of how we are to extract information about an agent’s belief state from a set of probabilities. I shall not repeat those arguments here—but I do need to show that the belief function model does have a workable interpretation.

In the traditional probabilistic model of degrees of belief, the interpretative story is simple and direct. A probability function assigns numbers in the interval [0, 1] to propositions, or subsets of the sample space—and those numbers directly represent strength of belief. The belief function model inherits this story. It too assigns numbers to propositions, or subsets of the sample space—and those numbers directly represent strength of belief. 0 means the weakest possible strength of belief. 1 means maximum strength belief. In between, bigger numbers mean stronger degrees of belief and smaller numbers mean weaker degrees of belief. Indeed the story about how the belief function model relates to the phenomena modelled is even more simple and direct than the story about the traditional probabilistic model, for the reason touched on in Sect. 3. In the belief function model, the number assigned to P represents the degree of belief in P that is warranted by the agent’s evidence: from 0 (no support) to 1 (complete support). In the probabilistic case the account is not quite so simple because of the bleed-through between the stories about P and \(\lnot P\). Assigning 0 to P does not simply indicate that the evidence provides no support for P: it goes hand in hand with assigning 1 to \(\lnot P\) and hence indicates that the evidence warrants such a degree of belief.

For certain sets of probability functions, their lower envelope is a belief function.Footnote 28 So an imprecise probabilist could, in theory, respond as follows: “I only countenance sets of probability functions whose lower envelopes are belief functions, and the only thing about such a set that I take to be meaningful—to represent something about degrees of belief—is its lower envelope. As for my interpretation of this lower envelope: it is exactly the same as your interpretation of a belief function, which I simply co-opt.” But then this imprecise probabilist would be better off jettisoning the set of probabilities—which by her own lights is meaningless baggage—and just taking the lower envelope as her model of degrees of belief. This would simplify her model, without removing anything she regards as important—as conveying information about degrees of belief. But of course it would also render her a belief function theorist.

5 Asymmetric evidence

Consider a detective who starts out with no evidence at all and then gains some evidence for (say) the proposition W that a woman did it. Her degree of belief in W will go up—but what happens to her degree of belief in the proposition M that a man did it? The belief function model has the capacity to say two different things here, depending on the nature of the evidence. First, suppose the situation is as follows. The detective has a bugging device, and she hears a woman’s voice on it—but there is a programming error in the bugging system, so that the central recorder is picking up signals from multiple bugs without logging which is which. So there is only a 10% chance that the recording of a woman’s voice comes from the scene of the crime (rather than one of the other nine locations she has bugged).Footnote 29 In the belief function framework this situation can be modelled as follows.Footnote 30 The detective starts with a basic mass assignment \(m_1\) with \(m_1(\Omega )=1\), and ends with a basic mass assignment \(m_2\) with \(m_2(W)=0.1\) and \(m_2(\Omega )=0.9\). Second, suppose the situation is different: there is no bugging device. Instead, the detective learns that a crime syndicate chose which of its operatives would do the job by flipping a coin: heads a woman, tails a man. The detective also learns that the coin used was biased, with a 10% chance of landing heads. In the belief function framework this situation can be modelled as follows. The detective again starts with a basic mass assignment \(m_1\) with \(m_1(\Omega )=1\). This time she ends with a basic mass assignment \(m_3\) with \(m_3(W)=0.1\) and \(m_3(M)=0.9\).

In both cases the detective moves from having zero degree of belief in W to having degree of belief 0.1. The difference between the two cases lies in the treatment of the remaining 0.9 belief mass. In the first case \(m_2\) assigns it to the sample space; in the second case \(m_3\) assigns it to M. Thus in the first case the detective’s degree of belief in M does not change: it remains at zero. In the second case the detective’s degree of belief in M goes up: from 0 to 0.9. The rationale for the different treatment of the two cases is as follows. In the first case, if the evidence ends up being good (i.e. the bug was from the relevant crime scene) then a woman did it. But if the evidence ends up being bad, that does not mean a man did it. It means the detective is back to square one, with no evidence. So the 90% chance that the evidence is bad does not contribute a weight of 0.9 to M: it remains at the level of the sample space. This aspect of belief functions can take a bit of getting used to, for those used to thinking in terms of probability functions—so here’s a way of thinking that might be helpful. The notions of evidence and proof are related: proof is the maximum level that evidence can attain. It is a familiar fact (and a significant one in, for example, the context of intuitionist logic) that absence of a proof of P is not necessarily the same as a proof of \(\lnot P\). This structure carries over to certain kinds of evidence, as modelled by belief functions. When you have (say) 0.7 degree of belief that P, based on evidence E, this does not automatically mean that the 0.3 degree to which E falls short of conclusively supporting P gets applied to \(\lnot P\): not being evidence for P is not necessarily the same as being evidence for \(\lnot P\). It can be that the 0.3 goes neither towards P nor towards \(\lnot P\), but gets applied to the background space of possibilities. In the second version of the detective case however, \(m_3\) does assign the remaining 0.9 to M, rather than to the sample space. This is because the evidence is different in this case. The coin had a 10% chance of landing heads. If it landed heads, a woman did it.Footnote 31 But furthermore, if it did not land heads, it landed tails. There is a 90% chance of that. And if that happened, a man did it. So 0.1 goes to W and 0.9 to M.

The imprecise probability framework, on the other hand, cannot model the first kind of response to the evidence: an asymmetric response in which there is a change in the degree of belief in some proposition P (in this case W) but no change in the degree of belief in \(\lnot P\) (in this case M).Footnote 32 In the imprecise probability framework there will always be a symmetry between the treatments of P and \(\lnot P\), inherited from the underlying fact about probability functions that the probabilities of P and of \(\lnot P\) must sum to 1. If the probability of P is spread between m and n, then the probability of \(\lnot P\) must be spread between \(1-n\) and \(1-m\). If the minimum value assigned to P by probability functions in the set goes up (down), then the maximum value assigned to \(\lnot P\) goes down (up). If the maximum value assigned to P goes up (down), then the minimum value assigned to \(\lnot P\) goes down (up). In general, the treatment of \(\lnot P\) will always be the mirror image of the treatment of P, reflected around 0.5. The imprecise probabilist could reply that she cares only about minimum values (or maximum values—but one or the other, not both). For the minimum value assigned to P could move without the minimum value assigned to \(\lnot P\) moving. However imprecise probabilists tend to be keen to distance themselves even from the view that the only information conveyed by a set of probabilities is the lowest and highest values assigned to each proposition—let alone the view just mooted—precisely on the grounds that sets of probabilities convey more information.Footnote 33 An alternative line of reply for probabilists might be to try to deny the intuition that certain bodies of evidence can partially support P, without providing any support to \(\lnot P\).Footnote 34 But this would be a hard row to hoe, for the intuition is both long-lasting and widespread. Its force has been felt right back to the early days of probability theory, by such luminaries as Leibniz and Jacques Bernoulli.Footnote 35 And more recently, it has been expressed not only by historians of those early daysFootnote 36—nor only by proponents of belief functions such as Shafer (1976)—but also by proponents of distinct models, for example Cohen (1977) on Baconian (aka inductive) probability.

6 The compatibility argument

I have argued that the belief function framework is superior to the imprecise probability framework in several respects. Now I shall strengthen the relative position of the case for belief functions vis-à-vis the case for imprecise probabilities not by bolstering the former but by undercutting the latter. I identify two main kinds of argument for imprecise probabilities as a model of evidence-respecting degrees of belief. This section argues that the first kind fails. The next section argues that the second provides at least as much support for the belief function view.

The first kind of argument parallels the plurivaluationist approach to the semantics of vague discourse.Footnote 37 The classical approach to the logical formalisation of a discourse (exemplified for instance in the first-order formalisation of arithmetic) defines a formal language, and associates statements of the discourse with well formed formulas (wffs) of the language. It then captures the idea that some statements in the discourse are simply true by distinguishing, amongst the infinity of interpretations of the formal language—on some of which the wff associated with a given statement is true and on some of which it is false—a unique intended interpretation.Footnote 38 Truth (simpliciter) of a statement in the discourse becomes (model-theoretic) truth of the associated wff on the intended interpretation. The plurivaluationist approach to the semantics of vague discourse differs from this classical approach in only one respect: in place of a unique intended classical interpretation, it posits multiple acceptable classical interpretations.

There are two different kinds of thought that can be taken to motivate the plurivaluationist approach: one turning on the notion of precisification, the other on the notion of constraint. Starting with the former, the thought is that as it stands, ordinary vague language is not well modelled by a classical interpretation. (For example, our usage of the term ‘tall’ does not divide people neatly into two camps—the tall and the non-tall—whereas a classical interpretation assigns each predicate as extension a crisp subset of the domain, which divides it cleanly into two parts.) However, we could in principle completely precisify our language and then a classical interpretation would be an appropriate model of its semantics. But there are many different admissible ways in which we could precisify the language (e.g. we could set the cutoff for tall at \(6'\), or at \(6'1''\)). So we take as our model of the semantics of vague language the set of all interpretations, each of which would be the unique correct interpretation were we to precisify the language completely in one of the many admissible ways in which this could be done. The latter thought is a bit different. The idea is that usage places constraints on the correct interpretation of our language. For example, when we (the speakers) all deem a certain person to be clearly tall, we thereby rule out as incorrect an interpretation in which this person is not in the extension of (the predicate of the formal language associated with the word) ‘tall’. Where language is vague, however, these constraints fail to winnow down to a unique intended interpretation. Rather, we are left with a set of equally acceptable interpretations—and we take this set as our model of the semantics of vague language. Note that where the former thought gives a central place to the notion of precisification and to what would be a correct interpretation of the language were we to precisify it, the latter thought makes no reference to precisification: it considers interpretations that are compatible with (not ruled out by) our actual usage of vague language.

We can find a similar division of views in the imprecise probability literature. On the precisification side we have Jeffrey:Footnote 39

In practice, we achieve coherence only for preference rankings that involve small numbers of propositions: for small fragments of what our total preference rankings would be, had we the time, intelligence, experience, sensitivity, and patience to work them out. (Jeffrey, 1965, p. 533)

I do not take belief states to be determined by full preference rankings of rich Boolean algebras of propositions, for our actual preference rankings are fragmentary, i.e., they are rankings of various subsets of the full algebras. Then even if my theory were like Savage’s in that full rankings of whole algebras always determine unique probability functions, the actual, partial rankings that characterize real people would determine belief states that are infinite sets of probability functions on the full algebras...The role of definite probability measures in probability logic as I see it is the same as the role of maximal consistent sets of sentences in deductive logic...[they] play the role of unattainable completions of consistent human belief states. (Jeffrey, 1983, pp. 139–140)

On the constraint side we have Levi, Kaplan, van Fraassen and Joyce:

My thesis is that X’s credal state relative to K—i.e., the set of Q-functions seriously permissible according to X relative to K—should consist of all those Q-functions which X has no warrant for ruling out as impermissible relative to K. (Levi, 1980, p. 89)

giving evidence its due requires that you rule out as too high, or too low, only those values of con which the evidence gives you reason to consider too high or too low. As for the values of con not thus ruled out, you should remain undecided as to which to assign. (Kaplan, 1983, p. 570)

a finite and even small number of judgements may convey all there is to our opinion. But then there is a large class of probability functions which satisfy just those judgements, hence which are compatible with the person’s state of opinion. Call that his or her representor (class). (van Fraassen, 1990, p. 347)

Elements of C are, intuitively, probability functions that the believer takes to be compatible with her total evidence. (Joyce, 2010, p. 288)

Either way, to get an argument for imprecise probabilism we need a further step: an argument, in effect, as to why probability functions should be in the picture at all. Suppose we accept (a) that there are facts about an agent’s belief state that definitely rule out modelling it as certain probability functions, (b) that there are multiple probability functions that are not definitely ruled incorrect in this way, but (c) that nevertheless there are reasons against picking a single one of them as the model of the agent’s belief state. It simply does not follow that we should take the set of not-definitely-incorrect probability functions as our model of the agent’s belief state. For everything we have said is also perfectly compatible with taking a single belief function as our model of the agent’s belief state. (The definitely incorrect probability functions would then be seen as the ones that are not compatible with the belief function: i.e. not in the set of probabilities to which the belief function is mapped by the injective mapping discussed in Sect. 2.)Footnote 40

The point becomes very clear if we consider the analogous case of plurivaluationism about vagueness. Suppose we accept (a) that there are facts about (the usage of speakers of) the vague discourse that definitely rule out modelling it using certain classical interpretations, (b) that there are multiple classical interpretations that are not definitely ruled incorrect in this way, but (c) that nevertheless there are reasons against picking a single one of them as the model of the semantics of the discourse. (For example, we might agree with an epistemicist along the lines of Williamson (1994) that meaning facts must be determined by usage facts, but disagree that the usage facts suffice to determine a single correct classical interpretation.) It simply does not follow that we should take the set of not-definitely-incorrect classical interpretations as our model of the semantics of the vague discourse. For everything we have said is also perfectly compatible with taking a single non-classical interpretation as our model of the semantics: for example, a supervaluationist interpretation or a fuzzy interpretation.Footnote 41 Plurivaluationists need a further argument for why we should only countenance classical interpretations—and such arguments can indeed be found in the literature. Consider for example Lewis (1986, p. 212):

The only intelligible account of vagueness locates it in our thought and language. The reason it’s vague where the outback begins is not that there’s this thing, the outback, with imprecise borders; rather, there are many things, with different borders, and nobody’s been fool enough to try to enforce a choice of one of them as the official referent of the word ‘outback’.

There is a (more or less implicit) argument here for the plurivaluationist approach, which models vagueness using a multiplicity of classical interpretations rather than one nonclassical interpretation. The argument is that the world itself is precise: it offers up only precise objects and precise properties. So a model of how we talk about the world must be classical: it must represent us as picking out precise objects with our names and precise properties with our predicates. An interpretation in which names pick out unique vague objects and predicates pick out unique but inherently vague (or inherently fuzzy, or inherently partial or gappy) properties is a non-starter. Classical interpretations are the only game in town—so if one such interpretation won’t do (i.e. if a purely classical approach to vagueness fails) we’ll need to appeal to many of them at once.

I am not endorsing this argument for the plurivaluationist approach to vagueness—but at least it is an argument. In the probability case there is just as much need for an argument as to why probability functions are the only game in town and so if we cannot model our belief states using a unique probability function then we must model them using sets of such functions. For there is a completely different way of generalising the precise probability model: instead of many precise probability functions, we could appeal to a single non-probabilistic function. (Just as, in the vagueness case, instead of appealing to many classical interpretations, we could appeal to a single non-classical interpretation—this is the kind of approach Lewis is ruling out above.) That is what the belief function model does. To motivate the imprecise probability approach, one therefore needs an argument as to why one must stick with probability functions. And yet no clear or compelling argument along these lines has been given in the literature. Consider Kaplan (1989, pp. 57–58), who after saying that the set of probabilities model has “undeniable attractions”, sets out two of them—and then comes to the third:

Finally, the proposal still manages to keep faith with its orthodox predecessor: it requires that each degree of confidence assignment in your set of assignments be an orthodox bayesian degree of confidence assignment (that each of your many minds be an orthodox bayesian mind) and that, as a regulative ideal, each of these assignments obey the axioms of the probability calculus.

This is significant because it makes explicit what otherwise seems to be a pervasive but unspoken assumption in the imprecise probability literature: that probability functions should remain in the picture, even when we abandon traditional precise probabilism. (Note for example that Levi (1980) assumes a setup in which the agent eliminates probability functions insofar as this is warranted. He gives no explicit motivation for why probability functions are part of the setup in the first place, with a right to be there unless their elimination is warranted. Similar remarks apply to Kaplan (1996).)

Even in Kaplan, however, it is still an assumption that probabilities should be retained: the passage is quite clearly not an argument for retaining them. (The argument is that imprecise probabilism is good because it keeps faith with its orthodox predecessor by retaining probability functions.) I can see two ways to turn this into an argument—but both fail. First, one might argue that in theory change, there is a general methodological principle of minimal disruption: if the received view has problems, mutilate it as minimally as you can manage. But even if we accept this, we need to ask ourselves which is the more minimal change: that from a single probability function to a generally infinite set of them; or that from a single probability function to a single function which is just like a probability function except that in its definition we remove one occurrence of the word ‘singleton’?Footnote 42 I’d argue that the latter change is more minimal. Second, one might argue that mathematical probability theory is rich and mature and, again as a methodological point, we do not want to deny ourselves access to this body of theory. But again this does not favour imprecise probabilism over the belief function approach: both still make ample room for the applicability of probability theory. Recall that a belief function generates a set of compatible probability functions: so probability theory gets ample purchase on belief functions via this associated set.Footnote 43 Modelling degrees of belief as belief functions in no way cuts us off from the mathematical riches of probability theory. Compare Gödel numbering, which allows us to apply arithmetical techniques to logic. To do this, we do not literally need to identify formulas with numbers—we only need the right kind of mapping between them.

To sum up, there is a line of thought in the imprecise probability literature that says that facts about an agent’s belief state are compatible with multiple probability functions (not just one)—and so we should model the belief state by the set of those functions. I have pointed out that this line of argument is analogous to a line of argument in the vagueness literature for a plurivaluationist approach to the semantics of vagueness over a purely classical approach (such as epistemicism). I have then noted that (as consideration of the vagueness case makes clear) this line of argument requires as a lemma that probability functions (or classical interpretations in the vagueness case) are the only game in town: that the final model must employ them in a central role. I cannot, however, find a convincing argument for this lemma in the imprecise probability literature—and the arguments that I can imagine are not convincing. I conclude that this way of motivating imprecise probabilism fails.

7 The matching argument

Turning now to the second line of argument that I discern in the literature: in the quotes from Joyce, Walley and Sturgeon at the beginning of Sect. 3 we find the following thought. Beliefs should match evidence in kind; evidence is sometimes imprecise or equivocal or vague or fuzzy; so in those cases beliefs should be too. Consider also the following remarks from Levi (1980):

An ideally rational agent should sometimes suspend judgment by embracing a credal state which is indeterminate in the sense that more than one probability measure is considered permissible for the purpose of computing expected values in deliberation. [10]

X should not rule out any logically permissible Q-functions. He should endorse a credal state which avoids such elimination of Q-functions without warrant by suspending judgment between them. [89]

one should not reach unwarranted and arbitrary conclusions. It is preferable to remain in suspense. [95]

Kaplan (1989) says that such an agent “can be said to be of more than one mind” [57] and characterises the imprecise probability view as requiring “that each of your many minds be an orthodox bayesian mind” [58].Footnote 44 The idea is that some kinds of evidence warrant the agent’s making a judgement—being in a single mind—while other kinds of evidence do not. In these cases the agent should suspend judgement, should be in many minds. So here too we have the thought that sometimes the evidence is sharp and warrants a definite belief state—and sometimes it is not and warrants an indeterminate state.

We can agree with the sentiment that beliefs should match evidence in kind. Indeed I take this to be a facet of the more general idea that degrees of belief should respect the evidence. But there is no good argument here for the imprecise probability view. If we make a certain assumption, then the argument is compelling. But the assumption begs the question and if we remove it, then the argument fails to convince. I’ll explain this via consideration of the kinds of cases that Walley et al have in mind: the kinds imprecise probabilists see as involving evidence that is imprecise or fuzzy. At the extreme end we have cases where you have no evidence at all. For an example of a less extreme kind of case, suppose a reliable source tells you that a woman did it. (“That’s all I am able to tell you in the circumstances,” she says: “but believe me, I know about the case”—and let’s suppose you do believe her.) Now suppose we make the assumption that probability functions must be involved in the model of belief. Then we do indeed start to see in these cases indeterminacy, vagueness, suspension of judgement, being in many minds and so on. For undoubtedly, in these cases the evidence rules out some belief states characterised by probability functions, and does not rule in a single such belief state. So there is a set of probabilistic belief states that are not ruled out. Now if we think that probabilities must be a central part of the model of belief states, then it will be natural to see here a kind of vagueness or indeterminacy in the evidence. The vagueness or indeterminacy consists exactly in not ruling in a single (probabilistic) belief state. And the natural response to it will be to see the appropriate belief state to be in as one characterised by the whole set of not-ruled-out probability functions. Rather than plump for a single (probabilistic) belief state as a traditional subjective or personalist Bayesian would, we should suspend judgement.

Now that story is OK if we assume that probability functions must be involved in the model of belief. But this assumption begs the question in a debate between the imprecise probability and belief function views. And if we drop it and look with clear eyes—unclouded by the thought that probability functions have to be used somewhere—then the cases start to look rather different.Footnote 45 First, there doesn’t on the face of it seem to be anything vague or fuzzy or indeterminate about having no evidence at all. On the contrary, this is perhaps the most simple and determinate evidential situation one can be in. Compare: if you have a middling amount of money then it might be a very difficult question whether to spend it on A or B or C. But if you have absolutely no money at all then your situation is quite simple and definite: you cannot buy anything. Similarly here: the guiding idea was to model evidence-responsive degrees of belief. So you should not have elevated degree of belief in a proposition unless there is a reason, based on your evidence, for doing so. Thus when you have no evidence at all, you should not have an elevated degree of belief in any proposition. The belief function model captures this maximally simple evidential situation in a simple and direct way. It represents your belief state by an assignment of 1 to the entire sample space—and hence 0 to every contentful or nontrivial proposition.

The situation is similar if we turn to the less extreme example of the source who tells you that a woman did it. This evidence is not vague or fuzzy: it is simply nonspecific. It quite definitely and determinately lends weight to the proposition that a woman did it, while lending no weight to the proposition that Anne in particular did it, no weight to the proposition that Betty (specifically) did it, and so on. This evidence simply does not speak to those specific propositions. After all, it would be absurd to go to Anne, or Betty, or any other particular woman, and say: I have evidence that you did it! So on the basis of this evidence, you should not be lending elevated weight to these propositions. (Or to vary the example, suppose a parent learns simply that someone at the school has been caught smoking, and when his child returns home, confronts her and says: I have reason to believe you were caught smoking today! This is unfair because it is completely unjustified, on the basis of the available evidence.)Footnote 46 The belief function model captures this nicely. On the basis of this evidence, you assign some positive weight k to the set of women and 0 to the singleton of each individual woman.

So simply accepting that belief should match evidence in character does not get us to the conclusion that imprecise probabilism is the right model of the kinds of case considered above. Adding the assumption that probability functions must play a central role in the formal model of belief does push us in that direction: but the assumption begs the question. And without it, it is plausible to see the kinds of evidential situation that motivate moving away from precise probabilism as involving not vague or fuzzy evidence, but nonspecific evidence—and to see belief functions as an appropriate model that captures this nicely. The picture is as follows. Sometimes the evidence lends weight to the hypothesis that Anne in particular did it. Sometimes the evidence lends weight to the hypothesis that a woman did it while lending no weight at all to the hypothesis that Anne in particular did it, or to the hypothesis that Betty in particular did it, and so on. In both cases the evidence warrants a definite judgement: there is no vagueness, no suspension of judgement, no being in many minds about whether to attach this weight or that to Anne etc. The difference is simply whether the weight gets attached to less specific propositions or more specific ones. From this point of view, the problem with a probability function when the evidence is nonspecific isn’t that it is too precise: it’s that it is too constrained. It forces belief mass to be distributed among singletons in a way that the evidence simply does not warrant. What we need in such cases is not something more vague or fuzzy. We need something less constrained, less specific—and the belief function model provides this.

A probability function cannot capture the idea of evidence-justified degree of belief attaching to less specific propositions without attaching to any more specific propositions that imply them. The probability model forces weight assigned to larger propositions to ‘fall through’ them, all the way down to their singleton subsets—hence the appropriateness of van Fraassen’s muddy diagrams.Footnote 47 But opponents of traditional precise probabilism think that sometimes the evidence does not warrant beliefs of this sort. By contrast, a belief function can be nonspecific: it can assign positive weight to a set and zero weight to its subsets. In Sect. 2 I said that additivity is merely a symptom of the core difference between belief functions and probability functions. The heart of the matter is whether evidence can warrant attaching some mass of belief directly to a non-singleton subset of outcomes. What the cases of allegedly vague, imprecise and equivocal evidence show is that evidence can do this.

It is important to note here that evidence can also warrant probabilistic degrees of belief, in some situations: notably, when the evidence consists in observations of frequencies or facts about the objective chances of outcomes of some randomising device (e.g. a coin, die or roulette wheel). Probability theory is purpose-built to model chances and frequencies and it models them well.Footnote 48 This is no problem for the belief function view because probability functions are special cases of belief functions. Thus belief functions can model situations in which the evidence does warrant probabilistic degrees of belief, as well as situations in which it does not.Footnote 49 Indeed we have here a spectrum. At one end the agent has no evidence at all, and all the weight of belief attaches to the sample space. At the other end the agent has very specific evidence which warrants distributing all the weight of belief among specific outcomes, generating a Bayesian belief function. In between we have cases such as the testimony that the perpetrator was a woman. For each of these kinds of case the belief function framework provides a simple, direct model in which the belief state matches the evidence in its level of specificity.

8 Dilation and inertia

This section concludes the case for belief functions as a model of evidence-respecting degrees of belief by showing that they readily handle two kinds of situation that imprecise probabilities (notoriously) cannot handle.Footnote 50

First, dilation. Consider the following case:

You haven’t a clue as to whether p. But you know that I know whether p. I agree to write ‘p’ on one side of a fair coin, and ‘\(\lnot p\)’ on the other, with whichever one is true going on the heads side (I paint over the coin so that you can’t see which sides are heads and tails). We toss the coin and observe that it happens to land on ‘p’. (White, 2010, p. 175)Footnote 51

If we analyse this case using imprecise probabilities, we get an unacceptable result. As you start out with absolutely no evidence for or against p, your belief in p starts out maximally indeterminate: your representor includes probability functions that assign p any value between 0 and 1.Footnote 52 You know the coin is fair, however, so these functions all assign 0.5 to h and to t (the propositions that the coin lands with its heads—resp. tails—side up). Now consider what happens when the coin is tossed and you observe that it lands with ‘p’ showing. Given that ‘p’ is showing and that the true claim was written on the heads side, out of the original four possibilities (\(ph, pt, \bar{p}h, \bar{p}t\)) there are now two left open (\(ph, \bar{p}t\)).Footnote 53 So you conditionalise each probability function in your representor on the set \(\{ph, \bar{p}t\}\). The result is that your representor includes probability functions that assign h (i.e. \(\{ph, \bar{p}h\}\)) any value between 0 and 1: your belief in h has gone from a sharp 0.5 to maximally indeterminate. The same point applies to t. So your beliefs are not respecting the evidence. Nothing happened to undermine your evidence that the coin is fair or to provide further evidence regarding h or t, so your degree of belief in h and t should still be 0.5.

Joyce (2010, 299ff.) argues that the new information is evidentially relevant to h. However when followed through closely, his argument consists in showing how this is so within the imprecise probability model. That is, Joyce explains in detail how the model leads to this result—not why the result is the right one. The problem is that the information that the coin lands with ‘p’ showing clearly is not evidence for or against h, by any reasonable pretheoretical reckoning—given that you know that the true one of ‘p’ and ‘\(\lnot p\)’ is on the heads side but have absolutely no evidence for or against p. So Joyce’s argument does not avoid the conclusion that the imprecise probability model fails to provide an adequate account of evidence-respecting degrees of belief. Note that if you were sure that p (or not p) then seeing ‘p’ come up would tell you for sure that the coin landed heads (or tails). If you thought that p was quite/very un/likely, then seeing ‘p’ come up would tell you that heads is quite/very un/likely—and so on. But the whole point is that you have no evidence at all about p: not that it is very or a little bit likely or unlikely or anything else. In that situation, seeing the coin land ‘p’ tells you nothing about whether it landed heads. As we’ll see in a moment, this is exactly the result that the belief function story gives. But the imprecise probability story cannot yield this result. Each member of the ‘credal committee’Footnote 54 cannot avoid taking the initial probability assigned to p as a judgement that p is very (a little bit) un/likely or whatever—depending on whether the probability is very (a little bit) low/high, etc. Hence each committee member cannot avoid taking the observation of ‘p’ as evidence for or against h.Footnote 55 Thus the committee as a whole goes from being clustered at 0.5 on the question of h to scattering across the unit interval.

The belief function model, by contrast, handles the case nicely. Before the toss, your good evidence about the fairness of the coin (and complete lack of evidence about p) warrants a basic mass assignment \(m_1\) that assigns 0.5 to h (i.e. \(\{ph, \bar{p}h\}\)) and 0.5 to t (i.e. \(\{pt, \bar{p}t\}\)). Observing the coin land with ‘p’ showing gives you conclusive evidence (as explained above) that the actual situation is ph or \(\bar{p}t\). This evidence warrants a basic mass assignment \(m_2\) that assigns 1 to \(\{ph, \bar{p}t\}\). Combining \(m_1\) and \(m_2\) using Dempster’s rule (see Sect. 2) gives a basic mass assignment \(m_3\) that assigns 0.5 to \(\{ph\}\) and 0.5 to \(\{\bar{p}t\}\). Now \(m_3\)—like the initial \(m_1\)—generates a belief function that assigns 0.5 to h and 0.5 to t. So there is no loss of determinacy in your degrees of belief in heads and tails.

One objection that someone might raise to this treatment of the case concerns the fact that your belief in p after observing the toss is now 0.5 (and likewise for \(\lnot p\)). That is, \(m_3\)—unlike the initial \(m_1\)—generates a belief function that assigns 0.5 to \(\{ph, pt\}\) and 0.5 to \(\{\bar{p}h, \bar{p}t\}\). This will seem the wrong result if one thinks that the toss provides no evidence concerning p. And indeed White (2010, p. 176) claims that “Seeing the coin land ‘p’ should have no affect on your credence in p.” But this is incorrect: the toss does provide evidence about p. Consider the person I (i.e. White’s “I”) who knows whether p and writes on the coin. If I told you ‘p’ (or ‘\(\lnot p\)’) and you took him to be reliable in his beliefs about p and speaking sincerely, then this would give you strong evidence for p (or \(\lnot p\)). If I wrote the true one of ‘p’ or ‘\(\lnot p\)’ on both sides of the coin (and you knew this), observing the toss land ‘p’ (or ‘\(\lnot p\)’) would again give you strong evidence for p (or \(\lnot p\)): just as strong as in the direct testimony case (for this is simply a roundabout way of giving testimony). If I wrote the true one of ‘p’ and ‘\(\lnot p\)’ on one face of a 100-sided die and the false one on all the other faces, observing the roll would again give you strong evidence relevant to p: whatever you see come up (‘p’ or ‘\(\lnot p\)’) has a 1% chance of being true and so there is a 99% chance that the one you don’t see is true. You should then change your belief in p to 0.01 or 0.99 depending on what you see. This is a form of randomised or probabilistic testimony: I tells you whether p in such a way that you have a 1% chance of receiving the true message (and a 99% chance of receiving a false message). Note that I can control the chance of transmission of the true message by selecting a die with a certain number of sides and writing the true message on a certain number of them (and the false message on the others). White’s case is simply the one where I writes on one side of a two-sided die (a coin). This certainly provides evidence relevant to p, just like all the other cases of randomised testimony: in this case, evidence that warrants setting your degree of belief in p to 0.5. (Think of it this way: you receive a report that p and there is a 50% chance that it is true.)

Second, inertia. Consider the following case:

There are two kinds of chemicals used in this lab, X and Y. We know nothing about how common either chemical is or how this test tube came to be filled. But we have no evidence either way bearing on whether it contains X or Y...We should change our opinion in the light of new evidence. Let’s run a test that has a 90% accuracy rate (i.e. whether it contains X or Y, the test has a 90% chance of accurately reporting whether it is X or Y). When we get a result, how confident should we be that it is accurate? Obviously 90%. (White, 2010, p. 184)

The imprecise probability view cannot yield this result. Given that an agent knows only that the test tube contains X or Y and has no evidence at all bearing on which one of these two chemicals it is, for every \(0\le x\le 1\) her representor should include probability functions that assign x to the proposition that the tube contains X (and \(1-x\) to the proposition that it contains Y). But then, as White and Joyce (2010, pp. 290–291) show, whatever evidence the agent gets, her probabilities for X and Y will remain maximally spread out.Footnote 56 This is known as the problem of belief inertia. It is a devastating objection to imprecise probabilities as a model of evidence-responsive degrees of belief. Such degrees of belief should not sharpen in the absence of any evidence—hence they should start maximally spread out in this case—but they should then sharpen when evidence comes in.

One response to this problem on the part of imprecise probabilists is to model the agent’s initial state—when she has no evidence—by some more restricted set of probabilities. Joyce (2010, pp. 290–291) outlines a specific proposal along these lines but Vallinder (2018, pp. 1214–1216) shows that it does not achieve the desired result. White also mentions a specific way of restricting the initial set of probabilities but shows that while it does allow the agent’s belief state to move in response to the evidence provided by the test result, there is not enough movement: the resulting degree of belief that the tube contains X “is nothing like .9” (White, 2010, p. 185). But there is also a deeper problem with any such strategy of restricting the initial spread of probabilities:Footnote 57 it flies in the face of the idea that degrees of belief should respect the evidence. As Joyce (2010, p. 292) says, “what principled reason is there to confine credences to one proper subinterval...as opposed to another? Whichever way you go, you end up acting as if you have evidence that you do not actually possess.” As Joyce sees it, “the only view that makes epistemological sense” [290] is the original one according to which the agent’s initial belief state is represented by the full, unrestricted set of probability functions. But then we are stuck with the problem of belief inertia, which also flies in the face of the idea that degrees of belief should respect the evidence. As White puts it in the quote above, “We should change our opinion in the light of new evidence.” So the imprecise probabilist faces a devastating dilemma: either way, she fails to satisfy the original design brief according to which degrees of belief should respect the evidence.

The belief function view, by contrast, handles the situation nicely. Initially, the agent’s total lack of evidence about the contents of the tube (apart from knowing that the only possibilities are X and Y) induces a basic mass assignment \(m_1\) that assigns 1 to the frame of discernment \(\{T_X, T_Y\}\), where \(T_X\) (\(T_Y\)) is the situation in which the test tube contains X (Y). Knowing the test to have a 90% accuracy rate and seeing it give the result ‘X’ (analogous comments apply if it gives ‘Y’) warrants a basic mass assignment \(m_2\) that assigns 0.9 to \(\{T_X\}\) and 0.1 to the frame of discernment. Combining \(m_1\) and \(m_2\) using Dempster’s rule gives a basic mass assignment \(m_3\) that is the same as \(m_2\): it assigns 0.9 to \(\{T_X\}\) and 0.1 to the frame of discernment.Footnote 58 So in light of the test result, the agent’s degree of belief that the tube contains X goes from 0 to 0.9, as desired.Footnote 59