1 Introduction

The goal of this paper is to propose a new model for learning a probabilistic distribution, in situations that are commonly characterized as those of “radical uncertainty” (Walley 1996) or “Knightian uncertainty” (Cerreia-Vioglio et al. 2013). The most widespread model for these situations uses imprecise probabilities, i.e. sets of probability distributions. As an example, consider an urn full of marbles, coloured red, green, and blue, but with an unknown distribution. What is then the probability of drawing a red marble? In such cases, when the agent’s information is not enough to determine the true distribution, she is typically left with a large (possibly infinite) set of possible probability assignments. If she never goes beyond what she knows, then her only ‘rational’ answer should be “I don’t know”: she is in a state of ambiguity, and she should simply consider possible all distributions that are consistent with her background knowledge and observed evidence. This type of over-cautious rationality, resembling the famous paradox of “Buridan’s ass”, is not of much help in dealing with practical decision problems.

Our model allows the agent to go beyond what she knows with certainty, by forming rational qualitative beliefs about the unknown distribution, beliefs based on the inherent plausibility of each possible distribution. For this, we assume the agent is endowed with an initial plausibility map, assigning real numbers to the possible distributions. The plausibility map encodes the agent’s background beliefs and a priori assumptions about the world. For instance, an agent who assumes the Principle of Indifference (Williamson 2013; Hájek 2019) will use Shannon entropy as her plausibility function, thus initially believing that the distribution is the most non-informative one (in the given set of possibilities). On the other hand, an agent assuming a Normality or ‘Averageness’ Principle, will use closeness to the Center of Mass or the barycenter (Paris 1994) as her plausibility measure, thus starting with a belief in the most typical distribution, i.e. the one that is the most representative for the given set of distributions. Finally, an agent who assumes some form of Ockham’s Razor will use as plausibility some measure of simplicity (Kelly 2008), thus her prior belief will focus on the simplest distribution(s).

Our agent forms beliefs by using the standard definition of qualitative belief in Logic and Belief Revision Theory, in terms of plausibility maximization (Board 2004; Baltag and Smets 2008b): she believes the most plausible distribution(s). More precisely, we equate “belief” with “truth in all the worlds that are plausible enough”: P is believed iff there exists some distribution \(\mu \) s.t. P is true in all distributions that are at least as plausible as \(\mu \). In particular, “belief” coincides with truth in all the most plausible worlds, whenever such most plausible worlds/distributions exist. As a consequence, all the usual KD45 axioms of doxastic logic will be valid in our framework.

Note that, although our plausibility map assigns real values to probability distributions, this account is essentially different from the ones using so-called “second-order probabilities” (i.e. probability distributions defined on the given set of probability distributions) (Gaifman and Snir 1982; Gaifman 2016). Plausibility values are only relevant in so far as they induce a qualitative order on distributions. In contrast to probability, plausibility is not cumulative (in the sense that the low-plausibility alternatives do not add up to form more plausible sets of alternatives), and as a result only higher-ranking distributions ‘beat’ lower-ranking ones; in case that some distributions have the highest plausibility, they are the only ones of any relevance for beliefs.

Our model is not just a way to “rationally” select a Bayesian prior, but it also comes with a rational method for revising beliefs in the face of new evidence. In fact, it can deal with two types of new information: first-order evidence gathered by repeated sampling from the (unknown) distribution; and higher-order information about the distribution itself, coming in the form of a set of possible distributions (often defined by a set of linear inequality constraints on that distribution). To see the difference between the two types of new evidence, take for instance the example of a coin. As it is well-known, any finite sequence of Heads and Tails is consistent with all possible non-extremal biases of the coin. As such, any number of finite repeated samples will not shrink the set of possible biases, though they may increase the plausibility of some biases. Thus this type of information changes only the plausibility map but leaves the given set of distributions essentially unchanged (except for the elimination of some extremal distributions, that assigned probability 0 to the observed sample). The second type of information, on the other hand, shrinks the set of measures, while keeping their relative plausibility ranking. For instance, learning that the coin has a bias towards Tail (e.g. by weighing the coin, or receiving a communication in this sense from the coin’s manufacturer) eliminates all distributions that assign a higher probability to Heads. It is important to notice, however, that even with higher-order information, it is hardly ever the case that the distribution under consideration is fully specified. In our coin example, a known bias towards Tails will still leave an infinite set of possible biases consistent. Even a good measurement by weighting will leave open a whole interval of possible biases. In this sense, a combination of observations and higher-order information will not in general allow the agent to come to know the correct distribution, in the standard (‘infallible’) sense in which the term knowledge is used in doxastic and epistemic logics. Instead, it may eventually allow her to come to believe the true probability (at least, with a high degree of accuracy). This belief may even stabilize, to such a degree that it approaches the ‘softer’, defeasible notion of ‘knowledge’, which is the main focus in Epistemology (Lehrer 1990; Stalnaker 1996; Rott 2004) and (inductive) Learning Theory (Gold 1967; Baltag et al. 2019a). This convergence in belief and the resulting acquisition of statistical knowledge is what we aim to capture in this paper.

Our mechanism for belief revision with sampling evidence is non-Bayesian (and also different from AGM belief revision), though it incorporates a “plausibilistic” version of Bayes’ Rule. Instead of updating her prior belief according to this rule (and disregarding all other possible distributions), the agent keeps all possibilities in store and revises instead, her plausibility ranking, using a non-probabilistic analogue of Bayes’ Rule. After that, her new belief will be formed in a similar way to her initial belief: by maximizing her (new) plausibility. The outcome is different from simply performing a Bayesian update on the ‘prior’: qualitative jumps are possible, leading to abandoning “wrong” conjectures in a non-monotonic way. This results in a faster convergence-in-belief to the true probability in less restrictive conditions than the usual Savage-style convergence through repeated Bayesian updating (Edwards et al. 1963; Savage 1954).Footnote 1

The second type of evidence (higher-order information about the distribution) induces a more familiar kind of update: the distributions that do not satisfy the new information (typically given in the form of linear inequalities) are simply eliminated, then beliefs are formed as before by focusing on the most plausible remaining distributions. This form of revision is known as AGM conditioning in Belief Revision Theory (Alchourrón et al. 1985), and as update or “public announcement” in Logic (Baltag and Renne 2016; van Ditmarsch et al. 2007), and satisfies all the standard AGM axioms.Footnote 2

The fact that in our setting there are two types of updates should not be so surprising. It is related to the fact that our static framework consists of two different semantic ingredients, capturing two types of information: the plausibility map (encoding the agent’s beliefs and conditional beliefs, defeasible forms of knowledge, etc), and the set of possible distributions (encoding the agent’s infallible knowledge, her ‘hard information’ about the correct distribution). Correspondingly, the first type of update directly affects the agent’s beliefs (by changing the plausibility in the view of the sampling results), and only indirectly her knowledge (since e.g. she knows her new beliefs). Dually, the second type of update directly affects the agent’s knowledge (by reducing the set of possibilities), and only indirectly her beliefs (by restricting the plausibility map to the new set).

By allowing two forms of learning, one having a Bayesian-statistical flavor and the other having a logical-AGM flavor (Alchourrón et al. 1985; Darwiche and Pearl 1997), our framework combines logical and statistical reasoning in a unified setting. In this sense, it fits within the recent trend towards a unification of logic and probability, see e.g. Leitgeb (2017). In particular, the fact that conditioning on sampling evidence is non-AGM is in fact essential for the successful learning of the true probability from repeated sampling: since every sample is logically consistent with every non-extremal distribution, an AGM learner (obeying the principle of Rational MonotonicityFootnote 3) would typically never change her initial beliefs about the true distribution after any number of samples! The same applies to all the generalizations of AGM conditioning that retain Rational Monotonicity, e.g. the ones proposed by Darwiche and Pearl (1997), or by Konieczny and Perez (2000).

A preliminary version of this paper was presented at TARK 2019, and an abstract appeared in the online proceedings (Baltag et al. 2019). Our current article is the extended, journal version of that work, though with many major changes: improvements of the basic setting, the formalization and study of new epistemic notions (e.g. safe belief of a distribution, statistical knowledge, distance-from-the-truth), and a number of new convergence results. The plan of the paper is as follows. We start by reviewing in Sect. 2 some basic notions, results, and examples on probability distributions. In Sect. 3, we define our main setting (probabilistic plausibility models), consider a number of standard examples, define in this setting the notions of belief and (infallible) knowledge, and study their logical properties. In Sect. 4, we move to conditional beliefs, defining our two forms of conditionalization, and use them to explore belief dynamics (as captured by our two types of model updates). In Sect. 5, we look at notions of doxastic stability, defining a weaker form of stability (“safe belief”), followed by a stronger form (“statistical knowledge”), and investigating their properties and their connection to a notion of verisimilitude (or “distance from the truth”). In Sect. 6, we present and prove our main results on doxastic convergence to the true probability. Finally, in Sect. 7 we briefly sketch the contours of a dynamic doxastic logic for statistical learning, and in Sect. 8 we end with some concluding remarks and a brief comparison with other approaches to the same problem.

2 Preliminaries and notation

Throughout this paper, we fix a finite set \(O=\{o_1, \ldots , o_n\}\) of possible observations, or ‘(elementary) outcomes’.Footnote 4 Let

$$\begin{aligned} M_{O} := \left\{ \mu \in [0,1]^O \vert \, \sum _{o \in O} \mu (o)=1\right\} \end{aligned}$$

be the set of probability mass functions on O, which we identify with the corresponding probability functions on \({\mathcal {P}}(O)\). The sets of distributions \(P\in {\mathcal {P}}(M_O)\) will be called propositions. Let

$$\begin{aligned} \varOmega = O^{\infty } := \{\omega \vert \, \omega :{\mathbb {N}}{\setminus }\{0\}\rightarrow O\} \end{aligned}$$

be the set of infinite sequences from O, which we shall refer to as observation streams. Each such stream \(\omega =(\omega _1, \ldots , \omega _n, \ldots )\) represents a possible history of future sampling from an unknown distribution. For any \(\omega \in \varOmega \) and \(i \in {\mathbb {N}}{\setminus }\{0\}\), we write \(\omega _i\) for the i-th component of \(\omega \), and \(\omega ^{\le i}\) for its initial segment of length i, i.e. the sequence \(\omega ^{\le i}:=(\omega _1, \ldots , \omega _i)\) consisting of the first i components of \(\omega \). Similarly, we put \(\omega ^{>i}:=(\omega _{i+1}, \ldots , \omega _n, \ldots )\) for the infinite “tail” of \(\omega \) that follows the i-th observation. In particular, \(\omega ^{\le 0}:=\lambda =()\) is the empty sequence, and \(\omega ^{>0}=\omega \). We denote by

$$\begin{aligned} O^*=\{(\omega _1,\ldots , \omega _i) \vert \, i\ge 0, \omega _1, \ldots , \omega _i\in O\} \end{aligned}$$

the set of all finite sequences of observations. For each \(o \in O\) we define the sets \(o^{j}\) to be the basic cylinders

$$\begin{aligned} \, o^{j}=\{\omega \in \varOmega \, \vert \, \omega _{j}=o \} \subseteq \varOmega . \end{aligned}$$

These cylinders correspond to individual observations of evidence sampled from the unknown distributions. Let \({\mathcal {A}} \subseteq {\mathcal {P}}(\varOmega )\) be the \(\sigma \)-algebra of subsets of \(\varOmega \) generated by the cylinders (algebra obtained by closing the family of basic cylinders under complementation and countable unions). Every probability distribution \(\mu \in M_{O}\) induces a unique multinomial probability distribution over \((\varOmega , {\mathcal {A}})\), also denoted by \(\mu \), and obtained by first setting

$$\begin{aligned} {\mu }(o^{j})= \mu (o) \end{aligned}$$

then extending this to all of \({\mathcal {A}}\) using independence, additivity and continuity. Let \({\mathcal {E}}\subseteq {\mathcal {A}}\) be the family of sets obtained by closing the family of basic cylinders only under complementation and finite unions. The sets \(e\in {\mathcal {E}}\) are called observable events (or just ‘events’, for short).Footnote 5 It is easy to see that every event \(e\in {\mathcal {E}}\) can be written as a finite disjoint union of finite intersections of basic cylinders. In particular, for each finite sequence of observations \(\omega ^{\le i}=(\omega _1, \ldots , \omega _i)\in O^*\), we denote by \([\omega ^{\le i}]=[\omega _1, \ldots , \omega _i]\) the corresponding event of observing this sequence by sampling, i.e. the event given by

$$\begin{aligned}{}[\omega ^{\le i}]=[\omega _1, \ldots , \omega _i]\, :=\, \{\omega '\in \varOmega : \, \vert \, \omega '_j=\omega _j\ \text{ for } \text{ all } j\le i\}=\bigcap _{j=1}^i \omega _j^j \end{aligned}$$

Example 1

Let \(O=\{H, T\}\) be the possible outcomes of a coin toss. Then \(\varOmega \) will be streams of Heads and Tails representing infinite tosses of the coin, e.g. HTTHHH.... And \(H^{j}\) (res. \(T^{j}\)) will be the set of streams of observations in which the j-th toss of the coin will land Heads up (res. Tails up). The set \(M_O\) will be the set of possible biases of the coin.

Example 2

Let \(O=\{R, B, G\}\) be the possible outcomes for a draw from an urn filled with marbles, coloured Red (\({ R}\)), Blue (\({ B}\)) and Green (\({ G}\)). Then \(M_O\) will be the set of all possible distributions of coloured marbles in the urn, \(\varOmega \) will be the set of infinite streams of R, B and G (representing infinite draws from the urn), and \(R^{j}\) (res. \(B^{j}\) or \(G^{j}\)) will be the set of streams of draws in which the j-th draw is a Red (res. Blue or Green) marble.

Standard topology on \(M_{O}\) Notice that a probability function \(\mu \in M_O\), defined over the set \(O=\{o_1, \ldots , o_n\}\), can be identified with an n-dimensional vector \((\mu (o_1), \ldots , \mu (o_n))\), corresponding to the probabilities assigned to each \(o_i\) respectively. Let \({\mathcal {D}}_O := \{\mathbf {x} \in [0,1]^{n} \, \vert \, \sum x_i =1\},\) then every \(\mu \in M_O\) can be identified with the point \(\mathbf {\mu } \in {\mathcal {D}}_O \subset [0, 1]^n\). Thus probability functions in \(M_{O}\) live in the vector space \([0, 1]^{n}\). In the other direction every \(\mathbf {x} \in {\mathcal {D}}_O\) defines a probability function x on O by setting \(x(o_i) = \mathbf {x}_i\). This gives a one to one correspondence between \(M_O\) and \({\mathcal {D}}_O\). There are various metric distances that can be defined on the space of probability measures over a (finite) set O, many of which are known to induce the same topology. Here we will consider the standard topology of \([0,1]^n\), induced by the Euclidean metric: for \(\mathbf {x}, \mathbf {y} \in [0,1]^n\), put \(d(\mathbf {x}, \mathbf {y}) := \sqrt{ \sum _{i=1}^{n}(x_i-y_i)^{2}}\); a basis for the standard topology is given by the family of all open balls \({{\mathcal {B}}}_\varepsilon (\mathbf {x})\) centred at some point \(\mathbf {x}\in {\mathbb {R}}^n\) with radius \(\varepsilon >0\); where

$$\begin{aligned} {{\mathcal {B}}}_\varepsilon (\mathbf {x})= \{\mathbf {y} \in {\mathbb {R}}^n \, \vert \, d(\mathbf {x},\mathbf {y})<\varepsilon \}. \end{aligned}$$

We will make use of the following well-known facts:

Proposition 1

For any finite set O, the set \(M_{O}\) of probability mass functions on O is compact in the standard topology.

Proof

Notice that the set \(\{\mathbf {x} \in [0, 1]^{n}\, \vert \, \sum _{i=1}^{n} x_{i}=1\}\) is compact in \({\mathbb {R}}^{n}\). \(\square \)

Proposition 2

Let X, Y be compact topological spaces, \(Z \subseteq X\) and \(f: X \rightarrow Y\)

  1. (1)

    Every closed subset of X is compact.

  2. (2)

    If f is continuous, then f(X) is compact.

  3. (3)

    If Z is compact then it is closed and bounded.

Proof

See Hunter (2012), Theorem 1.40 and Proposition 1.41. \(\square \)

Lemma 1

For each event \(e \in {\mathcal {E}}\), the function \(F_e: M_O \rightarrow [0, 1]\), defined as \(F_e(\mu ) := {\mu }(e)\), is continuous.

Proof

This can be verified, using the above-mentioned fact that every event is a finite disjoint union of finite intersections of basic cylinders. The proof is by induction on the structure of this representation. The conclusion is immediate when \(e=o^j\) is a basic cylinder: given any \(\varepsilon >0\), we can take \(\delta :=\varepsilon \), and then, for all \(\mu ,\nu \in M_O\) with \(d(\mu ,\nu )<\delta \), we have \(|F_e(\mu )-F_e(\nu )|=|{\mu }(e)-{\nu }(e)|=|\mu (o)-\nu (o)|\le \sqrt{ \sum _{i=1}^{n}(\mu (o_i)-\nu (o_i))^{2}}=d(\mu , \nu )<\delta =\varepsilon \). This can be extended to finite intersections of basic cylinders, by noting that if \(e=\bigcap _{k=1}^m \omega _{k}^{j_k}\) is such a finite intersection with all \(j_k\) distinct,Footnote 6 then by independence we have \(F_e=\prod _{k=1}^m F_{\omega _{k}^{j_k}}\), and then using the fact that a finite product of continuous functions is continuous. Finally, we can extend to disjoint unions of finite intersections of basic cylinders, noting that if \(e=\bigcup _{i=1}^m e_i\) is a disjoint union of events (with \(e_i\cap e_j=\emptyset \) for \(i\not =j\)), then by additivity we have \(F_e=\sum _{i=1}^m F_{e_i}\), and then using the fact that a finite sum of continuous functions is continuous.\(\square \)

Proposition 3

Every continuous function f :  X\( \rightarrow {\mathbb {R}}\) on a compact topological space X is bounded, and it attains its supremum (i.e., it has a maximum value).

Proof

See Hunter (2012), Theorem 7.35. \(\square \)

Theorem 1

(Hein-Cantor) Let M, N be two metric spaces and \(f: M \rightarrow N\) be continuous. If M is compact then f is uniformly continuous.

Proof

See Rudin (1953). \(\square \)

Before presenting our framework, we need one more technical lemma that will prove useful in the proof of our convergence Theorem 1.

Lemma 1

For \(0<p_1, \ldots , p_n \le 1\) with \(\sum p_i=1\), the function \(f(\mathbf {x})= \varPi _{i=1}^{n} x_{i}^{ p_i}\) has \(\mathbf {x}=\mathbf {p}\) as its unique maximizer on \(M_O\).

Proof

First we notice that \(f(\mathbf {x}) \ge 0\) on \(M_O=\{\mathbf {z} \in [0, 1]^{n} \, \vert \, \sum z_i=1\}\) and by Propositions 1 and 3f has a maximum value on \(M_O\). But note that \(f(\mathbf {z}) =0\) for any point \(\mathbf {z} \in M_O\) having some zero coordinate \(z_i=0\) (for any \(i\le n\)). Hence, f reaches its maximum on \(D= (0, 1]^n\cup M_O= \{\mathbf {z} \in (0, 1]^n\, \vert \, \sum z_i=1\}\).

To prove the lemma, we will show that \(\log (f(\mathbf {x}))\) has \(\mathbf {x}= \mathbf {p}\) as its unique maximizer on D. The conclusion will then follow from noticing that \(f(x) \ge 0\) and the monotonicity of \(\log \) function on \({\mathbb {R}}^{+}\). To maximise \(\log (f(\mathbf {x}))\) subject to condition \(\sum _{i} x_i=1\), we use Lagrange multiplier methods: let

$$\begin{aligned} G(\mathbf {x})= \log (f(\mathbf {x})) - \lambda \left( \sum _{i=1}^{n} x_i -1\right) = \sum _{i=1}^{n} p_i \log (x_i)- \lambda \left( \sum _{i=1}^{n} x_i -1\right) . \end{aligned}$$

Setting partial derivatives of G equal to zero we get,

$$\begin{aligned} \frac{ \partial G(\mathbf {x})}{\partial x_i} = \frac{p_i}{x_i} -\lambda =0 \end{aligned}$$

which gives \(p_i= \lambda x_i\). Inserting this in the condition \(\sum _{i} p_i =1\) we get \(\lambda \sum _{i} x_i=1\) and using \(\sum _{i} x_i=1\) we get \(\lambda =1\) and thus \(x_i=p_i\). Since f has a maximum on this domain and the Lagrange multiplier method gives a necessary condition for the maximum, any point \(\mathbf {x}\) that maximises f should satisfy the condition \(x_i=p_i\) and thus \(\mathbf {p}\) is the unique maximiser for f. \(\square \)

3 Probabilistic plausibility models

In this section, we introduce and exemplify our basic framework for dealing with radical uncertainty.

Definition 1

(Plausibility measures) A plausibility ‘measure’ (on K) is a continuous function \(pla:K\rightarrow [0,\infty )\), whose domain is some closed set of distributions \(K={\overline{K}}\subseteq M_O\). Given a plausibility measure on K, we can extend it to a mapFootnote 7 on propositions (sets of distributions) \(P\subseteq M_O\), by putting

$$\begin{aligned} pla(P) \,\,\, :=\,\,\, sup\{pla(\mu )\, \vert \, \mu \in P\cap K\} \end{aligned}$$

Similarly, we can extend it to distribution-event pairs \((\mu ,e)\in K\times {{\mathcal {E}}}\), by putting:

$$\begin{aligned} pla(\mu ,e) \,\,\, :=\,\,\, pla(\mu )\cdot \mu (e), \end{aligned}$$

and further extend this to proposition-event pairs \((P,e)\in {{\mathcal {P}}}(M_O)\times {{\mathcal {E}}}\), by putting

$$\begin{aligned} pla(P,e) \,\,\, :=\,\,\, sup\{pla(\mu ,e))\, \vert \, \mu \in P\}= sup \{ pla(\mu )\cdot \mu (e) \, \vert \, \mu \in P\cap K\} \end{aligned}$$

These last two maps give us a way of assessing the joint plausibility of having true distribution \(\mu \) (or true proposition P) and observing event e.

It seems apt at this point to emphasize again that events \(e \in {\mathcal {E}} \subseteq {\mathcal {A}}\) in our setting are intended to capture observable events in multinomial experiments. The successive observations \(\omega _i\) in a finite sampling sequence \(\omega ^{\le i} =(\omega _1, \ldots , \omega _i)\) are thus regarded as outcomes of i independent and identically distributed trials as in Examples 1 and 2 . In the same manner, \(\mu (e)\) encodes the probability assigned to e by the unique multinomial probability distribution induced by each \(\mu \in M_O\) on \((\varOmega , {\mathcal {A}})\) (which by a slight abuse of notation we also denote by \(\mu \)).

Definition 2

[Plausibility models] A (probabilistic) plausibility model is a structure \({{\mathbf {M}}}= (M, pla)\) where \(M\not =\emptyset \) is a non-empty subset of \(M_O\), called the set of ‘possible distributions’, and \(pla: {\overline{M}}\rightarrow [0,1]\) is a plausibility measure on the closure \({\overline{M}}\), called probabilistic plausibility ranking map (or just plausibility map, for short), and required to satisfy two additional conditions: (1) possible distributions have positive plausibility rank, i.e. \(pla(\mu )>0\) for all \(\mu \in M\); (2) \(pla(M)=1\), or equivalently the maximum plausibility value on \({\overline{M}}\) is 1.Footnote 8

The plausibility map induces a total preorderFootnote 9\(\le ^{{\mathbf {M}}}\) on the possible distributions in M, called the plausibility ranking order, and given by putting for all \(\mu ,\nu \in M\):

$$\begin{aligned} \mu \le ^{{\mathbf {M}}}\nu \,\, \text{ iff } \,\, pla(\mu )\le pla(\nu ). \end{aligned}$$

For every real number \(\delta \in [0,1]\), we put \(M^\delta :=\{\mu \in M\, \vert \, pla(\mu )\ge \delta \}\) for the set of all distributions in M that have plausibility rank at least \(\delta \). A (probabilistic) Grove sphere is a non-empty set of the form \(M^\delta \ne \emptyset \).Footnote 10 It is easy to see that the family of all Grove spheres \({\mathcal {S}}:=\{M^\delta \, \vert \, \delta \in [0,1], M^\delta \ne \emptyset \}\) is nested (i.e. totally ordered by inclusion: in fact, for \(\delta \ge \varepsilon \) we have \(M^\delta \supseteq M^\varepsilon \)), and exhaustive (i.e. \(M=\bigcup {\mathcal {S}}\)).

The plausibility map pla attains its supremum (1) on M if and only if there exists a smallest Grove sphere, given by the set

$$\begin{aligned} Max (M):= M^{1}=\{\nu \in M\,|\, pla(\nu )=1\}=\{\nu \in M\,|\, \nu \ge ^{{{\mathbf {M}}}} \nu ' \ \text{ for } \text{ all } \nu '\in M\} \end{aligned}$$

of all maximizers of the function \(\textit{pla}\) on M.

The plausibility \(\textit{pla}(e)\) of an event e in the model \({{\mathbf {M}}}=(M,pla)\) is defined as the joint plausibility \(\textit{pla}(M,e)\):

$$\begin{aligned} pla(e)\,\,\, :=\,\,\, pla(M,e)= sup \{ pla(\mu )\cdot \mu (e) \, \vert \, \mu \in M\} \end{aligned}$$

A plausibility model \({{\mathbf {M}}}=(M, pla)\) is said to be closed if the set M of possible distributions is closed (in the standard topology on \(M_O\)). The model is said to be convex if the set M is convex (i.e. \(\alpha \cdot \mu + (1-\alpha )\cdot \nu \in M\) for all \(\mu ,\nu \in M\) and all \(\alpha \in [0,1]\)).

The difference between plausibility measures and (the special case of) plausibility ranking maps is a plausibilistic analogue of the difference between measures in Measure Theory and (the special case of) probability functions. Although conditions (1) and (2) in the definition of plausibility maps may appear very restrictive at first sight, they do not in fact restrict the generality of our plausibility ranking order: the next example shows that any plausibility measure can be used to define plausibility ranking maps.

Generic example: plausibility-generating measures Let \(\textit{pla}:K\rightarrow [0,\infty ]\) be any plausibility measure with \(dom(pla)=K={\overline{K}}\subseteq M_O\). Then pla induces a plausibility model \({{\mathbf {M}}}= (M,pla^M)\) on each non-empty subset \(M\subseteq \{\mu \in K\, \vert \, pla(\mu )\ne 0\}\), with the plausibility map \(pla^M\) given by renormalizing \(\textit{pla}\) to \(\textit{M}\), i.e. putting

$$\begin{aligned} pla^M(\mu ) \,\, :=\,\, \frac{pla(\mu )}{pla(M)}=\frac{pla(\mu )}{sup\{pla(\nu )\, \vert \, \nu \in M\}}= \frac{pla(\mu )}{max\{pla(\nu )\, \vert \, \nu \in {\overline{M}}\}} \end{aligned}$$

for all \(\mu \in {\overline{M}}\). In this case, we say that the plausibility ranking map \(pla^M\) is generated by the plausibility measure pla. Note that the plausibility ranking order \(\le ^{{\mathbf {M}}}\) induced by \(pla^M\) on \(\textit{M}\) coincides with the order induced by the generating measure \(\textit{pla}\), i.e. we have:

$$\begin{aligned} \mu \le ^{{\mathbf {M}}}\mu ' \,\, \text{ iff } \,\, pla(\mu )\le pla(\mu '). \end{aligned}$$

A plausibility-generating measure pla is said to be fully positive whenever its domain \(dom(pla)=M_O\) is the full set of all distributions, and its codomain is \((0,\infty )\) (i.e. \(pla(\mu )>0\) for all \(\mu \in M_O\)). This is a special case of great importance: fully positive measures generate plausibility models on every non-empty set of distributions \(M\subseteq M_O\).

Interpretation In a plausibility model, the current set of possibilities \(\textit{M}\) encodes an agent’s current epistemic state, her “hard information” or higher-level knowledge about a given probabilistic distribution \(\mu \): all she knows for sure is that \(\mu \in M\). The agent may have come to this prior knowledge due to some previously received information (either in the form of observations obtained by sampling or in the form of higher-level information about the mechanism underlying the unknown distribution). On the other hand, pla represents the agent’s “soft information”, her current beliefs (and conditional beliefs etc) about the unknown distribution, typically acquired by sampling. Unlike in probabilistic inference processes (Paris 1994)(but like in most concrete examples of such processes), this doesn’t give only one (unconditional) belief, but a whole ranking of the distributions, in the form of a continuous function (which will give rise to a series of conditional beliefs): she considers the higher-ranked distributions to be more plausible than the lower-ranked ones. But, in contrast to knowledge, such soft information is not enough to exclude the less plausible distributions: the agent ‘believes’ that they are not the real distribution; but she doesn’t know it for certain. The agent believes every proposition satisfied by all the “top” (most plausible) distributions: the ones having plausibility rank 1; or, if such top distributions don’t exist, the agent will believe every proposition satisfied by all distributions that are “plausible enough”: i.e. all above any given plausibility rank \(1-\varepsilon \) (for any \(\varepsilon >0\)).

The above-defined extensions of the plausibility map have epistemic/doxastic significance: \(pla(\mu ,e)\) can be thought of as a way of assessing of joint plausibility of having true distribution \(\mu \) and observing event e. Note the analogy with the formula for the joint probability of two events.Footnote 11 Similarly, \(\textit{pla}(P)\) gives us a way to assess the plausibility of a ‘proposition’: essentially, a set of distributions \(P\subseteq M\) is only as plausible as the most plausible element of P (if such an element exists); or more generally P is at least as plausible as all its elements, but no more than (i.e. \(\textit{pla}(P)\) is the supremum of all plausibility ranks in P). Note now the analogy with, but also the difference from, probability: the role usually played by addition is played here by the supremum. With this notation, condition (2) on plausibility models \((M,\textit{pla})\) can be restated simply as \(pla(M)=1\). Finally, \(\textit{pla}(P,e)\) combines the formulas for \(\textit{pla}(\mu , e)\) and for \(\textit{pla}(P)\) in the natural way, giving the joint plausibility of having the true distribution in P and observing event e. In particular, \(\textit{pla}(e):=\textit{pla}(M,e)\) is a natural definition for the plausibility of the event e.

Differences between plausibility and probability Note the key differences between plausibility models and probabilistic models. First, unlike in the probabilistic case, maximal plausibility \(pla(\mu )=1\) does not mean certainty or full belief, but only consistency with all the agent’s beliefs: the distributions \(\mu \) with \(pla(\mu )=1\) are “doxastically possible”, i.e. they satisfy every proposition believed by the agent. Second, the plausibility map does not obey Kolmogorov’s additivity axiom: the plausibility pla(P) of a set is not the sum of plausibility ranks of its elements, but rather their supremum. This, together with the above normalization requirement (2), suggests that the plausibilistic analogue of addition of probabilities is the operation of maximization (or more generally, taking the supremum).

Models for experimental-based information Closed models characterize the situations in which all prior knowledge about the distribution is based only on experimental evidence about the mechanism underlying this distribution: e.g. measurements of the side weights or asymmetries of a coin or dice; opening each of a number of urns (from which an unknown one will be chosen for later sampling) and counting (or approximately estimating) the marbles of a given color in the urn, etc. In such contexts, it is indeed natural to assume that M is closed: if a distribution is a limit of possible distributions in M, then it is indistinguishable from M by any such experimental means, and hence it cannot be excluded from M.

In the case that the experimental evidence is based only on measurements, it is natural to assume more, namely that M is both closed and convex: measurements typically produce interval estimates [ab] for the probability \(\mu (o)\) of each outcome. Indeed, such interval models are the ones most used when dealing with imprecise probabilities. More generally, the information obtained in this way may come in the form of linear constraints of the form \(\sum _{i=1}^n a_i \mu (o_i)\ge c\) (with \(a_1, \ldots , a_n,c\in Q\)). Any finite set of such constraints gives a closed and convex set M of possible distributions.

One might wonder why do we permit distributions \({\overline{M}}{\setminus } M\) to have positive plausibility ranks, or even why do we take the whole closure \({\overline{M}}\) (instead of \(\textit{M}\)) as the domain of the plausibility map. Given that the agent knows for sure that the true distribution lies within \(\textit{M}\), the distributions in \({\overline{M}}{\setminus } M\) are incompatible with the agent’s hard information, so they are known to be ‘impossible’ in the view of this information. It would seem natural to require that \(pla\equiv 0\) on \({\overline{M}}{\setminus } M\), or else just restrict the domain of \(\textit{pla}\) to \(\textit{M}\). This can indeed be done if \(\textit{M}\) is closed. But in general, the technical condition of continuity poses constraints on the plausibility ranks of distributions in the closure \({\overline{M}}\), which may force some \(\mu \in {\overline{M}}{\setminus } M\) to have non-zero plausibility ranks. Even from a purely conceptual perspective, distributions in \({\overline{M}}{\setminus } M\) are in a sense “almost possible”, since they are not distinguishable from the ones in M by any experimental means. Their epistemic impossibility is only due to higher-order, non-experimental information, and so it makes sense to take them into account. Moreover, it may be that such ideal limit-distributions may have a high inherent plausibility (despite being ruled out by the current information). In some cases, they may be inherently more plausible than the possible distributions. In such cases, these distributions would be in principle believed on purely a priori grounds, though they are disbelieved (in fact known to be impossible) when the higher-level information is taken into account.Footnote 12

The above intuitions about knowledge and belief can be made formal as follows:

Definition 3

[Knowledge and belief] We say that a proposition \(P \subseteq M_O\) is known in the model \(\mathbf{M}=(M,pla)\), and write \({{\mathbf {M}}}\models K(P)\), if all possible distributions are in \(\textit{P}\); i.e. if \(M\subseteq P\).

We say that \(P\subseteq M_O\) is believed in the model \({{\mathbf {M}}}=(M,pla)\), and write \({{\mathbf {M}}}\models B(P)\), if all “plausible enough” distributions in \(\textit{M}\) are in \(\textit{P}\); i.e. iff there exists some \(\mu \in M\) such that \(\{\nu \in M \, \vert \, \nu \ge ^{{\mathbf {M}}}\mu \}\subseteq P\). An equivalent definition can be given in terms of Grove spheres: \(\textit{B(P)}\) holds in \({{\mathbf {M}}}\) iff \(\textit{P}\) includes some Grove sphere; i.e. iff there exists \(\delta \le 1\) such that \(\emptyset \ne M^\delta \subseteq P\); or, yet another equivalence: there exists \(\varepsilon \ge 0\) such that \(\emptyset \ne M^{1-\varepsilon }\subseteq P\).

Connections to belief revision theory Grove sphere models (in non-probabilistic form, consisting of possible worlds instead of distributions) form the standard semantic framework in Belief Revision Theory (Grove 1988). Plausibility models (again, in their non-probabilistic version) are well-known equivalent relational descriptions of sphere models, that are preferred in Dynamic Epistemic Logic (Baltag and Smets 2008a, b; Baltag et al. 2019a; van Benthem 2007, 2011), as well as in the “dynamic interactive epistemology” approach developed by game-theorists (Board 2004). These are in fact adaptations to doxastic modeling of the older setting of Lewis spheres, with its equivalent description in terms of a comparative similarity relation (Lewis 2000). In these models, the elements of M are taken to be possible worlds, or possible ‘states’ of the world, and the structure is purely qualitative, given either in terms of a nested, exhaustive family of spheres, or in terms of a total preorder on worlds. Sometimes an additional converse well-foundedness condition, or a weaker ‘Limit Condition’, is imposed to ensure the existence of maximal elements \(Max(M)\not =\emptyset \) (or equivalently, the existence of a smallest sphere). As seen below, this simplifies the definition of (conditional) belief, as the doxastic analogue of Lewis conditionals. But as noted by Lewis (2000), such additional assumptions are not really needed, since a satisfactory notion of conditional (or conditional belief) can still be defined in non-converse-wellfounded models. Hence, we make no such additional assumptions here.

Our models are just a special case of plausibility models, adapted to a probabilistic setting: the possible worlds come as probability distributions, while the plausibility preorder and the Grove spheres are quantitatively defined from a plausibility ranking map. But the mechanism for forming beliefs B(P) and conditional beliefs B(P|Q) in our probabilistic plausibility models will be exactly the same as in the general (non-converse-wellfounded, non-probabilistic) plausibility models.

Connections to inference processes Our probabilistic plausibility models can also be seen as a generalization and refinement of Paris’ inference processes (Paris 1994; Paris and Rad 2008; Paris and Vencovska 1997). Roughly speaking, an inference process is a map Bel assigning to each set \(M\subseteq M_O\) of distributions some “believed” distribution \(Bel(M)\in M\). The definition in Paris (1994) actually restricts the domain of Bel to a subclass of \({\mathcal {P}}(M_O)\) (namely the ones definable by a set of linear inequalities),Footnote 13 but our more general setting extends this to all sets of distributions. A good look at Paris’ examples of interesting inference processes shows that all of them define the salient distribution \(\textit{Bel(M)}\) by maximizing (or minimizing) over M a certain continuous quantity (entropy, distance from centre of mass, distance from barycentre, etc). Our approach makes explicit this method of generating inference processes, in the form of the plausibility map, and recognizes it as just a special case of the standard method of belief formation in Logic and Belief Revision Theory. Generalizing to arbitrary sets of distributions also forces us to give up on the insistence for only one most preferred distribution,Footnote 14 or even a set of most preferred distributions. Following Lewis’ approach (Lewis 2000) (as later adapted to non-converse-wellfounded plausibility models), one can still define beliefs as we did above, in terms of propositions that hold on all distributions that are plausible enough. Indeed, this seems the most natural generalization of maximization-based inference processes to arbitrary sets.

In closed models (and more generally in models in which plausibility map attains its maximum value 1) the definition of belief can be simplified, yielding the maximization-based notion of belief that is standard in both inference processes and Belief Revision Theory (in terms of maximizing plausibility rank). In such cases, belief amounts to truth in all the ‘most plausible’ distributions (the ones with plausibility rank 1):

Proposition 4

If \({{\mathbf {M}}}=(M,pla)\) is a closed model, then there exists some \(\mu \in M\) with \(pla(\mu )=1\); i.e. we have \(Max(M)\ne \emptyset \).Footnote 15 Moreover, whenever \({{\mathbf {M}}}=(M,pla)\) is any model with \(Max(M)\ne \emptyset \) (and hence in particular, whenever M is closed) and \(P\subseteq M_O\) is any proposition, we have that: P is believed iff all distributions in M with plausibility 1 satisfy P; i.e.

$$\begin{aligned} B(P) \hbox { holds in}\ {{\mathbf {M}}}\,\,\, \text{ iff } \,\,\, Max(M)\subseteq P. \end{aligned}$$

Proof

For the first part, let \(M \subseteq M_O\) be closed. Since \(\textit{pla}\) is a continuous function, we can use Propositions 12(1) and 3 , to conclude that \(\textit{pla}\) attains its supremum on \(\textit{M}\), hence \(M^1=Max(M) \ne \emptyset \).

For the second part, assume only that \(Max(M) \ne \emptyset \). To prove the left-to-right direction in the displayed equivalence, suppose that \(\textit{B(P)}\) holds; then by definition, there exists \(\delta \le 1\) such that \(\emptyset \ne M^\delta \subseteq P\). But \(\delta \ge 1\) implies \(M^1\subseteq M^\delta \), hence by transitivity of inclusion we conclude that \(Max(M)=M^1\subseteq P\), as desired.

For the converse, suppose that we have \(Max(M)=M^1\subseteq P\). Since \(Max(M) \ne \emptyset \), take any \(\mu \in Max(M)=M^1\). Then we have \(\{\nu \in M \, \vert \, pla(\nu ) \ge pla(\mu )=1\}= M^1\) (since pla cannot take values larger than 1), hence \(\{\nu \in M \, \vert \, pla(\nu ) \ge pla(\mu )\}\subseteq P\), i.e. \(\textit{B(P)}\) holds in \({{\mathbf {M}}}\). \(\square \)

Some canonical plausibility maps and plausibility-generating functions

Here are some specific examples:

  1. 1.

    Entropy-based plausibility maps: The most direct implementation of the Principle of Indifference is to take as our generating plausibility measure the Shannon entropy \(Ent:M_O\rightarrow [0,\infty )\), given by putting

    $$\begin{aligned} Ent(\mu ):=-\sum _{o \in O,\, \mu (o)\not =0} \mu (o)\log (\mu (o)) \end{aligned}$$

    It is convenient to assume that the logarithms are taken in base n (where recall that \(n=|O|\) is the number of outcomes in O). This measure generates a plausibility model \({{\mathbf {M}}}=(M, Ent^M)\) on every non-empty set \(M\subseteq M_O\) of distributions with positive entropy. The generated probabilistic plausibility map \(Ent^M\) is obtained by renormalizing entropy wrt \(\textit{M}\), as described in the generic example above: for \(\mu \in {\overline{M}}\), put \(Ent^M (\mu ):=\frac{Ent(\mu )}{Ent(M)}\), where \(Ent(M):={sup\{Ent(\nu )\,\vert \, \nu \in M\}}\). So the most plausible distribution will be the one with highest Shannon entropy, i.e. the most uninformative one.Footnote 16 More generally, less informative distributions will be more plausible than more informative ones. Note also that, when using logarithms in base \(n=|O|\), we have \(Ent(M_O)=Ent(\mu ^{eq})=\sum _{1\le i\le n} -\frac{1}{n}log_n\frac{1}{n}=1\) (where \(\mu ^{eq}\) is the distribution that gives equal probability \(\frac{1}{n}\) to every outcome), hence \(Ent^{M_O}=Ent\).

    One of the “defects” of entropy \(\textit{Ent}\) as a plausibility-generating measure is that it may take value zero, so it is not fully positive. This means there exist non-empty sets of distributions \(\textit{M}\), for which \(\textit{(M, Ent)}\) is technically speaking not a plausibility model (since \(Ent(\mu )=0\) for some \(\mu \in M\)): indeed, the set \(M_O\) of all distributions is such a counterexample! But recall that only the plausibility (pre-)order \(\le ^{{\mathbf {M}}}\) is of relevance when forming beliefs. So we can take instead any positive continuous function that induces the same order. One simple way to do this is to add to entropy some fixed positive number, say 1. In this way we obtain a fully positive version of entropy measure \(Ent^{+}:M_O \rightarrow (0, \infty )\), given by putting

    $$\begin{aligned} Ent^{+} (\mu ):= 1 +Ent(\mu )= 1-\sum _{o \in O,\, \mu (o)\not =0} \mu (o)\log (\mu (o)). \end{aligned}$$

    Using \(Ent^+\) as our plausibility-generating measure, we generate a plausibility model \((M, Ent^{+M})\) on every non-empty set \(M\subseteq M_O\), whose plausibility map is once again obtained by renormalizing \(Ent^{+}\) to M. Moreover, \(Ent^{+M}\) agrees with \(Ent^M\) on the ranking order between any two distributions, so it induces the same plausibility ranking order as the one given by entropy. As a consequence, for every plausibility model \((M,Ent^M)\), all beliefs and conditional beliefs (as well as knowledge) are the same as in the model \((M, Ent^{+M})\).

    Philosophically speaking, taking either \(\textit{Ent}\) or \(Ent^+\) as one’s plausibility measure amounts intuitively to the adoption of the Principle of Indifference at the level of the possible outcomes.

  2. 2.

    Cautious plausibility: The most ‘cautious’ choice of plausibility is assigning equal plausibility to all possible distributions, e.g. taking

    $$\begin{aligned} C(\mu ):=1 \,\,\,\,\,\,\, \,\, \text{ for } \text{ all }\ \mu \in M_O. \end{aligned}$$

    Obviously, this is a fully positive plausibility measure, so it induces a plausibility model on every non-empty set \(M\subset M_O\) (with the generated plausibility map given by the restriction of \(\textit{C}\) to \({\overline{M}}\)).

    Cautious plausibility can be thought of as yet another application of the Principle of Indifference at a higher level (that of all possible distributions): since a priori there is no reason to prefer a distribution to another, the prior plausibility assigns equal rank to all of them. With this cautious choice, the prior beliefs do not go beyond what is known: the agent only believes what she knows. (But as we’ll see, this is no longer the case after more information is received, e.g. via sampling evidence from the unknown distribution.)

  3. 3.

    Typicality-based plausibility maps: The so-called Limiting Centre of Mass of a set of distributions \(M\subseteq M_O\) (also called Centre of Mass Infinity) is the output of a probabilistic inference process (Paris 1994), that involves maximizing the quantity \(\sum _{o \in O(M)} \log (\mu (o))\), where we fixed a set \(M\subseteq M_O\) and used the notation \(O(M)=\{o\in O\,\vert \, \exists \mu \in M\ \text{ with } \mu (o)>0\}\). Whenever it exists and is unique (as e.g. in the case of closed and convex sets \(\textit{M}\)), the maximizer of this function over a set of distributions \(M\subseteq M_O\) gives a form of ‘averaging’ over \(\textit{M}\). So, in general, distributions for which this quantity has a higher value are closer to the average of \(\textit{M}\).

    Unfortunately, the function \(\sum _{o \in O(M)} \log (\mu (o))\) takes no positive values, since its range is \([-\infty , 0)\). But once again, only the induced ranking order is of relevance when forming beliefs, so we can apply any continuous transformation from \([-\infty ,0)\) to [0, 1) (e.g. \(x \mapsto 2^x\)), to obtain a plausibility measure

    $$\begin{aligned} CM_{\infty }(\mu ) := 2^{\sum _{o \in O(M)} \log (\mu (o))}= \prod _{o\in O(M)} \mu (o), \end{aligned}$$

    where here we assumed that the logarithm is taken in binary base. Assume now that \(M\subseteq M_O\) is a non-empty set with the property that for every outcome \(o\in O\), we have either \(\mu (o)=0\) for all \(\mu \in M\) or else \(\mu (o)>0\) for all \(\mu \in M\). Then the measure \(CM_{\infty }\) generates a probabilistic plausibility map on \(\textit{M}\), obtainable once again by renormalization to \(\textit{M}\).

    If we instead apply first a slightly different transformation (\(x\mapsto 1+2^x\)), we can go further and convert \(CM_{\infty }\) into a fully positive plausibility measure \(CM_{\infty }^+\). This helps avoid any restrictions on \(\textit{M}\): as long as \(M\ne \emptyset \), \(CM_{\infty }\) generates a probabilistic plausibility map \(CM_{\infty }^{+M}\) on \(\textit{M}\), that induces the same preorder \(\le ^{{\mathbf {M}}}\) as the original function \(\sum _{o \in O(M)} \log (\mu (o))\). Hence, \((M, CM_\infty ^{+M})\) is a plausibility model for every non-empty \(\textit{M}\), and its ranking order, beliefs, conditional beliefs etc, agree with the one of \((M, CM_{\infty }^M)\), whenever the second is a plausibility model.

    Taking \(CM_\infty ^M\) or \(CM_\infty ^{+M}\) as one’s plausibility ranking amounts intuitively to the adoption of a Principle of ‘Averageness’ or Typicality. Indeed, the probability distributions in M that have a higher \(CM_{\infty }^{+M}\)-plausibility will be those that are “more typical”, more ‘normal’ or representative for \(\textit{M}\); while the most plausible ones are the “most typical”.

    Another typicality-based plausibility map is related to the barycentre inference process (Paris 1994): this involves minimizing the function

    $$\begin{aligned} \mu \, \mapsto \, \sup \{d(\mu ,\nu )\, \vert \, \nu \in M\} \end{aligned}$$

    If it exists and is unique, the minimizer of this function over M is called the barycentre of the set M, and it gives another notion of averageness or representativeness. It chooses the distribution \(\mu \) that minimizes the worst error that could be made (when one wrongly takes \(\mu \) to be the true probability). To convert this into a maximization problem, we can apply the transformation \(2^{-x}\), obtaining the (fully positive) barycentric plausibility measure \(BM:M\rightarrow (0,1]\), for any non-empty set \(M\subseteq M_O\) and arbitrary distribution \(\mu \in {\overline{M}}\):

    $$\begin{aligned} B^M(\mu ):=2^{-sup \{d(\mu ,\nu )\, \vert \, \nu \in M\}}. \end{aligned}$$

    Using again renormalization, this generates a probabilistic plausibility map on M, that will assign higher plausibility to distributions that are closer to M’s barycenter.

  4. 4.

    Evidence-based plausibility: Given an observed event \(e\in {{\mathcal {E}}}\), we may prefer distributions that maximize the probability of e. This corresponds to taking as our plausibility measure the function \(F_e\) from Lemma 1, given by \(F_e(\mu )={\mu }(e)\). This gives higher ranking to distributions that assign higher probability to the event e. When renormalized to any non-empty set \(M\subseteq M_O\) with the property that \(\mu (e)>0\) for all \(\mu \in M\), it induces a plausibility model \((M,F^M_e)\), given by \(F^M_e(\mu ):= \frac{{\mu }(e)}{sup\{{\nu }(e)\,\vert \, \nu \in M\}}\).

  5. 5.

    Centered plausibility: Given a salient distribution \(\mu \) (that is considered as the most plausible), one may adopt a plausibility map given by a “normal” curve centered at \(\mu \). This means that distributions that are closer to \(\mu \) are considered more plausible than the ones that are farther: \(pla(\nu )\ge pla(\nu ')\) iff \(d(\nu ,\mu )\le d(\nu ', \mu )\). One example of a fully positive plausibility measure that induces this ranking order is \(C_\mu : M_O\rightarrow (0,1]\), given by putting \(C_\mu (\nu ):= 2^{-d(\nu ,\mu )}\).

  6. 6.

    Plausibility based on second-order probability: Let \(M\subseteq M_O\) be a discreteFootnote 17 set of distributions, and let \(P:M\rightarrow [0,1]\) be any second-order probability mass function (cf. Gaifman and Snir 1982; Gaifman 2016), that is required to satisfy \(P(\mu )>0\) for all \(\mu \in M\) and \(\sum _{\mu \in M} P(\mu )=1\). Then this function can be extended to a continuous function \(P:{\overline{M}}\rightarrow [0,1]\), by putting \(P(\mu ):=0\) for all limit points \(\mu \in {\overline{M}}{\setminus } M\). The fact that this extension is continuous follows from the assumption that \(\sum _{\mu \in M} P(\mu )=1\), which implies that \(lim_{n\rightarrow \infty } P(\mu _n)=0\) for any infinite sequence of distinct points \(\mu _n\in M\). By taking this extended function \(\textit{P}\) as our plausibility measure, we generate a plausibility model \((M, P^M)\), by renormalizing as above: \(P^M(\mu ):=\frac{P(\mu )}{sup\{P(\mu ) \,\vert \, \mu \in M\}}= \frac{P(\mu )}{max_M(P)}\). (Note that, in order for \(\sum _{\mu \in M} P(\mu )\) to have a finite value, P must attain a maximum value \(max_M(P):=max\{P(\mu ) \,\vert \, \mu \in M\}\) on \(\textit{M}\).)

    However, note that the beliefs based on the plausibility ranking \(P^M\) will not necessarily match the Lockean beliefs based on the second-order probability \(\textit{P}\). Only the distributions \(\mu \in Max(M)\), having \(P^M(\mu )=1\), or equivalently \(P(\mu )=max_M(P)\), are relevant for the agent’s plausibilistic beliefs: she will believe that the true distribution is one of the ones in \(\textit{Max}(M)\). This will hold even in the case that \(\sum _{\mu \in Max(M)}P(\mu )<\frac{1}{2}\); while an agent using \(\textit{P}\) as her second-order probability will have in this case precisely the opposite belief: she believes that the true distribution is in \(M{\setminus } Max(M)\), since this is more likely to be the case. This points yet again to the fundamental difference between the interpretation of a function as a plausibility map versus its meaning as a probability function. Plausibility ranks do not obey the Kolmogorov additivity axiom, but instead higher plausibility ranks simply dominate lower ones.

Example 1

(continued). In the Coin example, we initially have no information about the coin, the set of possible coin biases will be the set \(M_O\) of all probability mass functions on \(O=\{H, T\}\). Suppose that we have background information that the extremal distributions (\(\mu _0\) with \(\mu _0(H)=0\), and \(\mu _1\) with \(\mu _1(H)=1\)) are impossible. Then the set of possibilities is given by \(M:=M_O{\setminus } \{\mu _0,\mu _1\}\). We can choose the entropy \(\textit{Ent}\) as our plausibility map, as this can be justified here in terms of symmetry: the faces of a coin (or a dice) are symmetric, so there is no reason to prefer one outcome over another. Then \((M, Ent^+)\) is a plausibility model, where the highest plausibility will be given to the distribution with the highest entropy: the fair-coin distribution \(\mu ^{eq}\), assigning \(\mu ^{eq}(H)=\mu (T)=\frac{1}{2}\) (since for every \(\nu \ne \mu ^{eq}\) we have \(Ent(\nu ) < Ent(\mu ^{eq})\)). So entropic plausibility starts with an initial belief in the fairness of the coin (and more generally it assigns a higher ranking to a distribution that corresponds to a more well-balanced coin). Note that entropic plausibility induces the same ranking order on this model as the centered plausibility \(C_{\mu ^{eq}}\) (centered at the fair-coin distribution \(\mu ^{eq}\)).

If, however, we cannot exclude any distribution (not even the extremal ones), then the set of possibilities is the whole \(M_O\), and Ent will no longer give us a plausibility model. Still, we can choose instead the positive version of entropic plausibility \(Ent^+\), which makes \((M_O, Ent^+)\) into a plausibility model, while maintaining the same initial belief in the coin’s fairness (and the same preference for more well-balanced coins). Note again that \(Ent^+\) still induces the same ranking order on this model as the centered plausibility \(C_{\mu ^{eq}}\).

Example 2

(continued). In the Urn example, we initially have no other information besides the three colors, so the set of possibilities is the set \(M_O\) of all distributions over \(O=\{R, B, G\}\). Since there is no reason to prefer any one distribution over any other (and no considerations of symmetry are relevant, since we cannot see inside the urn to somehow assess whether there is a rough balance between the quantities of marbles of different colors), the most natural prior ranking seems to be in this case the cautious plausibility \(\textit{C}\): each possible distribution is assigned an equal plausibility of 1. In the plausibility model \((M_O,C)\), the agent has no other rational beliefs at this stage, beyond what she knows.Footnote 18

Proposition 5

Let \({{\mathbf {M}}}=(M,pla)\) be any model, then Knowledge and belief satisfy the following properties:

  1. 1.

    Knowledge is truthful: if \(\textit{K(P)}\) holds, then \(\textit{P}\) holds at all possible distributions (i.e. \(M \subseteq P\));

  2. 2.

    Tautologies are known: \(K(M_O)\) holds;

  3. 3.

    Knowledge implies belief: if \(\textit{K(P)}\) holds then \(\textit{B(P)}\) holds.

  4. 4.

    Belief is consistent: \(B(\emptyset )\) never holds;

  5. 5.

    Knowledge and belief are closed under entailment: if \(P\subseteq Q\), then \(\textit{K(P)}\) implies \(\textit{K(Q)}\), and similarly \(\textit{B(P)}\) implies \(\textit{B(Q)}\);

  6. 6.

    Knowledge and belief are (finitely) conjunctive: if \(K(P_i)\) holds for all \(1\le i\le n\), then \(K(\bigcap _{i=1}^n P_i)\) holds; similarly, if \(B(P_i)\) holds for all \(1\le i\le n\), then \(B(\bigcap _{i=1}^n P_i)\) holds;

  7. 7.

    Any finite number of beliefs are mutually consistent: if \(B(P_i)\) holds for all \(1\le i\le n\), then \(\bigcap _{i=1}^n P_i\ne \emptyset \).

Proof

Properties 1,2,3,4 follow immediately from the definitions of knowledge and belief. Property 5 for knowledge follows directly from property 1. For property 5 for belief: \(\textit{B(P)}\) gives the existence of some \(\delta >0\) with \(\emptyset \ne M^\delta \subseteq P\), which together with \(P\subseteq Q\) gives us \(\emptyset \ne M^\delta \subseteq Q\), hence B(Q) holds. Property 6 for knowledge follows from property 1, via the sequence of implications: if \(K(P_i)\) holds for all \(1\le i\le n\), then \(M\subseteq P_i\) for all \(1\le i\le n\), so \(M\subseteq \bigcap _{i=1}^n P_i\), hence \(K(\bigcap _{i=1}^n P_i)\) holds. Property 6 for belief: suppose that \(B(P_i)\) holds for all \(1\le i\le n\); so, for every \(1\le i\le n\), there exists some \(\delta _i>0\) s.t. \(\emptyset \ne M^{\delta _i}\subseteq P_i\). Take \(\delta =min\{\delta _i\, \vert \, 1\le i\le n\}\). Then we have \(\delta >0\), \(M^\delta \ne \emptyset \), and \(M^\delta \subseteq \bigcap _{i=1}^n M^{\delta _i} \subseteq \bigcap _{i=1}^n P_i\), hence \(B(\bigcap _{i=1}^n P_i)\) holds. Property 7 follows immediately from properties 6 and 4. \(\square \)

Finally, one should note that belief in closed models (or more generally, any model having most plausible distributions) is better behaved, having stronger consistency and conjunctivity properties, than in arbitrary models:

Proposition 6

Let \({{\mathbf {M}}}=(M,pla)\) be any model with \(Max(M)\ne \emptyset \). Then we have the following:

  • beliefs are closed under arbitrary conjunctions: if \(\{P_i\, \vert \, i\in I\}\) is a family of propositions such that \(B(P_i)\) holds for all \(i\in I\), then \(B(\bigcap _{i\in I} P_i)\) also holds;

  • beliefs are globally consistent: \(\bigcap \{P\subseteq M_O\, \vert \, B(P)\ \text{ holds } \text{ in }\ {{\mathbf {M}}}\}\ne \emptyset \).

In particular, these properties hold in closed models.

Proof

For the first item, suppose that \(B(P_i)\) holds for all \(i\in I\). Then by the second part of Proposition 4, we have \(Max(M)\subseteq P_i\) for all i, and hence \(Max(M)\subseteq \bigcap _{i\in I} P_i\), hence \(B(\cap _{i\in I} P_i)\) (again by Proposition 4).

For the second item, we apply the first item to the family \(\{P{\subseteq } M_O\, \vert \, B(P)\ \text{ holds } \text{ in }\ {{\mathbf {M}}}\}\) to infer that we have \(B(\{P \subseteq M_O\, \vert \, B(P) \hbox { holds in}\ {{\mathbf {M}}}\})\), then apply Proposition 5.3 to obtain the desired conclusion.\(\square \)

The following example shows that the above properties do not necessarily hold in arbitrary probabilistic plausibility models!

Counterexample: Suppose that, in the Coin Example, our agent learns from the manufacturer only one piece of information: the coin is not completely fair, due to very small, accidental imperfections (rather than any intentional bias). What is a rational agent, who forms entropy-based beliefs, supposed to believe? Smaller imperfections seem to be more plausible than larger ones: hence, any bias closer to \(\frac{1}{2}\) is more plausible than one that is farther. On the other hand, the agent knows for sure that the coin is not fair. Our agent has acquired omega-inconsistent beliefs, which nevertheless seem rational, given her information.

To formalize this counterexample, take \(O=\{H, T\}\) as in the Coin Example, and take the model \((M,Ent^+)\) with \(M=M_O{\setminus } \{\mu ^{eq}\}\), where \(\mu ^{eq}(H)=\mu (T)=\frac{1}{2}\) is the fair-coin distribution and \(Ent^+\) is the positive version of entropic plausibility. Recall that \(Ent^+\) yields on the same ranking order on \(M_O\) as the centered plausibility \(C_{\mu ^{eq}}\): distributions that are closer to \(\mu ^{eq}\) are more plausible than the ones that are farther. For each \(n\ge 2\), take \(P_n:=\{\mu \in M\, \vert \, \mu (H)\in (\frac{1}{2}-\frac{1}{n}, \frac{1}{2}+\frac{1}{n})\}\). Then \(B(P_n)\) holds for all \(n\ge 2\) (since every distribution close enough to \(\mu ^{eq}\) is in \(P_n\)), but \(\bigcap _{n\ge 2} P_n=\emptyset \) (since \(\mu ^{eq}\not \in M\)), hence beliefs are globally inconsistent; moreover, \(B(\bigcap _{n\ge 2} P_n)\) does not hold (since \(B(\emptyset )\) is false, by Proposition 5.3), hence beliefs are not necessarily closed under countable conjunctions.

This counterexample shows that plausibility-based beliefs in non-closed models may be subject to a kind of Infinite Lottery Paradox: though believing, for each \(n\ge 2\), that the coin’s bias is in \((\frac{1}{2}-\frac{1}{n}, \frac{1}{2}+\frac{1}{n}){\setminus } \{\frac{1}{2}\}\), our agent does not believe that the bias is in (empty) intersection of all these sets. So beliefs in non-closed models may exhibit a type of ‘omega-inconsistency’: though each belief is consistent, and any finitely many beliefs are mutually consistent, the family of all beliefs may still be inconsistent, when taken as a whole!

We think this is a small price to pay for being able to form beliefs when given arbitrary information \(M\subseteq M_O\). Situations such as in the above counterexample can occur in practice, whenever partial information is obtained, say by communication. Still, readers who consider global doxastic consistency to be an inherent feature of rationality are welcome to restrict our framework to models in which the plausibility map attains a maximum value. Full infinitary conjunctivity and global consistency of beliefs can be regained in this way, without any other loss, except for generality.

4 Conditioning and belief dynamics

One of the main motivations for developing the setting that we investigate here is to capture the process of learning a distribution as a form of iterated belief revision, that results from receiving new information. But, as already explained, the two components of our probabilistic plausibility models \({{\mathbf {M}}}=(M,pla)\) capture two different types of information about the unknown distribution \(\mu \): the set \(\textit{M}\) represents the agent’s hard higher-level information about \(\mu \) (her ‘knowledge’, given by the proposition \(M\subseteq M_O\)); while the plausibility map \(pla:M_O\rightarrow [0,1]\) represents the agent’s soft information about \(\mu \) (typically obtained by sampling or other observational events), her “beliefs” given by the ranking order. Each of these two forms of information is subject to its own type of revision, captured by its own form of conditioning or update: (1) conditioning on a new proposition \(Q\subseteq M_O\), resulting in an eliminative update with the hard information \(\textit{Q}\), by which some distributions are eliminated, while the plausibility ranking stays the same; (2) conditioning on a new observational event \(e\in {{\mathcal {E}}}\) (resulting in an upgrade of the plausibility map, by which distributions assigning a higher probability to e get a boost ranking, while the set M typically stays the same (except possibly for the elimination of those extreme distributions that assigned zero probability to e).

Definition 4

[Two forms of conditioning and updating] Given a plausibility model \({{\mathbf {M}}}=(M, pla)\), a proposition \(P\subseteq M_O\) is said to be compatible with \({{\mathbf {M}}}\) if the intersection \(M\cap P\ne \emptyset \) is non-empty. Similarly, an event \(e\in {\mathcal {E}}\) is said to be compatible with \({{\mathbf {M}}}\) if there exist distributions \(\mu \in M\) with \({\mu }(e)\not =0\), i.e. the set \(M_e:=\{\mu \in M\,\vert \, \mu (e)\ne 0\}\) is non-empty.

Let \(Prop_{{\mathbf {M}}}\) be the family of all propositions compatible with \({{\mathbf {M}}}\), and let \({\mathcal E}_{{\mathbf {M}}}\) be the family of all events compatible with \({{\mathbf {M}}}\). We can define two binary operations \(pla( . \vert .): {\overline{M}} \times Prop_{{\mathbf {M}}}\rightarrow [0, 1]\) (conditioning on a proposition) and \(pla( . \vert .): {\overline{M}} \times {\mathcal {E}}_{{\mathbf {M}}}\rightarrow [0,1]\) (conditioning on an event), by putting

$$\begin{aligned}&pla(\mu \vert P)\,\, :=\,\, \frac{pla(\mu )}{pla(M\cap P)}=\frac{pla(\mu )}{sup\{pla(\nu )\,\vert \, \nu \in M\cap P\}}, \\&pla(\mu \vert e)\,\, :=\,\, \frac{pla(\mu ,e)}{pla(e)}=\frac{pla(\mu ,e)}{pla(M,e)}=\frac{pla(\mu )\cdot \mu (e)}{sup\{pla(\nu )\cdot \nu (e)\, \vert \, \nu \in M\}}. \end{aligned}$$

The two types of conditioning give rise to two forms of dynamic operations on models, corresponding to two distinct varieties of learning: updating the plausibility model \({{\mathbf {M}}}=(M,pla)\) with a compatible proposition \(P\in Prop_{{\mathbf {M}}}\) yields the P-updated model \({{\mathbf {M}}}_P=(M_P, pla_P)\), given by \(M_P:=M\cap P\) and \(pla_P(\mu ):=pla(\mu \, \vert \, P)\) for \(\mu \in \overline{M\cap P}\); while updating the same model with a compatible event \(e\in {{\mathcal {E}}}_{{\mathbf {M}}}\) yields the e-updated model \({{\mathbf {M}}}_e=(M_e, pla_e)\), given by \(M_e:=\{\mu \in M: \mu (e)\not =0\}\) and \(pla_{e}(\mu ):= pla(\mu \, \vert \, e)\) for \(\mu \in \overline{M_e}\).

The first type of conditioning can be recognized as a plausibilistic analogue of the Kolmogorov definition of conditional probability, that fits well with propositional learning. Note that the \(\textit{P}\)-conditional plausibility order \(\le _P\) in the model \({{\mathbf {M}}}_P\), given by

$$\begin{aligned} \mu \le _P \mu ' \,\, \text{ iff } \,\, pla_P(\mu )\le pla_P (\mu ') \,\, \text{ iff } \,\, pla(\mu )\le pla (\mu ') \,\, \text{ iff } \,\, \mu \le \mu ', \end{aligned}$$

is the same as the initial plausibility order \(\le \), except that it is restricted to \(M\cap P\) (since the renormalizing denominator \(pla(M\cap P)\) in the definition of \(pla_P\) doesn’t make a difference for the order). Indeed, the propositional update (generated by receiving new “hard” higher-order information P) shrinks the space of possible distributions M by eliminating certain possibilities, while leaving the plausibility map “essentially the same” (modulo the renormalizing factor). This shows that our propositional update falls well within the scope of traditional Belief Revision Theory, representing a special case of AGM conditioning.

On the other hand, the second type of conditioning can be seen as a plausibilistic analogue of Bayes’ conditioning formula (where in both cases, the operation sup of taking supremum plays the role usually played by addition \(\sum \)), and thus captures a notion of learning through sampling. The event conditioning rule weights the plausibility of each distribution with how well it predicts the observed sampling event e. Note that e-conditional plausibility order \(\le _e\) in the model \(M_e\) is given by

$$\begin{aligned} \mu \le _e \mu ' \,\, \text{ iff } \,\, pla_e(\mu )\le pla_e(\mu ') \,\, \text{ iff } \,\, pla(\mu )\cdot \mu (e)\le pla(\mu ')\cdot \mu '(e). \end{aligned}$$

Indeed, the event update is generated by receiving “soft” information (obtained by sampling), and it naturally resembles soft doxastic ‘upgrades’ (rather than updates) from Dynamic Epistemic Logic (Baltag and Renne 2016; van Benthem 2011; Baltag and Smets 2008b): it leaves the set of possibilities M “essentially the same” (since it does not necessarily eliminate any distribution, except for the extremal ones, assigning probability 0 to e, if there any in M), but rather only changes the plausibility over them. Distributions that better fit the sampling evidence are only ‘promoted’ in plausibility, while the others are demoted (but not eliminating, except for the extremal ones).

The next result confirms that our updates are well-defined operations on plausibility models:

Proposition 7

Let \({{\mathbf {M}}}=(M,pla)\) be a plausibility model, \(P\in Prop_{{\mathbf {M}}}\) be a compatible proposition, and \(e\in {\mathcal E}_{{\mathbf {M}}}\) be a compatible event. Then \({{\mathbf {M}}}_P=(M_P, pla_P)\) and \({{\mathbf {M}}}_e=(M_e, pla_{e})\), as defined above, are plausibility models.

Proof

For propositional updates, the compatibility of \(\textit{P}\) with \({{\mathbf {M}}}\) implies that the domain of the \(\textit{P}\)-updated model is non-empty: \(M_P=M\cap P\ne \emptyset \). Similarly, the compatibility of the event e with \({{\mathbf {M}}}\) implies that \(M_e\not =\emptyset \). It is easy to check that the function \(pla_P:\overline{M\cap P}\rightarrow [0,\infty )\), given by \(pla_P(\mu )=\frac{pla(\mu )}{pla(M\cap P)}\), takes indeed values in [0, 1] (since \(0<pla(\mu )\le max\{pla(\nu )\, \vert \, \nu \in \overline{M\cap P}\}=sup\{pla(\nu )\,\vert \, \nu \in M\cap P\}=pla(M\cap P)\) for \(\mu \in \overline{M\cap P}\)); moreover, \(pla_P(\mu )>0\) for all \(\mu \in M_P=M\cap P\) (since the denumerator \(pla(\mu )>0\) for all \(\mu \in M\)); and finally \(sup\{pla^P(\nu )\,\vert \nu \in M\cap P\}=sup \{\frac{pla(\nu )}{pla(M\cap P)}\, \vert \, \nu \in M\cap P\}=\frac{sup\{pla(\nu )\,\vert \, \nu \in M \cap P\}}{pla(M\cap P)}=1\) (again using the fact that \(sup\{pla(\nu )\,\vert \, \nu \in M\cap P\}=pla(M\cap P)\)).

Similarly, the definition of \(M_e\) ensures that the function \(pla_e(\mu )= pla(\mu \, \vert \, e)= \frac{pla(\mu )\cdot \mu (e)}{sup\{pla(\nu )\cdot \nu (e)\, \vert \, \nu \in M_e\}}\) takes only positive values on \(M_e\), and that its supremum is 1 on \(M_e\). To show that \(pla_e\) is continuous, we put together the definition of conditional plausibility, the fact that \(pla_e=\frac{pla\cdot F_e}{k}\) (where \(F_e\) is the function introduced in Lemma 1 and \(k= sup\{pla(\nu )\cdot \nu (e)\, \vert \, \nu \in M_e\}\) is a non-zero constant), the continuity of \(\textit{pla}\) (by definition) and of \(F_e\) (by Lemma 1), and use the closure of continuous functions under products and division by non-zero constants. \(\square \)

This fact allows us to iterate and even interleave the two forms of updating. For simplicity, we only do it for events and propositions that fit the true distribution (since this automatically ensures their mutual compatibility):

Definition 5

(Iterated updating) Given a plausibility model \({{\mathbf {M}}}=(M, pla)\), and let \(\mu \in M\) be the ‘true’ distribution, we can define the iterated update \({{\mathbf {M}}}_\sigma \), for every finite sequence \(\sigma =(\sigma _1, \ldots , \sigma _n)\in (Prop\cup {\mathcal {E}})^*\) consisting of true propositions (\(\sigma _i\in Prop\) with \(\mu \in \sigma _i\)) or truly observable events (\(\sigma _i\in {\mathcal {E}}\) with \(\mu (\sigma _i)\not =0\)). The definition is by recursion on the length of the \(\sigma \), by putting:

$$\begin{aligned}&{{\mathbf {M}}}_{\lambda }\, \, :=\,\, {{\mathbf {M}}}, \,\, \,\, \text{ for } \text{ the } \text{ empty } \text{ sequence }\ \lambda =(), \\&{{\mathbf {M}}}_{\sigma , e}\, \,:=\,\, ({{\mathbf {M}}}_{\sigma })_e, \, \, \,\, \text{ for } \text{ observable }\ e\in {\mathcal {E}}\ \text{(with }\ \mu (e)\not =0),\\&{{\mathbf {M}}}_{\sigma , P}\,\, :=\,\,({{\mathbf {M}}}_{\sigma })_P, \, \, \,\, \text{ for } \text{ truthful }\ P\in Prop\ \text{(with }\ \mu \in P). \end{aligned}$$

The next three results ensure that updating satisfies some standard rationality constraints: Proposition 8 guarantees that the result of repeated conditionalisation is independent of the order of application; Proposition 9 says that the result of conditioning is independent of whether it is done successively (conditioning on each independent observation, one after the other) or in one global step (conditioning on the whole sequence of independent observations, as one big single event); while Proposition 10 shows that, when conditioning with a sequence of observations, the result is independent of the temporal order of the observations. These last three facts are important as they ensure that the agent’s posterior beliefs depend only on the evidence that is observed (and the prior plausibility model), not on the temporal or logical order in which this evidence is observed or processed.

Proposition 8

The order of applying (iterated) conditionalization is irrelevant: if \(\sigma ,\sigma ' \in (Prop\cup {\mathcal {E}})^*\) are sequences of equal length m of propositions and/or events, s.t. \(\sigma '\) is obtained by permuting the components of \(\sigma \) (i.e. there exists some bijection \(g:\{1, \ldots , m\}\rightarrow \{1, \ldots , m\}\) s.t. \(\sigma '_i=\sigma _{g(i)}\) for all i), then we have

$$\begin{aligned} {{\mathbf {M}}}_{\sigma }={{\mathbf {M}}}_{\sigma '} \end{aligned}$$

Proof

It is enough to show that we can commute the order of any two basic updates, since then the desired conclusion follows by induction. So we only need to check that we have \({{\mathbf {M}}}_{P,e}={{\mathbf {M}}}_{e,P}\), \({{\mathbf {M}}}_{P,Q}={{\mathbf {M}}}_{Q,P}\) and \({{\mathbf {M}}}_{e,e'}={{\mathbf {M}}}_{e', e}\). This is an easy but tedious verification, so we only sketch here the last case: for the underlying set we have \(M_{e,e'}=(M_e)_{e'}=\{\nu \in M_e \, \vert \, \nu (e')\ne 0\}=\{\nu \in M\,\vert \, \nu (e)\ne 0, \nu (e')\ne 0\}\), which immediately gives us \(M_{e,e'}=M_{e', e}\); for the plausibility map, we have \(pla_{e,e'}(\mu )=\frac{pla_e(\mu )\cdot \mu (e')}{sup\{pla_e(\nu )\cdot \nu (e')\,\vert \, \nu \in M_{e,e'}\}}= \frac{pla(\mu )\cdot \mu (e)\cdot \mu (e')}{sup \{pla(\nu )\cdot \nu (e)\cdot \nu (e')\,\vert \, \nu \in M_{e,e'}\}}\), which again immediately give us \(pla_{e,e'}=pla_{e', e}\). \(\square \)

The next proposition shows that conditioning successively on a number of independent observations is the same as conditioning on the single event consisting of the whole sequence of observations:

Proposition 9

If \({{\mathbf {M}}}=(M,pla)\) is a plausibility model and events \(e,e' \in {\mathcal {E}}\) are independent wrt all distributions \({\mu }\) with \(\mu \in M\), then we have:

$$\begin{aligned} ({{\mathbf {M}}}_e)_{e'}={{\mathbf {M}}}_{e\cap e'} \end{aligned}$$

As a consequence, for any event of the form \([\omega _1, \ldots , \omega _m]=\bigcap _{k=1}^m \omega _{i}^{i}\), we have:

$$\begin{aligned} {{\mathbf {M}}}_{\omega _{1}^1,\omega _ {2}^2, \ldots , \omega _{m}^m}= {{\mathbf {M}}}_{[\omega _1, \ldots , \omega _m]} \end{aligned}$$

(where recall that, for any outcome \(o=\omega _i\in O\), the event \(\omega _i^j=o^j:=\{{\tilde{\omega }}\in \varOmega \,\vert \,{\tilde{\omega }}_j=o=\omega _i\}\) is the one of observing outcome \(o=\omega _i\) at the j-th sampling from the unknown distribution, while \([\omega _1,\ldots , \omega _m]=\bigcap _{i=1}^m \omega _i^i\) is the event associated to the observational sequence \(\omega _1,\ldots , \omega _m\)).

Proof

By independence we have \({\mu }(e\cap e')= {\mu }(e)\cdot {\mu }(e')\), so we get \((M_e)_{e'}=\{ \mu \in M\,\vert \, {\mu }(e)\not =0, {\mu }(e')\not =0 \}= \{\mu \in M\, \vert \, {\mu }(e)\cdot {\mu }(e')\not =0\} = \{\mu \in M\, \vert \, {\mu }(e\cap e')\not =0\}= M_{e\cap e'}\). Similarly, as seen in the proof of Proposition 8, we have \(pla_{e,e'}(\mu )= \frac{pla(\mu )\cdot \mu (e)\cdot \mu (e')}{sup \{pla(\nu )\cdot \nu (e)\cdot \nu (e')\,\vert \, \nu \in M_{e,e'}\}}\). Using the independence assumption, we obtain \(pla_{e,e'}(\mu )=\frac{pla_{e\cap e'}(\mu )}{sup\{pla_{e\cap e'}(\nu )\,\vert \, \nu \in M_{e,e'}\}}= pla_{e\cap e'}(\mu )\).

The second claim of our Proposition follows by an easy induction from the first (given that, by the definition of \({\mu }\), each event \(\omega _{j}^j\) is independent on the event \(\bigcap _{k=1}^{j-1} \omega _{k}^k\)). \(\square \)

While Proposition 8 states that the logical order of applying conditionalization (with both events and propositions) is irrelevant, the next result shows that the temporal order in which the outcomes are observed is also irrelevant:

Proposition 10

For \(m\ge 1\), let g be a permutation of the first m positive integers (i.e. a bijection \(g:\{1, 2, \ldots , m\}\rightarrow \{1, 2, \ldots , m\}\)). For any two events of the form \([\omega _1, \ldots , \omega _m]=\bigcap _{k=1}^m \omega _{i}^{i}\) and \([\omega _{g(1)},\ldots ,, \omega _{g(m)}]= \bigcap _{i=1}^m \omega _{g(i)}^{i}\), we have

$$\begin{aligned} {{\mathbf {M}}}_{[\omega _1, \ldots , \omega _m]}={{\mathbf {M}}}_{[\omega _{g(1)}, \ldots , \omega _{g(m)}]}. \end{aligned}$$

Proof

Using the notations \(F_e\) from Lemma 1, and applying the multiplicative rule for independent events (as well as the associativity and commutativity of multiplication), we obtain: \(F_{[\omega _1, \ldots , \omega _m]}(\mu )= {\mu }(\bigcap _{i=1}^m \omega _{i}^{i})=\prod _{i=1}^m {\mu }(\omega _{i}^i)= \prod _{i=1}^m \mu (\omega _{i})= \prod _{i=1}^m \mu (\omega _{g(i)})=\prod _{i=1}^m {\mu }(\omega _{g(i)}^{i})= {\mu }(\bigcap _{i=1}^m \omega _{g(i)}^{i})= F_{[\omega _{g(1)}, \ldots , \omega _{g(m)}]}(\mu )\), for all \(\mu \in M_O\). Using this, we get that \(M_{[\omega _1, \ldots , \omega _m]}= \{\mu \in M\, \vert \, F_{[\omega _1, \ldots , \omega _m]}(\mu )\not =0\}= \{\mu \in M \, \vert \, F_{[\omega _{g(1)}, \ldots , \omega _{g(m)}]}(\mu )\not =0\}= M_{[\omega _{g(1)}, \ldots , \omega _{g(m)}]}\). Similarly (using also the definition of conditional plausibility), we conclude that \(pla_{[\omega _1, \ldots , \omega _m]}(\mu )=\frac{pla(\mu )\cdot F_{[\omega _1, \ldots , \omega _m]}(\mu )}{sup\{ pla(\nu )\cdot F_{[\omega _1, \ldots , \omega _m]}(\nu )\,\vert \, \nu \in M\}}= \frac{pla(\mu )\cdot F_{[\omega _{g(1)}, \ldots , \omega _{g(m)}]}(\mu )}{sup\{ pla(\nu )\cdot F_{[\omega _{g(1)}, \ldots , \omega _{g(m)}]}(\nu )\,\vert \, \nu \in M\}}{=}pla_{[\omega _{g(1)}, \ldots , \omega _{g(m)}]}(\mu )\). \(\square \)

Example 1

(continued) Take the plausibility model \((\textit{M}, \textit{Ent})\) as before where \(M:=M_O{\setminus } \{\mu _0, \mu _1\}\) is the set of all non-extremal biases of the coin and \(Ent^M:{\overline{M}}=M_O\) is the entropic plausibility. Since in this case \(O=\{Heads, Tails\}\) has \(n=2\) outcomes, our entropy calculations will use logarithms in binary base. We have \(Ent(M)=Ent(M_O)=Ent(\mu ^{eq})\), where \(\mu ^{eq}\) is the fair distribution, so \(Ent^M=Ent\). Let \(e:=[H, H, H]=H^1\cap H^2\cap H^3 \in {\mathcal {E}}\) be the event that “the first three tosses of the coin have landed on Heads”. After observing e, no distribution is eliminated (since the only distribution incompatible with the evidence is \(\mu _0\) with \(\mu (H)=0\), which has already been excluded from the start), so \(M_e=M=M_O{\setminus } \{\mu _0, \mu _1\}\). The new plausibility function is given by \(\,\, pla_{e}(\mu )= \frac{pla(\mu ,e)}{pla(M,e)}\), where \(pla(\mu ,e)= Ent(\mu ) \cdot {\mu }(e)\). Thus the most plausible probability function will no longer be \(\mu ^{eq}\) and ones with a bias towards Heads will become more plausible. Let \(\mu _1, \mu _2\) and \(\mu _3\) be such that \(\mu _1(Heads)= 0.75\), but \(\mu _2(Heads)=0.8\) and \(\mu _3(Heads)= 0.9\) then it is easy to check that \(\,\, pla_{e}(\mu _1) < pla_{e}(\mu _2)> pla_{e}(\mu _3)\).Footnote 19 So the maximizer has \(\mu (Heads)\in (0.8,0.9)\). This is natural: the initial belief in fairness is no longer realistic; the agent now believes there is a bias towards Heads.

If however, we cannot initially exclude the extremal distributions, then Ent is not a good plausibility map, and we have to once again take its positive version to form the initial plausibility model \((M_O, Ent^+)\). The same event \(e=[H, H, H]\) will now change the model differently: it will now eliminate \(\mu _0\), yielding \(M_e=M_O{\setminus } \{\mu _0\}\) as the new set of possibilities, while new plausibility map is given by \(pla_e(\mu ):=(1+Ent(\mu ))\cdot {\mu }(e)\). This changes the initial belief in fairness, and distributions with a higher bias towards Heads become more plausible (though the maximizer will be slightly different than in the previous situation). Also, note that the new plausibility map still inherits from the entropic plausibility the aversion towards extremal distributions: e.g. the distribution \(\mu _1\) with \(\mu _1(Heads)=1\), though it can no longer be excluded (since \(\mu _1\in M_e\) now) and though it, in fact, matches exactly the observed frequency of Heads, will still not be believed (and in fact will never become the most plausible, after no finite sequence of observations, no matter how many times the coin falls Heads up).

Example 2

(continued) Take the plausibility model \((M_O, C)\) as before where \(M_O\) is the set of all possible distributions over the set \(O=\{R,G,B\}\), and \(\textit{C}\) is the cautious plausibility. Recall that \(C(\mu )=1\) for all \(\mu \in M_O\) and hence all distributions are maximizers of \(pla_C\), so initially there are no special beliefs about the distribution. The agent starts sampling marbles, noting their colour, and replacing them in the urn. Let \(e:=[R,R,R]=R^1\cap R^2\cap R^3\) be the event that “the first three sampled marbles are all Red”. After observing e, all distributions \(\mu \) with \(\mu (R)=0\) are eliminated, so that the new set of possibilities is \(M_e=\{\mu \in M_O: \mu (R)\not =0\}\)), and the new plausibility map is given by \(\,\, pla_{e}(\mu )= {\mu }(e)=\mu (R)^3\). The maximizer of this function is \(\mu _R\), given by \(\mu _R(R)=1\) and \(\mu _R(G)=\mu _R(B)=0\). So the agent now believes that there are only Red marbles in the urn: this is natural since based on her current evidence there is no reason to assume there are any Green or Blue marbles inside. If however, the next sampled marble comes up Green, then we have the event \(f:=e\cap G^4= [R,R,R,G]=R^1\cap R^2\cap R^3\cap G^4\). After observing this, all distributions with \(\mu (G)=0\) are also eliminated, so the new set of possibilities is \(M_f= \{\mu \in M_O: \mu (R)\cdot \mu (G)\not =0\}\). Note that the previously believed distribution \(\mu _R\) has been eliminated now: not it is no longer believed, it is known now to be impossible! Furthermore, the new plausibility map is given by \(\,\, pla_{f}(\mu )= {\mu }(f)=\mu (R)^3\cdot \mu (G)\). The unique maximizer of this function is the distribution \(\mu _{2R1G}\), given by \(\mu _{2R1G}(R)=\frac{2}{3}\), \(\mu _{2R1G}(G)=\frac{1}{3}\) and \(\mu _{2R1G}(B)=0\). So the agent now believes that there are twice as many Red marbles than Green marbles (and no Blue marbles) in the urn. Again, this is natural, since twice as many Red marbles were observed than Green (and no Blue). One can in fact show that, when the prior is given by the cautious plausibility, the most plausible distribution after any sequence of observations will always be the one matching the observed frequencies.

The above notion of conditional plausibility gives us immediately a theory of belief revision, which can be formalized in terms of a notion of conditional belief. Note that this is conditionalisation on an observable event, corresponding to learning from observations (i.e. from sampling from the unknown distribution). On the other hand, the standard AGM setting in Belief Revision Theory and Logic (Alchourrón et al. 1985; Board 2004; Baltag and Smets 2008b; van Benthem 2011) involves revising with a proposition (i.e. set of distributions), rather than an event. This corresponds to learning high-level information about the unknown distribution, which allows to further shrink the range of possibilities to some subset of the prior set of possible distributions. We thus obtain two forms of conditional beliefs: a Bayesian-type conditioning on events, encoding ‘statistical’ learning; and an AGM-type of conditioning on propositions, encoding ‘logical’ belief revision.

Definition 6

[Two forms of conditional belief] Let \({{\mathbf {M}}}=(M,pla)\) be a plausibility model, and \(P \subseteq M\) be a proposition. For an event \(e \in {\mathcal {E}}\), we say that P is believed conditional on e in \({{\mathbf {M}}}\), and write \({{\mathbf {M}}}\models B (P \vert e)\), iff all e-plausible enough distributions in \(\textit{M}\) are in \(\textit{P}\); i.e. for some \(\mu \in M\), \(\{\nu \in M\, \vert \, pla_e(\nu ) \ge pla_e(\mu )\}\subseteq P\). For a proposition \(Q\subseteq M\), we say that \(\textit{P}\) is believed conditional on Q in \({{\mathbf {M}}}\), and write \({{\mathbf {M}}}\models B (P \vert Q)\), if and only if all plausible enough distributions in Q are in P; i.e. for some \(\mu \in Q\), \(\{\nu \in Q \, \vert \, pla(\nu ) \ge pla(\mu )\}\subseteq P\).

It should be clear that \(\textit{B(P)}\) is equivalent to \(B(P \vert \varOmega )\) and to \(B(P \vert M)\), where the set \(\varOmega \) of all observation streams represents the tautological event (corresponding to “no observation”) and the set \(\textit{M}\) of all worlds represents the tautological proposition (corresponding to “no further higher-order information”).

It should be equally clear that conditional beliefs track the updated beliefs: for every \(P\subseteq M_O\), B(P|Q) holds in \({{\mathbf {M}}}\) iff B(P) holds in \({{\mathbf {M}}}_Q\); and similarly, B(P|e) holds in \({{\mathbf {M}}}\) iff B(P) holds in \({{\mathbf {M}}}_e\).Footnote 20 This allows us to generalize conditional beliefs to iterated conditioning:

Definition 7

[General conditional belief] Let \({{\mathbf {M}}}=(M,pla)\) be a plausibility model, and \(P \subseteq M\) be a proposition. For any finite sequence \(\sigma =(\sigma _1, \ldots , \sigma _n)\in (Prop\cup {\mathcal {E}})^*\) of propositions/and or events, we say that \(\textit{P}\) is believed conditional on \(\sigma \) in \({{\mathbf {M}}}\), and write \({{\mathbf {M}}}\models B (P \vert \sigma )\), iff \(\textit{P}\) is believed in \({{\mathbf {M}}}_\sigma \).

Conditional belief is consistent whenever the evidence is (i.e. if \(e\not =\emptyset \), then \(\textit{B(P|e)}\) implies \(P\not =\emptyset \), and similarly for \(\textit{B(P|Q)}\)). As we’ll see, beliefs conditional on events allow us to inductively learn from repeated sampling, and to ultimately converge to the true distribution. As such, they behave in a way that is somewhat similar to the usual Bayesian conditioning, used in statistical learning. In contrast, beliefs conditional on propositions will behave as a ‘logical’ form of belief update, satisfying all the standard axioms of Conditional Doxastic Logic (Board 2004; Baltag and Smets 2008b)(which are in fact just an equivalent formulation of the so-called AGM postulates (Alchourrón et al. 1985) from Belief Revision Theory).

As for simple belief, the definition of belief conditional on events can be simplified in closed models. In this case, conditional belief \(\textit{B(P|e)}\) amounts to truth in all the most e-plausible distributions:

Proposition 11

If \({{\mathbf{F}}}=(M,pla)\) is a closed model and \(e\in {\mathcal {E}}\) is compatible with \(\mathbf {M}\), then there exists some \(\mu \in M_e\) with highest e-revised plausibility in M (i.e. s.t. \(pla_e(\mu ) \ge pla_e(\mu ')\) for all \(\nu ' \in M_e\)). In other words, we have

$$\begin{aligned} Max_e(M_e)\not =\emptyset , \end{aligned}$$

where for any proposition \(P\subseteq M_O\), we put \(Max_e (P):= \{\nu \in P\, \vert \, \nu \ge _e \nu ' \text{ for } \text{ all } \nu '\in P\}= \{\nu \in P \, \vert \, pla_e (\nu )\ge pla_e (\nu ') \text{ for } \text{ all } \nu '\in P\}\).

Moreover, for any proposition \(P\subseteq M_O\), we have that \(\textit{P}\) is believed conditional on e iff all most e-plausible distributions in \(\textit{M}\) are in \(\textit{P}\):

$$\begin{aligned} B(P|e) \mathrm{\ holds\ in}\ {{\mathbf {F}}} \,\,\, \mathrm{iff } \,\,\, Max_e(M_e)\subseteq P. \end{aligned}$$

Proof

By Proposition 7, \(pla_e\) is a plausibility function, hence it is continuous. Recall that \(\mathbf {M}\) is closed and hence (by Propositions 1, 2(1) and 3) \(pla_e\) has a maximum value on \(\textit{M}\). Let \(\mu \in M\) be a distribution in which this maximum value is attained, i.e. we have \(pla_e(\mu )\ge pla_e(\mu ')\) for all \(\mu '\in M\) (and thus also for all \(\mu '\in M_e\subseteq M\)). Since e is compatible with \(\mathbf{M}\), there exists some \(\nu \in \mathbf{M}\) s.t. \({\nu }(e)> 0\), and hence \(pla_e(\mu )\ge pla_e(\nu )=pla(\nu )\cdot {\nu }(e)>0\). So we have \(0<pla_e(\mu )=pla(\mu )\cdot {\mu }(e)\), which implies that \({\mu }(e)\not =0\), i.e. \(\mu \in M_e\). This, together with the fact that \(pla_e(\mu )\ge pla_e(\mu ')\) for all \(\mu '\in M_e\), gives us that \(\mu \in Max_e(M_e)\not =\emptyset \).

The rest of the proof goes exactly as in the proof of Proposition 4, by replacing unconditional belief B(P), plausibility \(\textit{pla}\) and \(\textit{Max(M)}\) by their conditional versions \(\textit{B(P|e)}\), \(pla_e\) and \(Max_e(M_e)\). \(\square \)

5 Safe belief, statistical knowledge, and verisimilitude

Until now, we only used the notion of knowledge K that is most common among logicians, economists and computer scientists: absolutely certain, infallible, irrevocable, and fully introspective knowledge. This matches what philosophers call “(hard) evidence” or “(hard) information”. But the notion of knowledge favoured by epistemologists is softer: fallible, less-than-absolutely-certain, revisable, and possibly non-introspective (or at least not always negatively introspective). It is the kind of knowledge that we typically encounter in daily life or in empirical sciences, where absolute certainty may be hard to achieve. This is known sometimes as defeasible knowledge, and it is also related to the notion of inductive knowledge in Philosophy of Science. Here, we are interested in developing such a soft notion of knowledge that can apply to statistical learning: after repeatedly updating our beliefs by sampling from an unknown distribution, when do our beliefs become focused enough and stable enough to qualify as soft ‘knowledge’ of the true distribution (at least to some good enough approximation)?

Various formalizations have been proposed for this notion. Here, we will borrow ideas from the so-called Defeasibility Theory of Knowledge (Lehrer 1990): the main principle is that ‘knowledge’ is a form of robust belief, namely belief that is resilient under conditioning with truthful information. These ideas go back to Plato’s Meno and were more recently championed in various forms by Klein, Lehrer, Pappas and Swayn, Rott and others. Before going on to formalize and then criticize the defeasibility theory, Stalnaker (1996) summarizes it as follows: “An agent knows that \(\phi \) if and only if \(\phi \) is true, she believes that \(\phi \), and she continues to believe \(\phi \) if any true information is received”. Rott (2004) develops a version called stability theory, and states it as: “A belief \(\textit{K}\) is a piece of knowledge of the subject S iff \(\textit{K}\) is not given up by S on the basis of any true information that S might receive”. Baltag and Smets (2008b) restated Stalnaker’s formalization, under the name of safe belief, and developed it in the framework of dynamic epistemic logic. Here, we adapt this concept to our setting, and later strengthen it to a notion of statistical knowledge.

Definition 8

[Safe Belief] Let \({{\mathbf {M}}}=(M,pla)\) be a plausibility model, in which we also specify the ‘true’ distribution \(\mu \). We say that a proposition \(P\subseteq M\) is safely believed (or is a “safe belief”) at \(\mu \) in \({{\mathbf {M}}}\), and write \(\mu \models _{{\mathbf {M}}}Sb(P)\), if \(\textit{P}\) is believed in \({{\mathbf {M}}}\) conditional on every true proposition \(\textit{Q}\); i.e. \(\textit{B(P|Q)}\) holds for all \(Q\in Prop\) with \(\mu \in Q\).

This is simply the same notion as the one defined by Baltag and Smets (2008b) in general plausibility models, but stated here in the special case of our probabilistic plausibility models. As such, it satisfies the following general characterization, given in Baltag and Smets (2008b):

Proposition 12

The following are equivalent:

  • \(\textit{P}\) is safely believed at \(\mu \) in \({{\mathbf {M}}}\);

  • all distributions in \(\textit{M}\) that are at least as plausible as \(\mu \) satisfy \(\textit{P}\); i.e., we have that \(\{ \nu \in M \, \vert \, pla(\nu )\ge pla(\mu )\} \subseteq P\).

It is easy to see that, if \(\textit{P}\) is a safe belief, then \(\textit{P}\) is a true belief. As such, the notion of safe belief gives a good formal approximation of the defeasibility conception of knowledge.

Distance from the truth and verisimilitude We can think of a plausibility model \({{\mathbf {M}}}=(M,pla)\) as an epistemic/doxastic approximation of some unknown probability distribution \(\mu \in M\). The natural question that arises is: how ‘truthlike’ is our model \({{\mathbf {M}}}\), how good an approximation is it? To assess this, we connect with notions from Verisimilitude Theory, cf Popper (1976), Tichy (1974), Miller (1974), Niiniluoto (1987, 1998), Kuipers (1987) and others. In particular, we adapt to our setting ideas coming from the metric approach to truthlikeness Niiniluoto (1987). We are looking for a notion of distance of a model \({{\mathbf {M}}}\) from a distribution \(\mu \in M\), which measures how far the agent’s beliefs are from the truth. In the case of closed models, the beliefs are given by the set Max(M), so the natural notion of distance would be given in this case by the quantity

$$\begin{aligned} \delta _\mu ({{\mathbf {M}}}):=sup\{d(\mu ,\nu ) \, \vert \, \nu \in Max(M)\}, \end{aligned}$$

which measures the “worst possible error” one could make when taking as the true distribution to be any of the ones compatible with the agent’s beliefs. However, when M is not closed, we might have \(Max(M)=\emptyset \), which would render the above notion of distance-from-the-truth meaningless, or at least useless (in case we adopt the natural convention that \(sup \emptyset =\infty \)). But one can weaken the above definition to include in the relevant set of possibilities (whose distances from the truth are assessed) all the “plausible enough” distributions, and in particular all the ones that are at least as plausible as the true distribution. In this way, we arrive at the following definition of distance-from-the-truth:

$$\begin{aligned} d_\mu ({{\mathbf {M}}}):=sup \{d(\mu , \nu ) \, \vert \, \nu \in M, pla(\nu )\ge pla(\mu )\}= sup \{d(\mu , \nu ) \, \vert \, \nu \in M, \nu \ge ^{{\mathbf {M}}}\mu \}. \end{aligned}$$

This measures the worst possible error one could make when taking as the true distribution any of the ones that are currently thought to be at least plausible as the “truly true” distribution \(\mu \). It is easy to see that we have that the distance-from-the-truth matches the radius of the smallest open ball around the true distribution that is safely believed:

$$\begin{aligned} d_\mu ({{\mathbf {M}}})=inf\{\varepsilon >0 \, \vert \, \mu \models _{{\mathbf {M}}}Sb({\mathcal {B}}_\varepsilon (\mu ))\} =min \{\varepsilon \ge 0 \, \vert \, \mu \models _{{\mathbf {M}}}Sb({\mathcal {B}}_\varepsilon (\mu ))\}. \end{aligned}$$

So \(d_\mu ({{\mathbf {M}}})\ge \varepsilon \) tell us that the agent has a safe belief of the approximate value of the true distribution within an \(\varepsilon \)-margin of error. It is also easy to see that we have

$$\begin{aligned} 0\le \delta _\mu ({{\mathbf {M}}})\le d_\mu ({{\mathbf {M}}}) \,\,\,\,\, \hbox { whenever}\ Max(M)\not =\emptyset , \end{aligned}$$

and also that we have

$$\begin{aligned} d_\mu ({{\mathbf {M}}}){=}0 \, \text{ iff } \, \delta _\mu ({{\mathbf {M}}}){=}0 \, \text{ iff } \, Max(M)=\{\mu \} \, \text{ iff } \, {{\mathbf {M}}}\models B(\{\mu \}) \, \text{ iff } \, \mu \models _{{\mathbf {M}}}Sb(\{\mu \}). \end{aligned}$$

So 0-distance (according to either definition) indicates that the agent’s (safe) beliefs fully match the true distribution.

When we have \(d_{\mu }({{\mathbf {M}}}) < d_\mu ({{\mathbf {M}}}')\) for the true distribution \(\mu \), we say that \({{\mathbf {M}}}\) is more truthlike than \({{\mathbf {M}}}'\). This verisimilitude order suffices for our purposes. But we could also convert it into an actual measure of truthlikeness, by defining the verisimilitude \(v_\mu ({{\mathbf {M}}})\) of a model \({{\mathbf {M}}}\) wrt a distribution \(\mu \), say by putting \(v_\mu ({{\mathbf {M}}}):=2^{-d_\mu ({{\mathbf {M}}})}\). The maximum verisimilitude \(v_\mu ({{\mathbf {M}}})=1\) is achieved when \(d_\mu ({{\mathbf {M}}})=0\), i.e. when \(\mu \models _{{\mathbf {M}}}Sb(\{\mu \})\).

Safe belief is not safe from conditioning on events While of inherent interest, the notion of safe belief does not fully capture the intended meaning of defeasible knowledge in a probabilistic framework. Although safe belief is resilient under conditioning with any true ‘proposition’, in our setting propositions are not the only kind of new information; and indeed, safe beliefs are not necessarily stable under conditioning on events. Indeed, even if we restrict to truly observable events (whose true probability \(\mu \ne 0\)), one can show that no non-trivial belief is stable under every such event!

This means we have to moderate our safety requirements when dealing with events. Note that, for inductive learning, absolute safety (under all observable sampling events) is irrelevant: what is important is that our correct beliefs are resilient throughout the (actual) future sampling history. This resembles the notion of identification in the limit in Formal Learning Theory (Gold 1967), as well as the concepts of inductive knowledge developed in e.g. Kelly (2014) and Baltag et al. (2019b). In our setting, this gives rise to the concept of statistical knowledge:

Definition 9

[Statistical Knowledge] Let \({{\mathbf {M}}}=(M,pla)\) be a plausibility model, let \(\mu \) be some distribution (representing the ‘true’ probability), and let \(\omega \in \varOmega \) be an infinite observation stream (representing the ‘true’ future sampling history from the unknown distribution \(\mu \)). We say that a proposition \(P\subseteq M\) is statistically known (or is “statistical knowledge”) at \(\mu \) wrt \(\omega \) in \({{\mathbf {M}}}\), and write \(\mu , \omega \models _{{\mathbf {M}}}Sk(P)\), if P is believed in \({{\mathbf {M}}}\) conditional on every ‘true’ proposition Q and every (event corresponding to an) initial segment of the ‘true’ sampling history \(\omega \); i.e. if we have \(B(P|Q, [\omega ^{\ge n}])\), for all \(Q\in Prop\) with \(\mu \in Q\), and all \(n\ge 0\).

It is obvious that, if P is statistically known, then it is safely believed. But statistical knowledge is much more resilient: it essentially captures a strong form of inductive knowledge. Using Proposition 12, we immediately obtain the following characterization:

Proposition 13

The following are equivalent:

  • P is statistically known at \(\mu \) wrt \(\omega \) in \({{\mathbf {M}}}\);

  • after every initial segment \([\omega ^{\le n}]\) of the true sampling history \(\omega \), every distribution in M that is at least as plausible as \(\mu \) satisfies P; i.e. we have:

    $$\begin{aligned} \forall \nu \in M \, \forall n\ge 0\, \left( \, pla_{[\omega ^{\le n}]}(\nu )\ge pla_{[\omega ^{\le n}]}(\mu )\Rightarrow \nu \in P\, \right) . \end{aligned}$$

In the next section, we show that this notion is actually realistically achievable, and in fact unavoidable: repeated sampling will almost surely eventually lead to statistical knowledge of the true distribution with any desired accuracy.

6 Tracking the truth

Definition 10

For \(\mu \in M\), we define the set \(\varOmega _\mu \) of \(\mu \)-normal observations as the set of infinite sequences from O for which (1) the limiting frequencies of each \(o_i\) correspond to \(\mu (o_i)\) and (2) no outcome with probability 0 is ever observed:

$$\begin{aligned} \varOmega _{\mu } :=\left\{ \omega \in \varOmega \, \vert \, \forall o \in O \,\, \lim _{n \rightarrow \infty } \frac{\vert \{ i \le n \, \vert \, \omega _i = o\} \vert }{n} = \mu (o)\} {\setminus } \{\omega \in \varOmega \, \vert \, \exists i \in {\mathbb {N}}\,\,\, \mu (\omega _i)= 0\right\} \end{aligned}$$

Proposition 14

For every probability function \(\mu \), \({\mu }(\varOmega _{\mu }) =1.\) Hence, if \(\mu \) is the true probability distribution over O, then almost all observable infinite sequence from O will be \(\mu \)-normal.

Proof

Let \(\varDelta =\{\omega \in \varOmega \, \vert \, \exists i \in {\mathbb {N}} \,\,\, \mu (\omega _i)= 0\}\). Using the law of large numbers it is enough to show that \({\mu }(\varDelta )=0\). To see this let \(\mu (o)=0\) then

$$\begin{aligned} {\mu }(\{ \omega \in \varOmega \,\, \vert \,\, \exists i \in {\mathbb {N}} \,\,\, \omega _i =o \}) = {\mu } \left( \bigcup _{i \in {\mathbb {N}}} \{ \omega \in \varOmega \,\, \vert \,\, \omega _i =o \}\right) \le \sum _{i \in {\mathbb {N}}}{\mu } (o^i) = 0. \end{aligned}$$

The result then follows from finiteness of O. \(\square \)

We are now in the position to look into the learnability of the correct probability distribution via plausibility-revision induced by repeated sampling. We first prove a preliminary result on convergence.

Lemma 2

Let \({{\mathbf {M}}}=(M,pla)\) be a plausibility model, and \(\mu \in M\). Then, when repeatedly sampling from an unknown distribution \(\mu \), we have that for every \(\varepsilon >0\), the plausibility of having a distribution \(\varepsilon \)-farther from \(\mu \) will become in the limit vanishingly smaller than the plausibility \(pla(\mu )\) of the true distribution \(\mu \).

More precisely: for every \(\mu \)-normal sequence \(\omega \in \varOmega _\mu \) and every positive real \(\varepsilon >0\), we have

$$\begin{aligned} \lim _{n\rightarrow \infty } \frac{pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {\mathcal B}_\varepsilon (\mu ))}{pla_{[\omega ^{\le n}]}(\mu )}=0 \end{aligned}$$

(where recall that \({{\mathcal {B}}}_\varepsilon (\mu )= \{\nu \in M_O \vert d(\mu , \nu ) < \varepsilon \}\)).

Proof

We first need to make some preliminary notations and observations. If \(O=\{o_1, \ldots , o_n\}\) is the set of outcomes, and \(\mu \) is the fixed distribution in the statement of our Lemma, then we put \(p_i:=\mu (o_i)\), for all \(1\le i\le n\). More generally, for all distributions \(\nu \in {\overline{M}}\), all \(\mu \)-normal sequences \(\omega \in \varOmega _\mu \) and all \(1\le i\le n\), we put: \(\nu _i:=\nu (o_i)\), \(m_{i,\omega }:=|\{k\le m\, \vert \, \omega _k=o_i\}|\) for the number of occurrences of \(o_i\) in the sequence \(\omega ^{\le m}=(\omega _1, \ldots , \omega _n)\), and \(\alpha _{i,m, \omega }:=\frac{m_{i,\omega }}{m}\) for the relative frequency of \(o_i\) in \(\omega ^{\le m}\). Since \(\omega \in \varOmega _{\mu }\)) we have (by the definition of \(\varOmega _\mu \)) that: \(\lim _{m \rightarrow \infty } \alpha _{i,m,\omega }=p_i\) for all \(1\le i\le n\); and also that \(m_{i,\omega }=\alpha _{i,m,\omega }=0\) holds whenever \(p_i= \mu (o_i)=0\) (because of the normality of the sequence \(\omega \)).

Let us put \(A:=\{1 \le i \le n \, \vert p_i\not =0\}\). Since \(\lim _{m \rightarrow \infty } \alpha _{i,m,\omega }=p_i>0\) for \(i\in A\), there must exist some \(N_{1,\omega }\) such that \(0<\frac{p_i}{2}\le \alpha _{i,m,\omega }\le 2\cdot p_i\) for all \(m\ge N_{1,\omega }\) and all \(i \in A\). Since \(0\le p_i, \nu _i\le 1\), this gives us that

$$\begin{aligned} (*) \,\,\,\,\, \, \nu _i^{m\cdot 2\cdot p_i}\le \nu _i^{m\cdot \alpha _{i,m,\omega }} \le \nu _i^{m\cdot \frac{p_i}{2}} \,\, \hbox { for all}\ \nu \in {\overline{M}},\ \hbox {all}\ i\in A\ \hbox {and all}\ m\ge N_{1,\omega }, \end{aligned}$$

and in particular \(p_i^{m\cdot 2\cdot p_i}\le p_i^{m\cdot \alpha _{i,m,\omega }} \le p_i^{m\cdot \frac{p_i}{2}}\) for all such \(\nu ,i,m\).

Using independence, we have

$$\begin{aligned}&(**) \,\,\,\,\, \, pla_{[\omega ^{\le m}]}(\nu )=pla(\nu )\cdot {\nu }([\omega ^{\le m}])= pla(\nu ) \cdot \varPi _{i=1}^n \nu _i^{m_{i,\omega }}\\&\qquad = pla(\nu )\cdot \varPi _{i\in A} \nu _i^{m\cdot \alpha _{i,m, \omega }} \end{aligned}$$

(where we used the fact that, for every \(i\not \in A\) we have \(p_i=0\), so by normality of the sequence we also have \(m_{i,\omega }=\alpha _{i,m,\omega }=0\), and thus \(\nu _i^{m_{i,\omega }}=1\), hence these factors can be skipped from the product). In particular, for \(\nu :=\mu \) (so \(\nu _i=p_i\)), we get that

$$\begin{aligned} (***) \,\,\,\,\, \, pla_{[\omega ^{\le m}]}(\mu )= pla(\mu )\cdot \mu ([\omega ^{\le m}])= pla(\mu )\cdot \varPi _{i\in A} p_i^{m\cdot \alpha _{i,m, \omega }} > 0 \end{aligned}$$

(since \(p_i\not =0\) for \(i\in A\), and also \(pla(\mu ) \ne 0\) because \(\mu \in M\)).

Using these abbreviations and facts, we can now prove our lemma. Fix \(\omega \in \varOmega _\mu \) and \(\varepsilon >0\). To prove the desired conclusion, let now \(\nu \in {\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu )\), and let N be any arbitrarily chosen natural number. Using the above unfoldings (**) and (***) of the definitions of \(pla({\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu ))\) and \(pla_{[\omega ^{\le n}]}(\mu )\), we see that it is enough to show that, for any such arbitrarily chosen N, we have

$$\begin{aligned} N\cdot pla(\nu )\cdot \varPi _{i\in A} \nu _i^{m\cdot \alpha _{i,m,\omega }} < pla(\mu )\cdot \varPi _{i\in A} p_i^{m\cdot \alpha _{i,m,\omega }} \end{aligned}$$
(1)

for all large enough m.

We prove this by cases. In the first case, assume that \(pla(\nu )=0\), then the left hand side of (1) is 0 and the inequality holds. In the second case, assume that \(pla(\nu ) > 0\). Let \(\varDelta =\{\nu \in {\overline{M}} \, \vert \, \nu _i=0\ \text{ for } \text{ some } i \in A \}\), and similarly for any \(\delta >0\), put \(\varDelta _{\delta }=\{\nu \in {\overline{M}} \, \vert \, \nu _I<\delta \ \text{ for } \text{ some } i \in A\}\), and so \(\overline{\varDelta _{\delta }}=\{\nu \in {\overline{M}} \, \vert \, \nu _I\le \delta \ \text{ for } \text{ some } i \in A\}\) is its closure. Choose some \(\delta >0\) small enough such that we have \(\varPi _{i \in A} \nu _i^{\frac{p_i}{2}} < \varPi _{i \in A} p_i^{2\cdot p_i}\) for all \(\nu \in \overline{\varDelta _{\delta }}\) (-this is possible, since \(\varPi _{i \in A} \nu _i^{\frac{p_i}{2}}=0< \varPi _{i \in A} p_i^{2\cdot p_i}\) for all \(\nu \in \varDelta \), so the continuity of \(\varPi _{i \in A} \nu _i^{\frac{p_i}{2}}\) gives us the existence of \(\delta \)). Hence, we have

$$\begin{aligned} 0\le \frac{\varPi _{i \in A} \nu _i^{\frac{p_i}{2}}}{\varPi _{i \in A} p_i^{2\cdot p_i}}<1 \,\, \text{ for } \text{ all }\ \nu \in \overline{\varDelta _{\delta }} \end{aligned}$$

(where we used again the fact that \(p_i>0\) for \(i\in A\)). The set \(\overline{\varDelta _{\delta }}\) is closed, hence the continuous function \(\frac{\varPi _{i \in A} \nu _i^{\frac{p_i}{2}}}{\varPi _{i \in A} p_i^{2\cdot p_i}}\) has a maximum value Q on \(\overline{\varDelta _{\delta }}\). Note that \(Q<1\) (-this follows from the inequality above), so there exists some \(N_2>N_{1,\omega }\) (where \(N_{1,\omega }\) is the number satisfying the inequality (*) in the preliminary facts above) s.t. we have \(Q^m< \frac{pla(\mu )}{N}\) for all \(m>N_2\). Recalling also that by definition \(pla(\nu )\le 1\), we obtain, for all \(\nu \in \varDelta _{\delta }\):

$$\begin{aligned}&N\cdot pla(\nu )\cdot \varPi _{i \in A} \nu _i^{m\cdot \alpha _{i,m,\omega }}\le N\cdot 1\cdot \varPi _{i \in A} \nu _i^{m\cdot \frac{p_i}{2}} \le N\cdot (Q\cdot \varPi _{i \in A} p_i^{2\cdot p_i})^m \\&\quad = N\cdot Q^m \cdot \varPi _{i \in A} p_i^{m\cdot 2\cdot p_i} < N\cdot \frac{pla(\mu )}{N}\cdot \varPi _{i \in A} p_i^{m\cdot \alpha _{i,m,\omega }} = pla(\mu )\cdot \varPi _{i \in A} p_i^{m\cdot \alpha _{i,m,\omega }} \end{aligned}$$

(where we used the above facts as well as the inequality (*)). So we proved that the inequality (1) holds for all \(\nu \in \varDelta _{\delta }\). It thus remains only to prove it for all \(\nu \in M':={\overline{M}} {\setminus } (B_{\varepsilon }(\mu )\cup \varDelta _{\delta })\). For this, note that \(M':={\overline{M}}{\setminus } (B_{\varepsilon }(\mu )\cup \varDelta _{\delta })\) is closed and that \(\nu _i\not =0\) over this set for all \(i \in A\), while for all \(i \notin A\) we have \(\alpha _{i,m,\omega } =0\). Hence using the assumption that \(pla(\nu ) \ne 0\), (1) is equivalent over this set with:

$$\begin{aligned} \left( \frac{pla(\mu )}{pla(\nu )}\right) \cdot \left( \frac{\varPi _{i=1}^{n} p_i^{m\cdot \alpha _{i,m,\omega }}}{ \varPi _{i=1}^{n} \nu _i^{m \cdot \alpha _{i,m,\omega }}} \right) >N \end{aligned}$$
(2)

Applying logarithm (and using its monotonicity, and its other properties), this in turn is equivalent to

$$\begin{aligned} \log (pla(\mu ))-\log (pla(\nu )) + \sum _{i=1}^{n} m\cdot \alpha _{i,m,\omega } \cdot (\log p_i - \log \nu _i) > log N\end{aligned}$$
(3)

So we see that it is enough to show that, for all large m and for \(\nu \in M'\), we have

$$\begin{aligned} m> \frac{log N+ log(pla(\nu ))-log(pla(\mu ))}{\sum _{i=1}^{n} \alpha _{i,m,\omega } \cdot (\log p_i - \log \nu _i)} \end{aligned}$$
(4)

Recall that \(\alpha _{i,m,\omega }\ge \frac{p_i}{2}\) for all \(m> N_2> N_1\) and all \(1\le i\le n\). Thus, to prove (4), it is enough to show that, for large m and for all \(\nu \in M'\), we have

$$\begin{aligned} m> \frac{f(\nu )}{g(\nu )}, \end{aligned}$$
(5)

where we introduced the auxiliary continuous functions \(f, g: M'\rightarrow R\), defined by putting \(f(\nu )= 2\cdot (log N + \log (pla(\nu ))-\log (pla(\mu )))\) and \(g(\nu )= \sum _{i=1}^{n} p_i \cdot (\log p_i - \log \nu _i)\) for all \(\nu \in M_O\).

To show (5), note first that

$$\begin{aligned} g(\nu )= \sum _{i=1}^n p_i\cdot (\log p_i - \log \nu _i)= log \left( \frac{\varPi _{i=1}^n p_i^{p_i}}{\varPi _{i=1}^n \nu _i^{p_i}}\right) > log 1=0 \end{aligned}$$

(where at the end we used the fact, proved in Lemma 1, that the measure \(\mu \), with values \(\mu (o_i)=p_i\), is the unique maximizer of the function \(\varPi _{i=1}^n \nu _i^{p_i}\) on \(M_O\)). Since g is continuous and \(M'\) is closed, g is bounded and attains its infimum \(B=min_{M'}(g)\) on \(M'\). But since g is non-zero on \(M'\), this minimum cannot be zero: \(B=min_{M'}(g)\not =0\). Similarly, since f is continuous and \(M'\) is closed, g is bounded and attains its supremum \(C=max_{M'}(f)<\infty \) (which thus has to be finite). Take now some \(N_3 \ge max(N_2, \frac{C}{B})\). For all \(m> N_3\), we have

$$\begin{aligned} m > \frac{C}{B}\ge \frac{f(\nu )}{g(\nu )} \end{aligned}$$

for all \(\nu \in M'\), as desired. \(\square \)

We can now establish our first convergence result.

Theorem 2

[Convergence in plausibility] Let \({{\mathbf {M}}}=(M, pla)\) be a plausibility model. If \(\mu \in M\) is the ‘true’ distribution, then we have the following:

  1. 1.

    when repeatedly sampling from the unknown distribution \(\mu \), we have that for every \(\varepsilon >0\), the plausibility \(pla({\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu ))\) of having a distribution \(\varepsilon \)-farther from \(\mu \) will also almost surely converge to 0 (as sample size converges to infinity):

    $$\begin{aligned} \mu (\{\omega \in \varOmega \, \vert \, \lim _{n\rightarrow \infty } pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu ))=0\})=1; \end{aligned}$$

    in particular, in the same conditions of repeated sampling, every other distribution \(\nu \in {\overline{M}}{\setminus } \{\mu \}\) will almost surely converge to 0 (as sample size converges to infinity):

    $$\begin{aligned} \mu (\{\omega \in \varOmega \, \vert \, \lim _{n\rightarrow \infty } pla_{[\omega ^{\le n}]}(\nu )=0\})=1; \end{aligned}$$
  2. 2.

    in contrast, in the same conditions, we have that for every \(\varepsilon >0\), the plausibility \(pla({{\mathcal {B}}}_\varepsilon (\mu ))\) of having a distribution \(\varepsilon \)-close to \(\mu \) will also almost surely eventually settle on 1 (after finitely many rounds of sampling):

    $$\begin{aligned} \mu (\{\omega \in \varOmega \, \vert \, \exists N \forall n\ge N pla_{[\omega ^{\le n}]}({\mathcal B}_\varepsilon (\mu ))=1\})=1; \end{aligned}$$

    as an obvious consequence, in the same conditions, we have for every \(\varepsilon >0\), the plausibility \(pla({{\mathcal {B}}}_\varepsilon (\mu ))\) of having a distribution \(\varepsilon \)-close to \(\mu \) will also almost surely converge to 1:

    $$\begin{aligned} \mu \left( \left\{ \omega \in \varOmega \, \vert \, \lim _{n\rightarrow \infty } pla_{[\omega ^{\le n}]}({\mathcal B}_\varepsilon (\mu ))=1\right\} \right) =1; \end{aligned}$$

Proof

Fix \(\mu \in M\). It is obviously enough to show the following two claims, for all \(\mu -normal\) sequences \(\omega \in \varOmega _\mu \) and all \(\varepsilon >0\):

$$\begin{aligned} \lim _{n\rightarrow \infty } pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu ))=0 \ \text{ for } \text{ all }\ \varepsilon >0; \\ \exists N \forall n\ge N \, pla_{[\omega ^{\le n}]}({{\mathcal {B}}}_\varepsilon (\mu ))=1. \end{aligned}$$

To prove the first claim, we use the fact that every plausibility ranking function satisfies \(0\le pla \le 1\) to derive

$$\begin{aligned}&0\le pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu )) \le \frac{pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu ))}{pla_{[\omega ^{\le n}]}(\mu )} \cdot {pla_{[\omega ^{\le n}]}(\mu )} \\&\quad \le \frac{pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu ))}{pla_{[\omega ^{\le n}]}(\mu )} \cdot 1 = \frac{pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {\mathcal B}_\varepsilon (\mu ))}{pla_{[\omega ^{\le n}]}(\mu )}, \end{aligned}$$

then obtain the desired conclusion by taking limits and applying Lemma 2.

For the second claim: for any \(\epsilon >0\), apply the first claim to conclude that \(\exists N \forall n\ge N \, pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu ))\le \frac{1}{2}\). From this we get that

$$\begin{aligned}&1= pla_{[\omega ^{\le n}]}(M)= max(pla_{[\omega ^{\le n}]}(M\cap {{\mathcal {B}}}_\varepsilon (\mu )\}, pla_{[\omega ^{\le n}]}(M{\setminus } {{\mathcal {B}}}_\varepsilon (\mu )\})\\&\quad \le max(pla_{[\omega ^{\le n}]}(M\cap {{\mathcal {B}}}_\varepsilon (\mu )\}, \frac{1}{2}), \end{aligned}$$

hence \(pla_{[\omega ^{\le n}]}(M\cap {{\mathcal {B}}}_\varepsilon (\mu )\}=1\), and so also \(pla_{[\omega ^{\le n}]}{{\mathcal {B}}}_\varepsilon (\mu )=1\). \(\square \)

Corollary 1

[Convergence in belief] Let \({{\mathbf {M}}}=(M, pla)\) be a plausibility model. Then the agent’s beliefs after repeated sampling will almost surely eventually settle arbitrarily close to the true distribution.

More precisely: for every \(\mu \in M\) and every \(\varepsilon >0\), we have

$$\begin{aligned} \mu (\{ \omega \in \varOmega \, \vert \,\, \exists N\,\, \forall n \ge N \,\,\, \,\, B ( {\mathcal B}_\varepsilon (\mu )) \hbox { holds in}\ {{\mathbf {M}}}_{[\omega ^{\le ^n}]}\})=1, \end{aligned}$$

or equivalently

$$\begin{aligned} \mu (\{ \omega \in \varOmega \, \vert \,\, \exists N\,\, \forall n \ge N \,\,\, \,\, B ( {\mathcal B}_\varepsilon (\mu )\,\vert \, [\omega ^{\le n}]) \hbox { holds in}\ {{\mathbf {M}}}\})=1. \end{aligned}$$

Proof

From Theorem 2, we know that with \(\mu \)-probability 1, we have \(\lim _{n\rightarrow \infty } pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu ))=0\), hence almost certainly there is some \(N_1\) such that \(pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu ))<1\) for all \(n\ge N_1\). Similarly, we know from Theorem 2 that, with \(\mu \)-probability 1, there is some \(N_2\) such that \(pla_{[\omega ^{\le n}]}({{\mathcal {B}}}_\varepsilon (\mu ))=1\) for all \(n\ge N_2\). By taking \(N:=max\{N_1, N-2\}\), we obtain that (with \(\mu \)-probability 1): for all \(n\ge N\) and all \(\nu \in M\) with maximal plausibility \(pla_{[\omega ^{\le n}]}(\nu )=pla_{[\omega ^{\le n}]}(\nu )(M)=1\), we have \(\nu \in {{\mathcal {B}}}_\varepsilon (\mu )\), as desired. \(\square \)

We now show that we can strengthen this result to:

Proposition 15

[Convergence in safe belief] Let \({{\mathbf {M}}}=(M, pla)\) be a plausibility model. Then the agent’s safe beliefs after repeated sampling will almost surely eventually settle arbitrarily close to the true distribution.

More precisely: for every \(\mu \in M\) and every \(\varepsilon >0\), we have

$$\begin{aligned} \mu (\{ \omega \in \varOmega \, \vert \,\, \exists N\,\, \forall n \ge N \,\,\, \,\, Sb ( {\mathcal B}_\varepsilon (\mu )) \hbox { holds at}\ \mu \ \hbox {in}\ {{\mathbf {M}}}_{[\omega ^{\le ^n}]}\})=1. \end{aligned}$$

Proof

Let \(\omega \in \varOmega _\mu \). By Lemma 2, we have \(\lim _{n\rightarrow \infty } \frac{pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {\mathcal B}_\varepsilon (\mu ))}{pla_{[\omega ^{\le n}]}(\mu )}=0\). So there exists some N, s.t. \(\frac{pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu ))}{pla_{[\omega ^{\le n}]}(\mu )}<1\) for all \(n\ge N\). Hence, we have \(pla_{[\omega ^{\le n}]}({\overline{M}}{\setminus } {\mathcal B}_\varepsilon (\mu )) < pla_{[\omega ^{\le n}]}(\mu )\) for all \(n\ge N\). Thus, for all \(n\ge N\) and all \(\nu \in M\), if \(pla_{[\omega ^{\le n}]}(nu)\ge _{[\omega ^{\le n}]}(\mu )\), then \(\nu \not \in {\overline{M}}{\setminus } {{\mathcal {B}}}_\varepsilon (\mu ))\), i.e. \(\nu \in {{\mathcal {B}}}_\varepsilon (\mu ))\). By Proposition 12, this means that Sb(P) holds in \(\mu \) in \({{\mathbf {M}}}_{[\omega ^{\le ^n}]}\) for all \(n\ge 1\). The desired conclusion follows again from the fact that \(\mu (\varOmega _\mu )=1\).\(\square \)

An obvious consequence is the following:

Corollary 2

[Approximate statistical learning] Let \({{\mathbf {M}}}=(M, pla)\) be a plausibility model. Then after repeated sampling from an unknown distribution, the agent will almost surely eventually acquire approximate statistical knowledge of the true distribution with any desired accuracy \(\varepsilon >0\).

More precisely: for every \(\mu \in M\) and every \(\varepsilon >0\), we have

$$\begin{aligned} \mu (\{ \omega \in \varOmega \, \vert \,\, \exists N\,\, \, Sk ( {{\mathcal {B}}}_\varepsilon (\mu )) \text{ holds } \text{ at }\ \mu \ \text{ wrt }\ \omega ^{\ge N}\ \text{ in }\ {{\mathbf {M}}}_{[\omega ^{\le ^N}]}\})=1. \end{aligned}$$

The proof is immediate, given Proposition 15. All these convergence results are inexact: they concern only approximations of the true distribution. However, the fact that every non-zero degree of accuracy is eventually achieved (and maintained forever after) shows that the verisimilitude of our models keeps increasing, or equivalently the distance-from-the-truth keeps decreasing (approaching 0 in the limit). In this sense, we have convergence in the limit to the exact true distribution:

Corollary 3

[Convergence in verisimilitude] Let \({{\mathbf {M}}}=(M,pla)\) be a plausibility model. If \(\mu \in M\) is the true distribution, then the distance-from-the-truth will almost surely converge to 0 after repeated sampling:

$$\begin{aligned} \mu (\{\omega \in \varOmega \, \vert \ lim_{n\rightarrow \infty } d_\mu ({{\mathbf {M}}}_{[\omega ^{\le n}]})=0\})=1. \end{aligned}$$

Proof

It is clear that we have to show that

$$\begin{aligned} \mu (\{\omega \in \varOmega \, \vert \ \forall \varepsilon >0 \exists N \forall n\ge N \, d_\mu ({{\mathbf {M}}}_{[\omega ^{\le n}]})<\epsilon \})=1. \end{aligned}$$

But note that, by the definition of distance-to-the-truth, we have the following equivalence:

$$\begin{aligned} d_\mu ({{\mathbf {M}}}_{[\omega ^{\le n}]})<\epsilon \, \, \text{ iff } \,\, Sb({\mathcal {B}}_\varepsilon (\mu )) \text{ holds } \text{ at }\ \mu \ \text{ in }\ {{\mathbf {M}}}_{[\omega ^{\le n}]}. \end{aligned}$$

The desired conclusion follows immediately, given Proposition 15. \(\square \)

A general feature of all the above forms of truth-tracking is that the convergence to the exact true distribution (rather than to an approximation) happens only in the limit (rather than being reached at some finite stage). However, one can do better than this when the agent’s prior knowledge is consistent with only a discrete (or in particular, a finite) set of distributions:

Proposition 16

[Finite convergence to exact truth] Let \({{\mathbf {M}}}=(M,pla)\) be a plausibility model, based on a discrete set \(M\subseteq M_O\).Footnote 21 Then we have the following:

  • when repeatedly sampling from an unknown distribution \(\mu \), the plausibility \(pla(\mu )\) of the true distribution will almost surely eventually settle on 1 (after finitely many rounds of sampling); while in contrast, the plausibility \(pla(\nu )\) of any other distribution will almost surely settle below any given threshold \(\delta >0\) (after finitely may such rounds):

    $$\begin{aligned}&\mu (\{\omega \in \varOmega \, \vert \, \exists N\forall n\ge N\, pla_{[\omega ^{\le n}]}(\mu )=1\})=1, \,\, \text{ and } \\&\mu (\{\omega \in \varOmega \, \vert \, \exists N\forall n\ge N\, pla_{[\omega ^{\le n}]}(\nu )<\delta \})=1, \text{ for } \text{ all }\ \nu \ne \mu \ \text{ and } \text{ all }\ \delta >0; \end{aligned}$$
  • similarly, the agent’s beliefs will almost surely eventually settle on the exact true probability \(\mu \), after finitely many rounds sampling:

    $$\begin{aligned} \mu (\{\omega \in \varOmega \, \vert \, \exists N\forall n\ge N\, B(\{\mu \}) \hbox { holds in}\ {{\mathbf {M}}}_{[\omega ^{\le n}]}\})=1; \end{aligned}$$
  • the same statement as in the previous part applies to safe beliefs:

    $$\begin{aligned} \mu (\{\omega \in \varOmega \, \vert \, \exists N\forall n\ge N\, Sb(\{\mu \}) \hbox { holds at}\ \mu \ \hbox {in}\ {{\mathbf {M}}}_{[\omega ^{\le n}]}\})=1; \end{aligned}$$
  • after finitely many rounds of sampling from the unknown distribution, the agent will almost surely eventually acquire exact statistical knowledge of the true distribution:

    $$\begin{aligned} \mu (\{\omega \in \varOmega \, \vert \, \exists N\forall \, n\ge NSk(\{\mu \})\; \mathrm{holds \ at}\ \mu \ \mathrm{wrt}\ \omega ^{\ge N}\ \mathrm{in}\ {{\mathbf {M}}}_{[\omega ^{\le n}]}\})=1. \end{aligned}$$
  • finally, the distance-to-the-truth of the plausibility model will almost surely eventually settle to 0, after finitely many rounds of sampling:

    $$\begin{aligned} \mu (\{\omega \in \varOmega \, \vert \, \exists N \forall n\ge N \, d_\mu ({{\mathbf {M}}}_{[\omega ^{\ge n}]})=0\})=1. \end{aligned}$$

Proof

Apply each of the previous results to some \(\varepsilon >0\) small enough so that \({\mathcal {B}}_\varepsilon (\mu )\cap M=\{\mu \}\). \(\square \)

It is important to note the differences between our convergence results and the Savage-style convergence results in the Bayesian literature (Edwards et al. 1963; Savage 1954; Doob 1971; Gaifman and Snir 1982; Earman 1992), that were mentioned in the Introduction. Savage’s theorem assumes a certain restriction on the true hypothesis (namely, that its prior probability is non-zero), which makes it applicable only to a finite (or countable) set of hypothesesFootnote 22 (since otherwise the prior probability cannot be assumed to be non-zero for every hypothesis). Our general results (concerning truth-tracking in the limit) do not need this assumption and indeed, they even apply to the whole (uncountable) set \(M_O\) of all distributions.

On the other hand, in the case of a finite (or more generally, discrete) set of hypotheses/distributions, our plausibilistic learning is even better-behaved than the standard Bayesian learning: we obtain convergence in this case in finitely many steps (while Savage’s still converges only in the limit). This faster convergence is explained by the qualitative nature of our belief-formation (as standard in logic, only the most plausible hypotheses matter for beliefs), instead of the quantitative-cumulative of probabilistic credences. The combination of this qualitative-logical way of forming beliefs with the statistical-Bayesian way of updating them (as encoded in our rule for conditioning on events) ensures that the true distribution will eventually reach the highest plausibility (among a finite set of distributions), thus giving us finite convergence to the exact truth.

7 Towards a logic of statistical learning

In this section we propose a logical setting that can capture the dynamics of statistical learning described in this paper. Our logical language is designed to accommodate both types of information, i.e. finite observations and higher-order information. As already mentioned, there is a fundamental distinction between these two types of information. The observations are interpreted in a \(\sigma \)-algebra \({\mathcal {E}} \subseteq {\mathcal {P}}(\varOmega )\), and are not themselves formulas in our formal logical language, as they do not correspond to properties of probability distributions. The formulas will instead be statements about the probabilities of observations, given in terms of linear inequalities and logical combinations thereof, as well as the statements concerning the dynamics arising from finite observations.

Given the set of outcomes \(O=\{o_1, \ldots , o_n\}\), the set of formulas \(\phi \) of our language is inductively defined by

$$\begin{aligned} \phi \,::= \sum _{i=1}^{m} a_i P(\omega _i) \ge c \, \vert \, \phi \wedge \phi \, \vert \, \lnot \phi \, \vert \, K \phi \, \vert \, Sb(\phi ) \, \vert \,B (\phi \, \vert \, \omega ^{\le n}) \,\vert \, [o] \phi \,\vert \, [\phi ] \phi \end{aligned}$$

where \(o,\omega _i \in O\), \(a_i\)’s and c in \({\mathbb {Q}}\) and \(\omega ^{\le n}=(\omega _1, \ldots , \omega _n) \in O^n\) is a stream of observations of length n.

Let \({{\mathbf {M}}}=(M, pla)\) be a probabilistic plausibility model. The semantics is given by inductively defining a satisfaction relation \({{\mathbf {M}}},\mu \vDash \phi \) between distributions \(\mu \in M\) and formulas \(\phi \). At each pair \(({{\mathbf {M}}},\mu )\), the symbol P will be interpreted as a probability mass function, namely \(\mu \) itself. In this definition, we use the notation \(\Vert \phi \Vert _{{\mathbf {M}}}:=\{\mu \in M \, \vert \, {{\mathbf {M}}}, \mu \vDash \phi \}\), and skip the subscript \({{\mathbf {M}}}\) when the model is understood:

$$\begin{aligned} \begin{array}{ll} {{\mathbf {M}}}, \mu \vDash \mathop {\sum }\limits _{i=1}^{n} a_i P(\omega _i) \ge c &{} \iff \mathop {\sum }\limits _{i=1}^{n} a_i {\mu }(\omega _i) \ge c\\ {{\mathbf {M}}}, \mu \vDash \phi \wedge \psi &{}\iff {{\mathbf {M}}}, \mu \vDash \phi \,\,\mathrm { and }\,\, {{\mathbf {M}}}, \mu \vDash \psi \\ {{\mathbf {M}}}, \mu \vDash \lnot \phi &{}\iff {{\mathbf {M}}}, \mu \nvDash \phi \\ {{\mathbf {M}}}, \mu \vDash K \phi &{}\iff {{\mathbf {M}}}, \nu \vDash \phi \,\, \mathrm { for } \,\, \mathrm { all }\,\, \nu \in M\\ {{\mathbf {M}}}, \mu \vDash Sb \phi &{}\iff {{\mathbf {M}}}, \mu \vDash \phi \,\, \mathrm { for } \,\, \mathrm { all }\,\, \nu \in M \, \mathrm { s. t. } \, pla(\nu )\ge pla(\mu )\\ {{\mathbf {M}}}, \mu \vDash B(\phi \, \vert \, \omega ^{\le n}) &{}\iff B(\Vert \phi \Vert \, \vert \, [\omega ^{\le n}]) \,\, \mathrm { holds }\,\, \mathrm { in } \,\, {{\mathbf {M}}}\\ {{\mathbf {M}}}, \mu \vDash [o] \phi &{}\iff \left( \mu (o)>0 \implies {{\mathbf {M}}}_{[o^1]}, \mu \vDash \phi \right) \\ {{\mathbf {M}}}, \mu \vDash [\theta ] \phi &{}\iff \left( {{\mathbf {M}}}, \mu \vDash \theta \implies {{\mathbf {M}}}_{\Vert \theta \Vert }, \mu \vDash \phi \right) \end{array} \end{aligned}$$

The atomic formulas \( \sum _{i=1}^{m} a_i P(\omega _i) \ge c\) describe linear inequalities satisfied by the true probability, using numerical constants ranging over rationals. The propositional connectives \(\lnot , \wedge \) are standard. Letters \(\textit{K}\) and \(\textit{B}\) stand for knowledge and (conditional) belief operators, and Sb stands for safe belief. The dynamic modalities \([o]\psi \) (standing for “after observing o, \(\psi \) holds”) and \([\phi ]\psi \) (standing for “after learning \(\phi \), \(\psi \) holds”) capture the updates induced by the two forms of learning.

The reason we did not include simple belief \(B\phi \) or propositionally-conditional beliefs \(B(\phi \, \vert \, \phi )\) is that these operators are definable as abbreviations in the above syntax. For plain belief, it should be obvious that it can be obtained as a special case of conditioning on a sampling sequence \(\omega ^{\le 0}\) of length 0, i.e. we can put

$$\begin{aligned} B(\phi ) \,\,\, := \,\,\, B(\phi \, \vert \, \lambda ), \end{aligned}$$

where \(\lambda =()=\omega ^{0}\) is the empty sequence of observations. Less trivially, conditional beliefs of the form \(B(\phi \, \vert \, \theta )\) can be defined in terms of knowledge and safe belief, by putting:

$$\begin{aligned} B(\phi \, \vert \, \theta ) \,\,\, := \,\,\, {\tilde{K}}\theta \rightarrow {\tilde{K}}(\theta \wedge Sb(\theta \rightarrow \phi )), \end{aligned}$$

where \({\tilde{K}}\psi :=\lnot K\lnot \psi \) is the Diamond-dual modality for \(\textit{K}\) (denoting “epistemic possibility”). With these abbreviations, one can easily check that the resulting notion satisfies the expected semantic clause for conditional belief:Footnote 23

$$\begin{aligned} {{\mathbf {M}}}, \mu \vDash B(\phi \, \vert \, \theta ) \,\,\,\, \text{ iff } \,\,\,\, B(\Vert \phi \Vert \, \vert \, \Vert \theta \Vert ) \,\, \text{ holds } \text{ in } \,\, {{\mathbf {M}}}. \end{aligned}$$

We say that a formula \(\phi \) is valid in model \({{\mathbf {M}}}\), and write \({{\mathbf {M}}}\vDash \phi \), if and only if \({{\mathbf {M}}}, \mu \vDash \phi \) for all \(\mu \in M\). As usual, \(\phi \) is simply valid if it is valid in every model \({{\mathbf {M}}}\).

Proposition 17

Let \(o \in O\) and formulas \(\phi , \psi , \theta , \xi \). Then the following formulas are valid:

  1. 1.

    \(P(o) \ge 0\)

  2. 2.

    \(\sum _{o \in O} P(o)=1\)

  3. 3.

    \( K(\phi \rightarrow \theta ) \rightarrow (K \phi \rightarrow K \theta )\)

  4. 4.

    \(K \phi \rightarrow \phi \)

  5. 5.

    \(K \phi \rightarrow KK\phi \)

  6. 6.

    \(\lnot K \phi \rightarrow K \lnot K\phi \)

  7. 7.

    \(K\phi \rightarrow Sb\phi \)

  8. 8.

    \(Sb \phi \rightarrow \phi \)

  9. 9.

    \(Sb \phi \rightarrow Sb Sb\phi \)

  10. 10.

    \(\left( K(\phi \vee Sb \psi )\wedge K(\psi \vee Sb\phi ) \right) \rightarrow (K\phi \vee K\psi )\)

  11. 11.

    \( B(\phi \rightarrow \theta \,\vert \, \psi ) \rightarrow (B (\phi \,\vert \, \psi ) \rightarrow B( \theta \, \vert \, \psi ))\)

  12. 12.

    \( K \phi \rightarrow B (\phi \, \vert \, \psi ) \)

  13. 13.

    \( B(\phi \, \vert \, \phi )\)

  14. 14.

    \( B (\phi \, \vert \, \psi ) \rightarrow K( B( \phi \, \vert \, \psi ) \,\vert \, \psi ) \)

  15. 15.

    \(\lnot B (\phi \, \vert \, \psi ) \rightarrow K (\lnot B (\phi \, \vert \, \psi ) \, \vert \, \psi )\)

  16. 16.

    \( B (\theta \, \vert \, \phi ) \rightarrow \left( B (\xi \, \vert \, \phi \wedge \theta ) \leftrightarrow B (\xi \, \vert \, \phi )\right) \)

  17. 17.

    \( \lnot B (\lnot \theta \, \vert \, \phi ) \rightarrow \left( B (\xi \, \vert \, \phi \wedge \theta ) \leftrightarrow B (\theta \rightarrow \xi \, \vert \, \phi )\right) \)

  18. 18.

    If \(\phi \leftrightarrow \theta \) is valid in \({{\mathbf {M}}}\) then so is \( B (\xi \, \vert \, \phi ) \leftrightarrow B (\xi \, \vert \, \theta )\).

Proof

Note that the plausibility function induces a complete preorder on the set of worlds. The validity of the above formulas over such models follows directly from the results in Board (2004) and Baltag and Smets (2008b), and it is in fact a straightforward application of general results in Correspondence Theory for modal frames. \(\square \)

Finally, we give some validities regarding the interaction of the dynamic modalities with knowledge modality and (conditional) belief.

Proposition 18

Let \(o,\omega _1, \ldots , \omega _n\in O\) and formulas \(\phi , \theta , \xi \). Then the following formulas are valid:

  1. 1.

    \([\phi ]q \leftrightarrow (\phi \rightarrow q)\) for atomic q

  2. 2.

    \([o]q \leftrightarrow (P(o)>0 \rightarrow q)\) for atomic q

  3. 3.

    \([\phi ] \lnot \theta \leftrightarrow (\phi \rightarrow \lnot [\phi ] \theta )\)

  4. 4.

    \([o] \lnot \theta \leftrightarrow (P(o)>0 \rightarrow \lnot [o] \theta )\)

  5. 5.

    \([\phi ] (\theta \wedge \xi ) \leftrightarrow ([\phi ] \theta \wedge [\phi ] \xi )\)

  6. 6.

    \([o] (\theta \wedge \xi ) \leftrightarrow ([o] \theta \wedge [o] \xi )\)

  7. 7.

    \([\phi ] K \theta \leftrightarrow (\phi \rightarrow K[\phi ] \theta )\)

  8. 8.

    \([o] K\phi \leftrightarrow (P(o)>0 \rightarrow K [o]\phi )\)

  9. 9.

    \([\phi ] B (\theta \, \vert \, \xi ) \leftrightarrow \left( \phi \rightarrow B([\phi ] \theta \, \vert \, \phi \wedge [\phi ]\xi )\right) \)

  10. 10.

    \( [o] B (\phi \, \vert \omega _1, \ldots ,\omega _n) \leftrightarrow (P(o)>0 \rightarrow B([o]\phi \, \vert \, o, \omega _1,\ldots , \omega _n))\)

Open question. Is the above logic recursively axiomatizable? Is it decidable?

Further extension. To define statistical knowledge, we need to extend the above semantics, by making explicit the actual (future) sampling history. This means that we define the satisfaction relation on triples \({{\mathbf {M}}}, \mu , \omega \vDash \phi \), where \({{\mathbf {M}}}\) and \(\mu \) are as above, while \(\omega \in \varOmega _\mu \) is the infinite string of future observations. The semantical clauses for all the above operators stay essentially the same (i.e. the sequence \(\omega \) plays no role, so it is just carried through). But we can now introduce new operators, which refer to the future sampling history. We could directly introduce statistical knowledge \(Sk\phi \), but it seems more natural to add instead temporal operators \(\Box \phi \) (“from now and forever in the future, \(\phi \) holds”) and its dual \(\Diamond \varphi \) (“\(\phi \) holds now or at some future moment”), with the obvious semantics:

$$\begin{aligned} \begin{array}{ll} {{\mathbf {M}}}, \mu , \omega \vDash \Box \phi &{}\qquad \qquad \iff {{\mathbf {M}}}_{[\omega ^{\le n}]}, \mu , \omega ^{>n}\vDash \phi \,\, \mathrm { for } \,\, \mathrm { all } \, n\ge 0\\ {{\mathbf {M}}}, \mu , \omega \vDash \Diamond \phi &{}\qquad \qquad \iff {{\mathbf {M}}}_{[\omega ^{\le n}]}, \mu , \omega ^{>n}\vDash \phi \,\, \mathrm { for } \,\, \mathrm { some } \, n\ge 0 \end{array} \end{aligned}$$

(In fact, \(\Diamond \phi \) is redundant: it is the Diamond-dual of \(\Box \), so can be taken to be just an abbreviation for \(\lnot \Box \lnot \phi \).)

For non-epistemicFootnote 24 formulas P, we can identify statistical knowledge Sk(P) with the formula \(\Box Sb(P)\). As a result, our result in Corollary 2, on eventual convergence (in finitely many steps) to approximate statistical knowledge of the true distribution, is captured in this logic by the validity

$$\begin{aligned} (\bigwedge _i P(o_i)=p_i) \rightarrow \Diamond \Box Sb\bigwedge _i (p_i-\epsilon<P(o_i)<p_i+\epsilon ), \end{aligned}$$

for every \(\epsilon >0\).

8 Conclusion and comparison with other work

We studied forming beliefs about unknown probabilities in situations that are commonly described as those of radical uncertainty. The most widespread approach to model such situations of ‘radical uncertainty’ is in terms of imprecise probabilities, i.e. representing the agent’s knowledge as a set of probability measures. There is extensive literature on the study of imprecise probabilities (Bradley and Drechsler 2014; Chandler 2014; Hajek and Smithson 2012; Levi 1985; Walley 2000; Denoeux 2000; Romeijn and Roy 2014) and on different approaches for decision making based on them Bradley and Steele (2014), Huntley et al. (2014), Troffaesin (2007), Elkin and Wheeler (2018), Mayo-Wilson and Wheeler (2016), Seidenfeld (2004), Seidenfeld et al. (2010), Williams and Robert (2014) or to collapse the state of radical uncertainty by settling on some specific probability assignment as the most rational among all that is consistent with the agent’s information. The latter giving rise to the area of investigation known as the Objective Bayesian account (Paris and Rad 2010; Paris and Vencovska 1997; Paris 2014; Rad 2017; Williamson 2008, 2010).

A similar line of inquiry has been extensively pursued in the Economics literature, as well as in Decision Theory, where the situation we are investigating in this paper is referred to as Knightian uncertainty or ‘ambiguity’. This is the case when the decision-maker has too little information to arrive at a unique prior. There have been different approaches in this literature to model these scenarios. These include, among others, the use of Choquet integration, by for instance Huber and Strassen (1973), or Schmeidler (1989, 1986), the maxmin expected utility by Gilboa and Schmeidler (1989) and the smooth ambiguity model by Klibanoff et al. (2005) which employs second-order probabilities or Al-Najjar’s work (Al-Najjar 2009) where he models rational agents who use frequentist models for interpreting the evidence and investigates learning in the long run. Cerreia-Vioglio et al. (2013) studies this problem in a formal setting similar to the one used here and axiomatizes different decision rules such as the maxmin model of Gilboa-Schmeidler and the smooth ambiguity model of Klibanoff et al, and gives an overview of some of the different approaches in that literature.

These approaches employ different mechanisms for ranking probability distributions compared to what we propose in this paper. Among these, it is particularly worth pointing out the difference between our setting and those ranking probability distributions by their (second-order) probabilities. In contrast, in our setting, it is only the worlds with the highest plausibility that play a role in specifying the set of beliefs. In particular, unlike the probabilities, the plausibilities are not cumulative in the sense that the distributions with low plausibility do not add up to form more plausible events as those with low probability would have had. This is a fundamental difference between our account and the account given in terms of second-order probabilities.

Another approach to deal with these scenarios in the Bayesian literature is based on the series of convergence results, that are collectively referred to as “washing out of the prior”. The idea, which traces back to Savage, see Edwards et al. (1963) and Savage (1954), is that as long as one repeatedly updates a prior probability for an event through conditionalisation on new evidence, then in the limit one would surely converge to the true probability, independent of the initial choice of the prior.Footnote 25 Bayesians use these results to argue that an agent’s choice of a probability distribution in scenarios such as our urn example is unimportant as long as she repeatedly updates that choice (via conditionalisation) by acquiring further evidence, for example by repeated sampling from the urn. However, it is clear that the efficiency of the agent’s choice for the probability distribution, put in the context of a decision problem, depends strongly on how closely the chosen distribution tracks the actual one. This choice is most relevant when the agents are facing a one-off decision problem, where their approximation of the true probability distribution at a given point ultimately determines their actions at that point.

Our approach, based on forming rational qualitative beliefs about probability (based on the agent’s assessment of each distribution plausibility), does not seem prone to these objections. The agent does “the best she can” at each moment, given her evidence, her higher-order information, and her background assumptions (captured by her plausibility map). Thus, she can solve one-off decision problems to the best of her ability. And, by updating her plausibility with new evidence, her beliefs are still guaranteed to converge to the true distribution (if given enough evidence) in essentially all conditions (including in the cases that evade Savage-type theorems).

As we already mentioned, our approach is based on a probabilistic adaptation of the standard qualitative theory of plausibility models (Board 2004; Baltag and Smets 2008b), that underlies modern presentations of standard Belief Revision Theory (Alchourrón et al. 1985; Grove 1988) within Dynamic Epistemic Logic (Baltag and Moss 2004; Baltag et al. 1998; van Ditmarsch et al. 2007; Baltag and Smets 2008b; Baltag and Renne 2016; van Benthem 2011). As such, it has some connections with Wolfgang Spohn’s quantitative theory of plausibility ranking (Spohn 2016), but it differs from it in essential ways: like Spohn’s ranking theory,Footnote 26 we use maximization to form beliefs (where standard probabilistic theory uses addition);Footnote 27 but, when updating plausibility with independent sampling evidence, we follow the probabilistic usage of taking products (in Bayes’s rule), while Spohn’s ranking theory uses addition for this purpose. On the other hand, our framework does satisfy the conditions of Halpern’s abstract theory of algebraic conditional plausibility spaces (Halpern 2003), which is meant as a generalization of a large number of theories of uncertainty (Bayesian probabilities, Dempster-Shafer belief functions, possibility measures, relative likelihoods, AGM conditioning, Popper measures, Spohn’s ranking theory). The theory postulates the existence of two operations: one, the analogue of probabilistic addition, is used for computing the plausibility of a proposition P, and decide whether it is to be believed or not; while the other, the analogue of probabilistic multiplication, is used for updating plausibilities (via an abstract analogue of Bayes’ rule) and for computing the plausibility of joint independent observations. To work well, the two operations need to satisfy certain conditions, tying them together. Our particular combination, of maximization and multiplication, though as far as we know was never encountered in the literature, satisfies Halpern’s conditions, and so it is in a sense a “natural” theory. But beyond that, we think that this particular combination is the key to fast learning from sampling, as well as to reconciling probability with logic: on the one hand, multiplication is needed for the update, to deal rationally with successive independent observations (cf. Proposition 9, which would fail without the use of multiplication in our plausibilistic analogue of Bayes’ rule); and on the other hand, the use of maximization in the formation of beliefs allows convergence in finitely many steps (in contrast to mere convergence in the limit via probabilistic updating a la Savage), and at the same time makes beliefs about probability fit the general patterns and conditions of Doxastic Logic and Belief Revision Theory. Indeed, it does seem that the particular combination provided by our probabilistic plausibility theory succeeds in adopting the best features of both worlds (doxastic logic with its belief revision, and statistical reasoning with its Bayesian updates), while at the same time fitting within the general conditions of a natural theory of uncertainty (as formalized by Halpern’s abstract requirements).

Our approach connects well with mainstream epistemology and formal learning theory, by making essential use of the formal concept of “safe belief”, studied in Baltag and Smets (2008b) as an approximation of the philosophical notion of defeasible knowledge (Lehrer 1990; Rott 2004), and related also to the issue of stability or ‘resilience’ of probabilistic belief (Skyrms 2011), an issue underlying recent attempts at unifying logical and probabilistic reasoning, cf. the so-called stability theory of belief (Leitgeb 2017). Our concept of statistical knowledge improves on the notion of safe belief, by adding a form of stability under future sampling, that connects well with the learning-theoretic concept of identifiability in the limit (Gold 1967), as well as with various formal notions of inductive knowledge, introduced in Baltag et al. (2019a, 2019b) and Kelly (2014) as epistemic correlatives of empirical induction. As already mentioned, the correlative notion of distance-from-the-truth fits well with the main tenets of Verisimilitude Theory, originating in the work of Popper (1976) and his critics (Tichy 1974; Miller 1974), and developed to maturity in the wor1k of Niiniluoto (1987), Niiniluoto (1998), Kuipers (1987) and others. In particular, our setting fits within the metric approach to truthlikeness (Niiniluoto 1987), resulting in the verisimilitude version of our convergence results: tracking the truth is then naturally understood as progressive increase in our models’ truthlikeness (or equivalently, progressive decrease of the models’ distance-from-the-truth).

Our paper ends by sketching the contours of a dynamic doxastic logic for statistical learning, that validates a number of standard axioms, and can express the core of our convergence results. Nevertheless, this leads us to an outstanding open problem: finding a complete axiomatization of this logic and investigating its complexity. This seems a daunting task at the time of our writing. Given the power of this formalism and its significance for the investigation of statistical learning, we think this to be an important and potentially fertile challenge.