Tracking probabilistic truths: a logic for statistical learning

We propose a new model for forming and revising beliefs about unknown probabilities. To go beyond what is known with certainty and represent the agent’s beliefs about probability, we consider a plausibility map, associating to each possible distribution a plausibility ranking. Beliefs are defined as in Belief Revision Theory, in terms of truth in the most plausible worlds (or more generally, truth in all the worlds that are plausible enough). We consider two forms of conditioning or belief update, corresponding to the acquisition of two types of information: (1) learning observable evidence obtained by repeated sampling from the unknown distribution; and (2) learning higher-order information about the distribution. The first changes only the plausibility map (via a ‘plausibilistic’ version of Bayes’ Rule), but leaves the given set of possible distributions essentially unchanged; the second rules out some distributions, thus shrinking the set of possibilities, without changing their plausibility ordering.. We look at stability of beliefs under either of these types of learning, defining two related notions (safe belief and statistical knowledge), as well as a measure of the verisimilitude of a given plausibility model. We prove a number of convergence results, showing how our agent’s beliefs track the true probability after repeated sampling, and how she eventually gains in a sense (statistical) knowledge of that true probability. Finally, we sketch the contours of a dynamic doxastic logic for statistical learning.


Introduction
The goal of this paper is to propose a new model for learning a probabilistic distribution, in situations that are commonly characterized as those of "radical uncertainty" (Walley 1996) or "Knightian uncertainty" (Cerreia-Vioglio et al. 2013). The most widespread model for these situations uses imprecise probabilities, i.e. sets of probability distributions. As an example, consider an urn full of marbles, coloured red, green, and blue, but with an unknown distribution. What is then the probability of drawing a red marble? In such cases, when the agent's information is not enough to determine the true distribution, she is typically left with a large (possibly infinite) set of possible probability assignments. If she never goes beyond what she knows, then her only 'rational' answer should be "I don't know": she is in a state of ambiguity, and she should simply consider possible all distributions that are consistent with her background knowledge and observed evidence. This type of over-cautious rationality, resembling the famous paradox of "Buridan's ass", is not of much help in dealing with practical decision problems.
Our model allows the agent to go beyond what she knows with certainty, by forming rational qualitative beliefs about the unknown distribution, beliefs based on the inherent plausibility of each possible distribution. For this, we assume the agent is endowed with an initial plausibility map, assigning real numbers to the possible distributions. The plausibility map encodes the agent's background beliefs and a priori assumptions about the world. For instance, an agent who assumes the Principle of Indifference (Williamson 2013; Hájek 2019) will use Shannon entropy as her plausibility function, thus initially believing that the distribution is the most non-informative one (in the given set of possibilities). On the other hand, an agent assuming a Normality or 'Averageness' Principle, will use closeness to the Center of Mass or the barycenter (Paris 1994) as her plausibility measure, thus starting with a belief in the most typical distribution, i.e. the one that is the most representative for the given set of distributions. Finally, an agent who assumes some form of Ockham's Razor will use as plausibility some measure of simplicity (Kelly 2008), thus her prior belief will focus on the simplest distribution(s).
Our agent forms beliefs by using the standard definition of qualitative belief in Logic and Belief Revision Theory, in terms of plausibility maximization (Board 2004;Baltag and Smets 2008b): she believes the most plausible distribution(s). More precisely, we equate "belief" with "truth in all the worlds that are plausible enough": P is believed iff there exists some distribution μ s.t. P is true in all distributions that are at least as plausible as μ. In particular, "belief" coincides with truth in all the most plausible worlds, whenever such most plausible worlds/distributions exist. As a consequence, all the usual KD45 axioms of doxastic logic will be valid in our framework.
Note that, although our plausibility map assigns real values to probability distributions, this account is essentially different from the ones using so-called "second-order probabilities" (i.e. probability distributions defined on the given set of probability distributions) (Gaifman and Snir 1982;Gaifman 2016). Plausibility values are only relevant in so far as they induce a qualitative order on distributions. In contrast to probability, plausibility is not cumulative (in the sense that the low-plausibility alternatives do not add up to form more plausible sets of alternatives), and as a result only higher-ranking distributions 'beat' lower-ranking ones; in case that some distributions have the highest plausibility, they are the only ones of any relevance for beliefs.
Our model is not just a way to "rationally" select a Bayesian prior, but it also comes with a rational method for revising beliefs in the face of new evidence. In fact, it can deal with two types of new information: first-order evidence gathered by repeated sampling from the (unknown) distribution; and higher-order information about the distribution itself, coming in the form of a set of possible distributions (often defined by a set of linear inequality constraints on that distribution). To see the difference between the two types of new evidence, take for instance the example of a coin. As it is well-known, any finite sequence of Heads and Tails is consistent with all possible non-extremal biases of the coin. As such, any number of finite repeated samples will not shrink the set of possible biases, though they may increase the plausibility of some biases. Thus this type of information changes only the plausibility map but leaves the given set of distributions essentially unchanged (except for the elimination of some extremal distributions, that assigned probability 0 to the observed sample). The second type of information, on the other hand, shrinks the set of measures, while keeping their relative plausibility ranking. For instance, learning that the coin has a bias towards Tail (e.g. by weighing the coin, or receiving a communication in this sense from the coin's manufacturer) eliminates all distributions that assign a higher probability to Heads. It is important to notice, however, that even with higher-order information, it is hardly ever the case that the distribution under consideration is fully specified. In our coin example, a known bias towards Tails will still leave an infinite set of possible biases consistent. Even a good measurement by weighting will leave open a whole interval of possible biases. In this sense, a combination of observations and higher-order information will not in general allow the agent to come to know the correct distribution, in the standard ('infallible') sense in which the term knowledge is used in doxastic and epistemic logics. Instead, it may eventually allow her to come to believe the true probability (at least, with a high degree of accuracy). This belief may even stabilize, to such a degree that it approaches the 'softer', defeasible notion of 'knowledge', which is the main focus in Epistemology (Lehrer 1990;Stalnaker 1996;Rott 2004) and (inductive) Learning Theory (Gold 1967;Baltag et al. 2019a). This convergence in belief and the resulting acquisition of statistical knowledge is what we aim to capture in this paper.
Our mechanism for belief revision with sampling evidence is non-Bayesian (and also different from AGM belief revision), though it incorporates a "plausibilistic" version of Bayes' Rule. Instead of updating her prior belief according to this rule (and disregarding all other possible distributions), the agent keeps all possibilities in store and revises instead, her plausibility ranking, using a non-probabilistic analogue of Bayes' Rule. After that, her new belief will be formed in a similar way to her initial belief: by maximizing her (new) plausibility. The outcome is different from simply performing a Bayesian update on the 'prior': qualitative jumps are possible, leading to abandoning "wrong" conjectures in a non-monotonic way. This results in a faster convergence-in-belief to the true probability in less restrictive conditions than the usual Savage-style convergence through repeated Bayesian updating (Edwards et al. 1963;Savage 1954). 1 The second type of evidence (higher-order information about the distribution) induces a more familiar kind of update: the distributions that do not satisfy the new information (typically given in the form of linear inequalities) are simply eliminated, then beliefs are formed as before by focusing on the most plausible remaining distributions. This form of revision is known as AGM conditioning in Belief Revision Theory (Alchourrón et al. 1985), and as update or "public announcement" in Logic (Baltag and Renne 2016;van Ditmarsch et al. 2007), and satisfies all the standard AGM axioms. 2 The fact that in our setting there are two types of updates should not be so surprising. It is related to the fact that our static framework consists of two different semantic ingredients, capturing two types of information: the plausibility map (encoding the agent's beliefs and conditional beliefs, defeasible forms of knowledge, etc), and the set of possible distributions (encoding the agent's infallible knowledge, her 'hard information' about the correct distribution). Correspondingly, the first type of update directly affects the agent's beliefs (by changing the plausibility in the view of the sampling results), and only indirectly her knowledge (since e.g. she knows her new beliefs). Dually, the second type of update directly affects the agent's knowledge (by reducing the set of possibilities), and only indirectly her beliefs (by restricting the plausibility map to the new set).
By allowing two forms of learning, one having a Bayesian-statistical flavor and the other having a logical-AGM flavor (Alchourrón et al. 1985;Darwiche and Pearl 1997), our framework combines logical and statistical reasoning in a unified setting. In this sense, it fits within the recent trend towards a unification of logic and probability, see e.g. Leitgeb (2017). In particular, the fact that conditioning on sampling evidence is non-AGM is in fact essential for the successful learning of the true probability from repeated sampling: since every sample is logically consistent with every non-extremal distribution, an AGM learner (obeying the principle of Rational Monotonicity 3 ) would typically never change her initial beliefs about the true distribution after any number of 1 In contrast to Savage's theorem, our update ensures convergence even in the case that the initial set of possible distributions is infinite (indeed, even in the case we start with the uncountable set of all distributions). Moreover, in the finite case (where Savage's result does apply), our update is guaranteed to converge in finitely many steps, while Savage's theorem only ensures convergence in the limit. 2 We should note that there have been several proposals in the literature for AGM-compatible processes of iterated belief revision to remedy the inadequacy of AGM postulates to correctly capture the process of belief change from repeated observations, see for example Booth and Meyer (2006), Darwiche and Pearl (1997), Konieczny and Perez (2000) and Nayak (1994). In fact, the propositional conditionalization in our paper, like its older qualitative versions in Game Theory (Board 2004) and Dynamic Epistemic Logic (Baltag and Smets 2008c;Baltag and Renne 2016;van Ditmarsch et al. 2007), is an instance of the iterated revision operation of Darwiche and Pearl (1997): indeed, both the prior information state before revision and the revised information state are plausibility models (i.e. what (Darwiche and Pearl 1997) calls "epistemic states"), rather than theories or belief bases (i.e. propositions or sets of propositions). Still, we follow the usage in the epistemological (Kelly 2014;Kelly et al. 1995Kelly et al. , 1998 and dynamic-epistemic logic literature Baltag and Renne 2016;van Ditmarsch et al. 2007) in calling this operation AGM conditioning. samples! The same applies to all the generalizations of AGM conditioning that retain Rational Monotonicity, e.g. the ones proposed by Darwiche and Pearl (1997), or by Konieczny and Perez (2000).
A preliminary version of this paper was presented at TARK 2019, and an abstract appeared in the online proceedings (Baltag et al. 2019). Our current article is the extended, journal version of that work, though with many major changes: improvements of the basic setting, the formalization and study of new epistemic notions (e.g. safe belief of a distribution, statistical knowledge, distance-from-the-truth), and a number of new convergence results. The plan of the paper is as follows. We start by reviewing in Sect. 2 some basic notions, results, and examples on probability distributions. In Sect. 3, we define our main setting (probabilistic plausibility models), consider a number of standard examples, define in this setting the notions of belief and (infallible) knowledge, and study their logical properties. In Sect. 4, we move to conditional beliefs, defining our two forms of conditionalization, and use them to explore belief dynamics (as captured by our two types of model updates). In Sect. 5, we look at notions of doxastic stability, defining a weaker form of stability ("safe belief"), followed by a stronger form ("statistical knowledge"), and investigating their properties and their connection to a notion of verisimilitude (or "distance from the truth"). In Sect. 6, we present and prove our main results on doxastic convergence to the true probability. Finally, in Sect. 7 we briefly sketch the contours of a dynamic doxastic logic for statistical learning, and in Sect. 8 we end with some concluding remarks and a brief comparison with other approaches to the same problem.

Preliminaries and notation
Throughout this paper, we fix a finite set O = {o 1 , . . . , o n } of possible observations, or '(elementary) outcomes'. 4 Let be the set of probability mass functions on O, which we identify with the corresponding probability functions on P(O). The sets of distributions P ∈ P(M O ) will be called propositions. Let be the set of infinite sequences from O, which we shall refer to as observation streams. Each such stream ω = (ω 1 , . . . , ω n , . . .) represents a possible history of future sampling from an unknown distribution. For any ω ∈ Ω and i ∈ N\{0}, we write ω i for the i-th component of ω, and ω ≤i for its initial segment of length i, i.e. the sequence beliefs (Board 2004;Baltag and Smets 2008b), and it is equivalent to a combination of AGM axioms of Inclusion/Subexpansion and Vacuity/Superexpansion. 4 Intuitively, these are the possible outcomes of sampling or of some other possible type of experimentation. ω ≤i := (ω 1 , . . . , ω i ) consisting of the first i components of ω. Similarly, we put ω >i := (ω i+1 , . . . , ω n , . . .) for the infinite "tail" of ω that follows the i-th observation. In particular, ω ≤0 := λ = () is the empty sequence, and ω >0 = ω. We denote by the set of all finite sequences of observations. For each o ∈ O we define the sets o j to be the basic cylinders These cylinders correspond to individual observations of evidence sampled from the unknown distributions. Let A ⊆ P(Ω) be the σ -algebra of subsets of Ω generated by the cylinders (algebra obtained by closing the family of basic cylinders under complementation and countable unions). Every probability distribution μ ∈ M O induces a unique multinomial probability distribution over (Ω, A), also denoted by μ, and obtained by first setting then extending this to all of A using independence, additivity and continuity. Let E ⊆ A be the family of sets obtained by closing the family of basic cylinders only under complementation and finite unions. The sets e ∈ E are called observable events (or just 'events', for short). 5 It is easy to see that every event e ∈ E can be written as a finite disjoint union of finite intersections of basic cylinders. In particular, for each finite sequence of observations ω ≤i = (ω 1 , . . . , ω i ) ∈ O * , we denote by [ω ≤i ] = [ω 1 , . . . , ω i ] the corresponding event of observing this sequence by sampling, i.e. the event given by Example 1 Let O = {H , T } be the possible outcomes of a coin toss. Then Ω will be streams of Heads and Tails representing infinite tosses of the coin, e.g. HTTHHH.... And H j (res. T j ) will be the set of streams of observations in which the j-th toss of the coin will land Heads up (res. Tails up). The set M O will be the set of possible biases of the coin.
Example 2 Let O = {R, B, G} be the possible outcomes for a draw from an urn filled with marbles, coloured Red (R), Blue (B) and Green (G). Then M O will be the set of all possible distributions of coloured marbles in the urn, Ω will be the set of infinite streams of R, B and G (representing infinite draws from the urn), and R j (res. B j or G j ) will be the set of streams of draws in which the j-th draw is a Red (res. Blue or Green) marble.

Standard topology on
We will make use of the following well-known facts:

Proposition 1 For any finite set O, the set M O of probability mass functions on O is compact in the standard topology.
Proof Notice that the set {x ∈ [0, 1] n | n i=1 x i = 1} is compact in R n . Proposition 2 Let X , Y be compact topological spaces, Z ⊆ X and f : X → Y (1) Every closed subset of X is compact.
(2) If f is continuous, then f (X ) is compact. Proof This can be verified, using the above-mentioned fact that every event is a finite disjoint union of finite intersections of basic cylinders. The proof is by induction on the structure of this representation. The conclusion is immediate when e = o j is a basic cylinder: given any ε > 0, we can take δ := ε, and then, for all μ, This can be extended to finite intersections of basic cylinders, by noting that if e = m k=1 ω j k k is such a finite intersection with all j k distinct, 6 then by independence we have F e = m k=1 F ω j k k , and then using the 6 If j k = j q and ω k = ω q for some j = q, then the intersection is empty; while if j k = j q and ω k = ω q (for j = q), then we have ω j k k = ω j q q , so one of the two terms is redundant and can be eliminated from the representation. fact that a finite product of continuous functions is continuous. Finally, we can extend to disjoint unions of finite intersections of basic cylinders, noting that if e = m i=1 e i is a disjoint union of events (with e i ∩ e j = ∅ for i = j), then by additivity we have F e = m i=1 F e i , and then using the fact that a finite sum of continuous functions is continuous.

Proposition 3
Every continuous function f : X→ R on a compact topological space X is bounded, and it attains its supremum (i.e., it has a maximum value).
Before presenting our framework, we need one more technical lemma that will prove useful in the proof of our convergence Theorem 1.
and by Propositions 1 and 3 f has a maximum value on M O . But note that f (z) = 0 for any point z ∈ M O having some zero coordinate z i = 0 (for any i ≤ n). Hence, f reaches its maximum on D = (0, 1] n ∪ M O = {z ∈ (0, 1] n | z i = 1}. To prove the lemma, we will show that log( f (x)) has x = p as its unique maximizer on D. The conclusion will then follow from noticing that f (x) ≥ 0 and the monotonicity of log function on R + . To maximise log( f (x)) subject to condition i x i = 1, we use Lagrange multiplier methods: let Setting partial derivatives of G equal to zero we get, which gives p i = λx i . Inserting this in the condition i p i = 1 we get λ i x i = 1 and using i x i = 1 we get λ = 1 and thus x i = p i . Since f has a maximum on this domain and the Lagrange multiplier method gives a necessary condition for the maximum, any point x that maximises f should satisfy the condition x i = p i and thus p is the unique maximiser for f .

Probabilistic plausibility models
In this section, we introduce and exemplify our basic framework for dealing with radical uncertainty.
Definition 1 (Plausibility measures) A plausibility 'measure' (on K ) is a continuous function pla : K → [0, ∞), whose domain is some closed set of distributions K = K ⊆ M O . Given a plausibility measure on K , we can extend it to a map 7 on propositions (sets of distributions) P ⊆ M O , by putting Similarly, we can extend it to distribution-event pairs (μ, e) ∈ K × E, by putting: and further extend this to proposition-event pairs (P, e) ∈ P(M O ) × E, by putting pla(P, e) := sup{ pla(μ, e)) | μ ∈ P} = sup{ pla(μ) · μ(e) | μ ∈ P ∩ K } These last two maps give us a way of assessing the joint plausibility of having true distribution μ (or true proposition P) and observing event e.
It seems apt at this point to emphasize again that events e ∈ E ⊆ A in our setting are intended to capture observable events in multinomial experiments. The successive observations ω i in a finite sampling sequence ω ≤i = (ω 1 , . . . , ω i ) are thus regarded as outcomes of i independent and identically distributed trials as in Examples 1 and 2 . In the same manner, μ(e) encodes the probability assigned to e by the unique multinomial probability distribution induced by each μ ∈ M O on (Ω, A) (which by a slight abuse of notation we also denote by μ). (2) pla(M) = 1, or equivalently the maximum plausibility value on M is 1. 8 7 Using systematic ambiguity, we also denote this map by pla. 8 The equivalence between these conditions is easily seen if we note that, by the continuity of plausibility measures, we have pla(M) = sup{ pla(μ) | μ ∈ M} = max{ pla(μ) | μ ∈ M} = pla(M). Note that pla(M) = 1 implies that there exist possible distributions (in M) with plausibility arbitrarily close (or equal) to 1.
The plausibility map induces a total preorder 9 ≤ M on the possible distributions in M, called the plausibility ranking order, and given by putting for all μ, ν ∈ M: For every real number δ ∈ [0, 1], we put M δ := {μ ∈ M | pla(μ) ≥ δ} for the set of all distributions in M that have plausibility rank at least δ. A (probabilistic) Grove sphere is a non-empty set of the form M δ = ∅. 10 It is easy to see that the family of all Grove spheres S := {M δ | δ ∈ [0, 1], M δ = ∅} is nested (i.e. totally ordered by inclusion: in fact, for δ ≥ ε we have M δ ⊇ M ε ), and exhaustive (i.e. M = S).
The plausibility map pla attains its supremum (1)  The difference between plausibility measures and (the special case of) plausibility ranking maps is a plausibilistic analogue of the difference between measures in Measure Theory and (the special case of) probability functions. Although conditions (1) and (2) in the definition of plausibility maps may appear very restrictive at first sight, they do not in fact restrict the generality of our plausibility ranking order: the next example shows that any plausibility measure can be used to define plausibility ranking maps.
for all μ ∈ M. In this case, we say that the plausibility ranking map pla M is generated by the plausibility measure pla. Note that the plausibility ranking order ≤ M induced by pla M on M coincides with the order induced by the generating measure pla, i.e. we have: A plausibility-generating measure pla is said to be fully positive whenever its domain dom( pla) = M O is the full set of all distributions, and its codomain is (0, ∞) (i.e. pla(μ) > 0 for all μ ∈ M O ). This is a special case of great importance: fully positive measures generate plausibility models on every non-empty set of distributions M ⊆ M O .
Interpretation In a plausibility model, the current set of possibilities M encodes an agent's current epistemic state, her "hard information" or higher-level knowledge about a given probabilistic distribution μ: all she knows for sure is that μ ∈ M. The agent may have come to this prior knowledge due to some previously received information (either in the form of observations obtained by sampling or in the form of higher-level information about the mechanism underlying the unknown distribution). On the other hand, pla represents the agent's "soft information", her current beliefs (and conditional beliefs etc) about the unknown distribution, typically acquired by sampling. Unlike in probabilistic inference processes (Paris 1994)(but like in most concrete examples of such processes), this doesn't give only one (unconditional) belief, but a whole ranking of the distributions, in the form of a continuous function (which will give rise to a series of conditional beliefs): she considers the higher-ranked distributions to be more plausible than the lower-ranked ones. But, in contrast to knowledge, such soft information is not enough to exclude the less plausible distributions: the agent 'believes' that they are not the real distribution; but she doesn't know it for certain. The agent believes every proposition satisfied by all the "top" (most plausible) distributions: the ones having plausibility rank 1; or, if such top distributions don't exist, the agent will believe every proposition satisfied by all distributions that are "plausible enough": i.e. all above any given plausibility rank 1 − ε (for any ε > 0). The above-defined extensions of the plausibility map have epistemic/doxastic significance: pla(μ, e) can be thought of as a way of assessing of joint plausibility of having true distribution μ and observing event e. Note the analogy with the formula for the joint probability of two events. 11 Similarly, pla(P) gives us a way to assess the plausibility of a 'proposition': essentially, a set of distributions P ⊆ M is only as plausible as the most plausible element of P (if such an element exists); or more generally P is at least as plausible as all its elements, but no more than (i.e. pla(P) is the supremum of all plausibility ranks in P). Note now the analogy with, but also the difference from, probability: the role usually played by addition is played here by the supremum. With this notation, condition (2) on plausibility models (M, pla) can be restated simply as pla(M) = 1. Finally, pla(P, e) combines the formulas for pla(μ, e) and for pla(P) in the natural way, giving the joint plausibility of having the true distribution in P and observing event e. In particular, pla(e) := pla(M, e) is a natural definition for the plausibility of the event e. Differences between plausibility and probability Note the key differences between plausibility models and probabilistic models. First, unlike in the probabilistic case, maximal plausibility pla(μ) = 1 does not mean certainty or full belief, but only consistency with all the agent's beliefs: the distributions μ with pla(μ) = 1 are "doxastically possible", i.e. they satisfy every proposition believed by the agent. Second, the plausibility map does not obey Kolmogorov's additivity axiom: the plausibility pla(P) of a set is not the sum of plausibility ranks of its elements, but rather their supremum. This, together with the above normalization requirement (2), suggests that the plausibilistic analogue of addition of probabilities is the operation of maximization (or more generally, taking the supremum). Models for experimental-based information Closed models characterize the situations in which all prior knowledge about the distribution is based only on experimental evidence about the mechanism underlying this distribution: e.g. measurements of the side weights or asymmetries of a coin or dice; opening each of a number of urns (from which an unknown one will be chosen for later sampling) and counting (or approximately estimating) the marbles of a given color in the urn, etc. In such contexts, it is indeed natural to assume that M is closed: if a distribution is a limit of possible distributions in M, then it is indistinguishable from M by any such experimental means, and hence it cannot be excluded from M.
In the case that the experimental evidence is based only on measurements, it is natural to assume more, namely that M is both closed and convex: measurements typically produce interval estimates [a, b] for the probability μ(o) of each outcome. Indeed, such interval models are the ones most used when dealing with imprecise probabilities. More generally, the information obtained in this way may come in the form of linear constraints of the form n i=1 a i μ(o i ) ≥ c (with a 1 , . . . , a n , c ∈ Q). Any finite set of such constraints gives a closed and convex set M of possible distributions.
One might wonder why do we permit distributions M\M to have positive plausibility ranks, or even why do we take the whole closure M (instead of M) as the domain of the plausibility map. Given that the agent knows for sure that the true distribution lies within M, the distributions in M\M are incompatible with the agent's hard information, so they are known to be 'impossible' in the view of this information. It would seem natural to require that pla ≡ 0 on M\M, or else just restrict the domain of pla to M. This can indeed be done if M is closed. But in general, the technical condition of continuity poses constraints on the plausibility ranks of distributions in the closure M, which may force some μ ∈ M\M to have non-zero plausibility ranks. Even from a purely conceptual perspective, distributions in M\M are in a sense "almost possible", since they are not distinguishable from the ones in M by any experimental means. Their epistemic impossibility is only due to higher-order, non-experimental information, and so it makes sense to take them into account. Moreover, it may be that such ideal limit-distributions may have a high inherent plausibility (despite being ruled out by the current information). In some cases, they may be inherently more plausible than the possible distributions. In such cases, these distributions would be in principle believed on purely a priori grounds, though they are disbelieved (in fact known to be impossible) when the higher-level information is taken into account. 12 The above intuitions about knowledge and belief can be made formal as follows:

Definition 3 [Knowledge and belief ] We say that a proposition
An equivalent definition can be given in terms of Grove spheres: B(P) holds in M iff P includes some Grove sphere; i.e. iff there exists δ ≤ 1 such that ∅ = M δ ⊆ P; or, yet another equivalence: there exists Connections to belief revision theory Grove sphere models (in non-probabilistic form, consisting of possible worlds instead of distributions) form the standard semantic framework in Belief Revision Theory (Grove 1988). Plausibility models (again, in their non-probabilistic version) are well-known equivalent relational descriptions of sphere models, that are preferred in Dynamic Epistemic Logic (Baltag and Smets 2008a, b;Baltag et al. 2019a;van Benthem 2007van Benthem , 2011, as well as in the "dynamic interactive epistemology" approach developed by game-theorists (Board 2004). These are in fact adaptations to doxastic modeling of the older setting of Lewis spheres, with its equivalent description in terms of a comparative similarity relation (Lewis 2000). In these models, the elements of M are taken to be possible worlds, or possible 'states' of the world, and the structure is purely qualitative, given either in terms of a nested, exhaustive family of spheres, or in terms of a total preorder on worlds. Sometimes an additional converse well-foundedness condition, or a weaker 'Limit Condition', is imposed to ensure the existence of maximal elements Max(M) = ∅ (or equivalently, the existence of a smallest sphere). As seen below, this simplifies the definition of (conditional) belief, as the doxastic analogue of Lewis conditionals. But as noted by Lewis (2000), such additional assumptions are not really needed, since a satisfactory notion of conditional (or conditional belief) can still be defined in non-conversewellfounded models. Hence, we make no such additional assumptions here.
Our models are just a special case of plausibility models, adapted to a probabilistic setting: the possible worlds come as probability distributions, while the plausibility preorder and the Grove spheres are quantitatively defined from a plausibility ranking map. But the mechanism for forming beliefs B(P) and conditional beliefs B(P|Q) 12 Take a coin, for which there is no reason to suspect an in-built bias. Initially, before receiving any other information, the set of possible distributions was [0, 1] (if we represent each distribution by the probability it assigns to Heads), and the most plausible distribution was the fair one μ eq (assigning probability 0.5 to Heads). But in the meantime, one piece of new higher-order information was received, namely that the coin is not perfectly fair (due to some small manufacture accidents). Now, μ eq is excluded as impossible, so the set of possibilities is M = [0, 1]\{0.5}, but nevertheless, there is still no reason to suspect any systematic bias. So, the distributions that are closer to μ eq have higher plausibility, and their plausibility decreases as we move further away from it. The only way to extend this plausibility in a continuous way to the closure M = [0, 1] is to continue to assign maximal plausibility to μ eq . This merely technical constraint makes also conceptual sense, if we think counterfactually: if the received information happened to be wrong, then we'd revert to considering μ eq as the most plausible distribution. A priori, this impossible distribution is still inherently the more plausible. in our probabilistic plausibility models will be exactly the same as in the general (non-converse-wellfounded, non-probabilistic) plausibility models. Connections to inference processes Our probabilistic plausibility models can also be seen as a generalization and refinement of Paris' inference processes (Paris 1994;Paris and Rad 2008;Paris and Vencovska 1997). Roughly speaking, an inference process is a map Bel assigning to each set M ⊆ M O of distributions some "believed" distribution Bel(M) ∈ M. The definition in Paris (1994) actually restricts the domain of Bel to a subclass of P(M O ) (namely the ones definable by a set of linear inequalities), 13 but our more general setting extends this to all sets of distributions. A good look at Paris' examples of interesting inference processes shows that all of them define the salient distribution Bel(M) by maximizing (or minimizing) over M a certain continuous quantity (entropy, distance from centre of mass, distance from barycentre, etc). Our approach makes explicit this method of generating inference processes, in the form of the plausibility map, and recognizes it as just a special case of the standard method of belief formation in Logic and Belief Revision Theory. Generalizing to arbitrary sets of distributions also forces us to give up on the insistence for only one most preferred distribution, 14 or even a set of most preferred distributions. Following Lewis' approach (Lewis 2000) (as later adapted to non-converse-wellfounded plausibility models), one can still define beliefs as we did above, in terms of propositions that hold on all distributions that are plausible enough. Indeed, this seems the most natural generalization of maximization-based inference processes to arbitrary sets.
In closed models (and more generally in models in which plausibility map attains its maximum value 1) the definition of belief can be simplified, yielding the maximizationbased notion of belief that is standard in both inference processes and Belief Revision Theory (in terms of maximizing plausibility rank). In such cases, belief amounts to truth in all the 'most plausible' distributions (the ones with plausibility rank 1): Proof For the first part, let M ⊆ M O be closed. Since pla is a continuous function, we can use Propositions 1, 2(1) and 3 , to conclude that pla attains its supremum on M, hence M 1 = Max(M) = ∅. 13 In fact, there are other differences: Paris' approach is syntactic, so the linear inequalities involve probabilities of sentences in a given language. 14 The existence of maximizers in Paris (1994) is ensured by the fact that the sets defined by linear inequalities are closed, while the quantity to be maximized is continuous. The uniqueness of the maximizer is ensured there by the fact that these sets are convex, while the relevant quantity is concave (or convex, in the case of minimization). 15 Recall that

Proposition 4 If
For the second part, assume only that Max(M) = ∅. To prove the left-to-right direction in the displayed equivalence, suppose that B(P) holds; then by definition, there exists δ ≤ 1 such that ∅ = M δ ⊆ P. But δ ≥ 1 implies M 1 ⊆ M δ , hence by transitivity of inclusion we conclude that Max(M) = M 1 ⊆ P, as desired.
For the converse, suppose that we have

Some canonical plausibility maps and plausibility-generating functions
Here are some specific examples: 1. Entropy-based plausibility maps: The most direct implementation of the Principle of Indifference is to take as our generating plausibility measure the Shannon entropy Ent : M O → [0, ∞), given by putting It is convenient to assume that the logarithms are taken in base n (where recall that n = |O| is the number So the most plausible distribution will be the one with highest Shannon entropy, i.e. the most uninformative one. 16 More generally, less informative distributions will be more plausible than more informative ones. Note also that, when using logarithms in base n = |O|, we have Ent(M O ) = Ent(μ eq ) = 1≤i≤n − 1 n log n 1 n = 1 (where μ eq is the distribution that gives equal probability 1 n to every outcome), hence Ent M O = Ent. One of the "defects" of entropy Ent as a plausibility-generating measure is that it may take value zero, so it is not fully positive. This means there exist non-empty sets of distributions M, for which (M, Ent) is technically speaking not a plausibility model (since Ent(μ) = 0 for some μ ∈ M): indeed, the set M O of all distributions is such a counterexample! But recall that only the plausibility (pre-)order ≤ M is of relevance when forming beliefs. So we can take instead any positive continuous function that induces the same order. One simple way to do this is to add to entropy some fixed positive number, say 1. In this way we obtain a fully positive version of entropy measure Ent + : M O → (0, ∞), given by putting Using Ent + as our plausibility-generating measure, we generate a plausibility model (M, Ent +M ) on every non-empty set M ⊆ M O , whose plausibility map is once again obtained by renormalizing Ent + to M. Moreover, Ent +M agrees with Ent M on the ranking order between any two distributions, so it induces the same plausibility ranking order as the one given by entropy. As a consequence, for every plausibility model (M, Ent M ), all beliefs and conditional beliefs (as well as knowledge) are the same as in the model (M, Ent +M ). Philosophically speaking, taking either Ent or Ent + as one's plausibility measure amounts intuitively to the adoption of the Principle of Indifference at the level of the possible outcomes. 2. Cautious plausibility: The most 'cautious' choice of plausibility is assigning equal plausibility to all possible distributions, e.g. taking Obviously, this is a fully positive plausibility measure, so it induces a plausibility model on every non-empty set M ⊂ M O (with the generated plausibility map given by the restriction of C to M). Cautious plausibility can be thought of as yet another application of the Principle of Indifference at a higher level (that of all possible distributions): since a priori there is no reason to prefer a distribution to another, the prior plausibility assigns equal rank to all of them. With this cautious choice, the prior beliefs do not go beyond what is known: the agent only believes what she knows. (But as we'll see, this is no longer the case after more information is received, e.g. via sampling evidence from the unknown distribution. . But once again, only the induced ranking order is of relevance when forming beliefs, so we can apply any continuous transformation from where here we assumed that the logarithm is taken in binary base. Assume now that M ⊆ M O is a non-empty set with the property that for every outcome o ∈ O, we have either μ(o) = 0 for all μ ∈ M or else μ(o) > 0 for all μ ∈ M. Then the measure C M ∞ generates a probabilistic plausibility map on M, obtainable once again by renormalization to M. If we instead apply first a slightly different transformation (x → 1 + 2 x ), we can go further and convert C M ∞ into a fully positive plausibility measure C M + ∞ . This helps avoid any restrictions on M: as long as M = ∅, C M ∞ generates a probabilistic plausibility map C M +M ∞ on M, that induces the same preorder ≤ M as the original function o∈O(M) log(μ(o)). Hence, (M, C M +M ∞ ) is a plausibility model for every non-empty M, and its ranking order, beliefs, conditional beliefs etc, agree with the one of (M, C M M ∞ ), whenever the second is a plausibility model. Taking C M M ∞ or C M +M ∞ as one's plausibility ranking amounts intuitively to the adoption of a Principle of 'Averageness' or Typicality. Indeed, the probability distributions in M that have a higher C M +M ∞ -plausibility will be those that are "more typical", more 'normal' or representative for M; while the most plausible ones are the "most typical". Another typicality-based plausibility map is related to the barycentre inference process (Paris 1994): this involves minimizing the function If it exists and is unique, the minimizer of this function over M is called the barycentre of the set M, and it gives another notion of averageness or representativeness. It chooses the distribution μ that minimizes the worst error that could be made (when one wrongly takes μ to be the true probability). To convert this into a maximization problem, we can apply the transformation 2 −x , obtaining the (fully positive) barycentric plausibility measure B M : M → (0, 1], for any non-empty set M ⊆ M O and arbitrary distribution μ ∈ M: Using again renormalization, this generates a probabilistic plausibility map on M, that will assign higher plausibility to distributions that are closer to M's barycenter. 4. Evidence-based plausibility: Given an observed event e ∈ E, we may prefer distributions that maximize the probability of e. This corresponds to taking as our plausibility measure the function F e from Lemma 1, given by F e (μ) = μ(e). This gives higher ranking to distributions that assign higher probability to the event e. When renormalized to any non-empty set M ⊆ M O with the property that μ(e) > 0 for all μ ∈ M, it induces a plausibility model (M, F M e ), given by sup{ν(e) | ν∈M} . 5. Centered plausibility: Given a salient distribution μ (that is considered as the most plausible), one may adopt a plausibility map given by a "normal" curve centered at μ. This means that distributions that are closer to μ are considered more plausible than the ones that are farther: μ). One example of a fully positive plausibility measure that induces this ranking order is C μ : M O → (0, 1], given by putting C μ (ν) := 2 −d(ν,μ) .
6. Plausibility based on second-order probability: Let M ⊆ M O be a discrete 17 set of distributions, and let P : M → [0, 1] be any second-order probability mass function (cf. Gaifman and Snir 1982;Gaifman 2016), that is required to satisfy P(μ) > 0 for all μ ∈ M and μ∈M P(μ) = 1. Then this function can be extended to a continuous function P : M → [0, 1], by putting P(μ) := 0 for all limit points μ ∈ M\M. The fact that this extension is continuous follows from the assumption that μ∈M P(μ) = 1, which implies that lim n→∞ P(μ n ) = 0 for any infinite sequence of distinct points μ n ∈ M. By taking this extended function P as our plausibility measure, we generate a plausibility model (M, P M ), by renormalizing as above: P) . (Note that, in order for μ∈M P(μ) to have a finite value, P must attain a maximum value max M (P) := max{P(μ) | μ ∈ M} on M.) However, note that the beliefs based on the plausibility ranking P M will not necessarily match the Lockean beliefs based on the second-order probability P. Only the distributions μ ∈ Max(M), having P M (μ) = 1, or equivalently P(μ) = max M (P), are relevant for the agent's plausibilistic beliefs: she will believe that the true distribution is one of the ones in Max(M). This will hold even in the case that μ∈Max(M) P(μ) < 1 2 ; while an agent using P as her second-order probability will have in this case precisely the opposite belief : she believes that the true distribution is in M\Max(M), since this is more likely to be the case. This points yet again to the fundamental difference between the interpretation of a function as a plausibility map versus its meaning as a probability function. Plausibility ranks do not obey the Kolmogorov additivity axiom, but instead higher plausibility ranks simply dominate lower ones.
Example 1 (continued). In the Coin example, we initially have no information about the coin, the set of possible coin biases will be the set M O of all probability mass functions on O = {H , T }. Suppose that we have background information that the extremal distributions (μ 0 with μ 0 (H ) = 0, and μ 1 with μ 1 (H ) = 1) are impossible. Then the set of possibilities is given by M := M O \{μ 0 , μ 1 }. We can choose the entropy Ent as our plausibility map, as this can be justified here in terms of symmetry: the faces of a coin (or a dice) are symmetric, so there is no reason to prefer one outcome over another. Then (M, Ent + ) is a plausibility model, where the highest plausibility will be given to the distribution with the highest entropy: the fair-coin distribution μ eq , assigning μ eq (H ) = μ(T ) = 1 2 (since for every ν = μ eq we have Ent(ν) < Ent(μ eq )). So entropic plausibility starts with an initial belief in the fairness of the coin (and more generally it assigns a higher ranking to a distribution that corresponds to a more well-balanced coin). Note that entropic plausibility induces the same ranking order on this model as the centered plausibility C μ eq (centered at the fair-coin distribution μ eq ).
If, however, we cannot exclude any distribution (not even the extremal ones), then the set of possibilities is the whole M O , and Ent will no longer give us a plausibility model. Still, we can choose instead the positive version of entropic plausibility Ent + , which makes (M O , Ent + ) into a plausibility model, while maintaining the same initial belief in the coin's fairness (and the same preference for more well-balanced coins). Note again that Ent + still induces the same ranking order on this model as the centered plausibility C μ eq . Example 2 (continued). In the Urn example, we initially have no other information besides the three colors, so the set of possibilities is the set M O of all distributions over O = {R, B, G}. Since there is no reason to prefer any one distribution over any other (and no considerations of symmetry are relevant, since we cannot see inside the urn to somehow assess whether there is a rough balance between the quantities of marbles of different colors), the most natural prior ranking seems to be in this case the cautious plausibility C: each possible distribution is assigned an equal plausibility of 1. In the plausibility model (M O , C) n, then n i=1 P i = ∅. Proof Properties 1,2,3,4 follow immediately from the definitions of knowledge and belief. Property 5 for knowledge follows directly from property 1. For property 5 for belief: B(P) gives the existence of some δ > 0 with ∅ = M δ ⊆ P, which together with P ⊆ Q gives us ∅ = M δ ⊆ Q, hence B(Q) holds. Property 6 for knowledge follows from property 1, via the sequence of implications: if K (P i ) holds for all 1 ≤ i ≤ n, Property 6 for belief: suppose that B(P i ) holds for all 1 ≤ i ≤ n; so, for every 1 ≤ i ≤ n, there exists some Property 7 follows immediately from properties 6 and 4.
Finally, one should note that belief in closed models (or more generally, any model having most plausible distributions) is better behaved, having stronger consistency and conjunctivity properties, than in arbitrary models: In particular, these properties hold in closed models.
Proof For the first item, suppose that B(P i ) holds for all i ∈ I . Then by the second part of Proposition 4, we have Max(M) ⊆ P i for all i, and hence Max(M) ⊆ i∈I P i , hence B(∩ i∈I P i ) (again by Proposition 4).
For the second item, we apply the first item to the family {P⊆M O | B(P) holds in M} to infer that we have B({P ⊆ M O | B(P) holds in M}), then apply Proposition 5.3 to obtain the desired conclusion.
The following example shows that the above properties do not necessarily hold in arbitrary probabilistic plausibility models! Counterexample: Suppose that, in the Coin Example, our agent learns from the manufacturer only one piece of information: the coin is not completely fair, due to very small, accidental imperfections (rather than any intentional bias). What is a rational agent, who forms entropy-based beliefs, supposed to believe? Smaller imperfections seem to be more plausible than larger ones: hence, any bias closer to 1 2 is more plausible than one that is farther. On the other hand, the agent knows for sure that the coin is not fair. Our agent has acquired omega-inconsistent beliefs, which nevertheless seem rational, given her information.
To formalize this counterexample, take O = {H , T } as in the Coin Example, and take the model (M, Ent + ) with M = M O \{μ eq }, where μ eq (H ) = μ(T ) = 1 2 is the fair-coin distribution and Ent + is the positive version of entropic plausibility. Recall that Ent + yields on the same ranking order on M O as the centered plausibility C μ eq : distributions that are closer to μ eq are more plausible than the ones that are farther. For each n ≥ 2, take P n := {μ ∈ M | μ(H ) ∈ ( 1 2 − 1 n , 1 2 + 1 n )}. Then B(P n ) holds for all n ≥ 2 (since every distribution close enough to μ eq is in P n ), but n≥2 P n = ∅ (since μ eq / ∈ M), hence beliefs are globally inconsistent; moreover, B( n≥2 P n ) does not hold (since B(∅) is false, by Proposition 5.3), hence beliefs are not necessarily closed under countable conjunctions.
This counterexample shows that plausibility-based beliefs in non-closed models may be subject to a kind of Infinite Lottery Paradox: though believing, for each n ≥ 2, that the coin's bias is in ( 1 2 − 1 n , 1 2 + 1 n )\{ 1 2 }, our agent does not believe that the bias is in (empty) intersection of all these sets. So beliefs in non-closed models may exhibit a type of 'omega-inconsistency': though each belief is consistent, and any finitely many beliefs are mutually consistent, the family of all beliefs may still be inconsistent, when taken as a whole! We think this is a small price to pay for being able to form beliefs when given arbitrary information M ⊆ M O . Situations such as in the above counterexample can occur in practice, whenever partial information is obtained, say by communication. Still, readers who consider global doxastic consistency to be an inherent feature of rationality are welcome to restrict our framework to models in which the plausibility map attains a maximum value. Full infinitary conjunctivity and global consistency of beliefs can be regained in this way, without any other loss, except for generality.

Conditioning and belief dynamics
One of the main motivations for developing the setting that we investigate here is to capture the process of learning a distribution as a form of iterated belief revision, that results from receiving new information. But, as already explained, the two components of our probabilistic plausibility models M = (M, pla) capture two different types of information about the unknown distribution μ: the set M represents the agent's hard higher-level information about μ (her 'knowledge', given by the proposition M ⊆ M O ); while the plausibility map pla : M O → [0, 1] represents the agent's soft information about μ (typically obtained by sampling or other observational events), her "beliefs" given by the ranking order. Each of these two forms of information is subject to its own type of revision, captured by its own form of conditioning or update: (1) conditioning on a new proposition Q ⊆ M O , resulting in an eliminative update with the hard information Q, by which some distributions are eliminated, while the plausibility ranking stays the same; (2) conditioning on a new observational event e ∈ E (resulting in an upgrade of the plausibility map, by which distributions assigning a higher probability to e get a boost ranking, while the set M typically stays the same (except possibly for the elimination of those extreme distributions that assigned zero probability to e). The first type of conditioning can be recognized as a plausibilistic analogue of the Kolmogorov definition of conditional probability, that fits well with propositional learning. Note that the P-conditional plausibility order ≤ P in the model M P , given by is the same as the initial plausibility order ≤, except that it is restricted to M ∩ P (since the renormalizing denominator pla(M ∩ P) in the definition of pla P doesn't make a difference for the order). Indeed, the propositional update (generated by receiving new "hard" higher-order information P) shrinks the space of possible distributions M by eliminating certain possibilities, while leaving the plausibility map "essentially the same" (modulo the renormalizing factor). This shows that our propositional update falls well within the scope of traditional Belief Revision Theory, representing a special case of AGM conditioning.
On the other hand, the second type of conditioning can be seen as a plausibilistic analogue of Bayes' conditioning formula (where in both cases, the operation sup of taking supremum plays the role usually played by addition ), and thus captures a notion of learning through sampling. The event conditioning rule weights the plausibility of each distribution with how well it predicts the observed sampling event e. Note that e-conditional plausibility order ≤ e in the model M e is given by Indeed, the event update is generated by receiving "soft" information (obtained by sampling), and it naturally resembles soft doxastic 'upgrades' (rather than updates) from Dynamic Epistemic Logic (Baltag and Renne 2016;van Benthem 2011;Baltag and Smets 2008b): it leaves the set of possibilities M "essentially the same" (since it does not necessarily eliminate any distribution, except for the extremal ones, assigning probability 0 to e, if there any in M), but rather only changes the plausibility over them. Distributions that better fit the sampling evidence are only 'promoted' in plausibility, while the others are demoted (but not eliminating, except for the extremal ones).
The next result confirms that our updates are well-defined operations on plausibility models: = 1 (again using the fact that sup{ pla(ν) | ν ∈ M ∩ P} = pla(M ∩ P)).
Similarly, the definition of M e ensures that the function pla e (μ) = pla(μ | e) =

pla(μ)·μ(e)
sup{ pla(ν)·ν(e) | ν∈M e } takes only positive values on M e , and that its supremum is 1 on M e . To show that pla e is continuous, we put together the definition of conditional plausibility, the fact that pla e = pla·F e k (where F e is the function introduced in Lemma 1 and k = sup{ pla(ν) · ν(e) | ν ∈ M e } is a non-zero constant), the continuity of pla (by definition) and of F e (by Lemma 1), and use the closure of continuous functions under products and division by non-zero constants.
This fact allows us to iterate and even interleave the two forms of updating. For simplicity, we only do it for events and propositions that fit the true distribution (since this automatically ensures their mutual compatibility): Definition 5 (Iterated updating) Given a plausibility model M = (M, pla), and let μ ∈ M be the 'true' distribution, we can define the iterated update M σ , for every finite sequence σ = (σ 1 , . . . , σ n ) ∈ (Prop∪E) * consisting of true propositions (σ i ∈ Prop with μ ∈ σ i ) or truly observable events (σ i ∈ E with μ(σ i ) = 0). The definition is by recursion on the length of the σ , by putting: The next three results ensure that updating satisfies some standard rationality constraints: Proposition 8 guarantees that the result of repeated conditionalisation is independent of the order of application; Proposition 9 says that the result of conditioning is independent of whether it is done successively (conditioning on each independent observation, one after the other) or in one global step (conditioning on the whole sequence of independent observations, as one big single event); while Proposition 10 shows that, when conditioning with a sequence of observations, the result is independent of the temporal order of the observations. These last three facts are important as they ensure that the agent's posterior beliefs depend only on the evidence that is observed (and the prior plausibility model), not on the temporal or logical order in which this evidence is observed or processed. sup{ pla e∩e (ν) | ν∈M e,e } = pla e∩e (μ). The second claim of our Proposition follows by an easy induction from the first (given that, by the definition of μ, each event ω j j is independent on the event j−1 k=1 ω k k ).

Proposition 8 The order of applying (iterated) conditionalization is irrelevant: if
While Proposition 8 states that the logical order of applying conditionalization (with both events and propositions) is irrelevant, the next result shows that the temporal order in which the outcomes are observed is also irrelevant: Proof Using the notations F e from Lemma 1, and applying the multiplicative rule for independent events (as well as the associativity and commutativity of multiplication), we obtain: The new plausibility function is given by pla e (μ) = pla (μ,e) pla (M,e) , where pla(μ, e) = Ent(μ) · μ(e). Thus the most plausible probability function will no longer be μ eq and ones with a bias towards Heads will become more plausible. Let μ 1 , μ 2 and μ 3 be such that μ 1 (Heads) = 0.75, but μ 2 (Heads) = 0.8 and μ 3 (Heads) = 0.9 then it is easy to check that pla e (μ 1 ) < pla e (μ 2 ) > pla e (μ 3 ). 19 So the maximizer has μ(Heads) ∈ (0.8, 0.9). This is natural: the initial belief in fairness is no longer realistic; the agent now believes there is a bias towards Heads.
If however, we cannot initially exclude the extremal distributions, then Ent is not a good plausibility map, and we have to once again take its positive version to form the initial plausibility model (1 + Ent(μ)) · μ(e). This changes the initial belief in fairness, and distributions with a higher bias towards Heads become more plausible (though the maximizer will be slightly different than in the previous situation). Also, note that the new plausibility map still inherits from the entropic plausibility the aversion towards extremal distributions: e.g. the distribution μ 1 with μ 1 (Heads) = 1, though it can no longer be excluded (since μ 1 ∈ M e now) and though it, in fact, matches exactly the observed frequency of Heads, will still not be believed (and in fact will never become the most plausible, after no finite sequence of observations, no matter how many times the coin falls Heads up). The agent starts sampling marbles, noting their colour, and replacing them in the urn. Let e := [R, R, R] = R 1 ∩ R 2 ∩ R 3 be the event that "the first three sampled marbles are all Red". After observing e, all distributions μ with μ(R) = 0 are eliminated, so that the new set of possibilities is M e = {μ ∈ M O : μ(R) = 0}), and the new plausibility map is given by pla e (μ) = μ(e) = μ(R) 3 . The maximizer of this function is μ R , given by μ R (R) = 1 and μ R (G) = μ R (B) = 0. So the agent now believes that there are only Red marbles in the urn: this is natural since based on her current evidence there is no reason to assume there are any Green or Blue marbles inside. If however, the next sampled marble comes up Green, then we have the event After observing this, all distributions with μ(G) = 0 are also eliminated, so the new set of possibilities is Note that the previously believed distribution μ R has been eliminated now: not it is no longer believed, it is known now to be impossible! Furthermore, the new plausibility map is given by The unique maximizer of this function is the distribution μ 2R1G , given by μ 2R1G (R) = 2 3 , μ 2R1G (G) = 1 3 and μ 2R1G (B) = 0. So the agent now believes that there are twice as many Red marbles than Green marbles (and no Blue marbles) in the urn. Again, this is natural, since twice as many Red marbles were observed than Green (and no Blue). One can in fact show that, when the prior is given by the cautious plausibility, the most plausible distribution after any sequence of observations will always be the one matching the observed frequencies.
The above notion of conditional plausibility gives us immediately a theory of belief revision, which can be formalized in terms of a notion of conditional belief. Note that this is conditionalisation on an observable event, corresponding to learning from observations (i.e. from sampling from the unknown distribution). On the other hand, the standard AGM setting in Belief Revision Theory and Logic (Alchourrón et al. 1985;Board 2004;Baltag and Smets 2008b;van Benthem 2011) involves revising with a proposition (i.e. set of distributions), rather than an event. This corresponds to learning high-level information about the unknown distribution, which allows to further shrink the range of possibilities to some subset of the prior set of possible distributions. We thus obtain two forms of conditional beliefs: a Bayesian-type conditioning on events, encoding 'statistical' learning; and an AGM-type of conditioning on propositions, encoding 'logical' belief revision.  (σ 1 , . . . , σ n ) ∈ (Prop∪E) * of propositions/and or events, we say that P is believed conditional on σ in M, and write M | B(P|σ ), iff P is believed in M σ .
Conditional belief is consistent whenever the evidence is (i.e. if e = ∅, then B(P|e) implies P = ∅, and similarly for B(P|Q)). As we'll see, beliefs conditional on events allow us to inductively learn from repeated sampling, and to ultimately converge to the true distribution. As such, they behave in a way that is somewhat similar to the usual Bayesian conditioning, used in statistical learning. In contrast, beliefs conditional on propositions will behave as a 'logical' form of belief update, satisfying all the standard axioms of Conditional Doxastic Logic (Board 2004;Baltag and Smets 2008b)(which are in fact just an equivalent formulation of the so-called AGM postulates (Alchourrón et al. 1985) from Belief Revision Theory).
As for simple belief, the definition of belief conditional on events can be simplified in closed models. In this case, conditional belief B(P|e) amounts to truth in all the most e-plausible distributions: Proof By Proposition 7, pla e is a plausibility function, hence it is continuous. Recall that M is closed and hence (by Propositions 1, 2(1) and 3) pla e has a maximum value on M. Let μ ∈ M be a distribution in which this maximum value is attained, i.e. we have pla e (μ) ≥ pla e (μ ) for all μ ∈ M (and thus also for all μ ∈ M e ⊆ M). Since e is compatible with M, there exists some ν ∈ M s.t. ν(e) > 0, and hence pla e (μ) ≥ pla e (ν) = pla(ν) · ν(e) > 0. So we have 0 < pla e (μ) = pla(μ) · μ(e), which implies that μ(e) = 0, i.e. μ ∈ M e . This, together with the fact that pla e (μ) ≥ pla e (μ ) for all μ ∈ M e , gives us that μ ∈ Max e (M e ) = ∅.

Safe belief, statistical knowledge, and verisimilitude
Until now, we only used the notion of knowledge K that is most common among logicians, economists and computer scientists: absolutely certain, infallible, irrevocable, and fully introspective knowledge. This matches what philosophers call "(hard) evidence" or "(hard) information". But the notion of knowledge favoured by epistemologists is softer: fallible, less-than-absolutely-certain, revisable, and possibly non-introspective (or at least not always negatively introspective). It is the kind of knowledge that we typically encounter in daily life or in empirical sciences, where absolute certainty may be hard to achieve. This is known sometimes as defeasible knowledge, and it is also related to the notion of inductive knowledge in Philosophy of Science. Here, we are interested in developing such a soft notion of knowledge that can apply to statistical learning: after repeatedly updating our beliefs by sampling from an unknown distribution, when do our beliefs become focused enough and stable enough to qualify as soft 'knowledge' of the true distribution (at least to some good enough approximation)?
Various formalizations have been proposed for this notion. Here, we will borrow ideas from the so-called Defeasibility Theory of Knowledge (Lehrer 1990): the main principle is that 'knowledge' is a form of robust belief, namely belief that is resilient under conditioning with truthful information. These ideas go back to Plato's Meno and were more recently championed in various forms by Klein, Lehrer, Pappas and Swayn, Rott and others. Before going on to formalize and then criticize the defeasibility theory, Stalnaker (1996) summarizes it as follows: "An agent knows that φ if and only if φ is true, she believes that φ, and she continues to believe φ if any true information is received". Rott (2004) develops a version called stability theory, and states it as: "A belief K is a piece of knowledge of the subject S iff K is not given up by S on the basis of any true information that S might receive". Baltag and Smets (2008b) restated Stalnaker's formalization, under the name of safe belief, and developed it in the framework of dynamic epistemic logic. Here, we adapt this concept to our setting, and later strengthen it to a notion of statistical knowledge.

Definition 8 [Safe Belief
] Let M = (M, pla) be a plausibility model, in which we also specify the 'true' distribution μ. We say that a proposition P ⊆ M is safely believed (or is a "safe belief") at μ in M, and write μ | M Sb(P), if P is believed in M conditional on every true proposition Q; i.e. B(P|Q) holds for all Q ∈ Prop with μ ∈ Q. This is simply the same notion as the one defined by Baltag and Smets (2008b) in general plausibility models, but stated here in the special case of our probabilistic plausibility models. As such, it satisfies the following general characterization, given in Baltag and Smets (2008b):

Proposition 12
The following are equivalent: -P is safely believed at μ in M; -all distributions in M that are at least as plausible as μ satisfy P; i.e., we have that It is easy to see that, if P is a safe belief, then P is a true belief. As such, the notion of safe belief gives a good formal approximation of the defeasibility conception of knowledge. Distance from the truth and verisimilitude We can think of a plausibility model M = (M, pla) as an epistemic/doxastic approximation of some unknown probability distribution μ ∈ M. The natural question that arises is: how 'truthlike' is our model M, how good an approximation is it? To assess this, we connect with notions from Verisimilitude Theory, cf Popper (1976), Tichy (1974), Miller (1974, Niiniluoto (1987), Kuipers (1987 and others. In particular, we adapt to our setting ideas coming from the metric approach to truthlikeness Niiniluoto (1987). We are looking for a notion of distance of a model M from a distribution μ ∈ M, which measures how far the agent's beliefs are from the truth. In the case of closed models, the beliefs are given by the set Max(M), so the natural notion of distance would be given in this case by the quantity which measures the "worst possible error" one could make when taking as the true distribution to be any of the ones compatible with the agent's beliefs. However, when M is not closed, we might have Max(M) = ∅, which would render the above notion of distance-from-the-truth meaningless, or at least useless (in case we adopt the natural convention that sup∅ = ∞). But one can weaken the above definition to include in the relevant set of possibilities (whose distances from the truth are assessed) all the "plausible enough" distributions, and in particular all the ones that are at least as plausible as the true distribution. In this way, we arrive at the following definition of distance-from-the-truth: This measures the worst possible error one could make when taking as the true distribution any of the ones that are currently thought to be at least plausible as the "truly true" distribution μ. It is easy to see that we have that the distance-from-the-truth matches the radius of the smallest open ball around the true distribution that is safely believed: So d μ (M) ≥ ε tell us that the agent has a safe belief of the approximate value of the true distribution within an ε-margin of error. It is also easy to see that we have and also that we have So 0-distance (according to either definition) indicates that the agent's (safe) beliefs fully match the true distribution.
When we have d μ (M) < d μ (M ) for the true distribution μ, we say that M is more truthlike than M . This verisimilitude order suffices for our purposes. But we could also convert it into an actual measure of truthlikeness, by defining the verisimilitude v μ (M) of a model M wrt a distribution μ, say by putting v μ (M) := 2 −d μ (M) . The maximum verisimilitude v μ (M) = 1 is achieved when d μ (M) = 0, i.e. when μ | M Sb({μ}). Safe belief is not safe from conditioning on events While of inherent interest, the notion of safe belief does not fully capture the intended meaning of defeasible knowledge in a probabilistic framework. Although safe belief is resilient under conditioning with any true 'proposition', in our setting propositions are not the only kind of new information; and indeed, safe beliefs are not necessarily stable under conditioning on events. Indeed, even if we restrict to truly observable events (whose true probability μ = 0), one can show that no non-trivial belief is stable under every such event! This means we have to moderate our safety requirements when dealing with events. Note that, for inductive learning, absolute safety (under all observable sampling events) is irrelevant: what is important is that our correct beliefs are resilient throughout the (actual) future sampling history. This resembles the notion of identification in the limit in Formal Learning Theory (Gold 1967), as well as the concepts of inductive knowledge developed in e.g. Kelly (2014) and Baltag et al. (2019b). In our setting, this gives rise to the concept of statistical knowledge: Definition 9 [Statistical Knowledge] Let M = (M, pla) be a plausibility model, let μ be some distribution (representing the 'true' probability), and let ω ∈ Ω be an infinite observation stream (representing the 'true' future sampling history from the unknown distribution μ). We say that a proposition P ⊆ M is statistically known (or is "statistical knowledge") at μ wrt ω in M, and write μ, ω | M Sk(P), if P is believed in M conditional on every 'true' proposition Q and every (event corresponding to an) initial segment of the 'true' sampling history ω; i.e. if we have B(P|Q, [ω ≥n ]), for all Q ∈ Prop with μ ∈ Q, and all n ≥ 0.
It is obvious that, if P is statistically known, then it is safely believed. But statistical knowledge is much more resilient: it essentially captures a strong form of inductive knowledge. Using Proposition 12, we immediately obtain the following characterization:

Proposition 13
The following are equivalent: -P is statistically known at μ wrt ω in M; -after every initial segment [ω ≤n ] of the true sampling history ω, every distribution in M that is at least as plausible as μ satisfies P; i.e. we have: In the next section, we show that this notion is actually realistically achievable, and in fact unavoidable: repeated sampling will almost surely eventually lead to statistical knowledge of the true distribution with any desired accuracy.

Tracking the truth
Definition 10 For μ ∈ M, we define the set Ω μ of μ-normal observations as the set of infinite sequences from O for which (1) the limiting frequencies of each o i correspond to μ(o i ) and (2) no outcome with probability 0 is ever observed: Proposition 14 For every probability function μ, μ(Ω μ ) = 1. Hence, if μ is the true probability distribution over O, then almost all observable infinite sequence from O will be μ-normal.
Using the law of large numbers it is enough to show that μ(Δ) = 0. To see this let μ(o) = 0 then The result then follows from finiteness of O.
We are now in the position to look into the learnability of the correct probability distribution via plausibility-revision induced by repeated sampling. We first prove a preliminary result on convergence.
Lemma 2 Let M = (M, pla) be a plausibility model, and μ ∈ M. Then, when repeatedly sampling from an unknown distribution μ, we have that for every ε > 0, the plausibility of having a distribution ε-farther from μ will become in the limit vanishingly smaller than the plausibility pla(μ) of the true distribution μ.
More precisely: for every μ-normal sequence ω ∈ Ω μ and every positive real Proof We first need to make some preliminary notations and observations. If O = {o 1 , . . . , o n } is the set of outcomes, and μ is the fixed distribution in the statement of our Lemma, then we put p i := μ(o i ), for all 1 ≤ i ≤ n. More generally, for all distributions ν ∈ M, all μ-normal sequences ω ∈ Ω μ and all 1 ≤ i ≤ n, we put: Since ω ∈ Ω μ ) we have (by the definition of Ω μ ) that: lim m→∞ α i,m,ω = p i for all 1 ≤ i ≤ n; and also that m i,ω = α i,m,ω = 0 holds whenever p i = μ(o i ) = 0 (because of the normality of the sequence ω). Let us put A := {1 ≤ i ≤ n | p i = 0}. Since lim m→∞ α i,m,ω = p i > 0 for i ∈ A, there must exist some N 1,ω such that 0 < p i 2 ≤ α i,m,ω ≤ 2 · p i for all m ≥ N 1,ω and all i ∈ A. Since 0 ≤ p i , ν i ≤ 1, this gives us that (where we used the fact that, for every i / ∈ A we have p i = 0, so by normality of the sequence we also have m i,ω = α i,m,ω = 0, and thus ν m i,ω i = 1, hence these factors can be skipped from the product). In particular, for ν := μ (so ν i = p i ), we get that ( * * * ) pla [ω ≤m ] (μ) = pla(μ) · μ([ω ≤m ]) = pla(μ) · Π i∈A p m·α i,m,ω i > 0 (since p i = 0 for i ∈ A, and also pla(μ) = 0 because μ ∈ M).
Using these abbreviations and facts, we can now prove our lemma. Fix ω ∈ Ω μ and ε > 0. To prove the desired conclusion, let now ν ∈ M\B ε (μ), and let N be any arbitrarily chosen natural number. Using the above unfoldings (**) and (***) of the definitions of pla (M\B ε (μ)) and pla [ω ≤n ] (μ), we see that it is enough to show that, for any such arbitrarily chosen N , we have for all large enough m.
We prove this by cases. In the first case, assume that pla(ν) = 0, then the left hand side of (1) is 0 and the inequality holds. In the second case, assume that pla(ν) > 0. Let Δ = {ν ∈ M | ν i = 0 for some i ∈ A}, and similarly for any δ > 0, put Δ δ = {ν ∈ M | ν I < δ for some i ∈ A}, and so Δ δ = {ν ∈ M | ν I ≤ δ for some i ∈ A} is its closure. Choose some δ > 0 small enough such that we have Π i∈A ν p i 2 i < Π i∈A p 2· p i i for all ν ∈ Δ δ (-this is possible, since Π i∈A ν p i 2 i = 0 < Π i∈A p 2· p i i for all ν ∈ Δ, so the continuity of Π i∈A ν p i 2 i gives us the existence of δ). Hence, we have (where we used again the fact that p i > 0 for i ∈ A). The set Δ δ is closed, hence the continuous function has a maximum value Q on Δ δ . Note that Q < 1 (-this follows from the inequality above), so there exists some N 2 > N 1,ω (where N 1,ω is the number satisfying the inequality (*) in the preliminary facts above) s.t. we have Q m < pla(μ) N for all m > N 2 . Recalling also that by definition pla(ν) ≤ 1, we obtain, for all ν ∈ Δ δ : (where we used the above facts as well as the inequality (*)). So we proved that the inequality (1) holds for all ν ∈ Δ δ . It thus remains only to prove it for all ν ∈ M := M\(B ε (μ) ∪ Δ δ ). For this, note that M := M\(B ε (μ) ∪ Δ δ ) is closed and that ν i = 0 over this set for all i ∈ A, while for all i / ∈ A we have α i,m,ω = 0. Hence using the assumption that pla(ν) = 0, (1) is equivalent over this set with: Applying logarithm (and using its monotonicity, and its other properties), this in turn is equivalent to So we see that it is enough to show that, for all large m and for ν ∈ M , we have Recall that α i,m,ω ≥ p i 2 for all m > N 2 > N 1 and all 1 ≤ i ≤ n. Thus, to prove (4), it is enough to show that, for large m and for all ν ∈ M , we have where we introduced the auxiliary continuous functions f , g : M → R, defined by putting f (ν) = 2 · (log N + log( pla(ν)) − log( pla(μ))) and g(ν) To show (5), note first that (where at the end we used the fact, proved in Lemma 1, that the measure μ, with values μ(o i ) = p i , is the unique maximizer of the function Π n i=1 ν p i i on M O ). Since g is continuous and M is closed, g is bounded and attains its infimum B = min M (g) on M . But since g is non-zero on M , this minimum cannot be zero: B = min M (g) = 0. Similarly, since f is continuous and M is closed, g is bounded and attains its supremum C = max M ( f ) < ∞ (which thus has to be finite). Take now some for all ν ∈ M , as desired.
We can now establish our first convergence result.

Theorem 2 [Convergence in plausibility]
Let M = (M, pla) be a plausibility model. If μ ∈ M is the 'true' distribution, then we have the following: 1. when repeatedly sampling from the unknown distribution μ, we have that for every ε > 0, the plausibility pla(M\B ε (μ)) of having a distribution ε-farther from μ will also almost surely converge to 0 (as sample size converges to infinity): in particular, in the same conditions of repeated sampling, every other distribution ν ∈ M\{μ} will almost surely converge to 0 (as sample size converges to infinity): 2. in contrast, in the same conditions, we have that for every ε > 0, the plausibility pla(B ε (μ)) of having a distribution ε-close to μ will also almost surely eventually settle on 1 (after finitely many rounds of sampling): as an obvious consequence, in the same conditions, we have for every ε > 0, the plausibility pla(B ε (μ)) of having a distribution ε-close to μ will also almost surely converge to 1: It is obviously enough to show the following two claims, for all μ − normal sequences ω ∈ Ω μ and all ε > 0: lim n→∞ pla [ω ≤n ] (M\B ε (μ)) = 0 for all ε > 0; ∃N ∀n ≥ N pla [ω ≤n ] (B ε (μ)) = 1.
To prove the first claim, we use the fact that every plausibility ranking function satisfies 0 ≤ pla ≤ 1 to derive then obtain the desired conclusion by taking limits and applying Lemma 2.
An obvious consequence is the following: Corollary 2 [Approximate statistical learning] Let M = (M, pla) be a plausibility model. Then after repeated sampling from an unknown distribution, the agent will almost surely eventually acquire approximate statistical knowledge of the true distribution with any desired accuracy ε > 0.
More precisely: for every μ ∈ M and every ε > 0, we have The proof is immediate, given Proposition 15. All these convergence results are inexact: they concern only approximations of the true distribution. However, the fact that every non-zero degree of accuracy is eventually achieved (and maintained forever after) shows that the verisimilitude of our models keeps increasing, or equivalently the distance-from-the-truth keeps decreasing (approaching 0 in the limit). In this sense, we have convergence in the limit to the exact true distribution: Corollary 3 [Convergence in verisimilitude] Let M = (M, pla) be a plausibility model. If μ ∈ M is the true distribution, then the distance-from-the-truth will almost surely converge to 0 after repeated sampling: Proof It is clear that we have to show that But note that, by the definition of distance-to-the-truth, we have the following equivalence: The desired conclusion follows immediately, given Proposition 15.
A general feature of all the above forms of truth-tracking is that the convergence to the exact true distribution (rather than to an approximation) happens only in the limit (rather than being reached at some finite stage). However, one can do better than this when the agent's prior knowledge is consistent with only a discrete (or in particular, a finite) set of distributions: -after finitely many rounds of sampling from the unknown distribution, the agent will almost surely eventually acquire exact statistical knowledge of the true distribution: -finally, the distance-to-the-truth of the plausibility model will almost surely eventually settle to 0, after finitely many rounds of sampling: Proof Apply each of the previous results to some ε > 0 small enough so that B ε (μ) ∩ M = {μ}. It is important to note the differences between our convergence results and the Savage-style convergence results in the Bayesian literature (Edwards et al. 1963;Savage 1954;Doob 1971;Gaifman and Snir 1982;Earman 1992), that were mentioned in the Introduction. Savage's theorem assumes a certain restriction on the true hypothesis (namely, that its prior probability is non-zero), which makes it applicable only to a finite (or countable) set of hypotheses 22 (since otherwise the prior probability cannot be assumed to be non-zero for every hypothesis). Our general results (concerning truth-tracking in the limit) do not need this assumption and indeed, they even apply to the whole (uncountable) set M O of all distributions.
On the other hand, in the case of a finite (or more generally, discrete) set of hypotheses/distributions, our plausibilistic learning is even better-behaved than the standard Bayesian learning: we obtain convergence in this case in finitely many steps (while Savage's still converges only in the limit). This faster convergence is explained by the qualitative nature of our belief-formation (as standard in logic, only the most plausible hypotheses matter for beliefs), instead of the quantitative-cumulative of probabilistic credences. The combination of this qualitative-logical way of forming beliefs with the statistical-Bayesian way of updating them (as encoded in our rule for conditioning on events) ensures that the true distribution will eventually reach the highest plausibility (among a finite set of distributions), thus giving us finite convergence to the exact truth.

Towards a logic of statistical learning
In this section we propose a logical setting that can capture the dynamics of statistical learning described in this paper. Our logical language is designed to accommodate both types of information, i.e. finite observations and higher-order information. As already mentioned, there is a fundamental distinction between these two types of information. The observations are interpreted in a σ -algebra E ⊆ P(Ω), and are not themselves formulas in our formal logical language, as they do not correspond to properties of probability distributions. The formulas will instead be statements about the probabilities of observations, given in terms of linear inequalities and logical combinations thereof, as well as the statements concerning the dynamics arising from finite observations. Given the set of outcomes O = {o 1 , . . . , o n }, the set of formulas φ of our language is inductively defined by where o, ω i ∈ O, a i 's and c in Q and ω ≤n = (ω 1 , . . . , ω n ) ∈ O n is a stream of observations of length n.
Let M = (M, pla) be a probabilistic plausibility model. The semantics is given by inductively defining a satisfaction relation M, μ φ between distributions μ ∈ M and formulas φ. At each pair (M, μ), the symbol P will be interpreted as a probability mass function, namely μ itself. In this definition, we use the notation φ M := {μ ∈ M | M, μ φ}, and skip the subscript M when the model is understood: The atomic formulas m i=1 a i P(ω i ) ≥ c describe linear inequalities satisfied by the true probability, using numerical constants ranging over rationals. The propositional connectives ¬, ∧ are standard. Letters K and B stand for knowledge and (conditional) belief operators, and Sb stands for safe belief. The dynamic modalities [o]ψ (standing for "after observing o, ψ holds") and [φ]ψ (standing for "after learning φ, ψ holds") capture the updates induced by the two forms of learning.
The reason we did not include simple belief Bφ or propositionally-conditional beliefs B(φ | φ) is that these operators are definable as abbreviations in the above syntax. For plain belief, it should be obvious that it can be obtained as a special case of conditioning on a sampling sequence ω ≤0 of length 0, i.e. we can put where λ = () = ω 0 is the empty sequence of observations. Less trivially, conditional beliefs of the form B(φ | θ) can be defined in terms of knowledge and safe belief, by putting: whereK ψ := ¬K ¬ψ is the Diamond-dual modality for K (denoting "epistemic possibility"). With these abbreviations, one can easily check that the resulting notion satisfies the expected semantic clause for conditional belief: 23 We say that a formula φ is valid in model M, and write M φ, if and only if M, μ φ for all μ ∈ M. As usual, φ is simply valid if it is valid in every model M.

Proposition 17
Let o ∈ O and formulas φ, ψ, θ, ξ . Then the following formulas are valid: future sampling history. We could directly introduce statistical knowledge Skφ, but it seems more natural to add instead temporal operators φ ("from now and forever in the future, φ holds") and its dual ♦ϕ ("φ holds now or at some future moment"), with the obvious semantics: , μ, ω >n φ for some n ≥ 0 (In fact, ♦φ is redundant: it is the Diamond-dual of , so can be taken to be just an abbreviation for ¬ ¬φ.) For non-epistemic 24 formulas P, we can identify statistical knowledge Sk(P) with the formula Sb(P). As a result, our result in Corollary 2, on eventual convergence (in finitely many steps) to approximate statistical knowledge of the true distribution, is captured in this logic by the validity for every > 0.

Conclusion and comparison with other work
We studied forming beliefs about unknown probabilities in situations that are commonly described as those of radical uncertainty. The most widespread approach to model such situations of 'radical uncertainty' is in terms of imprecise probabilities, i.e. representing the agent's knowledge as a set of probability measures. There is extensive literature on the study of imprecise probabilities (Bradley and Drechsler 2014;Chandler 2014;Hajek and Smithson 2012;Levi 1985;Walley 2000;Denoeux 2000;Romeijn and Roy 2014) and on different approaches for decision making based on them Bradley and Steele (2014), Huntley et al. (2014), Troffaesin (2007), Elkin and Wheeler (2018), Mayo-Wilson and Wheeler (2016), Seidenfeld (2004), Seidenfeld et al. (2010), Williams and Robert (2014) or to collapse the state of radical uncertainty by settling on some specific probability assignment as the most rational among all that is consistent with the agent's information. The latter giving rise to the area of investigation known as the Objective Bayesian account (Paris and Rad 2010;Paris and Vencovska 1997;Paris 2014;Rad 2017;Williamson 2008Williamson , 2010. A similar line of inquiry has been extensively pursued in the Economics literature, as well as in Decision Theory, where the situation we are investigating in this paper is referred to as Knightian uncertainty or 'ambiguity'. This is the case when the decisionmaker has too little information to arrive at a unique prior. There have been different approaches in this literature to model these scenarios. These include, among others, the use of Choquet integration, by for instance Huber and Strassen (1973), or Schmeidler (1989, 1986, the maxmin expected utility by Gilboa and Schmeidler (1989) and the smooth ambiguity model by Klibanoff et al. (2005) which employs second-order probabilities or Al-Najjar's work (Al-Najjar 2009) where he models rational agents who use frequentist models for interpreting the evidence and investigates learning in the long run. Cerreia-Vioglio et al. (2013) studies this problem in a formal setting similar to the one used here and axiomatizes different decision rules such as the maxmin model of Gilboa-Schmeidler and the smooth ambiguity model of Klibanoff et al, and gives an overview of some of the different approaches in that literature.
These approaches employ different mechanisms for ranking probability distributions compared to what we propose in this paper. Among these, it is particularly worth pointing out the difference between our setting and those ranking probability distributions by their (second-order) probabilities. In contrast, in our setting, it is only the worlds with the highest plausibility that play a role in specifying the set of beliefs. In particular, unlike the probabilities, the plausibilities are not cumulative in the sense that the distributions with low plausibility do not add up to form more plausible events as those with low probability would have had. This is a fundamental difference between our account and the account given in terms of second-order probabilities.
Another approach to deal with these scenarios in the Bayesian literature is based on the series of convergence results, that are collectively referred to as "washing out of the prior". The idea, which traces back to Savage, see Edwards et al. (1963) andSavage (1954), is that as long as one repeatedly updates a prior probability for an event through conditionalisation on new evidence, then in the limit one would surely converge to the true probability, independent of the initial choice of the prior. 25 Bayesians use these results to argue that an agent's choice of a probability distribution in scenarios such as our urn example is unimportant as long as she repeatedly updates that choice (via conditionalisation) by acquiring further evidence, for example by repeated sampling from the urn. However, it is clear that the efficiency of the agent's choice for the probability distribution, put in the context of a decision problem, depends strongly on how closely the chosen distribution tracks the actual one. This choice is most relevant when the agents are facing a one-off decision problem, where their approximation of the true probability distribution at a given point ultimately determines their actions at that point.
Our approach, based on forming rational qualitative beliefs about probability (based on the agent's assessment of each distribution plausibility), does not seem prone to these objections. The agent does "the best she can" at each moment, given her evidence, her higher-order information, and her background assumptions (captured by her plausibility map). Thus, she can solve one-off decision problems to the best of her ability. And, by updating her plausibility with new evidence, her beliefs are still guaranteed to converge to the true distribution (if given enough evidence) in essentially all conditions (including in the cases that evade Savage-type theorems). 25 To be more precise, if one starts with a prior probability for an event A, and keeps updating this probability by conditionalising on new evidence, then almost surely, the conditional probability of A converges to the indicative function of A (i.e. to 1 if A is true, and to 0 otherwise). This form is called Levy's 0-1 law. Savage's results use IID trials and objective probabilities and have been criticised regarding its applicability to scientific inference. There are, however, a number of more powerful convergence results avoiding these assumptions, for example, based on Doob's martingale convergence theorem (Doob 1971). There are also several generalisations of these results, e.g. Gaifman and Snir (1982).
As we already mentioned, our approach is based on a probabilistic adaptation of the standard qualitative theory of plausibility models (Board 2004;Baltag and Smets 2008b), that underlies modern presentations of standard Belief Revision Theory (Alchourrón et al. 1985;Grove 1988) within Dynamic Epistemic Logic (Baltag and Moss 2004;Baltag et al. 1998;van Ditmarsch et al. 2007;Baltag and Smets 2008b;Baltag and Renne 2016;van Benthem 2011). As such, it has some connections with Wolfgang Spohn's quantitative theory of plausibility ranking (Spohn 2016), but it differs from it in essential ways: like Spohn's ranking theory, 26 we use maximization to form beliefs (where standard probabilistic theory uses addition); 27 but, when updating plausibility with independent sampling evidence, we follow the probabilistic usage of taking products (in Bayes's rule), while Spohn's ranking theory uses addition for this purpose. On the other hand, our framework does satisfy the conditions of Halpern's abstract theory of algebraic conditional plausibility spaces (Halpern 2003), which is meant as a generalization of a large number of theories of uncertainty (Bayesian probabilities, Dempster-Shafer belief functions, possibility measures, relative likelihoods, AGM conditioning, Popper measures, Spohn's ranking theory). The theory postulates the existence of two operations: one, the analogue of probabilistic addition, is used for computing the plausibility of a proposition P, and decide whether it is to be believed or not; while the other, the analogue of probabilistic multiplication, is used for updating plausibilities (via an abstract analogue of Bayes' rule) and for computing the plausibility of joint independent observations. To work well, the two operations need to satisfy certain conditions, tying them together. Our particular combination, of maximization and multiplication, though as far as we know was never encountered in the literature, satisfies Halpern's conditions, and so it is in a sense a "natural" theory. But beyond that, we think that this particular combination is the key to fast learning from sampling, as well as to reconciling probability with logic: on the one hand, multiplication is needed for the update, to deal rationally with successive independent observations (cf. Proposition 9, which would fail without the use of multiplication in our plausibilistic analogue of Bayes' rule); and on the other hand, the use of maximization in the formation of beliefs allows convergence in finitely many steps (in contrast to mere convergence in the limit via probabilistic updating a la Savage), and at the same time makes beliefs about probability fit the general patterns and conditions of Doxastic Logic and Belief Revision Theory. Indeed, it does seem that the particular combination provided by our probabilistic plausibility theory succeeds in adopting the best features of both worlds (doxastic logic with its belief revision, and statistical reasoning with its Bayesian updates), while at the same time fitting within the general conditions of a natural theory of uncertainty (as formalized by Halpern's abstract requirements).
Our approach connects well with mainstream epistemology and formal learning theory, by making essential use of the formal concept of "safe belief", studied in Baltag and Smets (2008b) as an approximation of the philosophical notion of defeasible knowledge (Lehrer 1990;Rott 2004), and related also to the issue of stability or 'resilience' of probabilistic belief (Skyrms 2011), an issue underlying recent attempts at unifying logical and probabilistic reasoning, cf. the so-called stability theory of belief (Leitgeb 2017). Our concept of statistical knowledge improves on the notion of safe belief, by adding a form of stability under future sampling, that connects well with the learning-theoretic concept of identifiability in the limit (Gold 1967), as well as with various formal notions of inductive knowledge, introduced in Baltag et al. (2019a, b) and Kelly (2014) as epistemic correlatives of empirical induction. As already mentioned, the correlative notion of distance-from-the-truth fits well with the main tenets of Verisimilitude Theory, originating in the work of Popper (1976) and his critics (Tichy 1974;Miller 1974), and developed to maturity in the wor1k of Niiniluoto (1987), Niiniluoto (1998), Kuipers (1987 and others. In particular, our setting fits within the metric approach to truthlikeness (Niiniluoto 1987), resulting in the verisimilitude version of our convergence results: tracking the truth is then naturally understood as progressive increase in our models' truthlikeness (or equivalently, progressive decrease of the models' distance-from-the-truth).
Our paper ends by sketching the contours of a dynamic doxastic logic for statistical learning, that validates a number of standard axioms, and can express the core of our convergence results. Nevertheless, this leads us to an outstanding open problem: finding a complete axiomatization of this logic and investigating its complexity. This seems a daunting task at the time of our writing. Given the power of this formalism and its significance for the investigation of statistical learning, we think this to be an important and potentially fertile challenge.