Learning under unawareness

We propose a model of learning when experimentation is possible, but unawareness and ambiguity matter. In this model, complete lack of information regarding the underlying data generating process is expressed as a (maximal) family of priors. These priors yield posterior inferences that become more precise as more information becomes available. As information accumulates, however, the individual’s level of awareness as encoded in the state space may expand. Such newly learned states are initially seen as ambiguous, but as evidence accumulates there is a gradual reduction of ambiguity.


Introduction
State spaces and probabilities are ubiquitous in economic models. They provide an unrivaled analytical tool to study situations in which uncertainty matters. Yet, it is not always clear how state spaces and beliefs emerged. If such knowledge is founded upon experience, how so?
Standard models of learning take for granted (an exogenously given) state space, impose an objective prior, and update it using Bayes' rule in face of new information.
Practical problems, however, rarely fit this mould. For all intents and purposes, states of the world are essentially abstract representations of resolutions of uncertainty. Moreover, whereas Bayesian updating is an effective tool to sort out the probability of the number of heads in a finite sequence of tosses of a fair coin, in many practical situations, it may not be possible to specify a prior probability. The standard Bayesian machinery is ill-equipped to provide guidance when the new information is surprising-and falls on a zero probability event-or when the new information contradicts past experience-and cannot be categorized in any of the previously considered events. Moreover, it leaves no room for confidence in the probability assessments made.
This paper considers learning when there is incomplete information about the structure of the state space, but further information can be obtained through experimentation. We take for granted that the final objective of such analysis, after observations have been made, is the choice among possible courses of action whose consequences depend on the state of the world. For the present purposes, we are only concerned with the assessment phase of the analysis: the process of building the state space on the go and updating beliefs in a way that satisfies some basic principles of coherence and consistency.
For the problem to be well-defined, we aim to answer two questions. First, we need to sort out what constitutes a state of the world. If it is possible to learn new sources of randomness, then the model should allow for extensions of the state space to accommodate them. In particular, the model of the state space should permit two operations: 1. Creation: when new states are added without changing the structure of old events; and 2. Refinement: when old states turn into events that can be partitioned into more richly described states.
Second, we need to work out how to form and update a prior belief on this space that truly reflects complete ignorance. To answer the first question, we invoke the approach of Vierø (2013, 2017); 1 to answer the second, we propose an imprecise version of the Dirichlet process prior, defined by Ferguson (1973Ferguson ( , 1974. More specifically, the theory presented here is adapted to the following kind of problem. Suppose that an expert is challenged to express her opinions about possible plans of action whose consequences depend on the unknown state of the world. Before making her assessment, she has the opportunity to carry out a sequential experiment to learn about possible states of the world and their plausibility. Experiments are described by an underlying stochastic process, that embodies the physical law governing the machinery of the experiment. Ideally, the realizations of these sequential trials provide a full description of the outcome, thus resolving all uncertainty regarding the state of the world. However, the expert's level of awareness restricts her perception of the realized state of the world, she can only partially observe these realizations. Thus the experimenter's awareness level determines which events she can conceive and delimits the boundaries of her conceivable state space. As new evidence becomes available, the experimenter may discover new states. Consequently, her awareness level increases and her conceivable state space expands.
Beliefs are represented by an evolving set of predictive distributions over future, conceivable states of the world; inferences about events can be summarized by upper and lower probabilities. As new states are discovered, probability mass may be shifted from old, non-null events to the events just created. Moreover, newly learned events are initially perceived as ambiguous. As evidence accummulates, however, the experimenter becomes more familiar with these events. The ambiguity associated with those events gradually disappears and the assessment made by the individual converges to their true posterior probability.
The resulting theory captures several desirable, intuitive features: 1. Rich class of stochastic environments: the model is able to capture a wide range of data generating processes. 2. Internal consistency: the individual's beliefs are revised in a coherent way. 3. External consistency: the evolution of beliefs reflects data (frequentist validation). 4. Possibility of surprises: the individual is cognizant of the possibility of consequences and actions of which she is currently unaware, but which may be revealed over time; as a result, learning never ends. 5. Event-specific ambiguity: ambiguity is related to lack of familiarity; newly learned events are seen as more ambiguous than old events. 6. Large-sample confidence: as evidence accummulates, the ambiguity perceived by the individual is gradually reduced and her confidence in her assessment increases. Our approach relates to four different literatures. First, we provide a model of learning that, in line with Marinacci (2002) and Epstein and Schneider (2007), accommodates ambiguous beliefs. Secondly, we explicitly model the process of inductive reasoning implied by the dynamics of growing awareness described in Dominiak and Tserenjigmid (2021) that utilizes the framework developed in Vierø (2013, 2017). 2 Because ambiguity emerges endogenously, when information is surprising, we provide a dynamic foundation for the unanimity rule preference representation axiomatized in Bewley (2002) and Gilboa et al. (2010) as well as the partial "comparative likelihood relation" in Nehring (2009). Finally, we provide an axiomatization of the model's probability kernel, which can be compared to the result in Billot et al. (2005).
There are three other papers that make the connection between levels of awareness and perceptions of ambiguity. Halpern et al. (2010) introduces unawareness in the context of a Markov decision problem and provides a characterization as to when the individual can 'learn' to play nearly optimally. Kochov (2016), on the other hand, proposes a 'revealed preference' test to distinguish between those contingencies the individual is unaware of versus those she foresees but whose likelihood she perceives as ambiguous. Grant et al. (2021) formalize a notion of coherent multiple priors and derive a representation that corresponds to the usual unique prior but with less than full awareness generates multiple priors. When information is received with no change in awareness each element of the set of priors is updated according to Bayes rule. An increase in awareness, however, leads to an expansion of the individual's subjective state space and a contraction in the set of priors.
The paper is organized as follows. Section 2 describes the underlying data generating process and how actions and consequences are discovered. Section 3 explains how the individual's conception of the world evolves in view of this acquired information. Section 4 investigates the properties of the model. Section 5 explores the implications of the results to the problem of consensus formation of beliefs. Finally, Sect. 6 discusses the Bayesian interpretation of the model. All the proofs are collected in the Appendix.

The discovery process
First, we introduce the elements of the data generating process and the nature of the observations made by the individual. Let T = {0, 1, . . . , t, . . .} denote time. There exists a countable set A of actions, which corresponds to the set of alternatives that are or may become known to the individual. There also exists a set C of consequences, which we take to be a separable, completely metrizable space.

Underlying stochastic process
We are given a sequence X 1 , X 2 , . . . , X t , . . . of random variables defined on the ambient probability space ( , F , μ) and taking values in the common measurable space (C A , B), where B denotes the Borel σ -algebra on the metrizable product C A . If x t ∈ C A is a realization of X t , then action a ∈ A is associated with the consequence (x t ) a ∈ C, which we also denote by a(x t ). We may also think of the action a as the sequence of random variables (a(X 1 ), a(X 2 ), . . . , a(X t ), . . . ) with range C. An element of C A specifies the unique consequence associated with every possible action in A. A realization x t of X t , being an element of C A , thus resolves all uncertainty.
Throughout the paper, we assume the following stochastic dependence condition on (X t ) t 1 , originally studied by Kallenberg (1988).

Definition 1
The sequence of random variables (X t ) t 1 is said to be conditionally exchangeable if for every n 1 where ∼ means "distributed as." We view conditional exchangeability as a structural judgement, that is, it is an assessment regarding the structural properties of the underlying stochastic process, which are the result of the design of the experiment. It is related to the more familiar notion of exchangeability. Recall that a sequence of random variables (X t ) t 1 is said to be exchangeable whenever every permutation of every finite subsequence has the same distribution, i.e., for every n 1 It is immediate that exchangeable sequences are conditionally exchangeable, but the converse is true only under stationarity. Recall that a sequence of random variables (X t ) t 1 is said to be stationary whenever the distribution of finite subsequences is invariant over time, i.e., for every k 1 By a result due to Kallenberg (1988, Proposition 2.1) if a stationary sequence of random variables is conditionally exchangeable, then it is exchangeable. Kallenberg (1988) provides an example of a non-stationary, but conditionally exchangeable sequence that is not exchangeable.
Furthermore, conditional exchangeability can be understood in terms of the martingale aspect of the process (X t ) t 1 . In particular, Kallenberg (1988, Proposition 2.2) shows that conditional exchangeability is equivalent to a property later coined as conditionally identically distributed.
Let G = (G t ) t 0 denote the nested sequence of sub-σ -algebras G t of F such that G 0 = {∅, } and G t = σ (X 1 , . . . , X t ) for every t 1. That is, G is the filtration induced or generated by the sequence of random variables (X t ) t 1 . It is the smallest filtration with the property that X t is G t -measurable for every t 1.

Definition 2
We say that (X t ) t 1 is conditionally identically distributed with respect to G (or G -c.i.d.) if the conditional expectation satisfies μ-a.s.
for every k > t 0 and every bounded measurable g : C A → R.
Essentially, this property means that, at every time t, future realizations (X k ) k>t are identically distributed conditional on past realizations (X k ) k≤t . We refer to Berti et al. (2004) for examples of non-exchangeable, c.i.d. stochastic processes.

Discovery of actions and consequences
The individual in period 0 is aware of a nonempty, finite set A 0 ⊆ A of actions, and a nonempty, finite set C 0 ⊆ C of consequences. We now describe how other actions and consequences are discovered.

Discovery of actions
We suppose that actions in A are discovered over time, via a given nested sequence where each A t , t 0, is a finite, nonempty set of actions. We think of A 0 as the individual's prior knowledge of the available actions. At time t 1, the individual discovers or becomes aware of the set A t \A t−1 of additional actions. We do not model how the sequence (A t ) t 0 is realized. We leave open the possibility that the sequence is a sample path of some independent stochastic process or the result of some independent learning process. As an example, the actions in (A t ) t 0 could represent the sequence of treatment options for a particular illness as new alternatives are discovered and made available to the specialist. Further, we do not assume that all actions in A are eventually revealed to the individual. That is, it could be the case that everything is intrinsically knowable, but our model also allows for "unknown unknowns" that remain unknowable indefinitely.

Stochastic discovery of consequences
Throughout the rest of the section, we fix a sample path The realization x t sets the perfect level of description of the true state of the world. What the individual observes, however, is restricted by her level of unawareness. If x ∈ C A , then we let x| A t = (x a ) a∈A t . In period t, the individual's information consists of the finite history of all observations up to period t for the given sample path, which can be written as: In particular, the individual discovers consequences as she (partially) observes the realizations of the collection of random variables a(X t ), induced by the actions known to her. 4 Thus the set of consequences known to the individual is the set of all observed consequences given the finite history x t . Formally, the set of consequences the individual considers possible at the end of period t is given by with C 0 := C 0 ∪ {θ 0 }. We interpret θ t , for t 0, as the intangible consequence that represents the possibility of observing an outcome that is not in the set That is, we also allow for the individual to conceive of the possibility that some actions may yield, in future periods, consequences which are as yet unknown to her in period t. 5 Of course, we assume that C ∩ {θ 0 , θ 1 , . . .} = ∅.

The inference problem
In this section, we explain how the individual's perception of the state space evolves over time. Recall that, given the sample path the information available to the individual at period t is the finite history of all observations up to period t. Given this finite history of observations, the individual's conceivable state space expands to convey her increasing level of awareness.

Evolution of the set of conceivable states
Following the approach of Karni and Schmeidler (1991) and Vierø (2013, 2017), conceivable states represent the possible resolution of uncertainty restricted by the awareness level of the individual. That is, having observed the finite history x t and being aware of the consequences in C t and actions in A t , the individual can conceive of states that are elements of the finite set which we refer to as the set of conceivable states (at the end of period t), for the given history x t . In particular, for every period t, the observation x t | A t is understood as a conceivable state in S t , that is, x t | A t = (a(x t )) a∈A t ∈ S t . Moreover, for every t 1, the set of conceivable states S t−1 induces a partition of S t , whereby each element in S t−1 is mapped to an event in S t .
The following example illustrates how the conceivable state space evolves over time and how conceivable states in previous periods are mapped to events in later periods. Suppose that the individual is aware of one action a and one consequence c 1 in period t − 1. The set S t−1 comprises two states s 1 and s 2 : In the next period, we consider three alternative scenarios. If A t = {a} and a(x t ) = c 2 , then a new, previously unknown consequence is revealed in period t. Hence, at the end of period t, the individual conceives of three states in S t : In this scenario, the state s 1 in S t−1 becomes s i 1 , and the state s 2 in S t−1 is "split" into s i 2 and s ii 2 . The partition of S t induced by S t−1 is thus then a new action is discovered but no new consequence, and s 1 and s 2 correspond to events in S t : In this alternative scenario, s 1 splits into s i 1 , and s ii 1 and s 2 splits into s i 2 and s ii In this case, the state s 1 is split into s i 1 , s ii 1 and s iii 1 , and the state s 2 is split into the remaining six states. Therefore, the partition of S t induced by the state space S t−1 conceivable in the previous period is In a nutshell, there are two circumstances that surprise the individual and prompts her to expand the set of states she believes to be possible: i. the discovery of new consequences: formally, that corresponds to C t−1 \{θ t−1 } being a proper subset of C t \{θ t }; and ii. the discovery of new feasible actions: that corresponds to A t−1 being a proper subset of A t .
The advantage of working with this canonical state space and its product structure is that it distinguishes between the information deduced by logical inference-which states can be conceived-and the information explicitly conveyed by the data-the distributions over consequences induced by the actions. In particular, nothing suggests to the individual that the drawings of consequences induced by different actions is independent. However, as will be shown in the next section, the individual's (imprecise) prior reflects her total ignorance regarding possible correlations between different actions. It is thus left to the evidence to reveal the true correlation structure among the states. 6 We now formalize the embedding illustrated by the previous examples for the general structure of the states. For any pair of periods t and t , with t < t , we can express the associated embeddings and projections as follows.
For every period t, let κ t : C → C t denote the mapping that takes consequences in C that are unknown in period t to θ t in C t , and keeps the other, known consequences unchanged. That is, The mapping κ t induces a natural embedding ψ t : Finally, for t < t , we define ϕ t :t : S t → S t as the unique mapping such that Notice that, because C t \{θ t } ⊆ C, the mapping ψ t is surjective. Moreover, because A t ⊆ A t , and C t \{θ t } ⊆ C t \{θ t }, it follows that ϕ t :t is also surjective, and hence it has a set-valued inverse. The partition of S t induced by S t is the one given by the collection of sets {ϕ −1 t :t (s) : s ∈ S t }. In particular, for each state s ∈ S t , the event ϕ −1 t :t (s) ⊆ S t is the collection of conceivable states in S t into which the state s has been "split."

Evolution of beliefs
We now propose a prior, based on the Dirichlet process, that represents (almost) complete ignorance with respect to the distribution μ. 7

The Dirichlet process
The simplest and one of the most commonly used nonparametric statistical model is the Dirichlet process, defined by Ferguson (1973Ferguson ( , 1974. 8 It was introduced as a prior over probability distributions and, due to the tractability of the resulting posterior inferences, it is widely employed for Bayesian nonparametric inference. We use the following notation. If S is a separable, completely metrizable space, then let (S) denote the set of Borel probability measures on S. For every s in S, let δ s ∈ (S) denote the (Dirac or degenerate) probability measure that assigns probability one to s obtaining.
To describe the Dirichlet process, let G(α, β), with α > 0 and β > 0, denote the Gamma distribution on R + , with Lebesgue density where (α) is the complete gamma function. For G(α, β), α is called the shape parameter and β is the scale parameter. The n-dimensional Dirichlet distribution with parameter (α 1 , . . . , α n ), with α k 0 and k α k > 0, is the distribution of the random Definition 3 Let π be a finite non-null (probability) measure on (X, B X ), where B X is the Borel σ -algebra of subsets of X, and α > 0. Then a stochastic process is a Dirichlet process with base measure π and concentration parameter α, denoted by ∼ D P(α, π ), if for every finite measurable partition {B 1 , . . . , B n } of X, the random vector ( (B 1 ), . . . , (B n )) has a Dirichlet distribution with parameter (απ(B 1 ), . . . , απ(B n )).
Under the Dirichlet process, data is assumed to be generated according to the law Benavoli et al. (2015) propose a prior similar to the one we describe. They, however, apply it to Bayesian hypothesis testing, obtaining a method with nice asymptotic properties and, at the same time, that is more robust compared to the usual tests.
That is, the parameter that explains the data is the random measure itself. The Dirichlet Process is thus a probability distribution on (X), the space of probability measures over (X, B X ). We record the following properties of the Dirichlet process.
Theorem 1 (Ferguson (1973, Theorem 1), Ghosh and Ramamoorthi (2003, Chapter 3)) Let (X 1 , . . . , X t ) be i.i.d. samples from and suppose has a D P(α, π ) distribution. Then: 2. The predictive distribution of the next observation X t+1 , given the finite sample We can re-write the predictive distribution of the next observation as We interpret this conditional distribution as follows. The term k δ X k t is the empirical distribution and represents the contribution of experience. The term π represents the initial guess. Thus the conditional distribution of the next observation is a weighed average of the initial guess and the empirical distribution. The relative weights α α+t and t α+t balance out our confidence on the prior beliefs versus the data.

The imprecise Dirichlet process
We extend the Dirichlet process to allow for ambiguity, arising from the lack of information about the law of the underlying process, as well as from the incompleteness of the observed data. Specifically, we construct a class of Dirichlet processes by allowing the base measure π to vary in the set of probability measures.
From the perspective of the individual, in every period t 0, prior uncertainty regarding the probability law of the underlying stochastic process is expressed by the following class of Dirichlet priors: Dirichlet priors that the individual can conceive in a given period t, thus representing her unconditional belief assessment over the conceivable state space S t under the veil of ignorance, before making any observations. We note that the base measure π in the precise Dirichlet process D P(α, π ) is interpreted as the initial guess. Therefore, the set t reflects complete ignorance about the law of the underlying process, since the set of initial guesses is maximal: every conceivable (nontrivial) event E ⊆ S t has upper probability of 1 and lower probability of 0 in that set.
In subsequent periods t 1, the individual updates these unconditional assessments t in light of the history of observations x t . However, when reassessing her beliefs, the individual's level of awareness increases over time. With the wisdom of hindsight, she can conceive of many sample paths that are consistent with her observations x t . She then updates her belief assessment conservatively, allowing for all those possible sample paths.
More precisely, in period t, if the individual's conceivable state space is S t , then she can conceive of all histories of length t, that is, with h k ∈ S t for every k = 1, . . . , t. Given observations x t , we let H t ⊆ × t k=1 S t denote the set of histories that are conceivable in period t and consistent with the finite history of observations x t , that is, Notice that, since the state space may expand over time, usually there will be more than one history consistent with x t . In particular, for every t 2, if at least two consequences are known, | C t | 2, and the individual becomes aware of at least one new action in period t, so that A t−1 ⊂ A t , then there will be at least two different histories in H t . Moreover, regardless of the cardinality of the set H t , the last observation The following theorem describes the individual's posterior inferences in the presence of the ambiguity generated by growing awareness. Theorem 2 Let (X 1 , . . . , X t ) be i.i.d. samples from and suppose has one of the distributions D P(α, π ) in t . Let also H t denote the set of histories consistent with the observations x t generated by . Then: 1. The set of conditional distributions of , given the histories H t , is the set of Dirichlet processes:

The set of predictive distributions, given the set of histories H t , is
Remark 1 Henceforth, we write P H t for the set of predictive distributions P[X t+1 | A t H t ] above.
Having observed x t , the individual can conceive of many finite histories h ∈ H t . Theorem 2 says that she updates every prior in t given every sample history consistent with her observations.
Notice that the influence of the probability measures in t , the individual's unconditional assessment, on her conditional assessment t H t is determined by the hyperparameter α and the length of the history of observations she has seen, that is, t. We shall interpret α as the learning parameter. Since the weight given to the empirical distributions is t /(α + t), higher values of α are associated with a lower degree of confidence of the individual on the accumulated data. 9 We also note that the degree of imprecision of t H t comes from two sources. The first, not surprisingly, relates to the lack of prior information to guide the choice of the base measure π of the (unconditional) Dirichlet prior D P(α, π ). The second arises from the lack of familiarity with the newly discovered events, since the discovery of new actions in a subsequent period t increases the number of elements in the set of conceivable histories H t of length t that are consistent with the observed data x t .
The set of predictive distributions P H t represents the conditional assessment the individual has about the next observation. For every event E ⊆ S t , this conditional assessment can be summarized by the upper probability ρ t (E) = max{ρ(E) : ρ ∈ P H t } and the lower probability Notice that if, for some k ≤ t, h k ∈ E for every h ∈ H t , then ρ t (E) > 0. That is, upon making the observation that event E has occurred at least once unambiguously, for every conceivable history consistent with the observations, the individual's revised beliefs regarding E are bounded away from zero.
Remark 2 Given Theorem 2, it readily follows that for every pair of periods t > t : there exists ν ∈ P H t such that π ϕ −1 t:t (s) = ν(s) for every s ∈ S t .
So, in particular, by taking t = t − 1 we obtain the following equivalent recursive definition for the individual's predictive assessment in period t:

Properties of the model
In this section, we study the properties of the model. We first provide a characterization of the updating rule that generates the predictive distribution in Theorem 2-2. We then study the large sample properties of the individual's belief assessment.

Internal consistency
We begin by noting that in any period t 0, if the decision maker's conceivable state space is S, then she can conceive of all future sample paths of any finite length n, that is, with f k ∈ S for every k = 1, . . . , n. Let F n S denote the set of all such conceivable future sample paths of length n and set F S = ∪ n 1 F n S . In addition, let S ∞ denote the set of conceivable state spaces the individual will become aware of as a result of the observations she will make. That is, S ∞ = { S t : t 0}.
Adopting the nomenclature of Billot et al. (2005), we define a probabilistic belief in period t to be a mapping ρ S t : (S) × F S → (S). Given a belief π ∈ (S), we interpret the probabilistic belief ρ S t (π ; f ) as telling us how the individual anticipates that such a belief π in period t would be revised as a result of observing the conceivable future sample path f ∈ F S .
We consider the following axioms on the family of probabilistic beliefs {ρ S t : t 0, S ∈ S ∞ }.

Axiom 2 (symmetric treatment [of previously unconsidered states])
For every pair of state spaces S and S and every pair of beliefs π ∈ (S), π ∈ (S ): whenever s / ∈ supp π and s / ∈ supp π .
Axiom 5 (intertemporal coherence) For every pair of periods t < t , every belief π ∈ (S) and every future sample path f = ( f 1 , . . . , f n ) ∈ F n S with n > t − t: . . , f n .
As the name suggests, Axiom 1 (responsiveness), requires that ρ S t assigns strictly positive weight to every state that appears in the future sample path f . Furthermore, a state s cannot be in the support of ρ S t (π ; f ) if it is neither in the support of the belief π nor in the future sample path f . Axiom 2 (symmetric treatment), may be interpreted as requiring, should the decision-maker observe next period a previously unconsidered state, that the (strictly positive) weight that she anticipates being shifted to this previously unconsidered state depends only on the period t in which this probabilistic belief is formed.
Axiom 3 (invariance) is an exchangeability property: future information is interpreted in the same way, regardless of the order in which it arrives. It reflects the conditional exchangeability property of the underlying data generating process.
To explain the normative appeal of Axiom 4 (linearity in beliefs), recall that any initial belief π may be expressed as the probability weighted sum π = s∈S π(s)δ s .
That is, π is a weighted sum of the Dirac probability measures associated with each element in its support. Noting that ρ S t (δ s ; f ) is the probabilistic belief the individual anticipates she would have starting from the degenerate belief concentrated on the state s after observing the sample path f , we see from the repeated application of Axiom 4 that That is, the probabilistic belief ρ S t (π ; f ) can be expressed as a mixture of the beliefs in the set {ρ S t (δ s ; f ) : s ∈ S} with the weights corresponding to the weight she assigns in her belief π to each particular state s.
Our final axiom, Axiom 5 (intertemporal coherence), is a consistency property. It ensures that the family of probabilistic beliefs exhibits an appropriate "law of iterative conditioning." Together, these five axioms characterize the anticipated revision of beliefs being undertaken in accordance with a Dirichlet process.
Theorem 3 Suppose that | S t | 3 for some t 0. Then the following are equivalent.
1. The family of probabilistic beliefs {ρ S t : t 0, S ∈ S ∞ }, satisfies Axioms 1-5 (responsiveness, symmetric treatment, invariance, linearity in beliefs and intertemporal coherence). 2. In every period t 0 and for each S ∈ S ∞ the probabilistic belief ρ S t , takes the following form: for each belief π ∈ (S), each future sample path where α 0.
For any family of probabilistic beliefs satisfying the five axioms above, Theorem 3 implies that the individual's conditional assessment in period t as specified in part 2 of Theorem 2 may be reexpressed in terms of the probabilistic belief ρ S t 0 as follows.
That is, we can view P H t as the set of probability beliefs the individual in period 0 would anticipate her assessment in period t would be, if she had been aware in period 0 of what her conceivable state space in period t would be, as well as what were all the conceivable histories in H t that would be consistent with her observations up to and including period t.
Alternatively, employing the second of the equivalent expressions for P H t from Remark 2, we obtain the recursive formulation given by: such that π ϕ −1 t:t−1 (s) = ν(s) for every s ∈ S t−1 .
We conclude this section by comparing our probability assignment rule with the similarity-based probability assignment rule introduced and axiomatized by Billot et al. (2005). Formally, the intersection of our probability assignment rule and theirs is the limiting case in which our learning parameter α is equal to zero and their similarity weighting function is constant. That is, the case in which all the elements of any "database" (the analog of our conceivable histories) in their setting are (considered) equally similar. As there is no increasing awareness in their model, their probability assignment is always precise.

External consistency
We now investigate whether, as the sample size increases, the decision maker's assessment regarding known events converges in some meaningful sense to the true posterior probability of the event, according to the data generating process.
The ambient sample space ( , F , μ) describes every possible outcome of all sources of randomness. However, from the perspective of the individual, the particular choice of this space is immaterial. The individual only cares about the sequence of induced distributions defined by for every B in the σ -algebra B of Borel subsets of C A .
Furthermore, the individual's perception of these distributions is restricted by her current level of awareness, which has two implications. First, the individual can conceive of the idea of the unknown, that is, she is cognizant of her unawareness. She can thus conceive of random variables that may take the intangible value θ t in period t. Let C = C ∪ {θ 0 , θ 1 , . . .} denote the set of extended consequences and consider the set C A of functions from A into C. Endow C with the σ -algebra C generated by the union of the Borel σ -algebra on C and the discrete σ -algebra on {θ 0 , θ 1 , . . .}. Let B = σ (B a : a ∈ A) denote the σ -algebra of subsets of C A generated by the cylinder sets for every set B in C and every action a ∈ A. We shall abuse notation and identify the random variable X t , which takes values in C A , with the coextension of X t that takes values in the larger range space C A . Second, in every period, the individual can only observe the realized consequences associated with the actions in the set known to her. Because the individual can only learn the distributions of consequences induced by the actions already discovered by her, learning is not uniform. To deal with this issue, take as given period t . The set A t consists of all actions known to the individual in the given period t . Consider the set C A t of functions from A t into C and endow it with B t = σ (B a : a ∈ A t ), that is, the σ -algebra of subsets of C A t generated by the cylinder sets for every set B in C and every a ∈ A t . If g : C A t → R is a bounded, B t -measurable function, then the individual's (conditional) assessment of g in period t t is given by the collection of conditional expectations for every ρ ∈ P H t . There is learning if these assessments become closer to the true expectation E μ [g(X t+1 | A t ) G t ] as the sample size increases. 10 Our main result, Theorem 4, formalizes these ideas. It conveys the sense in which the individual learns in the long run. Essentially, it says that the limiting distribution induced over consequences by the actions in A t and the individual's assessment of this distribution become indistinguishable from the point of view of integrating bounded, measurable functions.
If ( , F , μ) is a measure space and (X t ) t 1 is a sequence of random variables with values in an arbitrary measurable space, then we can define the following notion of countably generated sets of converging functions.
Consider a family G of real-valued, measurable functions such that for every g ∈ G there exists a random variable V g such that We say that G is countably generated whenever it contains a countable subset G 0 such that uniformly for every g ∈ G if and only if the convergence happens uniformly for every g ∈ G 0 . Theorem 4 Fix some t 1. If (X t ) t 1 is conditionally exchangeable and G is a countably generated family of bounded, B t -measurable functions g : C A t → R, then for every g ∈ G there exists a random variable V g such that E μ [g(X t+1 | A t ) G t ] → V g μ-a.s. and A useful feature of this model is that it is possible to measure the degree of imprecision or ambiguity of the individual's beliefs for every event she is aware of. We say that the degree of ambiguity associated with an event E ⊆ S t in period t is the difference between the upper and the lower probabilities: The following corollary is an easy consequence of Theorem 4. It establishes that ambiguity is associated with lack of familiarity or information regarding conceivable events. There is resolution of ambiguity over time, but it is only partial, in the sense that the individual sees the events she just became aware of as more ambiguous.

Merging of opinions
This section considers the problem of a group of agents who need to make a collective decision involving the (uncertain) value of possible plans of action. Initially, their beliefs may not be consistent, but they will have some opportunity to jointly gather evidence regarding the stochastic outcome of each plan of action available before they make a final decision.
The group consensus formation problem has been studied extensively. Among Bayesians, there is a lot of controversy regarding which is the best procedure for aggregating beliefs-many desirable properties lead to dictatorial aggregation rules [see Genest and Zidek (1986) for a survey of classical aggregation methods and impossibility theorems as well as Mongin (1995)]. Shafer (1986) and Walley (1991, Chapter 4) propose belief aggregation methods, based on theories of imprecise probability, that seem to perform better and avoid dictatorial rules (see also Crès et al. 2011). Going beyond the realm of belief aggregation and into the territory of multi-Bayesian decision problems does not help. These impossibility results carry through if one incorporates preferences into the problem-as illustrated by Hylland and Zeckhauser (1979). For surveys, we refer to Zidek (1981, 1983), Zidek (1988) and Gilboa et al. (2004) is a rare positive result in this area.
We focus on the aggregation rule determined by unanimity voting, whereby an action is acceptable if and only if it is acceptable for every individual in the group. We show that the beliefs of the members of the group merge with increasing information, as a corollary of Theorem 4.
Consider a group of n individuals who must reach a consensus regarding the value of each act or plan of action in a common, finite set of possible actions A. Suppose that each individual i has a prior assessment of the expected value of each action in A. For the purposes of this section, the choice of the set A is arbitrary. For example, it could happen that only actions "observable" by all individuals are considered. In this case, A is the intersection of the individual sets of actions known by each expert, that is, A = ∩ n i=1 A i . Alternatively, it could also happen that the individuals are set out to find a cooperative solution to a complex problem, about which each expert has only a partial understanding. In this case, they could have engaged in a pre-play communication and agreed on the set that represents the combined knowledge of all in the group. Then the appropriate description of the problem is to take A to be the union of all individual sets of feasible acts, thus A = ∪ n i=1 A i . We also assume that there is a common set C of consequences and that the group agrees with respect to the return associated with each of these consequences. That is, there exists a bounded and measurable function v : C → R that expresses the common value assigned by all individuals to each consequence in C. Let C 0 ⊆ C denote the finite subset of consequences known to the experts in period 0.
We assume that each individual in the group is Bayesian and makes inferences from the observed data by updating a Dirichlet process prior. In period 0, expert i's assessment is represented by a Dirichlet prior on C A with base measure π i 0 , with support on the set C A 0 , and common concentration parameter α > 0. 11 The interpretation is that π i 0 represents the initial guess of expert i and α represents the experts' unanimous confidence in their initial estimation.
The evolution of the experts assessment goes as follows. In period 0, expert i's belief is represented by the probability measure π i 0 ∈ (C A ). For every period t 1, after the group publicly observes the sample X 1 = x 1 , . . . , X t = x t , expert i updates her assessment by computing the predictive probability under the Dirichlet posterior, that is, The group's belief is then represented by the set of predictive probabilities Notice that the group's beliefs P G t X 1 , . . . , X t is a subset of the set of predictive distributions P H t described in Theorem 2, when we let A t = A for every period t.
Furthermore, each individual i evaluates action a by the expectation Therefore, the consensus of the group at t is represented by the lower and upper expectations. That is, the aggregated opinion of the group is given by the pair of functionals E t : C A → R and E t : The following result follows from Theorem 4. It establishes that the group's lower and upper expectations merge as the experts revise their opinions in view of the information gathered over time.

Corollary 6
If (X t ) t 1 is conditionally exchangeable, then for every action a Corollary 6 provides, to some extent, circumstances under which the common prior assumption may be justified. It implies that the posterior beliefs of different agents merge, after observing a sufficiently long history of past, public signals. The common prior could be understood as the limiting posterior of agents who have been observing the realization of a common signal for a sufficiently long time before the economic interaction being studied begins.
It is worth noting that the argument proposed here avoids, at least partially, the shortcomings of alternative arguments in support of the common prior assumption. 12 From a Bayesian statistics point of view, even in the single-person decision problem, agents need to assign positive probability to the true parameter. In the multi-person problem, these results require that individuals agree on which events should be assigned positive probability (mutual absolute continuity condition). 13 These assumptions seem barely weaker than the common prior assumption itself. By applying the consistency results of our learning model, however, it is possible to show convergence of posterior beliefs even when the individuals' prior beliefs are mutually singular.
The common prior assumption has also been justified from a frequentist point of view, by the argument that past experiences wash away differences in beliefs, because limiting relative frequencies can be commonly learned. Corollary 6 has the same flavor. It should be noted, however, that contrary to most of the results available, our convergence results do not require stationarity of the underlying stochastic process. 14 6 Bayesian interpretation So far, the discussion focused on the classical Bayesian view that there is an objective model of probability which explains the data. According to this view, there is a true but unknown parameter that is to be estimated from the data. The prior assessments in 12 For a survey of such arguments, we refer to Morris (1995). 13 The absolutely continuous condition was suggested by Blackwell and Dubins (1962). Kalai and Lehrer (1993) formalized this argument in a game theoretic setting, in which learning leads to Nash equilibrium. 14 See, for example, Kurz (1994). t thus represent initial guesses about this parameter. Consistency matters to classical Bayesians because they would like the posterior to converge in a meaningful way to the true objective parameter as data accumulates.
An alternative view is the subjective view of probability, according to which there is no such thing as an objective probability model. For a subjective Bayesian, probabilities are nothing but a representation of degrees of belief. Here, we provide an interpretation of the model that is compatible with this view.
Usually, an individual has some information about a statistical problem-perhaps the order of magnitude of some parameter or some qualitative aspect of sampling. However, there is little reason to believe that an individual should have much confidence in a sharply defined prior distribution, or that individuals with different backgrounds should agree on all minute details of the model that explains the data. In particular, it is much easier for subjective Bayesians to reach an agreement about qualitative features of the process, such as conditional exchangeability, or hyper-parameters, such as the concentration parameter α, than have a consensus about the whole prior distribution. However, especially in high-dimensional problems, there is no guarantee that the opinions of individuals with different subjective priors would eventually merge, no matter how much data they have. From this point of view, consistency represents asymptotic interpersonal agreement. Indeed, a model with nice frequentist properties is robust in the sense that small variations in the specification of the prior will not lead to large disagreements.

Discussion
The analysis presented in this paper has a number of limitations and poses questions that are yet to be addressed. A central question has to do with whether or not the learning process in the model can be extended to a model of preference updating over time, in line with the axiomatization proposed by Kopylov (2016).
Our analysis assumes conditional exchangeability of the underlying stochastic process. This guarantees asymptotic convergence of the learning process. The question not addressed in the paper is whether or not conditional exchangeability can be weakened to the case in which observed data are Cesàro summable. The relationship between Cesàro summable sequences and exchangeable processes is well understood -see for example Kingman (1978, Section 3-(c)). Such an approach will have the added advantage of providing a non-probabilistic perspective on learning.
One of the open questions has to do with how the decision maker foresees future sample paths, having observed past samples. How do we incorporate to the model a decision maker that has the ability to theorize about plausible future sample paths? The philosophical approach we have in mind for this is Zabell's conception of "unanticipated knowledge" (Zabell 1992(Zabell , 2005). In our model, the decision maker can only conceive counterfactual future experiments based on past observations. The limitation of this approach is that the decision maker has no theory in mind for the data generating process. Finally, the model presented here allows for "awareness of unawareness" of consequences. However, it does not incorporate the possibility of "awareness of unawareness" regarding future actions. We believe that our model can be generalizad to allow for the possibility of learning "unanticipated actions" that are revealed over time. The main difficulty is the interpretation of asymptotic results with such generalization. In particular, how does one define and interpret a limiting process of observed available actions? We think this is a potentially interesting avenue for future research.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
[2 ⇒ 1] The proof is straightforward and is omitted.

C Proofs of Sect. 4.2
To prove the main result in Sect. 4.2, we need two lemmas. The first essentially shows that the empirical average of any bounded and measurable function computed from some fixed period t onwards gets closer to the conditional expectation of that function. The second lemma establishes that the decision maker's assessment of that function converges to the limiting conditional expectation.
Lemma 7 Take some t 1 and let g : C A → R be a bounded, B-measurable function.
If (X t ) t 1 is conditionally exchangeable, then there exists a random variable V g such that and for t t

Proof of Lemma 7
Fix t 1 and let g : C A → R be a B-measurable function. By Berti et al. (2004, Lemma 2.1), there exists a random variable V g such that E μ [g(X t+1 ) G t ] → V g μ-almost surely. For t t , define The sequence (Y t ) t t is a uniformly integrable martingale with respect to (G t ) t t and thus converges μ-almost surely. Taking into account Berti et al. (2004, Lemma 2.1), an application of Kronecker's lemma gives This concludes the proof of the lemma.
Lemma 8 Take some t 1 and let g : C A t → R be a bounded, B t -measurable function. If (X t ) t 1 is conditionally exchangeable, then there exists a random variable V g such that E μ [g(X t+1 | A t ) G t ] → V g μ-a.s. and Proof For every t t , define The sequence (Z t ) t t is a uniformly integrable martingale with respect to (G t ) t t and, thus, converges μ-almost surely to a random variable Z . Furthermore, if t t , then h k | A t = h k | A t = X k | A t for every k t and h, h ∈ H t . Thus, max ρ∈P H t and, because 2(α+t −1) α+t → 0 surely, the desired convergence for g follows from Lemma 7. The same argument applied to case (2) completes the proof.
We are ready to prove Theorem 4.

Proof of Theorem 4
For each bounded, B t -measurable function g : C A t → R, let N g ∈ F denote the set with μ(N g ) = 0 outside of which the convergence shown in Lemma 8 does not happen. Because G is countably generated, there exists a countable subsetset G 0 of bounded, measurable functions on C A t such that the convergence is uniform for every g ∈ G if and only if it happens for every g ∈ G 0 . Noting that μ(∪ g∈G N g ) = 0 completes the proof.