Decision-making with limited resources
The difference between the random belief model supposed by Thompson sampling and the standard maximum expected utility model for decision-making is highlighted by contrasting two simple decision scenarios depicted in Figure 1. The goal is to predict the outcome when throwing one of two possible biased coins. A rational decision maker places bets (shown inside speech bubbles) such that his subjective expected utility is maximized. These subjective beliefs are delimited within dotted boxes. A Thompson sampling agent first samples a random belief and then chooses the best prediction with respect to this belief.
The difference between the two becomes clear by inspecting the expected utility in each case: they are
respectively, where the labels (a) and (b) correspond to the labels in Figure 1. Here it is clearly seen that the difference between the two lies in the order in which we apply the expectation (over the environment parameter) and the maximization operator. It should also be noted that the expected utility of (a) is an upper bound on the expected utility of (b). Yet, both cases can constitute optimal decisions depending on constraints. In (a), the decision-maker picks his action taking into account the uncertainty over the bias, while in (b), the decision-maker picks his action only after his beliefs over the coin bias are instantiated—that is, he is optimal w.r.t. his random beliefs. Here we consider how this optimality w.r.t. random beliefs can be considered as a form of optimal decision-making under information processing constraints.
Modeling bounded rational decision-making
Here we consider a particular information-theoretic model of bounded rational decision-making that formalizes limited information processing resources by a variational principle that trades off expected utility gains (or losses) and entropic information costs (Ortega [2011a]; Ortega and Braun [2011, 2012a, 2013]). Information processing costs are usually ignored in the study of perfectly rational decision-makers. Given a choice set with choices and utilities U(x), a perfectly rational decision-maker would always choose the best option x∗= arg maxx U(x)—presupposing there is a unique maximum. In general, a bounded rational decision-maker is unable to pick out the best option with certainty, and his choice can be described by a probability distribution P(x) reflecting uncertainty. Improving the choice strategy P(x) can be understood as a costly search process.
Let us assume the initial strategy of the decision-maker can be described by a probability distribution P0(x). The search process for the optimum transforms this initial choice into a final choice P(x). In case of the perfectly rational decision-maker the final choice is . In the general case of the bounded rational decision-maker the search is costly and he will not be able to afford such a stark reduction in uncertainty. Assuming that search costs are real-valued, additive and higher for rare events (Ortega and Braun [2010c]), it can be shown that the cost of the search is determined by the information distance D
K
L
between P0 and P, that is . Both Bayesian search (Jaynes [1985]) and Koopman’s random search (Stone [1998]) are compatible with these assumptions, as well as energetic costs that would have to be paid by a Maxwellian demon for reducing uncertainty in statistical physical systems (Ortega and Braun [2013]). How this information-theoretic model of search costs relates to computational resources such as space and time complexity is still an open problem (Vitanyi [2005]).
Simple decisions
The decision process is modeled as a transformation of a prior choice probability P0 into a posterior choice probability P by taking into account the utility gains (or losses) and the transformation costs arising from information processing, such that
(7)
where the are the possible outcomes, P0(x) are their prior probablities, U(x) are their utilities. The inverse temperature α≥0 can be regarded as a rationality parameter that translates the cost of information processing measured in units of information into units of utility. If the limits in information processing capabilities are given as a constraint D
K
L
(P||P0)≤K with some positive constant K, then α is determined as a Lagrange multiplier. The maximizing distribution is the equilibrium distribution
(8)
and represents the choice probabilities after deliberation—see Theorem 1.1.3 (Keller [1998]) and (Ortega and Braun [2013]) for a proof. The value V of the choice set under choice probabilities P can be determined from the same variational principle
(9)
For the two different limits of α, the value and the equilibrium distribution take the asymptotic forms
where is the uniform distribution over the maximizing subset . It can be seen that a perfectly rational agent with α→∞ is able to pick out the optimal action—which is a deterministic policy in the case of a single optimum—, whereas finitely rational agents have stochastic policies with non-zero probability of picking a sub-optimal action.
The model of bounded rational decision-making also lends itself to an interpretation in terms of sampling complexity. If we use a rejection sampling scheme to obtain samples from p(x) by first sampling from p0(x), we can ask how many samples we will need on average from p0 to obtain one sample from p. In this scheme, we produce a sample x∼p0(x) and then decide whether to accept or reject the sample based on the criterion
(10)
where u is drawn from the uniform and T is the acceptance target value with T≥ maxx U(x). The equality holds for the most efficient sampler, but requires knowledge of the maximum. With this sampling scheme, the accepted samples will be distributed according to Equation (8). The average number of samples needed from p0 to produce one sample of p is then
(11)
The important point about Equation (11) is that the average number of samples increases with increasing rationality parameter α. In fact, the average number of samples will grow exponentially for large α when T> maxx U(x), as
where x∗= arg maxU(x). It can also be straightforwardly seen that
(12)
because , that is the exponential of the Kullback-Leibler divergence provides a lower bound on the average number of samples.
Decisions in the presence of latent variables
To model a Thompson sampling agent, we need at least a two-step decision with a variable x that has to be chosen by the agent, and a variable θ that is chosen by the environment. In the example described in Figure 1, the variable x is the agent’s prediction for the outcome of a coin toss, the variable θ indicates nature’s choice which one of the two coins is tossed. The agent’s prediction can take on the values x=H and x=T corresponding to the outcomes Head and Tail. The variable θ takes on the two values and corresponding to the biases of the two coins. The prior probability over θ is and . The expected rewards for all combinations of x and θ are then , , and .
In the case of two-step decisions, the variational problem can in general be formulated as a nested expression (Ortega and Braun [2011, 2012a, 2013])
(13)
with the two different rationality parameters α and β for the two different variables x and θ. Limited information processing resources with respect to these variables can also be thought of as different degrees of control. For example, if α assumed a large value, the decision-maker could basically hand-pick a particular x, or if θ was determined by a coin toss that the agent cannot influence, we could model this by setting β to zero. The utility can in general depend on both action and observation variables. However, since the action by itself does not yield a reward in our case, we have U(x)≡0. Moreover, we see that in our case, nature’s probability of flipping either coin does not actually depend on the agent’s prediction, so we can replace the conditional probabilities p(θ|x) by p(θ). We have then an inner variational problem:
(14)
with the solution
(15)
and the normalization constant and an outer variational problem
(16)
with the solution
(17)
and the normalization constant From Equation (17) we can derive both the perfectly rational decision-maker and the Thompson sampling agent. To simplify, we assume in the following that the agent has no prior preference for x, that is .
The perfectly rational decision-maker is obtained in the limit α→∞ and β→0. If we first take the limit , a decision-maker with rationality α chooses x with probability
(18)
The perfectly rational expected utility maximizer as depicted in Figure 1a is then obtained from Equation (18) by taking the limit α→∞.
In contrast, the Thompson sampling agent is obtained when β=α. In this case, the choice probability for x is given by
(19)
The resulting agent is a probabilistic superposition of agents that act optimally for any given θ as depicted in Figure 1b. It can be seen that in Equation (19) and in Equation (18) the order of the expectation operation and the (soft-)maximization operation are reversed.
Again we can interpret this formalism in terms of sampling complexity. Here we should accept a sample x∼p0(x) if it fulfils the criterion
(20)
where and T≥ maxx maxθ U(x,θ). From Equation (11) we know that the ratio Z
β
(x)/eβT is the acceptance probability of θ∼p0(θ). In order to accept one sample from x, we thus need to accept consecutive samples of θ, with acceptance criterion
(21)
with and T as set above. Since α≫β we can assume α≈N β with , and we can see easily that the perfectly rational agent will require infinitely many θ samples (α→∞ and β→0) to obtain one sample of x, whereas the Thompson sampling agent will only require a single sample (α=β). The Thompson sampling agent is therefore the agent that can solve the optimization problem of Equation (16) for a given α with the least amount of samples. This can also be seen from Equation (18), when doing the Monte Carlo approximation by drawing N samples θ
i
∼p0(θ
i
). For infinitely many samples, the average approximates the expectation, for a single sample we can rewrite Equation (18) into Equation (19). This sampling procedure also allows estimating the upper and lower bounds of the optimal utility (Tziortziotis et al. [2013]). Of course, the Thompson sampling agent will not achieve the same expected utility as the perfectly rational agent. But both agents can be considered optimal under particular information processing constraints.
Causal induction
A generalized Thompson-sampling agent can be thought of as a probabilistic superposition of models θ, where each model θ is characterized by a likelihood model P(o
t
|θ,a≤t,o<t) and a policy model P(a
t
|θ,a<t,o<t). In previous applications we assumed that all models θ have the same causal structure, i.e. considering multivariate random variables a
t
and o
t
, we assumed that the same variables a
t
are intervened for all θ and the same causal model is used to predict the consequences of these interventions on the observational variables o
t
. However, this need not be the case. In principle, different models θ could represent different causal structures and suggest intervention of different variables. Such a setup can be used for causal induction as illustrated in the following example.
Imagine we are working on a medical treatment that involves two gene sites X and Y, each of which can be active or inactive. We encode the 'on’ and 'off’ states of X as X=x and X=¬x and similarly Y=y and Y=¬y to denote the 'on’ and 'off’ states of Y. Assume we are unsure about the causal mechanism between the two variables, that is we are unsure whether the activity of X influences the activity of Y or the other way around. Formally, we are interested in the explanatory power of two competing causal hypotheses: either 'X causes Y’ (Θ=θ) or 'Y causes X’ (Θ=¬θ). Assume our goal is to have Y in an active state, but that it is much cheaper and easier to manipulate X instead of Y. This leaves us with the following policies. If X causes Y we prefer to manipulate X, because it is cheap and easy. If Y causes X we have no other choice, but to directly manipulate Y. When manipulating either gene, we can be 100% sure that the new state of the gene is set by us, but we only have a 50% chance that the state will be 'on’. Assume not manipulating either variable is not an option, because then both gene sites stay inactive. The question is how should we act if we do not know the causal dependency?
One of the main methods to deal with problems of causal inference is the framework of causal graphical models (Pearl [2000]). Given a graph that represents a causal structure, we can intervene this graph and ask questions about the probabilities of the variables in the graph. However, in causal induction we would like to discover the causal structure itself, that is we would like to do inference over a multitude of graphs representing different causal structures (Heckerman et al. [1999]). If one would like to represent the problem of causal discovery graphically, the main challenge is that the model Θ is a random variable that controls the causal structure itself. However, as argued in (Ortega [2011]), this difficulty can be overcome by using a probability tree to model the causal structure over the random events. Probability trees can encode alternative causal realizations, and in particular alternative causal hypotheses (Shafer [1996]). For instance, Figure 2a encodes the probabilities and functional dependencies among the random random variables of the previous problem.
In a probability tree, each (internal) node is a causal mechanism; hence a path from the root node to one of the leaves corresponds to a particular sequential realization of causal mechanisms. The logic underlying the structure of this tree is as follows:
-
1.
Causal precedence: A node causally precedes its descendants. For instance, the root node corresponding to the sure event Ω causally precedes all other nodes.
-
2.
Resolution of variables: Each node resolves the value of a random variable. For instance, given the node corresponding to Θ=θ and X=¬x, either Y=y will happen with probability or Y=¬y with probability .
-
3.
Heterogeneous order: The resolution order of random variables can vary across different branches. For instance, X precedes Y under Θ=θ, but Y precedes X under Θ=¬θ. This is precisely how we model competing causal hypotheses.
While the probability tree represents the agent’s subjective model explaining the order in which the random values are resolved, it does not necessarily correspond to the temporal order in which the events are revealed to us. So for instance, under hypothesis Θ=θ, the value of the variable Y might be revealed before X, even though X causally precedes Y; and the causal hypothesis Θ, which precedes both X and Y, is never observed.
Consider a Thompson sampling agent that uses the beliefs outlined in Figure 2 that runs a single experiment. The agent does so by first manipulating X and observing Y:
-
1.
Manipulating X: First, the agent instantiates his random beliefs by sampling the value of Θ from the prior:
Assume that the result is θ. Treating θ as if it was the true parameter, he proceeds to sample the action from P(X|θ) given by
as indicated in the left branch of the probability tree. Assume that outcome is x, and this is the action that the agent executes. Because of this, the agent has to update its beliefs first by intervening the probability tree for and second by conditioning on x. The intervention is carried out by replacing all the nodes in the tree that resolve the value of X with new nodes assigning probability one to x and zero to ¬x. Figure 2b illustrates the result of this intervention. The posterior is then given by
In other words, the agent has switched on X, and has so far learned nothing from this fact.
-
2.
Observing Y: Now, the agent observes the activity of Y, and assume that it is active, i.e. Y=y. Then, the posterior beliefs of the agent are given as
Since , the agent has gathered evidence favoring the hypothesis “X causes Y”. This was only possible because the intervention introduced a statistical asymmetry among the two hypotheses that did not exist in the beginning. In comparison, if the action is not treated as an intervention, then the posterior is
that is, the agent doesn’t learn anything just from observing. This also highlights the importance of interventions (Box [1966]).
Naturally, multiple interventions and observations can be executed in consecution. In this case Thompson sampling is used in each time step to decide which policy model to use, which implies the decision which variables to intervene. Then, after the intervention, all variables are revealed simultaneously at every time step of the inference process. The update of the observational probabilities is done the same way as in the one step case, taking into account which variables were intervened. A simulation of the repeated Thompson sampling process for causal induction of our example system is shown in Figure 3. This very simple example contains the principles of causal induction using Thompson sampling. Of course, more complex causal structures require richer model classes as is customary in Bayesian modeling. But importantly, the essence of causal induction is already contained in our simple illustration.