## Abstract

Information-theoretic principles for learning and acting have been proposed to solve particular classes of Markov Decision Problems. Mathematically, such approaches are governed by a variational free energy principle and allow solving MDP planning problems with information-processing constraints expressed in terms of a Kullback-Leibler divergence with respect to a reference distribution. Here we consider a generalization of such MDP planners by taking model uncertainty into account. As model uncertainty can also be formalized as an information-processing constraint, we can derive a unified solution from a single generalized variational principle. We provide a generalized value iteration scheme together with a convergence proof. As limit cases, this generalized scheme includes standard value iteration with a known model, Bayesian MDP planning, and robust planning. We demonstrate the benefits of this approach in a grid world simulation.

You have full access to this open access chapter, Download conference paper PDF

### Similar content being viewed by others

## Keywords

## 1 Introduction

The problem of planning in Markov Decision Processes was famously addressed by Bellman who developed the eponymous principle in 1957 [2]. Since then numerous variants of this principle have flourished in the literature. Here we are particularly interested in a generalization of the Bellman principle that takes information-theoretic constraints into account. In the recent past there has been a special interest in the Kullback-Leibler divergence as a constraint to limit deviations of the action policy from a prior. This can be interesting in a number of ways. Todorov [31, 32], for example, has transformed the general MDP problem into a restricted problem class without explicit action variables, where control directly changes the dynamics of the environment and control costs are measured by the Kullback-Leibler divergence between controlled and uncontrolled dynamics. This simplification allows mapping the Bellman recursion to a linear algebra problem. This approach can also be generalized to continuous state spaces leading to path integral control [4, 5]. The same equations can also be interpreted in terms of *bounded rational* decision-making where the decision-maker has limited computational resources that allow only limited deviations from a prior decision strategy (measured by the Kullback-Leiber divergence in bits) [19]. Such a decision-maker can also be instantiated by a sampling process that has restrictions in the number of samples it can afford [20]. Disregarding the possibility of a sampling-based interpretation, the Kullback-Leibler divergence introduces a control information cost that is interesting in its own right when formalizing the perception action cycle [30].

While the above frameworks have led to interesting computational advances, so far they have neglected the possibility of model misspecification in the MDP setting. Model misspecification or model uncertainty does not refer to the uncertainty arising due to the stochastic nature of the environment (usually called risk-uncertainty in the economic literature), but refers to the uncertainty with respect to the latent variables that specify the MDP. In Bayes-Adaptive MDPs [7], for example, the uncertainty over the latent parameters of the MDP is explicitly represented, such that new information can be incorporated with Bayesian inference. However, Bayes-Adaptive MDPs are not robust with respect to model misspecification and have no performance guarantees when planning with wrong models [15]. Accordingly, there has been substantial interest in developing robust MDP planners [13, 16, 33]. One way to take model uncertainty into account is to bias an agent’s belief model from a reference Bayesian model towards worst-case scenarios; thus avoiding disastrous outcomes by not visiting states where the transition probabilities are not known. Conversely, the belief model can also be biased towards best-case scenarios as a measure to drive exploration—also referred in the literature as *optimism in face of uncertainty* [28, 29].

When comparing the literature on information-theoretic control and model uncertainty, it is interesting to see that some notions of model uncertainty follow exactly the same mathematical principles as the principles of relative entropy control [32]. In this paper we therefore formulate a unified and combined optimization problem for MDP planning that takes *both*, model uncertainty and bounded rationality into account. This new optimization problem can be solved by a generalized value iteration algorithm. We provide a theoretical analysis of its convergence properties and simulations in a grid world.

## 2 Background and Notation

In the MDP setting the agent at time *t* interacts with the environment by taking action \({a_{t}}\in \mathcal A\) while in state \({s_{t}}\in \mathcal {S}\). Then the environment updates the state of the agent to \({s_{t+1}}\in \mathcal {S}\) according to the transition probabilities \(T({s_{t+1}}|{a_{t}},{s_{t}})\). After each transition the agent receives a reward \(R_{{s_{t}},{a_{t}}}^{{s_{t+1}}}\in \mathcal R\) that is bounded. For our purposes we will consider \(\mathcal A\) and \(\mathcal S\) to be finite. The aim of the agent is to choose its policy \({\pi (a|s)}\) in order to maximize the total discounted expected reward or value function for any \(s \in \mathcal S\)

with discount factor \(0 \le \gamma < 1\). The expectation is over all possible trajectories \(\xi = s_0, a_0, s_1 \dots \) of state and action pairs distributed according to \(p(\xi ) =\) \(\prod _{t=0}^{T-1}{\pi ({a_{t}}|{s_{t}})}\) \(T({s_{t+1}}|{a_{t}}, {s_{t}})\). It can be shown that the optimal value function satisfies the following recursion

At this point there are two important implicit assumptions. The first is that the policy \(\pi \) can be chosen arbitrarily without any constraints which, for example, might not be true for a bounded rational agent with limited information-processing capabilities. The second is that the agent needs to know the transition-model \(T(s'|a,s)\), but this model is in practice unknown or even misspecified with respect to the environment’s true transition-probabilities, specially at initial stages of learning. In the following, we explain how to incorporate both bounded rationality and model uncertainty into agents.

### 2.1 Information-Theoretic Constraints for Acting

Consider a one-step decision-making problem where the agent is in state *s* and has to choose a single action *a* from the set \(\mathcal A\) to maximize the reward \(R_{s,a}^{s'}\), where \(s'\) is the next state. A perfectly rational agent selects the optimal action that is \({a^*(s) = \mathop {\text {argmax}}\nolimits _a \sum _{s'} T(s'|a,s)R_{s,a}^{s'}}\). However, a bounded rational agent has only limited resources to find the maximum of the function \(\sum _{s'} T(s'|a,s)R_{s,a}^{s'}\). One way to model such an agent is to assume that the agent has a prior choice strategy \({\rho (a|s)}\) in state *s* *before* a deliberation process sets in that refines the choice strategy to a posterior distribution \({\pi (a|s)}\) that reflects the strategy *after* deliberation. Intuitively, because the deliberation resources are limited, the agent can only afford to deviate from the prior strategy by a certain amount of information bits. This can be quantified by the relative entropy \({D_{\text {KL}}}(\pi ||\rho ) = \sum _a {\pi (a|s)}\log \frac{{\pi (a|s)}}{{\rho (a|s)}}\) that measures the average information cost of the policy \({\pi (a|s)}\) using the source distribution \({\rho (a|s)}\). For a bounded rational agent this relative entropy is bounded by some upper limit *K*. Thus, a bounded rational agent has to solve a constrained optimization problem that can be written as

This problem can be rewritten as an unconstrained optimization problem

where \(F^*\) is a free energy that quantifies the value of the policy \(\pi \) by trading off the average reward against the information cost. The optimal strategy can be expressed analytically in closed-form as

with partition sum \(Z_\alpha (s)= \sum _a {\rho (a|s)}\exp \left( \alpha \sum _{s'} T(s'|a,s)R_{s,a}^{s'}\right) \). Therefore, the maximum operator in (2) can be eliminated and the free energy can be rewritten as in (3). The Lagrange multiplier \(\alpha \) quantifies the boundedness of the agent. By setting \(\alpha \rightarrow \infty \) we recover a perfectly rational agent with optimal policy \(\pi ^*(a|s) = \delta (a - a^*(s))\). For \(\alpha = 0\) the agent has no computational resources and the agent’s optimal policy is to act according to the prior \(\pi ^*(a|s) = \rho (a|s)\). Intermediate values of \(\alpha \) lead to a spectrum of bounded rational agents.

### 2.2 Information-Theoretic Constraints for Model Uncertainty

In the following we assume that the agent has a model of the environment \(T_\theta (s'|a,s)\) that depends on some latent variables \(\theta \in \varTheta \). In the MDP setting, the agent holds a belief \({\mu ({\theta }|a,s)}\) regarding the environmental dynamics where \(\theta \) is a unit vector of transition probabilities into all possible states \(s'\). While interacting with the environment the agent can incorporate new data by forming the Bayesian posterior \(\mu (\theta |a,s, D)\), where *D* is the observed data. When the agent has observed an infinite amount of data (and assuming \(\theta ^*(a,s) \in \varTheta \)) the belief will converge to the delta distribution \(\mu (\theta |s,a,D)= \delta (\theta -\theta ^*(a,s))\) and the agent will act optimally according to the true transition probabilities, exactly as in ordinary optimal choice strategies with known models. When acting under a limited amount of data the agent cannot determine the value of an action *a* with the true transition model according to \(\sum _{s'} T(s'|a,s)R_{s,a}^{s'}\), but it can only determine an expected value according to its beliefs \(\int _\theta {\mu ({\theta }|a,s)}\sum _{s'} {T_{\theta }(s'|a,s)}R_{s,a}^{s'}\).

The Bayesian model \(\mu \) can be subject to model misspecification (e.g. by having a wrong likelihood or a bad prior) and thus the agent might want to allow deviations from its model towards best-case (optimistic agent) or worst-case (pessimistic agent) scenarios up to a certain extent, in order to act more robustly or to enhance its performance in a friendly environment [12]. Such deviations can be measured by the relative entropy \({D_{\text {KL}}}(\psi |\mu )\) between the Bayesian posterior \(\mu \) and a new biased model \(\psi \). Effectively, this allows for mathematically formalizing model uncertainty, by not only considering the specified model but all models within a neighborhood of the specified model that deviate no more than a restricted number of bits. Then, the effective expected value of an action *a* while having limited trust in the Bayesian posterior \(\mu \) can be determined for the case of optimistic deviations as

for \(\beta >0\), and for the case of pessimistic deviations as

for \(\beta <0\). Conveniently, both equations can be expressed as a single equation

with \(\beta \in \mathbb {R}\) and \(Z_\beta (s,a) = \int _\theta {\mu ({\theta }|a,s)}\exp \left( \beta \sum _{s'} {T_{\theta }(s'|a,s)}R_{s,a}^{s'}\right) \) when inserting the optimal biased belief

into either Eq. (4) or (5). By adopting this formulation we can model any degree of trust in the belief \(\mu \) allowing deviation towards worst-case or best-case with \(-\infty \le \beta \le \infty \). For the case of \(\beta \rightarrow -\infty \) we recover an infinitely pessimistic agent that considers only worst-case scenarios, for \(\beta \rightarrow \infty \) an agent that is infinitely optimistic and for \(\beta \rightarrow 0\) an agent that fully trusts its model.

## 3 Model Uncertainty and Bounded Rationality in MDPs

In this section, we consider a bounded rational agent with model uncertainty in the infinite horizon setting of an MDP. In this case the agent must take into account all future rewards and information costs, thereby optimizing the following free energy objective

where the extremum operator \(\mathop {\mathrm {ext}}\limits \) can be either \(\max \) for \(\beta >0\) or \(\min \) for \(\beta <0\), \(0 \le \gamma <1\) is the discount factor and the expectation \({\mathbb {E}}\) is over all trajectories \(\xi = s_0, a_0,\theta _0,s_1,a_1,\dots a_{T-1}, \theta _{T-1},s_{T}\) with distribution \(p(\xi ) =\) \(\prod _{t=0}^{T-1}{\pi ({a_{t}}|{s_{t}})}\) \({\psi ({\theta _t}|{a_{t}},{s_{t}})}\) \(T_{\theta _t}({s_{t+1}}|{a_{t}}, {s_{t}})\). Importantly, this free energy objective satisfies a recursive relation and thereby generalizes Bellman’s optimality principle to the case of model uncertainty and bounded rationality. In particular, Eq. (6) fulfills the recursion

Applying variational calculus and following the same rationale as in the previous sections [19], the extremum operators can be eliminated and Eq. (7) can be re-expressed as

because

where

with the optimizing arguments

and partition sum

With this free energy we can model a range of different agents for different \(\alpha \) and \(\beta \). For example, by setting \(\alpha \rightarrow \infty \) and \(\beta \rightarrow 0\) we can recover a Bayesian MDP planner and by setting \(\alpha \rightarrow \infty \) and \(\beta \rightarrow -\infty \) we recover a robust planner. Additionally, for \(\alpha \rightarrow \infty \) and when \({\mu ({\theta }|a,s)}=\delta (\theta - \theta ^*(a,s))\) we recover an agent with standard value function with known state transition model from Eq. (1).

### 3.1 Free Energy Iteration Algorithm

Solving the self-consistency Eq. (8) can be achieved by a generalized version of value iteration. Accordingly, the optimal solution can be obtained by initializing the free energy at some arbitrary value *F* and applying a value iteration scheme \(B^{i+1}F = BB^i F\) where we define the operator

with \(B^1F = BF\), which can be simplified to

In Algorithm (1) we show the pseudo-code of this generalized value iteration scheme. Given state-dependent prior policies \({\rho (a|s)}\) and the Bayesian posterior beliefs \({\mu ({\theta }|a,s)}\) and the values of \(\alpha \) and \(\beta \), the algorithm outputs the equilibrium distributions for the action probabilities \({\pi (a|s)}\), the biased beliefs \({\psi ({\theta }|a,s)}\) and estimates of the free energy value function \(F^*(s)\). The iteration is run until a convergence criterion is met. Assuming dimensionality *A* for the action space, *S* for the state space, and *B* for the (discretized) belief space we have a complexity of \(O(A S^2 B)\) per iteration, similar to other value iteration algorithms running on the belief space. The convergence proof of the algorithm is shown in the next section.

## 4 Convergence

Here, we show that the value iteration scheme described through Algorithm 1 converges to a unique fixed point satisfying Eq. (8). To this end, we first prove the existence of a unique fixed point (Theorem 1) following [3, 25], and subsequently prove the convergence of the value iteration scheme presupposing that a unique fixed point exists (Theorem 2) following [27].

### Theorem 1

Assuming a bounded reward function \(R_{s,a}^{s'}\), the optimal free-energy vector \(F^*(s)\) is a unique fixed point of Bellman’s equation \(F^*=BF^*\), where the mapping \(B:\mathbb {R}^{|\mathcal {S}|} \rightarrow \mathbb {R}^{|\mathcal {S}|}\) is defined as in Eq. (13)

### Proof

Theorem 1 is proven through Propositions 1 and 2 in the following.

### Proposition 1

The mapping \(T_{\pi ,\psi }: \mathbb {R}^{|\mathcal {S}|} \rightarrow \mathbb {R}^{|\mathcal {S}|}\)

converges to a unique solution for every policy-belief-pair \((\pi ,\psi )\) independent of the initial free-energy vector *F*(*s*).

### Proof

By introducing the matrix \(P_{\pi ,\psi }(s,s')\) and the vector \(g_{\pi ,\psi }(s)\) as

Equation (14) may be expressed in compact form: \(T_{\pi ,\psi }F = g_{\pi ,\psi } + \gamma P_{\pi ,\psi } F\). By applying the mapping \(T_{\pi ,\psi }\) an infinite number of times on an initial free-energy vector *F*, the free-energy vector \(F_{\pi ,\psi }\) of the policy-belief-pair \((\pi ,\psi )\) is obtained:

which does no longer depend on the initial *F*. It is straightforward to show that the quantity \(F_{\pi ,\psi }\) is a fixed point of the operator \(T_{\pi ,\psi }\):

Furthermore, \(F_{\pi ,\psi }\) is unique. Assume for this purpose an arbitrary fixed point \(F'\) such that \(T_{\pi ,\psi }F' = F'\), then \(F' = \lim _{i\rightarrow \infty }T_{\pi ,\psi }^iF'=F_{\pi ,\psi }\).

### Proposition 2

The optimal free-energy vector \(F^*=\max _{\pi }\mathop {\mathrm {ext}}\nolimits _{\psi }F_{\pi ,\psi }\) is a unique fixed point of Bellman’s equation \(F^*=BF^*\).

### Proof

The proof consists of two parts where we assume \(\mathop {\mathrm {ext}}\limits = \max \) in the first part and \(\mathop {\mathrm {ext}}\limits = \min \) in the second part respectively. Let \(\mathop {\mathrm {ext}}\limits = \max \) and \(F^*=F_{\pi ^*,\psi ^*}\), where \((\pi ^*,\psi ^*)\) denotes the optimal policy-belief-pair. Then

where the last inequality can be straightforwardly proven by induction^{Footnote 1} and exploiting the fact that \(P_{\pi ,\psi }(s,s') \in [0;1]\). But by definition \(F^* = \max _{\pi }\max _{\psi }F_{\pi ,\psi } \ge F_{\pi ',\psi '}\), hence \(F^* = F_{\pi ',\psi '}\) and therefore \(F^*=BF^*\). Furthermore, \(F^*\) is unique. Assume for this purpose an arbitrary fixed point \(F'=F_{\pi ',\psi '}\) such that \(F'=BF'\) with the corresponding policy-belief-pair \((\pi ',\psi ')\). Then

and similarly \(F' \ge F^*\), hence \(F' = F^*\).

Let \(\mathop {\mathrm {ext}}\limits = \min \) and \(F^*=F_{\pi ^*,\psi ^*}\). By taking a closer look at Eq. (13), it can be seen that the optimization over \(\psi \) does not depend on \(\pi \). Then

But by definition \(F^*=\min _{\psi }F_{\pi ^*,\psi } \le F_{\pi ^*,\psi '}\), hence \(F^*=F_{\pi ^*,\psi '}\). Therefore it holds that \(BF^* = \max _\pi \min _\psi T_{\pi ,\psi } F^* = \max _\pi T_{\pi ,\psi ^*} F^*\) and similar to the first part of the proof we obtain

But by definition \(F^* = \max _{\pi }F_{\pi ,\psi ^*} \ge F_{\pi ',\psi *}\), hence \(F^*=F_{\pi ',\psi *}\) and therefore \(F^*=BF^*\). Furthermore, \(F_{\pi ^*, \psi ^*}\) is unique. Assume for this purpose an arbitrary fixed point \(F'=F_{\pi ',\psi '}\) such that \(F'=BF'\). Then

and similarly \(F^* \le F'\), hence \(F^* = F'\).

### Theorem 2

Let \(\epsilon \) be a positive number satisfying \(\epsilon <\frac{\eta }{1-\gamma }\) where \(\gamma \in [0;1)\) is the discount factor and where *u* and *l* are the bounds of the reward function \(R_{s,a}^{s'}\) such that \(l \le R_{s,a}^{s'}\le u\) and \(\eta =\max \{|u|,|l|\}\). Suppose that the value iteration scheme from Algorithm 1 is run for \(i=\lceil \log _\gamma \frac{\epsilon (1-\gamma )}{\eta } \rceil \) iterations with an initial free-energy vector \(F(s)=0\) for all *s*. Then, it holds that \(\max _s |F^*(s) - B^iF(s)| \le \epsilon \), where \(F^*\) refers to the unique fixed point from Theorem 1.

### Proof

We start the proof by showing that the \(L_\infty \)-norm of the difference vector between the optimal free-energy \(F^*\) and \(B^iF\) exponentially decreases with the number of iterations *i*:

where we exploit the fact that \(\left| \mathop {\mathrm {ext}}\nolimits _xf(x) - \mathop {\mathrm {ext}}\nolimits _xg(x) \right| \le \max _x \left| f(x) - g(x) \right| \) and that the free-energy is bounded through the reward bounds *l* and *u* with \(\eta =\max \{|u|,|l|\}\). For a convergence criterion \(\epsilon >0\) such that \(\epsilon \ge \gamma ^i \frac{\eta }{1-\gamma }\), it then holds that \(i \ge \log _\gamma \frac{\epsilon (1-\gamma )}{\eta }\) presupposing that \(\epsilon < \frac{\eta }{1-\gamma }\).

## 5 Experiments: Grid World

This section illustrates the proposed value iteration scheme with an intuitive example where an agent has to navigate through a grid-world. The agent starts at position \(\mathbf S \in \mathcal {S}\) with the objective to reach the goal state \(\mathbf G \in \mathcal {S}\) and can choose one out of maximally four possible actions \(a \in \lbrace \uparrow , \rightarrow , \downarrow , \leftarrow \rbrace \) in each time-step. Along the way, the agent can encounter regular tiles (actions move the agent deterministically one step in the desired direction), walls that are represented as *gray tiles* (actions that move the agent towards the wall are not possible), holes that are represented as *black tiles* (moving into the hole causes a negative reward) and *chance tiles* that are illustrated as white tiles with a question mark (the transition probabilities of the chance tiles are unknown to the agent). Reaching the goal \(\mathbf G \) yields a reward \(R=+1\) whereas stepping into a hole results in a negative reward \(R= -1\). In both cases the agent is subsequently teleported back to the starting position \(\mathbf S \). Transitions to regular tiles have a small negative reward of \(R= -0.01\). When stepping onto a chance tile, the agent is pushed stochastically to an adjacent tile giving a reward as mentioned above. The true state-transition probabilities of the chance tiles are not known by the agent, but the agent holds the Bayesian belief

where the transition model is denoted as \(T_{{\varvec{\theta }}_{s,a}} (s'|s,a)= \theta _{s,a}^{s'}\) and \({\varvec{\theta }}_{s,a} = \big (\theta _{s,a}^{s_1'} \dots \theta _{s,a}^{s_{N(s)}'} \big )\) and *N*(*s*) is the number of possible actions in state *s*. The data are incorporated into the model as a count vector \(\big ( \varPhi _{s,a}^{s_1'}, \dots , \varPhi _{s,a}^{s_{N(s)}'}\big ) \) where \(\varPhi _{s,a}^{s' }\) represents the number of times that transition \((s,a,s')\) occurred. The prior \({\rho (a|s)}\) for the actions at every state is set to be uniform. An important aspect of the model is that in the case of unlimited observational data, the agent will plan with the correct transition probabilities.

We conducted two experiments with discount factor \(\gamma =0.9\) and uniform priors \({\rho (a|s)}\) for the action variables. In the first experiment, we explore and illustrate the agent’s planning behavior under different degrees of computational limitations (by varying \(\alpha \)) and under different model uncertainty attitudes (by varying \(\beta \)) with fixed uniform beliefs \({\mu ({\theta }|a,s)}\). In the second experiment, the agent is allowed to update its beliefs \({\mu ({\theta }|a,s)}\) and use the updated model to re-plan its strategy.

### 5.1 The Role of the Parameters \(\alpha \) and \(\beta \) on Planning

Figure 1 shows the solution to the variational free energy problem that is obtained by iteration until convergence according to Algorithm 1 under different values of \(\alpha \) and \(\beta \). In particular, the first row shows the free energy function \(F^*(s)\) (Eq. (8)). The second, third and fourth row show heat maps of the position of an agent that follows the optimal policy (Eq. (12)) according to the agent’s biased beliefs (plan) and to the actual transition probabilities in a friendly and unfriendly environment, respectively. In chance tiles, the most likely transitions in these two environments are indicated by arrows where the agent is teleported with a probability of 0.999 into the tile indicated by the arrow and with a probability of 0.001 to a random other adjacent tile.

In the first column of Fig. 1 it can be seen that a stochastic agent (\(\alpha = 3.0\)) with high model uncertainty and optimistic attitude (\(\beta =400\)) has a strong preference for the broad corridor in the bottom by assuming favorable transitions for the unknown chance tiles. This way the agent also avoids the narrow corridors that are unsafe due to the stochasticity of the low-\(\alpha \) policy. In the second column of Fig. 1 with low \(\alpha =3\) and high model uncertainty with pessimistic attitude \(\beta =-400\), the agent strongly prefers the upper broad corridor because unfavorable transitions are assumed for the chance tiles. The third column of Fig. 1 shows a very pessimistic agent (\(\beta =-400\)) with high precision (\(\alpha =11\)) that allows the agent to safely choose the shortest distance by selecting the upper narrow corridor without risking any tiles with unknown transitions. The fourth column of Fig. 1 shows a very optimistic agent (\(\beta =400\)) with high precision. In this case the agent chooses the shortest distance by selecting the bottom narrow corridor that includes two chance tiles with unknown transition.

### 5.2 Updating the Bayesian Posterior \(\mu \) with Observations from the Environment

Similar to model identification adaptive controllers that perform system identification while the system is running [1], we can use the proposed planning algorithm also in a reinforcement learning setup by updating the Bayesian beliefs about the MDP while executing always the first action and replanning in the next time step. During the learning phase, the exploration is governed by both factors \(\alpha \) and \(\beta \), but each factor has a different influence. In particular, lower \(\alpha \)-values will cause more exploration due to the inherent stochasticity in the agent’s action selection, similar to an \(\epsilon \)-greedy policy. If \(\alpha \) is kept fixed through time, this will of course also imply a “suboptimal” (i.e. bounded optimal) policy in the long run. In contrast, the parameter \(\beta \) governs exploration of states with unknown transition-probabilities more directly and will not have an impact on the agent’s performance in the limit, where data has eliminated model uncertainty. We illustrate this with simulations in a grid-world environment where the agent is allowed to update its beliefs \({\mu ({\theta }|a,s)}\) over the state-transitions every time it enters a chance tile and receives observation data acquired through interaction with the environment—compare left panels in Fig. 2. In each step, the agent can then use the updated belief-models for planning the next action.

Figure 2 (right panels) shows the number of data points acquired (each time a chance tile is visited) and the average reward depending on the number of steps that the agent has interacted with the environment. The panels show several different cases: while keeping \(\alpha =12.0\) fixed we test \(\beta =(0.2, 5.0, 20.0)\) and while keeping \(\beta =0.2\) fixed we test \(\alpha =(5.0, 8.0, 12.0)\). It can be seen that lower \(\alpha \) leads to better exploration, but it can also lead to lower performance in the long run—see for example the rightmost bottom panel. In contrast, optimistic \(\beta \) values can also induce high levels of exploration with the added advantage that in the limit no performance detriment is introduced. However, high \(\beta \) values can in general also lead to a detrimental persistence with bad policies, as can be seen for example in the superiority of the low-\(\beta \) agent at the very beginning of the learning process.

## 6 Discussion and Conclusions

In this paper we are bringing two strands of research together, namely research on information-theoretic principles of control and decision-making and robustness principles for planning under model uncertainty. We have devised a unified recursion principle that extends previous generalizations of Bellman’s optimality equation and we have shown how to solve this recursion with an iterative scheme that is guaranteed to converge to a unique optimum. In simulations we could demonstrate how such a combination of information-theoretic policy and belief constraints that reflect model uncertainty can be beneficial for agents that act in partially unknown environments.

Most of the research on robust MDPs does not consider information-processing constraints on the policy, but only considers the uncertainty in the transition probabilities by specifying a set of permissible models such that worst-case scenarios can be computed in order to obtain a robust policy [13, 16]. Recent extensions of these approaches include more general assumptions regarding the set properties of the permissible models and assumptions regarding the data generation process [33]. Our approach falls inside this class of robustness methods that use a restricted set of permissible models, because we extremize the biased belief \({\psi ({\theta }|a,s)}\) under the constraint that it has to be within some information bounds measured by the Kullback-Leibler divergence from a reference Bayesian posterior. Contrary to these previous methods, our approach additionally considers robustness arising from the stochasticity in the policy.

Information-processing constraints on the policy in MDPs have been previously considered in a number of studies [14, 23, 25, 32], however not in the context of model uncertainty. In these studies a free energy value recursion is derived when restricting the class of policies through the Kullback-Leibler divergence and when disregarding separate information-processing constraints on observations. However, a small number of studies has considered information-processing constraints both for actions and observations. For example, Polani and Tishby [30] and Ortega and Braun [19] combine both kinds of information costs. The first cost formalizes an information-processing cost in the policy and the second cost constrains uncertainty arising from the state transitions directly (but crucially not the uncertainty in the latent variables). In both information-processing constraints the cost is determined as a Kullback-Leibler divergence with respect to a reference distribution. Specifically, the reference distribution in [30] is given by the marginal distributions (which is equivalent to a rate distortion problem) and in [19] is given by fixed priors. The Kullback-Leibler divergence costs for the observations in these cases essentially correspond to a risk-sensitive objective. While there is a relation between risk-sensitive and robust MDPs [6, 22, 26], the innovation in our approach is at least twofold. First, it allows combining information-processing constraints on the policy with model uncertainty (as formalized by a latent variable). Second, it provides a natural setup to study learning.

The algorithm presented here and Bayesian models in general [7] are computationally expensive as they have to compute possibly high-dimensional integrals depending on the number of allowed transitions for action-state pairs. Nevertheless, there have been tremendous efforts in solving unknown MDPs efficiently, especially by sampling methods [10, 11, 24]. An interesting future direction to extend our methodology would therefore be to develop a sampling-based version of Algorithm 1 to increase the range of applicability and scalability [21]. Moreover, such sampling methods might allow for reinforcement learning applications, for example by estimating free energies through TD-learning [8], or by Thompson sampling approaches [17, 18] or other stochastic methods for adaptive control [1].

## Notes

- 1.
Base case: \(T_{\pi ,\psi } F \le F\). Inductive step: assume \(T^{i}_{\pi ,\psi } F \le T^{i-1}_{\pi ,\psi } F\) then \(T^{i+1}_{\pi ,\psi } F = g_{\pi ,\psi } + \gamma P_{\pi ,\psi } T^i_{\pi ,\psi } F \le g_{\pi ,\psi } + \gamma P_{\pi ,\psi } T^{i-1}_{\pi ,\psi } F = T^i_{\pi ,\psi } F \) and similarly for the base case \(T_{\pi ,\psi } F \ge F \;\square \).

## References

Åström, K.J., Wittenmark, B.: Adaptive control. Courier Corporation, Mineola (2013)

Bellman, R.: Dynamic Programming, 1st edn. Princeton University Press, Princeton (1957). http://books.google.com/books?id=fyVtp3EMxasC&pg=PR5&dq=dynamic+programming+richard+e+bellman&client=firefox-a#v=onepage&q=dynamic%20programming%20richard%20e%20bellman&f=false

Bertsekas, D., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)

Braun, D.A., Ortega, P.A., Theodorou, E., Schaal, S.: Path integral control and bounded rationality. In: 2011 IEEE Symposium on Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), pp. 202–209. IEEE (2011)

van den Broek, B., Wiegerinck, W., Kappen, H.J.: Risk sensitive path integral control. In: UAI (2010)

Chow, Y., Tamar, A., Mannor, S., Pavone, M.: Risk-sensitive and robust decision-making: a CVaR optimization approach. In: Advances in Neural Information Processing Systems, pp. 1522–1530 (2015)

Duff, M.O.: Optimal learning: computational procedures for Bayes-adaptive Markov decision processes. Ph.d. thesis, University of Massachusetts Amherst (2002)

Fox, R., Pakman, A., Tishby, N.: G-learning: Taming the noise in reinforcement learning via soft updates. arXiv preprint (2015). arXiv:1512.08562

Geramifard, A., Dann, C., Klein, R.H., Dabney, W., How, J.P.: Rlpy: a value-function-based reinforcement learning framework for education and research. J. Mach. Learn. Res.

**16**, 1573–1578 (2015)Guez, A., Silver, D., Dayan, P.: Efficient Bayes-adaptive reinforcement learning using sample-based search. In: Advances in Neural Information Processing Systems, pp. 1025–1033 (2012)

Guez, A., Silver, D., Dayan, P.: Scalable and efficient Bayes-adaptive reinforcement learning based on Monte-Carlo tree search. J. Artif. Intell. Res.

**48**, 841–883 (2013)Hansen, L.P., Sargent, T.J.: Robustness. Princeton University Press, Princeton (2008)

Iyengar, G.N.: Robust dynamic programming. Math. Oper. Res.

**30**(2), 257–280 (2005)Kappen, H.J.: Linear theory for control of nonlinear stochastic systems. Phys. Rev. Lett.

**95**(20), 200201 (2005)Mannor, S., Simester, D., Sun, P., Tsitsiklis, J.N.: Bias and variance approximation in value function estimates. Manag. Sci.

**53**(2), 308–322 (2007)Nilim, A., El Ghaoui, L.: Robust control of Markov decision processes with uncertain transition matrices. Oper. Res.

**53**(5), 780–798 (2005)Ortega, P.A., Braun, D.A.: A Bayesian rule for adaptive control based on causal interventions. In: 3rd Conference on Artificial General Intelligence (AGI-2010), Atlantis Press (2010)

Ortega, P.A., Braun, D.A.: A minimum relative entropy principle for learning and acting. J. Artif. Intell. Res.

**38**(11), 475–511 (2010)Ortega, P.A., Braun, D.A.: Thermodynamics as a theory of decision-making with information-processing costs. Proc. R. Soc. A.

**469**, 20120683 (2013). The Royal SocietyOrtega, P.A., Braun, D.A.: Generalized Thompson sampling for sequential decision-making and causal inference. Complex Adapt. Syst. Model.

**2**(1), 2 (2014)Ortega, P.A., Braun, D.A., Tishby, N.: Monte Carlo methods for exact & efficient solution of the generalized optimality equations. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 4322–4327. IEEE (2014)

Osogami, T.: Robustness and risk-sensitivity in Markov decision processes. In: Advances in Neural Information Processing Systems, pp. 233–241 (2012)

Peters, J., Mülling, K., Altun, Y., Poole, F.D., et al.: Relative entropy policy search. In: Twenty-Fourth National Conference on Artificial Intelligence (AAAI-10), pp. 1607–1612. AAAI Press (2010)

Ross, S., Pineau, J., Chaib-draa, B., Kreitmann, P.: A Bayesian approach for learning and planning in partially observable Markov decision processes. J. Mach. Learn. Res.

**12**, 1729–1770 (2011)Rubin, J., Shamir, O., Tishby, N.: Trading value and information in MDPs. In: Guy, T.V., Kárný, M., Wolpert, D.H. (eds.) Decision Making with Imperfect Decision Makers. Intelligent Systems Reference Library, vol. 28, pp. 57–74. Springer, Heidelberg (2012)

Shen, Y., Tobia, M.J., Sommer, T., Obermayer, K.: Risk-sensitive reinforcement learning. Neural Comput.

**26**(7), 1298–1328 (2014)Strehl, A.L., Li, L., Littman, M.L.: Reinforcement learning in finite MDPs: Pac analysis. J. Mach. Learn. Res.

**10**, 2413–2444 (2009)Szita, I., Lőrincz, A.: The many faces of optimism: a unifying approach. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1048–1055. ACM (2008)

Szita, I., Szepesvári, C.: Model-based reinforcement learning with nearly tight exploration complexity bounds. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 1031–1038 (2010)

Tishby, N., Polani, D.: Information theory of decisions and actions. In: Cutsuridis, V., Hussain, A., Taylor, J.G. (eds.) Perception-Action Cycle. Springer Series in Cognitive and Neural Systems, pp. 601–636. Springer, New York (2011)

Todorov, E.: Linearly-solvable Markov decision problems. In: Advances in Neural Information Processing Systems, pp. 1369–1376 (2006)

Todorov, E.: Efficient computation of optimal actions. Proc. Nat. Acad. Sci.

**106**(28), 11478–11483 (2009)Wiesemann, W., Kuhn, D., Rustem, B.: Robust Markov decision processes. Math. Oper. Res.

**38**(1), 153–183 (2013)

## Acknowledgments

This study was supported by the DFG, Emmy Noether grant BR4164/1-1. The code was developed on top of the RLPy library [9].

## Author information

### Authors and Affiliations

### Corresponding author

## Editor information

### Editors and Affiliations

## Rights and permissions

## Copyright information

© 2016 Springer International Publishing AG

## About this paper

### Cite this paper

Grau-Moya, J., Leibfried, F., Genewein, T., Braun, D.A. (2016). Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9852. Springer, Cham. https://doi.org/10.1007/978-3-319-46227-1_30

### Download citation

DOI: https://doi.org/10.1007/978-3-319-46227-1_30

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-319-46226-4

Online ISBN: 978-3-319-46227-1

eBook Packages: Computer ScienceComputer Science (R0)