1 Introduction

Reinforcement learning (RL) agents face the fundamental problem of maximizing long-term rewards while actively exploring an unknown environment, commonly referred to as exploration versus exploitation trade-off. Model-based Bayesian reinforcement learning (BRL) is a principled framework for computing the optimal trade-off from the Bayesian perspective by maintaining a posterior distribution over the model of the unknown environment and computing Bayes-optimal policy (Duff 2002; Poupart et al. 2006; Ross et al. 2007). Unfortunately, it is intractable to exactly compute Bayes-optimal policies except for very restricted cases.

Among a large and growing body of literature on model-based BRL, we focus on algorithms with formal guarantees, particularly PAC-BAMDP (Kolter and Ng 2009; Araya-López et al. 2012). These algorithms are followed by rigorous analyses showing that they are able to perform nearly as well as the Bayes-optimal policy after executing a polynomial number of time steps. They are variants of PAC-MDP algorithms (Kearns and Singh 2002; Strehl and Littman 2008; Asmuth et al. 2009), which guarantee near-optimal performance with respect to the optimal policy of an unknown ground-truth model, to the BRL setting. As such, they balance exploration and exploitation by adopting optimism in the face of uncertainty principle as in many PAC-MDP algorithms using additional reward bonus for state-action pairs that are less executed than others (Kolter and Ng 2009), assuming optimistic transitions to states with higher values (Araya-López et al. 2012), or using posterior samples of models  (Asmuth 2013).

In this paper, we propose a PAC-BAMDP algorithm based on optimistic transitions with an information-theoretic bound, which we name Bayesian optimistic Kullback–Leibler exploration (BOKLE). Specifically, BOKLE computes policies by constructing an optimistic MDP model in the neighborhood of posterior mean of transition probabilities, defined in terms of Kullback–Leibler (KL) divergence. We provide an analysis showing that BOKLE is near Bayes-optimal with high probability, i.e. PAC-BAMDP. In addition, we show that BOKLE asymptotically reduces to a well-known PAC-BAMDP algorithm, namely Bayesian exploration bonus (BEB) (Kolter and Ng 2009), with a reward bonus equivalent to that of UCB-V (Audibert et al. 2009), which strengthen our understanding of how optimistic transitions and reward bonuses relate to each other. Finally, although our contribution is mainly in the formal analysis of the algorithm, we provide experimental results on well-known model-based BRL domains and show that BOKLE performs better than some representative PAC-BAMDP algorithms in the literature.

We remark that perhaps the most relevant work in the literature is KL-UCRL (Filippi et al. 2010), where the transition probabilities are optimistically chosen in the neighborhood of empirical transition (multinomial) probabilities, also defined in terms of KL divergence. KL-UCRL is also shown to perform nearly optimally under a different type of formal analysis, namely regret. Compared to KL-UCRL, BOKLE can be seen as extending the neighborhood to be defined over Dirichlets. In addition, perhaps not surprisingly, we lose the connection to existing PAC-BAMDP algorithms if we make BOKLE optimal under Bayesian regret. We provide details on this issue in the next section after we review some necessary background.

2 Background

A Markov decision process (MDP) is a common environment model for RL, defined by a tuple \(\langle S,A,P,R \rangle \), where S is a finite set of states, A is a finite set of actions, \(P=\{\mathbf {p}_{sa} \in \Delta _S | s\in S,a\in A\}\) is the transition distribution, i.e. \(p_{sas'} = \Pr (s'|s,a)\), and \(R(s,a) \in [0,R_\text {max}]\) is the reward function. A (stationary) policy \(\pi : S \rightarrow A\) specifies the action to be executed in each state. For a fixed time horizon H, the value function of a given policy \(\pi \) is defined as \(V_H^{\pi }(s) = \mathbb {E}\big [ \sum _{t=0}^{H-1} R(s_t,\pi (s_t)) | s_0=s \big ]\), where \(s_t\) is the state at time step t. The optimal policy \(\pi _H^*\) is typically obtained by computing optimal value function \(V_H^*\) that satisfies Bellman optimality equation \( V_H^*(s) = \max _{a \in A} \left[ R(s,a) + \sum _{s' \in S}{p_{sas'} V_{H-1}^*(s')} \right] \) using classical dynamic programming methods (Puterman 2005).

In this paper, we consider model-based BRL where the underlying environment is modeled as an MDP with unknown transition distribution \(P = \{\mathbf {p}_{sa}\}\). Following the Bayes-adaptive MDP (BAMDP) formulation with discrete states (Duff 2002), we represent \(\mathbf {p}_{sa}\)’s as multinomial parameters and maintain the posterior over these parameters (i.e. belief b) using the flat Dirichlet-multinomial (FDM) distribution (Kolter and Ng 2009; Araya-López et al. 2012). Formally, given Dirichlet parameters \({\varvec{\alpha }}_{sa}\) for each state-action pair, which consist of both initial prior parameters \({\varvec{\alpha }}_{sa}^0\) and execution counts \(\mathbf {n}_{sa}\), the prior over the transition distribution \(\mathbf {p}_{sa}\) is given by

$$\begin{aligned} \text {Dir}(\mathbf {p}_{sa};\varvec{\alpha }_{sa}) = \frac{1}{B(\varvec{\alpha }_{sa})} \prod _{s'} {p_{sas'}^{\alpha _{sas'}-1}} \end{aligned}$$
(1)

where \(B(\varvec{\alpha }_{sa}) = {\prod _{s'} \varGamma (\alpha _{sas'})} / {\varGamma (\sum _{s'}{\alpha _{sas'}})}\) is the normalizing constant, \(\varGamma \) is the gamma function. The FDM assumes independent transition distributions among state-action pairs so that

$$\begin{aligned} b(P) = \prod _{s,a} \text {Dir}(\mathbf {p}_{sa}; \varvec{\alpha }_{sa}). \end{aligned}$$

Upon observing a transition tuple \(\langle s,a,s' \rangle \), this prior belief is updated by

$$\begin{aligned} b_{a}^{ss'}(P)&= \eta p_{sas'} \prod _{\hat{s},\hat{a}}{\text {Dir}(\mathbf {p}_{\hat{s}\hat{a}}; \varvec{\alpha }_{\hat{s}\hat{a}})} = \prod _{\hat{s},\hat{a}}{\text {Dir}(\mathbf {p}_{\hat{s}\hat{a}}; \varvec{\alpha }_{\hat{s}\hat{a}} + \delta _{\hat{s},\hat{a},\hat{s}'}(s,a,s'))} \end{aligned}$$

where \(\delta _{\hat{s},\hat{a},\hat{s}'}(s,a,s')\) is the Kronecker delta function that yields 1 if \((s,a,s')=(\hat{s},\hat{a},\hat{s}')\) and 0 otherwise, \(\eta \) is the normalizing factor. This is equivalent to incrementing the single Dirichlet parameter corresponding to the observed transition: \(\alpha _{sas'} \leftarrow \alpha _{sas'} + 1\). Thus, the belief is equivalently represented by its Dirichlet parameters, \(b = \{ \varvec{\alpha }_{sa} | s \in S, a\in A\}\), and this results in \({\varvec{\alpha }}_{sa} = {\varvec{\alpha }}_{sa}^0 + \mathbf {n}_{sa}\) where \({\varvec{\alpha }}_{sa}^0\) is the initial Dirichlet parameters and \(\mathbf {n}_{sa}\) is the execution counts.

The BAMDP formulates the task of computing Bayes-optimal policy as a stochastic planning problem. Specifically, the BAMDP augments environment state s with current belief b, which essentially captures the uncertainty in the transition distribution as part of the state space. Then, the optimal value function of the BAMDP should satisfy Bellman optimality equation

$$\begin{aligned} \mathbb {V}_H^*(s,b) = \max _{a} \big [ R(s,a) + \textstyle \sum _{s'} \mathbb {E}[p_{sas'}|b] \mathbb {V}_{H-1}^*(s',b_{a}^{ss'}) \big ] \end{aligned}$$

where \(\mathbb {E}[p_{sas'}|b] = {\alpha _{sas'}}/ {\sum _{s''}{\alpha _{sas''}}}\). Unfortunately, it is intractable to find the solution except for restricted cases primarily because the number of beliefs grows exponentially in H.

Before we present our algorithm, we briefly review some of the most relevant work in the literature on RL. Since Rmax (Brafman and Tennenholtz 2002) and \(\text {E}^3\) (Kearns and Singh 1998), a growing body of research has been devoted to algorithms that can be shown to achieve near-optimal performance with high probability, i.e. probably approximately correct (PAC). Depending on whether the learning target is the optimal policy or the Bayes-optimal policy, these algorithms are classified as either PAC-MDP or PAC-BAMDP. They commonly construct and solve optimistic MDP models of the environment by defining confidence regions of transition distributions centered at empirical distributions, or by adding larger bonuses to rewards of state-action pairs that are less executed than others.

Model-based interval estimation (MBIE) (Strehl and Littman 2005) is a PAC-MDP algorithm that uses confidence regions of transition distributions captured by the 1-norm distance of \(O(1/\sqrt{n_{sa}})\), where \(n_{sa}\) is the execution count of action a in state s. Bayesian optimistic local transition (BOLT) (Araya-López et al. 2012) is a PAC-BAMDP algorithm that uses confidence regions of transition distributions captured by the 1-norm distance of \(O(1/n_{sa})\), although not explicitly mentioned in the work. MBIE-EB (Strehl and Littman 2008) is a simpler version that uses additive rewards of \(O(1/\sqrt{n_{sa}})\). On the other hand, BEB (Kolter and Ng 2009) is a PAC-BAMDP algorithm that uses additive rewards of \(O(1/n_{sa})\). These results imply that we can significantly reduce the degree of exploration in PAC-BAMDP compared to PAC-MDP, which is natural: the learning target is the Bayes-optimal policy (which we know but hard to compute) rather than the optimal policy of the environment (which we don’t know).

On the other hand, UCRL2 (Jaksch et al. 2010) uses the 1-norm distance bound of \(O(1/\sqrt{n_{sa}})\) for confidence regions of transition distributions, and is shown to produce near optimal policy under the notion of regret, a formal analysis framework alternative to PAC-MDP. KL-UCRL (Filippi et al. 2010) uses the KL bound of \(O(1/n_{sa})\) to achieve the same regret, while exhibiting a better performance in experiments. This empirical advantage is due to the continuous change in optimistic transition models being constructed with the KL bound. Now, it would be interesting to question ourselves whether we can reduce the degree of exploration if we switch to Bayesian regret, as was the case with PAC-BAMDP. Unfortunately, there is some evidence to the contrary. In Bayes-UCB (Kaufmann et al. 2012), it was shown that the Bayesian bandit algorithm requires the same degree of exploration as KL-UCB (Garivier and Cappé 2011). In PSRL (Osband et al. 2013), the formal analysis uses the same set of plausible models as in UCRL2. Hence, we strongly believe that we cannot reduce the degree of exploration under the Bayesian regret criterion.

These results motivate us to investigate a PAC-BAMDP algorithm that uses optimistic transition models with the KL bound of \(O(1/n_{sa}^2)\), which is the main result of this paper.

3 Bayesian optimistic KL exploration

figure a

In order to characterize confidence regions of transition distributions defined by KL bound, we first define \(C_{{\varvec{\alpha }}_{sa}}\) for each state-action pair sa, which specifies the KL divergence threshold from the posterior mean \(\mathbf {q}_{sa}\). Here, \(C_{{\varvec{\alpha }}_{sa}}\) is the parameter of the algorithm proportional to \(O(1/n_{sa}^2)\), which will be discussed later. Algorithm 1 presents our algorithm, Bayesian Optimistic KL Exploration (BOKLE), that precisely uses this idea for computing optimistic value functions. For each state-action pair sa, the optimistic Bellman backup in BOKLE essentially seeks the solution to the following convex optimization problem

$$\begin{aligned} \max _{\mathbf {p}}{\sum _{s'}{p_{s'} \tilde{V}(s')}} \,\,\,\, \text { subject to } \,\,\,\, \begin{aligned}&D_{KL}(\mathbf {q}_{sa} \Vert \mathbf {p}) \le C_{{\varvec{\alpha }}_{sa}} \\&\textstyle \sum _{s'} p_{s'} = 1 \\&p_{s'} \ge 0, \, \forall s' \in S \end{aligned} \end{aligned}$$
(2)

recursively using \(\tilde{V}\) from the previous step, which can be solved in polynomial time by the barrier method (Boyd and Vandenberghe 2004) (Details are available in the “Appendix A”).

4 PAC-BAMDP analysis

BOKLE algorithm described in Algorithm 1 obtains the optimistic value function over a KL bound of \(O(1/n_{sa}^2)\). This exploration bound is much tighter than that of KL-UCRL (Filippi et al. 2010), \(O(1/n_{sa})\), since BOKLE seeks Bayes-optimal actions whereas KL-UCRL seeks the ground-truth actions. Similarly, the Pinsker inequality implies that the exploration bound can be much tighter than the 1-norm bound of \(O(1/\sqrt{n_{sa}})\) in MBIE (Strehl and Littman 2005) and UCRL2 (Jaksch et al. 2010). In this section, we provide a PAC-BAMDP analysis of BOKLE algorithm even though it optimizes over asymptotically much tighter bound than others. Also, we show the sample complexity bound \(O(\frac{|S| |A| H^4 R_\text {max}^2 }{\epsilon ^2} \log \frac{|S||A|}{\delta })\) in Theorem 1, the same complexity bound in BOLT (Araya-López et al. 2012), which is the main result of our analysis.

Before we embark on providing the main theorem, we define KL bound parameter \(C_{{\varvec{\alpha }}_{sa}}\) in Eq. (2).

Definition 1

Given a Dirichlet distribution with parameter \({\varvec{\alpha }}_{sa}\), let \(\mathbf {q}_{sa}\) be the mean of the posterior distribution. Then, \(C_{{\varvec{\alpha }}_{sa}}\) is the maximum KL divergence

$$\begin{aligned} C_{{\varvec{\alpha }}_{sa}} = \max _{ \begin{array}{c} h=1,\ldots ,H, \\ \hat{s} \text{ s.t. } n_{sa\hat{s}} \ne 0 \end{array}} D_{KL}(\mathbf {q}_{sa} \Vert \mathbf {p}^{h,\hat{s}}) \end{aligned}$$

where \(\mathbf {p}^{h,\hat{s}}\) is the mean of the Dirichlet distribution with parameter \({\varvec{\alpha }}'_{sa} = {\varvec{\alpha }}_{sa} + h\mathbf {e}_{\hat{s}}\) where \(\mathbf {e}_{\hat{s}}\) is the standard base, i.e. \({\varvec{\alpha }}'_{sa}\) can be “reached” from \({\varvec{\alpha }}_{sa}\) in h steps by applying h Bayesian updates from \({\varvec{\alpha }}_{sa}\) so that \(\alpha '_{sas'} = \alpha _{sas'}\) for all \(s' \ne \hat{s}\) except \(\alpha '_{sa\hat{s}} = \alpha _{sa\hat{s}} + h\).

We note that, in Definition 1, h artificial pieces of evidence are only applied to the state \(\hat{s}\), which is the state observed at least once by action a in state s while the agent is learning. Therefore, \(\alpha _{sa\hat{s}}\) asymptotically increases at a ratio of \(p_{sa\hat{s}} \, n_{sa}\), which results in \(C_{{\varvec{\alpha }}_{sa}}\) diminishing at a ratio of \(O(1/n_{sa}^2)\) if we regard the true underlying transition probability \(p_{sa\hat{s}}\) as a domain-specific constant.

Proposition 1

\(C_{{\varvec{\alpha }}_{sa}}\) defined in Definition 1 diminishes at a ratio of \(O(1/n_{sa}^2)\) and is upper bounded by \(H^2 / \min _{\hat{s} \text{ s.t. } n_{sa\hat{s}} \ne 0} \alpha _{sa} \alpha _{sa\hat{s}}\) where \(\alpha _{sa} = \sum _{s'} \alpha _{sas'}\) is the sum of Dirichlet parameters.

We provide the proof of Proposition 1 in the “Appendix C.1”. We now present the main theorem stating that BOKLE is PAC-BAMDP.

Theorem 1

Let \(\mathcal {A}_t\) be the policy followed by BOKLE at time step t using H as the horizon for computing value functions with KL bound parameter \(C_{{\varvec{\alpha }}_{sa}}\) defined in Definition 1, and let \(s_t\) and \(b_t\) be the state and the belief (the parameter of the FDM posterior) at that time. Then, with probability at least \(1-\delta \), the Bayesian evaluation of \(\mathcal {A}_t\) is \(\epsilon \)-close to the optimal Bayesian evaluation

$$\begin{aligned} \mathbb {V}_H^{\mathcal {A}_t}(s_t,b_t) \ge \mathbb {V}_H^*(s_t,b_t) - \epsilon \end{aligned}$$

for all but

$$\begin{aligned} O\left( \frac{|S| |A| H^4 R_\text {max}^2 }{\epsilon ^2} \log \frac{|S||A|}{\delta } \right) \end{aligned}$$

time steps. In this equation, the definition of Bayes value function \(\mathbb {V}_H^\pi (s,b)\) is

$$\begin{aligned} \mathbb {V}_H^\pi (s,b)=\sum _a R(s,a)+\sum _{s'}\mathbb {E}[p_{sas'}|b]\mathbb {V}_{H-1}^\pi (s',b_a^{ss'}) \end{aligned}$$
(3)

Our proof of Theorem 1 is based on showing three essential properties of being PAC-BAMDP: bounded optimism, induced inequality, and mixed bound. We provide the proofs of three properties in the “Appendix C” by following the steps analogous to the analyses of BEB (Kolter and Ng 2009) and BOLT (Araya-López et al. 2012).

Lemma 1

(Bounded Optimism) Let \(s_t\) and \(b_t\) be the state and the belief at time step t. Then, \(\tilde{V}_H(s_t,b_t)\), computed by BOKLE with \(C_{{\varvec{\alpha }}_{sa}}\) defined in Definition 1, is lower bounded by

$$\begin{aligned} \tilde{V}_H(s_t,b_t) \ge \mathbb {V}_H^*(s_t,b_t) - \frac{H^2 V_\text {max}}{\alpha _{s_t} + H} \end{aligned}$$

where \(\mathbb {V}_H^*(s_t,b_t)\) is the H-horizon Bayes-optimal value, \(V_\text {max}\) is the upper bound on the H-horizon value function, and \(\alpha _{s_t} = \min _a \alpha _{s_ta}\).

Compared to the optimism lemma that appears in all PAC-BAMDP analysis (Kolter and Ng 2009; Araya-López et al. 2012), this lemma is much more general, because we allow \(\mathbb {V}_H^*(s_t,b_t)\) to be less than \(\tilde{V}_H(s_t,b_t)\) by at most \(O(1/n_{sa})\). In the proof of the main theorem, we show that this weaker condition is still sufficient to establish that the algorithm is PAC-BAMDP.

The second lemma states that, if we evaluate a policy \(\pi \) on two different rewards and transition distributions, \(R,\mathbf {p}\) and \(\hat{R},\hat{\mathbf {p}}\), where \(R(s,a)=\hat{R}(s,a)\) and \(\mathbf {p}_{sa} = \hat{\mathbf {p}}_{sa}\) on a set K of “known” state-action pairs (Brafman and Tennenholtz 2002), the two value functions will be similar given that the probability of escaping from K is small. This is a slight modification of the induced inequality lemma used in PAC-MDP analysis, essentially the same lemma in BEB (Kolter and Ng 2009) and BOLT (Araya-López et al. 2012). The known set K is defined by

$$\begin{aligned} K=\bigg \lbrace (s,a)|\alpha _{sa}=\sum _{s'}\alpha _{sas'}\ge m\bigg \rbrace \end{aligned}$$

where m is a threshold parameter that represents state-action pair with enough evidence. This definition will be used frequently in the rest of this section. We will later derive an appropriate value of m that results in the PAC-BAMDP bound in Theorem 1.

Lemma 2

(Induced Inequality) Let \(\mathbb {V}_h^{\pi }(s,b)\) be the Bayesian evaluation of a policy \(\pi \) defined by Eq. (3), and a be the action selected by the policy at (sb). We define the mixed value function by

$$\begin{aligned} \hat{\mathbb {V}}_{h+1}^{\pi }(s,b) = {\left\{ \begin{array}{ll} R(s,a) + \sum _{s'}{E[p_{sas'} | b] \, \hat{\mathbb {V}}_h^{\pi }(s',b')} &{} \text { if } (s,a) \in K \\ \hat{R}(s,a) + \sum _{s'}{\hat{p}_{sas'} \, \hat{\mathbb {V}}_h^{\pi }(s',b')} &{} \text { if } (s,a) \notin K \end{array}\right. } \end{aligned}$$

for the known set K, where \(\hat{p}_{sas'}\) is a transition probability that can be different from the expected transition probability \(E[p_{sas'} | b]\) and \(b'\) is the updated belief of b by observing state transition \((s,a,s')\). Let \(A_K\) be the event that a state-action pair not in K is visited when starting from state s and following policy \(\pi \) for H steps. Then,

$$\begin{aligned} \mathbb {V}_H^{\pi }(s,{\varvec{\alpha }}) \ge \hat{\mathbb {V}}_H^{\pi }(s,{\varvec{\alpha }}) - V_\text {max}\Pr (A_K) \end{aligned}$$

where \(V_\text {max}\) is the upper bound on the H-horizon value function and \(\Pr (A_K)\) is the probability of event \(A_K\).

The last lemma bounds the difference between the value function computed by BOKLE and the mixed value function, where the reward and transition distribution \(\hat{R},\hat{\mathbf {p}}\) are set to those used by BOKLE. Note that \(\hat{R}=R\) in our case, since BOKLE only modifies transition distribution.

Lemma 3

(BOKLE Mixed Bound) Let the known set \(K = \{(s,a)| \,\alpha _{sa} = \sum _{s'} \alpha _{sas'} \ge m \}\). Then, the difference between the value obtained by BOKLE, \(\tilde{V}_H\), and the mixed value of BOKLE’s policy \(\mathcal {A}_t\) with BOKLE’s transition probabilities \(\hat{\mathbf {p}}_{sa}\) for K, \(\hat{\mathbb {V}}^{\mathcal {A}_t}_H\), is bounded by

$$\begin{aligned} \tilde{V}_H(s_t,b_t) - \hat{\mathbb {V}}_H^{\mathcal {A}_t}(s_t,b_t) \le \frac{(\sqrt{2}/p^\text {min}+ 1) H^2 V_{\text {max}}}{m} \end{aligned}$$

where \(p^\text {min}= \min _{s,a,s'} p_{sas'}\) is the minimum non-zero transition probability of each action a in each state s on the true underlying environment, which is a domain-specific constant.

Finally, we provide the proof of Theorem 1 using the three lemmas.

Proof

$$\begin{aligned}&\mathbb {V}_H^{\mathcal {A}_t}(s_t,b_t) \nonumber \\&\quad \ge \hat{\mathbb {V}}_H^{\tilde{\pi }}(s_t,b_t) - V_{\text {max}} \Pr (A_K) \nonumber \\&\quad \ge \tilde{V}_H(s_t,b_t) - \frac{(\sqrt{2}/p^\text {min}+ 1) H^2 V_{\text {max}}}{m} - V_{\text {max}} \Pr (A_K) \nonumber \\&\quad \ge \mathbb {V}_H^*(s_t,b_t) - \frac{(\sqrt{2}/p^\text {min}+ 1) H^2 V_{\text {max}}}{m} - \frac{H^2 V_\text {max}}{m+ H} - V_{\text {max}} \Pr (A_K) \nonumber \\&\quad \ge \mathbb {V}_H^*(s_t,b_t) - \frac{\epsilon }{2} - V_{\text {max}} \Pr (A_K) \end{aligned}$$
(4)

by applying Lemma 2 (induced inequality) in the first inequality and noticing that \(\mathcal {A}_t\) equals \(\tilde{\pi }\) unless \(A_K\) occurs, Lemma 3 (mixed bound) in the second inequality, Lemma 1 (bounded optimism) in the third inequality. We obtain the last line if we set

$$\begin{aligned} m = \frac{(2\sqrt{2}/p^\text {min}+ 4) H^2 V_{\text {max}}}{\epsilon }. \end{aligned}$$

This particular value is set to satisfy \(\frac{(\sqrt{2}/p^\text {min}+ 1) H^2 V_{\text {max}}}{m}<\frac{\epsilon }{4}\) and \(\frac{H^2 V_\text {max}}{m+ H} <\frac{\epsilon }{4}\), which can be easily checked.

If \(\Pr (A_K) \le \frac{\epsilon }{2 V_\text {max}}\), from Eq. (4), we obtain \(\mathbb {V}_H^{\mathcal {A}_t}(s_t,\alpha _t) \ge \mathbb {V}_H^*(s_t,\alpha _t) - \epsilon \). If \(\Pr (A_K) > \frac{\epsilon }{2 V_\text {max}}\), using the Hoeffding and union bounds, with probability at least \((1-\delta )\), \(A_K\) will occur no more than \(O(\frac{|S| |A| m }{\Pr (A_K)} \log \frac{|S||A|}{\delta }) = O(\frac{|S| |A| H^4 R_\text {max}^2 }{\epsilon ^2} \log \frac{|S||A|}{\delta })\) time steps since \(V_\text {max}\le H R_\text {max}\) (It can be easily checked using the fact that \(\Pr (A_K)>\frac{\epsilon }{2V_{\max }}\ge \frac{\epsilon }{2HR_{\max }}\) with the m as described above). \(\square \)

As we mentioned before, this sample complexity bound \(O(\frac{|S| |A| H^4 R_\text {max}^2 }{\epsilon ^2} \log \frac{|S||A|}{\delta })\) is the same as the bound of BOLT (Araya-López et al. 2012) \(O(\frac{|S| |A| H^2}{\epsilon ^2 (1-\gamma )^2} \log \frac{|S||A|}{\delta })\) and better than the bound of BEB (Kolter and Ng 2009) \(O(\frac{|S| |A| H^6}{\epsilon ^2} \log \frac{|S||A|}{\delta })\) if we reconcile the differences in the problem settings (in BOKLE: \(V_\text {max}= H R_\text {max}\), in BEB: \(V_\text {max}= H\), and in BOLT: \(V_\text {max}= 1/(1-\gamma )\)).

5 Relating to BEB

In this section, we discuss how BOKLE relates to BEB (Kolter and Ng 2009). The first few steps of our analysis share some similarities with KL-UCRL (Filippi et al. 2010), but we go further to derive asymptotic approximate solutions in order to make the connection. For the asymptotic analysis, from now on, we will consider confidence regions of transition distributions centered at the posterior mode rather than the mean since both asymptotically converge the same value after a large number of observations.

The mode of the transition distribution in Eq. (1) is \(\mathbf {r}\) given by

$$\begin{aligned} r_s = \frac{\alpha _s - 1}{\sum _{s'} \alpha _{s'} - |S|}, \end{aligned}$$

where we dropped the state-action subscript for brevity. If we define the overall concentration parameter \(N = \sum _s \alpha _s - |S|\), then we can rewrite the belief as \(\text {Dir}(\mathbf {p}; {\varvec{\alpha }}) = \frac{1}{B({\varvec{\alpha }})} \prod _s p_s^{N r_s}\), and its log density as

$$\begin{aligned} \log \text {Dir}(\mathbf {p};{\varvec{\alpha }}) = \sum _s N r_s \log p_s - \log B({\varvec{\alpha }}). \end{aligned}$$

Then, the difference of log densities between \(\mathbf {p}\) and the mode \(\mathbf {r}\) becomes

$$\begin{aligned} \log \text {Dir}(\mathbf {p};{\varvec{\alpha }}) - \log \text {Dir}(\mathbf {r};{\varvec{\alpha }})&= - N D_{KL}(\mathbf {r}\Vert \mathbf {p}) \end{aligned}$$

Thus, we can see that isocontours of the Dirichlet density function \(\text {Dir}(\mathbf {p}; {\varvec{\alpha }}) = \epsilon \) are equivalent to the uniform KL divergence from the mode, i.e. \(D_{KL}(\mathbf {r}\Vert \mathbf {p}) = \epsilon '/N\) with an appropriately chosen \(\epsilon '\). This shows why KL bound neighborhood is a better idea than the 1-norm neighborhood: the former can be seen as conditioning directly on the density.

Explicitly representing the non-negativity constraint of probabilities, the Lagrangian \(\mathcal {L}\) of the problem in Eq. (2) can be written with the multipliers \(\nu ,\mu _s \ge 0\) and \(\lambda \) as

$$\begin{aligned} \mathcal {L}&= \sum _s p_s V(s) - \nu \Big (\sum _s r_s \log \frac{r_s}{p_s} - C_{\varvec{\alpha }}\Big ) - \lambda \Big (\sum _s p_s - 1 \Big ) + \sum _s{\mu _s p_s}, \end{aligned}$$

which has the analytical solution

$$\begin{aligned} p_s^* = \left[ \frac{\nu }{\lambda - \mu _s - V(s)} \right] r_s = {\left\{ \begin{array}{ll} 0 &{} \text { if } r_s = 0 \\ \left[ \frac{\nu }{\lambda - V(s)} \right] r_s &{} \text { if } r_s \ne 0 \end{array}\right. }, \end{aligned}$$

where the multiplier \(\mu _s = 0\) when if \(r_s \ne 0\). This is because KL divergence is well-defined only when \(p_s = 0 \Rightarrow r_s = 0\), and \(\mu _s p_s = 0\) while \(r_s \ne 0\). Thus, \(\mu _s\) was omitted in the earlier formulation.

We focus on the case \(r_s \ne 0\), where the solution can be rewritten as

$$\begin{aligned} p_s^* = \left[ 1 - \frac{1}{\nu } (V(s) - \lambda ') \right] ^{-1} r_s , \end{aligned}$$

with constant \(\lambda '\) is determined by the condition \(\sum _s p_s = 1\). In the regime \(\nu \gg 1\) (i.e. \(C_{\varvec{\alpha }}\approx 0\)), we can approximate this solution by the first-order Taylor expansion:

$$\begin{aligned} p^*_s&\approx \left[ 1 + \frac{1}{\nu } (V(s) - \lambda ') \right] r_s = \left[ 1 + \frac{1}{\nu } (V(s) - E_{\mathbf {r}}[V]) \right] r_s \end{aligned}$$
(5)

where \(E_{\mathbf {r}}[V] = \sum _s{r_s V(s)}\).

Then, the KL divergence can be approximated by the second-order Taylor expansion (The proof is available in the “Appendix D”):

$$\begin{aligned} C_{\varvec{\alpha }}= \frac{1}{2\nu ^2} \text {Var}_{\mathbf {r}}[V] \end{aligned}$$

where \(\text {Var}_{\mathbf {r}}[V] = E_{\mathbf {r}}[(V - E_{\mathbf {r}}[V])^2]\). Thus, \(\nu \) can be approximated as

$$\begin{aligned} \nu \approx \sqrt{\frac{\text {Var}_{\mathbf {r}}[V]}{2 C_{\varvec{\alpha }}}}. \end{aligned}$$
Fig. 1
figure 1

Three benchmark domains: a chain (top), b double-loop (middle), and c RiverSwim (bottom). The solid (resp. dotted) arrows indicate transition probabilities and rewards for action a (resp. b), but only non-zero rewards are represented together with transition probabilities

Using this \(\nu \) in Eq. (5), we obtain

$$\begin{aligned}&p^*_s \approx \left[ 1 + \sqrt{\frac{2 C_{\varvec{\alpha }}}{\text {Var}_{\mathbf {r}}[V]}} (V(s) - E_{\mathbf {r}}[V]) \right] r_s \quad \text{ and } \\&\sum _s{p^*_s V(s)} \approx \sum _s{r_s V(s)} + \sqrt{2 C_{\varvec{\alpha }}\text {Var}_{\mathbf {r}}[V]}. \end{aligned}$$

We can now derive an approximation to the dynamic programming update performed in BOKLE:

$$\begin{aligned} \tilde{V}_h (s,b)&= \max _{a\in A} \Big [ R(s,a) + \sum _{s'} p^*_{sa}(s') \tilde{V}_{h-1}(s',b) \Big ] \\&\approx \max _{a \in A} \Big [ R(s,a) + \sqrt{2C_{{\varvec{\alpha }}_{sa}} \text {Var}_{\mathbf {r}_{sa}}[V]} + \sum _{s'}{r_{sas'} \tilde{V}_{h-1}(s',b)} \Big ]. \end{aligned}$$

which is comparable to the value function computed in BEB (Kolter and Ng 2009):

$$\begin{aligned} V^{\text {BEB}}_h(s,b)&= \max _{a\in A} \Big [ R(s,a) + \frac{\beta ^{\text {BEB}}}{1+\sum _{s''}{\alpha _{sas''}}} + \sum _{s'}{E[p_{sas'}|b] V^{\text {BEB}}_{h-1}(s',b)} \Big ] \end{aligned}$$

for some constant \(\beta ^{\text {BEB}}\). This highly suggests that the additive reward \(\sqrt{C_{{\varvec{\alpha }}_{sa}} \text {Var}_{\mathbf {r}_{sa}}[V]}\) corresponds to BEB exploration bonus \(\frac{\beta ^{\text {BEB}}}{1+\sum _{s''}{\alpha _{sas''}}}\), ignoring the mean-mode difference in the transition model. As we have discussed in the previous section, \(C_{{\varvec{\alpha }}_{sa}} = O(1/n_{sa}^2)\), which is consistent with BEB exploration bonus \(O(1/n_{sa})\). In addition, BOKLE scales the additive reward by \(\sqrt{\text {Var}_{\mathbf {r}_{sa}}[V]}\), which incentivizes the agent to explore actions with a higher variance in values, a similar but different formulation compared to Variance-Based Reward Bonus (VBRB) (Sorg et al. 2010). Interestingly, adding the square-root of the empirical variance coincides with the exploration bonus in UCB-V (Audibert et al. 2009), which is a variance-aware upper confidence bound (UCB) algorithm in bandits.

6 Experiments

Although our contribution is mainly in the formal analysis of BOKLE, we present simulation results on three BRL domains. We emphasize that the experiments are intended as a preliminary demonstration of how the different exploration strategies compare to each other, and not as a rigorous evaluation on real-world problems.

Table 1 Average returns and their standard errors in chain, double-loop, and RiverSwim from 50 runs of 1000 time steps
Fig. 2
figure 2

Average return versus time step in a chain (top-left), b double-loop (top-right), c RiverSwim (bottom) for three PAC-BAMDP algorithms: BOKLE, BEB, and BOLT. The shaded region represents the standard error

Chain (Strens 2000) consists of 5 states and 2 actions as shown in Fig. 1a. The agent starts in state 1 and for each time step can either move on to the next state (action a, solid edges) or reset to state 1 (action b, dotted edges). The transition distributions make the agent perform the other action with a “slip” probability of 0.2. The agent receives a large reward of 10 by executing action a in the rightmost state 5 or a small reward of 2 by executing action b in any state. Double-Loop (Dearden et al. 1998) consists of 9 states and 2 deterministic actions as shown in Fig. 1b. It has two loops with a shared (starting) state 1, and the agent has to execute action b (dotted edges) to complete the loop with a higher reward of 2, instead of the easier loop with a lower reward of 1. RiverSwim (Filippi et al. 2010; Strehl and Littman 2008) consists of 6 states and 2 actions as shown in Fig. 1c. The agent starts in state 1, and can swim either to the left (action b, dotted edges) or the right (action a, solid edges). The agent has to swim all the way to state 6 to receive a reward of 1, which requires swimming against the current of the river. Swimming to the right has a success probability of 0.35, and a small probability 0.05 of drifting to the left. Swimming to the left always succeeds, but receives a much smaller reward of 0.005 in state 1.

Table 1 compares the returns collected from three PAC-BAMDP algorithms averaged over 50 runs of 1000 timesteps: BOKLE (our algorithm in Algorithm 1), BEB (Kolter and Ng 2009), and BOLT (Araya-López et al. 2012). To handle the sparsity of the transition distributions better, BOKLE used confidence regions centered at the posterior mode. In all experiments, we used the discount factor \(\gamma = 0.95\) for computing internal value functions. For each domain, we varied the algorithm parameters as follows: for BOKLE, \(C_{\varvec{\alpha }}= \epsilon /N^2\) where \(\epsilon \in \{0.1,0.25,0.5,1,5,10,25,50\}\); for BEB, \(\beta \in \)\(\{0.1,1,5,10,25,50,100,150\}\); for BOLT, \(\eta \in \)\(\{0.1,1,5,10,25,50,100,150\}\), and selected the best parameter setting for each domain.

In Fig. 2, we show the cumulative returns versus time steps on the onset of each simulation. It is evident from the figure that the learning performance of BOKLE is better than those of BEB and BOLT. These results reflect our discussions on the advantage of KL bound exploration in the previous section.

It is noteworthy that BOKLE performs better than BOLT in the experiments, even though their sample complexity bounds are the same. This result is supported by the discussion in  Filippi et al. (2010) on the comparison between KL-UCRL (Filippi et al. 2010) and UCRL2 (Jaksch et al. 2010): For constructing the optimistic transition model, KL-UCRL uses KL divergence bound of \(O(1/n_{sa})\) whereas UCRL2 uses 1-norm distance bound of \(O(1/\sqrt{n_{sa}})\). Although the formal bounds of these two algorithms are the same, KL-UCRL performs better than UCRL2 in the experiments. This is due to the desirable properties of the neighborhood models under KL divergence, being continuous with respect to the estimated value and robust with respect to unlikely transitions. This insight carries on to BOKLE versus BOLT, since BOKLE uses KL divergence bound of \(O(1/n_{sa}^2)\) whereas BOLT uses 1-norm distance bound of \(O(1/n_{sa})\).

7 Conclusion

In this paper, we introduced Bayesian optimistic Kullback–Leibler exploration (BOKLE), a model-based Bayesian reinforcement learning algorithm that uses KL divergence in constructing the optimistic posterior model of the environment for Bayesian exploration. We provided a formal analysis showing that the algorithm is PAC-BAMDP, meaning that the algorithm is near Bayes-optimal with high probability.

As we have discussed in previous sections, using KL divergence is a natural measure of bounding the credible region of multinomial transition models when constructing optimistic models for exploration. It directly yields the log ratio of the posterior density to the mode, which results in smooth isocontours in the probability simplex. In addition, we showed that the optimistic model constrained by KL divergence can be quantitatively related to other algorithms that use an additive reward approach for exploration (Kolter and Ng 2009; Sorg et al. 2010; Audibert et al. 2009). We presented simulation results on a number of standard BRL domains, highlighting the advantage of using KL exploration.

A number of promising directions for future work include extending the approach to other families of priors and continuous state/action spaces, as well as their formal analyses. In particular, we believe that BOKLE can be extended to the continuous case, similar to UCCRL (Ortner and Ryabko 2012), and it would be an important direction for our future work.