Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Value function estimation is an integral part of many reinforcement learning (RL) [29] algorithms (e.g., policy evaluation step of policy iteration) as it assesses the quality of a fixed control policy. This is straightforward in domains with a finite number of states. Large or infinite state spaces generally prohibit an explicit value function representation, but we can always represent the value function through a parameterized class of functions. In this paper we focus on the case of linear architectures where the values are approximated by a linear combination of a number of features. This approach is used by the Least-Squares Temporal Difference (LSTD) [6] algorithm, a temporal-difference algorithm that finds a linear approximation to the value function that minimizes the mean squared projected bellman error (MSPBE) [30].

The selection of features is critical for LSTD, as it determines the expressiveness of the value function representation. The richer the feature space is, the more likely that the value function space will contain a good approximation to the value function, but more data are needed [21]. This problem, already present in linear regression is only exacerbated in RL problems. Furthermore, using too many features makes use of the computed policies rather slow.

In linear regression, regularization is commonly used to control over-fitting, through a penalty term which discourages coefficients from reaching large values. In regression problems, two of the most effective regularization approaches are \(\ell _1\) and \(\ell _2\)-regularization [15] which involve adding a penalty term (\(\ell _1\) and \(\ell _2\) norms of the parameter vector, respectively) to the error function in order to discourage model’s parameters from getting large values. In both schemes, a coefficient term \(\lambda \), which typically must be selected in advance, governs the relative importance of the penalty term compared to the error function.

Bayesian reinforcement learning (BRL) (see [34] for an overview) is a framework for designing RL algorithms that models the reinforcement learning problem in a Bayesian decision theoretic manner. In model-free BRL, a probability distribution is maintained over the parameters of the value function, which quantifies our uncertainty over its parameters. One of the first such algorithm was Gaussian-process temporal-difference learning (GPTD) [9], which assumes that the unknown true values over the observed states are random variables generated by a Gaussian process. More specifically, GPTD incorporates a Gaussian prior over value functions and assumes a Gaussian noise model. Thus, the solution to the inference problem is given by the posterior distribution conditioned on the observed sequence of rewards. A sparse Bayesian extension of GPTD was proposed in [32, 33], called RVMTD, where adopted a sparse kernelized Bayesian learning approach [31]. However, RVMTD minimizes the mean Bellman error instead of the MSPBE as in our case.

In this paper, we propose a Bayesian treatment of the LSTD algorithm, called BLSTD, that instead of seeking only a point estimate over the unknown value function parameters, actually considers the uncertainty on the value function. We adopt a fully probabilistic framework by introducing a stochastic variant of the standard Bellman operator as well as a prior distribution over the unknown model’s parameters. To avoid overfitting, we further extend BLSTD algorithm with a sparse Bayesian learning approach [3, 31], which we call VBLSTD. By using a tractable variational approach to automatically determine the model’s complexity, we obviate the need to select a regularization parameter. We demonstrate the performance of the proposed algorithms on a number of domains, showing the ability of our model to avoid overfitting.

The remainder of the paper is organised as follows. Section 2 presents some preliminaries, review the LSTD algorithm and gives an overview of related work. Sections 3 introduce the Bayesian LSTD algorithm. In Sect. 4 we extend the Bayesian LSTD algorithm, presenting the VBLSTD algorithm that constitutes the main contribution of this paper. Our empirical analysis is presented in Sect. 5. We conclude the paper in Sect. 6 by discussing future directions.

2 Preliminaries and Related Work

A Markov Decision Process (MDP) is a tuple \(\{{\mathcal {S}}, {\mathcal {A}}, P, r, \gamma \}\), where \({\mathcal {S}}\) is a set of states; \({\mathcal {A}}\) is a set of actions; \(P(\cdot | s, a)\) is a transition probability kernel, defining the probability of next states in \({\mathcal {S}}\) for any state action pair \(s \in {\mathcal {S}}\) and \(a \in {\mathcal {A}}\); \(r: {\mathcal {S}}\rightarrow {\mathbb {R}}\) is a reward function and \(\gamma \in [0,1]\) is a constant discount factor. The policy \(\pi :{\mathcal {S}}\rightarrow {\mathcal {A}}\) to be evaluated is a deterministic mapping from states to actions.

Value functions are of central interest in reinforcement learning. Briefly, value function \(V^{\pi }\) defines the expected discounted sum of rewards for the policy \(\pi \), given that we start at state s: \(V^{\pi }(s) \mathrel {\triangleq }\mathop {{\mathbb {E}}}\nolimits ^{\pi } \left[ \sum _{t=0}^{\infty } \gamma ^t r(s_t) | s_{0} = s\right] \), with \(V^{*} \mathrel {\triangleq }\sup _{\pi } V^{\pi }\). It is known [28] that the value function is the unique fixed-point of the Bellman operator \(T^{\pi }\), i.e., \(V^{\pi } = T^{\pi } V^{\pi }\), defined as:

$$\begin{aligned} (T^{\pi }V)(s) = r(s) + \gamma \int _{{\mathcal {S}}} V(s') dP(s' | s, \pi (s)), \end{aligned}$$
(1)

or in a more compact form as \(T^{\pi }V= \varvec{r}+ \gamma P^{\pi } V\), where V and \(\varvec{r}\) are vectors of size \(|{\mathcal {S}}|\) that contains the state values and rewards, respectively. When the rewards and transition probabilities are known, the value function can be obtained analytically by solving the next linear system \(V^{\pi } = (\varvec{I}- \gamma P^\pi )^{-1} \varvec{r}\).

In practice, however, the MDP is unknown, and we only have access to a set of n observations \({\mathcal {D}}= \{(s_i, r_i, s_{i}')\}_{i=1}^n\) generated by the policy we wish to evaluate, i.e., \(s_{i}' \sim P(s_i, \pi (s_i))\) Footnote 1. An additional difficulty is that when the state space is large (e.g., continuous) the value function cannot be represented exactly. It is then common to use some form of parametric value function approximation. In this paper we consider linear approximation architectures with parameters \(\varvec{\theta }\in {\mathbb {R}}^{k}\) over k features \(\varvec{\phi }: {\mathcal {S}}\rightarrow {\mathbb {R}}^{k}\), \(\varvec{\phi }(\cdot ) = \left( \phi _1(\cdot ),\ldots ,\phi _k(\cdot )\right) ^{\top }\):

$$\begin{aligned} V^{\pi }_{\varvec{\theta }}(s) = \varvec{\phi }(s)^{\top } \varvec{\theta } = \sum _{i=1}^k \phi _i(s) \theta _i. \end{aligned}$$

Throughout the paper we denote by \({\mathcal {F}}\) the linear function space spanned by the features \(\phi _i\), i.e., \( {\mathcal {F}}= \{f_{\varvec{\theta }} | f_{\varvec{\theta }}(\cdot ) =\varvec{\phi }(\cdot )^{\top } \varvec{\theta }\}\). Roughly speaking, \({\mathcal {F}}\) contains all the value functions that can be represented by the features. Let us also introduce the projection operator \(\varPi \) onto \({\mathcal {F}}\), which takes any value function \(\varvec{u}\) and projects it to the nearest value function, such that \(\varPi \varvec{u} = V^{\pi }_{\varvec{\theta }}\) where the corresponding parameters are the solution to the least-squares problem: \(\varvec{\theta } = \mathop {\mathrm {arg\,min}}\nolimits _{\varvec{\theta }} \Vert V^{\pi }_{\varvec{\theta }} - \varvec{u}\Vert ^2_D\) Footnote 2 [30]. As the parameterization is linear, it is straightforward to show that the projection operator is linear and independent of the parameters \(\varvec{\theta }\) and given by \(\varPi = \varPhi C^{-1}\varPhi ^{\top }D\), where \(\varPhi \in {\mathbb {R}}^{|{\mathcal {S}}|\times k}\) is a matrix whose rows contain the feature vector \(\varvec{\phi }(s)^{\top }, \forall s \in {\mathcal {S}}\) and \(C = \varPhi ^{\top }D\varPhi \) is the Gram matrix.

2.1 Least Squares Temporal Difference

The least-squares temporal difference (LSTD) algorithm was introduced by Bradtke and Barto [6] and computes the fixed-point of the composed projection and Bellman operators: \(V^{\pi }_{\varvec{\theta }} = \varPi T^{\pi } V^{\pi }_{\varvec{\theta }}\) (see Fig. 1). It can be seen as minimizing the mean-square projected Bellman error (MSPBE), i.e., the distance between \(V_{\varvec{\theta }}\) and its projected Bellman image onto \({\mathcal {F}}\):

$$\begin{aligned} \varvec{\theta } = \mathop {\mathrm {arg\,min}}_{\varvec{\theta } \in {\mathbb {R}}^k} \Vert V^{\pi }_{\varvec{\theta }} - \varPi T^{\pi } V^{\pi }_{\varvec{\theta }}\Vert _D^2. \end{aligned}$$
(2)

As is shown in prior work [1], LSTD seen as solving the following nested optimization problem:

$$\begin{aligned} \varvec{u}^{*}&= \mathop {\mathrm {arg\,min}}_{\varvec{u} \in {\mathbb {R}}^k} \Vert \varPhi \varvec{u} - T^{\pi } \varPhi \varvec{\theta } \Vert _D^2,&\varvec{\theta }&= \mathop {\mathrm {arg\,min}}_{\varvec{\theta } \in {\mathbb {R}}^k} \Vert \varPhi \varvec{\theta } - \varPhi \varvec{u}^{*}\Vert _D^2, \end{aligned}$$
(3)

where the first (projection) step finds the back-projection of \(T^{\pi }V^{\pi }_{\varvec{\theta }}\) onto \({\mathcal {F}}\), and the second (fixed-point) step solves the fixed-point problem which minimizes the distance between \(V^{\pi }_{\varvec{\theta }}\) and its projection.

Fig. 1.
figure 1

Figure adopted from [16].

A graphical representation of the LSTD problem. Here we can see the geometric relationship between the MSBE and the MSPBE.

As we discussed, usually the MDP model is unknown, or the full \(\varPhi \) matrices are too large to be formed, and so LSTD relies on sample-based estimates. Using a set \({\mathcal {D}}\) of samples from the MDP of interest, we can define \(\tilde{\varPhi } = [\phi (s_1)^{\top }; \ldots ; \phi (s_n)^{\top }]\) and \(\tilde{\varPhi '} = [\phi (s_1')^{\top }; \ldots ; \phi (s_n')^{\top }]\) to be the sampled feature matrices of successive transition states, and as \(\tilde{R} = [r_i, \ldots , r_n]^{\top }\) the sampled reward vector. Given these samples, the sample-based LSTD solution is given by the empirical version of Eq. (3):

$$\begin{aligned} \varvec{u}^{*}&= \tilde{C}^{-1}\tilde{\varPhi }(\tilde{R} + \gamma \tilde{\varPhi '} \varvec{\theta }), \\ \varvec{\theta }&= (\tilde{\varPhi }^{\top } (\tilde{\varPhi } - \gamma \tilde{\varPhi '}))^{-1}\tilde{\varPhi }^{\top } \tilde{R} = A^{-1} \varvec{b}, \end{aligned}$$

where we have defined

$$\begin{aligned} \tilde{C} \mathrel {\triangleq }\tilde{\varPhi }^{\top }\tilde{\varPhi }, \quad A \mathrel {\triangleq }\tilde{\varPhi }^{\top } (\tilde{\varPhi } - \gamma \tilde{\varPhi '}), \quad \text { and } \quad \varvec{b} \mathrel {\triangleq }\tilde{\varPhi }^{\top } \tilde{R}. \end{aligned}$$

As the number of samples n increases, the LSTD solution \( \tilde{\varPhi } \varvec{\theta }\) converges to the fixed-point of \(\hat{\varPi } T^{\pi }\) [6, 21, 24]. We denote as \(\hat{\varPi }\) the sample based feature space projector (empirical projection).

2.2 Review of Regularized LSTD Schemes

Despite the fact that LSTD offers an unbiased estimate of the value function [16], high-dimensional feature space create additional challenges. The larger the number of features is, the more samples required to estimate \(\varvec{\theta }\). In some cases, the number of features may even significantly outnumber the number of observed samples \(n \gg k\), leading to severe overfitting and poor prediction as the matrix A will be ill-conditioned.

For this reason, some form of regularization or model selection should be adopted, in order to prevent overfitting. Indeed, a plethora of methods have been proposed in value function approximation in RL, using different regularization and feature selection schemes (see [7] for an overview). A common form of regularization is based on ridge regression: this simply adds a term \(\lambda \varvec{I}\) to A, which is essentially \(\ell _2\)-regularization. This idea was introduced and analysed by Farahmand et al. [10] for the \(\ell _{2,2}\)-LSTD algorithm, which uses an \(\ell _2\)-penalty for both the projection and fixed-point steps. However, when the number of samples is much smaller than the number of features, ridge regression may fail, as it does not encourage sparsity.

On the other hand, as \(\ell _1\)-penalties enforce sparsity, it is natural to consider those instead. The LASSO-TD variantFootnote 3 incorporates an \(\ell _1\)-penalty in the projection step. LARS-TD [19] applies \(\ell _1\)-regularization to the projection operator in the feature space \({\mathcal {F}}\), using a variant of LARS [8]. Finally, LC-TD [17] reformulates Lasso-TD as a linear complementary (LC) problem, allowing the usage of any efficient off-shelf solver. It should be emphasised that some of the solvers allow warm-starts, offering a significant computational advantage in the policy iteration context. In order for both LARS-TD and LC-TD to find a solution, matrix A is required to be a P-matrix Footnote 4. The theoretical properties of the Lasso-TD problem were examined in [14], demonstrating that LARS-TD and LC-TD converge to the same solution. Particularly, it has been shown that Lasso-TD is guaranteed to have a unique fixed point. Additionally, Pires [27] suggests to solve the linear system of LSTD by including an \(\ell _1\)-regularization term directly to it. This is a typical convex optimization problem where any standard solver can be used, being also applicable to off-policy learning.

Two closely related algorithms have been proposed in order to alleviate some of the limitations of Lasso-TD (e.g., P-matrix constraint), the \(\ell _1\)-PBR (Projected Bellman residual) [11] and the \(\ell _{2,1}\)-LSTD [16]. Both of them place an \(\ell _1\)-penalty term in the fix-point step, which actually penalizes the projected Bellman residual and yields a convex optimization problem. In contrast with \(\ell _1\)-PBR, \(\ell _{2,1}\)-LSTD puts also an \(\ell _2\)-penalty term on the operator problem. The Dantzig-LSTD algorithm, proposed by Geist et al. [12], integrates LSTD with the Dantzig selector, converting it into a standard linear program, that can be solved efficiently. Actually, it minimizes the sum of all parameters under the constraint that the linear system of LSTD is smaller than a predefined parameter \(\lambda \) in each dimension. An alternative Dantzig Selector temporal difference learning algorithm has been introduced recently by Liu et al. [22], called ODDS-TD. It is a two-stage algorithm that is also able to compute the optimal denoising matrix.

3 Bayesian LSTD

In this section we present a Bayesian LSTD algorithm, called BLSTD. In our analysis, we model the fact that the transition distribution P is not known exactly by considering an empirical Bellman operator, given by the standard Bellman operator (1) plus additive white noise, . For simplicity, we can assume that the noise term is state independent. Thus, the empirical Bellman operator can be written concisely as

figure a

In other words, our model says that \(\varvec{r}+ \gamma \hat{P}^{\pi } V^{\pi }_{\varvec{\theta }}\) is normally distributed with mean \(\varvec{r}+ \gamma P^{\pi } V^{\pi }_{\varvec{\theta }}\). We shall formulate a Bayesian regression model, that is based on a sample from this empirical Bellman operator.

As aforementioned, given the set of observations \({\mathcal {D}}\), LSTD seeks the value function parameters \(\varvec{\theta }\) which are invariant with respect to the composed operator \(\hat{\varPi }\hat{T}^{\pi }\):

$$\begin{aligned} V^{\pi }_{\varvec{\theta }}&= \hat{\varPi }\hat{T}^{\pi }V^{\pi }_{\varvec{\theta }} \Leftrightarrow \\ \tilde{\varPhi }^{\top } \tilde{R}&= \tilde{\varPhi }^{\top }(\tilde{\varPhi } - \gamma \tilde{\varPhi '})\varvec{\theta } + \tilde{\varPhi }^{\top } N, \end{aligned}$$

where we have rewritten the projection operators and approximate value function in terms of the feature matrix and parameter vectors. We can now reformulate this as the following linear regression model:

$$\begin{aligned} \varvec{b} = A \varvec{\theta } + \tilde{\varPhi }^{\top } N. \end{aligned}$$

The likelihood function for this model is given by:

figure b

Taking the logarithm of the likelihood, we have

$$\begin{aligned} \ln p(\varvec{b}| \varvec{\theta }, \beta ) = \frac{k}{2}\ln (\beta ) - \frac{1}{2} \ln (|\tilde{C}|) - \frac{k}{2}\ln (2\pi ) - \frac{\beta }{2} E_{{\mathcal {D}}}(\varvec{\theta }), \end{aligned}$$
(4)

where \(E_{{\mathcal {D}}}\) corresponds to the MSPBE:

$$\begin{aligned} E_{{\mathcal {D}}}(\varvec{\theta }) = (\varvec{b}- A\varvec{\theta })^{\top } \tilde{C}^{-1} (\varvec{b}- A\varvec{\theta }). \end{aligned}$$

To complete our Bayesian model, we now introduce a prior distribution over the model parameters \(\varvec{\theta }\). Specifically, we consider a zero-mean isotropic Gaussian conjugate prior governed by a single precision parameter \(\alpha \),

figure c

Thus, we model the parametric uncertainty [23], which arises if the true transition probabilities and expected rewards are not known and must be estimated. Writing only the terms from the likelihood and prior depend on the model parameters, the log of the posterior distribution is given by

$$\begin{aligned} \ln p(\varvec{\theta } | {\mathcal {D}}) \propto - \frac{\beta }{2} E_{{\mathcal {D}}}(\varvec{\theta }) - \frac{\alpha }{2}\varvec{\theta }^{\top }\varvec{\theta }. \end{aligned}$$
(5)

Taking the maximization of the posterior distribution with respect to \(\varvec{\theta }\) is equivalent to the minimization of the MSPBE with the addition of an \(\ell _2\)-penalty (\(\lambda = \alpha / \beta \)). Thus, if hyperparameter \(\alpha \) takes a large value, the total squared length of the parameter vector \(\varvec{\theta }\) is encouraged to be small. Completing the squares of Eq. (5),

$$\begin{aligned} \ln p(\varvec{\theta } | {\mathcal {D}})&\propto - \frac{\beta }{2} (\varvec{b}- A\varvec{\theta })^{\top } \tilde{C}^{-1} (\varvec{b}- A\varvec{\theta }) - \frac{\alpha }{2}\varvec{\theta }^{\top }\varvec{\theta } \\&\propto -\frac{1}{2} \varvec{\theta }^{\top } (a\varvec{I}+ \beta A^{\top }\tilde{C}^{-1}A) \varvec{\theta } + \varvec{\theta }^{\top } \beta A^{\top } \tilde{C}^{-1}\varvec{b}+ const \end{aligned}$$

we get that the posterior distribution is also Gaussian,

figure d

with the covariance and mean to be given as

$$\begin{aligned} S = (\alpha \varvec{I}+ \beta \underbrace{A^{\top } \tilde{C}^{-1} A}_{\varSigma })^{-1} \text { and } \varvec{m}= \beta S A^{\top } \tilde{C}^{-1} \varvec{b}, \end{aligned}$$

respectively, where matrix \(\varSigma \mathrel {\triangleq }A^{\top } \tilde{C}^{-1} A\) is always positive definite. Hence, the predictive distribution of the value function over a new state \(s^*\) is estimated by averaging the output of all possible linear models w.r.t. the posterior distribution

figure e

An online version of our model can also be derived easily, with the posterior distribution at any phase acting as the prior distribution for the subsequent transition [2] and by using the matrix inversion lemma for the covariance matrix.

Maximum likelihood. For illustrative purposes, consider also a maximum likelihood approach. Restricting respect to \(\varvec{\theta }\), we getting the gradient of the log likelihood (4):

$$\begin{aligned} - \frac{1}{2} \nabla _{\varvec{\theta }} E_{{\mathcal {D}}}(\varvec{\theta }) = - \beta A^{\top } \tilde{C}^{-1} (A \varvec{\theta } - \varvec{b}). \end{aligned}$$

By setting the gradient equal to zero, we get the batch LSTD solution.

In conclusion, under our model, maximum a posteriori inference corresponds to \(\ell _2\)-regularization, while maximum likelihood inference to standard LSTD. In the next section, we propose an extension of our model that also induces sparsity.

4 Variational Bayesian LSTD (VBLSTD)

We now extend our model through a hierarchical sparse Bayesian prior, and introduce a variational approach for inference. The hope is the resulting VBLSTD algorithm will be able to avoid the over-fitting problem through regularization. For the prior distribution over parameter vector \(\varvec{\theta }\), we use an approach similar to [31] where a sparse zero-mean Gaussian prior was considered. Specifically, our prior over the model’s parameters \(\varvec{\theta }\) is given by:

figure f

where \(\varvec{\alpha } = (\alpha _1, \dots , \alpha _k)^{\top }\) are the parameters specifying our prior. Instead of selecting an arbitrary value for \(\varvec{\alpha }\), we select a hyperprior over \(\varvec{\alpha }\) of the form:

figure g

where \(h_{a}, h_{b}\) are fixed parameters. The choice of the Gamma distribution for \(\varvec{\alpha }\) results in a marginal distribution \(p(\varvec{\theta })\) that is Student-t, which is known to enforce sparse representations. To complete the specification of our model, we define a Gamma hyperprior over the noise precision \(\beta \):

figure h

To get broad hyperpriors, we can set those parameters to some small value, e.g., \(h_{a} = h_{a} = h_{c} = h_{d} = 10^{-6}\).

Bayesian inference requires the computation of the posterior distribution over all latent parameters \({\mathcal {Z}}= \{\varvec{\theta }, \varvec{\alpha }, \beta \}\) given the observations:

$$\begin{aligned} p(\varvec{\theta }, \varvec{\alpha }, \beta | \varvec{b}) = \frac{p(\varvec{b}| \varvec{\theta }, \beta ) p(\beta ) p(\varvec{\theta }| \varvec{\alpha }) p(\varvec{\alpha })}{p(\varvec{b})}. \end{aligned}$$

As the direct computation of the marginal likelihood is analytically intractable, we resort to variational inference [3, 18]. This introduces a variational approximation \({\mathcal {Q}}({\mathcal {Z}})\) to the true distribution \(p({\mathcal {Z}}| \varvec{b})\) over the latent variables, and the problem is defined as finding the approximation closest to the true posterior distribution in terms of KL divergence. The main insight in variational methods is the following identity,

$$\begin{aligned} \ln p(\varvec{b}) = {\mathcal {L}}({\mathcal {Q}}) + \text {KL}({\mathcal {Q}}\Vert p) \end{aligned}$$

where we have defined,

$$\begin{aligned} {\mathcal {L}}({\mathcal {Q}})&= \int {\mathcal {Q}}({\mathcal {Z}}) \ln \left\{ \frac{p({\mathcal {Z}}, \varvec{b})}{{\mathcal {Q}}({\mathcal {Z}})} \right\} , \end{aligned}$$
(6)
$$\begin{aligned} \text {KL}({\mathcal {Q}}\Vert p)&= - \int {\mathcal {Q}}({\mathcal {Z}}) \ln \left\{ \frac{p({\mathcal {Z}}| \varvec{b})}{{\mathcal {Q}}({\mathcal {Z}})} \right\} . \end{aligned}$$
(7)

The \(\text {KL}({\mathcal {Q}}\Vert p)\) (7) represents the Kullback-Leibler divergence between the variational posterior distribution \({\mathcal {Q}}({\mathcal {Z}})\) and the true posterior distribution \(p({\mathcal {Z}}| \varvec{b})\) over the latent variables. As \(\text {KL}({\mathcal {Q}}\Vert p) \ge 0\), it follows that \({\mathcal {L}}({\mathcal {Q}}) \le \ln p(\varvec{b})\), which means that \({\mathcal {L}}({\mathcal {Q}})\) is a lower bound on \(\ln p(\varvec{b})\). Therefore, maximizing the evidence lower bound (ELBO, see [4] for an overview) \({\mathcal {L}}({\mathcal {Q}})\) with respect to \({\mathcal {Q}}\) is equivalent to minimizing the \(\text {KL}({\mathcal {Q}}\Vert p)\), as the largest value of \({\mathcal {L}}({\mathcal {Q}})\) will be achieved when the \(\text {KL}({\mathcal {Q}}\Vert p)\) becomes zero.

In our problem, we consider a variational distribution with a factorized Gaussian form over the latent variables (c.f. mean field theory [26]), such that \({\mathcal {Q}}({\mathcal {Z}}) = {\mathcal {Q}}_{\varvec{\theta }}(\varvec{\theta }) {\mathcal {Q}}_{\varvec{\alpha }}(\varvec{\alpha }) {\mathcal {Q}}_{\beta }(\beta )\). Then the optimal distribution for each one of the factors can be written as:

(8)
(9)
(10)

where,

$$\begin{aligned}&S = \left( diag\mathop {{\mathbb {E}}}\nolimits [\varvec{\alpha }] + \mathop {{\mathbb {E}}}\nolimits [\beta ] \varSigma \right) ^{-1}, \quad \varvec{m}= \mathop {{\mathbb {E}}}\nolimits [\beta ] S A^{\top } \tilde{C}^{-1} \varvec{b},\\&\tilde{a}_{i} = h_{a} + \frac{1}{2}, \quad \tilde{b}_{i} = h_b + \frac{1}{2} \mathop {{\mathbb {E}}}\nolimits [\theta _{i}^{2}], \\&\tilde{c} = h_{c} + \frac{k}{2}, \quad \tilde{d} = h_{d} + \frac{1}{2} \Vert \varvec{b}- A \varvec{m}\Vert _{\tilde{C}}^2 + \frac{1}{2} \mathop {\mathrm {tr}}(\varSigma S). \end{aligned}$$

The required moments can be expressed as follows:

$$\mathop {{\mathbb {E}}}\nolimits [\alpha _{i}] = \tilde{a}_{i} / \tilde{b}_{i}, \quad \mathop {{\mathbb {E}}}\nolimits [\beta ] = \tilde{c} / \tilde{d}, \text { and } \mathop {{\mathbb {E}}}\nolimits [\theta _{i}^{2}] = \varvec{m}_i^{2} + S_{ii}.$$

The variational posterior distributions given in Eqs. (8), (9) and (10) are then iteratively updated until convergence. As the evidence lower bound is convex with respect to each one of the factors, convergence is guaranteed.

Similarly to BLSTD, the value function distribution over a new state \(s^{*}\) can be approximated by averaging the output of all possible linear models w.r.t the variational posterior distribution \({\mathcal {Q}}_{\varvec{\theta }}(\varvec{\theta })\)

This gives us not only a specific mean value function, but also effectively expresses our uncertainty about what the value function is through the covariance terms.

The lower bound is interesting to look at more closely, as it is the quantity that we maximizing. Furthermore, it can be used as a convergence criterion for the variational inference. If the difference between the lower bound on two successive iterations is lower than a threshold, we assume that our model converges. Algorithm 1 provides the pseudocode of the sparse Bayesian LSTD algorithm.

Remark 1

The lower bound can be written as

$$\begin{aligned} {\mathcal {L}}({\mathcal {Q}}) =&\frac{1}{2}\ln |S| - \frac{1}{2}|\tilde{C}| + \sum _{i=1}^k \{ \ln \varGamma (\tilde{a}_i) - \tilde{a}_i\ln \tilde{b}_i \} + \ln \varGamma (\tilde{c}) - \tilde{c}\ln \tilde{d} \nonumber \\&+ \frac{k}{2}(1 - \ln 2\pi ) - k\ln \varGamma (h_a) + k h_a\ln h_b - \ln \varGamma (h_c) + h_c\ln h_d. \end{aligned}$$
(11)

Proof

Decomposing Eq. (6) we obtain:

$$\begin{aligned} {\mathcal {L}}({\mathcal {Q}})&= \mathop {{\mathbb {E}}}\nolimits _{\varvec{\theta }, \beta }[\ln p(\varvec{b}| \varvec{\theta }, \beta )] + \mathop {{\mathbb {E}}}\nolimits _{\beta }[\ln p(\beta )] +\mathop {{\mathbb {E}}}\nolimits _{\varvec{\theta },\varvec{\alpha }}[\ln p(\varvec{\theta }| \varvec{\alpha })] + \mathop {{\mathbb {E}}}\nolimits _{\varvec{\alpha }}[\ln p(\varvec{\alpha })] \\&- \mathop {{\mathbb {E}}}\nolimits _{\varvec{\theta }}[\ln {\mathcal {Q}}_{\varvec{\theta }}(\varvec{\theta }) ] - \mathop {{\mathbb {E}}}\nolimits _{\varvec{\alpha }}[\ln {\mathcal {Q}}_{\varvec{\alpha }}(\varvec{\alpha }) ] -\mathop {{\mathbb {E}}}\nolimits _{\beta } [\ln {\mathcal {Q}}_{\beta }(\beta ) ]. \end{aligned}$$

We now evaluate each term in turn.

$$\begin{aligned}&\mathop {{\mathbb {E}}}\nolimits _{\varvec{\theta }, \beta }[\ln p(\varvec{b}| \varvec{\theta }, \beta )] = \frac{k}{2}(\psi (\tilde{c}) - \ln \tilde{d})-\frac{k}{2}\ln 2\pi - \frac{1}{2}|\tilde{C}| - \frac{1}{2}\mathop {{\mathbb {E}}}\nolimits [\beta ]\{ \Vert \varvec{b}- A \varvec{m}\Vert _{\tilde{C}}^2 +\mathop {\mathrm {tr}}(\varSigma S) \} \\&\mathop {{\mathbb {E}}}\nolimits _{\varvec{\theta },\varvec{\alpha }}[\ln p(\varvec{\theta }|\varvec{\alpha })] = -\frac{k}{2} \ln 2\pi - \frac{1}{2} \sum _{i=1}^k (\psi (\tilde{a}_i) - \ln \tilde{b}_i) - \frac{1}{2} \sum _{i=1}^k \mathop {{\mathbb {E}}}\nolimits [\alpha _i] (\varvec{m}_i^2 + S_{ii}) \\&\mathop {{\mathbb {E}}}\nolimits _{\varvec{\alpha }} [\ln p(\varvec{\alpha }) ] = -k\ln \varGamma (h_a) + k h_a \ln h_b + (h_a -1)\sum _{i=1}^k(\psi (\tilde{a}_i) - \ln \tilde{b}_i) - h_b \sum _{i=1}^k\mathop {{\mathbb {E}}}\nolimits [\alpha _i] \\&\mathop {{\mathbb {E}}}\nolimits _{\beta } [\ln p(\beta ) ] = -\ln \varGamma (h_c) + h_c\ln h_d +(h_{c}-1)(\psi (\tilde{c}) - \ln \tilde{d}) - h_d\mathop {{\mathbb {E}}}\nolimits [\beta ] \\&\mathop {{\mathbb {E}}}\nolimits _{{\varvec{\theta }}}[\ln {\mathcal {Q}}_{\varvec{\theta }}(\varvec{\theta })] = -\frac{1}{2} \ln |S| - \frac{k}{2}(1 + \ln 2\pi ) \\&\mathop {{\mathbb {E}}}\nolimits _{\varvec{\alpha }} [\ln {\mathcal {Q}}_{\varvec{\alpha }}(\varvec{\alpha }) ] = \sum _{i=1}^{k} \{ -\ln \varGamma (\tilde{a}_i) + \tilde{a}_i\ln \tilde{b}_i+(\tilde{a}_i-1)(\psi (\tilde{a}_i) - \ln \tilde{b}_i) - \tilde{b}_i\mathop {{\mathbb {E}}}\nolimits [\alpha _i] \}\\&\mathop {{\mathbb {E}}}\nolimits _{\beta } [\ln {\mathcal {Q}}_{\beta }(\beta ) ] = -\ln \varGamma (\tilde{c}) + \tilde{c}\ln \tilde{d}+(\tilde{c}-1)(\psi (\tilde{c}) - \ln \tilde{d}) - \tilde{d}\mathop {{\mathbb {E}}}\nolimits [\beta ]. \end{aligned}$$

Substituting back, we obtain the required result.    \(\square \)

figure i

In the next section, we compare the Bayesian LSTD methods we derived with other state-of-the-art LSTD approaches for value function estimation.

5 Experiments

To analyze the performance of the proposed VBLSTD algorithm, we considered two discrete chain problems. Through our empirical analysis we examine both the convergence capabilities of VBLSTD on the true solution, as well as the ability of the VBLSTD algorithm in avoiding overfitting. In the first case, comparisons have been made with the vanilla LSTD algorithm, considering three different sizes of the Boyan’s chain [5]. In the second case, comparisons have been conducted with the \(\ell _2\)-LSTD (adding an \(\ell _2\)-regularization factor to the projector operator), LarsTD [19] and OMPTD [25] algorithms. For that purpose, we considered the corrupted chain problem similar to [12, 16, 19].

In contrast to the VBLSTD algorithm, the performance of the three aforementioned algorithms are totally depended on the penalty parameter that should be defined explicitly in advance. Therefore, we have to answer the next question: which is the best value to set the regalurization factor? In the case of the \(\ell _2\)-LSTD algorithm we adopted the same strategy with the one followed by Hoffman et al. [16]. Actually, we used a grid of 10 parameters logarithmically spaced between \(10^{-6}\) and 10. In the case of the LarsTD and OMPTD algorithms, we computed the whole regularization path similar to [12] by setting the regularization factor equal to \(10^{-7}\). In all cases, the best prediction error has been reported.

In our experimental results, we illustrate the average root mean squared error with respect to the true value function, \(V^*\). The optimal value function was computed explicitly since we examine discrete environments. It should be also noticed that for each run the algorithms were provided with the same rollouts of data. For each average, we also plot the \(95\%\) confidence interval for the accuracy of the mean estimate with error bars. Additionally, we show the \(90\%\) percentile region of the runs, in order to indicate inter-run variability in performance.

5.1 Boyan’s Chain

To demonstrate the ability of the VBLSTD algorithm to converge to the same solution with that of standard LSTD, we examine the Boyan’s chain problem [5]. Actually, it is an N-state Markov chain with a single action. Each episode starts in state \(N-1\) and terminates when the (absorbing) state zero is reached. For each state, \(s > 2\), we transit with equal probability in states \(s-1\) or \(s-2\), with reward \(-3\). On the other hand, we deterministically transit from states 2 to 1 and state 1 to 0, where the received rewards are equal to \(-2\) and 0, respectively. Similar to [13], three different problems sizes have been considered: \(N = \{14, 102, 402\}\). The feature vectors that considered for the states’ representation are exactly the same with those used by Geramifard et al. [13]. Figure 2 illustrates the performance of the VBLSTD and LSTD algorithms on the three different Boyan’s chain problems, averaged over 1000 runs. In all these three problems, it is clear that the proposed VBLSTD algorithm converges to same solution with the one returned by the LSTD algorithm. It means that the VBLSTD algorithm discovers the global optimum solution (i.e., the solution that corresponds to the minimum MSPBE).

Fig. 2.
figure 2

Performance of policy evaluation on the Boyan’s Chain for a fixed policy.

5.2 Corrupted Chain

In order to examine the sparsification properties of the VBLSTD algorithm, we consider the corrupted chain problem as in [12, 16, 19]. This is a 20-state, 2-actions MDP proposed in [20]. In this problem, the states are connected in a chain with the actions to indicate the direction (left or right), with the probability of success to be equal to 0.9. For instance, executing left action at state s, the system transitions to state \(s-1\) with probability 0.9 and to state \(s+1\) with probability 0.1. A reward of one is given only at the ends of the chain. Similar to [12, 16, 19], to represent the value function we will consider \(k = 6 + \overline{s}\), 6 ‘relevant’ features (i.e., including a bias term and five RBF basis functions spaced evenly over the state space) and \(\overline{s}\) additional ‘irrelevant’ (noise) features (containing random Gaussian noise, ). It should also be stressed that through our analysis we didn’t perform any standarization over the feature matrices. Also, in the case of the VBLSTD, we keep the noise precision unchanged.

Fig. 3.
figure 3

Performance of policy evaluation on the Corrupted chain for a fixed policy. (Left) we consider \(\overline{s} = 600\) ‘irrelevant’ features while varying the number of samples. The horizon of each episode is set equal to 20 steps. (Right) we use 400 transitions (20 rollouts of horizon 20) varying the number of ‘irrelevant’ features.

The results of our experiments are presented in Fig. 3, averaged over 30 runs. We report the prediction error between the estimated and the true value function on 1000 test points. The evaluated policy is the optimal one, which selects left action on the first 10 states and right action on the rest 10. The first (left) plot shows the results in the case where we have \(\overline{s} = 600\) ‘irrelevant’ features while varying the number of samples (the horizon of each episode is equal to 20, started randomly on \(\{1,\ldots ,20\}\)). On the other hand, the second (right) plot depicts the results in the case where we sample 400 transitions (20 rollouts of horizon 20), varying the number of ‘irrelevant’ features. In both cases, it seems that the VBLSTD algorithm performs much better compared with the others three reguralization schemes. The difference between them becomes more apparent as the number of irrelevant features increased. Additionally, it stems that the performance of VBLSTD is quite stable even when we select a large number of noise features. On the contrary, the OMPTD algorithm seems to become unstable when the number of noise features becomes large. Furthermore, we also note that the VBLSTD is not affected by overfitting when the number of features becomes greater than the number of samples. As it is expected, the performance of all algorithms is quite close even when large number of transitions are used for training or the number of noise features is quite small.

Finally, Fig. 4 illustrates the mean weights (solution), \(\varvec{\theta }\), for each one of the examined algorithms considering 600 irrelevant features. The number of training episodes that used for training on these two plots (Fig. 4) are 10 and 100, respectively. As we can easily verify, when the number of features exceeds the number of the training samples, we encounter the overfitting problem that leads to poor predictions. It is more apparent in the case of the LSTD algorithm. However, the VBLSTD algorithm achieves to avoid the overfitting problem, succeeding to identify the relevant features even in the case where the number of samples is much lower than that of the features. Last but not least, it should be highlighted that when the number of training samples is much higher than the number of features, the solutions of the LSTD and VBLSTD algorithms are quite similar.

Fig. 4.
figure 4

The 606 mean weight values. The first weight is the bias term, the next 5 correspond to the relevant features (RBFs), the rest 600 correspond to the noise (irrelevant) features. The dichotomy between the ‘relevant’ and ‘irrelevant’ weights is apparent.

6 Conclusion

In this paper we introduced a fully Bayesian framework for least-squares temporal difference learning algorithm, called BLSTD. This is achieved by adopting an explicit probabilistic model for the empirical Bellman operator and introducing a prior distribution over the unknown model’s parameters. This gives us the advantage of not only having a point estimate over the unknown value function parameters, but also quantifying our uncertainty about the value function. We further extended this method to a sparse variational Bayes model, called VBLSTD. The main advantage of VBLSTD compared to other regularization schemes, is its ability to avoid over-fitting by determining the model’s complexity in an automatic way. In practice, we verified that the VBLSTD algorithm solutions are at least as good as any other state-of-the-art algorithm, while being able to automatically ignore noisy features. We believe that this principled approach to policy evaluation can also lead to reinforcement learning algorithms with good exploration performance, something that we leave for future work.