1 Summary of notation

We use \({\mathbf {x}}\) for random variable and x for deterministic variable. For set A, \(I_{A}\) represents the indicator function of A, i.e., \(I_{A}(x) = 1\) if \(x \in A\) and 0 otherwise. Let \(f_{\theta }(\cdot )\) denote the probability density function parametrized by \(\theta \). Let \({\mathbb {E}}_{\theta }[\cdot ]\) and \(P_{\theta }\) denote the expectation and the induced probability measure w.r.t. \(f_{\theta }\). For \(\rho \in (0,1)\) and a scalar-valued function J, let \(\gamma _{\rho }(J, \theta )\) denote the \((1-\rho )\)-quantile of \(J({\mathbf {x}})\) w.r.t. \(f_{\theta }\), i.e.,

$$\begin{aligned} \gamma _{\rho }(J, \theta ) \triangleq \sup \{l: P_{\theta }(J({\mathbf {x}}) \ge l) \ge \rho \}. \end{aligned}$$
(1)

Let \(supp(f) \triangleq \overline{\{x | f(x) \ne 0\}}\) denote the support of f and interior(A) be the interior of set A. Let \(\mathcal {N}_{d}(a, B)\) represent the multivariate Gaussian distribution with mean vector a and covariance matrix B. A function \(L:\mathrm{I\!R}^{m} \rightarrow \mathrm{I\!R}\) is Lipschitz continuous, if \(\exists K \ge 0\) s.t. \(\vert L(x) - L(y) \vert \le K\Vert x - y \Vert \), \(\forall x, y \in \mathrm{I\!R}^{m}\), where \(\Vert \cdot \Vert \) is a norm defined on \(\mathrm{I\!R}^{m}\). Also, for a matrix \(A = [a_{ij}]_{1 \le i \le m, 1 \le j \le n} \in \mathrm{I\!R}^{m \times n}\), we define the norm \(\Vert A \Vert _{\infty } \triangleq \max _{1 \le i \le m} \sum _{1 \le j \le n}\vert a_{ij} \vert \) and for invertible matrices, we define the condition number \(\kappa (A) \triangleq \Vert A \Vert _{\infty } \Vert A^{-1} \Vert _{\infty }\). Also, \(\vert A \vert \triangleq [\vert a_{ij} \vert ]_{1 \le i \le m, 1 \le j \le n}\). Similarly, for \(x \in \mathrm{I\!R}^{m}\), the sup norm \(\Vert x \Vert _{\infty }\) is defined as \(\Vert x \Vert _{\infty } \triangleq \sup _{i}{\vert x_i \vert }\) and \(\vert x \vert \triangleq (\vert x_i \vert )_{1 \le i \le m}\).

2 Introduction and preliminaries

A discrete time Markov decision process (MDP) (Sutton and Barto 1998; Bertsekas 1995) is a 4-tuple (\({\mathbb {S}}\), \({\mathbb {A}}\), R, P), where \({\mathbb {S}}\) denotes the set of states and \({\mathbb {A}}\) is the set of actions. Also, \(R: {\mathbb {S}} \times {\mathbb {A}} \times {\mathbb {S}} \rightarrow \mathrm{I\!R}\) is the reward function where \(R(s, a, s^{\prime })\) represents the reward obtained in state s after taking action a and transitioning to state \(s^{\prime }\). Without loss of generality, we assume that the same choice of actions is available for all the states. We also assume that the reward function is bounded, i.e., \(\Vert R \Vert _{\infty } < \infty \). We let \(P:{\mathbb {S}} \times {\mathbb {A}} \times {\mathbb {S}} \rightarrow [0,1]\) denote the transition probability kernel, where \(P(s, a, s^{\prime })\) is the probability of next state being \(s^{'}\) conditioned on the fact that the current state is s and action taken is a. We assume that the state and action spaces are finite. A stationary random policy (SRP) \(\pi (\cdot \vert s)\) is a probability distribution over the action space \({\mathbb {A}}\) conditioned on state \(s \in {\mathbb {S}}\). A given policy \(\pi \) along with the transition kernel P determines the state dynamics of the system. For a given policy \(\pi \), the system behaves as a homogeneous Markov chain with transition probabilities

$$\begin{aligned} P_{\pi }(s,s^{\prime }) = \sum _{a \in {\mathbb {A}}}\pi (a \vert s)P(s, a, s^{\prime }), s, s^{\prime } \in {\mathbb {S}}. \end{aligned}$$
(2)

In this paper, we consider only stationary randomized policies. We also assume that given an SRP \(\pi \), the Markov chain induced by \(P_{\pi }\) is ergodic, i.e., the Markov chain is irreducible and aperiodic.

The two fundamental questions most commonly addressed in the MDP literature are: 1. Prediction problem and 2. Control problem.

Prediction problem For a given SRP \(\pi \) and discount factor \(\gamma \in (0,1)\), the objective is to evaluate the long-run \(\gamma \)-discounted cost \(V^{\pi } \in \mathrm{I\!R}^{\vert {\mathbb {S}} \vert }\) which is defined as

$$\begin{aligned} V^{\pi }(s) \triangleq {\mathbb {E}}_{\pi }\left[ \sum _{k=0}^{\infty }\gamma ^{k}R({\mathbf {s}}_{k}, {\mathbf {a}}_{k}, {\mathbf {s}}_{k+1}) \Big \vert \mathbf {s}_{0} = s\right] , s \in {\mathbb {S}}, \end{aligned}$$
(3)

where the random variable \({\mathbf {s}}_{k}\) represents the state at instant k, the random variable \({\mathbf {a}}_{k}\) represents the action chosen at instant k and the random variable \({\mathbf {s}}_{k+1}\) represents the transitioned state after instant k, i.e., the state at instant \(k+1\). Further, \({\mathbb {E}}_{\pi }[\cdot ]\) is the expectation w.r.t. the probability distribution induced by \(P_{\pi }\) with initial state \({\mathbf {s}}_{0} = s\). Note that the cost evaluation in (3) is realistic and prudent. Since MDP is a sequential decision making paradigm, the discount factor \(\gamma \) controls the width of the window of future events to be considered to guide the decision process. For \(\gamma \) close to 0, only the rewards pertaining to the first few transitions count as the effect of the future rewards whose weights are geometric in \(\gamma \) is minimal. However, the case of \(\gamma \) very close to 1 requires a very long window to be considered.

For a given policy \(\pi \), the value function \(V^{\pi }\) satisfies the following Bellman equation (written in vector-matrix notation):

$$\begin{aligned} V^{\pi } = T^{\pi }V^{\pi }, \end{aligned}$$
(4)

where \(T^{\pi }\) called the Bellman operator is defined as \(T^{\pi }V \triangleq R^{\pi } + \gamma P_{\pi } V\) and \(R^{\pi }(s) \triangleq \sum _{a \in {\mathbb {A}}}\pi (a \vert s)\sum _{s^{\prime } \in {\mathbb {S}}}P(s, a, s^{\prime })R(s, a, s^{\prime })\). Hence \(V^{\pi }\) can be directly computed as \(V^{\pi } = (I-\gamma P_{\pi })^{-1}R^{\pi }\). The computational complexity of the above direct computation is \(O(\vert {\mathbb {S}} \vert ^{3})\) and the space complexity is \(O(\vert {\mathbb {S}} \vert ^{2})\). An alternate procedure to solve the prediction problem is value iteration that is based on the contraction mapping theorem. It is easy to see that the Bellman operator \(T^{\pi }\) is a contraction mapping with the contraction constant \(\gamma \). Hence by the contraction mapping theorem, \((T^{\pi })^{k}V \rightarrow V^{\pi }\) as \(k \rightarrow \infty \), \(\forall V \in \mathrm{I\!R}^{\vert {\mathbb {S}} \vert }\). The computational complexity of this successive approximation procedure is \(O(\vert {\mathbb {S}} \vert ^{2})\) per iteration and the space complexity is \(O(\vert {\mathbb {S}} \vert ^{2})\) as well. The state space \({\mathbb {S}}\) can be huge, for example, in cases where the state is represented as a high-dimensional vector. The cardinality of the state space in such a case is exponential in the dimension resulting in a corresponding exponential upsurge in computational effort and storage requirement. In such cases, the above method can become well-nigh intractable. This predicament is referred to in the literature as the curse of dimensionality. One commonly employed heuristic to circumvent the curse is the state aggregation (Bertsekas and Castanon 1989) technique. However, it also suffers dearly when the state space is huge.

Control problem The objective for this problem is to find the optimal stationary policy \(\pi ^{*}\) of the MDP, where

$$\begin{aligned} \pi ^{*}(s) \in \mathop {\mathrm{arg\,max}}\limits _{\pi }V^{\pi }(s), s \in {\mathbb {S}}. \end{aligned}$$
(5)

The existence of an optimal stationary policy is proven in Puterman (2014). The optimal value function \(V^{*} (= V^{\pi ^{*}})\) satisfies the Bellman optimality equation given by: \(T V^{*} = V^{*}\), where the Bellman optimality operator T is defined as \(T V(s) \triangleq max_{a \in {\mathbb {A}}}{\sum _{s^\prime \in {\mathbb {S}}}P(s, a, s^\prime )(R(s, a, s^\prime ) + \gamma V(s^\prime ))}\). The primary numerical methods which solve the control problem are the value iteration and policy iteration. A detailed description of these methods is available in Puterman (2014). In a nutshell, policy iteration can be characterized as generating a sequence of improving policies \(\{\pi _k\}_{k \in {\mathbb {N}}}\) with \(\pi _k\) converging to \(\pi ^{*}\) after a finite number of steps. Value iteration on the other hand involves repeated application of the Bellman optimality operator, which requires multiple extensive passes over the state space and the convergence is only guaranteed asymptotically. The computational complexities of policy iteration and value iteration are \(O(\vert {\mathbb {S}} \vert ^{2}\vert {\mathbb {A}} \vert + \vert {\mathbb {S}} \vert ^{3})\) and \(O(\vert {\mathbb {S}} \vert ^{2} \vert {\mathbb {A}} \vert )\) respectively. The space complexity of both the methods is the same and it is \(O(\vert {\mathbb {S}} \vert + \vert {\mathbb {A}} \vert )\). The super-linear dependency of the methods on the size of state space results in the curse of dimensionality. A recently proposed policy iteration method based on stochastic factorization (Barreto et al. 2014) has reduced the dependency to linear terms. However, when \({\mathbb {S}}\) is very large, stochastic factorization also becomes intractable.

2.1 Model free algorithms

In the above section, the prediction and control algorithms are numerical methods that assume that the probability transition function P and the reward function R are available. In most of the practical scenarios, it is unrealistic to assume that accurate knowledge of P and R is realizable. However, the behaviour of the system can be observed and one needs to either predict the value of a given policy or find the optimal control using the available observations. The observations are in the form of a sample trajectory \(\{s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, \dots \}\), where \(s_i \in {\mathbb {S}}\) is the state and \(r_i = R(s_i, a_i, s_{i+1})\) is the immediate reward at time instant i. Model free algorithms are basically of three types: (i) Indirect methods, (ii) Direct methods and (iii) Policy search methods. The last of these methods searches in the policy space to find the optimal policy where the performance measure used for comparison is the estimate of the value function induced from the observations. Prominent algorithms in this category are actor-critic (Konda and Tsitsiklis 2003), policy gradient (Baxter and Bartlett 2001), natural actor-critic (Bhatnagar et al. 2009) and fast policy search (Mannor et al. 2003). Indirect methods are based on the certainty equivalence of computing where initially the transition matrix and the expected reward vector are estimated using the observations and subsequently, model based approaches mentioned in the above section are applied on the estimates. A few indirect methods are control learning (Sato et al. 1982, 1988; Kumar and Lin 1982), priority sweeping (Moore and Atkeson 1993), adaptive real-time dynamic programming (ARTDP) (Barto et al. 1995) and PILCO (Deisenroth and Rasmussen 2011). For the case of direct methods which are more appealing, the model is not estimated, rather the control policy is adapted iteratively using a shadow utility function derived from the instantiation of the internal dynamics of the MDP. The algorithms in this class are generally referred to in the literature as the reinforcement learning algorithms. Prominent reinforcement learning algorithms include temporal difference (TD) learning (Sutton 1988) (prediction method), Q-learning (Watkins 1989) and SARSA (Singh and Sutton 1996) (control methods). There are two variants of the prediction algorithm depending on how the sample trajectory is generated. They are on-policy and off-policy algorithms. In the on-policy case, the sample trajectory is generated using the policy \(\pi \) which is being evaluated, i.e., \(s_{i+1} \sim P(s_i, a_i, \cdot )\), where \(a_i \sim \pi (\cdot \vert s_{i})\) and \(r_i = R(s_i, a_i, s_{i+1})\). In the off-policy case, the sample trajectory is generated using a policy \(\pi _b\) which is possibly different from the policy \(\pi \) that is being evaluated, i.e., \(s_{i+1} \sim P(s_i, a_i, \cdot )\), where \(a_i \sim \pi _b(\cdot \vert s_{i})\) and \(r_i = R(s_i, a_i, s_{i+1})\).

Model free algorithms are shown to be robust, stable and exhibit good convergence behaviour under realistic assumptions. However, they suffer from the curse of dimensionality which arises due to the space complexity. Note that the space complexity of the above mentioned learning algorithms is \(O(\vert {\mathbb {S}} \vert )\), which becomes unmanageably large with increasing state space.

2.2 Linear function approximation (LFA) methods for model free Markov decision process

To tackle the curse of dimensionality and to achieve tractability, it is imperative to eliminate the dependency both in terms of the computational and storage requirements of the learning methods on the cardinalities of state and action spaces. An efficient approach is to compactly yet effectively represent the system in a lower \(k_1\)-dimensional space, where \(k_1 \ll \vert {\mathbb {S}} \vert \). A well understood dimensionality reduction technique is the linear function approximation. Here, we choose a collection of prediction features \(\{\phi _i\}_{i=1}^{k_1}\), where \(\phi _i \in \mathrm{I\!R}^{\vert {\mathbb {S}} \vert }\). In this case, the prediction task becomes a projection where

$$\begin{aligned} \varPi V^{\pi } = \mathop {\mathrm{arg \,min}}\limits _{h \in \mathrm{I\!H}^{\varPhi }} \Vert V^{\pi } - h \Vert ^{2}, \end{aligned}$$
(6)

where \(\mathrm{I\!H}^{\varPhi } \triangleq \{\varPhi x \vert x \in \mathrm{I\!R}^{k_1}\} \subset \mathrm{I\!R}^{\vert {\mathbb {S}} \vert }\) is the space of representable functions with \(\varPhi \triangleq (\phi _1, \dots , \phi _{k_1}) \in \mathrm{I\!R}^{\vert {\mathbb {S}} \vert \times k_1}\) and the norm \(\Vert \cdot \Vert \) is chosen appropriately according to the domain. Note that \(\mathrm{I\!H}^{\varPhi }\) is a linear function space. Further, we define \(\phi (s) \triangleq (\phi _1(s), \dots , \phi _{k_{1}}(s))^{\top }\), \(s \in {\mathbb {S}}\). Note that \(\phi _i\) can be viewed as a function from \({\mathbb {S}}\) to \(\mathrm{I\!R}\). Similarly, the control problem becomes \(\pi ^{*}(s) \in \mathop {\mathrm{arg\,max}}\limits _{\pi }\varPi V^{\pi }(s), \forall s \in {\mathbb {S}}\). Note that in the case of large and complex MDPs, the features are not hard-coded, instead one employs compact representations in the form of basis functions. Examples of basis functions include radial basis functions and Fourier basis.

To address the computational and storage concerns arising due to large action space, a sagacious approach is to employ a parametrized class of SRPs \(\{\pi _{w} \vert w \in {\mathbb {W}} \subset \mathrm{I\!R}^{k_{2}}\}\), where \(k_{2} \in {\mathbb {N}}\), instead of an exact representation. The most commonly used is the Gibbs (or Boltzmann) “soft-max” class of policies. In this case, for a given \(w \in {\mathbb {W}} \subset \mathrm{I\!R}^{k_{2}}\), the SRP \(\pi _{w}\) is defined as

$$\begin{aligned} \pi _{w}(a | s) = \frac{\exp {(w^{\top }\psi (s, a)/\tau )}}{\sum _{b \in {\mathbb {A}}}\exp {(w^{\top }\psi (s,b)/\tau )}}, \end{aligned}$$
(7)

where \(\{\psi (s, a) \in \mathrm{I\!R}^{k_{2}} \vert s \in {\mathbb {S}}, a \in {\mathbb {A}}\}\) is a given policy feature set and \(\tau \in \mathrm{I\!R}_{+}\) is fixed a priori.

The accuracy of the function approximation method depends on the representational/expressive ability of \(\mathrm{I\!H}^{\varPhi }\). For example, when \(k_{1} = \vert {\mathbb {S}} \vert \), the representational ability is utmost, since \(\mathrm{I\!H}^{\varPhi } = \mathrm{I\!R}^{\vert \mathbb {S} \vert }\). In general, \(k_1 \ll \vert \mathbb {S} \vert \) and hence \(\mathrm{I\!H}^{\varPhi } \subset \mathrm{I\!R}^{\vert \mathbb {S} \vert }\). So for an arbitrary policy \(\pi \), where \(V^{\pi } \notin \mathrm{I\!H}^{\varPhi }\), the prediction of the value function \(V^{\pi }\) shall always incur an unavoidable approximation error (\(e_{appr}\)) given by \(\inf _{h \in \mathrm{I\!H}^{\varPhi }}\Vert V^{\pi } - h \Vert \). Given \(\mathrm{I\!H}^{\varPhi }\), one cannot perform no better than \(e_{appr}\). The prediction features \(\{\phi _i\}\) are hand-crafted using prior domain knowledge and their choice is critical in approximating the value function. There is an abundance of literature available on the topic. In this paper, we assume that an appropriately chosen feature set is available a priori. Also note that the convergence of the prediction methods is in asymptotic sense. But in most practical scenarios, the algorithm has to be terminated after a finite number of steps. This incurs an estimation error (\(e_{est}\)) which however decays to zero, asymptotically.

Even though LFA produces sub-optimal solutions, since the search is conducted on a restricted subspace of \(\mathrm{I\!R}^{\vert {\mathbb {S}} \vert }\), it yields large computational and storage benefits. So some degree of trade-off between accuracy and tractability is indeed unavoidable.

2.3 Off-policy prediction using LFA

Setup Given \(w, w_b \in {\mathbb {W}}\) and an observation of the system dynamics in the form of a sample trajectory \(\{s_{0}, a_0, r_{0}, s_{1}, a_1, r_{1}, s_{2}, \dots \}\), where at each instant k, \(a_{k} \sim \pi _{w_b}(\cdot \vert s_{k})\), \(s_{k+1} \sim \) \(P(s_k, a_k, \cdot )\) and \(r_{k}\) = \(R(s_{k}, a_{k}, s_{k+1})\), the goal is to estimate the value function \(V^{\pi _{w}}\) of the target policy \(\pi _{w}\) (that is possibly different from \(\pi _{w_b}\)). We assume that the Markov chains defined by \(P_{w}\) and \(P_{w_b}\) are ergodic. Further, let \(\nu _{w}\) and \(\nu _{w_b}\) be the stationary distributions of the Markov chains with transition probability matrices \(P_{w}\) and \(P_{w_b}\) respectively, i.e., \(\lim _{k \rightarrow \infty }P_{w}({\mathbf {s}}_k = s) = \nu _{w}(s)\) and \(\nu _{w}^{\top }P_{w} = \nu _{w}^{\top }\) and likewise for \(\nu _{w_b}\). Note that for brevity the notations have been simplified here, i.e., \(P_{w} \triangleq P_{\pi _{w}}\) and \(P_{w_b} \triangleq P_{\pi _{w_b}}\). We follow the new notation for the rest of the paper. Similarly, \(V^{w} \triangleq V^{\pi _w}\).

In the off-policy learning case, the projection is w.r.t. the norm \(\Vert \cdot \Vert _{\nu _{w_b}}\), where \(\Vert V \Vert ^{2}_{\nu _{w_b}} = <V, V>_{\nu _{w_b}}\). The inner product is defined as \(<V_1, V_2>_{\nu } = V_{1}^{\top }D^{\nu }V_2\), where \(V_1, V_2 \in \mathrm{I\!R}^{\vert {\mathbb {S}} \vert }, \nu \in [0,1]^{\vert {\mathbb {S}} \vert }\) is a probability mass function over \({\mathbb {S}}\) and \(D^{\nu }\) is a \(\vert {\mathbb {S}} \vert \times \vert {\mathbb {S}} \vert \) diagonal matrix with \(D^{\nu }_{ii} = \nu (i)\), \(1 \le i \le \vert {\mathbb {S}} \vert \). Thus the norm \(\Vert \cdot \Vert _{\nu _{w_b}}\) is in fact the Euclidean norm weighted with the stationary distribution \(\nu _{w_b}\) of the behaviour policy \(\pi _{w_b}\), i.e., \(\Vert V \Vert _{\nu _{w_b}} \triangleq \sqrt{\sum _{s \in {\mathbb {S}}}\nu _{w_b}(s)V^{2}(s)}\). So

$$\begin{aligned} h_{w | w_b} \triangleq \varPi ^{w_b}V^{w} = \mathop {\mathrm{arg\,min}}\limits _{h \in \mathrm{I\!H}^{\varPhi }} \Vert V^{w} - h \Vert _{\nu _{w_b}}^{2}, \end{aligned}$$
(8)

where \(\varPi ^{w_b}\) denotes the projection operator w.r.t. \(\Vert \cdot \Vert _{\nu _{w_b}}\) whose closed form expression can be derived as follows:

$$\begin{aligned} \begin{aligned} \nabla _{x}\Vert V^{w} -&h \Vert _{\nu _{w_b}}^{2} = 0\\&\Rightarrow \nabla _x(V^{w} - \varPhi x)^{\top }D^{\nu _{w_b}}(V^{w} - \varPhi x) = 0\\&\Rightarrow \varPhi ^{\top }D^{\nu _{w_b}}(V^{w} - \varPhi x) = 0\\&\Rightarrow \varPhi ^{\top }D^{\nu _{w_b}}\varPhi x = \varPhi ^{\top }D^{\nu _{w_b}}V^{w}\\&\Rightarrow x = (\varPhi ^{\top }D^{\nu _{w_b}}\varPhi )^{-1}\varPhi ^{\top }D^{\nu _{w_b}}V^{w} \\&\Rightarrow \varPhi x = \varPhi (\varPhi ^{\top }D^{\nu _{w_b}}\varPhi )^{-1}\varPhi ^{\top }D^{\nu _{w_b}}V^{w}. \end{aligned}\nonumber \\ \therefore \varPi ^{w_b} = \varPhi (\varPhi ^{\top } D^{\nu _{w_b}} \varPhi )^{-1}\varPhi ^{\top }D^{\nu _{w_b}}. \end{aligned}$$
(9)

\(\circledast \) Assumption (A1) The prediction features \(\{\phi _i\}_{i=1}^{k_1}\) are linearly independent.

Algorithms The evaluation of \(\varPi ^{w_b}\) requires knowledge of the stationary distribution \(\nu _{w_b}\) which can only be derived if the transition matrix \(P_{w_b}\) is available. However, in model free learning \(P_{w_b}\) is hidden and hence all the state-of-the-art methods can only derive an approximation to the projection. Two pertinent algorithms are off-policy TD(\(\lambda \)) and off-policy LSTD(\(\lambda \)). The algorithms return a prediction vector \(x \in \mathrm{I\!R}^{k_{1}}\) s.t. \(\varPhi x \approx h_{w \vert w_b}\). The major technique used in both the algorithms is to correct the discrepancies between the target and behaviour policies using importance sampling (Glynn and Iglehart 1989). Here we introduce the sampling ratio at time k to be \(\rho _{k} \triangleq \frac{\pi _{w}(a_{k} \vert s_{k})}{\pi _{w_b}(a_{k} \vert s_{k})}\), where we use the convention \(0/0 = 0\).

  • Off-policy TD(\(\lambda \))

Off-policy TD(\(\lambda )\) (Yu 2012, 2015), where \(\lambda \in [0,1]\) is one of the fundamental algorithms to approximate value function using linear architecture. The algorithm is defined as follows:

$$\begin{aligned} {\mathbf {x}}_{k+1}:= & {} {\mathbf {x}}_k + \alpha _{k+1}\delta _{k+1}{\mathbf {e}}_{k},\end{aligned}$$
(10a)
$$\begin{aligned} {\mathbf {e}}_{k+1}:= & {} \gamma \lambda \rho _{k} {\mathbf {e}}_k + \phi (s_k), \end{aligned}$$
(10b)

where \({\mathbf {e}}_{k}, {\mathbf {x}}_{k} \in \mathrm{I\!R}^{k_{1}}\) and \(\delta _{k+1} \triangleq \rho _{k}r_{k} + \gamma \rho _{k}{\mathbf {x}}_{k}^{\top }\phi (s_{k+1}) - {\mathbf {x}}_{k}^{\top }\phi (s_k)\) is called the temporal difference error. The learning rate \(\alpha _{k}\) is non-negative, deterministic and satisfies \(\sum _{k} \alpha _k = \infty \), \(\sum _{k} \alpha _{k}^{2} < \infty \). The vector \({\mathbf {e}}_{k} \in \mathrm{I\!R}^{k_{1}}\) is called the eligibility trace and is used for variance reduction. Eligibility traces accelerate the learning process by integrating temporal differences from multiple time steps. The convergence analysis of the off-policy TD(\(\lambda \)) method is provided in Yu (2012). However, the analysis assumes that the iterates \({\mathbf {x}}_{k} \in \bar{B}_{r}(0), \forall k \ge 0\), with \(r > 0\) being sufficiently large. The convergence of the un-constrained case for \(\lambda \) close to 1 is proved in Yu (2015).

  • Off-policy LSTD(\(\lambda \))

Off-policy least squares temporal difference (LSTD) with eligibility traces (Yu 2012) is another relevant algorithm in this category. The procedure is described below:

$$\begin{aligned} {\mathbf {e}}_{k+1}:= & {} \gamma \lambda \rho _{k} {\mathbf {e}}_k + \phi (s_k), \end{aligned}$$
(11a)
$$\begin{aligned} {\mathbf {A}}_{k+1}:= & {} {\mathbf {A}}_{k} + \frac{1}{k+1}\left( {\mathbf {e}}_{k}(\phi (s_{k})-\gamma \rho _{k}\phi (s_{k+1}))^{\top } - {\mathbf {A}}_k\right) ,\end{aligned}$$
(11b)
$$\begin{aligned} {\mathbf {b}}_{k+1}:= & {} {\mathbf {b}}_{k} + \frac{1}{k+1}(\rho _{k}r_{k}{\mathbf {e}}_k - {\mathbf {b}}_k), \end{aligned}$$
(11c)
$$\begin{aligned} {\mathbf {x}}_{k+1}:= & {} {\mathbf {A}}^{-1}_{k+1}{\mathbf {b}}_{k+1}, \end{aligned}$$
(11d)

where \({\mathbf {A}}_{k} \in \mathrm{I\!R}^{k_{1} \times k_{1}}\) and \(\mathbf {e}_{k}\), \({\mathbf {b}}_{k}\), \({\mathbf {x}}_{k} \in \mathrm{I\!R}^{k_{1}}\). In some cases, the matrix \({\mathbf {A}}_{k}\) may not be of full rank. To avoid such singularities, initialize \({\mathbf {A}}_0\) with , \(\delta > 0\).

Contrary to the earlier algorithm, the off-policy LSTD(\(\lambda \)) is shown to be stable with well defined limiting behaviour for all \(\lambda \in [0,1]\) under pragmatic assumptions. The only restriction imposed is that the target policy \(\pi _{w_b}\) is absolutely continuous (\(\prec \)) w.r.t. the behaviour policy \(\pi _{{w}_{b}}\), i.e.,

$$\begin{aligned} \pi _{w} \prec \pi _{w_b} \,\,\Leftrightarrow \,\, \pi _{w_b}(a \vert s) = 0 \Rightarrow \pi _{w}(a \vert s) = 0, \forall a \in {\mathbb {A}}, \forall s \in {\mathbb {S}}. \end{aligned}$$
(12)

The contrapositive form of the above statement implies that \(\pi _{w}(a \vert s) \ne 0 \Rightarrow \pi _{w_b}(a \vert s) \ne 0\), \(\forall a \in {\mathbb {A}}, \forall s \in {\mathbb {S}}\). This means that for a given state \(s \in {\mathbb {S}}\), every action feasible under the target policy \(\pi _{w}\) is also feasible under the behaviour policy \(\pi _{w_b}\). The following result from Yu (2012) characterizes the limiting behaviour of the off-policy LSTD(\(\lambda \)) algorithm:

Theorem 1

For a given target policy vector \(w \in {\mathbb {W}}\) and a behaviour policy vector \(w_b \in {\mathbb {W}}\), the sequence \(\{{\mathbf {x}}_{k}\}\) generated by the off-policy LSTD(\(\lambda \)) algorithm defined in Eq. (11) converges to the limit \(x_{w \vert w_b}\) with probability one, where

$$\begin{aligned} \begin{aligned} x_{w | w_b} = A^{-1}_{w \vert w_b}&b_{w \vert w_b}, \text {with} \\ A_{w \vert w_b}&= \varPhi ^{\top }D^{\nu _{w_b}}(I-\gamma \lambda P_{w})^{-1}(I-\gamma P_{w})\varPhi \text { and }\\ b_{w \vert w_b}&= \varPhi ^{\top }D^{\nu _{w_b}}(I-\gamma \lambda P_{w})^{-1}R^{w}. \end{aligned} \end{aligned}$$
(13)

Here \(D^{\nu _{w_b}}\) is the diagonal matrix with \(D^{\nu _{w_b}}_{ii}=\nu _{w_b}(i)\), \(1 \le i \le \vert {\mathbb {S}} \vert \), where \(\nu _{w_b}\) is the stationary distribution of the Markov chain \(P_{w_b}\) induced by the behaviour policy \(\pi _{w_b}\), i.e., \(\nu _{w_b}\) satisfies \(\nu _{w_b}^{\top } P_{w_b} = \nu _{w_b}^{\top }\) and \(R^{w}(s) \in \mathrm{I\!R}^{\vert {\mathbb {S}} \vert }\) is the expected reward, i.e., \(R^{w} \triangleq \Sigma _{s^{\prime } \in {\mathbb {S}}, a \in {\mathbb {A}}}\pi _{w}(a \vert s)P(s, a, s^{\prime })R(s, a, s^{\prime })\).

It is also important to note that in the on-policy LSTD(\(\lambda \)), where both \(\pi _w\) and \(\pi _{w_b}\) are the same, the limit point \(x_{w | w}\) is given by \(x_{w \vert w} = A_{w \vert w}^{-1}b_{w \vert w}\), where

$$\begin{aligned} \begin{aligned} A_{w \vert w}&= \varPhi ^{\top }D^{\nu _{w}}(I-\gamma \lambda P_{w})^{-1}(I-\gamma P_{w})\varPhi \,\, \text { and }\\ b_{w \vert w}&= \varPhi ^{\top }D^{\nu _{w}}(I-\gamma \lambda P_{w})^{-1}R^{w}. \end{aligned} \end{aligned}$$
(14)

2.4 The control problem of interest

In this section, we define a variant of the control problem which is the topic of interest in this paper.

Problem Statement

$$\begin{aligned} \text { Find } w^{*} \in \mathop {\mathrm{arg\,max}}\limits _{w \in {\mathbb {W}} \subset \mathrm{I\!R}^{k_{2}}} {\mathbb {E}}_{\nu _{w}}\left[ L(h_{w \vert w})\right] , \end{aligned}$$
(15)

where \(L:\mathrm{I\!R}^{\vert {\mathbb {S}} \vert } \rightarrow \mathrm{I\!R}^{\vert {\mathbb {S}} \vert }\) is a performance function. We assume that L is bounded and continuous. Note that since \(h_{w \vert w} \in \mathrm{I\!R}^{\vert {\mathbb {S}} \vert }\), we have \(L(h_{w \vert w}) \in \mathrm{I\!R}^{\vert {\mathbb {S}} \vert }\), i.e., \(L(h_{w \vert w})\) can be viewed as a mapping from the the state space \({\mathbb {S}}\) to the scalars. In the case of finite MDP (both \({\mathbb {S}}\) and \({\mathbb {A}}\) are finite), we have \({\mathbb {E}}_{\nu _{w}}\left[ L(h_{w \vert w})\right] = \sum _{s \in {\mathbb {S}}}\nu _{w}(s)L(h_{w \vert w})(s)\). Thus the objective function in Eq. (15) is scalar-valued and hence the optimization problem defined in Eq. (15) is indeed well-defined.

\(\circledast \) Assumption (A2) The Markov chain under any SRP \(\pi _{w}, w \in \mathrm{I\!R}^{k_2}\) is ergodic, i.e., irreducible and aperiodic.

\(\circledast \) Assumption (A3) \({\mathbb {W}}\) is a compact subset of \(\mathrm{I\!R}^{k_{2}}\).

Fig. 1
figure 1

Self-drive system

2.5 Motivation

We demonstrate here a practical situation where the optimization problem of the kind (15) arises. We consider here a special case of the self-drive system (Fig. 1). The goal is to propel an automotive (equipped with sensors to detect the vehicular traffic) from source 0 to destination F (where there are multiple intersections in between) in minimum time without any accidents. Here, the collection of junctions represents the state space, i.e., \({\mathbb {S}} = \{0,1,2,3,F\}\). The automotive travels with a constant velocity between subsequent intersections and the choice of the velocities is restricted to the discrete, finite set \(\{1,2,3\}\). The velocity is chosen randomly by the automotive from the above set at each intersection. The purpose of the randomness is to capture the uncertainty in the traffic conditions during the subsequent stretch of the trip. At each intersection, the automotive senses the vehicular traffic at the intersection and has to make a choice of whether to halt or not. So the action space is \({\mathbb {A}} = \{0 \text { (halt)}, 1 \text { (proceed)}\}\). Here, the performance of the task is evaluated based on the overall time the automotive takes to cover the distance to the destination. Hence the reward function is taken as the velocity chosen by the automotive to traverse the subsequent stretch. This indeed makes sense since the time is directly dependent on the velocity with distance being constant. This optimization problem can be modeled using a finite horizon cost function. Now, suppose that the task is further rewarded based on the overall time it takes to complete the trip. In this case, the final payoff is dependent on the value function (in this case, the value function is time), then the role of the performance function L is to capture this particular aspect. If the payoffs are further based on the maintenance cost incurred (which cannot be integrated into the reward function due to the presence of multiple operating components and hence considering the net maintenance cost at the end of the episode is more worthwhile), the performance function might not be unimodal in general. This is further confirmed by Fig. 2, where we provide the plot of the objective function of the self-drive MDP which exhibits a complex landscape with many local optima. This particular problem is more relevant in the context of neural computation, where distinct neural substrates in regions of prefrontal and anterior striatum have been identified with human habitual learning (model free reinforcement learning) (ODoherty et al. 2015; Balleine and Dickinson 1998; Lee and Shimojo 2014). The human brain is a complex network of computing components and one is inclined to believe that the value function obtained through the habitual learning will be further evaluated using a performance function (similar to the activation function found in the artificial neural networks) before relaying to the subsequent level in the network.

Fig. 2
figure 2

\({\mathbb {S}} = \{0,1,2,F\}\), \({\mathbb {A}} = \{0,1\}\), \(k_1=1\), \(k_2=1\), \(\gamma = 0.99, \tau = 10\), \(\lambda = 0.00125\), \(\psi (s, a)=s*a, \varPhi =(1, 0, 1, 0)^{\top }\), \(L(h_{w \vert w})(s) = \sin ^{2}{(\frac{\pi }{2} s)}\), \(P(0,0,0)=P(1,0,1)=P(2,0,2)=1.0\), \(P(0,1,1)=P(1,1,2)=P(2,1,F)=1.0\). The remaining transition probabilities are zero

The control problem in Eq. (15) is harder due to the application of the performance function L on the approximate value function. Hence we cannot apply the existing direct model free methods like LSPI or off-policy Q-learning (Maei et al. 2010). Note that the LSPI algorithm [Fig. 8 of Lagoudakis and Parr (2003)] is a policy iteration method, where at each iteration an improved policy parameter is deduced from the projected Q-value of the previous policy parameter. So one cannot directly incorporate the operator \({\mathbb {E}}_{\nu _w}\) into the LSPI iteration. Similar compatibility issues are found with the off-policy Q-learning also (Maei et al. 2010). However, policy search methods are a direct match for this problem. Not all policy search methods can provide quality solutions. The pertinent issue is the non-convexity of \({\mathbb {E}}_{\nu _w}\left[ L(h_{w \vert w})\right] \) which presents a landscape with many local optima. Any gradient based method like the state-of-the-art simultaneous perturbation stochastic approximation (SPSA) (Spall 1992) algorithm or the policy gradient methods can only provide sub-optimal solutions. In this paper, we try to solve the control problem in its true sense, i.e., find a solution close to the global optimum of the optimization problem (15). We employ a stochastic approximation variant of the well known cross entropy (CE) method proposed in Joseph and Bhatnagar (2016b, c, 2016a) to achieve the true sense behaviour. The CE method has in fact been applied to the model free control setting before in Mannor et al. (2003), where the algorithm is termed the fast policy search. However, the approach in Mannor et al. (2003) has left several practical and computational challenges uncovered. The method in Mannor et al. (2003) assumes access to a generative model, i.e., the real MDP system itself or a simulator/computational model of the MDP under consideration, which can be configured with moderate ease (with time constraints) and the observations recorded. The existence of generative models for extremely complex MDPs is highly unlikely, since it demands accurate knowledge about the transition dynamics of the MDP. Now regarding the computational aspect, the algorithm in Mannor et al. (2003) maintains an evolving \(\vert {\mathbb {S}} \vert \times \vert {\mathbb {A}} \vert \) matrix \(P^{(t)} \triangleq (P^{(t)}_{sa})_{s \in {\mathbb {S}}, a \in {\mathbb {A}}}\), where \(P^{(t)}_{sa}\) is the probability of taking action a in state s at time t. At each discrete time instant t, the algorithm generates multiple sample trajectories using \(P^{(t)}\), each of finite length, but sufficiently long. For each trajectory, the discounted cost is calculated and then averaged over those multiple trajectories to deduce the subsequent iterate \(P^{(t+1)}\). This however is an expensive operation, both computation and storage wise. Another pertinent issue is the number of sample trajectories required at each time instant t. There is no analysis pertaining to finding a bound on the trajectory count. This implies that a brute-force approach has to be adopted which further burdens the algorithm. A more recent global optimization algorithm called the model reference adaptive search (MRAS) has also been applied in the model free control setting (Chang et al. 2013). However, it also suffers from similar issues as the earlier approach.

Here, we illustrate using a real life scenario, the hardness incurred in assuming a generative model. We consider a legacy water delivery system (Feinberg and Shwartz 2012; Fracasso et al. 2014; Ikonen and Bene 2011; Ertin et al. 2001). The legacy water delivery systems in most cases are not electronically controlled, which implies that a manual intervention is required to adjust the various throughput levels. The reservoir operators have to rely on agreed upon rules, their judgement and experience to calibrate the network. Figure 3 shows a water delivery network where there is a web of manual controllers. The state space is the net output (quantity of water delivered) of the delivery system. Intuitively, one might expect the dynamics of the system to be Markovian in character since the immediate future output is indeed dependent on the current quantity of the reservoir and its current consumption rate. So the state variable takes real values and the underlying MDP is continuous. The reward function is a complex function with positive weights on profits from effective utilization (agriculture, drinking purpose, power generation, etc) and negative weights on spill overs, kinetic energy losses and factors engendering physical damage to the network like excessive pipe pressure. The objective is to find a configuration for the network of controllers (which is indeed a vector with each co-ordinate deciding the amount of calibration required for the corresponding controller) which provides optimum expected discounted reward. Here the configurations represent the action space and thus are also continuous. The reconfiguration of the whole system as and when demanded by the algorithm requires heavy human labor, which is a luxury one cannot afford. On the other hand, developing a simulator for this system requires understanding all the sources of water for the reservoir which depends on a wide variety of environmental factors and also the consumption statistics of the end users, both of which require observations for a long period of time notwithstanding the human labor incurred. Therefore, it is hard in general to develop a simulator/generative model for MDPs with large state and action spaces with complex, opaque and perplexing transition dynamics. Examples where similar issues arise can be found in manual human control, social sciences, biological systems, unmanned aerial vehicles (Bagnell and Schneider 2001) and mechanical systems which wear out quickly like low-cost robots (Deisenroth and Rasmussen 2011).

Fig. 3
figure 3

Water delivery network: the system consists of a water reservoir and a web of manual controllers. The quantity of water in the reservoir is stochastic in nature and so is the consumption of the water by the end users. The end usage of the system includes agriculture, household activities, power generation etc. The reward function is a complex function with positive weights on profits from effective utilization and negative weights on spill overs

A few relevant work in the literature which do not assume the availability of a generative model include Bellman-residual minimization based fitted policy iteration using a single trajectory (Antos et al. 2008) and value-iteration based fitted policy iteration using a single trajectory (Antos et al. 2007). However, those approaches fall prey to the curse of dimensionality arising from large action spaces. Also, they are abstract in the sense that a generic function space is considered and the value function approximation step is expressed as a formal optimization problem. In the above methods which are almost similar in their approach, considerable effort is dedicated to addressing the approximation power of the function space and sample complexity.

In this paper, we address the above mentioned practical and computational concerns. We focus on two key objectives:

  1. 1.

    To reduce the total number of policy evaluations.

  2. 2.

    To find a high performing policy without presuming an unlimited access to the generative model.

By accomplishing the above mentioned objectives, we try to chisel down the requirements inherent in most of the reinforcement learning algorithms and thus enable them to operate in real-time scenarios. We provide here a brief narrative of the approach we follow to realize the above objectives.

To accomplish the former objective, the ubiquitous choice is to employ the stochastic approximation (SA) version of the CE method instead of the naive CE method used in Mannor et al. (2003). The SA version of CE is a zero-order optimization method which is incremental, adaptive, robust and stable with the additional attractive attribute of convergence to the global optimum of the objective function. It has been demonstrated empirically in Joseph and Bhatnagar (2016a, b) that the method exhibits efficient utilization of the samples and possesses better rate of convergence than the naive CE method. The effective sample utilization implies that the method requires minimum number of objective function evaluations. These attributes are appealing in the context of the control problem we consider here, especially in effectively addressing the former objective. The adaptive nature of the algorithm apparently eliminates any brute-force approach which has a detrimental impact on the performance of the naive CE method.

The latter objective is achieved by employing the off-policy LSTD(\(\lambda \)) for policy evaluation which is defined in Sect. 2.3. The advantage of this method lies in its ability to approximate the value function of an arbitrary policy (called the target policy) using the observations of the MDP under a possibly different policy (called the behaviour policy), with the only restriction being the absolute continuity between the target and behaviour policies. This implies that we optimize the approximate objective function given by \({\mathbb {E}}_{\nu _{w_b}}\left[ L(\varPhi x_{w \vert w_b})\right] \) (where \(x_{w \vert w_b}\) is the solution generated by the off-policy LSTD(\(\lambda \))) instead of the true objective function \({\mathbb {E}}_{\nu _w}\left[ L(h_{w \vert w})\right] \). Here, \(\nu _{w_b}\) is the steady state distribution of the Markov chain induced by the behaviour policy \(\pi _{w_b}\). This is the best approximation possible under the absence of the generative model since \(\nu _{w}\) is the long-run steady state marginal distribution of the Markov chain induced by the policy \(\pi _{w}\) and one cannot correct the long-run discrepancies arising due to the restriction that the available sample trajectory is generated using the behaviour policy. However, hidden deep under the appealing characteristic of the single sample trajectory approach is the painful Achilles heel of choice, where one cannot forget that the quality of the solution contrived by the algorithm depends on the choice of the sample trajectory which is directly dependent on the behaviour policy that generates it. The additional approximation error incurred due to this particular information restrictive setting is indeed unavoidable. In order to choose the behaviour policy wisely, it is imperative to provide a quantitative analysis of the cost incurred in the choice of the behaviour policy. In this paper, we provide a bound on the approximation error of the off-policy LSTD(\(\lambda \)) solution of an arbitrary target policy with respect to the deviation of the target policy from the behaviour policy. The practical aspect of the approach can be further improved by reconsidering the same sample trajectory for all value function evaluations. This implies that our algorithm just requires a single sample trajectory to solve the optimization problem defined in Eq. (15). Since the access to the generative model is forbidden, in order to reuse the trajectory, one has to find provisions in terms of memory to store the transition stream.

Fig. 4
figure 4

a Information pyramid. b Optimization box

Goal of the Paper To solve the control problem defined in Eq. (15) without having access to any generative model. Formally stated, given an infinitely long sample trajectory \(\{{\mathbf {s}}_0, {\mathbf {a}}_0, {\mathbf {r}}_0, {\mathbf {s}}_1, {\mathbf {a}}_1, {\mathbf {r}}_1, {\mathbf {s}}_2, \dots \}\) generated using the behaviour policy \(\pi _{w_b}\) (\(w_b \in \mathrm{I\!R}^{k_{2}}\)), solve the control problem in (15).

\(\circledast \) Assumption (A4) The behaviour policy \(\pi _{w_b}\), where \(w_b \in {\mathbb {W}}\), satisfies the following condition: \(\pi _{w_b}(a \vert s) > 0\), \(\forall s \in {\mathbb {S}}, \forall a \in {\mathbb {A}}\).

A few remarks are in order: We can classify the reinforcement learning algorithms based on the information made available to the algorithm in order to seek the optimal policy. We graphically illustrate this classification as a pyramid in Fig. 4. The bottom of the pyramid contains the classical methods, where the entire model information, i.e., both P and R are available, while in the middle, we have the model free algorithms, where both P and R are assumed hidden, however an access to the generative model/simulator is presumed. In the top of the pyramid, we have the single trajectory approaches, where a single sample trajectory generated using a behaviour policy is made available, however, the algorithms have no access to the model information or simulator. Observe that as one goes up the pyramid, the mass of the information vested upon the algorithm reduces considerably. The algorithm we propose in this paper belongs to the top of the information pyramid and to the upper half of the optimization box which makes it a unique combination.

3 Proposed algorithm

In this section, we propose an algorithm to solve the control problem defined in Eq. (15). We employ a stochastic approximation variant of the Gaussian based cross entropy method to find the optimal policy. We delay the discussion of the algorithm until the next subsection. We now focus on the objective function estimation. The objective function values \({\mathbb {E}}_{\nu _{w}}\left[ L(h_{w \vert w})\right] \) which are required to efficiently guide the search for \(w^{*}\) are estimated using the off-policy LSTD(\(\lambda \)) method. In LFA, given \(w \in {\mathbb {W}}\), the best approximation of \(V^{w}\) one can hope for is the projection \(\varPi ^{w}V^{w}\). Theorem 1 of Tsitsiklis and Roy (1997) shows that the on-policy LSTD(\(\lambda \)) solution \(\varPhi x_{w \vert w}\) is indeed an approximation of the projection \(\varPi ^{w}V^{w}\). Using Babylonian–Pythagorean theorem and Theorem 1 of Tsitsiklis and Roy (1997) along with a little arithmetic, we obtain \(\Vert \varPhi x_{w \vert w} - \varPi ^{w}V^{w} \Vert _{\nu _{w}} \le \frac{\sqrt{(1-\lambda )\gamma (\gamma +\gamma \lambda +2)}}{1-\gamma }\Vert \varPi ^{w}V^{w} - V^{w}\Vert _{\nu _{w}}\). Hence for \(\lambda = 1\), we have \(\varPhi x_{w \vert w} = \varPi ^{w}V^{w}\), i.e., the on-policy LSTD(1) provides the exact projection. However for \(\lambda < 1\), only approximations to it are obtained. Now when off-policy LSTD(\(\lambda \)) is applied, it adds one more level of approximation, i.e., \(\varPhi x_{w \vert w}\) is approximated by \(\varPhi x_{w \vert w_b}\). Hence to evaluate the performance of the off-policy approximation, we must quantify the errors incurred in the approximation procedure and we believe a capacious analysis had been far overdue.

3.1 Choice of the behaviour policy

The behaviour policy is often an exploration policy which promotes the exploration of the state and action spaces of the MDP. Efficient exploration is a precondition for effective learning. In this paper, we operate in a minimalistic MDP setting, where the only information available for inference is the single stream of transitions and payoffs generated using the behaviour policy. So the choice of the behaviour policy is vital for a sound inductive reasoning. The following theorem will provide a bound on the approximation error incurred in the off-policy LSTD(\(\lambda \)) method. The provided bound can be beneficial in choosing a good behaviour policy and also supplements in understanding the stability and usefulness of the proposed algorithm.

Theorem 2

For a given \(w \in {\mathbb {W}}\), the target policy vector, and \(w_b \in {\mathbb {W}}\), the behaviour policy vector, let \(x_{w \vert w}\) and \(x_{w \vert w_b}\) be the solutions of the on-policy and off-policy versions of LSTD(\(\lambda \)), respectively, with \(\lambda \in [0,1]\).

$$\begin{aligned} \textit{If }\sup _{s \in {\mathbb {S}}, a \in {\mathbb {A}}}&\Big \vert \frac{\pi _{w}(a \vert s)}{\pi _{w_b}(a \vert s)}-1\Big \vert < \epsilon _2, \text { then } \frac{\big \Vert x_{w \vert w} - x_{w \vert w_b} \big \Vert _{\infty }}{\Vert x_{w \vert w}\Vert _{\infty }} \le \nonumber \\&O\big ((\vert {\mathbb {S}} \vert ^{2}\epsilon ^{2}_2 + \vert {\mathbb {S}} \vert \epsilon _2)\frac{(1+\gamma )(1+\gamma \lambda )}{(1-\gamma )(1-\gamma \lambda )}\Vert D^{\nu _{w_b}}\Vert _{\infty }\Vert (D^{\nu _{w_b}})^{-1}\Vert _{\infty }\big ). \end{aligned}$$
(16)
$$\begin{aligned} \textit{Also, }&\Vert \varPhi x_{w \vert w_b} - V^{w} \Vert _{\nu _{w_b}} \le \frac{\gamma -2\gamma \lambda +1}{1-\gamma }\Vert V^{w} - V^{w_b} \Vert _{\nu _{w_b}} \nonumber \\&+\frac{\epsilon _2(1-\gamma \lambda )\Vert R \Vert _{\infty }}{(1-\gamma )^{2}} + \frac{1-\gamma \lambda }{1-\gamma }\Vert \varPi ^{w_b}V^{w} - V^{w} \Vert _{\nu _{w_b}}, \end{aligned}$$
(17)

where \(V^{w}\) and \(V^{w_b}\) are the true value functions corresponding to the SRPs \(\pi _w\) and \(\pi _{w_b}\) respectively. Also, \(\nu _{w_b}\) is the stationary distribution of the Markov chain defined by \(P_{w_b}\) and \(D^{\nu _{w_b}}\) is the diagonal matrix defined in Theorem 1.

Proof

Given \(w \in {\mathbb {W}}\), we have

$$\begin{aligned} P_{w}(s, s^{\prime }) = \sum _{a \in {\mathbb {A}}}\pi _{w}(a \vert s)P(s, a, s^{\prime }), s, s^{\prime } \in {\mathbb {S}},\\ P_{w_b}(s, s^{\prime }) = \sum _{a \in {\mathbb {A}}}\pi _{w_b}(a \vert s)P(s, a, s^{\prime }), s, s^{\prime } \in {\mathbb {S}}. \end{aligned}$$

Therefore,

$$\begin{aligned} P_{w} = P_{w_b} + F, \,\, \text {where} \, F = P_{w} - P_{w_b}. \end{aligned}$$

Hence, for \(s, s^{\prime } \in {\mathbb {S}}\),

$$\begin{aligned} \vert F(s, s^{\prime }) \vert&= \Big \vert \sum _{a \in {\mathbb {A}}}\left( \pi _{w}(a \vert s) - \pi _{w_b}(a \vert s)\right) P(s, a, s^{\prime }) \Big \vert ,\nonumber \\&= \Big \vert \sum _{a \in {\mathbb {A}}}\left( \frac{\pi _{w}(a \vert s)}{\pi _{w_b}(a \vert s)} - 1\right) \pi _{w_b}(a \vert s)P(s, a, s^{\prime }) \Big \vert ,\nonumber \\&\le \sum _{a \in {\mathbb {A}}}\epsilon _2\pi _{w_b}(a \vert s)P(s, a, s^{\prime }),\nonumber \\&= \epsilon _2 P_{w_b}(s, s^{\prime }). \end{aligned}$$
(18)

The above bound of the deviation matrix F in terms of \(P_{w_b}\) compels us to apply the result from Xue (1997), which provides a sensitivity analysis of the stationary distribution of a Markov chain w.r.t. its probability transition matrix. In particular, by appealing to Theorem 1 of Xue (1997) along with Eq. (18), we obtain the following:

$$\begin{aligned} \Big \vert \frac{\nu _w(s)-\nu _{w_b}(s)}{\nu _{w_b}(s)}\Big \vert \le 2(\vert {\mathbb {S}} \vert - 1) \epsilon _2 + O(\epsilon _2^{2}), s \in {\mathbb {S}}. \nonumber \\ \Longrightarrow \Big \vert \frac{\nu _w(s)-\nu _{w_b}(s)}{\nu _{w_b}(s)}\Big \vert \le O(\vert {\mathbb {S}} \vert \epsilon _2), s \in {\mathbb {S}}. \end{aligned}$$
(19)

Let \(\epsilon _3 = O(\vert {\mathbb {S}} \vert \epsilon _2)\). Then from (19), we get

$$\begin{aligned} \vert \nu _w(s)-\nu _{w_b}(s) \vert \le \epsilon _3\vert \nu _{w_b}(s) \vert \le \epsilon _3(\vert \nu _w(s)-\nu _{w_b}(s) \vert + \vert \nu _{w}(s) \vert )\nonumber \\ \Longrightarrow \frac{\vert \nu _w(s)-\nu _{w_b}(s)\vert }{\vert \nu _{w}(s) \vert } \le \frac{\epsilon _3}{1-\epsilon _3} = O(\epsilon _3 + \epsilon ^{2}_3) = O(\vert {\mathbb {S}} \vert \epsilon _2 + \vert {\mathbb {S}} \vert ^{2}\epsilon _2^{2}). \end{aligned}$$
(20)

For the policy \(\pi _w\), recall that the on-policy approximation is \(\varPhi x_{w \vert w}\), where \(x_{w \vert w}\) is the unique solution to the linear system \(A_{w \vert w}x = b_{w \vert w}\). Analogously, the off-policy approximation is given by \(\varPhi x_{w \vert w_b}\), where \(x_{w \vert w_b}\) is the unique solution to the linear system \(A_{w \vert w_b}x = b_{w \vert w_b}\). Now using the bound in (20) and the definitions of \(A_{w \vert w}\), \(A_{w \vert w_b}\), \(b_{w \vert w}\) and \(b_{w \vert w_b}\) in (14) and (13), it is easy to verify that

$$\begin{aligned} \vert A_{w \vert w_b} - A_{w \vert w}\vert \le O(\vert {\mathbb {S}} \vert ^{2}\epsilon _2^{2} + \vert {\mathbb {S}} \vert \epsilon _2)\vert A_{w \vert w} \vert \text { and }\\ \vert b_{w \vert w_b} - b_{w \vert w}\vert \le O(\vert {\mathbb {S}} \vert ^{2}\epsilon _2^{2} + \vert {\mathbb {S}} \vert \epsilon _2)\vert b_{w \vert w} \vert . \end{aligned}$$

Hence the off-policy linear system \(A_{w \vert w_b}x = b_{w \vert w_b}\) can be viewed as a perturbed version of the on-policy system \(A_{w \vert w}x = b_{w \vert w}\). Let \(\epsilon _4 = O(\vert {\mathbb {S}} \vert ^{2}\epsilon _2^{2} + \vert {\mathbb {S}} \vert \epsilon _2)\). Now we make use of the norm bound on the solutions of perturbed linear system of equations provided in Theorem 2.2 of Higham (1994). In particular, using the remark following Theorem 2.2 of Higham (1994), we have

$$\begin{aligned} \frac{\big \Vert x_{w \vert w} - x_{w \vert w_b} \big \Vert _{\infty }}{\Vert x_{w \vert w} \Vert _{\infty }} \le \frac{2\epsilon _4\kappa (A_{w \vert w})}{1-\epsilon _4\kappa (A_{w \vert w})}, \end{aligned}$$
(21)

where \(\kappa (A_{w \vert w}) = \Vert A_{w \vert w} \Vert _{\infty } \Vert A^{-1}_{w \vert w}\Vert _{\infty }\) (condition number \(\kappa (\cdot )\) is defined in Sect. 1). Using the definition of \(A_{w \vert w}\) in (14), we obtain \(A^{-1}_{w \vert w} = \varPhi ^{-1}(I-\gamma P_{w})^{-1}(I-\gamma \lambda P_{w})(D^{\nu _{w}})^{-1}\varPhi ^{-\top }\), where \(\varPhi ^{-1}\) is the left inverse of \(\varPhi \) and \(\varPhi ^{-\top }\) is the right inverse of \(\varPhi ^{\top }\). Therefore \(\Vert A^{-1}_{w \vert w} \Vert _{\infty } \le \Vert \varPhi ^{-1}\Vert _{\infty }\Vert (I-\gamma P_{w})^{-1}\Vert _{\infty }\Vert I-\gamma \lambda P_{w} \Vert _{\infty } \Vert (D^{\nu _{w}})^{-1}\Vert _{\infty }\Vert \varPhi ^{-\top }\Vert _{\infty }\). Now by arguing along the same lines as (31), one can show that \(\Vert (I-\gamma P_{w})^{-1}\Vert _{\infty } \le \frac{1}{1-\gamma }\). Also \(\Vert I-\gamma \lambda P_{w} \Vert _{\infty } = 1+\gamma \lambda \). And the feature matrix \(\varPhi \) is presumed to be constant. A forteriori, \(\Vert A^{-1}_{w \vert w} \Vert _{\infty } = O(\frac{1+\gamma \lambda }{1-\gamma }\Vert (D^{\nu _{w}})^{-1}\Vert _{\infty })\). Also from (19), we have \(\nu _{w}(s) \ge (1-\epsilon _3)\nu _{w_b}(s)\), \(s \in {\mathbb {S}}\). Henceforth, \(\Vert A^{-1}_{w \vert w} \Vert _{\infty } = O(\frac{1+\gamma \lambda }{(1-\gamma )(1-\epsilon _3)}\Vert (D^{\nu _{w_b}})^{-1}\Vert _{\infty })\). Similarly, one can show that

$$\begin{aligned} \Vert A_{w \vert w} \Vert _{\infty } = O(\frac{(1+\gamma )(1+\epsilon _3)}{(1-\gamma \lambda )}\Vert D^{\nu _{w_b}}\Vert _{\infty }). \end{aligned}$$

Hence

$$\begin{aligned} \kappa (A_{w \vert w})= & {} O\big (\frac{(1+\epsilon _3)(1+\gamma )(1+\gamma \lambda )}{(1-\gamma )(1-\gamma \lambda )(1-\epsilon _3)}\Vert D^{\nu _{w_b}}\Vert _{\infty }\Vert (D^{\nu _{w_b}})^{-1}\Vert _{\infty }\big ),\\= & {} O\big (\frac{(1+\epsilon _3)^{2}(1+\gamma )(1+\gamma \lambda )}{(1-\gamma )(1-\gamma \lambda )}\Vert D^{\nu _{w_b}}\Vert _{\infty }\Vert (D^{\nu _{w_b}})^{-1}\Vert _{\infty }\big ). \end{aligned}$$

Consequently from (21), we get

$$\begin{aligned} \frac{\big \Vert x_{w \vert w} - x_{w \vert w_b} \big \Vert _{\infty }}{\Vert x_{w \vert w}\Vert _{\infty }}&\le O(\epsilon _4\kappa (A_{w \vert w}) +\epsilon ^{2}_4\kappa ^{2}(A_{w \vert w}))\\ = O&\big ((\vert {\mathbb {S}} \vert ^{2}\epsilon ^{2}_2 + \vert {\mathbb {S}} \vert \epsilon _2)\frac{(1+\gamma )(1+\gamma \lambda )}{(1-\gamma )(1-\gamma \lambda )}\Vert D^{\nu _{w_b}}\Vert _{\infty }\Vert (D^{\nu _{w_b}})^{-1}\Vert _{\infty }\big ). \end{aligned}$$

This completes the proof of (16). \(\square \)

Now to prove (17), here we define an operator \(T_{w \vert w_b}^{(\lambda )}\) [referred to as the TD(\(\lambda \)] operator in Tsitsiklis and Roy (1997)) as follows:

$$\begin{aligned}&T_{w \vert w_b}^{(\lambda )} V = (1-\lambda )\sum _{i=0}^{\infty }\lambda ^{i}\left( \sum _{j=0}^{i}(\gamma P_{w_b})^{j}R^{w}(s_{j}) + (\gamma P_{w_b})^{i+1}V\right) \end{aligned}$$
(22)
$$\begin{aligned}&\text { with }P_{w_b}(s, s^{\prime }) \triangleq \sum _{a \in {\mathbb {A}}}\pi _{w_b}(a \vert s)P(s, a, s^{\prime }) \nonumber \\&\text { and }R^{w}(s) \triangleq \sum _{s^{\prime } \in {\mathbb {S}}}\sum _{a \in {\mathbb {A}}}\pi _{w}(a \vert s)P(s, a, s^{\prime })R(s, a, s^{\prime }). \end{aligned}$$
(23)

Before we proceed any further, a few observations are in order:

Observation 1

For \(V \in \mathrm{I\!R}^{\vert {\mathbb {S}} \vert } \text { and } w \in {\mathbb {W}}\) , we have

$$\begin{aligned} \Vert \varPi ^{w} V \Vert _{\nu _{w}} \le \Vert V \Vert _{\nu _{w}}. \end{aligned}$$
(24)

Proof

Using\(<\varPi ^{w} V - V, \varPi ^{w} V>_{\nu _{w}} = 0\) and by the Babylonian–Pythagorean theorem, we have \(\Vert V \Vert _{\nu _{w}}^{2} = \Vert \varPi ^{w} V - V \Vert _{\nu _{w}}^{2} + \Vert \varPi ^{w} V \Vert _{\nu _{w}}^{2}, \Rightarrow \Vert \varPi ^{w} V \Vert _{\nu _{w}} \le \Vert V \Vert _{\nu _{w}}\). This proves (24). \(\square \)

Observation 2

For \(w \in {\mathbb {W}}, s \in {\mathbb {S}}\),

$$\begin{aligned} \text {if }\sup _{a \in {\mathbb {S}}}&\Big \vert \frac{\pi _{w}(a \vert s)}{\pi _{w_b}(a \vert s)}-1\Big \vert < \epsilon _2 \text { then } \vert R^{w}(s) - R^{w_b}(s) \vert \le \epsilon _2\Vert R \Vert _{\infty }. \end{aligned}$$
(25)

Proof

From (23), we have,

$$\begin{aligned} \vert R^{w}(s)- & {} R^{w_b}(s) \vert = \big \vert \sum _{s^{\prime } \in {\mathbb {S}}}\sum _{a \in {\mathbb {A}}}\big (\pi _{w}(a \vert s) - \pi _{w_b}(a \vert s)\big )P(s, a, s^{\prime })R(s, a, s^{\prime })\big \vert ,\nonumber \\&\le \sum _{s^{\prime } \in {\mathbb {S}}}\sum _{a \in {\mathbb {A}}}\big \vert \pi _{w}(a \vert s) - \pi _{w_b}(a \vert s))\big \vert P(s, a, s^{\prime })R(s, a, s^{\prime }),\nonumber \\&\le \sum _{s^{\prime } \in {\mathbb {S}}}\epsilon _2 P_{w_b}(s, s^{\prime })\Vert R \Vert _{\infty },\nonumber \\&\le \epsilon _2 \Vert R \Vert _{\infty }. \end{aligned}$$
(26)

This proves (25).\(\square \)

Observation 3

For \(V_1, V_2 \in \mathrm{I\!R}^{\vert {\mathbb {S}} \vert }\),

$$\begin{aligned} \Vert T_{w \vert w_b}^{(\lambda )} V_1 - T_{w \vert w_b}^{(\lambda )} V_2 \Vert _{\nu _{w_b}} \le \frac{\gamma (1-\lambda )}{1-\gamma \lambda }\Vert V_1 - V_2 \Vert _{\nu _{w_b}}. \end{aligned}$$

Proof

Refer Lemma 4 of Tsitsiklis and Roy (1997).

Observation 4

$$\begin{aligned} \big \vert T_{w \vert w_b}^{(\lambda )} V(s) - T_{w_b \vert w_b}^{(\lambda )} V(s) \big \vert \le \frac{\epsilon _2\Vert R \Vert _{\infty }}{1-\gamma }. \end{aligned}$$
(27)

Proof

From (22) and observation 2, we have

$$\begin{aligned}&\big \vert T_{w \vert w_b}^{(\lambda )} V(s) - T_{w_b \vert w_b}^{(\lambda )} V(s) \big \vert \\&\quad =\Big \vert (1-\lambda )\sum _{i=0}^{\infty }\lambda ^{i}\sum _{j=0}^{i}\gamma ^{j}\sum _{s^{\prime } \in {\mathbb {S}}}P_{w_b}^{j}(s,s^{\prime }\Big (R^{w}(s^{\prime })-R^{w_b}(s^{\prime })\Big ) \Big \vert ,\\&\quad \le (1-\lambda )\sum _{i=0}^{\infty }\lambda ^{i}\sum _{j=0}^{i}\gamma ^{j}\sum _{s^{\prime } \in {\mathbb {S}}}P_{w_b}^{j}(s, s^{\prime })\Vert R \Vert _{\infty }\epsilon _2,\\&\quad = (1-\lambda )\sum _{i=0}^{\infty }\lambda ^{i}\sum _{j=0}^{i}\gamma ^{j}\epsilon _{2} \Vert R \Vert _{\infty },\\&\quad \le \frac{\epsilon _2\Vert R \Vert _{\infty }}{1-\gamma }. \end{aligned}$$

This proves (27).

Observation 5

\(\varPhi x_{w \vert w_b} = \varPi ^{w_b}T^{(\lambda )}_{w \vert w_b}\varPhi x_{w \vert w_b}\). This is the off-policy projected Bellman equation. Detailed discussion is available in Yu (2012). For the on-policy case, similar equation exists which is as follows: \(\varPhi x_{w \vert w} = \varPi ^{w}T^{(\lambda )}_{w \vert w}\varPhi x_{w \vert w}\). For the proof of the above equation, refer Theorem 1 of Tsitsiklis and Roy (1997). A few other relevant fixed point equations are \(T^{(\lambda )}_{w \vert w}V^{w} = V^{w}\) and \(T^{(\lambda )}_{w_b \vert w_b}V^{w_b} = V^{w_b}\). The proof of the above equations is provided in Lemma 5 of Tsitsiklis and Roy (1997).

This completes the observations. Now we will prove (17). Using the triangle inequality and the above observations, we have

$$\begin{aligned}&\Vert \varPhi x_{w \vert w_b} - V^{w} \Vert _{\nu _{w_b}} \le \Vert \varPhi x_{w \vert w_b} - \varPi ^{w_b}V^{w_b} \Vert _{\nu _{w_b}} + \Vert \varPi ^{w_b}V^{w_b} - V^{w} \Vert _{\nu _{w_b}},\\&{=_{1}} \Vert \varPi ^{w_b}T^{(\lambda )}_{w \vert w_b}\varPhi x_{w \vert w_b} - \varPi ^{w_b}T^{(\lambda )}_{w_b \vert w_b}V^{w_b} \Vert _{\nu _{w_b}} + \Vert \varPi ^{w_b}V^{w_b} - V^{w} \Vert _{\nu _{w_b}},\\&\le _{2} \Vert T^{(\lambda )}_{w \vert w_b}\varPhi x_{w \vert w_b} - T^{(\lambda )}_{w_b \vert w_b}V^{w_b} \Vert _{\nu _{w_b}} + \Vert \varPi ^{w_b}V^{w_b} - V^{w} \Vert _{\nu _{w_b}}, \\&\le _{3} \Vert T^{(\lambda )}_{w \vert w_b}\varPhi x_{w \vert w_b} - T^{(\lambda )}_{w \vert w_b}V^{w_b} \Vert _{\nu _{w_b}} + \Vert T^{(\lambda )}_{w \vert w_b}V^{w_b} - T^{(\lambda )}_{w_b \vert w_b}V^{w_b} \Vert _{\nu _{w_b}} + \\&\Vert \varPi ^{w_b}V^{w_b} - V^{w} \Vert _{\nu _{w_b}},\\&\le _{4} \frac{\gamma (1-\lambda )}{1-\gamma \lambda }\Vert \varPhi x_{w \vert w_b} - V^{w_b} \Vert _{\nu _{w_b}} + \Vert T^{(\lambda )}_{w \vert w_b}V^{w_b} - T^{(\lambda )}_{w_b \vert w_b}V^{w_b} \Vert _{\nu _{w_b}} \\&+\,\Vert \varPi ^{w_b}V^{w_b} - V^{w} \Vert _{\nu _{w_b}},\\&\le _{5} \frac{\gamma (1-\lambda )}{1-\gamma \lambda }\Vert \varPhi x_{w \vert w_b} - V^{w} \Vert _{\nu _{w_b}} + \frac{\gamma (1-\lambda )}{1-\gamma \lambda }\Vert V^{w} - V^{w_b} \Vert _{\nu _{w_b}} + \\&\frac{\epsilon _2\Vert R \Vert _{\infty }}{1-\gamma } + \Vert \varPi ^{w_b}V^{w_b} - V^{w} \Vert _{\nu _{w_b}}, \end{aligned}$$

Note that \(=_{1}\) follows from Observation 5; \(\le _{2}\) follows from Observation 1; \(\le _{3}\) follows from the triangle inequality; \(\le _{4}\) follows from Observation 3; \(\le _{5}\) follows from Observation 4 and the triangle inequality. This further implies

$$\begin{aligned}&\frac{1-\gamma }{1-\gamma \lambda }\Vert \varPhi x_{w \vert w_b} - V^{w} \Vert _{\nu _{w_b}} \\&\quad \le \frac{\gamma (1-\lambda )}{1-\gamma \lambda }\Vert V^{w} - V^{w_b} \Vert _{\nu _{w_b}} + \frac{\epsilon _2\Vert R \Vert _{\infty }}{1-\gamma } + \Vert \varPi ^{w_b}V^{w_b} - V^{w} \Vert _{\nu _{w_b}},\\&\quad \le \frac{\gamma (1-\lambda )}{1-\gamma \lambda }\Vert V^{w} - V^{w_b} \Vert _{\nu _{w_b}} + \frac{\epsilon _2\Vert R \Vert _{\infty }}{1-\gamma } + \Vert \varPi ^{w_b}V^{w_b} - \varPi ^{w_b}V^{w} \Vert _{\nu _{w_b}} + \\&\Vert \varPi ^{w_b}V^{w} - V^{w} \Vert _{\nu _{w_b}},\\&\quad \le \frac{\gamma (1-\lambda )}{1-\gamma \lambda }\Vert V^{w} - V^{w_b} \Vert _{\nu _{w_b}} + \frac{\epsilon _2\Vert R \Vert _{\infty }}{1-\gamma } + \Vert V^{w_b} - V^{w} \Vert _{\nu _{w_b}} \\&\qquad + \Vert \varPi ^{w_b}V^{w} - V^{w} \Vert _{\nu _{w_b}}. \end{aligned}$$

Therefore

$$\begin{aligned} \Vert \varPhi x_{w \vert w_b} - V^{w} \Vert _{\nu _{w_b}} \le \frac{\gamma -2\gamma \lambda +1}{1-\gamma }\Vert V^{w} - V^{w_b} \Vert _{\nu _{w_b}} + \frac{\epsilon _2(1-\gamma \lambda )\Vert R \Vert _{\infty }}{(1-\gamma )^{2}} + \\ \frac{1-\gamma \lambda }{1-\gamma }\Vert \varPi ^{w_b}V^{w} - V^{w} \Vert _{\nu _{w_b}}. \end{aligned}$$

This completes the proof of (17). \(\square \)

The implications of the bounds given in Theorem 2 are indeed significant. The quantity \(\sup _{s \in {\mathbb {S}}, a \in {\mathbb {A}}}\Big \vert \frac{\pi _{w}(a \vert s)}{\pi _{w_b}(a \vert s)}-1\Big \vert \) given in the hypothesis of the theorem can ostensibly be viewed as a measure of the closeness of the SRPs \(\pi _w\) and \(\pi _{w_b}\), with the minimum value of 0 being achieved in the on-policy case. Under the hypothesis that \(\sup _{s \in {\mathbb {S}}, a \in {\mathbb {A}}}\Big \vert \frac{\pi _{w}(a \vert s)}{\pi _{w_b}(a \vert s)}-1\Big \vert < \epsilon _2\), we obtain in (16) an upper bound on the relative error of the on-policy and off-policy solutions. The bound is predominantly dominated by the hypothesis bound \(\epsilon _2\), the eligibility factor \(\lambda \), the discount factor \(\gamma \) and \(\Vert (D^{\nu _{w_b}})^{-1} \Vert _{\infty }\Vert D^{\nu _{w_b}} \Vert _{\infty }\). Note that \(\Vert D^{\nu _{w_b}} \Vert _{\infty } = \max _{s} \nu _{w_b}(s)\) and \(\Vert (D^{\nu _{w_b}})^{-1} \Vert _{\infty } = (\min _{s} \nu _{w_b}(s))^{-1}\). If the behaviour policy is chosen in such a way that all the states are equally likely under its stationary distribution, then \(\Vert (D^{\nu _{w_b}})^{-1} \Vert _{\infty }\Vert D^{\nu _{w_b}} \Vert _{\infty } \approx 1\). Consequently, the upper bound can be reduced to \(O\big ((\vert {\mathbb {S}} \vert ^{2}\epsilon ^{2}_2 + \vert {\mathbb {S}} \vert \epsilon _2)\frac{(1+\gamma )(1+\gamma \lambda )}{(1-\gamma )(1-\gamma \lambda )}\big )\).

Now regarding the latter bound provided in Eq. (17), given \(w \in {\mathbb {W}}\), by using triangle inequality and Eq. (17), we obtain a proper quantification of the distance between the solution of the off-policy LSTD(\(\lambda \)), i.e., \(\varPhi x_{w \vert w_b}\) and the projection \(\varPi ^{w_b}V^{w}\) in terms of \(\Vert \cdot \Vert _{\nu _{w_b}}\) and \(\epsilon _2\). The above bound can be further improved by obtaining an expedient bound for \(\Vert V^{w} - V^{w_b} \Vert _{\nu _{w_b}}\) as follows:

Corollary 1

Let \(w \in {\mathbb {W}}\), \(\lambda \in [0,1]\) and \(\gamma \in (0,1)\). Let the assumptions of Theorem 2 hold. Also, assume that \(\epsilon _2\) which is defined in Theorem 2 satisfy \(\epsilon _2\frac{1+\gamma }{1-\gamma } < 1\). Then \(\exists K_1 > 0\), s.t.

$$\begin{aligned} \begin{aligned}&\Vert \varPhi x_{w \vert w_b} - V^{w} \Vert _{\nu _{w_b}} \le \frac{K_1(\gamma -2\gamma \lambda +1)(1+\gamma )\epsilon _2}{(1-\gamma )(1-\gamma -\epsilon _2(1+\gamma ))} + \frac{\epsilon _2(1-\gamma \lambda )\Vert R \Vert _{\infty }}{(1-\gamma )^{2}} + \\&\frac{1-\gamma \lambda }{1-\gamma }\Vert \varPi ^{w_b}V^{w} - V^{w} \Vert _{\nu _{w_b}}, \end{aligned} \end{aligned}$$

Proof

Given \(w \in {\mathbb {W}}\), the value function \(V^{w}\) satisfies the linear system given by the Bellman equation as shown in Eq. (4), i.e.,

$$\begin{aligned} (I-\gamma P_{w})V^{w} = R^{w}. \end{aligned}$$
(28)

Similarly, for the behaviour policy \(w_b\), we have

$$\begin{aligned} (I-\gamma P_{w_b})V^{w_b} = R^{w_b}. \end{aligned}$$
(29)

Now, note that

$$\begin{aligned}&(I-\gamma P_{w}) = (I-\gamma P_{w_b}) + F, \text { where } F = \gamma (P_{w_b} - P_{w}).\\&R^{w} = R^{w_b} + b, \text { where } b = R^{w} - R^{w_b}. \end{aligned}$$

By arguing along the same lines as (26), one can show that \(\vert b(s) \vert \le \epsilon _2 \vert R^{w_b}(s) \vert \), \(\forall s \in {\mathbb {S}}\). Similarly, \(\vert F(s, s^{\prime }) \vert \le \epsilon _2 \gamma \vert P_{w_b}(s, s^{\prime }) \vert \le \epsilon _2 \vert (I-\gamma P_{w_b})(s, s^{\prime }) \vert \), \(\forall s, s^{\prime } \in {\mathbb {S}}\). [The proof is similar to that of (18)]. Hence the on-policy linear system given by (28) can be viewed as a perturbed version of the linear system (29) of the behaviour policy. So, using the remark following Theorem 2.2 of Higham (1994), we obtain the following:

$$\begin{aligned} \frac{\Vert V^{w} - V^{w_b} \Vert _{\nu _{w_b}}}{\Vert V^{w_b} \Vert _{\nu _{w_b}}} \le \frac{2\epsilon _2\kappa (I - \gamma P_{w_b})}{1-\epsilon _2\kappa (I - \gamma P_{w_b})}. \end{aligned}$$
(30)

where \(\kappa (I - \gamma P_{w_b}) = \Vert I - \gamma P_{w_b} \Vert _{\infty }\Vert (I - \gamma P_{w_b})^{-1} \Vert _{\infty }\) (condition number \(\kappa (\cdot )\) is defined in Sect. 1). It is also easy to verify that \(\Vert I - \gamma P_{w_b} \Vert _{\infty } = 1+\gamma \). Now to bound \(\Vert (I - \gamma P_{w_b})^{-1} \Vert _{\infty }\), we use the Ahlberg–Nilson–Varah bound from Varga (1976). In particular, by using Theorem A of Varga (1976), we have

$$\begin{aligned} \Vert (I - \gamma P_{w_b})^{-1} \Vert _{\infty }&\le \frac{1}{\min _{1 \le i \le \vert {\mathbb {S}} \vert }\big \{\vert (I-\gamma P_{w_b})_{ii}\vert - \sum _{j=1, j \ne i }^{\vert {\mathbb {S}} \vert }\vert (I-\gamma P_{w_b})_{ij}\vert \big \}},\nonumber \\&= \frac{1}{1-\gamma }, \end{aligned}$$
(31)

where \((\cdot )_{ij}\) is the (ij) entry of the matrix.

By putting together the above facts, we get \(\kappa (I - \gamma P_{w_b}) \le \frac{1+\gamma }{1-\gamma }\). Consequently from Eq. (30) and the assumption that \(\epsilon _2\frac{1+\gamma }{1-\gamma } < 1\), we obtain

$$\begin{aligned} \frac{\Vert V^{w} - V^{w_b} \Vert _{\nu _{w_b}}}{\Vert V^{w_b} \Vert _{\nu _{w_b}}} \le \frac{2\epsilon _2(1+\gamma )}{1-\gamma -\epsilon _2(1+\gamma )}. \end{aligned}$$

Therefore \(\Vert V^{w} - V^{w_b} \Vert _{\nu _{w_b}} \le K_{1}\epsilon _2(1+\gamma )(1-\gamma -\epsilon _2(1+\gamma ))^{-1}\), \(K_1 > 0\). The corollary now easily follows from the above bound and from (17) of Theorem 2. \(\square \)

The note worthy result on the upper bound of the approximation error of the on-policy LSTD(\(\lambda \)) provided in Tsitsiklis and Roy (1997) can be easily derived from the above result as follows:

Corollary 2

For \(w \in {\mathbb {W}}\), \(\lambda \in [0,1]\) and \(\gamma \in (0,1)\),

$$\begin{aligned} \Vert \varPhi x_{w \vert w} - V^{w} \Vert _{\nu _{w}} \le \frac{1-\gamma \lambda }{1-\gamma }\Vert \varPi ^{w}V^{w} - V^{w} \Vert _{\nu _{w}}. \end{aligned}$$

Proof

In the on-policy case, \(w_b = w\). Hence \(\epsilon _2 = 0\). The corollary directly follows from direct substitution of these values in (17). \(\square \)

3.2 Estimation of the objective function

The objective function of the control problem defined in Eq. (15) is

$$\begin{aligned} J(w) = {\mathbb {E}}_{\nu _{w}}\left[ L(h_{w \vert w})\right] . \end{aligned}$$
(32)

In this paper, we employ off-policy LSTD(\(\lambda \)) to approximate \(h_{w \vert w}\) for a given policy parameter \(w \in {\mathbb {W}}\). A sample trajectory \(\{{\mathbf {s}}_{0}, {\mathbf {a}}_0, {\mathbf {r}}_{0}, {\mathbf {s}}_{1}, {\mathbf {a}}_1, {\mathbf {r}}_{1}, {\mathbf {s}}_{2}, \dots \}\) (fixed for the algorithm) generated using the behaviour policy \(\pi _{w_b}\) is provided.

The procedure to estimate the objective function J is formally defined in Algorithm 1. The Predict procedure in Algorithm 1 is almost the same as the off-policy LSTD algorithm. The additional recursion (step 10) estimates the objective function defined in Eq. (32) as follows:

$$\begin{aligned} \ell _{k+1}^{w} = \ell _{k}^{w} + \alpha _{k+1}\Big (L({\mathbf {x}}_{k}^{\top }\phi ({\mathbf {s}}_{k+1})) - \ell _{k}^{w}\Big ), \end{aligned}$$
(33)

where \(\alpha _{k} = 1/k\). The above choice of \(\alpha _k\) is merely a recommendation and not a strict requirement. This, however, alleviates the extra burden of deciding \(\alpha _k\) during implementation.

For a given \(w \in {\mathbb {W}}\), \(\ell _{k}^{w}\) attempts to find an approximate value of the objective function J(w). The following lemma formally characterizes the limiting behaviour of the iterates \(\ell _{k}^{w}\).

Lemma 1

For a given \(w \in {\mathbb {W}}\),

$$\begin{aligned} \ell _{k}^{w} \rightarrow \ell ^{w}_{*} = {\mathbb {E}}_{\nu _{w_b}}\left[ L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}))\right] \text { as } k \rightarrow \infty \text { w.p. 1.} \end{aligned}$$
(34)

Proof

We begin the proof by defining the filtration \(\{\mathcal {F}_{k}\}_{k \in {\mathbb {N}}}\), where the \(\sigma \)-field \(\mathcal {F}_{k} \triangleq \sigma (\{{\mathbf {x}}_{i}, \ell ^{w}_{i}, {\mathbf {s}}_{i}, {\mathbf {a}}_{i}, {\mathbf {r}}_{i}, 0 \le i \le k \})\).

Now recalling the recursion (33),

$$\begin{aligned} \begin{aligned} \ell _{k+1}^{w}&:= \ell _{k}^{w} + \alpha _{k+1}\Big (L({\mathbf {x}}_{k}^{\top }\phi ({\mathbf {s}}_{k+1})) - \ell _{k}^{w}\Big )\\&:= \ell _{k}^{w} + \alpha _{k+1}\Big (h(\ell _{k}^{w}) + {\mathbb {M}}_{k+1} + c_k\Big ), \end{aligned} \end{aligned}$$

where \({\mathbb {M}}_{k+1} \triangleq L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}_{k+1})) - {\mathbb {E}}\left[ L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}_{k+1})) \big \vert \mathcal {F}_{k}\right] \),

\(h(z) \triangleq {\mathbb {E}}_{\nu _{w_b}}\left[ L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}_{k+1}))\right] - z\) and \(c_k \triangleq L({\mathbf {x}}_{k}^{\top }\phi ({\mathbf {s}}_{k+1})) - L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}_{k+1})) + \)

\({\mathbb {E}}\left[ L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}_{k+1})) \big \vert \mathcal {F}_{k}\right] - {\mathbb {E}}_{\nu _{w_b}}\left[ L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}_{k+1}))\right] \).

We state here a few observations:

  1. 1.

    \(\{{\mathbb {M}}_{k}, k \ge 1\}\) is a martingale difference noise sequence w.r.t. \(\{\mathcal {F}_{k}\}\), i.e., \({\mathbb {M}}_{k}\) is \(\mathcal {F}_{k}\)-measurable, integrable and \({\mathbb {E}}[{\mathbb {M}}_{k+1} \vert \mathcal {F}_{k}] = 0\) a.s., \(\forall k \ge 0\).

  2. 2.

    \(h(\cdot )\) is a Lipschitz continuous function.

  3. 3.

    \(\exists K > 0\) s.t. \({\mathbb {E}}[\vert {\mathbb {M}}_{k+1} \vert ^{2} \vert \mathcal {F}_{k}] \le K(1+\vert \ell _{k} \vert ^{2})\) a.s., \(\forall k \ge 0\).

  4. 4.

    By Theorem 1, \(c_k \rightarrow 0\) as \(k \rightarrow \infty \) w.p. 1. This directly follows by considering the following facts: (a) by Eq. (1), the off-policy LSTD(\(\lambda \)) iterates \(\{{\mathbf {x}}_{k}\}\) converges almost surely to the off-policy solution \(x_{w \vert w_b}\) (b) by assumption (A2), \(P_{w_b}({\mathbf {s}}_{k} = s) \rightarrow \nu _{w_b}(s)\) as \(k \rightarrow \infty \) and (c) \(L(\cdot )\) and \(\phi (\cdot )\) are bounded.

  5. 5.

    For a given \(w \in {\mathbb {W}}\), the iterates \(\{\ell _{k}^{w}\}_{k \in \mathbb {N}}\) are stable, i.e., \(\sup _{k} \vert \ell _{k}^{w} \vert < \infty \) a.s. A brief proof is provided here: For \(c > 0\), we define

    $$\begin{aligned} h_{c}(z) \triangleq \frac{h(cz)}{c} = \frac{{\mathbb {E}}_{\nu _{w_b}}\left[ L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}))\right] }{c} - z. \end{aligned}$$
    (35)

    Now consider the following ODE corresponding to the following \(\infty \)-system:

    $$\begin{aligned} \dot{z}(t) = h_{\infty }(z(t)) \triangleq \lim _{c \rightarrow \infty }h_{c}(z(t)). \end{aligned}$$
    (36)

    Note that \(h_{\infty }(z) = -z\). It can be easily verified that the above ODE is globally asymptotically stable to the origin. This further implies the stability of the iterates \(\{\ell _{k}^{w}\}\) using Theorem 2, Chapter 3 of Borkar (2008).

Now by appealing to the third extension of Theorem 2, Section 2.2, Chapter 2 of Borkar (2008) and from the above observations, we can henceforth conclude almost surely that the iterates \(\{\ell _{k}^{w}\}\) asymptotically track the ODE given by:

$$\begin{aligned} \dot{z}(t) = h(z(t)). \end{aligned}$$
(37)

This further implies that the limit points of the iterates \(\{\ell _{k}^{w}\}\) are indeed contained in the limit set of the ODE (37) almost surely. However, it is easy to verify that \({\mathbb {E}}_{\nu _{w_b}}\left[ L(x^{\top }_{w \vert w_b}\phi ({\mathbf {s}}))\right] \) is the unique globally asymptotically stable equilibrium of the ODE (37). Hence \(\lim _{k \rightarrow \infty }\ell _{k}^{w} = {\mathbb {E}}_{\nu _{w_b}}\left[ L(x^{\top }_{w \vert w_b}\phi ({\mathbf {s}}))\right] \) a.s. This completes the proof of (34). \(\square \)

Remark 1

By the above lemma, for a given \(w \in {\mathbb {W}}\), the quantity \(\ell _{k}^{w}\) tracks \({\mathbb {E}}_{\nu _{w_b}}\big [L(x^{\top }_{w \vert w_b}\phi ({\mathbf {s}}))\big ]\). This is however different from the true objective function value \(J(w) = {\mathbb {E}}_{\nu _{w}}\left[ L(h_{w \vert w})\right] \), when \(w \ne w_b\). This additional approximation error incurred is the extra cost one has to pay for the dearth in information (in the form of generative model) about the underlying MDP. Nevertheless, from Eqs. (16) and (19), we know that the relative errors in the solutions \(x_{w\vert w}\) and \(x_{w \vert w_b}\) as well as in the stationary distributions \(\nu _{w}\) and \(\nu _{w_b}\) are bounded. We also know that \(\varPhi x_{w \vert w} \approx h_{w \vert w}\). Further, if we can restrict the smoothness of the performance function L, then we can contain the deviation of L(y) when the input variable y is perturbed slightly. All these factors further affirm the fact that the approximation proposed in (33) is well-conditioned. This is indeed significant, considering the restricted setting we operate in, i.e., non-availability of the generative model.

figure a

3.3 Stochastic approximation version of Gaussian cross entropy method and its application to the control problem

Cross entropy method (Rubinstein and Kroese 2013; Kroese et al. 2006) solves optimization problems where the objective function does not possess good structural properties, such as possibly discontinuous, non-differentiable, i.e., those of the kind:

$$\begin{aligned} \text {Find } x^{*} \in \mathop {\mathrm{arg\,max}}\limits _{x \in {\mathbb {X}} \subset \mathrm{I\!R}^{d}} J(x), \end{aligned}$$
(38)

where \(J:{\mathbb {X}} \rightarrow \mathrm{I\!R}\) is a bounded Borel measurable function.

CE is a model based search method (Zlochin et al. 2004) used to solve the global optimization problem. CE is a zero-order method (a.k.a. gradient-free method) which implies the algorithm does not require gradient or higher-order derivatives of the objective function. This remarkable feature of the algorithm makes it a suitable choice for the “black-box” optimization setting, where neither a closed form expression nor structural properties of the objective function J are available. CE method has found successful application in diverse domains which include continuous multi-extremal optimization (Rubinstein 1999), buffer allocation (Alon et al. 2005), queueing models (de Boer 2000), DNA sequence alignment (Keith and Kroese 2002), control and navigation (Helvik and Wittner 2001), reinforcement learning (Mannor et al. 2003; Menache et al. 2005) and several NP-hard problems (Rubinstein 2002, 1999). We would also like to mention that there are other model based search methods in the literature, a few pertinent ones include the gradient-based adaptive stochastic search for simulation optimization (GASSO) (Zhou et al. 2014), estimation of distribution algorithm (EDA) (Mühlenbein and Paass 1996) and model reference adaptive search (MRAS) (Hu et al. 2007). However, in this paper, we do not explore the possibility of employing the above algorithms in a MDP setting.

The Gaussian based cross entropy method generates a sequence of Gaussian distributions \({\{\theta _{j} = (\mu _{j}, \Sigma _{j})^{\top } \in \Theta \subset \mathrm{I\!R}^{d(d+1)}\}}_{j \in {\mathbb {N}}}\) parametrized by its mean vector \(\mu _{j} \in \mathrm{I\!R}^{d}\) and the covariance matrix \(\Sigma _{j} \in \mathrm{I\!R}^{d \times d}\), with the property that the support of the multivariate Gaussian probability density function given by

$$f_{\theta _{j+1}}(x) = (2\pi |\Sigma _{j+1}|)^{-d/2}\exp {(-\frac{1}{2}(x-\mu _{j+1})^{\top }\Sigma _{j+1}^{-1}(x-\mu _{j+1}))}$$

satisfies (P1) below.

Property (P1) \(supp(f_{\theta _{j+1}}) \subseteq \{x \vert J(x) \ge \gamma _{\rho }(J, \theta _{j})\}\),

where \(\rho \in (0,1)\) is fixed a priori. Note that \(\gamma _{\rho }(J, \theta _{j})\) is the \((1-\rho )\)-quantile of J w.r.t. the distribution \(f_{\theta _{j}}\). Hence it is easy to verify that the threshold sequence \(\{\gamma _{\rho }(J, \theta _{j})\}_{j \in {\mathbb {N}}}\) is a monotonically non-decreasing sequence. The intuition behind this recursive generation of the model sequence is that by assigning greater weight to the higher values of J at each iteration, the expected behaviour of the model sequence should improve. We make the following assumption on the model parameter space \(\Theta \):

\(\circledast \) Assumption (A5) The parameter space \(\Theta \) is a compact subset of \(\mathrm{I\!R}^{d(d+1)}\).

The invariant in each iteration of the CE method is property (P1). At each instant \(j+1\), the CE method seeks the distribution which is proximally optimal to maintaining the invariant by solving the following optimization problem:

$$\begin{aligned} \theta _{j+1} = \mathop {\mathrm{arg\,max}}\limits _{\theta \in \Theta }\Gamma _{j}(\theta , \gamma _{\rho }(J, \theta _{j})), \end{aligned}$$
(39)

where \(\Gamma _{j}(\theta , \gamma ) \triangleq {\mathbb {E}}_{\theta _{j}}\left[ \varphi (J({\mathbf {x}}))I_{\varvec{\{}J({\mathbf {x}}) \ge \gamma \varvec{\}}}\log {f_\theta ({\mathbf {x}})}\right] \) and \(\varphi :\mathrm{I\!R}\rightarrow \mathrm{I\!R}_{+}\) is a positive, strictly monotonically increasing function. This recursive equation forms the basis of the cross entropy method and is referred to as the model update procedure.

Note that the solution to Eq. (39) is obtained by equating \(\nabla \Gamma _{j}\) to 0:

$$\begin{aligned} \nabla _{\vartheta ^{\theta }_1}\Gamma _{j}(\theta , \gamma ) = 0 \varvec{\Rightarrow } \mu = \frac{{\mathbb {E}}_{\theta _j}\left[ {{\mathbf {g}_{1}}}\varvec{\big (}J({\mathbf {x}}), {\mathbf {x}}, \gamma \varvec{\big )}\right] }{{\mathbb {E}}_{\theta _{j}}\left[ {{\mathbf {g}_{0}}}(J({\mathbf {x}}), \gamma )\right] } \triangleq \Upsilon _{1}(\theta _{j}, \gamma ), \end{aligned}$$
(40)
$$\begin{aligned} \nabla _{\vartheta ^{\theta }_2}\Gamma _{j}(\theta , \gamma ) = 0 \varvec{\Rightarrow } \Sigma = \frac{{\mathbb {E}}_{\theta _{j}}\left[ {\mathbf {g_{2}}}\varvec{\big (}J({\mathbf {x}}), {\mathbf {x}}, \gamma , \mu \varvec{\big )}\right] }{{\mathbb {E}}_{\theta _{j}}\left[ {\mathbf {g_{0}}}\varvec{\big (}J({\mathbf {x}}), \gamma \varvec{\big )}\right] } \triangleq \Upsilon _{2}(\theta _{j}, \gamma ), \end{aligned}$$
(41)
$$\begin{aligned}&\text { where }\quad {\mathbf {g_{0}}}\varvec{(}y, \gamma \varvec{)} \triangleq \varphi (y)I_{\varvec{\{}y \ge \gamma \varvec{\}}},\end{aligned}$$
(42a)
$$\begin{aligned}&{\mathbf {g_{1}}}\varvec{(}y, x, \gamma \varvec{)} \triangleq x\varphi (y)I_{\varvec{\{}y \ge \gamma \varvec{\}}}, \end{aligned}$$
(42b)
$$\begin{aligned}&{\mathbf {g_{2}}}\varvec{(}y, x, \gamma , \mu \varvec{)} \triangleq \varphi (y)(x-\mu )(x-\mu )^{\top }I_{\varvec{\{}y \ge \gamma \varvec{\}}}\end{aligned}$$
(42c)
$$\begin{aligned}&(\vartheta ^{\theta }_1, \vartheta ^{\theta }_2)^{\top } = (\Sigma ^{-1}\mu , -\frac{1}{2}\Sigma ^{-1})^{\top }. \end{aligned}$$
(42d)

The mapping of \((\mu , \Sigma )^{\top } \mapsto (\Sigma ^{-1}\mu , \frac{-1}{2}\Sigma ^{-1})^{\top }\) is a bijective transformation and it makes the algebra a lot simpler. Also it is not hard to verify that \(\Upsilon _1\) and \(\Upsilon _2\) are well defined.

Now from (40) and (41), we can rewrite the recursion (39) as

$$\begin{aligned} \theta _{j+1} = \big (\Upsilon _{1}\left( \theta _{j}, \gamma _{\rho }(J, \theta _{j})\right) , \Upsilon _{2}\left( \theta _{j}, \gamma _{\rho }\left( J, \theta _{j}\right) \right) \big )^{\top }. \end{aligned}$$
(43)

The above update rule for recursively generating model sequence \(\{\theta _{j}\}\) is commonly referred to as the ideal version of the standard CE method. However, in this paper, we employ an extended version of the CE method proposed in Joseph and Bhatnagar (2016a, b, c) whose update rule is slightly different. In the extended version, a mixture PDF \(\widehat{f}_{\theta _j} = (1-\zeta )f_{\theta _{j}} + \zeta f_{\theta _0}\) (with \(\zeta \in (0,1)\) and \(\theta _0\) is the initial distribution parameter) is employed to compute \(\gamma _{\rho }\), \(\Upsilon _1\) and \(\Upsilon _2\) instead of the original PDF \(f_{\theta _j}\). In this case, the update rule is defined as follows:

$$\begin{aligned} \begin{aligned} \theta _{j+1} = \left( \Upsilon _{1}\left( \widehat{\theta }_{j}, \gamma _{\rho }(J, \widehat{\theta }_{j})\right) , \Upsilon _{2}\left( \widehat{\theta }_{j}, \gamma _{\rho }\left( J, \widehat{\theta }_{j}\right) \right) \right) ^{\top }.\\ \end{aligned} \end{aligned}$$
(44)

Here \(\gamma _{\rho }(J, \widehat{\theta })\) is defined as the \((1-\rho )\)-quantile of J w.r.t. the mixture distribution \(\widehat{f}_{\theta }\). Similarly we define \(\Upsilon _1(\widehat{\theta }, \cdot )\) and \(\Upsilon _2(\widehat{\theta }, \cdot )\) respectively. This extended version is shown to exhibit global optimum convergence (Joseph and Bhatnagar 2016a, b, c).

However, there are certain tractability concerns. The quantities \(\gamma _{\rho }(J, \widehat{\theta }_{j})\), \(\Upsilon _{1}(\widehat{\theta }_{j}, \cdot )\) and \(\Upsilon _{2}(\widehat{\theta }_{j}, \cdot )\) involved in the update rule are intractable, i.e. computationally hard to compute (and hence the tag name ‘ideal’). To overcome this, a naive approach usually found in the literature is to employ sample averaging, with sample size increasing to infinity. However, this approach suffers from hefty storage and computational complexity which is primarily attributed to the accumulation and processing of huge number of samples. In Joseph and Bhatnagar (2016a, b, c), a stochastic approximation variant of the extended cross entropy method has been proposed. The proposed approach is efficient both computationally and storage wise, when compared to the rest of the state-of-the-art CE tracking methods (Hu et al. 2012; Wang and Enright 2013; Kroese et al. 2006). It also integrates the mixture approach (44) and henceforth exhibits global optimum convergence.

The goal of the stochastic approximation (SA) version of Gaussian CE method is to find a sequence of Gaussian model parameters \(\{\theta _j = (\mu _j, \Sigma _j)^{\top }\}\) (where \(\mu _j\) is the mean vector and \(\Sigma _j\) is the covariance matrix) which tracks the ideal CE method. The algorithm efficiently accomplishes the goal by employing multiple stochastic approximation recursions. The algorithm is shown to exhibit global optimum convergence, i.e., the model sequence \(\{\theta _{j}\}\) converges to the degenerate distribution concentrated on any of the global optima of the objective function (Fig. 5), in both deterministic (when the objective function is deterministic) and stochastic settings, i.e., when noisy versions of the objective function are available. Successful application of the stochastic approximation version of CE in stochastic settings is appealing to the control problem we consider in this paper, since the off-policy LSTD(\(\lambda \)) method only provides estimates of the value function. The SA version of CE is a discrete evolutionary procedure where the model sequence \(\{\theta _j\}\) is adapted to the degenerate distribution concentrated at global optima, where at each discrete step of the evolution a single sample from the solution space is used. This unique nature of the SA version is appealing to settings where the objective function values are hard to obtain, especially to the MDP control problem we consider in this paper. The single sample requirement attribute which is unique to the SA version implies that one does not need to scale the computing machine for unnecessary value function evaluations.

Our algorithm which attempts to solve the control problem defined in Eq. (15) is formally illustrated in Algorithm 2.

Fig. 5
figure 5

Illustration of the sequence \(\{\theta _j\}\) generated by the CE method

A few remarks about the algorithm are in order:

  1. 1.

    The learning rates \(\{\overline{\beta }_{j}\}\), \(\{\beta _{j}\}\) and the mixing weight \(\zeta \) are deterministic, non-increasing and satisfy the following:

    $$\begin{aligned} \begin{aligned} \zeta \in (0, 1),&\beta _{j}> 0, \overline{\beta }_{j} > 0, \\&\sum _{j=1}^{\infty }\beta _{j} = \infty , \sum _{j=1}^{\infty }\overline{\beta }_{j} = \infty , \sum _{j=1}^{\infty }\left( \beta ^{2}_{j}+\bar{\beta }^{2}_{j}\right) < \infty . \end{aligned} \end{aligned}$$
    (45)
  2. 2.

    In our algorithm, the objective function is estimated in (50) using the Predict procedure which is defined in Algorithm 1. Even though an infinitely long sample trajectory is assumed to be available, the Predict procedure has to practically terminate after processing a finite number of transitions from the trajectory. Hence a user configured trajectory length rule \(\{N_{j} \in {\mathbb {N}}\setminus \{0\}\}_{j \in {\mathbb {N}}}\) with \(N_{j} \uparrow \infty \) is used. At each iteration j of the cross entropy method, when Predict procedure is invoked to estimate the objective function \(L(h_{w_j \vert w_j})\), the procedure terminates after processing the first \(N_{j}\) transitions in the trajectory. It is also important to note that the same sample trajectory is reused for all invocations of Predict. This eliminates the need for any further observations of the MDP.

  3. 3.

    Recall that we employ the stochastic approximation (SA) version of the extended CE method to solve our control problem (15). The SA version (hence Algorithm 2) maintains three variables: \(\gamma _j, \xi ^{(0)}_{j}\) and \(\xi ^{(1)}_{j}\), with \(\gamma _j\) tracking \(\gamma _{\rho }(\cdot , \widehat{\theta }_j)\), while \(\xi ^{(0)}_j\) and \(\xi ^{(1)}_j\) track \(\Upsilon _1(\widehat{\theta }_j, \cdot )\) and \(\Upsilon _2(\widehat{\theta }_j, \cdot )\) respectively. Their stochastic recursions are defined in Eqs. (51), (52) and (53) of Algorithm 2. The increment terms for their respective stochastic recursions are defined recursively as follows:

    $$\begin{aligned}&\Delta \gamma _{j}(y) \triangleq -(1-\rho )I_{\{y \ge \gamma _j\}}+\rho I_{\{y \le \gamma _j\}}.\end{aligned}$$
    (46)
    $$\begin{aligned}&\Delta \xi ^{(0)}_{j}(x, y) \triangleq {\mathbf {g_{1}}}(y, x, \gamma _j) - \xi ^{(0)}_j {\mathbf {g_{0}}}(y, \gamma _j).\end{aligned}$$
    (47)
    $$\begin{aligned}&\Delta \xi ^{(1)}_{j}(x, y) \triangleq {\mathbf {g_{2}}}(y, x, \gamma _j, \xi ^{(0)}_j) - \xi ^{(1)}_j {\mathbf {g_{0}}}(y, \gamma _j). \end{aligned}$$
    (48)
  4. 4.

    The initial distribution parameter \(\theta _0\) is chosen by hand such that probability density function \(f_{\theta _0}\) has strictly positive values for every point in the solution space \({\mathbb {W}}\), i.e., \(f_{\theta _0}(w) > 0, \forall w \in {\mathbb {W}}\).

  5. 5.

    The stopping rule we adopt here for the control problem is to terminate the algorithm when the model sequence \(\{\theta _j\}\) is sufficiently close consequently for a finitely long time, i.e., \(\exists \bar{j} \ge 0\) s.t. \(\Vert \theta _j - \theta _{j+1} \Vert < \delta _1\), \(\bar{j} \le \forall j \le \bar{j}+N(\delta _1)\), where \(\delta _1 \in \mathrm{I\!R}_{+}\), \(N(\delta _1) \in {\mathbb {N}}\) are decided a priori.

  6. 6.

    The quantile factor \(\rho \) is also a relevant parameter of the CE method. An empirical analysis in Joseph and Bhatnagar (2016b) has revealed that the convergence rate of the algorithm is sensitive to the choice of \(\rho \). The paper also recommends that [0.01, 0.3] is the most suitable choice of \(\rho \).

  7. 7.

    We also extended the algorithm to include Polyak averaging of the model sequence \(\{\theta _j\}\). The sequence \(\{\overline{\theta }_{j}\}\) maintains the Polyak averages of the sequence \(\{\theta _j\}\) and its update step is given in (57). Note that the Polyak averaging Polyak and Juditsky (1992) is a double averaging technique which does not cripple the convergence of the original sequence \(\{\theta _j\}\), however it reduces the variance of the iterates and accelerates the convergence of the sequence.

figure b

3.4 Convergence analysis of Algorithm 2

The convergence analysis of the generalized variant of Algorithm 2 is already addressed in Joseph and Bhatnagar (2016c) and its application to the prediction problem is given in Joseph and Bhatnagar (2016b). However, for completeness, we will restate the results here. We do not give proof of those results, however, provide references for the same. The additional Polyak averaging (step 19 of Algorithm 2) requires analysis, which is covered below.

Note that Algorithm 2 employs the off-policy prediction method for estimating the objective function. In particular, in step 6 of Algorithm 2, we have \(\hat{J}({\mathbf {w}}_{j+1}) := Predict({\mathbf {w}}_{j+1}, N_{j+1})\), which converges to \({\mathbb {E}}_{\nu _{w_b}}\left[ L(x^{\top }_{w \vert w_b}\phi ({\mathbf {s}}))\right] \) almost surely as \(N_j \rightarrow \infty \) (by Lemma 1). Hence the objective function optimized by Algorithm 2 is \(J_b(w) \triangleq {\mathbb {E}}_{\nu _{w_b}}\left[ L(x^{\top }_{w \vert w_b}\phi ({\mathbf {s}}))\right] \), where \(w_b \in {\mathbb {W}}\) is the chosen behaviour policy vector.

Also note that the model parameter \(\theta _{j}\) in Algorithm 2 is not updated at each iteration j. Rather it is updated whenever \(T_{j}\) hits the \(\epsilon \) threshold (step 15 of Algorithm 2), where \(\epsilon \in (0, 1)\) is a constant. So the update of \(\theta _{j}\) only happens along a sub-sequence \(\{j_{(n)}\}_{n \in {\mathbb {N}}}\) of \(\{j\}_{j \in {\mathbb {N}}}\). Between \(j = j_{(n)}\) and \(j = j_{(n+1)}\), the model parameter \(\theta _j\) remains constant and the variable \(\gamma _{j}\) estimates \((1-\rho )\)-quantile of \(J_b\) w.r.t. \(\widehat{f}_{\theta _{j_{(n)}}}\).

Notation We denote by \(\gamma _{\rho }(J_b, \widehat{\theta })\), the \((1-\rho )\)-quantile of \(J_b\) w.r.t. the mixture distribution \(\widehat{f}_{\theta }\) and let \(E_{\widehat{\theta }}[\cdot ]\) be the expectation w.r.t. \(\widehat{f}_{\theta }\).

Since the model parameter \(\theta _j\) remains constant between \(j = j_{(n)}\) and \(j = j_{(n+1)}\), the convergence behaviour of \(\gamma _j\), \(\xi ^{(0)}_j\) and \(\xi ^{(1)}_j\) can be studied by keeping \(\theta _j\) constant.

Lemma 2

Let \(\theta _{j} \equiv \theta , \forall j\). Also, assume \(sup_{j} \vert \gamma _{j} \vert < \infty \) a.s. Then the stochastic sequence \(\{\gamma _{j}\}\) defined in Eq. (51) satisfies \(\lim _{j \rightarrow \infty }\gamma _{j} = \gamma _{\rho }(J_b, \widehat{\theta })\) a.s.

Proof

Refer Lemma 3 of Joseph and Bhatnagar (2016b). \(\square \)

Lemma 3

Assume \(\theta _{j} \equiv \theta ,\forall j\) . Then almost surely,

  1. (i)
    $$\begin{aligned} \lim _{j \rightarrow \infty } \xi ^{(0)}_{j} = \xi ^{(0)}_{*} = \frac{{\mathbb {E}}_{\widehat{\theta }}\left[ {\mathbf {g_{1}}}\big (J_b({\mathbf {x}}), {\mathbf {x}}, \gamma _{\rho }(J_b, \widehat{\theta })\varvec{\big )}\right] }{{\mathbb {E}}_{\widehat{\theta }}\left[ {\mathbf {g_{0}}}\big (J_b({\mathbf {x}}), \gamma _{\rho }(J_b, \widehat{\theta })\big )\right] }. \end{aligned}$$
  2. (ii)
    $$\begin{aligned} \lim _{j \rightarrow \infty } \xi ^{(1)}_{j} = \xi ^{(1)}_{*} = \frac{{\mathbb {E}}_{\widehat{\theta }}\left[ {\mathbf {g_{2}}}\big (J_b({\mathbf {x}}), {\mathbf {x}}, \gamma _{\rho }(J_b, \widehat{\theta }), \xi ^{(0)}_{*}\big )\right] }{{\mathbb {E}}_{\widehat{\theta }}\left[ {\mathbf {g_{0}}}\big (J_b({\mathbf {x}}), \gamma _{\rho }(J_b, \widehat{\theta })\big )\right] }. \end{aligned}$$
  3. (iii)

    \(T_j\) defined in Eq. (54) satisfies \(-1< T_j < 1, \forall j\).

  4. (iv)

    If \(\gamma _{\rho }(J_b, \widehat{\theta }) > \gamma _{\rho }(J_b, \widehat{\theta }^{p})\), then \(T_{j}\),\(j \ge 1\) in (54) satisfy \(\lim _{j \rightarrow \infty } T_{j} = 1\) a.s.

Proof

For (i), (ii) and (iv), refer Lemma 4 of Joseph and Bhatnagar (2016b). For (iii) refer Proposition 1 of Joseph and Bhatnagar (2016b). \(\square \)

Notation For the subsequence \(\{j_{(n)}\}_{n > 0}\) of \(\{j\}_{j \in {\mathbb {N}}}\), we denote \(j^{-}_{(n)} \triangleq j_{(n)}-1\) for \(n > 0\).

Along the subsequence \(\{j_{(n)}\}_{n \ge 0}\) with \(j_{0} = 0\) the updating of \(\theta _{j}\) can be expressed as follows:

$$\begin{aligned} \theta _{j_{(n+1)}} := \theta _{j_{(n)}} + \beta _{j_{(n)}}\Delta \theta _{j_{(n)}}, \end{aligned}$$
(58)

where \(\Delta \theta _{j_{(n)}}\) = \((\xi ^{(0)}_{j^{-}_{(n+1)}}, \xi ^{(1)}_{j^{-}_{(n+1)}})^{\top } - \theta _{j_{(n)}}\).

We now present our main result. The following theorem shows that the model sequence \(\{\theta _{j}\}\) and the averaged sequence \(\{\overline{\theta }_{j}\}\) generated by Algorithm 2 converge to the degenerate distribution concentrated on the global maximum of the objective function \(J_b\).

Theorem 3

Let \(\varphi (x) = exp(rx), r \in \mathrm{I\!R}\). Let \(\rho , \zeta \in (0,1)\). Let the learning rates \(\{\overline{\beta }_{j}\}\) and \(\{\beta _{j}\}\) satisfy Eq. (45). Assume \(J_b \in \mathcal {C}^{2}\). Let \(\{\theta _{j} = (\mu _{j}, \Sigma _{j})\}_{j \in {\mathbb {N}}}\) and \(\{\overline{\theta }_{j} = (\overline{\mu }_{j}, \overline{\Sigma }_{j})\}_{j \in {\mathbb {N}}}\) be the sequences generated by Algorithm 2 and also assume \(\theta _{j} \in \Theta \), \(\forall j \in {\mathbb {N}}\). Let \(\overline{\beta }_{j} = o (\beta _{j})\). Let \(w_b \in {\mathbb {W}}\) be the chosen behaviour policy vector. Also, let the assumptions (A1–A5) hold. Then

$$\begin{aligned}&\theta _{j} \rightarrow (w^{b*}, 0_{k_2 \times k_2})^{\top } \text { as } j \rightarrow \infty \quad w.p.1, \end{aligned}$$
(59)
$$\begin{aligned}&\overline{\theta }_{j} \rightarrow (w^{b*}, 0_{k_2 \times k_2})^{\top } \text { as } j \rightarrow \infty \quad w.p.1, \end{aligned}$$
(60)

\(\text {where } w^{b*} \in \mathop {\mathrm{arg\,max}}\limits _{w \in {\mathbb {W}}} J_b(w)\) with \(J_b(w) \triangleq {\mathbb {E}}_{\nu _{w_b}}\left[ L(x^{\top }_{w \vert w_b}\phi ({\mathbf {s}}))\right] \).

Proof

Since \(\overline{\beta }_{j} = o(\beta _{j})\), \(\overline{\beta }_{j} \rightarrow 0\) faster than \(\beta _{j} \rightarrow 0\). This implies that the updates of \(\theta _j\) in (55) are larger than those of \(\overline{\theta }_j\) in (57). Hence the sequence \(\{\theta _{j}\}\) appears quasi-convergent when viewed from the timescale of \(\{\overline{\theta }_{j}\}\) sequence.

Theorem 2 of Joseph and Bhatnagar (2016b) analyses the limiting behaviour of the stochastic recursion (55) of Algorithm 2 in great detail. The analysis discloses the global optimum convergence of the algorithm under limited regularity conditions. It is shown that the model sequence \(\{\theta _j\}\) converges almost surely to the degenerate distribution concentrated on the global optimum. The proposed regularity conditions for the global optimum convergence are that the objective function belongs to \(\mathcal {C}^{2}\) and the existence of a Lyapunov function on the neighbourhood of the degenerate distribution concentrated on the global optimum. This justifies the hypothesis \(J_{b} \in \mathcal {C}^{2}\) in the statement of the theorem and we further assume the existence of a Lyapunov function on the neighbourhood of the degenerate distribution \((w^{b*}, 0_{k_2 \times k_2})^{\top }\). Then by Theorem 2 of Joseph and Bhatnagar (2016b), we deduce that \(\{\theta _j\}\) converges to \((w^{b*}, 0_{k_2 \times k_2})^{\top }\). This completes the proof of (59).

For brevity, lets define \(\theta ^{*} \triangleq (w^{b*}, 0_{k_2 \times k_2})^{\top }\). We also define the filtration \(\{\overline{\mathcal {F}}_{j}\}_{j \in {\mathbb {N}}}\), where the \(\sigma \)-field \(\overline{\mathcal {F}}_{j} \triangleq \sigma (\theta _{i}, \overline{\theta }_i, 0 \le i \le j \})\). Now recalling recursion (57),

$$\begin{aligned} \overline{\theta }_{j+1}&:= \overline{\theta }_{j} + \overline{\beta }_{j+1}\left( \theta _{j+1} - \overline{\theta }_{j}\right) ,\\&:= \overline{\theta }_{j} + \overline{\beta }_{j+1}\left( \theta _{j} - {\mathbb {E}}\left[ \theta _{j+1} \vert \overline{\mathcal {F}}_{j}\right] + {\mathbb {E}}\left[ \theta _{j+1} \vert \overline{\mathcal {F}}_{j}\right] - \theta ^{*} + \theta ^{*} - \overline{\theta }_{j}\right) ,\\&:= \overline{\theta }_{j} + \overline{\beta }_{j+1}\left( \overline{{\mathbb {M}}}_{j+1} + \overline{b}_{j} + \overline{h}(\overline{\theta }_{j})\right) , \end{aligned}$$

where \(\overline{{\mathbb {M}}}_{j+1} \triangleq \theta _{j+1} - {\mathbb {E}}\left[ \theta _{j+1} \vert \overline{\mathcal {F}}_{j}\right] \), \(\overline{b}_{j} \triangleq {\mathbb {E}}\left[ \theta _{j+1} \vert \overline{\mathcal {F}}_{j}\right] - \theta ^{*}\) and \(\overline{h}(x) \triangleq \theta ^{*} - x\).

Here we make the following observations:

  1. 1.

    \(\overline{b}_{j} \rightarrow 0\) almost surely as \(j \rightarrow \infty \). This follows from the hypothesis \(\overline{\beta }_{j} = o(\beta _j)\) and by considering the fact that \(\theta _j \rightarrow \theta ^{*}\) almost surely.

  2. 2.

    \(\overline{h}\) is Lipschitz continuous.

  3. 3.

    \(\{\overline{{\mathbb {M}}}_{j}\}\) is a martingale difference sequence.

  4. 4.

    \(\{\overline{\theta }_{j}\}\) is stable, i.e., \(\sup _{j}\Vert \overline{\theta }_{j} \Vert < \infty \).

  5. 5.

    The ODE defined by \(\dot{\overline{\theta }}(t) = \overline{h}(\overline{\theta }(t))\) is globally asymptotically stable at \(\theta ^{*}\).

All the above facts are easy to verify. Now by appealing to the third extension of Theorem 2, Section 2.2, Chapter 2 of Borkar (2008) and from the above observations, we can henceforth conclude that \(\overline{\theta }_{j} \rightarrow \theta ^{*}\) almost surely as \(j \rightarrow \infty \). This completes the proof of (60). \(\square \)

4 Experimental illustrations

The performance of our algorithm is evaluated on four different MDP settings:

  1. 1.

    Chain walk MDP.

  2. 2.

    Linearized cart-pole balancing.

  3. 3.

    5-link actuated pendulum balancing.

  4. 4.

    Random MDP.

Our algorithm is compared against the state-of-the-art algorithms such as least squares policy iteration (LSPI), fast policy search method, model reference adaptive search (MRAS) and simultaneous perturbation stochastic approximation (SPSA). In each setting, the results shown are averages over 10 independent sample sequences generated by the algorithms with different initial conditions. The function \(\varphi (\cdot )\) used here is \(\varphi (x) = \exp (rx)\), where \(r \in \mathrm{I\!R}_{+}\).

4.1 Experiment 1: chain walk

This particular setting (Fig. 6) which has been proposed in Koller and Parr (2000) demonstrates the unique scenario where policy iteration is non-convergent when approximate value functions are employed instead of true ones. This particular example is also utilized to empirically evaluate the performance of LSPI in Lagoudakis and Parr (2003). Here, we compare the performance of our algorithm against LSPI and also against the stable Q-learning algorithm with linear function approximation (called Greedy-GQ) proposed in Maei et al. (2010). This particular demonstration is pertinent in two ways: (1) when LSPI was evaluated on this setting, the maximum state space cardinality considered was 50. We consider here a larger MDP with 450 states and (2) the stable Greedy-GQ algorithm is only evaluated over a small experimental setting in Maei et al. (2010). Here, by applying it on a relatively harder setting, we attempt to assess its applicability and robustness.

Setup We consider a Markov decision process with \(\vert {\mathbb {S}} \vert = 450\), \({\mathbb {A}} = \{L, R\}\), \(k_1=5\), \(k_2=10\) and the discount factor \(\gamma = 0.99\).

Reward function \(R(\cdot , \cdot , 150) = R(\cdot , \cdot , 300) = 1.0\) and zero for all other transitions. This implies that only the transitions to states 150 and 300 will acquire a positive payoff, while the rest are nugatory transitions.

Transition dynamics The transition probability kernel is defined as follows:

$$\begin{aligned} \text {For }1< s < \vert {\mathbb {S}} \vert {\left\{ \begin{array}{ll} P(s, L, s+1) = 0.1,\,\, P(s, L, s-1) = 0.9,\\ P(s, R, s+1) = 0.9, \,\, P(s, R, s-1) = 0.1. \end{array}\right. } \end{aligned}$$
$$\begin{aligned}&P(1, L, 2) = 0.1, \, P(1, L, 1) = 0.9,\\&P(1, R, 2) = 0.9, \, P(1, R, 1) = 0.1,\\&P(\vert {\mathbb {S}} \vert , L, \vert {\mathbb {S}} \vert ) = 0.1, \,\, P(\vert {\mathbb {S}} \vert , L, \vert {\mathbb {S}} \vert -1) = 0.9,\\&P(\vert {\mathbb {S}} \vert , R, \vert {\mathbb {S}} \vert ) = 0.9, \,\, P(\vert {\mathbb {S}} \vert , R, \vert {\mathbb {S}} \vert -1) = 0.1, \end{aligned}$$
Fig. 6
figure 6

Chain walk MDP

Feature set We employ radial basis functions (RBF) as both policy and prediction features. We utilize 5 RBFs for prediction and 10 for policy features, i.e., \(k_1 = 5\) and \(k_2 = 10\). Note that RBFs are Gaussian kernels which are parametrized by the centroid \(m \in \mathrm{I\!R}\) and spread \(v \in \mathrm{I\!R}_{+}\) and are expressed as:

$$\begin{aligned} b(s) = e^{-\frac{(s-m)^{2}}{2.0v^{2}}}. \end{aligned}$$
(61)

In our experiments, we initially tried to employ polynomials for features and found that the approximations they produced were quite poor. However, with RBFs one can indeed obtain decent performance by uniformly distributing the centroids in the state or state-action space and by considering the spread to be the half of the distance between subsequent centroids. In this way, one can indeed cover the respective spaces reasonably well. The policy features and the prediction features are defined as follows:

Policy features

Prediction features

\(\psi (s,a) = \begin{pmatrix} I_{\{a = L\}}e^{-\frac{(s-m_1)^{2}}{2.0v_{1}^{2}}}\\ \vdots \\ I_{\{a = L\}}e^{-\frac{(s-m_5)^{2}}{2.0v_5^{2}}}\\ I_{\{a = R\}}e^{-\frac{(s-m_1)^{2}}{2.0v_1^{2}}}\\ \vdots \\ I_{\{a = R\}}e^{-\frac{(s-m_5)^{2}}{2.0v_5^{2}}} \end{pmatrix}.\)

\(\phi _{i}(s) = e^{-\frac{(s-m_{i})^{2}}{2.0v_{i}^{2}}},\)

where \(m_i = 5+10(i-1), v_i = 5\), \(1 \le i \le 5\).

Behaviour policy This is the most important choice and one has to be discreet while choosing the behaviour policy. For this setting, we prefer a policy which is unbiased and which uniformly covers the action space to provide sufficient exploration. Henceforth, by choosing \(w_{b} = (0,0,\dots ,0)^{\top }\) we obtain a uniform distribution over action space for every state in \({\mathbb {S}}\).

Performance function Note that both LSPI and Q-learning seek in the policy parameter space to find the optimal or sub-optimal policy by recalibrating the parameter vector at each iteration in the direction of the improved value function. But the objective function that we consider in this paper is a more generalized version involving the performance function L and scalarization using \({\mathbb {E}}_{\nu _w}[\cdot ]\). So the predicament, the above algorithms attempt to resolve becomes a special instance of our generalized version and hence to compare our algorithm against them, we consider the objective function to be the weighted Euclidean norm of the approximate value function (with weight being the stationary distribution \(\nu _w\)). Therefore, the performance function L is defined as \(L(h_{w \vert w}) = h^{2}_{w \vert w}\) (where squaring of the vector is defined as squaring of each of its components). Note that, in our algorithm, we approximate \(h_{w \vert w}\) using the behaviour policy and the true approximation and the stationary distribution involved are \(\varPhi x_{w \vert w_b}\) and \(\nu _{w_b}\) respectively. However, since the behaviour policy chosen is the uniform distribution over the action space for each state in \({\mathbb {S}}\), one can easily deduce that the underlying Markov chain of the behaviour policy is a uniform random walk and its stationary distribution is the uniform distribution over the state space \({\mathbb {S}}\).

The various parameter values employed and the results obtained in the experiment are provided in Table 1 and Fig. 7 respectively.

Table 1 Algorithm parameter values used in the chain walk experiment
Fig. 7
figure 7

The plot of the respective optimal value functions contrived by LSPI, Q-learning and Algorithm 2 for the chain walk MDP setting. The optimal solutions of various algorithms are being developed by averaging over 10 independent trials. For Algorithm 2, we averaged the various optimal solutions obtained for different sample trajectories generated using the same behaviour policy, but with different initial states which are chosen randomly. Our approach (Algorithm 2) literally surpassed other algorithms in terms of its quality. The random choice of the initial state effectively favoured sufficient exploration of the state space which directly assisted in generating high quality solutions

4.2 Experiment 2: linearized cart-pole balancing (Dann et al. 2014)

Setup A pole with mass m and length l is connected to a cart of mass M. It can rotate in the interval \([-\pi , \pi ]\) with negative angle representing the rotation in the counter clockwise direction. The cart is free to move in either direction within the bounds of a linear track and the distance lies in the region \([-4.0, 4.0]\) with negative distance representing the movement to the left of the origin.In our experiment, we have \(m = 0.5\), \(M = 0.5\), \(l = 20.5\) and the discount factor \(\gamma = 0.1\).

Goal To bring the cart to the equilibrium position, i.e., to balance the pole upright and the cart at the centre of the track.

State space The state is the 4-tuple \((x, \dot{x}, \psi , \dot{\psi })^{\top }\) where \(\psi \) is the angle of the pendulum w.r.t. the vertical axis, \(\dot{\psi }\) is the angular velocity, x the relative cart position from the centre of the track and \(\dot{x}\) is its velocity. For better tractability, we restrict \(\dot{x} \in [-5.0, 5.0]\) and \(\dot{\psi } \in [-5.0, 5.0]\), respectively.

Control (Policy) space The controller applies a horizontal force a on the cart parallel to the track. The stochastic policy used in this setting corresponds to \(\pi (a|s) = \mathcal {N}(a | \vartheta ^{\top }s, \sigma ^{2})\) (normal distribution with mean \({\vartheta }^{\top }s\) and standard deviation \(\sigma \)). Here the policy is parametrized by \(\vartheta \in \mathrm{I\!R}^{4}\) and \(\sigma \in \mathrm{I\!R}\).

System dynamics The dynamical equations of the system are given by

$$\begin{aligned} \ddot{\psi }= & {} \frac{-3ml\dot{\psi }^{2}\sin {\psi }\cos {\psi }+(6M+m)g\sin {\psi }-6(a-b\dot{\psi })\cos {\psi }}{4l(M+m)-3ml\cos {\psi }}, \end{aligned}$$
(62)
$$\begin{aligned} \ddot{x}= & {} \frac{-2ml\dot{\psi }^{2}\sin {\psi }+3mg\sin {\psi }\cos {\psi }+4a-4b\dot{\psi }}{4(M+m)-3m\cos {\psi }}. \end{aligned}$$
(63)

By making further assumptions on the initial conditions, the system dynamics can be approximated accurately by the linear system

$$\begin{aligned} \begin{bmatrix} x_{t+1}\\ \dot{x}_{t+1}\\ \psi _{t+1}\\ \dot{\psi }_{t+1} \end{bmatrix} = \begin{bmatrix} x_{t}\\ \dot{x}_{t}\\ \psi _{t}\\ \dot{\psi }_{t} \end{bmatrix} + \Delta t \begin{bmatrix} \dot{\psi }_{t} \\ \frac{3(M+m)\psi _t-3a+3b\dot{\psi _t}}{4Ml-ml} \\ \dot{x}_{t} \\ \frac{3mg\psi _t + 4a - 4b\dot{\psi _t}}{4M-m} \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ 0 \\ {\mathbf {z}} \end{bmatrix}, \end{aligned}$$
(64)

where b is the friction coefficient of the cart on the floor, \(g = 9.81\frac{m}{sec^{2}}\) is the gravitational constant, \(\Delta t\) is the integration time step, i.e., the time difference between two transitions and \({\mathbf {z}}\) is a standard Gaussian noise on the velocity of the cart. In our experiment, we set \(b = 0.1Newton(msec)^{-1}\) and \(\Delta t = 0.1sec\), respectively.

Reward function \(R(s, a) = R(\psi , \dot{\psi }, x, \dot{x}, a) = -4\psi ^2 - x^2 - 0.1a^2\). The reward function can be viewed as assigning penalty which is directly proportional to the deviation from the equilibrium state.

Prediction features \(\phi (s \in \mathrm{I\!R}^{4}) = (1, s_{1}^{2}, s_{2}^{2} \dots , s_{1}s_{2}, s_{1}s_{3}, \dots , s_{3}s_{4})^{\top } \in \mathrm{I\!R}^{11}\).

Behaviour policy \(\pi _{b}(a|s) = \mathcal {N}(a | \vartheta _{b}^{\top }s, \sigma _{b}^{2})\), where \(\vartheta _{b} = (3.684, 3.193, 4.252,\) \(3.401)^{\top }\) and \(\sigma _{b} = 5.01\). The behaviour policy is determined by vaguely solving the problem using true value functions and then choosing the behaviour policy vector \(\vartheta _b\) by perturbing each component of the vague solution so obtained. The margin of perturbation we considered is chosen randomly from the interval \([-\,5.0, 5.0]\).

Performance function The performance function L is defined as under: We randomly select (from the given intervals described in the definition of the state space), \(s_{0} = (0.235, 3.581, 2.276, 1.069)^{\top }\). Now, define

$$\begin{aligned} \begin{aligned} L(h_{w \vert w})(s) = {\left\{ \begin{array}{ll} 0.1h_{w \vert w}(s_0), \text { for } s = s_{0}\\ 0, \forall s \in {\mathbb {S}}\setminus \{s_{0}\}. \end{array}\right. } \end{aligned} \end{aligned}$$
(65)

Here \(s_{0}\) is the initial state of the cart-pole system which implies that the cart is initially stationed at a distance of 0.235 from the centre and the pendulum is at an angle of 2.276 (\(=\frac{\pi }{1.38}\)) from the vertical position. The initial velocity of the cart and the angular velocity of the pendulum are 3.581 and 1.069 respectively. The goal is to find the optimal policy (which corresponds to the parameters of the horizontal force) to bring the cart to the equilibrium position, i.e., cart at the centre of the track and the pendulum in the vertical position. The nature of the performance function L in Eq. (65) is to explicitly capture this aspect of the problem, i.e., to find the optimal policy that takes the cart from \(s_{0}\) to the equilibrium position and hence, only the cumulative cost incurred starting from \(s_{0}\) is considered. Note that \(s_{0}\) is chosen arbitrarily for the experiment and thus does not render any particular advantage to any of the algorithms.

The various parameter values employed and the results obtained in the experiment are provided in Table 2 and Fig. 8 respectively.

Table 2 Algorithm parameter values used in the experiments
Fig. 8
figure 8

a The cart-pole system: the goal is to keep the pole in the upright position and the cart at the centre of the track by applying a force a either to the right or to the left. The system is parametrized by the position x of the cart, the angle of the pole \(\psi \), the velocity \(\dot{x}\) and the angular velocity \(\dot{\psi }\). b Cart-pole results. Here, for Algorithm 2, we plot \({\mathbb {E}}_{\nu _{w_b}}\left[ L\left( x_{\overline{\mu }_{j} \vert w_b}^{\top }\phi ({\mathbf {s}})\right) \right] \), where \(\overline{\mu }_j\) is the mean vector of the Polyak averaged model sequence \(\{\overline{\theta }_j\}\), i.e., \(\overline{\theta }_{j} = (\overline{\mu }_{j}, \overline{\Sigma }_{j})^{\top }\). For the other algorithms, i.e., SPSA, MRAS and fast policy search, we plot \({\mathbb {E}}_{\nu _{w_j}}\left[ L\left( x_{w_{j} \vert w_j}^{\top }\phi ({\mathbf {s}})\right) \right] \), where \(\{w_j \in {\mathbb {W}}\}\) is the iterative sequence generated by the respective algorithms. This implies that Algorithm 2 operates in the off-policy setting, while the rest of the algorithms utilize on-policy value function approximations to generate the optimal policy vector. With this advantage, the algorithms SPSA, MRAS and fast policy search are expected to perform better as they have complete access to the generative model unlike Algorithm 2 which has access only to the sample trajectory generated by the behaviour policy. Also, note that x-axis is time in seconds relative to the start of the algorithm since MRAS and fast policy search are batch based approaches, while Algorithm 2 and SPSA are incremental schemes. Regarding the accuracy of the solution obtained by our algorithm, note that the global optimum is indeed zero, since the reward function is defined as the negative penalty with respect to the deviation from the equilibrium position and the goal is to bring the cart to the equilibrium position

4.3 Experiment 3: 5-link actuated pendulum balancing (Dann et al. 2014)

Setup 5 independent poles each with mass m and length l with the top pole being a pendulum connected using 5 rotational joints. In our experiment, we take \(m = 1.5\), \(l = 10.0\) and the discount factor \(\gamma = 0.1\).

Goal To keep all the poles in the horizontal position by applying independent torques at each joint.

State space The state \(s = (q, \dot{q})^{\top } \in \mathrm{I\!R}^{10}\) where \(q = (\psi _{1}, \psi _{2}, \psi _{3}, \psi _{4}, \psi _{5}) \in \mathrm{I\!R}^{5}\) and \(\dot{q} = (\dot{\psi }_{1}, \dot{\psi }_{2}, \dot{\psi }_{3}, \dot{\psi }_{4}, \dot{\psi }_{5}) \in \mathrm{I\!R}^{5}\) with \(\psi _{i}\) being the angle of the pole i w.r.t. the horizontal axis and \(\dot{\psi }_{i}\) is the angular velocity. In our experiment, we consider the following bounds on the state space: \(\psi _i \in [-\pi , \pi ]\), \(\forall 1 \le i \le 5\) and \(\dot{\psi }_i \in [-5.0, 5.0]\), \(\forall 1 \le i \le 5\).

Control space The action \(a = (a_{1}, a_{2}, \dots , a_{5})^{\top } \in \mathrm{I\!R}^{5}\) where \(a_{i}\) is the torque applied to the joint i. The stochastic policy used in this setting corresponds to

$$\begin{aligned} \pi (a|s) = \mathcal {N}_{5}(a | A s, B)\quad \text {where}\; A \in \mathrm{I\!R}^{5 \times 10}, B \in \mathrm{I\!R}^{5 \times 5}. \end{aligned}$$
(66)

We assume that the torques \(a_i\) applied at each joint are independent and hence B is a diagonal matrix. The policy parameter space \({\mathbb {W}}\) is defined as \({\mathbb {W}} = \{w \in \mathrm{I\!R}^{55} \vert w = (A_{00}, A_{01}, A_{02}, \dots , A_{48}, A_{49}, B_{00}, B_{11}, \dots , B_{44})^{\top }\}\).

System dynamics The state equations representing the approximate linear system dynamics are given by

$$\begin{aligned} \begin{bmatrix} q_{t+1}\\ \dot{q}_{t+1} \end{bmatrix} = \begin{bmatrix} I&\Delta t I\\ -\Delta t M^{-1}U&I \end{bmatrix}\begin{bmatrix}q_{t}\\ \dot{q}_{t}\end{bmatrix} + \Delta t \begin{bmatrix} 0 \\ M^{-1} \end{bmatrix}a + {\mathbf {z}} \end{aligned}$$
(67)

where \(\Delta t\) is the integration time step, i.e., the time difference between two transitions and M is the mass matrix in the horizontal position with \(M_{ij} = l^{2}(6-max(i,j))m\). U is a diagonal matrix with \(U_{ii} = -gl(6-i)m\), where g is the gravitational constant. Each component of \({\mathbf {z}}\) is a standard Gaussian noise. In our experiment, we take \(\Delta t = 0.1\) and \(g=9.8\).

Reward function \(R(q, \dot{q}, a) = -q^{\top }q\). The reward function can be viewed as assigning penalty (negative reward) with respect to the deviation from the optimal pole position (the unique position with zero deviation from the horizontal position and hence attracts no penalty, i.e., highest reward).

Fig. 9
figure 9

a Each rotational joint i, \(1 \le i \le 3\) is independently actuated by a torque \(a_{i}\). The system is parametrized by the angle \(\psi _{i}\) against the horizontal direction and the angular velocity \(\dot{\psi }_{i}\). The goal is to balance the pole in the horizontal direction, i.e., all \(\psi _{i}\) should be as close to 0 as possible by actuating Gaussian torques \(a_i\) [Eq. (66)]. b Here, for Algorithm 2, we plot \({\mathbb {E}}_{\nu _{w_b}}\left[ L\left( x_{\overline{\mu }_{j} \vert w_b}^{\top }\phi ({\mathbf {s}})\right) \right] \), where \(\overline{\mu }_j\) is the mean vector of the Polyak averaged model sequence \(\{\overline{\theta }_j\}\), i.e., \(\overline{\theta }_{j} = (\overline{\mu }_{j}, \overline{\Sigma }_{j})^{\top }\). For the other algorithms, i.e., SPSA, MRAS and fast policy search, we plot \({\mathbb {E}}_{\nu _{w_j}}\left[ L\left( x_{w_{j} \vert w_j}^{\top }\phi ({\mathbf {s}})\right) \right] \), where \(\{w_j \in {\mathbb {W}}\}\) is the iterative sequence generated by the respective algorithms. This implies that Algorithm 2 operates in the off-policy setting, while the rest of the algorithms utilize on-policy value function approximations to generate the optimal policy vector. With this advantage, the algorithms MRAS, SPSA and fast policy search are expected to perform better as they have unrestricted access to the generative model unlike Algorithm 2 which has access only to a sample trajectory generated by the behaviour policy. Also, note that x-axis is time in seconds relative to the start of the algorithm since MRAS and fast policy search are batch based approaches, while Algorithm 2 and SPSA are incremental schemes. Again, regarding the accuracy of the solution obtained by our algorithm, note that the global optimum is indeed zero, since the reward function is defined as the negative penalty with respect to the deviation from the equilibrium position and the goal is to bring the system to the equilibrium position. a 3-link actuated pendulum setting. b 5-link actuated pendulum results

Feature vectors \(\phi (s \in \mathrm{I\!R}^{10}) = (1, s_{1}^{2}, s_{2}^{2} \dots , s_{1}s_{2}, s_{1}s_{3}, \dots , s_{9}s_{10})^{\top } \in \mathrm{I\!R}^{46}\).

Behaviour policy The behaviour policy considered in the experiment is given by \(\pi _{b}(a | s) = \mathcal {N}_{5}(a | A_{b}s, B_{b})\), where

$$\begin{aligned} {A^{\top }_b = \begin{pmatrix} 5.794 &{} 2.000 &{} 6.230 &{} 4.500 &{} 6.145 \\ 4.843 &{} 5.014 &{} 2.306 &{} 2.796 &{} 7.000 \\ 6.031 &{} 6.500 &{} 6.600 &{} 8.379 &{} 4.252 \\ 6.640 &{} 3.424 &{} 5.937 &{} 5.045 &{} 3.617 \\ 8.661 &{} 3.463 &{} 4.430 &{} 3.000 &{} 4.233 \\ 5.660 &{} 3.437 &{} 7.275 &{} 7.417 &{} 5.755 \\ 3.781 &{} 2.989 &{} 4.756 &{} 6.417 &{} 6.760 \\ 3.391 &{} 3.696 &{} 4.153 &{} 5.761 &{} 3.196 \\ 5.725 &{} 2.929 &{} 3.205 &{} 3.631 &{} 8.651 \\ 1.337 &{} 4.677 &{} 8.009 &{} 3.609 &{} 5.602 \end{pmatrix}} \text { and } {B_b = \begin{pmatrix} 5.0 &{} &{} &{} \mathbf O &{} \\ &{} 5.0 &{} &{} &{} \\ &{} &{} 5.0 &{} &{} \\ &{} &{} &{} 5.0 &{} \\ &{} \mathbf O &{} &{} &{} 5.0 \end{pmatrix}.} \end{aligned}$$

The methodology employed to induce the behaviour policy in this case is similar to that of the cart-pole setting.

Performance function The performance function L is defined as under: We randomly select (from the given intervals described in the definition of the state space), \(s_{0} =\) \(( -\,1.515, -\,2.437, -\,1.386, -\,3.041, 0.001, 4.510, 0.691, 1.450, 3.241,\) \(3.535)^{\top }\). Now define

$$\begin{aligned} L(h_{w \vert w})(s) = {\left\{ \begin{array}{ll} 0.1h_{w \vert w}(s_{0}), \text { for } s = s_{0}\\ 0, \forall s \in {\mathbb {S}} \setminus \{s_{0}\}. \end{array}\right. } \end{aligned}$$
(68)

The rationale behind the choice of the above particular performance function is similar to that of Experiment 2. Also, note that \(s_{0}\) is chosen arbitrarily for the experiment and thus does not accord any unfounded predisposition to any of the algorithms.

The various parameter values employed and the results obtained in the experiment are provided in Table 2 and Fig. 9 respectively.

4.4 Experiment 4: random MDP

Setup We consider a randomly generated Markov decision process with \(\vert {\mathbb {S}} \vert = 500\), \(\vert {\mathbb {A}} \vert = 30\), \(k_1=5\), \(k_2=5\) and \(\gamma = 0.8\).

Reward function The reward function R is defined as follows:

$$\begin{aligned} R(s, a, s^{\prime }) = \omega _{1}(s)\omega _{1}(s^{\prime })\left( \frac{\sin {(a)}+2.0}{(1.0+s{'})^{0.25}}\right) , s, s^{\prime } \in {\mathbb {S}}, a \in {\mathbb {A}}. \end{aligned}$$
(69)

Here \(\omega _{1} \in [3,5]^{\vert {\mathbb {S}} \vert }\) is initialized for the algorithm with \(\omega _{1}(s) \sim U(1,4)\).

Transition dynamics The transition probability kernel P is defined as follows:

$$\begin{aligned} P(s, a, s^{\prime }) = {n \atopwithdelims ()s^{\prime }}\omega _{2}(s, a)^{s^{\prime }}(1.0-\omega _{2}(s, a))^{n - s^{\prime }}, s, s^{\prime } \in {\mathbb {S}}, a \in {\mathbb {A}}. \end{aligned}$$
(70)

Here the matrix \(\omega _{2} \in [0,1]^{{\mathbb {S}} \times {\mathbb {A}}}\) is initialized for the algorithm with \(\omega _{2}(s, a) \sim U(0,1)\).

Feature set The policy features and the prediction features are as follows:

Policy features

Prediction features

\(\psi (s, a) = B[s\vert \mathbb {A} \vert + a]\)

\(\phi _{i}(s) = e^{-\frac{(s-m_{i})^{2}}{2.0v_{i}^{2}}}\)

\(\text {where }{B = \begin{pmatrix} 1 &{} 0 &{} 0 &{} 0 &{} 0\\ 0 &{} 1 &{} 0 &{} 0 &{} 0\\ 0 &{} 0 &{} 1 &{} 0 &{} 0\\ 0 &{} 0 &{} 0 &{} 1 &{} 0\\ 0 &{} 0 &{} 0 &{} 0 &{} 1\\ 1 &{} 0 &{} 0 &{} 0 &{} 0\\ 0 &{} 1 &{} 0 &{} 0 &{} 0\\ \vdots &{} &{} \ddots &{} &{}\vdots \\ \end{pmatrix}.}_{15000 \times 5}\)

where \(m_i = 5+10(i-1), v_i = 5.\)

In this experimental setting, we employ the Gibbs “softmax” policies defined in Eq. (7).

Behaviour policy The behaviour policy vector \(w_b\) considered for the experiment is \(w_{b} = (12.774, 15.615, 20.626, 25.877, 11.945)^{\top }\).

Performance function The performance function L is defined as follows:

\(L(h_{w \vert w}) = 0.1 h^{2}_{w \vert w}\) (Note that squaring the vector here corresponds to co-ordinate wise squaring).

Table 3 Algorithm parameter values used in the random MDP experiment
Fig. 10
figure 10

Plot of the results obtained in the random MDP experiment. Here also, x-axis is time in secs relative to the start of the algorithm

As with the previous two experiments, Algorithm 2 was run for the off-policy case while SPSA, MRAS and fast policy search were run for the on-policy setting.

The various parameter values employed and the results obtained in the experiment are provided in Table 3 and Fig. 10 respectively.

4.5 Exegesis of the experiments

In this section, we summarize the inferences drawn from the above experiments:

(1) The proposed algorithm performed better than the state-of-the-art methods without compromising on the rate of convergence. The choice of the underlying behaviour policy indeed influenced this improved performance. Note that to labour high quality solutions, the choice of the behaviour policy is pivotal. In Experiment 1, we considered a uniform policy, where every action is equally likely to be chosen for each state in \({\mathbb {S}}\). The results obtained in that experiment are quite promising, since, by only utilizing a uniform behaviour policy, we were able to grind out superior quality solutions. One has to justify the results to add credibility, considering the fact that LSPI is shown to produce optimal policy given a generative model. Note that in the original LSPI paper, we find that the LSPI method utilizes a sample trajectory provided in the form of tuples \(\{(s_i, a_i, r_i, s_i^{\prime })\}_{i \in {\mathbb {N}}}\), where \(s_i\) and \(a_i\) are drawn uniformly randomly from \({\mathbb {S}}\) and \({\mathbb {A}}\) respectively, while \(s_{i}^{\prime }\) is the transitioned state given \(s_i\) and \(a_i\) by following the underlying transition dynamics of the MDP and \(r_i\) is the immediate reward for that transition. One can immediately see that the information content required to generate such a trajectory is equivalent to that of maintaining a generative model.

figure c

Further, in Lagoudakis and Parr (2003), where LSPI is being empirically evaluated, we find that a trajectory length of 5000 is being used in the 20-state chain walk to obtain optimal performance. However, in our experiment (Experiment 1) with 450 states, we only consider a trajectory length of 5000 for LSPI and hence obtain the sub-optimal performance. But, one should also consider the fact that the behaviour policy utilized by our algorithm in the same experiment is uniform (no prior information about the MDP is being availed) and the trajectory length is only half of that of LSPI. Now, regarding the performance of Q-learning, we know [from Theorem 1 of Maei et al. (2010)] that the method can only provide sub-optimal solutions.

In Experiments 2, 3 and 4, we surmised the behaviour policy based on more than a passable knowledge of the MDP. To make the comparison unbiased (since our algorithm utilized prior information about the MDP to induce the behaviour policy), in the algorithms (MRAS, fast policy search and SPSA) to which our method is being compared, we employed the more accurate on-policy approximation which requires the generative model. This is contrary to our method, where off-policy approximation is tried. Our algorithm exhibited as good a performance as the state-of-the-art methods in the cart-pole experiment and noticeably the finest performance in the actuated pendulum experiment. This is regardless of the fact that our algorithm is primarily designed for the discrete, finite MDP setting, while the cart-pole experiment and the actuated pendulum experiment are MDPs with continuous state and action spaces. The suboptimal performance of the fast policy search and MRAS is primarily attributed to the insufficient sample size. But the underlying computing machine which we consider for the experiments is a 64-bit Intel i3 processor with 4GB of memory. Because of these limited resources, there is a finite limit to which the sample size can be scaled. This illustrates the effectiveness of our approach on a resource restricted setting. Now regarding the random MDP experiment, the performance of our algorithm is on par (in fact superior) to the state-of-the-art schemes.

(2) The significance of these results is further strengthened by the fact that all the baseline algorithms considered in the experiments have access to the generative-model and the outcome depicted above is obtained after processing a bevy of sample trajectories. This is contrary to our method where such a privilege is not conferred.

Fig. 11
figure 11

The schematic diagram of the optimal policy generated by Algorithm 2 for the chain walk MDP with \(\vert {\mathbb {S}} \vert = 60\), \({\mathbb {A}} = \{L, R\}\) and the discount factor \(\gamma = 0.01\)

Fig. 12
figure 12

The schematic diagram of the optimal policy generated by Algorithm 2 for the chain walk MDP with \(\vert {\mathbb {S}} \vert = 60\), \({\mathbb {A}} = \{L, R\}\) and the discount factor \(\gamma = 0.99\).

(3) The algorithm does not seem to be heavily dependent on the discount factor \(\gamma \). To corroborate the claim, we show here the performance of the algorithm for two different, yet extreme values of \(\gamma \), i.e., for \(\gamma \in \{0.01, 0.99\}\) on the chain walk MDP with 60 states. Here, only the transitions to states 20 and 40 incur a positive cost, while the rest are null transitions. The optimal policies generated by our algorithm in the two cases are shown in Figs. 11 and 12 respectively. As one can observe, for \(\gamma = 0.99\), the window around state 20 is wider than that for \(\gamma = 0.01\). This is the expected behaviour since the discount factor controls the relative weights of future transitions while evaluating the discounted value function. However, note that this is not the case with regards to state 40. This lack of accuracy in the final third is primarily due to the fact that the behaviour policy we consider in this setting has its stationary distribution heavily concentrated on the first half of the state space. This particular scenario thus also illustrates the dependency of behaviour policy on the accuracy of the solution generated by our algorithm. This is indeed revealed in Theorem 3. To exemplify it further, we show here how the relative frequency of the states in the given trajectory generated using the behaviour policy determines the accuracy of the solution of our algorithm. Remember that the relative frequency of the states in the sample trajectory is indeed decided by the stationary distribution of the Markov chain induced by the behaviour policy. The results are shown in Figs. 13 and 14.

Fig. 13
figure 13

a Frequency ratio of the states in the sample trajectory. b Optimal value function generated by Algorithm 2. The frequency ratio of a particular state in the sample trajectory is defined as the ratio of the number of occurrences of that state in the sample trajectory to the total number of state transitions in the sample trajectory. For an ergodic Markov chain, this ratio will eventually converge to is stationary distribution. In this particular example, observe that the accuracy of the value function is better for states whose relative frequency is good

Fig. 14
figure 14

a Frequency ratio of the states in the sample trajectory. b Optimal value function generated by Algorithm 2. In this setting, the relative frequency is better on the right half of the state space and the value function also seems to be more accurate in that region

(4) Finally, in the experiments, we found that the parameter which required the highest tuning is \(\beta _j\) which is also intuitive since \(\beta _{j}\) controls most of the stochastic recursions. The other parameters required minimum tuning with almost all of them taking common values.

4.6 Data efficiency

Here, we compare the efficiency of our algorithm with respect to the state-of-the-art algorithms. To measure the efficiency, we consider two benchmarks: system configuration count and memory usage. The system configuration count denotes the number of times the algorithm queries the generative model of the MDP with a policy to obtain sample trajectories. Memory usage denotes the average real time memory consumed by the algorithms. The results are shown in Fig. 15. The performance of our algorithm with regard to the above benchmarks is commendable.

Fig. 15
figure 15

Efficiency comparison of Algorithm 2 w.r.t. the state-of-the-art methods.

We also compare here the average memory usage of the fast policy search algorithm and our algorithm with respect to \(k_2\) which is the dimension of the policy space. The results are shown in Fig. 16. The illustration shows that memory usage of our algorithms almost remains constant, however fast policy search is very sensitive to the parameter \(k_2\).

Fig. 16
figure 16

Memory usage w.r.t. \(k_2\)

This non-dependency of our algorithm on the dimension of the policy space has a real pragmatic advantage since, as a result of this, our algorithm can be applied to very large and complex MDPs with wider policy spaces where fast policy search and MRAS might become intractable.

Another advantage of our approach is the application on legacy systems. In such systems, the information on the dynamics of the system in the form of bits or bytes or paper might be hard to find. However, human experience through long time interaction with the system is available in most cases. Utilizing this human experience to develop a generative model of the system might be hard, however using it to find a behaviour policy which can give average performance is more plausible, and which in turn can be exploited using our algorithm to find an optimal policy.

5 Conclusion

We presented an algorithm which solves the modified control problem in a model free MDP setting. We showed its convergence to the global optimal policy relative to the choice of the behaviour policy. The algorithm is data efficient, robust, stable as well as computationally and storage efficient. Using an appropriately chosen behaviour policy, it is also seen to consistently outperform or is competitive against the current state-of-the-art (both) off-policy and on-policy methods.