Abstract
In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, i.e., estimating the value function of a modelfree Markov reward process using the linear function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multitimescale stochastic approximation variant of the very popular cross entropy optimization method which is a model based search method to find the global optimum of a realvalued function. A proof of convergence of the algorithms using the ODE method is provided. We supplement our theoretical results with experimental comparisons. The algorithms achieve good performance fairly consistently on many RL benchmark problems with regards to computational efficiency, accuracy and stability.
Introduction
In this paper, we follow the reinforcement learning (RL) framework as described in Sutton and Barto (1998), White (1993), Bertsekas (2013). The basic structure in this setting is the discrete time Markov decision process (MDP) which is a 4tuple (\({\mathbb {S}}\), \({\mathbb {A}}\), \(\mathrm {R}\), \(\mathrm {P}\)), where \({\mathbb {S}}\) denotes the set of states and \({\mathbb {A}}\) is the set of actions. We assume that the state and action spaces are finite. The function \(\mathrm {R}: {\mathbb {S}} \times {\mathbb {A}} \times {\mathbb {S}} \rightarrow \mathbb {R}\) is called the reward function, where \(\mathrm {R}(s, a, s^{\prime })\) represents the reward obtained in state s after taking action a and transitioning to \(s^{\prime }\). Without loss of generality, we assume that the reward function is bounded, i.e., \(\mathrm {R}(\cdot ,\cdot ,\cdot ) \le \mathrm {R}_{\mathrm {max}} < \infty \). Also, \(\mathrm {P}:{\mathbb {S}} \times {\mathbb {A}} \times {\mathbb {S}} \rightarrow [0,1]\) is the transition probability kernel, where \(\mathrm {P}(s, a, s^{\prime }) = {\mathbb {P}}(s^{\prime }  s, a)\) is the probability of next state being \(s^{\prime }\) conditioned on the fact that the current state is s and the action taken is a. A stationary policy \(\pi :{\mathbb {S}} \rightarrow {\mathbb {A}}\) is a function from states to actions, where \(\pi (s)\) is the action taken whenever the system is in state s (independent of time).^{Footnote 1} A given policy \(\pi \) along with the transition kernel \(\mathrm {P}\) determine the state dynamics of the system. For a given policy \(\pi \), the system behaves as a Markov chain with transition matrix \(\mathrm {P}^{\pi }(s,s^{\prime })\) = \(\mathrm {P}(s,\pi (s),s^{\prime })\).
For a given policy \(\pi \), the system evolves at each discrete time step and this process can be captured as a coupled sequence of transitions and rewards \(\{{\mathbf {s}}_{0}, {\mathbf {r}}_{0}, {\mathbf {s}}_{1}, {\mathbf {r}}_{1}, {\mathbf {s}}_{2}, {\mathbf {r}}_{2}, \ldots \}\), where \({\mathbf {s}}_{t}\) is the random variable which represents the state at time t, \({\mathbf {s}}_{t+1}\) is the transitioned state from \({\mathbf {s}}_{t}\) and \({\mathbf {r}}_{t} = \mathrm {R}({\mathbf {s}}_t, \pi ({\mathbf {s}}_t), {\mathbf {s}}_{t+1})\) is the reward associated with the transition. In this paper, we are concerned with the problem of prediction, i.e., estimating the expected long run \(\gamma \)discounted cost \(V^{\pi } \in \mathbb {R}^{{\mathbb {S}}}\) (also referred to as the value function) corresponding to the given policy \(\pi \). Here, given \(s \in {\mathbb {S}}\), we let
where \(\gamma \in [0,1)\) is a constant called the discount factor and \({\mathbb {E}}[\cdot ]\) is the expectation over sample trajectories of states obtained in turn from \(\mathrm {P}^{\pi }\) when starting from the initial state s. \(V^{\pi }\) satisfies the well known Bellman equation (Bertsekas 2013) under policy \(\pi \), given by
where \(\mathrm {R}^\pi \triangleq (\mathrm {R}^\pi (s),s\in {\mathbb {S}})^\top \) with \(\mathrm {R}^{\pi }(s)\) = \({\mathbb {E}}\left[ {\mathbf {r}}_{t} \vert {\mathbf {s}}_t = s\right] \), \(V^\pi \triangleq (V^\pi (s),s\in {\mathbb {S}})^\top \) and \(T^\pi V^\pi \triangleq ((T^\pi V^\pi )(s),\) \( s \in {\mathbb {S}})^\top \), respectively. Here \(T^\pi \) is called the Bellman operator.
Prediction problem^{Footnote 2} (Sutton 1988; Maei et al. 2009; Sutton et al. 2009): In this paper, we follow a generalized RL framework, where we assume that the model, i.e., \(\mathrm {P}\) and \(\mathrm {R}\) are inaccessible; only a sample trajectory \(\{({\mathbf {s}}_t, {\mathbf {r}}_{t}, {\mathbf {s}}^{\prime }_{t})\}_{t=0}^{\infty }\) is available where at each instant t, the state \({\mathbf {s}}_t\) of the triplet \(({\mathbf {s}}_t, {\mathbf {r}}_{t}, {\mathbf {s}}^{\prime }_{t})\) is sampled using an arbitrary distribution \(\nu \) over \({\mathbb {S}}\) called the sampling distribution, while the next state \({\mathbf {s}}^{\prime }_{t}\) is drawn using \(\mathrm {P}^{\pi }({\mathbf {s}}_t, \cdot )\) following the underlying Markov dynamics and \({\mathbf {r}}_{t}\) is the immediate reward for the transition, i.e., \({\mathbf {r}}_{t} = \mathrm {R}({\mathbf {s}}_{t}, \pi ({\mathbf {s}}_{t}), {\mathbf {s}}^{\prime }_{t})\). We assume that \(\nu (s) > 0\), \(\forall s \in {\mathbb {S}}\). The goal of the prediction problem is to estimate the value function \(V^{\pi }\) from the given sample trajectory.
Remark 1
The framework that we consider in this paper is a generalized setting, commonly referred to as the offpolicy setting.^{Footnote 3} In the literature, one often finds the onpolicy setting where the underlying Markovian system induced by the evaluation policy is assumed ergodic, i.e., aperiodic and irreducible, which directly implies the existence of a unique steady state distribution (stationary distribution). In such cases, the sample trajectory is presumed to be a continuous rollout of a particular instantiation of the underlying transition dynamics in the form of \(\{{\mathbf {s}}_0, {\mathbf {r}}_1, {\mathbf {s}}_1, {\mathbf {r}}_2, \ldots \}\), where \({\mathbf {s}}_0\) is chosen arbitrarily. Since the system is ergodic, the distribution of the states in the sample trajectory will eventually follow the steady state distribution. Hence, the onpolicy setting becomes a special case of the offpolicy setting, where the sampling distribution is nothing but the stationary distribution and \({\mathbf {s}}_{t+1} = {\mathbf {s}}^{\prime }_{t}\), \(\forall t \in {\mathbb {N}}\).
Unfortunately, the number of states \(\vert {\mathbb {S}} \vert \) may be large in many practical applications (Kaelbling et al. 1996; Doya 2000), for example, elevator dispatching (Crites and Barto 1996), robotics (Kober et al. 2013) and board games such as Backgammon (\(10^{20}\) states, Tesauro 1995) and computer Go (\(10^{170}\) states, Silver et al. 2007). The impending combinatorial blowups exemplify the underlying problem with the value function estimation, commonly referred to as the curse of dimensionality. In this case, the value function is unrealizable due to both storage and computational limitations. Apparently one has to resort to approximate solution methods where we sacrifice precision for computational tractability. A common approach in this context is the function approximation method (Sutton and Barto 1998), where we approximate the value function of unobserved states using the knowledge of the observed states and their transitions.
In the linear function approximation technique, a linear architecture consisting of a set of k feature vectors (\(\vert {\mathbb {S}} \vert \)dimensional) \(\{\phi _{i} \in \mathbb {R}^{\vert {\mathbb {S}} \vert }, 1 \le i \le k\}\), where \(1 \le k \ll \vert {\mathbb {S}} \vert \), is chosen a priori. For a state \(s \in {\mathbb {S}}\), we define
where the vector \(\phi (s)\) is called the feature vector corresponding to the state \(s \in {\mathbb {S}}\), while the matrix \({\varPhi }\) is called the feature matrix.
Primarily, the task in linear function approximation is to find a weight vector \(z \in \mathbb {R}^{k}\) such that the predicted value function \({\varPhi } z \approx V^{\pi }\). Given \({\varPhi }\), the best approximation of \(V^{\pi }\) is its projection on to the closed subspace \(\mathbb {H}^{{\varPhi }} = \{{\varPhi } z  z \in \mathbb {R}^{k}\}\) (column space of \({\varPhi }\)) with respect to some norm on \(\mathbb {R}^{\vert {\mathbb {S}} \vert }\). Typically, one uses the weighted seminorm \(\Vert \cdot \Vert _{\nu }\) on \(\mathbb {R}^{\vert {\mathbb {S}} \vert }\), where \(\nu (\cdot )\) is the sample probability distribution with which the states \({\mathbf {s}}_{t}\) occur in the sample trajectory. It is assumed that \(\nu (s) > 0, \forall s \in \mathbb {S}\). The seminorm \(\Vert \cdot \Vert _{\nu }\) on \(\mathbb {R}^{\vert {\mathbb {S}} \vert }\) is defined as \(\Vert V \Vert _{\nu }^{2} = \sum _{s \in \mathbb {S}}V(s)^{2}\nu (s)\). The associated linear projection operator \({\varPi }^{\nu }\) is defined as \({\varPi }^{\nu } V^{\pi } = {{\mathrm{arg\,min}}}_{h \in \mathbb {H}^{{\varPhi }}} \Vert V^{\pi }  h \Vert _{\nu }^{2}\). It is not hard to derive the following closed form expression for \({\varPi }^{\nu }\).
where \(D^{\nu }\) is the diagonal matrix with \(D^{\nu }_{ii} = \nu (s_i), i=1,\dots ,\vert \mathbb {S} \vert \). On a technical note, observe that the projection is obtained by minimizing the squared \(\nu \)weighted distance from the true value function \(V^{\pi }\) and this distance is referred to as the mean squared error (MSE), i.e.,
However, it is hard to evaluate or even estimate \({\varPi }^{\nu }\) since it requires complete knowledge of the sampling distribution \(\nu \) and also requires \(\vert {\mathbb {S}} \vert \) amount of memory for storing \(D^{\nu }\). Therefore, one has to resort to additional approximation techniques to estimate the projection \({\varPi }^{\nu } V^{\pi }\) which is indeed the prime objective of this paper.
Goal of this paper: To find a vector \(z^{*} \in \mathbb {R}^{k}\) such that \({\varPhi } z^{*} \approx {\varPi }^{\nu }V^{\pi }\) without knowing \({\varPi }^{\nu }\) or trying to estimate the same.
A caveat is in order. It is important to note that the efficacy of the learning method depends on the choice of the feature set \(\{\phi _{i}\}\) (Lagoudakis and Parr 2003). One can either utilize prior knowledge about the system to develop hardcoded features or employ offtheshelf basis functions^{Footnote 4} from the literature. In this paper, we assume that a carefully chosen set of features is available a priori.
Related work
The existing algorithms can be broadly classified as

1.
Linear methods which include temporal difference method (TD(\(\lambda \)), \(\lambda \in [0, 1]\) Sutton (1988); Tsitsiklis and Roy (1997)), gradient temporal difference methods (GTD Sutton et al. 2009; GTD2 Sutton et al. 2009; TDC Sutton et al. 2009) and residual gradient (RG) schemes (Baird 1995), whose computational complexities are linear in k and hence are good for large values of k and

2.
Secondorder methods which include least squares temporal difference (LSTD) (Bradtke and Barto 1996; Boyan 2002) and least squares policy evaluation (LSPE) (Nedić and Bertsekas 2003) whose computational complexities are quadratic in k and are useful for moderate values of k. Secondorder methods, albeit computationally expensive, are seen to be more data efficient than others except in the case when trajectories are very small (Dann et al. 2014).
In cases where the Markov chain is ergodic (i.e. irreducible and aperiodic) and the sampling distribution \(\nu \) is the stationary distribution of the Markov chain, then with \({\varPhi }\) being a full column rank matrix, the convergence of TD(\(\lambda \)) is guaranteed (Tsitsiklis and Roy 1997). But in cases where the sampling distribution \(\nu \) is not the stationary distribution of the Markov chain or the projected subspace is a nonlinear manifold, then TD(\(\lambda \)) can diverge (Tsitsiklis and Roy 1997; Baird 1995). However, both LSTD and LSPE algorithms are stable (Schoknecht 2002) and are also seen to be independent of the sampling distribution \(\nu \). However, there do not exist any extensions of LSTD and LSPE to the nonlinear function approximation.
Tsitsiklis and Roy (1997) gave a different characterization for the stable limit point of TD(0) as the fixed point of the projected Bellman operator \({\varPi }^{\nu }T^{\pi }\),
where \(\nu \) is the stationary distribution of the underlying ergodic chain.
This characterization yields a new error function, the mean squared projected Bellman error (MSPBE) which is defined as follows:
The LSTD algorithm (Bradtke and Barto 1996; Boyan 2002) is a fitted value function method (least squares approach) obtained by directly solving MSPBE over the sample trajectory using sample averaging of the individual transitions. However, the LSPE method (Nedić and Bertsekas 2003) solves MSPBE indirectly using a double minimization procedure where the primary minimizer finds the projection of the Bellman operator value using the least squares approach with the proximal Bellman operator value being obtained from the secondary gradient based minimizer. In Sutton et al. (2009), MSPBE is exquisitely manoeuvred to derive multiple stable \({\varTheta }(k)\) algorithms like TDC and GTD2. A nonlinear function approximation version of the GTD2 algorithm is also available (Maei et al. 2009). The method is shown to be stable and the convergence to the suboptimal solutions is also guaranteed under reasonably realistic assumptions (Maei et al. 2009). The suboptimality of the solutions is expected as GTD2 is a gradientbased method and the convexity of the objective function does not always hold in nonlinear function approximation settings.
Another pertinent error function is the mean squared Bellman residue (\(\mathrm {MSBR}\)) which is defined as follows:
where \(\delta _{t}(z) \triangleq {\mathbf {r}}_{t} + \gamma z^{\top }\phi ({\mathbf {s}}^{\prime }_{t})  \phi ({\mathbf {s}}_{t})\) is the temporal difference error under function approximation when z is the associated approximation parameter. Note that MSBR is a measure of how closely the prediction vector represents the solution to the Bellman equation.
Residual gradient (RG) algorithm (Baird 1995) minimizes the error function MSBR directly using stochastic gradient search. Indeed, RG solves \(\nabla _{z}\text {MSBR} = 0\) \(\Rightarrow {\mathbb {E}}\Big [{\mathbb {E}}\left[ \delta _t(z) {\mathbf {s}}_t\right] \Big ]{\mathbb {E}}\Big [{\mathbb {E}}\left[ (\gamma \phi ({\mathbf {s}}_{t})\phi ({\mathbf {s}}_t)) {\mathbf {s}}_t\right] \Big ] = 0\). The above expression is a product of two expectations conditioned on the current state \({\mathbf {s}}_{t}\). Hence it requires two independent samples \(\mathbf {s}^{\prime }_{t}\) and \(\mathbf {s}^{\prime \prime }_{t}\) of the next state when in the current state \({\mathbf {s}}_{t}\). This is generally referred to as double sampling. Even though the RG algorithm guarantees convergence, due to large variance, the convergence rate is small (Schoknecht and Merke 2003).
Eligibility traces (Sutton 1988) are a mechanism to accelerate learning by blending temporal difference methods with Monte Carlo simulation (averaging the values) and weighted using a geometric distribution with parameter \(\lambda \in [0,1]\). Eligibility traces can be integrated into most of these algorithms.^{Footnote 5} In this paper, we do not consider the treatment of eligibility traces.
Table 1 provides a list of important TD based algorithms along with the associated error objectives. The algorithm complexities and other characteristics are also shown in the table.
Put succinctly, when linear function approximation is applied in an RL setting, the main task can be cast as an optimization problem whose objective function is one of the aforementioned error functions. Typically, almost all the stateoftheart algorithms employ gradient search technique to solve the minimization problem. In this paper, we apply a gradientfree technique called the cross entropy (CE) method instead to find the minimum. By ‘gradientfree’, we mean the algorithm does not incorporate information about the gradient of the objective function, rather it uses the function values themselves. The cross entropy method as such lies within the general class of model based search methods (Zlochin et al. 2004). Other methods in this class are model reference adaptive search (MRAS) (Hu et al. 2007), gradientbased adaptive stochastic search for simulation optimization (GASSO) (Zhou et al. 2014), ant colony optimization (ACO) (Dorigo and Gambardella 1997) and estimation of distribution algorithms (EDAs) (Mühlenbein and Paass 1996). Model based search methods have been applied to the control problem^{Footnote 6} in Hu et al. (2008), Mannor et al. (2003), Busoniu et al. (2009) and in basis adaptation^{Footnote 7} (Menache et al. 2005), but this is the first time such a procedure has been applied to the prediction problem. However, due to the naive batch based approach of the original CE method, it cannot be directly applied to the online RL setting. In this paper, therefore, we propose two incremental, adaptive, online algorithms which solve MSBR and MSPBE respectively by employing a stochastic approximation version of the cross entropy method proposed in Joseph and Bhatnagar (2016a, b, 2018).
Our contributions
The cross entropy (CE) method (Rubinstein and Kroese 2013; Boer et al. 2005) is a model based search algorithm to find the global maximum of a given real valued objective function. In this paper, we propose for the first time, an adaptation of this method to the problem of parameter tuning in order to find the best estimates of the value function \(V^{\pi }\) for a given policy \(\pi \) under the linear function approximation architecture. We propose two prediction algorithms using the multitimescale stochastic approximation framework (Robbins and Monro 1951; Borkar 1997; Kushner and Clark 1978) which minimize MSPBE and MSBR respectively. The algorithms possess the following attractive features:

1.
A remodelling of the famous CE method to a modelfree MDP framework using the stochastic approximation framework.

2.
Stable with minimal restrictions on both the structural properties of the underlying Markov chain and on the sample trajectory.

3.
Minimal restriction on the feature set.

4.
Computational complexity is quadratic in the number of features (this is a significant improvement compared to the cubic complexity of the least squares algorithms).

5.
Competitive with least squares and other stateoftheart algorithms in terms of accuracy.

6.
Algorithms are incremental update, adaptive, streamlined and online.

7.
Algorithms provide guaranteed convergence to the global minimum of MSPBE (or MSBR).

8.
Relative ease in extending the algorithms to nonlinear function approximation settings.
A noteworthy observation is that under linear architecture, both MSPBE and MSBR are strongly convex functions (Dann et al. 2014) and hence their local and global minima overlap. Hence, the fact that CE method finds the global minima as opposed to local minima, unlike gradient search, does not provide any tangible advantage in terms of the quality of the solution. Nonetheless, in the case of nonlinear function approximators, the convexity property does not hold in general and so there may exist multiple local minima in the objective and the gradient search schemes would get stuck in local optima unlike CE based search. We have not explored analytically the nonlinear case in this paper. Notwithstanding, we have applied our algorithm to the nonlinear MDP setting defined in section X of Tsitsiklis and Roy (1997) and the results obtained are quite impressive. The MDP setting in Tsitsiklis and Roy (1997) is a classic example where TD(0) is shown to diverge and GTD2 is shown to produce suboptimal solutions. This demonstrates the robustness of our algorithm which is quite appealing, considering the fact that the stateoftheart RL algorithms are specifically designed to perform in a linear environment and extending them to domains beyond the realm of linearity is quite tedious and often impossible. In view of all these alluring features, our approach can be viewed as a significant first step towards efficiently using model based search for policy evaluation in a generalized RL environment.
Summary of notation
We use \({\mathbf {X}}\) for random variable and x for deterministic variable. Let \({\mathbb {I}}_{k \times k}\) and \(0_{k \times k}\) be the identity matrix and the zero matrix with dimensions \(k \times k\) respectively. For set A, \(I_{A}\) represents the indicator function of A, i.e., \(I_{A}(x) = 1\) if \(x \in A\) and 0 otherwise. Let \(f_{\theta }:\mathbb {R}^{n} \rightarrow \mathbb {R}\) denote the probability density function (PDF) over \(\mathbb {R}^{n}\) parametrized by \(\theta \). Let \({\mathbb {E}}_{\theta }[\cdot ]\) and \(P_{\theta }\) denote the expectation and the induced probability measure w.r.t. \(f_{\theta }\). For \(\rho \in (0,1)\) and \({\mathcal {H}}:\mathbb {R}^{n} \rightarrow \mathbb {R}\), let \(\gamma _{\rho }({\mathcal {H}}, \theta )\) denote the \((1\rho )\)quantile of \(\mathcal {H}(\mathbf {X})\) w.r.t. \(f_{\theta }\), i.e.,
Let int(A) be the interior of set A. Let \({\mathcal {N}}_{n}(m, V)\) represent the nvariate Gaussian distribution with mean vector \(m \in \mathbb {R}^{n}\) and covariance matrix \(V \in \mathbb {R}^{n \times n}\). A function \(L:\mathbb {R}^{n} \rightarrow \mathbb {R}\) is Lipschitz continuous, if \(\exists K \ge 0\) s.t. \(\vert L(x)  L(y) \vert \le K\Vert x  y \Vert \), \(\forall x, y \in \mathbb {R}^{n}\), where \(\Vert \cdot \Vert \) is some norm defined on \(\mathbb {R}^{n}\).
Background: the CE method
To better understand our algorithm, we explicate the original CE method first.
Objective of CE
The cross entropy (CE) method (Rubinstein and Kroese 2013; Boer et al. 2005) solves problems of the following form:
where \({\mathcal {H}}(\cdot )\) is a multimodal realvalued function and \({\mathcal {X}}\) is called the solution space.
The goal of the CE method is to find an optimal “model” or probability distribution over the solution space \({\mathcal {X}}\) which concentrates on the global maxima of \({\mathcal {H}}(\cdot )\). The CE method adopts an iterative procedure where at each iteration t, a search is conducted on a space of parametrized probability distributions \(\{f_{\theta } \vert \theta \in {\varTheta }\}\) over \({\mathcal {X}}\), where \({\varTheta }\) (a subset of the multidimensional Euclidean space) is the parameter space, to find a distribution parameter \(\theta _t\) which reduces the Kullback–Leibler (KL) divergence (also called the cross entropy distance) (Kullback 1959) from the optimal model. The most commonly used class here is the natural exponential family of distributions (NEF).
Natural exponential family of distributions (Morris 1982): These are denoted as \({\mathcal {C}} \triangleq \) \(\{f_{\theta }(x) = h(x)e^{\theta ^{\top }{\varGamma }(x)K(\theta )} \mid \theta \in {\varTheta } \subseteq \mathbb {R}^d\}, \text { where }\) \(h:\mathbb {R}^{m} \longrightarrow \mathbb {R}\), \({\varGamma }:\mathbb {R}^{m} \longrightarrow \mathbb {R}^{d}\) and \(K:\mathbb {R}^{d} \longrightarrow \mathbb {R}\). By rearranging the parameters, we can show that the Gaussian distribution with mean vector \(\mu \) and the covariance matrix \({\varSigma }\) belongs to \({\mathcal {C}}\). In this case,
and one may let \({\displaystyle h(x) = \frac{1}{\sqrt{(2\pi )^{m}}}}\), \({\varGamma }(x) = (x, xx^{\top })^{\top }\) and \({\displaystyle \theta = ({\varSigma }^{1} \mu ,\,\frac{1}{2}{\varSigma }^{1})^{\top }}\).
 \(\circledast \) :

Assumption (A1): The parameter space \({\varTheta }\) is compact.
CE method (ideal version)
The CE method aims to find a sequence of model parameters \({\{\theta _t\}}_{t \in {\mathbb {N}}}\), where \(\theta _t \in {\varTheta }\) and an increasing sequence of thresholds \({\{\gamma _{t}\}}_{t \in {\mathbb {N}}}\) where \(\gamma _t \in \mathbb {R}\), with the property that the event \(\{{\mathcal {H}}({\mathbf {X}}) \ge \gamma _{t}\}\) is a very high probability event with respect to the probability measure induced by the model parameter \(\theta _{t}\). By assigning greater weight to higher values of \({\mathcal {H}}\) at each iteration, the expected behaviour of the probability distribution sequence should improve. The most common choice for \(\gamma _{t+1}\) is \(\gamma _{\rho }({{\mathcal {H}}}, \theta _t)\), the \((1\rho )\)quantile of \({\mathcal {H}}({\mathbf {X}})\) w.r.t. the probability density function \(f_{\theta _{t}}\), where \(\rho \in (0,1)\) is set a priori for the algorithm. We take the Gaussian distribution as the preferred choice for \(f_{\theta }\) in this paper. In this case, the model parameter is \(\theta = (\mu , {\varSigma })^{\top }\) where \(\mu \in \mathbb {R}^{m}\) is the mean vector and \({\varSigma } \in \mathbb {R}^{m \times m}\) is the covariance matrix.
The CE algorithm is an iterative procedure which starts with an initial value \(\theta _0 = (\mu _{0}, {\varSigma }_0)^{\top }\) of the mean vector and the covariance matrix tuple and at each iteration t, a new parameter \(\theta _{t+1} = (\mu _{t+1}, {\varSigma }_{t+1})^{\top }\) is derived from the previous value \(\theta _t\) as follows (from Section 4 of Hu et al. 2007):
where \(S:\mathbb {R}\rightarrow \mathbb {R}_{+}\) is a positive and strictly monotonically increasing function.
If the gradient w.r.t. \(\theta \) of the objective function in Eq. (11) is equated to 0, considering Gaussian PDF for \(f_\theta \) (i.e., using the expression provided in Eq. (10) for \(f_{\theta }\))and \(\gamma _{t+1} = \gamma _{\rho }({\mathcal {H}}, \theta _t)\) , we obtain the following:
where
Remark 2
The function \(S(\cdot )\) in Eq. (11) is positive and strictly monotonically increasing and is used to account for the cases when the objective function \({\mathcal {H}}(x)\) takes negative values for some x. Note that in the expression of \(\mu _{t+1}\) in Eq. (12), x is being weighted with \(S({\mathcal {H}}(x))\) in the region \(\{x \vert {\mathcal {H}}(x) \ge \gamma _{t+1}\}\). Since the function S is positive and strictly monotonically increasing, the region where \({\mathcal {H}}(x)\) is higher (hence \(S({\mathcal {H}}(x))\) is also higher) is given more weight and hence \(\mu _{t+1}\) concentrates in the region where \({\mathcal {H}}(x)\) takes higher values. In case where \({\mathcal {H}}(\cdot )\) is positive, we can choose \(S(x) = x\). However, in general scenarios, where \({\mathcal {H}}(\cdot )\) takes positive and negative values, the identity function is not an appropriate choice since the effect of the positive weights is reduced by the negative ones. In such cases, we take \(S(x) = exp(rx), r \in \mathbb {R}_{+}\).
Thus the ideal CE algorithm can be expressed using the following recursion:
An illustration demonstrating the evolution of the model parameters of the CE method with Gaussian distribution during the optimization of a multimodal objective function is provided in Fig. 16 of the “Appendix”.
Comparison of the objectives: MSPBE and MSBR
This question is critical since most reinforcement learning algorithms can be characterized via some optimization problem which minimizes either MSBR or MSPBE. A comprehensive comparison of the two error functions is available in the literature (Schoknecht and Merke 2003; Schoknecht 2002; Scherrer 2010). A direct relationship between MSBR and MSPBE can be easily established as follows:
This follows directly from Babylonian–Pythagorean theorem and the fact that \((T^{\pi }\Phi z  \Pi ^{\nu } T^{\pi }\Phi z)\) \(\bot \) \(\left( \Pi ^{\nu } T^{\pi }\Phi z  \Phi z\right) \), \(\forall z \in \mathbb {R}^{k}\). A vivid depiction of this relationship is shown in Fig. 1.
If the columns of the feature matrix \({\varPhi }\) are linearly independent, then both the error functions MSBR and MSPBE are strongly convex (Dann et al. 2014). However, the respective minima of MSBR and MSPBE are related depending on whether the feature set is perfect or not. A feature set is perfect if \(V^{\pi } \in \{{\varPhi } z \vert z \in \mathbb {R}^{k}\}\). In the perfect case, \(\exists z_{0} \in \mathbb {R}^{k}\) s.t. \({\varPhi } z_{0} = V^{\pi }\) and hence MSBR(\(z_{0}\)) = 0. Since MSBR\((z) \ge 0\), \(\forall z \in \mathbb {R}^{k}\), we have \(z_0 = {{\mathrm{arg\,min}}}_{z}{\text {MSBR}(z)}\). Now from (18), we get \(\text {MSPBE}(z_0) = 0\) and \(z_0 = {{\mathrm{arg\,min}}}_{z}{\text {MSPBE}(z)}\) (again since MSPBE\((z) \ge 0\), \(\forall z \in \mathbb {R}^{k}\)). Hence in the perfect feature set scenario, the respective minima of MSBR and MSPBE coincide. However, in the imperfect case, they might differ since MSPBE(z) \(\ne \) MSBR(z) for some \(z \in {\mathcal {Z}}\) (follows from Eq. (18)).
In Scherrer (2010), Williams and Baird (1993), a relationship between MSBR and MSE is provided as shown in (19). Recall that MSE is the error which defines the projection operator \({\varPi }^{\nu }\) in the linear function approximation setting. It is found that, for a given \(\nu \) with \(\nu (s) > 0, \forall s \in {\mathbb {S}}\),
where \(C(\nu ) = \max _{s,s^{\prime }}{\frac{\mathrm {P}^{\pi }(s, s^{\prime })}{\nu (s)}}\). This bound (albeit loose) ensures that the minimization of MSBR is indeed stable and the solution so obtained cannot be too far from the projection \({\varPi }^{\nu }V^{\pi }\). A noticeable drawback with MSBR is the statistical overhead brought about by the double sampling required for its estimation. To elaborate this, recall that MSBR(z) = \({\mathbb {E}}\Big [{\mathbb {E}}\left[ \delta _t(z) {\mathbf {s}}_t\right] {\mathbb {E}}\left[ \delta _t(z) {\mathbf {s}}_t\right] \Big ]\) (from Eq. 8). In the above expression of MSBR, we have a product of two conditional expectations conditioned on the current state \({\mathbf {s}}_{t}\). This implies that to estimate MSBR, one requires two independent samples of the next state, given the current state \({\mathbf {s}}_{t}\). Another drawback which was observed in the literature is the large variance incurred while estimating MSBR (Dann et al. 2014; Scherrer 2010), which inversely affects the rate of convergence of the optimization procedure. Also, in settings where only a finite length sample trajectory is available, the larger stochastic noise associated with the MSBR estimation will produce inferior quality solutions. MSPBE is attractive in the sense that double sampling is not required and there is sufficient empirical evidence (Dann et al. 2014) to believe that the minimum of MSPBE often has low MSE. The absence of double sampling is quite appealing, since for large complex MDPs obtaining sample trajectories is itself tedious, let alone double samples. Also, MSPBE when integrated with control algorithms is also shown to produce better quality policies (Lagoudakis and Parr 2003). Another less significant advantage is the fact that MSPBE(z) \(\le \) MSBR(z), \(\forall z\) (follows from Eq. 18). This implies that the optimization algorithm can work with smaller objective function values compared to MSBR.
Now, we explore both the error functions analytically:
MSPBE
In Sutton et al. (2009), a compact expression for MSPBE is provided as follows:
where \(V_{z} = {\varPhi } z\), while \({\varPhi }\) and \(D^{\nu }\) are defined in Eqs. (3) and (4) respectively. Now the expression \({\varPhi }^{\top }D^{\nu } (T^{\pi }V_{z}V_{z})\) is further rewritten as
Putting all together we get,
where \(\omega ^{(0)}_{*} \triangleq {\mathbb {E}}\left[ {\mathbb {E}}\left[ \phi _t {\mathbf {r}}_{t} \vert {\mathbf {s}}_t\right] \right] \), \(\omega ^{(1)}_{*} \triangleq {\mathbb {E}}\left[ {\mathbb {E}}\left[ \phi _t(\gamma \phi ^{\prime }_{t} \phi _t)^{\top } \vert {\mathbf {s}}_t\right] \right] \) and \(\omega ^{(2)}_{*} \triangleq ({\mathbb {E}}\left[ \phi _t \phi _{t}^{\top }\right] )^{1}\).
This is a quadratic function in z. Note that in the above expression, the parameter vector z and the stochastic component involving \({\mathbb {E}}[\cdot ]\) are decoupled. Hence the stochastic component can be estimated or tracked independent of the parameter vector z.
MSBR
We execute a similar decoupling procedure to the MSBR function. Indeed, from Eq. (8), we have
Therefore,
where \(\upsilon ^{(0)}_{*} \triangleq {\mathbb {E}}\Big [{\mathbb {E}}^{2}\left[ {\mathbf {r}}_{t}\big \vert {\mathbf {s}}_{t}\right] \Big ]\), \(\upsilon ^{(1)}_{*} \triangleq \gamma ^{2}{\mathbb {E}}\Big [{\mathbb {E}}[\phi ^{\prime }_{t}\big \vert {\mathbf {s}}_{t}]{\mathbb {E}}\left[ \phi ^{\prime }_{t}\big \vert {\mathbf {s}}_{t}\right] ^{\top }\Big ]\), \(\upsilon ^{(2)}_{*} \triangleq {\mathbb {E}}\Big [{\mathbb {E}}[{\mathbf {r}}_{t} \vert {\mathbf {s}}_t]\big ({\mathbb {E}}[\phi ^{\prime }_{t}\big  {\mathbf {s}}_t]\phi _{t}\big )\Big ]\) and \(\upsilon ^{(3)}_{*} \triangleq {\mathbb {E}}\Big [\big (\phi _t  2\gamma {\mathbb {E}}[\phi ^{\prime }_{t}\big \vert {\mathbf {s}}_{t}]\big )\phi _{t}^{\top }\Big ]\).
Proposed algorithms
We propose a generalized algorithm to approximate the value function \(V^{\pi }\) (for a given policy \(\pi \)) with linear function approximation by minimizing either MSPBE or MSBR, where the optimization is performed using a multitimescale stochastic approximation variant of the CE algorithm. Since the CE method is a maximization algorithm, the objective function in the optimization problem here is the negative of MSPBE and MSBR. To state it more formally: In this paper, we solve the following two optimization problems:

1.
$$\begin{aligned}&z_{p}^{*} = \mathop {{{\mathrm{arg\,min}}}}\limits _{z \in {\mathcal {Z}} \subset \mathbb {R}^{k}} \mathrm {MSPBE}(z) = \mathop {{{\mathrm{arg\,max}}}}\limits _{z \in {\mathcal {Z}} \subset \mathbb {R}^{k}} {\mathcal {J}}_{p}(z), \nonumber \\&\quad \mathrm { where }\,\, {\mathcal {J}}_{p} = \mathrm {MSPBE}. \end{aligned}$$(25)

2.
$$\begin{aligned}&z_{b}^{*} = \mathop {{{\mathrm{arg\,min}}}}\limits _{z \in {\mathcal {Z}} \subset \mathbb {R}^{k}} \mathrm {MSBR}(z) = \mathop {{{\mathrm{arg\,max}}}}\limits _{z \in {\mathcal {Z}} \subset \mathbb {R}^{k}} {\mathcal {J}}_{b}(z), \nonumber \\&\quad \mathrm { where } \,\, {\mathcal {J}}_{b} = \mathrm {MSBR}. \end{aligned}$$(26)
Here \({\mathcal {Z}}\) is the solution space, i.e., the space of parameter values of the function approximator. We also define \(\mathcal {J}_{p}^{*} \triangleq \mathcal {J}_{p}(z_{p}^{*})\) and \({\mathcal {J}}_{b}^{*} \triangleq {\mathcal {J}}_{b}(z_{b}^{*})\).
 \(\circledast \) :

Assumption (A2): The solution space \({\mathcal {Z}}\) is compact, i.e., it is closed and bounded.
A few annotations about the algorithms are in order:
1. Tracking the objective function \(\mathcal {J}_{p}, \mathcal {J}_{b}\): Recall that the goal of the paper is to develop an online and incremental prediction algorithm. This implies that the algorithm has to estimate the value function by recalibrating the prediction vector incrementally as new transitions of the sample trajectory are revealed. Note that the sample trajectory is simply a rollout of an arbitrary realization of the underlying Markovian dynamics in the form of state transitions and their associated rewards and we assume that the sample trajectory satisfies the following assumption:
 \(\circledast \) :

Assumption (A3): A sample trajectory \(\{({\mathbf {s}}_{t}, {\mathbf {r}}_{t}, {\mathbf {s}}^{\prime }_{t})\}_{t=0}^{\infty }\) is given, where \({\mathbf {s}}_{t} \sim \nu (\cdot )\), \({\mathbf {s}}^{\prime }_{t} \sim \mathrm {P}^{\pi }({\mathbf {s}}_{t}, \cdot )\) and \({\mathbf {r}}_{t} = \mathrm {R}({\mathbf {s}}_{t}, \pi ({\mathbf {s}}_{t}), {\mathbf {s}}^{\prime }_{t})\). Let \(\nu (s) > 0 \), \(\forall s \in {\mathbb {S}}\). Also, let \(\phi _t, \phi ^{\prime }_{t}\), and \({\mathbf {r}}_{t}\) have uniformly bounded second moments. And the matrix \({\mathbb {E}}\left[ \phi _{t}\phi _{t}^{\top }\right] \) is nonsingular.
Now recall that in the analytic closedform expression (Eq. (23)) of the objective function \({\mathcal {J}}_{p}(\cdot )\), we have isolated the stochastic and the deterministic parts. The stochastic part can be identified by the tuple \(\omega _{*} \triangleq (\omega ^{(0)}_{*}, \omega ^{(1)}_{*}, \omega ^{(2)}_{*})^{\top }\). So if we can find ways to track \(\omega _{*}\), then it implies that we can track the objective function \({\mathcal {J}}_{p}(\cdot )\). This is the line of thought we follow here. In our algorithm, we track \(\omega _{*}\) by maintaining a time indexed variable \(\omega _{t} \triangleq (\omega ^{(0)}_{t}, \omega ^{(1)}_{t}, \omega ^{(2)}_{t})^{\top }\), where \(\omega ^{(0)}_{t} \in \mathbb {R}^{k}\), \(\omega ^{(1)}_{t} \in \mathbb {R}^{k \times k}\) and \(\omega ^{(2)}_{t} \in \mathbb {R}^{k \times k}\). Here \(\omega ^{(i)}_{t}\) independently tracks \(\omega ^{(i)}_{*}\), \(0 \le i \le 2\). We show here that \(\lim _{t \rightarrow \infty }\omega ^{(i)}_{t} = \omega ^{(i)}_{*}\), with probability one, \(0 \le i \le 2\). Now the stochastic recursion to track \(\omega _{*}\) is given by
The increment term \({\varDelta } \omega _{t+1} \triangleq ({\varDelta }\omega ^{(0)}_{t+1}, {\varDelta }\omega ^{(1)}_{t+1}, {\varDelta }\omega ^{(2)}_{t+1})^{\top }\) used for this recursion is defined as follows:
where \(\phi _t \triangleq \phi ({\mathbf {s}}_{t})\) and \(\phi ^{\prime }_{t} \triangleq \phi ({\mathbf {s}}^{\prime }_{t})\).
Now we define the estimate of \({\mathcal {J}}_{p}(\cdot )\) at time t as follows:
For a given \(z \in {\mathcal {Z}}\),
Superficially, it is similar to the expression of \({\mathcal {J}}_{p}\) in Eq. (23) except for \(\omega _{t}\) replacing \(\omega _{*}\). Since \(\omega _{t}\) tracks \(\omega _{*}\), it is easy to verify that \(\bar{{\mathcal {J}}}_{p}(\omega _{t}, z)\) indeed tracks \({\mathcal {J}}_{p}(z)\) for a given \(z \in {\mathcal {Z}}\).
Similarly, in the case of MSBR, we require the following double sampling assumption on the sample trajectory:
 \(\circledast \) :

Assumption (A3)\(^{\prime }\): A sample trajectory \(\{({\mathbf {s}}_{t}, {\mathbf {r}}_{t}, {\mathbf {r}}^{\prime }_{t}, {\mathbf {s}}^{\prime }_{t}, {\mathbf {s}}^{\prime \prime }_{t})\}_{t=0}^{\infty }\) is provided, where \({\mathbf {s}}_{t} \sim \nu (\cdot )\), \({\mathbf {s}}^{\prime }_{t} \sim \mathrm {P}^{\pi }({\mathbf {s}}_{t}, \cdot )\), \({\mathbf {s}}^{\prime \prime }_{t} \sim \mathrm {P}^{\pi }({\mathbf {s}}_{t}, \cdot )\) with \({\mathbf {s}}^{\prime }_{t}\) and \({\mathbf {s}}^{\prime \prime }_{t}\) sampled independently. Also, \({\mathbf {r}}_{t} = \mathrm {R}({\mathbf {s}}_{t}, \pi ({\mathbf {s}}_{t}), {\mathbf {s}}^{\prime }_{t})\) and \({\mathbf {r}}^{\prime }_{t} = \mathrm {R}({\mathbf {s}}_{t}, \pi ({\mathbf {s}}_{t}), {\mathbf {s}}^{\prime \prime }_{t})\). Let \(\nu (s)> 0 \), \(\forall s \in {\mathbb {S}}\). Further, let \(\phi _t, \phi ^{\prime }_{t}, \phi ^{\prime \prime }_{t}, {\mathbf {r}}_{t}\), and \({\mathbf {r}}^{\prime }_{t}\) have uniformly bounded second moments (where \(\phi _t \triangleq \phi ({\mathbf {s}}_{t}), \phi ^{\prime }_{t} \triangleq \phi ({\mathbf {s}}^{\prime }_{t}), \phi ^{\prime \prime }_{t} \triangleq \phi ({\mathbf {s}}^{\prime \prime }_{t})\)).
Assumption A3\(^{\prime }\) does not contain any nonsingularity condition. However, it demands the availability of two independent transitions \(({\mathbf {s}}^{\prime }_{t}, {\mathbf {r}}_{t}) \) and \(({\mathbf {s}}^{\prime \prime }_{t}, {\mathbf {r}}^{\prime }_{t})\) given the current state \({\mathbf {s}}_{t}\). This requirement is referred to as the double sampling.
We maintain the time indexed variable \(\upsilon _{t} \triangleq (\upsilon ^{(0)}_{t}, \upsilon ^{(1)}_{t}, \upsilon ^{(2)}_{t}, \upsilon ^{(3)}_{t})^{\top }\), where \(\upsilon ^{(0)}_{t} \in \mathbb {R}\), \(\upsilon ^{(1)}_{t} \in \mathbb {R}^{k \times k}\), \(\upsilon ^{(2)}_{t} \in \mathbb {R}^{k \times 1}\) and \(\upsilon ^{(3)}_{t} \in \mathbb {R}^{k \times k}\). Now the stochastic recursion to track \(\upsilon _{*}\) is given by
The increment term \({\varDelta } \upsilon _{t+1} \triangleq (\upsilon ^{(0)}_{t+1}, \upsilon ^{(1)}_{t+1}, \upsilon ^{(2)}_{t+1}, \upsilon ^{(3)}_{t+1})^{\top }\) used in the above recursion is defined as follows:
We also define the estimate of \({\mathcal {J}}_{b}(\cdot )\) at time t as follows:
For a given \(z \in {\mathcal {Z}}\),
2. Tracking the ideal CE method: The ideal CE method defined in Eq. (17) is computationally intractable due to the inherent hardness involved in computing the quantities \({\mathbb {E}}_{\theta _t}[\cdot ]\) and \(\gamma _{\rho }(\cdot , \cdot )\) efficiently (hence the tag name “ideal”). There are multiple ways one can track the ideal CE method. In this paper, we consider the efficient tracking of the ideal CE method using the stochastic approximation (SA) framework proposed in Joseph and Bhatnagar (2016a, b, 2018). The stochastic approximation approach is efficient both computationally and storage wise when compared to the rest of the stateoftheart CE tracking methods.The SA variant is also shown to exhibit global optimum convergence, i.e., the model sequence \(\{\theta _t\}_{t \in {\mathbb {N}}}\) converges to the degenerate distribution concentrated on any of the global optima of the objective function. The SA version of the CE method consists of three stochastic recursions which are defined as follows:
Note that the above recursions are defined for the objective function \({\mathcal {J}}_{p}\). However, in the case of \({\mathcal {J}}_{b}\), the recursions are similar except for \({\mathcal {J}}_{b}\) replacing \({\mathcal {J}}_{p}\) and \(\upsilon _t\) replacing \(\omega _t\) wherever required.
3. Learning rates and timescales: Our algorithms use two learning rates \(\{\alpha _{t}\}_{t \in {\mathbb {N}}}\) and \(\{\beta _{t}\}_{t \in {\mathbb {N}}}\), which are deterministic, positive, nonincreasing, predetermined (chosen a priori) and satisfy the following conditions:
In a multitimescale stochastic approximation setting (Borkar 1997), it is important to understand the difference between timescale and learning rate. The timescale of a stochastic recursion is defined by its learning rate (also referred to as stepsize). Note that from the conditions imposed on the learning rates \(\{\alpha _t\}_{t \in {\mathbb {N}}}\) and \(\{\beta _t\}_{t \in {\mathbb {N}}}\) in Eq. (36), we have \(\frac{\alpha _t}{\beta _t} \rightarrow 0\). So \(\alpha _{t}\) decays to 0 relatively faster than \(\beta _{t}\). Hence the timescale obtained from \(\{\beta _t\}_{t \in {\mathbb {N}}}\) is considered faster as compared to the other. So in a multitimescale stochastic recursion scenario, the evolution of the recursion controlled by \(\{\alpha _{t}\}\) (that converges relatively faster to 0) is slower compared to the recursions controlled by \(\{\beta _{t}\}\). This is because the increments are weighted by their learning rates, i.e., the learning rates control the quantity of change that occurs to the variables when the update is executed. When observed from the faster timescale recursion, one can consider the slower timescale recursion to be almost stationary, while when viewed from the slower timescale, the faster timescale recursion appears to have equilibrated. This attribute of the multitimescale recursions are very important in the analysis of the algorithm. In the analysis, when studying the asymptotic behaviour of a particular stochastic recursion, we can consider the variables of other recursions which are on slower timescales to be constant. In our algorithm, the recursion of \(\omega _t\) and \(\theta _t\) proceed along the slowest timescale and so updates of \(\omega _{t}\) appear to be quasistatic when viewed from the timescale on which the recursions governed by \(\beta _{t}\) proceed. The recursions of \(\gamma _t, \xi ^{(0)}_t\) and \(\xi ^{(1)}_t \) proceed along the faster timescale and hence appear equilibrated when viewed from the slower recursion. The coherent behaviour exhibited by the algorithms is primarily attributed to the timescale differences obeyed by the various recursions.
The algorithm SCEMSPBEM acronym for stochastic cross entropymean squared projected Bellman error minimization that minimizes the mean squared projected Bellman error (MSPBE) by incorporating a multitimescale stochastic approximation variant of the cross entropy (CE) method is formally presented in Algorithm 1.
The algorithm SCEMSBRM acronym for stochastic cross entropymean squared Bellman residue minimization that minimizes the mean squared Bellman residue (MSBR) by incorporating a multitimescale stochastic approximation variant of the cross entropy (CE) method is formally presented in Algorithm 2.
Convergence analysis
Observe that the algorithms are multitimescale stochastic approximation algorithms (Borkar 1997) involving multiple stochastic recursions piggybacking each other. The primal recursions which typify the algorithms are the stochastic recursions which update the model parameters \(\theta _t\) (Eq. (45) of Algorithm 1 and Eq. (54) of Algorithm 2), where the model parameters \(\theta _t\) are calibrated to ensure their evolution towards the degenerate distribution concentrated on the global optimum (\(z^{*}_p\) for Algorithm 1 and \(z_{b}^{*}\) for Algorithm 2). Nonetheless, not disregarding the relevance of the remaining recursions which are all too vital and should augment each other and the primal recursion in achieving the desideratum. Therefore to analyze the limiting behaviour of the algorithms, one has to study the asymptotic behaviour of the individual recursions, i.e., the effectiveness of the variables involved in tracking the true quantities. For analyzing the asymptotic behaviour of the algorithms, we apply the ODE based analysis from Ljung (1977), Kushner and Clark (1978), Kubrusly and Gravier (1973), Borkar (2008), Benveniste et al. (2012). In this method of analysis, for each individual stochastic recursion, we identify an associated ODE whose asymptotic (limiting) behaviour is similar to that of the stochastic recursion. In other words, the stochastic recursion eventually tracks the associated ODE. Subsequently, a qualitative analysis of the solutions of the associated ODE is performed to study their limiting behaviour and it is argued that the stochastic recursion asymptotically converges almost surely to the set of stable fixed points of the ODE (See Chapter 2 of Borkar (2008) or Chapter 5 of Kushner and Clark (1978) or Chapter 2 of Benveniste et al. 2012).
Outline of the proof
The roadmap followed in the analysis of the algorithms is as follows:

1.
First and foremost, in the case of Algorithm 1, we study the asymptotic behaviour of the stochastic recursion (38). We show in Lemma 1 that the stochastic sequence \(\{\omega _{t}\}\) indeed tracks the true quantity \(\omega _{*}\) which defines the true objective function \({\mathcal {J}}_{p}\). Note that the recursion (38) is independent of other recursions and hence can be analyzed independently. The composition of the analysis (proof of Lemma 1) apropos of the limiting behaviour of \(\{\omega _t\}\) involves mutlitple steps such as analyzing the nature of growth of the stochastic sequence, identifying the character of the implicit noise extant in the stochastic recursion, exploring the existence of finite bounds of the noise sequence (we solicit probabilistic analysis (Borkar 2012) to realize the above steps), ensuring with certainty the stability of the stochastic sequence (we appeal to Borkar–Meyn theorem Borkar 2008) and finally the qualitative analysis of the limit points of the associated ODE of the stochastic recursion (we seek assistance from dynamical systems theory Perko 2013).

2.
Similarly, in the case of Algorithm 2, we study the asymptotic behaviour of the stochastic recursion (48). We show in Lemma 2 that the stochastic sequence \(\{\upsilon _{t}\}\) certainly tracks the true quantity \(\upsilon _{*}\) which defines the true objective function \({\mathcal {J}}_{b}\). The composition of the proof of Lemma 2 follows similar discourse as that of Lemma 1.

3.
Since the proposed algorithms are multitimescale stochastic approximation algorithms, their asymptotic behaviour depends heavily on the timescale differences induced by the stepsize schedules \(\{\alpha _{t}\}_{t \in {\mathbb {N}}}\) and \(\{\beta _{t}\}_{t \in {\mathbb {N}}}\). The timescale differences allow the different individual recursions in a multitimescale setting to learn at different rates. Since \(\frac{\alpha _t}{\beta _t} \rightarrow 0\), the stepsize \(\{\beta _t\}_{t \in {\mathbb {N}}}\) decays to 0 at a relatively slower rate than \(\{\alpha _t\}_{t \in {\mathbb {N}}}\) and therefore the increments in the recursions (40)–(42) which are controlled by \(\beta _t\) are relatively larger and hence appear to converge relatively faster than the recursions (38)–(39) and (45) which are controlled by \(\alpha _t\) when viewed from the latter. So, considering a finite, yet sufficiently long time window, the relative evolution of the variables from the slower timescale \(\alpha _{t}\), i.e., \(\omega _{t}\) and \(\theta _t\) to their steadystate form is indeed slow and in fact can be considered quasistationary when viewed from the evolutionary path of the faster timescale \(\beta _t\). See Chapter 6 of Borkar (2008) for a succinct description on multitimescale stochastic approximation algorithms. Hence, when viewed from the timescale of the recursions (40)–(42), one may consider \(\omega _{t}\) and \(\theta _{t}\) to be fixed. This is a standard technique used in analyzing multitimescale stochastic approximation algorithms. Following this course of analysis, we obtain Lemma 3 which characterizes the asymptotic behaviour of the stochastic recursions (40)–(42). The original paper (Joseph and Bhatnagar 2018) apropos of the stochastic approximation version of the CE method (proposed for a generalized optimization setting) establishes claims synonymous to Lemma 3 and hence we skip the proof of the lemma, nonetheless, we provide references to the same.
The results in Lemma 3 attest to validate that under the quasistationary hypothesis of \(\omega _t \equiv \omega \) and \(\theta _t \equiv \theta \), the stochastic sequence \(\{\gamma _t\}\) tracks the true quantile \(\gamma _{\rho }(\bar{{\mathcal {J}}}_{p}(\omega , \cdot ),\widehat{\theta })\) ((1) of Lemma 3), while the stochastic sequences \(\{\xi ^{(0)}_{t}\}\) and \(\{\xi ^{(1)}_{t}\}\) track the ideal CE model parameters \({\varUpsilon }_1(\bar{{\mathcal {J}}}_{p}(\omega , \cdot ), \widehat{\theta })\) and \({\varUpsilon }_2(\bar{{\mathcal {J}}}_{p}(\omega , \cdot ), \widehat{\theta })\) respectively ((2–3) of Lemma 3) with probability one. Certainly, these results establish that the stochastic recursions (40–42) track the ideal CE method and ergo, they provide a stable and proximal optimization gadget to minimize the error functions MSPBE (or MSBR). The rationale behind the pertinence of the stochastic recursion (44) is provided in Joseph and Bhatnagar (2018). Ostensibly, the purpose is as follows: The threshold sequence \(\{\gamma _{\rho }({\mathcal {J}}_{p}, \theta _t)\}\) (where \(\theta _t\) is generated by Eq. (17)) of the ideal CE method is monotonically increasing (Proposition 2 of Joseph and Bhatnagar 2018). However, when stochastic approximation iterates are employed to track the ideal model parameters, the monotonicity may not hold always. The purpose of the stochastic recursion (44) is to ensure that the monotonicity of the threshold sequence is maintained and therefore (4–5) of Lemma 3 along with an appropriate choice of \(\epsilon _1 \in [0,1)\) (Algorithm 1) ensure that the model sequence \(\{\theta _t\}\) is updated infinitely often.

4.
Finally, we state our main results regarding the convergence of MSPBE and MSBR in Theorems 1 and 2, respectively. The theorems analyze the asymptotic behaviour of the model sequence \(\{\theta _t\}_{t \in {\mathbb {N}}}\) for Algorithms 1 and 2 respectively. The theorems claim that the model sequence \(\{\theta _{t}\}\) generated by Algorithm 1 (Algorithm 2) almost surely converges to \(\theta _{p}^{*} = (z_{p}^{*}, 0_{k \times k})^{\top }\) \((\theta _{b}^{*} = (z_{b}^{*}, 0_{k \times k})^{\top })\), the degenerate distribution concentrated at \(z_{p}^{*}\) \((z_{b}^{*})\), where \(z_{p}^{*}\) \((z_{b}^{*})\) is the solution to the optimization problem (25) ((26)) which minimizes the error function MSPBE (MSBR).
The proof of convergence
For the stochastic recursion (38), we have the following result:
As a proviso, we define the filtration^{Footnote 9} \(\{{\mathcal {F}}_{t}\}_{t \in {\mathbb {N}}}\), where the \(\sigma \)field \({\mathcal {F}}_t \triangleq \sigma (\omega _i, \gamma _i, \gamma ^{p}_i, \xi ^{(0)}_i, \xi ^{(1)}_i, \theta _i, 0 \le i \le t; {\mathbf {Z}}_{i}, 1 \le i \le t; {\mathbf {s}}_{i}, {\mathbf {r}}_{i}, {\mathbf {s}}^{\prime }_{i}, 0 \le i < t )\), \(t \in {\mathbb {N}}\), is the \(\sigma \)field generated by the specified random variables in the definition.
Lemma 1
Let the stepsize sequences \(\{\alpha _{t}\}_{t \in {\mathbb {N}}}\) and \(\{\beta _{t}\}_{t \in {\mathbb {N}}}\) satisfy Eq. (36). For the sample trajectory \(\{({\mathbf {s}}_{t}, {\mathbf {r}}_{t}, {\mathbf {s}}^{\prime }_{t})\}_{t=0}^{\infty }\), we let Assumption (A3) hold. Then, for a given \(z \in {\mathcal {Z}}\), the sequence \(\{\omega _{t}\}_{t \in {\mathbb {N}}}\) defined in Eq. (38) satisfies with probability one,
where \(\omega ^{(0)}_{*}\), \(\omega ^{(1)}_{*}\), \(\omega ^{(2)}_{*}\) and MSPBE are defined in Eq. (23), while \(\bar{{\mathcal {J}}}_{p}(\omega _t,z)\) is defined in Eq. (39).
Proof
By rearranging equations in (38), for \(t \in {\mathbb {N}}\), we get
where \({\mathbb {M}}^{(0,0)}_{t+1} = {\mathbf {r}}_{t}\phi _{t}  {\mathbb {E}}\left[ {\mathbf {r}}_{t}\phi _{t}\right] \, \mathrm {and} \, h^{(0,0)}(x)={\mathbb {E}}\left[ {\mathbf {r}}_{t}\phi _{t} \right] x\). Similarly,
where \({\mathbb {M}}^{(0,1)}_{t+1} = \phi _{t}(\gamma \phi ^{\prime }_{t}\phi _{t})^{\top }  {\mathbb {E}}\left[ \phi _{t}(\gamma \phi ^{\prime }_{t}\phi _{t})^{\top }\right] \) and \(h^{(0,1)}(x)={\mathbb {E}}\left[ \phi _{t}(\gamma \phi ^{\prime }_{t}\phi _{t})^{\top }\right] x\). Finally,
where \({\mathbb {M}}^{(0,2)}_{t+1} = {\mathbb {E}}\left[ \phi _{t}\phi _{t}^{\top }\omega ^{(2)}_{t} \right]  \phi _{t}\phi _{t}^{\top }\omega ^{(2)}_{t} \text { and } h^{(0,2)}(x) = {\mathbb {I}}_{k \times k} {\mathbb {E}}\left[ \phi _{t}\phi _{t}^{\top }x\right] \).
To apply the ODE based analysis, certain necessary conditions on the structural decorum are in order:

(B1)
\(h^{(0,j)}, 0 \le j \le 2\) are Lipschitz continuous (easy to verify).

(B2)
\(\{{\mathbb {M}}^{(0,j)}_{t+1}\}_{t \in {\mathbb {N}}}\), \(0 \le j \le 2\) are martingale difference noise sequences, i.e., for each j, \({\mathbb {M}}^{(0,j)}_{t}\) is \({\mathcal {F}}_{t}\)measurable, integrable and \({\mathbb {E}}\left[ {\mathbb {M}}^{(0,j)}_{t+1} \vert {\mathcal {F}}_{t}\right] = 0\), \(t \in {\mathbb {N}}\), \(0 \le j \le 2\).

(B3)
Since \(\phi _{t}\), \(\phi ^{\prime }_{t}\) and \({\mathbf {r}}_{t}\) have uniformly bounded second moments, the noise sequences \(\{{\mathbb {M}}^{(0,j)}_{t+1}\}_{t \in {\mathbb {N}}}\) have uniformly bounded second moments as well for each \(0 \le j \le 2\) and hence \(\exists K_{0,0}, K_{0,1}, K_{0,2} > 0\) s.t.
$$\begin{aligned}&{\mathbb {E}}\left[ \Vert {\mathbb {M}}^{(0,0)}_{t+1} \Vert ^{2} \big \vert {\mathcal {F}}_{t}\right] \le K_{0,0}(1+\Vert \omega ^{(0)}_{t} \Vert ^{2}), \quad t \in {\mathbb {N}}. \end{aligned}$$(59)$$\begin{aligned}&{\mathbb {E}}\left[ \Vert {\mathbb {M}}^{(0,1)}_{t+1} \Vert ^{2} \big \vert {\mathcal {F}}_{t}\right] \le K_{0,1}(1+\Vert \omega ^{(1)}_{t} \Vert ^{2}), \quad t \in {\mathbb {N}}. \end{aligned}$$(60)$$\begin{aligned}&{\mathbb {E}}\left[ \Vert {\mathbb {M}}^{(0,2)}_{t+1} \Vert ^{2} \big \vert {\mathcal {F}}_{t}\right] \le K_{0,2}(1+\Vert \omega ^{(2)}_{t} \Vert ^{2}), \quad t \in {\mathbb {N}}. \end{aligned}$$(61) 
(B4)
To establish the stability (boundedness) condition, i.e., \(\sup _{t \in {\mathbb {N}}} \Vert \omega ^{(j)}_t \Vert < \infty \) a.s., for each \(0 \le j \le 2\), we appeal to the Borkar–Meyn theorem (Theorem 2.1 of Borkar and Meyn (2000) or Theorem 7, Chapter 3 of Borkar 2008). Particularly, in order to prove \(\sup _{t \in {\mathbb {N}}} \Vert \omega ^{(0)}_{t} \Vert < \infty \) a.s., we study the qualitative behaviour of the dynamical system defined by the following limiting ODE:
$$\begin{aligned} \frac{d}{dt}\omega ^{(0)}(t) = h^{(0,0)}_{\infty }(\omega ^{(0)}(t)), \quad t \in \mathbb {R}_{+}, \end{aligned}$$(62)where
$$\begin{aligned} h^{(0,0)}_{\infty }(x) \triangleq \lim _{c \rightarrow \infty }\frac{h^{(0,0)}(cx)}{c} = \lim _{c \rightarrow \infty }\frac{{\mathbb {E}}\left[ {\mathbf {r}}_{t}\phi _{t} \right] cx}{c} = \lim _{c \rightarrow \infty }\frac{{\mathbb {E}}\left[ {\mathbf {r}}_{t}\phi _{t}\right] }{c}  x = x. \end{aligned}$$According to the Borkar–Meyn theorem, the global asymptotic stability of the above limiting system to the origin is sufficient to warrant the stability of the sequence \(\{\omega ^{(0)}_{t}\}_{t \in {\mathbb {N}}}\). Now, note that the ODE (62) is a linear, firstorder ODE with negative rate of change and hence qualitatively the flow induced by the ODE is globally asymptotically stable to the origin. Therefore, we obtain the following:
$$\begin{aligned} \sup _{t \in {\mathbb {N}}}{\Vert \omega ^{(0)}_{t} \Vert } < \infty \quad a.s. \end{aligned}$$(63)Similarly we can show that
$$\begin{aligned} \sup _{t \in {\mathbb {N}}}{\Vert \omega ^{(1)}_{t} \Vert } < \infty \quad a.s. \end{aligned}$$(64)Now, regarding the stability of the sequence \(\{\omega ^{(2)}_{t}\}_{t \in {\mathbb {N}}}\), we consider the following limiting ODE:
$$\begin{aligned} \frac{d}{dt}\omega ^{(2)}(t) = h^{(0,2)}_{\infty }(\omega ^{(2)}(t) ), \quad t \in \mathbb {R}_{+}, \end{aligned}$$(65)where
$$\begin{aligned} h^{(0,2)}_{\infty }(x) \triangleq \lim _{c \rightarrow \infty }\frac{h^{(0,2)}(cx)}{c}&= \lim _{c \rightarrow \infty }\frac{{\mathbb {I}}_{k \times k}{\mathbb {E}}\left[ \phi _{t}\phi _{t}^{\top }cx\right] }{c} \\&= \lim _{c \rightarrow \infty } \frac{{\mathbb {I}}_{k \times k}}{c}  x{\mathbb {E}}\left[ \phi _{t}\phi _{t}^{\top } \right] =  x{\mathbb {E}}\left[ \phi _{t}\phi _{t}^{\top } \right] . \end{aligned}$$The system defined by the limiting ODE (65) is globally asymptotically stable to the origin since \({\mathbb {E}}[\phi _t \phi _t^{\top }]\) is positive definite (as it is positive semidefinite (easy to verify) and nonsingular (from Assumption A3)). Therefore, by Borkar–Meyn theorem, we obtain the following:
$$\begin{aligned} \sup _{t \in {\mathbb {N}}}{\Vert \omega ^{(2)}_{t} \Vert } < \infty \quad a.s. \end{aligned}$$(66)
Since we have hitherto established the necessary conditions (B1–B4), now by appealing to Theorem 2, Chapter 2 of Borkar (2008), we can directly establish the asymptotic equivalence between the individual stochastic recursions (56)–(58) and the following associated ODEs respectively.
Now we study the qualitative behaviour of the above system of firstorder, linear ODEs. A simple examination of the trajectories of the ODEs reveals that the point \({\mathbb {E}}\left[ {\mathbf {r}}_{t}\phi _{t}\right] \) is a globally asymptotically stable equilibrium point of the ODE (67). Similarly for the ODE (68), the point \({\mathbb {E}}\left[ \phi _{t}(\gamma \phi ^{\prime }_{t}\phi _{t})^{\top }\right] \) is a globally asymptotically stable equilibrium. Finally, regarding the limiting behaviour of the ODE (69), we find that the point \({\mathbb {E}}\left[ \phi _{t}\phi _{t}^{\top }\right] ^{1}\) is a globally asymptotically stable equilibrium. This follows since \({\mathbb {E}}\left[ \phi _{t}\phi _{t}^{\top }\right] \) is positive semidefinite (easy to verify) and nonsingular (from Assumption A3). Formally,
where the above convergence is achieved independent of the initial values \(\omega ^{(0)}(0), \omega ^{(1)}(0)\) and \(\omega ^{(2)}(0)\).
Therefore, by the employing the asymptotic equivalence of the stochastic recursions (56)–(58) and their associated ODEs (67)–(69) we obtain the following:
Putting all the above together, we get, for \(z \in {\mathcal {Z}}\), \(\lim _{t \rightarrow \infty } \bar{{\mathcal {J}}}_{p}(\omega _t, z) = \bar{{\mathcal {J}}}_{p}(\omega _{*}, z)\) = \({\mathcal {J}}_{p}(z)\) a.s. \(\square \)
For the stochastic recursion (48), we have the following result:
Again, as a proviso, we define the filtration \(\{{\mathcal {F}}_{t}\}_{t \in {\mathbb {N}}}\) where the \(\sigma \)field \({\mathcal {F}}_t \triangleq \sigma (\upsilon _i, \gamma _i, \gamma ^{p}_i, \xi ^{(0)}_i, \xi ^{(1)}_i, \theta _i, 0 \le i \le t; {\mathbf {Z}}_{i}, 1 \le i \le t; {\mathbf {s}}_{i}, {\mathbf {r}}_{i}, {\mathbf {r}}^{\prime }_{i}, {\mathbf {s}}^{\prime }_{i}, {\mathbf {s}}^{\prime \prime }_{i}, 0 \le i < t )\), \(t \in {\mathbb {N}}\).
Lemma 2
Let the stepsize sequences \(\{\alpha _{t}\}_{t \in {\mathbb {N}}}\) and \(\{\beta _{t}\}_{t \in {\mathbb {N}}}\) satisfy Eq. (36). For the sample trajectory \(\{({\mathbf {s}}_{t}, {\mathbf {r}}_{t}, {\mathbf {r}}^{\prime }_{t}, {\mathbf {s}}^{\prime }_{t}, {\mathbf {s}}^{\prime \prime }_{t})\}_{t=0}^{\infty }\), we let Assumption \((A3)^{\prime }\) hold. Then, for a given \(z \in {\mathcal {Z}}\), the sequence \(\{\upsilon _{t}\}_{t \in {\mathbb {N}}}\) defined in Eq. (48) satisfies with probability one,
where \(\upsilon ^{(0)}_{*}\), \(\upsilon ^{(1)}_{*}\), \(\upsilon ^{(2)}_{*}\), \(\upsilon ^{(3)}_{*}\) and MSBR are defined in Eq. (24), while \(\bar{{\mathcal {J}}}_{b}(\upsilon _t,z)\) is defined in Eq. (49).
Proof
By rearranging equations in (48), for \(t \in {\mathbb {N}}\), we get
where \({\mathbb {M}}^{(1,0)}_{t+1} = {\mathbf {r}}_{t}{\mathbf {r}}^{\prime }_{t}  {\mathbb {E}}^{2}\left[ {\mathbf {r}}_{t}\right] \, \mathrm {and} \, h^{(1,0)}(x)={\mathbb {E}}^{2}\left[ {\mathbf {r}}_{t}\right] x\).
Similarly,
where \({\mathbb {M}}^{(1,1)}_{t+1} = \gamma ^{2}\phi ^{\prime }_{t}\phi ^{\prime \prime \top }_{t}  \gamma ^{2}{\mathbb {E}}\left[ \phi ^{\prime }_{t}\right] {\mathbb {E}}\left[ \phi ^{\prime }_{t}\right] ^\top \) and \(h^{(1,1)}(x) = \gamma ^{2}{\mathbb {E}}\left[ \phi ^{\prime }_{t}\right] {\mathbb {E}}\left[ \phi ^{\prime }_{t}\right] ^\top x\). Also,
where \({\mathbb {M}}^{(1,2)}_{t+1} = {\mathbf {r}}_{t}\big (\phi ^{\prime }_{t}\phi _{t}\big )  {\mathbb {E}}\left[ {\mathbf {r}}_{t}\big (\phi ^{\prime }_{t}\phi _{t}\big )\right] \text { and } h^{(1,2)}(x) = {\mathbb {E}}\left[ {\mathbf {r}}_{t}\big (\phi ^{\prime }_{t}\phi _{t}\big )\right]  x\). Finally,
where \({\mathbb {M}}^{(1,3)}_{t+1} = \big (\phi _{t}  2\gamma \phi ^{\prime }_{t}\big )\phi _{t}^{\top }  {\mathbb {E}}\left[ \big (\phi _{t}  2\gamma \phi ^{\prime }_{t}\big )\phi _{t}^{\top }\right] \) and \(h^{(1,3)}(x) = {\mathbb {E}}\left[ \big (\phi _{t}  2\gamma \phi ^{\prime }_{t}\big )\phi _{t}^{\top }\right]  x\).
To apply the ODE based analysis, certain necessary conditions on the structural decorum are in order:

(C1)
\(h^{(1,j)}, 0 \le j \le 3\) are Lipschitz continuous (easy to verify).

(C2)
\(\{{\mathbb {M}}^{(1,j)}_{t+1}\}_{t \in {\mathbb {N}}}\), \(0 \le j \le 3\) are martingale difference noise sequences, i.e., for each \(t \in {\mathbb {N}}\), \({\mathbb {M}}^{(1,j)}_{t}\) is \({\mathcal {F}}_{t}\)measurable, integrable and \({\mathbb {E}}\left[ {\mathbb {M}}^{(1,j)}_{t+1} \vert {\mathcal {F}}_t\right] = 0\), \(t \in {\mathbb {N}}\), \(0 \le j \le 3\).

(C3)
Since \(\phi _{t}\), \(\phi ^{\prime }_{t}\), \(\phi ^{\prime \prime }_{t}\), \({\mathbf {r}}_{t}\) and \({\mathbf {r}}^{\prime }_{t}\) have uniformly bounded second moments, the noise sequences \(\{{\mathbb {M}}^{(1,j)}_{t+1}\}_{t \in {\mathbb {N}}}\), \(0 \le j \le 3\) have uniformly bounded second moments as well and hence \(\exists K_{1,0},K_{1,1},K_{1,2},K_{1,3} > 0\) s.t.
$$\begin{aligned} {\mathbb {E}}\left[ \Vert {\mathbb {M}}^{(1,0)}_{t+1} \Vert ^{2} \vert {\mathcal {F}}_{t}\right]&\le K_{1,0}(1+\Vert \upsilon ^{(0)}_{t} \Vert ^{2}), t \in {\mathbb {N}}. \end{aligned}$$(77)$$\begin{aligned} {\mathbb {E}}\left[ \Vert {\mathbb {M}}^{(1,1)}_{t+1} \Vert ^{2} \vert {\mathcal {F}}_{t}\right]&\le K_{1,1}(1+\Vert \upsilon ^{(1)}_{t} \Vert ^{2}), t \in {\mathbb {N}}. \end{aligned}$$(78)$$\begin{aligned} {\mathbb {E}}\left[ \Vert {\mathbb {M}}^{(1,2)}_{t+1} \Vert ^{2} \vert {\mathcal {F}}_{t}\right]&\le K_{1,2}(1+\Vert \upsilon ^{(2)}_{t} \Vert ^{2}), t \in {\mathbb {N}}. \end{aligned}$$(79)$$\begin{aligned} {\mathbb {E}}\left[ \Vert {\mathbb {M}}^{(1,3)}_{t+1} \Vert ^{2} \vert {\mathcal {F}}_{t}\right]&\le K_{1,3}(1+\Vert \upsilon ^{(3)}_{t} \Vert ^{2}), t \in {\mathbb {N}}. \end{aligned}$$(80) 
(C4)
To establish the stability condition, i.e., \(\sup _{t \in {\mathbb {N}}} \Vert \upsilon ^{(j)}_t \Vert < \infty \) a.s., for each \(0 \le j \le 3\), we appeal to the Borkar–Meyn theorem (Theorem 2.1 of Borkar and Meyn (2000) or Theorem 7, Chapter 3 of Borkar 2008). Indeed to prove \(\sup _{t \in {\mathbb {N}}} \Vert \upsilon ^{(0)}_{t} \Vert < \infty \) a.s., we consider the dynamical system defined by the following \(\infty \)system ODE:
$$\begin{aligned} \frac{d}{dt}{\upsilon ^{(0)}}(t) = h^{(1,0)}_{\infty }(\upsilon ^{(0)}(t)), \end{aligned}$$(81)where
$$\begin{aligned} h^{(1,0)}_{\infty }(x) \triangleq \lim _{c \rightarrow \infty }\frac{h^{(1,0)}(cx)}{c} = \lim _{c \rightarrow \infty }\frac{{\mathbb {E}}^{2}\left[ {\mathbf {r}}_{t}\right] cx}{c} = \lim _{c \rightarrow \infty }\frac{{\mathbb {E}}^{2}\left[ {\mathbf {r}}_{t}\right] }{c}  x = x. \end{aligned}$$It is easy to verify that the above flow (81) is globally asymptotically stable to the origin. Therefore, by appealing to the Borkar–Meyn theorem, we obtain that the iterates \(\{\upsilon ^{(0)}_{t}\}_{t \in {\mathbb {N}}}\) are almost surely stable, i.e.,
$$\begin{aligned} \sup _{t \in {\mathbb {N}}}{\Vert \upsilon ^{(0)}_{t} \Vert } < \infty \quad a.s. \end{aligned}$$(82)Similarly we can show that
$$\begin{aligned} \sup _{t \in {\mathbb {N}}}{\Vert \upsilon ^{(1)}_{t} \Vert }&< \infty \quad a.s. \end{aligned}$$(83)$$\begin{aligned} \sup _{t \in {\mathbb {N}}}{\Vert \upsilon ^{(2)}_{t} \Vert }&< \infty \quad a.s. \end{aligned}$$(84)$$\begin{aligned} \sup _{t \in {\mathbb {N}}}{\Vert \upsilon ^{(3)}_{t} \Vert }&< \infty \quad a.s. \end{aligned}$$(85)
Since we have hitherto established the necessary conditions (C1–C4), now by appealing to Theorem 2, Chapter 2 of Borkar (2008), we can forthwith guarantee the asymptotic equivalence between the recursion (73) and the following ODE (i.e., the recursion (73) asymptotically tracks the following ODE):
Similarly, we can guarantee the independent asymptotic equivalences between the recursions (74)–(76) and the ODEs (87)–(89) respectively.
Note that all the above ODEs (86)–(89) are linear, firstorder ODEs and further qualitative analysis reveals that the individual flows defined by the various ODEs are globally asymptotically stable. An examination of the trajectories of the ODEs attests that the limiting behaviour of the individual flows defined by the ODEs (86)–(89) satisfies the following:
Finally, Eq. (90) and the previously established asymptotic equivalence between the recursions (73)–(76) and their respective associated ODEs (86)–(89) ascertains the following:
Putting all the above together, we get, for \(z \in {\mathcal {Z}}\),
\(\square \)
Notation: We denote by \({\mathbb {E}}_{\widehat{\theta }}[\cdot ]\) the expectation w.r.t. the mixture PDF \(\widehat{f}_{\theta }\) and \({\mathbb {P}}_{\widehat{\theta }}\) denotes its induced probability measure. Also, \(\gamma _{\rho }(\cdot , \widehat{\theta })\) represents the \((1\rho )\)quantile w.r.t. the mixture PDF \({\widehat{f}}_{\theta }\).
The following result characterizes the asymptotic behaviour of the stochastic recursions (4042):
Lemma 3
Assume \(\omega _{t} \equiv \omega \), \(\theta _{t} \equiv \theta \), \(\forall t \in {\mathbb {N}}\). Let Assumption A2 hold. Also, let the stepsize sequences \(\{\alpha _{t}\}_{t \in {\mathbb {N}}}\) and \(\{\beta _{t}\}_{t \in {\mathbb {N}}}\) satisfy Eq. (36). Then,

1.
The sequence \(\{\gamma _t\}_{t \in {\mathbb {N}}}\) generated by Eq. (40) satisfies
$$\begin{aligned} \lim _{t \rightarrow \infty }\gamma _t = \gamma _{\rho }(\bar{{\mathcal {J}}}_{p}(\omega , \cdot ), \widehat{\theta }) \quad a.s. \end{aligned}$$ 
2.
The sequence \(\{\xi ^{(0)}_{t}\}_{t \in \mathbb {N}}\) generated by Eq. () satisfies
$$\begin{aligned} \lim _{t \rightarrow \infty } \xi ^{(0)}_{t} = \xi ^{(0)}_{\omega , \theta } = \frac{\mathbb {E}_{\widehat{\theta }}\left[ \mathbf {g}_{1}\left( \bar{\mathcal {J}}_{p}({\omega },\mathbf {Z}), \mathbf {Z}, \gamma _{\rho }(\bar{\mathcal {J}}_{p}(\omega , \cdot ), \widehat{\theta })\right) \right] }{\mathbb {E}_{\widehat{\theta }}\left[ \mathbf {g}_{0}\left( \bar{\mathcal {J}}_{p}({\omega },\mathbf {Z}), \gamma _{\rho }(\bar{\mathcal {J}}_{p}(\omega , \cdot ), \widehat{\theta })\right) \right] } \text { a.s.} \end{aligned}$$ 
3.
The sequence \(\{\xi ^{(1)}_{t}\}_{t \in \mathbb {N}}\) generated by Eq. () satisfies
$$\begin{aligned} \lim _{t \rightarrow \infty } \xi ^{(1)}_{t} = \frac{\mathbb {E}_{\widehat{\theta }}\left[ \mathbf {g}_{2}\left( \bar{\mathcal {J}}_{p}({\omega }, \mathbf {Z}), \mathbf {Z}, \gamma _{\rho }(\bar{\mathcal {J}}_{p}({\omega }, \cdot ), \widehat{\theta }), \xi ^{(0)}_{\omega ,\theta }\right) \right] }{\mathbb {E}_{\widehat{\theta }}\left[ \mathbf {g}_{0}\left( \bar{\mathcal {J}}_{p}({\omega }, \mathbf {Z}), \gamma _{\rho }(\bar{\mathcal {J}}_{p}({\omega }, \cdot ), \widehat{\theta })\right) \right] } \text { a.s.} \end{aligned}$$ 
4.
For any \(T_0 \in (0,1)\), \(\{T_t\}_{t \in {\mathbb {N}}}\) generated by Eq. (44) satisfies \(T_t \in (1,1)\), \(\forall t \in {\mathbb {N}}\).

5.
If \(\gamma _{\rho }(\bar{\mathcal {J}}_{p}({\omega }, \cdot ), \widehat{\theta }) > \gamma _{\rho }(\bar{\mathcal {J}}_{p}({\omega }, \cdot ), \widehat{\theta ^{p}})\), then \(\{T_t\}_{t \in \mathbb {N}}\) generated by Eq. (44) satisfies \(\lim _{t \rightarrow \infty } T_{t} = 1\) a.s.
Proof
Please refer to the proofs of Proposition 1, Lemmas 2 and 3 in Joseph and Bhatnagar (2018). \(\square \)
Remark 3
Similar results can also be obtained for Algorithm 2 with \(\bar{{\mathcal {J}}}_{p}\) replaced by \(\bar{{\mathcal {J}}}_{b}\) and \(\omega \) replaced by \(\upsilon \).
Finally, we analyze the asymptotic behaviour of the model sequence \(\{\theta _t\}_{t \in {\mathbb {N}}}\). As a preliminary requirement, we define \({\varPsi }_p(\omega , \theta ) = ({\varPsi }^{(0)}_p(\omega , \theta ), {\varPsi }^{(1)}_p(\omega , \theta ))^{\top }\), where
Similarly, we define \({\varPsi }_{b}\) with \({\mathcal {J}}_{p}\) replaced by \({\mathcal {J}}_{b}\) and \(\omega \) replaced by \(\upsilon \).
We now state our main theorems. The first theorem states that the model sequence \(\{\theta _{t}\}_{t \in {\mathbb {N}}}\) generated by Algorithm 1 almost surely converges to \(\theta ^{p*} = (z_{p}^{*}, 0_{k \times k})^{\top }\), the degenerate distribution concentrated at \(z_{p}^{*}\), where \(z_{p}^{*}\) is the solution to the optimization problem (25) which minimizes the error function MSPBE.
Theorem 1
(MSPBE Convergence) Let \(S(z) = exp(rz)\), \(r \in \mathbb {R}_{+}\). Let \(\rho \in (0,1)\) and \(\lambda \in (0,1)\). Let \(\theta _0 = (\mu _0, qI_{k \times k})^{\top }\), where \(q \in \mathbb {R}_{+}\). Let the stepsize sequences \(\{\alpha _t\}_{t \in {\mathbb {N}}}\), \(\{\beta _t\}_{t \in {\mathbb {N}}}\) satisfy Eq. (36). Also let \(c_t \rightarrow 0\). Suppose \(\{\theta _t = (\mu _t, {\varSigma }_t)^{\top }\}_{t \in {\mathbb {N}}}\) is the sequence generated by Algorithm 1 and assume \(\theta _{t} \in {\varTheta }\), \(\forall t \in {\mathbb {N}}\). Also, let the Assumptions (A1), (A2) and (A3) hold. Further, we assume that there exists a continuously differentiable function \(V:U \rightarrow \mathbb {R}_{+}\), where \(U \subseteq \Theta \) is an open neighbourhood of \(\theta ^{p*}\) with \(\nabla V(\theta )^{\top }\Psi _{p}(\omega _{*}, \theta ) < 0\), \(\forall \theta \in U\smallsetminus \{\theta ^{p*}\}\) and \(\nabla V(\theta ^{p*})^{\top }\Psi _{p}(\omega _{*}, \theta ^{p*}) = 0\). Then, there exists \(q^{*} \in \mathbb {R}_{+}\) and \(r^{*} \in \mathbb {R}_{+}\) s.t. \(\forall q > q^{*}\) and \(\forall r > r^{*}\),
where \({\mathcal {J}}_{p}^{*}\) and \(z_{p}^{*}\) are defined in Eq. (25). Further, since \(\mathcal {J}_{p} = MSPBE\), the algorithm SCEMSPBEM converges to the global minimum of MSPBE a.s.
Proof
Please refer to the proof of Theorem 1 in Joseph and Bhatnagar (2018). \(\square \)
Similarly for Algorithm 2, the following theorem states that the model sequence \(\{\theta _{t}\}_{t \in {\mathbb {N}}}\) generated by Algorithm 2 almost surely converges to \(\theta ^{b*} = (z_{b}^{*}, 0_{k \times k})^{\top }\), the degenerate distribution concentrated at \(z_{b}^{*}\), where \(z_{b}^{*}\) is the solution to the optimization problem (26) which minimizes the error function MSBR.
Theorem 2
(MSBR Convergence) Let \(S(z) = exp(rz)\), \(r \in \mathbb {R}_{+}\). Let \(\rho \in (0,1)\) and \(\lambda \in (0,1)\). Let \(\theta _0 = (\mu _0, qI_{k \times k})^{\top }\), where \(q \in \mathbb {R}_{+}\). Let the stepsize sequences \(\{\alpha _t\}_{t \in {\mathbb {N}}}\), \(\{\beta _t\}_{t \in {\mathbb {N}}}\) satisfy Eq. (36). Also let \(c_t \rightarrow 0\). Suppose \(\{\theta _t = (\mu _t, {\varSigma }_t)^{\top }\}_{t \in {\mathbb {N}}}\) is the sequence generated by Algorithm 2 and assume \(\theta _{t} \in {\varTheta }\), \(\forall t \in {\mathbb {N}}\). Also, let the Assumptions (A1), (A2) and (A3)\(^{\prime }\) hold. Further, we assume that there exists a continuously differentiable function \(V:U \rightarrow \mathbb {R}_{+}\), where \(U \subseteq \Theta \) is an open neighbourhood of \(\theta ^{b*}\) with \(\nabla V(\theta )^{\top }\Psi _{b}(\upsilon _{*}, \theta ) < 0\), \(\forall \theta \in U\smallsetminus \{\theta ^{b*}\}\) and \(\nabla V(\theta ^{b*})^{\top }\Psi _{b}(\upsilon _{*}, \theta ^{b*}) = 0\). Then, there exists \(q^{*} \in \mathbb {R}_{+}\) and \(r^{*} \in \mathbb {R}_{+}\) s.t. \(\forall q > q^{*}\) and \(\forall r > r^{*}\),
where \({\mathcal {J}}_{b}^{*}\) and \(z_{b}^{*}\) are defined in Eq. (26). Further, since \(\mathcal {J}_{b} = MSBR\), the algorithm SCEMSBRM converges to the global minimum of MSBR a.s.
Proof
Please refer to the proof of Theorem 1 in Joseph and Bhatnagar (2018). \(\square \)
Discussion of the proposed algorithms
The computational load of the algorithms SCEMSPBEM and SCEMSBRM is \({\varTheta }(k^{2})\) per iteration which is primarily attributed to the computation of Eqs. (38) and (48) respectively. Least squares algorithms like LSTD and LSPE also require \({\varTheta }(k^{2})\) per iteration. However, LSTD requires an extra operation of inverting the \(k \times k\) matrix \(A_{T}\) (Algorithm 7) which requires an extra computational effort of \({\varTheta }(k^{3})\). (Note that LSPE also requires a \(k \times k\) matrix inversion.) This makes the overall complexity of LSTD and LSPE to be \({\varTheta }(k^{3})\). Further in some cases the matrix \(A_{T}\) may not be invertible. In that case, the pseudo inverse of \(A_{T}\) needs to be obtained in LSTD and LSPE which is computationally even more expensive. Our algorithm does not require such an inversion procedure. Also even though the complexity of the first order temporal difference algorithms such as TD(\(\lambda \)) and GTD2 is \({\varTheta }(k)\), the approximations they produced in the experiments we conducted turned out to be inferior to ours and also showed a slower rate of convergence than our algorithm. Another noteworthy characteristic exhibited by our algorithm is stability. Recall that the convergence of TD(0) is guaranteed by the requirements that the Markov chain of \(\mathrm {P}^{\pi }\) should be ergodic with the sampling distribution \(\nu \) as its stationary distribution. The classic example of Baird’s 7star (Baird 1995) violates those restrictions and hence TD(0) is seen to diverge. However, our algorithm does not impose such restrictions and shows stable behaviour even in nonergodic cases such as the Baird’s example.
But the significant feature of SCEMSPBEM/SCEMSBRM is its ability to find the global optimum. This particular characteristic of the algorithm enables it to produce high quality solutions when applied to nonlinear function approximation, where the convexity of the objective function does not hold in general. Also note that SCEMSPBEM/SCEMSBRM is a gradientfree technique and hence does not require strong structural restrictions on the objective function.
Experimental results
We present here a numerical comparison of our algorithms with various stateoftheart algorithms in the literature on some benchmark reinforcement learning problems. In each of the experiments, a random trajectory \(\{({\mathbf {s}}_t, {\mathbf {r}}_{t}, {\mathbf {s}}^{\prime }_{t})\}_{t=0}^{\infty }\) is chosen and all the algorithms are updated using it. Each \({\mathbf {s}}_t\) in \(\{({\mathbf {s}}_t, {\mathbf {r}}_{t}, {\mathbf {s}}^{\prime }_{t}), t \ge 0\}\) is sampled using an arbitrary distribution \(\nu \) over \({\mathbb {S}}\). The algorithms are run on 10 independent trajectories and the average of the results obtained is plotted. The xaxis in the plots is t / 1000, where t is the iteration number. The function \(S(\cdot )\) is chosen as \(S(x) = \exp {(rx)}\), where \(r \in \mathbb {R}_{+}\) is chosen appropriately. In all the test cases, the evolution of the model sequence \(\{\theta _t\}\) across independent trials was almost homogeneous and hence we omit the standard error bars from our plots.
We evaluated the performance of our algorithms on the following benchmark problems:

1.
Linearized cartpole balancing (Dann et al. 2014).

2.
5Link actuated pendulum balancing (Dann et al. 2014).

3.
Baird’s 7star MDP (Baird 1995).

4.
10state ring MDP (Kveton et al. 2006).

5.
MDPs with radial basis functions and Fourier basis functions (Konidaris et al. 2011).

6.
Settings involving nonlinear function approximation (Tsitsiklis and Roy 1997).
Experiment 1: linearized cartpole balancing (Dann et al. 2014)

Setup: A pole with mass m and length l is connected to a cart of mass M. It can rotate \(360^{\circ }\) and the cart is free to move in either direction within the bounds of a linear track (Fig. 2).

Goal: To balance the pole upright and the cart at the centre of the track.

State space: The 4tuple \((x, \dot{x}, \psi , \dot{\psi })^{\top }\) \(\in \mathbb {R}^{4}\), where \(\psi \) is the angle of the pendulum with respect to the vertical axis, \(\dot{\psi }\) is the angular velocity, x the relative cart position from the centre of the track and \(\dot{x}\) is its velocity.

Control space: The controller applies a horizontal force \(a \in \mathbb {R}\) on the cart parallel to the track. The stochastic policy used in this setting corresponds to \(\pi (as) = {\mathcal {N}}(a  \beta _{1}^{\top }s, \sigma _1^{2})\), where \(\beta _1 \in \mathbb {R}^{4}\) and \(\sigma _1 \in \mathbb {R}\).

System dynamics: The dynamical equations of the system are given by
$$\begin{aligned} \ddot{\psi }= & {} \frac{3ml\dot{\psi }^{2}\sin {\psi }\cos {\psi }+(6M+m)g\sin {\psi }6(ab\dot{\psi })\cos {\psi }}{4l(M+m)3ml\cos {\psi }}, \end{aligned}$$(93)$$\begin{aligned} \ddot{x}= & {} \frac{2ml\dot{\psi }^{2}\sin {\psi }+3mg\sin {\psi }\cos {\psi }+4a4b\dot{\psi }}{4(M+m)3m\cos {\psi }}. \end{aligned}$$(94)By making further assumptions on the initial conditions, the system dynamics can be approximated accurately by the linear system
$$\begin{aligned} \begin{bmatrix} x_{t+1}\\ \dot{x}_{t+1}\\ \psi _{t+1}\\ {\dot{\psi }}_{t+1} \end{bmatrix} = \begin{bmatrix} x_{t}\\ \dot{x}_{t}\\ \psi _{t}\\ \dot{\psi }_{t} \end{bmatrix} + {\varDelta } t \begin{bmatrix} \dot{\psi }_{t} \\ \frac{3(M+m)\psi _t3a+3b\dot{\psi _t}}{4Mlml} \\ \dot{x}_{t} \\ \frac{3mg\psi _t + 4a  4b\dot{\psi _t}}{4Mm} \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ 0 \\ {\mathbf {z}} \end{bmatrix}, \end{aligned}$$(95)where \({\varDelta } t\) is the integration time step, i.e., the time difference between two transitions and \({\mathbf {z}}\) is a Gaussian noise on the velocity of the cart with standard deviation \(\sigma _{2}\).

Reward function: \(\mathrm {R}(s, a) = \mathrm {R}(\psi , \dot{\psi }, x, \dot{x}, a) = 100\psi ^2  x^2  \frac{1}{10}a^2\).

Feature vectors: \(\phi (s \in \mathbb {R}^{4}) = (1, s_{1}^{2}, s_{2}^{2} \ldots , s_{1}s_{2}, s_{1}s_{3}, \ldots , s_{3}s_{4})^{\top } \in \mathbb {R}^{11}\).

Evaluation policy: The policy evaluated in the experiment is the optimal policy \(\pi ^{*}(a  s) = {\mathcal {N}}(a  {\beta _{1}^{*}}^{\top }s, {\sigma _{1}^{*}}^{2})\). The parameters \(\beta _{1}^{*}\) and \(\sigma _{1}^{*}\) are computed using dynamic programming. The feature set chosen above is a perfect feature set, i.e., \(V^{\pi ^{*}} \in \{{\varPhi } z \vert z \in \mathbb {R}^{k}\}\).
Here the sample trajectory is obtained by a continuous rollout of a particular realization of the underlying Markov chain and hence it is of onpolicy nature. Therefore, the sampling distribution is the stationary distribution (steadystate distribution) of the Markov chain induced by the policy being evaluated (see Remark 1). The various parameter values we used in our experiment are provided in Table 3 of “Appendix”. The results of the experiments are shown in Fig. 3.
Experiment 2: 5link actuated pendulum balancing (Dann et al. 2014)

Setup: 5 independent poles each with mass m and length l with the top pole being a pendulum connected using 5 rotational joints (Fig. 4).

Goal: To keep all the poles in the upright position by applying independent torques at each joint.

State space: The state \(s = (q, \dot{q})^{\top } \in \mathbb {R}^{10}\), where \(q = (\psi _{1}, \psi _{2}, \psi _{3}, \psi _{4}, \psi _{5}) \in \mathbb {R}^{5}\) and \(\dot{q} = (\dot{\psi }_{1}, \dot{\psi }_{2}, \dot{\psi }_{3}, \dot{\psi }_{4}, \dot{\psi }_{5}) \in \mathbb {R}^{5}\) with \(\psi _{i}\) being the angle of the pole i with respect to the vertical axis and \(\dot{\psi }_{i}\) the angular velocity.

Control space: The action \(a = (a_{1}, a_{2}, \ldots , a_{5})^{\top } \in \mathbb {R}^{5}\), where \(a_{i}\) is the torque applied to the joint i. The stochastic policy used in this setting corresponds to \(\pi (as) = {\mathcal {N}}_{5}(a  \beta _{1}^{\top }s, \sigma _1^{2})\), where \(\beta _1 \in \mathbb {R}^{10 \times 5}\) and \(\sigma _1 \in \mathbb {R}^{5 \times 5}\).

System dynamics: The approximate linear system dynamics is given by
$$\begin{aligned} \begin{bmatrix} q_{t+1}\\ \dot{q}_{t+1} \end{bmatrix} = \begin{bmatrix} I&{\varDelta } t\, I\\ {\varDelta } t \,M^{1}U&I \end{bmatrix}\begin{bmatrix}q_{t}\\ \dot{q}_{t}\end{bmatrix} + {\varDelta } t \begin{bmatrix} 0 \\ M^{1} \end{bmatrix}a + {\mathbf {z}}, \end{aligned}$$(96)where \({\varDelta } t\) is the integration time step, i.e., the time difference between two transitions, M is the mass matrix in the upright position where \(M_{ij} = l^{2}(6max(i,j))m\) and U is a diagonal matrix with \(U_{ii} = gl(6i)m\). Each component of \({\mathbf {z}}\) is a Gaussian noise.

Reward function: \(\mathrm {R}(q, \dot{q}, a) = q^{\top }q\).

Feature vectors: \(\phi (s \in \mathbb {R}^{10}) = (1, s_{1}^{2}, s_{2}^{2} \ldots , s_{1}s_{2}, s_{1}s_{3}, \ldots , s_{9}s_{10})^{\top } \in \mathbb {R}^{46}\).

Evaluation policy: The policy evaluated in the experiment is the optimal policy \(\pi ^{*}(a  s) = {\mathcal {N}}(a  {\beta _{1}^{*}}^{\top }s, {\sigma _{1}^{*}}^{2})\). The parameters \(\beta _{1}^{*}\) and \(\sigma _{1}^{*}\) are computed using dynamic programming. The feature set chosen above is a perfect feature set, i.e., \(V^{\pi ^{*}} \in \{{\varPhi } z \vert z \in \mathbb {R}^{k}\}\).
Similar to the earlier experiment, here also the sample trajectory is of onpolicy nature and therefore the sampling distribution is the steadystate distribution of the Markov chain induced by the policy being evaluated (see Remark 1). The various parameter values we used in our experiment are provided in Table 4 of Appendix. Note that we have used constant stepsizes in this experiment. The results of the experiment are shown in Fig. 5.
Experiment 3: Baird’s 7star MDP (Baird 1995)
Our algorithm was also tested on Baird’s star problem (Baird 1995). We call it the stability test because the Markov chain in this case is not ergodic and this is a classic example where TD(0) is seen to diverge (Baird 1995). We consider here an MDP with \(\vert {\mathbb {S}} \vert = 7\), \(\vert {\mathbb {A}} \vert = 2\) and \(k = 8\). The setting is illustrated in Fig. 6. We let the sampling distribution \(\nu \) to be the uniform distribution over \({\mathbb {S}}\). The feature matrix \({\varPhi }\) and the transition matrix \(P^{\pi }\) are given by
The reward function is given by \(\mathrm {R}(s, s^{\prime }) = 0\), \(\forall s, s^{\prime } \in {\mathbb {S}}\). The performance comparison of the algorithms GTD2, TD(0) and LSTD(0) with SCEMSPBEM is shown in Fig. 7. Here, the performance metric used for comparison is the \(\sqrt{\mathrm {MSE}(\cdot )}\) of the prediction vector generated by the corresponding algorithm at time t.
The algorithm parameter values used in the experiment are provided in Table 5 of Appendix.
A careful analysis in Schoknecht and Merke (2002) has shown that when the discount factor \(\gamma \le 0.88\), with appropriate learning rate, TD(0) converges. Nonetheless, it is also shown in the same paper that for discount factor \(\gamma = 0.9\), TD(0) will diverge for all values of the learning rate. This is explicitly demonstrated in Fig. 7. However our algorithm SCEMSPBEM converges in both cases, which demonstrates the stable behaviour exhibited by our algorithm.
The algorithms were also compared on the same Baird’s 7star, but with a different feature matrix \({\varPhi }_{1}\) as under.
In this case, the reward function is given by \(\mathrm {R}(s, s^{\prime })=2.0\), \(\forall s, s^{\prime } \in {\mathbb {S}}\). Note that \({\varPhi }_{1}\) gives an imperfect feature set. The algorithm parameter values used are same as earlier. The results are shown in Fig. 8. In this case also, TD(0) diverges. However, SCEMSPBEM is seen to exhibit good stable behaviour.
Experiment 4: 10state ring MDP (Kveton et al. 2006)
Next, we studied the performance comparisons of the algorithms on a 10ring MDP with \(\vert {\mathbb {S}}\vert = 10\) and \(k = 8\). The setting is illustrated in Fig. 9. We let the sampling distribution \(\nu \) to be the uniform distribution over \({\mathbb {S}}\). The transition matrix \(P^{\pi }\) and the feature matrix \({\varPhi }\) are given by
The reward function is \(\mathrm {R}(s, s^{\prime })=1.0, \forall s, s^{\prime } \in {\mathbb {S}}\).
The performance comparisons of the algorithms GTD2, TD(0) and LSTD(0) with SCEMSPBEM are shown in Fig. 10. The performance metric used here is the \(\sqrt{\mathrm {MSE}(\cdot )}\) of the prediction vector generated by the corresponding algorithm at time t. The Markov chain in this case is ergodic and the uniform distribution over \({\mathbb {S}}\) is indeed the stationary distribution of the Markov chain. So theoretically all the algorithms should converge and the results in Fig. 10 confirm this. However, there is a significant difference in the rate of convergence of the various algorithms for large values of the discount factor \(\gamma \). For \(\gamma = 0.99\), the results show that GTD2 and RG trail behind other methods, while our method is only behind LSTD and outperforms TD(0), RG and GTD2. The algorithm parameter values used in the experiment are provided in Table 6 of “Appendix”.
Experiment 5: random MDP with radial basis functions and fourier basis
These toy experiments are designed by us. Here, the tests are performed using standard basis functions to demonstrate that the algorithm is not dependent on any particular feature set. Two types of feature sets are considered here: Fourier basis functions and radial basis functions (RBF).
The Fourier basis functions Konidaris et al. (2011) are defined as follows:
The radial basis functions are defined as follows:
with \(m_i\) and \(v_i\) fixed a priori.
In both the cases, the reward function is given by
where the vector \(G \in (0,1)^{\vert {\mathbb {S}} \vert }\) is initialized for the algorithm with \(G(s) \sim U(0,1), \forall s \in {\mathbb {S}}\).
Also in both the cases, the transition probability matrix \(P^{\pi }\) is generated as follows:
where the vector \(b \in (0,1)^{\vert {\mathbb {S}} \vert }\) is initialized for the algorithm with \(b(s) \sim U(0,1), \forall s \in {\mathbb {S}}\). It is easy to verify that the Markov chain defined by \(\mathrm {P}_{\pi }\) is ergodic in nature.
In the case of RBF, we let \(\vert {\mathbb {S}}\vert = 1000\), \(\vert {\mathbb {A}}\vert = 200\), \(k = 50\), \(m_i = 10+20(i1)\) and \(v_i = 10\), while for Fourier basis functions, we let \(\vert {\mathbb {S}}\vert = 1000\), \(\vert {\mathbb {A}}\vert = 200\), \(k = 50\). In both the cases, the distribution \(\nu \) is the stationary distribution of the Markov chain. The simulation is run sufficiently long to ensure that the chain achieves its steady state behaviour, i.e., the states appear with the stationary distribution. The algorithm parameter values used in the experiment are provided in Table 7 of “Appendix” and the results obtained are provided in Figs. 11 and 12.
Also note that when Fourier basis is used, the discount factor \(\gamma =0.9\) and for RBFs, \(\gamma =0.01\). SCEMSPBEM exhibits good convergence behaviour in both cases, which shows the nondependence of SCEMSPBEM on the discount factor \(\gamma \). This is important because in Schoknecht and Merke (2003), the performance of TD methods is shown to be dependent on the discount factor \(\gamma \).
To measure how well our algorithm scales with respect to the size of the state space, we applied it on a medium sized MDP,^{Footnote 10} where \({\mathbb {S}} = 2^{15}, {\mathbb {A}} = 50\), \(k = 100\) and \(\gamma =0.9\). This is the stress test. The reward function \(\mathrm {R}\) and the transition probability matrix \(\mathrm {P}_{\pi }\) are generated using Eqs. (101) and (102) respectively. RBFs are used as the features in this case. Since the MDP is huge, the algorithms were run on Amazon cloud servers. The true value function \(V^{\pi }\) was computed and the \(\sqrt{\text {MSE}}\)s of the prediction vectors generated by the different algorithms were compared. The performance results are shown in Table 2. The results show that the performance of our algorithm does not seem affected by the complexity of the MDP.
Experiment 6: nonlinear function approximation of value function (Tsitsiklis and Roy 1997)
To demonstrate the flexibility and robustness of our approach, we also consider a few nonlinear function approximation RL settings. The landscape in the nonlinear setting is mostly nonconvex and therefore multiple local optima exist. The stable nonlinear function approximation extension of GTD2 is only shown to converge to the local optima (Maei et al. 2009). We believe that the nonlinear setting offers the perfect scaffolding to demonstrate the global convergence property of our approach.
Experiment 6.1: Van Roy and Tsitsiklis MDP (Tsitsiklis and Roy 1997)
This particular setting is designed in Tsitsiklis and Roy (1997) to show the divergence of the standard TD(0) algorithm in reinforcement learning under a nonlinear approximation architecture.
We consider here a discrete time Markov chain with state space \({\mathbb {S}} = \{1,2,3\}\), discount factor \(\gamma = 0.9\), the reward function \(\mathrm {R}(s, s^{\prime }) = 0, \forall s, s^{\prime } \in {\mathbb {S}}\) and the transition probability matrix as under:
Note that the Markov chain is ergodic and hence the sample trajectory is obtained by following the dynamics of the Markov chain. Therefore the steadystate distribution of the Markov chain is indeed the sampling distribution. Here, we minimize the MSBR error function and it demands double sampling as prescribed in Assumption \(A3^{\prime }\). The optimization is as follows:
where \(\delta _{t} = {\mathbf {r}}_{t}+\gamma \psi _{\eta }({\mathbf {s}}^{\prime }_{t})\psi _{\eta }({\mathbf {s}}_{t})\). We also have
where \(a = [100, 70, 30]^{\top }\), \(b = [23.094, 98.15, 75.056]^{\top }\), \(\tau = 0.01\) and \(\epsilon = 0.001\). Here \(\psi _{\eta }\) defines the projected nonlinear manifold. The true value function of this particular setting is \(V = (0, 0, 0)^{\top }\).
Now the challenge here is to best approximate the true value function V using the family of nonlinear functions \(\psi _{\eta }\) parametrized by \(\eta \in \mathbb {R}\) by solving the optimization problem (103). It is easy to see that \(\psi _{\infty } = V\) and hence is a degenerate setting.
The objective function in Eq. (103) can be rearranged as
where \(\upsilon \in \mathbb {R}^{3 \times 3}\) with \(\upsilon = (\upsilon _{ij})_{1 \le i,j \le 3} \triangleq {\mathbb {E}}[ h _{t}]{\mathbb {E}}[ h ^{\prime }_{t}]^{\top }\). Here \( h _{t} = [{\mathbf {r}}_{t}, a({\mathbf {s}}^{\prime }_{t})a({\mathbf {s}}_{t}), b({\mathbf {s}}^{\prime }_{t})b({\mathbf {s}}_{t})]^{\top }\) and \( h ^{\prime }_{t} = [{\mathbf {r}}^{\prime }_{t}, a({\mathbf {s}}^{\prime \prime }_{t})a({\mathbf {s}}_{t}), b({\mathbf {s}}^{\prime \prime }_{t})b({\mathbf {s}}_{t})]^{\top }\).
Now we maintain the time indexed random vector \(\upsilon ^{(t)} \in \mathbb {R}^{3 \times 3}\) with \(\upsilon ^{(t)} = (\upsilon ^{(t)}_{ij})_{1 \le i,j \le 3}\) and employ the following recursion to track \(\upsilon \):
Also, we define
Now we solve the optimization problem (103) using Algorithm 2 with the objective function defined in Eq. (106) (i.e., using Eq. (105) instead of Eq. (48) and Eq. (106) instead of Eq. (49) respectively).
The various parameter values used in the experiment are provided in Table 8 of Appendix. The results of the experiment are shown in Fig. 13. The xaxis is the iteration number t. The performance measure considered here is the mean squared error (MSE) which is defined in Eq. (5). Algorithm 2 is seen to clearly outperform TD(0) and GTD2 here.
Experiment 6.2: Baird’s 7star MDP using nonlinear function approximation
Here, we consider the Baird’s 7star MDP defined in Sect. 8.3 with discount factor \(\gamma = 0.9\), \(k=8\) and the sampling distribution to be the uniform distribution over \({\mathbb {S}}\). To perform the nonlinear function approximation, we consider the nonlinear manifold given by \(\{{\varPhi } h(z) \vert z \in \mathbb {R}^{8}\}\), where \(h(z) \triangleq (\cos ^{2}{(z_1)}\exp {(0.01z_1)}, \cos ^{2}{(z_2)}\exp {(0.01z_2)}, \ldots , \cos ^{2}{(z_8)}\exp {(0.01z_8)})^{\top }\) and \({\varPhi }\) is defined in Eq. (97). The reward function is given by \(\mathrm {R}(s,s^{\prime }) = 0, \forall s,s^{\prime } \in {\mathbb {S}}\) and hence the true value function is \((0, 0, \ldots , 0)^{\top }_{7 \times 1}\). Due to the unique nature of the nonlinear manifold, one can directly apply SCEMSBRM (Algorithm 2) with h(z) replacing z in Eq. (49). This setting presents a hard and challenging task for TD(\(\lambda \)) since we already experienced the erratic and unstable behaviour of TD(\(\lambda \)) in the linear function approximation version of Baird’s 7star. This setting also proves to be a litmus test for determining the confines of the stable characteristic of the nonlinear function approximation version of GTD2. The results obtained are provided in Fig. 14. The various parameter values used in the experiment are provided in Table 9 of Appendix. It can be seen that whereas both TD(0) and GTD2 diverge here, SCEMSBRM is converging to the true value.
Experiment 6.3: 10ring MDP using nonlinear function approximation
Here, we consider the 10ring MDP defined in Sect. 8.4 with discount factor \(\gamma = 0.99\), \(k=8\) and the sampling distribution as the stationary distribution of the underlying Markov chain. Here, we consider the nonlinear manifold given by \(\{{\varPhi } h(z) \vert z \in \mathbb {R}^{8}\}\), where \(h(z) \triangleq (\cos ^{2}{(z_1)}\exp {(0.1z_1)}, \cos ^{2}{(z_2)}\exp {(0.1z_2)}, \ldots ,\) \(\cos ^{2}{(z_8)}\exp {(0.1z_8)})^{\top }\) and \({\varPhi }\) is defined in Eq. (98). The reward function is given by \(\mathrm {R}(s,s^{\prime }) = 0, \forall s,s^{\prime } \in {\mathbb {S}}\) and hence the true value function is \((0, 0, \ldots , 0)^{\top }_{10 \times 1}\). Similar to the previous experiment, here also one can directly apply SCEMSBRM (Algorithm 2) with h(z) replacing z in Eq. (49). The results obtained are provided in Fig. 15. The various parameter values we used are provided in Table 10 of “Appendix”. GTD2 does not converge to the true value here while both SCEMSBRM and TD(0) do, with TD(0) marginally better.
Conclusion
We proposed, for the first time, an application of the cross entropy (CE) method to the prediction problem in reinforcement learning (RL) under the linear function approximation architecture. This task is accomplished by employing the multitimescale stochastic approximation variant of the cross entropy optimization method to minimize the mean squared projected Bellman error (MSPBE) and mean squared Bellman error (MSBR) objectives. The proofs of convergence of the algorithms to the optimum values using the ODE method are also provided. The theoretical analysis is supplemented by extensive experimental evaluation which is shown to corroborate the claims. Experimental comparisons with the stateoftheart algorithms show the superiority in terms of stability and accuracy while being competitive enough with regard to computational efficiency and rate of convergence. As future work, one may design similar cross entropy approaches for both prediction and control problems. More numerical experiments involving other nonlinear function approximators, delayed rewards etc. may be tried as well.
Notes
 1.
The policy can also be stochastic in order to incorporate exploration. In that case, for a given \(s \in {\mathbb {S}}\), \(\pi (\cdot \vert s)\) is a probability distribution over the action space \({\mathbb {A}}\).
 2.
The prediction problem is related to policy evaluation except that the latter procedure evaluates the value of a policy given complete model information. We are however in an online setting where model information is completely unknown, however, a realization of the model dynamics in the form of a sample trajectory as described above is made available in an incremental fashion. The goal then is to predict at each time instant the value of each state in \({\mathbb {S}}\) (both observed and unobserved) under this constraint using the sample trajectory revealed till that instant.
 3.
One may find the term offpolicy to be a misnomer in this context. Usually onpolicy refers to RL settings where the underlying Markovian system is assumed ergodic and the sample trajectory provided follows the dynamics of the system. Hence, offpolicy can be interpreted as a contrapositive statement of this definition of onpolicy and in that sense, our setting is indeed offpolicy. See Sutton et al. (2009).
 4.
 5.
The algorithms with eligibility traces are named with \((\lambda )\) appended, for example TD\((\lambda )\), LSTD\((\lambda )\) etc.
 6.
The problem here is to find the optimal basis of the MDP.
 7.
The basis adaptation problem is to find the best parameters of the basis functions for a given policy.
 8.
A sufficient condition is the columns of the feature matrix \({\varPhi }\) are linearly independent.
 9.
For detailed technical information pertaining to filtration and \(\sigma \)field, refer Borkar (2012).
 10.
This is the biggest MDP we could deploy on a machine with 3.2GHz processor and 8GB of memory.
References
Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the twelfth international conference on machine learning (pp. 30–37).
Benveniste, A., Métivier, M., & Priouret, P. (2012). Adaptive Algorithms and Stochastic Approximations (Vol. 22). Berlin: Springer.
Bertsekas, D. P. (2013). Dynamic programming and optimal control (Vol. 2). Belmont: Athena Scientific.
Borkar, V. S. (1997). Stochastic approximation with two time scales. Systems & Control Letters, 29(5), 291–294.
Borkar, V. S. (2008). Stochastic approximation: A dynamical systems viewpoint. Cambridge: Cambridge University Press.
Borkar, V. S. (2012). Probability theory: An advanced course. Berlin: Springer.
Borkar, V. S., & Meyn, S. P. (2000). The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2), 447–469.
Boyan, J. A. (2002). Technical update: Leastsquares temporal difference learning. Machine Learning, 49(2–3), 233–246.
Bradtke, S. J., & Barto, A. G. (1996). Linear leastsquares algorithms for temporal difference learning. Machine Learning, 22(1–3), 33–57.
Busoniu, L., Ernst, D., De Schutter, B., & Babuska, R. (2009). Policy search with crossentropy optimization of basis functions. In IEEE symposium on adaptive dynamic programming and reinforcement learning, 2009. ADPRL’09 (pp. 153–160). IEEE.
Crites, R. H., & Barto, A. G. (1996). Improving elevator performance using reinforcement learning. In Advances in neural information processing systems (pp. 1017–1023).
Dann, C., Neumann, G., & Peters, J. (2014). Policy evaluation with temporal differences: A survey and comparison. The Journal of Machine Learning Research, 15(1), 809–883.
De Boer, P. T., Kroese, D. P., Mannor, S., & Rubinstein, R. Y. (2005). A tutorial on the crossentropy method. Annals of Operations Research, 134(1), 19–67.
Dorigo, M., & Gambardella, L. M. (1997). Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation, 1(1), 53–66.
Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245.
Eldracher, M., Staller, A., & Pompl, R. (1994). Function approximation with continuous valued activation functions in CMAC. Inst. für Informatik.
Hu, J., Fu, M. C., & Marcus, S. I. (2007). A model reference adaptive search method for global optimization. Operations Research, 55(3), 549–568.
Hu, J., Fu, M. C., & Marcus, S. I. (2008). A model reference adaptive search method for stochastic global optimization. Communications in Information & Systems, 8(3), 245–276.
Hu, J., & Hu, P. (2009). On the performance of the crossentropy method. In Proceedings of the 2009 winter simulation conference (WSC) (pp. 459–468). IEEE.
Joseph, A. G., & Bhatnagar, S. (2016). A randomized algorithm for continuous optimization. In Winter simulation conference, WSC 2016, Washington, DC, USA, (pp. 907–918).
Joseph, A. G., & Bhatnagar, S. (2016). Revisiting the cross entropy method with applications in stochastic global optimization and reinforcement learning. Frontiers in artificial intelligence and applications (ECAI 2016) (Vol. 285, pp. 1026–1034). https://doi.org/10.3233/97816149967291026
Joseph, A. G., & Bhatnagar, S. (2018). A cross entropy based optimization algorithm with global convergence guarantees. CoRR (arXiv:1801.10291).
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.
Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11), 1238–1274.
Konidaris, G., Osentoski, S., & Thomas, P. S. (2011). Value function approximation in reinforcement learning using the Fourier basis. In Twentyfifth AAAI conference on artificial intelligence.
Kubrusly, C., & Gravier, J. (1973). Stochastic approximation algorithms and applications. In 1973 IEEE conference on decision and control including the 12th symposium on adaptive processes (Vol. 12, pp. 763–766).
Kullback, S. (1959). Statistics and information theory. New York: Wiley.
Kushner, H. J., & Clark, D. S. (1978). Stochastic approximation for constrained and unconstrained systems. New York: Springer.
Kveton, B., Hauskrecht, M., & Guestrin, C. (2006). Solving factored mdps with hybrid state and action variables. Journal of Artificial Intelligence Research (JAIR), 27, 153–201.
Lagoudakis, M. G., & Parr, R. (2003). Leastsquares policy iteration. The Journal of Machine Learning Research, 4, 1107–1149.
Ljung, L. (1977). Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, 22(4), 551–575.
Maei, H. R., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., & Sutton, R. S. (2009). Convergent temporaldifference learning with arbitrary smooth function approximation. In Advances in neural information processing systems (pp. 1204–1212).
Mannor, S., Rubinstein, R. Y., & Gat, Y. (2003). The cross entropy method for fast policy search. In International conference on machine learningICML 2003 (pp. 512–519).
Menache, I., Mannor, S., & Shimkin, N. (2005). Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, 134(1), 215–238.
Morris, C. N. (1982). Natural exponential families with quadratic variance functions. The Annals of Statistics 65–80.
Mühlenbein, H., & Paass, G. (1996). From recombination of genes to the estimation of distributions i. binary parameters. In Parallel problem solving from naturePPSN IV (pp. 178–187). Springer.
Nedić, A., & Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(1–2), 79–110.
Perko, L. (2013). Differential equations and dynamical systems (Vol. 7). Berlin: Springer.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics 400–407.
Rubinstein, R. Y., & Kroese, D. P. (2013). The crossentropy method: A unified approach to combinatorial optimization, MonteCarlo simulation and machine learning. Berlin: Springer.
Scherrer, B. (2010). Should one compute the temporal difference fix point or minimize the Bellman residual? the unified oblique projection view. In 27th International conference on machine learningICML 2010.
Schoknecht, R. (2002). Optimality of reinforcement learning algorithms with linear function approximation. In Advances in neural information processing systems (pp. 1555–1562).
Schoknecht, R., & Merke, A. (2002) Convergent combinations of reinforcement learning with linear function approximation. In Advances in neural information processing systems (pp. 1579–1586).
Schoknecht, R., & Merke, A. (2003). TD(0) converges provably faster than the residual gradient algorithm. In International conference on machine learningICML 2003 (pp. 680–687).
Silver, D., Sutton, R. S., & Müller, M. (2007). Reinforcement learning of local shape in the game of go. In International joint conference on artificial intelligence (IJCAI) (Vol. 7, pp. 1053–1058).
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44.
Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning. New York: MIT Press.
Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., & Wiewiora, E. (2009). Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning (pp. 993–1000). ACM.
Sutton, R. S., Maei, H. R., & Szepesvári, C. (2009). A convergent O(n) temporaldifference algorithm for offpolicy learning with linear function approximation. In Advances in neural information processing systems (pp. 1609–1616).
Tesauro, G. (1995). Tdgammon: A selfteaching backgammon program. In Applications of neural networks (pp. 267–285). Springer.
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.
White, D. J. (1993). A survey of applications of Markov decision processes. Journal of the Operational Research Society, 44, 1073–1096.
Williams, R. J., & Baird, L. C. (1993 Nov 24). Tight performance bounds on greedy policies based on imperfect value functions. Technical report, Techical report NUCCS9314, Northeastern University, College of Computer Science, Boston, MA.
Zhou, E., Bhatnagar, S., & Chen, X. (2014). Simulation optimization via gradientbased stochastic search. In Winter simulation conference (WSC), 2014 (pp. 3869–3879.) IEEE.
Zlochin, M., Birattari, M., Meuleau, N., & Dorigo, M. (2004). Modelbased search for combinatorial optimization: A critical survey. Annals of Operations Research, 131(1–4), 373–395.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Jesse Davis, Elisa Fromont, Derek Greene, and Bjorn Bringmaan.
Appendices
Appendices
A Linear function approximation (LFA) based prediction algorithms
B Nonlinear function approximation (NLFA) based prediction algorithms
C Parameter values used in various experiments
See Tables 3, 4, 5, 6, 7, 8, 9 and 10
D Illustration of CE optimization procedure
See Fig .16
E Borkar–Meyn theorem (Theorem 2.1 of Borkar and Meyn 2000)
Theorem 3
For the stochastic recursion of \(x_{n} \in \mathbb {R}^{d}\) given by
if the following assumptions are satisfied:

The map \(h:\mathbb {R}^{d} \rightarrow \mathbb {R}^{d}\) is Lipschitz, i.e., \(\Vert h(x)  h(y) \Vert \le L\Vert x  y \Vert \), for some \(0< L < \infty \).

Stepsizes \(\{a_n\}\) are positive scalars satisfying
$$\begin{aligned} \sum _{n} a_n = \infty , \;\;\sum _{n} a_n^{2} < \infty . \end{aligned}$$ 
\(\{{\mathbb {M}}_{n+1}\}_{n \in {\mathbb {N}}}\) is a martingale difference noise w.r.t. the increasing family of \(\sigma \)fields
$$\begin{aligned} {\mathcal {F}}_{n} \triangleq \sigma (x_{m},{\mathbb {M}}_{m},m \le n), \;\; n \in {\mathbb {N}}. \end{aligned}$$That is,
$$\begin{aligned} {\mathbb {E}}\left[ {\mathbb {M}}_{n+1} \vert {\mathcal {F}}_{n}\right] = 0 \;\; a.s., \;\; n \in {\mathbb {N}}. \end{aligned}$$Furthermore, \(\{{\mathbb {M}}_{n+1}\}_{n \in {\mathbb {N}}}\) are squareintegrable with
$$\begin{aligned} {\mathbb {E}}\left[ \Vert {\mathbb {M}}_{n+1} \Vert ^{2} \vert {\mathcal {F}}_{n}\right] \le K(1+\Vert x_{n} \Vert ^{2}) \;\; a.s., \;\; n \in {\mathbb {N}}, \end{aligned}$$for some constant \(K > 0\).

The functions \(h_{c}(x) \triangleq \frac{h(cx)}{x}\), \(c \ge 1\), \(x \in \mathbb {R}^{d}\), satisfy \(h_{c}(x) \rightarrow h_{\infty }(x)\) as \(c \rightarrow \infty \), uniformly on compacts for some \(h_{\infty } \in C(\mathbb {R}^{d})\). Furthermore, the ODE
$$\begin{aligned} \dot{x}(t) = h_{\infty }(x(t)) \end{aligned}$$(108)has the origin as its unique globally asymptotically stable equilibrium,
then
Rights and permissions
About this article
Cite this article
Joseph, A.G., Bhatnagar, S. An online prediction algorithm for reinforcement learning with linear function approximation using cross entropy method. Mach Learn 107, 1385–1429 (2018). https://doi.org/10.1007/s109940185727z
Received:
Accepted:
Published:
Issue Date:
Keywords
 Markov decision process
 Prediction problem
 Reinforcement learning
 Stochastic approximation algorithm
 Cross entropy method
 Linear function approximation
 ODE method