1 Introduction

Actor-critic refers to a family of two time-scale algorithms for reinforcement learning where one alternates between policy gradient updates (actor) and action-value function estimation in an online fashion (critic). These approaches form the bedrock of several practical advances in reinforcement learning, as in supply chain management (Giannoccaro & Pontrandolfo, 2002), power systems (Jiang et al., 2014), robotic manipulation (Kober & Peters, 2012), and games of various kinds (Tesauro et al., 1995; Brockman et al., 2016; Mnih et al., 2016; Silver et al., 2017). While their asymptotic stability has been known for decades (Konda & Borkar, 1999; Konda & Tsitsiklis, 2000), their sample complexity is relatively unexplored. In this work, we establish the statistical behavior of actor-critic algorithms for a number of canonical settings, which to our knowledge is the first time a comprehensive accounting has been conducted.

We focus on reinforcement learning problems over possibly continuous state and action spaces, which are defined by a Markov Decision Process (Puterman, 2014): each time, starting from one state, an agent selects an action, and then it transitions to a new state according to a distribution Markov in the current state and action. Then, the environment reveals a reward informing the quality of that decision. The goal of the agent is to select an action sequence which yields the largest expected accumulation of rewards, defined as the value (Bellman, 1954; Bertsekas, 2005). Actor-critic algorithms adapt the merits of reinforcement learning algorithms based on approximate dynamic programming with those based on policy search, the two dominant model-free approaches in the literature (Sutton et al., 2017).

For finite spaces, one may obtain the globally optimal policy, and therefore it is possible quantify sample complexity in terms of the gap to the optimal value function (regret) as, e.g., a polynomial function of the cardinality of the state and action spaces–see Jin et al. (2018) and references therein. This is possible because these quantities have finite cardinality; however, in continuous spaces, these analyses break down because policy parameterization is required, and the value function becomes non-convex with respect to the policy parameters (unless it is parameterized by a sufficiently high-dimensional neural model (Wang et al., 2019)).

More specifically, in the actor step of actor-critic, stochastic gradient steps with respect to the value function over a parameterized family of policies are conducted. Via the Policy Gradient Theorem (Sutton et al., 2000), the gradient with respect to policy parameters (policy gradient) is the product of two factors: the score function and the Q function. One may employ Monte Carlo rollouts to estimate Q-factors, which under careful choice of rollout horizon, can be shown to be unbiased (Paternain, 2018). As a result, linking policy gradient methods to more standard stochastic programming results for non-convex optimization, namely, sublinear \({\mathcal {O}}(k^{-1/2})\) rates to stationarity have recently been established (Zhang et al., 2019). Doing so, however, requires an inordinate amount of querying to the environment in order to generate trajectory data. In actor-critic, we replace Monte Carlo rollouts with online estimates for the action-value function.

More specifically, in actor-critic, the critic step estimates the action-value (Q) function through stochastic approximation, i.e., temporal difference (TD) (Sutton, 1988), approaches to solving Bellman’s evaluation equation (Watkins & Dayan, 1992; Tsitsiklis, 1994). Combining temporal difference iterations with nonlinear function parameterizations may cause instability, as shown by Baird (1995); Tsitsiklis and Van Roy (1997). This motivates the majority of TD algorithms to focus on the case where the Q function is parameterized by a linear basis expansion over given universal features, which is common in practice (Sutton et al., 2017), and can be satisfied by radial basis function (RBF) networks or auto-encoders Park and Sandberg (1991). We consider this setting of universal features given a priori.

The asymptotic stability of linear TD algorithms hinges upon dynamical systems tools to encapsulate the mean estimation error sequence–see Borkar and Meyn (2000); Kushner and Yin (2003). By contrast, a number of finite-time characterizations of various TD algorithms have appeared recently, i.e., those based on stochastic fixed point iterations and gradient-based approximations known as gradient temporal difference (GTD) (Sutton et al., 2009a). For TD algorithms, finite-time sublinear rates have been derived both in the case where samples (state-action-reward triples) are independent and identically distributed (i.i.d.) (Dalal et al., 2018b; Bhandari et al., 2018; Lakshminarayanan & Szepesvari, 2018) and when they exhibit Markovian dependence (Srikant & Ying, 2019). Further, the convergence of GTD was established in Koppel et al. (2017); Tolstaya et al. (2018) by employing coupled supermartingales (Wang et al., 2017a), which permits us to derive the rates of convergence in expectation of GTD as corollaries. As a result, we may explicitly derive the bias due to critic estimation error in terms of the number of critic steps. This is in contrast to the use of an unbiased estimate from a Monte Carlo rollout, as in pure policy gradient methods. We further note that contemporaneously of beginning this work, several analyses of GTD have been developed (Liu et al., 2015; Dalal et al., 2018b, 2020) that refine the rates employed in this analysis; however, these results focus on concentration bounds (“lock-in probability"), a weaker metric of stability than convergence in mean, i.e., convergence in Lebesgue integral implies convergence in measure. Since in this work we focus on the intuitive and broadly interpretable global convergence to stationarity in terms of the expected gradient norm of the value function, we seek to employ policy evaluation rates that are compatible with this goal, and defer refined lock-in probability results, for which tighter bounds of convergence on the critic exist, to future work.

Convergence of Actor-Critic In this work, we link the behavior of actor-critic to gradient ascent algorithms with biased gradient directions. This bias is controllable and depends on the step-size and number of critic iterations per actor update. We perform this analysis for the setting that samples are i.i.d, which may be explicitly guaranteed through the introduction of a new Monte Carlo rollout step for each actor update. As a result, we establish that actor-critic, independent of any critic method, exhibits convergence to stationary points of the value function that are comparable to stochastic gradient ascent in the non-convex regime. A key distinguishing feature from standard non-convex stochastic programming is that the rates are inherently tied to the bias of the search direction which is determined by the choice of critic scheme. In fact, our methodology is such that a rate for actor-critic can be derived for any critic-only method for which a convergence rate in expectation on the parameters can be expressed. In particular, we establish the rates for actor-critic with temporal difference (TD) (Sutton, 1988) and gradient TD (GTD) (Sutton et al., 2009a) critic steps. Furthermore, we propose an Accelerated GTD (A-GTD) method derived from accelerations of stochastic compositional gradient descent (Wang et al., 2017a), which converges faster than TD and GTD.

Table 1 Rates of Actor Critic with Policy Gradient Actor updates and different critic-only methods.The term \(\sigma\) is the critic stepsize for TD(0) with continuous state-action space, and should be chosen according to conditioning of the feature space (see Sect. 6.1)

In summary, for the continuous spaces, we establish that A-GTD converges faster than GTD, and the effective convergence rate of TD(0) varies as a result of the feature space representation selected a priori. In particular, this introduces a trade off between the smoothness assumptions and the rates derived (see Table 1). TD has no additional smoothness assumptions, and it achieves a rate of \(O(\epsilon ^{-2/\sigma })\). This rate is analogous to the non-convex analysis of stochastic compositional gradient descent when \(\sigma\) is equal to 0.5, which is a conservative estimate (see Fig. 1). Adding a smoothness assumption, GTD achieves the faster rate of \(O(\epsilon ^{-3})\). By requiring an additional smoothness assumption, we find that A-GTD achieves the fastest convergence rate of \(O(\epsilon ^{-5/2})\). For the case of finite state action space, actor critic achieves a convergence rate of \(O(\epsilon ^{-2})\). Overall, the contribution in terms of sample complexities of different actor-critic algorithms may be found in Table 1.

Relative to existing convergence results, actor-critic is classically studied as a form of two time-scale algorithm (Borkar, 1997), whose asymptotic stability is well-known via dynamical systems (Kushner & Yin, 2003; Borkar, 2009). To wield these approaches to establish finite-time performance, however, concentration probabilities and geometric ergodicity assumptions of the Markov dynamics are required–see Borkar (2009). We obviate these complications by focusing on the case where independent trajectory samples are acquirable through querying the environment, for which recent unbiased sampling procedures gave proved adept (Paternain, 2018; Zhang et al., 2019). Relative to existing finite-time characterizations of actor-critic, Cai et al. (2019) proposes Neural TD updates, which converges to global optimality under a suitably over-parameterized deep neural network (DNN) and initialization. One quandary is how to find these initializations or design DNN architectures to satisfy these conditions. In separate work, the sample complexity of actor-critic has been established in terms of the value function gradient norm when the critic parameters are estimated with non-linear function approximation in a batch fashion (Yang et al., 2018). It is well-known that non-linear function approximators may diverge given by various counterexamples (Baird, 1995; Tsitsiklis & Van Roy, 1997). Our work circumvents this obstacle by considering only well-behaved and well-studied linear function approximation, which includes commonly chosen radial basis function (RBF) networks and auto-encoders fixed at the outset of RL training.

Since the original date of submission, efforts to refine the analysis in this work exist: for instance, relaxations of assumptions on the sampling distribution to allow Markovian dependence (Qiu et al., 2021; Xu et al., 2020; Wu et al., 2020) and augmentations of the critic objective for practical variance reduction (Parisi et al., 2019). However, these works require the Markov transition density to mix at an exponentially fast rate in order to establish convergence. Thus, while i.i.d. sampling may be difficult to justify, exponentially fast mixing often does not hold either, unless algorithm step-sizes are sent to null at an exponential rate. These intricacies have motivated experimental techniques to mitigate correlation among samples using replay buffers (Wang et al., 2017b) and parallelization of queries to a generative model (Gruslys et al., 2018). However, their exact relationship to mixing rates is opaque. Therefore, for simplicity, in this work we focus on the i.i.d. case.

Moreover, sharper sample complexities for actor-critic have been developed (Qiu et al. 2021); Xu et al. 2020); Wu et al. 2020); however, they do not address the possibility of designing alternate policy evaluation schemes than TD(0) updates, and instead focus only on actor-critic in its vanilla form. This is because their perspective is on understanding the sample complexity of actor-critic alone, whereas we provide a unified perspective upon the basis of biased stochastic gradient iteration. In doing so, we are able incorporate a variety of critic updates and illuminate the interplay of problem smoothness, cardinality, and the choice of critic parameterization. In particular, the sample complexity of actor-critic with TD(0) updates for the tabular case given in Corollary 4 matches Xu et al. (2020); Wu et al. (2020), but in continuous spaces, depending on the conditioning of the feature map covariance and other problem smoothness conditions, GTD or A-GTD may yield faster convergence, a facet elsewhere unaddressed in the literature.

Even more recently, efforts have been made to improve upon the rate of convergence by considering regularized MDP’s with overparametrized networks (Cayci et al., 2022), single critic step (Olshevsky & Gharesifard, 2022), and single trajectory actor updates (Chen et al., 2022a). Decentralized convergence rates have also been established (Chen et al., 2022b; Zeng et al., 2022). Shen et al. (2020) show that for both i.i.d. and markovian sampling, there is a linear speedup for the decentralized setting whose is bottleneck is the slowest mixing chain. All of the aforementioned results require the assumption that the probability for any action given a state is strictly positive, which we do not require.

We evaluate actor-critic with TD, GTD, and A-GTD critic updates on both a navigation problem and the canonical pendulum problem. For the navigation problem, we find that indeed A-GTD converges faster than both GTD and TD. Interestingly, the stationary point it reaches is worse than GTD or TD. This suggests that the choice of critic scheme illuminates an interplay between optimization and generalization that is less-well understood in reinforcement learning (Boyan & Moore, 1995; Bousquet & Elisseeff, 2002). For the pendulum problem, we also find that A-GTD converges fastest with respect to the gradient norm, which is consistent with our main convergence results. In particular, we again find that the faster convergence in gradient norm results the stationary point having a lower cumulative reward. We additinally consider advantage actor-critic in our simulations (Mnih et al., 2016). A detailed discussion on the results and implications can be found in Sect. 7. The remainder of the paper is organized as follows. Section 2 describes the problem of reinforcement learning and defines common assumptions which we use in our analysis. In Sect. 3, we derive a generic actor-critic algorithm from an optimization perspective and describe how the algorithm would be amended given different policy evaluation methods. The derivation of the convergence rate for generic actor-critic is presented in Sect. 4, and the specific analysis for Gradient, Accelerated Gradient, and vanilla Temporal Difference are characterized in Sects. 5 and 6.

2 Reinforcement learning

We consider the Reinforcement Learning (RL) problem where an agent moves through a state space \({\mathcal {S}}\) and takes actions that belong to some action set \({\mathcal {A}}\), and the state/action spaces are assumed to be continuous compact subsets of Euclidean space: \({\mathcal {S}}\subset {\mathbb {R}}^q\) and \({\mathcal {A}}\subset {\mathbb {R}}^p\). Every time an action is taken, the agent transitions to its next state that depends only on its current state and action. Moreover, a reward is revealed by the environment. In this situation, the agent would like to accumulate as much reward as possible in the long term, which is referred to as value. Mathematically this problem definition may be encapsulated as a Markov decision process (MDP), which is a tuple \(({\mathcal {S}},{\mathcal {A}},{\mathbb {P}},R,\gamma )\) with Markov transition density \({\mathbb {P}}(s'\mid s,a):{\mathcal {S}}\times {\mathcal {A}}\rightarrow {\mathbb {P}}({\mathcal {S}})\) that determines the probability of moving to state \(s'\). Here, \(\gamma \in (0,1)\) is the discount factor that parameterizes the value of a given sequence of actions, which we will define shortly.

At each time t, the agent executes an action \(a_t\in {\mathcal {A}}\) given the current state \(s_t\in {\mathcal {S}}\), following a stochastic policy \(\pi :{\mathcal {S}}\rightarrow {\mathbb {P}}({\mathcal {A}})\), i.e., \(a_t\sim \pi (\cdot \mid s_t)\). Then, given the state-action pair \((s_t,a_t)\), the agent observes a (deterministic) reward \(r_t=R(s_t,a_t)\) and transitions to a new state \(s_t' \sim {\mathbb {P}}(\cdot \mid s_t, a_t)\) according to a Markov transition density. For any policy \(\pi\), define the value function \(V_{\pi }:{\mathcal {S}}\rightarrow {\mathbb {R}}\) as

$$\begin{aligned} V_{\pi }(s):={\mathbb {E}}_{a_t\sim \pi (\cdot \mid s_t),s_{t+1}\sim {\mathbb {P}}(\cdot \mid s_t,a_t)}\bigg (\sum _{t=0}^\infty \gamma ^t r_t\mid s_0=s\bigg ), \end{aligned}$$
(1)

which is a measure of the long term average reward accumulation discounted by \(\gamma\). We can further define the value \(V_{\pi }:{\mathcal {S}}\times {\mathcal {A}}\rightarrow {\mathbb {R}}\) conditioned on a given initial action as the action-value, or Q function as \(Q_{\pi }(s,a)={\mathbb {E}}\big (\sum _{t=0}^\infty \gamma ^t r_t\mid s_0=s,a_0=a\big )\). Given any initial state \(s_0\), the goal of the agent is to find the optimal policy \(\pi\) that maximizes the long-term return \(V_{\pi }(s_0)\), i.e., to solve the following optimization problem

$$\begin{aligned} \max _{\pi \in \Pi }~~J(\pi ) , \ \text {where} \ \quad \ J(\pi ):=V_{\pi }(s_0). \end{aligned}$$
(2)

In this work, we investigate actor-critic methods to solve (2), which is a hybrid RL method that fuses key properties of policy search and approximate dynamic programming. To ground the discussion, we first derive the canonical policy search technique called policy gradient method, and explain how actor-critic augments policy gradient. Begin by noting that to address (2), one must search over an arbitrarily complicated function class \(\Pi\) which may include those which are unbounded and discontinuous. To mitigate this issue, we parameterize the policy \(\pi\) by a vector \(\theta \in {\mathbb {R}}^d\), i.e., \(\pi =\pi _{\theta }\), yielding RL tools called policy gradient methods (Konda & Tsitsiklis, 2000; Bhatnagar et al., 2009; Castro & Meir, 2010). Under this specification, the search over arbitrarily complicated function class \(\Pi\) to (2) may be reduced to Euclidean space \({\mathbb {R}}^d\), i.e., a vector-valued optimization, \(\max _{\theta \in {\mathbb {R}}^d}J(\pi _\theta ):=V_{\pi _{\theta }}(s_0)\). Subsequently, we denote \(J(\pi _\theta )\) by \(J(\theta )\) for notational convenience.

We now make the following standard assumption on the regularity of the MDP problem and the parameterized policy \(\pi _\theta\), which are the same conditions as Zhang et al. (2020), as well as an assumption to bound the state-action feature representation.

Assumption 1

Suppose the reward function R and the parameterized policy \(\pi _\theta\) satisfy the following conditions:

  1. (i)

    The absolute value of the reward R is bounded uniformly by \(U_{R}\), i.e., \(|R(s,a)|\in [0,U_{R}]\) for any \((s,a)\in {\mathcal {S}}\times {\mathcal {A}}\).

  2. (ii)

    The policy \(\pi _{\theta }\) is differentiable with respect to \(\theta\), and the score function \(\nabla \log \pi _{\theta }(a\mid s)\) is \(L_\Theta\)-Lipschitz and has bounded norm, i.e., for any \((s,a)\in {\mathcal {S}}\times {\mathcal {A}}\),

    $$\begin{aligned}&\Vert \nabla \log \pi _{\theta ^1}(a\mid s)-\nabla \log \pi _{\theta ^2}(a\mid s)\Vert \le L_\Theta \cdot \Vert \theta ^1-\theta ^2\Vert ,\text {~~for any~~} \theta ^1,\theta ^2, \end{aligned}$$
    (3)
    $$\begin{aligned}&\Vert \nabla \log \pi _{\theta }(a\mid s)\Vert \le B_{\Theta },\text {~~for any~~} \theta . \end{aligned}$$
    (4)

Note that the boundedness of the reward function in Assumption 11 is standard in policy search algorithms (Bhatnagar et al., 2008, 2009; Castro & Meir, 2010; Zhang et al., 2018). Observe that with R, we have the Q function is absolutely upper bounded by \(U_{R}/(1-\gamma )\), since by definition

$$\begin{aligned} |Q_{\pi _\theta }(s,a)|\le \sum _{t=0}^\infty \gamma ^t \cdot U_{R}=\frac{U_{R}}{1-\gamma }, ~~\text {for any}~~ (s,a)\in {\mathcal {S}}\times {\mathcal {A}}. \end{aligned}$$
(5)

The same bound also applies for \(V_{\pi _\theta }(s)\) for any \(\pi _\theta\) and \(s\in {\mathcal {S}}\) and thus the objective \(J(\theta )\) which is defined as \(V_{\pi _\theta }(s_0)\), satisfies,

$$\begin{aligned} |V_{\pi _\theta }(s)|\le \frac{U_{R}}{1-\gamma },~~\text {for any} s\in {\mathcal {S}},~~\quad |J(\theta )|\le \frac{U_{R}}{1-\gamma }. \end{aligned}$$
(6)

We note that the conditions (3) and (4) have appeared in recent analyses of policy search (Castro & Meir, 2010; Pirotta et al., 2015; Papini et al., 2018), and are satisfied by canonical policy parameterizations such as Boltzmann policy (Konda & Borkar, 1999) and Gaussian policy (Doya, 2000). For example, for Gaussian policyFootnote 1 in continuous spaces, \(\pi _\theta (\cdot \mid s)={\mathcal {N}}(\phi (s)^\top \theta ,\sigma ^2)\), where \({\mathcal {N}}(\mu ,\sigma ^2)\) denotes the Gaussian distribution with mean \(\mu\) and variance \(\sigma ^2\) and \(\phi (s)\) denotes some state feature representation. Then the score function has the form \([a-\phi (s)^\top \theta ]\phi (s)/\sigma ^2\), which satisfies (3) and (4) if the feature vectors \(\phi (s)\) have bounded norm, the parameter \(\theta\) lies some bounded set, and the action \(a\in {\mathcal {A}}\) is bounded.

Generally, the value function is nonconvex with respect to the parameter \(\theta\), meaning that obtaining a globally optimal solution to (2) is out of reach unless the problem has additional structured properties, as in phase retrieval (Sun et al., 2016), matrix factorization (Li et al., 2016), and tensor decomposition (Ge et al., 2015), among others. Thus, our goal is to design actor-critic algorithms to attain stationary points of the value function \(J(\theta )\). Moreover, we characterize the sample complexity of actor-critic, a noticeable gap in the literature for an algorithmic tool decades old (Konda & Borkar, 1999) at the heart of the recent innovations of artificial intelligence architectures (Silver et al., 2017).

3 From policy gradient to actor-critic

In this section, we derive actor-critic method (Konda & Borkar, 1999) from an optimization perspective: we view actor-critic as a way of doing stochastic gradient ascent with biased ascent directions, and the magnitude of this bias is determined by the number of critic evaluations done in the inner loop of the algorithm. The building block of actor-critic is called policy gradient method, a type of direct policy search, based on stochastic gradient ascent. Begin by noting that the gradient of the objective \(J(\theta )\) with respect to policy parameters \(\theta\), owing to the Policy Gradient Theorem (Sutton et al., 2000), has the following form:

$$\begin{aligned} \nabla J(\theta )&=\int _{s\in {\mathcal {S}},a\in {\mathcal {A}}}\sum _{t=0}^\infty \gamma ^t \cdot p(s_t=s\mid s_0,\pi _\theta )\cdot \nabla \pi _{\theta }(a\mid s)\cdot Q_{\pi _\theta }(s,a)dsda \end{aligned}$$
(7)
$$\begin{aligned}&=\frac{1}{1-\gamma }\int _{s\in {\mathcal {S}},a\in {\mathcal {A}}}(1-\gamma )\sum _{t=0}^\infty \gamma ^t \cdot p(s_t=s\mid s_0,\pi _\theta )\cdot \nabla \pi _{\theta }(a\mid s)\cdot Q_{\pi _\theta }(s,a)dsda\nonumber \\&\quad=\frac{1}{1-\gamma }\int _{s\in {\mathcal {S}},a\in {\mathcal {A}}}\rho _{\pi _\theta }(s)\cdot \pi _{\theta }(a\mid s)\cdot \nabla \log [\pi _{\theta }(a\mid s)]\cdot Q_{\pi _\theta }(s,a)dsda\nonumber \\&\quad=\frac{1}{1-\gamma }\cdot {\mathbb {E}}_{(s,a)\sim \rho _{\theta }(\cdot ,\cdot )}\big [\nabla \log \pi _{\theta }(a\mid s)\cdot Q_{\pi _\theta }(s,a)\big ]. \end{aligned}$$
(8)

This expression follows from rolling the sum forward, repeatedly applying Bellman’s evaluation equation, and exploiting the Markov property of the transition kernel, together with multiplying and dividing by \(\pi _\theta\) and rewriting the denominator in terms of the score function via the fact that \(\nabla _x \log (x) = 1/x\), as in Sutton et al. (2000); Zhang et al. (2019). In the preceding expression, \(p(s_t=s\mid s_0,\pi _\theta )\) denotes the probability of state \(s_t\) equals s given initial state \(s_0\) and policy \(\theta\), which is occasionally referred to as the occupancy measure, or the Markov chain transition density induced by policy \(\pi\). Moreover, \(\rho _{\pi _\theta }(s)=(1-\gamma )\sum _{t=0}^\infty \gamma ^t p(s_t=s\mid s_0,\pi _\theta )\) is the ergodic distribution associated with the MDP for fixed policy, which is shown to be a valid distribution (Sutton et al., 2000). For future reference, we define \(\rho _{\theta }(s,a)=\rho _{\pi _\theta }(s)\cdot \pi _{\theta }(a\mid s)\). The derivative of the logarithm of the policy \(\nabla \log [\pi _{\theta }(\cdot \mid s)]\) is usually referred to as the score function corresponding to the probability distribution \(\pi _{\theta }(\cdot \mid s)\) for any \(s\in {\mathcal {S}}\).

Next, we discuss how (8) can be used to develop stochastic methods to address (2). Unbiased samples of the gradient \(\nabla J(\theta )\) are required to perform the stochastic gradient ascent, which hopefully converges to a stationary solution of the nonconvex maximization. One way to obtain an estimate of the gradient \(\nabla J(\theta )\) is to evaluate the score function and Q function at the end of a rollout whose length is drawn from a geometric distribution with parameter \(1-\gamma\) (Zhang et al., 2020)[Theorem 4.3]. If the Q function evaluation is unbiased, then the stochastic estimate of the gradient \(\nabla J(\theta )\) is unbiased as well. We therefore define the stochastic estimate by

$$\begin{aligned} {\hat{\nabla }} J(\theta ):= \frac{1}{1 - \gamma } {\hat{Q}}_{\pi _\theta } (s_T, a_T) \nabla \log \pi _{\theta }(a_T\vert s_T), \end{aligned}$$
(9)

where the tuple \((s_T, a_T)\) is drawn from end of the geometric rollout of length \(T\sim {\textbf {Geom}}(1-\gamma )\). Of course, such an approach is very inefficient with respect to samples, as it does not utilize the state action transitions up until the final tuple. Using the entire trajectory for the actor update comes at the cost of a biased gradient estimate. Before we characterize this bias, we will discuss how to evaluate the Q function using the single point estimation for simplicity.

We consider the case where the Q function admits a linear parametrization of the form \({\hat{Q}}_{\pi _\theta }(s,a) = \xi ^\top \varphi (s,a)\), which in the literature on policy search is referred to as the critic (Konda & Borkar, 1999), as it “criticizes" the performance of actions chosen according to policy \(\pi\). We let \(\varphi :{\mathcal {S}} \times {\mathcal {A}} \rightarrow {\mathbb {R}}^p\) be a (possibly nonlinear) feature map such as a network of radial basis functions or an auto-encoder known a priori. The choice to consider the Q function with a linear function approximator comes from the well known convergence results of linear critic-only methods. In contrast, nonlinear function approximators suffer from the possibility of divergence, as is demonstrated by well known counterexamples (Baird, 1995; Tsitsiklis & Van Roy, 1997).

The critic parameter \(\xi\) belongs to a bounded set \(\xi \in \Xi \subset {\mathbb {R}}^p\) such that

$$\begin{aligned} \Vert \xi \Vert \le C_\xi \text {~for all~} \xi \in \Xi \end{aligned}$$
(10)

This is reasonable because (5) guarantees boundeness of the true Q function. The boundedness of the estimate \({\hat{Q}}\) follows from requiring the feature map \(\varphi (s,a)\) to be bounded, an assumption which can be achieved through normalization, which we subsequently state

Assumption 2

For any state action pair \((s,a) \in {\mathcal {S}} \times {\mathcal {A}}\), the norm of the feature representation \(\varphi (s,a)\) is bounded by a constant \(C_\varphi \in {\mathbb {R}}_+\).

We also bound the true gradient of the objective function

$$\begin{aligned} \Vert \nabla J(\theta _k)\Vert \le C_\nabla , \end{aligned}$$
(11)

which is established by (8) being bounded as a result of \(|Q| \le U_R/(1-\gamma )\) [c.f. (5)] and \(\Vert \nabla \log \pi _\theta (a\vert s)\Vert \le B_\Theta\) [c.f. (4)].

Moreover, for each actor update k, we estimate the parameter \(\xi _k\) that defines the Q function from an online policy evaluation (critic-only) method after some \(T_C(k)\) iterations, where k denotes the number of policy gradient updates. Thus, we may write the stochastic gradient estimate as

$$\begin{aligned} {\hat{\nabla }} J(\theta ) = \frac{1}{1 - \gamma } \xi _k^\top \varphi (s_T, a_T) \nabla \log \pi _{\theta }(a_T\vert s_T). \end{aligned}$$
(12)

If the estimate of the Q function is unbiased, i.e., \({\mathbb {E}}[\xi _k^\top \varphi (s_T,a_T) \,|\,\theta , s, a]= Q(s,a)\), then \({\mathbb {E}}[{\hat{\nabla }} J(\theta ) \,|\,\theta ] = \nabla J(\theta )\) (c.f. (Zhang et al., 2020)[Theorem 4.3]). Typically, critic-only methods do not give unbiased estimates of the Q function; however, in expectation the rate at which their bias decays is proportional to the number of Q estimation steps. In particular, denote \(\xi _*\) as the parameter for which the Q estimate is unbiased:

$$\begin{aligned} {\mathbb {E}}[\xi _*^\top \varphi (s,a)] = {\mathbb {E}}[{\hat{Q}}_{\pi _\theta }(s,a)] = Q(s,a). \end{aligned}$$
(13)

Hence, by adding and subtracting the true estimate of the parametrized Q function to (12), we arrive at the fact the policy search direction admits the following decomposition:

$$\begin{aligned} {\hat{\nabla }} J(\theta ) = \frac{1}{1 - \gamma } (\xi _k - \xi _*)^\top \varphi (s_T, a_T) \nabla \log \pi _{\theta }(a_T\vert s_T) + \frac{1}{1 - \gamma } \xi _*^\top \varphi (s_T, a_T) \nabla \log \pi _{\theta }(a_T\vert s_T). \end{aligned}$$
(14)

The second term is the unbiased estimate of the gradient \(\nabla J(\theta )\), whereas the first defines the difference of the critic parameter at iteration k with the true estimate \(\xi _*\). For linear parameterizations of the Q function, policy evaluation methods establish convergence in mean of the bias

$$\begin{aligned} {\mathbb {E}}[\Vert \xi _k - \xi _*\Vert ] \le g(k), \end{aligned}$$
(15)

where g(k) is some decreasing function. We address cases where the critic bias decays at rate \(k^{-b}\) for \(b\in (0,1]\), due to the fact that several state of the art works on policy evaluation may be mapped to the form (15) for this specification (Wang et al., 2017a; Dalal et al., 2018b). We formalize this with the following proposition.

Proposition 1

Given some \(b \in (0,1]\), there exists a constant \(L_1 > 0\) such that

$$\begin{aligned} {\mathbb {E}}[\Vert \xi _k - \xi _*\Vert ] \le L_1k^{-b}. \end{aligned}$$
(16)

This implies the expected error of the critic parameter is bounded by \(O(k^{-b})\).

Recently, alternate rates have been established as \(O(\log k / k)\); however, they concede that O(1/k) rates may be possible (Bhandari et al., 2018; Zou et al., 2019). Thus, we subsume recent sample complexity characterizations of policy evaluation as is described in Proposition 1. Proposition 1 is an intrinsic property of many policy evaluation schemes, and thus permits one to substitute the standard subsampling rates of a Monte Carlo-based estimator for the Q function (as in REINFORCE (Sutton et al., 2000)) with one that is estimated online using, e.g., temporal difference learning. Hence its role is critical in relating the bias of using critic estimators rather than unbiased gradient estimates to the number of critic steps.

More specifically, (14) is nearly a valid ascent direction: it is approximately an unbiased estimate of the gradient \(\nabla J(\theta )\) since the first term becomes negligible as the number of critic estimation steps increases. Based upon this observation, we propose the following full trajectory variant of actor-critic method (Konda & Borkar, 1999): run a critic estimator (policy evaluator) for \(T_C(k)\) steps, whose output is critic parameters \(\xi _{k}\). We denote the critic estimator by \({\textbf {Critic:}}{\mathbb {N}} \rightarrow {\mathbb {R}}^p\) which returns the parameter \(\xi _{k} \in {\mathbb {R}}^p\) after \(T_C(k) \in {\mathbb {N}}\) iterations. Then, simulate a trajectory of length H(k), and update the actor (policy) parameters \(\theta\) as:

$$\begin{aligned} \theta _{k+1} = \theta _k + \eta _k \frac{1}{1-\gamma } \sum _{t= 1}^{H(k)} \xi _{k}^\top \varphi (s_{t}, a_{t}) \nabla \log \pi _{\theta _k}(s_{t},a_{t}|\theta _k). \end{aligned}$$
(17)

Note that we make the number of critic estimation steps and horizon length grow with k. Increasing T and H with k allows us to control the bias of the estimate as is seen in Proposition 1 for the critic evaluations and in the following theorem for horizon length.

Now, we will characterize the bias between the gradient estimate using the entire trajectory of length H(k). Let \(\tau = \left\{ s_1, a_1, \dots , s_{H-1}, a_{H-1}, s_{H}\right\}\) be a sampled trajectory of length H. Define \(F_t\) to be the product of the true state action (Q) function with the score function evaluated at the tuple \((s_t, a_t)\), namely

$$\begin{aligned} F_t:= Q(s_t, a_t) \nabla _\theta {\log } \pi _\theta (s_t,a_t). \end{aligned}$$
(18)

One can consider constructing an estimate of the policy gradient using the entire trajectory of length H by

$$\begin{aligned} {\hat{g}}_H = \sum _{t = 1}^{H} \gamma ^{t-1}F_t. \end{aligned}$$
(19)

The following theorem establishes the bias between the true policy gradient and the finite horizon estimate.

Theorem 1

Let Assumption 1 be in effect. Then it is true that for some finite \(C_1\),

$$\begin{aligned} \left\| {\mathbb {E}}_\tau \left[ {\hat{g}}_H \right] - \nabla _\theta J(\theta ) \right\| \le \gamma ^{H-1}C_1. \end{aligned}$$

Proof

First we will show that \({\mathbb {E}}_\tau \left[ \sum _{t=1}^\infty \gamma ^{t-1}F_t\right] = \nabla _\theta J(\theta )\). We let \(\text {Pr}(s_t = s \vert s_1)\) denote the probability the state at time t is equal to s given the initial state \(s_1\).

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left[ \sum _{t=1}^\infty \gamma ^{t-1} F_t\right]&= \sum _{t=1}^\infty \gamma ^{t-1} \int _{\mathcal {S}} {\mathbb {E}}\left[ F_t | s_t = s\right] \text {Pr}\left( s_t = s\vert s_1\right) \text {d}s\\&= \sum _{t = 1}^\infty \gamma ^{t-1} \int _{{\mathcal {S}}} \int _{{\mathcal {A}}} Q(s,a)\nabla _\theta {\log } \pi _\theta (s,a) \text {d}a \text {Pr}(s_t = s|s_1) \text {d}s \\&= \int _{\mathcal {S}} \int _{\mathcal {A}} Q(s,a) \nabla _\theta {\log }\pi _\theta (s,a) \text {d}a \sum _{t = 1}^\infty \gamma ^{t-1}\text {Pr}(s_t = s|s_1)\text {d}s \\&= \int _{\mathcal {S}} \int _{\mathcal {A}} Q(s,a) \nabla _\theta {\log } \pi _\theta (s,a) \text {d}a \rho ^{\pi _\theta }(s)\text {d}s\\&= {\mathbb {E}}_{s\sim \rho ^{\pi _\theta }(s)} \left[ \int _{\mathcal {A}} Q(s,a) \nabla _\theta {\log } \pi _\theta (s,a) \text {d}a\right] \\&= {\mathbb {E}}_{s\sim \rho ^{\pi _\theta }(s), a \sim \pi _\theta (s, \cdot )} \left[ Q(s,a) \nabla _\theta \log \pi _\theta (s,a)\right] \\&= \nabla _\theta J(\theta ) \end{aligned} \end{aligned}$$
(20)

By Fubini’s Theorem, we are able to exchange the summation and integrals due to the regularity assumptions. Let \({\hat{g}}_\infty = \sum _{t = 1}^\infty \gamma ^{t-1} F_t\). Then

$$\begin{aligned} {\hat{g}}_\infty - {\hat{g}}_H = \gamma ^{H-1} \sum _{t = 0}^\infty \gamma ^t F_{t+H +1} \end{aligned}$$
(21)

By the regularity assumptions, we can bound \(F_t\) by \(U_RB_\Theta / (1-\gamma )\). As such, we establish the bound \(\sum _{t = 0}^\infty \gamma ^t F_{t+H +1} \le \sum _{t=0}^\infty \gamma ^{t} U_RB_\Theta / (1-\gamma ) \le U_RB_\Theta /(1-\gamma )^2=:C_1 \le \infty\)

Taking the norm of the expectation completes the proof. \(\square\)

Theorem 1 holds under the assumption that the true Q function is accessible. Of course, only a biased version of the critic is available through the uses of a critic, as described before. The algorithm we propose is the actor-critic variant of the finite horizon gradient estimate. The actor parameter update takes the following form:

$$\begin{aligned} \theta _{k+1} = \theta _k + \eta _k {\hat{g}}_{H}^{AC} = \theta _k + \frac{1}{1-\gamma }\eta _k \sum _{t = 1}^{H(k)} \gamma ^{t-1} \xi _{k}^\top \varphi (s_{t}, a_{t}) \nabla \log \pi _{\theta _k}(s_{t},a_{t}|\theta _k). \end{aligned}$$
(22)

The following theorem characterizes the bias of the stochastic gradient estimate.

Theorem 2

Let Assumptions 1 and 2 be in effect. Then, when proposition 1 is in effect, it is true that for a horizon of length H and T critic evaluations,

$$\begin{aligned} \left\| {\mathbb {E}}_\tau \left[ {\hat{g}}_H^{AC}\right] - \nabla _\theta J(\theta ) \right\| \le C_1 \gamma ^{H} + C_2T^{-b} \end{aligned}$$

Proof

Let \(F_{AC,t}:= \xi _k^\top \varphi (s_t,a_t) \nabla _\theta \log \pi _{\theta }(s_t,a_t)\). Then

$$\begin{aligned} \begin{aligned} {\mathbb {E}}_\tau \left[ {\hat{g}}_{\infty }^{AC}\right]&= {\mathbb {E}}_\tau \left[ \sum _{t=1}^\infty \gamma ^{t-1} F_{AC,t}\right] \\&= {\mathbb {E}}_\tau \left[ \sum _{t=1}^\infty \gamma ^{t-1} \left( F_t +F_{AC,t} -F_t\right) \right] \\&= {\mathbb {E}}_\tau \left[ \sum _{t=1}^\infty \gamma ^{t-1} F_t\right] + {\mathbb {E}}_\tau \left[ \sum _{t=1}^\infty \gamma ^{t-1} \left( F_{AC,t} -F_t\right) \right] \\&= \nabla _\theta J(\theta ) + {\mathbb {E}}_\tau \left[ \sum _{t=1}^\infty \gamma ^{t-1} \left( F_{AC,t} -F_t\right) \right] \\ \end{aligned} \end{aligned}$$
(23)

The final term can be considered an error term. Consider the difference

$$\begin{aligned} F_{AC,t} - F_t = \left( Q(s_t,a_t) - \xi _k^\top \varphi (s_t,a_t) \right) \nabla \log \pi _{\theta }(s_t,a_t). \end{aligned}$$
(24)

Let \(Q(s_t,a_t) = \xi _*^\top \varphi (s_t,a_t)\). Then by assumptions 1 and 2 and proposition 1,

$$\begin{aligned} |F_{AC,t} - F_t| \le T^{-b}L_1C_\varphi B_\Theta \end{aligned}$$
(25)

This implies

$$\begin{aligned} \left\| {\hat{g}}_{\infty }^{AC} - \nabla _\theta J(\theta )\right\| \le T^{-b}L_1C_\varphi B_\Theta \frac{1}{1-\gamma } = C_2 T^{-b} \end{aligned}$$
(26)

Following the same logic as Theorem 1, we can bound the difference between the finite horizon estimate and the infinite horizon actor-critic estimate by

$$\begin{aligned} \Vert {\hat{g}}_\infty ^{AC} - {\hat{g}}_H^{AC}\Vert \le C_1 \gamma ^{H-1}. \end{aligned}$$
(27)

We evoke triangle inequality to complete the proof.

$$\begin{aligned} \Vert {\hat{g}}_\infty - {\hat{g}}_H^{AC}\Vert = \Vert {\hat{g}}_\infty - {\hat{g}}_\infty ^{AC} + {\hat{g}}_\infty ^{AC} - {\hat{g}}_H^{AC}\Vert \le \Vert {\hat{g}}_\infty - {\hat{g}}_\infty ^{AC}\Vert + \Vert {\hat{g}}_\infty ^{AC} - {\hat{g}}_H^{AC}\Vert \le C_1 \gamma ^{H-1} + C_2 T^{-b}. \end{aligned}$$
(28)

This concludes the proof. \(\square\)

The fact that the estimate \({\hat{g}}^{AC}_{H}\) is bounded comes from the fact that \({\hat{g}}^{AC}_\infty\) is bounded. We formalize this for use in the analysis

$$\begin{aligned} {\mathbb {E}}(\Vert {\hat{g}}^{AC}_H \Vert ) \le {\mathbb {E}}(\Vert {\hat{g}}^{AC}_\infty \Vert ) \le \frac{C_\varphi C_\xi B_\Theta }{(1- \gamma ) } =: \sigma , \end{aligned}$$
(29)

where \(C_\varphi\), \(C_\xi\) and \(B_\Theta\) come from Assumption 2, (10) and Assumption 1 1 respectively.

Theorem 2 establishes the bias on the stochastic gradient update. The bias can be decreased by increasing T, the number of critic update steps per each actor step, and H, the horizon for the actor update. In our main result, we will set both of these quantities to grow linearly with k, meaning that we decrease the bias with each actor update step (see Theorem 3). In our numerical results, we show that selecting a large enough constant T and H is sufficient(see Sect. 7).

figure a

We summarize the aforementioned procedure, which is agnostic to particular choice of critic estimator, as Algorithm 1. We acknowledge that the actor-critic algorithm proposed in Algorithm 1 differs from Konda and Borkar (1999) in that rather than updating the actor and critic in tandem, the critic learns the state-action (Q) function from scratch at each update of the actor algorithm. The classical version of the algorithm can be recovered by setting \(T_C(k) = 1\) and initializing the critic parameter to the previous step. Existing convergence proofs of this format are limited to asymptotic convergence only, where the critic steps at a faster learning rate than the actor. As such, this batch-type approach emulates this behavior, as the critic must learn something meaningful before the actor can update. As such, one might relate our work to Yang et al. (2018); however, unlike their work, we are not only able to prove convergence to a stationary point of the original objective by increasing the number of critic evaluations at each actor step rather than keeping it fixed, but also, we use the entire trajectory rather than a single state action pair sampled from the discounted state distribution.

Examples of Critic Updates We note that \({\textbf {Critic:}}\,\,{\mathbb {N}} \rightarrow {\mathbb {R}}^p\) admits two canonical forms: temporal difference (TD) (Sutton, 1988) and gradient temporal difference (GTD)-based estimators (Sutton et al., 2008). The TD update for the critic is given as

$$\begin{aligned} \delta _{t} = r_{t} + {\gamma \xi ^\top _t\varphi (s_t', a_t') - \xi ^\top _t\varphi (s_t,a_t)} \;, \quad \xi _{t+1} = \xi _t + \alpha _t\delta _t\varphi (s_t,a_t) \end{aligned}$$
(30)

whereas for the GTD-based estimator for the critic, we consider the update

$$\begin{aligned} \delta _{t}&= r_{t} + {\gamma \xi _t^\top \varphi (s'_{t}, a'_t) - \xi ^\top _t\varphi (s_t,a_t)} \; , \quad z_{t + 1} = (1- \beta _t)z_t + \beta _t \delta _t , \nonumber \\ \xi _{t+1}&= \xi _t - 2\alpha _t z_{t+1}[\gamma \varphi (s'_{t}, a'_{t}) - \varphi (s_t,a_t) ] \end{aligned}$$
(31)

We further analyze a modification of GTD updates proposed by (Wang et al., 2017a) that incorporates an extrapolation technique to reduce bias in the estimates and improve error dependency, which is distinct from accelerated stochastic approximation with Nesterov Smoothing (Nesterov, 1983). With \(y_0 = 0\) and \(z_t\) defined for \(t = 1, \dots\), the accelerated GTD (A-GTD) update becomes

$$\begin{aligned} \xi _{t+1}&= \xi _t - 2\alpha _t (\gamma \varphi (s'_t,a'_t) - \varphi (s_t,a_t))y_t \nonumber \\ z_{t+1}&= -\left( \frac{1}{\beta _t} - 1\right) \xi _t + \frac{1}{\beta _t}\xi _{t+1} \nonumber \\ y_{t+1}&= (1-\beta _t)y_t + \beta _t (r(s_t,a_t) + z_{t+1}^\top \left( \gamma \varphi (s'_t,a'_t) -\varphi (s_t,a_t) \right) \end{aligned}$$
(32)

Subsequently, we shift focus to characterizing the mean convergence of actor-critic method given any policy evaluation method satisfying (15) in Sect. 4. Then, we specialize the sample complexity of actor-critic to the cases associated with critic updates (30) – (32), which we respectively call Classic (Algorithm 4), Gradient (Algorithm 2), and Accelerated Actor-Critic (Algorithm 3).

Remark 1

We wish to emphasize that a major advantage of this generic characterization of actor-critic admits the ability to interchange critic only methods to estimate the state-action (Q) function. The merit is twofold, as it can extend to faster convergence rates and fewer assumptions. In particular, recent works have shown tighter sample complexity bounds for critic-only methods for convergence in probability, which suggests that existing bounds on convergence in expectation are not necessarily tight. Furthermore, so long as the convergence of the critic takes the form of Proposition 1, the i.i.d. assumption for the critic can be lifted. The general conditions for stability of trajectories with Markov dependence, i.e., negative Lyapunov exponents for mixing rates, may be found in (Meyn & Tweedie, 2012).

4 Convergence rate of generic actor-critic

In this section, we derive the rate of convergence in expectation for the variant of actor-critic defined in Algorithm 1, which is agnostic to the particular choice of policy evaluation method used to estimate the Q function used in the actor update. Unsurprisingly, we establish that the rate of convergence in expectation for actor-critic depends on the critic update used. Therefore, we present the main result in this paper for any generic critic method. Thereafter, we specialize this result to two well-known choices of policy evaluation previously described (30) - (31), as well as a new variant that employs acceleration (32).

We begin by noting that under Assumption 1, one may establish Lipschitz continuity of the policy gradient \(\nabla J(\theta )\) (Zhang et al., 2020)[Lemma 4.2].

Lemma 1

(Lipschitz-Continuity of Policy Gradient) The policy gradient \(\nabla J(\theta )\) is Lipschitz continuous with some constant \(L>0\), i.e., for any \(\theta ^1,\theta ^2\in {\mathbb {R}}^d\)

$$\begin{aligned} \Vert \nabla J(\theta ^1)-\nabla J(\theta ^2)\Vert \le L\cdot \Vert \theta ^1-\theta ^2\Vert . \end{aligned}$$
(33)

This lemma allows us to establish an approximate ascent for the objective sequence \(\{J(\theta _k)\}\).

Lemma 2

Consider the actor parameter sequence defined by Algorithm 1. Further let Assumptions 1 and 2 be in effect. Define the probability space \(\left( \Omega ,{\mathcal {F}}, P \right)\). Further, let \({\mathcal {F}}_k\) be the \(\sigma\)-algebra generated by the set \(\{s_u,a_u,\theta _u\}_{u< k}\), that is the states, actions, and policy parameters until time k. Then, the sequence \(\{J(\theta _k)\}\) satisfies the inequality

$$\begin{aligned} {\mathbb {E}}[{J(\theta _{k+1})} \mid {\mathcal {F}}_k] \ge {J(\theta _k)} + \eta _k \Vert \nabla J(\theta _k)\Vert ^2 - \eta _k C_\nabla C_1 \gamma ^{H(k)-1} - \eta _k C_\nabla C_2 T(k)^{-b}- {L \sigma ^2 \eta _k^2}. \end{aligned}$$
(34)

where \(C_1\) and \(C_2\) come from Theorem 2.

Proof

See Appendix 1\(\square\)

From (34) (Lemma 2), consider taking the total expectation

$$\begin{aligned} {\mathbb {E}}[J(\theta _{k+1}) ] \ge {\mathbb {E}}[J(\theta _k)] + \eta _k {\mathbb {E}}[ \Vert \nabla J(\theta _k)\Vert ^2] - \eta _k C_\nabla C_1 \gamma ^{H(k)-1} - \eta _k C_\nabla C_2 T(k)^{-b}{- L \sigma ^2 \eta _k^2}. \end{aligned}$$
(35)

This almost describes an ascent of \(J(\theta _k)\). Because the norm of the gradient is non-negative, if the latter three terms were removed, an argument could be constructed to show that in expectation, the gradient converges to zero. Unfortunately, both the error of the finite horizon estimate and the critic error complicate the picture. However, we know that the error goes to zero in expectation as the number of critic steps and the horizon length increase. Thus, we leverage this property to derive the sample complexity of actor-critic (Algorithm 1).

We now present our main result, which is the convergence rate of actor-critic method when the algorithm remains agnostic to the particular choice of critic scheme. We characterize the rate of convergence by the smallest number \(K_\epsilon\) of actor updates k required to attain a value function gradient smaller than \(\epsilon\), i.e. for \(\epsilon > 0\),

$$\begin{aligned} K_\epsilon = \min \{k:\,\inf _{0\le m \le k} \Vert \nabla J(\theta _m)\Vert ^2 < \epsilon \}. \end{aligned}$$
(36)

Theorem 3

Suppose the actor step-size satisfies \(\eta _k = k^{-a}\) for \(a >0\) and the critic update satisfies Proposition 1. Further let \(T_C(k) = k + 1\cdot {\textbf{1}}(b = 1)\), and \(H(k) = k\). Then the actor sequence defined by Algorithm 1 satisfies

$$\begin{aligned} K_\epsilon \le {\mathcal {O}}\left( \epsilon ^{-1/\ell } \right) \;, \text{ where } \ell = \min \{a,1-a, b\} \end{aligned}$$
(37)

Minimizing over a yields actor step-size \(\eta _k = k^{-1/2}\). Moreover, depending on the rate b of attenuation of the critic bias [cf. (15)], the resulting sample complexity is:

$$\begin{aligned} K_\epsilon \le {\left\{ \begin{array}{ll} {\mathcal {O}}\left( \epsilon ^{-1/b}\right) &{} \text {if } b\in (0,1/2)\\ {\mathcal {O}}\left( \epsilon ^{-2}\right) . &{} \text {if }b\in (1/2, 1] \end{array}\right. } \end{aligned}$$
(38)

Proof

See Appendix 2\(\square\)

The analysis of Lemma 2 and Theorem 3 do not make any assumptions on the size of the state action space. Additionally, the result describes the number of actor updates required. The number of critic updates required is simply the \(K_\epsilon ^\text {th}\) triangular number, that is \(K_\epsilon + 1 \atopwithdelims ()2\). These results connect actor-critic algorithms with the behavior of stochastic gradient method for finding the root of a non-convex objective. Under additional conditions, actor-critic with TD updates for the critic step attains a \(O( \epsilon ^{-2} )\) rate. However, under milder conditions on the state and action spaces but more stringent smoothness conditions on the reward function, using GTD updates for the critic yields \(O(\epsilon ^{-3})\) rates. These results are formally derived in the following subsections. We further note that contemporaneously of beginning this work, several refined analyses of TD and GTD have been developed (Dalal et al., 2018b, 2020) that focus on concentration bounds (“lock-in probability"), a weaker metric of stability than convergence in mean, i.e., convergence in Lebesgue integral implies convergence in measure. In this work, we focus on global convergence to stationarity in terms of the expected gradient norm of the value function, and thus employ policy evaluation rates that are compatible with this goal, i.e., rates in the form of attenuation of mean square error. We defer the study of lock-in probabilities to future work.

Remark 2

We note that it may be possible to establish convergence in terms of asymptotic covariance or the Hessian around a stationary point, as in Thoppe and Borkar (2019), and thus obtain a sharper characterization of the limit points of actor-critic. However, doing so pre-supposes that the algorithm settle to a neighborhood of a local extrema, and would require a Hessian parameterization that is only locally valid. Hence sharper global convergence characterizations, to our knowledge, are beyond reach. In this work, our intention is to establish the global sample complexity of actor-critic type algorithms, and leave strengthening the local rates using, e.g., techniques developed in Thoppe and Borkar (2019), to future work.

5 Rates of gradient and accelerated actor-critic

In this section, we show how Algorithm 1 can be applied to derive the rate of actor-critic methods using Gradient Temporal Difference (GTD) as the critic update. Thus, we proceed with deriving GTD-style updates through links to compositional stochastic programming (Wang et al., 2017a) which is also the perspective we adopted to derive rates in the previous section. For simplicity in notation, we let Q stand for \(Q_{\pi _\theta }\). Begin by recalling that any critic method seeks a fixed point of the Bellman evaluation operator:

$$\begin{aligned} (T^{\pi _\theta }Q)(s,a) \triangleq r(s,a) + \gamma {\mathbb {E}}_{s'\in {\mathcal {S}},{ a' \sim \pi _{\theta }(s')}}[Q(s', a')~|~s,a] \end{aligned}$$
(39)

Since we focus on parameterizations of the Q function by parameter vectors \(\xi \in {\mathbb {R}}^d\) with some fixed feature map \(\varphi\) which is learned a priori, the Bellman operator simplifies

$$\begin{aligned} T^{\pi _\theta } Q_\xi (s,a) = {\mathbb {E}}_{s' {\in {\mathcal {S}}},a'\sim \pi _\theta (s')} [r(s,a) +\gamma \xi ^\top \varphi (s',a') | s,a ] \end{aligned}$$
(40)

The solution of the Bellman equation is its fixed point: \(T^\pi Q(s,a) = Q(s,a)\) for all \(s\in {\mathcal {S}}, a \in {\mathcal {A}}\). Thus, we seek Q functions that minimize the (projected) Bellman error

$$\begin{aligned} \min _{\xi \in \Xi } \Vert \Pi T^{\pi _\theta } Q_\xi - Q_\xi \Vert _\mu ^2 =: F(\xi ). \end{aligned}$$
(41)

where \(\Xi \subseteq {\mathbb {R}}^p\) is a closed and convex feasible set. The Bellman error quantifies distance from the fixed point for a given \(Q_\xi\). Here the projection and \(\mu\)-norm are respectively defined as

$$\begin{aligned} \Pi {\hat{Q}} = \arg \min _{ \textrm{f} \in {\mathcal {F}} } \Vert {\hat{Q}} -\textrm{f} \Vert _\mu \;, \qquad \Vert Q \Vert _\mu ^2 = \int Q^2(s,a)\mu (\textrm{d}s,\textrm{d}a), \end{aligned}$$
(42)

This parameterization of Q implies that we restrict the feasible set–which is in general \(B({\mathcal {S}},{\mathcal {A}})\), the space of bounded continuous functions whose domain is \({\mathcal {S}}\times {\mathcal {A}}\)–to be \({{\mathcal {F}}}= \{Q_\xi : \xi \in \Xi \subset {{\mathbb {R}}}^d\}\) (as in Maei et al. (2010)). Without this parameterization, one would require searching over \(B({\mathcal {S}},{\mathcal {A}})\), whose complexity scales with the dimension of the state and action spaces (Bellman, 1957), which is costly when dimensions are large, and downright impossible for continuous spaces (Powell, 2007).

Under certain mild conditions drawing tools from functional analysis, we can define a projection over a class of functions such that \(\Pi {\hat{Q}} = {\hat{Q}}\). For example, Radial-Basis-Function (RBF) networks have been shown to be capable of approximating arbitrarily well functions in \(L^p({\mathbb {R}}^r)\) ((Park & Sandberg, 1991), Theorem 1). Further, neural networks with one hidden layer and sigmoidal activation functions are known to approximate arbitrarily well continuous functions on the unit cube ((Cybenko, 1989), Theorem 1).

By the definition of the \(\mu\)-norm, we can write F [cf. (41)] as an expectation

$$\begin{aligned} F(\xi ) = {\mathbb {E}} [( T^{\pi _\theta } Q_\xi - Q_{\xi })^2]. \end{aligned}$$
(43)

As such, we replace the Bellman operator in (43) with (40) to obtain

$$\begin{aligned} F(\xi ) = {\mathbb {E}}_{s,a\sim \pi _\theta (s)}\{ ( {\mathbb {E}}_{s',a'\sim \pi _\theta (s')} [r(s,a) +\gamma \xi ^\top \varphi (s',a') | s,a\sim \pi _\theta (s) ] - \xi ^\top \varphi (s,a))^2\}. \end{aligned}$$
(44)

Pulling the last term into the inner expectation, \(F(\xi )\) can be written as the function composition \(F(\xi ) = (f \circ g)(x) = f(g(x))\), where \(f: {\mathbb {R}} \rightarrow {\mathbb {R}}\) and \(g: {\mathbb {R}}^p \rightarrow {\mathbb {R}}\) take the form of expected values

$$\begin{aligned} f(y) = {\mathbb {E}}_{(s,a)}[f_{(s,a)}(y)] ,\qquad g(\xi ) = {\mathbb {E}}_{(s',a')}[g_{(s',a')}(\xi )~|~s,a\sim \pi _\theta (s)], \end{aligned}$$
(45)

where

$$\begin{aligned} f_{(s,a)}(y) = y^2 \;, \qquad g_{(s',a')}(\xi ) = r(s,a) + \gamma \xi ^\top \varphi (s',a') - \xi ^\top \varphi (s,a). \end{aligned}$$
(46)

Because \(F(\xi )\) can be written as a nested expectations of convex functions, we can use Stochastic Compositional Gradient Descent (SCGD) for the critic update (Wang et al., 2017a). This requires the computation of the sample gradients for both f and g in (45)

$$\begin{aligned} \nabla f_{(s,a)}(y) =2 y \;, \qquad \nabla g_{(s',a')}(\xi ) = \gamma \varphi (s',a') - \varphi (s,a). \end{aligned}$$
(47)

The specification of SCGD to the Bellman evaluation error (44) yields the GTD updates (31) defined in Sect. 3–see (Sutton et al., 2008) for further details. We now turn to establishing the convergence rate in expectation for Algorithm 1 (substituting Algorithm 2 for the \({\textbf {Critic}}(k)\)) step using Theorem 3. Doing so requires the conditions of Theorem 3 from Wang et al. (2017a) to be satisfied, which we subsequently state.

Assumption 3

  1. (i)

    The outer function f is continuously differentiable, the inner function g is continuous, the critic parameter feasible set \(\Xi\) is closed and convex, and there exists at least one optimal solution to problem (41), namely \(\xi ^*\in \Xi\)

  2. (ii)

    The sample first order information is unbiased. That is,

    $$\begin{aligned} {\mathbb {E}}[g_{(s_0',a_0')}(\xi ) ~|~s_0,a_0\sim \pi _\theta (s_0)] = g(\xi ) \end{aligned}$$
  3. (iii)

    The function \({\mathbb {E}}[g(\xi )]\) [cf. (46)] is \(C_g\)-Lipshitz continuous and the samples \(g(\xi )\) and \(\nabla g(\xi )\) have bounded second moments

    $$\begin{aligned} {\mathbb {E}}[\Vert \nabla g_{(s_0',a_0')}(\xi )\Vert ^2 ~|~s_0,a_0\sim \pi _\theta (s_0)] \le C_g, \; \qquad {\mathbb {E}}[\Vert g_{(s_0',a_0')}(\xi ) - g(\xi )\Vert ^2] \le V_g \end{aligned}$$
  4. (iv)

    The \(f_{(s,a)}(y)\) has a Lipschitz continuous gradient such that

    $$\begin{aligned} {\mathbb {E}}[\Vert \nabla f_{(s_0,a_0)}(y)\Vert ^2] \le C_f \; \qquad \Vert \nabla f_{(s_0,a_0)}(y) - f_{(s_0,a_0)}({\bar{y}})\Vert \le L_f\Vert y - {\bar{y}}\Vert \end{aligned}$$

    for all \(y, {\bar{y}} \in {\mathbb {R}}\)

  5. (v)

    The projected Bellman error is strongly convex with respect to the critic parameter \(\xi\) in the sense that there exists a \(\lambda\) such that

    $$\begin{aligned} \nabla ^2 F(\xi ) \succeq \lambda I \end{aligned}$$

The first part of Assumption 3(i) is trivially satisfied by the forms of f and g in (46). Assumption 3(ii) requires that the state-action pairs used to update the critic parameter to be independently and identically distributed (i.i.d.), which is a common assumption unless one focuses on performance along a single trajectory. Doing so requires tools from dynamical systems under appropriate mixing conditions on the Markov transition density (Borkar, 2009; Antos et al., 2008), which we obviate here for simplicity and to clarify insights. We note that the sample complexity of policy evaluation along a trajectory has been established by Bhandari et al. (2018), but remains open for policy learning in continuous spaces. Moreover, i.i.d. sampling yields unbiasedness of certain gradient estimators and second-moment boundedness which are typical for stochastic optimization (Bottou, 1998). We note that these conditions come directly from Wang et al. (2017a)–here we translate them to the reinforcement learning context.

figure b

We further require \(F(\xi )\) to be strongly convex, so that Wang et al. (2017a)[Theorem 3 and Theorem 7] holds. Consider the Hessian

$$\begin{aligned} \nabla ^2 F(\xi ) = {\mathbb {E}}_{s,a} \left[ {\mathbb {E}}_{s',a'} \left[ \gamma \varphi (s',a') - \varphi (s,a)|s,a\right] ^\top {\mathbb {E}}_{s',a'} \left[ \gamma \varphi (s',a') - \varphi (s,a)|s,a\right] \right] . \end{aligned}$$
(48)

Due to its structure, and the i.i.d. assumption, the Hessian \(\nabla ^2 F(\xi )\) is known to be positive definite Bertsekas et al. (1995); Dalal et al. (2018b).

We can now combine the convergence result (Theorem 3) from Wang et al. (2017a) with Theorem 3 to establish the rate of actor-critic with GTD updates for the critic, through connecting GTD and SCGD. We summarize the resulting method as Algorithm 2, which we call Gradient Actor-Critic.

figure c

Corollary 1

Consider the actor parameter sequence defined by Algorithm 2. If the stepsize \(\eta _k = k^{-1/2}\) and the critic stepsizes are \(\alpha _t = 1/t\sigma\) and \(\beta _t = 1/t^{2/3}\), then we have the following bound on \(K_\epsilon\) defined in (36):

$$\begin{aligned} K_\epsilon \le {\mathcal {O}}\left( \epsilon ^{-3} \right) . \end{aligned}$$
(49)

Proof

Here we invoke ((Wang et al., 2017a), Theorem 3) which characterizes the rate of convergence for the critic parameter

$$\begin{aligned} {\mathbb {E}}[\Vert \xi _k - \xi _*\Vert ^2] \le {\mathcal {O}}\left( k^{-2/3}\right) . \end{aligned}$$
(50)

Applying Jensen’s inequality, we have

$$\begin{aligned} {\mathbb {E}}[\Vert \xi _k - \xi _*\Vert ]^2 \le {\mathbb {E}}[\Vert \xi _k - \xi _*\Vert ^2] \le {\mathcal {O}}\left( k^{-2/3}\right) , \end{aligned}$$
(51)

Taking the square root gives us

$$\begin{aligned} {\mathbb {E}}[\Vert \xi _k - \xi _*\Vert ] \le {\mathcal {O}}\left( k^{-1/3}\right) . \end{aligned}$$
(52)

Therefore, \(b = 1/3\) (c.f. Proposition 1) in Theorem 3, which determines the \({\mathcal {O}}\left( \epsilon ^{-3}\right)\) rate on \(K_\epsilon\) in the preceding expression. \(\square\)

Unsurprisingly, with additional smoothness assumptions, it is possible to obtain faster convergence through accelerated variants of GTD. The corresponding actor-critic method with Accelerated GTD updates is given by substituting Algorithm 3 for \({\textbf {Critic}}(k)\) in Algorithm 1, which we call Accelerated Actor-Critic. The validity of accelerated rates, aside from Assumption 3, requires imposing that the inner expectation has Lipschitz gradients and that sample gradients have boundedness properties which are formally stated below.

Assumption 4

  1. (i)

    There exists a constant scalar \(L_g > 0\) such that

    $$\begin{aligned} \Vert \nabla {\mathbb {E}}_{s',a'\sim \pi _\theta (s')}[g(\xi _1)] - \nabla {\mathbb {E}}_{s',a'\sim \pi _\theta (s')}[g(\xi _2)]\Vert \le L_g \Vert \xi _1-\xi _2\Vert , \; \forall \xi _1,\xi _2\in \Xi \end{aligned}$$
  2. (ii)

    The sample gradients satisfy with probability 1 that

    $$\begin{aligned} {\mathbb {E}}\left[ \Vert \nabla g(\xi )\Vert ^4 ~|~s_0,a_0\right] \le C^2_g, \; \forall \xi \in \Xi \;, \qquad {\mathbb {E}}\left[ \Vert \nabla f(y)\Vert ^4 \right] \le C^2_f, \; \forall y \in {\mathbb {R}}^d \end{aligned}$$

With this additional smoothness assumption, sample complexity is reduced, as we state in the following corollary.

Corollary 2

Consider the actor parameter sequence defined by Algorithm 3. If the stepsize \(\eta _k = k^{-1/2}\) and the critic stepsizes are \(\alpha _t = 1/t\sigma\) and \(\beta _t = 1/t^{4/5}\), then we have the following bound on \(K_\epsilon\) defined in (36):

$$\begin{aligned} K_\epsilon \le {\mathcal {O}}\left( \epsilon ^{-5/2} \right) . \end{aligned}$$
(53)

Proof

The proof is identical to the proof of Corollary 1 while invoking Theorem 7 from Wang et al. (2017a). \(\square\)

Corollary 2 establishes a \({\mathcal {O}}(\epsilon ^{-5/2})\) sample complexity of actor-critic when accelerated GTD steps are used for the critic update. This is the lowest complexity/fastest rate relative to all others analyzed in this work for continuous spaces. However, this fast rate requires the most stringent smoothness conditions. In the following section, we shift to the case where the critic is updated using vanilla TD(0) updates (30), which is the original form of actor-critic proposed by Konda and Borkar (1999).

6 Sample complexity of classic actor-critic

figure d

In this section, we derive convergence rates for actor-critic when the critic is updated using TD(0) as in (30) for two different canonical settings: the case where the state space action is continuous (Sect. 6.1) and when it is finite (Sect. 6.2). Both use TD(0) with linear function approximation in its unaltered form (Sutton, 1988). We substitute Algorithm 4 for the \({\textbf {Critic}}(k)\) step in Algorithm 1, which is the classical form of actor-critic given by Konda and Borkar (1999); Konda and Tsitsiklis (2000), thus the name Classic Actor-Critic.

6.1 Continuous state and action spaces

The analysis for Continuous State Action space TD(0) with linear function approximation uses the analysis from Dalal et al. (2018a) to characterize the rate of convergence for the critic. Their analysis requires the following common assumption.

Assumption 5

There exists a constant \(K_s > 0\) such that for the filtration \({\mathcal {G}}_t\) defined for the TD(0) critic updates, we have

$$\begin{aligned} {\mathbb {E}}[\Vert M_{t+1}\Vert ^2 \vert {\mathcal {G}}_t] \le K_s[1 + \Vert \xi _t - \xi _*\Vert ^2], \end{aligned}$$
(54)

where \(M_{t+1}\) is defined as

$$\begin{aligned} M_{t+1} = \left( r_t + \gamma \xi _t^\top \varphi (s_{t+1},a_{t+1}) - \xi _t^\top \varphi (s_{t},a_{t}) \right) \varphi (s_{t},a_{t}) - b + A \end{aligned}$$
(55)

where

$$\begin{aligned} \begin{aligned} b:= {\mathbb {E}}_{s,a\sim \pi (s)}[r(s,a)\varphi (s,a)],&\text {~and~} A:= {\mathbb {E}}_{s,a\sim \pi (s)}[\varphi (s,a)(\varphi (s,a) - \gamma \varphi (s',a'))^\top ]\\ \end{aligned} \end{aligned}$$
(56)

Assumption 5 is known to hold when the samples have uniformly bounded second moments, which is a common assumption for convergence results (Sutton et al., 2009a, b). In the same way the projected Bellman error is strongly convex [see (48)], it is known that A is positive definite. As such, we define \(\lambda _\text {TD} \in (0, \lambda _\text {min} (A + A^\top ))\). The value of \(\lambda _\text {TD}\) is conditioned on the feature representation of the state space, which is chosen a priori. However, this value plays an important role in determining the rate of convergence for TD(0), as we see in the following corollary.

Corollary 3

Consider the actor parameter sequence defined by Algorithm 4. Suppose the actor step-size is chosen as \(\eta _k = k^{-1/2}\) and the critic step-size takes the form \(\alpha _t = {1}/{(t+1)^\sigma }\) where \(\sigma \in (0,1)\). Then, for large enough k,

$$\begin{aligned} K_\epsilon \le {\mathcal {O}}\left( \epsilon ^{-2/\sigma }\right) \end{aligned}$$
(57)

Proof

Here we invoke the TD(0) convergence result from ((Dalal et al., 2018b), Theorem 3.1) which establishes that

$$\begin{aligned} {\mathbb {E}}[\Vert \xi _t - \xi _*\Vert ^2] \le K_1 e^{-\lambda _\text {TD} t^{1-\sigma }/2} + \frac{K_2}{t^\sigma } \end{aligned}$$
(58)

for some positive constants \(K_1\) and \(K_2\). For \(\sigma\) not close to 1, the first term is dominated by \({K_2}/{t^\sigma }\), which permits us to write that

$$\begin{aligned} {\mathbb {E}}[\Vert \xi _t - \xi _*\Vert ^2] \le {\mathcal {O}}\left( \frac{1}{t^\sigma }\right) \end{aligned}$$
(59)

Applying Jensen’s inequality, we have

$$\begin{aligned} {\mathbb {E}}[\Vert \xi _t - \xi _*\Vert ]^2 \le {\mathbb {E}}[\Vert \xi _t - \xi _*\Vert ^2] \le {\mathcal {O}}\left( \frac{1}{t^\sigma }\right) . \end{aligned}$$
(60)

Taking the square root on both sides gives us

$$\begin{aligned} {\mathbb {E}}[\Vert \xi _t - \xi _*\Vert ] \le {\mathcal {O}}\left( \frac{1}{t^{\sigma /2}}\right) , \end{aligned}$$
(61)

which means that the convergence rate statement of Proposition 1 is satisfied with parameter \(b = \sigma /2\). Because \(\sigma < 1/2\), this specializes Theorem 3, specifically, (38) to case (i), which yields the rate

$$\begin{aligned} K_\epsilon \le {\mathcal {O}}\left( \epsilon ^{-2/\sigma } \right) \;. \end{aligned}$$
(62)

Thus the claim in Corollary 3 is valid. \(\square\)

The operative phrase in the proof of the previous theorem is for \(\sigma\) not close to 1. This is because we want the first second term of (58) to dominate the first term so that Proposition 1 holds. Asymptotically, this is not a problem, however for finite sample complexity, the point at which the exponential term is dominated by the second term is highly sensitive to both \(\lambda _\text {TD}\) and \(\sigma\). The choice of \(\sigma\) can be chosen to be larger as the value of \(\lambda _\text {TD}\) grows. The choice of \(\sigma\) as a function of \(\lambda _\text {TD}\) and the number of iterates is summarized in Fig. 1.

Fig. 1
figure 1

Plot shows the critical value of \(\sigma\) for which the exponential term of (64) is dominated by the second term, thereby allowing Proposition 1 to hold. In particular, any \(\sigma > 0\) chosen between zero and the curves shown above satisfies the proposition. We show plots for varying values of \(\lambda _\text {TD}\), which is determined by the feature space representation. For each value of \(\lambda _\text {TD}\), we vary the ratio of the constants \(K_2/K_1\) from .001 to 100

We find that as the value of \(\lambda _\text {TD}\) increases, the critical value of \(\sigma\) also increases. This means that the stepsize of the critic can be chosen to be larger, allowing for faster convergence. Again, we define the critical value of \(\sigma\) to be the point at which both terms on the right hand side of (64) are equal at a specific time t. Therefore, the feature space representation plays a large role on the performance of actor-critic with TD(0) updates. This result becomes apparent in our numerical results (Sect. 7). We note that, the GTD rates given in Corollary 1 hinge upon strong convexity of the projected Bellman error, which may hold for carefully chosen state-action feature maps, bounded parameter spaces, and lower bounds on the reward. These conditions are absent for TD(0) critic updates.

In the next section, we will consider analysis of actor-critic with TD(0) critic updates in the case where the state and action spaces are finite. As would be expected, this added assumption significantly improves the bound on the rate of convergence, i.e., reduces the sample complexity needed for policy parameters that are within \(\epsilon\) of stationary points of the value function.

6.2 Finite state and action spaces

In this section, we characterize the rate of convergence for the actor-critic defined by Algorithm 1 with TD(0) critic updates (Algorithm 4) when the number of states and actions are finite, i.e., \(|{\mathcal {S}}| = S<\infty\) and \(|{\mathcal {A}}| = A<\infty\). This setting yields faster convergence. A key quantity in the analysis of TD(0) in finite spaces is the minimal eigenvalue of the covariance of the feature map \(\phi (s,a)\) weighted by policy \(\pi (s)\), which is defined as

$$\begin{aligned} \omega =\min \left\{ \text {eig} \left( \sum _{(s,a) \in {\mathcal {S}}\times {\mathcal {A}}} \pi (s) \varphi (s,a)\varphi (s,a)^\top \right) \right\} \;. \end{aligned}$$
(63)

That \(\omega\) exists is an artifact from the finite state action space assumption. (63) is used to define conditions on the rate of step-size attenuation for TD(0) [cf. (30)] critic updates in ((Bhandari et al., 2018), Theorem 2 (c)), which we invoke to establish the iteration complexity of actor-critic in finite spaces. We do so next.

Corollary 4

Consider the actor parameter sequence defined by Algorithm 4. Let the actor step-size satisfy \(\eta _k = k^{-1/2}\) and the critic step-size decrease as \(\alpha _t = {\beta }/({\lambda + t})\) where \(\beta = 2/\omega (1-\gamma )\) and \(\lambda = 16/\omega (1-\gamma )^2\). Then when the number of critic updates per actor update satisfies \(T_C(k) = k+1\), the following convergence rate holds

$$\begin{aligned} K_\epsilon \le O\left( \epsilon ^{-2}\right) \end{aligned}$$
(64)

Proof

We begin by invoking the TD(0) convergence result ((Bhandari et al., 2018), Theorem 2 (c)):

$$\begin{aligned} {\mathbb {E}}[\Vert \xi _t - \xi _*\Vert ^2] \le {\mathcal {O}}\left( \frac{K_1}{t + K_2}\right) , \end{aligned}$$
(65)

for some constants \(K_1, K_2\) which depend on \(\omega\) and \(\sigma\). Applying Jensen’s inequality, we have

$$\begin{aligned} {\mathbb {E}}[\Vert \xi _t - \xi _*\Vert ]^2 \le {\mathbb {E}}[\Vert \xi _t - \xi _*\Vert ^2] \le {\mathcal {O}}\left( \frac{K_1}{t+K_2}\right) . \end{aligned}$$
(66)

Taking the square root on both sides yields

$$\begin{aligned} {\mathbb {E}}[\Vert \xi _t - \xi _*\Vert ]\le {\mathcal {O}}\left( \frac{K_1^{-1/2}}{(t+K_2)^{-1/2}}\right) \lesssim {\mathcal {O}}(t^{-1/2}), \end{aligned}$$
(67)

which means that Proposition 1 is valid with critic convergence rate parameter \(b = 1/2\). Therefore, we may apply Theorem 3 to obtain the rate

$$\begin{aligned} K_\epsilon \le {\mathcal {O}}\left( \epsilon ^{-2} \right) \end{aligned}$$
(68)

as stated in Corollary 4. \(\square\)

7 Numerical results

In this section, we compare the convergence rates of actor-critic with the aforementioned critic-only methods on a two-dimensional navigation problem and the inverted pendulum. Before detailing the RL problem specifics, we first discuss the metrics we use to evaluate both performance and convergence.

Because the main objective is to maximize the long term average reward accumulation, it follows naturally to measure the cumulative reward of a trajectory. We evaluate the policy without action noise \((\sigma ^2 = 0)\), with a fixed trajectory length, and with a fixed starting position which makes the plots easier to compare. In addition, we consider a proxy for the gradient norm. In particular, we calculate the norm of the difference between two consecutive normalized actor parameters \((\Vert \theta _k/ \Vert \theta _k\Vert - \theta _{k+1}/\Vert \theta _{k+1}\Vert \Vert )\). The normalization treats two scaled versions of the same parameter equivalently. This is meaningful because the action vector field induced by the parameters (see Fig 3) are similarly scaled versions of each other. In this form, the gradient norm proxy serves as the optimization metric on which our main result is based.

Along with varying the critic-only methods, we elect to consider two additional variations on policy gradient where the Q function is replaced by the advantage and value functions. Recall the definition of the value function from (1). The advantage function is defined by \(A(s_t, a_t) = Q(s_t, a_t) - V(s_t)\), which, by definition of the Q function, can also take the form of \(A(s_t, a_t) = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)\) (Mnih et al., 2016). The main benefit of using the value function and advantage functions instead of the Q function for actor critic is that the dimension of the function approximator domain is smaller, as the agent only needs to learn on the state space.

7.1 Navigating around an obstacle

We consider the problem of a point agent starting at an initial state \(s_0 \in {\mathbb {R}}^2\) whose objective is to navigate to a destination \(s^*\in {\mathbb {R}}^2\) while remaining in the free space at all time. The free space \({\mathcal {X}}\subset {\mathbb {R}}^2\) is defined by

$$\begin{aligned} {{\mathcal {X}}}:= \left\{ s\in {\mathbb {R}}^2 \Big | \Vert s\Vert \in [0.5,4] \right\} . \end{aligned}$$
(69)

The feature representation of the state is determined by a radial basis (Gaussian) kernel where

$$\begin{aligned} \kappa (s,s') = \exp \left\{ \frac{-\Vert s-s'\Vert _2^2}{2\sigma ^2}\right\} . \end{aligned}$$
(70)

The p kernel points are chosen evenly on the \([-5,5]\times [-5,5]\) grid so that the the feature representation becomes

$$\begin{aligned} \varphi (s) = \begin{bmatrix} \kappa (s,s_1)&\kappa (s,s_2)&\dots&\kappa (s,s_p) \end{bmatrix}^\top , \end{aligned}$$
(71)

which we normalize. Given the state \(s_t\), the action is sampled from a multivariate Gaussian distribution with covariance matrix \(\Sigma = 0.5\cdot I_2\) and mean given by \(\theta _k^\top \varphi (s_t)\). We let the action determine the direction in which the agent will move. As such, the state transition is determined by \(s_{t+1} = s_t + 0.5a_t/\Vert a_t\Vert\).

Because the agent’s objective is to reach the target \(s^*\) while remaining in F for all time, we want to penalize the agent heavily for taking actions which result in the next step being outside the free space and reward the agent for being close to the target. As such, we define the reward function to be

$$\begin{aligned} r_{t+1} = {\left\{ \begin{array}{ll} -11 &{} \text {~if~} s_{t+1} \notin {{\mathcal {X}}} \\ -0.1 &{} \text {~if~} \Vert s_{t+1} - s^*\Vert < 0.5 \\ -1 &{} \text {~otherwise~}. \end{array}\right. } \end{aligned}$$
(72)

The design of this reward function for the navigation problem is informed by the (Zhang et al., 2020), which suggests that the reward function should be bounded away from zero. In this simulation, we allow for the agent to continue taking actions through the obstacles. This formulation is similar to a car driving on a race track which has grass outside the track. The car is allow is allowed to drive off the track, however it incurs a larger cost due to the substandard driving conditions.

Although it is true that this particular formulation does not allow for generalization, that is, if the target of the agent, obstacle location, or starting point of the agent are moved, the agent would have to start from scratch to learn a new meaningful policy, we emphasize that it is the rates of convergence which are of interest in this exposition, not necessarily finding the best way to design the navigation problem.

Algorithm Specifics We consider the problem with \(\gamma = 0.97\). In practice, we use the entire trajectory data for the critic updates. In particular, for each actor parameter update, we run ten critic updates with rollout length \(T = 66\) (comes from the expected rollout length given \(\gamma = 0.97\)). Similarly, we update the actor along the trajectory of rollout length H = 67. For simulations, the actor update step \(\eta _t\) is chosen to be constant \(\eta =10^{-3}\). For TD(0), we let also let the critic stepsize be constant, namely \(\alpha _t = \alpha = 0.05\). For GTD, we let \(\alpha _t = t^{-1}\) and \(\beta _t = t^{-2/3}\). For A-GTD, we set \(\alpha _t = t^{-1}\) and \(\beta _t = t^{-4/5}\). We draw the initial distribution uniformly at random on the grid \([-2,2]\times [-2,2]\), and we set the target to be \(s^* = (-2,-2)\). For each critic only method, we run the algorithm 50 times. We evaluate the policy by measuring the accumulated reward of a trajectory of length \(H = 66\).

Fig. 2
figure 2

Navigation Problem: (a) Average reward per episode with confidence bounds over 50 trials. (b) Average gradient norm proxy over 50 trials. A-GTD converges fastest with respect to the cumulative reward and gradient norm proxy at the cost of converging to a suboptimal stationary point (see Fig. 3). A moving average filter of size ten has been applied on the gradient norm proxy to aid in comparison

Fig. 3
figure 3

Visualization of the learned policy for the navigation problem. The obstacle is shown in the top right corner, and the target is located at (-2,-2). As Fig. 2 (a) depicts, TD (shown in (a)) and GTD (shown in (b)) learn meaningful policies which guide the agent to the target. In contrast, A-GTD (shown in (c)) simply learns to avoid the obstacle

7.2 Pendulum problem

We also consider the canonical continuous state action space reinforcement learning problem of the pendulum. The objective is to balance the pendulum upright starting from any starting position. Given that this is a well established benchmark for reinforcement learning, we refer the reader to Brockman et al. (2016) for the specifications on reward and transition dynamics. Similar to the navigation problem, we let the feature representation of the state be determined by a radial basis (Gaussian) kernel (c.f. (70)) where the p kernel points are chosen evenly on \([-1, -1, -8, -2] \times [1, 1, 8, 2]\), where the bounds come from the sine and cosine of the angle \(\theta\), the time derivative of the angle \({\dot{\theta }}\), and the maximum torque of the action respectively. The action is chosen by a normal distribution with mean \(\xi ^\top \varphi (s,a)\) and variance \(\sigma _a^2\). Like the navigation problem, we use a linear policy and linear critic. Again, we stress that these experiments are meant to show the rates of convergence, and not necessarily finding the best way to solve the pendulum problem. For the pendulum problem, we only consider advantage actor-critic.

Algorithm Specifics Similar to the navigation problem, we let \(\gamma = 0.97\), and we use the entire trajectory data for the critic updates. In particular, for each actor parameter update, we run ten critic updates with rollout length \(T = 66\) (comes from the expected rollout length given \(\gamma = 0.97\)). Similarly, we update the actor along the trajectory of rollout length H = 66. For simulations, the actor update step \(\eta _t\) is chosen to be constant \(\eta =0.01\). For critic only methods, we also let also let the critic stepsize be constant. In particular, we let \(\alpha _t = 0.01\) for TD(0), \((\alpha _t, \beta _t) = (0.2, 0.01)\) for GTD, and \((\alpha _t, \beta _t) = (0.05, 0.005)\) for AGTD. We evaluate the policy by measuring the average accumulated by a single trajectory starting \(\theta = \pi /2\) with angular velocity \(\omega = 1\). The action variance is chosen to be \(\sigma _a^2 = 0.5\).

Fig. 4
figure 4

Pendulum Problem: (a) Average reward per episode with confidence bounds over 50 trials. (b) Average gradient norm proxy over 50 trials. In contrast to the navigation problem there is a significant gain in using advantage actor-critic; here, the state action (Q) function was used instead of the value function (V). A moving average filter of size ten has been applied on the gradient norm proxy to aid in comparison

7.3 Discussion

Recall that the analysis of Corollaries 1, 2, and 3 establish that the convergence rates for GTD, A-GTD, and TD(0) are \(O(\epsilon ^{-3})\), \(O(\epsilon ^{-5/2})\), and \(O(\epsilon ^{-2/\sigma })\) respectively [also see Table 1]. Figure 2 shows the performance of the navigation problem with value and advantage function policy gradient updates. As expected, A-GTD converges fastest with respect to the gradient norm proxy, while GTD and TD(0) are comparable. The plots highlight a disconnect between the convergence in reward and the convergence in gradient norm. Namely, TD converges faster in gradient norm, but slower with respect to the cumulative reward. Even more interesting, although AGTD converges fastest with respect to gradient norm and reward, its resulting stationary point is suboptimal compared to TD and GTD (see Fig 3). On the other hand, GTD and TD(0) converge the slower, and they consistently reach the solved region marked by the solid black line at \(-66\). We say that rewards which are greater than \(-66\) are solved trajectories because these trajectory spend time in the destination region. A trajectory which does not reach the destination region will have accumulated reward of \(-66\) or less. Taken together, these theoretical and experimental results suggest a tight coupling between the choice of training methodology and the quality of learned policies. Thus, just as the choice of optimization method, statistical model, and sample size influence generalization in supervised learning, they do so in reinforcement learning. Theorem 3 characterizes the rate of convergence to a stationary point of the Bellman optimality operator, however it does not provide any guarantee on the quality of the stationary point. Figure 3 captures this trade-off convergence rate and quality of the stationary point.

The disconnect between convergence in reward and convergence in gradient norm appears again in the pendulum. Fig. 4 (b) shows the gradient norm proxy for the advantage actor-critic applied to the pendulum problem. Consistent with Table 1, AGTD converges the fastest with followed by GTD and TD(0). Here, we again see the disconnect between convergence in gradient norm and cumulative reward. Notice how in the first few iterations, TD(0) actually converges the fastest. In tandem, the cumulative reward of TD(0) also increases quickly. By the final episode, TD(0) and AGTD perform worse than GTD. This is consistent with the convergence rate and quality of stationary point trade-off observed in the navigation problem.

There are a number of future directions to take this work. To begin, we can establish bounds on cases where the samples are not i.i.d., but instead have Markovian noise. Second, we can further generalize our results to consider a generic critic convergence rate that does not necessarily take the form of Proposition 1. Third, we can explore the choice of feature representation to explicitly characterize the convergence rate of actor-critic with TD(0) critic updates with respect to \(\lambda _\text {TD}\). Finally, we can characterize the behavior of the variance and use such characterizations to accelerate training.