1 Introduction

Reinforcement Learning (RL) (Sutton & Barto, 2018) has achieved astounding successes in games (Mnih et al., 2015; Silver et al., 2018; OpenAI, 2018; Vinyals et al., 2019), matching or surpassing human performance in several occasions. However, the much-anticipated applications of RL to real-world tasks such as robotics (Kober et al., 2013), autonomous driving (Okuda et al., 2014) and finance (Li & Hoi, 2014) seem still far. This technological delay may be due to the very nature of RL, which relies on the repeated interaction of the learning machine with the surrounding environment, e.g., a manufacturing plant, a trafficked road, a stock market. The trial-and-error process resulting from this interaction is what makes RL so powerful and general. However, it also poses significant challenges in terms of sample efficiency (Recht, 2019) and safety (Amodei et al., 2016).

In reinforcement learning, the term safety can actually refer to a variety of problems (Garca & Fernandez, 2015). The general concern is always the same: avoiding or limiting damage. In financial applications, it is typically a loss of money. In robotics and autonomous driving, one should also consider direct damage to people and property. In this work, we do not make assumptions about the nature of the damage, but we assume it is entirely encoded in the scalar reward signal that is presented to the agent in order to evaluate its actions. Other works (e.g., Turchetta et al., 2016) employ a distinct safety signal, separate from rewards.

A further distinction is necessary on the scope of safety constraints with respect to the agent’s life. One may simply require the final behavior, the one that is deployed at the end of the learning process, to be safe. This is typically the case when learning is performed in simulation, but the final controller has to be deployed in the real world. The main challenges there are in transferring safety properties from simulation to reality (e.g., Tan et al., 2018). In other cases, learning must be performed, or at least finalized, on the actual system, because no reliable simulator is available (e.g., Peters & Schaal 2008). In such a scenario, safety must be enforced for the whole duration of the learning process. This poses a further challenge, as the agent must necessarily go through a sequence of sub-optimal behaviors before learning its final policy. The problem of learning while containing the damage is also known as safe exploration (Amodei et al., 2016) and will be the focus of this work.Footnote 1

Garca and Fernandez (2015) provide a comprehensive survey on safe RL, where the existing approaches are organized into two main families: methods that modify the exploration process directly in order to explicitly avoid dangerous actions (e.g., Gehring & Precup 2013), and methods that constrain exploration in a more indirect way by modifying the reward optimization process. The former typically require some sort of external knowledge, such as human demonstrations or advice (e.g., (Abbeel et al., 2010; Clouse & Utgo, 1992)). In this work, we only assume online access to a sufficiently informative reward signal and prior knowledge of some worst-case constants that are easy to obtain. Optimization-based methods (those belonging to the second class) are more suited for this scenario. A particular kind, identified by García and Fernández as constrained criteria (Moldovan & Abbeel, 2012; Castro et al., 2012; Kadota et al., 2006), enforces safety by introducing constraints in the optimization problem, i.e., reward maximization.Footnote 2

A typical constraint is that the agent’s performance, i.e., the sum of rewards, must never be less than a user-specified threshold (Geibel & Wysotzki, 2005; Thomas et al., 2015), which may be the average performance of a trusted baseline policy. Under the assumption that the reward signal also encodes danger, low performances can be matched with dangerous behaviors, so that the performance threshold works as a safety threshold. This falls into the general framework of Seldonian machine learning introduced by Thomas et al. (2019).

If we only cared about the safety of the final controller, the traditional RL objective—maximizing cumulated reward—would be enough. However, most RL algorithms are known to yield oscillating performances during the learning phase. Regardless of the final solution, the intermediate ones may violate the threshold, hence yield unsafe behavior. This problem is known as policy oscillation (Bertsekas, 2011; Wagner, 2011).

A similar constraint, which confronts the policy oscillation problem even more directly, is Monotonic Improvment (MI, S. Kakade & Langford, 2002; Pirotta et al., 2013), and is the one adopted in this work. The requirement is that each new policy implemented by the agent during the learning process does not perform worse than the previous one. In this way, if the initial policy is safe, so will be all the subsequent ones.

The way safety constraints such as MI can be imposed on the optimization process depends, of course, on what kind of policies are considered as candidates and on how the optimization itself is performed. These two aspects are often tied and will depend on the specific kind of RL algorithm that is employed. Policy Search or Optimization (PO, Deisenroth et al., 2013) is a family of RL algorithms where the class of candidate policies is fixed in advance and a direct search for the best one within the class is performed. This makes PO algorithms radically different from value-based algorithms such as Deep Q-Networks (Mnih et al., 2015), where the optimal policy is a byproduct of a learned value function. Although value-based methods gained great popularity from their successes in games, PO algorithms are better suited for real-world tasks, especially the ones involving cyber-physical systems. The main reasons are the ability of PO methods to deal with high-dimensional continuous state and action spaces, convergence guarantees (Sutton et al., 2000), robustness to sensor noise, and the superior control on the set of feasible policies. The latter allows introducing domain knowledge into the optimization process, possibly including some safety constraints.

In this work, we focus on Policy Gradient methods (PG, Peters & Schaal, 2008; Sutton et al., 2000), where the set of candidate policies is a class of parametric distributions and the optimization is performed via stochastic gradient ascent on the performance objective as a function of the policy parameters. In particular, we analyze the prototypical PG algorithm, REINFORCE (Williams, 1992) and see how the MI constraints can be imposed by adaptively selecting its meta-parameters during the learning process. To achieve this, we study in more depth the stochastic gradient-based optimization process that is at the core of all PG methods (Robbins & Monro, 1951). In particular, we identify a general family of parametric policies that makes the optimization objective Lipschitz-smooth (Nesterov, 2013) and allows easy upper-bounding of the related smoothness constant. This family, referred to as smoothing policies, includes commonly used policy classes from the PG literature, namely Gaussian and Softmax policies. Using known properties of Lipschitz-smooth functions, we then provide lower bounds on the performance improvement produced by gradient-based updates, as a function of tunable meta-parameters. This, in turn, allows identifying those meta-parameter schedules that guarantee MI with high probability. In previous work, a similar result was achieved only for Gaussian policies (Pirotta et al., 2013; Papini et al., 2017).Footnote 3

The meta-parameters studied here are the step size of the policy updates, or learning rate, and the batch size of gradient estimates, i.e., the number of trials that are performed within a single policy update. These meta-parameters, already present in the original REINFORCE algorithm, are typically selected by hand and fixed for the whole learning process (Duan et al., 2016). Besides guaranteeing monotonic improvement, our proposed method removes the burden of selecting these meta-parameters. This safe, automatic selection within the REINFORCE algorithmic framework yields SPG, our Safe Policy Gradient algorithm.

The paper is organized as follows: in Sect. 2 we introduce the necessary background on Markov decision processes, policy optimization, and smooth functions. In Sect. 3, we introduce smoothing policies and show the useful properties they induce on the policy optimization problem, most importantly a lower bound on the performance improvement yielded by an arbitrary policy parameter update (Theorem 7). In Sect. 4, we exploit these properties to select the step size of REINFORCE in a way that guarantees MI with high probability when the batch size is fixed, then we achieve similar results with an adaptive batch size. In Sect. 5, we design a monotonically improving policy gradient algorithm with adaptive batch size, called safe policy gradient (SPG), and show how the latter can also be adapted to weaker improvement constraints. In Sect. 6, we offer a detailed comparison of our contributions with the most closely related literature. In Sect. 7 we empirically evaluate SPG on simulated control tasks. Finally, we discuss the limitations of our approach and propose directions for future work in Sect. 8.

2 Preliminaries

In this section, we revise continuous Markov decision processes (MDPs, Puterman, 2014), actor-only Policy Gradient algorithms (PG, Deisenroth et al., 2013), and some general properties of smooth functions.

2.1 Markov decision processes

A Markov decision process (MDP, Puterman, 2014) is a tuple \({\mathcal {M}=\langle \mathcal {S},\mathcal {A},p,r,\gamma ,\mu \rangle }\), comprised of a measurable state space \({\mathcal {S}}\), a measurable action space \({\mathcal {A}}\), a Markovian transition kernel \({p:\mathcal {S}\times \mathcal {A}\rightarrow \Delta _\mathcal {S}}\), where \(\Delta _\mathcal {S}\) denotes the set of probability distributions over \(\mathcal {S}\), a reward function \({r:\mathcal {S}\times \mathcal {A}\rightarrow \mathbb {R}}\), a discount factor \(\gamma \in (0,1)\) and an initial-state distribution \(\mu \in \Delta _\mathcal {S}\). We only consider bounded-reward MDPs, and denote with \(R\ge \sup _{s\in \mathcal {S},a\in \mathcal {A}}\vert r(s,a)\vert\) (a known upper bound on) the maximum absolute reward. This is the only prior knowledge we have on the task. The MDP is used to model the interaction of a rational agent with the environment. We model the agent’s behavior with a policy \(\pi :\mathcal {S}\rightarrow \Delta _\mathcal {A}\), a stochastic mapping from states to actions. The initial state is drawn as \(s_0\sim \mu\). For each time step \(t=0,1,\dots\), the agent draws an action \(a_t\sim \pi (\cdot \vert s_t)\), conditional on the current state \(s_t\). Then, the agent obtains a reward \(r_{t+1}=r(s_t,a_t)\) and the state of the environment transitions to \(s_{t+1}\sim p(\cdot \vert s_t, a_t)\). The goal of the agent is to maximize the expected sum of discounted rewards, or performance measure:

$$\begin{aligned} J(\pi ) {:}{=}\mathop {\mathbb {E}}_{}\left[ {\sum _{t=0}^\infty \gamma ^t r_{t+1} \vert s_0\sim \mu , a_t\sim \pi (\cdot \vert s_t), s_{t+1}\sim p(\cdot \vert s_t,a_t)}\right] . \end{aligned}$$
(1)

We focus on continuous MDPs, where states and actions are real vectors: \(\mathcal {S}\subseteq \mathbb {R}^{d_{\mathcal {S}}}\) and \(\mathcal {A}\subseteq \mathbb {R}^{d_{\mathcal {A}}}\). However, all the results naturally extend to the discrete case by replacing integrals with summations. See (Puterman, 2014; Bertsekas & Shreve, 2004) on matters of measurability and integrability, which just require common technical assumptions. We slightly abuse notation by denoting probability measures (assumed to be absolutely continuous) and density functions with the same symbol.

Given an MDP, the purpose of RL is to find an optimal policy \(\pi ^*\in \arg \max _{\pi }J(\pi )\) without knowing the transition kernel \(p\) and the reward function \(r\) in advance, but only through interaction with the environment. To better characterize this optimization objective, it is convenient to introduce further quantities. We denote with \(p_{\pi }\) the transition kernel of the Markov Process induced by policy \(\pi\), i.e., \(p_{\pi }(\cdot \vert s) {:}{=}\int _{\mathcal {A}}\pi (a\vert s) p(\cdot \vert s,a)\,\mathrm {d}a\). The t-step transition kernel under policy \(\pi\) is defined inductively as follows:

$$\begin{aligned}&p_{\pi }^{0}(\cdot \vert s) = \mathbbm {1}\left\{ s=s'\right\} ,\nonumber \\&p_{\pi }^{1}(\cdot \vert s) {:}{=}p_{\pi }(\cdot \vert s),\nonumber \\&p_{\pi }^{t+1}(\cdot \vert s) {:}{=}\int _{\mathcal {S}}p_{\pi }^{t}(s'\vert s)p_{\pi }(\cdot \vert s')\,\mathrm {d}s', \end{aligned}$$
(2)

for all \(s\in \mathcal {S}\) and \(t\ge 1\). The t-step transition kernel allows to define the following conditional state-occupancy measure:

$$\begin{aligned} \rho _{s}^{\pi }(\cdot ) = (1-\gamma )\sum _{t=0}^{\infty }\gamma ^tp_{\pi }^t(\cdot \vert s), \end{aligned}$$
(3)

measuring the (discounted) probability of visiting a state starting from s and following policy \(\pi\). The following property of \(\rho _{s}^\pi\) —a variant of the generalized eigenfunction property by Ciosek and Whiteson (2020, Lemma20)—will be useful (proof in "Appendix A.1"):

Proposition 1

Let \(\pi\) be any policy and f be any integrable function on \(\mathcal {S}\) satisfying the following recursive equation:

$$\begin{aligned} f(s) = g(s) + \gamma \int _{\mathcal {S}}p_{\pi }(s'\vert s)f(s')\,\mathrm {d}s', \end{aligned}$$

for all \(s\in \mathcal {S}\) and some integrable function g on \(\mathcal {S}\). Then:

$$\begin{aligned} f(s) = \frac{1}{1-\gamma }\int _{\mathcal {S}}\rho ^{\pi }_{s}(s')g(s')\,\mathrm {d}s', \end{aligned}$$

for all \(s\in \mathcal {S}\).

The state-value function \(V^{\pi }(s)=\mathop {\mathbb {E}}_{\pi }\left[ {\sum _{t=0}^{\infty }r(S_t,A_t)\vert S_0=s}\right]\) is the discounted sum of rewards obtained, in expectation, by following policy \(\pi\) from state s, and satisfies Bellman’s equation (Puterman, 2014):

$$\begin{aligned} V^{\pi }(s) = \mathop {\mathbb {E}}_{a\sim \pi (\cdot \vert s)}\left[ {r(s,a) + \gamma \mathop {\mathbb {E}}_{s'\sim p(\cdot \vert s,a)}\left[ {V^{\pi }(s')}\right] }\right] , \end{aligned}$$
(4)

Similarly, the action-value function:

$$\begin{aligned} Q^{\pi }(s,a) = r(s,a) + \gamma \mathop {\mathbb {E}}_{s'\sim p(\cdot \vert s,a)}\left[ {V^{\pi }(s')}\right] , \end{aligned}$$
(5)

is the discounted sum of rewards obtained, in expectation, by taking action a in state s and following \(\pi\) afterwards.

The two value functions are closely related:

$$\begin{aligned} V^{\pi }(s)&= \int _{\mathcal {A}}\pi (a\vert s)Q^{\pi }(s,a)\,\mathrm {d}a, \end{aligned}$$
(6)
$$\begin{aligned} Q^{\pi }(s,a)&= r(s,a) + \gamma \int _{\mathcal {S}}p(s'\vert s,a)V_{\pi }(s')\,\mathrm {d}s'. \end{aligned}$$
(7)

For bounded-reward MDPs, the value functions are bounded for every policy \(\pi\):

$$\begin{aligned}&\left\| V^{\pi }\right\| _{\infty } \le \left\| Q^{\pi }\right\| _{\infty } \le \frac{R}{1-\gamma }, \end{aligned}$$
(8)

where \(\left\| V^{\pi }\right\| _{\infty } = \sup _{s\in \mathcal {S}}\vert V^{\pi }(s)\vert\) and \(\left\| Q^{\pi }\right\| _{\infty } = \sup _{s\in \mathcal {S},a\in \mathcal {A}} \vert Q^{\pi }(s,a)\vert\). Using the definition of state-value function we can rewrite the performance measure as follows:

$$\begin{aligned} J(\pi ) = \int _{\mathcal {S}}\mu (s)V^{\pi }(s)\,\mathrm {d}s = \frac{1}{1-\gamma }\int _{\mathcal {S}}\rho ^\pi (s)\int _\mathcal {A}\pi (a\vert s)r(s,a)\,\mathrm {d}a\,\mathrm {d}s, \end{aligned}$$
(9)

where:

$$\begin{aligned} \rho ^{\pi }(\cdot ) = \int _{\mathcal {S}}\mu (s)\rho _{s}^{\pi }(\cdot ) \,\mathrm {d}s, \end{aligned}$$
(10)

is the state-occupancy probability under the starting-state distribution \(\mu\).

2.2 Parametric policies

In this work, we only consider parametric policies. Given a d-dimensional parameter vector \(\varvec{\theta }\in \Theta \subseteq \mathbb {R}^d\), a parametric policy is a stochastic mapping from states to actions parametrized by \(\varvec{\theta }\), denoted with \(\pi _{\varvec{\theta }}\). The search for the optimal policy is thus limited to the policy class \(\Pi _{\Theta } = \left\{ \pi _{\varvec{\theta }}\mid \varvec{\theta }\in \Theta \right\}\). This corresponds to finding an optimal parameter, i.e., \(\varvec{\theta }^*\in \arg \max _{\varvec{\theta }\in \Theta }J(\pi _{\varvec{\theta }})\). For ease of notation, we often write \(\varvec{\theta }\) in place of \(\pi _{\varvec{\theta }}\) in function arguments and superscripts, e.g., \(J(\varvec{\theta })\), \(\rho ^{\varvec{\theta }}(s)\) and \(V^{\varvec{\theta }}(s)\) in place of \(J(\pi _{\varvec{\theta }})\), \(\rho ^{\pi _{\varvec{\theta }}}\) and \(V^{\pi _{\varvec{\theta }}}(s)\), respectively.Footnote 4 We restrict our attention to policies that are twice differentiable w.r.t. \(\varvec{\theta }\), for which the gradient \(\nabla _{\varvec{\theta }}\pi _{\varvec{\theta }}(a\vert s)\) and the Hessian \(\nabla _{\varvec{\theta }}^2\pi _{\varvec{\theta }}(a\vert s)\) are defined everywhere and finite. For ease of notation, we omit the \(\varvec{\theta }\) subscript in \(\nabla _{\varvec{\theta }}\) when clear from the context. Given any twice-differentiable scalar function \(f:\Theta \rightarrow \mathbb {R}\), we denote with \(D_if\) the i-th gradient component, i.e., \(\frac{\partial f}{\partial \theta _i}\), and with \(D_{ij}f\) the Hessian element of coordinates (ij), i.e., \(\frac{\partial ^2f}{\partial \theta _i\partial \theta _j}\). We also write \(\nabla f(\varvec{\theta })\) to denote \(\left. \nabla _{\widetilde{\varvec{\theta }}} f(\widetilde{\varvec{\theta }})\right| _{\widetilde{\varvec{\theta }} = \varvec{\theta }}\) when this does not introduce any ambiguity.

The Policy Gradient Theorem (Sutton et al., 2000; Konda & Tsitsiklis, 1999) allows us to characterize the gradient of the performance measure \(J(\varvec{\theta })\) as an expectation over states and actions visited under \(\pi _{\varvec{\theta }}\):Footnote 5

$$\begin{aligned} \nabla J(\varvec{\theta }) = \frac{1}{1-\gamma }\int _{\mathcal {S}}\rho ^{\varvec{\theta }}(s)\int _{\mathcal {A}}\pi _{\varvec{\theta }}(a\vert s)\nabla \log \pi _{\varvec{\theta }}(a\vert s)Q^{\varvec{\theta }}(s,a)\,\mathrm {d}a \,\mathrm {d}s. \end{aligned}$$
(11)

The gradient of the log-likelihood \(\nabla \log \pi _{\varvec{\theta }}(\cdot \vert s)\) is called score function, while the Hessian of the log-likelihood \(\nabla ^2\log \pi _{\varvec{\theta }}(\cdot \vert s)\) is sometimes called observed information.

2.3 Actor-only policy gradient

In practice, we always consider finite episodes of length T. We call this the effective horizon of the MDP, chosen to be sufficiently large so that the problem does not lose generality.Footnote 6 We denote with \({\tau {:}{=}(s_0,a_0,s_1,a_1,\dots ,s_{T-1},a_{T-1})}\) a trajectory, i.e., a sequence of states and actions of length T such that \(s_0\sim \mu\), \(a_t\sim \pi (\cdot \vert s_t)\), \(s_{t}\sim p(\cdot \vert s_{t-1}, a_{t-1})\) for \(t=0,\dots ,T-1\) and some policy \(\pi\). In this context, the performance measure of a parametric policy \(\pi _{\varvec{\theta }}\) can be defined as:

$$\begin{aligned} J(\varvec{\theta }) = \mathop {\mathbb {E}}_{\tau \sim p_{\varvec{\theta }}}\left[ {\sum _{t=0}^{T-1}\gamma ^tr(s_t,a_t)}\right] , \end{aligned}$$
(12)

where \(p_{\varvec{\theta }}(\tau )\) is the probability density of the trajectory \(\tau\) that can be generated by following policy \(\pi _{\varvec{\theta }}\), i.e., \(p_{\varvec{\theta }}(\tau )=\mu (s_0)\pi _{\varvec{\theta }}(a_0\vert s_0)p(s_1\vert s_0,a_0)\dots \pi _{\varvec{\theta }}(a_{T-1}\vert s_{T-1})\). Let \(\mathcal {D}\sim p_{\varvec{\theta }}\) be a batch \(\{\tau _1,\tau _2,\dots ,\tau _N\}\) of N trajectories generated with \(\pi _{\varvec{\theta }}\), i.e., \(\tau _i\sim p_{\varvec{\theta }}\) i.i.d. for \(i=1,\dots ,N\). Let \(\widehat{\nabla }J(\varvec{\theta }{;}\mathcal {D})\) be an estimate of the policy gradient \(\nabla J(\varvec{\theta })\) based on \(\mathcal {D}\). Such an estimate can be used to perform stochastic gradient ascent on the performance objective \(J(\varvec{\theta })\):

$$\begin{aligned} \varvec{\theta }' \leftarrow \varvec{\theta }+ \alpha \widehat{\nabla }J(\varvec{\theta }{;}\mathcal {D}), \end{aligned}$$
(13)

where \(\alpha \ge 0\) is a step size and \(N=\vert \mathcal {D}\vert\) is called batch size. This yields an Actor-only Policy Gradient method, summarized in Algorithm 1.

figure a

Under mild conditions, this algorithm is guaranteed to converge to a local optimum (Sutton et al., 2000). This is reasonable since the objective \(J(\varvec{\theta })\) is non-convex in general.Footnote 7 As for the gradient estimator, we can use REINFORCE (Williams, 1992; Glynn, 1986):Footnote 8

$$\begin{aligned} \widehat{\nabla }J(\varvec{\theta }{;}\mathcal {D}) = \frac{1}{N}\sum _{i=1}^{N}\left( \sum _{t=0}^{T-1}\gamma ^tr(a_t^i,s_t^i) - b\right) \left( \sum _{t=0}^{T-1}\nabla \log \pi _{\varvec{\theta }}(a_t^i\vert s_t^i)\right) , \end{aligned}$$
(14)

or its refinement, G(PO)MDP (Baxter & Bartlett, 2001), which typically suffers from less variance (Peters & Schaal, 2008):

$$\begin{aligned} \widehat{\nabla }J(\varvec{\theta }{;}\mathcal {D}) = \frac{1}{N}\sum _{i=1}^{N}\sum _{t=0}^{T-1}\left[ \left( \gamma ^tr(a_t^i,s_t^i) - b_t\right) \sum _{h=0}^{t}\nabla \log \pi _{\varvec{\theta }}(a_h^i\vert s_h^i)\right] , \end{aligned}$$
(15)

where the superscript on states and actions denotes the i-th trajectory of the dataset and b is a (possibly time-dependent and vector-valued) control variate, or baseline. Both estimators are unbiased for any action-independent baseline.Footnote 9 Peters and Schaal (2008) prove that Algorithm 1 with the G(PO)MDP estimator is equivalent to Monte-Carlo PGT (Policy Gradient Theorem, (Sutton et al., 2000)), and provide variance-minimizing baselines for both REINFORCE and G(PO)MDP.

Algorithm 1 is called actor-only to discriminate it from actor-critic policy gradient algorithms (Konda & Tsitsiklis, 1999), where an approximate value function, or critic, is employed in the gradient computation. In this work, we will focus on actor-only algorithms, for which safety guarantees are more easily proven.Footnote 10 Generalizations of Algorithm 1 include reducing the variance of gradient estimates through baselines and other stochastic-optimization techniques (e.g., (Papini et al., 2018; Shen et al., 2019; Xu et al., 2020)) using a vector step size (Yu et al., 2006; Papini et al., 2017); making the step size adaptive, i.e., iteration and/or data-dependent (Pirotta et al., 2013); making the batch size N also adaptive (Papini et al., 2017); applying a preconditioning matrix to the gradient, as in Natural Policy Gradient (Kakade, 2002) and second-order methods (Furmston & Barber, 2012).

2.4 Smooth functions

In the following we denote with \(\left\| \varvec{x}\right\| _{p}\) the \(\ell _p\)-norm of vector \(\varvec{x}\), which is the Euclidean norm for \(p=2\). . For a matrix A, \(\left\| A\right\| _{p}=\sup \{\left\| Ax\right\| _{p}:\left\| x\right\| _{p}=1\}\) denotes the induced norm, which is the spectral norm for \(p=2\). When the p subscript is omitted, we always mean \(p=2\).

Let \(g:\mathcal {X}\subseteq \mathbb {R}^d\rightarrow \mathbb {R}^n\) be a (non-convex) vector-valued function. We call g Lipschitz continuous if there exists \(L>0\) such that, for every \(\varvec{x},\varvec{x}'\in \mathcal {X}\):

$$\begin{aligned} \left\| g(\varvec{x}')-g(\varvec{x})\right\| _{} \le L\left\| \varvec{x}'-\varvec{x}\right\| _{}. \end{aligned}$$
(16)

Let \(f:\mathcal {X}\subseteq \mathbb {R}^d\rightarrow \mathbb {R}\) be a real-valued differentiable function. We call f Lipschitz smooth if its gradient is Lipschitz continuous, i.e., there exists \(L>0\) such that, for every \(\varvec{x},\varvec{x}'\in \mathcal {X}\):

$$\begin{aligned} \left\| \nabla f(\varvec{x}')- \nabla f(\varvec{x})\right\| _{} \le L\left\| \varvec{x}'-\varvec{x}\right\| _{}. \end{aligned}$$
(17)

Whenever we want to specify the Lipschitz constant L of the gradient, we call f L-smooth.Footnote 11 We also call L the smoothness constant of f. For a twice-differentiable function, the following holds:Footnote 12

Proposition 2

Let \(\mathcal {X}\) be a convex subset of \(\mathbb {R}^d\) and \(f:\mathcal {X}\rightarrow \mathbb {R}\) be a twice-differentiable function. If the Hessian is uniformly bounded in spectral norm by \(L>0\), i.e., \(\sup _{\varvec{x}\in \mathcal {X}}\left\| \nabla ^2f(\varvec{x})\right\| _{2} \le L\), then f is L-smooth.

Lipschitz smooth functions admit a quadratic bound on the deviation from linear behavior:

Proposition 3

(Quadratic Bound) Let \(\mathcal {X}\) be a convex subset of \(\mathbb {R}^d\) and \(f:\mathcal {X}\rightarrow \mathbb {R}\) be an L-smooth function. Then, for every \(\varvec{x},\varvec{x}'\in \mathcal {X}\):

$$\begin{aligned} \left| f(\varvec{x}') - f(\varvec{x}) - \left\langle \varvec{x}'- \varvec{x}, \nabla f(\varvec{x})\right\rangle \right| \le \frac{L}{2}\left\| \varvec{x}' - \varvec{x}\right\| _{}^2, \end{aligned}$$
(18)

where \(\langle \cdot ,\cdot \rangle\) denotes the dot product.

This bound is often useful for optimization purposes (Nesterov, 2013).

3 Smooth policy gradient

In this section, we provide lower bounds on performance improvement based on general assumptions on the policy class.

3.1 Smoothing policies

We introduce a family of parametric stochastic policies having properties that we deem desirable for policy-gradient learning. We call them smoothing, as they are characterized by the smoothness of the performance measure:

Definition 1

Let \(\Pi _{\Theta }=\{\pi _{\varvec{\theta }}\mid \varvec{\theta }\in \Theta \}\) be a class of twice-differentiable parametric stochastic policies, where \(\Theta \subset \mathbb {R}^d\) is convex. We call it smoothing if there exist non-negative constants \(\xi _1,\xi _2,\xi _3\) such that, for every state and in expectation over actions, the Euclidean norm of the score function:

$$\begin{aligned}&\sup _{s\in \mathcal {S}}\mathbb {E}_{a\sim \pi _{\varvec{\theta }}(\cdot \vert s)} \Big [ \left\| \nabla \log \pi _{\varvec{\theta }}(a\vert s)\right\| _{} \Big ] \le \xi _1, \end{aligned}$$
(19)

the squared Euclidean norm of the score function:

$$\begin{aligned}&\sup _{s\in \mathcal {S}}\mathbb {E}_{a\sim \pi _{\varvec{\theta }}(\cdot \vert s)} \Big [ \left\| \nabla \log \pi _{\varvec{\theta }}(a\vert s)\right\| _{}^2 \Big ] \le \xi _2, \end{aligned}$$
(20)

and the spectral norm of the observed information:

$$\begin{aligned}&\sup _{s\in \mathcal {S}}\mathbb {E}_{a\sim \pi _{\varvec{\theta }}(\cdot \vert s)} \Big [\left\| \nabla ^2\log \pi _{\varvec{\theta }}(a\vert s)\right\| _{}\Big ] \le \xi _3, \end{aligned}$$
(21)

are upper-bounded.

Note that the definition requires that the bounding constants \(\xi _1,\xi _2, \xi _3\) be independent of the policy parameters and the state. For this reason, the existence of such constants depends on the policy parameterization.Footnote 13 We call a policy class \((\xi _1,\xi _2,\xi _3)\)-smoothing when we want to specify the bounding constants. In "Appendix B", we show that some of the most commonly used policies, such as the Gaussian policy for continuous actions and the Softmax policy for discrete actions, are smoothing. The smoothing constants for these classes are reported in Table 1. In the following sections, we will exploit the smoothness of the performance measure induced by smoothing policies to develop a monotonically improving policy gradient algorithm. However, smoothing policies have other interesting properties. For instance, variance upper bounds for REINFORCE/G(PO)MDP with Gaussian policies (Zhao et al., 2011; Pirotta et al., 2013) can be generalized to smoothing policies (see "Appendix  D" for details). Other nice properties of smoothing policies, such as Lipschitzness of the performance measure, are discussed in Yuan et al. (2021, Lemma D.1)

Table 1 Smoothing constants \(\xi _1,\xi _2,\xi _3\) and smoothness constant L for Gaussian and Softmax policies, where \(M\) is an upper bound on the Euclidean norm of the feature function, \(R\) is the maximum absolute-value reward, \(\gamma\) is the discount factor, \(\sigma\) is the standard deviation of the Gaussian policy and \(\tau\) is the temperature of the Softmax policy. We also report the improved smoothness constant by Yuan et al. (2021) as \(L^\star\)

3.2 Policy Hessian

We now show that the Hessian of the performance measure \(\nabla ^2J(\varvec{\theta })\) for a smoothing policy has bounded spectral norm. We start by writing the policy Hessian for a general parametric policy as follows. The result is well known (Kakade, 2001), but we report a proof in "Appendix A.4" for completeness. Also, note that our smoothing-policy assumption is weaker than the typical one (uniformly bounded policy derivatives). See "Appendix  A.3" for details.

Proposition 4

Let \(\pi _{\varvec{\theta }}\) be a smoothing policy. The Hessian of the performance measure is:

$$\begin{aligned} \nabla ^2J(\varvec{\theta })&= \frac{1}{1-\gamma }\mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\varvec{\theta }}\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s) \end{array}} \Big [ \nabla \log \pi _{\varvec{\theta }}(a\vert s)\nabla ^{\top } Q^{\varvec{\theta }}(s,a) +\nabla Q^{\varvec{\theta }}(s,a)\nabla ^{\top }\log \pi _{\varvec{\theta }}(a\vert s)\\&\qquad +\left( \nabla \log \pi _{\varvec{\theta }}(a\vert s)\nabla ^{\top }\log \pi _{\varvec{\theta }}(a\vert s) +\nabla ^2\log \pi _{\varvec{\theta }}(a\vert s) \right) Q^{\varvec{\theta }}(s,a) \Big ]. \end{aligned}$$

For smoothing policies, we can bound the policy Hessian in terms of the constants from Definition 1:

Lemma 5

Given a \((\xi _1,\xi _2,\xi _3)\)-smoothing policy \(\pi _{\varvec{\theta }}\), the spectral norm of the policy Hessian can be upper-bounded as follows:

$$\begin{aligned} \left\| \nabla ^2J(\varvec{\theta })\right\| _{} \le \frac{R}{(1-\gamma )^2}\left( \frac{2\gamma \xi _1^2}{1-\gamma } + \xi _2+ \xi _3\right) . \end{aligned}$$

Proof

By the Policy Gradient Theorem (see the proof of Theorem 1 by, (Sutton et al., 2000)):

$$\begin{aligned} \nabla V^{\varvec{\theta }}(s)&= \frac{1}{1-\gamma }\int _{\mathcal {S}}\rho _{s}^{\varvec{\theta }}(s')\int _{\mathcal {A}}\pi _{\varvec{\theta }}(a\vert s')\nabla \log \pi _{\varvec{\theta }}(a\vert s')Q^{\varvec{\theta }}(s,a) \,\mathrm {d}a\,\mathrm {d}s'. \end{aligned}$$
(22)

Using (22), we bound the gradient of the value function in Euclidean norm:

$$\begin{aligned} \left\| \nabla V^{\varvec{\theta }}(s)\right\| _{}&\le \frac{1}{1-\gamma }\mathop {\mathbb {E}}_{\begin{array}{c} s'\sim \rho _{s}^{\varvec{\theta }}\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s') \end{array}}\left[ {\left\| \nabla \log \pi _{\varvec{\theta }}(a\vert s') Q^{\varvec{\theta }}(s',a)\right\| _{}}\right] \nonumber \\&\le \frac{R}{(1-\gamma )^2}\mathop {\mathbb {E}}_{\begin{array}{c} s'\sim \rho ^{\varvec{\theta }}_s\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s') \end{array}}\left[ {\left\| \nabla \log \pi _{\varvec{\theta }}(a\vert s')\right\| _{}}\right] \end{aligned}$$
(23)
$$\begin{aligned}&\le \frac{R}{(1-\gamma )^2}\sup _{s'\in \mathcal {S}}\mathop {\mathbb {E}}_{a\sim \pi _{\varvec{\theta }}(\cdot \vert s')}\left[ {\left\| \nabla \log \pi _{\varvec{\theta }}(a\vert s')\right\| _{}}\right] \nonumber \\&\le \frac{\xi _1R}{(1-\gamma )^2}, \end{aligned}$$
(24)

where (23) is from the Cauchy-Schwarz inequality and (8), and (24) is from the smoothing-policy assumption. Next, we bound the gradient of the action-value function. From (7):

$$\begin{aligned} \left\| \nabla Q^{\varvec{\theta }}(s,a)\right\| _{}&=\left\| \nabla \left( r(s,a)+\gamma \mathop {\mathbb {E}}_{s'\sim p(\cdot \vert s,a)}\left[ {V^{\varvec{\theta }}(s')}\right] \right) \right\| _{} \end{aligned}$$
(25)
$$\begin{aligned}&=\gamma \left\| \mathop {\mathbb {E}}_{s'\sim p(\cdot \vert s,a)}\left[ {\nabla V^{\varvec{\theta }}(s')}\right] \right\| _{} \end{aligned}$$
(26)
$$\begin{aligned}&\le \gamma \mathop {\mathbb {E}}_{s'\sim p(\cdot \vert s,a)}\left[ {\left\| \nabla V^{\varvec{\theta }}(s)\right\| _{}}\right] \le \frac{\gamma \xi _1R}{(1-\gamma )^2}, \end{aligned}$$
(27)

where the interchange of gradient and expectation in (26) is justified by the smoothing-policy assumption (see "Appendix  A.3" for details) and (27) is from (24). Finally, from Proposition 4:

$$\begin{aligned} (1-\gamma )\left\| \nabla ^2J(\varvec{\theta })\right\| _{}&\le \mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\varvec{\theta }}\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s) \end{array}}\left[ { \left\| \nabla \log \pi _{\varvec{\theta }}(a\vert s)\nabla ^{\top } Q^{\varvec{\theta }}(s,a)\right\| _{}}\right] \nonumber \\&\qquad +\mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\varvec{\theta }}\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s) \end{array}}\left[ {\left\| \nabla Q^{\varvec{\theta }}(s,a)\nabla ^{\top }\log \pi _{\varvec{\theta }}(a\vert s)\right\| _{}}\right] \nonumber \\&\qquad +\mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\varvec{\theta }}\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s) \end{array}}\left[ {\left\| \nabla \log \pi _{\varvec{\theta }}(a\vert s)\nabla ^{\top }\log \pi _{\varvec{\theta }}(a\vert s) Q^{\varvec{\theta }}(s,a)\right\| _{}}\right] \nonumber \\&\qquad +\mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\varvec{\theta }}\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s) \end{array}}\left[ {\left\| \nabla ^2\log \pi _{\varvec{\theta }}(a\vert s)Q^{\varvec{\theta }}(s,a)\right\| _{}}\right] \end{aligned}$$
(28)
$$\begin{aligned}&\le 2\mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\varvec{\theta }}\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s) \end{array}}\left[ { \left\| \nabla \log \pi _{\varvec{\theta }}(a\vert s)\right\| _{}\left\| \nabla Q^{\varvec{\theta }}(s,a)\right\| _{}}\right] \nonumber \\ {}&\qquad +\mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\varvec{\theta }}\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s) \end{array}}\left[ {\left\| \nabla \log \pi _{\varvec{\theta }}(a\vert s)\right\| _{}^2 \left| Q^{\varvec{\theta }}(s,a)\right| }\right] \nonumber \\&\qquad +\mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\varvec{\theta }}\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s) \end{array}}\left[ {\left\| \nabla ^2\log \pi _{\varvec{\theta }}(a\vert s)\right\| _{}\left| Q^{\varvec{\theta }}(s,a)\right| }\right] \end{aligned}$$
(29)
$$\begin{aligned}&\le \frac{2\gamma \xi _1R}{(1-\gamma )^2}\mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\varvec{\theta }}\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s) \end{array}}\left[ { \left\| \nabla \log \pi _{\varvec{\theta }}(a\vert s)\right\| _{}}\right] \nonumber \\ {}&\qquad +\frac{R}{1-\gamma }\mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\varvec{\theta }}\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s) \end{array}}\left[ {\left\| \nabla \log \pi _{\varvec{\theta }}(a\vert s)\right\| _{}^2}\right] \nonumber \\&\qquad +\frac{R}{1-\gamma }\mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\varvec{\theta }}\\ a\sim \pi _{\varvec{\theta }}(\cdot \vert s) \end{array}}\left[ {\left\| \nabla ^2\log \pi _{\varvec{\theta }}(a\vert s)\right\| _{}}\right] \end{aligned}$$
(30)
$$\begin{aligned}&\le \frac{R}{(1-\gamma )}\left( \frac{2\gamma \xi _1^2}{1-\gamma } + \xi _2+ \xi _3\right) , \end{aligned}$$
(31)

where (28) is from Jensen inequality (all norms are convex) and the triangle inequality, (29) is from \(\left\| \varvec{x}\varvec{y}^{\top }\right\| _{} = \left\| \varvec{x}\right\| _{}\left\| \varvec{y}\right\| _{}\) for any two vectors \(\varvec{x}\) and \(\varvec{y}\), (30) is from (8) and (27), and the last inequality is from the smoothing-policy assumption. \(\square\)

3.3 Smooth performance

For a smoothing policy, the performance measure \(J(\varvec{\theta })\) is Lipschitz smooth with a smoothness constant that only depends on the smoothing constants, the reward magnitude, and the discount factor. This result is of independent interest as it can be used to establish convergence rates for policy gradient algorithms (Yuan et al., 2021).

Lemma 6

Given a \((\xi _1,\xi _2,\xi _3)\)-smoothing policy class \(\Pi _{\Theta }\), the performance measure \(J(\varvec{\theta })\) is L-smooth with the following smoothness constant:

$$\begin{aligned} L = \frac{R}{(1-\gamma )^2}\left( \frac{2\gamma \xi _1^2}{1-\gamma } + \xi _2+ \xi _3\right) . \end{aligned}$$
(32)

Proof

From Lemma 5, L is a bound on the spectral norm of the policy Hessian. From Lemma 2, this is a valid Lipschitz constant for the policy gradient, hence the performance measure is L-smooth. \(\square\)

The smoothness of the performance measure, in turn, yields the following property on the guaranteed performance improvement:

Theorem 7

Let \(\Pi _{\Theta }\) be a \((\xi _1,\xi _2,\xi _3)\)-smoothing policy class. For every \(\varvec{\theta }, \varvec{\theta }'\in \Theta\):

$$\begin{aligned} J(\varvec{\theta }') - J(\varvec{\theta }) \ge \left\langle \Delta \varvec{\theta },\nabla J(\varvec{\theta })\right\rangle - \frac{L}{2}\left\| \Delta \varvec{\theta }\right\| _{}^2, \end{aligned}$$

where \(\Delta \varvec{\theta }= \varvec{\theta }'-\varvec{\theta }\) and \(L = \frac{R}{(1-\gamma )^2}\left( \frac{2\gamma \xi _1^2}{1-\gamma } + \xi _2+ \xi _3\right)\).

Proof

It suffices to apply Lemma 3 with the Lipschitz constant from Lemma 6. \(\square\)

The smoothness constant L for Gaussian and Softmax policies is reported in Table 1.

In the following, we will exploit this property of smoothing policies to enforce safety guarantees on the policy updates performed by Algorithm 1, i.e., stochastic gradient ascent updates. However, Theorem 7 applies to any policy update \(\Delta \varvec{\theta }\in \mathbb {R}^d\) as long as \(\varvec{\theta }+\Delta \varvec{\theta }\in \Theta\).

Very recently, R. Yuan et al. (2021, Lemma4.4) provided an improved smoothness constant for smoothing policies:

$$\begin{aligned} L^\star = \frac{R(\xi _2+\xi _3)}{(1-\gamma )^2}. \end{aligned}$$
(33)

This is a significant step forward since it improves the dependence on the effective horizon by a \((1-\gamma )^{-1}\) factor. In Table 1 we report explicit expressions for \(L^\star\) in the case of linear Gaussian and Softmax policies. We will use these superior smoothness constant in the numerical simulations of Sect. 7.

4 Optimal safe meta-parameters

In this section, we provide a step size for Algorithm 1 that maximizes a lower bound on the performance improvement for smoothing policies. This yields safety in the sense of Monotonic Improvement (MI), i.e., non-negative performance improvements at each policy update:

$$\begin{aligned} J(\varvec{\theta }_{k}) - J(\varvec{\theta }_{k+1}) \ge 0, \end{aligned}$$
(34)

at least with high probability.

In policy optimization, at each learning iteration k, we ideally want to find the policy update \(\Delta \varvec{\theta }\) that maximizes the new performance \(J(\varvec{\theta }_{k}+\Delta \varvec{\theta })\), or equivalently:

$$\begin{aligned} \max _{\Delta \varvec{\theta }} J(\varvec{\theta }_{k}+\Delta \varvec{\theta }) - J(\varvec{\theta }_k), \end{aligned}$$
(35)

since \(J(\varvec{\theta }_{k})\) is fixed. Unfortunately, the performance of the updated policy cannot be known in advance.Footnote 14 For this reason, we replace the optimization objective in (35) with a lower bound, i.e., a guaranteed improvement. In particular, taking Algorithm 1 as our starting point, we maximize the guaranteed improvement of a policy gradient update (line 5) by selecting optimal meta-parameters. The solution of this meta-optimization problem provides a lower bound on the actual performance improvement. As long as this is always non-negative, MI is guaranteed.

4.1 Adaptive step size: exact framework

To decouple the pure optimization aspects of this problem from gradient estimation issues, we first consider an exact policy gradient update, i.e., \(\varvec{\theta }_{k+1}\leftarrow \varvec{\theta }_k + \alpha \nabla J(\varvec{\theta }_k)\), where we assume to have a first-order oracle, i.e., to be able to compute the exact policy gradient \(\nabla J(\varvec{\theta }_k)\). This assumption is clearly not realistic, and will be removed in Sect. 4.2. In this simplified framework, performance improvement can be guaranteed deterministically. Furthermore, the only relevant meta-parameter is the step size \(\alpha\) of the update. We first need a lower bound on the performance improvement \(J(\varvec{\theta }_{k+1}) - J(\varvec{\theta }_k)\). For a smoothing policy, we can use the following:

Theorem 8

Let \(\Pi _{\Theta }\) be a \((\xi _1,\xi _2,\xi _3)\)-smoothing policy class. Let \(\varvec{\theta }_k\in \Theta\) and \(\varvec{\theta }_{k+1} = \varvec{\theta }_k + \alpha \nabla J(\varvec{\theta }_k)\), where \(\alpha >0\). Provided \(\varvec{\theta }_{k+1}\in \Theta\), the performance improvement of \(\varvec{\theta }_{k+1}\) w.r.t. \(\varvec{\theta }_k\) can be lower bounded as follows:

$$\begin{aligned} J(\varvec{\theta }_{k+1}) - J(\varvec{\theta }_k) \ge \alpha \left\| \nabla J(\varvec{\theta }_k)\right\| _{}^2 - \alpha ^2\frac{L}{2}\left\| \nabla J(\varvec{\theta }_k)\right\| _{}^2 {:}{=}B(\alpha {;}\varvec{\theta }_k), \end{aligned}$$

where \(L = \frac{R}{(1-\gamma )^2}\left( \frac{2\gamma \xi _1^2}{1-\gamma } + \xi _2+ \xi _3\right)\).

Proof

This is a direct consequence of Theorem 7 with \(\Delta \varvec{\theta }=\alpha \nabla J(\varvec{\theta }_k)\). \(\square\)

This bound is in the typical form of performance improvement bounds (e.g., (Kakade & Langford, 2002; Pirotta et al., 2013; Schulman et al., 2015; Cohen et al., 2018)): a positive term accounting for the anticipated advantage of \(\varvec{\theta }_{k+1}\) over \(\varvec{\theta }_k\), and a penalty term accounting for the mismatch between the two policies, which makes the anticipated advantage less reliable. In our case, the mismatch is measured by the curvature of the performance measure w.r.t. the policy parameters, via the smoothness constant L. This lower bound is quadratic in \(\alpha\), hence we can easily find the optimal step size \(\alpha ^*\).

Corollary 9

Let \(B(\alpha {;}\varvec{\theta }_k)\) be the guaranteed performance improvement of an exact policy gradient update, as defined in Theorem 8. Under the same assumptions, \(B(\alpha {;}\varvec{\theta }_k)\) is maximized by the constant step size \(\alpha ^*=\frac{1}{L}\), which guarantees the following non-negative performance improvement:

$$\begin{aligned} J(\varvec{\theta }_{k+1}) - J(\varvec{\theta }_k) \ge \frac{\left\| \nabla J(\varvec{\theta }_k)\right\| _{}^2}{2L}. \end{aligned}$$

Proof

We just maximize \(B(\alpha {;} \varvec{\theta }_k)\) as a (quadratic) function of \(\alpha\). The global optimum \(B(\alpha ^*{;}\varvec{\theta }_k) = \frac{\left\| \nabla J(\varvec{\theta }_k)\right\| _{}^2}{2L}\) is attained by \(\alpha ^*=\frac{1}{L}\). The improvement guarantee follows from Theorem 8. \(\square\)

4.2 Adaptive step size: approximate framework

In practice, we cannot compute the exact gradient \(\nabla J(\varvec{\theta }_k)\), but only an estimate \(\widehat{\nabla }J(\varvec{\theta }{;}\mathcal {D})\) obtained from a finite dataset \(\mathcal {D}\) of trajectories. In this section, N denotes the fixed size of \(\mathcal {D}\). To find the optimal step size, we just need to adapt the performance-improvement lower bound of Theorem 8 to stochastic-gradient updates. Since sample trajectories are involved, this new lower bound will only hold with high probability. To establish statistical guarantees, we make the following assumption on how the (unbiased) gradient estimate concentrates around its expected value:

Assumption 1

Fixed a parameter \(\varvec{\theta }\in \Theta\), a batch size \(N\in \mathbb {N}\) and a failure probability \(\delta \in (0,1)\), with probability at least \(1-\delta\):

$$\begin{aligned} \left\| \widehat{\nabla }J(\varvec{\theta }{;}\mathcal {D})-\nabla J(\varvec{\theta })\right\| _{} \le \frac{\epsilon (\delta )}{\sqrt{N}}, \end{aligned}$$

where \(\vert \mathcal {D}\vert\) is a dataset of N i.i.d. trajectories collected with \(\pi _{\varvec{\theta }}\) and \(\epsilon :(0,1)\rightarrow \mathbb {R}\) is a known function.

We will discuss how this assumption is satisfied in cases of interest in Sect. 5 and "Appendix  C". Under the above assumption, we can adapt Theorem 8 to the stochastic-gradient case as follows:

Theorem 10

Let \(\Pi _{\Theta }\) be a \((\xi _1,\xi _2,\xi _3)\)-smoothing policy class. Let \(\varvec{\theta }_{k}\in \Theta \subseteq \mathbb {R}^d\) and \(\varvec{\theta }_{k+1} = \varvec{\theta }_{k} + \alpha \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\), where \(\alpha \ge 0\), \(N=\vert \mathcal {D}_k\vert \ge 1\). Under Assumption 1, provided \(\varvec{\theta }_{k+1}\in \Theta\), the performance improvement of \(\varvec{\theta }_{k+1}\) w.r.t. \(\varvec{\theta }_{k}\) can be lower bounded, with probability at least \(1-\delta _k\), as follows:

$$\begin{aligned} \begin{aligned} J(\varvec{\theta }_{k+1}) - J(\varvec{\theta }_{k})&\ge \alpha \left( \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}- \frac{\epsilon (\delta _k)}{\sqrt{N}}\right) \\ {}&\qquad \times \max \left\{ \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}, \frac{\left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} + \frac{\epsilon (\delta _k)}{\sqrt{N}}}{2} \right\} \\&\qquad - \frac{\alpha ^2L}{2} \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}^2{:}{=}\widetilde{B}_k(\alpha {;}N), \end{aligned} \end{aligned}$$

where \(L = \frac{R}{(1-\gamma )^2}\left( \frac{2\gamma \xi _1^2}{1-\gamma } + \xi _2+ \xi _3\right)\).

Proof

Consider the good event \(E_k=\left\{ \left\| \widehat{\nabla }J(\varvec{\theta }{;}\mathcal {D})-\nabla J(\varvec{\theta })\right\| _{} \le {\epsilon (\delta _k)}/{\sqrt{N}}\right\}\). By Assumption 1, \(E_k\) holds with probability at least \(1-\delta _k\). For the rest of the proof, we will assume \(E_k\) holds.

Let \(\epsilon _k{:}{=}\epsilon (\delta _k)/\sqrt{N}\) for short. Under \(E_k\), by the triangular inequality:

$$\begin{aligned} \left\| \nabla J(\varvec{\theta }_{k})\right\| _{}&\ge \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} - \left\| \nabla J(\varvec{\theta }_{k}) - \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} \nonumber \\&\ge \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} - \epsilon _k, \end{aligned}$$
(36)

thus:

$$\begin{aligned} \left\| \nabla J(\varvec{\theta }_{k})\right\| _{}^2 \ge \max \left\{ \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} - \epsilon _k, 0\right\} ^2. \end{aligned}$$
(37)

Then, by the polarization identity:

$$\begin{aligned}&\left\langle \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k),\nabla J(\varvec{\theta }_{k})\right\rangle = \frac{1}{2}\left( \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}^2 + \left\| \nabla J(\varvec{\theta }_{k})\right\| _{}^2 \right. \\&\qquad \qquad \qquad \qquad \qquad \qquad \left. - \left\| \nabla J(\varvec{\theta }_{k}) - \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}^2\right) \\&\qquad \qquad \ge \frac{1}{2}\left( \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}^2 + \max \left\{ \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} - \epsilon _k, 0\right\} ^2 - \epsilon _k^2 \right) , \end{aligned}$$

where the latter inequality is from (37). We first consider the case in which \(\left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} > \epsilon _k\):

$$\begin{aligned} \left\langle \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k),\nabla J(\varvec{\theta }_{k})\right\rangle&\ge \frac{1}{2}\left( \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}^2 + \left( \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} - \epsilon _k\right) ^2 - \epsilon ^2_k \right) \nonumber \\&= \left( \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} - \epsilon _k\right) \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}. \end{aligned}$$
(38)

Then, we consider the case in which \(\left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} \le \epsilon _k\):

$$\begin{aligned} \left\langle \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k),\nabla J(\varvec{\theta }_{k})\right\rangle&\ge \frac{1}{2}\left( \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}^2 - \epsilon ^2_k \right) \end{aligned}$$
(39)
$$\begin{aligned}&=\left( \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} - \epsilon _k\right) \frac{\left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} + \epsilon _k}{2}. \end{aligned}$$
(40)

The two cases can be unified as follows:

$$\begin{aligned}&\left\langle \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k),\nabla J(\varvec{\theta }_{k})\right\rangle \ge \left( \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}- \epsilon _k\right) \nonumber \\ {}&\qquad \qquad \qquad \qquad \qquad \qquad \times \max \left\{ \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}, \frac{\left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} + \epsilon _k}{2} \right\} . \end{aligned}$$
(41)

From Theorem 7 with \(\Delta \varvec{\theta }=\alpha \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\) we obtain:

$$\begin{aligned} J(\varvec{\theta }_{k+1}) - J(\varvec{\theta }_{k})&\ge \left\langle \varvec{\theta }_{k+1}-\varvec{\theta }_{k}, \nabla J(\varvec{\theta }_{k})\right\rangle - \frac{L}{2}\left\| \varvec{\theta }_{k+1}-\varvec{\theta }_{k}\right\| _{}^2 \nonumber \\&= \alpha \left\langle \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k), \nabla J(\varvec{\theta }_{k})\right\rangle - \frac{\alpha ^2L}{2}\left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}^2 \nonumber \\&\ge \alpha \left( \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}- \epsilon _k\right) \nonumber \\ {}&\quad \times \max \left\{ \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}, \frac{\left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} + \epsilon _k}{2} \right\} \nonumber \\ {}&\qquad - \frac{\alpha ^2L}{2}\left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{}^2, \end{aligned}$$
(42)

where the last inequality is from (41). \(\square\)

From Theorem 10 we can easily obtain an optimal step size, as done in the exact setting, provided the batch size is sufficiently large:

Corollary 11

Let \(\widetilde{B}(\alpha ,N{;}\varvec{\theta }_k)\) be the guaranteed performance improvement of a stochastic policy gradient update, as defined in Theorem 10. Under the same assumptions, provided the batch size satisfies:

$$\begin{aligned} N\ge \frac{\epsilon ^2(\delta _k)}{\left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}^2}, \end{aligned}$$
(43)

\(\widetilde{B}(\alpha , N{;}\varvec{\theta }_k)\) is maximized by the following adaptive step size:

$$\begin{aligned} \alpha ^*_k = \frac{1}{L}\left( 1 - \frac{\epsilon (\delta _k)}{\sqrt{N}\left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}} \right) , \end{aligned}$$
(44)

which guarantees, with probability at least \(1-\delta _k\), the following non-negative performance improvement:

$$\begin{aligned} J(\varvec{\theta }_{k+1}) - J(\varvec{\theta }_k) \ge \frac{\left( \left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;} \mathcal {D}_k)\right\| _{} - \frac{\epsilon (\delta _k)}{\sqrt{N}} \right) ^2}{2L}. \end{aligned}$$
(45)

Proof

Let \(N_0=\epsilon ^2(\delta _k)\left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}^{-2}\). When \(N\le N_0\), the second argument of the \(\max\) operator in (41) is selected. In this case, no positive improvement can be guaranteed and the optimal non-negative step size is \(\alpha =0\). Thus, we focus on the case \(N>N_0\). In this case, the first argument of the \(\max\) operator is selected. Optimizing \(\widetilde{B}(\alpha ,N)\) as a function of \(\alpha\) alone, which is again quadratic, yields (44) as the optimal step size and (45) as the maximum guaranteed improvement. \(\square\)

In this case, the optimal step size is adaptive, i.e., time-varying and data-dependent. The constant, optimal step size for the exact case (Corollary 9) is recovered in the limit of infinite data, i.e., \(N\rightarrow \infty\). In the following we discuss why this adaptive step size should not be used in practice, and propose an alternative solution.

4.3 Adaptive Batch Size

The safe step size from Corollary 11 requires the batch size to be large enough. As soon as the condition (43) fails to hold, the user is left with the decision whether to interrupt the learning process or collect more data — an undesirable property for a fully autonomous system. To avoid this, a large batch size must be selected from the start, which results in a pointless waste of data in the early learning iterations. Even so, Eq. (43), used as a stopping condition, would be susceptible to random oscillations of the stochastic gradient magnitude, interrupting the learning process prematurely.

As observed in (Papini et al., 2017), controlling also the batch size N of the gradient estimation can be advantageous. Intuitively, a larger batch size yields a more reliable estimate, which in turn allows a safer policy gradient update. The larger the batch size, the higher the guaranteed improvement, which would lead to selecting the highest possible value of N. However, we must take into account the cost of collecting the trajectories, which is non-negligible in real-world problems (e.g., robotics). For this reason, we would like the meta-parameters to maximize the per-trajectory performance improvement:

$$\begin{aligned} \alpha _k,N_k = \arg \max _{\alpha ,N} \frac{J(\varvec{\theta }_k+\alpha \widehat{\nabla }J(\varvec{\theta }_k{;}\mathcal {D})) - J(\varvec{\theta }_k)}{N}, \end{aligned}$$
(46)

where \(\mathcal {D}\) is a dataset of N i.i.d. trajectories sampled with \(\pi _{\varvec{\theta }_k}\). We can then use the lower bound from Theorem 10 to find the jointly optimal safe step size and batch size, similarly to what was done in (Papini et al., 2017) for the special case of Gaussian policies:

Corollary 12

Let \(\widetilde{B}_k(\alpha {;}N)\) be the lower bound on the performance improvement of a stochastic policy gradient update, as defined in Theorem 10. Under the same assumptions, the continuous relaxation of \({\widetilde{B}_k(\alpha {;}N)}/{N}\) is maximized by the following step size \(\alpha ^*\) and batch size \(N_k^*\):

$$\begin{aligned} {\left\{ \begin{array}{ll} &{} \alpha ^* =\frac{1}{2L}\\ &{} N^*_k = \frac{4\epsilon ^2(\delta _k)}{\left\| \widehat{\nabla }J(\varvec{\theta }_{k}{;}\mathcal {D}_k)\right\| _{}^2}. \end{array}\right. } \end{aligned}$$
(47)

Using \(\alpha ^*\) and \(\lceil N^*_k\rceil\) in the stochastic gradient ascent update guarantees, with probability at least \(1-\delta _k\), the following non-negative performance improvement:

$$\begin{aligned} J(\varvec{\theta }_{k+1}) - J(\varvec{\theta }_k) \ge \frac{\left\| \widehat{\nabla }J(\varvec{\theta }_k{;}\mathcal {D}_k)\right\| _{}^2}{8L}. \end{aligned}$$
(48)

Proof

Fix k and let \(\Upsilon (\alpha ,N)=\widetilde{B}_k(\alpha {;}N)/N\) and \(N_0=\epsilon ^2(\delta _k)\big /\left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}^2\). We consider the continuous relaxation of \(\Upsilon (\alpha ,N)\), where N can be any positive real number. For \(N\ge N_0\), the first argument of the \(\max\) operator in (36) can be selected. Note that the second argument is always a valid choice, since it is a lower bound on the first one for every \(N\ge 1\). Thus, we separately solve the following constrained optimization problems:

$$\begin{aligned}&{\left\{ \begin{array}{ll} &{}\max _{\alpha ,N}\frac{1}{N}\left( \alpha \left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}\left( \left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}-\frac{\epsilon (\delta _k)}{\sqrt{N}}\right) \right. \\ {} &{}\qquad \qquad \qquad \left. - \alpha ^2\frac{L}{2}\left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}^2\right) \\ &{}\text {s.t.} \quad \alpha \ge 0, \\ &{}{\text {s.t.}} \quad N > \frac{\epsilon ^2(\delta _k)}{\left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}^2}, \end{array}\right. } \end{aligned}$$
(49)

and:

$$\begin{aligned} &{\left\{ \begin{array}{ll} &{}\max _{\alpha ,N}\frac{1}{N}\left( \frac{\alpha }{2}\left( \left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}^2-\frac{\epsilon ^2(\delta _k)}{N}\right) - \alpha ^2\frac{L}{2}\left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}^2\right) \\ &{}\text {s.t.} \quad \alpha \ge 0, \\ &{}{\text {s.t.}} \quad N > 0. \end{array}\right. } \end{aligned}$$
(50)

Both problems can be solved in closed form using KKT conditions. The first one (49) yields \(\Upsilon ^* = \left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}^4\big /\left( 32L\epsilon ^2(\delta _k)\right)\) with the values of \(\alpha ^*\) and \(N^*_k\) given in (47). The second one (50) yields a worse optimum \(\Upsilon ^* = \left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}^4\big /\left( 54L\epsilon ^2(\delta _k)\right)\) with \(\alpha =\frac{1}{3L}\) and \(N=3\epsilon ^2(\delta )\big /\left\| \widehat{\nabla }J(\varvec{\theta }_k{;} \mathcal {D}_k)\right\| _{}^2\). Hence, we keep the first solution. From Theorem 10, using \(\alpha ^*\) and \(N^*_k\) would guarantee \(J(\varvec{\theta }_{k+1}) - J(\varvec{\theta }_k) \ge \left\| \widehat{\nabla }J(\varvec{\theta }_k{;}\mathcal {D}_k)\right\| _{}^2\big /\left( 8L\right)\). Of course, only integer batch sizes can be used. However, for \(N\ge N_0\), the right-hand side of (36) is monotonically increasing in N. Since \(N_k^*\ge N_0\) and \(\lceil N_k^*\rceil \ge N_k^*\), the guarantee (48) is still valid when \(\alpha ^*\) and \(\lceil N_k^*\rceil\) are employed in the stochastic gradient ascent update. \(\square\)

In this case, the optimal step size is constant, and is exactly half the one for the exact case (Corollary 9). In turn, the batch size is adaptive: when the norm of the (estimated) gradient is small, a large batch size is selected. Intuitively, this allows to counteract the variance of the estimator, which is large relatively to the gradient magnitude. One may worry about the recursive dependence of \(N_k^*\) on itself through \(\mathcal {D}_k\). We will overcome this issue in the next section.

5 Algorithm

In this section, we leverage the theoretical results of the previous sections to design a reinforcement learning algorithm with monotonic improvement guarantees. For the reasons discussed above, we adopt the adaptive-batch-size approach from Sect. 4.3.

Corollary 12 provides a constant step size \(\alpha ^*\) and a schedule for the batch size \((\lceil N_k^*\rceil )_{k\ge 1}\) that jointly maximize per-trajectory performance improvement under a monotonic-improvement constraint. Plugging these meta-parameters into Algorithm 1, we could obtain a safe policy gradient algorithm. Unfortunately, the closed-form expression for \(N_k^*\) provided in (47) cannot be used directly. We must compute the batch size before collecting the batch of trajectories \(\mathcal {D}_k\), but \(N_k^*\) depends on \(\mathcal {D}_k\) itself. To overcome this issue, we collect trajectories in an incremental fashion until the optimal batch size is achieved. We call this algorithm Safe Policy Gradient (SPG), outlined in Algorithm 2. The user specifies the failure probability \(\delta _k\) for each iteration k, while the smoothness constant L and the concentration bound \(\epsilon :(0,1)\rightarrow \mathbb {R}\) can be computed depending on the policy class and the gradient estimator (see Tables 1 and 2).

figure b

We can study the data-collecting process of SPG as a stopping problem. Fixed an iteration k, let \(\mathcal {F}_{k,i}=\sigma (\{\tau _{k,1},\dots ,\tau _{k,i-1}\})\) be the sigma-algebra generated by the first i trajectories collected at that iteration. Let \(\mathop {\mathbb {E}}_{i}[X]\) be short for \(\mathop {\mathbb {E}}[X\vert \mathcal {F}_{i-1}]\).Footnote 15 In Sects. 4 and 4.3 we assumed the Euclidean norm of the gradient estimation error to be bounded by \(\epsilon (\delta )/\sqrt{N}\) with probability \(1-\delta\) for some function \(\epsilon :(0,1)\rightarrow \mathbb {R}_+\). For Algorithm 2 to be well-behaved, we need gradient estimates to concentrate exponentially, which translates into the following, stronger assumption:

Assumption 2

Fixed a parameter \(\varvec{\theta }\in \Theta\), a batch size \(N\in \mathbb {N}\) and a failure probability \(\delta \in (0,1)\), with probability at least \(1-\delta\):

$$\begin{aligned} \left\| \widehat{\nabla }J(\varvec{\theta }{;}\mathcal {D})-\nabla J(\varvec{\theta })\right\| _{} \le \frac{\epsilon (\delta )}{\sqrt{N}}, \end{aligned}$$

where \(\vert \mathcal {D}\vert\) is a dataset of N i.i.d. trajectories collected with \(\pi _{\varvec{\theta }}\) and \(\epsilon (\delta )=C\sqrt{d\log (6/\delta )}\) for some problem-dependent constant C that is independent of \(\delta\), d and N.

This is satisfied by REINFORCE/G(PO)MDP with Softmax and Gaussian policies, as shown in "Appendix  C". In Table 2 we summarize the value of the error bound \(\epsilon (\delta )\) to be used in the different scenarios. Equipped with this exponential tail bound we can prove that, at any given (outer) iteration of SPG, the data-collecting process (inner loop) terminates:

Lemma 13

Fix an iteration k of Algorithm 2 and let \(N_k\) the number of trajectories that are collected at that iteration. Under Assumption 2, provided \(\left\| \nabla J(\varvec{\theta }_k)\right\| _{}>0\), \(\mathop {\mathbb {E}}[N_k]<\infty\).

Proof

First, note that \(N_k\) is a stopping time w.r.t. the filtration \((\mathcal {F}_{k,i})_{i\ge 1}\). Consider the event \(E_{k,i}=\big \{\left\| g_{k,i}-\nabla J(\varvec{\theta }_k)\right\| _{}\le {\epsilon (\delta _{k,i})}/{\sqrt{i}}\big \}\). By Assumption 2, \(\mathbb {P}(\lnot E_{k,i})\le \delta _{k,i}\). This allows to upper bound the expectation of \(N_k\) as follows:

$$\begin{aligned} \mathop {\mathbb {E}}[N_k]&\le \mathop {\mathbb {E}}\left[ \sum _{i=1}^\infty \mathbb {I}\left( \sqrt{i}<\frac{2\epsilon (\delta _{k,i})}{\left\| g_{k,i}\right\| _{}}\right) \right] \end{aligned}$$
(51)
$$\begin{aligned}&=\mathop {\mathbb {E}}\left[ \sum _{i=1}^\infty \mathbb {I}\left( \sqrt{i}<\frac{2\epsilon (\delta _{k,i})}{\left\| g_{k,i}\right\| _{}}, E_{k,i}\right) \right] + \mathop {\mathbb {E}}\left[ \sum _{i=1}^\infty \mathbb {I}\left( \sqrt{i}<\frac{2\epsilon (\delta _{k,i})}{\left\| g_{k,i}\right\| _{}}, \lnot E_{k,i}\right) \right] \end{aligned}$$
(52)
$$\begin{aligned}&\le \sum _{i=1}^\infty \mathbb {I}\left( \sqrt{i}<\frac{2\epsilon (\delta _{k,i})}{\left\| \nabla J(\varvec{\theta }_k)\right\| _{}-\epsilon (\delta _{k,i})/\sqrt{i}}\right) + \sum _{i=1}^\infty \mathbb {P}(\lnot E_{k,i}) \end{aligned}$$
(53)
$$\begin{aligned}&\le \min _{i\ge 1}\left\{ \sqrt{i}\ge \frac{2\epsilon (\delta _{k,i})}{\left\| \nabla J(\varvec{\theta }_k)\right\| _{}-\epsilon (\delta _{k,i})/\sqrt{i}}\right\} + \sum _{i=1}^\infty \delta _{k,i} \end{aligned}$$
(54)
$$\begin{aligned}&\le \min _{i\ge 1}\left\{ \left\| \nabla J(\varvec{\theta }_k)\right\| _{}\sqrt{i} \ge 3\epsilon (\delta _{k,i})\right\} ) + \delta _k\sum _{i=1}^\infty \frac{1}{i(i+1)} \end{aligned}$$
(55)
$$\begin{aligned}&\le \min _{i\ge 1}\left\{ \left\| \nabla J(\varvec{\theta }_k)\right\| _{}\sqrt{i} \ge 3C\sqrt{d\log (6i(i+1)/\delta _k)}\right\} + 1 \end{aligned}$$
(56)
$$\begin{aligned}&\le \min _{i\ge 1}\left\{ \left\| \nabla J(\varvec{\theta }_k)\right\| _{}^2i \ge 18C^2d\log (6i/\delta _k)\right\} + 1 \end{aligned}$$
(57)
$$\begin{aligned}&\le \left\lceil \frac{36C^2d}{\left\| \nabla J(\varvec{\theta }_k)\right\| _{}^2}\log \frac{108C^2d}{\left\| \nabla J(\varvec{\theta }_k)\right\| _{}^2\delta _k}\right\rceil +1, \end{aligned}$$
(58)

where (56) is by Assumption 2 and the last inequality is by Lemma 21 assuming \(\left\| \nabla J(\varvec{\theta }_k)\right\| _{}\le C\). If the latter is not true, we still get:

$$\begin{aligned} \mathop {\mathbb {E}}[N_k]&\le \min _{i\ge 1}\left\{ \left\| \nabla J(\varvec{\theta }_k)\right\| _{}^2i \ge 18C^2d\log (6i/\delta _k)\right\} + 1 \end{aligned}$$
(59)
$$\begin{aligned}&\le \min _{i\ge 1}\left\{ i \ge 18d\log (6i/\delta _k)\right\} + 1 \end{aligned}$$
(60)
$$\begin{aligned}&\le \lceil 36d\log (108d/\delta _k)\rceil +1. \end{aligned}$$
(61)

\(\square\)

Table 2 Gradient estimation error bound \(\epsilon (\delta )\) for Gaussian and Softmax policies using REINFORCE (RE.), GPOMDP (GP.), or the random-horizon estimator discussed in "Appendix  C.2" (RH.) as gradient estimator, where d is the dimension of the policy parameter, \(M\) is an upper bound on the max norm of the feature function, \(R\) is the maximum absolute-valued reward, \(\gamma\) is the discount factor, T is the task horizon, \(\sigma\) is the standard deviation of the Gaussian policy and \(\tau\) is the temperature of the Softmax policy

We can now prove that the policy updates of SPG are safe.

Theorem 14

Consider Algorithm 2 applied to a smoothing policy, where \(\widehat{\nabla }J\) is an unbiased policy gradient estimator. Under Assumption 2, for any iteration \({k\ge 1}\), provided \(\nabla J(\varvec{\theta }_k)\ne 0\), with probability at least \(1-\delta _k\):

$$\begin{aligned} J(\varvec{\theta }_{k+1}) - J(\varvec{\theta }_k) \ge \frac{\left\| \widehat{\nabla }J(\varvec{\theta }_k{;}\mathcal {D}_k)\right\| _{}^2}{8L}\ge 0. \end{aligned}$$

Proof

Fix an (outer) iteration k of Algorithm 2 and let \(g_{k,i}=\widehat{\nabla }J(\varvec{\theta }_k{;}\mathcal {D}_{k,i})\) for short. Using an unbiased policy gradient estimator we ensure \(\mathop {\mathbb {E}}_i[g_{k,i}-\nabla J(\varvec{\theta }_k)]=0\), so \(X_i=g_{k,i}-\nabla J(\varvec{\theta }_k)\) is a martingale difference sequence adapted to \((\mathcal {F}_{k,i})_{i\ge 1}\). We use an optional stopping argument to show that \(g_{k,N_k}\) is an unbiased policy gradient estimate. Lemma 13 shows that \(N_k\) is a stopping time w.r.t. the filtration \((\mathcal {F}_{k,i})_{i\ge 1}\) that is finite in expectation. Furthermore, by Assumption 2, integrating the tail:

$$\begin{aligned} {\mathop {\mathbb {E}}}_i[\left\| X_i\right\| _{}]&= \int _0^\infty \mathbb {P}\left( \left\| X\right\| _{}> x\vert \mathcal {F}_{k,i}\right) \,\mathrm {d}x \end{aligned}$$
(62)
$$\begin{aligned}&\le 6 \int _0^\infty \exp (-x^2i/(C^2d)) \,\mathrm {d}x \end{aligned}$$
(63)
$$\begin{aligned}&\le 6C\sqrt{\frac{\pi d}{4i}} \le 6C\sqrt{\frac{\pi d}{4}}&\hbox { for all}\ i\ge 1. \end{aligned}$$
(64)

Hence, by optional stopping (Lemma 22), \(\mathop {\mathbb {E}}[X_{N_k}]=0\). Since \(X_{N_k}=\widehat{\nabla }J(\varvec{\theta }_k{;}\mathcal {D}_k)-\nabla J(\varvec{\theta }_k)\), we have \(\mathop {\mathbb {E}}[\widehat{\nabla }J(\varvec{\theta }_k{;}\mathcal {D}_k)]=\nabla J(\varvec{\theta }_k)\). This shows that the policy update of Algorithm 2 is an unbiased policy-gradient update. By the stopping condition:

$$\begin{aligned} N_k \ge \frac{4\epsilon ^2(\delta _{k,N_k})}{\left\| \widehat{\nabla }J(\varvec{\theta }_k{;}\mathcal {D}_k)\right\| _{}^2}. \end{aligned}$$
(65)

Now consider the following good event:

$$\begin{aligned} E_k = \left\{ \forall i\ge 1: \left\| g_{k,i}-\nabla J(\varvec{\theta }_k)\right\| _{}\le \epsilon (\delta _{k,i})/i\right\} . \end{aligned}$$
(66)

Under Assumption 2, by union bound:

$$\begin{aligned} \mathbb {P}\left( \lnot E_k\right) \le \sum _{i=1}^\infty \delta _{k,i} = \sum _{i=1}^\infty \frac{\delta _k}{i(i+1)} = \delta _k. \end{aligned}$$
(67)

So \(E_k\) holds with probability at least \(1-\delta _k\). Under \(E_k\), the performance improvement guarantee is by Corollary 12, Eq. (65), and the choice of the step size \(\alpha\). \(\square\)

We have shown that the policy updates of SPG are safe with probability \(1-\delta _k\), where the failure probability \(\delta _k\) can be specified by the user for each iteration k. Typically, one would like to ensure monotonic improvement for the whole duration of the learning process. This can be achieved by appropriate confidence schedules. If the number of updates K is fixed a priori, \(\delta _k=\delta /K\) guarantees monotonic improvement with probability \(1-\delta\). The same can be obtained by using an adaptive confidence schedule \(\delta _k=\frac{\delta }{k(k+1)}\), even when the number of updates is not known in advance. Both results are easily shown by taking a union bound over \(k\ge 1\). Notice how having an exponential tail bound like the one from Assumption 2 is fundamental for the batch size to have a logarithmic dependence on the number of policy updates.

5.1 Towards a practical algorithm

The version of SPG we have just analyzed is very conservative. The price for guaranteeing monotonic improvement is slow convergence, even in small problems (see Sect. 7.1 for an example). In this section, we discuss possible variants and generalizations of Algorithm 2 aimed at the development of a more practical method. In doing so, we still stay faithful to the principle of satisfying the safety requirement specified by the user with no compromises. We just list the changes here. See "Appendix  E" for a more rigorous discussion.

Improved smoothness constant

As mentioned in Sect. 3.3, we can use the improved smoothness constant by Yuan et al. (2021), denoted \(L^\star\) in the following, which has a better dependence on the effective horizon. This yields a larger step size with the same theoretical guarantees, and allows to tackle problems with longer horizons in practice.

Mini-batches

In the inner loop of Algorithm 2, instead of just one trajectory at a time, we can collect mini-batches of n independent trajectories. For instance, \(n\ge 2\) is required to employ the variance-reducing baselines discussed in Sect. 2. Moreover, a carefully picked mini-batch size n can make the early gradient estimates more stable, leading to an earlier stopping of the inner loop and a smaller batch size \(N_k\). We leave the investigation of the optimal value of n to future work.

Largest safe step size

The meta parameters of Algorithm 2 were selected to maximize a lower bound on the per-trajectory performance improvement. Although we believe this is the most theoretically justified choice, we could gain some convergence speed by using a larger step size. From Theorem 10, it is easy to check that \(\alpha =1/L\) is the largest constant step size we can use with our choice of adaptive batch size from Algorithm 2. We leave the investigation of alternative safe combinations of batch size and (possibly adaptive) step size to future work.

Empirical Bernstein bound

The stopping condition of Algorithm 2 (line 11) is based on a Hoeffding-style bound on the gradient estimation error. In the case of policies with bounded score function, such as Softmax policies (see "Appendix  B.2"), we can use an empirical Bernstein bound instead (Maurer & Pontil, 2009). This requires some modifications to the algorithm, but yields a smaller adaptive batch size with the same safety guarantees. See "Appendix  E" for details. Unfortunately, we cannot use the empirical Bernstein bound with the Gaussian policy because of its unbounded score function (see "Appendix  B.1").

Weaker safety requirements

Monotonic improvement is a very strong requirement, so we do expect an algorithm with strict monotonic improvement guarantees like SPG to be very data-hungry and slow to converge. However, with little effort, Algorithm 2 can be modified to handle weaker safety requirements. A common one is the baseline constraint (Garcelon et al., 2020; Laroche et al., 2019), e.g.,), where the performance of the policy is required to never be (significantly) lower than the performance of a baseline policy \(\pi _b\). In a real safety-critical application, the reward could be designed so that policies with performance greater than \(J(\pi _b)\) are always safe. In other applications, \(\pi _b\) can be an existing, reliable controller that the user wants to replace with an adaptive RL agent. In this case, assuming \(\pi _{\varvec{\theta }_0}=\pi _b\), the baseline constraint guarantees that the learning agent never performs worse than the original controller. In our numerical simulations of Sect. 7, we will consider a stronger version of the baseline constraint that we call milestone constraint. In this case, the agent’s policy must never perform (significantly) worse than the best performance observed so far. Formally, for all \(k\ge 1\):

$$\begin{aligned} J(\varvec{\theta }_{k+1}) \ge \lambda \max _{j=1,2,\dots ,k}\{J(\varvec{\theta }_j)\}, \end{aligned}$$
(68)

where \(\lambda \in [0,1]\) is a user-defined significance parameter. The idea is as follows: every time the agent reaches a new level of performance (a milestone), it should never do significantly worse than that. When \(\lambda =1\), this reduces to monotonic improvement. When \(\lambda <1\), some amount of performance oscillation is allowed, but this relaxation can significantly improve the learning speed. Of course, the user has full control on this trade-off through the meta-parameter \(\lambda\). In "Appendix  E" we show that variants of Algorithm 2 satisfy the milestone constraint (and other requirements, such as the baseline constraint) with probability \(1-\delta\) for given significance \(\lambda\) and failure probability \(\delta\). We experiment with the milestone constraint in Sect. 7.2.

6 Related works

In this section, we discuss previous results on MI guarantees for policy gradient algorithms.

The seminal work on monotonic performance improvement is by Kakade and Langford (2002). In this work, policy gradient approaches are soon dismissed because of their lack of exploration, although they guarantee MI in the limit of an infinitesimally small step size. The authors hence focus on value-based RL, proposing the Conservative Policy Iteration (CPI) algorithm, where the new policy is a mixture of the old policy and a greedy one. The guaranteed improvement of this new policy  (S. Kakade & Langford, 2002, Theorem 4.1) depends on the coefficient of this convex combination, which plays a similar role as the learning rate in our Theorem 8:

$$\begin{aligned} J(\pi _{k+1}) - J(\pi _k) \ge \frac{\alpha }{(1-\gamma )}\mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\pi _k}\\ a\sim \pi ^{+}_k \end{array}}\left[ {A^{\pi _k}(s,a)}\right] - \frac{2\alpha ^2\gamma \epsilon }{(1-\gamma )^2(1-\alpha )}, \end{aligned}$$
(69)

where \(\epsilon =\max _{s\in \mathcal {S}}\vert \mathop {\mathbb {E}}_{a\sim \pi ^{+}_k(s,a)}\left[ {A^{\pi _k}(s,a)}\right] \vert\) and \(A^{\pi }(s,a)=Q^\pi (s,a)-V^\pi (s)\) denotes the advantage function of policy \(\pi\). In fact, both lower bounds have a positive term that accounts for the expected improvement of the new policy w.r.t. the old one, and a penalization term due to the mismatch between the two. The CPI approach is refined by Pirotta et al. (2013), who propose the Safe Policy Iteration (SPI) algorithm (see also, (Metelli et al., 2021)).

Specific performance improvement bounds for policy gradient algorithms were first provided by Pirotta et al. (2013) by adapting previous results on policy iteration (Pirotta et al., 2013) to continuous MDPs. However, the penalty term can only be computed for shallow Gaussian policies ("Appendix  B.1") in practice. The bound for the exact framework is:

$$\begin{aligned} J(\varvec{\theta }_{k+1}) - J(\varvec{\theta }_k)&\ge \alpha _k\left\| \nabla J(\varvec{\theta }_k)\right\| _{}^2 -\alpha _k^2 \frac{M^2R}{\sigma ^2(1-\gamma )^2}\left( \frac{\vert \mathcal {A}\vert }{\sqrt{2\pi }\sigma }+\frac{\gamma }{2(1-\gamma )}\right) \nonumber \\ {}&\qquad \qquad \qquad \qquad \qquad \times \left\| \nabla J(\varvec{\theta }_k)\right\| _{1}^2, \end{aligned}$$
(70)

where \(\vert \mathcal {A}\vert\) denotes the volume of the action space. From Table 1, our bound for the same setting is (Corollary 9):

$$\begin{aligned} J(\varvec{\theta }_{k+1}) - J(\varvec{\theta }_k) \ge \alpha _k\left\| \nabla J(\varvec{\theta }_k)\right\| _{}^2 -\alpha _k^2 \frac{M^2R}{\sigma ^2(1-\gamma )^2}\left( 1+\frac{2\gamma }{\pi (1-\gamma )}\right) \left\| \nabla J(\varvec{\theta }_k)\right\| _{}^2, \end{aligned}$$

which has the same dependence on the step size, the policy standard deviation \(\sigma\), the effective horizon \((1-\gamma )^{-1}\), the maximum reward \(R\) and the maximum feature norm \(M\). Besides being more general, our penalty term does not depend on the problematic \(\vert \mathcal {A}\vert\) term (the action space is theoretically unbounded for Gaussian policies) and replaces the \(l_1\) norm of (70) with the smaller \(l_2\) norm. Due to the different constants, we cannot say our penalty is always smaller, but the change of norm could make a big difference in practice, especially for large parameter dimension d. Pirotta et al. (2013) also study the approximate framework. However, albeit formulated in terms of the estimated gradient, their lower bound (Theorem 5.2) still pertains exact policy gradient updates, since \(\varvec{\theta }_{k+1}\) is defined as \(\varvec{\theta }_k+\alpha _k\nabla J(\varvec{\theta }_k)\). This easy-to-overlook observation makes our Theorem 10 the first rigorous monotonic improvement guarantee for stochastic policy gradient updates of the form \(\varvec{\theta }_{k+1}=\varvec{\theta }_k+\alpha _k\widehat{\nabla }J(\varvec{\theta }_k)\). Pirotta et al. (2013) use their results to design an adaptive step-size schedule for REINFORCE and G(PO)MDP, similarly to what we propose in this paper, but limited to Gaussian policies. Papini et al. (2017) rely on the same improvement lower bound (70) to design an adaptive-batch size algorithm, the most similar to our SPG. Again, their monotonic improvement guarantees are limited to shallow Gaussian policies.

Another related family of performance improvement lower bounds, inspired once again by Kakade and Langford (2002), is that of TRPO. These are very general results that apply to arbitrary pairs of stochastic policies, although they are mostly used to construct policy gradient algorithms in practice. Specializing Theorem 1 by Schulman et al. (2015) to our setting and applying the KL lower bound suggested by the authors we can get the following:

$$\begin{aligned} J(\varvec{\theta }_{k+1}) - J(\varvec{\theta }_k)&\ge \frac{1}{1-\gamma }\mathop {\mathbb {E}}_{\begin{array}{c} s\sim \rho ^{\varvec{\theta }_k}\\ a\sim \pi _{\varvec{\theta }_{k+1}} \end{array}}\left[ {A^{\varvec{\theta }_k}(s,a)}\right] \nonumber \\&-\qquad \frac{2\gamma R}{(1-\gamma )^3}\max _{s\in \mathcal {S}}\left\{ \mathop {\mathcal {KL}}(\pi _{\varvec{\theta }_k}(\cdot \vert s)\Vert \pi _{\varvec{\theta }_{k+1}})(\cdot \vert s)\right\} , \end{aligned}$$
(71)

where \(\pi _{\varvec{\theta }}\) is a stochastic policy. Unfortunately, the lower bound for a policy gradient update (exact or stochastic) cannot be computed exactly. Approximations can lead to very good practical algorithms such as TRPO, but not to actually implementable algorithms with rigorous monotonic improvement guarantees like our SPG. Achiam et al. (2017) and Pajarinen et al. (2019) are able to remove some approximations, but not all.Footnote 16 If we were to derive a computable worst-case lower bound starting from (71), we would get a result similar to (70). In fact, Pirotta et al. (2013) explicitly upper-bound the KL divergence in their derivations, which is why the final result is limited to Gaussian policies. We overcome this difficulty by directly upper-bounding the curvature of the objective function (Lemma 5). Furthermore, Theorem 7 suggests that our theory is not limited to policy gradient updates. Arbitrary update directions are considered in (Papini et al., 2020).

Pirotta et al. (2015) provide performance improvement lower bounds (Lemma 8) and adaptive-step algorithms for policy gradients under Lipschitz continuity assumptions on the MDP and the policy. Our assumptions on the environment are much weaker since we only require boundedness of the reward. Intuitively, stochastic policies smooth out the irregularities of the environment in computing expected return objectives. In turn, the results of Pirotta et al. (2015) also apply to deterministic policies.

Cohen et al. (2018) provide a general safe policy improvement strategy that can be applied also to policy gradient updates. However, it requires to maintain and evaluate a set of policies per iteration instead of a single one.

As mentioned, R. Yuan et al. (2021, Lemma4.4) also study policy gradient with smoothing policies, providing an improved smoothness constant and proving Lipschitz continuity of the objective function. However, their main focus is sample complexity of vanilla policy gradient.

7 Experiments

In this section, we test our SPG algorithm on simulated control tasks. We first test Algorithm 2 with monotonic improvement guarantees on a small continuous-control problem. We then experiment with the milestone-constraint relaxation proposed in Sect. 5.1 on a classic RL benchmark—cart-pole balancing.

7.1 Linear-quadratic regulator with Gaussian policy

The first task is a 1-dimensional Linear-Quadratic Regulator (LQR, Dorato et al. (1994)), a typical continuous-control benchmark. See "Appendix  F.1" for a detailed task specification. We use a Gaussian policy ("Appendix  B.1") that is linear in the state, \(\pi _{\varvec{\theta }}(a\vert s)=\mathcal {N}(a{;}\theta s, \sigma ^2)\). The task horizon is \(T=10\) and we use \(\gamma =0.9\) as a discount factor. The policy mean parameter is initialized to \(\theta _0=0\) and the variance is fixed as \(\sigma =1\). For this task, the maximum reward (in absolute value) is \(R=1\) and the only feature is the state itself, giving \(M=1\). Hence, the smoothness constant \(L^\star \simeq 200\) is easily computed (see Table 1). Similarly, the error bound can be retrieved from Table 2. We compare the SPG (Algorithm 2) with an existing adaptive-batch-size policy gradient algorithm for Gaussian policies (Papini et al., 2017), discussed in the previous section and labeled AdaBatch in the plots. SPG is run with a mini-batch size of \(n=100\) (see Sect. 5.1), and AdaBatch (in the version with Bernstein’s inequality as recommended in the original paper) with an initial batch size of \(N_0=100\). Both use the adaptive confidence schedule \(\delta _k=\delta /(k*(k+1))\) discussed in Sect. 5, with an overall failure probability of \(\delta =0.05\). We also consider SPG with a twice-as-large step size \(\alpha =1/L^\star\), as discussed in Sect. 5.1.

Figure 1 shows the expected performance of the algorithms on the LQR task. For this task, we are able to compute the expected performance in closed form given the policy parameters (Peters, 2002). This allows to filter out the oscillations due to the stochasticity of policy and environment, focusing on actual (expected) performance oscillations. It is also why the variability among different seeds is so small (note that, for this figure, shaded areas correspond to 10 standard deviations. They correspond to a single standard deviation in the other figures). Performance is plotted against the total number of collected trajectories for fair comparison. The distribution of policy updates can be deduced from the markers. We can see that indeed all the safe PG algorithms exhibit monotonic improvement. SPG converges faster than AdaBatch. This is mostly due to the larger step size of SPG (we observed that the step size of SPG was more than 100 times larger than the one of AdaBatch in most of the updates). This allows SPG to converge faster even with fewer policy updates. The variant of SPG with a larger step size (\(\alpha =1/L^\star\)) converges faster to a good policy, but the original version from Algorithm 2 achieves higher performance on the long run. This indicates that maximizing the lower bound on per-trajectory performance improvement from Theorem 10 is indeed meaningful.

Fig. 1
figure 1

Performance of SPG and AdaBatch (Papini et al., 2017) on the LQR task with Gaussian policy. Results are averaged over 5 independent runs. The shaded areas correspond to 10 standard deviations. A marker corresponds to 100 policy updates

Figure 2 shows the batch size of the different algorithms. The batch size of SPG is mostly larger than that of AdaBatch. From Sect. 6 we know that the monotonic improvement guarantee of SPG is more rigorous, so a larger batch size is justified. Notice also that the batch size of SPG is smaller than that of AdaBatch in the early iterations, suggesting that the former is more adaptive.

Fig. 2
figure 2

Batch size of SPG and AdaBatch on the LQR task. Results are averaged over 5 independent runs. The shaded areas correspond to one standard deviation. A marker corresponds to 100 policy updates

7.2 Cart-pole with softmax policy

The second task is cart-pole (Barto et al., 1983). We use the implementation from openai/gym, which has 4-dimensional continuous states and finite actions, \(a\in \{1,2\}\). See "Appendix  F.2" for further details. The policy is Softmax ("Appendix  B.2"), linear in the state: \(\pi _{\varvec{\theta }}(a\vert s)\propto \exp (\varvec{\theta }_a^\top s)\), with a separate parameter for each action (\(\varvec{\theta }= [\varvec{\theta }_1; \varvec{\theta }_2]\)). We use a fixed temperature \(\tau =1\), initial policy parameters set to zero (this corresponds to a uniform policy) and \(\gamma =0.9\) as a discount factor. For SPG, we employ all the practical variants proposed in Sect. 5.1. In particular, since the Softmax policy has a bounded score function, we can use the empirical Bernstein bound. Note that we could not have done the same for the LQG task since the score function of the Gaussian policy is unbounded (see "Appendix  C"). Moreover, we consider the relaxed milestone constraint for different values of the significance parameter, \(\lambda \in \{0.1,0.2,0.4\}\). The overall failure probability is always \(\delta =0.2\), the mini-batch size is \(n=100\), and the step size is \(\alpha =1/L^\star\).Footnote 17 We compare with GPOMDP with the same step size but a fixed batch size of \(N=100\), which comes with no safety guarantees, and corresponds to \(\lambda =0\). In Fig. 3 we plot the performance against the total number of collected trajectories. As expected, a more relaxed constraint yields faster convergence. However, no significant performance oscillations are observed, not even in the case of GPOMDP, suggesting that the choice of meta-parameters is still over-conservative. In Fig. 4 (left) we report the evolution of the batch size of SPG during the learning process. Note how, in this case, the batch size seems to converge to a constant value. In Fig. 4 (right) we illustrate the milestone constraint. The solid line is the performance of SPG with \(\lambda =0.1\), while the dotted line is the performance lower-threshold enforced by the milestone constraint, representing \(90\%\) of the highest performance achieved so far. As desired, the actual performance never falls under the threshold.

Fig. 3
figure 3

Performance of GPOMDP and SPG (for different values of the significance parameter \(\lambda\)) on the cart-pole task with Softmax policy. Results are averaged over 5 independent runs. The shaded areas correspond to one standard deviation. A marker corresponds to 1000 policy updates

Fig. 4
figure 4

Further results for SPG on the cart-pole task. On the left, the batch size is plotted against the total number of trajectories. A marker corresponds to 1000 policy updates. On the right, the performance at each policy update (solid line) is compared with the performance threshold (dashed line) when \(\lambda =0.1\). In both plots, shaded areas correspond to one standard deviation

8 Conclusion

We have identified a general class of policies, called smoothing policies, for which the performance measure (expected total reward) is a smooth function of policy parameters. We have exploited this property to select meta-parameters for actor-only policy gradient that guarantee monotonic performance improvement. We have shown that an adaptive batch size can be used in combination with a constant step size for improved efficiency, especially in the early stages of learning. We have designed a monotonically improving policy gradient algorithm, called Safe Policy Gradient (SPG), with adaptive batch size. We have shown how SPG can also be applied to weaker performance-improvement constraints. Finally, we have tested SPG on simulated control tasks.

Albeit the safety motivations are clearly of practical interest, our contribution is mostly theoretical. The meta-parameters proposed in Sect. 4 and used in SPG are based on worst-case problem-dependent constants that are known and easy to compute, but can be very large. This would lead to over-conservative behavior in most problems of interest. However, we believe that this work provides a solid starting point to develop safe and efficient policy gradient algorithms that are rooted in theory.

To conclude, we propose some possible ideas for future work that are aimed to close this gap between theory and practice. While we used the empirical Bernstein bound to characterize the gradient estimation error for Softmax policies, the same cannot be done for Gaussian policies due to their unbounded score function. Tighter concentration inequalities should be studied for this case. The convergence rate of SPG should also be studied. The main challenge here is the growing batch size. The numerical simulations of Sect. 7.1 suggest the the growth is sublinear. Moreover, we have observed convergence to a fixed batch size under the weaker milestone constraint in Sect. 7.2. It is also worth to investigate whether SPG can be combined with stochastic variance-reduction techniques (e.g., (Papini et al., 2018; Yuan et al., 2020)). Convergence to global optima should also be investigated, as is now common in the policy optimization literature (Bhandari & Russo, 2019; Zhang et al., 2020; Agarwal et al., 2020). Actor-critic algorithms (Konda & Tsitsiklis, 1999) are more used than actor-only algorithms in practice (e.g., (Haarnoja et al., 2018)) due to their reduced variance. Thus, extending our improvement guarantees to this class of algorithms is also important. The main challenge lies in handling the bias due to the critic. A promising first step is to consider compatible critics that yield unbiased gradient estimates (Sutton et al., 2000; Konda & Tsitsiklis, 1999). Although the class of smoothing policies is very broad, we have restricted our attention to Gaussian and Softmax policies with given features. Other policy classes, such as beta policies (Chou et al., 2017) should be considered. Most importantly, deep policies should be considered that also learn the features from data, especially given their success in practice (Duan et al., 2016). See "Appendix B.3" for a brief discussion. Other possible extensions include generalizing the monotonic improvement guarantees to other concepts of safety, such as learning under constraints, or risk-averse RL (Bisi et al., 2020). Finally, the conservative approach adopted in this work could prevent exploration, making some tasks very hard to learn. We studied the case of Gaussian policies with adaptive standard deviation in (Papini et al., 2020). Future work should consider the trade-off between safety, efficiency and exploration in greater generality.