Cautious policy programming: exploiting KL regularization for monotonic policy improvement in reinforcement learning

In this paper, we propose cautious policy programming (CPP), a novel value-based reinforcement learning (RL) algorithm that exploits the idea of monotonic policy improvement during learning. Based on the nature of entropy-regularized RL, we derive a new entropy-regularization-aware lower bound of policy improvement that depends on the expected policy advantage function but not on state-action-space-wise maximization as in prior work. CPP leverages this lower bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation. Different from similar algorithms that are mostly theory-oriented, we also propose a novel interpolation scheme that makes CPP better scale in high dimensional control problems. We demonstrate that the proposed algorithm can trade off performance and stability in both didactic classic control problems and challenging high-dimensional Atari games.


Introduction
Reinforcement learning (RL) has recently achieved impressive successes in fields such as robotic manipulation (Andrychowicz et al., 2020) and video game playing (Mnih et al., 2015).However, compared with supervised learning that has a wide range of practical applications, RL applications have primarily been limited to game playing or lab robotics.A crucial reason for such limitation is the lack of guarantee that the performance of RL policies will improve monotonically; they often oscillate during policy updates.As such, deploying such updated policies without examining their reliability might bring severe consequences in real-world scenarios, e.g., crashing a self-driving car.
Dynamic programming (DP) (Bertsekas, 2005) offers a well-studied framework under which strict policy improvement is possible: with a known state transition model, reward function, and exact computation, monotonic improvement is ensured and convergence is guaranteed within a finite number of iterations (Ye, 2011).However, in practice an accurate model of the environment is rarely available.In situations where either model knowledge is absent or the DP value functions cannot be explicitly computed, approximate DP and corresponding RL methods are to be considered.However, approximation introduces unavoidable update and Monte-Carlo sampling errors, and possibly restricts the policy space in which the policy is updated, leading to the policy oscillation phenomenon (Bertsekas, 2011;Wagner, 2011), whereby the updated policy performs worse than pre-update policies during intermediate stages of learning.Inferior updated policies resulting from policy oscillation could pose a physical threat to real-world RL applications.Further, as valuebased methods are widely employed in the state-of-the-art RL algorithms (Haarnoja et al., 2018), addressing the problem of policy oscillation becomes important in its own right.
Previous studies (Kakade & Langford, 2002;Pirotta et al., 2013b) attempt to address this issue by optimizing lower bounds of policy improvement: the classic conservative policy iteration (CPI) (Kakade & Langford, 2002) algorithm states that, if the new policy is linearly interpolated by the greedy policy and the baseline policy, non-negative lower bound on the policy improvement can be defined.Since this lower bound is a negative quadratic function in the interpolation coefficient, one can solve for the maximizing coefficient to obtain maximum improvement at every update.CPI opened the door of monotonic improvement algorithms and the concept of linear interpolation can be regarded as performing regularization in the stochastic policy space to reduce greediness.Such regularization is theoretically sound as it has been proved to converge to global optimum (Scherrer & Geist, 2014;Neu et al., 2017).For the last two decades, CPI has inspired many studies on ensuring monotonic policy improvement.However, those studies (including CPI itself) are mostly theory-oriented and hardly applicable to practical scenarios, in that maximizing the lowerbound requires solving several state-action-space-wise maximization problems, e.g.estimating the maximum distance between two arbitrary policies.One significant factor causing the complexity might be its excessive generality (Kakade & Langford, 2002;Pirotta et al., 2013b), these bounds do not focus on any particular class of value-based RL algorithms, and hence without further assumptions the problem cannot be simplified.
Another recent trend of developing algorithms robust to the oscillation is by introducing regularizers into the reward function.Such regularizations prove to stabilize the maxmimum pursuing greedy policies.For example, by maximizing reward as well as Shannon entropy of current policy (Ziebart, 2010;Fox et al., 2016;Haarnoja et al., 2017), the optimal policy becomes a multi-modal Boltzmann softmax distribution which avoids putting unit probability mass on the greedy but potentially sub-optimal actions corrupted by noise or error, significantly enhancing the robustness since optimal actions always have nonzero probabilities of being chosen.The Shannon entropy regularization has also been extensively praticed in the policy gradient literature (O'Donoghue et al., 2016;Haarnoja et al., 2018) and verified to smooth policy optimization landscape (Ahmed et al., 2019).On the other hand, the introduction of Kullback-Leibler (KL) divergence (Todorov, 2006) generalizes the Shannon entropy regularization since the Shannon entropy can be produced by taking the KL divergence with respect to a uniform policy (Fox et al., 2016).Besides having the effect of encouraging exploration same as the Shannon entropy (Chan et al., 2022).By penalizing aggressive policy changes with respect to some baseline policy, KL regularization has recently been identified to attain state-of-the-art theoretical results (Vieillard et al., 2020a;Kozuno et al., 2022) as well as averaging out different sources of errors in practice.However, it is worth noting that, though the aforementioned entropyregularized algorithms have superior finite-time bounds and enjoy strong empirical performance, they are not guaranteed to reduce policy oscillation since degradation during learning can still persist (Nachum et al., 2018).
It is hence natural to raise the question of whether the practically intractable lowerbounds from the monotonic improvement literature can benefit from entropy regularization if we restrict ourselves to the entropy-regularized policy class.By noticing that the policy interpolation and entropy regularization actually perform regularization in different aspects, i.e. in the stochastic policy space and reward function, we answer this question by affirmative.We show focusing on the class of entropy-regularizede policies significantly simplifies the problem as a very recent result indicates a sequence of entropy-regularized policies has bounded KL divergence (Kozuno et al., 2019).This result sheds light on approximating the intractable lowerbounds from the monotonic improvement algorithms since many quantities are related to the maximum distance between two arbitrary policies.
In this paper, we aim to tackle the policy oscillation problem by ensuring monotonic improvement via optimizing a more tractable lowerbound.This novel entropy regularization aware lower bound of policy improvement depends only the expected policy advantage function.We call the resultant algorithm cautious policy programming (CPP).CPP leverages this lower bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation.By introducing heuristic designs suitable for nonlinear approximators, CPP can be extended to working with deep networks.The extensions are compared with the state-of-the-art algorithm (Vieillard et al., 2020b) on monotonic policy improvement.We demonstrate that our approach can trade off performance and stability in both didactic classic control problems and challenging Atari games.
The contribution of this paper can be succinctly summarized as follows: • we develop an easy-to-use lowerbound for ensuring monotonic policy improvement in RL • we propose a novel scalable algorithm CPP which optimizes the lowerbound • CPP is validated to reduce policy oscillation on high-dimensional problems which are intractable for prior methods.
Here, the first and second points are presented in Sect.4, after a brief review on related work in Sect. 2 and preliminary in Sect.3. The third point is inspected in Sect. 5 which presents the results.CPP has touched upon many related problems, and we provide indepth discussion in Sect.6.The paper is concluded in Sect.7. To not interrupt the flow of the paper, we defer all proofs until the Appendix.

Related work
The policy oscillation phenomenon, also termed overshooting by Wagner (2011) and referred to as degraded performance of updated policies, frequently arises in approximate policy iteration algorithms (Bertsekas, 2011) and can occur even under asymptotically converged value functions (Wagner, 2011).It has been shown that aggressive updates with sampling and update errors, together with restricted policy spaces, are the main reasons for policy oscillation (Pirotta et al., 2013b).In modern applications of RL, policy oscillation becomes an important issue when learning with deep networks when various sources of errors have to been taken in to account.It has been investigated by Fujimoto et al. (2018) and Fu et al. (2019) that those errors are the main cause for typical oscillating performance with deep RL implementations.
To attenuate policy oscillation, the seminal algorithm conservative policy iteration (CPI) (Kakade & Langford, 2002) propose to perform regularization in the stochastic policy space, whereby the greedily updated policy is interpolated with the current policy to achieve less aggressive updates.CPI has inspired numerous conservative algorithms that enjoy strong theoretical guarantees (Abbasi-Yadkori et al., 2016;Metelli et al., 2018Pirotta et al., 2013a, 2013b) to improve upon CPI by proposing new lower bounds for policy improvement.However, since their focus is on general Markov decision processes (MDPs), deriving practical algorithms based on the lower bounds is nontrivial and the proposed lower bounds are mostly of theoretical value.Indeed, as admitted by the authors of (Papini et al., 2020) that a large gap between theory and practice exists, as manifested by the their experimental results that even for a simple Cartpole environment, state-of-the-art algorithm failed to deliver attenuated oscillation and convergence speed comparable with heuristic optimization scheme such as Adam (Kingma & Ba, 2015).This might explain why adaptive coefficients must be introduced in (Vieillard et al., 2020b) to extend CPI to be compatible with deep neural networks.To remove this limitation, our focus on entropyregularized MDPs allows for a straightforward algorithm based on a novel, significantly simplified lower bound.
Another line of research toward alleviating policy oscillation is to incorporate regularization as a penalty into the reward function, leading to the recently booming literature on entropy-regularized MDPs (Azar et al., 2012;Fox et al., 2016;Kozuno et al., 2019;Haarnoja et al., 2017;Vieillard et al., 2020a;Mei et al., 2019).Instead of interpolating greedy policies, the reward is augmented with entropy of the policy, such as Shannon entropy for more diverse behavior and smooth optimization landscape (Ahmed et al., 2019), or Kullback-Leibler (KL) divergence for enforcing policy similarity between policy updates and hence achieving superior sample efficiency (Uchibe, 2018;Uchibe & Doya, 2021).The Shannon entropy renders the optimal policy of the regularized MDP stochastic and multimodal and hence robust against errors and noises in contrast to the deterministic policy that puts all probability mass on a single action (Haarnoja et al., 2018).On the other hand, augmenting with KL divergence shapes the optimal policy an average of all past value functions, which is significantly more robust than a single point estimate.Compared to the CPI-based algorithms, entropy-regularized algorithms do not have guarantee on perupdate improvement.But they have demonstrated state-of-the-art empirical successes on a wide range of challenging tasks (Cui et al., 2017;Tsurumine et al., 2019;Zhu et al., 2020Zhu et al., , 2022)).To the best of the authors' knowledge, unifying those two regularization schemes has not been considered in published literature before.
It is worth noting that, inspired by Kakade and Langford (2002), the concept of monotonic improvement has been exploited also in policy search scenarios (Schulman et al., 2015;Akrour et al., 2018;Shani et al., 2019;Mei et al., 2020;Papini et al., 2020).However, there is a large gap between theory and practice in those policy gradient methods.On one hand, though (Schulman et al., 2015(Schulman et al., , 2017) ) demonstrated good empirical performance, their relaxed trust region is often too optimistic and easily corrupted by noises and errors that arise frequently in the deep RL setting: as pointed out by Engstrom et al. (2019), the trust region technique itself alone fails to explain the efficiency of the algorithms and lots of code-level tricks are necessary.On the other hand, exactly following the guidance of monotonic improving gradient does not lead to tempered oscillation and better performance even for simple problems (Papini et al., 2017(Papini et al., , 2020)).Another shortcoming of policy gradient methods is they focus on local optimal policy with strong dependency on initial parameters.On the other hand, we focus on value-based RL that searches for global optimal policies.

Reinforcement learning
RL problems can be formulated by MDPs expressed by the quintuple (S, A, T, R, ) , where S denotes the state space, A denotes the finite action space, and T denotes transition dynamics such that T a ss � ∶= T(s � |s, a) represents the transition from state s to s ′ with action a taken.R = r a ss � is the immediate reward associated with that transition.We also write the expected reward weighted by transition as r a s = ∑ s � T a ss � r a ss � .In this paper, we consider r a ss ′ as bounded in the interval [−1, 1] .∈ (0, 1) is the discount factor.For simplicity, we consider the infinite horizon discounted setting with a fixed starting state s 0 .A policy is a probability distribution over actions given some state.We also define the stationary state distribution induced by as d (s) = (1 − ) ∑ ∞ t=0 t T(s t = s�s 0 , ).RL algorithms search for an optimal policy * that maximizes the state value function for all states s: where the expectation is with respect to the transition dynamics T and policy .The state- action value function Q is more frequently used in the control context:

Lower bounds on policy improvement
To frame the monotonic improvement problem, we introduce the following lemma that formally defines the criterion of policy improvement of some policy ′ over : Lemma 1 Kakade & Langford (2002) For any stationary policies ′ and , the following equation holds: J is the discounted cumulative reward, and A (s, a) ∶= Q (s, a) − V (s) is the advantage function.Though Lemma 1 relates policy improvement to the expected advantage function, * ∶= arg max (1) where pursuing policy improvement by directly exploiting Lemma 1 is intractable as it requires comparing ′ and point-wise for infinitely many new policies.Many existing works (Kakade & Langford, 2002;Pirotta et al., 2013b;Schulman et al., 2015) instead focus on finding a ′ such that the right-hand side of Eq. ( 1) is lower bounded.To alleviate policy oscillation brought by the greedily updated policy π, (Kakade & Langford, 2002) proposes adopting partial update: Eq. ( 2) corresponds to performing regularization in the stochastic policy space by interpolating the greedy policy and the current policy to achieve conservative updates.The concept of linearly interpolating policies has inspired many algorithms that enjoy strong theoretical guarantees (Akrour et al., 2018;Metelli et al., 2018;Pirotta et al., 2013b).However, those algorithms are mostly of theoretical value and have only been applied to small problems due to intractable optimization or estimation when the state-action space is highdimensional/continuous.Indeed, as admitted by the authors of (Papini et al., 2020), there is a large gap between theory and practice when using algorithms based on policy regularization Eq. ( 2): even on a simple CartPole problem, a state-of-the-art algorithm fail to compete with heuristic optimization technique.Like our proposal in this paper, a very recent work (Vieillard et al., 2020b) attempts to bridge this gap by proposing heuristic coefficient design for learning with deep networks.We discuss the relationship between it and the CPP in Sect.4.4.
In the next section, we detail the derivation of the proposed lower bound by exploiting entropy regularization.This novel lower bound allows us to significantly simplify the intractable optimization and estimation in prior work and provide a scalable implementation.

Proposed method
This section features the proposed novel lower bound on which we base a novel algorithm for ensuring monotonic policy improvement.

Entropy-regularized RL
In the following discussion, we provide a general formulation for entropy-regularized algorithms (Azar et al., 2012;Haarnoja et al., 2018;Kozuno et al., 2019).At iteration K, the entropy of current policy K and the Kullback-Leibler (KL) divergence between K and some baseline policy π are added to the value function: where we can see by multiplying the policy K into the bracket with I  K π we obtain the Shannon entropy − ∑ a K (a�s) log K (a�s) minus the KL divergence ∑ a K (a|s) log K (a|s)  ̄ (a|s) .Thus the coefficient controls the weight of the entropy bonus and weights the effect of KL regularization.The baseline policy π is often taken as the policy from pre- vious iteration K−1 .Based on Nachum et al. (2017Nachum et al. ( , 2018)), we know the state value (2) (3) π defined in Eq. ( 3) and state-action value function Q  K π also satisfy the Bellman recursion: Following (Vieillard et al., 2020a), the regularized optimal policies are no longer deterministic greedy.Instead they are soft in the sense they constitute a Boltzmann distribution over action values.For simplicity let us suppose we have only either Shannon entropy or KL regularization.Then their optimal policies at K + 1-th iteration respectively take the form: where the left with corresponds to Shannon entropy regularization and the right with for KL.
� are their respective normalization functions.An intuitive explanation to the above optimal policies are that the entropy term endows the optimal policy with multi-modal policy behavior since the action values can be multimodal (Haarnoja et al., 2017).It places nonzero probability mass on every action candidate, hence is robust against error and noise in function approximation that can easily corrupt the conventional deterministic optimal policy (Puterman, 1994).On the other hand, KL divergence provides smooth policy updates by multiplying with the previous policy K (a|s) , limiting the change that could be induced by the action value alone (Azar et al., 2012;Kozuno et al., 2019;Schulman et al., 2015).KL regularization has been recently shown to possess several advantages over the Shannon entropy.The optimal policy can be transformed to an exponential function over all past value functions (Vieillard et al., 2020a).Limiting the update step also plays a crucial role in the recent successful algorithms since it prevents the aggressive updates that could easily be corrupted by errors (Fujimoto et al., 2018;Fu et al., 2019).It is worth noting that when the optimal policy is attained, the KL regularization term becomes zero.Hence in Eq. (3), the optimal policy maximizes the cumulative reward while keeping the entropy high.Now in the rest of the paper, since we consider the case where both Shannon entropy and KL regularization appear, for notational convenience we use the following definition:

Entropy-regularization-aware lower bound
Let us recall in Eq. (2) regularization was performed in the stochastic policy space by linear interpolation  � =  π + (1 −  ).This interpolation requires preparing a reference pol- icy , could be from expert knowledge or previous policies.The resultant ′ , has guaranteed monotonic improvement which we formulate as the following lemma: Lemma 2 ( (Pirotta et al., 2013b)) Provided that policy ′ is generated by partial update Eq. (2), is chosen properly, and A π ,d  ≥ 0 , then the following policy improvement is guaranteed: where Proof See Section A.1.1 for the proof.◻ The interpolated policy ′ optimizes the bound and the policy improvement is a negative quadratic function in .However, this optimization problem is highly non-trivial as and ΔA π  require searching the entire state-action space.This challenge explains why CPIinspired methods have only been applied to small problems with low-dimensional stateaction spaces (Metelli et al., 2018;Papini et al., 2020;Pirotta et al., 2013b).
When the expert knowledge is not available, we can simply choose previous policies.Specifically, at any iteration K, we want to leverage the monotonic policy improvement given policy K .We propose constructing a new monotonically improving policy as: It is now clear by comparing Eq. (2) with Eq. ( 6) that our proposal takes  ′ , π ,  as πK+1 ,  K+1 ,  K , respectively.It is worth noting that K+1 is the updated policy that has not been deployed.
However, the intractable quantities and ΔA π  in Lemma 2 are still an obstacle to deriving a scalable algorithm.Specifically, by writing the component A π  (s) of ΔA π  as we see that both and ΔA π  require accurately estimating the total variation between two policies.This could be difficult without enforcing constraints such as gradual change of policies.Fortunately, by noticing that the consecutive entropy-regularized policies K+1 , K have bounded total variation, we can leverage the boundedness to bypass the intracatable estimation.
Lemma 3 Kozuno et al., (2019) For any policies K and K+1 generated by taking the maximizer of Eq. (3), the following bound holds for their maximum total variation: (5) where where

is the uniform upper bound of error.
Proof See Section A.1.2 for the proof.◻ Lemma 3 states that, entropy-regularized policies have bounded total variation (and hence bounded KL divergence by Pinsker's and Kozuno's inequality (Kozuno et al., 2019)).This bound allows us to bypass the intractable estimation in Lemma 2 and approximate πK+1 that optimizes the lowerbound.We formally state this result in the Theorem 4 below.
For convenience, we assume there is no error, i.e.B K = 0 .Setting B K = 0 is only for the ease of notation of our latter derivation.Our results still hold by simply replacing all appearance of C K to B K + C K .On the other hand, in implementation it requires a sensible choice of upper bound of error which is typically difficult especially for high dimensional problems and with nonlinear function approximators.Fortunately, by the virtue of KL regularization in Eq. ( 3), it has been shown in Azar et al. (2012); Vieillard et al. (2020b) that if the sequence of errors is a martingale difference under the natural filtration, then the summation of errors asymptotically cancels out.Hence it might be safe to simply set B K = 0 if we assume the martingale difference condition.We formally state this condition following (Vieillard et al., 2020b).Let K+1 be the evaluation error associated with Q K+1 , and F K be the natural filtration up to iteration K (generated by the sequence of states sampled from the generative model (Vieillard et al., 2020b)).We assume the sequence of errors Theorem 4 Provided that partial update Eq. ( 6) is adopted, A K+1 K ,d K ≥ 0 , and is chosen properly as specified below, then any maximizer policy of Eq. (3) guarantees the following improvement that depends only on , , and A K+1 K ,d K after any policy update: , are defined in Eq. (4) and are the expected policy advantage, and the policy advantage function, respectively.Furthermore, the following two convergence rates are available for CPI, and hence inherited by CPP: (8) where .
Proof See Section A.1.3 for the proof.◻ While theoretically we need to compare 1 − e −2C K and 4C K when computing * , in implementation the exponential function e −2C K might be sometimes close to 1 and hence causing numerical instability.Hence in the rest of the paper we shall stick to using the constant C K rather than the exponential function.
In the lower bound Eq. ( 8), only A K+1 K ,d K needs to be estimated.It is worth noting that ∀s, A K+1 K (s) ≥ 0 is a straightforward criterion that is naturally satisfied by the greedy pol- icy improvement of policy iteration when computation is exact.To handle the negative case caused by error or approximate computations, we can simply stack more samples to reduce the variance, as will be detailed in Sect.4.5.
Remark 1 Since CPI-genre algorithms (including our work) aims at explicitly maximizing lower bound of policy improvement, it is interesting to compare the improvement to the one presented by Agarwal et al. (2021).While it is not directly comparable to our bound since our contribution is not a policy gradient method, we assume that they are the same for a qualitative illustration.Lemma 17 of Agarwal et al. (2021) states that (using our notation): � On the other hand, our lower bound Eq.( 8) states: Let us divide Eq.( 12) by Eq.( 11), we have Now let us inspect when is this quantity larger than 1 and solve for the number of iterations K. Let us assume = 0.99, = 0.9, = 1, = 0.01 = (1 − ) .For simplicity let us also assume the ideal case log Z t (s) = log ∑ a t (a�s) exp That is, in the ideal case, after around 2000 iterations the denominator shrinks towards zero and our bound offers a better lower bound.However, it should be noted that the example serves only the illustration purpose; the aforementioned convergence result depends on the practical performance of the algorithm and is not considered as a formal rate.

The CPP policy iteration
We now detail the structure of our proposed algorithm based on Theorem 4. Specifically, value update, policy update, and stationary distribution estimation are introduced, followed by discussion on a subtlety in practice and two possible solutions.Following Scherrer et al., (2015), CPP can be written in the following succinct policy iteration style: where C ∶= (1− ) 3 2 is the horizon constant.Note that for numerical stability we stick to using (4C K ) −1 as the entropy-bounding constant rather than using 1 1−e −2C K .CPP can obtain global optimal policy rather than just achieving monotonic improvement (which might still converge to a local optimum) since CPP is an extension and a special case of CPI and SPI.By the argument of Scherrer and Geist (2014) it is known that both CPI and SPI both can attain the global optimality, CPP inherits this merit.However, the convergence proof with linear/nonlinear function approximation as well as finitetime loss bounds remain an open question.The first step corresponds to the greedy step of policy iteration, the second step policy estimation step, third step computing interpolation coefficient and the last step interpolating the policy.

Policy improvement and policy evaluation
The first two steps are standard update and estimation steps of policy iteration algorithms (Sutton & Barto, 2018).The subscript of Q  K π indicates it is entropy-regularized as introduced in Eq. (3).The policy improvement step consists of evaluating GQ where , were defined in Eq. ( 4).
The policy evaluation step estimates the value of current policy K+1 by repeatedly applying the Bellman operator T K+1 : Note that m = 1, ∞ correspond to the value iteration and policy iteration, respectively (Bertsekas & Tsitsiklis, 1996).Other interger-valued m ∈ [2, ∞) correspond to the approx- imate modified policy iteration (Scherrer et al., 2015).Now in order to estimate A K+1 K ,d K in Theorem 4, both A K+1 K and d K need to be estimated from samples.Estimating A K+1 K (s) is straightforward by its definition in Eq. ( 10).We can first compute Q K (s, a) − V K (s), ∀s, a for the current policy, and then update the policy to obtain K+1 (a|s) .On the other hand, sampling with respect to d K results in an on-policy algorithm, which is expensive.We provide both on-and off-policy implementations of CPP in the following sections, but in principle off-policy learning algorithms can be applied to estimate d K by exploiting techniques such as importance sampling (IS) ratio (Precup, 2000).

Leveraging policy interpolation
Computing in Eq. ( 8) involves the horizon constant C ∶= (1− ) 3 2 and policy difference bound constant C K .The horizon constant is effective in DP scenarios where the total num- ber of timesteps is typically small, but might not be suitable for learning with deep networks that feature large number of timesteps: a vanishingly small C will significantly hin- der learning, hence it should be removed in deep RL implementations.We detail this consideration in Sect.4.5.
The updated policy K+1 in Eq. ( 13) cannot be directly deployed since it has not been verified to improve upon K .We interpolate between K+1 and K with coefficient such that the resultant policy πK+1 by finding the maximizer of a negative quadratic function in .The maximizer * optimizes the lowerbound . Here, is optimally tuned and dynamically changing in every update.It reflects the cautiousness against policy oscillation, i.e., how much we trust the updated policy K+1 .Generally, at the early stage of learning, should be small in order to explore conservatively.
However, a major concern is that Lemma 3 holds only for Boltzmann policies, while the interpolated policies are generally no longer Boltzmann.In practice, we have two options for handling this problem: 1. We use the interpolated policy only for collecting samples (i.e.behavior policy) but not for computing next policy; 2. We perform an additional projection step to project the interpolated policy back to the Boltzmann class as the next policy.
The first solution might be suitable for relatively simple problems where the safe exploration is required: the behavior policy is conservative in exploring when ≈ 0 .But learn- ing can still proceed even with such small .Hence this scheme suits problems where ( 14) interaction with the environment is crucial but progress is desired.On the other hand, the second scheme is more natural since the off-policyness caused by the mismatch between the behavior and learning policy might be compounded by high dimensionality.The increased mismatch might be perturbing to performance.In the following section, we introduce CPP using linear function approximation for the first scheme and deep CPP for the second scheme.
For the second scheme, manipulating the interpolated policy is inconvenient since we will have to remember all previous weights and more importantly, the theoretical properties of Boltzmann policies do not hold any longer.To solve this issue, heuristically an information projection step is performed for every interpolated policy to obtain a Boltzmann policy.In practice, this policy is found by solving min  D KL (‖ πK+1 + (1 −  ) K ) .Though the information projection step can only approximately guarantee that the CVI bound continues to apply since the replay buffer capacity is finite, it has been commonly used in practice (Haarnoja et al., 2018;Vieillard et al., 2020b).In our implementation of deep CPP, the projection problem is solved efficiently using autodifferentiation (Line 7 of Algorithm 2).

Approximate interpolation coefficient
The lowerbound of policy improvement depends on A K+1 K ,d K .Though it is general difficult to compute exactly, very recently (Vieillard et al., 2020b) propose to estimate it using batch samples.We hence define several quantities following (Vieillard et al., 2020b): let B t denote a batch randomly sampled from the replay buffer B and define ÂK (s) ∶= max a Q(s, a) − V(s) as an estimate of A π  (s) , ̂ K ∶= s∼B [ ÂK (s)] as an estimate of A π ,d  , and ÂK,min ∶= min s∼B ÂK (s) as the minimum of the batch.The reward term r max 1− in CPI can be approximated by batch maximum QK ∶= max s,a,⋯∈B Q K (s, a) which itself approximates the maximum norm ‖Q K ‖ ∞ .When we use linear function approximation with on-policy buffer B K , we simply change the minibatch B in the above notations to the on-policy buffer B K .
Given the notations defined above, we can compare the existing interpolation coefficients in Table 1, with explanations in the following: CPI: the classic CPI algorithm proposes to use the coefficient: where r max is the largest possible reward.When the knowledge of the largest reward is not available, approximation based on batches or buffer will have to be employed.
Exact SPI: SPI proposes to extend CPI by using the following coefficient: where , ΔA K+1 K were specified in Lemma 2. When , ΔA K+1 K cannot be exactly computed, sample-based approximation will have to employed.
Approximate SPI: as suggested by ((Pirotta et al., 2013b), Remark 1) approximate can be derived if we naïvely leverage ΔA Linear CPP: if policies are entropy-regularized as indicated in Eq. ( 3), we can upper bound ΔA K+1 K by using Lemma 3: By the definition of C K in Eq. ( 7), CPP can take on a wider range of values than A-SPI .
Deep CPP: we follow the DCPI coefficient design for making CPP suitable for deep RL.Specifically, we modify DCPP by defining ζ0 = 1 C K : where m K , M K are same as Eq. ( 19).
Based on Eqs. ( 18), (20), we detail the linear and deep implementations of CPP in the next section. (17

Approximate CPP
Algorithm 1 Linear Cautious Policy Programming Require: α, β, γ CPP parameters, I the total number of iterations, T the number of steps for each iteration 1: Initialize θ, π0 at random, empty on-policy buffer B K = {} 2: while K ≤ I do 3: Interact with the environment using policy π 5: Collect a transition tuple (s, a, r, s ) into buffer B 6: Interact using policy πK−1 7: Compute basis matrix Φ K using B K update θ by Eq. ( 21) 10: Compute ζ0 = 1 CK and ζ = ζ0 mK MK using Eq. ( 19) 11: Empty on-policy buffer B K 12: end if 13: end while 14: end while We introduce the linear implementation of CPP following (Lagoudakis & Parr, 2003;Azar et al., 2012) and deep CPP inspired by Vieillard et al. (2020b) in Algs. 1 and 2, respectively.It is worth noting that in linear CPP we assume the interpolated policy π is used only for collecting samples (line 5 of Alg. 1) hence no projection is necessary as it does not interfere with computing next policy.
Linear CPP.We adopt linear function approximation (LFA) to approximate the Q-function by Q(s, a) = (s, a) T , where (x) = [ 1 (x), … , M (x)] T , x = [s, a] T , (x) is the basis function and corresponds to the weight vector.One typical choice of basis function is the radial basis function: where c i is the center and is the width.We construct basis matrix Φ = [ 1 (x 1 ), … , M (x N )] ∈ ℝ T×M , where T is the number of timesteps.Specifically, at K- th iteration, we maintain an on-policy buffer B K .For every timestep t ∈ [1, T] , we collect (s K t , a K t , r K t , s K t+1 ) into the buffer and compute the basis matrix at the end of every iteration.To obtain the best-fit K+1 for the K + 1-th iteration, we solve the least-squares prob- lem where is a small constant preventing singular matrix inversion and T K+1 Q K is the empirical Bellman operator defined by Since the buffer is on-policy, we empty it at the end of every iteration (line 10).

Algorithm 2 Deep Cautious Policy Programming
Require: α, β, γ CPP parameters, T the total number of steps, F the interaction period, C the update period 1: Initialize θ at random, set θ − = θ, K = 0 and buffer B to be empty 2: while K ≤ T do 3: Interact with the environment using policy π 4: Collect a transition tuple (s, a, r, s ) into buffer B 5: Sample a minibatch B t from B and compute the loss L value and L policy using Eqs.( 22), ( 23) Do one step of gradient descent on the loss L train = L value +L policy compute ÂK , ÂK and moving average m K , M K using Eq. ( 19) CK and ζ CPP = ζ0 mK MK using Eq. ( 19) 12: end if 13: end while Deep CPP.Though CPP is an on-policy algorithm, by following (Vieillard et al., 2020b) off-policy data can also be leveraged with the hope that random sampling from the replay buffer covers areas likely to be visited by the policy in the long term.Off-policy learning greatly expands CPP's coverage, since on-policy algorithms require expensively large number of samples to converge, while off-policy algorithms are more competitive in terms of sample complexity in deep RL scenarios.
We implement CPP based on the DQN architecture, where the Q-function is parameterized as Q , where denotes the weights of an online network, as can be seen from Line 2. Line 3 begins the learning loop.For every step we interact with the environment using policy , where denotes the epsilon-greedy policy threshold.As a result, a tuple of experience is collected to the buffer.
Line 6 of Alg. 2 begins the update loop.We sample a minibatch from the buffer and compute the loss L value , L policy defined in Eqs. ( 22), (23), respectively.Since our implemen- tation is based on DQN, we do not include additional policy network as done in Vieillard et al. (2020b).Instead, we denote the policy as to indicate that the policy is a function of Q as shown in Eq. ( 24).The base policy is hence denoted by − to indicate it is computed by the target network of − .We define the regression target as: Hence, the loss for is defined by: It should be noted that the interpolated policy cannot be directly used as it is generally no longer Boltzmann.To tackle this problem, we further incorporate the following minimization problem to project the interpolated policy back to the Boltzmann policy class: where GQ takes the maximizer of the action value function.The reason why we can express the policy and GQ with the subscript is because the policy is a function of action value function, which has a closed-form solution (see (Kozuno et al., 2019) for details): which by simple induction can be written completely in terms of Q as Vieillard et al., 2020a).Line 8 performs one step of gradient descent on the the compound loss and line 9 computes the approximate expected advantage function for computing .
There is one subtlety in that the definition of K is unclear in the deep RL context: there is no clear notion of iteration.If we naïvely define K as the the number of steps or the number of updates, then by definition C K in Eq. ( 20) could quickly converge to 0 or explode, rendering CPP losing the ability of controlling update.Hence in our implementation, we increment K by one every time we update the target network (every C steps), which results in a suitable magnitude of K.

Experimental results
The proposed CPP algorithm can be applied to a variety of entropy-regularized algorithms.In this section, we utilize conservative value iteration (CVI) as the base algorithm in Kozuno et al. (2019) for our experiments.In our implementation, for the K + 1-th update, the baseline policy π in Eq. ( 3) is K .
For didactic purposes, we first examine all algorithms (specified below) in a safety gridworld and the classic control problem pendulum swing-up.The tabular gridworld allows for exact computation to inspect the effect of algorithms.On the other hand, pendulum swing-up leverages linear function approximation detailed in Alg. 1.We then apply the algorithms on a set of Atari games to demonstrate the effectiveness of our proposed method.It is worth noting that even state-of-the-art monotonic improving methods failed in complicated Atari games (Papini et al., 2020).The gridworld, pendulum swing-up and Atari games manifest the growth of complexity and allow for comparison on how the algorithms trade off stability and scalability.
For the gridworld and pendulum experiments, we compare Linear CPP using coefficient Eq. ( 18) against safe policy iteration (SPI) (Pirotta et al., 2013b) which is the closest to our work.We employ Exact-SPI (E-SPI) coefficient in Eq. ( 16) on the gridworld since in small state spaces where the quantities , ΔA K+1 K can be accurately estimated.As a result, SPI performance should upper bound that of CPP since CPP was derived by further loosening on SPI.For problems with larger state-action spaces, SPI performance may become poor as a result of insufficient samples for estimating those quantities, hence Approximate-SPI (23) (A-SPI) Eq. ( 17) should be used.However, leveraging A-SPI coefficient often results in vanishingly small values.
For Atari games, we compare Deep CPP leveraging Eq. ( 20) against on-and off-policy state-of-the-art algorithms, see Sect.5.3 for a detailed list.Specifically, we implement deep CPP using off-policy data to show it is capable of leveraging off-policy samples, hence greatly expanding its coverage since on-policy algorithms typically have expensive sample requirement.

Experimental setup
The agent in the 5 × 5 grid world starts from a fixed position at the upper left corner and can move to any of its neighboring states with success probability p or to a random different direction with probability 1 − p .Its objective is to travel to a fixed destination located at the lower right corner and receives a +1 reward upon arrival.Stepping into two dan- ger grids located at the center of the gridworld incurs a cost of −1 .Every step costs −0.1 .We maintain tables for value functions to inspect the case when there is no approximation error.Parameters are tuned to yield empirically best performance.For testing the sample efficiency, every iteration terminates after 20 steps or upon reaching the goal, and only 30 iterations are allowed for training.For statistical significance, the results are averaged over 100 independent trials.

Results
Figure 1a shows the performance of SPI, CPP, and CVI, respectively.Recall that SPI used the exact coefficient Eq. ( 16).The black, blue, and red lines indicate their respective cumulative reward (y-axis) along the number of iterations (x-axis).The shaded area shows ±1 standard deviation.CVI learned policies that visited danger regions more often and result in delayed convergence compared to CPP. Figure 1b compares the average policy oscillation defined in Eq. ( 25).Fig. 1 Comparison between SPI, CPP, and CVI on the safety grid world.The black line shows the mean SPI cumulative reward, the blue line CPP, and the red line CVI in Fig. 1a, with the shaded area indicating ±1 standard deviation.Figure 1b compares the respective policy oscillation value defined in Eq. ( 25) The slightly worse oscillation value of CPP than SPI with E-SPI is expected as CPP exploited a lower bound that is looser than that of SPI.However, as will be shown in the following examples when both linear and nonlinear function approximation are adopted, SPI failed to learn meaningful behaviors due to the inability to accurately estimate the complicated lower bound.

Pendulum swing up
Since the state space is continuous in the pendulum swing up, E-SPI can no longer expect to accurately estimate ΔA K+1 K , so we employ A-SPI in Eq. ( 17) and compare both E-SPI and A-SPI against Linear CPP Eq. ( 18).

Experimental setup
A pendulum of length 1.5 meters has a ball of mass 1kg at its end starting from the fixed initial state [0, − ] .The pendulum attempts to reach the goal [0, ] and stay there for as long as possible.The state space is two-dimensional s = [, θ] , where denotes the vertical angle and θ the angular velocity.Action is one-dimensional torque [−2, 0, 2] applied to the pendulum.The reward is the negative addition of two quadratic functions quadratic in angle and angular velocity, respectively: where 1 z normalizes the rewards and a large b penalizes high angular velocity.We set z = 10, a = 1, b = 0.01.
To demonstrate that the proposed algorithm can ensure monotonic improvement even with a small number of samples, we allow 80 iterations of learning; each iteration comprises 500 steps.For statistical evidence, all figures show results averaged over 100 independent experiments.

Results
We compare CPP with CVI and both E-SPI and A-SPI in Fig. 2. In this simple setup, all algorithms showed similar trend.But CPP managed to converge to the optimal solution in all seeds, as can be seen from the variance plot.On the other hand, both SPI versions exhibited lower mean scores and large variance, which indicate that for many seeds they failed to learn the optimal policy.In Fig. 2a, both E-SPI and CVI exhibited wild oscillations, resulting in large average oscillaton values, in which the oscillation criterion is defined as: where R K+1 refers to the cumulative reward at the K + 1-th iteration.It is worth noting that the difference R K+1 − R K is obtained by πK+1 , πK , which is the lower bound of that by πK+1 ,  K .Intuitively, ‖OJ‖ ∞ and ‖OJ‖ 2 measure maximum and average oscillation in cumulative reward.The stars between CPP and CVI represent statistical significance at level p = 0.05.
The reason for SPI's drastic behavior can be observed in Fig. 2c (truncated to 30 iterations for better view); in E-SPI, insufficient samples led to very large .The aggressive choice of led to a large oscillation value.On the other hand, A-SPI went to the other extreme of producing vanishingly small due to the loose choice of for ensuring improvement of , as can be seen from the almost horizontal lines in the same figure; A-SPI had average value ΔJ πK+1  K ,d π K+1 = 2.39 × 10 −9 and = 1.69 × 10 −6 .CPP con- verged with much lower oscillation thanks to the smooth growth of the values; CPP was cautious in the beginning ( ≈ 0 ) and gradually became confident in the updates when it was close to the optimal policy ( ≈ 1).
However, it might happen that values are large but probability changes are actually small and vice versa.To certify CPP did not produce such pathological mixture policy and indeed cautiously learned, we plot in Fig. 3 the interpolated policies of CPP and E-SPI yielding action probability of the pendulum swinging right π (a right |s t ) .The probability change is plotted in z-axis, timesteps t = 1, … , 500 of all iterations are drawn on x, y axes.For both  1 3

Experimental setup
We applied the algorithms to a set of challenging Atari games: MsPacmann, SpaceInvaders, Beamrider, Assault and SeaquestBellemare et al. ( 2013) using the adaptive introduced in Eq. ( 20).We compare deep CPP with both on-and off-policy algorithms to demonstrate that CPP is capable of achieving superior balance between learning speed and oscillation values.
For on-policy algorithms, we include the celebrated proximal policy gradient (PPO) (Schulman et al., 2017), a representative trust-region method.We also compare with Advantage Actor-Critic (A2C) (Mnih et al., 2016) which is a standard on-policy actorcritic algorithm: our intention is to confirm the expensive sample requirement of on-policy algorithms typically render them underperformant when the number of timesteps is not sufficiently large.
For the off-policy algorithms, we decide to include several state-of-the-art DQN variants: Munchausen DQN (MDQN) (Vieillard et al., 2020a) features the implicit KL regularization brought by the Munchausen log-policy term: it was shown that MDQN was the only non-distributional RL method outperforming distributional ones.We also include another state-of-the-art variant: Momentum DQN (MoDQN) (Vieillard et al., 2020b) that avoids estimating the intractable base policy in KL-regularized RL by constructing momentum.MoDQN has been shown to obtain superior performance on a wide range of Atari games.Finally, as an ablation study, we are interested in the case = 1 , which translates to con- servative value iteration (CVI) (Kozuno et al., 2019) based on the framework Eq. ( 3).CVI has not seen deep RL implementation to the best of our knowledge.Hence a performant deep CVI implementation is of independent interest.
All algorithms are implemented using library Stable Baselines 3 (Raffin et al., 2021), and tuned using the library Optuna (Akiba et al., 2019).Further, all on-and off-policy algorithms share the same network architectures for their group (i.e.MDQN and CPP share the same architecture and PPO and A2C share another same architecture) for fair comparison.The experiments are evaluated over 3 random seeds.Details are provided in Appendix A.2.We expect that on simple tasks PPO and A2C might be stable due to the on-policy nature, but too slow to learn meaningful behaviors.However, PPO is known to take drastic updates and heavily needs code-level optimization to correct the drasticity (Engstrom et al., 2019).On the other hand, for complicated tasks, too drastic policy updates might be corrupted by noises and errors, leading to divergent learning.By contrast, CPP should balance between learning speed and oscillation value, leading to gradual but smooth improvement.

Results
Final scores.As is visible from Fig. 4, Deep CPP achieved either the first or second place in terms of final scores on all environments, with the only competitive algorithm being MDQN which is the state-of-the-art DQN variant, and occasionally CVI which is the case of = 1 .However, MDQN suffered from numerical stability on the environment Sea- quest as can be seen from the flat line at the end of learning.
CVI performed well on the simple environment MsPacman, which can be interpreted as that learning on simple environments is not likely to oscillate, and hence the policy regularization imposed by is not really necessary, setting = 1 is the best approach for obtaining high return.However, in general it is better to have adjustable update: on the environment BeamRider the benefit of adjusting the degree of updates was significant: CPP learning curve quickly rised at the beginning of learning, showing a significant large gap with all other algorithms.Further, while CVI occasionally performed well, it suffered also from numerical stability: on the environment Assault, CVI and MoDQN achieved around 1000 final scores but ran into numerical issues as visible from the end of learning.This problem has been pointed out in Vieillard et al. (2020b).
On the other hand, on all environments on-policy algorithms A2C and PPO failed to learn meaningful behaviors.On some environment such as Assault A2C showed divergent learning behavior at around 4 × 10 6 and PPO did not learn meaningful behavior until the end.This observation suggests that the sample complexity of on-policy algorithms is high and generally not favorable compared to off-policy algorithms.
Oscillation.The averaged oscillation values of all algorithms are listed in Table 2.While MDQN showed competitive performance against CPP, it exhibited wild oscillation on the difficult environment Seaquest (Fortunato et al., 2018) and finally ran into numerical issue as indicated by the flatline near the end.The oscillation value reached to around 2100.Since MDQN is the state-of-the-art regularized value iteration algorithm featuring implicit regularization, this result illustrates that on difficult environments, only reward regularization might not be sufficient to maintain stable learning.On the other hand, CPP achieved a balance between stable learning and small oscillation, with oscillation value around 600, attaining final score slightly lower than MDQN and higher than MoDQN and CVI.
The oscillation values and final scores should be combined together for evaluating how algorithms perform.CVI, MoDQN sometimes showed similar performance to CPP, but in general the final scores are lower than CPP, with higher oscillation values.On the other hand, MDQN showed competitive final scores, but sometimes it exhibited wild oscillation and ran into numerical issues, implying that on some environments where low oscillation is desired, CPP might be more desirable than MDQN.On-policy algorithms even showed low oscillation values, but their final scores are considered unacceptable.

Ablation study
We are interested in comparing the performance of DCPI with CPP to see the role played by DCPP .It is also enlightening by inspecting the result of fixing as a constant value.In this subsection, we perform ablation study by comparing the the following four designs: • DCPI with fixed = 0.01 : this is to inspect the result of constantly low interpolation coefficient.It is expected that such small value would hinder learning significantly • DCPI with fixed = 0.5 : this is to examine the performance of equally weighting all policies.It is expected that such interpolation could not produce effective learning policies • CPI: this uses the coefficient from Eq. ( 15).Since the maximum reward has to be approximated or set to a large upperbound, it is expected this coefficient would be similar to the case = 0.01 • SPI: this uses the DCPI architecture, but we compute A-SPI by using Eq. ( 17).This coefficient scheme is known to produce vanishingly small values, hence expected to hinder learning.
We examine those four designs on the challenging environment Seaquest.Other experimental settings are held same with Sect.5.3.
As can be seen from Fig. 5, all designs showed a similar trend of converging to some sub-optimal policy.The final scores were around 50, which was significantly lower than CPP in Fig. 4.This result is not surprising since for = 0.01 , almost no update was per- formed.For = 0.5 , the algorithm weights contribution of all policies equally without car- ing about their quality.On this environment, A-SPI is vanishingly small similar with that shown in Fig. 2c.Lastly, for CPI the number of learning steps is not sufficient for learning meaningful behavior.

Discussion
Leveraging the entropy-regularized formulation for monotonic improvement has been recently analyzed in the policy gradient literature for tabular MDP (Agarwal et al., 2020;Cen et al., 2022;Mei et al., 2020).In the tabular MDP setting with exact computation, monotonic improvement and fast convergence can be proved.However, realistic applications are beyond the scope for their analysis and no scalable implementation has been provided.On the other hand, value-based methods have readily applicable error propagation analysis (Munos, 2005;Scherrer et al., 2015;Lazaric et al., 2016) for the function approximation setting, but they seldom focus on monotonic improvement guarantees such as J K+1 − J K ≥ 0 .In this paper, we started from the value-based perspec- tive to derive monotonic improvement formulation and provide scalable implementation suitable for learning with deep networks.However, one pending challenge that was not tackled in this paper is to exploit the theoretical tools from the aforementioned literature to show the finite-time loss bounds etc. of CPP.We verified that CPP empirically exploited monotonic improvement in low-dimensional problems and achieved superior tradeoff between learning speed and stabilized learning in high-dimensional Atari games.This tradeoff is best seen from the value of : in the beginning of learning the agent prefers to be cautious, resulting in small values as can be seen from Fig. 2c.In relatively simple scenarios where exact computation or linear function approximation suffices,  < 1 might slow down convergence rate in favor of more stable learning.On the other hand, in challenging problems this cautiousness might in turn accelerate learning in the later stages, as can be seen from the CVI curves in Fig. 4 that correspond to drastically setting = 1 : except in the envi- ronment MsPacman, in all other environments CVI performed worse than CPP.This might be due to that learning with deep networks involve heavy approximation error and noises.Smoothly changing of the interpolation coefficient becomes necessary under these errors and noises, which is a core factor of CPP.We found that CPP was especially useful in challenging tasks where both learning progress and cautiousness are required.We believe CPP bridges the gap between theory and practice that long exists in the Fig. 5 Learning curves of DCPI on Seaquest with four coefficient designs.All designs achieved the final score of 50, while CPP achieved around 3000 in Fig. 4 monotonic improvement RL literature: previous algorithms have only been tested on simple environments yet failed to deliver guaranteed stability.
CPP made a step towards practical monotonic improving RL by leveraging entropy-regularized RL.However, there is still room for improvement.Since the entropy-regularized policies are Boltzmann, generally the policy interpolation step does not yield another Boltzmann by adding two Boltzmann policies.Hence an information projection step should be performed to project the resultant policy back to the Boltzmann class to retrieve Boltzmann properties.While this projection step can be made perfect in the ideal case, in practice there is an unavoidable projection error.This error if well controlled, could be damaging and significantly degrade the performance.How to remove this error is an interesting future direction.
Another sublety of CPP is on the use of Lemma 3. Lemma 3 states that the maximum KL divergence of a sequence of CVI policies is bounded.However, since we performed interpolation on top of CVI policies, it is hence not clear whether this guarantee continues to hold for the interpolated policy, which renders our use of Lemma 3 heurisitic.As demonstrated by the experimental results, we found such heuristic worked well for the problems studied.We leave the theoretical justification of Lemma 3 on interpolated policies to future work.
We believe the application of CPP, i.e., the combination of policy interpolation and entropy-regularization to other state-of-the-art methods is feasible at least within the value iteration scenario.Indeed, CPP performs two regularization: one in the stochastic policy space and the other in the reward function.There are many algorithms share the reward function regularization idea with CPP, which implies the possibility of adding another layer of regularization on top of it.On the other hand, distributional RL methods may also benefit from the interpolation since they output distribution of rewards which renders interpolation straightforward.We leave them to future investigation.
We also discuss the relationship between the proposed method and line search.One benefit of CPP (and all CPI-like algorithms) lies in that the interpolated policies naturally guarantee improvement, while in line search one has to deploy unverified policies to the system, which is unacceptable to safety-critical tasks such as industrial process control problems.We refer to Zhu et al. (2022) for a successful application of the CPP interpolation.However, when safety is not of critical importance, line search and interpolation can be rather complementary.Two possible extensions combining both techniques exist: (1) Line search at a regular interval: we can interleave some line search steps during learning to verify that the weight design achieves non-trivial improvment.(2) Line search plus confidence: another possible extension is to measure the confidence of policy improvement at each iteration.When the confidence is low, we could introduce line search to take the role of improvement.
Finally, another interesting future direction is to extend CPP to the actor-critic setting that can handle continuous action spaces.Though both CPI-based and entropy-regularized concepts have been respectively applied in actor-critic algorithms, there has not seen published results showing featuring this combination.We expect that the combination could greatly alleviate the policy oscillation phenomenon in complicated continuous action control domain such as Mujoco environments.

Conclusion
In this paper we proposed a novel RL algorithm: cautious policy programming that leveraged a novel entropy regularization aware lower bound for monotonic policy improvement.The key ingredients of the CPP is the seminal policy interpolation and entropy-regularized policies.Based on this combination, we proposed a genre of novel RL algorithms that can effectively trade off learning speed and stability, especially inhibiting the policy oscillation problem that arises frequently in RL applications.We demonstrated the effectiveness of CPP against existing state-of-the-art algorithms on simple to challenging environments, in which CPP achieved performance consistent with the theory.

Appendix A Appendix
In the first part of the Appendix, we detail the proofs of the theorems and lemmas that appear in our paper.We provide implementation details in the latter half.

A.1. Proof of Theorem 4
In order to prove Theorem 4, we introduce the following two lemmas.The first concerns monotonic policy improvement and the second provides a tool for connecting it with the entropy-regularization-aware lower bound.

A.1.1 Monotonic Policy Improvement Lemma
In this section we provide the proof of Lemma 2. The proof was borrowed from (Pirotta et al., 2013b) but for the ease of reading we rephrase it here.
Lemma 2. Provided that policy ′ is generated by partial update Eq. (2), is chosen properly, and A π ,d  ≥ 0 , then the following improvement is guaranteed: Proof The proof follows the similar derivation in the classic CPI (Kakade & Langford, 2002) and similar results appeared many times in e.g.(Pirotta et al., 2013b;Metelli et al., 2018).We also show that the role of and (1 − ) in Eq. ( 2) can be exchanged by solving a similar problem.To begin, we leverage Theorem 3.5 of Pirotta et al. (2013b) that: Substituting in  � =  π + (1 −  ) , we have: Hence, Eq. ( A2) is transformed into: The right hand side (r.h.s.) is a quadratic function in and has its maximum at By substituting * back to Eq. (A5), we obtain: When  * > 1 , we clip it using min(1, * ).
Note that, if we exchange the roles of and (1 − ) , the coefficients in Eq. (A3) should be (1 − ) .Equation (A5) would become a quadratic function in (1 − ) ; hence the r.h.s. of Eq. (A7) would be the maximum of (1 − * ) .This concludes the proof.◻ Remark.By noting that π (a|s) − (s, appears in both and ΔA π  , we see that the policy improvement ΔJ � ,d is governed by the maximum total variation of policies.While one can exploit Lemma 2 for a value-based RL algorithm, it can be seen that it could only apply to problems with small-state action spaces.In general, without further assumptions on  ′ , π ,  , lower-bounding policy improvement is intractable, as maximization and ΔA π  in a large state space require exponentially many samples for accurate estimation. (A3)

A.1.2 Entropy-regularization Lemma
To optimize the lowerbound in Lemma 2, it is required to know (Pirotta et al., 2013b), which is intractable for large state spaces without further specification on the considered policy class.By considering the class of entropy-regularized MDPs, Lemma 2 can be significantly simplified, of which the following lemma plays a crucial role.
Lemma 3.For any policies K and K+1 generated by taking the maximizer of Eq. (3), the following bound holds for their maximum total variation: K denotes the current iteration index and 0 ≤ k ≤ K − 1 is the loop index.is the uniform upper bound of error.
Proof By the Fenchel conjugacy of the Shannon entropy and KL divergence (Boyd & Vandenberghe, 2004), it is clear that the maximizing policies for the regularized MDP are Boltzmann softmax (Geist et al., 2019) as shown in Sect.4.3.1.The relationship between Boltzmann softmax policies has recently been actively investigated (Azar et al., 2012;Asadi & Littman, 2017).We leverage the very recent result Kozuno et al. (2019) Propsition 3, which states that: where is the uniform upper bound of errors.While Pinsker's inequality D TV (p‖q) ≤ √ 2D KL (p‖q) , where p, q are distributions can be used to directly exploit Eq. (A9), there is a gap between the total variation and KL divergence since D TV ≤ 1 and D KL is potentially unbounded.Leveraging Pinsker's inequal- ity on Eq. (A9) and then on Eqs.(A3,A4) will result in large errors when D KL ≥ √ 2 2 .To tackle this problem, we introduce the following bound due to Bretagnolle and Huber (1978) that has more benign behavior1 : A similar bound appears also in Tsybakov (2008) but is a slightly looser.More relevant inequalities of such kind can be found in Sason and Verdú (2016).Both (Bretagnolle & Huber, 1978) and (Tsybakov, 2008) feature the component e −D KL (p‖q) that ensures the total variation bound is well-defined: the upperbound √ 1 − e −D KL (p‖q) is guaranteed to be no large than 1.Hence we can combine Eq. (A10) with Eq. (A9) by taking the maximization on both sides, yielding the following relationship: where Now by applying Pinsker's inequality on Eq. (A9), we have the following relationship: taking the minimum of Eqs.(A11, A12) yields the promised result.◻ Now back to Eq. (A9), since the reward is bounded in [−1, 1] , r max can be conveniently dropped.Also, note that for simplicity we assume there is no update error, i.e., B K = 0 .How- ever, it can be straightforwardly extended to cases where errors present by simply choosing an upper-bound for errors.It is worth noting that in deep RL setting the magnitude of might be non-trivial and has to be considered in parameter tuning.Intuitively, Lemma 3 enforces that an updated entropy-regularized policy will not deviate much from the previous policy.

A.1.3 Proof of Theorem 4
Now, given Lemma 2 and Lemma 3, we are ready to prove Theorem 4. We first restate it for ease of reading.
Theorem 4. Provided that partial update Eq. ( 6) is adopted, A K+1 K ,d K ≥ 0 , and is chosen properly, then any maximizer policy of Eq. (3) guarantees the following improvement that depends only on , , and A K+1 K ,d K after any policy update: Proof The proof follows similarly to the proof of Lemma 2 and hence (Pirotta et al., 2013b).We prove Theorem 4 by noticing the following inequalities hold for and ΔA π  of Eq. ( 5), respectively: where where V max ∶= 1 1− r max is the maximum possible value function.Since we assume reward is upper bounded by 1, V max = 1 1− .The second inequality makes use of the triangle inequality: and the third inequality makes use of Hölder's inequality 1 p + 1 q = 1 , with p set to 1 and q set to ∞ .The last inequality is because of Pinsker's inequality: and the fact that ‖Q ‖ ∞ ≤ V max = 1 1− .Following (Pirotta et al., 2013b) The way of choosing is same as Eq.(A7) solving the equation that is negative quadratic in .◻

A.2 Implementation Details
In Algorithm 2 we followed (Vieillard et al., 2020b) for computing the stationary weighted advantage function that empirically shows good performance.It should be noted that accurately estimating stationary distribution is still nontrivial (Wen et al., 2020) and we leave the improvement to CPP in this regard to our future work.Deep CPP, MDQN, MoDQN and CVI in the experimental section share the same network architecture and hyperparameters as specified by Table 3: By comparing our results on Deep CPP and Deep CPI (Vieillard et al., 2020b) we see there is difference on the horizon.We ran all algorithms for 5 × 10 6 steps while Deep CPI was ran for 5 × 10 7 steps.However, we can still make a comparison by the scores up to 5 × 10 6 steps.By comparing on the environments that appeared in both papers we have in Table 4: Hence we see on relatively simple environments like MsPacman and SpaceInvaders DCPP and DCPI performed similarly.On the other hand, on the challenging environment Seaquest (Fortunato et al., 2018), DCPP achieved around 30% higher scores at the end of 5 × 10 6 environment steps.
We also report the tuned hyperparameters unique to each algorithm in Figure 3 using Akiba et al. (2019) in Table 5.The hyperparameters were obtained by running on the environment SpaceInvaders for 300 Optuna trials (Akiba et al., 2019).Each trial consists of 10 5 steps and the resultant 300 sets of parameters were ranked.For the on-policy algo- rithms, PPO and A2C are built-in with Stable Baselines 3 library (Raffin et al., 2021) and the parameters were already fine-tuned.We evaluated them without changing their default hyperparameters.

ΔJ 𝜋
K π , which is the greedy operator acting on Q  K π .By the Fenchel conjugacy of Shannon entropy and KL divergence, GQ  K π has a closed-form solution (Kozuno et al. 2019; Beck 2017):

Fig. 2
Fig.2Comparison of SPI, CPP, and CVI on the pendulum swing up task.Figure2aillustrates the policy oscillation value defined in Eq. (25).Figure2bshows the cumulative reward with ±1 standard deviation.Figure2cshows the values

Fig. 3
Fig. 3 (a) CPP interpolated policy of swinging right π (a right |s t ) .(b) E-SPI interpolated policy of swinging right π (a right |s t ) .CPP and E-SPI interpolated policies of pendulum swinging right π (a right |s t ) (z-axis) for timesteps t = 1, … , 500 (x-axis) from the first to last iteration (y-axis).E-SPI interpolated policy performed might much more aggressive than the CPP policy caused by the large values shown in Fig. 2c

Fig. 4
Fig. 4 Comparison on Atari games averaged over 3 random seeds.CPP, MoDQN, MDQN and CVI are implemented as variants of DQN and hence are off-policy.PPO and A2C are on-policy.Correspondence between algorithms and colors is shown in the lower right corner.Overall, CPP achieved the best balance between final scores, learning speed and oscillation values −D KL (p‖q) .

Table 1
Different linear interpolation coefficient computation schemes.Linear CPP and Deep CPP are proposed in this work

Table 2
The oscillation values of algorithms listed in Sect.5.3.1 measured in ‖OJ‖ 2 and ‖OJ‖ ∞ defined by Eq. (25).CPP achieved the best balance between final score, learning speed and oscillation values.Note that CPP was implemented to leverage off-policy data.Algorithms of small oscillation values, such as PPO, failed to compete with CPP in terms of final scores and convergence speed Lingwei Zhu and Takamitsu Matsubara declared no conflicting interest.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http:// creat iveco mmons.org/ licen ses/ by/4.0/.
DeclarationConflict of interest