Compatible natural gradient policy search

Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.


Introduction
The natural gradient [Amari, 1998] is an integral part of many reinforcement learning [Kakade, 2001, Bagnell and Schneider, 2003, Peters and Schaal, 2008, Geist and Pietquin, 2010 and optimization [Wierstra et al., 2008] algorithms. Due to the natural gradient, gradient updates become invariant to affine transformations of the parameter space and the natural gradient is also often used to define a trust-region for the policy update. The trust-region is defined by a bound of the Kullback-Leibler (KL) [Peters et al., 2010, Schulman et al., 2015 divergence between new and old policy and it is well known that the Fisher information matrix, used to compute the natural gradient is a second order approximation of the KL divergence. Such trust-region optimization is common in policy search and has been successfully used to optimize neural network policies. the natural gradient analytically and empirically and show that the natural gradient does not give fast convergence properties if we do not add an entropy regularization term. This entropy regularization term results in a new update rule which ensures that the policy looses entropy at the correct pace, leading to convergence to a good policy. We further show that the natural gradient is the optimal (and not the approximate) solution to a trust region optimization problem for log-linear models if the natural parameters of the distribution are optimized and we use compatible value function approximation.
We analyze compatible value function approximation for neural networks and show that the components of this approximation are composed of two terms, a state value function which is subtracted from a state-action value function. While it is well known that the compatible function approximation denotes an advantage function, the exact structure was unclear. We show that using compatible value function approximation, we can derive similar algorithms to Trust Region Policy Search that obtain the policy update in closed form. A summary of our contributions is as follows: • It is well known that the second-order Taylor approximation to trust-region optimization with a KL-divergence bound leads to an update direction identical to the natural gradient. However, what is not known is that when using the natural parameterization for an exponential policy and using compatible features we can compute the step-size for the natural gradient that solves the trust-region update exactly for the log-linear parameters.
• When using an entropy bound in addition to the common KL-divergence bound, the compatible features allow us to compute the exact update for the trust-region problem in the log-linear case and for a Gaussian policy with a state independent covariance we can compute the exact update for the covariance also in the non-linear case.
• Our new algorithm called Compatible Policy Search (COPOS), based on the above insights, outperforms comparison methods in both continuous control and partially observable discrete action experiments due to entropy control allowing for principled exploration.

Preliminaries
This section discusses background information needed to understand our compatible policy search approach. We first go into Markov decision process (MDP) basics and introduce the optimization objective. We continue by showing how trust region methods can help with challenges in updating the policy by using a KL-divergence bound, continue with the classic policy gradient update, introduce the natural gradient and the connection to the KL-divergence bound. Moreover, we introduce the compatible value function approximation and connect it to the natural gradient. Finally, this section concludes by showing how the optimization problem resulting from using an entropy bound to control exploration can be solved.
Following standard notation, we denote an infinite-horizon discounted Markov decision process (MDP) by the tuple (S, A, p, r, p 0 , γ), where S is a finite set of states and A is a finite set of actions. p(s t+1 |s t , a t ) denotes the probability of moving from state s t to s t+1 when the agent executes action a t at time step t. We assume p(s t+1 |s t , a t ) is stationary and unknown but that we can sample from p(s t+1 |s t , a t ) either in simulation or from a physical system. p 0 (s) denotes the initial state distribution, γ ∈ (0, 1) the discount factor, and r(s t , a t ) denotes the real valued reward in state s t when agent executes action a t . The goal is to find a stationary policy π(a t |s t ) that maximizes the expected reward E s0,a0,... [ resulting in Compatible Value Function Approximation. It is well known that we can obtain an unbiased gradient with typically smaller variance if compatible value function approximation is used [Sutton et al., 1999]. An approximation of the Monte-Carlo estimatesG πold w (s, a) = φ(s, a) T w is compatible to the policy π θ (a|s), if the features φ(s, a) are given by the log gradients of the policy, that is, φ(s, a) = ∇ θ log π θ (a|s). The parameter w of the approximationG πold w (s, a) is the solution of the least squares problem Peters and Schaal [2008] showed that in the case of compatible value function approximation, the inverse of the Fisher information matrix cancels with the matrix spanned by the compatible features and, hence, ∇ θ J NAC = η −1 w * . Another interesting observation is that the compatible value function approximation is in fact not an approximation for the Q-function but for the advantage function A πold (s, a) = Q πold (s, a) − V πold (s) as the compatible features are always zero mean. In Section 3, we show how with compatible value function approximation the natural gradient directly gives us an exact solution to the trust region optimization problem instead of requiring a search for the update step size to satisfy the KL-divergence bound in trust region optimization.
Entropy Regularization. Recently, some approaches [Abdolmaleki et al., 2015, Akrour et al., 2016, O'Donoghue et al., 2016 use an additional bound for the entropy of the resulting policy. The entropy bound can be beneficial since it allows to limit the change in exploration potentially preventing greedy policy convergence. The trust region problem is in this case given by where the second constraint limits the expected loss in entropy (H() denotes Shannon entropy in the discrete case and differential entropy in the continuous case) for the new distribution (applying an entropy constraint only on π(a|s) but adjusting β according to π old (a|s) is equivalent in [Akrour et al., 2016[Akrour et al., , 2018). The policy update rule can be formed for the constrained optimization problem by using the method of Lagrange multipliers: where η and ω are Lagrange multipliers [Akrour et al., 2016]. η is associated with the KL-divergence bound and ω is related to the entropy bound β. Note that, for ω = 0, the entropy bound is not active and therefore, the solution is equivalent to the standard trust region solution. It has been realized that the entropy bound is needed to prevent premature convergence issues connected with the natural gradient. We show that these premature convergence issues are inherent to the natural gradient as it always reduces entropy of the distribution. In contrast, entropy control can prevent this.

Compatible Policy Search with Natural Parameters
In this section, we analyze the natural gradient update equations for exponential family distributions. This analysis reveals an important connection between the natural gradient and the trust region optimization: Both are equivalent if we use the natural parameterization of the distribution in combination with compatible value function approximation. This is an important insight as the natural gradient now provides the optimal solution for a given trust region, not just an approximation which is commonly believed: for example, Schulman et al. [2015] have to use line search to fit the natural gradient update to the KL-divergence bound. Moreover, this insight can be applied together with the entropy bound to control policy exploration and get a closed form update in the case of compatible log-linear policies. Furthermore, the use of compatible value function approximation has several advantages in terms of variance reduction which can not be achieved with the plain Monte-Carlo estimates which we leave for future work. We also present an analysis of the online performance of the natural gradient and show that entropy regularization can converge exponentially faster. Finally, we present our new algorithm for Compatible Policy Search (COPOS) which uses the insights above.

Equivalence of Natural Gradients and Trust Region Optimization
We first consider soft-max distributions that are log-linear in the parameters (for example, Gaussian distributions or the Boltzmann distribution) and subsequently extend our results to non-linear softmax distributions, for example given by neural networks. A log-linear soft-max distribution can be represented as Note that also Gaussian distributions can be represented this way (see, for example, Eq. (9)), however, the natural parameterization is commonly not used for Gaussian distributions. Typically, the Gaussian is parameterized by the mean µ and the covariance matrix Σ. However, the natural parameterization and our analysis suggest that the precision matrix B = Σ −1 and the linear vector b = Σ −1 µ should be used to benefit from many beneficial properties of the natural gradient.
It makes sense to study the exact form of the compatible approximation for these log-linear models.
The compatible features are given by As we can see, the compatible feature space is always zero mean, which is inline with the observation that the compatible approximationG πold w (s, a) is an advantage function. Moreover, the structure of the features suggests that the advantage function is composed of a term for the Q-functioñ Q w (s, a) = ψ(s, a) T w and for the value functionṼ w (s) = E π(·|s) Q w (s, a) , that is, We can now directly use the compatible advantage functionG πold w (s, a) in our trust region optimization problem given in Eq. (3). The resulting policy is then given by π(a|s) ∝ π old (a|s) exp ψ(s, a) T w − E π(·|s) ψ(s, ·) T w η ∝ exp ψ(s, a) T (θ old + η −1 w) .
Note that the value function part ofG w does not influence the updated policy. Hence, if we use the natural parameterization of the distributions in combination with compatible function approximation, then we directly get a parametric update of the form Furthermore, the suggested update is equivalent to the natural gradient update: The natural gradient is the optimal solution for a given trust region problem and not just an approximation. However, this statement only holds if we use natural parameters and compatible value function approximation. Moreover, the update needs only the Q-function part of the compatible function approximation.
We can do a similar analysis for the optimization problem with entropy regularization given in Eq. (4) using Eq. (5). The optimal policy given the compatible value function approximation is now given by In comparison to the standard natural gradient, the influence of the old parameter vector is diminished by the factor η/(η + ω) which will play an important role for our further analysis.

Compatible Approximation for Neural Networks
So far, we have only considered models that are log-linear in the parameters (ignoring the normalization constant). For more complex models, we need to introduce non-linear parameters β for the feature vector, that is, ψ(s, a) = ψ β (s, a). We are in particular interested in Gaussian policies in the continuous case and softmax policies in the discrete case as they are the standard for continuous and discrete actions, respectively. In the continuous case, we could either use a Gaussian with a constant variance where the mean is parameterized by a neural network, a non-linear interpolation linear feedback controllers with Gaussian noise or also Gaussians with state-dependent variance. For simplicity, we will focus on Gaussians with a constant covariance Σ where the mean is a product of neural network (or any other non-linear function) features ϕ i (s) and a mixing matrix K that could be part of the neural network output layer. The policy and the log policy are then To get ψ(s, a) we note that some parts of Eq.(9) and thus of ∇ θ log π θ (a|s) do not depend on a.
We ignore those parts for computing ψ(s, a) since ψ(s, a)θ is the state-action value function and action independent parts of the state-action value function do not influence the optimal action choice.
We then take the gradient w.r.t. the log-linear parameters θ concatenates matrix columns into a column vector.
Note that the variances and the linear parameters of the mean are contained in the parameter vector θ and can be updated by the update rule in Eq. (7) explained above. However, for obtaining the update rules for the non-linear parameters β, we first have to compute the compatible basis, that is, Note that due to the log operator the derivative is linear w.r.t. log-linear parameters θ. For the Gaussian distribution in Eq. (8) the gradient of the action dependent parts of the log policy in Eq. (11) become Now, in order to find the update rule for the non-linear parameters we will write the update rule for the policy using Eq. (5), and, using the value function formed by multiplying the compatible basis in Eq. (11) by w β which is the part of the compatible approximation vector that is responsible for β: Algorithm 1 Compatible Policy Search (COPOS).
Initialize π 0 for i = 1 to max episodes do Sample (s, a, r) tuples using π i−1 Estimate advantage function A πi−1 (s, a) from samples Solve w = F −1 ∇J P G (π i ) Use compatible value function to solve Lagrange multipliers η and ω for Eq. (4) Update π i using w, η and ω based on Eq. (7) and Eq. (16) end for where we dropped action independent parts, which can be seen as part of the distribution normalization, from Eq. (12) to Eq. (13). Note that Eq. (15) represents the first order Taylor approximation of Eq.( 14) at w β /η = 0. Moreover, note that rescaling of the energy function ψ β (s, a)θ is implemented by the update of the parameters θ and hence can often be ignored for the update for β. The approximate update rule for β is thus Hence, we can conclude that the natural gradient is an approximate trust region solution for the non-linear parameters β as the first order Taylor approximation of the energy function is replaced by the real energy function after the update. Still, for the parameters θ, which in the end dominate the mean and covariance of the policy, the natural gradient is the exact trust region solution.

Compatible Value Function Approximation in Practice
Algorithm 1 shows the Compatible Policy Search (COPOS) approach (see Appendix B for a more detailed description of the discrete action algorithm version). In COPOS, for the policy updates in Eq. (7) and Eq. (16), we need to find w, η, and ω. For estimating w we could use the compatible function approximation. In this paper, we do not estimate the value function explicitly but instead estimate w as a natural gradient using the conjugate gradient method which removes the need for computing the inverse of the Fisher information matrix explicitly (see for example [Schulman et al., 2015]). As discussed before η and ω are Lagrange multipliers associated with the KL-divergence and entropy bounds. In the log-linear case with compatible natural parameters, we can compute them exactly using the dual of the optimization objective Eq. (4) and in the non-linear case approximately.
In the continuous action case, the basic dual includes integration over both actions and states but we can integrate over actions in closed form due to the compatible value function: we can eliminate terms which do not depend on the action. The dual resolves into an integral over just states allowing computing η and ω efficiently. Please, see Appendix A for more details. Since η is an approximation for the non-linear parameters, we performed in the experiments for the continuous action case an additional alternating optimization twice: 1) we did a binary search to satisfy the KL-divergence bound while keeping ω fixed, 2) we re-optimized ω (exactly, since ω depends only on log-linear parameters) keeping η fixed. For discrete actions it was sufficient to perform only an additional line search to update the non-linear parameters.

Analysis and Illustration of the Update Rules
We will now analyze the performance of both update rules, with and without entropy, in more detail with a simple stateless Gaussian policy π(a) ∝ exp(−0.5Ba 2 + ba) for a scalar action a. Our real reward function R(a) = −0.5Ra 2 +ra is also quadratic in the actions. We assume an infinite number of samples to perfectly estimate the compatible function approximation. In this caseG w is given bỹ G w (a) = R(a) = −0.5Ra 2 + ra and w = [R, r]. The reward function maximum is a * = R −1 r.

Natural Gradients.
For now, we will analyze the performance of the natural gradient if η is a constant and not optimized for a given KL-bound. After n update steps, the parameters of the policy are given by B n = B 0 + nR/η, b n = b 0 + nr/η. The distance between the mean µ n = B  (17)), the expected "Reward", the expected "Entropy", and the expected "KL-divergence" between previous and current policy over 200 iterations (x-axis). Top: Policy updates with constant learning rates and no trust region. Comparison of the natural gradient (blue), natural gradient with entropy regularization (green) and vanilla policy gradient (red). Bottom: Policy updates with trust region. Comparison of the natural gradient (blue), natural gradient where the entropy is controlled to a set-value (green), natural gradient with zero entropy loss (cyan) and vanilla policy gradient (red). To summarize, without entropy regularization the natural gradient decreases the entropy too fast.
optimal solution a * is We can see that the learned solution approaches the optimal solution, however, very slowly and heavily depending on the precision B 0 of the initial solution. The reason for this effect can be explained by the update rule of the precision. As we can see, the precision B n is increased at every update step. This shrinking variance in turn decreases the step-size for the next update.
Entropy Regularization. Here we provide the derivation of d n for the entropy regularization case. We perform a similar analysis for the entropy regularization update rule. We start with constant parameters η and ω and later consider the case with the trust region. The distance d n = µ n − a * is again a function of n. The updates for the entropy regularization result in the following parameters after n iterations The distance d n = µ n − a * can again be expressed as a function of n: with c 2 > 1. Hence, also this update rule converges to the correct solution but contrary to the natural gradient, the part of the denominator that depends on n grows exponentially. As the old parameter vector is always multiplied by a factor smaller than one, the influence of the initial precision matrix B 0 vanishes while B 0 dominates natural gradient convergence. While the natural gradient always decreases variance, entropy regularization avoids the entropy loss and can even increase variance.
Empirical Evaluation of Constant Updates. We plotted the behavior of the algorithms, and standard policy gradient, for this simple toy task in Figure 1 (top). We use η = 10 and ω = 1 for the natural gradient and the entropy regularization and a learning rate of α = 1000 for the policy gradient. We estimate the standard policy gradients from 1000 samples. Entropy regularization performs favorably speeding up learning in the beginning by increasing the entropy. With constant parameters η and ω, the algorithm drives the entropy to a given target-value. The policy gradient performs better than the natural gradient as it does not reduce the variance all the time and even increases the variance. However, the KL-divergence of the standard policy gradient appears uncontrolled.

Empirical Evaluation of Trust Region Updates.
In the trust region case, we minimized the Lagrange dual at each iteration yielding η and ω. We chose at each iteration the highest policy gradient learning rate where the KL-bound was still met. For entropy regularization we tested two setups: 1) We fixed the entropy of the policy (that is, γ = 0), 2) The entropy of the policy was slowly driven to 0. Figure 1 (bottom) shows the results. The natural gradient still suffers from slow convergence due to decreasing the entropy of the policy gradually. The standard gradient again performs better as it increases the entropy outperforming even the zero entropy loss natural gradient. For entropy control, even more sophisticated scheduling could be used such as the step-size control of CMA-ES [Hansen and Ostermeier, 2001] as a heuristic that works well.

Related Work
Similar to classical reinforcement learning the leading contenders in deep reinforcement learning can be divided into value based-function methods such as Q-learning with deep Q-Network (DQN) [Mnih et al., 2015], actor-critic methods [Wu et al., 2017, Tangkaratt et al., 2018, policy gradient methods such as deep deterministic policy gradient (DDPG) [Silver et al., 2014, Lillicrap et al., 2015 and policy search methods based on information theoretic / trust region methods, such as proximal policy optimization (PPO) [Schulman et al., 2017] and trust region policy optimization (TRPO) [Schulman et al., 2015].
Trust region optimization was introduced in the relative entropy policy search (REPS) method [Peters et al., 2010]. TRPO and TNPG [Schulman et al., 2015] are the first methods to apply trust region optimization successfully to neural networks. In contrast to TRPO and TNPG, we derive our method from the compatible value function approximation perspective. TRPO and TNPG differ from our approach, in that they do not use an entropy constraint and do not consider the difference between the log-linear and non-linear parameters for their update. On the technical level, compared to TRPO, we can update the log-linear parameters (output layer of neural network and the covariance) with an exact update step while TRPO does a line search to find the update step. Moreover, for the covariance we can find an exact update to enforce a specific entropy and thus control exploration while TRPO does not bound the entropy, only the KL-divergence. PPO also applies an adaptive KL penalty term. Kakade [2001], Bagnell and Schneider [2003], Peters and Schaal [2008], Geist and Pietquin [2010] have also suggested similar update rules based on the natural gradient for the policy gradient framework. Wu et al. [2017] applied approximate natural gradient updates to both the actor and critic in an actor-critic framework but did not utilize compatible value functions or an entropy bound. Peters and Schaal [2008], Geist and Pietquin [2010] investigated the idea of compatible value functions in combination with the natural gradient but used manual learning rates instead of trust region optimization. The approaches in [Abdolmaleki et al., 2015, Akrour et al., 2016 use an entropy bound similar to ours. However, the approach in [Abdolmaleki et al., 2015] is a stochastic search method, that is, it ignores sequential decisions and views the problem as black-box optimization, and the approach in [Akrour et al., 2016] is restricted to trajectory optimization. Moreover, both of these approaches do not explicitly handle non-linear parameters such as those found in neural networks. The entropy bound used in [Tangkaratt et al., 2018] is similar to ours, however, their method depends on second order approximations of a deep Q-function, resulting in a much more complex policy update that can suffer from the instabilities of learning a non-linear Q-function.
For exploration one can in general add an entropy term to the objective. In the experiments, we compare against TRPO with this additive entropy term. In preliminary experiments, to control entropy in TRPO, we also combined the entropy and KL-divergence constraints into a single constraint without success.

Experiments
In the experiments, we focused on investigating the following research question: Does the proposed entropy regularization approach help to improve performance compared to other methods which do not control the entropy explicitly? For selecting comparison methods we followed [Duan et al., 2016] and took four gradient based methods: Trust Region Policy Optimization (TRPO) [Schulman et al., 2015], Truncated Natural Policy Gradient [Duan et al., 2016, Schulman et al., 2015, REINFORCE (VPG) [Williams, 1992], Reward-Weighted Regression (RWR) [Kober and Peters, 2009] and two gradient-free black box optimization methods: Cross Entropy Method (CEM) [Rubinstein, 1999], Covariance Matrix Adaption Evolution Strategy (CMA-ES) [Hansen and Ostermeier, 2001]. We used rllab 1 for algorithm implementation. We ran experiments in both challenging continuous control tasks and discrete partially observable tasks which we discuss next.
Continuous Tasks. In the continuous case, we ran experiments in eight different Roboschool 2 environments which provide continuous control tasks of increasing difficulty and action dimensions without requiring a paid license.
We ran all evaluations, 10 random seeds for each method, for 1000 iterations of 10000 samples each. In all problems, we used the Gaussian policy defined in Eq. (8) for COPOS and TRPO (denoted by π 1 (a|s)) with max(10, action dimensions) neural network outputs as basis functions, a neural network with two hidden layers each containing 32 tanh-neurons, and a diagonal precision matrix. For TRPO and other methods, except COPOS, we also evaluated a policy, denoted for TRPO by π 2 (a|s), where the neural network directly specifies the mean and the diagonal covariance is parameterized with log standard deviations. In the experiments, we used high identical initial variances (we tried others without success) for all policies. We set = 0.01 [Schulman et al., 2015] in all experiments. For COPOS we used two equality entropy constraints: β = and β =auto. In β =auto, we assume positive initial entropy and schedule the entropy to be the negative initial entropy after 1000 iterations. Since we always initialize the variances to one, higher dimensional problems have higher initial entropy. Thus β =auto reduces the entropy faster for high dimensional problems effectively scaling the reduction with dimensionality. Table 1 summarizes the results in continuous tasks: COPOS outperforms comparison methods in most of the environments. Fig. 2 and Fig. 3 show learning curves and performance of COPOS compared to the other methods. COPOS prevents both too fast, and, too slow entropy reduction while outperforming comparison methods. Table 4 in Appendix C shows additional results for experiments where different constant entropy bonuses were added to the objective function of TRPO without success, highlighting the necessity of principled entropy control.
Discrete control task. Partial observability often requires efficient exploration due to non-myopic actions yielding long term rewards which is challenging for model-free methods. The Field Vision Rock Sample (FVRS) [Ross et al., 2008] task is a partially observable Markov decision process (POMDP) benchmark task. For the discrete action experiments we used as policy a softmax policy with a fully connected feed forward neural network consisting of 2 hidden layers with 30 tanh nonlinearities each. The input to the neural network is the observation history and the current position of the agent in the grid. To obtain the hyperparameters β, , and the scaling factor for TRPO with additive entropy regularization, denoted with "TRPO ent reg", we performed a grid search on smaller instances of FVRS. See Appendix B for more details about the setup. Results in Table 2 and Figure 4 show that COPOS outperforms the comparison methods due to maintaining higher entropy. FVRS has been used with model-based online POMDP algorithms [Ross et al., 2008] but not with model-free algorithms. The best model-based results in [Ross et al., 2008] (scaled to correspond to our rewards, COPOS in parentheses) are 2.275 (1.94) in FVRS(5,5) and 2.34 (2.45) in FVRS(5,7).

Conclusions & Future Work
We showed that when we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation, the natural gradient and trust region optimization are equivalent. Furthermore, we demonstrated that natural gradient updates may reduce the entropy of the policy according to a schedule which can lead to premature convergence. To combat the problem of bad entropy scheduling in trust region methods we proposed a new compatible policy search method called COPOS that can control the entropy of the policy using an entropy bound. In both challenging high dimensional continuous and discrete tasks the approach yielded state-of-the-art results due to better entropy control. In future work, an exciting direction is to apply efficient approximations to compute the natural gradient [Bernacchia et al., 2018]. Moreover, we have started work on applying the proposed algorithm in challenging partially observable environments found for example in autonomous driving where exploration and sample efficiency is crucial for finding high quality policies [Dosovitskiy et al., 2017].   : Average discounted return and Shannon entropy for both FVRS 5 × 7 with a noisy sensor and full observations over 10 random seeds. Shaded area denotes the bootstrapped 95% confidence interval. Algorithms were executed for 600 iterations with 5000 time steps (samples) in each iteration.
Setting it to zero and rearranging terms results in π(a|s) = π old (a|s) where the last term can be seen as a normalization constant Plugging Eq. (51)

B Technical Details for Discrete Action Experiments
Here, we provide details on the experiments with discrete actions. Table 3 shows details on the hyperparameters used in the Field Vision RockSample (FVRS) experiments and Algorithm 2 describes details on the discrete action algorithm.
(5, 5) full (5, 5) noise (5, 7) full (5, 7) noise (7, 8) full ( Table 4 shows additional results for continuous control in the Roboschool environment. In these experiments, an additonal entropy bonus is added to the objective function of TRPO. Algorithm 2 COPOS discrete actions Initialize policy network π θ,β with non-linear parameters β and linear parameters θ and Θ = (θ, β) for episode ← 1 to maxEpisode do Initialize empty batch B while collected samples < batchsize do Run policy π θ,β (a|s) for T timesteps or until termination: Draw action a t ∼ π θ,β (a t |s t ), observe reward r t Add samples (s t , a t , r t ) to B end while Compute advantage values A πold (s i , a i ) Compute w = (w θ , w β ) using conjugate gradient to solve UseG πold w (s, a) to solve for η > 0 and ω > 0 using the dual to the corresponding trust region optimization problem: argmax π θ E s∼p(s) π θ (a|s)G πold w (s, a) da subject to E s∼p(s) [KL (π θ ( · | s) || π θold ( · | s))] < E s∼p(s) [H (π θ ( · | s)) − H (π θold ( · | s))] < β Apply updates for the new policy: where s is a rescaling factor found by line search end for Table 4: Additional continuous control environment benchmark runs. In these experiments TRPO was run with an additional entropy term multiplied with a factor β added to the objective function. All algorithms are TRPO versions with the two different policy structures π 1 (a|s), π 2 (a|s) and different β except for the two COPOS entries. We report the mean of the average return over 50 last iterations ± standard error over 10 random seeds. Bold denotes: no statistically significant difference to the best result (Welch's t-test with p < 0.05). COPOS β=auto COPOS β = π 1 , β = 0.02 π 2 , β = 0.02