Geometry and convergence of natural policy gradient methods

We study the convergence of several natural policy gradient (NPG) methods in infinite-horizon discounted Markov decision processes with regular policy parametrizations. For a variety of NPGs and reward functions we show that the trajectories in state-action space are solutions of gradient flows with respect to Hessian geometries, based on which we obtain global convergence guarantees and convergence rates. In particular, we show linear convergence for unregularized and regularized NPG flows with the metrics proposed by Kakade and Morimura and co-authors by observing that these arise from the Hessian geometries of conditional entropy and entropy respectively. Further, we obtain sublinear convergence rates for Hessian geometries arising from other convex functions like log-barriers. Finally, we interpret the discrete-time NPG methods with regularized rewards as inexact Newton methods if the NPG is defined with respect to the Hessian geometry of the regularizer. This yields local quadratic convergence rates of these methods for step size equal to the inverse penalization strength.


Introduction
Markov decision processes (MDPs) are an important model for sequential decision making in interaction with an environment and constitute a theoretical framework for modern reinforcement learning (RL).This framework has been successfully applied in recent years to solve increasingly complex tasks from robotics to board and video games [62,63,56,45,60].In MDPs the goal is to identify a policy π, i.e., a procedure to select actions at every time step, which maximizes an expected time-aggregated reward R(π).We will assume that the set of possible states S and the set of possible actions A are finite, and model the policy π θ as a differentiably parametrized element in the polytope ∆ S A of conditional probability distributions of actions given states, with π θ (a|s) specifying the probability of selecting action a ∈ A when currently in state s ∈ S, for the parameter value θ.We will study gradient-based policy optimization methods and more specifically natural policy gradient (NPG) methods.Inspired by the seminal works of Amari [5,8], various NPG methods have been proposed [29,47,49].In general, they take the form where G(θ) + denotes the Moore-Penrose pseudo inverse and G(θ) ij = g(dP θ e i , dP θ e j ) is a Gram matrix defined with respect to some Riemannian metric g and some representation P (θ) of the parameter.Most of our analysis does not actually depend on the specific choice of the pseudo inverse, but in Section 6 we will use the Moore-Penrose pseudo inverse.The most traditional natural gradient method is the special case where P (θ) is a probability distribution and g is the Fisher information in the corresponding space of probability distributions.However, the terminology may be used more generally to refer to a Riemannian gradient method where the metric is in some sense natural.Kakade [29] proposed using P (θ) = π θ and taking for g a product of Fisher metrics weighted by the state probabilities resulting from running the Markov process with policy π θ .Although this is a natural choice for P , the choice of a Riemannian metric on ∆ S A is a non trivial problem.Peters et al. [56] offered reasons to regard Kakade's metric as the true Fisher metric in this case, yet other choices of the weights can be motivated by axiomatic approaches to define a Fisher metric of conditional probabilities [35,46].From our perspective, a main difficulty is that it is not clear how to choose a Riemannian metric on ∆ S A that interacts nicely with the objective function R(π), which is a non-convex rational function of π ∈ ∆ S A .An alternative choice for P (θ) is the vector of state-action frequencies η θ , whose components η θ (s, a) are the probabilities of state-action pairs (s, a) ∈ S × A resulting from running the Markov process with policy π θ .Morimura et al. [47] proposed using P (θ) = η θ and the Fisher information on the state-action probability simplex ∆ S×A as a Riemannian metric.We will study both approaches and variants from the perspective of Hessian geometry.

Contributions
We study the natural policy gradient dynamics inside the polytope N of stateaction frequencies, which provides a unified treatment of several existing NPG methods.We focus on finite state and action spaces and the expected infinite-horizon discounted reward optimized over the set of memoryless stochastic policies.
• We show that the dynamics of Kakade's NPG and Morimura's NPG solve a gradient flow in N with respect to the Hessian geometries of conditional entropic and entropic regularization of the reward (Sections 4.2 and 4.3 and Proposition 16).
• Leveraging results on gradient flows in Hessian geometries, we derive linear convergence rates for Kakade's and Morimura's NPG flow for the unregularized reward, which is a linear and hence not strictly concave function in state-action space, and also for regularized reward (Theorems 25 and 26 and Corollaries 30 and 31).
• Further, for a class of NPG methods which correspond to β-divergences and which generalize Morimura's NPG, we show sub-linear convergence in the unregularized case and linear convergence in the regularized case (Theorem 26 and Corollary 31, respectively).
• We complement our theoretical analysis with experimental evaluation, which indicates that the established linear and sub-linear rates for unregularized problems are essentially tight.
• For discrete-time gradient optimization, our ansatz in state-action space yields an interpretation of the regularized NPG method as an inexact Newton iteration if the step size is equal to the regularization strength.This yields a relatively short proof for the local quadratic convergence of regularized NPG methods with Newton step sizes (Theorem 33).This recovers as a special case the local quadratic convergence of Kakade's NPG under state-wise entropy regularization previously shown in [19].

Related work
The application of natural gradients to optimization in MDPs was first proposed by Kakade [29], taking as a metric on ∆ S A the product of Fisher metrics on ∆ s A , s ∈ S, weighted by the stationary state distribution.The relation of this metric to finite-horizon Fisher information matrices was studied by Bagnell and Schneider [12] as well as by Peters et al. [56].Later, Morimura et al. [47] proposed a natural gradient using the Fisher metric on the state-action frequencies, which are probability distributions over states and actions.
There has been a growing number of works studying the convergence properties of policy gradient methods.It is well known that reward optimization in MDPs is a challenging problem, where both the non-convexity of the objective function with respect to the policy and the particular parametrization of the policies can lead to the existence of suboptimal critical points [15].Global convergence guarantees of gradient methods require assumptions on the parametrization.Most of the existing results are formulated for tabular softmax policies, but more general sufficient criteria have been given in [15,73,74].
Vanilla PGs have been shown to converge sublinearly at rate O(t −1 ) for the unregularized reward and linearly for entropically regularized reward.For unregularized problems, the convergence rate can be improved to a linear rate by normalization [44,43].For continuous state and action spaces, vanilla PG converges linearly for entropic regularization and shallow policy networks in the mean-field regime [34].
For Kakade's NPG, [1] established sublinear convergence rate O(t −1 ) for unregularized problems, and the result has been improved to a linear rate of convergence for step sizes found by exact line search [16], constant step sizes [31,3,70], and for geometrically increasing step sizes [69].For regularized problems, the method converges linearly for small step sizes and locally quadratically for Newton-like step size [19].These results have been extended to more general frameworks using state-mixtures of Bregman divergences on the policy polytope [33,72,37], which however do not include NPG methods defined in state-action space such as Morimura's NPG.For projected PGs, [1] shows sublinear convergence at a rate O(t −1/2 ), and the result has been improved to a sublinear rate O(t −1 ) [69], and to a linear rate for step sizes chosen by exact line search [16].
Apart from the works on convergence rates for policy gradient methods for standard MDPs, a primal-dual NPG method with sublinear global convergence guarantees has been proposed for constrained MDPs [24,25].For partially observable systems, a gradient domination property has been established in [11].NPG methods with dimension-free global convergence guarantees have been studied for multi-agent MDPs and potential games [2].The sample complexity of a Bregman policy gradient arising from a strongly convex function in parameter space has been studied in [27].For the linear quadratic regulator, global linear convergence guarantees for vanilla, Gauss-Newton and Kakade's natural policy gradient methods are provided in [26]; note that this setting is different to reward optimization in MDPs, where the objective at a fixed time is linear and not quadratic.A lower bound of O(η −1 |S| 2 Ω((1−γ) −1 ) ) on the iteration complexity for softmax PG method with step size η has been established in [36].
Notation We denote the simplex of probability distributions over a finite set X by ∆ X .An element µ ∈ ∆ X is a vector with non-negative entries µ x = µ(x), x ∈ X adding to one, x µ x = 1.We denote the set of Markov kernels from a finite set X to another finite set For a vector µ ∈ R X ≥0 we denote its Shannon entropy by with the usual convention that 0 log(0) := 0. For µ ∈ R X ×Y ≥0 we denote the X-marginal by µ X ∈ R X ≥0 , where µ X (x) := y µ(x, y).Further, we denote the conditional entropy of µ conditioned on X by For any strictly convex function φ : Ω → R defined on a convex subset Ω ⊆ R d , the associated Given two smooth manifolds M and N and a smooth function f : M → N , we denote the differential of f at p ∈ M by df p : T p M → T f (p) N .In the Euclidean case, we also write Df (p) for the Jacobian matrix with entries Df (p) ij = ∂ j f i (p).We denote the gradient of a smooth function f : M → R defined on a Riemannian manifold (M, g) by ∇ g f : M → T M and denote the values of the vector field by ∇ g f (p) ∈ T p M for p ∈ M. When the Riemannian metric is unambiguous we drop the superscript.
For A ∈ R n×m , we denote its Moore-Penrose inverse by A + ∈ R m×n .Note that AA + is the orthogonal (Euclidean) projection onto range(A) and A + A is the orthogonal (Euclidean) projection onto ker(A).Finally, for functions f, g we write f (t) = O(g(t)) for t → t 0 if there is a constant c > 0 such that f (t) ≤ cg(t) for t → t 0 , where we allow t 0 = +∞.

Markov decision processes
A Markov decision process or shortly MDP is a tuple (S, A, α, r).We assume that S and A are finite sets which we call the state and action space respectively.We fix a Markov kernel α ∈ ∆ S×A S which we call the transition mechanism.Further, we consider an instantaneous reward vector r ∈ R S×A .In the case of partially observable MDPs (POMDPs) one also has a fixed kernel β ∈ ∆ S O called the observation mechanism.The system is fully observable if β = id,1 in which case the POMDP simplifies to an MDP.
As policies we consider elements π ∈ ∆ S A .More generally, in POMDPs we would consider We will focus on the MDP case, however.A policy induces transition kernels P π ∈ ∆ S×A S×A and p π ∈ ∆ S S by For any initial state distribution µ ∈ ∆ S , a policy π ∈ ∆ S A defines a Markov process on S × A with transition kernel P π which we denote by P π,µ .For a discount rate γ ∈ (0, 1) we define called the expected discounted reward.The expected mean reward is obtained as the limit with γ → 1 when this exists.We will focus on the discounted case, however.The goal is to maximize R over the policy polytope ∆ O A .For a policy π we define the value function where δ s is the Dirac distribution concentrated at s.
A short calculation shows that R(π) = s,a r(s, a)η π (s, a) = r, η π S×A [71], where The vector η π is an element of ∆ S×A called the expected (discounted) state-action frequency [22], or (discounted) visitation/occupancy measure, or on-policy distribution [64].Denoting the state marginal of η π by ρ π ∈ ∆ S we have η π (s, a) = ρ π (s)(π • β)(a|s).We denote the set of all state-action frequencies in the fully and in the partially observable cases by Note that the expected cumulative reward function R : ∆ O A → R factorizes according to We recall the following well-known facts.
Proposition 1 (State-action polytope of MDPs, [22]).The set N of state-action frequencies is a polytope given by N = ∆ S×A ∩ L = R S×A ≥0 ∩ L, where and s (η The state-action polytope for a two-state MDP is shown in Figure 3.We note that in in the case of partially observable MDPs, the set of state-action frequencies N β does not form a polytope, but rather a polynomially constrained set involving polynomials of higher degree depending on the properties of the observation kernel [50]. The result above shows that a (fully observable) Markov decision process can be solved by means of linear programming.Indeed, if η * is a solution of the linear program r, η S×A over N , one can compute the maximizing policy over ∆ S A by conditioning, π * (a|s) = η * (s, a)/ a η * (s, a).We propose to study the evolution of natural policy gradient methods in state-action space N ⊆ ∆ S×A .Indeed, we show that the evolution of diverse natural policy gradient algorithms in the state-action polytope solves the gradient flow of a (regularized) linear objective with respect to a Hessian geometry in state-action space.This perspective facilitates relatively short proofs for the global convergence of natural policy gradient methods and can also provide rates.In order to relate Riemannian geometries in the policy space ∆ S A to Riemannian geometries in the state-action polytope N we need the following assumption.
Assumption 2 (Positivity).For every s ∈ S and π ∈ ∆ O A , we assume that a η π sa > 0. Assumption 2 holds in particular if either α > 0 and γ > 0 or γ < 1 and µ > 0 entrywise [50].This assumption is standard in linear programming approaches and necessary for the convergence of policy gradient methods in MDPs [30,44].With this assumption in place we have the following.
Proposition 3 (Inverse of state-action map, [50]).Under Assumption 2, the mapping ∆ S A → N , ω → η is rational and bijective with rational inverse given by conditioning N → ∆ S A , η → ω, where ω as = ηsa a η sa .This result shows that the (interior of the) set of policies and the (interior of the) state-action polytope are diffeomorphic.Hence, we can port the Riemannian geometry on any of the two sets to the other by using the pull back along π → η or the conditioning map η → π.

Natural gradients
In this section we provide some background on the notion of natural gradients.

Definition and general properties of natural gradients
In many applications, one aims to optimize a model parameter θ with respect to an objective function that is defined on a model space M, as illustrated in Figure 1.This general setup, with an objective function that factorizes as L(θ) = (P (θ)), covers several usual parameter estimation and supervised learning cases, and also problems such as the numerical solution of PDEs with neural networks or policy optimization in MDPs and reinforcement learning.Naively, the optimization problem can be approached with first order methods, computing the gradients in parameter space with respect to the Euclidean geometry.However, this neglects the geometry of the parametrized model M Θ = P (Θ), which is often seen as a disadvantage since it may lead to parametrization-dependent plateaus in the optimization landscape.At the same time, the biases that particular parametrizations can introduce into the optimization can be favorable in some cases.This is an active topic of investigation particularly in deep learning, where P is often a highly non-linear function of θ.At any rate, there is a good motivation to study of the effects of the parametrization and the possible advantages from incorporating the geometry of model space into the optimization procedure in parameter space.The natural gradient as introduced in [5] is a way to incorporate the geometry of the model space into the optimization procedure and to formulate iterative update directions that are invariant under reparametrizations.Although it has been most commonly applied in the context of parameter estimation under the maximum likelihood criterion, the concept of natural gradient has been formulated for general parametric optimization problems and in combination with arbitrary geometries.In particular, natural gradients have been applied to neural network training [55,42,23,28], policy optimization [29,56,47] and inverse problems [54].Especially in the latter case, different notions of natural gradients have been introduced.A version that incorporates the geometry of the sample space are natural gradients based on an optimal transport geometry in model space [38,39,9].We shall discuss natural gradients in a way that emphasizes that even for a specific problem there may not be a unique natural gradient.This is because both the factorization L(θ) = (P (θ)) of the objective as well as what should be considered a natural geometry in model space may not be unique.
But what is it that makes a gradient or update direction natural ?The general consensus is that it should be invariant under reparametrization to prevent artificial plateaus and provide consistent stopping criteria, and it should (approximately) correspond to a gradient update with respect to the geometry in the model space.We now give the formal definition of the natural gradient with respect to a given factorization and a geometry in model space that we adopt in this work, which can be shown to satisfy the desired properties.Definition 4 (General natural gradient).Consider the problem of optimizing an objective L : Θ → R, where the parameter space Θ ⊆ R p is an open subset.Further, assume that the objective factorizes as L = • P , where P : Θ → M is a model parametrization with M a Riemannian manifold with Riemannian metric g, and : M → R is a loss in model space, as shown in Figure 1.For θ ∈ Θ we define the Gram matrix G(θ) ij := g P (θ) (dP θ e i , dP θ e j ) and call ∇ N L(θ) := G(θ) + ∇L(θ) the natural gradient (NG) of L at θ with respect to the factorization L = • P and the metric g.
Natural gradient as best improvement direction Consider a parametrization P : Θ → M with image M Θ = P (Θ), where M is a Riemannian manifold with metric g.Let us fix a parameter θ ∈ Θ and set p := P (θ).Moving in the direction v ∈ T θ Θ in parameter space results in moving in the direction w = dP θ v ∈ T p M in model space.The space of all directions that can result in this way is the generalized tangent space T θ M Θ := range(d θ P ) ⊆ T p M. Hence, the best direction one can take on M Θ by infinitesimally varying the parameter θ is given by arg max which is equal (up to normalization) to the projection Π T θ M Θ (∇ g (p)) of the Riemannian gradient ∇ g (p) onto T θ M Θ .Moving in the direction of the natural gradient in parameter space results in the optimal update direction over the generalized tangent space T θ M Θ in model space.
Theorem 5 (Natural gradient leads to steepest descent in model space).Consider the settings of Definition 4, where M is a Riemannian manifold with metric g.
denote the natural gradient with respect to this factorization.Then it holds that For invertible Gram matrices G(θ) this result is well known [6, Subsection 12.1.2];for singular Gram matrices we refer to [66, Theorem 1].

Choice of a geometry in model space
Invariance axiomatic geometries A celebrated theorem by Chentsov [20] characterizes the Fisher metric of statistical manifolds with finite sample spaces as the unique metric (up to multiplicative constants) that is invariant with respect to congruent embeddings by Markov mappings.A generalization of Chentsov's result for arbitrary sample spaces was given by Ay et al. [10].
Given two Riemannian manifolds (E, g), (E , g ) and an embedding f : E → E , the metric is said to be invariant if f is an isometry, meaning that where f * : T p E → T f (p) E is the pushforward of f .And a congruent Markov mapping is in simple terms a linear map p → M T p, where M is a row stochastic partition matrix, i.e., a matrix of non-negative entries with a single non-zero entry per column and entries of each row adding to one.Such a mapping has the natural interpretation of splitting each elementary event into several possible outcomes with fixed conditional probabilities.By Chentsov's theorem, requiring invariance with respect to any such mapping results in a single possible choice for the metric (up to multiplicative constants).We recall that on the interior of the probability simplex ∆ S the Fisher metric is given by Based on this approach, Campbell [18] characterized the set of invariant metrics on the set of non-normalized positive measures with respect to congruent embeddings by Markov mappings.This results in a family of metrics which restrict to the Fisher metric (up to a multiplicative constant) over the probability simplex.Following this line of ideas, Lebanon [35] characterized a class of invariant metrics of positive matrices that restrict to products of Fisher metrics over stochastic matrices. 2The maps considered by Lebanon do not map stochastic matrices to stochastic matrices, which motivated [46] to investigate a natural class of mappings between conditional probabilities.They showed that requiring invariance with respect to their proposed mappings singles out a family of metrics that correspond to products of Fisher metrics on the interior of the conditional probability polytope, up to multiplicative constants.This work also offered a discussion of metrics on general polytopes and weighted products of Fisher metrics, which correspond to the Fisher metric when the conditional probability polytope is embedded in the joint probability simplex by way of providing a marginal distribution.
Hessian geometries Instead of characterizing the geometry of model space M via an invariance axiomatic, one can select a metric based on the optimization problem at hand.For example, it is well known that the Fisher metric is the local Riemannian metric induced by the Hessian of the KL-divergence in the probability simplex.Hence, if the objective function is a KL-divergence, choosing the Fisher metric yields preconditioners that recover the inverse of the Hessian at the optimum, which can yield locally quadratic convergence rates.More generally, if the objective : M → R has a positive definite Hessian at every point, it induces a Riemannian metric via in local coordinates, which we call the Hessian geometry induced by on M; see [7,61].
Example 6 (Hessian geometries).The following Riemannian geometries are induced by strictly convex functions.

Euclidean geometry:
The Euclidean geometry on R d is induced by the squared Euclidean norm x → i x 2 i .

Fisher geometry:
The Fisher metric on R d >0 is induced by the negative entropy x → i x i log(x i ).
then the resulting Riemannian metric on R d for σ ∈ (−∞, 0] and on R d >0 for σ ∈ (0, ∞) is given by This recovers the Euclidean geometry for σ = 0, the Fisher metric for σ = 1, and the Itakura-Saito metric for σ = 2.Note that these geometries are closely related to the so-called β-divergences [7], which are the Bregman divergences of the functions φ σ for β = 1 − σ.We use σ instead of β in order to avoid confusion with our notation for the observation kernel β in a POMDP.

Conditional entropy:
Given two finite sets X , Y and a probability distribution µ in ∆ X ×Y we can consider the conditional entropy φ C (µ . This is a convex function on the simplex ∆ X ×Y [53].The Hessian of the conditional entropy is given by as can be verified by explicit computation or the chain rule for Hessian matrices (see also proof of Theorem 11).This Hessian does not induce a Riemannian geometry on the entire simplex since is not positive definite on the tangent space T ∆ X ×Y , as can be seen by considering the specific choice X = Y = {1, 2}, µ ij = 1/4 for all i, j = 1, 2 and the tangent vector v ∈ T µ ∆ X ×Y given by v ij = (−1) i .However, when fixing a marginal distribution ν ∈ ∆ X , ν > 0, then the conditional entropy φ C induces a Riemannian metric on the interior of P = {µ ∈ ∆ X ×Y : µ X = ν}.To see this we consider the diffeomorphism given by conditioning int(P ) → int(∆ X Y ), µ → µ Y |X .It can be shown by explicit computation (analogous to the proof of Theorem 11) that the Hessian ∇ 2 φ C (µ) is the metric tensor of the pull back of the Riemmanian metric This argument can be adapted to sets {µ ∈ ∆ X ×X : Y .We note that the Bregman divergence induced by the conditional entropy is the conditional relative entropy [53], (1) , µ (2) ) = D KL (µ (1) , µ (2) Local Hessian of Bregman divergences Let φ be a twice differentiable strictly convex function and denote its Bregman divergence with To see this, we set f (y) := D φ (x, y).Then it is straight forward to see that ∇ 2 f (y) = ∇ 2 φ(y).Further, one can compute Hence, we obtain and hence Connection to Gauss-Newton method Let φ be a twice differentiable strictly convex function.
Then the Gram matrix of the Hessian geometry is given by Hence G −1 (θ) can be interpreted as a Gauss-Newton preconditioner of the objective function φ • P [41].In particular, for the square loss we have φ(x) = x 2 2 , in which case G(θ) −1 = (DP (θ) DP (θ)) −1 is the usual nonlinear least squares Gauss-Newton preconditioner.

Natural policy gradient methods
In this section we give a brief overview of different notions of policy gradient methods that have been proposed in the literature and study their associated geometries in state-action space.Policy gradient methods [68,32,65,40,14] offer a flexible approach to reward optimization.They have been used in robotics [56] and have been combined with deep neural networks [62,63,60].In the context of MDPs there are multiple notions of natural policy gradients.For instance, one may choose to use an optimal transport geometry in model space resulting in Wasserstein natural policy gradients [49].Most important to our discussion, there are different possible choices for the model space.One obvious candidate is the policy space ∆ S A , which was used by Kakade [29].However the objective function R(π) is a rational non-convex function over this space an thus requires a delicate analysis.A second candidate, which was proposed by Morimura et al. [47], is the state-action space N ⊆ ∆ S×A , for which the objective becomes a rather simple, linear function.By Proposition 3 the two model spaces ∆ S A and N are diffeomorphic under mild assumptions, which allows us to study any NPG method defined with respect to the policy space in state-action space.Because of the simplicity of the objective function in state-action space, we propose to study the evolution of NPG methods in this space.As we will see, this has the added benefit that it allows us to interpret several of the existing NPG methods as being induced by Hessian geometries.Based on this observation we can conduct a relatively simple convergence analysis for these methods.Finally, we propose a class of policy gradients closely related to β-divergences that interpolate between NPG arising from logarithmic barriers, entropic regularization and the Euclidean geometry.

Policy gradients
Throughout the section, we consider parametric policy models P : Θ → ∆ S A and write π θ = P (θ) ∈ ∆ S A for the policy arising from the parameter θ.We denote the corresponding stateaction and state frequencies by η θ and ρ θ .Finally, in slight abuse of notation we write R(θ) for the expected infinite-horizon discounted reward obtained by the policy π θ .The vanilla policy gradient (vanilla PG) method is given by the iteration where ∆t > 0 is the step size.
For γ ∈ (0, 1), the reward function π → R(π) is a rational function.Hence, in principle it can be differentiated using any automatic differentiation method.One can use the celebrated policy gradient theorem and use matrix inversion to compute the parameter update.
Theorem 7 (Policy gradient theorem).Consider an MDP (S, A, α, r), γ ∈ [0, 1) and a parametrized policy class.It holds that where In a reinforcement learning setup, one does not have direct access to the transition α and hence to P π (2) nor Q π , and sometimes even S is not known a priori.In this case, one has to estimate the gradient from interactions with the environment [14,13,48,64].In this work, however, we study the planning problem in MDPs, i.e., we assume that we have access to exact gradient evaluations.
Policy parametrizations Many results on the convergence of policy gradient methods have been providede for tabular softmax policies.The tabular softmax parametrization is given by π θ (a|s) := e θsa a e θ sa for all a ∈ A, s ∈ S, (10) for θ ∈ R S×A .One benefit of tabular softmax policies is that they parametrize the interior of the policy polytope ∆ S A in a regular way, i.e., such that the Jacobian has full rank everywhere, and the parameter is unconstrained in an affine space.Definition 8 (Regular policy parametrization).We call a policy parametrization R p → int(∆ S A ), θ → π θ regular if it is differentiable and satisfies We will focus on regular policy parametrizations.Nonetheless, we observe that policy optimization with constrained search variables can also be an attractive option and refer to [51] for a discussion in context of POMDPs.

Regularization in MDPs
In practice, the reward function is often regularized as R λ = R − λψ.This is often motivated to encourage exploration [68] and has also been shown to lead to fast convergence for strictly convex regularizers ψ [44,19].One popular regularizer is the conditional entropy in state-action space, see [53,44,19], which has also been used to successfully design trust region and proximal methods for reward optimization [58,59].It is also possible to take the functions φ σ defined in (5) as regularizers.This includes the entropy function, which is studied in state-action space in [53] and logarithmic barriers, which are studied in policy space in [1].Introducing a regularizer changes the optimization problem and usually also the optimizer.The difference introduced by this regularization can be estimated in terms of the regularization strength λ.For logarithmic barriers in state-action space, this follows from standard estimates for interior point methods [17].For entropic regularization in state-action space, this is elaborated in [67], and for the conditional entropy this is done in [44,19].We will see later that several of these regularizers lead to Hessian geometries in state-action space that correspond to different natural gradients that have been proposed in the context of policy optimization.
Partially observable systems Although we will only consider parametric policies in fully observable MDPs, our discussion covers the case of POMDPs in the following way.Any parametric family of observation-based policies {π θ : θ ∈ Θ} ⊆ ∆ O A induces a parametric family of statebased policies {π θ • β : θ ∈ Θ}.Hence, the policy gradient theorem as well as the definitions of natural policy gradients directly extend to the case of partially observable systems.However, the global convergence guarantees in Section 5 and Section 6 do not carry over to POMDPs since they assume tabular softmax (state) policies.
Projected policy gradients An alternative to using parametrizations with the property that any unconstrained choice of the parameter leads to a policy, is to use constrained parametrizations and projected gradient methods.For instance, one can parametrize policies in ∆ S A by their constrained entries and use the iteration where Π ∆ S A is the (Euclidean) projection to ∆ S A .We will not study projected policy gradient methods and refer to [1,69] for convergence rates of these methods.

Kakade's natural policy gradient
Kakade [29] proposed natural policy gradient based on a Riemannian geometry in the policy polytope ∆ S A .We will see that Kakade's NPG can be interpreted as the NPG induced by the Hessian geometry in state-action space arising from conditional entropy regularization of the linear program associated to MDPs.Kakade's idea was to mix the Fisher information matrices of the policy over the individual states according to the state frequencies, i.e., to use the following Definition 9 (Kakade's NPG and geometry in policy space).We refer to the natural gradient ) as Kakade's natural policy gradient (K-NPG), where G K is defined in (13).Hence, Kakade's NPG is the NPG induced by the factorization θ → π θ → R(θ) and the Riemannian metric on int(∆ S A ) given by Due to its popularity, this method is often referred to simply as the natural policy gradient.We will call it Kakade's NPG in order to distinguish it from other NPGs.
Remark 10.In [29] the definition of G K was heuristically motivated by the fact that the reward is also a mix of instantaneous rewards according to the state frequencies, R(π) = s ρ π (s) a π(a|s)r(s, a).The invariance axiomatic approaches discussed in [35,46] also yield mixtures of Fisher metrics over individual states, which however do not fully recover Kakade's metric, since this would require a way to account for the particular process that gives rise to the stationary state distribution ρ π .The works [56,12,52] argued that the Gram matrix G K corresponds to the limit of the Fisher information matrices of finite-path probability measures as the path length tends to infinity.

Interpration as Hessian geometry of conditional entropy regularization
The metric g K on the conditional probability polytope ∆ S A has been studied in terms of its invariances and its connection to the Fisher metric on finite-horizon path space [12,56,46].We offer a different interpretation of Kakade's geometry by studying its counterpart in state-action space, which we show to be the Hessian geometry induced by the conditional entropy.
Theorem 11 (Kakade's geometry as conditional entropy Hessian geometry).Consider an MDP (S, A, α) and fix µ ∈ ∆ S and γ ∈ (0, 1) such that Assumption 2 holds.Then, Kakade's geometry on ∆ S A is the pull back of the Hessian geometry induced by the conditional entropy on the stateaction polytope N ⊆ ∆ S×A along π → η π .In particular, Kakade's natural policy gradient is the natural policy gradient induced by the factorization θ → η θ → R(θ) with respect to the conditional entropy Hessian geometry, i.e., Proof.We can pull back the Riemannian metric on the policy polytope proposed by Kakade along the conditioning map to define a corresponding geometry in state-action space.The metric tensor in state-action space is given by We aim to show that , where ρ(s) = a η(s, a) denotes the state-marginal.Note that ∇ 2 H(η) = diag(η), which is the first term appearing in (17).For linear maps g A (x) = Ax the chain rule yields the expression Noting that ρ is a linear function of η we obtain which is the second term in (17).Overall this implies The Bregman divergence of the conditional entropy is the conditional relative entropy and has been studied as a regularizer for the linear program associated to MDPs in [53].
Remark 12. Kakade's NPG is known to converge at a locally quadratic rate under conditional entropy regularization [19], a regularizer which in policy space takes the form Note however, by direct calculation, that Kakade's geometry in policy space g K defined in (14) is not the Hessian geometry induced by ψ in policy space, which would take the form Instead, the metric proposed by Kakade only considers the contribution of the first term, see (14).As we will see in Sections 5 and 6, the interpretation of Kakade's NPG as a Hessian natural gradient induced by the conditional entropic regularization in state-action space allows for a great simplification of its convergence analysis.

Morimura's natural policy gradient
In contrast to Kakade's approach, who proposed a mixture of Fisher metrics to obtain a metric on the conditional probability polytope ∆ S A , Morimura and co-authors [47] proposed to work with the Fisher metric in state-action space ∆ S×A to define a natural gradient for reward optimization.The resulting Gram matrix is given by the Fisher information matrix induced by the state-action distributions, that is P (θ) = η θ and Definition 13 (Morimura's NPG).We refer to the natural gradient ∇ M R(θ) := G M (θ) + ∇ θ R(π θ ) as Morimura's natural policy gradient (M-NPG), where G M is defined in (18).Hence, Morimura's NPG is the NPG induced by the factorization θ → η θ → R(θ) and the Fisher metric on int(∆ S×A ).
By (15) the Gram matrix proposed by Morimura and co-authors and the Gram matrix proposed by Kakade are related to each other by where F ρ (θ) ij = s ρ θ (s)∂ θ i log(ρ θ (s))∂ θ j log(ρ θ (s)) denotes the Fisher information matrix of the state distributions.This relation is reminiscent of the chain rule for the conditional entropy and can be verified by direct computation; see [47].Where we have seen that Kakade's geometry in state-action space is the Hessian geometry of conditional entropy, the Fisher metric is known to be the Hessian metric of the entropy function [7].Hence, we can interpret the Fisher metric as the Hessian geometry of the entropy regularized reward η → r, η − H(η).

General Hessian natural policy gradient
Generalizing the above definitions, we define general state-action space Hessian NPGs as follows.Consider a twice differentiable function φ : R S×A >0 → R such that ∇ 2 φ(η) is positive definite on T η N = T L ⊆ R S×A for every η ∈ int(N ).Then we set which is the Gram matrix with respect to the Hessian geometry in R S×A >0 .Definition 14 (Hessian NPG).We refer to the natural gradient ∇ φ R(θ) := G φ (θ) + ∇ θ R(π θ ) as Hessian natural policy gradient with respect to φ or shortly φ-natural policy gradient (φ-NPG).
Leveraging results on gradient flows in Hessian geometries we will later provide global convergence guarantees including convergence rates for a large class of Hessian NPG flows covering Kakade's and Morimura's natural gradients as special cases.Further, we consider the family φ σ of strictly convex functions defined in (5).With G σ (θ) we denote the Gram matrix associated with the Riemannian metric g σ , i.e., Definition 15 (σ-NPG).We refer to the natural gradient ∇ σ R(θ) := G σ (θ) + ∇ θ R(π θ ) as the σ-natural policy gradient (σ-NPG).Hence, the σ-NPG is the NPG induced by the factorization θ → η θ → R(θ) and the metric g σ on int(∆ S×A ) defined in (6).
For σ = 1 we recover the Fisher geometry and hence Morimura's NPG; for σ = 2 we obtain the Itakura-Saito metric; and for σ = 0 we recover the Euclidean geometry.Later, we show that the Hessian gradient flows exist globally for σ ∈ [1, ∞) and provide convergence rates depending on σ.

Convergence of natural policy gradient flows
In this section we study the convergence properties of natural policy gradient flows arising from Hessian geometries in state-action space for fully observable systems and tabular softmax policies.Although we focus on this case, we observe that our results directly extend to regular parametrizations of the interior of the policy polytope ∆ S A .Leveraging tools from the theory of gradient flows in Hessian geometries established in [4] we show O(t −1 ) convergence of the objective value for a large class of Hessian geometries and unregularized reward.We strengthen this general result and establish linear convergence for Kakade's and Morimura's NPG flows and O(t −1/(σ−1) ) convergence for σ-NPG flows for σ ∈ (1, 2).We provide empirical evidence that these rates are tight and that the rate O(t −1/(σ−1) ) also holds for σ ≥ 2. Under strongly convex penalization, we obtain linear convergence for a large class of Hessian geometries.
Reduction to state-action space For a solution θ(t) of the natural policy gradient flow, the corresponding state-action frequencies η(t) solve the gradient flow with respect to the Riemannian metric.This is made precise in the following result, which shows that it suffices to study Riemannian gradient flows in state-action space in order to study natural policy gradient flows for tabular softmax policies.
Proposition 16 (Evolution in state-action space).Consider an MDP (S, A, α), a Riemannian metric g on int(N ) = R S×A >0 and an differentiable objective function R : int(∆ S×A ) → R. Consider a regular policy parametrization and the objective R(θ) := R(η θ ) and a solution θ : [0, T ] → Θ = R S×A of the natural policy gradient flow where G(θ) ij = g p (∂ θ i η θ , ∂ θ j η θ ) and G(θ) + denotes some pseudo inverse of G(θ).Then, setting η(t) := η θ(t) we have that η : [0, T ] → ∆ S×A is the gradient flow with respect to the metric g| N and the objective R, i.e., solves where Π g T L is the g-orthogonal projection onto the tangent space T L with L defined in (4).Proof.This is a direct consequence of Theorem 5.
By Proposition 16 it suffices to study solutions η : [0, T ] → N of the gradient flow in stateaction space.We have seen before that a large class of natural policy gradients arise from Hessian geometries in state-action space.In particular, this covers the natural policy gradients proposed by Kakade [29] and Morimura et al. [47].We study the evolution of these flows in state-action space and leverage results on Hessian gradient flows of convex problems in [4] to obtain global convergence rates for different NPG methods.

Convergence of unregularized Hessian natural policy gradient flows
First, we study the case of unregularized reward, i.e., where the state-action objective is linear and given by R(η) = r, η .In this case we obtain global convergence guarantees including rates.In particular, our general result covers the σ-NPGs and thus Morimura's NPGs as well as Kakade's NPGs.For the remainder of this section we work under the following assumptions.
Setting 17.Let (S, A, α) be an MDP, µ ∈ ∆ S and r ∈ R S×A and let the positivity Assumption 2 hold.We denote the state-action polytope by N = R S×A ≥0 ∩L, see Proposition 1, and its (relative) interior and boundary by int(N ) = R S×A >0 ∩ L and ∂N = ∂R S×A ≥0 ∩ L respectively.We consider an objective function R : R S×A → R ∪ {−∞} that is finite, differentiable and concave on R S×A >0 and continuous on its domain dom(R) = {η ∈ R S×A : R(η) ∈ R}.Further, we consider a realvalued function φ : R S×A → R ∪ {+∞}, which we assume to be finite and twice continuously differentiable on R S×A >0 and such that ∇ 2 φ(η) is positive definite on T η N = T L ⊆ R S×A for every η ∈ int(N ).Further, with η : [0, T ) → N we denote a solution of the Hessian gradient flow which is the gradient flow with respect to the Hessian geometry induced by φ on N .We denote3 R * := sup η∈N R(η) < ∞ and by η * ∈ N , we denote a maximizer -if one exists -of R over N .We denote the policies corresponding to η 0 and η * by π 0 and π * , see Proposition 3.
We observe that the Hessian of the conditional entropy only defines a Riemannian metric on int(N ), even if not over all of ∆ S×A .Note that in general η * might lie on the boundary and for linear R corresponding to unregularized reward it necessarily lies on the boundary.

Sublinear rates for general case
We begin by providing a sublinear rate of convergence for general NPG flows, which we the specialize to Kakade and σ-NPGs.
Lemma 18 (Convergence of Hessian natural policy gradient flows).Consider Setting 17 and assume that there exists a solution η : [0, T ) → int(N ) of the NPG flow (21) with initial condition η(0) = η 0 .Then for any η ∈ N and t ∈ [0, T ) it holds that where D φ denotes the Bregman divergence of φ.In particular it holds that R(η(t)) → R * as T → ∞.Further, this convergence happens at a rate O(t −1 ) if there is a maximizer Proof.This is precisely the statement of Proposition 4.4 in [4]; note however, that they assume a globally defined objective R : R S×A → R and hence for completeness we provide a quick argument.Note that which can either be seen by inspecting the proof of equation (4.4) in [4] and noting that the proof does not require the stronger assumption made there or by explicit computation.Integration and the monotonicity of t → R(η(t)) yields the claim.
The previous result is very general and reduces the problem of showing convergence of the natural gradient flow to the problem of well posedness.However, well posedness is not always given, such as for example in the case of an unregularized reward and the Euclidean geometry in state-action space.In this case, the gradient flow in state-action space will reach the boundary of the state-action polytope N in finite time at which point the gradient is not classically defined anymore and the softmax parameters blow up; see Figure 3.An important class of Hessian geometries that prevent a finite hitting time of the boundary are induced by the class of Legendre-type functions, which curve up towards the boundary.

Smoothness and convexity:
We assume φ to be continuous on dom(φ) and twice continuous differentiable on R S×A >0 and such that ∇ 2 φ(η) is positiv definite on T η N = T L ⊆ R S×A for every η ∈ int(N ).

Gradient blowup at boundary: For any (η
We note that the above definition differs from [4], who consider Legendre functions on arbitrary open sets but work with more restrictive assumptions.More precisely, they require the gradient blowup on the boundary of the entire cone R S×A ≥0 and not only on the boundary of the feasible set N of the optimization problem.However, this relaxation is required to cover the case of the conditional entropy, which corresponds to Kakade's NPG, as we see now.
Example 20.The class of Legendre type functions covers the functions inducing Kakade's and Morimura's NPG via their Hessian geometries.More precisely, the following Legendre type functions will be of great interest in the remainder: 1.The functions φ σ defined in (5) used to define the σ-NPG are Legendre type functions for σ ∈ [1, ∞).Note that this includes the Fisher geometry, corresponding to Morimura's NPG for σ = 1 but excludes the Euclidean geometry, which corresponds to σ = 0.
2. The conditional entropy φ C defined in ( 12) is a Legendre type function.The Hessian geometry of this function induces Kakade's NPG.Note that in this case the gradient blowup holds on the boundary N but not on the boundary of ∆ S×A or even R S×A ≥0 .The definition of a Legendre function with the gradient blowing up at the boundary of the feasible set prevents the gradient flow from reaching the boundary in finite time and thus ensures the global existence of the gradient flow.
Let us now turn towards Kakade's natural policy gradient, which is the Hessian NPG induced by the conditional entropy φ C defined in (1).The Bregman divergence of the conditional entropy (see [57]) is given by which has been studied in the context of mirror descent algorithms of the linear programming formulation of MDPs in [53].
Proof.The well posedness follows by a similar reasoning as in [4,Theorem 4.1].Now the result follows directly from Lemma 18.
Now we elaborate the consequences of the general convergence result Lemma 18 for the case of σ-NPG flows.Here, the study is more delicate since for σ > 2 we typically have φ σ (η * ) = ∞ since the maximizer η * lies at the boundary unless the reward is constant.
Proof.By the preceding Lemma 18 it suffices to show the well posedness of the σ-NPG flow.
The result [4, Theorem 4.1] guarantees the well posedness for Hessian gradient flows for smooth Legendre type functions.Note however that they work with slightly stronger assumptions, which are that the gradient blowup of the Legendre type functions occurs not only on the boundary of N but on the boundary of R S×A ≥0 and that the objective R is globally defined.Consolidating the proof in [4] reveals that both of these relaxations do not change the validity or proof of the statement.
It is easy to see that for σ ≥ 1 the functions φ σ are of Legendre type and smooth and hence we can apply the preceding Lemma 18.Let η * be a maximizer, which necessarily lies at the boundary of N (except for constant reward) and therefore has at least one zero entry.For σ ∈ [1, 2) we have that φ σ (η * ) < ∞ and hence we obtain For σ ∈ (2, ∞) the calculation follows in analogue fashion.Noting that dist(η(t), S) ∼ R * − R(η(t)) finishes the proof.Remark 23.Theorem 22 and Theorem 21 show global convergence of σ-NPG and Kakade's NPG flows to a maximizer of the unregularized problem.Note that the reason why this is possible is that one does not work with a regularized objective but rather with a geometry arising from a regularization but with the original linear objective.For σ < 1 the flow may reach a face of the feasible set in finite time; see Figure 3.For σ ≥ 3 Theorem 22 is uninformative since R(η(t)) is non increasing.However, in our experiments we observed that (discretizations of) σ-NPG flows still converge for σ ≥ 3, although the plateau problem becomes more pronounced, as can be seen in Figure 3.
Furthermore, one can show that the trajectory converges towards the maximizer that is closest to the initial point η 0 with respect to the Bregman divergence [4].
Faster rates for σ ∈ [1, 2) and Kakade's NPG Now we obtain improved and even linear convergence rates for Kakade's and Morimura's NPG flow for unregularized problems.To this end, we first formulate the following general result.
Lemma 24 (Convergence rates for gradient flow trajectories).Consider Setting 17 and assume that there is a global solution η : [0, ∞) → int(N ) of the Hessian gradient flow (21).Assume that there is η * ∈ N such that φ(η * ) < +∞ as well as a neighborhood N of η * in N and ω ∈ (0, ∞) Then there is a constant c > 0 such that 1) .
The lower bound (24) can be interpreted as a form of strong convexity under which the objective value controls the Bregman divergence and hence convergence in objective value implies convergence of the state-action trajectories in the sense of the Bregman divergence.
Proof.The statement of this result can be found in [4, Proposition 4.9], where however stronger assumptions are made and hence we provide a short proof.First, note that our assumption implies that η * is the unique global maximizer of R over N .By (23) it holds that u(t) := ∂ t D φ (η * , η(t)) is strictly decreasing as long as η(t) = η * .Note that if η(t) = η * for some t ∈ [0, ∞), we have u(t ) = 0 for all t ≥ t and hence the statement becomes trivial.Therefore, we can assume u(t) > 0 for all t > 0. By Lemma 18 it holds that R(η(t)) → R(η * ) and hence η(t) → η * ; this is due to the compactness of N and because the continuity of R implies that every accumulation point of η(t) is a maximizer and thus equal to η * .Hence, η(t) ∈ N for t ≥ t 0 .For the statement about the asymptotic behavior we may therefore assume without loss of generality that η(t) ∈ N for all t ≥ 0. Combining ( 23) and ( 24) we obtain u (t) ≤ −ωu(t) τ .Dividing by the right hand side and integrating the inequality we obtain u(t) ≤ u(0)e −ωt for τ = 1 and u(t Theorem 25 (Linear convergence of unregularized Kakade's NPG flow).Consider Setting 17, where φ = φ C is the conditional entropy defined in (12) and assume that there is a unique maximizer η * of the unregularized reward R. Then R * − R(η(t)) = O(e −ct ) for some c > 0.
Proof.Let φ C denote the conditional entropy, so that . Hence, we obtain just like in the case of σ-NPG flows for σ = 1 that for some c > 0 by Lemma 24.Hence, it remains to estimate R(η * ) − R(η) = O( η * − η 1 ) by the conditional relative entropy D φ C (η * , η).Note that π * is a deterministic policy and hence we can write π * (a * s |s) = 1 and estimate Here, we have used log(t) ≤ t − 1 as well as

Now we observe that the mapping
) follows from the policy gradient theorem as ∂ π(a|s) η π = ρ π (s)(I − γP π ) −1 e (s,a) , see also [50,Proposition 48].In turn, it holds that Altogether this implies R * − R(η(t)) = O(e −ct ), which concludes the proof.The O notation hides constants that scale with the norm of the instantaneous reward vector r, inversely with the minimum state probability, and inversely with (1 − γ) where γ is the discount rate.
For y = 0 this is immediate and for y > 0 the local strong convexity of x → x 2−σ around y implies for x → y.Now, Jensen's inequality yields Overall, we obtain The case σ = 1 can be treated similarly, where one obtains Compared to Theorem 22 the above Theorem 26 improves the O(t −1 ) rates for σ ∈ [1, 2).Later, we conduct numerical experiments that indicate that the rates O(t −1/(σ−1) ) also hold for σ ≥ 2 and are tight.
Numerical examples We use the following example proposed by Kakade [29] and which was also used in [12,47].We consider an MDP with two states s 1 , s 2 and two actions a 1 , a 2 , with the transitions and instantaneous rewards shown in Figure 2. We adopt the initial distribution µ(s 1 ) = 0.2, µ(s 2 ) = 0.8 and work with a discount factor of γ = 0.9, whereas Kakade studied the mean reward case.Note however that the experiments can be performed for arbirtrarily large discount factor, where we chose a smaller factor since the correspondence between the policy polytope and the state-action polytope is clearer to see in the illustrations.We consider tabular softmax policies and plot the trajectories of vanilla PG, Kakade's NPG, and σ-NPG for the values σ ∈ {−0.5, 0, 0.5, 1, 1.5, 2, 3, 4} for 30 random (but the same for every method) initializations.We plot the trajectories in the state-action space (Figure 3) and in the policy polytope (Figure 4).In order to put the convergence results from this section into perspective, we plot the evolution of the optimality gap R * − R(θ(t)) (Figure 5).We use an adaptive step size ∆t k , which prevents the blowup of the parameters for σ < 1, and hence we do not consider the number of iterations but rather the sum of the step sizes as a measure for the time, t n = n k=1 ∆t k .For vanilla PG and σ ∈ (1, 2) we expect a decay at rate O(t −1 ) [44] and O(t −1/(σ−1) ) by Theorem 26.Therefore we use a logarithmic (on both scales) plot for vanilla PG and σ > 1 and also indicate the predicted rate using a dashed gray line.For Kakade's and Morimuras NPG we expect linear convergence by Theorem 25 and 26 respectively and hence use a semi-logarithmic plot.First, we note that for σ ∈ {−0.5, 0, 0.5} the trajectories of σ-NPG flow hit the boundary of the state-action polytope N , which is depicted in gray inside the simplex ∆ S×A .This is consistent with our analysis, since the functions φ σ are Legendre type functions only for σ ∈ [1, ∞) and hence only in this case is the NPG flow is guaranteed to exhibit long time solutions.However, we observe finite-time convergence of the trajectories towards the global optimum (see Figure 5), which we suspect to be due to the discretization error.
For the other methods, namely vanilla PG, Kakade's NPG and σ-NPG with σ ∈ [1, ∞), Theorem 22 and Theorem 21 show the global convergence of the gradient flow trajectories, which we also observe both in state-action space and in policy space (see Figures 3 and 4 respectively).When considering the convergence in objective value we observe that both Kakade's and Morimura's NPG exhibit a linear rate of convergence as asserted by Theorem 25 and Theorem 26, whereby Kakade's NPG appears to have more severe plateaus in some examples.For vanilla PG and σ > 1 we observe a sublinear convergence rate of O(t −1 ) and O(t −1/(σ−1) ) respectively, which are shown via dashed gray lines in each case.This confirms the convergence rate O(t −1 ) for vanilla PG [44] and indicates that the rate O(t −1/(σ−1) ) shown for σ ∈ (1, 2) is also valid in the regime σ > 2. Finally, we observe that larger σ appears to lead to more severe plateaus, which is apparent in the convergence in objective and also from the evolution in policy space and in state-action space.

Linear convergence of regularized Hessian natural policy gradient flows
It is known that strictly convex regularization in state-action space can yield linear convergence in reward optimization for vanilla and Kakade's natural policy gradients [44,19].Using Lemma 24 we generalize the result for Kakade's NPG and provide a result giving the linear convergence for general Hessian NPG.
Theorem 27 (Linear convergence for regularized problems).Consider Setting 17 and let φ be a Legendre type function and denote the regularized reward by R λ (η) = r, η − λφ(η) for some λ > 0 and fix an η 0 ∈ int(N ) and assume that the global maximizer η * λ of R λ over N lies in the interior int(N ).Assume that η : [0, ∞) → int(N ) solves the natural policy gradient flow with respect to the regularized reward R λ and the Hessian geometry induced by φ.For any c ∈ (0, λ) there exists a constant K > 0 such that D φ (η * λ , η(t)) ≤ Ke −ct .In particular, for any κ ∈ (κ c , ∞) this implies R * λ − R λ (η(t)) ≤ κλKe −ct , where κ c denotes the condition number of ∇ 2 φ(η * ).Proof.We first recall that by Lemma 18 it holds that R(η(t)) → R(η * ) and the uniqueness of the maximizer η(t) → η * ∈ int(N ).By Lemma 24 it suffices to show that for any ω ∈ (0, 1) it which shows the linear convergence of the trajectory in the Bregman divergence.For arbitrary m, M > 0 such that mI ≺ ∇ 2 φ(η * ) ≺ M I we can estimate for η(t) close to η * , where we used that φ is m strongly convex in a neighborhood of η * .
In the proof of the previous theorem we used the following lemma.
Remark 29 (Location of maximizers).The condition that η * λ ∈ int(N ) assumed in Theorem 27 is satisfied if the gradient blow-up condition from Definition 19 is slightly strengthened.Indeed, suppose that for any η ∈ ∂N there is a direction v such that η + tv ∈ int(N ) for small t and such that ∂ Then by the concavity of R λ and continuity of R λ we have and hence η = η * .Now we elaborate the consequences of this general convergence result given in Theorem 27 for Kakade and σ-NPG flows.
Proof.We want to use Remark 29.Recall that where ρ(s) = a η(s, a) is the state marginal.Note that by Assumption 2 it holds that ρ(s) > 0. Hence, if η ∈ ∂N we can take any v ∈ R S×A such that η ε := η + εv ∈ int(N ) for small ε > 0. Writing ρ ε for the associated state marginal, we obtain for ε → 0 since η(s , a ) = 0 for some s ∈ S, a ∈ A and ρ ε (s) → ρ(s) > 0 for all s ∈ S.
Remark 32 (Extension to arbitrary regularizers).The results above do not cover arbitrary combinations of Hessian geometries and regularizers.However, the proof of Theorem 27 can be adapted to this case, where the only part that requires adjustments is (25) that couples the regularized reward to the Bregman divergence.In principle, this can be extended to the case of regularizers that are different from the function inducing the Hessian geometry.

Locally quadratic convergence for regularized problems
It is known that Kakade's NPG method and more generally quasi-Newton policy gradient methods with suitable regularization and step sizes converge at a locally quadratic rate [19,37].Whereas these results regard the NPG method as an inexact Newton method in the parameter space, we regard it as an inexact Newton method in state-action space, which allows us to directly leverage results from the optimization literature and thus formulate relatively short proofs.Our result extends the locally quadratic convergence rate to general Hessian-NPG methods, which include in particular Kakade's and Morimura's NPG.Note that the result holds when the step size is equal to the penalization strength, which is reminiscent of Newton's method converging for step size 1.
The proof of this result relies on the following convergence result for inexact Newton methods.
Theorem 34 (Theorem 3.3 in [21]).Consider an objective function f ∈ C 2 (R d ) with ∇ 2 f (x) ∈ S sym >0 for any x ∈ R d and assume that f admits a minimizer x * .Let (x k ) be inexact Newton iterates given by and assume that they converge towards the minimum x * .If ε k = O( ∇f (x k ) ω ), then x k → x * at rate ω, i.e., x k − x * = O(e −k ω ).
We take this approach and show that the iterates of the regularized NPG method can be interpreted as an inexact Newton method in state-action space.For this, we first make the form of the Newton updates in state-action space explicit.
Lemma 35 (Newton iteration in state-action space).The iterates of Newton's method in stateaction space are given by where R λ (η) = r, η + λφ(η) is the regularized reward and Π E T L the Euclidean projection onto the tangent space of the affine space L defined in (4).
Proof.The domain of the optimization problem is R S×A ≥0 ∩ L an hence, we perform Newton's method on the affine subspace L. Writing L = η 0 +X for a linear subspace X we can equivalently perform Newton's method on X since the method is affine invariant.We denote the canonical ι : X → L, x → x + η 0 and set f (x) := R λ (ιx).Then, we obtain the Newton iterates x k and η k = ιx k by Straight up computation yields ∇f (x)ι ∇R λ (ιx) and ∇ 2 f (x) = ι ∇ 2 R λ (ιx)ι.Hence, we obtain where we used AA + = Π range(A) and (A ) + A = Π ker(A ) = Π range(A) .
Lemma 36.Let (θ k ) be the iterates of a Hessian NPG induced by a stricly convex function φ and with step size ∆t, i.e, where the Gram matrix is given by G(θ) = DP (θ) ∇ 2 φ(η θ )DP (θ).Then the state-action iterates η k := η θ k satisfy Proof.Writing P for the mapping θ → η θ and an application of Taylor's theorem implies that Proof of Theorem 33.In our case, by the preceding two lemmata, we have which proves the claim.Remark 37. A benefit of regarding the iteration as an inexact Newton method in state-action space is that the problem is strongly convex in state-action space.In contrast, in policy space the problem is non-convex, which makes the analysis in that space more delicate.Further, the corresponding Riemannian metric might not be the Hessian metric of the regularizer in policy space (see also Remark 12).In the parameter θ, the NPG algorithm can be perceived as a generalized Gauss-Newton method; however, the reward function is non-convex in parameter space.Further, for overparametrized policy models, i.e., when dim(Θ) > dim(∆ S A ) = |S|(|A|−1) the Hessian ∇ 2 R(θ * ) can not be positive definite, which makes the analysis in parameter space less immediate.Note that the tabular softmax policies in (10) are overparametrized since in this case dim(Θ) = |S||A|.

Discussion
We provide a study of a general class of natural policy gradient methods arising from Hessian geometries in state-action space.This covers, in particular, the notions of NPG due to Kakade and Morimura et al., which are induced by the conditional entropy and entropy respectively.Leveraging results on gradient flows in Hessian geometries we obtain global convergence guarantees of NPG flows for regular policy parametrizations and show that both Kakade's and Morimura's NPG converge linearly, and obtain sublinear convergence rates for NPG associated with β-divergences.We provide experimental evidence of the tightness of these rates.Finally, we perceive the NPG with respect to the Hessian geometry induced by the regularizer and with step size equal to the regularization strength, as an inexact Newton method in state-action space, which allows for a very compact argument of the locally quadratic convergence of this method.
Our convergence analysis currently does not cover the case of general parametric policy classes nor the case of partially observable MDPs, which we consider important future directions.Further, we study only the planning problem, i.e., assume to have access to exact gradients, and hence a combination of our study of NPG methods in state-action space with estimation problems would be a natural extension.

Figure 1 :
Figure 1: Schematic drawing of parametric models with an objective function and resulting parameter objective function L; note that neither the choice of geometry in the model space nor the factorization or the model space itself is uniquely determined by the objective function L.

Figure 2 :
Figure 2: Transition graph and reward of the MDP example.

Figure 3 :
Figure 3: State-action trajectories for different PG methods, which are vanilla PG, Kakade's NPG and σ-NPG, where Morimura's NPG corresponds to σ = 1; the state-action polytope is shown in gray inside a three dimensional projection of the the simplex ∆ S×A ; shown are trajectories with the same random 30 initial values for every method; the maximizer η * is located at the upper left corner of the state-action polytope.

Figure 4 :
Figure 4: Plots of the trajectories of the individual methods inside the policy polytope ∆ S A ∼ = [0, 1] 2 ; additionally, a heatmap of the reward function π → R(π) is shown; the maximizer π * is located at the upper left corner of the policy polytope.
Figure 5: Plot of the optimality gaps R * − R(θ(t)) during optimization; note that for vanilla PG and σ > 1 these are log-log plots since we expect a decay like t −1 and t −1/(σ−1) respectively, which are shown as a dashed gray line; Kakade's and Morimura's NPG are at a log plot since we expect a linear convergence; finally, for σ < 1 we observe finite time convergence.Lemma 28.Let φ be a strictly convex function defined on an open convex set Ω ⊆ R d with unique minimizer η *. Then for any ω ∈ (0, 1) there is a neighborhood N ω of x * such that φ