Abstract
The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For \(\gamma \)discounted infinitehorizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a nearoptimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space \({\mathcal {S}}\) and the effective horizon \(\frac{1}{1\gamma }\), both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize \(\eta \) can take
to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefullyconstructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Despite their remarkable empirical popularity in modern reinforcement learning [35, 41], theoretical underpinnings of policy gradient (PG) methods and their variants [20, 23, 37, 43, 47] remain severely obscured. Due to the nonconcave nature of value function maximization induced by complicated dynamics of the environments, it is in general highly challenging to pinpoint the computational efficacy of PG methods in finding a nearoptimal policy. Motivated by their practical importance, a recent strand of work sought to make progress towards demystifying the effectiveness of policy gradient type methods (e.g., [1, 7,8,9, 11, 18, 22, 24, 31, 32, 34, 40, 53,54,55, 57, 59]), focusing primarily on canonical settings such as tabular Markov decision processes (MDPs) for discretestate problems and linear quadratic regulators for continuousstate problems.
The current paper studies PG methods with softmax parameterization—commonly referred to as softmax policy gradient methods—which are among the de facto implementations of PG methods in practice. An intriguing theoretical result was recently obtained by the work Agarwal et al. [1], which established asymptotic global convergence of softmax PG methods for infinitehorizon \(\gamma \)discounted tabular MDPs. Subsequently, Mei et al. [34] strengthened the theory by demonstrating that softmax PG methods are capable of finding an \(\varepsilon \)optimal policy with an iteration complexity proportional to \(1/\varepsilon \) (see Table 1 for the precise form). While these results take an important step towards understanding the effectiveness of softmax PG methods, caution needs to be exercised before declaring fast convergence of the algorithms. In particular, the iteration complexity derived by Mei et al. [34] falls short of delineating clear dependencies on important salient parameters of the MDP, such as the dimension of the state space \({\mathcal {S}}\) and the effective horizon \(1/(1\gamma )\). These parameters are, more often than not, enormous in contemporary RL applications, and might play a pivotal role in determining the scalability of softmax PG methods.
Additionally, it is worth noting that existing literature largely concentrated on developing algorithmdependent upper bounds on the iteration complexities. Nevertheless, we recommend caution when directly comparing computational upper bounds for distinct algorithms: the superiority of the computational upper bound for one algorithm does not necessarily imply this algorithm outperforms others, unless we can certify the tightness of all upper bounds being compared. As a more concrete example, it is of practical interest to benchmark softmax PG methods against natural policy gradient (NPG) methods with softmax parameterization, the latter of which is a variant of policy optimization lying underneath several mainstream RL algorithms such as proximal policy optimization (PPO) [39] and trust region policy optimization (TRPO) [38]. While it is tempting to claim superiority of NPG methods over softmax PG methods—given the appealing convergence properties of NPG methods [1] (see Table 1)—existing theory fell short to reach such a conclusion, due to the absence of convergence lower bounds for softmax PG methods in prior literature.
The above considerations thus lead to a natural question that we aim to address in the present paper:
Can we develop a lower bound on the iteration complexity of softmax PG methods that reflects explicit dependency on salient parameters of the MDP of interest?
1.1 Main result
As an attempt to address the question posed above, our investigation delivers a somewhat surprising message that can be described in words as follows:
Softmax PG methods can take (super)exponential time to converge, even in the presence of a benign initialization and an initial state distribution amenable to exploration.
Our finding, which is concerned with a discounted infinitehorizon tabular MDP, is formally stated in the following theorem. Here and throughout, \({\mathcal {S}}\) denotes the size of the state space \({\mathcal {S}}\), \(0<\gamma <1\) stands for the discount factor, \(V^{\star }\) indicates the optimal value function, \(\eta >0\) is the learning rate or stepsize, whereas \(V^{(t)}\) represents the value function estimate of softmax PG methods in the tth iteration. All immediate rewards are assumed to fall within \([1,1]\). See Sect. 2 for formal descriptions.
Theorem 1
Assume that the softmax PG method adopts a uniform initial state distribution, a uniform policy initialization, and has access to exact gradient computation. Suppose that \(0<\eta < (1\gamma )^2/5\), then there exist universal constants \(c_1, c_2, c_3>0\) such that: for any \(0.96< \gamma < 1\) and \( {\mathcal {S}} \ge c_3 (1\gamma )^{6}\), one can find a \(\gamma \)discounted MDP with state space \({\mathcal {S}}\) that takes the softmax PG method at least
to reach
Remark 1
(Action space) The MDP we construct contains at most three actions for each state.
Remark 2
(Stepsize range) Our lower bound operates under the assumption that \(\eta < (1\gamma )^2/5\). In comparison, prior convergence guarantees for PGtype methods with softmax parameterization (e.g., Agarwal et al. [1, Theorem 5.1] and Mei et al. [34, Theorem 6]) required \(\eta <(1\gamma )^3/8\), a range of stepsizes fully covered by our theorem. In fact, prior works could only guarantee monotonicity of softmax PG methods (in terms of the value function) within the range \(\eta <(1\gamma )^2/5\) (see Agarwal et al. [1, Lemma C.2]).
Remark 3
While we can also provide explicit numbers for the constants \(c_1, c_2, c_3>0\), these numbers are not informative, and hence we omit explicit numbers here to streamline the proof a bit.
For simplicity of presentation, Theorem 2 is stated for the longeffectivehorizon regime where \(\gamma > 0.96\); it continues to hold when \(\gamma > c_0\) for some smaller constant \(c_0>0\). Our result is obtained by exhibiting a hard MDP instance—which is a properly augmented chainlike MDP—for which softmax PG methods converge extremely slowly even when perfect model specification is available. Several remarks and implications of our result are in order.
Comparisons with prior results. Table 1 provides an extensive comparison of the iteration complexities—including both upper and lower bounds—of PG and NPG methods under softmax parameterization. As suggested by our result, the iteration complexity \(O({\mathcal {C}}^2_{\textsf{spg}}({\mathcal {M}}) \frac{1}{\varepsilon })\) derived in Mei et al. [34] (see Table 1) might not be as rosy as it seems for problems with large state space and long effective horizon; in fact, the crucial quantity \({\mathcal {C}}_{\textsf{spg}}({\mathcal {M}})\) therein could scale in a prohibitive manner with both \({\mathcal {S}}\) and \(\frac{1}{1\gamma }\). Mei et al. [34] also developed a lower bound on the iteration complexity of softmax PG methods, which falls short of capturing the influence of the state space dimension and might become smaller than 1 unless \(\varepsilon \) is very small (e.g., \(\varepsilon \lesssim (1\gamma )^3\)) for problems with long effective horizons. In addition, Mei et al. [33] provided some interesting evidence that a poorlyinitialized softmax PG algorithm can get stuck at suboptimal policies for a singlestate MDP (i.e., the bandit problem). This result, however, fell short of providing a complete runtime analysis and did not look into the influence of a large state space. By contrast, our theory reveals that softmax PG methods can take exponential time to reach even a moderate accuracy level.
Slow convergence even with benign distribution mismatch. Existing computational complexities for policy gradient type methods (e.g., [1, 34]) typically scale polynomially in the socalled distribution mismatch coefficient^{Footnote 1}\(\big \Vert \frac{d^{\pi }_{\rho }}{\mu } \big \Vert _{\infty } \), where \(d^{\pi }_{\rho }\) stands for a certain discounted state visitation distribution (see (13) in Sect. 2), and \(\mu \) denotes the distribution over initial states. It is thus natural to wonder whether the exponential lower bound in Theorem 1 is a consequence of an exceedingly large distribution mismatch coefficient. This, however, is not the case; in fact, our theory chooses \(\mu \) to be a benign uniform distribution so that \(\Vert \frac{d^{\pi }_{\rho }}{\mu }\Vert _{\infty } \le \Vert \frac{1}{\mu }\Vert _{\infty } \le {\mathcal {S}}\), which scales at most linearly in \({\mathcal {S}}\).
Benchmarking with softmax NPG methods. Our algorithmspecific lower bound suggests that softmax PG methods—in their vanilla form—might take a prohibitively long time to converge when the state space and effective horizon are large. This is in stark contrast to the convergence rate of NPG type methods, whose iteration complexity is dimensionfree and scales only polynomially with the effective horizon [1, 11]. Consequently, our results shed light on the practical superiority of NPGbased algorithms such as PPO [39] and TRPO [38].
Crux of our design. As we shall elucidate momentarily in Sect. 3, our exponential lower bound is obtained through analyzing the trajectory of softmax PG methods on a carefullydesigned MDP instance with no more than 3 actions per state, when a uniform initialization scheme and a uniform initial state distribution are adopted. Our construction underscores the critical challenge of credit assignments [42] in RL compounded by the presence of delayed rewards, long horizon, and intertwined interactions across states. While it is difficult to elucidate the source of exponential lower bound without presenting our MDP construction, we take a moment to point out some critical properties that underlie our designs. To be specific, we seek to design a chainlike MDP containing \(H=O\big ( \frac{1}{1\gamma } \big )\) key primary states \(\{1,\ldots ,H\}\) (each coupled with many auxiliary states), for which the softmax PG method satisfies the following properties.

For the two key primary states, we have
$$\begin{aligned} \min \big \{ \textsf{convergence}\text {}\textsf{time}\text {(state 1)}, \, \textsf{convergence}\text {}\textsf{time}\text {(state 2)} \big \} \ge \frac{{\mathcal {S}}}{\eta } . \end{aligned}$$(3) 
(A blowingup phenomenon) For each key primary state \(3\le s\le H=O\big (\frac{1}{1\gamma }\big )\), one has
$$\begin{aligned} \textsf{convergence}\text {}\textsf{time}\text {(state }s\,) \gtrsim \big ( \textsf{convergence}\text {}\textsf{time}\text {(state }s2\,) \big )^{1.5}, \qquad 3\le s\le H. \end{aligned}$$(4)
Here, it is understood that \(\textsf{convergence}\text {}\textsf{time}\text {( state }s\,)\) represents informally the time taken for the value function of state s to be sufficiently close to its optimal value. The blowingup phenomenon described above is precisely the source of our (super)exponential lower bound.
1.2 Other related works
Nonasymptotic analysis of (natural) policy gradient methods. Moving beyond tabular MDPs, finitetime convergence guarantees of PG / NPG methods and their variants have recently been studied for control problems (e.g., [18, 19, 44, 58]), regularized MDPs (e.g., [11, 24, 54]), constrained MDPs (e.g., [15, 50]), robust MDPs (e.g., [29, 60]), MDPs with function approximation (e.g., [1, 2, 10, 25, 30, 45]), Markov games (e.g., [13, 14, 46, 49, 61]), and their use in actorcritic methods (e.g., [3, 12, 48, 51]).
Other policy parameterizations. In addition to softmax parameterization, several other policy parameterization schemes have also been investigated in the context of policy optimization and reinforcement learning at large. For example, [1, 24, 54, 56] studied the convergence of projected PG methods and policy mirror descent with direct parameterization, [4] introduced the socalled mallow parameterization, while [33] studied the escort parameterization. Part of these parameterizations were proposed in response to the ineffectiveness of softmax parameterization observed in practice.
Lower bounds. Establishing informationtheoretic or algorithmicspecific lower bounds on the statistical and computational complexities of RL algorithms—often achieved by constructing hard MDP instances—plays an instrumental role in understanding the bottlenecks of RL algorithms. To give a few examples, Azar et al. [5], Domingues et al. [16], Li et al. [28], Yan et al. [52] developed informationtheoretic lower bounds on the sample complexity of RL under multiple sampling mechanisms (e.g., sampling with a generative model, online RL, and offline/batch RL), Li et al. [27] established an algorithmdependent lower bound on the sample complexity of Qlearning, whereas Khamaru et al. [21], Pananjady and Wainwright [36] developed instancedependent lower bounds for policy evaluation. Additionally, Agarwal et al. [1] constructed a chainlike MDP whose value function under direct parameterization might contain very flat saddle points under a certain initial state distribution, highlighting the role of distribution mismatch coefficients in policy optimization. Finally, exponentialtime convergence of gradient descent has been observed in other nonconvex problems as well (e.g., [17]) despite its asymptotic convergence [26], although the context and analysis therein are drastically different from what happens in RL settings.
1.3 Paper organization
The rest of this paper is organized as follows. In Sect. 2, we introduce the basics of Markov decision processes, and describe the softmax policy gradient method along with several key functions/quantities. Section 3 constructs a chainlike MDP, which is the hard MDP instance underlying our computational lower bound for PG methods. In Sect. 4, we outline the proof of Theorem 1, starting with the proof of a weaker version before establishing Theorem 1. The proof of all technical lemmas are deferred to the appendix. We conclude the paper in Sect. 5 with a summary of our findings.
2 Background
In this section, we introduce the basics of MDPs, and formally describe the softmax PG method. Here and throughout, we denote by \(\Delta ({\mathcal {X}})\) the probability simplex over a set \({\mathcal {X}}\), and let \({\mathcal {X}}\) represent the cardinality of the set \({\mathcal {X}}\). Given two probability distributions p and q over \({\mathcal {S}}\), we adopt the notation \(\big \Vert \frac{p}{q} \big \Vert _{\infty } = \max _{s\in {\mathcal {S}}} \frac{p(s)}{q(s)}\) and \(\big \Vert \frac{1}{q} \big \Vert _{\infty } = \max _{s\in {\mathcal {S}}} \frac{1}{q(s)}\). Throughout this paper, the notation \(f({\mathcal {M}})\gtrsim g({\mathcal {M}})\) (resp. \(f({\mathcal {M}})\lesssim g({\mathcal {M}})\)) means there exist some universal constants \(c>0\) independent of the parameters of the MDP \({\mathcal {M}}\) such that \(f({\mathcal {M}})\ge c g({\mathcal {M}})\) (resp. \(f({\mathcal {M}})\le c g({\mathcal {M}})\)), while the notation \(f({\mathcal {M}})\asymp g({\mathcal {M}})\) means that \(f({\mathcal {M}})\gtrsim g({\mathcal {M}})\) and \(f({\mathcal {M}})\lesssim g({\mathcal {M}})\) hold simultaneously.
Infinitehorizon discounted MDP. Let \({\mathcal {M}} = ({\mathcal {S}}, \{{\mathcal {A}}_s \}_{s\in {\mathcal {S}}}, P, r,\gamma )\) be an infinitehorizon discounted MDP. Here, \({\mathcal {S}}\) represents the state space, \({\mathcal {A}}_s\) denotes the action space associated with state \(s\in {\mathcal {S}}\), \(\gamma \in (0,1)\) indicates the discount factor, P is the probability transition kernel (namely, for each stateaction pair (s, a), \(P(\cdot \,\,{s,a})\in \Delta ({\mathcal {S}})\) denotes the transition probability from state s to the next state when action a is taken), and r stands for a deterministic reward function (namely, r(s, a) is the immediate reward received in state s upon executing action a). Throughout this paper, we assume normalized rewards such that \(1 \le r(s,a)\le 1\) for any stateaction pair (s, a). In addition, we concentrate on the scenario where \(\gamma \) is quite close to 1, and often refer to \(\frac{1}{1\gamma }\) as the effective horizon of the MDP.
Policy, value function, Qfunction and advantage function. The agent operates by adopting a policy \(\pi \), which is a (randomized) action selection rule based solely on the current state of the MDP. More precisely, for any state \(s\in {\mathcal {S}}\), we use \(\pi (\cdot \,\,s)\in \Delta ({\mathcal {A}}_s)\) to specify a probability distribution, with \(\pi (a \,\,s)\) denoting the probability of executing action \(a\in {\mathcal {A}}_s\) when in state s. The value function \(V^{\pi }: {\mathcal {S}}\rightarrow {\mathbb {R}}\) of a policy \(\pi \)—which indicates the expected discounted cumulative reward induced by policy \(\pi \)—is defined as
Here, the expectation is taken over the randomness of the MDP trajectory \(\{(s^k,a^k)\}_{k\ge 0}\) and the policy, where \(s^0=s\) and, for all \(k\ge 0\), \(a^k \sim \pi (\cdot \,\,s^k)\) follows the policy \(\pi \) and \(s^{k+1}\sim P(\cdot \,\,s^k, a^k)\) is generated by the transition kernel P. Analogously, we shall also define the value function \(V^{\pi }(\mu )\) of a policy \(\pi \) when the initial state is drawn from a distribution \(\mu \) over \({\mathcal {S}}\), namely,
Additionally, the Qfunction \(Q^{\pi }\) of a policy \(\pi \)—namely, the expected discounted cumulative reward under policy \(\pi \) given an initial stateaction pair \((s^0,a^0)=(s,a)\)—is formally defined by
where the expectation is again over the randomness of the MDP trajectory \(\{(s^k,a^k)\}_{k\ge 1}\) when policy \(\pi \) is adopted. In addition, the advantage function of policy \(\pi \) is defined as
for every stateaction pair (s, a).
A major goal is to find a policy that optimizes the value function and the Qfunction. Throughout this paper, we denote respectively by \(V^{\star }\) and \(Q^{\star }\) the optimal value function and optimal Qfunction, namely,
Softmax parameterization and policy gradient methods. The family of policy optimization algorithms attempts to identify optimal policies by resorting to optimizationbased algorithms. To facilitate differentiable optimization, a widely adopted scheme is to parameterize policies using softmax mappings. Specifically, for any realvalued parameter \(\theta =[\theta (s,a)]_{s\in {\mathcal {S}}, a\in {\mathcal {A}}_s}\), the corresponding softmax policy \(\pi _\theta {:=}\, \textsf{softmax}(\theta )\) is defined such that
With the aim of maximizing the value function under softmax parameterization, namely,
the softmax PG method proceeds by adopting gradient ascent update rules w.r.t. \(\theta \):
Here and throughout, we let \({V}^{(t)} = V^{\pi ^{(t)}}\) and \({Q}^{(t)} = Q^{\pi ^{(t)}}\) abbreviate respectively the value function and Qfunction of the policy iterate \(\pi ^{(t)}{:=}\pi _{\theta ^{(t)}}\) in the tth iteration, and \(\eta >0\) denotes the stepsize or learning rate. Interestingly, the gradient \( \nabla _{\theta } {V}^{\pi _{\theta }}\) under softmax parameterization (10) admits a closedform expression [1], that is, for any stateaction pair (s, a),
Here, \(d_{\mu }^{\pi _\theta }(s)\) represents the discounted state visitation distribution of a policy \(\pi \) given the initial state \(s^0\sim \mu \):
with the expectation taken over the randomness of the MDP trajectory \(\{(s^k,a^k)\}_{k\ge 0}\) under the policy \(\pi \) and the initial state distribution \(\mu \). In words, \(d_{\mu }^{\pi }(s)\) measures—starting from an initial distribution \(\mu \)—how frequently state s will be visited in a properly discounted fashion. Throughout this paper, we shall denote \({A}^{(t)} {:=}\, A^{\pi ^{(t)}}\) and \(d_{\mu }^{(t)}(s) {:=}\,d_{\mu }^{\pi ^{(t)}}(s)\) for notation simplicity.
3 Construction of a hard MDP
This section constructs a discounted infinitehorizon MDP \({\mathcal {M}}=\{{\mathcal {S}}, \{{\mathcal {A}}_s\}_{s\in {\mathcal {S}}}, r, P, \gamma \}\), as depicted in Fig. 1, which forms the basis of the exponential lower bound claimed in this paper. In addition to the basic notation already introduced in Sect. 2, we remark on the action space as follows.

For each state \(s\in {\mathcal {S}}\), we have \({\mathcal {A}}_s \subseteq \{a_0,a_1, a_2\}\). For convenience of presentation, we allow the action space to vary with \(s\in {\mathcal {S}}\), but it always comprises no more than 3 actions.
State space partitioning. The states of our MDP exhibit certain group structure. To be precise, we partition the state space \({\mathcal {S}}\) into a few disjoint subsets
which entails:

State 0 (an absorbing state);

Two key “buffer” state subsets \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\);

A set of \(H2\) key primary states \({\mathcal {S}}_{\textsf{primary}}{:=}\{3, \ldots , H \}\);^{Footnote 2}

A set of H key adjoint states \({\mathcal {S}}_{\textsf{adj}}{:=}\{ {\overline{1}}, {\overline{2}}, \ldots , {\overline{H}} \}\);

2H “booster” state subsets \(\widehat{{\mathcal {S}}}_1, \ldots , {\widehat{{\mathcal {S}}}}_H\), \({\widehat{{\mathcal {S}}}}_{{\overline{1}}}, \ldots , {\widehat{{\mathcal {S}}}}_{{\overline{H}}}\).
Remark 4
Our subsequent analysis largely concentrates on the subsets \({\mathcal {S}}_1\), \({\mathcal {S}}_2\), \({\mathcal {S}}_{\textsf{primary}}\) and \({\mathcal {S}}_{\textsf{adj}}\). In particular, each state \(s\in \{3,\ldots , H\}\) is paired with what we call an adjoint state \({\overline{s}}\), whose role will be elucidated shortly. In addition, state \({\overline{1}}\) (resp. state \({\overline{2}}\)) can be viewed as the adjoint state of the set \({\mathcal {S}}_1\) (resp. \({\mathcal {S}}_2\)). The sets \({\mathcal {S}}_{\textsf{primary}}\) and \({\mathcal {S}}_{\textsf{adj}}\) comprise a total number of \(2H2\) states; in comparison, \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\) are chosen to be much larger and contain a number of replicated states, a crucial design component that helps ensure the property (3) under a uniform initial state distribution. As we shall make clear momentarily, the “booster” state sets are introduced mainly to help boost the discounted visitation distribution of the states in \({\mathcal {S}}_1\), \({\mathcal {S}}_2\), \({\mathcal {S}}_{\textsf{primary}}\), and \({\mathcal {S}}_{\textsf{adj}}\) at the initial stage.
We shall also specify below the size of these state subsets as well as some key parameters, where the choices of the quantities \(c_{\textrm{h}},c_{\textrm{b},1},c_{\textrm{b},2},c_{\textrm{m}}\asymp 1\) will be made clear in the analysis (cf. (35)).

H is taken to be on the same order as the “effective horizon” of this discounted MDP, namely,
$$\begin{aligned} H = \frac{c_{\textrm{h}}}{1\gamma }. \end{aligned}$$(15) 
The two buffer state subsets \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\) have size
$$\begin{aligned} {\mathcal {S}}_1 = c_{\textrm{b},1}(1\gamma ){\mathcal {S}} \qquad \text {and}\qquad {\mathcal {S}}_2 = c_{\textrm{b},2}(1\gamma ){\mathcal {S}}. \end{aligned}$$(16) 
The booster state sets are of the same size, namely,
$$\begin{aligned} {\widehat{{\mathcal {S}}}}_1= \cdots {\widehat{{\mathcal {S}}}}_H = {\widehat{{\mathcal {S}}}}_{{\overline{1}}}= \cdots = {\widehat{{\mathcal {S}}}}_{{\overline{H}}} = c_{\textrm{m}}(1\gamma ){\mathcal {S}}. \end{aligned}$$(17)
Probability transition kernel and reward function. We now describe the probability transition kernel and the reward function for each state subset. Before continuing, we find it helpful to isolate a few key parameters that will be used frequently in our construction:
where \(s\in \{1,2,\ldots ,H\}\), and \(c_{\textrm{p}}>0\) is some small constant that shall be specified later (see (35)). To facilitate understanding, we shall often treat \(\tau _s\) and \(r_s\) (\(s\le H\)) as quantities that are all fairly close to 0.5 (which would happen if \(\gamma \) is close to 1 and \(H=\frac{c_{\textrm{h}}}{1\gamma }\) for \(c_{\textrm{h}}\) sufficiently small).
We are now positioned to make precise descriptions of both P and r as follows.

Absorbing state 0: singleton action space \(\{a_0\}\),
$$\begin{aligned} P(0 \,\,0, a_0)=1, \qquad \qquad r(0,a_0)=0. \end{aligned}$$(19)This is an absorbing state, namely, the MDP will stay in this state permanently once entered. As we shall see below, taking action \(a_0\) in an arbitrary state will enter state 0 immediately.

Key primary states \(s\in \{ 3,\ldots , H \}\): action space \(\{a_0,a_1, a_2\}\),
(20a)(20b)(20c)(20d)where p, \(\tau _s\) and \(r_s\) are all defined in (18).

Key adjoint states \({\overline{s}} \in \{ {\overline{3}},\ldots , {\overline{H}} \}\): action space \(\{a_0, a_1\}\),
(21a)(21b)where \(\tau _s\) is defined in (18a).

Key buffer state subsets \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\): action space \(\{a_0,a_1\}\),
(22a)(22b)(22c)(22d)Given the homogeneity of the states in \({\mathcal {S}}_1\) (resp. \({\mathcal {S}}_2\)), we shall often use the shorthand notation \(P(\cdot \,\,1, a)\) (resp. \(P(\cdot \,\,2, a)\)) to abbreviate \(P(\cdot \,\,s_{1}, a)\) (resp. \(P(\cdot \,\,s_{2}, a)\)) for any \(s_1\in {\mathcal {S}}_1\) (resp. \(s_2\in {\mathcal {S}}_2\)) for the sake of convenience.

Other adjoint states \({\overline{1}}\) and \({\overline{2}}\): action space \(\{a_0,a_1\}\),
(23a)(23b)where \(\tau _1\) and \(\tau _2\) are defined in (18a).

Booster state subsets \({\widehat{{\mathcal {S}}}}_1\), \(\ldots \), \({\widehat{{\mathcal {S}}}}_H\), \({\widehat{{\mathcal {S}}}}_{{\overline{1}}}\), \(\ldots \), \({\widehat{{\mathcal {S}}}}_{{\overline{H}}}\): singleton action space \(\{a_1\}\),
(24a)(24b)for any \(s\in \{3,\ldots , H\}\),
$$\begin{aligned} \forall s'\in \widehat{{\mathcal {S}}}_{s},:\qquad&P(s\,\,s',a_{1})=1, \end{aligned}$$(24c)and for any \({\overline{s}} \in \{{\overline{1}},\ldots ,{\overline{H}}\}\),
$$\begin{aligned} \forall s'\in \widehat{{\mathcal {S}}}_{{\overline{s}}},:\qquad&P({\overline{s}}\,\,s',a_{1})=1. \end{aligned}$$(24d)The rewards in all these cases are set to be 0 (in fact, they will not even appear in the analysis). In addition, any transition probability that has not been specified is equal to zero.
Convenient notation for buffer state subsets \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\). By construction, it is easily seen that the states in \({\mathcal {S}}_1\) (resp. \({\mathcal {S}}_2\)) have identical characteristics; in fact, all states in \({\mathcal {S}}_1\) (resp. \({\mathcal {S}}_2\)) share exactly the same value functions and Qfunctions throughout the execution of the softmax PG method. As a result, we introduce the following convenient notation whenever it is clear from the context:
Optimal values and optimal actions of the constructed MDP. Before concluding this section, we find it convenient to determine the optimal value functions and the optimal actions of the constructed MDP, which would be particularly instrumental when presenting our analysis. This is summarized in the lemma below, whose proof can be found in Appendix A.3.
Lemma 1
Suppose that \(\gamma ^{2H}\ge 2/3\) and \(H\ge 2\).
Then one has
and the optimal policy is to take action \(a_1\) in all nonabsorbing states. In addition, for any policy \(\pi \) and any stateaction pair (s, a), one has \(Q^{\pi }(s,a) \ge  \gamma ^2\).
Lemma 1 tells us that for this MDP, the optimal policy for all nonabsorbing states takes a simple form: sticking to action \(a_{1}\). In particular, when \(\gamma \approx 1\) and \(\gamma ^H\approx 1\), Lemma 1 reveals that the optimal values of all nonabsorbing major states are fairly close to 1, namely,
Additionally, the above lemma directly implies that the Qfunction (and hence the value function) is always bounded below by \(1\), a property that will be used several times in our analysis.
4 Analysis: proof outline
In this section, we present the main steps for establishing our computational lower bound in Theorem 1. Before doing so, we find it convenient to start by presenting and proving a weaker version as follows.
Theorem 2
Consider the MDP \({\mathcal {M}}\) constructed in Sect. 3 (and Fig. 1). Assume that the softmax PG method adopts a uniform initial state distribution, a uniform policy initialization, and has access to exact gradient computation. Suppose that \(0<\eta <(1\gamma )^2/5\). There exist universal constants \(c_1, c_2, c_3>0\) such that: for any \(0.96< \gamma < 1\) and \( {\mathcal {S}} \ge c_3 (1\gamma )^{6}\), one has
provided that the iteration number satisfies
In what follows, we shall concentrate on establishing Theorem 2, on the basis of the MDP instance constructed in Sect. 3. Once this theorem is established, we shall revisit Theorem 1 (towards the end of Sect. 4.3) and describe how the proof of Theorem 2 can be adapted to prove Theorem 1.
4.1 Preparation: crossing times and choice of constants
Crossing times. To investigate how long it takes for softmax PG methods to converge to the optimal policy, we shall pay particular attention to a family of key quantities: the number of iterations needed for \(V^{(t)}(s)\) to surpass a prescribed threshold \(\tau \) (\(\tau <1\)) before it reaches its optimal value. To be precise, for each \(s \in \{3,\ldots ,H\} \cup \{{\overline{1}},\ldots ,{\overline{H}}\}\) and any given threshold \(\tau >0\), we introduce the following crossing time:
When it comes to the buffer state subsets \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\), we define the crossing times analogously as follows
where we recall the notation \(V^{(t)}(1)\) and \(V^{(t)}(2)\) introduced in (25).
Monotonicity of crossing times. Recalling the definition (30) of the crossing time \(t_s(\cdot )\), we know that
with \(\tau _{s}\) defined in expression (18a). We immediately make note of the following crucial monotonicity property that will be justified later in Remark 8:
It will also be shown in Lemma 4 that \(t_1({\tau }_1)\le t_2({\tau }_2)\) when the constants \(c_{\textrm{b},1}, c_{\textrm{b},2}\) and \(c_{\textrm{m}}\) are properly chosen.
Remark 5
As we shall see shortly (i.e., Part (iii) of Lemma 8), one has \(t_{{\overline{s}}}(\gamma \tau _s) = t_{s}(\tau _s)\) for any \({\overline{s}}\in \{{\overline{1}},\ldots ,{\overline{H}}\}\), which combined with (33) leads to
Choice of parameters. We assume the following choice of parameters throughout the proof:
In the sequel, we outline the key steps that underlie the proof of our main results, with the proofs of the key lemmas postponed to the appendix.
4.2 A highlevel picture
While our proof is highly technical, it is prudent to point out some key features that help paint a highlevel picture about the slow convergence of the algorithm. Recall that \(a_1\) is the optimal action in the constructed MDP. The chainlike structure of our MDP underscores a sort of sequential dependency: the dynamic of any primary state \(s\in \{3,\ldots ,H\}\) depends heavily on what happens in those states prior to s—particularly state \(s1\), state \(s2\) as well as the associated adjoint states. By carefully designing the immediate rewards, we can ensure that for any \(s\in \{3,\ldots ,H\}\), the iterate \(\pi ^{(t)}(a_1\,\,s)\) corresponding to the optimal action \(a_1\) keeps decreasing before \(\pi ^{(t)}(a_1\,\,s2)\) gets reasonably close to 1. As illustrated in Fig. 2, this feature implies that the time taken for \(\pi ^{(t)}(a_1\,\,s)\) to get close to 1 grows (at least) geometrically as s increases, as will be formalized in (46).
Furthermore, we summarize below the typical dynamics of the iterates \(\theta ^{(t)}(s,a)\) before they converge, which are helpful for the reader to understand the proof. We start with the key buffer state sets \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\), which are the easiest to describe.
Next, the dynamics of \(\theta ^{(t)}(s,a)\) for the key primary states \(3\le s \le H\) are much more complicated, and rely heavily on the status of several prior states \(s1\), \(s2\) and \(\overline{s1}\). This motivates us to divide the dynamics into several stages based on the crossing times of these prior states, which are illustrated in Fig. 3 as well. Here, we remind the reader of the definition of \(\tau _s\) in (18).
4.3 Proof outline
We are now in a position to outline the main steps of the proof of Theorem 1 and Theorem 2, with details deferred to the appendix. In the following, Steps 16 are devoted to analyzing the dynamics of softmax PG methods when applied to the constructed MDP \({\mathcal {M}}\), which in turn establish Theorem 2. Step 7 describes how these can be easily adapted to prove Theorem 1, by slightly modifying the MDP construction.
4.4 Step 1: bounding the discounted state visitation distributions
In view of the PG update rule (12), the size of the policy gradient relies heavily on the discounted state visitation distribution \(d_{\mu }^{(t)}(s)\). In light of this observation, this step aims to quantify the magnitudes of \(d_{\mu }^{(t)}(s)\), for which we start with several universal lower bounds regardless of the policy in use.
Lemma 2
For any policy \(\pi \), the following lower bounds hold true:
As it turns out, the above lower bounds are orderwise tight estimates prior to certain crucial crossing times. This is formalized in the following lemma, where we recall the definition of \(\tau _s\) in (18).
Lemma 3
Under the assumption (35), the following results hold:
Remark 6
As will be demonstrated in Lemma 4, one has \(t_1({\tau }_1)\le t_2({\tau }_2)\) for properly chosen constants \(c_{\textrm{b},1}, c_{\textrm{b},2}\) and \(c_{\textrm{m}}\). Therefore, we shall bear in mind that the properties (37d) and (37e) hold for any \(t \le t_1({\tau }_1)\).
The proofs of these two lemmas are deferred to Appendix B. The sets of booster states, whose cardinality is controlled by \(c_{\textrm{m}}\), play an important role in sandwiching the initial distribution of the states in \({\mathcal {S}}_1\), \({\mathcal {S}}_2\), \({\mathcal {S}}_{\textsf{primary}}\), and \({\mathcal {S}}_{\textsf{adj}}\). Combining these bounds, we uncover the following properties happening before \(V^{(t)}(s)\) exceeds \({\tau }_s\):

For any key primary state \(s\in \{3,\ldots ,H\} \) or any adjoint state \(s\in \{{\overline{1}},\ldots , {\overline{H}}\} \), one has
$$\begin{aligned} d_{\mu }^{(t)}(s) \asymp (1\gamma )^2. \end{aligned}$$ 
For any state s contained in the buffer state subsets \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\), we have
$$\begin{aligned} d_{\mu }^{(t)}(1) \asymp \frac{(1\gamma )^2}{{\mathcal {S}}_1} \qquad \text {and} \qquad d_{\mu }^{(t)}(2) \asymp \frac{(1\gamma )^2}{{\mathcal {S}}_2}, \end{aligned}$$where we recall the size of \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\) in (16). In other words, the discounted state visitation probability of any buffer state is substantially smaller than that of any key primary state \(3,\ldots ,H\) or adjoint state. In principle, the size of each buffer state subset plays a crucial role in determining the associated \(d_{\mu }^{(t)}(s)\)—the larger the size of the buffer state subset, the smaller the resulting state visitation probability.

Further, the aggregate discounted state visitation probability of the above states is no more than the order of
$$\begin{aligned} (1\gamma )^2 \cdot H \asymp 1\gamma = o(1), \end{aligned}$$which is vanishingly small. In fact, state 0 and the booster states account for the dominant fraction of state visitations at the initial stage of the algorithm.
4.5 Step 2: characterizing the crossing times for the first few states (\({\mathcal {S}}_1\), \({\mathcal {S}}_2\), and \({\overline{1}}\))
Armed with the bounds on \(d_{\mu }^{(t)}\) developed in Step 1, we can move forward to study the crossing times for the key states. In this step, we pay attention to the crossing times for the buffer states \({\mathcal {S}}_1, {\mathcal {S}}_2\) as well as the first adjoint state \({\overline{1}}\), which forms a crucial starting point towards understanding the behavior of the subsequent states. Specifically, the following lemma develops lower and upper bounds regarding these quantities, whose proof can be found in Appendix C.
Lemma 4
Suppose that (35) holds. If \({\mathcal {S}} \ge 1/(1\gamma )^2\), then the crossing times satisfy
In addition, if \({\mathcal {S}} \ge \frac{320\gamma ^3}{c_{\textrm{m}}(1\gamma )^{2}}\), then one has
For properly chosen constants \(c_{\textrm{b},1}\), \(c_{\textrm{b},2}\) and \(c_{\textrm{m}}\), Lemma 4 delivers the following important messages:

The cross times of these first few states are already fairly large; for instance,
$$\begin{aligned} t_1(\tau _1) \,\asymp \, t_2(\tau _2) \,\asymp \, \frac{ {\mathcal {S}} }{ \eta }, \end{aligned}$$(39)which scale linearly with the state space dimension. As we shall see momentarily, while \(t_1(\tau _1)\) and \(t_2(\tau _2)\) remain polynomially large, these play a pivotal role in ensuring rapid explosion of the crossing times of the states that follow (namely, the states \(\{3,\ldots ,H\}\)).

We can guarantee a strict ordering such that the crossing time of state 2 is at least as large as that of both state 1 and state \({\overline{1}}\). This property is helpful as well for subsequent analysis.
4.6 Step 3: understanding the dynamics of \(\theta ^{(t)}(s, a)\) before \(t_{s2}({\tau }_{s2})\)
With the above characterization of the crossing times for the first few states, we are ready to investigate the dynamics of \(\theta ^{(t)}(s, a)\) (\(3 \le s \le H\)) at the initial stage, that is, the duration prior to the threshold \(t_{s2}({\tau }_{s2})\). Our finding for this stage is summarized in the following lemma, with the proof deferred to Appendix D.
Lemma 5
Suppose that (35) holds. For any \(3 \le s \le H\) and any \(0 \le t \le t_{s2}({\tau }_{s2})\), one has
Lemma 5 makes clear the behavior of \(\theta ^{(t)}(s, a)\) during this initial stage:

The iterate \(\theta ^{(t)}(s, a_1)\) associated with the optimal action \(a_1\) keeps dropping at a rate of \(\log \big (O(\frac{1}{\sqrt{t}})\big )\), and remains the smallest compared to the ones with other actions (since \(\theta ^{(t)}(s, a_1) \le 0 \le \theta ^{(t)}(s, a_2) \le \theta ^{(t)}(s, a_0) \)).

The other two iterates \(\theta ^{(t)}(s, a_0)\) and \(\theta ^{(t)}(s, a_2)\) stay nonnegative throughout this stage, with \(a_0\) being perceived as more favorable than the other two actions.

In fact, a closer inspection of the proof in Appendix D reveals that \(\theta ^{(t)}(s, a_2)\) remains increasing—even though at a rate slower than that of \(\theta ^{(t)}(s, a_0)\)—throughout this stage (see (134) and the gradient expression (12b)).
In particular, around the threshold \(t_{s2}({\tau }_{s2})\), the iterate \(\theta ^{(t)}(s, a_1)\) becomes as small as
In fact, an inspection of the proof of this lemma reveals that
This means that \(\pi ^{(t)}(a_1 \,\,s)\) becomes smaller for a larger \(t_{s2}({\tau }_{s2})\), making it more difficult to return/converge to 1 afterward.
4.7 Step 4: understanding the dynamics of \(\theta ^{(t)}(s, a)\) between \(t_{s2}({\tau }_{s2})\) and \(t_{\overline{s1}}(\tau _{s})\)
Next, we investigate, for any \(3\le s\le H\), the behavior of the iterates during an “intermediate” stage, namely, the duration when the iteration count t is between \(t_{s2}({\tau }_{s2})\) and \(t_{\overline{s1}}(\tau _{s})\). This is summarized in the following lemma, whose proof can be found in Appendix E.
Lemma 6
Consider any \(3 \le s \le H\). Assume that (35) holds. Suppose that
Then one has
In particular, when \(s = 3\), the results in (43) hold true without requiring the assumption (42).
Remark 7
Condition (42a) only requires \(t_{s1}({\tau }_{s1})\) to be slightly larger than \(t_{\overline{s2}}(\tau _{s1})\), which will be justified using an induction argument when proving the main theorem.
As revealed by the claim (43) of Lemma 6, the iterate \(\theta ^{(t)}(s, a_2)\) remains sufficiently large during this intermediate stage. In the meantime, Lemma 6 guarantees that during this stage, \(\theta ^{(t)}(s, a_1)\) lies below the level of \(\theta ^{(t_{s2}({\tau }_{s2}))}(s, a_1)\) that has been pinned down in Lemma 5 (which has been shown to be quite small). Both of these properties make clear that the iterates \(\theta ^{(t)}(s, a) \) remain far from optimal at the end of this intermediate stage.
4.8 Step 5: establishing a blowingup phenomenon
The next lemma, which plays a pivotal role in developing the desired exponential convergence lower bound, demonstrates that the cross times explode at a super fast rate. The proof is postponed to Appendix F.
Lemma 7
Consider any \(3 \le s \le H\). Suppose that (35) holds and
Then there exists a time instance \(t_{\textsf{ref}}\) obeying \(t_{\overline{s1}}(\tau _s) \le t_{\textsf{ref}}< t_{s}({\tau }_s)\) such that
and at the same time,
The most important message of Lemma 7 lies in property (45c). In a nutshell, this property uncovers that the crossing time \(t_{s}({\tau }_s)\) is substantially larger than \(t_{s2}({\tau }_{s2})\), namely,
thus leading to explosion at a superlinear rate. By contrast, the other two properties unveil some important features happening between \(t_{\overline{s1}}(\tau _s) \) and \(t_{s}({\tau }_s)\) that in turn lead to property (45c). In words, property (45a) requires \(\theta ^{(t_{\textsf{ref}})}(s, a_0)\) to be not much larger than \(\theta ^{(t_{\textsf{ref}})}(s, a_1)\); property (45b) indicates that: when \(t_{s2}({\tau }_{s2})\) is large, both \(\theta ^{(t_{\textsf{ref}})}(s, a_1)\) and \(\theta ^{(t_{\textsf{ref}})}(s, a_0)\) are fairly small, with \(\theta ^{(t_{\textsf{ref}})}(s, a_2)\) being the dominant one (due to the fact \(\sum _a \theta ^{(t_{\textsf{ref}})}(s, a) = 0\) as will be seen in Part (vii) of Lemma 8).
The reader might naturally wonder what the above results imply about \(\pi ^{(t_{\textsf{ref}})}(a_1 \,\,s)\) (as opposed to \(\theta ^{(t_{\textsf{ref}})}(s, a_1 )\)). Towards this end, we make the observation that
where (i) holds true since \(\sum _a \theta ^{(t_{\textsf{ref}})}(s,a)=0\) (a wellknown property of policy gradient methods as recorded in Lemma 8(vii)), and (ii) follows from the properties (45a) and (45b). In other words, \(\pi ^{(t_{\textsf{ref}})}(s,a_{1})\) is inversely proportional to \(\big ( t_{s2}({\tau }_{s2}) \big )^{3/2}\). As we shall see, the time taken for \(\pi ^{(t_{\textsf{ref}})}(a_1 \,\,s)\) to converge to 1 is proportional to the inverse policy iterate \(\big (\pi ^{(t)}(s,a_{1})\big )^{1}\), meaning that it is expected to take an order of \(\big (t_{s2}({\tau }_{s2})\big )^{3/2}\) iterations to increase from \(\pi ^{(t_{\textsf{ref}})}(s,a_{1})\) to 1.
4.9 Step 6: putting all this together to establish Theorem 2
With the above steps in place, we are ready to combine them to establish the following result. As can be easily seen, Theorem 2 is an immediate consequence of Theorem 3.
Theorem 3
Suppose that (35) holds. There exist some universal constants \(c_1, c_2, c_3 > 0\) such that
provided that
Proof of Theorem 3
Let us define two universal constants \(C_1 {:=}\frac{\log 3}{1+17c_{\textrm{m}}/c_{\textrm{b},1}}\) and \(C_2 {:=} \frac{10^{20}c_{\textrm{p}}^2c_{\textrm{m}}\log 3}{1+17c_{\textrm{m}}/c_{\textrm{b},1}}\). We claim that if one can show that
then the desired bound (48) holds true directly. In order to see this, recall that \(\tau _{s} \le 1/2\) by definition, and therefore,
Here, (i) follows from (50) in conjunction with the assumption (49), whereas (ii) holds true by setting \(c_1 = C_1/C_2\) and \(c_2 = C_2^3\).
It is then sufficient to prove the inequality (50), towards which we shall resort to mathematical induction in conjunction with the following induction hypothesis

We start with the cases with \(s=1,2,3\). It follows from Lemma 4 that
$$\begin{aligned} t_2({\tau }_2) \ge t_1({\tau }_1) \ge \frac{\log 3}{1+17c_{\textrm{m}}/c_{\textrm{b},1}} \frac{{\mathcal {S}}}{\eta } = C_1 \frac{{\mathcal {S}}}{\eta }, \end{aligned}$$(52)which validates the above claim (50) for \(s = 1\) and \(s=2\). In addition, Lemma 7 ensures that
$$\begin{aligned} t_{3}({\tau }_3)  \max \Big \{ t_{{\overline{1}}}(\gamma ^{3}1/4),~ t_{{\overline{2}}}(\tau _{3}) \Big \}&\ge 10^{10}c_{\textrm{p}}c_{\textrm{m}}^{0.5}\eta ^{0.5}(1\gamma )^2\Big (t_{1}({\tau }_{1}) \Big )^{1.5}\nonumber \\&\ge \frac{9776}{c_{\textrm{m}}\gamma \eta (1\gamma )^2}, \end{aligned}$$(53)where the last inequality is guaranteed by (52) and the assumption \({\mathcal {S}} \ge \max \Big \{\frac{4888}{C_1c_{\textrm{m}}\gamma (1\gamma )^2}, \frac{4}{C_2(1\gamma )^4}\Big \}\). This implies that the inequality (51) is satisfied when \(s = 3.\)

Next, suppose that the inequality (50) holds true up to state \(s1\) and the inequality (51) holds up to s for some \(3\le s\le H\). To invoke the induction argument, it suffices to show that the inequality (50) continues to hold for state s and the inequality (51) remains valid for \(s+1\). This will be accomplished by taking advantage of Lemma 7.]
Given that the inequality (50) holds true for every state up to \(s1\), one has
$$\begin{aligned} t_{s1}({\tau }_{s1})&\ge t_{s2}({\tau }_{s2}) \ge C_1\frac{{\mathcal {S}}}{\eta }\Big (C_2(1\gamma )^4{\mathcal {S}} \Big )^{1.5^{\lfloor (s3)/2\rfloor }1} \\&\ge \Big ( \frac{6300 e}{c_{\textrm{p}}(1\gamma )} \Big )^4\frac{1}{\frac{c_{\textrm{m}}\gamma }{35} \eta (1\gamma )^2}, \end{aligned}$$where the last inequality is satisfied provided that \({\mathcal {S}} > \max \Big \{\big ( \frac{6300 e}{c_{\textrm{p}}} \big )^4\frac{35}{C_1c_{\textrm{m}}\gamma (1\gamma )^6}, \frac{4}{C_2(1\gamma )^4}\Big \}.\) Therefore, Lemma 7 is applicable for both s and \(s+1\), thus leading to
$$\begin{aligned} t_{s}({\tau }_{s})  t_{\overline{s1}}(\tau _{s})&\ge 10^{10}c_{\textrm{p}}c_{\textrm{m}}^{0.5}\eta ^{0.5}(1\gamma )^2\Big (t_{s2}({\tau }_{s2}) \Big )^{1.5} \\&\ge 10^{10}c_{\textrm{p}}c_{\textrm{m}}^{0.5}\eta ^{0.5}(1\gamma )^2\\&\quad \left( C_1\frac{{\mathcal {S}}}{\eta }\Big (C_2(1\gamma )^4{\mathcal {S}} \Big )^{1.5^{\lfloor (s3)/2\rfloor }1}\right) ^{1.5} \\&\ge C_1\frac{{\mathcal {S}}}{\eta }\Big (C_2(1\gamma )^4{\mathcal {S}} \Big )^{1.5^{\lfloor (s1)/2\rfloor }1}. \end{aligned}$$Here, the last step relies on the condition \(10^{10}c_{\textrm{p}}c_{\textrm{m}}^{0.5}\eta ^{0.5}(1\gamma )^2 (C_1\frac{{\mathcal {S}}}{\eta })^{0.5}\ge 1.\) This in turn establishes the property (50) for state s (given that \(t_{\overline{s1}}(\tau _{s})\ge 0\)). In addition, Lemma 7—when applied to \(s+1\)—gives
$$\begin{aligned} t_{s+1}({\tau }_{s+1})  t_{{\overline{s}}}(\tau _{s+1})&\ge 10^{10}c_{\textrm{p}}c_{\textrm{m}}^{0.5}\eta ^{0.5}(1\gamma )^2\Big (t_{s1}({\tau }_{s1}) \Big )^{1.5} \\&\ge 10^{10}c_{\textrm{p}}c_{\textrm{m}}^{0.5}\eta ^{0.5}(1\gamma )^2\\&\quad \left( C_1\frac{{\mathcal {S}}}{\eta }\Big (C_2(1\gamma )^4{\mathcal {S}} \Big )^{1.5^{\lfloor (s2)/2\rfloor }1}\right) ^{1.5} \\&\ge C_1\frac{{\mathcal {S}}}{\eta }\Big (C_2(1\gamma )^4{\mathcal {S}} \Big )^{1.5^{\lfloor s/2\rfloor }1}\\&\ge \frac{2444(s+2)}{c_{\textrm{m}}\gamma \eta (1\gamma )^2}, \end{aligned}$$where the last step follows as long as \({\mathcal {S}} > \max \left\{ \frac{4888}{C_1c_{\textrm{m}}\gamma (1\gamma )^2}, \frac{4}{C_2(1\gamma )^4}\right\} \). We have thus established the property (51) for state \(s+1\).
Putting all the above pieces together, we arrive at the inequality (50), thus establishing Theorem 3. \(\square \)
4.10 Step 7: adapting the proof to establish Theorem 1
Thus far, we have established Theorem 2, and are well equipped to return to the proof of Theorem 1. As a remark, Theorem 2 and its analysis posits that for a large fraction of the key primary states (as well as their associated adjoint states), softmax PG methods can take a prohibitively large number of iterations to converge. The issue, however, is that there are in total only O(H) key primary states and adjoint states, accounting for a vanishingly small fraction of all \({\mathcal {S}}\) states. In order to extend Theorem 2 to Theorem 1 (the latter of which is concerned with the error averaged over the entire state space), we would need to show that the value functions associated with those booster states—which account for a large fraction of the state space—also converge slowly.
In the MDP instance constructed in Sect. 3, however, the action space associated with the booster states is a singleton set, meaning that the action is always optimal. As a result, we would first need to modify/augment the action space of booster states, so as to ensure that their learned actions remain suboptimal before the algorithm converges for the associated key primary states and adjoint states.
A modified MDP instance. We now augment the action space for all booster states in the MDP \({\mathcal {M}}\) constructed in in Sect. 3, leading to a slightly modified MDP denoted by \({\mathcal {M}}_{\textsf{modified}}\):

for any key primary state \(s\in \{3,\ldots , H\}\) and any associated booster state \({\widehat{s}}\in \widehat{{\mathcal {S}}}_{s}\), take the action space of \({\widehat{s}}\) to be \(\{a_0, a_1\}\) and let
$$\begin{aligned}&P(0\,\,{\widehat{s}},a_0) = 0.9, ~~ P(s\,\,{\widehat{s}},a_0) = 0.1, ~~ r({\widehat{s}},a_0)=0.9\gamma \tau _s,~~\nonumber \\&P(s\,\,{\widehat{s}},a_{1}) = 1, ~~ r({\widehat{s}},a_1)=0; \end{aligned}$$(54) 
for any key adjoint state \({\overline{s}} \in \{{\overline{1}},\ldots ,{\overline{H}}\}\) and any associated booster state \({\widehat{s}}\in \widehat{{\mathcal {S}}}_{{\overline{s}}}\), take the action space of \({\widehat{s}}\) to be \(\{a_0, a_1\}\) and let
$$\begin{aligned}&P(0\,\,{\widehat{s}},a_0) = 0.9, ~~P({\overline{s}}\,\,{\widehat{s}},a_0) = 0.1,~~ r({\widehat{s}},a_0)=0.9\gamma ^2\tau _s,~~ \nonumber \\&P({\overline{s}}\,\,{\widehat{s}},a_{1}) = 1, ~~ r({\widehat{s}},a_1)=0; \end{aligned}$$(55) 
all other comoponents of \({\mathcal {M}}_{\textsf{modified}}\) remain identical to those of the original \({\mathcal {M}}\).
Analysis for the new booster states. Given that the dynamics of nonbooster states are unaffected by the booster states, it suffices to perform analysis for the booster states. Let us first consider any key primary state s and any associated booster state \({\widehat{s}}\).

As can be easily seen,
$$\begin{aligned} Q^{(t)}({\widehat{s}},a_{0})&=r({\widehat{s}},a_{0})+\gamma P(s\,\,{\widehat{s}},a_{0})V^{(t)}(s)+\gamma P(0\,\,{\widehat{s}},a_{0})V^{(t)}(0)\\&=0.9\gamma \tau _{s}+0.1\gamma V^{(t)}(s),\\ Q^{(t)}({\widehat{s}},a_{1})&=r({\widehat{s}},a_{1})+\gamma P(s\,\,{\widehat{s}},a_{0})V^{(t)}(s)=\gamma V^{(t)}(s), \end{aligned}$$where we have used the basic fact \(V^{(t)}(0)=0\) (see (73) in Lemma 8). Given that \(V^{(t)}({\widehat{s}})\) is a convex combination of \(Q^{(t)}({\widehat{s}},a_{0})\) and \(Q^{(t)}({\widehat{s}},a_{1})\), one can easily see that: if \(V^{(t)}(s) < \tau _s\), then one necessarily has \(V^{(t)}({\widehat{s}}) < \gamma \tau _s\)

Similarly, the optimal Qfunction w.r.t. \({\widehat{s}}\) is given by
$$\begin{aligned} Q^{\star }({\widehat{s}},a_{0})&=r({\widehat{s}},a_{0})+\gamma P(s\,\,{\widehat{s}},a_{0})V^{\star }(s)+\gamma P(0\,\,{\widehat{s}},a_{0})V^{\star }(0)\\&=0.9\gamma \tau _{s}+0.1\gamma V^{\star }(s),\\ Q^{\star }({\widehat{s}},a_{1})&=r({\widehat{s}},a_{1})+\gamma P(s\,\,{\widehat{s}},a_{0})V^{\star }(s)=\gamma V^{\star }(s), \end{aligned}$$which together with Lemma 1 and the definition (18) of \(\tau _s\) indicates that \(V^{\star }({\widehat{s}})=Q^{\star }({\widehat{s}},a_{1})=\gamma ^{2s+1}\).

The above facts taken collectively imply that: if \(V^{(t)}(s) < \tau _s\), then
$$\begin{aligned} V^{\star }({\widehat{s}})  V^{(t)}({\widehat{s}})> \gamma ^{2s+1}  \gamma \tau _s = \gamma \big (\gamma ^{2s}  0.5\gamma ^{\frac{2s}{3}}\big ) > 0.22, \end{aligned}$$(56)provided that \(\gamma \) is sufficiently large (which is satisfied under the condition (35)).
Similarly, for any key adjoint state \({\overline{s}}\) and any associated booster state \({\widehat{s}}\), if \(V^{(t)}({\overline{s}}) < \gamma \tau _s\), then one must have
Repeating the same proof as for Theorem 2, one can easily show that (with slight adjustment of the universal constants)
This taken together with the above analysis suffices to establish Theorem 1, given the following two simple facts: (i) there are \(2 Hc_{\textrm{m}}(1\gamma ){\mathcal {S}}= 2c_{\textrm{m}}c_{\textrm{h}}{\mathcal {S}}\) booster states, and (ii) more than \(90\%\) of them need a prohibitively large number of iterations (cf. (58)) to reach 0.22optimality. Here, we can take \(c_{\textrm{m}}c_{\textrm{h}}> 0.18\) which satisfies (35). The proof is thus complete.
5 Discussion
This paper has developed an algorithmspecific lower bound on the iteration complexity of the softmax policy gradient method, obtained by analyzing its trajectory on a carefullydesigned hard MDP instance. We have shown that the iteration complexity of softmax PG methods can scale pessimistically, in fact (super)exponentially, with the dimension of the state space and the effective horizon of the discounted MDP of interest. Our finding makes apparent the potential inefficiency of softmax PG methods in solving largedimensional and longhorizon problems. In turn, this suggests the necessity of carefully adjusting update rules and/or enforcing proper regularization in accelerating policy gradient methods (Table 2).
Our work relies heavily on proper exploitation of the structural properties of the MDP in algorithmdependent analysis, which might shed light on lower bound construction for other algorithms as well. For instance, if the objective function (i.e., the value function) is augmented by a regularization term, how does the choice of regularization affect the global convergence behavior? While Agarwal et al. [1] demonstrated polynomialtime convergence of PG methods in the presence of logbarrier regularization, nonasymptotic analysis of PG methods with other popular regularization—particularly entropy regularization—remains unavailable in existing literature. How to understand the (in)effectiveness of entropyregularized PG methods is of fundamental importance in the theory of policy optimization. Additionally, the current paper concentrates on the use of constant learning rates; it falls short of accommodating more adaptive learning rates, which might be a potential solution to accelerate vanilla PG methods. Furthermore, our strategy for lower bound construction might be extended to unveil algorithmic bottlenecks of policy optimization in multiagent Markov games as well. All this is worthy of future investigation.
Notes
Here and throughout, the division of two vectors represents componentwise division.
While we do not include states 1 and 2 here, any state in \({\mathcal {S}}_1\) (resp. \({\mathcal {S}}_2\)) can essentially be viewed as a (replicated) copy of state 1 (resp. state 2).
References
Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G.: On the theory of policy gradient methods: optimality, approximation, and distribution shift. J. Mach. Learn. Res. 22(98), 1–76 (2021)
Agazzi, A., Lu, J.: Global optimality of softmax policy gradient with single hidden layer neural networks in the meanfield regime. In: International Conference on Learning Representations (ICLR) (2021)
Alacaoglu, A., Viano, L., He, N., Cevher, V.: A natural actorcritic framework for zerosum Markov games. In: International Conference on Machine Learning, pp. 307–366. PMLR (2022)
Asadi, K., Littman, M.L.: An alternative softmax operator for reinforcement learning. In: International Conference on Machine Learning, pp. 243–252 (2017)
Azar, M.G., Munos, R., Kappen, H.J.: Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn. 91(3), 325–349 (2013)
Beck, A.: FirstOrder Methods in Optimization. SIAM, Philadelphia, PA (2017)
Bhandari, J.: Optimization foundations of reinforcement learning. Ph.D. thesis, Columbia University (2020)
Bhandari, J., Russo, D.: Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786 (2019)
Bhandari, J., Russo, D.: On the linear convergence of policy gradient methods for finite MDPs. In: International Conference on Artificial Intelligence and Statistics, pp. 2386–2394. PMLR (2021)
Cai, Q., Yang, Z., Jin, C., Wang, Z.: Provably efficient exploration in policy optimization. In: International Conference on Machine Learning, pp. 1283–1294 (2020)
Cen, S., Cheng, C., Chen, Y., Wei, Y., Chi, Y.: Fast global convergence of natural policy gradient methods with entropy regularization. Oper. Res. 70(4), 2563–2578 (2022)
Cen, S., Chi, Y., Du, S.S., Xiao, L.: Faster lastiterate convergence of policy optimization in zerosum Markov games. arXiv preprint arXiv:2210.01050 (2022)
Cen, S., Wei, Y., Chi, Y.: Fast policy extragradient methods for competitive games with entropy regularization. Adv. Neural Inf. Process. Syst. 34, 27952–27964 (2021)
Daskalakis, C., Foster, D.J., Golowich, N.: Independent policy gradient methods for competitive reinforcement learning. Adv. Neural Inf. Process. Syst. 33, 5527–5540 (2020)
Ding, D., Zhang, K., Basar, T., Jovanovic, M.: Natural policy gradient primal–dual method for constrained Markov decision processes. Adv. Neural Inf. Process. Syst. 33, 8378–8390 (2020)
Domingues, O.D., Ménard, P., Kaufmann, E., Valko, M.: Episodic reinforcement learning in finite MDPs: minimax lower bounds revisited. In: Algorithmic Learning Theory, pp. 578–598 (2021)
Du, S., Jin, C., Jordan, M., Póczos, B., Singh, A., Lee, J.: Gradient descent can take exponential time to escape saddle points. Adv. Neural Inf. Process. Syst. 30, 1068–1078 (2017)
Fazel, M., Ge, R., Kakade, S., Mesbahi, M.: Global convergence of policy gradient methods for the linear quadratic regulator. In: International Conference on Machine Learning, pp. 1467–1476 (2018)
JanschPorto, J.P., Hu, B., Dullerud, G.E.: Convergence guarantees of policy optimization methods for Markovian jump linear systems. In: American Control Conference, pp. 2882–2887. IEEE (2020)
Kakade, S.M.: A natural policy gradient. Adv. Neural Inf. Process. Syst. 14, 1531–1538 (2002)
Khamaru, K., Pananjady, A., Ruan, F., Wainwright, M.J., Jordan, M.I.: Is temporal difference learning optimal? An instancedependent analysis. SIAM J. Math. Data Sci. 3(4), 1013–1040 (2021)
Khodadadian, S., Doan, T.T., Romberg, J., Maguluri, S.T.: Finite sample analysis of twotimescale natural actorcritic algorithm. IEEE Trans. Automatic Control. (2022). https://doi.org/10.1109/TAC.2022.3190032
Konda, V.R., Tsitsiklis, J.N.: Actorcritic algorithms. Adv. Neural Inf. Process. Syst. 12, 1008–1014 (2000)
Lan, G.: Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes. Math. Program. (2022). https://doi.org/10.1007/s10107022018165
Lan, G.: Policy optimization over general state and action spaces. arXiv preprint arXiv:2211.16715 (2022)
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: Conference on learning Theory, pp. 1246–1257 (2016)
Li, G., Cai, C., Chen, Y., Wei, Y., Chi, Y.: Is Qlearning minimax optimal? A tight sample complexity analysis. arXiv preprint arXiv:2102.06548 (2021)
Li, G., Shi, L., Chen, Y., Chi, Y., Wei, Y.: Settling the sample complexity of modelbased offline reinforcement learning. arXiv preprint arXiv:2204.05275 (2022)
Li, Y., Zhao, T., Lan, G.: Firstorder policy optimization for robust Markov decision process. arXiv preprint arXiv:2209.10579 (2022)
Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural proximal/trust region policy optimization attains globally optimal policy. Adv. Neural Inf. Process. Syst. 32, 10565–10576 (2019)
Liu, Y., Zhang, K., Basar, T., Yin, W.: An improved analysis of (variancereduced) policy gradient and natural policy gradient methods. Adv. Neural Inf. Process. Syst. 33, 7624–7636 (2020)
Mei, J., Gao, Y., Dai, B., Szepesvari, C., Schuurmans, D.: Leveraging nonuniformity in firstorder nonconvex optimization. In: International Conference on Machine Learning, pp. 7555–7564 (2021)
Mei, J., Xiao, C., Dai, B., Li, L., Szepesvári, C., Schuurmans, D.: Escaping the gravitational pull of softmax. Adv. Neural Inf. Process. Syst. 33, 21130–21140 (2020)
Mei, J., Xiao, C., Szepesvari, C., Schuurmans, D.: On the global convergence rates of softmax policy gradient methods. In: International Conference on Machine Learning, pp. 6820–6829 (2020)
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Humanlevel control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Pananjady, A., Wainwright, M.J.: Instancedependent \(\ell _{\infty }\)bounds for policy evaluation in tabular reinforcement learning. IEEE Trans. Inf. Theory 67(1), 566–585 (2020)
Peters, J., Schaal, S.: Natural actorcritic. Neurocomputing 71(7–9), 1180–1190 (2008)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Shani, L., Efroni, Y., Mannor, S.: Adaptive trust region policy optimization: global convergence and faster rates for regularized MDPs. In: AAAI Conference on Artificial Intelligence, vol. 34, pp. 5668–5675 (2020)
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Sutton, R.S.: Temporal credit assignment in reinforcement learning. Ph.D. thesis, University of Massachusetts (1984)
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 12, 1057–1063 (2000)
Tu, S., Recht, B.: The gap between modelbased and modelfree methods on the linear quadratic regulator: an asymptotic viewpoint. In: Conference on Learning Theory, pp. 3036–3083 (2019)
Wang, L., Cai, Q., Yang, Z., Wang, Z.: Neural policy gradient methods: global optimality and rates of convergence. In: International Conference on Learning Representations (2019)
Wei, C.Y., Lee, C.W., Zhang, M., Luo, H.: Lastiterate convergence of decentralized optimistic gradient descent/ascent in infinitehorizon competitive Markov games. In: Conference on Learning Theory, pp. 4259–4299. PMLR (2021)
Williams, R.J.: Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
Wu, Y.F., Zhang, W., Xu, P., Gu, Q.: A finitetime analysis of two timescale actorcritic methods. Adv. Neural Inf. Process. Syst. 33, 17617–17628 (2020)
Xie, Q., Yang, Z., Wang, Z., Minca, A.: Provable fictitious play for general meanfield games. arXiv preprint arXiv:2010.04211 (2020)
Xu, T., Liang, Y., Lan, G.: A primal approach to constrained policy optimization: global optimality and finitetime analysis. arXiv preprint arXiv:2011.05869 (2020)
Xu, T., Wang, Z., Liang, Y.: Nonasymptotic convergence analysis of two timescale (natural) actorcritic algorithms. arXiv preprint arXiv:2005.03557 (2020)
Yan, Y., Li, G., Chen, Y., Fan, J.: Modelbased reinforcement learning is minimaxoptimal for offline zerosum Markov games. arXiv preprint arXiv:2206.04044 (2022)
Yang, W., Li, X., Xie, G., Zhang, Z.: Finding the near optimal policy via adaptive reduced regularization in MDPs. arXiv preprint arXiv:2011.00213 (2020)
Zhan, W., Cen, S., Huang, B., Chen, Y., Lee, J.D., Chi, Y.: Policy mirror descent for regularized reinforcement learning: a generalized framework with linear convergence. arXiv preprint arXiv:2105.11066 (2021)
Zhang, J., Kim, J., O’Donoghue, B., Boyd, S.: Sample efficient reinforcement learning with REINFORCE. In: AAAI Conference on Artificial Intelligence, vol. 35, pp. 10887–10895 (2021)
Zhang, J., Koppel, A., Bedi, A.S., Szepesvari, C., Wang, M.: Variational policy gradient method for reinforcement learning with general utilities. Adv. Neural Inf. Process. Syst. 33, 4572–4583 (2020)
Zhang, J., Ni, C., Szepesvari, C., Wang, M.: On the convergence and sample efficiency of variancereduced policy gradient method. Adv. Neural Inf. Process. Syst. 34, 2228–2240 (2021)
Zhang, K., Hu, B., Basar, T.: Policy optimization for \(\cal{H}_2\) linear control with \(\cal{H}_{\infty }\) robustness guarantee: implicit regularization and global convergence. In: Learning for Dynamics and Control, pp. 179–190 (2020)
Zhang, K., Koppel, A., Zhu, H., Basar, T.: Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM J. Control. Optim. 58(6), 3586–3612 (2020)
Zhang, X., Chen, Y., Zhu, X., Sun, W.: Robust policy gradient against strong data corruption. In: International Conference on Machine Learning, pp. 12391–12401 (2021)
Zhao, Y., Tian, Y., Lee, J., Du, S.: Provably efficient policy gradient methods for twoplayer zerosum Markov games. in: International Conference on Artificial Intelligence and Statistics (2021)
Acknowledgements
Y. Wei is supported in part by the the NSF grants CCF2106778 and CAREER award DMS2143215. Y. Chi is supported in part by the grants ONR N000141812142 and N000141912404, ARO W911NF1810303, NSF CCF1806154, CCF2007911 and CCF2106778. Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the AFOSR grant FA95502210198, the ONR grant N000142212354, and the NSF grants CCF2221009, CCF1907661, IIS2218713 and IIS2218773.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was presented in part at the Conference on Learning Theory (COLT) 2021.
Appendices
Preliminary facts
1.1 Basic properties of the constructed MDP
In this section, we provide more basic properties about the MDP we have constructed (see Sect. 3). Specifically, we present a miscellaneous collection of basic relations regarding more general policies, postponing the proof to Appendix A.4.
Lemma 8
Consider any policy \(\pi \), and recall the quantities defined in (18). Suppose that \(\gamma ^{2H}\ge 1/2\) and \(0<c_{\textrm{p}}\le 1/6\).

(i)
For any state \(s\in \{3,\ldots , H\}\), one has
$$\begin{aligned} \gamma ^{\frac{3}{2}}\tau _{s1} \le Q^{\pi }(s,a_{0})&= r_{s}+\gamma ^{2}p\tau _{s2} \le \gamma ^{\frac{1}{2}}\tau _{s}, \end{aligned}$$(59a)$$\begin{aligned} Q^{\pi }(s,a_{1})&=\gamma V^{\pi }(\overline{s1}), \end{aligned}$$(59b)$$\begin{aligned} Q^{\pi }(s,a_{2})&=r_{s}+\gamma pV^{\pi }(\overline{s2}) \le \gamma ^{\frac{1}{2}}\tau _{s}. \end{aligned}$$(59c)If one further has \(V^{\pi }(\overline{s2})\ge 0\), then \(Q^{\pi }(s,a_{2}) \ge \gamma ^{\frac{3}{2}}\tau _{s1}\).

(ii)
If \(V^{\pi }(s) \ge \tau _s\) for some \(s\in \{3,\ldots , H\}\), then we necessarily have
$$\begin{aligned} \pi (a_{1}\,\,s) \ge \frac{1\gamma }{2}. \end{aligned}$$(60) 
(iii)
For any \({\overline{s}}\in \{{\overline{1}},\ldots ,{\overline{H}}\}\), one has
$$\begin{aligned} Q^{\pi }({\overline{s}},a_{0})&=\gamma \tau _{s} \qquad \text {and} \qquad Q^{\pi }({\overline{s}},a_{1}) = \gamma V^{\pi }(s), \end{aligned}$$(61)where we recall the definition of \(V^{\pi }(1)\) and \(V^{\pi }(2)\) in (25). In addition, if \(\pi (a_1\,\,{\overline{s}})>0\), then
$$\begin{aligned} V^{\pi }({\overline{s}}) \ge \gamma \tau _s \qquad \text {holds if and only if} \qquad V^{\pi }(s) \ge \tau _{s}. \end{aligned}$$(62)This means that: if \(\pi ^{(t)}(a_1\,\,{\overline{s}})>0\) holds for all \(t\ge 0\), then one necessarily has
$$\begin{aligned} t_{{\overline{s}}}(\gamma \tau _s) = t_{s}(\tau _s). \end{aligned}$$(63) 
(iv)
For any policy \(\pi \), we have
$$\begin{aligned}&Q^{\pi }(1,a_{0})=\gamma ^{2},\quad Q^{\pi }(1,a_{1})=\gamma ^{2}, \quad V^{\pi }(1)=\gamma ^{2}\pi (a_{0}\,\,1)+\gamma ^{2}\pi (a_{1}\,\,1), \end{aligned}$$(64a)$$\begin{aligned}&Q^{\pi }(2,a_{0})=\gamma ^{4},\quad Q^{\pi }(2,a_{1})=\gamma ^{4}, \quad V^{\pi }(2)=\gamma ^{4}\pi (a_{0}\,\,2)+\gamma ^{4}\pi (a_{1}\,\,2). \end{aligned}$$(64b) 
(v)
Consider any policy \(\pi \) obeying \(\min _{a,s}\pi (a\,\,s)>0\). For every \(s\in \{3,\ldots , H\}\), if \(V^{\pi }(s) \ge \gamma ^{\frac{1}{2}}\tau _{s}\) occurs, then one necessarily has \(V^{\pi }(s1) \ge \tau _{s1}.\)

(vi)
If \(V^{\pi }(s2) < \tau _{s2}\) and \(\pi (a_1 \,\,\overline{s2}) >0\), then
$$\begin{aligned} Q^{\pi }(s,a_0)  Q^{\pi }(s,a_2) = \gamma p \big (\gamma \tau _{s2}  V^\pi (\overline{s2}) \big ) > 0. \end{aligned}$$If \(V^{\pi }(s1) \le \tau _{s1}\) and \(V^{\pi }(\overline{s2})\ge 0\), then
$$\begin{aligned} \min \big \{Q^{\pi }(s, a_0), Q^{\pi }(s, a_2) \big \}  Q^{\pi }(s, a_1) \ge (1\gamma )/8. \end{aligned}$$ 
(vii)
Consider the softmax PG update rule (12). One has for any \(s \in {\mathcal {S}}\) and any \(\theta \),
$$\begin{aligned} \sum _a \frac{\partial V^{\pi _{\theta }}(\mu )}{\partial \theta (s, a)} = 0 \qquad \text {and}\qquad \sum _a \theta ^{(t)}(s,a) = 0 \end{aligned}$$(65)
Remark 8
As it turns out, invoking Part (v) of Lemma 8 recursively reveals that: for any \(2\le s \le H\) and any \(t< t_s({\tau }_s)\), we have
This in turn implies that \(t_2({\tau }_2)\le t_3({\tau }_3)\le \cdots \le t_H({\tau }_H)\) according to the definition (30).
Let us point out some implications of Lemma 8 that help guide our lower bound analysis. Once again, it is helpful to look at the results of this lemma when \(\gamma \approx 1\) and \(\gamma ^{H} \approx 1\). In this case, the quantities defined in (18) obey \(\tau _s\approx r_s \approx 1/2\), allowing us to obtain the following messages:

Lemma 8(i) implies that, under mild conditions,
$$\begin{aligned} Q^{\pi }(s,a_0)\approx Q^{\pi }(s,a_2) \approx 1/2 \end{aligned}$$holds any \(s\in \{3,\ldots , H\}\) and any policy \(\pi \). In comparison to the optimal values (27), this result uncovers the strict suboptimality of actions \(a_0\) and \(a_2\), and indicates that one cannot possibly approach the optimal values unless \(\pi (a_{1}\,\,s)\approx 1\).

As further revealed by Lemma 8(ii), one needs to ensure a sufficiently large \(\pi (a_{1}\,\,s)\)—i.e., \(\pi (a_{1}\,\,s) \ge (1 \gamma )/2\)—in order to achieve \(V^{\pi }(s) \gtrapprox 1/2\).

Lemma 8(iii) establishes an intimate connection between \(V^{\pi }(s)\) and \(V^{\pi }({\overline{s}})\): if we hope to attain \(V^{\pi }({\overline{s}})\gtrapprox 1/2\) for an adjoint state \({\overline{s}}\), then one needs to first ensure that its associated primary state achieves \(V^{\pi }(s)\gtrapprox 1/2\). The equivalence property (63) allows one to propagate the crossing time of state s to that of state \({\overline{s}}\).

In Lemma 8(iv), we make clear that the Qfunctions w.r.t. the buffer states are independent of the policy in use.

Lemma 8(v) further establishes an intriguing connection between the crossing time of state s and that of the preceding state \(s1\).

Lemma 8(vi) uncovers that: (a) if \(V^{\pi }(s2)\) is not sufficiently large, then the Qvalue associated with \((s,a_0)\) dominates the one associated with \((s,a_2)\); (b) if \(V^{\pi }(s1)\) is not large enough, then the Qvalue associated with \((s,a_1)\) is dominated by that of the other two.

As indicated by Lemma 8(vii), the sum of the iterate \(\theta ^{(t)}(s,a)\) over a remains unchanged throughout the execution of the algorithm.
Another key feature that permeates our analysis is a certain monotonicity property of value function estimates as the iteration count t increases, which we discuss in the sequel. To begin with, akin to the monotonicity properties of gradient descent [6], the softmax PG update is known to achieve monotonic performance improvement in a pointwise manner, as summarized in the following lemma. The interested reader is referred to Agarwal et al. [1, Lemma C.2] for details.
Lemma 9
Consider the softmax PG method (12). One has
for any stateaction pair (s, a) and any \(t\ge 0\), provided that \(0<\eta < (1\gamma )^2 / 5\).
The preceding monotonicity feature, in conjunction with the uniform initialization scheme, ensures nonnegativity of value function estimates throughout the execution of the algorithm.
Lemma 10
Consider the softmax PG method (12), and suppose the initial policy \(\pi ^{(0)}(\cdot \,\,s)\) for any \(s\in {\mathcal {S}}\) is given by a uniform distribution over the action space \({\mathcal {A}}_s\) and \(0< \eta < (1\gamma )^2/5\). Then one has
Proof
The only negative rewards in our constructed MDP are \(r(s_1,a_0)\) for \(s_1\in {\mathcal {S}}_1\) and \(r(s_2,a_0)\) for \(s_2\in {\mathcal {S}}_2\). When \(\pi ^{(0)}(\cdot \,\,s_1)\) is uniformly distributed, the MDP specification (22) gives
Similarly, one has \(V^{(0)}(s_2)=0\) for all \(s_2\in {\mathcal {S}}_2\). Applying Lemma 9, we can demonstrate that \(V^{(t)}(s)\ge V^{(0)}(s) \ge 0\) for any \(s\in {\mathcal {S}}_1 \cup {\mathcal {S}}_2\) and any \(t\ge 0\). From the Bellman equation, it is easily seen that the value function \(V^{(t)}\) of any other state is a linear combination of \(\{r(s,a)\,\,s\notin {\mathcal {S}}_1, s\notin {\mathcal {S}}_2\}\), \(\{V^{(t)}(s_1) \,\,s_1\in {\mathcal {S}}_1 \}\) and \(\{V^{(t)}(s_2)\,\,s_2\in {\mathcal {S}}_2\}\), which are all nonnegative. It thus follows that \(V^{(t)}(s)\ge 0\) for any \(s\in {\mathcal {S}}\) and any \(t\ge 0\). \(\square \)
1.2 A type of recursive relations
In addition, we make note of a sort of recursive relations that appear commonly when studying the dynamics of gradient descent [6]. The proof of the following lemma can be found in Appendix A.5.
Lemma 11
Consider a positive sequence \(\{x_t\}_{t\ge 0}\).

(i)
Suppose that \(x_t\le x_{t1}\) for all \(t>0\). If there exists some quantity \(c_{\textrm{l}} >0\) obeying \(c_{\textrm{l}} x_0\le 1/2\) and
$$\begin{aligned} x_t \ge x_{t1}  c_{\textrm{l}} x_{t1}^2 \qquad \text {for all }t > 0, \end{aligned}$$(67a)then one has
$$\begin{aligned} x_{t}\ge \frac{1}{2c_{\textrm{l}} t + \frac{1}{x_{0}}} \qquad \text {for all }t \ge 0. \end{aligned}$$(67b) 
(ii)
If there exists some quantity \(c_{\textrm{u}} >0\) obeying
$$\begin{aligned} x_t \le x_{t1}  c_{\textrm{u}} x_{t1}^2 \qquad \text {for all }t > 0, \end{aligned}$$(68a)then it follows that
$$\begin{aligned} x_{t}\le \frac{1}{c_{\textrm{u}} t + \frac{1}{x_{0}}} \qquad \text {for all }t \ge 0. \end{aligned}$$(68b) 
(iii)
Suppose that \(0<x_{t} < c_{x}\) for all \(t<t_{0}\) and \(x_{t_0}\ge c_{x}\) for some quantity \(c_x>0\). Assume that
$$\begin{aligned} x_{t}\ge x_{t1}+c_{}x_{t1}^{2}\qquad \text {for all }0<t\le t_0 \end{aligned}$$(69a)for some quantity \(c_ >0\). Then one necessarily has
$$\begin{aligned} t_0 \le \frac{1+c_{}c_x}{c_ x_{0}}. \end{aligned}$$(69b) 
(iv)
Suppose that
$$\begin{aligned} 0 \le x_{t}\le x_{t1}+c_{+}x_{t1}^{2}\qquad \text {for all }0 < t \le t_{0} \end{aligned}$$(70a)for some quantity \(c_+ >0\). Then one necessarily has
$$\begin{aligned} t_0 \ge \frac{\frac{1}{x_0}\frac{1}{x_{t_0}}}{c_+}. \end{aligned}$$(70b)
1.3 Proof of Lemma 1

(i)
Let us start with state 0. Given that this is an absorbing state and that \(r(0,a_0)=0\), we have \( V^{\star }(0) = 0\).

(ii)
Next, we turn to the buffer states in \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\). For any \(s_1\in {\mathcal {S}}_1\), the Bellman equation gives
$$\begin{aligned} Q^{\star }(s_{1},a_{0})&=r(s_{1},a_{0})+\gamma V^{\star }(0)=\gamma ^{2}; \end{aligned}$$(71a)$$\begin{aligned} Q^{\star }(s_{1},a_{1})&=r(s_{1},a_{1})+\gamma V^{\star }(0) = \gamma ^{2} . \end{aligned}$$(71b)This in turn implies that \(V^{\star }(s_1)= Q^{\star }(s_{1},a_{1})=\gamma ^2\). Repeating the same argument, we arrive at \(V^{\star }(s_2) = Q^{\star }(s_2,a_1) = r(s_{2},a_{1}) = \gamma ^4\) for any \(s_2\in {\mathcal {S}}_2\).

(iii)
We then move on to the adjoint states \({\overline{1}}\) and \({\overline{2}}\). From the construction (23), the Bellman equation yields
$$\begin{aligned} Q^{\star }({\overline{1}},a_{0})&=r({\overline{1}},a_{0})+\gamma V^{\star }(0)=\gamma \tau _{1} < \gamma /2,\\ Q^{\star }({\overline{1}},a_{1})&=r({\overline{1}},a_{1})+\frac{\gamma }{{\mathcal {S}}_{1}}\sum _{s_{1}\in {\mathcal {S}}_{1}}V^{\star }(s_{1})= \frac{\gamma }{{\mathcal {S}}_{1}}\sum _{s_{1}\in {\mathcal {S}}_{1}}V^{\star }(s_{1}) = \gamma ^3, \end{aligned}$$where the last identity follows since \(V^{\star }(s_{1})=\gamma ^2\). This in turn indicates that \(V^{\star }({\overline{1}})=\max \{Q^{\star }({\overline{1}},a_{0}), Q^{\star }({\overline{1}},a_{1})\} =\gamma ^3\), provided that \(\gamma ^2 \ge 1/2\). Similarly, repeating this argument shows that \(V^{\star }({\overline{2}})=\gamma ^5\), as long as \(\gamma ^4 \ge 1/2\). As before, the optimal action in state \({\overline{1}}\) (resp. \({\overline{2}}\)) is \(a_1\).

(iv)
The next step is to determine \(V^{\star }(s)\) for any \(s\in \{3,\ldots ,H\}\). Suppose that \(V^{\star }(\overline{s2})=\gamma ^{2s3}\) and \(V^{\star }(\overline{s1})=\gamma ^{2s1}\). Then the construction (20) together with the Bellman equation yields
$$\begin{aligned} Q^{\star }(s,a_{0})&=r(s,a_{0})+\gamma V^{\star }(0)=r_{s}+\gamma ^{2}p\tau _{s2}< 2/3;\\ Q^{\star }(s,a_{1})&=r(s,a_{1})+\gamma V^{\star }(\overline{s1})=\gamma \gamma ^{2s1}=\gamma ^{2s}; \\ Q^{\star }(s,a_{2})&=r(s,a_{2})+\gamma (1p)V^{\star }(0)+\gamma pV^{\star }(\overline{s2})\\&=r_{s}+p\gamma ^{2s2} < 2/3. \end{aligned}$$Consequently, one has \(V^{\star }(s)=Q^{\star }(s,a_{1})=\gamma ^{2s}\)—namely, \(a_1\) is the optimal action—as long as \(\gamma ^{2s}\ge 2/3\).

(v)
We then turn attention to \(V^{\star }({\overline{s}})\) for any \({\overline{s}}\in \{{\overline{3}},\ldots ,{\overline{H}}\}\). Suppose that \(V^{\star }(s)= \gamma ^{2s}\). In view of the construction (21) and the Bellman equation, one has
$$\begin{aligned} Q^{\star }({\overline{s}},a_{0})&=r({\overline{s}},a_{0})+\gamma V^{\star }(0)=\gamma \tau _{s} < 1/2;\\ Q^{\star }({\overline{s}},a_{1})&=r({\overline{s}},a_{1})+\gamma V^{\star }(s)=\gamma ^{2s+1}. \end{aligned}$$Hence, we have \(V^{\star }({\overline{s}})=Q^{\star }({\overline{s}},a_{1})=\gamma ^{2s+1}\)—with the optimal action being \(a_1\)—provided that \(\gamma ^{2s+1} \ge 1/2\).

(vi)
Applying an induction argument based on Steps (iii), (iv) and (v), we conclude that
$$\begin{aligned} V^{\star }(s)= \gamma ^{2s} \qquad \text {and} \qquad V^{\star }({\overline{s}})=\gamma ^{2s+1} \end{aligned}$$(72)for all \(3\le s\le H\), with the proviso that \(\gamma ^{2H}\ge 2/3\) and \(\gamma ^{2H+1}\ge 1/2\).

(vii)
In view of our MDP construction, a negative immediate reward (which is either \(\gamma ^2\) or \(\gamma ^4\)) is accrued only when the current state lies in the buffer sets \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\) and when action \(a_0\) is executed. However, once \(a_0\) is taken, the MDP will transition to the absorbing state 0, with all subsequent rewards frozen to 0. In conclusion, the entire MDP trajectory cannot receive negative immediate rewards more than once, thus indicating that \(Q^{\pi }(s,a)\ge \min \{\gamma ^2, \gamma ^4\}=\gamma ^2\) irrespective of \(\pi \) and (s, a).
1.4 Proof of Lemma 8
Proof of Part (i). Before proceeding, we make note of a straightforward fact
given that state 0 is an absorbing state and \(r(0,a_0)=0\).
For any \(s\in \{ 3, \ldots , H\}\), the construction (20) together with (73) and the Bellman equation yields
Recalling the choices of \(\tau _s\), \(r_s\) and p in (18), we can continue the derivation in (74a) to reach
Here, the last inequality is valid when \(c_{\textrm{p}}\le 1/6\), given that \(\gamma ^{\frac{1}{3}}+\frac{1\gamma }{6}\gamma ^{\frac{1}{6}} \le 1\) holds for any \(\gamma <1\).
In addition, combining (74c) with (72), we arrive at
This is guaranteed to hold when \(c_{\textrm{p}}\le 1/6\), given that \(\gamma ^{\frac{1}{3}}+\frac{1\gamma }{3}\gamma ^{\frac{4s}{3}\frac{5}{2}}\le \gamma ^{\frac{1}{3}}+\frac{1\gamma }{3}\gamma ^{\frac{3}{2}}\le 1\) is valid for all \(\gamma <1\) and \(s\ge 3\). Moreover, if one further has \(V^{\pi }(\overline{s2})\ge 0\), then it is seen from (74c) that
Proof of Part (ii). By virtue of the construction (20), we can invoke the Bellman equation to show that
Here, the second identity comes from (74b), the penultimate line follows from (59), (72), as well as the facts \( V^{\pi }(\overline{s1})\le V^{\star }(\overline{s1})\), while the last inequality exploits the fact \(\pi (a_{0}\,\,s)+\pi (a_{2}\,\,s)=1\pi (a_{1}\,\,s)\).
If \(V^{\pi }(s)\ge \tau _s\), then this together with the upper bound (76) necessarily requires that
which is equivalent to saying that
Putting these arguments together establishes the advertised result (60).
Proof of Part (iii). For any \({\overline{s}}\in \{{\overline{3}},\ldots ,{\overline{H}}\}\), in view of the construction (21) and the Bellman equation, one has
Regarding state \({\overline{1}}\), we have
Similarly, one obtains \(Q^{\pi }({\overline{2}},a_{0})=\gamma \tau _2\) and \(Q^{\pi }({\overline{2}},a_{1}) = \gamma V^{\pi }(2)\).
Next, let us decompose \(V^{\pi }({\overline{s}})\) as follows:
where we have used \(\pi (a_{0}\,\,{\overline{s}})+\pi (a_{1}\,\,{\overline{s}})=1\). From this relation and the assumption \(\pi ( a_{1}\,\,{\overline{s}} )>0\), it is straightforward to see that \(V^{\pi }({\overline{s}})\ge \gamma \tau _s\) if and only if \(V^{\pi }(s) \ge \tau _{s}\). The claim (63) regarding \(t_s(\tau _s)\) and \(t_{{\overline{s}}}(\gamma \tau _s)\) then follows directly from the definition of \(t_s\) (see (30) and (31)).
Proof of Part (iv). For any \(s_{1}\in {\mathcal {S}}_{1}\), the Bellman equation yields
and hence
A similar argument immediately yields that for any \(s_{2}\in {\mathcal {S}}_{2}\),
These together with our notation convention (25) establish (64).
Proof of Part (v). Suppose instead that \(V^{\pi }(s1) < \tau _{s1}\). In view of the basic property (62) in Lemma 8, this necessarily requires that
Taking (78) together with the relation (59b) allows us to reach
In addition, the properties (59a) and (59c) imply that
Putting everything together implies that
which contracticts the assumption \(V^{\pi }(s) \ge \gamma ^{\frac{1}{2}}\tau _s\). This establishes the claimed result for any \(s\in \{3,\ldots , H\}\).
Proof of Part (vi). First, due to explicit expressions of the Q functions (74a) and (74c), one has
where the last relation holds since \(V^\pi (\overline{s2})<\gamma \tau _{s2}\) when \(V^\pi (s2)<\tau _{s2}\) (see (62)).
In addition, following the same derivation as for (79), we see that the condition \(V^{\pi }(s1) \le \tau _{s1}\) implies
It is also seen from Part (i) of this lemma that
provided that \(V^{\pi }(\overline{s2}) \ge 0\). Combining these two inequalities, we arrive at the claimed bound
where the last inequality holds if \(\gamma ^{2s/3+5/6} \ge \gamma ^{s} \ge 1/2\).
Proof of Part (vii). According to the update rule (12), we have—for any policy \(\pi \)—that
where we have used the identities \(\sum _a \pi _{\theta }(a \,\,s)=1\) and \(V^{\pi }(s) = \sum _a \pi (a \,\,s)Q^{\pi }(s, a) \). As a result, if \(\sum _a \theta ^{(0)}(s,a)=0\), then it follows from the PG update rule that \(\sum _a \theta ^{(t)}(s,a)=0\).
1.5 Proof of Lemma 11
Proof of Part (i). Dividing both sides of (67a) by \(x_{t}x_{t1}\), we obtain
If \(c_1x_{0}\le 1/2\), then the monotonicity assumption gives \(c_{\textrm{l}} x_{t}\le 1/2\) for all \(t\ge 0\). It then follows that
Apply this relation recursively to deduce that
This readily concludes the proof of (67b).
Proof of Part (ii). Similarly, divide both sides of (68a) by \(x_{t}x_{t1}\) to derive
given the monotonicity and positivity assumption \(0< x_{t}\le x_{t1}\). Invoking this inequality recursively gives
thus establishing the advertised bound (68b).
Proof of Part (iii). We now turn attention to (69b). As is clearly seen, the nonnegative sequence \(\{x_t\}\) majorizes another sequence \(\{y_t\}\) generated as follows (in the sense that \(x_t\ge y_t\) for all \(0<t\le t_0\))
Dividing both sides of the second equation of (80) by \(y_{t1}y_{t}\), we reach
To see why the last inequality holds, note that, according to the first equation of (80) and the assumption \(x_{t1}< c_{x}\) (and hence \(y_{t1}\le x_{t1} < c_{x}\)), we have
As a result, we can apply the preceding inequalities recursively to derive
and hence we arrive at (69b),
Proof of Part (iv). The proof of (70b) is quite similar to that of (69b). Let us construct another nonnegative sequence \(\{z_t\}\) as follows
Comparing this with (70a) clearly reveals that \(z_{t} \ge x_t\). Divide both sides of (81) by \(z_tz_{t1}\) to reach
where the last inequality is valid since, by construction, \(z_t\ge z_{t1}\). Applying this relation recursively yields
which taken together with the fact \(z_0 = x_0 \) and \(z_{t_0}\ge x_{t_0}\) leads to
Discounted state visitation probability (Lemmas 23)
In this section, we establish our bounds concerning the discounted state visitation probability, as claimed in Lemma 2 and Lemma 3. Throughout this section, we denote by \({\mathbb {P}}(\cdot \,\,\pi )\) the probability distribution when policy \(\pi \) is adopted. Also, we recall that \(\mu \) is taken to be a uniform distribution over all states.
1.1 Lower bounds: proof of Lemma 2
Consider an arbitrary policy \(\pi \), and let \(\{s^k\}_{k\ge 0}\) represent an MDP trajectory. For any \(s\in \{3,\ldots ,H\}\), it follows from the definition (13) of \(d_{\mu }^{\pi }\) that
Here, the penultimate identity is valid due to the construction (24) and the assumption that \(\mu \) is uniformly distributed, whereas the last identity results from the assumption (17). This establishes (36a). Repeating the same argument also reveals that
for any \({\overline{s}}\in \{{\overline{1}},\ldots , {\overline{H}}\}\), thus validating the lower bound (36b).
In addition, for any \(s\in {\mathcal {S}}_{1}\), the MDP construction (24) allows one to derive
Here, the last line holds due to the fact that \(\mu \) is uniformly distributed and the assumptions (16) and (17). We have thus concluded the proof for (36c). The proof for (36d) follows from an identical argument and is hence omitted.
1.2 Upper bounds: proof of Lemma 3
1.2.1 Preliminary facts
Before embarking on the proof, we collect several basic yet useful properties that happen when \(t< t_s({\tau }_s)\). The firsttime readers can proceed directly to Appendix B.2.2.
Properties about \({Q}^{(t)}({\overline{s}},a)\). Combine the property (61) in Lemma 8 with (32) to yield that: for any \(1\le s \le H\) and any \(t< t_s({\tau }_s)\), one has
In addition, combining the property (61) in Lemma 8 with (66) yields: for any \(2\le s \le H\),
holds for all \(s^{\prime }\) obeying \(s \le s^{\prime } \le H\) and all \(t< t_s({\tau }_s)\). As a remark, (83) indicates that \(a_1\) remains unfavored (according to the current estimate \(Q^{(t)}\)) before the iteration number hits \(t_s({\tau }_s)\).
Properties about \({Q}^{(t)}(s+1,a)\) and \({Q}^{(t)}(s+2,a)\). First, combining (84) with the relation (74) reveals that: for any \(2\le s \le H1\) and any \(t< t_s({\tau }_s)\),
hold as long as \(V^{(t)}(\overline{s1})\ge 0\) (which is guaranteed by Lemma 10). Similarly, (84) and (74) also give
for any \(1\le s \le H2\) and any \(t< t_s({\tau }_s)\). Consequently, we have
for all \(t< t_s({\tau }_s)\). In other words, the above two inequalities reveal that actions \(a_1\) and \(a_2\) are perceived as suboptimal (based on the current Qfunction estimates) before the iteration count surpasses \(t_s({\tau }_s)\).
Next, consider any \(2\le s\le H1\) and any \( t< t_s({\tau }_s)\). It has already been shown above that
A similar argument also implies that, for any \( t< t_s({\tau }_s)\),
which forms another property useful for our subsequent analysis.
1.2.2 Proof of the upper bounds (37a) and (37b)
We now turn attention to upper bounding \(d_{\mu }^{(t)}(s)\) for any \(s\in \{3,\ldots ,H\}\). By virtue of the expansion (82), upper bounding \(d_{\mu }^{(t)}(s)\) requires controlling \({\mathbb {P}}\big (s^{k}=s\,\,s^{0}\sim \mu ,\pi ^{(t)}\big )\) for all \(k\ge 0\). In light of this, our analysis consists of (i) developing upper bounds on the interrelated quantities \({\mathbb {P}}\big (s^{k}=s\,\,s^{0}\sim \mu ,\pi ^{(t)}\big )\) and \({\mathbb {P}}\big (s^{k}= {\overline{s}} \,\,s^{0}\sim \mu ,\pi ^{(t)}\big )\) for any \(k\ge 0\), and (ii) combining these upper bounds to control \(d_{\mu }^{(t)}(s)\). At the core of our analysis is the following upper bounds on the tth policy iterate, which will be established in Appendix B.2.6.
Lemma 12
Under the assumption (35), for any \(2\le s\le H\) and any \(t < t_s({\tau }_s)\), one has
Furthermore,
hold if \(2\le s \le H1\), and
hold if \(1\le s \le H2\).
In words, Lemma 12 posits that, at the beginning, the policy iterate \(\pi ^{(t)}\) does not assign too much probability mass on actions that are currently perceived as suboptimal (see the remarks in Appendix B.2.1). With this lemma in place, we are positioned to establish the advertised upper bound.
Step 1: bounding \({\mathbb {P}}\big (s^{k}=s\,\,s^{0}\sim \mu ,\pi ^{(t)}\big )\). For any \(t < t_s({\tau }_s)\) and any \(s\in \{3,\ldots ,H\}\), making use of the upper bound (88a) and the MDP construction in Sect. 3 yields
for all \(k\ge 2\). Note that the above calculation exploits the fact that \(\mu \) is a uniform distribution.
Step 2: bounding \({\mathbb {P}}\big (s^{k}= {\overline{s}} \,\,s^{0}\sim \mu ,\pi ^{(t)}\big )\). Given that \(\mu \) is a uniform distribution, one has
for any \(s\in {\mathcal {S}}\). With (88b) and (88c) in mind, the MDP construction in Sect. 3 allows one to show that
holds for any \(2\le s \le H2\) and any \(t< t_s({\tau }_s)\), and in addition,
hold for any \(k\ge 2\), \(2\le s \le H2\), and any \(t< t_s({\tau }_s)\). Moreover, invoking (88b) and the MDP construction once again reveals that
hold for any \(k\ge 2\) and any \(t< t_s({\tau }_s)\). In addition, it is seen that
Step 3: putting all this together. Combining the preceding upper bounds on both \({\mathbb {P}}\big (s^{k}= s \,\,s^{0}\sim \mu ,\pi ^{(t)}\big )\) and \({\mathbb {P}}\big (s^{k}= {\overline{s}} \,\,s^{0}\sim \mu ,\pi ^{(t)}\big )\) \((k\ge 1\)) and recognizing the monotonicity property (33), we immediately arrive at the following crude bounds
for any \(k\ge 2\). It is then straightforward to deduce that
for any \(k\ge 1\). In turn, these bounds give rise to
for any \(3\le s \le H\) and any \(t< t_s({\tau }_s)\). This establishes the claimed upper bound (37a) as long as Lemma 12 is valid. Further, replacing s with \({\overline{s}}\) in (92) also reveals that
for any \(2\le s \le H\) and any \(t< t_s({\tau }_s)\), thus concluding the proof of (37b).
1.2.3 Proof of the upper bound (37c)
We now consider any \(s\in {\mathcal {S}}_2\). From our MDP construction, we have
for any \(k\ge 2\) and any \(s\in {\mathcal {S}}_2\). In addition, our bound in (91b) gives
for any \(k\ge 2\) and any \(t<t_2({\tau }_2)\). Consequently, we arrive at
Armed with the preceding inequalities, we can derive
for any \(s\in {\mathcal {S}}_2\) and any \(t< t_2({\tau }_2)\), thus concluding the advertised upper bound for \(s\in {\mathcal {S}}_2\).
1.2.4 Proof of the upper bound (37d)
It follows from our MDP construction that
Moreover, for any \(k\ge 2\) and any \(t<t_3({\tau }_3)\), one can derive
where the last inequality arises from (91a). Putting these bounds together leads to
where we have used the assumption that \({\widehat{{\mathcal {S}}}}_{{\overline{1}}}=c_{\textrm{m}}(1\gamma ){\mathcal {S}}\). When \(t<t_2({\tau }_2)\), the monotonicity property (33) indicates that \(t<t_3({\tau }_3)\), thus concluding the proof of (37d).
1.2.5 Proof of the upper bound (37e)
In view of our MDP construction, for any \(s \in {\mathcal {S}}_1\) and any \(t<\min \{t_1({\tau }_1),t_2({\tau }_2) \}\) we have
where k is any integer obeying \(k\ge 2\). Here, the last inequality comes from (95). These bounds taken collectively demonstrate that
for any \(s \in {\mathcal {S}}_1\) and any \(t<\min \{t_1({\tau }_1),t_2({\tau }_2) \}\). This completes the proof.
1.2.6 Proof of Lemma 12
In order to prove this lemma, we are in need of the following auxiliary result, whose proof can be found in Appendix B.2.7.
Lemma 13
Consider any state \(1\le s \le H \). Suppose that \( 0 <\eta \le (1\gamma )/2\).

(i)
If the following conditions
$$\begin{aligned} {Q}^{(t)}(s, a_0)  {Q}^{(t)}(s, a_1)&\ge 0,\quad \quad \quad {Q}^{(t)}(s, a_2)  {Q}^{(t)}(s, a_1) \ge 0 \\ \pi ^{(t1)}(a_{1} \,\,s )&\le \min \big \{ \pi ^{(t1)}(a_{0} \,\,s ), \pi ^{(t1)}(a_{2} \,\,s ) \big \} \end{aligned}$$hold, then one has \(\pi ^{(t)}(a_{1} \,\,s )\le 1/3\) and \(\pi ^{(t)}(a_{1} \,\,s ) \le \min \big \{ \pi ^{(t)}(a_{0} \,\,s ), \pi ^{(t)}(a_{2} \,\, s ) \big \}\).

(ii)
If the following conditions
$$\begin{aligned} {Q}^{(t)}(s, a_0)  {Q}^{(t)}(s, a_2) \ge 0 \quad \text {and} \quad \pi ^{(t1)}(a_{2} \,\,s ) \le \pi ^{(t1)}(a_{0} \,\,s ) \end{aligned}$$hold, then one has \(\pi ^{(t)}(a_{2} \,\,s )\le 1/2\) and \(\pi ^{(t)}(a_{2} \,\,s ) \le \pi ^{(t)}(a_{0} \,\,s )\).

(iii)
If the following conditions
$$\begin{aligned} {Q}^{(t)}({\overline{s}}, a_0)  {Q}^{(t)}({\overline{s}}, a_1) \ge 0 \quad \text {and} \quad \pi ^{(t1)}(a_{1} \,\,{\overline{s}} ) \le \pi ^{(t1)}(a_{0} \,\,{\overline{s}} ) \end{aligned}$$hold, then one has \(\pi ^{(t)}(a_{1} \,\,{\overline{s}} )\le 1/2\) and \(\pi ^{(t)}(a_{1} \,\,{\overline{s}} ) \le \pi ^{(t)}(a_{0} \,\,{\overline{s}} )\).
Remark 9
In words, Lemma 13 develops nontrivial upper bounds on the policy associated with actions that are currently perceived as suboptimal. As we shall see, such upper bounds—which are strictly below 1—translate to some contraction factors that enable the advertised result of this lemma.
With Lemma 13 in place, we proceed to prove Lemma 12 by induction. Let us start from the base case with \(t=0\). Given that the initial policy is chosen to be uniformly distributed, we have
Therefore, the claim (88) trivially holds for \(t=0\).
Next, we move on to the induction step. Suppose that the induction hypothesis (88) holds for the tth iteration, and we intend to establish it for the \((t+1)\)th iteration. Apply Lemma 13 with Conditions (83) and (88a) to yield
with the proviso that \(0<\eta \le (1\gamma )/2\). Clearly, this also implies that \(\pi ^{(t+1)}(a_1\,\,{\overline{s}})\le 1/2\). Further, invoke Lemma 13 once again with Condition (87) and the induction hypothesis (88) to arrive at
A straightforward consequence is \(\pi ^{(t+1)}(a_1 \,\,s+1) \le 1/3\) and \(\pi ^{(t+1)}(a_2 \,\,s+2) \le 1/2\). The proof is thus complete by induction.
1.2.7 Proof of Lemma 13
First of all, suppose that \({Q}^{(t)}(s, a_0)  {Q}^{(t)}(s, a_1) \ge 0\) and \(\pi ^{(t1)}(a_{0}\,\,s)\ge \pi ^{(t1)}(a_{1}\,\,s)\) hold true. Combining this result with the PG update rule (12) gives
Consequently, applying this inequality and using the PG update rule (12) yield
where the last line arises by combining terms and invoking the assumption \(\pi ^{(t1)}(a_{0}\,\,s) \ge \pi ^{(t1)}(a_{1}\,\,s)\).
Additionally, it is seen from the definition of the advantage function that
where the last inequality follows from Lemma 1. Recognizing that \(d^{(t1)}_{\mu }(s) \le 1\), one obtains
with the proviso that \(0<\eta \le (1\gamma )/2\).
Substituting (99) into (97) then yields
where both the first line and the last identity rely on the fact that \(\theta ^{(t1)}(s,a_{1})\le \theta ^{(t1)}(s,a_{0})\)—an immediate consequence of the assumption \(\pi ^{(t1)}(a_{1} \,\,s)\le \pi ^{(t1)}(a_{0}\,\,s)\). To see why the inequality (100) holds, it suffices to make note of the following consequence of softmax parameterization:
where (b) follows since \(\frac{e^x1}{e^x+1}\le x\) for all \(x\ge 0\), and the validity of (a) is guaranteed since
To conclude, the above result (100) implies that
Repeating the above argument immediately reveals that: if
then one has \(\pi ^{(t)}(a_{2} \,\,s ) \ge \pi ^{(t)}(a_{1} \,\,s ) \), which together with (101) indicates that
This establishes Part (i) of Lemma 13.
The proofs of Parts (ii) and (iii) follow from exactly the same argument as for Part (i), and are hence omitted for the sake of brevity.
Crossing times of the first few states (Lemma 4)
This section presents the proof of Lemma 4 regarding the crossing times w.r.t. \({\mathcal {S}}_1\), \({\mathcal {S}}_2\), and state \({\overline{1}}\).
1.1 Crossing times for the buffer states in \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\)
We first present the proof of the relation (38a) regarding several quantities about \(t_1\) and \(t_2\).
Step 1: characterize the policy gradients. Our analysis largely relies on understanding the policy gradient dynamics, towards which we need to first characterize the gradient. Recalling that the gradient of \(V^{(t)}\) w.r.t. \(\theta _t(1, a_1)\) (cf. (12b)) is given by
where in the last step we use \(Q^{(t)}(1, a_1)  Q^{(t)}(1, a_0) = 2\gamma ^{2}\) (see (64)). The same calculation also yields
As an immediate consequence, the PG update rule (12a) reveals that both \(\theta ^{(t)}(1,a_1)\) (resp. \(\pi ^{(t)}(1,a_1)\)) and \(\theta ^{(t)}(2,a_1)\) (resp. \(\pi ^{(t)}(2,a_1)\)) are monotonically increasing with t throughout the execution of the algorithm, which together with the initial condition \(\pi ^{(0)}(a_0 \,\,1) = \pi ^{(0)}(a_1\,\,1)=\pi ^{(0)}(a_0 \,\,2) = \pi ^{(0)}(a_1\,\,2)\) as well as the identities \(\theta ^{(t)}(1,a_1)=\theta ^{(t)}(1,a_0)\) and \(\theta ^{(t)}(2,a_1)=\theta ^{(t)}(2,a_0)\) (due to (65)) gives
Step 2: determine the range of \(\pi ^{(t)}(\cdot \,\,1)\) and \(\pi ^{(t)}(\cdot \,\,2)\). From the basic property (64), the value function of the buffer states in \({\mathcal {S}}_1\)—abbreviated by \(V^{(t)}(1)\) as in the notation convention (25)—satisfies
given that \(\pi ^{(t)}(a_0 \,\,1) + \pi ^{(t)}(a_1 \,\,1) = 1\). Therefore, for any \(t < t_{1}(\gamma ^21/4)\)—which means \(V^{(t)}(1)< \gamma ^21/4\) according to the definition (31)—one has the following upper bound:
This is equivalent to requiring that
and, consequently, \(\pi ^{(t)}(a_0 \,\,1) = 1  \pi ^{(t)}(a_1 \,\,1) \ge 1/8\) for any \(t < t_{1}(\gamma ^21/4)\). Putting this and (104) together further implies—for every \(t < t_{1}(\gamma ^21/4)\)—that:
Step 3: determine the range of policy gradients. In addition to showing the nonnegativity of \(\frac{\partial V^{(t)}(\mu )}{\partial \theta (1, a_1)}\) and \(\frac{\partial V^{(t)}(\mu )}{\partial \theta (2, a_1)}\) for all \(t\ge 0\), we are also in need of bounding their magnitudes. Towards this, invoke the property (107) to bound the derivative (102) by
for any \(t < t_{1}(\gamma ^21/4)\), where we have used the elementary facts
Similarly, repeating the above argument with the gradient expression (103) leads to
Further, note that Lemma 2 and Lemma 3 deliver upper and lower bounds on the quantities \(d^{(t)}_{\mu }(1)\) and \(d^{(t)}_{\mu }(2)\), which allow us to deduce that
Step 4: develop an upper bound on \(t_{1}(\gamma ^21/4)\). The preceding bounds allow us to develop an upper bound on \(t_{1}(\gamma ^21/4)\). To do so, it is first observed from the fact \(\theta ^{(t)}(1,a_{0})=\theta ^{(t)}(1,a_{1})\) (due to (65)) that
Recognizing that \(V^{(t)}(1) < \gamma ^21/4\) occurs if and only if \(\pi ^{(t)}(a_1 \,\,1) < 1 (8\gamma ^2)^{1}\) (see (106)), we can easily demonstrate that
If \(t_{1}\big (\gamma ^{2}1/4\big ) \ge \big \lceil \frac{32\log (7)c_{\textrm{b},1}{\mathcal {S}}}{7\gamma ^{3}c_{\textrm{m}}\eta } \big \rceil \), then taking \(t= \big \lceil \frac{32\log (7)c_{\textrm{b},1}{\mathcal {S}}}{7\gamma ^{3}c_{\textrm{m}}\eta } \big \rceil \) together with (110) and (12a) yields
thus leading to contradiction with (111). As a result, one arrives at the following upper bound:
with the proviso that \(\gamma \ge 0.85\) (so that \(\tau _1\le \gamma ^{2}1/4\)).
An upper bound on \(t_2(\gamma ^4  1/4)\) (and hence \(t_2(\tau _2)\)) can be obtained in a completely analogous manner
provided that \(\gamma \ge 0.95\) (so that \(\tau _2\le \gamma ^{4}1/4\)). We omit the proof of this part for the sake of brevity.
Step 5: develop a lower bound on \(t_{2}({\tau }_2)\). Repeating the argument in (106) and (111), we see that \(V^{(t)}(2) \ge {\tau }_2\) if and only if \(\pi ^{(t)}(a_1\,\,2) \ge \frac{1}{2}+\frac{{\tau }_2}{2\gamma ^4}\), which is also equivalent to
as long as \(2{\tau }_2 > \gamma ^{4}\). Of necessity, this implies that \(\theta ^{(t)}(2,a_{1})>\frac{1}{2} \log 3\) when \(t=t_2 ({\tau }_2)\). If \(t_{2}({\tau }_2) \le \frac{{\mathcal {S}}\log 3}{2\eta \gamma ^{4}\left( 1+8c_{\textrm{m}}/c_{\textrm{b},2}\right) }\), then invoking (110) and (12a) and taking \(t= t_{2} ({\tau }_2)\) yield
thus resulting in contradiction. We can thus conclude that
As an important byproduct, comparing (113) with (112) immediately reveals that
with the proviso that \(\frac{\log 3}{1+8c_{\textrm{m}}/c_{\textrm{b},2}}\ge \frac{15c_{\textrm{b},1}}{c_{\textrm{m}}}\) and \(\gamma \ge 0.87\) (so that \(\gamma ^{2}1/4>{\tau }_1\)).
Step 6: develop a lower bound on \(t_{1}({\tau }_1)\). Repeat the analysis in (106) and (111) to show that: \(V^{(t)}(1) \ge {\tau }_1\) if and only if
Clearly, this lower bound should hold if \(t=t_{1} ({\tau }_1)\). In addition, in view of (114), one has \(\min \{t_1({\tau }_1), t_2({\tau }_2)\}=t_1({\tau }_1)\). If \(t_1({\tau }_1) \le \frac{{\mathcal {S}}\log 3}{\eta \gamma ^{2}(1+17c_{\textrm{m}}/c_{\textrm{b},1})}\), then setting \(t= t_1({\tau }_1) =\min \{t_1({\tau }_1), t_2({\tau }_2)\} \) and applying (110) and (12a) lead to
which is contradictory to the preceding lower bound. This in turn implies that
1.2 Crossing times for the adjoint state \({\overline{1}}\)
We now move on to the proof of (38b). Note that we have developed a lower bound on \(t_{2}({\tau }_2)\) in (113). In order to justify the advertised result \(t_{2}({\tau }_2) > t_{{\overline{1}}}\big (\gamma ^31/4\big )\), it thus suffices to demonstrate that
a goal we aim to accomplish in this subsection.
To do so, we divide into two cases. In the scenario where \(t_{1} ({\tau }_1 ) \ge t_{{\overline{1}}}\big (\gamma ^31/4\big )\), the bound (112) derived previously immediately leads to the desired bound:
with the proviso that \(\frac{15c_{\textrm{b},1}}{c_{\textrm{m}}}\le \frac{\log 3}{1+8c_{\textrm{m}}/c_{\textrm{b},2}}\). Consequently, the subsequent analysis concentrates on establishing (116) for the case where
In what follows, we divide into three stages and investigate each one separately, after presenting some basic gradient calculations that shall be invoked frequently.
Gradient characterizations. To begin with, observe from (12) that
which makes use of the fact \(\pi ^{(t)}\big (a_{0}\,\,{\overline{1}}\big )+ \pi ^{(t)}\big (a_{1}\,\,{\overline{1}}\big ) =1\). Analogously, we have
Stage 1: any t obeying \(t < t_{1} ({\tau }_1 )\). We start by looking at each term in the gradient expression (117a) separately. First, note that when \(t < t_{1} ({\tau }_1 )\), one has \(V^{(t)}(1)<{\tau }_1\), which combined with (61) in Lemma 8 indicates that \(Q^{(t)}({\overline{1}}, a_1) = \gamma V^{(t)}(1) < \gamma \tau _1 = Q^{(t)}({\overline{1}}, a_0)\). In fact, from the definition (18a) of \(\tau _1\), the property (61) and Lemma 10, we have
Additionally, recall that \(t_{1} ({\tau }_1 ) < t_{2} ({\tau }_2 )\) (see (114)). Lemma 3 then tells us that \(d^{(t)}_{\mu }({\overline{1}}) \le 14c_{\textrm{m}}(1\gamma )^2\) during this stage. Substituting these into (117a) and using \(\pi ^{(t)}(a_0 \,\,{\overline{1}})\le 1\), we arrive at
which together with the PG update rule (12) also indicates that \(\theta ^{(t)}({\overline{1}},a_{1})\) (and hence \(\pi ^{(t)}\big (a_{1}\,\,{\overline{1}}\big )\)) is monotonically nonincreasing with t in this stage. Invoke the auxiliary fact in Lemma 14 to reach
Taking the preceding recursive relation together with Lemma 11 and recalling the initialization \(\pi ^{(0)}\big (a_{1}\,\,{\overline{1}}\big )=1/2\), we can guarantee that
provided that \(14\eta c_{\textrm{m}}(1\gamma )\le 1\). In conclusion, the above calculation precludes \(\pi ^{(t)}\big (a_{1}\,\,{\overline{1}}\big )\) from decaying to zero too quickly, an observation that is particularly useful for our analysis in Stage 3.
Stage 2: any t obeying \(t_{1}({\tau }_1) \le t < t_{1}(\gamma ^21/4)\). The only step lies in extending the lower bound (119) to this stage. From the definition (31) of \(t_{1}({\tau }_1)\) as well as the monotonicity of \(V^{(t)}(1)\) (see Lemma 9), we know that
provided that \(\eta < (1\gamma )^2/5\). This taken together with the property (61) in Lemma 8 reveals that
and hence \(\pi ^{(t)}\big (a_{1}\,\,{\overline{1}}\big )\) is nondecreasing in t during this stage. Therefore, we have
where the first inequality follows from the nondecreasing property established above, and the second inequality follows from the lower bound (119). In fact, we have established a lower bound on \(\pi ^{(t)}\big (a_{1}\,\,{\overline{1}}\big )\) that holds for the entire trajectory of the algorithm.
Stage 3: any t obeying \(t_{1}(\gamma ^21/4) \le t \le t_{{\overline{1}}}(\gamma ^31/4)\). To facilitate analysis, we single out a time threshold \(t^{\prime }\) as follows:
We begin by developing an upper bound on \(\pi ^{(t)}\big (a_{0}\,\,{\overline{1}}\big ) \) for any \(t \ge \max \{t^{\prime }, t_{1}(\gamma ^21/4)\}\). Towards this, with the help of (61) in Lemma 8 we make the observation that: for any \(t \ge t_{1}(\gamma ^21/4)\), one has
as long as \(\gamma \ge 0.92\), which combined with (117b) indicates that
Recognizing that \(d^{\pi }_{\mu }({\overline{1}}) \ge c_{\textrm{m}}\gamma (1\gamma )^2\) (see Lemma 2), we can continue the derivation (117b) to derive
for any \(t \ge \max \{t^{\prime }, t_{1}(\gamma ^21/4)\}\), which implies
Invoke Lemma 14 to arrive at
provided that \(2\eta \frac{\partial {V}^{(t)}(\mu )}{\partial \theta ({\overline{1}},a_{0})}\ge 1\), which is guaranteed by \(\eta < (1\gamma )/2\). Recalling that \(\pi ^{(t)}\big (a_{0}\,\,{\overline{1}}\big ) \le 1/2\) for this entire stage, one can apply Lemma 11 to obtain
for any \(t \ge \max \{t^{\prime }, t_{1}(\gamma ^21/4)\}\).
With the above upper bound (124) in place, we are capable of showing that the target quantity \(t_{{\overline{1}}}\big ( \gamma ^{3}1/4 \big )\) is not much larger than \(\max \{t^{\prime }, t_{1}(\gamma ^21/4)\}\). To show this, we first note that the value function of the adjoint state \({\overline{1}}\) obeys (see Part (iii) in Lemma 8)
where the inequality holds since \(V^{(t)}(1)\ge \gamma ^{2}1/4\) in this stage (given that \(t\ge t_1(\gamma ^21/4)\)). Recognizing that \(0.5\gamma ^{2/3}\gamma ^{2}+1/4<0\) for any \(\gamma \ge 0.85\), we can rearrange terms to demonstrate that \(V^{(t)}({\overline{1}})\ge \gamma ^3  1/4\) holds once
In fact, for any \(\gamma \ge 0.85\), the above inequality is guaranteed to hold as long as \(\pi ^{(t)}(a_{0}\,\,{\overline{1}})\le 1\gamma \) since \(4\gamma \left( \gamma ^{2}1/40.5\gamma ^{2/3}\right) <1\). In view of (124), we can achieve \(\pi ^{(t)}(a_0 \,\,{\overline{1}}) \le 1\gamma \) as soon as \(t  \max \left\{ t^{\prime }, t_{1}(\gamma ^21/4)\right\} \) surpasses \(\frac{40}{c_{\textrm{m}}\gamma \eta (1\gamma )^2}\). As a consequence, we reach
Armed with the relation (125), the goal of upper bounding \(t_{{\overline{1}}}\big ( \gamma ^{3}1/4 \big )\) can be accomplished by controlling \(t^{\prime }\). To this end, we claim for the moment that
If this claim holds, then combining it with (125) and (112) would result in the advertised bound (116):
where the penultimate inequality relies on the assumptions \(\frac{c_{\textrm{b},1}}{c_{\textrm{m}}} \le \frac{1}{79776}\) and \({\mathcal {S}} \ge \frac{320\gamma ^3}{c_{\textrm{m}}(1\gamma )^{2}}\), and the last one holds as long as \(\frac{1}{4\gamma ^{4}}\le \frac{\log 3}{1+8c_{\textrm{m}}/c_{\textrm{b},2}}\). To finish up, it suffices to establish the claim (126).
Proof of the claim (126). It is sufficient to consider the case where \(t^{\prime } > t_{1}(\gamma ^21/4)\); otherwise the inequality (126) is trivially satisfied. Since Lemma 2 tells us that \(d^{\pi }_{\mu }({\overline{1}}) \ge c_{\textrm{m}}\gamma (1\gamma )^2\), we can see from (117a) that, for any t with \(t_{1}(\gamma ^21/4) \le t < t^{\prime }\),
where the last line follows by combining (122) and the fact that \(\pi ^{(t)}(a_{0}\,\,{\overline{1}})\ge 1/2\) for any \(t<t^{\prime }\) (see the definition (121) of \(t^{\prime }\)). According to Lemma 14, we can demonstrate that
for any t obeying \(t_{1}(\gamma ^21/4) \le t < t^{\prime }\). Invoking Lemma 11, we then have
as claimed, where the second line follows from (120).
1.3 Auxiliary facts
In this subsection, we collect some elementary facts that have been used multiple times in the proof of Lemma 4. Specifically, the lemma below makes clear an explicit link between the gradient \(\nabla _{\theta } V^{(t)}(\mu )\) and the difference between two consecutive policy iterates.
Lemma 14
Consider any s whose associated action space is \(\{a_0,a_1\}\).

If \(\frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}\le 0\), then one has
$$\begin{aligned} \pi ^{(t+1)}\big (a_{1}\,\,s\big )\pi ^{(t)}\big (a_{1}\,\,s\big ) \ge 2 \eta \pi ^{(t)}\big (a_{1}\,\,s\big ) \frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}. \end{aligned}$$(127) 
If \(\pi ^{(t+1)}\big (a_{0}\,\,s\big )\ge 1/2\) and \(1\le 2\eta \frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}\le 0\), then we have
$$\begin{aligned} \pi ^{(t+1)}\big (a_{1}\,\,s\big )\pi ^{(t)}\big (a_{1}\,\,s\big ) \le \frac{\eta }{2}\pi ^{(t)}\big (a_{1}\,\,s\big )\frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}. \end{aligned}$$(128) 
If \(\frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}\ge 0\) and if \(\pi ^{(t+1)}\big (a_{0}\,\,s\big )\ge 1/2\), then one has
$$\begin{aligned} \pi ^{(t+1)}\big (a_{1}\,\,s\big )\pi ^{(t)}\big (a_{1}\,\,s\big ) \ge \eta \pi ^{(t)}\big (a_{1}\,\,s\big ) \frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}. \end{aligned}$$(129)
Proof of Lemma 14
We make note of the following elementary identity
which allows us to write

If \(\frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})} \le 0\), then one can deduce that
$$\begin{aligned} {}(130)&\ge 2\eta \pi ^{(t+1)}\big (a_{0}\,\,s\big )\pi ^{(t)}\big (a_{1}\,\,s\big ) \frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})} \ge 2\eta \pi ^{(t)}\big (a_{1}\,\,s\big ) \frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}, \end{aligned}$$where the first inequality relies on the elementary fact \(e^{x}1\ge x\) for all \(x\in {\mathbb {R}}\), and the second one holds since \(\pi ^{(t+1)}\big (a_{0}\,\,s\big ) \le 1\) and \(\frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})} \le 0\).

If \(1\le 2\eta \frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}\le 0\) and \(\pi ^{(t+1)}\big (a_{0}\,\,s\big ) \ge 1/2\), then one has
$$\begin{aligned} {}(130) \le \eta \pi ^{(t+1)}\big (a_{0}\,\,s\big )\pi ^{(t)}\big (a_{1}\,\,s\big )\frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})} \le \frac{\eta }{2}\pi ^{(t)}\big (a_{1}\,\,s\big )\frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}, \end{aligned}$$where the first inequality comes from the elementary inequality \(e^{x}1\le 0.5 x\) for any \(1\le x \le 0\), and the last inequality is valid since \(\pi ^{(t+1)}\big (a_{0}\,\,s\big ) \ge 1/2\) and \(\frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}\le 0\).

If \(\frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})} \ge 0\) and if \(\pi ^{(t+1)}\big (a_{0}\,\,s\big ) \ge 1/2\), then it follows that
$$\begin{aligned} {}(130)&\ge 2\eta \pi ^{(t+1)}\big (a_{0}\,\,s\big )\pi ^{(t)}\big (a_{1}\,\,s\big )\frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}\ge \eta \pi ^{(t)}\big (a_{1}\,\,s\big )\frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}, \end{aligned}$$as claimed in (129). \(\square \)
Analysis for the initial stage (Lemma 5)
This section establishes Lemma 5, which investigates the dynamics of \(\theta ^{(t)}(s, a)\) prior to the threshold \(t_{s2}({\tau }_{s2})\). Before proceeding, let us introduce a rescaled version of \(\pi ^{(t)}(s, a)\) that is sometimes convenient to work with:
for any stateaction pair (s, a). This is orderwise equivalent to \(\pi ^{(t)}(s, a)\) since
1.1 Two key properties
Our proof is based on the following claim: in order to establish the advertised results of Lemma 5, it suffices to justify the following two properties
hold for any \(t\le t_{s2}({\tau }_{s2})\). In light of this claim, our subsequent analysis consists of validating these two inequalities separately, which forms the main content of Sect. D.2.
We now move on to justify the above claim, namely, Lemma 5 is valid as long as the two key properties (133) and (134) hold true. First, recall that Lemma 12 together with (33) and Lemma 4 tells us that
for any \(3\le s\le H\). Next, note that the gradient takes the following form (cf. (12))
which together with the assumption \({Q}^{(t)}(s, a_2)  {V}^{(t)}(s) \ge 0\) (cf. (134)) implies that
Consequently, \(\theta ^{(t)}(s, a_2)\) keeps increasing before t exceeds \(t_{s2}({\tau }_{s2})\). This combined with the relation (135), the initialization \(\theta ^{(0)}(s, a_0)=\theta ^{(0)}(s, a_2)=0\) and the constraint \(\sum _a \theta ^{(t)}(s, a)=0\) (see Part (vii) of Lemma 8) reveals that
thereby confirming the desired property (41).
Further, given the nonnegativity of \(\theta ^{(t)}(s, a_2)\) stated in (137), one can readily derive
where the last line also makes use of the identity \(\theta ^{(t)}(s,a_{0})=\theta ^{(t)}(s,a_{1})\theta ^{(t)}(s,a_{2})\) (see Part (vii) of Lemma 8). With this observation in mind, the assumed property (133) directly leads to the advertised result (40).
1.2 Proof of the properties (133) and (134)
This subsection presents the proofs of the two key properties, which are somewhat intertwined and require a bit of induction. Before proceeding, we make note of the initialization \({\widehat{\pi }}^{(0)}(a_{1}\,\,s)=1\), which clearly satisfies the property (133) for this base case. Our proof consists of two steps to be detailed below. As can be easily seen, combining these two steps in an inductive manner immediately establishes both properties (133) and (134) for any \(t\le t_{s2}({\tau }_{s2})\).
1.2.1 Step 1: justifying (134) for the tth iteration if (133) holds for the tth iteration
We first turn to the proof of the inequality (134), assuming that (133) holds for the tth iteration. According to (132) and (133), we have
By virtue of the auxiliary fact (146c) in Lemma 15 (see Sect. D.3), one has
Given that \(p {:=}c_{\textrm{p}}(1\gamma )\) for some small constant \(0< c_{\textrm{p}}< \frac{1}{2016}\), the above two inequalities allow one to ensure that
With the above relation in mind, we are ready to control \(Q^{(t)}(s, a_2)  V^{(t)}(s)\) as follows
Here, the second lines arise from the auxiliary facts in Lemma 15, while the last inequality is a consequence of (140). Then we complete the proof of the inequality (134).
1.2.2 Step 2: justifying (133) for the \((t+1)\)th iteration if (134) holds up to the tth iteration
Suppose that the inequality (134) holds up to the tth iteration. To validate (133) for the \((t+1)\)th iteration, we claim for the moment that
as long as \(t\le t_s({\tau }_s)\). Let us take this claim as given, and return to prove it shortly.
Recall from (135) that
and hence \(\theta _{\textsf{max}}^{(t)}(s) = \theta ^{(t)}(s,a_{0})\) is increasing with t during this stage, and as a result,
The gradient expression (136) combined with the satisfaction of (134) up to the tth iteration implies that \(\theta ^{(t)}(s,a_{2})\) is increasing up to the tth iteration. Given that \(\sum _a \theta ^{(t)}(s,a) = 0\) (see Part (vii) of Lemma 8), we can derive
thus indicating that
These combined with Lemma 16 in Sect. D.3 guarantee that
and as a consequence,
Taking this collectively with (141), we reach
Apply Lemma 11 together with the initialization \({\widehat{\pi }}^{(0)}(a_1 \,\,s)=1\) to arrive at
Proof of the inequality (141). Recall the gradient expression (136):
each term of which will be bounded separately.
The first step is to control \(Q^{(t)}(s, a_1)  V^{(t)}(s)\), towards which we start with the following decomposition
The auxiliary facts stated in Lemma 15 (see Appendix D.3) imply that
while Lemma 1 and Lemma 10 tell us that
At the same time, the auxiliary fact (146a) in Lemma 15 (see Appendix D.3) taken together with the gradient expression (12b) guarantees that
and hence \(\theta ^{(t)}(s, a_1)\le \theta ^{(t)}(s, a_2) \le \theta ^{(t)}(s, a_0)\) (or equivalently \(\pi ^{(t)}(a_1\,\,s)\le \pi ^{(t)}(a_2\,\,s) \le \pi ^{(t)}(a_0\,\,s)\)) during this stage. As a result,
Substituting the preceding bounds into the decomposition (145), we arrive at
provided that \(\gamma \ge 0.85\). Meanwhile, it follows from Lemma 1 and Lemma 10 that
Further, from Lemma 3, we have learned that \(c_{\textrm{m}}\gamma (1\gamma )^2 \le d^{(t)}_{\mu }(s) \le 14c_{\textrm{m}}(1\gamma )^2\) for any \(t\le t_s(\tau _s)\). Substituting the above bounds into (144) and invoking (132), we establish the desired inequality (141).
1.3 Auxiliary facts
We now gather a few basic facts that are useful throughout this section. The first lemma presents some preliminary facts regarding the difference of Qfunction estimates across different actions in the current setting; the proof is deferred to Appendix D.3.1.
Lemma 15
Consider any \(t < t_{s2}({\tau }_{s2})\). Under the assumption (35), the following are satisfied
Remark 10
Lemma 15 makes clear that—before t exceeds \(t_{s2}({\tau }_{s2})\)—action \(a_0\) is perceived as the best choice, with \(a_1\) being the least favorable one. In the meantime, it also reveals that (i) \(Q^{(t)}(s, a_2)\) is considerably larger than \(Q^{(t)}(s, a_1)\), while (ii) the gap between \(Q^{(t)}(s, a_0)\) and \(Q^{(t)}(s, a_2)\) decays at least as rapidly as O(1/t) in this stage.
The second lemma is concerned with the consecutive difference between two rescaled policy iterates. The proof can be found in Appendix D.3.2.
Lemma 16
Suppose that \(0<\eta \le (1\gamma ) / 6\). For any \(t \ge 0\) and any \(3 \le s \le H\), define \(\theta _{\textsf{max}}^{(t)}(s) {:=}\max _a \theta ^{(t)}(s, a)\). If we write
for some \(c\in {\mathbb {R}}\), then we necessarily have
1.3.1 Proof of Lemma 15
In view of Lemma 10, one has \(V^{(t)}(\overline{s2}) \ge 0\) for all \(t\ge 0\). Therefore, the relation (59) yields
In addition, for any \(t < t_{s2}({\tau }_{s2}) \le t_{s1}({\tau }_{s1}) \le t_{\overline{s1}}(\gamma \tau _{s1})\) (see Lemma 8 and Lemma 4), we have \(V^{(t)}(\overline{s1})< \gamma \tau _{s1}\), and hence it is seen from the relation (59) that
as claimed in (146b). Also, Part (i) of Lemma 8 tells us that
where the last inequality holds for any \(t < t_{s2}({\tau }_{s2})\) (see Part (iii) of Lemma 8). These taken together validate (146a).
It remains to justify (146c), which is the content of the rest of this proof. The main step lies in demonstrating that, for any \(t < t_{s}({\tau }_{s})\) and any \(1\le s\le H\),
If this were true, than taking it together with the following property (which is a consequence of (59))
would establish the inequality (146c). It then boils down to justifying (148). Towards this, we first make the observation that
where the second line holds since \(Q^{(t)}({\overline{s}},a_{0}) =\gamma \tau _{s}\) (see (61)). Additionally, recall from the definition that for any \(t < t_{s}({\tau }_{s}) \), one has \(V^{(t)}(s)< {\tau }_{s}\) and hence
where the last line makes use of the identities in (61). This means that \(\theta ^{(t)}({\overline{s}}, a_1)\) keeps decreasing, and hence \(\theta ^{(t)}({\overline{s}}, a_1)\le 0\) given the initialization \(\theta ^{(0)}({\overline{s}}, a_1) =0\). As an immediate consequence, one has \(\theta ^{(t)}( {\overline{s}} , a_0 ) = \theta ^{(t)}( {\overline{s}} , a_1 ) \ge 0\) and \(\pi ^{(t)}(a_0 \,\,{\overline{s}}) \ge 1/2\). Taking this observation together with (151) and Lemma 2 gives
Moreover, combine (151) with Lemma 3 and Lemma 1 to yield
If \(28c_{\textrm{m}}\eta (1\gamma )<1/2\), then the above inequalities taken together with Lemma 14 give
for all \(t < t_{s}({\tau }_{s})\). This combined with (150) and the monotonicity of \(Q^{(t)}({\overline{s}}, a_1)\) (see Lemma 9) gives
where the penultimate line follows from the inequality (152) for the iteration \(t1\), and the last identity makes use of (150). In conclusion, we have arrived at the following inductive relation
which bears resemblance to the recursive relations studied in Lemma 11. Recognizing that \(\gamma \tau _{s}  V^{(0)}({\overline{s}}) \le \gamma \tau _{s2}\) (since \(V^{(0)}({\overline{s}})\ge 0\) according to Lemma 10), we can invoke Lemma 11 to derive
Putting the above pieces together concludes the proof of (146c).
1.3.2 Proof of Lemma 16
From the definition (131), direct calculations lead to
According to Lemma 1, we have \({Q}^{(t)}(s, a) \le 1\) and \({V}^{(t)}(s) \le 1\), which indicates—for any action \(a\in \{a_0,a_1,a_2\}\)—that
provided that \(\eta \le (1\gamma ) / 6\). An immediate consequence is that \(\theta _{\textsf{max}}^{(t+1)}(s)\theta _{\textsf{max}}^{(t)}(s) \le 1/3\) and hence
This taken together with the following elementary facts
establishes the claim (147).
Analysis for the intermediate stage (Lemma 6)
We now turn attention to Lemma 6, which studies the dynamics during an intermediate stage between \(t_{s2}({\tau }_{s2})\) and \(t_{\overline{s1}}(\tau _{s})\).
1.1 Main steps
Key facts regarding crossing times. Our proof for Lemma 6 relies on several crucial properties regarding the crossing times for both the key primary states and the adjoint states, as stated in the following two lemmas.
Lemma 17
Suppose that (35) holds. There exists some constant \(0<c_{0}\le \frac{1222}{c_{\textrm{m}}\gamma }\) such that:
holds for every \(3\le s \le H\), and
holds for every \(1\le s \le H\).
Lemma 18
Suppose that (35) holds and
Then for every \(3\le s \le H\), we have
In addition, if we further have \(t_{s1}({\tau }_{s1}) > t_{\overline{s2}}(\tau _{s1}) + \frac{2sc_0}{\eta (1\gamma )^2}\), then
Furthermore, (156c) still holds for \(s = 3\) without requiring the assumption (155).
The proofs of the above two lemmas are postponed to Appendix E.2 and Appendix E.3, respectively. Let us take a moment to explain these two lemmas; to provide some intuitions, let us treat \(\gamma ^{2s}\approx 1\). Lemma 17 makes clear that: once the value function estimates for states \(\overline{s1}\) and s are both sufficiently large (i.e., \(V^{(t)}(\overline{s1})\gtrapprox 0.75\) and \(V^{(t)}(s)\gtrapprox 0.5\)), then it does not take long for \(V^{(t)}(s)\) to (approximately) exceed 0.75. A similar message holds true if we replace s (resp. \(\overline{s1}\)) with \({\overline{s}}\) (resp. s). Built upon this observation, Lemma 18 further reveals that: the time taken for \(V^{(t)}(s)\) (resp. \(V^{(t)}(\overline{s1})\)) to rise from 0.5 to 0.75 is fairly short.
Proof of Lemma 6
We are now in a position to present the proof of Lemma 6. To begin with, recall from Lemma 8 that: for any \(t \le t_{\overline{s1}}(\tau _{s}) \le t_{\overline{s1}}(\tau _{s1})\), one has
Given that \(V^{(t)}(s)\) is a convex combination of \(\{Q^{(t)}(s, a)\}_{a\in \{a_0,a_1,a_2\}}\), one has \(V^{(t)}(s)Q^{(t)}(s, a_1)\ge 0\), which together with the gradient expression (136) indicates that
and hence \(\theta ^{(t)}(s, a_1)\) is nonincreasing with t for any \(t< t_{\overline{s1}}(\tau _{s})\). Additionally, we have learned from Lemma 18 that
where the second inequality holds since \(\gamma ^{2s3}1/4 \ge \gamma {\tau }_{s2}\),f and the last identity results from Part (iii) of Lemma 8. This combined with the nonincreasing nature of \(\theta ^{(t)}(s, a_1)\) readily establishes the advertised inequality \(\theta ^{(t_{\overline{s1}}(\tau _{s}))}(s, a_1) \le \theta ^{(t_{s2}({\tau }_{s2}))}(s, a_1)\).
The next step is to justify \(\theta ^{(t_{\overline{s1}}(\tau _{s}))}(s, a_2) \ge 0\). Notice that for \(t > t_{s2}(\tau _{s2})\), we have \(V^{(t)}(s2) > \tau _{s2}\), and then \(V^{(t)}(\overline{s2}) > \gamma \tau _{s2}\) by (62), which leads to \(Q^{(t)}(s, a_2) > Q^{(t)}(s, a_0)\) by (59) in Lemma 8. Recall (157) that \(Q^{(t)}(s, a_1) \le \gamma \tau _{s} \le \min \Big \{Q^{(t)}(s, a_0), Q^{(t)}(s, a_2) \Big \}\). Then, one has \(Q^{(t)}(s, a_2)V^{(t)}(s)\ge 0\), which together with the gradient expression (136) indicates that
and hence \(\theta ^{(t)}(s, a_2)\) is nondecreasing with t for any \(t< t_{\overline{s1}}(\tau _{s})\). This establishes \(\theta ^{(t_{\overline{s1}}(\tau _{s}))}(s, a_2) \ge 0\). \(\square \)
1.2 Proof of Lemma 17
For every \(t \ge \max \big \{t_{\overline{s1}}(\gamma ^{2s1}1/4), t_{s}({\tau }_s) \big \}\), we isolate the following properties that will prove useful.

The definition (30) of \(t_{\overline{s1}}(\cdot )\) together with the monotonicity property in Lemma 9 requires that \(V^{(t)}(\overline{s1}) \ge \gamma ^{2s1}1/4\), and hence it is seen from (59) that
$$\begin{aligned} Q^{(t)}(s, a_1) = \gamma V^{(t)}(\overline{s1}) \ge \gamma ^{2s}\gamma /4. \end{aligned}$$(160) 
In the meantime, since \(t \ge t_{s}({\tau }_s)\), Lemma 8 (cf. (60)) guarantees that
$$\begin{aligned} \pi ^{(t)}(a_{1} \,\,s) \ge (1\gamma )/2. \end{aligned}$$(161) 
Given that \(t_{s}({\tau }_s) \ge t_{s2}(\tau _{s2})\) (see (33)) and the monotonicity property in Lemma 9, one has \(V^{(t)}(\overline{s2})\ge \tau _{s2}\), and thus we can see from (59) that
$$\begin{aligned} Q^{(t)}(s,a_{2})  Q^{(t)}(s,a_{0}) = \gamma p \big ( V^{(t)}(\overline{s2})  \tau _{s2} \big ) \ge 0. \end{aligned}$$(162) 
In addition, Lemma 8 ensures that both \(Q^{(t)}(s,a_{2}) \) and \(Q^{(t)}(s,a_{0})\) are bounded above by \(\gamma ^{1/2}\tau _{s}\). Therefore, it is easily seen that
$$\begin{aligned} Q^{(t)}(s,a_{1}) \ge \gamma ^{2s}  \gamma / 4 > \gamma ^{1/2}\tau _{s} \ge Q^{(t)}(s,a_{2}) \ge Q^{(t)}(s,a_{0}), \end{aligned}$$(163)where the first inequality comes from (160), the second one holds when \(\gamma ^{2s}>0.75\), and the last inequality has been justified in (162).

Moreover, given that \(V^{(t)}(s)\ge {\tau }_s\) (since \(t \ge t_{s}({\tau }_s)\)), one further has
$$\begin{aligned}&Q^{(t)}(s,a_{1})\max \big \{Q^{(t)}(s,a_{2}), Q^{(t)}(s,a_{0}) \big \}\nonumber \\&\quad> V^{(t)}(s) \max \big \{Q^{(t)}(s,a_{2}), Q^{(t)}(s,a_{0}) \big \} \nonumber \\&\quad> {\tau }_s  \gamma ^{1/2}\tau _{s} > 0. \end{aligned}$$(164)Here, the first inequality comes from (163), while the penultimate inequality is a consequence of (163).

We have seen from the above bullet points that
$$\begin{aligned} Q^{(t)}(s,a_{1})> V^{(t)}(s) > \max \big \{Q^{(t)}(s,a_{2}), Q^{(t)}(s,a_{0}) \big \} , \end{aligned}$$(165)which combined with the gradient expression (136) reveals that
$$\begin{aligned} \frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{1})}>0 > \max \left\{ \frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{0})}, \frac{\partial V^{(t)}(\mu )}{\partial \theta (s,a_{2})} \right\} . \end{aligned}$$(166)
With the above properties in place, we are now ready to prove our lemma, for which we shall look at the key primary states \(3\le s \le H\) and the adjoint states separately.
Analysis for the key primary states. Let us start with any state \(3\le s \le H\), and control \(t_s(\gamma ^{2s}1/4)\) as claimed in (153). As before, define
From the above fact (166), we know that \(\theta ^{(t)}(s, a_1)\) keeps increasing with t while \(\theta ^{(t)}(s, a_0), \theta ^{(t)}(s, a_2)\) are both decreasing with t. As a result, once \(\theta ^{(t)}(s, a_1) = \theta _{\textsf{max}}^{(t)}(s)\), then \(\theta ^{(t)}(s, a_1)\) will remain equal to \(\theta _{\textsf{max}}^{(t)}(s)\) for the subsequent iterations. This allows us to divide into two stages as follows.

Stage 1: the duration when \(\theta ^{(t)}(s, a_1) < \theta _{\textsf{max}}^{(t)}(s)\). Our aim is to show that this stage contains at most \(O\big (\frac{1}{ \eta (1\gamma )^2} \big )\) iterations. In order to prove this, the starting point is again the gradient expression (136):
$$\begin{aligned} \frac{\partial V^{(t)}(\mu )}{\partial \theta (s, a_1)}&= \frac{1}{1\gamma } d^{(t)}_{\mu }(s) \pi ^{(t)}(a_1 \,\,s)\big (Q^{(t)}(s, a_1)  V^{(t)}(s) \big ) \nonumber \\&\ge c_{\textrm{m}}\gamma (1\gamma )\pi ^{(t)}(a_1 \,\,s)\big ( Q^{(t)}(s, a_1)  V^{(t)}(s) \big ) , \end{aligned}$$(167)where the last line relies on Lemma 2 and the fact \(Q^{(t)}(s, a_1) > V^{(t)}(s)\) (cf. (165)). Regarding the size of \(Q^{(t)}(s, a_1)  V^{(t)}(s)\), we make the observation that
$$\begin{aligned} Q^{(t)}(s, a_1)  V^{(t)}(s)&= \pi ^{(t)}(a_ 0 \,\,s) \Big (Q^{(t)}(s, a_1)  Q^{(t)}(s, a_0) \Big )\\&\quad + \pi ^{(t)}(a_2 \,\,s) \Big (Q^{(t)}(s, a_1)  Q^{(t)}(s, a_2) \Big )\\&\ge \big (\pi ^{(t)}(a_ 0 \,\,s) + \pi ^{(t)}(a_2 \,\,s) \big )\\&\quad \Big (Q^{(t)}(s, a_1)  \max _{a\in \{a_0,a_2\}} Q^{(t)}(s, a)\Big )\\&{\mathop {\ge }\limits ^{(\textrm{i})}} \frac{1}{2} \Big (Q^{(t)}(s, a_1)  \max _{a\in \{a_0,a_2\}} Q^{(t)}(s, a)\Big )\\&{\mathop {\ge }\limits ^{(\textrm{ii})}} \frac{1}{2} \big (\gamma ^{2s} \gamma /4  \gamma ^{1/2}\tau _s \big ) {\mathop {\ge }\limits ^{(\textrm{iii})}} \frac{1}{16}. \end{aligned}$$Here, (i) follows since \(\theta ^{(t)}(s, a_1) < \theta _{\textsf{max}}^{(t)}(s)\) during this stage and, therefore, \(\pi ^{(t)}(a_1 \,\,s) \le 1/2\); (ii) arises from the relation (163); and (iii) holds whenever \(\gamma ^{2s}  \gamma / 4 > 5/8\). Substitution into (167) yields
$$\begin{aligned} \frac{\partial V^{(t)}(\mu )}{\partial \theta (s, a_1)} \ge \frac{1}{16} c_{\textrm{m}}\gamma (1\gamma )\pi ^{(t)}(a_1 \,\,s) \ge \frac{1}{48} c_{\textrm{m}}\gamma (1\gamma ) {\widehat{\pi }}^{(t)}(a_1 \,\,s), \end{aligned}$$(168)where the last inequality comes from (132). In addition, recall that \(\theta ^{(t)}(s, a_1)\) is increasing with t, while \(\theta ^{(t)}(s, a_0)\) and \(\theta ^{(t)}(s, a_2)\) are both decreasing (and hence \(\theta _{\textsf{max}}^{(t)}(s)\) is also decreasing). Invoking Lemma 16 then yields
$$\begin{aligned}&{\widehat{\pi }}^{(t+1)}(a_{1}\,\,s){\widehat{\pi }}^{(t)}(a_{1}\,\,s) \\&\quad \ge {\widehat{\pi }}^{(t)}(a_{1}\,\,s)\Big ({{\theta }}^{(t+1)}(s,a_{1})\theta ^{(t)}(s,a_{1})+{{\theta }}_{\textsf{max}}^{(t)}(s){{\theta }}_{\textsf{max}}^{(t+1)}(s)\Big )\\&\quad \ge {\widehat{\pi }}^{(t)}(a_{1}\,\,s)\Big ({{\theta }}^{(t+1)}(s,a_{1})\theta ^{(t)}(s,a_{1})\Big )\\&\quad ={\widehat{\pi }}^{(t)}(a_{1}\,\,s)\cdot \eta \frac{\partial {V}^{(t)}(\mu )}{\partial \theta (s,a_{1})} \ge \frac{1}{48} c_{\textrm{m}}\eta \gamma (1\gamma ) \Big [ {\widehat{\pi }}^{(t)}(a_1 \,\,s) \Big ]^2, \end{aligned}$$where the last line arises from (168). Given this recursive relation, Lemma 11 implies that: if \({\widehat{\pi }}^{(t)}(a_{1}\,\,s) < 1\) (or equivalently, \(\theta ^{(t)}(s, a_1) < \theta _{\textsf{max}}^{(t)}(s)\)), then one necessarily has
$$\begin{aligned} t  t_{0,1}&\le \frac{1+ \frac{1}{48} c_{\textrm{m}}\eta \gamma (1\gamma ) }{ \frac{1}{48} c_{\textrm{m}}\eta \gamma (1\gamma ) \pi ^{(t_0)}(a_{1} \,\,s) } \le \frac{2 }{ \frac{1}{48} c_{\textrm{m}}\eta \gamma (1\gamma ) \pi ^{(t_{0,1})}(a_{1} \,\,s) } \\&\le \frac{240}{c_{\textrm{m}}\eta \gamma (1\gamma )^2}, \end{aligned}$$with \(t_{0,1} {:=}\max \big \{t_{\overline{s1}}(\gamma ^{2s1}1/4), t_{s}({\tau }_s) \big \}\). Here, the last inequality relies on the property (161).

Stage 2: the duration when \(\theta ^{(t)}(s, a_1) = \theta _{\textsf{max}}^{(t)}(s)\). For this stage, we intend to demonstrate that it takes at most \(O\big ( \frac{1}{\eta (1\gamma )^2} \big )\) iterations to achieve \(\max \big \{ \pi ^{(t)}(a_0 \,\,s),\pi ^{(t)}(a_2 \,\,s) \big \} \le (1\gamma )/8\). To this end, we again begin by studying the gradient as follows:
$$\begin{aligned} \frac{\partial V^{(t)}(\mu )}{\partial \theta (s, a_2)}&=\frac{1}{1\gamma } d^{(t)}_{\mu }(s) \pi ^{(t)}(a_2 \,\,s) \Big ( Q^{(t)}(s, a_2)  V^{(t)}(s) \Big )\\&\le c_{\textrm{m}}\gamma (1\gamma )\pi ^{(t)}(a_2 \,\,s) \Big ( Q^{(t)}(s, a_2)  V^{(t)}(s) \Big ) \\&\le \frac{1}{3}c_{\textrm{m}}\gamma (1\gamma ){\widehat{\pi }}^{(t)}(a_2 \,\,s) \Big (Q^{(t)}(s, a_2)  V^{(t)}(s) \Big ). \end{aligned}$$Here, the first inequality comes from Lemma 2 and the fact \(Q^{(t)}(s, a_2) < V^{(t)}(s)\) (see (165)), whereas the last inequality is a consequence of (132). In order to control \(Q^{(t)}(s, a_2)  V^{(t)}(s)\), we observe that
$$\begin{aligned} Q^{(t)}(s,a_{2})V^{(t)}(s)&=\pi ^{(t)}(a_{1}\,\,s)\Big (Q^{(t)}(s,a_{2})Q^{(t)}(s,a_{1})\Big )\\&\quad +\pi ^{(t)}(a_{0}\,\,s)\Big (Q^{(t)}(s,a_{2})Q^{(t)}(s,a_{0})\Big )\\&\le \pi ^{(t)}(a_{1}\,\,s)\Big (\gamma ^{1/2}\tau _{s}\gamma ^{2s}+\gamma /4\Big )\\&\quad +\pi ^{(t)}(a_{2}\,\,s)\gamma p\left( V^{(t)}(\overline{s2})\tau _{s2}\right) \\&\le \frac{1}{3}\Big (\gamma ^{1/2}\tau _{s}\gamma ^{2s}+\gamma /4\Big )+\gamma p\le \frac{1}{24}, \end{aligned}$$where the second line arises from (163) and (162), and the last line holds since \(V^{(t)}(\overline{s2})\le 1\) as well as the fact \(\pi ^{(t)}(a_{1}\,\,s)\ge 1/3\) during this stage (since \(\theta ^{(t)}(s, a_1) = \theta _{\textsf{max}}^{(t)}(s)\)). Putting the above two bounds together leads to
$$\begin{aligned} \frac{\partial V^{(t)}(\mu )}{\partial \theta (s, a_2)} \le  \frac{1}{72}c_{\textrm{m}}\gamma (1\gamma ){\widehat{\pi }}^{(t)}(a_2 \,\,s) . \end{aligned}$$(169)Next, Lemma 16 tells us that
$$\begin{aligned} {\widehat{\pi }}^{(t+1)}(a_{2}\,\,s){\widehat{\pi }}^{(t)}(a_{2}\,\,s)&\le 0.72{\widehat{\pi }}^{(t)}(a_{2}\,\,s)\Big ({{\theta }}^{(t+1)}(s,a_{2})\theta ^{(t)}(s,a_{2})\\&\quad +{{\theta }}_{\textsf{max}}^{(t)}(s){{\theta }}_{\textsf{max}}^{(t+1)}(s)\Big )\\&\le 0.72{\widehat{\pi }}^{(t)}(a_{2}\,\,s)\Big ({{\theta }}^{(t+1)}(s,a_{2})\theta ^{(t)}(s,a_{2})\Big )\\&=0.72{\widehat{\pi }}^{(t)}(a_{2}\,\,s)\cdot \eta \frac{\partial {V}^{(t)}(\mu )}{\partial \theta (s,a_{2})}\\&\le  0.01\eta c_{\textrm{m}}\gamma (1\gamma )\Big [{\widehat{\pi }}^{(t)}(a_{2}\,\,s)\Big ]^{2}, \end{aligned}$$where the first inequality makes use of the facts \(\theta ^{(t+1)}(s,a_{2})\le \theta ^{(t)}(s,a_{2})\) and \(\theta _{\textsf{max}}^{(t)}(s)= \theta ^{(t)}(s,a_{1}) \le \theta ^{(t+1)}(s,a_{1}) = \theta _{\textsf{max}}^{(t+1)}(s)\) (see (166)). Denoting by \(t_{0,2}\) the first iteration in this stage, we can invoke Lemma 11 to reach
$$\begin{aligned} {\widehat{\pi }}^{(tt_{0,2})}(a_{2}\,\,s)\le \frac{1}{0.01\eta c_{\textrm{m}}\gamma (1\gamma )(tt_{0,2})+1} . \end{aligned}$$(170)As a consequence, once \(tt_{0,2}\) exceeds
$$\begin{aligned} \frac{800}{\eta c_{\textrm{m}}\gamma (1\gamma )^{2}}, \end{aligned}$$then one has \(\pi ^{(t)}(a_2 \,\,s) \le (1\gamma )/8\). The same conclusion holds for \(a_0\) as well.
Combining the above analysis for the two stages, we see that: if
with \(t_{0,1} {:=}\max \big \{t_{\overline{s1}}(\gamma ^{2s1}1/4), t_{s}({\tau }_s) \big \}\), then one has
which combined with (163) leads to
This means that one necessarily has \(t \ge t_{s}(\gamma ^{2s}1/4)\). It then follows that
thus concluding the proof of (153).
Analysis for the adjoint states. We then move forward to the adjoint states \(\{{\overline{1}},\ldots ,{\overline{H}}\}\) and control \(t_{{\overline{s}}}(\gamma ^{2s+1}1/4)\) as desired in (154). The proof consists of studying the dynamic for any t obeying
Once again, we divide into two stages and analyze each of them separately.

Stage 1: the duration where \(\theta ^{(t)}({\overline{s}}, a_1) < \theta ^{(t)}({\overline{s}}, a_0)\). We aim to demonstrate that it takes no more than \(O\big ( \frac{1}{ \eta (1\gamma )^2} \big )\) iterations for \(\theta ^{(t)}({\overline{s}}, a_1)\) to surpass \(\theta ^{(t)}({\overline{s}}, a_0)\). In order to do so, note that
$$\begin{aligned} \frac{\partial V^{(t)}(\mu )}{\partial \theta ({\overline{s}},a_{1})}&=\frac{1}{1\gamma }d_{\mu }^{(t)}({\overline{s}})\pi ^{(t)}(a_{1}\,\,{\overline{s}})\pi ^{(t)}(a_{0}\,\,{\overline{s}})\Big (Q^{(t)}({\overline{s}},a_{1})Q^{(t)}({\overline{s}},a_{0})\Big ) \nonumber \\&\ge \frac{1}{16}c_{\textrm{m}}\gamma (1\gamma )\pi ^{(t)}(a_{1}\,\,{\overline{s}}) > 0. \end{aligned}$$(171)Here, the last line applies Lemma 2 and makes use of the fact
$$\begin{aligned} Q^{(t)}({\overline{s}}, a_1)  Q^{(t)}({\overline{s}}, a_0) = \gamma V^{(t)}(s)  \gamma \tau _s \ge \gamma (\gamma ^{2s}1/4  \tau _s) \ge 1/8. \end{aligned}$$(172)where the inequality comes from the assumption \(t\ge t_{s} \big (\gamma ^{2s}1/4 \big )\) as well as the monotonicity property in Lemma 9. As a result, the PG update rule (12a) implies that \(\theta ^{(t)}({\overline{s}},a_{1})\) is increasing in t, and hence \(\theta ^{(t)}({\overline{s}},a_{0})\) is decreasing in t (since \(\sum _{a} \theta ^{(t)}(s, a) = 0\)); these taken collectively mean that
$$\begin{aligned}{} & {} \theta ^{(t+1)}({\overline{s}},a_{1})  \theta ^{(t)}({\overline{s}},a_{1}) + \theta ^{(t)}({\overline{s}},a_{0})  \theta ^{(t+1)}({\overline{s}},a_{0}) \\{} & {} \quad \ge \theta ^{(t+1)}({\overline{s}},a_{1})  \theta ^{(t)}({\overline{s}},a_{1}) \ge 0 . \end{aligned}$$Invoking Lemma 16 then reveals that
$$\begin{aligned} {\widehat{\pi }}^{(t+1)}(a_{1}\,\,{\overline{s}}){\widehat{\pi }}^{(t)}(a_{1}\,\,{\overline{s}})&\ge {\widehat{\pi }}^{(t)}(a_{1}\,\,{\overline{s}})\Big (\theta ^{(t+1)}({\overline{s}},a_{1})\theta ^{(t)}({\overline{s}},a_{1})\\&\quad +\theta ^{(t)}({\overline{s}},a_{0})\theta ^{(t+1)}({\overline{s}},a_{0})\Big )\\&\ge {\widehat{\pi }}^{(t)}(a_{1}\,\,{\overline{s}})\Big (\theta ^{(t+1)}({\overline{s}},a_{1})\theta ^{(t)}({\overline{s}},a_{1})\Big )\\&=\eta {\widehat{\pi }}^{(t)}(a_{1}\,\,{\overline{s}})\frac{\partial {V}^{(t)}(\mu )}{\partial \theta ({\overline{s}},a_{1})}\\&\ge \frac{1}{48} \eta c_{\textrm{m}}\gamma (1\gamma ) \Big [ {\widehat{\pi }}^{(t)}(a_{1}\,\,{\overline{s}}) \Big ]^2, \end{aligned}$$where the last inequality relies on (171) and (132). Given this recursive relation, Lemma 11 tells us that: one has \({\widehat{\pi }}^{(t)}(a_{1}\,\,{\overline{s}}) \ge 1\) (which means \(a_1\) becomes the favored action by (131)) as soon as \(tt_{0, 3}\) exceeds
$$\begin{aligned} \frac{2}{\frac{1}{48} \eta c_{\textrm{m}}\gamma (1\gamma ) {\widehat{\pi }}^{(t_{0,3})}(a_{1}\,\,{\overline{s}}) } \le \frac{96}{\eta c_{\textrm{m}}\gamma (1\gamma ) {\pi }^{(t_{0,3})}(a_{1}\,\,{\overline{s}}) } \le \frac{1152}{\eta c_{\textrm{m}}\gamma (1\gamma )^2 }, \end{aligned}$$where \(t_{0, 3}{:=}\max \Big \{t_{s} \big (\gamma ^{2s}1/4 \big ), \, t_{{\overline{s}}}(\tau _{s+1}) \Big \}\). Here, the last inequality is valid as long as
$$\begin{aligned} \pi ^{(t_{0,3})}(a_{1} \,\,{\overline{s}}) \ge (1\gamma )/12 \end{aligned}$$(173)holds. It thus remains to justify (173). Towards this end, observe that for any \(t\ge t_{{\overline{s}}}(\tau _{s+1})\),
$$\begin{aligned} \tau _{s+1}&\le V^{(t)}({\overline{s}})=\pi ^{(t)}(a_{0}\,\,{\overline{s}})Q^{(t)}({\overline{s}},a_{0})+\pi ^{(t)}(a_{1}\,\,{\overline{s}})Q^{(t)}({\overline{s}},a_{1})\\&=\pi ^{(t)}(a_{0}\,\,{\overline{s}})\gamma \tau _{s}+\pi ^{(t)}(a_{1}\,\,{\overline{s}})\gamma V^{(t)}(s)\\&=\gamma \tau _{s}+\pi ^{(t)}(a_{1}\,\,{\overline{s}})\gamma \left( V^{(t)}(s)\tau _{s}\right) \le \gamma \tau _{s}+\pi ^{(t)}(a_{1}\,\,{\overline{s}})\gamma , \end{aligned}$$and, as a result,
$$\begin{aligned} \pi ^{(t)}(a_{1}\,\,{\overline{s}})&\ge \frac{\tau _{s+1}}{\gamma }\tau _{s}=\frac{1}{2}\frac{\gamma ^{\frac{2}{3}s+\frac{2}{3}}\gamma ^{\frac{2}{3}s+1}}{\gamma }=\frac{\gamma ^{\frac{2}{3}s1}}{2}\left( \gamma ^{\frac{2}{3}}\gamma \right) \ge \frac{1\gamma }{12}, \end{aligned}$$provided that \(\gamma \ge 0.9\) (so that \(\gamma ^{\frac{2}{3}}\gamma \ge 0.3(1\gamma )\)) and \(\gamma ^{\frac{2}{3}H} \ge 0.7\). This concludes the analysis of this stage.

Stage 2: the duration where \(\theta ^{(t)}({\overline{s}}, a_1) \ge \theta ^{(t)}({\overline{s}}, a_0)\). Similar to the above argument, we intend to show that it takes at most \(O\big ( \frac{1}{\eta (1\gamma )^2} \big )\) iterations for \(\pi ^{(t)}(a_0 \,\,{\overline{s}}) \le 1\gamma \) to occur. From the gradient expression and the property (172), we obtain
$$\begin{aligned} \frac{\partial V^{(t)}(\mu )}{\partial \theta ({\overline{s}}, a_0)}&= \frac{1}{1\gamma } d^{(t)}_{\mu }({\overline{s}}) \pi ^{(t)}(a_0 \,\,{\overline{s}})\pi ^{(t)}(a_1 \,\,{\overline{s}}) \Big ( Q^{(t)}({\overline{s}}, a_0)  Q^{(t)}({\overline{s}}, a_1) \Big ) \\&\le \frac{1}{16}c_{\textrm{m}}\gamma (1\gamma ) {\pi }^{(t)}(a_0 \,\,{\overline{s}} ) \le \frac{1}{48}c_{\textrm{m}}\gamma (1\gamma ) {\widehat{\pi }}^{(t)}(a_0 \,\,{\overline{s}} ), \end{aligned}$$where the first inequality uses Lemma 2 and the property \(\pi ^{(t)}(a_1 \,\,{\overline{s}})\ge 1/2\) (since \(\theta ^{(t)}({\overline{s}}, a_1) \ge \theta ^{(t)}({\overline{s}}, a_0)\)), and the last inequality relies on (132). Repeating a similar argument as above, we can demonstrate that
$$\begin{aligned} {\widehat{\pi }}^{(t+1)}(a_0 \,\,{\overline{s}} )  {\widehat{\pi }}^{(t)}(a_0 \,\,{\overline{s}} ) \le  \frac{1}{70} \eta c_{\textrm{m}}\gamma (1\gamma ) \Big [{\widehat{\pi }}^{(t)}(a_0 \,\,{\overline{s}} )\Big ]^2. \end{aligned}$$(174)This combined with Lemma 11 implies that
$$\begin{aligned} {\widehat{\pi }}^{(t)}(a_0 \,\,{\overline{s}} ) \le \frac{1}{\frac{1}{70} \eta c_{\textrm{m}}\gamma (1\gamma ) (tt_{0,4}) +1}, \end{aligned}$$(175)with \(t_{0,4}\) denoting the first iteration of this stage. Consequently, one has \({\widehat{\pi }}^{(t)}(a_0 \,\,{\overline{s}} ) \le 1\gamma \)—and therefore \(\pi ^{(t)}(a_0 \,\,{\overline{s}} )\le 1\gamma \) according to (132)—as soon as \(tt_{0,4}\) exceeds
$$\begin{aligned} \frac{70}{ \eta c_{\textrm{m}}\gamma (1\gamma )^2 }. \end{aligned}$$
Finally, if \(\pi ^{(t)}(a_0 \,\,{\overline{s}} )\le 1\gamma \), then one has
where the first inequality holds by recalling that \(t\ge t_{s}(\gamma ^{2s}1/4)\). Consequently, putting the above pieces (regarding the duration of the two stages) together allows us to conclude that
as claimed.
1.3 Proof of Lemma 18
Before proceeding, we first single out two properties that play a crucial role in the proof of Lemma 18.
Lemma 19
The following basic properties hold true for any \(2\le s\le H\):
The proof of this auxiliary lemma is deferred to the end of this subsection. Equipped with this result, we are positioned to present the proof of Lemma 18. To begin with, we seek to bound the quantity \(t_{s}(\gamma ^{2s}1/4)  t_{s}({\tau }_s)\). Apply Lemma 17 with a little algebra to yield
With the assistance of the bound (176a) in Lemma 19, we can continue the bound in (177) to derive
To continue, we shall bound the quantity \(t_{\overline{s1}}(\gamma ^{2s1}1/4) t_{\overline{s1}}(\tau _{s})\). Similar to the derivation of the inequality (177), we can apply Lemma 17 to show that
where the third line makes use of (176b) in Lemma 19.
Applying the inequalities (178) and (179) recursively, one arrives at
To continue, note that Lemma 17 and the bound (176a) give
which together leads to
Plugging back to (180) leads to
where the last step arises from the assumption (155), that is, \(t_{2}(\gamma ^{4}1/4) < t_{3}({\tau }_3)\).
Further, the above inequality taken together with (179) yields
We have thus established (156a) and (156b).
Finally, we turn to the proof of (156c). In view of (156b), one has
In addition,
where the identity in the first line comes from Part (iii) of Lemma 8, and the last inequality uses the assumption \(t_{s1}({\tau }_{s1}) > t_{\overline{s2}}(\tau _{s1}) + \frac{2sc_0}{\eta (1\gamma )^2}\). Combining the above two inequalities justifies the validity of the advertised inequality (156c). Then, we establish (156c) for \(s = 3\) through Lemma 4, which gives
where the last inequality comes from (176b).
Proof of Lemma 19
To begin with, the claim (176a) holds when \(s=2\) as a result of the inequality (38b) in Lemma 4. We now turn to the case with \(3\le s\le H\). In view of the property (59) in Lemma 8, we have
Recognizing that \(V^{(t)}(s)\) is a convex combination of \(\big \{ Q^{(t)}(s,a) \big \}_{a\in \{a_0,a_1,a_2\}}\), we know that if \(V^{(t)}(s) \ge {\tau }_s\), then one necessarily has \(Q^{(t)}(s,a_{1}) > {\tau }_s\), or equivalently, \(V^{(t)}(\overline{s1}) > {\tau }_s / \gamma \ge \tau _s \). This essentially means that \(t_{s}({\tau }_s) \ge t_{\overline{s1}}(\tau _{s})\), thus establishing the claim (176a).
Similarly, Lemma 8 (cf. (61)) also tells us that
This means that if \(V^{(t)}(s1)\le {\tau }_{s1}\), then
Consequently, we conclude that \(t_{\overline{s1}}(\tau _{s}) \ge t_{s1}({\tau }_{s1})\), as claimed in (176b). \(\square \)
Analysis for the blowingup lemma (Lemma 7)
In this section, we establish the blowingup phenomenon as asserted in Lemma 7.
1.1 Which reference point \(t_{\textsf{ref}}\) shall we choose?
Let us specify the time instance \(t_{\textsf{ref}}\) as required in Lemma 7 as follows
where \(c_{\textsf{ref}}\in (0,1/3)\) is some constant to be specified shortly.
Existence. An important step is to justify that (185) is welldefined, namely, there exists at least one time instance within \(\big [\,t_{\overline{s1}}(\tau _s) , t_{s}(\tau _s) \,\big )\) that satisfies \(c_{\textsf{ref}}(1\gamma )\pi ^{(t)}(a_0 \,\,s) \le \pi ^{(t)}(a_1 \,\,s)\). Towards this, we note that if the time instance \(t_{\overline{s1}}(\tau _s) \) obeys
then we simply have \(t_{\textsf{ref}}= t_{\overline{s1}}(\tau _s) \). We then move on to the complement case where
To justify that the construction (185) makes sense, it suffices to show that the endpoint \(t_s({\tau }_s)\) obeys
In order to validate (187), recall that the inequality (60) in Lemma 8 ensures that
given that \(V^{(t_s({\tau }_s))}(s) \ge {\tau }_s\). Therefore, the inequality (187) must be satisfied when \(c_{\textsf{ref}}<1/2\), given that the lefthand side of (187) obeys
This in turn validates the existence of (187) for this case.
Several immediate properties about \(t_{\textsf{ref}}\) and \(t_{\overline{s1}}(\tau _s) \). We pause to single out a couple of immediate properties about the \(t_{\textsf{ref}}\) constructed above as well as \(t_{\overline{s1}}(\tau _s) \).
Consider the case where \(t_{\overline{s1}}(\tau _s) \) obeys
then one has \(t_{\textsf{ref}}= t_{\overline{s1}}(\tau _s) \) (as discussed above). As can be clearly seen, \(t_{\overline{s1}}(\tau _s) \) satisfies the advertised inequality (45a) by taking \(c_{\textsf{ref}}\ge c_{\textrm{p}}/8064\). Additionally, let us first recall from (156c) in Lemma 18 that
This combined with Lemma 6 (see (43)) tells us that
where the last relation utilizes the bound (40) in Lemma 5. This leads to the advertised bound (45b).
As a result, the claims (45a)(45b) only need to be justified under the assumption (186).
Organization of the proof. In light of the above basic facts, the subsequent proof focuses on the scenario where (186) is satisfied, namely, the case where
We shall start by justifying that \(\theta ^{(t)}(s, a_1)\) has not increased much during \([t_{\overline{s1}}(\tau _s) , t_{\textsf{ref}}]\), as detailed in Appendix F.2 and F.3 (focusing on two separate stages respectively). This feature will then be used in Appendix F.4 to establish the claims (45a)(45b), and in Appendix F.5 to establish the claim (45c).
1.2 Stage I: the duration where \(\theta ^{(t)}(s, a_2) < \theta ^{(t)}(s, a_0)\)
Suppose that at the starting point we have \(\theta ^{(t_{\overline{s1}}(\tau _s) )}(s, a_2) < \theta ^{(t_{\overline{s1}}(\tau _s) )}(s, a_0)\); otherwise the reader can proceed directly to Stage II in Appendix F.3. The goal is to control the number of iterations taken to achieve \(\theta ^{(t)}(s, a_2) \ge \theta ^{(t)}(s, a_0)\). More specifically, let us define the transition point
In this subsection, we seek to develop an upper bound on \(t_{\textsf{tran}}t_{\overline{s1}}(\tau _s) \), and to show that \(\theta ^{(t)}(s, a_1)  \theta ^{(t_{\overline{s1}}(\tau _s) )}(s, a_1) \le 1/2\) holds throughout this stage.
Preparation: basic facts and rescaled policies. Before moving forward, we first gather some basic facts. To begin with, from the definition (185) of \(t_{\textsf{ref}}\), we know that the inequality \(c_{\textsf{ref}}(1\gamma )\pi ^{(t)}(a_0\,\,s) > \pi ^{(t)}(a_1 \,\,s)\) holds true for every \(t \in [t_{\overline{s1}}(\tau _s) ,~t_{\textsf{ref}})\), or equivalently,
In the case considered here, we have—according to (190) and (189)—that
for any t obeying \(t_{\overline{s1}}(\tau _s) \le t < \min \{t_{\textsf{tran}},t_{\textsf{ref}}\}\). This means that
holds for any t obeying \(t_{\overline{s1}}(\tau _s) \le t < \min \{t_{\textsf{tran}},t_{\textsf{ref}}\}\), provided that \(0< c_{\textsf{ref}}< 1\).
Moreover, let us introduce the rescaled policy \({\widehat{\pi }}^{(t)}(a\,\,s)\) as before
In view of (192), the rescaled policy can therefore be written as
for any t with \(t_{\overline{s1}}(\tau _s) \le t < \min \{t_{\textsf{tran}},t_{\textsf{ref}}\}\), where we have used the constraint \(\sum _{a} \theta ^{(t)}(s,a) = 0\) (see Part (vii) of Lemma 8).
Showing \(\theta ^{(t)}(s, a_1)  \theta ^{(t_{\overline{s1}}(\tau _s) )}(s, a_1) \le 1/2\) by induction. In the following, we seek to prove by induction the following key property
for any t that obeys \(t_{\overline{s1}}(\tau _s) \le t \le \min \{t_{\textsf{tran}},t_{\textsf{ref}}\}\) and
We shall return to justify (195) for all t within this stage later on. In words, the claim (194) essentially means that \(\theta ^{(t)}(s, a_1)\) does not deviate much from \(\theta ^{(t_{\overline{s1}}(\tau _s) )}(s, a_1)\) during this stage. With regards to the base case where \(t = t_{\overline{s1}}(\tau _s) \), the hypothesis (194) holds true trivially. Next, assuming that (194) is satisfied for every integer less than or equal to \(t1\), we intend to establish this hypothesis for the tth iteration, which is accomplished as follows.
First, Lemma 1 and Lemma 10 tell us that \(Q^{(t)}(s, a_1)  V^{(t)}(s)\le 1\). It then follows that
which relies on the bound \(d_{\mu }^{(t)}(s) \le 14 c_{\textrm{m}}(1\gamma )^{2}\) stated in Lemma 3. As a result, it can be derived from the PG update rule (12a) that
Regarding the term involving \(\pi ^{(j)}(a_1 \,\,s)\), we observe that for any \(t_{\overline{s1}}(\tau _s) \le j < t\),
Here, (i) is a consequence of (132), (ii) holds since (in view of (193), \(\theta ^{(j)}(s,a_0)\ge 0\), and \(\sum _a\theta ^{(j)}(s,a)=0\))
whereas (iii) follows from the induction hypothesis (194) for any \(t_{\overline{s1}}(\tau _s) \le j<t\). Combine the inequalities (196) and (198) to reach
Consequently, under the constraint (195), the preceding inequality implies that
where the last inequality makes use of (188) and the assumption (44). These allow us to establish the induction hypothesis for the tth iteration, namely,
Validating the constraint (195) and upper bounding \(\min \{t_{\textsf{tran}},t_{\textsf{ref}}\}  t_{\overline{s1}}(\tau _s) \). It remains to justify the assumed condition (195) for all iteration within this stage. To this end, suppose instead that
where \({\widetilde{t}}\) is defined in (195). We claim that the following relation is satisfied
for any t obeying \(t_{\overline{s1}}(\tau _s) \le t \le t_{\overline{s1}}(\tau _s) + {\widetilde{t}} \le \min \{t_{\textsf{tran}},t_{\textsf{ref}}\}\). Equipped with this recursive relation, we can invoke Lemma 11 to develop a lower bound on \({\widehat{\pi }}^{(t)}(a_2 \,\,s)\), provided that an initial lower bound is available. In order to do so, in view of the expression (193), we can deduce that
where the last relation is due to the bound \(\theta ^{(t_{\overline{s1}}(\tau _s) )}(s, a_2) \ge 0\) (see (43) in Lemma 6). Combining the above two inequalities and applying Lemma 11 (see (69b)), we arrive at \(\pi ^{(t)}(s, a_2) \ge 1/2\)—and hence \(\pi ^{(t)}(s, a_2) \ge \pi ^{(t)}(s, a_0) \)—as soon as \(t t_{\overline{s1}}(\tau _s) \) exceeds
This together with the definition of \(t_{\textsf{tran}}\) thus indicates that
provided that \(\frac{c_{\textrm{p}}c_{\textrm{m}}}{150}\eta (1\gamma )^2 \le 0.5\). This, however, contradicts the assumption (201). As a consequence, we conclude that \(t_{\overline{s1}}(\tau _s) + {\widetilde{t}} > \min \{t_{\textsf{tran}},t_{\textsf{ref}}\}\), thus indicating that
Showing that \(t_{\textsf{tran}}= \min \{t_{\textsf{tran}},t_{\textsf{ref}}\} \). We now justify that \(t_{\textsf{tran}}< t_{\textsf{ref}}\), so that the upper bound (203) leads to an upper bound on \(t_{\textsf{tran}} t_{\overline{s1}}(\tau _s) \). Suppose instead that
and we would like to show that this leads to contradiction. By definition of \(t_{\textsf{ref}}\), we have
This further yields
where the second inequality arises from (194), and the last one makes use of (188) as long as \(t_{s2}({\tau }_{s2})\). However, this together with the constraint \(\sum _a \theta ^{(t_{\textsf{ref}})}(s, a)=0\) implies that
which, however, implies that \(t_{\textsf{ref}}> t_{\textsf{tran}}\) (according to the definition of \(t_{\textsf{tran}}\)) and leads to contradiction. As a result, we conclude that
and the bound (203) then indicates that
1.2.1 Proof of the inequality (202)
From the relation (193), one can deduce that
for any t with \(t_{\overline{s1}}(\tau _s) \le t \le \min \{t_{\textsf{tran}},t_{\textsf{ref}}\}\), where the inequality above follows from the elementary fact \(e^{x}1\ge x\) for any \(x\in {\mathbb {R}}\). Therefore, the difference between \({\widehat{\pi }}^{(t)}(a_2 \,\,s)\) and \({\widehat{\pi }}^{(t1)}(a_2 \,\,s)\) depends on both \(\frac{\partial {V}^{(t1)}(\mu )}{\partial \theta (s, a_2)}\) and \(\frac{\partial {V}^{(t1)}(\mu )}{\partial \theta (s, a_1)}\), motivating us to lower bound these two derivatives separately.
Step 1: bounding \(\frac{\partial {V}^{(t)}(\mu )}{\partial \theta (s, a_2)}\). First, we make the observation that for any \(3 \le s < H\) and any \(t \ge t_{\overline{s1}}(\tau _s) \),
holds as long as \(\gamma (\gamma ^{2s3}  1/4  \gamma \tau _s) \ge 1/8.\) Here, the first identity comes from (59) in Lemma 8, and the first inequality holds for any \(t\ge t_{\overline{s2}}(\gamma ^{2s3}1/4)\)—a consequence of the monotonicity property in Lemma 9. As a result, for any t obeying \(t_{\overline{s1}}(\tau _s) \le t \le \min \{t_{\textsf{tran}},t_{\textsf{ref}}\}\) we have
where the first inequality combines (207) with the facts that \(\pi ^{(t)}(a_0 \,\,s)\ge 1/3\) (see (192)) and \(0\le Q^{(t)}(s, a_2) , Q^{(t)}(s, a_1)\le 1\) (see Lemma 1), and the last line holds by observing (see (185))
and using the assumption \(c_{\textsf{ref}}\le c_{\textrm{p}}/2\). Consequently, for any \(t \ge t_{\overline{s1}}(\tau _s) \), the gradient w.r.t. \(\theta (s, a_2)\) satisfies
where the first inequality above also makes use of the lower bound in Lemma 2.
In fact, the above lower bound holds true regardless of t, as long as \(t \ge t_{\overline{s1}}(\tau _s) \) where we have shown that \(\frac{\partial V^{(t1)}(\mu )}{\partial \theta (s, a_2)}\) is bounded from below by 0. One can thus conclude that the iterate \(\theta ^{(t)}(s, a_2)\) increases with t.
Step 2: bounding \(\frac{\partial {V}^{(t)}(\mu )}{\partial \theta (s, a_1)}\). Regarding the gradient w.r.t. \(\theta (s, a_1)\), we have
where the last line follows since (see Lemma 8 and the fact that \(t\ge t_{\overline{s1}}(\tau _s)\))
In addition, recognizing that \(\pi ^{(t)}(a_0 \,\,s) + \pi ^{(t)}(a_2 \,\,s) \le 1\) and \(d_{\mu }^{(t)}(s) \le 14 c_{\textrm{m}}(1\gamma )^{2}\) (see Lemma 3), we can continue the above bound to obtain
where the last inequality is due to \(\tau _{s} \le 1/2\) and \(0< \gamma < 1\) and the bound (132).
Step 3: connecting \({\widehat{\pi }}^{(t)}(a_1 \,\,s)\) with \({\widehat{\pi }}^{(t)}(a_2 \,\,s)\). The above lower bound (210) on \(\frac{\partial {V}^{(t)}(\mu )}{\partial \theta (s, a_1)}\) is dependent on \({\widehat{\pi }}^{(t)}(a_1 \,\,s)\). However, the desired lower bound (202) is only a function of \({\widehat{\pi }}^{(t)}(a_2 \,\,s)\). This motivates us to investigate the connection between \({\widehat{\pi }}^{(t)}(a_1 \,\,s)\) and \({\widehat{\pi }}^{(t)}(a_2 \,\,s)\).
To this end, let us write
As a result, one only needs to control the quantity \(\exp \big (\theta ^{(t1)}(s, a_1)  \theta ^{(t1)}(s, a_2)\big )\). In order to do so, we make use of the induction hypothesis (194) for the \((t1)\)th iteration to show that
Here, (i) follows from the fact that \(\theta ^{(t)}(s, a_2)\) increases with t (see (209)); and (ii) comes from the inequality (43) in Lemma 6 as well as (188). Recalling Lemma 5, one has
where the last inequality is satisfied provided that \(t_{s2}({\tau }_{s2}) \ge \frac{1050^2 e}{\frac{c_{\textrm{m}}\gamma ^3}{35} \eta (1\gamma )^2 c_{\textrm{p}}^2}\). Combining (210) with (211) and (212), we arrive at
Step 4: combining bounds. Putting the above pieces together and invoking the expression (206) yield for \(\gamma > 0.96\),
which concludes the proof of the advertised bound (202).
1.3 Stage II: the duration where \(\theta ^{(t)}(s, a_2) \ge \theta ^{(t)}(s, a_0)\)
We now turn attention to the case where t lies within \([t_{\textsf{tran}}, t_{\textsf{ref}})\), which is a nonempty interval according to (204). In this case one has
as a consequence of the definition (189) of \(t_{\textsf{tran}}\). Again, from the definition (185) of \(t_{\textsf{ref}}\), the inequality \(c_{\textsf{ref}}(1\gamma )\pi ^{(t)}(a_0\,\,s) > \pi ^{(t)}(a_1 \,\,s)\) holds true for every \(t \in [t_{\textsf{tran}}, ~t_{\textsf{ref}})\), or equivalently,
The goal of this subsection is to show that \(\theta ^{(t)}(s, a_1)  \theta ^{(t_{\textsf{tran}})}(s, a_1) \le 1/2\) throughout this stage.
Preparation. From the above conditions (214) and (215), we have
We now look at the gradient w.r.t. \(\theta (s, a_0)\), for which we first observe that
Here, (i) follows from the inequalities (207) and (216), whereas (ii) holds true as long as \(c_{\textsf{ref}}\le {c_{\textrm{p}}}/{72}\). Consequently,
thus indicating that \(\theta ^{(t)}(s, a_0)\) is decreasing with t.
Key induction hypotheses. Again, we seek to prove by induction that
For the base case where \(t = t_{\textsf{tran}}\), this claim trivially holds true. Now suppose that the induction hypothesis (218) is satisfied for every iteration up to \(t1\), and we would like to establish it for the tth iteration. Towards this, we find it helpful to introduce another auxiliary induction hypothesis
As an immediate remark, this hypothesis trivially holds true when \(t=t_{\textsf{tran}}+1\). In what follows, we shall first establish (218) for the tth iteration assuming satisfaction of (219), and then use to demonstrate that (219) holds for \(i=t\) as well.
Inductive step 1: showing that \(\theta ^{(t)}(s, a_1)  \theta ^{(t_{\textsf{tran}})}(s, a_1) \le 1/2\). Towards this, let us introduce for convenience another time instance
which reflects the time when \(\theta ^{(i)}(s, a_1)\) reaches its maximum before iteration t. In order to establish the induction hypothesis (218) for the tth iteration, it is sufficient to demonstrate that
As before, let us employ the PG update rule (12a) to expand \(\theta ^{({\widetilde{t}})}(s, a_1)  \theta ^{(t_{\textsf{tran}})}(s, a_1) \) as follows
For each gradient \(\frac{\partial {V}^{(i)}(\mu )}{\partial \theta (s, a_1)}\), invoking Lemma 3, Lemma 1 and Lemma 10 tells us that
In addition, a little algebra together with (216) leads to
for any i obeying \(t_{\overline{s1}}(\tau _s) \le i < {\widetilde{t}}\), where the first inequality comes from (132), (i) makes use of \(\sum _a \theta ^{(i)}(s, a) = 0\), and (ii) follows from the induction hypothesis (219) along with the definition (220) of \({\widetilde{t}}\).
Putting the above bounds together with (222) and (223) guarantees that
Given that \(\theta ^{({\widetilde{t}})}(s, a_0) \ge \theta ^{({\widetilde{t}})}(s, a_1)  \log \big (c_{\textsf{ref}}(1\gamma )\big )\) (see (215)) and \(\sum _a \theta ^{({\widetilde{t}})}(s, a) = 0\), one obtains
which combined with the inequality (219) thus implies that
As a consequence of the inequalities (224) and (225), we obtain
where the last line holds as long as \(c_{\textsf{ref}}< c_{\textrm{p}}/ 16128\). This in turn establishes our induction hypothesis (221)—and hence (220) for the tth iteration—assuming satisfaction of the hypothesis (219).
Inductive step 2: establishing the upper bound (219). The next step is thus to justify the induction hypothesis (219) when \(i=t\). To do so, we first pay attention to the dynamics of \(\theta ^{(i)}(s, a_0)\) for any \(t_{\textsf{tran}}\le i\le t\). Recognizing that \(\theta ^{(i)}(s, a_2)=\max _a \theta ^{(i)}(s, a)\) (see (216)) and \(\sum _a \theta ^{(i)}(s, a)=0\), we can express
This allows one to obtain
With the above observation in mind, we claim for the moment the following recursive relation
for any i obeying \(t_{\textsf{tran}}\le i < t\), whose proof is deferred to the end of this section. If this claim were true, then (67b) in Lemma 11 allows us to conclude the desired bound
Proof of the inequality (228). Combining (217) and the lower bound on \(d^{(i)}_{\mu }(s)\) in Lemma 2, we have
where the last inequality also makes use of (132). In addition, invoking the inequalities (223) and (132) gives
Recall that for any \(i \in [t_{\textsf{tran}},~t_{\textsf{ref}})\), one has \(\theta ^{(i)}(s, a_0) \ge \theta ^{(i)}(s, a_1)  \log \big (c_{\textsf{ref}}(1\gamma )\big )\), or equivalently,
It thus follows that
As a result, the above bounds taken collectively lead to
provided that \(c_{\textsf{ref}}/ c_{\textrm{p}}< 1/1568\). In addition, similar to (230), we can easily see that
as long as \(\eta c_{\textrm{m}}(1\gamma ) \le 1/42\).
Substituting the preceding bounds into (227), we immediately arrive at
where the first inequality holds due to the fact \(1 \le 2\eta \frac{\partial {V}^{(i)}(\mu )}{\partial \theta (s,a_{0})}+\eta \frac{\partial {V}^{(i)}(\mu )}{\partial \theta (s,a_{1})} \le 0\) as well as the elementary inequality \(1e^x \ge x/2\) as long as \(1\le x\le 0\). This establishes the inequality (228).
1.4 Proof of the claims (45a) and (45b)
We are now ready to justify the claims (45a) and (45b). Combining (194) and (218), we reach
This taken collectively with (188) leads to
as claimed in (45b).
In addition, recalling the definition (185) of \(t_{\textsf{ref}}\), we have
which clearly satisfies (45a) as long as \(c_{\textsf{ref}}\ge c_{\textrm{p}}/16128\).
1.5 Proof of the claim (45c)
Finally, we move on to analyze what happens after iteration \(t_{\textsf{ref}}\), for which we focus on tracking the changes of \({\widehat{\pi }}^{(t)}(a_1 \,\,s)\). In this part, let us only consider the set of t satisfying
Note that at time \(t_{\textsf{ref}}\), the inequalities (45a) and (45b) are both satisfied, which together with the property \(\pi ^{(t)}(a_1 \,\,s) \le \pi ^{(t)}(a_2 \,\,s)\) yield
Then, if \(c_{\textsf{ref}}< c_{\textrm{p}}/ 1000\), we have
Here, the first inequality holds if \(\eta \frac{\partial {V}^{(t)}(\mu )}{\partial \theta (s, a_1)}  \eta \frac{\partial {V}^{(t)}(\mu )}{\partial \theta (s, a_2)} \le 1\) (given the elementary fact \(e^x  1 \le 2x\) for any \(0\le x\le 1\)), and the last line is valid since
where (ii) holds since \(Q^{(t)}(s,a_{2})\ge Q^{(t)}(s,a_{0})\) (cf. (207)), and (i) and (iii) make use of Lemma 1 and Lemma 3. In addition, these bounds also imply that \(\eta \frac{\partial {V}^{(t)}(\mu )}{\partial \theta (s, a_1)}  \eta \frac{\partial {V}^{(t)}(\mu )}{\partial \theta (s, a_2)} \le 1\) hold as long as \(28\eta c_{\textrm{m}}(1\gamma )\le 1\), thus validating the argument for the first inequality in (232).
Armed with the above recursive relation (232), we can invoke Lemma 11 to show that
where the last inequality holds since (in view of (132) and (60)).
In order to control \(t_{s}({\tau }_s)  t_{\textsf{ref}}\) via (234), it remains to upper bound \({\widehat{\pi }}^{(t_{\textsf{ref}})}(a_1 \,\,s)\). Towards this end, it is seen that
where the first line uses \(\sum _a \theta ^{(t_{\textsf{ref}})}(s, a)=0\), the second line relies on the inequality (45a), and the last one applies the inequality (45b). Substitution into the relation (234) yields
thus establishing the advertised bound.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, G., Wei, Y., Chi, Y. et al. Softmax policy gradient methods can take exponential time to converge. Math. Program. 201, 707–802 (2023). https://doi.org/10.1007/s10107022019206
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107022019206
Keywords
 Policy gradient methods
 Exponential lower bounds
 Softmax parameterization
 Discounted infinitehorizon MDPs