Finite-time error bounds for Greedy-GQ

Wang, Yue; Zhou, Yi; Zou, Shaofeng

doi:10.1007/s10994-024-06542-x

Finite-time error bounds for Greedy-GQ

Published: 30 April 2024

(2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

1 Altmetric

Abstract

Greedy-GQ with linear function approximation, originally proposed in Maei et al. (in: Proceedings of the international conference on machine learning (ICML), 2010), is a value-based off-policy algorithm for optimal control in reinforcement learning, and it has a non-linear two timescale structure with non-convex objective function. This paper develops its tightest finite-time error bounds. We show that the Greedy-GQ algorithm converges as fast as $\mathcal {O}({1}/{\sqrt{T}})$ under the i.i.d. setting and $\mathcal {O}({\log T}/{\sqrt{T}})$ under the Markovian setting. We further design variant of the vanilla Greedy-GQ algorithm using the nested-loop approach, and show that its sample complexity is $\mathcal {O}({\log (1/\epsilon )\epsilon ^{-2}})$, which matches with the one of the vanilla Greedy-GQ. Our finite-time error bounds match with the one of the stochastic gradient descent algorithm for general smooth non-convex optimization problems, despite of its additonal challenge in the two time-scale updates. Our finite-sample analysis provides theoretical guidance on choosing step-sizes for faster convergence in practice, and suggests the trade-off between the convergence rate and the quality of the obtained policy. Our techniques provide a general approach for finite-sample analysis of non-convex two timescale value-based reinforcement learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

Article 28 April 2022

Exploiting Multi-step Sample Trajectories for Approximate Value Iteration

Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme

Data availability

Not applicable.

References

Archibald, T., McKinnon, K., & Thomas, L. (1995). On the generation of Markov decision processes. Journal of the Operational Research Society, 46(3), 354–361.
Article Google Scholar
Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine learning proceedings (pp. 30–37). Elsevier.
Bhandari, J., Russo, D., & Singal, R. (2018). A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450
Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H., & Szepesvári, C. (2009). Convergent temporal-difference learning with arbitrary smooth function approximation. In Proceedings of the advances in neural information processing systems (NIPS) (Vol. 22, pp. 1204–1212).
Borkar, V. S., & Pattathil, S. (2018). Concentration bounds for two time scale stochastic approximation. In Proceedings of the annual Allerton conference on communication, control, and computing (pp. 504–511). IEEE.
Borkar, V. S. (2009). Stochastic approximation: A dynamical systems viewpoint (Vol. 48). Springer.
Google Scholar
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI Gym. arXiv preprint arXiv:1606.01540
Cai, Q., Yang, Z., Lee, J. D., & Wang, Z. (2019). Neural temporal-difference learning converges to global optima. In Proceedings of the advances in neural information processing systems (NeurIPS) (pp. 11312–11322).
Dalal, G., Szorenyi, B., & Thoppe, G. (2020). A tale of two-timescale reinforcement learning with the tightest finite-time bound. In Proceedings of the AAAI conference on artificial intelligence (AAAI) (pp. 3701–3708).
Dalal, G., Szrnyi, B., Thoppe, G., & Mannor, S. (2018a). Finite sample analyses for TD(0) with function approximation. In Proceedings of the AAAI conference on artificial intelligence (AAAI) (pp. 6144–6160).
Dalal, G., Szörényi, B., Thoppe, G., & Mannor, S. (2018). Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. Proceedings of Machine Learning Research, 75, 1–35.
Google Scholar
Doan, T. T. (2021). Nonlinear two-time-scale stochastic approximation: Convergence and finite-time performance. In Learning for dynamics and control (pp. 47–47). PMLR.
Ghadimi, S., & Lan, G. (2013). Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2341–2368.
Article MathSciNet Google Scholar
Gordon, G. J. (1996). Chattering in SARSA ($\lambda$). CMU Learning Lab Technical Report.
Gupta, H., Srikant, R., & Ying, L. (2019). Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. In Proceedings of the advances in neural information processing systems (NeurIPS) (pp. 4706–4715).
Kaledin, M., Moulines, E., Naumov, A., Tadic, V., & Wai, H.-T. (2020). Finite time analysis of linear two-timescale stochastic approximation with Markovian noise. arXiv preprint arXiv:2002.01268
Karmakar, P., & Bhatnagar, S. (2018). Two time-scale stochastic approximation with controlled Markov noise and off-policy temporal-difference learning. Mathematics of Operations Research, 43(1), 130–151.
Article MathSciNet Google Scholar
Konda, V. R., & Tsitsiklis, J. N. (2004). Convergence rate of linear two-time-scale stochastic approximation. The Annals of Applied Probability, 14(2), 796–819.
Article MathSciNet Google Scholar
Lakshminarayanan, C., & Szepesvari, C. (2018). Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International conference on artificial intelligence and statistics (pp. 1347–1355).
Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., & Petrik, M. (2015). Finite-sample analysis of proximal gradient TD algorithms. In Proceedings of the international conference on uncertainty in artificial intelligence (UAI) (pp. 504–513). Citeseer.
Ma, S., Chen, Z., Zhou, Y., & Zou, S.(2021). Greedy-GQ with variance reduction: Finite-time analysis and improved complexity. In The International Conference on Learning Representations (ICLR).
Ma, S., Zhou, Y., & Zhou, S.(2020). Variance-reduced off-policy TDC learning: Non-asymptotic convergence analysis. arXiv preprint arXiv:2010.13272
Maei, H. R. (2011). Gradient temporal-difference learning algorithms. Thesis, University of Alberta.
Maei, H. R., Szepesvári, C., Bhatnagar, S., & Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In Proceedings of the international conference on machine learning (ICML).
Mokkadem, A., & Pelletier, M. (2006). Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. The Annals of Applied Probability, 16(3), 1671–1702.
Article MathSciNet Google Scholar
Srikant, R., & Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD learning. In Proceedings of the annual conference on learning theory (CoLT).
Sun, J., Wang, G., Giannakis, G.B., Yang, Q., & Yang, Z. (2020). Finite-time analysis of decentralized temporal-difference learning with linear function approximation. In Proceedings of the international conference on artifical intelligence and statistics (AISTATS) (pp. 4485–4495). PMLR.
Sutton, R. S., Maei, H. R., & Szepesvári, C. (2009a). A convergent ${O} (n)$ temporal-difference algorithm for off-policy learning with linear function approximation. In Proceedings of the advances in neural information processing systems (NIPS) (pp. 1609–1616).
Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., & Wiewiora, E. (2009b). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the international conference on machine learning (ICML) (pp. 993–1000).
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). The MIT Press.
Google Scholar
Wang, Y., & Zou, S.(2020). Finite-sample analysis of Greedy-GQ with linear function approximation under Markovian noise. In Proceedings of the uncertainty in artificial intelligence (UAI) (pp. 11–20). PMLR
Wang, Y., Chen, W., Liu, Y., Ma, Z.-M., & Liu, T.-Y. (2017). Finite sample analysis of the GTD policy evaluation algorithms in Markov setting. In Proceedings of the advances in neural information processing systems (NIPS) (pp. 5504–5513).
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8(3–4), 279–292.
Article Google Scholar
Xu, P., & Gu, Q. (2020). A finite-time analysis of Q-learning with neural network function approximation. In Proceedings of the international conference on machine learning (ICML) (pp. 10555–10565).
Xu, T., & Liang, Y.(2021). Sample complexity bounds for two timescale value-based reinforcement learning algorithms. In International conference on artificial intelligence and statistics (pp. 811–819). PMLR.
Xu, T., Zou, S., & Liang, Y. (2019). Two time-scale off-policy TD learning: Non-asymptotic analysis over Markovian samples. In Proceedings of the advances in neural information processing systems (NeurIPS) (pp. 10633–10643).
Yu, H. (2017). On convergence of some gradient-based temporal-differences algorithms for off-policy learning. arXiv preprint arXiv:1712.09652
Zou, S., Xu, T., & Liang, Y. (2019). Finite-sample analysis for SARSA with linear function approximation. In Proceedings of the advances in neural information processing systems (NeurIPS) (pp. 8665–8675).

Download references

Funding

S. Zou and Y. Wang were supported by the National Science Foundation (NSF) CAREER Award under Grant ECCS-2337375 and by NSF under Grants CCF-2106560 and CCF-2007783. Yi Zhou’s work was supported in part by U.S. National Science Foundation under the Grant CCF-2106216.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Central Florida, Orlando, 32816, FL, USA
Yue Wang
Department of Electrical and Computer Engineering, University of Utah, Salt Lake City, 84112, UT, USA
Yi Zhou
Department of Electrical Engineering, University at Buffalo, Buffalo, 14228, NY, USA
Shaofeng Zou

Authors

Yue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Shaofeng Zou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: Shaofeng Zou, Yi Zhou and Yue Wang; Methodology: Shaofeng Zou, Yi Zhou and Yue Wang; Formal analysis and investigation: Shaofeng Zou and Yue Wang; Writing - original draft preparation: Yue Wang; Writing - review and editing: Shaofeng Zou and Yi Zhou; Funding acquisition: Shaofeng Zou and Yi Zhou; Supervision: Shaofeng Zou and Yi Zhou.

Corresponding author

Correspondence to Shaofeng Zou.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Consent for publication

Not applicable.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Code availability

The codes are written in Python and will be open-sourced after the publication of this paper.

Additional information

Editor: Tong Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Analysis for vanilla Greedy-GQ

In the following proof, $\Vert a\Vert$ denotes the $\ell _2$ norm if a is a vector; and $\Vert A\Vert$ denotes the operator norm if A is a matrix. For technical convenience, we impose a projection step on both the updates of $\theta$ and $\omega$ with radius R: for any t, $\Vert \theta _t\Vert \le R$ and $\Vert \omega _t\Vert \le R$. The projection step is necessary to guarantee the stability of the algorithm. The approach developed in Srikant and Ying (2019) which bounds the parameter using its retrospective copy several time steps back, is not applicable here due to the nonlinear structure of Greedy-GQ.

We first show that the objective function $J(\theta )$ is K-smooth for $\theta \in \{\theta : \Vert \theta \Vert \le R\}$.

Lemma 2

$J(\theta )$ is K-smooth:

$$\begin{aligned} \Vert \nabla J(\theta _1)-\nabla J(\theta _2)\Vert \le K\vert \vert \theta _1-\theta _2\vert \vert , \forall \Vert \theta _1\Vert ,\Vert \theta _2 \Vert \le R, \end{aligned}$$

where $K=2\gamma {\lambda ^{-1}}\big ((k_1\vert \mathcal {A}\vert R+1)(1+\gamma +\gamma Rk_1\vert \mathcal {A}\vert )+\vert \mathcal {A}\vert (r_{\max }+R+\gamma R)( 2k_1+ k_2R) \big ).$

Proof

It follows that

$$\begin{aligned}&\nabla J\left( \theta _1\right) -\nabla J\left( \theta _2\right) \\&=2\nabla \left( \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _1\right) \phi _{S,A}\right] \right) C^{-1}\mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _1\right) \phi _{S,A}\right] \\&\quad -2\nabla \left( \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _2\right) \phi _{S,A}\right] \right) C^{-1}\mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _2\right) \phi _{S,A}\right] \\&=2\nabla \left( \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _1\right) \phi _{S,A}\right] \right) C^{-1}\mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _1\right) \phi _{S,A}\right] \\&\quad -2\nabla \left( \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _1\right) \phi _{S,A}\right] \right) C^{-1}\mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _2\right) \phi _{S,A}\right] \\&\quad +2\nabla \left( \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _1\right) \phi _{S,A}\right] \right) C^{-1}\mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _2\right) \phi _{S,A}\right] \\&\quad -2\nabla \left( \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _2\right) \phi _{S,A}\right] \right) C^{-1}\mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _2\right) \phi _{S,A}\right] . \end{aligned}$$

Since $C^{-1}$ is positive definite, thus it suffices to show both $\nabla \left( \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta \right) \phi _{S,A}\right] \right)$ and $\mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta \right) \phi _{S,A}\right]$ are Lipschitz in $\theta$ and are bounded.

It is straightforward to see that

$$\begin{aligned} \Vert \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta \right) \phi _{S,A}\right] \Vert \le r_{\max }+(1+\gamma ) R, \end{aligned}$$

(13)

and $\Vert \nabla \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta \right) \phi _{S,A}\right] \Vert \le 1+\gamma (k_1\vert \mathcal {A}\vert R+1).$ We further have that

$$\begin{aligned}&\Vert \nabla \left( \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _1\right) \phi _{S,A}\right] \right) -\nabla \left( \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _2\right) \phi _{S,A}\right] \right) \Vert \nonumber \\&= \gamma \bigg \Vert \mathbb {E}_{\mu }\bigg [\sum _{a\in \mathcal {A}}\bigg ( \nabla \left( \pi _{\theta _1}\left( a\vert S'\right) \right) \theta _1^\top \phi _{S',a}-\nabla \left( \pi _{\theta _2}\left( a\vert S'\right) \right) \nonumber \\&\quad \cdot \theta _2^\top \phi _{S',a}+\pi _{\theta _1}\left( a\vert S'\right) \phi _{S',a}-\pi _{\theta _2}\left( a\vert S'\right) \phi _{S',a}\bigg )\phi _{S,A}^\top \bigg ]\bigg \Vert \nonumber \\&= \gamma \bigg \Vert \mathbb {E}_{\mu }\bigg [\sum _{a\in \mathcal {A}}\bigg ( \nabla \left( \pi _{\theta _1}\left( a\vert S'\right) -\pi _{\theta _2}\left( a\vert S'\right) \right) \theta _1^\top \phi _{S',a}\nonumber \\&\quad +\nabla \left( \pi _{\theta _2}\left( a\vert S'\right) \right) (\theta _1-\theta _2)^\top \phi _{S',a}\bigg )\phi _{S,A}^\top \bigg ]\bigg \Vert \end{aligned}$$

(14)

$$\begin{aligned}&\quad +\gamma \big \Vert \mathbb {E}_{\mu }\big [\big (\sum _{a\in \mathcal {A}} \big ( \pi _{\theta _1}\left( a\vert S'\right) \phi _{S',a}-\pi _{\theta _2}\left( a\vert S'\right) \phi _{S',a}\big )\big )\phi _{S,A}^\top \big ]\big \Vert \nonumber \\&\le \gamma \vert \mathcal {A}\vert \left( 2k_1+ k_2 R \right) \Vert \theta _1-\theta _2\Vert , \end{aligned}$$

(15)

which is from Assumption 3.

Following similar steps, we can also show that $\mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta \right) \phi _{S,A}\right]$ is Lipschitz in $\theta$:

$$\begin{aligned} \Vert \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _1\right) \phi _{S,A}\right] -\mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _2\right) \phi _{S,A}\right] \Vert \nonumber \\ \le \left( \gamma (\vert \mathcal {A}\vert k_1R+1)+1\right) \Vert \theta _1-\theta _2\Vert . \end{aligned}$$

(16)

Combining (14) and (16) concludes the proof. $\square$

Recall the definition of $G_{t+1}(\theta , \omega )$ in Sect. 3.3. The following Lemma shows that $G_{t+1}(\theta , \omega )$ is Lipschitz in $\omega$, and $G_{t+1}(\theta , \omega ^*(\theta ))$ is Lipschitz in $\theta$.

Lemma 3

For any $w_1,w_2$, $\Vert G_{t+1}(\theta ,\omega _1)-G_{t+1}(\theta ,\omega _2)\Vert \le \gamma (\vert \mathcal {A}\vert Rk_1+1)\Vert \omega _1-\omega _2\Vert ,$ and for any $\theta _1,\theta _2\in \{\theta :\Vert \theta \Vert \le R\}$,

$$\begin{aligned} \hspace{-0.2cm}\Vert G_{t+1}(\theta _1,\omega ^*(\theta _1))-G_{t+1}(\theta _2,\omega ^*(\theta _2))\Vert \le k_3\Vert \theta _1-\theta _2 \Vert , \end{aligned}$$

(17)

where $k_3=(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1+\gamma \frac{1}{\lambda }\vert \mathcal {A}\vert (2k_1+k_2R)(r_{\max }+\gamma R+R)+\gamma \frac{1}{\lambda }(1+\vert \mathcal {A}\vert Rk_1)(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)).$

Proof

Under Assumption 3, it can be easily shown that

$$\begin{aligned} \Vert \hat{\phi }_{t+1}(\theta ) \Vert \le \vert \mathcal {A}\vert Rk_1+1. \end{aligned}$$

(18)

It then follows that for any $\omega _1$ and $\omega _2$,

$$\begin{aligned}&\Vert G_{t+1}(\theta ,\omega _1)-G_{t+1}(\theta ,\omega _2)\Vert \le \gamma (\vert \mathcal {A}\vert Rk_1+1)\Vert \omega _1-\omega _2\Vert . \end{aligned}$$

To show that $G_{t+1}(\theta ,\omega ^*(\theta ))$ is Lipschitz in $\theta$, we first show that $\hat{\phi }_{t+1}(\theta )$ is Lipschitz in $\theta$ following similar steps as those in (14):

$$\begin{aligned}&\Vert \hat{\phi }_{t+1}(\theta _1)-\hat{\phi }_{t+1}(\theta _2)\Vert \le \vert \mathcal {A}\vert (2k_1 +k_2R)\Vert \theta _1-\theta _2 \Vert . \end{aligned}$$

(19)

We have that

$$\begin{aligned}&\Vert G_{t+1}(\theta _1,\omega ^*(\theta _1))-G_{t+1}(\theta _2,\omega ^*(\theta _2)) \Vert \nonumber \\&\le \vert \delta _{t+1}(\theta _1)-\delta _{t+1}(\theta _2) \vert +\gamma \Vert (\omega ^*(\theta _2))^\top \phi _t\hat{\phi }_{t+1}(\theta _2)\nonumber \\&\quad -(\omega ^*(\theta _1))^\top \phi _t\hat{\phi }_{t+1}(\theta _1) \Vert \nonumber \\&\overset{(a)}{\le }\gamma \Vert (\omega ^*(\theta _2))^\top \phi _t\hat{\phi }_{t+1}(\theta _2)-(\omega ^*(\theta _1))^\top \phi _t\hat{\phi }_{t+1}(\theta _1)\nonumber \\&\quad -(\omega ^*(\theta _1))^\top \phi _t\hat{\phi }_{t+1}(\theta _2)+(\omega ^*(\theta _1))^\top \phi _t\hat{\phi }_{t+1}(\theta _2) \Vert \nonumber \\&\quad +(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)\Vert \theta _1-\theta _2 \Vert \nonumber \\&\le \gamma (1+\vert \mathcal {A}\vert Rk_1)\Vert \omega ^*(\theta _2)-\omega ^*(\theta _1) \Vert \nonumber \\&\quad +\gamma \Vert \omega ^*(\theta _1) \Vert \Vert \hat{\phi }_{t+1}(\theta _1)-\hat{\phi }_{t+1}(\theta _2)\ \Vert \nonumber \\&\quad +\gamma (1+R\vert \mathcal {A}\vert k_1)\Vert \theta _1-\theta _2 \Vert +\Vert \theta _1-\theta _2 \Vert \nonumber \\&\overset{(b)}{\le }\ \bigg (\left( 1+\frac{\gamma }{\lambda }(1+\vert \mathcal {A}\vert Rk_1)\right) (1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)\nonumber \\&\quad +\frac{\gamma }{\lambda }\vert \mathcal {A}\vert (2k_1+k_2R)(r_{\max }+\gamma R+R)\bigg ) \Vert \theta _1-\theta _2 \Vert , \end{aligned}$$

(20)

where (a) can be shown following steps similar to those in (16), while (b) can be shown using

$$\begin{aligned} \Vert \omega ^*(\theta _2)-\omega ^*(\theta _1)\Vert \le \frac{(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)}{\lambda }\Vert \theta _1-\theta _2\Vert , \end{aligned}$$

(21)

and $\Vert \omega ^*(\theta )\Vert \le \frac{1}{\lambda }(r_{\max }+\gamma R+R).$ $\square$

Since $J(\theta )$ is K-smooth, by Taylor expansion we have that

$$\begin{aligned}&J(\theta _{t+1}) \le J(\theta _t) +\left\langle \nabla J(\theta _t), \theta _{t+1}-\theta _t\right\rangle + \frac{K}{2} \Vert \theta _{t+1}-\theta _t\Vert ^2\nonumber \\&=J(\theta _t)-\alpha \big \langle \nabla J(\theta _t),-G_{t+1}(\theta _t, \omega _t)+G_{t+1}(\theta _t, \omega ^*(\theta _t)) \big \rangle \nonumber \\&\quad +\frac{\alpha }{2} \langle \nabla J(\theta _t), {\nabla J(\theta _t)} +2G_{t+1}(\theta _t, \omega ^*(\theta _t)) \rangle \nonumber \\&\quad -\frac{\alpha }{2}\vert \vert \nabla J(\theta _t)\vert \vert ^2+\frac{K}{2} \alpha ^2\vert \vert G_{t+1}(\theta _t,\omega _t)\vert \vert ^2\nonumber \\&\le J(\theta _t) +\alpha \gamma \Vert \nabla J(\theta _t) \Vert (1+\vert \mathcal {A}\vert Rk_1)\Vert \omega ^*(\theta _t)-\omega _t \Vert \nonumber \\&\quad +\frac{\alpha }{2} \langle \nabla J(\theta _t), {\nabla J(\theta _t)}+2G_{t+1}(\theta _t, \omega ^*(\theta _t)) \rangle \nonumber \\&\quad -\frac{\alpha }{2}\vert \vert \nabla J(\theta _t)\vert \vert ^2+\frac{K}{2} \alpha ^2\vert \vert G_{t+1}(\theta _t,\omega _t)\vert \vert ^2, \end{aligned}$$

(22)

where the last inequality follows from Lemma 3.

Re-arranging the terms in (22), summing up w.r.t. t from 0 to $T-1$, taking the expectation and applying Cauchy’s inequality implies that

$$\begin{aligned}&\sum ^{T-1}_{t=0} \frac{\alpha }{2} \mathbb {E}[\Vert \nabla J(\theta _t) \Vert ^2]\nonumber \\&\le J(\theta _0)-J(\theta _{T})+ \gamma \alpha (1+\vert \mathcal {A}\vert Rk_1)\sqrt{\sum ^{T-1}_{t=0} \mathbb {E}[\Vert \nabla J(\theta _t)\Vert ^2]}\nonumber \\&\quad \cdot \sqrt{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \omega ^*(\theta _t)-\omega _t\Vert ^2]} +\frac{K}{2}\sum ^{T-1}_{t=0}\alpha ^2 \mathbb {E}[\Vert G_{t+1}(\theta _t,\omega _t)\Vert ^2]\nonumber \\&\quad +\sum ^{T-1}_{t=0}\frac{\alpha }{2} \mathbb {E}\left[ \left\langle \nabla J(\theta _t),{\nabla J(\theta _t)}+2G_{t+1}(\theta _t, \omega ^*(\theta _t)) \right\rangle \right] . \end{aligned}$$

(23)

We then provide the bounds on $\mathbb {E}[\Vert \omega ^*(\theta _t)-\omega _t\Vert ^2]$ and $\mathbb {E}\left[ \left\langle \nabla J(\theta _t),{\nabla J(\theta _t)}/{2}+G_{t+1}(\theta _t, \omega ^*(\theta _t)) \right\rangle \right]$, which we refer to as “tracking error" and “stochastic bias". We define $\zeta (\theta , O_t)\triangleq \langle \nabla J(\theta ), \frac{\nabla J(\theta )}{2}+G_{t+1}(\theta , \omega ^*(\theta )) \rangle$, then $\mathbb {E}_{\mu }[\zeta (\theta , O_t)]=0$ for any fixed $\theta$ when $O_t\sim \mu$ (which doesn’t hold under the Markovian setting). In the following lemma, we provide an upper bound on $\mathbb {E}[\zeta (\theta , O_t)]$.

Lemma 4

Stochastic Bias. Let $\tau _{\alpha }\triangleq \min \left\{ k: m\rho ^k \le \alpha \right\}$. If $t \le \tau _{\alpha }$, then $\mathbb {E}[\zeta (\theta _t,O_t)] \le k_{\zeta },$ and if $t > \tau _{\alpha }$, then

$$\begin{aligned} \mathbb {E}[\zeta (\theta _t, O_t)]\le k_{\zeta }\alpha +c_{\zeta }(c_{f_1}+c_{g_1})\tau _{\alpha }\alpha , \end{aligned}$$

(24)

where $c_{\zeta }=2\gamma (1+k_1\vert \mathcal {A}\vert R)\frac{1}{\lambda }(r_{\max }+R+\gamma R)(\frac{K}{2}+k_3)+K(r_{\max }+R+\gamma R)( \frac{2\gamma }{\lambda }(1+k_1\vert \mathcal {A}\vert R)+1)$ and $k_{\zeta }=4\gamma (1+k_1R\vert \mathcal {A}\vert )\frac{1}{\lambda }(r_{\max }+R+\gamma R)^2(2\gamma (1+k_1\vert \mathcal {A}\vert R)\frac{1}{\lambda }+1)$.

Proof

For any $\theta _1$ and $\theta _2$, it follows that

$$\begin{aligned}&\vert \zeta (\theta _1,O_t)-\zeta (\theta _2,O_t)\vert \nonumber \\&=\frac{1}{2}\vert \left\langle \nabla J(\theta _1), {\nabla J(\theta _1)} +2G_{t+1}(\theta _1, \omega ^*(\theta _1)) \right\rangle \nonumber \\&\quad -\left\langle \nabla J(\theta _1), {\nabla J(\theta _2)}+2G_{t+1}(\theta _2, \omega ^*(\theta _2)) \right\rangle \nonumber \\&\quad +\left\langle \nabla J(\theta _1)-\nabla J(\theta _2), {\nabla J(\theta _2)} +2G_{t+1}(\theta _2, \omega ^*(\theta _2)) \right\rangle \vert . \end{aligned}$$

(25)

By Lemma 2, $\zeta (\theta ,O_t)$ is also Lipschitz in $\theta$: $\vert \zeta (\theta _1,O_t)-\zeta (\theta _2,O_t)\vert \le c_{\zeta } \Vert \theta _1-\theta _2 \Vert ,$ where $c_{\zeta }=2\gamma (1+k_1\vert \mathcal {A}\vert R)\frac{1}{\lambda }(r_{\max }+R+\gamma R)(\frac{K}{2}+k_3) +K(r_{\max }+R+\gamma R)(\gamma \frac{1}{\lambda }(1+k_1\vert \mathcal {A}\vert R)+1 +\gamma \frac{1}{\lambda }(1+Rk_1\vert \mathcal {A}\vert )).$ Thus from (25), it follows that for any $\tau \ge 0$,

$$\begin{aligned}&\vert \zeta (\theta _t,O_t)-\zeta (\theta _{t-\tau },O_t)\vert \le c_{\zeta } \Vert \theta _t-\theta _{t-\tau } \Vert \nonumber \\&\le c_\zeta \sum ^{t-1}_{k=t-\tau }\alpha \Vert G_{k+1}(\theta _k,\omega _k)\Vert \le c_{\zeta }(c_{f_1}+c_{g_1})\sum ^{t-1}_{k=t-\tau }\alpha , \end{aligned}$$

(26)

where $c_{f_1}=r_{\max }+(1+\gamma )R+\frac{\gamma }{\lambda }(r_{\max }+(1 + \gamma )R)(1+R\vert \mathcal {A}\vert k_1)$, $c_{g_1}=2\gamma R(1+R\vert \mathcal {A}\vert k_1)$ and $\Vert G_{k+1}(\theta _k,\omega _k)\Vert \le c_{f_1}+c_{g_1}$.

We define an independent random variable ${\hat{O}}=({\hat{S}},{\hat{A}},{\hat{R}},{\hat{S}}')$, where $({\hat{S}},{\hat{A}})\sim \mu$, ${\hat{S}}'$ is the subsequent state and ${\hat{R}}$ is the reward. Then $\mathbb {E}[\zeta (\theta _{t-\tau },{\hat{O}})]=0$ by the fact that $\mathbb {E}_{\mu }[{G_{t+1}(\theta ,\omega ^*(\theta ))}]=-\frac{1}{2}\nabla J(\theta )$. Thus for any $\tau \le t$,

$$\begin{aligned} \mathbb {E}[\zeta (\theta _{t-\tau },O_t)]&\le \vert \mathbb {E}[\zeta (\theta _{t-\tau },O_t)]-\mathbb {E}[\zeta (\theta _{t-\tau },\hat{O})]\vert \le k_{\zeta }m\rho ^{\tau }, \end{aligned}$$

which follows from Assumption 4, and $k_{\zeta }=4\gamma (1+k_1R\vert \mathcal {A}\vert )\frac{1}{\lambda }(r_{\max }+R+\gamma R)^2(2\gamma (1+k_1\vert \mathcal {A}\vert R)\frac{1}{\lambda }+1)$.

If $t \le \tau _{\alpha }$, the conclusion follows from the fact that $\vert \zeta (\theta ,O_t)\vert \le k_{\zeta }$.

If $t > \tau _{\alpha }$, we choose $\tau =\tau _{\alpha }$, and then $\mathbb {E}[\zeta (\theta _t, O_t)]\le \mathbb {E}[\zeta (\theta _{t-\tau _{\alpha }},O_t)]+c_{\zeta }(c_{f_1}+c_{g_1})\sum ^{t-1}_{k=t-\tau _{\alpha }}\alpha \le k_{\zeta }\alpha +c_{\zeta }(c_{f_1}+c_{g_1})\tau _{\alpha }\alpha .$ $\square$

The tracking error can be bounded in the following lemma.

Lemma 5

Tracking error. (proof in Appendix 1)

$$\begin{aligned}&\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]}{T}\le \frac{2Q_T}{T}+\frac{32}{1-e^{-2\lambda \beta }}\frac{ \Vert R_2\Vert ^2}{\lambda \beta }\\&\quad + \frac{8\alpha ^2}{\lambda ^3\beta } \frac{(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2}{1-e^{-2\lambda \beta }}\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t) \Vert ^2]}{T}\\&=\mathcal {O}\Bigg (\frac{1}{T^{1-b}}+\frac{\log T}{T^b}+\frac{1}{T^{2a-2b}} \frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t) \Vert ^2]}{T} \Bigg ), \end{aligned}$$

where $Q_T=\frac{\Vert z_0\Vert ^2}{1-e^{-2\lambda \beta }}+\frac{\left( 4Rc_{f_2}\beta +2b_{g_2}\beta +4Rb_{\eta }\alpha \right) }{\left( 1-e^{-2\lambda \beta }\right) ^2} +\frac{\tau _{\beta }+1}{1-e^{-2\lambda \beta }} \left( 4Rc_{f_2}\beta +2b_{g_2}\beta +4Rb_{\eta }\alpha +c_{z}\beta ^2 \right) +c_z\beta ^2 +\frac{T}{1-e^{-2\lambda \beta }} \big (2\beta \left( 4Rc_{f_2}\beta +b_{f_2}\beta \tau _{\beta }\right) + 2\beta \left( b_{g_2}\beta +b'_{g_2}\beta \tau _{\beta }\right) +2\alpha \left( 4Rb_{\eta }\beta +b'_{\eta }\beta \tau _{\beta }\right) \big )$, and $b_{g_2}, b'_{g_2}, b_{\eta }, b'_{\eta } \text { and } c_z$ are some constants defined in Lemmas 9 and 10.

Now we have the bounds on the stochastic bias and the tracking error.

From (23), we first have that

$$\begin{aligned}&\frac{\sum ^{T-1}_{t=0}\alpha \mathbb {E}[\Vert \nabla J(\theta _t)\Vert ^2]}{2T\alpha }\nonumber \\&\le \frac{1}{T\alpha } \Bigg ( J(\theta _0)-J^*+\gamma \alpha (1+\vert \mathcal {A}\vert Rk_1)\sqrt{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t) \Vert ^2]}\nonumber \\&\quad \cdot \sqrt{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]}+\sum ^{T-1}_{t=0}\alpha \mathbb {E}[\zeta (\theta _t,O_t)]\nonumber \\&\quad +\sum ^{T-1}_{t=0}K\alpha ^2\left( r_{\max } +( R+ \gamma R(2+\vert \mathcal {A}\vert Rk_1) \right) ^2 \Bigg ), \end{aligned}$$

(27)

where $J^*=\min _{\theta } J(\theta )$ is positive and finite, and the inequality is from $\Vert G_{t+1}(\theta ,\omega )\Vert \le r_{\max }+\gamma R+R+\gamma R(1+\vert \mathcal {A}\vert Rk_1)$. From Lemma 4, it follows that $\sum _{t=0}^{T-1} \alpha \mathbb {E}[\zeta (\theta _t,O_t)] \le \sum ^{\tau _{\alpha }}_{t=0}\alpha k_{\zeta } +\sum ^{T-1}_{t=\tau _{\alpha }+1} (k_{\zeta }\alpha ^2+c_{\zeta }(c_{f_1}+c_{g_1})\tau _{\alpha }\alpha ^2).$ Hence, we have that

$$\begin{aligned}&\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t)\Vert ^2]}{2T}\\&\le \Omega +\gamma (1+\vert \mathcal {A}\vert Rk_1)\sqrt{\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t)\Vert ^2]}{T}}\sqrt{\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]}{T}}, \end{aligned}$$

where $\Omega \triangleq k_{\zeta }\frac{\tau _{\alpha }+1}{T}+c_{\zeta }(c_{f1}+c_{g1})\tau _{\alpha }\alpha +k_{\zeta }{\alpha }+\frac{J(\theta _0)-J^*}{T\alpha }+K\alpha \left( r_{\max }+\gamma R+R+\gamma R(1+\vert \mathcal {A}\vert Rk_1) \right) ^2$. We then plug in the tracking error in Lemma 5:

$$\begin{aligned}&\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t)\Vert ^2]}{2T} \overset{(a)}{\le }\ \Omega +\gamma (1+\vert \mathcal {A}\vert Rk_1)\nonumber \\&\quad \cdot \sqrt{\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t)\Vert ^2]}{T}}\Bigg ( \sqrt{\frac{2Q_T}{T}+\frac{32}{1-e^{-2\lambda \beta }}\frac{ \Vert R_2\Vert ^2}{\lambda \beta }}\nonumber \\&\quad +\bigg (\frac{8\alpha ^2}{\beta }\frac{1}{\lambda ^3}(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2\frac{1}{1-e^{-2\lambda \beta }}\nonumber \\&\quad \cdot \frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t) \Vert ^2]}{T}\bigg )^{0.5} \Bigg )\nonumber \\&=\Omega +\gamma (1+\vert \mathcal {A}\vert Rk_1)\sqrt{\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t)\Vert ^2]}{T}}\nonumber \\&\quad \cdot \sqrt{\frac{2Q_T}{T}+\frac{32}{1-e^{-2\lambda \beta }}\frac{ \Vert R_2\Vert ^2}{\lambda \beta }}\nonumber \\&\quad +\gamma (1+\vert \mathcal {A}\vert Rk_1)\sqrt{\frac{8\alpha ^2}{\beta }\frac{1}{\lambda ^3}(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2\frac{1}{1-e^{-2\lambda \beta }}}\nonumber \\&\quad \cdot \frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t) \Vert ^2]}{T}, \end{aligned}$$

(28)

where (a) is from $\sqrt{x+y}\le \sqrt{x}+\sqrt{y}$ for any $x,y \ge 0$. Rearranging the terms, and choosing $\alpha$ and $\beta$ such that $\gamma (1+\vert \mathcal {A}\vert Rk_1)\sqrt{\frac{8\alpha ^2}{\beta (1-e^{-2\lambda \beta })}\frac{1}{\lambda ^3}(1+\gamma +\gamma \vert \mathcal {A}\vert Rk_1)^2}<\frac{1}{4}$, then

$$\begin{aligned}&\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t)\Vert ^2]}{T} \le U+V\sqrt{\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t)\Vert ^2]}{T}}, \end{aligned}$$

where $V=4\gamma (1+\vert \mathcal {A}\vert Rk_1)\left( \sqrt{\frac{2Q_T}{T}+\frac{32}{1-e^{-2\lambda \beta }}\frac{\Vert R_2\Vert ^2}{\lambda \beta }}\right)$ and $U=4\Omega$. Hence, we have that

$$\begin{aligned}&\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t)\Vert ^2]}{T}\le \left( \frac{V+\sqrt{V^2+4U}}{2}\right) ^2\nonumber \\&\overset{(a)}{\le }\ V^2+2U\nonumber \\&\le 16\gamma ^2(1+\vert \mathcal {A}\vert Rk_1)^2\Bigg ({\frac{2Q_T}{T}}+{\frac{32}{1-e^{-2\lambda \beta }}\frac{\Vert R_2\Vert ^2}{\lambda \beta }}\Bigg )+8\Omega \nonumber \\&=\mathcal {O}\left( \frac{1}{T^{1-a}}+\frac{\log T}{T^a}+\frac{1}{T^{1-b}} +\frac{\log T}{T^b}\right) , \end{aligned}$$

(29)

where (a) is from $(x+y)^2\le 2x^2+2y^2$ for any $x,y \ge 0$, and the last step is due to the fact that $\alpha =\mathcal {O}\left( T^{-a} \right)$, $\beta =\mathcal {O}\left( T^{-b}\right)$, $1-e^{-2\lambda \beta }=\mathcal {O}\left( T^{-b}\right)$, $\frac{Q_T}{T}=\mathcal {O}\left( \frac{1}{T^{1-b}}+\frac{\log T}{T^b}\right)$, $\frac{\Vert R_2\Vert ^2}{\beta }=\mathcal {O}\left( \frac{\alpha ^4}{\beta ^2}\right) =\mathcal {O}(T^{-2a})$ which is from $a\ge b \ge 0$. This completes the proof of Theorem 1.

1.1 Appendix 1.1: Proof of Lemma 5

Recall that $z_t=\omega _t-\omega ^*(\theta _t)$, then

$$\begin{aligned} z_{t+1}&=z_t+\beta (f_2(\theta _t,O_t)+g_2(\theta _t,O_t))+\omega ^*(\theta _t)-\omega ^*(\theta _{t+1}), \nonumber \\ \theta _{t+1}&=\theta _t+\alpha (f_1(\theta _t,O_t)+g_1(\theta _t,z_t,O_t)), \end{aligned}$$

(30)

where $f_1(\theta _t, O_t) \triangleq \delta _{t+1}(\theta _t)\phi _t-\gamma \phi _t^\top \omega ^*(\theta _t)\hat{\phi }_{t+1}(\theta _t),$ $g_1(\theta _t, z_t, O_t) \triangleq -\gamma \phi _t^\top z_t\hat{\phi }_{t+1}(\theta _t),$ $f_2(\theta _t,O_t) \triangleq (\delta _{t+1}(\theta _t)-\phi _t^\top \omega ^*(\theta _t))\phi _t,$ and $g_2(z_t,O_t) \triangleq -\phi _t^\top z_t\phi _t.$ We then develop upper bounds on functions $f_1,g_1,f_2,g_2$ as follows.

Lemma 6

For $\Vert \theta \Vert \le R$, $\Vert z\Vert \le 2R$, $\Vert f_1(\theta ,O_t)\Vert \le c_{f_1},$ $\Vert g_1(\theta ,z,O_t)\Vert \le c_{g_1},$ $\vert f_2(\theta ,O_t)\vert \le c_{f_2}$ and $\vert g_2(\theta ,O_t)\vert \le c_{g_2}$, where $c_{f_2}=r_{\max }+(1+\gamma )R+\frac{1}{\lambda }(r_{\max }+(1 + \gamma )R)$, and $c_{g_2}=2R$.

Proof

This lemma follows from (13) (18) and (21). $\square$

We then decompose the tracking error as follows

$$\begin{aligned} \vert \vert z_{t+1}\vert \vert ^2&=\vert \vert z_t\vert \vert ^2+2\beta \langle z_t, f_2(\theta _t,O_t)\rangle +2\beta \langle z_t,g_2(z_t,O_t)\rangle \nonumber \\&\quad +2\langle z_t, \omega ^*(\theta _t)-\omega ^*(\theta _{t+1})\rangle \nonumber \\&\quad +\vert \vert \beta f_2(\theta _t,O_t)+\beta g_2(z_t,O_t)+\omega ^*(\theta _t)-\omega ^*(\theta _{t+1})\vert \vert ^2\nonumber \\&\le \vert \vert z_t\vert \vert ^2+2\beta \langle z_t, f_2(\theta _t,O_t)\rangle +2\beta \langle z_t,\bar{g}_2(z_t)\rangle \nonumber \\&\quad +2\langle z_t, \omega ^*(\theta _t)-\omega ^*(\theta _{t+1})\rangle +2\beta \langle z_t,g_2(z_t,O_t)-\bar{g}_2(z_t)\rangle \nonumber \\&\quad +3\beta ^2c_{f_2}^2+3\beta ^2c_{g_2}^2\nonumber \\&\quad + {6}(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)^2\alpha ^2 (c_{f_1}^2+c_{g_1}^2)/{\lambda ^2}, \end{aligned}$$

(31)

where $\bar{g}_2(z)\triangleq -Cz$, and the inequality follows from Lemmas 6 and 3.

Define $\zeta _{f_2}(\theta ,z,O_t)\triangleq \langle z, f_2(\theta ,O_t) \rangle$, and $\zeta _{g_2}(z,O_t)\triangleq \langle z,g_2(z,O_t)-\bar{g}_2(z)\rangle$. We then characterize the bounds on and the Lipschitz smoothness of $\zeta _{f_2}$ and $\zeta _{g_2}$.

Lemma 7

For any $\theta ,\theta _1,\theta _2 \in \{\theta :\Vert \theta \Vert \le R\}$ and any $z,z_1,z_2\in \{z:\Vert z\Vert \le 2R\}$, 1) $\vert \zeta _{f_2}(\theta ,z,O_t) \vert \le 2Rc_{f_2}$; 2) $\vert \zeta _{f_2}(\theta _1,z_1,O_t)-\zeta _{f_2}(\theta _2,z_2,O_t) \vert \le k_{f_2}\Vert \theta _1-\theta _2 \Vert +c_{f_2}\Vert z_1-z_2\Vert$, where $k_{f_2}=2R(1+\gamma +\gamma Rk_1\vert \mathcal {A}\vert )(1+\frac{1}{\lambda })$; 3) $\vert \zeta _{g_2}(z,O_t) \vert \le 8R^2$; and 4) $\vert \zeta _{g_2}(z_1,O_t)-\zeta _{g_2}(z_2,O_t) \vert \le 8R\Vert z_1-z_2\Vert$.

Proof

1) and 3) follow directly from the definition and Lemma 6. For 2), it can be shown that

$$\begin{aligned}&\vert \zeta _{f_2}(\theta _1,z_1,O_t)-\zeta _{f_2}(\theta _2,z_2,O_t)\vert \nonumber \\&\le \vert \langle z_1, f_2(\theta _1,O_t) \rangle -\langle z_1, f_2(\theta _2,O_t)\vert \nonumber \\&\quad +\vert \langle z_1, f_2(\theta _2,O_t)-\langle z_2, f_2(\theta _2,O_t) \rangle \vert \nonumber \\&\le 2R \Vert f_2(\theta _1,O_t)-f_2(\theta _2,O_t)\Vert +\Vert f_2(\theta _2,O_t) \Vert \Vert z_1-z_2 \Vert \nonumber \\&\le 2R(\vert \delta _{t+1}(\theta _1)-\delta _{t+1}(\theta _2)\vert +\Vert \omega ^*(\theta _1)-\omega ^*(\theta _2) \Vert )\nonumber \\&\quad +c_{f_2}\Vert z_1-z_2 \Vert \nonumber \\&{\le } k_{f_2}\Vert \theta _1-\theta _2\Vert +c_{f_2}\Vert z_1-z_2\Vert , \end{aligned}$$

(32)

where the last inequality is from the fact that both $\delta (\theta )$ and $\omega ^*(\theta )$ are Lipschitz.

To prove 4), we have that

$$\begin{aligned}&\vert \zeta _{g_2}(z_1,O_t)-\zeta _{g_2}(z_2,O_t)\vert \nonumber \\&=\vert \langle z_1, -\phi _t^\top z_1\phi _t+\mathbb {E}[\phi _t^\top z_1\phi _t]\rangle \nonumber \\&\quad -\langle z_1, -\phi _t^\top z_2\phi _t+\mathbb {E}[\phi _t^\top z_2\phi _t]\rangle +\langle z_1, -\phi _t^\top z_2\phi _t\nonumber \\&\quad +\mathbb {E}[\phi _t^\top z_2\phi _t]\rangle -\langle z_2, -\phi _t^\top z_2\phi _t+\mathbb {E}[\phi _t^\top z_2\phi _t]\rangle \vert \nonumber \\&\le 8R\Vert z_1-z_2\Vert . \end{aligned}$$

(33)

$\square$

Now we are ready to bound the tracking error. Note that $\langle z_t,\bar{g}_2(z_t)\rangle =-z_t^\top C z_t$, then (31) can be bounded as follows

$$\begin{aligned} \vert \vert z_{t+1}\vert \vert ^2&\le (1-2\beta \lambda )\Vert z_t\Vert ^2+2\beta \zeta _{f_2}(\theta _t,z_t,O_t)\nonumber \\&\quad +2\beta \zeta _{g_2}(z_t,O_t)+2\langle z_t,\omega ^*(\theta _t)-\omega ^*(\theta _{t+1})\rangle +3\beta ^2c_{f_2}^2\nonumber \\&\quad +3\beta ^2c_{g_2}^2+ \frac{6}{\lambda ^2}(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)^2\alpha ^2 (c_{f_1}^2+c_{g_1}^2). \end{aligned}$$

(34)

Taking expectation on both sides of (34), applying it recursively and using the fact that $1-2\beta \lambda \le e^{-2\beta \lambda }$, we obtain

$$\begin{aligned} \mathbb {E}[\vert \vert z_{t+1}\vert \vert ^2&\le A_t \vert \vert z_0\vert \vert ^2+2\sum _{i=0}^t B_{it}\nonumber \\&\quad +2\sum _{i=0}^t C_{it}+2\sum _{i=0}^t D_{it}+c_z\sum _{i=0}^t E_{it}, \end{aligned}$$

(35)

where

$$\begin{aligned} A_t&=e^{-2\lambda \sum _{i=0}^t \beta }, \nonumber \\ B_{it}&=e^{-2\lambda \sum _{k=i+1}^t \beta } \beta \mathbb {E}[\zeta _{f_2}(z_i,\theta _i,O_i)], \nonumber \\ C_{it}&=e^{-2\lambda \sum _{k=i+1}^t \beta } \beta \mathbb {E}[\zeta _{g_2}(z_i,O_i)],\nonumber \\ D_{it}&=e^{-2\lambda \sum _{k=i+1}^t \beta } \mathbb {E}[\langle z_i,\omega ^*(\theta _i)-\omega ^*(\theta _{i+1})\rangle ],\nonumber \\ E_{it}&=e^{-2\lambda \sum _{k=i+1}^t \beta } \beta ^2, \end{aligned}$$

(36)

and $c_z=3\left( c_{f_2}^2+c_{g_2}^2+\frac{2}{\lambda ^2}(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)^2(c_{f_1}^2+c_{g_1}^2)\right)$.

To bound (35), we provide the following lemmas.

Lemma 8

Define $\tau _{\beta }=\min \left\{ k: m\rho ^k \le \beta \right\}$. If $t\le \tau _{\beta }$, then $\mathbb {E}[\zeta _{f_2}(\theta _t,z_t,O_t)]\le 2Rc_{f_2};$ and if $t> \tau _{\beta }$, then $\mathbb {E}[\zeta _{f_2}(\theta _t,z_t,O_t)]\le 4Rc_{f_2}\beta +b_{f_2}\tau _{\beta }\beta ,$ where $b_{f_2}=( c_{f_2}(c_{f_2}+c_{g_2})+ (k_{f_2}(c_{f_1}+c_{g_1})+c_{f_2}\frac{1}{\lambda }(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)(c_{f_1}+c_{g_1})))$.

Proof

We first note that

$$\begin{aligned}&\Vert z_{t+1}-z_t\Vert \\&=\Vert \beta (f_2(\theta _t,O_t)+g_2(z_t,O_t))+\omega ^*(\theta _t)-\omega ^*(\theta _{t+1}) \Vert \\&\le (c_{f_2}+c_{g_2})\beta +\frac{1}{\lambda }(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)(c_{f_1}+c_{g_1})\alpha , \end{aligned}$$

where the last step is due to (21). Furthermore, due to part 2) in Lemma 7, $\zeta _{f_2}$ is Lipschitz in both $\theta$ and z, then we have that for any $\tau \le t$

$$\begin{aligned}&\vert \zeta _{f_2}(\theta _t,z_t,O_t)-\zeta _{f_2}(\theta _{t-\tau },z_{t-\tau },O_t)\vert \nonumber \\&\overset{(a)}{\le } c_{f_2}(c_{f_2}+c_{g_2})\sum ^{t-1}_{i=t-\tau }\beta +\bigg (k_{f_2}(c_{f_1}+c_{g_1})\nonumber \\&\quad +c_{f_2}\frac{1}{\lambda }(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)(c_{f_1}+c_{g_1})\bigg )\sum ^{t-1}_{i=t-\tau }\alpha , \end{aligned}$$

(37)

where in (a), we apply (21) and Lemma 6.

Define an independent random variable ${\hat{O}}=({\hat{S}},{\hat{A}},{\hat{R}},{\hat{S}}')$, where $({\hat{S}},{\hat{A}})\sim \mu$, ${\hat{S}}'\sim \mathsf P(\cdot \vert {\hat{S}},{\hat{A}})$ is the subsequent state, and ${\hat{R}}$ is the reward. Then it can be shown that

$$\begin{aligned}&\mathbb {E}[\zeta _{f_2}(\theta _{t-\tau },z_{t-\tau },O_t)] \nonumber \\&\overset{(a)}{\le }\ \vert \mathbb {E}[\zeta _{f_2}(\theta _{t-\tau },z_{t-\tau },O_t)]-\mathbb {E}[\zeta _{f_2}(\theta _{t-\tau },z_{t-\tau },{\hat{O}})]\vert \nonumber \\&\le 4Rc_{f_2}m\rho ^{\tau }, \end{aligned}$$

(38)

where (a) is due to the fact that $\mathbb {E}[\zeta _{f_2}(\theta _{t-\tau },z_{t-\tau },{\hat{O}})]=0$, and the last inequality follows from Assumption 4.

If $t\le \tau _{\beta }$, the result follows due to $\vert \zeta _{f_2}(\theta ,z_t,O_t)\vert \le 2Rc_{f_2}$.

If $t> \tau _{\beta }$, we choose $\tau =\tau _{\beta }$ in (37). Then,

$$\begin{aligned}&\mathbb {E}[\zeta _{f_2}(\theta _t,z_t,O_t)]\le \mathbb {E}[\zeta _{f_2}(\theta _{t-\tau _{\beta }},z_{t-\tau _{\beta }},O_t)]\\&\quad +\bigg (c_{f_2}\frac{1}{\lambda }(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)(c_{f_1}+c_{g_1})+k_{f_2}(c_{f_1}+c_{g_1})\bigg )\\&\quad \cdot \sum ^{t-1}_{i=t-\tau _{\beta }}\alpha +c_{f_2}(c_{f_2}+c_{g_2})\sum ^{t-1}_{i=t-\tau _{\beta }}\beta \\&\le 4Rc_{f_2}\beta +\bigg ( c_{f_2}(c_{f_2}+c_{g_2})+ \bigg (k_{f_2}(c_{f_1}+c_{g_1})\\&\quad +c_{f_2}\frac{1}{\lambda }(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)(c_{f_1}+c_{g_1})\bigg )\bigg )\tau _{\beta }\beta , \end{aligned}$$

where in the last step we upper bound $\alpha$ using $\beta$. Note that this will not change the order of the bound. $\square$

Define the following constants:

$$\begin{aligned} b_{\eta }&=(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )\left( \frac{1+\lambda +2\gamma (1+k_1R\vert \mathcal {A}\vert )}{\lambda ^2}\right) (r_{\max }+2R),\\ b'_{\eta }&= k'_{\eta }\left( c_{f_2}+c_{g_2}\right) + \big (k_{\eta }+\frac{k'_{\eta }}{\lambda }\left( 1+\gamma +\gamma R\vert \mathcal {A}\vert k_1\right) \big ) \left( c_{f_1}+c_{g_1}\right) ,\\ k_{\eta }&=2R\bigg ( \frac{1}{\lambda }(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )\left( k_3+\frac{K}{2} \right) +(r_{\max }+\gamma R+R)\\&\quad \cdot \left( 1+\lambda +2\gamma (1+k_1R\vert \mathcal {A}\vert )\right) \frac{2}{\lambda ^2}(\gamma \vert \mathcal {A}\vert ( k_1+k_2R))\bigg ),\\ k'_{\eta }&= \left( \frac{1+\lambda +2\gamma (1+k_1R\vert \mathcal {A}\vert )}{\lambda ^2(r_{\max }+\gamma R+R)^{-1}}\right) (1+\gamma +\gamma k_1R\vert \mathcal {A}\vert ). \end{aligned}$$

Lemma 9

Let $\eta (\theta ,z,O_t)=\langle z, -\nabla \omega ^*(\theta )^\top (G_{t+1}(\theta ,\omega ^*(\theta ))+{\nabla J(\theta )}/{2})\rangle$, then if $t\le \tau _{\beta }$, $\mathbb {E}[\eta (\theta _t,z_t,O_t)]\le 2Rb_{\eta }$; and if $t> \tau _{\beta }$, then $\mathbb {E}[\eta (\theta _t,z_t,O_t)]\le 4Rb_{\eta }\beta +b'_{\eta }\tau _{\beta }\beta$.

Proof

From the update of $z_t$ in (30), we first have

$$\begin{aligned}&\Vert z_{t+1}-z_t\Vert \nonumber \\&=\Vert \beta (f_2(\theta _t,O_t)+g_2(z_t,O_t))+\omega ^*(\theta _t)-\omega ^*(\theta _{t+1}) \Vert \nonumber \\&\le (c_{f_2}+c_{g_2})\beta +\frac{1}{\lambda }(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)(c_{f_1}+c_{g_1})\alpha , \end{aligned}$$

(39)

where the last step is due to the fact that $\Vert f_2(\theta ,O_t)\Vert \le c_{f_2}$, $\Vert g_2(\theta ,O_t)\Vert \le c_{g_2}$ and $\omega ^*(\theta )$ is Lipschitz in $\theta$ (Lemma 3).

Recall that both ${\nabla J(\theta )}/{2}$, and $G_{t+1}(\theta ,\omega ^*(\theta ))$ are Lipschitz in $\theta$ from (2) and (17). Also note that $\nabla \omega ^*(\theta )=C^{-1} \nabla \mathbb {E}[\delta _{S,A,S'}(\theta )\phi _{S,A}]$, which implies that $\Vert \nabla \omega ^*(\theta )\Vert ^2\le \frac{1}{\lambda ^2}(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2$. Then $\nabla \omega ^*(\theta )$ is Lipschitz in $\theta$:

$$\begin{aligned}&\Vert \nabla \omega ^*(\theta _1)-\nabla \omega ^*(\theta _2) \Vert \nonumber \\&\le \Vert C^{-1}\Vert \big \Vert \nabla \left( \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _1\right) \phi _{S,A}\right] \right) -\nabla \left( \mathbb {E}_{\mu }\left[ \delta _{S,A,S'}\left( \theta _2\right) \phi _{S,A}\right] \right) \big \Vert \nonumber \\&\le \frac{\gamma }{\lambda }\Bigg \Vert \mathbb {E}_{\mu }\Bigg [ \Bigg (\sum _{a'\in \mathcal {A}} \bigg (\nabla \pi _{\theta _1}\left( a'\vert S'\right) \theta _1^\top \phi _{S',a'}-\nabla \pi _{\theta _2}\left( a'\vert S'\right) \theta _2^\top \phi _{S',a'}\nonumber \\&\quad +\pi _{\theta _1}\left( a'\vert S'\right) \phi _{S',a'}-\pi _{\theta _2}\left( a'\vert S'\right) \phi _{S',a'}\bigg )\Bigg )\phi _{S,A}^\top \Bigg ]\Bigg \Vert \nonumber \\&\le \frac{\gamma }{\lambda }\Bigg \Vert \mathbb {E}_{\mu }\Bigg [\Bigg ( \sum _{a'\in \mathcal {A}} (\nabla \pi _{\theta _1}\left( a'\vert S'\right) \theta _1^\top \phi _{S',a'}-\nabla \pi _{\theta _2}\left( a'\vert S'\right) \theta _1^\top \phi _{S',a'}\nonumber \\&\quad +\nabla \pi _{\theta _2}\left( a'\vert S'\right) \theta _1^\top \phi _{S',a'}-\nabla \pi _{\theta _2}\left( a'\vert S'\right) \theta _2^\top \phi _{S',a'})\Bigg )\phi _{S,A}^\top \Bigg ]\Bigg \Vert \nonumber \\&\quad +\frac{\gamma }{\lambda }\Bigg \Vert \mathbb {E}_{\mu }\Bigg [\Big (\sum _{a'\in \mathcal {A}} (\pi _{\theta _1} (a'\vert S' )-\pi _{\theta _2} (a'\vert S' ))\phi _{S',a'}\Big )\phi _{S,A}^\top \Bigg ]\Bigg \Vert \nonumber \\&\le \frac{2\gamma }{\lambda } \vert \mathcal {A}\vert (Rk_2+k_1)\Vert \theta _1-\theta _2\Vert . \end{aligned}$$

(40)

Therefore,

$$\begin{aligned}&\vert \eta (\theta _1,z_1,O_t)-\eta (\theta _2,z_2,O_t)\vert \nonumber \\&\le 0.5\big \vert \big \langle z_1, \nabla \omega ^*(\theta _1)^\top \left( 2G_{t+1}(\theta _1,\omega ^*(\theta _1))+ {\nabla J(\theta _1)} \right) \big \rangle \nonumber \\&\quad -\big \langle z_2, \nabla \omega ^*(\theta _1)^\top \left( 2G_{t+1}(\theta _1,\omega ^*(\theta _1))+ {\nabla J(\theta _1)}\right) \big \rangle \big \vert \nonumber \\&\quad +0.5\big \vert \big \langle z_2, \nabla \omega ^*(\theta _1)^\top \left( 2G_{t+1}(\theta _1,\omega ^*(\theta _1))+ {\nabla J(\theta _1)}\right) \big \rangle \nonumber \\&\quad -\big \langle z_2, \nabla \omega ^*(\theta _2)^\top \left( 2G_{t+1}(\theta _2,\omega ^*(\theta _2))+ \nabla J(\theta _2)\right) \big \rangle \big \vert \nonumber \\&\le \frac{1}{\lambda }\bigg (\left( 1+\frac{1}{\lambda }+\frac{2\gamma (1+ k_1R\vert \mathcal {A}\vert )}{\lambda }\right) \nonumber \\&\quad \cdot (r_{\max }+\gamma R+R)(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )\bigg )\Vert z_1-z_2\Vert \nonumber \\&\quad +R\big \Vert \nabla \omega ^*(\theta _1)^\top \left( 2G_{t+1}(\theta _1,\omega ^*(\theta _1))+\nabla J(\theta _1)\right) \nonumber \\&\quad -\nabla \omega ^*(\theta _2)^\top \left( 2G_{t+1}(\theta _2,\omega ^*(\theta _2))+\nabla J(\theta _2)\right) \big \Vert . \end{aligned}$$

(41)

Consider the last term in (41). We know that $\nabla \omega ^*(\theta )$ and $G_{t+1}(\theta ,\omega ^*(\theta ))+\frac{\nabla J(\theta )}{2}$ are both Lipschitz in $\theta$ from (2), (17) and (40). It can then be shown that $\nabla \omega ^*(\theta )\left( G_{t+1}(\theta ,\omega ^*(\theta ))+\frac{\nabla J(\theta )}{2}\right)$ is also Lipschitz with constant $\frac{1}{\lambda }(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )\left( k_3+\frac{K}{2} \right) +\left( 1+\frac{2\gamma (1+k_1R\vert \mathcal {A}\vert )}{\lambda }+\frac{1}{\lambda }\right) (r_{\max }+\gamma R+R)\frac{2}{\lambda }(\gamma \vert \mathcal {A}\vert ( k_1+k_2R))$. Plugging this into (41), we obtain that

$$\begin{aligned}&\vert \eta (\theta _1,z_1,O_t)-\eta (\theta _2,z_2,O_t)\vert \le k_{\eta }\Vert \theta _1-\theta _2\Vert +k'_{\eta }\Vert z_1-z_2\Vert . \end{aligned}$$

Then for any $\tau \ge 0$,

$$\begin{aligned}&\vert \eta \left( \theta _t,z_t,O_t\right) -\eta \left( \theta _{t-\tau },z_{t-\tau },O_t\right) \vert \nonumber \\&\overset{}{\le } k'_{\eta }\left( c_{f_2}+c_{g_2}\right) \sum ^{t-1}_{i=t-\tau }\beta +\bigg (k_{\eta }\left( c_{f_1}+c_{g_1}\right) \nonumber \\&\quad +k'_{\eta }\frac{1}{\lambda }\left( 1+\gamma +\gamma R\vert \mathcal {A}\vert k_1\right) \left( c_{f_1}+c_{g_1}\right) \bigg )\sum ^{t-1}_{i=t-\tau }\alpha . \end{aligned}$$

(42)

Define an independent random variable ${\hat{O}}=({\hat{S}},{\hat{A}},{\hat{R}},{\hat{S}}')$, where $({\hat{S}},{\hat{A}})\sim \mu$, ${\hat{S}}'\sim \mathsf P(\cdot \vert {\hat{S}},{\hat{A}})$ is the subsequent state, and ${\hat{R}}$ is the reward. Then it can be shown that

$$\begin{aligned}&\mathbb {E}[\eta (\theta _{t-\tau },z_{t-\tau },O_t)] \nonumber \\&\overset{(a)}{\le }\ \vert \mathbb {E}[\eta (\theta _{t-\tau },z_{t-\tau },O_t)]-\mathbb {E}[\eta (\theta _{t-\tau },z_{t-\tau },{\hat{O}})]\vert \nonumber \\&\le 4Rb_{\eta }m\rho ^{\tau }, \end{aligned}$$

(43)

where (a) is due to the fact that $\mathbb {E}[\eta (\theta _{t-\tau },z_{t-\tau },{\hat{O}})]=0$, and $b_{\eta }\triangleq \sup _{\Vert \theta \Vert \le R} \left\| \nabla \omega ^*(\theta )^\top \left( G_{t+1}(\theta ,\omega ^*(\theta ))+{\nabla J(\theta )}/{2}\right) \right\| = {1}/{\lambda } (1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )\left( 1+{1}/{\lambda }+{2\gamma (1+k_1R\vert \mathcal {A}\vert )}/{\lambda }\right) (r_{\max }+(1+\gamma )R)$.

If $t\le \tau _{\beta }$, the conclusion is straightforward by noting that $\vert \eta (\theta ,z,O_t)\vert \le 2Rb_{\eta }$ for any $\Vert \theta \Vert \le R$ and $\Vert z\Vert \le 2R$. If $t> \tau _{\beta }$, we choose $\tau =\tau _{\beta }$ in (42) and (43). Then, it can be shown that

$$\begin{aligned}&\mathbb {E}[\eta \left( \theta _t,z_t,O_t\right) ]\nonumber \\&\le \mathbb {E}[\eta \left( \theta _{t-\tau _{\beta }},z_{t-\tau _{\beta }},O_t\right) ]+k'_{\eta }\left( c_{f_2}+c_{g_2}\right) \hspace{-0.1cm}\sum ^{t-1}_{i=t-\tau _{\beta }}\hspace{-0.1cm}\beta +\hspace{-0.1cm}\sum ^{t-1}_{i=t-\tau _{\beta }}\hspace{-0.1cm}\alpha \nonumber \\&\quad \cdot \left( k_{\eta }\left( c_{f_1}+c_{g_1}\right) +k'_{\eta }\frac{1}{\lambda }\left( 1+\gamma +\gamma R\vert \mathcal {A}\vert k_1\right) \left( c_{f_1}+c_{g_1}\right) \right) \nonumber \\&\le 4Rb_{\eta }\beta +b'_{\eta }\beta \tau _{\beta }. \end{aligned}$$

(44)

$\square$

The next lemma provides a bound on $\mathbb {E}[\zeta _{g_2}(z_t,O_t)]$.

Lemma 10

If $t\le \tau _{\beta }$, then $\mathbb {E}[\zeta _{g_2}(z_t,O_t)] \le b_{g_2}$; and if $t> \tau _{\beta }$, then $\mathbb {E}[\zeta _{g_2}(z_t,O_t)] \le b_{g_2}\beta +b'_{g_2}\tau _{\beta }\beta$, where $b'_{g_2}=8R(c_{f_2}+c_{g_2})+\frac{1}{\lambda }(1+\gamma +\gamma R\vert \mathcal {A}\vert k_1)(c_{f_1}+c_{g_1})$ and $b_{g_2}=16R^2$.

Proof

The proof is similar to the one for Lemma 8. $\square$

Now we bound the terms in (35) as follows. If $t\le \tau _{\beta }$,

$$\begin{aligned} \sum ^t_{i=0} B_{it}&\le 2\beta Rc_{f_2}\sum ^t_{i=0}e^{-2\lambda (t-i)\beta }\le \frac{2\beta Rc_{f_2}}{1-e^{-2\lambda \beta }}. \end{aligned}$$

(45)

If $t>\tau _{\beta }$, we have that

$$\begin{aligned}&\sum ^t_{i=0} B_{it} \le \beta (2Rc_{f_2})\sum ^{\tau _{\beta }}_{i=0}e^{-2\lambda \sum _{k=i+1}^t \beta }\nonumber \\&\quad +\sum ^t_{i=\tau _{\beta }+1} e^{-2\lambda (t-i)\beta }\beta (4Rc_{f_2}\beta +b_{f_2}\beta \tau _{\beta })\nonumber \\&\le 2Rc_{f_2}\beta \frac{e^{-2\lambda (t-\tau _{\beta })\beta }}{1-e^{-2\lambda \beta }}+\frac{\beta (4Rc_{f_2}\beta +b_{f_2}\beta \tau _{\beta })}{1-e^{-2\lambda \beta }} . \end{aligned}$$

(46)

Similarly, using Lemma 10, we can bound the third term in (35) as follows. If $t\le \tau _{\beta }$, we have that

$$\begin{aligned} \sum ^t_{i=0}C_{it}&=\sum ^t_{i=0} e^{-2\lambda \sum _{k=i+1}^t \beta } \beta b_{g_2} \le \frac{\beta b_{g_2} }{1-e^{-2\lambda \beta }}. \end{aligned}$$

(47)

If $t>\tau _{\beta }$, we have that

$$\begin{aligned} \sum ^t_{i=0}C_{it}&\le b_{g_2}\beta \frac{e^{-2\lambda (t-\tau _{\beta })\beta }}{1-e^{-2\lambda \beta }}+\frac{\beta (b_{g_2}\beta +b'_{g_2}\beta \tau _{\beta })}{1-e^{-2\lambda \beta }}. \end{aligned}$$

(48)

The last step to bound the tracking error is to bound $\sum ^t_{i=0} D_{it}$, which is shown in the following lemma.

Lemma 11

If $t\le \tau _{\beta }$, $\sum ^t_{i=0} D_{it}\le P_t+ \frac{2Rb_{\eta }\alpha }{1-e^{-2\lambda \beta }}$; and if $t>\tau _{\beta }$, $\sum ^t_{i=0} D_{it}\le P_t+2Rb_{\eta }\alpha \frac{e^{-2\lambda (t-\tau _{\beta })\beta }}{1-e^{-2\lambda \beta }}+\alpha (4Rb_{\eta }\beta +b'_{\eta }\beta \tau _{\beta })\frac{1}{1-e^{-2\lambda \beta }}$, where

$$\begin{aligned} P_t&=\sum ^t_{i=0} e^{-2\lambda \left( t-i\right) \beta } \bigg ( \bigg (\frac{\lambda \beta }{8}+\frac{8\alpha ^2}{\beta }\frac{1}{\lambda ^3}\gamma ^2\left( 1+k_1R\vert \mathcal {A}\vert \right) ^2\nonumber \\&\quad \cdot (1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2\bigg )\mathbb {E}[\Vert z_i\Vert ^2]+\frac{8\Vert R_2\Vert ^2}{\lambda \beta } \nonumber \\&\quad +\frac{2\alpha ^2}{\beta \lambda ^3} \left( 1+\gamma +\gamma k_1R\vert \mathcal {A}\vert \right) ^2\mathbb {E}\left[ \Vert \nabla J\left( \theta _i\right) \Vert ^2\right] \bigg ). \end{aligned}$$

(49)

Proof

We first have that

$$\begin{aligned}&\mathbb {E}\left[ \left\langle z_i, w^*\left( \theta _i\right) -w^*\left( \theta _{i+1}\right) \right\rangle \right] \nonumber \\&\overset{(*)}{=}\mathbb {E}\left[ \left\langle z_i, \nabla \omega ^*\left( \theta _i\right) ^\top \left( \theta _i-\theta _{i+1}\right) +R_2\right\rangle \right] \nonumber \\&= \alpha \mathbb {E}[\eta (\theta _i,z_i,O_i)]+\frac{1}{2}\mathbb {E}\big [\big \langle z_i, -\alpha \nabla \omega ^*\left( \theta _i\right) ^\top \big (2G_{i+1}\left( \theta _i,\omega _i\right) \nonumber \\&\quad -2G_{i+1}\left( \theta _i,\omega ^*\left( \theta _i\right) \right) -\nabla J\left( \theta _i\right) \big )+2R_2\big \rangle \big ], \end{aligned}$$

(50)

where $(*)$ follows from the Taylor expansion, and $R_2$ denotes higher order terms with $\Vert R_2\Vert =\mathcal {O}(\alpha ^2)$.

The second expectation on the RHS of (50) can be bounded as follows

$$\begin{aligned}&\frac{1}{2}\mathbb {E}\big [\big \langle z_i, -\alpha \nabla \omega ^*\left( \theta _i\right) ^\top \big (2G_{i+1}\left( \theta _i,\omega _i\right) -2G_{i+1}\left( \theta _i,\omega ^*\left( \theta _i\right) \right) \nonumber \\&\quad -\nabla J\left( \theta _i\right) \big )+2R_2\big \rangle \big ]\nonumber \\&\overset{(a)}{\le }\ \mathbb {E}\left[ \frac{\lambda \beta }{8}\Vert z_i\Vert ^2\right] \nonumber \\&\quad +\mathbb {E}\bigg [\frac{8\alpha ^2}{\beta } \frac{1}{\lambda ^3} \gamma ^2\left( \vert \mathcal {A}\vert Rk_1+1\right) ^2(1+\gamma +\gamma \vert \mathcal {A}\vert Rk_1)^2\Vert z_i\Vert ^2\bigg ]\nonumber \\&\quad +\mathbb {E}\left[ \frac{2}{\lambda ^3} \left( 1+\gamma +\gamma k_1R\vert \mathcal {A}\vert \right) ^2\frac{ \alpha ^2}{\beta }\Vert \nabla J\left( \theta _i\right) \Vert ^2\right] +\frac{8\Vert R_2\Vert ^2}{\lambda \beta }\nonumber \\&=\left( \frac{\lambda \beta }{8}+\frac{8\alpha ^2}{\beta }\frac{1}{\lambda ^3}\gamma ^2\left( 1+k_1R\vert \mathcal {A}\vert \right) ^2(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2\right) \nonumber \\&\quad \cdot \mathbb {E}\left[ \Vert z_i\Vert ^2\right] +\frac{8\Vert R_2\Vert ^2}{\lambda \beta }\nonumber \\&\quad +\frac{2\alpha ^2}{\beta }\frac{1}{\lambda ^3}\left( 1+\gamma +\gamma k_1R\vert \mathcal {A}\vert \right) ^2\mathbb {E}\left[ \Vert \nabla J\left( \theta _i\right) \Vert ^2\right] , \end{aligned}$$

(51)

where (a) follows from $\langle x,y\rangle \le \frac{\lambda \beta }{8}\Vert x\Vert ^2+\frac{2}{\lambda \beta }\Vert y\Vert ^2$ for any $x,y \in \mathbb {R}^N$, $\Vert x+y+z\Vert ^2\le 4\Vert x\Vert ^2+4\Vert y\Vert ^2+4\Vert z\Vert ^2$ for any $x,y,z \in \mathbb {R}^N$, and Lemma 3.

Thus, we have that

$$\begin{aligned}&\sum ^t_{i=0} D_{it}\le P_t +\sum ^{t}_{i=0} \alpha e^{-2\lambda \left( t-i\right) \beta }\mathbb {E}[\eta \left( \theta _i,z_i,O_i\right) ]. \end{aligned}$$

(52)

With Lemma 9, this concludes the proof. $\square$

We then consider the tracking error $\mathbb {E}[\Vert z_{t} \Vert ^2]$ in (35). Combining all the bounds in (45) (46) (47) (48) and Lemma 11, we have that if $t\le \tau _{\beta }$,

$$\begin{aligned}&\mathbb {E}[\Vert z_t\Vert ^2]\le \Vert z_0\Vert ^2 e^{-2\lambda t\beta }+\Omega _1+2P_t, \end{aligned}$$

(53)

where $\Omega _1\triangleq \frac{1}{1-e^{-2\lambda \beta }} ( 4Rc_{f_2}\beta +2b_{g_2}\beta +4Rb_{\eta }\alpha +c_z\beta ^2)$; and if $t> \tau _{\beta }$,

$$\begin{aligned} \mathbb {E}[\Vert z_t\Vert ^2]&\le \Vert z_0\Vert ^2 e^{-2\lambda t\beta }+2P_t+\Omega _2\\&\quad +\frac{e^{-2\lambda (t-\tau _{\beta })\beta }}{1-e^{-2\lambda \beta }} (4Rc_{f_2}\beta +2b_{g_2}\beta +4Rb_{\eta }\alpha ), \end{aligned}$$

where $\Omega _2\triangleq \frac{1}{1-e^{-2\lambda \beta }} (2\beta (4Rc_{f_2}\beta +b_{f_2}\beta \tau _{\beta })+ 2\beta (b_{g_2}\beta +b'_{g_2}\beta \tau _{\beta })+2\alpha (4Rb_{\eta }\beta +b'_{\eta }\beta \tau _{\beta })+c_z\beta ^2 )$. We then bound $\sum ^{T-1}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]$. The sum is divided into two parts $\sum ^{\tau _{\beta }}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]$ and $\sum ^{T-1}_{t=\tau _{\beta }+1}\mathbb {E}[\Vert z_t\Vert ^2]$ as follows

$$\begin{aligned} \sum ^{T-1}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]&=\sum ^{\tau _{\beta }}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]+\sum ^{T-1}_{t=\tau _{\beta }+1}\mathbb {E}[\Vert z_t\Vert ^2]\nonumber \\&\le \sum ^{\tau _{\beta }}_{t=0} \bigg (\Vert z_0\Vert ^2 e^{-2\lambda t\beta }+\Omega _1+2P_t\bigg )+\sum ^{T-1}_{t=\tau _{\beta }+1} \bigg (\Vert z_0\Vert ^2 e^{-2\lambda t\beta }\nonumber \\&\quad +2P_t+\frac{e^{-2\lambda \left( t-\tau _{\beta }\right) \beta }}{1-e^{-2\lambda \beta }} \left( 4Rc_{f_2}\beta +2b_{g_2}\beta +4Rb_{\eta }\alpha \right) +\Omega _2\bigg )\nonumber \\&\le \sum ^{T-1}_{t=0}\left( \Vert z_0\Vert ^2e^{-2\lambda t\beta }+ 2P_t\right) +(1+\tau _\beta )\Omega _1+(T-\tau _\beta )\Omega _2\nonumber \\&\quad +\sum ^{T-1}_{t=\tau _{\beta }+1}\frac{e^{-2\lambda \left( t-\tau _{\beta }\right) \beta }}{1-e^{-2\lambda \beta }} \left( 4Rc_{f_2}\beta +2b_{g_2}\beta +4Rb_{\eta }\alpha \right) \nonumber \\&\le \frac{\Vert z_0\Vert ^2}{1-e^{-2\lambda \beta }}+\sum ^{T-1}_{t=0} 2P_t+(1+\tau _\beta )\Omega _1+(T-\tau _\beta )\Omega _2\nonumber \\&\quad +\frac{\left( 4Rc_{f_2}\beta +2b_{g_2}\beta +4Rb_{\eta }\alpha \right) }{\left( 1-e^{-2\lambda \beta }\right) ^2}\nonumber \\&=\mathcal {O} \left( \frac{1}{\beta }+\tau _{\beta }+{T \beta \tau _{\beta }}+2\sum ^{T-1}_{t=0} P_t \right) . \end{aligned}$$

(54)

Let $Q_T\triangleq \frac{\Vert z_0\Vert ^2}{1-e^{-2\lambda \beta }}+\frac{\left( 4Rc_{f_2}\beta +2b_{g_2}\beta +4Rb_{\eta }\alpha \right) }{\left( 1-e^{-2\lambda \beta }\right) ^2}+(1+\tau _\beta )\Omega _1+(T-\tau _\beta )\Omega _2.$ Then, $\sum ^{T-1}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]\le 2\sum ^{T-1}_{t=0}P_t+Q_T.$

Now we plug in the exact definition of $P_t$,

$$\begin{aligned} \sum ^{T-1}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]&\le 2\sum ^{T-1}_{t=0}P_t+Q_T\nonumber \\&\le Q_T+\frac{16T}{1-e^{-2\lambda \beta }}\frac{\Vert R_2\Vert ^2}{\lambda \beta }+ 2\sum ^{T-1}_{t=0} \sum ^t_{i=0}e^{-2\lambda \left( t-i\right) \beta }\mathbb {E}[\Vert z_i\Vert ^2]\nonumber \\&\quad \cdot \bigg (\frac{\lambda \beta }{8}+\frac{8\alpha ^2}{\beta }\frac{1}{\lambda ^3}\gamma ^2\left( 1+k_1R\vert \mathcal {A}\vert \right) ^2(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2\bigg )\nonumber \\&\quad +\frac{4\alpha ^2}{\beta \lambda ^3}(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2\sum ^{T-1}_{t=0} \sum ^t_{i=0}e^{-2\lambda \left( t-i\right) \beta }\mathbb {E}[\Vert \nabla J\left( \theta _i\right) \Vert ^2]\nonumber \\&\le Q_T+\frac{16T}{1-e^{-2\lambda \beta }}\frac{\Vert R_2\Vert ^2}{\lambda \beta }+\frac{1}{1-e^{-2\lambda \beta }}\sum ^{T-1}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]\nonumber \\&\quad \cdot 2\bigg (\frac{\lambda \beta }{8}+\frac{8\alpha ^2}{\beta }\frac{1}{\lambda ^3}\gamma ^2\left( 1+k_1R\vert \mathcal {A}\vert \right) ^2(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2\bigg )\nonumber \\&\quad +\frac{4\alpha ^2}{\lambda ^3\beta }\frac{(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2}{1-e^{-2\lambda \beta }}\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J\left( \theta _t\right) \Vert ^2], \end{aligned}$$

(55)

where the last step is from the double sum trick: for any $x_t\ge 0$ $\sum ^{T-1}_{t=0}\sum ^t_{i=0} e^{-2\lambda (t-i)\beta }x_i \le \frac{1}{1-e^{-2\lambda \beta }}\sum ^{T-1}_{t=0}x_t$. Choose $\beta$ such that $\big (\frac{\lambda \beta }{8}+\frac{8\alpha ^2}{\beta }\frac{1}{\lambda ^3}\gamma ^2\left( 1+k_1R\vert \mathcal {A}\vert \right) ^2(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2\big )\frac{1}{1-e^{-2\lambda \beta }}<\frac{1}{4}$. Then it follows that

$$\begin{aligned}&\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert z_t\Vert ^2]}{T}\le \frac{2Q_T}{T}+\frac{32}{1-e^{-2\lambda \beta }}\frac{ \Vert R_2\Vert ^2}{\lambda \beta }\nonumber \\&\quad +\frac{8\alpha ^2}{\beta }\frac{1}{\lambda ^3}\frac{(1+\gamma +\gamma k_1R\vert \mathcal {A}\vert )^2}{1-e^{-2\lambda \beta }}\frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t) \Vert ^2]}{T}\nonumber \\&=\mathcal {O}\Bigg (\frac{\log T}{T^b}+\frac{1}{T^{1-b}}+ \frac{\sum ^{T-1}_{t=0}\mathbb {E}[\Vert \nabla J(\theta _t) \Vert ^2]}{T^{1+2a-2b}} \Bigg ), \end{aligned}$$

(56)

where the last step is from $1-e^{-2\lambda \beta }=\mathcal {O}(\beta )$ and $\Vert R_2\Vert ^2=\mathcal {O}(\alpha ^4)$. This hence completes the proof of Lemma 5.

Appendix 2: Analysis for nested-loop Greedy-GQ

1.1 Appendix 2.1: Proof of Theorem 2

Define $\hat{G}_t(\theta ,w)=\frac{1}{M}\sum ^M_{i=1} G_{(BT_c+M)t+BT_c+i}(\theta ,w).$ By the K-smoothness of $J(\theta )$, and following steps similar to those in the proof of Theorem 1, we have that

$$\begin{aligned}&\frac{\alpha -K\alpha ^2}{4} \Vert \nabla J(\theta _t)\Vert ^2\le J(\theta _t)-J(\theta _{t+1})\nonumber \\&\quad +2(\alpha +K\alpha ^2)\left\| \hat{G}_t(\theta _t,\omega _t)-\hat{G}_t(\theta _t,\omega ^*(\theta _t))\right\| ^2\nonumber \\&\quad +\frac{1}{2}(\alpha +K\alpha ^2)\left\| 2\hat{G}_t(\theta _t,\omega ^*(\theta _t))+ \nabla J(\theta _t)\right\| ^2. \end{aligned}$$

(57)

By the definition, we have that

$$\begin{aligned}&\hat{G}_t(\theta _t,\omega _t)-\hat{G}_t(\theta _t,\omega ^*(\theta _t))\nonumber \\&=\frac{1}{M} \sum ^M_{i=1} \big ( G_{(BT_c+M)t+BTc+i}(\theta _t,\omega _t)\nonumber \\&\quad -G_{(BT_c+M)t+BTc+i}(\theta _t,\omega ^*(\theta _t))\big ). \end{aligned}$$

(58)

For any $\Vert \theta \Vert \le R$ and any $\omega _1, \omega _2$, $\Vert G_{(BT_c+M)t+BTc+i}(\theta ,w_1)-G_{(BT_c+M)t+BTc+i}(\theta ,w_2) \Vert \le \gamma (1+\vert \mathcal {A}\vert Rk_1)\Vert w_1-w_2\Vert$. Hence we have that

$$\begin{aligned}&\left\| \hat{G}_t(\theta _t,\omega _t)-\hat{G}_t(\theta _t,\omega ^*(\theta _t))\right\| \le \gamma (1+\vert \mathcal {A}\vert Rk_1)\Vert z_t\Vert , \end{aligned}$$

(59)

Thus $\Vert \hat{G}_t(\theta _t,\omega _t)-\hat{G}_t(\theta _t,\omega ^*(\theta _t))\Vert ^2\le \gamma ^2(1+\vert \mathcal {A}\vert Rk_1)^2\Vert z_t\Vert ^2.$ Plugging this in (57), we have that

$$\begin{aligned}&\frac{\alpha -K\alpha ^2}{4} \Vert \nabla J(\theta _t)\Vert ^2\nonumber \\&\le J(\theta _t)-J(\theta _{t+1})+2(\alpha +K\alpha ^2)\gamma ^2(1+\vert \mathcal {A}\vert Rk_1)^2\Vert z_t\Vert ^2\nonumber \\&\quad +\frac{1}{2}(\alpha +K\alpha ^2)\left\| 2\hat{G}_t(\theta _t,\omega ^*(\theta _t))+\nabla J(\theta _t)\right\| ^2. \end{aligned}$$

(60)

The following lemma provide the upper bounds on the two terms (proof in Appendix 2).

Lemma 12

For any $t\ge 1$,

$$\begin{aligned}&\mathbb {E}[\Vert z_{t}\Vert ^2] \le 4R^2e^{\left( 4\beta ^2-\beta \lambda \right) (T_c-1)}+\frac{4\beta \lambda +2}{\lambda -4\beta }\frac{k_l+k_h}{B}, \end{aligned}$$

(61)

$$\begin{aligned}&\mathbb {E}\left[ \left\| 2\hat{G}_t(\theta _t,\omega ^*(\theta _t))+\nabla J(\theta _t)\right\| ^2\right] \le \frac{4k_G}{M}, \end{aligned}$$

(62)

where $k_h=\frac{32R^2(1+\rho m-\rho )}{1-\rho }$, $k_G=8(r_{\max }+\gamma R+R)^2\left( 1+\frac{1}{\lambda }+\frac{2\gamma }{\lambda }(1+Rk_1\vert \mathcal {A}\vert )\right) ^2(1+\rho (m-1))$ and $k_l=\frac{8(1+\lambda )^2(r_{\max }+R+\gamma R)^2(1+\rho m-\rho )}{1-\rho }$. If we further let $T_c=\mathcal {O}\left( \log \frac{1}{\epsilon }\right)$ and $B=\mathcal {O}\left( \frac{1}{\epsilon }\right)$, then $\mathbb {E}[\Vert z_{T_c}\Vert ^2]\le \mathcal {O}\left( \epsilon \right)$.

Now we have the bound above, we hence plug them in (60) and sum up w.r.t. t from 0 to $T-1$. Then

$$\begin{aligned}&\frac{\alpha -K\alpha ^2}{4} \frac{\sum _{t=0}^{T-1}\mathbb {E} [\Vert \nabla J(\theta _t)\Vert ^2]}{T}\\&\le \frac{J(\theta _0)-J^*}{T}+2(\alpha +K\alpha ^2)L^2\frac{\sum _{t=0}^{T-1}\mathbb {E}[\Vert z_{t}\Vert ^2] }{T}\\&\quad +2(\alpha +K\alpha ^2)\frac{k_G}{M}+2(\alpha +K\alpha ^2)\\&\quad \cdot \frac{4R^2L^2+\left( 1+\frac{1}{\lambda }+\frac{2\gamma (1+k_1R\vert \mathcal {A}\vert )}{\lambda }\right) ^2(r_{\max }+\gamma R+R)^2}{T}, \end{aligned}$$

which implies that

$$\begin{aligned}&\frac{\sum _{t=0}^{T-1}\mathbb {E} [\Vert \nabla J\left( \theta _t\right) \Vert ^2]}{T}\nonumber \\&\le \frac{4\left( J\left( \theta _0\right) -J^*\right) }{\left( \alpha -K\alpha ^2\right) T}+\frac{8L^2\left( \alpha +K\alpha ^2\right) }{\alpha -K\alpha ^2}\bigg (4R^2e^{\left( 4\beta ^2-\beta \lambda \right) (T_c-1)}\nonumber \\&\quad +\left( 4\beta ^2+\frac{2\beta }{\lambda }\right) \left( \frac{1}{\beta \lambda -4\beta ^2} \right) \frac{k_l+k_h}{B}\bigg )\nonumber \\&\quad +\frac{8\left( \alpha +K\alpha ^2\right) }{\alpha -K\alpha ^2}\frac{k_G}{M}+\frac{8(\alpha +K\alpha ^2)}{\alpha -K\alpha ^2}\nonumber \\&\quad \cdot \frac{4R^2L^2+\left( 1+\frac{1}{\lambda }+\frac{2\gamma (1+k_1R\vert \mathcal {A}\vert )}{\lambda }\right) ^2(r_{\max }+\gamma R+R)^2}{T}\nonumber \\&=\mathcal {O}\left( \frac{1}{T}+\frac{1}{M} +\frac{1}{B}+e^{-T_c}\right) , \end{aligned}$$

(63)

where $L=\gamma (1+\vert \mathcal {A}\vert Rk_1)$. Now let $T, M, B=\mathcal {O}\left( \frac{1}{\epsilon }\right)$ and $T_c=\mathcal {O}(\log (\epsilon ^{-1}))$, we have $\mathbb {E}[\Vert \nabla J(\theta _W)\Vert ^2] \le \epsilon ,$ with the sample complexity $\left( M+T_cB\right) T=\mathcal {O}\left( {\epsilon ^{-2}}{\log {\epsilon }^{-1}}\right) .$

1.2 Appendix 2.2: Proof of Lemma 12

Define $z_{t,t_c}=\omega _{t,t_c}-\omega ^*(\theta _t)$. Then by the update of $\omega _{t,t_c}$, we have that for any $t\ge 0$,

$$\begin{aligned}&z_{t,t_c+1} =z_{t,t_c}+\frac{\beta }{B}\sum ^B_{i=1} \big (\delta _{(BT_c+M)t+Bt_c+i}(\theta _t)\nonumber \\&\quad -\phi _{(BT_c+M)t+Bt_c+i-1}^\top \omega _{t,t_c}\big )\phi _{(BT_c+M)t+Bt_c+i-1}\nonumber \\&\triangleq z_{t,t_c}+\frac{\beta }{B}\sum ^B_{i=1} l_{t,t_c,i}(\theta _t)-\frac{\beta }{B} \sum ^B_{i=1} h_{t,t_c,i}(z_{t,t_c}), \end{aligned}$$

(64)

where $l_{t,t_c,i}(\theta _t)=(\delta _{(BT_c+M)t+Bt_c+i}(\theta _t)-\phi ^\top _{(BT_c+M)t+Bt_c+i-1}\omega ^*(\theta _t))\phi _{(BT_c+M)t+Bt_c+i-1}$, and $h_{t,t_c,i}(z_{t,t_c})=\phi _{(BT_c+M)t+Bt_c+i-1}^\top z_{t,t_c}\phi _{(BT_c+M)t+Bt_c+i-1}$. We also define the expectation of the above two functions under the stationary distribution for any fixed $\theta$ and z: $\bar{l}(\theta )=\mathbb {E}_{\mu }[l_{t,t_c,i}(\theta )]=0$ and $\bar{h}(z)=\mathbb {E}_{\mu }[h_{t,t_c,i}(z)]=Cz$. We then have that

$$\begin{aligned} \Vert z_{t,t_c+1}\Vert ^2&\le \Vert z_{t,t_c}\Vert ^2+\frac{2\beta ^2}{B^2}\left\| {\sum ^B_{i=1} l_{t,t_c,i}\left( \theta _t\right) } \right\| ^2\nonumber \\&\quad +2\frac{\beta ^2}{B^2}\left\| {\sum ^B_{i=1} h_{t,t_c,i}\left( z_{t,t_c}\right) } \right\| ^2+2\frac{\beta }{B}\left\langle z_{t,t_c},\sum ^B_{i=1} l_{t,t_c,i}\left( \theta _t\right) \right\rangle \nonumber \\&\quad -2\frac{\beta }{B}\left\langle z_{t,t_c}, \sum ^B_{i=1} h_{t,t_c,i}\left( z_{t,t_c}\right) \right\rangle \nonumber \\&\overset{(a)}{\le } \left( 1+4\beta ^2-\beta \lambda \right) \Vert z_{t,t_c}\Vert ^2+\left( 2\beta ^2+\frac{2\beta }{\lambda }\right) \left\| \frac{\sum ^B_{i=1} l_{t,t_c,i}\left( \theta _t\right) }{B} \right\| ^2\nonumber \\&\quad +\left( 4\beta ^2+\frac{2\beta }{\lambda }\right) \left\| \bar{h}\left( z_{t,t_c}\right) -\frac{\sum ^B_{i=1} h_{t,t_c,i}\left( z_{t,t_c}\right) }{B}\right\| ^2, \end{aligned}$$

(65)

where (a) is from $\langle z, \bar{h}(z)\rangle =z^\top C z\ge \lambda \Vert z\Vert ^2$, $\Vert \bar{h}(z)\Vert ^2=z^\top C^\top C z \le \Vert z\Vert ^2$ for any $z\in \mathbb {R}^N$, and $\langle x, y\rangle \le \frac{\lambda }{4}\Vert x\Vert ^2+ \frac{1}{\lambda }\Vert y\Vert ^2$ for any $x,y \in \mathbb {R}^N$. Recall that $\mathcal {F}_t$ is the $\sigma$-field generated by the randomness until $\theta _t$ and $\omega _t$, hence taking expectation conditioned on $\mathcal {F}_t$ on both sides implies that

$$\begin{aligned}&\mathbb {E}[\Vert z_{t,t_c+1}\Vert ^2\vert \mathcal {F}_t]\nonumber \\&\le \left( 1+4\beta ^2-\beta \lambda \right) \mathbb {E}[\Vert z_{t,t_c}\Vert ^2\vert \mathcal {F}_t]+\left( \frac{2\beta +4\beta ^2\lambda }{\lambda B^2}\right) \nonumber \\&\quad \cdot \mathbb {E}\bigg [\bigg \Vert B\bar{h}\left( z_{t,t_c}\right) -\sum ^B_{i=1} h_{t,t_c,i}\left( z_{t,t_c}\right) \bigg \Vert ^2\bigg \vert \mathcal {F}_t\bigg ]\nonumber \\&\quad +\left( 2\beta ^2+\frac{2\beta }{\lambda }\right) \mathbb {E}\bigg [\bigg \Vert \frac{\sum ^B_{i=1} l_{t,t_c,i}\left( \theta _t\right) }{B} \bigg \Vert ^2\bigg \vert \mathcal {F}_t\bigg ]. \end{aligned}$$

(66)

From Lemma 13, it follows that $\mathbb {E}[\Vert z_{t,t_c+1}\Vert ^2\vert \mathcal {F}_t] \le (1+4\beta ^2-\beta \lambda )\mathbb {E}[\Vert z_{t,t_c} \Vert ^2\vert \mathcal {F}_t]+\left( 4\beta ^2+\frac{2\beta }{\lambda }\right) \frac{k_l+k_h}{B}.$ Choose $\beta <\frac{\lambda }{4}$ and recursively apply the inequality, it follows that

$$\begin{aligned} \mathbb {E}[\Vert z_{t+1}\Vert ^2]&=\mathbb {E}[\mathbb {E}[\Vert z_{t+1}\Vert ^2\vert \mathcal {F}_t]]\\&\le 4R^2e^{\left( 4\beta ^2-\beta \lambda \right) (T_c-1)}+\left( 4\beta ^2+\frac{2\beta }{\lambda }\right) \left( \frac{1}{\beta \lambda -4\beta ^2} \right) \frac{k_l+k_h}{B}, \end{aligned}$$

which is from $1-x\le e^{-x}$ for any $x>0$ and $\Vert z_{t,0} \Vert ^2\le 4R^2$. Thus, let $T_c=\mathcal {O}\left( \log \frac{1}{\epsilon }\right) ,B=\mathcal {O}\left( \frac{1}{\epsilon }\right)$, then $\mathbb {E}[\Vert z_{t}\Vert ^2]\le \mathcal {O}\left( \epsilon \right)$. This completes the proof of (61).

1.3 Appendix 3.3: Lemma 13 and its proof

We now present bounds on the “variance terms” in (66).

Lemma 13

Consider the Markovian setting, then

$$\begin{aligned}&\mathbb {E}\left[ \left\| \frac{\sum ^B_{i=1} l_{t,t_c,i}(\theta _t)}{B} \right\| ^2\Bigg \vert \mathcal {F}_t\right] \le \frac{8(1+\lambda )^{2}(1+\rho (m-1))}{B(r_{\max }+R+\gamma R)^{-2}(1-\rho )};\\&\mathbb {E}\left[ \left\| \frac{\sum ^B_{i=1} h_{t,t_c,i}(z_{t,t_c})}{B} -\bar{h}(z_{t,t_c}) \right\| ^2\bigg \vert \mathcal {F}_t\right] \le \frac{32R^2(1+\rho (m-1))}{B(1-\rho )};\\&\mathbb {E}\left[ \left\| {2\hat{G}_t(\theta _t,\omega ^*(\theta _t))}+\nabla J(\theta _t)\right\| ^2\bigg \vert \mathcal {F}_t\right] \\&\le \frac{32\left( 1+\lambda +2\gamma (1+Rk_1\vert \mathcal {A}\vert )\right) ^2(1+\rho (m-1))}{(r_{\max }+\gamma R+R)^{-2}M(1-\rho )\lambda ^2}. \end{aligned}$$

Proof

Note that $\bar{l}(\theta )=\mathbb {E}_{\mu }[l_{t,t_c,i}(\theta )]=0$, thus

$$\begin{aligned}&\frac{1}{B^2}\mathbb {E}\left[ \bigg \Vert \sum ^B_{i=1} l_{t,t_c,i}(\theta _t) \bigg \Vert ^2\bigg \vert \mathcal {F}_t\right] \nonumber \\&=\frac{1}{B^2}\mathbb {E}\left[ \bigg \Vert \sum ^B_{i=1} l_{t,t_c,i}(\theta _t) -\sum ^B_{i=1} \bar{l}(\theta _t)\bigg \Vert ^2\bigg \vert \mathcal {F}_t\right] \nonumber \\&=\frac{1}{B^2} \sum ^B_{i=1} \mathbb {E}\left[ \Vert l_{t,t_c,i}(\theta _t)-\bar{l}(\theta _t) \Vert ^2\vert \mathcal {F}_t\right] \nonumber \\&\quad +\frac{1}{B^2} \sum ^B_{i\ne j} \mathbb {E}\left[ \langle l_{t,t_c,i}(\theta _t)-\bar{l}(\theta _t),l_{t,t_c,j}(\theta _t)-\bar{l}(\theta _t) \rangle \vert \mathcal {F}_t\right] \nonumber \\&\le \frac{4(1+\lambda )^2(r_{\max }+R+\gamma R)^2}{B}\nonumber \\&\quad +\frac{2}{B^2} \sum ^B_{i> j} \mathbb {E}\left[ \langle l_{t,t_c,i}(\theta _t)-\bar{l}(\theta _t),l_{t,t_c,j}(\theta _t)-\bar{l}(\theta _t) \rangle \vert \mathcal {F}_t\right] , \end{aligned}$$

(67)

which is due to the fact that $\vert l_{s,a,s'}(\theta )\vert \le (1+\lambda )(r_{\max }+R+\gamma R)$ for any $(s,a,s')$ and $\Vert \theta \Vert \le R.$

For the second part, we first consider the case $i>j$. Let $X_j$ be the $(BT_ct+Mt+Bt_c+j)$-th sample and $X_i$ be the $(BT_ct+Mt+Bt_c+i)$-th sample, and we denote the $\sigma -$field generated by all the randomness until $X_j$ by $\mathcal {F}_{t,t_c,j}$, then

$$\begin{aligned}&\mathbb {E}\left[ \langle l_{t,t_c,i}(\theta _t)-\bar{l}(\theta _t),l_{t,t_c,j}(\theta _t)-\bar{l}(\theta _t) \rangle \vert \mathcal {F}_t\right] \nonumber \\&=\mathbb {E}\left[ \langle \mathbb {E}\left[ l_{t,t_c,i}(\theta _t)-\bar{l}(\theta _t)\vert \mathcal {F}_{t,t_c,j}\right] , l_{t,t_c,j}(\theta _t)-\bar{l}(\theta _t)\rangle \vert \mathcal {F}_t\right] \nonumber \\&\le \mathbb {E}\left[ \left\| \mathbb {E}\left[ l_{t,t_c,i}(\theta _t)-\bar{l}(\theta _t)\vert \mathcal {F}_{t,t_c,j}\right] \right\| \left\| l_{t,t_c,j}(\theta _t)-\bar{l}(\theta _t) \right\| \vert \mathcal {F}_t\right] \nonumber \\&\le 2(1+\lambda )(r_{\max }+R+\gamma R)\mathbb {E}\left[ \left\| \mathbb {E}\left[ l_{t,t_c,i}(\theta _t)-\bar{l}(\theta _t)\vert \mathcal {F}_{t,t_c,j}\right] \right\| \vert \mathcal {F}_t\right] \nonumber \\&=2(1+\lambda )(r_{\max }+R+\gamma R) \nonumber \\&\quad \cdot \left\| \int _{X_i} l_{X_i}(\theta _t) (dX_i\vert X_j) -\int _{X_i} l_{X_i}(\theta _t) \mu (dX_i) \right\| \nonumber \\&\le 2(1+\lambda )(r_{\max }+R+\gamma R)\left\| \int _{X_i} l_{X_i}(\theta _t) ((dX_i\vert X_j)-\mu (dX_i))\right\| \nonumber \\&\le 2(1+\lambda )^2(r_{\max }+R+\gamma R)^2 \left| \int _{X_i}(dX_i\vert X_j)-\mu (dX_i) \right| \nonumber \\&\le 4c_l^2m\rho ^{i-j}, \end{aligned}$$

(68)

where the last inequality is from the geometric uniform ergodicity of the MDP. Thus we have that

$$\begin{aligned}&\frac{1}{B^2}\mathbb {E}\left[ \bigg \Vert \sum ^B_{i=1} l_{t,t_c,i}(\theta _t)\bigg \Vert ^2\vert \mathcal {F}_t\right] \le \frac{8(1+\lambda )^2(1+\rho (m-1))}{((r_{\max }+R+\gamma R)^{-2}B(1-\rho )}. \end{aligned}$$

Similarly we can show the other two inequalities. $\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Y., Zhou, Y. & Zou, S. Finite-time error bounds for Greedy-GQ. Mach Learn (2024). https://doi.org/10.1007/s10994-024-06542-x

Download citation

Received: 12 October 2022
Revised: 27 November 2023
Accepted: 23 March 2024
Published: 30 April 2024
DOI: https://doi.org/10.1007/s10994-024-06542-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finite-time error bounds for Greedy-GQ

Abstract

Access this article

Similar content being viewed by others

Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

Exploiting Multi-step Sample Trajectories for Approximate Value Iteration

Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for publication

Ethics approval

Consent to participate

Code availability

Additional information

Publisher's Note

Appendices

Appendix 1: Analysis for vanilla Greedy-GQ

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

1.1 Appendix 1.1: Proof of Lemma 5

Lemma 6

Proof

Lemma 7

Proof

Lemma 8

Proof

Lemma 9

Proof

Lemma 10

Proof

Lemma 11

Proof

Appendix 2: Analysis for nested-loop Greedy-GQ

1.1 Appendix 2.1: Proof of Theorem 2

Lemma 12

1.2 Appendix 2.2: Proof of Lemma 12

1.3 Appendix 3.3: Lemma 13 and its proof

Lemma 13

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation