Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling


We propose a stochastic approximation (SA) based method with randomization of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our proposed scheme is equivalent to running regular temporal difference learning with linear function approximation, albeit with samples picked uniformly from a given dataset. Our method results in an O(d) improvement in complexity in comparison to LSTD, where d is the dimension of the data. We provide non-asymptotic bounds for our proposed method, both in high probability and in expectation, under the assumption that the matrix underlying the LSTD solution is positive definite. The latter assumption can be easily satisfied for the pathwise LSTD variant proposed by Lazaric (J Mach Learn Res 13:3041–3074, 2012). Moreover, we also establish that using our method in place of LSTD does not impact the rate of convergence of the approximate value function to the true value function. These rate results coupled with the low computational complexity of our method make it attractive for implementation in big data settings, where d is large. A similar low-complexity alternative for least squares regression is well-known as the stochastic gradient descent (SGD) algorithm. We provide finite-time bounds for SGD. We demonstrate the practicality of our method as an efficient alternative for pathwise LSTD empirically by combining it with the least squares policy iteration algorithm in a traffic signal control application. We also conduct another set of experiments that combines the SA-based low-complexity variant for least squares regression with the LinUCB algorithm for contextual bandits, using the large scale news recommendation dataset from Yahoo.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. 1.

    By an abuse of notation, we shall use \(\varPhi\) to denote the feature matrix for TD as well as LSTD and the composition of \(\varPhi\) should be clear from the context.

  2. 2.

    A real matrix A is positive definite if and only if the symmetric part \(\frac{1}{2}(A+A^\textsf {T})\) is positive definite.

  3. 3.

    For notational convenience, we have chosen to ignore the dependence of \(K_1\) and \(K_2\) on the confidence parameter \(\delta\).

  4. 4.

    For notational convenience, we have not chosen to make the dependence of \(g_k\) on the random innovation \(f_k\) explicit. The Lipschitzness of \(g_k\) as a function of \(f_k\) is clear from equation (43) presented below.

  5. 5.

    One usually sees terms of the form \(\phi (s_{i_j}) (\phi (s_{i_j}) - \beta \phi (s_{i_j}'))\), whereas we use a transposed form to simplify handling the products that get written through the \(\varPi _j^n\) matrices.


  1. Antos, A., Szepesvári, C., & Munos, R. (2008). Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1), 89–129.

    Article  Google Scholar 

  2. Bach, F., & Moulines, E. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in neural information processing systems (pp. 451–459).

  3. Bach, F., & Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). In Advances in neural information processing systems (pp. 773–781).

  4. Bertsekas, D. P. (2012). Dynamic Programming and Optimal Control, Approximate Dynamic Programming, (4th ed., Vol. II). Belmont: Athena Scientific.

  5. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming (Optimization and Neural Computation Series, 3), (Vol. 7). Belmont: Athena Scientific.

    Google Scholar 

  6. Bhandari, J., Russo, D., & Singal, R. (2018). A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory pp. 1691–1692.

  7. Borkar, V. (2008). Stochastic approximation: A dynamical systems viewpoint. Cambridge: Cambridge University Press.

    Google Scholar 

  8. Borkar, V. S., & Meyn, S. P. (2000). The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2), 447–469.

    MathSciNet  Article  Google Scholar 

  9. Bradtke, S., & Barto, A. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57.

    MATH  Google Scholar 

  10. Dalal, G., Szörényi, B., Thoppe, G., & Mannor, S. (2018). Finite sample analyses for td (0) with function approximation. In Thirty-second AAAI conference on artificial intelligence.

  11. Dani, V., Hayes, T. P., & Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. In Proceedings of the 21st annual conference on learning theory (COLT) (pp. 355–366).

  12. Dieuleveut, A., Flammarion, N., & Bach, F. (2016). Harder, better, faster, stronger convergence rates for least-squares regression. arXiv preprint arXiv:160205419.

  13. Fathi, M., & Frikha, N. (2013). Transport-entropy inequalities and deviation estimates for stochastic approximation schemes. arXiv preprint arXiv:13017740.

  14. Frikha, N., & Menozzi, S. (2012). Concentration bounds for stochastic approximations. Electronic Communications in Probability, 17(47), 1–15.

    MathSciNet  MATH  Google Scholar 

  15. Geramifard, A., Bowling, M., Zinkevich, M., & Sutton, R. S. (2007). iLSTD: Eligibility traces and convergence analysis. In NIPS (Vol. 19, p. 441).

  16. Hazan, E., & Kale, S. (2011). Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In COLT (pp. 421–436).

  17. Konda, V. R. (2002). Actor-critic algorithms. PhD thesis, Department of Electrical Engineering and Computer Science, MIT.

  18. Korda, N., Prashanth, L. A., & Munos, R. (2015). Fast Gradient Descent for Drifting Least Squares Regression, with Application to Bandits. In Proceedings of the twenty-ninth AAAI conference on artificial intelligence (pp. 2708–2714).

  19. Kushner, H., & Clark, D. (1978). Stochastic approximation methods for constrained and unconstrained systems. Berlin: Springer-Verlag.

    Google Scholar 

  20. Kushner, H. J., & Yin, G. (2003). Stochastic approximation and recursive algorithms and applications, (Vol. 35). Berlin: Springer Verlag.

    Google Scholar 

  21. Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. The Journal of Machine Learning Research, 4, 1107–1149.

    MathSciNet  MATH  Google Scholar 

  22. Lakshminarayanan, C., & Szepesvari, C. (2018). Linear stochastic approximation: How far does constant step-size and iterate averaging go? Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 84, 1347–1355.

    Google Scholar 

  23. Lazaric, A., Ghavamzadeh, M., & Munos, R. (2012). Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13, 3041–3074.

    MathSciNet  MATH  Google Scholar 

  24. Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on world wide web, ACM (pp. 661–670).

  25. Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on web search and data mining, ACM (pp. 297–306).

  26. Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., & Petrik, M. (2015). Finite-sample analysis of proximal gradient TD algorithms. In: Proceedings of the 31st conference on uncertainty in artificial intelligence, Amsterdam, Netherlands

  27. Mary, J., Garivier, A., Li, L., Munos, R., Nicol, O., Ortner, R., & Preux, P. (2012). ICML exploration and exploitation 3—new challenges.

  28. Narayanan, C., & Szepesvári, C. (2017). Finite time bounds for temporal difference learning with function approximation: Problems with some “state-of-the-art” results. Technical report,

  29. Nemirovsky, A., & Yudin, D. (1983). Problem complexity and method efficiency in optimization. NY: Wiley-Interscience.

    Google Scholar 

  30. Pires, BA., & Szepesvári, C. (2012). Statistical linear estimation with penalized estimators: An application to reinforcement learning. arXiv preprint arXiv:12066444.

  31. Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.

    MathSciNet  Article  Google Scholar 

  32. Prashanth, L. A., & Bhatnagar, S. (2011). Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 12(2), 412–421.

    Article  Google Scholar 

  33. Prashanth, L. A., & Bhatnagar, S. (2012). Threshold tuning using stochastic optimization for graded signal control. IEEE Transactions on Vehicular Technology, 61(9), 3865–3880.

    Article  Google Scholar 

  34. Prashanth, L. A., Korda, N., & Munos, R. (2014). Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. In Joint European conference on machine learning and knowledge discovery in databases (pp. 66–81).

  35. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22, 400–407

    Article  Google Scholar 

  36. Roux, N. L., Schmidt, M., & Bach, F. R. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in neural information processing systems (pp. 2663–2671).

  37. Ruppert, D. (1991). Stochastic approximation. In B. K. Ghosh & P. K. Sen (Eds.), Handbook of sequential analysis (pp. 503–529).

  38. Silver, D., Sutton, R. S., & Müller, M. (2007). Reinforcement learning of local shape in the game of go. IJCAI, 7, 1053–1058.

    Google Scholar 

  39. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge: Cambridge University Press.

    Google Scholar 

  40. Sutton, R. S., Szepesvári, C., & Maei, H. R. (2009a). A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In NIPS (pp. 1609–1616).

  41. Sutton, R. S., et al.(2009b). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In ICML ACM (pp. 993–1000).

  42. Tagorti, M., & Scherrer, B. (2015). On the Rate of Convergence and Error Bounds for LSTD(\(\lambda\)). In ICML.

  43. Tarrès, P., & Yao, Y. (2011). Online learning as stochastic approximation of regularization paths. arXiv preprint arXiv:11035538.

  44. Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.

    MathSciNet  Article  Google Scholar 

  45. Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint (Vol. 48). Cambridge: Cambridge University Press.

    Google Scholar 

  46. Webscope, Y. (2011). Yahoo! Webscope dataset ydata-frontpage-todaymodule-clicks-\(\text{v}2_0\).

  47. Yu, H. (2015). On convergence of emphatic temporal-difference learning. In COLT (pp. 1724–1751).

  48. Yu, H., & Bertsekas, D. P. (2009). Convergence results for some temporal difference methods based on least squares. IEEE Transactions on Automatic Control, 54(7), 1515–1531.

    MathSciNet  Article  Google Scholar 

  49. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In ICML (pp. 928–925).

Download references

Author information



Corresponding author

Correspondence to L. A. Prashanth.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editor: Csaba Szepesvari.

A portion of this work was done when the authors were at INRIA Lille - Nord Europe.

Appendix 1: Proof of Theorem 10.3

Appendix 1: Proof of Theorem 10.3

The proof of Theorem 10.3 relies on a general rate result built from Proposition 10.1

Proposition 13.1

Under (A1)–(A3) we have, for all \(\epsilon \ge 0\) and \(\forall n\ge 1\),

$$\begin{aligned}&{\mathbb {P}}( \left\| z_n \right\| _2- {\mathbb {E}}\left\| z_n \right\| _2\ge \epsilon ) \le \exp \left( - \dfrac{\epsilon ^2}{4h(n)^2\sum \limits _{m=1}^{n} L_m^2} \right) , \end{aligned}$$

where \(L_i \triangleq \frac{\gamma _i}{n} \left( \sum _{l=i+1}^{n-1}\prod \limits _{j=i}^{l} \left( 1- \mu \gamma _{j+1}( 2 - \varPhi _{\max }^2\gamma _{j+1})) \right) ^{1/2}\right)\), and h(n) is as in Proposition 10.1.


This proof follows exactly the proof of Proposition 8.3, except that it uses the form of \(L_i\) for non-averaged iterates as derived in Proposition 10.1 part (1), rather than as derived in Proposition 8.1 part (1). \(\square\)

We specialise this result with the choice of step size \(\gamma _n \triangleq (c_0 c^{\alpha })/(n+c)^{\alpha }\). First, we prove the form of the \(L_i\) constants for this choice of step size in the lemma below.

Lemma 13.1

Under conditions of Theorem 10.3 , we have

$$\begin{aligned} \sum _{i=1}^n L_i^2 \le \frac{1}{\mu ^2} \left\{ 2^\alpha + \left[ \left[ \frac{2\alpha }{ c_0\mu c^{\alpha }}\right] ^{\frac{1}{1-\alpha }} + \frac{2(1 - \alpha )(c_0\mu )^{\alpha }}{\alpha } \right] \right\} ^2\frac{1}{n}. \end{aligned}$$

Second, we bound the expected error by directly averaging the errors of the non-averaged iterates:

$$\begin{aligned} {\mathbb {E}}\left\| {\bar{\theta }}_{n} - {\hat{\theta }}_T\right\| _2\le \frac{1}{n}\sum _{k = 1}^n{\mathbb {E}}\left\| \theta _k - {\hat{\theta }}_T \right\| _2, \end{aligned}$$

and directly applying the bounds in expectation given in Proposition 8.1.

Lemma 13.2

Under conditions of Theorem 10.3, we have

$$\begin{aligned} {\mathbb {E}}\left\| {\bar{\theta }}_n - {\hat{\theta }}_T\right\| _2\le&C_0\left( C_1\left\| \theta _0 - \theta _T\right\| _2+ 2h(n)c^{\alpha }c_0 \left( 2 c_0 \mu c^{\alpha }\right) ^{\frac{\alpha }{(1-\alpha )}} \sqrt{e}\left( \frac{2\alpha }{1-\alpha }\right) ^{\frac{1}{2(1-\alpha )}}\right) \frac{1}{n}\\&+ h(n) c^\alpha c_0 \left( 2c_0\mu c^\alpha \right) ^{\frac{\alpha }{2(1-\alpha )}} (n+c)^{-\frac{\alpha }{2}}, \end{aligned}$$

where \(C_0\) and \(C_1\) are as defined in Theorem 10.3.

Proof of Lemma 13.1

Recall from the statement of Theorem 10.3 that

$$\begin{aligned} 0< c_0 \varPhi _{\max }^2 < 1. \end{aligned}$$

Recall also from the formula in Proposition 13.1, that

$$\begin{aligned} L_i = \frac{\gamma _i}{n} \left( \sum _{l=i+1}^{n-1}\prod \limits _{j=i}^{l} \left( 1- \mu \gamma _{j+1}( 2 - \varPhi _{\max }^2\gamma _{j+1})) \right) ^{1/2}\right) . \end{aligned}$$

Notice that

$$\begin{aligned} \sum _{i = 1}^n L_i^2&= \sum _{i = 1}^n\left[ \frac{\gamma _i}{n} \left( \sum _{l=i+1}^{n-1}\prod \limits _{j=i}^{l} \left( 1- \mu \gamma _{j+1}( 2 - \varPhi _{\max }^2\gamma _{j+1})) \right) ^{1/2}\right) \right] ^2\\&\le \frac{1}{n^2}\sum _{i = 1}^n\left[ \gamma _i \left( \sum _{l = i+1}^{n-1} \exp \left( - \sum _{j=i}^l \mu \gamma _{j+1}(2 - \varPhi _{\max }^2\gamma _{j+1})) \right) \right) \right] ^2\\&< \frac{1}{n^2}\sum _{i = 1}^n {\underbrace{ \left[ c_0\left( \frac{c}{c+i}\right) ^\alpha \left( \sum _{l = i+1}^{n-1} \exp \left( - c_0\mu \sum _{j=i}^l \left( \frac{c}{c+j}\right) ^\alpha \right) \right) \right] }_{\triangleq (A)}}^2. \end{aligned}$$

To produce the final bound, we bound the summand (A) highlighted in line (91) by a constant, uniformly over all values of i and n, exactly as in the proof of Lemma 8.1. Thus, we have

$$\begin{aligned} \sum _{i=1}^n L_i^2 \le \frac{1}{\mu ^2} \left\{ 2^\alpha + \left[ \left[ \frac{2\alpha }{ c_0\mu c^{\alpha }}\right] ^{\frac{1}{1-\alpha }} + \frac{2(1 - \alpha )(c_0\mu )^{\alpha }}{\alpha } \right] \right\} ^2\frac{1}{n}. \end{aligned}$$

The rest of the proof follows that of Theorem 4.2. \(\square\)

Proof of Lemma 13.2

Recall that \(\gamma _n \triangleq c_0\left( \frac{c}{(c+n)}\right) ^{\alpha }\). Recall that in Theorem 10.3 we have assumed that

$$\begin{aligned} 0< c_0 \varPhi _{\max }^2 < 1. \end{aligned}$$

Using (99), we have

$$\begin{aligned}&{\mathbb {E}}\left( \left\| \theta _n - {\hat{\theta }}_T\right\| _2\right) ^2\nonumber \\&\quad \le \left[ \prod _{k = 1}^n \left( 1 - \mu \gamma _k(2 - \gamma _k\varPhi _{\max }^2\right) \left\| z_0\right\| _2\right] ^2 + 4\sum _{k=1}^{n}\gamma _k^2 \left[ \prod _{j = k}^{n-1} (1 - \mu \gamma _j(2 - \gamma _j\varPhi _{\max }^2) \right] ^2 h(k)^2\nonumber \\&\quad \le \left[ \prod _{k = 1}^n \left( 1 - \frac{\mu c_0 c^{\alpha }}{(c+k)^{\alpha }}\right) \left\| z_0\right\| _2\right] ^2 + 4\sum _{k=1}^{n}\frac{c_0^2 c^{2\alpha }}{(c+k)^{2\alpha }} \left[ \prod _{j = k}^{n-1} \left( 1 - \frac{\mu c_0 c^{\alpha }}{(c+j)^{\alpha }}\right) \right] ^2 h(k)^2 \nonumber \\&\quad \le \left[ \exp \left( -\mu c_0 \sum _{k = 1}^n \frac{c^{\alpha }}{(c+k)^{\alpha }}\right) \left\| z_0\right\| _2\right] ^2 + 4h(n)^2\sum _{k=1}^{n}\frac{c_0^2 c^{2\alpha }}{(c+k)^{2\alpha }} \exp \left( -2\mu c_0\sum _{j = k}^{n-1} \frac{ c^{\alpha }}{(c+j)^{\alpha }}\right) . \end{aligned}$$

To obtain (109), we have applied (108). For the final inequality, we have exponentiated the logarithm of the products, and used the inequality \(\ln (1+x) < x\) in several places.

Continuing the derivation, we have

$$\begin{aligned}&{\mathbb {E}}\left\| \theta _n - {\hat{\theta }}_T \right\| _2\end{aligned}$$
$$\begin{aligned}&\quad \le \exp \left( -c_0\mu c^\alpha (n+c)^{1-\alpha } - c_0\mu c^\alpha (1+c)^{1-\alpha }\right) \left\| \theta _0 - {\hat{\theta }}_T\right\| _2\nonumber \\&\quad \quad +\, 2h(n)\left( \sum _{k = 1}^{n}c_0^2\left( \frac{c}{k+c}\right) ^{2\alpha } \exp \left( -2c_0\mu c^\alpha ((n+c)^{1-\alpha } - (k+c)^{1-\alpha }\right) \right) ^{\frac{1}{2}} \end{aligned}$$
$$\begin{aligned}&\quad = \exp \left( -c_0\mu c^\alpha (n+c)^{1-\alpha }\right) \nonumber \\&\quad \quad \times \, \Bigg [\exp \left( c_0\mu c^\alpha (1+c)^{1-\alpha }\right) \left\| \theta _0 - {\hat{\theta }}_T\right\| _2\nonumber \\&\qquad +\, 2h(n) \left\{ \sum _{k = 1}^{n}c_0^2\left( \frac{c}{k+c}\right) ^{2\alpha } \exp \left( 2c_0\mu c^\alpha ((k+c)^{1-\alpha }\right) \right\} ^{\frac{1}{2}} \Bigg ]\nonumber \\&\quad \le \exp \left( -c_0\mu c^\alpha (n+c)^{1-\alpha }\right) \nonumber \\&\quad \quad \times \, \Bigg [\exp \left( c_0\mu c^\alpha (1+c)^{1-\alpha }\right) \left\| \theta _0 - {\hat{\theta }}_T\right\| _2\nonumber \\&\qquad +\, 2 h(n) \left\{ c^{2\alpha } c_0^2\int _{1}^{n+c}x^{-2\alpha }\exp \left( 2c_0\mu c^\alpha x^{1-\alpha }\right) dx \right\} ^{\frac{1}{2}} \Bigg ] \end{aligned}$$
$$\begin{aligned}&\quad \le \exp \left( -c_0\mu c^\alpha (n+c)^{1-\alpha }\right) \nonumber \\&\quad \quad \times \, \Bigg [\exp \left( c_0\mu c^\alpha (1+c)^{1-\alpha }\right) \left\| \theta _0 - {\hat{\theta }}_T\right\| _2\nonumber \\&\quad \quad +\, 2 h(n) \left\{ c^{2\alpha } c_0^2\left( 2c_0\mu c^\alpha \right) ^{\frac{2\alpha }{1-\alpha }} \times \int _{\left( 2c_0\mu c^\alpha \right) ^{1/(1-\alpha )}}^{(n+c)\left( 2c_0\mu c^\alpha \right) ^{1/(1-\alpha )}} y^{-2\alpha }\exp (y^{1-\alpha })dy \right\} ^{\frac{1}{2}} \Bigg ] \end{aligned}$$

As in the proof of Theorem 5.1, for arriving at (111), we have used Jensen’s inequality, and that \(\sum _{j=k}^{n-1}(c+j)^{-\alpha }\ge \int _{j=k}^n(c+j)^{1-\alpha }dj=(c+n)^{1-\alpha } - (c+k)^{1-\alpha }\). To obtain (112), we have upper bounded the sum with an integral, the validity of which follows from the observation that \(x\mapsto x^{-2\alpha }e^{x^{1-\alpha }}\) is convex for \(x\ge 1\). Finally, for (113), we have applied the change of variables \(y = (2c_0\mu c^\alpha )^{1/(1-\alpha )}x\).

Now, since \(y^{-2\alpha } \le \frac{2}{1-\alpha } ((1-\alpha )y^{-2\alpha } - \alpha y^{-(1+\alpha )})\) when \(y\ge \left( \frac{2\alpha }{1-\alpha }\right) ^{\frac{1}{1-\alpha }}\), we have

$$\begin{aligned}&\int _{\left( \frac{2\alpha }{1-\alpha }\right) ^{\frac{1}{1-\alpha }}} ^{(n+c)\left( 2c_0\mu c^\alpha \right) ^{1/(1-\alpha )}} y^{-2\alpha }\exp (y^{1-\alpha })dy\\&\quad \le \frac{2}{1-\alpha } \int _{\left( \frac{2\alpha }{1-\alpha }\right) ^{\frac{1}{1-\alpha }}} ^{(n+c)\left( 2c_0\mu c^\alpha \right) ^{1/(1-\alpha )}} ((1-\alpha )y^{-2\alpha } - \alpha y^{-(1+\alpha )}) \exp (y^{1-\alpha })dy\\&\quad \le \frac{2}{1-\alpha } \exp \left( 2c_0\mu c^\alpha (n+c)^{1-\alpha }\right) (n+c)^{-\alpha }\left( 2c_0\mu c^\alpha \right) ^{-\alpha /(1-\alpha )} \end{aligned}$$

and furthermore, since \(y\mapsto y^{-2\alpha }\exp (y^{1-\alpha })\) is decreasing for \(y\le \left( \frac{2\alpha }{1-\alpha }\right) ^{\frac{1}{1-\alpha }}\), we have

$$\begin{aligned} \int _{1}^{\left( \frac{2\alpha }{1-\alpha }\right) ^{\frac{1}{1-\alpha }}} y^{-2\alpha }\exp (y^{1-\alpha })dy \le e \left( \frac{2\alpha }{1-\alpha }\right) ^{\frac{1}{1-\alpha }}. \end{aligned}$$

Plugging these into (113), we obtain

$$\begin{aligned}&{\mathbb {E}}\left\| \theta _n - {\hat{\theta }}_T \right\| _2\le \exp \left( -c_0\mu c^\alpha (n+c)^{1-\alpha }\right) \\&\quad \times \, \left( \exp \left( c_0\mu c^\alpha (1+c)^{1-\alpha }\right) \left\| \theta _0 - \theta _T\right\| _2+ 2h(n)c^{\alpha }c_0 \left( 2 c_0 \mu c^{\alpha }\right) ^{\frac{\alpha }{(1-\alpha )}} \sqrt{e}\left( \frac{2\alpha }{1-\alpha }\right) ^{\frac{1}{2(1-\alpha )}}\right) \\&\quad +\, 2 h(n)c^\alpha c_0 \left( 2c_0\mu c^\alpha \right) ^{\frac{\alpha }{2(1-\alpha )}} (n+c)^{-\frac{\alpha }{2}}. \end{aligned}$$

Hence, we obtain

$$\begin{aligned}&{\mathbb {E}}\left\| {\bar{\theta }}_n - {\hat{\theta }}_T\right\| _2\le \left( \sum _{n=1}^{\infty } \exp \left( -c_0\mu c^\alpha (n+c)^{1-\alpha }\right) \right) \\&\quad \times \, \left( \exp \left( c_0\mu c^\alpha (1+c)^{1-\alpha }\right) \left\| \theta _0 - \theta _T\right\| _2+ 2h(n)c^{\alpha }c_0 \left( 2 c_0 \mu c^{\alpha }\right) ^{\frac{\alpha }{(1-\alpha )}} \sqrt{e}\left( \frac{2\alpha }{1-\alpha }\right) ^{\frac{1}{2(1-\alpha )}}\right) \frac{1}{n}\\&\quad +\, 2 h(n) c^\alpha c_0 \left( 2c_0\mu c^\alpha \right) ^{\frac{\alpha }{2(1-\alpha )}} (n+c)^{-\frac{\alpha }{2}}. \end{aligned}$$


Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Prashanth, L.A., Korda, N. & Munos, R. Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling. Mach Learn 110, 559–618 (2021).

Download citation