Abstract
We propose a stochastic approximation (SA) based method with randomization of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our proposed scheme is equivalent to running regular temporal difference learning with linear function approximation, albeit with samples picked uniformly from a given dataset. Our method results in an O(d) improvement in complexity in comparison to LSTD, where d is the dimension of the data. We provide non-asymptotic bounds for our proposed method, both in high probability and in expectation, under the assumption that the matrix underlying the LSTD solution is positive definite. The latter assumption can be easily satisfied for the pathwise LSTD variant proposed by Lazaric (J Mach Learn Res 13:3041–3074, 2012). Moreover, we also establish that using our method in place of LSTD does not impact the rate of convergence of the approximate value function to the true value function. These rate results coupled with the low computational complexity of our method make it attractive for implementation in big data settings, where d is large. A similar low-complexity alternative for least squares regression is well-known as the stochastic gradient descent (SGD) algorithm. We provide finite-time bounds for SGD. We demonstrate the practicality of our method as an efficient alternative for pathwise LSTD empirically by combining it with the least squares policy iteration algorithm in a traffic signal control application. We also conduct another set of experiments that combines the SA-based low-complexity variant for least squares regression with the LinUCB algorithm for contextual bandits, using the large scale news recommendation dataset from Yahoo.
This is a preview of subscription content, access via your institution.








Notes
- 1.
By an abuse of notation, we shall use \(\varPhi\) to denote the feature matrix for TD as well as LSTD and the composition of \(\varPhi\) should be clear from the context.
- 2.
A real matrix A is positive definite if and only if the symmetric part \(\frac{1}{2}(A+A^\textsf {T})\) is positive definite.
- 3.
For notational convenience, we have chosen to ignore the dependence of \(K_1\) and \(K_2\) on the confidence parameter \(\delta\).
- 4.
For notational convenience, we have not chosen to make the dependence of \(g_k\) on the random innovation \(f_k\) explicit. The Lipschitzness of \(g_k\) as a function of \(f_k\) is clear from equation (43) presented below.
- 5.
One usually sees terms of the form \(\phi (s_{i_j}) (\phi (s_{i_j}) - \beta \phi (s_{i_j}'))\), whereas we use a transposed form to simplify handling the products that get written through the \(\varPi _j^n\) matrices.
References
Antos, A., Szepesvári, C., & Munos, R. (2008). Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1), 89–129.
Bach, F., & Moulines, E. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in neural information processing systems (pp. 451–459).
Bach, F., & Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). In Advances in neural information processing systems (pp. 773–781).
Bertsekas, D. P. (2012). Dynamic Programming and Optimal Control, Approximate Dynamic Programming, (4th ed., Vol. II). Belmont: Athena Scientific.
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming (Optimization and Neural Computation Series, 3), (Vol. 7). Belmont: Athena Scientific.
Bhandari, J., Russo, D., & Singal, R. (2018). A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory pp. 1691–1692.
Borkar, V. (2008). Stochastic approximation: A dynamical systems viewpoint. Cambridge: Cambridge University Press.
Borkar, V. S., & Meyn, S. P. (2000). The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2), 447–469.
Bradtke, S., & Barto, A. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57.
Dalal, G., Szörényi, B., Thoppe, G., & Mannor, S. (2018). Finite sample analyses for td (0) with function approximation. In Thirty-second AAAI conference on artificial intelligence.
Dani, V., Hayes, T. P., & Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. In Proceedings of the 21st annual conference on learning theory (COLT) (pp. 355–366).
Dieuleveut, A., Flammarion, N., & Bach, F. (2016). Harder, better, faster, stronger convergence rates for least-squares regression. arXiv preprint arXiv:160205419.
Fathi, M., & Frikha, N. (2013). Transport-entropy inequalities and deviation estimates for stochastic approximation schemes. arXiv preprint arXiv:13017740.
Frikha, N., & Menozzi, S. (2012). Concentration bounds for stochastic approximations. Electronic Communications in Probability, 17(47), 1–15.
Geramifard, A., Bowling, M., Zinkevich, M., & Sutton, R. S. (2007). iLSTD: Eligibility traces and convergence analysis. In NIPS (Vol. 19, p. 441).
Hazan, E., & Kale, S. (2011). Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In COLT (pp. 421–436).
Konda, V. R. (2002). Actor-critic algorithms. PhD thesis, Department of Electrical Engineering and Computer Science, MIT.
Korda, N., Prashanth, L. A., & Munos, R. (2015). Fast Gradient Descent for Drifting Least Squares Regression, with Application to Bandits. In Proceedings of the twenty-ninth AAAI conference on artificial intelligence (pp. 2708–2714).
Kushner, H., & Clark, D. (1978). Stochastic approximation methods for constrained and unconstrained systems. Berlin: Springer-Verlag.
Kushner, H. J., & Yin, G. (2003). Stochastic approximation and recursive algorithms and applications, (Vol. 35). Berlin: Springer Verlag.
Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. The Journal of Machine Learning Research, 4, 1107–1149.
Lakshminarayanan, C., & Szepesvari, C. (2018). Linear stochastic approximation: How far does constant step-size and iterate averaging go? Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 84, 1347–1355.
Lazaric, A., Ghavamzadeh, M., & Munos, R. (2012). Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13, 3041–3074.
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on world wide web, ACM (pp. 661–670).
Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on web search and data mining, ACM (pp. 297–306).
Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., & Petrik, M. (2015). Finite-sample analysis of proximal gradient TD algorithms. In: Proceedings of the 31st conference on uncertainty in artificial intelligence, Amsterdam, Netherlands
Mary, J., Garivier, A., Li, L., Munos, R., Nicol, O., Ortner, R., & Preux, P. (2012). ICML exploration and exploitation 3—new challenges. http://explochallenge.inria.fr.
Narayanan, C., & Szepesvári, C. (2017). Finite time bounds for temporal difference learning with function approximation: Problems with some “state-of-the-art” results. Technical report, https://sites.ualberta.ca/~szepesva/papers/TD-issues17.pdf.
Nemirovsky, A., & Yudin, D. (1983). Problem complexity and method efficiency in optimization. NY: Wiley-Interscience.
Pires, BA., & Szepesvári, C. (2012). Statistical linear estimation with penalized estimators: An application to reinforcement learning. arXiv preprint arXiv:12066444.
Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.
Prashanth, L. A., & Bhatnagar, S. (2011). Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 12(2), 412–421.
Prashanth, L. A., & Bhatnagar, S. (2012). Threshold tuning using stochastic optimization for graded signal control. IEEE Transactions on Vehicular Technology, 61(9), 3865–3880.
Prashanth, L. A., Korda, N., & Munos, R. (2014). Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. In Joint European conference on machine learning and knowledge discovery in databases (pp. 66–81).
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22, 400–407
Roux, N. L., Schmidt, M., & Bach, F. R. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in neural information processing systems (pp. 2663–2671).
Ruppert, D. (1991). Stochastic approximation. In B. K. Ghosh & P. K. Sen (Eds.), Handbook of sequential analysis (pp. 503–529).
Silver, D., Sutton, R. S., & Müller, M. (2007). Reinforcement learning of local shape in the game of go. IJCAI, 7, 1053–1058.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge: Cambridge University Press.
Sutton, R. S., Szepesvári, C., & Maei, H. R. (2009a). A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In NIPS (pp. 1609–1616).
Sutton, R. S., et al.(2009b). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In ICML ACM (pp. 993–1000).
Tagorti, M., & Scherrer, B. (2015). On the Rate of Convergence and Error Bounds for LSTD(\(\lambda\)). In ICML.
Tarrès, P., & Yao, Y. (2011). Online learning as stochastic approximation of regularization paths. arXiv preprint arXiv:11035538.
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.
Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint (Vol. 48). Cambridge: Cambridge University Press.
Webscope, Y. (2011). Yahoo! Webscope dataset ydata-frontpage-todaymodule-clicks-\(\text{v}2_0\). http://research.yahoo.com/Academic_Relations.
Yu, H. (2015). On convergence of emphatic temporal-difference learning. In COLT (pp. 1724–1751).
Yu, H., & Bertsekas, D. P. (2009). Convergence results for some temporal difference methods based on least squares. IEEE Transactions on Automatic Control, 54(7), 1515–1531.
Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In ICML (pp. 928–925).
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editor: Csaba Szepesvari.
A portion of this work was done when the authors were at INRIA Lille - Nord Europe.
Appendix 1: Proof of Theorem 10.3
Appendix 1: Proof of Theorem 10.3
The proof of Theorem 10.3 relies on a general rate result built from Proposition 10.1
Proposition 13.1
Under (A1)–(A3) we have, for all \(\epsilon \ge 0\) and \(\forall n\ge 1\),
where \(L_i \triangleq \frac{\gamma _i}{n} \left( \sum _{l=i+1}^{n-1}\prod \limits _{j=i}^{l} \left( 1- \mu \gamma _{j+1}( 2 - \varPhi _{\max }^2\gamma _{j+1})) \right) ^{1/2}\right)\), and h(n) is as in Proposition 10.1.
Proof
This proof follows exactly the proof of Proposition 8.3, except that it uses the form of \(L_i\) for non-averaged iterates as derived in Proposition 10.1 part (1), rather than as derived in Proposition 8.1 part (1). \(\square\)
We specialise this result with the choice of step size \(\gamma _n \triangleq (c_0 c^{\alpha })/(n+c)^{\alpha }\). First, we prove the form of the \(L_i\) constants for this choice of step size in the lemma below.
Lemma 13.1
Under conditions of Theorem 10.3 , we have
Second, we bound the expected error by directly averaging the errors of the non-averaged iterates:
and directly applying the bounds in expectation given in Proposition 8.1.
Lemma 13.2
Under conditions of Theorem 10.3, we have
where \(C_0\) and \(C_1\) are as defined in Theorem 10.3.
Proof of Lemma 13.1
Recall from the statement of Theorem 10.3 that
Recall also from the formula in Proposition 13.1, that
Notice that
To produce the final bound, we bound the summand (A) highlighted in line (91) by a constant, uniformly over all values of i and n, exactly as in the proof of Lemma 8.1. Thus, we have
The rest of the proof follows that of Theorem 4.2. \(\square\)
Proof of Lemma 13.2
Recall that \(\gamma _n \triangleq c_0\left( \frac{c}{(c+n)}\right) ^{\alpha }\). Recall that in Theorem 10.3 we have assumed that
Using (99), we have
To obtain (109), we have applied (108). For the final inequality, we have exponentiated the logarithm of the products, and used the inequality \(\ln (1+x) < x\) in several places.
Continuing the derivation, we have
As in the proof of Theorem 5.1, for arriving at (111), we have used Jensen’s inequality, and that \(\sum _{j=k}^{n-1}(c+j)^{-\alpha }\ge \int _{j=k}^n(c+j)^{1-\alpha }dj=(c+n)^{1-\alpha } - (c+k)^{1-\alpha }\). To obtain (112), we have upper bounded the sum with an integral, the validity of which follows from the observation that \(x\mapsto x^{-2\alpha }e^{x^{1-\alpha }}\) is convex for \(x\ge 1\). Finally, for (113), we have applied the change of variables \(y = (2c_0\mu c^\alpha )^{1/(1-\alpha )}x\).
Now, since \(y^{-2\alpha } \le \frac{2}{1-\alpha } ((1-\alpha )y^{-2\alpha } - \alpha y^{-(1+\alpha )})\) when \(y\ge \left( \frac{2\alpha }{1-\alpha }\right) ^{\frac{1}{1-\alpha }}\), we have
and furthermore, since \(y\mapsto y^{-2\alpha }\exp (y^{1-\alpha })\) is decreasing for \(y\le \left( \frac{2\alpha }{1-\alpha }\right) ^{\frac{1}{1-\alpha }}\), we have
Plugging these into (113), we obtain
Hence, we obtain
\(\square\)
Rights and permissions
About this article
Cite this article
Prashanth, L.A., Korda, N. & Munos, R. Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling. Mach Learn 110, 559–618 (2021). https://doi.org/10.1007/s10994-020-05912-5
Received:
Revised:
Accepted:
Published:
Issue Date: