Skip to main content

Least-Squares Methods for Policy Iteration

  • Chapter

Part of the book series: Adaptation, Learning, and Optimization ((ALO,volume 12))

Abstract

Approximate reinforcement learning deals with the essential problem of applying reinforcement learning in large and continuous state-action spaces, by using function approximators to represent the solution. This chapter reviews least-squares methods for policy iteration, an important class of algorithms for approximate reinforcement learning. We discuss three techniques for solving the core, policy evaluation component of policy iteration, called: least-squares temporal difference, least-squares policy evaluation, and Bellman residual minimization.We introduce these techniques starting from their general mathematical principles and detailing them down to fully specified algorithms. We pay attention to online variants of policy iteration, and provide a numerical example highlighting the behavior of representative offline and online methods. For the policy evaluation component as well as for the overall resulting approximate policy iteration, we provide guarantees on the performance obtained asymptotically, as the number of samples processed and iterations executed grows to infinity. We also provide finite-sample results, which apply when a finite number of samples and iterations are considered. Finally, we outline several extensions and improvements to the techniques and methods reviewed.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   299.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   379.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning 71(1), 89–129 (2008)

    Article  Google Scholar 

  • Baird, L.: Residual algorithms: Reinforcement learning with function approximation. In: Proceedings 12th International Conference on Machine Learning (ICML-1995), Tahoe City, U.S, pp. 30–37 (1995)

    Google Scholar 

  • Bertsekas, D.P.: A counterexample to temporal differences learning. Neural Computation 7, 270–279 (1995)

    Article  MathSciNet  Google Scholar 

  • Bertsekas, D.P.: Approximate dynamic programming. In: Dynamic Programming and Optimal Control, Ch. 6, vol. 2 (2010), http://web.mit.edu/dimitrib/www/dpchapter.html

  • Bertsekas, D.P.: Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications 9(3), 310–335 (2011a)

    Article  MathSciNet  MATH  Google Scholar 

  • Bertsekas, D.P.: Temporal difference methods for general projected equations. IEEE Transactions on Automatic Control 56(9), 2128–2139 (2011b)

    Article  MathSciNet  Google Scholar 

  • Bertsekas, D.P., Ioffe, S.: Temporal differences-based policy iteration and applications in neuro-dynamic programming. Tech. Rep. LIDS-P-2349, Massachusetts Institute of Technology, Cambridge, US (1996), http://web.mit.edu/dimitrib/www/Tempdif.pdf

  • Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific (1996)

    Google Scholar 

  • Bertsekas, D.P., Borkar, V., Nedić, A.: Improved temporal difference methods with linear function approximation. In: Si, J., Barto, A., Powell, W. (eds.) Learning and Approximate Dynamic Programming. IEEE Press (2004)

    Google Scholar 

  • Boyan, J.: Technical update: Least-squares temporal difference learning. Machine Learning 49, 233–246 (2002)

    Article  MATH  Google Scholar 

  • Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference learning. Machine Learning 22(1-3), 33–57 (1996)

    Article  MATH  Google Scholar 

  • Buşoniu, L., Babuška, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming Using Function Approximators. In: Automation and Control Engineering. Taylor & Francis, CRC Press (2010a)

    Google Scholar 

  • Buşoniu, L., De Schutter, B., Babuška, R., Ernst, D.: Using prior knowledge to accelerate online least-squares policy iteration. In: 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR-2010), Cluj-Napoca, Romania (2010b)

    Google Scholar 

  • Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Approximate dynamic programming with a fuzzy parameterization. Automatica 46(5), 804–814 (2010c)

    Article  MathSciNet  MATH  Google Scholar 

  • Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Online least-squares policy iteration for reinforcement learning control. In: Proceedings 2010 American Control Conference (ACC-2010), Baltimore, US, pp. 486–491 (2010d)

    Google Scholar 

  • Dimitrakakis, C., Lagoudakis, M.: Rollout sampling approximate policy iteration. Machine Learning 72(3), 157–171 (2008)

    Article  Google Scholar 

  • Engel, Y., Mannor, S., Meir, R.: Bayes meets Bellman: The Gaussian process approach to temporal difference learning. In: Proceedings 20th International Conference on Machine Learning (ICML-2003), Washington, US, pp. 154–161 (2003)

    Google Scholar 

  • Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings 22nd International Conference on Machine Learning (ICML-2005), Bonn, Germany, pp. 201–208 (2005)

    Google Scholar 

  • Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, 503–556 (2005)

    MathSciNet  MATH  Google Scholar 

  • Farahmand, A.M., Ghavamzadeh, M., Szepesvári, C.S., Mannor, S.: Regularized policy iteration. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 21, pp. 441–448. MIT Press (2009)

    Google Scholar 

  • Geramifard, A., Bowling, M.H., Sutton, R.S.: Incremental least-squares temporal difference learning. In: Proceedings 21st National Conference on Artificial Intelligence and 18th Innovative Applications of Artificial Intelligence Conference (AAAI-2006), Boston, US, pp. 356–361 (2006)

    Google Scholar 

  • Geramifard, A., Bowling, M., Zinkevich, M., Sutton, R.S.: iLSTD: Eligibility traces & convergence analysis. In: Schölkopf, B., Platt, J., Hofmann, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 440–448. MIT Press (2007)

    Google Scholar 

  • Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins (1996)

    Google Scholar 

  • Jung, T., Polani, D.: Kernelizing LSPE(λ). In: Proceedings 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL-2007), Honolulu, US, pp. 338–345 (2007a)

    Google Scholar 

  • Jung, T., Polani, D.: Learning RoboCup-keepaway with kernels. In: Gaussian Processes in Practice, JMLR Workshop and Conference Proceedings, vol. 1, pp. 33–57 (2007b)

    Google Scholar 

  • Kolter, J.Z., Ng, A.: Regularization and feature selection in least-squares temporal difference learning. In: Proceedings 26th International Conference on Machine Learning (ICML-2009), Montreal, Canada, pp. 521–528 (2009)

    Google Scholar 

  • Konda, V.: Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology, Cambridge, US (2002)

    Google Scholar 

  • Lagoudakis, M., Parr, R., Littman, M.: Least-squares Methods in Reinforcement Learning for Control. In: Vlahavas, I.P., Spyropoulos, C.D. (eds.) SETN 2002. LNCS (LNAI), vol. 2308, pp. 249–260. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  • Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149 (2003a)

    MathSciNet  Google Scholar 

  • Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: Leveraging modern classifiers. In: Proceedings 20th International Conference on Machine Learning (ICML-2003), Washington, US, pp. 424–431 (2003b)

    Google Scholar 

  • Lazaric, A., Ghavamzadeh, M., Munos, R.: Analysis of a classification-based policy iteration algorithm. In: Proceedings 27th International Conference on Machine Learning (ICML-2010), Haifa, Israel, pp. 607–614 (2010a)

    Google Scholar 

  • Lazaric, A., Ghavamzadeh, M., Munos, R.: Finite-sample analysis of LSTD. In: Proceedings 27th International Conference on Machine Learning (ICML-2010), Haifa, Israel, pp. 615–622 (2010b)

    Google Scholar 

  • Li, L., Littman, M.L., Mansley, C.R.: Online exploration in least-squares policy iteration. In: Proceedings 8th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-2009), Budapest, Hungary, vol. 2, pp. 733–739 (2009)

    Google Scholar 

  • Maei, H.R., Szepesvári, C., Bhatnagar, S., Sutton, R.S.: Toward off-policy learning control with function approximation. In: Proceedings 27th International Conference on Machine Learning (ICML-2010), Haifa, Israel, pp. 719–726 (2010)

    Google Scholar 

  • Maillard, O.A., Munos, R., Lazaric, A., Ghavamzadeh, M.: Finite-sample analysis of Bellman residual minimization, vol. 13, pp. 299–314 (2010)

    Google Scholar 

  • Meyn, S., Tweedie, L.: Markov chains and stochastic stability. Springer, Heidelberg (1993)

    Book  MATH  Google Scholar 

  • Moore, A.W., Atkeson, C.R.: The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning 21(3), 199–233 (1995)

    Google Scholar 

  • Munos, R.: Error bounds for approximate policy iteration. In: Proceedings 20th International Conference (ICML-2003), Washington, US, pp. 560–567 (2003)

    Google Scholar 

  • Munos, R.: Approximate dynamic programming. In: Markov Decision Processes in Artificial Intelligence. Wiley (2010)

    Google Scholar 

  • Munos, R., Szepesvári, C.S.: Finite time bounds for fitted value iteration. Journal of Machine Learning Research 9, 815–857 (2008)

    MATH  Google Scholar 

  • Nedić, A., Bertsekas, D.P.: Least-squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems: Theory and Applications 13(1-2), 79–110 (2003)

    MathSciNet  MATH  Google Scholar 

  • Rasmussen, C.E., Kuss, M.: Gaussian processes in reinforcement learning. In: Thrun, S., Saul, L.K., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16. MIT Press (2004)

    Google Scholar 

  • Scherrer, B.: Should one compute the Temporal Difference fix point or minimize the Bellman Residual? the unified oblique projection view. In: Proceedings 27th International Conference on Machine Learning (ICML-2010), Haifa, Israel, pp. 959–966 (2010)

    Google Scholar 

  • Schweitzer, P.J., Seidmann, A.: Generalized polynomial approximations in Markovian decision processes. Journal of Mathematical Analysis and Applications 110(2), 568–582 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  • Sutton, R., Maei, H., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C.S., Wiewiora, E.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings 26th International Conference on Machine Learning (ICML-2009), Montreal, Canada, pp. 993–1000 (2009a)

    Google Scholar 

  • Sutton, R.S.: Learning to predict by the method of temporal differences. Machine Learning 3, 9–44 (1988)

    Google Scholar 

  • Sutton, R.S., Szepesvári, C.S., Maei, H.R.: A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 21, pp. 1609–1616. MIT Press (2009b)

    Google Scholar 

  • Szepesvári, C.S.: Algorithms for Reinforcement Learning. Morgan & Claypool Publishers (2010)

    Google Scholar 

  • Taylor, G., Parr, R.: Kernelized value function approximation for reinforcement learning. In: Proceedings 26th International Conference on Machine Learning (ICML-2009), Montreal, Canada, pp. 1017–1024 (2009)

    Google Scholar 

  • Thiery, C., Scherrer, B.: Least-squares λ policy iteration: Bias-variance trade-off in control problems. In: Proceedings 27th International Conference on Machine Learning (ICML-2010), Haifa, Israel, pp. 1071–1078 (2010)

    Google Scholar 

  • Tsitsiklis, J.N.: On the convergence of optimistic policy iteration. Journal of Machine Learning Research 3, 59–72 (2002)

    MathSciNet  Google Scholar 

  • Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control 42(5), 674–690 (1997)

    Article  MATH  Google Scholar 

  • Xu, X., Xie, T., Hu, D., Lu, X.: Kernel least-squares temporal difference learning. International Journal of Information Technology 11(9), 54–63 (2005)

    Google Scholar 

  • Xu, X., Hu, D., Lu, X.: Kernel-based least-squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks 18(4), 973–992 (2007)

    Article  Google Scholar 

  • Yu, H.: Convergence of least squares temporal difference methods under general conditions. In: Proceedings 27th International Conference on Machine Learning (ICML-2010), Haifa, Israel, pp. 1207–1214 (2010)

    Google Scholar 

  • Yu, H., Bertsekas, D.P.: Convergence results for some temporal difference methods based on least squares. IEEE Transactions on Automatic Control 54(7), 1515–1531 (2009)

    Article  MathSciNet  Google Scholar 

  • Yu, H., Bertsekas, D.P.: Error bounds for approximations from projected linear equations. Mathematics of Operations Research 35(2), 306–329 (2010)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lucian Buşoniu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Buşoniu, L., Lazaric, A., Ghavamzadeh, M., Munos, R., Babuška, R., De Schutter, B. (2012). Least-Squares Methods for Policy Iteration. In: Wiering, M., van Otterlo, M. (eds) Reinforcement Learning. Adaptation, Learning, and Optimization, vol 12. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27645-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27645-3_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27644-6

  • Online ISBN: 978-3-642-27645-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics