Advertisement

Machine Learning

, Volume 49, Issue 2–3, pp 233–246 | Cite as

Technical Update: Least-Squares Temporal Difference Learning

  • Justin A. Boyan
Article

Abstract

TD.λ/ is a popular family of algorithms for approximate policy evaluation in large MDPs. TD.λ/ works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency.

This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ = 0 to arbitrary values of λ; at the extreme of λ = 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.

reinforcement learning temporal difference learning value function approximation linear least-squares methods 

References

  1. Atkeson, C. G., & Santamaria, J. C. (1997). A comparison of direct and model-based reinforcement learning. In International Conference on Robotics and Automation.Google Scholar
  2. Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.Google Scholar
  3. Boyan, J. A. (1998). Learning evaluation functions for global optimization. Ph.D. Thesis, Carnegie Mellon University.Google Scholar
  4. Boyan, J. A., & Moore, A. W. (1998) Learning evaluation functions for global optimization and Boolean satisfiability. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI).Google Scholar
  5. Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:1-3, 33–57.Google Scholar
  6. Lin, L.-J. (1993). Reinforcement learning for robots using neural networks. Ph.D. Thesis, Carnegie Mellon University.Google Scholar
  7. Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, 103–130.Google Scholar
  8. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing. (2nd ed.), Cambridge: Cambridge University Press.Google Scholar
  9. Singh, S., & Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), NIPS-9 (p. 974). Cambridge, MA: The MIT Press.Google Scholar
  10. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3. Google Scholar
  11. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann.Google Scholar
  12. Sutton, R. S. (1992). Gain adaptation beats least squares. In Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems (pp. 161-166).Google Scholar
  13. Sutton, R. S. (1995).TD models: Modeling theworld at a mixture of time scales. In Machine Learning: Proceedings of the 12th International Conference (pp. 531–539). San Mateo, CA: Morgan Kaufmann.Google Scholar
  14. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.Google Scholar
  15. Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6:2, 215–219.Google Scholar
  16. Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Trans. Auto. Control, 42:5, 674–690.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Justin A. Boyan
    • 1
  1. 1.ITA SoftwareCambridgeUSA

Personalised recommendations