Least-Squares Methods in Reinforcement Learning for Control

  • Michail G. Lagoudakis
  • Ronald Parr
  • Michael L. Littman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2308)


Least-squares methods have been successfully used for prediction problems in the context of reinforcement learning, but little has been done in extending these methods to control problems. This paper presents an overview of our research efforts in using least-squares techniques for control. In our early attempts, we considered a direct extension of the Least-Squares Temporal Difference (LSTD) algorithm in the spirit of Q-learning. Later, an effort to remedy some limitations of this algorithm (approximation bias, poor sample utilization) led to the Least- Squares Policy Iteration (LSPI) algorithm, which is a form of model-free approximate policy iteration and makes efficient use of training samples collected in any arbitrary manner. The algorithms are demonstrated on a variety of learning domains, including algorithm selection, inverted pendulum balancing, bicycle balancing and riding, multiagent learning in factored domains, and, recently, on two-player zero-sum Markov games and the game of Tetris.


Optimal Policy Reinforcement Learn Markov Decision Process Inverted Pendulum Policy Iteration 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, Massachusetts, 1996.zbMATHGoogle Scholar
  2. 2.
    Steven J. Bradtke and Andrew G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1/2/3):33–57, 1996.zbMATHGoogle Scholar
  3. 3.
    Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored MDPs. In Proceeding of the 14th Neural Information Processing Systems (NIPS-14), Vancouver, Canada, December 2001.Google Scholar
  4. 4.
    Carlos Guestrin, Michail G. Lagoudakis, and Ronald Parr. Coordinated reinforcement learning. In Proceedings of the 2002 AAAI Spring Symposium Series: Collaborative Learning Agents, Stanford, CA, March 2002.Google Scholar
  5. 5.
    Leslie P. Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.Google Scholar
  6. 6.
    Daphne Koller and Ronald Parr. Policy iteration for factored MDPs. In Craig Boutilier and Moisés Goldszmidt, editors, Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI-00), pages 326–334, San Francisco, CA, 2000. Morgan Kaufmann Publishers.Google Scholar
  7. 7.
    Michail Lagoudakis and Ronald Parr. Model free least squares policy iteration. In Proceedings of the 14th Neural Information Processing Systems (NIPS-14), Vancouver, Canada, December 2001.Google Scholar
  8. 8.
    Michail G. Lagoudakis and Michael L. Littman. Algorithm selection using reinforcement learning. In Pat Langley, editor, Proceedings of the Seventeenth International Conference on Machine Learning, pages 511–518. Morgan Kaufmann, San Francisco, CA, 2000.Google Scholar
  9. 9.
    Michail G. Lagoudakis and Michael L. Littman. Learning to select branching rules in the dpll procedure for satisfiability. In Henry Kautz and Bart Selman, editors, Electronic Notes in Discrete Mathematics (ENDM), Vol. 9, LICS 2001 Workshop on Theory and Applications of Satisfiability Testing. Elsevier Science, 2001.Google Scholar
  10. 10.
    Michail G. Lagoudakis, Michael L. Littman, and Ronald Parr. Selecting the right algorithm. In Carla Gomes and Toby Walsh, editors, Proceedings of the 2001 AAAI Fall Symposium Series: Using Uncertainty within Computation, Cape Cod, MA, November 2001.Google Scholar
  11. 11.
    Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning, pages 157–163, San Francisco, CA, 1994. Morgan Kaufmann.Google Scholar
  12. 12.
    J. Randløv and P. Alstrøm. Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of The Fifteenth International Conference on Machine Learning, Madison, Wisconsin, July 1998. Morgan Kaufmann.Google Scholar
  13. 13.
    John R. Rice. The algorithm selection problem. Advances in Computers, 15:65–118, 1976.Google Scholar
  14. 14.
    J. Schneider, W. Wong, A. Moore, and M. Riedmiller. Distributed value functions. In Proceedings of The Sixteenth International Conference on Machine Learning, Bled, Slovenia, July 1999. Morgan Kaufmann.Google Scholar
  15. 15.
    R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MITPress, Cambridge, MA, 1998.Google Scholar
  16. 16.
    K. Wang, H. Tanaka and M. Griffin. An approach to fuzzy control of nonlinear systems: Stability and design issues. IEEE Transactions on Fuzzy Systems, 4(1):14–23, 1996.CrossRefGoogle Scholar
  17. 17.
    Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Michail G. Lagoudakis
    • 1
  • Ronald Parr
    • 1
  • Michael L. Littman
    • 2
  1. 1.Department of Computer ScienceDuke UniversityDurhamUSA
  2. 2.Shannon LaboratoryAT&T Labs - ResearchFlorham ParkUSA

Personalised recommendations