Technical Note

Incremental Multi-Step Q-Learning
  • Jing Peng
  • Ronald J. Williams


This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic-programming based reinforcement learning method, with the TD(λ) return estimation process, which is typically used in actor-critic learning, another well-known dynamic-programming based reinforcement learning method. The parameter λ is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the non-Markovian effect of coarse state-space quantization. The resulting algorithm, Q(λ)-learning, thus combines some of the best features of the Q-learning and actor-critic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.


reinforcement learning temporal difference learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics 13:835–846.Google Scholar
  2. Cichosz, P. & Mulawka, J. J. (1995). Fast and efficient reinforcement learning with truncated temporal differences. Proceedings of the Twelfth International Conference on Machine Learning, 99–107.Google Scholar
  3. Lin, L. J. (1992). Reinforcement learning for robots using neural networks. Ph. D. Dissertation, Carnegie Mellon University, PA.Google Scholar
  4. Moore, A. W. & Atkeson, C. G. (1994). Prioritized sweeping: reinforcement learning with less data and less time. Machine Learning 13(1):103–130.Google Scholar
  5. Pendrith, M. (1994). On reinforcement learning of control actions in noisy and non-Markovian domains. UNSW-CSE-TR-9410, University of New South Wales, Australia.Google Scholar
  6. Peng, J. (1993). Efficient Dynamic Programming-Based Learning for Control. Ph. D. Dissertation, Northeastern University, Boston, MA 02115.Google Scholar
  7. Peng, J. & Williams, R. J. (1993). Efficient learning and planning within the Dyna framework. Adaptive Behavior 1(4):437–454.CrossRefGoogle Scholar
  8. Ross, S. (1983). Introduction to Stochastic Dynamic Programming. New York, Academic Press.zbMATHGoogle Scholar
  9. Rummery, G. A. & Niranjan, M. (1994). On-line Q-learning using connectionist systems. CUED/F-INFENG/TR 166, Cambridge University, UK.Google Scholar
  10. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, 216–224.Google Scholar
  11. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3:9–44.Google Scholar
  12. Sutton, R S & Singh, S. P. (1994). On step-size and bias in temporal-difference learning. In Eighth Yale Workshop on adaptive and Learning Systems, pages 91–96, New Haven, CT.Google Scholar
  13. Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning 8:279–292.zbMATHGoogle Scholar
  14. Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph. D. Dissertation, King’s College, UK.Google Scholar

Copyright information

© Kluwer Academic Publishers 1996

Authors and Affiliations

  • Jing Peng
    • 1
  • Ronald J. Williams
    • 2
  1. 1.College of EngineeringUniversity of CaliforniaRiverside
  2. 2.College of Computer ScienceNortheastern UniversityBoston

Personalised recommendations