- 164 Downloads
This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic-programming based reinforcement learning method, with the TD(λ) return estimation process, which is typically used in actor-critic learning, another well-known dynamic-programming based reinforcement learning method. The parameter λ is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the non-Markovian effect of coarse state-space quantization. The resulting algorithm, Q(λ)-learning, thus combines some of the best features of the Q-learning and actor-critic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.
Keywordsreinforcement learning temporal difference learning
Unable to display preview. Download preview PDF.
- Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics 13:835–846.Google Scholar
- Cichosz, P. & Mulawka, J. J. (1995). Fast and efficient reinforcement learning with truncated temporal differences. Proceedings of the Twelfth International Conference on Machine Learning, 99–107.Google Scholar
- Lin, L. J. (1992). Reinforcement learning for robots using neural networks. Ph. D. Dissertation, Carnegie Mellon University, PA.Google Scholar
- Moore, A. W. & Atkeson, C. G. (1994). Prioritized sweeping: reinforcement learning with less data and less time. Machine Learning 13(1):103–130.Google Scholar
- Pendrith, M. (1994). On reinforcement learning of control actions in noisy and non-Markovian domains. UNSW-CSE-TR-9410, University of New South Wales, Australia.Google Scholar
- Peng, J. (1993). Efficient Dynamic Programming-Based Learning for Control. Ph. D. Dissertation, Northeastern University, Boston, MA 02115.Google Scholar
- Rummery, G. A. & Niranjan, M. (1994). On-line Q-learning using connectionist systems. CUED/F-INFENG/TR 166, Cambridge University, UK.Google Scholar
- Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, 216–224.Google Scholar
- Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3:9–44.Google Scholar
- Sutton, R S & Singh, S. P. (1994). On step-size and bias in temporal-difference learning. In Eighth Yale Workshop on adaptive and Learning Systems, pages 91–96, New Haven, CT.Google Scholar
- Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph. D. Dissertation, King’s College, UK.Google Scholar