Advertisement

Propagation of Q-values in Tabular TD(λ)

  • Philippe Preux
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2430)

Abstract

In this paper, we propose a new idea for tabular TD(λ) algorithm. In TD learning, rewards are propagated along the sequence of state/action pairs that have been visited recently. In complement to this, we propose to propagate rewards towards neighboring state/action pairs along this sequence, though unvisited. This leads to a great decrease in the number of iterations required for TD(λ) to be able to generalize since it is no longer necessary that a state/action pair is visited for its Q-value to be updated. The use of this propagation process makes tabular TD(λ) coming closer to neural net based TD(λ) with regards to its ability to generalize, while keeping unchanged other properties of tabular TD(λ).

References

  1. [1]
    A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995.CrossRefGoogle Scholar
  2. [2]
    D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.Google Scholar
  3. [3]
    A. W. Moore and C. G. Atkeson. Prioritized sweeping: reinforcement learning with less data and less real time. Machine Learning, 13:103–130, 1993.Google Scholar
  4. [4]
    A.W. Moore and C. G. Atkeson. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21, 1995.Google Scholar
  5. [5]
    J. Peng and R. J. Williams. Efficient learning and planning within the dyna framework. Adaptive Behavior, 1(4):437–454, 1993.CrossRefGoogle Scholar
  6. [6]
    Ph. Preux, Ch. Cassagnabère, S. Delepoulle, and J-Cl. Darcheville. A non supervised multi-reinforcement agents architecture to model the development of behavior of living organisms. In Proc. European Workshop on Reinforcement Learning, October 2001.Google Scholar
  7. [7]
    G. A. Rummery. Problem Solving with Reinforcement Learning. PhD thesis, Cambridge University, 1995.Google Scholar
  8. [8]
    G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical Report TR 166, Cambridge University, Enginerring Department, September 1994.Google Scholar
  9. [9]
    R. S. Sutton. Learning to predict by the method of temporal difference. Machine Learning, 3:9–44, 1988.Google Scholar
  10. [10]
    R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proc. Seventh Int’l Conf. on Machine Learning, pages 216–224. Morgan Kaufmann, 1990.Google Scholar
  11. [11]
    R. S. Sutton. Planning by incremental dynamic programming. In Proc. Eighth Int’l Conf. on Machine Learning, pages 353–357. Morgan Kaufmann, 1991.Google Scholar
  12. [12]
    R. S. Sutton and A. G. Barto. Reinforcement learning: an introduction. MIT Press, 1998.Google Scholar
  13. [13]
    G. Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38:58–68, 1995.CrossRefGoogle Scholar
  14. [14]
    C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, King’s college, Cambridge, UK, 1989.Google Scholar
  15. [15]
    M. Wiering and J. Schmidhuber. HQ-Learning. Adaptive Behavior, 6(2):219–246, 1997.CrossRefGoogle Scholar
  16. [16]
    W. Zhu and S. Levinson. PQ-learning: an e.cient robot learning method for intelligent behavior acquisition. In Proc. 7th Int’l Conf. on Intelligent Autonomous Systems, March 2002.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Philippe Preux
    • 1
  1. 1.Laboratoire d’Informatique du LittoralUPRES-EA 2335, Université du Littoral Cote d’OpaleCalais CedexFrance

Personalised recommendations