Machine Learning

, Volume 22, Issue 1–3, pp 123–158 | Cite as

Reinforcement learning with replacing eligibility traces

  • Satinder P. Singh
  • Richard S. Sutton
Article

Abstract

The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, thereplacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that the method corresponding to replacing traces is closely related to the maximum likelihood solution for these tasks, and that its mean squared error is always lower in the long run. Computational results confirm these analyses and show that they are applicable more generally. In particular, we show that replacing traces significantly improve performance and reduce parameter sensitivity on the "Mountain-Car" task, a full reinforcement-learning problem with a continuous state space, when using a feature-based function approximator.

Keywords

reinforcement learning temporal difference learning eligibility trace Monte Carlo method Markov chain CMAC 

References

  1. Albus, J. S., (1981).Brain, Behavior, and Robotics, chapter 6, pages 139–179, Byte Books.Google Scholar
  2. Baase, S., (1988).Computer Algorithms: Introduction to design and analysis. Reading, MA: Addison-Wesley.Google Scholar
  3. Barnard, E., (1993). Temporal-difference methods and Markov models.IEEE Transactions on Systems, Man. and Cybernetics,23(2), 357–365.Google Scholar
  4. Barto, A. G. & Duff, M., (1994). Monte Carlo matrix inversion and reinforcement learning. InAdvances in Neural Information Processing Systems 6, pages 687–694, San Mateo, CA, Morgan Kaufrnann.Google Scholar
  5. Barto, A. G., Sutton, R. S., & Anderson, C. W., (1983). Neuronlike elements that can solve difficult learning control problems.IEEE Trans. on Systems, Man, and Cybernetics,13, 835–846.Google Scholar
  6. Bellman, R. E., (1957).Dynamic Programming. Princeton, NJ: Princeton University Press.Google Scholar
  7. Curtiss, J. H., (1954). A theoretical comparison of the efficiencies of two classical methods and a Monte Carlo method for computing one component of the solution of a set of linear algebraic equations. In Meyer, H. A. (Ed.),Symposium on Monte Carlo Methods, pages 191–233, New York: Wiley.Google Scholar
  8. Dayan, P., (1992). The convergence of TD(λ) for general λ.Machine Learning,8(3/4), 341–362.Google Scholar
  9. Dayan, P., (1993). Improving generalization for temporal difference learning: The successor representation.Neural Computation,5(4), 613–624.Google Scholar
  10. Dayan, P. & Sejnowski, T., (1994). TD(λ) converges with probability 1.Machine Learning,14, 295–301.Google Scholar
  11. Holland, J. H., (1986).Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems, Volume 2 ofMachine Learning: An Artificial Intelligence Approach. chapter 20. Morgan Kaufmann.Google Scholar
  12. Jaakkola, T., Jordan, M. I., & Singh, S. P., (1994). On the convergence of stochastic iterative dynamic programming algorithms.Neural Computation,6(6), 1185–1201.Google Scholar
  13. Jaakkola, T., Singh, S. P., & Jordan, M. I., (1995). Reinforcement learning algorithm for partially observable Markov decision problems. InAdvances in Neural Information Processing Systems 7. Morgan Kaufmann.Google Scholar
  14. Klopf, A. H., (1972). Brain function and adaptive systems—A heterostatic theory. Technical Report AFCRL 72–0164, Air Force Cambridge Research Laboratories, Bedford, MA.Google Scholar
  15. Kumar, P. R. & Varaiya, P. P., (1986).Stochastic Systems: Estimation. Identification, and Adaptive Control Englewood Cliffs, N.J.: Prentice Hall.Google Scholar
  16. Lin, L. J., (1992). Self-improving reactive agents based on reinforcement learning, planning and teachingMachine Learning,8(3/4), 293–321.Google Scholar
  17. Miller, W. T., Glanz, F. H., & Kraft, L. G., (1990). CMAC: An associative neural network alternative to backpropagation.Proc. of the IEEE,78, 1561–1567.Google Scholar
  18. Moore, A. W., (1991). Variable resolution dynamic programming: Efficiently learning action maps in multi variate real-valued state-spaces. InMachine Learning: Proceedings of the Eighth International Workshop, pages 333–337, San Mateo, CA. Morgan Kaufmann.Google Scholar
  19. Peng, J., (1993).Dynamic Programming-based Learning for Control. PhD thesis, Northeastern University.Google Scholar
  20. Peng, J. & Williams, R. J., (1994). Incremental multi-step Q-learning. InMachine Learning: Proceedings of the Eleventh International Conference, pages 226–232. Morgan Kaufmann.Google Scholar
  21. Rubinstein, R., (1981).Simulation and the Monte Carlo method. New York, John Wiley Sons.Google Scholar
  22. Rummery, G. A. & Niranjan, M., (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Dept.Google Scholar
  23. Sutton, R. S., (1984).Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, MA.Google Scholar
  24. Sutton, R. S., (1988). Learning to predict by the methods of temporal differences.Machine Learning,3, 9–44.Google Scholar
  25. Sutton, R. S., (1995). TD models: Modeling the world at a mixture of time scales. InMachine Learning Proceedings of the Twelth International Conference, pages 531–539. Morgan Kaufmann.Google Scholar
  26. Sutton, R. S. & Barto, A. G., (1987). A temporal-difference model of classical conditioning. InProceedings of the Ninth Annual Conference of the Cognitive Science Society, pages 355–378, Hillsdale, NJ, Erlbaum.Google Scholar
  27. Sutton, R. S. & Barto, A. G., (1990). Time-derivative models of Pavlovian conditioning. In Gabriel, M. & Moore, J. W. (Eds.),Learning and Computational Neuroscience, pages 497–537. Cambridge, MA, MIT Press.Google Scholar
  28. Sutton, R. S. & Singh, S. P., (1994). On step-size and bias in temporal-difference learning. InEighth Yale Workshop on Adaptive and Learning Systems, pages 91–96, New Haven, CT.Google Scholar
  29. Sutton, R. S. & Whitehead, S. D., (1993). Online learning with random representations. InMachine Learning Proceedings of the Tenth Int. Conference, pages 314–321. Morgan Kaufmann.Google Scholar
  30. Tesauro, G. J., (1992). Practical issues in temporal difference learning.Machine Learning,8(3/4), 257–277.Google Scholar
  31. Tsitsiklis, J., (1994). Asynchronous stochastic approximation and Q-learning.Machine Learning,16(3). 185–202.Google Scholar
  32. Wasow, W. R., (1952). A note on the inversion of matrices by random walks.Math. Tables Other Aids Comput.,6, 78–81.Google Scholar
  33. Watkins, C. J. C. H., (1989).Learning from Delayed Rewards. PhD thesis, Cambridge Univ. Cambridge, England.Google Scholar
  34. Wilson, S. W., (to appear). Classifier fitness based on accuracy.Evolutionary Computation.Google Scholar

Copyright information

© Kluwer Academic Publishers 1996

Authors and Affiliations

  • Satinder P. Singh
    • 1
  • Richard S. Sutton
    • 2
  1. 1.Dept. of Brain and Cognitive SciencesMassachusetts Institute of TechnologyCambridge
  2. 2.Dept. of Computer ScienceUniversity of MassachusettsAmherst

Personalised recommendations