Prioritized sweeping: Reinforcement learning with less data and less time

Abstract

We present a new algorithm,prioritized sweeping, for efficient prediction and control of stochastic Markov systems. Incremental learning methods such as temporal differencing and Q-learning have real-time performance. Classical methods are slower, but more accurate, because they make full use of the observations. Prioritized sweeping aims for the best of both worlds. It uses all previous experiences both to prioritize important dynamic programming sweeps and to guide the exploration of state-space. We compare prioritized sweeping with other reinforcement learning schemes for a number of different stochastic optimal control problems. It successfully solves large state-space real-time problems with which other methods have difficulty.

References

  1. Barto, A.G., & Singh, S.P. (1990). On the computational economics of reinforcement learning. In D.S. Touretzky, J.L. Elman, T.J. Sejnowski, and G.E. Huiton (Eds.),Connectionist Models: Proceedings of the 1990 Summer School. San Mateo, CA: Morgan Kaufmann (pp. 35–44).

    Google Scholar 

  2. Barto, A.G., Sutton, R.S., & Watkins, C.J.C.H. (1989).Learning and sequential decision making (COINS Technical Report 89–95). Amherst, MA: University of Massachusetts.

    Google Scholar 

  3. Barto, A.G., Bradtke, S.J., & Singh, S.P. (1991). Real-time learning and control using asynchronous dynamic programming (COINS Technical Report 91-57). Amherst, MA: University of Massachusetts.

    Google Scholar 

  4. Bellman, R.E. (1957).Dynamic programming. Princeton, NJ: Princeton University Press.

    Google Scholar 

  5. Berry, D.A., & Fristedt, B. (1985).Bandit problems: Sequential allocation of experiments. New York, NY: Chapman and Hall.

    Google Scholar 

  6. Bertsekas, D.P., & Tsitsiklis, J.N. (1989).Parallel and distributed computation. Englewood Cliffs, NJ: Prentice Hall.

    Google Scholar 

  7. Chapman, D., & Kaelbling, L.P. (1990).Learning from delayed reinforcement in a complex domain (Technical Report No. TR-90-11). Teleos Research, Palo Alto, CA.

    Google Scholar 

  8. Christiansen, A.D., Mason, M.T., & Mitchell, T.M. (1990). Learning reliable manipulation strategies without initial physical models. InIEEE Conference on Robotics and Automation (pp. 1224–1230). IEEE Computer Society Press, Washington, DC.

    Google Scholar 

  9. Dayan, P. (1992). The convergence of TD(λ) for general λ.Machine Learning, 8(3), 341–362.

    Google Scholar 

  10. Kaelbling, L.P. (1990).Learning in embedded systems. PhD. thesis, Department of Computer Science, Stanford University, Stanford CA. (Technical Report No. TR-90-04.)

    Google Scholar 

  11. Knuth, D.E. (1973).Sorting and searching. Reading, MA: Addison Wesley.

    Google Scholar 

  12. Korf, R.E. (1990). Real-time heuristic search.Artificial Intelligence, 42, 189–211.

    Google Scholar 

  13. Lin, L.J. (1991). Programming robots using reinforcement learning and teaching. InProceedings of the Ninth International Conference on Artificial Intelligence (AAAI-91). Cambridge, MA: MIT Press.

    Google Scholar 

  14. Mahadevan, S., & Connell, J. (1990). Automatic programming of behavior-based robots using reinforcement learning (Technical Report). IBM T.J. Watson Research Center. Yorktown Heights, NY.

    Google Scholar 

  15. Michie, D., & Chambers, R.A. (1968). BOXES: An experiment in adaptive control. In E. Dale and D. Michie (Eds.),Machine intelligence 2. London: Oliver and Boyd, pp. 137–152.

    Google Scholar 

  16. Moore, A.W., & Atkeson, C.G. (1992). Memory-based function approximators for learning control. In preparation.

  17. Moore, A.W. (1991). Variable resolution dynamic programming: efficiently learning action maps in multivariate real-valued state-spaces. In L. Birnbaum & G. Collins (Eds.),Machine learning: Proceedings of the eighth international workshop. San Mateo, CA: Morgan Kaufman, pp. 333–337.

    Google Scholar 

  18. Nilsson, N.J. (1971).Problem solving methods in artificial intelligence. New York: McGraw Hill.

    Google Scholar 

  19. Peng, J. & Williams, R.J. (1992).Efficient search control in Dyna. College of Computer Science, Northeastern University, Boston, MA. (A revised version will appear as “Efficient learning and planning within the dyna framework.”Proceedings of the Second International Conference on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press, 1993.)

    Google Scholar 

  20. Sage, A.P., & White, C.C. (1977).Optimum systems control. Englewood Cliffs, NJ: Prentice Hall.

    Google Scholar 

  21. Samuel, A.L. (1959). Some studies in machine learning using the game of checkers.IBM Journal on Research and Development, 3, (3)210–229. Reprinted in E.A. Feigenbaum & J. Feldman (Eds.). (1963).Computers and thought. New York: McGraw-Hill, pp. 71–105.

    Google Scholar 

  22. Sato, M., Abe, K., & Takeda, H. (1988). Learning control of finite Markov chains with an explicit trade-off between estimation and control.IEEE Transactions on Systems, Man, and Cybernetics, 18(5), 667–684.

    Google Scholar 

  23. Singh, S.P. (1991). Transfer of learning across compositions of sequential tasks. In L. Birnbaum & G. Collins (EDs.).Machine learning: Proceedings of the eighth international workshop. Morgan Kaufman, pp. 348–352.

  24. Stanfill, C. & Waltz, D. (1986). Towards memory-based reasoning.Communications of the ACM, 29(12), 1213–1228.

    Google Scholar 

  25. Sutton, R.S., & Barto, A.G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.),Learning and computational neuroscience: Foundations of adaptive networks (pp. 497–537). Cambridge, MA: MIT Press.

    Google Scholar 

  26. Sutton, R.S. (1984).Temporal credit assignment in reinforcement learning. Ph.D. thesis, Department of Computer and Information Sciences, University of Massachusetts, Amherst.

    Google Scholar 

  27. Sutton, R.S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3, 9–44.

    Google Scholar 

  28. Sutton, R.S. (1990). Integrated architecture for learning, planning, and reacting based on approximating dynamic programming. InProceedings of the 7th International Conference on Machine Learning. San Mateo, CA: Morgan Kaufman.

    Google Scholar 

  29. Tesauro, G.J. (1991). Practical issues in temporal difference learning. Report RC 17223 (76307). IBM T.J. Watson Research Center, Yorktown Heights, NY.

    Google Scholar 

  30. Thrun, S.B., & Möller, K. (1992). Active exploration in dynamic environments. In J.E. Moody, S.J. Hanson, & R.P. Lippman (Eds.),Advances in neural information processing systems 4. San Mateo, CA: Morgan Kaufmann, pp. 531–538.

    Google Scholar 

  31. Watkins, C.J.C.H. (1989).Learning from delayed rewards. Ph.D. thesis, King's College, University of Cambridge, United Kingdom.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Moore, A.W., Atkeson, C.G. Prioritized sweeping: Reinforcement learning with less data and less time. Mach Learn 13, 103–130 (1993). https://doi.org/10.1007/BF00993104

Download citation

Key words

  • Memory-based learning
  • learning control
  • reinforcement learning
  • temporal differencing
  • asynchronous dynamic programming
  • heuristic search
  • prioritized sweeping