Machine Learning

, Volume 13, Issue 1, pp 103–130 | Cite as

Prioritized sweeping: Reinforcement learning with less data and less time

  • Andrew W. Moore
  • Christopher G. Atkeson


We present a new algorithm,prioritized sweeping, for efficient prediction and control of stochastic Markov systems. Incremental learning methods such as temporal differencing and Q-learning have real-time performance. Classical methods are slower, but more accurate, because they make full use of the observations. Prioritized sweeping aims for the best of both worlds. It uses all previous experiences both to prioritize important dynamic programming sweeps and to guide the exploration of state-space. We compare prioritized sweeping with other reinforcement learning schemes for a number of different stochastic optimal control problems. It successfully solves large state-space real-time problems with which other methods have difficulty.

Key words

Memory-based learning learning control reinforcement learning temporal differencing asynchronous dynamic programming heuristic search prioritized sweeping 


  1. Barto, A.G., & Singh, S.P. (1990). On the computational economics of reinforcement learning. In D.S. Touretzky, J.L. Elman, T.J. Sejnowski, and G.E. Huiton (Eds.),Connectionist Models: Proceedings of the 1990 Summer School. San Mateo, CA: Morgan Kaufmann (pp. 35–44).Google Scholar
  2. Barto, A.G., Sutton, R.S., & Watkins, C.J.C.H. (1989).Learning and sequential decision making (COINS Technical Report 89–95). Amherst, MA: University of Massachusetts.Google Scholar
  3. Barto, A.G., Bradtke, S.J., & Singh, S.P. (1991). Real-time learning and control using asynchronous dynamic programming (COINS Technical Report 91-57). Amherst, MA: University of Massachusetts.Google Scholar
  4. Bellman, R.E. (1957).Dynamic programming. Princeton, NJ: Princeton University Press.Google Scholar
  5. Berry, D.A., & Fristedt, B. (1985).Bandit problems: Sequential allocation of experiments. New York, NY: Chapman and Hall.Google Scholar
  6. Bertsekas, D.P., & Tsitsiklis, J.N. (1989).Parallel and distributed computation. Englewood Cliffs, NJ: Prentice Hall.Google Scholar
  7. Chapman, D., & Kaelbling, L.P. (1990).Learning from delayed reinforcement in a complex domain (Technical Report No. TR-90-11). Teleos Research, Palo Alto, CA.Google Scholar
  8. Christiansen, A.D., Mason, M.T., & Mitchell, T.M. (1990). Learning reliable manipulation strategies without initial physical models. InIEEE Conference on Robotics and Automation (pp. 1224–1230). IEEE Computer Society Press, Washington, DC.Google Scholar
  9. Dayan, P. (1992). The convergence of TD(λ) for general λ.Machine Learning, 8(3), 341–362.Google Scholar
  10. Kaelbling, L.P. (1990).Learning in embedded systems. PhD. thesis, Department of Computer Science, Stanford University, Stanford CA. (Technical Report No. TR-90-04.)Google Scholar
  11. Knuth, D.E. (1973).Sorting and searching. Reading, MA: Addison Wesley.Google Scholar
  12. Korf, R.E. (1990). Real-time heuristic search.Artificial Intelligence, 42, 189–211.Google Scholar
  13. Lin, L.J. (1991). Programming robots using reinforcement learning and teaching. InProceedings of the Ninth International Conference on Artificial Intelligence (AAAI-91). Cambridge, MA: MIT Press.Google Scholar
  14. Mahadevan, S., & Connell, J. (1990). Automatic programming of behavior-based robots using reinforcement learning (Technical Report). IBM T.J. Watson Research Center. Yorktown Heights, NY.Google Scholar
  15. Michie, D., & Chambers, R.A. (1968). BOXES: An experiment in adaptive control. In E. Dale and D. Michie (Eds.),Machine intelligence 2. London: Oliver and Boyd, pp. 137–152.Google Scholar
  16. Moore, A.W., & Atkeson, C.G. (1992). Memory-based function approximators for learning control. In preparation.Google Scholar
  17. Moore, A.W. (1991). Variable resolution dynamic programming: efficiently learning action maps in multivariate real-valued state-spaces. In L. Birnbaum & G. Collins (Eds.),Machine learning: Proceedings of the eighth international workshop. San Mateo, CA: Morgan Kaufman, pp. 333–337.Google Scholar
  18. Nilsson, N.J. (1971).Problem solving methods in artificial intelligence. New York: McGraw Hill.Google Scholar
  19. Peng, J. & Williams, R.J. (1992).Efficient search control in Dyna. College of Computer Science, Northeastern University, Boston, MA. (A revised version will appear as “Efficient learning and planning within the dyna framework.”Proceedings of the Second International Conference on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press, 1993.)Google Scholar
  20. Sage, A.P., & White, C.C. (1977).Optimum systems control. Englewood Cliffs, NJ: Prentice Hall.Google Scholar
  21. Samuel, A.L. (1959). Some studies in machine learning using the game of checkers.IBM Journal on Research and Development, 3, (3)210–229. Reprinted in E.A. Feigenbaum & J. Feldman (Eds.). (1963).Computers and thought. New York: McGraw-Hill, pp. 71–105.Google Scholar
  22. Sato, M., Abe, K., & Takeda, H. (1988). Learning control of finite Markov chains with an explicit trade-off between estimation and control.IEEE Transactions on Systems, Man, and Cybernetics, 18(5), 667–684.Google Scholar
  23. Singh, S.P. (1991). Transfer of learning across compositions of sequential tasks. In L. Birnbaum & G. Collins (EDs.).Machine learning: Proceedings of the eighth international workshop. Morgan Kaufman, pp. 348–352.Google Scholar
  24. Stanfill, C. & Waltz, D. (1986). Towards memory-based reasoning.Communications of the ACM, 29(12), 1213–1228.Google Scholar
  25. Sutton, R.S., & Barto, A.G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.),Learning and computational neuroscience: Foundations of adaptive networks (pp. 497–537). Cambridge, MA: MIT Press.Google Scholar
  26. Sutton, R.S. (1984).Temporal credit assignment in reinforcement learning. Ph.D. thesis, Department of Computer and Information Sciences, University of Massachusetts, Amherst.Google Scholar
  27. Sutton, R.S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3, 9–44.Google Scholar
  28. Sutton, R.S. (1990). Integrated architecture for learning, planning, and reacting based on approximating dynamic programming. InProceedings of the 7th International Conference on Machine Learning. San Mateo, CA: Morgan Kaufman.Google Scholar
  29. Tesauro, G.J. (1991). Practical issues in temporal difference learning. Report RC 17223 (76307). IBM T.J. Watson Research Center, Yorktown Heights, NY.Google Scholar
  30. Thrun, S.B., & Möller, K. (1992). Active exploration in dynamic environments. In J.E. Moody, S.J. Hanson, & R.P. Lippman (Eds.),Advances in neural information processing systems 4. San Mateo, CA: Morgan Kaufmann, pp. 531–538.Google Scholar
  31. Watkins, C.J.C.H. (1989).Learning from delayed rewards. Ph.D. thesis, King's College, University of Cambridge, United Kingdom.Google Scholar

Copyright information

© Kluwer Academic Publishers 1993

Authors and Affiliations

  • Andrew W. Moore
    • 1
  • Christopher G. Atkeson
    • 1
  1. 1.MIT Artificial Intelligence Laboratory, NE43-771Cambridge

Personalised recommendations