Machine Learning

, Volume 25, Issue 1, pp 5–22 | Cite as

Exploration Bonuses and Dual Control

  • Peter Dayan
  • Terrence J. Sejnowski


Finding the Bayesian balance between exploration and exploitation in adaptive optimal control is in general intractable. This paper shows how to compute suboptimal estimates based on a certainty equivalence approximation (Cozzolino, Gonzalez-Zubieta Miller, 1965) arising from a form of dual control. This systematizes and extends existing uses of exploration bonuses in reinforcement learning (Sutton, 1990). The approach has two components: a statistical model of uncertainty in the world and a way of turning this into exploratory behavior. This general approach is applied to two-dimensional mazes with moveable barriers and its performance is compared with Sutton‘s DYNA system.

Reinforcement learning dynamic programming exploration bonuses certainty equivalence non-stationary environment 


  1. Barto, A.G., Bradtke, S.J. & Singh, S.P. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence 72, 81-138.Google Scholar
  2. Barto, A.G., Sutton, R.S. & Watkins, C.J.C.H. (1989). Learning and sequential decision making. In M Gabriel & J Moore, editors, Learning and Computational Neuroscience: Foundations of Adaptive Networks. Cambridge, MA: MIT Press, Bradford Books.Google Scholar
  3. Bertsekas, D. & Shreve, S.E. (1978). Stochastic Optimal Control: The Discrete Time Case. New York, NY: Academic Press..Google Scholar
  4. Cohn, D.A. (1994). Neural network exploration using optimal experiment design. In JD Cowan, G Tesauro & J Allspector, editors, Advances in Neural Information Processing Systems, 6. San Mateo, CA: Morgan Kaufmann, 679-686.Google Scholar
  5. Cozzolino, J.M., Gonzalez-Zubieta, R. & Miller, R. (1965). Markov Decision Processes with Uncertain Transition Probabilities. Techical Report 11, Operations Research Center, MIT, Cambridge.Google Scholar
  6. Dersin, P.L., Athans, M. & Kendrick, D.A. (1981). Some properties of the dual adaptive stochastic control algorithm. IEEE Transactions on Automatic Control 26, 1001-1008.Google Scholar
  7. Dreyfus, S.E. (1965). Dynamic Programming and the Calculus of Variations. New York, NY: Academic Press.Google Scholar
  8. Fedorov, V. (1972). Theory of Optimal Experiments. New York: Academic Press.Google Scholar
  9. Fe'ldbaum, A.A. (1965). Optimal Control Systems. New York, NY: Academic Press.Google Scholar
  10. Howard, R.A. (1960). Dynamic Programming and Markov Processes. New York, NY: Technology Press &Wiley.Google Scholar
  11. Kumar, P.R. (1985). A survey of some results in stochastic adaptive control. SIAM Journal on Control and Optimization 23, 329-380.Google Scholar
  12. Littman, M.L. (1996). Algorithms for Sequential Decision Making. Ph.D., Department of Computer Science, Brown University.Google Scholar
  13. Lovejoy, W.S. (1991). A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research 28, 47-66.Google Scholar
  14. Meier, L., IIIrd (1965). Combined optimal control and estimation. Proceedings of the Third Annual Allerton Conference on Circuit and System Theory.Google Scholar
  15. Monahan, G.E. (1982). A survey of partially observable Markov decision processes: Theory, models and algo-rithms. Management Science 28, 1-16.Google Scholar
  16. Moore, A.W. & Atkeson, C.G. (1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning 13, 103-130.Google Scholar
  17. Moore, A.W. & Atkeson C.G. (1994). The Parti-Game algorithm. In G Tesauro, JD Cowan & J Alspector, editors, Advances in Neural Information Processing Systems, 6. San Mateo, CA: Morgan Kaufmann.Google Scholar
  18. Peng, J. & Williams, R.J. (1992). Efficient search control in DYNA. College of Computer Science, Northeastern University.Google Scholar
  19. Rishel, R.W. (1970). Necessary and sufficient dynamic programming conditions for continuous time stochastic optimal control. SIAM Journal of Control 8, 559-571.Google Scholar
  20. Sato, M., Abe, K. & Takeda, H. (1982). Learning control of finite Markov chains with unknown transition probabilities. IEEE Transactions on Automatic Control 27, 502-505.Google Scholar
  21. Schmidhuber, J.H. (1991). Adaptive Confidence and Adaptive Curiosity. (Technical Report FKI-149-91). Technische Universität München, Germany.Google Scholar
  22. Striebel, C.T. (1965). Sufficient statistics in the optimal control of stochastic systems. Journal of Mathematical Analysis and Applications 12, 576-592.Google Scholar
  23. Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Machine Learning: Proceedings of the Seventh International Conference216-224.Google Scholar
  24. Thrun, S.B. (1992). The role of exploration in learning control. In D.A. White & D.A. Sofge, editors, Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches. New York, NY: Van Nostrand Reinhold.Google Scholar
  25. Thrun, S.B. & Möller, K. (1992). Active exploration in dynamic environments. In J.E. Moody, S.J. Hanson & R.P. Lippmann, editors Advances in Neural Information Processing Systems 4, 531-538. San Mateo, CA: Morgan Kaufmann.Google Scholar
  26. Tse, E & Bar-Shalom, Y. (1973). An actively adaptive control for linear systems with random parameters via the dual control approach. IEEE Transactions on Automatic Control 18, 109-117.Google Scholar
  27. Tse, E., Bar-Shalom, Y & Meier, L, IIIrd (1973). Wide-sense adaptive dual control for nonlinear stochastic systems. IEEE Transactions on Automatic Control 18, 98-108.Google Scholar
  28. Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD Thesis, Department of Psychology, University of Cambridge, England.Google Scholar

Copyright information

© Kluwer Academic Publishers 1996

Authors and Affiliations

  • Peter Dayan
    • 1
  • Terrence J. Sejnowski
    • 2
    • 3
  1. 1.CBCLDepartment of Brain and Cognitive ScienceCambridgeMA 02139; E-mail
  2. 2.Howard Hughes Medical InstituteThe Salk InstituteCA
  3. 3.Department of BiologyUniversity of California at San DiegoLa JollaCA

Personalised recommendations