Optimal Tuning of Continual Online Exploration in Reinforcement Learning

  • Youssef Achbany
  • Francois Fouss
  • Luh Yen
  • Alain Pirotte
  • Marco Saerens
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4131)


This paper presents a framework allowing to tune continual exploration in an optimal way. It first quantifies the rate of exploration by defining the degree of exploration of a state as the probability-distribution entropy for choosing an admissible action. Then, the exploration/exploitation tradeoff is stated as a global optimization problem: find the exploration strategy that minimizes the expected cumulated cost, while maintaining fixed degrees of exploration at same nodes. In other words, “exploitation” is maximized for constant “exploration”. This formulation leads to a set of nonlinear updating rules reminiscent of the value-iteration algorithm. Convergence of these rules to a local minimum can be proved for a stationary environment. Interestingly, in the deterministic case, when there is no exploration, these equations reduce to the Bellman equations for finding the shortest path while, when it is maximum, a full “blind” exploration is performed.


Optimal Policy Reinforcement Learning Destination State Average Cost Bellman Equation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Achbany, Y., Fouss, F., Yen, L., Pirotte, A., Saerens, M.: Tuning continual exploration in reinforcement learning. Technical report (2005),
  2. 2.
    Bazaraa, M.S., Sherali, H.D., Shetty, C.M.: Nonlinear programming: Theory and algorithms. John Wiley and Sons, Chichester (1993)MATHGoogle Scholar
  3. 3.
    Bertsekas, D.P.: Neuro-dynamic programming. Athena Scientific, Belmont (1996)MATHGoogle Scholar
  4. 4.
    Bertsekas, D.P.: Network optimization: continuous and discrete models. Athena Scientific, Belmont (1998)MATHGoogle Scholar
  5. 5.
    Bertsekas, D.P.: Dynamic programming and optimal control. Athena sientific, Belmont (2000)Google Scholar
  6. 6.
    Boyan, J.A., Littman, M.L.: Packet routing in dynamically changing networks: A reinforcement learning approach. In: Advances in Neural Information Processing Systems 6 (NIPS6), pp. 671–678 (1994)Google Scholar
  7. 7.
    Brown, R.G.: Smoothing, forecasting and prediction of discrete time series. Prentice-Hall, Englewood Cliffs (1962)Google Scholar
  8. 8.
    Christofides, N.: Graph theory: An algorithmic approach. Academic Press, London (1975)MATHGoogle Scholar
  9. 9.
    Cover, T.M., Thomas, J.A.: Elements of information theory. John Wiley and Sons, Chichester (1991)MATHCrossRefGoogle Scholar
  10. 10.
    Kapur, J.N., Kesavan, H.K.: Entropy optimization principles with applications. Academic Press, London (1992)Google Scholar
  11. 11.
    Kemeny, J.G., Snell, J.L.: Finite markov chains. Springer, Heidelberg (1976)MATHGoogle Scholar
  12. 12.
    Osborne, M.J.: An introduction to game theory. Oxford University Press, Oxford (2004)Google Scholar
  13. 13.
    Raiffa, H.: Decision analysis. Addison-Wesley, Reading (1970)Google Scholar
  14. 14.
    Rummery, G., Niranjan, M.: On-line q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Departement (1994)Google Scholar
  15. 15.
    Shani, G., Brafman, R., Shimony, S.: Adaptation for changing stochastic environments through online pomdp policy learning. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 353–364. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  16. 16.
    Singh, S., Sutton, R.: Reinforcement learning with replacing eligibility traces. Machine Learning 22, 123–158 (1996)MATHGoogle Scholar
  17. 17.
    Spall, J.C.: Introduction to stochastic search and optimization. Wiley, Chichester (2003)MATHCrossRefGoogle Scholar
  18. 18.
    Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. The MIT Press, Cambridge (1998)Google Scholar
  19. 19.
    Thrun, S.: Efficient exploration in reinforcement learning. Technical report, School of Computer Science, Carnegie Mellon University (1992)Google Scholar
  20. 20.
    Thrun, S.: The role of exploration in learning control. In: White, D., Sofge, D. (eds.) Handbook for Intelligent Control: Neural, Fuzzy and Adaptive Approaches, Van Nostrand Reinhold, Florence, Kentucky 41022 (1992)Google Scholar
  21. 21.
    Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press, Cambridge (2005)MATHGoogle Scholar
  22. 22.
    Verbeeck, K.: Coordinated exploration in multi-agent reinforcement learning. PhD thesis, Vrije Universiteit Brussel, Belgium (2004)Google Scholar
  23. 23.
    Watkins, J.C.: Learning from delayed rewards. PhD thesis, King’s College of Cambridge, UK (1989)Google Scholar
  24. 24.
    Watkins, J.C., Dayan, P.: Q-learning. Machine Learning 8(3-4), 279–292 (1992)MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Youssef Achbany
    • 1
  • Francois Fouss
    • 1
  • Luh Yen
    • 1
  • Alain Pirotte
    • 1
  • Marco Saerens
    • 1
  1. 1.Information Systems Research Unit (ISYS)Place des Doyens 1, Université de LouvainBelgium

Personalised recommendations