Machine Learning

, Volume 38, Issue 3, pp 287–308 | Cite as

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

  • Satinder Singh
  • Tommi Jaakkola
  • Michael L. Littman
  • Csaba Szepesvári


An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.

reinforcement-learning on-policy convergence Markov decision processes 


  1. Barto, A. G., Bradtke S. J., & Singh S. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1), 81–138.Google Scholar
  2. Bellman, R. (1957). Dynamic Programming. Princeton, NJ: Princeton University Press.Google Scholar
  3. Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control. (Vol. 1 and 2). Belmont, Massachusetts: Athena Scientific.Google Scholar
  4. Boyan, J. A. & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (Vol. 7, pp. 369–376). Cambridge, MA: The MIT Press.Google Scholar
  5. Breiman, L. (1992).Probability. Philadelphia, Pennsylvania: Society for Industrial and Applied Mathematics.Google Scholar
  6. Crites, R. H. & Barto, A. G. (1996). Improving elevator performance using reinforcement learning. Advances in neural information processing systems (Vol. 8). MIT Press.Google Scholar
  7. Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8(3), 341–362.Google Scholar
  8. Dayan, P. & Sejnowski, T. J. (1994). TD.(λ)converges with probability 1. Machine Learning, 14 (3).Google Scholar
  9. Dayan, P. & Sejnowski, T. J. (1996). Exploration bonuses and dual control. Machine Learning, 25, 5–22.Google Scholar
  10. Gullapalli, V. & Barto, A. G. (1994). Convergence of indirect adaptive asynchronous value iteration algorithms. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 695–702). San Mateo, CA: Morgan Kaufmann.Google Scholar
  11. Jaakkola, T., Jordan, M. I., & Singh, S. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1185–1201.Google Scholar
  12. John, G. H. (1994). When the best move isn't optimal: Q-learning with exploration. In Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, WA, p. 1464.Google Scholar
  13. John, G. H. (1995). When the best move isn't optimal: Q-learning with exploration. Unpublished manuscript, available at URL Scholar
  14. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285.Google Scholar
  15. Kumar, P. R. & Varaiya, P. P. (1986). Stochastic systems: estimation, identification, and adaptive control. Englewood Cliffs, NJ: Prentice Hall.Google Scholar
  16. Littman, M. L. & Szepesvári, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In L. Saitta (Ed.), Proceedings of the Thirteenth International Conference on Machine Learning (pp. 310–318).Google Scholar
  17. Littman, M. L. (1996). Algorithms for sequential decision making. Ph.D. Thesis, Department of Computer Science, Brown University, February. Also Technical Report CS–96–09.Google Scholar
  18. Puterman, M. L. (1994). Markov decision processes—discrete stochastic dynamic programming. New York, NY: John Wiley & Sons, Inc.Google Scholar
  19. Rummery, G. A. (1994). Problem solving with reinforcement learning. Ph.D. Thesis, Cambridge University Engineering Department.Google Scholar
  20. Rummery, G. A. & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.Google Scholar
  21. Singh, S. & Bertsekas, D. P. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. Advances in neural information processing systems (Vol. 9, pp. 974–980). MIT Press.Google Scholar
  22. Singh, S. & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1–3):123–158.Google Scholar
  23. Singh, S. & Yee, R. C. (1994). An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16, 227–233.Google Scholar
  24. Sutton, R. S. & Barto, A. G. (1998). An introduction to reinforcement learning. The MIT Press.Google Scholar
  25. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3(1): 9–44.Google Scholar
  26. Sutton, R. S. (1996). Generalization in reinforcement learning: successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8). Cambridge, MA: The MIT Press.Google Scholar
  27. Szepesvári, C. & Littman, M. L. (1996). Generalized Markov decision processes: dynamic-programming and reinforcement-learning algorithms. Technical Report CS–96–11, Brown University, Providence, RI.Google Scholar
  28. Tesauro, G. J. (1995). Temporal difference learning and TD-gammon. Communications of the ACM, 38(3), 58–68.Google Scholar
  29. Thrun, S. B. (1992). The role of exploration in learning control. In D. A. White & D. A. Sofge (Eds.), Handbook of intelligent control: neural, fuzzy, and adaptive approaches. New York, NY: Van Nostrand Reinhold.Google Scholar
  30. Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185–202,September 1994.Google Scholar
  31. Tsitsiklis, J. N. & Van Roy, B. (1996). An analysis of temporal-difference learning with function approximation. Technical Report LIDS-P-2322, Massachusetts Institute of Technology, March. Available through To appear in IEEE Transactions on Automatic Control.Google Scholar
  32. Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Thesis, King's College, Cambridge, UK.Google Scholar
  33. Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279–292.Google Scholar
  34. Williams, R. J. & Baird, L. C., III (1993). Tight performance bounds on greedy policies based on imperfect value functions. Technical Report NU-CCS–93–14, Northeastern University, College of Computer Science, Boston, MA.Google Scholar
  35. Zhang, W. & Dietterich, T. G. (1995). High-performance job-shop scheduling with a time delay TD(λ) network. Advances in neural information processing systems (Vol. 8, pp. 1024–1030). MIT Press.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Satinder Singh
    • 1
  • Tommi Jaakkola
    • 2
  • Michael L. Littman
    • 3
  • Csaba Szepesvári
    • 4
  1. 1.AT&T Labs-ResearchUSA
  2. 2.Department of Computer ScienceMassachusetts Institute of TechnologyCambridgeUSA
  3. 3.Department of Computer ScienceDuke UniversityDurhamUSA
  4. 4.Mindmaker Ltd.BudapestHungary

Personalised recommendations