Skip to main content

Advertisement

SpringerLink
Go to cart
  • Log in
  1. Home
  2. Machine Learning
  3. Article
Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms
Download PDF
Your article has downloaded

Similar articles being viewed by others

Slider with three articles shown per slide. Use the Previous and Next buttons to navigate the slides or the slide controller buttons at the end to navigate through each slide.

Convergence results for an averaged LQR problem with applications to reinforcement learning

08 July 2021

Andrea Pesare, Michele Palladino & Maurizio Falcone

Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

28 April 2022

Guanghui Lan

Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

10 May 2018

Erick Asiain, Julio B. Clempner & Alexander S. Poznyak

A Lyapunov approach for stable reinforcement learning

10 August 2022

Julio B. Clempner

Adaptive optimal control of unknown discrete-time linear systems with guaranteed prescribed degree of stability using reinforcement learning

24 August 2021

Seyed Ehsan Razavi, Mohammad Amin Moradi, … Mohammad Bagher Menhaj

Softmax policy gradient methods can take exponential time to converge

23 January 2023

Gen Li, Yuting Wei, … Yuxin Chen

On the sample complexity of actor-critic method for reinforcement learning with function approximation

16 February 2023

Harshat Kumar, Alec Koppel & Alejandro Ribeiro

Reinforcement Learning for Non-stationary Discrete-Time Linear–Quadratic Mean-Field Games in Multiple Populations

10 May 2022

Muhammad Aneeq uz Zaman, Erik Miehling & Tamer Başar

Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games

05 January 2022

Weichao Mao & Tamer Başar

Download PDF
  • Published: March 2000

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

  • Satinder Singh1,
  • Tommi Jaakkola2,
  • Michael L. Littman3 &
  • …
  • Csaba Szepesvári4 

Machine Learning volume 38, pages 287–308 (2000)Cite this article

  • 10k Accesses

  • 328 Citations

  • Metrics details

Abstract

An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.

Download to read the full article text

Working on a manuscript?

Avoid the common mistakes

References

  • Barto, A. G., Bradtke S. J., & Singh S. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1), 81–138.

    Google Scholar 

  • Bellman, R. (1957). Dynamic Programming. Princeton, NJ: Princeton University Press.

    Google Scholar 

  • Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control. (Vol. 1 and 2). Belmont, Massachusetts: Athena Scientific.

    Google Scholar 

  • Boyan, J. A. & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (Vol. 7, pp. 369–376). Cambridge, MA: The MIT Press.

    Google Scholar 

  • Breiman, L. (1992).Probability. Philadelphia, Pennsylvania: Society for Industrial and Applied Mathematics.

    Google Scholar 

  • Crites, R. H. & Barto, A. G. (1996). Improving elevator performance using reinforcement learning. Advances in neural information processing systems (Vol. 8). MIT Press.

  • Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8(3), 341–362.

    Google Scholar 

  • Dayan, P. & Sejnowski, T. J. (1994). TD.(λ)converges with probability 1. Machine Learning, 14 (3).

  • Dayan, P. & Sejnowski, T. J. (1996). Exploration bonuses and dual control. Machine Learning, 25, 5–22.

    Google Scholar 

  • Gullapalli, V. & Barto, A. G. (1994). Convergence of indirect adaptive asynchronous value iteration algorithms. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 695–702). San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Jaakkola, T., Jordan, M. I., & Singh, S. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1185–1201.

    Google Scholar 

  • John, G. H. (1994). When the best move isn't optimal: Q-learning with exploration. In Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, WA, p. 1464.

  • John, G. H. (1995). When the best move isn't optimal: Q-learning with exploration. Unpublished manuscript, available at URL ftp://starry.stanford.edu/pub/gjohn/papers/rein-nips.ps.

  • Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285.

    Google Scholar 

  • Kumar, P. R. & Varaiya, P. P. (1986). Stochastic systems: estimation, identification, and adaptive control. Englewood Cliffs, NJ: Prentice Hall.

    Google Scholar 

  • Littman, M. L. & Szepesvári, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In L. Saitta (Ed.), Proceedings of the Thirteenth International Conference on Machine Learning (pp. 310–318).

  • Littman, M. L. (1996). Algorithms for sequential decision making. Ph.D. Thesis, Department of Computer Science, Brown University, February. Also Technical Report CS–96–09.

  • Puterman, M. L. (1994). Markov decision processes—discrete stochastic dynamic programming. New York, NY: John Wiley & Sons, Inc.

    Google Scholar 

  • Rummery, G. A. (1994). Problem solving with reinforcement learning. Ph.D. Thesis, Cambridge University Engineering Department.

  • Rummery, G. A. & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.

  • Singh, S. & Bertsekas, D. P. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. Advances in neural information processing systems (Vol. 9, pp. 974–980). MIT Press.

    Google Scholar 

  • Singh, S. & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1–3):123–158.

    Google Scholar 

  • Singh, S. & Yee, R. C. (1994). An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16, 227–233.

    Google Scholar 

  • Sutton, R. S. & Barto, A. G. (1998). An introduction to reinforcement learning. The MIT Press.

  • Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3(1): 9–44.

    Google Scholar 

  • Sutton, R. S. (1996). Generalization in reinforcement learning: successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8). Cambridge, MA: The MIT Press.

    Google Scholar 

  • Szepesvári, C. & Littman, M. L. (1996). Generalized Markov decision processes: dynamic-programming and reinforcement-learning algorithms. Technical Report CS–96–11, Brown University, Providence, RI.

  • Tesauro, G. J. (1995). Temporal difference learning and TD-gammon. Communications of the ACM, 38(3), 58–68.

    Google Scholar 

  • Thrun, S. B. (1992). The role of exploration in learning control. In D. A. White & D. A. Sofge (Eds.), Handbook of intelligent control: neural, fuzzy, and adaptive approaches. New York, NY: Van Nostrand Reinhold.

    Google Scholar 

  • Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185–202,September 1994.

    Google Scholar 

  • Tsitsiklis, J. N. & Van Roy, B. (1996). An analysis of temporal-difference learning with function approximation. Technical Report LIDS-P-2322, Massachusetts Institute of Technology, March. Available through http://web.mit.edu/bvr/www/td.ps. To appear in IEEE Transactions on Automatic Control.

  • Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Thesis, King's College, Cambridge, UK.

    Google Scholar 

  • Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279–292.

    Google Scholar 

  • Williams, R. J. & Baird, L. C., III (1993). Tight performance bounds on greedy policies based on imperfect value functions. Technical Report NU-CCS–93–14, Northeastern University, College of Computer Science, Boston, MA.

    Google Scholar 

  • Zhang, W. & Dietterich, T. G. (1995). High-performance job-shop scheduling with a time delay TD(λ) network. Advances in neural information processing systems (Vol. 8, pp. 1024–1030). MIT Press.

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. AT&T Labs-Research, 180 Park Avenue, Florham Park, NJ, 07932, USA

    Satinder Singh

  2. Department of Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA

    Tommi Jaakkola

  3. Department of Computer Science, Duke University, Durham, NC, 27708-0129, USA

    Michael L. Littman

  4. Mindmaker Ltd., Konkoly Thege M. u. 29-33, Budapest, 1121, Hungary

    Csaba Szepesvári

Authors
  1. Satinder Singh
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Tommi Jaakkola
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. Michael L. Littman
    View author publications

    You can also search for this author in PubMed Google Scholar

  4. Csaba Szepesvári
    View author publications

    You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Singh, S., Jaakkola, T., Littman, M.L. et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. Machine Learning 38, 287–308 (2000). https://doi.org/10.1023/A:1007678930559

Download citation

  • Issue Date: March 2000

  • DOI: https://doi.org/10.1023/A:1007678930559

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • reinforcement-learning
  • on-policy
  • convergence
  • Markov decision processes
Download PDF

Working on a manuscript?

Avoid the common mistakes

Advertisement

Over 10 million scientific documents at your fingertips

Switch Edition
  • Academic Edition
  • Corporate Edition
  • Home
  • Impressum
  • Legal information
  • Privacy statement
  • California Privacy Statement
  • How we use cookies
  • Manage cookies/Do not sell my data
  • Accessibility
  • FAQ
  • Contact us
  • Affiliate program

Not logged in - 95.216.99.153

Not affiliated

Springer Nature

© 2023 Springer Nature Switzerland AG. Part of Springer Nature.