Machine Learning

, Volume 22, Issue 1, pp 159–195

Average reward reinforcement learning: Foundations, algorithms, and empirical results

  • Sridhar Mahadevan

DOI: 10.1007/BF00114727

Cite this article as:
Mahadevan, S. Mach Learn (1996) 22: 159. doi:10.1007/BF00114727


This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.


Reinforcement learningMarkov decision processes
Download to read the full article text

Copyright information

© Kluwer Academic Publishers 1996

Authors and Affiliations

  • Sridhar Mahadevan
    • 1
  1. 1.Department of Computer Science and EngineeringUniversity of South FloridaTampa