Recent Advances in Reinforcement Learning pp 159-195 | Cite as

# Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results

- 2 Citations
- 166 Downloads

## Abstract

This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called *n-discount-optimality* is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate *gain-optimal* policies that maximize average reward, none of them can reliably filter these to produce *bias-optimal* (or *T-optimal*) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies, and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.

## Keywords

Reinforcement learning Markov decision processes## Preview

Unable to display preview. Download preview PDF.

## References

- Baird, L. Personal Communication.Google Scholar
- Baird, L., (1995). Residual algorithms: Reinforcement learning with function approximation. In
*Proceedings of the 12th International Conference on Machine Learning*, pages 30–37. Morgan Kaufmann.Google Scholar - Barto, A., Bradtke, S. & Singh, S., (1995). Learning to act using real-time dynamic programming.
*Artificial Intelligence*, 72(1):81–138.CrossRefGoogle Scholar - Bertsekas, D., (1982). Distributed dynamic programming.
*IEEE Transactions on Automatic Control*, AC-27(3).Google Scholar - Bertsekas, D., (1987).
*Dynamic Programming: Deterministic and Stochastic Models*. Prentice-Hall.Google Scholar - Blackwell, D., (1962). Discrete dynamic programming.
*Annals of Mathematical Statistics*, 33:719–726.zbMATHMathSciNetGoogle Scholar - Boutilier, C. & Puterman, M., (1995). Process-oriented planning and average-reward optimality. In
*Proceedings of the Fourteenth JCAI*, pages 1096–1103. Morgan Kaufmann.Google Scholar - Dayan, P. & Hinton, G., (1992). Feudal reinforcement learning. In
*Neural Information Processing Systems (NIPS)*, pages 271–278.Google Scholar - Denardo, E., (1970). Computing a bias-optimal policy in a discrete-time Markov decision problem.
*Operations Research*, 18:272–289.CrossRefMathSciNetGoogle Scholar - Dent, L., Boticario, J., McDermott, J., Mitchell, T. & Zabowski, D., (1992). A personal learning apprentice. In
*Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI)*, pages 96–103. MIT Press.Google Scholar - Engelberger, J., (1989).
*Robotics in Service*. MIT Press.Google Scholar - Federgruen, A. & Schweitzer, P., (1984). Successive approximation methods for solving nested functional equations in Markov decision problems.
*Mathematics of Operations Research*, 9:319–344.zbMATHMathSciNetGoogle Scholar - Haviv, M. & Puterman, M., (1991) An improved algorithm for solving communicating average reward markov decision processes.
*Annals of Operations Research*, 28:229–242.zbMATHCrossRefMathSciNetGoogle Scholar - Hordijk, A. & Tijms, H., (1975). A modified form of the iterative method of dynamic programming.
*Annals of Statistics*, 3:203–208.zbMATHMathSciNetGoogle Scholar - Howard, R., (1960).
*Dynamic Programming and Markov Processes*. MIT Press.Google Scholar - Jalali, A. & Ferguson, M., (1989). Computationally efficient adaptive control algorithms for Markov chains. In
*Proceedings of the 28th IEEE Conference on Decision and Control*, pages 1283–1288.Google Scholar - Jalali, A. & Ferguson, M., (1990). A distributed asynchronous algorithm for expected average cost dynamic programming. In
*Proceedings of the 29th IEEE Conference on Decision and Control*, pages 1394–1395.Google Scholar - Kaelbling, L., (1993a). Hierarchical learning in stochastic domains: Preliminary results. In
*Proceedings of the Tenth International Conference on Machine Learning*, pages 167–173. Morgan Kaufmann.Google Scholar - Kaelbling, L., (1993b)
*Learning in Embedded Systems*. MIT Press.Google Scholar - Lin, L., (1993).
*Reinforcement Learning for Robots using Neural Networks*. PhD thesis, Carnegie-Mellon Univ.Google Scholar - Mahadevan, S. A model-based bias-optimal reinforcement learning algorithm. In preparation.Google Scholar
- Mahadevan, S., (1992). Enhancing transfer in reinforcement learning by building stochastic models of robot actions. In
*Proceedings of the Seventh International Conference on Machine Learning*, pages 290–299. Morgan Kaufmann.Google Scholar - Mahadevan, S., (1994). To discount or not to discount in reinforcement learning: A case study comparing R-learning and Q-learning. In
*Proceedings of the Eleventh International Conference on Machine Learning*, pages 164–172. Morgan Kaufmann.Google Scholar - Mahadevan, S. & Baird, L. Value function approximation in average reward reinforcement learning. In preparation.Google Scholar
- Mahadevan, S. & Connell, J., (1992). Automatic programming of behavior-based robots using reinforcement learning.
*Artificial Intelligence*, 55:311–365. Appeared originally as IBM TR RC16359, Dec 1990.CrossRefGoogle Scholar - Moore, A., (1991). Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state spaces. In
*Proceedings of the Eighth International Workshop on Machine Learning*, pages 333–337. Morgan Kaufmann.Google Scholar - Narendra, K. & Thathachar, M., (1989).
*Learning Automata: An Introduction*. Prentice Hall.Google Scholar - Puterman, M., (1994).
*Markov Decision Processes: Discrete Dynamic Stochastic Programming*. John Wiley.Google Scholar - Ross, S., (1983).
*Introduction to Stochastic Dynamic Programming*. Academic Press.Google Scholar - Salganicoff, M., (1993). Density-adaptive learning and forgetting. In
*Proceedings of the Tenth International Conference on Machine Learning*, pages 276–283. Morgan Kaufmann.Google Scholar - Schwartz, A., (1993). A reinforcement learning method for maximizing undiscounted rewards. In
*Proceedings of the Tenth International Conference on Machine Learning*, pages 298–305. Morgan Kaufmann.Google Scholar - Singh, S., (1994a).
*Learning to Solve Markovian Decision Processes*. PhD thesis, Univ of Massachusetts, Amherst.Google Scholar - Singh, S., (1994b) Reinforcement learning algorithms for average-payoff Markovian decision processes. In
*Proceedings of the 12th AAAI*. MIT Press.Google Scholar - Sutton, R., (1988). Learning to predict by the method of temporal differences.
*Machine Learning*, 3:9–44.Google Scholar - Sutton, R., (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In
*Proceedings of the Seventh International Conference on Machine Learning*, pages 216–224. Morgan Kaufmann.Google Scholar - Sutton, R., editor, (1992).
*Reinforcement Learning*. Kluwer Academic Press. Special Issue of Machine Learning Journal Vol 8,Nos 3–4, May 1992.Google Scholar - Tadepall, P. Personal Communication.Google Scholar
- Tadepalli, P. & Ok, D., (1994). H learning: A reinforcement learning method to optimize undiscounted average reward. Technical Report 94-30-01, Oregon State Univ.Google Scholar
- Tesauro, G., (1992). Practical issues in temporal difference learning. In R. Sutton, editor,
*Reinforcement Learning*. Kluwer Academic Publishers.Google Scholar - Thrun, S. The role of exploration in learning control. In D. A. White and D. A. Sofge, editors,
*Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches*. Van Nostrand Reinhold.Google Scholar - Tsitsiklis, J., (1994). Asynchronous stochastic approximation and Q-learning.
*Machine Learning*, 16:185–202.zbMATHGoogle Scholar - Veinott, A., (1969) Discrete dynamic programming with sensitive discount optimality criteria.
*Annals of Mathematical Statistics*, 40(5):1635–1660.zbMATHMathSciNetGoogle Scholar - Watkins, C., (1989).
*Learning from Delayed Rewards*. PhD thesis, King’s College, Cambridge, England.Google Scholar - Wheeler, R. & Narendra, K., (1986). Decentralized learning in finite Markov chains.
*IEEE Transactions on Automatic Control*, AC-31(6)Google Scholar - White, D., (1963). Dynamic programming, markov chains, and the method of successive approximations.
*Journal of Mathematical Analysis and Applications*, 6:373–376.zbMATHCrossRefMathSciNetGoogle Scholar - Whitehead, S., Karlsson, J. & Tenenberg, J., (1993). Learning multiple goal behavior via task decomposition and dynamic policy merging. In J. Connell and S. Mahadevan, editors,
*Robot Learning*. Kluwer Academic Publishers.Google Scholar