Machine Learning

, Volume 22, Issue 1–3, pp 159–195 | Cite as

Average reward reinforcement learning: Foundations, algorithms, and empirical results

  • Sridhar Mahadevan
Article

Abstract

This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.

Keywords

Reinforcement learning Markov decision processes 

References

  1. Baird, L. Personal Communication.Google Scholar
  2. Baird, L., (1995). Residual algorithms: Reinforcement learning with function approximation. InProceedings of the 12th International Conference on Machine Learning, pages 30–37. Morgan Kaufmann.Google Scholar
  3. Barto, A., Bradtke, S. & Singh, S., (1995). Learning to act using real-time dynamic programming.Artificial Intelligence, 72(1):81–138.Google Scholar
  4. Bertsekas, D., (1982). Distributed dynamic programming.IEEE Transactions on Automatic Control. AC-27(3)Google Scholar
  5. Bertsekas, D., (1987).Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall.Google Scholar
  6. Blackwell, D., (1962). Discrete dynamic programming.Annals of Mathematical Statistics, 33 719–726.Google Scholar
  7. Boutilier, C. & Puterman, M., (1995). Process-oriented planning and average-reward optimality. InProceedings of the Fourteenth JCAI, pages 1096–1103. Morgan Kaufmann.Google Scholar
  8. Dayan, P. & Hinton, G., (1992). Feudal reinforcement learning. InNeural Information Processing Systems (NIPS), pages 271–278.Google Scholar
  9. Denardo, E., (1970). Computing a bias-optimal policy in a discrete-time Markov decision problem.Operations Research, 18:272–289.Google Scholar
  10. Dent, L., Boticario, J., McDermott, J., Mitchell, T. & Zabowski, D., (1992). A personal learning apprentice InProceedings of the Tenth National Conference on Artificial Intelligence (AAAI), pages 96–103. MIT Press.Google Scholar
  11. Engelberger, J., (1989).Robotics in Service. MIT Press.Google Scholar
  12. Federgruen, A. & Schweitzer, P., (1984). Successive approximation methods for solving nested functional equations in Markov decision problems.Mathematics of Operations Research, 9:319–344.Google Scholar
  13. Haviv, M. & Puterman, M., (1991) An improved algorithm for solving communicating average reward markov decision processes.Annals of Operations Research, 28:229–242.Google Scholar
  14. Hordijk, A. & Tijms, H., (1975). A modified form of the iterative method of dynamic programming.Annals of Statistics, 3:203–208.Google Scholar
  15. Howard, R., (1960).Dynamic Programming and Markov Processes. MIT Press.Google Scholar
  16. Jalali, A. & Ferguson, M., (1989). Computationally efficient adaptive control algorithms for Markov chains InProceedings of the 28th IEEE Conference on Decision and Control, pages 1283–1288.Google Scholar
  17. Jalali, A. & Ferguson, M., (1990). A distributed asynchronous algorithm for expected average cost dynamic programming. InProceedings of the 29th IEEE Conference on Dectsion and Control. pages 1394–1395.Google Scholar
  18. Kaelbling, L., (1993a). Hierarchical learning in stochastic domains: Preliminary results. InProceedings of the Tenth International Conference on Machine Learning, pages 167–173 Morgan Kaufmann.Google Scholar
  19. Kaelbling, L., (1993b)Learning in Embedded Systems. MIT Press.Google Scholar
  20. Lin, L., (1993).Reinforcement Learning for Robots using Neural Networks. PhD thesis. Carnegie-Mellon Univ.Google Scholar
  21. Mahadevan, S. A model-based bias-optimal reinforcement learning algorithm. In preparation.Google Scholar
  22. Mahadevan, S., (1992). Enhancing transfer in reinforcement learning by building stochastic models of robot actions. InProceedings of the Seventh International Conference on Machine Learning, pages 290–299 Morgan Kaufmann.Google Scholar
  23. Mahadevan, S., (1994). To discount or not to discount in reinforcement learning: A case study comparing R-learning and Q-learning. InProceedings of the Eleventh International Conference on Machine Learning. pages 164–172. Morgan Kaufmann.Google Scholar
  24. Mahadevan, S. & Baird, L. Value function approximation in average reward reinforcement learning. In preparation.Google Scholar
  25. Mahadevan, S. & Connell, J., (1992). Automatic programming of behavior-based robots using reinforcement learning.Artificial Intelligence, 55:311–365. Appeared originally as IBM TR RC16359. Dec 1990.Google Scholar
  26. Moore, A., (1991). Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state spaces. InProceedings of the Eighth International Workshop on Machine Learning, pages 333–337. Morgan Kaufmann.Google Scholar
  27. Narendra, K. & Thathachar, M., (1989).Learning Automata: An Introduction. Prentice Hall.Google Scholar
  28. Puterman, M., (1994).Markov Decision Processes: Discrete Dynamic Stochastic Programming. John Wiley.Google Scholar
  29. Ross, S., (1983).Introduction to Stochastic Dynamic Programming. Academic Press.Google Scholar
  30. Salganicoff, M., (1993). Density-adaptive learning and forgetting. InProceedings of the Tenth International Conference on Machine Learning, pages 276–283. Morgan Kaufmann.Google Scholar
  31. Schwartz, A., (1993). A reinforcement learning method for maximizing undiscounted rewards. InProceedings of the Tenth International Conference on Machine Learning, pages 298–305. Morgan Kaufmann.Google Scholar
  32. Singh, S., (1994a).Learning to Solve Markovian Decision Processes. PhD thesis, Univ of Massachusetts, Amherst.Google Scholar
  33. Singh, S., (1994b) Reinforcement learning algorithms for average-payoff Markovian decision processes. InProceedings of the 12th AAAI. MIT Press.Google Scholar
  34. Sutton, R., (1988). Learning to predict by the method of temporal differences.Machine Learning, 3:9–44.Google Scholar
  35. Sutton, R., (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InProceedings of the Seventh International Conference on Machine Learning, pages 216–224. Morgan Kaufmann.Google Scholar
  36. Sutton, R., editor, (1992).Reinforcement Learning. Kluwer Academic Press. Special Issue of Machine Learning Journal Vol 8, Nos 3–4, May 1992.Google Scholar
  37. Tadepall, P. Personal Communication.Google Scholar
  38. Tadepalli, P. & Ok, D., (1994). H learning: A reinforcement learning method to optimize undiscounted average reward. Technical Report 94-30-01, Oregon State Univ.Google Scholar
  39. Tesauro, G., (1992). Practical issues in temporal difference learning. In R. Sutton, editor,Reinforcement Learning. Kluwer Academic Publishers.Google Scholar
  40. Thrun, S. The role of exploration in learning control. In D. A. White and D. A. Sofge, editors,Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. Van Nostrand Reinhold.Google Scholar
  41. Tsitsiklis, J., (1994). Asynchronous stochastic approximation and Q-learning.Machine Learning, 16:185–202.Google Scholar
  42. Veinott, A., (1969) Discrete dynamic programming with sensitive discount optimality criteria.Annals of Mathematical Statistics, 40(5):1635–1660.Google Scholar
  43. Watkins, C., (1989).Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, England.Google Scholar
  44. Wheeler, R. & Narendra, K., (1986). Decentralized learning in finite Markov chains.IEEE Transactions on Automatic Control, AC-31(6)Google Scholar
  45. White, D., (1963). Dynamic programming, markov chains, and the method of successive approximationsJournal of Mathematical Analysis and Applications, 6:373–376.Google Scholar
  46. Whitehead, S., Karlsson, J. & Tenenberg, J., (1993). Learning multiple goal behavior via task decomposition and dynamic policy merging. In J. Connell and S. Mahadevan, editors,Robot Learning. Kluwer Academic Publishers.Google Scholar

Copyright information

© Kluwer Academic Publishers 1996

Authors and Affiliations

  • Sridhar Mahadevan
    • 1
  1. 1.Department of Computer Science and EngineeringUniversity of South FloridaTampa

Personalised recommendations