Average Reward Optimization with Multiple Discounting Reinforcement Learners

  • Chris ReinkeEmail author
  • Eiji Uchibe
  • Kenji Doya
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10634)


Maximization of average reward is a major goal in reinforcement learning. Existing model-free, value-based algorithms such as R-Learning use average adjusted values. We propose a different framework, the Average Reward Independent Gamma Ensemble (AR-IGE). It is based on an ensemble of discounting Q-learning modules with a different discount factor for each module. Existing algorithms only learn the optimal policy and its average reward. In contrast, the AR-IGE learns different policies and their resulting average rewards. We prove the optimality of the AR-IGE in episodic and deterministic problems where rewards are given at several goal states. Furthermore, we show that the AR-IGE outperforms existing algorithms in such problems, especially in situations where policies have to be changed due to changes in the task. The AR-IGE represents a new way to optimize average reward that could lead to further improvements in the field.


Reinforcement learning Average reward Model-free Value-based Q-learning Modular 



We thank Tadashi Kozuno for his help with parts of the optimality proof.


  1. 1.
    Das, T.K., Gosavi, A., Mahadevan, S., Marchalleck, N.: Solving Semi-Markov decision problems using average reward reinforcement learning. Manage. Sci. 45(4), 560–574 (1999)CrossRefzbMATHGoogle Scholar
  2. 2.
    Deisenroth, M.P., Neumann, G., Peters, J.: A survey on policy search for robotics. Found. Trends Robot. 2(1–2), 1–142 (2011)Google Scholar
  3. 3.
    Gosavi, A.: Reinforcement learning for long-run average cost. Eur. J. Oper. Res. 155(3), 654–674 (2004)CrossRefzbMATHMathSciNetGoogle Scholar
  4. 4.
    Kurth-Nelson, Z., Redish, A.D.: Temporal-difference reinforcement learning with distributed representations. PLoS One 4(10), e7362 (2009)CrossRefGoogle Scholar
  5. 5.
    Mahadevan, S., Marchalleck, N., Das, T.K., Gosavi, A.: Self-improving factory simulation using continuous-time average-reward reinforcement learning. In: Proceedings of the 14th International Conference on Machine Learning, pp. 202–210 (1997)Google Scholar
  6. 6.
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn. Wiley, New York (1994)CrossRefzbMATHGoogle Scholar
  7. 7.
    Reinke, C., Uchibe, E., Doya, K.: Maximizing the average reward in episodic reinforcement learning tasks. In: 2015 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), pp. 420–421. IEEE (2015)Google Scholar
  8. 8.
    Schwartz, A.: A reinforcement learning method for maximizing undiscounted rewards. In: Proceedings of the Tenth International Conference on Machine Learning, vol. 298, pp. 298–305 (1993)Google Scholar
  9. 9.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. Cambridge University Press, Cambridge (1998)Google Scholar
  10. 10.
    Tanaka, S.C., Schweighofer, N., Asahi, S., Shishida, K., Okamoto, Y., Yamawaki, S., Doya, K.: Serotonin differentially regulates short-and long-term prediction of rewards in the ventral and dorsal striatum. PLoS One 2(12), e1333 (2007)CrossRefGoogle Scholar
  11. 11.
    Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-learning. Mach. Learn. 16(3), 185–202 (1994)zbMATHGoogle Scholar
  12. 12.
    Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)zbMATHGoogle Scholar
  13. 13.
    Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge, England (1989)Google Scholar
  14. 14.
    Yang, S., Gao, Y., An, B., Wang, H., Chen, X.: Efficient average reward reinforcement learning using constant shifting values. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Okinawa Institute of Science and TechnologyOkinawaJapan
  2. 2.ATR Computational Neuroscience LaboratoriesKyotoJapan

Personalised recommendations