Average Reward Optimization with Multiple Discounting Reinforcement Learners

Reinke, Chris; Uchibe, Eiji; Doya, Kenji

doi:10.1007/978-3-319-70087-8_81

Chris Reinke¹⁸,
Eiji Uchibe^18,19 &
Kenji Doya¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10634))

Included in the following conference series:

International Conference on Neural Information Processing

4801 Accesses
2 Citations

Abstract

Maximization of average reward is a major goal in reinforcement learning. Existing model-free, value-based algorithms such as R-Learning use average adjusted values. We propose a different framework, the Average Reward Independent Gamma Ensemble (AR-IGE). It is based on an ensemble of discounting Q-learning modules with a different discount factor for each module. Existing algorithms only learn the optimal policy and its average reward. In contrast, the AR-IGE learns different policies and their resulting average rewards. We prove the optimality of the AR-IGE in episodic and deterministic problems where rewards are given at several goal states. Furthermore, we show that the AR-IGE outperforms existing algorithms in such problems, especially in situations where policies have to be changed due to changes in the task. The AR-IGE represents a new way to optimize average reward that could lead to further improvements in the field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Das, T.K., Gosavi, A., Mahadevan, S., Marchalleck, N.: Solving Semi-Markov decision problems using average reward reinforcement learning. Manage. Sci. 45(4), 560–574 (1999)
Article MATH Google Scholar
Deisenroth, M.P., Neumann, G., Peters, J.: A survey on policy search for robotics. Found. Trends Robot. 2(1–2), 1–142 (2011)
Google Scholar
Gosavi, A.: Reinforcement learning for long-run average cost. Eur. J. Oper. Res. 155(3), 654–674 (2004)
Article MATH MathSciNet Google Scholar
Kurth-Nelson, Z., Redish, A.D.: Temporal-difference reinforcement learning with distributed representations. PLoS One 4(10), e7362 (2009)
Article Google Scholar
Mahadevan, S., Marchalleck, N., Das, T.K., Gosavi, A.: Self-improving factory simulation using continuous-time average-reward reinforcement learning. In: Proceedings of the 14th International Conference on Machine Learning, pp. 202–210 (1997)
Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn. Wiley, New York (1994)
Book MATH Google Scholar
Reinke, C., Uchibe, E., Doya, K.: Maximizing the average reward in episodic reinforcement learning tasks. In: 2015 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), pp. 420–421. IEEE (2015)
Google Scholar
Schwartz, A.: A reinforcement learning method for maximizing undiscounted rewards. In: Proceedings of the Tenth International Conference on Machine Learning, vol. 298, pp. 298–305 (1993)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. Cambridge University Press, Cambridge (1998)
Google Scholar
Tanaka, S.C., Schweighofer, N., Asahi, S., Shishida, K., Okamoto, Y., Yamawaki, S., Doya, K.: Serotonin differentially regulates short-and long-term prediction of rewards in the ventral and dorsal striatum. PLoS One 2(12), e1333 (2007)
Article Google Scholar
Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-learning. Mach. Learn. 16(3), 185–202 (1994)
MATH Google Scholar
Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
MATH Google Scholar
Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge, England (1989)
Google Scholar
Yang, S., Gao, Y., An, B., Wang, H., Chen, X.: Efficient average reward reinforcement learning using constant shifting values. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Google Scholar

Download references

Acknowledgement

We thank Tadashi Kozuno for his help with parts of the optimality proof.

Author information

Authors and Affiliations

Okinawa Institute of Science and Technology, Okinawa, 904-0495, Japan
Chris Reinke, Eiji Uchibe & Kenji Doya
ATR Computational Neuroscience Laboratories, Kyoto, 619-0288, Japan
Eiji Uchibe

Authors

Chris Reinke
View author publications
You can also search for this author in PubMed Google Scholar
Eiji Uchibe
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Doya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chris Reinke .

Editor information

Editors and Affiliations

Guangdong University of Technology, Guangzhou, China
Derong Liu
Guangdong University of Technology, Guangzhou, China
Shengli Xie
South China University of Technology, Guangzhou, China
Yuanqing Li
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Dongbin Zhao
King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
El-Sayed M. El-Alfy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Reinke, C., Uchibe, E., Doya, K. (2017). Average Reward Optimization with Multiple Discounting Reinforcement Learners. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science(), vol 10634. Springer, Cham. https://doi.org/10.1007/978-3-319-70087-8_81

Download citation

DOI: https://doi.org/10.1007/978-3-319-70087-8_81
Published: 24 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70086-1
Online ISBN: 978-3-319-70087-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics