In order to solve the problem that Q-learning can suffer from large overestimations in some stochastic environments, we first propose a new form of Q-learning, which proves that it is equivalent to the incremental form and analyze the reasons why the convergence rate of Q-learning will be affected by positive bias. We generalize the new form for the purpose of easy adaptations. By using the current value instead of the bias term, we present an accurate Q-learning algorithm and show that the new algorithm converges to an optimal policy. Experimentally, the new algorithm can avoid the effect of positive bias and the convergence rate is faster than Q-learning and its variants on several MDP problems.
KeywordsReinforcement learning Q-learning Positive bias Accurate Q-learning
This work was funded by National Natural Science Foundation (61272005, 61303108, 61373094, 61502323, 61272005, 61303108, 61373094, 61472262). We would also like to thank the reviewers for their helpful comments. Natural Science Foundation of Jiangsu (BK2012616), High School Natural Foundation of Jiangsu (13KJB520020), Key Laboratory of Symbolic Computation and Knowledge.
- 2.Azar, M.G., Munos, R., Ghavamzadeh, M., Kappen, H.J.: Speedy Q-learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 2411–2419. Curran Associates Inc. (2011)Google Scholar
- 5.DEramo, C., Restelli, M., Nuara, A.: Estimating maximum expected value through gaussian approximation. In: ICML, pp. 1032–1040 (2016)Google Scholar
- 7.van Hasselt, H.P.: Double Q-learning. In: NIPS, pp. 2613–2621 (2010)Google Scholar
- 8.van Hasselt, H.P.: Insights in reinforcement learning: formal analysis and empirical evaluation of temporal-difference learning algorithms. Ph.D. thesis, Utrecht University, Netherlands (2011)Google Scholar
- 9.Kearns, M., Singh, S.: Finite-sample convergence rates for Q-learning and indirect algorithms. In: NIPS, pp. 996–1002 (1999)Google Scholar
- 10.Lee, D., Powell, W.B.: An intelligent battery controller using bias-corrected Q-learning. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 22–26 July 2012, Toronto, Ontario, Canada, pp. 316–322 (2012)Google Scholar
- 11.Littman, M.L., Szepesvári, C.: A generalized reinforcement-learning model: convergence and applications. In: ICML, vol. 96, pp. 310–318 (1996)Google Scholar
- 12.Pandey, S., Chakrabarti, D., Agarwal, D.: Multi-armed bandit problems with dependent arms. In: Proceedings of the 24th International Conference on Machine Learning, pp. 721–728. ACM (2007)Google Scholar
- 14.Sutton, R.S., Barto, A.G.: Reinforcement Learning: An introduction, vol. 1. MIT Press, Cambridge (1998)Google Scholar
- 15.Szepesvári, C.: The asymptotic convergence-rate of Q-learning. In: NIPS, pp. 1064–1070 (1997)Google Scholar
- 16.Watkins, C.J.: Learning from delayed rewards. Robot. Auton. Syst. 15(4), 233–235 (1989)Google Scholar
- 19.Zhang, Z., Pan, Z., Kochenderfer, M.J.: Weighted double Q-learning. In: IJCAI, pp. 3455–3461 (2017)Google Scholar