Accurate Q-Learning

  • Zhihui Hu
  • Yubin Jiang
  • Xinghong Ling
  • Quan Liu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11303)


In order to solve the problem that Q-learning can suffer from large overestimations in some stochastic environments, we first propose a new form of Q-learning, which proves that it is equivalent to the incremental form and analyze the reasons why the convergence rate of Q-learning will be affected by positive bias. We generalize the new form for the purpose of easy adaptations. By using the current value instead of the bias term, we present an accurate Q-learning algorithm and show that the new algorithm converges to an optimal policy. Experimentally, the new algorithm can avoid the effect of positive bias and the convergence rate is faster than Q-learning and its variants on several MDP problems.


Reinforcement learning Q-learning Positive bias Accurate Q-learning 



This work was funded by National Natural Science Foundation (61272005, 61303108, 61373094, 61502323, 61272005, 61303108, 61373094, 61472262). We would also like to thank the reviewers for their helpful comments. Natural Science Foundation of Jiangsu (BK2012616), High School Natural Foundation of Jiangsu (13KJB520020), Key Laboratory of Symbolic Computation and Knowledge.


  1. 1.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)CrossRefGoogle Scholar
  2. 2.
    Azar, M.G., Munos, R., Ghavamzadeh, M., Kappen, H.J.: Speedy Q-learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 2411–2419. Curran Associates Inc. (2011)Google Scholar
  3. 3.
    Bellman, R.: Dynamic Programming. Courier Corporation, North Chelmsford (2013)zbMATHGoogle Scholar
  4. 4.
    Bertsekas, D.P.: Stable optimal control and semicontractive dynamic programming. SIAM J. Control Optim. 56(1), 231–252 (2018)MathSciNetCrossRefGoogle Scholar
  5. 5.
    DEramo, C., Restelli, M., Nuara, A.: Estimating maximum expected value through gaussian approximation. In: ICML, pp. 1032–1040 (2016)Google Scholar
  6. 6.
    Even-Dar, E., Mansour, Y.: Learning rates for Q-learning. JMLR 5, 1–25 (2003)MathSciNetzbMATHGoogle Scholar
  7. 7.
    van Hasselt, H.P.: Double Q-learning. In: NIPS, pp. 2613–2621 (2010)Google Scholar
  8. 8.
    van Hasselt, H.P.: Insights in reinforcement learning: formal analysis and empirical evaluation of temporal-difference learning algorithms. Ph.D. thesis, Utrecht University, Netherlands (2011)Google Scholar
  9. 9.
    Kearns, M., Singh, S.: Finite-sample convergence rates for Q-learning and indirect algorithms. In: NIPS, pp. 996–1002 (1999)Google Scholar
  10. 10.
    Lee, D., Powell, W.B.: An intelligent battery controller using bias-corrected Q-learning. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 22–26 July 2012, Toronto, Ontario, Canada, pp. 316–322 (2012)Google Scholar
  11. 11.
    Littman, M.L., Szepesvári, C.: A generalized reinforcement-learning model: convergence and applications. In: ICML, vol. 96, pp. 310–318 (1996)Google Scholar
  12. 12.
    Pandey, S., Chakrabarti, D., Agarwal, D.: Multi-armed bandit problems with dependent arms. In: Proceedings of the 24th International Conference on Machine Learning, pp. 721–728. ACM (2007)Google Scholar
  13. 13.
    Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)CrossRefGoogle Scholar
  14. 14.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An introduction, vol. 1. MIT Press, Cambridge (1998)Google Scholar
  15. 15.
    Szepesvári, C.: The asymptotic convergence-rate of Q-learning. In: NIPS, pp. 1064–1070 (1997)Google Scholar
  16. 16.
    Watkins, C.J.: Learning from delayed rewards. Robot. Auton. Syst. 15(4), 233–235 (1989)Google Scholar
  17. 17.
    Wiering, M., Van Otterlo, M.: Reinforcement Learning. Adaptation, Learning, and Optimization, vol. 12. Springer, Heidelberg (2012). Scholar
  18. 18.
    Yu, H., Bertsekas, D.P.: Q-learning and policy iteration algorithms for stochastic shortest path problems. Ann. Oper. Res. 208(1), 95–132 (2013)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Zhang, Z., Pan, Z., Kochenderfer, M.J.: Weighted double Q-learning. In: IJCAI, pp. 3455–3461 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Zhihui Hu
    • 1
  • Yubin Jiang
    • 1
  • Xinghong Ling
    • 1
  • Quan Liu
    • 1
    • 2
    • 3
  1. 1.School of Computer Science and TechnologySoochow UniversitySuzhouChina
  2. 2.Collaborative Innovation Center of Novel Software Technology and IndustrializationNanjingChina
  3. 3.Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of EducationJilin UniversityChangchunChina

Personalised recommendations