Skip to main content

Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme

  • Conference paper
  • First Online:
Modern Trends in Controlled Stochastic Processes:

Part of the book series: Emergence, Complexity and Computation ((ECC,volume 41))

Abstract

We analyze the DQN reinforcement learning algorithm as a stochastic approximation scheme using the o.d.e. (for ‘ordinary differential equation’) approach and point out certain theoretical issues. We then propose a modified scheme called Full Gradient DQN (FG-DQN, for short) that has a sound theoretical basis and compare it with the original scheme on sample problems. We observe a better performance for FG-DQN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agarwal, A., Kakade, S.M., Jason D.L., Mahajan, G.: Optimality and approximation with policy gradient methods in Markov decision processes. In: Conference on Learning Theory, PMLR, pp. 64–66 (2020)

    Google Scholar 

  2. Agazzi, A., Lu, J.: Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime. arXiv preprint arXiv:2010.11858 (2020)

  3. Baird, L.: Residual algorithms: reinforcement learning with function approximation. In: Machine Learning Proceedings, vol. 30–37 (1995)

    Google Scholar 

  4. Bardi, M., Capuzzo-Dolcetta, I.: Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations. Birkhäuser, Boston (2018)

    Google Scholar 

  5. Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. Syst. 5, 834–846 (1983)

    Article  Google Scholar 

  6. Benveniste, A., Metivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Springer, Heidelberg (1991). https://doi.org/10.1007/978-3-642-75894-2_9

    Book  MATH  Google Scholar 

  7. Bertsekas, D.P.: Reinforcement Learning and Optimal Control. Athena Scientific (2019)

    Google Scholar 

  8. Bhandari, J., Russo, D.: Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786 (2019)

  9. Bhatnagar, S., Borkar, V.S., Prabuchandran, K.J.: Feature search in the Grassmanian in online reinforcement learning. IEEE J. Sel. Top. Signal Process. 7(5), 746–758 (2013)

    Article  Google Scholar 

  10. Borkar, V.S.: Probability Theory: An Advanced Course. Springer, New York (1995)

    Book  Google Scholar 

  11. Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Publishing Agency, New Delhi, and Cambridge University Press, Cambridge, UK (2008)

    Google Scholar 

  12. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: OpenAI Gym. ArXiv preprint arXiv:1606.01540 (2016)

  13. Cai, Q., Yang, Z., Lee, J.D., Wang, Z.: Neural temporal-difference learning converges to global optima. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  14. Chadès, I., Chapron, G., Cros, M.J., Garcia, F., Sabbadin, R.: MDPtoolbox: a multi-platform toolbox to solve stochastic dynamic programming problems. Ecography 37, 916–920 (2014)

    Article  Google Scholar 

  15. Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Proceedings of Neural Information Processing Systems, pp. 3040–3050 (2018)

    Google Scholar 

  16. Couture, S., Cros, M.J., Sabbadin, R.: Risk aversion and optimal management of an uneven-aged forest under risk of windthrow: a Markov decision process approach. J. For. Econ. 25, 94–114 (2016)

    Google Scholar 

  17. Danskin, J.M.: The theory of max-min, with applications. SIAM J. Appl. Math. 14, 641–664 (1966)

    Article  MathSciNet  Google Scholar 

  18. Delyon, B., Lavielle, M., Moulines, E.: Convergence of a stochastic approximation version of the EM algorithm. Ann. Stat. 27, 94–128 (1999)

    Article  MathSciNet  Google Scholar 

  19. Florian, R.V.: Correct equations for the dynamics of the cart-pole system. Romania, Center for Cognitive and Neural Studies (Coneural) (2007)

    Google Scholar 

  20. François-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An introduction to deep reinforcement learning. Found. Trends Mach. Learn. 11(3–4), 219–354 (2018)

    Article  Google Scholar 

  21. Gordon, G. J.: Stable fitted reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 1052–1058 (1996)

    Google Scholar 

  22. Gordon, G. J.: Approximate solutions to Markov decision processes. Ph.D. Thesis, Carnegie-Mellon University (1999)

    Google Scholar 

  23. Gu, S., Holly, E., Lillicrap, T., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: Proceedings of IEEE International Conference on Robotics and Automation, pp. 3389–3396 (2017)

    Google Scholar 

  24. Haarnoja, T., Ha, S., Zhou, A., Tan, J., Tucker, G., Levine, S.: Learning to walk via deep reinforcement learning. ArXiv preprint arXiv:1812.11103 (2018)

  25. Jaakola, T., Jordan, M.I., Singh, S.P.: On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 6, 1185–1201 (1994)

    Article  Google Scholar 

  26. Jonsson, A.: Deep reinforcement learning in medicine. Kidney Diseas. 5, 18–22 (2019)

    Article  Google Scholar 

  27. Lin, L.J.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8(3–4), 293–321 (1992)

    Google Scholar 

  28. Lin, L.-J.: Reinforcement learning for robots using neural networks. Ph.D. Thesis School of Computer Science, Carnegie-Mellon University, Pittsburgh (1993)

    Google Scholar 

  29. Luong, N.C., Hoang, D.T., Gong, S., Niyato, D., Wang, P., Liang, Y.C., Kim, D.I.: Applications of deep reinforcement learning in communications and networking: a survey. IEEE Commun. Surv. Tutor. 21, 3133–3174 (2019)

    Article  Google Scholar 

  30. Marbach, P., Tsitsiklis, J.N.: Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Contr. 46, 191–209 (2001)

    Article  MathSciNet  Google Scholar 

  31. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)

  32. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)

    Article  Google Scholar 

  33. Peng, X.B., Berseth, G., Yin, K., van de Panne, M.: Deeploco: dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph. 36, 1–13 (2017)

    Google Scholar 

  34. Popova, M., Isayev, O., Tropsha, A.: Deep reinforcement learning for de novo drug design. Sci. Adv. 4(7), eaap7885 (2018)

    Google Scholar 

  35. Prabuchandran, K.J., Bhatnagar, S., Borkar, V.S.: Actor-critic algorithms with online feature adaptation. ACM Trans. Model. Comput. Simul. (TOMACS) 26(4), 1–26 (2016)

    Article  MathSciNet  Google Scholar 

  36. Qian, Y., Wu, J., Wang, R., Zhu, F., Zhang, W.: Survey on reinforcement learning applications in communication networks. J. Commun. Netw. 4, 30–39 (2019)

    Google Scholar 

  37. Ramaswamy, A., Bhatnagar, S.: Analysis of gradient descent methods with nondiminishing bounded errors. IEEE Trans. Automat. Contr. 63, 1465–1471 (2018)

    Article  MathSciNet  Google Scholar 

  38. Riedmiller, M.: Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. Machine Learning: ECML, pp. 317–328 (2005)

    Google Scholar 

  39. Saleh, E., Jiang, N.: Deterministic Bellman residual minimization. In: Proceedings of Optimization Foundations for Reinforcement Learning Workshop at NeurIPS (2019)

    Google Scholar 

  40. Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015)

  41. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Hassabis, D.: Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016)

    Article  Google Scholar 

  42. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Hassabis, D.: A general reinforcement learning algorithm that masters chess, Shogi, and Go through self-play. Science 362, 1140–1144 (2018)

    Article  MathSciNet  Google Scholar 

  43. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018)

    Google Scholar 

  44. Sutton, R. S., McAllester, D. A., Singh, S. P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Neural Information Processing Systems Proceedings, pp. 1057–1063 (1999)

    Google Scholar 

  45. Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-learning. Mach. Learn. 16, 185–202 (1994)

    MATH  Google Scholar 

  46. Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Contr. 42, 674–690 (1997)

    Article  MathSciNet  Google Scholar 

  47. van Hasselt, H.: Double Q-learning. Adv. Neural. Inf. Process. Syst. 23, 2613–2621 (2010)

    Google Scholar 

  48. van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double Q-learning. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, vol. 30, pp. 2094–2100 (2016)

    Google Scholar 

  49. Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. Thesis, King’s College, University of Cambridge, UK (1989)

    Google Scholar 

  50. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)

    MATH  Google Scholar 

  51. Xiong, Z., Zhang, Y., Niyato, D., Deng, R., Wang, P., Wang, L.C.: Deep reinforcement learning for mobile 5G and beyond: fundamentals, applications, and challenges. IEEE Veh. Technol. Mag. 14, 44–52 (2019)

    Article  Google Scholar 

  52. Yaji, V.G., Bhatnagar, S.: Stochastic recursive inclusions with non-additive iterate-dependent Markov noise. Stochastics 90, 330–363 (2018)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgement

The authors are greatly obliged to Prof. K. S. Mallikarjuna Rao for pointers to the relevant literature on non-smooth analysis. The work of VSB was supported in part by an S. S. Bhatnagar Fellowship from the Council of Scientific and Industrial Research, Government of India. The work of KP and KA is partly supported by ANSWER project PIA FSN2 (P15 9564-266178/DOS0060094) and the project of Inria - Nokia Bell Labs “Distributed Learning and Control for Network Analysis”. This work is also partly supported by the project IFC/DST-Inria-2016-01/448 “Machine Learning for Network Analytics”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Konstantin E. Avrachenkov .

Editor information

Editors and Affiliations

Appendix: Elements of Non-smooth Analysis

Appendix: Elements of Non-smooth Analysis

The (Frechet) sub/super-differentials of a map \(f: \mathcal {R}^d \mapsto \mathcal {R}\) are defined by

$$\begin{aligned} \partial ^-f(x):= & {} \left\{ z \in \mathcal {R}^d: \liminf _{y\rightarrow x}\frac{f(y) - f(x) - \langle z, y-x\rangle }{|x-y|} \ge 0\right\} , \\ \partial ^+f(x):= & {} \left\{ z \in \mathcal {R}^d: \limsup _{y\rightarrow x}\frac{f(y) - f(x) - \langle z, y-x\rangle }{|x-y|} \le 0\right\} , \end{aligned}$$

respectively. Assume fg is Lipschitz. Some of the properties of \(\partial ^{\pm }f\) are as follows.

  • (P1) Both \(\partial ^-f(x), \partial ^+f(x)\) are closed convex and are nonempty on dense sets.

  • (P2) If f is differentiable at x, both equal the singleton \(\{\nabla f(x)\}\). Conversely, if both are nonempty at x, f is differentiable at x and they equal \(\{\nabla f(x)\}\).

  • (P3) \( \partial ^-f + \partial ^-g \subset \partial ^-(f + g), \ \partial ^+f + \partial ^+g \subset \partial ^+(f + g) .\)

The first two are proved in [4], pp. 30-1. The third follows from the definition. Next consider a continuous function \(f: \mathcal {R}^d\times B \mapsto \mathcal {R}\) where B is a compact metric space. Suppose \(f( \cdot , y)\) is continuously differentiable uniformly w.r.t. y. Let \(\nabla _xf(x,y)\) denote the gradient of \(f(\cdot , y)\) at x. Let \(g(x) := \max _yf(x,y), h(x) := \min _yf(x,y)\) with

$$ M(x) := \{\nabla _xf(x, y), y \in \text{ Argmax } f(x, \cdot )\} $$

and

$$ N(x) := \{\nabla _xf(x,y), y \in \text{ Argmin } f(x, \cdot )\}. $$

Then N(x), M(x) are compact nonempty subsets of B which are upper semi-continuous in x as set-valued maps. We then have the following general version of Danskin’s theorem [17]:

  • (P4) \(\partial ^-g(x) = \overline{\text{ co }}(M(x)), \partial ^+g(x) = y\) if \(M(x) = \{y\}\), \(= \phi \) otherwise, and g has a directional derivative in any direction z given by \(\max _{y\in M(x)}\langle y, z\rangle \).

  • (P5) \(\partial ^+h(x) = \overline{\text{ co }}(N(x)), \partial ^-h(x) = y\) if \(N(x) = \{y\}\), \(= \phi \) otherwise, and h has a directional derivative in any direction z given by \(\min _{y\in N(x)}\langle y, z\rangle \).

The latter is proved in [4], pp. 44-6, the former follows by a symmetric argument.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Avrachenkov, K.E., Borkar, V.S., Dolhare, H.P., Patil, K. (2021). Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme. In: Piunovskiy, A., Zhang, Y. (eds) Modern Trends in Controlled Stochastic Processes:. Emergence, Complexity and Computation, vol 41. Springer, Cham. https://doi.org/10.1007/978-3-030-76928-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-76928-4_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-76927-7

  • Online ISBN: 978-3-030-76928-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics