Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme

Avrachenkov, Konstantin E.; Borkar, Vivek S.; Dolhare, Hars P.; Patil, Kishor

doi:10.1007/978-3-030-76928-4_10

Konstantin E. Avrachenkov²⁵,
Vivek S. Borkar²⁶,
Hars P. Dolhare²⁶ &
…
Kishor Patil²⁵

Part of the book series: Emergence, Complexity and Computation ((ECC,volume 41))

630 Accesses
1 Citations

Abstract

We analyze the DQN reinforcement learning algorithm as a stochastic approximation scheme using the o.d.e. (for ‘ordinary differential equation’) approach and point out certain theoretical issues. We then propose a modified scheme called Full Gradient DQN (FG-DQN, for short) that has a sound theoretical basis and compare it with the original scheme on sample problems. We observe a better performance for FG-DQN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agarwal, A., Kakade, S.M., Jason D.L., Mahajan, G.: Optimality and approximation with policy gradient methods in Markov decision processes. In: Conference on Learning Theory, PMLR, pp. 64–66 (2020)
Google Scholar
Agazzi, A., Lu, J.: Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime. arXiv preprint arXiv:2010.11858 (2020)
Baird, L.: Residual algorithms: reinforcement learning with function approximation. In: Machine Learning Proceedings, vol. 30–37 (1995)
Google Scholar
Bardi, M., Capuzzo-Dolcetta, I.: Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations. Birkhäuser, Boston (2018)
Google Scholar
Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. Syst. 5, 834–846 (1983)
Article Google Scholar
Benveniste, A., Metivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Springer, Heidelberg (1991). https://doi.org/10.1007/978-3-642-75894-2_9
Book MATH Google Scholar
Bertsekas, D.P.: Reinforcement Learning and Optimal Control. Athena Scientific (2019)
Google Scholar
Bhandari, J., Russo, D.: Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786 (2019)
Bhatnagar, S., Borkar, V.S., Prabuchandran, K.J.: Feature search in the Grassmanian in online reinforcement learning. IEEE J. Sel. Top. Signal Process. 7(5), 746–758 (2013)
Article Google Scholar
Borkar, V.S.: Probability Theory: An Advanced Course. Springer, New York (1995)
Book Google Scholar
Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Publishing Agency, New Delhi, and Cambridge University Press, Cambridge, UK (2008)
Google Scholar
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: OpenAI Gym. ArXiv preprint arXiv:1606.01540 (2016)
Cai, Q., Yang, Z., Lee, J.D., Wang, Z.: Neural temporal-difference learning converges to global optima. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Chadès, I., Chapron, G., Cros, M.J., Garcia, F., Sabbadin, R.: MDPtoolbox: a multi-platform toolbox to solve stochastic dynamic programming problems. Ecography 37, 916–920 (2014)
Article Google Scholar
Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Proceedings of Neural Information Processing Systems, pp. 3040–3050 (2018)
Google Scholar
Couture, S., Cros, M.J., Sabbadin, R.: Risk aversion and optimal management of an uneven-aged forest under risk of windthrow: a Markov decision process approach. J. For. Econ. 25, 94–114 (2016)
Google Scholar
Danskin, J.M.: The theory of max-min, with applications. SIAM J. Appl. Math. 14, 641–664 (1966)
Article MathSciNet Google Scholar
Delyon, B., Lavielle, M., Moulines, E.: Convergence of a stochastic approximation version of the EM algorithm. Ann. Stat. 27, 94–128 (1999)
Article MathSciNet Google Scholar
Florian, R.V.: Correct equations for the dynamics of the cart-pole system. Romania, Center for Cognitive and Neural Studies (Coneural) (2007)
Google Scholar
François-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An introduction to deep reinforcement learning. Found. Trends Mach. Learn. 11(3–4), 219–354 (2018)
Article Google Scholar
Gordon, G. J.: Stable fitted reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 1052–1058 (1996)
Google Scholar
Gordon, G. J.: Approximate solutions to Markov decision processes. Ph.D. Thesis, Carnegie-Mellon University (1999)
Google Scholar
Gu, S., Holly, E., Lillicrap, T., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: Proceedings of IEEE International Conference on Robotics and Automation, pp. 3389–3396 (2017)
Google Scholar
Haarnoja, T., Ha, S., Zhou, A., Tan, J., Tucker, G., Levine, S.: Learning to walk via deep reinforcement learning. ArXiv preprint arXiv:1812.11103 (2018)
Jaakola, T., Jordan, M.I., Singh, S.P.: On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 6, 1185–1201 (1994)
Article Google Scholar
Jonsson, A.: Deep reinforcement learning in medicine. Kidney Diseas. 5, 18–22 (2019)
Article Google Scholar
Lin, L.J.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8(3–4), 293–321 (1992)
Google Scholar
Lin, L.-J.: Reinforcement learning for robots using neural networks. Ph.D. Thesis School of Computer Science, Carnegie-Mellon University, Pittsburgh (1993)
Google Scholar
Luong, N.C., Hoang, D.T., Gong, S., Niyato, D., Wang, P., Liang, Y.C., Kim, D.I.: Applications of deep reinforcement learning in communications and networking: a survey. IEEE Commun. Surv. Tutor. 21, 3133–3174 (2019)
Article Google Scholar
Marbach, P., Tsitsiklis, J.N.: Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Contr. 46, 191–209 (2001)
Article MathSciNet Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Article Google Scholar
Peng, X.B., Berseth, G., Yin, K., van de Panne, M.: Deeploco: dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph. 36, 1–13 (2017)
Google Scholar
Popova, M., Isayev, O., Tropsha, A.: Deep reinforcement learning for de novo drug design. Sci. Adv. 4(7), eaap7885 (2018)
Google Scholar
Prabuchandran, K.J., Bhatnagar, S., Borkar, V.S.: Actor-critic algorithms with online feature adaptation. ACM Trans. Model. Comput. Simul. (TOMACS) 26(4), 1–26 (2016)
Article MathSciNet Google Scholar
Qian, Y., Wu, J., Wang, R., Zhu, F., Zhang, W.: Survey on reinforcement learning applications in communication networks. J. Commun. Netw. 4, 30–39 (2019)
Google Scholar
Ramaswamy, A., Bhatnagar, S.: Analysis of gradient descent methods with nondiminishing bounded errors. IEEE Trans. Automat. Contr. 63, 1465–1471 (2018)
Article MathSciNet Google Scholar
Riedmiller, M.: Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. Machine Learning: ECML, pp. 317–328 (2005)
Google Scholar
Saleh, E., Jiang, N.: Deterministic Bellman residual minimization. In: Proceedings of Optimization Foundations for Reinforcement Learning Workshop at NeurIPS (2019)
Google Scholar
Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015)
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Hassabis, D.: Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016)
Article Google Scholar
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Hassabis, D.: A general reinforcement learning algorithm that masters chess, Shogi, and Go through self-play. Science 362, 1140–1144 (2018)
Article MathSciNet Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018)
Google Scholar
Sutton, R. S., McAllester, D. A., Singh, S. P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Neural Information Processing Systems Proceedings, pp. 1057–1063 (1999)
Google Scholar
Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-learning. Mach. Learn. 16, 185–202 (1994)
MATH Google Scholar
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Contr. 42, 674–690 (1997)
Article MathSciNet Google Scholar
van Hasselt, H.: Double Q-learning. Adv. Neural. Inf. Process. Syst. 23, 2613–2621 (2010)
Google Scholar
van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double Q-learning. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, vol. 30, pp. 2094–2100 (2016)
Google Scholar
Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. Thesis, King’s College, University of Cambridge, UK (1989)
Google Scholar
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
MATH Google Scholar
Xiong, Z., Zhang, Y., Niyato, D., Deng, R., Wang, P., Wang, L.C.: Deep reinforcement learning for mobile 5G and beyond: fundamentals, applications, and challenges. IEEE Veh. Technol. Mag. 14, 44–52 (2019)
Article Google Scholar
Yaji, V.G., Bhatnagar, S.: Stochastic recursive inclusions with non-additive iterate-dependent Markov noise. Stochastics 90, 330–363 (2018)
Article MathSciNet Google Scholar

Download references

Acknowledgement

The authors are greatly obliged to Prof. K. S. Mallikarjuna Rao for pointers to the relevant literature on non-smooth analysis. The work of VSB was supported in part by an S. S. Bhatnagar Fellowship from the Council of Scientific and Industrial Research, Government of India. The work of KP and KA is partly supported by ANSWER project PIA FSN2 (P15 9564-266178/DOS0060094) and the project of Inria - Nokia Bell Labs “Distributed Learning and Control for Network Analysis”. This work is also partly supported by the project IFC/DST-Inria-2016-01/448 “Machine Learning for Network Analytics”.

Author information

Authors and Affiliations

INRIA Sophia Antipolis, Valbonne, 06902, France
Konstantin E. Avrachenkov & Kishor Patil
Indian Institute of Technology Bombay, Mumbai, 400076, India
Vivek S. Borkar & Hars P. Dolhare

Authors

Konstantin E. Avrachenkov
View author publications
You can also search for this author in PubMed Google Scholar
Vivek S. Borkar
View author publications
You can also search for this author in PubMed Google Scholar
Hars P. Dolhare
View author publications
You can also search for this author in PubMed Google Scholar
Kishor Patil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Konstantin E. Avrachenkov .

Editor information

Editors and Affiliations

Department of Mathematical Sciences, University of Liverpool, Liverpool, UK
Alexey Piunovskiy
Department of Mathematical Sciences, University of Liverpool, Liverpool, UK
Yi Zhang

Appendix: Elements of Non-smooth Analysis

The (Frechet) sub/super-differentials of a map $f: \mathcal {R}^d \mapsto \mathcal {R}$ are defined by

$$\begin{aligned} \partial ^-f(x):= & {} \left\{ z \in \mathcal {R}^d: \liminf _{y\rightarrow x}\frac{f(y) - f(x) - \langle z, y-x\rangle }{|x-y|} \ge 0\right\} , \\ \partial ^+f(x):= & {} \left\{ z \in \mathcal {R}^d: \limsup _{y\rightarrow x}\frac{f(y) - f(x) - \langle z, y-x\rangle }{|x-y|} \le 0\right\} , \end{aligned}$$

respectively. Assume f, g is Lipschitz. Some of the properties of $\partial ^{\pm }f$ are as follows.

(P1) Both $\partial ^-f(x), \partial ^+f(x)$ are closed convex and are nonempty on dense sets.
(P2) If f is differentiable at x, both equal the singleton $\{\nabla f(x)\}$. Conversely, if both are nonempty at x, f is differentiable at x and they equal $\{\nabla f(x)\}$.
(P3) $ \partial ^-f + \partial ^-g \subset \partial ^-(f + g), \ \partial ^+f + \partial ^+g \subset \partial ^+(f + g) .$

The first two are proved in [4], pp. 30-1. The third follows from the definition. Next consider a continuous function $f: \mathcal {R}^d\times B \mapsto \mathcal {R}$ where B is a compact metric space. Suppose $f( \cdot , y)$ is continuously differentiable uniformly w.r.t. y. Let $\nabla _xf(x,y)$ denote the gradient of $f(\cdot , y)$ at x. Let $g(x) := \max _yf(x,y), h(x) := \min _yf(x,y)$ with

$$ M(x) := \{\nabla _xf(x, y), y \in \text{ Argmax } f(x, \cdot )\} $$

and

$$ N(x) := \{\nabla _xf(x,y), y \in \text{ Argmin } f(x, \cdot )\}. $$

Then N(x), M(x) are compact nonempty subsets of B which are upper semi-continuous in x as set-valued maps. We then have the following general version of Danskin’s theorem [17]:

(P4) $\partial ^-g(x) = \overline{\text{ co }}(M(x)), \partial ^+g(x) = y$ if $M(x) = \{y\}$, $= \phi $ otherwise, and g has a directional derivative in any direction z given by $\max _{y\in M(x)}\langle y, z\rangle $.
(P5) $\partial ^+h(x) = \overline{\text{ co }}(N(x)), \partial ^-h(x) = y$ if $N(x) = \{y\}$, $= \phi $ otherwise, and h has a directional derivative in any direction z given by $\min _{y\in N(x)}\langle y, z\rangle $.

The latter is proved in [4], pp. 44-6, the former follows by a symmetric argument.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Avrachenkov, K.E., Borkar, V.S., Dolhare, H.P., Patil, K. (2021). Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme. In: Piunovskiy, A., Zhang, Y. (eds) Modern Trends in Controlled Stochastic Processes:. Emergence, Complexity and Computation, vol 41. Springer, Cham. https://doi.org/10.1007/978-3-030-76928-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-76928-4_10
Published: 05 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76927-7
Online ISBN: 978-3-030-76928-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme

Abstract

Access this chapter

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Elements of Non-smooth Analysis

Appendix: Elements of Non-smooth Analysis

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation