Abstract
We analyze the DQN reinforcement learning algorithm as a stochastic approximation scheme using the o.d.e. (for ‘ordinary differential equation’) approach and point out certain theoretical issues. We then propose a modified scheme called Full Gradient DQN (FG-DQN, for short) that has a sound theoretical basis and compare it with the original scheme on sample problems. We observe a better performance for FG-DQN.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agarwal, A., Kakade, S.M., Jason D.L., Mahajan, G.: Optimality and approximation with policy gradient methods in Markov decision processes. In: Conference on Learning Theory, PMLR, pp. 64–66 (2020)
Agazzi, A., Lu, J.: Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime. arXiv preprint arXiv:2010.11858 (2020)
Baird, L.: Residual algorithms: reinforcement learning with function approximation. In: Machine Learning Proceedings, vol. 30–37 (1995)
Bardi, M., Capuzzo-Dolcetta, I.: Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations. Birkhäuser, Boston (2018)
Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. Syst. 5, 834–846 (1983)
Benveniste, A., Metivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Springer, Heidelberg (1991). https://doi.org/10.1007/978-3-642-75894-2_9
Bertsekas, D.P.: Reinforcement Learning and Optimal Control. Athena Scientific (2019)
Bhandari, J., Russo, D.: Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786 (2019)
Bhatnagar, S., Borkar, V.S., Prabuchandran, K.J.: Feature search in the Grassmanian in online reinforcement learning. IEEE J. Sel. Top. Signal Process. 7(5), 746–758 (2013)
Borkar, V.S.: Probability Theory: An Advanced Course. Springer, New York (1995)
Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Publishing Agency, New Delhi, and Cambridge University Press, Cambridge, UK (2008)
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: OpenAI Gym. ArXiv preprint arXiv:1606.01540 (2016)
Cai, Q., Yang, Z., Lee, J.D., Wang, Z.: Neural temporal-difference learning converges to global optima. Adv. Neural Inf. Process. Syst. 32 (2019)
Chadès, I., Chapron, G., Cros, M.J., Garcia, F., Sabbadin, R.: MDPtoolbox: a multi-platform toolbox to solve stochastic dynamic programming problems. Ecography 37, 916–920 (2014)
Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Proceedings of Neural Information Processing Systems, pp. 3040–3050 (2018)
Couture, S., Cros, M.J., Sabbadin, R.: Risk aversion and optimal management of an uneven-aged forest under risk of windthrow: a Markov decision process approach. J. For. Econ. 25, 94–114 (2016)
Danskin, J.M.: The theory of max-min, with applications. SIAM J. Appl. Math. 14, 641–664 (1966)
Delyon, B., Lavielle, M., Moulines, E.: Convergence of a stochastic approximation version of the EM algorithm. Ann. Stat. 27, 94–128 (1999)
Florian, R.V.: Correct equations for the dynamics of the cart-pole system. Romania, Center for Cognitive and Neural Studies (Coneural) (2007)
François-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An introduction to deep reinforcement learning. Found. Trends Mach. Learn. 11(3–4), 219–354 (2018)
Gordon, G. J.: Stable fitted reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 1052–1058 (1996)
Gordon, G. J.: Approximate solutions to Markov decision processes. Ph.D. Thesis, Carnegie-Mellon University (1999)
Gu, S., Holly, E., Lillicrap, T., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: Proceedings of IEEE International Conference on Robotics and Automation, pp. 3389–3396 (2017)
Haarnoja, T., Ha, S., Zhou, A., Tan, J., Tucker, G., Levine, S.: Learning to walk via deep reinforcement learning. ArXiv preprint arXiv:1812.11103 (2018)
Jaakola, T., Jordan, M.I., Singh, S.P.: On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 6, 1185–1201 (1994)
Jonsson, A.: Deep reinforcement learning in medicine. Kidney Diseas. 5, 18–22 (2019)
Lin, L.J.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8(3–4), 293–321 (1992)
Lin, L.-J.: Reinforcement learning for robots using neural networks. Ph.D. Thesis School of Computer Science, Carnegie-Mellon University, Pittsburgh (1993)
Luong, N.C., Hoang, D.T., Gong, S., Niyato, D., Wang, P., Liang, Y.C., Kim, D.I.: Applications of deep reinforcement learning in communications and networking: a survey. IEEE Commun. Surv. Tutor. 21, 3133–3174 (2019)
Marbach, P., Tsitsiklis, J.N.: Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Contr. 46, 191–209 (2001)
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Peng, X.B., Berseth, G., Yin, K., van de Panne, M.: Deeploco: dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph. 36, 1–13 (2017)
Popova, M., Isayev, O., Tropsha, A.: Deep reinforcement learning for de novo drug design. Sci. Adv. 4(7), eaap7885 (2018)
Prabuchandran, K.J., Bhatnagar, S., Borkar, V.S.: Actor-critic algorithms with online feature adaptation. ACM Trans. Model. Comput. Simul. (TOMACS) 26(4), 1–26 (2016)
Qian, Y., Wu, J., Wang, R., Zhu, F., Zhang, W.: Survey on reinforcement learning applications in communication networks. J. Commun. Netw. 4, 30–39 (2019)
Ramaswamy, A., Bhatnagar, S.: Analysis of gradient descent methods with nondiminishing bounded errors. IEEE Trans. Automat. Contr. 63, 1465–1471 (2018)
Riedmiller, M.: Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. Machine Learning: ECML, pp. 317–328 (2005)
Saleh, E., Jiang, N.: Deterministic Bellman residual minimization. In: Proceedings of Optimization Foundations for Reinforcement Learning Workshop at NeurIPS (2019)
Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015)
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Hassabis, D.: Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016)
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Hassabis, D.: A general reinforcement learning algorithm that masters chess, Shogi, and Go through self-play. Science 362, 1140–1144 (2018)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018)
Sutton, R. S., McAllester, D. A., Singh, S. P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Neural Information Processing Systems Proceedings, pp. 1057–1063 (1999)
Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-learning. Mach. Learn. 16, 185–202 (1994)
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Contr. 42, 674–690 (1997)
van Hasselt, H.: Double Q-learning. Adv. Neural. Inf. Process. Syst. 23, 2613–2621 (2010)
van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double Q-learning. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, vol. 30, pp. 2094–2100 (2016)
Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. Thesis, King’s College, University of Cambridge, UK (1989)
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
Xiong, Z., Zhang, Y., Niyato, D., Deng, R., Wang, P., Wang, L.C.: Deep reinforcement learning for mobile 5G and beyond: fundamentals, applications, and challenges. IEEE Veh. Technol. Mag. 14, 44–52 (2019)
Yaji, V.G., Bhatnagar, S.: Stochastic recursive inclusions with non-additive iterate-dependent Markov noise. Stochastics 90, 330–363 (2018)
Acknowledgement
The authors are greatly obliged to Prof. K. S. Mallikarjuna Rao for pointers to the relevant literature on non-smooth analysis. The work of VSB was supported in part by an S. S. Bhatnagar Fellowship from the Council of Scientific and Industrial Research, Government of India. The work of KP and KA is partly supported by ANSWER project PIA FSN2 (P15 9564-266178/DOS0060094) and the project of Inria - Nokia Bell Labs “Distributed Learning and Control for Network Analysis”. This work is also partly supported by the project IFC/DST-Inria-2016-01/448 “Machine Learning for Network Analytics”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Elements of Non-smooth Analysis
Appendix: Elements of Non-smooth Analysis
The (Frechet) sub/super-differentials of a map \(f: \mathcal {R}^d \mapsto \mathcal {R}\) are defined by
respectively. Assume f, g is Lipschitz. Some of the properties of \(\partial ^{\pm }f\) are as follows.
-
(P1) Both \(\partial ^-f(x), \partial ^+f(x)\) are closed convex and are nonempty on dense sets.
-
(P2) If f is differentiable at x, both equal the singleton \(\{\nabla f(x)\}\). Conversely, if both are nonempty at x, f is differentiable at x and they equal \(\{\nabla f(x)\}\).
-
(P3) \( \partial ^-f + \partial ^-g \subset \partial ^-(f + g), \ \partial ^+f + \partial ^+g \subset \partial ^+(f + g) .\)
The first two are proved in [4], pp. 30-1. The third follows from the definition. Next consider a continuous function \(f: \mathcal {R}^d\times B \mapsto \mathcal {R}\) where B is a compact metric space. Suppose \(f( \cdot , y)\) is continuously differentiable uniformly w.r.t. y. Let \(\nabla _xf(x,y)\) denote the gradient of \(f(\cdot , y)\) at x. Let \(g(x) := \max _yf(x,y), h(x) := \min _yf(x,y)\) with
and
Then N(x), M(x) are compact nonempty subsets of B which are upper semi-continuous in x as set-valued maps. We then have the following general version of Danskin’s theorem [17]:
-
(P4) \(\partial ^-g(x) = \overline{\text{ co }}(M(x)), \partial ^+g(x) = y\) if \(M(x) = \{y\}\), \(= \phi \) otherwise, and g has a directional derivative in any direction z given by \(\max _{y\in M(x)}\langle y, z\rangle \).
-
(P5) \(\partial ^+h(x) = \overline{\text{ co }}(N(x)), \partial ^-h(x) = y\) if \(N(x) = \{y\}\), \(= \phi \) otherwise, and h has a directional derivative in any direction z given by \(\min _{y\in N(x)}\langle y, z\rangle \).
The latter is proved in [4], pp. 44-6, the former follows by a symmetric argument.
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Avrachenkov, K.E., Borkar, V.S., Dolhare, H.P., Patil, K. (2021). Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme. In: Piunovskiy, A., Zhang, Y. (eds) Modern Trends in Controlled Stochastic Processes:. Emergence, Complexity and Computation, vol 41. Springer, Cham. https://doi.org/10.1007/978-3-030-76928-4_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-76928-4_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76927-7
Online ISBN: 978-3-030-76928-4
eBook Packages: EngineeringEngineering (R0)