# A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions

## Abstract

This paper proposes a study of *Reinforcement Learning* (RL) for continuous state-space and time control problems, based on the theoretical framework of *viscosity solutions* (VSs). We use the method of *dynamic programming* (DP) which introduces the *value function* (VF), expectation of the best future cumulative reinforcement. In the continuous case, the value function satisfies a non-linear first (or second) order (depending on the deterministic or stochastic aspect of the process) differential equation called the *Hamilton-Jacobi-Bellman* (HJB) equation. It is well known that there exists an infinity of generalized solutions (differentiable almost everywhere) to this equation, other than the VF. We show that gradient-descent methods may converge to one of these generalized solutions, thus failing to find the optimal control.

In order to solve the HJB equation, we use the powerful framework of viscosity solutions and state that there exists a unique viscosity solution to the HJB equation, which is the value function. Then, we use another main result of VSs (their stability when passing to the limit) to prove the convergence of numerical approximations schemes based on finite difference (FD) and finite element (FE) methods. These methods discretize, at some resolution, the HJB equation into a DP equation of a *Markov Decision Process* (MDP), which can be solved by DP methods (thanks to a “strong” contraction property) if all the initial data (the state dynamics and the reinforcement function) were perfectly known. However, in the RL approach, as we consider a system in interaction with some a priori (at least partially) unknown environment, which learns “from experience”, the initial data are not perfectly known but have to be approximated during learning. The main contribution of this work is to derive a general convergence theorem for RL algorithms when one uses only “approximations” (in a sense of satisfying some “weak” contraction property) of the initial data. This result can be used for model-based or model-free RL algorithms, with off-line or on-line updating methods, for deterministic or stochastic state dynamics (though this latter case is not described here), and based on FE or FD discretization methods. It is illustrated with several RL algorithms and one numerical simulation for the “Car on the Hill” problem.

### References

- Akian, M. (1990). Méthodes multigrilles en contrôle stochastique. Ph.D. Thesis, University Paris IX Dauphine.Google Scholar
- Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In
*Machine Learning: Proceedings of the Twelfth International Conference*.Google Scholar - Barles, G. (1994).
*Solutions de viscosité des équations de Hamilton-Jacobi*, Springer-Verlag. Mathématiques et Applications, Vol. 17.Google Scholar - Barles, G. & Perthame, B. (1988). Exit time problems in optimal control and vanishing viscosity solutions of hamilton-jacobi equations.
*SIAM Control Optimization*,*26*, 1133–1148.Google Scholar - Barles, G. & Perthame, B. (1990). Comparison principle for dirichlet-type hamilton-jacobi equations and singular perturbations of degenerated elliptic equations.
*Applied Mathematics and Optimization*,*21*, 21–44.Google Scholar - Barles, G. & Souganidis, P. (1991). Convergence of approximation schemes for fully nonlinear second order equations.
*Asymptotic Analysis*,*4*, 271–283.Google Scholar - Barto, A. G. (1990). Connectionist learning for control: An overview. In W. T. Miller, R. S. Sutton, & P. J. Werbos (Eds.),
*Neural Networks for Control*(pp. 5–58). Cambridge, Massachussetts: MIT Press.Google Scholar - Barto, A. G., Bradtke, S. J., & Singh, S. P. (1991). Real-time learning and control using asynchronous dynamic programming. Tech. Rep. 91–57, Computer Science Department, University of Massachusetts.Google Scholar
- Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems.
*IEEE Transactions on Systems, Man and Sybernetics, 13*, 835–846.Google Scholar - Bellman, R. (1957).
*Dynamic Programming*. Princeton Univ. Press.Google Scholar - Bersini, H. & Gorrini, V. (1997). A simplification of the back-propagation-through-time algorithm for optimal neurocontrol.
*IEEE Transaction on Neural Networks*,*8*, 437–441.Google Scholar - Bertsekas, D. P. (1987).
*Dynamic Programming: Deterministic and Stochastic Models*. Prentice Hall.Google Scholar - Bertsekas, D. P. & Tsitsiklis, J. (1996).
*Neuro-Dynamic Programming*. Athena Scientific.Google Scholar - Boyan, J. & Moore, A. (1995). Generalization in reinforcement learning: Safely approximating the value function.
*Advances in Neural Information Processing Systems*,*7*, 369–376.Google Scholar - Crandall, M., Ishii, H., & Lions, P. (1992). User's guide to viscosity solutions of second order partial differential equations.
*Bulletin of the American Mathematical Society*,*27*(1), 1–67.Google Scholar - Crandall, M. & Lions, P. (1983). Viscosity solutions of hamilton-jacobi equations.
*Trans. of the American Mathematical Society*,*277*, 1–42.Google Scholar - Doya, K. (1996). Temporal difference learning in continuous time and space.
*Advances in Neural Information Processing Systems*,*8*, 1073–1079.Google Scholar - Dupuis, P. & James, M. R. (1998). Rates of convergence for approximation schemes in optimal control.
*SIAM Journal Control and Optimization*,*360*(2).Google Scholar - Fleming, W. H. & Soner, H. M. (1993).
*Controlled Markov Processes and Viscosity Solutions*. Springer-Verlag. Applications of Mathematics.Google Scholar - Glorennec, P. & Jouffe, L. (1997). Fuzzy q-learning. In
*Sixth International Conference on Fuzzy Systems*.Google Scholar - Gordon, G. (1995). Stable function approximation in dynamic programming. In
*International Conference on Machine Learning*.Google Scholar - Griebel, M. (1998). Adaptive sparse grid multilevel methods for elliptic pdes based on finite differences. In
*Proceedings Large Scale Scientific Computations*. Notes on Numerical Fluid Mechanics: Computing, submitted.Google Scholar - Gullapalli, V. (1992). Reinforcement Learning and its application to control. Ph.D. Thesis, University of Massachussetts, Amherst.Google Scholar
- Harmon, M. E., Baird, L. C., & Klopf, A. H. (1996). Reinforcement learning applied to a differential game.
*Adaptive Behavior*,*4*, 3–28.Google Scholar - Kaelbling, L. P., Littman, M., & Moore, A.W. (1996). Reinforcement learning: A survey.
*Journal of AI Research*,*4*, 237–285.Google Scholar - Kushner, H. J. (1990). Numerical methods for stochastic control problems in continuous time.
*SIAM J. Control and Optimization*,*28*, 999–1048.Google Scholar - Kushner, H. J. & Dupuis, P. (1992).
*Numerical Methods for Stochastic Control Problems in Continuous Time*. Springer-Verlag. Applications of Mathematics.Google Scholar - Lin, L.-J. (1993). Reinforcement learning for robots using neural networks. Ph.D. Thesis, Carnegie Mellon University, Pittsburg, Pennsylvania.Google Scholar
- Mahadevan, S. & Connell, J. (1992). Automatic programming of behavior-based robots using reinforcement learning.
*Artificial Intelligence*,*5*, 311–365.Google Scholar - Meuleau, N. (1996). Le dilemme Exploration/Exploitation dans les systémes d'apprentissage par renforcement. Ph.D. Thesis, Université de Caen.Google Scholar
- Moore, A. W. (1991). Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state-spaces. In
*Machine Learning: Poceedings of the Eight International Workshop*(pp. 333–337).Google Scholar - Moore, A. W. & Atkeson, C. (1995). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state space.
*Machine Learning Journal*,*21*.Google Scholar - Munos, R. (1996). A convergent reinforcement learning algorithm in the continuous case: The finite-element reinforcement learning. In
*International Conference on Machine Learning*.Google Scholar - Munos, R. (1997a). Apprentissage par Renforcement, étude du cas continu. Ph.D. Thesis, Ecole des Hautes Etudes en Sciences Sociales.Google Scholar
- Munos, R. (1997b). A convergent reinforcement learning algorithm in the continuous case based on a finite difference method. In
*International Joint Conference on Artificial Intelligence*.Google Scholar - Munos, R. (1997c). Finite-element methods with local triangulation refinement for continuous reinforcement learning problems. In
*European Conference on Machine Learning*.Google Scholar - Munos, R. (1998). A general convergence theorem for reinforcement learning in the continuous case. In
*European Conference on Machine Learning*.Google Scholar - Munos, R., Baird, L., & Moore, A. (1999). Gradient descent approaches to neural-net-based solutions of the hamilton-jacobi-bellman equation. In
*International Joint Conference on Neural Networks*.Google Scholar - Munos, R. & Bourgine, P. (1997). Reinforcement learning for continuous stochastic control problems.
*Advances in Neural Information Processing Systems, 10*.Google Scholar - Munos, R. & Moore, A. (1998). Barycentric interpolators for continuous space and time reinforcement learning.
*Advances in Neural Information Processing Systems, 11*, 1024–1030.Google Scholar - Munos, R. & Moore, A. (1999). Variable resolution discretization for high-accuracy solutions of optimal control problems. In
*International Joint Conference on Artificial Intelligence*, 1348–1355.Google Scholar - Nowé, A. (1995). Fuzzy reinforcement learning an overview.
*Advances in Fuzzy Theory and Technology*.Google Scholar - Pareigis, S. (1996). Multi-grid methods for reinforcement learning in controlled diffusion processes.
*Advances in Neural Information Processing Systems, 9*.Google Scholar - Pareigis, S. (1997). Adaptive choice of grid and time in reinforcement learning.
*Advances in Neural Information Processing Systems*,*10*.Google Scholar - Pontryagin, L., Boltyanskii, V., Gamkriledze, R., & Mischenko, E. (1962).
*The Mathematical Theory of Optimal Processes*. New York: Interscience.Google Scholar - Puterman, M. L. (1994).
*Markov Decision Processes, Discrete Stochastic Dynamic Programming*. A Wiley-Interscience Publication.Google Scholar - Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Reinforcement learning with soft state aggregation.
*Advances in Neural Information Processing Systems*,*6*, 359–368.Google Scholar - Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding.
*Advances in Neural Information Processing Systems*,*8*, 1038–1044.Google Scholar - Sutton, R. & Whitehead, S. (1993). Online learning with random representations. In
*International Conference on Machine Learning*.Google Scholar - Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.
*Machine Learning*,*8*, 229–256.Google Scholar