Machine Learning

, Volume 40, Issue 3, pp 265–299 | Cite as

A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions

  • Rémi Munos
Article

Abstract

This paper proposes a study of Reinforcement Learning (RL) for continuous state-space and time control problems, based on the theoretical framework of viscosity solutions (VSs). We use the method of dynamic programming (DP) which introduces the value function (VF), expectation of the best future cumulative reinforcement. In the continuous case, the value function satisfies a non-linear first (or second) order (depending on the deterministic or stochastic aspect of the process) differential equation called the Hamilton-Jacobi-Bellman (HJB) equation. It is well known that there exists an infinity of generalized solutions (differentiable almost everywhere) to this equation, other than the VF. We show that gradient-descent methods may converge to one of these generalized solutions, thus failing to find the optimal control.

In order to solve the HJB equation, we use the powerful framework of viscosity solutions and state that there exists a unique viscosity solution to the HJB equation, which is the value function. Then, we use another main result of VSs (their stability when passing to the limit) to prove the convergence of numerical approximations schemes based on finite difference (FD) and finite element (FE) methods. These methods discretize, at some resolution, the HJB equation into a DP equation of a Markov Decision Process (MDP), which can be solved by DP methods (thanks to a “strong” contraction property) if all the initial data (the state dynamics and the reinforcement function) were perfectly known. However, in the RL approach, as we consider a system in interaction with some a priori (at least partially) unknown environment, which learns “from experience”, the initial data are not perfectly known but have to be approximated during learning. The main contribution of this work is to derive a general convergence theorem for RL algorithms when one uses only “approximations” (in a sense of satisfying some “weak” contraction property) of the initial data. This result can be used for model-based or model-free RL algorithms, with off-line or on-line updating methods, for deterministic or stochastic state dynamics (though this latter case is not described here), and based on FE or FD discretization methods. It is illustrated with several RL algorithms and one numerical simulation for the “Car on the Hill” problem.

reinforcement learning dynamic programming optimal control viscosity solutions finite difference and finite element methods Hamilton-Jacobi-Bellman equation 

References

  1. Akian, M. (1990). Méthodes multigrilles en contrôle stochastique. Ph.D. Thesis, University Paris IX Dauphine.Google Scholar
  2. Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning: Proceedings of the Twelfth International Conference.Google Scholar
  3. Barles, G. (1994). Solutions de viscosité des équations de Hamilton-Jacobi, Springer-Verlag. Mathématiques et Applications, Vol. 17.Google Scholar
  4. Barles, G. & Perthame, B. (1988). Exit time problems in optimal control and vanishing viscosity solutions of hamilton-jacobi equations. SIAM Control Optimization, 26, 1133–1148.Google Scholar
  5. Barles, G. & Perthame, B. (1990). Comparison principle for dirichlet-type hamilton-jacobi equations and singular perturbations of degenerated elliptic equations. Applied Mathematics and Optimization, 21, 21–44.Google Scholar
  6. Barles, G. & Souganidis, P. (1991). Convergence of approximation schemes for fully nonlinear second order equations. Asymptotic Analysis, 4, 271–283.Google Scholar
  7. Barto, A. G. (1990). Connectionist learning for control: An overview. In W. T. Miller, R. S. Sutton, & P. J. Werbos (Eds.), Neural Networks for Control (pp. 5–58). Cambridge, Massachussetts: MIT Press.Google Scholar
  8. Barto, A. G., Bradtke, S. J., & Singh, S. P. (1991). Real-time learning and control using asynchronous dynamic programming. Tech. Rep. 91–57, Computer Science Department, University of Massachusetts.Google Scholar
  9. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Sybernetics, 13, 835–846.Google Scholar
  10. Bellman, R. (1957). Dynamic Programming. Princeton Univ. Press.Google Scholar
  11. Bersini, H. & Gorrini, V. (1997). A simplification of the back-propagation-through-time algorithm for optimal neurocontrol. IEEE Transaction on Neural Networks, 8, 437–441.Google Scholar
  12. Bertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Models. Prentice Hall.Google Scholar
  13. Bertsekas, D. P. & Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific.Google Scholar
  14. Boyan, J. & Moore, A. (1995). Generalization in reinforcement learning: Safely approximating the value function. Advances in Neural Information Processing Systems, 7, 369–376.Google Scholar
  15. Crandall, M., Ishii, H., & Lions, P. (1992). User's guide to viscosity solutions of second order partial differential equations. Bulletin of the American Mathematical Society, 27(1), 1–67.Google Scholar
  16. Crandall, M. & Lions, P. (1983). Viscosity solutions of hamilton-jacobi equations. Trans. of the American Mathematical Society, 277, 1–42.Google Scholar
  17. Doya, K. (1996). Temporal difference learning in continuous time and space. Advances in Neural Information Processing Systems, 8, 1073–1079.Google Scholar
  18. Dupuis, P. & James, M. R. (1998). Rates of convergence for approximation schemes in optimal control. SIAM Journal Control and Optimization, 360(2).Google Scholar
  19. Fleming, W. H. & Soner, H. M. (1993). Controlled Markov Processes and Viscosity Solutions. Springer-Verlag. Applications of Mathematics.Google Scholar
  20. Glorennec, P. & Jouffe, L. (1997). Fuzzy q-learning. In Sixth International Conference on Fuzzy Systems.Google Scholar
  21. Gordon, G. (1995). Stable function approximation in dynamic programming. In International Conference on Machine Learning.Google Scholar
  22. Griebel, M. (1998). Adaptive sparse grid multilevel methods for elliptic pdes based on finite differences. In Proceedings Large Scale Scientific Computations. Notes on Numerical Fluid Mechanics: Computing, submitted.Google Scholar
  23. Gullapalli, V. (1992). Reinforcement Learning and its application to control. Ph.D. Thesis, University of Massachussetts, Amherst.Google Scholar
  24. Harmon, M. E., Baird, L. C., & Klopf, A. H. (1996). Reinforcement learning applied to a differential game. Adaptive Behavior, 4, 3–28.Google Scholar
  25. Kaelbling, L. P., Littman, M., & Moore, A.W. (1996). Reinforcement learning: A survey. Journal of AI Research, 4, 237–285.Google Scholar
  26. Kushner, H. J. (1990). Numerical methods for stochastic control problems in continuous time. SIAM J. Control and Optimization, 28, 999–1048.Google Scholar
  27. Kushner, H. J. & Dupuis, P. (1992). Numerical Methods for Stochastic Control Problems in Continuous Time. Springer-Verlag. Applications of Mathematics.Google Scholar
  28. Lin, L.-J. (1993). Reinforcement learning for robots using neural networks. Ph.D. Thesis, Carnegie Mellon University, Pittsburg, Pennsylvania.Google Scholar
  29. Mahadevan, S. & Connell, J. (1992). Automatic programming of behavior-based robots using reinforcement learning. Artificial Intelligence, 5, 311–365.Google Scholar
  30. Meuleau, N. (1996). Le dilemme Exploration/Exploitation dans les systémes d'apprentissage par renforcement. Ph.D. Thesis, Université de Caen.Google Scholar
  31. Moore, A. W. (1991). Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state-spaces. In Machine Learning: Poceedings of the Eight International Workshop (pp. 333–337).Google Scholar
  32. Moore, A. W. & Atkeson, C. (1995). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state space. Machine Learning Journal, 21.Google Scholar
  33. Munos, R. (1996). A convergent reinforcement learning algorithm in the continuous case: The finite-element reinforcement learning. In International Conference on Machine Learning.Google Scholar
  34. Munos, R. (1997a). Apprentissage par Renforcement, étude du cas continu. Ph.D. Thesis, Ecole des Hautes Etudes en Sciences Sociales.Google Scholar
  35. Munos, R. (1997b). A convergent reinforcement learning algorithm in the continuous case based on a finite difference method. In International Joint Conference on Artificial Intelligence.Google Scholar
  36. Munos, R. (1997c). Finite-element methods with local triangulation refinement for continuous reinforcement learning problems. In European Conference on Machine Learning.Google Scholar
  37. Munos, R. (1998). A general convergence theorem for reinforcement learning in the continuous case. In European Conference on Machine Learning.Google Scholar
  38. Munos, R., Baird, L., & Moore, A. (1999). Gradient descent approaches to neural-net-based solutions of the hamilton-jacobi-bellman equation. In International Joint Conference on Neural Networks.Google Scholar
  39. Munos, R. & Bourgine, P. (1997). Reinforcement learning for continuous stochastic control problems. Advances in Neural Information Processing Systems, 10.Google Scholar
  40. Munos, R. & Moore, A. (1998). Barycentric interpolators for continuous space and time reinforcement learning. Advances in Neural Information Processing Systems, 11, 1024–1030.Google Scholar
  41. Munos, R. & Moore, A. (1999). Variable resolution discretization for high-accuracy solutions of optimal control problems. In International Joint Conference on Artificial Intelligence, 1348–1355.Google Scholar
  42. Nowé, A. (1995). Fuzzy reinforcement learning an overview. Advances in Fuzzy Theory and Technology.Google Scholar
  43. Pareigis, S. (1996). Multi-grid methods for reinforcement learning in controlled diffusion processes. Advances in Neural Information Processing Systems, 9.Google Scholar
  44. Pareigis, S. (1997). Adaptive choice of grid and time in reinforcement learning. Advances in Neural Information Processing Systems, 10.Google Scholar
  45. Pontryagin, L., Boltyanskii, V., Gamkriledze, R., & Mischenko, E. (1962). The Mathematical Theory of Optimal Processes. New York: Interscience.Google Scholar
  46. Puterman, M. L. (1994). Markov Decision Processes, Discrete Stochastic Dynamic Programming. A Wiley-Interscience Publication.Google Scholar
  47. Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Reinforcement learning with soft state aggregation. Advances in Neural Information Processing Systems, 6, 359–368.Google Scholar
  48. Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems, 8, 1038–1044.Google Scholar
  49. Sutton, R. & Whitehead, S. (1993). Online learning with random representations. In International Conference on Machine Learning.Google Scholar
  50. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Rémi Munos
    • 1
  1. 1.Robotics InstituteCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations