Machine Learning

, Volume 49, Issue 2–3, pp 161–178 | Cite as

Kernel-Based Reinforcement Learning

  • Dirk Ormoneit
  • Śaunak Sen

Abstract

We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernel-based approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the bias-variance tradeoff in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or non-parametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.

reinforcement learning Markov decision process kernel-based learning kernel smoothing local averaging lazy learning 

References

  1. Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted regression for control. Artificial Intelligence Review, 11:1-5, 75–113.Google Scholar
  2. Baird, L. C., & Klopf, A. H. (1993). Reinforcement learning with high-dimensional, continuous actions. Technical Report WL-TR-93-1147, Wright Laboratory, Wright-Patterson Air Force Base Ohio.Google Scholar
  3. Bellman, R. E. (1957). Dynamic programming. Englewood Cliffs, NJ: Princeton University Press.Google Scholar
  4. Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vols. 1 and 2). Belmont, MA: Athena Scientific.Google Scholar
  5. Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advance in neural information processing systems (Vol. 7, pp. 369–376). Combridge MA: The MIT Press.Google Scholar
  6. Bradtke, S. J. (1993). Reinforcement learning applied to linear quadratic regulation. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems (Vol. 5, pp. 295–302). San Mateo, CA: Morgan Kaufmann.Google Scholar
  7. Brandt, M. W. (1999). Estimating portfolio and consumption choice. A conditional Euler equations approach. Journal of Finance, 54:5, 1609–1645.Google Scholar
  8. Connell, M. E., & Utgoff, P. E. (1987). Learning to control a dynamic physical system. In Sixth National Conference on Artificial Intelligence (pp. 456–460). San Mateo, CA: Morgan Kaufmann.Google Scholar
  9. Devroye L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer-Verlag.Google Scholar
  10. Fan, J., & Gijbels, I. (1996). Local polynomial modelling and its applications. London: Chapman & Hall.Google Scholar
  11. Gordon, G. (1999). Approximate solutions to Markov decision processes. Ph.D. Thesis, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA.Google Scholar
  12. Hastie, T., & Loader, C. (1993). Local regression: Autmatic kernel carpentry. Statistical Science, 8:2, 120–143.Google Scholar
  13. Jain, N. C., & Marcus, M. B. (1975). Central limit theorems for C(S)-valued random variables. Journal of Functional Analysis, 19, 216–231.Google Scholar
  14. Landelius, T. (1997). Reinforcement learning and distributed local model synthesis. Ph.D. Thesis, Linköping University.Google Scholar
  15. Longstaff, F. A., & Schwartz, E. S. (1998). Valuing American options by simulations: A simple least-squares approach. Technical Report 25-98, Department of Finance, UCLA.Google Scholar
  16. Ormoneit, D., & Glynn, P. W. (2001). Kernel-based reinforcement learning in average-cost problems: An application to optimal portfolio choice. In Advances in neural information processing systems (Vol. 13). Cambridge, MA: The MIT Press.Google Scholar
  17. Ormoneit, D., & Glynn, P. W. (2002). Kernel-based reinforcement learning in average-cost problems. IEEE Transactions on Automatic Control, accepted for publication.Google Scholar
  18. Ormoneit, D., & Hastie, T. (2000). Optimal kernel shapes for local linear regression. In S. A. Solla, T. K. Leen, & K-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12, pp. 540–546). Cambridge, MA: The MIT Press.Google Scholar
  19. Papavassiliou, V., & Russell, S. (1999). Convergence of reinforcement learning with general function approximators. In Proceedings IJCAI (p. 974).Google Scholar
  20. Peng, J., & Williams, R. J. (1995). Efficient learning and planning within the Dyna framework. In Twelfth International Conference on Machine Learning (pp. 438–446). San Mateo, CA: Morgan Kaufmann.Google Scholar
  21. Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.Google Scholar
  22. Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 20, 400–407.Google Scholar
  23. Rust, J. (1997). Using randomization to break the curse of dimensionality. Econometrica, 65:3, 487–516.Google Scholar
  24. Singh, S., & Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems (Vol. 9, pp. 974). Cambridge, MA: The MIT Press.Google Scholar
  25. Smart, W. D., & Kaebling, L. P. (2000). Practical reinforcement learning in continuous spaces. In P. Langley (Ed.), Proceedings of the Seventeenth International Conference on Machine Learning (pp. 903–910). San Francisco, CA: Morgan Kaufmann.Google Scholar
  26. Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Annals of Statistics, 10:4, 1040–1053.Google Scholar
  27. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.Google Scholar
  28. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12). Cambridge, MA: The MIT Press.Google Scholar
  29. Tesauro, G. (1989). Neurogammon wins computer olympiad. Neural Computation, 1:3, 321–323.Google Scholar
  30. Thrun, S. B., & Möller, K. (1992). Active exploration in dynamic environments. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural informaton processing systems (Vol. 4, pp. 531–538). San Mateo, CA: Morgan Kaufmann.Google Scholar
  31. Tsitsiklis, J. N., & Van Roy, B. (1996). Feature-based methods for large-scale dynamic programming. Machine Learning, 22, 59–94.Google Scholar
  32. Tsitsiklis, J. N., & Van Roy, B. (1999). Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Transactions on Automatic Control, 44:10, 1840–1851.Google Scholar
  33. Tsitsiklis, J. N., & Konda, V. R. (2000). Actor-critic algorithms. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12). Cambridge, MA: The MIT Press.Google Scholar
  34. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.Google Scholar
  35. Werbos, P. J. (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3, 179–189.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Dirk Ormoneit
    • 1
  • Śaunak Sen
    • 2
  1. 1.Department of Computer ScienceStanford UniversityStanfordUSA.
  2. 2.The Jackson LaboratoryBar Harbor

Personalised recommendations