Encyclopedia of Machine Learning

2010 Edition
| Editors: Claude Sammut, Geoffrey I. Webb

Gaussian Process Reinforcement Learning

  • Yaakov Engel
Reference work entry
DOI: https://doi.org/10.1007/978-0-387-30164-8_325

Definition

Gaussian process reinforcement learning generically refers to a class of  reinforcement learning (RL) algorithms that use Gaussian processes (GPs) to model and learn some aspect of the problem.

Such methods may be divided roughly into two groups:
  1. 1.

    Model-based methods: Here, GPs are used to learn the transition and reward model of the  Markov decision process (MDP) underlying the RL problem. The estimated MDP model is then used to compute an approximate solution to the true MDP.

     
  2. 2.

    Model-free methods: Here no explicit representation of the MDP is maintained. Rather, GPs are used to learn either the MDP’s value function, state-action value function, or some other quantity that may be used to solve the MDP.

     

This entry is concerned with the latter class of methods, as these constitute the majority of published research in this area.

Motivation and Background

 Reinforcement learningis a class of learning problems concerned with achieving long-term goals in unfamiliar,...

This is a preview of subscription content, log in to check access

References

  1. Bellman, R. E. (1956). A problem in the sequential design of experiments. Sankhya, 16, 221–229.MathSciNetMATHGoogle Scholar
  2. Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press.MATHGoogle Scholar
  3. Bertsekas, D. P. (1995). Dynamic programming and optimal control. Belmont, MA: Athena Scientific.MATHGoogle Scholar
  4. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.MATHGoogle Scholar
  5. Boyan, J. A. (1999). Least-squares temporal difference learning. In Proceedings of the 16th international conference on machine learning (pp. 49–56). San Francisco: Morgan Kaufmann.Google Scholar
  6. Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57.MATHGoogle Scholar
  7. Dearden, R., Friedman, N., & Andre, D. (1999). Model based Bayesian exploration. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence (pp. 150–159). San Francisco: Morgan Kaufmann.Google Scholar
  8. Dearden, R., Friedman, N., & Russell, S. (1998). Bayesian Q-learning. In Proceedings of the fifteenth national conference on artificial intelligence (pp. 761–768). Menlo Park, CA: AAAI Press.Google Scholar
  9. Duff, M. (2002). Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massachusetts, Amherst.Google Scholar
  10. Engel, Y. (2005). Algorithms and representations for reinforcement learning. PhD thesis, The Hebrew University of Jerusalem.Google Scholar
  11. Engel, Y., Mannor, S., & Meir, R. (2003). Bayes meets Bellman: The Gaussian process approach to temporal difference learning. In Proceedings of the 20th international conference on machine learning. San Francisco: Morgan Kaufmann.Google Scholar
  12. Engel, Y., Mannor, S., & Meir, R. (2005). Reinforcement learning with Gaussian processes. In Proceedings of the 22nd international conference on machine learning.Google Scholar
  13. Engel, Y., Szabo, P., & Volkinshtein, D. (2005). Learning to control an Octopus arm with Gaussian process temporal difference methods. Technical report, Technion Institute of Technology. www.cs.ualberta.ca/~yaki/reports/octopus.pdf.
  14. Ghavamzadeh, M., & Engel, Y. (2007). Bayesian actor-critic algorithms. In Z. Ghahramani (Ed.), 24th international conference on machine learning. Corvallis, OR: Omnipress.Google Scholar
  15. Howard, R. (1960). Dynamic programming and Markov processes. Cambridge, MA: MIT Press.MATHGoogle Scholar
  16. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99–134.MathSciNetMATHCrossRefGoogle Scholar
  17. Kushner, H. J., & Yin, C. J. (1997). Stochastic approximation algorithms and applications. Berlin: Springer.MATHGoogle Scholar
  18. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th international conference on machine learning (ICML-94) (pp. 157–163). New Brunswick, NJ: Morgan Kaufmann.Google Scholar
  19. Mannor, S., Simester, D., Sun, P., & Tsitsiklis, J. N. (2004). Bias and variance in value function estimation. In Proceedings of the 21st international conference on machine learning.Google Scholar
  20. Poupart, P., Vlassis, N. A., Hoey, J., & Regan, K. (2006). An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the twenty-third international conference on machine learning (pp. 697–704). Pittsburgh, PA.Google Scholar
  21. Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.MATHGoogle Scholar
  22. Rummery, G., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.Google Scholar
  23. Strens, M. (2000). A Bayesian framework for reinforcement learning. In Proceedings of the 17th international conference on machine learning (pp. 943–950). San Francisco: Morgan Kaufmann.Google Scholar
  24. Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. PhD thesis, University of Massachusetts, Amherst.Google Scholar
  25. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.Google Scholar
  26. Tsitsiklis, J. N., & Van Roy, B. (1996). An analysis of temporal-difference learning with function approximation. Technical report LIDS-P-2322, Cambridge, MA: MIT Press.Google Scholar
  27. Wang, T., Lizotte, D., Bowling, M., & Schuurmans, D. (2005). Bayesian sparse sampling for on-line reward optimization. In Proceedings of the 22nd international conference on machine learning (pp. 956–963). New York: ACM Press.Google Scholar
  28. Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, King’s College, Cambridge, UK.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Yaakov Engel
    • 1
  1. 1.University of AlbertaEdmontonCanada