Improving Gaussian Process Value Function Approximation in Policy Gradient Algorithms

  • Hunor Jakab
  • Lehel Csató
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6792)


The use of value-function approximation in reinforcement learning (RL) problems is widely studied, the most common application of it being the extension of value-based RL methods to continuous domains. Gradient-based policy search algorithms can also benefit from the availability of an estimated value-function, as this estimation can be used for gradient variance reduction. In this article we present a new value function approximation method that uses a modified version of the Kullback–Leibler (KL) distance based sparse on-line Gaussian process regression. We combine it with Williams’ episodic REINFORCE algorithm to reduce the variance of the gradient estimates. A significant computational overload of the algorithm is caused by the need to completely re-estimate the value-function after each gradient update step. To overcome this problem we propose a measure composed of a KL distance–based score and a time dependent factor to exchange obsolete basis vectors with newly acquired measurements. This method leads to a more stable estimation of the action value-function and also reduces gradient variance. Performance and convergence comparisons are provided for the described algorithm, testing it on a dynamic system control problem with continuous state-action space.


Reinforcement learning policy gradient methods Gaussian processes value function estimation control problems 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baird, L., Moore, A.: Gradient descent for general reinforcement learning. In: Kearns, M.S., Solla, S.A., Cohn, D.A. (eds.) NIPS 1998. Advances in Neural Information Processing Systems, vol. 11, pp. 968–974. MIT Press, Cambridge (1998)Google Scholar
  2. 2.
    Csató, L.: Gaussian Processes – Iterative Sparse Approximation. PhD thesis, Neural Computing Research Group  (2002)Google Scholar
  3. 3.
    Csató, L., Opper, M.: Sparse representation for Gaussian process models. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) NIPS, vol. 13, pp. 444–450. MIT Press, Cambridge (2001)Google Scholar
  4. 4.
    Deisenroth, M.P., Rasmussen, C.E., Peters, J.: Gaussian process dynamic programming. Neurocomputing 72(7-9), 1508–1524 (2009)CrossRefGoogle Scholar
  5. 5.
    Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the 22nd International Conference on Machine learning, pp. 201–208, New York (2005)Google Scholar
  6. 6.
    Fan, Y., Xu, J., Shelton, C.R.: Importance sampling for continuous time Bayesian networks. Journal of Machine Learning Research 11, 2115–2140 (2010)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Ghavamzadeh, M., Engel, Y.: Bayesian policy gradient algorithms. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) NIPS 2007, Advances in Neural Information Processing Systems, vol. 19, pp. 457–464. MIT Press, Cambridge (2007)Google Scholar
  8. 8.
    Jakab, H.S., Csató, L.: Using Gaussian processes for variance reduction in policy gradient algorithms. In: 8th International Conference on Applied Informatics, Eger, pp. 55–63 (2010)Google Scholar
  9. 9.
    Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients. Neural Networks 21(4), 682–697 (2008)CrossRefGoogle Scholar
  10. 10.
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York (1994)Google Scholar
  11. 11.
    Rasmussen, C.E., Kuss, M.: Gaussian processes in reinforcement learning. In: Saul, L.K., Thrun, S., Schlkopf, B. (eds.) NIPS 2003, Advances in Neural Information Processing Systems, pp. 751–759. MIT Press, Cambridge (2004)Google Scholar
  12. 12.
    Rasmussen, C.E., Williams, C.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)zbMATHGoogle Scholar
  13. 13.
    Sugiyama, M., Hachiya, H., Towell, C., Vijayakumar, S.: Geodesic gaussian kernels for value function approximation. Auton. Robots 25, 287–304 (2008)CrossRefGoogle Scholar
  14. 14.
    Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Solla, S.A., Leen, T.K., Müller, K.R. (eds.) NIPS 1999, Advances in Neural Information Processing Systems, pp. 1057–1063. MIT Press, Cambridge (1999)Google Scholar
  15. 15.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Hunor Jakab
    • 1
    • 2
  • Lehel Csató
    • 1
    • 2
  1. 1.Babeş-Bolyai UniversityCluj-NapocaRomania
  2. 2.Eötvös Loránd UniversityBudapestHungary

Personalised recommendations