Knowledge Gradient for Online Reinforcement Learning

  • Saba YahyaaEmail author
  • Bernard Manderick
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8946)


The most interesting challenge for a reinforcement learning agent is to learn online in unknown large discrete, or continuous stochastic model. The agent has not only to trade-off between exploration and exploitation, but also has to find a good set of basis functions to approximate the value function. We extend offline kernel-based LSPI (or least squares policy iteration) to online learning. Online kernel-based LSPI combines feature of offline kernel-based LSPI and online LSPI. Online kernel-based LSPI uses knowledge gradient policy as an exploration policy to trade-off between exploration and exploitation, and the approximate linear dependency based kernel sparsification method to select basis functions automatically. We compare between online kernel-based LSPI and online LSPI on 5 discrete Markov decision problems, where online kernel-based LSPI outperforms online LSPI according to the optimal policy performance.


Online reinforcement learning Trade-off between exploration and exploitation Knowledge gradient exploration policy Value function approximation (Kernel-based) least squares policy iteration Approximate linear dependency kernel sparsification 


  1. 1.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  2. 2.
    Lagoudakis, M.G., Parr, R.: Model-free least squares policy iteration. Technical report, Computer Science Department, Duke University, Durham, North Carolina, United States (2003)Google Scholar
  3. 3.
    Xu, X., Hu, D., Lu, X.: Kernel-based least squares policy iteration for reinforcement learning. J. IEEE Trans. Neural Netw. 18(4), 973–992 (2007)CrossRefGoogle Scholar
  4. 4.
    Vapnik, V.: The Grid: Statistical Learning Theory. Wiley, New York (1998)zbMATHGoogle Scholar
  5. 5.
    Engel, Y., Mannor, S., Meir, R.: The kernel recursive least-squares algorithm. J. IEEE Trans. Signal Process. 52(8), 2275–2285 (2004)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Online least-squares policy iteration for reinforcement learning control. In: American Control Conference (ACC), pp. 486–491 (2010)Google Scholar
  7. 7.
    Li, L., Littman, M.L., Mansley, C.R.: Online exploration in least-squares policy iteration. J. Comput. (2008)Google Scholar
  8. 8.
    Yahyaa, S., Manderick, B.: Knowledge gradient exploration in online least squares policy iteration. In: 5th International Conference on Agents and Artificial Intelligence (ICAART). Springer-Verlag, Barcelona (2013)Google Scholar
  9. 9.
    Ryzhov, I.O., Powell, W.B., Frazier, P.I.: The knowledge-gradient policy for a general class of online learning problems. J. Oper. Res. 60, 180–195 (2011)CrossRefzbMATHGoogle Scholar
  10. 10.
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (1994)CrossRefzbMATHGoogle Scholar
  11. 11.
    Powell, W.B., Ryzhov, I.O.: Optimal Learning. Willey, Canada (2012)CrossRefGoogle Scholar
  12. 12.
    Engel, Y., Meir, R.: Algorithms and representations for reinforcement learning. Technical report, Computer Science Department, Senate of the Hebrew (2005)Google Scholar
  13. 13.
    Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: 22nd International Conference on Machine learning (ICML), New York (2005)Google Scholar
  14. 14.
    Koller, D., Parr, R.: Policy iteration for factored MDPs. In: 16th Annual Conference on Uncertainty in Artificial Intelligence American Control Conference (UAI 2000) (2000)Google Scholar
  15. 15.
    Mahadevan, S.: Representation Discovery Using Harmonic Analysis. Morgan and Claypool Publishers, San Rafael (2008)zbMATHGoogle Scholar
  16. 16.
    Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2002)Google Scholar
  17. 17.
    Sugiyama, M., Hachiya, H., Towell, C., Vijayakumar, S.: Geodesic Gaussian kernels for value function approximation. J. Auton. Robots 25(3), 287–304 (2008)CrossRefGoogle Scholar
  18. 18.
    Yahyaa, S., Manderick, B.: Shortest path Gaussian kernels for state action graphs: an empirical study. In: 24th Benelux Conference on Artificial Intelligence (BNAIC). Maastricht University, The Netherlands (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Computer Science DepartmentVrije Universiteit BrusselBrusselsBelgium

Personalised recommendations