Multigrid Reinforcement Learning with Reward Shaping

  • Marek Grześ
  • Daniel Kudenko
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5163)

Abstract

Potential-based reward shaping has been shown to be a powerful method to improve the convergence rate of reinforcement learning agents. It is a flexible technique to incorporate background knowledge into temporal-difference learning in a principled way. However, the question remains how to compute the potential which is used to shape the reward that is given to the learning agent. In this paper we propose a way to solve this problem in reinforcement learning with state space discretisation. In particular, we show that the potential function can be learned online in parallel with the actual reinforcement learning process. If the Q-function is learned for states determined by a given grid, a V-function for states with lower resolution can be learned in parallel and used to approximate the potential for ground learning. The novel algorithm is presented and experimentally evaluated.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the 16th International Conference on Machine Learning, pp. 278–287 (1999)Google Scholar
  2. 2.
    Randlov, J., Alstrom, P.: Learning to drive a bicycle using reinforcement learning and shaping. In: Proceedings of the 15th International Conference on Machine Learning, pp. 463–471 (1998)Google Scholar
  3. 3.
    Marthi, B.: Automatic shaping and decomposition of reward functions. In: Proceedings of the 24th International Conference on Machine Learning, pp. 601–608 (2007)Google Scholar
  4. 4.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  5. 5.
    Boutilier, C., Dean, T., Hanks, S.: Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research 11, 1–94 (1999)MATHMathSciNetGoogle Scholar
  6. 6.
    Chow, C.S., Tsitsiklis, J.N.: An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE Transactions on Automatic Control 36(8), 898–914 (1991)MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Anderson, C., Crawford-Hines, S.: Multigrid Q-learning. Technical Report CS-94-121, Colorado State University (1994)Google Scholar
  8. 8.
    Sutton, R.S., Precup, D., Singh, S.P.: Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2), 181–211 (1999)MATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Wingate, D., Seppi, K.D.: Prioritization methods for accelerating MDP solvers. Journal of Machine Learning Research 6, 851–881 (2005)MathSciNetGoogle Scholar
  10. 10.
    Epshteyn, A., DeJong, G.: Qualitative reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 305–312 (2006)Google Scholar
  11. 11.
    Munos, R., Moore, A.: Variable resolution discretization in optimal control. Machine Learning 49(2-3), 291–323 (2002)MATHCrossRefGoogle Scholar
  12. 12.
    Stone, P., Veloso, M.: Layered learning. In: Proceedings of the 11th European Conference on Machine Learning (2000)Google Scholar
  13. 13.
    Kaelbling, L.P.: Hierarchical learning in stochastic domains: Preliminary results. In: Proceedings of International Conference on Machine Learning, pp. 167–173 (1993)Google Scholar
  14. 14.
    Moore, A., Baird, L., Kaelbling, L.P.: Multi-value-functions: Efficient automatic action hierarchies for multiple goal MDPs. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1316–1323 (1999)Google Scholar
  15. 15.
    Dayan, P., Hinton, G.E.: Feudal reinforcement learning. In: Proceedings of Advances in Neural Information Processing Systems (1993)Google Scholar
  16. 16.
    Parr, R., Russell, S.: Reinforcement learning with hierarchies of machines. In: Proccedings of Advances in Neural Information Processing Systems, vol. 10 (1997)Google Scholar
  17. 17.
    Dietterich, T.G.: Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13, 227–303 (2000)MATHMathSciNetGoogle Scholar
  18. 18.
    Taylor, M.E., Stone, P.: Behavior transfer for value-function-based reinforcement learning. In: Proceedings of the 4th International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 53–59 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Marek Grześ
    • 1
  • Daniel Kudenko
    • 1
  1. 1.Department of Computer ScienceUniversity of YorkYorkUK

Personalised recommendations