Automatic Construction of Temporally Extended Actions for MDPs Using Bisimulation Metrics

  • Pablo Samuel Castro
  • Doina Precup
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7188)


Temporally extended actions are usually effective in speeding up reinforcement learning. In this paper we present a mechanism for automatically constructing such actions, expressed as options [24], in a finite Markov Decision Process (MDP). To do this, we compute a bisimulation metric [7] between the states in a small MDP and the states in a large MDP, which we want to solve. The shape of this metric is then used to completely define a set of options for the large MDP. We demonstrate empirically that our approach is able to improve the speed of reinforcement learning, and is generally not sensitive to parameter tuning.


Optimal Policy Reinforcement Learning Markov Decision Process Target Domain Option Construction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Anderson, J.R.: Act: A simple theory of complex cognition. American Psychologist 51, 355–365 (1996)CrossRefGoogle Scholar
  2. 2.
    Barto, A.G., Mahadevan, S.: Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13(4), 341–379 (2003)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Castro, P.S., Precup, D.: Using bisimulation for policy transfer in MDPs. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI 2010), pp. 1065–1070 (2010)Google Scholar
  4. 4.
    Comanici, G., Precup, D.: Optimal policy switching algorithms in reinforcement learning. In: Proceedings of AAMAS (2010)Google Scholar
  5. 5.
    Dietterich, T.G.: Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research 13, 227–303 (2000)MathSciNetzbMATHGoogle Scholar
  6. 6.
    Ferns, N., Castro, P.S., Precup, D., Panangaden, P.: Methods for computing state similarity in Markov Decision Processes. In: Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI 2006), pp. 174–181 (2006)Google Scholar
  7. 7.
    Ferns, N., Panangaden, P., Precup, D.: Metrics for finite Markov decision processes. In: Proceedings of the 20th Annual Conference on Uncertainty in Artificial Intelligence (UAI 2004), pp. 162–169 (2004)Google Scholar
  8. 8.
    Givan, R., Dean, T., Greig, M.: Equivalence Notions and Model Minimization in Markov Decision Processes. Artificial Intelligence 147(1-2), 163–223 (2003)MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    Jonsson, A., Barto, A.G.: Causal graph based decomposition of factored MDPs. Journal of Machine Learning Research 7, 2259–2301 (2006)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Konidaris, G., Kuindersma, S., Barto, A.G., Grupen, R.A.: Constructing skill trees for reinforcement learning agents from demonstration trajectories. In: Advances in Neural Information Processing Systems 23, pp. 1162–1170 (2010)Google Scholar
  11. 11.
    Laird, J., Bloch, M.K., the Soar Group: Soar home page (2011)Google Scholar
  12. 12.
    Mannor, S., Menache, I., Hoze, A., Klein, U.: Dynamic abstraction in reinforcement learning via clustering. In: Proceedings of the 21st International Conference on Machine Learning, ICML 2004 (2004)Google Scholar
  13. 13.
    McGovern, A., Barto, A.G.: Automatic discovery of subgoals in reinforcement learning using diverse density. In: Proceedings of the 18th International Conference on Machine Learning, ICML 2001 (2001)Google Scholar
  14. 14.
    Mehta, N., Ray, S., Tapadalli, P., Dietterich, T.: Automatic discovery and transfer of maxq hierarchies. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008 (2008)Google Scholar
  15. 15.
    Mugan, J., Kuipers, B.: Autonomously learning an action hierarchy using a learned qualitative state representation. In: Proceedings of the 21st International Joint Conference on Artificial Intelligence (2009)Google Scholar
  16. 16.
    Parr, R., Russell, S.: Reinforcement learning with hierarchies of machines. In: Advances in Neural Information Processing Systems, NIPS 1998 (1998)Google Scholar
  17. 17.
    Precup, D.: Temporal Abstraction in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst (2000)Google Scholar
  18. 18.
    Ravindran, B., Barto, A.G.: Relativized options: Choosing the right transformation. In: Proceedings of the 20th International Conference on Machine Learning, ICML 2003 (2003)Google Scholar
  19. 19.
    Soni, V., Singh, S.: Using Homomorphism to Transfer Options across Reinforcement Learning Domains. In: Proceedings of AAAI Conference on Artificial Intelligence, AAAI 2006 (2006)Google Scholar
  20. 20.
    Sorg, J., Singh, S.: Transfer via Soft Homomorphisms. In: Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2009 (2009)Google Scholar
  21. 21.
    Stolle, M., Precup, D.: Learning Options in Reinforcement Learning. In: Koenig, S., Holte, R.C. (eds.) SARA 2002. LNCS (LNAI), vol. 2371, p. 212. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  22. 22.
    Stone, P., Sutton, R.S., Kuhlmann, G.: Reinforcement learning for robocup-soccer keepaway. Adaptive Behavior 13(3), 165–188 (2005)CrossRefGoogle Scholar
  23. 23.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  24. 24.
    Sutton, R.S., Precup, D., Singh, S.: Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112, 181–211 (1999)MathSciNetzbMATHCrossRefGoogle Scholar
  25. 25.
    Taylor, J., Precup, D., Panangaden, P.: Bounding performance loss in approximate MDP homomorphisms. In: Proceedings of the Conference on Advances in Neural Information Processing Systems, NIPS 2009 (2009)Google Scholar
  26. 26.
    Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research 10, 1633–1685 (2009)MathSciNetzbMATHGoogle Scholar
  27. 27.
    Šimšek, Ö., Wolfe, A.P., Barto, A.G.: Identifying useful subgoals in reinforcement learning by local graph partitioning. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005 (2005)Google Scholar
  28. 28.
    Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)zbMATHGoogle Scholar
  29. 29.
    Wolfe, A.P., Barto, A.G.: Defining object types and options using MDP homomorphisms. In: Proceedings of the ICML 2006 Workshop on Structural Knowledge Transfer for Machine Learning (2006)Google Scholar
  30. 30.
    Zang, P., Zhou, P., Minnen, D., Isbell, C.: Discovering options from example trajectories. In: Proceedings of the 26th International Conference on Machine Learning, ICML 2009 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Pablo Samuel Castro
    • 1
  • Doina Precup
    • 1
  1. 1.School of Computer ScienceMcGill UniversityCanada

Personalised recommendations