Machine Learning

, 73:289 | Cite as

Transfer in variable-reward hierarchical reinforcement learning

  • Neville MehtaEmail author
  • Sriraam Natarajan
  • Prasad Tadepalli
  • Alan Fern


Transfer learning seeks to leverage previously learned tasks to achieve faster learning in a new task. In this paper, we consider transfer learning in the context of related but distinct Reinforcement Learning (RL) problems. In particular, our RL problems are derived from Semi-Markov Decision Processes (SMDPs) that share the same transition dynamics but have different reward functions that are linear in a set of reward features. We formally define the transfer learning problem in the context of RL as learning an efficient algorithm to solve any SMDP drawn from a fixed distribution after experiencing a finite number of them. Furthermore, we introduce an online algorithm to solve this problem, Variable-Reward Reinforcement Learning (VRRL), that compactly stores the optimal value functions for several SMDPs, and uses them to optimally initialize the value function for a new SMDP. We generalize our method to a hierarchical RL setting where the different SMDPs share the same task hierarchy. Our experimental results in a simplified real-time strategy domain show that significant transfer learning occurs in both flat and hierarchical settings. Transfer is especially effective in the hierarchical setting where the overall value functions are decomposed into subtask value functions which are more widely amenable to transfer across different SMDPs.


Hierarchical reinforcement learning Transfer learning Average-reward learning Multi-criteria learning 


  1. Abbeel, P., & Ng, A. (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the ICML. Google Scholar
  2. Andre, D., & Russell, S. (2002). State abstraction for programmable reinforcement learning agents. In Eighteenth national conference on artificial intelligence (pp. 119–125). Google Scholar
  3. Dietterich, T. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 9, 227–303. MathSciNetGoogle Scholar
  4. Feinberg, E., & Schwartz, A. (1995). Constrained Markov decision models with weighted discounted rewards. Mathematics of Operations Research, 20(2), 302–320. zbMATHMathSciNetCrossRefGoogle Scholar
  5. Gabor, Z., Kalmar, Z., & Szepesvari, C. (1998). Multi-criteria reinforcement learning. In Proceedings of the ICML. Google Scholar
  6. Guestrin, C., Koller, D., & Parr, R. (2001). Multiagent planning with factored MDPs. In Proceedings NIPS-01. Google Scholar
  7. Guestrin, C., Koller, D., Gearhart, C., & Kanodia, N. (2003). Generalizing plans to new environments in relational MDPs. In International joint conference on artificial intelligence. Google Scholar
  8. Kaelbling, L., Littman, M., & Cassandra, A. (1998). Planning and acting in partially observable stochastic domains. AI Journal. Google Scholar
  9. Liu, Y., & Stone, P. (2006). Value-function-based transfer for reinforcement learning using structure mapping. In Proceedings of the twenty-first national conference on artificial intelligence. Google Scholar
  10. Mausam, D. (2003). Solving relational MDPs with first-order machine learning. In Proceedings of the ICAPS workshop on planning under uncertainty and incomplete information. Google Scholar
  11. Mehta, N., & Tadepalli, P. (2005). Multi-agent shared hierarchy reinforcement learning. In ICML workshop on rich representations in reinforcement learning. Google Scholar
  12. Natarajan, S., & Tadepalli, P. (2005). Dynamic preferences in multi-criteria reinforcement learning. In Proceedings of the ICML. Google Scholar
  13. Parr, R. (1998). Flexible decomposition algorithms for weakly coupled Markov decision problems. In UAI. Google Scholar
  14. Price, B., & Boutilier, C. (2003). Accelerating reinforcement learning through implicit imitation. Journal of Artificial Intelligence Research, 569–629. Google Scholar
  15. Puterman, M. L. (1994). Markov decision processes. New York: Wiley. zbMATHGoogle Scholar
  16. Russell, S., & Zimdars, A. (2003). Q-decomposition for reinforcement learning agents. In Proceedings of ICML-03. Google Scholar
  17. Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the 10th international conference on machine learning. San Mateo: Morgan Kaufmann. Google Scholar
  18. Seri, S., & Tadepalli, P. (2002). Model-based hierarchical average reward reinforcement learning. In Proceedings of the ICML (pp. 562–569). Google Scholar
  19. Sutton, R., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2), 181–211. zbMATHCrossRefMathSciNetGoogle Scholar
  20. Tadepalli, P., & Ok, D. (1998). Model-based average reward reinforcement learning. Artificial Intelligence, 100, 177–224. zbMATHCrossRefGoogle Scholar
  21. Taylor, M., Stone, P., & Liu, Y. (2005). Value functions for RL-based behavior transfer: a comparative study. In Proceedings of the twentieth national conference on artificial intelligence. Google Scholar
  22. Torrey, L., Shavlik, J., Walker, T., & Maclin, R. (2007). Relational macros for transfer in reinforcement learning. In Proceedings of the 17th conference on inductive logic programming. Google Scholar
  23. Weeks, J. (1985). The shape of space: how to visualize surfaces and three-dimensional manifolds. Google Scholar
  24. White, D. (1982). Multi-objective infinite-horizon discounted Markov decision processes. Journal of Mathematical Analysis and Applications, 89, 639–647. zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Neville Mehta
    • 1
    Email author
  • Sriraam Natarajan
    • 1
  • Prasad Tadepalli
    • 1
  • Alan Fern
    • 1
  1. 1.School of Electrical Engineering and Computer ScienceOregon State UniversityCorvallisUSA

Personalised recommendations