Model primitives for hierarchical lifelong reinforcement learning


Learning interpretable and transferable subpolicies and performing task decomposition from a single, complex task is difficult. Such decomposition can lead to immense sample efficiency gains in lifelong learning. Some traditional hierarchical reinforcement learning techniques enforce this decomposition in a top-down manner, while meta-learning techniques require a task distribution at hand to learn such decompositions. This article presents a framework for using diverse suboptimal world models to decompose complex task solutions into simpler modular subpolicies. Given these world models, this framework performs decomposition of a single source task in a bottom up manner, concurrently learning the required modular subpolicies as well as a controller to coordinate them. We perform a series of experiments on high dimensional continuous action control tasks to demonstrate the effectiveness of this approach at both complex single-task learning and lifelong learning. Finally, we perform ablation studies to understand the importance and robustness of different elements in the framework and limitations to this approach.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17


  1. 1.

    Abel, D., Hershkowitz, D. E., & Littman, M. L. (2016). Near optimal behavior via approximate state abstraction. In International conference on machine learning (ICML) (pp. 2915–2923).

  2. 2.

    Abel, D., Arumugam, D., Lehnert, L., & Littman, M. L. (2017). Toward good abstractions for lifelong learning. In Proceedings of the NIPS workshop on hierarchical reinforcement learning.

  3. 3.

    Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., & Abbeel. P, (2018). Continuous adaptation via meta-learning in nonstationary and competitive environments. In International conference on learning representations (ICLR).

  4. 4.

    Anand, A., Grover, A., Singla, P., et al. (2015). ASAP-UCT: Abstraction of state-action pairs in UCT. In International joint conference on artificial intelligence (IJCAI).

  5. 5.

    Andre, D., & Russell, S. J. (2002). State abstraction for programmable reinforcement learning agents. In AAAI conference on artificial intelligence (AAAI) (pp. 119–125).

  6. 6.

    Bacon, P., Harb, J., & Precup, D. (2017). The option-critic architecture. In AAAI conference on artificial intelligence (AAAI) (pp. 1726–1734).

  7. 7.

    Baird, L. C. (1994). Reinforcement learning in continuous time: Advantage updating. IEEE International Conference on Neural Networks (ICNN), 4, 2448–2453.

    Google Scholar 

  8. 8.

    Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., & Silver, D. (2017). Successor features for transfer in reinforcement learning. In Advances in neural information processing systems (NeurIPS) (pp. 4055–4065).

  9. 9.

    Bertsekas, D. P., & Castanon, D. A. (1989). Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6), 589–598.

    MathSciNet  MATH  Article  Google Scholar 

  10. 10.

    Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI gym. CoRR. arXiv:1606.01540.

  11. 11.

    Brunskill, E., & Li, L. (2014). PAC-inspired option discovery in lifelong reinforcement learning. In International conference on machine learning (ICML) (pp. 316–324).

  12. 12.

    Cobo, L. C., Isbell Jr, C. L., & Thomaz, A. L. (2012). Automatic task decomposition and state abstraction from demonstration. In International conference on autonomous agents and multiagent systems (AAMAS) (pp. 483–490). International Foundation for Autonomous Agents and Multiagent Systems.

  13. 13.

    Daniel, C., Neumann, G., Kroemer, O., & Peters, J. (2016). Hierarchical relative entropy policy search. Journal of Machine Learning Research, 17(1), 3190–3239.

    MathSciNet  MATH  Google Scholar 

  14. 14.

    Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4), 613–624.

    Article  Google Scholar 

  15. 15.

    Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE computer society conference on computer vision and pattern recognition (CVPR) (pp. 248–255).

  16. 16.

    Denis, N., & Fraser, M. (2019). Options in multi-task reinforcement learning: Transfer via reflection. In Canadian conference on artificial intelligence (pp. 225–237). Springer.

  17. 17.

    Eysenbach, B., Gupta, A., Ibarz, J., & Levine, S. (2018). Diversity is all you need: Learning skills without a reward function. In International conference on learning representations (ICLR).

  18. 18.

    Finn, C., Abbeel, P., & Levine, S. (2017a). Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning (ICML) (pp. 1126–1135).

  19. 19.

    Finn, C., Yu, T., Zhang, T., Abbeel, P., & Levine, S. (2017b). One-shot visual imitation learning via meta-learning. In Conference on robot learning (pp. 357–368).

  20. 20.

    Florensa, C., Duan, Y., & Abbeel, P. (2016). Stochastic neural networks for hierarchical reinforcement learning. In International conference on learning representations (ICLR).

  21. 21.

    Frans, K., Ho, J., Chen, X., Abbeel, P., & Schulman, J. (2018). Meta learning shared hierarchies. In International conference on learning representations (ICLR).

  22. 22.

    Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Ali Eslami, S. M., & Teh, Y. W. (2018). Neural processes. CoRR. arXiv:1807.01622.

  23. 23.

    Ge, L., Gao, J., Ngo, H., Li, K., & Zhang, A. (2014). On handling negative transfer and imbalanced distributions in multiple source transfer learning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 7(4), 254–271.

    MathSciNet  Article  Google Scholar 

  24. 24.

    Gershman, S. J. (2018). The successor representation: Its computational logic and neural substrates. Journal of Neuroscience, 38(33), 7193–7200.

    Article  Google Scholar 

  25. 25.

    Goyal, A., Islam, R., Strouse, D., Ahmed, Z., Larochelle, H., Botvinick, M., Levine, S., & Bengio, Y. (2019a). Transfer and exploration via the information bottleneck. In International conference on learning representations (ICLR).

  26. 26.

    Goyal, A., Sodhani, S., Binas, J., Peng, X.B., Levine, S., & Bengio, Y. (2019b). Reinforcement learning with competitive ensembles of information-constrained primitives. CoRR. arXiv:1906.10667.

  27. 27.

    Grant, E., Finn, C., Levine, S., Darrell, T., & Griffiths, T. (2018). Recasting gradient-based meta-learning as hierarchical Bayes. CoRR. arXiv:1801.08930.

  28. 28.

    Guestrin, C., Koller, D., Gearhart, C., & Kanodia, N. (2003). Generalizing plans to new environments in relational MDPs. In International joint conference on artificial intelligence (IJCAI) (pp. 1003–1010). Morgan Kaufmann Publishers Inc.

  29. 29.

    Ha, D., & Schmidhuber, J. (2018). World models. CoRR. arXiv:1803.10122.

  30. 30.

    Hafez-Kolahi, H., & Kasaei, S. (2019). Information bottleneck and its applications in deep learning. Information Systems and Telecommunication, 3(4), 119.

    Google Scholar 

  31. 31.

    Harb, J., Bacon, P. L., Klissarov, M., & Precup, D. (2018). When waiting is not an option: Learning options with a deliberation cost. In AAAI conference on artificial intelligence (AAAI).

  32. 32.

    Holland, G. Z., Talvitie, E., & Bowling, M. (2018). The effect of planning shape on dyna-style planning in high-dimensional state spaces. CoRR. arXiv:1806.01825.

  33. 33.

    Isele, D., & Cosgun, A. (2018). Selective experience replay for lifelong learning. In AAAI conference on artificial intelligence (AAAI).

  34. 34.

    Isele, D., Rostami, M., & Eaton, E. (2016). Using task features for zero-shot knowledge transfer in lifelong learning. In International joint conference on artificial intelligence (IJCAI) (pp. 1620–1626).

  35. 35.

    Jain, A., Khetarpal, K., & Precup, D. (2018). Safe option-critic: Learning safety in the option-critic architecture. CoRR. arXiv:1807.08060.

  36. 36.

    Jong, N. K., & Stone, P. (2005). State abstraction discovery from irrelevant state variables. International Joint Conference on Artificial Intelligence (IJCAI), 8, 752–757.

    Google Scholar 

  37. 37.

    Keller, G. B., Bonhoeffer, T., & Hübener, M. (2012). Sensorimotor mismatch signals in primary visual cortex of the behaving mouse. Neuron, 74(5), 809–815.

    Article  Google Scholar 

  38. 38.

    Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 3521–3526.

    MathSciNet  MATH  Article  Google Scholar 

  39. 39.

    Kulkarni, T. D., Saeedi, A., Gautam, S., & Gershman, S. J. (2016). Deep successor reinforcement learning. CoRR. arXiv:1606.02396.

  40. 40.

    Leinweber, M., Ward, D. R., Sobczak, J. M., Attinger, A., & Keller, G. B. (2017). A sensorimotor circuit in mouse cortex for visual flow predictions. Neuron, 95(6), 1420–1432.

    Article  Google Scholar 

  41. 41.

    Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for MDPs. In ISAIM.

  42. 42.

    Liu, M., Machado, M. C., Tesauro, G., & Campbell, M. (2017). The eigenoption-critic framework. CoRR. arXiv:1712.04065.

  43. 43.

    Machado, M. C., Rosenbaum, C., Guo, X., Liu, M., Tesauro, G., & Campbell, M. (2017). Eigenoption discovery through the deep successor representation. In International conference on learning representations (ICLR).

  44. 44.

    Machado, M. C., Bellemare, M. G., & Bowling, M. (2018). Count-based exploration with the successor representation. CoRR. arXiv:1807.11622.

  45. 45.

    Masoudnia, S., & Ebrahimpour, R. (2014). Mixture of experts: A literature survey. Artificial Intelligence Review, 42(2), 275–293.

    Article  Google Scholar 

  46. 46.

    McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. In G. H. Bower (Ed.), Psychology of learning and motivation, Vol 24, pp 109–165. Cambridge: Academic Press.

    Google Scholar 

  47. 47.

    Mendelssohn, R. (1982). An iterative aggregation procedure for Markov decision processes. Operations Research, 30(1), 62–73.

    MathSciNet  MATH  Article  Google Scholar 

  48. 48.

    Neumann, G., Daniel, C., Paraschos, A., Kupcsik, A., & Peters, J. (2014). Learning modular policies for robotics. Frontiers of Computational Neuroscience, 8(62), 1–32.

    Google Scholar 

  49. 49.

    Nguyen-Tuong, D., & Peters, J. (2011). Model learning for robot control: A survey. Cognitive Processing, 12(4), 319–340.

    Article  Google Scholar 

  50. 50.

    Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., & Wermter, S. (2019). Continual lifelong learning with neural networks: A review. Neural Networks, 113, 54–71.

    Article  Google Scholar 

  51. 51.

    Reyman, G., & van der Wal, J. (1988). Aggregation–disaggregation algorithms for discrete stochastic systems. In DGOR/NSOR (pp. 515–522). Springer, Berlin.

  52. 52.

    Rosenbaum, D., & Weiss, Y. (2015). The return of the gating network: Combining generative models and discriminative training in natural image priors. In Advances in neural information processing systems (NeurIPS) (pp. 2683–2691).

  53. 53.

    Rosenstein, M. T., Marx, Z., Kaelbling, L. P., & Dietterich, T. G. (2005). To transfer or not to transfer. In NIPS 2005 workshop on transfer learning (Vol. 898, p. 3).

  54. 54.

    Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., & Hadsell, R. (2016). Progressive neural networks. CoRR. arXiv:1606.04671.

  55. 55.

    Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. CoRR. arXiv:1506.02438.

  56. 56.

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. CoRR. arXiv:1707.06347.

  57. 57.

    Shamwell, E., Nothwang, W., & Perlis, D. (2018). An embodied multi-sensor fusion approach to visual motion estimation using unsupervised deep networks. Sensors, 18(5), 1427.

    Article  Google Scholar 

  58. 58.

    Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In IEEE international conference on computer vision (ICCV) (pp. 843–852).

  59. 59.

    Sung, F., Zhang, L., Xiang, T., Hospedales, T., & Yang, Y. (2017). Learning to learn: Meta-critic networks for sample efficient learning. CoRR. arXiv:1706.09529.

  60. 60.

    Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge: MIT Press.

    Google Scholar 

  61. 61.

    Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2), 181–211.

    MathSciNet  MATH  Article  Google Scholar 

  62. 62.

    Talvitie, E. (2017). Self-correcting models for model-based reinforcement learning. In AAAI conference on artificial intelligence (AAAI).

  63. 63.

    Tanaka, F., & Yamamura, M. (2003). Multitask reinforcement learning on the distribution of MDPs. IEEE International Symposium on Computational Intelligence in Robotics and Automation, 3, 1108–1113.

    Google Scholar 

  64. 64.

    Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T. P., & Riedmiller, M. A. (2018). Deepmind control suite. CoRR. arXiv:1801.00690.

  65. 65.

    Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., & Pascanu, R. (2017). Distral: Robust multitask reinforcement learning. In Advances in neural information processing systems (NeurIPS) (pp. 4496–4506).

  66. 66.

    Tessler, C., Givony, S., Zahavy, T., Mankowitz, DJ., & Mannor, S. (2017). A deep hierarchical approach to lifelong learning in minecraft. In AAAI conference on artificial intelligence (AAAI).

  67. 67.

    Thrun, S. (1995). A lifelong learning perspective for mobile robot control. In Intelligent robots and systems (pp. 201–214). Elsevier.

  68. 68.

    Thrun, S., & Pratt, L. (1998). Learning to learn: Introduction and overview. In S. Thrun & L. Pratt (Eds.), Learning to learn (pp. 3–17). Boston: Springer.

    Google Scholar 

  69. 69.

    Tiwari, S., & Thomas, P. S. (2018). Natural option critic. CoRR. arXiv:1812.01488.

  70. 70.

    Todorov, E., Erez, T., & Tassa, Y. (2012). MuJoCo: A physics engine for model-based control. In IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 5026–5033).

  71. 71.

    Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., & Kavukcuoglu, K. (2017). FeUdal networks for hierarchical reinforcement learning. In International conference on machine learning (ICML) (pp. 3540–3549).

  72. 72.

    Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007). Multi-task reinforcement learning: A hierarchical Bayesian approach. In International conference on machine learning (ICML) (pp. 1015–1022).

  73. 73.

    Yang, Y., Caluwaerts, K., Iscen, A., Tan, J., & Finn, C. (2019). NoRML: No-reward meta learning. In International conference on autonomous agents and multiagent systems (AAMAS) (pp. 323–331). International Foundation for Autonomous Agents and Multiagent Systems.

  74. 74.

    Zhang, J., Springenberg, J. T., Boedecker, J., & Burgard, W. (2017). Deep reinforcement learning with successor features for navigation across similar environments. In IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 2371–2378).

  75. 75.

    Zhang, S., & Whiteson, S. (2019). DAC: The double actor-critic architecture for learning options. CoRR. arXiv:1904.12691.

  76. 76.

    Zhu, Y., Gordon, D., Kolve, E., Fox, D., Fei-Fei, L., Gupta, A., Mottaghi, R., & Farhadi, A. (2017). Visual semantic planning using deep successor representations. In Proceedings of the IEEE international conference on computer vision (pp. 483–492).

Download references


We are thankful to the anonymous reviewers and everyone at the Stanford Intelligent Systems Laboratory for useful comments and suggestions. This work is supported in part by DARPA under Agreement Number D17AP00032. The content is solely the responsibility of the authors and does not necessarily represent the official views of DARPA. We are also grateful for the support from Google Cloud in scaling our experiments.

Author information



Corresponding author

Correspondence to Jayesh K. Gupta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wu, B., Gupta, J.K. & Kochenderfer, M. Model primitives for hierarchical lifelong reinforcement learning. Auton Agent Multi-Agent Syst 34, 28 (2020).

Download citation


  • Reinforcement learning
  • Task decomposition
  • Transfer
  • Lifelong learning
  • Hierarchical learning