Skip to main content

Evolving hierarchical memory-prediction machines in multi-task reinforcement learning


A fundamental aspect of intelligent agent behaviour is the ability to encode salient features of experience in memory and use these memories, in combination with current sensory information, to predict the best action for each situation such that long-term objectives are maximized. The world is highly dynamic, and behavioural agents must generalize across a variety of environments and objectives over time. This scenario can be modeled as a partially-observable multi-task reinforcement learning problem. We use genetic programming to evolve highly-generalized agents capable of operating in six unique environments from the control literature, including OpenAI’s entire Classic Control suite. This requires the agent to support discrete and continuous actions simultaneously. No task-identification sensor inputs are provided, thus agents must identify tasks from the dynamics of state variables alone and define control policies for each task. We show that emergent hierarchical structure in the evolving programs leads to multi-task agents that succeed by performing a temporal decomposition and encoding of the problem environments in memory. The resulting agents are competitive with task-specific agents in all six environments. Furthermore, the hierarchical structure of programs allows for dynamic run-time complexity, which results in relatively efficient operation.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12


  1. 1.

    With the parameters listed in Table 4, the team generation process creates 1575 new agents in each generation.

  2. 2.

    The population at any given generation includes 1575 new agents and 1575 elite agents from previous generations. The initial population size (\(R_{size}\) in Table 4) is 1000. Thus, after 2 generations the 63 bins of elites will remain full, their content being recalculated in each generation based on the fitness of new agents.


  1. 1.

    A. Agapitos, M. O’Neill, A. Brabazon, Genetic programming for the induction of seasonal forecasts: A study on weather derivatives, in Financial Decision Making Using Computational Intelligence. ed. by M. Doumpos, C. Zopounidis, P.M. Pardalos (Springer, US, Boston, MA, 2012), pp. 159–188

  2. 2.

    A. Banino, A.P. Badia, R. Koster, M.J. Chadwick, V. Zambaldi, D. Hassabis, C. Barry, M. Botvinick, D. Kumaran, C. Blundell, Memo: A deep network for flexible combination of episodic memories. arXiv:2001.10913 (2020)

  3. 3.

    A.M. Barreto, D.A. Augusto, H.J. Barbosa, On the characteristics of sequential decision problems and their impact on evolutionary computation. In: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO ’09, p. 1767-1768. Association for Computing Machinery, New York, NY, USA (2009).

  4. 4.

    G. Barth-Maron, M.W. Hoffman, D. Budden, W. Dabney, D. Horgan, D.TB, A. Muldal, N. Heess, T. Lillicrap, Distributed distributional deterministic policy gradients. arXiv:1804.08617 (2018)

  5. 5.

    C. Beattie, J.Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, S. Petersen, DeepMind Lab. arXiv:1612.03801 (2016)

  6. 6.

    M. Brameier, W. Banzhaf, Linear Genetic Programming (Springer, Berlin, 2007)

    MATH  Google Scholar 

  7. 7.

    G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba, OpenAI Gym. arXiv:1606.01540 (2016)

  8. 8.

    C. D’Eramo, D. Tateo, A. Bonarini, M. Restelli, J. Peters, Sharing knowledge in multi-task deep reinforcement learning. In: International Conference on Learning Representations (2020).

  9. 9.

    K. Desnos, N. Sourbier, P.Y. Raumer, O. Gesny, M. Pelcat, Gegelati: Lightweight Artificial Intelligence through Generic and Evolvable Tangled Program Graphs. In: Workshop on Design and Architectures for Signal and Image Processing (14th Edition), DASIP ’21, p. 35-43. ACM, New York, NY, USA (2021).

  10. 10.

    C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A.A. Rusu, A. Pritzel, D. Wierstra, Pathnet: Evolution channels gradient descent in super neural networks. arXiv:1701.08734 (2017)

  11. 11.

    H. Fu, H. Tang, J. Hao, Z. Lei, Y, Chen, C. Fan, Deep multi-agent reinforcement learning with discrete-continuous hybrid action spaces. arXiv:1903.04959 (2019)

  12. 12.

    F. J. Gomez, J. Schmidhuber, Co-evolving recurrent neurons learn deep memory pomdps. In: Proceedings of the 7th Annual Conference on Genetic and Evolutionary Computation, GECCO ’05, p. 491-498. ACM, New York, NY, USA (2005).

  13. 13.

    A. Goyal, A. Lamb, J. Hoffmann, S. Sodhani, S. Levine, Y. Bengio, B. Schölkopf, Recurrent independent mechanisms. arXiv:1909.10893 (2019)

  14. 14.

    K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, Lstm: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2017).

    MathSciNet  Article  Google Scholar 

  15. 15.

    M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, H. van Hasselt, Multi-task deep reinforcement learning with popart. Proceedings of the AAAI Conference on Artificial Intelligence 33(01), 3796–3803 (2019)

  16. 16.

    M.I. Heywood, Evolutionary model building under streaming data for classification tasks: opportunities and challenges. Genet. Program. Evol. Mach. 16(3), 283–326 (2015)

    MathSciNet  Article  Google Scholar 

  17. 17.

    J.H. Holland, Properties of the bucket brigade. In: Proceedings of the 1st International Conference on Genetic Algorithms, p. 1-7. L. Erlbaum Associates Inc., USA (1985)

  18. 18.

    S. Kelly, Scaling genetic programming to challenging reinforcement tasks through emergent modularity. Ph.D. thesis, Faculty of Computer Science, Dalhousie University (2018)

  19. 19.

    S. Kelly, Source code and animations (2021). Available at

  20. 20.

    S. Kelly, W. Banzhaf, Temporal memory sharing in visual reinforcement learning, in Genetic Programming Theory and Practice XVII. ed. by W. Banzhaf, L. Spector, L. Sheneman (Springer International Publishing, Cham, 2020), pp. 101–119

  21. 21.

    S. Kelly, M.I. Heywood, Discovering agent behaviors through code reuse: examples from half-field offense and Ms. Pac Man IEEE Trans. Games 10(2), 195–208 (2018)

    Article  Google Scholar 

  22. 22.

    S. Kelly, M.I. Heywood, Emergent solutions to high-dimensional multitask reinforcement learning. Evol. Comput. 26(3), 347–380 (2018)

    Article  Google Scholar 

  23. 23.

    S. Kelly, J. Newsted, W. Banzhaf, C. Gondro, A modular memory framework for time series prediction. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO ’20, pp. 949-957. ACM, New York, NY, USA (2020).

  24. 24.

    J.F.C. Kingman, A simple model for the balance between selection and mutation. J. Appl. Prob. 15(1), 1–12 (1978)

    MathSciNet  Article  Google Scholar 

  25. 25.

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A.A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, R. Hadsell, Overcoming catastrophic forgetting in neural networks. Proc. National Acad. Sci. 114(13), 3521–3526 (2017)

    MathSciNet  Article  Google Scholar 

  26. 26.

    J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT Press, Cambridge, 1992)

    MATH  Google Scholar 

  27. 27.

    T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D., Wierstra, Continuous control with deep reinforcement learning. arXiv:1509.02971 (2015)

  28. 28.

    L. Metz, J. Ibarz, N. Jaitly, J. Davidson, Discrete sequential prediction of continuous actions for deep RL. arXiv:1705.05035 (2017)

  29. 29.

    V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  30. 30.

    D.E. Moriarty, A.C. Schultz, J.J. Grefenstette, Evolutionary algorithms for reinforcement learning. J. Artif. Int. Res. 11(1), 241–276 (1999)

    MATH  Google Scholar 

  31. 31.

    A.M. Nedelcu, R.E. Michod, Evolvability, modularity, and individuality during the transition to multicellularity in volvocalean green algae. In: G. Schlosser, G. Wagner (eds.) Modularity in Development and Evolution, pp. 470–489. Chicago Press (2002)

  32. 32.

    E.O. Neftci, B.B. Averbeck, Reinforcement learning in artificial and biological systems. Nat. Mach. Intell. 1(3), 133–143 (2019).

    Article  Google Scholar 

  33. 33.

    J. Oh, V. Chockalingam, S. Singh, H. Lee, Control of memory, active perception, and action in minecraft. arXiv:1605.09128 (2016)

  34. 34.

    R.J. Preen, L. Bull, Dynamical genetic programming in Xcsf. Evol. Comput. 21(3), 361–387 (2013)

    Article  Google Scholar 

  35. 35.

    B. Recht, A tour of reinforcement learning: the view from continuous control. Ann. Rev. Control Robot. Auto. Syst. 2(1), 253–279 (2019)

    Article  Google Scholar 

  36. 36.

    A.A. Rusu, S.G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mih, K. Kavukcuoglu, R. Hadsell, Policy distillation. arXiv:1511.06295 (2016)

  37. 37.

    A.A. Rusu, N.C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, R. Hadsell, Progressive neural networks. arXiv:1606.04671 (2016)

  38. 38.

    H.A. Simon, The architecture of complexity. Proc. Am. Philos. Soc. 106, 467–482 (1962)

    Google Scholar 

  39. 39.

    R.J. Smith, R. Amaral, M.I. Heywood, Evolving simple solutions to the CIFAR-10 benchmark using tangled program graphs. In: Proceedings of the 2021 IEEE Congress of Evolutionary Computation (CEC), paper to appear (2021)

  40. 40.

    R.J. Smith, M.I. Heywood, Evolving Dota 2 shadow fiend bots using genetic programming with external memory. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’19, pp. 179–187. ACM, New York, NY, USA (2019)

  41. 41.

    R.J. Smith, M.I. Heywood, A model of external memory for navigation in partially observable visual reinforcement learning tasks, in Genetic Programming. ed. by L. Sekanina, T. Hu, N. Lourenço, H. Richter, P. García-Sánchez (Springer International Publishing, Cham, 2019), pp. 162–177

  42. 42.

    R.S. Sutton, Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988).

    Article  Google Scholar 

  43. 43.

    R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (A Bradford Book, Cambridge, 2018)

    MATH  Google Scholar 

  44. 44.

    N. Vithayathil Varghese, Q.H. Mahmoud, A survey of multi-task deep reinforcement learning. Electronics 9(9) (2020).

  45. 45.

    G.P. Wagner, L. Altenberg, Perspective: complex adaptations and the evolution of evolvability. Evolution 50(3), 967–976 (1996)

    Article  Google Scholar 

  46. 46.

    N. Wagner, Z. Michalewicz, M. Khouja, R.R. McGregor, Time series forecasting for dynamic environments: the DyFor genetic program model. IEEE Trans. Evol. Comput. 11(4), 433–452 (2007)

    Article  Google Scholar 

  47. 47.

    R.A. Watson, J.B. Pollack, Modular interdependency in complex dynamical systems. Artif. Life 11(4), 445–457 (2005)

    Article  Google Scholar 

  48. 48.

    A.S. Yang, Modularity, evolvability, and adaptive radiations: a comparison of the hemi- and holometabolous insects. Evol. Develop. 3(2), 59–72 (2001)

    Article  Google Scholar 

  49. 49.

    M. Yang, Q. Hu, Y. Wang, Multi-task learning method for hierarchical time series forecasting, in Artificial Neural Networks and Machine Learning—ICANN 2019: Text and Time Series. ed. by I.V. Tetko, V. Kůrková, P. Karpov, F. Theis (Springer International Publishing, Cham, 2019), pp. 474–485

  50. 50.

    R. Yang, H. Xu, Y. Wu, X. Wang, Multi-task reinforcement learning with soft modularization. arXiv:2003.13661 (2020)

  51. 51.

    G.N. Yannakakis, J. Togelius, Artificial intelligence and games. Springer (2018).

Download references


S.K. gratefully acknowledges support through the NSERC Postdoctoral Scholarship program. This material is based in part upon work supported by the National Science Foundation under Cooperative Agreement No. DBI-0939454 to the BEACON Center for Evolution in Action at Michigan State University. W.B. acknowledges support from the John R. Koza Endowment fund for part of this work. Michigan State University provided computational resources through the Institute for Cyber-Enabled Research. Additional support provided by ACENET, Calcul Québec, Compute Ontario and WestGrid, and Compute Canada ( Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Author information



Corresponding author

Correspondence to Stephen Kelly.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kelly, S., Voegerl, T., Banzhaf, W. et al. Evolving hierarchical memory-prediction machines in multi-task reinforcement learning. Genet Program Evolvable Mach 22, 573–605 (2021).

Download citation


  • Genetic programming
  • Reinforcement learning
  • Temporal memory
  • Multi-task