Abstract
Animals often demonstrate a remarkable ability to adapt to their environments during their lifetime. They do so partly due to the evolution of morphological and neural structures. These structures capture features of environments shared between generations to bias and speed up lifetime learning. In this work, we propose a computational model for studying a mechanism that can enable such a process. We adopt a computational framework based on meta reinforcement learning as a model of the interplay between evolution and development. At the evolutionary scale, we evolve reservoirs, a family of recurrent neural networks that differ from conventional networks in that one optimizes not the synaptic weights, but hyperparameters controlling macro-level properties of the resulting network architecture. At the developmental scale, we employ these evolved reservoirs to facilitate the learning of a behavioral policy through Reinforcement Learning (RL). Within an RL agent, a reservoir encodes the environment state before providing it to an action policy. We evaluate our approach on several 2D and 3D simulated environments. Our results show that the evolution of reservoirs can improve the learning of diverse challenging tasks. We study in particular three hypotheses: the use of an architecture combining reservoirs and reinforcement learning could enable (1) solving tasks with partial observability, (2) generating oscillatory dynamics that facilitate the learning of locomotion tasks, and (3) facilitating the generalization of learned behaviors to new tasks unknown during the evolution phase.
C. Léger and G. Hamon—Equal first authors.
X. Hinaut and C. Moulin-Frier—Equal last authors.
C. Léger—Work done as intern at Flowers and Mnemosyne.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019)
Bäck, T., Schwefel, H.P.: An overview of evolutionary algorithms for parameter optimization. Evol. Comput. 1(1), 1–23 (1993)
Beck, J., et al.: A survey of meta-reinforcement learning. arXiv preprint arXiv:2301.08028 (2023)
Berner, C., et al.: Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680 (2019)
Bertschinger, N., Natschläger, T.: Real-time computation at the edge of chaos in recurrent neural networks. Neural Comput. 16(7), 1413–1436 (2004)
Chang, H., Futagami, K.: Reinforcement learning with convolutional reservoir computing. Appl. Intell. 50, 2400–2410 (2020)
Chang, H.H., Song, H., Yi, Y., Zhang, J., He, H., Liu, L.: Distributive dynamic spectrum access through deep reinforcement learning: a reservoir computing-based approach. IEEE Internet Things J. 6(2), 1938–1948 (2018)
Clune, J.: Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence. arXiv preprint arXiv:1905.10985 (2019)
Doya, K.: Reinforcement learning: computational theory and biological mechanisms. HFSP J. 1(1), 30 (2007)
Duan, Y., Schulman, J., Chen, X., Bartlett, P.L., Sutskever, I., Abbeel, P.: Rl squared: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779 (2016)
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR (2017)
Ha, D., Dai, A., Le, Q.V.: HyperNetworks (2016). http://arxiv.org/abs/1609.09106. arXiv:1609.09106 [cs]
Hansen, N.: The CMA evolution strategy: a tutorial. arXiv preprint arXiv:1604.00772 (2016)
Hinaut, X., Dominey, P.F.: A three-layered model of primate prefrontal cortex encodes identity and abstract categorical structure of behavioral sequences. J. Physiol.-Paris 105(1–3), 16–24 (2011)
Hinaut, X., Dominey, P.F.: Real-time parallel processing of grammatical structure in the fronto-striatal system: a recurrent network simulation study using reservoir computing. PLoS ONE 8(2), e52946 (2013)
Hougen, D.F., Shah, S.N.H.: The evolution of reinforcement learning. In: 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1457–1464. IEEE (2019)
Johnston, T.D.: Selective costs and benefits in the evolution of learning. In: Rosenblatt, J.S., Hinde, R.A., Beer, C., Busnel, M.C. (eds.) Advances in the Study of Behavior, vol. 12, pp. 65–106. Academic Press (1982). https://doi.org/10.1016/S0065-3454(08)60046-7. http://www.sciencedirect.com/science/article/pii/S0065345408600467
Johnston, T.D.: Selective costs and benefits in the evolution of learning. In: Advances in the Study of Behavior, vol. 12, pp. 65–106. Elsevier (1982)
Kauffman, S.A.: The Origins of Order: Self Organization and Selection in Evolution. Oxford University Press, Oxford (1993)
Laland, K.N., et al.: The extended evolutionary synthesis: its structure, assumptions and predictions. Proc. Royal Soc. B: Biol. Sci. 282(1813), 20151019 (2015). https://doi.org/10.1098/rspb.2015.1019. https://royalsocietypublishing.org/doi/10.1098/rspb.2015.1019
Li, Y.: Deep reinforcement learning: an overview. arXiv preprint arXiv:1701.07274 (2017)
Lukoševičius, M., Jaeger, H.: Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 3(3), 127–149 (2009)
Mante, V., Sussillo, D., Shenoy, K.V., Newsome, W.T.: Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature 503(7474), 78–84 (2013)
Marder, E., Bucher, D.: Central pattern generators and the control of rhythmic movements. Curr. Biol. 11(23), R986–R996 (2001)
Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Monahan, G.E.: State of the art-a survey of partially observable Markov decision processes: theory, models, and algorithms. Manag. Sci. 28(1), 1–16 (1982)
Moulin-Frier, C.: The ecology of open-ended skill acquisition. Ph.D. thesis, Université de Bordeaux (UB) (2022)
Najarro, E., Sudhakaran, S., Risi, S.: Towards self-assembling artificial neural networks through neural developmental programs. In: Artificial Life Conference Proceedings, vol. 35, p. 80. MIT Press, Cambridge (2023)
Nussenbaum, K., Hartley, C.A.: Reinforcement learning across development: what insights can we draw from a decade of research? Dev. Cogn. Neurosci. 40, 100733 (2019)
Pearson, K.: Neural adaptation in the generation of rhythmic behavior. Ann. Rev. Physiol. 62(1), 723–753 (2000)
Pedersen, J., Risi, S.: Learning to act through evolution of neural diversity in random neural networks. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1248–1256 (2023)
Pedersen, J.W., Risi, S.: Evolving and merging hebbian learning rules: increasing generalization by decreasing the number of rules. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 892–900 (2021)
Puterman, M.L.: Markov decision processes. Handb. Oper. Res. Manag. Sci. 2, 331–434 (1990)
Raffin, A.: Ppo vs recurrentppo (aka ppo lstm) on environments with masked velocity (sb3 contrib). https://wandb.ai/sb3/no-vel-envs/reports/PPO-vs-RecurrentPPO-aka-PPO-LSTM-on-environments-with-masked-velocity-VmlldzoxOTI4NjE4. Accessed Nov 2023
Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., Dormann, N.: Stable-baselines3: reliable reinforcement learning implementations. J. Mach. Learn. Res. 22(1), 12348–12355 (2021)
Reddy, M.J., Kumar, D.N.: Computational algorithms inspired by biological processes and evolution. Curr. Sci. 370–380 (2012)
Ren, G., Chen, W., Dasgupta, S., Kolodziejski, C., Wörgötter, F., Manoonpong, P.: Multiple chaotic central pattern generators with learning for legged locomotion and malfunction compensation. Inf. Sci. 294, 666–682 (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Seoane, L.F.: Evolutionary aspects of reservoir computing. Phil. Trans. R. Soc. B 374(1774), 20180377 (2019)
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Stanley, K.O., D’Ambrosio, D.B., Gauci, J.: A hypercube-based encoding for evolving large-scale neural networks. Artif. Life 15(2), 185–212 (2009). https://doi.org/10.1162/artl.2009.15.2.15202
Stephens, D.W.: Change, regularity, and value in the evolution of animal learning. Behav. Ecol. 2(1), 77–89 (1991). https://doi.org/10.1093/beheco/2.1.77
Stork: Is backpropagation biologically plausible? In: International 1989 Joint Conference on Neural Networks, pp. 241–246. IEEE (1989)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT press, Cambridge (2018)
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 12 (1999)
Tierney, A.: Evolutionary implications of neural circuit structure and function. Behav. Proc. 35(1–3), 173–182 (1995)
Towers, M., et al.: Gymnasium (2023). https://doi.org/10.5281/zenodo.8127026. https://zenodo.org/record/8127025
Trouvain, N., Pedrelli, L., Dinh, T.T., Hinaut, X.: ReservoirPy: an efficient and user-friendly library to design echo state networks. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12397, pp. 494–505. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61616-8_40
Watson, R.A., Szathmáry, E.: How can evolution learn? Trends Ecol. Evol. 31(2), 147–157 (2016)
Wyffels, F., Schrauwen, B.: Design of a central pattern generator using reservoir computing for learning human motion. In: 2009 Advanced Technologies for Enhanced Quality of Life, pp. 118–122. IEEE (2009)
Yu, Y., Si, X., Hu, C., Zhang, J.: A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 31(7), 1235–1270 (2019)
Zador, A.M.: A critique of pure learning and what artificial neural networks can learn from animal brains. Nat. Commun. 10(1), 3770 (2019)
Acknowledgments
Financial support was received from: the University of Bordeaux’s France 2030 program/RRI PHDS framework, French National Research Agency (ANR) grants: ECOCURL ANR-20-CE23-0006 and DEEPPOOL ANR-21-CE23-0009-01. We benefited HPC resources: IDRIS under the allocation A0091011996 made by GENCI, using the Jean Zay supercomputer, and Curta from the University of Bordeaux.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
6 Appendix
In this appendix, we provide comprehensive insights and clarifications on the methodologies employed in our study. Specifically, we elaborate on aspects such as the parameters governing our experiments, including the RL (PPO), the RC and the evolutionary (CMA-ES) algorithms we used. Furthermore, we furnish a detailed exposition of the environments utilized in our research. Lastly, we conduct supplementary analyses aimed at enhancing our understanding of some observed phenomena in the obtained results. In addition, we present results from experiments that were not featured in the main text to offer a more comprehensive view of our findings.
1.1 6.1 Methods
Proximal Policy Optimization (PPO). PPO, categorized as a policy gradient technique [45], undertakes exploration of diverse policies through stochastic gradient ascent. This process involves assigning elevated probabilities to actions correlated with high rewards, subsequently adjusting the policy to aim for higher expected returns. The adoption of PPO stems from its well-established reputation as a highly efficient and stable algorithm in the scientific literature, although its use does not have major theoretical implications for this particular project.
Reservoir Hyperparameters. In Reservoir Computing, the spectral radius controls the trade-off between stability and chaoticity of reservoir dynamics: in general “edge of chaos” dynamics are often desired [5]. Input scaling determines the strength of input signals, and the leak rate governs the memory capacity of reservoir neurons over time. These HPs specify the generation of the reservoir weights. Once the reservoir is generated, its weights are kept fixed and only a readout layer, mapping the states of the reservoir neurons to the desired output of the network are learned. Other HPs exist to initialize a reservoir, but they have not been studied in the experiments of the paper (as it has been tested that they have much less influence on the results).
1.2 6.2 Experiment Parameters
General Parameters. In our experiments, we adapted the number of timesteps during the training phase of our ER-MRL agent in the inner loop, based on whether we were evolving the reservoir HPs or testing the best HPs set discovered during the CMA-ES evolution. For the evolution phase, which was computationally intensive, we utilized 300,000 timesteps per training. Conversely, when evaluating our agents against standard RL agents, we employed 1,000,000 timesteps. Notably, in the case of the LunarLander environment, we extended the testing to 3,000,000 timesteps, as the learning curve had not yet converged at 1,000,000 timesteps.
PPO Hyperparameters. Regarding the parameters of our RL algorithm, PPO, we used the default settings provided by the Stable Baselines3 library [35]. For tasks involving partial observability, we made a slight adjustment by setting the learning rate to 0.0001, as opposed to the standard 0.0003. This modification notably enhanced performance, potentially indicating that reservoirs contained a degree of noise, warranting a lower learning rate to stabilize RL training.
CMA-ES Hyperparameters. For the parameters of our evolutionary algorithm, CMA-ES, we adopted the default settings of the CMA-ES sampler from the Optuna library [1].
Reservoirs Hyperparameters. For the reservoirs, we only modified the parameters mentioned in Sect. 6.1 and the number of neurons. We consistently used 100 neurons per reservoirs during all experiments. All the other HPs were kept the same and are the default reservoir parameters used in ReservoirPy [48]. We conducted additional analyses and observed that they exerted a relatively modest influence on tasks of this nature. However, a more refined analysis of the importance of these HPs could be interesting in future works.
1.3 6.3 Partially Observable Environments
In the following section, we present the different Reinforcement Learning environments from the Gymnasium library [47], used during our experiments on partial observability.
Results Analysis. To better understand the reservoir’s capabilities on these tasks, we conducted several tests on supervised learning problems where a sequence of actions, rewards, and observations (without velocity) was provided to a reservoir with a linear readout. In one case, the model had to reconstruct full observation information (position, angle, velocity, angular velocity), and in the other, it had to reconstruct positions and angles over several time steps (doing this only for the last 2 time steps allows a PPO to achieve maximum reward later on). In both cases, this model successfully solved the tasks with very high performance. Moreover, it was also capable of predicting future observations, which can be extremely valuable to find an optimal action policy.
Benchmark Comparison. Regarding benchmarks, our approach compares favorably with the results reported in the blog post from Raffin [34] where he used another model combining a RNN (LSTM [51]) with a RL algorithm (PPO, the one we also used) on the same partially observable tasks. The performance on each environment are pretty similar, but it is the training timesteps needed to reach the maximum performance that varies the most between the methods. Indeed for the LunarLander environment, our method is able to learn in less timesteps after evolving reservoirs, but it is the contrary with CartPole and Pendulum tasks.
It is worth noting that even if both approaches have similarities, ER-MRL consists in optimizing the HPs of reservoirs at an evolutionary scale, whereas the method presented in the blog post trains a recurrent architecture from scratch. This divergence complicates direct comparisons between both methods. Indeed, our results are derived after an extensive phase of computation in a Meta-RL outer loop, but the subsequent evaluation with the final reservoir configuration is comparatively swift. as only the RL policy (linear readout) requires training. In contrast, the LSTM-PPO method does not incorporate a computationally intensive meta-learning phase, but their training process takes more time per timestep update. Indeed, each training step of the this demands more computation, due to having to train the LSTM from scratch in addition to PPO, compared to our method where only the linear readout is trained at the developmental scale.
However to ensure a fair and comprehensive comparison with other baselines, especially in tracking the time required to achieve presented results, more experiments are necessary.
1.4 6.4 MuJoCo Forward Locomotion Environments
Results Analysis. In this section, we present how reservoirs could act as Central Pattern Generators within agents learning a locomotion task in these 3D environments.
It can be observed that the separation between the two models seems to occur starting from 100,000 timesteps at the top-right of Fig. 5. Therefore, we recorded videos of the RL and ER-MRL agents to better understand the performance difference between the two models. Furthermore, we conducted a study at the level of the input vector in the agent’s policy (\(o_t\) for RL agent, and \(c_t\) for ER-MRL agent). As seen in Fig. 11, it is noticeable that very early in the learning process, the reservoir exhibits much more rhythmic dynamics than the sole observation provided by the environment. This could be due to the link between the reservoir and CPGs, potentially facilitating the acquisition and learning of motor control in these tasks.
Expanding on this, it’s notable that CPGs, shared across various species, have evolved to embody common structures. Drawing parallels from nature, our investigation delves into whether generalization (results in Sect. 4.3) across a spectrum of motor tasks may mirror the principles found in biological systems.
However, further experiments, accompanied by robust quantitative analysis, are necessary to gain valuable insights into whether reservoirs can function as CPG-like structures.
1.5 6.5 MuJoCo Humanoid Environments
Interesting Reservoir Results. As seen in Sect. 2.3, one of the basic principles of RC is to project input data into a higher-dimensional space. In the case of the Humanoid tasks, where our results are displayed in Fig. 5 and Fig. 7, the initial observation and action space is larger (400 dimensions) compared to the context dimension for one or two reservoirs of 100 neurons (the dimension is equal to the number of neurons). This means that even by reducing the input dimension in the RL policy network, the reservoir improves the quality of the data. For other morphologies, the dimension of input data is inferior to the dimension our reservoir context.
1.6 6.6 Normalized Scores for Generalization
To prevent any particular task from disproportionately influencing the fitness score due to variations in reward scales, we use a fitness function for CMA-ES that aggregates the normalized score, denoted as nScore, across both environments. The normalization process is defined as :
Where randomScore and baselineScore represent the performances of a random and of a standard PPO agent, respectively.
1.7 6.7 Reservoir Hyperparameters Analysis
In preceding sections, we observed how HPs play a pivotal role in enabling ER-MRL agents to generalize across tasks. Now, we delve deeper into understanding why some reservoirs aid in generalization for specific tasks while others do not. To gain this insight, we constructed a hyperparameter map to visualize the regions of HPs associated with the best fitness in each environment. We selected the best 30 sets of HPs, comprising the spectral radius and leak rate values of the reservoirs, out of a pool of 900 for all MuJoCo locomotion tasks (refer to Fig. 10) and plotted them on a 2D plane.
In Fig. 13, we observe that the HPs for most environments are clustered closely together. Conversely, those for the Ant environment form a distinct cluster, characterized by notably lower leak rates. The leak rate reflects how much information a neuron retains in the reservoir, influencing its responsiveness to input data and connections with other neurons. A lower leak rate implies a more extended memory, possibly instrumental in capturing long-term dynamics. This observation aligns with the stable morphology of the Ant, potentially allowing the agent to prioritize long-term dynamics for efficient locomotion. This would partially explain why generalization wasn’t successful on this environment in Sect. 4.3, when reservoirs were evolved on other types of morphologies.
7 Additional experiments
We also led other experiments that we didn’t mention in the main text. As mentioned above in Sect. 6.1, we consistently employed reservoirs with a size of 100 neurons to ensure a standardized basis for result comparison. This configuration equates one reservoir to 100 neurons, two reservoirs to 200 neurons, and so forth. We conducted additional experiments to investigate the impact of varying the number of reservoirs and neurons within them. We observed that altering the number of neurons within a reservoir had a limited effect. For example, reducing the number of neurons to as low as 25 did not significantly affect performance on the partially observable environments. Increasing the size of the reservoirs didn’t seem to improve the performance a lot either, except for the Humanoid environments (with a large observation space) where reservoirs equipped with a lot of neurons (1000) performed slightly better than others. While we opted for 100 neurons per reservoir in our experiments, there is surely potential for further optimization.
Furthermore, we explored experiments involving partially observable reservoirs, in which only a subset of the observation was provided to the policy. The results demonstrated that it is not always necessary to fully observe the contextual information within the reservoir to successfully accomplish tasks. On the CartPole environment, we tested 3 type of models with a reservoir of 100 fully observable neurons (the policy has access to 100 out of the 100 neurons), a reservoir of 1000 fully observable neurons, and another reservoir with only 100 partially observable neurons out of 1000. We observed that the model with 1000 fully observable neurons performed worse than the two other, who had similar results.
Regarding generalization experiments, we investigated the impact of varying the number of reservoirs. Although experiments with three reservoirs yielded intriguing insights, such as distinct memory types characterized by leak rate in the different reservoirs, the overall performance was notably lower compared to configurations with two reservoirs. This observation can likely be attributed to the increased complexity of learning due to the larger observation space, despite the potential for richer dynamics. We also noted instances where several reservoirs maintained very similar hyperparameters for specific tasks, potentially indicating the importance of capturing particular dynamics.
Additionally, we considered the possibility of employing smaller reservoirs in greater numbers. This approach could capture a diverse range of interesting features, such as different dynamics, while keeping the total number of neurons low. This strategy would be particularly advantageous for tasks characterized by small observation and action spaces, but would also imply a wider space of reservoirs HPs search in return.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Léger, C., Hamon, G., Nisioti, E., Hinaut, X., Moulin-Frier, C. (2024). Evolving Reservoirs for Meta Reinforcement Learning. In: Smith, S., Correia, J., Cintrano, C. (eds) Applications of Evolutionary Computation. EvoApplications 2024. Lecture Notes in Computer Science, vol 14635. Springer, Cham. https://doi.org/10.1007/978-3-031-56855-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-56855-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56854-1
Online ISBN: 978-3-031-56855-8
eBook Packages: Computer ScienceComputer Science (R0)