Skip to main content
Log in

Policy-based optimization: single-step policy gradient method seen as an evolution strategy

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

This research reports on the recent development of black-box optimization methods based on single-step deep reinforcement learning and their conceptual similarity to evolution strategy (ES) techniques. It formally introduces policy-based optimization (PBO), a policy-gradient-based optimization algorithm that relies on a policy network to describe the density function of its forthcoming evaluations, and uses covariance estimation to steer the policy improvement process in the right direction. The specifics of the PBO algorithm are detailed, and the connections to evolutionary strategies are discussed. Relevance is assessed by benchmarking PBO against classical ES techniques on analytic functions minimization problems, and by optimizing various parametric control laws intended for the Lorenz attractor and the classical cartpole problem. Given the scarce existing literature on the topic, this contribution definitely establishes PBO as a valid, versatile black-box optimization technique, and opens the way to multiple future improvements building on the inherent flexibility of the neural networks approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. The base code used to produce all results documented in this paper is available via a dedicated github repository [22].

  2. The \(\rho \) and \(\sigma \) used here are therefore the canonical notations of the Lorenz attractor parameters, and have no link with the standard deviations and correlation parameters used previously in the paper.

References

  1. Rawat W, Wang Z (2017) Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput 29:2352–2449

    Article  MathSciNet  MATH  Google Scholar 

  2. Khan A, Sohail A, Zahoora U, Qureshi AS (2020) A survey of the recent architectures of deep convolutional neural networks. Artif Intell Revi , pp 2352–2449

  3. Nassif AB, Shahin I, Attili I, Azzeh M, Shaalan K (2019) Speech recognition using deep neural networks: a systematic review. IEEE Access 7:19143–19165

    Article  Google Scholar 

  4. Gui J, Sun Z, Wen Y, Tao D, Ye J (2020) A review on generative adversarial networks: algorithms, theory, and applications. http://arxiv.org/abs/2001.06937,

  5. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning. http://arxiv.org/abs/1312.5602,

  6. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap T, Hui F, Sifre L, van den Driessche G, Graepel T, Hassabis D (2017) Mastering the game of Go without human knowledge. Nature, 550

  7. OpenAI. OpenAI Five. https://blog.openai.com/openai-five/, (2018)

  8. Pinto L, Andrychowicz M, Welinder P, Zaremba W, Abbeel P (2017) Asymmetric actor critic for image-based robot learning. http://arxiv.org/abs/1710.06542,

  9. Bahdanau D, Brakel P, Xu K, Goyal A, Lowe R, Pineau J, Courville A, Bengio Y (2016) An actor-critic algorithm for sequence prediction. http://arxiv.org/abs/1607.07086,

  10. Kendall A, Hawke J, Janz D, Mazur P, Reda D, Allen J.-M, Lam V.-D, Bewley A, Shah A (2018) Learning to drive in a day. http://arxiv.org/abs/1807.00412,

  11. Bewley A, Rigley J, Liu Y, Hawke J, Shen R, Lam V.-D, Kendall A (2018) Learning to drive from simulation without real world labels. http://arxiv.org/abs/1812.03823,

  12. Knight W (2018) Google just gave control over data center cooling to an AI. http://www.technologyreview.com/s/611902/google-just-gave-control-over-data-center-cooling-to-an-ai/,

  13. Villarrubia G, De Paz JF, Chamoso P, De la Prieta F (2018) Artificial neural networks used in optimization problems. Neurocomputing 272:10–16

    Article  Google Scholar 

  14. Schweidtmann AM, Mitsos A (2019) Deterministic global optimization with artificial neural networks embedded. J Opt Theory Appl 180:925–948

    Article  MathSciNet  MATH  Google Scholar 

  15. Andrychowicz M, Denil M, Gomez S, Hoffman M. W, Pfau D, Schaul T, Shillingford B, de Freitas N (2016) Learning to learn by gradient descent by gradient descent. http://arxiv.org/abs/1606.04474,

  16. Yan X, Zhu J, Kuang M, Wang X (2019) Aerodynamic shape optimization using a novel optimizer based on machine learning techniques. Aerospace Sci Technol 86:826–835

    Article  Google Scholar 

  17. Li R, Zhang Y, Chen H (2020) Learning the aerodynamic design of supercritical airfoils through deep reinforcement learning. https://arxiv.org/abs/2010.03651,

  18. Viquerat J, Rabault J, Kuhnle A, Ghraieb H, Larcher A, Hachem E (2021) Direct shape optimization through deep reinforcement learning. J Comput Phys 428:110080

    Article  MathSciNet  MATH  Google Scholar 

  19. Ghraieb H, Viquerat J, Larcher A, Meliga P, Hachem E (2020) Optimization and passive flow control using single-step deep reinforcement learning. http://arxiv.org/abs/2006.02979,

  20. Hachem E, Ghraieb H, Viquerat J, Larcher A, Meliga P (2020) Deep reinforcement learning for the control of conjugate heat transfer with application to workpiece cooling. https://arxiv.org/abs/2011.15035,

  21. Hämäläinen P, Babadi A, Ma X, Lehtinen J (2018) Ppo-cma: Proximal policy optimization with covariance matrix adaptation. http://arxiv.org/abs/1810.02541,

  22. Viquerat J (2021) PBO git repository. https://github.com/jviquerat/pbo,

  23. Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366

    Article  MATH  Google Scholar 

  24. Goodfellow I, Bengio Y, Courville A (2017) The deep learning book. MIT Press, London

  25. Sutton R, Mcallester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst, 12

  26. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536

    Article  MATH  Google Scholar 

  27. Konda, V. R, Tsitsiklis J. N (2000) Actor-critic algorithms. In: Adv Neural Inf Process Syst, pp 1008–1014

  28. Schulman J, Moritz P, Levine S, Jordan M, Abbeel P (2015) High-dimensional continuous control using generalized advantage estimation. https://arxiv.org/abs/1506.02438,

  29. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. http://arxiv.org/abs/1707.06347,

  30. Sutton R, Barto, A. G (2018) Reinforcement learning: an introduction. MIT press, Cambridge

  31. Beyer H-G, Schwefel H-P (2002) Evolution strategies—a comprehensive introduction. Natural Comput 1(1):3–52

    Article  MathSciNet  MATH  Google Scholar 

  32. Eiben A. E, Smith J. E (2015) Introduction to Evolutionary Computing, 2nd edn. Springer, Berlin

  33. Hansen N (2016) The cma evolution strategy: a tutorial. http://arxiv.org/abs/1604.00772,

  34. Kingma D. P, Ba J (2014) Adam: a method for stochastic optimization. http://arxiv.org/abs/1412.6980,

  35. Degris T, White M, Sutton R. S (2013) Off-policy actor-critic. https://arxiv.org/abs/1205.4839,

  36. Rebonato R, Jäckel, P (2011) The most general methodology to create a valid correlation matrix for risk management and option pricing purposes. Available at SSRN 1969689,

  37. Rapisarda F, Brigo D, Mercurio F (2007) Parameterizing correlations: a geometric interpretation. IMA J Manage Math 18(1):55–73

    Article  MathSciNet  MATH  Google Scholar 

  38. Numpacharoen K, Atsawarungruangkit A (2012) Generating correlation matrices based on the boundaries of their coefficients. PLOS One, 7(11)

  39. Maree S (2012) Correcting non positive definite correlation matrices. BSc Thesis Applied Mathematics, TU Delft

  40. Saltzman B (1962) Finite amplitude free convection as an initial value problem. J Atmos Sci 19(4):329–341

    Article  Google Scholar 

  41. Lorenz EN (1963) Deterministic nonperiodic flow. J Atmos Sci 20(2):130–141

    Article  MathSciNet  MATH  Google Scholar 

  42. Beintema G, Corbetta A, Biferale L, Toschi F (2020) Controlling rayleigh-bénard convection via reinforcement learning. J Turbulen 21(9–10):585–605

    Article  Google Scholar 

  43. Virtanen P, GommersR, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat I, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P (2020) SciPy 1.0: Fundamental algorithms for scientific computing in python. Nature Methods 17:261–272

  44. Barto AG, Sutton RS, Anderson CW (1983) Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern, SMC-13(5):834–846

  45. Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym. https://arxiv.org/abs/1606.01540,

  46. Lillicrap T. P, Hunt J. J, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2019) Continuous control with deep reinforcement learning. https://arxiv.org/abs/1509.02971v6,

  47. Wang Z, Bapst V, Heess N, Mnih V, Munos R, Kavukcuoglu K, de Freitas N (2017) Sample efficient actor-critic with experience replay. https://arxiv.org/abs/1611.01224,

Download references

Acknowledgements

This work is supported by the Carnot M.I.N.E.S. Institute through the M.I.N.D.S. project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. Viquerat.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Viquerat, J., Duvigneau, R., Meliga, P. et al. Policy-based optimization: single-step policy gradient method seen as an evolution strategy. Neural Comput & Applic 35, 449–467 (2023). https://doi.org/10.1007/s00521-022-07779-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07779-0

Keywords

Navigation