Skip to main content

Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning

Abstract

Multi-objective reinforcement learning (MORL) algorithms aim to approximate the Pareto frontier uniformly in multi-objective decision making problems. In the scenario of deep reinforcement learning (RL), gradient-based methods are often adopted to learn deep policies/value functions due to the fast convergence speed, while pure gradient-based methods can not guarantee a uniformly approximated Pareto frontier. On the other side, evolution strategies straightly manipulate in the solution space to achieve a well-distributed Pareto frontier, but applying evolution strategies to optimize deep networks is still a challenging topic. To leverage the advantages of both kinds of methods, we propose a two-stage MORL framework combining a gradient-based method and an evolution strategy. First, an efficient multi-policy soft actor-critic algorithm is proposed to learn multiple policies collaboratively. The lower layers of all policy networks are shared. The first-stage learning can be regarded as representation learning. Secondly, the multi-objective covariance matrix adaptation evolution strategy (MO-CMA-ES) is applied to fine-tune policy-independent parameters to approach a dense and uniform estimation of the Pareto frontier. Experimental results on three benchmarks (Deep Sea Treasure, Adaptive Streaming, and Super Mario Bros) show the superiority of the proposed method.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

Notes

  1. Anonymous et al., Adaptive Streaming: From Bitrate Maximization to Rate-Distortion Optimization

References

  1. Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795

    Article  Google Scholar 

  2. Abualigah LM, Khader AT, Hanandeh ES (2018) A combination of objective functions and hybrid krill herd algorithm for text document clustering analysis. Eng Appl Artif Intell 73:111–125

    Article  Google Scholar 

  3. Abualigah LM, Khader AT, Hanandeh ES (2018) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48(11):4047–4071

    Article  Google Scholar 

  4. Abualigah LM, Khader AT, Hanandeh ES (2018) A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J Comput Sci 25:456– 466

    Article  Google Scholar 

  5. Abualigah LM, Khader AT, Hanandeh ES, Gandomi AH (2017) A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl Soft Comput 60:423–435

    Article  Google Scholar 

  6. Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin

    Book  Google Scholar 

  7. Abualigah LMQ, Hanandeh ES (2015) Applying genetic algorithms to information retrieval using vector space model. International Journal of Computer Science, Engineering and Applications 5(1):19

    Article  Google Scholar 

  8. Barrett L, Narayanan S (2008) Learning all optimal policies with multiple criteria. In: International conference on machine learning. ACM, pp 41–47

  9. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Article  Google Scholar 

  10. de Bruin T, Kober J, Tuyls K, Babuška R (2018) Integrating state representation learning into deep reinforcement learning. IEEE Robotics and Automation Letters 3(3):1394–1401

    Article  Google Scholar 

  11. Brys T, Harutyunyan A, Vrancx P, Taylor ME, Kudenko D, Nowé A (2014) Multi-objectivization of reinforcement learning problems by reward shaping. In: International joint conference on neural networks. IEEE, pp 2315–2322

  12. Castelletti A, Pianosi F, Restelli M (2012) Tree-based fitted Q-iteration for multi-objective Markov decision problems. In: International joint conference on neural networks. IEEE, pp 1–8

  13. Chen D, Wang Y, Gao W (2020) A two-stage multi-objective deep reinforcement learning framework. In: European conference on artificial intelligence (ECAI)

  14. Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197

    Article  Google Scholar 

  15. Dhariwal P, Hesse C, Klimov O, Nichol A, Plappert M, Radford A, Schulman J, Sidor S, Wu Y, Zhokhov P (2017) OpenAI baselines. https://github.com/openai/baselines

  16. Fernando C, Banarse D, Blundell C, Zwols Y, Ha D, Rusu AA, Pritzel A, Wierstra D (2017) Pathnet: evolution channels gradient descent in super neural networks. arXiv:1701.08734

  17. Gao P, Zhang Q, Wang F, Xiao L, Fujita H, Zhang Y (2020) Learning reinforced attentional representation for end-to-end visual tracking. Inf Sci 517:52–67

    Article  Google Scholar 

  18. Ha D, Schmidhuber J (2018) Recurrent world models facilitate policy evolution. In: Advances in neural information processing systems, pp 2450–2462

  19. Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv:1801.01290

  20. Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P et al (2018) Soft actor-critic algorithms and applications. arXiv:1812.05905

  21. Hansen N (2016) The CMA evolution strategy: a tutorial. arXiv:1604.00772

  22. Igel C, Hansen N, Roth S (2007) Covariance matrix adaptation for multi-objective optimization. Evol Comput 15(1):1–28

    Article  Google Scholar 

  23. Igel C, Heidrich-Meisner V, Glasmachers T (2008) Shark. J Mach Learn Res 9(Jun):993–996

    Google Scholar 

  24. Igel C, Suttorp T, Hansen N (2007) Steady-state selection and efficient covariance matrix update in the multi-objective CMA-ES. In: International conference on evolutionary multi-criterion optimization. Springer, Berlin, pp 171–185

  25. Kauten C (2018) Super mario bros for OpenAI Gym. GitHub. https://github.com/Kautenja/gym-super-mario-bros

  26. Lehman J, Chen J, Clune J, Stanley KO (2018) Safe mutations for deep and recurrent neural networks through output gradients. In: Proceedings of the genetic and evolutionary computation conference. ACM, pp 117–124

  27. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv:1509.02971

  28. Lizotte DJ, Bowling M, Murphy SA (2012) Linear fitted-Q iteration with multiple reward functions. J Mach Learn Res 13(Nov):3253–3295

    MathSciNet  MATH  Google Scholar 

  29. Lizotte DJ, Bowling MH, Murphy SA (2010) Efficient reinforcement learning with multiple reward functions for randomized controlled trial analysis. In: International conference on machine learning. Citeseer, pp 695–702

  30. Mannor S, Shimkin N (2004) A geometric approach to multi-criterion reinforcement learning. J Mach Learn Res 5(Apr):325–360

    MathSciNet  MATH  Google Scholar 

  31. Mao H, Netravali R, Alizadeh M (2017) Neural adaptive video streaming with pensieve. In: Proceedings of the conference of the ACM special interest group on data communication. ACM, pp 197–210

  32. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp 1928–1937

  33. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv:1312.5602

  34. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529

    Article  Google Scholar 

  35. Natarajan S, Tadepalli P (2005) Dynamic preferences in multi-criteria reinforcement learning. In: International conference on machine learning. ACM, pp 601–608

  36. Ng AY, Harada D, Russell S (1999) Policy invariance under reward transformations: theory and application to reward shaping. In: International conference on machine learning, vol 99, pp 278–287

  37. Nguyen TT (2018) A multi-objective deep reinforcement learning framework. arXiv:1803.02965

  38. Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped DQN. In: Advances in neural information processing systems, pp 4026–4034

  39. Osband I, Russo D, Van Roy B (2013) (More) efficient reinforcement learning via posterior sampling. In: Advances in neural information processing systems, pp 3003–3011

  40. Osband I, Van Roy B (2015) Bootstrapped thompson sampling and deep exploration. arXiv:1507.00300

  41. Osband I, Van Roy B, Wen Z (2014) Generalization and exploration via randomized value functions. arXiv:1402.0635

  42. Parisi S, Pirotta M, Restelli M (2016) Multi-objective reinforcement learning through continuous pareto manifold approximation. J Artif Intell Res 57:187–227

    MathSciNet  Article  Google Scholar 

  43. Parisi S, Pirotta M, Smacchia N, Bascetta L, Restelli M (2014) Policy gradient approaches for multi-objective sequential decision making. In: International joint conference on neural networks. IEEE, pp 2323–2330

  44. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, pp 8024–8035

  45. Risi S, Togelius J (2015) Neuroevolution in games: state of the art and open challenges. IEEE Transactions on Computational Intelligence and AI in Games 9(1):25–41

    Article  Google Scholar 

  46. Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z, et al. (2018) A tutorial on thompson sampling. Foundations and Trends®;, in Machine Learning 11(1):1–96

    Article  Google Scholar 

  47. Salimans T, Ho J, Chen X, Sidor S, Sutskever I (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864

  48. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484

    Article  Google Scholar 

  49. Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: International conference on machine learning

  50. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017) Mastering the game of Go without human knowledge. Nature 550(7676):354

    Article  Google Scholar 

  51. Spiteri K, Urgaonkar R, Sitaraman RK (2016) BOLA: near-optimal bitrate adaptation for online videos. In: IEEE international conference on computer communications. IEEE, pp 1–9

  52. Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J (2017) Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv:1712.06567

  53. Sullivan GJ, Wiegand T, et al. (1998) Rate-distortion optimization for video compression. IEEE Signal Proc Mag 15(6):74–90

    Article  Google Scholar 

  54. Sutton RS, Barto AG, et al. (1998) Introduction to reinforcement learning, vol 135. MIT Press , Cambridge

    Google Scholar 

  55. Suttorp T, Hansen N, Igel C (2009) Efficient covariance matrix update for variable metric evolution strategies. Mach Learn 75(2):167–197

    Article  Google Scholar 

  56. Tajmajer T (2017) Multi-objective deep Q-learning with subsumption architecture. arXiv:1704.06676

  57. Tesauro G, Das R, Chan H, Kephart J, Levine D, Rawson F, Lefurgy C (2008) Managing power consumption and performance of computing systems using reinforcement learning. In: Advances in neural information processing systems, pp 1497– 1504

  58. Vamplew P, Dazeley R, Berry A, Issabekov R, Dekker E (2011) Empirical evaluation methods for multiobjective reinforcement learning algorithms. Mach Learn 84(1-2):51–80

    MathSciNet  Article  Google Scholar 

  59. Vamplew P, Yearwood J, Dazeley R, Berry A (2008) On the limitations of scalarisation for multi-objective reinforcement learning of Pareto fronts. In: Australasian joint conference on artificial intelligence. Springer, pp 372–378

  60. Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double Q-learning. In: AAAI conference on artificial intelligence

  61. Van Moffaert K, Drugan MM, Nowé A (2013) Hypervolume-based multi-objective reinforcement learning. In: International conference on evolutionary multi-criterion optimization. Springer, pp 352–366

  62. Van Moffaert K, Drugan MM, Nowé A (2013) Scalarized multi-objective reinforcement learning: novel design techniques. In: 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL). IEEE, pp 191–199

  63. Voß T, Hansen N, Igel C (2010) Improved step size adaptation for the MO-CMA-ES. In: Annual conference on genetic and evolutionary computation. ACM, pp 487–494

  64. Wiering MA, Withagen M, Drugan MM (2014) Model-based multi-objective reinforcement learning. In: 2014 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL). IEEE, pp 1–6

  65. Yin X, Jindal A, Sekar V, Sinopoli B (2015) A control-theoretic approach for dynamic adaptive video streaming over HTTP. In: ACM SIGCOMM computer communication review, vol 45. ACM, pp 325–338

  66. Zitzler E, Thiele L (1999) Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach. IEEE Trans Evol Comput 3(4):257–271

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to express our thanks for the support from the following research grants: 2018AAA0102004, NSFC-61625201, NSFC-61527804.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diqi Chen.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, D., Wang, Y. & Gao, W. Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning. Appl Intell 50, 3301–3317 (2020). https://doi.org/10.1007/s10489-020-01702-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01702-7

Keywords

  • Multi-objective reinforcement learning
  • Multi-policy reinforcement learning
  • Pareto frontier
  • Sampling efficiency