Skip to main content
Log in

Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Multi-agent cooperation and coordination are often essential for task fulfillment. Multi-agent deep reinforcement learning (MADRL) can effectively learn the solutions to problems, but its application is still primarily restricted by the exploration–exploitation trade-off. Therefore, the focus of MADRL research is placed on how to effectively explore the environment and collect good experience with rich information to strengthen cooperative behaviors and optimize policy learning. To address this problem, we propose a novel multi-agent cooperation policy gradient method called multi-agent proximal policy optimization based on self-imitation learning and random network distillation (MAPPOSR). MAPPOSR consists of two policy gradient-based additional components, namely (1) random network distillation (RND) exploration bonus component that produces intrinsic rewards and encourages agents to access new states and actions, thereby helping them explore better trajectories and avoiding the algorithm prematurely converging or getting stuck in local optima; and (2) self-imitation learning (SIL) policy update component that stores and reuses high-return trajectory samples generated by agents themselves, thereby strengthening their cooperation and boosting learning efficiency. The experimental results show that in addition to effectively solving the hard-exploration problem, the proposed method significantly outperforms other SOTA MADRL algorithms in learning efficiency as well as in escaping local optima. Moreover, the effect of different function inputs on algorithm performance is investigated in the centralized training and decentralized execution (CTDE) framework, based on which a joint-observation coding method based on individual is developed. By encouraging the agent to focus more on the local observation information of other agents related to it and abandon global state information provided by the environment, the developed coding method can remove the effects of excessive value function input dimensions and redundant feature information on algorithm performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Code or data availability

Data sets generated during the current study are available from the corresponding author on reasonable request.

References

  1. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge

    Google Scholar 

  2. Li Q, Peng H, Li J, Wu J, Ning Y, Wang L, Yu PS, Wang Z (2021) Reinforcement learning-based dialogue guided event extraction to exploit argument relations. IEEE/ACM Trans Audio Speech Lang Process 30:520–533

    Article  ADS  Google Scholar 

  3. Peng B, Rashid T, Schroeder de Witt C, Kamienny P-A, Torr P, Böhmer W, Whiteson S (2021) Facmac: factored multi-agent centralised policy gradients. Adv Neural Inf Process Syst 34:12208–12221

    Google Scholar 

  4. Gupta JK, Egorov M, Kochenderfer MJ (2017) Cooperative multi-agent control using deep reinforcement learning. AAMAS Workshops 30:66–83

    Google Scholar 

  5. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  ADS  CAS  PubMed  Google Scholar 

  6. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489

    Article  ADS  CAS  PubMed  Google Scholar 

  7. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359

    Article  ADS  CAS  PubMed  Google Scholar 

  8. Bolander T, Andersen MB (2011) Epistemic planning for single-and multi-agent systems. J Appl Non Class Logics 21(1):9–34

    Article  MathSciNet  Google Scholar 

  9. Du W, Ding S (2021) A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications. Artif Intell Rev 54(5):3215–3238

    Article  Google Scholar 

  10. Hernandez-Leal P, Kartal B, Taylor ME (2019) A survey and critique of multiagent deep reinforcement learning. Auton Agent Multi Agent Syst 33(6):750–797

    Article  Google Scholar 

  11. Wang J, Hong Y, Wang J, Xu J, Tang Y, Han Q-L, Kurths J (2022) Cooperative and competitive multi-agent systems: from optimization to games. IEEE/CAA J Autom Sin 9(5):763–783

    Article  Google Scholar 

  12. Du Y, Han L, Fang M, Liu J, Dai T, Tao D (2019) Liir: learning individual intrinsic reward in multi-agent reinforcement learning. Adv Neural Inf Process Syst 32

  13. Wang T, Wang J, Wu Y, Zhang C (2020) Influence-based multi-agent exploration. International conference on learning representations

  14. Mahajan A, Rashid T, Samvelyan M, Whiteson S (2019) MAVEN: multi-agent variational exploration. Neural Inf Process Syst 32:7611–7622

    Google Scholar 

  15. Yang T, Tang H, Bai C, Liu J, Hao J, Meng Z et al (2023) Exploration in deep reinforcement learning: from single-agent to multiagent domain. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2023.3236361

  16. Liu I-J, Jain U, Yeh RA, Schwing A (2021) Cooperative exploration for multi-agent deep reinforcement learning. In: International conference on machine learning. PMLR, pp 6826–6836

  17. Hussein A, Gaber MM, Elyan E, Jayne C (2017) Imitation learning: a survey of learning methods. ACM Comput Surv (CSUR) 50(2):1–35

    Article  Google Scholar 

  18. Ambhore S (2020) A comprehensive study on robot learning from demonstration. In: 2020 2nd international conference on innovative mechanisms for industry applications (ICIMIA). IEEE, pp 291–299

  19. Ravichandar H, Polydoros AS, Chernova S, Billard A (2020) Recent advances in robot learning from demonstration. Ann Rev Control Robot Auton Syst 3:297–330

    Article  Google Scholar 

  20. Oh J, Guo Y, Singh S, Lee H (2018) Self-imitation learning. In: International conference on machine learning. PMLR, pp 3878–3887

  21. Guo Y, Oh J, Singh S, Lee H (2018) Generative adversarial self-imitation learning. arXiv preprint arXiv:1812.00950

  22. Osa T, Pajarinen J, Neumann G, Bagnell JA, Abbeel P, Peters J (2018) An algorithmic perspective on imitation learning. Found Trends Robot 7(1–2):1–179

    Google Scholar 

  23. Pomerleau DA (1991) Efficient training of artificial neural networks for autonomous navigation. Neural Comput 3(1):88–97

    Article  PubMed  Google Scholar 

  24. Bain M, Sammut C (1995) A framework for behavioural cloning. Mach Intell 15:103–129

    Google Scholar 

  25. Ross S, Gordon G, Bagnell D (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR, pp 627–635

  26. Sun W, Venkatraman A, Gordon GJ, Boots B, Bagnell JA (2017) Deeply aggrevated: differentiable imitation learning for sequential prediction. In: International conference on machine learning. PMLR, pp 3309–3318

  27. Russell S (1998) Learning agents for uncertain environments. In: Proceedings of the eleventh annual conference on computational learning theory. ACM, pp 101–103

  28. Abbeel P, Ng AY (2004) Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the twenty-first international conference on Machine learning. PMLR, pp 1–8

  29. Syed U, Schapire RE (2007) A game-theoretic approach to apprenticeship learning. Adv Neural Inf Process Syst 20. pp 1–8

  30. Ho J, Ermon S (2016) Generative adversarial imitation learning. Adv Neural Inf Process Syst 29. pp 1–9

  31. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144

    Article  MathSciNet  Google Scholar 

  32. Ziebart BD, Maas AL, Bagnell JA, Dey AK (2008) Maximum entropy inverse reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence. AAAI, pp 1433–1438

  33. Zhang Y, Cai Q, Yang Z, Wang Z (2020) Generative adversarial imitation learning with neural network parameterization: global optimality and convergence rate. In: International conference on machine learning. PMLR, pp 11044–11054

  34. Argall BD, Chernova S, Veloso M, Browning B (2009) A survey of robot learning from demonstration. Robot Auton Syst 57(5):469–483

    Article  Google Scholar 

  35. Gao Y, Xu H, Lin J, Yu F, Levine S, Darrell T (2018) Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313

  36. Jing M, Ma X, Huang W, Sun F, Yang C, Fang B et al (2020) Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI conference on artificial intelligence. AAAI, pp 5109–5116

  37. Kang B, Jie Z, Feng J (2018) Policy optimization with demonstrations. In: International conference on machine learning. PMLR, pp 2469–2478

  38. Vecerik M, Hester T, Scholz J, Wang F, Pietquin O, Piot B, Heess N, Rothörl T, Lampe T, Riedmiller M (2017) Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817

  39. Pshikhachev G, Ivanov D, Egorov V, Shpilman A (2022) Self-imitation learning from demonstrations. arXiv preprint arXiv:2203.10905

  40. Guo Y, Choi J, Moczulski M, Feng S, Bengio S, Norouzi M, Lee H (2020) Memory based trajectory-conditioned policies for learning from sparse rewards. Adv Neural Inf Process Syst 33:4333–4345

    Google Scholar 

  41. Gangwani T, Liu Q, Peng J (2018) Learning self-imitating diverse policies. arXiv preprint arXiv:1805.10309

  42. Tang Y (2020) Self-imitation learning via generalized lower bound q-learning. Adv Neural Inf Process Syst 33:13964–13975

    Google Scholar 

  43. Badia AP, Sprechmann P, Vitvitskyi A, Guo D, Piot B, Kapturowski S, Tieleman O, Arjovsky M, Pritzel A, Bolt A (2020) Never give up: learning directed exploration strategies. arXiv preprint arXiv:2002.06038

  44. Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2019) Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995

  45. Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2021) First return, then explore. Nature 590(7847):580–586

    Article  ADS  CAS  PubMed  Google Scholar 

  46. Guo ZD, Brunskill E (2019) Directed exploration for reinforcement learning. arXiv preprint arXiv:1906.07805

  47. Savinov N, Raichuk A, Marinier R, Vincent D, Pollefeys M, Lillicrap T, Gelly S (2018) Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274

  48. Oudeyer P-Y, Kaplan F (2008) How can we define intrinsic motivation? In: The 8th international conference on epigenetic robotics: modeling cognitive development in robotic systems. Lund:LUCS, pp 1–10

  49. Tang H, Houthooft R, Foote D, Stooke A, Xi Chen O, Duan Y, Schulman J, DeTurck F, Abbeel P (2017) # exploration: a study of count-based exploration for deep reinforcement learning. Adv Neural Inf Process Syst (30):1–10

  50. Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying count-based exploration and intrinsic motivation. Adv Neural Inf Process Syst (29):1–9

  51. Ostrovski G, Bellemare MG, Oord A, Munos R (2017) Count-based exploration with neural density models. In: International conference on machine learning. PMLR, pp 2721–2730

  52. Pathak D, Agrawal P, Efros AA, Darrell T (2017) Curiosity-driven exploration by self-supervised prediction. In: International conference on machine learning. PMLR, pp 2778–2787

  53. Oudeyer P-Y, Kaplan F, Hafner VV (2007) Intrinsic motivation systems for autonomous mental development. IEEE Trans Evol Comput 11(2):265–286

    Article  Google Scholar 

  54. Zhao R, Tresp V (2019) Curiosity-driven experience prioritization via density estimation. arXiv preprint arXiv:1902.08039

  55. Stadie BC, Levine S, Abbeel P (2015) Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814

  56. Burda Y, Edwards H, Pathak D, Storkey A, Darrell T, Efros AA (2018) Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355

  57. Choshen L, Fox L, Loewenstein Y (2018) Dora the explorer: directed outreaching reinforcement action-selection. arXiv preprint arXiv:1804.04012

  58. Pathak D, Gandhi D, Gupta A (2019) Self-supervised exploration via disagreement. In: International conference on machine learning. PMLR, pp 5062–5071

  59. Lee GT, Kim CO (2019) Amplifying the imitation effect for reinforcement learning of UCAV’s mission execution. arXiv preprint arXiv:1901.05856

  60. Burda Y, Edwards H, Storkey A, Klimov O (2018) Exploration by random network distillation. arXiv preprint arXiv:1810.12894

  61. Kang C-Y, Chen M-S (2020) Balancing exploration and exploitation in self-imitation learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 274–285

  62. Hao X, Wang W, Hao J, Yang Y (2019) Independent generative adversarial self-imitation learning in cooperative multiagent systems. arXiv preprint arXiv:1909.11468

  63. Jiang S, Amato C (2021) Multi-agent reinforcement learning with directed exploration and selective memory reuse. In: Proceedings of the 36th annual ACM symposium on applied computing. ACM, pp 777–784

  64. Oliehoek FA, Amato C (2016) A concise introduction to decentralized POMDPs. Springer, Berlin

    Book  Google Scholar 

  65. Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The complexity of decentralized control of Markov decision processes. Math Oper Res 27:819–840

    Article  MathSciNet  Google Scholar 

  66. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T et al. (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning. PMLR, pp 1928–1937

  67. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

  68. Yu C, Velu A, Vinitsky E, Wang Y, Bayen A, Wu Y (2021) The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955

  69. Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952

  70. Lowe R, Wu YI, Tamar A, Harb J, Pieter Abbeel O, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Adv Neural Inf Process Syst (30):1–12

  71. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst (30):1–11

  72. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

  73. Samvelyan M, Rashid T, De Witt CS, Farquhar G, Nardelli N, Rudner TG, Hung C-M, Torr PH, Foerster J, Whiteson S (2019) The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043

  74. Hu S, Hu J (2021) Noisy-MAPPO: noisy advantage values for cooperative multi-agent actor-critic methods. arXiv e-prints, arXiv:2106.14334

  75. de Witt CS, Gupta T, Makoviichuk D, Makoviychuk V, Torr PH, Sun M, Whiteson S (2020) Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li-yang Zhao.

Ethics declarations

Conflict of interest

No conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, Ly., Chang, Tq., Zhang, L. et al. Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks. Int. J. Mach. Learn. & Cyber. 15, 1431–1452 (2024). https://doi.org/10.1007/s13042-023-01976-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-023-01976-6

Keywords

Navigation