Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks

Zhao, Li-yang; Chang, Tian-qing; Zhang, Lei; Zhang, Xin-lu; Wang, Jiang-feng

doi:10.1007/s13042-023-01976-6

Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks

Original Article
Published: 30 September 2023

Volume 15, pages 1431–1452, (2024)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Li-yang Zhao ORCID: orcid.org/0000-0002-8483-5050¹,
Tian-qing Chang¹,
Lei Zhang¹,
Xin-lu Zhang² &
…
Jiang-feng Wang¹

316 Accesses
1 Citation
Explore all metrics

Abstract

Multi-agent cooperation and coordination are often essential for task fulfillment. Multi-agent deep reinforcement learning (MADRL) can effectively learn the solutions to problems, but its application is still primarily restricted by the exploration–exploitation trade-off. Therefore, the focus of MADRL research is placed on how to effectively explore the environment and collect good experience with rich information to strengthen cooperative behaviors and optimize policy learning. To address this problem, we propose a novel multi-agent cooperation policy gradient method called multi-agent proximal policy optimization based on self-imitation learning and random network distillation (MAPPOSR). MAPPOSR consists of two policy gradient-based additional components, namely (1) random network distillation (RND) exploration bonus component that produces intrinsic rewards and encourages agents to access new states and actions, thereby helping them explore better trajectories and avoiding the algorithm prematurely converging or getting stuck in local optima; and (2) self-imitation learning (SIL) policy update component that stores and reuses high-return trajectory samples generated by agents themselves, thereby strengthening their cooperation and boosting learning efficiency. The experimental results show that in addition to effectively solving the hard-exploration problem, the proposed method significantly outperforms other SOTA MADRL algorithms in learning efficiency as well as in escaping local optima. Moreover, the effect of different function inputs on algorithm performance is investigated in the centralized training and decentralized execution (CTDE) framework, based on which a joint-observation coding method based on individual is developed. By encouraging the agent to focus more on the local observation information of other agents related to it and abandon global state information provided by the environment, the developed coding method can remove the effects of excessive value function input dimensions and redundant feature information on algorithm performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Emerging trends in federated learning: from model fusion to federated X learning

Article Open access 02 April 2024

Multi-agent deep reinforcement learning: a survey

Article Open access 15 April 2021

A review of motion planning algorithms for intelligent robots

Article Open access 25 November 2021

Code or data availability

Data sets generated during the current study are available from the corresponding author on reasonable request.

References

Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge
Google Scholar
Li Q, Peng H, Li J, Wu J, Ning Y, Wang L, Yu PS, Wang Z (2021) Reinforcement learning-based dialogue guided event extraction to exploit argument relations. IEEE/ACM Trans Audio Speech Lang Process 30:520–533
Article ADS Google Scholar
Peng B, Rashid T, Schroeder de Witt C, Kamienny P-A, Torr P, Böhmer W, Whiteson S (2021) Facmac: factored multi-agent centralised policy gradients. Adv Neural Inf Process Syst 34:12208–12221
Google Scholar
Gupta JK, Egorov M, Kochenderfer MJ (2017) Cooperative multi-agent control using deep reinforcement learning. AAMAS Workshops 30:66–83
Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Article ADS CAS PubMed Google Scholar
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489
Article ADS CAS PubMed Google Scholar
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359
Article ADS CAS PubMed Google Scholar
Bolander T, Andersen MB (2011) Epistemic planning for single-and multi-agent systems. J Appl Non Class Logics 21(1):9–34
Article MathSciNet Google Scholar
Du W, Ding S (2021) A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications. Artif Intell Rev 54(5):3215–3238
Article Google Scholar
Hernandez-Leal P, Kartal B, Taylor ME (2019) A survey and critique of multiagent deep reinforcement learning. Auton Agent Multi Agent Syst 33(6):750–797
Article Google Scholar
Wang J, Hong Y, Wang J, Xu J, Tang Y, Han Q-L, Kurths J (2022) Cooperative and competitive multi-agent systems: from optimization to games. IEEE/CAA J Autom Sin 9(5):763–783
Article Google Scholar
Du Y, Han L, Fang M, Liu J, Dai T, Tao D (2019) Liir: learning individual intrinsic reward in multi-agent reinforcement learning. Adv Neural Inf Process Syst 32
Wang T, Wang J, Wu Y, Zhang C (2020) Influence-based multi-agent exploration. International conference on learning representations
Mahajan A, Rashid T, Samvelyan M, Whiteson S (2019) MAVEN: multi-agent variational exploration. Neural Inf Process Syst 32:7611–7622
Google Scholar
Yang T, Tang H, Bai C, Liu J, Hao J, Meng Z et al (2023) Exploration in deep reinforcement learning: from single-agent to multiagent domain. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2023.3236361
Liu I-J, Jain U, Yeh RA, Schwing A (2021) Cooperative exploration for multi-agent deep reinforcement learning. In: International conference on machine learning. PMLR, pp 6826–6836
Hussein A, Gaber MM, Elyan E, Jayne C (2017) Imitation learning: a survey of learning methods. ACM Comput Surv (CSUR) 50(2):1–35
Article Google Scholar
Ambhore S (2020) A comprehensive study on robot learning from demonstration. In: 2020 2nd international conference on innovative mechanisms for industry applications (ICIMIA). IEEE, pp 291–299
Ravichandar H, Polydoros AS, Chernova S, Billard A (2020) Recent advances in robot learning from demonstration. Ann Rev Control Robot Auton Syst 3:297–330
Article Google Scholar
Oh J, Guo Y, Singh S, Lee H (2018) Self-imitation learning. In: International conference on machine learning. PMLR, pp 3878–3887
Guo Y, Oh J, Singh S, Lee H (2018) Generative adversarial self-imitation learning. arXiv preprint arXiv:1812.00950
Osa T, Pajarinen J, Neumann G, Bagnell JA, Abbeel P, Peters J (2018) An algorithmic perspective on imitation learning. Found Trends Robot 7(1–2):1–179
Google Scholar
Pomerleau DA (1991) Efficient training of artificial neural networks for autonomous navigation. Neural Comput 3(1):88–97
Article PubMed Google Scholar
Bain M, Sammut C (1995) A framework for behavioural cloning. Mach Intell 15:103–129
Google Scholar
Ross S, Gordon G, Bagnell D (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR, pp 627–635
Sun W, Venkatraman A, Gordon GJ, Boots B, Bagnell JA (2017) Deeply aggrevated: differentiable imitation learning for sequential prediction. In: International conference on machine learning. PMLR, pp 3309–3318
Russell S (1998) Learning agents for uncertain environments. In: Proceedings of the eleventh annual conference on computational learning theory. ACM, pp 101–103
Abbeel P, Ng AY (2004) Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the twenty-first international conference on Machine learning. PMLR, pp 1–8
Syed U, Schapire RE (2007) A game-theoretic approach to apprenticeship learning. Adv Neural Inf Process Syst 20. pp 1–8
Ho J, Ermon S (2016) Generative adversarial imitation learning. Adv Neural Inf Process Syst 29. pp 1–9
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
Article MathSciNet Google Scholar
Ziebart BD, Maas AL, Bagnell JA, Dey AK (2008) Maximum entropy inverse reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence. AAAI, pp 1433–1438
Zhang Y, Cai Q, Yang Z, Wang Z (2020) Generative adversarial imitation learning with neural network parameterization: global optimality and convergence rate. In: International conference on machine learning. PMLR, pp 11044–11054
Argall BD, Chernova S, Veloso M, Browning B (2009) A survey of robot learning from demonstration. Robot Auton Syst 57(5):469–483
Article Google Scholar
Gao Y, Xu H, Lin J, Yu F, Levine S, Darrell T (2018) Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313
Jing M, Ma X, Huang W, Sun F, Yang C, Fang B et al (2020) Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI conference on artificial intelligence. AAAI, pp 5109–5116
Kang B, Jie Z, Feng J (2018) Policy optimization with demonstrations. In: International conference on machine learning. PMLR, pp 2469–2478
Vecerik M, Hester T, Scholz J, Wang F, Pietquin O, Piot B, Heess N, Rothörl T, Lampe T, Riedmiller M (2017) Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817
Pshikhachev G, Ivanov D, Egorov V, Shpilman A (2022) Self-imitation learning from demonstrations. arXiv preprint arXiv:2203.10905
Guo Y, Choi J, Moczulski M, Feng S, Bengio S, Norouzi M, Lee H (2020) Memory based trajectory-conditioned policies for learning from sparse rewards. Adv Neural Inf Process Syst 33:4333–4345
Google Scholar
Gangwani T, Liu Q, Peng J (2018) Learning self-imitating diverse policies. arXiv preprint arXiv:1805.10309
Tang Y (2020) Self-imitation learning via generalized lower bound q-learning. Adv Neural Inf Process Syst 33:13964–13975
Google Scholar
Badia AP, Sprechmann P, Vitvitskyi A, Guo D, Piot B, Kapturowski S, Tieleman O, Arjovsky M, Pritzel A, Bolt A (2020) Never give up: learning directed exploration strategies. arXiv preprint arXiv:2002.06038
Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2019) Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995
Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2021) First return, then explore. Nature 590(7847):580–586
Article ADS CAS PubMed Google Scholar
Guo ZD, Brunskill E (2019) Directed exploration for reinforcement learning. arXiv preprint arXiv:1906.07805
Savinov N, Raichuk A, Marinier R, Vincent D, Pollefeys M, Lillicrap T, Gelly S (2018) Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274
Oudeyer P-Y, Kaplan F (2008) How can we define intrinsic motivation? In: The 8th international conference on epigenetic robotics: modeling cognitive development in robotic systems. Lund:LUCS, pp 1–10
Tang H, Houthooft R, Foote D, Stooke A, Xi Chen O, Duan Y, Schulman J, DeTurck F, Abbeel P (2017) # exploration: a study of count-based exploration for deep reinforcement learning. Adv Neural Inf Process Syst (30):1–10
Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying count-based exploration and intrinsic motivation. Adv Neural Inf Process Syst (29):1–9
Ostrovski G, Bellemare MG, Oord A, Munos R (2017) Count-based exploration with neural density models. In: International conference on machine learning. PMLR, pp 2721–2730
Pathak D, Agrawal P, Efros AA, Darrell T (2017) Curiosity-driven exploration by self-supervised prediction. In: International conference on machine learning. PMLR, pp 2778–2787
Oudeyer P-Y, Kaplan F, Hafner VV (2007) Intrinsic motivation systems for autonomous mental development. IEEE Trans Evol Comput 11(2):265–286
Article Google Scholar
Zhao R, Tresp V (2019) Curiosity-driven experience prioritization via density estimation. arXiv preprint arXiv:1902.08039
Stadie BC, Levine S, Abbeel P (2015) Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814
Burda Y, Edwards H, Pathak D, Storkey A, Darrell T, Efros AA (2018) Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355
Choshen L, Fox L, Loewenstein Y (2018) Dora the explorer: directed outreaching reinforcement action-selection. arXiv preprint arXiv:1804.04012
Pathak D, Gandhi D, Gupta A (2019) Self-supervised exploration via disagreement. In: International conference on machine learning. PMLR, pp 5062–5071
Lee GT, Kim CO (2019) Amplifying the imitation effect for reinforcement learning of UCAV’s mission execution. arXiv preprint arXiv:1901.05856
Burda Y, Edwards H, Storkey A, Klimov O (2018) Exploration by random network distillation. arXiv preprint arXiv:1810.12894
Kang C-Y, Chen M-S (2020) Balancing exploration and exploitation in self-imitation learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 274–285
Hao X, Wang W, Hao J, Yang Y (2019) Independent generative adversarial self-imitation learning in cooperative multiagent systems. arXiv preprint arXiv:1909.11468
Jiang S, Amato C (2021) Multi-agent reinforcement learning with directed exploration and selective memory reuse. In: Proceedings of the 36th annual ACM symposium on applied computing. ACM, pp 777–784
Oliehoek FA, Amato C (2016) A concise introduction to decentralized POMDPs. Springer, Berlin
Book Google Scholar
Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The complexity of decentralized control of Markov decision processes. Math Oper Res 27:819–840
Article MathSciNet Google Scholar
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T et al. (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning. PMLR, pp 1928–1937
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
Yu C, Velu A, Vinitsky E, Wang Y, Bayen A, Wu Y (2021) The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955
Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952
Lowe R, Wu YI, Tamar A, Harb J, Pieter Abbeel O, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Adv Neural Inf Process Syst (30):1–12
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst (30):1–11
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
Samvelyan M, Rashid T, De Witt CS, Farquhar G, Nardelli N, Rudner TG, Hung C-M, Torr PH, Foerster J, Whiteson S (2019) The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043
Hu S, Hu J (2021) Noisy-MAPPO: noisy advantage values for cooperative multi-agent actor-critic methods. arXiv e-prints, arXiv:2106.14334
de Witt CS, Gupta T, Makoviichuk D, Makoviychuk V, Torr PH, Sun M, Whiteson S (2020) Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533

Download references

Author information

Authors and Affiliations

Department of Weaponry and Control, Army Academy of Armored Forces, Beijing, 100072, China
Li-yang Zhao, Tian-qing Chang, Lei Zhang & Jiang-feng Wang
China Huayin Ordnance Test Center, Huayin, 714200, China
Xin-lu Zhang

Authors

Li-yang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Tian-qing Chang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xin-lu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiang-feng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li-yang Zhao.

Ethics declarations

Conflict of interest

No conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhao, Ly., Chang, Tq., Zhang, L. et al. Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks. Int. J. Mach. Learn. & Cyber. 15, 1431–1452 (2024). https://doi.org/10.1007/s13042-023-01976-6

Download citation

Received: 16 February 2023
Accepted: 11 September 2023
Published: 30 September 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s13042-023-01976-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks

Abstract

Access this article

Similar content being viewed by others

Emerging trends in federated learning: from model fusion to federated X learning

Multi-agent deep reinforcement learning: a survey

A review of motion planning algorithms for intelligent robots

Code or data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-agent cooperation policy gradient method based on enhanced exploration for cooperative tasks

Abstract

Access this article

Similar content being viewed by others

Emerging trends in federated learning: from model fusion to federated X learning

Multi-agent deep reinforcement learning: a survey

A review of motion planning algorithms for intelligent robots

Code or data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation