Learning structured communication for multi-agent reinforcement learning

Sheng, Junjie; Wang, Xiangfeng; Jin, Bo; Yan, Junchi; Li, Wenhao; Chang, Tsung-Hui; Wang, Jun; Zha, Hongyuan

doi:10.1007/s10458-022-09580-8

Learning structured communication for multi-agent reinforcement learning

Published: 26 August 2022

Volume 36, article number 50, (2022)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Junjie Sheng¹,
Xiangfeng Wang ORCID: orcid.org/0000-0003-3064-5128¹,
Bo Jin¹,
Junchi Yan²,
Wenhao Li¹,
Tsung-Hui Chang³,
Jun Wang¹ &
…
Hongyuan Zha⁴

2012 Accesses
7 Citations
Explore all metrics

Abstract

This work explores the large-scale multi-agent communication mechanism for multi-agent reinforcement learning (MARL). We summarize the general topology categories for communication structures, which are often manually specified in MARL literature. A novel framework termed Learning Structured Communication (LSC) is proposed by learning a flexible and efficient communication topology (hierarchical structure). It contains two modules: structured communication module and communication-based policy module. The structured communication module learns to form a hierarchical structure by maximizing the cumulative reward of the agents under the current communication-based policy. The communication-based policy module adopts hierarchical graph neural networks to generate messages, propagate information based on the learned communication structure, and select actions. In contrast to existing communication mechanisms, our method has a learnable and hierarchical communication structure. Experiments on large-scale battle scenarios show that the proposed LSC has high communication efficiency and global cooperation capability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LFMAC: Low-Frequency Multi-Agent Communication

Cooperative Multi-agent Reinforcement Learning with Hierachical Communication Architecture

A survey of multi-agent deep reinforcement learning with communication

Article Open access 06 January 2024

Notes

We interchangeably abuse the term topology and structure.
In this paper, structured communication refers to communicating in a structured (hierarchical) topology. module and a communication-based policy module.
The ‘dynamic’ property means the structure can change rather than remain unchanged.
The field that an agent can precept from observation.
Communication efficiency varies by different communication mechanisms. Here our analysis is under the peer-to-peer mode.
To clarify, we denote the areas an agent can communicate and observe are communication fields and perception fields.
https://github.com/geek-ai/MAgent.
https://github.com/Jarvis-K/LSC.
For practical scenarios, a device can only connect to a finite number of other devices.

References

Adler, J. L., & Blue, V. J. (2002). A cooperative multi-agent transportation management and route guidance system. Transportation Research Part C: Emerging Technologies, 10(5–6), 433–454.
Article Google Scholar
Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261
Bellemare, M.G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. In International Conference on Machine Learning (pp. 449–458). JMLR. org.
Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., & Pineau, J. (2019). TarMAC: Targeted multi-agent communication. In International Conference on Machine Learning (pp. 1538–1546).
Foerster, J., Assael, I.A., De Freitas, N., & Whiteson, S. (2016). Learning to communicate with deep multi-agent reinforcement learning. In Neural Information Processing Systems (pp. 2137–2145).
Foerster, J.N., Farquhar, G., Afouras, T., Nardelli, N., & Whiteson, S. (2018). Counterfactual multi-agent policy gradients. In Association for the Advance of Artificial Intelligence (pp. 2974–2982).
Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., & Dahl, G.E. (2017). Neural message passing for quantum chemistry. In International Conference on Machine Learning (pp. 1263–1272).
Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., & Dahl, G.E. (2017). Neural message passing for quantum chemistry. In International Conference on Machine Learning (pp. 1263–1272).
Iqbal, S., & Sha, F. (2019). Actor-attention-critic for multi-agent reinforcement learning. In International Conference on Machine Learning (pp. 2961–2970).
Jiang, J., Dun, C., Huang, T., & Lu, Z. (2020) Graph convolutional reinforcement learning. In International Conference on Learning Representations.
Jiang, J., & Lu, Z. (2018) Learning attentional communication for multi-agent cooperation. In Neural Information Processing Systems (pp. 7254–7264).
Kim, D., Moon, S., Hostallero, D., Kang, W.J., Lee, T., Son, K., & Yi, Y. (2019) Learning to schedule communication in multi-agent reinforcement learning. In International Conference on Learning Representations.
Lazaridou, A., Peysakhovich, A., & Baroni, M. (2016) Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
Article Google Scholar
Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. S. (2016). Gated graph sequence neural networks. International Conference on Learning Representations 1511, 05493.
Google Scholar
Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. In International Conference on Learning Representations.
Liu, I.J., Yeh, R.A., & Schwing, A.G. (2019). Pic: Permutation invariant critic for multi-agent deep reinforcement learning. (pp. 590–602). PMLR.
Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O.P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In Neural Information Processing Systems (pp. 6379–6390).
Malysheva, A., Sung, T.T., Sohn, C.B., Kudenko, D., & Shpilman, A. (2018) Deep multi-agent reinforcement learning with relevance graphs. arXiv preprint arXiv:1811.12557
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
Article Google Scholar
Mordatch, Igor, & Abbeel, Pieter. (2018). Emergence of Grounded Compositional Language in Multi-Agent Populations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.11492
Pianini, Danilo, Casadei, Roberto, Viroli, Mirko, & Natali, Antonio. (2021). Partitioned integration and coordination via the self-organising coordination regions pattern. Future Generation Computer Systems, 114, 44–68. https://doi.org/10.1016/j.future.2020.07.032
Article Google Scholar
Raposo, D., Santoro, A., Barrett, D., Pascanu, R., Lillicrap, T., & Battaglia, P. (2017) Discovering objects and their relations from entangled scene representations. arXiv preprint arXiv:1702.05068 1702.05068.
Rezaee, M., & Yaghmaee, M. (2009). Cluster based routing protocol for mobile ad hoc networks. INFOCOM, 8(1), 30–36.
Ryu, Heechang, Shin, Hayong, & Park, Jinkyoo. (2020). Multi-Agent Actor-Critic with Hierarchical Graph Attention Network. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 7236–7243. https://doi.org/10.1609/aaai.v34i05.6214
Article Google Scholar
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2008). Computational capabilities of graph neural networks. IEEE Transactions on Neural Networks, 20(1), 81–102.
Semsar-Kazerooni, E. ., & Khorasani, K. . (2009). Multi-agent team cooperation: A game theory approach. Automatica, 45(10), 2205–2213. https://doi.org/10.1016/j.automatica.2009.06.006
Article MathSciNet MATH Google Scholar
Silver, David, Huang, Aja, Maddison, Chris J.., Guez, Arthur, Sifre, Laurent, van den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, Dieleman, Sander, Grewe, Dominik, Nham, John, Kalchbrenner, Nal, Sutskever, Ilya, Lillicrap, Timothy, Leach, Madeleine, Kavukcuoglu, Koray, Graepel, Thore, & Hassabis, Demis. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961
Article Google Scholar
Singh, A., Jain, T., & Sukhbaatar, S. (2019). Learning when to communicate at scale in multiagent cooperative and competitive tasks. In International Conference on Learning Representations. 1812.09755.
Sukhbaatar, S., Fergus, R., et al. (2016) Learning multiagent communication with backpropagation. In Neural Information Processing Systems, pp. 2244–2252.
Sutton, R. .S. ., & Barto, A. .G. . (1998). Reinforcement Learning: An Introduction. IEEE Transactions on Neural Networks, 9(5), 1054–1054. https://doi.org/10.1109/TNN.1998.712192
Article Google Scholar
Tacchetti, A., Song, H.F., Mediano, P.A.M., Zambaldi, V., Kramár, J., Rabinowitz, N.C., Graepel, T., Botvinick, M., & Battaglia, P.W. (2019). Relational forward models for multi-agent learning. In International Conference on Learning Representations. 1809.11044.
Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J., & Vicente, R. (2017). Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE, 12(4), 1–15.
Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In International Conference on Machine Learning (pp. 330–337).
Van Hasselt, Hado, Guez, Arthur, & Silver, David. (2016). Deep Reinforcement Learning with Double Q-Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). https://doi.org/10.1609/aaai.v30i1.10295
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998–6008).
Viroli, Mirko, Beal, Jacob, Damiani, Ferruccio, Audrito, Giorgio, Casadei, Roberto, & Pianini, Danilo. (2019). From distributed coordination to field calculus and aggregate computing. Journal of Logical and Algebraic Methods in Programming, 109, 100486. https://doi.org/10.1016/j.jlamp.2019.100486
Article MathSciNet MATH Google Scholar
Wang, Weixun, Yang, Tianpei, Liu, Yong, Hao, Jianye, Hao, Xiaotian, Hu, Yujing, Chen, Yingfeng, Fan, Changjie, & Gao, Yang. (2020). From Few to More: Large-Scale Dynamic Multiagent Curriculum Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 7293–7300. https://doi.org/10.1609/aaai.v34i05.6221
Article Google Scholar
Wang, Xingwei, Cheng, Hui, & Huang, Min. (2013). Multi-robot navigation based QoS routing in self-organizing networks. Engineering Applications of Artificial Intelligence, 26(1), 262–272. https://doi.org/10.1016/j.engappai.2012.01.008
Article Google Scholar
Wang, X., Girshick, R., Gupta, A., & He, K. (2018) Non-local neural networks. In IEEE conference on Computer Vision and Pattern Recognition (pp. 7794–7803).
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Freitas, N. (2016). Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning (pp. 1995–2003).
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.
Weyns, Danny, & Holvoet, Tom. (2003). Regional synchronization for simultaneous actions in situated multi-agent systems. In Vladimír Mařík, Michal Pěchouček, & Jörg. Müller (Eds.), Multi-agent systems and applications III, (pp. 497–510). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-45023-8_48
Chapter MATH Google Scholar
Wu, Feng, Zilberstein, Shlomo, & Chen, Xiaoping. (2011). Online planning for multi-agent systems with bounded communication. Artificial Intelligence, 175(2), 487–511. https://doi.org/10.1016/j.artint.2010.09.008
Article MathSciNet MATH Google Scholar
Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., & Wang, J. (2018). Mean field multi-agent reinforcement learning. In International Conference on Machine Learning (pp. 5571–5580).
Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., & Leskovec, J. (2018). Hierarchical graph representation learning with differentiable pooling. In Neural Information Processing Systems (pp. 4800–4810).
Zhang, C., & Lesser, V. (2013). Coordinating multi-agent reinforcement learning with limited communication. In International conference on Autonomous Agents and Multi-Agent Systems (pp. 1101–1108).
Zhang, C., & Lesser, V. (2013). Coordinating multi-agent reinforcement learning with limited communication. In International conference on Autonomous Agents and Multi-Agent Systems. 1902.01554.

Download references

Funding

This work was supported in part by National Key Research and Development Program of China (No. 2020AAA0107400), NSFC (No. 12071145), the Open Research Projects of Zhejiang Lab (NO. 2021KE0AB03) and a grant from Shenzhen Institute of Artificial Intelligence and Robotics for Society.

Author information

Authors and Affiliations

School of Computer Science and Technology, East China Normal University, Shanghai, 200092, China
Junjie Sheng, Xiangfeng Wang, Bo Jin, Wenhao Li & Jun Wang
Department of Computer Science and Engineering, Artificial Intelligence Institute, Shanghai Jiao Tong University, Shanghai, 200240, China
Junchi Yan
School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen), Shenzhen, China
Tsung-Hui Chang
School of Data Science, The Chinese University of Hong Kong (Shenzhen), Shenzhen, China
Hongyuan Zha

Authors

Junjie Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiangfeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Jin
View author publications
You can also search for this author in PubMed Google Scholar
Junchi Yan
View author publications
You can also search for this author in PubMed Google Scholar
Wenhao Li
View author publications
You can also search for this author in PubMed Google Scholar
Tsung-Hui Chang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongyuan Zha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiangfeng Wang.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 CBRP Function and HCOMM function

We provide the CBRP and HCOMM function used in the LSC in Algorithm 2 and 3.

1.2 Communication structure effects On Q-learning

The performance of a multi-agent learning algorithm is often measured by the sum of utilities obtained by all the agents: $Q({\mathbf {o}}, a)=\sum _i^N Q^i({\mathbf {o}},{\mathbf {a}})$ where the $Q^i$ is the local utility function of agent i. The ${\mathbf {o}}, {\mathbf {a}}$ are the joint observation and joint action, respectively. It is hard to analyze the performance of multi-agent learning in partial observable settings. Thus we consider a simple case: all the agents can observe the state information, which means joint observation can be obtained. In contrast, joint action can only be accessed through communication. Communication-based MARL uses local and communicated actions to approximate the joint action:

$$\begin{aligned} Q({\mathbf {o}}, {\mathbf {a}}) \approx \sum _i^N Q^i({\mathbf {o}}, a_i, a_{C_i}) \le \sum _i^N \max _{{\hat{a}}_{C_i}} Q^i({\mathbf {o}}, a_i, {\hat{a}}_{C_i}), \end{aligned}$$

(5)

where the $C_i$ is the communication agents set of agent i. Denote $-i$ as $\{1, \dots , i-1\} \bigcup \{i+1, \dots , N\}$. The FC and STAR communication methods take $C_i=N(i)=-i$ while NBOR and TREE have $C_i = N(i)\subset -i$ where the N(i) is the neighbour agents set of agent i. For Hierarchical topology, the $C_i=N(i)\bigcup NH(i) \subset -i$ where the NH(i) is the communication reachable agents set of agent i through intra-group communication. For example, in Fig. 4 of our manuscript, the agent E can communicate to A through intra-group communication, then $\{A\} \subset NH(E)$. On the one hand, taking $C_i=-i$ often faces the dimensionality issue (i.e., as the number of agents increases, the complexity of learning the utility function exponentially increases [48]). On the other hand, inappropriate design of $C_i$ leads to a loss of cooperation. To quantify how much utility an agent will potentially lose, we define the potential loss in lacking communication. Before that, we first define the potential expected utility as follows:

Definition 1

The potential expected utility of agent i is the maximum expected utility of agent i when perfect coordinating with its neighbors by communication:

$$\begin{aligned} PV_i({{\textbf {o}}}, a_i, C_i)= \max _{a_{C_i}} Q^i({\mathbf {o}}, a_i, a_{C_i}), \end{aligned}$$

(6)

where $Q^i({\mathbf {o}}, a_i, a_{C_i}) =\sum _{{\mathbf {o}}} \sum _{a_{-i\backslash C_i}} P({\mathbf {o}})P(a_{-i\backslash C_i} | {\mathbf {o}}, a_i, a_{C_i}) Q({\mathbf {o}}, {\mathbf {a}})$.

The $P(a_{-i\backslash C_i} | {\mathbf {o}}, a_i, a_{C_i})$ and $P({\mathbf {o}})$ can be estimated based on the experienced data. The $\max$ operation is taken on the $a_{C_i}$. Therefore, this measure generally overestimates the expected utility that an agent can get if it communicates and coordinates with $C_i$.

Proposition 1

If $D\subset C \subset -i$, then $PV_i({\mathbf {o}}, a_i, D_i) \le PV_i({\mathbf {o}}, a_i, C_i)$

Thus the $C_i=-i$ gains the maximum potential expected utility. Then the potential utility loss if an agent only communicates with some agents when selecting its action.

Definition 2

The potential loss in lacking communication of agent i is the difference of the potential the expected utility of agent i when it coordinates with all agents and with neighbors $C_i$:

$$\begin{aligned} PL_i({\mathbf {o}}, C_i)= \max _{a_i} PV_i({\mathbf {o}}, a_i, -i) - \max _{a_i} PV_i({\mathbf {o}}, a_i, C_i) \end{aligned}$$

(7)

Similar definitions also appear in [48]. With Proposition 1, it can be inferred that for $D_i\subset C_i$, $PL_i({\mathbf {o}}, D_i) \le PL_i({\mathbf {o}}, C_i) \le PL_i({\mathbf {o}}, -i)$. Then if NBOR (or TREE) and the hierarchy structure have the same set of neighbor agents N(i), $N(i)\subset N(i)\bigcup NH(i)$. The hierarchical structure has a potential loss less than or equal to NBOR (or TREE). Thus, the hierarchical structure is a trade-off between cooperation loss (NBOR and TREE) and curse of dimension (FC and STAR).

Furthermore, PL also suggests the importance of choosing a suitable $C_i$. If $PL_i({\mathbf {o}}, C_i)$ is large, relaxing the communication set from $-i$ to $C_i$ has a large potential loss. This lowers the upper bound of Eq. 5 and makes the algorithm obtain a lower expected global utility. If $PL_i({\mathbf {o}}, a) = 0$, we can replace $Q^i({\mathbf {o}}, a_i, a_{-i})$ with $Q^i({\mathbf {o}}, a_i, a_{C_i})$ without loss of potential expected utility. The complexity of learning can be reduced with a smaller size of $C_i$. Thus, learning $C_i$ (communication structure learning) is critical for Q-Learning. It learns the communication structure to obtain a higher global utility under current policy. This helps both the communication structure and policy learning.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sheng, J., Wang, X., Jin, B. et al. Learning structured communication for multi-agent reinforcement learning. Auton Agent Multi-Agent Syst 36, 50 (2022). https://doi.org/10.1007/s10458-022-09580-8

Download citation

Accepted: 10 August 2022
Published: 26 August 2022
DOI: https://doi.org/10.1007/s10458-022-09580-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning structured communication for multi-agent reinforcement learning

Abstract

Access this article

Similar content being viewed by others

LFMAC: Low-Frequency Multi-Agent Communication

Cooperative Multi-agent Reinforcement Learning with Hierachical Communication Architecture

A survey of multi-agent deep reinforcement learning with communication

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

1.1 CBRP Function and HCOMM function

1.2 Communication structure effects On Q-learning

Definition 1

Proposition 1

Definition 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning structured communication for multi-agent reinforcement learning

Abstract

Access this article

Similar content being viewed by others

LFMAC: Low-Frequency Multi-Agent Communication

Cooperative Multi-agent Reinforcement Learning with Hierachical Communication Architecture

A survey of multi-agent deep reinforcement learning with communication

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

1.1 CBRP Function and HCOMM function

1.2 Communication structure effects On Q-learning

Definition 1

Proposition 1

Definition 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation