Skip to main content
Log in

Learning structured communication for multi-agent reinforcement learning

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

This work explores the large-scale multi-agent communication mechanism for multi-agent reinforcement learning (MARL). We summarize the general topology categories for communication structures, which are often manually specified in MARL literature. A novel framework termed Learning Structured Communication (LSC) is proposed by learning a flexible and efficient communication topology (hierarchical structure). It contains two modules: structured communication module and communication-based policy module. The structured communication module learns to form a hierarchical structure by maximizing the cumulative reward of the agents under the current communication-based policy. The communication-based policy module adopts hierarchical graph neural networks to generate messages, propagate information based on the learned communication structure, and select actions. In contrast to existing communication mechanisms, our method has a learnable and hierarchical communication structure. Experiments on large-scale battle scenarios show that the proposed LSC has high communication efficiency and global cooperation capability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. We interchangeably abuse the term topology and structure.

  2. In this paper, structured communication refers to communicating in a structured (hierarchical) topology. module and a communication-based policy module.

  3. The ‘dynamic’ property means the structure can change rather than remain unchanged.

  4. The field that an agent can precept from observation.

  5. Communication efficiency varies by different communication mechanisms. Here our analysis is under the peer-to-peer mode.

  6. To clarify, we denote the areas an agent can communicate and observe are communication fields and perception fields.

  7. https://github.com/geek-ai/MAgent.

  8. https://github.com/Jarvis-K/LSC.

  9. For practical scenarios, a device can only connect to a finite number of other devices.

References

  1. Adler, J. L., & Blue, V. J. (2002). A cooperative multi-agent transportation management and route guidance system. Transportation Research Part C: Emerging Technologies, 10(5–6), 433–454.

    Article  Google Scholar 

  2. Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261

  3. Bellemare, M.G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. In International Conference on Machine Learning (pp. 449–458). JMLR. org.

  4. Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., & Pineau, J. (2019). TarMAC: Targeted multi-agent communication. In International Conference on Machine Learning (pp. 1538–1546).

  5. Foerster, J., Assael, I.A., De Freitas, N., & Whiteson, S. (2016). Learning to communicate with deep multi-agent reinforcement learning. In Neural Information Processing Systems (pp. 2137–2145).

  6. Foerster, J.N., Farquhar, G., Afouras, T., Nardelli, N., & Whiteson, S. (2018). Counterfactual multi-agent policy gradients. In Association for the Advance of Artificial Intelligence (pp. 2974–2982).

  7. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., & Dahl, G.E. (2017). Neural message passing for quantum chemistry. In International Conference on Machine Learning (pp. 1263–1272).

  8. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., & Dahl, G.E. (2017). Neural message passing for quantum chemistry. In International Conference on Machine Learning (pp. 1263–1272).

  9. Iqbal, S., & Sha, F. (2019). Actor-attention-critic for multi-agent reinforcement learning. In International Conference on Machine Learning (pp. 2961–2970).

  10. Jiang, J., Dun, C., Huang, T., & Lu, Z. (2020) Graph convolutional reinforcement learning. In International Conference on Learning Representations.

  11. Jiang, J., & Lu, Z. (2018) Learning attentional communication for multi-agent cooperation. In Neural Information Processing Systems (pp. 7254–7264).

  12. Kim, D., Moon, S., Hostallero, D., Kang, W.J., Lee, T., Son, K., & Yi, Y. (2019) Learning to schedule communication in multi-agent reinforcement learning. In International Conference on Learning Representations.

  13. Lazaridou, A., Peysakhovich, A., & Baroni, M. (2016) Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182

  14. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

    Article  Google Scholar 

  15. Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. S. (2016). Gated graph sequence neural networks. International Conference on Learning Representations 1511, 05493.

    Google Scholar 

  16. Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. In International Conference on Learning Representations.

  17. Liu, I.J., Yeh, R.A., & Schwing, A.G. (2019). Pic: Permutation invariant critic for multi-agent deep reinforcement learning. (pp. 590–602). PMLR.

  18. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O.P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In Neural Information Processing Systems (pp. 6379–6390).

  19. Malysheva, A., Sung, T.T., Sohn, C.B., Kudenko, D., & Shpilman, A. (2018) Deep multi-agent reinforcement learning with relevance graphs. arXiv preprint arXiv:1811.12557

  20. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.

    Article  Google Scholar 

  21. Mordatch, Igor, & Abbeel, Pieter. (2018). Emergence of Grounded Compositional Language in Multi-Agent Populations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.11492

  22. Pianini, Danilo, Casadei, Roberto, Viroli, Mirko, & Natali, Antonio. (2021). Partitioned integration and coordination via the self-organising coordination regions pattern. Future Generation Computer Systems, 114, 44–68. https://doi.org/10.1016/j.future.2020.07.032

    Article  Google Scholar 

  23. Raposo, D., Santoro, A., Barrett, D., Pascanu, R., Lillicrap, T., & Battaglia, P. (2017) Discovering objects and their relations from entangled scene representations. arXiv preprint arXiv:1702.05068 1702.05068.

  24. Rezaee, M., & Yaghmaee, M. (2009). Cluster based routing protocol for mobile ad hoc networks. INFOCOM, 8(1), 30–36.

  25. Ryu, Heechang, Shin, Hayong, & Park, Jinkyoo. (2020). Multi-Agent Actor-Critic with Hierarchical Graph Attention Network. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 7236–7243. https://doi.org/10.1609/aaai.v34i05.6214

    Article  Google Scholar 

  26. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2008). Computational capabilities of graph neural networks. IEEE Transactions on Neural Networks, 20(1), 81–102.

  27. Semsar-Kazerooni, E. ., & Khorasani, K. . (2009). Multi-agent team cooperation: A game theory approach. Automatica, 45(10), 2205–2213. https://doi.org/10.1016/j.automatica.2009.06.006

    Article  MathSciNet  MATH  Google Scholar 

  28. Silver, David, Huang, Aja, Maddison, Chris J.., Guez, Arthur, Sifre, Laurent, van den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, Dieleman, Sander, Grewe, Dominik, Nham, John, Kalchbrenner, Nal, Sutskever, Ilya, Lillicrap, Timothy, Leach, Madeleine, Kavukcuoglu, Koray, Graepel, Thore, & Hassabis, Demis. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961

    Article  Google Scholar 

  29. Singh, A., Jain, T., & Sukhbaatar, S. (2019). Learning when to communicate at scale in multiagent cooperative and competitive tasks. In International Conference on Learning Representations. 1812.09755.

  30. Sukhbaatar, S., Fergus, R., et al. (2016) Learning multiagent communication with backpropagation. In Neural Information Processing Systems, pp. 2244–2252.

  31. Sutton, R. .S. ., & Barto, A. .G. . (1998). Reinforcement Learning: An Introduction. IEEE Transactions on Neural Networks, 9(5), 1054–1054. https://doi.org/10.1109/TNN.1998.712192

    Article  Google Scholar 

  32. Tacchetti, A., Song, H.F., Mediano, P.A.M., Zambaldi, V., Kramár, J., Rabinowitz, N.C., Graepel, T., Botvinick, M., & Battaglia, P.W. (2019). Relational forward models for multi-agent learning. In International Conference on Learning Representations. 1809.11044.

  33. Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J., & Vicente, R. (2017). Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE, 12(4), 1–15.

  34. Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In International Conference on Machine Learning (pp. 330–337).

  35. Van Hasselt, Hado, Guez, Arthur, & Silver, David. (2016). Deep Reinforcement Learning with Double Q-Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). https://doi.org/10.1609/aaai.v30i1.10295

  36. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998–6008).

  37. Viroli, Mirko, Beal, Jacob, Damiani, Ferruccio, Audrito, Giorgio, Casadei, Roberto, & Pianini, Danilo. (2019). From distributed coordination to field calculus and aggregate computing. Journal of Logical and Algebraic Methods in Programming, 109, 100486. https://doi.org/10.1016/j.jlamp.2019.100486

    Article  MathSciNet  MATH  Google Scholar 

  38. Wang, Weixun, Yang, Tianpei, Liu, Yong, Hao, Jianye, Hao, Xiaotian, Hu, Yujing, Chen, Yingfeng, Fan, Changjie, & Gao, Yang. (2020). From Few to More: Large-Scale Dynamic Multiagent Curriculum Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 7293–7300. https://doi.org/10.1609/aaai.v34i05.6221

    Article  Google Scholar 

  39. Wang, Xingwei, Cheng, Hui, & Huang, Min. (2013). Multi-robot navigation based QoS routing in self-organizing networks. Engineering Applications of Artificial Intelligence, 26(1), 262–272. https://doi.org/10.1016/j.engappai.2012.01.008

    Article  Google Scholar 

  40. Wang, X., Girshick, R., Gupta, A., & He, K. (2018) Non-local neural networks. In IEEE conference on Computer Vision and Pattern Recognition (pp. 7794–7803).

  41. Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Freitas, N. (2016). Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning (pp. 1995–2003).

  42. Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.

  43. Weyns, Danny, & Holvoet, Tom. (2003). Regional synchronization for simultaneous actions in situated multi-agent systems. In Vladimír Mařík, Michal Pěchouček, & Jörg. Müller (Eds.), Multi-agent systems and applications III, (pp. 497–510). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-45023-8_48

    Chapter  MATH  Google Scholar 

  44. Wu, Feng, Zilberstein, Shlomo, & Chen, Xiaoping. (2011). Online planning for multi-agent systems with bounded communication. Artificial Intelligence, 175(2), 487–511. https://doi.org/10.1016/j.artint.2010.09.008

    Article  MathSciNet  MATH  Google Scholar 

  45. Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., & Wang, J. (2018). Mean field multi-agent reinforcement learning. In International Conference on Machine Learning (pp. 5571–5580).

  46. Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., & Leskovec, J. (2018). Hierarchical graph representation learning with differentiable pooling. In Neural Information Processing Systems (pp. 4800–4810).

  47. Zhang, C., & Lesser, V. (2013). Coordinating multi-agent reinforcement learning with limited communication. In International conference on Autonomous Agents and Multi-Agent Systems (pp. 1101–1108).

  48. Zhang, C., & Lesser, V. (2013). Coordinating multi-agent reinforcement learning with limited communication. In International conference on Autonomous Agents and Multi-Agent Systems. 1902.01554.

Download references

Funding

This work was supported in part by National Key Research and Development Program of China (No. 2020AAA0107400), NSFC (No. 12071145), the Open Research Projects of Zhejiang Lab (NO. 2021KE0AB03) and a grant from Shenzhen Institute of Artificial Intelligence and Robotics for Society.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiangfeng Wang.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 CBRP Function and HCOMM function

We provide the CBRP and HCOMM function used in the LSC in Algorithm 2 and 3.

figure b
figure c

1.2 Communication structure effects On Q-learning

The performance of a multi-agent learning algorithm is often measured by the sum of utilities obtained by all the agents: \(Q({\mathbf {o}}, a)=\sum _i^N Q^i({\mathbf {o}},{\mathbf {a}})\) where the \(Q^i\) is the local utility function of agent i. The \({\mathbf {o}}, {\mathbf {a}}\) are the joint observation and joint action, respectively. It is hard to analyze the performance of multi-agent learning in partial observable settings. Thus we consider a simple case: all the agents can observe the state information, which means joint observation can be obtained. In contrast, joint action can only be accessed through communication. Communication-based MARL uses local and communicated actions to approximate the joint action:

$$\begin{aligned} Q({\mathbf {o}}, {\mathbf {a}}) \approx \sum _i^N Q^i({\mathbf {o}}, a_i, a_{C_i}) \le \sum _i^N \max _{{\hat{a}}_{C_i}} Q^i({\mathbf {o}}, a_i, {\hat{a}}_{C_i}), \end{aligned}$$
(5)

where the \(C_i\) is the communication agents set of agent i. Denote \(-i\) as \(\{1, \dots , i-1\} \bigcup \{i+1, \dots , N\}\). The FC and STAR communication methods take \(C_i=N(i)=-i\) while NBOR and TREE have \(C_i = N(i)\subset -i\) where the N(i) is the neighbour agents set of agent i. For Hierarchical topology, the \(C_i=N(i)\bigcup NH(i) \subset -i\) where the NH(i) is the communication reachable agents set of agent i through intra-group communication. For example, in Fig. 4 of our manuscript, the agent E can communicate to A through intra-group communication, then \(\{A\} \subset NH(E)\). On the one hand, taking \(C_i=-i\) often faces the dimensionality issue (i.e., as the number of agents increases, the complexity of learning the utility function exponentially increases [48]). On the other hand, inappropriate design of \(C_i\) leads to a loss of cooperation. To quantify how much utility an agent will potentially lose, we define the potential loss in lacking communication. Before that, we first define the potential expected utility as follows:

Definition 1

The potential expected utility of agent i is the maximum expected utility of agent i when perfect coordinating with its neighbors by communication:

$$\begin{aligned} PV_i({{\textbf {o}}}, a_i, C_i)= \max _{a_{C_i}} Q^i({\mathbf {o}}, a_i, a_{C_i}), \end{aligned}$$
(6)

where \(Q^i({\mathbf {o}}, a_i, a_{C_i}) =\sum _{{\mathbf {o}}} \sum _{a_{-i\backslash C_i}} P({\mathbf {o}})P(a_{-i\backslash C_i} | {\mathbf {o}}, a_i, a_{C_i}) Q({\mathbf {o}}, {\mathbf {a}})\).

The \(P(a_{-i\backslash C_i} | {\mathbf {o}}, a_i, a_{C_i})\) and \(P({\mathbf {o}})\) can be estimated based on the experienced data. The \(\max\) operation is taken on the \(a_{C_i}\). Therefore, this measure generally overestimates the expected utility that an agent can get if it communicates and coordinates with \(C_i\).

Proposition 1

If \(D\subset C \subset -i\), then \(PV_i({\mathbf {o}}, a_i, D_i) \le PV_i({\mathbf {o}}, a_i, C_i)\)

Thus the \(C_i=-i\) gains the maximum potential expected utility. Then the potential utility loss if an agent only communicates with some agents when selecting its action.

Definition 2

The potential loss in lacking communication of agent i is the difference of the potential the expected utility of agent i when it coordinates with all agents and with neighbors \(C_i\):

$$\begin{aligned} PL_i({\mathbf {o}}, C_i)= \max _{a_i} PV_i({\mathbf {o}}, a_i, -i) - \max _{a_i} PV_i({\mathbf {o}}, a_i, C_i) \end{aligned}$$
(7)

Similar definitions also appear in [48]. With Proposition 1, it can be inferred that for \(D_i\subset C_i\), \(PL_i({\mathbf {o}}, D_i) \le PL_i({\mathbf {o}}, C_i) \le PL_i({\mathbf {o}}, -i)\). Then if NBOR (or TREE) and the hierarchy structure have the same set of neighbor agents N(i), \(N(i)\subset N(i)\bigcup NH(i)\). The hierarchical structure has a potential loss less than or equal to NBOR (or TREE). Thus, the hierarchical structure is a trade-off between cooperation loss (NBOR and TREE) and curse of dimension (FC and STAR).

Furthermore, PL also suggests the importance of choosing a suitable \(C_i\). If \(PL_i({\mathbf {o}}, C_i)\) is large, relaxing the communication set from \(-i\) to \(C_i\) has a large potential loss. This lowers the upper bound of Eq. 5 and makes the algorithm obtain a lower expected global utility. If \(PL_i({\mathbf {o}}, a) = 0\), we can replace \(Q^i({\mathbf {o}}, a_i, a_{-i})\) with \(Q^i({\mathbf {o}}, a_i, a_{C_i})\) without loss of potential expected utility. The complexity of learning can be reduced with a smaller size of \(C_i\). Thus, learning \(C_i\) (communication structure learning) is critical for Q-Learning. It learns the communication structure to obtain a higher global utility under current policy. This helps both the communication structure and policy learning.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sheng, J., Wang, X., Jin, B. et al. Learning structured communication for multi-agent reinforcement learning. Auton Agent Multi-Agent Syst 36, 50 (2022). https://doi.org/10.1007/s10458-022-09580-8

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10458-022-09580-8

Keywords

Navigation