Skip to main content
Log in

Modeling and reinforcement learning in partially observable many-agent systems

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

There is a prevalence of multiagent reinforcement learning (MARL) methods that engage in centralized training. These methods rely on all the agents sharing various types of information, such as their actions or gradients, with a centralized trainer or each other during the learning. Subsequently, the methods produce agent policies whose prescriptions and performance are contingent on other agents engaging in behavior assumed by the centralized training. But, in many contexts, such as mixed or adversarial settings, this assumption may not be feasible. In this article, we present a new line of methods that relaxes this assumption and engages in decentralized training resulting in the agent’s individual policy. The interactive advantage actor-critic (IA2C) maintains and updates beliefs over other agents’ candidate behaviors based on (noisy) observations, thus enabling learning at the agent’s own level. We also address MARL’s prohibitive curse of dimensionality due to the presence of many agents in the system. Under assumptions of action anonymity and population homogeneity, often exhibited in practice, large numbers of other agents can be modeled aggregately by the count vectors of their actions instead of individual agent models. More importantly, we may model the distribution of these vectors and its update using the Dirichlet-multinomial model, which offers an elegant way to scale IA2C to many-agent systems. We evaluate the performance of the fully decentralized IA2C along with other known baselines on a novel Organization domain, which we introduce, and on instances of two existing domains. Experimental comparisons with prominent and recent baselines show that IA2C is more sample efficient, more robust to noise, and can scale to learning in systems with up to a hundred agents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Algorithm 2
Fig. 5
Algorithm 3
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. Game theory refers to this setup as private information monitoring [7], which is fundamentally different from centralized training where the action of each agent is effectively public information (shared) and monitored by the trainer.

  2. Parts of this work have appeared in the literature before [10, 11]. Compared to these earlier versions, in this article we introduce the open Organization domain, which is notably more realistic and challenging (more non-stationarity) compared to the previous closed version. We provide a new theoretical result, which upper bounds the error introduced by our main method of calculating the Dirichlet posterior (Proposition 2). In Sect. 5 on experiments, we include the multi-agent proximal policy optimization (MAPPO) as an additional baseline method, given its prominence as a recent and widely-used MARL approach. In addition to presenting results from the new open domains, we also include results from environments with the Gaussian noise model. This comparison aims to assess the generalizability of different approaches for updating the Dirichlet distribution to another common noise model.

  3. We can make the discretization finer or let the financial health be its net revenue and therefore continuous.

  4. Our implementation of IA2C in Python is released at https://github.com/khextendss/IA2C.

References

  1. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research (JAIR), 4(1), 237–285.

    Article  Google Scholar 

  2. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In Neural information processing systems.

  3. Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., & Whiteson, S. (2018). Counterfactual multi-agent policy gradients. In Association for the advancement of artificial intelligence.

  4. Yu, C., Velu, A., Vinitsky, E., Gao, J., Wang, Y., Bayen, A., & Wu, Y. (2022). The surprising effectiveness of PPO in cooperative multi-agent games. In Neural information processing systems (NeurIPS).

  5. Konda, V., & Tsitsiklis, J. (2000). Actor-critic algorithms. In Advances in neural information processing systems (Vol. 12, pp. 1008–1014).

  6. Shoham, Y., Powers, R., & Grenager, T. (2007). If multi-agent learning is the answer, what is the question? Artificial Intelligence, 171(7), 365–377. https://doi.org/10.1016/j.artint.2006.02.006

    Article  MathSciNet  Google Scholar 

  7. Abreu, D., Pearce, D., & Stacchetti, E. (1990). Toward a theory of discounted repeated games with imperfect monitoring. Econometrica, 58(5), 1041–1063.

    Article  MathSciNet  Google Scholar 

  8. Jovanovic, B., & Rosenthal, R. W. (1988). Anonymous sequential games. Journal of Mathematical Economics, 17(1), 77–87.

    Article  MathSciNet  Google Scholar 

  9. Jiang, A., & Leyton-brown, K. (2010). Bayesian action-graph games. In Neural information processing systems (NIPS).

  10. He, K., Banerjee, B., & Doshi, P. (2021). Cooperative-competitive reinforcement learning with history-dependent rewards. In Autonomous agents and multiagent systems (AAMAS).

  11. He, K., Doshi, P., & Banerjee, B. (2022). Reinforcement learning in many-agent settings under partial observability. In Uncertainty in artificial intelligence (UAI).

  12. Gmytrasiewicz, P., & Doshi, P. (2005). A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research (JAIR), 24, 49–79.

    Article  Google Scholar 

  13. Chandrasekaran, M., Eck, A., Doshi, P., & Soh, L. (2016). Individual planning in open and typed agent systems. In Uncertainty in artificial intelligence.

  14. Shoham, Y., & Lleyton-Brown, K. (2008). Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press.

    Book  Google Scholar 

  15. Brandenburger, A., & Nalebuff, B. (1996). Co-opetition.

  16. Tsai, W. (2002). Social structure of “coopetition’’ within a multiunit organization: Coordination, competition, and intraorganizational knowledge sharing. Organization Science, 13, 179–190.

    Article  Google Scholar 

  17. Walley, K. (2007). Coopetition: An introduction to the subject and an agenda for research. International Studies of Management and Organization, 37, 11–31.

    Article  Google Scholar 

  18. Radulescu, R., Legrand, M., Efthymiadis, K., & Roijers, D. (2018). Deep multi-agent reinforcement learning in a homogeneous open population. Artificial Intelligence, 90–105.

  19. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning.

  20. Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., & Wang, J. (2018). Mean field multi-agent reinforcement learning. In International conference on machine learning.

  21. Jiang, A. X., Leyton-Brown, K., & Bhat, N. A. R. (2011). Action-graph games. Games and Economic Behavior, 71(1), 141–173.

    Article  MathSciNet  Google Scholar 

  22. Doshi, P., & Gmytrasiewicz, P. J. (2006). On the difficulty of achieving equilibrium in interactive POMDPs. In Proceedings of the 21st national conference on artificial intelligence (Vol. 2, pp. 1131–1136).

  23. Blei, D., Ng, A., & Jordan, M. (2002). Latent Dirichlet allocation. In Advances in neural information processing systems (Vol. 14).

  24. Zheng, L., Yang, J., Cai, H., Zhou, M., Zhang, W., Wang, J., & Yu, Y. (2018). Magent: A many-agent reinforcement learning platform for artificial collective intelligence. In Association for the advancement of artificial intelligence (AAAI).

  25. Samvelyan, M., Rashid, T., Witt, C. S., Farquhar, G., Nardelli, N., Rudner, T. G. J., Hung, C.-M., Torr, P. H. S., Foerster, J. N., & Whiteson, S. (2019) The starcraft multi-agent challenge. In Neural information processing systems (NeurIPS).

  26. Rashid, T., Samvelyan, M., Witt, C., Farquhar, G., Foerster, J., & Whiteson, S. (2018). QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International conference on machine learning.

  27. Wray, K., Kumar, A., & Zilberstein, S. (2018). Integrated cooperation and competition in multi-agent decision-making. In AAAI conference on artificial intelligence.

  28. Kleiman-Weiner, M., Ho, M., Austerweil, J., Littman, M., & Tenenbaum, J. (2016). Coordinate to cooperate or compete: Abstract goals and joint intentions in social interaction. In Conference of the cognitive science society.

  29. Foerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., & Mordatch, I. (2018). Learning with opponent-learning awareness. In International conference on autonomous agents and multiagent systems (pp. 122–130).

  30. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–33. https://doi.org/10.1038/nature14236

    Article  Google Scholar 

  31. Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J., & Vicente, R. (2017). Multi-agent cooperation and competition with deep reinforcement learning. PLoS ONE Journal, 12, e0172395.

    Article  Google Scholar 

  32. Jiang, J., & Lu, Z. (2022). I2Q: A fully decentralized Q-learning algorithm . In Proceedings of the neural information processing system (NeurIPS). NeurIPS.

  33. Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., & Graepel, T. (2018). Value-decomposition networks for cooperative multi-agent learning based on team reward. In International foundation for autonomous agents and multiagent systems. AAMAS ’18 (pp. 2085–2087).

  34. Rashid, T., Farquhar, G., Peng, B., & Whiteson, S. (2020). Weighted qmix: Expanding monotonic value function factorisation. In Advances in neural information processing systems (NeurIPS) (pp. 10199–10210).

  35. Ganapathi Subramanian, S., Taylor, M., Crowley, M., & Poupart, P. (2021). Partially observable mean field reinforcement learning. In Autonomous agents and multiagent systems (AAMAS) (pp. 537–545).

  36. Verma, T., Varakantham, P., & Lau, H. C. (2019). Markov games as a framework for multi-agent reinforcement learning. In International conference on automated planning and scheduling (ICAPS).

  37. Eck, A., Soh, L.-K., & Doshi, P. (2010). Decision making in open agent systems. AI Magazine. https://doi.org/10.1002/aaai.12131

    Article  Google Scholar 

  38. Eck, A., Shah, M., Doshi, P., & Soh, L.-K. (2020). Scalable decision-theoretic planning in open and typed multiagent systems. In Association for the advancement of artificial intelligence (AAAI).

  39. Rahman, A., Hopner, N., Christianos, F., & Albrecht, S. V. (2021). Towards open ad hoc teamwork using graph-based policy learning. In International conference on machine learning (ICML).

  40. Liu, I.-J., Jain, U., Yeh, R. A., & Schwing, A. G. (2021). Cooperative exploration for multi-agent deep reinforcement learning. In International conference on machine learning (ICML).

Download references

Acknowledgements

This work was supported in part by grants from NSF #IIS-1910037 and #IIS-2312657 (to PD) and UGA’s RIAS program for graduate students. We thank seminar participants at the Hebrew University of Jerusalem, University of Waterloo, Newcastle University, and Edinburgh University for their technical feedback during seminar presentations by PD. We also thank the anonymous reviewers for their valuable suggestions.

Author information

Authors and Affiliations

Authors

Contributions

KH conducted all experiments and wrote Sect. 2.5. PD and BB wrote Sects. 1, 6, and 7 as well as edited all sections. All authors proof read and reviewed the manuscript for correctness.

Corresponding author

Correspondence to Prashant Doshi.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

1.1 The influence of history-dependent reward

In this subsection, we demonstrate the influence of history-dependent reward on policies in Org. Suppose that policy \(\pi _0\) leads to the reward sequence \(\{\beta r, \beta r, r, \beta r, \ldots \}\) for an agent, and policy \(\pi _1\) leads to the reward sequence \(\{\beta r, \beta r, \frac{1+\beta }{\alpha }r, \frac{1+\beta }{\alpha }r,\ldots \}\). For convenience, we set \(d = \frac{1+\beta }{\alpha }\). For horizon \(H = 4\), the total reward from performing policy \(\pi _0\) is:

$$\begin{aligned}&\beta r + (\phi \beta r + \beta r) + (\phi ^2 \beta r + \phi \beta r + r) + (\phi ^3 \beta r + \phi ^2 \beta r + \phi r + \beta r)\\&\quad = \phi ^3\beta r + 2\phi ^2 \beta r + 2\phi \beta r+ \phi r + 3\beta r + r \end{aligned}$$

The total reward from performing policy \(\pi _1\) is:

$$\begin{aligned}&\beta r + (\phi \beta r + \beta r) + (\phi ^2 \beta r + \phi \beta r + d r) + (\phi ^3 \beta r + \phi ^2 \beta r + \phi d r + d r)\\&\quad = \phi ^3\beta r + 2\phi ^2 \beta r + 2\phi \beta r + \phi dr + 2\beta r + 2d r \end{aligned}$$

Then the total rewards from these two policies with varying choices of \(\beta\) and \(\phi\) can be compared. A simple program is used to check if \(\phi\) can affect which of the two policies is optimal for horizon \(H \ge 4\) when d is set to \(\frac{9}{4}\). The result shows that for every horizon from 4 to 100, the history-dependent parameter \(\phi\) is always a deciding factor in finding the optimal policy. Figure 22 shows the total reward from policy \(\pi _0\) and \(\pi _1\) with varying \(\beta\) and \(\phi\) for horizon 4. Notice that \(\phi\) influences when each surface has a higher total reward.

Fig. 22
figure 22

The total reward from policy \(\pi _0\) and \(\pi _1\) with varying \(\beta\) and \(\phi\) for horizon of 4

Appendix 2

1.1 Proof of Proposition 1

Claim The probability of error (which occurs when an observed configuration \(\omega _0'\) is different from the true configuration \({\mathcal {C}}^e\)) with the \(\epsilon\)-noise model is a decreasing function of N if

$$\begin{aligned} N>\frac{|A_i|}{\log (1/1-\epsilon )}. \end{aligned}$$

Proof

According to the \(\epsilon\)-noise model, \(P(a_j^o|a_k^e)\), where the subject agent observes action \(a_j^o\) from another agent when the latter executed action \(a_k^e\), is

$$\begin{aligned} P(a_j^o|a_k^e) = {\left\{ \begin{array}{ll}1-\epsilon &{} if\ a_j^o=a_k^e\\ \frac{\epsilon }{|A_i|-1} &{} otherwise\end{array}\right. } \end{aligned}$$
(27)

for some small \(\epsilon\). The effect of such noise from the private observation of an individual agent’s action can be aggregated over N agents in terms of \(\epsilon\) as follows. Suppose the observed configuration, \(\omega _0'\), is \({\mathcal {C}}^o=(\#a_1^o,\#a_2^o,\ldots ,\#a_{|A_i|}^o)\), and the true configuration is \({\mathcal {C}}^e=(\#a_1^e,\#a_2^e,\ldots ,\#a_{|A_i|}^e)\). Then the probability of an error in the observation of a configuration is

$$\begin{aligned} P(error)&=\sum _{{\mathcal {C}}^e}\sum _{{\mathcal {C}}^o\ne {\mathcal {C}}^e}P({\mathcal {C}}^o\wedge {\mathcal {C}}^e) \end{aligned}$$
(28)
$$\begin{aligned}&=\sum _{{\mathcal {C}}^e}\sum _{{\mathcal {C}}^o\ne {\mathcal {C}}^e}P({\mathcal {C}}^o|{\mathcal {C}}^e)P({\mathcal {C}}^e) \end{aligned}$$
(29)

where

$$\begin{aligned} P({\mathcal {C}}^e) &=\prod _{i}\rho _i^{\#a_i^e},\ and \\ P({\mathcal {C}}^o|{\mathcal {C}}^e) &=\prod _{(j,k)\in A\times A} P(a_j^o|a_k^e)^{n_{jk}} \\ s.t.\ \&(\sum _jn_{jk}=\#a_k^e)\wedge (\sum _kn_{jk} = \#a_j^o) \end{aligned}$$
(30)

Let \(m^{oe}_i=\min \{\#a_i^o,\#a_i^e\}\). Then \(P({\mathcal {C}}^o|{\mathcal {C}}^e)\) can be maximized by setting the diagonal of the matrix \([n_{jk}]\) as \(n_{ii}=m^{oe}_i\), and distributing the remaining weight \(N-\sum _i m^{oe}_i\) to the off-diagonal positions while satisfying Eq. 30. This yields

$$\begin{aligned} P({\mathcal {C}}^o|{\mathcal {C}}^e) \le&(1-\epsilon )^{\sum _i m^{oe}_i}\left( \frac{\epsilon }{|A_i|-1}\right) ^{N-\sum _i m^{oe}_i} \end{aligned}$$
(31)
$$\begin{aligned} \le&(1-\epsilon )^{N-1}\left( \frac{\epsilon }{|A_i|-1}\right) \end{aligned}$$
(32)

in order to ensure that \({\mathcal {C}}^o\ne {\mathcal {C}}^e\). Furthermore, the number of solutions of Eq. 30 is \(\le \prod _i(m^{oe}_i +1)=O(N^{|A_i|})\). Hence

$$\begin{aligned} P(error)\le N^{|A_i|}(1-\epsilon )^{N-1}\left( \frac{\epsilon }{|A_i|-1}\right) \end{aligned}$$
(33)

The above is a decreasing function of N when \(N>\frac{|A_i|}{\log (1/1-\epsilon )}\). \(\square\)

1.2 Proof of Proposition 2

Claim Suppose the true configuration that \(\mathcal{C}'\) estimates via Eq. 23 is \(\mathcal{C}^*\). Then the estimation error \(\Vert \mathcal{C}'-\mathcal{C}^*\Vert\) is upper bounded by

$$\begin{aligned} \Vert \mathcal{C}'-\mathcal{C}^*\Vert \le \left( \frac{1}{1-\frac{\epsilon |A_i|}{|A_i|-1}}\right) \Vert \omega _0'-\omega _0^*\Vert \end{aligned}$$

where \(\omega _0^*\) is the LHS of Eq. 23 obtained by plugging \(\mathcal{C}^*\) into its RHS, and \(\Vert .\Vert\) represents L1 norm.

Proof

The linear system of Eq. 23 has a coefficient matrix B with \((1-\epsilon )\) in its diagonal and \(\epsilon /|A_i|-1\) in the off-diagonal elements. Now, \(\omega _0^*=B\mathcal{C}^*\) and \(\omega _0'=B\mathcal{C}'\). A property of such a linear system is that

$$\begin{aligned} \frac{\Vert \mathcal{C}'-\mathcal{C}^*\Vert }{\Vert \mathcal{C}^*\Vert } \le \kappa (B)\frac{\Vert \omega _0'-\omega _0^*\Vert }{\Vert \omega _0^*\Vert }, \end{aligned}$$
(34)

where \(\kappa (B)\) is the condition number of B. Note that the symmetric matrix B has one eigenvalue equal to 1 and \(|A_i|-1\) eigenvalues equal to \(1-\epsilon -\frac{\epsilon }{|A_i|-1}\). This yields a condition number of \(\left( \frac{1}{1-\frac{\epsilon |A_i|}{|A_i|-1}}\right)\). The result is obtained by plugging the condition number into Eq. 34, and noting that \(\Vert \mathcal{C}^*\Vert =\Vert \omega _0^*\Vert =N\). \(\square\)

When the number of actions is large and \(\epsilon\) is small, the condition number is \(\kappa (B)\approx (1+\epsilon )\). This shows that for small \(\epsilon\), the rectified estimation is well-conditioned, i.e., the rectification error is not significantly larger than the observation error.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, K., Doshi, P. & Banerjee, B. Modeling and reinforcement learning in partially observable many-agent systems. Auton Agent Multi-Agent Syst 38, 12 (2024). https://doi.org/10.1007/s10458-024-09640-1

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10458-024-09640-1

Keywords

Navigation