Cooperative Multi-agent Policy Gradient

  • Guillaume BonoEmail author
  • Jilles Steeve Dibangoye
  • Laëtitia Matignon
  • Florian Pereyron
  • Olivier Simonin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11051)


Reinforcement Learning (RL) for decentralized partially observable Markov decision processes (Dec-POMDPs) is lagging behind the spectacular breakthroughs of single-agent RL. That is because assumptions that hold in single-agent settings are often obsolete in decentralized multi-agent systems. To tackle this issue, we investigate the foundations of policy gradient methods within the centralized training for decentralized control (CTDC) paradigm. In this paradigm, learning can be accomplished in a centralized manner while execution can still be independent. Using this insight, we establish policy gradient theorem and compatible function approximations for decentralized multi-agent systems. Resulting actor-critic methods preserve the decentralized control at the execution phase, but can also estimate the policy gradient from collective experiences guided by a centralized critic at the training phase. Experiments demonstrate our policy gradient methods compare favorably against standard RL techniques in benchmarks from the literature. Code related to this paper is available at:


Decentralized control Partial observable Markov decision processes Multi-agent systems Actor critic 

Supplementary material

478880_1_En_28_MOESM1_ESM.pdf (1.3 mb)
Supplementary material 1 (pdf 1340 KB)


  1. 1.
    Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)CrossRefGoogle Scholar
  2. 2.
    Amato, C., Dibangoye, J.S., Zilberstein, S.: Incremental policy generation for finite-horizon DEC-POMDPs. In: Proceedings of the Nineteenth International Conference on Automated Planning and Scheduling (2009)Google Scholar
  3. 3.
    Aström, K.J.: Optimal control of Markov decision processes with incomplete state estimation. J. Math. Anal. Appl. 10, 174–205 (1965)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Bellman, R.E.: The Theory of dynamic programming. Bull. Am. Math. Soc. 60(6), 503–515 (1954)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Bernstein, D.S., Givan, R., Immerman, N., Zilberstein, S.: The complexity of decentralized control of Markov decision processes. Math. Oper. Res. 27(4), 819–840 (2002)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Bono, G., Dibangoye, J.S., Matignon, L., Pereyron, F., Simonin, O.: On the Study of Cooperative Multi-Agent Policy Gradient. Research Report RR-9188, INSA Lyon, INRIA (2018)Google Scholar
  7. 7.
    Boutilier, C.: Planning, learning and coordination in multiagent decision processes. In: Proceedings of the Sixth Conference on Theoretical Aspects of Rationality and Knowledge (1996)Google Scholar
  8. 8.
    Degris, T., White, M., Sutton, R.S.: Linear off-policy actor-critic. In: Proceedings of the 29th International Conference on ML, ICML 2012, Edinburgh, Scotland, UK, 26 June–1 July 2012 (2012)Google Scholar
  9. 9.
    Dibangoye, J.S., Amato, C., Buffet, O., Charpillet, F.: Optimally solving Dec-POMDPs as continuous-state MDPs. J. AI Res. 55, 443–497 (2016)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Dibangoye, J.S., Amato, C., Buffet, O., Charpillet, F.: Optimally solving Dec-POMDPs as continuous-state MDPs. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (2013)Google Scholar
  11. 11.
    Dibangoye, J.S., Amato, C., Buffet, O., Charpillet, F.: Optimally solving Dec-POMDPs as Continuous-State MDPs: Theory and Algorithms. Research Report RR-8517 (2014)Google Scholar
  12. 12.
    Dibangoye, J.S., Buffet, O.: Learning to Act in Decentralized Partially Observable MDPs. Research report, INRIA, Jun 2018.
  13. 13.
    Dibangoye, J.S., Buffet, O., Charpillet, F.: Error-bounded approximations for infinite-horizon discounted decentralized POMDPs. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8724, pp. 338–353. Springer, Heidelberg (2014). Scholar
  14. 14.
    Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Counterfactual multi-agent policy gradients (2018)Google Scholar
  15. 15.
    Gupta, J.K., Egorov, M., Kochenderfer, M.: Cooperative multi-agent control using deep reinforcement learning. In: Sukthankar, G., Rodriguez-Aguilar, J.A. (eds.) AAMAS 2017. LNCS (LNAI), vol. 10642, pp. 66–83. Springer, Cham (2017). Scholar
  16. 16.
    Hansen, E.A., Bernstein, D.S., Zilberstein, S.: Dynamic programming for partially observable stochastic games. In: Proceedings of the Nineteenth National Conference on Artifical intelligence (2004)Google Scholar
  17. 17.
    Kakade, S.: A natural policy gradient. In: Advances in Neural Information Processing Systems 14 (NIPS 2001) (2001)Google Scholar
  18. 18.
    Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems 12 (2000)Google Scholar
  19. 19.
    Kraemer, L., Banerjee, B.: Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190, 82–94 (2016)CrossRefGoogle Scholar
  20. 20.
    Liu, M., Amato, C., Anesta, E.P., Griffith, J.D., How, J.P.: Learning for decentralized control of multiagent systems in large, partially-observable stochastic environments. In: AAAI (2016)Google Scholar
  21. 21.
    Liu, M., Amato, C., Liao, X., Carin, L., How, J.P.: Stick-breaking policy learning in Dec-POMDPs. In: International Joint Conference on Artificial Intelligence (IJCAI 2015). AAAI (2015)Google Scholar
  22. 22.
    Lowe, R., WU, Y., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in Neural Information Processing Systems 30 (2017)Google Scholar
  23. 23.
    Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)CrossRefGoogle Scholar
  24. 24.
    Moravčík, M., et al.: DeepStack: expert-level artificial intelligence in heads-up no-limit poker. Science 356(6337), 508–513 (2017)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Nguyen, D.T., Kumar, A., Lau, H.C.: Policy gradient with value function approximation for collective multiagent planning. In: Advances in Neural Information Processing Systems 30 (2017)Google Scholar
  26. 26.
    Oliehoek, F.A., Spaan, M.T.J., Amato, C., Whiteson, S.: Incremental clustering and expansion for faster optimal planning in Dec-POMDPs. J. AI Res. 46, 449–509 (2013)zbMATHGoogle Scholar
  27. 27.
    Oliehoek, F.A., Spaan, M.T.J., Dibangoye, J.S., Amato, C.: Heuristic search for identical payoff Bayesian games. In: Proceedings of the Ninth International Conference on Autonomous Agents and Multiagent Systems (2010)Google Scholar
  28. 28.
    Peshkin, L., Kim, K.E., Meuleau, N., Kaelbling, L.P.: Learning to cooperate via policy search. In: Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI-2000) (2000)Google Scholar
  29. 29.
    Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Shoham, Y., Leyton-Brown, K.: Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, New York (2008)CrossRefGoogle Scholar
  31. 31.
    Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, 2nd edn. MIT Press, Cambridge (2016)Google Scholar
  32. 32.
    Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy Gradient Methods for Reinforcement Learning with Function Approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems, Cambridge, MA, USA (1999)Google Scholar
  33. 33.
    Szer, D., Charpillet, F.: An optimal best-first search algorithm for solving infinite horizon DEC-POMDPs. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 389–399. Springer, Heidelberg (2005). Scholar
  34. 34.
    Szer, D., Charpillet, F., Zilberstein, S.: MAA*: a heuristic search algorithm for solving decentralized POMDPs. In: Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (2005)Google Scholar
  35. 35.
    Tan, M.: Multi-agent reinforcement learning: independent vs. cooperative agents. In: Readings in Agents, San Francisco, CA, USA (1998)Google Scholar
  36. 36.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3), 229–256 (1992)zbMATHGoogle Scholar
  37. 37.
    Wu, F., Zilberstein, S., Jennings, N.R.: Monte-Carlo expectation maximization for decentralized POMDPs. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (2013)Google Scholar
  38. 38.
    Zhang, X., Aberdeen, D., Vishwanathan, S.V.N.: Conditional random fields for multi-agent reinforcement learning. In: Proceedings of the 24th International Conference on Machine Learning (2007)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Univ Lyon, INSA Lyon, INRIA, CITIVilleurbanneFrance
  2. 2.Univ Lyon, Université Lyon 1, LIRIS, CNRS, UMR5205VilleurbanneFrance
  3. 3.Volvo Group, Advanced Technology and ResearchSaint-PriestFrance

Personalised recommendations