Skip to main content
Log in

Shaping multi-agent systems with gradient reinforcement learning

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

An original reinforcement learning (RL) methodology is proposed for the design of multi-agent systems. In the realistic setting of situated agents with local perception, the task of automatically building a coordinated system is of crucial importance. To that end, we design simple reactive agents in a decentralized way as independent learners. But to cope with the difficulties inherent to RL used in that framework, we have developed an incremental learning algorithm where agents face a sequence of progressively more complex tasks. We illustrate this general framework by computer experiments where agents have to coordinate to reach a global goal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

RL:

Reinforcement learning

MAS:

Multi-agent system

MDP:

Markov decision process

POMDP:

Partially observable Markov decision process

References

  1. Asada M., Noda S., Tawaratsumida S., Hosodaal K. (1996). Purposive behavior acquisition for a real robot by vision-based reinforcement learning. Machine Learning 23(2–3): 279–303

    Google Scholar 

  2. Bartlett P., Baxter J. (1999). Hebbian synaptic modifications in spiking neurons that learn. Technical report, Australian National University

  3. Baxter J., Bartlett P. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15, 319–350

    Article  MATH  MathSciNet  Google Scholar 

  4. Baxter J., Bartlett P., Weaver L. (2001). Experiments with infinite-horizon, policy-gradient estimation. Journal of Artificial Intelligence Research 15, 351–381

    MATH  MathSciNet  Google Scholar 

  5. Bernstein D., Givan R., Immerman N., Zilberstein S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research 27(4): 819–840

    Article  MATH  MathSciNet  Google Scholar 

  6. Bertsekas, D., & Tsitsiklis, J. (1996). Neurodynamic programming. Athena Scientific.

  7. Boutilier, C. (1996). Planning, learning and coordination in multiagent decision processes. In Y. Shoham (Ed.), Proceedings of the sixth conference on theoretical aspects of rationality and knowledge (TARK ’96) (pp. 195–210).

  8. Buffet, O. (2003). Une double approche modulaire de l’apprentissage par renforcement pour des agents intelligents adaptatifs. Ph.D. thesis, Université Henri Poincaré, Nancy 1. Laboratoire Lorrain de recherche en informatique et ses applications (LORIA).

  9. Buffet, O., & Aberdeen, D. (2006). The factored policy gradient planner (IPC-06 Version). In A. Gerevini, B. Bonet, & B. Givan (Eds.), Proceedings of the fifth international planning competition (IPC-5) (pp. 69–71). Winner, probabilistic track of the 5th International Planning Competition.

  10. Buffet, O., Dutech, A., & Charpillet, F. (2004). Self-growth of basic behaviors in an action selection based agent. In S. Schaal, A. Ijspeert, A. Billard, S. Vijayakumar, J. Hallam, & J.-A. Meyer (Eds.), From animals to animats 8: Proceedings of the eighth international conference on simulation of adaptive behavior (SAB’04) (pp. 223–232).

  11. Buffet O., Dutech A., Charpillet F. (2005). Développement autonome des comportements de base d’un agent. Revue d’Intelligence Artificielle, 19(4–5): 603–632

    Article  Google Scholar 

  12. Carmel, D., & Markovitch, S. (1996). Adaption and learning in multi-agent systems, Vol. 1042, Lecture notes in artificial intelligence, Chapt. Opponent modeling in multi-agent systems (pp. 40–52). Springer-Verlag.

  13. Cassandra, A. R. (1998). Exact and approximate algorithms for partially observable Markov decision processes. Ph.D. thesis, Brown University, Department of Computer Science, Providence, RI.

  14. Dorigo, M., & Di Caro, G. (1999). Ant colony optimization: A new meta-heuristic. In P. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, & A. Zalzala (Eds.), Proceedings of the congress on evolutionary computation (CEC-99) (pp. 1470–1477).

  15. Dutech, A. (2000). Solving POMDP using selected past-events. In W. Horn (Ed.), Proceedings of the fourteenth european conference on artificial intelligence (ECAI’00) (pp. 281–285).

  16. Fernández F., Parker L. (2001). Learning in large cooperative multi-robot domains. International Journal of Robotics and Automation 16(4): 217–226

    Google Scholar 

  17. Gerkey B., Matarić M. (2004). A formal analysis and taxonomy of task allocation in multi-robot systems. International Journal of Robotics Research 23(9): 939–954

    Article  Google Scholar 

  18. Gmytrasiewicz, P., & Doshi, P. (2004). Interactive POMDPs: Properties and preliminary results. In Proceedings of the third international joint conference on autonomous agents and multi-agent systems (AAMAS’04).

  19. Goldman, C., Allen, M., & Zilberstein, S. (2004). Decentralized language learning through acting. In Proceedings of the third international joint conference on autonomous agents and multi-agent systems (AAMAS’04).

  20. Hengst, B. (2002). Discovering hierarchy in reinforcement learning with HEXQ. In C. Sammut & A. G. Hoffmann (Eds.), Proceedings of the nineteenth international conference on machine learning (ICML’02) (pp. 243–250).

  21. Hu, J., & Wellman, M. (1998). Online learning about other agents in a dynamic multiagent system. In K. P. Sycara & M. Wooldridge (Eds.), Proceedings of the second international conference on autonomous agents (Agents’98) (pp. 239–246).

  22. Jaakkola T., Jordan M., Singh S. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6(6): 1186–1201

    Google Scholar 

  23. Jong, E. D. (2000). Attractors in the development of communication. In J.-A. Meyer, A. Berthoz, D. Floreano, H. L. Roitblat, & S. W. Wilson (Eds.), From animals to animats 6: Proceedings of the sixth international conference on simulation of adaptive behavior (SAB-00).

  24. Kaelbling L., Littman M., Moore A. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285

    Google Scholar 

  25. Laud, A. (2004). Theory and application of reward shaping in reinforcement learning. Ph.D. thesis, University of Illinois at Urbana-Champaign.

  26. Littman, M., Cassandra, A., & Kaelbling, L. (1995). Learning policies for partially observable environments: Scaling up. In A. Prieditis & S. J. Russell (Eds.), Proceedings of the twelveth international conference on machine learning (ICML’95) (pp. 362–370).

  27. Matarić M. (1997). Reinforcement learning in the multi-robot domain. Autonomous Robots 4(1): 73–83

    Article  Google Scholar 

  28. McCallum, R. A. (1995). Reinforcement learning with selective perception and hidden state. Ph.D. thesis, University of Rochester.

  29. Ng, A., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In I. Bratko & S. Dzeroski (Eds.), Proceedings of the sixteenth international conference on machine learning (ICML’99) (pp. 278–287).

  30. Peshkin, L., Kim, K., Meuleau, N., & Kaelbling, L. (2000). Learning to cooperate via policy search. In C. Boutilier & M. Goldszmidt (Eds.), Proceedings of the sixteenth conference on uncertainty in artificial intelligence (UAI’00) (pp. 489–496).

  31. Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. In J. Gama, R. Camacho, P. Brazdil, A. Jorge, & L. Torgo (Eds.), Proceedings of the sixteenth european conference on machine learning (ECML’05), Vol. 3720, Lecture notes in computer science.

  32. Puterman, M. L. (1994). Markov decision processes—Discrete stochastic dynamic programming. New York, USA: Wiley.

  33. Pynadath D., Tambe M. (2002). The communicative multiagent team decision problem: Analyzing teamwork theories and models. Journal of Artificial Intelligence Research 16: 389–423

    MATH  MathSciNet  Google Scholar 

  34. Randløv, J. (2000). Shaping in reinforcement learning by changing the physics of the problem. In P. Langley (Ed.), Proceedings of the seventeenth international conference on machine learning (ICML’00) (pp. 767–774).

  35. Randløv, J., & Alstrøm, P. (1998). Learning to drive a bicycle using reinforcement learning and shaping. In J. W. Shavlik (Ed.), Proceedings of the fifteenth international conference on machine learning (ICML’98) (pp. 463–471).

  36. Rogowski, C. (2004). Model-based opponent modelling in domains beyond the prisoner’s dilemma. In Proceedings of modeling other agents from observations (MOO 2004), AAMAS’04 workshop.

  37. Salustowicz R., Wiering M., Schmidhuber J. (1998). Learning team strategies: Soccer case studies. Machine Learning 33, 263–282

    Article  MATH  Google Scholar 

  38. Shoham, Y., Powers, R., & Grenager, T. (2003). Multi-agent reinforcement learning: A critical survey. Technical report, Stanford.

  39. Singh, S., Jaakkola, T., & Jordan, M. (1994). Learning without state estimation in partially observable Markovian decision processes. In W. W. Cohen & H. Hirsh (Eds.), Proceedings of the eleventh international conference on machine learning (ICML’94).

  40. Skinner B. (1953). Science and human behavior. New-York, Collier-Macmillian

    Google Scholar 

  41. Staddon, J. (1983). Adaptative behavior and learning. Cambridge University Press.

  42. Stone, P., & Veloso, M. (2000a). Layered learning. In R. L. de Mántaras & E. Plaza (Eds.), Proceedings of the eleventh european conference on machine learning (ECML’00), Vol. 1810, Lecture notes in computer science.

  43. Stone, P., & Veloso, M. (2000b). Multiagent systems: A survey from a machine learning perspective. Autonomous Robotics, 8(3).

  44. Stone, P., & Veloso, M. (2000c). Team-partitioned, opaque-transition reinforcement learning. In Proceedings of the third international conference on autonomous agents (Agents’00).

  45. Sutton, R., & Barto, G. (1998). Reinforcement learning: An introduction. Cambridge, MA: Bradford Book, MIT Press.

  46. Sutton, R., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems 11 (NIPS’99), Vol. 12 (pp. 1057–1063).

  47. Tumer, K., & Wolpert, D. (2000). Collective intelligence and Braess paradox. In Proceedings of the sixteenth national conference on artificial intelligence (AAAI’00) (pp. 104–109).

  48. Tyrrell, T. (1993). Computational mechanisms for action selection. Ph.D. thesis, University of Edinburgh.

  49. Vidal, J., & Durfee, E. (1997). Agent learning about agents: A framework and analysis. In S. Sen (Ed.), Collected papers from the AAAI-97 workshop on multiagent learning (pp. 71–76).

  50. Watkins, C. (1989). Learning from delayed rewards. Ph.D. thesis, King’s College of Cambridge, UK.

  51. Wolpert, D., & Tumer, K. (1999). An introduction to collective intelligence. Technical Report NASA-ARC-IC-99-63, NASA AMES Research Center.

  52. Wolpert, D., Wheeler, K., & Tumer, K. (1999). General principles of learning-based multi-agent systems. In Proceedings of the third international conference on autonomous agents (Agents’99) (pp. 77–83).

  53. Xuan, P., Lesser, V., & Zilberstein, S. (2000). Communication in multi-agent Markov decision processes. In S. Parsons & P. Gmytrasiewicz (Eds.), Proceedings of ICMAS workshop on game theoretic and decision theoretic agents.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Olivier Buffet.

Additional information

This work has been conducted in part in NICTA’s Canberra laboratory.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Buffet, O., Dutech, A. & Charpillet, F. Shaping multi-agent systems with gradient reinforcement learning. Auton Agent Multi-Agent Syst 15, 197–220 (2007). https://doi.org/10.1007/s10458-006-9010-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10458-006-9010-5

Keywords

Navigation