, Volume 2, Issue 2, pp 71–81 | Cite as

Cooperative Multi-Agent Reinforcement Learning for Multi-Component Robotic Systems: guidelines for future research

  • Manuel Graña
  • Borja Fernandez-Gauna
  • Jose Manuel Lopez-Guede
Review Article


Reinforcement Learning (RL) as a paradigm aims to develop algorithms that allow to train an agent to optimally achieve a goal with minimal feedback information about the desired behavior, which is not precisely specified. Scalar rewards are returned to the agent as response to its actions endorsing or opposing them. RL algorithms have been successfully applied to robot control design. The extension of the RL paradigm to cope with the design of control systems for Multi-Component Robotic Systems (MCRS) poses new challenges, mainly related to coping with scaling up of complexity due to the exponential state space growth, coordination issues, and the propagation of rewards among agents. In this paper, we identify the main issues which offer opportunities to develop innovative solutions towards fully-scalable cooperative multi-agent systems.


reinforcement learning multi-component robotic systems multi-agent systems 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    David W. Aha, Dennis Kibler, and Marc K. Albert. Instance-based learning algorithms. In Machine Learning, pages 37–66, 1991.Google Scholar
  2. [2]
    R. Aragues, J. Cortes, and C. Sagues. Distributed consensus algorithms for merging feature-based maps with limited communication. Robotics and Autonomous Systems, 59(3–4):163–180, 2011.CrossRefGoogle Scholar
  3. [3]
    Andrew G. Barto. Using relative novelty to identify useful temporal abstractions in reinforcement learning. In In Procedings of the Twenty-First International Conference on Machine Learning, pages 751–758. ACM Press, 2004.Google Scholar
  4. [4]
    Hamid Berenji. Fuzzy reinforcement learning and dynamic programming. In Anca Ralescu, editor, Fuzy Logic in Artificial Inteligence, volume 847 of Lecture Notes in Computer Science, pages 1–9. Springer Berlin / Heidelberg, 1994.Google Scholar
  5. [5]
    H.R. Berenji. Fuzzy Q-learning for generalization of reinforcement learning. In IEEE Press, editor, Proc. of the Fifth IEEE International Conference on Fuzy Systems, volume 3, pages 2208–2214, 1996.Google Scholar
  6. [6]
    Daniel S. Bernstein. Dynamic programming for partially observable stochastic games. In In Procedings of the Ninetenth National Conferenceon Artificial Inteligence, pages 709–715, 2004.Google Scholar
  7. [7]
    Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of markov decision processes. In Mathematics of Operations Research, page 2002, 2000.Google Scholar
  8. [8]
    Michael Bowling and Manuela Veloso. Scalable learning in stochastic games. In In: AAAI Workshop on Game The oretic and Decision Theoretic Agents, pages 11–18, 2002.Google Scholar
  9. [9]
    Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Procesing Systems 7, pages 369–376. MIT Press, 1995.Google Scholar
  10. [10]
    Steven J. Bradtke and Michael O. Duff. Reinforcement learning methods for continuous-time markov decision problems. In Advances in Neural Information Procesing Systems, pages 393–400. MIT Press, 1994.Google Scholar
  11. [11]
    L. Busoniu, R. Babuska, and B. De Schutter. Comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics. Part C: Applications and Reviews, 38(2):pp. 156–172, 2008.CrossRefGoogle Scholar
  12. [12]
    D. Chapman and L.P. Kaelbling. Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In Learning and Knowledge Acquisition, IJCAI 1991, pages 726–731. Morgan Kaufmann, 1991.Google Scholar
  13. [13]
    Chung-Cheng Chiu and Von-Wun Soo. Subgoal identification for reinforcement learning and planning in multiagent problem solving. In Paolo Petta, Jrg Mller, Matthias Klusch, and Michael Georgeff, editors, Multiagent System Technologies, volume 4687 of Lecture Notesin Computer Science, pages pp. 37–48. Springer Berlin / Heidelberg, 2007.Google Scholar
  14. [14]
    Chung-Cheng Chiu and Von-Wun Soo. Automatic complexity reduction in reinforcement learning. Computational Inteligence, 26(1):pp. 1–25, 2010.MathSciNetCrossRefGoogle Scholar
  15. [15]
    Chung-Cheng Chiu and Von-Wun Soo. Advancesin Reinforcement Learning, chapter Subgoal Identifications in Reinforcement Learning: A Survey, pages pp.181–188. InTech, 2011.Google Scholar
  16. [16]
    Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In InProcedings of the Fiftenth National Conference on Artificial Inteligence, pages 746–752. AAAI Press, 1997.Google Scholar
  17. [17]
    Robert Crites and Andrew Barto. Improving elevator performance using reinforcement learning. In Advances in Neural Information Processing Systems 8, pages 1017–1023. MIT Press, 1996.Google Scholar
  18. [18]
    Thomas Dietterich. An overview of maxq hierarchical reinforcement learning. In Berthe Choueiry and Toby Walsh, editors, Abstraction, Reformulation, and Approximation, volume 1864 of Lecture Notes in Computer Science, pages pp. 26–44. Springer Berlin / Heidelberg, 2000.Google Scholar
  19. [19]
    Thomas G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:pp. 227–303, 2000.MathSciNetzbMATHGoogle Scholar
  20. [20]
    Bruce Digney. Learning hierarchical control structures for multiple tasks and changing environments. In In Procedings of the Fifth Conference on the Simulation of Adaptive Behavior: SAB98. MIT Press, 1998.Google Scholar
  21. [21]
    Y. Duan and X. Hexu. Fuzzy reinforcement learning and its application in robot navigation. In Machine Learning and Cybernetics,2005. Proceedings of 2005 International Conference on, volume 2, pages 899–904 Vol. 2, 18–21 2005.CrossRefGoogle Scholar
  22. [22]
    R.J. Duro, Manuel Graña, and J. de Lope. On the potential contributions of hybrid intelligent approaches to multicomponen robotic system development. Information Sciences, 180(14):2635–2648, 2010.CrossRefGoogle Scholar
  23. [23]
    Z. Echegoyen, I. Villaverde, R. Moreno, M. Graña, and A. d’Anjou. Linked multi-component mobile robots: modeling, simulation and control. Robotics and Autonomous Systems, 58(12):1292–1305, 2010.CrossRefGoogle Scholar
  24. [24]
    B. Fernandez-Gauna, J.M. Lopez-Guede, E. Zulueta, Z. Echegoyen, and M. Graña. Basic results and experiments on robotic multi-agent system for hose deployment and transportation. International Journal of Artificial Inteligence, 6(S11):183–202, 2011.Google Scholar
  25. [25]
    Robert Fitch, Bernhard Hengst, Dorian Suc, Greg Calbert, and Jason Scholz. Structural abstraction experiments in reinforcement learning. In Shichao Zhang and Ray Jarvis, editors, AI 2005: Advances in Artificial Inteligence, volume 3809 of Lecture Notes in Computer Science, pages pp. 164–175. Springer Berlin / Heidelberg, 2005.Google Scholar
  26. [26]
    Mohammad Ghavamzadeh and Sridhar Mahadevan. Learning to communicate and act using hierarchical reinforcement learning. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multi Agent Systems, pages 1114–1121, 2004.Google Scholar
  27. [27]
    Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in markov decision processes. Artif. Intel., 147:163–223, July 2003.MathSciNetzbMATHCrossRefGoogle Scholar
  28. [28]
    Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored mdps. In NIPS-14, pages pp. 1523–1530. The MIT Press, 2001.Google Scholar
  29. [29]
    Carlos Guestrin, Michail Lagoudakis, and Ronald Parr. Coordinated reinforcement learning. In In Proceedings of the IXth ICML, pages 227–234, 2002.Google Scholar
  30. [30]
    T. Hall, M. Humphrys, and M. Humphrys. Action selection methods using reinforcement learning. In Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, pages 135–144. MIT Press, 1996.Google Scholar
  31. [31]
    Bernhard Hengst. Discovering hierarchy in reinforcement learning with hexq. In In Maching Learning: Proceedings of the Nineteenth International Conference on Machine Learning, pages pp. 243–250. Morgan Kaufmann, 2002.Google Scholar
  32. [32]
    Pieter Hoen, Karl Tuyls, Liviu Panait, Sean Luke, and J.A. La Poutr. An overview of cooperative and competitive multiagent learning. In Karl Tuyls, Pieter Hoen, Katja Verbeeck, and Sandip Sen, editors, Learning and Adaption in Multi-Agent Systems, volume 3898 of Lecture Notes in Computer Science, pages 1–46. Springer Berlin / Heidelberg, 2006.Google Scholar
  33. [33]
    Nicholas K. Jong. State abstraction discovery from irrelevant state variables. In In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, pages pp. 752–757, 2005.Google Scholar
  34. [34]
    Anders Jonsson and Andrew Barto. A causal approach to hierarchical decomposition of factored mdps. In Advances in Neural Information Processing Systems, volume 13, pages pp.1054–1060, 2005.Google Scholar
  35. [35]
    Anders Jonsson and Andrew Barto. Causal graph based decomposition of factored mdps. J. Mach. Learn. Res., 7:pp. 2259–2301, December 2006.MathSciNetzbMATHGoogle Scholar
  36. [36]
    Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Inteligence, 101:99–134, 1998.MathSciNetzbMATHCrossRefGoogle Scholar
  37. [37]
    S. Kapetanakis and D. Kudenko. Reinforcement learning of coordination in cooperative multi-agent systems. In 18th National Conference on Artificial Intelligence and 14th Conference on Innovative Applications of Artificial Intelligence, pages pp. 326–331, 2002.Google Scholar
  38. [38]
    Jelle R. Kok and Nikos Vlassis. Sparse cooperative q-learning. In Proceedings of the International Conference on Machine Learning, pages 481–488. ACM, 2004.Google Scholar
  39. [39]
    Jelle R. Kok and Nikos Vlassis. Collaborative multiagent reinforcement learning by payoff propagation. Journal of Machine Learning Research, 7:1789–1828, 2006.MathSciNetzbMATHGoogle Scholar
  40. [40]
    Daphne Koller and Ronald Parr. Computing factored value functions for policies in structured mdps. In In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pages 1332–1339. Morgan Kaufmann, 1999.Google Scholar
  41. [41]
    Martin Lauer and Martin A. Riedmiller. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML’ 00, pages 535–542, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.Google Scholar
  42. [42]
    C. Li, J. Zhang, and Y. Li. Application of artificial neural network based on q-learning for mobile robot path planning. In Information Acquisition, 2006 IEEE International Conference on, pages 978–982, 20–23 2006.Google Scholar
  43. [43]
    Sridhar Mahadevan, Nicholas Marchalleck, Tapas K. Das, and A. Gosavi. Self-improving factory simulation using continuous-time average-reward reinforcement learning. In Proceedings of the 14th International Conference on Machine Learning, pages 202–210. Morgan Kaufmann, 1997.Google Scholar
  44. [44]
    Rajbala Makar and Sridhar Mahadevan. Hierarchical multi-agent reinforcement learning. In Proceedings of the Fifth International Conference on Autonomous Agents, pages pp. 246–253. ACM Press, 2001.Google Scholar
  45. [45]
    Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. Dynamic abstraction in reinforcement learning via clustering. In In Proceedings of the Twenty-First International Conference on Machine Learning, pages pp. 560–567. ACM Press, 2004.Google Scholar
  46. [46]
    Amy Mcgovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. In In Proceedings of the eighteenth international conference on machine learning, pages pp. 361–368. Morgan Kaufmann, 2001.Google Scholar
  47. [47]
    Francisco Melo and M. Ribeiro. Coordinated learning in multiagent mdps with infinite state-space. Autonomous Agents and Multi-Agent Systems, 21:321–367, 2010. 10.1007/s10458-009-9104-y.CrossRefGoogle Scholar
  48. [48]
    Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut: Dynamic discovery of sub-goals in reinforcement learning. In Tapio Elomaa, Heikki Mannila, and Hannu Toivonen, editors, Machine Learning: ECML 2002, volume 2430 of Lecture Notes in Computer Science, pages pp. 187–195. Springer Berlin / Heidelberg, 2002.Google Scholar
  49. [49]
    N. Ono and K. Fukumoto. A modular approach to multi-agent reinforcement learning. In Gerhard Weiss, editor, Distributed Artificial Intelligence Mets Machine Learning Learning in Multi-Agent Environments, volume 1221 of Lecture Notesin Computer Science, pages 25–39. Springer Berlin / Heidelberg, 1997.Google Scholar
  50. [50]
    L. Panait and S. Luke. Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems, 11(3):387–434, 2005.CrossRefGoogle Scholar
  51. [51]
    Ronald Parr and Stuart Russell. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems 10, pages pp. 1043–1049. MIT Press, 1998.Google Scholar
  52. [52]
    Ronald Edward Parr. Hierarchical control and learning for markov decision processes. Master’s thesis, University of California, Berkeley, 1998. AAI9902197.Google Scholar
  53. [53]
    Marc Ponsen, Matthew E. Taylor, and Karl Tuyls. Abstraction and generalization in reinforcement learning: A summary and framework. In ALA Workshop, Adaptive and Learning Agents (LNAI Journal), pages pp. 1–33, 2010.Google Scholar
  54. [54]
    Wei Ren and R.W. Beard. Distributed Consensus in Multi-Vehicle Coperative Control: Theory and Applications. Springer Publishing Company, Incorporated, 2007.Google Scholar
  55. [55]
    Khashayar Rohanimanesh and Sridhar Mahadevan. Decisiontheoretic planning with concurrent temporally extended actions. In In UAI’01, pages pp. 472–479. Morgan Kaufmann Publishers, 2001.Google Scholar
  56. [56]
    Kazuyuki Samejima, Kenji Doya, and Mitsuo Kawato. Inter-module credit assignment in modular reinforcement learning. Neural Netw., 16:985–994, September 2003.CrossRefGoogle Scholar
  57. [57]
    Anton Maximilian Schfer, Steffen Udluft, and Departement Neural Computation. Solving partially observable reinforcement learning problems with recurrent neural networks. In In Workshop Proc. Of the European Conference on Machine Learning, 2005.Google Scholar
  58. [58]
    Je Schneider, Weng-Keen Wong, Andrew Moore, and Martin Riedmiller. Distributed value functions. In In Proceedings of the Sixteenth International Conference on Machine Learning, pages 371–378. Morgan Kaufmann, 1999.Google Scholar
  59. [59]
    Jing Shen, Guochang Gu, and Haibo Liu. Multi-agent hierarchical reinforcement learning by integrating options into maxq. In Computer and Computational Sciences, 2006. IMSCCS’06. First International Multi-Symposiums on, volume 1, pages 676–682, 2006.Google Scholar
  60. [60]
    Ozgur Simsek, Alicia P. Wolfe, and Andrew G. Barto. Identifying useful subgoals in reinforcement learning by local graph partitioning. In In Proceedings of the Twenty-Second International Conference on Machine Learning, pages pp. 816–823, 2005.Google Scholar
  61. [61]
    Satinder Singh, Tommi Jaakkola, Michael L. Littman, and Csaba Szepesv Ari. Convergence results for single-step on-policy reinforcement-learning algorithms. In Machine Learning, pages 287–308, 1998.Google Scholar
  62. [62]
    Satinder P. Singh, Tommi Jaakkola, and Michael I. Jordan. Reinforcement learning with soft state aggregation. In Advances in Neural Information Processing Systems 7, pages 361–368. MIT Press, 1995.Google Scholar
  63. [63]
    William D. Smart. Explicit manifold representations for value-function approximation in reinforcement learning. In Prceedings of the 8th International Symposium on Artificial Intelligence and mathematics, pages 25–2004, 2004.Google Scholar
  64. [64]
    Richard Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:pp. 181–211, 1999.MathSciNetzbMATHCrossRefGoogle Scholar
  65. [65]
    Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Procesing Systems 8, pages 1038–1044. MIT Press, 1996.Google Scholar
  66. [66]
    R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.Google Scholar
  67. [67]
    Prasad Tadepalli and Dokyeong Ok. Scaling up average reward reinforcement learning by approximating the domain models and the value function. In In Saita, pages 471–479. Morgan Kaufmann, 1996.Google Scholar
  68. [68]
    Y. Takahashi and M. Asada. Reinforcement Learning: Theory and Applications, chapter Modular Learning Systems for Behavior Acquisition in Multi-Agent Environment, pages 225–238. I-Tech Education and Publishing, Vienna, 2008.Google Scholar
  69. [69]
    Yasutake Takahashi and Minoru Asada. Modular learning systems for soccer robot. In Proceedings of the Fourth International Symposium on Human and Artificial Intelligence Systems, pages pp.370–375, 2004.Google Scholar
  70. [70]
    Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In InProceedings of the Tenth International Conference on Machine Learning, pages 330–337. Morgan Kaufmann, 1993.Google Scholar
  71. [71]
    Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(1):1633–1685, 2009.MathSciNetGoogle Scholar
  72. [72]
    N. Vlassis, R. Elhorst, and J. R. Kok. Anytime algorithms for multiagent decision making using coordination graphs. In In Proc. Intl. Conf. on Systems, Manand Cybernetics, 2004.Google Scholar
  73. [73]
    Xiaofeng Wang and Tuomas Sandholm. Reinforcement learning to play an optimal nash equilibrium in team markov games. In in Advances in Neural Information Processing Systems, pages 1571–1578. MIT Press, 2002.Google Scholar
  74. [74]
    Christopher Watkins and Peter Dayan. Technical note: Q-learning. In MachineLearning, volume 8, pages pp. 279–292, May 1992.zbMATHGoogle Scholar
  75. [75]
    S. Whitehead, J. Karlsson, and J. Tenenberg. Robot Learning, chapter Learning multiple goal behavior via task decomposition and dynamic policy merging, pages 45–78. Kluwer Academic Publisher, 1993.Google Scholar
  76. [76]
    H. Xiao, L. Liao, and F. Zhou. Mobile robot path planning based on q-ann. In Automation and Logistics, 2007 IEEE International Conference on, pages 2650–2654, 18–21 2007.Google Scholar
  77. [77]
    Pucheng Zhou and Bingrong Hong. A modular on-line profit sharing approach in multiagent domains. International Journal of Electrical and Computer Engineering, 1(6):424–431, 2006.Google Scholar

Copyright information

© © Versita Warsaw and Springer-Verlag Wien 2011

Authors and Affiliations

  • Manuel Graña
    • 1
  • Borja Fernandez-Gauna
    • 1
  • Jose Manuel Lopez-Guede
    • 1
  1. 1.Grupo de Inteligencia ComputacionalUniversidad Pais Vasco, UPV/EHUPais VascoSpain

Personalised recommendations