Abstract
Two multi-agent policy iteration learning algorithms are proposed in this work. The two proposed algorithms use the exponential moving average approach along with the Q-learning algorithm as a basis to update the policy for the learning agent so that the agent’s policy converges to a Nash equilibrium policy. The first proposed algorithm uses a constant learning rate when updating the policy of the learning agent, while the second proposed algorithm uses two different decaying learning rates. These two decaying learning rates are updated based on either the Win-or-Learn-Fast (WoLF) mechanism or the Win-or-Learn-Slow (WoLS) mechanism. The WoLS mechanism is introduced in this article to make the algorithm learn fast when it is winning and learn slowly when it is losing. The second proposed algorithm uses the rewards received by the learning agent to decide which mechanism (WoLF mechanism or WoLS mechanism) to use for the game being learned. The proposed algorithms have been theoretically analyzed and a mathematical proof of convergence to pure Nash equilibrium is provided for each algorithm. In the case of games with mixed Nash equilibrium, our mathematical analysis shows that the second proposed algorithm converges to an equilibrium. Although our mathematical analysis does not explicitly show that the second proposed algorithm converges to a Nash equilibrium, our simulation results indicate that the second proposed algorithm does converge to Nash equilibrium. The proposed algorithms are examined on a variety of matrix and stochastic games. Simulation results show that the second proposed algorithm converges in a wider variety of situations than state-of-the-art multi-agent reinforcement learning algorithms.
Similar content being viewed by others
References
Abdallah S, Lesser V (2008) A multiagent reinforcement learning algorithm with non-linear dynamics. J Artif Intell Res 33:521–549
Awheda MD, Schwartz HM (2013) Exponential moving average Q-learning algorithm. In: Adaptive dynamic programming and reinforcement learning (ADPRL), 2013 IEEE symposium on, IEEE, pp 31–38. IEEE
Awheda MD, Schwartz HM (2015) The residual gradient FACL algorithm for differential games. In Electrical and computer engineering (CCECE). 2015 IEEE 28th Canadian conference on, IEEE, pp 1006–1011. IEEE
Banerjee B, Peng J (2007) Generalized multiagent learning with performance bound. Auton Agents Multi-Agent Syst 15(3):281–312
Bellman R (1957) Dynamic programming. Princeton University Press, Princeton
Bowling M (2005) Convergence and no-regret in multiagent learning. Adv Neural Inf Process Syst 17:209–216
Bowling M, Veloso M (2001a) Convergence of gradient dynamics with a variable learning rate. In: ICML, pp 27–34
Bowling M, Veloso M (2001b) Rational and convergent learning in stochastic games. In: International joint conference on artificial intelligence, vol. 17. Lawrence Erlbaum Associates Ltd, pp 1021–1026
Bowling M, Veloso M (2002) Multiagent learning using a variable learning rate. Artif Intell 136(2):215–250
Burkov A, Chaib-draa B (2009) Effective learning in the presence of adaptive counterparts. J Algorithms 64(4):127–138
Busoniu L, Babuska R, De Schutter B (2006) Multi-agent reinforcement learning: A survey. In: Control, automation, robotics and vision, 2006. ICARCV’06. 9th international conference on, IEEE, pp 1–6. IEEE
Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. Syst Man Cybern Part C: Appl Rev, IEEE Trans 38(2):156–172
Claus C, Boutilier C (1998) The dynamics of reinforcement learning in cooperative multiagent systems. In: AAAI/IAAI, pp 746–752
Conitzer V, Sandholm T (2007) Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Mach Learn 67(1–2):23–43
Dai X, Li C-K, Rad AB (2005) An approach to tune fuzzy controllers based on reinforcement learning for autonomous vehicle control. Intell Transp Syst, IEEE Trans 6(3):285–293
D’Angelo H (1970) Linear time-varying systems: analysis and synthesis. Allyn & Bacon, Newton
DeCarlo RA (1989) Linear systems: a state variable approach with numerical implementation. Prentice-Hall Inc, Upper Saddle River
Dixon W (2014) Optimal adaptive control and differential games by reinforcement learning principles. J Guid Control Dyn 37(3):1048–1049
Fulda N, Ventura D (2007) Predicting and preventing coordination problems in cooperative Q-learning systems. In: IJCAI, vol. 2007, pp 780–785
Gutnisky DA, Zanutto BS (2004) Learning obstacle avoidance with an operant behavior model. Artif Life 10(1):65–81
Hinojosa W, Nefti S, Kaymak U (2011) Systems control with generalized probabilistic fuzzy-reinforcement learning. Fuzzy Syst, IEEE Trans 19(1):51–64
Howard RA (1960) Dynamic programming and markov processes. MIT Press, Cambridge
Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J Mach Learn Res 4:1039–1069
Hu J, Wellman MP, et al (1998) Multiagent reinforcement learning: theoretical framework and an algorithm. In: ICML, vol. 98, Citeseer, pp 242–250
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285
Kondo T, Ito K (2004) A reinforcement learning with evolutionary state recruitment strategy for autonomous mobile robots control. Robot Auton Syst 46(2):111–124
Luo B, Wu H-N, Li H-X (2014a) Data-based suboptimal neuro-control design with reinforcement learning for dissipative spatially distributed processes. Ind Eng Chem Res 53(19):8106–8119
Luo B, Wu H-N, Huang T, Liu D (2014b) Data-based approximate policy iteration for nonlinear continuous-time optimal control design. Automatica 50(12):3281–3290
Luo B, Wu H-N, Huang T (2015a) Off-policy reinforcement learning for \(H_{\infty }\) control design. Cybern, IEEE Trans 45(1):65–76
Luo B, Wu H-N, Li H-X (2015b) Adaptive optimal control of highly dissipative nonlinear spatially distributed processes with neuro-dynamic programming. Neural Netw Learn Syst, IEEE Trans 26(4):684–696
Luo B, Huang T, Wu H-N, Yang X (2015c) Data-driven \(H_{\infty }\) control for nonlinear distributed parameter systems. Neural Netw Learn Syst, IEEE Trans 26(11):2949–2961
Luo B, Wu H-N, Huang T, Liu D (2015d) Reinforcement learning solution for HJB equation arising in constrained optimal control problem. Neural Netw 71:150–158
Modares H, Lewis FL, Naghibi-Sistani M-B (2014) Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica 50(1):193–202
Rodríguez M, Iglesias R, Regueiro CV, Correa J, Barro S (2007) Autonomous and fast robot learning through motivation. Robot Auton Syst 55(9):735–740
Schwartz HM (2014) Multi-Agent Machine Learning: A Reinforcement Approach. Wiley, New York
Sen S, Sekaran M, Hale J (1994) Learning to coordinate without sharing information. In: AAAI, pp 426–431
Singh S, Kearns M, Mansour Y (2000) Nash convergence of gradient dynamics in general-sum games. In: Proceedings of the sixteenth conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., pp 541–548
Smart WD, Kaelbling LP (2002) Effective reinforcement learning for mobile robots. In: Robotics and automation. Proceedings. ICRA’02. IEEE international conference on, vol. 4, IEEE, 2002, pp. 3404–3410. IEEE
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. The MIT Press, Cambridge
Tan M (1993) Multi-agent reinforcement learning: Independent vs. cooperative agents. In: Proceedings of the tenth international conference on machine learning, pp 330–337
Tesauro G (2004) Extending Q-learning to general adaptive multi-agent systems. In: Advances in neural information processing systems, vol. 16. MIT press, pp 871–878
Thathachar MA, Sastry PS (2011) Networks of learning automata: techniques for online stochastic optimization. Springer, Boston
Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8(3–4):279–292
Watkins CJCH (1989) Learning from delayed rewards, Ph.D. thesis, University of Cambridge England
Weiss G (1999) Multiagent systems: a modern approach to distributed artificial intelligence. MIT Press
Wu H-N, Luo B (2012) Neural network based online simultaneous policy update algorithm for solving the HJI equation in nonlinear control. Neural Netw Learn Syst, IEEE Trans 23(12):1884–1895
Ye C, Yung NH, Wang D (2003) A fuzzy controller with supervised learning assisted reinforcement learning algorithm for obstacle avoidance. Syst Man Cybern Part B: Cybern, IEEE Trans 33(1):17–27
Zhang C, Lesser VR (2010) Multi-agent learning with policy prediction. In: AAAI
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Awheda, M.D., Schwartz, H.M. Exponential moving average based multiagent reinforcement learning algorithms. Artif Intell Rev 45, 299–332 (2016). https://doi.org/10.1007/s10462-015-9447-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-015-9447-5