Policy iteration based Q-learning for linear nonzero-sum quadratic differential games
- 8 Downloads
In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.
Keywordsadaptive dynamic programming ADP Q-learning reinforcement learning RL linear nonzero-sum quadratic differential games policy iteration PI off-policy
This work was supported by National Natural Science Foundation of China (Grant No. 61203078) and the Key Project of Shenzhen Robotics Research Center NSFC (Grant No. U1613225).
- 4.Lin F H, Liu Q, Zhou X W, et al. Towards green for relay in InterPlaNetary Internet based on differential game model. Sci China Inf Sci, 2014, 57: 042306Google Scholar
- 10.Possieri C, Sassano M. An algebraic geometry approach for the computation of all linear feedback Nash equilibria in LQ differential games. In: Proceedings of the 54th IEEE Conference on Decision and Control, Osaka, 2015. 5197–3Google Scholar
- 11.Engwerda J C. LQ Dynamic Optimization and Differential Games. New York: Wiley, 2005Google Scholar
- 14.Werbos P J. Approximate dynamic programming for real-time control and neural modeling. In: Handbook of Intelligent Control. New York: Van Nostrand, 1992Google Scholar
- 16.Werbos P J. The elements of intelligence. Cybernetica, 1968, 11: 131Google Scholar
- 36.Vrabie D, Lewis F L. Integral reinforcement learning for online computation of feedback Nash strategies of nonzero-sum differential games. In: Proceedings of the 49th IEEE Conference on Decision and Control, Atlanta, 2010: 3066–3071Google Scholar
- 41.Bradtke S J, Ydstie B E, Barto A G. Adaptive linear quadratic control using policy iteration. In: Proceedings of American Control Conference, Baltimore, 1994. 3475–3Google Scholar