Policy iteration based Q-learning for linear nonzero-sum quadratic differential games

Abstract

In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.

This is a preview of subscription content, access via your institution.

References

  1. 1

    Basar T, Olsder G J. Dynamic Noncooperative Game Theory (Classics in Applied Mathematics). 2nd ed. Philadelphia: SIAM, 1999

    Google Scholar 

  2. 2

    Falugi P, Kountouriotis P A, Vinter R B. Differential games controllers that confine a system to a safe region in the state space, with applications to surge tank control. IEEE Trans Automat Contr, 2012, 57: 2778–2788

    MathSciNet  Article  MATH  Google Scholar 

  3. 3

    Zha W Z, Chen J, Peng Z H, et al. Construction of barrier in a fishing game with point capture. IEEE Trans Cybern, 2017, 47: 1409–1422

    Article  Google Scholar 

  4. 4

    Lin F H, Liu Q, Zhou X W, et al. Towards green for relay in InterPlaNetary Internet based on differential game model. Sci China Inf Sci, 2014, 57: 042306

    Google Scholar 

  5. 5

    Luo B, Wu H N, Huang T. Off-policy reinforcement learning for H control design. IEEE Trans Cybern, 2015, 45: 65–76

    Article  Google Scholar 

  6. 6

    Bea R W. Successive Galerkin approximation algorithms for nonlinear optimal and robust control. Int J Control, 1998, 71: 717–743

    MathSciNet  Article  Google Scholar 

  7. 7

    Abu-Khalaf M, Lewis F L, Huang J. Neurodynamic programming and zero-sum games for constrained control systems. IEEE Trans Neural Netw, 2008, 19: 1243–1252

    Article  Google Scholar 

  8. 8

    Freiling G, Jank G, Abou-Kandil H. On global existence of solutions to coupled matrix Riccati equations in closed-loop Nash games. IEEE Trans Automat Contr, 1996, 41: 264–269

    MathSciNet  Article  MATH  Google Scholar 

  9. 9

    Li T Y, Gajic Z. Lyapunov iterations for solving coupled algebraic riccati equations of nash differential games and algebraic riccati equations of zero-sum game. In: New Trends in Dynamic Games and Applications. Boston: Birkhäuser, 1995. 333–3

    Google Scholar 

  10. 10

    Possieri C, Sassano M. An algebraic geometry approach for the computation of all linear feedback Nash equilibria in LQ differential games. In: Proceedings of the 54th IEEE Conference on Decision and Control, Osaka, 2015. 5197–3

  11. 11

    Engwerda J C. LQ Dynamic Optimization and Differential Games. New York: Wiley, 2005

    Google Scholar 

  12. 12

    Mylvaganam T, Sassano M, Astolfi A. Constructive α-Nash equilibria for nonzero-sum differential games. IEEE Trans Automat Contr, 2015, 60: 950–965

    MathSciNet  Article  MATH  Google Scholar 

  13. 13

    Sutton R S, Barto A G. Reinforcement Learning: an Introduction. Cambridge: MIT Press, 1998

    Google Scholar 

  14. 14

    Werbos P J. Approximate dynamic programming for real-time control and neural modeling. In: Handbook of Intelligent Control. New York: Van Nostrand, 1992

    Google Scholar 

  15. 15

    Bertsekas D P, Tsitsiklis J N. Neuro-Dynamic Programming. Belmont: Athena Scientific, 1996

    Google Scholar 

  16. 16

    Werbos P J. The elements of intelligence. Cybernetica, 1968, 11: 131

    Google Scholar 

  17. 17

    Doya K. Reinforcement learning in continuous time and space. Neural Computation, 2000, 12: 219–245

    Article  Google Scholar 

  18. 18

    Wei Q L, Lewis F L, Sun Q Y, et al. Discrete-time deterministic Q-learning: a novel convergence analysis. IEEE Trans Cyber, 2016, 47: 1–14

    Article  Google Scholar 

  19. 19

    Wang D, Mu C X. Developing nonlinear adaptive optimal regulators through an improved neural learning mechanism. Sci China Inf Sci, 2017, 60: 058201

    Article  Google Scholar 

  20. 20

    Vrabie D, Pastravanu O, Abu-Khalaf M, et al. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica, 2009, 45: 477–484

    MathSciNet  Article  MATH  Google Scholar 

  21. 21

    Jiang Y, Jiang Z P. Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica, 2012, 48: 2699–2704

    MathSciNet  Article  MATH  Google Scholar 

  22. 22

    Luo B, Wu H N, Huang T W, et al. Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design. Automatica, 2014, 50: 3281–3290

    MathSciNet  Article  MATH  Google Scholar 

  23. 23

    Zhang H G, Wei Q L, Liu D R. An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games. Automatica, 2011, 47: 207–214

    MathSciNet  Article  MATH  Google Scholar 

  24. 24

    Vrabie D, Lewis F L. Adaptive dynamic programming for online solution of a zero-sum differential game. J Control Theor Appl, 2011, 9: 353–360

    MathSciNet  Article  MATH  Google Scholar 

  25. 25

    Zhu Y H, Zhao D B, Li X G. Iterative adaptive dynamic programming for solving unknown nonlinear zero-sum game based on online data. IEEE Trans Neural Netw Learn Syst, 2017, 28: 714–725

    MathSciNet  Article  Google Scholar 

  26. 26

    Modares H, Lewis F L, Jiang Z P. H tracking control of completely unknown continuous-time systems via off-policy reinforcement learning. IEEE Trans Neural Netw Learn Syst, 2015, 26: 2550–2562

    MathSciNet  Article  Google Scholar 

  27. 27

    Kiumarsi B, Lewis F L, Jiang Z P. H control of linear discrete-time systems: off-policy reinforcement learning. Automatica, 2017, 78: 144–152

    MathSciNet  Article  MATH  Google Scholar 

  28. 28

    Vamvoudakis K G, Lewis F L, Hudas G R. Multi-agent differential graphical games: Online adaptive learning solution for synchronization with optimality. Automatica, 2012, 48: 1598–1611

    MathSciNet  Article  MATH  Google Scholar 

  29. 29

    Zhang H G, Cui L L, Luo Y H. Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern, 2013, 43: 206–216

    Article  Google Scholar 

  30. 30

    Zhang H G, Jiang H, Luo C M, et al. Discrete-time nonzero-sum games for multiplayer using policy-iteration-based adaptive dynamic programming algorithms. IEEE Trans Cybern, 2017, 47: 3331–3340

    Article  Google Scholar 

  31. 31

    Vamvoudakis K G. Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems. Automatica, 2015, 61: 274–281

    MathSciNet  Article  MATH  Google Scholar 

  32. 32

    Zhao D B, Zhang Q C, Wang D, et al. Experience replay for optimal control of nonzero-sum game systems with unknown dynamics. IEEE Trans Cybern, 2016, 46: 854–865

    Article  Google Scholar 

  33. 33

    Johnson M, Kamalapurkar R, Bhasin S, et al. Approximate N-player nonzero-sum game solution for an uncertain continuous nonlinear system. IEEE Trans Neural Netw Learn Syst, 2015, 26: 1645–1658

    MathSciNet  Article  Google Scholar 

  34. 34

    Liu D R, Li H L, Wang D. Online synchronous approximate optimal learning algorithm for multi-player non-zero-sum games with unknown dynamics. IEEE Trans Syst Man Cybern Syst, 2014, 44: 1015–1027

    Article  Google Scholar 

  35. 35

    Song R Z, Lewis F L, Wei Q L. Off-policy integral reinforcement learning method to solve nonlinear continuous-time multiplayer nonzero-sum games. IEEE Trans Neural Netw Learn Syst, 2017, 28: 704–713

    MathSciNet  Article  Google Scholar 

  36. 36

    Vrabie D, Lewis F L. Integral reinforcement learning for online computation of feedback Nash strategies of nonzero-sum differential games. In: Proceedings of the 49th IEEE Conference on Decision and Control, Atlanta, 2010: 3066–3071

  37. 37

    Vamvoudakis K G, Modares H, Kiumarsi B, et al. Game theory-based control system algorithms with real-time reinforcement learning: how to solve multiplayer games online. IEEE Control Syst, 2017, 37: 33–52

    MathSciNet  Google Scholar 

  38. 38

    Leake R J, Liu R W. Construction of suboptimal control sequences. SIAM J Control, 1967, 5: 54–63

    MathSciNet  Article  MATH  Google Scholar 

  39. 39

    Vamvoudakis K G, Lewis F L. Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica, 2011, 47: 1556–1569

    MathSciNet  Article  MATH  Google Scholar 

  40. 40

    Watkins C, Dayan P. Q-Learning. Mach Learn, 1992, 8: 279–292

    MATH  Google Scholar 

  41. 41

    Bradtke S J, Ydstie B E, Barto A G. Adaptive linear quadratic control using policy iteration. In: Proceedings of American Control Conference, Baltimore, 1994. 3475–3

  42. 42

    Chen C L, Dong D Y, Li H X, et al. Hybrid MDP based integrated hierarchical Q-learning. Sci China Inf Sci, 2011, 54: 2279–2294

    MathSciNet  Article  MATH  Google Scholar 

  43. 43

    Wei Q L, Liu D R. A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems. Sci China Inf Sci, 2015, 58: 122203

    Article  Google Scholar 

  44. 44

    Palanisamy M, Modares H, Lewis F L, et al. Continuous-time Q-learning for infinite-horizon discounted cost linear quadratic regulator problems. IEEE Trans Cybern, 2015, 45: 165–176

    Article  Google Scholar 

  45. 45

    Yan P F, Wang D, Li H L, et al. Error bound analysis of Q-function for discounted optimal control problems with policy iteration. IEEE Trans Syst Man Cybern Syst, 2017, 47: 1207–1216

    Article  Google Scholar 

  46. 46

    Luo B, Liu D R, Wu H N, et al. Policy gradient adaptive dynamic programming for data-based optimal control. IEEE Trans Cybern, 2017, 47: 3341–3354

    Article  Google Scholar 

  47. 47

    Vamvoudakis K G. Q-learning for continuous-time linear systems: a model-free infinite horizon optimal control approach. Syst Control Lett, 2017, 100: 14–20

    MathSciNet  Article  MATH  Google Scholar 

  48. 48

    Vamvoudakis K G, Hespanha J P. Cooperative Q-learning for rejection of persistent adversarial inputs in networked linear quadratic systems. IEEE Trans Automat Contr, 2018, 63: 1018–1031

    MathSciNet  Article  MATH  Google Scholar 

  49. 49

    Rizvi S A A, Lin Z L. Output feedback Q-learning for discrete-time linear zero-sum games with application to the H-infinity control. Automatica, 2018, 95: 213–221

    MathSciNet  Article  MATH  Google Scholar 

  50. 50

    Li J N, Chai T Y, Lewis F L, et al. Off-policy Q-learning: set-point design for optimizing dual-rate rougher flotation operational processes. IEEE Trans Ind Electron, 2018, 65: 4092–4102

    Article  Google Scholar 

  51. 51

    Kleinman D. On an iterative technique for Riccati equation computations. IEEE Trans Automat Contr, 1968, 13: 114–115

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No. 61203078) and the Key Project of Shenzhen Robotics Research Center NSFC (Grant No. U1613225).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Zhihong Peng.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, X., Peng, Z., Liang, L. et al. Policy iteration based Q-learning for linear nonzero-sum quadratic differential games. Sci. China Inf. Sci. 62, 52204 (2019). https://doi.org/10.1007/s11432-018-9602-1

Download citation

Keywords

  • adaptive dynamic programming
  • ADP
  • Q-learning
  • reinforcement learning
  • RL
  • linear nonzero-sum quadratic differential games
  • policy iteration
  • PI
  • off-policy