Online adaptive Q-learning method for fully cooperative linear quadratic dynamic games

Abstract

A model-based offline policy iteration (PI) algorithm and a model-free online Q-learning algorithm are proposed for solving fully cooperative linear quadratic dynamic games. The PI-based adaptive Q-learning method can learn the feedback Nash equilibrium online using the state samples generated by behavior policies, without sending inquiries to the system model. Unlike the existing Q-learning methods, this novel Q-learning algorithm executes both policy evaluation and policy improvement in an adaptive manner. We prove the convergence of the offline PI algorithm by proving its equivalence to Newton’s method while solving the game algebraic Riccati equation (GARE). Furthermore, we prove that the proposed Q-learning method will converge to the Nash equilibrium under a small learning rate if the method satisfies certain persistence of excitation conditions, which can be easily met by suitable behavior policies. Our simulation results demonstrate the good performance of the proposed online adaptive Q-learning algorithm.

This is a preview of subscription content, access via your institution.

References

  1. 1

    Basar T, Olsder G J. Dynamic Noncooperative Game Theory (Classics in Applied Mathematics). 2nd ed. Philadelphia: SIAM, 1999

    Google Scholar 

  2. 2

    Falugi P, Kountouriotis P A, Vinter R B. Differential games controllers that confine a system to a safe region in the state space, with applications to surge tank control. IEEE Trans Autom Contr, 2012, 57: 2778–2788

    MathSciNet  Article  Google Scholar 

  3. 3

    Lin F H, Liu Q, Zhou X W, et al. Towards green for relay in InterPlaNetary Internet based on differential game model. Sci China Inf Sci, 2014, 57: 042306

    Google Scholar 

  4. 4

    Luo B, Wu H N, Huang T. Off-policy reinforcement learning for H control design. IEEE Trans Cyber, 2015, 45: 65–76

    Article  Google Scholar 

  5. 5

    Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT Press 1998

    Google Scholar 

  6. 6

    Xia R S, Wu Q X, Chen M. Disturbance observer-based optimal longitudinal trajectory control of near space vehicle. Sci China Inf Sci, 2019, 62: 050212

    Article  Google Scholar 

  7. 7

    Wang D, Mu C X. Developing nonlinear adaptive optimal regulators through an improved neural learning mechanism. Sci China Inf Sci, 2017, 60: 058201

    Article  Google Scholar 

  8. 8

    Yan X H, Zhu J H, Kuang M C, et al. Missile aerodynamic design using reinforcement learning and transfer learning. Sci China Inf Sci, 2018, 61: 119204

    Article  Google Scholar 

  9. 9

    Watkins C, Dayan P. Q-learning. Mach Learn, 1992, 8: 279–292

    MATH  Google Scholar 

  10. 10

    Bradtke S J, Ydstie B E, Barto A G. Adaptive linear quadratic control using policy iteration. In: Proceedings of American Control Conference, Baltimore, 1994. 3475–3479

  11. 11

    Chen C L, Dong D Y, Li H X, et al. Hybrid MDP based integrated hierarchical Q-learning. Sci China Inf Sci, 2011, 54: 2279–2294

    MathSciNet  Article  Google Scholar 

  12. 12

    Wei Q L, Liu D R. A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems. Sci China Inf Sci, 2015, 58: 122203

    Article  Google Scholar 

  13. 13

    Wei Q L, Lewis F L, Sun Q Y, et al. Discrete-time deterministic Q-learning: a novel convergence analysis. IEEE Trans Cybern, 2017, 47: 1224–1237

    Article  Google Scholar 

  14. 14

    Luo B, Liu D R, Huang T W, et al. Model-free optimal tracking control via critic-only Q-learning. IEEE Trans Neural Netw Learn Syst, 2016, 27: 2134–2144

    MathSciNet  Article  Google Scholar 

  15. 15

    Vamvoudakis K G. Q-learning for continuous-time linear systems: a model-free infinite horizon optimal control approach. Syst Control Lett, 2017, 100: 14–20

    MathSciNet  Article  Google Scholar 

  16. 16

    Vrabie D, Lewis F L. Adaptive dynamic programming for online solution of a zero-sum differential game. J Control Theory Appl, 2011, 9: 353–360

    MathSciNet  Article  Google Scholar 

  17. 17

    Zhu Y H, Zhao D B, Li X G. Iterative adaptive dynamic programming for solving unknown nonlinear zero-sum game based on online data. IEEE Trans Neural Netw Learn Syst, 2017, 28: 714–725

    MathSciNet  Article  Google Scholar 

  18. 18

    Vamvoudakis K G, Lewis F L. Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica, 2011, 47: 1556–1569

    MathSciNet  Article  Google Scholar 

  19. 19

    Zhang H G, Cui L L, Luo Y H. Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network ADP. IEEE Trans Cyber, 2013, 43: 206–216

    Article  Google Scholar 

  20. 20

    Liu D R, Li H L, Wang D. Online synchronous approximate optimal learning algorithm for multi-player non-zero-sum games with unknown dynamics. IEEE Trans Syst Man Cyber Syst, 2014, 44: 1015–1027

    Article  Google Scholar 

  21. 21

    Vamvoudakis K G. Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems. Automatica, 2015, 61: 274–281

    MathSciNet  Article  Google Scholar 

  22. 22

    Zhao D B, Zhang Q C, Wang D, et al. Experience replay for optimal control of nonzero-sum game systems with unknown dynamics. IEEE Trans Cyber, 2016, 46: 854–865

    Article  Google Scholar 

  23. 23

    Song R Z, Lewis F L, Wei Q L. Off-policy integral reinforcement learning method to solve nonlinear continuous-time multiplayer nonzero-sum games. IEEE Trans Neural Netw Learn Syst, 2017, 28: 704–713

    MathSciNet  Article  Google Scholar 

  24. 24

    Mehraeen S, Dierks T, Jagannathan S, et al. Zero-sum two-player game theoretic formulation of affine nonlinear discrete-time systems using neural networks. IEEE Trans Cyber, 2013, 43: 1641–1655

    Article  Google Scholar 

  25. 25

    Zhang H G, Jiang H, Luo C M, et al. Discrete-time nonzero-sum games for multiplayer using policy-iteration-based adaptive dynamic programming algorithms. IEEE Trans Cyber, 2017, 47: 3331–3340

    Article  Google Scholar 

  26. 26

    Zhang H G, Jiang H, Luo Y H, et al. Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method. IEEE Trans Ind Electron, 2017, 64: 4091–4100

    Article  Google Scholar 

  27. 27

    Kiumarsi B, Lewis F L, Jiang Z P. H control of linear discrete-time systems: off-policy reinforcement learning. Automatica, 2017, 78: 144–152

    MathSciNet  Article  Google Scholar 

  28. 28

    Vamvoudakis K G, Modares H, Kiumarsi B, et al. Game theory-based control system algorithms with real-time reinforcement learning: how to solve multiplayer games online. IEEE Control Syst, 2017, 37: 33–52

    MathSciNet  Google Scholar 

  29. 29

    Tamimi A A, Lewis F L, Khalaf M A. Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica, 2007, 43: 473–481

    MathSciNet  Article  Google Scholar 

  30. 30

    Rizvi S A A, Lin Z L. Output feedback Q-learning for discrete-time linear zero-sum games with application to the H-infinity control. Automatica, 2018, 95: 213–221

    MathSciNet  Article  Google Scholar 

  31. 31

    Li J N, Chai T Y, Lewis F L, et al. Off-policy Q-learning: set-point design for optimizing dual-rate rougher flotation operational processes. IEEE Trans Ind Electron, 2018, 65: 4092–4102

    Article  Google Scholar 

  32. 32

    Leake R J, Liu R W. Construction of suboptimal control sequences. J SIAM Control, 1967, 5: 54–63

    MathSciNet  Article  Google Scholar 

  33. 33

    Ioannou P, Fidan B. Adaptive Control Tutorial. Philadelphia: SIAM 2006

    Google Scholar 

Download references

Acknowledgements

This work was supported by Key Program of National Natural Science Foundation of China (Grant No. U1613225).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Zhihong Peng.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, X., Peng, Z., Jiao, L. et al. Online adaptive Q-learning method for fully cooperative linear quadratic dynamic games. Sci. China Inf. Sci. 62, 222201 (2019). https://doi.org/10.1007/s11432-018-9865-9

Download citation

Keywords

  • adaptive dynamic programming
  • reinforcement learning
  • Q-learning
  • fully cooperative linear quadratic dynamic games
  • policy iteration
  • off-policy