Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Online adaptive Q-learning method for fully cooperative linear quadratic dynamic games

Abstract

A model-based offline policy iteration (PI) algorithm and a model-free online Q-learning algorithm are proposed for solving fully cooperative linear quadratic dynamic games. The PI-based adaptive Q-learning method can learn the feedback Nash equilibrium online using the state samples generated by behavior policies, without sending inquiries to the system model. Unlike the existing Q-learning methods, this novel Q-learning algorithm executes both policy evaluation and policy improvement in an adaptive manner. We prove the convergence of the offline PI algorithm by proving its equivalence to Newton’s method while solving the game algebraic Riccati equation (GARE). Furthermore, we prove that the proposed Q-learning method will converge to the Nash equilibrium under a small learning rate if the method satisfies certain persistence of excitation conditions, which can be easily met by suitable behavior policies. Our simulation results demonstrate the good performance of the proposed online adaptive Q-learning algorithm.

This is a preview of subscription content, log in to check access.

References

  1. 1

    Basar T, Olsder G J. Dynamic Noncooperative Game Theory (Classics in Applied Mathematics). 2nd ed. Philadelphia: SIAM, 1999

  2. 2

    Falugi P, Kountouriotis P A, Vinter R B. Differential games controllers that confine a system to a safe region in the state space, with applications to surge tank control. IEEE Trans Autom Contr, 2012, 57: 2778–2788

  3. 3

    Lin F H, Liu Q, Zhou X W, et al. Towards green for relay in InterPlaNetary Internet based on differential game model. Sci China Inf Sci, 2014, 57: 042306

  4. 4

    Luo B, Wu H N, Huang T. Off-policy reinforcement learning for H control design. IEEE Trans Cyber, 2015, 45: 65–76

  5. 5

    Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT Press 1998

  6. 6

    Xia R S, Wu Q X, Chen M. Disturbance observer-based optimal longitudinal trajectory control of near space vehicle. Sci China Inf Sci, 2019, 62: 050212

  7. 7

    Wang D, Mu C X. Developing nonlinear adaptive optimal regulators through an improved neural learning mechanism. Sci China Inf Sci, 2017, 60: 058201

  8. 8

    Yan X H, Zhu J H, Kuang M C, et al. Missile aerodynamic design using reinforcement learning and transfer learning. Sci China Inf Sci, 2018, 61: 119204

  9. 9

    Watkins C, Dayan P. Q-learning. Mach Learn, 1992, 8: 279–292

  10. 10

    Bradtke S J, Ydstie B E, Barto A G. Adaptive linear quadratic control using policy iteration. In: Proceedings of American Control Conference, Baltimore, 1994. 3475–3479

  11. 11

    Chen C L, Dong D Y, Li H X, et al. Hybrid MDP based integrated hierarchical Q-learning. Sci China Inf Sci, 2011, 54: 2279–2294

  12. 12

    Wei Q L, Liu D R. A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems. Sci China Inf Sci, 2015, 58: 122203

  13. 13

    Wei Q L, Lewis F L, Sun Q Y, et al. Discrete-time deterministic Q-learning: a novel convergence analysis. IEEE Trans Cybern, 2017, 47: 1224–1237

  14. 14

    Luo B, Liu D R, Huang T W, et al. Model-free optimal tracking control via critic-only Q-learning. IEEE Trans Neural Netw Learn Syst, 2016, 27: 2134–2144

  15. 15

    Vamvoudakis K G. Q-learning for continuous-time linear systems: a model-free infinite horizon optimal control approach. Syst Control Lett, 2017, 100: 14–20

  16. 16

    Vrabie D, Lewis F L. Adaptive dynamic programming for online solution of a zero-sum differential game. J Control Theory Appl, 2011, 9: 353–360

  17. 17

    Zhu Y H, Zhao D B, Li X G. Iterative adaptive dynamic programming for solving unknown nonlinear zero-sum game based on online data. IEEE Trans Neural Netw Learn Syst, 2017, 28: 714–725

  18. 18

    Vamvoudakis K G, Lewis F L. Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica, 2011, 47: 1556–1569

  19. 19

    Zhang H G, Cui L L, Luo Y H. Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network ADP. IEEE Trans Cyber, 2013, 43: 206–216

  20. 20

    Liu D R, Li H L, Wang D. Online synchronous approximate optimal learning algorithm for multi-player non-zero-sum games with unknown dynamics. IEEE Trans Syst Man Cyber Syst, 2014, 44: 1015–1027

  21. 21

    Vamvoudakis K G. Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems. Automatica, 2015, 61: 274–281

  22. 22

    Zhao D B, Zhang Q C, Wang D, et al. Experience replay for optimal control of nonzero-sum game systems with unknown dynamics. IEEE Trans Cyber, 2016, 46: 854–865

  23. 23

    Song R Z, Lewis F L, Wei Q L. Off-policy integral reinforcement learning method to solve nonlinear continuous-time multiplayer nonzero-sum games. IEEE Trans Neural Netw Learn Syst, 2017, 28: 704–713

  24. 24

    Mehraeen S, Dierks T, Jagannathan S, et al. Zero-sum two-player game theoretic formulation of affine nonlinear discrete-time systems using neural networks. IEEE Trans Cyber, 2013, 43: 1641–1655

  25. 25

    Zhang H G, Jiang H, Luo C M, et al. Discrete-time nonzero-sum games for multiplayer using policy-iteration-based adaptive dynamic programming algorithms. IEEE Trans Cyber, 2017, 47: 3331–3340

  26. 26

    Zhang H G, Jiang H, Luo Y H, et al. Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method. IEEE Trans Ind Electron, 2017, 64: 4091–4100

  27. 27

    Kiumarsi B, Lewis F L, Jiang Z P. H control of linear discrete-time systems: off-policy reinforcement learning. Automatica, 2017, 78: 144–152

  28. 28

    Vamvoudakis K G, Modares H, Kiumarsi B, et al. Game theory-based control system algorithms with real-time reinforcement learning: how to solve multiplayer games online. IEEE Control Syst, 2017, 37: 33–52

  29. 29

    Tamimi A A, Lewis F L, Khalaf M A. Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica, 2007, 43: 473–481

  30. 30

    Rizvi S A A, Lin Z L. Output feedback Q-learning for discrete-time linear zero-sum games with application to the H-infinity control. Automatica, 2018, 95: 213–221

  31. 31

    Li J N, Chai T Y, Lewis F L, et al. Off-policy Q-learning: set-point design for optimizing dual-rate rougher flotation operational processes. IEEE Trans Ind Electron, 2018, 65: 4092–4102

  32. 32

    Leake R J, Liu R W. Construction of suboptimal control sequences. J SIAM Control, 1967, 5: 54–63

  33. 33

    Ioannou P, Fidan B. Adaptive Control Tutorial. Philadelphia: SIAM 2006

Download references

Acknowledgements

This work was supported by Key Program of National Natural Science Foundation of China (Grant No. U1613225).

Author information

Correspondence to Zhihong Peng.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, X., Peng, Z., Jiao, L. et al. Online adaptive Q-learning method for fully cooperative linear quadratic dynamic games. Sci. China Inf. Sci. 62, 222201 (2019). https://doi.org/10.1007/s11432-018-9865-9

Download citation

Keywords

  • adaptive dynamic programming
  • reinforcement learning
  • Q-learning
  • fully cooperative linear quadratic dynamic games
  • policy iteration
  • off-policy