A novel Z-function-based completely model-free reinforcement learning method to finite-horizon zero-sum game of nonlinear system

Chen, Zhe; Xue, Wenqian; Li, Ning; Lian, Bosen; Lewis, Frank L.

doi:10.1007/s11071-021-07049-z

A novel Z-function-based completely model-free reinforcement learning method to finite-horizon zero-sum game of nonlinear system

Original Paper
Published: 09 January 2022

Volume 107, pages 2563–2582, (2022)
Cite this article

Nonlinear Dynamics Aims and scope Submit manuscript

Zhe Chen^1,2,3,
Wenqian Xue⁴,
Ning Li ORCID: orcid.org/0000-0003-1025-9641^1,2,3,
Bosen Lian⁵ &
…
Frank L. Lewis⁵

844 Accesses
6 Citations
Explore all metrics

Abstract

This paper addresses the finite-horizon two-player zero-sum game for the continuous-time nonlinear system by defining a novel Z-function and proposing a completely model-free reinforcement learning (RL)-based method with reduced dimension of the basis functions. First, a model-based RL policy iteration framework is raised for reducing the order of the Hamiltonian–Jacobi–Isaacs (HJI) equation and strengthening the anti-interference capability and efficiency. This provides the basic framework for model-free algorithms. A partially model-free algorithm is then developed by applying integral RL and iterative learning control techniques to further simplify the solution seeking and remove the need for system dynamics on value function update. An integral Bellman equation is considered. The value function for the HJI equation is evaluated by a critic neural network with time-variant weights and state-dependent basis functions. In order to realize completely model-free learning, a novel Z-function is finally defined and a completely model-free algorithm is thus proposed to further remove the need for system dynamics on input update. Sufficient convergence and stability analysis is provided. The corresponding simulation results are shown to verify the validity of this algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-agent deep reinforcement learning: a survey

Article Open access 15 April 2021

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A review of cooperative multi-agent deep reinforcement learning

Article 14 October 2022

Data Availability Statement

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

References

Başar, T., Olsder, G.J.: Dynamic Noncooperative Game Theory. SIAM, Philadelphia (1998)
Book Google Scholar
Lewis, F.L., Vrabie, D., Syrmos, V.L.: Optimal Control. Wiley, New York (2012)
Book Google Scholar
Başar, T., Bernhard, P.: $H_\infty $ Optimal Control and Related Minimax Design Problems. Birkhäuser, Boston (2008)
Book Google Scholar
Wang, D., Chaoxu, M., Liu, D., Ma, H.: On mixed data and event driven design for adaptive-critic-based nonlinear ${H}_\infty $ control. IEEE Trans. Neural Netw. Learn. Syst. 29(99), 993–1005 (2018)
Article Google Scholar
Rizvi, S.A.A., Lin, Z.: Output feedback Q-learning for discrete-time linear zero-sum games with application to the ${H}_\infty $ control. Automatica 95, 213–221 (2018)
Article MathSciNet Google Scholar
Kiumarsi, B., Vamvoudakis, K.G., Modares, H., Lewis, F.L.: Optimal and autonomous control using reinforcement learning: a survey. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2042–2062 (2017)
Article MathSciNet Google Scholar
Al-Tamimi, A., Lewis, F.L., Abu-Khalaf, M.: Model-free Q -learning designs for linear discrete-time zero-sum games with application to ${H}_\infty $ control. Automatica 43(3), 473–481 (2007)
Article MathSciNet Google Scholar
Wang, K., Mu, C., Zhang, Y., Liu, W.: An approximate control algorithm for zero-sum differential games using adaptive critic technique. In: 2018 37th Chinese Control Conference(CCC), pp. 2812–2817. IEEE (2018)
Gao, Y., Liu, C., Jiang, S., Zhang, S.: Zero-sum differential games-based fast adaptive robust optimal sliding mode control design for uncertain missile autopilot with constrained input. Int J Control 1–13 (2021). https://doi.org/10.1080/00207179.2021.1872802
Barto, A.G.: Reinforcement learning control. Curr. Opin. Neurobiol. 4(6), 888–893 (1994)
Article Google Scholar
Shan, X., Biao, L., Derong, L.: Event-triggered adaptive dynamic programming for zero-sum game of partially unknown continuous-time nonlinear systems. IEEE Trans. Syst. Man Cybern. Syst. 50(9), 3189–3199 (2018)
Zhong, X., He, H., Wang, D., Ni, Z.: Model-free adaptive control for unknown nonlinear zero-sum differential game. IEEE Trans. Cybern. 99, 1–14 (2017)
Article Google Scholar
Ren, L., Zhang, G., Chaoxu, M.: Data-based ${H}_\infty $ control for the constrained-input nonlinear systems and its applications in chaotic circuit systems. IEEE Trans. Circuits Syst. I Regul. Pap. 67(8), 2791–2802 (2020)
Article MathSciNet Google Scholar
Chaoxu, M., Wang, K.: Approximate-optimal control algorithm for constrained zero-sum differential games through event-triggering mechanism. Nonlinear Dyn. 95(4), 2639–2657 (2019)
Article Google Scholar
Chen, W., Ding, D., Ge, X., Han, Q.L., Wei, G.: ${H}_\infty $ containment control of multiagent systems under event-triggered communication scheduling: the finite-horizon case. IEEE Trans. Cybern. 50(4), 1372–1382 (2018)
Zhang, H., Wang, H.: Finite horizon ${H}_\infty $ preview control. Acta Autom. Sinica 36, 327–331 (2010)
Zhang, H., Wang, H. Xie, L.: Discrete-time ${H}_\infty $ preview control problem in finite horizon. In: Mathematical Problems in Engineering, 2014 (2014)
Werbos, P.: Beyond regression: new tools for prediction and analysis in the behavioral sciences. Ph. D. dissertation, Harvard University (1974)
Werbos, P.: Approximate dynamic programming for real-time control and neural modeling. In: Handbook of Intelligent Control. Van Nostrand Reinhold, New York (1992)
Zhao, B., Jia, L., Xia, H., Li, Y.: Adaptive dynamic programming-based stabilization of nonlinear systems with unknown actuator saturation. Nonlinear Dyn. 93(4), 2089–2103 (2018)
Article Google Scholar
Hendzel, Z.: An adaptive critic neural network for motion control of a wheeled mobile robot. Nonlinear Dyn. 50(4), 849–855 (2007)
Article Google Scholar
Bin, X.: Robust adaptive neural control of flexible hypersonic flight vehicle with dead-zone input nonlinearity. Nonlinear Dyn. 80(3), 1509–1520 (2015)
Article MathSciNet Google Scholar
Zhang, H., Wei, Q., Liu, D.: An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games. Automatica 47(1), 207–214 (2011)
Article MathSciNet Google Scholar
Sun, J., Liu, C.: Zero-sum differential games for nonlinear systems using adaptive dynamic programming with input constraint. In: 2017 36th Chinese Control Conference (CCC) (2017)
Duan, D., Liu, C., Sun, J.: Adaptive periodic event-triggered control for missile-target interception system with finite-horizon convergence. Trans. Inst. Meas. Control 42(10), 1808–1822 (2020)
Article Google Scholar
Xu, H., Jagannathan, S.: Finite horizon stochastic optimal control of nonlinear two-player zero-sum games under communication constraint. In: 2014 International Joint Conference on Neural Networks (IJCNN) (2014)
Liang, Y., Zhang, H., Cai, Y., Sun, S.: A neural network-based approach for solving quantized discrete-time ${H}_\infty $ optimal control with input constraint over finite-horizon. Neurocomputing 333(14), 248–260 (2019)
Article Google Scholar
Hao, X.: Finite-horizon near optimal design of nonlinear two-player zero-sum game in presence of completely unknown dynamics. J. Control, Autom. Electr. Syst. 26(4), 361–370 (2015)
Article Google Scholar
Zhang, H., Cui, X., Luo, Y., Jiang, H.: Finite-horizon ${H}_\infty $ tracking control for unknown nonlinear systems with saturating actuators. IEEE Trans. Neural Netw. Learn. Syst 29(4), 1200–1212 (2017)
Duan, D., Liu, C., Zhang, S.: Robust optimal control for finite-horizon zero-sum differential games via a plug-n-play event-triggered scheme. J. Frankl. Inst. 357(10), 5989–6017 (2020)
Article MathSciNet Google Scholar
Zhao, J., Zhang, C.: Finite-horizon optimal control of discrete-time linear systems with completely unknown dynamics using Q-learning. J. Ind. Manag. Optim. 17(3), 1471 (2021)
MathSciNet MATH Google Scholar
Xu, H., Jagannathan, S.: Model-free Q-learning over finite horizon for uncertain linear continuous-time systems. In: IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 1–6. IEEE (2014)
Vrabie, D., Lewis, F.L.: Adaptive optimal control algorithm for continuous-time nonlinear systems based on policy iteration. In: Proceedings of the IEEE Conference on Decision and Control, pp. 73–79. IEEE (2008)
Vrabie, D., Pastravanu, O., Abu-Khalaf, M., Lewis, F.L.: Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)
Article MathSciNet Google Scholar
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)
Sebastian, G., Tan, Y., Oetomo, D.: Convergence analysis of feedback-based iterative learning control with input saturation. Automatica 101, 44–52 (2019)
Article MathSciNet Google Scholar
Ahn, H.-S., Chen, Y., Moore, K.L.: Iterative learning control: brief survey and categorization. IEEE Trans. Syst. Man Cybern. 37(6), 1099–1121 (2007)
Article Google Scholar
Meng, D., Moore, K.L.: Convergence of iterative learning control for SISO nonrepetitive systems subject to iteration-dependent uncertainties. Automatica 79, 167–177 (2017)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work is supported by National Natural Science Foundation under Grant (61773260) and National Key R&D Program of China (2018YFB1305902).

Author information

Authors and Affiliations

Department of Automation, Shanghai Jiao Tong University, Shanghai, China
Zhe Chen & Ning Li
Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
Zhe Chen & Ning Li
Shanghai Engineering Research Center of Intelligent Control and Management, Shanghai, China
Zhe Chen & Ning Li
The State Key Laboratory of Synthetical Automation for Process Industries and International Joint Research Laboratory of Integrated Automation, Northeastern University, Shenyang, China
Wenqian Xue
UTA Research Institute, University of Texas at Arlington, Arlington, TX, USA
Bosen Lian & Frank L. Lewis

Authors

Zhe Chen
View author publications
You can also search for this author in PubMed Google Scholar
Wenqian Xue
View author publications
You can also search for this author in PubMed Google Scholar
Ning Li
View author publications
You can also search for this author in PubMed Google Scholar
Bosen Lian
View author publications
You can also search for this author in PubMed Google Scholar
Frank L. Lewis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ning Li.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of Theorem 1

Proof

Optimality Along the system trajectories (1), we have

$$\begin{aligned} \frac{\mathrm {d} V}{\mathrm {d} t} =\left( \frac{\partial V}{\partial x} \right) ^{\mathrm {T}}\left( f+gu+kw \right) +\frac{\partial V}{\partial t}. \end{aligned}$$

(66)

Substituting (66) into (8) gives

$$\begin{aligned} H\left( x,t,u,w,V \right) =Q(x)+u^{\mathrm {T}}Ru-\gamma ^2w^{\mathrm {T}}w+\frac{\mathrm {d} V}{\mathrm {d} t}. \end{aligned}$$

(67)

Based on (10) and processing the Hamiltonian function (8) by completing the squares, one has

$$\begin{aligned}&H\left( x,t,u,w,V^* \right) \nonumber \\&=Q(x)-\frac{1}{4}\frac{\partial V^*}{\partial x}^{\mathrm {T}}gR^{-1}g^{\mathrm {T}}\frac{\partial V^*}{\partial x}+\frac{1}{4\gamma ^2}\frac{\partial V^{*}}{\partial x}^{\mathrm {T}}\nonumber \\&\quad \times kk^{\mathrm {T}}\frac{\partial V^{*}}{\partial x}+\frac{\partial V^*}{\partial t}+\left[ u+\frac{1}{2}R^{-1}g^{\mathrm {T}}\frac{\partial V^{*}}{\partial x}\right] ^{\mathrm {T}}\nonumber \\&\quad \times R\left[ u+\frac{1}{2}R^{-1}g^{\mathrm {T}}\frac{\partial V^{*}}{\partial x}\right] -\gamma ^2\left[ w-\frac{1}{2\gamma ^2}k^{\mathrm {T}}\frac{\partial V^{*}}{\partial x}\right] ^{\mathrm {T}}\nonumber \\&\times quad\left[ w-\frac{1}{2\gamma ^2}k^{\mathrm {T}}\frac{\partial V^{*}}{\partial x}\right] +\frac{\partial V^*}{\partial x}^{\mathrm {T}}f\nonumber \\&=H\left( x,t,u^*,w^*,V^* \right) +\left( u-u^*\right) ^{\mathrm {T}}R\left( u-u^*\right) \nonumber \\&\quad -\gamma ^2\left( w-w^*\right) ^{\mathrm {T}}\left( w-w^*\right) \nonumber \\&=\left( u-u^*\right) ^{\mathrm {T}}R\left( u-u^*\right) -\gamma ^2\left( w-w^*\right) ^{\mathrm {T}}\left( w-w^*\right) , \end{aligned}$$

(68)

and $J\left( x_{0},u,w \right) $ in (2) could be represented as

$$\begin{aligned}&J\left( x_{0},u,w \right) \nonumber \\&=\psi \left( x\left( t_{f} \right) ,t_{f}\right) +\int _{t_0}^{t_{f}}\left( Q(x)+u^{\mathrm {T}}Ru-\gamma ^2w^{\mathrm {T}}w\right) \nonumber \\&\quad \times \mathrm {d}\tau \nonumber \\&=\psi \left( x\left( t_{f} \right) ,t_{f}\right) +\int _{t_0}^{t_{f}}\left( Q(x)+u^{\mathrm {T}}Ru-\gamma ^2w^{\mathrm {T}}w\right) \nonumber \\&\quad \times \mathrm {d}\tau \nonumber \\&\quad +\int _{t_0}^{t_f} \frac{\mathrm {d} V^*}{\mathrm {d} \tau } \mathrm {d}\tau -V^*\left( x\left( t_{f}\right) ,t_{f}\right) +V^*\left( x\left( t_{0}\right) ,t_{0}\right) \nonumber \\&=\int _{t_0}^{t_{f}}\left( Q(x)+u^{\mathrm {T}}Ru-\gamma ^2w^{\mathrm {T}}w+\frac{\mathrm {d} V^*}{\mathrm {d} \tau }\right) \mathrm {d}\tau \nonumber \\&\quad +\psi \left( x\left( t_{f} \right) ,t_{f}\right) -V^*\left( x\left( t_{f}\right) ,t_{f}\right) +V^*\left( x\left( t_{0}\right) ,t_{0}\right) \nonumber \\&=\int _{t_0}^{t_{f}}H\left( x,\tau ,u,w,V^* \right) \mathrm {d}\tau +V^*\left( x\left( t_0\right) ,t_0\right) \nonumber \\&=\int _{t_0}^{t_{f}}\bigg [ \left( u-u^*\right) ^{\mathrm {T}}R\left( u-u^*\right) -\gamma ^2\left( w-w^*\right) ^{\mathrm {T}}\nonumber \\&\quad \left( w-w^*\right) \bigg ]\mathrm {d}\tau +V^*\left( x\left( t_0\right) ,t_0\right) . \end{aligned}$$

(69)

Hence, the Nash equilibrium condition (5) is met when the input policy is adopted as $u=u^*,w=w^*$, and the corresponding cost function is $V^*\left( x(t_0),t_0 \right) $. $\square $

B Proof of Theorem 2

Proof

Along with the trajectory ${\dot{x}}=f+gu^{j+1}+kw^{j+1}$, the value function $V^j$ yields

$$\begin{aligned} \frac{\mathrm {d} V^j}{\mathrm {d} t} =\left[ \frac{\partial V^j}{\partial x} \right] ^{\mathrm {T}}\left( f+gu^{j+1}+kw^{j+1} \right) +\frac{\partial V^j}{\partial t}, \end{aligned}$$

(70)

so does $V^{j+1}$, one has

$$\begin{aligned} \frac{\mathrm {d} V^{j+1}}{\mathrm {d} t}&=\left[ \frac{\partial V^{j+1}}{\partial x} \right] ^{\mathrm {T}}\left( f+gu^{j+1}+kw^{j+1} \right) \nonumber \\&\quad +\frac{\partial V^{j+1}}{\partial t}. \end{aligned}$$

(71)

According to the Bellman equation (11), one has

$$\begin{aligned} \begin{aligned}&\frac{\partial V^j}{\partial t}+(\frac{\partial V^j}{\partial x})^{\mathrm {T}}f\\&=-(\frac{\partial V^j}{\partial x})^{\mathrm {T}}\left( gu^j+kw^{j} \right) \\&\quad -Q(x)-\left( u^j \right) ^{\mathrm {T}}Ru^j+r^2(w^j)^{\mathrm {T}}w^j. \end{aligned} \end{aligned}$$

(72)

Now, we prove the convergence of $V^{j}$ and optimality of the solutions obtained by Algorithm 1. Consider

$$\begin{aligned}&V^{j+1}\left( x\left( t_{0}\right) ,t_{0}\right) =V^{j+1}\left( x\left( t_{f}\right) ,t_{f}\right) \nonumber \\&\int _{t_0}^{t_f} \frac{\mathrm {d} V^{j+1}}{\mathrm {d} t} \mathrm {d}t, \end{aligned}$$

(73)

$$\begin{aligned}&V^{j}\left( x\left( t_{0}\right) ,t_{0}\right) =V^{j}\left( x\left( t_{f}\right) ,t_{f}\right) \nonumber \\&\int _{t_0}^{t_f} \frac{\mathrm {d} V^{j}}{\mathrm {d} t} \mathrm {d}t. \end{aligned}$$

(74)

Since $V^j\left( x\left( t_{f}\right) ,t_{f}\right) =V^{j+1}\left( x\left( t_{f}\right) ,t_{f}\right) =\psi \left( x\left( t_{f} \right) ,t_{f}\right) $, subtracting (74) from (73) gives

$$\begin{aligned}&V^{j+1}\left( x\left( t_{0}\right) ,t_{0}\right) -V^{j}\left( x\left( t_{0}\right) ,t_{0}\right) \nonumber \\&\quad =\int _{t_0}^{t_f} \left( \frac{\mathrm {d} V^{j}}{\mathrm {d} t} -\frac{\mathrm {d} V^{j+1}}{\mathrm {d} t}\right) \mathrm {d}t. \end{aligned}$$

(75)

Then, along with the trajectory ${\dot{x}}=f+gu^{j+1}+kw^{j+1}$, substituting (70) and (71) into (75) gives

$$\begin{aligned}&V^{j+1}\left( x\left( t_{0}\right) ,t_{0}\right) -V^{j}\left( x\left( t_{0}\right) ,t_{0}\right) \nonumber \\&=\int _{t_0}^{t_f} \left[ {\left( ( \frac{\partial V^j}{\partial x} ) ^{\mathrm {T}}\left( f+gu^{j+1}+kw^{j+1} \right) +\frac{\partial V^j}{\partial t} \right) }\right. \nonumber \\&\quad \left. {-\bigg (( \frac{\partial V^{j+1}}{\partial x} ) ^{\mathrm {T}}\left( f+gu^{j+1}+kw^{j+1} \right) +\frac{\partial V^{j+1}}{\partial t}\bigg )}\right] \nonumber \\&\quad \times \mathrm {d}t. \end{aligned}$$

(76)

According to (72), equivalently one has

$$\begin{aligned} \begin{aligned}&\frac{\partial V^{j+1}}{\partial t}+\frac{\partial V^{j+1}}{\partial x}^{\mathrm {T}}\left( f+gu^{j+1}+kw^{j+1} \right) \\&=r^2(w^{j+1})^{\mathrm {T}}w^{j+1}-Q(x)-\left( u^{j+1} \right) ^{\mathrm {T}}Ru^{j+1}. \end{aligned} \end{aligned}$$

(77)

Substitute (72) and (77) into (76), one has

$$\begin{aligned}&V^{j+1}\left( x\left( t_{0}\right) ,t_{0}\right) -V^{j}\left( x\left( t_{0}\right) ,t_{0}\right) \nonumber \\&=\int _{t_0}^{t_f} \Bigg [{( \frac{\partial V^j}{\partial x} ) ^{\mathrm {T}} gu^{j+1} }+\Bigg ( -(\frac{\partial V^j}{\partial x})^{\mathrm {T}}\left( gu^j+kw^{j} \right) \nonumber \\&-\left( u^j \right) ^{\mathrm {T}}Ru^j+r^2(w^{j})^{\mathrm {T}}w^{j} \Bigg )+( \frac{\partial V^j}{\partial x} ) ^{\mathrm {T}}kw^{j+1}\nonumber \\&-\big (r^2(w^{j+1})^{\mathrm {T}}w^{j+1}-\left( u^{j+1} \right) ^{\mathrm {T}}Ru^{j+1}\big )\Bigg ]\mathrm {d}t. \end{aligned}$$

(78)

Considering (12), then we have

$$\begin{aligned} (\frac{\partial V^{j}}{\partial x})^{\mathrm {T}}g&=-2\left( u^{j+1}\left( x\right) \right) ^{\mathrm {T}}R, (\frac{\partial V^{j}}{\partial x})^{\mathrm {T}}k\nonumber \\&=2r^2\left( w^{j+1}\left( x\right) \right) ^{\mathrm {T}}. \end{aligned}$$

(79)

Then, substituting (79) into (78) gives

$$\begin{aligned}&V^{j+1}\left( x\left( t_{0}\right) ,t_{0}\right) -V^{j}\left( x\left( t_{0}\right) ,t_{0}\right) \nonumber \\&=\int _{t_0}^{t_f} \left[ -\left( u^{j+1} \right) ^{\mathrm {T}}Ru^{j+1}+2\left( u^{j+1} \right) ^{\mathrm {T}}Ru^j\right. \nonumber \\&\quad \left. -\left( u^j \right) ^{\mathrm {T}}Ru^j +r^2\left( w^{j+1} \right) ^{\mathrm {T}}w^{j+1}-2r^2\left( w^{j+1} \right) ^{\mathrm {T}}\right. \nonumber \\&\left. \quad w^j+r^2\left( w^j \right) ^{\mathrm {T}}w^j\right] \mathrm {d}t\nonumber \\&=\int _{t_0}^{t_f} -\left[ \left( u^{j+1}-u^{j} \right) ^{\mathrm {T}}R\left( u^{j+1}-u^{j} \right) \right] \mathrm {d}t\nonumber \\&\quad +\int _{t_0}^{t_f} \left[ r^2\left( w^{j+1}-w^{j} \right) ^{\mathrm {T}}\left( w^{j+1}-w^{j} \right) \right] \mathrm {d}t. \end{aligned}$$

(80)

From (69) in Theorem 1 and (80), we conclude that as $j\rightarrow \infty $, the cost function $V^{j+1}(x_0,t_0)$ is monotonically decreasing in terms of u, monotonically increasing in terms of w and bounded. Thus, $V^{j+1}\left( x_0,t_0\right) $ will converge. Hence, when $j\rightarrow \infty $, we have $V^{j+1}=V^{j}$. And accordingly

$$\begin{aligned} u^{j+1}=u^{j},w^{j+1}=w^{j}. \end{aligned}$$

(81)

Substituting (81) into (11) means

$$\begin{aligned}&-\frac{\partial V^{j}}{\partial t}=(\frac{\partial V^{j}}{\partial x})^{\mathrm {T}}\left( f+gu^{j+1}+kw^{j+1} \right) \nonumber \\&\quad +Q(x)+(u^{j+1})^{\mathrm {T}}Ru^{j+1}\nonumber \\&\quad -\gamma ^2(w^{j+1})^{\mathrm {T}}w^{j+1}, V^j\left( x\left( t_{f}\right) ,t_{f}\right) \nonumber \\&\quad =\psi \left( x\left( t_{f} \right) ,t_{f}\right) . \end{aligned}$$

(82)

Considering (12), (82) is essentially equivalent to (10). Hence, when $j\rightarrow \infty $, $V^{j}$ is the solution of (10). This completes the proof. $\square $

C Proof of Theorem 3

Proof

Denote

$$\begin{aligned} \zeta ^j=\sigma (x ( t ) ), j=1,2... \end{aligned}$$

(83)

wherein x(t) is the corresponding state trajectories under the input $u^j$ and j remains unchanged during whole inner-loop iteration.

When $t=t_f$, we know from (30) that

$$\begin{aligned} {\hat{W}}_{j}^{i+1}\left( t_f \right)&={\hat{W}}_{j}^{i}\left( t_f \right) -c_2 \sigma \left( x\left( t_f \right) \right) \sigma \left( x\left( t_f \right) \right) ^{\mathrm {T}} \nonumber \\&\quad \times {\hat{W}}_{j}^{i}\left( t_f \right) +c_2 \sigma \left( x\left( t_f \right) \right) \psi \left( x\left( t_{f} \right) ,t_{f}\right) . \end{aligned}$$

(84)

Using (84) for updating ${\hat{W}}_{j}^i\left( t_f \right) $ and subtracting from (84) gives

$$\begin{aligned} {\hat{W}}_{j}^{i+1}\left( t_f \right) -{\hat{W}}_{j}^{i}\left( t_f \right) =&\bigg [I-c_2 \sigma \left( x\left( t_f \right) \right) \sigma \left( x\left( t_f \right) \right) ^{\mathrm {T}}\bigg ]\nonumber \\&\times \left[ {\hat{W}}_{j}^{i}\left( t_f \right) -{\hat{W}}_{j}^{i-1}\left( t_f \right) \right] . \end{aligned}$$

(85)

When $t\in [t_0,t_f)$, substituting (83) into (30) gives

$$\begin{aligned}&{\hat{W}}_{j}^{i+1}\left( t \right) ={\hat{W}}_{j}^{i}\left( t \right) -c_1\zeta ^j \left[ -\sigma \left( x\left( t+T \right) \right) ^{\mathrm {T}}{\hat{W}}_{j}^{i}\right. \nonumber \\&\left. \left( t+T \right) -\int _{t}^{t+T} \left( Q(x)+(u^{j})^{\mathrm {T}}Ru^j-r^2(w^{j})^{\mathrm {T}}w^j \right) \right. \nonumber \\&\quad \left. \times \mathrm {d}\tau +(\zeta ^j)^{\mathrm {T}}{\hat{W}}_{j}^{i}\left( t \right) \right] . \end{aligned}$$

(86)

Let $m_1=\zeta ^j\zeta ^{jT}$, $m_2=\zeta ^j \sigma ^{\mathrm {T}}\left( x\left( t+T\right) \right) $, then ${\hat{W}}_{j}^{i+1}\left( t \right) $ in (86) can be obtained by

$$\begin{aligned}&{\hat{W}}_{j}^{i+1}\left( t \right) ={\hat{W}}_{j}^{i}\left( t \right) -c_1m_1{\hat{W}}_{j}^{i}\left( t \right) +c_1m_2{\hat{W}}_{j}^{i}\left( t+T \right) \nonumber \\&+c_1 \zeta ^j\left[ \int _{t}^{t+T} \left( Q(x^j)+u^{jT}Ru^j-r^2(w^{j})^{\mathrm {T}}w^j \right) \right. \nonumber \\&\quad \times \left. \mathrm {d}\tau \right] . \end{aligned}$$

(87)

Using (87) for updating ${\hat{W}}_{j}^{i}\left( t \right) $ and subtracting from (87) shows that

$$\begin{aligned}&{\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) =\left( I-c_1m_1\right) \left[ {\hat{W}}_{j}^{i}\left( t \right) -{\hat{W}}_{j}^{i-1}\left( t \right) \right] \nonumber \\&\quad +c_1m_2\left[ {\hat{W}}_{j}^{i}\left( t+T \right) -{\hat{W}}_{j}^{i-1}\left( t+T \right) \right] . \end{aligned}$$

(88)

Therefore, we can obtain

$$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| \\&\le \left\| (I-c_1m_1)({\hat{W}}_{j}^{i}\left( t \right) -{\hat{W}}_{j}^{i-1}\left( t \right) ) \right\| \\&\quad +\left\| c_1m_2\left[ {\hat{W}}_{j}^{i}\left( t+T \right) -{\hat{W}}_{j}^{i-1}\left( t+T \right) \right] \right\| . \end{aligned} \end{aligned}$$

(89)

According to (85) and (89) , it can be proved by mathematical induction that

$$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| \\&\le \Bigg \Vert (I-c_1m_1)({\hat{W}}_{j}^{i}\left( t \right) -{\hat{W}}_{j}^{i-1}\left( t \right) ) \Bigg \Vert , \end{aligned} \end{aligned}$$

(90)

which gives

$$\begin{aligned} \begin{aligned} \lim _{i \rightarrow \infty }\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| =0,\forall t\in [t_0,t_f]. \end{aligned} \end{aligned}$$

(91)

Denote $t=t_f-(n-1)*T, n\ge 1$.

1)
When $n=1$, i.e., $t=t_f$, we have noted that $\sigma \left( x\left( t_f \right) \right) \sigma ^{\mathrm {T}}\left( x\left( t_f \right) \right) $ is positive definite and symmetric, hence its eigenvalues are larger than 0. Denote $T_{ad}=I-c_2 \sigma \left( x\left( t_f \right) \right) \sigma ^{\mathrm {T}}\left( x\left( t_f \right) \right) $ and $x_w={\hat{W}}_{j}^{i}\left( t_f \right) -{\hat{W}}_{j}^{i-1}\left( t_f \right) $. (85) becomes
$$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t_f \right) -{\hat{W}}_{j}^{i}\left( t_f \right) \right\| \\&= \left\| T_{ad}x_w \right\| =\sqrt{(x_w)^{\mathrm {T}}(T_{ad})^{\mathrm {T}}T_{ad}x_w}. \end{aligned} \end{aligned}$$
(92)
Since $c_2>0$ can be selected to make the eigenvalues of $T_{ad}$ belong to [0,1), $T_{ad}$ can be orthogonally diagonalized as $T_{ad}=({M}_{ad})^{\mathrm {T}}D_{ad}{M}_{ad}$ with an orthogonal matrix ${M}_{ad}$ and a diagonal matrix $D_{ad}$. We know elements of $D_{ad}$ represent all eigenvalues of the matrix $T_{ad}$ that belong to [0, 1). Let $\epsilon $ be the maximum of these eigenvalues, that is $\epsilon \in (0,1)$. Therefore, we have
$$\begin{aligned} \begin{aligned} (D_{ad})^2 \le \varepsilon ^2 I, ({M}_{ad})^{\mathrm {T}}{M}_{ad}=I. \end{aligned} \end{aligned}$$
(93)
With (93), (92) gives
$$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t_f \right) -{\hat{W}}_{j}^{i}\left( t_f \right) \right\| \\&= \sqrt{(x_w)^{\mathrm {T}}(T_{ad})^{\mathrm {T}}T_{ad}x_w}\\&=\sqrt{(x_w)^{\mathrm {T}}({M}_{ad})^{\mathrm {T}}(D_{ad})^2{M}_{ad}x_w}\\&\le \varepsilon \sqrt{(x_w)^{\mathrm {T}}({M}_{ad})^{\mathrm {T}}{M}_{ad}x_w}\\&=\varepsilon \sqrt{(x_w)^{\mathrm {T}}x_w}\\&= \varepsilon \left\| {\hat{W}}_{j}^{i}\left( t_f \right) -{\hat{W}}_{j}^{i-1}\left( t_f \right) \right\| . \end{aligned} \end{aligned}$$
(94)
Therefore,
$$\begin{aligned}&\lim _{i \rightarrow \infty }\left\| {\hat{W}}_{j}^{i+1}\left( t_f \right) \right. \nonumber \\&\quad \left. -{\hat{W}}_{j}^{i}\left( t_f \right) \right\| =0. \end{aligned}$$
(95)
2)
When $n=k$, i.e., $t=t_f-(k-1)*T, k\ge 1$, we assume (91) holds for such t. That is
$$\begin{aligned}&\lim _{i \rightarrow \infty }\left\| {\hat{W}}_{j}^{i+1}\left( t_f+T-k*T \right) \right. \nonumber \\&\quad \left. -{\hat{W}}_{j}^{i}\left( t_f+T-k*T \right) \right\| =0. \end{aligned}$$
(96)
3)
When $n=k+1$, i.e., $t=t_f-k*T, k\ge 1$, we know from (89) and (96) that
$$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| \\&\quad \le \left\| (I-c_1 m_1)({\hat{W}}_{j}^{i}\left( t \right) -{\hat{W}}_{j}^{i-1}\left( t \right) ) \right\| . \end{aligned} \end{aligned}$$
(97)
Similar to the manipulations (92)–(95), (97) becomes
$$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| \\&\quad \le \varepsilon \left\| {\hat{W}}_{j}^{i}\left( t \right) -{\hat{W}}_{j}^{i-1}\left( t \right) \right\| . \end{aligned} \end{aligned}$$
(98)
Therefore, $\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| $ will converge to 0 as $i\rightarrow \infty $. That is, ${\hat{W}}_{j}^{i+1}\left( t \right) $ will converge to an optimal solution ${\hat{W}}_{j}^{*}(t)$. Also, the overall residual error E will go to 0, thus $e_1^j(t)$ and $e_2^j(t_f)$ will go to 0. Hence, ${\hat{W}}_{j}^{*}(t)$ gives the optimal solution $V^{j}$ in (17) and (18).

$\square $

D Proof of Theorem 4

Proof

The Z-function for $V^j$ at the time instant t could be expressed with

$$\begin{aligned} \begin{aligned}&Z^{j}(u(t),w(t),t)\\&=\int _{t}^{t+T}r\left( x\left( \tau \right) ,u\left( \tau \right) ,w\left( \tau \right) \right) \mathrm {d}\tau \\&\quad +V^{j}(x(t+T),t+T)\\&=\int _{t}^{t+T}r\left( x\left( \tau \right) ,u\left( \tau \right) ,w\left( \tau \right) \right) \mathrm {d}\tau +V^{j}(x(t),t)\\&\quad +\int _{t}^{t+T}\bigg (\frac{\partial V^{j}}{\partial x}\bigg )^{\mathrm {T}}\left( f+gu+kw \right) +\frac{\partial V^{j}}{\partial \tau }\mathrm {d}\tau . \end{aligned} \end{aligned}$$

(99)

Since T is small and $r\left( x,u,w \right) =Q(x)+u^{\mathrm {T}}Ru-\gamma ^2w^{\mathrm {T}}w$, (99) could be evaluated as

$$\begin{aligned} \begin{aligned}&Z^{j}(u(t),w(t),t)=T\bigg [Q(x(t))+u^{\mathrm {T}}(t)Ru(t)\\&-\gamma ^2w^{\mathrm {T}}(t)w(t)\\&+\bigg (\frac{\partial V^{j}}{\partial x}\bigg )^{\mathrm {T}}\left( f+gu(t)+kw(t) \right) +\frac{\partial V^{j}}{\partial t}\bigg ]\\&\quad +V^{j}(x(t),t). \end{aligned} \end{aligned}$$

(100)

In (100), the current state x(t) at the instant t and the value function $V^j$ are given. The variables remaining to be optimized are u(t), w(t). Therefore, (100) can be resolved into the three parts:

$$\begin{aligned} \begin{aligned}&Z^{j}(u(t),w(t),t)\\&=\underset{Z^j_c}{\underbrace{\bigg [T\bigg (Q(x(t))+\big (\frac{\partial V^{j}}{\partial x}\big )^{\mathrm {T}} f+\frac{\partial V^{j}}{\partial t}\bigg )+V^{j}(x(t),t)\bigg ]}}\\&+\underset{Z^j_{o_1}}{\underbrace{T\frac{\partial V^{j}}{\partial x}^{\mathrm {T}}gu(t)+T\frac{\partial V^{j}}{\partial x}^{\mathrm {T}}kw(t)}} \\&+\underset{Z^j_{o_2}}{\underbrace{Tu^{\mathrm {T}}(t)Ru(t)-T\gamma ^2w^{\mathrm {T}}(t)w(t)}}. \end{aligned} \end{aligned}$$

(101)

In (101), $Z^j_c$ is the absolute term remaining constant for any u(t), w(t). $Z^j_{o_1}$ is the first-order term that is linearly dependent of u(t) and w(t). $Z^j_{o_2}$ is the second-order term composed of the binomial of u(t) and the binomial of w(t). Note that (101) illustrates the reason why the basis of $Z^j$ is selected as $\sigma _a \left( u,w \right) $ in (39).

After the parameters of $W_a(t)$ in (39) are identified by testing signals as

$$\begin{aligned} \begin{aligned} W_a(t)=[W_{a,0},W_{a,u^1},W_{a,w^1},W_{a,u^2},W_{a,w^2}]^{\mathrm {T}} \end{aligned} \end{aligned}$$

(102)

where $W_{a,0}\in {{\mathbb {R}}^{1}}, W_{a,u^1}\in {{\mathbb {R}}^{m}},W_{a,w^1}\in {{\mathbb {R}}^{q}},W_{a,u^2}\in {{\mathbb {R}}^{m}}$, $W_{a,w^2}\in {{\mathbb {R}}^{q}}$ correspond to dimension of the terms of $\sigma _a \left( u,w \right) $ in (39).

Based on (102), the value of the identified $W_{a,0}$ is actually $Z^j_c$. The parameters corresponding to the basis $\sigma _a^1(w)=[w_1,w_2,\cdots ,w_q]\in {{\mathbb {R}}^{q}}$ are identified as $W_{a,w^1}=T\big (\frac{\partial V^{j}}{\partial x}\big )^{\mathrm {T}}k\in {{\mathbb {R}}^{q}}$. The parameters corresponding to the basis $\sigma _a^2(w)=[w_1^2,w_2^2, \cdots ,w_q^2]\in {{\mathbb {R}}^{q}}$ are identified as $W_{a,w^2}=-T\gamma ^2[1,1,\cdots ,1]\in {{\mathbb {R}}^{q}}$. To determine the worst disturbance input $w^{j+1}$ by (44), we have

$$\begin{aligned} \begin{aligned}&\frac{\partial [W_a^{\mathrm {T}}\left( t \right) \sigma _a \left( u,w \right) ]}{\partial w}\\&\quad =\frac{[W_{a,w^1}\sigma _a^1(w)^{\mathrm {T}}+W_{a,w^2}\sigma _a^2(w)^{\mathrm {T}}]}{\partial w}=0 \end{aligned} \end{aligned}$$

(103)

and then

$$\begin{aligned} w^{j+1}=-\frac{1}{2}\mathrm {diag}(W_{a,w^2})^{-1}W_{a,w^1}^{\mathrm {T}}=\frac{1}{2\gamma ^2}k^{\mathrm {T}}\frac{\partial V^j}{\partial x} \end{aligned}$$

(104)

where $\mathrm {diag}(W_{a,w^2})\in {{\mathbb {R}}^{q\times q}}$ denotes the diagonal matrix with all the elements coming from $W_{a,w^2}$.

Without loss of generality, we regard $R\in {{\mathbb {R}}^{m\times m}}$ as a diagonal matrix. Then, based on (101) and (102), the parameters corresponding to the basis $\sigma _a^1(u)=[u_1,u_2,\cdots ,u_m]\in {{\mathbb {R}}^{m}}$ are identified as $W_{a,u^1}=T\big (\frac{\partial V^{j}}{\partial x}\big )^{\mathrm {T}}g\in {{\mathbb {R}}^{m}}$. The parameters corresponding to the basis $\sigma _a^2(u)=[u_1^2,u_2^2,\cdots ,u_m^2]\in {{\mathbb {R}}^{m}}$ are identified as $W_{a,u^2}=T[R_{11},R_{22},\cdots , R_{mm}]\in {{\mathbb {R}}^{m}}$. To determine the optimal control input $u^{j+1}$ by (44), we have

$$\begin{aligned} \begin{aligned}&\frac{\partial [W_a^{\mathrm {T}}\left( t \right) \sigma _a \left( u,w \right) ]}{\partial u}\\&=\frac{[W_{a,u^1}\sigma _a^1(u)^{\mathrm {T}}+W_{a,u^2}\sigma _a^2(u)^{\mathrm {T}}]}{\partial u}=0 \end{aligned} \end{aligned}$$

(105)

and then

$$\begin{aligned} u^{j+1}&=-\frac{1}{2}\mathrm {diag}(W_{a,u^2})^{-1}W_{a,u^1}^{\mathrm {T}}=-\frac{1}{2}(TR)^{-1}W_{a,u^1}^{\mathrm {T}}\nonumber \\&=-\frac{1}{2}R^{-1}g^{\mathrm {T}}\frac{\partial V^j}{\partial x} \end{aligned}$$

(106)

where $\mathrm {diag}(W_{a,u^2})\in {{\mathbb {R}}^{q\times q}}$ denotes the diagonal matrix with all the elements coming from $W_{a,u^2}$.

Different from Algorithm 2 where the input policy is updated by the feedback gain between the input and state, the input in Algorithm 3 is directly calculated by the identified Z-function parameters.

Now we apply mathematical induction to prove that the optimal control input $u^{j+1}(t)$ and the worst disturbance input $w^{j+1}(t)$ in step 3 of Algorithm 3 are equivalent to that in step 3 of Algorithm 2 over the whole time interval $[t_0,t_f]$.

Denote $t=t_0+(n-1)T$.

(1)
When $n=1$, $t=t_0$, the system state is given as $x_0$, which is the same in Algorithms 2 and 3. Therefore, the value of $w^{j+1}$ in (104) and $u^{j+1}$ in (106) for Algorithm 3 is the same as that for Algorithm 2. Based on the same state $x_0$, applying the same input policy to system (1), the system state $x(t_0+T)$ in the next sampling instant is also the same in Algorithms 2 and 3.
(2)
When $n=k$, $t=t_0+(k-1)T$, the state x(t) at this time instant t is assumed to be the same in Algorithms 2 and 3. Similar to the above procedure, the corresponding input policy and the state $x(t_0+(k-1)T+T)$ are also the same in these two algorithms.
(3)
When $n=k+1$, $t=t_0+kT$, the state x(t) at this instant t is the same based on the conclusion for $t=t_0+(k-1)T$. Similar to the above procedure, the corresponding input policy and the state $x(t_0+kT+T)$ are also the same in Algorithms 2 and 3.

Therefore, the input policy $u^{j+1}(t)$, $w^{j+1}(t)$ and the corresponding system trajectories in Algorithms 2 and 3 are the same over the whole time interval $[t_0,t_f]$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Z., Xue, W., Li, N. et al. A novel Z-function-based completely model-free reinforcement learning method to finite-horizon zero-sum game of nonlinear system. Nonlinear Dyn 107, 2563–2582 (2022). https://doi.org/10.1007/s11071-021-07049-z

Download citation

Received: 24 May 2021
Accepted: 31 October 2021
Published: 09 January 2022
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11071-021-07049-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel Z-function-based completely model-free reinforcement learning method to finite-horizon zero-sum game of nonlinear system

Abstract

Access this article

Similar content being viewed by others

Multi-agent deep reinforcement learning: a survey

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A review of cooperative multi-agent deep reinforcement learning

Data Availability Statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A Proof of Theorem 1

Proof

B Proof of Theorem 2

Proof

C Proof of Theorem 3

Proof

D Proof of Theorem 4

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel Z-function-based completely model-free reinforcement learning method to finite-horizon zero-sum game of nonlinear system

Abstract

Access this article

Similar content being viewed by others

Multi-agent deep reinforcement learning: a survey

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A review of cooperative multi-agent deep reinforcement learning

Data Availability Statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A Proof of Theorem 1

Proof

B Proof of Theorem 2

Proof

C Proof of Theorem 3

Proof

D Proof of Theorem 4

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation