Skip to main content
Log in

A novel Z-function-based completely model-free reinforcement learning method to finite-horizon zero-sum game of nonlinear system

  • Original Paper
  • Published:
Nonlinear Dynamics Aims and scope Submit manuscript

Abstract

This paper addresses the finite-horizon two-player zero-sum game for the continuous-time nonlinear system by defining a novel Z-function and proposing a completely model-free reinforcement learning (RL)-based method with reduced dimension of the basis functions. First, a model-based RL policy iteration framework is raised for reducing the order of the Hamiltonian–Jacobi–Isaacs (HJI) equation and strengthening the anti-interference capability and efficiency. This provides the basic framework for model-free algorithms. A partially model-free algorithm is then developed by applying integral RL and iterative learning control techniques to further simplify the solution seeking and remove the need for system dynamics on value function update. An integral Bellman equation is considered. The value function for the HJI equation is evaluated by a critic neural network with time-variant weights and state-dependent basis functions. In order to realize completely model-free learning, a novel Z-function is finally defined and a completely model-free algorithm is thus proposed to further remove the need for system dynamics on input update. Sufficient convergence and stability analysis is provided. The corresponding simulation results are shown to verify the validity of this algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data Availability Statement

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

References

  1. Başar, T., Olsder, G.J.: Dynamic Noncooperative Game Theory. SIAM, Philadelphia (1998)

    Book  Google Scholar 

  2. Lewis, F.L., Vrabie, D., Syrmos, V.L.: Optimal Control. Wiley, New York (2012)

    Book  Google Scholar 

  3. Başar, T., Bernhard, P.: \(H_\infty \) Optimal Control and Related Minimax Design Problems. Birkhäuser, Boston (2008)

    Book  Google Scholar 

  4. Wang, D., Chaoxu, M., Liu, D., Ma, H.: On mixed data and event driven design for adaptive-critic-based nonlinear \({H}_\infty \) control. IEEE Trans. Neural Netw. Learn. Syst. 29(99), 993–1005 (2018)

    Article  Google Scholar 

  5. Rizvi, S.A.A., Lin, Z.: Output feedback Q-learning for discrete-time linear zero-sum games with application to the \({H}_\infty \) control. Automatica 95, 213–221 (2018)

    Article  MathSciNet  Google Scholar 

  6. Kiumarsi, B., Vamvoudakis, K.G., Modares, H., Lewis, F.L.: Optimal and autonomous control using reinforcement learning: a survey. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2042–2062 (2017)

    Article  MathSciNet  Google Scholar 

  7. Al-Tamimi, A., Lewis, F.L., Abu-Khalaf, M.: Model-free Q -learning designs for linear discrete-time zero-sum games with application to \({H}_\infty \) control. Automatica 43(3), 473–481 (2007)

    Article  MathSciNet  Google Scholar 

  8. Wang, K., Mu, C., Zhang, Y., Liu, W.: An approximate control algorithm for zero-sum differential games using adaptive critic technique. In: 2018 37th Chinese Control Conference(CCC), pp. 2812–2817. IEEE (2018)

  9. Gao, Y., Liu, C., Jiang, S., Zhang, S.: Zero-sum differential games-based fast adaptive robust optimal sliding mode control design for uncertain missile autopilot with constrained input. Int J Control 1–13 (2021). https://doi.org/10.1080/00207179.2021.1872802

  10. Barto, A.G.: Reinforcement learning control. Curr. Opin. Neurobiol. 4(6), 888–893 (1994)

    Article  Google Scholar 

  11. Shan, X., Biao, L., Derong, L.: Event-triggered adaptive dynamic programming for zero-sum game of partially unknown continuous-time nonlinear systems. IEEE Trans. Syst. Man Cybern. Syst. 50(9), 3189–3199 (2018)

  12. Zhong, X., He, H., Wang, D., Ni, Z.: Model-free adaptive control for unknown nonlinear zero-sum differential game. IEEE Trans. Cybern. 99, 1–14 (2017)

    Article  Google Scholar 

  13. Ren, L., Zhang, G., Chaoxu, M.: Data-based \({H}_\infty \) control for the constrained-input nonlinear systems and its applications in chaotic circuit systems. IEEE Trans. Circuits Syst. I Regul. Pap. 67(8), 2791–2802 (2020)

    Article  MathSciNet  Google Scholar 

  14. Chaoxu, M., Wang, K.: Approximate-optimal control algorithm for constrained zero-sum differential games through event-triggering mechanism. Nonlinear Dyn. 95(4), 2639–2657 (2019)

    Article  Google Scholar 

  15. Chen, W., Ding, D., Ge, X., Han, Q.L., Wei, G.: \({H}_\infty \) containment control of multiagent systems under event-triggered communication scheduling: the finite-horizon case. IEEE Trans. Cybern. 50(4), 1372–1382 (2018)

  16. Zhang, H., Wang, H.: Finite horizon \({H}_\infty \) preview control. Acta Autom. Sinica 36, 327–331 (2010)

  17. Zhang, H., Wang, H. Xie, L.: Discrete-time \({H}_\infty \) preview control problem in finite horizon. In: Mathematical Problems in Engineering, 2014 (2014)

  18. Werbos, P.: Beyond regression: new tools for prediction and analysis in the behavioral sciences. Ph. D. dissertation, Harvard University (1974)

  19. Werbos, P.: Approximate dynamic programming for real-time control and neural modeling. In: Handbook of Intelligent Control. Van Nostrand Reinhold, New York (1992)

  20. Zhao, B., Jia, L., Xia, H., Li, Y.: Adaptive dynamic programming-based stabilization of nonlinear systems with unknown actuator saturation. Nonlinear Dyn. 93(4), 2089–2103 (2018)

    Article  Google Scholar 

  21. Hendzel, Z.: An adaptive critic neural network for motion control of a wheeled mobile robot. Nonlinear Dyn. 50(4), 849–855 (2007)

    Article  Google Scholar 

  22. Bin, X.: Robust adaptive neural control of flexible hypersonic flight vehicle with dead-zone input nonlinearity. Nonlinear Dyn. 80(3), 1509–1520 (2015)

    Article  MathSciNet  Google Scholar 

  23. Zhang, H., Wei, Q., Liu, D.: An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games. Automatica 47(1), 207–214 (2011)

    Article  MathSciNet  Google Scholar 

  24. Sun, J., Liu, C.: Zero-sum differential games for nonlinear systems using adaptive dynamic programming with input constraint. In: 2017 36th Chinese Control Conference (CCC) (2017)

  25. Duan, D., Liu, C., Sun, J.: Adaptive periodic event-triggered control for missile-target interception system with finite-horizon convergence. Trans. Inst. Meas. Control 42(10), 1808–1822 (2020)

    Article  Google Scholar 

  26. Xu, H., Jagannathan, S.: Finite horizon stochastic optimal control of nonlinear two-player zero-sum games under communication constraint. In: 2014 International Joint Conference on Neural Networks (IJCNN) (2014)

  27. Liang, Y., Zhang, H., Cai, Y., Sun, S.: A neural network-based approach for solving quantized discrete-time \({H}_\infty \) optimal control with input constraint over finite-horizon. Neurocomputing 333(14), 248–260 (2019)

    Article  Google Scholar 

  28. Hao, X.: Finite-horizon near optimal design of nonlinear two-player zero-sum game in presence of completely unknown dynamics. J. Control, Autom. Electr. Syst. 26(4), 361–370 (2015)

    Article  Google Scholar 

  29. Zhang, H., Cui, X., Luo, Y., Jiang, H.: Finite-horizon \({H}_\infty \) tracking control for unknown nonlinear systems with saturating actuators. IEEE Trans. Neural Netw. Learn. Syst 29(4), 1200–1212 (2017)

  30. Duan, D., Liu, C., Zhang, S.: Robust optimal control for finite-horizon zero-sum differential games via a plug-n-play event-triggered scheme. J. Frankl. Inst. 357(10), 5989–6017 (2020)

    Article  MathSciNet  Google Scholar 

  31. Zhao, J., Zhang, C.: Finite-horizon optimal control of discrete-time linear systems with completely unknown dynamics using Q-learning. J. Ind. Manag. Optim. 17(3), 1471 (2021)

    MathSciNet  MATH  Google Scholar 

  32. Xu, H., Jagannathan, S.: Model-free Q-learning over finite horizon for uncertain linear continuous-time systems. In: IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 1–6. IEEE (2014)

  33. Vrabie, D., Lewis, F.L.: Adaptive optimal control algorithm for continuous-time nonlinear systems based on policy iteration. In: Proceedings of the IEEE Conference on Decision and Control, pp. 73–79. IEEE (2008)

  34. Vrabie, D., Pastravanu, O., Abu-Khalaf, M., Lewis, F.L.: Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)

    Article  MathSciNet  Google Scholar 

  35. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)

  36. Sebastian, G., Tan, Y., Oetomo, D.: Convergence analysis of feedback-based iterative learning control with input saturation. Automatica 101, 44–52 (2019)

    Article  MathSciNet  Google Scholar 

  37. Ahn, H.-S., Chen, Y., Moore, K.L.: Iterative learning control: brief survey and categorization. IEEE Trans. Syst. Man Cybern. 37(6), 1099–1121 (2007)

    Article  Google Scholar 

  38. Meng, D., Moore, K.L.: Convergence of iterative learning control for SISO nonrepetitive systems subject to iteration-dependent uncertainties. Automatica 79, 167–177 (2017)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is supported by National Natural Science Foundation under Grant (61773260) and National Key R&D Program of China (2018YFB1305902).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ning Li.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of Theorem 1

Proof

Optimality Along the system trajectories (1), we have

$$\begin{aligned} \frac{\mathrm {d} V}{\mathrm {d} t} =\left( \frac{\partial V}{\partial x} \right) ^{\mathrm {T}}\left( f+gu+kw \right) +\frac{\partial V}{\partial t}. \end{aligned}$$
(66)

Substituting (66) into (8) gives

$$\begin{aligned} H\left( x,t,u,w,V \right) =Q(x)+u^{\mathrm {T}}Ru-\gamma ^2w^{\mathrm {T}}w+\frac{\mathrm {d} V}{\mathrm {d} t}. \end{aligned}$$
(67)

Based on (10) and processing the Hamiltonian function (8) by completing the squares, one has

$$\begin{aligned}&H\left( x,t,u,w,V^* \right) \nonumber \\&=Q(x)-\frac{1}{4}\frac{\partial V^*}{\partial x}^{\mathrm {T}}gR^{-1}g^{\mathrm {T}}\frac{\partial V^*}{\partial x}+\frac{1}{4\gamma ^2}\frac{\partial V^{*}}{\partial x}^{\mathrm {T}}\nonumber \\&\quad \times kk^{\mathrm {T}}\frac{\partial V^{*}}{\partial x}+\frac{\partial V^*}{\partial t}+\left[ u+\frac{1}{2}R^{-1}g^{\mathrm {T}}\frac{\partial V^{*}}{\partial x}\right] ^{\mathrm {T}}\nonumber \\&\quad \times R\left[ u+\frac{1}{2}R^{-1}g^{\mathrm {T}}\frac{\partial V^{*}}{\partial x}\right] -\gamma ^2\left[ w-\frac{1}{2\gamma ^2}k^{\mathrm {T}}\frac{\partial V^{*}}{\partial x}\right] ^{\mathrm {T}}\nonumber \\&\times quad\left[ w-\frac{1}{2\gamma ^2}k^{\mathrm {T}}\frac{\partial V^{*}}{\partial x}\right] +\frac{\partial V^*}{\partial x}^{\mathrm {T}}f\nonumber \\&=H\left( x,t,u^*,w^*,V^* \right) +\left( u-u^*\right) ^{\mathrm {T}}R\left( u-u^*\right) \nonumber \\&\quad -\gamma ^2\left( w-w^*\right) ^{\mathrm {T}}\left( w-w^*\right) \nonumber \\&=\left( u-u^*\right) ^{\mathrm {T}}R\left( u-u^*\right) -\gamma ^2\left( w-w^*\right) ^{\mathrm {T}}\left( w-w^*\right) , \end{aligned}$$
(68)

and \(J\left( x_{0},u,w \right) \) in (2) could be represented as

$$\begin{aligned}&J\left( x_{0},u,w \right) \nonumber \\&=\psi \left( x\left( t_{f} \right) ,t_{f}\right) +\int _{t_0}^{t_{f}}\left( Q(x)+u^{\mathrm {T}}Ru-\gamma ^2w^{\mathrm {T}}w\right) \nonumber \\&\quad \times \mathrm {d}\tau \nonumber \\&=\psi \left( x\left( t_{f} \right) ,t_{f}\right) +\int _{t_0}^{t_{f}}\left( Q(x)+u^{\mathrm {T}}Ru-\gamma ^2w^{\mathrm {T}}w\right) \nonumber \\&\quad \times \mathrm {d}\tau \nonumber \\&\quad +\int _{t_0}^{t_f} \frac{\mathrm {d} V^*}{\mathrm {d} \tau } \mathrm {d}\tau -V^*\left( x\left( t_{f}\right) ,t_{f}\right) +V^*\left( x\left( t_{0}\right) ,t_{0}\right) \nonumber \\&=\int _{t_0}^{t_{f}}\left( Q(x)+u^{\mathrm {T}}Ru-\gamma ^2w^{\mathrm {T}}w+\frac{\mathrm {d} V^*}{\mathrm {d} \tau }\right) \mathrm {d}\tau \nonumber \\&\quad +\psi \left( x\left( t_{f} \right) ,t_{f}\right) -V^*\left( x\left( t_{f}\right) ,t_{f}\right) +V^*\left( x\left( t_{0}\right) ,t_{0}\right) \nonumber \\&=\int _{t_0}^{t_{f}}H\left( x,\tau ,u,w,V^* \right) \mathrm {d}\tau +V^*\left( x\left( t_0\right) ,t_0\right) \nonumber \\&=\int _{t_0}^{t_{f}}\bigg [ \left( u-u^*\right) ^{\mathrm {T}}R\left( u-u^*\right) -\gamma ^2\left( w-w^*\right) ^{\mathrm {T}}\nonumber \\&\quad \left( w-w^*\right) \bigg ]\mathrm {d}\tau +V^*\left( x\left( t_0\right) ,t_0\right) . \end{aligned}$$
(69)

Hence, the Nash equilibrium condition (5) is met when the input policy is adopted as \(u=u^*,w=w^*\), and the corresponding cost function is \(V^*\left( x(t_0),t_0 \right) \). \(\square \)

B Proof of Theorem 2

Proof

Along with the trajectory \({\dot{x}}=f+gu^{j+1}+kw^{j+1}\), the value function \(V^j\) yields

$$\begin{aligned} \frac{\mathrm {d} V^j}{\mathrm {d} t} =\left[ \frac{\partial V^j}{\partial x} \right] ^{\mathrm {T}}\left( f+gu^{j+1}+kw^{j+1} \right) +\frac{\partial V^j}{\partial t}, \end{aligned}$$
(70)

so does \(V^{j+1}\), one has

$$\begin{aligned} \frac{\mathrm {d} V^{j+1}}{\mathrm {d} t}&=\left[ \frac{\partial V^{j+1}}{\partial x} \right] ^{\mathrm {T}}\left( f+gu^{j+1}+kw^{j+1} \right) \nonumber \\&\quad +\frac{\partial V^{j+1}}{\partial t}. \end{aligned}$$
(71)

According to the Bellman equation (11), one has

$$\begin{aligned} \begin{aligned}&\frac{\partial V^j}{\partial t}+(\frac{\partial V^j}{\partial x})^{\mathrm {T}}f\\&=-(\frac{\partial V^j}{\partial x})^{\mathrm {T}}\left( gu^j+kw^{j} \right) \\&\quad -Q(x)-\left( u^j \right) ^{\mathrm {T}}Ru^j+r^2(w^j)^{\mathrm {T}}w^j. \end{aligned} \end{aligned}$$
(72)

Now, we prove the convergence of \(V^{j}\) and optimality of the solutions obtained by Algorithm 1. Consider

$$\begin{aligned}&V^{j+1}\left( x\left( t_{0}\right) ,t_{0}\right) =V^{j+1}\left( x\left( t_{f}\right) ,t_{f}\right) \nonumber \\&\int _{t_0}^{t_f} \frac{\mathrm {d} V^{j+1}}{\mathrm {d} t} \mathrm {d}t, \end{aligned}$$
(73)
$$\begin{aligned}&V^{j}\left( x\left( t_{0}\right) ,t_{0}\right) =V^{j}\left( x\left( t_{f}\right) ,t_{f}\right) \nonumber \\&\int _{t_0}^{t_f} \frac{\mathrm {d} V^{j}}{\mathrm {d} t} \mathrm {d}t. \end{aligned}$$
(74)

Since \(V^j\left( x\left( t_{f}\right) ,t_{f}\right) =V^{j+1}\left( x\left( t_{f}\right) ,t_{f}\right) =\psi \left( x\left( t_{f} \right) ,t_{f}\right) \), subtracting (74) from (73) gives

$$\begin{aligned}&V^{j+1}\left( x\left( t_{0}\right) ,t_{0}\right) -V^{j}\left( x\left( t_{0}\right) ,t_{0}\right) \nonumber \\&\quad =\int _{t_0}^{t_f} \left( \frac{\mathrm {d} V^{j}}{\mathrm {d} t} -\frac{\mathrm {d} V^{j+1}}{\mathrm {d} t}\right) \mathrm {d}t. \end{aligned}$$
(75)

Then, along with the trajectory \({\dot{x}}=f+gu^{j+1}+kw^{j+1}\), substituting (70) and (71) into (75) gives

$$\begin{aligned}&V^{j+1}\left( x\left( t_{0}\right) ,t_{0}\right) -V^{j}\left( x\left( t_{0}\right) ,t_{0}\right) \nonumber \\&=\int _{t_0}^{t_f} \left[ {\left( ( \frac{\partial V^j}{\partial x} ) ^{\mathrm {T}}\left( f+gu^{j+1}+kw^{j+1} \right) +\frac{\partial V^j}{\partial t} \right) }\right. \nonumber \\&\quad \left. {-\bigg (( \frac{\partial V^{j+1}}{\partial x} ) ^{\mathrm {T}}\left( f+gu^{j+1}+kw^{j+1} \right) +\frac{\partial V^{j+1}}{\partial t}\bigg )}\right] \nonumber \\&\quad \times \mathrm {d}t. \end{aligned}$$
(76)

According to (72), equivalently one has

$$\begin{aligned} \begin{aligned}&\frac{\partial V^{j+1}}{\partial t}+\frac{\partial V^{j+1}}{\partial x}^{\mathrm {T}}\left( f+gu^{j+1}+kw^{j+1} \right) \\&=r^2(w^{j+1})^{\mathrm {T}}w^{j+1}-Q(x)-\left( u^{j+1} \right) ^{\mathrm {T}}Ru^{j+1}. \end{aligned} \end{aligned}$$
(77)

Substitute (72) and (77) into (76), one has

$$\begin{aligned}&V^{j+1}\left( x\left( t_{0}\right) ,t_{0}\right) -V^{j}\left( x\left( t_{0}\right) ,t_{0}\right) \nonumber \\&=\int _{t_0}^{t_f} \Bigg [{( \frac{\partial V^j}{\partial x} ) ^{\mathrm {T}} gu^{j+1} }+\Bigg ( -(\frac{\partial V^j}{\partial x})^{\mathrm {T}}\left( gu^j+kw^{j} \right) \nonumber \\&-\left( u^j \right) ^{\mathrm {T}}Ru^j+r^2(w^{j})^{\mathrm {T}}w^{j} \Bigg )+( \frac{\partial V^j}{\partial x} ) ^{\mathrm {T}}kw^{j+1}\nonumber \\&-\big (r^2(w^{j+1})^{\mathrm {T}}w^{j+1}-\left( u^{j+1} \right) ^{\mathrm {T}}Ru^{j+1}\big )\Bigg ]\mathrm {d}t. \end{aligned}$$
(78)

Considering (12), then we have

$$\begin{aligned} (\frac{\partial V^{j}}{\partial x})^{\mathrm {T}}g&=-2\left( u^{j+1}\left( x\right) \right) ^{\mathrm {T}}R, (\frac{\partial V^{j}}{\partial x})^{\mathrm {T}}k\nonumber \\&=2r^2\left( w^{j+1}\left( x\right) \right) ^{\mathrm {T}}. \end{aligned}$$
(79)

Then, substituting (79) into (78) gives

$$\begin{aligned}&V^{j+1}\left( x\left( t_{0}\right) ,t_{0}\right) -V^{j}\left( x\left( t_{0}\right) ,t_{0}\right) \nonumber \\&=\int _{t_0}^{t_f} \left[ -\left( u^{j+1} \right) ^{\mathrm {T}}Ru^{j+1}+2\left( u^{j+1} \right) ^{\mathrm {T}}Ru^j\right. \nonumber \\&\quad \left. -\left( u^j \right) ^{\mathrm {T}}Ru^j +r^2\left( w^{j+1} \right) ^{\mathrm {T}}w^{j+1}-2r^2\left( w^{j+1} \right) ^{\mathrm {T}}\right. \nonumber \\&\left. \quad w^j+r^2\left( w^j \right) ^{\mathrm {T}}w^j\right] \mathrm {d}t\nonumber \\&=\int _{t_0}^{t_f} -\left[ \left( u^{j+1}-u^{j} \right) ^{\mathrm {T}}R\left( u^{j+1}-u^{j} \right) \right] \mathrm {d}t\nonumber \\&\quad +\int _{t_0}^{t_f} \left[ r^2\left( w^{j+1}-w^{j} \right) ^{\mathrm {T}}\left( w^{j+1}-w^{j} \right) \right] \mathrm {d}t. \end{aligned}$$
(80)

From (69) in Theorem 1 and (80), we conclude that as \(j\rightarrow \infty \), the cost function \(V^{j+1}(x_0,t_0)\) is monotonically decreasing in terms of u, monotonically increasing in terms of w and bounded. Thus, \(V^{j+1}\left( x_0,t_0\right) \) will converge. Hence, when \(j\rightarrow \infty \), we have \(V^{j+1}=V^{j}\). And accordingly

$$\begin{aligned} u^{j+1}=u^{j},w^{j+1}=w^{j}. \end{aligned}$$
(81)

Substituting (81) into (11) means

$$\begin{aligned}&-\frac{\partial V^{j}}{\partial t}=(\frac{\partial V^{j}}{\partial x})^{\mathrm {T}}\left( f+gu^{j+1}+kw^{j+1} \right) \nonumber \\&\quad +Q(x)+(u^{j+1})^{\mathrm {T}}Ru^{j+1}\nonumber \\&\quad -\gamma ^2(w^{j+1})^{\mathrm {T}}w^{j+1}, V^j\left( x\left( t_{f}\right) ,t_{f}\right) \nonumber \\&\quad =\psi \left( x\left( t_{f} \right) ,t_{f}\right) . \end{aligned}$$
(82)

Considering (12), (82) is essentially equivalent to (10). Hence, when \(j\rightarrow \infty \), \(V^{j}\) is the solution of (10). This completes the proof. \(\square \)

C Proof of Theorem 3

Proof

Denote

$$\begin{aligned} \zeta ^j=\sigma (x ( t ) ), j=1,2... \end{aligned}$$
(83)

wherein x(t) is the corresponding state trajectories under the input \(u^j\) and j remains unchanged during whole inner-loop iteration.

When \(t=t_f\), we know from (30) that

$$\begin{aligned} {\hat{W}}_{j}^{i+1}\left( t_f \right)&={\hat{W}}_{j}^{i}\left( t_f \right) -c_2 \sigma \left( x\left( t_f \right) \right) \sigma \left( x\left( t_f \right) \right) ^{\mathrm {T}} \nonumber \\&\quad \times {\hat{W}}_{j}^{i}\left( t_f \right) +c_2 \sigma \left( x\left( t_f \right) \right) \psi \left( x\left( t_{f} \right) ,t_{f}\right) . \end{aligned}$$
(84)

Using (84) for updating \({\hat{W}}_{j}^i\left( t_f \right) \) and subtracting from (84) gives

$$\begin{aligned} {\hat{W}}_{j}^{i+1}\left( t_f \right) -{\hat{W}}_{j}^{i}\left( t_f \right) =&\bigg [I-c_2 \sigma \left( x\left( t_f \right) \right) \sigma \left( x\left( t_f \right) \right) ^{\mathrm {T}}\bigg ]\nonumber \\&\times \left[ {\hat{W}}_{j}^{i}\left( t_f \right) -{\hat{W}}_{j}^{i-1}\left( t_f \right) \right] . \end{aligned}$$
(85)

When \(t\in [t_0,t_f)\), substituting (83) into (30) gives

$$\begin{aligned}&{\hat{W}}_{j}^{i+1}\left( t \right) ={\hat{W}}_{j}^{i}\left( t \right) -c_1\zeta ^j \left[ -\sigma \left( x\left( t+T \right) \right) ^{\mathrm {T}}{\hat{W}}_{j}^{i}\right. \nonumber \\&\left. \left( t+T \right) -\int _{t}^{t+T} \left( Q(x)+(u^{j})^{\mathrm {T}}Ru^j-r^2(w^{j})^{\mathrm {T}}w^j \right) \right. \nonumber \\&\quad \left. \times \mathrm {d}\tau +(\zeta ^j)^{\mathrm {T}}{\hat{W}}_{j}^{i}\left( t \right) \right] . \end{aligned}$$
(86)

Let \(m_1=\zeta ^j\zeta ^{jT}\), \(m_2=\zeta ^j \sigma ^{\mathrm {T}}\left( x\left( t+T\right) \right) \), then \({\hat{W}}_{j}^{i+1}\left( t \right) \) in (86) can be obtained by

$$\begin{aligned}&{\hat{W}}_{j}^{i+1}\left( t \right) ={\hat{W}}_{j}^{i}\left( t \right) -c_1m_1{\hat{W}}_{j}^{i}\left( t \right) +c_1m_2{\hat{W}}_{j}^{i}\left( t+T \right) \nonumber \\&+c_1 \zeta ^j\left[ \int _{t}^{t+T} \left( Q(x^j)+u^{jT}Ru^j-r^2(w^{j})^{\mathrm {T}}w^j \right) \right. \nonumber \\&\quad \times \left. \mathrm {d}\tau \right] . \end{aligned}$$
(87)

Using (87) for updating \({\hat{W}}_{j}^{i}\left( t \right) \) and subtracting from (87) shows that

$$\begin{aligned}&{\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) =\left( I-c_1m_1\right) \left[ {\hat{W}}_{j}^{i}\left( t \right) -{\hat{W}}_{j}^{i-1}\left( t \right) \right] \nonumber \\&\quad +c_1m_2\left[ {\hat{W}}_{j}^{i}\left( t+T \right) -{\hat{W}}_{j}^{i-1}\left( t+T \right) \right] . \end{aligned}$$
(88)

Therefore, we can obtain

$$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| \\&\le \left\| (I-c_1m_1)({\hat{W}}_{j}^{i}\left( t \right) -{\hat{W}}_{j}^{i-1}\left( t \right) ) \right\| \\&\quad +\left\| c_1m_2\left[ {\hat{W}}_{j}^{i}\left( t+T \right) -{\hat{W}}_{j}^{i-1}\left( t+T \right) \right] \right\| . \end{aligned} \end{aligned}$$
(89)

According to (85) and (89) , it can be proved by mathematical induction that

$$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| \\&\le \Bigg \Vert (I-c_1m_1)({\hat{W}}_{j}^{i}\left( t \right) -{\hat{W}}_{j}^{i-1}\left( t \right) ) \Bigg \Vert , \end{aligned} \end{aligned}$$
(90)

which gives

$$\begin{aligned} \begin{aligned} \lim _{i \rightarrow \infty }\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| =0,\forall t\in [t_0,t_f]. \end{aligned} \end{aligned}$$
(91)

Denote \(t=t_f-(n-1)*T, n\ge 1\).

  1. 1)

    When \(n=1\), i.e., \(t=t_f\), we have noted that \(\sigma \left( x\left( t_f \right) \right) \sigma ^{\mathrm {T}}\left( x\left( t_f \right) \right) \) is positive definite and symmetric, hence its eigenvalues are larger than 0. Denote \(T_{ad}=I-c_2 \sigma \left( x\left( t_f \right) \right) \sigma ^{\mathrm {T}}\left( x\left( t_f \right) \right) \) and \(x_w={\hat{W}}_{j}^{i}\left( t_f \right) -{\hat{W}}_{j}^{i-1}\left( t_f \right) \). (85) becomes

    $$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t_f \right) -{\hat{W}}_{j}^{i}\left( t_f \right) \right\| \\&= \left\| T_{ad}x_w \right\| =\sqrt{(x_w)^{\mathrm {T}}(T_{ad})^{\mathrm {T}}T_{ad}x_w}. \end{aligned} \end{aligned}$$
    (92)

    Since \(c_2>0\) can be selected to make the eigenvalues of \(T_{ad}\) belong to [0,1), \(T_{ad}\) can be orthogonally diagonalized as \(T_{ad}=({M}_{ad})^{\mathrm {T}}D_{ad}{M}_{ad}\) with an orthogonal matrix \({M}_{ad}\) and a diagonal matrix \(D_{ad}\). We know elements of \(D_{ad}\) represent all eigenvalues of the matrix \(T_{ad}\) that belong to [0, 1). Let \(\epsilon \) be the maximum of these eigenvalues, that is \(\epsilon \in (0,1)\). Therefore, we have

    $$\begin{aligned} \begin{aligned} (D_{ad})^2 \le \varepsilon ^2 I, ({M}_{ad})^{\mathrm {T}}{M}_{ad}=I. \end{aligned} \end{aligned}$$
    (93)

    With (93), (92) gives

    $$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t_f \right) -{\hat{W}}_{j}^{i}\left( t_f \right) \right\| \\&= \sqrt{(x_w)^{\mathrm {T}}(T_{ad})^{\mathrm {T}}T_{ad}x_w}\\&=\sqrt{(x_w)^{\mathrm {T}}({M}_{ad})^{\mathrm {T}}(D_{ad})^2{M}_{ad}x_w}\\&\le \varepsilon \sqrt{(x_w)^{\mathrm {T}}({M}_{ad})^{\mathrm {T}}{M}_{ad}x_w}\\&=\varepsilon \sqrt{(x_w)^{\mathrm {T}}x_w}\\&= \varepsilon \left\| {\hat{W}}_{j}^{i}\left( t_f \right) -{\hat{W}}_{j}^{i-1}\left( t_f \right) \right\| . \end{aligned} \end{aligned}$$
    (94)

    Therefore,

    $$\begin{aligned}&\lim _{i \rightarrow \infty }\left\| {\hat{W}}_{j}^{i+1}\left( t_f \right) \right. \nonumber \\&\quad \left. -{\hat{W}}_{j}^{i}\left( t_f \right) \right\| =0. \end{aligned}$$
    (95)
  2. 2)

    When \(n=k\), i.e., \(t=t_f-(k-1)*T, k\ge 1\), we assume (91) holds for such t. That is

    $$\begin{aligned}&\lim _{i \rightarrow \infty }\left\| {\hat{W}}_{j}^{i+1}\left( t_f+T-k*T \right) \right. \nonumber \\&\quad \left. -{\hat{W}}_{j}^{i}\left( t_f+T-k*T \right) \right\| =0. \end{aligned}$$
    (96)
  3. 3)

    When \(n=k+1\), i.e., \(t=t_f-k*T, k\ge 1\), we know from (89) and (96) that

    $$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| \\&\quad \le \left\| (I-c_1 m_1)({\hat{W}}_{j}^{i}\left( t \right) -{\hat{W}}_{j}^{i-1}\left( t \right) ) \right\| . \end{aligned} \end{aligned}$$
    (97)

    Similar to the manipulations (92)–(95), (97) becomes

    $$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| \\&\quad \le \varepsilon \left\| {\hat{W}}_{j}^{i}\left( t \right) -{\hat{W}}_{j}^{i-1}\left( t \right) \right\| . \end{aligned} \end{aligned}$$
    (98)

    Therefore, \(\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| \) will converge to 0 as \(i\rightarrow \infty \). That is, \({\hat{W}}_{j}^{i+1}\left( t \right) \) will converge to an optimal solution \({\hat{W}}_{j}^{*}(t)\). Also, the overall residual error E will go to 0, thus \(e_1^j(t)\) and \(e_2^j(t_f)\) will go to 0. Hence, \({\hat{W}}_{j}^{*}(t)\) gives the optimal solution \(V^{j}\) in (17) and (18).

\(\square \)

D Proof of Theorem 4

Proof

The Z-function for \(V^j\) at the time instant t could be expressed with

$$\begin{aligned} \begin{aligned}&Z^{j}(u(t),w(t),t)\\&=\int _{t}^{t+T}r\left( x\left( \tau \right) ,u\left( \tau \right) ,w\left( \tau \right) \right) \mathrm {d}\tau \\&\quad +V^{j}(x(t+T),t+T)\\&=\int _{t}^{t+T}r\left( x\left( \tau \right) ,u\left( \tau \right) ,w\left( \tau \right) \right) \mathrm {d}\tau +V^{j}(x(t),t)\\&\quad +\int _{t}^{t+T}\bigg (\frac{\partial V^{j}}{\partial x}\bigg )^{\mathrm {T}}\left( f+gu+kw \right) +\frac{\partial V^{j}}{\partial \tau }\mathrm {d}\tau . \end{aligned} \end{aligned}$$
(99)

Since T is small and \(r\left( x,u,w \right) =Q(x)+u^{\mathrm {T}}Ru-\gamma ^2w^{\mathrm {T}}w\), (99) could be evaluated as

$$\begin{aligned} \begin{aligned}&Z^{j}(u(t),w(t),t)=T\bigg [Q(x(t))+u^{\mathrm {T}}(t)Ru(t)\\&-\gamma ^2w^{\mathrm {T}}(t)w(t)\\&+\bigg (\frac{\partial V^{j}}{\partial x}\bigg )^{\mathrm {T}}\left( f+gu(t)+kw(t) \right) +\frac{\partial V^{j}}{\partial t}\bigg ]\\&\quad +V^{j}(x(t),t). \end{aligned} \end{aligned}$$
(100)

In (100), the current state x(t) at the instant t and the value function \(V^j\) are given. The variables remaining to be optimized are u(t), w(t). Therefore, (100) can be resolved into the three parts:

$$\begin{aligned} \begin{aligned}&Z^{j}(u(t),w(t),t)\\&=\underset{Z^j_c}{\underbrace{\bigg [T\bigg (Q(x(t))+\big (\frac{\partial V^{j}}{\partial x}\big )^{\mathrm {T}} f+\frac{\partial V^{j}}{\partial t}\bigg )+V^{j}(x(t),t)\bigg ]}}\\&+\underset{Z^j_{o_1}}{\underbrace{T\frac{\partial V^{j}}{\partial x}^{\mathrm {T}}gu(t)+T\frac{\partial V^{j}}{\partial x}^{\mathrm {T}}kw(t)}} \\&+\underset{Z^j_{o_2}}{\underbrace{Tu^{\mathrm {T}}(t)Ru(t)-T\gamma ^2w^{\mathrm {T}}(t)w(t)}}. \end{aligned} \end{aligned}$$
(101)

In (101), \(Z^j_c\) is the absolute term remaining constant for any u(t), w(t). \(Z^j_{o_1}\) is the first-order term that is linearly dependent of u(t) and w(t). \(Z^j_{o_2}\) is the second-order term composed of the binomial of u(t) and the binomial of w(t). Note that (101) illustrates the reason why the basis of \(Z^j\) is selected as \(\sigma _a \left( u,w \right) \) in (39).

After the parameters of \(W_a(t)\) in (39) are identified by testing signals as

$$\begin{aligned} \begin{aligned} W_a(t)=[W_{a,0},W_{a,u^1},W_{a,w^1},W_{a,u^2},W_{a,w^2}]^{\mathrm {T}} \end{aligned} \end{aligned}$$
(102)

where \(W_{a,0}\in {{\mathbb {R}}^{1}}, W_{a,u^1}\in {{\mathbb {R}}^{m}},W_{a,w^1}\in {{\mathbb {R}}^{q}},W_{a,u^2}\in {{\mathbb {R}}^{m}}\), \(W_{a,w^2}\in {{\mathbb {R}}^{q}}\) correspond to dimension of the terms of \(\sigma _a \left( u,w \right) \) in (39).

Based on (102), the value of the identified \(W_{a,0}\) is actually \(Z^j_c\). The parameters corresponding to the basis \(\sigma _a^1(w)=[w_1,w_2,\cdots ,w_q]\in {{\mathbb {R}}^{q}}\) are identified as \(W_{a,w^1}=T\big (\frac{\partial V^{j}}{\partial x}\big )^{\mathrm {T}}k\in {{\mathbb {R}}^{q}}\). The parameters corresponding to the basis \(\sigma _a^2(w)=[w_1^2,w_2^2, \cdots ,w_q^2]\in {{\mathbb {R}}^{q}}\) are identified as \(W_{a,w^2}=-T\gamma ^2[1,1,\cdots ,1]\in {{\mathbb {R}}^{q}}\). To determine the worst disturbance input \(w^{j+1}\) by (44), we have

$$\begin{aligned} \begin{aligned}&\frac{\partial [W_a^{\mathrm {T}}\left( t \right) \sigma _a \left( u,w \right) ]}{\partial w}\\&\quad =\frac{[W_{a,w^1}\sigma _a^1(w)^{\mathrm {T}}+W_{a,w^2}\sigma _a^2(w)^{\mathrm {T}}]}{\partial w}=0 \end{aligned} \end{aligned}$$
(103)

and then

$$\begin{aligned} w^{j+1}=-\frac{1}{2}\mathrm {diag}(W_{a,w^2})^{-1}W_{a,w^1}^{\mathrm {T}}=\frac{1}{2\gamma ^2}k^{\mathrm {T}}\frac{\partial V^j}{\partial x} \end{aligned}$$
(104)

where \(\mathrm {diag}(W_{a,w^2})\in {{\mathbb {R}}^{q\times q}}\) denotes the diagonal matrix with all the elements coming from \(W_{a,w^2}\).

Without loss of generality, we regard \(R\in {{\mathbb {R}}^{m\times m}}\) as a diagonal matrix. Then, based on (101) and (102), the parameters corresponding to the basis \(\sigma _a^1(u)=[u_1,u_2,\cdots ,u_m]\in {{\mathbb {R}}^{m}}\) are identified as \(W_{a,u^1}=T\big (\frac{\partial V^{j}}{\partial x}\big )^{\mathrm {T}}g\in {{\mathbb {R}}^{m}}\). The parameters corresponding to the basis \(\sigma _a^2(u)=[u_1^2,u_2^2,\cdots ,u_m^2]\in {{\mathbb {R}}^{m}}\) are identified as \(W_{a,u^2}=T[R_{11},R_{22},\cdots , R_{mm}]\in {{\mathbb {R}}^{m}}\). To determine the optimal control input \(u^{j+1}\) by (44), we have

$$\begin{aligned} \begin{aligned}&\frac{\partial [W_a^{\mathrm {T}}\left( t \right) \sigma _a \left( u,w \right) ]}{\partial u}\\&=\frac{[W_{a,u^1}\sigma _a^1(u)^{\mathrm {T}}+W_{a,u^2}\sigma _a^2(u)^{\mathrm {T}}]}{\partial u}=0 \end{aligned} \end{aligned}$$
(105)

and then

$$\begin{aligned} u^{j+1}&=-\frac{1}{2}\mathrm {diag}(W_{a,u^2})^{-1}W_{a,u^1}^{\mathrm {T}}=-\frac{1}{2}(TR)^{-1}W_{a,u^1}^{\mathrm {T}}\nonumber \\&=-\frac{1}{2}R^{-1}g^{\mathrm {T}}\frac{\partial V^j}{\partial x} \end{aligned}$$
(106)

where \(\mathrm {diag}(W_{a,u^2})\in {{\mathbb {R}}^{q\times q}}\) denotes the diagonal matrix with all the elements coming from \(W_{a,u^2}\).

Different from Algorithm 2 where the input policy is updated by the feedback gain between the input and state, the input in Algorithm 3 is directly calculated by the identified Z-function parameters.

Now we apply mathematical induction to prove that the optimal control input \(u^{j+1}(t)\) and the worst disturbance input \(w^{j+1}(t)\) in step 3 of Algorithm 3 are equivalent to that in step 3 of Algorithm 2 over the whole time interval \([t_0,t_f]\).

Denote \(t=t_0+(n-1)T\).

  1. (1)

    When \(n=1\), \(t=t_0\), the system state is given as \(x_0\), which is the same in Algorithms 2 and 3. Therefore, the value of \(w^{j+1}\) in (104) and \(u^{j+1}\) in (106) for Algorithm 3 is the same as that for Algorithm 2. Based on the same state \(x_0\), applying the same input policy to system (1), the system state \(x(t_0+T)\) in the next sampling instant is also the same in Algorithms 2 and 3.

  2. (2)

    When \(n=k\), \(t=t_0+(k-1)T\), the state x(t) at this time instant t is assumed to be the same in Algorithms 2 and 3. Similar to the above procedure, the corresponding input policy and the state \(x(t_0+(k-1)T+T)\) are also the same in these two algorithms.

  3. (3)

    When \(n=k+1\), \(t=t_0+kT\), the state x(t) at this instant t is the same based on the conclusion for \(t=t_0+(k-1)T\). Similar to the above procedure, the corresponding input policy and the state \(x(t_0+kT+T)\) are also the same in Algorithms 2 and 3.

Therefore, the input policy \(u^{j+1}(t)\), \(w^{j+1}(t)\) and the corresponding system trajectories in Algorithms 2 and 3 are the same over the whole time interval \([t_0,t_f]\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Z., Xue, W., Li, N. et al. A novel Z-function-based completely model-free reinforcement learning method to finite-horizon zero-sum game of nonlinear system. Nonlinear Dyn 107, 2563–2582 (2022). https://doi.org/10.1007/s11071-021-07049-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11071-021-07049-z

Keywords

Navigation