Abstract
This paper addresses the finite-horizon two-player zero-sum game for the continuous-time nonlinear system by defining a novel Z-function and proposing a completely model-free reinforcement learning (RL)-based method with reduced dimension of the basis functions. First, a model-based RL policy iteration framework is raised for reducing the order of the Hamiltonian–Jacobi–Isaacs (HJI) equation and strengthening the anti-interference capability and efficiency. This provides the basic framework for model-free algorithms. A partially model-free algorithm is then developed by applying integral RL and iterative learning control techniques to further simplify the solution seeking and remove the need for system dynamics on value function update. An integral Bellman equation is considered. The value function for the HJI equation is evaluated by a critic neural network with time-variant weights and state-dependent basis functions. In order to realize completely model-free learning, a novel Z-function is finally defined and a completely model-free algorithm is thus proposed to further remove the need for system dynamics on input update. Sufficient convergence and stability analysis is provided. The corresponding simulation results are shown to verify the validity of this algorithm.
Similar content being viewed by others
Data Availability Statement
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
References
Başar, T., Olsder, G.J.: Dynamic Noncooperative Game Theory. SIAM, Philadelphia (1998)
Lewis, F.L., Vrabie, D., Syrmos, V.L.: Optimal Control. Wiley, New York (2012)
Başar, T., Bernhard, P.: \(H_\infty \) Optimal Control and Related Minimax Design Problems. Birkhäuser, Boston (2008)
Wang, D., Chaoxu, M., Liu, D., Ma, H.: On mixed data and event driven design for adaptive-critic-based nonlinear \({H}_\infty \) control. IEEE Trans. Neural Netw. Learn. Syst. 29(99), 993–1005 (2018)
Rizvi, S.A.A., Lin, Z.: Output feedback Q-learning for discrete-time linear zero-sum games with application to the \({H}_\infty \) control. Automatica 95, 213–221 (2018)
Kiumarsi, B., Vamvoudakis, K.G., Modares, H., Lewis, F.L.: Optimal and autonomous control using reinforcement learning: a survey. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2042–2062 (2017)
Al-Tamimi, A., Lewis, F.L., Abu-Khalaf, M.: Model-free Q -learning designs for linear discrete-time zero-sum games with application to \({H}_\infty \) control. Automatica 43(3), 473–481 (2007)
Wang, K., Mu, C., Zhang, Y., Liu, W.: An approximate control algorithm for zero-sum differential games using adaptive critic technique. In: 2018 37th Chinese Control Conference(CCC), pp. 2812–2817. IEEE (2018)
Gao, Y., Liu, C., Jiang, S., Zhang, S.: Zero-sum differential games-based fast adaptive robust optimal sliding mode control design for uncertain missile autopilot with constrained input. Int J Control 1–13 (2021). https://doi.org/10.1080/00207179.2021.1872802
Barto, A.G.: Reinforcement learning control. Curr. Opin. Neurobiol. 4(6), 888–893 (1994)
Shan, X., Biao, L., Derong, L.: Event-triggered adaptive dynamic programming for zero-sum game of partially unknown continuous-time nonlinear systems. IEEE Trans. Syst. Man Cybern. Syst. 50(9), 3189–3199 (2018)
Zhong, X., He, H., Wang, D., Ni, Z.: Model-free adaptive control for unknown nonlinear zero-sum differential game. IEEE Trans. Cybern. 99, 1–14 (2017)
Ren, L., Zhang, G., Chaoxu, M.: Data-based \({H}_\infty \) control for the constrained-input nonlinear systems and its applications in chaotic circuit systems. IEEE Trans. Circuits Syst. I Regul. Pap. 67(8), 2791–2802 (2020)
Chaoxu, M., Wang, K.: Approximate-optimal control algorithm for constrained zero-sum differential games through event-triggering mechanism. Nonlinear Dyn. 95(4), 2639–2657 (2019)
Chen, W., Ding, D., Ge, X., Han, Q.L., Wei, G.: \({H}_\infty \) containment control of multiagent systems under event-triggered communication scheduling: the finite-horizon case. IEEE Trans. Cybern. 50(4), 1372–1382 (2018)
Zhang, H., Wang, H.: Finite horizon \({H}_\infty \) preview control. Acta Autom. Sinica 36, 327–331 (2010)
Zhang, H., Wang, H. Xie, L.: Discrete-time \({H}_\infty \) preview control problem in finite horizon. In: Mathematical Problems in Engineering, 2014 (2014)
Werbos, P.: Beyond regression: new tools for prediction and analysis in the behavioral sciences. Ph. D. dissertation, Harvard University (1974)
Werbos, P.: Approximate dynamic programming for real-time control and neural modeling. In: Handbook of Intelligent Control. Van Nostrand Reinhold, New York (1992)
Zhao, B., Jia, L., Xia, H., Li, Y.: Adaptive dynamic programming-based stabilization of nonlinear systems with unknown actuator saturation. Nonlinear Dyn. 93(4), 2089–2103 (2018)
Hendzel, Z.: An adaptive critic neural network for motion control of a wheeled mobile robot. Nonlinear Dyn. 50(4), 849–855 (2007)
Bin, X.: Robust adaptive neural control of flexible hypersonic flight vehicle with dead-zone input nonlinearity. Nonlinear Dyn. 80(3), 1509–1520 (2015)
Zhang, H., Wei, Q., Liu, D.: An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games. Automatica 47(1), 207–214 (2011)
Sun, J., Liu, C.: Zero-sum differential games for nonlinear systems using adaptive dynamic programming with input constraint. In: 2017 36th Chinese Control Conference (CCC) (2017)
Duan, D., Liu, C., Sun, J.: Adaptive periodic event-triggered control for missile-target interception system with finite-horizon convergence. Trans. Inst. Meas. Control 42(10), 1808–1822 (2020)
Xu, H., Jagannathan, S.: Finite horizon stochastic optimal control of nonlinear two-player zero-sum games under communication constraint. In: 2014 International Joint Conference on Neural Networks (IJCNN) (2014)
Liang, Y., Zhang, H., Cai, Y., Sun, S.: A neural network-based approach for solving quantized discrete-time \({H}_\infty \) optimal control with input constraint over finite-horizon. Neurocomputing 333(14), 248–260 (2019)
Hao, X.: Finite-horizon near optimal design of nonlinear two-player zero-sum game in presence of completely unknown dynamics. J. Control, Autom. Electr. Syst. 26(4), 361–370 (2015)
Zhang, H., Cui, X., Luo, Y., Jiang, H.: Finite-horizon \({H}_\infty \) tracking control for unknown nonlinear systems with saturating actuators. IEEE Trans. Neural Netw. Learn. Syst 29(4), 1200–1212 (2017)
Duan, D., Liu, C., Zhang, S.: Robust optimal control for finite-horizon zero-sum differential games via a plug-n-play event-triggered scheme. J. Frankl. Inst. 357(10), 5989–6017 (2020)
Zhao, J., Zhang, C.: Finite-horizon optimal control of discrete-time linear systems with completely unknown dynamics using Q-learning. J. Ind. Manag. Optim. 17(3), 1471 (2021)
Xu, H., Jagannathan, S.: Model-free Q-learning over finite horizon for uncertain linear continuous-time systems. In: IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 1–6. IEEE (2014)
Vrabie, D., Lewis, F.L.: Adaptive optimal control algorithm for continuous-time nonlinear systems based on policy iteration. In: Proceedings of the IEEE Conference on Decision and Control, pp. 73–79. IEEE (2008)
Vrabie, D., Pastravanu, O., Abu-Khalaf, M., Lewis, F.L.: Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)
Sebastian, G., Tan, Y., Oetomo, D.: Convergence analysis of feedback-based iterative learning control with input saturation. Automatica 101, 44–52 (2019)
Ahn, H.-S., Chen, Y., Moore, K.L.: Iterative learning control: brief survey and categorization. IEEE Trans. Syst. Man Cybern. 37(6), 1099–1121 (2007)
Meng, D., Moore, K.L.: Convergence of iterative learning control for SISO nonrepetitive systems subject to iteration-dependent uncertainties. Automatica 79, 167–177 (2017)
Acknowledgements
This work is supported by National Natural Science Foundation under Grant (61773260) and National Key R&D Program of China (2018YFB1305902).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Proof of Theorem 1
Proof
Optimality Along the system trajectories (1), we have
Substituting (66) into (8) gives
Based on (10) and processing the Hamiltonian function (8) by completing the squares, one has
and \(J\left( x_{0},u,w \right) \) in (2) could be represented as
Hence, the Nash equilibrium condition (5) is met when the input policy is adopted as \(u=u^*,w=w^*\), and the corresponding cost function is \(V^*\left( x(t_0),t_0 \right) \). \(\square \)
B Proof of Theorem 2
Proof
Along with the trajectory \({\dot{x}}=f+gu^{j+1}+kw^{j+1}\), the value function \(V^j\) yields
so does \(V^{j+1}\), one has
According to the Bellman equation (11), one has
Now, we prove the convergence of \(V^{j}\) and optimality of the solutions obtained by Algorithm 1. Consider
Since \(V^j\left( x\left( t_{f}\right) ,t_{f}\right) =V^{j+1}\left( x\left( t_{f}\right) ,t_{f}\right) =\psi \left( x\left( t_{f} \right) ,t_{f}\right) \), subtracting (74) from (73) gives
Then, along with the trajectory \({\dot{x}}=f+gu^{j+1}+kw^{j+1}\), substituting (70) and (71) into (75) gives
According to (72), equivalently one has
Substitute (72) and (77) into (76), one has
Considering (12), then we have
Then, substituting (79) into (78) gives
From (69) in Theorem 1 and (80), we conclude that as \(j\rightarrow \infty \), the cost function \(V^{j+1}(x_0,t_0)\) is monotonically decreasing in terms of u, monotonically increasing in terms of w and bounded. Thus, \(V^{j+1}\left( x_0,t_0\right) \) will converge. Hence, when \(j\rightarrow \infty \), we have \(V^{j+1}=V^{j}\). And accordingly
Substituting (81) into (11) means
Considering (12), (82) is essentially equivalent to (10). Hence, when \(j\rightarrow \infty \), \(V^{j}\) is the solution of (10). This completes the proof. \(\square \)
C Proof of Theorem 3
Proof
Denote
wherein x(t) is the corresponding state trajectories under the input \(u^j\) and j remains unchanged during whole inner-loop iteration.
When \(t=t_f\), we know from (30) that
Using (84) for updating \({\hat{W}}_{j}^i\left( t_f \right) \) and subtracting from (84) gives
When \(t\in [t_0,t_f)\), substituting (83) into (30) gives
Let \(m_1=\zeta ^j\zeta ^{jT}\), \(m_2=\zeta ^j \sigma ^{\mathrm {T}}\left( x\left( t+T\right) \right) \), then \({\hat{W}}_{j}^{i+1}\left( t \right) \) in (86) can be obtained by
Using (87) for updating \({\hat{W}}_{j}^{i}\left( t \right) \) and subtracting from (87) shows that
Therefore, we can obtain
According to (85) and (89) , it can be proved by mathematical induction that
which gives
Denote \(t=t_f-(n-1)*T, n\ge 1\).
-
1)
When \(n=1\), i.e., \(t=t_f\), we have noted that \(\sigma \left( x\left( t_f \right) \right) \sigma ^{\mathrm {T}}\left( x\left( t_f \right) \right) \) is positive definite and symmetric, hence its eigenvalues are larger than 0. Denote \(T_{ad}=I-c_2 \sigma \left( x\left( t_f \right) \right) \sigma ^{\mathrm {T}}\left( x\left( t_f \right) \right) \) and \(x_w={\hat{W}}_{j}^{i}\left( t_f \right) -{\hat{W}}_{j}^{i-1}\left( t_f \right) \). (85) becomes
$$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t_f \right) -{\hat{W}}_{j}^{i}\left( t_f \right) \right\| \\&= \left\| T_{ad}x_w \right\| =\sqrt{(x_w)^{\mathrm {T}}(T_{ad})^{\mathrm {T}}T_{ad}x_w}. \end{aligned} \end{aligned}$$(92)Since \(c_2>0\) can be selected to make the eigenvalues of \(T_{ad}\) belong to [0,1), \(T_{ad}\) can be orthogonally diagonalized as \(T_{ad}=({M}_{ad})^{\mathrm {T}}D_{ad}{M}_{ad}\) with an orthogonal matrix \({M}_{ad}\) and a diagonal matrix \(D_{ad}\). We know elements of \(D_{ad}\) represent all eigenvalues of the matrix \(T_{ad}\) that belong to [0, 1). Let \(\epsilon \) be the maximum of these eigenvalues, that is \(\epsilon \in (0,1)\). Therefore, we have
$$\begin{aligned} \begin{aligned} (D_{ad})^2 \le \varepsilon ^2 I, ({M}_{ad})^{\mathrm {T}}{M}_{ad}=I. \end{aligned} \end{aligned}$$(93)$$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t_f \right) -{\hat{W}}_{j}^{i}\left( t_f \right) \right\| \\&= \sqrt{(x_w)^{\mathrm {T}}(T_{ad})^{\mathrm {T}}T_{ad}x_w}\\&=\sqrt{(x_w)^{\mathrm {T}}({M}_{ad})^{\mathrm {T}}(D_{ad})^2{M}_{ad}x_w}\\&\le \varepsilon \sqrt{(x_w)^{\mathrm {T}}({M}_{ad})^{\mathrm {T}}{M}_{ad}x_w}\\&=\varepsilon \sqrt{(x_w)^{\mathrm {T}}x_w}\\&= \varepsilon \left\| {\hat{W}}_{j}^{i}\left( t_f \right) -{\hat{W}}_{j}^{i-1}\left( t_f \right) \right\| . \end{aligned} \end{aligned}$$(94)Therefore,
$$\begin{aligned}&\lim _{i \rightarrow \infty }\left\| {\hat{W}}_{j}^{i+1}\left( t_f \right) \right. \nonumber \\&\quad \left. -{\hat{W}}_{j}^{i}\left( t_f \right) \right\| =0. \end{aligned}$$(95) -
2)
When \(n=k\), i.e., \(t=t_f-(k-1)*T, k\ge 1\), we assume (91) holds for such t. That is
$$\begin{aligned}&\lim _{i \rightarrow \infty }\left\| {\hat{W}}_{j}^{i+1}\left( t_f+T-k*T \right) \right. \nonumber \\&\quad \left. -{\hat{W}}_{j}^{i}\left( t_f+T-k*T \right) \right\| =0. \end{aligned}$$(96) -
3)
When \(n=k+1\), i.e., \(t=t_f-k*T, k\ge 1\), we know from (89) and (96) that
$$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| \\&\quad \le \left\| (I-c_1 m_1)({\hat{W}}_{j}^{i}\left( t \right) -{\hat{W}}_{j}^{i-1}\left( t \right) ) \right\| . \end{aligned} \end{aligned}$$(97)Similar to the manipulations (92)–(95), (97) becomes
$$\begin{aligned} \begin{aligned}&\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| \\&\quad \le \varepsilon \left\| {\hat{W}}_{j}^{i}\left( t \right) -{\hat{W}}_{j}^{i-1}\left( t \right) \right\| . \end{aligned} \end{aligned}$$(98)Therefore, \(\left\| {\hat{W}}_{j}^{i+1}\left( t \right) -{\hat{W}}_{j}^{i}\left( t \right) \right\| \) will converge to 0 as \(i\rightarrow \infty \). That is, \({\hat{W}}_{j}^{i+1}\left( t \right) \) will converge to an optimal solution \({\hat{W}}_{j}^{*}(t)\). Also, the overall residual error E will go to 0, thus \(e_1^j(t)\) and \(e_2^j(t_f)\) will go to 0. Hence, \({\hat{W}}_{j}^{*}(t)\) gives the optimal solution \(V^{j}\) in (17) and (18).
\(\square \)
D Proof of Theorem 4
Proof
The Z-function for \(V^j\) at the time instant t could be expressed with
Since T is small and \(r\left( x,u,w \right) =Q(x)+u^{\mathrm {T}}Ru-\gamma ^2w^{\mathrm {T}}w\), (99) could be evaluated as
In (100), the current state x(t) at the instant t and the value function \(V^j\) are given. The variables remaining to be optimized are u(t), w(t). Therefore, (100) can be resolved into the three parts:
In (101), \(Z^j_c\) is the absolute term remaining constant for any u(t), w(t). \(Z^j_{o_1}\) is the first-order term that is linearly dependent of u(t) and w(t). \(Z^j_{o_2}\) is the second-order term composed of the binomial of u(t) and the binomial of w(t). Note that (101) illustrates the reason why the basis of \(Z^j\) is selected as \(\sigma _a \left( u,w \right) \) in (39).
After the parameters of \(W_a(t)\) in (39) are identified by testing signals as
where \(W_{a,0}\in {{\mathbb {R}}^{1}}, W_{a,u^1}\in {{\mathbb {R}}^{m}},W_{a,w^1}\in {{\mathbb {R}}^{q}},W_{a,u^2}\in {{\mathbb {R}}^{m}}\), \(W_{a,w^2}\in {{\mathbb {R}}^{q}}\) correspond to dimension of the terms of \(\sigma _a \left( u,w \right) \) in (39).
Based on (102), the value of the identified \(W_{a,0}\) is actually \(Z^j_c\). The parameters corresponding to the basis \(\sigma _a^1(w)=[w_1,w_2,\cdots ,w_q]\in {{\mathbb {R}}^{q}}\) are identified as \(W_{a,w^1}=T\big (\frac{\partial V^{j}}{\partial x}\big )^{\mathrm {T}}k\in {{\mathbb {R}}^{q}}\). The parameters corresponding to the basis \(\sigma _a^2(w)=[w_1^2,w_2^2, \cdots ,w_q^2]\in {{\mathbb {R}}^{q}}\) are identified as \(W_{a,w^2}=-T\gamma ^2[1,1,\cdots ,1]\in {{\mathbb {R}}^{q}}\). To determine the worst disturbance input \(w^{j+1}\) by (44), we have
and then
where \(\mathrm {diag}(W_{a,w^2})\in {{\mathbb {R}}^{q\times q}}\) denotes the diagonal matrix with all the elements coming from \(W_{a,w^2}\).
Without loss of generality, we regard \(R\in {{\mathbb {R}}^{m\times m}}\) as a diagonal matrix. Then, based on (101) and (102), the parameters corresponding to the basis \(\sigma _a^1(u)=[u_1,u_2,\cdots ,u_m]\in {{\mathbb {R}}^{m}}\) are identified as \(W_{a,u^1}=T\big (\frac{\partial V^{j}}{\partial x}\big )^{\mathrm {T}}g\in {{\mathbb {R}}^{m}}\). The parameters corresponding to the basis \(\sigma _a^2(u)=[u_1^2,u_2^2,\cdots ,u_m^2]\in {{\mathbb {R}}^{m}}\) are identified as \(W_{a,u^2}=T[R_{11},R_{22},\cdots , R_{mm}]\in {{\mathbb {R}}^{m}}\). To determine the optimal control input \(u^{j+1}\) by (44), we have
and then
where \(\mathrm {diag}(W_{a,u^2})\in {{\mathbb {R}}^{q\times q}}\) denotes the diagonal matrix with all the elements coming from \(W_{a,u^2}\).
Different from Algorithm 2 where the input policy is updated by the feedback gain between the input and state, the input in Algorithm 3 is directly calculated by the identified Z-function parameters.
Now we apply mathematical induction to prove that the optimal control input \(u^{j+1}(t)\) and the worst disturbance input \(w^{j+1}(t)\) in step 3 of Algorithm 3 are equivalent to that in step 3 of Algorithm 2 over the whole time interval \([t_0,t_f]\).
Denote \(t=t_0+(n-1)T\).
-
(1)
When \(n=1\), \(t=t_0\), the system state is given as \(x_0\), which is the same in Algorithms 2 and 3. Therefore, the value of \(w^{j+1}\) in (104) and \(u^{j+1}\) in (106) for Algorithm 3 is the same as that for Algorithm 2. Based on the same state \(x_0\), applying the same input policy to system (1), the system state \(x(t_0+T)\) in the next sampling instant is also the same in Algorithms 2 and 3.
-
(2)
When \(n=k\), \(t=t_0+(k-1)T\), the state x(t) at this time instant t is assumed to be the same in Algorithms 2 and 3. Similar to the above procedure, the corresponding input policy and the state \(x(t_0+(k-1)T+T)\) are also the same in these two algorithms.
-
(3)
When \(n=k+1\), \(t=t_0+kT\), the state x(t) at this instant t is the same based on the conclusion for \(t=t_0+(k-1)T\). Similar to the above procedure, the corresponding input policy and the state \(x(t_0+kT+T)\) are also the same in Algorithms 2 and 3.
Therefore, the input policy \(u^{j+1}(t)\), \(w^{j+1}(t)\) and the corresponding system trajectories in Algorithms 2 and 3 are the same over the whole time interval \([t_0,t_f]\). \(\square \)
Rights and permissions
About this article
Cite this article
Chen, Z., Xue, W., Li, N. et al. A novel Z-function-based completely model-free reinforcement learning method to finite-horizon zero-sum game of nonlinear system. Nonlinear Dyn 107, 2563–2582 (2022). https://doi.org/10.1007/s11071-021-07049-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11071-021-07049-z