Skip to main content
Log in

Reinforcement learning-based optimal control of unknown constrained-input nonlinear systems using simulated experience

  • Original Paper
  • Published:
Nonlinear Dynamics Aims and scope Submit manuscript

Abstract

Reinforcement learning (RL) provides a way to approximately solve optimal control problems. Furthermore, online solutions to such problems require a method that guarantees convergence to the optimal policy while also ensuring stability during the learning process. In this study, we develop an online RL-based optimal control framework for input-constrained nonlinear systems. Its design includes two new model identifiers that learn a system’s drift dynamics: a slow identifier used to simulate experience that supports the convergence of optimal problem solutions and a fast identifier that keeps the system stable during the learning phase. This approach is a critic-only design, in which a new fast estimation law is developed for a critic network. A Lyapunov-based analysis shows that the estimated control policy converges to the optimal one. Moreover, simulation studies demonstrate the effectiveness of our developed control scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

Enquiries about data availability should be directed to the authors.

Notes

  1. The accuracy of \(\hat{{\textbf{f}}}({\textbf{x}})\) can be monitored through the error signal \(\tilde{{\textbf{x}}}\), and its output can be used in Eq. (32) instead of \({\textbf{f}}\) when the error becomes negligible.

References

  1. SN Balakrishnan and Victor Biega: Adaptive-critic-based neural networks for aircraft optimal control. J. Guid. Control Dyn. 19(4), 893–898 (1996)

    Article  MATH  Google Scholar 

  2. He, P. and Jagannathan, S.: Reinforcement learning neural-network-based controller for nonlinear discrete-time systems with input constraints. IEEE Trans. Syst. Man Cybern. Part B Cybern. 37(2):425–436 (2007)

  3. T Dierks, and Sarangapani, Jagannathan: Optimal control of affine nonlinear continuous-time systems. In Proceedings of the 2010 American Control Conference. pp. 1568–1573 (2010)

  4. Doya, Kenji: Reinforcement learning in continuous time and space. Neural Comput. 12(1), 219–245 (2000)

    Article  Google Scholar 

  5. Vamvoudakis, K.G., Lewis, F.L.: Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 46(5), 878–888 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  6. Vrabie, D., Pastravanu, O., Abu-Khalaf, M., Lewis, F.L.: Adaptive optimal control or continuous-time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  7. Bhasin, S., Kamalapurkar, R., Johnson, M., Vamvoudakis, K.G., Lewis, F.L., Dixon, W.E.: A novel actor critic identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 49(1), 82–92 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  8. Modares, H., Lewis, F.L.: Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning. Automatica 50(7), 1780–1792 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  9. Modares, H., Lewis, F.L., Naghibi-Sistani, M.-B.: Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica 50(1), 193–202 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  10. Kamalapurkar, R., Andrews, L., Walters, P., Dixon, W.E.: Model-based reinforcement learning for infinite-horizon approximate optimal tracking. IEEE Trans. Neural Netw. Learn. Syst. 28(3), 753–758 (2016)

    Article  Google Scholar 

  11. Kamalapurkar, R., Walters, P., and Dixon, W.,: Concurrent learning-based approximate optimal regulation. In 52nd IEEE Conference on Decision and Control, pp. 6256–6261 (2013)

  12. Zhao, Bo., Liu, Derong, Alippi, Cesare: Sliding-mode surface-based approximate optimal control for uncertain nonlinear systems with asymptotically stable critic structure. IEEE Trans. Cybern. 51(6), 2858–2869 (2020)

    Article  Google Scholar 

  13. Abu-Khalaf, M., Lewis, F.L.: Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41(5), 779–791 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  14. Guo, Xinxin, Yan, Weisheng, Cui, Rongxin: Integral reinforcement learning-based adaptive nn control for continuous-time nonlinear mimo systems with unknown control directions. IEEE Trans. Syst. Man Cybern. Syst. 50(11), 4068–4077 (2019)

    Article  Google Scholar 

  15. Modares, H., Lewis, F.L., Naghibiistani, M.-B.: Online solution of nonquadratic two-player zero-sum games arising in the Hs control of constrained input systems. Int. J. Adap. Control Signal Process. 28(35), 232–254 (2014)

    Article  MATH  Google Scholar 

  16. Yang, Y., Vamvoudakis, K.G., Modares, H., Yin, Y., Wunsch, D.C.: Safe intermittent reinforcement learning with static and dynamic event generators. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5441–5455 (2020)

    Article  MathSciNet  Google Scholar 

  17. Mishra, A., and Ghosh, S.: Variable gain gradient descent-based reinforcement learning for robust optimal tracking control of uncertain nonlinear system with input constraints. Nonlinear Dyn. pp. 2195—2214 (2022)

  18. Modares, H., Lewis, F.L., Naghibi-Sistani, M.-B.: Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks. IEEE Trans. Neural Netw. Learn. Syst. 24(10), 1513–1525 (2013)

    Article  Google Scholar 

  19. Huo, Y., Wang, D., Qiao, J., and Li, M.: Adaptive critic design for nonlinear multi-player zero-sum games with unknown dynamics and control constraints. Nonlinear Dyn. pp. 1–13 (2023)

  20. Jean-Jacques E, Slotine, WL. et al: Applied nonlinear control, volume 199. Prentice hall Englewood Cliffs, NJ, (1991)

  21. Sastry, S.: Nonlinear Systems: Analysis, Stability, and Control, vol. 10. Springer Science and Business Media, Berlin (2013)

    Google Scholar 

  22. Dong H., Zhao X., and Luo B.: Optimal tracking control for uncertain nonlinear systems with prescribed performance via critic-only ADP. IEEE Trans. Syst. Man Cybern. Syst. (2020)

  23. Lv, Yongfeng, Ren, Xuemei, Na, Jing: Online optimal solutions for multi-player nonzero-sum game with completely unknown dynamics. Neurocomputing 283, 87–97 (2018)

    Article  Google Scholar 

  24. Wang, Wei, Wen, Changyun: Adaptive actuator failure compensation control of uncertain nonlinear systems with guaranteed transient performance. Automatica 46(12), 2082–2091 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  25. Xian, B., Dawson, D.M., de Queiroz, M.S., Chen, J.: A continuous asymptotic tracking control strategy for uncertain nonlinear systems. IEEE Trans. Autom. Control 49(7), 1206–1211 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  26. Marcio S, De Queiroz, Jun, Hu, Darren M, Dawson, Timothy, Burg, and Sreenivasa R, Donepudi: Adaptive position/force control of robot manipulators without velocity measurements: Theory and experimentation. IEEE Trans. Syst. Man Cybern Part B 27(5):796–809 (1997)

  27. Chowdhary, G. and Johnson, E.: Concurrent learning for convergence in adaptive control without persistency of excitation. In 49th IEEE Conference on Decision and Control p. 3674–3679 (2010)

  28. Girish, V.: Chowdhary and Eric N, Johnson: Theory and flight-test validation of a concurrent-learning adaptive controller. J. Guid Control Dyn. 34(2), 592–607 (2011)

    Article  Google Scholar 

  29. Vahidi-Moghaddam, Amin, Mazouchi, Majid, Modares, Hamidreza: Memory-augmented system identification with finite-time convergence. IEEE Control Syst. Lett. 5(2), 571–576 (2020)

    Article  MathSciNet  Google Scholar 

  30. Spong, M.W.: On the robust control of robot manipulators. IEEE Trans. Autom. Control 37(11), 1782–1786 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  31. Hornik, Kurt, Stinchcombe, Maxwell, White, Halbert: Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw. 3(5), 551–560 (1990)

    Article  Google Scholar 

  32. Edwin, K.P., Chong, E.K., Zak, S.H.: An Introduction to Optimization 75, 514 (2013)

    Google Scholar 

  33. Khalil, H.K.: Noninear Systems. Prentice-Hall. New Jersey, 3rd edn (1996)

  34. Patre, P.: Lyapunov-based robust and adaptive control of nonlinear systems using a novel feedback structure. University of Florida, Florida (2009)

    Google Scholar 

  35. Marios M, Polycarpou and Petros A, Ioannou: A robust adaptive nonlinear control design. In 1993 American Control Conference pp. 1365–1369 (1993)

Download references

Funding

This work is based on results obtained from project JPNP20006, commissioned by the New Energy and Industrial Technology Development Organization (NEDO). This work was partially supported by JSPS KAKENHI Grant Number JP21H03527.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamed Jabbari Asl.

Ethics declarations

Conflicts of interest

The authors have not disclosed any competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendices

Proof of Theorem 1

Consider \({\mathcal {D}}\subseteq {\mathbb {R}}^{2n+2}\) a domain containing \({\textbf{y}}=0\), where \({\textbf{y}}\in {\mathbb {R}}^{2n+2}\) is defined as \({\textbf{y}}\triangleq [\varvec{\omega }^{\top }\;\sqrt{P}\;\sqrt{Q_f}]^{\top }\), in which \(Q_f(t)\in {\mathbb {R}}^{+}\) is defined as \(Q_f(t)\triangleq \frac{\alpha }{2\gamma }\text {tr}(\tilde{{\textbf{W}}}_f^{\top }\tilde{{\textbf{W}}}_f)\) with \(\text {tr}(\cdot )\) denoting the trace of a matrix. Here, function \(P(t)\in {\mathbb {R}}\) is given by

$$\begin{aligned}{} & {} P(t)\triangleq \beta _1\sum _{i=1}^{n}|\varsigma _i(0)|-\varvec{\varsigma }(0)^{\top }(\dot{\varvec{\epsilon }}_f(0)+{\textbf{N}}_B(0)) \nonumber \\{} & {} \quad -\int _0^t {\mathcal {K}}({\bar{t}})\text {d}{\bar{t}}, \end{aligned}$$
(48)

where function \({\mathcal {K}}(t)\in {\mathbb {R}}\) is defined as

$$\begin{aligned} {\mathcal {K}}(t)\triangleq {\textbf{e}}_{\varsigma }^{\top }\left( \dot{\varvec{\epsilon }}_0-\beta _1\text {sgn}(\varvec{\varsigma })\right) +\dot{\varvec{\varsigma }}^{\top }{\textbf{N}}_B-\beta _2\Vert \varvec{\varsigma }\Vert ^2, \end{aligned}$$
(49)

in which \(\beta _2\in {\mathbb {R}}^{+}\) is a positive constant. Provided the conditions of Eqs. (30) and (31) are satisfied, it can be shown that \(P(t)\ge 0\); see the proof in Appendix B. Now consider the continuously differentiable positive-definite function as follows:

$$\begin{aligned} {\mathcal {V}}=\frac{1}{2}\varvec{\varsigma }^{\top }\varvec{\varsigma }+\frac{1}{2}{\textbf{e}}_{\varsigma }^{\top }{\textbf{e}}_{\varsigma }+P+Q_f. \end{aligned}$$
(50)

It can be concluded that

$$\begin{aligned} \psi _1({\textbf{y}})\le {\mathcal {V}}({\textbf{y}})\le \psi _2({\textbf{y}}), \end{aligned}$$
(51)

where positive-definite strictly increasing functions \(\psi _1,\psi _2\in {\mathbb {R}}^{+}\) are defined as \(\psi _1\triangleq 0.5\Vert {\textbf{y}}\Vert ^2\) and \(\psi _2\triangleq \Vert {\textbf{y}}\Vert ^2\). Using Eqs. (24), (25), and the time derivative of Eq. (48), the time derivative of \({\mathcal {V}}\) can be developed as

$$\begin{aligned} \dot{{\mathcal {V}}}=&-(\alpha -\beta _2)\varvec{\varsigma }^{\top }\varvec{\varsigma }-k_f{\textbf{e}}_{\varsigma }^{\top }{\textbf{e}}_{\varsigma }+{\textbf{e}}_{\varsigma }^{\top }\tilde{{\textbf{N}}}+{\textbf{e}}_{\varsigma }^{\top }{\textbf{N}}_B\\&-\dot{\varvec{\varsigma }}^{\top }{\textbf{N}}_B-\frac{\alpha }{\gamma }\text {tr}(\tilde{{\textbf{W}}}_f^{\top }\dot{\hat{{\textbf{W}}}}_f). \end{aligned}$$

Therefore, using Eq. (24) and the definition of \({\textbf{N}}_B\), and knowing that \({\textbf{a}}^{\top }{\textbf{b}}=\text {tr}({\textbf{b}}{\textbf{a}}^{\top })\), \(\forall {\textbf{a}},{\textbf{b}}\in {\mathbb {R}}^{n}\), we can write

$$\begin{aligned} \dot{{\mathcal {V}}}=&-(\alpha -\beta _2)\varvec{\varsigma }^{\top }\varvec{\varsigma }-k_f{\textbf{e}}_{\varsigma }^{\top }{\textbf{e}}_{\varsigma }+{\textbf{e}}_{\varsigma }^{\top }\tilde{{\textbf{N}}}\\&-\frac{\alpha }{\gamma }\text {tr}(\tilde{{\textbf{W}}}_f^{\top }\dot{\hat{{\textbf{W}}}}_f-\gamma \tilde{{\textbf{W}}}_f^{\top }\dot{\varvec{\sigma }}_f\varvec{\varsigma }^{\top }\varvec{\Upsilon }_0). \end{aligned}$$

Considering the adaptation law (Eq. 29) and using the properties of the projection operator, we have

$$\begin{aligned} -\frac{\alpha }{\gamma }\text {tr}(\tilde{{\textbf{W}}}_f^{\top }\dot{\hat{{\textbf{W}}}}_f-\gamma \tilde{{\textbf{W}}}_f^{\top }\dot{\varvec{\sigma }}_f\varvec{\varsigma }^{\top }\varvec{\Upsilon }_0)\le 0. \end{aligned}$$
(52)

Therefore, using Eqs. (26) and (52), an upper bound for \(\dot{{\mathcal {V}}}\) can be written as

$$\begin{aligned} \dot{{\mathcal {V}}}\le -(\alpha -\beta _2)\Vert \varvec{\varsigma }\Vert ^2-k_f\Vert {\textbf{e}}_{\varsigma }\Vert ^2+\rho (\Vert \varvec{\omega }\Vert )\Vert \varvec{\omega }\Vert \Vert {\textbf{e}}_{\varsigma }\Vert . \end{aligned}$$

By splitting \(k_f\) into adjustable positive gains \(k_{f1},k_{f2}\in {\mathbb {R}}^{+}\) as \(k_f=k_{f1}+k_{f2}\), we can further bound \(\dot{{\mathcal {V}}}\) as follows if the condition in Eq. (31) is satisfied:

$$\begin{aligned} \dot{{\mathcal {V}}}\le -\beta _3\Vert \varvec{\omega }\Vert ^2-k_{f2}\Vert {\textbf{e}}_{\varsigma }\Vert ^2+\rho (\Vert \varvec{\omega }\Vert )\Vert \varvec{\omega }\Vert \Vert {\textbf{e}}_{\varsigma }\Vert , \end{aligned}$$
(53)

where \(\beta _3\) is defined as \(\beta _3\triangleq \min \{(\alpha -\beta _2),k_{f1}\}\). Completing the squares for the last two terms of Eq. (53), we obtain \(\dot{{\mathcal {V}}}\le -(\beta _3-\rho ^2(\Vert \varvec{\omega }\Vert )/4k_{f2})\Vert \varvec{\omega }\Vert ^2\). Therefore, we have \(\dot{{\mathcal {V}}}\le c\Vert \varvec{\omega }\Vert ^2\) for some positive constant \(c\in {\mathbb {R}}^{+}\) in the following domain:

$$\begin{aligned} {\mathcal {D}}\triangleq \left\{ {\textbf{y}}\in {\mathbb {R}}^{2n+2}\;|\;\Vert {\textbf{y}}\Vert \le \rho ^{-1}(2\sqrt{\beta _3 k_{f2}})\right\} , \end{aligned}$$

which, considering Eq. (51), indicates that \({\mathcal {V}}\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). Therefore, from Eq. (50), we have \(\varvec{\varsigma }\), \({\textbf{e}}_{\varsigma }\), P, \(Q_f\), and hence \(\tilde{{\textbf{W}}}_f\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). Since \(\varvec{\varsigma }\) is bounded, we can conclude that the prescribed performance condition is satisfied and \(\tilde{{\textbf{x}}}\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). To analyze the convergence of the signals, we need to show that \(\varvec{\omega }\) is uniformly bounded. From Eq. (24), we can see that \(\dot{\varvec{\varsigma }}\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). Therefore, since \(\varvec{\Upsilon }\) is bounded, from Eq. (19) we have \(\dot{\tilde{{\textbf{x}}}}\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). Consequently, from Eq. (22), we conclude that \(\varvec{\nu }\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). Since \(\dot{\tilde{{\textbf{x}}}}\) is bounded, from the definition of \(\varvec{\Upsilon }\), we also see that \(\dot{\varvec{\Upsilon }}\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). From Eq. (29), \(\dot{\hat{{\textbf{W}}}}_f\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\), and we also have \(\dot{\tilde{\varvec{\sigma }}}_f\in {\mathcal {L}}_{\infty }\). From these results and Eq. (25), we can see that \(\dot{{\textbf{e}}}_{\varsigma }\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). Then, since \(\dot{\varvec{\varsigma }},\dot{{\textbf{e}}}_{\varsigma }\in {\mathcal {L}}_{\infty }\), we conclude that \(\dot{\varvec{\omega }}\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\), which indicates that \(\varvec{\omega }\) is uniformly continuous in \({\mathcal {D}}\).

Now consider region \({\mathcal {S}}\subset {\mathcal {D}}\) as \({\mathcal {S}}\triangleq \{{\textbf{y}}\subset {\mathcal {D}}\;|\;\psi _2({\textbf{y}})< \frac{1}{2}(\rho ^{-1}(2\sqrt{\beta _3 k_{f2}}))^2\}\). Based on the above results, Theorem 8.4 of an earlier work [33] can be used to conclude that \(\Vert \varvec{\omega }\Vert \rightarrow 0\) as time goes to infinity for all \({\textbf{y}}(0)\in {\mathcal {S}}\). According to Eq. (17), \(T_i\rightarrow 0\) as \(\varsigma _i\rightarrow 0\), and from Eq. (14), \(\Vert \tilde{{\textbf{x}}}\Vert \rightarrow 0\). Therefore, from Eqs. (19) and (24), we conclude that \(\Vert \dot{\tilde{{\textbf{x}}}}\Vert \rightarrow 0\) as \(t\rightarrow 0\). The convergence of \(\dot{\tilde{{\textbf{x}}}}\) to zero indicates that \(\hat{{\textbf{f}}}\), given by Eq. (21), converges to \({\textbf{f}}\).

Proof of \(P(t)\ge 0\)

Here we show that \(P(t)\ge 0\). The proof follows the same steps as in a prior study [34]. Integrating both sides of Eq. (49), we have

$$\begin{aligned} \int _0^t {\mathcal {K}}({\bar{t}})\text {d}{\bar{t}}=&\int _0^t {\textbf{e}}_{\varsigma }^{\top }\left( \dot{\varvec{\epsilon }}_0-\beta _1\text {sgn}(\varvec{\varsigma })\right) \text {d}{\bar{t}}+\int _0^t \dot{\varvec{\varsigma }}^{\top }{\textbf{N}}_B\text {d}{\bar{t}}\\&-\int _0^t \beta _2\Vert \varvec{\varsigma }\Vert ^2\text {d}{\bar{t}}. \end{aligned}$$

Using Eq. (24), we have

$$\begin{aligned} \int _0^t {\mathcal {K}}({\bar{t}})\text {d}{\bar{t}}&=\int _0^t \frac{\text {d}\varvec{\varsigma }^{\top }}{\text {d}{\bar{t}}}(\dot{\varvec{\epsilon }}_0+{\textbf{N}}_B)\text {d}{\bar{t}}-\int _0^t \frac{\text {d}\varvec{\varsigma }^{\top }}{\text {d}{\bar{t}}}\beta _1\text {sgn}(\varvec{\varsigma })\text {d}{\bar{t}}\nonumber \\&+\int _0^t \alpha \varvec{\varsigma }^{\top }\left( \dot{\varvec{\epsilon }}_0-\beta _1\text {sgn}(\varvec{\varsigma })\right) \text {d}{\bar{t}}-\int _0^t \beta _2\Vert \varvec{\varsigma }\Vert ^2\text {d}{\bar{t}}. \end{aligned}$$
(54)

Integrating the first integral in Eq. (54) by parts yields

$$\begin{aligned}&\int _0^t {\mathcal {K}}({\bar{t}})\text {d}{\bar{t}} = \varvec{\varsigma }^{\top }(\dot{\varvec{\epsilon }}_0+{\textbf{N}}_B)\bigg |_0^t-\int _0^t\varvec{\varsigma }^{\top }\frac{\text {d}(\dot{\varvec{\epsilon }}_0+{\textbf{N}}_B)}{\text {d}{\bar{t}}}\text {d}{\bar{t}}\\&\quad \quad -\int _0^t \frac{\text {d}\varvec{\varsigma }^{\top }}{\text {d}{\bar{t}}}\beta _1\text {sgn}(\varvec{\varsigma })\text {d}{\bar{t}}-\int _0^t \beta _2\Vert \varvec{\varsigma }\Vert ^2\text {d}{\bar{t}}\\&\quad \quad +\int _0^t \alpha \varvec{\varsigma }^{\top }\left( \dot{\varvec{\epsilon }}_0-\beta _1\text {sgn}(\varvec{\varsigma })\right) \text {d}{\bar{t}}\\&\qquad =\varvec{\varsigma }^{\top }(\dot{\varvec{\epsilon }}_0+{\textbf{N}}_B)-\varvec{\varsigma }(0)^{\top }(\dot{\varvec{\epsilon }}_0(0)+{\textbf{N}}_B(0))\\&\qquad +\int _0^t \alpha \varvec{\varsigma }^{\top }\left( \dot{\varvec{\epsilon }}_0-\frac{1}{\alpha }\frac{\text {d}(\dot{\varvec{\epsilon }}_0+{\textbf{N}}_B)}{\text {d}{\bar{t}}}-\beta _1\text {sgn}(\varvec{\varsigma })\right) \text {d}{\bar{t}}\\&\qquad -\beta _1\sum _{i=1}^{n}|\varsigma _i|+\beta _1\sum _{i=1}^{n}|\varsigma _i(0)|-\int _0^t \beta _2\Vert \varvec{\varsigma }\Vert ^2\text {d}{\bar{t}}. \end{aligned}$$

Knowing that \(\sum _{i=1}^{n}|\varsigma _i|\ge \Vert \varvec{\varsigma }\Vert \), and using Eqs. (27) and (28), we can write the following inequality:

$$\begin{aligned}&\int _0^t {\mathcal {K}}({\bar{t}})\text {d}{\bar{t}} \le (\zeta _1+\zeta _3-\beta _1)\Vert \varvec{\varsigma }\Vert +\beta _1\sum _{i=1}^{n}|\varsigma _i(0)|\\&\qquad \qquad +\int _0^t \alpha (\zeta _1+\frac{\zeta _2+\zeta _4}{\alpha }-\beta _1)\Vert \varvec{\varsigma }\Vert \text {d}{\bar{t}}\\&\qquad \quad +\int _0^t (\zeta _5-\beta _2)\Vert \varvec{\varsigma }\Vert ^2\text {d}{\bar{t}}-\varvec{\varsigma }(0)^{\top }(\dot{\varvec{\epsilon }}_0(0)+{\textbf{N}}_B(0)). \end{aligned}$$

Therefore, if the conditions in Eqs. (30) and (31) are satisfied, we have

$$\begin{aligned} \int _0^t {\mathcal {K}}({\bar{t}})\text {d}{\bar{t}} \le \beta _1\sum _{i=1}^{n}|\varsigma _i(0)|-\varvec{\varsigma }(0)^{\top }(\dot{\varvec{\epsilon }}_0(0)+{\textbf{N}}_B(0)), \end{aligned}$$

which indicates that \(P(t)\ge 0\).

Proof of Theorem 2

Consider the following Lyapunov function:

$$\begin{aligned} L\triangleq V^{*}({\textbf{x}})+L_{W_c}+L_{W_s}, \end{aligned}$$
(55)

where \(L_{W_s}\) is defined in Eq. (36), and \(L_{W_c}\triangleq \frac{1}{2\alpha _c}\tilde{{\textbf{W}}}_c^{\top }\tilde{{\textbf{W}}}_c\), in which \(\tilde{{\textbf{W}}}_c\triangleq {\textbf{W}}_c-\hat{{\textbf{W}}}_c\). Using Eqs. (1), (5), and (42), we have

$$\begin{aligned} {\dot{V}}^{*}&=\frac{\partial V^{*}}{\partial {\textbf{x}}}\dot{{\textbf{x}}} =\nabla V^{*}({\textbf{f}}+{\textbf{g}}\hat{{\textbf{u}}})\\&=\nabla V^{*}({\textbf{f}}+{\textbf{g}}{\textbf{u}}^{*})+\lambda \nabla V^{*}{\textbf{g}}(\tanh ({\textbf{D}}^{*})-\tanh (\hat{{\textbf{D}}})). \end{aligned}$$

Therefore, using Eqs. (7) and (38), \({\dot{V}}^{*}\) can be written as

$$\begin{aligned} {\dot{V}}^{*}=&-Q({\textbf{x}})-U({\textbf{u}}^{*})\nonumber \\&+\lambda ({\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c+\nabla \epsilon _c^{\top }){\textbf{g}}(\tanh ({\textbf{D}}^{*})-\tanh (\hat{{\textbf{D}}})). \end{aligned}$$
(56)

Defining \(\tilde{{\textbf{u}}}\triangleq \tanh ({\textbf{D}}^{*})-\tanh (\hat{{\textbf{D}}})\), and knowing that

$$\begin{aligned}&U({\textbf{u}}^{*})>0,\\&\lambda \nabla \epsilon _c^{\top }{\textbf{g}}\tilde{{\textbf{u}}} \le \lambda \Vert \nabla \epsilon _c\Vert \Vert {\textbf{g}}\Vert \Vert \tilde{{\textbf{u}}}\Vert \le 2\lambda b_{\epsilon cx}b_g,\\&\lambda {\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}\tilde{{\textbf{u}}} \le \lambda \Vert {\textbf{W}}_c\Vert \Vert \nabla \varvec{\sigma }_c\Vert \Vert {\textbf{g}}\Vert \Vert \tilde{{\textbf{u}}}\Vert \\&\;\;\quad \qquad \qquad \le 2\lambda \Vert {\textbf{W}}_c\Vert b_{\sigma cx}b_g, \end{aligned}$$

and \(Q({\textbf{x}})>q_{\min }{\textbf{x}}^{\top }{\textbf{x}}\) for some \(q_{\min }\in {\mathbb {R}}^{+}\), an upper bound can be written for \({\dot{V}}^{*}\) as

$$\begin{aligned} {\dot{V}}^{*}\le -q_{\min }\Vert {\textbf{x}}\Vert ^{2}+\kappa _1, \end{aligned}$$
(57)

where \(\kappa _1\triangleq 2\lambda \Vert {\textbf{W}}_c\Vert b_{\sigma cx}b_g+2\lambda b_{\epsilon cx}b_g\).

To develop the time derivative of the second term of the right-hand side of Eq. (55), we first write the Bellman error (Eq. 10) as follows by substituting the gradient of Eq. (41) into Eq. (8):

$$\begin{aligned} \delta _B=Q({\textbf{x}})+U(\hat{{\textbf{u}}})+\hat{{\textbf{W}}}_c^{\top }\nabla \varvec{\sigma }_c\left( \hat{{\textbf{f}}}+{\textbf{g}}\hat{{\textbf{u}}}\right) . \end{aligned}$$
(58)

From Eq. (40), we have

$$\begin{aligned} Q({\textbf{x}})=-U({\textbf{u}}^{*})-{\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c\left( {\textbf{f}}+{\textbf{g}}{\textbf{u}}^{*}\right) +\epsilon _{hjb}. \end{aligned}$$

Therefore, substituting this expression into Eq. (58), we obtain

$$\begin{aligned} \delta _{B}=&~U(\hat{{\textbf{u}}})-U({\textbf{u}}^{*})-{\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c({\textbf{f}}+{\textbf{g}}{\textbf{u}}^{*})\nonumber \\&+\hat{{\textbf{W}}}_c^{\top }\nabla \varvec{\sigma }_c\left( \hat{{\textbf{f}}}+{\textbf{g}}\hat{{\textbf{u}}}\right) +\epsilon _{hjb}. \end{aligned}$$
(59)

Considering the definition of \(U({\textbf{u}}^{*})\) and \(U(\hat{{\textbf{u}}})\), given respectively in Eqs. (6) and (9), and knowing that \(\ln ({\textbf{1}}-\tanh ^2({\textbf{D}}^{*}))=\ln ({\textbf{4}})-2{\textbf{D}}^{*}\text {sgn}({\textbf{D}}^{*})+\varvec{\epsilon }_{D^{*}}\) and \(\ln ({\textbf{1}}-\tanh ^2(\hat{{\textbf{D}}}))=\ln ({\textbf{4}})-2\hat{{\textbf{D}}}\text {sgn}(\hat{{\textbf{D}}})+\varvec{\epsilon }_{D}\) for some bounded \(\varvec{\epsilon }_{D^{*}}\) and \(\varvec{\epsilon }_{D}\) [18], where \(\text {sgn}(\cdot )\) denotes the signum function, we develop the following expression:

$$\begin{aligned} U(\hat{{\textbf{u}}})-U({\textbf{u}}^{*})=&~2\lambda ^2\bar{{\textbf{R}}}\left( {\textbf{D}}^{*}\text {sgn}({\textbf{D}}^{*})-\hat{{\textbf{D}}}\text {sgn}(\hat{{\textbf{D}}})\right) \nonumber \\&-\hat{{\textbf{W}}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}\hat{{\textbf{u}}}+{\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}{\textbf{u}}^{*}\nonumber \\&+\nabla \epsilon _c{\textbf{g}}{\textbf{u}}^{*}+\lambda ^2\bar{{\textbf{R}}}(\varvec{\epsilon }_{D}-\varvec{\epsilon }_{D^{*}}). \end{aligned}$$
(60)

The signum function can be approximated by a \(\tanh \) function with the following relation quantifying the approximation error [35]:

$$\begin{aligned} 0\le x\text {sgn}(x)-x\tanh (\kappa x)\le \frac{1}{0.2785\kappa }. \end{aligned}$$

Therefore, the expression in Eq. (60) can be written as

$$\begin{aligned}&U(\hat{{\textbf{u}}})-U({\textbf{u}}^{*})=2\lambda ^2\bar{{\textbf{R}}}\left( {\textbf{D}}^{*}\tanh (\kappa {\textbf{D}}^{*})-\hat{{\textbf{D}}}\tanh (\kappa \hat{{\textbf{D}}})\right) \nonumber \\&\qquad +2\lambda ^2\bar{{\textbf{R}}}(\varvec{\kappa }_{D^{*}}-\varvec{\kappa }_{D})-\hat{{\textbf{W}}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}\hat{{\textbf{u}}}+{\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}{\textbf{u}}^{*}\nonumber \\&\qquad +\nabla \epsilon _c{\textbf{g}}{\textbf{u}}^{*}+\lambda ^2\bar{{\textbf{R}}}(\varvec{\epsilon }_{D}-\varvec{\epsilon }_{D^{*}}), \end{aligned}$$
(61)

where \(\varvec{\kappa }_{D^{*}}\) and \(\varvec{\kappa }_{D}\) denote the approximation errors. Then, considering the definitions of \({\textbf{D}}^{*}\) and \(\hat{{\textbf{D}}}\), and adding and subtracting \(\lambda {\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}\tanh (\kappa \hat{{\textbf{D}}})\) to the right-hand side of Eq. (61), we have

$$\begin{aligned}&U(\hat{{\textbf{u}}})-U({\textbf{u}}^{*})= \lambda \tilde{{\textbf{W}}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}\tanh (\kappa \hat{{\textbf{D}}})\nonumber \\&+2\lambda ^2\bar{{\textbf{R}}}(\varvec{\kappa }_{D^{*}}-\varvec{\kappa }_{D})-\hat{{\textbf{W}}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}\hat{{\textbf{u}}}+{\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}{\textbf{u}}^{*}\nonumber \\&+\nabla \epsilon _c{\textbf{g}}{\textbf{u}}^{*}+\lambda ^2\bar{{\textbf{R}}}(\varvec{\epsilon }_{D}-\varvec{\epsilon }_{D^{*}})\nonumber \\&+\lambda {\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}\left( \tanh (\kappa {\textbf{D}}^{*})-\tanh (\kappa \hat{{\textbf{D}}})\right) . \end{aligned}$$
(62)

Substituting Eq. (62) into Eq. (59) and doing certain manipulations, we obtain

$$\begin{aligned} \delta _B=-\varvec{\beta }^{\top }\tilde{{\textbf{W}}}_c+\epsilon _{\delta }, \end{aligned}$$
(63)

where \(\varvec{\beta }\in {\mathbb {R}}^{{\mathcal {L}}_c}\) is defined in Eq. (44) and \(\epsilon _\delta \in {\mathbb {R}}\) is given by

$$\begin{aligned} \epsilon _{\delta }\triangleq&~2\lambda ^2\bar{{\textbf{R}}}(\varvec{\kappa }_{D^{*}}-\varvec{\kappa }_{D})+\lambda ^2\bar{{\textbf{R}}}(\varvec{\epsilon }_{D}-\varvec{\epsilon }_{D^{*}})+\nabla \epsilon _c{\textbf{g}}{\textbf{u}}^{*}\\&+\epsilon _{hjb}+\lambda {\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}\left( \tanh (\kappa {\textbf{D}}^{*})-\tanh (\kappa \hat{{\textbf{D}}})\right) \\&-{\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c\tilde{{\textbf{f}}}, \end{aligned}$$

where all of its elements are bounded terms, and thus an upper bound can be considered for it as \(|\epsilon _{\delta }|\le {\bar{\epsilon }}_{\delta }\). Also note that approximation errors \(\varvec{\epsilon }_{D^{*}}\), \(\varvec{\epsilon }_{D}\), \(\varvec{\kappa }_{D^{*}}\), and \(\varvec{\kappa }_{D}\) converge to zero as \({\textbf{x}}\) goes to zero. Following the same steps, the unmeasurable form of simulated Bellman error \(\delta _{Bj}\) can be obtained as

$$\begin{aligned} \delta _{Bj}=-\varvec{\beta }_j^{\top }\tilde{{\textbf{W}}}_c-{\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_{cj}\tilde{{\textbf{f}}}_{sj}+\epsilon _{\delta j}, \end{aligned}$$
(64)

where \(\tilde{{\textbf{f}}}_{sj}\triangleq {\textbf{f}}_{j}-\hat{{\textbf{f}}}_{sj}={\textbf{e}}_{sj}\), subscript j indicates the jth sample of the variables, and \(\epsilon _{\delta j}\triangleq 2\lambda ^2\bar{{\textbf{R}}}(\varvec{\kappa }_{D^{*}}-\varvec{\kappa }_{Dj})+\lambda ^2\bar{{\textbf{R}}}(\varvec{\epsilon }_{Dj}-\varvec{\epsilon }_{D^{*}})+\lambda {\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}(\tanh (\kappa {\textbf{D}}^{*})-\tanh (\kappa \hat{{\textbf{D}}}({\textbf{x}}_j)))-\nabla \epsilon _c{\textbf{f}}_j\), for which an upper constant bound can be considered.

Then, using Eqs. (63) and (64) and the adaptation rule (Eq. 43), the time derivative of \(L_{W_c}\) can be written as

$$\begin{aligned}&{\dot{L}}_{W_c}= -\tilde{{\textbf{W}}}_c^{\top }\bar{\varvec{\beta }}\bar{\varvec{\beta }}^{\top }\tilde{{\textbf{W}}}_c+\tilde{{\textbf{W}}}_c^{\top }\frac{\bar{\varvec{\beta }}}{m_s}\epsilon _{\delta }\\&-\frac{\gamma _{c1}}{N}\sum _{j=1}^{N}\tilde{{\textbf{W}}}_c^{\top }\bar{\varvec{\beta }}_j\bar{\varvec{\beta }}_j^{\top }\tilde{{\textbf{W}}}_c\\&-\frac{\gamma _{c1}}{N}\sum _{j=1}^{N}\tilde{{\textbf{W}}}_c^{\top }\frac{\bar{\varvec{\beta }}_j}{m_{sj}}{\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_{cj}\tilde{{\textbf{f}}}_{sj}+\frac{\gamma _{c1}}{N}\sum _{j=1}^{N}\tilde{{\textbf{W}}}_c^{\top }\frac{\bar{\varvec{\beta }}_j}{m_{sj}}\epsilon _{\delta j}\\&-\frac{\gamma _{c2}\sum _{j=1}^{N}\tilde{{\textbf{W}}}_c^{\top }\bar{\varvec{\beta }}_j\bar{\varvec{\beta }}_j^{\top }\tilde{{\textbf{W}}}_c}{N(\Vert \varvec{\Xi }\Vert +\varepsilon _c)}+\frac{\gamma _{c2}\sum _{j=1}^{N}\tilde{{\textbf{W}}}_c^{\top }\bar{\varvec{\beta }}_j \epsilon _{\delta j}/m_{sj}}{N(\Vert \varvec{\Xi }\Vert +\varepsilon _c)}\\&-\frac{\gamma _{c2}\sum _{j=1}^{N}\tilde{{\textbf{W}}}_c^{\top }\bar{\varvec{\beta }}_j {\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_{cj}\tilde{{\textbf{f}}}_{sj}/m_{sj}}{N(\Vert \varvec{\Xi }\Vert +\varepsilon _c)}, \end{aligned}$$

where \(\bar{\varvec{\beta }}\triangleq \varvec{\beta }/(1+\varvec{\beta }^{\top }\varvec{\beta })\) and \(m_s\triangleq 1+\varvec{\beta }^{\top }\varvec{\beta }\) that satisfy the following inequality:

$$\begin{aligned} \Vert \frac{\bar{\varvec{\beta }}}{m_s}\Vert \le b_{\beta }<1, \end{aligned}$$

in which \(b_{\beta }\) is a positive constant. We have

$$\begin{aligned} -\frac{\gamma _{c2}\sum _{j=1}^{N}\tilde{{\textbf{W}}}_c^{\top }\bar{\varvec{\beta }}_j\bar{\varvec{\beta }}_j^{\top }\tilde{{\textbf{W}}}_c}{N(\Vert \varvec{\Xi }\Vert +\varepsilon _c)}\le -\gamma _{c2}\vartheta _c\Vert \tilde{{\textbf{W}}}_c\Vert , \end{aligned}$$

where \(\vartheta _c>0\) is defined as

$$\begin{aligned} \vartheta _c\triangleq 1- \frac{(c_{\beta _2}-c_{\beta _1})\Vert \tilde{{\textbf{W}}}_c\Vert +\hbar _1+\hbar _2+\varepsilon _c}{c_{\beta _2}\Vert \tilde{{\textbf{W}}}_c\Vert +\hbar _1+\hbar _2+\varepsilon _c}, \end{aligned}$$

in which \(c_{\beta _2}\triangleq \frac{1}{N}(\sup \limits _{t\in {\mathbb {R}}_{\ge t_0}}(\lambda _{\max }(\sum _{j=1}^{N}\frac{\varvec{\beta }_j\varvec{\beta }_j^{\top }}{(1+\varvec{\beta }_j^{\top }\varvec{\beta }_j)^2})))\), and \(\hbar _1\triangleq {\bar{b}}_{\beta } b_{\sigma cx} b_{\sigma sx}\Vert {\varvec{W}}_c\Vert \Vert \tilde{{\textbf{W}}}_s\Vert _F\), \(\hbar _2\triangleq {\bar{b}}_{\beta }b_{\sigma cx}{\bar{\epsilon }}_{sj}\Vert {\varvec{W}}_c\Vert + {\bar{b}}_{\beta }{\bar{\epsilon }}_{\delta j}\), in which \({\bar{b}}_{\beta }\triangleq \sup \limits _{j=1\cdots N}(\Vert \bar{\varvec{\beta }}_{j}/m_{sj}\Vert )\), \({\bar{\epsilon }}_{sj}\triangleq \sup \limits _{j=1\cdots N}(\Vert \varvec{\epsilon }_{sj}\Vert )\) and \({\bar{\epsilon }}_{\delta j}\triangleq \sup \limits _{j=1\cdots N}(|\epsilon _{\delta j}|)\). Also, assuming that \(\varepsilon _c> \sup \limits _{j=1\cdots N}(\Vert \tilde{{\textbf{f}}}_{sj}\Vert )+{\bar{\epsilon }}_{\delta }\), the following inequality can be developed:

$$\begin{aligned}&-\frac{\gamma _{c2}\sum _{j=1}^{N}\tilde{{\textbf{W}}}_c^{\top }\bar{\varvec{\beta }}_j {\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_{cj}\tilde{{\textbf{f}}}_{sj}/m_{sj}}{N(\Vert \varvec{\Xi }\Vert +\varepsilon _c)}\\&\qquad +\frac{\gamma _{c2}\sum _{j=1}^{N}\tilde{{\textbf{W}}}_c^{\top }\bar{\varvec{\beta }}_j \epsilon _{\delta j}/m_{sj}}{N(\Vert \varvec{\Xi }\Vert +\varepsilon _c)}\le \\&\qquad \gamma _{c2}{\bar{b}}_{\beta }b_{\sigma cx}\varpi _{c1}\Vert {\textbf{W}}_c\Vert \Vert \tilde{{\textbf{W}}}_c\Vert +\gamma _{c2}{\bar{b}}_{\beta }\varpi _{c2}\Vert \tilde{{\textbf{W}}}_c\Vert \end{aligned}$$

for some \(\varpi _{c1},\varpi _{c2}<1\). Then, defining the following positive constants:

$$\begin{aligned}&\varrho _1\triangleq \gamma _{c1} {\bar{b}}_{\beta } b_{\sigma cx} b_{\sigma sx}\Vert {\varvec{W}}_c\Vert ,\\&\varrho _2\triangleq \gamma _{c1} {\bar{b}}_{\beta }b_{\sigma cx}{\bar{\epsilon }}_{sj}\Vert {\varvec{W}}_c\Vert +\gamma _{c2}{\bar{b}}_{\beta }b_{\sigma cx}\varpi _{c1}\Vert {\varvec{W}}_c\Vert \\&\qquad +\gamma _{c1} {\bar{b}}_{\beta }{\bar{\epsilon }}_{\delta j}+\gamma _{c2}{\bar{b}}_{\beta }\varpi _{c2}+b_{\beta }{\bar{\epsilon }}_{\delta }, \end{aligned}$$

and considering Eq. (45), the following upper bound can be written for \({\dot{L}}_{W_c}\):

$$\begin{aligned} {\dot{L}}_{W_c}\le&-\gamma _{c1}c_{\beta _1}\Vert \tilde{{\textbf{W}}}_c\Vert ^2+\varrho _1\Vert \tilde{{\textbf{W}}}_c\Vert \Vert \tilde{{\textbf{W}}}_s\Vert _F\nonumber \\&+(\varrho _2-\gamma _{c2}\vartheta _c)\Vert \tilde{{\textbf{W}}}_c\Vert . \end{aligned}$$
(65)

Therefore, using Eqs. (37), (57), and (65) and Young’s inequality, an upper bound for \({\dot{L}}\) can be written as

$$\begin{aligned} {\dot{L}}\le&-q_{\min }\Vert {\textbf{x}}\Vert ^{2}-(\gamma _{c1} c_{\beta _1}-\frac{\varrho _1}{\eta _1^2}-\frac{\eta _2^2}{4})\Vert \tilde{{\textbf{W}}}_c\Vert ^2\nonumber \\&-(\gamma _{s1}c_{\sigma _1}-\frac{\eta _1^2\varrho _1}{4}-\frac{\eta _3^2}{4})\Vert \tilde{{\textbf{W}}}_s\Vert _F^2\nonumber \\&+\kappa _1+\frac{(\varrho _2-\gamma _{c2}\vartheta _c)^2}{\eta _2^2}+\frac{\epsilon _{w}^2}{\eta _3^2}, \end{aligned}$$
(66)

where \(\eta _1,\eta _2,\eta _3\in {\mathbb {R}}^{+}\) are adjustable constants. Therefore, considering Eq. (66), whenever the gain conditions in Eq. (46) are satisfied, we conclude that \({\textbf{x}}\), \(\tilde{{\textbf{W}}}_c\), and \(\tilde{{\textbf{W}}}_s\) are uniformly ultimately bounded.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Asl, H.J., Uchibe, E. Reinforcement learning-based optimal control of unknown constrained-input nonlinear systems using simulated experience. Nonlinear Dyn 111, 16093–16110 (2023). https://doi.org/10.1007/s11071-023-08688-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11071-023-08688-0

Keywords

Navigation