Abstract
Reinforcement learning (RL) provides a way to approximately solve optimal control problems. Furthermore, online solutions to such problems require a method that guarantees convergence to the optimal policy while also ensuring stability during the learning process. In this study, we develop an online RL-based optimal control framework for input-constrained nonlinear systems. Its design includes two new model identifiers that learn a system’s drift dynamics: a slow identifier used to simulate experience that supports the convergence of optimal problem solutions and a fast identifier that keeps the system stable during the learning phase. This approach is a critic-only design, in which a new fast estimation law is developed for a critic network. A Lyapunov-based analysis shows that the estimated control policy converges to the optimal one. Moreover, simulation studies demonstrate the effectiveness of our developed control scheme.
Similar content being viewed by others
Data availability
Enquiries about data availability should be directed to the authors.
Notes
The accuracy of \(\hat{{\textbf{f}}}({\textbf{x}})\) can be monitored through the error signal \(\tilde{{\textbf{x}}}\), and its output can be used in Eq. (32) instead of \({\textbf{f}}\) when the error becomes negligible.
References
SN Balakrishnan and Victor Biega: Adaptive-critic-based neural networks for aircraft optimal control. J. Guid. Control Dyn. 19(4), 893–898 (1996)
He, P. and Jagannathan, S.: Reinforcement learning neural-network-based controller for nonlinear discrete-time systems with input constraints. IEEE Trans. Syst. Man Cybern. Part B Cybern. 37(2):425–436 (2007)
T Dierks, and Sarangapani, Jagannathan: Optimal control of affine nonlinear continuous-time systems. In Proceedings of the 2010 American Control Conference. pp. 1568–1573 (2010)
Doya, Kenji: Reinforcement learning in continuous time and space. Neural Comput. 12(1), 219–245 (2000)
Vamvoudakis, K.G., Lewis, F.L.: Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 46(5), 878–888 (2010)
Vrabie, D., Pastravanu, O., Abu-Khalaf, M., Lewis, F.L.: Adaptive optimal control or continuous-time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)
Bhasin, S., Kamalapurkar, R., Johnson, M., Vamvoudakis, K.G., Lewis, F.L., Dixon, W.E.: A novel actor critic identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 49(1), 82–92 (2013)
Modares, H., Lewis, F.L.: Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning. Automatica 50(7), 1780–1792 (2014)
Modares, H., Lewis, F.L., Naghibi-Sistani, M.-B.: Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica 50(1), 193–202 (2014)
Kamalapurkar, R., Andrews, L., Walters, P., Dixon, W.E.: Model-based reinforcement learning for infinite-horizon approximate optimal tracking. IEEE Trans. Neural Netw. Learn. Syst. 28(3), 753–758 (2016)
Kamalapurkar, R., Walters, P., and Dixon, W.,: Concurrent learning-based approximate optimal regulation. In 52nd IEEE Conference on Decision and Control, pp. 6256–6261 (2013)
Zhao, Bo., Liu, Derong, Alippi, Cesare: Sliding-mode surface-based approximate optimal control for uncertain nonlinear systems with asymptotically stable critic structure. IEEE Trans. Cybern. 51(6), 2858–2869 (2020)
Abu-Khalaf, M., Lewis, F.L.: Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41(5), 779–791 (2005)
Guo, Xinxin, Yan, Weisheng, Cui, Rongxin: Integral reinforcement learning-based adaptive nn control for continuous-time nonlinear mimo systems with unknown control directions. IEEE Trans. Syst. Man Cybern. Syst. 50(11), 4068–4077 (2019)
Modares, H., Lewis, F.L., Naghibiistani, M.-B.: Online solution of nonquadratic two-player zero-sum games arising in the Hs control of constrained input systems. Int. J. Adap. Control Signal Process. 28(35), 232–254 (2014)
Yang, Y., Vamvoudakis, K.G., Modares, H., Yin, Y., Wunsch, D.C.: Safe intermittent reinforcement learning with static and dynamic event generators. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5441–5455 (2020)
Mishra, A., and Ghosh, S.: Variable gain gradient descent-based reinforcement learning for robust optimal tracking control of uncertain nonlinear system with input constraints. Nonlinear Dyn. pp. 2195—2214 (2022)
Modares, H., Lewis, F.L., Naghibi-Sistani, M.-B.: Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks. IEEE Trans. Neural Netw. Learn. Syst. 24(10), 1513–1525 (2013)
Huo, Y., Wang, D., Qiao, J., and Li, M.: Adaptive critic design for nonlinear multi-player zero-sum games with unknown dynamics and control constraints. Nonlinear Dyn. pp. 1–13 (2023)
Jean-Jacques E, Slotine, WL. et al: Applied nonlinear control, volume 199. Prentice hall Englewood Cliffs, NJ, (1991)
Sastry, S.: Nonlinear Systems: Analysis, Stability, and Control, vol. 10. Springer Science and Business Media, Berlin (2013)
Dong H., Zhao X., and Luo B.: Optimal tracking control for uncertain nonlinear systems with prescribed performance via critic-only ADP. IEEE Trans. Syst. Man Cybern. Syst. (2020)
Lv, Yongfeng, Ren, Xuemei, Na, Jing: Online optimal solutions for multi-player nonzero-sum game with completely unknown dynamics. Neurocomputing 283, 87–97 (2018)
Wang, Wei, Wen, Changyun: Adaptive actuator failure compensation control of uncertain nonlinear systems with guaranteed transient performance. Automatica 46(12), 2082–2091 (2010)
Xian, B., Dawson, D.M., de Queiroz, M.S., Chen, J.: A continuous asymptotic tracking control strategy for uncertain nonlinear systems. IEEE Trans. Autom. Control 49(7), 1206–1211 (2004)
Marcio S, De Queiroz, Jun, Hu, Darren M, Dawson, Timothy, Burg, and Sreenivasa R, Donepudi: Adaptive position/force control of robot manipulators without velocity measurements: Theory and experimentation. IEEE Trans. Syst. Man Cybern Part B 27(5):796–809 (1997)
Chowdhary, G. and Johnson, E.: Concurrent learning for convergence in adaptive control without persistency of excitation. In 49th IEEE Conference on Decision and Control p. 3674–3679 (2010)
Girish, V.: Chowdhary and Eric N, Johnson: Theory and flight-test validation of a concurrent-learning adaptive controller. J. Guid Control Dyn. 34(2), 592–607 (2011)
Vahidi-Moghaddam, Amin, Mazouchi, Majid, Modares, Hamidreza: Memory-augmented system identification with finite-time convergence. IEEE Control Syst. Lett. 5(2), 571–576 (2020)
Spong, M.W.: On the robust control of robot manipulators. IEEE Trans. Autom. Control 37(11), 1782–1786 (1992)
Hornik, Kurt, Stinchcombe, Maxwell, White, Halbert: Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw. 3(5), 551–560 (1990)
Edwin, K.P., Chong, E.K., Zak, S.H.: An Introduction to Optimization 75, 514 (2013)
Khalil, H.K.: Noninear Systems. Prentice-Hall. New Jersey, 3rd edn (1996)
Patre, P.: Lyapunov-based robust and adaptive control of nonlinear systems using a novel feedback structure. University of Florida, Florida (2009)
Marios M, Polycarpou and Petros A, Ioannou: A robust adaptive nonlinear control design. In 1993 American Control Conference pp. 1365–1369 (1993)
Funding
This work is based on results obtained from project JPNP20006, commissioned by the New Energy and Industrial Technology Development Organization (NEDO). This work was partially supported by JSPS KAKENHI Grant Number JP21H03527.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors have not disclosed any competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
Proof of Theorem 1
Consider \({\mathcal {D}}\subseteq {\mathbb {R}}^{2n+2}\) a domain containing \({\textbf{y}}=0\), where \({\textbf{y}}\in {\mathbb {R}}^{2n+2}\) is defined as \({\textbf{y}}\triangleq [\varvec{\omega }^{\top }\;\sqrt{P}\;\sqrt{Q_f}]^{\top }\), in which \(Q_f(t)\in {\mathbb {R}}^{+}\) is defined as \(Q_f(t)\triangleq \frac{\alpha }{2\gamma }\text {tr}(\tilde{{\textbf{W}}}_f^{\top }\tilde{{\textbf{W}}}_f)\) with \(\text {tr}(\cdot )\) denoting the trace of a matrix. Here, function \(P(t)\in {\mathbb {R}}\) is given by
where function \({\mathcal {K}}(t)\in {\mathbb {R}}\) is defined as
in which \(\beta _2\in {\mathbb {R}}^{+}\) is a positive constant. Provided the conditions of Eqs. (30) and (31) are satisfied, it can be shown that \(P(t)\ge 0\); see the proof in Appendix B. Now consider the continuously differentiable positive-definite function as follows:
It can be concluded that
where positive-definite strictly increasing functions \(\psi _1,\psi _2\in {\mathbb {R}}^{+}\) are defined as \(\psi _1\triangleq 0.5\Vert {\textbf{y}}\Vert ^2\) and \(\psi _2\triangleq \Vert {\textbf{y}}\Vert ^2\). Using Eqs. (24), (25), and the time derivative of Eq. (48), the time derivative of \({\mathcal {V}}\) can be developed as
Therefore, using Eq. (24) and the definition of \({\textbf{N}}_B\), and knowing that \({\textbf{a}}^{\top }{\textbf{b}}=\text {tr}({\textbf{b}}{\textbf{a}}^{\top })\), \(\forall {\textbf{a}},{\textbf{b}}\in {\mathbb {R}}^{n}\), we can write
Considering the adaptation law (Eq. 29) and using the properties of the projection operator, we have
Therefore, using Eqs. (26) and (52), an upper bound for \(\dot{{\mathcal {V}}}\) can be written as
By splitting \(k_f\) into adjustable positive gains \(k_{f1},k_{f2}\in {\mathbb {R}}^{+}\) as \(k_f=k_{f1}+k_{f2}\), we can further bound \(\dot{{\mathcal {V}}}\) as follows if the condition in Eq. (31) is satisfied:
where \(\beta _3\) is defined as \(\beta _3\triangleq \min \{(\alpha -\beta _2),k_{f1}\}\). Completing the squares for the last two terms of Eq. (53), we obtain \(\dot{{\mathcal {V}}}\le -(\beta _3-\rho ^2(\Vert \varvec{\omega }\Vert )/4k_{f2})\Vert \varvec{\omega }\Vert ^2\). Therefore, we have \(\dot{{\mathcal {V}}}\le c\Vert \varvec{\omega }\Vert ^2\) for some positive constant \(c\in {\mathbb {R}}^{+}\) in the following domain:
which, considering Eq. (51), indicates that \({\mathcal {V}}\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). Therefore, from Eq. (50), we have \(\varvec{\varsigma }\), \({\textbf{e}}_{\varsigma }\), P, \(Q_f\), and hence \(\tilde{{\textbf{W}}}_f\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). Since \(\varvec{\varsigma }\) is bounded, we can conclude that the prescribed performance condition is satisfied and \(\tilde{{\textbf{x}}}\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). To analyze the convergence of the signals, we need to show that \(\varvec{\omega }\) is uniformly bounded. From Eq. (24), we can see that \(\dot{\varvec{\varsigma }}\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). Therefore, since \(\varvec{\Upsilon }\) is bounded, from Eq. (19) we have \(\dot{\tilde{{\textbf{x}}}}\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). Consequently, from Eq. (22), we conclude that \(\varvec{\nu }\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). Since \(\dot{\tilde{{\textbf{x}}}}\) is bounded, from the definition of \(\varvec{\Upsilon }\), we also see that \(\dot{\varvec{\Upsilon }}\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). From Eq. (29), \(\dot{\hat{{\textbf{W}}}}_f\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\), and we also have \(\dot{\tilde{\varvec{\sigma }}}_f\in {\mathcal {L}}_{\infty }\). From these results and Eq. (25), we can see that \(\dot{{\textbf{e}}}_{\varsigma }\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\). Then, since \(\dot{\varvec{\varsigma }},\dot{{\textbf{e}}}_{\varsigma }\in {\mathcal {L}}_{\infty }\), we conclude that \(\dot{\varvec{\omega }}\in {\mathcal {L}}_{\infty }\) in \({\mathcal {D}}\), which indicates that \(\varvec{\omega }\) is uniformly continuous in \({\mathcal {D}}\).
Now consider region \({\mathcal {S}}\subset {\mathcal {D}}\) as \({\mathcal {S}}\triangleq \{{\textbf{y}}\subset {\mathcal {D}}\;|\;\psi _2({\textbf{y}})< \frac{1}{2}(\rho ^{-1}(2\sqrt{\beta _3 k_{f2}}))^2\}\). Based on the above results, Theorem 8.4 of an earlier work [33] can be used to conclude that \(\Vert \varvec{\omega }\Vert \rightarrow 0\) as time goes to infinity for all \({\textbf{y}}(0)\in {\mathcal {S}}\). According to Eq. (17), \(T_i\rightarrow 0\) as \(\varsigma _i\rightarrow 0\), and from Eq. (14), \(\Vert \tilde{{\textbf{x}}}\Vert \rightarrow 0\). Therefore, from Eqs. (19) and (24), we conclude that \(\Vert \dot{\tilde{{\textbf{x}}}}\Vert \rightarrow 0\) as \(t\rightarrow 0\). The convergence of \(\dot{\tilde{{\textbf{x}}}}\) to zero indicates that \(\hat{{\textbf{f}}}\), given by Eq. (21), converges to \({\textbf{f}}\).
Proof of \(P(t)\ge 0\)
Here we show that \(P(t)\ge 0\). The proof follows the same steps as in a prior study [34]. Integrating both sides of Eq. (49), we have
Using Eq. (24), we have
Integrating the first integral in Eq. (54) by parts yields
Knowing that \(\sum _{i=1}^{n}|\varsigma _i|\ge \Vert \varvec{\varsigma }\Vert \), and using Eqs. (27) and (28), we can write the following inequality:
Therefore, if the conditions in Eqs. (30) and (31) are satisfied, we have
which indicates that \(P(t)\ge 0\).
Proof of Theorem 2
Consider the following Lyapunov function:
where \(L_{W_s}\) is defined in Eq. (36), and \(L_{W_c}\triangleq \frac{1}{2\alpha _c}\tilde{{\textbf{W}}}_c^{\top }\tilde{{\textbf{W}}}_c\), in which \(\tilde{{\textbf{W}}}_c\triangleq {\textbf{W}}_c-\hat{{\textbf{W}}}_c\). Using Eqs. (1), (5), and (42), we have
Therefore, using Eqs. (7) and (38), \({\dot{V}}^{*}\) can be written as
Defining \(\tilde{{\textbf{u}}}\triangleq \tanh ({\textbf{D}}^{*})-\tanh (\hat{{\textbf{D}}})\), and knowing that
and \(Q({\textbf{x}})>q_{\min }{\textbf{x}}^{\top }{\textbf{x}}\) for some \(q_{\min }\in {\mathbb {R}}^{+}\), an upper bound can be written for \({\dot{V}}^{*}\) as
where \(\kappa _1\triangleq 2\lambda \Vert {\textbf{W}}_c\Vert b_{\sigma cx}b_g+2\lambda b_{\epsilon cx}b_g\).
To develop the time derivative of the second term of the right-hand side of Eq. (55), we first write the Bellman error (Eq. 10) as follows by substituting the gradient of Eq. (41) into Eq. (8):
From Eq. (40), we have
Therefore, substituting this expression into Eq. (58), we obtain
Considering the definition of \(U({\textbf{u}}^{*})\) and \(U(\hat{{\textbf{u}}})\), given respectively in Eqs. (6) and (9), and knowing that \(\ln ({\textbf{1}}-\tanh ^2({\textbf{D}}^{*}))=\ln ({\textbf{4}})-2{\textbf{D}}^{*}\text {sgn}({\textbf{D}}^{*})+\varvec{\epsilon }_{D^{*}}\) and \(\ln ({\textbf{1}}-\tanh ^2(\hat{{\textbf{D}}}))=\ln ({\textbf{4}})-2\hat{{\textbf{D}}}\text {sgn}(\hat{{\textbf{D}}})+\varvec{\epsilon }_{D}\) for some bounded \(\varvec{\epsilon }_{D^{*}}\) and \(\varvec{\epsilon }_{D}\) [18], where \(\text {sgn}(\cdot )\) denotes the signum function, we develop the following expression:
The signum function can be approximated by a \(\tanh \) function with the following relation quantifying the approximation error [35]:
Therefore, the expression in Eq. (60) can be written as
where \(\varvec{\kappa }_{D^{*}}\) and \(\varvec{\kappa }_{D}\) denote the approximation errors. Then, considering the definitions of \({\textbf{D}}^{*}\) and \(\hat{{\textbf{D}}}\), and adding and subtracting \(\lambda {\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}\tanh (\kappa \hat{{\textbf{D}}})\) to the right-hand side of Eq. (61), we have
Substituting Eq. (62) into Eq. (59) and doing certain manipulations, we obtain
where \(\varvec{\beta }\in {\mathbb {R}}^{{\mathcal {L}}_c}\) is defined in Eq. (44) and \(\epsilon _\delta \in {\mathbb {R}}\) is given by
where all of its elements are bounded terms, and thus an upper bound can be considered for it as \(|\epsilon _{\delta }|\le {\bar{\epsilon }}_{\delta }\). Also note that approximation errors \(\varvec{\epsilon }_{D^{*}}\), \(\varvec{\epsilon }_{D}\), \(\varvec{\kappa }_{D^{*}}\), and \(\varvec{\kappa }_{D}\) converge to zero as \({\textbf{x}}\) goes to zero. Following the same steps, the unmeasurable form of simulated Bellman error \(\delta _{Bj}\) can be obtained as
where \(\tilde{{\textbf{f}}}_{sj}\triangleq {\textbf{f}}_{j}-\hat{{\textbf{f}}}_{sj}={\textbf{e}}_{sj}\), subscript j indicates the jth sample of the variables, and \(\epsilon _{\delta j}\triangleq 2\lambda ^2\bar{{\textbf{R}}}(\varvec{\kappa }_{D^{*}}-\varvec{\kappa }_{Dj})+\lambda ^2\bar{{\textbf{R}}}(\varvec{\epsilon }_{Dj}-\varvec{\epsilon }_{D^{*}})+\lambda {\textbf{W}}_c^{\top }\nabla \varvec{\sigma }_c{\textbf{g}}(\tanh (\kappa {\textbf{D}}^{*})-\tanh (\kappa \hat{{\textbf{D}}}({\textbf{x}}_j)))-\nabla \epsilon _c{\textbf{f}}_j\), for which an upper constant bound can be considered.
Then, using Eqs. (63) and (64) and the adaptation rule (Eq. 43), the time derivative of \(L_{W_c}\) can be written as
where \(\bar{\varvec{\beta }}\triangleq \varvec{\beta }/(1+\varvec{\beta }^{\top }\varvec{\beta })\) and \(m_s\triangleq 1+\varvec{\beta }^{\top }\varvec{\beta }\) that satisfy the following inequality:
in which \(b_{\beta }\) is a positive constant. We have
where \(\vartheta _c>0\) is defined as
in which \(c_{\beta _2}\triangleq \frac{1}{N}(\sup \limits _{t\in {\mathbb {R}}_{\ge t_0}}(\lambda _{\max }(\sum _{j=1}^{N}\frac{\varvec{\beta }_j\varvec{\beta }_j^{\top }}{(1+\varvec{\beta }_j^{\top }\varvec{\beta }_j)^2})))\), and \(\hbar _1\triangleq {\bar{b}}_{\beta } b_{\sigma cx} b_{\sigma sx}\Vert {\varvec{W}}_c\Vert \Vert \tilde{{\textbf{W}}}_s\Vert _F\), \(\hbar _2\triangleq {\bar{b}}_{\beta }b_{\sigma cx}{\bar{\epsilon }}_{sj}\Vert {\varvec{W}}_c\Vert + {\bar{b}}_{\beta }{\bar{\epsilon }}_{\delta j}\), in which \({\bar{b}}_{\beta }\triangleq \sup \limits _{j=1\cdots N}(\Vert \bar{\varvec{\beta }}_{j}/m_{sj}\Vert )\), \({\bar{\epsilon }}_{sj}\triangleq \sup \limits _{j=1\cdots N}(\Vert \varvec{\epsilon }_{sj}\Vert )\) and \({\bar{\epsilon }}_{\delta j}\triangleq \sup \limits _{j=1\cdots N}(|\epsilon _{\delta j}|)\). Also, assuming that \(\varepsilon _c> \sup \limits _{j=1\cdots N}(\Vert \tilde{{\textbf{f}}}_{sj}\Vert )+{\bar{\epsilon }}_{\delta }\), the following inequality can be developed:
for some \(\varpi _{c1},\varpi _{c2}<1\). Then, defining the following positive constants:
and considering Eq. (45), the following upper bound can be written for \({\dot{L}}_{W_c}\):
Therefore, using Eqs. (37), (57), and (65) and Young’s inequality, an upper bound for \({\dot{L}}\) can be written as
where \(\eta _1,\eta _2,\eta _3\in {\mathbb {R}}^{+}\) are adjustable constants. Therefore, considering Eq. (66), whenever the gain conditions in Eq. (46) are satisfied, we conclude that \({\textbf{x}}\), \(\tilde{{\textbf{W}}}_c\), and \(\tilde{{\textbf{W}}}_s\) are uniformly ultimately bounded.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Asl, H.J., Uchibe, E. Reinforcement learning-based optimal control of unknown constrained-input nonlinear systems using simulated experience. Nonlinear Dyn 111, 16093–16110 (2023). https://doi.org/10.1007/s11071-023-08688-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11071-023-08688-0