1 Introduction

In recent decades, robots are widely applied in industrial automation, such as assembling robots, handling robots, welding robots. They can not only cooperate with human partners for certain work, but also can complete some tasks independently, or even replace human beings to work in some hazard environment with high temperature, pressure and radiation. However, in some practical applications, robots will unavoidably interact with the external environment, which will not only affect execution of the work, but also directly threaten safety of human partners and robots themselves. Consequently, interaction control between robot and the environment has become an important research topic.

It’s noted that there are two main approaches applied in current research in robotics to ensure the compliant behaviour, i.e., hybrid position/force control proposed by Raibert and Craig (1981) and impedance control proposed by Hogan (1981). The former requires decomposition in position and force subspaces and control law switching during implementation process. Since the dynamic coupling between the robot and external environment is not considered, the accuracy of this approach is difficult to be guaranteed. Comparatively, the latter establishes the relationship between the robot and environment, and achieve compliant behaviour by adjusting mechanical impedance to a target value in case interaction occurs, which guarantees the interaction safety. Impedance control has two execution methods according to the controller causality, i.e., impedance control and admittance control. For impedance control system, the external force imposed by the environment can be obtained by desired trajectory and impedance model, while for admittance control system, the modified motion trajectory can be derived from the measured interaction force and the expected admittance model. Therefore, we adopt admittance control to deal with robot-environment interaction problem.

The interaction force and admittance model are significant parts for admittance control. If interaction between the robot and environment occurs, the interaction force can be measured by the force sensors which are mounted at the end-effector of the robot. However, due to the complex environment, it’s often very hard to obtain the desired admittance model which is critical for admittance control system. In addition, the fixed model can’t satisfy requirements of all situations. Consequently, Braun et al. (2012) took human-robot cooperation as an example and proposed that it was essential to adopt variable admittance model to improve system efficiency. For variable admittance control, iterative learning has been studied to derive the admittance parameters to adapt to unknown environment in robot intelligent control field. To complete a wall-following task, Cohen and Flash (1991) proposed an impedance learning strategy with an associate network. Tsuji et al. (1996) introduced neural network into impedance control to tuning the model parameters. But iterative learning approach requires the robot to perform the same task repeatedly, which is not available in some practical application execution. So researchers have adopted adaptation methods to solve this problem such as Love and Book (2004), Uemura and Kawamura (2009), Stanisic and Fernndez (2012), Landi et al. (2017) and Yao et al. (2018).

Tracking control is a very important research topic in robot intelligent control area. In the current studies, a lot of control methods have been employed to robot systems. Cervantes and Alvarez-Ramirez (2001) and Parra-Vega et al. (2003) applied the classic proportional-integral-derivative(PID) control into the robot system with satisfied tracking performance. PID control is often used in the industrial field owing to the simple structure and good performance. But for complex systems, it is very difficult to choose appropriate PID parameters which normally depends on experiences of the operator. In recent years, neural network(NN) control has been investigated and applied to robot systems because of strong approximation property for unknown system (Yang et al. 2017). In Zhang et al. (2018), NN control was employed to improve the tracking performance of the robot system with uncertainties. In Yang et al. (2019), NN-based controller combined with admittance adaptation was proposed to tackle the robot-environment interaction problem. However, these control methods only deal with stabilization problem of the system without considering optimal control. Based on the optimal control theory, we expect to find a control strategy that enables the system to reach the target in an optimal manner. To achieve this goal, it is usually required to minimize the specified cost function by solving the Hamilton–Jacobi–Bellman (HJB) equation. The HJB equation for a nonlinear system is nonlinear partial differential, so its analytical solution is non-trivial to derive. Dynamic programming proposed by Bellman (1957) provides a useful method for solving HJB equation. However, since this method is studied based on backward numerical process, it will be affected by well-known curse of dimensionality with the increase of system dimension. To overcome this problem, Werbos (1992) proposed adaptive dynamic programming (ADP) strategy using NN to approximate the cost function forward and then obtain the solution of HJB equation. During the past few years, great efforts have been made on ADP to deal with the control issues for nonlinear systems (Liu et al. 2014; Jiang and Jiang 2015), such as systems with dynamic uncertainties (Wang et al. 2018) and disturbances (Cui et al. 2017).

In practical control system, actuator saturation is a common phenomenon, which may affect the system performance, or even result in system instability. Therefore, it is essential and challenging to derive optimal control strategy for nonlinear systems with actuator saturation. Wenzhi and Selmic (2006) proposed a NN-based and feed-forward saturation compensation strategy for nonlinear systems with Brunovsky canonical form. In Wen et al. (2011), the Nussbaum function was employed to compensate for the nonlinear term caused by the input saturation. To handle the control issue for nonlinear systems with unknown saturation, an auxiliary system in He et al. (2016) and Peng et al. (2020) was proposed to tackle the actuator saturation. And in Zhao et al. (2018) a control strategy consisted of an ADP-based nominal control and a NN-based compensator was proposed. In Abu-Khalaf and Lewis (2005), the HJB equation was in the form of a non-quadratic function and NN least-squares method was proposed to obtain the solution.

In Peng et al. (2020), robot-environment interaction and actuator saturation are considered, while optimal control is not. However, for robot systems, it’s worthwhile to investigate how to realize tracking control in an optimal manner. Therefore, based on our previous work, optimal tracking control issue for robot systems with environment interaction and actuator saturation will be studied in this paper. Inspired by Abu-Khalaf and Lewis (2005), Lyshevski (1998) and Jiang and Jiang (2012), a control scheme based on admittance control and ADP method is employed to improve the control performance of robot systems. The main contributions of this paper are summarized as follows

(i) To solve interaction problem, the unknown environment is regarded as a linear system and an admittance adaptation approach based on iterative linear quadratic regulator(LQR) is adopted to obtain the compliant behaviour of the robot.

(ii) To tackle the optimal tracking problem, an ADP-basd controller is designed. The cost function is defined with nonquadratic form. The critic network with RBFNN is developed to derive an approximate solution to the minimum cost of the HJB equation, and then the corresponding optimal control is obtained.

The rest of this paper is arranged as follows. In Sect. 2, the robot systems with actuator saturation and the environment dynamics are described, and the control objective is provided. In Sect. 3, the control strategy based on admittance adaptation and ADP-based optimal controller is proposed. In Sect. 4, simulation studies are performed on a 2-DOF planar manipulator. In Sect. 5, conclusion is drawn. The system stability is discussed and proved in Appendix.

2 Preliminaries and problem formulation

2.1 Robot dynamics

The dynamics of n-link robot manipulator subjected to actuator saturation is described as:

$$\begin{aligned} M(q){\ddot{q}}+C(q,{\dot{q}}){\dot{q}}+G(q)=\mu \end{aligned}$$
(1)

where \(q \in {{\mathbb {R}}}^n\), \( {\dot{q}} \in {{\mathbb {R}}}^n \), and \( {\ddot{q}} \in {{\mathbb {R}}}^n \) denote the position, velocity and acceleration vectors in joint space of the robot, respectively. \( \mu \), \(\lambda \) and A denote the joint torque, admissible control set and constant saturation bound, respectively, and \( \mu \in \lambda \), \(\lambda = \{ \mu \in { {\mathbb {R}}}^n: \vert \mu _i \vert \le A \}\) is satisfied. For the sake of brevity, we use M, C and G to denote the known inertial matrix \(M(q)\in {{\mathbb {R}}}^{n\times n}\), coriolis/centrifugal matrix \(C(q,{\dot{q}})\in {{\mathbb {R}}}^{n\times n}\) and gravity vector \(G(q)\in {{\mathbb {R}}}^n\), respectively.

If we define the reference trajectory as \(q_r\in {{\mathbb {R}}}^n\), the tracking error \(q_e\in {{\mathbb {R}}}^n\) is given as \(q_e=q-q_r\). Define the sliding motion surface as \(\xi =\varLambda q_e+{\dot{q}}_e\), where \(\varLambda \in R^{n\times n}\) is a constant positive matrix, then we have

$$\begin{aligned} \begin{aligned} {\dot{q}}\,=\,&\xi -\varLambda q_e+\dot{q_r}\\ {\ddot{q}}\,=\,&{\dot{\xi }}-\varLambda q_e+\ddot{q_r} \end{aligned} \end{aligned}$$
(2)

According to (1) and (2), the error dynamics is derived as

$$\begin{aligned} \begin{aligned} {\dot{\xi }}=&-M^{-1}C(\xi -\varLambda q_e+\dot{q_r})-M^{-1}G{}\\&-\ddot{q_r}+\varLambda {\dot{q}}_e+M^{-1}\mu \end{aligned} \end{aligned}$$
(3)

Consequently, we can obtain the following system:

$$\begin{aligned} {\dot{\xi }}=f(\xi )+g(\xi )\mu \end{aligned}$$
(4)

where \(f:{{\mathbb {R}}^{n}}\rightarrow {{\mathbb {R}}^{n}}\) and \(g:{{\mathbb {R}}^{n}}\rightarrow {{\mathbb {R}}^{n\times n}}\) are non-linear functions and described as

$$\begin{aligned} \begin{aligned} f(\xi )=&-M^{-1}C(\xi -\varLambda q_e+{\dot{q}}_r){}-M^{-1}G-{\ddot{q}}_r+\varLambda {\dot{q}}_e\\ g(\xi )\,=\,&M^{-1} \end{aligned} \end{aligned}$$
(5)

2.2 Environment dynamics

In this paper, we consider the unknown interaction environment which is regarded as a damping-stiffness model in Ge et al. (2014) given by

$$\begin{aligned} C_E{\dot{x}}+G_Ex=-F \end{aligned}$$
(6)

where \(C_E\) and \(G_E\) are unknown damping and stiffness of the environment, respectively. F represents the measured interaction force by the force sensor and x is the end-effector position of the robot in Cartesian space.

We define \(x_d\) as the corresponding desired trajectory, and \(U_d\in {\mathbb {R}}^{m\times m}\) is a known matrix, then \(x_d\) is expressed as follows

$$\begin{aligned} {\dot{x}}_d = U_dx_d \end{aligned}$$
(7)

Consequently, define \(\eta = [x,~ x_d]^T\), the dynamics of the environment and desired trajectory will be derived.

$$\begin{aligned} \begin{aligned} {\dot{\eta }}\,=\,&\left[ \begin{array}{cc}-C_E^{-1}G_E&{}\mathbf{0} \\ \mathbf{0} &{}U_d\end{array}\right] \eta +\left[ \begin{array}{c}-C_E^{-1}\\ \mathbf{0} \end{array}\right] F\\ =\,&A_e\eta +B_eF \end{aligned} \end{aligned}$$
(8)

Therefore, (8) can be regarded as a linear system, where F is the control input and \(\eta \) is the controlled state. \(F = -K_e\eta \) is the corresponding optimal feedback control law and the object is to minimize the cost function given as

$$\begin{aligned} \varGamma _1 = \int _{0}^{\infty }\left( x_e^TQ_{E1}x_e+F^TR_{E}F\right) dt \end{aligned}$$
(9)

From (9), we can see that the purpose of modifying trajectory \(x_d\) is to balance interaction force F and tracking error \(x_e\) defined as \(x_e=x-x_d\), which can be realized by adjusting user-defined matrices \(Q_{E1}\) and \(R_{E}\).

The robot dynamics with saturated actuator and unknown environment dynamics are described in this section. Then, an ADP enhanced admittance control scheme will be designed to ensure the complaint behaviour and optimal trajectory tracking with robot-environment interaction.

3 Control strategy

As shown in Fig.1, the designed control scheme inspired by Zhan et al. (2020) in this section consists of three parts, including an optimal trajectory modifier using admittance control to modify user-desired trajectory \(x_d\) to modified trajectory \(x_r\), a closed loop inverse kinematics (CLIK) solver to transform \(x_r\) in Cartesian space into \(q_r\) in joint space and an optimal trajectory tracking controller based on ADP with the output torque \(\mu \) acting on robot manipulator to ensure optimal tracking performance.

Fig. 1
figure 1

An illustration of the proposed control scheme

3.1 Trajectory modifier using admittance control

Formula (9) can be written as the following form by transformation, and the system counterpart is consistent with system (8).

$$\begin{aligned} \varGamma= & {} \int _{0}^{\infty }\left( \eta ^TQ_E\eta +F^TR_EF\right) dt \nonumber \\ Q_E= & {} \left[ \begin{array}{ll}Q_{E1}&{}-Q_{E1}U_d\\ -U_d^TQ_{E1}&{}U_d^TQ_{E1}U_d \end{array} \right] \end{aligned}$$
(10)

It’s noted that solving (10) can be regarded as the process similar to LQR problem. Then, the algebraic riccati equation (ARE) associated with (9) and (10) is given in (11). And in this subsection, an algorithm proposed by Jiang and Jiang (2012) is employed to solve the ARE and obtain the feedback gain \(K_e\) in (11).

$$\begin{aligned} \begin{aligned}& PA_e+A_e^TP+Q_E-PB_eR_E^{-1}B_e^TP=0\\ &K_e= -R_E^{-1}B_e^TP \end{aligned} \end{aligned}$$
(11)

Now, we list the matrices with sampled signals as follows

$$\begin{aligned} \begin{aligned} {\hat{p}}&=\left[ p_{11}, 2p_{12}, \ldots , 2p_{1n}, p_{22}, 2p_{23}, \ldots , p_{nn}\right] ^{T} \\ {\bar{\eta }}&=\left[ \eta _{1}^{2}, \eta _{1} \eta _{2}, \ldots , \eta _{1} \eta _{n}, \eta _{2}^{2}, \eta _{2} \eta _{3}, \ldots , \eta _{n}^{2}\right] ^{T} \\ d_{{\bar{\eta }}}&=\left[ {\bar{\eta }}\left( t_{1}\right) -{\bar{\eta }}\left( t_{0}\right) , {\bar{\eta }}\left( t_{2}\right) -{\bar{\eta }}\left( t_{1}\right) , \ldots , {\bar{\eta }}\left( t_{d}\right) -{\bar{\eta }}\left( t_{d-1}\right) \right] ^{T} \\ I_{\eta }^{\eta }&=\left[ \int _{t_{0}}^{t_{1}} \eta \otimes \eta dt, \int _{t_{1}}^{t_{2}} \eta \otimes \eta dt, \ldots , \int _{t_{d-1}}^{t_{d}} \eta \otimes \eta d t\right] ^{T} \\ I^\eta _F&=\left[ \int _{t_{0}}^{t_{1}} \eta \otimes F d t, \int _{t_{1}}^{t_{2}} \eta \otimes F dt, \ldots , \int _{t_{d-1}}^{t_{d}} \eta \otimes F d t\right] ^{T} \end{aligned} \end{aligned}$$
(12)

where n, m, d denote the length of \(\eta \), F and the sample times integer, respectively. \(p_{ij}\), \(\eta _i\) represent entries of P and \(\eta \), respectively. In addition, in (12), \(\otimes \) reprents the Kronecker product, and \(p\in {\mathbb {R}}^{\frac{1}{2}n(n+1)}\), \({\bar{\eta }}\in {\mathbb {R}}^{\frac{1}{2}n(n-1)}\), \(d_{{\bar{\eta }}}\in {\mathbb {R}}^{d\times \frac{1}{2}n(n-1)}\), \(I_{\eta }^{\eta }\in {\mathbb {R}}^{d\times n^2}\), \(I_{F}^{\eta }\in {\mathbb {R}}^{d\times nm}\).

Let \(\Vert *\Vert \) and \(vec(*)\) denote the 2-norm of \(*\) and the column vectorization of \(*\), respectively. Let k and \(I_n\in {\mathbb {R}}^{n\times n}\) denote iteration index and an identity matrix, respectively. If the sampled data is large enough and the rank condition in (13) is satisfied, \(K_e\) can be solved by iteratively calculate (14) until \(||{\hat{p}}^{(k)}-{\hat{p}}^{(k-1)}||<\varepsilon \), where \(\varepsilon \) is an acceptable range.

$$\begin{aligned}&rank\left( \left[ I^\eta _\eta ,~ I^\eta _F\right] \right) =\frac{n(n+1)}{2}+nm \end{aligned}$$
(13)
$$\begin{aligned}&\quad Q_E^{(k)}=Q_E+K_e^{(k)T}R_EK_e^{(k)}\nonumber \\&\quad \varTheta ^{(k)}=\left[ d_{{\bar{\eta }}},-2 I_\eta ^\eta \left( I_{n} \otimes K_e^{(k)T}R_E\right) -2 I^\eta _F\left( I_{n} \otimes R_E\right) \right] \nonumber \\&\quad \varXi ^{(k)}=-I_\eta ^\eta vec\left( Q_E^{(k)}\right) \nonumber \\&\quad \left[ \begin{array}{l}{ {\hat{p}}^{(k)}} \\ {vec\left( K_e^{(k+1)}\right) }\end{array}\right] =\left( \varTheta ^{(k)T} \varTheta ^{(k)}\right) ^{-1} \varTheta ^{(k)T} \varXi ^{(k)} \end{aligned}$$
(14)

When we obtain the optimal feedback gain \(K_e\), the modified trajectory \(x_r\) which is to be tracked and equal to x in (15) can be calculated by (16), where \(K_{e1}\) and \(K_{e2}\) are compatible matrices of \(K_e\).

$$\begin{aligned} F= & {} -K_e\eta =-\left[ \begin{array}{cc}K_{e1}&K_{e2}\end{array}\right] \left[ \begin{array}{c}x\\ x_d\end{array}\right] \end{aligned}$$
(15)
$$\begin{aligned} x_r= & {} -K_{e1}^{-1}F - K_{e1}^{-1}K_{e2}x_d \end{aligned}$$
(16)

3.2 CLIK solver

We adopt the CLIK algorithm proposed by Siciliano 1990 to transform reference trajectory \(x_r\) in Cartesian space into \(q_r\) in joint space. Let \(\kappa (*)\) and \(K_f\) represent the forward kinematics and a positive user-defined matrix, respectively. Define \(e:=\kappa (q_r)-x_r\), \({\dot{e}}=-K_fe \), \({\dot{x}}=J_{co}{\dot{q}}\), \(J_{co}=\partial \kappa (q)/\partial q\), then

$$\begin{aligned} {\dot{q}}_r=J_{co}^\dagger ({\dot{x}}_r-K_f(\kappa (q_r)-x_r)) \end{aligned}$$
(17)

Integrating both sides of the above equation, \(q_r\) can be obtained as follows

$$\begin{aligned} q_r=\int _{0}^{t}\left( J^\dagger {\dot{x}}_r-J_{co}^\dagger K_f(\kappa (q_r)-x_r)\right) dt \end{aligned}$$
(18)

where \(q(0)=\kappa ^{-1}(x_r(0))\), \(J_{co}^\dagger =J_{co}^T(J_{co}J_{co}^T+\sigma I_n)^{-1}\), and \(\sigma \in {\mathbb {R}}\). Note that \(\sigma \) is used to prevent the singularity problem and it is also required to be small enough to promote the accuracy of the solution.

3.3 Optimal control using ADP

The objective of this section is to find the stabilizing control input \(\mu \) of the robot system (4) which could minimize the defined cost function. According to optimal theory, the optimal feedback control of system (4) can be obtained by solving HJB equation in ADP framework. The structure diagram of the ADP-based tracking controller is given in Fig. 2.

Fig. 2
figure 2

Structure diagram of the ADP-based tracking controller

We assume that system (4) is controllable, the nonlinear functions \(f(\xi )\) and \(g(\xi )\) are Lipschitz continuous and differentiable in \({\mathbb {R}}^{2n}\). In order to deal with actuator saturation of the robot system, inspired by Abu-Khalaf and Lewis (2005) and Lyshevski (1998), we define the cost function as follows

$$\begin{aligned} J(\xi (t))=\int _{t}^{\infty }\left[ \varPhi (\xi (s))+U(\xi (s), \mu (\xi (s)))\right] ds \end{aligned}$$
(19)

where

$$\begin{aligned} \varPhi (\xi (s))\,=\, & {} \xi (s)^{\mathrm {T}} Q \xi (s) \end{aligned}$$
(20)
$$\begin{aligned} U(\xi (s), \mu (\xi (s)))\,= \, & {} 2A \int _{0}^{\mu } ({\varPsi }^{-1}(v/A))^{\mathrm {T}}Rdv \end{aligned}$$
(21)

It is noted that \(Q\in {\mathbb {R}}^{n\times n}\) in (20) is symmetric positive definite. In (21), \({\varPsi }^{-1}(v/A)= {\left[ {\psi }^{-1}(v_1/A), {\psi }^{-1}(v_2/A),\cdots , {\psi }^{-1}(v_n/A) \right] }^{\mathrm {T}}\), \(\varPsi \in {\mathbb {R}}^{n}\). \(\psi (\cdot )\) is a strictly monotonic odd function and its first derivative is bounded by a constant B. Meanwhile, R is also a symmetric and positive definite matrix. Therefore, \(U(\xi (s), \mu (\xi (s)))\) is also positive definite. Without loss of generality, we select \(\psi (\cdot ) = \tanh (\cdot )\) and \(R=rI_n\) with r as a positive constant and \(I_n\) as the identity matrix of n-dimension.

If \(J(\xi (t))\) defined in (19) is continuously differentiable, by taking the time derivative of (19), we can get the following nonlinear Lyapunov equation with \(J(0)=0\) , which is an infinitesimal form of (19).

$$\begin{aligned} 0\,=\, & {} \varPhi (\xi )+2Ar\int _{0}^{\mu } (\tanh ^{-1}(v/A))^{\mathrm {T}}dv\nonumber \\&+(\nabla J(\xi ))^{\mathrm {T}}(f(\xi )+g(\xi )\mu (\xi )) \end{aligned}$$
(22)

where \(J(\xi )\) denotes \(J(\xi (t))\) and \(\nabla * \triangleq \frac{\partial *}{\partial \xi }\) denotes the partial derivative of * for convenience.

Therefore, the Hamiltonian function and optimal cost function are described as

$$\begin{aligned} H(\xi , \mu (\xi ), \nabla J(\xi ))\,= \,& {} \varPhi (\xi )+2Ar\int _{0}^{\mu } (\tanh ^{-1}(v/A))^{\mathrm {T}}dv+\nonumber \\&(\nabla J(\xi ))^{\mathrm {T}}(f(\xi )+g(\xi )\mu (\xi )) \end{aligned}$$
(23)
$$\begin{aligned} {J(\xi )}^*= & {} \min \limits _{\mu \in \lambda}\nonumber \\&\int _{t}^{\infty }\left[ \varPhi (\xi (s))+U(\xi (s), \mu (\xi (s)))\right] ds \end{aligned}$$
(24)

We can derive HJB equation as below

$$\begin{aligned} 0=\min \limits _{\mu \in \lambda}H(\xi , \mu (\xi ), \nabla {J^*(\xi )}) \end{aligned}$$
(25)

Suppose that the minimum value on the right side of formula (25) exists and also is unique, then from \(\frac{\partial H(\xi , \mu (\xi ), \nabla J^*(\xi ))}{\partial \mu }=0\), we can obtain the optimal control \(\mu ^*(\xi )\) as

$$\begin{aligned} \mu ^{*}(\xi )=-A \tanh \left( \frac{1}{2A}r^{-1}g^{\mathrm {T}}(\xi ) \nabla {J^*(\xi )}\right) \end{aligned}$$
(26)

Substituting (26) into (22), another HJB equation form related to \(\nabla {J^*(\xi )}\) will be derived as

$$\begin{aligned} H(\xi , \mu ^{*}(\xi ), \nabla J^{*}(\xi ))=0 \end{aligned}$$
(27)

Then, from (26) and (27), the HJB equation for the robot system with actuator saturation becomes

$$\begin{aligned}&H(\xi , \mu ^{*}(\xi ), \nabla J^{*}(\xi ))=(\nabla J^*(\xi ))^\mathrm {T}f(\xi )\nonumber \\&\quad -2A^2rD^\mathrm {T}(\xi )\tanh (D(\xi ))\nonumber \\&+\varPhi (\xi )+2Ar\int _{0}^{-A\tanh (D(\xi ))} \tanh ^\mathrm {-T}(v/A)dv=0 \end{aligned}$$
(28)

where \(D(\xi )=\frac{1}{2A}r^{-1}g^{\mathrm {T}}(\xi )\nabla J^*(\xi )\). Applying the integral formula of inverse hyperbolic function, we have

$$\begin{aligned}&2Ar\int _{0}^{-A\tanh (D(\xi ))} \tanh ^\mathrm {-T} \left( v/A \right) dv\nonumber \\&\quad =2Ar\sum _{i=1}^n{\int _{0}^{-A\tanh (D_i(\xi ))} \tanh ^\mathrm {-T} \left( v_i/A\right) dv_i}\nonumber \\&\quad =2A^2rD^\mathrm {T} \left( \xi \right) \tanh \left( D \left( \xi \right) \right) \nonumber \\&\quad\quad +A^2r \sum _{i=1}^n{\ln \left[ 1-\tanh ^2(D_i(\xi )) \right] } \end{aligned}$$
(29)

where \(D(\xi )=(D_1(\xi ), \ldots , D_n(\xi ))^\mathrm {T}\) with \(D_i(\xi ) \in {\mathbb {R}} , i=1, \ldots , n\). Substituting (29) into (28), (28) can be rewritten as follows

$$\begin{aligned}&H(\xi , \mu ^{*}(\xi ), \nabla J^{*}(\xi ))=(\nabla J^*(\xi ))^{\mathrm {T}}f(\xi )\nonumber \\&+\varPhi (\xi )+A^2r\sum _{i=1}^n{\ln \left[ 1-\tanh ^2(D_i(\xi )) \right] }=0 \end{aligned}$$
(30)

However, (30) is a nonlinear partial differential equation with regard to \(J^{*}(\xi )\) and it’s very difficult to obtain \(J^{*}(\xi )\) from it, even impossible.

Suppose \(J^*(\xi )\) is continuously differentiable, it can be constructed by RBFNN and described by

$$\begin{aligned} J^{*}(\xi )=w^{\mathrm {T}} S(\xi )+\varepsilon (\xi ) \end{aligned}$$
(31)

where \(w \in {{{\mathbb {R}}}^l}\) and \(S:{{\mathbb {R}}}^{2n}\rightarrow {{\mathbb {R}}}^l \) represent the ideal constant weight and the activation function, respectively. l and \(\varepsilon (\xi )\) denote the node number of the hidden layer and the unknown approximation error of the critic NN, respectively. Consequently, we can obtain the derivation of (31) refer to \(\xi \) as follows.

$$\begin{aligned} \nabla {J^*(\xi )}={(\nabla S(\xi ))}^{\mathrm {T}} w+\nabla {\varepsilon (\xi )} \end{aligned}$$
(32)

From (26) and (32) and using Taylor series expansion, we have \(\mu ^{*}\) shown as

$$\begin{aligned} \mu ^{*}(\xi )= & {} -A \tanh \left( \frac{1}{2A}r^{-1}g^{\mathrm {T}}(\xi ){(\nabla S(\xi ))}^{\mathrm {T}}w \right) +\varepsilon _{\mu ^*} \end{aligned}$$
(33)
$$\begin{aligned} \varepsilon _{\mu ^*}= & {} -\frac{1}{2} \left( \mathbf {1}-\tanh ^{2}( \iota )\right) g^{\mathrm {T}}(\xi )\nabla {\varepsilon (\xi )} \end{aligned}$$
(34)

where \( \mathbf {1}=(1,\ldots , 1)^{\mathrm {T}} \in {{\mathbb {R}}}^{n} \) and \(\iota \in {{\mathbb {R}}}^{n}\) is selected between \( \frac{1}{2A}r^{-1}g^{\mathrm {T}}(\xi ){(\nabla S(\xi ))}^{\mathrm {T}}w \) and \( \frac{1}{2A}r^{-1}g^{\mathrm {T}}(\xi )\left( {(\nabla S(\xi ))}^{\mathrm {T}}w+\nabla {\varepsilon (\xi )} \right) \). Then, by substituting (32) into (30), (30) will be written as

$$\begin{aligned} H^{*}(\xi , \mu ^{*}(\xi ), \nabla J^{*}(\xi ))\,= \,& {} w^\mathrm {T}(\nabla S(\xi ))f(\xi )+\varPhi (\xi )\nonumber \\&+A^2 r \sum _{i=1}^n{\ln \left[ 1-\tanh ^2(B_{1i}(\xi )) \right] }\nonumber \\&+\varepsilon _{HJB}=0 \end{aligned}$$
(35)
$$\begin{aligned} B_1(\xi )= & {} \frac{1}{2A}r^{-1}g^\mathrm {T}(\xi ){(\nabla S(\xi ))}^{\mathrm {T}}w \end{aligned}$$
(36)

where \(B_1(\xi )=(B_{11}(\xi ),\ldots ,B_{1n})^\mathrm {T}\), \(B_{1i} \in {{\mathbb {R}}}\) and \(\varepsilon _{HJB}\) is the HJB approximation error.

Actually, the ideal w and \(J^*(\xi )\) in (31) are not available, so we can derive the estimated weight and optimal cost function which are represented by \({\hat{w}}\) and \({{\hat{J}}}(\xi )\) respectively by the constructed critic NN described as

$$\begin{aligned} {\hat{J}}(\xi )={\hat{w}}^{\mathrm {T}} S(\xi ) \end{aligned}$$
(37)

Then, the partial derivative of \({\hat{J}}(\xi )\) refers to \(\xi \) and the approximate optimal control \(\hat{\mu }(\xi )\) can be obtained as follows

$$\begin{aligned} \nabla {\hat{J}}(\xi )\,= \,& {} {(\nabla S(\xi ))}^{\mathrm {T}} {\hat{w}} \end{aligned}$$
(38)
$$\begin{aligned} {\hat{\mu }}(\xi )= & {} -A \tanh \left( \frac{1}{2A}r^{-1}g^{\mathrm {T}}(\xi ){(\nabla S(\xi ))}^{\mathrm {T}}{\hat{w}} \right) \end{aligned}$$
(39)

According to (23), (38) and (39), we can obtain the approximate Hamitonian function \({{\hat{H}}}(\xi , \hat{\mu }_n(\xi ), \nabla {{\hat{J}}}(\xi ))\) shown as

$$\begin{aligned} {\hat{H}}(\xi ,\hat{\mu }(\xi ),\nabla {\hat{J}}(\xi ))= \,& {} {{\hat{w}}}^\mathrm {T}(\nabla S(\xi ))f(\xi )+\varPhi (\xi )\nonumber \\&+A^2r \sum _{i=1}^n{\ln \left[ 1-\tanh ^2(B_{2i}(\xi )) \right] } \end{aligned}$$
(40)
$$\begin{aligned} B_2(\xi )= \,& {} \frac{1}{2A}r^{-1}g^\mathrm {T}(\xi )(\nabla S(\xi ))^{\mathrm {T}} {\hat{w}} \end{aligned}$$
(41)

where \(B_2(\xi )=(B_{21}(\xi ),\ldots, B_{2n}(\xi ))^\mathrm {T}\),\(B_{2i}(\xi ) \in {{\mathbb {R}}}\). Now we define the approximate neural network weight error as \({\tilde{w}}=w-{{\hat{w}}}\), and the error between \({{\hat{H}}}\) and \(H^{*}\) as \(E_H\), then, we have

$$\begin{aligned} E_{H}\,= \,& {} {\hat{H}}(\xi ,\hat{\mu }(\xi ),\nabla {\hat{J}}(\xi ))-H^{*}(\xi , \mu ^{*}(\xi ), \nabla J^{*}(\xi ))\nonumber \\= \, & {} {\hat{H}}(\xi ,\hat{\mu }(\xi ),\nabla {\hat{J}}(\xi ))\nonumber \\=\, & {} -{\tilde{w}}^{\mathrm {T}}\nabla { S(\xi )}f(\xi )+A^{2}r \nonumber \\&\sum _{i=1}^n{\left[ \varUpsilon (B_{2i}(\xi ))-\varUpsilon (B_{1i}(\xi ))\right] }-\varepsilon _{HJB} \end{aligned}$$
(42)

where \(\varUpsilon (B_{\ell i}(\xi ))=\ln \left[ 1-\tanh ^2(B_{\ell i}(\xi )) \right] \), \(\ell =1,2\) and \(i=1, \ldots , n\). Note that \(\varUpsilon (B_{\ell i}(\xi ))\) can be expressed as

$$\begin{aligned} \begin{aligned} \varUpsilon (B_{\ell i}(\xi ))&=\ln \left[ 1-\tanh ^2(B_{\ell i}(\xi )) \right] \\&= \left\{ \begin{array}{lll} \ln 4-2B_{\ell i}(\xi )-2\ln \left( 1+\exp \left( -2B_{\ell i}(\xi )\right) \right) , B_{\ell i}(\xi ) >0\\ \ln 4+2B_{\ell i}(\xi )-2\ln \left( 1+\exp \left( 2B_{\ell i}(\xi )\right) \right) , B_{\ell i}(\xi ) <0 \end{array} \right. \end{aligned} \end{aligned}$$
(43)

For convenience, it can be written as follows

$$\begin{aligned}&\varUpsilon (B_{\ell i}(\xi ))\nonumber \\&\quad =\ln 4-2B_{\ell i}(\xi ) \mathrm {sgn} {( B_{\ell i}(\xi ))}-2 \nonumber \\&\quad \ln {[1+\exp (-2B_{\ell i}(\xi )\mathrm {sgn} (B_{\ell i}(\xi )))]} \end{aligned}$$
(44)

where \(\mathrm sgn(B_{\ell i}(\xi ))\) is the sign function.

To train the critic NN, inspired by Liu et al. (2017) and Yang et al. (2013), a suitable weight updating law \({{\hat{w}}}\) is designed, which can minimize the objective function \(E_c=\frac{1}{2}E_H^{2}\) and also ensure that \({{\hat{w}}}\) converges to w.

$$\begin{aligned} \begin{aligned} \dot{{\hat{w}}}=&-\alpha _H {\bar{\phi }}(\varPhi (\xi )\\&+{{\hat{w}}}^{\mathrm {T}}\nabla {S(\xi )}f(x)+A^2r \sum _{i=1}^n \ln [1-\tanh ^2(B_{2i}(\xi ))])\\&+\frac{\alpha _H}{2}h\nabla { S(\xi )}g(\xi )[I_n- Z(B_{2}(\xi ))]g^{\mathrm {T}}(\xi )\nabla {V_s(\xi )}\\&+\alpha _H ( A\nabla {S(\xi )}g(\xi )[\tanh (B_{2}(\xi ))- \mathrm {sgn} (B_{2}(\xi ))]\frac{\varphi ^{\mathrm {T}}}{m_s} {{\hat{w}}} \\&-(F_2 {{\hat{w}}}-F_1 \varphi ^{\mathrm {T}} {{\hat{w}}})) \end{aligned} \end{aligned}$$
(45)

where \({\bar{\phi }}={\phi }/{m_s}^2\), \(m_s=1+{\phi }^{\mathrm {T}}\phi \), \(\phi =\nabla {S(\xi )}f(\xi )-A\nabla {S(\xi )}g(\xi )\tanh (B_2(\xi ))\), \(\varphi =\phi /m_s\), \(\alpha _H >0\) is a designed parameter, \( Z(B_2(\xi )) = \mathrm {diag} \left[ \tanh ^2(B_{21}(\xi )), \ldots , \tanh ^2(B_{2n}(\xi )) \right] \) and \(F_1\) and \(F_2\) are tuning parameters with suitable dimensions. In (45), h is described as follows:

$$\begin{aligned} h=\left\{ \begin{array}{l} {0,\quad \text{ if } {(\nabla V_s(\xi ))^{\mathrm {T}}(f(\xi )-Ag(\xi )\tanh (B_2(\xi )))}<0} \\ {1, \quad \text{ else } } \end{array}\right. \end{aligned}$$
(46)

where \(V_s(\xi )\) is chosen as a Lyapunov function candidate which is continuously differentiable. Suppose that one positive definite matrix N exists, we have the following formula satisfied.

$$\begin{aligned} \begin{aligned} \dot{V}_s(\xi )&={(\nabla V_s(\xi ))^{\mathrm {T}}(f(\xi )+g(\xi ){\mu ^*})}\\ {}&=-(\nabla V_s(\xi ))^{\mathrm {T}}N{\nabla V_s(\xi )}<0 \end{aligned} \end{aligned}$$
(47)

Here, \(V_s(\xi )\) is a polynomial with regard to the state variable \(\xi \), which can be appropriately selected, such as \(V_s(\xi )=\frac{1}{2}\xi ^{\mathrm {T}}k_{\xi } \xi \).

Remark 1

The \(\dot{{{\hat{w}}}}\) in (45) composes of two parts: the first term is based on the standard gradient descent algorithm and the others are introduced to ensure the stability of the robot system during the critic NN learning process. Note that in (46), if \((\nabla V_s(\xi ))^{\mathrm {T}}(f(\xi )-Ag(\xi )\tanh (B_2(\xi )))\ge 0\), the system is unstable, then \(h=1\) and the second term in (46) will be activated, which improves the learning process. Therefore, the initial stabilizing control requirement will be released.

Remark 2

From (40) and (45), we can see that if \(x=0\) and \(f(x)=0\), then \({\hat{H}}(\xi ,\hat{\mu }(\xi ),\nabla {\hat{J}}(\xi ))=0\). If \(F_2=F_1 \varphi ^{\mathrm {T}}\), then we have \(\dot{{\hat{w}}}=0\) and the critic NN will not be updated and the optimal control may not be obtained. Consequently, the persistence excitation is required.

3.4 Stability analysis

We will discuss the system stability of the robot and give detailed proof that the estimated weight error \({{\tilde{w}}}\) and the system state \(\xi \) are ultimately uniformly bounded.

Now we give the necessary assumption as follows:

Assumption

There exist known positive constants \(w_m\), \(\varepsilon _M\), \(\varepsilon _N\) such that \(\Vert {w}\Vert \le w_{m}\), \(\Vert {\varepsilon }\Vert \le {\varepsilon _M}\), \(\Vert {\varepsilon _{u^*}}\Vert \le {\varepsilon _N}\), respectively. Item \(g(\xi )\) in (4) is bounded over a compact set \(\varOmega \), i.e., there exist positive constants \(g_m\) and \(g_M\) such that \(g_m \le \Vert g(\xi ) \Vert \le g_M\).

Theorem

Considering the robot system (1) referring to actuator saturation, the corresponding HJB equation (30) and Assumption, if the control law is designed as (39) and the critic NN weight update according to (45), it can be concluded that the critic NN weight approximation error \({\tilde{w}}\) and the state \(\xi \) are guaranteed to be ultimately uniformly bounded(UUB).

Proof

see the Appendix. \(\square \)

4 Simulation study

4.1 Simulation settings

Table 1 Parameters of the robot manipulator
Fig. 3
figure 3

An illustration of the simulation settings

A two-link manipulator, constructed by the robotics toolbox in Corke (2017) and shown in Fig. 3, is employed to verify the proposed control strategy, whose dynamic parameters are given in Table 1. The simulation runs on the Matlab 2018a software with an ode3 solver and the fixed time step is 0.01s. The robot manipulator is required to track a reference trajectory and interact simultaneously with a virtual environment govern by

$$\begin{aligned} F = {\left\{ \begin{array}{ll} -C_E{\dot{x}}-G_E(x-x_0)&\quad {}x\le x_0\\ 0 \quad &{}x>x_0\\ \end{array}\right. } \end{aligned}$$
(48)

where \(C_E=0.1\), \(G_E=1.0\), \(x_0\) denotes the contour of an object and F denotes the reactive force due to the penetration into the object. For simplicity and generality, only the trajectory along x-axis is modified and disturbed by the external interaction forces.

Parameters of the proposed control scheme are selected as follows: for the “Optimal Trajectory Modifier” block in Fig. 1, in (10), \(Q_{E1} = 1.0\), \(R_E = 1.0\), the reference trajectory is \(x_d=[0.3e^{-0.5t},0.5]^Tm\), where \(U_d = 0.3\); the feedback gain of the inverse kinematics in (18), \(K_f = 30\), \(\sigma = 1e-6\). An RBFNN is selected to approximate the cost function in (31), where \(S(\xi )=exp((\xi -c)^T(\xi -c)/{\sigma _N}^2)\) with \({\hat{w}}\in {\mathbb {R}}^{49}\), \(S(\xi )\in {\mathbb {R}}^{49}\). For the controller in (39), \(A=6N\cdot m\), the centers and variance of the RBFNN are \(c\in [-1.5,-0.5,-0.1,0,0.1,0.5,1.5]\times [-1.5,-0.5,-0.1,0,0.1,0.5,1.5]\), \(\sigma _N=0.6\), \({\hat{w}}(0)=\mathbf{0} \). For the updated law in (45), \(V_s=2\xi ^T\xi \), \(\alpha _H=30\), \(Q=200\), \(R=0.006\), \(F_1=1e-6\), \(F_2=1e-8\).

4.2 Results analysis

Fig. 4
figure 4

Tracking performance of the proposed control scheme. Up: reference and modified trajectory during the interaction. Down: tracking errors

Fig. 5
figure 5

Control signals of the proposed control scheme. Up: control input of the controller. Down: weights of the RBFNN

The control performance is shown in Fig. 4, from which we can see that at the beginning of the control, there exists a large transient error since the weights of the RBFNN have not converged. However, before the trajectory starts to be modified at \(t=4.2 \,\rm{s}\), the tracking error has reduced to an acceptable range. Subsequently, the actual trajectory gradually converges to the desired trajectory. Fig. 5 gives the control signals during the control process. In this figure, we can clearly see that the control input stays within the limits of the actuator and weights of the RBFNN eventually converge to constant values. These observations demonstrate the effectiveness of the ADP-based controller under the saturation effect.

To show the effectiveness of the optimal admittance adaptation control, the control performance under two different feedback gain \(K_e\) that affects the trajectory modification in (16) is compared, wherein \(K_e^{opt}\) is obtained by assuming that the dynamic parameters of the environment in (6) are exactly known, while \(K_e^{pro}\) is calculated by the algorithm presented in (14). Note that unlike the virtual environment used in (48), environmental dynamics in (6) adopted for theoretical design does not take the contour of the environment \(x_0\) into consideration. Thus, \(K_e^{opt}\) is sub-optimal. The results are shown in Fig. 6. We can notice that both the tracking error and value of the cost function in (9) under \(K^{pro}_e\) are smaller than those under \(K^{opt}_e\), which shows the superiority of the proposed method when dynamics of the environment are unknown.

Fig. 6
figure 6

A comparison for the modified tracking performance under the optimal feedback gain \(K^{opt}_e\) in (11) and feedback gain obtained from the proposed control scheme \(K^{pro}_e\). Up: time series of the cost function in (9); Down: tracking performance

5 Conclusion

In this paper, the optimal tracking control issue for robot systems with environment interaction and actuator saturation is addressed. An ADP-based controller enhanced admittance adaptation control scheme is developed. The unknown environment is considered as a linear system and admittance adaptation control ensures the complaint behaviour of the robot. In ADP-based controller, to guarantee the optimal tracking performance, RBFNN is used to approach to the minimum cost function and make an optimal control of the HJB equation. The system stability is analysed and the simulation studies are performed to demonstrate the effectiveness of this control scheme.

Other input constraints such as dead zones and hysteresis, and dynamic uncertainties are also very common in actual robotic systems. These constraints will not only reduce the system performance, but also affect the system stability. Consequently, under ADP framework, the optimal control with other constraints and dynamic uncertainties will be considered in our future work.