2.1 Introduction

In the fight against cancer, there had been no effective measures before chemotherapy and radiation appeared since there only exist tiny differences between cancer cells and normal cells. Doctors operate to remove solid tumors that have not yet spread, which can not guarantee cancer from recurring. When radiotherapy and chemotherapy have increased side effects, and targeted therapy is not flexible because of its strong pertinence, the scientific research direction began to turn to the human body system. Generally, tumor cells escape from the immune system, not because it fails for the immune system to recognize them or it is not activated, but cancer cells have evolved a way to block the activation of T cells by making a specific binding. Thus, the medical communities have struggled to find a lot of special means for cancer cells to intercept the activation of the T cells, freeing up the immune system. Compared with traditional treatments such as surgery, radiation and chemotherapy, immunotherapy has fewer side effects and better therapeutic effects. However, it is difficult to tackle the transient period of immune agents. Therefore, the hybrid therapy of chemotherapy and immunotherapy is a better choice. As [1], it is hardly sufficient to control tumor growth through neither chemotherapy nor immunotherapy alone, but tumor cells can be eradicated by adopting the combination therapies which is known as biochemotherapy described in [2].

With extensive development of nonlinear dynamic [3, 4], its engineering application scenarios enjoy increasing diversification such as competitive Nash equilibrium problems, especially in the biomedical field. And not coincidentally, game theory has been introduced into the interaction model of tumor cells and immune cells. Both of the chemotherapy and immunotherapy aim at reducing the number of tumor cells. Based on this fact, the collaborative game is formed and one can design adaptive therapy from the view of game theory. Multiple biological interactions constitute complex nonlinear growth process of tumor cells, however, regarding major influence factors of tumor cell populations as research object is the focus. Hunting cells refer to the immune cells participating in removing foreign agents and strengthening the immune response. temperatures have suggested that cell-mediated anti-tumor immunity contributed to increasing the population of hunting tumor cells to maintain a specific proportion between the resting and the hunting predator cells as 40% in literature [5], which was beneficial for maintenance of the tumor dormant state. The immune regulations vary from individual to individual, but immunotherapy-based optimal regulation plays the role of reducing tumor cells without considering certain circumstances in case of special invocation. Enhanced tumor antigen presentation could effectively stimulate dendritic cells and increase the immunotherapy-based curative effect in [6]. The known “predator-prey” between immune cells and tumor cells leads to cyclic growth and reduction, which can be continue indefinitely or reach an equilibrium saddle point determined by system parameters. Literature [7] investigated nonlinear dynamical model which provided guiding significance for introducing that into cybernetics. As known, system identification or optimal control is of great practical value. As a powerful and effective optimization algorithm, the ADP method can solve the nonlinear optimal control problems well, realizing the most appropriate therapeutic strategy.

Of course, the immune system has the responsibility for restraining tumor growth, but it is hardly to fight out the tumor cells alone. Firstly, ego characteristic of tumor cells compared to normal cells within the body leads to no exclusion and tolerance to tumor cells of the immune system. Secondly, there is no strong defense mechanism itself in fighting with the cancer cells which means the failure of the immune response. Finally, Immune function was observed to be protective through intervention with organic binding agents of CD4 and CD8 cells. Chemotherapy can not only rapidly kill differentiated tumor cells, but also destroy regular cells. This side effect caused by chemotherapy can be lessened through introducing the immunotherapy. Thus the combined therapy of chemotherapy and immunotherapy is more reasonable. Immunotherapies can strengthen the immune system through extra stimulation, on the other hand, improve the ability to recognize foreign entity. Therefore, decelerating the growth rate of tumor cells with minimized dose of chemotherapy and immunotherapeutic drugs is the control objective. Furthermore, optimal control strategy is obtained through ADP method, giving the optimal levels of each treatment regimen through nonzero-sum differential games strategy developed in [8].

Prescribed performance tracking control has been creatively developed as [9], however, there is seldom any literatures focusing on this scope considers mutual relationship among tumor cells, immune cells, chemotherapy and immunotherapy drugs, let alone setting the performance as eventually acquired of optimal therapeutic effect associated with coupling behaviors mentioned above. Retrospect to literatures as [10], the chapter transformed it into multi-player nonzero-sum games problems whose optimal control was obtained by complex decoupling in dealing with Hamilton-Jacobi equation as [11]. Subsequently, online adaptive and off-policy learning algorithms were respectively developed in [12,13,14]. Of course, the constrained-input was taken into consideration, when it comes to practical applications in [15], even more intensive work on uncertain constraints were in contemplation considered as [16]. As [17], the control policies of the distributed subsystems acted as players, noticeably, the chapter was formulated as a two-players nonzero-sum game including chemotherapy and immunotherapy. [18] first introduced an updating strategy based on intertask relationships. Synchronously, reciprocal action between the tumor cells and immune cells which could be analogous to interactions between systems in [19, 20].

The unknown nonlinear dynamic is usually implemented by fuzzy control as [21, 22] and neural networks in [18, 23], where the actor network and critic network are adopted for updating control policy at an appropriate time through policy iteration technique as [24,25,26]. The convergence of model-based policy iteration algorithm is equivalent to that of data-based learning as [27]. Similarly, states of the system and critic error are required to be ultimately uniformly bounded during the process of value iteration, which is guaranteed through event-triggered formation control scheme firstly proposed for all signals of the closed-loop system in literature [28]. According to the iterative value algorithm, the optimum can be obtained through learning continuously [29, 30]. However there is little research on the two-players nonzero-sum game considering tumor cells and immune cells using the proposed value iteration learning.

2.2 Preliminaries

As is known, there exist interaction relationships among the anticancer agent cells, lymphocytes and macrophages that constitute the basic immune system microenvironment, which can be presented as follows. Firstly, T-lymphocytes and cytotoxic macrophages/natural killer cells can effectively damage tumor cells. Secondly, destroyed behaviour of macrophages can also active T-lymphocytes for launching another attack. Meanwhile, the population of T-lymphocytes can be fed through resting cells. Finally, the model is guided by degradation of resting cells and activation of immune cells by natural growth rate. This section gives the nonlinear growth equation which can represent the whole immune response.

$$\begin{aligned} N_{total}=\frac{\upsilon N_{H}(t)N_{T}(t)}{\nu +N_{T}(t)} \end{aligned}$$
(2.1)

where \(N_{H}(t)\), \(N_{T}(t)\) denote the number of hunting cells and tumor cells at time t, respectively. \(\upsilon \) and \(\nu \) are positive constants. The changes in quantity caused by the inactivation of the immune cells and the apoptosis of tumor cells are presented as:

$$\begin{aligned}&\frac{dN_{T}(t)}{dt}=-\sigma _1N_{H}(t)N_{T}(t) \nonumber \\&\frac{dN_{H}(t)}{dt}=-\sigma _2N_{H}(t)N_{T}(t) \end{aligned}$$
(2.2)

where \(\sigma _1\) denotes the loss rate of \(N_T(t)\) caused by \(N_H(t)\) and \(\sigma _2\) represents the loss rate of \(N_H(t)\) caused by \(N_T(t)\). The situations above reflect the competition between tumor cells and the host cells. Then we construct the dynamic equations as follows

$$\begin{aligned} \dot{N}_{T}(t)=&\,\iota _{1}{N}_{T}(t)(1-\varrho _{1}{N}_{T}(t))-\sigma _{1}N_{T}(t)N_{H}(t)\nonumber \\&-\delta _{1}{N}_{CD}(t)N_{T}(t)\nonumber \\ \dot{N}_{H}(t)=&\,\frac{\upsilon N_{H}(t)N_{T}^{2}(t)}{\nu +N_{T}^{2}(t)} +\frac{\varsigma N_{H}(t){N}_{ID}(t)}{\vartheta +{N}_{ID}(t)}-\sigma _{2}N_{T}(t)N_{H}(t) \nonumber \\&-\mathfrak {D}N_{H}(t)-\delta _{2}{N}_{CD}(t)N_{H}(t)\nonumber \\ \end{aligned}$$
(2.3)

where \(\mathfrak {D}\) represents the death rate of cells without considering any tumor cells. \(\iota _{\alpha }\) \((\alpha =1,2)\) and \(\varrho _{\alpha }\) denote the per capita growth rates and reciprocal carrying capacities. The descriptions of the other associated parameters are given in Table 2.1.

Table 2.1 Detailed descriptions of system parameters

Consider the given chemotherapy and immunotherapy drugs as u(t) and v(t) at time t, which is regarded as multiple dose administration compared with influence of recombinant human interleukin-11 for injection or recombinant human granulocyte colony-stimulating factor injection. Assume that targeted therapy cannot be achieved through only chemotherapeutic drugs. Then we can obtain that

$$\begin{aligned} f_{response}(t)=s_{\alpha }(1-e^{-\lambda u(t)}) \end{aligned}$$
(2.4)

where \(s_{\alpha }\) is the different response coefficients for distinguishing the change rate of different cells. The mathematical model considering injected drugs is presented as

$$\begin{aligned} \dot{N}_{CD}(t)=&\,u(t)-\varphi _{1}{N}_{CD}(t)\nonumber \\ \dot{N}_{ID}(t)=&\,v(t)-\varphi _{2}{N}_{ID}(t)\nonumber \\ \dot{N}_{T}(t)=&\,\iota _{1}{N}_{T}(t)(1-\varrho _{1}{N}_{T}(t))-\sigma _{1}N_{T}(t)N_{H}(t)\nonumber \\&-\delta _{1}{N}_{CD}(t)N_{T}(t)-s_{2}(1-e^{-\lambda u(t)})\nonumber \\ \dot{N}_{H}(t)=&\,\frac{\upsilon N_{H}(t)N_{T}^{2}(t)}{\nu +N_{T}^{2}(t)} +\frac{\varsigma N_{H}(t){N}_{ID}(t)}{\vartheta +{N}_{ID}(t)}-\sigma _{2}N_{T}(t)N_{H}(t)-\mathfrak {D}N_{H}(t) \nonumber \\&-\delta _{2}{N}_{CD}(t)N_{H}(t)-s_{1}(1-e^{-\lambda u(t)}) \end{aligned}$$
(2.5)

where \({N}_{CD}(t)\) and \({N}_{ID}(t)\) are concentrations of chemotherapy and immunotherapy. v(t) and u(t) are the doses of chemotherapeutic drug and immunotherapeutic drug. Generally speaking, \(\lambda \) is taken as 1 for the unknown role of cytokines.

Remark 2.1

The model (2.5) describes the relations among the hunting cells, the tumor cells, the concentration of chemotherapy agentia, and the concentration of immunotherapy agentia. From (2.5) we can find both of the hunting cells and the chemotherapy agentia can reduce the number of tumor cells, and the immunotherapy agentia can stimulate the growth of hunting cells. On the other hand, the tumor cells can influence the number of hunting cells. Based on this complicated interactive relationship, we can obtain the optimal object through ADP, that is, minimization of tumor cells while ensuring the number of normal cells at certain time t.

Before proceeding, let \(X=[N_T,N_H,N_{CD},N_{ID}]^T\), then the model (2.5) can be simplified as

$$\begin{aligned} \dot{{X}}(t)=f(X)+g(X)u(t)+\kappa (X) v(t) \end{aligned}$$
(2.6)

where f(X) is the right-hand dynamics of (2.5) excluding the control u(t) and v(t). The matrixes \(g(X)=[0,0,1,0]^T\) and \(\kappa (X)=[0,0,0,1]^T\).

For system (2.6), the performance index function of the \(\epsilon \) player can be given as

$$\begin{aligned} J_\epsilon (X_0)=&\int _{0}^{\infty }\Big ({X}^{T}\mathcal {Q}_{\epsilon }{X}+u^{T}\mathcal {R}_{\epsilon 1}u+v^{T}\mathcal {R}_{\epsilon 2}v\Big ) d\tau \end{aligned}$$
(2.7)

where \(\mathcal {Q}_{\epsilon }\) is positive definite matrix, \(\mathcal {R}_{\epsilon 1}\) and \(\mathcal {R}_{\epsilon 2}\) are symmetric positive matrixes. The corresponding cost functions are presented as:

$$\begin{aligned} \mathcal {V}_{\epsilon }(X,u,v)=\int _{t}^{\infty }\mathfrak {R}_{\epsilon }(X,u,v)d\tau \end{aligned}$$
(2.8)

with the utility function

$$\begin{aligned} \mathfrak {R}_{\epsilon }(X,u,v) =&{X}^{T}\mathcal {Q}_{\epsilon }{X}+u^{T}{\mathcal {R}_{\epsilon 1}}u+v^{T}\mathcal {R}_{\epsilon 2}v. \end{aligned}$$
(2.9)

Definition 2.2

For two-player NZS game of system (2.6), the Nash equilibrium solution is said to be obtained with the control pair \((u^{*},v^{*})\) which satisfied that,

$$\begin{aligned} \mathcal {V}_{\epsilon }(u^{*},v^{*})&\le \mathcal {V}_{\epsilon }(u,v^{*})\nonumber \\ \mathcal {V}_{\epsilon }(u^{*},v^{*})&\le \mathcal {V}_{\epsilon }(u^{*},v) \end{aligned}$$
(2.10)

for any admissible control policies u and v.

The Hamilton functions can be constructed as:

$$\begin{aligned} \textrm{H}_{\epsilon }(X,u,v)&={X}^{T}\mathcal {Q}_{\epsilon }{X}+u^{T}\mathcal {R}_{\epsilon 1}u+v^{T}\mathcal {R}_{\epsilon 2}v\nonumber \\&+\nabla \mathcal {V}_{\epsilon }^{T}(f(X)+g(X)u(t)+\kappa (X) v(t)) \end{aligned}$$
(2.11)

where \(\nabla \mathcal {V}_{\epsilon }\) is the partial derivative of the cost function and \({\epsilon }=1,2\). According to the stationarity conditions at equilibrium points, the optimal control for two players are obtained

$$\begin{aligned} u^{*}&=-\frac{1}{2}\mathcal {R}_{11}^{-1}g^{T}(X)\nabla \mathcal {V}_{1}^*\nonumber \\ v^{*}&=-\frac{1}{2}\mathcal {R}_{22}^{-1}\kappa ^{T}(X)\nabla \mathcal {V}_{2}^*\end{aligned}$$
(2.12)

with \(\mathcal {V}_1^*\) and \(\mathcal {V}_2^*\) being the solutions of coupled HJ equations as

$$\begin{aligned} X^T\mathcal {Q}_{1}{X}&-\frac{1}{4}\nabla \mathcal {V}_{1}^{*T}g(X)\mathcal {R}_{11}^{-1}g^{T}(X)\nabla \mathcal {V}_{1}^*+\nabla \mathcal {V}_{1}^{*T}f({X})\nonumber \\&+\frac{1}{4}\nabla \mathcal {V}_{2}^{*T}\kappa (X)\mathcal {R}_{22}^{-1}\mathcal {R}_{12}\mathcal {R}_{22}^{-1}\kappa ^{T}(X)\nabla \mathcal {V}_{2}^*\nonumber \\&-\frac{1}{2}\nabla \mathcal {V}_{1}^{*T}\kappa (X)\mathcal {R}_{22}^{-1}\kappa ^{T}(X)\nabla \mathcal {V}_{2}^*=0, \end{aligned}$$
(2.13)

and

$$\begin{aligned} X^T\mathcal {Q}_{2}{X}&-\frac{1}{4}\nabla \mathcal {V}_{2}^{*T}\kappa (X)\mathcal {R}_{22}^{-1}\kappa ^{T}(X)\nabla \mathcal {V}_{2}^*+\nabla \mathcal {V}_{2}^{*T}f({X})\nonumber \\&+\frac{1}{4}\nabla \mathcal {V}_{1}^{*T}g(X)\mathcal {R}_{11}^{-1}\mathcal {R}_{21}\mathcal {R}_{11}^{-1}g^{T}(X)\nabla \mathcal {V}_{1}^*\nonumber \\&-\frac{1}{2}\nabla \mathcal {V}_{2}^{*T}g(X)\mathcal {R}_{11}^{-1}g^{T}(X)\nabla \mathcal {V}_{1}^*=0. \end{aligned}$$
(2.14)

Lemma 2.3

For nonlinear system (2.6), suppose that \(\mathcal {V}_1^{*}\) and \(\mathcal {V}_2^*\) satisfy the equations (2.13) and (2.14). Then under the optimal control (2.12), the system is asymptotically stable.

Proof

The proof is omitted since it is similar to that in [31, 32].

By solving the coupled HJ equations (2.13) and (2.14), one can obtain the optimal control as (2.12), which means the Nash equilibrium for the two-player NZS game system is attained. Nevertheless, due to the existence of nonlinear terms and coupled terms, these partial differential equations are uneasy to solve. Since ADP is a powerful approximate learning method, the approximate solutions of (2.13) and (2.14) can be acquired.

2.3 Design of Adaptive Controller

In order to find the optimal control strategy, a critic network is constructed based on neural network firstly. And then optimal value function can be shown as:

$$\begin{aligned} \mathcal {V}_{\epsilon }^{*}=(\zeta _{\epsilon }^{*})^{T}\xi _{\epsilon }(X)+o_{\epsilon },\epsilon =1,2, \end{aligned}$$
(2.15)

where \(\zeta _{\epsilon }^{*}\in R^{p_{\epsilon }}\), \(\xi _{\epsilon }\in R^{p_{\epsilon }}\) and \(o_{\epsilon }\in \textrm{R}\) are the ideal weight vector, activation function and approximation error of the neural network. As it’s scarcely possible to get the weight \(\zeta _{\epsilon }^{*}\), we give the approximate version

$$\begin{aligned} \hat{\mathcal {V}}_{\epsilon }^{*}=(\hat{\zeta }_{\epsilon })^{T}\xi _{\epsilon }(X). \end{aligned}$$
(2.16)

Based on (2.12) and (2.15), we obtain the optimal control as

$$\begin{aligned} u^{*}&=-\frac{1}{2}\mathcal {R}_{11}^{-1}g^{T}(X)((\nabla \xi _{1}(X))^{T}\zeta _{1}^{*}+\nabla o_{1})\nonumber \\ v^{*}&=-\frac{1}{2}\mathcal {R}_{22}^{-1}\kappa ^{T}(X)((\nabla \xi _{2}(X))^{T}\zeta _{2}^{*}+\nabla o_{2}) \end{aligned}$$
(2.17)

Then we further get the approximate control policies as

$$\begin{aligned} \hat{u}&=-\frac{1}{2}\mathcal {R}_{11}^{-1}g^{T}(X)(\nabla \xi _{1}(X)^{T}\hat{\zeta }_{1}\nonumber \\ \hat{v}&=-\frac{1}{2}\mathcal {R}_{22}^{-1}\kappa ^{T}(X)(\nabla \xi _{2}(X)^{T}\hat{\zeta }_{2} \end{aligned}$$
(2.18)

Remark 2.4

For the unknowable nature of ideal weights, the NNs are used to approximate the system dynamics and approximate version as (2.16), aming at minimizing the current estimate of the value functions in (2.15) by selecting policies (2.18) can be obtained with available closed-form expressions.

According to (2.18), the closed-loop system can be rewritten as

$$\begin{aligned} \dot{X}(t)=f(X)+g(X)\hat{u}+\kappa (X)\hat{v}. \end{aligned}$$
(2.19)

Furthermore, we can attain the approximate Hamilton as

$$\begin{aligned} \textrm{H}_{\epsilon }(X,\hat{u},\hat{v})&={X}^{T}\mathcal {Q}_{\epsilon }{X}+\hat{u}^{T}\mathcal {R}_{\epsilon 1}\hat{u} \nonumber \\&+\hat{v}^{T}\mathcal {R}_{\epsilon 2}\hat{v} +(\hat{\zeta }_{\epsilon })^{T}\nabla \xi _{\epsilon }(X)\dot{X}(t) \nonumber \\&=e_{\epsilon }(t). \end{aligned}$$
(2.20)

To approach the optimal strategy and minimize \(e_\epsilon (t)\), the goal of adaptive learning is set to be \(\mathcal {E}=\mathcal {E}_1+\mathcal {E}_2=1/2e_1^2+1/2e_2^2\). Then applying the gradient descent method, we obtain the learning law of critic for player \(\epsilon \)

$$\begin{aligned} \dot{\hat{\zeta }}_{\epsilon }&=-\varrho _{\epsilon }\frac{1}{(\delta _{\epsilon }^{T}\delta _{\epsilon }+1)^{2}}\frac{\partial \mathcal {E}(t)}{\partial \hat{\zeta }_{\epsilon }}=-\varrho _{\epsilon }\frac{1}{(\delta _{\epsilon }^{T}\delta _{\epsilon }+1)^{2}}\frac{\partial \mathcal {E}_{\epsilon }(t)}{\partial \hat{\zeta }_{\epsilon }} =-\varrho _{\epsilon }\frac{\delta _{\epsilon }e_{\epsilon }(t)}{(\delta _{\epsilon }^{T}\delta _{\epsilon }+1)^{2}}\nonumber \\ \end{aligned}$$
(2.21)

where \(\delta _{\epsilon }=\nabla \xi _{\epsilon }(X)\dot{X}(t)\), and \(\varrho _{\epsilon }\) is the positive learning law. Let \(\tilde{\zeta }_\epsilon =\zeta _\epsilon ^*-\hat{\zeta }_\epsilon \), then we have

$$\begin{aligned} \dot{\tilde{\zeta }}_\epsilon&=\varrho _{\epsilon }\frac{\delta _{\epsilon }\sigma _{\epsilon }(t)}{(\delta _{\epsilon }^{T}\delta _{\epsilon }+1)^{2}}-\varrho _{\epsilon }\frac{\delta _{\epsilon }\delta _{\epsilon }^{T}\tilde{\zeta }_\epsilon }{(\delta _{\epsilon }^{T}\delta _{\epsilon }+1)^{2}} =\varrho _\epsilon \underline{\delta }_\epsilon \sigma _\epsilon (t)-\varrho _\epsilon \bar{\delta }_\epsilon \bar{\delta }_\epsilon ^T\tilde{\zeta }_\epsilon , \end{aligned}$$
(2.22)

where \(\underline{\delta }_\epsilon =\delta _\epsilon /(\delta _\epsilon ^T\delta _\epsilon +1)^2\), \(\bar{\delta }_\epsilon =\delta _\epsilon /(\delta _\epsilon ^T\delta _\epsilon +1)\) and \(\sigma _\epsilon (t)=-\nabla o_\epsilon ^T(X)(f(X)+g(X)\hat{u}+\kappa (X)\hat{v})\) is the approximate residual error when employing critic neural network [33].

Before presenting the main results of this chapter, two regular assumptions are necessary [34,35,36].

Assumption 2.1

For \(\epsilon =1,2\), the signal \(\bar{\delta }_\epsilon \) is persistently excited such that the following inequality is satisfied

$$\begin{aligned} \varsigma _\epsilon I_{\nu _\epsilon \times \nu _\epsilon }\le \int _t^{t+T}\bar{\delta }_\epsilon \bar{\delta }_\epsilon ^Td\varepsilon , \end{aligned}$$
(2.23)

where \(\nu _\epsilon \) denotes the neuro number of the \(\epsilon \)th critic network.

Assumption 2.2

For \(\epsilon =1,2\), there exist positive constants \(\xi _{\epsilon max}\), \(o_{\epsilon max}\) and \(\sigma _{\epsilon max}\) such that the following inequalities hold, that is, \(\Vert \nabla \xi _\epsilon (X)\Vert \le \xi _{\epsilon max}\), \(\Vert \nabla o\Vert \le o_{\epsilon max}\) and \(\Vert \sigma _\epsilon \Vert \le \sigma _{\epsilon max}\).

Applying the Lyapunov method, the stability in the sense of UUB is demonstrated to be guaranteed by the following theorem.

Theorem 2.5

For system (2.6),when the weight updating laws of critic networks are given by (2.21), then the UUB properties of the weight estimation error \(\tilde{\zeta }_\epsilon \) can be guaranteed by the obtained control policies (2.18).

Proof

Select the Lyapunov function as

$$\begin{aligned} \mathcal {L}=\frac{1}{2}\varrho _1^{-1}\tilde{\zeta }_1^T\tilde{\zeta }_1^T+\frac{1}{2}\varrho _2^{-1}\tilde{\zeta }_2^T\tilde{\zeta }_2^T. \end{aligned}$$
(2.24)

Taking the time derivative of (2.24), then we obtain

$$\begin{aligned} \dot{\mathcal {L}}=\,&\varrho _1^{-1}\tilde{\zeta }_1^T\dot{\tilde{\zeta }}_1+\varrho _2^{-1}\tilde{\zeta }_2^T\dot{\tilde{\zeta }}_2 \nonumber \\ =\,&\tilde{\zeta }_1^T(\underline{\delta }_1\sigma _1(t)-\bar{\delta }_1\bar{\delta }_1^T\tilde{\zeta }_1)+\tilde{\zeta }_2^T(\underline{\delta }_2\sigma _2(t)-\bar{\delta }_2\bar{\delta }_2^T\tilde{\zeta }_2) \end{aligned}$$
(2.25)

According to Young’s inequality, we have

$$\begin{aligned} \tilde{\zeta }_1^T\underline{\delta }_1\sigma _1(t)\le \tilde{\zeta }_1^T\bar{\delta }_1\sigma _1(t)\le \frac{1}{2}\tilde{\zeta }_1^T\bar{\delta }_1\bar{\delta }_1^T\tilde{\zeta }_1+\frac{1}{2}\sigma _{1max}^2. \end{aligned}$$
(2.26)

Similarly,

$$\begin{aligned} \tilde{\zeta }_2^T\underline{\delta }_2\sigma _2(t)\le \frac{1}{2}\tilde{\zeta }_2^T\bar{\delta }_2\bar{\delta }_2^T\tilde{\zeta }_2+\frac{1}{2}\sigma _{2max}^2. \end{aligned}$$
(2.27)

Substituting (2.26) and (2.27) into (2.25), we get

$$\begin{aligned} \dot{\mathcal {L}}\le -\frac{1}{2}\tilde{\zeta }_1^T\bar{\delta }_1\bar{\delta }_1^T\tilde{\zeta }_1-\frac{1}{2}\tilde{\zeta }_2^T\bar{\delta }_2\bar{\delta }_2^T\tilde{\zeta }_2+\frac{1}{2}(\sigma _{1max}^2+\sigma _{2max}^2). \end{aligned}$$
(2.28)

From (2.28) we can conclude that \(\dot{\mathcal {L}}<0\) when one of the following conditions holds

$$\begin{aligned} \Vert \tilde{\zeta }_1\Vert >\sqrt{\frac{\sigma _{1max}^2+\sigma _{2max}^2}{\lambda _{min}(\bar{\delta }_1\bar{\delta }_1^T)}}, \end{aligned}$$
(2.29)

or

$$\begin{aligned} \Vert \tilde{\zeta }_2\Vert >\sqrt{\frac{\sigma _{1max}^2+\sigma _{2max}^2}{\lambda _{min}(\bar{\delta }_2\bar{\delta }_2^T)}}. \end{aligned}$$
(2.30)

According to Lyapunov theory, it yields that the weight estimation errors for both critic networks are UUB.

Remark 2.6

The weight matrices are usually updated through certain renewal equations, and from (2.29) and (2.30), we can draw that the approximation weight error will asymptotically converge to zero as \(\nu _\epsilon \rightarrow \infty \).

Theorem 2.7

Consider the system (2.6). The weight updating laws for critic networks are given by (2.21). Then the obtained policies (2.18) can force system states X to be UUB.

Proof

In order to discuss the stability of closed-loop system, the derivative of \(\mathcal {V}=\mathcal {V}_1^*+\mathcal {V}_2^*\) is considered as

$$\begin{aligned} \dot{\mathcal {V}}=&(\nabla \mathcal {V}_1^*)^{T}(f(X)+g(X)\hat{u}+\kappa (X)\hat{v}) \nonumber \\&+(\nabla \mathcal {V}_2^*)^T(f(X)+g(X)\hat{u}+\kappa (X)\hat{v}). \end{aligned}$$
(2.31)

Recalling (2.13) and (2.14), we have

$$\begin{aligned} \nabla \mathcal {V}_{1}^{*T}f({X})=&-X^T\mathcal {Q}_{1}{X}+\frac{1}{4}\nabla \mathcal {V}_{1}^{*T}g(X)\mathcal {R}_{11}^{-1}g^{T}(X)\nabla \mathcal {V}_{1}^*\nonumber \\&-\frac{1}{4}\nabla \mathcal {V}_{2}^{*T}\kappa (X)\mathcal {R}_{22}^{-1}\mathcal {R}_{12}\mathcal {R}_{22}^{-1}\kappa ^{T}(X)\nabla \mathcal {V}_{2}^*\nonumber \\&+\frac{1}{2}\nabla \mathcal {V}_{1}^{*T}\kappa (X)\mathcal {R}_{22}^{-1}\kappa ^{T}(X)\nabla \mathcal {V}_{2}^*, \end{aligned}$$
(2.32)

and

$$\begin{aligned} \nabla \mathcal {V}_{2}^{*T}f({X})=&-X^T\mathcal {Q}_{2}{X}+\frac{1}{4}\nabla \mathcal {V}_{2}^{*T}\kappa (X)\mathcal {R}_{22}^{-1}\kappa ^{T}(X)\nabla \mathcal {V}_{2}^*\nonumber \\&-\frac{1}{4}\nabla \mathcal {V}_{1}^{*T}g(X)\mathcal {R}_{11}^{-1}\mathcal {R}_{21}\mathcal {R}_{11}^{-1}g^{T}(X)\nabla \mathcal {V}_{1}^*\nonumber \\&+\frac{1}{2}\nabla \mathcal {V}_{2}^{*T}g(X)\mathcal {R}_{11}^{-1}g^{T}(X)\nabla \mathcal {V}_{1}^*. \end{aligned}$$
(2.33)

For \(\epsilon =1\), we can obtain \(\dot{\mathcal {V}}_1^*\) as

$$\begin{aligned} \dot{\mathcal {V}}^*_1=&-X^T\mathcal {Q}_1X-\frac{1}{4}\nabla \mathcal {V}_{1}^{*T}g(X)\mathcal {R}_{11}^{-1}g^{T}(X)\nabla \mathcal {V}_{1}^*\nonumber \\&-\frac{1}{4}\nabla \mathcal {V}_{2}^{*T}\kappa (X)\mathcal {R}_{22}^{-1}\mathcal {R}_{12}\mathcal {R}_{22}^{-1}\kappa ^{T}(X)\nabla \mathcal {V}_{2}^*\nonumber \\&-\nabla \mathcal {V}_1^{*T}(g(X)(u^*-\hat{u})+\kappa (X)(v^*-\hat{v})). \end{aligned}$$
(2.34)

According to (2.15) and (2.16) we have

$$\begin{aligned} \dot{\mathcal {V}}_1^*=&-X^T\mathcal {Q}_1X-\frac{1}{4}\nabla \mathcal {V}_{1}^{*T}g(X)\mathcal {R}_{11}^{-1}g^{T}(X)\nabla \mathcal {V}_{1}^*\nonumber \\&-\frac{1}{4}\nabla \mathcal {V}_{2}^{*T}\kappa (X)\mathcal {R}_{22}^{-1}\mathcal {R}_{12}\mathcal {R}_{22}^{-1}\kappa ^{T}(X)\nabla \mathcal {V}_{2}^*\nonumber \\&+\frac{1}{2}((\nabla \xi _1(X))^T\zeta _1^*+\nabla o_1)^T\Big (g(X)\mathcal {R}_{11}^{-1}g^T(X) \nonumber \\&\times ((\nabla \xi _1^T(X))^T\tilde{\zeta }_1+\nabla o_1)+\kappa (X)\mathcal {R}_{22}^{-1}\kappa ^T(X) \nonumber \\&\times ((\nabla \xi _2^T(X))^T\tilde{\zeta }_2+\nabla o_2)\Big ). \end{aligned}$$
(2.35)

Due to Assumption 2.2 and Theorem 2.5, we obtain that

$$\begin{aligned} \dot{\mathcal {V}}_1^*\le&-X^T\mathcal {Q}_1X-\frac{1}{4}\nabla \mathcal {V}_{1}^{*T}g(X)\mathcal {R}_{11}^{-1}g^{T}(X)\nabla \mathcal {V}_{1}^*\nonumber \\&-\frac{1}{4}\nabla \mathcal {V}_{2}^{*T}\kappa (X)\mathcal {R}_{22}^{-1}\mathcal {R}_{12}\mathcal {R}_{22}^{-1}\kappa ^{T}(X)\nabla \mathcal {V}_{2}^*+\theta _1, \nonumber \\ \end{aligned}$$
(2.36)

where the positive constant \(\theta _1\) denotes the bound of the term \(\frac{1}{2}((\nabla \xi _1(X))^T\zeta _1^*+\nabla o_1)^T\Big (g(X)\mathcal {R}_{11}^{-1}g^T(X)((\nabla \xi _1^T(X))^T\tilde{\zeta }_1+\nabla o_1)+\kappa (X)\mathcal {R}_{22}^{-1}\kappa ^T(X)((\nabla \xi _2^T(X))^T\tilde{\zeta }_2\) \(+\nabla o_2)\Big )\). As \(\mathcal {R}_{11}\), \(\mathcal {R}_{12}\) and \(\mathcal {R}_{22}\) are symmetric positive definite, we have

$$\begin{aligned} \frac{1}{4}\nabla \mathcal {V}_{2}^{*T}\kappa (X)\mathcal {R}_{22}^{-1}\mathcal {R}_{12}\mathcal {R}_{22}^{-1}\kappa ^{T}(X)\nabla \mathcal {V}_{2}^*\nonumber \\ +\frac{1}{4}\nabla \mathcal {V}_{1}^{*T}g(X)\mathcal {R}_{11}^{-1}g^{T}(X)\nabla \mathcal {V}_{1}^*>0. \end{aligned}$$
(2.37)

Furthermore, we attain

$$\begin{aligned} \dot{\mathcal {V}}_1^*\le -X^T\mathcal {Q}_1X+\theta _1\le -\lambda _{min}(\mathcal {Q}_1)\Vert X\Vert ^2+\theta _1. \end{aligned}$$
(2.38)

Similarly, for \(\epsilon =2\), it yields that

$$\begin{aligned} \dot{\mathcal {V}}_2^*\le -X^T\mathcal {Q}_2X+\theta _2\le -\lambda _{min}(\mathcal {Q}_2)\Vert X\Vert ^2+\theta _2, \end{aligned}$$
(2.39)

where the definition of \(\theta _2\) is similar to that of \(\theta _1\). Then it can be concluded that \(\dot{\mathcal {V}}<0\) when the following inequality is satisfied

$$\begin{aligned} \Vert X\Vert >\max \left\{ \sqrt{\frac{\theta _1}{\lambda _{min}(\mathcal {Q}_1)}},\sqrt{\frac{\theta _2}{\lambda _{min}(\mathcal {Q}_2)}}\right\} \triangleq \varTheta . \end{aligned}$$
(2.40)

Thus with the proposed control policies (2.18), the system state N is UUB with the bound \(\varTheta \). This completes the proof.

Remark 2.8

From Theorems 2.5 and 2.7, we can conclude that under the obtained control policies the states of the system X and the critic weight error \(\tilde{\zeta }_\epsilon \) are ultimately uniformly bounded.

Remark 2.9

According to the clinical requirements, the specific value of the cost function is finalised. Transformation is implemented from the mathematical mechanism model to the solvable affine model. Subsequently, the chapter solve the optimal control problem that means minimum dose of medicine can realize the best therapeutic effect.

2.4 Simulation and Numerical Experiments

To verify the proposed method in the previous section, a simulation is given as followed.

2.4.1 States Analysis on Tumor Cell Growth

According to clinical medical statistics borrowed from the literature [37], the specific parameters of the dynamic models are presented as Table 2.2.

Table 2.2 Concentration variation on immune cells, tumor cells, chemotherapeutic drug and immunoagents

According to (2.5) and Table 2.2, we construct the model (2.41)

$$\begin{aligned} \dot{N}_{T}(t)=&\,0.00431{N}_{T}(t)(1-1.02\times 10^{-9}){N}_{T}(t)) \nonumber \\ {}&-6.41\times 10^{-11}N_{T}(t)N_{H}(t)\nonumber \\&-0.08{N}_{CD}(t)N_{T}(t)-(1-e^{-u(t)})\nonumber \\ \dot{N}_{H}(t)=&\,0.33+\frac{0.0125N_{H}(t)N_{T}^{2}(t)}{2.02\times 10^7+N_{T}^{2}(t)} +\frac{0.125N_{H}(t){N}_{ID}(t)}{2\times 10^7+{N}_{ID}(t)}\nonumber \\&-3.42\times 10^{-6}N_{T}(t)N_{H}(t)-(1-e^{- u(t)}) \nonumber \\&-0.204N_{H}(t)-3.42\times 10^{-6}{N}_{CD}(t)N_{H}(t)\nonumber \\ \dot{N}_{CD}(t)=&\,u(t)-0.1{N}_{CD}(t)\nonumber \\ \dot{N}_{ID}(t)=&\,v(t)-{N}_{ID}(t) \end{aligned}$$
(2.41)

The initial state of tumor cells \(N_{1}(t)\) and immune cells \(N_{2}(t)\) in a patient and follow a certain chemotherapy and immunotherapy regimen. Correspondingly, \(N_{3}(t)\) and \(N_{4}(t)\) respectively denote the concentrations of chemotherapy and immunotherapy. And we can get the following curves on systems states tumor cells, immune cells, chemotherapy and immunotherapy drugs as shown in Fig. 2.1. Initial value is set as \(X_0=\left[ {\begin{array}{*{20}{c}} 20&10&8&6 \end{array}} \right] ^{T}\).

Fig. 2.1
figure 1

The curves of system states

It is obviously that the control policies can stabilize the nonlinear system and make the system states tend to zero which means that the closed-system is stable and the control method is effective. Retrospect the original problem that the key is to minimize cancer cells and reduce therapy toxicity as possible.

2.4.2 Weight Analysis of Control Policies

The weights \(\zeta _{\epsilon }^{*}\) of the control policies u(t) and v(t) can be estimated through the value function \(\hat{\mathcal {V}}_{\epsilon }^{*}=(\hat{\zeta }_{\epsilon })^{T}\xi _{\epsilon }(X)\) in (2.16), and the performance index is shown as (2.6) with \( \mathcal {Q}_{1}=I_{4\times 4}\), \(\mathcal {Q}_{2}=5\mathcal {Q}_{1}\), \(\mathcal {R}_{11}=\mathcal {R}_{22}=1\), \(\mathcal {R}_{12}=\mathcal {R}_{21}=2\). The initialize weights are set as \([-0.25,-0.25,-1,-0.25]^{T}\). The selected activation function is selected as \([\zeta _{11\rightarrow 15}^{T},\zeta _{16\rightarrow 18}^{T},\zeta _{19\rightarrow 10}^{T}]\), where \(\zeta _{11\rightarrow 15}=[N_{1}^{2}(t),N_{1}(t)N_{2}(t),N_{1}(t)N_{3}(t),N_{1}(t)N_{4}(t) ,N_{2}^{2}(t)]\) and \(\zeta _{16\rightarrow 18}=[N_{2}(t)N_{3}(t),N_{2}(t)N_{4}(t) ,N_{3}^{2}(t)]\) and \(\zeta _{19\rightarrow 10}=[N_{3}(t)N_{4}(t),N_{4}^{2}(t)]\)

Fig. 2.2
figure 2

Optimal control policies u(t)

Fig. 2.3
figure 3

Optimal control policies v(t)

Fig. 2.4
figure 4

The curves of system states

According to Fig. 2.2, we can conclude that the proposed optimal control demonstrated a shorter convergence time than that without taking optimal control, where the former needs only 10s, but the later may be 38s, which draws the superiority of the proposed method.

In Fig. 2.3, we can obtain the less doses of the drugs is another advantage compared with that without taking optimal control. Taking comprehensive consideration of Figs. 2.2 and 2.3, we can draw a conclusion that the adopted algorithm can not only decrease the convergence time but also reduce doses of chemotherapy drugs and immune agents, and patients will benefit from for the minimal toxicity and shorter response time.

When the initialize state is set as \([-0.5,-0.1,-1,-0.4]^{T}\), and the other parameters are unaltered, we give another set of figures as Figs. 2.4, 2.5 and 2.6. In Figs. 2.5 and 2.6, there exist more obvious advantages for the proposed algorithms over that without taking optimal control in response time and control policies,and we can conclude that effectiveness of the control method does not vary in the different initial weights.

Fig. 2.5
figure 5

Optimal control policies u(t)

Fig. 2.6
figure 6

Optimal control policies v(t)

2.5 Conclusion

This chapter has introduced adaptive dynamic programming into solving the optimal control policies of tumor cells growth model and realized objective of minimizing tumor cells with the minimum dose of chemotherapeutic and immunotherapeutic drugs. As is known, the negative effect caused by chemotherapy and immunotherapy must be reduced for the reasonable treatment plan extracted from the optimal control behavior. Convergence properties have been proved to be guaranteed through Lyapunov theory. Meanwhile, states of the system and critic error have been demonstrated to be ultimately uniformly bounded. Simulations have been given to verify rationality of the proposed method. In the future work, we will further investigate the medical frontier topics and propose adaptive therapeutic methods to solve these issues by employing ADP approach.