3.1 Introduction

As the rapid increase of tumor patients, immunotherapies integrated with multi-pronged approaches are being burgeoning for treatment of cancers with specific forms, especially for poorly immunogenic tumors as [1]. The original intention of immunotherapy is fighting cancer cells with their own lethality of immune cells. AIDS as a typical immunodeficiency syndrome caused by failure of immune response tends to be attributed to weakened immune levels, however, Natural killer cell population determine whether shutdown of immune system, once the activate immune system can not be suspended from and produce cytokines [2], which is regarded as an overreaction of the immune system such as COVID-19. Thus, the Nash equilibrium between the tumor cells and the immune cell population needs to be solved through optimal regulation based on specific learning method, and optimal control scheme is firstly brought into this field with its unique superiority, what’s more, nonzero-sum games-based ADP enjoys meliority and practicability.

Decision and estimation on unknown nonlinearity existed so extensively in fields of engineering practice, medical treatments and even the social sciences, such that literature [3] firstly proposed the evaluation of the designed S-Box with highly nonlinearity on the basis of Chinese I-Ching philosophy. It is of great importance to make a suitable treatment decision in the field of health care where remains highly nonlinearity. To obtain an optimal mixed treatment strategy, the growth model of cell population levels was developed based on combination of immune and chemotherapy as literatures [4, 5]. When it comes to reaction of the immune system to tumor growth, a rather complicated nonlinear model of the immune system is requisite to simulate the overall aggressive combination treatment plan of immunotherapy and chemotherapy well. Thus, the process of solving the nonlinear function is hardly to be achieved unless the application of exceptionally optimized iterative algorithm such as backstepping techniques in [6], self-learning optimal regulation in [7], hierarchical lifelong learning as [8], broad learning adaptive neural control in [9] and adaptive dynamic programming, which benefits from its adaptive capability and strong autonomous iterative learning ability [10, 11]. Whether backstepping or adaptive dynamic programming both could guarantee the control objective would be achieved, and unknown nonlinear function matched the value of successive searching approximation through neural networks or fuzzy control as literatures [12,13,14,15].

\(\mathcal {H}_{\infty }\) control enjoys excellent disturbance suppression while minimizing performance index and it is recognized as a typical two-player zero-sum problem, which can be equivalent to solving algebraic Riccati equations, and it is generally applied into linear dynamics systems, of course, systems with quadratic performance index could be actually solved such as literature [16]. Meanwhile, the familiar Hamilton-Jacobi-Isaacs is perceived as an effective medium in dealing with systems considering inherent nonlinearity, such as unknown mechanical parameters in [17], which is difficult to achieve using conventional methods for absence of exact system parameters. The mainstream analysis of ADP is seeking optimal control strategy integrated with solution to Bellman functions without information of system dynamics, which has ascended to the core methodologies of optimization and artificial intelligence. When it comes to actual models, control constraint has been definitely considered as [18,19,20], thus the chapter mainly focuses on dynamic model of the immune system which limits the single injection of drugs to an intervention level, and the optimal control scheme is transformed into constrained control which needs to take a discounted factor into account, avoiding infinite time dimension effectively, which will lead to development of optimal constrained control policy.

Model-free adaptive control was developed to obtain optimal control strategy without knowledge of exact system parameters as literatures [21,22,23], and multiple neural networks were constructed to achieve multi-objective approximation or optimization control process. Research with respect to multiple networks has been extended to multitudinous actor-critic constructions. A tremendous amount of practical application scenarios need multiple controllers, each of which minimizes its individual performance function as nonzero-sum problem. As elaborated in nonzero-sum game theory, the control objective was minimizing the individual performance function and maintaining stability to yield a Nash equilibrium in [24]. As literature [25], saddle point of the Nash equilibrium was explored throughout the nonzero-sum games-based optimization iterative process using ADP, even if there was no feasible saddle point, optimum was realized through mixed optimal control scheme iteratively, and the latter is of universal significance for conditions that are uneasy to satisfy in practical applications, The local optimal problem exits extensively which was firstly effectively avoided through fault-tolerant adaptive multigradient recursive reinforcement learning as [9]. To seek the solution to Nash equilibrium, the simultaneous algebraic Riccati or Hamilton-Jacobi-Isaacs functions require solving for nonlinear systems, which leads to “curse of dimensionality” with huge amount of computation, especially for multitudinous actor-critic constructions suffering from higher computational burden by many multiples, such as a double-loop policy iteration in [26]. According to the reason described above, the chapter adopts compromise acceptable actor-critic neural networks with appropriate dimensions, effectively realizing the transformation process from value iterative to cost function.

Value and policy iterations generally constitute the whole iterative methods, and begin with an semidefinite function or admissible control law accordingly. With applications of ADP to solve the optimal control strategy for both continuous [27, 28] or discrete-time systems [29, 30], however, traditional ADP can not satisfy the physical application in the immune system considering the mixed treatment strategy with chemotherapy drugs and immunotherapy, improving matters somewhat by nonzero-sum games-based ADP. There are seldom any literatures on nonzero-sum games-based ADP method for solving optimal regulation schemes of the immune system, let alone considering optimal constrained control, policy iterations, tumor regression and mixed control strategy of chemotherapy and immunotherapy, scilicet the cost function approaching covers minimization of the tumor cells, chemotherapy drugs and immunotherapy drugs, simultaneously.

3.2 Establishment of Mathematical Model

This part mainly introduces the mathematical growth model of tumor cells, which considers the influence of external factors such as chemotherapy drugs and immunotherapy on the tumor cells, mutual effect between two types of cells. In the following model, Tu(t) represents the amount of tumor cells, Im(t) denotes the number of immune cells, and Che(t), \(Im_{py}(t)\) depicts the concentrations of chemotherapy drugs and immunotherapy drugs in the bloodstream, respectively.

3.2.1 Growth Model of Tumor Cells

Individually considering the natural growth law of tumor cells without the relationship with immune cells and any external effect on them, the growth law of tumor cells is subject to logical growth.

$$\begin{aligned} Tu(t+1)=Tu(t)+C_1Tu(t)(1-C_2Tu(t)). \end{aligned}$$
(3.1)

But when it comes to the interaction between immune cells and tumor cells, the direct killing of cells by chemotherapeutic drugs, and the growth model of tumor cells can be revised to:

$$\begin{aligned} Tu(t+1)=&\,Tu(t)+C_1Tu(t)(1-C_2Tu(t))\nonumber \\&-C_{Im,Tu}Tu(t)Im(t)-C_{Che,Tu}Tu(t)Che(t), \end{aligned}$$
(3.2)

where the specifications of parameters are demonstrated as Table 3.1.

Table 3.1 Parameter specifications of the tumor cells

3.2.2 Growth Model of Immune Cells

Considering the natural growth law of immune cells simply, we assume that a fixed number of immune cells are produced in a unit of time and that these cells have an inevitable life cycle.

$$\begin{aligned} Im(t+1)=Im(t)+C_3-C_{Im,d}Im(t) \end{aligned}$$
(3.3)

The tumor cells in the body can stimulate the growth of immune cells, which shows a positive non-linear change by (3.4).

$$\begin{aligned} \varDelta _{im}=\frac{\alpha _1 Tu(t)^2Im(t)}{\beta _1 +Tu(t)^2}. \end{aligned}$$
(3.4)

In immunotherapy, the addition of immune agents can produce an immune response, which leads to the non-linear growth of immune cells.

$$\begin{aligned} \varDelta _{Im_{py}}=\frac{\alpha _2 Tu(t)Im_{py}(t)}{\beta _2 +Im_{py}(t)}. \end{aligned}$$
(3.5)

Simultaneously, in the struggle between immune cells and tumor cells, immune cells themselves can also cause losses,

$$\begin{aligned} \varDelta _{C_{Tu,Im}}= -C_{Tu,Im}Tu(t)Im(t). \end{aligned}$$
(3.6)

and in chemotherapy, chemotherapeutic drugs can also cause damage to immune cells.

$$\begin{aligned} \varDelta _{C_{Che,Im}}= -C_{Che,Im}Che(t)Im(t). \end{aligned}$$
(3.7)

Combined (3.3)–(3.7), and then (3.8) can be obtained.

$$\begin{aligned} Im(t+1)\,=\,&Im(t)+C_3-C_{Im,d}Im(t)+\varDelta _{im}+\varDelta _{Im_{py}}\nonumber \\&+\varDelta _{C_{Tu,Im}}+\varDelta _{C_{Che,Im}}\nonumber \\ =\,&Im(t)+C_3-C_{Im,d}Im(t)+\frac{\alpha _1 Tu(t)^2Im(t)}{\beta _1 +Tu(t)^2}\nonumber \\&+\frac{\alpha _2 Tu(t)Im_{py}(t)}{\beta _2 +Im_{py}(t)}-C_{Tu,Im}Tu(t)Im(t)\nonumber \\&-C_{Che,Im}Che(t)Im(t). \end{aligned}$$
(3.8)

Parameter elucidation of immune cells are outlined as Table 3.2.

Table 3.2 Parameter specifications of the immune cells

3.2.3 Drug Attenuation Model

We assume that at some point after the injection of a chemotherapy drug, the concentration of the drug in the body will decrease exponentially. To guarantee the effectiveness of the treatment, we add chemotherapy drugs to the body, simultaneously.

$$\begin{aligned} Che(t+1)=Dr_{Che}(t)-e^{-\gamma _1}Che(t). \end{aligned}$$
(3.9)

Similarly, we can obtain the attenuation model of the immunoagents:

$$\begin{aligned} Im_{py}(t+1)=Dr_{Im}(t)-e^{-\gamma _2}Im_{py}(t). \end{aligned}$$
(3.10)

where injected at t, \(Dr_{Che}(t)\) and \(Dr_{Im}(t)\) denotes concentrations of the chemotherapy drugs and immunoagents separately. \(\gamma _1\) and \(\gamma _2\) is the decay rates of the chemotherapy drugs and immunoagents.

3.2.4 The Design of the Optimization Problem

Combined with the contents of (A), (B) and (C), we finally obtain the mathematical model affecting the growth of tumor cells:

$$\begin{aligned} \left\{ \begin{array}{lcl} &{}Tu(t+1)=Tu(t)+C_1Tu(t)(1-C_2Tu(t)) \\ &{}-C_{Im,Tu}Tu(t)Im(t)-C_{Che,Tu}Tu(t)Che(t)\\ &{}Im(t+1)=Im(t)+C_3-C_{Im,d}Im(t)\\ &{}+\frac{\alpha _1 Tu(t)^2Im(t)}{\beta _1 +Tu(t)^2}+\frac{\alpha _2 Tu(t)Im_{py}(t)}{\beta _2 +Im_{py}(t)}\\ &{}-C_{Tu,Im}Tu(t)Im(t)-C_{Che,Im}Che(t)Im(t)\\ &{}Che(t+1)=Dr_{Che}(t)-e^{-\gamma _1}Che(t)\\ &{}Im_{py}(t+1)=Dr_{Im}(t)-e^{-\gamma _2}Im_{py}(t).\\ \end{array} \right. \end{aligned}$$
(3.11)

Given that Tu(t), Im(t) are biomass, and Che(t), \(Im_{py}(t)\) are the drug concentrations in the bloodstream,

$$\begin{aligned} Tu(t),Im(t),Che(t),Im_{py}(t)\ge 0,\forall t> 0. \end{aligned}$$
(3.12)

And all parameters in the model are non-negative:

$$\begin{aligned}&C_1;C_2;C_3;C_{Im,Tu};C_{Che,Tu};C_{Im,d};C_{Tu,Im};C_{Che,Im}\nonumber \\&\alpha _1;\alpha _2;\beta _1;\beta _2;\gamma _1;\gamma _2\ge 0,\forall t > 0. \end{aligned}$$
(3.13)

When we qualitatively analyze the problem that how to minimize the residual tumor cell population in the bloodstream on the premise of using as few drugs as possible, including chemotherapy drugs and immunoagents. This process can be described as a quantitative mathematical expression as (3.14).

$$\begin{aligned} \min \{aTu(t)^2&+b_1\int _{0}^{Dr_{Che}(t)} tanh^{-1}(\bar{U}^{-1}_1s)\bar{U}_1R_1 ds\nonumber \\&+b_2\int _{0}^{Dr_{Im}(t)} tanh^{-1}(\bar{U}^{-1}_2s)\bar{U}_2R_2 ds\}. \end{aligned}$$
(3.14)

It is emphasized here that the single dose of the two drugs should be limited to avoid drug poisoning. So we use a definition form with input constraints. During the whole treatment process, we can get:

$$\begin{aligned} \sum _{t=t_0}^{t_f}\lambda ^t\{aTu(t)^2&+b_1\int _{0}^{Dr_{Che}(t)} tanh^{-1}(\bar{U}^{-1}_1s)\bar{U}_1R_1 ds\nonumber \\&+b_2\int _{0}^{Dr_{Im}(t)} tanh^{-1}(\bar{U}^{-1}_2s)\bar{U}_2R_2 ds\}, \end{aligned}$$
(3.15)

where \(0< \lambda < 1\), \(\bar{U}_1\) and \(\bar{U}_2\) represent the maximum permissible dose of chemotherapy drug and dose of immune agents in a single injection, respectively.

3.3 The Proposed Nonzero-Sum Games-Based ADP Scheme

To solve the given problems above, we propose an aggressive treatment plan or control scheme based on nonzero-sum games-based ADP algorithm.

3.3.1 Theoretical Introduction

For a differential control system \(x(t+1)=F(x(t),u(t),t))\), x(t) is the state variable, u(t) is the control variable, F is the transition mapping between states, and then the cost of state transition is obtained: U(x(t), u(t), t), and the total cost of the whole period is \(\sum _{t=t_0}^{t_f}U(x(t),u(t),t)\).

When solving a finite time problem, we can equivalent it to

$$\begin{aligned} \sum _{t=t_0}^{\infty } \lambda ^t U(x(t),u(t),t), 0< \lambda < 1. \end{aligned}$$
(3.16)

In the application of Bellman’s optimality principle to solve (3.1), we first stipulate \(J(x(t_0))=\sum _{t=t_0}^{\infty } \lambda ^t U(x(t),u(t),t)\), and then we can obtain that

$$\begin{aligned} J^*(x(t))=\mathop {\min }\limits _{u(t)}\left\{ U[x(t),u(t)]+\lambda J^*[x(t+1)] \right\} , t\in (t_0,\infty ). \end{aligned}$$
(3.17)

The corresponding optimal control can be solved and the form as follows.

$$\begin{aligned} u^*[x(t)]=\mathop {\arg \min }\limits _{u(t)}\left\{ U[x(t),u(t)]+\lambda J^*[x(t+1)] \right\} , t\in (t_0,\infty ). \end{aligned}$$
(3.18)

This typical solution approach is a considerable challenge for computing and storage space.

Remark 3.1

Adaptive dynamic programming as an optimize learning method is usually used to track the cost function, which is not only designed to minimize the tumor cells, but also minimum dose chemotherapy drugs and immunoagents in this chapter.

3.3.2 Iterative ADP Algorithm

To solve (1), we use an iterative adaptive dynamic programming algorithm, and the revised facilitate solving differential equations model.

(1) Brief interpretation of ADP algorithm

Firstly, we take a value function K(x) to approximate the cost function J(x). In this case, the purpose of iteration is to ensure that the approximate function approaches to the optimal value equation and obtain the optimal decision law. Namely,

$$\begin{aligned} \left\{ \begin{array}{lcl} K(x)\rightarrow J^*(x)\\ \kappa \rightarrow u^*. \end{array} \right. \end{aligned}$$
(3.19)

Secondly, in the specific solution process:

Give \(K^{0}(\cdot )=0\), we make

$$\begin{aligned} \kappa ^{0}(x(t))=\mathop {\arg \min }\limits _{u(t)}\left\{ U[x(t),u(t)]+\lambda K^{0}(x(t+1)) \right\} , \end{aligned}$$
(3.20)

and update the value function as

$$\begin{aligned} K^{1}(x(t))= U[x(t),\kappa ^{0}(x(t))]+\lambda K^{0}(x(t+1)), \end{aligned}$$
(3.21)

for \(i=1,2,3,...\) , we can get

$$\begin{aligned} \kappa ^{i}(x(t))=\mathop {\arg \min }\limits _{u(t)}\left\{ U[x(t),u(t)]+\lambda K^{i}(x(t+1)) \right\} . \end{aligned}$$
(3.22)

and

$$\begin{aligned} K^{i+1}(x(t))=\mathop {\min }\limits _{u(t)}\left\{ U[x(t),u(t)]+\lambda K^{i}(x(t+1)) \right\} . \end{aligned}$$
(3.23)

Thus,

$$\begin{aligned} K^{i+1}(x(t))=U[x(t),\kappa ^{i}(x(t))]+\lambda K^{i}(x(t+1)). \end{aligned}$$
(3.24)

and the optimal solution is obtained when the error requirement has been adequately satisfied as condition that \(K^i(x(t))\rightarrow K^*(x(t))\) and \(\begin{Vmatrix}K^{i+1}(x(t))- K^i(x(t+1))\end{Vmatrix} \le \varepsilon \), where i represents the number of iterations.

\(\mathbf {\mathop {Algorithm: Evolutionary ADP algorithm }}\)

\(\mathbf {\mathop {Initialization:}}\)

1. A certain initial state is given randomly in the feasible region x(t);

2. Set \(\varLambda ^0(\cdot )=0\);

3. Specific parameters are given according to the requirements: error \(\epsilon \),

discount factor \(\lambda \);

\(\mathbf {\mathop {Iteration and Update:}}\)

4. \(i=0\) , substitute x(t) into “(3.26) = 0 ”, yield \(\kappa ^i(t)\);

5. Plug x(t) and \(\kappa ^i(t)\) into (3.25),and to get \(x(t+1)\);

6. According to the (3.29), calculate \( \varLambda ^{i+1}(x(t))= \frac{\partial U(x(t+1),\kappa ^i(t))}{\partial x(t)}+\varLambda ^{i}(x(t+1))\);

7. According to the data set [\(x(t),\varLambda ^{i+1}(x(t))\)] ,

the neural network of the relationship between \(x \sim \varLambda \) ;

8. Using the neural network obtained by “7.”, the value in the same state

is calculated. When \(\begin{Vmatrix} \varLambda ^{i+1}(x(t))-\varLambda ^{i}(x(t)) \end{Vmatrix} \le \epsilon \) ,ends;

If it is not true, returns “4.”;

To faster convergence to the optimal solution, we update in each iteration and value function, the control law according to the current direction of steepest descent, that is,

$$\begin{aligned} \frac{\partial K^{i+1}(x(t))}{\partial x(t)}=&\frac{\partial U(x(t),\kappa ^{i+1}(t))}{\partial x(t)}+\lambda [\frac{\partial x(t+1)}{\partial x(t)}]^T\frac{\partial K^{i}(x(t+1))}{\partial x(t+1)},\end{aligned}$$
(3.25)
$$\begin{aligned} \frac{\partial K^{i+1}(x(t))}{\partial \kappa ^{i+1}(t)}=&\frac{\partial U(x(t),\kappa ^{i+1}(t))}{\partial \kappa ^{i+1}(t)}+\lambda [\frac{\partial x(t+1)}{\partial \kappa ^{i+1}(t)}]^T\frac{\partial K^{i}(x(t+1))}{\partial x(t+1)}. \end{aligned}$$
(3.26)

Setting \(\varLambda ^{i}(x(t+1))= \frac{\partial K^i(x(t+1))}{\partial x(t+1)}\):

(2) Modification of Model (3.11)

Compared with the traditional control strategy, we directly solve the problem proposed in this chapter by using ADP, although it is difficult to solve the model. Here, we propose a fitting idea to modify the model. Analysis on (3.11) shows that the injection of chemotherapy drugs into the body has a direct effect on tumor cells. On the other hand, immunoagents act on immune cells, which affects tumor cell populations. Throughout the whole action process, we can only consider the input of chemotherapy drugs and immunoagents at every moment as the two control inputs of the system and the state variables of the system are selected as the intermediate transition variables such as tumor cells and immune cell population.

1. The standard expressions of control variables, state variables, cost functions and so on are given as follows,

$$\begin{aligned} x(t)=Tu(t), u_1(t)=Dr_{che}(t),u_2(t)=Dr_{Im}(t). \end{aligned}$$
(3.27)
$$\begin{aligned} K(x)=\sum _{t=t_0}^{\infty }&\lambda ^t\{ax(t)^2+b_1\int _{0}^{u_1(t)} tanh^{-1}(\bar{U}^{-1}_1s)\bar{U}_1R_1\, ds\nonumber \\&+b_2\int _{0}^{u_2(t)} tanh^{-1}(\bar{U}^{-1}_2s)\bar{U}_2R_2\, ds\}. \end{aligned}$$
(3.28)

2. The modified system model adopts the form of nonlinear affine system, namely:

$$\begin{aligned} x(t+1)=f(x(t))- [g_1(x(t)),g_2(x(t))][u_1(t),u_2(t)]^T. \end{aligned}$$
(3.29)

3. Update the optimal control law and value function:

Let \(\frac{\partial K^{i+1}(x(t))}{\partial u^{i}_1(t)}=0 \) and \(\frac{\partial K^{i+1}(x(t))}{\partial u^{i}_2(t)}=0 \),

$$\begin{aligned} u_1^{i,*}(t)=\bar{U}_1tanh(\frac{\lambda }{b_1\bar{U}_1R_1}g_1(x(t))\varLambda ^{i}(x(t+1))), \end{aligned}$$
(3.30)
$$\begin{aligned} u_2^{i,*}(t)=\bar{U}_2tanh(\frac{\lambda }{b_2\bar{U}_2R_2}g_2(x(t))\varLambda ^{i}(x(t+1))). \end{aligned}$$
(3.31)

From this, we can also get

$$\begin{aligned} \frac{\partial K^{i+1}(x(t))}{\partial x(t)}=\varLambda ^{i+1}(x(t)) =&\lambda [\frac{\textrm{d}{f(x)}}{\textrm{d}x}-u^{i}_1\frac{\textrm{d}{g_1(x)}}{\textrm{d}x}-u^{i}_2\frac{\textrm{d}{g_2(x)}}{\textrm{d}x}]\nonumber \\ {}&\cdot \varLambda ^{i}(x(t+1))+2ax. \end{aligned}$$
(3.32)

Remark 3.2

To approximate optimal value based on optimal decision law, value iteration method is devoted to tending to the cost function J(x) through value function K(x).

Remark 3.3

The fitted curve is constructed according to date obtained from the original model which is uneasy to solve, and the modification of model is research objectives for replacement, considering control inputs as chemotherapy drugs and immunoagents, simultaneously.

3.3.3 Convergence Analysis

This section provides proof of the convergence of this algorithm to prove the effectiveness of the algorithm in theory. This proof is mainly derived from formulas (3.1), (3.2), and (3.3), including two lemmas and three theorems.

Lemma 3.4

Take a control sequence {\(\vec {Ar}^i(\vec {x}(t)\))}. When it is brought into formula (1), the corresponding value function \(J^{i}_{Ar}(\vec {x})\) is obtained. Compared with the control sequence {\(\vec {\kappa ^{i}}(\vec {x}(t))\)} corresponding to the minimum cost \(K^{i}(\vec {x}(t))\). If \(J^{0}_{Ar}(\cdot )=K^{0}(\cdot )=0\), \(J^{i+1}_{Ar}(\vec {x}(t))=U[\vec {x}(t),\vec {A}r^i(\vec {x}(t))]+\lambda J^{i}_{Ar}(\vec {x}(t+1))\), satisfying

$$\begin{aligned} K^{i+1}(\vec {x}(t))&=U[\vec {x}(t),\kappa ^i(\vec {x}(t))]+\lambda J^{i}_{Ar}(\vec {x}(t+1))\nonumber \\&=\mathop {\min }\limits _{Ar(t)}\left\{ U[\vec {x}(t),Ar(t)]+\lambda K^{i}(\vec {x}(t+1)) \right\} \end{aligned}$$
(3.33)

Then, \( J^{i}_{Ar}(\vec {x}(t)) \ge K^{i}(\vec {x}(t)) \) for \(\forall i\).

Proof

\(K^{i}(\vec {x})\) is obtained by taking the minimum value equation \(J^{i}_{Ar}(\vec {x})\). {\(\vec {\kappa }^i(\vec {x}(t))\)} is the corresponding optimal control sequence. For the arbitrarily control sequence {\(\vec {Ar}^i(\vec {x}(t))\)}, the value equation \(J^{i}_{Ar}(\vec {x})\) which is corresponding with the arbitrarily control sequence must not be less than \(K^{i}(\vec {x})\).

Lemma 3.5

Select a stable admissible control sequence {\(\vec {Sa}^i(\vec {x}(t)\))} with certain restrictions and the corresponding value equation \(J^{i}_{Sa}(\vec {x})\). For controllable system, if \(J^{0}_{Sa}(\cdot )=K^{0}(\cdot )=0\) and \(J^{i+1}_{Sa}(\vec {x}(t))=U[\vec {x}(t),\vec {Sa}^i(\vec {x}(t))] +\lambda ^{i}(\vec {x}(t+1))\), Then \(J^{i}_{Sa}(\vec {x})\) is bounded.

Proof

$$\begin{aligned}&J^{i+1}_{Sa}(\vec {x}(t))=U[\vec {x}(t),\vec {Sa}^i(\vec {x}(t))]+\lambda J^{i}_{Sa}(\vec {x}(t+1))\nonumber \\&=U[\vec {x}(t),\vec {Sa}^i(\vec {x}(t))]+\lambda U[\vec {x}(t),\vec {Sa}^{i-1} (\vec {x}(t+1))]\nonumber \\&~~+\lambda J^{i-1}_{Sa}(\vec {x}(t+2))\nonumber \\&=U[\vec {x}(t),\vec {Sa}^i(\vec {x}(t))]+\lambda U[\vec {x}(t),\vec {Sa}^{i-1} (\vec {x}(t+1))]\nonumber \\&~~+\lambda ^2U[\vec {x}(t+2),\vec {Sa}^{i-2}(\vec {x}(t+2))]+...\nonumber \\&~~+\lambda ^{i+1}J^{0}_{Sa}(\vec {x}(t+i+1)). \end{aligned}$$
(3.34)

Thus, \(J^{i+1}_{Sa}(\vec {x}(t))=\sum _{j=0}^{i}\lambda ^iU[\vec {x}(t+j), \vec {Sa}^{i-j}(\vec {x}(t+j))]\) and \(J^{i+1}_{Sa}(\vec {x}(t))\le \lim _{i \rightarrow \infty } \sum _{j=0}^{i}\lambda ^iU[\vec {x}(t+j),\vec {Sa}^{i-j}(\vec {x}(t+j))] \), where {\(Sa^{i}(\vec {x})\)} is the stable allowable control sequence, and we can get an conclusion that \(0\le J^{i+1}_{Sa}(\vec {x}(t))\le \lim _{i \rightarrow \infty } \sum _{j=0}^{i}\lambda ^iU[\vec {x}(t+j),\vec {Sa}^{i-j}(\vec {x}(t+j))]\le C \) for given constant C. That is, \(J^{i}_{Sa}(\vec {x})\) is bounded.

Theorem 3.6

From formula (1), {\(\vec {\kappa ^{i}}(\vec {x}(t))\)} is the control sequence corresponding to the minimum value function \(K^{i}(\vec {x})\). Assuming the initial state \(K^{i}(\cdot )=0 \), it can be proved that the sequence {\(\vec {\kappa ^{i}}(\vec {x}(t))\)} is a monotonic non-decreasing sequence, and \(K^{i}(\vec {x}(t)) \le K^{i+1}(\vec {x}(t))\).

Proof

Define a value equation \(T^{i}(\vec {x}(t)): T^{i}(\cdot )=0\), \(T^{i+1}(\vec {x}(t))=\lambda T^{i}(\vec {x}(t+1)) + U[\vec {x}(t),\vec {\tau }^{i+1}(\vec {x}(t))]\). When \(i=0\), \(T^{1}(\vec {x}(t))=U[\vec {x}(t),\vec {\tau }^{0}(\vec {x}(t))]+\lambda T^{0}(\vec {x}(t+1))\), \( T^{1}(\vec {x}(t))-T^{0}(\vec {x}(t))=U[\vec {x}(t),\vec {\tau }^{0} (\vec {x}(t))]\ge 0\), we get \(T^{1}(\vec {x}(t))\ge T^{0}(\vec {x}(t))\).

Assuming \(t=i-1\), \(T^{i}(\vec {x}(t))\ge T^{i-1}(\vec {x}(t))\), When \(t=i\), \(T^{i+1}(\vec {x}(t))=U[\vec {x}(t),\vec {\xi }^{i}(\vec {x}(t))]+\lambda T^i\vec {x}(t+1)\) and \(T^{i+1}(\vec {x}(t))-T^{i}(\vec {x}(t))=\lambda (U[\vec {x}(t),\vec {\xi }^{i-1}(\vec {x}(t+1))])\ge 0 \). Then \(T^{i+1}(\vec {x}(t))\ge T^{i}(\vec {x}(t))\). And we can get \(K^{i}(\vec {x}(t)) \le K^{i+1}(\vec {x}(t))\).

Theorem 3.7

It is known that {\(\vec {\kappa ^{i}}(\vec {x}(t))\)} is the control sequence corresponding to the minimum cost function \(K^{i}(\vec {x})\), which can prove \(\lim _{i \rightarrow \infty }K^{i}(\vec {x}(t))= K^{*}(\vec {x}(t))\).

Proof

{\(\kappa ^{i}(\vec {x})\)}and \(K^{i}(\vec {x})\) have been given in Lemma 3.2, and the corresponding value function of {\(\kappa ^{i,l}(\vec {x})\)} is \(K^{i+1,l}(\vec {x}(t))=U[\vec {x}(t),\vec {\kappa }^{i,l}(\vec {x}(t))]+\lambda K^{i,l}(\vec {x}(t))\), where l is the length. Obviously, \(K^{i+1,l}(\vec {x}(t))=\sum _{j=0}^{i}\lambda ^iU[\vec {x}(t+j),\vec {\kappa }^ {i-j,l}(\vec {x}(t+j))]\).

After taking the limit, we can obtain \(K^{\infty ,l}(\vec {x}(t))=\lim _{i \rightarrow \infty } \sum _{j=0}^{i}\lambda ^iU[\vec {x}(t+j), \vec {\kappa }^{i-j,l}(\vec {x}(t+j))]\), and define \(K^{*}(\vec {x}(t))=\mathop {\inf }\limits _{l}\{K^{\infty ,s} (\vec {x}(t))\}\). Similarly, \(\varOmega ^{\infty +1,s}(\vec {x}(t)) \le K^{\infty ,l}(\vec {x}(t)) \le D^s\) can be obtained from Lemma 3.5. On the other hand, we get \(K^{i+1}(\vec {x}(t)) \le K^{\infty ,s}(\vec {x}(t))\) based on Lemma 3.4. Therefore, it can be concluded that \(K^{i+1}(\vec {x}(t)) \le \varOmega ^{i+1,l}(\vec {x}(t))\le \varOmega ^{\infty ,l}(\vec {x}(t)) \le D^s\). \(K^{*}(\vec {x}(t)) = \mathop {\inf }\limits _{l} K^{\infty ,l}(\vec {x}(t))\) with the definition of minimum value for the optimal value equation, extracting a control sequence \(\{\vec {\kappa }^{i,m}\}\) so that \(K^{\infty ,m} \le K^{*}(\vec {x}(t))+\epsilon \), and drawing an conclusion that \( K^{\infty ,m} \le K ^{*}(\vec {x}(t))+\epsilon \). Considering \(K^{i+1}(\vec {x}(t)) \le K^{i+1,l}(\vec {x}(t))\le K^{\infty ,l}(\vec {x}(t)) \le D^l\) in another way and taking the limit, the formula holds for any il, then \(\lim _{i \rightarrow \infty }K^{i}(\vec {x}(t)) = \mathop {\inf }\limits _{s} {D^{s}}\).

To guarantee \(\lim _{i \rightarrow \infty }K^{i}(\vec {x}(t)) =K^{\infty ,g}(\vec {x}(t))\), the control sequence \(\{\vec {\kappa }^{i,g}\}\) is necessary, and then we can get \(K^{i+1}(\vec {x}(t)) \ge K^{*}(\vec {x}(t)) \). Combining both aspects above, \(\lim _{i \rightarrow \infty }K^{i}(\vec {x}(t))= K^{*}(\vec {x}(t))\) is obtained.

Theorem 3.8

For any state variable \(\vec {x}(t)\), the optimal value equation \(K^{i}(\vec {x}(t))\) satisfies the characteristics of the HJB equation.

$$\begin{aligned} K^{*}(\vec {x}(t))=U[\vec {x}(t),\vec {\kappa }(t)]+\lambda K^{*}(\vec {x}(t+1)). \end{aligned}$$
(3.35)

Proof

From the proved lemmas and theorems, a series of characteristics about “\(K^{i}(\vec {x}(t))\)” are obtained. At this time, it is necessary to verify that characteristics of the HJB equation are satisfied. According to (3.23), there exits \(K^{*}(\vec {x}(t))=\mathop {\inf }\limits _{\vec {\kappa }(t)} \{U[\vec {x}(t),\vec {\kappa }]+\lambda K^{i}(\vec {x}(t+1))\}\), meanwhile, according to Theorems 3.6 and 3.7, yield that \(K^{i+1}(\vec {x}(t))=\mathop {\min \limits _{\vec {\kappa }(t)}\{{U[\vec {x}(t),\vec {\kappa }]}+\lambda K^{i}(\vec {x}(t+1))\}}\). Then take the mathematical limit, we get \(K^{*}(\vec {x}(t)) \le \mathop {\inf }\limits _{\vec {\kappa }(t)} U[\vec {x}(t),\vec {\kappa }]+\lambda K^{i}(\vec {x}(t+1)\) for the randomness of {\(\vec {u}(t)\)}).

From the other side, we have \(K^{i+1}(\vec {x}(t))\ge \mathop {\inf }\limits _{\vec {\kappa }(t)} U[\vec {x}(t),\vec {\kappa }]+\lambda K^{i-1}(\vec {x}(t+1)\), take the limit again, then yield that \(K^{*}(\vec {x}(t))\ge \mathop {\inf \limits _{i}}{U[\vec {x}(t),\vec {\kappa }]+\lambda K^{i}(\vec {x}(t+1))}\). As to the analysis above, we can get a final conclusion.

$$\begin{aligned} K^{*}(\vec {x}(t))=U[\vec {x}(t),\vec {\kappa }(t)]+\lambda K^{*}(\vec {x}(t+1)). \end{aligned}$$
(3.36)

All content is verified.

Remark 3.9

The control sequence {\(\vec {\kappa ^{i}}(\vec {x}(t))\)} is a monotonic non-decreasing sequence corresponding to the minimum value function \(K^{i}(\vec {x})\), which tend to be \(K^{*}(\vec {x}(t))\) eventually, satisfying the characteristics of the HJB functions as [31].

3.4 Simulation and Numerical Experiments

In this section, we consider the mechanism model of tumor cell growth combined with immunotherapy, chemotherapy and combination treatments proposed as experimental validation. Firstly, The affine system model is constructed with chemotherapy drugs and immunoagents as control inputs and the account involved of tumor cells as state variables. Secondly, according to the affine model obtained by fitting, we developed the cost function of treatment loss with the clinical treatment requirements. Finally, the optimal treatment plan for a patient with a basic condition is given after calculation by the algorithm.

3.4.1 An Affine Model of Tumor Cell Growth

According to clinical medical statistics and literature [4], the specific parameters of the mechanism model are given as Table 3.3.

At this point, when we give the initial count of tumor cell population and immune cells in a patient and follow a certain chemotherapy and immunotherapy regimen, we can get the following four curves on tumor cells and immune cell population as shown in Figs. 3.1 and 3.2. It is obviously that state variable Tu(t) denoted the population of tumor cells tend to be stable in Fig. 3.2, similarly, for Im(t) in Fig. 3.1.

Table 3.3 Concentration variation on immune cells, tumor cells, chemotherapeutic drug and immunoagents
Fig. 3.1
figure 1

The curves of tumor cells

When the fitted affine system is carried out according to the data obtained from the mechanism model, \(Dr_{Che}(t)\) and \(Dr_{Im}(t)\) are selected as two control inputs and Im(t) as state variables. Within the allowable error range, the obtained fitting relation is shown as the following form,

$$\begin{aligned} x(t+1)=f(x(t))- [g_1(x(t)),g_2(x(t))][u_1(t),u_2(t)]^T.\end{aligned}$$
(3.37)
$$\begin{aligned} f(x)=x+0.00431x(1-1.02\times 10^{-9}x).\end{aligned}$$
(3.38)
$$\begin{aligned} g_1(x)=exp(8.15\times 10^{-6}[log(x)]^{6.131}+3.482).\end{aligned}$$
(3.39)
$$\begin{aligned} g_2(x)=exp(0.05639[log(x)]^{2.093}+2.492). \end{aligned}$$
(3.40)

The curves before and after fitting are compared as Fig. 3.3, which meets the requirements of fitting precision, which guarantees accuracy of the data traced back to the original source.

Fig. 3.2
figure 2

The curves of immune cells

Fig. 3.3
figure 3

The curves of immunoagents drug concentrations in the bloodstream

3.4.2 The Treatment Loss Cost Function

The form of the cost function proposed in the third part as (3.17). Unlike the theoretical mechanism model analysis, and combined with clinical requirements, it is necessary to limit the single injection of drugs to no more than 0.05. Therefore,

$$\begin{aligned} \bar{U}_1=0.05, \bar{U}_2=0.05. \end{aligned}$$
(3.41)

To avoid the optimal solution in the infinite time dimension, we choose the discount factor \(\lambda =0.95\). Finally, the specifically obtained cost function as follows:

$$\begin{aligned} K(x)=&\sum _{t=t_0}^{\infty }0.95^t\{2.784\times 10^{-5}x(t)^2+\int _{0}^{u_1(t)} 50tanh^{-1}\nonumber \\&(0.05^{-1}s)ds+\int _{0}^{u_2(t)}850tanh^{-1}(0.05^{-1}s)ds\}. \end{aligned}$$
(3.42)

3.4.3 The Optimal Solution of the Treatment

According to the previous two subsections, we have completed the transformation from the mathematical mechanism model to the solvable affine model, and determined the specific value of the cost function according to the clinical requirements. The optimal treatment strategy is acquired through the proposed algorithm and make a comparison to prove the effectiveness and feasibility. The cost function is designed to minimize the tumor cells, meanwhile, there exit minimum dose chemotherapy drugs and immunoagents.

In the following three figures (Figs. 3.4, 3.5 and 3.6), the blue curve represents the changes of tumor cells and the changes of a single dose in patients under the normal treatment regimen. In contrast, the red curve represents the optimal treatment regimen’s effect calculated by the nonzero-sum game-based ADP algorithm.

Fig. 3.4
figure 4

The injection dose curve of chemotherapy drugs under two kinds of treatment

As shown in Fig. 3.4, there are originally many cancer cells in the body. The two curves are close to the upper limit, with drugs and dual function of the immune system, a substantial reduction in the number of cancer cells. The amount of drug injection therapy hasn’t changed greatly during the process from beginning to end. Even in the closing stage, cancer cells decreased significantly, there are still specific doses, and we solve the treatment dose is substantially less than the former.

Fig. 3.5
figure 5

The injection dose curve of immunologic agents under two kinds of treatment

Correspondingly, as shown in Fig. 3.5 that the changing trend of the injection dose of immunoagents on the two curves is close to the changing direction of chemotherapy drugs. The optimized treatment is slightly more than the traditional treatment plan when more cancer cells are in the initial stage, but it will not last for a long time. When the number of cancer cells is relatively large, the primary or indirect target of these two drugs is cancer cells; then, in the late stage of treatment, the number of cancer cells is significantly reduced. If the chemotherapy drugs are put in according to the normal treatment, the normal cells will suffer a lot of erosion, which has a more significant impact on the body. However, the optimized drug dose has been dramatically reduced, and the normal cells have been less affected.

Fig. 3.6
figure 6

The curves of tumor cells under two kinds of treatment

As shown in Fig. 3.6, control effect of the two treatment schemes on the number of tumor cells enjoy resemblance to that in the initial stage. Still, at the final stage, the algorithm optimized by ADP not only significantly reduces the count of tumor cell population, combined with Figs. 3.4 and 3.5, but also minimize the injection amount of the two drugs, which shows the effectiveness of our treatment scheme.

Remark 3.10

The optimal regulation strategy for the immune system enjoys advantage of decreasing of tumor cells, what is more, clinical treatment benefits from typical minimization of chemotherapy drugs and immunoagents.

3.5 Conclusion

Nonzero-sum games-based adaptive dynamic programming has been proposed acquiring the optimum through affecting the growth of tumor and immune cells, providing guidance for clinical practice through adjusting the administered doses of chemotherapy drugs and immunotherapy drugs. Obtained results have shown that the immune system can decrease the tumor cells, meanwhile, minimizing of chemotherapy drugs and immunoagents through optimal control behavior. Simulation examples have presented availability and effectiveness of the research methodology. The future research will focus on solving the optimal mixed treatment strategy taking account of complex immunotherapy system including immune cell subsets and cytokines, considering the switched control policies in according with hybrid therapy.