4.1 Introduction

Interaction between cancer cells, surrounding stromal cells and immune cells through autonomous and non-autonomous signaling can influence survival competition. Therefore, it is very critical for evolutionary and ecological dynamics mechanistic understanding of tumor progression [1]. It is assumed that evolution causes traits to change continuously over time even if the ecological dynamics are constantly changing. More broadly, imagine an evolutionarily stable state that is a trajectory of phenotypic states-an evolutionarily stable trait attractor. This can be used in scenarios where there is sufficient variation to facilitate rapid evolution, or where the state involves a plastic response to environmental conditions, eventually constituting evolutionary stability. Simultaneously, Natural killer (NK) cells as one of the players in the game attack many tumour cell lines, which is critical in anti-tumour immunity [2], however, the interaction between NK cells and tumour targets is poorly. To overcome drug resistance, anti-tumor immunotherapy gradually replaces the traditional treatment strategy [3]. The interaction between specialized cancer cell populations and immune cells has become a special evolutionary dynamics phenomenon in the process of tumor immunity growth architecture. The goal of optimization is to minimize administration dosage and reduce negative effects.

The dynamic perception or learning process is realized through interactions between cells and organism architecture, accomplishing observing their responses and learning optimal control strategy ultimately of Markov decision. It is required to seek an optimal control scheme such that the desired dosage of administration can be tracked and the optimal performance of minimize chemotherapeutic drugs and immunological agents can be achieved. Thus, reinforcement learning is urgently needed for optimal research-oriented tumor immunity architecture. The classical policy iteration and value iteration frameworks are never out of date, and the new min-Hamilton function [4] and the low-gain parameter ADP-Bellman equation for global stabilization are thriving [5].

The interaction between cells is highly nonlinear and coupled. When the computational conditions allow, whether it is the adaptive algorithm design based on policy iteration, or the adaptive hierarchical neural network algorithm [6], which can easily solve the coupled fractional order chaotic synchronization problem. All inspire us in solving the optimal solution of the HJB equation of the idea. Once computing conditions are not available, model-free is the best idea. The iteration-ADP algorithm is developed into iteration-NDP algorithm, which does not require an accurate system model [7], but only requires observable system data, which can reduce the cost and optimize the control action in the process of error backpropagation [8]. The emergence of Q-learning, from containing three classes four networks to interleaving double iteration, and then to the critical Q-learning [9] of a single class one network, effectively improves the utilization of resources, and the problem of insufficient exploration no longer exists. The interaction between cells coincides with multiple agents, and the attack of tumor cells on normal cells may cause abnormal reactions, and the neural net-based attack detection and estimation scheme designed by [10] can easily capture such anomalies. Cells cannot proliferate without limit. When solving the optimal solution of the constrained auxiliary subsystem, based on the framework of ADP, the idea of pi iteration is continued, and a strong convergence synchronous iterative optimization strategy [11] is given.

The difficult-to-decouple leaderer-follower behavior of vehicle-vehicle communication [12], human-vehicle interaction, and mutual quality of everyone can be easily solved with off-PI [13]. Switching system [14], T-S fuzzy, nash equilibrium, zero-sum game [15], let each agent deal with a low-dimensional state and local pattern, reduce conservatism, can easily obtain the minimum local cost [16]. Influenced by the improved exploration feature, the parallel A-C asynchronous gradient sharing mechanism can realize the parallel optimization operation of diversified agents in a short time [17]. Affected by the time difference error, integral reinforcement learning can obtain the estimated control strategy by updating the critic weight [18, 19]. In order to obtain a better stabilizing adaptive control scheme, it is necessary to give an appropriate robust control scheme for the control system [20]. Reference [21] summarizes the recent outstanding progress in the continuous nonlinear control system of the controller that combines adaptability and robustness. The reliability and effectiveness of the actual power system and some large machinery and heavy machinery devices with these two designs considered are also demonstrated. The theory integrates ecological and evolutionary dynamics blending ecological mathematical model evolutionary game theory [22]. Then evolutionarily stable strategies will be investigated to seamless integration of both sides [23]. Solvable dynamic equations can be used to explore optimal control objectives, however, what followed is a disaster of dimensions.

To overcome it, dual-heuristic dynamic programming is proposed for the nonlinear affine evolutionary dynamic dated from ADP considering the actual constraints. By introducing a discounted performance index, the optimal regulation problem of the infinite dimensional problem is reformulated into a finite dimensional. Different from previous value iterations which requires a strategy for initially stable the system. ADP is conformed to the optimal formation control by the establishment of performance index function [24]. The affine mathematical model is firstly introduced to twinborn the real scenario [25]. The optimal control is transformed into pursuing solution of HJB, and the convergence is proved. ADP involves learners giving rise to learning strategy, and the author studied a competitive learning system setting with cancer cell populations and immune cells, aiming at minimizing the dose administered.

4.2 Pre-knowledge

Consider a classical discrete-time nonlinear affine system,

$$\begin{aligned} \mathcal {x(t+1)} = \mathcal {f}(x(\mathcal {t}))+ \mathcal {g}(x(\mathcal {t}))u(\mathcal {t}) \end{aligned}$$
(4.1)

where the state variable \(x(\mathcal {t}) \in \mathcal {R}^{\mathcal {n}}\), the control variable \(u(\mathcal {t}) \in \mathcal {R}^{\mathcal {m}}\), and \(\mathcal {f}(\cdot ) \in \mathcal {R}^{\mathcal {n}}, \mathcal {g}(\cdot ) \in \mathcal {R}^{\mathcal {n \times m}}\) can be stabilized on a compact set \( \mathbf {\Omega } \in \mathcal {R}^{\mathcal {n}}\), and \(\mathcal {f}(0)=0\ \mathcal {g}(0)=0\). Colloquially, the optimal control problem of (4.1) is equivalent to obtaining \(u^*(\mathcal {t})=u(x(\mathcal {t}))\)(the optimal control law) that minimizes the proposed infinite-horizon performance index:

$$\begin{aligned} \textrm{J}(x(\mathcal {t})) = \sum _{\mathcal {t}=0}^{\infty } \textrm{K}(x(\mathcal {t}),u(\mathcal {t})). \end{aligned}$$
(4.2)

\(\textrm{K}(x(\mathcal {t}),u(\mathcal {t}))\) is the cost function, \(\textrm{K}(x,u) \ge 0 \ \forall x,u \). Basically, the cost function \(\textrm{K}(\cdot )\) is given a quadratic form

$$\begin{aligned} \textrm{K}(x(\mathcal {t}),u(\mathcal {t}))= x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + u^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{Q}u(\mathcal {t}) \end{aligned}$$
(4.3)

\(\textrm{P} ,\textrm{Q} > 0 \) are all positive definite matrices.

The optimal control problem of (4.2) can be converted to solve the HJB equation. According to the Bellman optimal principle, the optimal value function should obey the following[9]:

$$\begin{aligned} \textrm{J}^{\scriptscriptstyle {*}}\!(x(\mathcal {t})) \!=\! \underset{u(\mathcal {t})}{\text{ m }in} \Big \{ x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) \! + u^{{\scriptscriptstyle {T}}}(\mathcal {t})\textrm{Q}u(\mathcal {t}) \!+ \textrm{J}^{\scriptscriptstyle {*}}\!(x(\mathcal {t+1}))\Big \} \end{aligned}$$
(4.4)

By minimizing the right side of the (4.4) to solve the optimal control law, get the optimal value function \(\textrm{J}^*(x(\mathcal {t}))\). For necessity, one can take the partial derivative of the right-hand side of (4.4) with respect to \(u(\mathcal {t})\) to obtain \(u^*\). Hence,

$$\begin{aligned} u^*(\mathcal {t}) = -\frac{\textrm{Q}^{\scriptscriptstyle {-1}}}{2}\ \Big [\mathcal {g}(x(\mathcal {t}))\Big ]^{\scriptscriptstyle {T}}\!\frac{\partial \textrm{J}^*(x(\mathcal {t+1}))}{\partial x(\mathcal {t+1})} \end{aligned}$$
(4.5)

Take (4.5) into (4.4), it can be obtained that

$$\begin{aligned} \textrm{J}^*(x(\mathcal {t})) =\,&x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t})+\frac{1}{4}\Big [\frac{\partial \textrm{J}^*(x(\mathcal {t+1}))}{\partial x(\mathcal {t+1})}\Big ]^{\scriptscriptstyle {T}}\mathcal {g}(x(\mathcal {t}))\textrm{Q}^{\scriptscriptstyle {-1}} \nonumber \\&\ \cdot \mathcal {g}^{\scriptscriptstyle {T}}(x(\mathcal {t}))\Big [\frac{\partial \textrm{J}^*(x(\mathcal {t+1}))}{\partial x(\mathcal {t}+1)}\Big ] + \textrm{J}^*(x(\mathcal {t+1})). \end{aligned}$$
(4.6)

By the on (4.6), it is almost impossible to obtain an analytical solution for \(u^*(\mathcal {t})\). Impossible in the current moment \(\mathcal {t}\) can know the next moment \(\textrm{J}^*(x(\mathcal {t+1}))\). To overcome this dilemma, the approximate optimal solution of HJB equation can be studied. In the fourth part of this chapter, the derivation of IDHP algorithm is introduced to solve this kind of optimal control problem [26, 27].

4.3 Modeling of Mixed Immunotherapy and Chemotherapy for Tumor Cell

In this part, a mathematical model is constructed from the natural growth of a single type of tumor cells, the gradual increase of the interaction between various immune cells and tumor cells in vivo, and the influence of external application of chemotherapy drugs and immune agents on the population of tumor cells [22, 28, 29].

First, define the acronyms of various cells:

  • \(\mathcal {T_u}(\mathcal {t})\): Tumor cell population in the vivo.

  • \(\mathcal {N_K}(\mathcal {t})\): NK cells are derived from bone marrow lymphoid stem cells.

  • \(\mathcal {C_T}(\mathcal {t})\): Cytotoxic T lymphocytes (CTL), a subdivision of leukocytes, are specific T cells that secrete various cytokines and participate in immune function.

  • \(\mathcal {C_L}(\mathcal {t})\): Number of circulating lymphocytes (or leukocytes).

  • \(\mathcal {Ch_{dr}}(\mathcal {t})\): Chemotherapeutic drug concentration in the blood.

  • \(\mathcal {Im_{dr}}(\mathcal {t})\): Immunotherapy drug concentration in the blood.

For the convenience of writing, the following subsections do not specify the time, and the default is \(\mathcal {t}\), lowercase letters “\(\mathfrak {a}\), \(\mathfrak {b}\), \(\mathfrak {c_1}\), \(\mathfrak {c_2}\), \(\mathfrak {e}\), \(\mathfrak {f}\), \(\mathfrak {g}\), \(\mathfrak {h_1}\), \(\mathfrak {h_2}\), \(\mathfrak {i}\), \(\mathfrak {j}\), \(\mathfrak {l}\), \(\mathfrak {m}\), \(\mathfrak {n_1}\), \(\mathfrak {n_2}\), \(\mathfrak {p_1}\), \(\mathfrak {p_2}\), \(\mathfrak {q_1}\), \(\mathfrak {q_2}\), \(\mathfrak {r}\), \(\mathfrak {s}\), \(\mathfrak {u}\)” all represent fixed real numbers; Uppercase letters “\( \textrm{G,K,O,R,I} \)” represent different categories of gain items, which depend on time \(\mathcal {t}\); \(\mathcal {L}_{(\cdot )}\) is a constant that depends on the cell type; and \(\textbf{e}^{(\cdot )}\) stands for exponential functions.

4.3.1 The Natural Growth of Cells

According to the [2, 22], the increase of tumor cells follows a natural growth curve, \( \mathcal {G}_{\mathcal {T_u}}=\mathfrak {a}\mathcal {T_u}(1-\mathfrak {b}\mathcal {T_u})\) (\(\mathcal {G}_{(\cdot )}\) represents the natural growth tescr operator of all types of cells). Natural killer cells [22] are assumed to be produced at a constant rate and to be influenced by circulating lymphocytes throughout the production cycle (since circulating lymphocytes represent the overall level of immune health), and thus,\(\mathcal {G}_{\mathcal {N_K}}=\mathfrak {c}_1\mathcal {C_{L}}-\mathfrak {c}_2\mathcal {N_K}\). In the absence of tumor cells, Cytotoxic T lymphocytes are assumed to be absent and cell growth of \(\mathcal {C_{T}}(\mathcal {t})\) cells is only affected by natural mortality, \(\mathcal {G}_{\mathcal {C_{T}}}=-\mathfrak {e}\mathcal {C_{T}}\). Circulating lymphocytes are also produced at a constant rate during their lifetime, \(\mathcal {G}_{\mathcal {C_{L}}}=\mathfrak {f}-\mathfrak {g}\mathcal {C_{L}}\). It is set that when the body is injected with chemotherapy drugs or immune agents, it will show exponential decay, \(\mathcal {G}_{\mathcal {Ch_{dr}}}=-\textbf{e}^{-\gamma _{\alpha }}\mathcal {Ch_{dr}}\), \(\mathcal {G}_{\mathcal {Im_{dr}}}=-\textbf{e}^{-\gamma _{\beta }}\mathcal {Im_{dr}}\).

4.3.2 Intercellular Conditioning

When the above cells exist at the same time, there will be a negative interaction between the two populations, partly due to the competition for growth space and nutrients, and this indirect effect. The other part is the direct resistance of cell populations to each other [22]:

$$\begin{aligned} \mathcal {K}_{\mathcal {T_u}}=-\mathfrak {j}\mathcal {N_K}\mathcal {T_u}\quad \mathcal {K}_{\mathcal {C_{T}}}=\mathfrak {h}_1\cdot \frac{(\mathcal {C_{T}}/\mathcal {T_u})^\mathfrak {i}}{\mathfrak {h}_2(\mathcal {C_{T}}/\mathcal {T_u})^\mathfrak {i}}\cdot \mathcal {T_u} \end{aligned}$$

And just to simplify the writing, let’s write \(\mathcal {O}\) for a particular term, and notice that \(\mathcal {O}=\mathcal {O}(\mathcal {t})\), which is related to \(\mathcal {C_{T}}(\mathcal {t}),\mathcal {T_u}(\mathcal {t})\).

$$\begin{aligned} \mathcal {O}=\mathfrak {h}_1\cdot \frac{(\mathcal {C_{T}}/\mathcal {T_u})^\mathfrak {i}}{\mathfrak {h}_2(\mathcal {C_{T}}/\mathcal {T_u})^\mathfrak {i}} \quad \mathcal {K}_{\mathcal {C_{T}}}=\mathcal {O}\cdot \mathcal {T_u} \end{aligned}$$
(4.7)

NK cells have the function of recruitment, which is to design sequential application methods of cell cycle non-specific drugs and cell cycle specific drugs, recruit more cells at specific stages into the proliferation cycle, so as to increase the number of tumor cells killed [29,30,31].

$$\begin{aligned} \mathcal {R}_{\mathcal {N_k}}=\frac{\mathfrak {l}\cdot \mathcal {T_u}^2}{\mathfrak {m}\mathcal {T_u}^2}\mathcal {N_k};\quad \mathcal {R}_{\mathcal {C_{T}}}(\mathcal {T_u},\mathcal {C_{T}} )=\mathfrak {p}_1\frac{\mathcal {O}^2\mathcal {T_u}^2}{\mathfrak {q}_1\!+\mathcal {O}^2\mathcal {T_u}^2}\mathcal {C_{T}} \end{aligned}$$

\(\mathcal {C_{T}}\) cells have a similar recruitment effect [32]. It is directly proportional to the number of cells killed by NK cell lysis of tumor cells, \(\mathcal {R}_{\mathcal {C_{T}}}(\mathcal {N_k},\mathcal {T_u} )=\mathfrak {n}_1\mathcal {N_k}\mathcal {T_u}\). Also, the presence of tumor cells stimulates the immune system to secrete more cells, \(\mathcal {R}_{\mathcal {C_{T}}}(\mathcal {C_{L}},\mathcal {T_u} )=\mathfrak {n}_2\mathcal {C_{L}}\mathcal {T_u}\). In the immune function, NK cells or CD cells may have to undergo multiple contact with tumor cells, and then inactivate [29, 33,34,35].

$$\begin{aligned} \mathcal {I}_{\mathcal {ac,N_k}}=-\mathfrak {p}_2\mathcal {T_u}\mathcal {N_k}\quad \mathcal {I}_{\mathcal {ac,\mathcal {C_{T}}}}=-\mathfrak {q}_2\mathcal {C_{T}}\mathcal {T_u}\quad \mathcal {I}_{\mathcal {C_{L}},\mathcal {C_{T}}}=-\mathfrak {r}\mathcal {N_k}(\mathcal {C_{T}})^2 \end{aligned}$$

4.3.3 Drug Intervention

All kinds of cell populations in this model contain the action tescr of chemotherapy drugs, and the killing effect of chemotherapy drugs is not always effective. At low drug concentration, the killing rate increases almost linearly, while at high drug concentration, the killing rate tends to be stable. Saturation type is used to describe them in the model [36], \(1-\textbf{e}^{\mathcal {Ch_{dr}}(\mathcal {t})}\).

$$\begin{aligned} \mathcal {D}^{\mathcal{C}\mathcal{h}}_{r}(\cdot )=\mathcal {L}_{(\cdot )}(1-\textbf{e}^{\mathcal {Ch_{dr}}(\mathcal {t})})(\cdot )\quad \end{aligned}$$

\((\cdot )=\mathcal {T_u}, \mathcal {C_{T}}, \mathcal {C_{L}}, \mathcal {N_k}\).

\(\mathcal {L}_{(\cdot )}\) represents the interaction coefficient between corresponding cells and tumor cells. It also includes immunotherapy, whose impact on immune system efficacy can be mathematically described by the Michaelis-Menten interaction, \(\mathfrak {s},\mathfrak {u}\) are the constant [30].

$$\begin{aligned} \mathcal {D}^{\mathcal{I}\mathcal{m}}_{r}(\mathcal {C_T},\mathcal {Im_{dr}})=\mathfrak {u}\frac{\mathcal {Im_{dr}}\mathcal {C_T}}{s+\mathcal {Im_{dr}}} \end{aligned}$$

Chemotherapy and immunotherapy drugs are injected in a certain period of time, and denote by \(\mathcal {V}_{Che}(\mathcal {t})\) and \(\mathcal {V}_{Im}(\mathcal {t})\) the amount of chemotherapy drug injection and the amount of immunotherapy drug injection, respectively.

4.3.4 Mixed Growth Model of Cell Population

Combined with the above contents, the total cell population growth model can be obtained:

$$\begin{aligned} \mathcal{I}\mathcal{m}&_\mathcal{d}\mathcal{r}(\mathcal {t}+1)=(1-\textbf{e}^{-\gamma _{\beta }})\mathcal {Im_{dr}}(\mathcal {t})+\mathcal {V}_{Im}(\mathcal {t}) \end{aligned}$$
(4.8a)
$$\begin{aligned} \mathcal{C}\mathcal{h}&\mathcal {e_{dr}}(\mathcal {t}+1)=(1-\textbf{e}^{-\gamma _{\alpha }})\mathcal {Ch_{dr}}(\mathcal {t})+\mathcal {V}_{Che}(\mathcal {t}) \end{aligned}$$
(4.8b)
$$\begin{aligned} \mathcal {C_{L}}(\mathcal {t}&+1)=\mathfrak {f}-\mathcal {L}_{\mathcal {C_{L}}}+(1-\mathfrak {g})\mathcal {C_{L}}(\mathcal {t})-\mathcal {L}_{\mathcal {C_{L}}}\textbf{e}^{\mathcal {Ch_{dr}}(\mathcal {t})} \end{aligned}$$
(4.8c)
$$\begin{aligned} \mathcal {T_u}(\mathcal {t}&+1)=(1+\mathfrak {a}-\mathcal {L}_{\mathcal {T_u}})\mathcal {T_u}(\mathcal {t})-\mathfrak {b}\mathcal {T_u}^{\scriptscriptstyle {2}}(\mathcal {t})\nonumber \\&+\mathcal {T_u}(\mathcal {t})\Big [\textbf{e}^{\mathcal {Ch_{dr}}(\mathcal {t})}-\mathfrak {j}\mathcal {N_k}(\mathcal {t})-\mathcal {O}(\mathcal {t})\Big ] \end{aligned}$$
(4.8d)
$$\begin{aligned} \mathcal {C_T}(\mathcal {t}&+1)=(1-\mathfrak {e}-\mathcal {L}_{\mathcal {C_T}})\mathcal {C_T}(\mathcal {t})+[\mathfrak {n}_1\mathcal {N_k}(\mathcal {t})-\mathfrak {q}_2\mathcal {C_T}(\mathcal {t})\nonumber \\&+\mathfrak {n}_2\mathcal {C_{L}}(\mathcal {t})]\cdot \mathcal {T_u}(\mathcal {t})-\mathfrak {r}\mathcal {N_k}(\mathcal {t})\mathcal {C_T}^{\scriptscriptstyle {2}}(\mathcal {t})+\mathcal {L}_{\mathcal {C_T}}\mathcal {C_T}(\mathcal {t}) \end{aligned}$$
(4.8e)
$$\begin{aligned}&\cdot \textbf{e}^{\mathcal {Ch_{dr}}(\mathcal {t})}+\mathcal {C_T}(\mathcal {t})\Big [\frac{\mathfrak {u}\mathcal {Im_{dr}(\mathcal {t})}}{s+\mathcal {Im_{dr}}(\mathcal {t})}+\frac{\mathfrak {p}_1\mathcal {O}^{\scriptscriptstyle {2}}(\mathcal {t})\mathcal {T_u}^{\scriptscriptstyle {2}}(\mathcal {t})}{\mathfrak {q}_1+\mathcal {O}^{\scriptscriptstyle {2}}(\mathcal {t})\mathcal {T_u}^{\scriptscriptstyle {2}}(\mathcal {t})}\Big ]\nonumber \\ \mathcal {N_k}(\mathcal {t}&+1)=-\mathcal {L}_{\mathcal {N_k}}+(1-\mathfrak {c}_2)\mathcal {N_k}(\mathcal {t})+\frac{\mathfrak {l}\cdot \mathcal {T_u}^{\scriptscriptstyle {2}}(\mathcal {t})}{\mathfrak {m}+\mathcal {T_u}^{\scriptscriptstyle {2}}(\mathcal {t})}\mathcal {N_k}(\mathcal {t})\nonumber \\ {}&+\Big [\mathcal {L}_{\mathcal {N_k}}\textbf{e}^{\mathcal {Ch_{dr}}(\mathcal {t})}-\mathfrak {p}_2\mathcal {T_u}(\mathcal {t})\Big ]\mathcal {N_k}(\mathcal {t})+\mathfrak {c}_1\mathcal {C_{L}}(\mathcal {t}) \end{aligned}$$
(4.8f)

4.4 Iterative-Dual Heuristic Dynamic Programming Algorithm for Mixed Treatment

The optimal control problem has been transformed into solving the HJB equation (4.4). In this part, a constrained iterative dual heuristic dynamic programming algorithm based on mixed treatment is given. The algorithm is derived from adaptive dynamic programming [26]. This part mainly three parts research content are presented as working mechanism of ADP algorithm, structure of constrained iterative dual-heuristic dynamic programming algorithm and proof of convergence on I-DHP algorithm.

4.4.1 Working Mechanism of ADP Algorithm

Generally speaking, for unconstrained control problems, the performance functional (4.3) is usually chosen as the quadratic form. In this chapter, considering the actual constraints, is transformed into solving a bounded control problem, adopted a non-quadratic functional as follows:

$$\begin{aligned} \textrm{Y}(\mathcal {t})= x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + 2\int _{0}^{u(\mathcal {t})} \text {tanh}^{\scriptscriptstyle {-T}}(\overline{\mathcal {U}}^{\scriptscriptstyle {-1}}s)\overline{\mathcal {U}}\textrm{Q}ds \end{aligned}$$

It is convenient for mathematical calculation avoiding the loop or unlimited create unlimited returns markov decision process. In the loop or unlimited markov process which will constantly get reward again and again, so we need to add discount factor to avoid infinity and infinitesimal value function,By introducing discount factor \(\lambda \), an infinite dimensional problem is transformed into a finite dimensional problem, \(0<\lambda \le 1\).

$$\begin{aligned} \textrm{J}(\mathcal {t})&=\sum _{l=\mathcal {t}}^{\infty }\lambda ^{\scriptscriptstyle {l-\mathcal {t}}} \textrm{Y}(x(l),u(l)) =\textrm{Y}(\mathcal {t})\nonumber \\&\quad \quad +\lambda \sum _{l=\mathcal {t}+1}^{\infty }\lambda ^{\scriptscriptstyle {{l-(\mathcal {t}+1)}}} \textrm{Y}(x(l),u(l)) \end{aligned}$$
(4.9)

According to the Bellman optimality principle, the optimal value function satisfies:

$$\begin{aligned} \textrm{J}^{\scriptscriptstyle {*}}\!(x(\mathcal {t}))&= \underset{u(\mathcal {t})}{\text{ m }in} \Big \{ x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + 2\int _{0}^{u(\mathcal {t})} \text {tanh}^{\scriptscriptstyle {-T}}(\overline{\mathcal {U}}^{\scriptscriptstyle {-1}}s)\cdot \nonumber \\&\quad \quad \quad \quad \overline{\mathcal {U}}\textrm{Q}ds+\lambda \textrm{J}^{\scriptscriptstyle {*}}\!(x(\mathcal {t+1}))\Big \}. \end{aligned}$$
(4.10)

In the ADP algorithm structure, it iterates according to the policy iteration, selecting \(\textrm{T}^{\iota }(x)\) as the approximation function and \(\mathrm {\tau }^{\iota }(x)\) as the corresponding control law. The whole iterative process is as follows:

  1. 1.

    Let the initial value function be \(\textrm{T}^{0}(\cdot )=0\) (which is far from optimal) and compute the control law at “\(\iota =0 \)”as follows.

    $$\begin{aligned} \mathrm {\tau }^{\scriptscriptstyle {0}}\!(x(\mathcal {t})) =&\, \underset{u(\mathcal {t})}{\text {arg min}} \Big \{x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + 2\int _{0}^{u(\mathcal {t})} \text {tanh}^{\scriptscriptstyle {-T}}\!(\overline{\mathcal {U}}^{\scriptscriptstyle {-1}}s) \nonumber \\&\quad \quad \cdot \overline{\mathcal {U}}\textrm{Q}ds +\lambda \textrm{T}^{\scriptscriptstyle {0}}\!(x(\mathcal {t+1})) \Big \} \end{aligned}$$
    (4.11)
  2. 2.

    Get \(\textrm{T}^{\scriptscriptstyle {1}}\!(x(\mathcal {t}))\):

    $$\begin{aligned} \textrm{T}^{\scriptscriptstyle {1}}\!(x(\mathcal {t})) =\,&x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + 2\int _{0}^{\mathrm {\tau }^{\scriptscriptstyle {0}}\!(x(\mathcal {t}))}\! \text {tanh}^{\scriptscriptstyle {-T}}\!(\overline{\mathcal {U}}^{\scriptscriptstyle {-1}}s)\overline{\mathcal {U}}\nonumber \\ {}&\quad \cdot \textrm{Q}ds +\lambda \textrm{T}^{\scriptscriptstyle {0}}\!(x(\mathcal {t+1})). \end{aligned}$$
    (4.12)
  3. 3.

    And for \(\iota =1,2,3,\cdots \)

    $$\begin{aligned} \mathrm {\tau }^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t})) =\,&\underset{u(\mathcal {t})}{\text {arg min}} \Big \{x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + 2\int _{0}^{u(\mathcal {t})}\! \text {tanh}^{\scriptscriptstyle {-T}}\!(\overline{\mathcal {U}}^{\scriptscriptstyle {-1}}s) \nonumber \\&\quad \quad \cdot \overline{\mathcal {U}}\textrm{Q}ds +\lambda \textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t+1})) \Big \}. \end{aligned}$$
    (4.13)
  4. 4.

    The iterative value function is obtained as follows:

    $$\begin{aligned} \textrm{T}^{\scriptscriptstyle {\iota +1}}\!(x(\mathcal {t})) =\,&x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + 2\int _{0}^{\mathrm {\tau }^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))} \text {tanh}^{\scriptscriptstyle {-T}}\!(\overline{\mathcal {U}}^{\scriptscriptstyle {-1}}s)\nonumber \\ \cdot \overline{\mathcal {U}}\textrm{Q}ds&+\lambda \textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t+1})). \end{aligned}$$
    (4.14)

4.4.2 Structure of Constrained Iterative Dual-Heuristic Dynamic Programming Algorithm

In the dual heuristic dynamic programming, the assumption is that the value function is smooth, modelled on the (4.5), the partial derivatives (4.14) on the right side of \(\mathrm {\tau }^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\), can get [37]:

$$\begin{aligned} \frac{\partial {\textrm{T}^{\scriptscriptstyle {\iota +1}}\!(x(\mathcal {t})) }}{\partial {u(\mathcal {t})}} \!&=\!\frac{\partial {\Big \{x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + 2\int _{0}^{u(\mathcal {t})} \text {tanh}^{\scriptscriptstyle {-T}}\!(\overline{\mathcal {U}}^{\scriptscriptstyle {-1}}s)\overline{\mathcal {U}}\textrm{Q}ds\Big \}}}{\partial {u(\mathcal {t})}}\nonumber \\&+\lambda \frac{\partial {\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t+1}))}}{\partial {u(\mathcal {t})}}= 0. \end{aligned}$$

And, for \(\iota =0,1,2,\cdots \)

$$\begin{aligned} \mathrm {\tau }^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t})) =\overline{\mathcal {U}}\text {tanh}\Big (\frac{-\lambda }{2\overline{\mathcal {U}}\textrm{Q}}\ \Big [\frac{\partial {x(\mathcal {t+1})}}{\partial {u(\mathcal {t})}}\Big ]^{\scriptscriptstyle {T}}\frac{\partial {\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}+1)) }}{\partial {x(\mathcal {t}+1)}}\Big ) \end{aligned}$$
(4.15)

Do the same with (4.14) respect to \(x(\mathcal {t})\),

$$\begin{aligned} \frac{\partial {\textrm{T}^{\scriptscriptstyle {\iota +1}}\!(x(\mathcal {t})) }}{\partial {x(\mathcal {t})}}&= 2\textrm{P}x(\mathcal {t})+\lambda \Big [\frac{\partial {x(\mathcal {t+1})}}{\partial {x(\mathcal {t})}}\Big ]^{\scriptscriptstyle {T}}\frac{\partial {\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}+1)) }}{\partial {x(\mathcal {t}+1)}}. \end{aligned}$$
(4.16)

As can be seen in (4.15) and (4.16), both have \(\displaystyle {\frac{\partial {\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}+1)) }}{\partial {x(\mathcal {t}+1)}}}\), compared to \(\textrm{T}^{\scriptscriptstyle {\iota }}(x(\mathcal {t}))\) in (4.14), DHP algorithm evaluates and updates the first partial derivative of the value function.

The specific algorithm structure is as follows: (set costate function \(\textrm{C}^{\boldsymbol{\iota }}(x(\mathcal {t}))=\partial {\textrm{T}^{\boldsymbol{\iota }}}\!(x(\mathcal {t}))/\partial {x(\mathcal {t})}\)).

4.4.3 Proof of Convergence on I-DHP Algorithm

The convergence proof of the algorithm shows that with the increase of the number of iterations, the evaluation and update between (4.15) and (4.16) are continuously completed, and the termination condition can finally be satisfied and the optimal solution can be obtained.

figure a

The corresponding lemma needs to be given before the formal theorem proving. In order to facilitate writing, abbreviated “\(2\int _{0}^{u(\mathcal {t})} \text {tanh}^{\scriptscriptstyle {-T}}(\overline{\mathcal {U}}^{\scriptscriptstyle {-1}}s)\overline{\mathcal {U}}\textrm{Q}ds\)” to “\(\mathrm {H(u(\mathcal {t}))}\)”.

Lemma 4.1

Assume that \(\mathrm {\tau }^{\scriptscriptstyle {\iota }}(\mathcal {t})\) is the control sequence calculated by (4.13), \(\textrm{T}^{\scriptscriptstyle {\iota }}(x)\) is the value function calculated by (4.14). \(\mathrm {\omega }^{\scriptscriptstyle {\iota }}(\mathcal {t})\) is any admissible control sequence in the domain, and \(\mathrm {\Omega }^{\scriptscriptstyle {\iota }}(x)\) is its corresponding value function equation,

$$\begin{aligned} \mathrm {\Omega }^{\scriptscriptstyle {\iota +1}}\!(x(\mathcal {t})) \!=\! x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + \mathrm {H(\mathrm {\omega }^{\scriptscriptstyle {\iota }}(\mathcal {t}))} +\lambda \mathrm {\Omega }^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t+1})) \end{aligned}$$
(4.17)

and it is easy to obtain:

If \(\mathrm {\Omega }^{\scriptscriptstyle {0}}\!(\cdot )=\textrm{T}^{\scriptscriptstyle {0}}\!(\cdot )=0\), then \(0 \le \textrm{T}^{\scriptscriptstyle {\iota }}(x) \le \mathrm {\Omega }^{\scriptscriptstyle {\iota }}(x)\), \(\forall \iota \).

Proof

The conclusion is obvious. \(\textrm{T}^{\scriptscriptstyle {\iota }}(x)\) is the minimum value that can be obtained on the right side of (4.14), and \(\mathrm {\tau }^{\scriptscriptstyle {\iota }}(\mathcal {t})\) is the corresponding control sequence. And \(\mathrm {\Omega }^{\scriptscriptstyle {\iota }}(x)\) is any admissible value function, so it must be not less than \(\textrm{T}^{\scriptscriptstyle {\iota }}(x)\).\(\blacksquare \)

Lemma 4.2

Given that \(\textrm{T}^{\scriptscriptstyle {\iota }}(x)\) by the (4.14), and if the system is controlled, then \(\textrm{T}^{\scriptscriptstyle {\iota }}(x)\) has an upper bounded \(\mathfrak {Z}\) (a constant).

$$\begin{aligned} 0 \le \textrm{T}^{\scriptscriptstyle {\iota }}(x) \le \mathfrak {Z}, \forall \iota \end{aligned}$$

Proof

Set \(\mathcal {v}^{\scriptscriptstyle {\iota }}(\mathcal {t})\) to be an admissible and stabilizing control sequence and \(\mathcal {V}^{\scriptscriptstyle {\iota }}(x)\) to be:

$$\begin{aligned} \mathcal {V}^{\scriptscriptstyle {\iota +1}}(x)=\!x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + \mathrm {H(\mathcal {v}^{\scriptscriptstyle {\iota }}(\mathcal {t}))} +\lambda \mathcal {V}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t+1})) \end{aligned}$$

Then, it can be obtained: (\(\mathcal {V}^{\scriptscriptstyle {0}}\!(\cdot )= \textrm{T}^{\scriptscriptstyle {\iota }}(\cdot ) =0\))

$$\begin{aligned} \mathcal {V}^{\scriptscriptstyle {\iota +1}}(x)&=\!x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + \mathrm {H(\mathcal {v}^{\scriptscriptstyle {\iota }}(\mathcal {t}))} +\lambda \mathcal {V}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t+1}))\\&=\!x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + \mathrm {H(\mathcal {v}^{\scriptscriptstyle {\iota }}(\mathcal {t}))} +\lambda \Big [ x^{\scriptscriptstyle {T}}(\mathcal {t}+1)\textrm{P} \\&\!\cdot x(\mathcal {t}+1)\!+\!\mathrm {H(\mathcal {v}^{\scriptscriptstyle {\iota -1}}(\mathcal {t}+1))}\Big ]\!+\!\lambda ^{\scriptscriptstyle {2}} \mathcal {V}^{\scriptscriptstyle {\iota -1}}\!(x(\mathcal {t+2}))\\&=\ldots \\&=\!x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + \mathrm {H(\mathcal {v}^{\scriptscriptstyle {\iota }}(\mathcal {t}))}+\lambda \Big [ x^{\scriptscriptstyle {T}}(\mathcal {t}+1)\textrm{P} \\&\!\cdot x(\mathcal {t}+1)\!+\!\mathrm {H(\mathcal {v}^{\scriptscriptstyle {\iota -1}}(\mathcal {t}+1))}\Big ]\!+\cdots \\&+\!\lambda ^{\scriptscriptstyle {\iota }}\Big [x^{\scriptscriptstyle {T}}(\mathcal {t}+\iota )\textrm{P}x(\mathcal {t}+\iota ) + \mathrm {H(\mathcal {v}^{\scriptscriptstyle {0}}(\mathcal {t}+\iota ))}\Big ]\\&+\lambda ^{\scriptscriptstyle {\iota +1}} \mathcal {V}^{\scriptscriptstyle {0}}\!(x(\mathcal {t+\iota +1})) .\\ \end{aligned}$$

\(\mathcal {V}^{\scriptscriptstyle {\iota }+l}(x) = \sum _{l=0}^{\iota }\lambda ^{\scriptscriptstyle {l}} \Big [\!x^{\scriptscriptstyle {T}}(\mathcal {t}+l)\textrm{P}x(\mathcal {t}+l) + \textrm{H}(\mathcal {v}^{\scriptscriptstyle {\iota }-l}(\mathcal {t}+l)) \Big ]\le \lim _{\iota \rightarrow \infty } \Big \{ \sum _{l=0}^{\iota }\lambda ^{\scriptscriptstyle {l}} \Big [ \!x^{\scriptscriptstyle {T}}(\mathcal {t}+l)\textrm{P}x(\mathcal {t}+l) + \textrm{H}(\mathcal {v}^{\scriptscriptstyle {\iota }-l}(\mathcal {t}+l)) \Big ] \Big \} \).

Due to the admissible control sequence \(\mathcal {v}{\scriptscriptstyle {\iota }}(\mathcal {t})\), it has an upper bound \(\mathfrak {Z}\) that

$$\begin{aligned} \mathcal {V}^{\scriptscriptstyle {\iota }+l}(x) \! \le \!\lim _{\iota \rightarrow \infty }\! \Big \{ \sum _{l=0}^{\iota }\lambda ^{\scriptscriptstyle {l}} \Big [ \!x^{\scriptscriptstyle {T}}(\mathcal {t}+l)\textrm{P}x(\mathcal {t}+l)\!+\! \textrm{H}(\mathcal {v}^{\scriptscriptstyle {\iota }-l}(\mathcal {t}+l))\!\Big ] \Big \}\!\le \!\mathfrak {Z}. \end{aligned}$$

Combined with Lemma 4.1, it can be obtained the result.\(\blacksquare \)

Theorem 4.1

For the iterative cost function \(\textrm{T}^{\scriptscriptstyle {\iota }}(x)\) which follows (4.14) and its corresponding control law \(\mathrm {\tau }^{\scriptscriptstyle {\iota }}(\mathcal {t})\) obtained by (4.13), it can be concluded that with the increase of the number of iterations, \(\textrm{T}^{\scriptscriptstyle {\iota }}(x)\) will converge to the optimal value function and \(\mathrm {\tau }^{\scriptscriptstyle {\iota }}(\mathcal {t})\) will converge to the optimal control law, i.e., \(\textrm{T}^{\scriptscriptstyle {\iota }}(x)\rightarrow \textrm{J}^*(x)\), \(\mathrm {\tau }^{\scriptscriptstyle {\iota }}(\mathcal {t}) \rightarrow {u}^*(\mathcal {t})\).

Proof

From Lemma 4.1, \(\mathrm {\Omega }^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\) is the cost function corresponding to an any admissible control sequence \(\mathrm {\omega }^{\scriptscriptstyle {\iota }}(\mathcal {t})\), with \(\Omega ^0(\cdot ) = 0\).

Firstly, \(\iota =0\),

$$\begin{aligned} \textrm{T}^{\scriptscriptstyle {1}}\!(x(\mathcal {t}))-\mathrm {\Omega }^{\scriptscriptstyle {0}}\!(x(\mathcal {t}))=x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + \mathrm {H(\mathrm {\omega }^{\scriptscriptstyle {0}}(\mathcal {t}))} \ge 0 \end{aligned}$$

then, \(\textrm{T}^{\scriptscriptstyle {1}}\!(x(\mathcal {t}))\ge \mathrm {\Omega }^{\scriptscriptstyle {0}}\!(x(\mathcal {t}))\), \(\iota =0\).

Secondly, for \(\iota -1\), given \(\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\ge \Omega ^{\scriptscriptstyle {\iota -1}}(x(\mathcal {t}))\), \(\forall x(\mathcal {t})\). Then, as \(\iota \), it can be able to conclude that

$$\begin{aligned} \textrm{T}^{\scriptscriptstyle {\iota +1}}\!(x(\mathcal {t}))-&\mathrm {\Omega }^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))=x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + \mathrm {H(\mathrm {\omega }^{\scriptscriptstyle {\iota +1}}(\mathcal {t}))} \nonumber \\&+\lambda \big (\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}+1))-\Omega ^{\scriptscriptstyle {\iota -1}}(x(\mathcal {t}+1))\big )\nonumber \\&\ge \lambda \big (\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}+1))-\Omega ^{\scriptscriptstyle {\iota -1}}(x(\mathcal {t}+1))\big ) \end{aligned}$$
(4.18)

By the mathematical induction, it can be obtained that \(\textrm{T}^{\scriptscriptstyle {\iota +1}}\!(x(\mathcal {t}))\ge \mathrm {\Omega }^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\), \(\forall \iota \). Combined with Lemma 4.1, it is obviously concluded that \(\textrm{T}^{\scriptscriptstyle {\iota +1}}\!(x(\mathcal {t})) \ge \mathrm {\Omega }^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\ge \textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\), that is, \(\Big \{\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\Big \} \) is a non-decreasing sequence, \(\forall \iota \).

From Lemma 4.2, the sequence \(\Big \{\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\Big \}\) is bounded to \(\mathfrak {Z}\), which is equivalent to that the iterative equation has a limit value, which can be expressed as \(\lim _{\iota \rightarrow \infty }\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))=\textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}))\). Therefore, it is bold to assume that \( \textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}))=\underset{\mathrm {\tau }(\mathcal {t})}{\text {min}} \Big \{x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + \mathrm {H(\mathrm {\tau }(\mathcal {t}))}+\lambda \textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}))\Big \} \). This assumption will be proved below. According to (4.14),

$$\begin{aligned} \textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t})) \le x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t})+ \mathrm {H(\mathrm {\tau }(\mathcal {t}))}+ \lambda \textrm{T}^{\scriptscriptstyle {\iota -1}}\!(x(\mathcal {t}+1)). \end{aligned}$$
(4.19)

From the non-decreasing property of sequence \(\Big \{\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\Big \} \), it can be known that \(\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t})) \le \textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t})) \quad \forall \iota \).

Substitute it into (4.19),

$$\begin{aligned} \textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t})) \le x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t})+ \mathrm {H(\mathrm {\tau }(\mathcal {t}))}+ \lambda \textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}+1)), \quad \forall \iota . \end{aligned}$$
(4.20)

(4.20) for any \(\iota \) was established, that when \(\iota =\infty \), also meet.

$$\begin{aligned} \textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t})) \le x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t})+ \mathrm {H(\mathrm {\tau }(\mathcal {t}))}+ \lambda \textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}+1)), \forall \iota . \end{aligned}$$
(4.21)

Considering that \(\mathrm {\tau }(\mathcal {t})\) is any given control sequence, (4.21) can further obtain:

$$\begin{aligned} \textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t})) \le \text {min}\Big \{x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t})+ \mathrm {H(\mathrm {\tau }(\mathcal {t}))}+ \lambda \textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}+1))\Big \}, \forall \iota . \end{aligned}$$
(4.22)

With (4.14), \(\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\!=\!\underset{\mathrm {\tau }(\mathcal {t})}{\text {min}}\Big \{x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t})\!+\! \mathrm {H(\mathrm {\tau }(\mathcal {t}))}\!+\!\lambda \textrm{T}^{\scriptscriptstyle {\iota -1}}\!(x(\mathcal {t}+1))\Big \}\).\(\forall \iota \)

At this time of its on the left, and as a result of \(\Big \{\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\Big \} \) non decreasing, get, \(\textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}))\!\ge \!\underset{\mathrm {\tau }(\mathcal {t})}{\text {min}}\Big \{x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t})\!+\! \mathrm {H(\mathrm {\tau }(\mathcal {t}))}\!+\!\lambda \textrm{T}^{\scriptscriptstyle {\iota -1}}\!(x(\mathcal {t}+1))\Big \}\). Similarly, let \(\iota \rightarrow \infty \),

$$\begin{aligned} \textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}))\!\ge \!\text {min}\Big \{x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t})+ \mathrm {H(\mathrm {\tau }(\mathcal {t}))}+ \lambda \textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}+1))\Big \}, \forall \iota . \end{aligned}$$
(4.23)

Combining (4.22) and (4.23), it follows that,

$$\begin{aligned} \textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}))\!=\!\text {min}\Big \{x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t})+ \mathrm {H(\mathrm {\tau }(\mathcal {t}))}+ \lambda \textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}+1))\Big \}, \forall \iota . \end{aligned}$$
(4.24)

Can be seen from (4.24), the previous assumption proved how. Can be learned from Theorem 4.1, \(\textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}))\) is a discrete-time time solution of HJB equation. Considering the uniqueness of the solution of the discrete-time-time HJB equation, it means that \(\textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}))\) in (4.24) and \(\textrm{J}^{\scriptscriptstyle {*}}\!(x(\mathcal {t}))\) in (4.10) are the same solution. In other words, \(\lim _{\iota \rightarrow \infty }\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))=\textrm{T}^{\scriptscriptstyle {*}}\!(x(\mathcal {t}))=\textrm{J}^{\scriptscriptstyle {*}}\!(x(\mathcal {t}))\).\(\blacksquare \)

Theorem 4.1 proves that \(\textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}))\) in (4.24) and \(\textrm{J}^{\scriptscriptstyle {*}}\!(x(\mathcal {t}))\) in (4.24) in (4.10) are the same solution of the HJB equation corresponding to the same cost function, while the termination criterion “\(\left\| \textrm{T}^{\iota +1}(x(\mathcal {t}))-\textrm{T}^{\iota }(x(\mathcal {t})) \right\| \le \epsilon \)” indicates that the optimal control law can be solved in finite time, and Theorem 4.2 will explain this context.

Theorem 4.2

The system (4.1) is controllable and the initial state \(x(\mathcal {t})\) of the system can be chosen arbitrarily. Under the finite iteration index \(\iota \), the iterative approximate cost function and the optimal cost function \(\Vert \textrm{T}^{*}(x(\mathcal {t}))-\textrm{T}^{\iota }(x(\mathcal {t})) \Vert \le \epsilon \) are equivalent to the termination criterion \(\Vert \textrm{T}^{\iota +1}(x(\mathcal {t}))-\textrm{T}^{\iota }(x(\mathcal {t})) \Vert \le \epsilon \).

Proof

In Theorem 4.1, it is mentioned that \(\Big \{\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\Big \} \) is a non-decreasing sequence, that is

$$\begin{aligned} \textrm{J}^{\scriptscriptstyle {*}}\!(x(\mathcal {t}))=\textrm{T}^{\scriptscriptstyle {*}}\!(x(\mathcal {t})) \ge \textrm{T}^{\scriptscriptstyle {\iota +1}}\!(x(\mathcal {t})) \ge \textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t})). \end{aligned}$$
(4.25)

If \(\left\| \textrm{T}^{*}(x(\mathcal {t}))-\textrm{T}^{\iota }(x(\mathcal {t})) \right\| \le \epsilon \), it can be concluded that

$$\begin{aligned} \textrm{T}^{*}(x(\mathcal {t}))-\textrm{T}^{\iota }(x(\mathcal {t}))\le \epsilon ,\ \textrm{T}^{*}(x(\mathcal {t}))\le \textrm{T}^{\iota }(x(\mathcal {t}))+\epsilon . \end{aligned}$$
(4.26)

Combined (4.26) with (4.25),

$$\begin{aligned} \textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\le \textrm{T}^{\scriptscriptstyle {\iota +1}}\!(x(\mathcal {t}))\le \textrm{T}^{\scriptscriptstyle {*}}\!(x(\mathcal {t}))\le \textrm{T}^{\iota }(x(\mathcal {t}))+\epsilon . \end{aligned}$$
$$\begin{aligned} \Rightarrow \textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\le \textrm{T}^{\scriptscriptstyle {\iota +1}}\!(x(\mathcal {t}))\le \textrm{T}^{\iota }(x(\mathcal {t}))+\epsilon . \end{aligned}$$
(4.27)

It can get that,

$$\begin{aligned} \left\| \textrm{T}^{\iota +1}(x(\mathcal {t}))-\textrm{T}^{\iota }(x(\mathcal {t})) \right\| \le \epsilon \end{aligned}$$
(4.28)

From a different perspective, if (4.28) holds and the \(\Big \{\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\Big \}\) is nondecreasing,

$$\begin{aligned} -\epsilon +\textrm{T}^{\iota +1}(x(\mathcal {t})) \le \textrm{T}^{\iota }(x(\mathcal {t}))\le \textrm{T}^{*}(x(\mathcal {t}))=\textrm{J}^{*}(x(\mathcal {t})). \end{aligned}$$
(4.29)

It is obvious that \(\textrm{T}^{\iota +1}(x(\mathcal {t}))-\textrm{T}^{*}(x(\mathcal {t})\le \epsilon \),

$$\begin{aligned} \left\| \textrm{T}^{\iota +1}(x(\mathcal {t}))-\textrm{T}^{*}(x(\mathcal {t})\right\| \le \epsilon . \end{aligned}$$
(4.30)

Based on the analysis of both sides, it can be concluded that \(\Vert \textrm{T}^{*}\!(x(\mathcal {t}))\!-\!\textrm{T}^{\iota }\!(x(\mathcal {t})) \Vert \le \epsilon \Leftrightarrow \left\| \textrm{T}^{\iota +1}\!(x(\mathcal {t})) -\textrm{T}^{\iota }\!(x(\mathcal {t})) \right\| \le \epsilon \).\(\blacksquare \)

The two theorems deal with value functions \(\textrm{T}(x(\mathcal {t}))\), while Algorithm 1 deals with costate function \(\boldsymbol{\textrm{C}}(x(\mathcal {t}))\). It will be shown in Theorem 4.3 that this convergence is equivalent.

Theorem 4.3

(4.14) defines the sequence of value functions. The control law sequence is shown in (4.13) and the update cofunction sequence is shown in (4.16). The optimal value is chosen as the limit of the costate function \(\textrm{C}^{*}(x(\mathcal {t}))=\lim _{\iota \rightarrow \infty } \!\textrm{C}^{\iota }(x(\mathcal {t}))\), and when the value function approaches the optimal value, the sequence of costate functions converges with the sequence of the control law.

Proof

In Theorems 4.1 and 4.2, it is shown that \(\textrm{T}^{*}(x(\mathcal {t}))\) and \(\textrm{T}^{\infty }(x(\mathcal {t}))\) satisfy the corresponding HJB equation respectively. i.e., \(\textrm{T}^{\scriptscriptstyle {\infty }}\!(x(\mathcal {t}))\!=\!\textrm{T}^{\scriptscriptstyle {*}}\!(x(\mathcal {t}))\!=\!\underset{\mathrm {\tau }(\mathcal {t})}{\text {min}}\Big \{x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t})+ \mathrm {H(\mathrm {\tau }(\mathcal {t}))}+ \lambda \textrm{T}^{\scriptscriptstyle {*}}\!(x(\mathcal {t}+1))\Big \}.\)

Therefore, it can be concluded that the sequence \(\Big \{\textrm{T}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\Big \}\) of value functions converges to the optimal value function of the discrete-time-time HJB equation. i.e.,\(\textrm{T}^{\scriptscriptstyle {\iota }} \rightarrow \textrm{T}^{\scriptscriptstyle {*}} \), as \(\iota \rightarrow \infty \).

Given \(\textrm{C}^{\boldsymbol{\iota }}(x(\mathcal {t}))=\partial {\textrm{T}^{\boldsymbol{\iota }}}\!(x(\mathcal {t}))/\partial {x(\mathcal {t})}\). It is also possible that the corresponding sequence \(\Big \{\textrm{C}^{\scriptscriptstyle {\iota }}\!(x(\mathcal {t}))\Big \}\) of costate function converges to \(\textrm{C}^{\scriptscriptstyle {\iota }}\! \rightarrow \textrm{C}^{\scriptscriptstyle {*}} \) as \(\iota \rightarrow \infty \). Due to the association, costate function is convergent, at the same time, it is concluded that the corresponding sequence converges to the optimal control law \(\mathrm {\tau }^{\scriptscriptstyle {\iota }}\! \rightarrow \mathrm {\tau }^{\scriptscriptstyle {*}} \) as \(\iota \rightarrow \infty \).\(\blacksquare \)

Table 4.1 Estimated parameter values

4.5 Multi-factor Mixed Optimization Experiment Treatment of Tumor Cells

This section explores a novel therapeutic intervention for tumor cell growth inhibition. A discrete-time affine control system has been constructed from the multi-factor tumor cell growth model, and the iterative DHP algorithm has been applied to realize the reduction of drug dosage under the condition of greatly inhibiting the proliferation of tumor cell population.

4.5.1 Discrete Affine Model of Tumor Cell Growth

According to clinical medical statistics and literature [2, 30, 31, 38,39,40,41], the values of each parameter in the tumor cell proliferation model affected by multiple factors are shown in Table 4.1.

Using these parameters, try to observe the tumor cell proliferation model given some circumstances.

With reference to [1], the initial “\(\mathcal {T_u}(0)=2\times 10^{7}\), \(\mathcal {N_k}(0)=1\times 10^{3}\), \(\mathcal {C_T}(0)=10\), \(\mathcal {C_L}(0)=6\times 10^{8}\) ” was selected, and the chemotherapy drug at a dose of \(\mathcal {V}_{Che}(\mathcal {t})=3.5\) was injected every 5 days in (4.8) to observe the changes of various cells in the current body.

Fig. 4.1
figure 1

Ten doses of chemotherapy over 60 days has been sufficient to eliminate the tumor. a Curves of the population of the four cell species. b Distribution of 10 doses of chemotherapy drugs within 60 days and the trend of changes in the concentration of chemotherapy drugs in vivo

Figure 4.1 shows an injection method of chemotherapy drugs in the form of pulse. The drug is injected into the body to study the influence of the addition of chemotherapy drugs on the number of various cell populations in the body at different times. As can be seen from the curve of tumor cell change in Fig. 4.1a (the second curve), a dose of 5 chemotherapy drug injected every 5 days for 60 days is sufficient to control the proliferation of tumor cells. The four curves showed different forms of oscillatory changes in the early stage, which mainly depended on the pulse injection of chemotherapy drugs, and immunospecific cell \(\mathcal {C_T}\), which also decreased to stability after tumor cells stabilized in the later stage. Figure 4.1b shows the corresponding mode of administration, with the red is the pulse of administration and the green is the change of the corresponding chemotherapy drug in the body.

4.5.2 Construction of Affine Model

In (4.8), although the discrete model has been obtained, it is too complex and the addition of various coupling forms, which is difficult to be directly combined into the iteration-DHP structure. At this time, the idea of constructing a simple affine model is introduced. It can be easily learned from the above two sub-parts, which can be simplified as the influence of the injected concentrations of the two drugs on tumor cells in the body. Then, the current concentration of tumor cells can be selected as the state variable, and the injected concentrations of the two drugs (chemotherapy drugs and immune agents) can be used as the control variable to form a data set, starting from a large number of random data. The desired affine discrete model is obtained by fitting.

$$\begin{aligned} x(\mathcal {t}+1) = \mathcal {f}(x(\mathcal {t}))+ \begin{bmatrix} \mathcal {g}_1(x(\mathcal {t})) \\ \mathcal {g}_2(x(\mathcal {t})) \end{bmatrix}^{\scriptscriptstyle {T}} u(\mathcal {t}) \end{aligned}$$
(4.31)
$$\begin{aligned} \mathcal {g}_1\Big (\text {log}_{10}(x)\Big )&= 0.001771\Big (\text {log}_{10}(x)\Big )\!^{\scriptscriptstyle {5}}\!-\!0.02931\Big (\text {log}_{10}(x)\Big )\!^{\scriptscriptstyle {4}} \!+\!0.1793\Big (\text {log}_{10}(x)\Big )\!^{\scriptscriptstyle {3}}\!\nonumber \\ {}&-\!0.5353\Big (\text {log}_{10}(x)\Big )\!^{\scriptscriptstyle {2}}\!+\!1.741\Big (\text {log}_{10}(x)\Big )\!-\!1.133 \end{aligned}$$
(4.32)
$$\begin{aligned} \mathcal {g}_2\Big (\text {log}_{10}(x)\Big )&= 0.007579 \Big (\text {log}_{10}(x)\Big )\!^{\scriptscriptstyle {4}}\!-\!0.1087\Big (\text {log}_{10}(x)\Big )\!^{\scriptscriptstyle {3}}\nonumber \\&\!+\!0.4838\Big (\text {log}_{10}(x)\Big )\!^{\scriptscriptstyle {2}}\!+\!0.1783\Big (\text {log}_{10}(x)\Big )\!^{\scriptscriptstyle {2}}\!-\!0.2304 \end{aligned}$$
(4.33)

4.5.3 Optimization of Mixed Treatment Regimen

Following the affine model mentioned above, it is necessary to specify the cost function required in iteration-DHP before optimizing the treatment:

$$\begin{aligned} \textrm{J}&(x(\mathcal {t}))= \sum _{\iota =0}^{\infty }\lambda ^{\scriptscriptstyle {\iota }} \Big \{ x^{\scriptscriptstyle {T}}(\mathcal {t})\textrm{P}x(\mathcal {t}) + m_1\int _{0}^{u_1(\mathcal {t})} \text {tanh}^{\scriptscriptstyle {-T}}(\overline{\mathcal {U}}_1^{\scriptscriptstyle {-1}}s)\nonumber \\&\cdot \overline{\mathcal {U}}_1\textrm{Q}_1ds +m_2\int _{0}^{u_2(\mathcal {t})} \text {tanh}^{\scriptscriptstyle {-T}}(\overline{\mathcal {U}}_2^{\scriptscriptstyle {-1}}s)\overline{\mathcal {U}}_2\textrm{Q}_2ds\Big \}. \end{aligned}$$
(4.34)
Table 4.2 Default parameters
Fig. 4.2
figure 2

The iteration error change curve, after the end of the 67th iteration, satisfies the termination condition

According to clinical experience, the default parameters are shown in Table 4.2. The iteration error \(\epsilon \) is set to \(10^{-6}\), and the iteration error variation curve is shown in Fig. 4.2. The error decreases extremely fast in the first twenty iterations of the calculation, and the convergence rate gradually decreases after 20 iterations. At \(\iota =67\), the termination condition has been satisfied.

Analysis of tumor cells after meet the termination criterion, according to the optimized regimen of population change curve as shown in Fig. 4.3, visible at an extremely rapid rate by the growth of stem. The usage and dosage of two drugs are shown in Fig. 4.4. Figure 4.4a represents the curve of injected concentration of chemotherapy drugs, and Fig. 4.4b represents the curve of injected concentration of immune drugs.

Fig. 4.3
figure 3

Tumor cell population changes in optimized treatment

Fig. 4.4
figure 4

Optimization of treatment of different drugs usage and dosage. a Injection concentration of chemotherapeutic drugs, b Injection concentration of immune agents

4.6 Conclusion

In this chapter, a tumor immune differential game system has been established to solve the problem of optimal clinical tumor treatment oriented to evolutionary dynamics. Firstly, a mathematical model of the game system between tumor cells and immune cells treated by immune agents and chemotherapy drugs has been given. Secondly, the bounded optimal control problem has been solved by the HJB equation with infinite horizon performance index which is subjected to practical constraints. Finally, the optimal iterative approximate control strategy has been obtained by the iterative dual heuristic dynamic programming algorithm, and the effectiveness of the proposed algorithm has been proved.