7.1 Introduction

Low efficacy and high toxicity for patients is the characteristics of traditional therapies as surgery, chemotherapy, and radiation, hence the most prosperous tumor treatment strategy, oncolytic virotherapy which depends on the virus with relatively weak pathogenicity and appropriate gene modification, simultaneously, the therapeutic effect benefits from strong replication capabilities. Similar to the principle of targeted therapy, gene-modified viruses repressed selectively infect tumor cells (ITCs) through rapid replication increment, and ultimately destroy TCs, concurrently, activate the body’s immune response. Soluble tumor virus therapy not only can kill TCs, but also attract more immune cells to kill residual cancer cells, however, it doesn’t deplete normal cells in the body. Oncolytic virus (OVs) enjoyed the superiority of minimal side effects and optimal therapeutic effects compared with traditional treatment strategies as literature [1]. Development of oncolytic viruses benefit from the virus-specific lytic CTL response eliciting immunostimulatory signals and contributing to killing of ITCs as literature [2], thus, viral doses, number of doses and timing with reliable mathematical models are the future research direction.

To lucubrate cancer virotherapy, mathematical models which described mechanisms of TCs, OVs and immune cells have been proposed and updated as literatures [3, 4]. Literature [5] expounded the inner mechanism including uninfected tumor cells (UTCs), ITCs and free viruses. Successively, the infected cells and uninfected cells are distinguished through logistic growth of TCs and elimination of free recombinant measles viruses as [6]. What matters most is the immune response which leads to inhibitory effect of viral therapy for misregarding of genetically modified viruses.

Therapy efficiency depends on hyperimmunity or not, in other words, infected cancer cells and viruses are swallowed for indistinguishability. Literatures has demonstrated the side effect of immune cells, and immunosuppressive agent cyclophosphamide is chosen to reduce immune response [7]. Reference [8] has considered the virus-free population adding the previous three variables, reflecting interactive relationship between innate immune with infected cancer cells and the virus cells, evolving into an effective mechanism analysis model, but more effective control strategy is in urgent need. Cytokines form natural killer cells contribute most strength on destruction of both tumor and virus-infected cells. The proposed model gives explicitation of interplay among TCs, OVs, and immune response, which is the guideline of optimal therapeutic strategies or dosage regimen on oncotherapy. Although correlational research on regulation on immune system and TCs has been proposed using ADP as [9], selective oncolysis will enjoy optimal therapeutic effect through gene-modified viruses compared with wild-type OVs based on ADP method.

As a vital branch in machine learning, obtaining information from interactive environment [10,11,12], reinforcement learning (RL) has been demonstrated to perform well in solving optimal control issues of nonlinear systems [13]. The ADP method, which was derived from RL and dynamic programming, generally attempts to obtain the optimal strategies with the aid of the classic critic-actor algorithm framework [14]. Under this architecture, the critic evaluates the cost when the current strategy is applied, and actor updates the control strategy in accordance with the feedback information provided by the critic. Thus the approximate optimal strategy can be derived and the “curse of dimensionality” can be obviated. Recently, ADP-based methods have been widely researched to tackle various optimal issues, for instances, tracking control [15,16,17], optimal consensus control [18,19,20], zero-sum games and nonzero-sum games [21,22,23]. Different from fuzzy approximation as [24], the robust dynamic NN was established to asymptotically identify the uncertain system with additive disturbances, and the critic and actor worked together to find the equilibrium solution for nonzero-sum games subject to nonlinear system. The identifier was developed to reconstruct the unknown dynamics and the critic was tuned by a concurrent learning strategy which could effectively use real-time data and recorded data such that the persistence of excitation (PE) condition could be removed. By utilizing both online and off-line data, a data-based policy gradient ADP method was developed to seek optimal scheme in [25]. To address global optimum control issue and avoid falling into local optimality as[26], the ADP method which combined with the predesigned extra compensators was proposed in [27]. The introductions of these compensators contributed to deriving the augmented neighborhood error systems, thus the system dynamics requirement for ADP was avoided. In [28], integrating the neural network learning ability and the spirits of ADP, a general architecture of intelligent critic control was proposed to solve the robustness issues of disturbed nonlinear systems.

As saturation phenomena which exist widely in many practical systems can affect the system performance, multifarious ADP-based method were proposed to achieve optimal control with input constraints [29,30,31]. For the tumor-virus-immune system in this , the control input is the medicine containing the virus particles. Redundant or insufficient medicine dosages may well influence the therapeutic effect or patients’ health. Thus we consider the asymmetric input constraints and construct the corresponding non-quadratic value functions associated with the tumor-virus-immune system.

Recently, ADP-based methods have been proposed to develop approximate optimal strategies in various practical applications [32,33,34,35]. However, there exist seldom any literatures associated with optimal strategy based on virotherapy which is derived from ADP-based methods. Enlightened by the literatures mentioned above, we design the virotherapy-based optimal strategy via ADP method with MDRM. The contributions can be stated as follows. Firstly, the mathematic model is introduced to simulate the relationships between TCs, OVs and immune cells. Due to the asymmetric dosage constraints for medicine, a non-quadratic utility function is constructed to form the discounted value function. Then, on the basis of the tumor-virus-immune model, ADP method of single-critic architecture is proposed to solve HJBE such that the approximate optimal strategy can be achieved, which means that the TCs can be largely eliminated with the constrained optimal virotherapy-based strategy. Furthermore, the reasonable the medicine dosage regulation mechanism is firstly introduced into this algorithm framework, and the indications for medicine is considered for the first time. Finally, theoretical analysis and simulation experiments both validate the effectiveness of the designed therapeutic strategy.

7.2 Problem Formulation and Preliminaries

7.2.1 Establishment of Interaction Model

In the section, tumor-virus-immune interaction model is introduced to describe the relations between TCs, viruses and immune cells. Due to the behavior of OVs, we can divide TCs into UTCs and ITCs. In the model composed of four ordinary differential equations as follows, \(P_{TU}(t)\), \(P_{TI}(t)\), \(P_{VI}(t)\) and \(P_{IM}(t)\) respectively denote the populations of UTCs, ITCs, free OVs and immune cells.

The population of UTCs can be affected by multiple factors, that is, the multiplication and apoptosis of TCs, the infection by OVs and the reduction caused by immune cells. Moreover, the growth dynamics of UTCs is presented as

$$\begin{aligned} \dot{P}_{TU}(t)=&\,A_1 P_{TU}(t)\big (1-\frac{P_{TU}(t)+P_{TI}(t)}{K}\big )-A_2 P_{TU}(t)P_{VI}(t) \nonumber \\&-B_1 P_{TU}(t)P_{IM}(t)-C_1 P_{TU}(t), \end{aligned}$$
(7.1)

where \(A_1\) is the tumor proliferation rate, \(A_2\) is the infection rate of virus, \(B_1\) denotes the killing-efficiency of immune cells, and \(C_1\) is the apoptosis rate of UTCs.

Similarly, the population of ITCs can be modeled by

$$\begin{aligned} \dot{P}_{TI}(t)=A_2 P_{TU}(t)P_{VI}(t)-B_2 P_{TI}(t)P_{IM}(t)-\varphi P_{TI}(t), \end{aligned}$$
(7.2)

where \(B_2\) denotes the immune killing-efficiency of ITCs and \(\varphi \) is apoptosis rate of ITCs.

The lysis of ITCs which contain multiple replicated virion particles and the input of virus agentia can both contribute to the rise of the free virus population. Thus the evolution dynamics of virus population can be presented as

$$\begin{aligned} \dot{P}_{VI}(t)=&\,\mathcal {U}+\kappa \varphi P_{TI}(t)-A_2 P_{TU}(t)P_{VI}(t) \nonumber \\&-B_3 P_{VI}(t)P_{IM}(t)-C_2 P_{VI}(t), \end{aligned}$$
(7.3)

where \(\mathcal {U}\) denotes the input of agentia, \(\kappa \) the burst size of free viruses, \(B_3\) the immune killing-efficiency rate of OVs, and \(C_2\) the clearance rate of OVs.

The immune response dynamics can be formulated as

$$\begin{aligned} \dot{P}_{IM}(t)=&D_1 P_{TI}(t)P_{IM}(t)+D_2 P_{TU}(t)P_{IM}(t) \nonumber \\&-C_3 P_{IM}(t), \end{aligned}$$
(7.4)

where \(D_1\) and \(D_2\) are immune response rates stimulated by infected and uninfected cells. And \(C_3\) is the apoptosis rate of immune cells. For purpose of simplifying the interaction model, we utilize the nondimensionalization technique [36, 37] to derive the simplified version as

$$\begin{aligned} \left\{ \begin{aligned} \dot{p}_{TU}(t)=&\,a_1 p_{TU}(t)(1-p_{TU}(t)-p_{TI}(t))-c_1 p_{TU}(t) \\&-a_2 p_{TU}(t)p_{VI}(t)-b_1 p_{TU}(t)p_{IM}(t) \\ \dot{p}_{TI}(t)=&\,a_2 p_{TU}(t)p_{VI}(t)-b_2 p_{TI}(t)p_{IM}(t)-\varphi p_{TI}(t) \\ \dot{p}_{VI}(t)=&\,\mathcal {u}+\kappa p_{TI}(t)-a_2 p_{TU}(t)p_{VI}(t) \\&-b_3 p_{VI}(t)p_{IM}(t)-c_2 p_{VI}(t) \\ \dot{p}_{IM}(t)=&\,d_1 p_{TI}(t)p_{IM}(t)+d_2 p_{TU}(t)p_{IM}(t) \\&-c_3 p_{IM}(t). \end{aligned} \right. \end{aligned}$$
(7.5)

Herein the nonnegative states of nondimensionalization version are represented as \(p_{TU}(t)\), \(p_{TI}(t)\), \(p_{VI}(t)\) and \(p_{IM}(t)\).

Remark 7.1

In virotherapy, the viruses achieved their reproductive objective by infecting tumor cells and replicating themselves. After the lysis of infected cells, new reproductions burst out and infect other tumor cells. Under this mechanism, the tumor cells can be effectively eliminated. Furthermore, comparing with uninfected tumor cells, the infected cells can activate immune cells more effectually to kill tumor cells.

7.2.2 Problem Formulation

Consider the system (7.5) as

$$\begin{aligned} \dot{x}=f(x)+\mathcal {g}\mathcal {u}, \end{aligned}$$
(7.6)

where \(\mathcal {g}=[0,0,1,0]^T\), and f(x) is constructed by the right-hand side parts of (7.5) excluding the control input \(\mathcal {u}\). \(\mathcal {u}\in [\mathcal {u}_m,\mathcal {u}_M]\) where \(\mathcal {u}_m\) and \(\mathcal {u}_M\) denote the minimum and maximum thresholds for medicine input dosage.

For system (7.6), the corresponding discounted value function is defined as

$$\begin{aligned} V(x(t))=\int _t^{\infty }e^{-\theta (\iota -t)}{} W (x,\mathcal {u})d\iota , \end{aligned}$$
(7.7)

with the discounted factor \(\theta >0\). The utility function is given by

$$\begin{aligned} W (x,\mathcal {u})=x^T \varUpsilon x+\chi (\mathcal {u}), \end{aligned}$$
(7.8)

where the matrix \(\varUpsilon \) is positive definite, and \(\chi (\mathcal {u})\) is non-negative function. It’s noted that for system (7.6) the input constraints are not symmetric. In order to cope with this issue, function \(\chi (\mathcal {u})\) is defined as

$$\begin{aligned} \chi (\mathcal {u})=2\hbar \int _\alpha ^\mathcal {u} \psi ^{-1}(\hbar ^{-1}(\iota -\alpha ))d\iota , \end{aligned}$$
(7.9)

where \(\alpha =(\mathcal {u}_m+\mathcal {u}_M)/2\) and \(\hbar =(\mathcal {u}_M-\mathcal {u}_m)/2\). \(\psi (\cdot )\) is a monotonic odd function which is continuously differential with \(\psi (0)=0\). Without loss of generality, we select the hyperbolic tangent function as \(\psi (\cdot )\), that is, \(\psi (\cdot )=\tanh (\cdot )\).

Differentiating the value function (7.7) along system (7.6), we obtain that

$$\begin{aligned} 0=\nabla V^T (f+\mathcal {g}\mathcal {u})+x^T \varUpsilon x+\chi (\mathcal {u})-\theta V. \end{aligned}$$
(7.10)

Then the Hamiltonian function can be expressed as

$$\begin{aligned} H(x,\mathcal {u},\nabla V)=\nabla V^T (f+\mathcal {g}\mathcal {u})+x^T \varUpsilon x+\chi (\mathcal {u})-\theta V. \end{aligned}$$
(7.11)

The optimal value function is defined as

$$\begin{aligned} V^{*}(x)=\min _u \int _t^{\infty } e^{-\theta (\iota -t)}{} W (x,\mathcal {u})d\iota . \end{aligned}$$
(7.12)

which satisfies HJBE

$$\begin{aligned} \min _u H(x,\mathcal {u},\nabla V^{*})=0. \end{aligned}$$
(7.13)

Applying the stationary condition, we can derive the optimal strategy as

$$\begin{aligned} \mathcal {u}^{*}=-\hbar \tanh (\frac{1}{2\hbar }\mathcal {g}^T\nabla V^{*})+\alpha . \end{aligned}$$
(7.14)

On the basis of (7.13) and (7.14), we rewrite the HJBE as

$$\begin{aligned} (\nabla V^{*})^T f-\hbar (\nabla V^{*})^T\mathcal {g}\tanh (\frac{1}{2\hbar }\mathcal {g}^T\nabla V^{*})+x^T \varUpsilon x \nonumber \\ +(\nabla V^{*})^T\mathcal {g}\alpha -\theta V^{*}+\chi (\mathcal {u}^{*})=0. \end{aligned}$$
(7.15)

Remark 7.2

In the conventional optimal control issue with control constraints, it’s often required that the input constraints should be symmetric. Nevertheless, the proposed method in this takes the asymmetric input constraints into account. Thus the symmetric constrained condition is relaxed by constructing the unconventional utility function (7.8).

Due to the nonlinear nature of (7.15), it’s often intractable to derive the analytical solution, which is requisite for designing the optimal strategy. To overcome this issue, in the following sections, ADP method of single-critic network using dosage regulation mechanism is designed to approximately solve (7.15).

7.3 Optimal Strategy Based on MDRM

In order to achieve the goal of regulating therapeutic strategy timely and necessarily, MDRM is introduced to provide indications for medicine to determine the time when it’s necessary to make some regulation. Therefore, the time sequence \(\{z_\imath \}\) is required to record the regulating instants. The parameter \(\imath \in \mathbb {N}^{+}\) represents the \(\imath \)th updating instant and \(\mathbb {N}^{+}\) is the set including all positive integers. Then we can define the state as

$$\begin{aligned} \breve{x}_\imath (t)=x(z_\imath ),t\in [z_\imath ,z_{\imath +1}). \end{aligned}$$
(7.16)

In general, the clinical data after the latest regulation is different from the current comparable data. Hence the error is given by

$$\begin{aligned} \nu _\imath (t)=\breve{x}_\imath -x(t), t\in [z_\imath ,z_{\imath +1}). \end{aligned}$$
(7.17)

Based on \(\nu _\imath \) and the threshold associated with state x, the medicine regulation mechanism is established. When a regulation occurs, \(\nu _\imath =0\), which means the medicine dosage is regulated to be equal to the current medicine indication. The comparable data is updated by the clinical data at regulation instant, and the medicine dosage remains unchanged until the occurrence of the next regulation. That is, \(\breve{\mathcal {u}}=\mathcal {u}(x_\imath )\). Thus we derive the MDRM-based strategy as

$$\begin{aligned} \breve{\mathcal {u}}^{*}=-\hbar \tanh (\frac{1}{2\hbar }\mathcal {g}^T(\breve{x}_\imath )\nabla V^{*}(\breve{x}_\imath ))+\alpha , \end{aligned}$$
(7.18)

where \(\nabla \breve{V}^{*}=\partial V^{*}/\partial x\) when \(t=z_\imath \). Then the medicine regulation mechanism-based HJBE can be denoted as

$$\begin{aligned} H(x,\breve{\mathcal {u}}^{*},V^{*})=&-\hbar (\nabla V^{*})^T\mathcal {g}\tanh (\frac{1}{2\hbar }\mathcal {g}^T(\breve{x}_\imath )\nabla V^{*}(\breve{x}_\imath )) \nonumber \\&+(\nabla V^{*})^T f+(\nabla V^{*})^T \mathcal {g}\alpha +x^T \varUpsilon x \nonumber \\&+\chi (\breve{\mathcal {u}}^{*})-\theta V^{*}. \end{aligned}$$
(7.19)

The existence of the error \(\nu _\imath \) lead to that (7.19) does equal to 0, which is different from HJBE (7.15). Before proceeding, an assumption is necessary [31].

Assumption 7.1

The optimal strategy \(\mathcal {u}^{*}\) is locally Lipschitz with respect to error \(\nu _\imath \), i.e., \(\Vert \mathcal {u}^{*}-\breve{\mathcal {u}}^{*}\Vert ^2\le K_\mathcal {u}\Vert x-\breve{x}_\imath \Vert ^2=K_\mathcal {u}\Vert \nu _\imath \Vert ^2\) where \(K_\mathcal {u}\) is a positive constant.

Theorem 7.1

Consider the nonlinear system (7.6). Suppose that Assumption 7.1 is tenable and there exists function \(V^{*}\) satisfying (7.15). If the optimal strategy is formulated as (7.18) with the medicine indication

$$\begin{aligned} \Vert \nu _\imath \Vert ^2\le \frac{(1-\zeta ^2)\lambda _m(\varUpsilon )}{K_\mathcal {u}}\Vert x\Vert ^2 \end{aligned}$$
(7.20)

where \(\zeta \in (0,1)\) is the designed parameter, then the controlled system is guaranteed to be asymptotically stable in the sense of UUB.

Proof

Select the Lyapunov function \(\bar{Y}=V^{*}(x)\). Then we can obtain the derivative of \(V^{*}\)

$$\begin{aligned} \dot{\bar{Y}}=(\nabla V^{*})^T (f+\mathcal {g}\breve{\mathcal {u}}^{*}). \end{aligned}$$
(7.21)

According to (7.14) and (7.15), we derive that

$$\begin{aligned} (\nabla V^{*})^T f=-(\nabla V^{*})^T \mathcal {g}\mathcal {u}^{*}-x^T \varUpsilon x-\chi (\mathcal {u}^{*})+\theta V^{*}, \end{aligned}$$
(7.22)

and

$$\begin{aligned} (\nabla V^{*})^T \mathcal {g}=-2\hbar (\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar ))^T. \end{aligned}$$
(7.23)

Then (7.21) can be rewritten as

$$\begin{aligned} \dot{\bar{Y}}=&-(\nabla V^{*})^T \mathcal {g}(\mathcal {u}^{*}-\breve{\mathcal {u}}^{*})-x^T \varUpsilon x-\chi (\mathcal {u}^{*})+\theta V^{*} \nonumber \\ =&-2\hbar (\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar ))^T(\breve{\mathcal {u}}^{*}-\mathcal {u}^{*})-x^T \varUpsilon x \nonumber \\&-\chi (\mathcal {u}^{*})+\theta V^{*} \nonumber \\ =&-x^T \varUpsilon x-\chi (\mathcal {u}^{*})+\theta V^{*}+\varpi , \end{aligned}$$
(7.24)

where \(\varpi =-2\hbar (\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar ))^T(\breve{\mathcal {u}}^{*}-\mathcal {u}^{*})\). Due to Young’s inequality, from (7.24) we derive

$$\begin{aligned} \varpi \le \hbar ^2(\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar ))^2+K_\mathcal {u}\Vert \nu _\imath \Vert ^2. \end{aligned}$$
(7.25)

Via variable substitution approach, we have

$$\begin{aligned} \chi (\mathcal {u}^{*})=2\hbar \int _0^{\mathcal {u}^{*}-\alpha }\tanh ^{-1}((\iota -\alpha )/\hbar )d(\iota -\alpha ). \end{aligned}$$
(7.26)

The function (7.26) can be further expressed as

$$\begin{aligned} \chi (\mathcal {u}^{*})=&2\hbar ^2\int _0^{\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar )}\varsigma (1-\tanh ^2(\varsigma ))d\varsigma \nonumber \\ =&-2\hbar ^2\int _0^{\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar )}\varsigma \tanh ^2(\varsigma )d\varsigma \nonumber \\&+\hbar ^2(\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar ))^2. \end{aligned}$$
(7.27)

Based on (7.24), (7.25) and (7.27), we can obtain

$$\begin{aligned} \dot{\bar{Y}}\le \varXi _1+K_u\Vert \nu _\imath \Vert ^2+\theta V^{*}-x^T \varUpsilon x, \end{aligned}$$
(7.28)

where \(\varXi _1(x)=2\hbar ^2\int _0^{\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar )}\varsigma \tanh ^2(\varsigma )d\varsigma \). Via utilizing integral mean-value theorem, we derive that

$$\begin{aligned} \varXi _1(x)=2\hbar ^2\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar )\rho \tanh ^2(\rho ), \end{aligned}$$
(7.29)

where \(\rho \in (0,\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar ))\). As \(\mathcal {u}^{*}\) is admissible, it can be deduced that \(V^{*}\) and \(\nabla V^{*}\) are bounded. Let \(\Vert V^{*}\Vert \le b_V\) and \(\Vert \nabla V^{*}\Vert \le b_{\nabla V}\) with \(b_V\) and \(b_{\nabla V}\) being positive constants. Then (7.29) becomes that

$$\begin{aligned} \varXi _1(x)\le&2\hbar ^2\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar )\rho \nonumber \\ \le&2\hbar ^2(\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar ))^2 \nonumber \\ =&\frac{1}{2}\nabla V^{*T} \mathcal {g}\mathcal {g}^T \nabla V^{*} \nonumber \\ =&\frac{1}{2}b_{\mathcal {g}}^2 b_{\nabla V}^2\triangleq b_{\varXi _1}, \end{aligned}$$
(7.30)

where the positive constant \(b_{\mathcal {g}}\) denotes the bound of \(\mathcal {g}(x)\). According to (7.28) and (7.30), it can be obtained that

$$\begin{aligned} \dot{\bar{Y}}\le&-\zeta ^2 \lambda _m(\varUpsilon )\Vert x\Vert ^2-(1-\zeta ^2)\lambda _m(\varUpsilon )\Vert x\Vert ^2 \nonumber \\&+K_{\mathcal {u}}\Vert \nu _\imath \Vert ^2+\theta b_V+b_{\varXi _1}. \end{aligned}$$
(7.31)

When the indication (7.20) is satisfied, it yields that \(\dot{\bar{Y}}\le -\zeta ^2\lambda _m(\varUpsilon )\Vert x\Vert ^2+\theta b_V+b_{\varXi _1}\). Then we can conclude that \(\dot{\bar{Y}}<0\) when \(\Vert x\Vert >\sqrt{\frac{\theta b_V+b_{\varXi _1}}{\zeta ^2\lambda _m(\varUpsilon )}}\).\(\blacksquare \)

Theorem 7.1 indicates that with the utilization of medicine regulation mechanism, the MDRM-based optimal strategy can asymptotically stabilize the controlled system.

7.4 MDRM-Based Approximate Optimal Control Design

The approximate optimal control strategy is designed based on the ADP algorithm which integrates the medicine regulation mechanism. Furthermore, for the closed-loop controlled system, the asymptotically stability in the sense of UUB is guaranteed when the proposed medicine indication is applied.

7.4.1 Implementation of the Adaptive Dynamic Programming Method

In this section, the approximate optimal strategy is designed by the ADP method of single-critic framework which integrates the medicine regulation mechanism.

Based on the universal approximation properties of NN, \(V^{*}\) can be represented as

$$\begin{aligned} V^{*}=\omega ^{*T}\vartheta (x)+\tau , \end{aligned}$$
(7.32)

where \(\omega ^{*}\) is the ideal weight vector, \(\vartheta (\cdot )\) the activation function and \(\tau \) the approximate error. Let \(\varGamma _1(\breve{x}_\imath )=\frac{1}{2\hbar }\mathcal {g}^T(\breve{x}_\imath )\nabla \vartheta ^T(\breve{x}_\imath )\omega \), then we have

$$\begin{aligned} \breve{\mathcal {u}}^{*}=-\hbar \tanh (\varGamma _1(\breve{x}_\imath ))+\bar{\tau }(\breve{x}_\imath )+\alpha , t\in [z_\imath ,z_{\imath +1}) \end{aligned}$$
(7.33)

where \(\bar{\tau }(\breve{x}_\imath )=-(1/2)(1-\tanh ^2(\Phi (\breve{x}_\imath )))\mathcal {g}^T(\breve{x}_\imath )\nabla \tau (\breve{x}_\imath )\). Herein, \(\Phi (\breve{x}_\imath )\) is selected between \(1/(2\hbar )\mathcal {g}^T(\breve{x}_\imath )\nabla V^{*}(\breve{x}_\imath )\) and \(\varGamma _1(\breve{x}_\imath )\). As the ideal weight \(\omega ^{*}\) is unknown, the approximate version of \(V^{*}\) is derived by the critic NN, which is presented as

$$\begin{aligned} \hat{V}=\hat{\omega }^{T}\vartheta (x), \end{aligned}$$
(7.34)

where \(\hat{\omega }\) is the approximate vector. Then the MDRM-based approximate strategy can be obtained

$$\begin{aligned} \breve{\mathcal {u}}=-\hbar \tanh (\varGamma _2(\breve{x}_\imath ))+\alpha , t\in [z_\imath ,z_{\imath +1}), \end{aligned}$$
(7.35)

where \(\varGamma _2(\breve{x}_\imath )=1/(2\hbar )\mathcal {g}^T(\breve{x}_\imath )\nabla \vartheta ^T(\breve{x}_\imath )\hat{\omega }\). Then the approximate Hamiltonian could be restated as

$$\begin{aligned} H(x,\breve{\mathcal {u}},\hat{\omega })=\hat{\omega }^T\xi +x^T \varUpsilon x+\chi (\breve{\mathcal {u}})\triangleq \varepsilon _H, \end{aligned}$$
(7.36)

where \(\xi =\nabla \vartheta (f+\mathcal {g}\breve{\mathcal {u}})-\theta \vartheta \).

The goal of tuning \(\hat{\omega }\) is to minimize the term \(\varepsilon _H\). Thus we set the target function as \(E=\frac{1}{2}\varepsilon _H^T\varepsilon _H\). Using the gradient descent approach, we obtain

$$\begin{aligned} \dot{\hat{\omega }}=-\ell \frac{\xi }{(\xi ^T \xi +1)^2}\varepsilon _H=-\ell \breve{\xi }\varepsilon _H, \end{aligned}$$
(7.37)

where \(\ell \) is the learning parameter and \(\breve{\xi }=\xi /(\xi ^T \xi +1)^2\). Define \(\tilde{\omega }=\omega ^{*}-\hat{\omega }\). From (7.37) we derive that

$$\begin{aligned} \dot{\tilde{\omega }}=-\ell \bar{\xi }\bar{\xi }^T\tilde{\omega }+\ell \breve{\xi }e_H, \end{aligned}$$
(7.38)

where \(\bar{\xi }=\xi /(\xi ^T\xi +1)\) and the approximate residual error \(e_H=-\nabla \tau ^{T}(f+\mathcal {g}\breve{\mathcal {u}})+\theta \tau \). Before presenting the main results, the following assumptions are requisite [38, 39].

Assumption 7.2

The signal \(\bar{\xi }\) is persistently excited over the time interval \([t,t+T]\). In another word, there exists the positive constants \(\phi \) and T such that

$$\begin{aligned} \phi I_{N_{c}\times N_{c}}\le \int _t^{t+T}\bar{\xi }\bar{\xi }^{T}d\iota , \end{aligned}$$
(7.39)

with \(N_{c}\) being the neuron number of the critic network.

Assumption 7.3

The terms \(\bar{\tau }\) and \(e_H\) are both bounded. That is, \(\Vert \bar{\tau }\Vert \le b_{\bar{\tau }}\) and \(\Vert e_H\Vert \le b_{eH}\) where \(b_{\bar{\tau }}\) and \(b_{eH}\) are positive constants.

7.4.2 Stability Analysis

This section discuss the asymptotic stability of the controlled system with the designed DARM-based strategy.

Theorem 7.2

Consider system (7.6) and let Assumptions 7.17.3 hold. The strategy is given by (7.35) and the weights tuning law for critic is set as (7.37). Then the closed-loop system (7.6) and weight estimation error \(\tilde{\omega }\) are asymptotically stable in the sense of UUB provided that the medicine indication is applied

$$\begin{aligned} \Vert \nu _\imath \Vert ^2\le \frac{(1-\eta ^2)\lambda _m(\varUpsilon )}{2K_u}\Vert x\Vert ^2\triangleq \Vert T_{\nu _\imath }\Vert \end{aligned}$$
(7.40)

with \(\eta \in (0,1)\) being the regulation parameter.

Proof

Select the Lyapunov function as

$$\begin{aligned} Y=V^{*}(\breve{x}_\imath )+V^{*}(x)+\tilde{\omega }\ell ^{-1}\tilde{\omega }=Y_a+Y_b+Y_c. \end{aligned}$$
(7.41)

Note that when medicine indication is applied, the system can be described by the impulsive model comprising two components. One is flow dynamics for \(t\in [z_\imath ,z_{\imath +1})\) and the other is jump dynamics for \(t=z_\imath \). Hence we present the discussions over the two cases.

Case I: No regulation occurs, i.e., \(t\in [z_\imath ,z_{\imath +1})\). Then we can obtain \(\dot{Y}_a=0\). In light of (7.22) and (7.23), we could derive that

$$\begin{aligned} \dot{Y}_b=&(\nabla V^{*})^T(f+\mathcal {g}\breve{\mathcal {u}}) \nonumber \\ =&\varXi _2-\chi (\mathcal {u}^{*})-x^T \varUpsilon x+\theta V^{*}, \end{aligned}$$
(7.42)

where \(\varXi _2=-2\hbar (\tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar ))^T(\breve{\mathcal {u}}-\mathcal {u}^{*})\). According to Young’s inequation, we have

$$\begin{aligned} \varXi _2\le \hbar ^2\Vert \tanh ^{-1}((\mathcal {u}^{*}-\alpha )/\hbar )\Vert ^2+\Vert \breve{\mathcal {u}}-\mathcal {u}^{*}\Vert ^2. \end{aligned}$$
(7.43)

Recalling (7.27), we obtain

$$\begin{aligned} \varXi _2-\chi (\mathcal {u}^{*})\le \varXi _1(x)+\Vert \breve{\mathcal {u}}-\mathcal {u}^{*}\Vert ^2. \end{aligned}$$
(7.44)

As \(\varXi _1(x)\) and \(V^{*}(x)\) are bounded, (7.42) becomes

$$\begin{aligned} \dot{Y}_b\le \Vert \breve{\mathcal {u}}-\mathcal {u}^{*}\Vert ^2+b_{\varXi _1}+\theta b_V-x^T \varUpsilon x. \end{aligned}$$
(7.45)

Applying the Young’s inequation, we derive that

$$\begin{aligned} \Vert \breve{\mathcal {u}}-\mathcal {u}^{*}\Vert =&\Vert \breve{\mathcal {u}}-\breve{\mathcal {u}}^{*}+\breve{\mathcal {u}}^{*}-\mathcal {u}^{*}\Vert ^2 \le 2\Vert \breve{\mathcal {u}}-\breve{\mathcal {u}}^{*}\Vert ^2+2\Vert \breve{\mathcal {u}}^{*}-\mathcal {u}^{*}\Vert ^2 \nonumber \\ \le&4\Vert \hbar \tanh (\varGamma _1(\breve{x}_\imath ))-\hbar \tanh (\varGamma _2(\breve{x}_\imath ))\Vert ^2+4\Vert \bar{\tau }(\breve{x}_\imath )\Vert ^2+2K_{\mathcal {u}}\Vert \nu _\imath \Vert ^2 \nonumber \\ \le&8\hbar ^2\tanh ^2(\varGamma _1(\breve{x}_\imath ))+8\hbar ^2\tanh ^2(\varGamma _2(\breve{x}_\imath ))+2K_{\mathcal {u}}\Vert \nu _\imath \Vert ^2+4b_{\bar{\tau }}^2. \end{aligned}$$
(7.46)

As \(|\tanh (\cdot )|\le 1\), it could be obtained that

$$\begin{aligned} \dot{Y}_b\le -\lambda _m(\varUpsilon )\Vert x\Vert ^2+2K_{\mathcal {u}}\Vert \nu _\imath \Vert ^2+\sigma , \end{aligned}$$
(7.47)

where \(\sigma =16\hbar ^2+4b_{\bar{\tau }}^2+\theta b_V+b_{\varXi _1}\).

Taking the derivative of \(Y_c\), we derive that

$$\begin{aligned} \dot{Y}_c=-2\tilde{\omega }^T\bar{\xi }\bar{\xi }^T\tilde{\omega }+2\tilde{\omega }^T\breve{\xi }e_H. \end{aligned}$$
(7.48)

In light of Young’s inequation, it yields that

$$\begin{aligned} 2\tilde{\omega }^T\breve{\xi }e_H\le 2\tilde{\omega }^T\bar{\xi }e_H\le \tilde{\omega }^T\bar{\xi }\bar{\xi }^T\tilde{\omega }+e_H^T e_H. \end{aligned}$$
(7.49)

Then (7.48) can be further expressed as

$$\begin{aligned} \dot{Y}_c\le -\tilde{\omega }^T\bar{\xi }\bar{\xi }^T\tilde{\omega }+e_H^T e_H \le -\lambda _m(\delta )\Vert \tilde{\omega }\Vert ^2+b_{eH}^2, \end{aligned}$$
(7.50)

where \(\delta =\bar{\xi }\bar{\xi }^T\).

According to (7.47) and (7.50), when the medicine indication (7.40) is satisfied, we can derive that

$$\begin{aligned} \dot{Y}\le&-(1-\eta ^2)\lambda _m(\varUpsilon )\Vert x\Vert ^2-\eta ^2\lambda _m(\varUpsilon )\Vert x\Vert ^2+2K_{\mathcal {u}}\Vert \nu _\imath \Vert ^2 \nonumber \\&-\lambda _m(\delta )\Vert \tilde{\omega }\Vert ^2+b_{eH}^2+\sigma \nonumber \\ \le&-\eta ^2\lambda _m(\varUpsilon )\Vert x\Vert ^2-\lambda _m(\delta )\Vert \tilde{\omega }\Vert ^2+b_{eH}^2+\sigma . \end{aligned}$$
(7.51)

Then it can be concluded that \(\dot{Y}<0\) when one of the conditions holds that

$$\begin{aligned} \Vert x\Vert >\frac{1}{\eta }\sqrt{\frac{b_{eH}^2+\sigma }{\lambda _m(\varUpsilon )}}, \end{aligned}$$
(7.52)

and

$$\begin{aligned} \Vert \tilde{\omega }\Vert >\sqrt{\frac{b_{eH}^2+\sigma }{\lambda _m(\delta )}}. \end{aligned}$$
(7.53)

Thus x and \(\tilde{\omega }\) are demonstrated to be UUB.

Case II: A regulation occurs, i.e., \(t=z_\imath \). The difference of \(L_Y\) is presented as

$$\begin{aligned} \triangle Y=&\underbrace{V^{*}(\breve{x}_{\imath +1})-V^{*}(\breve{x}_{\imath })}_{\triangle Y_a}+\underbrace{V^{*}(x(z_\imath ^+))-V^{*}(x(z_\imath ))}_{\triangle Y_b} \nonumber \\ =&\underbrace{\frac{1}{\ell }\tilde{\omega }^T(z_\imath ^+)\tilde{\omega }(z_\imath ^+)-\frac{1}{\ell }\tilde{\omega }^T(z_\imath )\tilde{\omega }(z_\imath )}_{\triangle Y_c}. \end{aligned}$$
(7.54)

From the analysis in Case I, it can be derived that \(\dot{L}_Y<0\) when (7.52) or (7.53) is satisfied. It can be further deduced that \(Y_b+Y_c\) is monotonically decreasing when \(t\in [z_\imath ,z_{\imath +1})\), that is,

$$\begin{aligned} Y_b(x(z_\imath ))+Y_c(x(z_\imath ))\ge Y_b(x(z_\imath +\epsilon ))+Y_c(x(z_\imath +\epsilon )), \end{aligned}$$
(7.55)

where \(\epsilon \in (0,z_{\imath +1}-z_\imath )\). According to the properties of limits, we can obtain

$$\begin{aligned} Y_b(x(z_\imath ))+Y_c(x(z_\imath ))\ge Y_b(x(z_\imath ^+))+Y_c(x(z_\imath ^+)), \end{aligned}$$
(7.56)

with \(x(z_\imath ^+)=\lim _{\epsilon \rightarrow 0}x(z_\imath +\epsilon )\). More specially, it yields that

$$\begin{aligned} V^{*}(x(z_\imath ))+\frac{1}{\ell }\tilde{\omega }^{T}(z_\imath )\tilde{\omega }(z_\imath )\ge V^{*}(x(z_\imath ^{+}))+\frac{1}{\ell }\tilde{\omega }^{T}(z_\imath ^{+})\tilde{\omega }(z_\imath ^{+}). \end{aligned}$$
(7.57)

As x is proved to be UUB, it can be obtained that

$$\begin{aligned} V^{*}(\breve{x}_{\imath +1})\le V^{*}(\breve{x}_\imath ). \end{aligned}$$
(7.58)

From (7.57) and (7.58), it’s derived that \(\triangle Y<0\), which indicates that the constructed Lyapunov (7.41) is monotonically decreasing when \(t=z_\imath \). \(\blacksquare \)

Remark 7.3

\(\zeta \) in (7.40) is the regulation parameter determining the frequency of medicine dosage regulation. A large \(\zeta \) means that the medicine dosage is regulated frequently while a small \(\zeta \) implies the regulation occurs rarely. It can be set as an appropriate value according to the clinical data.

Remark 7.4

Theorem 7.2 indicates that the designed MDRM-based approximate optimal strategy (7.35) can asymptotically stabilize system (7.6). The medicine indication (7.40), the cornerstone of MDRM, can provide a reasonable reference threshold for therapeutic strategy. When the difference derived from the current clinical data and latest reference data is larger than the threshold, the medicine dosage can be regulated, and the current indication data will be recorded and utilized as the new reference data in the future. Thus the designed therapeutic strategy can be regulated timely and necessarily according to the medicine indication.

Remark 7.5

The discount factor is programmed to avoid infinity and infinitesimal value function in the accumulation of rewards, and immediately return can earn more than the delayed return of interest. In human trials, we have found that human prefer to immediately return can present close to exponential growth, the discount factor is used to simulate such a cognitive model and biological process to make a decision.

7.5 Simulation Study

In this section, we consider the system (7.6) which is the simplified version of the growth dynamics of cells and viruses described by (7.1)–(7.4). Based on system (7.6), the simulation experiment is conducted to show the effectiveness of the proposed ADP method with medicine regulation mechanism.

According to the clinical medical statistics and literatures [36, 37, 40], the parameters associated with the dynamics (7.1)–(7.4) are presented in Table 7.1. After the nondimensionalization, the corresponding parameters are set as \(a_1=0.36\), \(a_2=0.1\), \(b_1=0.36\), \(b_2=0.48\), \(b_3=0.16\), \(c_1=0.1278\), \(c_2=0.2\), \(c_3=0.036\), \(d_1=0.6\), and \(d_2=0.29\). The initial state vector is \([0.8,0,0.2,0.05]^T\). The minimum and maximum thresholds are given by \(\mathcal {u}_m=0\) and \(\mathcal {u}_M=0.02\). For the discounted value function (7.7) of system (7.6), the parameters \(\varUpsilon =0.2I_{4\times 4}\) and \(\theta =0.5\).

Table 7.1 Parameter specifications of the tumor-virus-immune system
Fig. 7.1
figure 1

The population evolution of uninfected tumor cells

Fig. 7.2
figure 2

The population evolution of infected tumor cells

Fig. 7.3
figure 3

The population evolution of free oncolytic virus

Fig. 7.4
figure 4

The population evolution of immune cells

Fig. 7.5
figure 5

The curves of the therapeutic strategies

Fig. 7.6
figure 6

The population evolutions of uninfected tumor cells

Fig. 7.7
figure 7

The population evolutions of infected tumor cells

For the critic network, we select the activation function as \([x_1^2\), \(x_1 x_2\), \(x_1 x_3\), \(x_1 x_4\), \(x_2^2\), \(x_2 x_3\), \(x_2 x_4\), \(x_3^2\), \(x_3 x_4\), \(x_4^2]^{T}\). The other parameters are respectively set as \(K_{\mathcal {u}}=20\), \(\zeta =0.9\) and \(\ell =1.6\).

Simulation results demonstrate that in Figs. 7.1, 7.2, 7.3, 7.4, 7.5, 7.6 and 7.7. For model (7.5), the evolution trajectories of states are respectively depicted in Figs. 7.1, 7.2, 7.3 and 7.4. From Fig. 7.1, we could observe that under the attacks from oncolytic viruses and immune cells, the population of uninfected tumor cells rapidly declines and reaches a stabilizing value which is very low after \(t=150 d\). Figures 7.2 and 7.3 reveal the relations between the population of infected tumor cells and that of virus particles which is large proportional. The immune cells are activated by the uninfected and infected tumor cells to kill tumor cells, which can be observed from Fig. 7.4. The medicine dosage of the derived approximate optimal therapeutic strategy and that of initial strategy are compared in Fig. 7.5. From Fig. 7.5, one can derive that the dosage of the obtained strategy is obviously less than that of initial strategy. On the other hand, the input dosages of the two strategies are both constrained by the pre-designed thresholds. This is of great practical significance since excess medicines may well threaten the health of patients and cause a huge overhead. Furthermore, it can be observed that the medicine dosage regulation frequency steps down when the clinic data becomes better, which means that with the aid of medicine regulation mechanism, the medicine dosage can be regulated timely and necessarily. Figures 7.6 and 7.7 present the population curves of the cells and viruses under the derived strategy with different burst sizes of viruses, that is, \(\kappa =2,5\). This verified that the obtained therapeutic strategy can effectively kill tumor cells with oncolytic viruses of different burst out sizes. However, when the parameter \(\kappa \) is large enough, it may cause an oscillation. When the innate immune response is considered, the tumor-virus-immune system becomes very complicated. Though the viruses with large \(\kappa \) try their best to produce more replicas and infect more tumor cells, the reduction of tumor cells inactivate the immune response in the meanwhile. The viruses dominate the dynamics and the warfare between tumor cells and viruses can last a long time such that the oscillation occurs repeatedly. The oncolytic virus has the ability to effectively kill the tumor cells, while the immune response can reduce the killing-efficiency of the viruses and block their infections. Furthermore, the activated immune response can eliminate tumor cells as well. Thus there exists a subtle balance between the viruses and the immune cells which demands a further investigation.

7.6 Conclusion

Medicine regulation mechanism has been designed such that the constrained therapeutic strategy based on virotherapy can be obtained to eliminate tumor cells, guaranteeing that the medicine dosage can be regulated timely and necessarily. Firstly, a mathematical model is utilized to describe the relations among the uninfected tumor cells, infected tumor cells, oncolytic viruses and immune cells. Meanwhile, as the simplified version of the tumor-virus-immune model, the non-quadratic function is proposed to formulate the value function to acquire HJBE. Secondly, to address the optimal therapeutic strategy, single-critic architecture has been designed to seek the approximate solution of the HJBE through ADP. Finally, the simulation results has verified the effectiveness of the proposed method. Furthermore, nonzero-sum optimal control based on differential games will be a edge of the new frontier in therapy of tumor treatment, cardiovascular, orthodontic treatment, osteoporosis and cerebrovascular diseases.