The stability analysis of dynamical systems, which are ubiquitous in nature, has long been a hot topic of research and several approaches have been proposed. However, control scientists often demand optimality in addition to the stability of the control system. In the 1950s and 1960s, motivated by the development of space technology and the practical use of digital computers, the theory of optimization of dynamical systems developed rapidly, forming an important branch of the discipline: optimal control. It is increasingly used in many fields, such as space technology, systems engineering, economic management and decision-making, population control, and optimization of multi-stage process equipment. In 1957, Bellman proposed an effective tool for solving optimal control problems: the dynamic programming (DP) method [1]. At the heart of this approach is Bellman’s optimality principle, which states that the optimal policy for a multilevel decision process has the property that, regardless of the initial state and initial decision, the remaining decisions must also be optimal for the state formed by the initial decision. This principle can be reduced to a basic recursive formula for solving multilevel decision problems by starting at the end and working backward to the beginning. It applies to a wide range of discrete, continuous, linear, nonlinear, deterministic, and stochastic systems.

ADP is a new approach to approximate optimality in the field of optimal control, and it is a current research topic in the international optimization field. The ADP method uses the function approximation structure to approximate the solution of the Hamilton-Jacobi-Bellman (HJB) equation and uses offline iteration or online update to obtain the approximate optimal control strategy of the system, which can effectively solve the optimal control problem of nonlinear systems [2,3,4,5,6,7,8,9,10,11]. Bertsekas et al. summarized neuronal dynamic programming in the literature [12, 13], describing in detail dynamic programming, the structure of neural networks, and training algorithms. Meanwhile, several effective methods have been proposed for applying neuronal dynamic programming. Si et al. summarized the development of ADP methods in cross-cutting disciplines and discussed the connection of DP and ADP methods with artificial intelligence, approximation theory, control theory, operations research, and statistics [14]. In [15], Powell showed how to use ADP methods to solve deterministic or stochastic optimization problems, and pointed out the direction of ADP methods. In [16], Balakrishnan et al. concluded previous approaches to the design of feedback controllers for dynamic systems using the ADP method from both model and model-free cases. In [17], the ADP method was described from the perspective of requiring initial stability and not requiring initial stability.

The ADP method has a unique algorithm and structure compared to other existing optimal control methods. It overcomes the drawback that classical variational theory cannot handle optimal control problems with closed-set constraints on the control variables. Like the maximum value principle, the ADP method is not only suitable for optimal control problems with open-set constraints, but also for optimal control problems with closed-set constraints. While the extreme value principle can only provide the necessary conditions for optimal control problems, the DP method gives sufficient conditions. However, the direct application of the DP method is difficult due to the difficulty of solving the problem of “dimensional disaster” by the HJB equation in the DP method. Hence the ADP method, as an approximate solution to the DP method, overcomes the limitations of the DP method. It is more suitable for applications in systems with strong coupling, strong nonlinearity and high complexity. For example, the literature [18] presented a constrained adaptive dynamic programming (CADP) algorithm that could be used to solve general nonlinear non-affine optimal control problems with known dynamics. Unlike previous ADP algorithms, it was able to handle problems with state constraints directly by proposing a constrained generalized policy iteration framework that transforms the traditional policy improvement process into a constrained policy optimization problem with state constraints. To solve the problem of robust tracking control, the literature [19] designed an online adaptive learning structure to build a robust tracking controller for nonlinear uncertain systems. The literature [20] proposed an iterative method of bias policy for solving data-driven optimal control problems for unknown continuous linear systems by adding a bias parameter that could further relax the conditions of the initial admissible controller. The literature [21] considered the first attempt at ADP control for a nonlinear Itô-type stochastic system, which transformed a complex optimal tracking control problem into a stable control optimization problem by reconstructing a new stochastic augmented system. The use of a critical neural network in iterative learning subsequently simplifies the structure of the behavioral criterion and reduces the computational load. The ADP approach is still very widely used for a number of common practical systems. The literature [22] developed an event-triggered adaptive dynamic planning method to design formation controllers, and solved the problem of distributed formation control for multi-rotor UAS. For wind/light energy hybrid systems, literature [23] presented an adaptive dynamic programming method based on Bellman’s principle, which enables accurate current sharing and voltage regulation. Based on this approach, it is possible to obtain the optimal control variables for each energy body objective.

Optimal control of nonlinear systems has been one of the hot spots and difficulties in the field of control research. As a novel technology to solve the optimal control problem, ADP method integrates the theories of neural network, adaptive evaluation design, augmented learning and classical dynamic programming, to overcome the problem of “dimensional disaster”, which also enables the acquisition of an approximate optimal closed-loop feedback control law. As a consequence, delving deeper into the theory of ADP and its algorithms for solving optimal control of nonlinear systems holds immense theoretical significance and practical application value. Although the researches on the ADP method are still in its early stages, this book aims to equip readers with a foundational understanding of the method and empower them to apply it to diverse optimization problems in fields such as medicine, science, and engineering.

1.1 Optimal Control Formulation

There are several schemes of dynamic programming [1, 13, 24]. One can consider discrete-time systems or continuous-time systems, linear systems or nonlinear systems, time-invariant systems or time-varying systems, deterministic systems or stochastic systems, etc. Discrete-time (deterministic) nonlinear (time-invariant) dynamical systems will be discussed first. Time-invariant nonlinear systems cover most of the application areas and discrete time is the basic consideration for digital implementation.

1.1.1 ADP for Discrete-Time Systems

Consider the following discrete-time nonlinear systems:

$$\begin{aligned} x_{k+1}=F(x_k,u_k), k=1,2,..., \end{aligned}$$
(1.1)

where \(x_k \in \mathbb {R}^n\) is the state vector and \(u_k \in \mathbb {R}^m\) is the control input vector. The corresponding cost function (performance index function) of the system takes the form of

$$\begin{aligned} J(x_k, \overline{u}_k)= \sum \limits _{i=k}^{\infty } \gamma ^{i-k}U(x_i,u_i), \end{aligned}$$
(1.2)

where \(\overline{u}_k=(u_k,u_{k+1},...)\) is the control sequence starting at time k. \( U(x_i,u_i) \) is the utility function. \( \gamma \) is the discount factor, meeting \( 0<\gamma <1 \). Note that the function J is dependent on the initial time k and the initial state \( x_k \). Generally, it is desired to determine \(\overline{u}_0=(u_0,u_{1},...)\) so that \( J(x_0, \overline{u}_0 )\) is optimized (i.e., maximized or minimized). We will use \(\overline{u}_0^*=(u_0^*,u_1^*,...)\) and \( J^*(x_0 )\) to denote the optimal control sequence and the optimal cost function, respectively.The objective of dynamic programming problem in this book is to determine a control sequence \(u_k , k = 0, 1, . . . ,\) so that the function J (i.e., the cost) in (1.2) is minimized. The optimal cost function is defined as

$$\begin{aligned} J^*(x_0)= \inf \limits _{\overline{u}_0}J(x_0, \overline{u}_0 )=J(x_0, \overline{u}_0^* ), \end{aligned}$$
(1.3)

which is dependent upon the initial state \( x_0 \).

The control action may be determined as a function of the state. In this case, we write \( u_k = u(x_k), \forall k\). Such a relationship, or mapping \(u: R^n \rightarrow R^m\), is called feedback control, or control policy, or policy. It is also called control law. For a given control policy \(\mu \), the cost function in (1.2) is rewritten as

$$\begin{aligned} J^{\mu }(x_k)= \sum \limits _{i=k}^{\infty } \gamma ^{i-k}U(x_i,\mu (x_i)), \end{aligned}$$
(1.4)

which is the cost function for system (1.1) starting at xk when the policy \( u_k = \mu (x_k) \) is applied. The optimal cost for system (1.1) starting at \( x_0 \) is determined as

$$\begin{aligned} J^*(x_0)= \inf \limits _{\mu }J^{\mu }(x_0 )=J^{\mu ^*}(x_0 ), \end{aligned}$$
(1.5)

where \( \mu ^*\) denotes the optimal policy.

Dynamic programming is based on Bellman’s principle of optimality [1, 13, 24]: An optimal (control) policy has the property that no matter what previous decisions have been, the remaining decisions must constitute an optimal policy with regard to the state resulting from those previous decisions.

According to Bellman, the minimum cost of any state starting at time k consists of two parts, one of which is the minimum cost at time k and the other part is the cumulative sum of the infinite minimum cost starting from time \( k + 1 \). In terms of equations, this means that

$$\begin{aligned} \begin{aligned} J^*(x_k)&= \min \limits _{u_k} \{U(x_k,u_k)+ \gamma J^*(x_{k+1}) \} \\&=\min \limits _{u_k} \{U(x_k,u_k)+ \gamma J^*(F(x_{k},u_k)) \}. \end{aligned} \end{aligned}$$
(1.6)

This is known as the Bellman optimality equation, or the discrete-time Hamilton-Jacobi-Bellman (HJB) equation. One then has the optimal policy, i.e., the optimal control \(u_k^*\) at time k is the \( u_k \) that achieves this minimum as

$$\begin{aligned} u^*= \arg \min \limits _{u_k}J \{U(x_k,u_k)+\gamma J^*(x_{k+1})\}. \end{aligned}$$
(1.7)

Since one must know the optimal policy at time \( k+1 \) to (1.6) use to determine the optimal policy at time k, Bellman’s principle yields a backwards-in-time procedure for solving the optimal control problem. It is the basis for dynamic programming algorithms in extensive use in control system theory, operations research, and elsewhere.

1.1.2 ADP for Continuous-Time Systems

For continuous-time systems, the cost function J is also the key to dynamic programming. By minimizing J, one gets the optimal cost function \( J^* \), which is often a Lyapunov function of the system. As a consequence of the Bellman’s principle of optimality, \( J^* \)satisfies the Hamilton-Jacobi-Bellman (HJB) equation. But usually, one cannot get the analytical solution of the HJB equation. Even to find an accurate numerical solution is very difficult due to the so-called curse of dimensionality.

Consider the continuous-time nonlinear dynamical system

$$\begin{aligned} \dot{x}(t)=F(x(t),u(t)), t \ge t_0, \end{aligned}$$
(1.8)

where \(x \in \mathbb {R}^n\) is the state vector and \(u \in \mathbb {R}^m\) is the control input vector. The corresponding cost function of the system can be defined as

$$\begin{aligned} J(x_0, u)= \int _{t_0}^{\infty } U(x(\tau ),u(\tau ))d\tau , \end{aligned}$$
(1.9)

with utility function \( U(x, u) \ge 0 \), where \( x(t_0) = x_0 \). The Bellman’s principle of optimality can also be applied to continuous-time systems. In this case, the optimal cost

$$\begin{aligned} J^*(x(t))= \min \limits _{u(t)} \{J(x(t),u(t)) \} , t \ge t_0, \end{aligned}$$
(1.10)

satisfies the HJB equation

$$\begin{aligned} -\frac{\partial J^*}{\partial t}=\min _{u(t)}\left\{ U(x, u)+\left( \frac{\partial J^*}{\partial x}\right) ^{T} F(x, u)\right\} . \end{aligned}$$
(1.11)

The HJB equation in (1.11) can be derived from the Bellman’s principle of optimality [24]. Meanwhile, the optimal control \( u^* (t) \) will be the one that minimizes the cost function,

$$\begin{aligned} u^*(t)=\arg \min _{u(t)}\{J(x(t), u(t))\}, t \ge t_0. \end{aligned}$$
(1.12)

In 1994, Saridis and Wang [25] studied the nonlinear stochastic systems described by

$$\begin{aligned} \textrm{d} x=f(x, t) \textrm{d} t+g(x, t) u \mathrm {~d} t+h(x, t) \textrm{d} w, t_0 \le t \le T \end{aligned}$$
(1.13)

with the cost function

$$ J\left( x_0, u\right) =\mathbb {E}\left\{ \int _{t_0}^T\left( Q(x, t)+u^{T} u\right) \textrm{d} t+\phi (x(T), T): x\left( t_0\right) =x_0\right\} $$

where \(x \in \mathbb {R}^n, u \in \mathbb {R}^m\), and \(w \in \mathbb {R}^k\) are the state vector, the control vector, and a separable Wiener process; fg and h are measurable system functions; and Q and \(\phi \) are nonnegative functions. A value function V is defined as

$$ V(x, t)=\mathbb {E}\left\{ \int _t^T\left( Q(x, t)+u^{T} u\right) \textrm{d} t+\phi (x(T), T): x\left( t_0\right) =x_0\right\} , t \in I, $$

where \(I \triangleq \left[ t_0, T\right] \). The HJB equation is modified to become the following equation

$$\begin{aligned} \frac{\partial V}{\partial t}+\mathscr {L}_u V+Q(x, t)+u^{T} u=\nabla V \end{aligned}$$
(1.14)

where \(\mathscr {L}_u\) is the infinitesimal generator of the stochastic process specified by (1.13) and is defined by

$$\begin{aligned} \mathscr {L}_u V =&\frac{1}{2} {\text {tr}}\left\{ h(x, t) h^{T}(x, t) \frac{\partial }{\partial x}\left( \frac{\partial V(x, t)}{\partial x}\right) ^{T}\right\} \nonumber \\&+\left( \frac{\partial V(x, t)}{\partial x}\right) ^{T}(f(x, t)+g(x, t) u) \end{aligned}$$

Depending on whether \(\nabla V \le 0\) or \(\nabla V \ge 0\), an upper bound \(\bar{V}\) or a lower bound \(\underline{V}\) of the optimal cost \(J^*\) are found by solving equation (1.14) such that \(\underline{V} \le J^* \le \bar{V}\). Using \(\bar{V}\) (or \(\underline{V}\)) as an approximation to \(J^*\), one can solve for a control law. This leads to the so-called suboptimal control. It was proved that such controls are stable for the infinite-time stochastic regulator optimal control problem, where the cost function is defined as

$$ J\left( x_0, u\right) =\lim _{T \rightarrow \infty } \mathbb {E}\left\{ \frac{1}{T} \int _{t_0}^T\left( Q(x, t)+u^{T} u\right) \textrm{d} t: x\left( t_0\right) =x_0\right\} $$

The benefit of the suboptimal control is that the bound V of the optimal \({\text {cost}} J^*\) can be approximated by an iterative process. Beginning from certain chosen functions \(u_0\) and \(V_0\), let

$$\begin{aligned} u_i(x, t)=-\frac{1}{2} g^{T}(x, t) \frac{\partial V_{i-1}(x, t)}{\partial x}, i=1,2, \ldots . \end{aligned}$$
(1.15)

Then, by repeatedly applying (1.14) and (1.15), one will get a sequence of functions \(V_i\). This sequence \(\left\{ V_i\right\} \) will converge to the bound \(\bar{V}\) (or V) of the cost function \(J^*\). Consequently, \(u_i\) will approximate the optimal control when i tends to \(\infty \). It is important to note that the sequences \(\left\{ V_i\right\} \) and \(\left\{ u_i\right\} \) are obtainable by computation and they approximate the optimal cost and the optimal control law, respectively.

Some further theoretical results for ADP have been obtained in [2]. These works investigated the stability and optimality for some special cases of ADP. In [2], Murray et al. studied the (deterministic) continuous-time affine nonlinear systems

$$\begin{aligned} \dot{x}=f(x)+g(x) u, x\left( t_0\right) =x_0 \end{aligned}$$
(1.16)

with the cost function

$$\begin{aligned} J(x, u)=\int _{t_0}^{\infty } U(x, u) \textrm{d} t \end{aligned}$$
(1.17)

where \(U(x, u)=Q(x)+u^{T} R(x) u, Q(x)>0\) for \(x \ne 0\) and \(Q(0)=0\), and \(R(x)>0\) for all x. Similar to [25], an iterative procedure is proposed to find the control law as follows. For the plant (1.16) and the cost function (1.17), the HJB equation leads to the following optimal control law

$$\begin{aligned} u^*(x)=-\frac{1}{2} R^{-1}(x) g^{T}(x)\left[ \frac{\textrm{d} J^*(x)}{\textrm{d} x}\right] . \end{aligned}$$
(1.18)

Applying (1.17) and (1.18) repeatedly, one will get sequences of estimations of the optimal cost function \(J^*\) and the optimal control \(u^*\). Starting from an initial stabilizing control \(v_0(x)\), for \(i=0,1, \ldots \), the approximation is given by the following iterations between value functions

$$ V_{i+1}(x)=\int _t^{\infty } U\left( x(\tau ), v_i(\tau )\right) \textrm{d} \tau $$

and control laws

$$ v_{i+1}(x)=-\frac{1}{2} R^{-1}(x) g^{T}(x)\left[ \frac{\textrm{d} V_{i+1}(x)}{\textrm{d} x}\right] $$

The following results were shown in [2].

(1) The sequence of functions \(\left\{ V_i\right\} \) obtained above converges to the optimal cost function \(J^*\).

(2) Each of the control laws \(v_{i+1}\) obtained above stabilizes the plant (1.16), for all \(i=0,1, \ldots \)

(3) Each of the value functions \(V_{i+1}(x)\) is a Lyapunov function of the plant, for all \(i=0,1, \ldots \)

Abu-Khalaf and Lewis [26] also studied the system (1.16) with the following value function

$$ V(x(t))=\int _t^{\infty } U(x(\tau ), u(\tau )) \textrm{d} \tau =\int _t^{\infty }\left( x^{T}(\tau ) Q x(\tau )+u^{T}(\tau ) R x(\tau )\right) \textrm{d} \tau $$

where Q and R are positive-definite matrices. The successive approximation to the HJB equation starts with an initial stabilizing control law \(v_0(x)\). For \(i=0,1, \ldots \), the approximation is given by the following iterations between policy evaluation

$$ 0=x^{T} Q x+v_i^{T}(x) R v_i(x)+\nabla V_i^{T}(x)\left( f(x)+g(x) v_i(x)\right) $$

and policy improvement

$$ v_{i+1}(x)=-\frac{1}{2} R^{-1} g^{T}(x) \nabla V_i(x) $$

where \(\nabla V_i(x)=\partial V_i(x) / \partial x\). In [26], the above iterative approach was applied to systems (1.16) with saturating actuators through a modified utility function, with convergence and optimality proofs showing that \(V_i \rightarrow J^*\) and \(v_i \rightarrow u^*\), as \(i \rightarrow \infty \). For continuous-time optimal control problems, attempts have been going on for a long time in the quest for successive solutions to the \(\textrm{HJB}\) equation. Published works can date back to as early as 1967 by Leake and Liu [26]. The brief overview presented here only serves as a beginning of many more recent results [26,27,28].

1.2 Publication Outline

The general layout of the presentation of this monograph is given as follows. Adaptive dynamic programming is used to design drug dosage regulation mechanisms to provide adaptive viral treatment strategies for input-limited organisms, and to extend this to tumour cells, immune cells and interplay and regulation schemes among the immune system. The main contents of this monograph are shown as follows:

 

Chapter 1:

introduces the research background, development and current status of ADP both domestically and internationally, as well as the idea and design framework of the underlying ADP, including discrete-time and continuous-time systems.

Chapter 2:

investigates optimal regulation scheme between tumor and immune cells based on ADP approach. The therapeutic goal is to inhibit the growth of tumor cells to allowable injury degree, and maximize the number of immune cells in the meantime. The reliable controller is derived through the ADP approach to make the number of cells achieve the specific ideal states. Firstly, the main objective is to weaken the negative effect caused by chemotherapy and immunotherapy, which means that minimal dose of chemotherapeutic and immunotherapeutic drugs can be operational in the treatment process. Secondly, according to nonlinear dynamical mathematical model of tumor cells, chemotherapy and immunotherapeutic drugs can act as powerful regulatory measures, which is a closed-loop control behavior. Finally, states of the system and critic weight errors are proved to be ultimately uniformly bounded with the appropriate optimization control strategy and the simulation results are shown to demonstrate effectiveness of the cybernetics methodology.

Chapter 3:

investigates the optimal control strategy problem for nonzero-sum games of the immune system based on adaptive dynamic programming. Firstly, the main objective is approximating a Nash equilibrium between the tumor cells and the immune cell population, which is governed through chemotherapy drugs and immunoagents guided by the mathematical growth model of the tumor cells. Secondly, a novel intelligent nonzero-sum games-based ADP is put forward to solve optimization control problem through reducing the growth rate of tumor cells and minimizing chemotherapy drugs and immunotherapy drugs. Meanwhile, convergence analysis and iterative ADP algorithm are specified to prove feasibility. Finally, simulation examples are listed to account for availability and effectiveness of the research methodology.

Chapter 4:

devotes to evolutionary dynamics optimal control oriented tumor immune differential game system. Firstly, the mathematical model covering immune cells and tumor cells considering the effects of chemotherapy drugs and immune agents. Secondly, the bounded optimal control problem covering is transformed into solving HJB equation considering the actual constraints and infinite-horizon performance index based on minimize the amount of medication administered. Finally, approximate optimal control strategy is acquired through iteration dual heuristic dynamic programming algorithm avoiding dimensional disaster effectively and providing optimal treatment scheme for clinical applications.

Chapter 5:

mainly proposes an evolutionary algorithm and its first application to develop therapeutic strategies for Ecological Evolutionary Dynamics Systems (EEDS), obtaining the balance between tumor cells and immune cells by rationally arranging chemotherapeutic drugs and immune drugs. Firstly, an EEDS nonlinear kinetic model is constructed to describe the relationship between tumor cells, immune cells, dose, and drug concentration. Secondly, the N-Level Hierarchy Optimization (NLHO) algorithm is designed and compared with 5 algorithms on 20 benchmark functions, which proves the feasibility and effectiveness of NLHO. Finally, we apply NLHO into EEDS to give a dynamic adaptive optimal control policy, and develop therapeutic strategies to reduce tumor cells, while minimizing the harm of chemotherapy drugs and immune drugs to the human body. The experimental results prove the validity of the research method.

Chapter 6:

investigates the optimal control strategy for organism by using ADP method under the architecture of Firstly, a tumor model is established to formulate the interaction relationships among normal cells, tumor cells, endothelial cells and the concentrations of drugs. Then, the ADP-based method of single-critic network architecture is proposed to approximate the coupled HJEs under the medicine dosage regulation mechanism (MDRM). According to game theory, the approximate MDRM-based optimal strategy can be derived, which is of great practical significance. Owing to the proposed mechanism, the dosages of the chemotherapy and anti-angiogenic drugs can be regulated timely and necessarily. Furthermore, the stability of the closed-loop system with the obtained strategy is analyzed via Lyapunov theory. Finally, a simulation experiment is conducted to verify the effectiveness of the proposed method.

Chapter 7:

investigates the constrained adaptive control strategy based on virotherapy for organism using the MDRM. Firstly, the tumor-virus-immune interaction dynamics is established to model the relations among the tumor cells (TCs), virus particles and the immune response. ADP method is extended to approximately obtain the optimal strategy for the interaction system to reduce the populations of TCs. Due to the consideration of asymmetric control constraints, the non-quadratic functions are proposed to formulate the value function such that the corresponding Hamilton-Jacobi-Bellman equation (HJBE) is derived which can be deemed as the cornerstone of ADP algorithms. Then, the ADP method of single-critic network architecture which integrates MDRM is proposed to obtain the approximate solutions of HJBE and eventually derive the optimal strategy. The design of MDRM makes it possible for the dosage of the agentia containing oncolytic virus particles to be regulated timely and necessarily. Furthermore, the uniform ultimate boundedness of the system states and critic weight estimation errors are validated by Lyapunov stability analysis. Finally, simulation results are given to show the effectiveness of the derived therapeutic strategy.