1 Introduction

Since the first reported case of coronavirus disease 2019 (COVID-19) in early December 2019 in China, it has resulted in an ongoing crisis that unprecedentedly spreads all around the world [1,2,3,4]. Acute respiratory syndrome can occur in patients with serious illness, leading to multiple organ failures and death in some cases. [5, 6]. It has been established that the present pandemic's spread rate is much higher than similar previously reported epidemics in 2003 and 2012, namely SARS coronavirus (SARS-CoV) and MERS coronavirus (MERS-CoV). Until now, the epidemic crisis has resulted in a growing number of deaths all over the globe [7, 8].

Mathematical simulations have long been used to obtain insight into the mechanisms of disease transmission [9,10,11,12,13,14,15,16,17,18,19,20,21,22]. The essence of modeling lies in defining a set of equations that mimic the system's spread or dynamic in reality [23, 24]. From the beginning of the current epidemy, the mathematical models which show its spread have been at the forefront for prediction and control of the novel coronavirus outbreak [25,26,27,28,29]. Through the available data on the reported number of infections and information that we already know about the virus spread, as well as the confirmed number of deaths and hospitalizations, we can get an accurate insight for the future of the virus spread [30, 31].

Up to now, to effectively mitigate the spread of COVID-19, decision makers in all countries have applied various control policies such as mandatory lockdowns, quarantining and isolating infected people, maintaining a minimum social distancing, imposing strict and encouraging and strictly enforcing, avoiding crowded events, and forcing people to use face masks while in public [32,33,34,35]. Recently, several effective vaccines have been introduced for battling the pandemic. Some of them have passed all criteria, and now countries are using them. However, now with the advent of confirmed vaccines, governments and decision makers face new challenges. Now, to apply vaccines in effective ways, several questions have to be answered quickly and accurately. Which policies should be taken for vaccination? How can decision making choose different components of people? How can the vaccine be distributed throughout the time? How will the vaccine be able to decrease the risk of being infected? Since the disease's dynamic is complicated, and its spread is affected by several factors, answering these equations requires to be considered as optimization problems, which motivated the current study. The present study aims to solve these questions by proposing reinforcement learning-based optimal policies.

2 COVID-19 model with controls

In this study, an extended version of “Susceptible-Exposed-Infectious-Recovered” (SEIR) compartmental model is introduced. In this model, the spread of COVID-19 has been investigated. Using the Markov Chain Monte Carlo (MCMC) method and fitting the proposed model to the real data, the dynamic system's coefficients have been derived.

As mentioned in [36] and [27], the total population is considered as \(N\) which can be classified into eight different epidemiological subclasses: the humans who are not infected but susceptible \(S\), exposed \(E\), asymptomatic infected having no clinical symptoms but can infect healthy people \(A\), infected people showing clinical symptoms \(I\), the quarantined humans who are not infected but susceptible \({S}_{q}\), the quarantined humans who are exposed to the infection \({E}_{q}\), the hospitalized individuals \(H\) and the recovered individuals \(R\). Under these assumptions, the model is given by defining \(q\) as a rate of quarantined, \(\beta \) as the probability of transmission per contact, \(\varrho \) as the likelihood of having symptoms among infected people, \(\sigma \) as the proportion of individuals who move to the infected class, \(\lambda \) as the released rate of the quarantined uninfected contacts and \(c\) as the person-to-person contact rate. The disease-induced death rate of people is \(\alpha \). In this work, \({\delta }_{I}\) and \({\delta }_{q}\) stand for the transition of infected people and exposed people to the quarantined infected class, respectively. The recovery rate of asymptomatically infective patients is \({\gamma }_{A}\) and the \({\gamma }_{H}\) is the rate at which infected individuals get recovery, while \({\gamma }_{H}\) is the rate at which hospitalized individuals get recovery. Based on these coefficients, the epidemic model that proposes the transmission dynamics is given by

$$ \begin{gathered} \frac{{\text{d}}}{{{\text{d}}t}}S = - \left( {\beta c + cq\left( {1 - \beta } \right)} \right)S\left( {I + \theta A} \right) + \lambda S_{q} \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}E = - \beta c\left( {1 - q} \right)S\left( {I + \theta A} \right) - \sigma E \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}I = \sigma \,\uprho\,E - \left( {\delta _{I} + \alpha + \gamma _{I} } \right)I \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}A = \sigma \left( {1 - \uprho} \right)E - \gamma _{A} A \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}S_{q} = \left( {1 - \beta } \right)cqS\left( {I + \theta A} \right) - \lambda S_{q} \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}E_{q} = \beta cqS\left( {I + \theta A} \right) - \delta _{q} E_{q} \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}H = \delta _{I} I + \delta _{q} E_{q} - \left( {\alpha + \gamma _{H} } \right)H \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}R = \gamma _{I} I + \gamma _{A} A + \gamma _{H} H \hfill \\ \end{gathered} $$
(1)

where

$$c=\left({c}_{0}-{c}_{b}\right){e}^{{-r}_{1}t}+{c}_{b}$$
(2)
$$\frac{1}{{\delta }_{I}(t)}=\left(\frac{1}{{\delta }_{I0}}-\frac{1}{{ \delta }_{If}}\right){e}^{{-r}_{2}t}+\frac{1}{{ \delta }_{If}}$$
(3)

represents the person-to-person contact rate \((c)\) and detection rate \({\updelta }_{\mathrm{I}}\), respectively. Equation (2) and (3) include six parameters defined as follows:

  • \({c}_{0}\): initial contact rate

  • \({c}_{\mathrm{b}}\): final contact rate that is larger than \({c}_{0}\)

  • \({r}_{1}\): exponentially decreasing rate of contact rate

  • \({\delta }_{\mathrm{I}0}\): initial diagnosis rate

  • \({\delta }_{If}\): fastest diagnosis rate

  • \({r}_{2}\): exponentially increasing rate of diagnosis rate

It is assumed that the contact rate exponentially decreases over time and the diagnose rate exponentially increases with respect to time. Furthermore, we rewrite system (1) as follows:

$$ \begin{aligned} \dot{x} & = f\left( x \right) \\ f\left( {x\left( t \right)} \right) & = \left[ {\begin{array}{*{20}c} { - \left( {\beta c + cq\left( {1 - \beta } \right)} \right)S\left( {I + \theta A} \right) + \lambda S_{q} } \\ { - \beta c\left( {1 - q} \right)S\left( {I + \theta A} \right) - \sigma E} \\ {\begin{array}{*{20}c} {\sigma \varrho E - \left( {\delta_{I} + \alpha + \gamma_{I} } \right)I} \\ {\begin{array}{*{20}c} {\sigma \left( {1 - \varrho } \right)E - \gamma_{A} A} \\ {\left( {1 - \beta } \right)cqS\left( {I + \theta A} \right) - \lambda S_{q} } \\ {\begin{array}{*{20}c} {\beta cqS\left( {I + \theta A} \right) - \delta_{q} E_{q} } \\ {\delta_{I} I + \delta_{q} E_{q} - \left( {\alpha + \gamma_{H} } \right)H} \\ {\gamma_{I} I + \gamma_{A} A + \gamma_{H} H} \\ \end{array} } \\ \end{array} } \\ \end{array} } \\ \end{array} } \right] \\ \end{aligned} $$
(4)

where \(x\left(t\right)={\left[S\left(t\right),E\left(t\right),I\left(t\right),A\left(t\right),{S}_{q}\left(t\right),{E}_{q}\left(t\right),H\left(t\right),R\left(t\right)\right]}^{\mathrm{T}}\in {\mathbb{R}}_{0+}^{3}\) are considered as the state vector. In this paper, this model has been selected because this model can describe the ongoing situation better. Firstly, this model has a higher reproduction rate over the other models [36, 37] that make this compartment model reasonable and superior. To be more specific, in this case, the reproduction rate was found too unstable [37], and some new variants of the novel coronavirus have a higher reproduction rate. Consequently, when we consider a model with a high reproduction rate and impose vaccination as a controlling variable, the optimal controller can be adopted for the worst-case scenario [38, 39]. Moreover, this model could estimate the confirmed case very well from 23 to 29 January 2020 because it considered different parameter variations, and its data collection was performed during intensive social events [39]. Therefore, this model can reflect the real situation better than others. In Sect. 4, we consider vaccination as a control input and discuss the system's input signal and how to impose the vaccination to the nonlinear system.

3 Optimal control problem

Consider the system dynamics described by

$$\dot{x}=f\left(x\right)+g\left(x\right)u$$
(5)

with \(x\in {\mathbb{R}}^{n}\) denoting the state, \(f\left(x\right)\in {\mathbb{R}}^{n}\), \(g\left(x\right)\in {\mathbb{R}}^{n\times m}\) and the input\(u\in U\subset {\mathbb{R}}^{m}\). Consider \(U\) as a set that is defined for the control input saturation.

Assumption 1

\(f\left(.\right)\) and \(g\left(.\right)\) are differentiable in their argument with \({f}\left(0\right)=0\) and \({g}\left(0\right)=0,\) and they are Lipschitz continuous on their set, so \(f\left(x\right)+g\left(x\right)u\) is Lipschitz continuous on a set \(\Omega \subseteq {{R}}^{{n}}\) containing the origin, so there exists a continuous control function \(u\) such that the dynamics (5) is asymptotically stable on \(\Omega \) and controllable.

Assumption 2

The control matrix \(g\left(x\right)\) and \(f\left(x\right)\) are bounded over the compact set; \(\Vert g\left(x\right)\Vert \le {\Xi }_{g}\), \(\Vert g\left(x\right)\Vert \le {\Xi }_{f}\)

Definition 1

In this paper, we define infinite horizon integral cost as follows:

$$V\left(x\left(t\right),u\left(t\right)\right)= {\int }_{t}^{\infty }r\left(x\left(\tau \right),u\left(\tau \right)\right)\mathrm{d}\tau $$
(6)

where \(r\left(x\left(\tau \right),u\left(\tau \right)\right)=Q\left(x\left(\tau \right)\right)+{u\left(x\left(\tau \right)\right)}^{\mathrm{T}}Ru\left(x\left(\tau \right)\right)\) and \(Q\left(x\left(\tau \right)\right)\) is a positive definite monotonically increasing function. \(R\) is a symmetric positive definite matrix and \(Q\left(0\right)=0\).

Definition 2

(Admissible Control Policy) [40, 41] \(u\) is the control policy that can be said to be admissible with respect to the cost function (6) on Ω, written as \(u\), if \(u\) is continuous on a compact set \(\Omega \subset {\mathbb{R}}^{n}\) and differentiable on Ω, \(u\left(0\right)=0\), \(u\) stabilizes (5) and for every \({x}_{0}\in\Omega \), the \(V\left(x\left(0\right),u\left(0\right)\right)\) is finite.

According to the differentiability and continuity of cost function, the infinitesimal version of (7) is the nonlinear Lyapunov equation

$$0=Q\left(x\left(\tau \right)\right)+{u\left(x\left(\tau \right)\right)}^{\mathrm{T}}Ru\left(x\left(\tau \right)\right)+{\left(\nabla V\right)}^{\mathrm{T}}\left(f\left(x\right)+g(x)u\right)$$
(7)

with \(\left(0\right)=0\). In Eq. (6), the notation \({\nabla }_{x}\left(.\right)\) means the gradient operator with respect to \(x\) and is equivalent to \(\frac{\partial (.)}{\partial x}\). Consider the Hamiltonian of (5)

$$H\left(x,u,\nabla V\right)=Q\left(x\left(\tau \right)\right)+{u\left(x\left(\tau \right)\right)}^{\mathrm{T}}Ru\left(x\left(\tau \right)\right)+{\left(\nabla V\right)}^{\mathrm{T}}\left(f\left(x\right)+g\left(x\right)u\right)$$
(8)

The optimal performance index function of (5) can be formulated as

$${V}^{*}\left(x\left(t\right),u\left(t\right)\right)= {\int }_{t}^{\infty }r\left(x\left(\tau \right),{u}^{*}\left(\tau \right)\right)\mathrm{d}\tau =\underset{u\in U}{\mathrm{min}}{\int }_{t}^{\infty }r\left(x\left(\tau \right),{u}^{*}\left(\tau \right)\right)\mathrm{d}\tau $$
(9)

According to the Bellman optimal control theory, the optimal value function \({V}^{*}\left(x\left(t\right),u\left(t\right)\right)\) can be obtained by solving the Hamilton–Jacobi–Bellman (HJB) equation:

$$0=\underset{u\in U}{\mathrm{min}}H(x,u,\nabla {V}^{*}) $$
(10)

Assume that the minimum value on the right-hand side of Eq. (10) exists and is unique. By differentiating the HJB, the optimal control for the given problem can be expressed as

$${u}^{*}\left(x\right)=\mathrm{arg}\underset{u\in \mathcal{B}}{\mathrm{min}}H\left(x,u,{\nabla }_{x}{V}^{*}\left(x\right)\right)= -\frac{1}{2}{R}^{-1}{g}^{T}\left(x\right)\nabla {V}^{*}\left(x\right)$$
(11)

where \({V}^{*}\left(x\left(t\right),u\left(t\right)\right)\) is formulated in the following HJB equation

$$ \begin{aligned} V^{*} \left( {x\left( t \right),u\left( t \right)} \right) & = Q\left( {x\left( \tau \right)} \right) + \left( {\nabla V^{*} } \right)^{{\text{T}}} f\left( x \right) + \frac{1}{4}\left( {\nabla V^{*} } \right)^{{\text{T}}} g\left( x \right)R^{ - 1} g^{{\text{T}}} \left( x \right)\left( {\nabla V^{*} } \right) \\ V^{*} \left( 0 \right) & = 0 \\ \end{aligned} $$
(12)

This nonlinear partial differential HJB equation is extremely difficult to solve and, in general, maybe it is impossible to be computed in some cases. Moreover, complete knowledge of the system's dynamics is required. According to [42], the IRL algorithm is presented to estimate value function iteratively in the following section.

Definition 3

(UUB Stability [43, 44]) For nonlinear system (5), with the equilibrium point \({x}_{e}\) its solution is said to be UUB if there exists a compact set\(\Omega \subset {\mathbb{R}}^{n}\), so that for every \({x}_{0}\in \Omega \) , there exists a positive bound \(\mathcal{p}\) and a time\(\mathrm{T}\left(\mathcal{p},{\mathrm{x}}_{0}\right)>0\), independent of\({\mathrm{t}}_{0}\), such that \(\Vert x\left(t\right)-\mathcal{p}\Vert \le {x}_{0}\) for ∀\(t\ge {t}_{0}+T\).

In this article, partially model-free integral reinforcement learning (IRL) has been introduced to obtain the optimal value function approximation \({V}^{*}\left(x\right)\) and a continuous optimal control policy \({u}^{*}\left(x\right)\).

3.1 Value function approximation using Critic network

The critic control design with neural networks generally is acceptable to determine the optimal approximation for control problems [45, 46]. With the higher-order Weierstrass approximation theorem [47], a single-layer neural network can be utilized to reconstruct the cost function \({V}^{*}\left(x\right).\)

$${V}_{c}^{*}\left(x\right)={{w}_{c}^{*}}^{T}{\phi }_{c}\left(x\right)+{\varepsilon }_{c}\left(x\right)$$
(13)

where \({w}_{c}^{*}\in {\mathbb{R}}^{l}\) is suitable coefficients with \(l\) neurons, \({\phi }_{c}\left(x\right)\in {\mathbb{R}}^{l}\) provides the activation function of the neural network(NN), and \({\varepsilon }_{c}\left(x\right)\in {\mathbb{R}}\) is the reconstruction error. Assuming \({w}_{c}^{*}\) and \({\phi }_{c}\left(x\right)\) are bounded. \(\Vert {w}_{c}^{*}\Vert \le {\Xi }_{{w}_{c}^{*}}\), \({\Vert {\phi }_{c}\Vert \le \Xi }_{{\phi }_{c}}\) Since \({V}_{c}^{*}\left(x\right)\) is differentiable, its gradient can be approximated as

$$\frac{\partial {V}_{c}^{*}\left(x\right)}{\partial x}={\left(\frac{\partial {\phi }_{c}\left(x\right)}{\partial x}\right)}^{T}{w}_{c}^{*}+\frac{\partial {\varepsilon }_{c}}{\partial x}={\nabla \phi }_{c}\left(x\right){w}_{c}^{*}+\nabla {\varepsilon }_{c}\left(x\right)$$
(14)

According to the [48], for \(x\in \Omega \) one can infer that \({\varepsilon }_{c}\left(x\right)\) and its gradient \(\nabla {\varepsilon }_{c}\left(x\right)\) are bounded \(\Vert {\nabla }^{T}{\varepsilon }_{c}\left(x\right)\Vert \le {\Xi }_{{\varepsilon }_{1}}\). According to the fact that \({{\phi }_{c}\le \Xi }_{{\phi }_{c}}\), one can infer \({{\nabla \phi }_{c}\le \Xi }_{{\nabla \phi }_{c}}\). While generally, the optimal coefficient \({w}_{c}^{*}\) is unknown, the estimated value function is given by

$${\widehat{V}}_{c}\left(x\right)={\widehat{w}}_{c}^{T}{\phi }_{c}\left(x\right)$$
(15)

where \({\widehat{w}}_{c}\) denotes estimated weights of these basis functions that are updated through the learning process. The updating rule will be formulated in the following section.

3.2 Policy approximation using Actor-network

Zhu et al. [49], have determined policy estimation by considering the fact that if the initial admissible policy is given, the policy function can be expressed by NN. NN approximation is a well-known method for policy estimation in optimal control [50,51,52]. Therefore, similar to value function, in order to Weirstrass high-order approximation theorem, the smooth policy can be uniformly approximated over a compact set as

$${u}_{c}^{*}\left(x\right)=\Gamma \left({{w}_{a}^{*}}^{T}{\phi }_{a}\left(x\right)\right)+{\varepsilon }_{a}\left(x\right)$$
(16)

where \({w}_{a}^{*}\in {\mathbb{R}}^{l{^{\prime}}\times m}\) is optimal coefficients with \(l{^{\prime}}\) neurons, \({\phi }_{c}\left(x\right)\in {\mathbb{R}}^{l{^{\prime}}}\) provides the activation function of the neural network, and \({\varepsilon }_{a}\left(x\right)\in {\mathbb{R}}\) is the approximation error that is bounded \(\Vert {\varepsilon }_{a}\left(x\right)\Vert \le {b}_{a}\). \(\Gamma \left(.\right)\) is a continuous activation function.

Assumption 3

\(\Gamma :{\mathbb{R}}^{m}\cup \left\{\pm \infty \right\} \to U\) is a function that is continuous monotonic bijective. The first derivative of this function is bounded \({\Gamma }^{^{\prime}}\left(.\right)=\frac{\mathrm{d\Gamma }(.)}{\mathrm{d}(.)}\) and \(\Gamma \left(0\right)=0.\)

Remark 1

tanH, SQNL [53], and softsign [54] activation functions satisfy Assumption 3. In this case, because the input should be bounded by a constant \(\Vert u\Vert \le {\Xi }_{u}\), softsign is employed. Then, the estimated policy function is given by.

$${\widehat{u}}_{c}\left(x\right)=\Gamma \left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right)$$
(17)

where \({\widehat{w}}_{a}\) denotes estimated weights to learn \({w}_{a}^{*}.\)

3.3 Learning rules for actor and critic networks

Updating Rule for the Critic Network: By substituting Eqs. (14) in (8), we have

$$H\left(x,u,{w}_{c}^{*}\right)=Q\left(x\left(\tau \right)\right)+{u\left(x\left(\tau \right)\right)}^{\mathrm{T}}Ru\left(x\left(\tau \right)\right)+{\left({w}_{c}^{*}\right)}^{\mathrm{T}}{\nabla \phi }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right)u\right)={\varepsilon }_{H}$$
(18)

, while based on Eq. (7), \({\varepsilon }_{H}\) can be given by

$${\varepsilon }_{H}=-{\nabla }^{T}{\varepsilon }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right)u\right)$$
(19)

Assumption 4

Under the Lemma 1 that is mentioned in [55] and by using Assumption 7 in [56], the least-squares solution to (17) exists and is unique for any admissible control policy and the number of hidden layer neurons\(N\,\to\,\infty \), \({\phi }_{c}\left(x\right)\) gives the complete independent basis for \({V}_{c}^{*}\left(x\right)\).

Hence, \(\frac{\partial {V}_{c}^{*}\left(x\right)}{\partial x}\), \({V}_{c}^{*}\left(x\right)\) can be estimated by NNs in view of the following assumption and the Weierstrass higher-order approximation theorem, so \(\mathrm{as }\quad N\,\to\,\infty , \frac{ \partial {\varepsilon }_{c}\left(x\right)}{\partial x}\), \({\varepsilon }_{c}\left(x\right)\) approach zero [40]. Motivated by the research in [57], so as to find the updating laws for the critic weights, we define the error function for critic network as

$${e}_{c}={\sigma }^{T}{\widehat{w}}_{c}+Q\left(x\right)+{{\widehat{u}}_{c}}^{T}R{\widehat{u}}_{c}$$
(20)

where \(\sigma ={\nabla \phi }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right){\widehat{u}}_{c}\right)\). To train the critic networks, the squared residual error regarding the critic network training should be minimized.

$${E}_{c}=\frac{1}{2}{e}_{c}^{2}$$
(21)

The weight of the critic network \({\widehat{w}}_{c}\) is updated in a gradient descent algorithm to minimize \({E}_{c}\) with

$$\dot{{\widehat{w}}_{c}}= -{\alpha }_{c}\frac{\sigma }{{\left({\sigma }^{T}\sigma +1\right)}^{2}}\left[{\sigma }^{T}{\widehat{w}}_{c}+Q\left(x\right)+{{\widehat{u}}_{c}}^{T}R{\widehat{u}}_{c}\right]$$
(22)

where \({\alpha }_{c}>0\) is the critic learning rate. Hence, in the light of (22) and (11) and inspired by [42], the nearly optimal policy can be obtained by

$${u}_{c}\left(x\right)= -\frac{1}{2}{R}^{-1}{g}^{T}\left(x\right)\nabla {\widehat{V}}_{c}\left(x\right)=-\frac{1}{2}{R}^{-1}{g}^{T}\left(x\right){\widehat{w}}_{c}^{T}\nabla {\phi }_{c}\left(x\right)$$
(23)

where we used \({\widehat{V}}_{c}\left(x\right)\) instead of the optimal value function \({V}_{c}^{*}\left(x\right)\). This nearly optimal policy is used in the following section to update Actor-Network.

Updating Rule for the Actor-Network: Similar to the training of critic network, one may define

$${E}_{a}=\frac{1}{2}{e}_{a}^{2}$$
(24)

where \({e}_{a}\) is the output error between the policy function \({\widehat{u}}_{c}\left(x\right)\) and the nearly targeted control policy \({u}_{c}\left(x\right)\)

$${e}_{a}={\widehat{u}}_{c}\left(x\right)-{u}_{c}\left(x\right)=\Gamma \left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right)-{u}_{c}\left(x\right)$$
(25)

To minimize the square actor error \({E}_{a}\), the weights of the actor-network are tuned by gradient descent rule as follows:

$$\dot{{\widehat{w}}_{a}}=-{\alpha }_{a}\frac{\partial {E}_{a}}{\partial {\widehat{w}}_{a}}=-{\alpha }_{a}{\phi }_{a}\left(x\right){\Gamma }^{^{\prime}}\left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right){e}_{a}$$
(26)

Theorem 1

Consider the system given by (5) and the updating actor law and critic law given by (22) and (26), respectively. Assume \({\tilde{w }}_{a}={w}_{a}^{*}-{\widehat{w}}_{a}\), \({\tilde{w }}_{c}={w}_{c}^{*}-{\widehat{w}}_{c}\) and let \(\frac{\sigma }{\left({\sigma }^{T}\sigma +1\right)}\) be persistently exciting (PE) [44]. If Assumptions 1–4 hold, the actor-critic weight estimation errors \({\tilde{w }}_{a}\) and \({\tilde{w }}_{c}\) are uniformly ultimately bounded and \({\widehat{w}}_{a}\) and \({\widehat{w}}_{c}\) converge to a residual set in the neighbor of \({w}_{a}^{*}\) and \({w}_{c}^{*}\), respectively.

Proof

For convenience, we define

$${u}_{a}\left(x\right)=\Gamma \left({{w}_{a}^{*}}^{T}{\phi }_{a}\left(x\right)\right)$$
(27)
$$\widehat{D}={\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)$$
(28)
$${D}^{*}={{w}_{a}^{*}}^{T}{\phi }_{a}\left(x\right)$$
(29)
$$\tilde{D }={D}^{*}-\widehat{D}$$
(30)

The convergence of the actor-critic network during learning is based on Lyapunov analysis. We consider the following Lyapunov candidate

$$L={L}_{c}+{L}_{V}+{L}_{a}=\frac{1}{2}{\tilde{w }}_{c}^{T}{\alpha }_{c}^{-1}{\tilde{w }}_{c}+{V}^{*}(x)+ \frac{1}{2}{\tilde{w }}_{a}^{T}{\alpha }_{a}^{-1}{\tilde{w }}_{a}$$
(31)

First, let us consider \({L}_{c}=\frac{1}{2}{\tilde{w }}_{c}^{T}{\alpha }_{c}^{-1}{\tilde{w }}_{c}\). Then, \({\dot{L}}_{c}=-{\tilde{w }}_{c}^{T}{\alpha }_{c}^{-1}\dot{{\widehat{w}}_{c}}\) where \(\dot{{\widehat{w}}_{c}}\) is given in (22).

$${\dot{L}}_{c}=-{\tilde{w }}_{c}^{T}{\alpha }_{c}^{-1}\dot{{\widehat{w}}_{c}}={\tilde{w }}_{c}^{T}\frac{\sigma }{{\left({\sigma }^{T}\sigma +1\right)}^{2}}\left[{\sigma }^{T}\left({w}_{c}^{*}-{\tilde{w }}_{c}\right)+Q\left(x\right)+{\widehat{u}}_{c}^{T}R{\widehat{u}}_{c}\right]$$
(32)

or equivalently

$${\dot{L}}_{c}=-\frac{\sigma {\sigma }^{T}}{{\left({\sigma }^{T}\sigma +1\right)}^{2}}{\tilde{w }}_{c}^{T}{\tilde{w }}_{c}+\frac{\sigma {\sigma }^{T}}{{\left({\sigma }^{T}\sigma +1\right)}^{2}}\left[{\tilde{w }}_{c}^{T}\left(Q\left(x\right)+{\widehat{u}}_{c}^{T}R{\widehat{u}}_{c}\right)+{\tilde{w }}_{c}^{T}{\sigma }^{T}\left({w}_{c}^{*}\right)\right]$$
(33)

The solution of the HJB by considering (12) can be rewritten as

$${\varepsilon }_{\text{HJB}}=Q\left(x\left(t\right)\right)+{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)f\left(x\right)+\frac{1}{4}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right){\nabla }^{T}{\phi }_{c}\left(x\right){w}_{c}^{*}$$
(34)

where \({\varepsilon }_{\mathrm{HJB}}\) is the residual error due to the function approximation error, which is

$$ \begin{aligned} \varepsilon _{{{\text{HJB}}}} & = - \nabla ^{T} \varepsilon _{c} \left( x \right)f\left( x \right) + \frac{1}{2}w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)g\left( x \right)R^{{ - 1}} g^{T} \left( x \right)\nabla \varepsilon _{c} \left( x \right) \\ & \quad + \frac{1}{4}\nabla ^{T} \varepsilon _{c} \left( x \right)g\left( x \right)R^{{ - 1}} g^{T} \left( x \right)\nabla \varepsilon _{c} \left( x \right) \\ \end{aligned} $$
(35)

It is now desired to show that this error converges uniformly to zero as the number of hidden layer units N increases [40]. Hence, it can be demonstrated that \({\varepsilon }_{\mathrm{HJB}}\) is bounded. \(\Vert {\varepsilon }_{\mathrm{HJB}}\Vert \le {\Xi }_{1}\)

By considering (34), one can derive

$$ \begin{aligned}\dot{L}_{c} & = - \frac{{\sigma \sigma ^{T} }}{{\left( {\sigma ^{T} \sigma + 1} \right)^{2} }}\tilde{w}_{c}^{T} \tilde{w}_{c} \\ & \quad + \frac{{\sigma \sigma ^{T} }}{{\left( {\sigma ^{T} \sigma + 1} \right)^{2} }}\bigg[ \tilde{w}_{c}^{T} \bigg( \varepsilon _{{{\text{HJB}}}} - w_{c}^{{*T}} \nabla \phi _{c}\left( x \right)f\big( x \big) - \frac{1}{4}w_{c}^{{*T}} \nabla \phi _{c}\left( x \right)g\left( x \right)R^{{ - 1}} g^{T} \left( x \right)\nabla ^{T} \phi _{c}\left( x \right)w_{c}^{*} + \hat{u}_{c}^{T} R\hat{u}_{c} \bigg)+ \tilde{w}_{c}^{T} \sigma ^{T} \left( {w_{c}^{*} } \right) \bigg] \\ \end{aligned} $$
(36)

Based on (25), \(\sigma ={\nabla \phi }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right){\widehat{u}}_{c}\right)\) and making use of \( \mathop \sigma \limits^{ = } =\frac{\sigma {\sigma }^{T}}{{\left({\sigma }^{T}\sigma +1\right)}^{2}}\)

$${\dot{L}}_{c}=-\mathop \sigma \limits^{ = }{\tilde{w }}_{c}^{T}{\tilde{w }}_{c}+\mathop \sigma \limits^{ = }\left[{\tilde{w }}_{c}^{T}\left({\varepsilon }_{\mathrm{HJB}}-{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)f\left(x\right)-\frac{1}{4}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right){\nabla }^{T}{\phi }_{c}\left(x\right){w}_{c}^{*}+{\widehat{u}}_{c}^{T}R{\widehat{u}}_{c}\right)+{\tilde{w }}_{c}^{T}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right){\widehat{u}}_{c} \right)\right]$$
(37)

where \(\mathop \sigma \limits^{ = }\) is bounded. \( \left\| {\mathop \sigma \limits^{ = } } \right\| \le \Xi _{{\mathop \sigma \limits^{ = } }} \) Then,

$$\dot{L}_{c} = - \mathop \sigma \limits^{ = } \tilde{w}_{c}^{T} \tilde{w}_{c} + \mathop \sigma \limits^{ = }\left[ \tilde{w}_{c}^{T} \left( \varepsilon _{{{\text{HJB}}}} - \frac{1}{4}w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)g\left( x \right)R^{{ - 1}} g^{T} \left( x \right)\nabla ^{T} \phi _{c} \left( x \right)w_{c}^{*} + \hat{u}_{c}^{T} R\hat{u}_{c} \right) + \tilde{w}_{c}^{T} w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)\left( {g\left( x \right)\hat{u}_{c} ~} \right) \right] $$
(38)

The second term is,

$$ \begin{aligned} \dot{L}_{V} & = \left( {w_{c}^{{*T}} \nabla \phi _{c} \left( x \right) + \nabla ^{T} \varepsilon _{c} \left( x \right)} \right)\left( {f\left( x \right) + g\left( x \right)\hat{u}_{c} } \right) \\ & = ~\left( {w_{c}^{{*T}} \nabla \phi _{c} \left( x \right) + \nabla ^{T} \varepsilon _{c} \left( x \right)} \right)\left( {f\left( x \right) + g\left( x \right)\Gamma \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)} \right) \\ \end{aligned} $$
(39)

Combining with (34),

$$ \begin{aligned} \dot{L}_{V} & = \nabla ^{T} \varepsilon _{c} \left( x \right)\left( {f\left( x \right) + g\left( x \right)\Gamma \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)} \right) \\ & \quad + w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)f\left( x \right) + w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)g\left( x \right)\Gamma \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right) \\ & = \nabla ^{T} \varepsilon _{c} \left( x \right)\left( {f\left( x \right) + g\left( x \right)\Gamma \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)} \right) \\& \quad + \varepsilon _{{HJB}} - Q\left( {x\left( t \right)} \right) - \frac{1}{4}w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)g\left( x \right)R^{{ - 1}} g^{T} \left( x \right)\nabla ^{T} \phi _{c} \left( x \right)w_{c}^{*} \\ \end{aligned} $$
(40)

For the second term of (31), one can write

$$ \begin{aligned} \dot{L}_{a} & = ~\tilde{w}_{a}^{T} \alpha _{a}^{{ - 1}} \mathop {\tilde{w}_{a} }\limits^{ \cdot } = - \tilde{w}_{a}^{T} \alpha _{a}^{{ - 1}} \mathop {\tilde{w}_{a} }\limits^{ \cdot } = \tilde{w}_{a}^{T} \left( {\phi _{a} \left( x \right)\Gamma ^{\prime } \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)e_{a} } \right) \\ & = ~\tilde{w}_{a}^{T} \left( {\phi _{a} \left( x \right)\Gamma ^{\prime}\left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)\left( {\Gamma \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right) + \frac{1}{2}R^{{ - 1}} g^{T} \left( x \right)\hat{w}_{c}^{T} \nabla \phi _{c} \left( x \right)} \right)} \right) \\ \end{aligned} $$
(41)

Consequently,

$$ \begin{aligned} \dot{L}_{a} & = ~\tilde{w}_{a}^{T} \phi _{a} \left( x \right)\Gamma ^{\prime}\left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)\Gamma \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right) \\ & \quad + \tilde{w}_{a}^{T} \phi _{a} \left( x \right)\Gamma ^{\prime}\left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)\frac{1}{2}R^{{ - 1}} g^{T} \left( x \right)\hat{w}_{c}^{T} \nabla \phi _{c} \left( x \right) \\ \end{aligned} $$
(42)

Based on (28)–(30), (41) becomes

$$ \begin{aligned} \dot{L}_{a} & = \tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)\Gamma \left( {\hat{D}} \right) + \frac{1}{2}\tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)R^{{ - 1}} g^{T} \left( x \right)\hat{w}_{c}^{T} \nabla \phi _{c} \left( x \right) \\ & = \tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)\Gamma \left( {\hat{D}} \right) - \tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)u_{c} \\ & = \tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)\Gamma \left( {\hat{D}} \right) - \tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)\left( {\Gamma \left( {\hat{D}} \right) - e_{a} } \right) = \tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)e_{a} ~ \\ \end{aligned} $$
(43)

Then, \({u}_{c}^{*}\left(x\right)\) be can be written

$${u}_{c}^{*}\left(x\right)=\Gamma \left({D}^{*}\right)+{\varepsilon }_{a}\left(x\right)=-\frac{1}{2}{R}^{-1}{g}^{T}\left(x\right){{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)$$
(44)

Based on (25), \( {e}_{a}\) become

$$ \begin{aligned} e_{a} & = \hat{u}_{c} \left( x \right) - u_{c} \left( x \right) \\ & = \Gamma \left( {\hat{D}} \right) + \frac{1}{2}R^{{ - 1}} g^{T} \left( x \right)\left( {w_{c}^{{*T}} - \tilde{w}_{c}^{T} } \right)\nabla \phi _{c} \left( x \right) \\ & = \Gamma \left( {\hat{D}} \right) - \Gamma \left( {D^{*} } \right) - \varepsilon _{a} \left( x \right) - \frac{1}{2}R^{{ - 1}} g^{T} \left( x \right)\tilde{w}_{c}^{T} \nabla \phi _{c} \left( x \right) \\ & = ~ - \Gamma \left( {\tilde{D}} \right) - \varepsilon _{a} \left( x \right) - \frac{1}{2}R^{{ - 1}} g^{T} \left( x \right)\tilde{w}_{c}^{T} \nabla \phi _{c} \left( x \right) \\ \end{aligned} $$
(45)

Then, using (45) in (42)

$$ \begin{aligned} \dot{L}_{a} & = - \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\left( {\Gamma \left( {\tilde{D}} \right) + \varepsilon_{a} \left( x \right) + \frac{1}{2}R^{ - 1} g^{T} \left( x \right)\tilde{w}_{c}^{T} \nabla \phi_{c} \left( x \right)} \right) \\ & = - \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\varepsilon_{a} \left( x \right) - \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\left( {\Gamma \left( {\tilde{D}} \right)} \right) - \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\frac{1}{2}R^{ - 1} g^{T} \left( x \right)\tilde{w}_{c}^{T} \nabla \phi_{c} \left( x \right) \\ \end{aligned} $$
(46)

By applying the Young inequality to the second term, we have

$$ \begin{aligned} & - \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\frac{1}{2}R^{ - 1} g^{T} \left( x \right)\nabla \phi_{c} \left( x \right)\tilde{w}_{c} \\ & \le \frac{1}{4}\tilde{w}_{c} \nabla^{T} \phi_{c} \left( x \right)g\left( x \right)R^{ - 1} g^{T} \left( x \right)\nabla \phi_{c} \left( x \right) \tilde{w}_{c} \\ & \quad + \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\Gamma^{\prime}\left( {\hat{D}} \right)^{T} \tilde{D} \\ \end{aligned} $$
(47)

Then, we rewrite (46)

$$ \begin{aligned} \dot{L}_{a} & \le - \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\left( { - \Gamma^{\prime}\left( {\hat{D}} \right)^{T} \tilde{D} + \varepsilon_{a} \left( x \right) + \left( {\Gamma \left( {\tilde{D}} \right)} \right)} \right) \\ & \quad + \frac{1}{4}\tilde{w}_{c} \nabla^{T} \phi_{c} \left( x \right)g\left( x \right)R^{ - 1} g^{T} \left( x \right)\nabla \phi_{c} \left( x \right) \tilde{w}_{c} \\ \end{aligned} $$
(48)

By considering softsign as the activation function, \(\Gamma \left(y\right)=\frac{y}{\left(1+\left|y\right|\right)}\) therefore \(\Gamma{^{\prime}}\left(y\right)=\frac{1}{{\left(1+\left|y\right|\right)}^{2}}\) . With these functions, one can then show that the first term of (48) can be given

$$-{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\left(-{\Gamma }^{^{\prime}}{\left(\widehat{D}\right)}^{T}\tilde{D }+{\varepsilon }_{a}\left(x\right)+\left(\Gamma \left(\tilde{D }\right)\right)\right)=-{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\left(\frac{-\tilde{D }}{{\left(1+\Vert \widehat{D}\Vert \right)}^{2}}+{\varepsilon }_{a}\left(x\right)-\frac{\widehat{D}}{\left(1+\Vert \widehat{D}\Vert \right)}+\frac{{D}^{*}}{\left(1+\Vert {D}^{*}\Vert \right)}\right)=-{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\left(\frac{-\left({D}^{*}-\widehat{D}\right)-\widehat{D}\left(1+\Vert \widehat{D}\Vert \right)+{D}^{*}\left(1+\Vert \widehat{D}\Vert \right)}{{\left(1+\Vert \widehat{D}\Vert \right)}^{2}}\right)=-{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\frac{\Vert \widehat{D}\Vert \tilde{D }}{{\left(1+\Vert \widehat{D}\Vert \right)}^{2}}$$
(49)

Substituting (49) in (48)

$${\dot{L}}_{a}\le -{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\frac{\Vert \widehat{D}\Vert \tilde{D }}{{\left(1+\Vert \widehat{D}\Vert \right)}^{2}}+\frac{1}{4}{\tilde{w }}_{c}{\nabla }^{T}{\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right)\nabla {\phi }_{c}\left(x\right) {\tilde{w }}_{c}$$
(50)

Next, using (38), (40), and (50) to rewrite (30), we get

$$ \begin{aligned}\dot{L} &=-Q\left(x\left(t\right)\right)-\mathop \sigma \limits^{ = }{\tilde{w }}_{c}^{T}{\tilde{w }}_{c}-{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\frac{\Vert \widehat{D}\Vert \tilde{D }}{{\left(1+\Vert \widehat{D}\Vert \right)}^{2}}-\frac{1}{4}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right){\nabla }^{T}{\phi }_{c}\left(x\right){w}_{c}^{*} -{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\left(\Gamma \left(\tilde{D }\right)\right)\\ &\quad +{\nabla }^{T}{\varepsilon }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right)\Gamma \left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right)\right)\\ &\quad+{\varepsilon }_{\mathrm{HJB}}+\mathop \sigma \limits^{ = }\bigg[{\tilde{w }}_{c}^{T}\left({\varepsilon }_{\mathrm{HJB}}-\frac{1}{4}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right){\nabla }^{T}{\phi }_{c}\left(x\right){w}_{c}^{*}+{\widehat{u}}_{c}^{T}R{\widehat{u}}_{c}\right)\\ &\quad+{\tilde{w }}_{c}^{T}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)\left(g\left(x\right){\widehat{u}}_{c}\right)\bigg]+\frac{1}{4}{\varepsilon }_{\mathcal{K}}{\tilde{w }}_{c}{\nabla }^{T}{\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right)\nabla {\phi }_{c}\left(x\right) {\tilde{w }}_{c}\end{aligned}$$
(51)

Let \({\Upsilon }_{0}=\frac{{\Gamma }^{\mathrm{^{\prime}}}\left(\widehat{D}\right)\Vert \widehat{D}\Vert }{{\left(1+\Vert \widehat{D}\Vert \right)}^{2}}\), \({\Upsilon}_{1}=\frac{1}{4}{\varepsilon }_{\mathcal{K}}{\nabla }^{T}{\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right)\nabla {\phi }_{c}\left(x\right)\) and \({\Upsilon}_{2}=\frac{\mathop \sigma \limits^{ = }}{4}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right){\nabla }^{T}{\phi }_{c}\left(x\right){w}_{c}^{*}\) positive definites. Based on definitions of \({\widehat{u}}_{c}\), \({\varepsilon }_{\mathrm{HJB}}\), \({\nabla }^{T}{\varepsilon }_{c}\), \(\Gamma \left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right)\), \(\mathop \sigma \limits^{ = }\), \({w}_{c}^{*}\), \(\nabla {\phi }_{c}\left(x\right)\) and Assumption 2 and 3, one can conclude

$$ \begin{aligned} &\Vert {\varepsilon }_{\mathrm{HJB}}\Vert \le {\Xi }_{1}\\&\Vert {\nabla }^{T}{\varepsilon }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right)\Gamma \left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right)\right)\Vert \le {\Xi }_{{\varepsilon }_{1}}\left({\Xi }_{f}+{\Xi }_{g}{\Xi }_{u}\right)={\Xi }_{2}\\&\Vert \mathop \sigma \limits^{ = }{\widehat{u}}_{c}^{T}R{\widehat{u}}_{c}\Vert \le {\Xi }_{\mathop \sigma \limits^{ = }}{\Xi }_{u}^{2}R={\Xi }_{3}\\&\left\| {\mathop \sigma \limits^{ = } w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)\left( {g\left( x \right)\hat{u}_{c} } \right)} \right\| \le \Xi _{{\mathop \sigma \limits^{ = } }} \Xi _{{w_{c}^{*} }} \Xi _{{\nabla \phi _{c} }} \Xi _{g} \Xi _{u} = \Xi _{4} \end{aligned}$$
(52)

where \({\Xi }_{u}\), \({\Xi }_{1}\), \({\Xi }_{{\varepsilon }_{1}}\), \({\Xi }_{\mathop \sigma \limits^{ = }}\), \({\Xi }_{{w}_{c}^{*}}\), \({\Xi }_{\nabla {\phi }_{c}}\), \({\Xi }_{f}\) and \({\Xi }_{g}\) are the upper bound of \({\widehat{u}}_{c}\), \({\varepsilon }_{HJB}\), \({\nabla }^{T}{\varepsilon }_{c}\), \(\Gamma \left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right)\), \(\mathop \sigma \limits^{ = }\), \({w}_{c}^{*}\), \(\nabla {\phi }_{c}\left(x\right)\), \(f\left(x\right)\) and\(g\left(x\right)\), respectively. Consequently, we can obtain

$$ \dot{L} \le - \lambda_{Q} x^{2} - \tilde{w}_{c}^{T} \left( {\mathop \sigma \limits^{ = } - \Upsilon_{1} } \right)\tilde{w}_{c} - \tilde{D}^{T} \Upsilon_{0} \tilde{D} - \tilde{w}_{c}^{T} \left( {\Upsilon_{2} - \Xi_{3} - \Xi_{4} } \right) + \Xi_{1} + \Xi_{2} $$
(53)

where \({\lambda }_{Q}\) is a positive constant such that \(Q\left(x\left(t\right)\right)>{x}^{T}{\lambda }_{Q}x\) for every \(\mathrm{x}\in\Omega \). If we choose \({\upvarepsilon }_{\mathcal{K}}\) and \(\mathrm{R}\) such that \(\left(\mathop \sigma \limits^{ = }-{\epsilon}_{1}\right)\) bigger than zero, then \(\dot{L}\) yields

$$ \dot{L} \le - \left\| {\mathop Z\limits^{ = } } \right\|\lambda _{{{\text{min}}}} \left( {\mathcal{G}} \right) + \left\| q \right\|\left\| {\mathop Z\limits^{ = } } \right\| + \Sigma $$
(54)

where \( \mathop {{Z}}\limits^{ = } = \left[ {\begin{array}{*{20}c} x \\ {\tilde{D}} \\ {\tilde{w}_{c} } \\ \end{array} } \right] \) and

$$\mathcal{G}=\left[\begin{array}{ccc}Q& 0& 0\\ 0& {\Upsilon }_{0}& 0\\ 0& 0& \left( \mathop \sigma \limits^{ = } -{\Upsilon}_{1}\right)\end{array}\right]$$
$$ G = \left[ {\begin{array}{*{20}c} 0 \\ 0 \\ { - \left( {{\Upsilon }_{2} - \Xi_{3} - \Xi_{4} } \right)} \\ \end{array} } \right] $$
$$ \Sigma = \Xi_{1} + \Xi_{2} $$

If the parameters are selected such that \(\mathcal{q}\) and \(\mathcal{G}\) are positive, then the Lyapunov derivative is negative if

$$ \left\| {\mathop Z\limits^{ = } } \right\|\frac{{\left\| q \right\| + \sqrt {\left\| q \right\|^{2} + 4\lambda _{{\min }} \left({\mathcal{G}} \right)\Sigma } }}{{2\lambda _{{\min }} \left({\mathcal{G}} \right)}} $$
(55)

By mathematical induction, it is now desired to show if we can bound the Lyapunov function, then for sufficiently large \( \left\| {\mathop {\mathcal{Z}}\limits^{ = } } \right\| \) \(\dot{L}\) is negative. Therefore, by using the standard Lyapunov extension theorem [58], it follows that the system state and the weights error are UUB, which completes the proof.

Remark 2

If Assumptions 1–4 hold, then the assumption of the nonlinear Lyapunov equation solution of (12) can be relaxed without loss of the stability performance, and the equilibrium point of (5) remains UUB.

Remark 3:

By utilizing gradient descent rules (22), (26), and the backpropagation rule, the HJB error \({E}_{c}=\frac{1}{2}{e}_{c}^{2}\) and the point-wise control error \({E}_{a}=\frac{1}{2}{e}_{a}^{2}\) update the critic and actor networks. The error \({E}_{c}\) is convex with respect to \({\widehat{w}}_{c}\) . Hence, the critic network's weight converges to its global optimal point by applying the updating rule (22). The error \({E}_{a}\) is non-convex with respect to \({\widehat{w}}_{a}\) so the weights of the actor-network converge to a locally optimal point by applying the updating rule (26).

4 Optimal vaccination strategies

In this study, several optimal vaccination strategies are proposed. In each strategy, we consider four cost functions to formulate the main concerns and characterizations that should be considered and noted. Then, we define the appropriate model for each of them. It should be noted that the main objective is to reduce the effects of Covid-19 and its domino effects which stem from its spreading and progression. In this section, the optimal control principle provides an optimized approach that can hinder the impacts of Covid-19 outbreak. Therefore, to minimize defined objective functions, we implement reinforcement learning optimal control to reach the optimal policy. Here, by using the HJB equation which is described in each subsection, the necessary condition for optimality is satisfied.

4.1 Strategy 1

In view of the above discussion, we suppose the optimal control problem with vaccination as input control. Motivated from proposed model in [59] and [60], we introduce a mathematical model with end-point state constraint, control input inspired of [37], and considering vaccine efficiency. This model is shown in Fig. 1 and can be driven as follows:

$$ \begin{aligned} \dot{x} & = f\left( x \right) + g\left( x \right)u \\ f\left( {x\left( t \right)} \right) & = \left[ {\begin{array}{*{20}c} { - \left( {\beta c + cq\left( {1 - \beta } \right)} \right)S\left( {I + \theta A} \right) + \lambda S_{q} } \\ {\beta c\left( {1 - q} \right)S\left( {I + \theta A} \right) - \sigma E + \left( {1 - e_{v} } \right)\beta cV_{a} \left( {I + \theta A} \right)} \\ {\begin{array}{*{20}c} {\sigma \rangle E - \left( {\delta_{I} + \alpha + \gamma_{I} } \right)I} \\ {\begin{array}{*{20}c} {\sigma \left( {1 - \rangle } \right)E - \gamma_{A} A} \\ {\left( {1 - \beta } \right)cqS\left( {I + \theta A} \right) - \lambda S_{q} } \\ {\begin{array}{*{20}c} {\beta cqS\left( {I + \theta A} \right) - \delta_{q} E_{q} } \\ {\delta_{I} I + \delta_{q} E_{q} - \left( {\alpha + \gamma_{H} } \right)H} \\ {\gamma_{I} I + \gamma_{A} A + \gamma_{H} H} \\ \end{array} } \\ { - \left( {1 - e_{v} } \right)\beta cV_{a} \left( {I + \theta A} \right)} \\ \end{array} } \\ \end{array} } \\ \end{array} } \right] \\ g\left( x \right) & = \left[ { - S, 0,0,0,0,0,0,0,S} \right]^{T} \\ x\left( t \right) & = \left[ {S\left( t \right),E\left( t \right),I\left( t \right),A\left( t \right),S_{q} \left( t \right),E_{q} \left( t \right),H\left( t \right),R\left( t \right),V_{a} \left( t \right)} \right]^{T} \\ \end{aligned} $$
(56)

where \(u\) denotes the ratio of susceptible individuals vaccinated per day \((0\le u\le 1)\) and \({e}_{v}\) is the vaccine efficiency. This parameter demonstrates the effectiveness of vaccine which means that if this factor \({e}_{v}=1\), the vaccine is fully effective. Note that the term \(\left(1-{e}_{v}\right)\beta c{V}_{a}\left(I+\theta A\right)\) represents the vaccinated people who can be infected due to vaccine incompleteness and expresses the fact that no vaccination is 100% effective. Moreover, in the introduced model, \({V}_{a}\) indicates the vaccinated sub-population. In this strategy, we wish to reduce the objective function which considers the infected individuals and the ratio of vaccinated people

Fig. 1
figure 1

Transmission diagram of dynamics of COVID-19 spread, by the implementation of a vaccination by strategy 1 or 2, b strategy 3, c strategy 4

$${V}^{*}=\mathrm{min}\left({\int }_{0}^{{t}_{f}}{\Gamma }_{I,{s}_{1}}{I}^{2}(t)+{{\Gamma }_{u,{s}_{1}}u}^{2}\mathrm{d}t\right)$$
(57)

where \({\Gamma }_{I,{s}_{1}}\) and \({\Gamma }_{u,{s}_{1}}\) are the relative weight factors selected to balance the objective function over innervation time \({t}_{f}\).

4.2 Strategy 2

The objective function (55) can be enhanced by assuming exposed individuals as a population that should be considered to be minimized. In fact, by employing a strategy whose aim is to reduce exposed people, we can find a better solution to minimize the number of infected individuals and the cost of vaccination. More precisely, we seek that the optimal control consists of minimizing the objective functional

$${V}^{*}=\mathrm{min}\left({\int }_{0}^{{t}_{f}}{{\Gamma }_{I,{s}_{2}}I}^{2}(t)+{{\Gamma }_{E,{s}_{2}}E}^{2}(t)+{\Gamma }_{u,{s}_{2}}{u}^{2}\mathrm{d}t\right)$$
(58)

subject to the mathematical model proposed by (56). Similar to (57), \({\Gamma }_{I,{s}_{2}}\), \({\Gamma }_{s,{s}_{2}}\) and \({\Gamma }_{u,{s}_{2}}\) are positive weights for balancing cost function.

4.3 Strategy 3

In this strategy, we consider quarantined vaccination as a method that can effectively be taken into account to reduce the cost of vaccination and the number of infected and exposed individuals. Hence, the vaccination control variable is imposed on quarantined individuals who are susceptible to the virus. Then, the epidemic model with two control input imposed is given by

$$ \begin{aligned} \dot{x} & = f\left( x \right) + g\left( x \right)u \\ f\left( {x\left( t \right)} \right) & = \left[ {\begin{array}{*{20}c} { - \left( {\beta c + cq\left( {1 - \beta } \right)} \right)S\left( {I + \theta A} \right) + \lambda S_{q} } \\ {\beta c\left( {1 - q} \right)S\left( {I + \theta A} \right) - \sigma E + \left( {1 - q} \right)\left( {1 - e_{v} } \right)\beta cV_{a} \left( {I + \theta A} \right)} \\ {\begin{array}{*{20}c} {\sigma \rangle E - \left( {\delta _{I} + \alpha + \gamma _{I} } \right)I} \\ {\begin{array}{*{20}c} {\sigma \left( {1 - \rangle } \right)E - \gamma _{A} A} \\ {\left( {1 - \beta } \right)cqS\left( {I + \theta A} \right) - \lambda S_{q} } \\ {\begin{array}{*{20}c} {\beta cqS\left( {I + \theta A} \right) - \delta _{q} E_{q} + q(1 - e_{v} )\beta cV_{a} (I + \theta A)} \\ {\delta _{I} I + \delta _{q} E_{q} - \left( {\alpha + \gamma _{H} } \right)H} \\ {\gamma _{I} I + \gamma _{A} A + \gamma _{H} H} \\ \end{array} } \\ { - \left( {1 - e_{v} } \right)\beta cV_{a} \left( {I + \theta A} \right)} \\ \end{array} } \\ \end{array} } \\ \end{array} } \right] \\ g\left( x \right) & = \left[ { - S,0,0,0, - S_{q} ,0,0,0,\left( {S + S_{q} } \right)} \right]^{T} \\ x\left( t \right) & = \left[ {S\left( t \right),E\left( t \right),I\left( t \right),A\left( t \right),S_{q} \left( t \right),E_{q} \left( t \right),H\left( t \right),R\left( t \right),V_{a} \left( t \right)} \right]^{T} \\ \end{aligned} $$
(59)

Indeed, in this model, the vaccination will be distributed equally between the quarantined and non-quarantined individuals. By considering this fact that quarantined individuals are considered as one group which are vaccinated, the quadratic objective functional (56) is defined as

$${V}^{*}=\mathrm{min}\left({\int }_{0}^{{t}_{f}}{{\Gamma }_{I,{s}_{2}}I}^{2}(t)+{{\Gamma }_{E,{s}_{2}}E}^{2}\left(t\right)+{{\Gamma }_{{E}_{q},{s}_{2}}{E}_{q}}^{2}(t)+{\Gamma }_{u,{s}_{2}}{u}^{2}\mathrm{d}t\right)$$
(60)

where \({{\Gamma }_{{E}_{q},{s}_{2}}{E}_{q}}^{2}(t)\) is the quadratic terms of quarantined exposed individuals representing the population that we wish to minimize besides infected and non-quarantined exposed individuals.

4.4 Strategy 4

Here, instead of the uniform allocation of vaccine proposed in Strategy 3, we use two independent control variables for the propagation control of the coronavirus. The resulting control model, after incorporating the aforementioned control variables, is formulated via the following system:

$$ \begin{aligned} f\left( {x\left( t \right)} \right) & = \left[ {\begin{array}{*{20}c} { - \left( {\beta c + cq\left( {1 - \beta } \right)} \right)S\left( {I + \theta A} \right) + \lambda S_{q} } \\ {\beta c\left( {1 - q} \right)S\left( {I + \theta A} \right) - \sigma E + \left( {1 - q} \right)\left( {1 - e_{v} } \right)\beta cV_{a} \left( {I + \theta A} \right)} \\ {\begin{array}{*{20}c} {\sigma \rangle E - \left( {\delta _{I} + \alpha + \gamma _{I} } \right)I} \\ {\begin{array}{*{20}c} {\sigma \left( {1 - \rangle } \right)E - \gamma _{A} A} \\ {\left( {1 - \beta } \right)cqS\left( {I + \theta A} \right) - \lambda S_{q} } \\ {\begin{array}{*{20}c} {\beta cqS\left( {I + \theta A} \right) - \delta _{q} E_{q} + q(1 - e_{v} )\beta cV_{a} (I + \theta A)} \\ {\delta _{I} I + \delta _{q} E_{q} - \left( {\alpha + \gamma _{H} } \right)H} \\ {\gamma _{I} I + \gamma _{A} A + \gamma _{H} H} \\ \end{array} } \\ { - \left( {1 - e_{v} } \right)\beta cV_{a} \left( {I + \theta A} \right)} \\ \end{array} } \\ \end{array} } \\ \end{array} } \right] \\ g\left( x \right) & = \left[ {\begin{array}{*{20}c} { - S,0,0,0,0,0,0,0,\left( S \right)} \\ {0,0,0,0, - S_{q} ,0,0,0,\left( {S_{q} } \right)} \\ \end{array} } \right]^{T} \\ x\left( t \right) & = \left[ {S\left( t \right),E\left( t \right),I\left( t \right),A\left( t \right),S_{q} \left( t \right),E_{q} \left( t \right),H\left( t \right),R\left( t \right),V_{a} \left( t \right)} \right]^{T} \\ x\left( t \right) & = \left[ {S\left( t \right),E\left( t \right),I\left( t \right),A\left( t \right),S_{q} \left( t \right),E_{q} \left( t \right),H\left( t \right),R\left( t \right),V_{a} \left( t \right)} \right]^{T} \\ \end{aligned} $$
(61)
$${\mathcal{V}} = \left[ {\begin{array}{*{20}c} u \\ {u_{q} } \\ \end{array} } \right] $$

In this strategy, \({u}_{q}\) is the input variable which represents the fraction of quarantined susceptible individuals who vaccinated. By modifying Strategy 3 in the control variables assumption, the updated objective functional (60) is given as

$$ V^{*} = \min \left( {\mathop \smallint \limits_{0}^{{t_{f} }} \Gamma_{{I,s_{2} }} I^{2} (t) + \Gamma_{{E,s_{2} }} E^{2} \left( t \right) + \Gamma_{{E_{q} ,s_{2} }} E_{q}^{2} \left( t \right) + \Gamma_{{u_{q} ,s_{2} }} u_{q}^{2} + \Gamma_{{u,s_{2} }} u^{2} {\text{d}}t} \right) $$
(62)

where the \({\Gamma }_{{u}_{q},{s}_{2}}{u}_{q}^{2}\) stands for minimization of quarantined vaccination. This objective illustrates the importance of vaccination optimization. Moreover, the constant \({\Gamma }_{{u}_{q},{s}_{2}}\) similar to the proposed strategies is the balancing factor, which measures the relative cost of quarantined vaccination. Figure 1 shows these strategies. In the next section, we will present the result of each strategy and compare the numerical results of their optimal solution.

These strategies are designed based on this assumption that susceptible people have been determined. Moreover, in this article, the susceptible people are considered as the only group of people who should be prioritized for getting the vaccine because they are more likely to be infected by the infection, and their infection will be more severe than the other people. After identification of the susceptible individuals, they should be classified. In this case, susceptibility can be decerned through potential risk factors such as age or pregnancy. Consequently, it will be necessary to monitor the susceptible individuals and prioritize them regarding their conditions.

5 Numerical results

In this section, we simulate the epidemiological model with vaccination based on the obtained data from the laboratory-confirmed case of 2019-nCoV that occurred in mainland China which is proposed by [36]. It should be noted that their research was based on a collected dataset and surveys until January 22, 2020. They employed the Markov Chain Monte Carlo to estimate the model parameters and their baselines. Based on these parameters, we implement these four strategies in python. In each strategy, the balancing factors are considered to countervail the imbalances between the magnitude of objectives' value. In this simulation, the embedded Runge–Kutta (RK5(4)) [61] has been used to model the dynamics of the epidemiological system. According to the research in [62], we assume vaccine efficacy \({e}_{v}=0.90\), and the parameters in the optimal control framework are taken as

$${\phi }_{c}\left(x\right)=[{S}^{2},{E}^{2},{I}^{2},{A}^{2},{S}_{q}^{2},{E}_{q}^{2},{H}^{2},{R}^{2},S.E,S.I,S.A,S.{S}_{q},S.{E}_{q},S.H,S.R],$$
$${\phi }_{a}\left(x\right)=[S,E,I,A,{S}_{q},{E}_{q},H,R]$$

The initial values of weights are as follows:

$${\widehat{w}}_{c}=[{10}^{-5},\dots ,{10}^{-5}]$$
$${\widehat{w}}_{a}=[{10}^{-4},\dots ,{10}^{-4}]$$

Based on [36], the model's baselines and initial values are given in Tables 1 and 2. Here, we use them as baselines of the model and the initial values. Next, according to the defined cost functions (57), (58), (60), and (62), reinforcement learning optimal control has been applied as a feedback controller. Now, the time evolution of respective subpopulations and vaccination efforts are shown in Figs. 2, 3 and 4. In Fig. 2, the outcome of different optimal control strategies on the population of "stratified groups of people" is shown. First, in Fig. 2, the time evolution of the subpopulations illustrates that by using the vaccination strategies, susceptible, exposed, infected individuals (with or without a sign of disease), hospitalized, and recovered population fall. Moreover, the number of hospitalized individuals is reduced compared with the no-control strategy. It can be considered as a secondary effect of vaccination. Furthermore, at the beginning of public vaccination, it can successfully reduce the number of infected people, and as a consequence, people are less likely to be exposed to the infected people who can spread the disease. Therefore, the need for hospitalization will decrease in the long term. From Fig. 2a, it is precisely shown that the population of susceptible individuals declines more in Strategy 4, in which vaccine has been considered for both quarantined and non-quarantined susceptible individuals. This can explain that quarantined vaccination is one of the best options for eradicating the disease in the long run. In Fig. 3, the vaccinated population is shown in each control policy. As shown in this figure, the total population of vaccinated individuals in Strategy 3 is lower than the other optimal control strategies; however, Fig. 2c illustrates that the time evolution of infected individuals in all optimal strategies are close to each other. Thus, one can see after 110 days, the population of infected individuals in each strategy is similar to the others. Therefore, Strategy 3 will be suggested to be taken into account if restriction exists in vaccine supplements. In view of Fig. 2a, one can obtain that the tenth day can be supposed as the perfect trigger time for vaccination. On this day, the population of susceptible exceeds the minimum itself, and after that, the population of this compartment will rise gradually. In this context, if the cost of vaccination is important to governments, they can follow Strategy 3, which is the best option to bring down the cost of vaccination and reduce the number of infected people simultaneously. Based on this strategy, it would be better for governments and authorities to begin the public vaccination when the population of susceptible people reaches its minimum. Figure 4 shows the number of vaccinated population in each strategy per day presented to compare the time evolution of the vaccinated population in four different strategies. From Fig. 4a, we compare the control profiles of each strategy. One can see that in Strategy 4, the number of susceptible people declines more than the others; this strategy requires more vaccination effort. In this case, this fact is shown that from the aspect of vaccination cost, Strategy 3 can be more satisfying than the other strategies. Note that in Strategy 4, the vaccination distributes among quarantined and non-quarantined susceptible individuals. The allocation of the vaccine in this strategy is shown in Fig. 4b. One can infer from this figure that in the primary phase, the authorities should give top priority to the quarantined susceptible individuals, although the non-quarantined susceptible individuals also should be considered for vaccination. As mentioned in the previous section, these strategies are formulated to be performed just for susceptible individuals. As a result, it should be noted that before performing, the susceptible people should be identified, stratified, and prioritized. This stratification can be performed based on their risk factor and their vulnerability. Moreover, ring vaccination is another strategy to control the outbreak [63, 64]. To be more specific, utilizing smart surveillance monitoring can provide the authorities and governors with a great tool to identify the susceptible people. This method can be taken into action to reduce transmission earlier by vaccination and immunization of the susceptible ring. Therefore, these proposed strategies can provide effective protection.

Table 1 Parameter estimates for COVID-19 in Wuhan, China [36]
Table 2 Initial values estimation for COVID-19 in Wuhan, China [37]
Fig. 2
figure 2

Number of individuals with different values of derivative order, a susceptible people, b exposed people, c symptomatic infected people, d asymptomatic infected people, e quarantined susceptible people, f quarantined exposed people, g quarantined infected people, h recovered people

Fig. 3
figure 3

Number of vaccinated people

Fig. 4
figure 4

Compared solutions vaccination for Covid-19 by different strategies

In this sense, in Fig. 4a, Strategy 2 shows that its performance is better than the performance of Strategy 1 because this strategy reduces the number of susceptible people more than other strategies. This figure also demonstrates that if the exposed people are considered in the objective function, the optimal controller can perform better regarding the reduction in both susceptible and exposed populations. It should be noted that the more exposed the population decrease, the less susceptible individuals can be infected. Due to this fact, one can infer that both susceptible and exposed populations should be considered in objective functions.

Also, from a practical viewpoint, let us denote that the reinforcement learning optimal control can introduce a better policy for vaccination distribution regarding Pontryagin’s minimum principle. For example, in [59, 65, 66], the proposed optimal controls suggest the time evolution vaccination whose initial proportion is high and significant. This high initial value of vaccine usage makes Pontryagin’s minimum principle approach impractical and too harsh in the real world, but as presented in this article, reinforcement learning optimal control can propose a policy with smooth starting that provides functionality and practicality for public vaccination.

Graphical results depict the importance of vaccine allocation. In these graphical interpretation shows that if the vaccination is taken into account, the severity of infection can be reduced gradually. In the presented model, vaccination plays a vital role in the reduction in susceptible individuals. Consequently, one can see that when suspectable individuals who can transmit the virus and be infected start to fall, the number of infected people can decline in number. Decreasing the number of symptomatically infected people will reduce the exposure of uninfected people to infected people. Therefore, it can reduce the probability of being infected through the disease transmission too. As a result, the number of infected people is decreased significantly, which can end up with the elimination of the disease in society. It should be noted that due to the slow dynamic behavior of the epidemic model, it seems that the vaccine does not affect the population of infected people, but over time, the significance of vaccination effectiveness can be observable. Hence, this simulation highly suggests that governments and authorities should not be obsessed with the number of infected people during the early stage of vaccination because vaccines take time to induce immunity.

6 Conclusion

In this research, the significant challenge regarding vaccination strategies for COVID-19 has been investigated. Based on data from confirmed cases of 2019-nCoV in mainland China, a new deterministic SEIR model with additional vaccination components was developed. Following that, based on the reinforcement learning method, an optimal control was developed to discover the best policies. By implementing the dynamic model of the epidemiological system, numerical results for four different control strategies obtained by the proposed technique were demonstrated. The feasibility of the recommended method for designing optimal vaccine plans was clearly shown by these findings. As a future study, it would be useful to look at any of the behavioral or emotional side effects of quarantine, such as depression, which may impact depression or even suicide rate in society. Such investigations lead us to find an optimal trade-off for quarantine decisions.