Application of reinforcement learning for effective vaccination strategies of coronavirus disease 2019 (COVID-19)

Beigi, Alireza; Yousefpour, Amin; Yasami, Amirreza; Gómez-Aguilar, J. F.; Bekiros, Stelios; Jahanshahi, Hadi

doi:10.1140/epjp/s13360-021-01620-8

Application of reinforcement learning for effective vaccination strategies of coronavirus disease 2019 (COVID-19)

Regular Article
Published: 31 May 2021

Volume 136, article number 609, (2021)
Cite this article

Download PDF

The European Physical Journal Plus Aims and scope Submit manuscript

Application of reinforcement learning for effective vaccination strategies of coronavirus disease 2019 (COVID-19)

Download PDF

Alireza Beigi¹,
Amin Yousefpour¹,
Amirreza Yasami¹,
J. F. Gómez-Aguilar ORCID: orcid.org/0000-0001-9403-3767²,
Stelios Bekiros^3,4 &
…
Hadi Jahanshahi⁵

2123 Accesses
20 Citations
2 Altmetric
Explore all metrics

Abstract

Since December 2019, the new coronavirus has raged in China and subsequently all over the world. From the first days, researchers have tried to discover vaccines to combat the epidemic. Several vaccines are now available as a result of the contributions of those researchers. As a matter of fact, the available vaccines should be used in effective and efficient manners to put the pandemic to an end. Hence, a major problem now is how to efficiently distribute these available vaccines among various components of the population. Using mathematical modeling and reinforcement learning control approaches, the present article aims to address this issue. To this end, a deterministic Susceptible-Exposed-Infectious-Recovered-type model with additional vaccine components is proposed. The proposed mathematical model can be used to simulate the consequences of vaccination policies. Then, the suppression of the outbreak is taken to account. The main objective is to reduce the effects of Covid-19 and its domino effects which stem from its spreading and progression. Therefore, to reach optimal policies, reinforcement learning optimal control is implemented, and four different optimal strategies are extracted. Demonstrating the efficacy of the proposed methods, finally, numerical simulations are presented.

Determination of optimal prevention strategy for COVID-19 based on multi-agent simulation

Article 14 June 2022

Lockdown or Unlock in COVID-19 Disease? A Reinforcement Learning Approach

Reinforcement Learning Model for Pandemic Precautions in Healthcare Environment

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Since the first reported case of coronavirus disease 2019 (COVID-19) in early December 2019 in China, it has resulted in an ongoing crisis that unprecedentedly spreads all around the world [1,2,3,4]. Acute respiratory syndrome can occur in patients with serious illness, leading to multiple organ failures and death in some cases. [5, 6]. It has been established that the present pandemic's spread rate is much higher than similar previously reported epidemics in 2003 and 2012, namely SARS coronavirus (SARS-CoV) and MERS coronavirus (MERS-CoV). Until now, the epidemic crisis has resulted in a growing number of deaths all over the globe [7, 8].

Mathematical simulations have long been used to obtain insight into the mechanisms of disease transmission [9,10,11,12,13,14,15,16,17,18,19,20,21,22]. The essence of modeling lies in defining a set of equations that mimic the system's spread or dynamic in reality [23, 24]. From the beginning of the current epidemy, the mathematical models which show its spread have been at the forefront for prediction and control of the novel coronavirus outbreak [25,26,27,28,29]. Through the available data on the reported number of infections and information that we already know about the virus spread, as well as the confirmed number of deaths and hospitalizations, we can get an accurate insight for the future of the virus spread [30, 31].

Up to now, to effectively mitigate the spread of COVID-19, decision makers in all countries have applied various control policies such as mandatory lockdowns, quarantining and isolating infected people, maintaining a minimum social distancing, imposing strict and encouraging and strictly enforcing, avoiding crowded events, and forcing people to use face masks while in public [32,33,34,35]. Recently, several effective vaccines have been introduced for battling the pandemic. Some of them have passed all criteria, and now countries are using them. However, now with the advent of confirmed vaccines, governments and decision makers face new challenges. Now, to apply vaccines in effective ways, several questions have to be answered quickly and accurately. Which policies should be taken for vaccination? How can decision making choose different components of people? How can the vaccine be distributed throughout the time? How will the vaccine be able to decrease the risk of being infected? Since the disease's dynamic is complicated, and its spread is affected by several factors, answering these equations requires to be considered as optimization problems, which motivated the current study. The present study aims to solve these questions by proposing reinforcement learning-based optimal policies.

2 COVID-19 model with controls

In this study, an extended version of “Susceptible-Exposed-Infectious-Recovered” (SEIR) compartmental model is introduced. In this model, the spread of COVID-19 has been investigated. Using the Markov Chain Monte Carlo (MCMC) method and fitting the proposed model to the real data, the dynamic system's coefficients have been derived.

As mentioned in [36] and [27], the total population is considered as $N$ which can be classified into eight different epidemiological subclasses: the humans who are not infected but susceptible $S$, exposed $E$, asymptomatic infected having no clinical symptoms but can infect healthy people $A$, infected people showing clinical symptoms $I$, the quarantined humans who are not infected but susceptible ${S}_{q}$, the quarantined humans who are exposed to the infection ${E}_{q}$, the hospitalized individuals $H$ and the recovered individuals $R$. Under these assumptions, the model is given by defining $q$ as a rate of quarantined, $\beta $ as the probability of transmission per contact, $\varrho $ as the likelihood of having symptoms among infected people, $\sigma $ as the proportion of individuals who move to the infected class, $\lambda $ as the released rate of the quarantined uninfected contacts and $c$ as the person-to-person contact rate. The disease-induced death rate of people is $\alpha $. In this work, ${\delta }_{I}$ and ${\delta }_{q}$ stand for the transition of infected people and exposed people to the quarantined infected class, respectively. The recovery rate of asymptomatically infective patients is ${\gamma }_{A}$ and the ${\gamma }_{H}$ is the rate at which infected individuals get recovery, while ${\gamma }_{H}$ is the rate at which hospitalized individuals get recovery. Based on these coefficients, the epidemic model that proposes the transmission dynamics is given by

$$ \begin{gathered} \frac{{\text{d}}}{{{\text{d}}t}}S = - \left( {\beta c + cq\left( {1 - \beta } \right)} \right)S\left( {I + \theta A} \right) + \lambda S_{q} \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}E = - \beta c\left( {1 - q} \right)S\left( {I + \theta A} \right) - \sigma E \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}I = \sigma \,\uprho\,E - \left( {\delta _{I} + \alpha + \gamma _{I} } \right)I \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}A = \sigma \left( {1 - \uprho} \right)E - \gamma _{A} A \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}S_{q} = \left( {1 - \beta } \right)cqS\left( {I + \theta A} \right) - \lambda S_{q} \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}E_{q} = \beta cqS\left( {I + \theta A} \right) - \delta _{q} E_{q} \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}H = \delta _{I} I + \delta _{q} E_{q} - \left( {\alpha + \gamma _{H} } \right)H \hfill \\ \frac{{\text{d}}}{{{\text{d}}t}}R = \gamma _{I} I + \gamma _{A} A + \gamma _{H} H \hfill \\ \end{gathered} $$

(1)

where

$$c=\left({c}_{0}-{c}_{b}\right){e}^{{-r}_{1}t}+{c}_{b}$$

(2)

$$\frac{1}{{\delta }_{I}(t)}=\left(\frac{1}{{\delta }_{I0}}-\frac{1}{{ \delta }_{If}}\right){e}^{{-r}_{2}t}+\frac{1}{{ \delta }_{If}}$$

(3)

represents the person-to-person contact rate $(c)$ and detection rate ${\updelta }_{\mathrm{I}}$, respectively. Equation (2) and (3) include six parameters defined as follows:

${c}_{0}$: initial contact rate
${c}_{\mathrm{b}}$: final contact rate that is larger than ${c}_{0}$
${r}_{1}$: exponentially decreasing rate of contact rate
${\delta }_{\mathrm{I}0}$: initial diagnosis rate
${\delta }_{If}$: fastest diagnosis rate
${r}_{2}$: exponentially increasing rate of diagnosis rate

It is assumed that the contact rate exponentially decreases over time and the diagnose rate exponentially increases with respect to time. Furthermore, we rewrite system (1) as follows:

$$ \begin{aligned} \dot{x} & = f\left( x \right) \\ f\left( {x\left( t \right)} \right) & = \left[ {\begin{array}{*{20}c} { - \left( {\beta c + cq\left( {1 - \beta } \right)} \right)S\left( {I + \theta A} \right) + \lambda S_{q} } \\ { - \beta c\left( {1 - q} \right)S\left( {I + \theta A} \right) - \sigma E} \\ {\begin{array}{*{20}c} {\sigma \varrho E - \left( {\delta_{I} + \alpha + \gamma_{I} } \right)I} \\ {\begin{array}{*{20}c} {\sigma \left( {1 - \varrho } \right)E - \gamma_{A} A} \\ {\left( {1 - \beta } \right)cqS\left( {I + \theta A} \right) - \lambda S_{q} } \\ {\begin{array}{*{20}c} {\beta cqS\left( {I + \theta A} \right) - \delta_{q} E_{q} } \\ {\delta_{I} I + \delta_{q} E_{q} - \left( {\alpha + \gamma_{H} } \right)H} \\ {\gamma_{I} I + \gamma_{A} A + \gamma_{H} H} \\ \end{array} } \\ \end{array} } \\ \end{array} } \\ \end{array} } \right] \\ \end{aligned} $$

(4)

where $x\left(t\right)={\left[S\left(t\right),E\left(t\right),I\left(t\right),A\left(t\right),{S}_{q}\left(t\right),{E}_{q}\left(t\right),H\left(t\right),R\left(t\right)\right]}^{\mathrm{T}}\in {\mathbb{R}}_{0+}^{3}$ are considered as the state vector. In this paper, this model has been selected because this model can describe the ongoing situation better. Firstly, this model has a higher reproduction rate over the other models [36, 37] that make this compartment model reasonable and superior. To be more specific, in this case, the reproduction rate was found too unstable [37], and some new variants of the novel coronavirus have a higher reproduction rate. Consequently, when we consider a model with a high reproduction rate and impose vaccination as a controlling variable, the optimal controller can be adopted for the worst-case scenario [38, 39]. Moreover, this model could estimate the confirmed case very well from 23 to 29 January 2020 because it considered different parameter variations, and its data collection was performed during intensive social events [39]. Therefore, this model can reflect the real situation better than others. In Sect. 4, we consider vaccination as a control input and discuss the system's input signal and how to impose the vaccination to the nonlinear system.

3 Optimal control problem

Consider the system dynamics described by

$$\dot{x}=f\left(x\right)+g\left(x\right)u$$

(5)

with $x\in {\mathbb{R}}^{n}$ denoting the state, $f\left(x\right)\in {\mathbb{R}}^{n}$, $g\left(x\right)\in {\mathbb{R}}^{n\times m}$ and the input$u\in U\subset {\mathbb{R}}^{m}$. Consider $U$ as a set that is defined for the control input saturation.

Assumption 1

$f\left(.\right)$ and $g\left(.\right)$ are differentiable in their argument with ${f}\left(0\right)=0$ and ${g}\left(0\right)=0,$ and they are Lipschitz continuous on their set, so $f\left(x\right)+g\left(x\right)u$ is Lipschitz continuous on a set $\Omega \subseteq {{R}}^{{n}}$ containing the origin, so there exists a continuous control function $u$ such that the dynamics (5) is asymptotically stable on $\Omega $ and controllable.

Assumption 2

The control matrix $g\left(x\right)$ and $f\left(x\right)$ are bounded over the compact set; $\Vert g\left(x\right)\Vert \le {\Xi }_{g}$, $\Vert g\left(x\right)\Vert \le {\Xi }_{f}$

Definition 1

In this paper, we define infinite horizon integral cost as follows:

$$V\left(x\left(t\right),u\left(t\right)\right)= {\int }_{t}^{\infty }r\left(x\left(\tau \right),u\left(\tau \right)\right)\mathrm{d}\tau $$

(6)

where $r\left(x\left(\tau \right),u\left(\tau \right)\right)=Q\left(x\left(\tau \right)\right)+{u\left(x\left(\tau \right)\right)}^{\mathrm{T}}Ru\left(x\left(\tau \right)\right)$ and $Q\left(x\left(\tau \right)\right)$ is a positive definite monotonically increasing function. $R$ is a symmetric positive definite matrix and $Q\left(0\right)=0$.

Definition 2

(Admissible Control Policy) [40, 41] $u$ is the control policy that can be said to be admissible with respect to the cost function (6) on Ω, written as $u$, if $u$ is continuous on a compact set $\Omega \subset {\mathbb{R}}^{n}$ and differentiable on Ω, $u\left(0\right)=0$, $u$ stabilizes (5) and for every ${x}_{0}\in\Omega $, the $V\left(x\left(0\right),u\left(0\right)\right)$ is finite.

According to the differentiability and continuity of cost function, the infinitesimal version of (7) is the nonlinear Lyapunov equation

$$0=Q\left(x\left(\tau \right)\right)+{u\left(x\left(\tau \right)\right)}^{\mathrm{T}}Ru\left(x\left(\tau \right)\right)+{\left(\nabla V\right)}^{\mathrm{T}}\left(f\left(x\right)+g(x)u\right)$$

(7)

with $\left(0\right)=0$. In Eq. (6), the notation ${\nabla }_{x}\left(.\right)$ means the gradient operator with respect to $x$ and is equivalent to $\frac{\partial (.)}{\partial x}$. Consider the Hamiltonian of (5)

$$H\left(x,u,\nabla V\right)=Q\left(x\left(\tau \right)\right)+{u\left(x\left(\tau \right)\right)}^{\mathrm{T}}Ru\left(x\left(\tau \right)\right)+{\left(\nabla V\right)}^{\mathrm{T}}\left(f\left(x\right)+g\left(x\right)u\right)$$

(8)

The optimal performance index function of (5) can be formulated as

$${V}^{*}\left(x\left(t\right),u\left(t\right)\right)= {\int }_{t}^{\infty }r\left(x\left(\tau \right),{u}^{*}\left(\tau \right)\right)\mathrm{d}\tau =\underset{u\in U}{\mathrm{min}}{\int }_{t}^{\infty }r\left(x\left(\tau \right),{u}^{*}\left(\tau \right)\right)\mathrm{d}\tau $$

(9)

According to the Bellman optimal control theory, the optimal value function ${V}^{*}\left(x\left(t\right),u\left(t\right)\right)$ can be obtained by solving the Hamilton–Jacobi–Bellman (HJB) equation:

$$0=\underset{u\in U}{\mathrm{min}}H(x,u,\nabla {V}^{*}) $$

(10)

Assume that the minimum value on the right-hand side of Eq. (10) exists and is unique. By differentiating the HJB, the optimal control for the given problem can be expressed as

$${u}^{*}\left(x\right)=\mathrm{arg}\underset{u\in \mathcal{B}}{\mathrm{min}}H\left(x,u,{\nabla }_{x}{V}^{*}\left(x\right)\right)= -\frac{1}{2}{R}^{-1}{g}^{T}\left(x\right)\nabla {V}^{*}\left(x\right)$$

(11)

where ${V}^{*}\left(x\left(t\right),u\left(t\right)\right)$ is formulated in the following HJB equation

$$ \begin{aligned} V^{*} \left( {x\left( t \right),u\left( t \right)} \right) & = Q\left( {x\left( \tau \right)} \right) + \left( {\nabla V^{*} } \right)^{{\text{T}}} f\left( x \right) + \frac{1}{4}\left( {\nabla V^{*} } \right)^{{\text{T}}} g\left( x \right)R^{ - 1} g^{{\text{T}}} \left( x \right)\left( {\nabla V^{*} } \right) \\ V^{*} \left( 0 \right) & = 0 \\ \end{aligned} $$

(12)

This nonlinear partial differential HJB equation is extremely difficult to solve and, in general, maybe it is impossible to be computed in some cases. Moreover, complete knowledge of the system's dynamics is required. According to [42], the IRL algorithm is presented to estimate value function iteratively in the following section.

Definition 3

(UUB Stability [43, 44]) For nonlinear system (5), with the equilibrium point ${x}_{e}$ its solution is said to be UUB if there exists a compact set$\Omega \subset {\mathbb{R}}^{n}$, so that for every ${x}_{0}\in \Omega $ , there exists a positive bound $\mathcal{p}$ and a time$\mathrm{T}\left(\mathcal{p},{\mathrm{x}}_{0}\right)>0$, independent of${\mathrm{t}}_{0}$, such that $\Vert x\left(t\right)-\mathcal{p}\Vert \le {x}_{0}$ for ∀$t\ge {t}_{0}+T$.

In this article, partially model-free integral reinforcement learning (IRL) has been introduced to obtain the optimal value function approximation ${V}^{*}\left(x\right)$ and a continuous optimal control policy ${u}^{*}\left(x\right)$.

3.1 Value function approximation using Critic network

The critic control design with neural networks generally is acceptable to determine the optimal approximation for control problems [45, 46]. With the higher-order Weierstrass approximation theorem [47], a single-layer neural network can be utilized to reconstruct the cost function ${V}^{*}\left(x\right).$

$${V}_{c}^{*}\left(x\right)={{w}_{c}^{*}}^{T}{\phi }_{c}\left(x\right)+{\varepsilon }_{c}\left(x\right)$$

(13)

where ${w}_{c}^{*}\in {\mathbb{R}}^{l}$ is suitable coefficients with $l$ neurons, ${\phi }_{c}\left(x\right)\in {\mathbb{R}}^{l}$ provides the activation function of the neural network(NN), and ${\varepsilon }_{c}\left(x\right)\in {\mathbb{R}}$ is the reconstruction error. Assuming ${w}_{c}^{*}$ and ${\phi }_{c}\left(x\right)$ are bounded. $\Vert {w}_{c}^{*}\Vert \le {\Xi }_{{w}_{c}^{*}}$, ${\Vert {\phi }_{c}\Vert \le \Xi }_{{\phi }_{c}}$ Since ${V}_{c}^{*}\left(x\right)$ is differentiable, its gradient can be approximated as

$$\frac{\partial {V}_{c}^{*}\left(x\right)}{\partial x}={\left(\frac{\partial {\phi }_{c}\left(x\right)}{\partial x}\right)}^{T}{w}_{c}^{*}+\frac{\partial {\varepsilon }_{c}}{\partial x}={\nabla \phi }_{c}\left(x\right){w}_{c}^{*}+\nabla {\varepsilon }_{c}\left(x\right)$$

(14)

According to the [48], for $x\in \Omega $ one can infer that ${\varepsilon }_{c}\left(x\right)$ and its gradient $\nabla {\varepsilon }_{c}\left(x\right)$ are bounded $\Vert {\nabla }^{T}{\varepsilon }_{c}\left(x\right)\Vert \le {\Xi }_{{\varepsilon }_{1}}$. According to the fact that ${{\phi }_{c}\le \Xi }_{{\phi }_{c}}$, one can infer ${{\nabla \phi }_{c}\le \Xi }_{{\nabla \phi }_{c}}$. While generally, the optimal coefficient ${w}_{c}^{*}$ is unknown, the estimated value function is given by

$${\widehat{V}}_{c}\left(x\right)={\widehat{w}}_{c}^{T}{\phi }_{c}\left(x\right)$$

(15)

where ${\widehat{w}}_{c}$ denotes estimated weights of these basis functions that are updated through the learning process. The updating rule will be formulated in the following section.

3.2 Policy approximation using Actor-network

Zhu et al. [49], have determined policy estimation by considering the fact that if the initial admissible policy is given, the policy function can be expressed by NN. NN approximation is a well-known method for policy estimation in optimal control [50,51,52]. Therefore, similar to value function, in order to Weirstrass high-order approximation theorem, the smooth policy can be uniformly approximated over a compact set as

$${u}_{c}^{*}\left(x\right)=\Gamma \left({{w}_{a}^{*}}^{T}{\phi }_{a}\left(x\right)\right)+{\varepsilon }_{a}\left(x\right)$$

(16)

where ${w}_{a}^{*}\in {\mathbb{R}}^{l{^{\prime}}\times m}$ is optimal coefficients with $l{^{\prime}}$ neurons, ${\phi }_{c}\left(x\right)\in {\mathbb{R}}^{l{^{\prime}}}$ provides the activation function of the neural network, and ${\varepsilon }_{a}\left(x\right)\in {\mathbb{R}}$ is the approximation error that is bounded $\Vert {\varepsilon }_{a}\left(x\right)\Vert \le {b}_{a}$. $\Gamma \left(.\right)$ is a continuous activation function.

Assumption 3

$\Gamma :{\mathbb{R}}^{m}\cup \left\{\pm \infty \right\} \to U$ is a function that is continuous monotonic bijective. The first derivative of this function is bounded ${\Gamma }^{^{\prime}}\left(.\right)=\frac{\mathrm{d\Gamma }(.)}{\mathrm{d}(.)}$ and $\Gamma \left(0\right)=0.$

Remark 1

tanH, SQNL [53], and softsign [54] activation functions satisfy Assumption 3. In this case, because the input should be bounded by a constant $\Vert u\Vert \le {\Xi }_{u}$, softsign is employed. Then, the estimated policy function is given by.

$${\widehat{u}}_{c}\left(x\right)=\Gamma \left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right)$$

(17)

where ${\widehat{w}}_{a}$ denotes estimated weights to learn ${w}_{a}^{*}.$

3.3 Learning rules for actor and critic networks

Updating Rule for the Critic Network: By substituting Eqs. (14) in (8), we have

$$H\left(x,u,{w}_{c}^{*}\right)=Q\left(x\left(\tau \right)\right)+{u\left(x\left(\tau \right)\right)}^{\mathrm{T}}Ru\left(x\left(\tau \right)\right)+{\left({w}_{c}^{*}\right)}^{\mathrm{T}}{\nabla \phi }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right)u\right)={\varepsilon }_{H}$$

(18)

, while based on Eq. (7), ${\varepsilon }_{H}$ can be given by

$${\varepsilon }_{H}=-{\nabla }^{T}{\varepsilon }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right)u\right)$$

(19)

Assumption 4

Under the Lemma 1 that is mentioned in [55] and by using Assumption 7 in [56], the least-squares solution to (17) exists and is unique for any admissible control policy and the number of hidden layer neurons$N\,\to\,\infty $, ${\phi }_{c}\left(x\right)$ gives the complete independent basis for ${V}_{c}^{*}\left(x\right)$.

Hence, $\frac{\partial {V}_{c}^{*}\left(x\right)}{\partial x}$, ${V}_{c}^{*}\left(x\right)$ can be estimated by NNs in view of the following assumption and the Weierstrass higher-order approximation theorem, so $\mathrm{as }\quad N\,\to\,\infty , \frac{ \partial {\varepsilon }_{c}\left(x\right)}{\partial x}$, ${\varepsilon }_{c}\left(x\right)$ approach zero [40]. Motivated by the research in [57], so as to find the updating laws for the critic weights, we define the error function for critic network as

$${e}_{c}={\sigma }^{T}{\widehat{w}}_{c}+Q\left(x\right)+{{\widehat{u}}_{c}}^{T}R{\widehat{u}}_{c}$$

(20)

where $\sigma ={\nabla \phi }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right){\widehat{u}}_{c}\right)$. To train the critic networks, the squared residual error regarding the critic network training should be minimized.

$${E}_{c}=\frac{1}{2}{e}_{c}^{2}$$

(21)

The weight of the critic network ${\widehat{w}}_{c}$ is updated in a gradient descent algorithm to minimize ${E}_{c}$ with

$$\dot{{\widehat{w}}_{c}}= -{\alpha }_{c}\frac{\sigma }{{\left({\sigma }^{T}\sigma +1\right)}^{2}}\left[{\sigma }^{T}{\widehat{w}}_{c}+Q\left(x\right)+{{\widehat{u}}_{c}}^{T}R{\widehat{u}}_{c}\right]$$

(22)

where ${\alpha }_{c}>0$ is the critic learning rate. Hence, in the light of (22) and (11) and inspired by [42], the nearly optimal policy can be obtained by

$${u}_{c}\left(x\right)= -\frac{1}{2}{R}^{-1}{g}^{T}\left(x\right)\nabla {\widehat{V}}_{c}\left(x\right)=-\frac{1}{2}{R}^{-1}{g}^{T}\left(x\right){\widehat{w}}_{c}^{T}\nabla {\phi }_{c}\left(x\right)$$

(23)

where we used ${\widehat{V}}_{c}\left(x\right)$ instead of the optimal value function ${V}_{c}^{*}\left(x\right)$. This nearly optimal policy is used in the following section to update Actor-Network.

Updating Rule for the Actor-Network: Similar to the training of critic network, one may define

$${E}_{a}=\frac{1}{2}{e}_{a}^{2}$$

(24)

where ${e}_{a}$ is the output error between the policy function ${\widehat{u}}_{c}\left(x\right)$ and the nearly targeted control policy ${u}_{c}\left(x\right)$

$${e}_{a}={\widehat{u}}_{c}\left(x\right)-{u}_{c}\left(x\right)=\Gamma \left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right)-{u}_{c}\left(x\right)$$

(25)

To minimize the square actor error ${E}_{a}$, the weights of the actor-network are tuned by gradient descent rule as follows:

$$\dot{{\widehat{w}}_{a}}=-{\alpha }_{a}\frac{\partial {E}_{a}}{\partial {\widehat{w}}_{a}}=-{\alpha }_{a}{\phi }_{a}\left(x\right){\Gamma }^{^{\prime}}\left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right){e}_{a}$$

(26)

Theorem 1

Consider the system given by (5) and the updating actor law and critic law given by (22) and (26), respectively. Assume ${\tilde{w }}_{a}={w}_{a}^{*}-{\widehat{w}}_{a}$, ${\tilde{w }}_{c}={w}_{c}^{*}-{\widehat{w}}_{c}$ and let $\frac{\sigma }{\left({\sigma }^{T}\sigma +1\right)}$ be persistently exciting (PE) [44]. If Assumptions 1–4 hold, the actor-critic weight estimation errors ${\tilde{w }}_{a}$ and ${\tilde{w }}_{c}$ are uniformly ultimately bounded and ${\widehat{w}}_{a}$ and ${\widehat{w}}_{c}$ converge to a residual set in the neighbor of ${w}_{a}^{*}$ and ${w}_{c}^{*}$, respectively.

Proof

For convenience, we define

$${u}_{a}\left(x\right)=\Gamma \left({{w}_{a}^{*}}^{T}{\phi }_{a}\left(x\right)\right)$$

(27)

$$\widehat{D}={\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)$$

(28)

$${D}^{*}={{w}_{a}^{*}}^{T}{\phi }_{a}\left(x\right)$$

(29)

$$\tilde{D }={D}^{*}-\widehat{D}$$

(30)

The convergence of the actor-critic network during learning is based on Lyapunov analysis. We consider the following Lyapunov candidate

$$L={L}_{c}+{L}_{V}+{L}_{a}=\frac{1}{2}{\tilde{w }}_{c}^{T}{\alpha }_{c}^{-1}{\tilde{w }}_{c}+{V}^{*}(x)+ \frac{1}{2}{\tilde{w }}_{a}^{T}{\alpha }_{a}^{-1}{\tilde{w }}_{a}$$

(31)

First, let us consider ${L}_{c}=\frac{1}{2}{\tilde{w }}_{c}^{T}{\alpha }_{c}^{-1}{\tilde{w }}_{c}$. Then, ${\dot{L}}_{c}=-{\tilde{w }}_{c}^{T}{\alpha }_{c}^{-1}\dot{{\widehat{w}}_{c}}$ where $\dot{{\widehat{w}}_{c}}$ is given in (22).

$${\dot{L}}_{c}=-{\tilde{w }}_{c}^{T}{\alpha }_{c}^{-1}\dot{{\widehat{w}}_{c}}={\tilde{w }}_{c}^{T}\frac{\sigma }{{\left({\sigma }^{T}\sigma +1\right)}^{2}}\left[{\sigma }^{T}\left({w}_{c}^{*}-{\tilde{w }}_{c}\right)+Q\left(x\right)+{\widehat{u}}_{c}^{T}R{\widehat{u}}_{c}\right]$$

(32)

or equivalently

$${\dot{L}}_{c}=-\frac{\sigma {\sigma }^{T}}{{\left({\sigma }^{T}\sigma +1\right)}^{2}}{\tilde{w }}_{c}^{T}{\tilde{w }}_{c}+\frac{\sigma {\sigma }^{T}}{{\left({\sigma }^{T}\sigma +1\right)}^{2}}\left[{\tilde{w }}_{c}^{T}\left(Q\left(x\right)+{\widehat{u}}_{c}^{T}R{\widehat{u}}_{c}\right)+{\tilde{w }}_{c}^{T}{\sigma }^{T}\left({w}_{c}^{*}\right)\right]$$

(33)

The solution of the HJB by considering (12) can be rewritten as

$${\varepsilon }_{\text{HJB}}=Q\left(x\left(t\right)\right)+{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)f\left(x\right)+\frac{1}{4}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right){\nabla }^{T}{\phi }_{c}\left(x\right){w}_{c}^{*}$$

(34)

where ${\varepsilon }_{\mathrm{HJB}}$ is the residual error due to the function approximation error, which is

$$ \begin{aligned} \varepsilon _{{{\text{HJB}}}} & = - \nabla ^{T} \varepsilon _{c} \left( x \right)f\left( x \right) + \frac{1}{2}w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)g\left( x \right)R^{{ - 1}} g^{T} \left( x \right)\nabla \varepsilon _{c} \left( x \right) \\ & \quad + \frac{1}{4}\nabla ^{T} \varepsilon _{c} \left( x \right)g\left( x \right)R^{{ - 1}} g^{T} \left( x \right)\nabla \varepsilon _{c} \left( x \right) \\ \end{aligned} $$

(35)

It is now desired to show that this error converges uniformly to zero as the number of hidden layer units N increases [40]. Hence, it can be demonstrated that ${\varepsilon }_{\mathrm{HJB}}$ is bounded. $\Vert {\varepsilon }_{\mathrm{HJB}}\Vert \le {\Xi }_{1}$

By considering (34), one can derive

$$ \begin{aligned}\dot{L}_{c} & = - \frac{{\sigma \sigma ^{T} }}{{\left( {\sigma ^{T} \sigma + 1} \right)^{2} }}\tilde{w}_{c}^{T} \tilde{w}_{c} \\ & \quad + \frac{{\sigma \sigma ^{T} }}{{\left( {\sigma ^{T} \sigma + 1} \right)^{2} }}\bigg[ \tilde{w}_{c}^{T} \bigg( \varepsilon _{{{\text{HJB}}}} - w_{c}^{{*T}} \nabla \phi _{c}\left( x \right)f\big( x \big) - \frac{1}{4}w_{c}^{{*T}} \nabla \phi _{c}\left( x \right)g\left( x \right)R^{{ - 1}} g^{T} \left( x \right)\nabla ^{T} \phi _{c}\left( x \right)w_{c}^{*} + \hat{u}_{c}^{T} R\hat{u}_{c} \bigg)+ \tilde{w}_{c}^{T} \sigma ^{T} \left( {w_{c}^{*} } \right) \bigg] \\ \end{aligned} $$

(36)

Based on (25), $\sigma ={\nabla \phi }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right){\widehat{u}}_{c}\right)$ and making use of $ \mathop \sigma \limits^{ = } =\frac{\sigma {\sigma }^{T}}{{\left({\sigma }^{T}\sigma +1\right)}^{2}}$

$${\dot{L}}_{c}=-\mathop \sigma \limits^{ = }{\tilde{w }}_{c}^{T}{\tilde{w }}_{c}+\mathop \sigma \limits^{ = }\left[{\tilde{w }}_{c}^{T}\left({\varepsilon }_{\mathrm{HJB}}-{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)f\left(x\right)-\frac{1}{4}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right){\nabla }^{T}{\phi }_{c}\left(x\right){w}_{c}^{*}+{\widehat{u}}_{c}^{T}R{\widehat{u}}_{c}\right)+{\tilde{w }}_{c}^{T}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right){\widehat{u}}_{c} \right)\right]$$

(37)

where $\mathop \sigma \limits^{ = }$ is bounded. $ \left\| {\mathop \sigma \limits^{ = } } \right\| \le \Xi _{{\mathop \sigma \limits^{ = } }} $ Then,

$$\dot{L}_{c} = - \mathop \sigma \limits^{ = } \tilde{w}_{c}^{T} \tilde{w}_{c} + \mathop \sigma \limits^{ = }\left[ \tilde{w}_{c}^{T} \left( \varepsilon _{{{\text{HJB}}}} - \frac{1}{4}w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)g\left( x \right)R^{{ - 1}} g^{T} \left( x \right)\nabla ^{T} \phi _{c} \left( x \right)w_{c}^{*} + \hat{u}_{c}^{T} R\hat{u}_{c} \right) + \tilde{w}_{c}^{T} w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)\left( {g\left( x \right)\hat{u}_{c} ~} \right) \right] $$

(38)

The second term is,

$$ \begin{aligned} \dot{L}_{V} & = \left( {w_{c}^{{*T}} \nabla \phi _{c} \left( x \right) + \nabla ^{T} \varepsilon _{c} \left( x \right)} \right)\left( {f\left( x \right) + g\left( x \right)\hat{u}_{c} } \right) \\ & = ~\left( {w_{c}^{{*T}} \nabla \phi _{c} \left( x \right) + \nabla ^{T} \varepsilon _{c} \left( x \right)} \right)\left( {f\left( x \right) + g\left( x \right)\Gamma \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)} \right) \\ \end{aligned} $$

(39)

Combining with (34),

$$ \begin{aligned} \dot{L}_{V} & = \nabla ^{T} \varepsilon _{c} \left( x \right)\left( {f\left( x \right) + g\left( x \right)\Gamma \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)} \right) \\ & \quad + w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)f\left( x \right) + w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)g\left( x \right)\Gamma \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right) \\ & = \nabla ^{T} \varepsilon _{c} \left( x \right)\left( {f\left( x \right) + g\left( x \right)\Gamma \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)} \right) \\& \quad + \varepsilon _{{HJB}} - Q\left( {x\left( t \right)} \right) - \frac{1}{4}w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)g\left( x \right)R^{{ - 1}} g^{T} \left( x \right)\nabla ^{T} \phi _{c} \left( x \right)w_{c}^{*} \\ \end{aligned} $$

(40)

For the second term of (31), one can write

$$ \begin{aligned} \dot{L}_{a} & = ~\tilde{w}_{a}^{T} \alpha _{a}^{{ - 1}} \mathop {\tilde{w}_{a} }\limits^{ \cdot } = - \tilde{w}_{a}^{T} \alpha _{a}^{{ - 1}} \mathop {\tilde{w}_{a} }\limits^{ \cdot } = \tilde{w}_{a}^{T} \left( {\phi _{a} \left( x \right)\Gamma ^{\prime } \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)e_{a} } \right) \\ & = ~\tilde{w}_{a}^{T} \left( {\phi _{a} \left( x \right)\Gamma ^{\prime}\left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)\left( {\Gamma \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right) + \frac{1}{2}R^{{ - 1}} g^{T} \left( x \right)\hat{w}_{c}^{T} \nabla \phi _{c} \left( x \right)} \right)} \right) \\ \end{aligned} $$

(41)

Consequently,

$$ \begin{aligned} \dot{L}_{a} & = ~\tilde{w}_{a}^{T} \phi _{a} \left( x \right)\Gamma ^{\prime}\left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)\Gamma \left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right) \\ & \quad + \tilde{w}_{a}^{T} \phi _{a} \left( x \right)\Gamma ^{\prime}\left( {\hat{w}_{a}^{T} \phi _{a} \left( x \right)} \right)\frac{1}{2}R^{{ - 1}} g^{T} \left( x \right)\hat{w}_{c}^{T} \nabla \phi _{c} \left( x \right) \\ \end{aligned} $$

(42)

Based on (28)–(30), (41) becomes

$$ \begin{aligned} \dot{L}_{a} & = \tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)\Gamma \left( {\hat{D}} \right) + \frac{1}{2}\tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)R^{{ - 1}} g^{T} \left( x \right)\hat{w}_{c}^{T} \nabla \phi _{c} \left( x \right) \\ & = \tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)\Gamma \left( {\hat{D}} \right) - \tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)u_{c} \\ & = \tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)\Gamma \left( {\hat{D}} \right) - \tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)\left( {\Gamma \left( {\hat{D}} \right) - e_{a} } \right) = \tilde{D}^{T} \Gamma ^{\prime}\left( {\hat{D}} \right)e_{a} ~ \\ \end{aligned} $$

(43)

Then, ${u}_{c}^{*}\left(x\right)$ be can be written

$${u}_{c}^{*}\left(x\right)=\Gamma \left({D}^{*}\right)+{\varepsilon }_{a}\left(x\right)=-\frac{1}{2}{R}^{-1}{g}^{T}\left(x\right){{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)$$

(44)

Based on (25), $ {e}_{a}$ become

$$ \begin{aligned} e_{a} & = \hat{u}_{c} \left( x \right) - u_{c} \left( x \right) \\ & = \Gamma \left( {\hat{D}} \right) + \frac{1}{2}R^{{ - 1}} g^{T} \left( x \right)\left( {w_{c}^{{*T}} - \tilde{w}_{c}^{T} } \right)\nabla \phi _{c} \left( x \right) \\ & = \Gamma \left( {\hat{D}} \right) - \Gamma \left( {D^{*} } \right) - \varepsilon _{a} \left( x \right) - \frac{1}{2}R^{{ - 1}} g^{T} \left( x \right)\tilde{w}_{c}^{T} \nabla \phi _{c} \left( x \right) \\ & = ~ - \Gamma \left( {\tilde{D}} \right) - \varepsilon _{a} \left( x \right) - \frac{1}{2}R^{{ - 1}} g^{T} \left( x \right)\tilde{w}_{c}^{T} \nabla \phi _{c} \left( x \right) \\ \end{aligned} $$

(45)

Then, using (45) in (42)

$$ \begin{aligned} \dot{L}_{a} & = - \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\left( {\Gamma \left( {\tilde{D}} \right) + \varepsilon_{a} \left( x \right) + \frac{1}{2}R^{ - 1} g^{T} \left( x \right)\tilde{w}_{c}^{T} \nabla \phi_{c} \left( x \right)} \right) \\ & = - \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\varepsilon_{a} \left( x \right) - \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\left( {\Gamma \left( {\tilde{D}} \right)} \right) - \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\frac{1}{2}R^{ - 1} g^{T} \left( x \right)\tilde{w}_{c}^{T} \nabla \phi_{c} \left( x \right) \\ \end{aligned} $$

(46)

By applying the Young inequality to the second term, we have

$$ \begin{aligned} & - \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\frac{1}{2}R^{ - 1} g^{T} \left( x \right)\nabla \phi_{c} \left( x \right)\tilde{w}_{c} \\ & \le \frac{1}{4}\tilde{w}_{c} \nabla^{T} \phi_{c} \left( x \right)g\left( x \right)R^{ - 1} g^{T} \left( x \right)\nabla \phi_{c} \left( x \right) \tilde{w}_{c} \\ & \quad + \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\Gamma^{\prime}\left( {\hat{D}} \right)^{T} \tilde{D} \\ \end{aligned} $$

(47)

Then, we rewrite (46)

$$ \begin{aligned} \dot{L}_{a} & \le - \tilde{D}^{T} \Gamma^{\prime}\left( {\hat{D}} \right)\left( { - \Gamma^{\prime}\left( {\hat{D}} \right)^{T} \tilde{D} + \varepsilon_{a} \left( x \right) + \left( {\Gamma \left( {\tilde{D}} \right)} \right)} \right) \\ & \quad + \frac{1}{4}\tilde{w}_{c} \nabla^{T} \phi_{c} \left( x \right)g\left( x \right)R^{ - 1} g^{T} \left( x \right)\nabla \phi_{c} \left( x \right) \tilde{w}_{c} \\ \end{aligned} $$

(48)

By considering softsign as the activation function, $\Gamma \left(y\right)=\frac{y}{\left(1+\left|y\right|\right)}$ therefore $\Gamma{^{\prime}}\left(y\right)=\frac{1}{{\left(1+\left|y\right|\right)}^{2}}$ . With these functions, one can then show that the first term of (48) can be given

$$-{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\left(-{\Gamma }^{^{\prime}}{\left(\widehat{D}\right)}^{T}\tilde{D }+{\varepsilon }_{a}\left(x\right)+\left(\Gamma \left(\tilde{D }\right)\right)\right)=-{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\left(\frac{-\tilde{D }}{{\left(1+\Vert \widehat{D}\Vert \right)}^{2}}+{\varepsilon }_{a}\left(x\right)-\frac{\widehat{D}}{\left(1+\Vert \widehat{D}\Vert \right)}+\frac{{D}^{*}}{\left(1+\Vert {D}^{*}\Vert \right)}\right)=-{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\left(\frac{-\left({D}^{*}-\widehat{D}\right)-\widehat{D}\left(1+\Vert \widehat{D}\Vert \right)+{D}^{*}\left(1+\Vert \widehat{D}\Vert \right)}{{\left(1+\Vert \widehat{D}\Vert \right)}^{2}}\right)=-{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\frac{\Vert \widehat{D}\Vert \tilde{D }}{{\left(1+\Vert \widehat{D}\Vert \right)}^{2}}$$

(49)

Substituting (49) in (48)

$${\dot{L}}_{a}\le -{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\frac{\Vert \widehat{D}\Vert \tilde{D }}{{\left(1+\Vert \widehat{D}\Vert \right)}^{2}}+\frac{1}{4}{\tilde{w }}_{c}{\nabla }^{T}{\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right)\nabla {\phi }_{c}\left(x\right) {\tilde{w }}_{c}$$

(50)

Next, using (38), (40), and (50) to rewrite (30), we get

$$ \begin{aligned}\dot{L} &=-Q\left(x\left(t\right)\right)-\mathop \sigma \limits^{ = }{\tilde{w }}_{c}^{T}{\tilde{w }}_{c}-{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\frac{\Vert \widehat{D}\Vert \tilde{D }}{{\left(1+\Vert \widehat{D}\Vert \right)}^{2}}-\frac{1}{4}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right){\nabla }^{T}{\phi }_{c}\left(x\right){w}_{c}^{*} -{\tilde{D }}^{T}{\Gamma }^{^{\prime}}\left(\widehat{D}\right)\left(\Gamma \left(\tilde{D }\right)\right)\\ &\quad +{\nabla }^{T}{\varepsilon }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right)\Gamma \left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right)\right)\\ &\quad+{\varepsilon }_{\mathrm{HJB}}+\mathop \sigma \limits^{ = }\bigg[{\tilde{w }}_{c}^{T}\left({\varepsilon }_{\mathrm{HJB}}-\frac{1}{4}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right){\nabla }^{T}{\phi }_{c}\left(x\right){w}_{c}^{*}+{\widehat{u}}_{c}^{T}R{\widehat{u}}_{c}\right)\\ &\quad+{\tilde{w }}_{c}^{T}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)\left(g\left(x\right){\widehat{u}}_{c}\right)\bigg]+\frac{1}{4}{\varepsilon }_{\mathcal{K}}{\tilde{w }}_{c}{\nabla }^{T}{\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right)\nabla {\phi }_{c}\left(x\right) {\tilde{w }}_{c}\end{aligned}$$

(51)

Let ${\Upsilon }_{0}=\frac{{\Gamma }^{\mathrm{^{\prime}}}\left(\widehat{D}\right)\Vert \widehat{D}\Vert }{{\left(1+\Vert \widehat{D}\Vert \right)}^{2}}$, ${\Upsilon}_{1}=\frac{1}{4}{\varepsilon }_{\mathcal{K}}{\nabla }^{T}{\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right)\nabla {\phi }_{c}\left(x\right)$ and ${\Upsilon}_{2}=\frac{\mathop \sigma \limits^{ = }}{4}{{w}_{c}^{*}}^{T}\nabla {\phi }_{c}\left(x\right)g\left(x\right){R}^{-1}{g}^{T}\left(x\right){\nabla }^{T}{\phi }_{c}\left(x\right){w}_{c}^{*}$ positive definites. Based on definitions of ${\widehat{u}}_{c}$, ${\varepsilon }_{\mathrm{HJB}}$, ${\nabla }^{T}{\varepsilon }_{c}$, $\Gamma \left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right)$, $\mathop \sigma \limits^{ = }$, ${w}_{c}^{*}$, $\nabla {\phi }_{c}\left(x\right)$ and Assumption 2 and 3, one can conclude

$$ \begin{aligned} &\Vert {\varepsilon }_{\mathrm{HJB}}\Vert \le {\Xi }_{1}\\&\Vert {\nabla }^{T}{\varepsilon }_{c}\left(x\right)\left(f\left(x\right)+g\left(x\right)\Gamma \left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right)\right)\Vert \le {\Xi }_{{\varepsilon }_{1}}\left({\Xi }_{f}+{\Xi }_{g}{\Xi }_{u}\right)={\Xi }_{2}\\&\Vert \mathop \sigma \limits^{ = }{\widehat{u}}_{c}^{T}R{\widehat{u}}_{c}\Vert \le {\Xi }_{\mathop \sigma \limits^{ = }}{\Xi }_{u}^{2}R={\Xi }_{3}\\&\left\| {\mathop \sigma \limits^{ = } w_{c}^{{*T}} \nabla \phi _{c} \left( x \right)\left( {g\left( x \right)\hat{u}_{c} } \right)} \right\| \le \Xi _{{\mathop \sigma \limits^{ = } }} \Xi _{{w_{c}^{*} }} \Xi _{{\nabla \phi _{c} }} \Xi _{g} \Xi _{u} = \Xi _{4} \end{aligned}$$

(52)

where ${\Xi }_{u}$, ${\Xi }_{1}$, ${\Xi }_{{\varepsilon }_{1}}$, ${\Xi }_{\mathop \sigma \limits^{ = }}$, ${\Xi }_{{w}_{c}^{*}}$, ${\Xi }_{\nabla {\phi }_{c}}$, ${\Xi }_{f}$ and ${\Xi }_{g}$ are the upper bound of ${\widehat{u}}_{c}$, ${\varepsilon }_{HJB}$, ${\nabla }^{T}{\varepsilon }_{c}$, $\Gamma \left({\widehat{w}}_{a}^{T}{\phi }_{a}\left(x\right)\right)$, $\mathop \sigma \limits^{ = }$, ${w}_{c}^{*}$, $\nabla {\phi }_{c}\left(x\right)$, $f\left(x\right)$ and$g\left(x\right)$, respectively. Consequently, we can obtain

$$ \dot{L} \le - \lambda_{Q} x^{2} - \tilde{w}_{c}^{T} \left( {\mathop \sigma \limits^{ = } - \Upsilon_{1} } \right)\tilde{w}_{c} - \tilde{D}^{T} \Upsilon_{0} \tilde{D} - \tilde{w}_{c}^{T} \left( {\Upsilon_{2} - \Xi_{3} - \Xi_{4} } \right) + \Xi_{1} + \Xi_{2} $$

(53)

where ${\lambda }_{Q}$ is a positive constant such that $Q\left(x\left(t\right)\right)>{x}^{T}{\lambda }_{Q}x$ for every $\mathrm{x}\in\Omega $. If we choose ${\upvarepsilon }_{\mathcal{K}}$ and $\mathrm{R}$ such that $\left(\mathop \sigma \limits^{ = }-{\epsilon}_{1}\right)$ bigger than zero, then $\dot{L}$ yields

$$ \dot{L} \le - \left\| {\mathop Z\limits^{ = } } \right\|\lambda _{{{\text{min}}}} \left( {\mathcal{G}} \right) + \left\| q \right\|\left\| {\mathop Z\limits^{ = } } \right\| + \Sigma $$

(54)

where $ \mathop {{Z}}\limits^{ = } = \left[ {\begin{array}{*{20}c} x \\ {\tilde{D}} \\ {\tilde{w}_{c} } \\ \end{array} } \right] $ and

$$\mathcal{G}=\left[\begin{array}{ccc}Q& 0& 0\\ 0& {\Upsilon }_{0}& 0\\ 0& 0& \left( \mathop \sigma \limits^{ = } -{\Upsilon}_{1}\right)\end{array}\right]$$

$$ G = \left[ {\begin{array}{*{20}c} 0 \\ 0 \\ { - \left( {{\Upsilon }_{2} - \Xi_{3} - \Xi_{4} } \right)} \\ \end{array} } \right] $$

$$ \Sigma = \Xi_{1} + \Xi_{2} $$

If the parameters are selected such that $\mathcal{q}$ and $\mathcal{G}$ are positive, then the Lyapunov derivative is negative if

$$ \left\| {\mathop Z\limits^{ = } } \right\|\frac{{\left\| q \right\| + \sqrt {\left\| q \right\|^{2} + 4\lambda _{{\min }} \left({\mathcal{G}} \right)\Sigma } }}{{2\lambda _{{\min }} \left({\mathcal{G}} \right)}} $$

(55)

By mathematical induction, it is now desired to show if we can bound the Lyapunov function, then for sufficiently large $ \left\| {\mathop {\mathcal{Z}}\limits^{ = } } \right\| $ $\dot{L}$ is negative. Therefore, by using the standard Lyapunov extension theorem [58], it follows that the system state and the weights error are UUB, which completes the proof.

Remark 2

If Assumptions 1–4 hold, then the assumption of the nonlinear Lyapunov equation solution of (12) can be relaxed without loss of the stability performance, and the equilibrium point of (5) remains UUB.

Remark 3:

By utilizing gradient descent rules (22), (26), and the backpropagation rule, the HJB error ${E}_{c}=\frac{1}{2}{e}_{c}^{2}$ and the point-wise control error ${E}_{a}=\frac{1}{2}{e}_{a}^{2}$ update the critic and actor networks. The error ${E}_{c}$ is convex with respect to ${\widehat{w}}_{c}$ . Hence, the critic network's weight converges to its global optimal point by applying the updating rule (22). The error ${E}_{a}$ is non-convex with respect to ${\widehat{w}}_{a}$ so the weights of the actor-network converge to a locally optimal point by applying the updating rule (26).

4 Optimal vaccination strategies

In this study, several optimal vaccination strategies are proposed. In each strategy, we consider four cost functions to formulate the main concerns and characterizations that should be considered and noted. Then, we define the appropriate model for each of them. It should be noted that the main objective is to reduce the effects of Covid-19 and its domino effects which stem from its spreading and progression. In this section, the optimal control principle provides an optimized approach that can hinder the impacts of Covid-19 outbreak. Therefore, to minimize defined objective functions, we implement reinforcement learning optimal control to reach the optimal policy. Here, by using the HJB equation which is described in each subsection, the necessary condition for optimality is satisfied.

4.1 Strategy 1

In view of the above discussion, we suppose the optimal control problem with vaccination as input control. Motivated from proposed model in [59] and [60], we introduce a mathematical model with end-point state constraint, control input inspired of [37], and considering vaccine efficiency. This model is shown in Fig. 1 and can be driven as follows:

$$ \begin{aligned} \dot{x} & = f\left( x \right) + g\left( x \right)u \\ f\left( {x\left( t \right)} \right) & = \left[ {\begin{array}{*{20}c} { - \left( {\beta c + cq\left( {1 - \beta } \right)} \right)S\left( {I + \theta A} \right) + \lambda S_{q} } \\ {\beta c\left( {1 - q} \right)S\left( {I + \theta A} \right) - \sigma E + \left( {1 - e_{v} } \right)\beta cV_{a} \left( {I + \theta A} \right)} \\ {\begin{array}{*{20}c} {\sigma \rangle E - \left( {\delta_{I} + \alpha + \gamma_{I} } \right)I} \\ {\begin{array}{*{20}c} {\sigma \left( {1 - \rangle } \right)E - \gamma_{A} A} \\ {\left( {1 - \beta } \right)cqS\left( {I + \theta A} \right) - \lambda S_{q} } \\ {\begin{array}{*{20}c} {\beta cqS\left( {I + \theta A} \right) - \delta_{q} E_{q} } \\ {\delta_{I} I + \delta_{q} E_{q} - \left( {\alpha + \gamma_{H} } \right)H} \\ {\gamma_{I} I + \gamma_{A} A + \gamma_{H} H} \\ \end{array} } \\ { - \left( {1 - e_{v} } \right)\beta cV_{a} \left( {I + \theta A} \right)} \\ \end{array} } \\ \end{array} } \\ \end{array} } \right] \\ g\left( x \right) & = \left[ { - S, 0,0,0,0,0,0,0,S} \right]^{T} \\ x\left( t \right) & = \left[ {S\left( t \right),E\left( t \right),I\left( t \right),A\left( t \right),S_{q} \left( t \right),E_{q} \left( t \right),H\left( t \right),R\left( t \right),V_{a} \left( t \right)} \right]^{T} \\ \end{aligned} $$

(56)

where $u$ denotes the ratio of susceptible individuals vaccinated per day $(0\le u\le 1)$ and ${e}_{v}$ is the vaccine efficiency. This parameter demonstrates the effectiveness of vaccine which means that if this factor ${e}_{v}=1$, the vaccine is fully effective. Note that the term $\left(1-{e}_{v}\right)\beta c{V}_{a}\left(I+\theta A\right)$ represents the vaccinated people who can be infected due to vaccine incompleteness and expresses the fact that no vaccination is 100% effective. Moreover, in the introduced model, ${V}_{a}$ indicates the vaccinated sub-population. In this strategy, we wish to reduce the objective function which considers the infected individuals and the ratio of vaccinated people

$${V}^{*}=\mathrm{min}\left({\int }_{0}^{{t}_{f}}{\Gamma }_{I,{s}_{1}}{I}^{2}(t)+{{\Gamma }_{u,{s}_{1}}u}^{2}\mathrm{d}t\right)$$

(57)

where ${\Gamma }_{I,{s}_{1}}$ and ${\Gamma }_{u,{s}_{1}}$ are the relative weight factors selected to balance the objective function over innervation time ${t}_{f}$.

4.2 Strategy 2

The objective function (55) can be enhanced by assuming exposed individuals as a population that should be considered to be minimized. In fact, by employing a strategy whose aim is to reduce exposed people, we can find a better solution to minimize the number of infected individuals and the cost of vaccination. More precisely, we seek that the optimal control consists of minimizing the objective functional

$${V}^{*}=\mathrm{min}\left({\int }_{0}^{{t}_{f}}{{\Gamma }_{I,{s}_{2}}I}^{2}(t)+{{\Gamma }_{E,{s}_{2}}E}^{2}(t)+{\Gamma }_{u,{s}_{2}}{u}^{2}\mathrm{d}t\right)$$

(58)

subject to the mathematical model proposed by (56). Similar to (57), ${\Gamma }_{I,{s}_{2}}$, ${\Gamma }_{s,{s}_{2}}$ and ${\Gamma }_{u,{s}_{2}}$ are positive weights for balancing cost function.

4.3 Strategy 3

In this strategy, we consider quarantined vaccination as a method that can effectively be taken into account to reduce the cost of vaccination and the number of infected and exposed individuals. Hence, the vaccination control variable is imposed on quarantined individuals who are susceptible to the virus. Then, the epidemic model with two control input imposed is given by

$$ \begin{aligned} \dot{x} & = f\left( x \right) + g\left( x \right)u \\ f\left( {x\left( t \right)} \right) & = \left[ {\begin{array}{*{20}c} { - \left( {\beta c + cq\left( {1 - \beta } \right)} \right)S\left( {I + \theta A} \right) + \lambda S_{q} } \\ {\beta c\left( {1 - q} \right)S\left( {I + \theta A} \right) - \sigma E + \left( {1 - q} \right)\left( {1 - e_{v} } \right)\beta cV_{a} \left( {I + \theta A} \right)} \\ {\begin{array}{*{20}c} {\sigma \rangle E - \left( {\delta _{I} + \alpha + \gamma _{I} } \right)I} \\ {\begin{array}{*{20}c} {\sigma \left( {1 - \rangle } \right)E - \gamma _{A} A} \\ {\left( {1 - \beta } \right)cqS\left( {I + \theta A} \right) - \lambda S_{q} } \\ {\begin{array}{*{20}c} {\beta cqS\left( {I + \theta A} \right) - \delta _{q} E_{q} + q(1 - e_{v} )\beta cV_{a} (I + \theta A)} \\ {\delta _{I} I + \delta _{q} E_{q} - \left( {\alpha + \gamma _{H} } \right)H} \\ {\gamma _{I} I + \gamma _{A} A + \gamma _{H} H} \\ \end{array} } \\ { - \left( {1 - e_{v} } \right)\beta cV_{a} \left( {I + \theta A} \right)} \\ \end{array} } \\ \end{array} } \\ \end{array} } \right] \\ g\left( x \right) & = \left[ { - S,0,0,0, - S_{q} ,0,0,0,\left( {S + S_{q} } \right)} \right]^{T} \\ x\left( t \right) & = \left[ {S\left( t \right),E\left( t \right),I\left( t \right),A\left( t \right),S_{q} \left( t \right),E_{q} \left( t \right),H\left( t \right),R\left( t \right),V_{a} \left( t \right)} \right]^{T} \\ \end{aligned} $$

(59)

Indeed, in this model, the vaccination will be distributed equally between the quarantined and non-quarantined individuals. By considering this fact that quarantined individuals are considered as one group which are vaccinated, the quadratic objective functional (56) is defined as

$${V}^{*}=\mathrm{min}\left({\int }_{0}^{{t}_{f}}{{\Gamma }_{I,{s}_{2}}I}^{2}(t)+{{\Gamma }_{E,{s}_{2}}E}^{2}\left(t\right)+{{\Gamma }_{{E}_{q},{s}_{2}}{E}_{q}}^{2}(t)+{\Gamma }_{u,{s}_{2}}{u}^{2}\mathrm{d}t\right)$$

(60)

where ${{\Gamma }_{{E}_{q},{s}_{2}}{E}_{q}}^{2}(t)$ is the quadratic terms of quarantined exposed individuals representing the population that we wish to minimize besides infected and non-quarantined exposed individuals.

4.4 Strategy 4

Here, instead of the uniform allocation of vaccine proposed in Strategy 3, we use two independent control variables for the propagation control of the coronavirus. The resulting control model, after incorporating the aforementioned control variables, is formulated via the following system:

$$ \begin{aligned} f\left( {x\left( t \right)} \right) & = \left[ {\begin{array}{*{20}c} { - \left( {\beta c + cq\left( {1 - \beta } \right)} \right)S\left( {I + \theta A} \right) + \lambda S_{q} } \\ {\beta c\left( {1 - q} \right)S\left( {I + \theta A} \right) - \sigma E + \left( {1 - q} \right)\left( {1 - e_{v} } \right)\beta cV_{a} \left( {I + \theta A} \right)} \\ {\begin{array}{*{20}c} {\sigma \rangle E - \left( {\delta _{I} + \alpha + \gamma _{I} } \right)I} \\ {\begin{array}{*{20}c} {\sigma \left( {1 - \rangle } \right)E - \gamma _{A} A} \\ {\left( {1 - \beta } \right)cqS\left( {I + \theta A} \right) - \lambda S_{q} } \\ {\begin{array}{*{20}c} {\beta cqS\left( {I + \theta A} \right) - \delta _{q} E_{q} + q(1 - e_{v} )\beta cV_{a} (I + \theta A)} \\ {\delta _{I} I + \delta _{q} E_{q} - \left( {\alpha + \gamma _{H} } \right)H} \\ {\gamma _{I} I + \gamma _{A} A + \gamma _{H} H} \\ \end{array} } \\ { - \left( {1 - e_{v} } \right)\beta cV_{a} \left( {I + \theta A} \right)} \\ \end{array} } \\ \end{array} } \\ \end{array} } \right] \\ g\left( x \right) & = \left[ {\begin{array}{*{20}c} { - S,0,0,0,0,0,0,0,\left( S \right)} \\ {0,0,0,0, - S_{q} ,0,0,0,\left( {S_{q} } \right)} \\ \end{array} } \right]^{T} \\ x\left( t \right) & = \left[ {S\left( t \right),E\left( t \right),I\left( t \right),A\left( t \right),S_{q} \left( t \right),E_{q} \left( t \right),H\left( t \right),R\left( t \right),V_{a} \left( t \right)} \right]^{T} \\ x\left( t \right) & = \left[ {S\left( t \right),E\left( t \right),I\left( t \right),A\left( t \right),S_{q} \left( t \right),E_{q} \left( t \right),H\left( t \right),R\left( t \right),V_{a} \left( t \right)} \right]^{T} \\ \end{aligned} $$

(61)

$${\mathcal{V}} = \left[ {\begin{array}{*{20}c} u \\ {u_{q} } \\ \end{array} } \right] $$

In this strategy, ${u}_{q}$ is the input variable which represents the fraction of quarantined susceptible individuals who vaccinated. By modifying Strategy 3 in the control variables assumption, the updated objective functional (60) is given as

$$ V^{*} = \min \left( {\mathop \smallint \limits_{0}^{{t_{f} }} \Gamma_{{I,s_{2} }} I^{2} (t) + \Gamma_{{E,s_{2} }} E^{2} \left( t \right) + \Gamma_{{E_{q} ,s_{2} }} E_{q}^{2} \left( t \right) + \Gamma_{{u_{q} ,s_{2} }} u_{q}^{2} + \Gamma_{{u,s_{2} }} u^{2} {\text{d}}t} \right) $$

(62)

where the ${\Gamma }_{{u}_{q},{s}_{2}}{u}_{q}^{2}$ stands for minimization of quarantined vaccination. This objective illustrates the importance of vaccination optimization. Moreover, the constant ${\Gamma }_{{u}_{q},{s}_{2}}$ similar to the proposed strategies is the balancing factor, which measures the relative cost of quarantined vaccination. Figure 1 shows these strategies. In the next section, we will present the result of each strategy and compare the numerical results of their optimal solution.

These strategies are designed based on this assumption that susceptible people have been determined. Moreover, in this article, the susceptible people are considered as the only group of people who should be prioritized for getting the vaccine because they are more likely to be infected by the infection, and their infection will be more severe than the other people. After identification of the susceptible individuals, they should be classified. In this case, susceptibility can be decerned through potential risk factors such as age or pregnancy. Consequently, it will be necessary to monitor the susceptible individuals and prioritize them regarding their conditions.

5 Numerical results

In this section, we simulate the epidemiological model with vaccination based on the obtained data from the laboratory-confirmed case of 2019-nCoV that occurred in mainland China which is proposed by [36]. It should be noted that their research was based on a collected dataset and surveys until January 22, 2020. They employed the Markov Chain Monte Carlo to estimate the model parameters and their baselines. Based on these parameters, we implement these four strategies in python. In each strategy, the balancing factors are considered to countervail the imbalances between the magnitude of objectives' value. In this simulation, the embedded Runge–Kutta (RK5(4)) [61] has been used to model the dynamics of the epidemiological system. According to the research in [62], we assume vaccine efficacy ${e}_{v}=0.90$, and the parameters in the optimal control framework are taken as

$${\phi }_{c}\left(x\right)=[{S}^{2},{E}^{2},{I}^{2},{A}^{2},{S}_{q}^{2},{E}_{q}^{2},{H}^{2},{R}^{2},S.E,S.I,S.A,S.{S}_{q},S.{E}_{q},S.H,S.R],$$

$${\phi }_{a}\left(x\right)=[S,E,I,A,{S}_{q},{E}_{q},H,R]$$

The initial values of weights are as follows:

$${\widehat{w}}_{c}=[{10}^{-5},\dots ,{10}^{-5}]$$

$${\widehat{w}}_{a}=[{10}^{-4},\dots ,{10}^{-4}]$$

Based on [36], the model's baselines and initial values are given in Tables 1 and 2. Here, we use them as baselines of the model and the initial values. Next, according to the defined cost functions (57), (58), (60), and (62), reinforcement learning optimal control has been applied as a feedback controller. Now, the time evolution of respective subpopulations and vaccination efforts are shown in Figs. 2, 3 and 4. In Fig. 2, the outcome of different optimal control strategies on the population of "stratified groups of people" is shown. First, in Fig. 2, the time evolution of the subpopulations illustrates that by using the vaccination strategies, susceptible, exposed, infected individuals (with or without a sign of disease), hospitalized, and recovered population fall. Moreover, the number of hospitalized individuals is reduced compared with the no-control strategy. It can be considered as a secondary effect of vaccination. Furthermore, at the beginning of public vaccination, it can successfully reduce the number of infected people, and as a consequence, people are less likely to be exposed to the infected people who can spread the disease. Therefore, the need for hospitalization will decrease in the long term. From Fig. 2a, it is precisely shown that the population of susceptible individuals declines more in Strategy 4, in which vaccine has been considered for both quarantined and non-quarantined susceptible individuals. This can explain that quarantined vaccination is one of the best options for eradicating the disease in the long run. In Fig. 3, the vaccinated population is shown in each control policy. As shown in this figure, the total population of vaccinated individuals in Strategy 3 is lower than the other optimal control strategies; however, Fig. 2c illustrates that the time evolution of infected individuals in all optimal strategies are close to each other. Thus, one can see after 110 days, the population of infected individuals in each strategy is similar to the others. Therefore, Strategy 3 will be suggested to be taken into account if restriction exists in vaccine supplements. In view of Fig. 2a, one can obtain that the tenth day can be supposed as the perfect trigger time for vaccination. On this day, the population of susceptible exceeds the minimum itself, and after that, the population of this compartment will rise gradually. In this context, if the cost of vaccination is important to governments, they can follow Strategy 3, which is the best option to bring down the cost of vaccination and reduce the number of infected people simultaneously. Based on this strategy, it would be better for governments and authorities to begin the public vaccination when the population of susceptible people reaches its minimum. Figure 4 shows the number of vaccinated population in each strategy per day presented to compare the time evolution of the vaccinated population in four different strategies. From Fig. 4a, we compare the control profiles of each strategy. One can see that in Strategy 4, the number of susceptible people declines more than the others; this strategy requires more vaccination effort. In this case, this fact is shown that from the aspect of vaccination cost, Strategy 3 can be more satisfying than the other strategies. Note that in Strategy 4, the vaccination distributes among quarantined and non-quarantined susceptible individuals. The allocation of the vaccine in this strategy is shown in Fig. 4b. One can infer from this figure that in the primary phase, the authorities should give top priority to the quarantined susceptible individuals, although the non-quarantined susceptible individuals also should be considered for vaccination. As mentioned in the previous section, these strategies are formulated to be performed just for susceptible individuals. As a result, it should be noted that before performing, the susceptible people should be identified, stratified, and prioritized. This stratification can be performed based on their risk factor and their vulnerability. Moreover, ring vaccination is another strategy to control the outbreak [63, 64]. To be more specific, utilizing smart surveillance monitoring can provide the authorities and governors with a great tool to identify the susceptible people. This method can be taken into action to reduce transmission earlier by vaccination and immunization of the susceptible ring. Therefore, these proposed strategies can provide effective protection.

Table 1 Parameter estimates for COVID-19 in Wuhan, China [36]

Full size table

Table 2 Initial values estimation for COVID-19 in Wuhan, China [37]

Full size table

In this sense, in Fig. 4a, Strategy 2 shows that its performance is better than the performance of Strategy 1 because this strategy reduces the number of susceptible people more than other strategies. This figure also demonstrates that if the exposed people are considered in the objective function, the optimal controller can perform better regarding the reduction in both susceptible and exposed populations. It should be noted that the more exposed the population decrease, the less susceptible individuals can be infected. Due to this fact, one can infer that both susceptible and exposed populations should be considered in objective functions.

Also, from a practical viewpoint, let us denote that the reinforcement learning optimal control can introduce a better policy for vaccination distribution regarding Pontryagin’s minimum principle. For example, in [59, 65, 66], the proposed optimal controls suggest the time evolution vaccination whose initial proportion is high and significant. This high initial value of vaccine usage makes Pontryagin’s minimum principle approach impractical and too harsh in the real world, but as presented in this article, reinforcement learning optimal control can propose a policy with smooth starting that provides functionality and practicality for public vaccination.

Graphical results depict the importance of vaccine allocation. In these graphical interpretation shows that if the vaccination is taken into account, the severity of infection can be reduced gradually. In the presented model, vaccination plays a vital role in the reduction in susceptible individuals. Consequently, one can see that when suspectable individuals who can transmit the virus and be infected start to fall, the number of infected people can decline in number. Decreasing the number of symptomatically infected people will reduce the exposure of uninfected people to infected people. Therefore, it can reduce the probability of being infected through the disease transmission too. As a result, the number of infected people is decreased significantly, which can end up with the elimination of the disease in society. It should be noted that due to the slow dynamic behavior of the epidemic model, it seems that the vaccine does not affect the population of infected people, but over time, the significance of vaccination effectiveness can be observable. Hence, this simulation highly suggests that governments and authorities should not be obsessed with the number of infected people during the early stage of vaccination because vaccines take time to induce immunity.

6 Conclusion

In this research, the significant challenge regarding vaccination strategies for COVID-19 has been investigated. Based on data from confirmed cases of 2019-nCoV in mainland China, a new deterministic SEIR model with additional vaccination components was developed. Following that, based on the reinforcement learning method, an optimal control was developed to discover the best policies. By implementing the dynamic model of the epidemiological system, numerical results for four different control strategies obtained by the proposed technique were demonstrated. The feasibility of the recommended method for designing optimal vaccine plans was clearly shown by these findings. As a future study, it would be useful to look at any of the behavioral or emotional side effects of quarantine, such as depression, which may impact depression or even suicide rate in society. Such investigations lead us to find an optimal trade-off for quarantine decisions.

References

K. Yuki, M. Fujiogi, S. Koutsogiannaki, COVID-19 pathophysiology: a review. Clin. Immunol. 2020, 108427 (2020)
Article Google Scholar
P. Xu, Q. Zhou, J. Xu, Mechanism of thrombocytopenia in COVID-19 patients. Ann. Hematol. 99, 1205–1208 (2020)
Article Google Scholar
M. Ciotti, S. Angeletti, M. Minieri, M. Giovannetti, D. Benvenuto, S. Pascarella et al., COVID-19 outbreak: an overview. Chemotherapy 64, 215–223 (2019)
Article Google Scholar
T.P. Velavan, C.G. Meyer, The COVID-19 epidemic. Trop. Med. Int. Health 25, 278 (2020)
Article Google Scholar
L. Gattinoni, S. Coppola, M. Cressoni, M. Busana, S. Rossi, D. Chiumello, COVID-19 does not lead to a “typical” acute respiratory distress syndrome. Am. J. Respir. Crit. Care Med. 201, 1299–1300 (2020)
Article Google Scholar
Z. Xu, L. Shi, Y. Wang, J. Zhang, L. Huang, C. Zhang et al., Pathological findings of COVID-19 associated with acute respiratory distress syndrome. Lancet Respir. Med. 8, 420–422 (2020)
Article Google Scholar
J.A. Al-Tawfiq, Asymptomatic coronavirus infection: MERS-CoV and SARS-CoV-2 (COVID-19). Travel Med. Infect. Dis. 35, 101608 (2020)
Article Google Scholar
D. Giannis, I.A. Ziogas, P. Gianni, Coagulation disorders in coronavirus infected patients: COVID-19, SARS-CoV-1, MERS-CoV and lessons from the past. J. Clin. Virol. 127, 104362 (2020)
Article Google Scholar
N. Gul, R. Bilal, E.A. Algehyne, M.G. Alshehri, M.A. Khan, Y.-M. Chu et al., The dynamics of fractional order Hepatitis B virus model with asymptomatic carriers. Alex. Eng. J. 60, 3945–3955 (2021)
Article Google Scholar
A. Ali, F.S. Alshammari, S. Islam, M.A. Khan, S. Ullah, Modeling and analysis of the dynamics of novel coronavirus (COVID-19) with Caputo fractional derivative. Results Phys. 20, 103669 (2021)
Article Google Scholar
H. Alrabaiah, M.A. Safi, M.H. DarAssi, B. Al-Hdaibat, S. Ullah, M.A. Khan et al., Optimal control analysis of hepatitis B virus with treatment and vaccination. Results Phys. 19, 103599 (2020)
Article Google Scholar
Y.-M. Chu, A. Ali, M.A. Khan, S. Islam, S. Ullah, Dynamics of fractional order COVID-19 model with a case study of Saudi Arabia. Results Phys. 21, 103787 (2021)
Article Google Scholar
M.A. Khan, A. Atangana, Modeling the dynamics of novel coronavirus (2019-nCov) with fractional derivative. Alex. Eng. J. 59, 2379–2389 (2020)
Article Google Scholar
E.O. Alzahrani, W. Ahmad, M.A. Khan, S.J. Malebary, Optimal control strategies of Zika virus model with mutant. Commun. Nonlinear Sci. Numer. Simul. 93, 105532 (2021)
Article MathSciNet MATH Google Scholar
H. Jahanshahi, Smooth control of HIV/AIDS infection using a robust adaptive scheme with decoupled sliding mode supervision. Eur. Phys. J. Spec. Top. 227, 707–718 (2018)
Article Google Scholar
H. Jahanshahi, K. Shanazari, M. Mesrizadeh, S. Soradi-Zeid, J.F. Gómez-Aguilar, Numerical analysis of Galerkin meshless method for parabolic equations of tumor angiogenesis problem. Eur. Phys. J. Plus 135, 1–23 (2020)
Article Google Scholar
T.H. Zhao, O. Castillo, H. Jahanshah, A fuzzy-based strategy to suppress the novel coronavirus (2019-NCOV) massive outbreak. Appl Comput Math. 20, 160–176 (2021)
Google Scholar
H. Jahanshahi, J.M. Munoz-Pacheco, S. Bekiros, N.D. Alotaibi, A fractional-order SIRD model with time-dependent memory indexes for encompassing the multi-fractional characteristics of the COVID-19. Chaos Solitons Fract. 143, 110632 (2021)
Article Google Scholar
H. Wang, H. Jahanshahi, M.-K. Wang, S. Bekiros, J. Liu, A.A. Aly, A caputo-fabrizio fractional-order model of HIV/AIDS with a treatment compartment: sensitivity analysis and optimal control strategies. Entropy 23, 610 (2021)
Article ADS Google Scholar
P. Pandey, Y.-M. Chu, J.F. Gómez-Aguilar, H. Jahanshahi, A.A. Aly, A novel fractional mathematical model of COVID-19 epidemic considering quarantine and latent time. Results Phys. 221, 104286 (2021)
Article Google Scholar
N.C. Grassly, C. Fraser, Mathematical models of infectious disease transmission. Nat. Rev. Microbiol. 6, 477–487 (2008)
Article Google Scholar
S.-B. Chen, F. Rajaee, A. Yousefpour, R. Alcaraz, Y.-M. Chu, J.F. Gómez-Aguilar et al., Antiretroviral therapy of HIV infection using a novel optimal type-2 fuzzy control strategy. Alex. Eng. J. 60, 1545–1555 (2021)
Article Google Scholar
M.A. Khan, C. Alfiniyah, E. Alzahrani, Analysis of dengue model with fractal-fractional Caputo-Fabrizio operator. Adv. Differ. Equ. 2020, 1–23 (2020)
MathSciNet Google Scholar
S. Eubank, H. Guclu, V.S.A. Kumar, M.V. Marathe, A. Srinivasan, Z. Toroczkai et al., Modelling disease outbreaks in realistic urban social networks. Nature 429, 180–184 (2004)
Article ADS Google Scholar
S. Ullah, M.A. Khan, Modeling the impact of non-pharmaceutical interventions on the dynamics of novel coronavirus with optimal control analysis with a case study. Chaos Solitons Fract. 139, 110075 (2020)
Article MathSciNet Google Scholar
M. Awais, F.S. Alshammari, S. Ullah, M.A. Khan, S. Islam, Modeling and simulation of the novel coronavirus in Caputo derivative. Results Phys. 19, 103588 (2020)
Article Google Scholar
A. Yousefpour, H. Jahanshahi, S. Bekiros, Optimal policies for control of the novel coronavirus disease (COVID-19) outbreak. Chaos Solitons Fract. 136, 109883 (2020)
Article MathSciNet Google Scholar
F. Ndaïrou, I. Area, J.J. Nieto, D.F.M. Torres, Mathematical modeling of COVID-19 transmission dynamics with a case study of Wuhan. Chaos Solitons Fract. 135, 109846 (2020)
Article MathSciNet Google Scholar
N.H. Tuan, H. Mohammadi Rezapourdidi, A mathematical model for COVID-19 transmission by using the Caputo fractional derivative. Chaos Solitons Fract. 140, 110107 (2020)
Article MathSciNet Google Scholar
M.S. Alqarni, M. Alghamdi, T. Muhammad, A.S. Alshomrani, M.A. Khan, Mathematical Modeling for Novel Coronavirus (COVID-19) and Control (Methods Partial Differ. Equ, Numer, 2020). https://doi.org/10.1002/num.22695
Book Google Scholar
J. Panovska-Griffiths, Can Mathematical Modelling Solve the Current Covid-19 Crisis? (Springer, Berlin, 2020).
Book Google Scholar
M.A. Khan, A. Atangana, E. Alzahrani, The dynamics of COVID-19 with quarantined and isolation. Adv. Differ. Equ. 2020, 1–22 (2020)
Article MathSciNet Google Scholar
M.A.A. Oud, A. Ali, H. Alrabaiah, S. Ullah, M.A. Khan, S. Islam, A fractional order mathematical model for COVID-19 dynamics with quarantine, isolation, and environmental viral load. Adv. Differ. Equ. 2021, 1–19 (2021)
MathSciNet Google Scholar
M.A. Khan, S. Ullah, S. Kumar, A robust study on 2019-nCOV outbreaks through non-singular derivative. Eur. Phys. J. Plus 136, 1–20 (2021)
Article Google Scholar
D. Dunford, B. Dale, N. Stylianou, E. Lowther, M. Ahmed, A.I. De la Torre, Coronavirus: the world in lockdown in maps and charts. BBC News 9, 462 (2020)
Google Scholar
B. Tang, X. Wang, Q. Li, N.L. Bragazzi, S. Tang, Y. Xiao et al., Estimation of the transmission risk of the 2019-nCoV and its implication for public health interventions. J. Clin. Med. 9, 462 (2020)
Article Google Scholar
B. Tang, N.L. Bragazzi, Q. Li, S. Tang, Y. Xiao, J. Wu, An updated estimation of the risk of transmission of the novel coronavirus (2019-nCov). Infect. Dis. Model. 5, 248–255 (2020)
Google Scholar
E. Volz, S. Mishra, M. Chand, J.C. Barrett, R. Johnson, L. Geidelberg et al., Assessing transmissibility of SARS-CoV-2 lineage B. 1.1. 7 in England. Nature 593, 266–269 (2021)
Article Google Scholar
N.G. Davies, S. Abbott, R.C. Barnard, C.I. Jarvis, A.J. Kucharski, J.D. Munday et al., Estimated transmissibility and impact of SARS-CoV-2 lineage B. 1.1. 7 in England. Science 372, eabg3055 (2021)
Article Google Scholar
M. Abu-Khalaf, F.L. Lewis, Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41, 779–791 (2005)
Article MathSciNet MATH Google Scholar
R. Beard, Improving the closed-loop performance of nonlinear systems, Ph.D. dissertation, Rensselaer Polytech. Inst., Troy, NY (1995)
F.A. Yaghmaie, D.J. Braun, Reinforcement learning for a class of continuous-time input constrained optimal control problems. Automatica 99, 221–227 (2019)
Article MathSciNet MATH Google Scholar
P.G. Drazin, P.D. Drazin, Nonlinear Systems (Cambridge University Press, Cambridge, 1992).
Book MATH Google Scholar
P.A. Ioannou, J. Sun, Robust Adaptive Control (Courier Corporation, London, 2012).
MATH Google Scholar
X. Zhong, H. He, An event-triggered ADP control approach for continuous-time system with unknown internal states. IEEE Trans. Cybern. 47, 683–694 (2016)
Article Google Scholar
Wei Q, Zhang H, A New Approach to Solve a Class of Continuous-Time Nonlinear Quadratic Zero-Sum Game Using ADP. IEEE. pp. 507–512.
V.R. Konda, J.N. Tsitsiklis, Actor-Critic Algorithms, Citeseer. pp. 1008–10014.
K. Hornik, M. Stinchcombe, H. White, Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw. 3, 551–560 (1990)
Article Google Scholar
Y. Zhu, D. Zhao, X. Li, Using reinforcement learning techniques to solve continuous-time non-linear optimal tracking problem without system dynamics. IET Control Theory Appl. 10, 1339–1347 (2016)
Article MathSciNet Google Scholar
G. Xiao, H. Zhang, Y. Luo, Q. Qu, General value iteration based reinforcement learning for solving optimal tracking control problem of continuous–time affine nonlinear systems. Neurocomputing 245, 114–123 (2017)
Article Google Scholar
B. Kiumarsi, F.L. Lewis, Actor–critic-based optimal tracking for partially unknown nonlinear discrete-time systems. IEEE Trans. Neural Netw. Learn. Syst. 26, 140–151 (2014)
Article MathSciNet Google Scholar
X. Yang, D. Liu, B. Luo, C. Li, Data-based robust adaptive control for a class of unknown nonlinear constrained-input systems via integral reinforcement learning. Inf. Sci. 369, 731–747 (2016)
Article MATH Google Scholar
A. Wuraola, N. Patel, SQNL: A new computationally efficient activation function. IEEE, pp. 1–7.
D.L. Elliott, A better activation function for articial neural networks. ISR technical report TR 93-8, Univeristy of Maryland (1993)
K.G. Vamvoudakis, F.L. Lewis, Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 46, 878–888 (2010)
Article MathSciNet MATH Google Scholar
S. Bhasin, R. Kamalapurkar, M. Johnson, K.G. Vamvoudakis, F.L. Lewis, W.E. Dixon, A novel actor–critic–identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 49, 82–92 (2013)
Article MathSciNet MATH Google Scholar
J. Shi, D. Yue, X. Xie, Adaptive optimal tracking control for nonlinear continuous-time systems with time delay using value iteration algorithm. Neurocomputing 396, 172–178 (2020)
Article Google Scholar
F.W. Lewis, S. Jagannathan, A. Yesildirak, Neural Network Control of Robot Manipulators and Non-Linear Systems (CRC Press, London, 2020).
Book Google Scholar
Y. Yang, S. Tang, X. Ren, H. Zhao, C. Guo, Global stability and optimal control for a tuberculosis model with vaccination and treatment. Discrete Contin. Dyn. Syst. B 21, 1009 (2016)
Article MathSciNet MATH Google Scholar
D. Rostamy, E. Mottaghi, Stability analysis of a fractional-order epidemics model with multiple equilibriums. Adv. Differ. Equ. 2016, 1–11 (2016)
Article MathSciNet MATH Google Scholar
J.R. Dormand, P.J. Prince, A reconsideration of some embedded Runge–Kutta formulae. J. Comput. Appl. Math. 15, 203–211 (1986)
Article MATH Google Scholar
F.P. Polack, S.J. Thomas, N. Kitchin, J. Absalon, A. Gurtman, S. Lockhart et al., Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine. N. Engl. J. Med. 383, 2603–2615 (2020)
Article Google Scholar
W. Xu, S. Su, S. Jiang, Ring vaccination of COVID-19 vaccines in medium-and high-risk areas of countries with low incidence of SARS-CoV-2 infection. Clin. Transl. Med. 11, 2 (2021)
Article Google Scholar
A.J. Kucharski, R.M. Eggo, C.H. Watson, A. Camacho, S. Funk, W.J. Edmunds, Effectiveness of ring vaccination as control strategy for Ebola virus disease. Emerg. Infect. Dis. 22, 105 (2016)
Article Google Scholar
R.A. Sari, U. Habibah, A. Widodo, Optimal control on model of SARS disease spread with vaccination and treatment. J. Exp. Life Sci. 7, 61–68 (2017)
Article Google Scholar
M.D. Ahmad, M. Usman, A. Khan, M. Imran, Optimal control analysis of Ebola disease with control strategies of quarantine and vaccination. Infect. Dis. Poverty 5, 1–12 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Mechanical Engineering, College of Engineering, University of Tehran, 14399‒57131, Tehran, Iran
Alireza Beigi, Amin Yousefpour & Amirreza Yasami
CONACyT-Tecnológico Nacional de México/CENIDET, Interior Internado Palmira S/N, Col. Palmira, C.P. 62490, Cuernavaca, Morelos, Mexico
J. F. Gómez-Aguilar
Department of Banking and Finance, FEMA, , University of Malta, Msida, MSD 2080, Malta
Stelios Bekiros
Department of Economics, European University Institute, Via delle Fontanelle, 18, 50014, Florence, Italy
Stelios Bekiros
Department of Mechanical Engineering, University of Manitoba, Winnipeg, R3T 5V6, Canada
Hadi Jahanshahi

Authors

Alireza Beigi
View author publications
You can also search for this author in PubMed Google Scholar
Amin Yousefpour
View author publications
You can also search for this author in PubMed Google Scholar
Amirreza Yasami
View author publications
You can also search for this author in PubMed Google Scholar
J. F. Gómez-Aguilar
View author publications
You can also search for this author in PubMed Google Scholar
Stelios Bekiros
View author publications
You can also search for this author in PubMed Google Scholar
Hadi Jahanshahi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. F. Gómez-Aguilar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Beigi, A., Yousefpour, A., Yasami, A. et al. Application of reinforcement learning for effective vaccination strategies of coronavirus disease 2019 (COVID-19). Eur. Phys. J. Plus 136, 609 (2021). https://doi.org/10.1140/epjp/s13360-021-01620-8

Download citation

Received: 05 April 2021
Accepted: 26 May 2021
Published: 31 May 2021
DOI: https://doi.org/10.1140/epjp/s13360-021-01620-8

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Application of reinforcement learning for effective vaccination strategies of coronavirus disease 2019 (COVID-19)

Abstract

Similar content being viewed by others

Determination of optimal prevention strategy for COVID-19 based on multi-agent simulation

Lockdown or Unlock in COVID-19 Disease? A Reinforcement Learning Approach

Reinforcement Learning Model for Pandemic Precautions in Healthcare Environment

1 Introduction

2 COVID-19 model with controls

3 Optimal control problem

Assumption 1

Assumption 2

Definition 1

Definition 2

Definition 3

3.1 Value function approximation using Critic network

3.2 Policy approximation using Actor-network

Assumption 3

Remark 1

3.3 Learning rules for actor and critic networks

Assumption 4

Theorem 1

Proof

Remark 2

Remark 3:

4 Optimal vaccination strategies

4.1 Strategy 1

4.2 Strategy 2

4.3 Strategy 3

4.4 Strategy 4

5 Numerical results

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Navigation

Application of reinforcement learning for effective vaccination strategies of coronavirus disease 2019 (COVID-19)

Abstract

Similar content being viewed by others

Determination of optimal prevention strategy for COVID-19 based on multi-agent simulation

Lockdown or Unlock in COVID-19 Disease? A Reinforcement Learning Approach

Reinforcement Learning Model for Pandemic Precautions in Healthcare Environment

1 Introduction

2 COVID-19 model with controls

3 Optimal control problem

Assumption 1

Assumption 2

Definition 1

Definition 2

Definition 3

3.1 Value function approximation using Critic network

3.2 Policy approximation using Actor-network

Assumption 3

Remark 1

3.3 Learning rules for actor and critic networks

Assumption 4

Theorem 1

Proof

Remark 2

Remark 3:

4 Optimal vaccination strategies

4.1 Strategy 1

4.2 Strategy 2

4.3 Strategy 3

4.4 Strategy 4

5 Numerical results

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation