Description of Modelling Approach

In this articleFootnote 1, we briefly describe an AI-driven approach for generating lockdown policies that are optimised based on disease characteristics and network parameters. This approach is aimed for use by policy-makers who are knowledgeable in epidemiology, but not necessarily well-versed in dynamic systems and control. The approach is designed to be modular and flexible. That is, the underlying reinforcement learning algorithm can work with any compatible disease and network models. Furthermore, the critical characteristics of the models are parameters that can be tuned. The models described in this paper are based on a commonly used epidemiological model from literature Perez and Dragicevic (2009), as shown in Fig. 1. The parameter values can be tuned for modelling infectious diseases including Covid-19. We use values for Covid-19 computed by Jung et al. (2020).

Fig. 1
figure 1

Disease progression model adapted for Covid-19. S is susceptible, E is exposed (virus in the body but not yet affecting the immune system), IS is infected (showing symptoms), IA is asymptomatic carrier, D is dead, and R is recovered. Note the numbers represent capture probabilities and not rates of change. All parameters can be tuned. We do not consider a transition from Recovered to susceptible states, but this can be added if found to be possible

We also account for network propagation characteristics through tunable parameters such as strictness of lockdowns within network nodes and travel between nodes, including the possibility of leaky quarantine. The probability of disease transmission between people is a macro-level parameter, but it accounts for micro-level effects such as social distancing, mask usage, and weather effects. The network definition is based on node locations and population of each node, with connectivity between each pair of nodes defined by a gravity model Allamanis et al. (2012). The results presented in this paper are based on a randomly generated network with 100 nodes and 10,000 people randomly distributed amongst those nodes.Footnote 2 The disease progression is shown in Fig. 2 for a fixed strategy of locking down any node when its symptomatic population exceeds 5% of total, and reopening when it falls below this level. This is a typical method followed in several regions worldwide. While the peak is small, the epidemic lasts for nearly the full year (with high economic cost).

Fig. 2
figure 2

[Left] Network model, [right] Epidemic progression for the policy of locking down any node when its symptomatic population exceeds 5% of total

Fig. 3
figure 3

Evolution of reward (objective function) during training, based on (i) duration of lockdowns, (ii) number of people infected, and (iii) number of people dead

Reinforcement learning approach for computing lockdown policy

Reinforcement learning (RL) works by running a large number of simulations of the spread of the disease while attempting to find the optimal policy for lockdowns Sutton and Barto (2012). The chief requirement is to quantify the cost of each outcome of the simulation. In this study, we impose a cost of 1.0 on each day of lockdown, 1.0 on each person infected, and 2.5 on each deathFootnote 3. A reward is defined as the negative of these costs (higher the reward, lower the cost). The actions asked of the algorithm are binaryFootnote 4: at the beginning of every week and for every node, the algorithm must decide whether to keep the node open or lock it down. We use Deep Q Learning Mnih et al. (2015) to train the algorithm. The RL algorithm improves and then saturates in 75 simulations as shown in Fig. 3, for this specific instance. The evolution of infection rates in Fig. 4 (computed through 10 independent runs) shows that the policy has a higher peak than the 5% policy in Fig. 2, but significantly fewer lockdowns and a shorter epidemic duration. Note also that there are no kinks due to new infections after release.

The key points of novelty in this approach are: (i) we focus neither on epidemiological models nor on prediction of the spread of the disease, but rather on controlling the spread of disease while balancing long-term economic and health costs, (ii) our control approach can work with any disease parameters (not just Covid-19), and with any compatible network data and propagation model (not just for specific geographies), (iii) rather than taking decisions based on simple thresholds such as fraction of people with symptoms, the learnt policies combine several context variables such as rates of new infections to take optimal decisions, (iv) the end-users need to only change input parameters to create policies with their desired characteristics, and (v) the algorithm is not a black box, and sensitivity of the policy to features can be studied. Fig. 5 demonstrates the last claim by considering the sensitivity of decisions to sets of two input features at a time. The first plot shows that the policy recommends lockdowns when the infection rate in the overall population or within a node exceeds 0.2. However, lockdowns can be recommended at much smaller values if both infection rates reach 0.1. A similar trend is shown in the plot on the right, which shows that lockdowns are recommended at much lower infection rates if a node has a large population.

Fig. 4
figure 4

Epidemic progression using a policy trained by reinforcement learning

Fig. 5
figure 5

Policy outcomes as a function of two features at a time

Discussion

The reinforcement learning algorithm is ready for use in conjunction with real-world data sets, epidemiological models and network propagation models. Any of these three aspects can be changed as per user requirements. The algorithm is computationally lightweight, and, running it only requires Python. We have demonstrated its capability of handling nation-scale data Khadilkar et al. (2020). We are open to collaborating with epidemiologists who could benefit from a computational approach to address the spread of communicable diseases.