Abstract
During an epidemic, decisionmakers in public health need accurate predictions of the future case numbers, in order to control the spread of new cases and allow efficient resource planning for hospital needs and capacities. In particular, considering that infectious diseases are spread through humanhuman transmissions, the analysis of spatiotemporal mobility data can play a fundamental role to enable epidemic forecasting. This paper presents the design and implementation of a predictive approach, based on spatial analysis and regressive models, to discover spatiotemporal predictive epidemic patterns from mobility and infection data. The experimental evaluation, performed on mobility and COVID19 data collected in the city of Chicago, is aimed to assess the effectiveness of the approach in a realworld scenario.
1 Introduction
Reference Context. An epidemic is the rapid spread of a disease to a large number of people in a given population in a short period of time. Usually, it poses a serious threat not only to public health but also to healthcare institutions and economies as a whole (Schwabe et al. 2021). In the last two years, the whole world has experienced huge issues and problems due to COVID19, which, as of March 28, 2022, has been responsible for more than 400 million reported cases (source data: Johns Hopkins University Center for Systems Science and Engineering Dong et al. 2020), thus resulting as one of the worst pandemics in history.
To control the spread during an epidemic, decisionmakers in public health need accurate predictions of future case numbers. This allows for early interventions and, on top of that, is important for near realtime resource planning, so that hospital needs can be satisfied without exceeding their capacities (Ferguson et al. 2020). To do that, several explanatory models have been designed and studied for epidemic forecasting, aimed at modeling virus spreading dynamics and evolutions.
Motivations and Contributions. As infectious diseases are spread through humanhuman transmissions, spatiotemporal mobility data can play a fundamental role to enable epidemic forecasting. In fact, mobility data can be derived from movements among several locations of urban and/or suburban areas and, differently from static data, have the added value of dynamically tracing the underlying trajectories, providing a more comprehensive description of humanhuman interactions. Moreover, there can be some areas, like the touristic ones, in which there are more users (tourists) prone to produce and share contexts, data and messages by social media, so data can be integrated by different sources (Cesario et al. 2017). In particular, the underlying hypothesis of our work is that human mobility can strongly affect virus spreading, thus the extraction of mobility patterns among congested areas can enable the prediction of spatiotemporal epidemic patterns. More specifically, the identification of regions that are both infection hotspots (thus involved in high density of infection cases ) and mobility hotspots (thus involved in mobility patterns) is a crucial step of the process to identify how mobility patterns can also play a role of epidemic patterns.
This paper presents the design and implementation of an approach based on spatial analysis and regressive models to discover spatiotemporal predictive epidemic models from mobility and infection data. The algorithm is composed of several steps. First, infection hotspots (urban regions more densely affected by infection events with respect to others) and mobility hotspots (urban regions more densely visited by mobility traces) are detected. Then, mobility patterns among mobility hotspots are discovered. On the basis of such knowledge, epidemic hotspots (infection hotspots whose spatial overlap with mobility hotspots involved in mobility patterns is greater than a given threshold) and epidemic patterns are generated from the mobility patterns, by selecting those that have epidemic hotspots involved as both source and destination areas. Finally, the approach extracts a specific epidemic forecasting model for each epidemic hotspot, analyzing the infection data of the epidemic hotspots involved in mobility patterns.
Plan of the Paper. The rest of the paper is organized as follows. Section 2 reports the most important approaches in the virus spread forecasting literature and the most representative projects in that field of research. Section 3 outlines the problem statement and the goals of our analysis. Section 4 presents the proposed approach, based on spatial analysis and regressive models to discover spatiotemporal predictive epidemic models from mobility and infection data. Section 5 describes the experimental evaluation, performed on a realworld case study concerning the city of Chicago, aimed at showing the most significant mobility patterns among hotspots, the epidemic hotspots, and effective predictive models, which can estimate the number of epidemic events that are likely to happen in the future. We also performed a comparative analysis of our results with respect to a baseline approach (i.e., autoregression algorithm). Finally, Sect. 6 concludes the paper and plans future research work.
2 Related works
Forecasting virus spread is very important in the public health domain, because it can support decision makers to anticipate epidemic trends and thus to optimize public resources’ allocations. This is undoubtedly a challenging task and, since 2019, it has been experiencing a large effort of virologists, epidemiologists and data scientists to predict COVID19 diffusion. In this section we briefly review the most representative research works in this field, grouped in two main categories: mechanistic/stochastic models and AI/datadriven models. Then, we report a critical comparison (on the basis of some specific features) among the method we developed and stateofart solutions.
Mechanistic and stochastic models. Traditionally, the models for epidemic forecasting proposed in the literature can be distinguished in two categories: mechanistic models and stochastic models (Schwabe et al. 2021). Mechanistic models explicitly utilize epidemiological theory and empirical evidence, while stochastic models mainly rely on the predictive power of data. A type of mechanistic model is the compartmental model, which has been exploited in some research works on COVID19 (Bertozzi et al. 2020; Chang et al. 2021). A compartmental model is based on the assumption that each individual belongs to some compartment (e.g., susceptible, infected, recovered) and has a certain probability of transitioning to another compartment. The probability of a transition can be either derived from case data or modeled as being dependent on additional predictors. Stochastic models are based on probability distributions to predict the evolution of the events. In Reinhard and Frank (2020) a Gaussian model is used as stochastic model to predict the peak of COVID19 cases, while in Bertozzi et al. (2020) an exponential model is exploited to predict the disease spread in the early stages.
AI and datadriven models. Recently, artificial intelligence (AI) methods and datadriven approaches integrating mobility, social networks, web search, and air quality data have been proposed in literature. Interestingly, in Comito and Pizzuti (2022) is provided a comprehensive review of methods, algorithms, applications, and emerging AI technologies that can be utilized for forecasting and diagnosing COVID19. In particular, the purpose of this review is to investigate and discuss an extensive collection of papers with the aim of giving an overview of how AI can help fighting COVID19 pandemic. In Yabe et al. (2022) an approach based on human mobility trajectories (collected as GPS traces) and web search queries (with common user identifiers) has been proposed, to predict COVID19 hotspot locations beforehand. More specifically, a web search query analysis is conducted to identify users with a high risk of COVID19 contraction, and social contact analysis was further performed on the mobility patterns of these users to quantify the risk of an outbreak. The approach has been empirically tested using data collected from users in Tokyo, Japan, to predict COVID19 hotspot locations 12 weeks beforehand. In Schwabe et al. (2021) a model for epidemic forecasting based on mobility data, called mobility marked Hawkes model, has been proposed. This model consists of (i) a Hawkes process that captures the transmission dynamics of infectious diseases, (ii) a Poisson regression model which modulates the rate of infections (thus accounting for how the reproduction number R varies across space and time), and (iii)a correction procedure taking into account new cases seeded by people traveling between regions. This model has been used to predict the COVID19 epidemic in Switzerland, over different forecast horizons between 5 and 21 days. The paper (Mokhlesur Rahman et al. 2021) presents a review study aimed at analyzing interactions among the COVID19 pandemic, lockdown measures, human mobility, and air quality. In particular, the paper shows that urban form, people’s socioeconomic and physical conditions, social cohesion, and social distancing measures significantly affect human mobility and COVID19 transmission. The study also noticed that lockdown measures applied during COVID19 significantly improved air quality by reducing the concentration of air pollutants, which in turn improved the COVID19 situation by reducing respiratoryrelated sickness and deaths of people. In Ilin et al. (2021) a study is presented showing how public available data on human mobility (collected by Google, Facebook, and other providers) can be used to evaluate the effectiveness of nonpharmaceutical interventions (NPIs) and forecast the spread of COVID19. The approach has been evaluated using local and regional data from China, France, Italy, South Korea, and the United States, and has been applied to provide 10day forecasts of COVID19 cases. The paper (Comito 2021) presents a methodology based on Twitter data analysis that combines peak detection and clustering techniques to inspect how information about the COVID19 epidemics spread in the US. To this purpose, the objectives are to identify the key terms and features used in the tweets, the interest in the COVID19 topics, together with the evolution of the discussion all over the US. Spacetime features are extracted from the tweets and modeled as time series. After that, peaks are detected from the time series, and peaks of textual features are clustered based on the cooccurrence in the tweets.
Comparative analysis among the approaches. Now, we report a critical comparison of the proposed approach and some other solutions proposed in the literature. Specifically, the comparison of the different approaches has been made on the basis of some specific features (i.e., hotspot detection approach, hotspot shapes, forecasting approach), as summarized in Table 1 and detailed in the following:
Hotspot detection approach. This feature differentiates the algorithms on the basis of the approach used to detect spatial (epidemic) hotspots. Our approach uses spatial densitybased clustering to detect interesting hotspots, while the approaches presented in Schwabe et al. (2021); Ilin et al. (2021) rely on predefined regions (phone cellular grid cells and regular square regions, respectively). On the other side, the algorithm described in Comito (2021) exploits a topicbased clustering approach to detect groups of textual features on the basis of their cooccurrence in the tweets. Finally, the approach presented in Yabe et al. (2022) adopts a web searchbased approach to identify Covid19 hotspots, where a user is identified as a high risk user if he/she had more than k (predefined threshold value) COVID19 related web search sessions.
Hotspot shapes. This feature takes into account the shape of the detected hotspots, which is relevant to assess the ability of the detection approach in identifying any possible area, regardless of the shape. Our approach is capable of detecting regions of any shape (e.g., circular, rectangular, irregular), while the works described in Schwabe et al. (2021); Ilin et al. (2021) deal with only specific region shapes. The algorithm described in Comito (2021) partitions the geographic area according to the US state borders and performs the analysis at stategranularity shapes. Finally, the approach presented in Yabe et al. (2022) presents the experimental evaluation by splitting the geographic area in both \(1Km \times 1Km\) and \(125m \times 125m\) grid cells in Tokyo metropolitan region, thus considering hotspots with specific square shapes.
Regression approach. This feature classifies the approaches on the basis of the regression methodology used to forecast new COVID19 case numbers. Specifically, our approach exploits the LSTM method, while the methodologies described in Schwabe et al. (2021); Ilin et al. (2021) exploit Poisson regression and polynomial regression, respectively. On the other side, the algorithm described in Comito (2021) adopts two forecasting models, one expressed as a specific autoregression problem and the other based on a Bayesian approach. Finally, the approach presented in Yabe et al. (2022) adopts a social contact index (SCI) and a timelagged crosscorrelation analysis to predict the number of new cases.
From the above comparative evaluation, we can summarize the main differences the proposed approach exhibits with respect to the other ones proposed in literature. First, it detects infection and mobility hotspots as they emerge from real infection and mobility data, without relying on predefined static subdivisions of the spatial area (as done in Schwabe et al. (2021) and Ilin et al. (2021)), and thus enabling the detection of hotspots of any shape. Second, our approach relies on mobility data to predict further epidemic evolutions among several locations of urban and/or suburban areas. Thus, it is based on a comprehensive datadriven spatiotemporal knowledge of humanhuman interactions, rather than exploiting stochastic and probabilistic models to forecast future infection cases.
3 Problem definition and goal
We begin by fixing a proper notation and giving some definitions to be used throughout the paper.
Timestamp List. Let \(T=<t_1,t_2,\ldots ,t_H>\) be an ordered timestamp list, such that \(t_h<t_{h+1}, \forall _{ 0<h<H}\), and where all \(t_h\) are at equal time intervals (e.g., every minute, hour, day).
Infection Data. Let \({{\mathcal {I}}}{{\mathcal {D}}}\) be a dataset collecting infection data instances, \({{\mathcal {I}}}{{\mathcal {D}}}=\{ID_1,ID_2,\ldots ,ID_M\}\), where each \(ID_i\) is a data tuple \(<n_i, lat_i, long_i, t_i>\) described by the following features: \(n_i\) is the number of infection cases (i.e., number of positive cases) detected, \(lat_i\) and \(long_i\) are the latitude and longitude of the place the infection event has occurred, \(t_i\) (with \(t_i \in T\)) is the observation timestamp.
Mobility Data. Let \({{\mathcal {M}}}{{\mathcal {D}}}\) be a dataset collecting mobility data instances, \({{\mathcal {M}}}{{\mathcal {D}}}=\{MD_1,MD_2,\ldots ,MD_N\}\), where each \(MD_i\) is a spatiotemporal trajectory instance \(MD_i=<(lat_{i1},log_{i2},t_1), \ldots , (lat_{iH},long_{iH},t_H)>\), where each triple \((lat_{ih},long_{ih},t_h)\) indicates that an object of the trajectory \(MD_i\) is in the position \(lat_{ih},long_{ih}\) at time \(t_h\) (with \(t_h \in T\)).
Infection Hotspots. Let \({{\mathcal {I}}}{{\mathcal {H}}}\) be a set of infection hotspots, \({{\mathcal {I}}}{{\mathcal {H}}}=\{IH_1,IH_2,\ldots ,IH_N\}\), where each \(IH_i\) is an area where infection events of a specific disease occur with an higher density with respect to other neighbor areas.
Mobility Hotspots. Let \({{\mathcal {M}}}{{\mathcal {H}}}\) be a set of mobility hotspots, \({{\mathcal {M}}}{{\mathcal {H}}}=\{MH_1,MH_2,\ldots ,MH_N\}\), where each \(MH_i\) is an area that is more densely visited by the object’s trajectories with respect to other areas.
Mobility Patterns. Let \({{\mathcal {M}}}{{\mathcal {P}}}\) be a set of (frequent) mobility patterns, \({{\mathcal {M}}}{{\mathcal {P}}}=\{MP_1,MP_2,\ldots ,MP_N\}\), where each \(MP_i\) is a sequential pattern of two mobility hotspots, in the form \((MH_{t_s}^{j_s} \rightarrow MH_{t_d}^{j_d})\) with time constraints \(t_r < t_s\). The block on the left \(MH_{t_s}^{j_s}\) is the source hotspot, while the block on the right \(MH_{t_d}^{j_d}\) is the destination hotspot of the pattern. Mining of sequential patterns consists of mining the set of subsequences occurring with a support sup higher than a given minimum threshold \(sup_{min}\).
Overlapping infection hotspots. Let \(oih_{{{\mathcal {I}}}{{\mathcal {H}}}}(MH_i)\) be the set of infection hotspots \(IH_j\) having a spatial overlap with \(MH_i \in {{\mathcal {M}}}{{\mathcal {H}}}\) higher than a given threshold \(\delta _{min}\), and thus \(oih_{{{\mathcal {I}}}{{\mathcal {H}}}}(MH)=\{ IH \mid IH \in {{\mathcal {I}}}{{\mathcal {H}}} \wedge \delta (IH,MH)\ge \delta _{min} \}\), for all MH in \({{\mathcal {M}}}{{\mathcal {H}}}\), and \(\delta (\cdot ,\cdot )\) being a function computing the spatial overlapping percentage between its parameters.
Epidemic Hotspots. An epidemic hotspot is an infection hotspot \(IH_j\) having a spatial overlap (higher than a given threshold \(\delta\)) with a mobility hotspot \(MH_k\) and if \(MH_k\), and thus \({{\mathcal {E}}}{{\mathcal {H}}}=\{ EH \mid EH \in {{\mathcal {I}}}{{\mathcal {H}}} \wedge \exists MH \in {{\mathcal {M}}}{{\mathcal {H}}} \mid EH \in oih_{{{\mathcal {I}}}{{\mathcal {H}}}}(MH) \}\).
Now, let us consider a future temporal horizon, \(S=<t_w, t_{w+1}, \ldots>\), with \(w>H\). The goal of the analysis is to find epidemic spread models for reliably predicting the number and location of new infection events at a given timestamp \(t_w \in S\). More specifically, given the infection dataset \({{\mathcal {I}}}{{\mathcal {D}}}\) and the mobility dataset \({{\mathcal {M}}}{{\mathcal {D}}}\), our analysis aims at achieving the following goals:

1.
discover a set \({{\mathcal {E}}}{{\mathcal {H}}}\) of epidemic hotspots, \({{\mathcal {E}}}{{\mathcal {H}}} = \{EH_1, \ldots , EH_K\}\), where a epidemic hotspot \(EH_k\) is a spatial area both affected by higher density of infections than other areas and involved in frequent mobility patterns (as above described);

2.
discover a set \({{\mathcal {E}}}{{\mathcal {P}}}=\{EP_1,EP_2,...\}\) of epidemic patterns, where each EP is a couple \((EH_s,EH_d)\), meaning that the infection trend of \(EH_s\) influences the infection trend of \(EH_s\).

3.
extract a function \(F_{spreading}:{\mathcal {S}}\rightarrow \mathcal {({{\mathcal {E}}}{{\mathcal {H}}},R)}\) that, given a timestamp \(t_w \in S\), states the number of epidemic events (i.e., number of positive cases) \(N \in {\mathcal {R}}\) that are predicted to happen in each epidemic hotspot \(EH_i \in {{\mathcal {E}}}{{\mathcal {H}}}\) at the timestamp \(t_w\).
4 The proposed approach
This section describes the algorithm that we have designed to discover spatiotemporal predictive epidemic models from mobility and infection data. Specifically, Sect. 4.1 depicts the main steps of the proposed approach and its pseudocode, whereas Sects. 4.2, 4.3, and 4.4 describe in details the procedures for infection and mobility hotspots detection, mobility patterns extraction and epidemic forecasting models training, respectively.
4.1 Algorithm’s wokflow and pseudocode
The main workflow of the proposed approach is shown in Fig. 1, while the pseudocode is reported in Algorithm 1. For the reader’s convenience, Table 2 reports the meaning of the main symbols used throughout the code.
The algorithm receives in input the infection and the mobility datasets (represented in the previously format described in Sect. 3), and returns a set of epidemic hotspots, epidemic patterns and epidemic regression models. The workflow is composed of six steps (see Fig. 1 and Algorithm 1), as described in the following.
Step 1 and 2: Infection and Mobility Hotspots Detection. These two steps are aimed at detecting Infection and Mobility Hotspots from Infection and Mobility datasets, respectively. More specifically, Infection Hotspots are urban regions more densely affected by infection events with respect to others (thus, of interest for the further analysis), while Mobility Hotspots are urban regions more densely visited by mobility traces. This is done by running densitybased clustering algorithm instances whose final result consists of N and M clusters (each corresponding to a detected dense region). The detected hotspots, whose number is automatically detected by the algorithm, can have different shapes and sizes. These two steps are performed by the DiscoverHotspots method on the Infection and Mobility datasets, respectively (lines 1 and 2, Algorithm 1).
Step 3: Frequent Mobility Patterns Extraction. During this step a sequential pattern extraction algorithm on the detected mobility hotspots is executed, to discover frequent mobility patterns from them. The final mining model is a set of rules describing mobility relations between the movement of the users under investigation. For the sake of clarity, in this work we represent mobility patterns composed of one source and one destination hotspots. This is done by running the ExtractMobilityPatterns method (line 3, Algorithm 1).
Step 4: Overlapping InfectionMobility Hotspots Detection. This step is aimed at detecting epidemic hotspots, that is, infection hotspots whose spatial overlap with mobility hotspots involved in mobility patterns is higher than a given threshold \(\delta _{min}\). Since our hypothesis is that mobility can strongly affect the infection spreading, then the identification of regions which are both infection hotspots (thus involved in high density of infection cases ) and mobility hotspots (thus involved in mobility patterns) is a crucial step of the process. The spatial overlap is calculated as the percentage of the overlapping area between the identified infection and mobility hotspots. In Algorithm 1 this step is implemented by the lines 411.
Step 5: Epidemic Hotspots and Patterns Detection. On the basis of the overlapped hotspots detected, this step is aimed at detecting (i)epidemic hotspots and (ii)epidemic patterns. In particular, an epidemic hotspot is an infection hotspot that has a reasonable spatial overlap with a mobility hotspot involved in a mobility pattern. As epidemic hotspots are detected, epidemic patterns are generated from the mobility patterns, by selecting those having epidemic hotspots involved as both source and destination areas. The computation of the Epidemic Hotspots \({{\mathcal {E}}}{{\mathcal {H}}}\) and the Epidemic Patterns \({{\mathcal {E}}}{{\mathcal {P}}}\) is done by the lines 1221 of Algorithm 1.
Step 6: Epidemic Spread Forecasting. This step is aimed at extracting a specific epidemic forecasting model for each epidemic hotspot that is a destination of an epidemic pattern, by exploiting data from its sources. The training of the epidemic forecasting models is done by the lines 2231 of the Algorithm 1.
In particular, for each epidemic hotspot \(EH_i\), an Epidemic Time Series \(ETS_i\) is extracted, by the function BuildEpidemicTSData (line 28) applied in its Source Epidemic Hotspots \(SEH_i\) and the infection data ID. \(ETS_i\) is a multivariate time series that aggregates the number of infections occurred in a time interval (e.g., each day) in \(EH_i\) and in each source epidemic hotspot in \(SEH_i\). Given the Epidemic Time Series \(ETS_i\) related to \(EH_i\), the DiscoverEpidemicModel (line 29) method learns a forecasting model for predicting the number of epidemic events in the future, inside the epidemic hotspot \(EH_i\).
4.2 Detection of Infection and Mobility Hotspots
The DiscoverHotspots method (lines 1 and 2) performs a spatial clustering of the data set, where each cluster represents a hotspot of events. The densitybased notion is a common approach for clustering, whose inspiring idea is that objects forming a dense region should be grouped together into one cluster. In our implementation, this task has been performed by applying DBSCAN (Ester et al. 1996), a popular densitybased clustering algorithm that finds clusters starting from the estimated density distribution of the considered data. We have chosen the DBSCAN algorithm because it has the ability to discover clusters with arbitrary shape such as linear, concave, oval, etc. and (in contrast to other clustering algorithms proposed in the literature) it does not require the predetermination of the number of clusters to be discovered. Basically, the algorithm finds clusters with respect to the notion of density reachability among points: a point is directly densityreachable from another point if it is not farther away than a given distance (\(\epsilon\)) (i.e., is part of its neighborhood) and if it is surrounded by sufficiently many points (minPts). In the considered context, a cluster corresponds to an hotspot. Finally, DBSCAN requires the user to specify the radius of the neighborhood (i.e., \(\varepsilon\)) and the minimum number of objects it should have (i.e., minPoints), whose values affect size and density of the discovered clusters. Generally, an optimal setting of its parameters is complex to be achieved and requires specific techniques; however, this topic is outside the scope of this paper.
4.3 Extraction of mobility patterns
The ExtractMobilityPatterns method (line 3) performs the detection of mobility patterns between two hotspots. We implemented this task by the TApriori algorithm (Cesario et al. 2017), an Aprioribased algorithm that extracts rules whose elements respect a monotonically increasing time order (i.e., the timestamps of the antecedents are chronologically previous of those appearing in the consequent). The final result is a set of sequential patterns between two mobility hotspots, in the form \(SH \rightarrow DH\). The block on the left SH is the source hotspot, while the block on the right DH is the destination hotspot of the pattern. The algorithm requires the user to specify the \(sup_{min}\) and \(conf_{min}\) thresholds, and it discovers the set of subsequences occurring with a support \(sup\ge sup_{min}\) and a confidence \(conf\ge conf_{min}\).
4.4 Discovery of epidemic forecasting models
The DiscoverEpidemicModel method (line 29 in Algorithm 1) performs the training of an epidemic forecasting model given a multivariate time series that aggregates the number of infections that occurred in a time interval in a target epidemic hotspot and its sources previously identified. This task can be done through different time series forecasting approaches, such as Simple Exponential Smoothing (SES) and Autoregressive models (ARIMA). In this work we exploit Recurrent Neural Nnetworks, and in particular LongShort Term Memory (LSTM) neural networks.
LSTMs (Hochreiter and Schmidhuber 1997; Sak et al. 2014) have been proved to be effective in forecasting tasks, due to their capability of processing sequences. They are also widely used in speech recognition and language translation. An LSTM neural network is a particular kind of recurrent neural network (RNN) and thus admits feedback connections. LSTMs are characterized by the presence of a block of memory cells, which are special structures permitting the network to remember values over arbitrary time intervals. Four gates control a cell’s information flow and state: an input gate, an output gate, a forget gate, and a cell gate. The following equations define the behavior of an LSTM unit:
where \(h_t\) is the hidden state at time t, \(c_t\) is the cell state at time t, \(x_t\) is the input at time t, \(i_t\), \(f_t\), \(g_t\), \(o_t\) are the input, forget, cell, and output gates, respectively. \(\sigma\) is the sigmoid activation function, and \(\odot\) is the Hadamard product. W and b are the weights to be learned during the LSTM optimization. It is worth noting that the recurrent nature of LSTM is given as both \(c_t\) and \(h_t\) depend on \(c_{t1}\) and \(h_{t1}\), which are their values at the previous step.
In this work, we exploited a neural network embedding an LSTM layer. The models are trained by exploiting the wellknown ADAM optimizer (Kingma and Ba 2014).
5 Analysis and experimental results
To evaluate the performance and the effectiveness of the approach described above, we carried out an experimental analysis by performing different tests in a realworld case study concerning the city of Chicago, whose map and zip codes are shown in Fig. 2. The goal of our analysis comprises detecting the most significant mobility patterns among hotspots, the epidemic hotspots and effective predictive models, which can estimate the number of epidemic events that are likely to happen in the future. We also performed a comparative analysis of our results with respect to a baseline, achieved by autoregression. In the following subsections we describe the main issues of our analysis: data description and gathering (Section 5.1), the regressive model training, the detection of epidemic hotspots and mobility patterns (Section 5.2), training and testing of the forecasting model (Section 5.3), the experimental evaluation of the model on the test set and a comparative analysis with an autoregression model (Section 5.4).
5.1 Data description
The data that we used to train the models and perform the experimental evaluation have been gathered from Chicago Data Portal [18] and Observable Web Portal [19], two public data search and exploration platforms hosting datasets related to urban and extraurban environments. In particular:

Mobility data have been gathered from the ’Public Passenger Vehicle Licenses’ dataset housed on the Chicago Data Portal, a reallife collection of trajectories traced by public passenger vehicles (i.e., licensed taxicabs, liveries, ambulances, medicars, chartersightseeing buses, horsedrawn carriages, and pedicabs) in the city of Chicago. Each trajectory is described by source place and destination place.

Infection data have been retrieved from the ’Historical Illinois COVID19 ZIP code data’, a dataset populated by daily snapshots of Illinois Department of Health counts by ZIP code^{Footnote 1}. This dataset has been retrieved as open data by the Observable Web Portal, a Web framework which gives public access to several datasets storing urban data^{Footnote 2}. In particular, for our analysis we have collected infection data from April 18, 2020 to December 20, 2021. Each infection event is described by several attributes (i.e., zip code, date, cumulative number of tested, cumulative number of positive cases, etc.).
Mobility data have been analyzed to discover mobility patterns and epidemic hotspots, while the infection data have been analyzed to discover predictive models for epidemic spread forecasting.
Figure 3 shows a preliminary view of the infection data collected, which provides some hints about data trends and distribution. In particular, it shows the time plot of the number of observed infection data for each zip code, in which the cumulative number of infections (positive cases) is plotted versus the time of observation. From the plot, we see that the occurrence of infection is almost stable during Spring 2020 and Summer 2020, strongly increases in late Autumn 2020 and Winter 20202021, again achieving a stable trend during Spring and Summer 2021, and returning to rise in Autumn 2021.
Starting from the collected cumulative data, we computed the daily number of infected cases by performing the difference between two consecutive measurements, for each zip code. We plot the daily data in Fig. 4, which reveals some additional interesting features about data trends and distribution. First, it is evident that the number of detected infections is very unstable among the observation time, showing an high spread of the time series. Second, a seasonal pattern is clearly observable, that shows a multiwave pattern in the data. From the plot, we see that the number of positive cases increased in April 2020, decreased during the Summer 2020, increased in late Autumn 2020 and during the Winter 20202021, decreased again during the Spring 2021 (with a peak in April 2021) and the Summer 2021, and returning to rise again during the Autumn 2021.
A clearer view of the seasonality hidden in the data can be seen in Fig. 5, which shows the distribution of the number of positive cases by month in the whole city area, that is, cumulated over all the zip codes. The histogram shows that the number of infection events varies significantly between different periods of the year. In particular, the number of infection events is highest in November 2021, and lowest in June 2021.
5.2 Detection of Epidemic Hotspots and Epidemic Patterns
Before the analysis of the trajectories, a preprocessing step has been performed to clean, select and transform data to make them suitable for the analysis. First, we cleaned the collected data by removing all the points with unreliable positions (i.e., null coordinates and evident wrong values). Then, since infection data are spatially referenced at zipcode granularity, we transformed the trajectory data by assigning each pickUp and DropOff location to its specific Zip Code. In total, this matching step resulted in 23 Zip Codes (i.e., 60601, 60654, 60611, etc.) involved in the mobility data. The final dataset contains about 15,000 daily trajectories, each modeled as \(<PickUp, DropOff>\) pairs and describing the set of trajectories traced by buses during a day. The total data size is about 28 MB. We report in the following the results of the analysis performed on such a dataset, by showing the zip codes of the city involved in the mobility data and the discovered mobility patterns.
Mobility and Epidemic Patterns. Mobility patterns (Cesario et al. 2017) have been discovered by applying the mlxtend.frequentpatterns Python library, a pattern mining implementation of the well known apriori algorithm. The discovered mining model is a set of mobility patterns describing sequential relations between the movement of the users under investigation. The number of mobility patterns extracted from the frequent regions highly depends on the minimum support. When the minimum support increases, the number of rules decreases. In our tests we set a support \(s=0.6\) and we discovered 5 mobility patterns, involving 16 zip codes. In particular, the set of discovered mobility patterns is reported below:
\(EP_1\): 60611,60654,60656,60661,60603 \(\rightarrow\) 60601
\(EP_2\): 60654,60661,60601 \(\rightarrow\) 60603
\(EP_3\): 60601,60603 \(\rightarrow\) 60611
\(EP_4\): 60601,60603 \(\rightarrow\) 60661
\(EP_5\): 60661,60603,60601 \(\rightarrow\) 60654
As the mobility patterns are detected, epidemic patterns are generated by selecting those ones having epidemic hotspots involved as both source and destination areas. As in our case each zip code region is also an epidemic hotspot, the five mobility patterns above listed are also epidemic patterns, whose antecedents and consequents will be further exploited to learn the epidemic forecasting models.
5.3 Training and Testing the Forecasting Models
Having extracted the epidemic patterns, the next step is aimed at learning a specific epidemic forecasting model for each destination location, exploiting infection data of the source locations as regression variables. For such a reason, we perform now the training of five regression functions, one for each destination zip code: 60601, 60603, 60611, 60661, 60654.
As well known, to perform the regression task and its validation, we need to split the original dataset into two partitions: the training set and the test set. The first one is exploited to discover the relationships inside data while the second one is used for evaluating whether the discovered relationships hold. In our case, the overall infection data spans 21 months of data, and it has been split with respect to the number of months: the training set contains the infection data of the first 17 months (April 2020  August 2021), while the test set holds the infection data of the last 4 months (September 2021  December 2021). Thus, the training set contains the 81% of the data, while the test set the remaining 19%.
As described in the following subsections, we trained the knowledge models using data from April 2020 to August 2021 and we used the trained model to forecast the infection events from September 2021 to December 2021, to assess the quality of the predictions. Forecasting models have been discovered by applying the LSTM algorithm of the PyTorch library.
To assess the effectiveness and accuracy of the regressive function modeled by LSTM regression model, we performed an evaluation analysis on the test set consisting of the last four months of data (i.e., from September 2021 to December 2021). In particular, the model has been used to predict future values of the number of positive cases that will occur in that area, day by day. The prediction of other types of associated events (i.e., number of hospitalized persons, number of deaths, etc.) is out of the scope of this work and it will be studied in a further research activity.
Let \(y_i\) denote the \(i{th}\) observation and \(\widehat{y_i}\) denote the forecast of \(y_i\) according to the LSTM model for \(t=t_i\). Figure 6 shows the curves for the five zip codes under investigation. For the training set period, from April 2020 to August 2021, observed data and fitted data are plotted in black and red, respectively. For the test set period, i.e. from September 2021 to December 2021, observed and forecasted data are traced in blue and green, respectively. It is interesting to highlight that the regressive curve fits well the training data series. By looking at the test set, we can notice that forecasted data adhere very well to the observed data for that period. It is evident that the trend forecasted by the regressive model is very similar to that occurring in the observed data. In particular, we can notice that predicted values in general are a bit lower than observed data by showing an underforecasting with respect to the real number of infection events.
5.4 Experimental evaluation and comparative analysis
To make our evaluation more accurate and complete, we performed a comparative analysis of the proposed approach with an autoregressive approach. In particular, autoregressive (AR) models are a subset of time series models, which can be used to predict future values based on previous observations (AR models use regression techniques and rely on autocorrelation in order to make accurate predictions). AR models have been discovered by applying the AutoReg algorithm of the statsmodels.tsa.armodel Python library.
Now, let us give a quantitative evaluation about the accuracy of the regressive model. To do that, we computed several indices (commonly used in the literature) to evaluate the forecasting accuracy. In particular, let \(y_i\) denote the \(i^{th}\) observation, \(\widehat{y_i}\) denote the forecast of \(y_i\) according the regressive model and \({\overline{y}}=mean(y)\), the indices are defined as follows:

Mean Absolute Error: \(MAE(y_i, \widehat{y_i})=\frac{1}{n} \sum _{i=0}^{n  1}\mid y_i  \widehat{y_i}\mid\), i.e., a scaledependent index measuring the average forecasting absolute error (the lower value, the better score);

Mean Absolute Percentage Error: \(MAPE(y_i, \widehat{y_i})=\frac{1}{n} \sum _{i=0}^{n  1}\frac{\mid y_i  \widehat{y_i}\mid }{\mid y_i \mid }\), i.e., a scaleindependent index computing the average forecasting percentage error (the lower value, the better score);

Mean Squared Error: \(MSE(y_i, \widehat{y_i})=\frac{1}{n} \sum _{i=0}^{n  1} (y_i  \widehat{y_i})^2\), i.e., a general purpose error metric for numerical predictions that amplifies and severely punishes large errors (the lower value, the better score);

Median Absolute Error: \(MedAE(y_i, \widehat{y_i})=Median(\mid y_1  \widehat{y_1}\mid , \dots , \mid y_n  \widehat{y_n}\mid )\), i.e., a scaledependent index measuring the median absolute forecasting error, which is particularly interesting because it is robust to outliers (the lower value, the better score);

\(R^2\) score: \(R^2(y_i, \widehat{y_i})=1  \frac{\sum _{i=1}^{n} (y_i  \widehat{y_i})^2}{\sum _{i=1}^{n}(y_i  {\overline{y}})^2}\), i.e., a general purpose error metric for numerical predictions that represents the proportion of variance (of y) that has been explained by the independent variables in the model. It provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance. The best score is 1.0 and it can be negative because the model can be arbitrarily worse (the higher value, the better score);

Explained Variance Score: \(EVS(y_i, \widehat{y_i})=1  \frac{Var(y  {\widehat{y}})}{Var(y)}\), which measures the proportion to which a mathematical model accounts for the variance (dispersion) of a given data set, whose the best possible score is 1.0 (the higher value, the better score).
To perform the comparative analysis, we evaluated the forecasting performance of the two approaches on the test set of the five areas (identified by the zipcodes). The results for the algorithms were obtained by performing an accurate tuning of the input parameters: for each dataset, different runs were executed for different values of the parameters, then the best results were selected. The results shown below only refer to the run with the best combination of parameters. Figure 7 summarizes the results of the comparison between our proposed approach based on mobility patterns (MobPat) and the autoregression approach based only on a autoregressive formula (AR). The figure shows the achieved indices, for the five areas (zipcodes) under investigation. In particular, we can see that our approach largely achieves better performances than other algorithms, in terms of MAE, MAPE, MSE and MedAE, for all five zipcodes (but MedAE on the zipcodes 60611, as shown in Fig. 7d). Taking into account the two indices \(R^2\) and EVS, even if it is less evident, our approach results in better performance compared to the other. These results confirm the appropriateness of the proposed approach based on mobility patterns and its good performance in the epidemic prediction domain.
6 Conclusion
COVID19 has been resulting as one of the worst pandemics in history, which has been responsible for more than four hundred million reported cases. This has motivated a research effort towards the study of datadriven predictive models for epidemic events, whose effectiveness is crucial to support decisionmakers in the efficient management and utilization of healthcare resources.
This paper presented the design and implementation of an approach based on spatial analysis and regressive models to discover spatiotemporal predictive epidemic models from mobility and infection data. First, the algorithm extracts infection hotspots, mobility hotspots, and mobility patterns. Then, on the basis of such knowledge, as infectious diseases are mainly spread through humanhuman transmissions, epidemic patterns are extracted from the subset of mobility patterns involving epidemic hotspots. Finally, the approach extracts a specific epidemic forecasting model for each epidemic hotspot, by analyzing the infection data. The experimental evaluation, performed on a realworld data set collecting the infection cases of some areas of Chicago, showed that the proposed methodology can forecast the number of positive cases with good accuracy. Furthermore, we also presented a comparative analysis with an autoregressive algorithm exploited as baseline.
In future work, other research issues may be investigated. First, we may perform an extended experimental evaluation on other urban territories, to assess the results obtained in the case study reported here. Second, we may further explore the application of other spatial analysis approaches and regressive algorithms for the extraction of forecasting models. Third, in addition to the number of positive cases, it may be interesting to investigate some methodologies that predict other relevant indicators related to the epidemics under investigation.
References
Bertozzi AL, Franco E, Mohler G, Short MB, Sledge D (2020) The challenges of modeling and forecasting the spread of Covid19. Proce Nat Acad Sci 117(29):16732–16738
Comito C (2021) How covid19 information spread in us the role of twitter as early indicator of epidemics. IEEE Transactions on Services Computing, (01)
Cesario E, Comito C, Talia D (2017) An approach for the discovery and validation of urban mobility patterns. Pervasive Mobile Comput 42:77–92
Cesario E, Marozzo F, Talia D, Trunfio P (2017) Sma4td: a social media analysis methodology for trajectory discovery in largescale events. Online Social Networks and Media 3:49–62
Chang S, Pierson E, Koh PW, Gerardin J, Redbird B, Grusky D, Leskovec J (2021) Mobility network models of Covid19 explain inequities and inform reopening. Nature 589(7840):82–87
Chicago data portal (2022) https://data.cityofchicago.org/. Accessed: 20220325
Comito C, Pizzuti C (2022) Artificial intelligence for forecasting and diagnosing COVID19 pandemic: a focused review. Artif. Intell. Medicine, 128.
Diederik PK and Jimmy B (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
Dong E, Hongru D, Gardner L (2020) An interactive webbased dashboard to track Covid19 in real time. The Lancet Infectious Diseases 20(5):533–534
Ester M, Kriegel HP, Sander J, Xiaowei X et al (1996) A densitybased algorithm for discovering clusters in large spatial databases with noise. In KDD 96:226–231
Ferguson N, Laydon D, NedjatiGilani G, Imai N, Ainslie K, Baguelin M, Bhatia S, Boonyasiri A, Cucunubá Z, CuomoDannenburg G et al (2020) Report 9: impact of nonpharmaceutical interventions (NPIS) to reduce covid19 mortality and healthcare demand. Imperial College London 10(77482):491–497
Haşim S, Andrew S and Françoise B (2014) Long shortterm memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128,
Hochreiter S, Schmidhuber J (1997) Long shortterm memory. Neural Comput 9(8):1735–1780
Ilin C, AnnanPhan S, Tai XH, Mehra S, Hsiang S, Blumenstock JE (2021) Public mobility data enables Covid19 forecasting and management at local and global scales. Sci Rep 11(1):1–11
Mokhlesur Rahman M, Paul KC, Hossain MA, Ali GMN, Rahman MS and Thill JC (2021) Machine learning on the covid19 pandemic, human mobility and air quality: A review. Ieee Access,
Observable web portal https://observablehq.com/. Accessed: 20220325
Reinhard S, Frank S (2020) A gaussian model for the time development of the sarscov2 corona pandemic disease. Predictions for Germany made on 30 March 2020. Physics 2(2):164–170
Schwabe A, Persson J and Feuerriegel S (2021) Predicting covid19 spread from largescale mobility data. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3531–3539,
Yabe T, Tsubouchi K, Sekimoto Y, Ukkusuri SV (2022) Early warning of Covid19 hotspots using human mobility and web search query data. Comput Environ Urban Syst 92:101747
Funding
Open access funding provided by Università della Calabria within the CRUICARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Canino, M.P., Cesario, E., Vinci, A. et al. Epidemic forecasting based on mobility patterns: an approach and experimental evaluation on COVID19 Data. Soc. Netw. Anal. Min. 12, 116 (2022). https://doi.org/10.1007/s13278022009326
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278022009326