Introduction

In recent years, the surge in spatial and spatio-temporal data availability has been remarkable, largely related to technological advancements in computational tools. These tools enable the real-time acquisition of data from sources like GPS satellites, cellular network triangulation, Wi-Fi location tracking, etc. Consequently, researchers across diverse domains, from epidemiology, ecology, and climatology all the way to social sciences, often find themselves dealing with geo-referenced and time-stamped data that encapsulate spatial information as well as temporal aspects.

Machine learning and deep learning have attracted tremendous attention from researchers in various fields, e.g., AI, computer vision, and language processing, but also from more traditional sciences, e.g., physics, biology, and manufacturing. The added value provided by these algorithms is the few or no assumptions to be met. They are far more flexible than traditional statistical models, as they have weaker requirements in terms of collinearity, Gaussianity of residuals, and similar. Thus, they have high model uncertainty tolerance. In spite of the many pros, neural networks are often blamed for lack of interpretability (black-box models) and of uncertainty quantifications, and for high computational costs.

Image processing components such as convolutional neural networks, sequence processing models, such as recurrent neural networks, and regularization layers, such as dropouts [1], are used extensively, and they contribute to the lack of interpretability.

Yet, in sectors like physics, biology, business, and manufacturing, the representation of model uncertainty remains paramount. As these sectors increasingly lean toward embracing uncertainty, deep learning presents novel opportunities. Coherently, the goal of the current work is twofold: to create a model that is able to infer both the spatial and temporal components and to combine the advantages of both approaches, namely the flexibility of a neural network and the quantification of uncertainty offered by a traditional probabilistic regression model.

To implement this model different approaches were used:

  • Embeddings, a relatively low-dimensional space into which we translate high-dimensional vectors, to model spatio-temporal components and feed them to neural networks;

  • A neural network architecture able to handle sequences of data and quantify the uncertainty of each prediction.

We conduct a comprehensive analysis on the historical series of COVID-19 deaths. We evaluate the accuracy of our forecasts in comparison with other state-of-the-art models in the machine learning and spatio-temporal statistical literature, over various forecasting ranges.

1 Related works

In this section we reference, to the best of our knowledge, some studies related to the use of statistical models and machine learning for the analysis of COVID-19 data.

In [2] and [3] authors use machine learning models to forecast the number of upcoming patients affected by COVID-19. In particular, in [2] four standard forecasting models, such as linear regression (LR), least absolute shrinkage and selection operator (LASSO), support vector machine (SVM), and exponential smoothing (ES), have been used, while [3] uses different ML algorithms for predicting the chance of being infected and leverages an autoregressive integrated moving average time series for forecasting confirmed cases for various states of India.

Authors in [4] proposed comparative forecasting results using machine learning methods. The classical SIR model was used to fit COVID-19 data using different techniques and tools for forecasting, including machine learning with fitting functions. In [5] a multilayer perceptron for predicting the spread of COVID-19 is proposed, while in [6] simple Recurrent Neural Network (RNN), Long short-term memory (LSTM), Bidirectional L-STM (BiLSTM), Gated recurrent units (GRUs), and Variational AutoEncoder (VAE) algorithms have been applied for global forecasting of COVID-19 cases based on a small volume of data.

Authors of [7] compared the performance of several machine learning methods to predict the COVID-19 spread in different countries. In [8] a multimodel machine learning technique for forecasting COVID-19-related parameters in the long term both within India and on a global scale has been proposed.

In [9] the author proposes a semi-parametric approach to estimate the evolution of COVID-19 (SARS-CoV-2) in Spain using a combination of both a Deep learning model and a Poisson-Gamma Bayesian regression model to take into account uncertainty quantification. The goal was to elicit the expected number of counts and their reliability. In [10] they use a INLA spatio-temporal stochastic model to explain the temporal and spatial variations in the daily number of new confirmed cases in Spain, Italy, and Germany. In [11] authors present a Poisson autoregressive model to monitor the temporal evolution of COVID-19 contagion and associated reproduction rate, dynamically adapting parameters to explain the epidemic propagation in terms of short- and long-term case count dependencies, demonstrating how health policies can impact contagion trends. In [12], authors use a Poisson autoregressive model to analyze daily new observed cases, revealing whether the contagion exhibits a trend and determining the position of each country on that trend, while in [13] an endemic–epidemic model is proposed in order to track COVID-19 contagion dynamics both temporally and spatially, exemplified through an empirical analysis of Northern Italy’s provinces affected by the pandemic. Authors of [14] use a discrete latent variable model with spatial and time dependences, for the analysis of SARS-CoV-2 infections. Finally, [15] reviews different spatial and spatio-temporal approaches to identify spatial clusters and associated risk factors.

2 Methods

2.1 Modeling spatial and temporal components with embeddings

Embeddings are numerical representations of categorical variables, commonly used in machine learning [16]. They capture the semantic meaning of a variable by mapping it to a dense vector of real numbers. Models based on embeddings can thus learn the relationships among the variables in a continuous space rather than a discrete one.

The idea is to embed spatial and temporal components by synthesizing “context” information (i.e., locations with similar behavior have similar latent representation). Entity Embedding [17] serves this purpose: the idea is to map categorical variables into Euclidean spaces using a function approximation problem where categories are turned into “Entity” (a.k.a. category) Embeddings. It is expected that similar categories are close in the embedding space.

In this work, we take into account categorical information related to regions, week-day, month, season, and year. To give an intuition on how embeddings work we can observe Fig. 1. In this case, data come from the COVID-19 daily deaths time series for each of the 20 Italian regions (as discussed extensively in Sect. 4) and the goal is to embed each region in a latent space using the entity embedding approach. Figure 1 represents the embeddings in a two-dimensional-reduced space for visualization purposes, after applying the t-SNE [18] to the (n-dimensional) latent embeddings vectors. It can be observed that regions that have had a similar incidence of deaths from COVID-19 lie close together in the two-dimensional embedding representation of Fig. 1. The resulting clusters among regions can be intuitively justified.

Fig. 1
figure 1

A plot of the embeddings related to the spatial information input. For visualization purposes t-SNE was applied to represent embeddings in 2 dimensions. Same color points belong to the same cluster. Clusters are identified using a K-Means algorithm with K = 4 on the embedding representations

2.2 Model architecture

We aim to construct a neural network (NN) architecture capable of discerning patterns within spatio-temporal count data. To achieve this, we will model the outcome using a probability distribution [19, 20] suitable for count data, such as the Poisson distribution. Additionally, our NN will incorporate uncertainty in its predictions, akin to conventional statistical models.

The proposed architecture is shown in Fig. 2, and it is based on a multi-head CNN-LSTM [21] structure. We will now delve into the various components of the architecture, emphasizing their novel aspects.Footnote 1

Consider a scenario where we wish to study the temporal progression of a specific phenomenon across \(N\) distinct locations. Formally, for each location \(j \in \{1, \dots , N\}\) an event is observed at \(T\) time intervals, and our objective is to forecast for subsequent intervals up to a horizon of \(T+h\).

The first layer consists of N inputs, where N is the number of sites under consideration, ensuring that the T temporal data of each site are individually accounted for.

The temporal data are processed by a 1D convolutional layer, or Conv1D. This layer serves to provide temporal smoothing, ensuring that fluctuations over time are harmonized. Moreover, it is instrumental in identifying pertinent patterns within the time series. Following this process, the output from the convolutional layer is flattened [22] and reshaped, making it compatible with subsequent layers. The next step in the architecture involves employing two stacked LSTMs [23]. These are useful to extract insights from sequential data. In parallel with the spatio-temporal data, the network is also fed with additional information about the region under prediction, the day of the week, month, and year. These details are processed through different embedding layers.

These different processing flows are then joined together: the additional information embeddings (focusing on the region under prediction, ..., month, and year), which encompass both temporal and spatial information, are merged with the LSTMs outputs. This ensures that the model has a comprehensive view of the data, priming it to make accurate predictions.

For each time instance, identical input data are supplied to the network \(N\) times, each paired with the spatial data pertinent to the specific location for which a prediction is being generated. The rationale behind supplying both data and spatial embeddings as network inputs is that the embeddings will provide invaluable insights to compute the output for a specific site, while considering input data from multiple sites. Furthermore, replicating the same information N times acts as a data augmentation strategy: more complex models demand larger datasets for effective training. For instance, if we were to rely solely on a year’s worth of observations, we would be limited to 365 input data points to train our architecture. However, by iterating this process N times, we effectively amplify the number of input observations available to the model. This not only enhances the robustness of our model, but also aids in parameter estimation, ultimately leading to more accurate predictions.

Fig. 2
figure 2

Skeleton of our probabilistic neural network architecture

After concatenating the embeddings and the LSTM layer outputs, a dense layer is added, culminating in the output layer. The latter is a dense layer with as many neurons as the range of forecast, representing the rate parameters \( \lambda \) of a Poisson distribution, which fully identifies the conditional probability distribution (CPD) of the outcome \( y \) given the input \( x \). The whole NN input scheme is summarized in Algorithm 1.

Conventionally, a NN updates its parameters based on the minimization of a loss function. In our context, the approach is centered around maximizing the likelihood, ensuring that the resulting model can predict observed values with high probability. The likelihood of an arbitrary CPD can be maximized within a neural network framework by interpreting the probabilistic neural network’s output as a unique distribution parameter. The neural network “learns” to predict the \( \lambda \) value that maximizes the likelihood—or minimizes the negative log likelihood (NLL)—of the observed data

$$\begin{aligned} \text {NLL}(\lambda ; X)&= \sum _{i=1}^{n} \left[ -x_i \log (\lambda _i) + \lambda _i + \log (x_i!)\right] \end{aligned}$$

where

  • \(\lambda _i\) is the value predicted by the neural network for the \(i^{th}\) data point.

  • \(x_i\) is the \(i^{th}\) observed data point.

  • n is the total number of data points.

Algorithm 1
figure a

Neural Network Input Preparation

Fig. 3
figure 3

Historical trend of number of deaths across five Italian regions: LOMBARDY, VENETO, BASILICATA, CALABRIA, and VALLE D’AOSTA. The time series from January 2021 to December 2021 is zoomed in on the upper part to better highlight the differences between the historical series

2.3 Alternative models

To evaluate the performance of our model, the results are compared with a pool of alternatives that are commonly used in the literature for spatio-temporal data forecasting: two ensemble models: Random Forest [24] and XGBoost [25]; and a Bayesian statistical model: INLA [26].

In particular, regarding the ensemble models, the embeddings obtained from the embedding layers of a neural network are used as input. These embeddings are tasked to learn meaningful data representations along with the lagged time series. Regarding INLA, we have used a Poisson distribution for modeling the outcomes. The choice of the Poisson distribution was driven by the estimated dispersion parameter in our data being close to one, indicating that the Poisson distribution adequately captures the data variability. Additionally, we incorporated a spatial component using the Besag-York-Mollié (BYM) model [27] and an autoregressive temporal component, in order to capture spatio-temporal dynamics in our data.

3 Data

Our primary data source for the analysis is derived from the GitHub repository maintained by the Italian Civil Protection, accessible via this link: https://github.com/pcm-dpc/COVID-19. This repository provides daily updates, offering a comprehensive overview of the pandemic’s progression. It includes different time series such as the number of new positive cases, ICU occupancy, swabs made, and deceased, both from a national and regional perspective.

Our analysis primarily centers on the historical series of daily deaths for several reasons:

  • This time series exhibits significant variability and is prone to abrupt fluctuations. The daily death count often undergoes revisions in the days following its initial publication. Consequently, discerning the genuine signal amidst this noise requires the deployment of complex modeling techniques.

  • It is used as an indicator of the pandemic’s severity. Unlike time series like New Daily Positive cases (which are influenced by the number of swabs conducted) and ICU occupancy (that exhibit a degree of temporal persistence), the count of new deaths provides a more direct and unfiltered reflection of the pandemic’s impact.

Figure 3 showcases the historical series of COVID-19-related deaths across five Italian regions: LOMBARDY, VENETO, BASILICATA, CALABRIA, and VALLE D’AOSTA. A cursory examination reveals notable disparities in death counts across regions. Furthermore, on certain days, there are significant spikes, indicating abrupt surges in fatalities.

Table 1 Mean absolute error (MAE) and mean squared error (MSE) are reported for each forecast interval (7, 14, 21, and 28 days) in the considered period 12/01/2023–12/28/2023
Table 2 Mean absolute error (MAE) and mean squared error (MSE) are reported for each model across various regions within the forecast interval considered (12/01/2023 - 12/28/2023). In bold are highlighted the models that have a better score on the specific region

4 Experiments and results

Table 1 presents two metrics to evaluate the performance of the proposed model in comparison with benchmarks: the mean absolute error (MAE), which measures the average absolute prediction error, and the mean squared error (MSE), which more heavily penalizes large prediction errors. Comparisons were made using different forecasting ranges (7, 14, 21, and 28 days) for the period considered in December 2023. This approach aimed to assess the accuracy and reliability of forecasts over varying time spans within the specific month. Table 2 shows the two error metrics across different regions, specifically for each model, in the context of the 28-day forecast. It is observed that the regions with the largest discrepancies between the observed and predicted cases are also those with higher incidence and greater variability (Lombardy, Piedmont, and Veneto).

The proposed model, which we name probabilistic neural network (PNN), outperforms the others in terms of both metrics in all scenarios. Additionally, Fig. 4 shows that PNN returns highly accurate predictions at national level (obtained summing the daily regional forecasts): the 0.025 and 0.975 quantiles of the conditional probability distribution define a 95% prediction interval that quantifies the uncertainty of each prediction, and in most cases it includes the true number of COVID-19 deaths. To more accurately assess the accuracy of prediction intervals, in Table 3 we compared the intervals generated by the PNN model with those derived from the INLA approach. This analysis reveals that the coverage of the PNN’s prediction intervals is close to the theoretical level, but not perfectly in line. The regional coverages in both cases do not reach the nominal 95%, possibly because of anomalies in regional data, such as recounts of previous days’ deaths. These events introduce significant and sudden variations in the observed regional time series, making predictions more uncertain and affecting the coverage of the prediction intervals.

Table 3 Prediction interval (95% level) coverage, for PNN and INLA models at both regional and national levels
Fig. 4
figure 4

The forecasts at the national level are represented (with the dashed blue lines) along with their respective 95% confidence intervals for the time period considered

5 Strength and weaknesses

The model we propose offers several advantages, compared to competing models:

  • It is designed to handle input data characterized by spatial and temporal changes, and it delivers accurate results, offering a comprehensive and detailed view of trends and patterns. Moreover, this is a lightweight approach, which can be easily run on standard laptops, making it accessible with no advanced hardware resources.

  • Compared to a traditional neural approach, it adopts a probabilistic approach, thus providing an estimate of the probability of a particular outcome. This makes it flexible and particularly suitable for the analysis of count data. Specifically, the Poisson distribution that we adopt returns outputs that are integer values, capturing the inherent nature of count data and ensuring more meaningful predictions.

  • Embedding-related representations can be extracted to provide insights on specific features, such as locations. This means that the model can identify and place similar entities close in the embedding space, facilitating the interpretation and the understanding of relationships between different entities.

However, alongside the numerous advantages, it is also appropriate to analyze the potential weaknesses of our model. First, like all deep learning models, a substantial amount of data is needed to train the models, while a limited amount of data can, in fact, compromise parameters estimation accuracy and reliability. Furthermore, an incorrect representation of the embeddings can lead to a model that cannot correctly discriminate between different locations. This means that if the embeddings are not properly calibrated or correctly interpreted, the model might fail to distinguish between different positions or categories, leading to inaccurate or misleading results.

6 Conclusion

We introduce a neural network architecture capable of delivering forecasts for spatio-temporal data, together with a measure of uncertainty. Our model is evaluated across various range of forecasting intervals, outperforming benchmarks.

The use of a neural network, particularly a probabilistic one, offers a level of flexibility that traditional statistical models often lack. This flexibility is especially crucial when dealing with complex datasets, such as the spatio-temporal one we focus on. Neural networks can adapt to intricate patterns and complex relationships in the data, which might be challenging to capture by conventional statistical models. Moreover, by employing a probabilistic neural network, we not only benefit from the adaptability of neural architectures but also retain the advantages of statistical models in estimating uncertainty. This combination ensures that our predictions are both accurate and endowed with a reliable measure of confidence.

Moreover, embeddings have proven to be a valuable tool in guiding the network’s learning process, especially when forecasting COVID-19-related deaths across different Italian regions. These embeddings allow the model to understand and represent the similarities and differences between regions, enhancing predictive capabilities.

The proposed model serves as an efficient foundational framework and, in light of the results we discuss, is versatile enough to be extended for other related series and geographies.