1 Introduction

On December 31, 2019, China reported a cluster of pneumonia cases of unknown etiology in Wuhan city. On January 30, 2020, the World Health Organization (WHO) declared the new coronavirus Sars-CoV-2 outbreak in China to be a public health emergency of international concern (Gorbalenya et al. 2020).

On January 31, 2020, the Italian government proclaimed a state of emergency and implemented the first measures to contain the infection on the entire national territory (Camporesi et al. 2022).

Since then, Coronavirus disease 2019 (COVID-19) has become an unprecedented public health crisis with a major impact on the healthcare system. This impact was evident in Europe, especially in Italy (Paterlini 2020).

In particular, the Campania Region, in Southern Italy, from the data available at the beginning of 2020, has about 5,870,000 inhabitants, making it the third most populated Region in Italy and the most populated in the South. The population density is equal to 429.4 people per Km2, the highest value at the national level. Furthermore 63.1% of the population resides in 65 centers with more than 20,000 inhabitants. This makes the Campania Region at high risk of spreading the disease and saturating the local health system. Figure 1 shows a map of the population density of the region (Tuttitalia 2020; Siniscalchi 2018).

Fig. 1
figure 1

Map of the population density of the Campania Region (inhabitants per Km2)

In response, since the beginning of the pandemic, the Campania Region has adopted a preventive management approach, supporting the use of both the tools available in the study of infectious epidemiology and the new multidisciplinary approaches based on prediction algorithms through machine learning (Kour and Gondhi 2020).

In this paper, we describe some models of the SARS-CoV-2 spread in the territory, and a forecasting formula then integrated in the SVIMAC-19, an analytical-forecasting system for the containment, contrast, and monitoring of COVID-19 within the Campania Region (Regione Campania 2020). Namely, our goal was to predict the number of the new daily infected people at least 10/15 days in advance.

Forecasting of a pandemic can be done based on various parameters such as the impact of environmental factors, the incubation period, the impact of quarantine, age, gender, and many others (Shinde et al. 2020; Pak et al. 2020). However, not all these data are publicly available. In this study, we used only publicly available data from both Italian National Health Organization databases and Regional repositories.

1.1 Methodology

To date, many studies have tried to identify formulas and rules able to define a mathematical model of the COVID-19 spread. Although in some cases accuracy was found to be elevated, the state-of-the-art solutions make use of many data, such as governments interventions, new drugs, and so forth, and such information could be not available or reliable. As a consequence, the resultant forecasting models are often difficult to adapt to a specific area (Tu et al. 2020).

On the contrary, our approach intended to build a forecasting model by mining useful insight from the data observed over time, without taking into account any type of external information or human intervention, in the framework of inductive inference (Angluin and Smith 1983; Rampone and Russo 2012). Such technique assesses the situations of the past thereby enabling better predictions about the situation to occur in the future.

Namely, the approach used in this study relies on the so-called Evolutionary Algorithms, and in particular on the Genetic Programming (GP) (Koza 1994; Schmidt and Lipson 2009), by improving a random population of solutions (formulae) in an evolutionary way. The performance of other algorithms widely used was also valued and compared (Fix and Hodges 1951; Altman 1992; Zhang et al. 2017).

1.2 Related works

Given its massive impacts on lives globally, the COVID-19 pandemic is a major focus of research interest at present (Doornik et al. 2022) and the list of related works is necessarily incomplete.

On March 16, 2020, the White House, collaborating with research institutes and tech companies, issued a call to action for global artificial intelligence (AI) researchers for developing novel text and data-mining techniques to assist COVID-19-related research (Alimadadi et al. 2020). Several studies investigated the kinetics of coronavirus spread through human populations (Remuzzi and Remuzzi 2020; Li et al. 2020), and the basic reproductive ratio of the virus has been estimated (Anđelić et al., 2021).

Koza (1994) laid the foundations of Genetic Programming (GP) (Affenzeller et al. 2009) and since then several variations have been made (Katoch et al. 2020; D’Angelo and Palmieri 2021).

There are numerous applications of GP in the predictive field (Rampone et al. 2021; Rampone and Valente 2021). The GP application on publicly available COVID-19 data to obtain the estimation of confirmed, deceased, and recovered cases and the epidemiology curve for countries such as China, Italy, Spain, and the USA and as well as on the global scale was afforded among others by Anđelić et al. (2021) and Salgotra et al. (2020).

Del Giudice et al. (2020) implemented a regressive model investigating some consequences of the COVID-19 pandemic in the Campania Region, taking into account how the event might affect the regional activity.

1.3 Paper outline

This paper is organized as follows: In Sect. 2, we resume the method set up, the formulae obtained, the test results, and the comparisons with some alternative methods; in Sect. 3, we show the model tuning and the experimental results during the pandemic; Sect. 4 is devoted to the Discussion and Conclusions.

2 Model set up

We aimed to find a model, expressed as a set of explicit formulae, describing the number of new infected people in Campania Region (Italy) at least 10/15 days before the occurrence. More specifically, the model we intended to build should be able to perform the prediction by starting only from information on the current infected people.

3 Reference data

The initial data were taken from an officially published set of the Campania Region.Footnote 1 The data were in according to the daily national summary of health monitoring prepared by the Department of Civil Protection and made available on the website http://www.protezionecivile.gov.it/ following the official communication via a press conference at 6.00 pm by the Head of the Department of Civil Protection as extraordinary Commissioner.

The data describe in successive lines the daily situation in the Campania Region in terms of number of infected people (hospitalized, in intensive care, in home isolation, currently positives, new positives, discharged, cured, deceased, total) and swabs and cases tested.

At the time of use, the dataset included daily data from February 24, 2020 to December 31, 2020 (312 rows).

From each row, we defined a feature vector, adding a label, named Forecast, representing the new positives after ten days from the current date. The feature vector structure is reported in the Table 1.

Table 1 Labelled instances structure

In this way, we obtained 302 labelled instances from February 24, 2020 to December 21, 2020 (302). It is to point out that there is a negative value of new positive (− 229) in the data of June 02, 2020, which is probably a correction of the previous data. We left it unchanged.

3.1 Cross-validation and fitness measure

To build the formulae avoiding bias, we divided the dataset of Sect. 2.1 into 5 sub-sets according to the k-fold cross-validation approach (Devijver and Kittler 1982). In this way, the whole dataset was divided into 5 folds, and, in turn, one fold was used as validation set, while the remaining folds were used as training set.

As fitness measure leading GP (Affenzeller et al. 2009) we chose the minimum Root Mean Square Error (RMSE), where

$$ {\text{RMSE}} = \sqrt {\mathop \sum \limits_{i = 1}^{m} \frac{{\left( {y_{i} - \hat{y}_{i} } \right)^{2} }}{m}} $$
(1)

where \(\hat{y}_{i}\) is the prediction and yi he true value, while m is the number of samples.

3.2 GP hyperparameters tuning

The GP experiments were made in the Matlab environment (Higham and Higham 2016).

To run GP, several hyperparameters were set, such as the population size, the maximum number of generations, the tournament type and its size, the maximum depth of trees, the maximum number of genes allowed in an individual, the permitted operators. We remark that the choice of these parameters significantly affect the final result (Sipper et al. 2018).

These choices are generally made in a manual or automatic way. In the former, the values of the hyperparameters are randomly chosen by using a trial-and-error method through an extensive series of experiments and evaluation of the corresponding performance. The latter makes use of intelligent logic able to find out the appropriate values of the hyperparameters through an iteration-based method. In this study, we used the second approach by first defining the upper and lower bounds of each hyperparameter and then choosing them by following the workflow used by the Talos library implemented for running Tensorflow-based app in Python language.Footnote 2 More specifically, we used 70% of the dataset for calibrating these parameters.

The selected hyperparameters and their ranges are reported in Table 2.

Table 2 GP selected hyperparameter

3.3 GP formulae

In the GP experiments, we were looking for formulae f() that would satisfy

$$ {\text{Forecast}} = f\left( {F1,F2, \ldots ,F12} \right) $$
(2)

from the described data.

As aforementioned, we performed 5 main experiments, according to the fivefold cross-validation. Each experiment was repeated 100 times, and the best solution was considered. Besides, GP was applied on the whole dataset.

The resulting formulae, for each cross-validation experiment and for the whole dataset experiment, are reported in Table 3.

Table 3 GP formulae for each experiment

Table 4 shows the RMSE for each experiment, the mean value of the 5 cross-validation results and the RMSE value when the entire dataset was considered. Figure 2 graphically shows the expected and actual values of the new positives in the experiments. In particular, the graph of Exp 2 highlights the negative value of June 02, 2020 and its impact on forecasts.

Table 4 The RMSE for each experiment and the mean value of the 5 cross-validation results
Fig. 2
figure 2

Plot of predicted and real values of new positives at 10 days for each formula in Table 3. In each picture, the real values are reported in blue and the predicted values are reported in red. The graph of Exp 2 highlights the negative value of June 02, 2020

Table 5 shows how the considered features are distributed among the formulae carried out in the experiments. With reference to the occurrences reported in Table 5, the most significant characteristics seem to be F7, F8, F10 and F12, i.e., the number of new positives at 10 days from the moment of observation seems strongly dependent on the current variation in the number of infected people, newly infected, deceased people and molecular swabs performed at the time of observation.

Table 5 Feature occurrences for each formula

3.4 Result comparison

To compare the results, we repeated the experiments by using several algorithms widely used in the literature, that is k-Nearest Neighbors (KNN-Regression), Multi-Layer Perceptron (MLP), Support Vector Machines (SMO Regression), and Regression Tree (REPTree). All experiments were carried out by using the Waikato Environment for Knowledge Analysis (WEKA) by using the same Folds as for GP testing (Witten et al. 2016).

Table 6 shows the results. As depicted, the RMSE values are comparable with those obtained from GP, while these algorithms are not capable to provide a representation of the relationship among features involved, given their sub symbolic nature (Ilkou and Koutraki 2020).

Table 6 RMSE values for each compared method in all the experiments and the mean value of the 5 cross-validation results

4 Experimental results during the pandemic

In order to integrate the results into the SVIMAC-19 system, extending the forecast interval to 15 days, a new GP formula was produced with a new set of data available. We considered the Campania Region data available at the following link:https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv.

At the time of use, the dataset included daily data from February 24, 2020 to April 01, 2021 (403 rows). The considered features are the same of Table 1 except for the Forecast label, changed in the number of new positives at 15 days from the time of observation as reported in Table 7.

Table 7 Label of the new instances

In this way, we obtained 388 labelled instances from February 24, 2020 (1) to March 17, 2021 (388).

The GP formula was built by using the whole dataset, and it is reported in Table 8.

Table 8 GP formula for the whole dataset

Figure 3 plots real and predicted data. The RMSE achieved was 436.88 (variation explained 84.2035%).

Fig. 3
figure 3

Plot of predicted and real values of new positives at 15 days for the GP formula. In the picture, the real values are reported in blue and the predicted values are reported in red

As it can be derived by Table 8, also in this case the most significant feature is F8 which is present with a very high coefficient (3.222) in the equation, while F3 is less representative due to its medium coefficient (1.611), and lastly, F1 and F2 are very unrepresentative due to a very small coefficient (0.001135).

Then the formula has been integrated in the SVIMAC-19 system where it is still operating. The performances are valued both by the RMSE and by 4 standard measures of forecast error for both scientific and applicative fields:

  • Mean Error (ME), i.e., the arithmetic mean of the errors:

    $$ {\text{ME}} = \frac{1}{m}\sum\limits_{t = 1}^{m} {e_{t} } $$
    (3)
  • Mean Squared Error (MSE), i.e., the arithmetic mean of the squares of the errors:

    $$ {\text{MSE}} = \frac{1}{m}\sum\limits_{t = 1}^{m} {e_{t}^{2} } $$
    (4)
  • Mean Absolute Error (MAE), i.e., the arithmetic average of the errors taken as an absolute value:

    $$ {\text{MAE}} = \frac{1}{m}\sum\limits_{t = 1}^{m} {\left| {e_{t} } \right|} $$
    (5)
  • Mean Absolute Percentage Error (MAPE), that is the arithmetic mean of the relative percentage errors, taken as an absolute value:

    $$ {\text{MAPE}} = \frac{1}{m}\sum\limits_{t = 1}^{m} {\frac{{\left| {e_{t} } \right|}}{{y_{t} }}} 100 $$
    (6)

    where \({y}_{i}\) is the true value.

We report the experimental results during nine months of operation, i.e., from March 18, 2021 to December 18, 2021. The error measures are reported in Table 9, while the Fig. 4 reports the plot of predicted and real values.

Table 9 Error measures as defined in (1), (3), (4), (5), (6) for the time interval from March 18, 2021 to December 18, 2021
Fig. 4
figure 4

Plot of predicted and real values of new positives at 15 days for the GP formula on the unseen data from March 18, 2021 to December 18, 2021 (real data are reported in blue, predicted data are reported in violet)

5 Conclusions

In this paper, we used Genetic Programming to evidence dependences of the SARS-CoV-2 spread from past data in the Campania Region, in Italy. Our approach aimed to build a forecasting model by mining useful insights from the data observed over time, without taking into account any type of external information or human intervention.

Furthermore we based the prediction only from a few information, such as infected people (hospitalized, in intensive care, in home isolation, currently positives, new positives, discharged, cured, deceased, total) and swabs and cases tested.

According to our experimental results, which provide an explicit representation of relationships from the data, the number of future new positives appears to be independent from the number of people that are currently hospitalized with symptoms or in intensive care, and also from the number of people in home isolation, as well as from the total number of infected people since the start of the pandemic. On the contrary, the incidence of the current number of newly infected is evident.

The resulting models proved their effectiveness in predicting the number of new positives 10/15 days earlier. Then, thanks to the model adoption within a monitoring system, the experimental data were analyzed in the long term by evaluating different error measures such as Root Mean Square Error, Mean Error, Mean Squared Error, Mean Absolute Error, Mean Absolute Percentage Error.

The general adherence of the forecast curve to the real trend is rather surprising. In fact, in line with the initial choices, the model has not been modified following the strengthening of the vaccination policy and the occurrence of virus mutations. This suggests that the latter have an impact mainly on the severity of the disease rather than on the spread of the virus, and this will be a topic for future work.