Comparison of ARIMA, ETS, NNAR, TBATS and hybrid models to forecast the second wave of COVID-19 hospitalizations in Italy

The coronavirus disease (COVID-19) is a severe, ongoing, novel pandemic that emerged in Wuhan, China, in December 2019. As of January 21, 2021, the virus had infected approximately 100 million people, causing over 2 million deaths. This article analyzed several time series forecasting methods to predict the spread of COVID-19 during the pandemic’s second wave in Italy (the period after October 13, 2020). The autoregressive moving average (ARIMA) model, innovations state space models for exponential smoothing (ETS), the neural network autoregression (NNAR) model, the trigonometric exponential smoothing state space model with Box–Cox transformation, ARMA errors, and trend and seasonal components (TBATS), and all of their feasible hybrid combinations were employed to forecast the number of patients hospitalized with mild symptoms and the number of patients hospitalized in the intensive care units (ICU). The data for the period February 21, 2020–October 13, 2020 were extracted from the website of the Italian Ministry of Health (www.salute.gov.it). The results showed that (i) hybrid models were better at capturing the linear, nonlinear, and seasonal pandemic patterns, significantly outperforming the respective single models for both time series, and (ii) the numbers of COVID-19-related hospitalizations of patients with mild symptoms and in the ICU were projected to increase rapidly from October 2020 to mid-November 2020. According to the estimations, the necessary ordinary and intensive care beds were expected to double in 10 days and to triple in approximately 20 days. These predictions were consistent with the observed trend, demonstrating that hybrid models may facilitate public health authorities’ decision-making, especially in the short-term.


3
Thus, the first goal of this paper is to provide short-term and mid-term forecasts for the number of patients hospitalized with COVID-19 during the second wave of COVID-19 infections, i.e., during the period after October 13, 2020. COVID-19-related hospitalization trends offer a clear picture of the overall pressure on the national healthcare system. Moreover, models fitted to hospitalized patients are usually more reliable and accurate than models fitted to confirmed cases [30]. 1 The paper's second goal is to compare and investigate the accuracy of several statistical methods.
In particular, I estimated four time series forecast techniques and all of their feasible hybrid combinations: the autoregressive moving average (ARIMA) model, innovations state space models for exponential smoothing (ETS), the neural network autoregression (NNAR) model, and the trigonometric exponential smoothing state space model with Box-Cox transformation, ARMA errors, and trend and seasonal components (TBATS).
The rest of this paper is organized as follows. "Related literature" reviews the relevant literature while "Materials and methods" presents the data used in the analysis and discusses the empirical strategy. "Evaluation metrics" presents the evaluation metrics used to measure the performance of the models. "Results and discussion" discusses the main findings and policy implications. Finally, "Conclusions" provides some conclusive considerations.
Ala'raj et al. [2] utilized a dynamic hybrid model based on a modified susceptible-exposed-infected-recovered-dead (SEIRD) model with ARIMA corrections of the residuals. They provided long-term forecasts for infected, recovered, and deceased people using a US COVID-19 dataset, and their model had a remarkable ability to make accurate predictions. Using a nonseasonal ARIMA model, Ceylan [14] made short-term predictions of cumulative confirmed cases after April 15, 2020, for France, Italy, and Spain. The forecasts showed low mean absolute percentage errors (MAPE) and seemed to be sufficiently reliable and suitable for the short-term epidemiological analysis of COVID-19 trends.
Hasan [29] proposed a hybrid model that incorporates ensemble empirical mode decomposition (EEMD) and neural networks to forecast real-time global COVID-19 cases for the period after May 18, 2020. The analysis showed that the ANN-EEMD approach was quite promising and outperformed traditional statistical methods, such as regression analysis and moving average.
Ribeiro et al. [65] provided short-term estimates of COVID-19 cumulative confirmed cases in Brazil by employing multiple approaches and selecting several models, such as ARIMA, cubist regression (CUBIST), random forest (RF), ridge regression (RIDGE), support vector regression (SVR), and stacking-ensemble learning (SEL). The models' reliabilities were evaluated based on the improvement index, mean absolute error (MAE), and symmetric MAPE criteria. The analysis demonstrated that SVR and SEL performed best, but all models exhibited good forecasting performances.
Using ARIMA, TBATS, their statistical hybrid, and a mechanistic mathematical model combining the best of the previous models, Sardar et al. [68] attempted to forecast daily COVID-19 confirmed cases across India and in five different states (Delhi, Gujarat, Maharashtra, Punjab, and Tamil Nadu) from May 17, 2020, until May 31, 2020. The ensemble model showed the best prediction skills and suggested that COVID-19 that daily COVID-19 cases would significantly increase in the considered forecast window and that lockdown measures would be more effective in states with the highest percentages of symptomatic infection.
Wieczorek et al. [75] implemented deep neural network architectures, which learned by using a Nesterov-accelerated adaptive moment (Nadam) training model, to forecast cumulative confirmed COVID-19 cases in several countries and regions. The predictions, which referred to different time windows, revealed that the models had an extremely high level of accuracy (approximately 87.7% for most regions but, in some cases, reaching almost 100%).
Talkhi et al. [71] attempted to forecast the number of COVID-19 confirmed infections and deaths in Iran between August 15, 2020, and September 14, 2020, using several single and hybrid models. The extreme learning machine (ELM) and hybrid ARIMA-NNAR models were the most suitable for forecasting confirmed cases, while the Holt-Winters (HW) approach outperformed the others in predicting death cases.
hospitalizations of patients with mild symptoms and patients assigned to the ICU in Italy from February 21, 2020, to October 13, 2020. I extracted the data from the official Italian Ministry of Health's website (www. salute. gov. it). The confirmed COVID-19-related hospitalization trends appear in Fig. 1.
The data showed that the number of COVID-19 patients hospitalized with mild symptoms and the number of COVID-19 patients assigned to the ICU reached an initial peak on April 4, 2020. They then followed a downward trend until mid-August before accelerating again from the end of September 2020 to mid-October 2020. Recognizing that the use of only one model is never wise and may lead to unreliable forecasts [68], I computed the forecasts by employing different statistical techniques and their combinations. Specifically, linear ARIMA models, ETS models, linear and nonlinear NNAR models, TBATS models, and their feasible hybrid combinations were examined.
ARIMA models, which were first proposed by Box and Jenkins [11], represent one of the most widely used frameworks for epidemic/pandemic and disease time series predictions [66,84]. Considering only the linear trend of a time series, they are able to capture both nonseasonal and seasonal patterns of that series. The following should be noted regarding the nonseasonal component of ARIMA models: (i) the autoregressive (AR) process aims to forecast a time series using a linear combination of its past values; (ii) the differencing (I) is required to make the time series stationary by removing (or mitigating) trend or seasonality (if any), and (iii) the moving average (MA) process aims to forecast future values using a linear combination of previous forecast errors. The seasonal component is similar to the nonseasonal component, but implies backshifts of the seasonal period, i.e., adding the seasonal parameters to the AR, I, and MA components, which allows the model to handle most of the seasonal patterns in the real-world data. Therefore, the final seasonal model can be denoted as ARIMA (p,q,d)(P,Q,D) m , where m is the seasonal period and the lowercase and uppercase letters indicate the number of nonseasonal and seasonal parameters for each of its three components, respectively [34], Sect. 8).
The ETS class of models was introduced in the late 1950s [12,31,76] to consider different combinations of trend and seasonal components. The basic ETS model consists of two main equations: a forecast equation and a smoothing equation. By integrating these two equations into an innovation state space model, which may correspond to the additive (A) or multiplicative (M) error assumption, it is possible to obtain an observation/measurement equation and a transition/state equation, respectively. 2 The first equation describes the observed data while the second equation describes the behavior of the unobserved states. The states refer to the level, trend, and seasonality. The trend and seasonal components may be none (N), additive (A), additive damped (Ad), 3 or multiplicative (M), resulting in a wide range of model combinations. The final model assumes the form of a three-character string (Z,Z,Z), where the first letter identifies the error assumption of the state space model, the second letter identifies the trend type, and the third letter identifies the season type. These models are able to produce a time series forecast by using the weighted average of its past values and adding more weight to recent observations [34], Sect. 7, [40].
NNAR models can be viewed as a network of neurons or nodes that depict complex nonlinear relationships and functional forms. In a basic neural network framework, the neurons are organized in two layers: (i) the bottom layer identifies the original time series, and (ii) the top layer identifies the predictions. The resulting model is equivalent to a simple linear regression and becomes nonlinear only when an intermediate layer with "hidden neurons" is included. For seasonal data, NNAR models can be described with the notation NNAR (p,P,k) m , where m is the seasonal period, p denotes the number of nonseasonal lagged inputs for the linear AR process, P represents the seasonal lags for the AR process, and k indicates the number of nodes/neurons in the hidden layer [34], Sect. 11.3).
Finally, TBATS models are a class of models that combine different approaches: trigonometric terms for modeling seasonality, Box-Cox transformation [10] for addressing heterogeneity, ARMA errors for addressing short-term dynamics, damping (if any) trends, and seasonal components. Therefore, TBATS models have several properties: (i) they deal well with very complex seasonal patterns, Fig. 1 Patients hospitalized with mild symptoms and in the ICU from February 21, 2020 to October 13, 2020. Source: Italian Ministry of Health [43] which might, for example, exhibit daily, weekly, and annual patterns simultaneously; (ii) they are able to consider time series nonlinear patterns, and iii) they can handle any type of autocorrelation in the residuals [34], Sect. 11.1, [71].
The combination of different times series forecast methods maximizes the chance of capturing seasonal, linear, and nonlinear patterns [60,82] and is especially useful for predicting real-world phenomena, such as the COVID-19 pandemic, which are characterized by complex dynamics [7]. Well established from the seminal work of Bates and Granger [6], combining techniques with unique properties could allow models to achieve better performance and forecast accuracy. 4 The models were calculated by using the following analytical procedures: • ARIMA models were detected by applying the "auto. arima()" function included in the package "forecast" (in the R environment) and developed by Hyndman and Khandakar [35]. This function followed sequential steps to identify the best model, i.e., the number of p parameters of the autoregressive process (AR), the order i of differencing (I), the number of q parameters of the MA, and the number of the parameters of the seasonal component. It combined unit root tests 5 and the minimization of the following estimation methods: the bias-corrected Akaike's information criterion (AICc) 6 and the maximum likelihood estimation (MLE). The unit root tests identified the order of differencing while the AICc and the MLE methods identified the order and the values of the parameters (respectively) of the seasonal and nonseasonal AR and MA processes; • ETS models were identified by using the "ets()" function included in the package "forecast" (in the R environment) and developed by Hyndman et al. [39]. 7 In particular, I applied the Box-Cox [10] transformation to the data before estimating the model and then used the AICc metric to determine if the trend type was damped or not. The final three-character string identifying method (Z,Z,Z) was selected automatically; • NNAR models were identified via the "nnetar()" function included in the package "forecast" (in the R environment) written by Hyndman [33]. 8 I proceeded as follows: (i) first, the Box-Cox transformation [10] was applied to the data before estimating the model; (ii) second, the optimal number of nonseasonal p lags for the AR(p) process was obtained by using the AICc metric; (iii) third, the seasonal P lags for the AR process were set to 1 9 ; and (iv) finally, the optimal number of neurons was identified using the formula k = (p+P+1)

2
[34], Sect. 11.3); • TBATS models were identified using the "tbats()" function included in the package "forecast" (in the R environment) as described in De Livera et al. [17]. The optimal Box-Cox transformation parameter, ARMA (p,q) order, damping parameter, and number of Fourier terms were selected using the Akaike's information criterion (AIC) metric; 10 • Hybrid models were identified via the "hybridModel()" function included in the "forecastHybrid" package (in the R environment) developed by Shaub and Ellis. 11 The individual time series forecasting methods were combined as follows: (i) first, the Box-Cox power transformation [10] was applied to the inputs to increase the plausibility of the normality assumption; and (ii) then, the individual models were combined using both equal weights and cross-validated errors ("cv.errors"), which gave greater weight to the models that performed relatively better. In fact, since the best weighting procedure has not been established, I adopted a parsimonious approach and chose the one that performed better. Specifically, I tested the overall goodness-of-fit of all models with four common forecast accuracy measures: MAE, MAPE, mean absolute scaled error (MASE), and root mean square error (RMSE).
The estimated basic equation for the ARIMA was the following [16]: where Δ d is the second difference operator, y t indicates the predicted values, p is the lag order of the AR process, is the coefficient of each parameters p, q is the order of the (1) 4 See also, for example, Fallah et al. [21]. 5 Both the augmented Dickey-Fuller test [19] and the Kwiatkowsky, Phillips, Schmidt, and Shin test [50] were used. In fact, as stated by Gujarati and Porter [28], there is no recognized uniformly powerful test for detecting unit roots. 6 The AICc is a bias-corrected version of the original Akaike information criterion (AIC) proposed by Sugiura [70] and Hurvich and Tsai [32]. The former performed significantly better than the latter in both small and moderate sample sizes [32]. Thus, it is suitable for this study. 7 A description of the "ets()" function is provided by Hyndman and Athanasopoulos [34], Sect. 7.6). MA process, is the coefficient of each parameter q, and t denotes the residuals of the errors at time t.
The estimated equations for the basic ETS (A,N,N) model with additive errors were the following [34], Sect. 7): where l t is the new estimated level, ŷ t+1|t denotes each onestep-ahead prediction for time t +1 which results from the weighted average of all the observed data, 0 ≤ ≤ 1 is the smoothing parameter, which controls the rate of decrease of the weights, and y t − l t−1 is the error at time t. Hence, each forecasted observation is the sum of the previous level and an error, and each type of error, additive or multiplicative, corresponds to a specific probability distribution. For a model with additive errors, as is this case here, errors are assumed to follow a normal distribution. Thus, Equations (2) and (3), respectively, can be rewritten as follows: Equations (4) and (5) represent the innovation state space models that underlie the exponential smoothing methods.
The basic form of the neural network autoregression equation was the following [34], Sect. 11.3): w h e r e y t i n d i c a t e s t h e p r e d i c t e d va l u e s , y t−1 = y t−1, y t−2 … , y t−n � is a vector containing the lagged values of the observed data, f is the neural network with n hidden neurons in a single layer, and t is the error at time t. A simple graphical example of a nonlinear neural network is shown in Fig. 2.
Finally, the basic equation of the TBATS model took the following form [17]: where y ( ) i indicates the Box-Cox transformation parameter (ω) applied to the observation y t at time t, l t is the local level, is the damped trend, b is the long-run trend, T denotes the seasonal pattern, s (i) t is the ith seasonal component, 12 denotes the seasonal periods, and d t indicates an ARMA (p,q) process for residuals.

Evaluation metrics
The main metrics used to compare the performances of the single and hybrid prediction models were MAE, MAPE, MASE, and RMSE. The formulae used to calculate each of these metrics appear below (Eqs. [8][9][10][11]: where n represents the number of observations, y i denotes the actual values, and ŷ i indicates the predicted values. Specifically, MAE and RMSE are both scale-dependent measures, although based on different errors. MAE is easier to interpret because minimizing it leads to predictions of the median, while minimizing RMSE leads to predictions of the mean. In fact, if the first metric is based on absolute errors, the second is based on squared errors. MAPE is probably the most widely employed error measure [27,47], and unlike MAE and RMSE, it is not scale-dependent because it is based on percentage errors. Thus, it has the advantage of being a unit-free metric. However, it also requires some critical considerations. For example, it can lead to biased forecasts because it gives infinite or undefined results when one or more time series data point equals 0, and it puts a heavier penalty on negative errors (i.e., when predicted values are higher than actual values) than on positive errors. Finally, MASE, which was proposed by Hyndman and Koehler [36], is a scale-free error metric and probably the most versatile and reliable measure of forecast accuracy. It is superior to MAPE in that it does not give infinite or undefined values and can be used to compare forecast accuracy both on single and multiple time series. 13 Since each model thus entails specific strengths and disadvantages, I opted for the prudent approach-evaluating the output of all of them. Table 2 reports the best selected parameters for the single models 14 while Tables 3 and 4 include the forecast accuracy measures for the single and hybrid models. 15 For patients hospitalized with mild symptoms (Table 2), the optimal single models were the seasonal ARIMA (1,2,3) (0,0,1) 7 , ETS (A,Ad,N), 16 NNAR (7,1,4) 7 , and TBATS (0.428, {2,2}, 1, {< 7,2 >}). 17 For patients hospitalized in the ICU (Table 2), the optimal single models were the seasonal ARIMA (1,2,2) (0,0,1) 7 , ETS (A,A,N), 18 NNAR (6,1,4) 7 , and (T)BATS (0.427,{0,0},1,−). 19 The hybrid models were derived by combining the optimal single models with equal weights, which proved to be more suitable than weights based on error values. 20 In particular, for patients hospitalized with mild symptoms, the most accurate single model was NNAR, followed by seasonal ARIMA, while the best hybrid model was NNAR-TBATS, followed by ARIMA-NNAR, and ARIMA-NNAR-TBATS. For patients hospitalized in the ICU, the best single model was NNAR, followed by   14 The parameter values of the ARIMA, ETS, and TBATS models were reported in Tables A1, A2, and A3 (Appendix A). 15 As suggested by Hyndman [38], since the time series covered less than year and had daily observations, the frequency was set to 7, which allowed for weekly seasonality. 16 The selected ETS model is also known as the damped trend method with additive errors. 17 The estimated TBATS model was derived by applying a Box-Cox transformation of 0.428, an ARMA{2,2} process for modeling errors, a damping parameter of 1 (doing nothing), and two Fourier pairs with seasonal periods of 7. 18 The selected ETS model is also known as Holt's linear trend method. In this case, both error and trend were additive. 19 In this case, the "tbats()" algorithm gave a BATS model that differed from TBATS only in the way that it modeled seasonality. In particular, it did not implement the Fourier series to model seasonality. For simplicity, the notation TBATS was used for the discussion of hybrid models forecasting patients hospitalized in the ICU. This model employed a Box-Cox transformation of 0.427, no ARMA {0,0} errors, and a damping parameter of 1 (doing nothing). 20 The forecast performance measures of the hybrid models combined with the cross-validation procedure were reported in the Appendix B (Tables B1 and B2). With regard to patients hospitalized with mild symptoms and patients hospitalized in the ICU, the models derived using equal weights outperformed those obtained with crossvalidated errors in 26 and 31 (out of 33) performance accuracy metrics, respectively Tables B3 and B4.

3
ARIMA, while the best hybrid model was ARIMA-NNAR, followed by ARIMA-ETS-NNAR, and ETS-NNAR. 21 Autocorrelation function (ACF) indicated that current values were not correlated with previous values at lag 1 (Tables 2 and 3). In fact, the correlation coefficient between one point and the next in the time series ranged from − 0.07 to 0.19 for patients hospitalized with mild symptoms and from − 0.16 to 0.31 for patients in the ICU. The highest values (0.19 and 0.31) were obtained for the TBATS model. 22 According to Lewis' [52] interpretation, since MAPE was always significantly lower than 10, all predictive models can be considered as highly accurate. Moreover, MASE was much lower than 1 for all models; therefore, all the proposed forecasting approaches performed significantly better than the forecasts from the (no-change) "naïve" methods, i.e., the forecasts with no adjustments for casual factors (Hyndman and Koehler 2006), which justifies the use of more complex and sophisticated models.
Tables 5 and 6 compare the hybrid models with the respective single models considering the minimization of MAE, MAPE, MASE, and RMSE metrics. For patients hospitalized with mild symptoms, the hybrid models outperformed the respective single models in 98 out of 112 metrics, i.e., on 87.5% of all the forecast accuracy measures. For patients hospitalized in the ICU, the hybrid models outperformed the respective single models in 81 out of 112 metrics, i.e., on 72.3% of all measures. In the latter case, however, almost all losses of efficiency (25 out 31) were attributable to NNAR. 23 Thus, the hybrid models generally increased forecast accuracy, but this increase was more evident for patients hospitalized with mild symptoms. The best hybrid model for patients hospitalized with mild symptoms-NNAR-TBATS-outperformed the single NNAR and TBATS models by 5.63-9.23% on MAE and 5.88-9.95% on RMSE. While, the best hybrid model for patients hospitalized in the ICU-ARIMA-NNARoutperformed the single ARIMA and NNAR models by 1.2-8.67% on MAE, and 2.81-11.44% on RMSE. Figures 3  and 4 graphically represent the models for both time series ranked by the MAE and RMSE metrics.  21 In particular, ARIMA-ETS-NNAR and ETS-NNAR models exhibited nearly identical accuracy. 22 Tables C1 and C2 (Appendix C) also reported the results of Ljung-Box's [54] test for residual autocorrelation using no more than 10 lags, as suggested by Ljung [53] when the sample size is not large. The output showed that for patients hospitalized with mild symptoms, ARIMA, TBATS, and all hybrid models, except for ETS-TBATS, had no autocorrelation issue, while single ETS and NNAR exhibited some autocorrelation, especially at lags 8 and 10. With regard to patients hospitalized in the ICU, ARIMA and most of hybrid models, except for ARIMA-TBATS, ETS-TBATS, and ARIMA-ETS-TBATS, had no particular signs of autocorrelation. In contrast, single ETS, NNAR, and TBATS showed some problems at lags 8 and 10. 23 This did not change the meaning of the results because, as confirmed by Table C2 (Appendix C), the residuals of the hybrid model's output were generally less affected by autocorrelation than were those for NNAR. Figures 5 and 6 show the best six models and the remaining nine models for patients hospitalized with mild symptoms, respectively. Similarly, Figs. 7 and 8 show the best six models and the remaining nine models for patients hospitalized in the ICU, respectively. The light blue area in each graph shows the prediction intervals at 80%, while the dark blue area shows the prediction intervals at 95%. 24 The forecasts of the best single and hybrid models anticipated an increase in the number of patients hospitalized with mild symptoms and in the number of patients admitted to the ICU over the next 30 days, i.e., from October 14, 2020, to November 12, 2020. This predicted trend was also confirmed by the remaining estimated models.
Specifically, the NNAR-TBATS, ARIMA-NNAR, and ARIMA-NNAR-TBATS models predicted that: (i) after 10 days (October 23), the number of patients hospitalized with mild symptoms should have been 9624, 9397, and 9259, respectively; (ii) after 20 days (by November 2), the number should have been 18,000, 16,986, and 16,062, respectively; and (iii) after 30 days (by November 12), the number should have been 25,039, 21,669, and 21,430, respectively (Fig. 5). Regarding the number of patients hospitalized in the ICU, the ARIMA-NNAR, NNAR, and ARIMA-ETS-NNAR models predicted that: (i) after 10 days, the required number   (Fig. 7). 25 Figures 9, 10, 11 and 12 compare all 15 of the estimated models and the observed data over the period October 14, 2020, to November 12, 2020. For patients hospitalized with mild symptoms, the NNAR-TBATS models best fit the observed data (Fig. 9). Of the remaining models, predictions from ARIMA-NNAR, ARIMA-NNAR-TBATS, and NNAR most closely approximated the observed data (Figs. 9, 10). For patients hospitalized in the ICU, Fig. 11 reveals that ARIMA-NNAR, NNAR-TBATS,   25 Values are rounded to the nearest integer. The predicted values of the six best models for patients hospitalized with mild symptoms and in the ICU were reported in Table D1 and D2 (Appendix D), respectively.

3
and ARIMA-NNAR-TBATS hybrid models also fit the observed data quite well. All estimated models generally exhibited a strong match between predictions and observed data, except for ETS, NNAR, and ETS-TBATS, where the trends differed (Figs. 11 and 12). Notably, the models with lowest loss of efficiency adapted better to the observed data, which confirms the consistency and robustness of the statistical approach.
Thus, a second wave of COVID-19 was predicted for the period following October 13, 2020, which could have several policy implications both for the national healthcare system and the economy. In particular, the predictions underscored the importance of implementing adequate containment Fig. 6 The remaining nine forecast models for predicting patients hospitalized with mild symptoms. Notes: The models were ranked (from seventh to fifteenth place) on the MAE metric 1 3 measures and increasing the number of ordinary and intensive care beds, hiring additional healthcare personnel, and buying care facilities, protective equipment, and ventilators to fight the infection and reduce deaths. 26 Meanwhile, the opportunity to implement more or less restrictive non-pharmaceutical interventions (NIPs) to tackle the pandemicsuch as social distancing, travel bans, the use of face masks, hand hygiene, and bar and restaurant restrictions [20]should be evaluated carefully in light of these measures' potentially negative economic impacts. In fact, according to Fitch Rating's [24] previsions, the first wave of COVID-19 and the consequent massive lockdown measures may already have caused up to a 9.5% contraction in Italy's 2020 GDP. Fig. 8 The remaining nine forecast models for predicting patients hospitalized in the ICU. Notes: the models were ranked (from seventh to fifteenth place) on the MAE metric While the models with the lowest loss of efficiency seemed to adapt substantially well to the observed data in the forecast window, the predictions should, in general, be treated with caution and employed mainly to inform shortterm decisions. In fact, pandemic forecasting has raised many doubts in the last year due to several issues that can affect its accuracy and reliability, including, for example, (i) high interval predictions and sensitivity of the estimates, especially with long-term forecasts; (ii) inaccurate modeling assumptions, and (iii) the lack of or difficulty in measuring and identifying biological features of COVID-19 transmission [30,42]. These limitations-and the possibility for misleading forecasts-may erode public trust in science and thus affect compliance with policies intended to mitigate the Fig. 9 Comparison between forecasts and real data during the period October 14, 2020, to November 12, 2020, for patients hospitalized with mild symptoms (six best models) spread of COVID-19 [49]. Indeed, the inevitable uncertainty associated with this novel disease and the general failure of long-term and even mid-term forecasts require a different scientific approach toward model predictions. More prudent and balanced communication with the public is crucial if the field of science desires to maintain its leading role in human development and policymaking.

Conclusions
This paper attempted to forecast the short-term dynamics of real-time patients hospitalized from COVID-19 in Italy. In particular, it employed both single time series forecast methods and their feasible hybrid combinations. The results demonstrated that (i) the best single models were NNAR and ARIMA for both patients hospitalized with mild symptom and patients admitted to the ICU, (ii) the most accurate hybrid models were NNAR-TBATS, ARIMA-NNAR, and ARIMA-NNAR-TBATS for patients hospitalized with mild symptoms and ARIMA-NNAR, ARIMA-ETS-NNAR, and ETS-NNAR for patients hospitalized in the ICU, (iii) hybrid models generally outperformed the respective single models by offering more accurate predictions, and (iv) finally, predictions for the number of patients hospitalized in the ICU generally better fit the observed data than did predictions for patients hospitalized with mild symptoms. Notably, the best hybrid models always included a NNAR process, confirming the extensive and successful use of this algorithm in the COVID-19 related literature [56,57,71,72,81].
Compared to the single models, the hybrid statistical models captured a greater number of properties in the data Fig. 10 Comparison between forecasts and real data during the period October 14, 2020, to November 12, 2020, for patients hospitalized with mild symptoms (nine remaining models) structure, and the predictions seemed to offer useful policy implications. In fact, consistent with real-time data, the models predicted that the number of patients hospitalized with mild symptoms and admitted to the ICU would grow significantly until mid-November 2020. According to the estimations, the necessary ordinary and intensive care beds were expected to double in 10 days and to triple in approximately 20 days. Thus, since new waves of COVID-19 infections cannot be excluded, it may be necessary to strengthen the national healthcare system by buying protective equipment and hospital beds, managing healthcare facilities, and training healthcare staff.
Although the hybrid models proved to be sufficiently accurate, it is nevertheless important to stress that statistical methods may lead to unavoidable uncertainty and bias, which tend to grow over time, due, for example, to public authorities' progressive implementation of NIPs, such as the closure of public spaces and national or local lockdown measures, which the forecasts cannot adequately incorporate. Combining hybrid models with mechanistic mathematical models may partially overcome these issues by considering the effects of lockdowns on epidemiological parameters [68]. Thus, future research could proceed along these lines. Ultimately, since other factors may have affected COVID-19 dynamics, especially as the pandemic progressed, these predictions should be treated with caution and utilized only to inform short-term decision-making processes.

Appendix C
See Tables C1 and C2.

Appendix D
See Tables D1 and D2. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.