Optimal forecasting accuracy using Lp-norm combination

A well-known result in statistics is that a linear combination of two-point forecasts has a smaller Mean Square Error (MSE) than the two competing forecasts themselves (Bates and Granger in J Oper Res Soc 20(4):451–468, 1969). The only case in which no improvements are possible is when one of the single forecasts is already the optimal one in terms of MSE. The kinds of combination methods are various, ranging from the simple average (SA) to more robust methods such as the one based on median or Trimmed Average (TA) or Least Absolute Deviations or optimization techniques (Stock and Watson in J Forecast 23(6):405–430, 2004). Standard regression-based combination approaches may fail to get a realistic result if the forecasts show high collinearity in several situations or the data distribution is not Gaussian. Therefore, we propose a forecast combination method based on Lp-norm estimators. These estimators are based on the Generalized Error Distribution, which is a generalization of the Gaussian distribution, and they can be used to solve the cases of multicollinearity and non-Gaussianity. In order to demonstrate the potential of Lp-norms, we conducted a simulated and an empirical study, comparing its performance with other standard-regression combination approaches. We carried out the simulation study with different values of the autoregressive parameter, by alternating heteroskedasticity and homoskedasticity. On the other hand, the real data application is based on the daily Bitfinex historical series of bitcoins (2014–2020) and the 25 historical series relating to companies included in the Dow Jonson, were subsequently considered. We showed that, by combining different GARCH and the ARIMA models, assuming both Gaussian and non-Gaussian distributions, the Lp-norm scheme improves the forecasting accuracy with respect to other regression-based combination procedures.


Introduction
Model instabilities are deeply rooted in real-life forecasting challenges because models are uncertain and mutable. Moreover, in statistics, the Data Generating Process (DGP), refers to the true process which generates the data and it is different from a model of the data. Indeed, statistical models could be either mispecified or incomplete [53].
This is a fact that is widely accepted in theory, but less widely applied in practice in the field of statistics, in which researchers often still hang on to the conceptual error of assuming one true data-generating process and focussing too much on model selection in order to find the one true model.
In this respect, Hansen [26] takes note of this and related misconceptions of econometric forecasting practice in his essay on the challenges for statistical model selection.
An obvious alternative to choosing a single 'best' forecasting method is to combine forecasts from different models. The combination could be achieved in different ways. One of the most common is by regression approaches based on Ordinary Least Squares (OLS) estimation technique. Below, we propose an alternative and more general approach for combining models and/or forecasts, based on Lp-norm estimators [40].
In this paper we propose to study the effectiveness of Lp-norm estimators in combining volatility forecast. We consider the volatility of stock market data because they empirically show non-Gaussian distributions. Indeed the Lp-norm should work better than OLS (as it is) in presence of non-Gaussian forecasts. The GARCH distributions play an important role in the volatility measurement and in the value-at-risks topics [10].
The main motive of this paper is to measure the performance of GARCH techniques for forecasting combination by using different distribution model. The different GARCH distribution models present in the paper are the t-student, the Gaussian, the GED jointly considered with some ARMA models. We try to show the advantages of GED GARCH over the classical methods, for example, the t-student GARCH and the Gaussian GARCH.
The paper is, then, structured as follows. In the next section, we will present an historical survey about the literature of forecast combination and we will glance at some of the main methods of combination since its introduction. We will also briefly introduce our method. In Sect. 3, we introduce the main recent methods for better achieve the forecast combination and we show the advantages and the disadvantages of the different combination approaches.
In Sect. 4, we bring up for discussion the Generalized Error Distributions (G.E.D.) explaining in detail our approach based on Lp-norm estimators by proposing a new algorithm to combine the forecasts. Paragraph 4 shows, instead some comparative Montecarlo experiments where the overperformance of the proposed methodology is presented in the situations of multicollinearity and non-Gaussianity by means of a simulation study considering heteroskedastic and homoskedastic innovations.
The real data application is proposed, now, in Sect. 6, in which we attempt to determine if the results obtained in the simulation study are possible also in reality. In the last paragraph, some final comments and observations are offered.
Finally, in the appendix A and B we show the commented R studio scripts that we built and used to estimate the forecast and the combination in the simulated and the empirical study, whilst in appendix C some graphical findings of the simulation study are reported.

3 2 The main approaches to combine forecast in models: a review
The common way of choosing a forecasting model is to consider the one which allows to obtain the most accurate forecasts, discarding the others. This method has been re-evaluated since the introduction of forecast combinations, which integrate the information generated by multiple forecasts and using it to get a more accurate model. Combining forecasts is a useful practice as the excluded models may contain important information that is not present in the individual forecasts.
When the forecasts diverge, due to the assumptions made in the forecasting model (linearity, type of distribution, the variability of the parameters, etc.), the decision-maker who wants a single forecast faces a thorny problem of choice. Forecast combination constitutes a solution to this problem and refines the forecast. This is still true almost 30 years later and it is called the "forecast combination puzzle", a term coined by Stock and Watson [47]. Hibon and Evgeniou [29] verified that the linear combination, using the individually worst forecasts, still leads to better results than individual ones. Hence, choosing an individual method out of a set of available methods is riskier than choosing a combination.
First of all, let' s review the main forecast combination methodologies introduced since the seminal paper by Bates and Granger [7]. In the last few decades, several methodologies have been put forward both in the theoretical and empirical literature. Andrawis et al. [1] even suggest using hierarchical forecast combinations, i.e. combining joining forecasts. Our purpose is not to investigate the best combination method by comparing all the ones proposed in the literature. A lot depends on the specifics of the data available and the differences in modelling approaches. Rather, we provide a new way of combining within regression framework which works well in several real-life situations as non-Gaussian forecast density or collinear forecasts.
The Bayesian approach was suggested by Min and Zellner [37] which consider various time series models, that produce forecasts of production growth rates for 18 countries over a 13-year period to combine models and their forecasts. Optimal Bayesian combination procedures consider the use of rear dimensions for alternative models and are also used in predictive testing to decide whether or not to combine alternative model forecasts. Applying Bayesian pooling techniques it turns out that it is not always optimal to combine forecasts.
A well-known result is that a more accurate forecast can be obtained by making a linear combination of two-point predictions. The combination allows to improve the forecast, which will have a smaller mean square error (MSE) than the two competing forecasts. Therefore, by combining multiple forecasts, decision-making becomes simpler. The only case in which no improvements are possible is when one of the single forecasts is already the optimal one in terms of MSE [30].
A simple procedure to combine forecasts is to take an arithmetic average of the single forecasts that serves as a useful benchmark. Given a forecast ŷ i provided by the i-th forecasting model (or from the i-th expert forecaster), the well-known results is that the Mean Square Error of the combination of all ŷ i (e.g. − y ) is lower than the one of ŷ i itself: In a paper from 1963, Barnard showed an empirical example in which an arithmetic average of two forecasts had a smaller Mean Square Error (MSE) than either of the individual forecasts. If the component forecasts are not biased, combining with mean is a good method. Clemen [15] summarized the literature on forecast combinations and concluded that combining forecasts of various economic and financial variables led to increased forecast accuracy. He states that using a combination of forecasts is equivalent to admitting that the meteorologist is unable to construct a specified model and trying to combine increasingly elaborate models is incorrect. Similar conclusions were reached by Aksu and Gunter [4] based on macroeconomic variables and firm-specific series, by Makridakis and Hibon [36] based on the so-called M3 competition, by Stock and Watson [46,47] across various economic and financial variables, by Swanson and Zeng [48] using US macroeconomic variables [2,14,15] deemed that this equal weighting of component forecasts is often the best strategy. According to Armstrong [2] it is necessary to combine predictions from essentially different methods that are derived from different sources. In particular, in order to improve the accuracy of forecasts, it has been suggested using formal procedures to combine forecasts.
Another simple and appealing combination method is using the median of the component forecasts. The median is insensitive to outliers, which can be relevant for some application. Palm and Zellner [41], Ruiz and Nieto [43] Hendry and Clements [28] suggested that simply averaging may not be a suitable combination method when some of the component forecasts are biased.
Jose and Winkler [33] proposed the usage of the trimmed mean approach instead of a simple average or median combination. Sometimes, the simpler methods, such as median [42], can generate satisfactory results compared to more complex methods.
It might be better to combine the forecasts using more sophisticated rules: one of them is by Least Squares regression. The idea to use regression for combining forecasts was put forward by Crane and Crotty [16] and successfully driven to the forefront by Granger and Ramanathan [25]. Using this approach, the combined forecast is a linear function of the individual forecasts where the weights are determined using a regression of the individual forecasts on the target itself: An advantage of the OLS forecast combinations is that the combined forecast is unbiased, even if one of the individual forecasts is biased. A disadvantage is that the method does not restrict the combination weights (they do not add up to one and can be negative). In this respect, the Constrained Least Squares (CLS) regression, could be used to get the weights of the forecast, determined by minimizing the squared errors under the constraint of non-negativity, sum one and intercept equal to zero. This method, due to the absence of the intercept, might lead to a distorted forecast if one of the individual forecasts is biased. As a consequence, CLS does not always achieve the best level of accuracy in terms of MSE.
While the OLS regression estimates the coefficients by minimizing the sum of squared errors, we may want to estimate those coefficients differently [18,27], minimizing a different loss function, for example, the absolute sum of squares (LAD). The Least Absolute Deviation method has two main advantages: first of all, this method has a lower sensitivity to outliers. Secondly, when predictors are highly correlated, it performs better. This suggests that LAD combination should be preferred to OLS in these situations [52].
Dielman [17] proposed a Monte Carlo simulation to evaluate predictions by comparing the method of the minimum absolute value with the regression equations estimated by the least squares method. In the presence of outliers, the absolute minimum value forecasts were higher and performed better than forecasts obtained by the least squares method.
These results underline the importance of using absolute regression of the minimum value (or some other robust regression method) in the presence of outliers.
However, we could potentially still do better by providing a more general framework for combining forecasts through the Lp-norm estimators. As we will see more in detail, Lp-norm is generic, whereas LAD and Least Squares are special cases. When forecasts are non-Gaussian (e.g. volatility forecasts in financial markets) or collinear, OLS is no longer the best way of estimating quantities since it becomes less efficient as its variance increases. Another noticeable limitation occurs when the density function of the forecasts is not Gaussian distributed since by using a different estimator we gain in efficiency.
In this paper we propose a forecast combination method based on Lp-norm estimator, where the minimization of residuals is done according to estimated data kurtosis and the selection of more relevant forecast is achieved through some procedures proposed in literature. Among them the Lp-med combined method is a recent proposal to estimate p [23].
If the errors follow a Gaussian distribution, the forecast obtained with a linear combination of forecasts using Lp-norm parameters (L 2 norm in this case) gives satisfying results, the mean square error is slightly higher than the MSE using the OLS parameters but lower than the other methods (constrained least squares, least absolute deviation, generalized least squares, etc.).
Lp-norm estimators are useful generalizations of Ordinary Least Squares estimators, based on the Generalized Error Distribution (GED) [21].
Below, we introduce a table in which we provide a summary of the state of the art about the best-known methods of forecast combination and the main authors to whom they refer (Table 1).

Combine or not combine: recent contributions
The search for new forecast combination methods is still very active. Even in recent years new methods are experienced and older methods are tested such as the approach of forecast combination method for the M4-competition given by Jaganathan and Prakash [31].
Atiya [3] has provided a brief analysis of the reasons why forecast combinations are successful. There are many other very insightful works in the literature on this important topic that considers several different aspects, such as the effects of serial correlation, heteroskedasticity, structural breaks, estimation error in the combination weights.
Xiao et al. [54] stated that the combination models demonstrate better performance than the individual forecast models do. The most common procedure for combining forecasts is to assign a weight to each model based on its past forecast performance and then aggregate the related weighted forecasts. At the same time, there are other forecasting models, for example, methodologies that are based on the use of intelligent optimization algorithms to select or optimize the parameters of statistical models in the training phase. Another class of combined models is that which involves processing the time series by breaking it down into more stationary and regular sub-series that are easier to identify and deal with, which aim is to apply an adequate forecast model for each decomposed series and finally, to aggregate the related forecasts. Finally, all those models, which we will call hybrid models, are also combined models, in which the forecast of the historical series is carried out, also taking into account the forecast of the residuals resulting from the main model applied to the original series. Wang et al. [51] in their study, forecast the realized volatility of the S&P 500 index using the heterogeneous autoregressive model for realized volatility (HAR-RV) and its various extensions.
Models under consideration take into account the time-varying property of the models' parameters and the volatility of realized volatility. A dynamic model averaging (DMA) approach is used to combine the forecasts of the individual models. Empirical results of their paper suggest that DMA can generate more accurate forecasts than individual model in both statistical and economic senses. Models that use time-varying parameters have greater forecasting accuracy than models that use the constant coefficients.
Zhang et al. [56], to improve the forecasting accuracy, proposed a new hybrid model based on improved empirical mode decomposition (IEMD), autoregressive integrated moving average (ARIMA) and wavelet neural network (WNN) optimized by fruit fly optimization algorithm (FOA) is proposed and compared with some other models. Zhang et al. [57] employed two prevailing shrinkage methods, the elastic net and lasso in the prediction of oil price. Their findings suggests that the elastic net and lasso exhibit significantly better out-of-sample forecasting performance than not only the individual HAR-RV-type models but also the combinations approaches. The elastic net and lasso also exhibit higher directional accuracy and yield sizeable economic gains for asset allocation. Liang et al. [34] investigated which uncertainty indices have predictability for oil price by employing the standard predictive regression framework, various model combination and two prevailing model shrinkage methods (Elastic net and Lasso). The authors analyzing the two model shrinkage methods, namely Elastic net and Lasso, outperform other individual and combination models in forecasting the crude oil market volatilities.
Forecast combinations, while appealing in theory, are at a disadvantage over a single forecast model because they introduce parameter estimation error in cases where the combination weights need to be estimated. Finite sample errors in the estimates of the combination weights can lead to poor performance of combination schemes that dominate in large samples. Is better combine or not to combine the forecasts or rather simply attempt to identify the single best forecasting model? When the information sets are unobserved it is often justified to combine forecasts provided that the private (non-overlapping) parts of the information sets are sufficiently important. When forecast users do have access to the full information set used to construct the individual forecasts combinations may be less justified. Finding a 'best' model may of course be rather difficult if the space of models included in the search is high dimensional and the time-series short [50].
In situations where the data is not very informative and it is not possible to identify a single dominant model, it makes sense to combine forecasts. The Schwarz Information Criteria (SIC) can be used as a great alternative for choosing which subset of predictions to combine. Swanson and Zeng [48] observed that combination forecasts from a simple averaging approach MSE-dominate SIC combination forecasts less than 25% of the time in most cases, while other 'standard' combination approaches fare even worse. Estimation errors in the combination weights tend to be particularly large due to difficulties in precisely estimating the covariance matrix. One answer to this problem is to simply ignore correlations across forecast errors. Stock and Watson [45] propose a broader set of combination weights that also ignore correlations between forecast errors but base the combination weights on the models' relative MSE performance raised to various powers.
In conclusion, although the results got in the previous paragraphs show always that the forecasts combination has a MSE smaller than the values of the single forecasts, it is necessary to properly evaluate in which situations it is convenient to combine the single predictions and in which situations it is not convenient ( Table 2). Based on hybrid models A forecast model is also built for the residuals generated by the main forecast model applied to the time series They ensure a reduction in systematic error and therefore a high accuracy They are inefficient from the point of view of computation time

The proposed methodology
The Generalized Error Distribution (GED), also known as Exponential Power Function (EPF), is a family of probability functions proposed by Subbotin [49] and Mineo [38] that generalize the Gaussian distribution. An asymmetrical version of the Gaussian distribution has also been proposed in the literature,seen among others [10,11]. The GED constitutes a valid generalization of the Gaussian distribution that is usually assumed, even though, depending on the available data, it is not always fully supportable.
It constitutes a parametric alternative to previous robust methods [39]. The density function of the GED is: where p ∈ (−∞, +∞) is the location parameter, p is the scale parameter, that is positive with p ∈ (0, +∞) and the shape parameter p is a measure of fatness of tails where p ∈ (0, +∞) and x ∈ ℝ [58].
Considering the Pearson kurtosis index 2 , we distinguish: For specific values of p, we have: Considering a sample of n observed data ( x i , y i ), a general linear regression model is ( Fig. 1): With g(·) linear function. Lp-norm estimators are useful generalizations of ordinary least squares estimators, obtained by replacing the exponent 2 with a general exponent p ; similarly, we obtain the LAD (Least Absolute Deviation) with p = 1. Therefore, they minimize the sum of the p-th power of the absolute deviations of the observed points from the regression function [23]: Under the regular assumptions, the log-likelihood associated with the sample is given by: The result of the last equation shows that maximum likelihood estimators (we considered a model with m parameters) are equivalent to the Lp-norm estimators if the value of p is known. If p is unknown, we have the problem to estimate it.
The GED assumption has been applied with success also in forecasting literature. A clear example is represented by the GED-GARCH model for predicting volatility [32].
Indeed, since the introduction of Generalized Autoregressive Conditional Heteroskedastic (GARCH) model of Bollerslev [9], thousands of articles have been published applying the model on financial series. Because of many observed phenomena do not exhibit Gaussian law, therefore we extend the classical econometric models to the case with non-Gaussian distribution. As the extension we propose to use the Generalized Error Distribution that it has often revealed itself one of the most adequate to examined time series in the literature.
In this paper we illustrate the main properties of the considered models and present testing and estimation procedures. We illustrate the theoretical results with real financial data analysis and simulation studies.
The Normal distribution is the usual assumption in any time series estimation, but due to the fact that the distribution of GARCH process is leptokurtic, Normal distribution was found to be inappropriate in capturing the tail behavior of the series [55].
Even in the classical definition, it is assumed that the standardised residuals t of ARCH and GARCH are Gaussian random variables; in practice, it turns out that this assumption weakly corresponds to reality. An example of GED distribution in forecasting is the GED-GARCH model.
Since the Gaussian distribution is a special case of the GED the standard GARCH model is a particular case of the GED-GARCH. This means, in other words, that a GED-GARCH model takes the following form for the mean equation: where u t is called conditional mean and is also expressed as: The conditional variance, on the other hand, can be expressed as follows: where ∈ (−∞, +∞) , is positive therefore ∈ (0, +∞) and also , which is a measure of fatness of tails is ∈ (0, +∞) and x ∈ ℝ.
However, the proposed parameterization is very used, especially in the GARCH models, and, precisely, we will develop the calculation of the parameters of the models for volatility on that one, assuming that t it follows a Generalized Error Distribution.
In fact, the parameterization proposed by Forbes et al. [19] is given by: Or, with reparameterization: = 1 we can assume: where is the shape parameter (the equivalent of p), the scale parameter is the equivalent of p and the last parameter that is p . Equation (11) is the parameterization from which we will start to estimate the parameters of the GED-GARCH model. The algorithm used for real data application runs as follows. Given y t a time series, we consider the following steps: Step 0: For each time series y t in the sample, we compute its descriptive statistics; Step 1: We assess the time series stationarity by means of the Dickey-Fuller and KPSS unit root tests and use the first difference if the hypothesis of unit root is not rejected; Step 2: We divide the sample into a training set and a testing set; Step 3: We fit the following statistical models in training set: the ARMA(p,q) model, the Gaussian GARCH(p,q), the t-student GARCH(p,q), the GED GARCH(p,q); Step 4: We estimate the forecasts with these single models and we calculated the Root Mean Square Error (MSFE); Step 7: We calculated the Q-Q plot for Gaussian GARCH and GED GARCH and we tested with the Jarque-Bera test both for evaluating the Gaussian normality of the data distribution.

A comparative simulation study
Here is a short outline of what we are going to analyze in this paragraph. We have made twenty-four simulations: twelve with an AR(1) process and homoskedastic innovations and twelve with the same process but heteroskedastic innovations. In both cases we suppose three different shape parameters of the GED distribution (p = 1.5, in the case of leptokurtic distributions, p = 2, in the case of Gaussian distributions, p = 2.5, in the case of platikurtic distributions) and for each of them we suppose four different autoregressive parameters (φ = 0.2, low; φ = 0.4, medium-low; φ = 0.6, medium-high; φ = 0.8 high). We aim to show that in the presence of non-Gaussian data, combining forecasts with Lp-norm instead of OLS is a good choice since it is always better than LAD and CLS and sometimes better than OLS. Hence, firstly we simulated 1000 historical series with 200 observations from an autoregressive process of order 1 or, in other words, an AR(1) with non-Gaussian innovation t and autoregressive parameter . We assumed a Generalized Error Distribution G.E.D. with mean 0, scale parameter 1 and true shape parameter p . Then we split the overall sample in the training set (y TRAIN t ) and testing set ( y TEST t  [22]. The mean squared forecast errors are computed as the average value of 1000 replications [12]. In the Tables 3, 4 and 5 we illustrate the case of homoskedastic innovation. As we can see from the Table 3 with the case p = 1.5 (platikurtic case) with homoskedastic innovations, we have estimated the models ARMA (1,1), GARCH (1,1), t-GARCH (1,1) and GED-GARCH (1,1).
From the analysis of Table 3, we can see that the best model (i.e. the model with the lowest MSFE) is the GED-GARCH (1,1) because it has the lowest MSFE either for phi = 0.4 or for phi = 0.8. Moreover we combine the forecasts with the OLS, the CLS, the LAD and the Lp-norm combination methods. We know that combining forecasts increases the accuracy and, as proof of this, the MSFE of the single forecasts are always greater than the MSFE of the forecast combinations [13]. The best combination method is the Lp-norm, indeed it has the smallest MSFE for all the considered phi parameters.
In Table 4, we show the case p = 2 (Gaussianity) with homoskedastic errors, where we estimated the same models as before. In that case the ARMA (1,1) is the best model, because it has the lowest loss function value (MSFE) for phi = 0.4 and for phi = 0.8. We know that when the shape parameter p, is equal to 2, the GED distribution is equivalent to the Gaussian distribution. The most accurate combination method to combine forecasts is the Lp-norm but, as we can see, the MSFE are very close to the forecast combination values obtained by OLS method.
From Table 5, we can observed that the best forecasting model is the GARCH in the cases p = 2.5 Phi = 0.2 and Phi = 0.4. For Phi = 0.6 the best model is the t-GARCH (1,1) whilst for Phi = 0.8 the best forecast model is the ARMA (1,1). Again in the case of p = 2.5 (leptokurtic cases) the Lp-norm provides the most accurate forecast combination for any values of Phi and, once again, the best forecast combination method is the Lp-norm.
In the following Tables 6, 7 and 8, we illustrate the case of heteroschedastic innovations. In th Table 6 with the case p = 1.5 (platikurtic case) with heteroskedastic innovations, we have estimated, again, the models ARMA (1,1), GARCH (1,1), t-GARCH (1,1) and GED-GARCH (1,1). From this table we can see that the best model (i.e. the model with the lowest MSFE) is the t-GARCH (1,1) because it has the lowest MSFE either for phi = 0.4 or for phi = 0.8.
Indeed we combine the forecasts with the OLS, the CLS, the LAD and the Lp-norm combination methods. We know that combining forecasts increases the accuracy. The most interesting thing is that the Lp-norm method is not the best forecast combination anymore, or better, it is the best for phi = 0.2 and for phi = 0.8 but for phi = 0.4 and for  Table 7, we show the case p = 2 (Gaussianity) with heteroskedastic errors, where we estimated the same models as before ARMA (1,1), GARCH (1,1), t-GARCH (1,1) and GED-GARCH (1,1).
In that case the best model is the GARCH (1,1) because it has the lowest MSFE for phi = 0.2; phi = 0.4 and for phi = 0.6. It is well known that when the shape parameter p, is equal to 2, the GED distribution is equivalent to the Gaussian distribution. The most accurate forecast combination method is the Lp-norm for the cases phi = 0.6 and phi = 0.8 (that means phi medium-large). On the contrary when we have phi = 0.2 and phi = 0.4 (medium-low values of phi) the best forecast combination is the OLS. From Table 8 that illustrates the case p = 2.5 (leptokurtic case) we can observe that the best forecasting model is the GARCH (1,1) for three cases on four (phi = 0.2 and phi = 0.4 and phi = 0.6).
For the case phi = 0.8 the best forecast model is the ARMA (1,1). In two cases of the four we considered the Lp-norm provides the most accurate forecast combination. In particular the good performance of the Lp method occurs for the central values of the phi parameter (phi = 0.4 and phi = 0.6). Indeed the OLS method furnish us the best forecasting combination for extreme values of phi parameter (Phi = 0.2; Phi = 0.8) Hence, we can conclude that in general in the case of homoskedastic innovations, the Lp-norm method always provides the most accurate forecasts considering the different of values p (1.5; 2; 2.5) and different values of Phi (0.2; 0.4; 0.6; 0.8). Completely different is the situation with heteroskedastic innovation. In this cases Lp-norm is better than CLS and LAD, but it is not always better than the OLS, indeed sometimes the last one performs better than Lp-norm method. Moreover the difference in accuracy between Lp-norm and OLS is really small so we might not care.

Real data application
For real data application, we analyzed 26 time series. One of these was examined individually, while we used the other 25 to reinforce and generalize what we observed from the study of a single time series.

Historical series of Bitcoin
Cerqueti et al. [11] recently propose a new and comprehensive study about cryptocurrency market. At this aim, they have considered non-Gaussian GARCH volatility models, which formed a class of stochastic recursive systems commonly adopted for financial predictions. Results have shown that the best specification and forecasting accuracy are achieved under the Skewed Generalized Error Distribution when Bitcoin/USD and Litecoin/USD exchange rates are considered, while the best performances are obtained for Skewed Distribution in the case of Ethereum/USD exchange rate. Taking this research as a reference point, we used an historical daily financial series of the BTC/USD exchange rate from Bitfinex ranging from March 1st, 2014 to February 28th, 2020. Bitfinex is one of the best cryptocurrency trading websites.
Bitcoin is not classified as a currency, but rather as an highly variable way of transacting because its value is not defined by a central bank but only by supply and demand (1 bitcoin was worth 300 EUR in 2012, it peaked at 14,000 EUR in 2018 and by the time of writing it was worth 10,000 EUR).
We took that historical series from investing.com, a financial portal that tracks down the financial performance of companies around the world. One of the most recent and interesting research about the potential to improve accuracy with forecast combinations and its real applications has been given by Liu et al. [35] and Shaub [44].
The volatility models for cryptocurrencies have not been fully developed yet, and, most of all, the previous literature on this subject [8] has been focused on finding the best GARCH model ignoring the assumption about the error distribution, even if the characteristics of non-normality and asymmetry of the distribution of Bitcoin's return are well known to researchers [8]. Our aim, with this application, is to verify that the combination of forecasts generates more accurate results (specifically, we intend to prove that the Lp-norm combination method is better than the others in situations of multicollinearity and in presence of non-Gaussian distribution) to verify the good performance of the GED distribution. To reach these targets we used the software R.
We start our application showing the descriptive statistics (Table 9) and the stationary logarithmic historical series (Fig. 2).
Firstly, we verified that the series have been stationary with the Dickey-Fuller test and the KPSS test and, of course, the historical series of returns are always stationary (Fig. 2). Then, after splitting the historical series in the training set (or 80% of distribution data) and testing set (20% of distribution data) and we estimated the best ARIMA model for the training set that is the ARIMA (2,0,0) with coefficients AR1 = − 0.0302 and AR2 = − 0.0139.
Successively, we fitted the training set with three GARCH (1,1): one with Gaussian distribution (with coefficients of the fitted model: AR1 = − 0.45 and MA1 = 0.526), one with t Student distribution (with coefficients: AR1 = 0.342 and MA1 = − 0.413) and one with GED distribution (with coefficients: AR1 = 0.131 and MA1 = − 0.23). So, we estimated the forecasts from four separate data generating processes. We estimated 543 forecasts (the testing set) of the model fitted on the training set of the distribution and then we compared them through the MSFE between the forecasts and the elements of the testing set.
The best forecast is given by the GED GARCH (1,1) because it has the smallest MSFE. So we have proved that, among the individual models, the best is the GED GARCH (1,1) (Table 10).
We did the Jarque Bera test on the residuals of GED-GARCH (1,1) and the result was that the distribution of the residuals of GED-GARCH (1,1) is not Gaussian because the p value is the very low (2.2e−16).
As we can see from the QQ plots and from the results of the Jarque Bera test (that consider the null hypothesis H 0 = Gaussianity), our sample is not Gaussian. This is why, among the individual models, the GED GARCH (1,1) is the best in order to make forecasts.
The last thing we needed to prove was that the Lp norm is the best combination method in situations in situations like the ones of multicollinearity and non-Gaussian distribution. We combined the four models with OLS, CLS, LAD and an Lp-norm combination. By doing that, we confirmed that the MSFE was smaller than the MSFE for the individual models (Table 11).
As we can see, every MSFE is smaller than the MSFE of the individual models so we can assert that the combination method of forecasts generates more accurate results. The best combination method for the forecastings, in that case, are either the OLS or the Lp-norm. They only differ for 1 × 10 −8 , so we can maintain that these two forecast combination models are equal. As proof of this we can see the complete similarity of Figs. 3 and 4 which represent the two analyzed Q-Q plots. For this reason we assume that Lp-norm method can be considered more general than OLS (giving approximately the some results) and we decided to compare the last one with CLS and LAD considering 25 further hystorical series.

Application to Dow Jones index
To verify the predictive quality of the proposed method, we considered 25 historical series (taken from Yahoo Finance) listed in the Dow Jonson index, which measures the performance of 30 companies representing about a quarter of the value of the securities traded on the New York Stock Exchange (NYSE). We excluded from our study the series with missing values, so we used 25 historical series of 2516 observations (from 2010/01/01 to 2020/01/01).
Moreover we could considered more observations, but we have chosen to stop the historical series in the first day of 2020 because of the Covid-19 global pandemic that caused a lot of problems in the market stability, otherwise the forecasts would have been meaningless.  Fig. 5 we show the historical time series with logarithmic standardization and the Q-Q plots of time series returns: Looking at the Q-Q plots we can see that the behavior of the series should be very close to Gaussian distribution. In fact the curve are very close to the Gaussianity lines and consequently our distributions (at least in the 25 cases considered) seems similar to the normal distributions.  In the following table (Table 12) are reported the descriptive statistics. In what follows we provide the result related to the forecast accuracy of the different regression-based combination approaches.
How we can see in Table 13, the ARMA (1,1) is the best model for most of the historical because it has the smaller MSFE in 15 cases. In the other historical series seven times the GARCH (1,1) is the best model and only three times the GED-GARCH (1,1) is the best one.
It is known that, combining forecasts increases the accuracy, indeed if we compare Table 13 with Table 14, we notice that the MSFE of the single forecasts are always greater than the MSFE for the combined forecasts.
The combination of the forecasts led to very satisfactory results, we investigated the potential of the proposed methodology on a large segment of the NYSA, obtaining forecasts with an MSFE (Table 14) smaller than both of the compared methodologies. In fact, the Lp-norm estimators make always better predictions than CLS and LAD. The Lp-norm provides the most accurate results for all of the considered stock-returns-data.

Concluding remarks
This paper started with the aim of evaluating whether the method of combining forecasts with the Lp-norm constituted a valid alternative to the most common existing methods.
Through the analysis of the simulations, we found that the combination of forecasts with Lp-norm generates an MSFE lower than the OLS, LAD and CLS methods in situations of strong multicollinearity and non-Gaussianity especially in presence of homo-skedastics innovations.
In this case the Lp-norm provides the most accurate forecast with respect to the alternative regression-based combination methods.
In the case of eteroskedastic innovations instead, the Lp-norm performs better for highly persistent process (phi = 0.8) or for poorly persistent process (phi = 0.2) regarding the in leptokurtic data (p = 1.5) (see Table 12).
In the case of Gaussian data (p = 2) the Lp-norm seems to perform better for a highly level of persistency (phi = 0.6 and phi = 0.8) (Table 13). Finally concerning the platikurtic  (Table 14).
On the other hand, by analysing the empirical data, the MSFE using the combination with Lp-norm confirmed the result of the simulation: it was lower than the one obtained with the LAD and CLS methods, while, comparing it to the MSFE of the combination with the OLS (historical series of Bitcoin), we obtained an almost infinitesimal difference ( Table 15).
The reason of this small difference is due to the sample size since due to the central limit theorem there is a tendency to Gaussianity, which is compliant with the results from simulation, that is to say in the presence of a normal distribution. Looking at the 25 historical series relating to companies included in the Dow Jonson we can observe a strong tendency to Gaussianity (confirmed by the QQ-Plot analysis) that means the same behaviour of OLS and Lp-norm forecast combinations.
In conclusion, the method assuming Lp-norm estimators is a valid method of combining forecasts since, compared to other ones (LAD and CLS), it generates a lower MSFE.