1 Introduction

Time series analysis and forecasting are popular research areas in statistics, with many important applications. Given a model or a method a modeller faces the question of how to estimate its parameters. In the time series literature, among the various alternative estimators, estimation using methods based on multiple steps ahead errors has been well studied by statisticians and econometricians. Many papers argue that using these estimators leads to an increase in forecast accuracy (Weiss and Andersen 1984; Chevillon and Hendry 2005; Taylor 2008; Chevillon 2009; Franses and Legerstee 2009; McElroy 2015; Chevillon 2016), especially when the model is misspecified (Proietti 2011; Xia and Tong 2011), which is often the case in practice. However, other research demonstrates that this finding may not hold universally and depends on data characteristics and the degree of misspecification of the model (Kang 2003; Marcellino et al. 2006; Proietti 2011; McElroy and Wildi 2013). Discussing the statistical properties of the estimated parameters with such methods, it has been shown that multi-step estimators are asymptotically efficient (Haywood and Tunnicliffe Wilson 1997; Ing 2003; Chevillon and Hendry 2005; Chevillon 2007), consistent, and Normal (Weiss 1991; Haywood and Tunnicliffe Wilson 1997). But at the same time, it was shown by Tiao and Xu (1993) that parameters of models become less efficient on finite samples, in comparison to the conventional one-step-ahead mean squared error estimator and that the efficiency may decrease even further when the forecast horizon increases.

There is also a general understanding that the usage of multi-step estimators leads to more conservative and robust models (Cox 1961; Gersch and Kitagawa 1983; Tiao and Xu 1993; Marcellino et al. 2006; McElroy 2015). Still, there is no plausible and detailed explanation why this happens, although many researchers imply the obviousness of such an explanation. Several authors point out that there is a connection between multi-step estimators and the conventional Mean Squared Error (see, for example, Bhansali 1997; Haywood and Tunnicliffe Wilson 1997). But this fact is usually neglected in the discussion and not used in the analysis of the estimators. Kourentzes and Trapero (2018) attempt to provide some insights demonstrating the effect of some multi-step estimators on the parameter values for exponential smoothing.

The distribution of multi-step forecast errors has been discussed in the context of multivariate models in Jordà and Marcellino (2010), Pesaran et al. (2010), Martinez (2017). Furthermore, Clements and Hendry (1998, p. 77) and Martinez (2017) discuss the likelihood function based on multi-step estimator and claim that if the assumptions hold then the off-diagonal elements of the covariance matrix become zero.

The aim of this paper is to fill a gap in the field of multi-step estimators, providing a thorough explanation of what happens with any linear time series model, when multi-step estimators are used. We do that using the single source of error (SSOE) state space approach, which is flexible enough to encompass both ARIMA and exponential smoothing, two popular univariate models among both researchers and practitioners. As we carefully elicit these explanations, we identify limitations to which we propose a new multi-step estimator, the Geometric Trace Mean Squared Error.

The paper is organised as follows: in Sect. 2 we discuss the population properties of the existing multi-step estimators. We then discuss their properties on small samples in Sect. 3 and show what happens with estimates of parameters when they are used. We introduce a general predictive likelihood (GPL) approach for multi-step estimators in Sect. 4, which encompasses all existing multi-step estimator. In Sect. 5 a simulation is carried out, demonstrating the properties of the discussed estimators in different circumstances, providing evidence of the identified shrinkage. Finally, an example of an application of multi-step estimators is given in Sect. 6, followed by concluding remarks.

2 Conventional multi-step estimators in population

2.1 The definitions and general properties

Before we discuss the estimators, we define a general form of a forecasting model under consideration. We focus on the additive error models, which can be written in the following general way:

$$\begin{aligned} y_t = \mu _t + \epsilon _t, \end{aligned}$$

where \(y_t\) is the actual value, \(\mu _t\) is the structure of the model (e.g., covariates, seasonality, dependencies), \(\epsilon _t\) is the error term (typically, normal i.i.d.) and t is an integer time index. For what follows we assume that such a general model can generate h-steps ahead point forecasts (conditional expectation) from the observation t denoted as \(\mu _{t+h|t}\).

In this section, we discuss the asymptotic properties of several popular multi-steps estimators based on different loss functions. We start with one of the simplest and well-known loss function based on multiple steps ahead forecast error, the mean squared h-steps ahead error:

$$\begin{aligned} \mathcal {MSE}_h(y_{t+h}) = \text {E} \left( \eta _{t+h|t}^2 \right) , \end{aligned}$$
(1)

where \(\eta _{t+h|t} = y_{t+h} - \mu _{t+h|t}\) is the conditional h steps ahead forecast error, \(y_{t+h}\) is the actual value on the observation \(t+h\). For convenience, in this paper we will use \(\mathcal {MSE}_h\) instead of \(\mathcal {MSE}_h(y_{t+h})\), dropping the part in the brackets for estimators from all loss functions, but noting that they are functions of the observed data \(y_{t+h}\).

The estimator resulting from the MSE loss function (1) is sometimes used to estimate a model several times, for each horizon from 1 to h steps ahead (Kang 2003; Chevillon and Hendry 2005; Pesaran et al. 2010), resulting in h different values of parameters for each \(j=1, \ldots , h\). Such an estimator is sometimes called “direct multi-step estimator” (DMS, Chevillon 2007). Naturally, the estimation process of DMS is more complex than optimising model parameters only once, but is reported to result in increased prediction accuracy. This happens due to the alignment of forecast objective with the loss function i.e., predictions of h steps ahead.

Another popular loss function is based on the sum of all the \(\mathcal {MSE}_j\) for horizons from 1 to h calculated using (1), instead of using only one error on observation h (Weiss and Andersen 1984; Xia and Tong 2011). The resulting estimator, denoted as “Trace Mean Squared Error” (\(\mathcal {TMSE}\)), is:

$$\begin{aligned} \mathcal {TMSE} = \sum _{j=1}^h \text {E} \left( \eta _{t+j|t}^2 \right) . \end{aligned}$$
(2)

Using \(\mathcal {TMSE}\) simplifies the calculations in comparison to DMS, as the model is estimated only once instead of h times, while having only marginal loses in the resulting accuracy (Kourentzes and Trapero 2018).

A further loss function that can be used is the “Mean Squared Cumulative Error” (\(\mathcal {MSCE}\)), motivated from how predictions may be used, for instance in inventory control (for example, Kourentzes et al. 2019; Trapero et al. 2019; Saoud et al. 2022, and references therein), where analysts may be interested in the cumulative demand of a product over a period of time, for instance to cover demand over the supply lead time, rather than in the values for each separate observation within the same period. We thus define \(\mathcal {MSCE}\) as follows.

Definition 2.1

For a series of forecast errors, \(\eta _{t+j|t}\) generated at time t, for forecast horizon \(j=1,\ldots ,h\), the mean squared cumulative error is:

$$\begin{aligned} \mathcal {MSCE} = \text {E} \left( \left( \sum _{j=1}^h \eta _{t+j|t} \right) ^2 \right) . \end{aligned}$$
(3)

It follows directly from 3 that \(\mathcal {MSCE}\) can be represented as a sum of products of forecast errors:

$$\begin{aligned} \mathcal {MSCE} = \text {E} \left( \sum _{j=1}^h \eta ^2_{t+j|t} + 2 \sum _{j=2}^h \sum _{i=1}^{j-1} \eta _{t+j|t} \eta _{t+i|t} \right) . \end{aligned}$$
(4)

When the right-most sum of errors is equal to zero, \(\mathcal {MSCE}\) reduces to \(\mathcal {TMSE}\). This can be possible only in the case when the respective multiple steps ahead forecast errors are not correlated with each other, a condition which is unrealistic in many cases, as we show later in this paper.

Note that the three loss functions (1), (2) and (3) produce estimators that are equal to the variances of multiple steps ahead forecast errors, given that the error term in the population has a zero mean:

$$\begin{aligned} \mathcal {MSE}_h&= \sigma ^2_h, \\ \mathcal {TMSE}&= \displaystyle {\sum _{j=1}^h \sigma ^2_j}, \\ \mathcal {MSCE}&= \displaystyle {\sum _{j=1}^h \sigma ^2_j + 2 \sum _{j=2}^h \sum _{i=1}^{j-1} \sigma _{i,j}}, \end{aligned}$$
(5)

where \(\sigma _{i,j}\) is the covariance between i and j steps ahead forecast errors and \(\sigma ^2_{j}\) is the variance of j-steps ahead forecast error. This can be shown easily for example for \(\mathcal {MSE}_h\) using (1):

$$\begin{aligned} \mathcal {MSE}_h = \text {E} \left( \eta _{t+h|t}^2 \right) = \text {E} \left( \eta _{t+h|t} - \text {E}(\eta _{t+h|t})\right) ^2 = \text {V}(\eta _{t+h|t}) = \sigma ^2_h, \end{aligned}$$

which holds when \(\text {E}(\eta _{t+h|t})=0\). This means that when the estimators are used in population, then they minimise the respective variances of multi-step errors. This property can be used to explain what specifically happens to time series model dynamics when these estimators are used.

2.2 State space approach

In order to give a simple and concise explanation of the effect of minimisation of these loss functions on parameter estimates and model dynamics, we use the SSOE state space framework (Snyder 1985; Ord et al. 1997):

$$\begin{aligned} \left\{ \begin{array}{ll} y_t = {\textbf{w}}' {\textbf{v}}_{t-1} + \varepsilon _t \\ {\textbf{v}}_t = {\textbf{F}} {\textbf{v}}_{t-1} + {\textbf{g}} \varepsilon _t, \end{array}\right. \end{aligned}$$
(6)

where \({\textbf{v}}_t\) is the state vector, \({\textbf{F}}\) is the transition matrix, \({\textbf{g}}\) is the persistence vector, \({\textbf{w}}\) is the measurement vector and \(\varepsilon _t \sim \text {i.i.d. }{\mathcal {N}}(0,\sigma _1^2)\). We focus on this approach because it encompasses the commonly used ARIMA and exponential smoothing models (ETS) as shown in Snyder (1985) and Hyndman et al. (2008). So, all the discussions about the state space model (6) can be transferred to both ETS and ARIMA.

Using (6), it can be shown that the actual value for some observation \(t+h\) (where \(h > 1\)) with predefined values of \({\textbf{v}}_t, {\textbf{F}}, {\textbf{w}} \text { and } {\textbf{g}}\) can be calculated using Hyndman et al. (2008, p. 95):

$$\begin{aligned} y_{t+h} = {\textbf{w}}' {\textbf{F}}^{h-1} {\textbf{v}}_{t} + \sum _{j=1}^{h-1} c_{j} \varepsilon _{t+h-j} + \varepsilon _{t+h}, \end{aligned}$$
(7)

where

$$\begin{aligned} c_{j}= {\textbf{w}}'{\textbf{F}}^{j-1}{\textbf{g}} \end{aligned}$$
(8)

is the scalar value that depends on the values of parameters of the model, which are in matrices \({\textbf{w}}\), \({\textbf{F}}\) and \({\textbf{g}}\).

The expectation on observation \(t+h\), conditional on the information on the observation t is equal to:

$$\begin{aligned} \mu _{t+h | t} = {\textbf{w}}' {\textbf{F}}^{h-1} {\textbf{v}}_{t}. \end{aligned}$$
(9)

Substituting (9) in (7), we obtain the conditional h steps ahead forecast error:

$$\begin{aligned} \eta _{t+h|t} = y_{t+h} - \mu _{t+h | t} = \sum _{j=1}^{h-1} c_{j} \varepsilon _{t+j} + \varepsilon _{t+h}. \end{aligned}$$
(10)

Note that when \(h=1\), \(\eta _{t+1|t} = \varepsilon _{t+1}\). The formula (10) is essential for our analysis, as it shows how the forecast error is connected with the error term \(\varepsilon _t\) and with the parameters of the model.

The variance of the error term (10) can be calculated using:

$$\begin{aligned} \sigma ^2_h = \left\{ \begin{array}{ll} \sigma ^2_1 \left( 1 + \displaystyle {\sum _{j=1}^{h-1}} c_{j}^2 \right) & \text {when }h>1\\ \sigma ^2_1 & \text {when }h=1 \end{array} \right. , \end{aligned}$$
(11)

where \(\sigma ^2_h\) is the variance of h steps ahead forecast error. For the derivation see Hyndman et al. (2008, p.95). Another useful property, which we will use later in this paper, follows directly from the state space model (6) and the forecast error \(\eta _{t+h|t}\) (10).

Proposition 2.2

Assuming that the model errors, \(\varepsilon _t\), are independent and homoscedastic, the covariance \(\sigma _{i,j}\) between the \(i^{\text {th}}\) and \(j^{\text {th}}\) step ahead forecast errors can be decomposed as:

$$\begin{aligned} \sigma _{i,j} = \left\{ \begin{array}{ll} \sigma ^2_1 c_{|i-j|} & \text {when either } i=1 \text { or } j=1 \\ \sigma ^2_1 \left( c_{|i-j|} + \sum _{l=1}^{i-1} c_{|l-j|} c_{|i-l|} \right) & \text {otherwise} \end{array} \right. . \end{aligned}$$
(12)

Proof

See “Appendix A”. \(\square\)

It becomes apparent at this stage that when the multi-steps estimators are used in the case of state space models, the multi-steps ahead variance (11) is minimised, implying that \(c_j\) is minimised as well, thus the estimates of parameters of the model are directly influenced by the loss function in a style similar to what regularisation techniques do. This is not new; it has been briefly discussed specific model contexts in several papers (for examples, see Bhansali 1996, 1997; Haywood and Tunnicliffe Wilson 1997), but it has not been explored further. The behaviour of multi-step estimators should not differ from the behaviour of the conventional one-step-ahead one, when the model is estimated in population, because of the Law of Large Numbers. However, the sample behaviour of these estimators may differ depending on the number of observations.

3 The sample behaviour of multi-step estimators

3.1 The sample estimation of state space models

When the same estimators discussed in Sect. 2.1 are used in sample, the expectations are substituted by the sample means:

$$\begin{aligned} \textrm{MSE}_h = \frac{1}{T-h} \sum _{t=1}^{T-h} e_{t+h|t}^2, \end{aligned}$$
(13)

where \(\textrm{MSE}_h\) is the sample counterpart of \(\mathcal {MSE}_h\), \(e_{t+h|t} = y_{t+h} - {\hat{\mu }}_{t+h|t}\) is the estimate of the conditional h steps ahead forecast error \(\eta _{t+h|t}\), \({\hat{\mu }}_{t+h|t}\) is the estimate of \(\mu _{t+h|t}\) and T is the number of observations in sample.

Similarly, the sample versions of other estimators can be written as:

$$\begin{aligned} \textrm{TMSE} = \frac{1}{T-h} \sum _{j=1}^h \sum _{t=1}^{T-h} e_{t+j|t}^2, \end{aligned}$$
(14)

and

$$\begin{aligned} \textrm{MSCE} = \frac{1}{T-h} \sum _{t=1}^{T-h} \left( \sum _{j=1}^h e_{t+j|t} \right) ^2. \end{aligned}$$
(15)

Coming back to the state space model (6), which is now estimated on a sample of data, the estimate of the h steps ahead variance (11) is based on the estimates of \(c_j\) and \(\sigma _1^2\), so that the formula (11) is eventually substituted by:

$$\begin{aligned} s^2_h = \left\{ \begin{array}{ll} s^2_1 \left( 1 + \displaystyle {\sum _{j=1}^{h-1}} {\hat{c}}_{j}^2 \right) & \text {when }h>1 \\ s^2_1 & \text {when }h=1 \end{array} \right. , \end{aligned}$$
(16)

where \(s_j^2\) is an estimate of the j-steps ahead variance \(\sigma _j^2\) and \({\hat{c}}_j\) is the estimate of \(c_j\). It is important to note that the minimisation of MSE\(_h\) in (13) implies the minimisation of the estimate of the variance (16), due to the connection between the estimators and the variance (5). This means that as both one-step-ahead variance and the squared values of parameters decrease, the variable \({{\hat{c}}^2_{j}}\) inevitably moves towards zero. This causes shrinkage of \({\hat{c}}^2_{j}\) and its intensity increases with the increase of forecast horizon h and decreases with the increase of the sample size T. This finding helps explain results in the literature so far. For example, Tiao and Xu (1993) find, using simulations, that MA parameters tend towards one when using the MSE\(_h\) loss function (the detailed explanation of shrinkage mechanism for ARIMA models is given in Sect. 3.2). This effect also explains the observed robustness of models estimated using MSE\(_h\) discussed in the literature.

A similar shrinkage happens when using TMSE, which is proportional to the sum of variances of multi-step forecast errors:

$$\begin{aligned} \textrm{TMSE} \propto \sum _{j=1}^h s^2_j&= s^2_1 + \sum _{j=2}^h s^2_1 \left( 1 + \displaystyle {\sum _{i=1}^{j-1}} {\hat{c}}_{|j-i|}^2 \right) \\&= s_1^2 \left( h + \sum _{j=2}^h \sum _{i=1}^{j-1} {\hat{c}}_{|j-i|}^2 \right) . \end{aligned}$$
(17)

The parameters in (17) interact with the variance directly and the size of the sum in the right-hand side is at least \(h-1\) times greater than in the simpler case of (11). But at the same time, the one-step-ahead variance in (17) is multiplied by h, which mitigates the shrinkage effect. Taking that TMSE contains the sum of 1 to h steps ahead forecast errors, the short-term forecast errors weaken the shrinkage effect in TMSE in contrast with MSE\(_h\), especially when the forecast horizon h is large. This finding means that whenever MSE\(_h\) or TMSE are used the estimates of parameters might be biased, and the amount of bias should be proportional to the forecast horizon h.

In order to analyse MSCE, the sample estimates of covariances (12) and (11) can be used in (4):

$$\begin{aligned} \textrm{MSCE} \propto s_1^2 \left( h + \sum _{j=2}^h \sum _{i=1}^{j-1} {\hat{c}}_{|j-i|}^2 + 2 \sum _{j=2}^h {\hat{c}}_{|j-1|} + \right. \\ \left. 2 \sum _{j=3}^h \sum _{i=2}^{j-1} \left( {\hat{c}}_{|i-j|} + \sum _{l=1}^{i-1} {\hat{c}}_{|l-j|} {\hat{c}}_{|i-l|} \right) \right) . \end{aligned}$$
(18)

Analysing (18), the shrinkage effect in case of MSCE is also observed, but it can be either mitigated or emphasised by the sum of covariances—depending on the values of parameters. The former may happen if some of \({\hat{c}}_{j}\) become negative, the latter will happen if \({\hat{c}}_j\) is always positive. In this case \({\hat{c}}_j\) will be forced to become close to zero and the shrinkage effect will be even stronger in comparison with MSE\(_h\) and TMSE. The specific effect of the MSCE estimator depends on data and applied model. Some of the relevant examples will be discussed later in this section.

We should note at this stage, that given the structure (12), the covariances between the multiple steps ahead forecast errors will be equal to zero only when \({\hat{c}}_j=0\) for all j. In all the other cases the assumption of the zero correlation between the forecast errors is unrealistic (sometimes used in the literature, for example, in Martinez 2017, p. 11).

For the case of \(h=1\) all loss functions discussed in this section result in equivalent estimators, and no shrinkage is imposed. Therefore, we restrict our discussions in this paper with \(h>1\).

In order to see what exactly happens with state space models when different loss functions are minimised, we examine several cases. In particular, we discuss three model families: exponential smoothing, ARIMA, and regression.

3.2 ARIMA

Snyder (1985) demonstrated that any ARIMA model can be represented in a single source of error state space form. Using the positive polynomials for MA:

$$\begin{aligned} \theta (B) = 1 + \theta _1 B + \theta _2 B^2 + \ldots , \end{aligned}$$

leads to the following components of the state space model:

$$\begin{aligned} {\textbf{w}} = \begin{pmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{pmatrix}, {\textbf{F}} = \begin{pmatrix} \eta _1 & {\textbf{I}}_{k-1} \\ \vdots \\ \eta _k & 0 \end{pmatrix}, {\textbf{g}} = \begin{pmatrix} \eta _1 + \theta _1 \\ \vdots \\ \eta _k + \theta _k \end{pmatrix}, \end{aligned}$$
(19)

here \(\eta _i\) is AR polynomial (defined similar to Hyndman et al. 2008, p. 173) and \(\theta _i\) is \(i^{th}\) moving average parameter. The analysis of (19) shows that in general AR parameters should shrink towards zero, as the transition matrix gets exponentiated with each step \(j=1,\ldots ,h\), while MA parameters should shrink towards minus AR polynomials (recall (8)). Furthermore, with the increase of forecasting horizon the shrinkage of AR parameters becomes stronger than the shrinkage of MA parameters, because of the exponentiation of transition matrix \({\textbf{F}}\) in (16).

3.2.1 Example with ARIMA(1,1,1)

Analysis of more specific models gives a better understanding of the shrinkage mechanism in ARIMA when any multi-step loss function is minimised. For example ARIMA(1,1,1) can be represented in the state space form (6) where:

$$\begin{aligned} {\textbf{w}} = \begin{pmatrix} 1 \\ 0 \end{pmatrix}, {\textbf{F}} = \begin{pmatrix} 1 + \phi _1 & 1 \\ -\phi _1 & 0 \end{pmatrix}, {\textbf{g}} = \begin{pmatrix} 1 + \phi _1 + \theta _1 \\ - \phi _1 \end{pmatrix}. \end{aligned}$$

The matrix \({\textbf{F}}\) is exponentiated in the power of \(j-1\) when h steps ahead variance is calculated. The transition matrix decomposition shows (see “Appendix B”) that the estimated \({\hat{c}}_{j}\) in this case is equal to:

$$\begin{aligned} {\hat{c}}_{j} = 1 + \left( {\hat{\phi }}_1 + {\hat{\theta }}_1\right) \sum _{i=1}^{j} {\hat{\phi }}_1^{i-1}. \end{aligned}$$
(20)

Substituting (20) into the formula of h steps ahead variance (16) results in the following:

$$\begin{aligned} s^2_h = s^2_1 \left( 1 + \displaystyle {\sum _{j=1}^{h-1}} \left( 1 + \left( {\hat{\phi }}_1 + {\hat{\theta }}_1\right) \sum _{i=1}^{j} {\hat{\phi }}_1^{i-1} \right) ^2 \right) . \end{aligned}$$
(21)

Now it can be concluded that the minimisation of (21) leads to the shrinkage of both AR and MA parameters: the sum of polynomials in the right hand side of (21) shrinks towards zero, implying the shrinkage of \({\hat{\phi }}_1\) towards zero as well. The speed of shrinkage of \({\hat{\phi }}_1\) increases with the increase of the forecasting horizon, due to increase of the number of elements in the sum of polynomials. In fact, the second sum in the right-hand side of (21) is the sum of geometric progression, which is equal to:

$$\begin{aligned} s^2_h = s^2_1 \left( 1 + \displaystyle {\sum _{j=1}^{h-1}} \left( \frac{{\hat{\theta }}_1 (1-{\hat{\phi }}_1^j) + 1 - {\hat{\phi }}_1^{j+1}}{1-{\hat{\phi }}_1} \right) ^2 \right) . \end{aligned}$$
(22)

The minimisation of (22) means that:

  • \({\hat{\phi }}_1 \rightarrow -1\);

  • \({\hat{\theta }}_1 (1-{\hat{\phi }}_1^j) + 1 - {\hat{\phi }}_1^{j+1} \rightarrow 0\).

And (2) implies that \({\hat{\theta }}_1 \rightarrow -\frac{1 - {\hat{\phi }}_1^{j+1}}{1-{\hat{\phi }}_1^j}\), which given (1) leads to \({\hat{\theta }}_1 \rightarrow -1\).

Therefore when multistep loss functions are minimised, ARIMA(1,1,1) transforms into a restricted ARIMA(1,0,0) model. In order to see that, we need to analyse the compact form of ARIMA(1,1,1):

$$\begin{aligned} (1 -B) (1 -{\hat{\phi }}_1 B)y_t = (1 +{\hat{\theta }}_1 B) e_t. \end{aligned}$$
(23)

Taking that asymptotically with the increase of the forecast horizon \({\hat{\phi }}_1 = -1\) and \({\hat{\theta }}_1 = -1\), the formula (23) transforms into:

$$\begin{aligned} (1 -B)(1 +B) y_t = (1 -B) e_t, \end{aligned}$$

which is equivalent to:

$$\begin{aligned} (1 + B) y_t = e_t, \end{aligned}$$

which is ARIMA(1,0,0) model. This shows that when multi-step loss functions are minimised for the estimation of ARIMA(1,1,1), they force it to become a degenerate model.

3.3 Exponential smoothing

All of the exponential smoothing models that we discuss here use the Gardner (2006) taxonomy. We examine pure additive ETS models only, keeping in mind that the main findings can be extended to some of the multiplicative cases as well. Kourentzes and Trapero (2018) discuss the effect of MSE\(_h\) and TMSE for some exponential smoothing models, but mainly focus on their effects on model parameter values and forecast accuracy. Here we provide a more exhaustive investigation, focusing on one special case, and briefly discussing the main properties of the others.

Table 1 summarises the four pure additive ETS models, their \(c_j\) values and the h steps ahead conditional variances. The other two pure additive models, namely ETS(A,A,A) and ETS(A,Ad,A) can be considered as combinations of the models in the table and are not discussed separately.

Table 1 Parameters of ETS models. Here \({\alpha }\) is the level smoothing parameter, \({\beta }\) is the smoothing parameter for the trend component, \({\hat{\gamma }}\) is the seasonal smoothing parameter, \(\phi\) is the damping parameter, \(j_m = \lfloor \frac{j - 1}{m} \rfloor\) and m is the periodicity of the data

It can be seen from Table 1 that minimisation of any multi-step loss function leads to the shrinkage of the smoothing parameters towards zero, but the strength and sequence of shrinkage differs from model to model. For example, \({\hat{\alpha }}\) in ETS(A,N,N) shrinks towards zero, with an increasing speed of shrinkage for longer forecast horizons. Therefore, minimising multi-step loss functions with long term forecasts leads to models with more uniform weight distribution across observations for ETS(A,N,N). This corresponds to a model with the slower level changes that is less reactive to new information, as \({\hat{\alpha }}\) tends to zero. Asymptotically with the increase of h the model becomes a global level model.

Similarly, the shrinkage effect is preserved in ETS(A,A,N), but has a different form: both smoothing parameters shrink, but \({\hat{\beta }}\) shrinks faster than \({\hat{\alpha }}\) with the increase of h. This leads to more stable estimates of trend and allows capturing long term tendencies in time series. Asymptotically, with the increase of horizon the model becomes a deterministic trend model.

As for ETS(A,Ad,N), given that the damping parameter \(\phi \in [0,1]\), it slows down the shrinkage of \({\hat{\beta }}\) in comparison to ETS(A,A,N). Asymptotically with the increase of the horizon, ETS(A,Ad,N) becomes a model with a deterministic damped trend.

Finally, due to the values of \(j_m\) in ETS(A,N,A), the shrinkage happens in a stepwise manner, increasing as complete seasonal cycles increase: \({\hat{\gamma }}\) does not shrink when forecasting horizon is smaller than the seasonal periodicity, \(h \le m\). Parameter \({\hat{\alpha }}\) shrinks faster than \({\hat{\gamma }}\) on smaller horizons, but as h increases, this is inverted and \({\hat{\gamma }}\) shrinks faster. This happens because \({\hat{\gamma }}\) has higher weight corresponding to \(j_m\) with higher horizons. The switch in the shrinkage speed happens when \(h > 2m\), because \({\hat{\gamma }}\) starts prevailing in the sum in the formula in Table 1. Once again asymptotically the model becomes deterministic when \(h \rightarrow \infty\).

Other additive state space exponential smoothing models demonstrate behaviour similar to the ones discussed above. For example, the shrinkage mechanism in ETS(A,Ad,A) has features of both ETS(A,Ad,N) and ETS(A,N,A).

It should be noted that the initial states of any ETS model influence the variance of the one-step-ahead forecast error, so the shrinkage of the smoothing parameters is compensated by the change of the initial states. But the initial states do not shrink on their own, their behaviour depends mainly on time series characteristics.

Finally, as discussed above, the shrinkage effect will vary from loss function to loss function, and from model to model. However, given that the smoothing parameters of the ETS model are typically positive, the covariances between the j-steps ahead and i-steps ahead forecast errors will be positive as well. So, due to (18), the shrinkage effect in MSCE will be the strongest among all the multi-step loss functions.

3.4 Summary

Concluding this section, estimators based on multi-step loss functions impose shrinkage on the parameters of time series models, which means that they become more robust and persistent. Asymptotically, with an increase of h, any time series model becomes deterministic if estimated using a multi-step loss function. If the forecast horizon is very high, then the parameters may become biased due to over-shrinking. They may take much more time to converge to true values in comparison with the conventional MSE\(_1\) estimator. The parameters are still asymptotically consistent (Weiss 1991) but may revert to boundary values in cases of finite samples and high values of h. This effect has been discussed in Clements and Hendry (1998) for several special cases, but the discussion presented here is more general. Overall, it may be not be advisable to use multi-step loss functions, when the sample size is small and the forecast horizon is large. This is especially important when the parameters of the model are of the main interest.

Based on the discussion in this section, we can conclude that if a model is specified correctly, then estimators based on the multi-step loss functions might not be suitable and might lead to unnecessary shrinkage of parameters, making it become deterministic model instead of stochastic. However, when the model is misspecified, the shrinkage effect might be useful, making unnecessary parameters reach the boundary values. For example, if the local trend model is applied to the local level data, then using either MSE\(_h\), or TMSE, or MSCE can help in shrinking the smoothing parameter for the trend to zero. Nevertheless, the main problem in this situation is in regulating the strength of shrinkage, which can only be done by choosing between the loss functions, estimators and changing the forecast horizon h.

In this section, We focused on the dynamic models. If a regression is used, then it can be shown that there is no shrinkage from multi-step loss functions. This is due to the absence of dynamic elements in conventional regression, and its persistence vector is equal to zero.

Having discussed the properties of estimators from multi-step loss functions, the following section presents likelihood functions for some of them and briefly discusses their properties.

4 Predictive likelihood approach

This section considers the derivation of the predictive likelihood functions for the loss functions when the error term follows the Normal distribution. This is presented for each of the loss functions discussed in Sects. 4.1 and 4.2 before presenting a general predictive likelihood approach in Sect. 4.3. The main motivation in doing this is because the likelihood approach typically produces consistent and asymptotically efficient estimates of parameters and can be used in model selection. Showing the connection between the multi-step loss functions, the estimators presented thus far, and predictive likelihoods allows better understanding the principles behind them and gives additional flexibility in model building.

4.1 MSE\(_h\)

We start with a predictive likelihood function for MSE\(_h\). Taking (10) into account and assuming that the original error term \(\varepsilon _t \sim N(0, \sigma _1^2)\), we can conclude that a forecast error for h-steps ahead follows the \(\eta _{t+h|t} \sim N(0, \sigma _h^2)\). Based on that we can derive a predictive likelihood for the value h steps ahead as shown in Proposition 4.1.

Proposition 4.1

The predictive log-likelihood for h steps ahead forecast error, assuming that \(\varepsilon _t \sim N(0,\sigma _1^2)\), is given by

$$\begin{aligned} \ell _{\textrm{MSE}_h}(\varvec{\theta }, {\sigma ^2_h} | {\textbf{y}}) = -\frac{T-h}{2} \left( \log (2 \pi ) + \log \sigma ^2_h \right) -\frac{1}{2} \sum _{t=1}^{T-h} \left( \frac{\eta _{t+h}^2}{\sigma ^2_h} \right) , \end{aligned}$$
(24)

where \(\varvec{\theta }\) is the vector of parameters of the model and \({\textbf{y}}\) is the vector of \(y_{t+h}\) for all \(t=1,..,T-h\).

The MLE of variance \(\sigma ^2_h\) based on (24) is:

$$\begin{aligned} s^2_h = \frac{1}{T-h} \sum _{t=1}^{T-h} \eta _{t+h}^2. \end{aligned}$$
(25)

The concentrated sample log-likelihood based on (25) and (24) is then:

$$\begin{aligned} \ell _{\textrm{MSE}_h}(\varvec{\theta }, s^2_h | {\textbf{y}}) = -\frac{T-h}{2} \left( \log (2 \pi ) + \log s^2_h + 1 \right) , \end{aligned}$$
(26)

The maximisation of the likelihood (26) is equivalent to the minimisation of the sample estimate of variance (25), which in turn is equivalent to MSE\(_h\) discussed in Sect. 3.1. The concentrated predictive log-likelihood based on (26) and (16) (in cases, when \(h \ne 1\)) is:

$$\begin{aligned} \ell _{\textrm{MSE}_h}(\varvec{\theta }, s^2_h | {\textbf{y}}) = -\frac{T-h}{2} \left( \log (2 \pi e) + \log s^2_1 + \log \left( 1 + \sum _{j=1}^{h-1} {c}_{j}^2 \right) \right) . \end{aligned}$$
(27)

This predictive log-likelihood has similarities with the conventional one-step-ahead concentrated log-likelihood, which can be written as:

$$\begin{aligned} \ell _{\textrm{MSE}_1}(\varvec{\theta }, s^2_1 | {\textbf{y}}) = -\frac{T-1}{2} \left( \log (2 \pi e) + \log s^2_1 \right) , \end{aligned}$$
(28)

where \(s^2_1 = \frac{1}{T-1} \sum _{t=1}^{T-1} e_{t+1}^2\) is the MLE of the one-step-ahead variance. When maximised in sample both likelihoods will produce some estimates of parameters \(\hat{\varvec{\theta }}_h\) and \(\hat{\varvec{\theta }}_1\) respectively. These estimates are both consistent due to the general consistency of likelihoods (see, for example, Wald 1949), so that \({\text{plim}}_{{T \to \infty }} \mathbf{\hat{\theta }}_{h} (T) = {\text{ }}\mathbf{\theta }^{*}\).

Lemma 4.2

The predictive log-likelihood (24) is maximised asymptotically, with the increase of the sample size, by the vector of the true parameters \(\varvec{\theta }^*\) and the true variance \(\sigma ^2_1\).

Proof

With the increase of the samples size the value of the estimate of the variance \(s^2_1\) will converge to the true value of \(\sigma ^2_1\), while the vector of parameters \(\hat{\varvec{\theta }}_h\) will converge to \(\varvec{\theta }^*\) due to the Law of Large Numbers. In addition, the linear relation between \(\sigma ^2_1\) and \(\sigma ^2_h\) in (11) implies that the \(s^2_h\) will converge to \(\sigma ^2_h\). Given that the maximum of the likelihood (24) is reached with the estimates of parameters \(\hat{\varvec{\theta }}_h\) and \(s^2_h\), asymptotically the likelihood (24) will be maximised by the vector of the true parameters \(\varvec{\theta }^*\) and \(\sigma ^2_1\). \(\square\)

Note that the estimate of the one step ahead variance \(s_1^2\) will be in general greater in (24) than in (28), because it is not maximised directly in (25). However, they both will converge to the true value \(\sigma ^2_1\) asymptotically, with the increase of the sample size.

For further analysis, the predictive log-likelihood (27) with estimates of parameters \(\hat{\varvec{\theta }}_h\) can be decomposed into two parts: conventional one-step-ahead concentrated log-likelihood and the “bias” term:

$$\begin{aligned} \begin{array}{ll} \ell _{\textrm{MSE}_h}(\hat{\varvec{\theta }}_h, s^2_h | {\textbf{y}}) &= \frac{T-h}{T-1} \ell _{\textrm{MSE}_1}(\hat{\varvec{\theta }}_h, s^2_1 | {\textbf{y}}) -\frac{T-h}{2} \log \left( 1 + \displaystyle {\sum _{j=1}^{h-1} {\hat{c}}_{j}^2} \right) \\ &= \frac{T-h}{T-1} \ell _{\textrm{MSE}_1}(\hat{\varvec{\theta }}_h, s^2_1 | {\textbf{y}}) - \textrm{B}_{\textrm{MSE}_h}(\hat{\varvec{\theta }}_h, h) \end{array}. \end{aligned}$$
(29)

The parameters \(\hat{\varvec{\theta }}_1\) estimated via the maximisation of (28) will correspond to the vector of parameters \(\hat{\varvec{\theta }}_h\) only in the case when \({\hat{c}}_j=0\) for all \(j=1,\ldots ,h-1\), implying that we deal with a deterministic model. In all the other cases the likelihoods (27) and (28) will give different estimates of parameters. The value of the first term in the sum (29) is always greater than the respective value of the (28), because \(\frac{T-h}{T-1} < 1\) for \(h>1\) and \(\ell _{\textrm{MSE}_1}(\hat{\varvec{\theta }}_h, s^2_1 | {\textbf{y}}) < 0\) by definition. As for the bias, \(\textrm{B}_{\textrm{MSE}_h}(\hat{\varvec{\theta }}_h, h) \ge 0\) always, because \({\hat{c}}_j^2\) cannot be negative. So the value of the predictive likelihood \(\ell _{\textrm{MSE}_h}(\hat{\varvec{\theta }}_h, s^2_h | {\textbf{y}})\) will always be less than the value of the likelihood \(\ell _{\textrm{MSE}_1}(\hat{\varvec{\theta }}_h, s^2_1 | {\textbf{y}})\). We argue that because of that, the second derivative of (29) with respect to the parameters will be smaller in the case of MSE\(_h\) likelihood than in the case of MSE\(_1\). Therefore, we argue that in general the variances of parameters estimated using (27) will be higher than the variances of parameters of (28). However, given the discussion in this subsection, this does not hold universally and there might be some cases, when the multi step estimator will be more efficient than the conventional one. This will hold when the true parameters are such that the \(c_j\) is close to zero.

4.2 MSCE and TMSE

It is also possible to derive predictive likelihood function for MSCE, assuming the same normal distribution for the error term \(\varepsilon _t\):

Proposition 4.3

The predictive log-likelihood for MSCE (15), assuming that \(\varepsilon _t \sim N(0,\sigma _1^2)\) is given by,

$$\begin{aligned} \ell _{\textrm{MSCE}}(\varvec{\theta }, {\varsigma ^2_h} | {\textbf{z}}) = -\frac{T-h}{2} \left( \log (2 \pi ) + \log {\varsigma ^2_h} \right) -\frac{1}{2} \sum _{t=1}^{T-h} \left( \frac{\left( \sum _{j=1}^h \eta _{t+j|t}\right) ^2}{2 {\varsigma ^2_h}} \right) , \end{aligned}$$
(30)

where \({\textbf{z}}\) is the cumulative sum of actual values, the vector of \(z_t=\sum _{j=1}^h y_{t+j}\) for all \(t=1, \ldots , T-h\) and \({\varsigma ^2_h}\) is the variance of the cumulative error term.

Corollary 4.4

Using (18), it can be shown that the log-likelihood (30) can be represented in terms of the conventional one-step-ahead likelihood and bias, based on the MLE \(\hat{\varvec{\theta }}_{h}\) and \({\hat{\varsigma }}^2_h\):

$$\begin{aligned} \ell _{\textrm{MSCE}}(\hat{\varvec{\theta }}_{h}, {\hat{\varsigma }}^2_h | {\textbf{z}}) = \frac{T-h}{T-1} \ell _{\textrm{MSE}_1}(\hat{\varvec{\theta }}_{h}, {\hat{\varsigma }}^2_1 | {\textbf{y}}) - \textrm{B}_{\textrm{MSCE}}(\hat{\varvec{\theta }}_{h}, h), \end{aligned}$$

where

$$\begin{aligned} \begin{array}{ll} \textrm{B}_{\textrm{MSCE}}(\hat{\varvec{\theta }}_{h}, h) &= \frac{T-h}{2} \log \left( h + \sum\limits _{j=2}^h \sum\limits _{i=1}^{j-1} {\hat{c}}_{|j-i|}^2 + 2 \sum\limits _{j=2}^h {\hat{c}}_{|j-1|} + \right. \\ & \left. 2 \sum\limits _{j=3}^h \sum\limits _{i=2}^{j-1} \left( {\hat{c}}_{|i-j|} + \sum\limits _{l=1}^{i-1} {\hat{c}}_{|l-j|} {\hat{c}}_{|i-l|} \right) \right) \end{array}. \end{aligned}$$

Comparing the MLE of \(\ell _{\textrm{MSE}_h}(\hat{\varvec{\theta }}_{\textrm{MSE}_h}, s^2_h | {\textbf{y}})\) and \(\ell _{\textrm{MSCE}}(\hat{\varvec{\theta }}_{\textrm{MSCE}_h}, {\hat{\varsigma }}^2_h | {\textbf{z}})\), the two likelihoods will have the same asymptotical efficiency, when:

$$\begin{aligned} \begin{array}{ll} 1 + \displaystyle {\sum\limits _{j=1}^{h-1} {\hat{c}}_{j,{MSE_h}}^2} &= h + \sum\limits _{j=2}^h \sum\limits _{i=1}^{j-1} {\hat{c}}_{|j-i|,\textrm{MSCE}_h}^2 + 2 \sum\limits _{j=2}^h {\hat{c}}_{|j-1|,\textrm{MSCE}_h} + \\ & 2 \sum\limits _{j=3}^h \sum\limits _{i=2}^{j-1} \left( {\hat{c}}_{|i-j|,\textrm{MSCE}_h} + \sum\limits _{l=1}^{i-1} {\hat{c}}_{|l-j|,\textrm{MSCE}_h} {\hat{c}}_{|i-l|,\textrm{MSCE}_h} \right) \end{array}. \end{aligned}$$

However, there is no analytical solution for this equation, so each situation should be analysed separately for the specific values of parameters.

In general, the bias of \(\ell _{\textrm{MSE}_h}(\hat{\varvec{\theta }}_{\textrm{MSE}_h}, s^2_h | {\textbf{y}})\) is higher than the bias of \(\ell _{\textrm{MSCE}}(\hat{\varvec{\theta }}_{\textrm{MSCE}_h}, {\hat{\varsigma }}^2_h | {\textbf{z}})\), when the parameters are the same and \({\hat{c}}_j>0\) for all \(j=1, \ldots , h-1\). This implies that the MLE of MSCE is less efficient than the MLE of MSE\(_h\) in that case. However, the estimates should become asymptotically as efficient, although at a slower rate than MSE\(_h\).

As TMSE is a restricted version of MSCE, where the covariances between forecast errors are set to zero (which is an artificial condition), it is not possible to derive an appropriate likelihood for which the TMSE would be the optimal estimator. Considering that it has features similar to MSCE, it can be concluded that in general it will also be inefficient on small samples, but it should asymptotically be an efficient estimator.

4.3 General predictive likelihood

Following the discussion of likelihoods in the previous sections, we can derive a General Predictive Likelihood (GPL). If the actual observations are perceived as a vector of consecutive observations \(y_{t+1}\),..., \(y_{t+h}\):

$$\begin{aligned} {\textbf{Y}}_t = \begin{pmatrix} y_{t+1} \\ y_{t+2} \\ \vdots \\ y_{t+h} \end{pmatrix}, \end{aligned}$$
(31)

the predictive log-likelihood for a linear model and the vector (31) can be written as (Clements and Hendry 1998, p. 77):

$$\begin{aligned} \ell _{\textrm{GPL}_h}(\varvec{\theta }, {\varvec{\Sigma }} | {\textbf{Y}}) = -\frac{T-h}{2} \left( h \log (2 \pi ) + \log | {\varvec{\Sigma }}| \right) -\frac{1}{2} \sum\limits _{t=1}^T \left( \mathbf {E_t^\prime } {\varvec{\Sigma }}^{-1} \mathbf {E_t} \right) , \end{aligned}$$
(32)

where \({\varvec{\Sigma }}\) is the conditional covariance matrix for variable \({\textbf{Y}}_t\), \({\textbf{Y}}\) is the matrix consisting of (31) for all \(t=1, \ldots , T-h\) and \(\mathbf {E_t}^{\prime } = \begin{pmatrix} \eta _{t+1|t}&\eta _{t+2|t}&\ldots&\eta _{t+h|t} \end{pmatrix}\) is the vector of 1 to h steps ahead forecast errors. Given that \({\varvec{\Sigma }}\) is unknown, it can be estimated via the maximisation of (32):

$$\begin{aligned} {\varvec{\hat{\Sigma }}} = \frac{1}{T-h} \sum\limits _{t=1}^{T-h} \mathbf {E_t} \mathbf {E_t^\prime }. \end{aligned}$$
(33)

Proposition 4.5

The concentrated predictive log-likelihood based on 1, \(\ldots\), h steps ahead for data \({\textbf{Y}}\), with errors \(\varepsilon _t \sim N (0,\sigma _1^2)\), is:

$$\begin{aligned} \ell _{\textrm{GPL}_h}(\varvec{\theta }, {\varvec{\hat{\Sigma }}} | {\textbf{Y}}) = -\frac{T-h}{2} \left( h \log (2 \pi e) + \log | {\varvec{\hat{\Sigma }}} | \right) \end{aligned}$$
(34)

Proof

Concentrating out the predictive log-likelihood, after several simplifications (see “Appendix C” for the details) we obtain the result. \(\square\)

Analysing (34) shows that its maximisation is equivalent to the minimisation of the determinant of the covariance matrix \({\varvec{\hat{\Sigma }}}\). This determinant is called “Generalised Variance”:

$$\begin{aligned} \textrm{GV} = | {\varvec{\hat{\Sigma }}} | \end{aligned}$$
(35)

For convenience of estimation, in cases of large h, the logarithm of (35) may be taken.

It is essential to understand what elements the covariance matrix \({\varvec{\hat{\Sigma }}}\) contains and what the minimisation of its determinant implies. The matrix has the following structure:

$$\begin{aligned} {\varvec{\hat{\Sigma }}} = \begin{pmatrix} s_1^2 & s_{1,2} & \dots & s_{1,h} \\ s_{1,2} & s_2^2 & \dots & s_{2,h} \\ \vdots & \vdots & \ddots & \vdots \\ s_{1,h} & s_{2,h} & \dots & s_h^2 \end{pmatrix}. \end{aligned}$$
(36)

It is worth pointing out that MSE\(_h\) is equal to the \(s_h^2\) in (36), TMSE is equal to the trace of the matrix and MSCE is equal to the sum of all the elements of the matrix \({\varvec{\hat{\Sigma }}}\). This makes all the estimators discussed in the previous sections special cases of (35).

Theorem 4.6

Based on the formula for the multi-step covariances (12), the logarithm of the Generalized Variance is:

$$\begin{aligned} \log \textrm{GV} = |\hat{\mathbf {\Sigma }}|=\log s_1^2 + \log |\hat{{\textbf{A}}}|, \end{aligned}$$
(37)

where

$$\begin{aligned} \begin{array}{l} \hat{{\textbf{A}}} = \begin{pmatrix} {\hat{a}}_{1,1} & {\hat{a}}_{1,2} & \hdots & {\hat{a}}_{1,h} \\ {\hat{a}}_{2,1} & {\hat{a}}_{2,2} & \hdots & {\hat{a}}_{2,h} \\ \vdots & \vdots & \ddots & \vdots \\ {\hat{a}}_{h,1} & {\hat{a}}_{h,2} & \hdots & {\hat{a}}_{h,h} \\ \end{pmatrix} \\ \text {and } {\hat{a}}_{i,j} = \left\{ \begin{array}{ll} 1 & i = j = 1 \\ {\hat{c}}_{i-1}, & j = 1, i \ne 1 \text { or } i = 1, j \ne 1\\ {\hat{a}}_{i-1,j-1} + {\hat{a}}_{1,j} {\hat{a}}_{1,i}, & \text {otherwise} \end{array} \right. \end{array}. \end{aligned}$$
(38)

Proof

The proof hinges on the fact that the matrix \({\varvec{\hat{\Sigma }}}\) can be calculated using variance of one-step-ahead error and sum of \({\hat{c}}_{|j-i|}\) values (see “Appendix D”). \(\square\)

Corollary 4.7

Using the Generalised Variance, (35), will give similar parameter estimates as the one-step-ahead estimator when the model is correctly specified.

Proof

If the model is correctly specified, then minimising (35) is equivalent to minimising (37). When we consider the determinant of \(\hat{{\textbf{A}}}\), it becomes clear that it is always equal to one (see “Appendix E”). As a result the MLE of (34) will behave similar to the MLE of the conventional one-step-ahead error (28). This also means that the Generalised Variance (35) will behave similar to the conventional MSE\(_1\). \(\square\)

Interestingly, corollary 4.7 demonstrates that there is no shrinkage in (35), when the model is correctly specified, because the covariances between the multiple steps ahead errors cancel out the effect. However, in cases when the model is misspecified, (37) will not necessarily hold, so the GPL may produce estimates of parameters that will differ from the conventional likelihood based on the one-step-ahead error.

4.4 Geometric TMSE

An additional estimator can be derived from (35), if all the off-diagonal elements in \(\mathbf {\Sigma }\) are set to zero. The determinant of the matrix is then equal to the product of diagonal elements. Switching to logarithms and substituting the variances with their in-sample estimates gives us the “Geometric TMSE”:

Definition 4.8

$$\begin{aligned} \textrm{GTMSE} = \sum\limits _{j=1}^h \log \left( \frac{1}{T-h} \sum\limits _{t=1}^{T-h} e^2_{t+j} \right) . \end{aligned}$$
(39)

This cost function is proportional to:

$$\begin{aligned} \textrm{GTMSE} \propto \sum\limits _{j=1}^h \log \left( s^2_{j} \right) = h \log \left( s^2_1 \right) + \sum\limits _{j=2}^h \log \left( 1 + \sum\limits _{i=1}^{j-1} {\hat{c}}_{i}^2 \right) . \end{aligned}$$
(40)

Due to logarithms in the estimator (40) the one-step-ahead variance is balanced out with the sum of \({\hat{c}}_j\) elements. As a result there is a shrinkage effect in models estimated using (39), but it is reduced in comparison with MSE\(_h\), TMSE or MSCE. Although, we know that the off-diagonal elements cannot be zero in real life (excluding the case when \({\hat{c}}_{j}=0\) for all \(j=1,\ldots ,h-1\)), we can consider this as a restriction on the matrix (36).

We argue that some shrinkage of parameters can be useful in case of time series models, due to reasons outlined earlier in the Sect. 3.4. However it should not be as strong as, for example in MSE\(_h\), TMSE or MSCE. The new estimator, GTMSE, is more balanced in this sense, and we show how it performs in comparison with the other estimators in the next section.

5 Simulation experiment

We conduct two simulation experiments in order to see the behaviour of the different estimators depending on the sample size, the forecasting horizon and the appropriateness of the fitted model. We do not include GPL in the simulation experiment, due to Corollary 4.7, which proves that it performs similar to MSE\(_1\).

In the first experiment the ETS(A,N,N) model with \(\alpha =0.2\) is used as the data generating process, while in the second ARIMA(0,1,1) with \(\theta =0.6\) is used. In both cases 500 time series are generated, and each time series contains 5000 observations. Several sub-samples from these time series are used in order to assess the influence of the sample size on the parameter estimation, specifically 20, 50, 100, 200, 500, 1000 and 5000 observations. Four models are applied to the data: ETS(A,N,N), ETS(A,A,N), ARIMA(0,1,1) and ARIMA(1,1,1). The first model is a “true model” for the first case, and the third model is “true” for the second case. Note that these models have a direct connection, where their parameters are related as: \(\theta =\alpha -1\) (Box and Jenkins 1976, p. 107). Therefore, although they are formulated differently, we expect them to perform similarly in the two experiments. Our simulation covers the four case below:

  1. 1.

    The model is correctly specified;

  2. 2.

    The model is analogous to the true one (e.g., ETS(A,N,N) and ARIMA(0,1,1));

  3. 3.

    The model is misspecified (e.g., ARIMA(1,1,1) and ARIMA(0,1,1));

  4. 4.

    The model is wrong (e.g., ETS(A,A,N) and ARIMA(0,1,1)).

The parameters of all these models are estimated using the MSE\(_h\), TMSE, GTMSE, MSCE and the conventional MSE\(_1\) discussed in this paper. The forecast horizon is set to 10, 20, 50, 100 and 200-steps ahead. If the sample size is equal or smaller than the forecast horizon, then the model estimation is skipped.

All the simulations and models estimation are done using the sim.es, sim.ssarima, es and ssarima functions from the smooth package v2.5.3 (Svetunkov 2023b) in R (R Core Team 2023).

5.1 ETS(A,N,N) as DGP

Fig. 1
figure 1

ETS(A,N,N) applied to ETS(A,N,N) data. The estimates of the parameter \(\alpha\). The horizontal line indicates the parameter value of the DGP. Each box plot corresponds to a different sample size: 1–20, 2–50, 3–100, 4–200, 5–500, 6–1000 and 7–5000, indicated by numerical values 1 through 7

We start with applying ETS(A,N,N) to the data generated from ETS(A,N,N). The distribution of parameters estimated using different methods is shown in Fig. 1. Each of the boxplots corresponds to the specific sample size: 20, 50, 100, 200, 500, 1000 and 5000. There are several observations we can make from the figure:

  • All estimators are consistent, as the parameters converge to true values with the increase of sample size;

  • The estimators have different efficiency and are in general biased;

  • The longer horizons lead to the larger bias for the same sample size;

  • The effect of shrinkage depends on the length of the forecast horizon and is neglected on larger samples. For example, the distributions of estimates look very similar for cases of \(h=1\) and \(h=10\) on large samples;

  • The longer horizon is, the slower the parameter converges to its true value;

  • GTMSE is the most unbiased and efficient estimator of the four investigated ones. It converges to the true values faster than the other estimators;

  • MSE\(_h\) is the most biased and least efficient estimator of the four;

  • MSE\(_1\) produces more efficient and less biased estimates than any multi-step estimator.

Since this is for the scenario that the true model is known, all differences are due to the estimators.

Fig. 2
figure 2

ARIMA(0,1,1) applied to ETS(A,N,N) data, h = 50. True value is \(\theta =-0.8\), indicated by the horizontal line. Each box plot corresponds to a different sample size: 1–20, 2–50, 3–100, 4–200, 5–500, 6–1000 and 7–5000, indicated by numerical values 1 through 7

Similar observations can be done for ARIMA(0,1,1). We do not present all results, because they demonstrate a similar behaviour: with the increase of horizon the estimates become more biased and inefficient, while with the increase of sample size, they converge to the true value. We restrict our demonstration for \(h = 50\), where our observations are most clearly seen, in Fig. 2. Note that in some cases the estimators were so inefficient that the estimated parameter was positive, while the true parameter was \(\theta =0.2-1=-0.8\). In addition, MSE\(_h\) underestimates the value of the parameter (e.g. median for the sample of 100 is higher than \(-\)0.8), while other estimators tend to overestimate values (e.g. median is lower than \(-\)0.8).

In the case of ETS(A,A,N) applied to the ETS(A,N,N), we analyse how the two parameters are estimated: \(\alpha\), which needs to be equal to 0.2 and \(\beta\), which is expected to be equal to zero. Again we demonstrate the case of \(h = 50\) (Fig. 3 for \(\alpha\) parameter and Fig. 4 for \(\beta\) parameter) as the pattern is similar for all h.

Fig. 3
figure 3

ETS(A,A,N) applied to ETS(A,N,N) data, h = 50. Parameter \(\alpha\). Each box plot corresponds to a different sample size: 1–20, 2–50, 3–100, 4–200, 5–500, 6–1000 and 7–5000, indicated by numerical values 1 through 7

Fig. 4
figure 4

ETS(A,A,N) applied to ETS(A,N,N) data, h = 50. Parameter \(\beta\). Each box plot corresponds to a different sample size: 1–20, 2–50, 3–100, 4–200, 5–500, 6–1000 and 7–5000, indicated by numerical values 1 through 7

For the \(\alpha\) parameter, we observe similar findings as before. In general, the MSE\(_1\) produces more efficient and less biased estimates than any multi-step estimator. However, when it comes to parameter \(\beta\), there are some cases when the parameter is estimated incorrectly on smaller samples with MSE\(_1\), while due to the shrinkage effect, the true value is achieved much faster for estimators based on the multi-step loss functions. We can observe that estimators based on multistep loss functions are more efficient than MSE\(_1\) on smaller samples when the true parameters is equal to zero. Therefore, we argue that estimators based on multi-step loss functions are beneficial for the removal of wrong parameters in context of ETS model, but they might overshrink the necessary parameter.

Figure 5 shows the distribution of the estimated parameter \(\theta\) of ARIMA(1,1,1), while Fig. 6 presents the distribution of parameter \(\phi\), for \(h = 50\). The true value of \(\theta\) is \(-\)0.8, while \(\phi =0\). Note that all the estimators but GTMSE underestimate the MA parameter. Once again, GTMSE seems to produce the most efficient and least biased estimates among the multi-step estimators.

Fig. 5
figure 5

ARIMA(1,1,1) applied to ETS(A,N,N) data, h = 50. Parameter \(\theta\). Each box plot corresponds to a different sample size: 1–20, 2–50, 3–100, 4–200, 5–500, 6–1000 and 7–5000, indicated by numerical values 1 through 7

As for the AR part, due to the shrinkage effect, the parameter is forced to become negative in the majority of cases. GTMSE is the least biased multi-step estimator, but still all the multi-step estimators do not help in case of using a wrong model. They actually, make the situation worse, producing wrong values on small samples. This is not observed for the case of MSE\(_1\). The results of ARIMA(1,1,1) supports our analytical derivations in Sect. 3.2.

Fig. 6
figure 6

ARIMA(1,1,1) applied to ETS(A,N,N) data. Parameter \(\phi\). Each box plot corresponds to a different sample size: 1–20, 2–50, 3–100, 4–200, 5–500, 6–1000 and 7–5000, indicated by numerical values 1 through 7

5.2 ARIMA(0,1,1) as DGP

Fig. 7
figure 7

ARIMA(0,1,1) applied to ARIMA(0,1,1) data. The estimates of the parameter \(\theta\). The horizontal line indicates the parameter value of the DGP. Each box plot corresponds to a different sample size: 1–20, 2–50, 3–100, 4–200, 5–500, 6–1000 and 7–5000, indicated by numerical values 1 through 7

We observe similar patterns analysing the results for the ARIMA(0,1,1) scenario. Starting from the case of using the correct model, presented in Fig. 7, in addition to the effects observed in the ETS case, we see that the multi-step loss functions produce even more biased estimates of the parameter. This is especially evident on small samples, where the estimated parameter becomes negative in many cases, while the true parameter is positive. Only after a substantial increase of sample size, the estimates become less biased. However, in some cases, this still does not help. For example, in the case of \(h=200\), the estimates are still biased and inefficient even on the samples of 5000 observations. Once again GTMSE seems to produce the least biased and most efficient estimates of parameters of all the multi-step loss functions.

Fig. 8
figure 8

ARIMA(1,1,1) applied to ARIMA(0,1,1) data, h = 50. Parameter \(\theta\). Each box plot corresponds to a different sample size: 1–20, 2–50, 3–100, 4–200, 5–500, 6–1000 and 7–5000, indicated by numerical values 1 through 7

Applying the incorrect ARIMA(1,1,1) model, Fig. 8 demonstrates the distribution of the MA parameter. The patterns found here are similar to the ones discussed above for ARIMA(0,1,1). The main difference is that the estimates converge slightly faster to the true value than in case of the true model for \(h=50\).

Fig. 9
figure 9

ARIMA(1,1,1) applied to ARIMA(0,1,1) data, h = 50. Parameter \(\phi\). Each box plot corresponds to a different sample size: 1–20, 2–50, 3–100, 4–200, 5–500, 6–1000 and 7–5000, indicated by numerical values 1 through 7

As for the AR part, in Fig. 9, different estimators have different effects on the value of the parameter. MSE\(_h\) and MSCE force AR parameter to become negative on small samples, while on larger samples all the estimators produce positive values and only asymptomatically converge to the true value of zero. Once again, for both of these cases, the MSE\(_1\) is the most efficient and the least biased estimator.

We also analyse the performance of estimators for ETS(A,N,N) and ETS(A,A,N), plotting similar graphs for \(h=50\). Note that in this case the true value of \(\theta\) is 0.6, which corresponds to \(\alpha =1.6\). Figure 10 demonstrates the distribution of smoothing parameter \(\alpha\) for ETS(A,N,N).

Fig. 10
figure 10

ETS(A,N,N) applied to ARIMA(0,1,1) data, h = 50. Each box plot corresponds to a different sample size: 1–20, 2–50, 3–100, 4–200, 5–500, 6–1000 and 7–5000, indicated by numerical values 1 through 7

We observe that the smoothing parameter of ETS(A,N,N) is substantially reduced, especially on small samples. The shrinkage effect is the strongest for MSE\(_h\) loss function and the is weakest for GTMSE.

Finally, we study the case of an incorrect model and analyse the distribution of the \(\alpha\) and \(\beta\) parameters of ETS(A,A,N), which are shown in Figs. 11 and 12 respectively.

Fig. 11
figure 11

ETS(A,A,N) applied to ARIMA(0,1,1) data, h = 50. Parameter \(\alpha\). Each box plot corresponds to a different sample size: 1–20, 2–50, 3–100, 4–200, 5–500, 6–1000 and 7–5000, indicated by numerical values 1 through 7

The estimates of smoothing parameter \(\alpha\) are biased and inefficient for the presented case of \(h=50\) for almost all the estimators based on multi-step loss functions. In this case, the value does not converge to the true \(\alpha =1.6\) even on the largest sample of 5000 observations. GTMSE produces slightly more efficient estimates than the other methods on the larger sample sizes, but is outperformed by MSE\(_1\).

Fig. 12
figure 12

ETS(A,A,N) applied to ARIMA(0,1,1) data, h = 50. Parameter \(\beta\). Each box plot corresponds to a different sample size: 1–20, 2–50, 3–100, 4–200, 5–500, 6–1000 and 7–5000, indicated by numerical values 1 through 7

However, there is an improvement in efficiency when it comes to the estimation of parameter \(\beta\): while the conventional MSE\(_1\) has high variance on smaller samples, the estimators based on multi-step loss functions shrink the value towards zero.

One of the main conclusions of this simulation study is that the discussed shrinkage effect happens in univariate models no matter what the DGP is. This might be beneficial for the redundant parameters of ETS models, because they tend to shrink to zero, but this might be potentially damaging for ARIMA, as the wrong parameters are emphasised. If the parameters of the models are of the main interest, then MSE\(_1\) should be preferred to the other estimators. If a multi-step estimator is needed for a similar task, then GTMSE gives more efficient and less biased estimates than the other multi-step estimators.

6 Applications

6.1 The Mean Surface Temperature data

In order to compare the effect of different estimators, we use the Hadley Centre/Climatic Research Unit data of Mean Surface Temperatures, MST, (version HadCRUT.4.5.0.0; available at http://www.metoffice.gov.uk/hadobs/hadcrut4/data/current/download.html). We depict the analysis for a location at 322.5\(^{\circ }\), 62.5\(^{\circ }\) that is a place in Greenland. The data is organised as a monthly time series starting in January 1849 and finishing in December 2017. The series contains 2016 observations. We withhold the last five years of data (60 data points) in order to measure the accuracy of a forecasting model estimated on the first 1956 observations.

We use an ARIMA(1,1,2) in this experiment, because the data is non-stationary and AR(1) is commonly included on this type of data (see Beaulieu and Killick 2018, and discussion within). We estimate the model using the following six loss functions:

  1. 1.

    MSE\(_1\) (referred to as just “MSE”),

  2. 2.

    MSE\(_h\),

  3. 3.

    TMSE,

  4. 4.

    GTMSE,

  5. 5.

    MSCE,

  6. 6.

    GPL.

We measure their performance using the root mean squared error (RMSE), mean absolute error (MAE) and mean error (ME):

$$\begin{aligned} \text {RMSE}&= \sqrt{\frac{1}{h}\sum\limits _{j=1}^h e_{t+j|t}^2}, \end{aligned}$$
(41)
$$\begin{aligned} \text {MAE}&= \frac{1}{h}\sum\limits _{j=1}^h |e_{t+j|t}|, \end{aligned}$$
(42)
$$\begin{aligned} \text {ME}&= \frac{1}{h}\sum\limits _{j=1}^h e_{t+j|t}. \end{aligned}$$
(43)

RMSE and MAE measure the magnitude of the errors from the mean and the median of the distribution of the target variable respectively (Kolassa 2016). ME measures the bias of the predictions.

To fit the model we use the msarima function from the smooth v3.2.1 package (Svetunkov 2023b) for R (R Core Team 2023). Figure 13 plots model fit and forecast for the different loss functions. For MSE\(_h\) we only use the model corresponding to \(h=60\) and not all 60 possible models, one for each horizon. Note that the model estimated using the MSE\(_{h}\) does not fit well to the data, as it attempts to capture a specific horizon of 60 periods ahead and as a result does not react to local changes in the data. Still, all the loss functions produce very similar forecasts (slow decline), but with a different level.

Fig. 13
figure 13

Arima(1,1,2) with different loss functions. The vertical line marks the start of the test set. Actuals are indicated by a solid black line, fitted values with a dashed line and the line in the test set is the h-steps ahead forecast

The forecast errors and the parameters of the model with different loss functions are shown in Table 2, where the cases with the lowest errors are marked in boldface. Note that the conventional MSE estimator does not perform well on this data for any error measure. Given the sample size, GPL produces forecasts very similar to the MSE. The multi-step loss functions perform better, with MSE\(_h\) being the most accurate in terms of RMSE, MAE and least biased in terms of ME. TMSE, GTMSE and MSCE perform very similar across all measures.

Table 2 Performance of estimators on the mean surface temperature time series. The lowest values are marked in boldface

Regarding the parameters in Table 2, we can note that GTMSE, MSCE and GPL produce similar estimates, demonstrating less shrinkage than the MSE\(_h\) (as the AR(1) parameter is closer to \(-1\) for the MSE\(_h\) compared to the other estimators).

Climatologists argue that the AR(1) parameter should be positive, indicating that the shrinkage seen in MSE\(_h\) is not producing a realistic estimate. However, all GTMSE, MSCE and GPL estimates of the AR(1) parameter are similar to the results obtained by others in the field (Hartmann et al. 2013; Cahill et al. 2015; Rahmstorf et al. 2017), indicating that the shrinkage in the parameters is beneficial and necessary when compared to the model resulting from the MSE.

6.2 Air passengers data

As an additional example, we use the classical monthly Air Passengers data from Box and Jenkins (1976) to demonstrate the effect of model misspecification. Although, this series is known to have multiplicative components, we use ETS(A,A,A) model (additive error, trend, and seasonal components) and estimate it using MSE, MSE\(_{h}\), TMSE, GTMSE, MSCE and GPL. We set the forecasting horizon to 12 (one year). The model application was done using es() function (Svetunkov 2023a) from the smooth package in R (Svetunkov 2023b).

Fig. 14
figure 14

ETS(A,A,A) applied to Air Passengers data, estimated the losses discussed in this paper

Figure 14 shows how the ETS(A,A,A) fits the data when different loss functions are used. We can see that with the conventional MSE, the model has difficulties fitting the data, updating the level component more often than required. The MSE\(_{h}\) is not doing well in this example either, failing to produce adequate forecasts except for the specific point 12 observations ahead. In contrast to these two, TMSE, GTMSE, MSCE and GPL do much better, capturing the dynamics correctly and not adapting to the noise too much.

To better understand how models did specifically and why, we present holdout sample error measures and the values of smoothing parameters for the model estimated using the discussed approaches in Table 3. The GTMSE did not overshrink the smoothing parameter \(\gamma\) as MSE and MSE\(_{h}\) did, but it imposed some shrinkage on \(\alpha\). This example demonstrates the benefits of using estimators based on multistep loss functions in case of model misspecification, which is very common in practice.

Table 3 Performance of estimators on air passengers data example. The lowest values are marked in boldface

7 Conclusions

Estimation methods based on multiple steps ahead forecast error have been known for several decades. Their statistical effects of forecasting performance have been documented in detail, indicating that these loss functions make time series models robust. Nonetheless, there was no explanation of why this happens.

We show that the main reason for this robustness is the implied shrinkage, which is automatically imposed on parameters when any of these estimators are used. We discuss how the shrinkage happens in the ARIMA and ETS model families, using the state space model framework. Furthermore, we demonstrate that this type of shrinkage does not affect the coefficients of exogenous variables in regression. Therefore this constitutes a univariate form of parameter shrinkage, in contrast to the shrinkage in Ridge or LASSO regression (Tibshirani 1996). Furthermore, we indicate that there is a danger of “over-shrinking” the parameters.

We investigate several estimators based on multi-step loss functions, showing analytically the strength of shrinkage for each, as well as their limitations. We propose predictive likelihood functions for some of them and discuss the efficiency of these estimators. Despite the fact that MSE\(_h\) is in general a less efficient estimator than MSE\(_1\), we demonstrate that there are regions of parameters, where MSE\(_h\) becomes more efficient, which has never been shown in the literature before. In addition, we show that due to the dynamic structure of models, there is always a non-zero covariance between the forecast errors of different horizons. This means that when the multivariate distribution of the trace forecast is analysed, assuming that off-diagonal elements are zero is unreasonable. We introduce the General Predictive Likelihood and show how this is connected with the existing multi-step estimators and that it produces estimates similar to MSE\(_1\). Furthermore, we propose a new multi-step estimator, the Geometric Trace MSE, which has a milder shrinkage effect than the conventional multi-step estimators.

Using simulations we demonstrate that the shrinkage effect in all the multi-step estimators increases with the increase of the forecast horizon, matching our analytical investigation, and is compensated by large sample sizes. This can cause over-shrinkage of the model parameters, causing them to become biased and inefficient. They eventually converge to the true values, but with a very low rate. The speed of convergence is further reduced as the forecast horizon increases. However, the shrinkage effect is weakened in the proposed GTMSE, which as is found to result in more efficient, less biased and consistent estimates of parameters, compared to the other multi-step estimators. Our simulations show that when an incorrect model is used, the redundant parameters of univariate models shrink towards zero. In all the other cases, the multi-step estimators appear to unnecessarily over-shrink the parameters, potentially damaging the performance of the estimated models.

We apply the investigated estimators on a real data example, demonstrating the advantages of the estimators based on multi-step loss functions. We find that TMSE, GTMSE and MSCE produce more accurate 1 to h-steps ahead forecasts than either MSE\(_1\) or MSE\(_h\). Furthermore, the resulting parameters for TMSE, GTMSE and MSCE match the expected values of the modelled process more closely, indicating their benefit when the underlying DGP is unknown.

We conclude that if the parameters are of the main interest and we have some confidence in approximating well the underlying data structure, MSE\(_1\) should be preferred to estimators based on multi-step loss functions, especially on small samples. However, when the accuracy of forecasts is of the main interest, then the estimators based on multi-step loss functions can be advantageous over conventional estimators. Our analysis and results suggest that the proposed GTMSE performs very promisingly compared to existing multi-step estimators.

We can see from the simulation and the empirical evidence that the multi-step estimators can shrink the parameters of models to the extent of potentially introducing bias. This is a limitation of the approach. This bias disappears on large samples, when the ratio of \(\frac{T-h}{T-1}\) becomes close to one, however on small samples it can lead to the estimation of less stochastic and more inert models with biased (towards zero) estimates of parameters. In this respect, the multi-step estimators act similarly to LASSO and RIDGE for time series models (Pritularga et al. 2022), but without a meta parameter to regulate shrinkage, with the forecasting horizon operating as such a parameter.

Therefore, we do not argue that the shrinkage in time series models caused by multi-step estimators is always beneficial. There will be many cases where conventional estimators will perform well. Nonetheless, being aware of their properties, an informed statistician can understand and improve the performance of their models.