1 Introduction

Financial investment planning for long-term savings is highly relevant for the development of new pension products (Merton, 2014; Gerrard et al., 2019, 2020). Therefore, understanding the dynamics of the stock market is crucial in providing the long-term saver with sufficient wealth at retirement. It is well-known from the empirical literature that model-based predictions for longer horizons can provide better forecasts than the simple historical mean (Campbell & Thompson, 2008). However, a careful validation approach has to be applied when predictions of stock returns are based on reasonable long-term economic drivers. In this paper, we focus on nonlinear predictive functions which are estimated with a fully data-driven local-linear smoother in combination with a leave-k-out cross-validation for the prediction of stock returns in excess of different benchmarks as developed in Kyriakou et al. (2021a). These functions optimally incorporate the given information as they allow for complex interrelations of the potential predictor variables. We work with low-dimensional models and estimate individually for each selection of variables the nonlinear predictive relationship. However, forecast combinations are known to potentially reduce the mean squared forecast error when several individual candidates are available. Thus, we not only validate the predictive power of the individual forecasts but also analyse whether it is beneficial to combine them in several ways. Recently, machine learning (ML) algorithms have been proposed for this purpose and we focus mainly on weighting schemes for the forecast combinations which are based on such techniques like the Lasso, the Ridge, and the Elastic Net, as well as their egalitarian variants, or recently introduced refinements (Combination Elastic Net). We employ historical S &P500 returns in excess of different benchmarks, including the short-term interest rate and the inflation, at the annual frequency for a sample period ranging from 1872 to 2022.

The contributions of this paper are manifold. First, we extend the nonlinear prediction framework of Kyriakou et al. (2021a) to also considering three-dimensional models. We show that such complex models can have reasonable predictive power both in-sample and out-of-sample. For example, under the short-term interest rate benchmark, three of the five models with the largest out-of-sample predictive power are three-dimensional. Thereby, the model based on time-lagged excess returns, dividends, and term spread is the second-best predictive model for the risk-premium (in terms of a large out-of-sample \(R^2\) value). Under the inflation benchmark, four of the five models with the largest out-of-sample predictive power are three-dimensional. The model based on real-dividends, real-earnings, and term-spread performed best in predicting real stock returns out-of-sample (cf. Tables 2, 6). Second, we find that individual nonparametric forecasts usually outperform forecast combination methods based on ML techniques. If we allow under the short-term interest rate benchmark only for one-dimensional candidates, then the forecast combinations give slightly better predictions than the best individual model. Thus, the complexity introduced in the prediction process when using ML-based techniques does not pay off well enough and it is better to use simpler and more transparent methods. Third, we highlight that the classical shrinkage methods are prone to in-sample over-fitting when too many individual forecasts are used as possible candidates. The consequence is that the suggested predictive power is spurious and the out-of-sample performance very poor. However, using only the one-dimensional models balances in-sample and out-of-sample behaviour. Note also that forecast combinations perform better than the simple historical average. Fourth, considering all variables in real terms net of inflation (the inflation double benchmark) results in a much more stable and consistent analysis both between models and over time when compared to the prediction of the risk premium (short-term interest rate single benchmark). This is especially important for the long-term pension saver who is interested in adequate strategies for real-income protection.

The remaining of this paper is organized as follows. Section 2 presents the literature review. In Sect. 3, we introduce our long-term predictive framework, outline the estimation procedure using the local-linear smoother, describe different ways of combing the individual forecasts, and give an overview of the US stock market data. Section 4 provides a discussion about the results of the empirical study for the prediction of the risk-premium and real returns. Section 5 summarizes the key points of our analysis and concludes the paper.

2 Literature review

In the last decades, numerous studies in the academic literature focus on answering the question of whether asset returns are predictable or not. From an economic perspective, it was commonly assumed until the mid-1980s that predictability would contradict the efficient markets hypothesis (Fama, 1970). However, the seminal work by Fama (1988), Campbell and Shiller (1988), or Stambaugh (1999) suggests the nowadays ‘common wisdom’ of long term predictability (Lioui & Poncet, 2019). For more recent approaches regarding stock market forecasts, see, for example, Scholz et al. (2015), Scholz et al. (2016), Lioui and Poncet (2019), or Akyildirim et al. (2022) and the discussion therein.

From the statistical or econometric point of view, the prediction setup can be described in the following very general way (Hastie et al., 2017):

$$\begin{aligned} \min _{f\in \mathcal {H}}\bigg \{{L}\big (y_{t+h},f(Z_{t})\big )+p(f,\tau )\bigg \} ,\quad t=1,\ldots ,T, \end{aligned}$$
(1)

where \(y_{t+h}\) is the variable to be predicted h periods ahead, \(Z_{t}\) the vector of predictors, \(\mathcal {H}\) a space of possible functions f that combine the data to form the prediction, p a penalty on f, \(\tau \) a set of hyper-parameters (for example, the \(\lambda \) in the Lasso), and L a loss function that defines the optimal forecast.

In this article, we take the long-term actuarial perspective and base our empirical study on annual observations. Thus, we are not in a big-data context where the number of observations is huge. The set of possible predictive variable combinations is also rather small. In other words, we can work with low dimensional models in (1), and shrinkage, dimension reduction, or penalization are not necessary. However, sparsity could be an issue with our data set and a careful imposition of structure in the statistical modelling process is helpful. Note further that the use of nonlinear functions f in (1) has shown evidence of much stronger stock return predictability when compared to their linear counterparts (Lettau & Van Nieuwerburgh, 2008; Chen & Hong, 2010; Yang et al., 2010; Cheng et al., 2019; Caldeira et al., 2020; Freyberger et al., 2020). Thus, the local-linear smoother based on the standard \(L_{2}\)-loss function is ideally suited. Note that a linear function—the classical benchmark in this context—can be estimated without any bias.

Several studies are based on this technique. Most of them try to improve the prediction utilizing additional structure in the estimation process and to reduce the impact of the curse of dimensionality in a sparse data environment. Nielsen and Sperlich (2003) were the first to introduce this nonparametric technique together with an adequate validation method into the actuarial literature. Scholz et al. (2015) use bootstrap techniques to formally test the null hypothesis of the non-predictability of returns and improve the smoothing through prior knowledge using a multiplicative bias-reduction approach. Scholz et al. (2016) propose a two-step procedure for the prediction of excess stock returns: (i) same-years bond yield is constructed fully nonparametrically, and (ii) this additional predictor is used to forecast excess stock returns. Mammen et al. (2019) focus on the prediction of the conditional variance of long-term stock returns. They find that volatility forecastability is much less important at longer horizons and that the homoscedastic historical average of the squared return prediction errors give adequate approximations of the unobserved realised conditional variance. Kyriakou et al. (2020) consider the 5-year horizon and corresponding econometric challenges like overlapping observations. They find that long-term forecasting performs well and recommend drawing more attention to it when designing investment strategies for long-term investors. Kyriakou et al. (2021a) propose the use of different benchmarks when predicting stock returns. Their full benchmarking approach, that is, considering all variables net of inflation, has important consequences for long-term saving strategies, where one is interested in real value. Finally, Kyriakou et al. (2021b) propose an econometric model which combines different horizons. Their method exploits the lower long-term variance to further reduce the short-term variance, which is susceptible to speculative exuberance. As a consequence, the long-term pension-saver avoids an over-conservative portfolio with implied potential upside reductions given their optimal risk appetite. Our study analyses now the question of whether the combination of individual forecasts based on ML techniques can improve predictability as it was recently documented in the literature, for example, by Rapach and Zhou (2020). However, we find that this kind of complexity does not pay off well enough and we recommend the use of simpler individual forecasts.

ML is one of the in-vogue topics in empirical finance and actuarial science (Asimit et al., 2020; Dixon et al., 2020) for asset return prediction or portfolio choice (Coqueret & Guida, 2020; Akyildirim et al., 2021, 2022). It is often seen as “ (i) a diverse collection of high-dimensional models for statistical prediction, combined with (ii) so-called ‘regularization’ methods for model selection and mitigation of overfit, and (iii) efficient algorithms for searching among a vast number of potential model specifications” (Gu et al., 2020). Mostly one of the following methods is well suited to address the three challenges mentioned earlier: linear models for regression (including regularization via shrinkage methods with penalization, such as Ridge Regression, Lasso, or Elastic Nets), dimension reduction via principal components regression and partial least squares, regression trees and forests (including boosted trees and random forests), (deep) neural networks, and boosting (Oztekin et al., 2016; Athey & Imbens, 2019; Coulombe et al., 2020; Gu et al., 2020; Hiabu et al., 2020; Iworiso & Vrontos, 2020; Wu et al., 2020; Gambella et al., 2021).

Forecast combinations are a popular way of reducing the mean squared forecast error when several individual predictive models (usually of low dimensionality) for a target variable are available. The forecasting ability of individual predictive regression models could be seriously impaired by model uncertainty and (parameter) instability (Rapach et al., 2010). Several methods of finding the (optimal) combination forecast have been proposed in a large body of literature: for example, a weighted average of forecasts, with the weights adding up to unity (Granger & Ramanathan, 1984); trimming (Granger & Jeon, 2004); rank-based approaches (Aiolfi & Timmermann, 2006); a least-squares forecast averaging (Hansen, 2008b); a complete subset regression (Elliott et al., 2013); iterated (Lin et al., 2018) or depth-weighted combinations (Lee & Sul, 2021). Recently, ML techniques have been proposed to select and weight appropriate individual forecasts using, for example, Lasso-based procedures (Diebold & Shin, 2019; Mascio et al., 2020; Freyberger et al., 2020); a combining method for sophisticated models with the historical average serving as shrinkage target (Zhang et al., 2020); or the Combination Elastic Net (Rapach & Zhou, 2020). However, in many practical applications, the simple average of candidate forecasts is more robust than more sophisticated combination approaches (Qian et al., 2019), a phenomenon known as the forecast combination puzzle. A theoretical explanation for the latter is given in Claeskens et al. (2016) as well as the warning that “ there is no guarantee that the ‘optimal’ forecast combination will be better than the equal-weight case, or even improve on the original forecasts”.

3 Methodology and materials

In this section, we introduce the underlying financial model and the corresponding nonparametric predictive long-term regressions. We follow the approach of Scholz et al. (2015) and focus on (nonlinear) relationships between stock returns in excess of different benchmarks and a set of predictor variables. We aim to compare individual models with several combination approaches in terms of their in-sample and out-of-sample predictability over the horizon of 1 year. We consider the four benchmarks introduced in Kyriakou et al. (2021a): the short- and the long-term interest rate, the earnings-by-price ratio, and the inflation rate.

3.1 Predictive framework

Let \(D_{t}\) denote the (nominal) dividends paid during year t and \(P_{t}\) the (nominal) stock price at the end of year t. We consider stock returns \(S_{t}=(P_{t}+D_{t})/P_{t-1}\) in excess (log-scale) of a given reference rate or benchmark \(B_{t-1}^{(A)}\):

$$\begin{aligned} Y_{t}^{(A)}=\ln \frac{S_{t}}{B_{t-1}^{(A)}}, \end{aligned}$$
(2)

where \(A\in \{R,L,E,C\}\) with, respectively,

$$\begin{aligned} B_{t}^{(R)}=1+\frac{{R_{t}}}{100},\quad B_{t}^{(L)}=1+\frac{{L_{t}}}{100} ,\quad B_{t}^{(E)}=1+\frac{{E_{t}}}{P_{t}},\quad B_{t}^{(C)}=\frac{CPI_{t}}{ CPI_{t-1}}, \end{aligned}$$

\(R_{t}\) is the short-term interest rate, \(L_{t}\) the long-term interest rate, \(E_{t}\) the earnings accruing to the index in year t, and \(CPI_{t}\) the consumer price index for year t. The predictive nonparametric regression model for the 1-year excess stock returns \(Y_{t}^{(A)}\) is then given by

$$\begin{aligned} Y_{t}^{(A)}=m(X_{t-1}^{(A)})+\xi _{t}, \end{aligned}$$
(3)

where

$$\begin{aligned} m(x^{(A)})=\mathbb {E}(Y^{(A)}|X^{(A)}=x^{(A)}),\;x^{(A)}\in \mathbb {R}^{q}, \end{aligned}$$
(4)

is the unknown conditional mean-function which is estimated with the local-linear smoother. The error-terms \(\xi _{t}\) in Eq.  (3) form a martingale difference process and are serially uncorrelated zero-mean random variables of an unknown conditionally heteroscedastic form \(\sigma (x)\).

Our individual predictive models use (subsets of) popular time-lagged predictive variables: the dividend-by-price ratio \(d_{t-1}=D_{t-1}/P_{t-1}\); the earnings-by-price ratio \( e_{t-1}=E_{t-1}/P_{t-1}\); the short-term interest rate \(r_{t-1}=R_{t-1}/100\); the long-term interest rate \(l_{t-1}=L_{t-1}/100\); the inflation rate \(\pi _{t-1}=(CPI_{t-1}-CPI_{t-2})/CPI_{t-2}\); the term spread \( s_{t-1}=l_{t-1}-r_{t-1}\); and the excess stock return \(Y_{t-1}^{(A)}\). Note that we apply both the single benchmarking approach (Kyriakou et al., 2021a), where only the dependent variable in Eq. (2) is transformed with the benchmark, and the double benchmarking approach, where also the predictive variables are transformed according to

$$\begin{aligned} X_{t-1}^{(A)}=\left\{ \begin{array}{l} \frac{1+X_{t-1}}{B_{t-1}^{(A)}},\quad X\in \{d,e,r,l,\pi \} \\ \frac{s_{t-1}}{B_{t-1}^{(A)}}=\frac{l_{t-1}-r_{t-1}}{B_{t-1}^{(A)}} \\ Y_{t-1}^{(A)} \end{array} \right. ,\quad A\in \{R,L,E,C\}. \end{aligned}$$
(5)

The double benchmarking approach can be seen as a simple way of reducing dimensionality. It allows to import more structure in the estimation process which can help to reduce or circumvent problems caused by the curse of dimensionality. Remember that we apply our methods to annual data, that is, we use sparsely distributed observations in higher dimensions which limits the complexity of the fitted models.

3.2 Estimation and evaluation procedure

In the empirical part, we estimate the unknown conditional mean function m of Eq. (3) with the local-linear smoother which is based on the following minimization problem

$$\begin{aligned} \min _{a,b} \sum _{t=1}^T \bigg (Y_t^{(A)} - a - \left( X_t^{(A)} - x^{(A)}\right) ^{\top }b\bigg )^2 K_h\left( X_t^{(A)} - x^{(A)}\right) , \end{aligned}$$
(6)

where \(K_h\) denotes some kernel function, for example, the standard product kernel \(K_h\left( X_t^{(A)} - x^{(A)}\right) =\prod _{s=1}^q \frac{1}{h_s} k\left( \frac{X_{t,s}^{(A)}-x_s^{(A)}}{h_s}\right) \) which depends on a set of bandwidths \(h=(h_1,\ldots ,h_q)\) and the kernels k of order \(\nu \). The latter are univariate symmetric functions satisfying standard assumptions: \(\int k(u) du = 1\), \(\int u^lk(u)du=0\) (\(l=1,\ldots ,\nu -1\)), and \(\int u^{\nu }k(u)du=:\kappa _{\nu }>0\). \(X_{t,s}^{(A)}\) denotes the sth component of \(X_t^{(A)}\), \(s=1,\ldots ,q\). The solution \(\hat{a}=\hat{a}(x^{(A)})\) of (6) is a consistent estimator of \(m(x^{(A)})\) which depends on the bandwidths h. For a discussion of properties and references for proofs, see, for example, Section 3.1 in Kyriakou et al. (2021a).

For the choice of the smoothing parameters h, we apply the local linear cross-validation approach and select those bandwidths which minimize

$$\begin{aligned} CV(h) = \min \limits _{h}\sum \limits _{t=1}^T\,\left( Y_{t}^{(A)}-\hat{m}_{-t,h}\right) ^{2}, \end{aligned}$$
(7)

where T is the number of observations in the estimation sample and \(\hat{m}_{-t,h}\) is the leave-k-out estimator for the conditional mean function. It is computed by removing k observations around the tth time point and depends on the horizon of the prediction. Here we focus on the 1-year horizon and use the classical leave-one-out estimator.

Based on the cross-validation criterion in Eq. (7), we introduce next our validation measure used for in-sample model selection. It is a generalization of the validated \(R^2\) (Nielsen & Sperlich, 2003) and is defined as

$$\begin{aligned} R_{V}^{2}=1-\frac{\sum \nolimits _{t=1}^T\left( Y_{t}^{(A)}-\hat{m}_{-t,h}\right) ^{2}}{ \sum \nolimits _{t=1}^T\left( Y_{t}^{(A)}-\bar{Y}_{-t}^{(A)}\right) ^{2}}, \end{aligned}$$
(8)

where leave-k-out estimators (\(\hat{m}_{-t,h}\) and \(\bar{Y}_{-t}^{(A)}\)) are used for the conditional mean function m and for the unconditional (historical) mean of \(Y_{t}^{(A)}\), respectively. The \(R_{V}^{2}\) measures the predictive power of a given model compared to the cross-validated historical mean. A positive \(R_{V}^{2}\) implies that the predictor-based regression model (3) outperforms the corresponding historical average excess stock return over T years. Thus, we use the \(R_V^2\) to rank all possible candidate models and prefer the one with the largest value. We can use the \(R_V^2\) also for bandwidth selection as maximizing the \(R_V^2\) in Eq. (8) is equivalent to minimizing the cross-validation criterion in Eq.  (7). Note further that we apply the \(R_V^2\) also to the linear counterparts of the regression model (3). In this case, just replace \(\hat{m}_{-t,h}\) by the linear predictor based on the leave-k-out OLS-estimate \(\hat{\beta }_{-t}\).

For out-of-sample evaluation, we use the last \(\tau \) observations in our records to calculate the classical out-of-sample \(R^2\) (Campbell & Thompson, 2008) which is defined as

$$\begin{aligned} R_{oos}^2 = 1 - \frac{\sum \nolimits _{t=T+1}^{T+\tau } \left( Y_t^{(A)} - \hat{m}_t\right) ^2 }{\sum \nolimits _{t=T+1}^{T+\tau } \left( Y_t^{(A)} - \bar{Y}_t^{(A)}\right) ^2 }, \end{aligned}$$
(9)

where \(\hat{m}_t\) is the fitted value from the predictive regression estimated through period T (the last observation in the estimation sample) and evaluated at \(X_{t-1}^{(A)}\) (\(t=T+1,\ldots ,T+\tau \)), and \(\bar{Y}_t^{(A)}\) is the historical average return through period \(t-1\). In other words, we use the estimation sample (\(t=1,\ldots ,T\)) to fix the model by choosing corresponding bandwidths and evaluate through period \(t-1\) in the left out sample. A positive \(R_{oos}^2\) indicates that the predictive regression has a lower average mean squared prediction error than the historical average return. As Campbell and Thompson (2008) point out, the historical average has an advantage over predictive regressions because it is based on more observations and more recently available information.

3.3 Forecast combinations

It is well documented in the literature that the combination of M individual forecasts \(\hat{Y}_{t+1}^{(A),m}\) (with \(m=1,\ldots ,M\)), defined as

$$\begin{aligned} \hat{Y}_{t+1}^{comb} = w_1 \hat{Y}_{t+1}^{(A),1} + \ldots + w_M\hat{Y}_{t+1}^{(A),M}, \end{aligned}$$
(10)

may perform better (in terms of higher out-of-sample predictability) than the individual predictions itself (Bates & Granger, 1969; Granger & Ramanathan, 1984; Rapach et al., 2010). A popular choice is, for example, the simple average of the M different predictors:

$$\begin{aligned} \hat{Y}^{av}_{t+1} = \frac{1}{M}\sum _{m=1}^M \hat{Y}_{t+1}^{(A),m}. \end{aligned}$$
(11)

Each individual forecast gets the same weight \(w_m=1/M\) which shrinks, in case of a multivariate linear predictive model, the estimated (and probably biased) coefficients by the factor 1/M and reduces the role of multicollinearity when highly correlated predictors are used (Rapach et al., 2010). The simple average (11) allows to incorporate information of a large number of plausible predictors and helps to prevent from in-sample over-fitting (Rapach & Zhou, 2020). However, equal weights can be sub-optimal as one usually wants to give more weight to those forecasts with errors of lower variance (Diebold & Shin, 2019). In addition, when a large number of potential predictors is available, the redundant ones should be excluded, that is, have a weight of zero. Thus, several ML techniques have been applied to select and weight the relevant predictors in Eq. (10). Popular regularization methods set some weights to zero and shrink the remaining weights to zero [the ‘classical’ Lasso (Tibshirani, 1996), the ‘adaptive’ Lasso (Zou, 2006), the Ridge Regression (Hoerl & Kennard, 1970), or the Elastic Net (ENet) (Zou & Hastie, 2005)) or toward equality [the ‘egalitarian’ Lasso, the ‘egalitarian’ Ridge (Diebold & Shin, 2019), or the ‘combination’ Elastic Net (cENet) (Rapach & Zhou, 2020)].

The underlying penalization problem for the forecast combination methods used in this paper can be summarized as follows:

$$\begin{aligned} \hat{w} = {\mathrm{arg\,min}}_w \Bigg [ \sum _{t=1}^T\left( Y_t^{(A)} - \sum _{m=1}^M w_m \hat{Y}_{t}^{(A),m}\right) ^2 + \lambda \sum _{m=1}^M\bigg \{\alpha |w_m| + (1-\alpha ) w_m^2\bigg \} \Bigg ], \end{aligned}$$
(12)

that is, the Lasso (\(\alpha =1\)), the Ridge (\(\alpha =0\)), and the ENet (\(\alpha \in (0,1)\)) with \(w_m\) (\(m=1,\ldots ,M\)) restricted to be non-negative. In addition, we consider their ‘egalitarian’ versions (eLasso, eRidge, and eENet) using a two-step procedure (Diebold & Shin, 2019): Solving the standard problem (12), we (i) select the l important forecasts of the full set of M potential candidates, that is, the \(M-l\) forecasts with weight zero are excluded; and (ii) shrink their combining weights towards equality, that is, toward 1/l. Recently, Rapach and Zhou (2020) proposed a further refinement, the so called ‘combination’ ENet (cENet). They split the estimation sample into two parts: an initial in-sample period and an ‘holdout’ out-of-sample period; and apply the eENet only on the latter instead on all available observations. Note further that we use the multivariate regression approach introduced in Sect.  3.1. Therefore, we also account for model complexity measured by the number of included predictor variables and combine the different forecasts based on complete subset regressions with dimensionality \(k\in \{1,2,3\}\) (Elliott et al., 2013).

This leads in total to 32 different ways of combining the individual forecasts: applying the set of methods consisting of Lasso, Ridge, ENet, eLasso, eRidge, eENet, cENet, and simple average to all available potential forecasts or restricting to the k-dimensional ones with \(k\in \{1,2,3\}\).

3.4 The data

In the empirical part of this paper, we apply the methods described in Sects. 3.2 and 3.3 to annual US stock market data over the period 1872 to 2022. We use a revised and updated version of the series described in Shiller’s Chapter 26 (Shiller, 1989) which consist of the Standard and Poor’s (S &P) Composite Stock Price Index, dividends and earnings accruing to the index, a 1-year interest rate, a long government bond yield, and the consumer price index.Footnote 1 Note that we had to replace the original risk-free rate series (which was discontinued in 2013) by an annual yield based on the 6-month Treasury-bill rate,Footnote 2 secondary market. This new series is only available from 1958 onwards. Therefore, we regressed the Treasury-bill rate on the original commercial paper rate from Shiller’s data and instrumented the risk-free rate from 1872 to 1957 with corresponding predicted values. For more details, see, for example, Kyriakou et al. (2020) or Mammen et al. (2019). Table 1 summarizes the available variables with their basic descriptive statistics for both, the in-sample part of the data used for estimation and the out-of-sample part of the data used for evaluation of predictability. It is evident that most of the variables have a much larger mean and standard deviation in the left-out part. However, we focus on the predictability of excess stock returns which are very similar in both parts. Figure 1 exemplarily shows them in excess of the risk-free rate with the out-of-sample period highlighted in red. Note that large positive returns have been realized with higher probability in the in-sample part of the data.

Table 1 US market data (1872–2022)
Fig. 1
figure 1

Stock returns in excess of the risk-free rate. In-sample part (black), out-of-sample part (red). Left: Time-series plot, Right: Density estimates. Period: 1872–2022. Data: annual S &P 500. (Color figure online)

4 Results and discussion

In this section, we present and discuss the results of the empirical application. For ease of presentation, we focus to the most important benchmark models of Kyriakou et al. (2021a), the short-term interest rate (single benchmarking) and the inflation rate (double benchmarking). Results for the other benchmarks (single and double benchmarking) are available upon request. Note that the short-term interest rate benchmark directly corresponds to the classical prediction of the risk premium (over a risk-free investment) and the inflation rate benchmark refers to the forecast of real returns as led by Merton (2014).

One primary goal of this study is to compare the in-sample predictive power of several methods with the corresponding out-of-sample performance. For this reason, we split the annual US stock market data into two parts: (i) an in-sample period (1872–1962) used for (smoothing) parameter estimation and in-sample validation and (ii) an out-of-sample period (1963–2022) of 60 years used for 1-year-ahead prediction and out-of-evaluation. We estimated the nonparametric models with a local-linear kernel smoother using the quartic (product) kernel. The smoothing parameters (bandwidths) were chosen by leave-one-out cross-validation, that is, by maximizing the in-sample performance measure \(R_V^2\) introduced in Sect.  3.2. In other words, the in-sample period is just used to fix the smoothness of the underlying conditional mean function. The prediction itself is then based on most recent (time-lagged) information. The corresponding linear models were estimated with ordinary least squares (OLS).

4.1 Prediction of the risk-premium

Numerous academic research articles rely on macroeconomic variables to forecast the U.S. equity risk premium. We follow this road and present to begin with in Table 2 the comparison of in-sample predictive power (measured by the \(R_V^2\)) and out-of-sample performance (measured by the \(R_{oos}^2\)) for several individual models. Based on the in-sample measure, the best five nonparametric models (\(\{sp\}\), \(\{r\}\), \(\{r,sp\}\), \(\{l,sp\}\), \(\{r,l\}\)) have a \(R_V^2\) in the range of 8.8–6.9% and are one- or two-dimensional models with three of them including the term spread as covariate. However, only two of those models (\(\{sp\}\),\(\{l,sp\}\)) perform convincingly out-of-sample and are under the top five predictive models (\(\{l,sp\}\), \(\{Y,d,sp\}\), \(\{e,inf,sp\}\), \(\{l,inf,sp\}\), \(\{sp\}\)) demonstrating a \(R_{oos}^2\) in the range of 15.5–9.1%. Note that the term-spread is included in all of these models and that now also some of the three-dimensional models perform reasonably out-of-sample. Nevertheless, most three-dimensional models cannot beat the historical mean over the considered 60 year out-of-sample period. For the linear models, we find a similar set of five best performing (in-sample) models (\(\{sp\}\), \(\{r\}\), \(\{r,l\}\), \(\{r,sp\}\), \(\{l,sp\}\)) with a \(R_V^2\) in the lower range of 7.9–6.6%. However, only three of those models can beat the historical mean out-of-sample (\(\{e,sp\}\), \(\{r,sp\}\), \(\{l,sp\}\) with a \(R_{oos}^2\) between 4.2% and 7.0%). The five best performing predictive models (\(\{d,r,l\}\), \(\{d,l,sp\}\), \(\{e,r,l\}\), \(\{e,r,sp\}\), \(\{e,l,sp\}\)) are all three-dimensional with \(R_{oos}^2\) in the range of 13.2–9.3%. Here, the combination of earnings and spread together with an additional variable gives the most promising results.

Table 2 Comparison of predictive power: in-sample (measured by the \(R_V^2\)) versus out-of-sample (measured by the \(R^2_{oos}\))

In a next step, we consider the correlation between the individual forecasts. Figure 2 displays the correlation matrix of forecasts from all one- and two-dimensional nonparametric models for both the in-sample (left) and out-of sample predictions (right). The ‘ideal’ or best individual predictor would be (highly) positively correlated with the excess stock returns in the out-of-sample period. To improve over such a forecast, the (linear) combination of different individual predictors must be (i) positively correlated to the former and (ii) would be composed of positively correlated candidates. When considering forecasts from the two-dimensional models, the left-hand side of Fig. 2 shows for most of them a (high) positive correlation. Thus, one would expect a large potential for improvements in predictive power using forecast combinations when a strong selection (shrinkage) of only a few predictive candidates based on the in-sample information occurs. However, for the corresponding out-of-sample predictions the correlations are less pronounced and for some even negative. Using now the weights fixed in the estimation sample will not necessarily lead to an improved out-of-sample performance. For a theoretical analysis on factors that determine the advantages from combining forecasts including a discussion on their correlation can be found, for example, in Timmermann (2006). Note that there are only a few studies which directly account for the possibility of correlation between forecasts (Guerrero & Pena, 2003).

Fig. 2
figure 2

Correlations of predictions for stock returns in excess of the risk-free rate (for nonlinear models of one or two predictive variables). Left: In-sample, Right: Out-of-sample. Period: 1872–2022. Data: annual S &P 500

Table 3 Comparison of predictive power: in-sample (measured by the \(R_V^2\)) versus out-of-sample (measured by the \(R^2_{oos}\))

As described in Sect. 3.3, forecast combinations are a popular method for further improving the forecast quality. Table 3 summarizes 32 different versions of such combinations making use of the individual forecasts shown in Table 2. When using all 62 different nonparametric forecasts, the ENet (25.1%), the Ridge (22.3%), the Lasso (21.6%), and the eLasso (8.8%) improve in-sample over the individual models. However, none of those combinations produces forecasts that can beat the historical mean out-of-sample, that is, having any predictive power. This finding shows that those methods are prone to in-sample over-fitting when too many candidate models are available (even when these methods are validated against the mean). The situation is different, when the possible candidates are restricted to be one-dimensional. Now, the Enet, the Lasso, and the eLasso have both higher in-sample and higher out-of-sample power than individual models (\(R_V^2=\)13.1%, 11.4%, 9.2%, and \(R_{oos}^2=\)17.1%, 17.0%, 16.1%). Note that all of those combine individual forecasts based on the term-spread and the long-term interest rate. In terms of out-of-sample improvements, restrictions to two- or three-dimensional individual forecasts are less successful strategies. For example, in the two-dimensional case, the Enet (\(R_{V}^2=11.0\%\)) selects the forecasts of the following four individual models: \(\{Y,sp\}\), \(\{d,r\}\), \(\{e,r\}\), and \(\{e,inf\}\), which are highly correlated in-sample. The combined forecast improves over the individual ones in-sample but is far away from the out-of-sample predictive power of the best individual model. For the linear counterpart, only a few of the forecast combination methods are able to increase the in-sample performance. For example, the Ridge and the Enet restricted to three-dimensional individual models show \(R_V^2\) values of 9.5% and 8.0%. However, only one of the 32 ways of combining individual forecasts was able to improve out-of-sample over the best individual three-dimensional model (eLasso with \(R_{oos}=13.6\)). Note also that in-sample over-fitting is not such an issue in the linear case because the higher-dimensional models do not include interaction terms. This is also the reason for having very similar results, when accounting for model complexity (that is, similar \(R_{oos}^2\) values in all the panels of Table 3). The nonlinear and linear case have in common that the recently proposed refinement of the elastic net, the cEnet, was hardly able to beat the historical mean at all. For the linear models, it was even the only of the eight different ways of combining individual forecasts that produced negative \(R_{oos}^2\) values.

Table 4 Comparison of predictive power: out-of-sample mean squared error for the full sample period (full), during recessions (rec) and expansions (exp)

Now, we address the question of how the models best performing out-of-sample behave during recessions and economic expansions. For this purpose, we calculate the out-of-sample mean squared error during the aforementioned sub-samples based on the US business cycle expansion and contraction data provided by the NBERFootnote 3 and for the full period. Note that in the 60-year out-of-sample period only 8 years have been classified as recessions years (that is, with more than 6 months of a recession). Table 4 shows a comparison of these out-of-sample measures for the individual models, while Table 5 focusses on the forecast combination methods. It is evident that the (nonparametric and linear) models with the smallest out-of-sample mean squared error over the full period (and thus largest \(R_{oos}^2\) values) belong to the best performing models during economic expansions. Note that such models perform only slightly better than the historical mean during the recessions. There are several models which perform reasonably well during the recessions (for example, \(\{e,l,inf\}\) or \(\{d,l,inf\}\)). However, they have in common not to be able to beat the historical mean during economic expansions. A similar conclusion can be drawn for the forecast combination methods. Only a few of them can improve over the best individual models during the full period and the expansions, while non of those methods improves during the recessions.

We finish the empirical analysis for the risk premium by checking the robustness of the considered models over time. For this purpose, we increased the in-sample period stepwise from 89 to 124 years (and reduced the out-of-sample evaluation period correspondingly from 60 to 35 years). Figure 3 shows the \(R_V^2\) (left) and the \(R_{oos}^2\) (right) for models with the largest out-of-sample \(R^2\) (we show the best three nonparametric and the best three linear models, resp.). Note that for the nonparametric models, only individual models give reasonable results over time. The best three are: \(\{sp\}\), \(\{e,sp\}\), and \(\{d,inf,sp\}\). However, the most forecast combination models suffered from negative \(R_{oos}\) during the out-of-sample periods 1975–2022 or 1980–2022. For the linear models, the situation is quite different: only three individual models but most forecast combination models performed steadily over time. The best three in this case are: the Lasso and the Enet over all individual models, and the Lasso over all 3-dim. models. Figure 3 shows as well that (i) the in-sample performance of the displayed models increases steadily over time and (ii) the out-of-sample performance for the first half of the considered period remains stable but reduces sharply at their end. A possible explanation could be the fact that when the out-of-sample period gets shorter and shorter it is highly dominated by the large negative returns during the Great Recession which was caused by the Global Financial Crisis (compare also Fig. 1). To summarize, the model which performed best in terms of a high and stable \(R_{oos}^2\) was the nonparametric model based on the term-spread as covariate.

Table 5 Comparison of predictive power: out-of-sample mean squared error for the full sample period (full), during recessions (rec) and expansions (exp)
Fig. 3
figure 3

Robustness over time (increasing in-sample period) for selected models for stock returns in excess of the risk-free rate. Left: \(R_V^2\), Right: \(R_{oos}^2\). Period: 1872–2022. Data: annual S &P 500

4.2 Prediction of real returns

Real-income protection is one of the main aspects in long-term pension planning (Merton, 2014; Gerrard et al., 2018, 2019). Therefore, the underlying financial model used when optimizing the investment asset allocation for the long term should reflect these needs in real terms. We apply here the double benchmarking approach of Kyriakou et al. (2021a) with the inflation as the reference rate, that is, all variables are measured net of inflation. Note that inflation itself cannot be included as a covariate because it is transformed to a constant. Therefore, only 40 different models are possible under the inflation benchmark (instead of the 62 when single-benchmarking with the risk-free rate in Sect. 4.1) because all combinations which include inflation as a covariate are redundant. Table 6 presents the comparison of in-sample performance (measured by the \(R_V^2\)) and out-of-sample predictive power (measured by the \(R_{oos}\)) for the individual models. Based on the in-sample measure, the best five nonparametric models (\(\{e,sp\}\), \(\{Y,e,sp\}\), \(\{r,sp\}\), \(\{l,sp\}\), \(\{r,l\}\)) have a \(R_V^2\) in the range of 18.3–16.6%. We find again the term-spread to be included in most of these models. However, only the model \(\{e,sp\}\) performs convincingly out-of sample (\(R_{oos}=13.9\%\)) as it is one of the five best predictive models (\(\{d,e,sp\}\), \(\{e,sp\}\), \(\{e,r,sp\}\), \(\{e,l,sp\}\), \(\{e,r,l\}\)) demonstrating a \(R_{oos}^2\) in the range of 13.9–13.0%. Note that the combination of real earnings and spread is included in four of those models. For the linear case, we find the same set of five best performing (in-sample) models with a \(R_V^2\) in the range of 18.5–16.9%. However, only two of those models beat the historical mean out-of-sample (\(\{e,sp\}\), \(\{Y,e,sp\}\) with an \(R_{oos}^2\) of 13.3% and 11.1%, resp.). The best five predictive models (\(\{d,e,sp\}\), \(\{e,sp\}\), \(\{e,r,l\}\), \(\{e,r,sp\}\), \(\{e,l,sp\}\)) include all the variable combination of real earnings and the spread, in most cases together with an additional covariate. Their \(R_{oos}^2\) values are in the range 13.4–12.9%.

Table 6 Comparison of predictive power: in-sample (measured by the \(R_V^2\)) versus out-of-sample (measured by the \(R^2_{oos}\))

When considering the correlation matrix of the in-sample forecasts from all one- and two-dimensional nonparametric models which is displayed in Fig. 4 (left hand side), one can observe that most predictors are highly positively correlated. The three exceptions are the models Y, sp, and \(\{Y,sp\}\) which indeed are the models with the lowest \(R_V^2\) values, that is, the worst in-sample performance. The correlations for the out-of-sample forecasts shown in Fig. 4 (right hand side) are less pronounced but remain mostly positive (in contrast to the risk-free rate benchmark discussed above).

Fig. 4
figure 4

Correlations of predictions for stock returns in excess of the inflation rate (for nonlinear models of one or two predictive variables). Left: in-sample, Right: out-of-sample. period: 1872–2022. Data: annual S &P 500

In the next step, we focus on the in- and out-of-sample performance of the 32 different forecast combination models. The corresponding results are shown in Table 7. When using all 40 available models, we find similar as before that the ENet (27.1%), the Ridge (23.8%), and the Lasso (23.7%) largely improve in-sample over the individual models. However, none of these forecast combinations produces forecasts with improved predictive power out-of-sample compared to individual models. We confirm our finding from the risk-free benchmark that those methods are prone to over-fitting. The restriction to one-dimensional models reduces a bit the in-sample \(R_V^2\) (ENet: 21.8%, Ridge: 20.1%, Lasso: 19.2%) but all of these models have now \(R^2_{oos}\) values larger than 12.7%. However, none of the forecast combination models is able to improve out-of-sample over the best individual model. Note that the cENet seems to perform reasonable under the double inflation benchmark. The reason is that it selects only the single two-dimensional model based on real earnings and the term spread, one of the best individual candidate models. For the linear counterpart, the situation is similar. In-sample, mostly the ENet improves over the individual models. But out-of-sample, non of the 32 ways of combining individual forecasts was able to increase the predictive power over the best individual three-dimensional model.

Table 7 Comparison of predictive power: in-sample (measured by the \(R_V^2\)) versus out-of-sample (measured by the \(R^2_{oos}\))
Table 8 Comparison of predictive power: out-of-sample mean squared error for the full sample period (full), during recessions (rec) and expansions (exp)

For the performance during recessions or economic expansions, we find similar pattern as under the risk-free rate benchmark. Table 9 shows the mean squared errors for the individual models and Table 8 for the forecast combination models. One observes again that the (nonparametric and linear) models with the smallest out-of-sample mean squared error over the full period excellently perform during economic expansions but are only slightly able to beat the historical mean during the recessions. There are again several models which are performing very promising during recessions (for example, \(\{Y,d,l\}\) or \(\{Y,d,r\}\)). During the economic expansions, however, these models cannot beat the historical mean at all.

Table 9 Comparison of predictive power: out-of-sample mean squared error for the full sample period (full), during recessions (rec) and expansions (exp)

We finish the empirical part with the robustness check over time. Figure 5 shows the \(R_V^2\) (left) and the \(R_{oos}^2\) (right) for models with the largest out-of-sample \(R^2\) (we restrict ourselves again to the best three nonparametric and the best three linear models, resp.). In contrast to the risk-free benchmark, the performance over time is more stable under the inflation benchmark. For the nonparametric models, 8 of the individual models and 18 of the forecast combinations were able to beat the historical mean in each setting. However, the three best performing models are \(\{e,sp\}\), \(\{d,e,sp\}\), and \(\{e,l,sp\}\). For the linear models, a similar set of 8 individual models and 28 of the forecast combinations yield positive \(R_{oos}^2\) values over time. Again, the three best performing models were individual models: \(\{d,e,sp\}\), \(\{e,r,l\}\), and \(\{e,l,sp\}\). Figure 5 also shows that the in-sample performance is now quite stable over time. However, the out-of-sample measure sharply drops at the end of the considered period. Note also that nonparametric and linear models are close together in-sample as well as out-of-sample. Nevertheless, the nonparametric models based on the covariates \(\{e,sp\}\) and \(\{d,e,sp\}\) performed best in terms of largest and stable \(R_{oos}^2\), and are thus the preferred model choices.

Fig. 5
figure 5

Robustness over time (increasing in-sample period) for selected models for stock returns in excess of the inflation rate. Left: \(R_V^2\), Right: \(R_{oos}^2\). Period: 1872–2022. Data: annual S &P 500

5 Conclusion

In this paper, we analyse whether forecast combinations of stock returns in excess of different benchmarks are able to improve over the individual models. Our focus lies thereby on nonlinear predictive functions estimated by a fully nonparametric smoother with the covariates and smoothing parameters chosen by cross-validation. We extent the approach of Kyriakou et al. (2021a) to three-dimensional models and find for some of them a reasonable performance both in-sample and out-of-sample. However, the low number of observations in the estimation sample limits the complexity of the fitted models. This reduces the probability of choosing such models as the in-sample measure is worse when compared to simpler models of lower dimensionality. Note that the reason lies in the fact that the rate of convergence of the local-linear smoother is slower than, for example, in parametric regression (Hansen, 2008a).

We find further that the classical shrinkage methods (Lasso, Ridge, ENet) are prone to in-sample over-fitting when all individual forecasts are used as possible candidates. As a consequence, the suggested predictive power is spurious and the out-of-sample performance is indeed very poor. The restriction to one-dimensional candidates helps to balance in-sample and out-of-sample behaviour and improves the out-of-sample predictive power. We also find that forecast combinations perform better than the simple historical average. However, the individual nonparametric models outperform linear models and combination forecasts throughout. Finally, the double inflation benchmark results in a more stable performance compared to the single risk-free benchmark.

Recently, there has been a fast growth of methodology to process data for financial applications. This again provides us with the challenge of making sure that more and more complex methodology is indeed also better than simpler methods. It is well-known that complexity often comes with a price. With this current study, we get to the conclusion that a simple benchmark methodology is indeed as good as a selected collection of the most popular, but also more complex and less transparent, modern ML-type approaches. Also for the financial practitioner, implementing the model and communicating results are simply easier carried out with a simpler methodology. So, it is important for policy making and financial planning of long-term saving that complexity and lack of transparency are only introduced to the econometric modelling when it is absolutely necessary. In the important challenge of understanding long-term financial returns based on econometric modelling, the conclusion of this study seems to be that complexity does not pay off well enough and that it is better to use simpler benchmark methods.