1 Introduction

Machine learning provides a toolbox of powerful methods that excel in static prediction problems such as face recognition [37], language translation [12], and playing board games [41]. The recent literature suggests that machine learning methods can also outperform conventional models in forecasting problems; see, e.g., [4] for bond risk premia, [15] for recessions, and [5] for financial crises. Predicting macroeconomic dynamics is challenging. Relationships between variables may not hold over time, and shocks such as recessions or financial crises might lead to a breakdown of previously observed relationships. Nevertheless, several studies have shown that machine learning methods outperform econometric baselines in predicting unemployment, inflation, and output [38, 9].

While they learn meaningful relationships between variables from the data, these are not directly observable, leading to the criticism that machine learning models such as random forests and neural networks are opaque black boxes. However, as we demonstrate, there exist approaches that can make machine learning predictions transparent and even allow for statistical inference.

We have organized this chapter as a guiding example for how to combine improved performance and statistical inference for machine learning models in the context of macroeconomic forecasting.

We start by comparing the forecasting performance and inference on various machine learning models to more commonly used econometric models. We find that machine learning models outperform econometric benchmarks in predicting 1-year changes in US unemployment. Next, we address the black box critique by using Shapley values [44, 28] to depict the nonlinear relationships learned by the machine learning models and then test their statistical significance [24]. Our method closes the gap between two distinct data modelling objectives, using black box machine learning methods to maximize predictive performance and statistical techniques to infer the data-generating process [8].

While several studies have shown that multivariate machine learning models can be useful for macroeconomic forecasting [38, 9, 31], only a little research has tried to explain the machine learning predictions. Coulombe et al. [13] shows generally that the success of machine learning models in macro-forecasting can be attributed to their ability to exploit nonlinearities in the data, particularly at longer time horizons. However, we are not aware of any macroeconomic forecasting study that attempted to identify the functional form learned by the machine learning models.Footnote 1 However, addressing the explainability of models is important when model outputs inform decisions, given the intertwined ethical, safety, privacy, and legal concerns about the application of opaque models [14, 17, 20]. There exists a debate about the level of model explainability that is necessary. Lipton [27] argues that a complex machine learning model does not need to be less interpretable than a simpler linear model if the latter operates on a more complex space, while Miller [32] suggests that humans prefer simple explanations, i.e., those providing fewer causes and explaining more general events—even though these may be biased.

Therefore, with our focus on explainability, we consider a small but diverse set of variables to learn a forecasting model, while the forecasting literature often relies on many variables [21] or latent factors that summarize individual variables [43]. In the machine learning literature, approaches to interpreting machine learning models usually focus on measuring how important input variables are for prediction. These variable attributions can be either global, assessing variable importance across the whole data set [23, 25] or local, by measuring the importance of the variables at the level of individual observations. Popular global methods are permutation importance or Gini importance for tree-based models [7]. Popular local methods are LIMEFootnote 2 [34], DeepLIFTFootnote 3 [40] and Shapley values [44]. Local methods decompose individual predictions into variable contributions [36, 45, 44, 34, 40, 28, 35]. The main advantage of local methods is that they uncover the functional form of the association between a feature and the outcome as learned by the model. Global methods cannot reveal the direction of association between a variable and the outcome of interest. Instead, they only identify variables that are relevant on average across all predictions, which can also be achieved via local methods and averaging attributions across all observations.

For model explainability in the context of macroeconomic forecasting, we suggest that local methods that uncover the functional form of the data generating process are most appropriate. Lundberg and Lee [28] demonstrate that local method Shapley values offer a unified framework of LIME and DeepLIFT with appealing properties. We chose to use Shapely values in this chapter because of their important property of consistency. Here, consistency is when on increasing the impact of a feature in a model, the feature’s estimated attribution for a prediction does not decrease, independent of all other features. Originally, Shapley values were introduced in game theory [39] as a way to determine the contribution of individual players in a cooperative game. Shapely values estimate the increase in the collective pay-off when a player joins all possible coalitions with other players. Štrumbelj and Kononenko [44] used this approach to estimate the contribution of variables to a model prediction, where the variables and the predicted value are analogous to the players and payoff in a game.

The global and local attribution methods mentioned here are descriptive—they explain the drivers of a model’s prediction but they do not assess a model’s goodness-of-fit or the predictors’ statistical significance. These concepts relate to statistical inference and require two steps: (1) measuring or estimating some quantity, such as a regression coefficient, and (2) inferring how certain one is in this estimate, e.g., how likely is it that the true coefficient in the population is different from zero.

The econometric approach of statistical inference for machine learning is mostly focused on measuring low-dimensional parameters of interest [10, 11], such as treatment effects in randomized experiments [2, 47]. However, in many situations we are interested in estimating the effects for all variables included in a model. To the best of our knowledge, there exists only one general framework that performs statistical inference jointly on all variables used in a machine learning prediction model to test for their statistical significance [24]. The framework is called Shapley regressions, where an auxiliary regression of the outcome variable on the Shapley values of individual data points is used to identify those variables that significantly improve the predictions of a nonlinear machine learning model. We will discuss this framework in detail in Sect. 4. Before that, we will describe the data and the forecasting methodology (Sect. 2) and present the forecasting results (Sect. 3). We conclude in Sect. 5.

2 Data and Experimental Setup

We first introduce the necessary notation. Let y and \(\hat {y} \in \mathbb {R}^{m}\) be the observed and predicted continuous outcome, respectively, where m is the number of observations in the time series.Footnote 4 The feature matrix is denoted by \(x \in \mathbb {R}^{m \times n}\), where n is the number of features in the dataset. The feature vector of observation i is denoted by x i. Generally, we use i to index the point in time of the observation and k to index features. While our empirical analysis is limited to numerical features, the forecasting methods as well as the techniques to interpret their predictions also work when the data contains categorical features. These just need to be transformed into binary variables, each indicating membership of a category.

2.1 Data

We use the FRED-MD macroeconomic database [30]. The data contains monthly series of 127 macroeconomic indicators of the USA between 1959 and 2019. Our outcome variable is unemployment and we choose nine variables as predictors, each capturing a different macroeconomic channel. We add the slope of the yield curve as a variable by computing the difference of the interest rates of the 10-year treasury note and the 3-month treasury bill. The authors of the database suggest specific transformations to make each series stationary. We use these transformations, which are (for a variable a:) (1) changes (a i − a il), (2) log changes (loge a i −loge a il), and (3) second-order log changes ((loge a i −loge a il) − (loge a il −loge a i−2l)). As we want to predict the year-on-year change in unemployment, we set l to 12 for the outcome and the lagged outcome when used as a predictor. For the remaining predictors, we set l = 3 in our baseline setup. This generally leads to the best performance (see Table 3 for other choices of l). Table 1 shows the variables, with the respective transformations and the series names in the original database. The augmented Dickey-Fuller test confirms that all transformed series are stationary (p < 0.01).

Table 1 Series used in the forecasting experiment. The middle column shows the transformations suggested by the authors of the FRED-MD database and the right column shows the names in that database
Table 2 Forecasting performance for the different prediction models. The models are ordered by decreasing RMSE on the whole sample with the errors of the random forest set to unity. The forest’s MAE and RMSE (full period) are 0.574 and 0.763, respectively. The asterisks indicate the statistical significance of the Diebold-Mariano test, comparing the performance of the random forest with the other models, with significance levels p < 0.1; ∗∗ p < 0.05; ∗∗∗ p < 0.01

2.2 Models

We test three families of models that can be formalized in the following way assuming that all variables have been transformed according to Table 1.

  • The simple linear lag model only uses the 1-year lag of the outcome variable as a predictor: \(\hat {y}_i = \alpha + \theta _0 y_{i-12}\).

  • The autoregressive model (AR) uses several lags of the response as predictors: \({\hat {y}_i = \alpha + \sum _{l = 1}^{h} \theta _i y_{i-l}}\). We test AR models with a horizon 1 ≤ h ≤ 12, chosen by the Akaike Information Criterion [1].

  • The full information models use the 1-year lag of the outcome and 1-year lags of the other features as independent variables: \(\hat {y}_t = f(y_{i-12}; x_{i-12}\)), where f can be any prediction model. For example, if f is a linear regression, \(f(y_i,x_i) = \alpha + \theta _0y_{i-12} + \sum _{k= 1}^{n} \theta _kx_{i-12,k}\). To simplify this notation we imply that the lagged outcome is included in the feature matrix x in the following. We test five full information models: Ordinary least squares regression and Lasso regularized regression [46], and three machine learning regressors—random forest [7], support vector regression [16], and artificial neural networks [22].Footnote 5

2.3 Experimental Procedure

We evaluate how all models predict changes in unemployment 1 year ahead. After transforming the variables (see Table 1) and removing missing values, the first observation in the training set is February 1962. All methods are evaluated on the 359 data points of the forecasts between January 1990 and November 2019 using an expanding window approach. We recalibrate the full information and simple linear lag models every 12 months such that each model makes 12 predictions before it is updated. The autoregressive model is updated every month. Due to the lead-lag structure of the full information and simple linear lag models, we have to create an initial gap between training and test set when making predictions to avoid a look-ahead bias. For a model trained on observations 1…i, the earliest observation in the test set that provides a true 12-month forecast is i + 12. For observations i + 1, …, i + 11, the time difference to the last observed outcome in the training set is smaller than a year.

All machine learning models that we tested have hyperparameters. We optimize their values in the training sets using fivefold cross-validation.Footnote 6 As this is computationally expensive, we conduct the hyperparameter search every 36 months with the exception of the computationally less costly Lasso regression, whose hyperparameters are updated every 12 months.

To increase the stability of the full information models, we use bootstrap aggregation, also referred to as bagging. We train 100 models on different bootstrapped samples (of the same size as the training set) and average their predictions. We do not use bagging for the random forest as, by design, each individual tree is already calibrated on a different bootstrapped sample of the training set.

3 Forecasting Performance

3.1 Baseline Setting

Table 2 shows three measures of forecasting performance: the correlation of the observed and predicted response, the mean absolute error (MAE), and the root mean squared error (RMSE). The latter is the main metric considered, as most models minimize RMSE during training. The models are ordered by decreasing RMSE on the whole test period between 1990 and 2019. The random forest performs best and we divide the MAE and RMSE of all models by that of the random forest for ease of comparison.

Table 3 Performance for different parameter specifications. The shown metric is RMSE divided by the RMSE of the random forest in the baseline setup

Table 2 also breaks down the performance in three periods: the 1990s and the period before and after the onset of the global financial crisis in September 2008. We statistically compare the RMSE and MAE of the best model, the random forest, against all other models using a Diebold-Mariano test. The asterisks indicate the p-value of the tests.Footnote 7

Apart from support vector regression (SVR), all machine learning models outperform the linear models on the whole sample. The inferior performance of SVR is not surprising as it does not minimize a squared error metric such as RMSE but a metric similar to MAE which is lower for SVR than for the linear models. In the 1990s and the periods before the global financial crisis, there are only small differences in performance between the models, with the neural network being the most accurate model. Only after the onset of the crisis does the random forest outperform the other models by a large and statistically significant margin.

Figure 1 shows the observed response variable and the predictions of the random forest, the linear regression, and the AR. The vertical dashed lines indicate the different time periods distinguished in Table 2. The predictions of the random forest are more volatile than that of the regression and the AR.Footnote 8 All models underestimate unemployment during the global financial crisis and overestimate it during the recovery. However, the random forest is least biased in those periods and forecasts high unemployment earliest during the crisis. This shows that its relatively high forecast volatility can be useful in registering negative turning points. A similar observation can be made after the burst of the dotcom bubble in 2000. This points to an advantage of machine learning models associated with their greater flexibility incorporating new information as it arrives. This can be intuitively understood as adjusting model predictions locally, e.g., in regions (periods) of high unemployment, while a linear model needs to realign the full (global) model hyperplane.

Fig. 1
figure 1

Observed and predicted 1-year change in unemployment for the whole forecasting period comparing different models

3.2 Robustness Checks

We altered several parameters in our baseline setup to investigate their effects on the forecasting performance. The results are shown in Table 3. The RMSE of alternative specifications is again divided by the RMSE of the random forest in the baseline setup for a clearer comparison.

  • Window size. In the baseline setup, the training set grows over time (expanding window). This can potentially improve the performance over time as more observations may facilitate a better approximation of the true data generating process. On the other hand, it may also make the model sluggish and prevent quick adaptation to structural changes. We test sliding windows of 60, 120, and 240 months. Only the simplest model, linear regression with only a lagged response, profits from a short horizon; the remaining models perform best with the biggest possible training set. This is not surprising for machine learning models, as they can “memorize” different sets of information through the incorporation of multiple specification in the same model. For instance, different paths down a tree model, or different trees in a forest, are all different submodels, e.g., characterizing different time periods in our setting. By contrast, a simple linear model cannot adjust in this way and needs to fit the best hyperplane to the current situation, explaining its improved performance for some fixed window sizes.

  • Change horizon. In the baseline setup, we use a horizon of 3 months, when calculating changes, log changes, and second-order log changes of the predictors (see Table 1). Testing the horizons of 1, 6, 9, and 12 months, we find that 3 months generally leads to the best performance of all full information models. This is useful from a practical point of view, as quarterly changes are one of the main horizons considered for short-term economic projections.

  • Bootstrap aggregation (bagging). The linear regression, neural network, and SVR all benefit from averaging the prediction of 100 bootstrapped models. The intuition is that our relatively small dataset likely leads to models with high variance, i.e., overfitting. The bootstrap aggregation of models reduces the models’ variance and the degree of overfitting. Note that we do not expect much improvement for bagged linear models, as different draws from the training set are likely to lead to similar slope parameters resulting in almost identical models. This is confirmed by the almost identical performance of the single and bagged model.

4 Model Interpretability

4.1 Methodology

We saw in the last section that machine learning models outperform conventional linear approaches in a comprehensive economic forecasting exercise. Improved model accuracy is often the principal reason for applying machine learning models to a problem. However, especially in situations where model results are used to inform decisions, it is crucial to both understand and clearly communicate modelling results. This brings us to a second step when using machine learning models—explaining them.

Here, we introduce and compare two different methods for interpreting machine learning forecasting models permutation importance [7, 18] and Shapley values and regressions [44, 28, 24]. Both approaches are model-agnostic, meaning that they can be applied to any model, unlike other approaches, such as Gini impurity [25, 19], which are only compatible with specific machine learning methods. Both methods allow us to understand the relative importance of model features. For permutation importance, variable attribution is at the global level while Shapley values are constructed locally, i.e., for each single prediction. We note that both importance measures require column-wise independence of the features, i.e., contemporaneous independence in our forecasting experiments, an assumption that will not hold under all contexts.Footnote 9

4.1.1 Permutation Importance

The permutation importance of a variable measures the change of model performance when the values of that variable are randomly scrambled. Scrambling or permuting a variable’s values can either be done within a particular sample or by swapping values between samples. If a model has learnt a strong dependency between the model outcome and a given variable, scrambling the value of the variable leads to very different model predictions and thus affects performance. A variable k is said to be important in a model, if the test error e after scrambling feature k is substantially higher than the test error when using the original value for k, i.e., \(e_{k}^{perm}>>e\). Clearly, the value of the permutation error \(e_{k}^{perm}\) depends on the realization of the permutation, and variation in its value can be large, particularly in small datasets. Therefore, it is recommended to average \(e_{k}^{perm}\) over several random draws for more accurate estimates and to assess sampling variability. Footnote 10

The following procedure estimates the permutation importance.

  1. 1.

    For each feature x k:

    1. (a)

      Generate a permutation sample \(x_{k}^{perm}\) with the values of x k permuted across observations (or swapped between samples).

    2. (b)

      Reevaluate the test score for \(x_{k}^{perm}\), resulting in \(e_{k}^{perm}\).

    3. (c)

      The permutation importance of x k is given by \(I(x_k)=e_{k}^{perm}/e\).Footnote 11

    4. (d)

      Repeat and average over Q iterations and average I k = 1∕Qq I q(x k).

  2. 2.

    If I q is given by the ratio of errors, consider the normalized quantity \(\bar {I}_k = (I_k-1)\sum _k (I_k-1)\,\in \,(0,1)\).Footnote 12

  3. 3.

    Sort features by I k (or, \(\bar {I}_k\)).

Permutation importance is an intuitive measure that is relatively cheap to compute, requiring only new predictions generated on the permuted data and not model retraining. However, this ease of use comes at some cost. First, and foremost, permutation importance is inconsistent. For example, if two features contain similar information, permuting either of them will not reflect the actual importance of this feature relative to all other features in the model. Only permuting both or excluding one would do so. This situation is accounted for by Shapley values because they identify the individual marginal effect of a feature, accounting for its interaction with all other features. Additionally, the computation of permutation importance necessitates access to true outcome values and in many situations, e.g., when working with models trained on sensitive or confidential data, these may not be available. As a global measure, permutation importance only explains which variables are important but not how they contribute to the model, i.e., we cannot uncover the functional form or even the direction of the association between features and outcome that was learned by the model.

4.1.2 Shapley Values and Regressions

Shapley values originate from game theory [39] as a general solution to the problem of attributing a payoff obtained in a cooperative game to the individual players based on their contribution to the game. Štrumbelj and Kononenko [44] introduced the analogy between players in a cooperative game and variables in a general supervised model, where variables jointly generate a prediction, the payoff. The calculation is analogous in both cases (see also [24]),

$$\displaystyle \begin{aligned} \begin{array}{rcl} \varPhi^S\Big[f(x_i)\Big] & \equiv &\displaystyle \phi_0^S + \sum_{k=1}^{n}\phi_k^S(x_i) \;=\; f(x_i)\,,{} \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} \phi_{k}^{S}(x_i;f) & =&\displaystyle \sum_{x^{\prime}\,\subseteq\,\mathcal{C}(x)\setminus\{k\}} \frac{|x^{\prime}|!(n-|x^{\prime}|-1)!}{n!}\,\big[f(x_i|x^{\prime}\cup \{k\}) - f(x_i|x^{\prime})\big]\,,\quad {} \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} & =&\displaystyle \sum_{x^{\prime}\,\subseteq\,\mathcal{C}(x)\setminus\{k\}} \omega_{x^{\prime}}\big[ \mathbb{E}_b[f(x_i)|x^{\prime}\cup \{k\}]-\mathbb{E}_b[f(x_i)|x^{\prime}] \big]\,, {}\\ & \text{with}&\displaystyle \qquad \mathbb{E}_b[ f(x_i)|x^{\prime}] \;\equiv\, \int f(x_i)\,\mathrm{d}b(\bar{x^{\prime}})=\frac{1}{|b|}\sum_b f(x_i|\bar{x^{\prime}})\,. \end{array} \end{aligned} $$

Equation 1 states that the Shapley decomposition Φ S[f(x i)] of model f is local at x i and exact, i.e., it precisely adds up to the actually predicted value f(x i). In Eq. 2, \(\mathcal {C}(x)\setminus \{k\}\) is the set of all possible variable combinations (coalitions) of n − 1 variables when excluding the k th variable. |x | denotes the number of variables included in that coalition, \(\omega _{x^{\prime }}\equiv |x^{\prime }|!(n-|x^{\prime }|-1)!/n!\) is a combinatorial weighting factor summing to one over all possible coalition, b is a background dataset, and \(\bar {x^{\prime }}\) stands for the set of variables not included in x .

Equation 2 is the weighted sum of marginal contributions of variable k accounting for the number of possible variable coalitions.Footnote 13 In a general model, it is usually not possible to put an arbitrary feature to missing, i.e., exclude it. Instead, the contributions from features not included in x are integrated out over a suitable background dataset, where \(\{x_i|\bar {x^{\prime }}\}\) is the set of points with variables not in x being replaced by values in b. The background provides an informative reference point by determining the intercept \(\phi _0^S\). A reasonable choice is the training dataset incorporating all information the model has learned from.

An obvious disadvantage of Shapley values compared to permutation importance is the considerably higher complexity of their calculation. Given the factorial in Eq. 2, an exhaustive calculation is generally not feasible with larger feature sets. This can be addressed by either sampling from the space of coalitions or by setting all “not important” variables to “others,” i.e., treating them as single variables. This substantially reduces the number of elements in \(\mathcal {C}(x)\).

Nevertheless, these computational costs come with significant advantages. Shapley values are the only feature attribution method which is model independent, local, accurate, linear, and consistent [28]. This means that it delivers a granular high-fidelity approach for assessing the contribution and importance of variables. By comparing the local attributions of a variable across all observations we can visualize the functional form learned by the model. For instance, we might see that observations with a high (low) value on the variable have a disproportionally high (low) Shapley value on that variable, indicating a positive nonlinear functional form.

Based on these properties, which are directly inherited from the game theoretic origins of Shapley values, we can formulate an inference framework using Eq. 1. Namely, the Shapley regression [24],

$$\displaystyle \begin{aligned} y_i\,=\, \sum_{k=0}^n\phi^S_{k}(f,x_i)\beta^S_k + \hat{\epsilon}_i\,\equiv\,\varPhi^S_i\beta^S + \hat{\epsilon}_i, \end{aligned} $$

where k = 0 corresponds to the intercept and \(\hat {\epsilon }_i\sim \mathcal {N}(0,\sigma _{\epsilon }^2)\). The surrogate coefficients \(\beta _k^S\) are tested against the null hypothesis

$$\displaystyle \begin{aligned} \mathcal{H}_{0}^{k}(\varOmega)\;:\;\{\beta_k^S\leq 0\,\big|\,\varOmega\}\,, \end{aligned} $$

with \(\varOmega \in \mathbb {R}^n\) (a region of) the model input space. The intuition behind this approach is to test the alignment of Shapley components with the target variable. This is analogous to a linear model where we use “raw” feature values rather than their associated Shapley attributions. A key difference to the linear case is the regional dependence on Ω. We only make local statements about the significance of variable contributions, i.e., on those regions where it is tested against \(\mathcal {H}_{0}\). This is appropriate in the context of potential nonlinearity, where the model plane in the original input-target space may be curved, unlike that of a linear model. Note that the Shapley value decomposition (Eqs. 13) absorbs the signs of variable attributions, such that only positive coefficient values indicate significance. When negative values occur, it indicates that a model has poorly learned from a variable and \(\mathcal {H}_{0}\) cannot be rejected.

The coefficients β S are only informative about variable alignment (the strength of association between the output variable and feature of interest), not the magnitude of importance of a variable. Both together can be summarized by Shapley share coefficients,

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \varGamma^S_k(f,{\varOmega}) & \equiv &\displaystyle \Bigg[sign\big(\beta^{lin}_k\big)\,\left\langle\frac{|\phi_{k}^{S}(f)|}{\sum_{l=1}^{n}|\phi_{l}^{S}(f)|}\right\rangle_{\varOmega} \Bigg]^{(*)}\;\in\;[-1,1]\,, \end{array} \end{aligned} $$

where 〈⋅〉Ω stands for the average over x k in \(\varOmega _k\in \mathbb {R}\). The Shapley share coefficient \(\varGamma _k^S(f,\varOmega )\) is a summary statistic for the contribution of x k to the model over a region \(\varOmega \subset \mathbb {R}^n\) for modelling y.

It consists of three parts. The first is the sign, which is the sign of the corresponding linear model. The motivation for this is to indicate the direction of alignment of a variable with the target y. The second part is coefficient size. It is defined as the fraction of absolute variable attribution allotted to x k across Ω. The sum of the absolute value of Shapley share coefficients is one by construction.Footnote 14 It measures how much of the model output is explained by x k. The last component is the significance level, indicated by the star notation (∗), and refers to the standard notation used in regression analysis to indicate the certainty with which we can reject the null hypothesis (Eq. 5). This indicates the confidence one can have in information derived from variable x k measured by the strength of alignment of the corresponding Shapley components and the target, which is the same as its interpretation in a conventional regression analysis.

Equation 7 provides the explicit form for the linear model, where an analytical form exists. The only difference to the conventional regression case is the normalizing factor.

4.2 Results

We explain the predictions of the machine learning models and the linear regression as calibrated in the baseline setup of our forecasting. Our focus is largely on explaining forecast predictions in a pseudo-real-world setting where the model is trained on earlier observations that predate the predictions. However, in some cases it can be instructive to explain the predictions of a model that was trained on observations across the whole time period. For that, we use fivefold block cross-validation [3, 42].Footnote 15 This cross-validation analysis is subject to look-ahead bias, as we use future data to predict the past, but it allows us to evaluate a model for the whole time series.

4.2.1 Feature Importance

Figure 2 shows the global variable importance based on the analysis of the forecasting predictions. It compares Shapley shares |Γ S| (left panel) with permutation importance \(\bar {I}\) (middle panel). The variables are sorted by the Shapley shares of the best-performing model, the random forest. Vertical lines connect the lowest and highest share across models for each feature as a measure for disagreement between models.

Fig. 2
figure 2

Variable importance according to different measures. The left panel shows the importance according to the Shapley shares and the middle panel shows the variable importance according to permutation importance. The right panel shows an altered metric of permutation importance that measures the effect of permutation on the predicted value

The two importance measures only roughly agree in their ranking of feature importance. For instance, using a random forest model, past unemployment seems to be a key indicator according to permutation importance but relatively less crucial according to Shapley calculations. Permutation importance is based on model forecasting error and so is a measure of a feature’s predictive power (how much does its inclusion in a model improve predictive accuracy) and it is influenced by how the relationship between outcome and features may change over time. In contrast, Shapley values indicate which variables influence a predicted value, independent of predictive accuracy. The right panel of Fig. 2 shows an altered measure of permutation importance. Instead of measuring the change in the error due to permutations, we measure the change in the predicted value.Footnote 16 We see that this importance measure is more closely aligned with Shapley values. Furthermore, when we evaluate permutation importance using predictions based on block cross-validation, we find a strong alignment with Shapley values as the relationship between variables is not affected by the change between the training and test set (not shown).

Figure 3 plots Shapley values attributed to the S&P500 (vertical axis) against its input values (horizontal axis) for the random forest (left panel) and the linear regression (right panel) based on the block cross-validation analysis.Footnote 17 Each point reflects one of the observations between 1990 and 2019 and their respective value on the S&P500 variable. The approximate functional forms learned by both models are traced out by best-fit degree-3 polynomials. The linear regression learns a steep negative slope, i.e., higher stock market values are associated with lower unemployment 1 year down the road. This makes economic sense. However, we can make more nuanced observations for the random forest. There is satiation for high market valuations, i.e., changes beyond a certain point do not provide greater information for changes in unemployment.Footnote 18 A linear model is not able to reflect those nuances, while machine learning models provide a more detailed signal from the stock market and other variables.

Fig. 3
figure 3

Functional form learned by the random forest (left panel) and linear regression. The gray line shows a 3-degree polynomial fitted to the data. The Shapley values shown here are computed based on fivefold block cross-validation and are therefore subject to look-ahead bias

4.2.2 Shapley Regressions

Shapley value-based inference allows to communicate machine learning models analogously to a linear regression analysis. The difference between the coefficients of a linear model and Shapley share coefficients is primarily the normalization of the latter. The reason for this is that nonlinear models do not have a “natural scale,” for instance, to measure variation. We summarize the Shapley regression on the forecasting predictions (1990–2019) of the random forest and linear regression in Table 4.

Table 4 Shapley regression of random forest (left) and linear regression (right) for forecasting predictions between 1990–2019. Significance levels: p < 0.1; ∗∗ p < 0.05; ∗∗∗ p < 0.01

The coefficients β S measure the alignment of a variable with the target. Values close to one indicate perfect alignment and convergence of the learning process. Values larger than one indicate that a model underestimates the effect of a variable on the outcome. And the opposite is the case for values smaller than one. This can intuitively be understood from the model hyperplane of the Shapley regression either tilting more towards a Shapley component from a variable (underestimation, \(\beta ^S_k>1\)) or away from it (overestimation, \(\beta ^S_k<1\)). Significance decreases as the \(\beta ^S_k\) approaches zero.Footnote 19

Variables with lower p-values usually have higher Shapley shares |Γ S|, which are equivalent to those shown in Fig. 2. This is intuitive as the model learns to rely more on features which are important for predicting the target. However this does not hold by construction. Especially in the forecasting setting where the relationships of variables change over time, the statistical significance may disappear in the test set, even for features with high shares.

In the Shapley regression, more variables are statistically significant for the random forest than for the linear regression model. This is expected, because the forest, like other machine learning models, can exploit nonlinear relationships that the regression cannot account for (as in Fig. 3), i.e., it is a more flexible model. These are then reflected in localized Shapley values providing a stronger, i.e., more significant, signal in the regression stage.

5 Conclusion

This chapter provided a comparative study of how machine learning models can be used for macroeconomic forecasting relative to standard econometric approaches. We find significantly better performance of machine learning models for forecasting changes in US unemployment at a 1-year horizon, particularly in the period after the global financial crisis of 2008.

Apart from model performance, we provide an extensive explanation of model predictions, where we present two approaches that allow for greater machine learning interpretability—permutation feature importance and Shapley values. Both methods demonstrate that a range of machine learning models learn comparable signals from the data. By decomposing individual predictions into Shapley value attributions, we extract learned functional forms that allow us to visually demonstrate how the superior performance of machine learning models is explained by their enhanced ability to adapt to individual variable-specific nonlinearities. Our example allows for a more nuanced economic interpretation of learned dependencies compared to the interpretation offered by a linear model. The Shapley regression framework, which enables conventional parametric inference on machine learning models, allows us to communicate the results of machine learning models analogously to traditional presentations of regression results.

Nevertheless, as with conventional linear models, the interpretation of our results is not fixed. We observe some variation under different models, different model specifications, and the interpretability method chosen. This is in part due to small sample limitations; this modelling issue is common, but likely more aggravated when using machine learning models due to their nonparametric structure.

However, we believe that the methodology and results presented justify the use of machine learning models and such explainability methods to inform decisions in a policy-making context. The inherent advantages of their nonlinearity over conventional models are most evident in a situation where the underlying data-generating process is unknown and expected to change over time, such as in a forecasting environment as presented in the case study here. Overall, the use of machine learning in conjunction with Shapley value-based inference as presented in this chapter may offer a better trade-off between maximizing predictive performance and statistical inference thereby narrowing the gap between Breiman’s two cultures.