Abstract
We present a comprehensive comparative case study for the use of machine learning models for macroeconomics forecasting. We find that machine learning models mostly outperform conventional econometric approaches in forecasting changes in US unemployment on a 1year horizon. To address the black box critique of machine learning models, we apply and compare two variables attribution methods: permutation importance and Shapley values. While the aggregate information derived from both approaches is broadly in line, Shapley values offer several advantages, such as the discovery of unknown functional forms in the data generating process and the ability to perform statistical inference. The latter is achieved by the Shapley regression framework, which allows for the evaluation and communication of machine learning models akin to that of linear models.
Download chapter PDF
1 Introduction
Machine learning provides a toolbox of powerful methods that excel in static prediction problems such as face recognition [37], language translation [12], and playing board games [41]. The recent literature suggests that machine learning methods can also outperform conventional models in forecasting problems; see, e.g., [4] for bond risk premia, [15] for recessions, and [5] for financial crises. Predicting macroeconomic dynamics is challenging. Relationships between variables may not hold over time, and shocks such as recessions or financial crises might lead to a breakdown of previously observed relationships. Nevertheless, several studies have shown that machine learning methods outperform econometric baselines in predicting unemployment, inflation, and output [38, 9].
While they learn meaningful relationships between variables from the data, these are not directly observable, leading to the criticism that machine learning models such as random forests and neural networks are opaque black boxes. However, as we demonstrate, there exist approaches that can make machine learning predictions transparent and even allow for statistical inference.
We have organized this chapter as a guiding example for how to combine improved performance and statistical inference for machine learning models in the context of macroeconomic forecasting.
We start by comparing the forecasting performance and inference on various machine learning models to more commonly used econometric models. We find that machine learning models outperform econometric benchmarks in predicting 1year changes in US unemployment. Next, we address the black box critique by using Shapley values [44, 28] to depict the nonlinear relationships learned by the machine learning models and then test their statistical significance [24]. Our method closes the gap between two distinct data modelling objectives, using black box machine learning methods to maximize predictive performance and statistical techniques to infer the datagenerating process [8].
While several studies have shown that multivariate machine learning models can be useful for macroeconomic forecasting [38, 9, 31], only a little research has tried to explain the machine learning predictions. Coulombe et al. [13] shows generally that the success of machine learning models in macroforecasting can be attributed to their ability to exploit nonlinearities in the data, particularly at longer time horizons. However, we are not aware of any macroeconomic forecasting study that attempted to identify the functional form learned by the machine learning models.^{Footnote 1} However, addressing the explainability of models is important when model outputs inform decisions, given the intertwined ethical, safety, privacy, and legal concerns about the application of opaque models [14, 17, 20]. There exists a debate about the level of model explainability that is necessary. Lipton [27] argues that a complex machine learning model does not need to be less interpretable than a simpler linear model if the latter operates on a more complex space, while Miller [32] suggests that humans prefer simple explanations, i.e., those providing fewer causes and explaining more general events—even though these may be biased.
Therefore, with our focus on explainability, we consider a small but diverse set of variables to learn a forecasting model, while the forecasting literature often relies on many variables [21] or latent factors that summarize individual variables [43]. In the machine learning literature, approaches to interpreting machine learning models usually focus on measuring how important input variables are for prediction. These variable attributions can be either global, assessing variable importance across the whole data set [23, 25] or local, by measuring the importance of the variables at the level of individual observations. Popular global methods are permutation importance or Gini importance for treebased models [7]. Popular local methods are LIME^{Footnote 2} [34], DeepLIFT^{Footnote 3} [40] and Shapley values [44]. Local methods decompose individual predictions into variable contributions [36, 45, 44, 34, 40, 28, 35]. The main advantage of local methods is that they uncover the functional form of the association between a feature and the outcome as learned by the model. Global methods cannot reveal the direction of association between a variable and the outcome of interest. Instead, they only identify variables that are relevant on average across all predictions, which can also be achieved via local methods and averaging attributions across all observations.
For model explainability in the context of macroeconomic forecasting, we suggest that local methods that uncover the functional form of the data generating process are most appropriate. Lundberg and Lee [28] demonstrate that local method Shapley values offer a unified framework of LIME and DeepLIFT with appealing properties. We chose to use Shapely values in this chapter because of their important property of consistency. Here, consistency is when on increasing the impact of a feature in a model, the feature’s estimated attribution for a prediction does not decrease, independent of all other features. Originally, Shapley values were introduced in game theory [39] as a way to determine the contribution of individual players in a cooperative game. Shapely values estimate the increase in the collective payoff when a player joins all possible coalitions with other players. Štrumbelj and Kononenko [44] used this approach to estimate the contribution of variables to a model prediction, where the variables and the predicted value are analogous to the players and payoff in a game.
The global and local attribution methods mentioned here are descriptive—they explain the drivers of a model’s prediction but they do not assess a model’s goodnessoffit or the predictors’ statistical significance. These concepts relate to statistical inference and require two steps: (1) measuring or estimating some quantity, such as a regression coefficient, and (2) inferring how certain one is in this estimate, e.g., how likely is it that the true coefficient in the population is different from zero.
The econometric approach of statistical inference for machine learning is mostly focused on measuring lowdimensional parameters of interest [10, 11], such as treatment effects in randomized experiments [2, 47]. However, in many situations we are interested in estimating the effects for all variables included in a model. To the best of our knowledge, there exists only one general framework that performs statistical inference jointly on all variables used in a machine learning prediction model to test for their statistical significance [24]. The framework is called Shapley regressions, where an auxiliary regression of the outcome variable on the Shapley values of individual data points is used to identify those variables that significantly improve the predictions of a nonlinear machine learning model. We will discuss this framework in detail in Sect. 4. Before that, we will describe the data and the forecasting methodology (Sect. 2) and present the forecasting results (Sect. 3). We conclude in Sect. 5.
2 Data and Experimental Setup
We first introduce the necessary notation. Let y and \(\hat {y} \in \mathbb {R}^{m}\) be the observed and predicted continuous outcome, respectively, where m is the number of observations in the time series.^{Footnote 4} The feature matrix is denoted by \(x \in \mathbb {R}^{m \times n}\), where n is the number of features in the dataset. The feature vector of observation i is denoted by x _{i}. Generally, we use i to index the point in time of the observation and k to index features. While our empirical analysis is limited to numerical features, the forecasting methods as well as the techniques to interpret their predictions also work when the data contains categorical features. These just need to be transformed into binary variables, each indicating membership of a category.
2.1 Data
We use the FREDMD macroeconomic database [30]. The data contains monthly series of 127 macroeconomic indicators of the USA between 1959 and 2019. Our outcome variable is unemployment and we choose nine variables as predictors, each capturing a different macroeconomic channel. We add the slope of the yield curve as a variable by computing the difference of the interest rates of the 10year treasury note and the 3month treasury bill. The authors of the database suggest specific transformations to make each series stationary. We use these transformations, which are (for a variable a:) (1) changes (a _{i} − a _{i−l}), (2) log changes (log_{e} a _{i} −log_{e} a _{i−l}), and (3) secondorder log changes ((log_{e} a _{i} −log_{e} a _{i−l}) − (log_{e} a _{i−l} −log_{e} a _{i−2l})). As we want to predict the yearonyear change in unemployment, we set l to 12 for the outcome and the lagged outcome when used as a predictor. For the remaining predictors, we set l = 3 in our baseline setup. This generally leads to the best performance (see Table 3 for other choices of l). Table 1 shows the variables, with the respective transformations and the series names in the original database. The augmented DickeyFuller test confirms that all transformed series are stationary (p < 0.01).
2.2 Models
We test three families of models that can be formalized in the following way assuming that all variables have been transformed according to Table 1.

The simple linear lag model only uses the 1year lag of the outcome variable as a predictor: \(\hat {y}_i = \alpha + \theta _0 y_{i12}\).

The autoregressive model (AR) uses several lags of the response as predictors: \({\hat {y}_i = \alpha + \sum _{l = 1}^{h} \theta _i y_{il}}\). We test AR models with a horizon 1 ≤ h ≤ 12, chosen by the Akaike Information Criterion [1].

The full information models use the 1year lag of the outcome and 1year lags of the other features as independent variables: \(\hat {y}_t = f(y_{i12}; x_{i12}\)), where f can be any prediction model. For example, if f is a linear regression, \(f(y_i,x_i) = \alpha + \theta _0y_{i12} + \sum _{k= 1}^{n} \theta _kx_{i12,k}\). To simplify this notation we imply that the lagged outcome is included in the feature matrix x in the following. We test five full information models: Ordinary least squares regression and Lasso regularized regression [46], and three machine learning regressors—random forest [7], support vector regression [16], and artificial neural networks [22].^{Footnote 5}
2.3 Experimental Procedure
We evaluate how all models predict changes in unemployment 1 year ahead. After transforming the variables (see Table 1) and removing missing values, the first observation in the training set is February 1962. All methods are evaluated on the 359 data points of the forecasts between January 1990 and November 2019 using an expanding window approach. We recalibrate the full information and simple linear lag models every 12 months such that each model makes 12 predictions before it is updated. The autoregressive model is updated every month. Due to the leadlag structure of the full information and simple linear lag models, we have to create an initial gap between training and test set when making predictions to avoid a lookahead bias. For a model trained on observations 1…i, the earliest observation in the test set that provides a true 12month forecast is i + 12. For observations i + 1, …, i + 11, the time difference to the last observed outcome in the training set is smaller than a year.
All machine learning models that we tested have hyperparameters. We optimize their values in the training sets using fivefold crossvalidation.^{Footnote 6} As this is computationally expensive, we conduct the hyperparameter search every 36 months with the exception of the computationally less costly Lasso regression, whose hyperparameters are updated every 12 months.
To increase the stability of the full information models, we use bootstrap aggregation, also referred to as bagging. We train 100 models on different bootstrapped samples (of the same size as the training set) and average their predictions. We do not use bagging for the random forest as, by design, each individual tree is already calibrated on a different bootstrapped sample of the training set.
3 Forecasting Performance
3.1 Baseline Setting
Table 2 shows three measures of forecasting performance: the correlation of the observed and predicted response, the mean absolute error (MAE), and the root mean squared error (RMSE). The latter is the main metric considered, as most models minimize RMSE during training. The models are ordered by decreasing RMSE on the whole test period between 1990 and 2019. The random forest performs best and we divide the MAE and RMSE of all models by that of the random forest for ease of comparison.
Table 2 also breaks down the performance in three periods: the 1990s and the period before and after the onset of the global financial crisis in September 2008. We statistically compare the RMSE and MAE of the best model, the random forest, against all other models using a DieboldMariano test. The asterisks indicate the pvalue of the tests.^{Footnote 7}
Apart from support vector regression (SVR), all machine learning models outperform the linear models on the whole sample. The inferior performance of SVR is not surprising as it does not minimize a squared error metric such as RMSE but a metric similar to MAE which is lower for SVR than for the linear models. In the 1990s and the periods before the global financial crisis, there are only small differences in performance between the models, with the neural network being the most accurate model. Only after the onset of the crisis does the random forest outperform the other models by a large and statistically significant margin.
Figure 1 shows the observed response variable and the predictions of the random forest, the linear regression, and the AR. The vertical dashed lines indicate the different time periods distinguished in Table 2. The predictions of the random forest are more volatile than that of the regression and the AR.^{Footnote 8} All models underestimate unemployment during the global financial crisis and overestimate it during the recovery. However, the random forest is least biased in those periods and forecasts high unemployment earliest during the crisis. This shows that its relatively high forecast volatility can be useful in registering negative turning points. A similar observation can be made after the burst of the dotcom bubble in 2000. This points to an advantage of machine learning models associated with their greater flexibility incorporating new information as it arrives. This can be intuitively understood as adjusting model predictions locally, e.g., in regions (periods) of high unemployment, while a linear model needs to realign the full (global) model hyperplane.
3.2 Robustness Checks
We altered several parameters in our baseline setup to investigate their effects on the forecasting performance. The results are shown in Table 3. The RMSE of alternative specifications is again divided by the RMSE of the random forest in the baseline setup for a clearer comparison.

Window size. In the baseline setup, the training set grows over time (expanding window). This can potentially improve the performance over time as more observations may facilitate a better approximation of the true data generating process. On the other hand, it may also make the model sluggish and prevent quick adaptation to structural changes. We test sliding windows of 60, 120, and 240 months. Only the simplest model, linear regression with only a lagged response, profits from a short horizon; the remaining models perform best with the biggest possible training set. This is not surprising for machine learning models, as they can “memorize” different sets of information through the incorporation of multiple specification in the same model. For instance, different paths down a tree model, or different trees in a forest, are all different submodels, e.g., characterizing different time periods in our setting. By contrast, a simple linear model cannot adjust in this way and needs to fit the best hyperplane to the current situation, explaining its improved performance for some fixed window sizes.

Change horizon. In the baseline setup, we use a horizon of 3 months, when calculating changes, log changes, and secondorder log changes of the predictors (see Table 1). Testing the horizons of 1, 6, 9, and 12 months, we find that 3 months generally leads to the best performance of all full information models. This is useful from a practical point of view, as quarterly changes are one of the main horizons considered for shortterm economic projections.

Bootstrap aggregation (bagging). The linear regression, neural network, and SVR all benefit from averaging the prediction of 100 bootstrapped models. The intuition is that our relatively small dataset likely leads to models with high variance, i.e., overfitting. The bootstrap aggregation of models reduces the models’ variance and the degree of overfitting. Note that we do not expect much improvement for bagged linear models, as different draws from the training set are likely to lead to similar slope parameters resulting in almost identical models. This is confirmed by the almost identical performance of the single and bagged model.
4 Model Interpretability
4.1 Methodology
We saw in the last section that machine learning models outperform conventional linear approaches in a comprehensive economic forecasting exercise. Improved model accuracy is often the principal reason for applying machine learning models to a problem. However, especially in situations where model results are used to inform decisions, it is crucial to both understand and clearly communicate modelling results. This brings us to a second step when using machine learning models—explaining them.
Here, we introduce and compare two different methods for interpreting machine learning forecasting models permutation importance [7, 18] and Shapley values and regressions [44, 28, 24]. Both approaches are modelagnostic, meaning that they can be applied to any model, unlike other approaches, such as Gini impurity [25, 19], which are only compatible with specific machine learning methods. Both methods allow us to understand the relative importance of model features. For permutation importance, variable attribution is at the global level while Shapley values are constructed locally, i.e., for each single prediction. We note that both importance measures require columnwise independence of the features, i.e., contemporaneous independence in our forecasting experiments, an assumption that will not hold under all contexts.^{Footnote 9}
4.1.1 Permutation Importance
The permutation importance of a variable measures the change of model performance when the values of that variable are randomly scrambled. Scrambling or permuting a variable’s values can either be done within a particular sample or by swapping values between samples. If a model has learnt a strong dependency between the model outcome and a given variable, scrambling the value of the variable leads to very different model predictions and thus affects performance. A variable k is said to be important in a model, if the test error e after scrambling feature k is substantially higher than the test error when using the original value for k, i.e., \(e_{k}^{perm}>>e\). Clearly, the value of the permutation error \(e_{k}^{perm}\) depends on the realization of the permutation, and variation in its value can be large, particularly in small datasets. Therefore, it is recommended to average \(e_{k}^{perm}\) over several random draws for more accurate estimates and to assess sampling variability. ^{Footnote 10}
The following procedure estimates the permutation importance.

1.
For each feature x _{k}:

(a)
Generate a permutation sample \(x_{k}^{perm}\) with the values of x _{k} permuted across observations (or swapped between samples).

(b)
Reevaluate the test score for \(x_{k}^{perm}\), resulting in \(e_{k}^{perm}\).

(c)
The permutation importance of x _{k} is given by \(I(x_k)=e_{k}^{perm}/e\).^{Footnote 11}

(d)
Repeat and average over Q iterations and average I _{k} = 1∕Q∑_{q} I _{q}(x _{k}).

(a)

2.
If I _{q} is given by the ratio of errors, consider the normalized quantity \(\bar {I}_k = (I_k1)\sum _k (I_k1)\,\in \,(0,1)\).^{Footnote 12}

3.
Sort features by I _{k} (or, \(\bar {I}_k\)).
Permutation importance is an intuitive measure that is relatively cheap to compute, requiring only new predictions generated on the permuted data and not model retraining. However, this ease of use comes at some cost. First, and foremost, permutation importance is inconsistent. For example, if two features contain similar information, permuting either of them will not reflect the actual importance of this feature relative to all other features in the model. Only permuting both or excluding one would do so. This situation is accounted for by Shapley values because they identify the individual marginal effect of a feature, accounting for its interaction with all other features. Additionally, the computation of permutation importance necessitates access to true outcome values and in many situations, e.g., when working with models trained on sensitive or confidential data, these may not be available. As a global measure, permutation importance only explains which variables are important but not how they contribute to the model, i.e., we cannot uncover the functional form or even the direction of the association between features and outcome that was learned by the model.
4.1.2 Shapley Values and Regressions
Shapley values originate from game theory [39] as a general solution to the problem of attributing a payoff obtained in a cooperative game to the individual players based on their contribution to the game. Štrumbelj and Kononenko [44] introduced the analogy between players in a cooperative game and variables in a general supervised model, where variables jointly generate a prediction, the payoff. The calculation is analogous in both cases (see also [24]),
Equation 1 states that the Shapley decomposition Φ ^{S}[f(x _{i})] of model f is local at x _{i} and exact, i.e., it precisely adds up to the actually predicted value f(x _{i}). In Eq. 2, \(\mathcal {C}(x)\setminus \{k\}\) is the set of all possible variable combinations (coalitions) of n − 1 variables when excluding the k ^{th} variable. x ^{′} denotes the number of variables included in that coalition, \(\omega _{x^{\prime }}\equiv x^{\prime }!(nx^{\prime }1)!/n!\) is a combinatorial weighting factor summing to one over all possible coalition, b is a background dataset, and \(\bar {x^{\prime }}\) stands for the set of variables not included in x ^{′}.
Equation 2 is the weighted sum of marginal contributions of variable k accounting for the number of possible variable coalitions.^{Footnote 13} In a general model, it is usually not possible to put an arbitrary feature to missing, i.e., exclude it. Instead, the contributions from features not included in x ^{′} are integrated out over a suitable background dataset, where \(\{x_i\bar {x^{\prime }}\}\) is the set of points with variables not in x ^{′} being replaced by values in b. The background provides an informative reference point by determining the intercept \(\phi _0^S\). A reasonable choice is the training dataset incorporating all information the model has learned from.
An obvious disadvantage of Shapley values compared to permutation importance is the considerably higher complexity of their calculation. Given the factorial in Eq. 2, an exhaustive calculation is generally not feasible with larger feature sets. This can be addressed by either sampling from the space of coalitions or by setting all “not important” variables to “others,” i.e., treating them as single variables. This substantially reduces the number of elements in \(\mathcal {C}(x)\).
Nevertheless, these computational costs come with significant advantages. Shapley values are the only feature attribution method which is model independent, local, accurate, linear, and consistent [28]. This means that it delivers a granular highfidelity approach for assessing the contribution and importance of variables. By comparing the local attributions of a variable across all observations we can visualize the functional form learned by the model. For instance, we might see that observations with a high (low) value on the variable have a disproportionally high (low) Shapley value on that variable, indicating a positive nonlinear functional form.
Based on these properties, which are directly inherited from the game theoretic origins of Shapley values, we can formulate an inference framework using Eq. 1. Namely, the Shapley regression [24],
where k = 0 corresponds to the intercept and \(\hat {\epsilon }_i\sim \mathcal {N}(0,\sigma _{\epsilon }^2)\). The surrogate coefficients \(\beta _k^S\) are tested against the null hypothesis
with \(\varOmega \in \mathbb {R}^n\) (a region of) the model input space. The intuition behind this approach is to test the alignment of Shapley components with the target variable. This is analogous to a linear model where we use “raw” feature values rather than their associated Shapley attributions. A key difference to the linear case is the regional dependence on Ω. We only make local statements about the significance of variable contributions, i.e., on those regions where it is tested against \(\mathcal {H}_{0}\). This is appropriate in the context of potential nonlinearity, where the model plane in the original inputtarget space may be curved, unlike that of a linear model. Note that the Shapley value decomposition (Eqs. 1–3) absorbs the signs of variable attributions, such that only positive coefficient values indicate significance. When negative values occur, it indicates that a model has poorly learned from a variable and \(\mathcal {H}_{0}\) cannot be rejected.
The coefficients β ^{S} are only informative about variable alignment (the strength of association between the output variable and feature of interest), not the magnitude of importance of a variable. Both together can be summarized by Shapley share coefficients,
where 〈⋅〉_{Ω} stands for the average over x _{k} in \(\varOmega _k\in \mathbb {R}\). The Shapley share coefficient \(\varGamma _k^S(f,\varOmega )\) is a summary statistic for the contribution of x _{k} to the model over a region \(\varOmega \subset \mathbb {R}^n\) for modelling y.
It consists of three parts. The first is the sign, which is the sign of the corresponding linear model. The motivation for this is to indicate the direction of alignment of a variable with the target y. The second part is coefficient size. It is defined as the fraction of absolute variable attribution allotted to x _{k} across Ω. The sum of the absolute value of Shapley share coefficients is one by construction.^{Footnote 14} It measures how much of the model output is explained by x _{k}. The last component is the significance level, indicated by the star notation (∗), and refers to the standard notation used in regression analysis to indicate the certainty with which we can reject the null hypothesis (Eq. 5). This indicates the confidence one can have in information derived from variable x _{k} measured by the strength of alignment of the corresponding Shapley components and the target, which is the same as its interpretation in a conventional regression analysis.
Equation 7 provides the explicit form for the linear model, where an analytical form exists. The only difference to the conventional regression case is the normalizing factor.
4.2 Results
We explain the predictions of the machine learning models and the linear regression as calibrated in the baseline setup of our forecasting. Our focus is largely on explaining forecast predictions in a pseudorealworld setting where the model is trained on earlier observations that predate the predictions. However, in some cases it can be instructive to explain the predictions of a model that was trained on observations across the whole time period. For that, we use fivefold block crossvalidation [3, 42].^{Footnote 15} This crossvalidation analysis is subject to lookahead bias, as we use future data to predict the past, but it allows us to evaluate a model for the whole time series.
4.2.1 Feature Importance
Figure 2 shows the global variable importance based on the analysis of the forecasting predictions. It compares Shapley shares Γ ^{S} (left panel) with permutation importance \(\bar {I}\) (middle panel). The variables are sorted by the Shapley shares of the bestperforming model, the random forest. Vertical lines connect the lowest and highest share across models for each feature as a measure for disagreement between models.
The two importance measures only roughly agree in their ranking of feature importance. For instance, using a random forest model, past unemployment seems to be a key indicator according to permutation importance but relatively less crucial according to Shapley calculations. Permutation importance is based on model forecasting error and so is a measure of a feature’s predictive power (how much does its inclusion in a model improve predictive accuracy) and it is influenced by how the relationship between outcome and features may change over time. In contrast, Shapley values indicate which variables influence a predicted value, independent of predictive accuracy. The right panel of Fig. 2 shows an altered measure of permutation importance. Instead of measuring the change in the error due to permutations, we measure the change in the predicted value.^{Footnote 16} We see that this importance measure is more closely aligned with Shapley values. Furthermore, when we evaluate permutation importance using predictions based on block crossvalidation, we find a strong alignment with Shapley values as the relationship between variables is not affected by the change between the training and test set (not shown).
Figure 3 plots Shapley values attributed to the S&P500 (vertical axis) against its input values (horizontal axis) for the random forest (left panel) and the linear regression (right panel) based on the block crossvalidation analysis.^{Footnote 17} Each point reflects one of the observations between 1990 and 2019 and their respective value on the S&P500 variable. The approximate functional forms learned by both models are traced out by bestfit degree3 polynomials. The linear regression learns a steep negative slope, i.e., higher stock market values are associated with lower unemployment 1 year down the road. This makes economic sense. However, we can make more nuanced observations for the random forest. There is satiation for high market valuations, i.e., changes beyond a certain point do not provide greater information for changes in unemployment.^{Footnote 18} A linear model is not able to reflect those nuances, while machine learning models provide a more detailed signal from the stock market and other variables.
4.2.2 Shapley Regressions
Shapley valuebased inference allows to communicate machine learning models analogously to a linear regression analysis. The difference between the coefficients of a linear model and Shapley share coefficients is primarily the normalization of the latter. The reason for this is that nonlinear models do not have a “natural scale,” for instance, to measure variation. We summarize the Shapley regression on the forecasting predictions (1990–2019) of the random forest and linear regression in Table 4.
The coefficients β ^{S} measure the alignment of a variable with the target. Values close to one indicate perfect alignment and convergence of the learning process. Values larger than one indicate that a model underestimates the effect of a variable on the outcome. And the opposite is the case for values smaller than one. This can intuitively be understood from the model hyperplane of the Shapley regression either tilting more towards a Shapley component from a variable (underestimation, \(\beta ^S_k>1\)) or away from it (overestimation, \(\beta ^S_k<1\)). Significance decreases as the \(\beta ^S_k\) approaches zero.^{Footnote 19}
Variables with lower pvalues usually have higher Shapley shares Γ ^{S}, which are equivalent to those shown in Fig. 2. This is intuitive as the model learns to rely more on features which are important for predicting the target. However this does not hold by construction. Especially in the forecasting setting where the relationships of variables change over time, the statistical significance may disappear in the test set, even for features with high shares.
In the Shapley regression, more variables are statistically significant for the random forest than for the linear regression model. This is expected, because the forest, like other machine learning models, can exploit nonlinear relationships that the regression cannot account for (as in Fig. 3), i.e., it is a more flexible model. These are then reflected in localized Shapley values providing a stronger, i.e., more significant, signal in the regression stage.
5 Conclusion
This chapter provided a comparative study of how machine learning models can be used for macroeconomic forecasting relative to standard econometric approaches. We find significantly better performance of machine learning models for forecasting changes in US unemployment at a 1year horizon, particularly in the period after the global financial crisis of 2008.
Apart from model performance, we provide an extensive explanation of model predictions, where we present two approaches that allow for greater machine learning interpretability—permutation feature importance and Shapley values. Both methods demonstrate that a range of machine learning models learn comparable signals from the data. By decomposing individual predictions into Shapley value attributions, we extract learned functional forms that allow us to visually demonstrate how the superior performance of machine learning models is explained by their enhanced ability to adapt to individual variablespecific nonlinearities. Our example allows for a more nuanced economic interpretation of learned dependencies compared to the interpretation offered by a linear model. The Shapley regression framework, which enables conventional parametric inference on machine learning models, allows us to communicate the results of machine learning models analogously to traditional presentations of regression results.
Nevertheless, as with conventional linear models, the interpretation of our results is not fixed. We observe some variation under different models, different model specifications, and the interpretability method chosen. This is in part due to small sample limitations; this modelling issue is common, but likely more aggravated when using machine learning models due to their nonparametric structure.
However, we believe that the methodology and results presented justify the use of machine learning models and such explainability methods to inform decisions in a policymaking context. The inherent advantages of their nonlinearity over conventional models are most evident in a situation where the underlying datagenerating process is unknown and expected to change over time, such as in a forecasting environment as presented in the case study here. Overall, the use of machine learning in conjunction with Shapley valuebased inference as presented in this chapter may offer a better tradeoff between maximizing predictive performance and statistical inference thereby narrowing the gap between Breiman’s two cultures.
Notes
 1.
 2.
Local Interpretable Modelagnostic Explanations.
 3.
Deep Learning Important FeaTures for NN.
 4.
That is, we are in the setting of a regression problem in machine learning speak, while classification problems operate on categorical targets. All approaches presented here can be applied to both situations.
 5.
In machine learning, classification is arguably the most relevant and most researched prediction problem, and while models such as random forests and support vector machines are best known as classification, their variants being used in regression problems are also known to perform well.
 6.
 7.
The horizon of the DieboldMariano test is set to 1 for all tests. Note, however, that the horizon of the AR model is 12 so that the pvalues for this comparison are biased and thus reported in parentheses. Setting the horizon of the DieboldMariano test to 12, we do not observe significant differences between the RMSE of the random forest and AR.
 8.
The mean absolute deviance from the models’ mean prediction are 0.439, 0.356, and 0.207 for the random forest, regression, and AR, respectively.
 9.
Lundberg et al. [29] proposed TREESHAP, which correctly estimates the Shapley values when features are dependent for tree models only.
 10.
Considering a test set of size m with each observation having a unique value, there are m! permutations to consider for an exhaustive evaluation, which is intractable to compute for larger m.
 11.
Alternatively, the difference \(e_{j}^{perm}e\) can be considered.
 12.
Note, I _{k} ≥ 1 in general. If not, there may be problems with model optimization.
 13.
For example, assuming we have three players (variables) {A, B, C}, the Shapley value of player C would be \(\phi _{C}^{S}(f) = 1/3 [f(\{A,B,C\})f(\{A,B\})] + 1/6 [f(\{A,C\})f(\{A\})] + 1/6 [f(\{B,C\})f(\{B\})] + 1/3 [f(\{C\})f(\{\emptyset \})]\).
 14.
The normalization is not needed in binary classification problems where the model output is a probability. Here, the a Shapley contribution relative to a base rate can be interpreted as the expected change in probability due to that variable.
 15.
The time series is partitioned in five blocks of consecutive points in time and each block is once used as the test set.
 16.
This metric computes the mean absolute difference between the observed predicted values and the predicted values after permuting feature \(k: \frac {1}{m}\sum _{i=1}^{m}\hat {y}_i  \hat {y}_{i(k)}^{perm}\). The higher this difference, the higher the importance of the feature k (see [26, 36] for similar approaches to measure variable importance).
 17.
Showing the Shapley values based on the forecasting predictions makes it difficult to disentangle whether nonlinear patterns are due to a nonlinear functional form or to (slow) changes of the functional form over time.
 18.
Similar nonlinearities are learned by the SVR and the neural network.
 19.
The underlying technical details for this interpretation are provided in [24].
References
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.
Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353–7360.
Bergmeir, C., & Benítez, J. M. (2012). On the use of crossvalidation for time series predictor evaluation. Information Sciences, 191, 192–213.
Bianchi, D., Büchner, M., & Tamoni, A. (2019). Bond risk premia with machine learning. In USCINET Research Paper, No. 19–11.
Bluwstein, K., Buckmann, M., Joseph, A., Kang, M., Kapadia, S., & Simsek, Ö. (2020). Credit growth, the yield curve and financial crisis prediction: evidence from a machine learning approach. In Bank of England Staff Working Paper, No. 848.
Bracke, P., Datta, A., Jung, C., & Sen, S. (2019). Machine learning explainability in finance: an application to default risk analysis. In Bank of England Staff Working Paper, No. 816.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231.
Chen, J. C., Dunn, A., Hood, K. K., Driessen, A., & Batch, A. (2019). Off to the races: A comparison of machine learning and alternative data for predicting economic indicators. In Big Data for 21st Century Economic Statistics. Chicago: National Bureau of Economic Research, University of Chicago Press. Available at: http://www.nber.org/chapters/c14268.pdf
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., et al. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68.
Chernozhukov, V., Demirer, M., Duflo, E., & FernandezVal, I. (2018). Generic machine learning inference on heterogenous treatment effects in randomized experiments. In NBER Working Paper Series, No. 24678.
Conneau, A., & Lample, G. (2019). Crosslingual language model pretraining. In Advances in Neural Information Processing Systems, NIPS 2019 (Vol. 32, pp. 7059–7069). Available at: https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1Paper.pdf
Coulombe, P. G., Leroux, M., Stevanovic, D., & Surprenant, S. (2019). How is machine learning useful for macroeconomic forecasting. In CIRANO Working Papers 2019s22. Available at: https://ideas.repec.org/p/cir/cirwor/2019s22.html
Crawford, K. (2013). The hidden biases of big data. Harvard Business Review, art number H00ADRPDFENG. Available at: https://hbr.org/2013/04/thehiddenbiasesinbigdata
Döpke, J., Fritsche, U., & Pierdzioch, C. (2017). Predicting recessions with boosted regression trees. International Journal of Forecasting, 33(4), 745–759.
Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. J., & Vapnik, V. (1997). Support vector regression machines. In Advances in Neural Information Processing Systems, NIPS 2016 (Vol. 9, pp. 155–161). Available at: https://papers.nips.cc/paper/1996/file/d38901788c533e8286cb6400b40b386dPaper.pdf
European Union. (2016). Regulation (EU) 2016/679 of the European Parliament, Directive 95/46/EC (General Data Protection Regulation). Official Journal of the European Union, L119, 1–88.
Fisher, A., Rudin, C., & Dominici, F. (2019). All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research, 20(177), 1–81.
Friedman, J., Hastie, T., & Tibshirani, R. (2009). The Elements of Statistical Learning. Springer Series in Statistics. Berlin: Springer.
Fuster, A., GoldsmithPinkham, P., Ramadorai, T., & Walther, A. (2017). Predictably unequal? the effects of machine learning on credit markets. In CEPR Discussion Papers (No. 12448).
Giannone, D., Lenza, M., & Primiceri, G. E. (2017). Economic predictions with big data: The illusion of sparsity. In CEPR Discussion Paper (No. 12256).
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: MIT Press.
Henelius, A., Puolamäki, K., Boström, H., Asker, L., & Papapetrou, P. (2014). A peek into the black box: exploring classifiers by randomization. Data Mining and Knowledge Discovery, 28(5–6), 1503–1529.
Joseph, A. (2020). Parametric inference with universal function approximators, arXiv, CoRR abs/1903.04209
Kazemitabar, J., Amini, A., Bloniarz, A., & Talwalkar, A. S. (2017). Variable importance using decision trees. In Advances in Neural Information Processing Systems, NIPS 2017 (Vol. 30, pp. 426–435). Available at: https://papers.nips.cc/paper/2017/file/5737c6ec2e0716f3d8a7a5c4e0de0d9aPaper.pdf
Lemaire, V., Féraud, R., & Voisine, N. (2008). Contact personalization using a score understanding method. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 649–654).
Lipton, Z. C. (2016). The mythos of model interpretability, ArXiv, CoRR abs/1606.03490
Lundberg, S., & Lee, S.I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, NIPS 2017 (Vol. 30, pp. 4765–4774). Available: https://papers.nips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767Paper.pdf
Lundberg, S., Erion, G., & Lee, S.I. (2018). Consistent individualized feature attribution for tree ensembles. ArXiv, CoRR abs/1802.03888
McCracken, M. W., & Ng, S. (2016). FREDMD: A monthly database for macroeconomic research. Journal of Business & Economic Statistics, 34(4), 574–589.
Medeiros, M. C., Vasconcelos, G. F. R., Veiga, Á., & Zilberman, E. (2019). Forecasting inflation in a datarich environment: the benefits of machine learning methods. Journal of Business & Economic Statistics, 39(1), 98–119.
Miller, T. (2017). Explanation in Artificial Intelligence: Insights from the Social Sciences. ArXiv, CoRR abs/1706.07269
Racine, J. (2000). Consistent crossvalidatory modelselection for dependent data: hvblock crossvalidation. Journal of Econometrics, 99(1), 39–61.
Ribeiro, M., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD (pp. 1135–11134).
Ribeiro, M. T., Singh, S., & Guestrin, C. (2018). Anchors: Highprecision modelagnostic explanations. In ThirtySecond AAAI Conference on Artificial Intelligence, AAAI 2018 (pp. 1527–1535), art number 16982. Available at: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16982
RobnikŠikonja, M., & Kononenko, I. (2008). Explaining classifications for individual instances. IEEE Transactions on Knowledge and Data Engineering, 20(5), 589–600.
Schroff, F., Kalenichenko, D., & Philbin. J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 815–823).
Sermpinis, G., Stasinakis, C., Theofilatos, K., & Karathanasopoulos, A. (2014). Inflation and unemployment forecasting with genetic support vector regression. Journal of Forecasting, 33(6), 471–487.
Shapley, L. (1953). A value for nperson games. Contributions to the Theory of Games, 2, 307–317.
Shrikumar, A., Greenside, P., & Anshul, K. (2017). Learning important features through propagating activation differences. ArXiv, CoRR abs/1704.02685.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science, 362(6419), 1140–1144.
Snijders, T. A. B. (1988). On crossvalidation for predictor evaluation in time series. In T. K. Dijkstra (Ed.), On model uncertainty and its statistical implications, LNE (Vol. 307, pp. 56–69). Berlin: Springer.
Stock, J. H., & Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association, 97(460), 1167–1179.
Štrumbelj, E., & Kononenko, I. (2010). An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research, 11, 1–18.
Štrumbelj, E., Kononenko, I., RobnikŠikonja, M. (2009). Explaining instance classifications with interactions of subsets of feature values. Data & Knowledge Engineering, 68(10), 886–904.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228–1242.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2021 The Author(s)
About this chapter
Cite this chapter
Buckmann, M., Joseph, A., Robertson, H. (2021). Opening the Black Box: Machine Learning Interpretability and Inference Tools with an Application to Economic Forecasting. In: Consoli, S., Reforgiato Recupero, D., Saisana, M. (eds) Data Science for Economics and Finance. Springer, Cham. https://doi.org/10.1007/9783030668914_3
Download citation
DOI: https://doi.org/10.1007/9783030668914_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783030668907
Online ISBN: 9783030668914
eBook Packages: Computer ScienceComputer Science (R0)