1 Introduction

A rapidly growing interest in artificial intelligence (AI) on the subject of decision-making underlines the need to explain analytical solutions derived from black box machine learning (ML) approaches. In this vein, De Bock 2023 (2023) address the relevance of explainable artificial intelligence (XAI) in operations research (OR) and introduce a pioneering and comprehensive research agenda. The scope of this paper is an assessment of XAI with focus on economic data analytics. Although the application of AI is expanding in all areas of OR in finance, up to now, however, there is no standard metric to assess explainability of machine learning approaches in financial data analytics.Footnote 1 This aspect is crucial, since regulatory demands call for trustworthy and transparent AI in line with General Data Protection Regulation. Therefore, this study adds to the general discussion in De Bock 2023 (2023) and provides an in-depth discussion on the information content of the discussed post-hoc explanation methods with focus on economic data.

The application of AI to economic data stands in stark contrast to the "traditional" application of econometric regression approaches. Specifically, ML techniques are less restrictive concerning assumptions about the underlying data (i.e. normality, stationarity, linearity) and are therefore less limiting. Also, due to the scalable amount of coefficients and network complexity, ML algorithms are predestined to handle larger data sets with complex dependence structure.Footnote 2 Yet, on the other hand, ML algorithms are often described as black boxes without hitherto adequate interpretability, which is of particular importance for financial application, especially from a regulatory point of view.

Triggered by Cochrane (2011), who emphasizes that econometric regression approaches can be insufficient for handling a large number of highly correlated predictive economic variables, a field of study that deals with the application of deep learning to financial data emerged. In general, the rapidly growing amount of literature can be divided into two strings: The application of ML techniques to cross-sectional data in order to validate advantages in the context of large data sets and the application of ML techniques to time series data to improve forecasting accuracy (see Karolyi and Van Nieuwerburgh 2020). In this regard, econometric benchmarks are given by studies of Fama and French (2008) and Lewellen (2015) who provide fundamental discussions on cross-sectional data and Welch and Goyal (2008) and Koijen and Nieuwerburgh (2011) who provide elementary findings regarding econometric benchmarks for time series data.

This paper draws on the first string of literature and addresses the interpretability of machine learning applied to economic data and we also demonstrate that our implications are highly relevant for time series forecasting, the second string. We discuss the precision of competing techniques to achieve post hoc explanations by studying both simulated and real economic data. As there exist a variety of competing machine learning as well as deep learning approaches, we focus on model agnostic techniques which enable us to unbox any black box ML approach in order to assess economic interpretations of the identified determinants.

In this vein, Gu et al. (2020) provide a thorough empirical assessment of competing ML approaches for empirical asset pricing and study the benefits of ML techniques. The authors compare the performance of black box ML techniques against econometric benchmarks and demonstrate that a higher predictive accuracy can be achieved via competing ML techniques, i.e. trees and neural nets. Furthermore, the authors apply ML techniques to applied portfolio management and demonstrate the economic gains to investors by outperforming regression based trading strategies. Also, Rossi (2018) and Chen et al. (2019) provide similar results and find that return predictability can be improved by the application of tree based techniques. Avramov et al. (2022) adds to the discussion of ML based asset pricing and confirms the superior performance of ML in the context of return prediction under economic restrictions. The authors argue that ML based signals turn out to be more profitable for investors than those factors which are identified via linear modelling in recent literature. Feng et al. (2020) demonstrate the application of deep learning approaches for predicting asset returns and Longo et al. (2022) provide similar results with respect to GDP prediction. Krauss et al. (2017) apply machine learning techniques to identify the amount of relevant factors for arbitrage on the S &P 500 and Schnaubelt et al. (2020) demonstrate that arbitrage strategies can be improved via ML techniques by taking into account innovative data sources, such as Twitter. For a thorough overview on the identified relevant determinants for asset pricing we refer to Harvey et al. (2016) and Hou et al. (2020).

Despite the promising performance of ML techniques in the context of economic data describes a growing field of research, the lack of explainability in the context of financial data is not adequately discussed, yet. Taking into account that financial applications such as return predictions (see Avramov et al. 2022), asset management (see Babaei et al. 2022) or credit risk assessment (see Kellner et al. 2022) are highly relevant for financial institutions, the lack of interpretability of ML techniques presents a crucial challenge for regulatory demands. The challenge being that regulatory demands require interpretability as a necessary precondition. In this vein, Giudici and Raffinetti (2022) address cyber risk management, Babaei et al. (2023) the lending process within a fintech universe for small end medium enterprises and Babaei et al. (2022) provide a discussion on asset allocation and crypto assets.Footnote 3 Up to now, recent literature on ML is mainly focused on classification problems and the assessment of individual feature contributions, especially when the target variable of the assessment is measured in a metric scale, is of growing interest not only in a financial context (see De Bock 2023 2023).

However, competing concepts of interpretable ML do co-exist. Ribeiro et al. (2016) propose a Locally Interpretable Model-Agnostic Explanations (LIME) approach, to study contributions of individual explanatory variables to the target variable. Potrawa and Tetereva (2022) demonstrate the application of LIME to real estate data. Also, as described by Du et al. (2018), Feature Importance presents a widely accepted technique to identify the contribution of each feature to a desired precision metric of the estimated model.Footnote 4 For instance, Kellner et al. (2022) apply a neural network structure to model quantiles for loss given default prediction and assess economic interpretability via global Feature Importance. Gu et al. (2020) apply Feature Interaction to study relevant determinants for equity premia. Lundberg and Lee (2017) introduce a model agnostic computational framework which is derived from cooperative Game Theory, namely Shapley values, and build an explanatory approach that locally approximates the original model by calculating marginal contributions of each explanatory variable. Lundberg et al. (2020) provide a discussion on local and global interpretation and Davila-Pena et al. (2022) provide a thorough assessment of SHAP values for large data sets.

With our paper, we study competing approaches of interpretable machine learning in order to improve the quality of model validation in the context of financial data, especially when the response variable is measured on a metric scale. Furthermore, we add to the existing literature and assess a black box ML approach, namely boosted trees, which is also applicable to panel data.Footnote 5 In order to assess the precision of competing approaches, we discuss both simulated and empirical data. The simulation and empirical assessment are closely related to the study of Gu et al. (2020), we discus similar data and add to this study by providing an innovative perspective on the interpretability of the assessed ML approaches.

The remainder of this paper is structured as follows. Section 2 provides relevant literature on interpretable machine learning, Sect. 3 gives an overview about the methodology and Sect. 4 presents the investigated data. The results of the simulation study and the empirical assessment are in Sect. 5 and Sect. 6, Sect. 7 concludes our results.

2 Literature on model interpretability

Up to now, there is no unique definition of interpretable machine learning and the differentiation between explainability and interpretability is vague. In this paper, our understanding of model interpretability is similar to Doshi-Velez and Kim (2017) who describe model interpretability as the degree to which a human can consistently predict the result of a model. With reference to this paper, we discuss interpretable ML as a crucial component in the field of explainable artificial intelligence (XAI) for operational research as introduced in De Bock 2023 (2023). Due to the fact that interpretable ML merges different disciplines, i.e. statistics, mathematics and computer science and hence, different terms have come up. In this paper we apply the expressions "target variable" to describe dependent or response variable, and we apply "explanatory variable", which is similar to independent variable or feature.

Generally, interpretability of a model can be separated into three stages: Pre-model interpretability, which takes place before building a model.Footnote 6 In-model interpretability, while building a model, for example via constraints imposed on the complexity of a model.Footnote 7 And post-model interpretability, after building a model, we are interested in the assessment of the achieved output.Footnote 8 In this paper, we focus on the latter, post-model interpretability and study interpretable machine learning techniques that can be described as model agnostic approaches. That is, contrary to model specific interpretability such as AIC scores or p-values, which are directly derived from the applied model, we exclusively take into account pairs of input and output variables. Consequently, the findings of our paper are applicable to all black box ML algorithms.

A widely used model agnostic concept to achieve post-model interpretability is the technique of Feature Importance. Technically, the approach focuses on the model error and measures the increase in model error when the information of one input variable is perturbed. As described in Molnar (2022), a feature is determined as important, if perturbing its values increases the model error, measured in terms of variance. On the other hand, an explanatory variable is characterized as less important, if perturbing its values does not impact the error of the model. Severino and Peng (2021) assess fraud prediction in property insurance via Feature Importance and Gu et al. (2020) apply a similar technique, called Feature Interaction, to evaluate the application of ML to factor modelling in the context financial data. Kellner et al. (2022) apply a neural network structure to model relevant quantiles for loss given default prediction and assess economic interpretability via global Feature Importance. As the concept of Feature Importance summarizes the impact of each variable, averaged over the full sample, it provides information on the global impact of an explanatory variable on the model output. Contrarily, LIME and SHAP focus on the interpretation of individual observations and provide information on local importance of each explanatory variable.

Within this framework of local importance, Lundberg and Lee (2017) draw on Shapley (1953) and introduce a model agnostic computational framework which is derived from cooperative Game Theory, namely Shapley values, to construct an explanatory model that locally approximates the original model. The authors show that this is the only approach based on a solid theory which satisfies statistical properties such as efficiency, symmetry and additivity and which is therefore highly relevant for legal complaints. Bussmann et al. (2021) apply Shapley values to discuss credit risk management and Lin and Bai (2021) collect and assess 224 financial and non-financial variables in 40 listed enterprises and apply Shapley values to identify local feature importance. Jäger et al. (2021) provide a pioneering study on interpretable machine learning to discuss the performance of competing portfolio strategies. The authors provide non-linear interpretable measures, the Calmar-ratio spread, to discuss the performance of two strategies based on bootstrapped data sets and use Shapley values to identify the importance of Calmar-ratio for drawdown measures of individual assets. Giudici and Raffinetti (2021) combine Shapley framework with Lorenz Zonoid approach, to normalize Shapley values and to provide information on the contribution of each feature to the global predictive accuracy. Furthermore, based on the introduced Shapley-Lorenz approach, Giudici and Raffinetti (2023) initiate a discussion on accuracy and fairness of ML applied to financial market prices.

Ribeiro et al. (2016) also address the lack of interpretability of ML approaches and propose a co-existing concept of model agnostic technique, namely, Local Interpretable Model-agnostic Explanations (LIME). LIME belongs to the group of surrogate models, the goal is to approximate the output of the underlying ML black box algorithm with an interpretable approach. Depending on the underlying ML task, especially the scale of the target variable, typical interpretable surrogate models are the linear or the logistic regression approach. Potrawa and Tetereva (2022) apply Machine Learning in combination with LIME to identify relevant determinants of real estate prices. Although this surrogate technique is based on sparse linear models, which are popular in economic literature, the application of LIME to economic data is not adequately assessed.

Furthermore, the discussed approaches, especially SHAP and LIME, are applied to explain binary categorization problems. Thus, this paper draws on the pioneering study by Jäger et al. (2021) and provides an innovative discussion in the context of economic data analytics, especially when the output variable is measured via a metric scale and fills a currently existing gap in recent literature.

3 Methodology

In this section, we introduce the notation used throughout the paper, we define the desirable properties of the linear model and tree ensembles, specify boosting for trees and present our interpretable machine learning settings, namely Feature Importance, LIME and SHAP.

3.1 Benchmark model

In order to assess the impact of explanatory variables on the equity premium, we apply the econometric baseline approach which is described as follows:

$$\begin{aligned} y_{t} = \beta _0 + X_t \beta + u_{t}. \end{aligned}$$
(1)

Here, \(y_{t}\) is the monthly return of S &P500 in excess of the risk-free rate at time t. \(X_t\) is the 1xN vector of explanatory variables at time t and the linear impact of each variable is measured by corresponding factors which are described by the Nx1 vector \(\beta\). \(\beta _0\) is the constant, \(u_t\) the error term with mean 0, T is the sample size and N is the number of explanatory variables. We apply simple ordinary least squares (OLS) technique to estimate the equity premium:

$$\begin{aligned} {\hat{y}}_{t|OLS} = {\hat{\beta }}_{0|OLS} + X_t {\hat{\beta }}_{OLS}. \end{aligned}$$
(2)

Also, as described in Wei et al. (2015) et al., the estimated \({\hat{\beta }}_{OLS}\) depends on the dimension of the corresponding explanatory variable. Therefore, in order to measure the relative importance of each variable, we also apply multiple linear regression approach to the normalized explanatory variables to quantify the relative importance of each variable:

$$\begin{aligned} {\hat{y}}^{(z)}_{t|OLS} = {\hat{\beta }}^{(z)}_{0|OLS} + Z_t {\hat{\beta }}^{(z)}_{OLS}, \end{aligned}$$
(3)

where \(Z_t\) describes the 1xN vector of normalized explanatory variables at time t. Based on normalized explanatory variables, \({{\hat{\beta }}}^{(z)}_{OLS}\) indicates the relative importance of each explanatory variable independently from the individual dimension of each explanatory variable.

3.2 Tree ensembles

We draw on the findings of Gu et al. (2020) and apply boosted trees as a promising machine learning technique for economic data. The authors find that for the economic data, namely equity premia as described in Welch and Goyal (2008), boosted models outperform complex machine learning algorithms. Hence, we apply an ensemble approach that sums the prediction of multiple trees together. As described in Chen and Guestrin (2016), the ensemble is then defined as:

$$\begin{aligned} {\hat{y}}_{t|ML} = \sum _{k=1}^{K}f_k(X_t), f_k \in F \end{aligned}$$
(4)

with K as the number of trees, f is a function in the functional space F, and F is the set of all possible regression trees. \({\hat{y}}_{t|ML}\) is the estimated equity premium and similar to the setting in equation (1), \(X_t\) is the 1xN vector of explanatory variables at time t.

3.2.1 Regularized learning objective

In order to fit the Tree ensembles to our data, we follow Chen and Guestrin (2016) and apply the following objective function:

$$\begin{aligned} obj=\underbrace{\sum _{t=1}^{T}l(y_t,{\hat{y}}_{t|ML}^{(i)})}_{training~loss}+\underbrace{\sum _{k=1}^K \Omega (f_k)}_{regularization}, \end{aligned}$$
(5)

with

$$\begin{aligned} \Omega (f)=\gamma A + \frac{1}{2}\lambda ||\omega ||^2. \end{aligned}$$
(6)

Here l is a differentiable convex loss function that measures the difference between prediction \({\hat{y}}^{(i)}_{t|ML}\) after iteration i and target variable \(y_t\). \(\Omega\) penalizes the complexity of the model. Each \(f_k\) corresponds to an independent tree structure and leaf weights \(\omega\) and A is the number of leaves in tree.

3.2.2 Gradient tree boosting

Due to the fact that the model consists of functions instead of parameters, we train the trees in an additive manner: Let \({\hat{y}}^{(i-1)}_{t|ML}\) be the prediction of the \(t^{th}\) observation at the \((i-1)^{th}\) iteration, we need to add \(f_i\) in order to minimize the following objective:

$$\begin{aligned} obj^i=\sum _{t=1}^{T}l\left( y_t,{\hat{y}}_{t|ML}^{(i-1)}+f_i(X_t)\right) +\Omega (f_i). \end{aligned}$$
(7)

Here, \(f_i\), that improves the model fit according to equation (5), is greedily added to the fitted model. As demonstrated in Chen and Guestrin (2016), this objective function can then be optimized via second-order approximation:

$$\begin{aligned} \widetilde{obj}^i=\sum _{t=1}^{T}\left( g_t f_i(X_t)+\frac{1}{2}h_t f_i^2(X_t)\right) +\Omega (f_i), \end{aligned}$$
(8)

with \(g_t= \partial _{{\hat{y}}_{t|ML}^{(i-1)}} l(y_t,{\hat{y}}_{t|ML}^{(i-1)})\) and \(h_t= \partial ^2_{{\hat{y}}_{t|ML}^{(i-1)}} l(y_t,{\hat{y}}_{t|ML}^{(i-1)})\), which are first and second order gradient statistics on the loss function.

Furthermore, we define \(I_j\) as the instance set of leaf j and hence we can expand \(\Omega\) as follows:

$$\begin{aligned} \widetilde{obj}^i=\sum _{t=1}^{T}\left( g_t f_i(X_t)+\frac{1}{2}h_t f_i^2(X_t)\right) + \gamma A + \frac{1}{2} \lambda \sum _{j=1}^A \omega ^2 \end{aligned}$$
(9)

and for a fixed tree structure, the optimal weight of leaf j is given by \(\omega ^*_j\)

$$\begin{aligned} w_j^*=-\frac{\sum _{t \in I_j}g_t}{\sum _{t \in I_j} h_t + \lambda }, \end{aligned}$$
(10)

and the corresponding optimal value is calculated as

$$\begin{aligned} \widetilde{obj}^i=-\frac{1}{2} \sum _{j=1}^A \frac{\left( \sum _{t \in I_j}g_t\right) ^2}{\sum _{t \in I_j} h_t + \lambda }+\gamma A. \end{aligned}$$
(11)

Based on the scoring function as presented in equation (11), branches are iteratively added to the tree. In doing so, trees are grown successively and we exclusively work with the original data without assessing, for instance, bootstrapped samples, as it would be the case for bagging. In this vein, Gu et al. (2020) find that on average, boosted trees applied to economic data comprise less than six leaves. Also, boosting is robust against correlated explanatory variables. When a specific link between explanatory variable and target variable has been learned by the algorithm, boosting tries to not refocus on it. Therefore, all the importance of two correlated variables will be on either one of the two variables but not on both. Consequently, if two explanatory variables are highly correlated, one variable will have a remarkable impact on the target variable and we have to search for correlated variables, once the importance of the explanatory variable has been identified. For a thorough introduction to Boosting for regression trees we refer to Chen and Guestrin (2016) for a technical introduction and to James et al. (2021) for a general overview on boosting in the context of competing ML techniques.

3.3 Interpretable machine learning

The scope of our paper lies on techniques for interpretable machine learning that are characterized as model agnostic. That is, the introduced methods take the input and output parameter of the underlying black box ML approach into account whereas there is no need to re-estimate the weights of the explanatory variables within the underlying model. The introduced approaches do not assume any knowledge about the applied ML approach itself, see Slack (2020).

3.3.1 Feature importance

Feature importance provides a global interpretation of model agnostic impact of each explanatory variable. As described in Wei et al. (2015), the idea behind Feature Importance is to measure the impact of each explanatory variable on the models prediction error. That is, each of explanatory variable gets assessed individually, and the corresponding impact on the overall prediction error, namely the mean squared error \(\frac{1}{N}\sum _t^N ({\hat{y}}_{t|ML} - y_t)^2\), gets tracked at each node, when the explanatory variable is chosen. Breiman et al. (1984) introduced the concept of feature importance for regression trees and we apply the concept of feature importance as described in Hastie et al. (2017). The Feature Importance of explanatory variable l is then given as follows:

$$\begin{aligned} FI_j = \frac{1}{K} \sum ^K_{k=1} FI_j(A_k). \end{aligned}$$
(12)

With K as the absolute amount of fitted trees and \(FI_j(A_k)\) is the Feature Importance of explanatory variable j at tree \(A_k\), hence, \(FI_j\) is the averaged importance of explanatory variable j of the boosted trees.

An input variable is then characterized as relevant, if the impact of prediction error is relatively large in comparison to the impact stemming from the other variables. Analogously, an input variable is characterized as less important if the impact on the prediction error is relatively small. Consequently, the importance of every explanatory variable can be ranked according to the impact on the prediction error.

3.3.2 Locally interpretable model-agnostic explanations

The idea of Locally Interpretable Model-Agnostic Explanations (LIME) approach is to fit a surrogate model to the output which is generated via machine learning technique. Ribeiro et al. (2016) suggest to fit an interpretable model to the explanatory variables, whereas the output from the machine learning approach is the target variable. The goal is to find an interpretable surrogate model which provides insights into the impact of explanatory variables. Depending on the structure of the target variable, logistic or linear regression approaches typically describe the surrogates.

Let D be the input dataset T data points with \(D=(X,y)=\{(X_t,y_t),...,(X_N,y_N) \}\). \(X_{t}\) is the vector that comprises explanatory variable values at time t, ..., T and \(y_t\) the respective target variable. Then, let f be the applied ML (blackbox) approach and g denote an interpretable approach (i.e. linear regression) which will be applied to interpret f. The complexity of g is described by \(\Omega (g)\) and \(\pi _X(X')\) describes the proximity measure between inputs X and \(X'\). Then the objective function is given as follows:

$$\begin{aligned} \underset{g \in G}{arg \hspace{0.2cm} min\hspace{0.2cm}} \hspace{0.5cm} L(f,g,\pi _X)+\Omega (g) \end{aligned}$$
(13)

and the respective lossfunction L is defined as:

$$\begin{aligned} L(f,g,\pi _X) = \sum _{X_t' \in X'}[f(X_t')-g(X_t')]^2 \pi _X(X_t'). \end{aligned}$$
(14)

Here, \(X'\) is the set of inputs describing the local neighbourhood of X, \(\Omega\) is the number of non-zero weights in the linear model and \(\pi _X(X')\) is defined by using \(L2-\)distance. This leads to the following approach:

$$\begin{aligned} {\hat{y}}_{t|ML} = {\hat{\beta }}^{(LIME)}_{0|OLS} + X_t {\hat{\beta }}^{(LIME)}_{OLS}, \end{aligned}$$
(15)

where \(X_t\) describes the 1xN vector of explanatory variables at time t and \({\hat{y}}_{t|ML}\) the estimated target variable via ML approach. Then, via \({\hat{\beta }}^{(LIME)}_{OLS}\) we are able to identify statistically significant determinants for the estimated risk premium. In this paper, we will estimate \({\hat{\beta }}^{(LIME)}_{OLS}\) based on the entire data set to provide an average of the local explanations wich can then be compared to Global Feature Importance and Shapley values.

3.3.3 Shapley values

In contrast to LIME, \(\Omega (g)\) and \(\pi _X(X')\) are not determined heuristically, SHAP grounds the definition on game theoretic principles to guarantee that the explanation satisfy certain desired properties as described in Shapley (1953). Lundberg and Lee (2017) introduce the Shapley value to machine learning and demonstrate that Shapley values describe the average marginal contribution of an observed explanatory variable across all possible coalitions. The SHAP (SHapley Additive exPlanation) framework describes a model agnostic approach, which allows to estimate Shapley values expressing predictions as linear combinations.The Shapley value \(\phi _j\) of \(j-th\) explanatory variable is defined as follows:

$$\begin{aligned} \phi _j = \sum _{S \subseteq \{1,...,p\} \backslash \{ j \}} \frac{|S|! (p-|S|-1)!}{p!}\left( f_{S \cup \{ j \} }(X_{S \cup \{ j \} })-f_S(X_S)\right) \end{aligned}$$
(16)

With \(X_S\) as a subset of the explanatory variables X in the set S, \(f_{S \cup \{ j \}}\) is the trained model with variable j and \(f_{S}\) without. Consequently, the Shapley value is the variable contribution to the prediction. Hence, the Shapley value is the individual variable contribution to the prediction and therefore describes a local explanation. Additionally, we assess global importance via SHAP Feature Importance:

$$\begin{aligned} \Phi _j = \frac{1}{n}\sum ^{n}_{i=1} |\phi _j^{(i)}|, \end{aligned}$$
(17)

here \(\Phi _j\) is the average of absolute Shapley values per explanatory variable for the full data set. For a thorough methodological discussion on Shapley values, we refer to Davila-Pena et al. (2022).

4 Data

We assess the monthly risk premium of the S &P 500 as the dependent target variable and construct macroeconomic predictors.Footnote 9 The independent explanatory variables in our study are then calculated as described in Welch and Goyal (2008). Similar to Gu et al. (2020), we assess the eight macroeconomic predictors which are available on a monthly level. Our sample ranges from December 1931 up to December 2020.

Table 1 Descriptive Statistics

Table 1 presents the descriptive statistics of the economic predictors and the monthly returns of the S &P 500. The dividend-price ratio (dp) is calculated as the difference between the log dividends and log prices of the S &P 500. The earnings-price ratio (ep) describes the difference between the log of annual earnings and the log of prices, the book-to-market ratio (bm) is the ratio of book value to market value for the Dow Jones Industrial Average. Also, we consider corporate issuing activity via net equity expansion (ntis), which is the ratio of moving sums of net issues by NYSE listed stocks in relation to the total end-of-the-year market capitalization of NYSE stocks. The Treasury-bill rate (tbl) are the 3-Month Treasury bill rates, and the term spread (tms) measures the difference between the long term yield on government bonds and the Treasury-bill rate. The default spread (dfy) gives the difference between BAA and AAA-rated corporate bond yields and the stock variance (svar) provides information on the market volatility measured as the sum of squared daily returns of the S &P 500.

5 Simulation study

We conduct a simulation study to assess the precision and information content of the competing interpretable ML approaches.Footnote 10 Specifically, we generate financial returns via linear regression approach and then we fit boosted tree algorithm to discuss the concepts of SHAP and LIME, when the data generating process is known. As well, we discuss the concept of Feature Importance as a benchmark for global model interpretability. In doing so, we assume that the critical underlying assumptions of the standard econometric model as presented in equation (1) do hold. That is, we assume that the impact of each explanatory variable does not change over time and that the impact can be measured, given that other variables do not change.Footnote 11 Also, in order to discuss the interpretability of the competing approaches, we focus on a strict in-sample assessment.

To discuss simulated equity premium with respect to realistic factor loadings, we fit the linear model to the underlying data as presented in Table 1 to get parameter estimates \({\hat{\beta }}\) for each predictor variable. Then, we draw on \({\hat{\beta }}\) to generate estimated \({\hat{y}}\), and the data generating process is given as follows:

$$\begin{aligned} {\hat{y}}_{t} = {\hat{\beta }}_{0|OLS} + X_t {\hat{\beta }}_{OLS}. \end{aligned}$$
(18)

Hence, we achieve a simulated return series \({\hat{y}}\) which is perfectly described by the applied predictors and the choice of factor loadings is based on the factors which were fitted to the monthly S &P 500 returns.

On that account, we fit boosted trees algorithm as described in section 3.2.2 with predictors X and \({\hat{y}}_{ML}\) as the target variable. As the discussion on the optimal ML approach, especially its hyperparameter tuning, is not in the scope of our study, we apply similar configuration of the boosted tree algorithm as discussed in Gu et al. (2020). That is, we allow a maximum depth of trees that equals 6 leaves, a maximum amount of 20 boosting iterations. We regularize the weights via L2-norm, apply a learning rate close to 1 (here overfitting is not in scope) and apply a minimal loss reduction that equals 0.01. As a stopping criterion for the training algorithm, we stop the training after 10 epochs, given that the in-Sample root mean squared error (RMSE) is smaller than in the initially fitted OLS regression. For a thorough discussion on boosting, we refer to Chen and Guestrin (2016) and James et al. (2021).

Drawing on the fitted black box ML approach, we apply predicted \({\hat{y}}_{ML}\) and assess competing techniques of interpretable machine learning and study the identified importance of each explanatory variable. Specifically, as we know the underlying linear dynamics of the data generating process, we are enabled to discuss SHAP and LIME with respect to economic precision. However, as described in Sect. 2, the applied approaches provide different information on either local or global interpretability of the model.

Fig. 1
figure 1

Simulation Study: a For each of the assessed explanatory variable, estimated coefficients from linear regression are plotted against the respective LIME coefficients. The 90%-confidence interval for the estimated coefficients is indicated via a bar. Details on the explanatory variables are presented in Table 1 and details on the estimated coefficients \({\hat{\beta }}_{OLS}\) and \({\hat{\beta }}^{(LIME)}_{OLS}\) are presented in Table 2, column (1) and column (3). b Variable Importance via Shapley values. The y-axis gives the ranking of sorted explanatory variables, the most important variable is at the top. The value next to explanatory variable is the average Shapley value. On the x-axis is the individual Shapley value. Each point describes an observation from the applied data. Gradient colour indicates the original value for that variable. Details on the averaged Sapley values are presented in Table 2 column (6)

Fig. 1 illustrates both LIME and Shapley values. The left plot (plot a)) visualizes the estimated OLS-coefficients for the underlying data which are used as input to the data generating process to simulate \({\hat{y}}_{OLS}\). The estimated LIME parameters, i.e. the parameters that describe the impact of the explanatory variables on \({\hat{y}}_{ML}\) and the corresponding 90%-confidence intervals, are also illustrated. The right plot (plot b) ) provides information on the corresponding SHAP values. Local Shapley importance is indicated by coloured dots and averaged Shapley values are presented at the y-axis.

In order to compare the competing approaches, we focus on the identified rankings of each variable to discuss the information content in terms of factor importance that is provided by each approach. Specifically, we investigate to which extent the identified impact factors are in line with the simulated factor loadings. Due to the fact that we apply a linear regression approach as a local surrogate model via LIME, we are also able to study the differences between the simulated impact (measured via \(\beta _{OLS}\)) and the interpretable betas of the LIME approach (\(\beta _{OLS}^{(LIME)}\)). Also, we compare averaged Shapley values with global Feature Importance.

Table 2 The relation between simulated monthly S &P 500 returns (\({\hat{y}}_{t|OLS}\)) and explanatory variables measured via linear regression and boosted trees with LIME, Feature Importance and SHAP

Table 2 provides the estimated parameters for all approaches. The corresponding OLS and LIME coefficients (\(\beta _{OLS}\) and \(\beta _{OLS}^{(LIME)}\)) of the left plot of Fig. 1 are presented in column (1) and column (3), the underlying averaged Shapley coefficients (\(\Phi\)) of the right plot of Fig. 1 are given in column (6). In order to identify the relative importance of each explanatory variable, we normalize each variable to compare the estimated betas. These betas quantify the linear impact stemming from each determinant, independently from the individual dimension of each variable. That is, as the variables are normalized, the coefficients measure the impact of each variable on the equity premium. The coefficients are given in column (2). Column (4) provides the ratio between \(\beta _{OLS}\) and \(\beta _{OLS}^{(LIME)}\) and Global Feature Importance is presented in column (5).

We find that the identified leverage effect, namely the asymmetric impact of stock variance (\(\beta _{svar|OLS}^{(z)}=-1.1142\)) on the equity premium, describes the strongest impact on the equity premium, followed by impact stemming from default spread (\(\beta _{dfy|OLS}^{(z)}=0.63426\)) and Treasury-bill rate (\(\beta _{tbl|OLS}^{(z)}=0.1040\)). Indicated by low ratios as presented in column (4), the results indicate that LIME approach precisely identifies the impact of market variance (svar), default yield spread (dfy) and Treasury-bill rate (tbl), all differences are close to 0. Due to the intercept, LIME suggest that the impact of dividend-price ratio (dp) is higher than suggested by the data generating process (\(ratio_{dp} = 1.49\)). Specifically, the estimated lime coefficient equals \(\beta ^{(LIME)}_{dp|OLS}=-0,0083\) and the simulated impact is given by \(\beta _{dp|OLS}=-0,003\). The impact of net equity expansion (ntis) is also underestimated by LIME (\(ratio_{ntis}=0.93\)) and the simulated impact of the earnings-price ratio (ep) is underestimated (\(ratio_{ep}=-0.73\)) as well as book-to-market ratio (\(ratio_{bm}=-0.63\)) and the term spread (\(ratio_{tms}=0.58\)). It can be stated that the factors dp, tms and bm describe the least relevant impact factors in the simulated series. The factors are characterized by a small \({\hat{\beta }}_{OLS}^{(z)}\) coefficient, specifically \({\hat{\beta }}_{dp|OLS}^{(z)}=-0.07\), \({\hat{\beta }}_{tms|OLS}^{(z)}=-0.14\) and \({\hat{\beta }}_{bm|OLS}^{(z)}=-0.14\) respectively.

Both SHAP and Feature Importance identify similar impact factors as important. That is, both approaches identify a stark impact stemming from market variance (\(FI_{svar}=0.834;\Phi _{svar}=0.004\)) as well as an impact from the Treasury-bill rate (\(FI_{tbl}=0.104;\Phi _{tbl}=0.003\)) on the target variable. In addition, both SHAP and Feature Importance indicate that the Earnings Price Ratio describes the third largest impact factor (\(FI_{ep}=0.062;\Phi _{ep}=0.003\)). Interestingly, none of the two approaches puts relevance on default spread, which describes the second strongest determinant in the data generating process. This finding is contrary to LIME results. According to LIME approach, the market variance also describes the strongest impact on the monthly equity premium (\(\beta _{svar|OLS}^{(LIME)}=-1.53\)) whereas the default spread (\(\beta _{dfy|OLS}^{(LIME)}=0.59\)) is ranked second followed by net equity expansion (\(\beta _{ntis|OLS}^{(LIME)}=-0.01\)).

However, within the frame of available recent research, our findings add to the presented simulation assessment by Gu et al. (2020). The authors demonstrate that boosted regression trees describe a sensible choice for both linear and non-linear processes, and our results provide evidence that the relevance of linear impact factors can be identified by all three concepts of interpretable machine learning. Consequently, our results indicate, that boosted trees do not only fit the data well but are also able to identify non-linear dependencies. Also, these results describe relevant findings for the XAI research agenda as introduced in De Bock 2023 (2023). With focus on financial data, we add post-model interpretability via techniques of interpretable machine learning and find that the black box model provides similar economic information content, as suggested by linear regression coefficients. In view of the fact that the applied ML technique is less restrictive when it comes to assumptions of underlying data and that ML based techniques are scalable to large (Big) data sets, our simulation exercise indicates that the presented technique describes a promising alternative to interpretable linear models for in-sample assessments.

6 Empirical assessment

In this section, we study the economic information content of interpretable machine learning, when the data generating process is unknown. Thus, in contrast to the previous Sect. 5, the estimated linear OLS regression model describes the interpretable benchmark for our assessment and we study the information content of ML based interpretability in comparison to the estimated betas and an in-sample fit can be evaluated in terms of \(R^2\). Hence, we are able to analyze the additional information that can be achieved by trained ML algorithms to the data. Furthermore, we discuss the assessment of the empirical results in the context of simulation assessment. In doing so, we are enabled to identify additional impact stemming from the explanatory variables, which is explained by non-linearities that are not captured by the OLS approach.

Analogous to the simulation assessment, we fit both, a linear regression model as well as boosted trees to the data set. We define the in-sample root mean squared error (RMSE) as a stopping criterion for the learning algorithm. That is, we apply the same hyperparameters for the ML approach as described in the simulation analysis and stop the training of our model when the in-sample RMSE is smaller than the RMSE of the OLS approach.Footnote 12 In doing so, we ensure that the ML approach is well-fitted to the underlying data, however, to generate results that build a solid ground for potential out-of-sample applications, we avoid in-sample overfitting. Also, to further unbox the black box algorithm via competing approaches, similar to the simulation exercise, we also assess LIME, SHAP and Feature Importance and compare the relevant determinants of the equity premium with the estimated linear OLS parameters. In order to assess comparable \(\beta\)s, we also estimate the coefficient based on the normalized explanatory variables.

Fig. 2
figure 2

Empirical Study: a For each of the assessed explanatory variable, estimated coefficients from linear regression are plotted against the respective LIME coefficients. The 90%-confidence interval for the estimated coefficients is indicated via a bar. Details on the explanatory variables are presented in Table 1 and details on the estimated coefficients \({\hat{\beta }}_{OLS}\) and \({\hat{\beta }}^{(LIME)}_{OLS}\) are presented in Table 3, column (1) and column (3). b Variable Importance via Shapley values. The y-axis gives the ranking of sorted explanatory variables, the most important variable is at the top. The value next to explanatory variable is the average Shapley value. On the x-axis is the individual Shapley value. Each point describes an observation from the applied data. Gradient colour indicates the original value for that variable. Details on the averaged Sapley values are presented in Table 3 column (6)

Fig. 2 provides information on both LIME and SHAP values. The left plot a) presents the estimated OLS-coefficients for the underlying data and the LIME coefficients. For both coefficients, the 90%-confidence intervals are also presented for each parameter. The right plot b) illustrates the local and averaged Shapley values. Local Shapley importance is indicated by coloured dots and the averaged Shapley values are given at the y-axis.

Table 3 The relation between monthly S &P 500 returns at time t (\(y_t\)) and its respective explanatory variables measured via linear regression and boosted trees with LIME, Feature Importance and SHAP

Table 3 provides details on the estimated coefficients as illustrated in Fig. 2. The estimated coefficients of the linear regression approach are given in column (1) and the LIME coefficients are in column (3). The estimated OLS coefficients based on normalized explanatory variables are presented in column (2) and the ratio between estimated OLS and LIME coefficients is in column (4). The ratio measures the difference between both OLS- and LIME-coefficients (standardized by \(\beta _{OLS}\)) and is calculated for each explanatory variable. Global Feature Importance and averaged Shapley values are given column (5) and (6) respectively. The in-sample fit of linear regression approach is (\(R_{OLS}^2=0.073\)) and of the LIME approach is \(R_{ML}^2=0.12\). The LIME approach indicates that stock variance (\({\hat{\beta }}^{(LIME)}_{svar|OLS}=-1.93\)), net equity expansion (\({\hat{\beta }}^{(LIME)}_{ntis|OLS}=-0.11\)) and Treasury-bill rate (\({\hat{\beta }}^{(LIME)}_{tbl|OLS}=-0.09\)) describe a statistically significant negative impact on the predicted equity premium, whereas the default spread describes a statistically positive impact on the target variable (\({\hat{\beta }}^{(LIME)}_{dfy|OLS}=0.91\)). All other explanatory variables are not statistically significant. Consequently, our results confirm the findings of Lundberg et al. (2018), the applied boosted regression tree approach leads to a higher \(R^2\), which means that the black box ML approach describes a superior in-sample fit to the data.Footnote 13 When we compare the estimated impact of each statistically significant variable measured via OLS and ML approach, indicated by ratios close to zero, we find that both approaches identify similar impact stemming from stock variance (\(ratio_{svar}=-0.027\)), default spread (\(ratio_{dfy}=-0.012\)) and the Treasury-bill rate (\(ratio_{tbl} = 0.234\)). In contradiction to OLS, the application of ML identifies an additional statistically significant explanatory variable, namely net equity expansion (\({\hat{\beta }}^{(LIME)}_{ntis|OLS}=-0.159\)), as well, book-to-market ratio describes a weakly significant impact (\({\hat{\beta }}^{(LIME)}_{bm|OLS}=-0.019\); \(p-value = 0.08\)).

Based on the normalized variables (column (2)), the estimated OLS coefficients indicate the relevance of each variable within the linear regression framework. According to the OLS-benchmark, stock variance describes a remarkable negative impact on the equity premium (\({\hat{\beta }}^{(z)}_{svar|OLS}=-0.30\)) and the default spread a positive impact (\({\hat{\beta }}^{(z)}_{dfs|OLS}=0.17\)). Also, the T-bill rate describes a statistically significant impact on the monthly equity premium (\({\hat{\beta }}^{(z)}_{tbl|OLS}=0.09\)). Interestingly, both SHAP and Feature Importance identify stock variance and the default spread as the most relevant determinants. As well, the identified ranked relevance of the remaining explanatory variables is similar to SHAP and Feature Importance. That is, the ranked importance of each explanatory variable differs marginally. Both approaches determine the impact of book-to-market ratio, the term spread and earnings-price ratio as less relevant. In contrast to the benchmark and also in line with LIME, both techniques identify net equity expansion as an important determinant for the ML based approach. Based on a superior data fit via ML approach, we find that non-linear impact on the target variable can be captured and is then interpretable via the technique of interpretable machine learning.

Table 4 Rank correlation of Feature Importance

Table 4 summarizes the findings on the identified feature importance and provides information on Spearman’s correlation and Kendall Tau. The identified feature importance based on LIME approach is significantly positively correlated with the information content produced via linear OLS regression (S : 0.905; K : 0.786), whereas SHAP is also positively correlated but weakly statistically significant (S : 0.738; K : 0.643). These findings indicate that the information content of model agnostic post-hoc explanations is similar to the information provided by linear regression approach. Interestingly, we do not find a significant dependence between average SHAP and LIME coefficients. This indicates that both approaches do co-exist, but do not provide similar information content. That result is mainly driven by the fact, that LIME is designed to identify linear causalities, which explains the high correlation with the estimated linear regression parameters, whereas Shapley values, derived from game theory, describe different information on feature importance.

Furthermore, this result also sheds light on the fact that the assumptions of the linear regression approach are not fully in line with the underlying data and the application of less restrictive ML approaches provides additional information on the impact of the assessed explanatory variables. For instance, contrary to the simulation exercise, the empirical impact of dividend-price ratio increases. Specifically, non-linear effects can be captured by the less restrictive black box ML approach, which are not identified by the linear regression benchmark approach. This finding also affects the relevance of the explanatory variables such as the Treasury-bill rate, term spread, book-to-market ratio and earnings-price ratio. Furthermore, in contrast to the concept of global Feature Importance, SHAP provides local explanations and therefore enables us to study local interpretation provided by individual Shapley values via partial dependence. Specifically, for each observed explanatory variable, we are able to assess the impact on the target variable via the respective Shapley values.

Fig. 3
figure 3

Partial dependence plot between equity premium and each of the assessed explanatory variables. The y-axis gives the Shapley value, the x-axis the observed value of each explanatory variable. The black dots are one observation and the red line indicates the direction of dependence. Details on the explanatory variables are presented in Table 1

Fig. 3 illustrates the partial dependence between the target variable and every single explanatory variable. Local explanation of each observation is provided by individual bivariate plots of respective the Shapley value on the y-axis against the observed value of the explanatory variable on the x-axis. The red line indicates a function that approximates the dependence between variable and Shapley value. As indicated by each individual partial dependence, Shapley values provide insights into the non-linearity of dependence structure between explanatory and target variable, measured in terms of marginal contributions by each explanatory variable. For instance, we can observe that large values of market volatility (svar) coincide with largest negative Shapley values whereas the impact on the target variable seems to converge for increased market risk. Default spread (dfy) and net equity expansion (ntis) are also well described by a non-linear dependence structure, specifically, the dependence between target variable and default spread (dfy) approximates an exponential function, whereas mixed dependence schemes are observed for net equity expansion (ntis). If ntis-variable is around zero, its impact on the target variable seems to increase, whereas it decreases for values between 0.03 and 0.6 and increases again for values larger than 0.6. These findings add to the discussion in Lundberg et al. (2019), boosted trees are not only an adequate fit to metric target variables, they also provide innovative insights into the structure of impact factors.

Based on the empirical ML assessment, we find that the competing techniques of interpretable ML provide similar information on the relevance of explanatory variables for the estimated equity premium like the OLS benchmark. However, due to a superior in-sample fit, novel insights can be identified when using ML algorithms in combination with post-hoc model explanation. As global Feature Importance does not provide information on the direction of the respective variables, we find that both SHAP and LIME provide relevant economic information for the application of interpretable machine learning.

Hence, the results of our study suggests that post-model interpretability for ML algorithms describe a sensible option for economic data sets. Our results confirm the findings in Gu et al. (2020) and we provide additional evidence that boosted trees describe a promising interpretable alternative to OLS regression in the context of economic data. Also, our results are in line with Avramov et al. (2022) and we confirm that boosted trees lead to superior in-sample fit for financial data. Moreover, our results provide innovative insights into the interpretability of black box machine learning algorithms. Both LIME and SHAP describe solid concepts of local interpretability which can be applied to economic data when the target variable is measured on a metric scale. Moreover, additional, non-linear explanatory impacts can be identified via the application of interpretable machine learning.

7 Conclusion

The main objective of this paper is to evaluate the information content of XAI applied to OR and finance. To this end, we draw on Gu et al. (2020) and apply an adequate black box machine learning algorithm to assess explanatory variables of the equity premium and discuss competing model agnostic XAI techniques. The setup of our study allows for both the assessment of the accuracy of machine learning to financial returns and the discussion of information content of XAI in comparison to interpetable econometric benchmarks.

Conducting a simulation study, we simulate equity premia that are determined by a set of realistic financial explanatory variables. We find that machine learning in combination with SHAP and LIME provides similar information, compared to the widely-applied linear regression approach. The empirical study provides evidence that machine learning techniques describe a sensible alternative to linear regression approach for economic data. Furthermore, when the underlying data-generating process is unknown, machine learning allows for the identification of additional non-linear determinants, which are not captured via linear regression.

Hence, the implications of our study are twofold. First, this study adds to the research agenda in De Bock 2023 (2023) and describes the initial step towards a deeper understanding of XAI applied to economic data and metric target variables. We find that the application of ML allows for a better fit to the data and post-model interpretability enables us to identify novel and non-linear determinants. Second, due to the less restrictive assumptions of the applied ML technique and due to the ability of ML to handle larger data sets with higher frequencies, our results add to Gu et al. (2020) and provide evidence that black box ML techniques in combination with post-model interpretability provide a promising alternative to linear regression approach for financial data analytics.

Therefore, our study provides a pioneering step towards a deeper understanding of XAI as discussed in De Bock 2023 (2023) and we suggest that the presented framework can be applied to relevant quantitative approaches in financial management with metric target variables. From our point of view, we consider an assessment of XAI applied to credit risk modelling, as discussed in Kellner et al. (2022), multivariate time series forecasts, as presented in Ahelegbey et al. (2016) and Giudici et al. (2020), as well as portfolio management, as introduced in Babaei et al. (2022), a promising next step towards a deeper understanding of XAI in finance.