A Comparative Study of Methods for Estimating Conditional Shapley Values and When to Use Them

Shapley values originated in cooperative game theory but are extensively used today as a model-agnostic explanation framework to explain predictions made by complex machine learning models in the industry and academia. There are several algorithmic approaches for computing different versions of Shapley value explanations. Here, we focus on conditional Shapley values for predictive models fitted to tabular data. Estimating precise conditional Shapley values is difficult as they require the estimation of non-trivial conditional expectations. In this article, we develop new methods, extend earlier proposed approaches, and systematize the new refined and existing methods into different method classes for comparison and evaluation. The method classes use either Monte Carlo integration or regression to model the conditional expectations. We conduct extensive simulation studies to evaluate how precisely the different method classes estimate the conditional expectations, and thereby the conditional Shapley values, for different setups. We also apply the methods to several real-world data experiments and provide recommendations for when to use the different method classes and approaches. Roughly speaking, we recommend using parametric methods when we can specify the data distribution almost correctly, as they generally produce the most accurate Shapley value explanations. When the distribution is unknown, both generative methods and regression models with a similar form as the underlying predictive model are good and stable options. Regression-based methods are often slow to train but produce the Shapley value explanations quickly once trained. The vice versa is true for Monte Carlo-based methods, making the different methods appropriate in different practical situations.


Introduction
Complex machine learning (ML) models are extensively applied to solve supervised learning problems in many different fields and settings; cancer prognosis (Kourou et al. 2015), credit scoring (Kvamme et al. 2018), impact sensitivity of energetic crystals (Lansford et al. 2022), and money laundering detection (Jullum, Løland, et al. 2020). The ML methods are often very complex, as they contain thousands, millions, or even billions of tuneable model parameters.
Thus, understanding the complete underlying decision making process of the ML algorithms are infeasible (for us humans). The use of ML methods is based on them having the potential to generate more accurate predictions than established statistical models, but this may come at the expense of model interpretability, as discussed by Johansson et al. (2011), Guo, Zhang, et al. (2019), and Luo et al. (2019). Rudin (2019) conjectures that equally accurate but interpretable models exist across domains even though they might be hard to find.
The lack of understanding of how the input features of the ML model influence the model's output is a major drawback. Hence, to remedy the absence of interpretation, the fields of explainable artificial intelligence (XAI) and interpretable machine learning (IML) have become active research fields in recent years (Adadi and Berrada 2018;Molnar 2022;Covert et al. 2021). There has been developed a wide variety of explanation frameworks which extract the hidden knowledge about the underlying data structure captured by the black-box model, making the model's decision-making process more transparent. This is essential for, e.g., medical researchers who apply an intricate ML model to obtain well-performing predictions but who simultaneously also aim to discover important risk factors. The Right to Explanation legislation in the European Union's General Data Protection Regulation (GDPR) has also been a driving factor (European Commission 2016).
One of the most commonly used explanation frameworks in XAI is Shapely values, which is an explanation methodology with a strong mathematical foundation and unique theoretical properties from the field of cooperative game theory (Shapley 1953). Shapley values are most commonly used as a model-agnostic explanation framework for individual predictions, that is, for local explanations. Model-agnostic means that Shapley values do not rely on model internals and can be used to compare and explain any ML model trained on the same supervised learning problem. Local explanation means that Shapley values explain the local model behavior for an individual observation and not the global model behavior across all data instances. The methodology has also been used to provide global explanations, see, e.g., Owen (2014), Covert et al. (2020), Frye et al. (2021), and Giudici and Raffinetti (2021). See Molnar (2022) for an overview and detailed introduction to other explanation frameworks.
Shapley values originated in cooperative game theory, but have been reintroduced as a framework for model explanation by Kononenko (2010, 2014) and Lundberg and Lee (2017). Originally, Shapley values described a possible solution concept of how to fairly allocate the payout of a game among the players based on their contribution to the overall cooperation/payout. The solution concept is based on several desirable axioms, for which the Shapley values are the unique solution. When applying Shapley values as an explanation framework, we treat the features as the "players", the machine learning model as the "game", and the corresponding prediction as the "payout".
There are several ways to define the game, which yields different types of Shapley values. For local explanations, the two main types are interventional and conditional Shapley values 1 , and there is an ongoing debate about when to use them (Chen, Janizek, et al. 2020;Kumar et al. 2020;Chen, Covert, et al. 2022). Briefly stated, the interventional version does not take dependencies between the features into consideration, while the conditional version does. A disadvantage of the conditional Shapley values, compared to the interventional counterpart, is that they require the estimation/modeling of non-trivial conditional expectations. Throughout this article, we mean conditional Shapley values when we discuss Shapley values, if not otherwise specified.
There is a vast amount of literature on different approaches for estimating Shapley values (Strumbelj, Kononenko, and Sikonja 2009;Lundberg and Lee 2017;Lundberg, Erion, and Lee 2018;Redelmeier, Jullum, and Aas 2020;Williamson and Feng 2020;Aas, Jullum, et al. 2021;Figure 1: Schematic overview of the paradigms, method classes, and methods/approaches used in this article to compute the conditional Shapley value explanations. Aas, Nagler, et al. 2021;Frye et al. 2021;Covert et al. 2021;Olsen et al. 2022). These methods can be grouped into different method classes based on their characteristics, that is, if they (implicitly) assume feature independence or use empirical estimates, parametric assumptions, generative methods, and/or regression models; see Figure 1. To the best of our knowledge, there exist no thorough and methodological comparison between all the method classes and approaches. Chen, Covert, et al. (2022, Section 6) states that "[conditional Shapley values] constitutes an important future research direction that would benefit from new methods or systematic evaluations of existing approaches".
In this article, we both investigate existing methods, introduce several new approaches, and conduct extensive simulation studies starting from a very simple set-up with an interpretable model as a sanity check and gradually increase the complexity of the predictive model. We also investigate the effect the data distribution, with varying levels of dependence and the training sample size, has on the estimation of the conditional expectations using the different methods. Finally, we also conduct experiments on real-world data sets from the UCI Machine Learning Repository. In the numerical simulation studies, the parametric methods, which correctly (or nearly correctly) assume the data distribution, generate the most accurate Shapley values. However, if the data distribution is unknown, such as for most real-world data sets, our experiments show that using either a generative method or a regression model with the same form as the predictive model is the best option. In addition to accuracy, we also investigate the computation times of the methods. Based on our findings, we present recommendations for when to use the different method classes and approaches.
In Section 2, we give an overview of Shapley values' origin and use as a model-agnostic explanation framework. The existing and novel methods for estimating the Shapley value explanations are described in Section 3. In Section 4, we present the simulation studies and discuss the corresponding results. We conduct experiments on real-world data sets in Section 5. Recommendations for when to use the different methods and a conclusion are given in Sections 6 and 7, respectively. In the Appendix, we provide more approaches, implementation details, and additional simulation studies.

Shapley Values
In this section, we first briefly describe Shapley values in cooperative game theory, before we elaborate on their use for model explanations.

Shapley Values in Cooperative Game Theory
Shapley values are a solution concept of how to divide the payout of a cooperative game v : P(M) → R based on four axioms (Shapley 1953). The game is played by M players where M = {1, 2, . . . , M } denotes the set of all players and P(M) is the power set, that is, the set of all subsets of M. We call v(S) the contribution function 2 and it maps a subset of players S ∈ P(M), also called a coalition, to a real number representing their contribution in the game v. The Shapley values φ j = φ j (v) assigned to each player j, for j = 1, . . . , M , uniquely satisfy the following properties: Efficiency: They sum to the value of the grand coalition M over the empty set ∅, that is, Symmetry: Two equally contributing players j and k, that is, v(S ∪ {j}) = v(S ∪ {k}) for all S, receive equal payouts φ j = φ k .
Linearity: A linear combination of n games {v 1 , . . . , v n }, that is, v(S) = n k=1 c k v k (S), has Shapley values given by φ j (v) = n k=1 c k φ j (v k ). Shapley (1953) showed that the values φ j which satisfy these axioms are given by where |S| is the number of players in coalition S. The number of terms in (1) is 2 M , hence, the complexity grows exponentially with the number of players M . Each Shapley value is a weighted average of the player's marginal contribution to each coalition S.

Shapley Values in Model Explanation
We consider the setting of supervised learning where we aim to explain a predictive model is an M -dimensional feature vector, y [i] is a univariate response, and N train is the number of training observations. The predictionŷ = f (x), for a specific feature vector x = x * , is explained using Shapley values as a model-agnostic explanation framework Kononenko 2010, 2014;Lundberg and Lee 2017). The fairness aspect of Shapley values in the model explanation setting is discussed in, for example, Chen, Janizek, et al. (2020), Fryer et al. (2021), and Aas, Jullum, et al. (2021).
In the Shapley value framework, the predictive model f (indirectly) replaces the cooperative game and the M -dimensional feature vector replaces the M players. The Shapley value φ j describes the importance of the jth feature in the prediction f ( That is, the sum of the Shapley values explains the difference between the prediction f (x * ) and the global average prediction.
To calculate (1), we need to define an appropriate contribution function v(S) = v(S, x * ) which should resemble the value of f (x * ) when only the features in coalition S are known. We use the contribution function proposed by Lundberg and Lee (2017), namely the expected response of f (x) conditioned on the features in S taking on the values x * S . That is, where x S = {x j : j ∈ S} denotes the features in subset S, x S = {x j : j ∈ S} denotes the features outside S, that is, S = M\S, and p(x S |x S = x * S ) is the conditional density of x S given x S = x * S . The conditional expectation summarizes the whole probability distribution, it is the most common estimator in prediction applications, and it is also the minimizer of the commonly used squared error loss function . Note that the last equality of (2) only holds for continuous features. If there are any discrete or categorical features, the integral should be replaced by sums for these features. Hence, p(x S |x S = x * S ) is then no longer continuous.
The contribution function in (2) is also used by, for example, Covert et al. (2020), Aas, Jullum, et al. (2021), Aas, Nagler, et al. (2021), Frye et al. (2021), and Olsen et al. (2022). Covert et al. (2021) argue that the conditional approach in (2) is the only approach that is consistent with standard probability axioms. Computing (2) is not straightforward for a general data distribution and model. Assuming independent features, or having f be linear, simplifies the computations (Lundberg and Lee 2017;Aas, Jullum, et al. 2021), but these assumptions do not hold in general.
To compute the Shapley values in (1), we need to compute the contribution function v(s) in (2) for all S ∈ P(M), except for the edge cases S ∈ {∅, M}. For S = M, we have that x S = x * and v(M) = f (x * ) by definition. For S = ∅, we have by definition that φ 0 = v(∅) = E[f (x)], where the average training response is a commonly used estimate . We denote the non-trivial coalitions by P * (M) = P(M)\{∅, M}. The Shapley values φ * = {φ * j } M j=0 for the prediction f (x * ) are computed as the solution of a weighted least squares problem (Charnes et al. 1988;Lundberg and Lee 2017;Aas, Jullum, et al. 2021).
In Sections 2.2.1 and 2.2.2, we describe two prominent paradigms for estimating the contribution function v(S) for all S ∈ P * (M), namely, Monte Carlo integration and regression.

Monte Carlo Integration
One way to estimate the contribution function v(S) is by using Monte Carlo integration. I.e., where f is the predictive model, x (k) S ∼ p(x S |x S = x * S ), for k = 1, 2, . . . , K, and K is the number of Monte Carlo samples. We insertv(S) into (1) to estimate the Shapley values. To obtain accurate conditional Shapley values we need to generate Monte Carlo samples which follow the true conditional distribution of the data. This distribution is in general not known and needs to be estimated based on the training data. In Sections 3.1 to 3.4, we describe different method classes for generating the conditional samples x

Regression
As stated above, the conditional expectation (2) is the minimizer of the mean squared error loss function. That is, Thus, any regression model g S (x S ) which is fitted with the mean squared error loss function as the objective function will approximate (4), obtaining an alternative estimatorv(S). The accuracy of the approximation will depend on the form of the predictive model f (x), the flexibility of the regression model g S (x S ), and the optimization routine. We can either train a separate regression model g S (x S ) for each S ∈ P * (M) or we can train a single regression model g(x S ) which approximates the contribution function v(S) for all S ∈ P * (M) simultaneously. Herex S is an augmented version of x S with fixed-length M , where the augmented values are mask values to be explained later. We elaborate on the notation and details of these two regression methodologies in Sections 3.5 and 3.6, respectively.

Conditional Expectation Estimation
Computing conditional Shapley values is difficult due to the complexity of estimating the conditional distributions, which are not directly available from the training data. In this section, we give a methodological introduction to different methods for estimating the conditional expectation in (2) via either Monte Carlo integration or regression, while we provide implementation details in Appendix A. We organize the methods into six method classes in accordance with those described in Chen, Covert, et al. (2022, Section 5.1.3) and Covert et al. (2021, Section 8.2). The method classes we consider are called; independence, empirical, parametric, generative, separate regression, and surrogate regression, and they are described in Sections 3.1 to 3.6, respectively. The first four classes estimate the conditional expectation in (2) using Monte Carlo integration, while the last two classes use regression.

The Independence Method
Lundberg and Lee (2017) avoided estimating the complex conditional distributions by implicitly assuming feature independence. In the independence approach, the conditional distribution p(x S |x S ) simplifies to p(x S ), and the corresponding Shapley values are the interventional Shapley values discussed in Section 1. The Monte Carlo samples x (k) S ∼ p(x S ) are generated by randomly sampling observations from the training data, thus, no modeling is needed and x (k) S follows the assumed true data distribution. However, for dependent features, which is common in observational studies, the independence approach produces biased estimates of the contribution function (3) and the conditional Shapley values. Thus, the independence approach can lead to incorrect Shapley value explanations for real-world data Merrick and Taly 2020;Frye et al. 2021;Olsen et al. 2022).

The Empirical Method
Instead of sampling randomly from the training data, the empirical method samples only from similar observations in the training data. The optimal procedure is to use only samples which perfectly match the feature values x * S , as this approach exactly estimates the conditional expectation when the number of matching observations tends to infinity (Chen, Covert, et al. 2022). However, this is not applicable in practice, as data sets can have few observations, contain a high number of features to match, or continuous features where an exact match is very unlikely. A natural extension is to relax the perfect match criterion and allow for similar observations (Mase et al. 2019;Sundararajan and Najmi 2020;Aas, Jullum, et al. 2021). However, this procedure will also be influenced by the curse of dimensionality as conditioning on many features can yield few similar observations, and thereby inaccurate estimates of the conditional expectation (2). We can relax the similarity criterion and include less similar observations, but then we break the feature dependencies. The empirical approach coincide with the independence approach when the similarity measure defines all observations in the training data as similar.
We use the empirical approach described in Aas, Jullum, et al. (2021). The approach uses a scaled version of the Mahalanobis distance to calculate a distance D S (x * , x [i] ) between the observation being explained x * and every training instance x [i] . Then they use a Gaussian distribution kernel to convert the distance into a weight w S (x * , x [i] ) for a given bandwidth parameter σ. All the weights are sorted in increasing order with x {k} having the kth largest value. Finally, they approximate (2) by a weighted version of (3) ) > η , that is, the ratio between the sum of the K * largest weights and the sum of all weights must be at least η, for instance 0.95.
Note that as Aas, Jullum, et al. (2021) use the Mahalanobis distance, their approach is limited to continuous features. One could potentially extend their method by using a distance measure which supports mixed data, for example, the Gower's distance (Gower 1971;Podani 1999). Another solution is to use, for example, encodings like one-hot-encoding or entity embeddings to represent the categorical variables as numerical (Guo and Berkhahn 2016), although that would increase the computational demand due to increased dimension.

The Parametric Method Class
In the parametric method class, we make a parametric assumption about the distribution of the data. This simplifies the process of generating the conditional Monte Carlo samples x The idea is to assume a distribution whose conditional distributions have closed-form solutions or are otherwise easily obtainable after estimating the parameters of the full joint distribution. The parametric approaches can yield very accurate representations if the data truly follows the assumed distribution, but they may impose large bias for incorrect parametric assumptions. In this section, we discuss two previously proposed parametric approaches and introduce two new methods. The current parametric approaches do not support categorical features, which is a major drawback, but one can potentially use the same type of encodings or entity embeddings of the categorical variables as for the empirical method.

Gaussian
Both Chen, Janizek, et al. (2020) and Aas, Jullum, et al. (2021) assume that the observations are multivariate Gaussian distributed with mean µ and covariance matrix Σ. That is, The parameters µ and Σ are easily estimated using the sample mean and covariance matrix of the training data, respectively. In the Gaussian approach, we sample the conditional samples x (k) S from p(x S |x S = x * S ), for k = 1, 2, . . . , K and S ∈ P * (M), and use them in (3) to estimate the Shapley values in (1).

Gaussian Copula
Aas  also proposed an alternative approach if the features are far from multivariate Gaussian distributed, namely the (Gaussian) copula approach. The idea is to represent the marginals of the features by their empirical distributions and then model the dependence structure by a Gaussian copula. See Appendix B.1 for additional information about copulas. Assuming a Gaussian copula, Aas, Jullum, et al. (2021) use the following procedure to generate the K conditional Monte Carlo samples x 2. Assume that v is distributed according to a multivariate Gaussian (the quality of this assumption will depend on how close the Gaussian copula is to the true copula), and sample from the conditional distribution p(v S |v S = v * S ) using the method described in Section 3.3.1.
3. Convert the margins v j in the conditional distribution to the original distribution usinĝ

Burr and Generalized Hyperbolic
The multivariate Gaussian distribution is probably the most well-known multivariate distribution with closed-form expressions for the conditional distributions. However, any other distribution with easily obtainable conditional distributions is also applicable, for example, the multivariate Burr distribution (Takahasi 1965;Yari and Jafari 2006) and the multivariate generalized hyperbolic (GH) distribution (Barndorff-Nielsen 1977;McNeil et al. 2015;Browne and McNicholas 2015;Wei et al. 2019). We call these two approaches for Burr and GH, respectively. In contrast to the Gaussian distribution, whose parameters can easily be estimated by the sample means and covariance, the parameters of the Burr and GH distributions are more cumbersome to estimate. We describe the distributions in more details in Appendix B. The GH distribution is unbounded and can model any continuous data set, while the Burr distribution is strictly positive and is therefore limited to positive data sets. The GH distribution is related to the Gaussian distribution through the t-distribution, where the latter is a special case of the GH distribution and coincide with the Gaussian distribution when the degrees of freedom tends to infinity.

The Generative Method Class
The generative and parametric methods are similar in that they both generate Monte Carlo samples from the estimated conditional distributions. However, the generative methods do not make a parametric assumption about the data. We consider two generative approaches; the ctree approach of Redelmeier, Jullum, and Aas (2020) and the VAEAC approach of Olsen et al. (2022). The latter is an extension of the approach suggested by Frye et al. (2021). Both methods support mixed, i.e., continuous and categorical, data.

Ctree
Redelmeier, Jullum, and Aas (2020) compute conditional Shapley values by modeling the dependence structure between the features with conditional inference trees (ctree). A ctree is a type of recursive partitioning algorithm that builds trees recursively by making binary splits on features until a stopping criterion is satisfied (Hothorn et al. 2006). The process is sequential, where the splitting feature is chosen first using statistical significance tests, and then the splitting point is chosen using any type of splitting criterion. The ctree algorithm is independent of the dimension of the response, which in our case is x S , while the input features are x S , which varies in dimension based on the coalition S. That is, for each coalition S ∈ P * (M), a ctree with x S as the features and x S as the response is fitted to the training data. For a given x * S , the ctree approach finds the corresponding leaf node and samples K observations with replacement from the x S part of the training observations in the same node to generate the conditional Monte Carlo samples x . We get duplicated Monte Carlo samples when K is larger than the number of samples in the leaf node. Thus, the ctree method weight the Monte Carlo samples based on their sampling frequencies to bypass redundant calls to f . Therefore, the contribution function v(S) is not estimated by (3) but rather by the weighted averagev(S) = where K * is the number of unique Monte Carlo samples. For more details, see Redelmeier, Jullum, and Aas (2020, Section 3). Olsen et al. (2022) use a type of variational autoencoder called VAEAC (Ivanov et al. 2019) to generate the conditional Monte Carlo samples. Briefly stated, the original variational autoencoder Welling 2014, 2019;Rezende et al. 2014) gives a probabilistic representation of the true unknown distribution p(x). The VAEAC model extends this methodology to all conditional distributions p(x S |x S = x * S ) simultaneously. That is, a single VAEAC model can generate Monte Carlo samples x (k) S ∼ p(x S |x S = x * S ) for all coalitions S ∈ P * (M). It is advantageous to only have to fit a single model for all coalitions, as in higher dimensions the number of coalitions is 2 M − 2. That is, the number of coalitions increases exponentially with the number of features. In contrast, ctree trains 2 M − 2 different models, which eventually becomes computationally intractable for large M . The VAEAC model is trained by maximizing a variational lower bound, which conceptually corresponds to artificially masking features and then trying to reproduce them using a probabilistic representation. In deployment, the VAEAC method considers the unconditional features x S as masked features to be imputed.

The Separate Regression Method Class
The next two method classes use regression instead of Monte Carlo integration to estimate the conditional expectation in (2). In the separate regression methods, we train a new regression model g S (x S ) to estimate the conditional expectation for each coalition of features. Related ideas have been explored by Lipovetsky and Conklin (2001), Strumbelj, Kononenko, and Sikonja (2009), and Williamson and Feng (2020). However, to the best of our knowledge, we are the first to compare different regression models for estimating the conditional expectation as the contribution function v(S) in the local Shapley value explanation framework.
The idea is to estimate For each data set X S , we train a regression model g S (x S ) with respect to the mean squared error loss function. The optimal model, with respect to the loss function, is , which corresponds to the contribution function v(S). The regression model g S aims for the optimal, hence, it resembles/estimates the contribution function, i.e., . A wide variety of regression models minimize the MSE, and we describe a selection of them in Sections 3.5.1 to 3.5.5. The selection discussed below consists of classical regression models and those that generally perform well for many experiments.

Linear Regression Model
The simplest regression model we consider is the linear regression model. It takes the form g S (x S ) = β S,0 + j∈S β S,j x j = x T S β S , where the coefficients β S are estimated by the least squares solution, that is,β S = arg min β X S β − z 2 = (X T S X S ) −1 X T S z, for all S ∈ P * (M). Here X S is the design matrix with the first column consisting of 1s to also estimate the intercept β S,0 . We call this approach LM separate.

Generalized Additive Model
The generalized additive model (GAM) extends the linear regression model and allows for nonlinear effects between the features and the response (Wood 2006b;Hastie et al. 2009). The fitted GAM takes the form g S (x S ) = β S,0 + j∈S g S,j (x S,j ), where the effect functions g S,j are penalized regression splines. We call this approach GAM separate.

Projection Pursuit Regression
The projection pursuit regression (PPR) model extends the GAM model (Friedman and Stuetzle 1981;Hastie et al. 2009). The PPR model takes the form g S (x S ) = β S,0 + L l=1 g S,l (β T S,l x S ), where the parameter vector β S,l is an |S|-dimensional unit vector. The PPR is an additive model, but in the transformed features β T S,l x S rather than in the original features x S . The ridge functions g S,l are unspecified, and are estimated along with the parameters β S,l using some flexible smoothing method. The PPR model combines nonlinear functions of linear combinations, producing a large class of potential models. Moreover, it is an universal approximator for continuous functions for arbitrary large L and appropriate choice of g S,l (Hastie et al. 2009, Section 11.2). We call this approach PPR separate.

Random Forest
A random forest (RF) is an ensemble model consisting of a multitude of decision trees, where the average prediction of the individual trees is returned. The first algorithm was developed by Ho (1995), but Breiman (2001) later extended the algorithm to include bootstrap aggregating to improve the stability and accuracy. We call this approach RF separate.

Boosting
A (tree-based) boosted model is an ensemble learner consisting of weighted weak base-learners which has been iteratively fitted to the error of the previous base-learners and together they form a strong learner (Hastie et al. 2009). The seminal boosting algorithm was developed by Freund and Schapire (1997), but multitudes of boosting algorithms has later been developed (Mayr et al. 2014), for example, CatBoost (Prokhorenkova et al. 2018). We call this approach CatBoost separate.

The Surrogate Regression Method Class
Since the separate regression methods train a new regression model g S (x S ) for each coalition S ∈ P * (M), a total of 2 M − 2 models has to be trained, which can be time-consuming for slowly fitted models. The surrogate regression method class builds on the ideas from the separate regression class, but instead of fitting a new regression model for each coalition, we train a single regression model g(x S ) for all coalitions S ∈ P * (M), wherex S is defined in Section 3.6.1. The surrogate regression idea is used by Frye et al. (2021) and Covert et al. (2021), but their setup is limited to neural networks. In Section 3.6.1, we propose a general and novel framework which allows us to use any regression model. Then, we relate our framework to the previously proposed neural network setup in Section 3.6.2.

General Surrogate Regression Framework
To construct a surrogate regression method, we must consider that most regression models g rely on a fixed-length input, while the size of x S varies with coalition S. Thus, we are either limited to regression models which support variable-length input, or we can create a fixedlength representationx S of x S for all coalitions S. Thex S representation must also include fixed-length information about the coalition S to enable the regression model g to distinguish between coalitions. Finally, we need to augment the training data to reflect that g is to predict the conditional expectation for all coalitions S.
In our framework, we augment the training data by systematically applying all possible coalitions to all training observations. We can then train a single regression model g on the augmented training data set, and the corresponding regression model can then (in theory) estimate the contribution function v(S) for all coalitions S ∈ P * (M) simultaneously.
To illustrate the augmentation idea, we consider a small example with M = 3 features and N train = 2 training observations. Let X = x11 x12 x13 x21 x22 x23 Assuming that g relies on fixed-length input, we must represent both the observed values x S and coalition S in a fixed-length notation for the surrogate regression methods to work.
To solve this, we first introduce I(S) = {1(j ∈ S) : j = 1, . . . , M } ∈ {0, 1} M , where 1(j ∈ S) is the indicator function which is one if j ∈ S and zero otherwise. Then, I(S) is an Mdimensional binary vector where the jth element I(S) j is one if the jth feature is in S (i.e., observed/conditioned on) and zero if it is in S (i.e., unobserved/unconditioned). The I function ensures fixed-length representations of the coalitions/masks, and note that I(S) = 1 M − I(S), where 1 M is the size M vector of 1s. Second, to obtain a fixed-length representationx S of the observed/conditioned feature vector x S , we apply the fixed-length mask I(S) to x as an element-wise product, that is,x S = x • I(S) = x • (1 M − I(S)), where • is the elementwise product. Finally, we concatenate the fixed-length representations together to form the augmented version of x S , namely,x S = {x S , I(S)}, which has 2M entries. We include I(S) iñ x S such that the model g can distinguish between actual zeros in x S and those induced by the masking procedure when creatingx S . We treat I(S) as binary categorical features.
After carrying out this procedure for all coalitions and training observations, we obtain the following augmented training data and responses: The number of rows in X aug and z aug is N train (2 M − 2). For example, with N train = 1000 and M = 8 the augmented data X aug consists of 254 000 rows, while the number of rows is 65 534 000 when M = 16. This exponential growth can make it computationally intractable to fit some types of regression models to the augmented training data {X aug , z aug } in high-dimensions. The data instance x * , which we want to explain, is augmented by the same procedure, and g(x * S ) then approximates the corresponding contribution function v(S, x * ).
Methods: For the surrogate regression method class, we consider the same regression models as in Section 3.5. We call the methods for LM surrogate, GAM surrogate, PPR surrogate, RF surrogate, and CatBoost surrogate, and they take the following forms: j=M +1 β jxS,j . That is, we add nonlinear effect functions to the augmented featuresx S = x • I(S) inx S while letting the binary mask indicators I(S) inx S be linear.
where g l and β l are the lth ridge function and parameter vector, respectively.
RF surrogate: g(x S ) is a RF model fitted to the augmented data on the same form as in (5).
CatBoost surrogate: g(x S ) is a CatBoost model fitted to the augmented data on the same form as in (5).

Surrogate Regression: Neural Networks
The surrogate regression neural network (NN-Frye surrogate) approach in Frye et al. (2021) differs from our general setup above in that they do not train the model on the complete augmented data. Instead, for each observation in every batch in the training process, they randomly sample a coalition S with probability |S|!(M −|S|−1)! M ! . Then they set the masked entries of the observation, i.e., the features not in S, to an off-distribution value not present in the data. Furthermore, they do not concatenate the masks to the data, as we do in (5). We propose an additional neural network (NN-Olsen surrogate) approach to illustrate that one can improve on the NN-Frye surrogate method. The main conceptual differences between the methods are the following. First, for each batch, we generate a missing completely at random (MCAR) mask with paired sampling. MCAR means that the binary entries in the mask S is Bernoulli distributed with probability 0.5, which ensures that all coalitions are equally likely to be considered. Further, paired sampling means that we duplicate the observations in the batch and apply the complement mask, S, on these duplicates. This ensures more stable training as the network can associate both x S and x S with the response f (x). Second, we set the masked entries to zero and include the binary mask entries as additional features, as done in (5) and Olsen et al. (2022). This enables the network to learn to distinguish actual zeros in the data set and zeros induced by the masking, removing the need to set an off-distribution masking value. Additional differences due to implementation, for example, network architecture and optimization routine, are elaborated in Appendix A.

Additional Methods in the Appendix
In addition to the methods described in Sections 3.1 to 3.6, we include dozens more generative, separate regression, and surrogate regression methods in the Appendix. These methods are not included in the main text as they generally perform worse than the introduced methods. For the generative method class, we consider three additional VAEAC approaches with methodological differences and point to eleven other potential generative methods. For the separate regression method class, we consider twenty other regression models, and most of these are also applicable to the surrogate regression method class. Among the regression methods are: linear regression with interactions, polynomial regression with and without interactions, elastic nets, generalized additive models, principal component regression, partial least squares, K-nearest neighbors, support vector machines, decision trees, boosting, and neural networks. In the Appendix, we apply the additional methods to the numerical simulation studies and real-world data experiments conducted in Sections 4 and 5, respectively.

Numerical Simulation Studies
A major problem of evaluating explanation frameworks is that there is no ground truth for authentic real-world data. In this section, we simulate data for which we can compute the true Shapley values φ true and compare how close the estimated Shapley valuesφ q are when using approach q. We gradually increase the complexity of the setups in the simulation studies to uncover in which settings the different methods described in Section 3 perform the best and should be used. Additionally, as we focus on conditional Shapley values, we vary the dependencies between the features within each simulation setup to also investigate how the methods cope with different dependence levels.
In all experiments, we generate univariate prediction problems with M = 8-dimensional features simulated from a multivariate Gaussian distribution p(x) = N 8 (0, Σ), where Σ ij = ρ |i−j| for ρ ∈ {0, 0.3, 0.5, 0.9} and 1 on the diagonal. Larger values of ρ correspond to higher dependencies between the features. Higher feature dimensions are possible, but we chose M = 8 to keep the computation time of the simulation studies tractable. The real-word data sets in Section 5 contain more features. See also Chen, Covert, et al. (2022, Section 5.2) for estimation strategies used in the literature to compute Shapley values in higher dimensions.
We let the number of training observations be N train = 1000, while we explain N test = 250 test observations. Thus, the training data set is . The function f true is different in different experiments and ε [i] ∼ N (0, 1). The test data sets are created by the same procedure. We provide additional experiments with other settings and some illustrative plots of the data in Appendices D and E.
We evaluate the performance of the different approaches by computing the mean absolute error (MAE) between the true and estimated Shapley values, averaged over all test observations and features. This criterion has been used in Redelmeier, Jullum, and Aas (2020), Aas, Jullum, et al. (2021), Aas, Nagler, et al. (2021), and Olsen et al. (2022). The MAE is given by The true Shapley values are in general unknown, but we can compute them with arbitrary precision in our setup with multivariate Gaussian distributed data, as the conditional distributions p true (x S |x S ) are analytically known and samplable for all S. Thus, by sampling x (k) S,true ∼ p true (x S |x S ), we can compute the true contribution function v true (S) in (2) by using (3). The true Shapley values are then obtained by inserting the v true (S) quantities into the Shapley value formula in (1). The v true (S) quantities can be arbitrarily precise by choosing a sufficiently large number of Monte Carlo samples, e.g., K = 10 000.

Linear Regression Models
The first simulation setup should be considered a sanity check, as we generate the response y [i] according to the following linear regression models: where β = {1.0, 0.2, −0.8, 1.0, 0.5, −0.8, 0.6, −0.7, −0.6} and γ = {0.8, −1.0 − 2.0, 1.5}. For each setup, we fit a predictive linear model f with the same form as the true model. E.g., for the lm more interactions setup, the predictive linear model f has eight linear terms and two interaction terms reflecting the form of f lm,more . We fit the predictive models using the lm function in base R.
In Figures 2 to 4, we show the MAE for each test observations (i.e., the absolute error averaged only over the features) and the methods are sorted based on the overall MAE (i.e., when also averaged over the test observations). In what follows, we provide a short summary of the results for the different simulation setups. lm no interactions ( Figure 2): For ρ = 0, we see that ctree and LM surrogate perform the best. The independence approach, which makes the correct feature independence assumption, is close behind. For ρ > 0, the parametric and separate regression (LM, GAM, and PPR) methods generally perform the best. In particular, the LM separate method, which makes the correct model assumption, is the best performing approach. The generative and empirical approaches form the mid-field, while the surrogate regression and independence methods seem to be the least precise.
lm more interactions ( Figure 3): In this case, the LM separate method performs poorly, which is reasonable due to the incorrect model assumption. For ρ = 0, the ctree approach is the most accurate, but the independence and parametric methods are close behind. For ρ > 0, the parametric methods are clearly the best approaches as they make the correct parametric assumption. The PPR separate method performs very well, and the generative approaches are almost on par for moderate correlation.
The NN-Olsen surrogate method is the most accurate surrogate regression approach. In general, the separate regression approaches perform better as ρ increases, in particular the GAM separate approach.
lm numerous interactions ( Figure 4): The overall tendencies are very similar to those for lm more interactions. The parametric methods are by far the most accurate. Further, ctree is the best generative approach, the NN-Olsen surrogate is the best surrogate regression method, and the PPR separate method is the best separate regression approach.

Generalized Additive Models
In this section, we first investigate situations with a nonlinear relationship between the features and the response, while we later also include pairwise nonlinear interaction terms. More specifically, we first gradually progress from the lm no interactions model to a full generalized additive model by applying the nonlinear function cos(x j ) to a subset of the features in x. Then we extend the full generalized additive model by also including pairwise nonlinear interaction terms of the form g( We generate the features x [i] as before, but the response value y [i] is now generated according to: where β = {1.0, 0.2, −0.8, 1.0, 0.5, −0.8, 0.6, −0.7, −0.6} and γ = {0.8, −1.0, −2.0, 1.5}, i.e., the same coefficients as in Section 4.1. As the true models contain smooth nonlinear effects and smooth pairwise nonlinear interaction terms, we let the corresponding predictive models be GAMs with splines for the nonlinear terms and tensor product smooths for the nonlinear interaction terms. E.g., for the gam three experiment, the fitted predictive model f uses splines on the three first features while the others are linear. For the gam more interactions experiment, f uses splines on all eight features and tensor product smooths on the two nonlinear interaction terms. We fit the predictive models using the mgcv package with default parameters (Wood 2006a(Wood , 2022. In what follows, we provide a short summary of the results for the different simulation setups. gam three ( Figure 5): On the contrary to the lm no interactions experiment, we see that the LM separate approach performs much worse than the GAM separate approach, which makes sense as we have moved from a linear to a nonlinear setting. For ρ = 0, we see that ctree and independence are the best approaches. For ρ > 0, the parametric approaches are superior, but the GAM separate approach is not far behind, while the NN-Olsen surrogate method is the best surrogate regression approach.
gam all ( Figure 6): The performance of the LM approaches continue to degenerate. The separate regression methods get gradually better for higher values of ρ, but the parametric methods are still superior. The generative methods constitute the second-best class for ρ ∈ {0.3, 0.5}, but the GAM separate and PPR separate approaches are relatively close. The latter approaches outperform the generative methods when ρ = 0.9. The parametric approaches are superior in all settings. The generative methods perform quite well for ρ < 0.5, but are beaten by the PPR separate method for ρ = 0.9. Note that the GAM separate approach now falls behind the PPR separate approach, as it is not complex enough to model the nonlinear interaction terms. This indicates that complex separate regression approaches are needed to model complex predictive models. Furthermore, the RF surrogate method is on par or outperforms the NN based surrogate regression approaches.
gam numerous interactions ( Figure 8): We get nearly identical results as in the previous experiment. Hence, we do not provide further comments to the results.

Computation Time
In this section, we discuss the computation time used by the different methods to estimate the Shapley values, as a proper evaluation of the methods should not only be limited to their accuracy. We report the CPU times to get a fair comparison between the approaches, as some methods are parallelized and would therefore benefit from multiple cores when it comes to elapsed time. The CPU times for the different methods will vary significantly depending on operating system, hardware, and implementation. The times we report here are based on an Intel(R) Core(TM) i5-1038NG7 CPU@2.00GHz with 16GB 3733MHz LPDDR4X RAM running R version 4.2.0 on the macOS Ventura (13.0.1) operating system. Throughout this article, we mean CPU time when we discuss time.
In Table 1, we report the time it took to estimate the Shapley values using the different methods for the gam more interactions experiment with ρ = 0.5, N train = 1000, and N test = 250 in Section 4.2. We split the total time into three time categories: time used training the approaches, time used generating the Monte Carlo samples, and time used predicting the v(S) using Monte Carlo integration (including the calls to f ) or regression. We denote these three categories by: training, generating, and predicting, respectively. The matrix multiplication needed to estimate the Shapley values from the estimated contribution functions is almost instantaneous and is part of the predicting time. Furthermore, creating the augmented training data for the surrogate regression methods in Section 3.6.1 takes around one second and is part of the training time. We see a wide spread in the times, but, in general, the Monte Carlo approaches take on average around half an hour, while the regression methods are either much faster or slower, depending on the approach.
The Monte Carlo methods make a total of N test K(2 M − 2) calls to the predictive model f to explain the N test test observations with M features and K Monte Carlo samples. In our setting with N test = 250, M = 8, and K = 250, the predictive model is called N f = 15 875 000 times, thus, the speed of calculating f greatly effects the explanation time. For example the GAM model in the gam more interactions experiment is slow, as we can see in Table 1, since the predicting time constitutes the majority of the total time. To compare, N f calls to the linear model in the lm more interactions experiment takes approximately 3 CPU seconds, while the GAMs in the gam three and gam more interactions experiments take roughly 13 and 35 CPU minutes, respectively. For the latter experiment, the PPR and RF models in Section 4.5 take around 0.5 and 40 CPU minutes, respectively.
The training and generating times of the independence approach are higher than expected, but this is because the independence method is implemented as a special case of the empirical approach in version 0.2.0 of the shapr-package. Furthermore, the empirical and ctree approaches have lower predicting time than the other Monte Carlo-based methods due to fewer calls to f since they use weighted Monte Carlo samples; see Sections 3.2 and 3.4.1. The three influential time factors for the Monte Carlo methods are: the training time of the approach (estimating the parameters), the sampling time of the Monte Carlo samples, and the computational cost of calling f .
In contrast, both the separate regression and surrogate regression methods use roughly the same time to estimate the Shapley values for different predictive models f , as f is only called N train times when creating the training data sets. After that, we train the separate regression and surrogate regression approaches and use them to directly estimate the contribution functions. The influential factors for the regression methods are: the training time of the 2 M − 2 separate models (or the one surrogate model) and the prediction time of calling them a total of N test (2 M − 2) times. The former is the primary factor, and it is influenced by, e.g., hyperparameter tuning and the training data size. The latter can be a problem for the augmented training data for the surrogate regression methods, as we will see in Section 5.4. The NN approaches are the slowest methods and the training time is the cause. We can reduce the training time, at the cost of precision, by using default values instead of conducting cross-validation. This would approximately reduce the time by a factor of six and nine for the NN-Frye surrogate and NN-Olsen surrogate methods, respectively.
When excluding the time of the training step, which is only done once and can be considered as an upfront time cost, it is evident that the regression-based methods produce the Shapley value explanations considerably faster then the Monte Carlo-based methods. For example, consider the most accurate Monte Carlo and regression-based methods in the gam more interactions experiment with ρ = 0.5, i.e., the Gaussian and PPR separate methods, respectively. The Gaussian approach uses approximately 37 CPU minutes to explain 250 predictions, an average of 8.88 seconds per explanation. In contrast, the PPR separate method explains all the N test = 250 predictions in half a second. Thus, the PPR separate method is approximately 4440 times faster than the Gaussian approach per explanation, which is essential for large values of N test . However, note that this factor is lower for predictive models that are less computationally expensive to call.

Number of Training Observations
We repeated the experiments in Sections 4.1 and 4.2 with N train ∈ {100, 5000}, and some of them with N train = 20 000, to investigate if the MAE based ordering of the methods depends on the training data size. We obtained nearly identical results, except for three distinctions. First, the independence approach became relatively more accurate compared to the other methods when N train = 100, and worse when N train ∈ {5000, 20 000}. This is intuitive, as modeling the data distribution/response is easier when the methods have access to more data. Second, for the simple experiments in Sections 4.1 and 4.2 and N train ∈ {5000, 20 000}, the GAM separate and PPR separate approaches became even better, but were still beaten by the Gaussian and copula approaches in most experiments. Third, we observed that the MAE had a tendency to decrease when N train increased. However, we cannot directly compare the MAE scores as they depend on the fitted predictive model f which changes when N train is adjusted.

Other Choices for the Predictive Model
In practice, it might be difficult to identify the pairwise interactions in Section 4.2. Hence, one would potentially fit a model without them. We included them above as we knew the data generating processes and wanted a precise model, but we now pretend otherwise and fit other predictive models. We consider two different types of complex black-box predictive models: projection pursuit regression (PPR) and random forest (RF), and we conduct the same type of hyperparameter tuning as for the other experiments. However, we conduct no feature transformations and directly use the original features when fitting the models. These models are less precise than the GAMs in Section 4.2, which have an unfair advantage as they use the true formulas. E.g., for the gam more interactions experiment, the test prediction MSE was 1.32, 3.67, 7.36 for GAM, PPR, RF, respectively, where 1 is the theoretical optimum as Var(ε) = 1. We only include the figures for the gam more interactions experiment, as the corresponding figures for the other experiments are almost identical. The results are displayed in Figures 9 and 10 for the PPR and RF models, respectively, and the results are quite similar to those obtained for the GAM model in Figure 7. In general, the parametric methods are superior, followed by the generative methods, while the empirical, separate regression, and surrogate regression approaches are worse. Some separate regression approaches perform however much better for high dependence. The independence method performs well when ρ = 0, but gradually degenerates as the dependence level increases, as expected. We see that the PPR separate approach performs well for the PPR predictive model, but it is outperformed by the CatBoost separate method for the RF models. These results indicate that for our experiments, it is beneficial to choose a regression method similar to the predictive model; that is, for a non-smooth model, one should consider using a non-smooth regression method. However, note that the difference in the MAE is minuscule.

Different Data Distribution
In Appendix D.2, we repeat all the experiments described in Sections 4.1 and 4.2 but with multivariate Burr-distributed features instead of Gaussian ones. The Burr distribution allows for heavy-tailed, skewed marginals, and nonlinear dependence. In this case, the parametric Burr approach, which assumes Burr distributed data, not surprisingly, is the most accurate. The Gaussian method, which now incorrectly assumes Gaussian distributed data, performs worse. The VAEAC approach performs very well on the Burr distributed data, which was also observed by Olsen et al. (2022). In general, VAEAC is the second-best approach after Burr. The PPR separate method also performs well, but compared to the Burr and VAEAC approaches, it is less precise in the experiments with nonlinear interaction terms.

Summary of the Experiments
Making the correct (or nearly correct) parametric assumption about the data is advantageous, as the corresponding parametric methods significantly outperform the other approaches in most settings. In general, if the distribution is unknown, the second-best option for low to moderate levels of dependence is the generative methods. The separate regression approaches improve relative to the other methods when the feature dependence increases, and for highly dependent features, the PPR separate approach is a prime choice. Furthermore, the separate regression methods which match the form of f often give more accurate Shapley value estimates. The PPR model in the PPR separate approach is simple to fit but is still very flexible and can, therefore, accurately model complex predictive models. The independence approach is accurate for no (or very low) feature dependence, but it is often the worst approach for high feature dependence. The NN-Olsen surrogate method outperforms the NN-Frye surrogate approach in most settings and is generally the best surrogate regression approach.
We found it (often) necessary to conduct some form of cross-validation to tune (most of) the separate regression and surrogate regression methods to make them more competitive. Using default hyperparameter values usually resulted in less accurate Shapley value explanations; see Appendix D. The hyperparameter tuning can be time-consuming, but it was feasible in our setting with M = 8 features and N train = 1000 training observations. The regression-based methods use most of their computation time on training, while the predicting step is almost instantaneous for several methods. The opposite holds for the Monte Carlo-based approaches, which are overall slower than most regression-based methods. Hence, we have a trade-off between computation time and Shapley value accuracy in the numerical simulation studies. We did not conduct hyperparameter tuning for the empirical, parametric, and generative methods. Thus, the methods where we conduct hyperparameter tuning have an unfair advantage regarding the precision of the estimated Shapley values.

Real-World Data Experiments
In this section, we fit several predictive models to different real-world data sets from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.php) and then use Shapley values to explain the models' predictions. The models range from complex statistical models to black-box machine learning methods. We consider four data sets: Abalone, Diabetes, Wine, and Adult. Some illustrative data plots are provided in Appendix E.
For real-world data sets, the true Shapley values are unknown. Hence, we cannot use the MAE evaluation criterion from Section 4 to evaluate and rank the approaches. Instead, we use the MSE v criterion proposed by Frye et al. (2021) and later used by Olsen et al. (2022). The MSE v is given by where N S = |P * (M)| = 2 M − 2 andv q is the estimated contribution function using method q. The motivation behind the MSE v criterion is that see Covert et al. (2020, Appendix A). The first term on the right-hand side of (8) can be estimated by (7), while the second term is a fixed (unknown) constant not influenced by the approach q. Thus, a low value of (7) indicates that the estimated contribution functionv q is closer to the true counterpart v true than a high value. An advantage of the MSE v criterion is that v true is not involved. Thus, we can apply it to real-world data sets. However, the criterion has two drawbacks. First, we can only use it to rank the methods and not assess their closeness to the optimum since the minimum value of the MSE v criterion is unknown. Second, the criterion evaluates the contribution functions and not the Shapley values. It might be the case that the estimates for v(S) overshoot for some coalitions and undershoot for others, and such errors may cancel each other out in the Shapley value formula in (1). Nevertheless, for the numerical simulation studies in Section 4, we computed both criteria to compare the ordering of two criteria empirically. We generally observe a relatively linear relationship between the MAE and MSE v criteria. That is, a method that achieves a low MSE v score also tends to obtain a low MAE score, and vice versa. To illustrates this tendency, we include Figure 11, where we plot the MSE v criterion against the MAE criterion for the gam more interactions experiment with Gaussian distributed data with ρ = 0.5. Note that the orderings of the two criteria are not one-to-one, but they give very similar rankings of the methods.
In Table 2, we report the MSE v scores and CPU times of the different methods for the four data sets. The Abalone, Diabetes, and Wine data sets were run on the same system as specified in Section 4.3, while the Adult data set was run on a shared computer server running Red Hat Enterprise Linux 8.5 with two Intel(R) Xeon(R) Gold 6226R CPU@2.90GHz (16 cores, 32 threads each) and 768GB DDR4 RAM, due to memory constraints on the former system. Thus, one should not compare the CPU times across these systems but only the CPU times of the different methods within the same experiment. More detailed decomposition of the CPU times and additional methods are provided in Appendix F.

Abalone
We first consider the classical Abalone data set with mixed features. The data set originates from a study by the Tasmanian Aquaculture and Fisheries Institute (Nash et al. 1994) and  has been used in several XAI papers (Vilone et al. 2020;Aas, Nagler, et al. 2021;Frye et al. 2021;Olsen et al. 2022). The data set contains clear nonlinearity and heteroscedasticity among the pairs of features, and there is a significant pairwise correlation between the features, as all continuous features have a pairwise correlation above 0.775. The mean correlation is 0.89, and the maximum is 0.99. Furthermore, all marginals are skewed. We split the 4177 observations into training (75%) and testing (25%) data sets. The goal is to predict the age of the abalone based on M = 8 easily obtainable features: Length, Diameter, Height, WholeWeight, ShuckedWeight, VisceraWeight, ShellWeight, and Sex. All features are continuous except for Sex which is a three-level categorical feature (infant, female, male). Thus, the empirical and parametric methods are not applicable. However, to remedy this, we train two PPR models to act as our predictive models; one based on all features (PPR all ) and another based solely on the continuous features (PPR cont ). We chose the PPR model as it outperformed the other prediction models we fitted (GAM, RF, CatBoost). The test MSE increases from 2.04 to 2.07 when excluding Sex. Cross-validation determined that number of terms in PPR all and PPR cont should be 4 and 7, respectively. Table 2 shows that the best approaches for explaining the PPR predictive models are the PPR separate, NN-Olsen surrogate, and VAEAC methods. For the Abalone cont data set, the PPR separate and NN-Olsen surrogate methods perform equally well and share first place, but both methods are marginally outperformed by the VAEAC approach for the Abalone all data set. However, both the VAEAC and NN-Olsen surrogate methods are very slow compared to the PPR separate approach. The second-best Monte Carlo-based method for the Abalone cont data set is the Gaussian copula approach, even though the Abalone data set is far from Gaussian distributed. This is probably because the copula method does not make a parametric assumption about the marginal distributions of the data, but rather the copula/dependence structure, which makes it a more robust method than the Gaussian approach.

Diabetes
The diabetes data set stems from Efron et al. (2004) and contains M = 10 baseline features; Age, Sex, BMI, BP (blood pressure), and six blood serum measurements (S1, S2, S3, S4, S5, S6) obtained from 442 diabetes patients. The response of interest is a quantitative measure of disease progression one year after the baseline. Like Efron et al. (2004), we treat Sex as numerical and standardize all features; hence, we can apply all methods. Many features are strongly correlated, with a mean absolute correlation of 0.35, while the maximum is 0.90. The Age feature is the least correlated with the other features. Most scatter plots and marginal density functions display structures and marginals somewhat similar to the Gaussian distribution, except those related to the S4 feature, which has a multi-modal marginal. We split the data into a training and test data set at a 75 − 25 ratio, and we let the predictive model be a principle component regression (PCR) model with six principal components. This model outperformed the linear model and cross-validated random forest, XGBoost, CatBoost, PPR, and NN models in prediction error on the test data. The PCR model is not easily interpretable as it does not directly depend on the features but on their principal components. Table 2 shows that the LM separate, GAM separate, and PPR separate methods obtain the lowest MSE v scores, with the VAEAC, Gaussian, and copula approaches having nearly as low MSE v scores. We are not surprised that the latter two methods are competitive due to the Gaussian-like structures in the Diabetes data set. The LM separate method is the fastest approach, with a CPU time of 1.9 seconds.

Red Wine
The Red Wine data set contains information about variants of the Portuguese Vinho Verde wine (Cortez et al. 2009). The response is a quality score between 0 and 10, while the M = 11 continuous features are based on physicochemical tests: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. For the Red Wine data set, most scatter plots and marginal density functions display structures and marginals far from the Gaussian distribution, as most of the marginals are right-skewed. Many of the features have no to moderate correlation, with a mean absolute correlation of 0.20, while the largest correlation in absolute value is 0.683 between pH and fix acid. The data set contains 1599 wines, and we split it into a training (1349) and a test (250) data set. A cross-validated XGBoost model and a random forest with 500 trees perform equally well on the test data, and we use the latter as the predictive model f . Table 2 shows that the RF separate approach is the best method by far. Next, we have the CatBoost separate, RF surrogate, empirical, and VAEAC methods. The RF surrogate and CatBoost surrogate perform well compared to the other surrogate regression methods. The good performance of the non-smooth RF separate and CatBoost separate methods on the non-smooth predictive model f supports our findings from the simulation studies, where we observed that using a separate regression method with the same form as f was beneficial. The generative methods perform better than the GH and copula methods, while the Gaussian method falls behind. This is intuitive as the data distribution of the Red Wine data set is far from Gaussian distributed.

Adult
The Adult data set is based on the 1994 Census database, and the goal is to predict whether a person makes over $50 000 a year based on M = 14 mixed features: age (cont.), workclass (7 cat.), fnlwgt (cont.), education (16 cat.), education-num (cont.), marital-status (7 cat.), occupation (14 cat.), relationship (6 cat.), race (5 cat.), sex (2 cat.), capital-gain (cont.), capital-loss (cont.), hours-per-week (cont.), and native-country (41 cat.). The pairwise Pearson correlation coefficients for the continuous features are all close to zero, with a mean absolute correlation of 0.06. The data set contains 30 162 individuals, and we split it into a training (30 000) and a test (162) data set. We train a CatBoost model on the training data to predict an individual's probability of making over $50 000 a year and use the test data to compute the evaluation criterion. We used a relatively small test set due to memory constraints, and we chose the CatBoost as it outperformed the other prediction models we fitted (LM, GAM, RF, NN). Table 2 shows that the best method is the CatBoost separate approach, while second place is shared by the RF separate and VAEAC methods. Note that the difference in the MSE v score is very small. Like for the previous experiments, we observe that using a separate regression method with the same form as f is beneficial. The ctree approach supports mixed data, but we deemed it infeasible due to a very long computation time. Furthermore, the surrogate regression methods based on (5) ran out of memory as X aug consists of 30 000×(2 14 −2) = 491 460 000 training observations. The training time constitutes the majority of the total time for the separate regression and surrogate regression methods, while the predicting step only takes a couple of minutes.

Recommendations
In this section, we propose a list of advice for when to use the different classes and methods based on the results of the simulations studies and real-world data experiments. The list is not exhaustive. Hence, it must not be interpreted as definite rules but as guidance and points that should be considered when using conditional Shapley values for model explanation.
1. For data sets with no or minuscule feature dependencies, the independence approach is the simplest method to use.
2. In general, a parametric approach with the correct (or nearly correct) parametric assumption about the data distribution generates the most accurate Shapley values.
• The copula method does not make an assumption about the marginals of the data, but rather the copula/dependence structure, which makes it a more robust method.
• For features that do not fit the assumed distribution in the parametric approach, one can consider transformations, for example, power transformations, to make the data more Gaussian-like distributed.
• For categorical features, one can use, e.g., encodings or entity embeddings to represent the categorical features as numerical. This is needed, as no directly applicable multivariate distribution exists for mixed data. However, there exist copulas that support mixed data.
• If the parametric methods are not applicable, the next best option is (often) a generative or separate regression method, where all considered approaches support mixed data sets by default.
3. For the separate and surrogate regression methods, using a method with the same form as the predictive model f provides more precise Shapley value estimates.
• For some predictive models, e.g., the linear regression model in Figure 2, we know that the true conditional model is also a linear model. Thus, using a regression method which can model a linear model (e.g., lm, GAM, PPR) produces more accurate Shapley values. However, the form of the true conditional model is usually unknown for most predictive models.
• It is important that the regression method used is flexible enough to properly estimate/model the predictive model f .
• In the numerical simulation studies, the separate regression methods performed relatively better compared to the other method classes for higher feature dependence.
In the real-world experiments, the separate regression methods were also (among) the best approaches on data sets with moderate dependence.
• In general, conducting hyperparameter tuning of the regression methods improve the precision of the produced explanations, but this increases the computation time.
• In the simulation studies, a PPR separate approach with fixed L = |S| (often) provides fast and accurate Shapley value explanations; see Appendix D.
4. The modeling of the conditional distributions p(x S |x S ) in the Monte Carlo-based methods is independent of the predictive model f .
• For popular data sets, one can fine-tune an empirical, parametric, or generative method and let other researchers reuse the method to estimate Shapley values for their own predictive models.
• If a researcher is to explain several predictive models fitted to the same data, then reusing the generated Monte Carlo samples will save computation time.
5. There is time-accuracy trade-off between the different method classes and approaches.
• The simplest separate and surrogate regression methods are rapidly trained, while the complex methods are time-consuming. This is however a one-time upfront time cost. In return, all regression-based methods produce the Shapley value explanations almost instantly. Thus, developers can develop the predictive model f simultaneously with a suitable regression-based method and deploy them together.
The user of f will then get predictions and explanations almost instantaneously.
• In contrast, several of the Monte Carlo-based methods are trained considerably faster than many of the regression-based methods but are, in return, substantially slower at producing the Shapley  8. For high-dimensional settings, the number of models to fit in the separate regression class is infeasible. Then, the surrogate regression methods and the VAEAC approach with arbitrary conditioning can be useful. However, their accuracy will likely also decrease with higher dimensions. In high-dimensional settings, one can, e.g., group the features into relevant groups  or use tractable estimation strategies (Chen, Covert, et al. 2022, Section 5.2) to simplify the Shapley value computations.

Conclusion
In this article, we have discussed a large sample of Monte Carlo integration and regression-based methods used to estimate conditional Shapley values for model explanation. In agreement with the literature (Covert et al. 2021;Chen, Covert, et al. 2022), we have divided the studied methods into six different method classes. For each class, we have given an overview of the idea, reviewed earlier proposed methods within the class, and finally proposed and developed several new approaches for most classes. The existing and novel approaches have been systematically evaluated through a series of simulation studies with increasing complexity, as such evaluation has until now been lacking in the field of conditional Shapley values (Chen, Covert, et al. 2022).
We also conducted several experiments on real-world data sets from the UCI Machine Learning Repository. The ranking of the method classes and approaches differed slightly in the numerical simulation studies and real-world experiments. The most accurate Shapley value explanations in the simulation studies were generally produced by a parametric method with a correctly (or nearly correctly) assumed data distribution. This is intuitive, as making a correct parametric assumption is advantageous throughout statistics. However, the true data distribution is seldom known, e.g., for realworld data sets. In the simulation studies with moderate feature dependence levels, the second-best method class was generally the generative class with the ctree and VAEAC methods, which outperformed the independence, empirical, separate regression, and surrogate regression methods. For high feature dependence, the separate regression methods improved relative to the other classes, particularly the PPR separate method. Using a separate regression method with the same form as the predictive model proved beneficial.
In the real-world experiments, the parametric methods fell behind the best approaches, except for the simplest data set with Gaussian-like structures. In general, the best approaches in the real-world experiments belong to the separate regression method class and have the same form as the predictive model. However, the NN-Olsen surrogate method tied the best separate regression method in one experiment, and the VAEAC approach was marginally more precise in another experiment. The second-best method class varied for the different data sets, with all method classes, except the independence and empirical, taking at least one second place each.
In addition to the accuracy of the methods, we also investigate the computation time. The regression-based methods are often slowly trained, but they produce the Shapley value explanations almost instantaneously. In contrast, the Monte Carlo-based method are often faster to train but drastically slower at generating the Shapley value explanations. Finally, we gave some recommendations and considerations for when to use the different method classes and approaches.
In further work, one direction is to extend the investigation into higher dimensions to verify that the tendencies and order of the methods we discovered remain. However, one would then probably need to sample a subset of the coalitions to cope with the exponential complexity of Shapley values. In agreement with Chen, Covert, et al. (2022), one can also try to determine robust architectures, training procedures, and hyperparameter optimization for the generative and surrogate regression methods, investigate how non-optimal approaches change the estimated conditional Shapley values, and finally evaluate bias in estimated conditional Shapley values for data with known conditional distributions.
In Appendix A, we describe implementation details for the methods introduced in Section 3. We provide more details about the parametric methods in Appendix B. In Appendix C, we elaborate on approaches for estimating conditional Shapley value explanations used in the main text and describe other methods. We provide additional simulation studies in Appendix D. In Appendix E, we include plots of some simulated and real-world data sets. We apply additional methods to the real-world experiments and decompose the computation times of the methods in Appendix F. Finally, in Appendix G, we provide a schematic overview of the conditional Shapley value explanation framework and the estimation methods within the explainable artificial intelligence field. In addition, we have explored 884 simulation configurations and include all result figures in the Supplement.

A Implementation Details
In this section, we describe implementation details for the methods introduced in Section 3. We use the R-package shapr ( The independence, empirical, Gaussian, copula, and ctree methods are implemented in the shapr package, and we use default hyperparameter values. For the other methods, we estimate V and multiply it with R to get the estimated Shapley values. Olsen et al. (2022) implement the VAEAC approach as an add-on to the shapr package, and we use the default architecture and hyperparameters. For the numerical simulation studies in Section 4, we train the VAEAC approach for 200 epochs and use the estimated model parameters at the epoch with the lowest validation error, where 25% of the data constitutes the validation data. For the more complex real-world data distributions in Section 5, the VAEAC approach needs more training epochs to learn to model the data distributions properly. For the Abalone cont , Abalone all , Diabetes, Wine, and Adult data sets, we let the number of epochs be: 10 000, 40 000, 5000, 10 000, and 200, respectively. Other configurations than the default architecture and hyperparameters might reduce the number of needed learning epochs. In Appendices D and F, we provide VAEAC approaches with other numbers of epochs for the numerical simulation studies and the real-world data experiments, respectively.
Throughout the article, if not otherwise specified, we use K = 250 Monte Carlo samples in (3) for the Monte Carlo-based methods, which Olsen et al. (2022) found to be a fair tradeoff between accuracy and computation time. However, recall that the empirical and ctree methods (often) use less samples and rather weight them, as described in Sections 3.2 and 3.4.1, respectively. The independence approach is implemented as a special case of the empirical approach in version 0.2.0 of the shapr package. Hence, it does not support mixed data. For data sets with categorical features, we have implemented our own independence * approach which directly samples from the training data.
For the Burr and GH approaches, we estimate the parameters of the distributions by maximizing the likelihood function using the Nelder-Mead optimization routine (Nelder and Mead 1965), with default parameters in the optim function in base R (R Core Team 2020). Tuning the hyperparameters of the optimization algorithm and/or using a more advanced fitting procedure might improve the approaches. We run the optimization procedures until convergence. The number of parameters to estimate in the Burr and GH distribution is 2M + 1 and 1 2 (M + 1)(M + 4), respectively. The optimization of the GH method relies on good starting values, which we get from the ghyp package (Weibel et al. 2022). The ghyp package uses a sophisticated multi-cycle, expectation, conditional estimation (MCECM) algorithm to estimate the parameters for another more general parameterization of the GH distribution which lacks closed-form expressions for the conditional distributions.
For the separate regression methods, we tune (some of) the hyperparameters of the different methods using cross-validation procedures implemented in the packages, by us, or by using the caret package (Kuhn 2022). The LM separate approach was fitted using the lm function in the stats package in base R. We use the mgcv package (Wood 2022), with default parameters, to fit the GAM separate method. Note that in the mgcv package, the smoothing parameters in the penalized regression splines are selected by generalized cross-validation during the fitting procedure. The PPR separate method uses the ppr function in the stats package with default parameters, except the number of terms L, which we determine by cross-validation. The RF separate approach is based on the ranger package (Wright and Ziegler 2017). We use 500 decision trees and the caret package to do cross-validation on mtry, splitrule, and min.node.size, while we use default values for the remaining hyperparameters. Finally, the CatBoost separate method uses the CatBoost algorithm (Prokhorenkova et al. 2018), which is based on gradient-boosted decision trees, with default parameters (most notably, 1000 trees with depth 6). We employ early stopping of the CatBoost method if no improvement of the evaluation metric value was made in 100 iterations. One could employ cross-validation to tune the hyperparameters, but this would increase the computation time drastically as the CatBoost algorithm has many hyperparameters. An alternative is to tune only some of them.
For the surrogate regression methods, we use the same packages as above and tune the same hyperparameters if not otherwise specified. For the RF surrogate method, we reduced the number of trees from 500 to 200 due to high computation time. For the CatBoost surrogate approach, we increase the maximum number of trees to 10 000, but we still employ the same early stopping regime.
Originally, Frye et al. (2021) let the masking value be −1, as they only consider positive data, but this is not applicable for unbounded features. We let the value be −5 in the simulations and real-world experiments in Sections 4 and 5, respectively, which is a value not present in the data sets. For the NN-Frye surrogate approach, we use the same fully connected neural network and carry out the same cross-validation as in Frye et al. (2021). That is, depth = 2, width ∈ {128, 256, 512}, batch size = 256, and learning rate ∈ {0.001, 0.0001} in the Adam optimizer (Kingma and Ba 2015). We use 75% of the data to train the networks and the remaining observations as validation data. We use the network parameters at the epoch with the lowest validation error as the final model. Frye et al. (2021) use 2000 − 10 000 training epochs, while we use num epochs = 3000 to make the method more time-wise competitive and as the validation error obtains its minimum long before the last epoch in the simulation studies. Another alternative is to let num epochs be arbitrarily large and stop the training if no improvement has been made to the validation error for a fixed number of epochs, that is, employing early stopping.
In the NN-Olsen surrogate approach, we use batch normalization layers, ELU activation functions, and skip connections with summation over each layer in the network. We carry out similar hyperparameter tuning as the NN-Frye surrogate approach. That is, depth = 3, width ∈ {32, 64, 128}, num epochs = 500, batch size = 128 (as we duplicate the batch size), and learning rate ∈ {0.01, 0.001, 0.0001} in the Adam optimizer. Instead of specifying num epochs, another option could have been to train the network until a stopping criterion was meet, e.g., no improvement in the validation measure for a specific number of epochs. We observe relative small differences between the nine different hyperparameter choices, and one could thus potentially reduce the training time by a factor of nine by using width = 64 and learning rate = 0.001 as default values. The same also applies to the NN-Frye surrogate approach. The networks are implemented in torch (Falbel and Luraschi 2022).
In the real-world data experiments in Section 5, we omit the cross-validation of the hyperparameters in the NN surrogate approaches to make them more time-wise competitive. We let lr = 0.001 and width = 256 in the NN-Frye surrogate approach, while we use the same learning rate for the NN-Olsen surrogate method but we let width = 64. The convergence rates of the networks' validation errors vary in the different real-world data experiments. Hence, we use different num epochs for each experiment. For the Abalone cont , Abalone all , Diabetes, Wine, and Adult data sets, we let num epochs in the NN-Frye surrogate method be: 40 000, 40 000, 10 000, 40 000, and 3000, respectively. The corresponding values for the NN-Olsen surrogate method are 20 000, 10 000, 2500, 10 000, and 500. Other configurations than the architecture and hyperparameters set above might reduce the number of needed learning epochs. In Appendix F, we provide NN surrogate methods with other numbers of epochs, as a higher num epochs can make the methods more precise but at the cost of increased computation time.

B Additional Information About the Parametric Methods
In this section, we elaborate on the copula approach and give a short introduction to the multivariate Burr and generalized hyperbolic distributions.

B.1 Copulas
The definition of an M -dimensional copula is a multivariate distribution, C, with uniformly distributed marginals U[0, 1]. Sklar's theorem states that every multivariate distribution F with marginals F 1 , F 2 , . . . , F M can be written as F (x 1 , . . . , x M ) = C(F 1 (x 1 ), . . . , F M (x M )), for some appropriate M -dimensional copula C. In fact, the copula from the previous equation has the expression C(u 1 , . . . , u M ) = F (F −1 1 (u 1 ), . . . , F −1 M (u M )), where the F −1 j (u j )s are the inverse distribution functions of the marginals. While other copulas may be used, the Gaussian copula has the benefit that we may use the analytical expressions for the conditionals for the Gaussian distribution.
The Gaussian copula model used by Aas, Jullum, et al. (2021) is very flexible with regard to the marginal distributions, but quite restrictive in the dependence structures it can capture. It can only represent radially symmetric dependence relationships and does not allow for tail dependence (i.e., joint occurrence of extreme events has small probability). One can use other copulas in the copula approach instead. For example, Aas, Nagler, et al. (2021) use vine copulas, more specifically, a particular type of R-vines (regular vines) called D-vines (Kurowicka and Cooke 2005) when they estimate conditional Shapley values. Regular vines do not exclude categorical data, but the methods become more complicated when categorical features are included; hence, Aas, Nagler, et al. (2021) exclude them. Zhao and Udell (2020) propose a semiparametric algorithm to impute missing values for mixed data sets via a Gaussian copula.

B.2 Burr Distribution
The Burr distribution allows for heavy-tailed, skewed marginals, and nonlinear dependencies, which can be found in real-world data sets (Takahasi 1965). The density of the M -dimensional Burr distribution is given by for x m > 0 . The M -dimensional Burr distribution has 2M +1 parameters, namely, κ, b 1 , . . . b M , and r 1 , . . . , r M . Furthermore, the Burr distribution is a compound Weibull distribution with the gamma distribution as compounder (Takahasi 1965), and it can also be seen as a special case of the Pareto IV distribution (Yari and Jafari 2006). Any conditional distribution of the Burr distribution is in itself a Burr distribution (Takahasi 1965). Without loss of generality, assume that the first S < M features are the unobserved features, then the conditional density p(x 1 , . . . , , where x * indicates the conditional values, is an S-dimensional Burr density. The associated parameters are thenκ,b 1 , . . . ,b S , andr 1 , . . . ,r S , whereκ = κ + M − S, whileb j = b j and r j = rj 1+ M m=S+1 rm(x * m ) bm , for all j = 1, 2, . . . , S.

B.3 Generalized Hyperbolic Distribution
The generalized hyperbolic distribution GH(λ, ω, µ, Σ, β) is parameterized by an index parameter λ, concentration parameter ω, location vector µ, dispersion matrix Σ, and skewness vector β (Browne and McNicholas 2015). A random variable x is GH distributed if it can be represented by x = µ + W β + √ W U , where W ∼ GIG(λ, ω, ω), U ∼ N (0, Σ) and W is independent of U . GIG is the generalized inverse Gaussian distribution introduced by Good (1953). The density of the M -dimensional GH is given by is the squared Mahalanobis distance between x and µ, K λ is the modified Bessel function of the third kind with index λ. Wei et al. (2019) showed that when x is partitioned as (x S , x S ), the conditional distribution Here the GH * indicates that another parameterization of the GH distribution, proposed by McNeil et al. (2015), is used for the conditional distribution due to technical reasons:

C Additional Approaches
In this section, we provide more information about the methods used in the main text, describe additional approaches we have used, and point out potential methods that can be incorporated into the Shapley value explanation framework in the future.
C.1 The Missingness During Training Procedure Covert et al. (2021, Appendix E.2) and Chen, Covert, et al. (2022, Section 5.1.3) describe a procedure where they directly estimate the conditional expectation by modifying the training process of the predictive model f such that it handles missing features. That is, they train f on a particular objective function such that its predictions of observations with missing features are equivalent to marginalizing out the features using the conditional distribution. However, as we focus on model-agnostic post hoc explanation for arbitrary f and most predictive models do not support missingness, we skipped this procedure in the main part of the article.

C.2 The Generative Method Class
Here we describe alternative versions of the VAEAC approach that we have considered and then list other potential methods in the generative method class.

C.2.1 VAEAC with Response Feature
The VAEAC-f approach includes the predicted response of the predictive modelŷ = f (x) as an additional feature that is always unobserved in the deployment phase. That is, the extended training data takes the form . This idea was proposed by Ivanov et al. (2019), the creators of the VAEAC methodology, and they argued that this extension could improve the modeling of the data, especially for multi-modal data. The VAEAC-f approach will generate (x S , x S ). For the other approach, which we call the direct VAEAC-f-dir approach, we skip the intermediate step where we evaluate the model at the Monte Carlo samples by rather using theŷ S . This saves time if f is computationally expensive to call. We use the same hyperparameters for these two approaches as for the original VAEAC approach and the estimated model parameters at the epoch with the lowest validation error; see Appendix A. We consider several maximum numbers of epochs and indicate this by including the number in the method name. For example, VAEAC-f-dir-500 means that we trained the VAEAC-f-dir method for 500 epochs.

C.2.2 VAEAC with Paired Sampling
The VAEAC-paired approach is identical to the VAEAC method described in Section 3.4.2, except that we used paired sampling when generating the mask. That means that both S and S are applied to the same observation in the training phase of the VAEAC model.

C.2.3 Potential Generative Methods
We can use various applicable generative methods to generate the conditional Monte Carlo samples and Shapley values, such as non-parametric vine copulas (Aas, Nagler, et al. 2021).
We now provide a non-exhaustive list of other applicable generative methods, which, to the best of our knowledge, have yet to be used in Shapley value estimation. Computing the Monte Carlo samples coincide with the field of multiple imputation of missing values. The methods in this rich field can be categorized into two classes (Zheng and Charoenphakdee 2022). The first class contains the iterative approaches: the Multivariate Imputations based on Chained Equations (MICE) (Van Buuren and Groothuis-Oudshoorn 2011) and MissForest (Stekhoven and Bühlmann 2011). The second class contains the deep generative models: Multiple Imputation using Denoising Autoencoders (MIDA) (Gondara and Wang 2018), Missing Data Importance-weighted Autoencoder (MIWAE) (Mattei and Frellsen 2019), Generative Adversarial Imputation Nets (GAIN) (Yoon et al. 2018), and Conditional Score-based Diffusion Models for Tabular data (CSDI T) (Zheng and Charoenphakdee 2022). Other methods are the Arbitrary Conditioning Flow model (ACFlow) (Li et al. 2020), the Neural Conditioner (NC) (Belghazi et al. 2019), Neural Autoregressive Distribution Estimation (NADE) (Uria et al. 2016), and Universal Marginalizers (UM) (Douglas et al. 2017).

C.3 The Separate and Surrogate Method Class
In this section, we describe additional regression-based approaches. Note that all the regression methods can, in theory, be used both in the separate and surrogate regression frameworks. Some of them might however be infeasible for the latter in practice due to memory or time constraints, especially for large training data sets and high dimensions. Note that not all the regression methods minimize the mean squared error loss function.

C.3.1 Polynomial Regression
Polynomial regression is an extension of the linear regression where we model the relationship as an pth degree polynomial for each feature. That is, the model takes the following form: We estimate the coefficients of the polynomial model using the lm function in base R with formula = "y ∼ poly(X1, deg = p) +...+ poly(XM, deg = p)", where p is the degree. We call the approach Poly-p.

C.3.2 Linear Regression with Interactions
We extend the linear regression model by including interactions between the features. For example, we get the following model formula when we include first-order interactions: For second-order interactions, we get the following model: We estimate the coefficients in the interaction model using the lm function in base R with formula = "y ∼ (.) o+1 ", where o is the order. We call the approach LM-inter-o.

C.3.3 Polynomial Regression with Interactions
Here we extend the linear regression model with interactions by also allowing for polynomial terms. For example, the polynomial regression model with polynomial degree 2 and interactions of one order lower takes the following form: We estimate the coefficients in the polynomial interaction model using the lm function in base R with formula = "y ∼ poly(X1, ..., XM, deg = d)", where d is the degree. For the surrogate version, we do not include interactions with the binary mask features, as the number of coefficients to be estimated drastically increases. We call the approach Poly-inter-d.

C.3.4 Generalized Additive Models
We also fit GAMs using the gam package (Hastie 2022), which differs slightly from the mgcv package discussed in Appendix A. In the gam package, we can directly specify the degrees of freedom for the splines. We consider three different versions: one with df = 5 (GAM-5), another with df = 10 (GAM-10), and in the last, we conduct cross-validation using the caret package to tune the degrees of freedom (GAM-CV).

C.3.5 Elastic Net Regression
Elastic Net models add regularization to the model coefficients, and the popular Lasso and Ridge regression models are special cases. However, they do not minimize the MSE, but we still include them. The objective function for the Gaussian family is: where λ ≥ 0 is a regularization parameter and 0 ≤ α ≤ 1 is a compromise between Ridge (α = 0) and Lasso regression (α = 1). We consider α ∈ {0, 0.5, 1} and call the corresponding methods for Ridge, Elastic, and Lasso. We use the glmnet package (Friedman, Hastie, et al. 2010) to fit the models and use the package's cross-validation procedure to tune λ.

C.3.6 Principal Component Regression
The difference between regular linear regression and principal component regression (PCR) is that the latter regress the response on the principal components instead of the original features. For more details, see Hastie et al. (2009, pp. 79-80). We use the pls package (Liland et al. 2021) to fit the PCR model and use the package's cross-validation procedure to determine the number of principal components to include in the final model. We call the approach PCR.

C.3.7 Partial Least Squares
The partial least squares (PLS) regression model is similar to PCR, but the PLS also uses the response when constructing the linear combinations of the features for regression. For more details, see Hastie et al. (2009, pp. 80-82). We use the pls package (Liland et al. 2021) to fit the PLS model and use the package's cross-validation procedure to determine the number of components to include in the final regression model. We call the approach PLS.

C.3.8 Projection Pursuit Regression
Section 3.5.3 and Appendix A describe the PPR model and explain that we use crossvalidation to determine the number of terms L in the PPR separate approach. An alternative is the PPR-fixed separate approach where we let L = |S|. This method is much faster than PPR separate but still competitive; see Appendix D. For larger values of M , letting L = min(|S|, L max ), where L max is a threshold, will reduce the computation time even more.

C.3.9 Support Vector Machines
Support vector machines (SVM) used for regression are also known as support vector regression, and see Hastie et al. (2009, Ch. 12.3) for an introduction. We use -type regression and a radial kernel, but one could also consider ν-regression and linear, polynomial, and sigmoid kernels. We use the WeightSVM package (Xu et al. 2021) to fit the SVM, as it supports weighting of the observations, but the more well-known e1071 package (Meyer et al. 2022) could also have been used. We call the approach SVM.

C.3.10 K-Nearest Neighbors
The K-nearest neighbors (KNN) regression model is a memory-based approach that does not require any model to be fit. For a given individual x * , the model finds the K closes observations in the training data and return the mean response of these observations. We use the kknn package (Schliep and Hechenbichler 2016) to train the KNN model and to conduct hyperparameter tuning. We call the approach KNN.

C.3.11 Single Decision Tree
A regression decision tree partitions the feature space into a set of rectangles and predicts a new observation's response as the mean response of the training observations in the particular partition. CART, C4.5, and CTree are popular methods for tree-based regression, which are very simple to understand yet powerful; see, e.g., Hastie et al. (2009, Ch. 9.2). We use the rpart package (Therneau and Atkinson 2022) to fit a decision tree with complexity parameter 0.001 and then prune the tree afterward. We call the approach Tree.

C.3.12 Random Forest
In the main text, the RF approach was tuned using cross-validation, which leads to a large computation time. Here we propose a default approach called RF-def with 500 trees and default hyperparameter values in the ranger package (Wright and Ziegler 2017). A potential improvement is to vary the number of trees based on the coalition size |S| instead of having a fixed number as, e.g., 500 trees might be excessive when S is a singleton.

C.3.13 Boosting
Both XGBoost (Chen, He, et al. 2015) and CatBoost (Prokhorenkova et al. 2018) are gradientbased boosted decision trees, but they differ in that the latter supports categorical data by default while the former require the user to do, e.g., one-hot encoding. We include two versions of the XGBoost approach: one where we use default hyperparameter values (XGBoost-def) and one where we tune nrounds, max depth, eta, gamma, colsample bytree using the default grid in the caret package. We call the latter approach XGBoost, which is time-consuming due to the extensive hyperparameter tuning.

C.3.14 Neural Networks and Multilayer Perceptron
In Appendices D and F, we explore NN surrogate methods with hyperparameters defined in Section 5 but with different maximum numbers of learning epochs. E.g., NN-Olsen-500 surrogate indicates that num epochs = 500, but we still use the network weights at the epoch with the lowest validation error. We have also implemented a version that employs early stopping, as discussed in Appendix A. More precisely, we stop the training if no improvement has been made to the validation error in 150 epochs. Additionally, this version initiates ten networks to reduce the likelihood of poorly initiated network parameters, as discussed in Appendix F. We train the ten networks for fifteen epochs and continue only with the network with the lowest validation error. We denote this method by NN-Olsen-ES and NN-Frye-ES.
The multilayer perceptron (MLP) is a fully connected feedforward artificial neural network. We include some small networks to compare these against the large NN-Olsen surrogate and NN-Frye surrogate methods described in Section 3.6.2 and Appendix A. We call the approach, e.g., NN-[u, v, w], which means that the network has three layers where the number of neurons in the layers are u, v, and w, respectively. We use the RSNNS package (Bergmeir and Benítez 2012), with default hyperparameters and 200 epochs.

C.3.15 Potential Methods
Bénard et al. (2022) use a projected random forest to estimate Shapley effects, which is not the same as Shapley values in (1). The projected random forest is a surrogate regression model which provides predictions of the output conditioned on any feature subset. Thus, their procedure can be adapted to estimate conditional Shapley values. Jethani et al. (2021) propose another neural network based procedure that skips the modeling of the data/response altogether by training a complex neural network which takes in the full input feature vector x and directly outputs the Shapley values φ.

C.3.16 Surrogate Regression Methods in High-Dimensions
In high-dimensional settings, to reduce the computational cost, one can consider training the surrogate regression model on a sampled subset of the augmented representations in (5).
In that case, one should ensure that all coalitions are present. Uniformly sampling will mostly sample coalitions with approximately half of the features present, as the number of coalitions with |S| entries is given by M |S| . Therefore, Covert et al. (2021) propose to first uniformly sample the coalition size, i.e., |S| ∼ U[1, M − 1], and then sample |S| features with uniform probability. Recall that the number of terms in the Shapley value formula in (1) grows at an exponential rate; hence, in higher dimensions, it is common practice to estimate the Shapley values based on a sampled collection of coalitions with replacement (Lundberg and Lee 2017;Chen, Covert, et al. 2022;Olsen et al. 2022). We can then create X aug only based on these coalitions. Furthermore, many regression models support weighting of the observations, which we can set as the sampling frequency of the different coalitions. Olsen et al. (2022) use a similar idea when sampling masks. For regression models which do not support weights, one can duplicate the relevant data, but this naïve approach increases the number of rows in the augmented design matrix.

D Additional Simulation Studies
In this section, we extend the numerical simulation studies in Section 4 in two directions. First, in Appendix D.1, we include more setups with Gaussian data. Second, in Appendix D.2, we use the multivariate Burr distribution instead of the multivariate Gaussian.
The results for the first two linear setups are very similar to those obtained in Section 4.1. The correct parametric approaches are the most accurate, while the generative method class is generally the second-best class for moderate ρ. However, the separate regression method class is the second-best class for higher values of ρ.
The results of the gam one and gam two experiments are almost identical to those obtained in the lm no interactions experiment, which is unsurprising as the setups are very similar. When we include one nonlinear term, the LM separate is the most accurate, but as expected, this changes when we include more nonlinear terms. In this case, the correct parametric methods are the most accurate, together with the GAM separate method. Again, we see a tendency for the separate regression methods to become more precise for higher dependence levels. The same holds for the parametric methods. The results of the gam five experiment are nearly identical to those described for the gam three experiment in the main text.
For the gam some interactions and gam many interactions experiments, we obtain indistinguishable results. Furthermore, the results also coincide with the results we observed in Section 4.2; see Figure 12. Generally, the parametric methods are superior, with the generative methods as a close runner-up, except for ρ = 0.9, where the separate regression methods constitutes the second-best method class, in particular, the PPR separate method. Furthermore, using the PPR-fixed separate method introduced in Appendix C.3.8 drastically decreases the computational cost without sacrificing much precision.

D.2 Burr Distributed Experiments
In this section, we repeat the same simulation studies as in Section 4 and Appendix D.1, but we replace the multivariate Gaussian data with multivariate Burr data. The Burr distribution is strictly positive, heavy-tailed, skewed, and has nonlinear dependence and known conditional distributions; see Appendix B.2.
We sample N train = 1000 training and N test = 250 test observations from a Burr(κ, b, r) distribution. We let b = {5, 4, 6, 5, 3, 6, 5, 5}, r = {4, 3, 5, 2, 5, 3, 5, 1}, while we vary the scale parameter κ ∈ {0.5, 1.0, 1.5, 2.0, 2.5, 3.0}. Here, a low κ indicates high dependency. The average Pearson correlations for the six values of κ are 0. 80, 0.63, 0.46, 0.36, 0.29, and 0.25, respectively. In Figures 17 and 18 in Appendix E, we display plots of the Burr data when κ = 1 and κ = 3, respectively. A larger value of κ makes the Burr distribution more Gaussian-like, while lower values of κ produce more extreme observations due to the right heavy-tailed property of the Burr distribution. We observe that the methods struggle with these extreme observations, and the individual MAE is (often) higher for these observations in the outer region of the data distribution. That is, the data has few similar observations for the methods to learn the conditional structure. In Figures 13 to 15, we present a selection of the results. The results for the other settings are very similar.
We observe similar results for the Burr data as we did for the Gaussian data. That is, using the correct parametric approach, in this case, the Burr approach, yields the most accurate Shapley value explanations. The GH method also performs well, even though it makes an incorrect parametric assumption, while the Gaussian and copula methods perform relatively worse. The best method outside the parametric method class is generally a VAEAC approach, where the number in the name indicates the maximum number of epochs. However, for the less complex setups without interaction terms, the VAEAC approaches are often outperformed by some of the separate regression methods, particularly the GAM separate and PPR separate approaches. We do not see a systematic benefit of choosing a large value for the maximum number of epochs in the VAEAC approaches. Thus, the differences are likely based on better initialized random weights in the networks, similar to what we discuss for the NN-Olsen surrogate method in Appendix F.1. The VAEAC-f-dir approaches are consistently outperformed by the VAEAC and VAEAC-f-indir methods, and there does not seem to be a systematic winner between the later two methods. Some of the additional regression-based methods proposed in Appendix C.3 perform relatively well, such as the proposed LM-inter separate and Poly-inter separate methods.
For the surrogate regression approaches, we notice that the LM-inter surrogate and Poly-inter surrogate often outperform the complex NN-Frye surrogate and NN-Olsen surrogate approaches. However, for the complex gam more interactions experiment, the NN-Olsen surrogate approach with a high number of epochs is the overall best surrogate regression approach. For the complex setups in Section 4.2 and Appendix D.1 with interactions, we also let the predictive model f be a random forest and a projection pursuit regression model, as done in Section 4.5. This had a minor effect on the overall results. However, some of the non-smooth regression methods, such as CatBoost separate, performed relatively better when f was also non-smooth, just as in the main text. We also looked at N train ∈ {100, 5000}, and the overall tendencies remained.
In Table 3, we report the CPU times for the approaches used in the gam more interactions experiment with Burr distributed features, κ = 2, N train = 1000, and N test = 250. We see similar time tendencies as discussed in Section 4.3, but note the time differences between the separate regression methods with default and cross-validated versions. For example, the RF separate takes approximately 84 times longer than the RF-def separate method due to the extra cost of tuning the hyperparameters. However, the RF separate method obtains lower MAE scores; see Figure 15. We observe similar tendencies for the PPR separate and PPR-fixed separate methods. Here, we only report the total time, but the time decomposition is similar to the one in Table 1. That is, the separate regression and surrogate regression methods spend the vast majority of their time on training, while the predicting step only takes a couple of seconds. In contrast, the Monte Carlo-based approaches spend most of the time on the predicting and generating steps in that order.

E Characteristics of the Data Sets
In this section, we provide pairwise scatter plots, marginal density functions, and pairwise Pearson correlation coefficients (for the continuous features) between the features in some of the data sets we have used in this article. In Figures 16 to 18, we include figures for a few of the simulated Gaussian and Burr distributed data sets from the numerical simulation studies in Section 4 and Appendix D, respectively. Further, plots for the Abalone, Diabetes, and Wine data sets from Section 5 are provided in Figures 19 to 21, respectively. We omit the Adult dataset as it is chaotic and difficult to interpret due to the large number of levels in the categorical features. However, the pairwise Pearson correlation coefficients for the continuous features are all close to zero. In Figures 16 to 18, we include plots for the Gaussian distribution with ρ = 0.9 and Burr distribution with κ ∈ 1, 3, respectively. For the Gaussian features, the pairwise correlation between features i and j is given by ρ |i−j| , i.e., it decreases from 0.9 to 0.48 in the M = 8 situation. For the Burr distribution, the marginals become more right-skewed, and the correlation approximately doubles when we decrease κ from 3 to 1. The average correlations are 0.28 and 0.61 for κ = 3 and κ = 1, respectively.
For the Abalone data set (Figure 19), there is a clear nonlinearity and heteroscedasticity among the pairs of features and a significant pairwise correlation between the features. All continuous features have a pairwise correlation above 0.775, or 0.531 when grouped by the categorical feature Sex. There is a clear distinction between infants and females/males. All marginals are right-skewed, except for the features Length and Diameter, which is left-skewed.
The Diabetes data set ( Figure 20) shows a fairly strong correlation between many features. For example, the correlation between S1 and S2 is 0.90. On average, the Age feature is the least correlated feature with the other features. Most scatter plots and marginal distributions display structures and marginals somewhat similar to Gaussian distribution, except the S4 feature, which has a multi-modal marginal.
In contrast, for the (Red) Wine dataset (Figure 21), most scatter plots and marginal density functions display structures and marginals far from the Gaussian distribution, as most marginals are right-skewed. The largest correlation in absolute value is 0.683 (between pH and fix acid), while most other pairs of features have no to moderate correlation.

F Real-World Data Experiments: Computation Time
Tables 4 to 8 present the decomposed CPU times for the Abalone, Diabetes, Wine, and Adult experiments from Section 5, respectively. The tables also include the MSE v scores for the different methods. The total CPU times are decomposed into the same three categories as in Section 4.3: training, generating, and predicting. Recall that we ran the three first experiments on the MAC system specified in Section 4.3, while the Adult experiment was run on a shared computer server described in Section 5. The CPU times will differ from computer to computer.
Here, we include some methods in addition to those used in Section 5, such as the PCR separate approach (Appendix C.3.6), which performs similar to the best methods for the Diabetes experiment. Recall that the predictive model f in the Diabetes experiment is a PCR model. Hence, this supports our findings in the main text that we should use a separate regression method with the same form as f for accurate Shapley value estimates. Additionally, we include the RF-def separate method (Appendix C.3.12) with default hyperparameters to illustrate the need for conducting cross-validation to tune the hyperparameters. We also include versions of the VAEAC, NN-Olsen surrogate, and NN-Frye surrogate approaches with default hyperparameters but with different numbers of training epochs. It should also be noted that the PPR-fixed separate method with a fixed number of terms L = |S| produces almost as good results as the CV-alternative PPR separate in a fraction of the time.
As in Section 4.3, the training step is the most time-consuming step for the separate regression and surrogate regression methods, while the predicting step often takes only a couple of seconds. For the Abalone and Diabetes datasets, creating the augmented training data set takes approximately 2 seconds, while it takes approximately 14 seconds for the Wine data set. The time increase is due to larger a M (compared to Abalone) and N train (compared to Diabetes). In comparison, the Monte Carlo-based methods use most of their time generating the Monte Carlo samples in the Abalone and Diabetes experiments. This contrasts with the timings in Table 1, where the predicting step was the slowest. The time difference is caused by the GAM model in the gam more interactions experiment being more computationally expensive to call than the PPR and PCR models in Abalone and Diabetes experiments, respectively. For the Wine experiment, the predicting step is the most expensive, which is caused by the RF model being more computationally costly to call, and we have more calls due to a larger M (compared to Abalone) and N test (compared to Diabetes).
When excluding the training time, which is only done once and can be considered an upfront time cost, it is evident that the regression-based methods are superior with respect to computation time. For example, consider the best Monte Carlo and regression-based methods for the Abalone cont experiment, i.e., the VAEAC-10000 and PPR separate methods, respectively. The VAEAC-10000 approach uses 551.9 seconds to explain 1044 predictions, an average of 0.53 seconds per explanation. In contrast, the PPR separate method explains all the 1044 predictions in 0.5 seconds. Thus, there is a speed difference of a factor of 1104, which is essential when the number of predictions to explain is large.
If training time is not a limiting factor, we can use more time to train the NN-Olsen surrogate and NN-Frye surrogate methods, as these methods are slow to train but fast in the predicting step. In contrast to the numerical simulation studies in Section 4, the validation errors for some of the complex real-world experiments were still decreasing for these approaches, indicating that more training would be beneficial. Table 4 shows that the MSE v scores for the different versions of the NN-Olsen surrogate approach decrease when we increase the number of epochs leading it to share the first place with the PPR separate approach. However, recall that we use the network at the epoch with the lowest validation error, which was the 6457th epoch for the NN-Olsen-20000 surrogate approach. This means that the increased performance was not due to the additional number of training epochs, but rather to better random initialization values which caused the network parameters to converge to a better local optimum. The same tendency also holds for the other real-world experiments. For example, there is essentially no difference in the MSE v scores of the NN-Olsen-500 surrogate and NN-Olsen-10000 surrogate methods for the Diabetes data set in Table 6. Furthermore, the validation data is randomly extracted and removed from the training data for each NN surrogate method. Thus, it might be that for some NN surrogate methods and data sets, we were unlucky in that the training and validation data were not representative of the test data. This is more likely to happen for small data sets. Table 6 shows that the RF-def separate approach with default hyperparameters provides almost as low MSE v score as the cross-validated counter-version RF separate, whose training time is approximately 140 times longer. For the surrogate regression version, we see that the RF-def surrogate method outperforms the RF surrogate even though the default hyperparameters are an option in the cross-validation procedure.

F.1 Analysis of the NN-Olsen surrogate method for the Abalone Data Set
In this section, we look closer at the effect of initialization values and hyperparameters for the NN-Olsen surrogate method for the two experiments on the Abalone data set.
For the Abalone all experiment, we fitted ten versions of the NN-Olsen-10000 surrogate method with different idealizations seeds. These networks were trained on the shared computer server described in Section 5 and had an average training time of 14:02:46.6. In comparison, the training CPU time for the NN-Olsen-10000 surrogate method in Table 5 was almost 3.5 times higher when trained on the MAC operating system described in Section 4.3. The average MSE v was 1.214, with a standard deviation of 0.0085. Furthermore, on average, the best epoch was the 5509th, with a standard deviation of 1660. The large standard deviation means there is a large variability when the networks reach their minimum validation error. Seven versions reached their minimum before epoch 5000, while the slowest reached its minimum at the 8920th epoch. This means that if num epochs = 5000, the performance of seven of the networks would not be influenced by the reduced number of epochs. In contrast, the precision of the three remaining versions would decrease. This highlights the need for good network initialization values when num epochs is limited or for choosing a large value for num epochs.
For the NN-Olsen-10000 surrogate method, we wanted to investigate the effect of the hyperparameters. We considered the same hyperparameter grid as in Appendix A, i.e., lr ∈ {0.01, 0.001, 0.0001} and width ∈ {32, 64, 128} and we call the corresponding method for NN-Olsen-CV-10000 surrogate. We fit ten versions of the NN-Olsen-CV-10000 surrogate method with different initialized network weights. The average best epoch was the 7706th, with a standard deviation of 2496, while the average MSE v was 1.208, with a standard deviation of 0.0236. This score is nearly identical to the average score we obtained for the NN-Olsen-10000 surrogate method above with default hyperparameters: lr = 0.001 and width = 64. The default hyperparameters were chosen two times, but lr = 0.0001 and width = 128 was the best combination five times. For these five repetitions, we obtained an average MSE v of 1.198, with a standard deviation of 0.0053. However, the average best epoch was then 9456, with a standard deviation of 473, which is close to the maximum number of epochs. Thus, we might see further improvements for this hyperparameter combination by increasing num epochs. We ran one network with num epochs = 20 000, which obtained its best validation score after 15 518 epochs. The method's training time was 1:08:07:48.1, and it got an MSE v score of 1.214, which is at the same level as previous versions.
We repeated the investigations for the Abalone cont experiment. For the ten versions of the NN-Olsen-10000 surrogate approach, we obtain an average MSE v score of 1.178, with a standard deviation of 0.0068. This score is lower than the one reported in Tables 2 and 4. Thus, it is likely that that version had poorly initialized network parameters, or that the training and validation data sets were not representative. The average best epoch was the 5410th, with a standard deviation of 1842, and the average training time was 10:45:13.1. The best of the ten versions obtained an MSE v score of 1.167, beating all other methods. We also fitted ten versions of the NN-Olsen-CV-10000 surrogate method. They obtained an average MSE v score of 1.77, with a standard deviation of 0.0067. That is, there is minimal improvement in conducting crossvalidation. The default hyperparameters were never the best hyperparameter combination. The lr = 0.0001 and width = 128 combination was the best five times, while lr = 0.001 and width = 128 was best four times. For the former combination, the average MSE v score is 1.171, with a standard deviation of 0.0015. The average best number of epochs is 9253; meaning that we should consider increasing num epochs. We ran one network with num epochs = 20 000, which obtained its best validation score after 11 523 epochs. The method's training time was 1:02:02:24.5, and it got an MSE v score of 1.169, equal to the two best methods; NN- Olsen-20000 surrogate and PPR separate. However, the latter is approximately 700 times faster to train. Several methods exist to stabilize and robustify neural networks: we can regularize the network parameters, apply drop-out during training, or create an ensemble model of several networks. One can also initialize several networks and only continue to train the best-performing one after a fixed number of epochs. The latter is done in the NN-Frye-ES and NN-Olsen-ES surrogate methods, but we do not see a systematic improvement. In the R package torch (Falbel and Luraschi 2022), the weights and biases in each layer are uniformly initialized from where N in is the number of inputs to the linear layers in the network. Other initialization schemes exist, such as Xavier initialization (Glorot and Bengio 2010) and Kaiming initialization , where the latter considers the rectifier nonlinearities to initialize the network parameters robustly. Discovering better procedure and ensuring representative training, validation, and test data should be of focus and is mentioned as further work in the conclusion in Section 7.      Figure 22 provides a schematic overview of this article's method classes and methods for computing conditional Shapley value explanations. Furthermore, the figure also shows conditional Shapley values' place within the explainable artificial intelligence field as a modelagnostic explanation framework with local explanations. Note that the ellipses represent the additional methods introduced in Appendix C. Furthermore, LIME is an explanation framework developed by Ribeiro et al. (2016), and see, e.g., Molnar (2022)[Section 9.3] and Redelmeier, Jullum, Aas, and Løland (2021) for more on counterfactual explanations. In this article, we used Shapley values to provide local explanations for models fitted to tabular data, but different Shapley value-based frameworks are developed for other settings. For example, Lundberg, Erion, Chen, et al. (2020) develop model-specific local and global Shapley value explanations for tree ensemble models. Covert et al. (2020) introduce a modelagnostic global explanation framework based on Shapley values. Krzyziński et al. (2023) propose time-dependent Shapley value-based explanations for machine learning survival models. Bento et al. (2021) extend Shapley values to the sequential domain and introduce a modelagnostic Shapley value explanation framework for sequential decision-making models, such as recurrent neural networks. Chen, Lundberg, et al. (2022) explain a series of models by propagating Shapley values. Jethani et al. (2021) use Shapley values to explain classifications made by image classifiers. Mastropietro et al. (2022) and Duval and Malliaros (2021) introduce new methodologies based on Shapley values to explain graph neural network models. Wang et al. (2021) include knowledge about a causal graph between the features when creating Shapley value-based explanations. Heskes et al. (2020) also introduce a causal Shapley value methodology that exploits causal knowledge.