Keywords

1 Introduction

Machine learning (ML) models have gained widespread adoption, demonstrating their ability to model complex dependencies and make accurate predictions [32]. Besides accurate predictions, practitioners and scientists are often equally interested in understanding the data-generating process (DGP) to gain insights into the underlying relationships and mechanisms that drive the observed phenomena [53]. Since analytical information regarding the DGP is mostly unavailable, one way is to analyze a predictive model as a surrogate. Although this approach has potential pitfalls, it can serve as a viable alternative for gaining insights into the inherent patterns and relationships within the observed data, particularly when the generalization error of the ML model is small [43]. Regrettably, the complex and often non-linear nature of certain ML models renders them opaque, presenting a significant challenge in understanding them.

A broad range of interpretable ML (IML) methods have been proposed in the last decades [11, 25]. These include local techniques that only explain one specific prediction as well as global techniques that aim to explain the whole ML model or the DGP; model-specific techniques that require access to model internals (e.g., gradients) as well as model-agnostic techniques that can be applied to any model; and feature effects methods, which reflect the change in the prediction depending on the value of the feature of interest (FOI), as well as feature importance (FI) methods, which assign an importance value to each feature depending on its influence on the prediction performance. We argue that in many scenarios, analysts are interested in reliable statistical, population-level inference regarding the underlying DGP [41, 60], instead of “simply” explaining the model’s internal mechanisms or heuristic computations whose exact meaning regarding the DGP is at the very least unclear or not explicitly stated at all. If an IML technique is used for such a purpose, it should ideally be clear, what property of the DGP is computed, and, as we nearly always compute on stochastic and finite data, how variance and uncertainty are handled. The relevance of IML in the context of scientific inference has been recognized in general [53] as well as in specific subfields, e.g., in medicine [8] or law [15]. Krishna et al. [34] illustrate the disorientation of practitioners when choosing an IML method. In their study, practitioners from both industry and science were asked to choose between different IML methods and explain their choices. The participants predominantly based their choice on superficial criteria such as publication year or whether the method’s outputs align with their prior intuition, highlighting the absence of clear guidelines and selection criteria for IML techniques.

Motivating Example. The well-known “bike sharing” data set [17] includes 731 observations and 12 features corresponding to, e.g., weather, temperature, wind speed, season, and day of the week. Suppose a data scientist is not only interested in achieving accurate predictions of the number of bike rentals per day but also in learning about the DGP to identify how the features are associated with the target. She trains a default random forest (RF, test-RMSE: 623, test-\(R^2\): 0.90), and for analyzing the DGP, she decides to use two FI methods: permutation feature importance (PFI) and leave-one-covariate-out (LOCO) with L2 loss (details on these follow in Sects. 5 and 7). Unfortunately, she obtains somewhat contradictory results – shown in Fig. 1. The methods produce results that agree on using temperature (temp), season (season), the number of days elapsed since the start of data collection in 2011 (days_since_2011), and humidity (hum) as part of the top 6 most important features, but the rankings of these features differ across different methods. She is unsure which feature in the DGP is the most important one, what the disagreement of the FI methods means, and, most importantly, what she can confidently infer from the results about the underlying DGP. We will address her questions in the following sections.

Fig. 1.
figure 1

Six most important features following (a) PFI and (b) LOCO.

Contributions and Outline. This paper assesses the usefulness of several FI methods for gaining insight into associations between features and the prediction target in the DGP. Our work is the first concrete and encompassing guide for global, loss-based, model-agnostic FI methods directed toward researchers who aim to make informed decisions on the choice of FI methods for (in)dependence relations in the data. The literature review in Sect. 3 highlights the current state-of-the-art and identifies a notable absence of guidelines. Section 4 determines the type of feature-target associations within the DGP that shall be analyzed with the FI methods. In Sect. 5, we discuss methods that remove features by perturbing them; in Sect. 6 methods that remove features by marginalizing them out; and in Sect. 7 methods that remove features by refitting the model without the respective features. In each of the three sections, we first briefly introduce the FI methods, followed by an interpretation guideline according to the association types introduced in Sect. 4. At the end of each section, our results are stated mathematically, with some proofs provided in Appendix A. We return to our motivational example and additionally illustrate our theoretical results in a simulation study in Sect. 8 and formulate recommendations and practical advice in Sect. 9. We mainly analyze the estimands of the considered FI, but it should be noted that the interpretation of the estimates comes with additional challenges. Hence, we briefly discuss approaches to measure and handle their uncertainty in Sect. 10 and conclude in Sect. 11 with open challenges.

2 General Notation

Let \(\mathcal {D} = \left( \left( \textbf{x}^{(1)}, y^{(1)}\right) , \ldots , \left( \textbf{x}^{(n)}, y^{(n)}\right) \right) \) be a data set of n observations, which are sampled i.i.d. from a p-dimensional feature space \(\mathcal {X}= \mathcal {X}_{1} \times \ldots \times \mathcal {X}_{p}\) and a target space \(\mathcal {Y}\). The set of all features is denoted by \(P = \{1,\ldots ,p\}\). The realized feature vector is \(\textbf{x}^{(i)}= (x^{(i)}_1, \ldots , x^{(i)}_p)^\top \), \(i \in \{1, \ldots , n\}\), where \(\textbf{y}= \left( y^{(1)}, \ldots , y^{(n)}\right) ^\top \) are the realized labels. The associated random variables are \(X=(X_1, \ldots , X_p)^\top \) and Y, respectively. Marginal random variables for a subset of features \(S \subseteq P\) are denoted by \(X_S\). The complement of S is denoted by \(-S = P \setminus S\). Single features and their complements are denoted by j and \(-j\), respectively. Probability distributions are denoted by F, e.g., \(F_{Y}(Y)\) is the marginal distribution of Y. If two random vectors, e.g., feature sets \(X_J\) and \(X_K\), are unconditionally independent, we write ; if they are unconditionally dependent, which we also call unconditionally associated, we write .

We assume an underlying true functional relationship \(f_{\text {true}}: \mathcal {X}\rightarrow \mathcal {Y}\) that implicitly defines the DGP by \(Y = f_{\text {true}}(X) + \epsilon \). It is approximated by an ML model \(\hat{f}: \mathcal {X}\rightarrow \mathbb {R}^g\), estimated on training data \(\mathcal {D}\). In the case of a regression model, \(\mathcal {Y}= \mathbb {R}\), and \(g = 1\). If \(\hat{f}\) represents a classification model, g is greater or equal to 1: for binary classification (e.g., \(\mathcal {Y}= \{0,1\}\)), g is 1; for multi-class classification, it represents the g decision values or probabilities for each possible outcome class. The ML model \(\hat{f}\) is determined by the so-called learner or inducer \(\mathcal {I}: \mathcal {D}\times \lambda \mapsto \hat{f}\) that uses hyperparameters \(\lambda \) to map a data set \(\mathcal {D}\) to a model in the hypothesis space \(\hat{f}\in \mathcal {H}\). Given a loss function, defined by \(L: \mathcal {Y}\times \mathbb {R}^g \rightarrow \mathbb {R}_{0}^+\), the risk function of a model \(\hat{f}\) is defined as the expected loss \(\mathcal {R}(\hat{f}) = \mathbb {E}[L(Y,\hat{f}(X))]\).

3 Related Work

Several papers aim to provide a general overview of existing IML methods [11, 12, 25, 26], but they all have a very broad scope and do not discuss scientific inference. Freiesleben et al. [19] propose a general procedure to design interpretations for scientific inference and provide a broad overview of suitable methods. In contrast, we provide concrete interpretation rules for FI methods. Hooker et al. [30] analyze FI methods based on the reduction of performance accuracy when the FOI is unknown. We examine FI techniques and provide recommendations depending on different types of feature-target associations.

This paper builds on a range of work that assesses how FI methods can be interpreted: Strobl et al. [56] extended PFI [7] for random forests by using the conditional distribution instead of the marginal distribution when permuting the FOI, resulting in the conditional feature importance (CFI); Molnar et al. [42] modified CFI to a model-agnostic version where the dependence structure is estimated by trees; König et al. [33] generalize PFI and CFI to a more general family of FI techniques called relative feature importance (RFI) and assess what insight into the dependence structure of the data they provide; Covert et al. [10] derive theoretical links between Shapley additive global importance (SAGE) values and properties of the DGP; Watson and Wright [58] propose a CFI based conditional independence test; Lei et al. [35] introduce LOCO and are among the first to base FI on hypothesis testing; Williamson et al. [60] present a framework for loss-based FI methods based on model refits, including hypothesis testing; and Au et al. [4] focus on FI methods for groups of features instead of individual features, such as leave-one-group-out importance (LOGO).

In addition to the interpretation methods discussed in this paper, other FI approaches exist. Another branch of IML deals with variance-based FI methods aimed at the FI of an ML model and not necessarily regarding the DGP, as they only use the prediction function of an ML model without considering the ground truth. For example, the feature importance ranking measure (FIRM) [63] uses a feature effect function and defines the standard deviation as an importance method. A similar method by [23] uses the standard deviation of the partial dependence (PD) function [20] as an FI measure. The Sobol index [55] is a more general variance-based method based on a decomposition of the prediction function into main effects and high-order effects (i.e., interactions) and estimates the variance of each component to quantify their importance [45]. Lundberg et al. [38] introduced the SHAP summary plot as a global FI measure based on aggregating local SHAP values [39], which are defined only regarding the prediction function without considering the ground truth.

4 Feature-Target Associations

When analyzing the FI methods, we focus on whether they provide insight into (conditional) (in)dependencies between a feature \(X_j\) and the prediction target Y. More specifically, we are interested in understanding whether they provide insight into the following relations:

(A1):

Unconditional association ( ).

(A2):

Conditional association ...

(A2a):

... given all remaining features \(X_{-j}\) ( ).

(A2b):

... given any user-specified set \(X_G,\) G \(\subset P\backslash \{j\}\) ( ).

An unconditional association (A1) indicates that a feature \(X_j\) provides information about Y, i.e., knowing the feature on its own allows us to predict Y better; if \(X_j\) and Y are independent, this is not the case. On the other hand, a conditional association (A2) with respect to (w.r.t.) a set \(S \subseteq P\backslash \{j\}\) indicates that \(X_j\) provides information about Y, even if we already know \(X_S\). When analyzing the suitability of the FI methods to gain insight into (A1)(A2b), it is important to consider that no FI score can simultaneously provide insight into more than one type of association. In supervised ML, we are often interested in the conditional association between \(X_j\) and Y, given \(X_{-j}\) (A2a), i.e., predicting Y better if we are given information regarding all other features.

For example, given measurements of several biomarkers and a disease outcome, a doctor may not only be interested in a well-performing black-box prediction model based on all biomarkers but also in understanding which biomarkers are associated with the disease (A1). Furthermore, the doctor may want to understand whether measuring a biomarker is strictly necessary for achieving optimal predictive performance (A2a) and to understand whether a set of other biomarkers G can replace the respective biomarker (A2b).

Example 1 shows that conditional association does not imply unconditional association ((A2) \(\not \Rightarrow \) (A1)). Additionally, unconditional association does not imply conditional association, as Example 2 demonstrates ((A1) \(\not \Rightarrow \) (A2)).

Example 1

Let \(X_1, X_2 \sim Bern(0.5)\) be independent features and \(Y := X_1 \oplus X_2\) (where \(\oplus \) is the XOR operation). Then, all three features are pairwise independent, but \(X_1\) and \(X_2\) together allow us to predict Y perfectly.

Example 2

Let \(Y := X_1\) with \(X_1 \sim N(0,1)\) and \(X_2 := X_1 + \epsilon _2\) with \(\epsilon _2 \sim N(0, 0.1)\). Although \(X_2\) provides information about Y, all of this information is also contained in \(X_1\). Thus, \(X_2\) is unconditionally associated with Y but conditionally independent from Y given \(X_1\).

Furthermore, conditional (in)dependence w.r.t. one feature set does not imply (in)dependence w.r.t. another, e.g., (A2a) \(\not \Leftrightarrow \) (A2b). This is demonstrated by adding unrelated features to the DGP and the conditioning set, as shown in Examples 1 and 2.

5 Methods Based on Univariate Perturbations

Methods based on univariate perturbations quantify the importance of a feature of interest (FOI) by comparing the model’s performance before and after replacing the FOI \(X_j\) with a perturbed version \(\tilde{X}_j\) (permuted observations):

$$\begin{aligned} \text {FI}_j = \mathbb {E}\left[ L\left( Y, \hat{f}(\tilde{X}_{j}, X_{-j})\right) \right] - \mathbb {E}\left[ L\left( Y, \hat{f}(X)\right) \right] . \end{aligned}$$
(1)

The idea behind this approach is that if perturbing the feature increases the prediction error, the feature should be important for Y. Below, we discuss the three methods PFI (Sect. 5.1), CFI (Sect. 5.2), and RFI (Sect. 5.3) differing in their perturbation scheme: Perturbation in PFI [7, 18] preserves the feature’s marginal distribution while destroying all dependencies with other features \(X_{-j}\) and the target Y, i.e.,

(2)

CFI [56] perturbs the FOI while preserving its dependencies with the remaining features, i.e.,

(3)

RFI [33] is a generalization of PFI and CFI since the perturbations preserve the dependencies with any user-specified set G, i.e.,

(4)

To indicate on which set G the perturbation of j is conditioned, we denote RFI\(_j^G\). We obtain PFI by setting \(G=\emptyset \) and CFI by setting \(G=-j\). As will be shown, the type of perturbation strongly affects which features are considered relevant.

5.1 Permutation Feature Importance (PFI)

Insight into (A1): Non-zero PFI does not imply an unconditional association with Y (Negative Result 5.12). In the proof of Negative Result 5.12, we construct an example where the PFI is non-zero because the perturbation breaks the dependence between the features (and not because of an unconditional association with Y). Based on this, one may conjecture that unconditional feature independence is a sufficient assumption for non-zero PFI to imply an unconditional association with Y; however, this is not the case, as Negative Result 5.13 demonstrates. For non-zero PFI to imply an unconditional association with Y, the features must be independent conditional on Y instead (Result 5.11).

Zero PFI does not imply independence between the FOI and the target (Negative Result 5.14). Suppose the model did not detect the association, e.g., because it is a suboptimal fit or because the loss does not incentivize the model to learn the dependence. PFI may be zero in that case, although the FOI is associated with Y. In the proof of Negative Result 5.14, we demonstrate the problem for L2 loss, where the optimal prediction is the conditional expectation (and thus neglects dependencies in higher moments). For cross-entropy optimal predictors and given feature independence (both with and without conditioning on Y), zero PFI implies unconditional independence with Y (Result 5.11).

Insight into Conditional on \({\boldsymbol{X}}_{{\boldsymbol{G}}}\) or \({\boldsymbol{X}}_{{\boldsymbol{-j}}}\) (A2): PFI relates to unconditional (in)dependence and, thus, is not suitable for insight into conditional (in)dependence (see Sect. 4).

Result 5.11

(PFI Interpretation). For non-zero PFI, it holds that

(5)

For cross-entropy loss and the respective optimal model,

(6)

Proof

The first implication directly follows from Theorem 1 in [33]. The second follows from the more general Result 5.31.   \(\square \)

Negative Result 5.12

Proof

(Counterexample). Let \(Y, X_1 \sim N(0,1)\) be two independent random variables, \(X_2:= X_1\), and the prediction model \(\hat{f}(x) = x_1 - x_2\). It is simple to calculate that this model has expected L2 loss of 1, as \(\mathbb {E}[L(Y, X_1-X_2)] = \mathbb {E}[Y^2] = 1\). Now let \(\tilde{X}_1\) be the perturbed version of \(X_1\) (\(\tilde{X}_1 \sim F_{X_j}(X_1)\)), and . The expected L2 loss under perturbation now is \(\mathbb {E}[(Y - (\tilde{X}_1-X_2))^2] = \text {Var}(Y - \tilde{X}_1 + X_2) = 3\), which implies PFI\(_1 = 2\). So PFI\(_1\) is non-zero, but .    \(\square \)

Negative Result 5.13

Proof

(Counterexample). Let \(X_1, X_2 \sim Bern(0.5)\) with , and \(Y:= X_1 \oplus X_2\), where \(\oplus \) is XOR. Consider a perfect prediction model \(\hat{f}(X) = x_1 \oplus x_2\), and \(\hat{f}\) encodes the posterior probability for \(Y=1\) (here, Y can be only 0 or 1). This model has a cross-entropy loss of 0, since \(Y = \hat{f}(X)\). Furthermore, it holds that . Again, let \(\tilde{X}_1\) be the perturbed version of \(X_1\). One can easily verify that and \(Y, \tilde{\hat{Y}} \sim Bern(0.5)\). Thus, the prediction \(\tilde{\hat{Y}}\) using the perturbed feature \(\tilde{X}_1\) assigns probability 1 to the correct and wrong class with probability 0.5 each. Thus, the cross-entropy loss for the perturbed prediction is non-zero (actually, positive infinity), and PFI\(_j \ne 0\).    \(\square \)

Negative Result 5.14

FI for any \(G \subseteq P \backslash \{j\}\), even if the model is L2-optimal.

NB: This result holds not only for PFI but also for any FI method based on univariate perturbations, including PFI, CFI, and RFI (Eq. 1).

Proof

(Counterexample). If a model does not rely on a feature \(X_j\), FI\(_j = 0\). We construct an example where \(\hat{f}\) is L2-optimal but does not rely on the feature \(X_1\), which is dependent with Y conditional on any set \(G \subseteq P \backslash \{j\}\). Let \(Y\,|\,X_1, X_2 \sim N(X_2, X_1)\) with \(X_1, X_2 \sim N(0,1)\) and . Then, Y is dependent with \(X_1\) conditional on any set \(G \subseteq P \backslash \{1\}\): Here, G could either be \(G = \emptyset \) or \(G = \{2\}\). Now, for small \(X_1\), extreme values of Y are less likely than for \(X_1=100\), irrespective of whether we know \(X_2\). Now consider \(\hat{f}(x) = x_2\). \(\hat{f}\) is L2-optimal since \(\mathbb {E}[Y|X]=X_2\), but \(\hat{f}\) does not depend on \(X_1\).    \(\square \)

5.2 Conditional Feature Importance (CFI)

Insight into (A2a): Since CFI preserves associations between features, non-zero CFI implies a conditional dependence on Y, even if the features are dependent (Result 5.21). The converse generally does not hold, so Negative Result 5.14 also applies to CFI. However, for cross-entropy optimal models, zero CFI implies conditional independence (Result 5.21).

Insight into (A1) and (A2b): Since CFI provides insight into conditional dependence (A2a), it follows from Sect. 4 that CFI is not suitable to gain insight into (A1) and (A2b).

Result 5.21

(CFI interpretation). For CFI, it holds that

(7)

For cross-entropy optimal models, the converse holds as well.

Proof

The first equation follows from Theorem 1 in [33]. The second follows from the more general Result 5.31.    \(\square \)

5.3 Relative Feature Importance (RFI)

Insight into (A2b): Result 5.31 generalizes Results 5.11 and 5.21. While PFI and CFI are sensitive to dependencies conditional on no or all remaining features, RFI is sensitive to conditional dependencies w.r.t. a user-specified feature set G. Nevertheless, we must be careful with our interpretation if features are dependent. RFI may be non-zero even if the FOI is not associated with the target (Negative Result 5.32). In general, zero RFI does not imply independence (Negative Result 5.14). Still, for cross-entropy optimal models and under independence assumptions, insight into conditional independence w.r.t. G can be gained (Result 5.31).

Insight into (A1) and (A2a): If features are conditionally independent given Y, setting G to \(\emptyset \) (yielding PFI) enables insight into unconditional dependence. Setting G to \(-j\) (yielding CFI) enables insight into the conditional association given all other features.

Result 5.31

(RFI interpretation). For \(R=P\backslash (G\,\cup \,\{ j\})\), it holds that

(8)

For cross-entropy optimal predictors and \(G \subseteq P \backslash \{j\}\), it holds that

(9)

Proof

The first implication follows directly from Theorem 1 in [33]. The proof of the second implication can be found in Appendix A.1.

Negative Result 5.32

.

Proof

(Counterexample). Let \(G = \emptyset \). Then, \(RFI_j^G = PFI_j\) and . Thus, the result directly follows from 5.12.    \(\square \)

6 Methods Based on Marginalization

In this section, we assess SAGE value functions (SAGEvf) and SAGE values [10]. The methods remove features by marginalizing them out of the prediction function. The marginalization [39] is performed using either the conditional or marginal expectation. These so-called reduced models are defined as

$$\begin{aligned} {\begin{matrix} &{}\hat{f}^m_S(x_S) = \mathbb {E}_{X_{-S}}\left[ \hat{f}(x_S, X_{-S})\right] , \quad \text { and}\\ &{}\hat{f}^c_S(x_S) = \mathbb {E}_{X_{-S}\,|\,X_S}\left[ \hat{f}(x_S, X_{-S})\, |\, X_S\right] , \end{matrix}} \end{aligned}$$
(10)

where \(\hat{f}^m\) is the marginal and \(\hat{f}^c\) is the conditional-sampling-based version and \(\hat{f}^m_{\emptyset }=\hat{f}^c_{\emptyset }\) the average model prediction, e.g., \(\mathbb {E}[Y]\) for an L2 loss optimal model and \(\mathbb {P}(Y)\) for a cross-entropy loss optimal model. Based on these, SAGEvf quantify the change in performance that the model restricted to the FOIs achieves over the average prediction:

$$\begin{aligned} v^{m/c}(S) = \mathbb {E}\left[ L\left( Y, \hat{f}^{m/c}_{\emptyset }\right) \right] - \mathbb {E}\left[ L\left( Y, \hat{f}^{m/c}_S(X_{S})\right) \right] \end{aligned}$$
(11)

We abbreviate SAGEvf depending on the distribution used for the restricted prediction function (i.e., \(\hat{f}^m\) or \(\hat{f}^c\)) with mSAGEvf (\(v^m\)) and cSAGEvf (\(v^c\)).

SAGE values [10] regard FI quantification as a cooperative game, where the features are the players, and the overall performance is the payoff. The surplus performance (surplus payoff) enabled by adding a feature to the model depends on which other features the model can already access (coalition). To account for the collaborative nature of FI, SAGE values use Shapley values [52] to divide the payoff for the collaborative effort (the model’s performance) among the players (features). SAGE values are calculated as the weighted average of the surplus evaluations over all possible coalitions \(S \subseteq P\setminus \{j\}\):

$$\begin{aligned} \phi ^{m / c}_j(v) &= \frac{1}{p} \sum _{S \subseteq P\setminus \{j\}} \left( {\begin{array}{c}p-1\\ |S|\end{array}}\right) ^{-1} \left( v^{m / c}(S \cup \{j\}) - v^{m / c}(S) \right) , \end{aligned}$$
(12)

where the superscript in \(\phi _j\) denotes whether the marginal \(v^m(S)\) or conditional \(v^c(S)\) value function is used.

6.1 Marginal SAGE Value Functions (mSAGEvf)

Insight into (A1): Like PFI, mSAGE value functions use marginal sampling and break feature dependencies. mSAGEvf may be non-zero (\(v^m(\{j\}) \ne 0\)), although the respective feature is not associated with Y (Negative Result 6.12). While an assumption about feature independence was sufficient for PFI for insight into pairwise independence, this is generally not the case for mSAGEvf. The feature marginalization step may lead to non-zero importance for non-optimal models (Negative Result 6.13). Given feature independence and L2 or cross-entropy optimal models, a non-zero mSAGEvf implies unconditional association; the converse only holds for CE optimal models (Result 6.11).

Insight into Conditional on \({\boldsymbol{X}}_{\boldsymbol{G}}\) or \({\boldsymbol{X}}_{{\boldsymbol{-j}}}\) (A2): The method mSAGEvf does not provide insight into the dependence between the FOI and Y (Negative Result 6.12) unless the features are independent and the model is optimal w.r.t. L2 or cross-entropy loss (Result 6.11). Then, mSAGEvf can be linked to (A1) and, thus, is not suitable for (A2) (Sect. 4).

Result 6.11

(mSAGEvf interpretation). For L2 loss or cross-entropy loss-optimal models (and the respective loss) and , it holds that

(13)

For cross-entropy optimal predictors, the converse holds as well.

Proof

The proof can be found in Appendix A.2.

Negative Result 6.12

Proof

(Counterexample). Let us assume the same DGP and model as in the proof of Negative Result 5.12. In the setting, both the full model \(\hat{f}^m_1(x) = x_1 - x_2 = 0\) and \(\hat{f}^m_\emptyset = 0\) are optimal, but \(\hat{f}^m_1(x_1) = x_1 \ne 0\) is sub-optimal. Thus, \(v^m(\{1\}) \ne 0\) (although for any \(G \subseteq P \backslash \{j\}\)).    \(\square \)

Negative Result 6.13

Proof

(Counterexample). Let \(X_1, Y \sim N(0,1)\), and let \(X_{-1}\) be some (potentially multivariate) random variable, with and . Let \(\hat{f}(x) = x_1\) be the prediction model. Then, \(\hat{f}^m_\emptyset = \hat{f}^c_\emptyset = \mathbb {E}_X[X_1] = 0\) and \(\hat{f}^m_1(x_1) = \hat{f}^c_1(x_1) = x_1\). Since the optimal prediction \(\hat{Y}^* = \mathbb {E}[Y|X] = \mathbb {E}[Y] = 0\), the average prediction \(\hat{f}^m_\emptyset = \hat{f}^c_\emptyset = 0\) is loss-optimal and \(\hat{f}^m_1(x_1) = \hat{f}^c_1(x_1) = x_1\) is not loss-optimal. Consequently, \(v(\{1\}) \ne 0\) (although \(X_1\) is independent of target and features). Notably, the example works both for \(v^m\) and \(v^c\).    \(\square \)

6.2 Conditional SAGE Value Functions (cSAGEvf)

Insight into (A1): Like for mSAGEvf, model optimality w.r.t. L2 or cross-entropy loss is needed to gain insight into the dependencies in the data (Negative result 6.13). However, since cSAGEvf preserves associations between features, the assumption of independent features is not required to gain insight into unconditional dependencies (Result 6.21).

Insight into Conditional on \({\boldsymbol{X}}_{\boldsymbol{G}}\) or \({\boldsymbol{X}}_{{\boldsymbol{-j}}}\) (A2): Since cSAGEvf provide insight into (A1), they are unsuitable for gaining insight into (A2) (see Sect. 4). However, the difference between cSAGEvf for different sets, called surplus cSAGEvf (scSAGEvf\(^G_j:=v^c(G\,\cup \,\{j\}) - v^c(G)\), where \(G\subseteq P\backslash \{j\}\) is user-specified), provides insights into conditional associations (Result 6.21).

Result 6.21

(cSAGEvf interpretation). For L2 loss or cross-entropy loss optimal models, it holds that:

(14)

For cross-entropy loss, the respective converse holds as well.

Proof

The first implication (and the respective converse) follows from the second (and the respective converse) by setting \(G=\emptyset \). The second implication was proven in Theorem 1 in [40]. For the converse, [10] show that for cross-entropy optimal models \(v^c(G\,\cup \,\{j\})-v^c(G) = I(Y,X_j|X_G)\); it holds that .    \(\square \)

6.3 SAGE Values

Since non-zero cSAGEvf imply (conditional) dependence and cSAGE values are based on scSAGEvf of different coalitions, cSAGE values are only non-zero if a conditional dependence w.r.t. some conditioning set is present (see Result 6.31).

Result 6.31

Assuming an L2 or cross-entropy optimal model, the following interpretation rule for cSAGE values holds for a feature \(X_j\):

(15)

For cross-entropy optimal models, the converse holds as well.

Proof

The Proof can be found in Appendix A.3.

We cannot give clear guidance on the implications of mSAGE for (A1)(A2b) and leave a detailed investigation for future work.

7 Methods Based on Model Refitting

This section addresses FI methods that quantify importance by removing features from the data and refitting the ML model. For LOCO [35], the difference in risk of the original model and a refitted model \(\hat{f}_{-j}^r\) relying on every feature but the FOI \(X_j\) is computed:

$$\begin{aligned} \text {LOCO}_j = \mathbb {E}\left[ L\left( Y, \hat{f}_{-j}^r(X_{-j})\right) \right] - \mathbb {E}\left[ L\left( Y, \hat{f}(X)\right) \right] , \end{aligned}$$
(16)

where \(\hat{f}_{-j}^r\) keeps the learner \(\mathcal {I(\mathcal {D}, \lambda })\) fixed.Footnote 1

Williamson et al. [60] generalize LOCO, as they are interested in not only one FOI but also in a feature set \(S \subseteq P\). As they do not assign an acronym, we from here on call it Williamson’s Variable Importance Measure (WVIM):

$$\begin{aligned} \text {WVIM}_S = \mathbb {E}\left[ L\left( Y, \hat{f}_{-S}^r(X_{-S})\right) \right] - \mathbb {E}\left[ L\left( Y, \hat{f}(X)\right) \right] . \end{aligned}$$
(17)

Obviously, WVIM, also known as LOGO [4], equals LOCO for \(S=j\). For \(S=P\), the optimal refit reduces to the optimal constant prediction, e.g., for an L2-optimal model \(\hat{f}_{-S}^r(X_{-S})=\hat{f}_{\emptyset }^r(x_{\emptyset }) = \mathbb {E}[Y]\) and for a cross-entropy optimal model \(\hat{f}_{\emptyset }^r(x_{\emptyset }) = \mathbb {P}(Y).\)

7.1 Leave-One-Covariate-Out (LOCO)

For L2 and cross-entropy optimal models, LOCO is similar to \(v^c(-j\,\cup \,j) - v^c(-j)\), with the difference that we do not obtain the reduced model by marginalizing out one of the features, but rather by refitting the model. As such, the interpretation is similar to the one of cSAGEvf (Result 7.11).

Result 7.11

For an L2 or cross-entropy optimal model and the respective optimal reduced model , it holds that . For cross-entropy loss, the converse holds as well.

Proof

For cross-entropy and L2-optimal fits, the reduced model that we obtain from conditional marginalization behaves the same as the optimal refit (for cross-entropy loss \(\hat{f}^r_S = \hat{f}^c_S = \mathbb {P}(Y|X_S)\), for L2 loss \(\hat{f}^r_S = \hat{f}^c_S = \mathbb {E}[Y|X_S]\)) [10, Appendix B] and thus LOCO\(_j = v^c(j\,\cup \,-j) - v^c(-j)\). As such, the result follows directly from Result 6.21.

7.2 WVIM as Relative FI and Leave-One-Covariate-In (LOCI)

For \(S=j\), the interpretation is the same as for LOCO. Another approach to analyzing the relative importance of the FOI is investigating the surplus WVIM (sWVIM\(^{-G}_j\)) for a group \(G\subseteq P\backslash \{j\}\):

$$\begin{aligned} \text {sWVIM}^{-G}_j &= \mathbb {E}\left[ L\left( Y, \hat{f}_{G}^r(X_{G})\right) \right] - \mathbb {E}\left[ L\left( Y, \hat{f}_{G\,\cup \,\{j\}}^r(X_{G\,\cup \,\{j\}})\right) \right] . \end{aligned}$$
(18)

It holds that sWVIM\(^{-G}_j\) equals scSAGEvf\(^G_j\), only differing in the way features are removed, so the interpretation is similar to the one of scSAGEvf. A special case results for \(G = \emptyset \), i.e., the difference in risk between the optimal constant prediction and a model relying on the FOI only. We refer to this (leaving-one-covariate-in) as LOCI\(_j\). For cross-entropy or L2-optimal models, the interpretation is the same as for cSAGEvf, since LOCI\(_j = v^c(\{j\})\) (Result 7.21).

Result 7.21

For L2 or cross-entropy optimal learners, it holds that

(19)

For cross-entropy, the converse holds as well.

Proof

For L2-optimal models, \(\hat{f}^c_\emptyset = \mathbb {E}[Y] = \hat{f}^r_\emptyset \) and \(\hat{f}^c_G = \mathbb {E}[Y|X_G] = \hat{f}^r_G\). For cross-entropy optimal models, \(\hat{f}^c_\emptyset = \mathbb {P}(Y) = \hat{f}^r_\emptyset \) and \(\hat{f}^c_G = \mathbb {P}(Y|X_G) = \hat{f}^r_G\). Thus, the interpretation is the same as for cSAGEvf (Result 6.21).    \(\square \)

8 Examples

We can now answer the open questions of the motivational example from the introduction (Sect. 1). To illustrate our recommendations (summarized in Table 1), we additionally apply the FI methods to a simplified setting where the DGP and the model’s mechanism are known and intelligible, including features with different roles.

Table 1. Summary of our results. The abbreviation “CE” stands for cross-entropy loss and “L2” for L2-loss, each with the respective optimal model.

Returning to Our Motivating Example. Using Result 5.11, we know that PFI can assign high FI values to features even if they are not associated with the target but with other features that are associated with the target. Conversely, LOCO only assigns non-zero values to features conditionally associated with the target (here: bike rentals per day, see Result 7.11). We can therefore conclude that at least the features weathersit, season, temp, mnth, windspeed and weekday are conditionally associated with the target, and the TOP 5 most important features, according to PFI, tend to share information with other features or may not be associated with bike rentals per day at all.

Illustrative Example with Known Ground-Truth. This example includes five features \(X_1, \dots , X_5\) and a target Y with the following dependence structure (visualized in Fig. 2, left plot):

  • \(X_1, X_3\) and \(X_5\) are independent and standard normal: \(X_j \sim N(0,1)\),

  • \(X_2\) is a noisy copy of \(X_1\): \(X_2 := X_1 + \epsilon _2, \epsilon _2 \sim N(0, 0.001)\),

  • \(X_4\) is a (more) noisy copy of \(X_3\): \(X_4 := X_3 + \epsilon _4, \epsilon _4 \sim N(0, 0.1)\),

  • Y depends on \(X_4\) and \(X_5\) via linear effects and a bivariate interaction:

    \(Y:= X_4 + X_5 + X_4*X_5 + \epsilon _Y,\, \epsilon _Y \sim N(0, 0.1).\)

Regarding (A1), features \(X_3, X_4\) and \(X_5\) are unconditionally associated with Y, while only \(X_5\) is conditionally associated with Y given all other features (A2a).

We sample \(n = 10,000\) observations from the DGP and use 70% of the observations to train two models: A linear model (LM) with additional pair-wise interactions between all features (test-MSE \(=0.0103\), test-\(R^2=0.9966\)), and a random forest (RF) using default hyperparameters (test-MSE \(=0.0189\), test-\(R^2=0.9937\)). We apply the FI methods on 30% test data with L2 loss to both models using 50 repetitions for methods that marginalize or perturb features. We present the results in Fig. 2.Footnote 2 The right plot shows each feature’s FI value relative to the most important feature (which is scaled to 1).

Fig. 2.
figure 2

Left: Graph illustrating the model and data level associations. Right: Results of FI methods for the LM in panel (a) and the RF in panel (b); importance values are relative to the most important feature.

(A1): LOCI and cSAGEvf correctly identify \(X_3\), \(X_4\) and \(X_5\) as unconditionally associated. PFI correctly identifies \(X_4\) and \(X_5\) to be relevant, but it misses \(X_3\), presumably since the model predominantly relies on \(X_4\). For the LM, PFI additionally considers \(X_1\) and \(X_2\) to be relevant, although they are fully independent of Y; due to correlation in the feature sets, the trained model includes the term \(0.36x_1-0.36x_2\), which cancels out in the unperturbed, original distribution, but causes performance drops when the dependence between \(X_1\) and \(X_2\) is broken via perturbation. For mSAGEvf, similar observations can be made, with the difference that \(X_1\) and \(X_2\) receive negative importance. The reason is that for mSAGEvf, the performance of the average prediction is compared to the prediction where all but one feature are marginalized out; we would expect that adding a feature improves the performance, but for \(X_1\) and \(X_2\), the performance worsens if adding the feature breaks the dependence between \(X_1\) and \(X_2\).

(A2): CFI, LOCO, and scSAGEvf\(^{-j}\) correctly identify \(X_5\) as conditionally associated, as expected. cSAGE correctly identifies features that are dependent with Y conditional on any set S, specifically, \(X_3\), \(X_4\) and \(X_5\). The results of mSAGE for the RF are similar to those for cSAGE; on the LM, the results are quite inconclusive – most features have a negative importance.

Overall, the example empirically illustrates the differences between the methods as theoretically shown in Sects. 5 to 7.

9 Summary and Practical Considerations

In Sects. 5 to 7, we presented three different classes of FI techniques: Techniques based on univariate perturbations, techniques based on marginalization, and techniques based on model refitting. In principle, each approach can be used to gain partial insights into questions (A1) to (A2b). However, the practicality of the methods depends on the specific application. As follows, we discuss some aspects that may be relevant to the practitioner.

For (A1), PFI, mSAGEvf, cSAGEvf, and LOCI are – in theory – suitable. However, PFI and mSAGEvf require assumptions about feature independence, which are typically unrealistic. cSAGEvf require marginalizing out features using a multivariate conditional distribution \(P(X_{-j}|X_j)\), which can be challenging since not only the dependencies between \(X_j\) and \(X_{-j}\) but also the ones between \(X_{-j}\) have to be considered. LOCI requires fitting a univariate model, which is computationally much less demanding than the cSAGEvf computation.

For (A2a), a comparatively more challenging task, CFI, scSAGEvf and LOCO are suitable, but it is unclear which of the methods is preferable in practice. While CFI and scSAGEvf require a model of the univariate conditional \(P(X_j|X_{-j})\), LOCO requires fitting a model to predict Y from \(X_{-j}\). For (A2b), the practical requirements depend on the size of the conditioning set. The closer the conditioning set is to \(-j\), the fewer features have to be marginalized out for scSAGEvf, and the fewer feature dependencies may lead to extrapolation for RFI. For sWVIM, larger relative feature sets imply more expensive model fits.

Importantly, all three questions (A1) to (A2b) could also be assessed with direct or conditional independence tests, e.g., mutual information [9], partial correlation tests [5], kernel-based measures such as the Hilbert-Schmidt independence criterion [24, 62], or the generalized covariance [51]. This seems particularly appropriate for question (A1), where we simply model the association structure of a bivariate distribution. Methods like mSAGEvf can arguably be considered overly complex and computationally expensive for such a task.

10 Statistical Inference for FI Methods

So far, we have described how the presented FI methods should behave in theory or as point estimators. However, the estimation of FI values is inherently subject to various sources of uncertainty introduced during the FI estimation procedure, model training, or model selection [41, 60]. This section reviews available techniques to account for uncertainty in FI by applying methods of statistical inference, e.g., statistical tests and the estimation of confidence intervals (CIs).

All FI methods in this paper measure the expected loss. To prevent biased or misleading estimates due to overfitting, it is crucial to calculate FI values on independent test data not seen during training, aligning with best practices in ML performance assessment [37, 54]. Computing FI values on training data may lead to wrong conclusions. For example, Molnar et al. [43] demonstrated that even if features are random noise and not associated with the target, some features are incorrectly deemed important when FI values are computed using training data instead of test data. If no large dedicated test set is available, or the data set is not large in general to facilitate simple holdout splitting, resampling techniques such as cross-validation or bootstrap provide practical solutions [54].

In the following, we will first provide an overview of method-specific approaches and then summarize further ideas about more general ones.

PFI and CFI. Molnar et al. [41] address the uncertainty of model-specific PFI and CFI values caused by estimating expected values using Monte Carlo integration on a fixed test data set and model. To address the variance of the learning algorithm, they introduce the learner-PFI, computed using resampling techniques such as bootstrapping or subsampling on a held-out test set within each resampling iteration. They also propose variance-corrected Wald-type CIs to compensate for the underestimation of variance caused by partially sharing training data between the models fitted in each resampling iteration. For CFI, Watson and Wright [58] address sampling uncertainty by comparing instance-wise loss values. They use Fisher’s exact (permutation) tests and paired t-tests for hypothesis testing. The latter, based on the central limit theorem, is applied to all decomposable loss functions calculated by averaging instance-wise losses.

SAGE. The original paper of SAGE [10] introduced an efficient algorithm to approximate SAGE values, since the exact calculation of SAGE values is computationally expensive. They show that, according to the central limit theorem, the approximation algorithm convergences to the correct values and that the variance reduces with the number of iterations at a linear rate. They briefly mention that the variance of the approximation can be estimated at a specific iteration and can be used to construct CIs (which corresponds to the same underlying idea of the Wald-type CI for the model-specific PFI mentioned earlier).

WVIM including LOCO. Lei et al. [35] introduced statistical inference for LOCO by splitting the data into two parts: one for model fitting and one for estimating LOCO. They further employed hypothesis testing and constructing CIs using sign tests or the Wilcoxon signed-rank test. The results’ interpretation is limited to the importance of the FOI to an ML algorithm’s estimated model on a fixed training data set. Williamson et al. [60] construct Wald-type CI intervals for LOCO and WVIM, based on k-fold cross-validation and sample-splittingFootnote 3. Compared to LOCO, it provides a more general interpretation of the results as it considers the FI of an ML algorithm trained on samples of a particular size, i.e., due to cross-validation, the results are not tied to a single training data set. The approach is related to [41] but removes features via refitting instead of sampling and does not consider any variance correction. The authors note that, while sample-splitting helps to address issues related to zero-importance features having an incorrect type I error or coverage of their CIs, it may not fully leverage all available information in the data set to train a model. PIMP. The PIMP heuristic [2] is based on model refits and was initially developed to address bias in FI measures such as PFI within random forests. However, PIMP is a general procedure and has broader applicability across various FI methods [36, 43]. PIMP involves repeatedly permuting the target to disrupt its associations with features while preserving feature dependencies, training a model on the data with the permuted target, and computing PFI values. This leads to a collection of PFI values (called null importances) under the assumption of no association between the FOI and the target. The PFI value of the model trained on the original data is then compared with the distribution of null importances to identify significant features. Methods Based on the Rashomon Set. The Rashomon set refers to a collection of models that perform equally well but may differ in how they construct the prediction function and the features they rely on. Fisher et al. [18] consider the Rashomon set of a specific model class (e.g., decision trees) defined based on a performance threshold and propose a method to measure the FI within this set. For each model in the Rashomon set, the FI of a FOI is computed, and its range across all models is reported. Other works include the Variable Importance Cloud (VIC) [13], providing a visual representation of FI values over different model types; the Rashomon Importance Distribution (RID) [14], providing the FI distribution across the set and CIs to characterize uncertainty around FI point estimates; and ShapleyVIC [44], extending VIC to SAGE values and using a variance estimator for constructing CIs. The main idea is to address uncertainty in model selection by analyzing a Rashomon set, hoping that some of these models reflect the underlying DGP and assign similar FI values to features. Multiple Comparisons. Testing multiple FI values simultaneously poses a challenge known as multiple comparisons. The risk of falsely rejecting true null hypotheses increases with the number of comparisons. Practitioners can mitigate it, e.g., by controlling the family-wise error rate or the false discovery rate [43, 49].

11 Open Challenges and Further Research

Feature Interactions. FI computations are usually complicated by the presence of strong and higher-order interactions [43]. Such interactions typically have to be manually specified in (semi-)parametric statistical models. However, complex non-parametric ML models, to which we usually apply our model-agnostic IML techniques, automatically include higher-order interaction effects. While recent advances have been made in visualizing the effect of feature interactions and quantifying their contribution regarding the prediction function [3, 23, 27], we feel that this topic is somewhat underexplored in the context of loss-based FI methods, i.e., how much an interaction contributes to the predictive performance. A notable exception is SAGE, which, however, does not explicitly quantify the contribution of interactions towards the predictive performance but rather distributes interaction importance evenly among all interacting features. In future work, this could be extended by combining ideas from functional decomposition [3, 27], FI based on those [29] and loss-based methods as in SAGE. Model Selection and AutoML. As a subtle but important point: it seems somewhat unclear to which model class or learning algorithms the covered techniques can or should be applied to, if DGP inference is the goal. From a mechanistic perspective, these model-agnostic FI approaches can be applied to basically any model class, which seems to be the case in current applications. Considering what Williamson et al. [60] noted in and, following our results, many statements in the Sects. 5 to 7 only hold under a “loss-optimal model”. First of all, in practice, the construction of a loss-optimal model with certainty is virtually impossible. Does this imply we should try to squeeze out as much predictive performance as possible, regardless of the incurred extra model complexity? Williamson et al. [60] use the “super learner” in their definition and implementation of WVIM [59]. Modern AutoML systems like AutoGluon [16] are based on the same principle. While we perfectly understand that choice, and find the combination of AutoML and IML techniques very exciting, we are unsure about the trade-off costs. Certainly, this is a computationally expensive technique. But we rather also worry about the underlying implications for FI methods (or more generally IML techniques), when models of basically the highest order of complexity are now used, which usually contain nearly unconstrained higher-order interactions. We think that this issue needs to be more analyzed.

Rashomon Sets and Model Diagnosis. Expanding on the previous issue: In classical statistical modeling, models are usually not exclusively validated by checking predictive performance metrics only. The Rashomon effect tells us that in quite a few scenarios, very similarly performing models exist, which give rise to different response surfaces and different IML interpretations. This hints at the effect that ML researchers and data scientists might likely have to expand their model validation toolbox, in order to have better options to exclude misspecified models.

Empirical Performance Comparisons. We have tried to compile a succinct list of results to describe what can be derived from various FI methods regarding the DGP. However, we would also like to note that such theoretical analysis often considerably simplifies the complexity of real-world scenarios to which we apply these techniques. For that reason, it is usually a good idea to complement such mathematical analysis with informative, detailed, and carefully constructed empirical benchmarks. Unfortunately, not a lot of work on empirical benchmarks exists in this area. Admittedly, this is not easy in FI, as ground truths are often only available in simulations, which, in turn, lack the complexity found in real-world data sets. Moreover, even in simulations, concrete “importance ground truth numbers” might be debatable. So far, there are no extensive benchmarks in the literature on FI methods. Many compare local importance methods [1, 26], but few global methods: E.g., Blesch et al. [6] and Covert et al. [10] compare FI methods for different data sets, metrics, and ML models. However, the comparisons are not applied with regard to different association types, as the methods are not differentiated in this respect as in our paper.

Causality. Beyond association, scientific practitioners are often interested in causation (see, e.g., [21, 22, 50, 57, 61]). In our example from Sect. 4, the doctor may not only want to predict the disease but may also want to treat it. Knowing which features are associated with the disease is insufficient for that purpose – association remains on rung 1 of the so-called ladder of causation [47]: Although the symptoms are associated with the disease, treating them does not affect the disease. To gain insight into the effects of interventions (rung 2), experiments and/or causal knowledge and specialized tools are required [28, 31, 46, 48].