1 Introduction

Variable importance (VI) is a major topic in statistics and machine learning. It is the basis of most if not all feature selection methods, which analysts use to identify key drivers of variation in an outcome of interest and/or create more parsimonious models (Guyon & Elisseeff, 2003; Kuhn & Johnson, 2019; Meinshausen & Bühlmann, 2010). Many importance measures have been proposed in recent years, either for specific algorithms or more general applications. Several different notions of VI—some overlapping, some inconsistent—have emerged from this literature. We examine these in greater detail in Sect. 2.1.

One fundamental difference between various importance measures is whether they test the marginal or conditional independence of features. To evaluate response variable Y's marginal dependence on predictor \({X}_{j}\), we test against the following hypothesis:

$$H_{0}^{m} :~X_{j} \bot Y,\user2{X}_{{ - j}} ,$$

where \({\user2{X}}_{-j}\) represents a set of covariates.Footnote 1 A measure of conditional dependence, on the other hand, tests against a different null hypothesis:

$$H_{0}^{c} :~X_{j} \bot Y|\user2{X}_{{ - j}} .$$

Note that \({X}_{j}\)'s marginal VI may be high due to its association with either Y or \({{\user2{X}}}_{-j}\). This is why measures of marginal importance tend to favor correlated predictors. Often, however, our goal is to determine whether \({X}_{j}\) adds any new information—in other words, whether Y is dependent on \({X}_{j}\) even after conditioning on \({{\user2{X}}}_{-j}\). This becomes especially important when the assumption of feature independence is violated.

Tests of conditional independence (CI) are common in the causal modelling literature. For instance, the popular PC algorithm (Spirtes et al., 2000), which converges on a set of directed acyclic graphs (DAGs) consistent with some observational data, relies on the results of CI tests to recursively remove the edges between nodes. Common parametric examples include the partial correlation test for continuous variables or the \({\chi }^{2}\) test for categorical data. A growing body of literature in recent years has examined nonparametric alternatives to these options. We provide an overview of several such proposals in Sect. 2.2.

In this paper, we introduce a new CI test to measure VI. The conditional predictive impact (CPI) quantifies the contribution of one or several features to a given algorithm's predictive performance, conditional on a complementary feature subset. Our work relies on so-called “knockoff'” variables (formally defined in Sect. 2.3) to provide negative controls for feature testing. Because knockoffs are, by construction, exchangeable with their observed counterparts and conditionally independent of the response, they enable a paired testing approach without any model refitting. Unlike the original knockoff filter, however, our methods are not limited to certain types of datasets or algorithms.

The CPI is extremely modular and general. It can be used with any combination of knockoff sampler, supervised learner, and loss function. It can be efficiently computed in high dimensions without sparsity constraints. We demonstrate that the CPI is an unbiased estimator, provably consistent under minimal assumptions. We develop statistical inference procedures for evaluating its magnitude, precision, and significance. Finally, we demonstrate the measure's utility on a variety of real and simulated datasets.

The remainder of this paper is structured as follows. We review related work in Sect. 2. We present theoretical results in Sect. 3, where we also outline an efficient algorithm for estimating the CPI, along with corresponding p-values and confidence intervals. We test our procedure on real and simulated data in Sect. 4, comparing its performance with popular alternatives under a variety of regression and classification settings. Following a discussion in Sect. 5, we conclude in Sect. 6.

2 Related work

In this section, we survey the relevant literature on VI estimation, CI tests, and the knockoff filter.

2.1 Variable importance measures

The notion of VI may feel fairly intuitive at first, but closer inspection reveals a number of underlying ambiguities. One important dichotomy is that between global and local measures, which respectively quantify the impact of features on all or particular predictions. This distinction has become especially important with the recent emergence of interpretable machine learning techniques designed to explain individual outputs of black box models (e.g., Lundberg & Lee, 2017; Ribeiro et al., 2016; Wachter et al., 2018). In what follows, we restrict our focus to global importance measures.

Another important dichotomy is that between model-specific and model-agnostic approaches. For instance, a number of methods have been proposed for estimating importance in linear regression (Barber & Candès, 2015; Grömping, 2007; Lindeman et al., 1980), random forests (Breiman, 2001; Kursa & Rudnicki, 2010; Strobl et al., 2008), and neural networks (Bach et al., 2015; Gevrey et al., 2003; Shrikumar et al., 2017). These measures have the luxury of leveraging an algorithm's underlying assumptions and internal architecture for more precise and efficient VI estimation.

Other, more general techniques have also been developed. Van der Laan (2006) derives efficient influence curves and inference procedures for a variety of VI measures. Hubbard et al. (2018) build on this work, proposing a data-adaptive method for estimating the causal influence of variables within the targeted maximum likelihood framework (van der Laan & Rose, 2018). Williamson et al. (2021) describe an ANOVA-style decomposition of a regressor’s R2 into feature-wise contributions. Feng et al. (2018) design a neural network to efficiently compute this decomposition using multi-task learning. Fisher et al. (2019) propose a number of “reliance” statistics, calculated by integrating a loss function over the empirical distribution of covariates while holding a given feature vector constant.

Perhaps the most important distinction between various competing notions of VI is the aforementioned split between marginal and conditional measures. The topic has received considerable attention in the random forest literature, where Breiman’s popular permutation technique (2001) has been criticized for failing to properly account for correlations between features (Gregorutti et al., 2015; Nicodemus et al., 2010). Conditional alternatives have been developed (Mentch & Hooker, 2016; Strobl et al., 2008), but we do not consider them here, as they are specific to tree ensembles.

Our proposed measure resembles what Fisher et al. (2019) call “algorithm reliance” (AR). The authors do not have much to say about AR in their paper, the majority of which is instead devoted to two related statistics they term “model reliance” (MR) and “model class reliance” (MCR). These measure the marginal importance of a feature subset in particular models or groups of models, respectively. Only AR measures the importance of the subset conditional on remaining covariates for a given supervised learner, which is our focus here. Fisher et al. (2019) derive probabilistic bounds for MR and MCR, but not AR. They do not develop hypothesis testing procedures for any of their reliance statistics.

2.2 Conditional independence tests

CI tests are the cornerstone of constraint-based and hybrid methods for causal graph inference and Bayesian network learning (Koller & Friedman, 2009; Korb & Nicholson, 2009; Scutari & Denis, 2014). Assuming the causal Markov condition and faithfulness—which together state (roughly) that statistical independence implies graphical independence and vice versa—a number of algorithms have been developed that use CI tests to discover an equivalence class of DAGs consistent with a set of observational data (Maathuis et al., 2009; Spirtes et al., 2000; Verma & Pearl, 1991).

Shah and Peters (2020) have shown that there exists no uniformly valid CI test. Parametric assumptions are typically deployed to restrict the range of alternative hypotheses, which is default behavior for most causal discovery software (e.g., Kalisch et al., 2012; Scutari, 2010). However, more flexible methods have been introduced. Much of this literature relies on techniques that embed the data in a reproducing kernel Hilbert space (RKHS). For instance, Fukumizu et al. (2008) use a normalized cross-covariance operator to test the association between features in the RKHS. A null distribution is approximated via permutation. Doran et al. (2014) build on Fukumizu et al.’s work with a modified permutation scheme intended to capture the effects of CI. Zhang et al. (2011) derive a test statistic from the traces of kernel matrices, using a gamma null distribution to compute statistical significance.

Another general family of methods for CI testing is rooted in information theory. Fleuret (2004) proposes a fast binary variable selection procedure using conditional mutual information. Similar techniques have been used to infer directionality in networks (Vejmelka & Paluš, 2008), cluster features together for dimensionality reduction (Martínez Sotoca & Pla, 2010), and reason about the generalization properties of supervised learners (Steinke & Zakynthinou, 2020). These approaches typically rely either on discretization procedures to convert all inputs to categorical data, or strong parametric assumptions to handle continuous spaces.

Several authors have proposed alternative tests to avoid the inefficiencies of kernel methods and the binning often required by information theoretic algorithms. For instance, Strobl et al. (2018) employ a fast Fourier transform to reduce the complexity of matrix operations. Methods have been developed for estimating regularized, nonlinear partial correlations (Ramsey, 2014; Shah & Peters, 2020). Lei et al. (2018) and Rinaldo et al. (2019) study the leave-one-covariate-out (LOCO) procedure, in which an algorithm is trained on data with and without the variable of interest. The predictive performance of nested models is compared to evaluate the conditional importance of the dropped feature.

Our proposal is conceptually similar to LOCO, which can in principle be extended to feature subsets of arbitrary dimension. The method enjoys some desirable statistical properties when used in conjunction with sample splitting. For instance, Rinaldo et al. (2019) derive convergence rates for LOCO parameters, while Lei et al. (2018) prove finite sample error control using conformal inference. However, retraining an algorithm for each CI test is potentially infeasible, especially for complex learners and/or large datasets. With knockoffs, we can directly import LOCO's statistical guarantees without any model refitting.

2.3 The Knockoff framework

Our work builds on the knockoff procedure originally conceived by Barber and Candès (2015) and later refined by Candès et al. (2018). Central to this approach is the notion of a knockoff variable. Given an n × p input matrix \({\user2{X}}\), we define a knockoff matrix of equal dimensionality \({\tilde{\varvec{X}}}\) as any matrix that meets the following two criteria:

  1. (a)

    Pairwise exchangeability. For any proper subset \(S \subset \left[ p \right]\) = (1,…,p)


    where = d represents equality in distribution and the swapping operation is defined below.

  2. (b)

    Conditional independence. \({\tilde{\varvec{X}}}\perp Y|{\user2{X}}.\)

    A swap is obtained by switching the entries \({X}_{j}\) and \({\tilde{X}}_{j}\) for each \(j\in S\). For example, with \(p=3\) and \(S = \left\{ {1,~3} \right\}\):

    $$(X_{1} ,X_{2} ,X_{3} ,\tilde{X}_{1} ,\tilde{X}_{2} ,\tilde{X}_{3} )_{{swap\left( S \right)}} = ^{d} \left( {\tilde{X}_{1} ,~X_{2} ,\tilde{X}_{3} ,X_{1} ,\tilde{X}_{2} ,X_{3} } \right).$$

Knockoffs provide negative controls for conditional independence testing. The intuition behind the method is that if \({X}_{j}\) does not significantly outperform \({\tilde{\varvec{X}}_{j}}\) by some relevant importance measure, then the original feature may be safely removed from the final model.

Practical implementation requires both a method for generating knockoffs and a decision procedure for variable selection. The subject has quickly become a busy one in statistics and machine learning, with most authors focusing on the former task. In this paper, we instead tackle the latter, developing a general framework for testing conditional variable importance.

Constructing nontrivial knockoffs is a considerable challenge. Numerous methods have been proposed, including but not limited to: second-order Gaussian knockoffs (Candès et al., 2018); conditional permutation sampling (Berrett et al., 2020); hidden Markov models (Sesia et al., 2019); generative deep neural networks (Romano et al., 2020); Metropolis–Hastings sampling (Bates et al., 2020); conditional density estimation (Tansey et al., 2021); and normalizing flows (Hansen et al., 2021). A complete review of these proposals is beyond the scope of this chapter. Bates et al. (2020) demonstrate that no efficient knockoff sampler exists for arbitrary probability distributions, suggesting that algorithms will have to make some assumptions about the data generating process to strike a reasonable balance between sensitivity and specificity.

The original knockoff papers introduce a novel algorithm for controlling the false discovery rate (FDR) in variable selection problems. The goal is to find the minimal subset \(\mathcal{S}\subset \left[p\right]\) such that, conditional on \(\{{X}_{j}{\}}_{j\in \mathcal{S}}\), Y is independent of all other variables. Call this the Markov blanket of Y (Pearl, 1988). Null features form a complementary set \(\mathcal{R}=[p]\backslash \mathcal{S}\) such that \(k\in \mathcal{R}\) if and only if \({X}_{k}\perp Y|\{{X}_{j}{\}}_{j\in \mathcal{S}}\). The FDR is given by the expected proportion of false positives among all declared positives:

$${\text{FDR}} ={\mathbb{E}}\left[ {\frac{{\left| {{\widehat{\mathcal{S}}} \cap {\mathcal{R}}} \right|}}{{\left| {{\widehat{\mathcal{S}}} \vee 1} \right|}}} \right],$$

where \(\widehat{\mathcal{S}}\) is the output of the decision procedure and the “\(\vee 1\)” in the denominator enforces the convention that FDR = 0 when \(\left|\widehat{\mathcal{S}}\right|=0\).

Barber and Candès (2015) demonstrate a method for guaranteed finite sample FDR control when (i) statistics for null variables are symmetric about zero and (ii) large positive statistics indicate strong evidence against the null. We will henceforth refer to this method as the adaptive thresholding test (ATT). Unlike other common techniques for controlling the FDR (e.g., Benjamini & Hochberg, 1995; Benjamini & Yekutieli, 2001; Storey, 2002), the ATT does not require p-values as an intermediate step. Candès et al. (2018) argue that this is a benefit in high-dimensional settings, where p-value calculations can be unreliable.

Acknowledging that p-values may still be desired in some applications, however, Candès et al. also propose the conditional randomization test (CRT), which provides one-sided Monte Carlo p-values by repeatedly sampling from the knockoff distribution. Experiments indicate that the CRT is slightly more powerful than the ATT, but the authors caution that the former is computationally intensive and do not recommend it for large datasets. That has not stopped other groups from advancing formally similar proposals (e.g., Berrett et al., 2020; Tansey et al., 2021).

We highlight several important shortcomings of the ATT: (1) Not all algorithms provide feature scoring statistics. (2) The ATT requires a large number of variables to reliably detect true positives. (3) Because the ATT does not perform individual hypothesis tests, it cannot provide confidence or credible intervals for particular variables. In what follows, we present alternative inference procedures for conditional independence testing designed to address all three issues.

3 Conditional predictive impact

The basic intuition behind our approach is that important features should be informative—that is, their inclusion should improve the predictive performance of an appropriate algorithm as measured by some preselected loss function. Moreover, the significance of improvement should be quantifiable so that error rates can be controlled at user-specified levels.

Consider an n × p feature matrix \({\user2{X}}\in \mathcal{X}\) and corresponding n × 1 response variable \(Y\in \mathcal{Y}\), which combine to form the dataset \(\user2{Z}=\left(\user2{X},Y\right)\in \mathcal{Z}\).Footnote 2 Each observation \({\user2{z}}_{i}=({\user2{x}}_{i},{y}_{i})\) is an i.i.d. sample from a fixed but unknown joint probability distribution, \(\mathbb{P}\left(\user2{Z}\right)=\mathbb{P}(\user2{X},Y)\). Let \({\user2{X}}^{S}\subseteq ({X}_{1},\dots ,{X}_{p})\) denote some subset of features whose predictive impact we intend to quantify, conditional on the (possibly empty) set of remaining covariates \({\user2{X}}^{R}=\user2{X}\backslash {\user2{X}}^{S}\). Data can now be expressed as a triple, \(\user2{Z}=({\user2{X}}^{S},{\user2{X}}^{R},Y)\). We remove the predictive information in \({\user2{X}}^{S}\) while preserving the covariance structure of the predictors by replacing the submatrix with the corresponding knockoff variables, \( {\tilde{\varvec{X}}}^{S}\), rendering a new dataset, \({\tilde{\varvec{Z}}}=({\tilde{\varvec{X}}}^{S},{\user2{X}}^{R},Y)\).

Define a function \(f\in \mathcal{F}\),\(\mathcal{F}:\mathcal{X}\to \mathcal{Y}\) as a mapping from features to outcomes. We evaluate a model’s performance using some real-valued, non-negative loss function \(L:\mathcal{F}\times \mathcal{Z}\to {\mathbb{R}}_{\ge 0}\). Define the risk of f with respect to \(\user2{Z}\) as its expected loss over the joint probability distribution \(\mathbb{P}\left(\user2{Z}\right),\) \(R\left(f,\user2{Z}\right)=\mathbb{E}\left[L\left(f,\user2{Z}\right)\right].\) Our strategy is to replace the conditional null hypothesis defined in Sect. 1 with the following:

$${H}_{0}:R\left(f,\user2{Z}\right)\ge R\left(f,{\tilde{\varvec{Z}}}\right).$$

In other words, we test whether the model performs better using the original or the knockoff data. We note that this is potentially weaker than \({H}_{0}^{C}\), as many popular loss functions restrict attention to just the first moment. However, we argue that if such a loss function is appropriate, then a conditional predictive test is a better choice than a conditional independence test. For instance, if the goal is simply to minimize out-of-sample \({L}_{2}\) loss, then we have no need for features that do not improve mean square error, even if they encode information about higher moments (e.g., predictive variance or skewness). Alternatively, if confidence intervals are important for a given task, then the loss function should reflect that, as likelihood-based measures do. In general, recovering the full Markov blanket of Y may be unnecessary.

3.1 Consistency and convergence

The CPI of submatrix \({\user2{X}}^{S}\) measures the extent to which the feature subset improves predictions made using model f. Assume that the loss function L can be evaluated for each sample i.Footnote 3 We define the following random variable:

$${{{\Delta }}_i} = L\left({f,{{\tilde {\user2 {z}}}_i}} \right) - L\left( {f,{{\user2 {z}}_i}} \right).$$

This vector represents the difference in sample-wise loss between predictions made using knockoff data and original data. We define the CPI by taking its expectation:

$${\text{CPI}}\left( {\user2{{X}}^{S} } \right) = {\mathbb{E}}\left[\Delta \right].$$

Note that the CPI is always a function of some feature subset \({\user2{X}}^{S}\). We suppress the dependency for notational convenience moving forward.

To consistently estimate this statistic, it is necessary and sufficient to show that we can consistently estimate the risk of model \(f\). The population parameter \(R(f,\user2{Z})\) is estimated using the empirical risk formula:

$$R_{{{\text{emp}}}} \left( {f,\user2{Z}} \right) = \frac{1}{m}\mathop \sum \limits_{{i = 1}}^{m} L\left( {f,\user2{z}_{i} } \right).$$

The goal in estimating risk is to evaluate how well the model generalizes beyond its training data, so the m samples in Eq. 3 constitute a test set drawn independently from \(\user2{Z}\), distinct from the \(n\) samples used to fit \(f\). In practice, this is typically achieved by some resampling procedure like cross-validation or bootstrapping. In what follows, we presume that unit-level loss \(L(f,{\user2 {z}}_{i})\) is always an out-of-sample evaluation, such that \(f\) was trained on data excluding \({\user2{z}}_{i}\).

The empirical risk minimization (ERM) principle is a simple decision procedure in which we select the function f that minimizes empirical risk in some function space \(\mathcal{F}\). A celebrated result of Vapnik and Chervonenkis (1971), independently derived by Sauer (1972) and Shelah (1972), is that the ERM principle is consistent with respect to \(\mathcal{F}\) if and only if the function space is of finite VC dimension. Thus, for any algorithm that meets this minimal criterion, the empirical risk \({R}_{\mathrm{e}\mathrm{m}\mathrm{p}}\left(f,\user2{Z}\right)\) converges uniformly in probability to \(R\left(f,\user2{Z}\right)\) as n \(\to \infty\), which means the estimate:

$$\begin{aligned} \widehat{{{\text{CPI}}}} & = \frac{1}{n}\mathop \sum \limits_{i}^{n} L\left( {f,{\tilde{{\user2 {z}}}}_{i} } \right) - L\left( {f,{\user2 {z}}_{i} } \right) \\ & = R_{{{\text{emp}}}} \left( {f,{\tilde{{\user2 {Z}}}}} \right) - R_{{{\text{emp}}}} \left( {f,\user2{Z}} \right) \\ & = \frac{1}{n}\mathop \sum \limits_{i}^{n} {{\Delta }}_{i} \\ \end{aligned}$$

is likewise guaranteed to converge. Because the CPI inherits the convergence properties of the learner f, it imposes no additional smoothness, sparsity, parametric, or dimensionality constraints upon the data.

Though finite complexity thresholds have been derived for many algorithms—e.g., projective planes, decision trees, boosting machines, and neural networks (Shalev-Shwartz & Ben-David, 2014)—it is worth noting that some popular supervised learners do in fact have infinite VC dimension. This is the case, for instance, with methods that rely on the radial basis function kernel, widely used in support vector machines and Gaussian process regression. The learning theoretic properties of these algorithms are better described with other measures such as the Rademacher complexity and PAC-Bayes bounds (Guedj, 2019). However, as we show in Sect. 4, the CPI shows good convergence properties even when used with learners of infinite VC dimension.

Inference procedures for the CPI can be designed using any paired difference test. Familiar frequentist examples include the t-test and the Fisher exact test, which we use for large- and small-sample settings, respectively. Bayesian analogues can easily be implemented as well. Rouder et al. (2009) advocate an analytic strategy for calculating Bayes factors for t-tests. Wetzels et al. (2009) and Kruschke (2013) propose more general methods based on Markov chain Monte Carlo sampling, although they differ in their proposed priors and decision procedures. Care should be taken when selecting a prior distribution in the Bayesian setting, especially with small sample sizes. Tools for Bayesian inference are implemented in the accompanying cpi package; however, for brevity’s sake, we restrict the following sections to frequentist methods.

3.2 Large sample inference: paired t-tests

By the central limit theorem, empirical risk estimates will tend to be normally distributed around the true population parameter value. Thus we use paired, one-sided t-tests to evaluate statistical significance when samples are sufficiently large (\(n\ge 30\) or thereabouts).

The variable \(\Delta\) has mean \(\widehat{\mathrm{C}\mathrm{P}\mathrm{I}}\) and standard error SE = \(s/\sqrt{n}\), where \(s\) denotes the sample standard deviation:

$$s = \sqrt {\frac{1}{{n - 1}}\mathop \sum \limits_{{i = 1}}^{n} (\Delta _{i} - \widehat{{{\text{CPI}}}})^{2} } .$$

The t-score for \(\widehat{\mathrm{C}\mathrm{P}\mathrm{I}}\) is given by t = \(\widehat{\mathrm{C}\mathrm{P}\mathrm{I}}\)/SE, and we may compute p-values by comparing this statistic to the most tolerant distribution consistent with \({H}_{0}:R(f,\user2{Z})\ge R(f,\stackrel{\sim }{\user2{Z}})\), namely \({t}_{n-1}\). To control Type I error at level \(\alpha\), we reject \({H}_{0}\) for all t greater than or equal to the \((1-\alpha )\) quantile of \({t}_{n-1}\). This procedure can easily be modified to adjust for multiple testing.

We can relax the assumption of homoskedasticity if reliable estimates of predictive precision are available. Construct a 2n × (n + 1) feature matrix \(\user2{X}\) with columns for each unit i = {1,…,n}, as well as an indicator variable for data type D (original vs. knockoff). Let \(\user2{W}\) be a 2n × 2n diagonal matrix such that \({\user2{W}}_{ii}\) denotes the weight assigned to the ith prediction. For instance, in a regression setting, this could be the inverse of the expected residual variance for i. Then solve a weighted least squares regression, with the response variable \(\user2{y}\) equal to the observed loss for each unit-data type combination:

$${\widehat{\varvec{\gamma}}} = {(\varvec{X}^{\prime}{\varvec{WX}})^{-1}}{\varvec{X}}^{\prime}{\varvec{Wy}}$$

The t-statistic and p-value associated with coefficient \({\widehat{\gamma }}_{D}\) can then be used to test the CPI of the substituted variable(s) under a heteroskedastic error model.

Confidence intervals around \(\widehat{\mathrm{C}\mathrm{P}\mathrm{I}}\) may be constructed in the typical manner. The lower bound is set by subtracting from our point estimate the product of SE and\({F}_{n-1}^{-1}(1-\alpha )\), where \({F}_{n-1}(\bullet )\) denotes the CDF of\({t}_{n-1}\). Using this formula, we obtain a 95% confidence interval for \({\widehat{\mathrm {CPI}}}\) by calculating \(\left[ {\widehat{{{\text{CPI}}}} - {\text{SE}} \times F_{{n - 1}}^{{ - 1}} \left( {0.95} \right),~\infty } \right)\). As n grows large, this interval converges to the Wald-type interval, \(\left[ {\widehat{{{\text{CPI}}}} - {\text{SE}} \times \Phi ^{{ - 1}} \left( {0.95} \right),~\infty } \right)\), where \({\Phi }\) represents the standard normal CDF.

The t-testing framework also allows for analytic power calculations. Let \({t}^{*}\) denote the critical value \({t}^{*}={F}_{n-1}^{-1}(1-\alpha )\). Then Type II error is given by the formula \(\beta ={F}_{n-1}({t}^{*}-\delta )\), where \(\delta\) represents the postulated effect size. Statistical power is just the complement \(1-\beta\), and rearranging this equation with simple algebra allows us to determine the sample size required to detect a given effect at some fixed Type I error \(\alpha\).

3.3 Small sample inference: Fisher exact tests

The applicability of the central limit theorem is dubious when sample sizes are small. In such cases, exact p-values may be computed for a slightly modified null hypothesis using Fisher’s method (1935). Rather than focusing on overall risk, this null hypothesis states that replacing \({\user2{X}}^{S}\) with the knockoff submatrix \({\tilde{\varvec{X}}}^{S}\) has no impact on unit-level loss. More formally, we test against the following:

$$H_0^{{\text{FEP}}}:L\left(f, {\varvec{z}}_{i}\right) \ge L\left(f,\tilde{\varvec{z}}_{i}\right),\quad i = 1, \ldots,n.$$

Under this null hypothesis, which is sufficient but not necessary for the conditional predictive \({H}_{0}\), we may implement a permutation scheme in which the CPI is calculated for all possible assignments of data type D. Consider a 2n × 3 matrix with columns for unit index U = {1,1,…,n,n}, data type D \(\in\) {0,1}, and loss L. We permute the rows of D subject to the constraint that every sample’s loss is recorded under both original and knockoff predictions. For each possible vector D, compute the resulting CPI and compare the value of our observed statistics, \(\widehat{\mathrm{CPI}}\), to the complete distribution. Note that this paired setup dramatically diminishes the possible assignment space from an unmanageable \(\left(\genfrac{}{}{0pt}{}{2n}{n}\right)\), corresponding to a Bernoulli trial design, to a more reasonable 2n. The one-tailed Fisher exact p-value (FEP) is given by the formula:

$${\text{FEP}}\left( {\widehat{{{\text{CPI}}}}} \right) = \frac{1}{{2^{n} }}\mathop \sum \limits_{{b = 1}}^{{2^{n} }} {\mathbb{I}}\left( {\widetilde{{{\text{CPI}}}}_{b} \ge \widehat{{{\text{CPI}}}}} \right),$$

where \(\mathbb{I}(\bullet )\) represents the indicator function and \({\stackrel{\sim }{\mathrm{CPI}}}_{b}\) is the CPI resulting from the bth permutation of D.

To construct a confidence interval for \(\widehat{\mathrm{C}\mathrm{P}\mathrm{I}}\) at level \(1-\alpha\), we use our empirical null distribution. Find the critical value CPI* such that FEP(CPI*) = \(\alpha\). Then a \(\left(1-\alpha \right)\times 100\%\) confidence interval for \(\widehat{\mathrm{C}\mathrm{P}\mathrm{I}}\) is given by \([\widehat{\mathrm{C}\mathrm{P}\mathrm{I}}-\mathrm{C}\mathrm{P}{\mathrm{I}}^{\mathrm{*}},\infty )\). For n large, approximate calculations can be made by sampling from the set of \({2}^{n}\) permissible permutations. In this case, however, we add 1 to both the numerator and denominator to ensure unbiased inference (Phipson & Smyth, 2010).

3.4 Computational complexity

To summarize, we outline the proposed algorithm for testing the conditional importance of feature subsets for supervised learners in pseudocode below. This algorithm executes in \(\mathcal{O}(ak+g+h)\) time and \(\mathcal{O}(a+k+g+h)\) space. We take the complexity of the learner a and knockoff sampler g to be given. The empirical risk estimator k can be made more or less complex depending on the resampling procedure. The most efficient option for evaluating generalization error is the holdout method, in which a model is trained on a random subset of the available data and tested on the remainder. Unfortunately, this procedure can be unreliable with small sample sizes. Popular alternatives include the bootstrap and cross-validation. Both require considerable model refitting, which can be costly when a is complex.

The inference procedure h is quite efficient in the parametric case—on the order of \(\mathcal{O}\)(n) for the t-test—but scales exponentially with the sample size when using the permutation-based approach. As noted above, the complexity of the Fisher test can be bounded by setting an upper limit on the number of permutations B used to approximate the empirical null distribution. The standard error of a p-value estimate made using such an approximation is \(\sqrt{{p}^{*}(1-{p}^{*})/B}\), where p* represents the true p-value. This expression is maximized at \({p}^{*}=0.5\), corresponding to a standard error of \(1/(2\sqrt{B})\). Thus, to guarantee a standard error of at most 0.001, it would suffice to use B = 250,000 permutations, an eminently feasible computation on a modern laptop. Space complexity is dominated by the learner a and knockoff sampler g because most resampling procedures refit the same learner a fixed number of times and the inference procedures described above execute in \(\mathcal{O}(1)\) space.

figure a

4 Experiments

All experiments were conducted in the R statistical computing environment, version 3.6.2. Code for reproducing all results and figures can be found in the dedicated GitHub repository: https://github.com/dswatson/cpi_paper

4.1 Simulated data

We report results from a number of simulation studies. First, we analyze the statistical properties of the proposed tests under null and alternative hypotheses. We proceed to compare the sensitivity and specificity of the CPI to those of several alternative measures.

Data were simulated under four scenarios, corresponding to all combinations of independent vs. correlated predictors and linear vs. nonlinear outcomes. Because conditional importance is most relevant in the case of correlated predictors, results for the two scenarios with independent features are left to the Supplement. In the linear setting, ten variables were drawn from a multivariate Gaussian distribution \({\mathcal{N}}\left( {0,\Sigma } \right),\) with covariance matrix\({{\Sigma }}_{ij}=0.{5}^{|i-j|}\). A continuous outcome \(Y\) was calculated as\(Y={\user2{X}}\varvec{\beta }+\varepsilon\), where β = (0.0,0.1,…,0.9)′ and \(\varepsilon \sim ~{\mathcal{N}}\left( {0,1} \right)\). In the nonlinear scenario, we keep the same predictors but generate the response from a transformed matrix,\(Y={\user2{X}}^{\user2{*}}\varvec{\beta }+\varepsilon\), where

$$x_{{ij}}^{*} = \left\{ {\begin{array}{*{20}l} { + 1,} & {{\text{if}}~{{\Phi }}^{{ - 1}} \left( {0.25} \right) \le x_{{ij}} \le ~{{\Phi }}^{{ - 1}} \left( {0.75} \right)} \\ { - 1,} & {{\text{else}}} \\ \end{array} } \right.$$

with the same β and \(\varepsilon\) as in the linear case. A similar simulation was performed for a classification outcome, where the response \(Y\) was drawn from a binomial distribution with probability [1 + exp(− )]−1 and [1 + exp(− X*β)]−1 for the linear and nonlinear scenarios, respectively.

Knockoffs for all simulated data were generated using the second-order Gaussian technique described in (Candès et al., 2018) and implemented in the knockoff package, version 0.3.2 (Patterson & Sesia, 2018).

4.2 Type I and Type II errors

We simulate \({10}^{4}\) datasets with \(n=1000\) observations and compute the CPI using four different learning algorithms: linear/logistic regression (LM), random forest (RF), artificial neural network (ANN), and support vector machine (SVM). Risk was estimated using holdout sampling with a train/test ratio of 2:1. For regression models, we used mean square error (MSE) and mean absolute error (MAE) loss functions; for classification, we used cross entropy (CE) and mean misclassification error (MMCE). We computed p-values via the inference procedures described in Sect. 3, i.e. paired t-tests and Fisher exact tests. For Fisher tests we used \({10}^{4}\) permutations.

Linear and logistic regressions were built using the functions lm and glm, respectively, from the R package stats (R Core Team, 2019). RFs were built using the ranger package (Wright & Ziegler, 2017), with 500 trees. ANNs were built with the nnet package (Venables & Ripley, 2002), with 20 hidden units and a weight decay of 0.1. SVMs were built with the e1071 package (Meyer et al., 2018), using a Gaussian radial basis function (RBF) kernel and \(\sigma =1\). Unless stated otherwise, all parameters were left to their default values. Resampling was performed with the mlr package (Bischl et al., 2016).

Significance levels for all tests were fixed at \(\alpha =0.05\). For each simulation, we calculated the CPI values, Type I errors, Type II errors, empirical coverage, and t-statistics, where applicable. Results for MSE loss with p = 10 are shown in Fig. 1. Similar plots for MAE, CE, and MMCE loss functions are presented in Figs. S1–S10 of the Supplement. Coverage probabilities are shown in Tables S1–S8 of the Supplement.

Fig. 1
figure 1

Simulation results for continuous outcome with MSE loss and correlated predictors. A: Boxplots of simulated CPI values of variables \({X}_{1},\dots ,{X}_{10}\) with increasing effect size. The red line indicates a CPI value of \(0\), corresponding to a completely uninformative predictor. B: Histograms of simulation replications of t-statistics of variables with effect size \(0\). The distribution of the expected t-statistic under the null hypothesis is shown in red. C: Proportion of rejected hypotheses at \(\alpha =0.05\) as a function of effect size. Results at effect size 0 correspond to the Type I error, at effect sizes \(>0\) to statistical power. The dashed line indicates the nominal level, \(\alpha =0.05\)

For continuous outcomes, CPI controlled Type I error with all four learners and reached 100% power under all settings, with the exception of the LM on nonlinear data. No observable differences are detected between the MSE and MAE loss functions. We found similar results for categorical outcomes. The CPI controlled Type I error for the MMCE and CE loss functions with all four learners. The LM once again performed poorly on nonlinear data, as expected. The Fisher test had slightly increased power compared to the t-test. Statistical power was generally greater with CE loss than with MMCE loss.

4.3 Comparative performance

We use the same simulation setup to compare the CPI's performance to that of three other global, nonparametric, model-agnostic measures of CI, each of which relies on identical or stronger testing assumptions:

  • ANOVA: Williamson et al. (2021)’s nonparametric ANOVA-inspired VI.

  • LOCO: Lei et al. (2018)’s leave-one-covariate-out procedure.

  • GCM: Shah and Peters (2020)’s generalized covariance measure.

Unfortunately, software for Hubbard et al. (2018)’s targeted maximum likelihood VI statistic was still under development at the time of testing, and beta versions generated errors. Candès et al. (2018)’s probabilistic knockoff procedure can be extended to nonparametric models, but requires an algorithm-specific VI measure, which not all learners provide. We consider this method separately in Sect. 4.1.3. Kernel methods do not work with arbitrary algorithms and were therefore excluded. Information theoretic measures typically require a discretization step that does not extend well to high dimensions, and were therefore also excluded. We restrict this section to the regression setting, as none of the other methods considered here are designed for classification problems.

Training and test sets are of equal size, with n \(\in\) {100, 500, 1000}. In each case, we fit LM, RF, ANN, and SVM regressions, as described previously. We estimate the VI of all features on the test set for every model. This procedure was repeated \({10}^{4}\) times. Results for \(n=1000\) and \(p=10\) are plotted in Fig. 2. Similar results for n = {100, 500} and p = {20, 50, 100} are included in the Supplement.

Fig. 2
figure 2

Comparative performance of VI measures across different simulations and algorithms. Plots depict the proportion of rejected hypotheses at \(\alpha =0.05\) as a function of effect size. Results at effect size \(0\) correspond to Type I error, at effect sizes \(>0\) to statistical power. The dashed line indicates the nominal level, \(\alpha =0.05\). These results were computed using training and test samples of n = 1000 and p = 10. Similar results were obtained for sample sizes of n = {100, 500} and p = {20, 50, 100} (see Supplement, Sect. 2)

All methods have high Type II error rates when fitting an LM to nonlinear data, highlighting the dangers of model misspecification. GCM appears to dominate in the linear setting but struggles to detect VI in nonlinear simulations. LOCO is somewhat conservative, often falling short of the nominal Type I error rate under the null hypothesis. However, the method fails to control Type I error in the case of an ANN trained on nonlinear data. The nonparametric ANOVA generally performs poorly, especially with RF regressions, where we may observe Type I error rates of up to 100%.

The CPI outperforms all competitors with nonlinear data, and achieves greater power than ANOVA or LOCO in the linear case. GCM is the only other method to control Type I error under all simulation settings, but it has nearly zero power with nonlinear data.

4.4 Knockoff filter

To compare the performance of the CPI with that of the original knockoff filter, we followed the simulation procedure described in Sect. 4 of (Candès et al., 2018). A n = 300 × p = 1000 matrix was sampled from a multivariate Gaussian distribution \({\mathcal{N}}\left( {0,\Sigma } \right)\), with covariance matrix\({{\Sigma }}_{ij}={\rho }^{|i-j|}\). A continuous outcome \(Y\) was calculated as\(Y=\mathbf{X}{\varvec{\beta }}+\varepsilon\), where \(\varepsilon \sim ~{\mathcal{N}}\left( {0,1} \right)\) and the coefficient vector β contains just 60 nonzero entries, with random signs and variable effect sizes. We vary \(\rho\) with fixed nonzero \(\left|\beta \right|=1\), and vary effect size with fixed \(\rho =0\).

We train a series of lasso regressions (Tibshirani, 1996) on the data using the original design matrix and tenfold cross-validation to calculate the CPI, and the expanded n × 2p design matrix for the knockoff filter. VI for the latter was estimated using the difference statistic originally proposed by Barber and Candès (2015):

$$W_{j} = \left| {\hat{\beta }_{j} } \right| - \left| {\hat{\beta }_{{j + p}} } \right|,$$

where \({\widehat{\beta }}_{j}\) and \({\widehat{\beta }}_{j+p}\) represent coefficients associated with a feature and its knockoff, respectively, at some fixed value of the Lagrange multiplier \(\lambda\). Variables are selected based on the ATT method described in Sect. 2.3. The hyperparameter \(\lambda\) is tuned via tenfold cross-validation, per the default settings of the glmnet package (Friedman et al., 2010). Power and FDR are averaged over \({10}^{4}\) iterations for each combination of effect size and autocorrelation coefficient. FDR is estimated via the ATT procedure for the knockoff filter and via the Benjamini–Hochberg algorithm (1995) for the CPI.

The CPI is more powerful than the original knockoff filter for all effect sizes at \(\rho =0\), but less powerful for high autocorrelation of \(\rho =0.5\) (see Fig. 3). Both methods generally control the FDR at the target rate of 10%. The only exceptions are under small effect sizes, where the knockoff filter shows slightly inflated errors. Similar results for p = 2000 are included in the Supplement.

Fig. 3
figure 3

Power and FDR as a function of effect size and autocorrelation for CPI and knockoff filter. Target FDR is \(10\%\). Results are from a lasso regression with n = 300 and p = 1000. Each point represents 104 replications. Similar results were obtained for p = 2000 (see Supplement, Sect. 3)

Note that in addition to being a more powerful test under most conditions, the CPI has other important advantages over the ATT. Whereas the latter can only be applied to algorithms with inbuilt feature scoring statistics, the former requires nothing more than a valid loss function. Whereas the ATT struggles to select important variables in low-dimensional settings, the CPI imposes no dimensionality restraints. Finally, the CPI is more informative, insomuch as it provides feature-level p-values and confidence (or credible) intervals.

4.5 Real data

In this section, we apply the CPI to real datasets of low- and high-dimensionality.

4.5.1 Boston housing

We analyzed the Boston housing data (Harrison & Rubinfeld, 1978), which consists of 506 observations and 14 variables. This benchmark dataset is available in the UCI Machine Learning Repository (Dua & Graff, 2017). The dependent variable is the median price of owner-occupied houses in census tracts in the Boston metropolitan area in 1970. The independent variables include the average number of rooms, crime rates, and air pollution.

Using LM and SVM regressions, we computed CPI, standard errors, and t-test p-values for each feature, adjusting for multiple testing using Holm’s (1979) procedure. We used an RBF kernel for the SVM, measured performance via MSE, and used \(5\) subsampling iterations to evaluate empirical risk. The results are shown in Fig. 4. We found significant effects at \(\alpha =0.05\) for the average number of rooms (rm), percentage of lower status of the population (lstat), pupil-teacher ratio (ptratio), and several other variables with both LM and SVM, which is in line with previous analyses (Friedman & Popescu, 2008; Williamson et al., 2021). Interestingly, the SVM assigned a higher CPI value to several variables compared to the LM. For example, the proportion of owner-occupied units built prior to 1940 (age) significantly increased the predictive performance of the SVM but had approximately zero impact on the LM. The reason for this difference might be a nonlinear interaction between rm and age, which was also observed by Friedman and Popescu (2008).

Fig. 4
figure 4

Results of the Boston housing experiment. For each variable in the data set, the CPI value is shown, computed with a linear model and a support vector machine. Whiskers represent standard errors. Non-significant variables at \(\alpha =0.05\) after adjustment for multiple testing are shaded

4.5.2 Breast cancer

We examined gene expression profiles of human breast cancer samples downloaded from GEO series GSE3165. Only the \(94\) arrays of platform GPL887 (Agilent Human 1A Microarray V2) were included. These data were originally analysed by Herschkowitz et al. (2007) and later studied by Lim et al. (2009). We follow their pre-processing pipeline, leaving 13,064 genes. All samples were taken from tumor tissue and classified into one of six molecular subtypes: basal-like, luminal A, luminal B, Her2, normal-like, and claudin-low.

Basal-like breast cancer (BLBC) is an especially aggressive form of the disease, and BLBC patients generally have a poor prognosis. Following Wu and Smyth (2012), we defined a binary response vector to indicate whether samples are BLBC. Gene sets were downloaded from the curated C2 collection of the MSigDB and tested for their association with this dichotomous outcome.

We trained an RF classifier with \({10}^{4}\) trees to predict BLBC based on microarray data. Second-order knockoffs were sampled using an approximate semidefinite program with block-diagonal covariance matrices of maximum dimension 4000 × 4000. We test the CPI for each of the \(2609\) gene sets in the C2 collection for which at least \(25\) genes were present in the expression matrix. Models were evaluated using the CE loss function on out-of-bag samples.

We calculate p-values for each CPI via the t-test and corresponding q-values using the Benjamini–Hochberg procedure. We identify \(660\) significantly enriched gene sets at q ≤ 0.05, including \(24\) of \(73\) explicitly breast cancer derived gene sets and \(6\) of \(13\) gene sets indicative of basal signatures. Almost all top results are from cancer studies or other biologically relevant research (see Fig. 5). These include \(4\) sets of BRCA1 targets, genetic mutations known to be associated with BLBC (Turner & Reis-Filho, 2006), and \(4\) sets of ESR1 targets, which are markers for the luminal A subtype (Sørlie et al., 2003).

Fig. 5
figure 5

Results for the top \(50\) gene sets. For each gene set, the CPI value is shown, computed with a random forest. Whiskers represent standard errors

By comparison, popular pathway enrichment tests like GSEA (Subramanian et al., 2005) and CAMERA (Wu & Smyth, 2012) respectively identify \(137\) and \(74\) differentially enriched pathways in this dataset at 5% FDR. These results are especially notable given that those methods rely on marginal associations between gene expression and clinical outcomes, whereas the CPI is a conditional test with a more restrictive null hypothesis, and should theoretically have less power to detect enrichment when features within a gene set are correlated with others outside it. Despite collinearity between genes, the CPI still identifies a large number of biologically meaningful gene sets differentiating BLBC tumors from other breast cancer subtypes.

5 Discussion

Shah and Peters (2020) have demonstrated that no CI test can be uniformly valid against arbitrary alternatives, a sort of no-free-lunch (NFL) theorem for CI. Bates et al. (2020) prove a similar NFL theorem for constructing knockoff variables, showing that no algorithm can efficiently compute nontrivial knockoffs for arbitrary input distributions. The original NFL theorem for optimization is well-known (Wolpert & Macready, 1997). Together, these results delimit the scope of the CPI. The method is highly modular, in the sense that it works with any well-chosen combination of supervised learner, loss function, and knockoff sampler. However, it is simultaneously constrained by these choices. The CPI cannot be guaranteed to control Type I error or have any power against the null when knockoffs are poorly constructed or models are misspecified. If the selected loss function ignores distributional information from higher moments, then the CPI will as well.

In our experiments, we employed a variety of risk estimators, including cross-validation, subsampling, out-of-bag estimates, and the holdout method. Results did not depend on these choices, suggesting that analysts may use whichever is most efficient for the problem at hand. We also used a number of different learning algorithms and found that all showed good convergence properties in finite samples—even the SVM, which is known to have infinite VC dimension, and therefore violates the consistency criterion cited in Sect. 3.

Computational bottlenecks can complicate the use of this procedure for high-dimensional datasets. It took approximately \(49\) hours to generate second-order knockoffs for the gene expression matrix described in Sect. 4.2.2. However, as noted in Sect. 2.3, knockoff sampling is an active area of research, and it is reasonable to expect future advances to speed up the procedure considerably.

6 Conclusion

We propose the conditional predictive impact (CPI), a global measure of variable importance and general test of conditional independence. It works for regression and classification problems using any combination of knockoff sampler, supervised learning algorithm, and loss function. It imposes no parametric or sparsity constraints, and can be efficiently computed on data with many observations and features. Our inference procedures are fast and effective, able to simultaneously control Type I error and achieve high power in finite samples. We have shown that our approach is consistent and unbiased under minimal assumptions. Empirical results demonstrate that the method performs favorably against a number of alternatives for a range of supervised learners and data generating processes.

We envision several avenues for future research in this area. Localized versions of the CPI algorithm could be used to detect the conditional importance of features on particular predictions. Model-specific methods could be implemented to speed up the procedure. We are currently working on applications for causal discovery and inference, an especially promising direction for this approach.