Understanding complex predictive models with ghost variables

Framed in the literature on Interpretable Machine Learning, we propose a new procedure to assign a measure of relevance to each explanatory variable in a complex predictive model. We assume that we have a training set to fit the model and a test set to check its out-of-sample performance. We propose to measure the individual relevance of each variable by comparing the predictions of the model in the test set with those obtained when the variable of interest is substituted (in the test set) by its ghost variable, defined as the prediction of this variable by using the rest of explanatory variables. In linear models it is shown that, on the one hand, the proposed measure gives similar results to leave-one-covariate-out (loco, with a lowest computational cost) and outperforms random permutations, and on the other hand, it is strongly related to the usual F-statistic measuring the significance of a variable. In nonlinear predictive models (as neural networks or random forests) the proposed measure shows the relevance of the variables in an efficient way, as shown by a simulation study comparing ghost variables with other alternative methods (including loco and random permutations, and also knockoff variables and estimated conditional distributions). Finally, we study the joint relevance of the variables by defining the relevance matrix as the covariance matrix of the vectors of effects on predictions when using every ghost variable. Our proposal is illustrated with simulated examples and the analysis of a large real data set.


Introduction
In a stimulating and provocative paper, Breiman (2001) shook the statistical community by arguing that traditional Statistics was no longer the only way to reach conclusions from data.He noted that, in addition to the Data Modeling Culture (traditional Statistics), was emerging an Algorithmic Modeling Culture.At that time Breiman was almost surely thinking on Machine Learning, and now we could also think in Data Science.The algorithms proposed by the new culture (neural networks (NN), support vector machines, random forest, etc.) have often better predictive accuracy than traditional statistical models (linear regression, logistic regression, etc.).On the other hand, statistical models explain better the relationship among response and input variables.In fact, these new-culture algorithms usually are called black boxes.
The need of understanding the variables effects increases with prediction rules based on Big Data, that is, data sets with large number of variables, p, and observations, n, and even with p > n, because: (1) in some models, as NN, the effect of a variable is a non linear function of its weights in different linear combinations, making its total effect difficult to see; (2) the effect of a variable strongly correlated with others cannot be well understood without considering their joint effects.The crucial importance of understanding the effect of the variables on algorithm models has often been recognized, for instance, Ribeiro, Singh, and Guestrin (2016b) state that a vital concern remains: if the users do not trust a model or a prediction, they will not use it.There is a powerful research line on variable importance measure in algorithmic models, which main goal is to provide interpretability tools lighting these black box models, that has been labeled as eXplainable Artificial Intelligence (XAI) (e.g., Baehrens et al. 2010, Štrumbelj and Kononenko 2010, Biran and Cotton 2017, Nott 2017, Biecek 2018, Delicado 2019, and Miller 2019) or Interpretable Machine Learning (IML) (e.g., Molnar et al. 2018, Molnar 2019, Murdoch et al. 2019, and Greenwell et al. 2019).An often used approach in neural networks (NN) is to look at the derivatives of the prediction function with respect to each variable.See Intrator and Intrator (2001), that introduced the generalized weights in NN, and Montavon et al. (2018) for a survey of this field.However, these procedures have limitations to show the general pattern of interactions among variables.
We consider in this article the general framework of a prediction problem involving the random vector (X, Z, Y ), X ∈ R p , Z ∈ R and Y ∈ R, where Y is the response variable that should be predicted from (X, Z).It is assumed that there exist a training sample of size n 1 (used to fit the prediction rule) and a test sample of size n 2 (used for checking the out of sample precision accuracy of the fitted prediction rule).A simple approach to define the importance of the variable Z consists in measuring its contribution to the forecast of the response.To compute this measure the model is fitted twice: first including both X and Z and then excluding Z.This approach is often used to decide if the variable Z should be included in the model.However, other alternatives are possible.Instead of deleting the variable Z, Breiman (2001) proposed to randomly permute its values in the test sample to create a new variable Z and compare the model predictions using the original Z variable and the randomized one Z .This method has two main advantages over deleting the variable: (1) only one predictive model has to be fitted (the one using all the explanatory variables); and (2) the forecast comparison is made between two models with the same number of variables.A drawback is that Z is unrelated to Y but also to X and the joint effect of the Z variable with the X will be lost.It is worth mentioning that both methods, deleting Z and replacing it by Z , are model agnostic (Ribeiro et al. 2016a) because they can be applied to any predictive model, which is a desirable property.Despite its popularity, using random permutations for interpreting black box prediction algorithms has received numerous criticisms.See, for instance, Hooker and Mentch (2019) who advocate against this practice.
In this article we propose a third approach to measure the effect of a variable that combines the advantages of deleting and randomizing, and at the same 3 and 4 are deferred to the appendixes.An Online Supplement includes a part of the output corresponding to the real data example.
Remark.The term ghost variable has been used in computer science to refer to auxiliary variables that appear in a program to facilitate specification and verification, for instance, to count the number of iterations of a loop.Once the program is running correctly they can be eliminated from the program.We believe that this name has not been used before in Statistics although, of course, the missing value approach to identify the effect of data points or variables has a long and successful tradition to identify outliers, influential observations and sensitive points in time series and multivariate analysis.

Relevance for random variables
Let us consider the prediction problem involving the random vector (X, Z, Y ), X ∈ R p , Z ∈ R and Y ∈ R, where Y is the response variable that should be predicted from (X, Z).A prediction function f : R p+1 → R has expected loss (or risk ) R(f (X, Z), Y ) = E(L(f (X, Z), Y )), where L : R × R → R + is a loss function measuring the cost associated with predicting Y by f (X, Z), in particular quadratic loss is L(y, y ) = (y − y ) 2 .
We consider the problem of measuring the effect of Z on the f when predicting Y by f (X, Z).Also, we can compute a reduced version of f , say f p (X).A usual way to check for the relevance of the variable Z is to compute R(f p (X), Y ) − R(f (X, Z), Y ).An alternative way is to compute the change in the prediction function E(L(f (X, Z), f p (X))).These two relevance measures do not necessarily coincide, but they do in a relevant case: under quadratic loss and assuming Y = f (X, Z) + ε, as Both measures evaluate the effect of the single variable Z by suppressing it from the prediction function f .We choose to use E(L(f (X, Z), f p (X))) because it does not depend on the noisy part of the response variable Y .
An alternative approach, that does not require f p , is to replace Z by an independent copy Z : a random variable with the same marginal distribution as Z but independent from (X, Y ), and define the relevance of Z by E(L(f (X, Z), f (X, Z ))).However, this approach has some limitations.Consider the case of f being additive in X and Z: f (X, Z) = s 1 (X) + s 2 (Z), with E(s 2 (Z)) = 0.Under quadratic loss, If additional linearity happens, s 2 (Z) = Zβ z , then this relevance is 2β 2 z Var(Z).These results are similar to those obtained by Zhu et al. (2015) or Gregorutti et al. (2017).At a first glance these relevance measures (2Var(s 2 (Z)) or 2β 2 z Var(Z)) seem to be suitable, but: (1) The relevance of Z would be the same in one case where X and Z are independent (so Z encode exclusive information about Y ) and in another case where X and Z are strongly related (in such a case X could make up for the absence of Z).Clearly Z is more relevant in the first case than in the second one.
(2) The replacement of Z by an independent copy Z implies an alteration of the prediction function f (X, Z).Consider again the simple case of the linear predictor f (X, Z) = β 0 + X T β x + Zβ z .Replacing Z by Z is equivalent to using the following reduced version of f : where ), a zero mean random variable independent from (X, Y ) that does not contribute in any way to the prediction of Y .A better alternative would be to use the reduced version of f given just by A more useful replacement for Z is E(Z|X), the ghost variable for Z, that is the best prediction of Z as a function of X according to quadratic loss.Note that if there is dependency between X and Z, |Z − E(Z|X)| is expected to be lower than |Z − E(Z)|, so f (X, E(Z|X)) is expected to be closer to f (X, Z) than f (X, E(Z)).Therefore, when Z is not available, replacing it by E(Z|X) allows X to contribute a little bit more in the prediction of Y than replacing Z by E(Z).The larger is this extra contribution of X, the smaller is the relevance of Z in the prediction of Y , that could be measured by Observe that replacing Z by E(Z|X) is equivalent to consider the reduced version f p (X) of f (X, Z), where the function f p (x) = f (x, E(Z|X = x)).It should be noted that f (X, E(Z|X)) = E(f (X, Z)|X) for linear functions f (X, Z).It follows that removing Z or replacing it by E(Z|X) are equivalent in multiple linear regression.
Under quadratic loss, if f is additive in X and Z Observe that E(Var(s 2 (Z)|X)) and β 2 z E(Var(Z|X)) coincide, respectively, with Var(s 2 (Z)) and β 2 z Var(Z) when X and Z are independent, but otherwise the first two would be preferred to the second ones as relevance measures of Z.
The following Section 3 deals with the practical implementation of the three approaches introduced so far for measuring the relevance of a variable: suppressing it, replacing it by an independent copy, or by a ghost variable, respectively.Particular attention is paid to the additive and linear prediction models, and quadratic loss is assumed.Let m(x, z) = E(Y |X = x, Z = z) be the regression function of Y on (X, Z), the best prediction function for Y under quadratic loss.Any prediction function of Y , f (x, z), is also an estimator m(x, z) of the regression function m(x, z), and vice versa.So from now on we will talk indistinctly of prediction functions or regression function estimators.

Relevance in data sets
Consider a training sample of n 1 independent realizations of (X, Z, Y ), and y 1 ∈ R n1×1 , be the matrix representation of the training sample, that is used to estimate the regression function m(x, z) by m(x, z).The expected quadratic loss of m (Mean Square Prediction Error ) is where (X, Z, Y ) is independent from the training sample.In order to estimate MSPE( m), a test sample is observed: n 2 independent realizations of (X, Z, Y ), S 2 = {(x 2.i , z 2.i , y 2.i ), i = 1, . . ., n 2 }, that are also independent from the training sample.Let (X 2 , z 2 , y 2 ), be the test sample in matrix format, with X 2 ∈ R n2×p , z 2 ∈ R n2×1 and y 2 ∈ R n2×1 , and let z 2 ∈ R n2×1 be a random permutation of the elements of the column vector z 2 , with i-th element z 2.i .The estimation of MSPE( m) from the test sample is This is, under mild conditions, a consistent and unbiased estimator of MSPE( m).
In a similar way, the test sample is used to obtain the sampling versions of the three variable relevance measures introduced in Section 2, by where for V = Om, relevance for omission, mV (i) = mp (x 2.i ), for V = RP, relevance for random permutation, mV (i) = m(x 2.i , z 2.i ) and for V = Gh, relevance for ghost variable, mV (i) = m(x 2.i , z 2.i ) where z 2.i = Ê(Z|X = x 2.i ).Scaled versions of these relevance measures can be defined dividing them by an estimation of the residual variance ( MSPE( m), for instance), as in Breiman (2001).We present the results for the unscaled forms to simplify the presentation.Now we proceed to explore the relevance measures when they are computed in the multiple linear regression model (and, in some cases, in the additive model).

Relevance by omission in linear regression
Suppose zero mean variables and the linear regression m(x, z) = x T β x + zβ z , for which the natural reduced version is m p (x) = x T β 0 .Let ( βx , βz ) and β0 be, respectively, the ordinary least squares (OLS) estimated coefficients of both models, and ŷ1.X.z = X 1 βx + z 1 βz , ŷ1.X = X 1 β0 .Let se( βz ) be the estimated standard error of βz and let t z = βz /se( βz ) the standard t-test statistic for the null hypothesis H 0 : β z = 0. Let F z = t 2 z be the F -test statistic for the same null hypothesis.Standard computations in the linear model (see, e.g., Seber and Lee 2003, or check the online supplement, where we have included them for the sake of completeness) lead to establish the identity where is the relevance by omission of the variable Z, evaluated in the training sample, and σ2 = (y 1 − ŷ1.X.z ) T (y 1 − ŷ1.X.z )/(n 1 − p − 1).That is, measuring the relevance of a variable by its omission in the training sample is equivalent to the standard practice of testing its significance.Moreover, the same computations indicate that Rel Train Om (Z) = β2 , where σ2 z.x,n1 is a consistent estimator of σ 2 z.x , the residual variance in the model Z = X T α + ε z , computed from the training sample.This is a sampling version of the expression β 2 z E(Var(Z|X)) that we obtained in Section 2 as the value for the relevance by omission or by ghost variables at population level (we saw there that both coincide for linear models).
When using the test sample to compute the relevance by omission we obtain similar results.The vectors of predicted values in the test sample are ŷ2.X.z = X 2 βx + z 2 βz and ŷ2.X = X 2 β0 .Therefore, the relevance by omission of the variable Z is Then it can be proved (detailed computation can be found in the online supplement) that

Relevance by random permutations in the linear and additive models
The use of random permutations implies independence (conditional to the observed test sample) of the replacing variable z 2,i and the other explanatory variables x 2,i .We will analyze directly the additive model and show the results for linear regression as a particular case.Assume that an additive model is fitted in the training sample m(x, z) = β0 + p j=1 ŝj (x j ) + ŝp+1 (z), β0 = n1 i=1 y 1.i /n 1 , n1 i=1 ŝj (x 1.i )/n 1 = 0, j = 1, . . ., p, and n1 i=1 ŝp+1 (z 1.i )/n 1 = 0 for identifiability reasons.These identities are only approximately true when taking averages at the test sample.Observe that The approximation follows from the fact that n2 i=1 ŝp+1 (z 2.i )ŝ p+1 (z 2.i )/n 2 has expected value over random permutations equal to { n2 i=1 ŝp+1 (z 2.i )/n 2 } 2 ≈ 0 and variance of order O(1/n 2 ).
In the special case of linear regression, ŝj (x j ) = βj (x j − x1.j ) for all j = 1, . . ., p, and ŝp+1 (z) = βz (z − z1 ), where x1.j = n1 i=1 x 1.ij /n 1 and z1 = n1 i=1 z 1.i /n 1 .Then, in this case Observe that Rel RP (Z) is not a multiple of the statistic F z used to test the null hypothesis H 0 : β z = 0, because the variance of βz , the OLS estimator of β z , is not a multiple of 1/ Var(Z), except in the particular case that X 1 and z 1 are uncorrelated.

Relevance by ghost variables in linear regression
To get a model agnostic proposal, the training sample S 1 should be used only through the estimated prediction function.Therefore we propose to estimate E(Z|X) using the data in the test sample S 2 .Suppose we fit a simple prediction model (for instance, a multiple linear regression or an additive model) ẑ2.X.i = mz (x 2.i ), i = 1, . . ., n 2 .Let us assume that the regression function is linear and that it is estimated by OLS in the training sample (X 1 , z 1 , y 1 ) ∈ R n1×(p+2) .Let βx and βz be the estimated coefficients, and let σ2 be the estimated residual variance.Let (X 2 , z 2 , y 2 ) ∈ R n2×(p+2) be the test sample in matrix format.We fit by OLS in the test sample to obtain the ghost values for z 2 : when using (X, Z) as predictors, and ŷ2.X.ẑ = X 2 βx + ẑ2.2 βz , when using (X, ẐX ), that is replacing Z by the ghost variable.Therefore, the relevance by a ghost variable of the variable Z is The following result connects this relevance measure with the F -test for β z The proof can be found in Appendix A.2.
Theorem 1. Assume that the regression function of Y over (X, Z) is linear, that it is estimated by OLS, and that the ghost variable for Z is also estimated by OLS.Then (the residual variance in the linear regression model Z = X T α + ε z ), the first one depending on the test sample, and the second one on the training sample.
The parallelism between deleting the variable Z and replacing it by a ghost variable goes farther in the linear regression model.In the linear model m z (x) = x T α, let α1 and α2 be the OLS estimators of α in the training and test samples, respectively.They are expected to be close each other, because both are OLS estimators of the same parameter.This expected proximity and the results stated in Appendix A.1, lead us to write ŷ2.X.ẑ = X 2 βx + ẑ2.2 βz = X 2 βx + X 2 α2 βz = X 2 βx + α2 βz ≈ X 2 βx + α1 βz = X 2 β0 = ŷ2.X .
That is, using the ghost variable ẑ2.2 leads to similar predictions of Y in the test sample than removing the variable z 1 when fitting the model in the training sample.But using the ghost variable has two clear advantages: first, only one model has to be fitted in the training sample (the model with all the explanatory variables), and second, the estimated prediction function is the only element we have to save from the training sample (and, consequently, our proposal is model agnostic).As a consequence, from now on we will no longer consider the relevance by omission separately.
As a last remark, we would like to emphasize that the relevance by a ghost variable allows us to approximate a very relevant statistic in the linear regression model, namely the F -statistic for testing the null hypothesis H 0 : β z = 0.This approximation only requires the estimated prediction function from a training sample and a test sample.Therefore the relevance measure by a ghost variable allows us to evaluate a pseudo F -statistic for any statistical or algorithmic prediction model.

Understanding sets of variables: The relevance matrix
Variable relevance measures introduced so far have certain parallels with the usual practice in the detection of influential cases and outliers in regression, where each data is removed in turns and the effect of these deletions on the values predicted by the model (or on the estimates of the parameters) are computed.Peña and Yohai (1995) introduced the influence matrix by first looking at the influence vectors that measure the effect of deleting each observation on the vector of forecasts for the whole data set and then computing the n × n symmetric matrix that has the scalar products between these vectors.Thus the influence matrix has in the diagonals Cook's statistics and outside the diagonal the scalar products of these effects.These authors showed that eigen-structure of the influence matrix contains useful information to detect influential subsets or multiple outliers avoiding masking effects.We propose a similar idea by computing a case-variable relevance matrix (denoted by A below).We assume first that the variables will be substituted by their ghost variables as explained before, and afterwards we will consider the substitution by random permutations.
In this section we deal with all the explanatory variables in a symmetric way and consider the prediction of Y from the p components of X through the regression function m p+1) has been used to build and estimator m(x) of m(x), and that an additional test sample (X 2 , y 2 ) ∈ R n2×(p+1) is available.
[j] = (x 2.1 , . . ., x 2.(j−1) , x 2.(j+1) , . . ., x 2.p ) be the matrix X 2 without the j-th column, and [j] be the projection matrix on the column space of X 2. [j] .Let x2.j = H 2. [j] x 2.j be projection of x 2.j over the column space of the other columns of X 2 .Thus x2.j is the j-th ghost variable.Note that alternative regression models (additive models, for instance) could be also used to define the ghost variable.Let X 2. = (x 2.1 , . . ., x 2.j−1 , x2.j , x 2.j+1 , . . ., x 2.p ) be the regressor matrix in the test sample where the j-th variable has been replaced by the j-th ghost variable.We use m(X 2 ) to denote the n 2 -dimensional column vector with i-th element equal to the result of applying the function m to the i-th row of X 2 .Let Ŷ2 = m(X 2 ) and Ŷ2. = m(X 2. ).
We define the n 2 × p matrix Observe that the element a ij of A, i = 1, . . ., n 2 , j = 1, . . ., p, measures the change in the response prediction for the i-th case in the test sample, when the j-th variable has been replaced by its ghost variable.Finally, we define the relevance matrix as the p × p matrix Then, the element (j, k) of V is where x 2.j.i and x 2..i are, respectively, the i-th element of x 2.j and x2.j .In particular, the diagonal of the relevance matrix V has elements The advantage of working with the matrix V, instead of just computing univariate relevance measures, is that V contains additional information in its out-of-the-diagonal elements, that we are exploring through the examination of its eigen-structure.If, for j = 1 . . ., p,

The relevance matrix for linear regression
Consider the particular case that m(x) is the OLS estimator of a multiple linear regression.Then Therefore, the vector of predicted values in the test sample is Ŷ2 = X 2 β.Writing β = ( β1 , . . ., βp ) T , the predicted values when using the j-th ghost variable is where X2 is the matrix with each variable replace by its ghost variable.The relevance matrix is where The elements (j, k) of G and V are, respectively, Observe that, in the regression of x 2.j over X 2.
[j] , σ2 [j] = g jj is the residual variance estimation that uses n 2 as denominator.The following result summarizes the properties of the relevance matrix V and, in particular, its relationship with the partial correlation matrix.The proof can be found in Appendix A.3.
Theorem 2. Let P be the matrix that contains the partial correlation coefficients in the test sample as non-diagonal elements and has −1 in the diagonal.Then and consequently Therefore Rel Gh (X j ), the j-th element of the diagonal of V, admits the alternative expression Rel Gh (X j ) = β2 j σ2 [j] , and the partial correlation coefficient between variables j and k when controlling by the rest of variables can be computed as The expressions for the partial correlation coefficient appearing in Theorem 2 reminds the well known formula where the s jk is the element (j, k) of S −1 2 , the inverse of the covariance matrix of the test sample X 2 , S 2 .This coincidence, and the observation that s jj is the inverse of (x 2.j − x2.j ) T (x 2.j − x2.j ) (a consequence of the inverse formula for a block matrix; see Appendix A.1), imply the next Corollary.
Corollary 1.Let S 2 be the covariance matrix of the test sample X 2 .G and

Relevance matrix by random permutations in linear regression
We analyze now the structure of the relevance matrix V when random permutations are used instead of ghost variables.We focus in the case of multiple linear regression.Define where the j-th column of matrix X 2 is x 2.j , a random permutation of x j .Therefore, where S 2 j is the sample variance (computed dividing by n 2 ) of x j , and R is the correlation matrix of the test sample X 2 .The "approximately equal to" sign (≈) indicates that X T 2 X 2 is a matrix with elements approximately equal to 0, because 0 is their expected value under random permutations.We conclude that the correlation structure of Ṽ coincides with that of the sample correlation matrix R, and it has diagonal elements 2 βj 2 S 2 j = Rel RP (X j ).We have found analogous expressions for V and Ṽ that allow us to identify the main differences between both relevance matrices.First, V is related with partial correlations, while Ṽ is associated with standard correlations.Second, the expression of V includes the estimated residual variances in the regressions of each variable over the rest, while the usual sample variances appear in the expression of Ṽ.These findings suggest that the eigen-structure of Ṽ will probably be related with the principal component analysis of the test sample explanatory matrix X 2 , but hopefully new knowledge can be found when analyzing the eigen-structure of V.

Examples
We analyze now some examples of the performance of the relevance measures we have discussed.In Section 5.1 we use synthetic data, while we use real data in Section 5.2.Example 1: A small linear model Our first example ollows the lines of the hypothetical situation outlined in the Introduction.is the linear model Y = X 1 +X 2 +X 3 +ε, where X 1 is independent from X 2 and X 3 , which are highly correlated (ρ = .95).Additionally it is assumed that (X 1 , X 2 , X 3 ) is multivariate normal with zero mean, and that Var(ε) = Var(X i ) = 1, for i = 1, 2, 3.

Synthetic data sets
We generate a training sample of size n 1 = 2000, and a test sample of size n 2 = 1000 according to this setting.Using the training set, we fit a linear model (ŷ = β0 + β1 x 1 + β2 x 2 + β3 x 3 ), that presents a coefficient of determination R 2 = 0.8326.The t-values corresponding to the estimated coefficients βi , i = 1, 2, 3, are, respectively, 43.549, 14.514 and 13.636, all of them highly significant.The t-value corresponding to β1 is the largest one because the variance of β2 and β3 are larger than that of β1 as X 2 and X 3 are strongly correlated.
We compute the variable relevance, as well as the relevance matrix, using first ghost variables.Figure 1 summarizes our findings.The relevance of each variable are represented in the first plot.We can see that the relevance by ghost variables of X 1 is much larger than that of X 2 and X 3 .The second graphic in the first row compare the values of the relevance by ghost variables with the corresponding F -statistics (conveniently transformed).It can be seen that for every explanatory variable both values are almost equal.Blue dashed lines in these two graphics indicate the critical value beyond which an observed relevance can not be considered null, at significance level α = 0.01.According to Theorem 1, this critical value is computed as ), and σ2 is the estimated residual variance.
The methods we propose are not limited to reproducing the significance statistics but, through the relevance matrix, provide information on the joint effect that groups of variables have on the response.This goes beyond the standard output even in the linear model.In the present example, the relevance matrix by ghost variables is Observe that v 23 < 0. The negative sign was expected because of Theorem 2, which states that where ρ23.1 is the partial correlation coefficient between the second and third variables, computed in the test sample.Observe that in this example ρ23.1 is close to the correlation coefficient (0.95) because X 2 and X 3 are independent from X 1 , and βj ≈ β j = 1, j = 2, 3, so v 23 must be negative.From v 22 , v 33 and v 23 , we compute the correlation between the second and third columns in matrix A and obtain a value of −0.9493.The last plot in the upper row of Figure 1 shows the eigenvalues of matrix V, and the plots in the lower row represent the components of each eigenvector.These plots give the same information as the Principal Component Analysis (PCA) of matrix A: the first PC coincides with the first column of A, the second PC is a multiple of the difference of the second and third columns of A (that have a strong negative correlation), and the third PC, that has an eigenvalue very close to 0, indicates that the sum of the second and third columns of A is almost constant.
In terms of variable relevance, from Figure 1 we take the following conclusions: (i) the first explanatory variable is the most relevant and it affects the response variable in a isolated way (there is a unique eigenvector of V associated with X 1 ); (ii) the other two variables have similar relevance, and their relevance are strongly related (there is a pair of eigenvectors associated with X 2 and X 3 ).Now we study the results obtained when computing the relevance by random permutations, shown in Figure 2 (the structure of this figure is similar to that of Figure 1, with the only exception that now the second plot in the first row shows the relationship between the relevance by ghost variables and the relevance by random permutations).We can see that the three variables have a similar relevance, when computing it by random permutations.In this case the three eigenvalues are far from 0. Again there is one eigenvector associated exclusively to X 1 , and the other two are linear combinations of X 2 and X 3 .Now the first eigenvector has coefficients with the same sign in X 2 and X 3 , indicating that the second and third columns of Ã are positively correlated, as it was expected (because the element (2, 3) of Ṽ has the same sign as the correlation coefficient between X 2 and X 3 ).
The analysis of the relevance by random permutations have shown that (i) the three explanatory variables have similar relevance, and that (ii) the second and third variables have a joint influence on the response variable.We conclude that the information provided by the ghost variables is more useful than those obtained from random permutations.Example 2: An additive model In this example we generate data according to the model that is additive in X 1 , (X 2 , X 3 ), X 4 and X 5 .The zero mean residual is normal with Var(ε) = 1/4, and the explanatory variables (X 1 , X 2 , X 3 , X 4 , X 5 ) are multivariate normal with 0 means an covariance matrix .
So there are two independent blocks of regressors, (X 1 , X 2 ) and (X 3 , X 4 , X 5 ), that are linked by the regression function because one of the additive terms jointly depends on two variables, X 2 and X 3 , each belonging to one of the two blocks.
Training and test samples have been generated, with sizes n 1 = 2000 and n 2 = 1000, respectively.An additive model has been fitted to the training set using function gam from the R library mgcv (Wood 2017) using the right model specification: y~s(x1)+s(x2,x3)+s(x4)+ s(x5).The fitted model presents an adjusted coefficient of determination R 2 = 0.873.
Figure 3 shows the relevance by ghost variables results.All the explanatory variables have similar relevance.The eigen-structure of relevance matrix V reveals the following facts.The first and fourth eigenvectors are associated mainly with the effects of X 1 and X 2 on the fitted function, while the other three eigenvectors are associated with X 3 , X 4 and X 5 .So the presence of two separate blocks of regressors is clear.Moreover, in the second block, the role of X 3 is different from those of X 4 and X 5 (see eigenvectors 2 and 3).
Regarding relevance by random permutations, the results are in Figure 4.In this case the relevance of X 2 and X 3 is larger than that of the other explanatory variables.The eigen-structure of relevance matrix Ṽ is different from that of V.The first eigenvector indicates that variables X 2 and X 3 jointly affect the response Y (something already known because the model includes the term s(x2, x3)).Eigenvectors 2, 3 and 4 point out that it is difficult to separate the effects of all the regressors on the response.Finally, the fifth eigenvector is an indication that X 4 and X 5 have a certain differentiated influence on Y .
In this example we have seen again that the results from using ghost variables are more useful than those from random permutations: in Figure 3 we discover the two existing separate blocks of explanatory variables, whereas in Figure 4 we did not found anything new as the joint influence over Y of X 2 and X 3 was already included in the fitted model.

Example 3: A large linear model
We consider now a linear model with 200 standard normal explanatory variables, divided in four independent blocks of 50 variables each: X i.j , i = 1, . . ., 4, j = 1, . . ., 50.The first and third blocks include independent variables.The   The variable relevance for the 200 explanatory variables are computed, first using ghost variables and then by random permutations.Figure 5 summarizes the findings.The top left panel indicates that the relevance by ghost variables (that are quite similar to the transformed F -statistics) is larger for the 50 variables in the fist block (uncorrelated variables related with Y ) than for those in the second block (correlated variables related with Y ), while the rest of variables (blocks 3 and 4, both independent from Y ) are almost irrelevant.The relevance by ghost variables takes into account the dependence structure, detects that each variable in the first block contains exclusive information about the response (this is not the case in the second block), and concludes that variables in the first block are more relevant than those in the second one, even if the corresponding coefficients are 1/2 and 1, respectively.Conclusions from relevance by random permutations are different (top right panel): this method assigns larger relevance to variables in the second block than those in the first one, just because the corresponding coefficient is larger for block 2 than for block 1, without taking into account the dependence structure among the explanatory variables.
Panels in the second row of Figure 5 show the eigenvalues of relevance matrices V (left) and Ṽ (right).In the left panel they are marked steps at eigenvalues 50 and 99.The first 50 eigenvalues correspond to eigenvectors that are linear combinations of the first 50 columns of matrix A (first block of explanatory variables), and the following 49 eigenvalues correspond to columns 51 to 100 (second block).There are two additional eigenvectors (179 and 180, not reported here, with eigenvalues very close to zero) having almost constant weights in the 50 variables of the second block (and also significant weight in a few variables of blocks 3 and 4): they indicate that the sum of columns 51 to 100 of matrix A is almost constant.(This is coherent with the, not reported here, spectral analysis of the covariance of columns 51 to 100 in matrix A: the last eigenvalue is almost 0, much smaller than the 49th, and the corresponding eigenvector has almost constant coordinates).Finally, the eigenvalues 100 to 200 are almost null.They correspond mainly to variables in blocks 3 and 4, that are irrelevant for Y .
In the central right panel we can see that there is a main eigenvalue in matrix Ṽ, that corresponds to an eigenvector (not reported here) that is proportional to the sum of the columns 51 to 100 of matrix Ã.The eigenvalues 2 to 101 correspond to eigenvectors (not reported here) that are linear combinations of several of the first 100 columns of Ã.The remain eigenvalues (almost null) correspond to variables in blocks 3 and 4.
The last two plots in Figure 5 show the dendograms corresponding to two hierarchical clustering of the 200 explanatory variables.In the first one (bottom left panel) the squared elements of relevance matrix V are taken as similarities between variables, while matrix Ṽ has been used in the second one (bottom q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.10 0.20 F−statistics*σ ^2 n 1 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.

Cluster Dendrogram
Relevance by random permutations right panel).Both dendograms are able to detect blocks of variables 1 and 2, while blocks 3 and 4 appear as a unique cluster.

A real data example: Rent housing prices
We present a real data example on rental housing, coming from the Spanish real estate search portal Idealista (www.idelista.com)which allows customers to search for housing based on several criteria (geography, number of rooms, price, and various other filters) among the offers posted by other customers.
We started working from the data set downloaded from www.idelista.comby Alejandro Germán-Serrano on February 27th 2018 (available at https:// github.com/seralexger/idealista-data;accessed April 12th 2019).This data set contained 67201 rows (posted rental offers actives at the download date) and 19 attributes, corresponding to all cities in Spain.Some offers where activated for the first time in 2010.
We have selected posts corresponding to Madrid and Barcelona (16480 rows) and finally work with 17 variables (some of them already in the original data set, other calculated from the original attributes) listed in Table 1.
In order to predict the logarithm of prices as a function of the other 16 explanatory variables, we have fitted three predictive models: a linear regression, an additive model, and a neural network.For each model, the variables relevance has been computed by ghost variables and by random permutations (a 70% of the data are used as training set, and the rest as test set).For both, the linear and the additive model, the standard output offers a rich information about the statistical significance of each explanatory variable.Then the relevance analysis represents a complementary information, that in most cases confirms the standard one, although matrix relevance can add new lights.The situation is different for the neural network model: in this case the relevance analysis will provided genuine new insights on the importance of the explanatory variables, or groups of them.
We show below the results for the linear model and for the neural network.The results for the additive model are not included here (they are accessible as online material) because they are not very different from those obtained in the linear model.

Linear model
Table 2 shows the standard output of the fitted linear model (we have used the R function lm).Only three variables are not significant at level 0.001.There are 7 variables with t-value larger than 10 in absolute value: Barcelona, categ.distr,floor, log.size (this one being the most significant variable), rooms, bathrooms, and log activation.The adjusted coefficient of determination R 2 is 0.7599.
The relevance by ghost variables results are shown in Figure 6.We can see (first row, first column plot) that the 7 most relevant variables are the seven we cited before (those with largest t-values in absolute value).This is a consequence of the existing relation (Theorem 1) between the relevance by ghost variables and F -values (the squares of t-values) in the linear model.This plot shows that log.size is the most relevant variable, followed by categ.distr,bathrooms, Table 1: List of variables used in the rent housing prices examples.The response variable is price (in logarithms), and other are the explanatory variables.The district price level indicator categ.distrhas been computed in Barcelona and Madrid separately.For each district (there are N B = 10 districts in Barcelona and N M = 21 in Madrid) the third quartile of price is computed.Then these N values (where N = N B or N M ) are classified by their own quartiles, and values −3, −1, 1 and 3 are assigned to them accordingly.Finally this district value is used to define categ.distrfor all the houses in each district.
price Monthly rental price, the response variable (in logs).Barcelona 1 for houses in Barcelona, 0 for those in Madrid.categ.distrAn indication of the district price level, taking the values −3, −1, 1 and 3. See the caption for details.
type.chaletThese 4 variables are the binarization of the type.duplexoriginal variable type with 5 levels: type.penthouseflat (the most frequent), chalet, duplex, type.studiopenthouse and studio.
floor Floor where the house is located.hasLift 1 if the house has lift, 0 otherwise.floorLift abs(floor)*(1-hasLift) size Surface, in squared meters.exterior 1 if the house is exterior, 0 otherwise.
rooms Number of bedrooms.bathrooms Number of bathrooms.hasParkingSpace 1 if the house has a parking space, 0 otherwise.ParkingInPrice 1 if the parking space is included in the price, 0 otherwise.log activation logarithm of the number of days since the first activation of the post.Regarding the analysis of the relevance matrix V, only the eigenvectors explaining more than 1% of the total relevance are plotted.The first eigenvector accounts for the 60% of total relevance, and it is associated with the size of houses (log.size,bathrooms and rooms).The second eigenvector (15% of total relevance) is mostly related with the district price level (categ.distr)and with some small interaction of bathrooms and rooms.The third eigenvector is a combination of the six most relevant variables, with bathrooms having the largest weight.The other 4 eigenvector (from 5th to 7th) seem to be associated mainly with one explanatory variable (log activation, Barcelona, floor and rooms, respectively) with little interaction with other.
The analysis of relevance by random permutations (Figure 7) agrees, in general terms, with the previous one, although there are small differences.Now the most relevant variables are log.sizeand bathrooms, with categ.distr,rooms, log activation, and Barcelona somehow relegated.There are 6 eigenvectors responsible each for more than 1% of the total relevance.Each one is mainly associated with one of the previously mentioned 6 explanatory variables, in the order of citation.The main difference with the relevance by random permutations is that, when using random permutations, no interactions between explanatory variables have been detected.

Neutral network
A one-hidden-layer neural network has been fitted using the nnet function from the R package nnet (Venables and Ripley 2002).The explanatory variable were centered and scaled before the fitting.Tuning parameters, size (number of neurons in the hidden layer) and decay parameter, are chosen using caret (Kuhn 2018) by 10-fold cross validation (in the training set).The candidates values for size were 10, 15, and 20, and they were 0, 0.1, 0.3, and 0.5 for decay.
Finally the chosen values were size=10 and decay=0.5.With these values, the whole training sample was used to fit the neural network and the results were stored in the object nnet.logprice.
Table 3 shows the little information obtained when printing the output of the nnet function.Additionally, a quantity similar to the coefficient of determination for this fit has been computed and printed, with a value of 0.79.It is a little bit larger than for the linear and the additive models, so neural network provides the best fit in the sense of maximum coefficient of determination.You can see that the output in Table 3 does not provide any insight about which explanatory variables are more responsible for that satisfactory fit.Relevance measures will be of help in this respect.
Results on relevance by ghost variables for the fitted neural network are shown in Figure 8.In addition to the 7 variables that were relevant for the linear and additive models, now the 4 variables related with the type of house appear with small but not null relevance.The same can be said for the variable ParkingInPrice.Again, the two most relevant variables are log.sizeand categ.distr, in this order.
There are 10 eigenvectors with percentages of explained relevance greater than 1% (the first 9 are shown in Figure 8).The first eigenvector (47% of total relevance) corresponds almost entirely to log.size, and the second one q q q q q q q q q q q q q q q q 0.000 0.005 q q q q q q q q q q q q q q 5 10 15 0.000 0.005 0.010 0.015 0.020 0. Only the eigenvectors explaining more than 1% of the total relevance are plotted.q q q q q q q q q q q q q q q q 0.00 hasParkingSpace ParkingInPrice log_activation q q q q q q q q q q q q q q q q 5 10 15 0.00  (20%) to categ.distr(in fact, the first 3 eigenvectors are similar to those in the linear model, Figure 6).The variables bathrooms and rooms appear together in several eigenvectors, always accompanying other variables.Similar joint behavior present Barcelona and log activation.Variables referring to different types of houses appear at eigenvectors 4th to 9th (except the 6th one).The 7th eigenvector is mainly related with floor.
Let us analyze now the relevance by random permutations (Figure 9).In this case all the variables, except type.chalet,has non-null relevance.The results are quite different from what we have seen before.Now the most relevant variable is type.penthouse,followed first by floor and bathrooms, and then by log activation, ParkingInPrice, rooms, categ.distr,Barcelona, and type.studio.Interestingly, log.size has a very low relevance now, contrary to what we have seen before.
Variables type.penthouse,floor and bathrooms dominate the first 3 eigenvectors of the relevance matrix Ṽ (jointly accounting for almost 60% of the total relevance).Eigenvectors 4 to 9 seem to be each related with just one explanatory variable.
When computing relevances by Ghost variables, the most relevant one is log.size, while by random permutations are type.penthouse,floor and bathrooms.In order to determine which of these two methods better determines the really important variables, we compute the relevance by omission of these four variables removing them (one at a time) from the set of explanatory variables and then training again the neural network (observe that this computations are time consuming, because the optimal tuning parameters have to be selected each time we remove an explanatory variable).The most important variables are those with a larger relevance by omission.Table 4 shows these values.It follows that the most important variable (according to relevance by omission) is log.size, the same result obtained whit relevance by Ghost variables.The other variables (that were labeled as relevant by random permutations) have a very low value of relevance by omission, compared with log.size.The last row of Table 4 contains the relevance by omission when removing simultaneously the three most relevant variables according to random permutations.It follows that the relevance by omission of this group of variables q q q q q q q q q q q q q q q q 5 10 15 0.000 0.005 0.010 0.015 0.020 0.  q q q q q q q q q q q q q q q q 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 q q q q q q q q q q q q q q 5 10 15 0.000 Eig.vect.9,Expl.Var.: 3.21%  Summarizing the relevance results for the neural network, it can be said that they present certain differences with respect to linear and additive models.First, there are completely different results with ghost variables and with random permutation.Second, the variables found as most important with random permutation were among the less important in linear and additive models.Third in the relevance by ghost variables more variables are considered but the order of the variables by relevance does not change with respect to the results of linear and additive models.Fourth, the random permutation approach does not seem to be useful for NN in this example, whereas the ghost variable approach find the same results as relevance by omission in a much more computing efficiency.

Conclusions and further research
We have defined the relevance of a variable in a complex model by its contribution to out of sample prediction and proposed a new way to measure this contribution: to compare the predictions of the model that includes this variable to those of a model in which this variable is substituted by its ghost variable, that is, an estimation of the variable by using the other explanatory variables.We have also shown that this approach is more useful than usual procedures of deleting the variable or using its random permutation.We have proved that in linear regression this approach is related to the F -statistic used to check the significance of the variable and therefore the computation of the relevance of a variable in a complex prediction model is an extension of the significance concept to other models in which this concept is not usually considered.With many dependent variables, the relevance of a variable by itself is less useful and it is important to consider the joint contribution of sets of variables to the out of sample prediction.We have introduced the relevance matrix as a way to find groups of related variables with important joint effects in out of sample prediction.In the linear model, we have proved the relationship between the relevance matrix and the matrix of partial correlations of the explanatory variables.
The research in this paper has open many problems for further investigation.As was indicated in the Introduction, the ghost variables are related to the knockoff variables for controlling the false discovery rate in large regression problems and this possible connection and the use of the relevance matrix as a screening device requires further research.Also, there are some aspects in the application of these ideas to Neural Network prediction function that should be further analyzed.First, the estimation of the ghost variables can be also carried out by some NN models, to allow for non linear relations among the explanatory variables.See for instance Sánchez-Morales et al. (2019) for the estimation of missing values in NN.Second, a comparison of this approach to the local sensitivity analysis looking at the derivatives of the prediction function, that is often used in NN, will be important.Third, the extensions of these ideas to binary (or multi class) classification is an important problem for discrimination.and finally βx = β0 − α βz .

A.2 Proof of Theorem 1
By definition, the relevance by a ghost variable of the variable Z is where k Gh = (z 2 − ẑ2.2 ) T (z 2 − ẑ2.2 ) and k has been defined in Appendix A.1.The proof concludes by observing that σ2 z.x,n2 = k Gh /n 2 is an estimator of the residual variance in the linear regression model Z = X T α + ε z , as they also are k/(n 1 − p) and σ2 z.x,n1 = k/n 1 = ((n 1 − p)/n 1 )(k/(n 1 − p)).The expression involving the O p notation is derived by standard arguments for the limit of a quotient.By the Pythagoras Theorem,

Proof of Theorem 2
We start proving that the matrix has generic non-diagonal element g jk = ρjk.R σ[j] σ[k] for j = k, where ρjk.R is the partial correlation coefficient between variables j and k when controlling by the rest of variables, and σ2 [j] is the j-th element in the diagonal of G: that is, we have to prove that the cosinus of the angle between the vector of residuals x 2.j − x2.j and x 2.k − x2.k is equal to minus the cosinus of the angle between the vector of residuals a = x 2.j − x2.j.R and b = x 2.k − x2.k.R , obtained when regressing x 2.j and x 2.k respectively over R = X 2.
[jk] , the matrix with columns x 2.1 , . . ., x 2.p , except x 2.j and x 2.k .We use now the notation P U (x) to denote the projection of the vector x over the linear space U, and {R, x} for the subspace generated by the columns of the matrix R and the vector x.Observe that x2.j = P {R,x 2.k } (x 2.j ) = P {R,b} (x 2.j ) = P {R,b} (x 2.j.R + a) = P {R,b} (x 2.j.R ) + P {R,b} (a) = x2.j.R + P b (a).

Therefore
x 2.j − x2.j = x 2.j − x2.Proposition 2. Assume that the regression function of Y over (X, Z) is linear and that it is estimated by OLS.Then Proof.The vectors of predicted values in the test sample are ŷ2.X.z = X 2 βx + z 2 βz and, using equation ( 2) in the proof of Proposition 1, ŷ2.X = X 2 β0 = X 2 βx + α1 βz = X 2 βx + ẑ2.1 βz , where ẑ2.1 = X 2 α1 is the prediction of z 2 using the linear model fitted in the training sample to predict Z from X. Therefore, using the same notation as in Proposition 1, the relevance by omission of the variable Z is where k Om = (z 2 − ẑ2.1 ) T (z 2 − ẑ2.1 ) and k has been defined in the Appendix A.1 of the paper.Observe that both, k Om /n 2 and k/(n 1 − p), are consistent estimators of the residual variance in the linear regression model Z = X T α + ε z .The proof concludes when defining σ2 z.x,n1,n2 = k Om /n 2 and using σ2 z.x,n1 defined in Proposition 1.The expression involving the O p notation is derived by standard arguments for the limit of a quotient.Deviance explained = 78.1% ## GCV = 0.064253 Scale est.= 0.064006 n = 11536 q q q q q q q q q q q q q q q q 5 10 15 0.000 0.005 0.010 0.015 0.020 0.   5. q q q q q q q q q q q q q q q q 0.00 q q q q q q q q q q q q q q 5 10 15 0.00 We introduce three simulated examples to show how relevance measures work in practice.The first two examples have only three and five explanatory variables and are toy examples where we can look at the details of the procedure in linear regression and additive models.The third one is a regression example with 200 explanatory variables.

Figure 1 :
Figure 1: Relevance by ghost variables in Example 1.The blue dashed lines mark the critical values for testing null relevance, α = 0.01.The dashed red lines are placed at 0.

Figure 2 :
Figure 2: Relevance by random permutations in Example 1.The dashed red lines are placed at 0.

Figure 3 :
Figure 3: Relevance by ghost variables in Example 3.

Figure 6 :
Figure 6: Rent housing prices: Relevance by ghost variables for the linear model.Only the eigenvectors explaining more than 1% of the total relevance are plotted.

Figure 7 :
Figure 7: Rent housing prices: Relevance by random permutations for the linear model.Only the eigenvectors explaining more than 1% of the total relevance are plotted.

Figure 8 :
Figure 8: Relevance by ghost variables for the neural network model.

Figure 9 :
Figure 9: Relevance by random permutations for the neural network model.

Lemma 1 .
Let a and b be two non-null vectors of R d .Let P b (a) be the projection vector of a over b, and let α(a, b) be the angle between a and b.Then cos (α (a − P b (a), b − P a (b))) = − cos(α(a, b)).Proof.Given that cos(α(a, b)) = a T b/( a b ) and P b (a) = cos(α(a, b)) a (b/ b ) = (a T b)b/ b 2 , it follows that a T P a (b) = P b (a) T b = a T b, P b (a) T P a (b) = cos 2 (α(a, b)) a T b and P b (a) 2 = P b (a) T P b (a) = cos 2 (α(a, b)) a 2 .

Figure 10 :
Figure 10: Rent housing prices: Relevance by ghost variables for the additive model.Only the eigenvectors explaining more than 1% of the total relevance are plotted.The order of the variables is the same as in the additive model output in Table5.

Figure 11 :
Figure 11: Rent housing prices: Relevance by random permutations for the additive model.Only the eigenvectors explaining more than 1% of the total relevance are plotted.The order of the variables is the same as in the additive model output in Table 5

Table 2 :
Rent housing prices: Standard output of the linear model.and Barcelona.The relevance of rooms and floor is much lower.

Table 3 :
Rent housing prices: Output of the neural network model.

Table 4 :
Relevance by omission of several explanatory variables and a group of them when fitting neural network models.
(the residual variance in the linear regression model Z = X T α + ε z ), the first one depending on both, the training sample and the test sample, and the second one (also appearing in Proposition 1) only on the training sample.

Table 5 :
Rent housing prices: Standard output of the additive model.