Evaluating Feature Attribution Methods in the Image Domain

Feature attribution maps are a popular approach to highlight the most important pixels in an image for a given prediction of a model. Despite a recent growth in popularity and available methods, little attention is given to the objective evaluation of such attribution maps. Building on previous work in this domain, we investigate existing metrics and propose new variants of metrics for the evaluation of attribution maps. We confirm a recent finding that different attribution metrics seem to measure different underlying concepts of attribution maps, and extend this finding to a larger selection of attribution metrics. We also find that metric results on one dataset do not necessarily generalize to other datasets, and methods with desirable theoretical properties such as DeepSHAP do not necessarily outperform computationally cheaper alternatives. Based on these findings, we propose a general benchmarking approach to identify the ideal feature attribution method for a given use case. Implementations of attribution metrics and our experiments are available online.


Introduction
Deep neural networks have for some years been the state of the art for a number of predictive tasks, such as image classification (Krizhevsky et al., 2012;Simonyan and Zisserman, 2015;He et al., 2015), language modeling (Vaswani et al., 2017;Brown et al., 2020) and reinforcement learning (Mnih et al., 2013;Silver et al., 2017;Lillicrap et al., 2019).This has led to their widespread adoption in many areas of machine learning.Such models are however notorious for their black box nature: due to the large numbers of parameters and complex neural architectures, their predictions become very difficult or even impossible to understand.Interpretability of predictive models is a very useful property for many different reasons: it allows us to extract understandable knowledge from large datasets, potentially leading to new knowledge about the data itself, debug models when they fail, and explain predictions to end users to build trust in the system (Doshi-Velez and Kim, 2017).In some cases, the ability to explain predictions is crucial for model deployment.
For this reason, a number of different techniques have been proposed to try to make neural networks more explainable.Proposed approaches include extracting interpretable rules (Ribeiro et al., 2018), counterfactual explanations (Wachter et al., 2017;Dandl et al., 2020), model distillation (Liu et al., 2018), and feature attribution (Ribeiro et al., 2016;Selvaraju et al., 2017;Sundararajan et al., 2017).In this work, we focus on the latter type of explanation.Feature attribution explanations are among the most popular techniques for explaining image classification models, because they can easily be visualized as a heatmap showing which pixels in an image are important (in the case of color images, the attribution value of a pixel can be defined to be the average or maximum absolute value of its three color components).Feature attribution techniques can also be used to measure the importance of hidden neurons or layers (Shrikumar et al., 2017;Selvaraju et al., 2017), although the focus in this work is on pixel attribution in the image domain.
The exact task of feature attribution can be interpreted in different ways, leading to some discussion about which properties a feature attribution method should satisfy (Chen et al., 2020).Feature attribution methods can roughly be divided in four categories using two properties: local vs.global, and modelvs.data-centric.The first property concerns the locality of the explanation: in local feature attribution, we map the importance of features in a given sample, whereas in global feature attribution, we map the importance of features for all samples in a dataset (also called feature importance).The second property concerns the target of the explanation: in model-centric feature attribution, we are interested in the importance of features for a specific model, whereas in datacentric feature attribution, we try to measure the informativeness of features in the data, independent of any specific model.This can be estimated using classical statistical or information-theoretical techniques (Chandrashekar and Sahin, 2014).Model-and data-centric feature attributions are not necessarily the same, as a model can often make predictions using only a subset of the informative features, or even using features that are generally non-informative (in which case the model is overfitting).In this work, we specifically evaluate local, model-centric feature attributions.
Because of the desire for model explanations and the widespread popularity of deep neural networks in the domain of image classification, many feature attribution methods have been proposed in recent years.These methods can roughly be divided into backpropagation-based and perturbation-based methods (Ancona et al., 2018).Each attribution method creates a different explanation for the same prediction.This has naturally led to the question of explanation quality: which methods work best?This turns out to be a very difficult question, since there is no ground truth available in the form of "perfect" feature attribution scores.
Attempts at evaluating feature attribution explanations can roughly be categorized in three types of approaches.The first is human evaluation.This includes simply looking at an explanation and seeing if it "makes sense", or performing a user study to see how helpful explanations are for predicting model behaviour (Schmidt and Biessmann, 2019).The disadvantage of these approaches is that such user studies are difficult to set up, and their results are inherently subjective.It has been shown that, just because an explanation makes sense to humans, does not mean that it is true to the underlying workings of the model (Adebayo et al., 2018).Also, a user study is generally infeasible to perform for each use case, and it is unclear whether results from user studies can be generalized to different datasets.
A second approach is to define a set of desirable properties, or axioms, that a method should have (Lundberg and Lee, 2017;Sundararajan et al., 2017).Examples of such axioms include local accuracy, missingness and consistency (Lundberg and Lee, 2017).Such approaches are more objective in nature, but recent work has shown that methods that conform to these axioms are still not necessarily accurate (Adebayo et al., 2018).Some of these axioms can also be implemented in different ways, leading to a number of methods that all conform to certain axioms, but still provide different explanations for the same prediction (Sundararajan and Najmi, 2019).
Finally, we can define quantitative measures that try to indicate the quality of an explanation by measuring the behaviour of the model or explanation after applying specific perturbations (Ancona et al., 2018;Yeh et al., 2019).A simple example of this kind of measure is Deletion (Samek et al., 2015).Here, we iteratively mask the top n most important features, as indicated by the explanation.If the features that were marked as important are truly important, we would expect the output of the model to drop rapidly with increasing n.
In this work, we implement a number of both existing and newly proposed metrics for evaluating feature attribution methods.These metrics are evaluated on a large number of attribution methods, and we investigate the results on 8 different datasets of varying dimensionality.Our contributions are as follows: • We expand on the work done in Tomsett et al. (2020) to gain insight in the behaviour of different existing metrics, as well as some newly proposed ones.
• We propose a new metric based on Sensitivity-n (Ancona et al., 2018), and empirically show that it provides results with a higher signal-to-noise ratio on high-dimensional datasets.
• We find that the performance of some methods is complementary to that of other methods, suggesting that a combination of these attribution methods may be valuable.
• We find that, depending on the dataset, methods with strong theoretical foundations such as DeepSHAP (Lundberg and Lee, 2017) do not necessarily outperform their computationally cheaper counterparts such as DeepLIFT (Shrikumar et al., 2017).This suggests that a benchmarking approach can be useful to check if a computationally intensive method is truly more valuable than a simpler one for a given use case.
• Finally, we provide general guidelines to perform such a benchmark for a given use case.

Related Work
Although the systematic evaluation of feature attribution methods is a fairly recent topic, a number of attempts have already been made to systematically and objectively evaluate the quality of explanations.An early, intuitive way of evaluating feature attributions was proposed by Samek et al. (2015).In this approach, the top k most important features are removed by replacing them with random noise.Consequently, the difference in output of the model is measured.
If the most important features are truly important to the model, we expect a sharp drop in confidence for the predicted class.A more general approach was proposed by Ancona et al. (2018), called Sensitivity-n.This metric is computed by removing a number of random subsets of n pixels from the image, and measuring the correlation of the difference in output with the sum of attribution values of those removed pixels.This allows one to assess the accuracy of the attribution values in general, rather than just the top most important features.
A possible problem with the metrics mentioned above is the fact that masking inputs in images can introduce high-frequency artifacts, which can push the images outside of their normal data distribution.This can cause the model to produce arbitrary outputs.Although the exact impact of this problem on the scores produced by metrics is unclear, some efforts to resolving it have already been made, including the Remove And Retrain (ROAR) procedure (Hooker et al., 2018).Here, the authors attempt to resolve the OOD problem by modifying the Deletion metric by Samek et al. (2015) such that after every iteration, the model is retrained on the data where the top k pixels are removed.The reasoning is that in this way, the model learns to regard the mean-valued pixels as uninformative.
However, we argue that this metric is not measuring the same kind of feature attribution as the original Deletion metric.Because the model is retrained after each iteration, it is able to detect and use different parts of the input to make a prediction.Also, there is no guarantee that the model, after retraining, will consider the masked pixels as uninformative: the shape or location of regions with that specific color (the dataset mean) can still be very informative.In other words, ROAR can only assess the ability of methods to map local, dataspecific feature attributions.Since the methods we are evaluating are designed to map local, model-specific attributions, we do not incorporate this metric in our benchmark.
More recently, Yeh et al. (2019) proposed Infidelity and Max-Sensitivity, two complementary metrics that measure the accuracy of a method and its robustness against small, insignificant perturbations, respectively.Recent work has shown that some feature attribution methods, much like neural networks themselves, are vulnerable to adversarial attacks (Ghorbani et al., 2019).This makes the robustness of explanations an interesting property to measure in addition to the accuracy.Yang and Kim (2019) proposed a synthetic data approach, where objects from MSCOCO (Lin et al., 2015) were pasted into background images from MiniPlaces (Zhou et al., 2017).A model is then trained to classify either the background or the object in the image.Because it is known where in the image the object was pasted, a relative form of ground truth is available.From this, a number of metrics are derived.However, as opposed to the previously proposed metrics, these metrics are tied to a specific dataset, and cannot be calculated for any given dataset and model.For this reason, we do not consider this approach.
Another related approach was proposed by Adebayo et al. (2018).In this work, a relative form of ground truth is created by randomizing the parameter values of the network, layer by layer.The assumption is that the feature attribution map should be significantly different for a trained model vs. a randomized model.Methods that return the same attribution map for both models, appear to be independent of the model parameters.This approach however does not provide a numerical value that captures the "quality" of the explanation, but rather acts as a pass/fail "sanity check", and is also not considered in this work.
Recently, Tomsett et al. (2020) have shown that some of these metrics are very dependent on implementation details, and don't appear to be measuring the same underlying properties of explanations.This is shown by measuring the correlation between different metric scores.The authors find that details such as how pixels are masked (by setting them to 0 vs. replacing them with random noise), or in what order they are masked (by decreasing or increasing importance), have a great influence on the quality scores given by the metric.This suggests that these metrics, although they are all designed to measure the "accuracy" of explanations, appear to be measuring different underlying properties.We build upon this work by applying a similar but more extensive analysis on a larger number of metrics and methods.In doing so, we can draw more global conclusions about how different methods and metrics relate to each other, and which methods and metrics may be most desirable for specific use cases.

Definitions and Notation
We define an instance as a vector x ∈ X ⊆ R d .A model is defined as a function m : X → R o , where d is the number of inputs (pixels, or color values of pixels in the case of RGB images) and o is the number of output classes.Note that the output of the model for a given class c can be any real number.In many cases, the output of the model is followed by a softmax function σ, mapping the outputs from R o to [0, 1] o .In this case, the original outputs are called logits.In this work, we consider the logit values as the actual output of the model.We write m(x) c as the c-th component of the output of model m on instance x.
We denote the model class as M = {m}, this represents the set of possible model instantiations (for example, the set of all possible neural networks, or all possible neural networks of a given architecture).An attribution method is a function E ∈ E : M × X × {1, . . ., o} → R d .The result of this function is called an attribution map.We explicitly mention the model m ∈ M as an argument of this function to indicate that we consider local, model-specific attributions, which are dependent on a specific combination of instance and model.The output class is also an argument of the attribution method, as attributions can be calculated for each output of the model.Finally, we define an attribution metric as a function M : M× X ×R d ×{1, . . ., o} → R (type I) or M : M × X × E × {1, . . ., o} → R (type II), mapping a model m, an instance x, an attribution map e (type I) or attribution method E (type II), and an output class c to a single real number which represents the quality of the attributions given by e for output c of model m on instance x.
For an instance x and attribution map e, we will denote x e k and x e −k as the instance x where respectively the k most or least important inputs are removed according to e.This removal, or "masking out" of features can be implemented in a number of different ways, which will be discussed in detail in Section 6.2.In the case of color images, we define the attribution value of a pixel as the average value of its color components, and proceed analogously.
Finally, we will denote S = {S l } L l=1 , S l ∈ {0, 1} d as the set of segments of an input sample x as produced by a given segmentation algorithm, where S l i = 1 if the ith input feature is part of segment l, and S l i = 0 otherwise.The attribution value of a segment S l can then simply be computed as the average attribution value of its input features: e S l := ||e S l ||1 ||S l ||1 , where indicates element-wise multiplication.For an instance x, a corresponding segmentation S, and an attribution map e, we will denote x e k S (resp.x e −k S ) as the same sample x where the k most (resp.least) important segments are masked out.

Attribution Metrics
We now describe the different metrics that were used to evaluate the attribution methods described above.A summary of general properties can be seen in Table 1: • Attribution range: Indicates which parts of the attribution map the metric actually evaluates: we denote metrics that evaluate the most important, least important, or all inputs as high-end, low-end or overall metrics, respectively.For example: Del M oRF and Del LeRF evaluate the high-and low-end, respectively, because they measure the influence of removing the top and bottom k pixels, respectively (see further).
• Masking: Indicates whether the metric relies on masking inputs in its implementation.Metrics that do, can be implemented in different ways, as the choice of a neutral value to replace pixels with is not obvious (see Section 6.2).
• Data type: Indicates which types of data the metric can be applied to.
In our case, a metric can either be applied to any kind of data, or only to image data (for example, because it relies on an image segmentation algorithm (Rieger and Hansen, 2020), or an adversarial patch (Qiu Lin et al., 2019)).
• Complexity: Indicates the computational complexity of the metric expressed as a number of forward passes through the model.C mth is the complexity of the attribution method being evaluated, also expressed as a number of forward/backward passes.
• Interface: We define two interfaces for attribution metrics: -Type I: M : M × X × R d × {1, . . ., o} → R. A type I metric accepts an attribution map e ∈ R d to evaluate.This allows one to compute the metric result for any attribution map, regardless of whether the implementation of the attribution method that generated it is available or not.
-Type II: M : M × X × E × {1, . . ., o} → R. A type II metric needs access to the attribution method under evaluation E ∈ E, because the method needs to be re-executed at some point in the computation of the metric.If the implementation of the attribution method is not available, this type of metric cannot be computed.

Deletion
The first and most widely known metric is Deletion (Samek et al., 2015).This metric works by iteratively removing the top k most important pixels from an image.This is done by masking the pixel with some value (see further: 6.2).An ordering of pixels where the most important pixels are ranked highest will cause a steep decrease in the output confidence of the model.This can be summarized by computing the area under the perturbation curve, where a low AUC corresponds to a good explanation.Samek et al. (2015)  Type II 1 : provided that an adversarial patch is already available.
Table 1: Summary of metrics.
reversed order of importance.In that case, a high AUC value indicates a good attribution map.We call this variant Deletion LeRF (Least Relevant First), and the original Deletion M oRF (Most Relevant First).
Where L is the maximum number of inputs masked, and c is the output that the attribution is intended to explain (usually this is the highest output of the model, which corresponds to the class that the model assigned to x).For large images, we can approximate this value by taking a fixed number of steps with a constant step size.The MoRF-variant evaluates the high end of the attribution map, whereas the LeRF-variant evaluates the low end.We choose L such that at most 15% of pixels are masked, which corresponds to the original approach in Samek et al. (2015).This limits the influence of out-of-distribution effects: as more pixels are removed, the image gets further removed from the original data manifold, making the result less representative.This metric scales linearly in the number of steps L taken to compute the AUC.

Insertion
A simple variant of the Deletion metric is Insertion (Petsiuk et al., 2018).This metric works entirely analogously to Deletion, but instead of iteratively remov-ing pixels from the original image, we now iteratively insert pixels of the original image onto a blank background (which is again defined by the masking procedure).
Analogously to the Deletion metric, we can again define two variants of Insertion: Insertion LeRF and Insertion M oRF , where resp.the least and most relevant pixels are inserted first.Since inserting the k most important features is the same as removing the d − k least important ones, we can define Insertion as follows: Note that if L = d, Ins M oRF = Del LeRF and Ins LeRF = Del M oRF .Again, the MoRF-and LeRF-variants measure the high and low end respectively.This metric also scales linearly with the number of steps hyperparameter L.

Minimal Subset
The previously mentioned Deletion and Insertion metrics only take into account the model's confidence in the originally predicted class c.However, the actual prediction of the model is also dependent on the confidence of the other classes.The removal of certain pixels could, for example, hardly influence the output confidence in c, but drastically change the confidence of another class c , causing the model to change its overall prediction.To mitigate this problem, we introduce Minimal Subset Deletion and Minimal Subset Insertion.
These metrics work by iteratively removing (resp.inserting) the top k most important pixels from the image, and recording the smallest value for k that causes the prediction of the model to change.For Minimal Subset Insertion specifically, the prediction must change into the originally predicted class c.
For analogous reasons as with Deletion/Insertion, this metric evaluates the high end of the attribution map.Both variants scale linearly with the amount of dimensions d.

IROF
Iterative Removal Of Features (IROF) (Rieger and Hansen, 2020) is an extension of Deletion, where a segmentation S of the input sample x is used.Instead of iteratively masking the k most important inputs, we now mask the k most important segments.This can reduce the number of forward passes needed, and can provide insight into the quality of an attribution at a larger scale: an algorithm that is able to find the top few pixels that maximally perturb the network when removed, might score very well on Deletion, but not so much on IROF.If another algorithm correctly identifies the most important "regions", it might score better on IROF and worse on Deletion.In some cases, the latter might be more interesting, as this would likely correspond to explanations that are less noisy and more easily readable.
We can define IROF M oRF/LeRF analogously to Deletion M oRF/LeRF , leading to the following definitions: Note that, even though all segments are removed in IROF, we classify this metric as high-end.This is because the metric score still depends most on the top most important image segments: if those are identified correctly, the model output will decrease quickly, and the other segments will have little influence on the metric score.IROF scales linearly with the amount of segments |S|, and is only applicable to image data because of the dependence on an image segmentation algorithm.We implement IROF using the SLIC algorithm (Achanta et al., 2010), with an approximate number of segments of 100.

Sensitivity-n
Previous metrics have only considered the most or least important features.This can be a problem if the inputs contain a large number of features, in which case a large proportion of the features is hardly evaluated, or has a small influence on the evaluation.To get a more global assessment of the quality of feature attributions, Sensitivity-n was introduced (Ancona et al., 2018).Formally, Sensitivity-n is defined as follows (quoted from Ancona et al., 2018, where mathematical notation was adjusted to conform to ours): An attribution method satisfies Sensitivity-n when the sum of the attributions for any subset of features of cardinality n is equal to the variation of the output m(x) c caused by removing the features in the subset.
Since no attribution method exactly satisfies Sensitivity-n for all values of n, the metric instead measures how well the sum of attributions s∈S e s correlates with the difference in output m(x) c − m(x S ) c , using the Pearson correlation coefficient (where x S denotes the instance x with all features in S removed, and e s denotes the attribution of feature s according to attribution map e).We can compute Sensitivity-n as: Where S i is a random subset of inputs of size n, and r(X, Y ) is the Pearson correlation coefficient between variables X and Y .The correlation is computed using k randomly selected subsets S i .We choose k = 100, which corresponds to the configuration in Ancona et al. (2018).
The number of possible subsets of features grows exponentially with d.Because of this, the approximation made by this metric will get exponentially worse for increasing image size.To mitigate this problem, we introduce a segmented variant of Sensitivity-n, called Seg-Sensitivity-n.This metric works by first segmenting the input image x into segments S, and then removing random subsets of segments instead of features.Since the amount of segments is drastically lower than the number of features, selecting 100 random subsets gives a more representative sample, which we expect will increase the signal-to-noise ratio of this metric.
Where S is the segmentation of instance x (represented as a set of segments {S l }), and x S L i denotes the instance x with all segments in L i removed.The correlation is now computed using k = 100 randomly selected subsets of segments L i .Since the subsets of features/segments are chosen randomly, Sensitivity-n and Seg-Sensitivity-n evaluate the overall attribution map.Both metrics scale linearly in the number of subsets k.

Infidelity
Infidelity (Yeh et al., 2019) generalizes the previous metrics from perturbation by masking to general perturbations.This is done by comparing the difference in output after an arbitrary perturbation with the dot product of the perturbation vector I and the attribution map e.The perturbation vector is a random variable I ∈ R d with probability measure µ I .The infidelity of an attribution map e for an input sample x and class c is then defined as follows: Here, β acts as a normalizing term (called optimal scaling in the original paper) to make the values for different explanation methods comparable.We use two variants of Infidelity proposed in Yeh et al. (2019), defined by their perturbation vectors: This corresponds to a robust variant of the completeness axiom (Lundberg and Lee, 2017), where we take a Gaussian random vector centered around a zero baseline, instead of a constant zero baseline.
• Square removal (IN F D SQ ): in this case, I has a uniform distribution over square patches of the image x of some predefined size.This can better capture spatial relationships in the images, as the removal of single pixels actually removes very little information if the surrounding pixels are still intact.
Since the perturbations happen on the entire image or on randomly selected squares, respectively, they evaluate the overall attribution map.Infidelity scales linearly with the number of samples k used to approximate the expected value.
In this work, k = 1000, which corresponds to the original implementation by Yeh et al. (2019).

Max-Sensitivity
Max-Sensitivity (Yeh et al., 2019) is the only metric that isn't designed to evaluate the correctness of an attribution map, but rather the robustness of the attribution map against small perturbations.It does this by adding small perturbations to the sample and recomputing the attribution map on the perturbed samples.The maximum value of the L ∞ -norm of the difference between the original and perturbed attribution map is measured.To make different attribution methods comparable, the attribution maps are normalized to unit norm before computing Max-Sensitivity.
Where r is the maximum size of the added perturbation.We choose r = 0.1, as in (Yeh et al., 2019).Note that this is a type II-metric, meaning that it needs access to the attribution method E ∈ E rather than just the attribution map e ∈ R d .This metric scales linearly with the number of samples k (here chosen to be 50, as in Yeh et al. (2019)) used to approximate the maximum value, and the number of forward/backward passes necessary to compute the attribution map C mth .Note that this can result in very large runtimes when evaluating computationally complex methods.

Impact Coverage
The final metric we consider is Impact Coverage (Qiu Lin et al., 2019).This metric works by applying an adversarial patch to the image, and computing feature attributions on the adversarially attacked image.If the adversarial attack was successful, we would expect a large proportion of the attribution to be inside of the adversarial patch, as the patch caused the model to change its output.
We quantify this by computing the intersection-over-union (IOU) between the k most important pixels according to the attribution map E(m, x, c) (denoted here as the set T ), where k is the number of pixels covered by the patch, and the patch P itself.A score of 1 would indicate that the most important pixels perfectly cover the adversarial patch.
Where P is the set of features that were covered by the adversarial patch.Note that this metric, like Max-Sensitivity, is also a type II metric.Impact Coverage evaluates only the high end of the attribution map.Since the attribution method needs to be executed on the attacked image, this metric has the same complexity as the method being evaluated.Note that an adversarial patch is also needed to compute this metric, meaning that this complexity is only valid when the adversarial patch is given (that is, when evaluating this metric on a large number of attribution maps for the same model).Impact Coverage can only be computed for image data with a sufficiently high resolution, such that an adversarial patch can be generated successfully.

Attribution Methods
We now shortly describe the attribution methods used in this study.An overview is given in Table 2.
The Gradient method (Simonyan et al., 2014) uses the gradient of the model with respect to the input as the explanation: ∂x .This gradient describes for each input dimension how much a change in that input dimension would affect the output value m(x) c in a small neighborhood around x. InputXGradient (Shrikumar et al., 2017) is an extension of the Gradient method, where the gradient of the model is multiplied element-wise with the original input x: E IxG (m, x, c) = ∂m(x)c ∂x x.This is done to reduce the problem of "gradient saturation" that can occur when the Gradient method is used.This approach can also be viewed from the perspective of a linear model: the gradient is a linear approximation to the model m in the direct neighborhood around x.The product of the weight and the corresponding feature value represents the effect of the feature on the output of this linear model.(Selvaraju et al., 2017) uses the gradients of the output with respect to the feature maps of the final convolutional layer to compute the importance of each feature map for the target class c.The importances of all feature maps are then combined to produce a coarse attribution map of the same size as the final convolutional layer.This attribution map is then upsampled using linear interpolation to match the size of the original image.Deconvolution (Zeiler and Fergus, 2014) is a modified backpropagation algorithm which only propagates non-negative gradients through ReLU activation functions.Apart from how it handles ReLU activation functions, this method is equivalent to the Gradient method.Guided Backpropagation (or GuidedBackprop, Springenberg et al. ( 2014)) is an extension of the Deconvolution method, where gradients are only backpropagated if they are positive, and if their respective inputs are positive.As such, the normal backpropagation is "guided" by both input signals and output signals.Guided Grad-CAM (Selvaraju et al., 2017) is simply the element-wise product of Guided Backpropagation and upsampled Grad-CAM attributions.This combines the fine-grained attributions of Guided Backpropagations with the localization properties of the attribution maps produced by Grad-CAM.

Grad-CAM
IntegratedGradients (Sundararajan et al., 2017) is another approach to address gradient saturation.It does this by defining a "baseline input" x, which should represent the absence of feature information (for example, a completely gray image).The gradient of the model is then integrated along a linear path from x to x: dα.The integral is approximated using a sum of n equidistant points along the linear path.A new problem with this approach is the choice of baseline.Depending on the circumstances, it can be difficult to choose a baseline input that is effectively "neutral".Also, pixels that have similar values as the baseline will get under-attributed by IntegratedGradients.To mitigate this problem, Expect-edGradients (Erion et al., 2020) uses multiple instances from the training dataset as baseline values, and averages the Integrated Gradients attributions: x ∼D,α∼U (0,1) ∂x .
The SmoothGrad (Smilkov et al., 2017) method attempts to reduce noise in the Gradient explanation by averaging the gradient over a number of noisy copies of the original input: the original model around x.In the case of images, the instances are first converted into a "simplified representation" using a superpixel algorithm.LIME then returns an attribution value for each superpixel rather than each input dimension.This can easily be converted back to the original input space by assigning each input dimension inside a superpixel the same attribution value.KernelShap (Lundberg and Lee, 2017) is a modification of LIME, where the perturbed instances are weighted by a specific kernel function.This kernel function causes the linear model to approximate the Shapley values of the input features, which have a number of desirable theoretical properties known as "axioms" ( Štrumbelj and Kononenko, 2010).
DeepLIFT (Shrikumar et al., 2017) is another backpropagation-based method.This method defines a reference point in the input space (similar to the baseline input for IntegratedGradients), and compares the activation of each neuron to its activation on the reference point (the "reference activation").Attribution scores are then assigned according to that difference.We specifically use DeepLIFT with the Rescale rule (for more details, see Shrikumar et al. (2017)).DeepSHAP (Lundberg and Lee, 2017) extends the DeepLIFT algorithm to approximate Shapley values, by computing and averaging DeepLIFT attributions for multiple reference images.

Experimental Setup
In this section, we describe the datasets used in the experiments, the different implementations of feature masking, and the methods of statistical analysis that we performed on the metric scores.

Datasets
All experiments were conducted on 14 attribution methods and 8 datasets.The datasets can be divided into three groups: • Low-dimensional datasets (28x28x1): MNIST, FashionMNIST • Medium-dimensional datasets (32x32x3): CIFAR-10, CIFAR-100, SVHN • High-dimensional datasets (224x224x3): ImageNet, Caltech-256, Places-365 For the low-dimensional datasets, a simple CNN architecture (2 convolutional layers with 32 and 64 channels, followed by a fully connected hidden layer with 128 nodes) was trained.For the medium-and high-dimensional datasets, we used Resnet20 and Resnet18, respectively.The models for the low-and medium-dimensional datasets were trained up to a test set accuracy of at least 90%, except for CIFAR-100, where a top-five accuracy of 90.6% was reached.For Caltech-256 and Places-365, the models were trained up to a top-five test set accuracy of 91.6% and 83.7%, respectively.For ImageNet, the built-in Resnet18 model of torchvision2 was used, obtaining a top-five accuracy of 89.08%.The metric scores were computed for all attribution methods on 256 correctly-classified samples for each dataset.Note that an adversarial patch was only generated for the high-dimensional datasets (ImageNet,, which means that the Impact Coverage could only be computed for these datasets.

Masking
Except for Infidelity and Impact Coverage, all metrics depend in some way on the masking of pixels to remove information.When masking pixels, we try to replace the pixel value with some "neutral" value that is expected to remove the original information contained in the pixel.However, the choice of this neutral value is not obvious (Sturmfels et al., 2020).We consider three options: • Dataset mean: the first and simplest way of masking is by replacing the pixel with a constant zero value (in the case of color images, we do the same for each color channel).Since the data is z-normalized to have µ = 0 and σ = 1, this is equivalent to changing the pixel into the average pixel value over the full dataset.The disadvantage here is that, if the pixel was already close to the average value, the value doesn't change much and the information might not be properly destroyed.
• Uniform random: To mitigate the problem of the constant masking value, we can also draw values from a random distribution.In our case, we use the standard uniform distribution U(0, 1).This can help destroy spatial information if larger regions of the image need to be masked.
• Blur: A disadvantage of the previous two methods is their tendency to introduce high-frequency information in the data.This can push the data outside of the original data distribution, which can cause the model to give arbitrary outputs.To mitigate this problem, the pixels can instead be replaced by a blurred value, which is the average of its neighboring pixels.The size of this neighborhood can be expressed using the kernel size.This technique has again the disadvantage that the information in the pixel might not be properly destroyed by blurring it.

Statistical Analysis
We give a brief overview of the statistical techniques used to analyse the behaviour of metric scores.We first use a statistical significance test to identify which methods outperform a basic random baseline on the metrics.Next, we compute the correlations of scores between different metrics.We then study the consistency of method rankings as given by each metric.Finally, we propose a technique to compare two methods in more detail.

Statistical Significance Testing
For each method-metric pair a statistical test is used to verify if the method performs significantly better than a uniform baseline on this metric.The uniform baseline is defined as a "pseudo-method", which simply assigns random values u ∼ U(0, 1) to each input.This random baseline is generated once for every image, such that the same baseline attribution map is compared to each of the attribution maps computed by the attribution methods.Note that a different baseline method, such as an edge detection algorithm, could also be used as an alternative baseline.Since we cannot assume normality of the data, we use a Wilcoxon Signed Rank test (which is the non-parametric equivalent of a paired t-test).Wherever applicable, we mask using the dataset mean value.If the result of the test is significant, we also report the median score difference to the baseline as a measure of effect size.Since the absolute values of most metrics carry little to no semantic meaning, these effect sizes are only relevant relative to each other.For this reason, the effect sizes are scaled to [0, 1] for each metric, such that the best-performing method has an effect size of 1.

Inter-Metric Correlation
Inter-metric correlations are computed as the Spearman rank correlation between metric scores, averaged over all methods (except the random baseline).
These correlations allow us to identify which metrics are measuring different underlying aspects, and which metrics are mutually redundant.

Ranking Consistency
Ranking consistency assesses how consistent a metric is in ranking the methods across the different images.This is measured using Krippendorff's α (Krippendorff, 2018).Krippendorff's α is a statistic usually used to measure inter-rater reliability: the degree to which different raters (for example, for a psychological test) agree in their assessments.Krippendorff's α is defined as follows: where D o is the observed disagreement, and D e is the disagreement expected by chance.Before computing α, we convert the metric score data into rank data by ranking the methods for each image.This rank data is then treated as ordinal data in the computation of α.For more details on how these values are computed, we refer the reader to Krippendorff (2011).If α = 1, then the ranking is perfectly consistent: the ranking of methods produced by the metric is identical for each image.If α = 0, then the ranking is completely random.

Pairwise Comparison Using CLES
Once a global overview of method performance has been established using the statistical significance test (Section 6.3.1),two or more methods can be selected for a more detailed comparison.Such a comparison is then made by performing a new statistical test, this time comparing the methods to each other, rather than to a trivial random baseline.
In this case, we use the Common Language Effect Size (CLES) to measure the difference between methods.This measure is simply the fraction of images where method A outperforms method B. This effect size measure is less informative when comparing methods to the random baseline, as we expect methods to at least consistently outperform the baseline, leading to a saturated effect size of 1.If two methods are selected that are more similar in their performance, the CLES can give an intuition to how often one method (usually a more computationally complex one) outperforms the other.
For example, if the difference in metric scores is very large, but the CLES is only slightly larger than 0.5, this would mean that method A outperforms method B only in a weak majority of images.If computational cost is a concern, this can make it more interesting to choose for the computationally cheaper method.

Results
In this section, we describe the results of Wilcoxon Signed Rank tests, intermetric correlations, and ranking consistency of metric scores on the different datasets.Finally, we perform a pairwise comparison of DeepSHAP and DeepLIFT on MNIST, CIFAR-10 and ImageNet.

Wilcoxon Signed Rank Tests
The results of the Wilcoxon Signed Rank tests are shown in Figure 1.For each method-metric pair, a square is drawn if the result of the Wilcoxon Signed Rank test is significant (p < 0.01), with the size and color of the square indicating the effect size (median difference with the random baseline).Effect sizes are normalized, such that a value of 1 corresponds to the largest effect size for a given metric.
We can clearly distinguish the low-, medium-and high-dimensional datasets in the results.In the low-dimensional case, most methods significantly outperform the random baseline on nearly all metrics.In the medium-dimensional case, we see more complementarity in the results, although this still depends strongly on the dataset.For CIFAR-10 and CIFAR-100, GradCAM/KernelSHAP and DeepSHAP/DeepLIFT/ExpectedGradients form a complementary pair, with GradCAM/KernelSHAP outperforming DeepSHAP /DeepLIFT /ExpectedGradients on some metrics and vice versa.For SVHN, a number of methods (DeepLIFT, DeepSHAP, ExpectedGradients) significantly outperform the baseline across all metrics.
In the high-dimensional case, we see fewer differences between the datasets.The same complementarity between GradCAM/KernelSHAP and DeepSHAP /DeepLIFT /ExpectedGradients is again noticeable for all three datasets, suggesting that it is linked to the complexity or dimensionality of the classification problem.We also note that GradCAM and KernelSHAP significantly outperform the other methods on Impact Coverage (COV), which could only be computed for high-dimensional datasets.This complementarity can be explained by examining the inner workings of the methods.GradCAM works by upsampling features in the last convolutional layer (Selvaraju et al., 2017), and KernelSHAP uses an image segmentation step before computing the actual Shapley values (Lundberg and Lee, 2017).These properties result in very non-granular attribution maps.Conversely, DeepLIFT, DeepSHAP and ExpectedGradients are all based on modified versions of the gradient, which tends to produce very granular attribution maps.This difference in granularity could be the source of the observed complementarity.
We draw two conclusions from these results: 1. Depending on the dataset, very simple and computationally cheap methods (such as Gradient or InputXGradient) can perform nearly as well as the computationally more expensive methods such as ExpectedGradients or DeepSHAP.
2. Complementarity between methods, where some methods outperform other methods on a subset of metrics and vice versa, suggests that a combination of attribution maps given by different methods might provide more information than the individual attribution maps.This is related to the idea proposed in Tomsett et al. (2020) that different metrics might be measuring different underlying aspects of the attribution maps.

Inter-Metric Correlations
Inter-metric correlations are shown in Figure 2 for MNIST, CIFAR-10 and Im-ageNet (results on the other datasets were similar).In general, we note similar patterns of correlations for the three datasets.Most metrics have relatively low correlations, suggesting that they might be measuring different underlying aspects of the attribution maps, as proposed in Tomsett et al. (2020).We also note strong negative correlations between certain pairs of metrics, more specifically MoRF/LeRF-pairs, which suggests that MoRF/LeRF-pairs contain largely redundant information.This insight can be used to reduce computational cost in future benchmarking efforts, by selecting only MoRF or LeRF metrics.Finally, we note that correlations between segmented and non-segmented metrics (for example, Deletion and IROF) are stronger for low-dimensional datasets.This is to be expected, since the low dimensionality of the data causes segments to be composed only of a few pixels.
Table 3 shows inter-metric correlations of different metric implementations on ImageNet (results on the other datasets were generally similar).We note that, although different metrics have relatively low correlations, correlations between different implementations of the same metric are generally quite high.We can conclude from this that different implementations of the same metric generally provide redundant information.We recommend first deciding which masking procedure makes most sense for a given dataset and/or model, rather than performing full measurements using a large number of masking procedures.

Ranking Consistency
The values for α for all datasets are shown in Figure 3.We mask using the dataset mean where applicable.It can be observed that most of the metrics are most consistent on the low-dimensionality datasets (MNIST, FashionMNIST).Impact Coverage was only measured for high-dimensional datasets because of the reliance on an adversarial patch, and has the highest values of α.We also see that there is no clear pattern between the medium-and high-dimensional datasets, implying that α doesn't simply decrease with increasing dimensionality.We note that our proposed segmented variant of Sensitivity-n has a higher α for high-dimensional datasets, confirming the intuition that this metric has a higher signal-to-noise ratio for high-dimensional data.
Figure 3: Krippendorff's α for default implementations of different metrics on all datasets.Low-, medium-and high-dimensional datasets are indicated in green, blue and red tones, respectively.Impact Coverage (Cov) was only computed for the high-dimensional datasets due to the requirement of an adversarial patch (see Section 4.8) subset of metrics that is generally superior to all others.From this, we conclude that the ideal subset of metrics to measure depends on the dataset and model.Different implementations of the same metric (using different masking procedures) generally have similar values for α.An overview of Krippendorff α for all metric implementations is given in Appendix 10.

Pairwise Comparison of Methods
We use the proposed framework in Section 6.3.4 to compare the performance of DeepSHAP and DeepLIFT on MNIST, CIFAR-10 and ImageNet.We choose these two methods because they have very similar results across all datasets in Figure 1, which is to be expected as DeepSHAP is based on DeepSHAP is computationally much more expensive than DeepLIFT however, so if the fraction of images where it outperforms DeepLIFT is relatively small, it might be worth the cost.The results are shown in Figure 4.Each bar corresponds to a single metric A bar is only drawn if the corresponding Wilcoxon signed ranked test was significant (p < 0.01).The bars are centered on 0.5, since a CLES of 0.5 would indicate that both methods are equivalent, each outperforming the other in 50% of cases.
We see that, although performance in terms of absolute metric scores is very similar between the two methods (as shown in Section 7.1), the Common Language Effect Size (CLES) varies greatly depending on the dataset.On Im-ageNet, DeepSHAP outperforms DeepLIFT for most images, with the CLES ranging between 60-80% for most metrics.On CIFAR-10 however, the difference between the two methods is much smaller.Finally, on MNIST, DeepSHAP is outperformed by DeepLIFT on a majority of images, for almost all metrics.This indicates that the relative performance of methods is strongly dependent of the dataset in question.

Sensitivity-n vs. Seg-Sensitivity-n
To compare our proposed metric Seg-Sensitivity-n to the original Sensitivityn, we measure the stability of both metrics in two ways.First, we measure the signal-to-noise ratio (SNR) of both metrics.We repeatedly compute both Seg-Sensitivity-n and Sensitivity-n scores 100 times on 256 images (where the same images were used for both metrics).We then compute the SNR ratio of the metric for each image as µ 2 σ 2 , where µ is the mean of the 100 metric values, and σ is the standard deviation.The results are shown in the left part of Figure 5.Note that the SNR of Seg-Sensitivity-n for high-dimensional datasets (ImageNet, Caltech-256 and Places-365) is significantly higher than the SNR of Sensitivity-n.On the other datasets, the SNR is also larger for Seg-Sensitivityn, although the difference is smaller.
A different way to measure the stability is to look at the noise fraction of variance.To compute this, we compute the ratio of the within-sample variance (the variance of the 100 repeated measurements for each sample) to the betweensample variance (the total variance of all measurements on all samples).A low noise fraction of variance corresponds to a clear signal.These results are shown in the right plot of Figure 5.We see again that the noise fraction of variance for Sensitivity-n is much larger than for Seg-Sensitivity-n, especially on the high-dimensional datasets.

Guidelines
We have performed an extensive study of the behaviour of a large number of attribution metrics and methods, on a collection of image datasets with varying complexity and dimensionality.From this investigation, we draw a number of general conclusions: • Metric scores vary strongly for different datasets.This implies that the performance of attribution methods should be measured for each specific use case, rather than drawing general conclusions from the results on a set of benchmark datasets.
• Most metrics tend to have low ranking consistency, shown by the relatively low values of Krippendorff α.From this we conclude that a statistical testing approach should be used to draw any dataset-wide conclusions.Also, the ranking consistency values of metrics themselves are not consistent across datasets, implying that there is no generally superior metric.
• We extend the conclusion from Tomsett et al. (2020) that metrics don't necessarily measure the same underlying concept to a larger amount of metrics, including Sensitivity-n (Ancona et al., 2018), Infidelity (Yeh et al.,   2019), IROF (Rieger and Hansen, 2020) and Impact Coverage (Qiu Lin et al., 2019).This can be seen in the low inter-metric correlation values between these metrics.
• Finally, we also introduce Seg-Sensitivity-n as an extension of Sensitivityn (Ancona et al., 2018), and show that it has a higher signal-to-noise ratio than Sensitivity-n on high-dimensional datasets.
From these conclusions, we propose a set of benchmarking guidelines for developers seeking to select the best feature attribution method for their specific use case (see Figure 6, an example application of these guidelines can be found in Appendix C): 1. Baseline selection: First, a baseline attribution method must be defined.In general, a uniform random baseline can be used, but more specific baselines can also be chosen depending on the use case (for example, an edge detector).
2. Metric selection: Next, a selection of metric implementations must be made.This can be done manually, if such a selection of metrics is obvious from the use case and there is a clear approach to masking available (for example, if all images have a constant background color), or using a pilot study.In such a pilot study, a large number of metrics and masking approaches are tested on a limited number of images.We then recommend computing inter-metric correlations and Krippendorff α values, and selecting those metric implementations that have high Krippendorff α and low inter-metric correlations.In this way, a minimal number of images can used to draw dataset-wide conclusions, and metric scores will contain a minimal amount of redundant information.
3. Rough statistical analysis: Once the metric scores are computed for all methods and the baseline, a rough overview of method performance can be made using a pairwise, non-parametric statistical test (such as a Wilcoxon signed rank test).We specifically recommend this test because metric scores are dependent on the specific image being used (Tomsett et al., 2020), and metric scores don't necessarily follow a normal distribution.
4. Detailed comparison: Using the rough overview made in the previous step, we recommend selecting a smaller number of well-performing attribution methods, if possible with varying computational complexity.
Those methods can then be compared in more detail using new pairwise Wilcoxon tests, this time between the two methods rather than between a method and the baseline.The CLES can then be used to assess the fraction of cases where one method is superior to another.Based on these results, as well as the complexity of the methods being considered, a final selection of an ideal method can then be made.
An example application of these guidelines can be found in Appendix C.

Conclusion
In this paper, we have investigated the behaviour of a large number of attribution methods and metrics on a variety of image datasets.From this investigation, we conclude that the choice of attribution metric should depend on the specific use case, and that a statistical testing approach should be used to draw any global conclusions.We propose a general set of guidelines for performing a benchmarking study to select the appropriate attribution method for a given use case.
The results described in this paper leave a number of directions of future research.First of all, the observation that metrics don't necessarily measure the same underlying concept of feature attribution maps leads to the question of what those underlying concepts might be.A better understanding of those underlying concepts can lead to more directed benchmarking efforts and the development of better methods and/or metrics.A possible link can be made with the concepts of necessity and sufficiency, found in the literature of causality (Pearl, 2009).Secondly, the complementarity of results for different methods implies that a combination of different attribution maps can be more informative than a single one.This can lead to the development of new methods, generalizing the concept of feature attribution itself.Finally, the application of the benchmark procedure on new datasets can shed light on what the best attribution methods are for a given problem domain.An important example is the domain of biomedical imaging.Here, medical practitioners are often interested in what the most important regions of a radiographic image are for a specific prediction (Arun et al., 2020a,b;Aggarwal et al., 2020), in order to build trust in the model and identify when a model might be making a mistake.Application of the general guidelines given above can help developers choose the right attribution method in this case.

Figure 1 :
Figure1: Results of Wilcoxon signed rank tests.Effect sizes are normalized for each metric, such that 1 corresponds to the largest recorded effect size (difference in median values).A square is only drawn if the corresponding Wilcoxon signed rank test was significant (p < 0.01).Impact Coverage (Cov) was only computed for the high-dimensional datasets due to the requirement of an adversarial patch (see Section 4.8)

Figure 2 :
Figure 2: Inter-metric correlations for MNIST, CIFAR-10 and ImageNet.Impact Coverage (Cov) was only computed for the ImageNet due to the requirement of an adversarial patch (see Section 4.8)

Figure 4 :
Figure 4: Comparison of DeepSHAP vs. DeepLIFT using Common Language Effect Size.

Figure 6 :
Figure 6: Visual overview of our proposed benchmarking guidelines.

Figure 8 :
Figure 8: Results of Wilcoxon Signed Rank Tests for all metrics on all datasets (contd.).

Figure 10 :
Figure 10: Krippendorff α for all metrics.Values above 0.3 are shown in green, others are shown in red.
Inter-metric correlations of final selection of metrics.
also introduces an alternative variant of the Deletion metric, where the pixels are masked in

Table 2 :
(Ribeiro et al., 2016), 2018) is a variant of this method, where the variance is taken over the gradients of noisy copies instead of the expected value.A method that does not depend on the gradient of the model is LIME(Ribeiro et al., 2016).For this reason, LIME is applicable to any model, even if the model internals are unknown.LIME generates a set of perturbed instances by removing random subsets of features from x and weighting the resulting instances with a proximity measure π x .The model output on these perturbed instances is then computed, and a sparse linear model is trained to approximate Summary of methods.Complexity is expressed as number of executions of the model.n and m are path length and number of perturbed samples/baselines, respectively.Both of these are hyperparameters of the method.Note that, even though many methods have the same asymptotic complexity, the typical values of hyperparameters can vary a lot, for example DeepSHAP usually needs much fewer samples than KernelSHAP or LIME, making it computationally less expensive.