Evaluation of the Explanatory Power Of Layer-wise Relevance Propagation using Adversarial Examples

Approaches for visualizing and explaining the decision process of convolutional neural networks (CNNs) have recently received increasing attention. Particularly popular approaches are so-called saliency methods, which aim to assign a valence to each input pixel based on its importance and influence on the classification via saliency maps. In our paper, we contribute by a novel analyzing approach build on adversarial examples to investigate the explanatory power of saliency methods exemplified by layer-wise relevance propagation (LRP). Based on the hypothesis that distinct decisions, such as an image’s classification and the classification of its corresponding adversarial examples, should yield to dissimilar saliency maps to provide transparent rationales, we break down relevance scores of images and corresponding adversarial examples and analyze them using a comprehensive statistical evaluation. It turns out that different relevance decomposition rules of LRP do not lead to clearly distinguishable saliency maps for images and corresponding adversarial examples, neither in terms of their contour lines, nor in terms of the statistical analysis.

Saliency methods attempt to explain an algorithm's decision by assigning pixel-level values that reflect the importance of input components in terms of their contribution to the classification result.Therefore, saliency methods generally lead to so-called saliency maps [30] (also known as input contribution heatmaps or feature importance maps), which try to explain the decision process of CNNs through i. Input Modification, i.e., assigning a relevance to a pixel based on the drop in prediction probability caused by the pixel's perturbation [20,38,39], ii. Class Activation, i.e., combining the activation pattern of a higher-level layer with further information, such as the network's output [40], iii.Backpropagation, i.e., tracing the contribution of the output nodes backwards through the network to the input nodes [3,16,38].
A particularly common backpropagation approach is layer-wise relevance propagation (LRP) introduced by Lapuschkin [3,19].Layer-wise relevance propagation relies on the assumption that the total amount of relevance is preserved when decomposing the classification decision backwards to the pixel-wise relevance scores.This so-called layer-wise conservation principle postulates that the sum of relevance assigned to neurons in a CNN layer remains the same for two adjacent layers.Despite the popularity of using saliency methods to explain DL models, a significant number of papers have been published addressing more intensively the stability and robustness of saliency methods [1,3,12,15,18,19,23,28]. Therefore, our work aims to explore the use of adversarial examples as a further tool to help evaluating the robustness and explanatory power of techniques devoted to the explainability of DL models, exemplified by LRP.

Related Work on the Explanatory Power of Saliency Methods
In the work of Samek et al. [28], the amount of changed classification probability is presented as possible measure to evaluate the explanation of the decision process provided by LRP.For the investigation of the explanatory power of LRP, they suggest replacing the input variables considered most relevant with samples from a probability distribution, such as the Uniform or Dirichlet distribution.In this case, a large decrease in classification probability caused by a perturbation of the input variables with the highest relevance scores is considered to be an indicator of a suitable explanation.A similar idea is followed by Bach et al. [3] who evaluated the impact of single value perturbations on the detection result by flipping pixels with highly positive and highly negative relevance scores, as well as pixels with relevance scores close to zero.Lapuschkin [19] presents a more generalized approach by employing an iterative greedy procedure to evaluate the expected behavior of LRP.
The work of Ghorbani et al. [12], on the other hand, shows for various gradient-based methods and DeepLIFT that the same object classification for two extremely similar images can be explained by different saliency maps.They perform slight modifications to the input images to ensure a similar classification of the modified images and the originals, however, leading to a substantial difference in saliency maps.Similarly, Kindermans et.al. [18] analyze the invariance of saliency map generating methods to transformations of the input data which have no impact on the prediction outcome.Instead of modifying input images, Heo et al. [15] adversarially manipulate the classification model, leaving the model accuracy unchanged while achieving a dramatic change in explanation.Therefore, they are able to perform model manipulations that result in modified models classifying an input object with nearly the same classification probability as the original model.However, the saliency maps differ significantly in dependence of the underlying model.Further, Adebayo et al. [1] present a sanity check for saliency methods comprising model randomization and data randomization tests.Based on their observations, they find that some saliency methods (e.g., gradient input) can be interpreted as implicitly implemented techniques analogous to edge detection tending to detect edges rather than explain decisions.
Similarities in saliency maps of marginally perturbed images intentionally designed to cause a major shift in classification (also known as adversarial examples [29]) and the corresponding originals are also observed by other authors, such as Gu and Tresp [14] and Brama et al. [5].Aiming to use saliency maps of adversarial examples to develop defense strategies against adversarial attacks [29,37], Brama et al. [5] use binarized saliency maps based on the 5% of the highest scoring pixels (i.e., pixels most relevant for the classification result) to reveal class-discriminating information and illustrate similarities.Neither the work by Brama et al. [5] nor the work by Gu and Tresp [14] exceeds a visual comparison of contour lines within the saliency maps.This applies to the work of Montavon et al. [24] as well.However, visual inspection is insufficient to assess whether an explanation is model-sensitive, as Adebayo et.al. demonstrate in [1].

Contribution
In this paper, we investigate the use of minimally invasive classification shift perturbations, more precisely adversarial examples [29,37], to evaluate the robustness and stability of the explanations obtained from saliency maps exemplified by LRP.Our approach focuses on a relevance score-independent input modification causing a completely different classification, and therefore clearly separates from [3,19,28], which measures the variation in classification probabilities following relevance score-dependent perturbation of input variables.This also distinguishes our approach from the approach taken by Ghorbani et al. [12], where the input image is perturbed as well, yet without changing the classification decision.Moreover, the approach of Ghorbani et al. [12] is motivated by the expectation that, given a reproducible and consistent explanatory pattern, a minor input change should not affect the classification decision, and hence the saliency map.
In contrast, we present a novel method to analyze the explanatory power of LRP by breaking down LRP-relevance scores of images and corresponding adversarial examples based on the hypothesis that distinct decisions should yield to dissimilar saliency maps even for small changes of the input variables to provide comprehensible rationales.In other words: To explain a turn in a classifier's original decision, the decision turn should result in a variation in the saliency map for a significant proportion of the input variables considered most relevant for achieving the original decision.
Furthermore, we present a novel approach to statistically compare LRP-based saliency maps using relevance score distributions and relevance score rankings.We provide a comprehensive statistical analysis of LRP-based saliency maps of images and corresponding adversarial examples in terms of changes in relevance score distribution and variations in the scores of components marked highly relevant by LRP before and after image perturbation.Consequently, our approach extends current analysis of the explanatory power of saliency maps [3,5,14] by going beyond a rather subjective visual comparison of contour lines.Using multiple relevance decomposition rules of LRP, we demonstrate that different decomposition rules do not produce clearly distinguishable saliency maps for images and corresponding adversarial examples, neither regarding their contour lines, nor regarding the statistical measures mentioned above.Finally, we assess the suitability of our presented approach as potential evaluation tool for the adequacy of saliency methods and thus as extension of existing methods [1,12,15,18] by analyzing whether the differences in saliency maps for pairs of images and adversarial examples are significant to explain the difference in classification.
The saliency maps and adversarial examples underlying our analyses are generated based on a simple CNN architecture using CIFAR-10 [4] data and the L-BFGS attack [6,29,37].A schematic representation of our approach is shown in Fig. 1.The related source code, as well as the generated adversarial examples can be found in [7].

Outline
The remaining paper is organized as follows.Section 2 gives a brief introduction to the fundamental principles and applied methodologies underlying the analysis of this work.In Sect.3, the experimental setup for generating the adversarial examples, as well as the relevance scores and the procedure of their analysis are described, including the underlying dataset and the applied CNN architecture.Finally, the most significant results of the analysis are summarized in Sect.4, followed by a brief discussion and a conclusion (Sect.5).

Fundamental Principles
This section briefly outlines the basic principles of layer-wise relevance propagation and adversarial examples including general definitions and generation techniques.

Layer-wise Relevance Propagation
Layer-wise relevance propagation represents a methodical approach aiming to increase the transparency and interpretability of individual classification decisions.LRP strives to identify important components by decomposing the classifier's output into individual contributions of Fig. 2 Principle procedure of layer-wise relevance propagation, showing the idea of redistributing the relevance score R l+1 j , j ∈ {1, . . ., n l+1 }, of the j-th component of layer l +1 (right) in dependence of the corresponding input components' forward contributions (left) the input components.The application of LRP is based on the assumption that the classifier is decomposable into n L ∈ N individual size n l ∈ N layers and therefore, the classification function f : R n −→ C can be represented as composition of the functions f l with l ∈ {1, . . ., n L }, i.e., f = f n L • ... • f 1 , whereby C denotes the set of available classes.The classification function is required to create mappings between intermediate representations of the input data X ∈ R n , which are generally denoted by z l ∈ R n l .The mapping between the i-th component of layer l and the j-th component of layer l + 1 is defined by z l,l+1 i→ j , so that holds for all j ∈ 1, . . ., n l+1 (see Fig. 2).Using layer-wise relevance propagation, the degree of a component's influence on the final decision score is measured by the relevance score R which constitutes a relative measure of a component's contribution to the network's outcome.The sign of R indicates the direction of the contribution, whereby R > 0 implicates a positive and R < 0 a contradictory contribution to the classifier's outcome, i.e., a contribution that contradicts the final decision score.A component with a relevance score close to zero is expected to be irrelevant with regard to the decision made by the classifier.
Proceeding from the model's final layer n L with an initial relevance score of R n L = f (X ), LRP successively propagates the relevances backwards through the network until the input layer is reached.Considering a multi-class classification problem solved via deep neural networks, the classifier's prediction usually results in a vector containing probabilities for each existing class.In this case, the relevance score R n L is initialized with the value of the class that is supposed to be explained.Under the assumption that the relevance R l+1 j of each component z l+1 j of layer l + 1 (cf.Eq. 1) has already been identified, the relevances of the previous layer's components z l i is given by (cf.Fig. 2).The relevance message R l,l+1 i← j , directed from component j to component i, describes the ratio of the relevance score R l+1 j that can be traced back to the i-th component of layer l and can be determined, for example, according to the decomposition rules listed in Table 1.The signum function occurring in Table 1 is defined as follows: Positive forward contributions z l,l+1 i→ j + and and negative forward contributions z l,l+1 i→ j − are defined by and (4) Further decomposition rules, as well as a detailed description of the aforementioned decomposition rules can be found in [3], [19] and [24].

Adversarial Examples
Adversarial examples are especially common in the area of image classification and object recognition, intentionally designed to deceive machine learning models and provoke misclassifications with high probabilities.They are characterized by a close resemblance to the training data and cannot be differentiated from regular input images by human observers (see Fig. 3).
with η = X − X meaning the discrepancy between the reference image and the adversarial example X ∈ R n .The parameters L, U ∈ R n represent the component-wise lower and upper bounds on the pixel values of the adversarial example belonging to the target class c ∈ C. The determination of the minimal perturbation η * , that is needed to provoke a misclassification of the reference image, is a complex, nontrivial problem.There is a wide range of algorithms, so called adversarial attacks, which enable the approximate solution of this problem.Adversarial attacks, such as the Fast Gradient Sign method or the L-BFGS attack (see Sect. 3.2), are usually based on different algorithmic approaches and assumptions.An extensive survey of existing adversarial attacks is given by [29] and [37].

Dataset and Network Architecture
The generation of the adversarial examples forming the foundation of the analysis covered within this paper and the training of the underlying classifier is based on the dataset CIFAR-10.In the research area of machine learning, CIFAR-10 is a commonly used benchmark dataset of RGB images characterized by a comparatively low image resolution (32 × 32).The dataset comprises 60,000 images (50,000 training and 10,000 test samples) belonging to the classes airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck, which consist of 6,000 samples each.The available classes are denoted by c i ∈ C for i ∈ {1, . . ., 10} and thus, C = {c i | i ∈ {1, . . ., 10}} holds.
In the field of image classification, there are already a considerable amount of CNN architectures that achieve excellent results on CIFAR-10 [4].These CNNs are generally very deep and equipped with innovative architectural elements (e.g., skip connections [17]).To circumvent potential dependencies on specific architectural design and ensure a straightforward and transparent analysis of relevant features, we deliberately selected a CNN characterized by a more simple and less deep network architecture.The network's architectural design is based solely on fundamental structural elements, such as convolutional, max-pooling, and fullyconnected layers, whose configuration is inspired by the state-of-the-art classifier VGG [17] (cf.Table 2).Unlike VGG networks, our CNN architecture features a significantly smaller number of trainable parameters.VGG16 [31], for instance, includes approximately 138 million trainable parameters, while our CNN only consists of 307,936 trainable parameters (cf.Table 2).The CIFAR-10-based training and evaluation of the selected network leads to a training accuracy of 90.87 % and a validation accuracy of 89.02 %.Therefore, 8,902 test images and 47,552 training images are correctly classified.

Generation of Adversarial Examples
The generation of the adversarial examples is based on 8,902 correctly classified images of the CIFAR-10's test dataset using the L-BFGS attack (cf.Sect.2.2).The L-BFGS attack is an iterative white-box attack based on the limited memory BFGS method for bound constrained optimization (short L-BFGS-B), a numerical optimization algorithm described in detail by Byrd et al. [6].Since the L-BFGS attack is a targeted adversarial attack, the attack's desired target class c ∈ C needs to be specified in advance.To reach a wide variety of adversarial examples for later analysis, every class of CIFAR-10 is chosen once to be the attack's target class.Therefore, the attack is executed nine times for each correctly classified image X ∈ R n , i.e., for all targets c ∈ C with c = y.In this case, the parameter y = c for c ∈ C denotes the image's true label.The implementation of the L-BFGS attack provided by the Python library Foolbox 2.4.0 [11,26] was used to create the adversarial examples.Further information regarding the algorithmic specification of the attack's implementation can be found in [32].

Generation of Relevance Scores
The relevance scores for both, adversarial examples and original images (cf.Fig. 1), are determined according to the basic relevance decomposition rule (LRP-0), the ε-rule (LRP-ε), as well as the αβ-rule (LRP-αβ) (cf.Table 1) implemented in Python 3.6.10using Tensorflow 2.1.0.The advanced decomposition rules LRP-ε and LRP-αβ are executed for different parameter values, i.e., for ε ∈ {0.0001, 0.01, 0.1, 1} and α ∈ {1, 2}, to evaluate the parameters' effects on the final relevance scores as well.Given the tensorial representation of the input images, the application of the relevance decomposition rules according to Sect.3.3 results in 3072 relevance scores per adversarial example or original image, respectively.
To allow a clear distinction between relevant and irrelevant input components, especially when passing on off-manifold data like adversarial examples, the softmax pre-activation values are used as initial relevance scores, instead of utilizing the classifier's final probabilistic To avoid incredible large relevance scores and enable comparability while maintaining the relevance scores' ratio and signs within each image, the final relevance scores are normalized separately in dependence of the underlying image using the maximum norm.As opposed to [24], there is no composition of different decomposition rules in dependence of a layer's position within the architecture of the CNN.

Analysis of the Relevance Scores
The analysis is conducted based on the relevance scores of 77,402 generated adversarial examples, as well as the relevance scores of the correctly classified test images of CIFAR-10 (cf.Sect.3.3).For our investigations, we applied established methods of descriptive statistics and exploratory data analysis, such as expected values, standard deviations, quantile values and ranges, as well as visual evaluation and verification via histograms and saliency maps.The statistical analysis is performed separately for relevance scores above and below zero, due to the different interpretation of positive and negative relevance scores (cf.Sect.2.1).Input components with high positive or particularly contradictory contributions to the classifier's outcome, i.e., components with high absolute relevance scores, are of particular interest in the context of our analyses.If the hypothesis that different classification decisions must lead to distinct saliency maps to provide reasonable explanations holds, there should be a significant discrepancy between adversarial examples and original images, especially in extreme value ranges.Therefore, the focus is on the components with the most extreme relevance scores in each sample, as well as the largest 1 % of the positive relevance scores and the smallest 1 % of the negative relevance scores.This selection is due to the determined quantile values shown in Table 3 and Table 4 .In addition to the analysis of significantly influential components of adversarial examples and original images, non-influential components, i.e., components with relevance scores close to zero, are examined as well.Due to their comparatively low relevance scores, components with a positive score below 0.001 and a negative score above −0.001are assumed to be irrelevant for the classification.
Furthermore, we establish a ranking describing the relevance shift between components of original images and components of adversarial examples triggered by the application of the adversarial attack (cf.Fig. 4).Therefore, the input components of each image are sorted separately and in descending order according to their relevance score, without distinguishing between positive and negative values.The relevance shift of each component is defined by the difference between the position of a component in the relevance ranking based on the original image and the position of a component in the relevance ranking based on the corresponding adversarial example.Hence, a positive shift indicates a positional degradation and a negative shift a positional enhancement of a component when looking at adversarial examples.A shift of zero implies that the position of a component remains unchanged.When analyzing the positional shift of individual components, we focus primarily on the components with the largest or the largest 1 % of the positive relevance scores, similar to the statistical evaluation.Additionally, we investigate the change in position for the most relevant 10 % of the components.In the following, the relevance ranking for components of adversarial examples is referred to as adversarial relevance ranking and the ranking based on components of original images is referred to as original relevance ranking.

Classification Accuracy
The application of the L-BFGS attack according to the experimental setup sketched above (cf.Sect.3.2) results in a total of 77,402 adversarial examples which corresponds to a success rate of 96.61 %.The vast majority of adversarial examples, more precisely 99 %, show a classification probability p c above 93.37 % towards their respective target class c ∈ C.Only in 0.5 % of the cases, p c ≤ 54.82 % holds.Hence, in these cases, the adversarial attack

Visual Evaluation
The visual verification and direct comparison between the input contribution heatmaps of an original image and its corresponding adversarial examples reveal no significant differences (cf.Figs.5and 6).Despite strongly divergent classification decisions and high classification accuracy, there is almost no difference between components of adversarial examples marked relevant and relevant components of original images.Even though individual pixels undergo minor changes in the absolute magnitude of their relevance scores and some previously insignificant pixels seems to become relevant as a result of the adversarial attack, the majority of pixels appear to have a strong impact on both, the classification of the adversarial example and the classification of the original image.This can be observed especially for input contribution heatmaps derived from relevance scores obtained by applying LRP-αβ with α = 1 (cf.Figs. 5 and 6).Even tough some background components seem to become relevant for the classifier's outcome through the changes induced by the adversarial attack, the contours of the original objects are clearly visible in the input contribution heatmaps of the adversarial examples.This implies that components marked relevant for original images seem to be relevant for adversarial examples as well, albeit leading to a significantly different classification decision with a high accuracy towards the pre-defined target class (here automobile, bird, cat, horse and truck).In some cases the negative relevance scores even overpower the positive ones (e.g.Fig. 6 original class bird, target class cat) replicating the original object's contour lines.Thus, the heatmaps seem to clearly contradict the result of the classifier.
Accordingly, a visual verification seems to be ambiguous and not sufficient to explain the entirely different classification results of original images and adversarial examples.Furthermore, the question arises whether a visual verification of relevance scores based on human interpretation of contour lines can actually explain the influence of a component within the complex structure of a deep neural network.However, to allow the evaluation to be based on more than a visual inspection a statistical evaluation of the differences in saliency maps is presented in the following section.

Statistical Evaluation
Regardless of the applied decomposition rule, the statistical evaluation shows that on average 0.36 % of the adversarial components have a relevance score of zero, and therefore are considered non-influential to the final decision score f (X ), X ∈ R n .For correctly classified images, on average only 0.24 % of the components have a relevance score of zero.Looking at the quantile values in Tables 3 and 4, only 1 % of the positive and 1 % of the negative relevance scores appear to be significant for the final classification decision.However, the majority of the components seem to have no significant impact according to LRP, as their relevance scores are an order of magnitude lower than those below and above the 1 % and 99 % quantiles, respectively.This is also reflected by the relevance scores' expected value, which ranges from zero (LRP-0) to 0.0516 (LRP-αβ, α = 1) for adversarial examples and from zero to 0.055 for original images.
Considering the relevance scores' quantile values, summarized in Tables 3 and 4, there is no discernible difference between the relevance scores of adversarial examples, and the relevance scores of original images.In both cases, the relevance scores obtained by LRP-0 and LRP-ε with ε ∈ {0.0001, 0.01, 0.1} are symmetrically distributed around zero.The distributions of the relevance scores obtained by LRP-ε with ε = 1 and LRP-αβ with α ∈ {1, 2}, on the other hand, are slightly skewed to the right, which is due to the nature of the applied decomposition rules.
In the case of LRP-ε, the parameter ε absorbs a certain amount of relevance and thus eliminates weak or contradictory contributions as ε grows.Accordingly, with an increasing parameter value the number of irrelevant components increases and only the most salient components survive, which is also reflected by the quantile values in Tables 3 and 4. Furthermore, it can be observed that the relevance scores' standard deviation also declines with growing ε, showing values below 0.1184.Additionally, the gaps between the quantile values of the relevance scores change for ε = 1 and the relevance scores of allegedly influential components tend to become even larger.This seems to allow a more precise distinction of relevant and irrelevant features.In contrast to LRP-ε, the observed distribution shift for the relevance scores obtained by LRP-αβ is due to the different weighting of positive and negative forward contributions.Especially interesting is the significant difference in the lower quantile values for relevance scores of adversarial components and components of original images for α = 1.In the case of original images, 2% of the components are assigned a relevance score less than zero, whereas only 0.1 % of the relevance scores associated with adversarial examples are in the negative value range.Nevertheless, in both cases, the relevance scores have similar expected values (0.0516 for adversarial examples, 0.055 for original images) and standard deviations (0.0930 for adversarial examples, 0.1013 for original images).Similar to LRP-ε the variation of the gap between the relevance scores' quantile values can be observed for LRP-αβ as well.
Regardless of the applied decomposition rule (i.e., LRP-0, LRP-ε or LRP-αβ), the examination and direct comparison of the relevance scores for both adversarial examples and original images using quantiles, expected values and standard deviations revealed no major differences between their relevance scores.Even the analysis of highly influential or noninfluential components showed neither significant differences between the relevance scores of adversarial examples and original images, nor general differences between positive and negative relevance scores.Hence, the statistical analysis indicates a rather ambiguous behavior of LRP as well (cf.Sect.4.2.1),supporting the conjecture of insufficient explanatory power, especially when considering defective data such as adversarial examples.

Relevance Ranking
The results above are also supported by the established relevance ranking for components of original images and components of adversarial examples, as well as by the direct comparison of their ranking position according to Sect.The examination of the components with the highest relevance score in each original image pursuant to LRP-αβ with α = 1 shows, that 96.72% of these components are also among the 10% of the components most relevant for the classification of the corresponding adversarial example.In 75.21% of the cases, the most relevant component of the original image even belongs to the 1% top-scored components of the associated adversarial example.When looking at the positional shift of an original image's most relevant component, which still belongs to the top 1% or top 10% of the most relevant components in the adversarial ranking, a comparatively small change in position can be observed.For the majority of the components (more precisely 70% of them), the change in position is below 7 when considering the top 1% of the components within the adversarial ranking, and less than 54 when taking the top 10% into account.This observation is also illustrated by Fig. 7.In 3.75% of the cases, the most relevant component of the original image and the most relevant component of the corresponding adversarial example are identical.
Considering the 1% of the original images' most relevant components according to LRPαβ with α = 1, it can be observed that 43.37% of them also belong to the 1% of the top-ranked components of the corresponding adversarial examples.Approximately 93% of original images' top 1% even belong to the top 10% of the adversarial components mainly responsible for the classifier's outcome.As illustrated by Fig. 7, the absolute positional change of an original component within the adver-sarial relevance ranking is less than 10 in 70% of the cases when looking at the 1% of the top-ranked adversarial components, and less than 54 when considering the top 10%.Of particular interest is the change in position of the most   scores obtained by LRP-αβ with α = 1.Therefore, these results will not be discussed further.However, the main results can be found in Table 5.
Since the majority of top-ranked 1% experience only a marginal change in position of 2.3 per thousand and the majority of the top-ranked 10% merely undergo a relative position change of 1.76%, these top-score shifts between original images and adversarial examples cannot be considered a reliable foundation for explaining the change in classification w.r.t.adversarial examples and original images.Given the fact that the analysis did not discriminate between class affiliations or target class dependencies, these results indicate a general characteristic problem of layer-wise relevance propagation.

Discussion
Adversarial examples are generally characterized by high similarity to the original data.Therefore, edges in images rarely undergo significant changes in adversarial attacks (cf.Fig. 3).This feature is clearly highlighted by LRP by carving out almost identical contour lines for both the original image and the adversarial examples (cf.Figs. 5 and 6) while they are classified differently.Consequently, LRP emphasizes the image contour lines rather than actually explaining the network's decision.This finding is also supported by the observations of Adebayo et al. [1], who show that some saliency methods (e.g., gradient input) work like an edge detector, in combination with the work of Ancona et al. [2], who show that gradient input is strongly related to LRP and even equivalent in some configurations.Taking further into account that visual comparison between object contours and saliency maps is a "poor guide in judging whether the saliency map is sensitive to the underlying model" [1], our results lead to the conclusion that assigning a relevance score to individual input components based on a layer-wise conservation principle to measure their importance in the decision process does not properly explain the behavior of a deep neural network.
However, this rationale is not inconsistent with other evaluations using relevance scoredependent perturbations of input components to analyze the explanatory power of LRP [3,19,28].Since these approaches change components of objects marked relevant, they technically change the edges of these objects which should reduce the classification probability.But this does not explain how the network arrives at its decisions, because in the case of adversarial examples-where components allegedly relevant according to LRP mainly remain unchanged (cf.Figs.7 and 8)-the classification decision is completely overturned.

Conclusion
In this paper, we presented a comprehensive statistical analysis and a novel approach to evaluate the explanatory power of LRP using adversarial examples as relevance score-independent perturbation.The performed analyses demonstrate that there is no significant difference between the saliency maps of adversarial images and the corresponding original ones.This leads to the conclusion that there is no evidence that LRP in its current version explains the CNN's decision process for original images or adversarial examples in a comprehensible way.Nevertheless, our analyses show that adversarial examples offer the potential to uncover inconsistencies in the robustness and stability of explanations obtained by saliency methods.We believe that adversarial examples are a useful addition to the means of evaluating the explanatory power of such methods.
While our work was a first step in this direction, the presented approach can be used for consistency evaluations of other explainability methods (e.g., LIME [27,35]) as well.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Fig. 1
Fig. 1 Procedure of analyzing the explanatory power of LRP using adversarial examples.First, the adversarial examples are created based on the original data (cf.Sect.3.2).Then, the relevance scores are generated for original images and the adversarial examples using LRP (cf.Sect.3.3).Finally, the relevance scores are comprehensively analyzed, compared, and evaluated using established statistical methods (cf.Sect.3.4)

Fig. 3
Fig. 3 Results of the L-BFGS attack targeting the classes airplane, automobile, bird, cat, deer, dog, frog, ship and truck based on an image X ∈ R n (bottom row, third from the left) originally assigned to the class horse.All adversarial examples show a classification probability p c over 99 % towards the attack's corresponding target class c ∈ C

Fig. 4
Fig. 4 Description of the procedure for generating the relevance ranking based on the relevance scores of original images and corresponding adversarial examples (see input).Step 1: The relevance scores (here LRP Score) are sorted for each image and associated adversarial examples individually in descending order.The pixel with the highest score is ranked first, while the pixel with the lowest score is ranked last.A pixel can be identified by its position in the image frame (here Pixel Pos.).Step 2: The position of each pixel in the original relevance ranking and the adversarial relevance ranking is compared, resulting in a positional relevance shift for each pixel (here Pos.Shift)

Fig. 5
Fig. 5 Results of the application of LRP-ε and LRP-αβ to correctly classified images of CIFAR-10 belonging to the classes automobile, bird, cat, horse and truck

Fig. 6
Fig. 6 Results of the application of LRP-ε and LRP-αβ to adversarial examples belonging to the target classes c ∈ {automobile, bird, cat, horse, truck} 3.4.Particularly striking are the results of the relevance ranking for adversarial examples and original images based on the relevance scores obtained by applying LRP-αβ with α = 1.

Fig. 7 Fig. 8
Fig. 7 Frequency distribution of the positional change of the most relevant component of each original image according to LRP-αβ, α = 1, which also belongs to the top 1% (left) or the top 10% (right) of the adversarial input components

Fig. 9
Fig. 9 Frequency distribution of the positional change of the 10% most relevant components of each original image according to LRP-αβ, α = 1, which also belong to the top 10% of the adversarial input components

Table 1
Relevance message based on commonly used decomposition rules

Table 2
Detailed specification of the CNN architecture

Table 4
Quantiles of the original images' relevance scores obtained by applying LRP-0, LRP-ε and LRP-αβ

Table 5
Share of the original images' top-ranked components that also belong to the corresponding share of the adversarial examples' top-ranked components based on the relevance scores obtained by LRP-αβ with α = 2 and LRP-ε with ε = 1