Improving Interpretability via Regularization of Neural Activation Sensitivity

State-of-the-art deep neural networks (DNNs) are highly effective at tackling many real-world tasks. However, their wide adoption in mission-critical contexts is hampered by two major weaknesses - their susceptibility to adversarial attacks and their opaqueness. The former raises concerns about the security and generalization of DNNs in real-world conditions, whereas the latter impedes users' trust in their output. In this research, we (1) examine the effect of adversarial robustness on interpretability and (2) present a novel approach for improving the interpretability of DNNs that is based on regularization of neural activation sensitivity. We evaluate the interpretability of models trained using our method to that of standard models and models trained using state-of-the-art adversarial robustness techniques. Our results show that adversarially robust models are superior to standard models and that models trained using our proposed method are even better than adversarially robust models in terms of interpretability.


Introduction
In recent years, deep neural networks (DNNs) have increasingly been used to tackle many complex tasks previously thought to be solvable only by humans, with accuracy often surpassing that of humans.These tasks are part of technical domains such as computer vision, natural language processing, and anomaly detection, and use cases such as medical diagnosis, text translation, self-driving cars, fraud detection, and malware detection.
Two major obstacles to the wider adoption of DNNs in mission-critical tasks are (1) their vulnerability to adversarial attacks [6,7,10,19,26] and, more generally, concerns about their robustness when confronted with realworld data, and (2) their opaqueness, which makes it difficult to trust their output [17,34].
Extensive research has been performed to address the first obstacle, mainly consisting of approaches for the detection of adversarial examples [15,16,24,30,39] and methods for training robust models [19,29,35,43].To address the second obstacle, research has focused on creating a priori interpretable models and developing methods for creating post-hoc explanations for existing models [28,34,38,40].
The terms interpretability and explainability are often incorrectly used interchangeably.We adopt the definitions proposed by Gilpin et al. [18]: Interpretability is a measure of how well a human can understand the way a system (in our case -a DNN) functions.Explainability of DNNs is a field of research with the goal of either answering the question of why a DNN produces a specific output for a given input (for example, by assigning contribution scores to different neurons of the network signifying their importance in steering the network towards that output) or describing what the network "learns," e.g., producing visualizations of concepts learned by specific neurons [33]).
In this paper, we examine the relationship between the obstacles mentioned above by performing quantitative and qualitative analysis of the effect of a model's robustness (achieved via adversarial training and Jacobian regularization) on its interpretability, and by introducing a regularization-based approach which is conceptually similar to other approaches for improving adversarial robustness but has a substantial effect on the model's interpretability.
Recent research has shown that adversarial robustness positively affects interpretability [1,32,42,45].However, these studies mainly evaluated their methods using low resolution inputs and datasets such as the MNIST [27] and CIFAR-10 [25], and only provided anecdotal evidence of improved interpretability by presenting saliency maps of a few input images.Moreover, these studies failed to pinpoint the specific trait of adversarially robust models that makes them more interpretable.
We hypothesize that the most important trait in this respect is the models' increased robustness to random noise in a certain -radius of the data manifold, as opposed to adversarial robustness which is essentially an approximation of a model's robustness to the worst possible perturbations of a certain -radius.Therefore, we construct a regularization term explicitly aimed at reducing the model's sensitivity to random perturbations.
Our main contributions in this paper are: (1) a quanti-Figure 1.The process used to compute neuron activation differences (MD) and sensitivity (NS) for a single neuron ij with a single example x.A neuron's sensitivity is measured by its behavior, i.e., the variation in its activation values for a given input x and the corresponding N perturbations x k p sampled from the surrounding L2 sphere.Mij(•) denotes the activation value of neuron j in layer i for a given input; each activation value is presented in a different color, e.g., the activation of the last perturbation, x 3 p , is presented in blue.Then, M D is used to calculate activation differences, and N S is used to assess a neuron's sensitivity based on the computed M D and normalization factors.Note that this simplified example of NsLoss computation is presented to demonstrate the intuition and isn't a complete description of the computation process.
tative comparison, based on well-accepted metrics, of the interpretability of adversarially trained models vs. standard ones on a high-resolution image dataset; (2) the discovery that Jacobian regularization is approximately as effective as adversarial training for improving model interpretability; (3) a novel, regularization based approach that outperforms both adversarial training and Jacobian regularization in the interpretability of the trained model, with a similarly low decline in accuracy.

Post-hoc Explainability Techniques for DNNs
There are many methods for explaining the predictions of machine learning models, and more specifically DNNs.In the computer vision domain, explanations usually consist of a saliency/attribution map which scores each input pixel based on its positive or negative effect on encouraging the model to classify the input as a specific class.Some of the most prominent methods used in our evaluation include: • Integrated Gradients (IG) [40] calculates the importance score by summing the gradients on images interpolated between a baseline x and the input image x.The baseline image x represents the absence of the input features.Therefore, by computing the path integral between the baseline x and the real input x of the partial derivative of the model over each feature, we obtain multiple estimates of the importance of every feature; this avoids the problem of saturated local gradients.Formally, the importance score for feature i is computed as: where f represents the model, x i represents one feature of the input, and α is part of the integral and defines the distance on the path between x and x .• SHapley Additive exPlanations (SHAP) [28] (specifically Gradient SHAP) is a local explainability method that approximates the Shapley values [36] of the input features by computing the expected values of the gradients when adding Gaussian noise to each input.Since it computes the expectations of gradients using different reference points, it can be viewed as an approximation of Integrated Gradients.

Adversarial Evasion Attacks and Defenses
Adversarial evasion attacks are methods for producing adversarial examples -model inputs that closely resemble valid inputs but result in drastically different model outputs.Formally, given a classifier M (•) : R d → R C , an input sample x ∈ R d , and a correct class label c, we call δ ∈ R d an adversarial perturbation and x = x + δ an adversarial example if: where || • || is a distance metric, and > 0 is the maximum perturbation size allowed, which is set at a small pos-itive value to constrain the perturbation so that the resulting adversarial example is indistinguishable from the original sample to the naked human eye, thus making it potentially useful in various adversarial scenarios.
Extensive research has been performed on countering adversarial attacks, mostly focused on methods for detecting adversarial examples and methods for training robust models.The latter is of greater relevance to the current research, and below we highlight two such methods: Adversarial training [19,29] is a method in which a model is trained to correctly classify adversarial examples by presenting them to the model during the training process.More precisely, this method solves the saddle point problem: in which the aim is to find model parameters that minimize the expected value of the worst case increase in model loss due to input perturbations.Practically, the method consists of modifying the standard training loss so that it is applied to adversarial examples constructed from the training batch samples instead of the original training samples.Madry et al. [29] used projected gradient descent (PGD) for generating adversarial examples during model training.Jacobian regularization [21,23] is a method in which the Frobenius norm of the Jacobian matrix containing the partial derivatives of the model's logits over the inputs is added as an extra loss term, resulting in the minimization of the Jacobian norm during model training.This was found to effectively push the model's decision boundaries away from the data manifold [21] thus improving the model's robustness to adversarial attacks.

Related Work
The notion of adversarially robust models being more interpretable is not new.Zhang and Zhu [45] showed that adversarially trained convolutional neural networks (CNNs) produce explanations that rely on the global shape of the input images, in contrast to standard CNNs which focus more on textures that are inherently more sensitive to small perturbations.Tispras et al. [42] observed that adversarially trained models, by virtue of the constraints imposed by adversarial training that have the effect of reducing sensitivity to small perturbations, are more aligned with human vision, which is evident in explanations that emphasize features that are more human-perceivable.In [1], the authors studied the effect of adversarial training on feature-level explanations of internal CNN layers and showed that adversarially trained models produce feature-level explanations that are "purified" in that they are much less noisy and better represent high-level visual concepts.Noack et al. [32] proposed a method for leveraging model explanations to improve robustness, by adding terms to the training loss that penalize the cosine of the angle between the explanation vector and the loss gradient, as well as the norm of the loss gradient vector.They showed that minimizing these two terms improves adversarial robustness.

Method Overview
The proposed method aims to improve the interpretability of neural network classifiers.We introduce NsLoss, a novel regularization term that penalizes the classifier for high sensitivity of the network's neurons to input perturbations.Thus, we define a new training loss function as follows: where CELoss is the standard cross entropy loss, and N sLoss is our new loss term.
We apply the proposed method on a pretrained model by continuing to train it with the custom loss function for a predefined number of epochs using a standard stochastic gradient descent-based optimizer and a relatively low learning rate to allow the model's interpretability to improve without "unlearning" the classification task.

NsLoss Regularization Term
The inputs and parameters used to compute the NsLoss are as follows: • M -The model being trained.• X -Batch of samples for which the loss is computed.
• ns -Hyperparameter specifying the radius of the L 2 ball in which random perturbations used to compute the loss are generated.• N -Number of perturbations to generate for each sample.We begin by computing the normalized sensitivity of each neuron in the model to random perturbations of the input within an L 2 ball with radius ns .Given the j-th neuron of the i-th layer of the model: 1.For every sample x ∈ X, generate N random samples: 2. Evaluate M i,j (X) and M i,j (X p ), the activations of the j-th neuron in the i-th layer of the model on each original and each randomly perturbed sample, respectively.

Compute the mean absolute activation of neuron i, j
on the batch X: where X[m] is the m-th sample in the batch X.
4. Compute the mean absolute difference between the activations of the neuron on perturbed and original samples: 5. Compute the sensitivity of the neuron: where |M i | is the number of neurons in the i-th layer of M .
Figure 1 illustrates the process discussed so far, for a single neuron ij and a single example x.Finally, we compute the NsLoss as follows: The final loss is simply the mean neuron sensitivity weighted by the neuron's mean absolute activation, which accommodates for the neuron's contribution to the models' output.From the implementation perspective, it is important to note that although we described the algorithm for computing NsLoss so that we compute the neuron sensitivity values for each neuron separately (for simplicity's sake), in practice, it is straightforward to implement the computation of the aggregated loss using tensor operations on entire input batches and model layers, effectively making the time spent on loss computation negligible compared to model forward passes.

The Effectiveness of NsLoss Regularization
The NsLoss regularization term is constructed in a way that penalizes the model for small random input perturbations causing large differences in the activations of both the output neurons (logits) and internal neurons of the model.This has the obvious effect of optimizing the model to minimize the activation differences and, as a result, to minimize the magnitude of the model's gradients with regard to inputs in the vicinity of the training set and, by generalization, the test set.Once aggregated on the entire training set during the training process, this is expected to have the effect of minimizing the model's gradients in an ns neighborhood of the entire data manifold, where ns is the hyperparameter specifying the radius of the L 2 ball from which random input perturbations are sampled during the computation of the NsLoss regularization term.

Hyperparameters
A description of the hyperparameters used in our approach is provided below.
• N -The number of perturbed samples used to estimate the neuron sensitivity; N = 5 was used in all of our experiments.• ns -The radius of the L 2 ball from which perturbations for neuron sensitivity computation are generated.• λ -The weight of the loss term.We strive to choose the largest value of λ that does not harm the model's crossentropy loss and validation set accuracy.Specifically, we follow the protocol below to select the value of λ: 1.For a standard model, compute the values of N sLoss on 10 random batches from the training and validation sets and store the average value as N sLoss 0 .
3. Perform a binary search by selecting values of λ that are larger and smaller than λ 0 but in the same order of magnitude.For each such λ, start training the model for an epoch.If the training cross-entropy loss reaches the cross-entropy of a random guess (log 2 N umClasses), then λ is too high; otherwise it can be increased further.

Evaluation
The objective of our experiments is to examine the effect of adversarial training, Jacobian regularization, and our proposed NsLoss regularization term on the quality of the explanations both qualitatively (i.e., visual improvement) and quantitatively (using objective metrics).We also compare the results to those obtained by a baseline model.

Datasets, Models, Robust Training Methods
In all of our experiments we use the ImageNette [22] dataset, which contains high resolution images from a 10 class subset of the popular ImageNet [13] dataset.We use VGG19 [37] and PreResNet10 [20] models pretrained on ImageNet and fine-tuned on ImageNette for a clean test accuracy of 96.2%, and 96.8% respectively.We use Madry's adversarial training method [29], as implemented by the "robustness" library [14] to train robust models for our evaluation.We retrain the model for 50 epochs against a PGD adversary in the L ∞ norm, using seven PGD steps, = 4/255 and with random initialization.We use the implementation of Jacobian regularization presented by Hoffman et al. [21].Tables 1 and 2 summarize the configurations and hyperparameters used to train all of the models evaluated.
Figure 2. Two images of a parachute (top rows) and two images of a church (bottom rows), as well as a comparison of the attribution maps obtained for these images by the VGG19 model [37] trained using the standard, NsLoss, JacobReg and adversarial training methods.The labels on the left indicates the attribution method used (IG and GS).The attribution maps generated by our method are framed in red.

General Training
NsLoss

Evaluation Metrics
Various recent studies [3,8,31] attempted to determine what properties an attribution-based explanation should have.They showed that one metric alone is insufficient to provide explanations that are meaningful to humans.As suggested by Bhatt et al. [5], three desirable criteria for feature-based explanation functions are: low sensitivity, high faithfulness, and low complexity.Therefore, we evaluate the different techniques based on these three wellstudied properties: 1. Sensitivity -measures how strongly the explanations vary within a small local neighborhood of the input when the model prediction remains approximately the same [3,44].In our evaluation, we use maxsensitivity, avg-sensitivity [44], and the local Lipschitz estimate [3].  2. Faithfulness -estimates how the presence (or absence) of features influences the prediction score; i.e., whether removing highly important features results in model accuracy degradation [2,4,5].In our evaluation, we use the faithfulness correlation [5] and faithfulness estimate [2].

Complexity -captures the complexity of explanations
i.e., how many features are used to explain a model's prediction [5,9].In our evaluation, we use complexity [5] and sparseness [9].

Explainability Methods
We use the Integrated Gradients [40] and Gradient SHAP [28] explanation methods in our evaluation.
We argue that since the majority of local explanation methods use model gradients, any improvement on these explanation methods using the proposed method is likely to be successfully transferred to other gradient-based methods.

Quantitative Evaluation Results
We start by examining the performance of the compared methods, considering the three aforementioned explanation-quality criteria (i.e., sensitivity, faithfulness, and complexity), applied to the values of the methods' respective explanations.
For the sensitivity criteria, lower values are better; for the faithfulness criteria, higher values are better, and for the complexity criteria, lower complexity values and higher sparseness values are better.The scores were computed and averaged over the entire test set from the ImageNette dataset.The Quantus library [12] was employed for XAI evaluation and Integrated Gradients was used as the base attribution method.To facilitate meaningful comparisons, all models compared were retrained to have roughly the same test accuracy on natural images (except for the standard model), which is presented in column 9. To support the hypothesis that robust models tend to be more interpretable,  4. Comparison of attribution quality on the VGG19 model trained using the four different methods presented in Table 2; Integrat-edGradients [41] is used as the base attribution method.(↑) indicates that a higher value is better, and (↓) indicates that a lower value is better.The scores were computed and averaged on the entire ImageNette test set.we also provide the robust accuracy for each method, obtained using the AutoAttack [11] library, in column 10.
In Table 3, we examine the PreResNet10 model trained using the four different methods presented in Table Table 5.The effect of the NsLoss λ hyperparameter (rows) on the accuracy and the explanation metric (columns).The scores were obtained when PreResNet10 was trained for 10 epochs with the corresponding λ value.

metrics.
With regard to the sensitivity criterion, NsLoss and Jaco-bReg obtain a comparable score in the max-sensitivity and avg-sensitivity metrics, which represents a reduction of over 116% in sensitivity compared to the standard method, while adversarial training leads, obtaining a score that is twice as good as theirs but this comes at a cost of a 10% decrease in the clean test accuracy.
On the local Lipschitz estimate metric, NsLoss is superior with an improvement of over 500% vs. the standard method; this is followed by adversarial training with an improvement of 266%.With regard to the faithfulness criteria, NsLoss outperforms the other methods by a significant margin on both metrics.Furthermore, for the complexity criterion, the NsLoss method achieves the best scores on the two metrics examined.
Table 4 presents a similar performance comparison, this time using variants of the VGG19 model presented in Table 2. Regarding the sensitivity criterion, the NsLoss and JacobReg methods significantly improve the sensitivity, with a decrease of 317% and 156% respectively for both the max and avg sensitivity metrics, whereas for the local Lipschitz estimate metric NsLoss outperforms both alternatives.For the faithfulness criterion, on the faithfulness correlation metric, the JacobReg method is clearly superior, whereas on the faithfulness estimation metric, NsLoss performed favorably.For the complexity criterion, NsLoss performed the best in all metrics tested, albeit by a moderate margin.
Based on the quantitative evaluation results, we can conclude with sufficient certainty that (1) robust models are superior to standard models in terms of interpretability, since their explanations are less sensitive, more faithful, and less complex; and (2) the proposed NsLoss method produces models that are more interpretable than those produced by both the adversarial training and Jacobian regularization methods.

Qualitative Evaluation Results
Figures 2 and 3 present the attribution maps of images from ImageNette's test set for VGG19 and PreResNet10 respectively, The NsLoss attribution maps for the parachute images (top two rows in Figure 2) demonstrate the ability of the trained NsLoss model to capture the key region of interest in the image (the parachute itself) and produce an attribution map that focuses precisely on that region, while ignoring the background.The other methods compared (except for the standard method) were also able to capture the region of interest in the image but these regions were rather noisy much less sharp.
Following Smilkov et al. [38], we use visual coherence to indicate that the salient areas highlight mainly the object of interest, rather than the background.As can be seen in the attribution maps of the church images (bottom two rows in Figure 2) and the dogs (in Figure 3), the standard saliency maps demonstrate quite poor visual coherence, as they focus mainly on the background, rather than on the object itself.In contrast to this, the NsLoss, JacobReg, and adversarial training methods provide more visually coherent maps; however, based on the figures provided, as well as the comprehensive analysis we performed, NsLoss is found to consistently provide the most visually coherent and least noisy maps compared to the other methods, regardless of the explainability method used.

Effect of Regularization Term (λ) and Training Length
NsLoss makes use of several hyperparameters.We present the effect of (1) λ, the weight of the regularization term, and (2) the number of training epochs on select interpretability metrics.
Regularization term weight (λ) In Table 5, it can be observed that as λ increases, there is a gradual improvement in the results for all explanation-quality metrics: The max-sensitivity scores decrease and the faithfulness estimation and sparseness increase, suggesting an improvement of the models' interpretability.However, this improvement is accompanied by a certain drop in the model's accuracy.Therefore, the λ value must be carefully chosen; we suggest that readers follow the protocol described in Section 4.4.
Training epochs Figure 4 presents plots for three chosen attribution quality metrics and clean test accuracy, over 80 training epochs.It can be seen that the max-sensitivity values decrease relatively quickly, right from the first epoch.Both the faithfulness estimation and sparseness continue to improve moderately as epochs progress.Moreover, the results show that there is an interpretability-accuracy tradeoff, and a gradual drop in accuracy can be seen.Therefore, the training process should be monitored, choosing the 'sweet spot' where there is a balance between the desired explanation quality and the required accuracy of the model.

Conclusions and Future Work
Our experimental results validate the effectiveness of both adversarial training, Jacobian regularization, and our novel regularization-based approach (NsLoss) in improving the models' interpretability by changing the model's behavior such that state-of-the-art explainability methods produce explanations that are more focused and better aligned with human perception.This supports previous research results and hypotheses about the positive effect of adversarial training on model interpretability and sets the stage for further research in the field.Moreover, we quantitatively demonstrated the superiority of our proposed method using well-accepted metrics for measuring the quality of explanations, as well as representative qualitative evidence based on saliency map visualizations.
Future work may include: (1) testing our method on other computer vision tasks.We expect to see very similar results on other datasets, with medical imaging being a natural choice, since it is a field where interpretability is crucial in order to establish trust in the system's output; (2) applying NsLoss to other domains (beyond computer vision), which should be fairly straightforward as the method makes no assumptions about the nature of the input or the model's architecture; and (3) explore the effect of NsLoss regularization on the nature of features learnt by the model, similar to related work conducted for adversarially trained models [1,42,45].Extrapolating the results presented in this paper leads us to believe that NsLoss trained models learn features that are even more aligned with human perception than adversarially trained models.

Figure 3 .
Figure 3.Comparison of the feature maps obtained by the PreResNet10[20] model which was trained using the standard, NsLoss, JacobReg, and adversarial training methods on images of dogs from the ImageNette[22] test set.The labels on the left indicates the attribution method used (IG and GS).The attribution maps generated by our method are framed in red.

Figure 4 .
Figure 4.The effect of the number of training epochs on test accuracy and a subset of explanation-quality metrics.The scores were obtained by training PreResNet10 for 80 epochs using the NsLoss method with hyperparameter λ = 60.In (a) clean accuracy over epochs, (b) faithfulness estimation [2] over epochs, (c) max-sensitivity [44] over epochs, (d) sparseness [9] over epochs.

Table 1 .
Configurations of the PreResNet10 models.

Table 2 .
Configurations of the VGG19 models.

Table 3 .
[41]arison of attribution quality on the PreResNet10 model trained using the four different methods presented in Table1; Integrated Gradients[41]is used as the base attribution method.(↑) indicates that a higher value is better, and (↓) indicates that a lower is better.A value in bold indicates the best score in the column, whereas an underlined value indicates that this value is second best.The scores were computed and averaged on the entire ImageNette test set.
1. Compared to the other methods, our proposed NsLoss method results in a significant improvement in attribution quality, and it achieves the best performance on nearly all of the λ Max-Sensitivity [44] Faithfulness Est.[2] Sparseness [9] Accuracy