Improving interpretability via regularization of neural activation sensitivity

Moshe, Ofir; Fidel, Gil; Bitton, Ron; Shabtai, Asaf

doi:10.1007/s10994-024-06549-4

Improving interpretability via regularization of neural activation sensitivity

Open access
Published: 19 June 2024

Volume 113, pages 6165–6196, (2024)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Improving interpretability via regularization of neural activation sensitivity

Download PDF

Ofir Moshe¹,
Gil Fidel¹,
Ron Bitton¹ &
…
Asaf Shabtai¹

700 Accesses
1 Altmetric
Explore all metrics

Abstract

State-of-the-art deep neural networks (DNNs) are highly effective at tackling many real-world tasks. However, their widespread adoption in mission-critical contexts is limited due to two major weaknesses - their susceptibility to adversarial attacks and their opaqueness. The former raises concerns about DNNs’ security and generalization in real-world conditions, while the latter, opaqueness, directly impacts interpretability. The lack of interpretability diminishes user trust as it is challenging to have confidence in a model’s decision when its reasoning is not aligned with human perspectives. In this research, we (1) examine the effect of adversarial robustness on interpretability, and (2) present a novel approach for improving DNNs’ interpretability that is based on the regularization of neural activation sensitivity. We evaluate the interpretability of models trained using our method to that of standard models and models trained using state-of-the-art adversarial robustness techniques. Our results show that adversarially robust models are superior to standard models, and that models trained using our proposed method are even better than adversarially robust models in terms of interpretability.(Code provided in supplementary material.)

Two to Trust: AutoML for Safe Modelling and Interpretable Deep Learning for Robustness

Interpretable deep learning: interpretation, interpretability, trustworthiness, and beyond

Article 14 September 2022

Interpretability of Deep Learning: A Survey

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, deep neural networks (DNNs) have increasingly been used to tackle many complex tasks in various domains, including tasks previously thought to be solvable only by humans, with accuracy often surpassing that of humans. These domains include computer vision, natural language processing, and anomaly detection, where DNNs have been employed in a variety of use cases such as medical diagnosis, text translation, self-driving cars, fraud detection, and malware detection.

There are two major obstacles limiting wider adoption of DNNs in mission-critical tasks: (1) their vulnerability to adversarial attacks (Goodfellow et al., 2014; Carlini & Wagner, 2017; Kurakin et al., 2016; Chen et al., 2019; Brendel et al., 2017) and, more generally, concerns about their robustness when confronted with real-world data, and (2) their opaqueness, which makes it difficult to trust their output (Gerlings et al., 2021; Ribeiro et al., 2016).

Extensive research has been performed to address the first obstacle, with studies mainly proposing approaches for the detection of adversarial examples (Metzen et al., 2017; Feinman et al., 2017; Song et al., 2017; Fidel et al., 2019; Katzir & Elovici, 2018) and methods for training robust models (Goodfellow et al., 2014; Madry et al., 2017; Salman et al., 2020; Wong et al., 2020; Altinisik et al., 2022; Wang et al., 2020; Ding et al., 2020). To address the second obstacle, research has focused on creating a priori, interpretable models and developing methods capable of providing post-hoc explanations for existing models (Smilkov et al., 2017; Sundararajan et al., 2017; Lundberg & Lee, 2017; Ribeiro et al., 2016).

The terms interpretability and explainability are often incorrectly used interchangeably. We adopt the definitions proposed by Gilpin et al. (2018): Interpretability is a measure of how well a human can understand the way a system (in our case - a DNN) functions. Explainability of DNNs is a field of research that aims to answer the question of why a DNN produces a specific output for a given input (e.g., by assigning contribution scores to different neurons of the network signifying their importance in steering the network towards that output), or to describe what the network “learns" (e.g., by producing visualizations of concepts learned by specific neurons Olah et al. (2017)). The evaluation of improved interpretability in DNNs is performed by showing that conventional explainability techniques yield explanations that are more comprehensible to humans, as assessed both visually and through quantitative metrics.

In this paper, we examine the relationship between the aforementioned obstacles through quantitative and qualitative analyses of the effect of a model’s robustness (achieved via adversarial training and Jacobian regularization) on its interpretability, and by introducing a regularization-based approach that is conceptually similar to other approaches aimed at improving adversarial robustness but has a substantial effect on the model’s interpretability.

Recent research has shown that adversarial robustness positively affects interpretability (Zhang & Zhu, 2019; Tsipras et al., 2018; Allen-Zhu & Li, 2022; Noack et al., 2021; Margeloiu et al., 2020). However, most studies evaluated their methods using low-resolution datasets such as the MNIST (LeCun et al., 1998) and CIFAR-10 (Krizhevsky et al., 2009) datasets or only provided anecdotal evidence of improved interpretability by presenting saliency maps of a few input images.

The main contributions of this paper are as follows: (1) We provide a quantitative comparison, based on well-accepted metrics, of the interpretability of adversarially trained models vs. standard models on high-resolution image datasets; (2) Our research resulted in the discovery that Jacobian regularization is nearly as effective as adversarial training for improving model interpretability; (3) We propose a novel regularization-based approach that outperforms both adversarial training and Jacobian regularization in terms of the interpretability of the trained model, with a similarly low decline in accuracy. We believe that these contributions are significant since: (1) to the best of our knowledge, our study is the first to show quantitatively, rather than via merely anecdotal evidence, the enhanced interpretability resulting from adversarially trained models, and (2) it further improves interpretability without the overhead of adversarial training.

2 Background

2.1 Post-hoc explainability techniques for DNNs

There are many methods for explaining the predictions of machine learning models, and more specifically DNNs. In the computer vision domain, explanations usually consist of a saliency/attribution map that assigns a score to each input pixel based on its positive or negative effect on encouraging the model to classify the input as a specific class. Some of the most prominent methods, which we use in our evaluation, include: Integrated Gradients (IG) (Sundararajan et al., 2017), SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2017) (and specifically Gradient SHAP (GS)), Layer-wise Relevance Propagation (LRP) (Bach et al., 2015), and Gradient-weighted Class Activation Mapping (GradCam) (Selvaraju et al., 2017). Additional information on the various explanation methods used in the evaluation is provided in Appendix A.

2.2 Adversarial evasion attacks and defenses

Adversarial evasion attacks are methods of producing adversarial examples, which are model inputs that closely resemble valid inputs but result in drastically different model outputs. Formally, given a classifier $M(\cdot ): R^d \rightarrow R^C$, an input sample $x \in R^d$, and a correct class label c, we denote $\delta \in R^d$ an adversarial perturbation and $x' = x + \delta $ an adversarial example if:

$$\begin{aligned} \begin{aligned}&M(x') \ne c, \\&s.t.: ||\delta || < \epsilon \end{aligned} \end{aligned}$$

(1)

where $||\cdot ||$ is a distance metric, and $\epsilon > 0$ is the maximum perturbation size allowed, which is set at a low positive value to constrain the perturbation so that the resulting adversarial example is indistinguishable from the original sample to the naked human eye thus making it potentially useful in various adversarial scenarios.

Countering adversarial attacks has been the subject of a great deal of research. Studies have largely aimed at proposing methods capable of detecting adversarial examples or methods for training robust models. The latter is relevant to our study, and two such methods are highlighted below.

Adversarial training (Goodfellow et al., 2014; Madry et al., 2017) is a method in which a model is trained to correctly classify adversarial examples by presenting them to the model during the training process. More precisely, this method solves the saddle point problem:

$$\begin{aligned} \min _\theta \mathbb {E}_{(x,y)}\left[ \max _{\delta : ||\delta || < \epsilon }L(\theta , x + \delta , y)\right] \end{aligned}$$

in which the aim is to find model parameters that minimize the expected value of the worst case increase in model loss due to input perturbations. Practically, the method consists of modifying the standard training loss so that it is applied to adversarial examples constructed from the training batch samples instead of the original training samples. Madry et al. (2017) used projected gradient descent (PGD) to generate adversarial examples during model training. More recent studies have advanced the original adversarial training approach by addressing the inherent tendency of overfitting in adversarially trained models (Altinisik et al., 2022). These advancements include adversarial training with a variable perturbation budget to maximize margins from decision boundaries (Wang et al., 2020), and differentiating between correctly classified and misclassified examples during adversarial training (Ding et al., 2020).

Jacobian regularization (Jakubovitz & Giryes, 2018; Hoffman et al., 2019) is a method in which the Frobenius norm of the Jacobian matrix containing the partial derivatives of the model’s logits over the inputs is added as an extra loss term, resulting in the minimization of the Jacobian norm during model training. This was found to effectively push the model’s decision boundaries away from the data manifold (Hoffman et al., 2019) thus improving the model’s robustness to adversarial attacks.

3 Related work

The observation that adversarially robust models are more interpretable is not new. Zhang and Zhu (2019) showed that adversarially trained convolutional neural networks (CNNs) produce explanations that rely on the global shape of the input images, in contrast to standard CNNs which focus more on textures that are inherently more sensitive to small perturbations.

Tsipras et al. (2018) observed that by virtue of the constraints imposed by adversarial training, which have the effect of reducing sensitivity to small perturbations, adversarially trained models are more aligned with human vision, which is evident in the produced explanations that emphasize features that are more human-perceivable.

In Allen-Zhu and Li (2022), the authors studied the effect of adversarial training on feature-level explanations of internal CNN layers and showed that adversarially trained models produce feature-level explanations that are “purified," in that they are much less noisy and better represent high-level visual concepts.

Noack et al. (2021) proposed a method for leveraging model explanations to improve robustness, by adding terms to the training loss that penalize the cosine of the angle between the explanation vector and the loss gradient, as well as the norm of the loss gradient vector. They showed that minimizing these two terms improves adversarial robustness. Margeloiu et al. (2020) explored the impact of adversarial training on the interpretability of CNNs for skin cancer diagnosis, demonstrating that adversarially trained CNNs produce clearer and more visually consistent saliency maps, especially highlighting melanoma characteristics.

The work in the literature most similar to ours is that of Zhang et al. (2020) who proposed augmenting adversarial training with the regularization of the sensitivity of the top-k sensitive neurons of each DNN layer in order to improve model robustness to adversarial attacks. While conceptually similar to our method of regularizing neuron sensitivities, our work differs in the following respects: (1) We regularize the sensitivity of all neurons, not just the top-k sensitive ones. (2) We regularize the sensitivity of the neurons by measuring sensitivity to random noise instead of adversarial examples, which is computationally lighter. (3) We regularize neuron sensitivity in order to improve interpretability rather than robustness. (4) In contrast to merely providing visual representations of a limited set of input samples as done by Zhang et al., we perform a quantitative evaluation to validate the improvement in interpretability observed. Furthermore, we performed our evaluation on multiple datasets, including a medical dataset.

4 Proposed method

4.1 Method overview

The proposed method aims to improve the interpretability of neural network classifiers. We introduce NsLoss, a novel regularization term that penalizes the classifier for high sensitivity of the network’s neurons to input perturbations. Thus, we define a new training loss function as follows:

$$\begin{aligned} \begin{aligned} L = CELoss + \lambda \cdot NsLoss \end{aligned} \end{aligned}$$

(2)

where CELoss is the standard cross-entropy loss, and NsLoss is our new loss term.

We apply the proposed method on a pretrained model by continuing to train it with the custom loss function for a predefined number of epochs using a standard stochastic gradient descent-based optimizer and a relatively low learning rate to allow the model’s interpretability to improve without “unlearning” the classification task.

4.2 NsLoss regularization term

The inputs and parameters used to compute the NsLoss are as follows:

M - The model being trained.
X - Batch of samples for which the loss is computed.
$\epsilon _{ns}$ - Hyperparameter specifying the radius of the $L_p$ sphere from which random perturbations used to compute the loss are sampled. An $L_p$ sphere around a point $x_0$ with radius r is the set $\{x \in \mathbb {R}^d: ||x-x_0||_p = r\}$.
N - Number of perturbations to generate for each sample.

We begin by computing the normalized sensitivity of each neuron in the model to random perturbations of the input within an $L_p$ sphere with radius $\epsilon _{ns}$. Given the j-th neuron of the i-th layer of the model:

1.
For every sample $x \in X$, generate N random samples: $x_1, x_2,..., x_N$ s.t. $||x-x_k||_p = \epsilon _{ns}$ and store them in tensors $X_p^{(k)}$ for $k \in [1, N]$ (Note that the "p" in $X_p$ stands for "perturbed" and is not related to the norm dimension of the $L_p$ sphere from which random samples are sampled).
2.
Evaluate $M_{i,j}(X)$ and $M_{i,j}(X_p^{(k)})$, the activations of the j-th neuron in the i-th layer of the model, on each original and randomly perturbed sample, respectively.
3.
Compute the mean absolute activation of neuron i, j on batch X:
$$\begin{aligned} \begin{aligned} MA_{i,j} \leftarrow \frac{\sum _{m=1}^{|X|}{|M_{i,j}(X[m])|}}{|X|} \end{aligned} \end{aligned}$$
(3)
where X[m] is the m-th sample in batch X.
4.
Compute the mean absolute difference between the activations of the neuron on the perturbed and original samples:
$$\begin{aligned} \begin{aligned} MD_{i,j} \leftarrow \frac{\sum _{k=1}^{N}\sum _{m=1}^{|X|}{|{M_{i,j}(X[m])-M_{i,j}(X_p^{(k)}[m])|}}}{N \cdot |X|} \end{aligned} \end{aligned}$$
(4)
5.
Compute the sensitivity of the neuron:
$$\begin{aligned} \begin{aligned} NS_{i,j} \leftarrow \frac{MD_{i,j}}{\epsilon _{ns} \cdot |M_i| \cdot MA_{i,j}} \end{aligned} \end{aligned}$$
(5)
where $|M_i|$ is the number of neurons in the i-th layer of M.

Figure 1 illustrates the process described so far for a single neuron i, j and example x.

Finally, we compute the NsLoss as follows:

$$\begin{aligned} \begin{aligned} NsLoss(M, X, \epsilon _{ns}, N) \leftarrow \sum _{i=1}^{|M|}\sum _{j=1}^{|M_i|}{NS_{i,j} \cdot MA_{i,j}} \end{aligned} \end{aligned}$$

(6)

The final loss is simply the mean neuron sensitivity weighted by the neuron’s mean absolute activation, which accommodates for the neuron’s contribution to the model’s output. From the implementation perspective, it is important to note that for the sake of simplicity, in the description of the algorithm used to compute the NsLoss provided here, the neuron sensitivity values for each neuron are computed separately; in practice, the computation of the aggregated loss can be computed using tensor operations on entire input batches and model layers, effectively making the time spent on loss computation negligible compared to model forward passes.

4.3 NsLoss regularization’s effectiveness

The NsLoss regularization term is constructed in a way that penalizes the model for small random input perturbations that cause large differences in the activations of the model’s output neurons (logits) and internal neurons. This has the obvious effect of optimizing the model to minimize the activation differences and, as a result, minimizing the magnitude of the model’s gradients with respect to inputs in the vicinity of the training set and, by generalization, in the vicinity of the test set. Once aggregated on the entire training set during the training process, this is expected to have the effect of minimizing the model’s gradients in an $\epsilon _{ns}$ neighborhood of the entire data manifold, where $\epsilon _{ns}$ is the hyperparameter specifying the radius of the $L_p$ sphere from which random input perturbations are sampled during the computation of the NsLoss regularization term.

4.4 Hyperparameters

A description of the hyperparameters used in our approach is provided below.

N - The number of perturbed samples used to estimate the neuron sensitivity; $N=5$ was used in all of our experiments. The value of N linearly affects the training computation budget. We have performed anecdotal experimentation with values larger than five and found no improvement in interpretability metrics.
$\epsilon _{ns}$ - The radius of the $L_p$ sphere from which perturbations for neuron sensitivity computation are generated. We have chosen values aligned with best practices from the Adversarial robustness literature as these values correspond to negligible perceptual artifacts in the perturbed images.
p - The order of the distance measure $L_p$. A common choice for p is 2 or $\inf $.
$\lambda $ - The weight of the loss term. We aim to select the highest value of $\lambda $ that does not harm the model’s cross-entropy loss and validation set’s accuracy. The following protocol is followed to select the value of $\lambda $:
1. 1.
  For a standard model, compute the values of NsLoss on 10 random batches from the training and validation sets and store the average value as $NsLoss_0$.
2. 2.
  Choose $\lambda _0 = \frac{\log _2{NumClasses}}{NsLoss_0}$.
3. 3.
  Perform a binary search by selecting values of $\lambda $ that are higher and lower than $\lambda _{0}$ but in the same order of magnitude. As an illustrative example, if $\lambda _0$ = 5, we can choose the lower value to be 1 and a higher value set at 10. Note that the exact choice of values is less important as long as the smaller value has a negligible influence on the model’s accuracy, while the larger value leads to a deterioration in the model’s accuracy. For each such $\lambda $, train the model for an epoch. If the training cross-entropy loss reaches the cross-entropy of a random guess ($\log _2{NumClasses}$), then the $\lambda $ value is too high; otherwise it can be increased further.

For Adversarial Training and Jacobian regularization we have used standard hyper-parameters from the relevant literature (Madry et al., 2017; Hoffman et al., 2019).

5 Evaluation

The objective of our experiments is to examine the effect of Adversarial Training, Jacobian regularization, and our proposed NsLoss regularization term on the quality of the explanations both qualitatively (i.e., by observing visual improvement) and quantitatively (by using objective metrics). We also compare the results to those obtained by a baseline model.

5.1 Evaluation setup

We assess the effectiveness of our method across three diverse datasets: ImageNette (Howard, 2019), a collection of high-resolution images from a 10-class subset of ImageNet (Deng et al., 2009); CIFAR10 (Krizhevsky et al., 2009), a widely used dataset for image classification; and a subset of the HAM10000 dermatology dataset (Tschandl et al., 2018). The HAM10000 subset focuses on three balanced classes: melanoma, melanocytic nevi, and benign keratosis, as recommended by Margeloiu et al. (2020).

For the models, we employ the VGG19 (Simonyan & Zisserman, 2014), PreResNet10 (He et al., 2016) and DenseNet (Huang et al., 2018) architectures - all pretrained on ImageNet. These models were fine-tuned on Imagenette, CIFAR10 and the aforementioned HAM10000 subset. On Imagenette, the models achieved clean test accuracy rates of 96.2%, 96.8% and 99.3% for VGG19, PreResNet10 and DenseNet, respectively. On HAM10000, the VGG19 and PreResNet10 models obtained clean test accuracy rates of 87.6% and 84.0%, respectively. On CIFAR10, the VGG19 and PreResNet10 models obtained clean test accuracy rates of 92.4% and 91.5% respectively.

We use the VGG19, PreResNet10 and DenseNet architectures, since they achieve near state-of-the-art performance on image classification tasks and are different enough to demonstrate that our method does not overfit to a specific model architecture. To train robust models, we use (1) the adversarial training method of Madry et al. (2017), as implemented by the “robustness" library (Engstrom et al., 2019), to retrain the standard models against a PGD adversary, and (2) the implementation of Jacobian regularization presented by Hoffman et al. (2019). Tables 11, 12, 13, 14, 15, 16, 17 in Appendix B summarize the configurations and hyperparameters used for all the methods evaluated.

5.2 Evaluation metrics

Several recent studies (Carvalho et al., 2019; Montavon et al., 2018; Alvarez-Melis & Jaakkola, 2018) attempted to determine what properties an attribution-based explanation should have. These studies showed that the use of just one metric is insufficient to produce explanations that are meaningful to humans. Bhatt et al. (2020) established three desirable criteria for feature-based explanation functions: low sensitivity, high faithfulness, and low complexity. Therefore, we evaluate the different techniques based on these three well-studied properties:

1.
Sensitivity - measures how much the explanations vary within a small local neighborhood of the input when the model prediction remains approximately the same (Alvarez-Melis & Jaakkola, 2018; Yeh et al., 2019). In our evaluation, we use max-sensitivity, avg-sensitivity (Yeh et al., 2019), and the Local Lipschitz estimate (Alvarez-Melis & Jaakkola, 2018) metrics.
2.
Faithfulness - estimates how the presence (or absence) of features influences the prediction score, i.e., whether removing highly important features results in a degradation of model accuracy (Bhatt et al., 2020; Bach et al., 2015; Alvarez Melis & Jaakkola, 2018). In our evaluation, we use the faithfulness correlation (Bhatt et al., 2020) and faithfulness estimate (Alvarez Melis & Jaakkola, 2018) metrics.
3.
Complexity - captures the complexity of explanations, i.e., how many features are used to explain a model’s prediction (Chalasani et al., 2020; Bhatt et al., 2020). In our evaluation, we use the metrics of complexity (Bhatt et al., 2020) and sparseness (Chalasani et al., 2020).

5.3 Explainability methods

Various explanation methods are used in our evaluation, including Integrated Gradients (IG) (Sundararajan et al., 2017), Gradient SHAP (GS) (Lundberg & Lee, 2017), Layer-wise Relevance Propagation (LRP) (Bach et al., 2015), and GradCam (Selvaraju et al., 2017). For the implementation, we employ the Captum library (Kokhlikyan et al., 2020). We note that the implementation of Gradient SHAP we used (the Captum library) applies SmoothGrad (Smilkov et al., 2017) by default.

We argue that since the majority of local explanation methods use model gradients, any improvement observed on these explanation methods when using the proposed method is likely to be successfully transferred to other gradient-based methods (Tables 1, 2).

5.4 Quantitative evaluation results

We start by examining the performance of the compared methods, considering the three aforementioned explanation-quality criteria (i.e., sensitivity, faithfulness, and complexity), applied to the values of the methods’ respective explanations.

Tables 3, 4, 5 summarize the results on the Imagenette dataset for the PreResNet10, DenseNet and VGG19 models, respectively. For the HAM1000 dataset, we report the results in Tables 6 and 7 for VGG19 and PreResNet10. On the CIFAR10 dataset, we report the results for the VGG19, and PreResNet10 models in Tables 8 and 9.

The scores were computed and averaged over the entire test set. The Quantus library (Hedström et al., 2022) was employed for XAI evaluation and either Integrated Gradients (IG), Gradient SHAP (GS), or Layer-wise Relevance Propagation (LRP) were used as the base attribution method. To facilitate meaningful comparisons, all of the compared models were retrained to have roughly the same test accuracy on natural images (except for the standard model), which is presented in the ninth column. To support the hypothesis that robust models tend to be more interpretable than standard models, we also provide the robust accuracy for models re-trained with Jacobian Regularization and Adversarial Training, obtained using the AutoAttack (Croce & Hein, 2020) library, in the tenth column. The robust accuracy is computed by running AutoAttack on each sample in the test set to obtain an adversarial test set and then computing the accuracy of the model on the adversarial test set with the original target labels.

In each table, the methods (standard, NsLoss $L_2$, NsLoss $L_{\infty }$, JacobReg, and Adversarial Training) are listed in the first column, and the respective values for the metrics: max-sensitivity (Max), avg-sensitivity (Avg) (Yeh et al., 2019), local Lipschitz estimate (LL) (Alvarez-Melis & Jaakkola, 2018), faithfulness correlation (Corr) (Bhatt et al., 2020), faithfulness estimate (Est) (Alvarez Melis & Jaakkola, 2018), complexity (Comp) (Bhatt et al., 2020), and sparseness (Spr) (Chalasani et al., 2020) are presented in columns 2–8.

For the sensitivity criterion, lower values are better; for the faithfulness criterion, higher values are better, and for the complexity criterion, lower complexity values and higher sparseness values are better. ($\uparrow $) indicates that a higher value is better, and ($\downarrow $) indicates that a lower value is better. A value in bold is the best score in a column, and an underlined value is the second best.

Prior to discussing the results for the seven $<dataset, model~architecture>$ pairs, the radar plot in Fig. 2 and Tables 1 and 2 provide a high-level view of the results by normalizing and averaging the different metrics’ values over all models and datasets. To create the aggregated results, we first normalize the values of each metric for all methods and for each $<dataset, model~architecture>$ pair into the range of [0,100] (note that after the normalization, for all metrics, a higher value is better). Then, we aggregate all seven experiments by calculating the mean of each metric for each method over the seven $<dataset, model~architecture>$ pairs. The radar plot is based on the mean values for each of the eight metrics presented in Tables 1 and 2. Note that due to our normalization, these results no longer represent absolute metric values but rather represent the relative scores of the evaluated methods on each metric.

Table 1 Normalized mean ± standard deviation scores for each method and metric across seven experiments

Improving interpretability via regularization of neural activation sensitivity

Abstract

Similar content being viewed by others

Two to Trust: AutoML for Safe Modelling and Interpretable Deep Learning for Robustness

Interpretable deep learning: interpretation, interpretability, trustworthiness, and beyond

Interpretability of Deep Learning: A Survey

Explore related subjects

1 Introduction

2 Background

2.1 Post-hoc explainability techniques for DNNs

2.2 Adversarial evasion attacks and defenses

3 Related work

4 Proposed method

4.1 Method overview

4.2 NsLoss regularization term

4.3 NsLoss regularization’s effectiveness

4.4 Hyperparameters

5 Evaluation

5.1 Evaluation setup

5.2 Evaluation metrics

5.3 Explainability methods

5.4 Quantitative evaluation results

5.5 Qualitative evaluation results

5.6 Hyperparameter studies

6 Conclusions and future work

Availability of data and material

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest/Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

A. Post-hoc explainability techniques for DNNs

1.1 A.1 Integrated gradients

1.2 A.2 SHapley additive explanations

1.3 A.3 Layer-wise relevance propagation

1.4 A.4 Gradient-weighted class activation mapping (GradCam)

B Configurations used to train the models evaluated

C Statistical significance study

D Additional qualitative results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation