Keywords

1 Introduction

Deep learning is now widely used in state-of-the-art Artificial Intelligence (AI) technology. A Deep Neural Network (DNN) model however is, thus far, a “black box.” AI applications in finance, medicine, and autonomous vehicles demand justifiable predictions, barring most deep learning methods from use. Understanding what is going on inside the “black box” of a DNN, what the model has learned, and how the training data influenced that learning are all instrumental as AI serves humans and should be accountable to humans and society.

In response, Explainable AI (XAI) popularizes a series of visual explanations called saliency methods, that highlight pixels that are “important” for a model’s final prediction to which we contribute multiple works that target understanding deep model behavior through the analysis of saliency maps that highlight regions of evidence used by the model. We then contribute works that utilize such saliency to obtain models that have improved accuracy, network utilization, robustness, and domain generalization. In this work, we provide an overview of our contributions in this field.

XAI in Visual Data. Grounding model decisions in visual data has the benefit of being clearly interpretable by humans. The evidence upon which a deep convolutional model participates in the class conditional probability for a specific class is highlighted in the form of a saliency map. In our work [36], we present applications of spatial grounding in model interpretation, data annotation assistance for facial expression analysis and medical imaging tasks, and as a diagnostic tool for model misclassifications. We do so in a discriminative way that highlights evidence for every possible outcome given the same input for any deep convolutional neural network classifier.

We also propose a black-box grounding techniques RISE [22] and D-RISE [23]. Unlike the majority of previous approaches RISE can produce saliency maps without the access to the internal states of the base model, such as weights, gradients or feature maps. The advantages of such a black-box approach are that RISE does not assume any specifics about the base model architecture, it can be used to test proprietary models that do not allow full access, the implementation is very easily adapted to a new base model. The saliency is computed by perturbing the input image using a set of randomized masks while keeping track of the changes in the output. Major changes in the output are reflected in increased saliency of the perturbed region of the input, see Fig. 2.

Deep recurrent models are state-of-the-art for many vision tasks including video action recognition and video captioning. Models are trained to caption or classify activity in videos, but little is known about the evidence used to make such decisions. Our work was the first to formulate top-down saliency in deep recurrent models for space-time grounding of videos [1]. We do so using a single contrastive backward pass of an already trained model. This enables the visualization of spatiotemporal cues that contribute to a deep model’s classification/captioning output and localization of segments within a video that correspond with a specific action, or phrase from a caption, without explicitly optimizing/training for these tasks.

XAI for Improved Models. We propose three frameworks that utilize explanations to improve model accuracy. The first proposes a guided dropout regularizer for deep networks [39] based on the explanation of a network prediction defined as the firing of neurons in specific paths. The explanation at each neuron is utilized to determine the probability of dropout, rather than dropping out neurons uniformly at random as in standard dropout. This results in dropping out with higher probability neurons that contribute more to decision making at training time, forcing the network to learn alternative paths in order to maintain loss minimization, resulting in a plasticity-like behavior, a characteristic of human brains. This demonstrates better generalization ability, an increased utilization of network neurons, and a higher resilience to network compression for image/video recognition.

Our second training strategy not only leads to a more explainable AI system for object classification, but as a consequence, suffers no perceptible accuracy degradation [40]. Our training strategy enforces a periodic saliency-based feedback to encourage the model to focus on the image regions that directly correspond to the ground-truth object. We propose explainability as a means for bridging the visual-semantic gap between different domains where model explanations are used as a means of disentagling domain specific information from otherwise relevant features. We demonstrate that this leads to improved generalization to new domains without hindering performance on the original domain.

Our third strategy is applied at test time and improves model accuracy by zooming in on the evidence, and ensuring the model has “the right reasons” for a prediction, being defined as reasons that are coherent with those used to make similar correct decisions at training time [2, 3]. The reason/evidence upon which a deep neural network makes a prediction is defined to be the spatial grounding, in the pixel space, for a specific class conditional probability in the model output. We use evidence grounding as the signal to a module that assesses how much one can trust a Convolutional Neural Network (CNN) prediction over another.

The rest of this chapter is organized as follows. Section 2 presents saliency approaches that target explaining how deep neural network models associate input regions to output predictions. Sections 34, and 5 present approaches that utilize explainability in the form of saliency (Sect. 2) to obtain models that possess state-of-the-art in-domain and out-of-domain accuracy, have improved neuron utilization, and are more robust to network compression. Section 6 concludes the presented line of works.

2 Saliency-Based XAI in Vision

In this section we propose sample white- and black-box methods for saliency-based explainability for vision models.

2.1 White-Box Models

We first present sample white-box grounding techniques developed for the purpose of explainability of deep vision models. Formulation of white-box techniques assumes knowledge of model architectures and parameters.

Spatial. In a standard spatial CNN, the forward activation of neuron \(a_j\) is computed by \(\widehat{a}_j=\phi (\sum _iw_{ij}\widehat{a}_i+b_i)\), where \(\widehat{a}_i\) is the activation coming from the previous layer, \(\phi \) is a nonlinear activation function, \(w_{ij}\) and \(b_i\) are the weight from neuron i to neuron j and the added bias at layer i, respectively. Excitation Backprop (EB) was proposed in [37] to identify the task-relevant neurons in any intermediate layer of a pre-trained CNN network. EB devises a backpropagation formulation that is able to reconstruct the evidence used by a deep model to make decisions. It computes the probability of each neuron recursively using conditional probabilities \(P(a_i|a_j)\) in a top-down order starting from a probability distribution over the output units, as follows:

$$\begin{aligned} P(a_i) =\sum _{a_j\in \mathcal {P}_i}P(a_i|a_j)P(a_j) \end{aligned}$$
(1)

where \(\mathcal {P}_i\) is the parent node set of \(a_i\). EB passes top-down signals through excitatory connections having non-negative activations, excluding from the competition inhibitory ones. EB is designed with an assumption of non-negative activations that are positively correlated with the detection of specific visual features. Most modern CNNs use ReLU activation functions, which satisfy this assumption. Therefore, negative weights can be assumed to not positively contribute to the final prediction. Assuming \(C_j\) the child node set of \(a_j\), for each \(a_i \in C_j\), the conditional winning probability \(P(a_i |a_j)\) is defined as

$$\begin{aligned} P(a_i|a_j) = {\left\{ \begin{array}{ll} Z_j \widehat{a}_i w_{ij}, &{} \text {if } w_{ij}\ge 0, \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

where \(Z_j\) is a normalization factor such that a probability distribution is maintained, i.e. \(\sum _{a_i\in \mathcal {C}_j}P(a_i|a_j) = 1\). Recursively propagating the top-down signal and preserving the sum of backpropagated probabilities, it is possible to highlight the salient neurons in each layer using Eq. 1, i.e. neurons that mostly contribute to a specific task. This has been shown to accurately localize spatial objects in images (corresponding to object classes) in a weakly-supervised way.

Spatiotemporal. Spatiotemporal explainability is instrumental for applications like action detection and image/video captioning [32]. We extend EB to become spatiotemporal [1]. This work is the first to formulate top-down saliency in deep recurrent models for space-time grounding of videos. In this section we explain the details of our spatiotemporal grounding framework: cEB-R. As illustrated in Fig. 1, we have three main modules: RNN Backward, Temporal normalization, and CNN Backward.

The RNN Backward module implements an excitation backprop formulation for RNNs. Recurrent models such as LSTMs are well-suited for top-down temporal saliency as they explicitly propagate information over time. The extension of EB for Recurrent Networks, EB-R, is not straightforward since EB must be implemented through the unrolled time steps of the RNN and since the original RNN formulation contains tanh non-linearities which do not satisfy the EB assumption. [6, 10] have conducted an analysis over variations of the standard RNN formulation, and discovered that different non-linearities performed similarly for a variety of tasks. Based on this, we use ReLU nonlinearities and corresponding derivatives, instead of tanh. This satisfies the EB assumption, and results in similar performance on both tasks.

Working backwards from the RNN’s output layer, we compute the conditional winning probabilities from the set of output nodes O, and the set of dual output nodes \(\overline{O}\):

$$\begin{aligned} P^t(a_i|a_j) = {\left\{ \begin{array}{ll} Z_j \widehat{a}_i^t w_{ij}, &{} \text {if } w_{ij}\ge 0, \\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(3)
$$\begin{aligned} \overline{P}^t(a_i|a_j) = {\left\{ \begin{array}{ll} Z_j \widehat{a}_i^t \overline{w}_{ij}, &{} \text {if } \overline{w}_{ij}\ge 0, \\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(4)

\(Z_j = 1/\sum _{i:w_{ij \ge 0}}{\hat{a}_i^t w_{ij}}\) is a normalization factor such that the sum of all conditional probabilities of the children of \(a_j\) (Eqs. 3, 4) sum to 1; \(w_{ij} \in W\) where W is the set of model weights and \(w_{ij}\) is the weight between child neuron \(a_i\) and parent neuron \(a_j\); \(\overline{w}_{ij} \in \overline{W}\) where \(\overline{W}\) is obtained by negating the model weights at the classification layer only. \(\overline{P}^t(a_i|a_j)\) is only needed for contrastive attention.

Fig. 1.
figure 1

Our proposed framework spatiotemporally highlights/grounds the evidence that an RNN model used in producing a class label or caption for a given input video. In this example, by using our proposed back-propagation method, the evidence for the activity class CliffDiving is highlighted in a video that contains CliffDiving and HorseRiding. Our model employs a single backward pass to produce saliency maps that highlight the evidence that a given RNN used in generating its outputs.

We compute the neuron winning probabilities starting from the prior distribution encoding a given action/caption as follows:

$$\begin{aligned} P^t(a_i) =\sum _{a_j\in \mathcal {P}_i}P^t(a_i|a_j)P^t(a_j) \end{aligned}$$
(5)
$$\begin{aligned} \overline{P}^t(a_i) = \sum _{a_j\in \mathcal {P}_i}\overline{P}^t(a_i|a_j)\overline{P}^t(a_j) \end{aligned}$$
(6)

where \(\mathcal {P}_i\) is the set of parent neurons of \(a_i\).

Replacing tanh non-linearities with ReLU non-linearities to extend EB in time does not suffice for temporal saliency. EB performs normalization at every layer to maintain a probability distribution. For spatiotemporal localization, the Temporal Normalization module normalizes signals from the desired \(n^{th}\) time-step of a T-frame clip in both time and space (assuming S neurons in current layer) before being further backpropagated into the CNN:

$$\begin{aligned} P_N^t(a_i) = P^t(a_i) / \textstyle \sum \nolimits _{t=1}^{T} \textstyle \sum \nolimits _{i=1}^{S} P^t(a_i). \end{aligned}$$
(7)
$$\begin{aligned} \overline{P}_N^t(a_i) = \overline{P}^t(a_i) / \textstyle \sum \nolimits _{t=1}^{T} \textstyle \sum \nolimits _{i=1}^{S} \overline{P}^t(a_i). \end{aligned}$$
(8)

cEB-R computes the difference between the normalized saliency maps obtained by EB-R starting from O, and EB-R starting from \(\overline{O}\) using negated weights of the classification layer. cEB-R is more discriminative as it grounds the evidence that is unique to a selected class/word and not common to other classes used at training time. This is conducted as follows:

$$\begin{aligned} \begin{aligned} Map^t(a_i) = P_N^t(a_i) - \overline{P}_N^t(a_i). \end{aligned} \end{aligned}$$
(9)

For every video frame \(f_t\) at time step t, we use the backprop of [37] for all CNN layers in the CNN Backward module:

$$\begin{aligned} P^t(a_i|a_j) = {\left\{ \begin{array}{ll} Z_j \widehat{a}_i^t w_{ij}, &{} \text {if } w_{ij}\ge 0, \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(10)
$$\begin{aligned} Map^t(a_i) = \sum _{a_j\in P_i} P^t(a_i|a_j)Map^t(a_j) \end{aligned}$$
(11)

where \(\widehat{a}^t_i\) is the activation when frame \(f_t\) is passed through the CNN. \(Map^t\) at the desired CNN layer is the cEB-R saliency map for \(f_t\). Computationally, the complexity of cEB-R is on the order of a single backward pass. Note that for EB-R, \(P_N^t(a_j)\) is used instead of \(Map^t(a_j)\) in Eq. 11.

The general framework has been applied to action localization. We ground the evidence of a specific action using a model trained on this task. The input is a video sequence and the action to be localized, and the output is spatiotemporal saliency maps for this action in the video. Performing cEB-R results in a sequence of saliency maps \(Map^{t}\) for \(t=1, ..., T\). These maps can then be used for localizing the action by finding temporal regions of highest aggregate saliency. This has also been applied to other spatiotemporal applications such as image and video captioning.

2.2 Black-Box Models

Black-box methods operate under the assumption that no internal information about the model is available. Thus we can only observe the final output of the model for each input that we provide. In this paradigm to explain the black-box model one has to come up with a way to query the model in such a way, that the outputs would reveal some of the underlying behaviour of the model. This methods are typically slower than white-box approaches since information is obtained at the cost of additional queries to the model.

One way to construct the queries is to run the model on similar versions of the input and analyze the differences in the output. For example, to compute how important different regions of the inputs are, i.e. compute saliency, one can mask out certain parts of the image. Significant changes in the output would mean the importance of the masked region.

Our method RISE [22] builds on this idea. We probe the base model by perturbing the input image using random masks and record its responses to each of the masked images. The saliency map S is computed as a weighted sum of the used masks, where the weights come from the probabilities predicted by the base model (see Fig. 2):

$$\begin{aligned} S_{I, f} = \frac{1}{\sum \limits _{M\in \mathcal {M}} M} \sum _{M\in \mathcal {M}} f(I\odot M)\cdot M, \end{aligned}$$
(12)

where f is the base model, I is the input image and \(\mathcal {M}\) is the set of generated masks. The mask M has large weight \(f(I\odot M)\) in the sum only if the score of the base model is high on the masked image, i.e. the mask preserves important regions. We generate masks as a uniformly random binary grid (bilinearly upsampled) to refrain from imposing any priors on the resulting saliency maps.

Fig. 2.
figure 2

RISE overview

RISE can be applied to explain models that predict a distribution over labels given an image such as classification and captioning models. Classification saliency methods fail when directly applied to the object detection models. To generate such saliency maps for object detectors we propose D-RISE method [23]. It accounts for the differences in object detection model’s structure and output format. To measure the effect of the masks on the model output we propose a similarity metric between detection two proposals \(d_t\) and \(d_j\):

$$\begin{aligned} s(d_t, d_j) = s_L(d_t, d_j) \cdot s_P(d_t, d_j) \cdot s_O(d_t, d_j), \end{aligned}$$
(13)

This metrics computes similarity values for the three components of the detection proposals: localization (bounding box L), classification (class probabilities P), and objectness score (O).

$$\begin{aligned} s_L(d_t, d_j)&= \mathrm {IoU}(L_t, L_j),\end{aligned}$$
(14)
$$\begin{aligned} s_P(d_t, d_j)&= \frac{P_t\cdot P_j}{\Vert P_t\Vert \Vert P_j\Vert }, \end{aligned}$$
(15)
$$\begin{aligned} s_O(d_t, d_j)&= O_j. \end{aligned}$$
(16)

Using the masking technique and the similarity metric D-RISE can compute saliency maps for object detectors in the similar querying manner. We use D-RISE to gain insights into the use of context by the detector. We demonstrate how to use saliency to better understand the use of correlations in the data by the model, e.g. ski poles are used when detecting the ski class. We also demonstrate the utility of saliency maps for detecting accidental or adversarial biases in the data.

3 XAI for Improved Models: Excitation Dropout

Dropout avoids overfitting on training data, allowing for better generalization on unseen test data. In this work, we target at determining how the dropped neurons are selected, answering the question Which neurons to drop out?

Fig. 3.
figure 3

Training pipeline of Excitation Dropout. Step 1: A minibatch goes through the standard forward pass. Step 2: Backward EB is performed until the specified dropout layer; this gives a neuron saliency map at the dropout layer in the form of a probability distribution. Step 3: The probability distribution is used to generate a binary mask for each image of the batch based on a Bernoulli distribution determining whether each neuron will be dropped out or not. Step 4: A forward pass is performed from the specified dropout layer to the end of the network, zeroing the activations of the dropped out neurons. Step 5: The standard backward pass is performed to update model weights.

Our approach [39] is inspired by brain plasticity [8, 17, 18, 29]. We deliberately, and temporarily, paralyze/injure neurons to enforce learning alternative paths in a deep network. At training time, neurons that are more relevant to the correct prediction, i.e. neurons having a high saliency, are given a higher dropout probability. The relevance of a neuron for making a certain prediction is quantified using Excitation Backprop [37]. Excitation Backprop conveniently yields a probability distribution at each layer that reflects neuron saliency, or neuron contribution to the prediction being made. This is utilized in the training pipeline of our approach, named Excitation Dropout, which is summarized in Fig. 3.

Method. In the standard formulation of dropout [9, 31], the suppression of a neuron in a given layer is modeled by a Bernoulli random variable p which is defined as the probability of retaining a neuron, \(0 < p \le 1\). Given a specific layer where dropout is applied, during the training phase, each neuron is turned off with a probability \(1-p\).

We argue for a different approach that is guided in the way it selects neurons to be dropped. In a training iteration, certain paths have high excitation contributing to the resulting classification, while other regions of the network have low responses. We encourage the learning of alternative paths (plasticity) through the temporary damaging of the currently highly excited path. We re-define the probability of retaining a neuron as a function of its contribution in the currently highly excited path

$$\begin{aligned} p=1-\frac{(1-P)*(N-1)*p_{EB}}{((1-P)*N-1)*p_{EB}+P} \end{aligned}$$
(17)

where \(p_{EB}\) is the probability backpropagated through the EB formulation (Eq. 1) in layer l, P is the base probability of retaining a neuron when all neurons are equally contributing to the prediction and N is the number of neurons in a fully-connected layer l or the number of filters in a convolutional layer l. The retaining probability defined in Eq. 17 drops the neurons that contribute the most to the recognition of a specific class, with higher probability. Dropping out highly relevant neurons, we retain less relevant ones and thus encourage them to awaken.

Results. We evaluate the effectiveness of Excitation Dropout on popular network architectures that employ dropout layers including AlexNet [14], VGG16 [28], VGG19 [28], and CNN-2 [19]. We perform dropout in the first fully-connected layer of the networks and find that it results in a 1%–5% accuracy improvement in comparison to Standard Dropout and other proposed dropout variants in the literature including Adaptive, Information, Standard, and Curriculum Dropout. These results have been validated on image and video datasets including UCF101 [30], Cifar10 [13], Cifar100 [13], and Caltech256 [7].

Excitation Dropout shows a higher number of active neurons, a higher entropy over activations, and a probability distribution \(p_{EB}\) that is more spread (higher entropy over \(p_{EB}\)) among the neurons of the layer, leading to a lower peak probability of \(p_{EB}\) and therefore less specialized neurons. These results are observed to have consistent trends over all training iterations for examined image and video recognition datasets. Excitation Dropout also enables networks to have a higher robustness against network compression for all examined datasets. It is capable of maintaining a much less steep decline of GT probability as more neurons are pruned. Explainability has also been recently used to prune networks for transfer learning from large corpora to more specialized tasks [35].

4 XAI for Improved Models: Domain Generalization

While Sect. 3 focuses on dropping neurons ‘relevant’ to a prediction as a means of network regularization within a particular domain, we now propose using such relevance to focus on domain agnostic features that can aid domain generalization.

Fig. 4.
figure 4

In this figure we demonstrate how explainability (XAI) can be used to achieve domain generalization from a single source. Training a deep neural network model to enforce explainability, e.g. focusing on the skateboard region (red is most salient, and blue is least salient) for the ground-truth class skateboard in the central training image, enables improved generalization to other domains where the background is not necessarily class-informative. (Color figure online)

We develop a training strategy [40] for deep neural network models that increases explainability, suffers no perceptible accuracy degradation on the training domain, and improves performance on unseen domains.

We posit that the design of algorithms that better mimic the way humans reason, or “explain”, can help mitigate domain bias. Our approach utilizes explainability as a means for bridging the visual-semantic gap between different domains as presented in Fig. 4. Specifically, our training strategy is guided by model explanations and available human-labeled explanations, mimicking interactive human feedback [26]. Explanations are defined as regions of visual evidence upon which a network makes a decision. This is represented in the form of a saliency map conveying how much each pixel contributed to the network’s decision.

Our training strategy periodically guides the forward activations of spatial layer(s) of a Convolutional Neural Network (CNN) trained for object classification. The activations are guided to focus on regions in the image that directly correspond to the ground-truth (GT) class label, as opposed to context that may more likely be domain dependent. The proposed strategy aims to reinforce explanations that are non-domain specific, and alleviate explanations that are domain specific. Classification models are compact and fast in comparison to more complex semantic segmentation models. This allows the compact classification model to possess some properties of a segmentation model without increasing model complexity or test-time overhead.

Method. We enforce focusing on objects in an image by scaling the forward activations of a particular spatial layer l in the network at certain epochs. We generate a multiplicative binary mask for guiding the focus of the network in the layer in which we are enforcing XAI. For an explainable image \(x^i\), the binary mask is a binarization of the achieved saliency map, i.e. , \(j=1,\dots ,W\) and \(k=1,\dots ,H\), where W and H are the spatial dimension of a layers’ output neuron activations; The mask is active at locations of non-zero saliency. This re-inforces the activations corresponding to the active saliency regions that have been classified as being explainable. For images that need an improved explanation, the binary mask is assigned to be the GT spatial annotation \(mask^i_{j,k} = g^i_{j,k} ~\forall j ~\forall k\), \(j=1,\dots ,W\) and \(k=1,\dots ,H\); The mask is active at GT locations. This increases the frequency at which the network reinforces activations at locations that are likely to be non-domain specific and suppresses activations at locations that are likely to be domain specific. We then perform element-wise multiplication of our computed mask with the forward activations of layer l; i.e. \(a^{l,i}_{j,k} = mask^i_{j,k} * a^{l,i}_{j,k} ~\forall j ~\forall k\), \(j=1,\dots ,W\) and \(k=1,\dots ,H\).

Results. The identification of evidence within a visual input using top-down neural attention formulations [27] can be a powerful tool for domain analysis. We demonstrate that more explainable deep classification models could be trained without hindering their performance.

We train ResNet architectures for the single-label classification task for the popular MSCOCO [15] and PASCAL VOC [4] datasets. The XAI model resulted in a 25% increase in the number of correctly classified images that result in better localization/explainability using the popular pointing game metric. The XAI model has learnt to rely less on context information, without hurting the performance.

Thus far, evaluation assumed that a saliency map whose peak overlaps with the GT spatial annotation of the object is a better explanation. We then conduct a human study to confirm our intuitive quantification of an “explainable” model. The study asks users what they think is a better explanation for the presence of an object. XAI evidence won for 67% of the whole image population and 80% of the images with a winner choice.

Finally, we demonstrate how the explainable model better generalizes from real images of MSCOCO/PASCAL VOC to six unseen target domains from the DomainNet [20] and Syn2Real [21] datasets (clipart, quickdraw, infograph, painting, sketch, and graphics).

5 XAI for Improved Models: Guided Zoom

In state-of-the-art deep single-label classification models, the top-k \((k=2,3,4, \dots )\) accuracy is usually significantly higher than the top-1 accuracy. This is more evident in fine-grained datasets, where differences between classes are quite subtle. Exploiting the information provided in the top k predicted classes boosts the final prediction of a model. We propose Guided Zoom [3], a novel way in which explainability could be used to improve model performance. We do so by making sure the model has “the right reasons” for a prediction. The reason/evidence upon which a deep neural network makes a prediction is defined to be the grounding, in the pixel space, for a specific class conditional probability in the model output. Guided Zoom examines how reasonable the evidence used to make each of the top-k predictions is. In contrast to work that implements reasonableness in the loss function e.g. [24, 25], test time evidence is deemed reasonable in Guided Zoom if it is coherent with evidence used to make similar correct decisions at training time. This leads to better informed predictions.

Fig. 5.
figure 5

Pipeline of Guided Zoom. A conventional CNN outputs class conditional probabilities for an input image. Salient patches could reveal that evidence is weak. We refine the class prediction of the conventional CNN by introducing two modules: 1) Evidence CNN determines the consistency between the evidence of a test image prediction and that of correctly classified training examples of the same class. 2) Decision Refinement uses the output of Evidence CNN to refine the prediction of the conventional CNN.

Method. We now describe how Guided Zoom utilizes multiple discriminative evidence, does not require part annotations, and implicitly enforces part correlations. This is done through explanations of the main modules depicted in Fig. 5.

Conventional CNNs trained for image classification output class conditional probabilities upon which predictions are made. The class conditional probabilities are the result of some corresponding evidence in the input image. From correctly classified training examples, we generate a reference pool \(\mathcal {P}\) of (evidence, prediction) pairs over which the Evidence CNN will be trained for the same classification task. We recover/ground such evidence using several grounding techniques [1, 22, 27]. We extract the image patch corresponding to the peak saliency region. This patch highlights the most discriminative evidence. However, the next most discriminative patches may also be good additional evidence for differentiating fine-grained categories.

Also, grounding techniques only highlight part(s) of an object. However, a more inclusive segmentation map can be extracted from the already trained model at test time using an iterative adversarial erasing of patches [33]. We augment our reference pool with patches resulting from performing iterative adversarial erasing of the most discriminative evidence from an image. We notice that adversarial erasing results in implicit part localization from most to least discriminative parts. All patches extracted from this process inherit the ground-truth label of the original image. By labeling different parts with the same image ground-truth label, we are implicitly forcing part-label correlations in Evidence CNN.

Including such additional evidence in our reference pool gives a richer description of the examined classes compared to models that recursively zoom into one location while ignoring other discriminative cues [5]. We note that we add an evidence patch to the reference pool only if the removal of the previous salient patch does not affect the correct classification of the sample image. Erasing is performed by adding a black-filled square on the previous most salient evidence to encourage a highlight of the next salient evidence. We then train a CNN model, Evidence CNN, on the generated evidence pool.

At test time, we analyze whether the evidence upon which a prediction is made is reasonable. We do so by examining the consistency of a test (evidence, prediction) with our reference pool that is used to train Evidence CNN. We exploit the visual evidence used for each of the top-k predictions for Decision Refinement. The refined prediction will be inclined toward each of the top-k classes by an amount proportional to how coherent its evidence is with the reference pool. For example, if the (evidence, prediction) of the second-top predicted class is more coherent with the reference pool of this class, then the refined prediction will be more inclined toward the second-top class.

Assuming test image \(s^j\), where \(j \in {1,\dots ,m}\) and m is the number of testing examples, \(s^j\) is passed through the conventional CNN resulting in \(v^{j,0}\), a vector of class conditional probabilities having some top-k classes \(c_1, \dots , c_k\) to be considered for the prediction refinement. We obtain the evidence for each of the top-k predicted classes \(e^{j,c_1}_0, \dots , e^{j,c_k}_0\), and pass each one through the Evidence CNN to get the output class conditional probability vectors \(v^{j,c_1}_0, \dots , v^{j,c_k}_0\). We then perform adversarial erasing to get the next most salient evidence \(e^{j,c_1}_l, \dots , e^{j,c_k}_l\) and their corresponding class conditional probability vectors \(v^{j,c_1}_l, \dots , v^{j,c_k}_l\), for \(l \in {1, \dots , L}\). Finally, we compute a weighted combination of all class conditional probability vectors proportional to their saliency (a lower l has more discriminative evidence and is therefore assigned a higher weight \(w_l\)). The estimated, refined class \(c^j_{ref}\) is determined as the class having the maximum aggregate prediction in the weighted combination.

Results. We show that Guided Zoom results in an improvement of a model’s classification accuracy on four fine-grained classification datasets: CUB-200-2011 Birds [34], Stanford Dogs [11], FGVC-Aircraft [16], and Stanford Cars [12] of various bird species, dog species, aircraft models, and car models.

Guided Zoom is a generic framework that can be directly applied to any deep convolutional model for decision refinement within the top-k predictions. Guided zoom demonstrates that multi-zooming is more beneficial than a single recursive zoom [5]. We also demonstrate that Guided Zoom further improves the performance of existing multi-zoom approaches [38]. Choosing random patches to be used with original images, as opposed to Guided Zoom patches results in comparable results to using the original images on their own. Therefore, Guided Zoom presents performance gains that are complementary to data augmentation.

6 Conclusion

This chapter presents sample white- and black-box approaches to providing visual grounding as a form of explainable AI. It also presents a human judgement verification that such visual explainability techniques mostly agree with evidence humans use for the presence of visual cues. This chapter then demonstrates three strategies on how this preliminary form of explainable AI (also widely known as saliency maps) can be integrated into automated algorithms, that do not require human feedback, to improve fine-grained accuracy, in-domain and out-of-domain generalization, network utilization, and robustness to network compression.