Adversarial perturbations
Given n-th image \(x_n\) and its respective ground truth class \(y_n\) predicted by a classifier \(f(x_n)\), an image \({\hat{x}}_n\) is generated by adding adversarial perturbations to it such that the classifier \(f({\hat{x}}_n)\) predicts y, where \(y_n \ne y\), and \(x_n\) and \({\hat{x}}_n\) are close according to some distance metric. Next, we present the method for generating adversarial examples through untargeted attacks [6] and targeted attacks [6, 8].
Untargeted attacks We leverage IFGSM method [23] to generate adversarial perturbations. The mechanism for generating adversarial examples through basic iterative method is given by:
$$\begin{aligned}&{\hat{x}}_n^0 =x_n \nonumber \\&{\hat{x}}_n^{i+1}=\text {Clip}_{\epsilon }\{{\hat{x}}_n^{i}+\alpha \text {Sign}(\bigtriangledown _{{\hat{x}}_n^i}{\mathcal {L}} ({\hat{x}}_n^i,y_n))\} \end{aligned}$$
(1)
where, \({\hat{x}}_n^0\) is the input image at step \(i=0\), \(\bigtriangledown _{{\hat{x}}_n^i}{\mathcal {L}}\) is the derivative of the loss function w.r.t to the current input image, \(\alpha \) is the step size taken at step i in the direction of sign of the gradient, and finally the result is clipped by \(\text {Clip}_{\epsilon }\).
Targeted attacks For targeted attacks we target our input image to be misclassified into a specific class \(y_t\). Following equations are used to create adversarial perturbations for misclassification in to a target class.
$$\begin{aligned}&{\hat{x}}_n^0 =x_n \nonumber \\&{\hat{x}}_n^{i+1}=\text {Clip}_{\epsilon }\{{\hat{x}}_n^{i}- \alpha \text {Sign}(\bigtriangledown _{{\hat{x}}_n^i}{\mathcal {L}}( {\hat{x}}_n^i,y_t))\} \end{aligned}$$
(2)
In the targeted attacks we maximize the loss against ground truth class \(y_n\) and minimize the loss against target class \(y_t\).
Adversarial robustness
Adversarial training Adversarial training [37] is one of the state of the art method for robustness against adversarial perturbations. In adversarial training the model \(f^r({\hat{x}}_n)\) finds the worst case adversarial examples and trains the network on these adversarial examples besides training it on clean images to make it robust against adversarial perturbations. Hence, this leads to an improvement in performance against adversarial perturbations. The following objective function is minimized in adversarial training:
$$\begin{aligned} {\mathcal {L}}_{adv}(x_n,y_n)&= \gamma {\mathcal {L}}(x_n,y_n) + (1-\gamma ){\mathcal {L}}({\hat{x}}_n,y) \end{aligned}$$
(3)
where, \({\mathcal {L}}(x_n,y_n)\) is the classification loss for clean images, \({\mathcal {L}}({\hat{x}}_n,y)\) is the loss for adversarial images and \(\gamma \) regulates the loss to be minimized.
Attribute prediction
We use class attributes available with the dataset to predict per image attributes and provide explanations for classification. The model is shown in Fig. 3. At training time our network learns to map image features closer to their ground truth class attributes and farther from other classes in the embedding space. During test time when clean image features are projected in the learned embedding space the image gets mapped closer to the ground truth class attributes e.g. “Crested head” and “Red beak” associated with the ground truth class “Cardinal”, see Fig. 3. However, an adversarially perturbed image gets mapped closer to the wrong class attributes e.g. “Plain head” and “Black beak” belonging to the counterclass ‘Pine Grosbeak”, Fig. 3.
Given the n-th input image features \(\theta (x_n)\in {\mathcal {X}}\) and output class attributes \(\phi (y_n)\in {\mathcal {Y}}\) from the sample set \({\mathcal {S}}=\{\theta (x_n),\phi (y_n),n=1\ldots .N\}\) we employ SJE [2] to predict attributes in an image. SJE learns to map \(\mathbb {f}:{\mathcal {X}}\longrightarrow {\mathcal {Y}}\) by minimizing the empirical risk of the form \(\frac{1}{N}\sum _{n=1}^N \Delta (y_n,\mathbb {f}(x_n))\), where \(\Delta : {\mathcal {Y}} \times {\mathcal {Y}} \rightarrow {\mathbb {R}} \) estimates the cost of predicting \(\mathbb {f}(x_n)\) when the ground truth label is \(y_n\).
A compatibility function \(F:{\mathcal {X}}\times {\mathcal {Y}}\rightarrow {\mathbb {R}}\) is defined between input \({\mathcal {X}}\) and output \({\mathcal {Y}}\) space:
$$\begin{aligned} F(x_n,y_n;W)=\theta (x_n)^TW\phi (y_n) \end{aligned}$$
(4)
Pairwise ranking loss \({\mathbb {L}}(x_n,y_n,y)\) is used to learn the parameters (W):
$$\begin{aligned} \Delta (y_n,y)+\theta (x_n)^TW\phi (y_n)-\theta (x_n)^TW\phi (y) \end{aligned}$$
(5)
At test time attributes are predicted for clean images by projecting image features on to the learned embedding space. It is given by:
$$\begin{aligned} {\mathbf {A}}_{n,y_n}=\theta (x_n)W \end{aligned}$$
(6)
and for adversarial images by:
$$\begin{aligned} {\hat{\mathbf {A}}}_{n,y}=\theta ({\hat{x}}_n)W \end{aligned}$$
(7)
The image is assigned the label of the nearest output class attributes \(\phi (y_n)\).
Attribute grounding
Thereafter, we ground the predicted attributes on the images for better visual explanations using a pre-trained Faster-RCNN as in [4]. The pre-trained Faster-RCNN \({\mathcal {F}}(x_n)\) model predicts bounding boxes \(b^j\). For each bounding box j in each image \(x_n\) it predicts a class \({\mathbb {Y}}_{x_n}^j\) and an attribute \({\mathbb {A}}_{x_n}^j\) [3].
$$\begin{aligned} b_{x_n}^j,{\mathbb {A}}_{x_n}^j,{\mathbb {Y}}_{x_n}^j={\mathcal {F}}(x_n) \end{aligned}$$
(8)
where j is the bounding box index.
Attribute selection for grounding As all the attributes predicted for an image cannot be visualized due to visual constraints. Therefore, we select the most discriminative attributes for grounding on the images. Attributes are selected based on the criterion that they change the most when the image is perturbed with the adversarial noise. For clean images we use:
$$\begin{aligned} q=\underset{i}{\mathrm {argmax}}({\mathbf {A}}_{n,y_n}^i-\phi (y^i)) \end{aligned}$$
(9)
and for adversarial images we use:
$$\begin{aligned} p=\underset{i}{\mathrm {argmax}}({\hat{\mathbf {A}}}_{n,y}^i-\phi (y_n^i)). \end{aligned}$$
(10)
where i is the attribute index, \({\mathbf {A}}_{n,y_n}^i\) and \({\hat{\mathbf {A}}}_{n,y}^i\) are attributes predicted by SJE for clean and adversarial images, respectively. \(\phi (y^i)\), \(\phi (y_n^i)\) indicate the counterclass and ground truth class attributes, respectively. q and p are indexes of the most discriminative attributes selected based on our criterion.
After selecting the most discriminative attributes predicted by SJE using Eqs. 9 and 10, we search for the selected attributes \({\mathbf {A}}_{x_n,y_n}^q, {\mathbf {A}}_{{\hat{x}}_n,y}^p\) in the attributes predicted by RCNN for each bounding box \({\mathbb {A}}_{x_n}^j, {\mathbb {A}}_{{\hat{x}}_n}^j\). When the attributes predicted by SJE and Faster-RCNN are matched, that is \({\mathbf {A}}_{x_n,y_n}^q = {\mathbb {A}}_{x_n}^j\), \({\mathbf {A}}_{{\hat{x}}_n,y}^p = {\mathbb {A}}_{{\hat{x}}_n}^j\) we ground them on their respective clean and adversarial images. As shown in Fig. 3, the attributes “Crested head” and “Red beak” are grounded on the image while “Plain head” and “Black beak” could not be grounded because there is no visual evidence present in the image for these attributes.
Example-based explanations
Besides providing attribute-based explanations we propose to provide counterexample-based explanations as shown in Fig. 3. We compare the results for example-based explanations by selecting examples randomly from the counterclass with examples selected based on attributes Fig. 13.
Example selection through attributes The procedure for example-based explanations using attributes is detailed in the Algorithm 1 and the results are shown in Figs. 12 and 13. Given clean images classified correctly and adversarial images misclassified and their predicted attributes, we search for attributes in the adversarial class which are most similar to the attributes of the adversarial image and select these images as counterexamples, i.e., a “Pine Grosbeak” image with the attributes “Plain head” and “Black beak” is selected as a counterexample Fig. 3.
Attribute analysis method
Finally, in this section we introduce our techniques for quantitative analysis on the predicted attributes.
Predicted attribute analysis: standard network In order to perform analysis on attributes in embedding space, we consider the images which are correctly classified without perturbations and misclassified with perturbations. Our aim is to analyze the change in attributes in embedding space to verify that attributes change with the change in the class.
We contrast the Euclidean distance between predicted attributes of clean and adversarial samples:
$$\begin{aligned} d_1 = d\{{\mathbf {A}}_{n,y_n},{\hat{\mathbf {A}}}_{n,y}\} =\parallel {\mathbf {A}}_{n,y_n}-{\hat{\mathbf {A}}}_{n,y} \parallel _2 \end{aligned}$$
(11)
with the Euclidean distance between the ground truth attribute vector of the correct and adversarial classes:
$$\begin{aligned} d_2 = d\{\phi (y_n),\phi (y)\}=\parallel \phi (y_n)-\phi (y)) \parallel _2 \end{aligned}$$
(12)
where, \({\mathbf {A}}_{n,y_n}\) denotes the predicted attributes for the clean images classified correctly, and \({\hat{\mathbf {A}}}_{n,y}\) denotes the predicted attributes for the adversarial images misclassified with a standard network. The correct ground truth class attribute are referred to as \(\phi (y_n)\) and adversarial class attributes are referred to as \(\phi (y)\).
Predicted attribute analysis: robust network We compare the distances between predicted attributes of only adversarial images that are classified correctly with the help of an adversarially robust network \({\hat{\mathbf {A}}}^{{r}}_{n,y_n}\) and classified incorrectly with a standard network \({\hat{\mathbf {A}}}_{n,y}\):
$$\begin{aligned} d_1 = d\{{\hat{\mathbf {A}}}^{{r}}_{n,y_n},{\hat{\mathbf {A}}}_{n,y}\}=\parallel {\hat{\mathbf {A}}}^{{r}}_{n,y_n}-{\hat{\mathbf {A}}}_{n,y} \parallel _2 \end{aligned}$$
(13)
with the distances between the ground truth class attributes \(\phi (y_n)\) and ground truth adversarial class attributes \(\phi (y)\):
$$\begin{aligned} d_2 = d\{\phi (y_n),\phi (y)\}=\parallel \phi (y_n)-\phi (y)) \parallel _2 \end{aligned}$$
(14)
Implementation details
Image features and adversarial examples We extract image features and generate adversarial images using the fine-tuned Resnet-152. Adversarial attacks are performed using the basic iterative method with epsilon \(\epsilon \) values 0.01, 0.06 and 0.12. The \(l_\infty \) norm is used as a similarity measure between clean input and the generated adversarial example. In order to generate adversarial examples for untargeted attacks the algorithm perturbs the images such that they get misclassified into any alternative counter class. In order to generate adversarial examples for targeted attacks we direct adversarial examples to be misclassified into randomly selected classes.
Adversarial training As for adversarial training, we repeatedly computed the adversarial examples while training the fine-tuned Resnet-152 to minimize the loss on these examples. We generated adversarial examples using the projected gradient descent method. This is a multi-step variant of FGSM with epsilon \(\epsilon \) values 0.01, 0.06 and 0.12, respectively, for adversarial training as in [27].
Attribute prediction and grounding At test time the image features are projected onto the learned attribute space and attributes per image are predicted. The image is assigned with the label of the nearest ground truth attribute vector. Since we do not have ground truth part bounding boxes for any of the attribute datasets, the predicted attributes are grounded by using Faster-RCNN pre-trained on the Visual Genome Dataset [22] .