1 Introduction

Despite the remarkable success of deep neural networks (DNN) in various computer vision tasks such as image classification, object detection, and semantic segmentation, DNNs are found vulnerable to adversarial attacks [25]. An adversarial sample is a carefully crafted image to fool the target network by adding small perturbations onto the original clean image [2, 13, 23]. This has raised serious concerns for the security of deep learning networks as they have been extensively used in various applications that require high levels of robustness and security, such as face recognition for identification and authentication, autonomous driving, safety inspection, and surveillance. Therefore, adversarial attacks and defenses of deep neural networks have recently emerged as an important research task in deep learning.

Recently, a number of dense adversarial attack methods have been developed, such as FGSM [9], BIM [15], and PGD [17]. The main objective of these methods is to maximize the attack success rate under a l2 or \(l_{\infty }\) norm noise constraint [17]. However, the resultant attacked images ususally have low preceptual qualities.

Sparse attack methods based on l0 norm try to modify as few pixels as possible to attack the image without limiting the noise magnitude. These methods usually suffer from low efficiency in achieving a high attack success rate as it is very computationally intensive to search the image space for the candidate pixels [4, 7, 12]. To alleviate the problem, JSMA [21] and GreedyFool [7] predict a saliency or distortion map using a trained network to guide the search process. However, these maps are not directly optimized towards image quality and attack success rate. On the other hand, attacking based on the l0 norm constraint alone does not guarantee the high visual quality of the attacked images.

The problem is to design an adversarial attack method to simultaneously optimize the attack success rate, visual qualities of the attacked images, and time efficiency or attack complexity that can be measured by the number of model inferences. To this end, we propose to take advantage of the high attack efficiency and success rate of those l2 or \(l_{\infty }\) constraint based attack methods, and reshape the perturbation noise to optimize the image quality while keeping the attack success rate. However, it has been well recognized that the lp norm of image noise is not accurately correlated to human perception of image quality [23, 32]. In the literature, the structural similarity index (SSIM) has been demonstrated to be an effective metric for perceptual image quality [20, 23, 28, 31]. Perceptual color distance is also a good metric for human perception [16, 32]. Figure 1 shows two adversarial examples which have almost the same Peak Signal to Noise Ratio (PSNR) between the original images (left) and the attacked images (right). However, we can clearly see that the adversarial noise in (a) is much more visible and annoying than that in (b), which corresponds to the significant difference between the SSIM values. We recognize that, in different regions of the image, the adversarial noise has different levels of visibility. Or specifically, with the same amount of noise, the SSIM values of different image regions are quite different. For example, the SSIM values of the structural regions are often higher than those in the smooth image regions. This motivates us to argue that the perceptual quality of the adversarial images can be improved by modulating the adversarial noise with the perceptual weights or noise sensitivity levels of different image regions.

Fig. 1
figure 1

Comparison between SSIM and PSNR. The first column is the original images. In the second column, we can see that the upper image has much more distortion than the lower one when compared to the originals. However, the two PSNR scores are very close. SSIM score better reflects the visual distortions in images. A higher SSIM score means fewer distortions in the image

Based on this, we propose to develop a perceptually optimized noise reshaping (PONS) scheme to reshape the adversarial noise generated by a base attacker and optimize the visual quality of the attacked images while achieving the same attack success rate. Specifically, in the first stage, we use a convolution SSIM model to calculate the SSIM value between a clean image and its attacked version. Guided by this SSIM visual quality module, the proposed method learns a perceptual attention network that predicts a perceptual sensitivity map to modulate the adversarial noise in the input image, aiming to maximize the visual quality while achieving the same attack success rate. The perceptual attention network and attack method are jointly trained.

Within the context of image classification, we recognize that the decision of the network is binary with the network inference score being compared to a decision threshold. This implies that some regions of the adversarial noise can be removed or pruned to further improve the perceptual quality while ensuring that the classification score remains above the threshold to achieve the same adversarial state. To this end, in the second stage, we propose a fast search algorithm to perform an iterative block-wise pruning of the adversarial noise without affecting the attack success rate.

We have tested our method on the mini-ImageNet dataset against different defense methods. Our method is able to significantly improve the image visual quality over the base attack method without significant loss of the attack success rate. Compared with the state-of-the-art attack methods, our method can achieve better attack performance in terms of image quality, attack success rate, and efficiency.

The major contribution of this work can be summarized as follows:

  1. (1)

    We incorporate the perceptual quality optimization into the adversarial attack method design and propose a two-stage attack method to reshape the adversarial noise generated by an initial attack while achieving the same attack success rate.

  2. (2)

    We develop a perceptual attention network that learns to predict a perceptual attention map to modulate the adversarial noise so that the SSIM visual quality of the image is optimized without significant loss of the attack success rate.

  3. (3)

    We propose a fast binary search algorithm to perform iterative block-wise pruning of the adversarial noise to further improve the perceptual image quality, which does not affect the adversarial state of the image.

  4. (4)

    Our experimental results on mini-ImageNet dataset with different defense schemes have shown that the proposed method is able to significantly improve the attack performance over the state-of-the-art.

2 Related work

2.1 White-box adversarial attack

This section reviews the related white-box adversarial attack methods as this work belongs to this category. In the white-box attack setting, the parameters of the target model are exposed to the attack process. For deep neural networks, Szegedy et al. pointed out an intriguing weakness of them within the context of image classification and developed a box-constrained L-BFGS method to generate adversarial examples [25]. To overcome the computation efficiency problem, Goodfellow et al. proposed the fast gradient sign method (FGSM) by performing a single gradient step [9]. Kurakin et al. extended this method to an iterative version and demonstrated the adversarial examples in physical world scenarios by feeding adversarial images obtained from cell-phone camera to an ImageNet Inception classifier [15]. Dong et al. proposed a broad class of momentum-based iterative algorithms to generate more transferable adversarial examples, and applied momentum iterative algorithms to an ensemble of models [8]. The PGD method developed in [17] also works in an iterative manner. To improve the transferability of adversarial examples, Xie et al. applied random transformations to the input images at each iteration of the attack process to create more diverse input noise patterns [30]. To circumvent the “obfuscated gradients” problem introduced by the defense methods, Athalye et al. proposed Backward Pass Differentiable Approximation (BPDA) to provide proximate gradient when the true gradient is unavailable [1]. Carlini and Wagner proposed a set of three adversarial attacks in [3] and declared that defensive distillation did not significantly increase the robustness of neural networks. Zhao et al. exploited human color perception and improved C&W [3] by minimizing the perturbation size with respect to perceptual color distance. Croce et al. extended the usual Projected Gradient Descent (PGD) attack to the L0 norm to generate highly sparse adversarial examples [4].

Sparse attack methods attempt to make the perturbations imperceptible by constraining the magnitude of the adversarial noise using lp norm, such as the l0, l2, and \(l_{\infty }\) norms [22]. To search for a minimal adversarial perturbation for a given image, Moosavi-Dezfooli et al. proposed DeepFool [19] which can generate adversarial examples with less amount of perturbations than the FGSM method [9] while achieving similar attack success rates. An extreme case of minimizing image perturbation is the one-pixel attack method in which only one pixel in the image is changed to fool the classifier. Su et al. achieved an attack success rate of 70.97% on the tested images by changing just one pixel of the input image [24]. SparseFool [18] converts the l0 constraint problem into an l1 constraint problem and exploits the boundaries’ low mean curvature to compute sparse adversarial perturbations. GreedyFool [7] selects a few of the most effective candidate pixels to modify using a predicted distortion map as guidance. TSAA [12] translates a benign image into an adversarial image by a generator network which is trained to learn the mapping between natural images and sparse adversarial images.

2.2 Defense against adversarial attack

In this work, we evaluate the attack methods against different defense schemes. Here we briefly review the related defense works. Recently, many methods have been developed for defending deep neural networks against adversarial attacks. Adversarial training is a common method to increase the network robustness by adding adversarial examples into the training data [14, 26, 27]. Tramer et al. proposed an ensemble adversarial training method to augment training data with perturbations transferred from other models [27]. Transformation to the input is another way to defend against adversarial attacks, such as bit-depth reduction, JPEG compression, and total variance minimization [10]. The network structure can be modified to improve its robustness to adversarial attacks. Dhillon et al. pruned a random subset of activations according to their magnitude to enhance network robustness [6]. Cihang et al. proposed a feature denoising method using non-local means or other filters. Along with adversarial training, the adversarial robustness can be substantially improved [29].

3 Method

In this section, we present the proposed perceptually optimized noise reshaping (PONS) method for adversarial attacks.

3.1 Problem formulation

Let X be a natural image, and yT be the corresponding ground-truth label, also referred to as the target label. A classifier Φ𝜃(X) = y takes an image X as input and predicts its label \(y \in \mathcal {Y}\), where \(\mathcal {Y}\) is the output label space. The goal of the adversarial attack is to generate an adversarial noise Z constrained by lp norm so that the attacked image Xa = X + Z is misclassified by the target model. In this work, the image quality SSIM(X,Xa) is maximized during the attack process. Mathematically, it is written as

$$ \begin{array}{@{}rcl@{}} && \underset{Z}{\arg\max} SSIM(X, X+Z)\\ &&s.t.\ {\Phi}_{\theta}(X+Z) \neq {\Phi}_{\theta}(X)\\ &&s.t.\ ||Z||_p\leq \epsilon, \end{array} $$
(1)

where 𝜖 is the perturbation budget, p is the order of matrix norm and it can be 0, 1, 2 or \(\infty \).

3.2 Method overview

In this work, we propose a novel two-stage adversarial attack framework that takes advantage of the high attack efficiency and success rate of a base attack method and reshapes the perturbation noise to optimize the visual quality of the attacked images while maintaining the same attack success rate. As illustrated in Fig. 2, the proposed PONS method uses the SSIM perceptual quality and sensitivity analysis to drive the adversarial noise generation process. In the first stage, we use a convolution model to calculate the SSIM index between the attacked image and the input, and a perceptual attention network to predict the perceptual sensitivity map, which is used to modulate the adversarial noise generated by the base attack method. The attacker network and the perceptual attention map prediction network are learned end-to-end to optimize the SSIM image qualities while achieving successful adversarial attacks. In the second stage, we propose a fast binary search method to perform a block-wise pruning of the adversarial noise based on the perceptual sensitivity map to improve the image qualities further.

Fig. 2
figure 2

Diagram of the proposed PONS method. The lower figure is a two-stage attack process. In the first stage, the perceptual attention network (PAN) is used to predict a perceptual attention mask using the extracted features from the images to modulate the adversarial noise generated by the base attack method. In the second stage, to further reduce the perturbation, a block-wise perturbation pruning is applied to find out image blocks to turn off the perturbation. In the upper figure, the perceptual attention network is trained to optimize the image quality (Lq) and attack success rate (Lc), where the SSIM model is used to calculate the SSIM score between the clean image and the adversarial

3.3 SSIM prediction network and perceptual sensitivity map

SSIM has been extensively used as a metric to measure the perceptual quality of images. Here we give a brief definition. For more details, please go to the reference [28]. Let A and B be the two images being compared. A window moves pixel-by-pixel from the top left corner to the bottom right corner of the image. In each step, the local statistics Θ(Aj,Bj) is calculated within local window j as follows [28]:

$$ \theta(A_{j}, B_{j}) = \frac{(2\cdot m_{A_{j}} m_{B_{j}}+ C_{1})\cdot (2\cdot \sigma_{A_{j}B_{j}} + C_{2})}{(m_{A_{j}}^{2}+m_{B_{j}}^{2}+C_{1})(\sigma_{A_{j}}^{2}+\sigma_{B_{j}}^{2}+C_{2})}, $$
(2)

where \(m_{A_{j}}, m_{B_{j}}, \sigma _{A_{j}}, \sigma _{B_{j}}\), and \(\sigma _{A_{j}B_{j}}\) represent the average intensity of image patches Aj and Bj, the standard deviation of Aj and Bj, and covariance between Aj and Bj, respectively. C1 and C2 are two small constants introduced to avoid numerical instability. The SSIM index between A and B is defined by [28]

$$ {\Theta}(A,B) = \frac{{\sum}_{j=1}^{N_{s}} W(A_{j}, B_{j}) \theta(A_{j},B_{j})}{{\sum}_{j=1}^{N_{s}} W(A_{j}, B_{j})}, $$
(3)

where Ns is the number of local windows in the image and W(Aj,Bj) is the weights applied to window j [28]. It should be noted that the above computation is highly nonlinear. In this work, we use a PyTorch model (https://github.com/aserdega/ssim-pytorch) to approximate the SSIM function Θ(,).

We introduce the perceptual sensitivity map to measure the impact of adversarial noise on the image quality at different locations. For image A,B with size of [W,H,C], we propose to use the feature map F(m,n,c), 1 ≤ mW, 1 ≤ nH, 1 ≤ cC, from the last layer of the SSIM calculation network. This feature map represents the impact of the difference between the attacked image and the original clean image on the overall image SSIM quality. The perceptual sensitivity map σ(m,n) is defined as

$$ \sigma(m, n) = 1 - \frac{1}{C}\sum\limits_{c=1}^{C} |F(m, n, c)|. $$
(4)

Figure 3 shows an example of the perceptual sensitivity map. The figure shows that the smooth areas are relatively whiter than those texture-rich regions, which means smooth areas are more sensitive to adversarial perturbation.

Fig. 3
figure 3

Example of perceptual sensitivity map. Images from left to right are the original, the adversarial, and the corresponding perceptual sensitivity map, respectively. Smooth areas are relatively whiter than texture-rich regions, which means higher sensitivity to noise

3.4 SSIM-optimized adversarial attack

The core idea of our method is to reshape the perturbation noise in adversarial examples that are successfully attacked by a base attack method. The base attacker usually has a high attack success rate. This work considers PGD and its variant BPDA as the base attack method. The PGD is an iterative version of the original FGSM method which adds perturbation along the gradient direction of the loss ∇xL(X,yT;𝜃) onto the input X to generate adversarial example as follows [9]:

$$ X^{a} = X + \epsilon \cdot sign(\nabla_{X} L(X, y^{T}; \theta)), $$
(5)

where L(⋅) is usually defined as the cross-entropy loss, and 𝜖 controls the \(l_{\infty }\)-norm of the difference between X and Xa. The PGD method iteratively updates the image as [17]

$$ X^{a}_{t+1} = {\Gamma}_{\epsilon}[X; \ {X^{a}_{t}} + \alpha \cdot sign(\nabla_{X} L({X^{a}_{t}}, y; \theta))], $$
(6)

where t is the iteration index, α is the step size, and Γ𝜖[X;Xa] is a clipping function which makes sure that the largest difference between X and Xa is less than 𝜖.

As illustrated in Fig. 3, different image regions have different levels of sensitivity to the adversarial noise. Motivated by this observation, we propose to modulate the adversarial noise by a learned perceptual weight mask \({\mathscr{M}} = [{\mathscr{M}}(m, n)]\), as illustrated in Fig. 2. The perceptually modulated PGD attack is given by

$$ X^{a}_{t+1} = {\Gamma}_{\epsilon}[X+Z_{t+1} \cdot \mathcal{M}]. $$
(7)

where Zt+ 1 is the final perturbation generated by the base attack method. The perceptual mask \({\mathscr{M}}\) reshapes the adversarial noise such that, in those image regions whose visual quality levels are sensitive to noise, the noise magnitude is reduced.

In our proposed method, this perceptual mask \({\mathscr{M}}\) is predicted by the perceptual attention network which is learned with the target network, the attack network, and the SSIM network. The output mask has the same size as the input image. The proposed perceptual mask prediction network is based on an auto-encoder-decoder structure with a Resnet-18 backbone [11]. The training process is to maximize the SSIM qualities while achieving successful attacks of the original images. Specifically, let yA be the attack target, the incorrect label that the attack forces the classifier Φ𝜃 to produce, and ya be the current classifier softmax output. We use the following cross-entropy loss \({\mathscr{L}}_{C}\) between the attack target yA and the actual classifier output ya

$$ \mathcal{L}_{C}(y^{A}, y^{a}) = -\sum\limits_{i=1}^{C} {y_{i}^{A}}\cdot \log_{2} {y^{a}_{i}}. $$
(8)

The following loss is then used to train the perceptual attention network

$$ \mathcal{L}_{M} = \mathcal{L}_{C}(y^{A}, y^{a}) - \lambda \cdot {\Theta}(X^{a}_{t+1}, X), $$
(9)

where λ is a weight factor, \({\Theta }(X^{a}_{t+1}, X)\) measures the SSIM quality between the current attacked image \(X^{a}_{t+1}\) and the original image X.

3.5 Block-wise binary pruning of adversarial noise

In the above section, we have learned a network to predict a perceptual attention mask \({\mathscr{M}}(m, n)\) to modulate the adversarial noise. It should be noted that this learned mask \({\mathscr{M}}(m, n)\) is a continuous value that aims to maximize the overall SSIM image quality of the attacked image. This prediction and quality optimization is a global process, and all entries of the mask \({\mathscr{M}}\) are jointly predicted from one network inference. From our experiments, we observe that the value of each individual entry \({\mathscr{M}}(m, n)\) can be fine-tuned to further improve the perceptual quality of the attacked image. To this end, we propose a fast and efficient method, called block-wise binary pruning to perform local adjustment of the perceptual mask. Specifically, we equally partition the mask \({\mathscr{M}}\) into non-overlapping blocks and, for each block, the proposed algorithm will make the following binary decision: the adversarial noise within this block being kept (indicated by 1) or totally removed / pruned (indicated by 0). Conceptually, we need to remove or turn off the adversarial noise in those image blocks with high levels of sensitivity to noise to maximize the perceptual quality while achieving a successful attack of the image. In (4), we have derived the perceptual sensitivity map from the SSIM index calculation network. We propose to use this perceptual sensitivity map [σ(m,n)]W×H to guide the block-wise binary pruning of adversarial noise. Each entry σ(m,n) of this map represents the perceptual sensitivity of an image pixel. The central task here is: we need to select a subset of these image blocks to remove or prune their adversarial noise, i.e., setting their attack noise to zeros, so that the SSIM perceptual quality of the attacked image is maximized without changing the adversarial state. Clearly, it is very computationally intensive to search for this subset of image blocks. Most importantly, at each search step, we need to run the classification network to check whether the resulting image is still successfully attacked or not. Therefore, the number of search steps has to be very limited.

To address this issue, we propose a fast binary search algorithm to obtain a sub-optimal solution. Specifically, we first sort the perceptual sensitivity value σ[k] of all blocks in an ascending order, denoted by σ[k], \(1\leq k\leq \frac {W}{w}\times \frac {H}{h}\), where σ[k] is the mean value of σ(m,n) in the k-th image block with size [w,h]. We aim to find a decision threshold σP such that all adversarial noise in image blocks with σ[k] > σP will be removed while the noise in the remaining blocks will be kept. With the image blocks sorted according to the perceptual sensitivity, this search can be efficiently implemented with the following binary search method. Specifically, let {It} be the sequence of search positions (or k values). Here, t is search step index. At search step t, we set σP = σ[It] or remove adversarial noise in all image blocks with indices k > It. Let the corresponding pruned adversarial noise be Z[It]. We then evaluate the classification network on the attacked image X + Zt to produce the classification output Φ𝜃(X + Z[It]). If the image is successfully attacked, we denote it by Γ(X + Z[It]) = 1. Otherwise, Γ(X + Z[It]) = 0.

Initially, we set I0 = 0 and \(I_{1}=\frac {W}{w}\times \frac {H}{h}\). Certainly, we have Γ(X + Z[0]) = 0 since the adversarial noise in all blocks are removed and X + Z[0] becomes the original image. If Γ(X + Z[1]) = 0, which means that the original attack is not successful, the search stops, and the adversarial attack fails. Otherwise, the proposed binary search process continues as follows: at search step t, let

$$ I^{+}_{t} = \min\{I_{j}|\ {\Gamma}[X+Z[I_{j}]) = 1, j< t\}, $$
(10)

which is the smallest index with successful attack, and

$$ I^{-}_{t} = \max\{I_{j}|\ {\Gamma}[X+Z[I_{j}]) = 0, j< t\}, $$
(11)

which is the largest index with failed attack. Then, in our proposed binary search, we set

$$ I_{t+1} = \frac{I^{+}_{t} + I^{-}_{t}}{2}. $$
(12)

Once the It+ 1 is determined, we will evaluate Γ(X + Z[It+ 1]) and repeat the above steps until It+ 1It < 1. The overall binary search-based method is outlined as Algorithm 1. Figure 4 shows four examples of the above binary search process. The horizontal axis is the total number of blocks where the adversarial noises are removed. As the number of pruned blocks increases, the SSIM quality of the attacked image improves significantly. The ending value represents the point where the attack fails, which is the target value that the binary algorithm aims to find.

figure e
Fig. 4
figure 4

SSIM scores keep increasing when turning off the noise on more and more image blocks until the classification result by the model changes. The X-axis is the number of image blocks where the perturbations are turned off. The rightmost number in each figure is the switching point that the classification result flips

3.6 Attack complexity analysis

Besides the computation consumed by the base attack method, the extra complexity of our method comes from the inference of the perceptual attention prediction network and the binary search for adversarial noise pruning. The first part is a one-time cost, which is relatively small, depending on the model complexity. The complexity of the second part is the number of search steps multiplied by the complexity of the target classification network. Theoretically, our binary search in Algorithm 1 has the complexity of log2(Nb), where \(N_{b} = \frac {W}{w}\times \frac {H}{h}\) is the number of image blocks. In our experiments, the image size is 224 × 224, the block size is 4 × 4, so the total number of blocks is N = 3136. Therefore, the maximum number of model inferences is 12.

4 Experiments

In this section, we provide extensive experimental results to evaluate the performance of our proposed perceptual attention-guided visual quality optimization algorithm for adversarial attacks.

4.1 Experimental setup and datasets

In this section, we compare the performance of our method to the state-of-the-art attack methods with different defensive schemes. The methods include two dense attack methods PGD and BPDA, and four sparse attack methods Perc_CW [32], PGDL0+σ [4], GreedyFool [7] and TSAA [12]. The performance is studied in metrics of the SSIM score between the clean image and the adversarial, attack success rate, and attack efficiency in terms of the number of target model inferences.

As Perc_CW is optimized on the color distance score, we also use perceptual color distance to measure the quality of the adversarial images [32]. To compare to Perc_CW and PGDL0+σ, we only compare the adversarial image qualities at five different attack success rates as it is time-consuming to get the desired attack success rate due to the high dimensional search space of their hyperparameters. For example, Perc_CW has five parameters that can affect the attack success rate. We keep running these two tools with different hyperparameter settings and use the attack result if its attack success rate is within [0.850.95]. GreedyFool [7] and TSAA [12] are l0 norm based attack methods. GreedyFool requires a large number of iterations to obtain good attack results. So we only evaluate it using one model. TSAA is slightly different from the others as it is designed mainly for the black-box attack to improve the transferability of the adversarial examples. We still show its results for comparison.

The dataset is the mini-ImageNet [5] which consists of 60,000 images for 100 classes. We select 20 classes, and for each class, 100 images are randomly selected as test images, and the rest is used for training.

4.2 Experimental results

4.2.1 Resnet-34 with adversarial training for defense

In this experiment, the target model is Resnet-34 which is adversarially trained on the training dataset, and the default classification accuracy is 76%. During adversarial training, the adversarial examples are generated by PGD with default parameters 𝜖 = 0.05, alpha = 0.01, num_iter = 10, where alpha,num_iter are the step size and the number of iterations, respectively. The parameter of λ in (9) is set to 0.001 during training of the mask prediction network. Table 1 shows the average SSIM score of PGD and our method at different values of 𝜖. The scores are calculated over all tested images that are correctly classified by the target model. We can see that the SSIM gain obtained by the proposed PONS algorithm is quite significant. When 𝜖 = 0.03, which is quite small, the SSIM gain is about 7%. For the loss of attack success rate, it is quite small. Table 2 shows the average SSIM score of Perc_CW [32] and PGDL0+σ [4] and ours. Our method improves the SSIM score and outperforms these two state-of-the-art methods by large margins, especially at high attack success rates. For example, for Perc_CW, the SSIM gain of our method is 0.35 at the attack success rate 0.95, and for PGDL0+σ, when the attack success rate is 0.94, the SSIM gain is 0.1.

Table 1 Comparison of SSIM and attack success rate between PGD and our method
Table 2 Comparison of SSIM score against Perc_CW and PGDL0+σ under the same attack success rate. The target model is Resnet-34 with adversarial training

4.2.2 Resnet-101 with feature denoising for defense

In the following experiment, the target model is the Resnet-101 network which is modified by adding Feature denoising layers [29] to reduce the effect of adversarial noise. The model is also adversarially trained using PGD attacked images. The model has a classification accuracy of 80% on clean images. Table 3 shows the average SSIM score of the baseline PGD method and our PONS method at different values of 𝜖. Similar to the previous experiment, our method has significantly improved the SSIM quality. For the attack success rate, the loss is very small, mostly less than 1%. For tests with 𝜖 = 0.06,0.07,0.09,0.1, the drop of success rate is zero. Table 4 shows the average SSIM scores of Perc_CW, PGDL0+σ and our method, where we can see that our method has achieved much higher SSIM scores, especially when the attack success rate is greater than 0.9.

Table 3 Comparison of SSIM and attack success rate between PGD and our method
Table 4 SSIM comparison against Perc_CW and PGDL0+σ under the same attack success rate

4.2.3 Resnet-50 with input transformation for defense

Input transform has been demonstrated as an effective method for adversarial defense [10]. In this experiment, the input transformation is bit reduction [10], which removes the least 5 significant bits of each pixel value. Or, equivalently, the pixel values are quantized into the set {0,32,64,128,160,192,224}. It should be noted that this bit reduction processing is not differentiable. In this case, the BPDA method uses an identity function during the gradient backward propagation process. During the BPDA attack, the number of steps is set to 200, and the learning rate is set to 0.1. The target model is Resnet-50, which is also adversarially trained using PGD-attacked images. The baseline classification accuracy on clean images is 77.3%. Table 5 shows the average SSIM score of BPDA and our method at different values of 𝜖. Our PONS method can significantly improve the SSIM visual quality of the images attacked by BPDA. In the meantime, the drop of attack success rate remains very small. Again, Table 6 compares the average SSIM against Perc_CW and PGDL0+σ. The improvement is consistent and significant.

Table 5 Comparison of SSIM score and attack success rate between BPDA and ours
Table 6 SSIM score comparison against Perc_CW and PGDL0+σ under the same attack success rate

4.2.4 Color distance comparison

We notice that Perc_CW is optimized on the color distance or difference [16, 32] between the attacked image and the original one. Table 7 shows the color distance scores between Perc_CW and our method at different attack success rates in each of the previous tests. Smaller color distance with respect to the natural image means better image quality of the adversarial. We can see that even though our method is not optimized on the color distance, the reported average color distance is still comparable to that of Perc_CW. For most of the cases, our method is even better.

Table 7 Color distance comparison at different attack success rates between Perc_CW and our method in each of the previous tests

4.2.5 Comparison to GreedyFool and TSAA

TSAA [12] and GreedyFool [7] generate adversarial samples based on l0 norm constraint which try to modify fewest number of image pixels. We study their attack performance when the perturbation magnitude, i.e., \(l_{\infty }\) norm constraint, is also applied. Table 8 shows the attack result using the model Resnet-34 in Section 4.2.1 when 𝜖 changes from 0.1 to 0.5 and 1.0, in which 𝜖 = 1.0 means the adversarial attack is fully under l0 norm constraint. The three numbers in the table are the average SSIM, attack success rate, and the number of model inferences. For TSAA, the performance improves when 𝜖 increases from 0.1 to 0.5 and 1.0, achieving the best average SSIM of 0.65 and attack success rate of 0.53. For GreedyFool, the number of iterations used to search for the candidate pixels affects its performance significantly. When 𝜖 = 1.0, GreedyFool achieves the best average SSIM (0.97) and attack success rate (0.98) within 500 iterations, i.e., the GF(500) in the table. However, the actual average number of model inferences is as high as 669, which costs a significant amount of time. Compared to Table 1, our method can achieve the average SSIM of 0.93 and attack success rate of 0.97 with only 12 model inferences at 𝜖 = 0.1. In this case, GreedyFool works fully under the l0 norm constraint. We have further checked the Mean Squared Error (MSE) of the attacked images and found that GF(500) has an average MSE of 104.95, and ours is 80.02. This means that GreedyFool adds much more perturbations to images than our method. For GF(500), when 𝜖 = 0.5, the attack success rate drops to 0.89 at the cost of 1009 model inferences. When only 12 iterations are used for GreedyFool, i.e., the GF(12), it fails to attack the images. The success rate is almost zero.

Table 8 Attack performance of TSAA [12] and GreedyFool [7] with different values of perturbation budget

4.3 Ablation studies

In the following experiments, we conduct ablation studies to further understand the performance of the proposed PONS method.

4.3.1 Contribution of algorithm components

Our method has two stages, perceptual attention mask prediction, and block-wise adversarial noise pruning. In Fig. 5, we show the average SSIM improvement brought by each stage in the previous tests. The horizontal axis is the perturbation budget 𝜖. In each sub-figure, we show the average SSIM score for (a) the baseline attack method, (b) the baseline method plus the perceptual attention noise reshaping, and (c) the baseline method plus both algorithm modules. In all three tests, we can see that both algorithm modules contribute significantly to the overall performance. The performance gain achieved by the binary pruning is more significant than the perceptual attention noise reshaping, especially for the BPDA attack method. For example, in Fig. 5 (b), at 𝜖 = 0.05, the SSIM score of the base attack is 0.84, the perceptual attention module increases it to 0.87, and the block-wise perturbation pruning further increases it to 0.95.

Fig. 5
figure 5

Figure (a), (b), and (c) shows the SSIM improvement by the perceptual attention prediction and block-wise pruning of the adversarial noise. In all three figures, the black line is the base attack method, the brown line is the SSIM score with only perceptual attention prediction, and the orange line shows the SSIM score with both attention mask prediction and block-wise noise pruning

4.3.2 Comparison of different block sorting schemes

In our method, we use SSIM sensitivity score to sort the image blocks to find out candidate blocks to turn off the perturbation by the binary search algorithm. Table 9 compares the average SSIM scores when different block sorting schemes are used, namely 1) sorting by the SSIM sensitivity, 2) original order (no sorting), and 3) randomly sorting. From Table 9, we can see that sorting image blocks according to the SSIM sensitivity score achieves the largest improvement of the SSIM score. The random sorting is the least. For example, in the test of BPDA with input transform, when 𝜖 = 0.05, the average SSIM scores are 0.88,0.90,0.95 respectively, sorting based on the SSIM sensitivity improves the average SSIM by 5 percentage points when compared to the case without any sorting.

Table 9 SSIM comparison of block-wise perturbation pruning with different block sorting methods in three experiments

4.3.3 Effect of SSIM loss ratio

To train the perceptual attention model, Eq. (9) has two loss terms, the cross-entropy between the attack target and the predicted labels, and the SSIM image quality between the attacked image and the input image. The value of ratio λ affects the quality of the attention mask predicted by the perceptual attention network. More weight on the SSIM term will likely improve the image quality and decrease the attack success rate. We redo the test of Table 1 in Section 4.2.1, but without the block-wise noise pruning. Table 10 shows the performance of average SSIM and attack success rate with different values of λ. We can see that the performance is quite robust when λ increases from 0.001 to 0.5. When it reaches 1.0, the average SSIM increases, and the attack success rate drops significantly, compared with the case of 0.001.

Table 10 Attack performance with different values of λ in (9)

4.3.4 Effect of block size in perturbation pruning

In block-wise perturbation pruning, increasing the block size will speed up the process and likely lower the image quality. Table 11 shows the average SSIM and the number of model inferences when the block size increases from 2 to 16 for the test of Table 1 in Section 4.2.1. The table shows that the average SSIM is highest when the block size is 2, which is reasonable as we search for the smaller blocks to turn off the perturbation. However, it is the slowest among the four. For block sizes of 4 and 8, the average SSIM is very close, which means our method can be even faster without significant loss of the image qualities.

Table 11 Effect of block size in block-wise perturbation pruning for the test of Table 1 in Section 4.2.1

4.3.5 Perceptual quality comparison

Figure 6 compares the image quality of several adversarial examples generated by the base attack method and our method. For each pair, the left one is generated by the base attack method, and the right one is ours. The three rows correspond to 𝜖 = 0.05,0.06,0.07, respectively. For each adversarial image, we also show the SSIM score with respect to the clean image. We can see that our method significantly improves the visual quality of the attacked images. In Fig. 7, we show several pairs to compare the quality of the adversarial images generated by Perc_CW, PGDL0+σ and our method. We also show the SSIM score of each adversarial image. These sample pairs are taken from attack tests with similar attack success rates. We can see that under the same attack success rate, the qualities of the attacked images by Perc_CW, PGDL0+σ are worse than ours.

Fig. 6
figure 6

Image quality comparison between the result of the base attack method and our method. For each pair, the left one is from the base attack method, and the right one is our result. The three image rows correspond to 𝜖 = 0.05,0.06,0.07, respectively

Fig. 7
figure 7

Image quality comparison against Perc_CW and PGDL0+σ. The first row is the comparison between ours and Perc_CW, and the second row is against PGDL0+σ. The numbers in parentheses are the SSIM scores. Each pair is taken from tests with similar attack success rates

5 Discussion and future work

In the previous section, we have compared our method to the state-of-the-art dense and sparse attack methods against different defense schemes. The experimental results have demonstrated that our method can achieve significantly better attack performance in image qualities between the clean image and the adversarial, the attack success rate, and the attack efficiency in terms of the number of model inferences. GreedyFool under l0 norm constraint, i.e., the GF(500) in Table 8 at 𝜖 = 1.0, has achieved a significant better average SSIM than ours at 𝜖 = 0.1 in Table 1. However, the average MSE and attack complexity are substantially worse than ours.

The advantage of our method is that we use a base attack method that has strong attack capabilities so that we only need to optimize the image qualities without significant loss of attack success rate. Tables 13, and 5 have demonstrated this. On the other side, the attack success rate of our method is largely determined by the base attack method as we only reshape the adversarial noise to improve the image quality. For this point, we can increase the perturbation budge to obtain a high attack success rate, which will inevitably introduce more noise to the images and require a more powerful perceptual attention network to produce high-quality perceptual attention masks. A better base attack method will certainly help improve the performance of our method.

In this work, the perceptual attention network is designed based on the Resnet-18 backbone to demonstrate its effectiveness. If we use a more advanced architecture for the perceptual attention network, we expect to see more improvement in the image quality and attack success rate.

6 Conclusion

In this work, we have observed that most existing adversarial attack methods are designed to maximize the attack success rate under the lp norm constraint, which has not fully considered the perceptual sensitivity of the adversarial noise in different image regions. Motivated by this, we propose a novel two-stage attack method to maximize the image perceptual quality as well as the attack success rate. Specifically, we construct and learn a perceptual attention network to generate a perceptual attention mask to modulate the adversarial noise generated by a base attack method in the input image, aiming to maximize the visual quality while achieving the same attack success rate. To further improve the image perceptual quality, we propose a fast binary search algorithm to perform an iterative pruning of the adversarial noise based on the perceptual sensitivity map. We have conducted comprehensive evaluations and demonstrated that our method could significantly improve the image visual quality over the base attack method without sacrificing the attack success rate. When compared with the state-of-the-art adversarial attack methods, our method can achieve better attack performance in terms of image quality, attack success rate, and attack efficiency.