Introduction

Deep neural networks (DNNs) are playing an increasingly important role in autonomous driving [1], recommender systems [2], speech recognition [3], natural language processing [4], etc. However, research in recent years has shown that adding human imperceptible perturbations to the input data could generate adversarial examples, which mislead deep neural networks make the wrong output [5]. Adversarial examples have become one of the serious security risks faced by intelligent systems [6], which could cause unintended consequences [7]. For example, if the autonomous driving system cannot correctly identify traffic signs, it could bring people’s lives in danger. If the face recognition authentication system is attacked by adversarial examples and cannot correctly identify the input faces, it could cause property damage. Therefore, it will help us to explore the security risks of deep learning models and reduce the harm caused by adversarial attacks that studying adversarial examples theoretically and empirically.

In 2013, Szegedy et al. [8] successfully constructed adversarial examples using the box-constrained L-BFGS algorithm [9]. Since then, adversarial examples have become a hot topic. So far, adversarial attack algorithms can be divided into the white-box attacks and the black-box attacks according to the available model knowledge [10]. Under the white-box attack environment, attackers have access to all the information about the DNNs, including weight parameters and model structure. On the other hand, under the black-box attack environment, attackers do not have access to the internal information of the model and can only generate adversarial examples by querying the model’s output.

Under the white-box attack environment, gradient-based attack algorithms are efficient and easy to implement. Goodfellow et al. [11] proposed the FGSM (Fast Gradient Sign Method), which generates adversarial examples by directly adding tiny perturbations in the gradient direction of the example. In their work [11], it is pointed out that the effectiveness of the FGSM stems from the existence of linear behaviors in deep learning models. Furthermore, Miyato et al. [12] proposed a variant of FGSM, i.e., FGM (Fast Gradient Method), which normalizes the gradient through the \(l_2\) norm and could generate adversarial perturbations more effectively. Both FGSM and FGM are one-step attack techniques, and their attack success rates are relatively limited. Therefore, some multi-step variants of FGSM have been proposed to improve the attack success rate, such as BIM [13], PGD [14], SLIDE [15], and MIM [16], etc. The BIM (Basic Iterative Method) generates adversarial examples through multiple iterations of FGSM. It uses the input image as the initial sample and then adds a small step-size perturbation to the sample in each iteration until the adversarial example is generated. Due to its small step-size at each iteration, adversarial examples could be generated before the amount of perturbation added to the pixel reaches the maximum limit. Experimental results show that the adversarial examples generated by BIM is better than that of FGSM. Madry et al. [14] proposed the PGD (Project Gradient Descent) attack algorithm on the basis of BIM. Unlike the BIM, PGD adds uniformly distributed random noise as initialization. Experimental results show that the performance of PGD has been significantly improved, and the adversarial examples generated by it are better than that of BIM. The SLIDE (Sparse \(L_1\)-Descent Attack) [15] attack algorithm is a \(l_1\) norm variant of PGD. Compared with the PGD algorithm, the perturbations generated by the SLIDE algorithm are more sparse. Its attack performance is also quite competitive compared with other attacks. The MIM (Momentum Iterative Method) [16] is also a gradient-based iterative attack algorithm similar to PGD, which introduces the concept of momentum. In each iteration, the perturbation is determined by the current gradient direction and the previously calculated gradient direction. In addition, DeepFool [17] and JSMA (Jacobian Saliency Map Attack) [18] are also typical gradient-based algorithms. The DeepFool algorithm treats the multi-class classification task as a combination of multiple binary classification tasks, and believes that the boundary of the classifier is approximately linear. By solving the minimum distance from the sample to the decision boundary, the DeepFool algorithm could obtain the minimum perturbation to generate adversarial examples. The JSMA algorithm only modifies some of the pixels in the image, rather than disturbing the entire image. The perturbed pixels are selected by computing a saliency map using the gradients of the deep neural network to the input pixels.

Although gradient-based white-box attack algorithms have been widely studied to generate adversarial examples, most of them are designed for multi-class classification, and there are only a few attack algorithms specifically for multi-label classification. However, in the real world, multi-label classification models are also faced with adversarial attacks. In multi-label classification [19], a sample often corresponds to more than one label, and there is often a certain correlation between labels [20]. For example, for an image containing “traffic light” and “crosswalk”, when an attacker only considers the “traffic light” label and wants to hide it, the attack may fail because the associated “crosswalk” label may also be hidden. Therefore, it is more challenging to generate the adversarial examples for multi-label models, and the effectiveness of gradient-based attack algorithms in multi-label classification is worthy of a comprehensive study.

In order to evaluate the performance of the gradient-based attack algorithms in multi-label classification, we conducted extensive experiments to test nine typical gradient-based attack algorithms. In experiments, we individually tested the performance of all attack algorithms on six different multi-label targeted attack types across two datasets and two models.

The main contributions of this paper are summarized as follows.

  1. 1.

    Nine typical gradient-based algorithms are comprehensively compared in this paper. Among these nine algorithms, the BIM [13], MIM [16], PGD [14], SLIDE [15], and JSMA [18] attack algorithms are transplanted to multi-label attacks for the first time. The remaining four attack algorithms are MLA-LP [21], FGS [22], FG [22], and ML-DP[22]. The MLA-LP attack was specifically designed for multi-label classification models by Zhou et al. [21], while the other three algorithms were transplanted by Song et al. [22] from existing multi-class attacks. This study aims to reveal the performance of gradient-based attack algorithms on multi-label classification models from an experimental perspective. Additionally, the transplantation of these five attack algorithms in this paper expands the existing repertoire of multi-label white-box attack algorithms.

  2. 2.

    We evaluate each attack algorithm on two datasets, two models, and six attack types. Experimental results demonstrate that one-step gradient-based attacks exhibit significantly lower performance compared to iterative attacks, which suggests that multi-label classification models own lower linearity compared to multi-class classification models. Regarding all attack algorithms, their performance is directly influenced by the attack type, where attacks involving a larger number of targeted labels are more challenging. Experimental results under the two types of transfer attacks indicate that the adversarial examples generated by all attack algorithms exhibit weaker transferability. Additionally, experimental results of different attack types indicate that, for gradient-based attack algorithms, augmenting labels is more difficult than hiding labels, and extreme attacks are the most challenging attack type to achieve.

The rest of this paper is organized as follows. “The task of generating muti-label adversarial examples” introduces the multi-label classification task and multi-label adversarial examples. “Gradient-based algorithms of generating adversarial examples” introduces nine mainstream gradient-based attack algorithms, all of which are involved in our experiments. “Experimental design” introduces the experimental setup, including datasets, models, parameter settings, evaluation metrics, etc. “Results and discussion” presents experimental results and analysis. Finally, “Conclusion” summarizes this paper briefly.

The task of generating muti-label adversarial examples

In this section, we briefly introduce some preliminaries for multi-label classification tasks. We then describe multi-label adversarial examples and introduce six types of attacks in multi-label classification.

Multi-label classification

Traditional multi-class classification of images only considers one object in the image, although images in training sets, e.g. ImageNet [23], often contain more than one object. In multi-label classification tasks, we are interested in multiple labels in an image and we often need to pay more attention to the connections between labels. For example, when a computer appears on a desk, devices such as a mouse and a keyboard often appear in the same scene at the same time.

Multi-label learning consists of two main tasks: multi-label classification (MLC) and multi-label ranking (MLR). Assuming that F is a multi-label classifier with c labels, an instance \(\textbf{x}\) can be classified as \(F(\textbf{x}) = \textbf{y}\). For these two different tasks, a multi-label classifier produces different outputs. One is \({\textbf{y}}\in \{-1,1\}^{c\times 1}\), and the output is a c-dimensional vector composed of binary values, representing the inclusion relationship between the input and labels. The other is \({\textbf{y}}\in \{f_1,\ldots ,f_c\}\), which outputs the confidence corresponding to each label in the input image.

Given a confidence threshold t, we can transform the outputs of multi-label ranking to multi-label classification as \(F(\textbf{x})=\{sgn(f_1(\cdot )-t),\ldots ,sgn(f_c(\cdot )-t)\}\), where the \(sgn(\cdot )\) is the sign function [22]. Therefore, in this paper, without loss of generality, the gradient-based algorithms are tested on the multi-label classification task.

Multi-label adversarial examples

Adversarial attacks can be divided into targeted attacks and non-targeted attacks from the perspective of attack targets. In multi-label classification adversarial attacks, the attacker usually pays more attention to the specific results generated by the attack. That is, the target label vector is often given before attacking. Therefore, in this paper, we will mainly discuss the targeted multi-label adversarial attacks under the white-box attack environment, which is defined as follows.

$$\begin{aligned} \begin{array}{ll} \mathop {\min }\limits _{\textbf{r}} &{} \Vert \textbf{r}\Vert _{p}, \\ \text{ s.t. }&{} F(\textbf{x}+\textbf{r})=\textbf{y}^*,\\ \end{array} \end{aligned}$$
(1)

where \(\textbf{y}^*\) is the target label vector, and \(\textbf{r}\) denotes the perturbation, and \(\Vert \cdot \Vert _p\) denotes a \(l_p\) norm. Then, we can define multi-label adversarial examples as \(\textbf{x}^{adv}=\textbf{x}+\textbf{r}\).

According to the existing work [22, 24], the typical target of multi-label adversarial examples can be summarized into the following cases.

  1. 1.

    Hiding single [22]: In this attack type, we select examples containing at least two labels and randomly choose contained one label as the attack label. The target of the attack is to hide the chosen label, i.e., the multi-label classifier is not aware of the selected label in the sample.

  2. 2.

    Hiding all [24]: Hiding all attack means that the generated adversarial example will make the multi-label classifier unable to output any label for this instance, i.e., \(F(\textbf{x}^{adv})=\{-1\}^{c\times 1}\). This attack type is a blind threat to multi-label classifiers because such adversarial examples can make multi-label classifiers unaware of the environment.

  3. 3.

    Random [22]: For each input image, one positive label and one negative label are randomly selected as attack labels among its ground-truth labels. The purpose of the attack is to turn the selected positive label to be negative and turn the selected negative label to be positive.

  4. 4.

    Extreme [22]: This attack strategy is the most extreme case of targeted attacks. The result of this targeted attack is to flip the ground truth labels of all original images, i.e., all positive labels are attacked as negative labels, and all negative labels are attacked as positive labels.

  5. 5.

    Reduction [22]: In the reduction case, we randomly select images containing the specified label for the reduction attack. The target of the reduction attack is to make the multi-label classifier unable to output the specified label for the input example.

  6. 6.

    Augmentation [22]: In the augmentation case, we randomly select the images that does not contain the specified label, and the purpose of the augmentation attack is to change the specified negative label to be a positive one.

It is worth noting that, referring to Ref. [22], without loss of generality, to avoid insufficient test instances due to label imbalance, we select the labels “person" and “sheep" as the specified “reduction" and “augmentation" labels, respectively.

Gradient-based algorithms of generating adversarial examples

In this section, nine representative gradient-based attack algorithms are introduced. Among these attack algorithms, FGS, FG, and ML-DP are presented by Song et al. [22], and MLA-LP is proposed by Zhou et al. [21], and other attack algorithms are transplanted by us in this paper. It is noted that FGS, FG and ML-DP are also transplanted from the gradient-based attack algorithms for multi-class classification models. Since the BIM, PGD, MIM, and SLIDE attacks only require the use of models’ gradients to implement attacks, these algorithms can be transplanted to multi-label classification models as long as the models are differentiable.

Fast gradient sign method (FGSM)

The Fast Gradient Sign Method (FGSM) by Goodfellow et al. [11] is a basic attack algorithm in the white-box environment. The FGSM first obtains the derivative of the model with respect to the input data, then uses the sign function to obtain its gradient direction and finally adds the perturbation with a specified step size in this direction to generate adversarial examples. The core function of the FGSM in [11] is as follows.

$$\begin{aligned} \textbf{x}^{adv}=\textbf{x}+\alpha \cdot sign\left( \nabla _{\textbf{x}} loss(\ \textbf{x},\textbf{y} \right) ), \end{aligned}$$
(2)

where \(\mathcal {\alpha }\) is the perturbation step size, \(\textbf{y}\) is the ground truth of the input sample and \(loss(\cdot )\) is the loss function.

It is worth noting that FGSM is originally a multi-class non-targeted attack algorithm. However, Song et al. [22] adapted it for multi-label classification by transplanting it into a multi-label targeted attack. The core concept of FGSM involves adding perturbations in the direction of the gradient of its loss function between the input sample and the true label. This increases the loss function value when the sample is predicted as the true label. In the case of a multi-label targeted attack, the algorithm’s core function undergoes a different operation. It aims to minimize the loss function value when the original sample is predicted as the target label. Therefore, Song et al. [22] proposed the FGS attack by modifying the core function of the FGSM as follows.

$$\begin{aligned} \textbf{x}^{adv}=Clip(\textbf{x}-\alpha \cdot sign\left( \nabla _{\textbf{x}} J(\ F(\textbf{x}),\textbf{y}^* \right) ), \end{aligned}$$
(3)

where the \(Clip(\cdot )\) function clips the pixel values that overflow the boundary so that the final generated adversarial example is a valid image. \(sign(\cdot )\) function indicates the specific gradient direction, and \(J(\cdot )\) is the cross entropy loss function. \(F(\textbf{x})\) is the predicted label vector of \(\textbf{x}\). \(\textbf{y}^*\) is the target label vector.

In this paper, we will follow this approach to perform multi-label transplantation for the FG, BIM, PGD, MIM, and SLIDE attack algorithms.

Fast gradient method (FGM)

The Fast Gradient Method (FGM) is proposed by Miyato et al. [12] in 2016. Compared with FGSM, the FGM not only considers the gradient direction of the model, but also considers the specific step size when calculating the perturbation and normalizes it by the \(l_2\) norm to obtain a better adversarial example. Song et al. [22] transplant the FGM to the multi-label classification models and design the FG attack algorithm. The core function of the FG algorithm in [22] is as follows.

$$\begin{aligned} \textbf{x}^{adv}=Clip\left( \textbf{x}-\alpha \cdot \frac{\nabla _{\textbf{x}} J\left( F(\textbf{x}), \textbf{y}^{*}\right) }{\left\| \nabla _{\textbf{x}} J\left( F(\textbf{x}), \textbf{y}^{*}\right) \right\| _{2}}\right) , \end{aligned}$$
(4)

where \(\mathcal {\alpha }\) is a hyperparameter that controls the size of the perturbation, and \( J(\cdot )\) is the cross entropy loss function. \(\Vert \cdot \Vert _2\) denotes \(l_2\) norm.

Basic iterative method (BIM)

The FGSM, which increases the loss function of the classifier through one step operation, often fails to achieve a good attack effect. In order to solve the shortcomings of the FGSM, Kurakin et al. [13] proposed the Basic Iterative Method, a iterative gradient-based attack algorithm, i.e., the BIM. BIM generates an adversarial perturbation through multiple iterations. At each iteration, BIM adds an adversarial perturbation to the current sample in the direction that maximizes the loss function. Therefore, BIM is an iterative version of the FGSM algorithm. BIM performs pixel value clipping after each perturbation is added to ensure that the final adversarial example is still a valid image. The core function of BIM in [13] is as follows.

$$\begin{aligned} \begin{array}{l} \textbf{x}_{0}^{adv} = \textbf{x}, \\ \textbf{x}_{n+1}^{adv} = Clip\left\{ \textbf{x}_{n}^{adv} + \alpha \cdot sign\left( \nabla _{\textbf{x}_{n}^{adv}} J\left( \textbf{x}_{n}^{adv}, \textbf{y}\right) \right) \right\} , \end{array} \end{aligned}$$
(5)

where \(\textbf{x}_{n+1}^{adv}\) denote the adversarial example generated at the \((n+1)\)th iteration. At the beginning of the BIM, \(\textbf{x}_{0}^{adv}\) is the input image.

In this study, we transplant the BIM to the multi-label classification models. We modified the core function as follows to adapt the BIM attack to multi-label targeted attacks.

$$\begin{aligned} \begin{array}{l} \textbf{x}_{0}^{adv} = \textbf{x}, \\ \textbf{x}_{n+1}^{adv} = Clip\left\{ \textbf{x}_{n}^{adv} - \alpha \cdot sign\left( \nabla _{\textbf{x}_{n}^{adv}} J\left( F(\textbf{x}_{n}^{adv}), \textbf{y}^*\right) \right) \right\} . \end{array} \end{aligned}$$
(6)

The purpose of the formula (6) is to reduce the gap between the attack target and the current predicted label at each iteration.

Projected gradient descent (PGD)

The Projected Gradient Descent (PGD) algorithm is proposed by Madry et al. [14] to solve the linear assumption problem in FGSM and FGM. PGD is an iterative attack algorithm. PGD performs multiple iterations with a small step and each iteration project the disturbance into the specified range. The core function of PGD in [14] is as follows.

$$\begin{aligned} \begin{array}{l} \textbf{x}_{0}^{adv} = \textbf{x}+\textbf{d}, \\ \textbf{x}_{n+1}^{adv}=\Pi _{\textbf{x}+\mathcal {S}}\left( \textbf{x}_{n}^{adv} +\alpha \cdot sign\left( \nabla _{\textbf{x}_{n}^{adv}} J(\textbf{x}_{n}^{adv}, \textbf{y})\right) \right) , \end{array} \end{aligned}$$
(7)

where \(\textbf{d}\) is a uniformly distributed random perturbation, and \(\Pi _{\textbf{x}+\mathcal {S}}(\cdot )\) represents a function which constrains the generated adversarial examples to a neighborhood of the original sample \(\textbf{x}\) with \(\mathcal {S}\) size.

The PGD algorithm is similar to the BIM in that it iterates the FGSM to generate adversarial perturbations. However, unlike the BIM, the PGD algorithm uses a random perturbation as the initial input of the algorithm and generates an adversarial example after a number of iterations, instead of starting from the original sample. As reported in [14], the PGD algorithm has significantly improved performance compared to the BIM algorithm, generating adversarial examples with better transferability and attack performance.

In this paper, we modify the core function of PGD to adapt it to multi-label targeted attacks. The core function is modified as follows.

$$\begin{aligned} \begin{array}{l} \textbf{x}_{0}^{adv} = \textbf{x}+\textbf{d}, \\ \textbf{x}_{n+1}^{adv}=Clip\{\textbf{x}_{n}^{adv} -\alpha \cdot sign\left( \nabla _{\textbf{x}_{n}^{adv}} J(F(\textbf{x}_{n}^{adv}), \textbf{y}^*)\right) \}. \end{array} \end{aligned}$$
(8)

Momentum iterative fast gradient sign method (MIM)

The Momentum Iterative Method (MIM) is a gradient-based iterative attack algorithm similar to PGD. It differs from BIM and PGD that the MIM [16] additionally introduces a momentum algorithm to stabilize the updated direction of the gradient.

In the iterative process of the MIM, the perturbation of each iteration is not only related to the current gradient direction, but also to the previously calculated gradient direction. The MIM introduces a decay factor to adjust the correlation. The smaller the decay factor, the smaller the effect of the momentum on the gradient direction of the current iteration. The core function of the MIM in [16] is as follows.

$$\begin{aligned} \begin{array}{l} \textbf{g}_{0}=0,\textbf{x}_{0}^{adv} = \textbf{x},\\ \textbf{g}_{n+1}=\mu \cdot \textbf{g}_{n}+ \frac{\nabla _{\textbf{x}_{n}^{adv}} J\left( \textbf{x}_{n}^{adv}, \textbf{y}, \right) }{\left\| \nabla _{\textbf{x}_{n}^{adv}} J\left( \textbf{x}_{n}^{adv}, \textbf{y},\right) \right\| _{1}} ,\\ \textbf{x}_{n+1}^{a d v}=\textbf{x}_{n}^{a d v}+\alpha \cdot sign\left( \textbf{g}_{n+1}\right) , \end{array} \end{aligned}$$
(9)

where \(\mu \) is the decay factor, and \(\textbf{g}_{n}\) represents the momentum factor at nth iteration. \(\Vert \cdot \Vert _1\) denotes \(l_1\) norm.

Since the previous gradient in the MIM also affects the subsequent iterations, the direction of the iterations will not deviate very much. By integrating the momentum term into the iterative process of the attack, the MIM could stably update the attack direction, resulting in more transferable adversarial examples. In this paper, we transplant the MIM into multi-label environment and the core function is modified as follows.

$$\begin{aligned} \begin{array}{l} \textbf{g}_{0}=0,\textbf{x}_{0}^{adv} = \textbf{x},\\ \textbf{g}_{n+1}=\mu \cdot \textbf{g}_{n}+ \frac{\nabla _{\textbf{x}_{n}^{adv}} J\left( F(\textbf{x}_{n}^{adv}), \textbf{y}^{*} , \right) }{\left\| \nabla _{\textbf{x}_{n}^{adv}} J\left( F(\textbf{x}_{n}^{adv}), \textbf{y}^{*} ,\right) \right\| _{1}} ,\\ \textbf{x}_{n+1}^{a d v}=Clip\left\{ \textbf{x}_{n}^{a d v}-\alpha \cdot sign\left( \textbf{g}_{n+1}\right) \right\} . \end{array} \end{aligned}$$
(10)

Sparse \(l_1\)-descent attack (SLIDE)

The SLIDE [15] algorithm is the \(l_1\) norm variant of PGD [14]. The \(l_1\) norm case is trickier than the \(l_\infty \) and \(l_2\) cases adopted by the PGD [14], because the \(l_1\) norm’s steepest descent direction is too sparse (it updates a single coordinate adversarial perturbation in each step). The SLIDE [15] attack controls the sparsity of the update step through an additional parameter. This attack outperforms the projected steepest descent by a large margin of moderately sparse update steps and is competitive with other attacks using the \(l_1\) norm.

Multi-label Deepfool attack (ML-DP)

The ML-DP algorithm by song et al. [22] is based on the Deepfool algorithm [17]. The algorithm assumes that when the perturbation is small, the decision boundary is approximately linear and the minimum perturbation is the shortest distance from the sample point to the decision boundary. The core formula of the ML-DP algorithm in [22] is as follows.

$$\begin{aligned} \begin{array}{ll} \mathop {\min }\limits _{\textbf{r}} &{} \Vert \textbf{r}\Vert _2, \\ \text{ s.t. } &{} -\textbf{y}^{*} \odot \left( F(\textbf{x})+\frac{\partial F(\textbf{x})}{\partial \textbf{x}} \textbf{r}\right) \le \textbf{0} . \end{array} \end{aligned}$$
(11)

The objective function of the algorithm is to minimize the perturbation \(\textbf{r}\), and the constraint condition is that, under the influence of the perturbation \(\textbf{r}\), the label confidence has the same sign as \(\textbf{y}^{*}\).

Jacobian saliency map attack (JSMA)

The JSMA algorithm [18] consists of three steps: computing the forward derivative, constructing the saliency map, and selecting the pixel locations that need to be changed. The JSMA algorithm computes the partial derivative of the model’s confidence for each class with respect to the input \(\textbf{x}\). The partial derivative value is used to indicate the influence degree of different pixels. It is called forward derivative in [18], as follows.

$$\begin{aligned} \nabla F(\textbf{x})=\frac{\partial F(\textbf{x})}{\partial \textbf{x}}=\left[ \frac{\partial F_{j}(\textbf{x})}{\partial \textbf{x}_{i}}\right] _{i \in 1 \ldots N, j \in 1 \ldots c}, \end{aligned}$$
(12)

where \(F_{j}(\cdot )\) represents the corresponding output class j, and \(\textbf{x}_i\) represents the corresponding ith input feature. N represents the number of pixels, and c represents the number of labels. It is worth noting that the internal structure of the model does not significantly affect the calculation of the forward derivative. Therefore, as long as the neural network of the multi-label classification model is differentiable, the forward derivative can be calculated. In this study, we chose the logits layer as the output to calculate the forward derivative.

Since the original JSMA algorithm only considers the influence of pixel points on a single target class when computing the saliency map, but in multi-label classification a sample can correspond to multiple classes, we propose a new method of computing the saliency map to adapt to multi-label targeted attacks. The formula for the saliency map is as follows.

$$\begin{aligned}{} & {} S(\textbf{x}, \textbf{y}^{*})[i]\nonumber \\{} & {} \quad =\left\{ \begin{array}{ll} 0, &{}\quad \text{ if } \sum _{\textbf{p}_j \ne 0} \textbf{p}_jJacobian(j, i)<0 \\ \sum _{\textbf{p}_j \ne 0} \textbf{p}_j Jacobian(j, i)-\sum _{\textbf{p}_j=0} Jacobian(j, i), &{}\quad \text{ otherwise } \end{array}\right. , \end{aligned}$$
(13)

where \(S(\cdot )\) represents the saliency map function, \(Jacobian(j,i)\) represents the component of the Jacobian matrix corresponding to the ith pixel and jth class, \(\textbf{p}=\textbf{y}^{*}-\textbf{y}\), \(\textbf{p}_j\) represents the difference between the jth target label and the true label, which can take values of \(-1\), 0, or 1.

The purpose of the formula (13) is to focus on the impact of each pixel on the target labels that need to be changed. If a feature point has a negative impact on the target labels, it is directly excluded. If a feature point has a positive impact on the target labels, we also need to evaluate its impact on other labels that do not need to be changed. If the impact on the irrelevant labels is greater than that on the target labels, the feature point will also be excluded.

MLA-LP

The MLA-LP algorithm [21] defines the generation of an adversarial perturbation as a constrained optimization problem. When the constraint is the \(l_\infty \) norm, MLA-LP defines the attack as follows.

$$\begin{aligned} \begin{array}{ll} \mathop {\min }\limits _{\textbf{r}} &{}\Vert \textbf{r}\Vert _{\infty },\\ \text{ s.t. } &{} \frac{\partial loss\left( F(\textbf{x}), \textbf{y}^{*}\right) }{\partial \textbf{x}} \cdot \textbf{r} \le loss\left( \textbf{t}, \textbf{y}^{*}\right) -loss\left( F(\textbf{x}), \textbf{y}^{*}\right) , \end{array} \end{aligned}$$
(14)

where \(loss(\cdot )\) is a specific loss function, and \(\textbf{t}\) is the threshold vector corresponding to every class. By introducing an additional variable a as shown in [21], this optimization problem can be further solved by linear programming as follows.

$$\begin{aligned} \begin{array}{ll} \mathop {\min }\limits _{a, \textbf{r}} &{}a\\ \text{ s.t. } \quad &{} a \ge r_{i}, \quad i=1,2, \ldots , d \text{, } \\ &{}a \ge -r_{i}, i=1,2, \ldots , d \text{, } \\ \frac{\partial loss\left( F(\textbf{x}), \textbf{y}^{*}\right) }{\textbf{r}} \cdot \textbf{r} &{}\le loss\left( \textbf{t}, \textbf{y}^{*}\right) -loss\left( F(\textbf{x}), \textbf{y}^{*}\right) \text{. } \end{array} \end{aligned}$$
(15)

Because linear programming is used, MLA-LP requires lower time cost and could obtain the adversarial examples with very tiny perturbations.

Experimental design

In this section, we introduce the experimental design. First, we introduce two datasets used in experiments. Subsequently, we introduce the multi-label classification models used in experiments. Third, we detail the parameter settings of all evaluated attack algorithms. Finally, we present the evaluation measures and metrics used to analyses the experimental results.

Datasets

We use two PASCAL VOC datasets during training multi-label models and generating adversarial examples. The PASCAL VOC dataset is published in the visual object classes (VOC) challenge, which mainly consists of classification and detection. The classification task is to predict whether an example of this class exists in the test image. The detection task is to predict bounding boxes and labels for the objects contained in the test image. In this paper, we select PASCAL VOC 2007 [25] and PASCAL VOC 2012 [26] as the datasets used for training models and generating adversarial examples. For convenience, they will be called as VOC 2007 and VOC 2012 in this paper.

VOC 2007: In the VOC 2007 dataset, there is detailed information for each image, including category, bounding box, and semantic information. This dataset comes from real-world scenarios with 20 categories. Its training set includes 5011 samples, its validation set includes 2510 samples and its test set includes 4952 samples.

VOC 2012: The VOC 2012 dataset is published in the VOC 2012 challenge. Similar to VOC 2007, its samples are all from real-world scenarios and it contains 20 categories. There is only a training set containing 5017 samples and a validation set containing 5821 samples in the VOC 2012 dataset, and there is no test set.

It is worth noting that the VOC2012 dataset only contains a training set and a validation set. In the experiment, we randomly select 80% of the images from the original VOC2012 training set as the actual training set used for model training, and the remaining 20% are used as the actual validation set. We use the original VOC2012 validation set as the actual test set for model training.

Multi-label classification models

We focus on the performance of gradient-based algorithms attacking multi-label classification models. First, we choose the benchmark model proposed by Song et al. [22], which is called ML-LIW in this paper. Since the ML-LIW model does not sufficiently focus on the correlations between labels, we use the ML-GCN [27] model, which focuses on the use of label correlation, to take into account the important characteristics of the multi-label classification model. Therefore, the experiments in this paper are conducted on two models, i.e., ML-GCN and ML-LIW. Next, briefly introduce two multi-label classification models.

ML-LIW: Song et al. [22] build a multi-label classification model based on the Inception-v3 [28] network pre-trained on the ImageNet dataset, which is called ML-LIW in this paper. In order to better apply it to the multi-label situation, Song et al. replaced the softmax layer of the original network with sigmoid to retrain the model classification layer. Furthermore, they consider instance loss and label loss to ensure the model has good classification performance.

ML-GCN: Chen et al. [27] proposed a multi-label classification model ML-GCN based on graph convolutional networks (GCN) to obtain and exploit the dependencies between labels. The model builds a directed graph between labels in a data-driven manner, where nodes are word vectors for all labels. The correlation between labels is obtained by encoding this directed acyclic graph through GCN and ML-GCN can model the correlation between labels to improve the representation and learning ability of the model. GCN is deeply analyzed and redesigned to make it more suitable for multi-label problems. Experimental results on two authoritative datasets for multi-label image recognition show that ML-GCN significantly outperforms many existing models. Furthermore, the results of their work show that the classifiers learned by the model also maintain meaningful semantic topology.

In the process of model training, we use the same training parameters for the ML-GCN model as those given in [27], with 100 training epochs, an initial learning rate of 0.01, a 10% decrease in learning rate every 40 epochs, and the SGD optimizer. For the ML-LIW model, we also set 100 training epochs, with a learning rate of 0.001, but use the Adam optimizer instead. Furthermore, during model training, the experimental settings resize all images in the VOC 2007 and VOC 2012 datasets to \(448*448\) and normalize the image pixel values to [0, 1]. In the training of the ML-GCN model, we use the author’s open-source code in the original paper directly to obtain the model. Then we train this model on the two datasets separately. The code is available at https://github.com/Megvii-Nanjing/ML-GCN. In the training of ML-LIW, we construct a ML-LIW model based on [22]. Subsequently, the model is trained on the two datasets, respectively. The training results of these two models are shown in the Table 1.

Table 1 Performance of ML-GCN and ML-LIW on VOC 2007 and VOC 2012 datasets

Experimental setting

The premise of the comprehensive evaluation experiment is to provide the same conditions for all attack algorithms. We conduct experiments on sufficient samples and provide each attack algorithm with the opportunity to exert its best performance to ensure the fairness of comprehensive evaluation experiments. Next, the specific parameter settings of the attack algorithms and the number of samples in different attack situations will be introduced in detail.

Since the attack algorithms we evaluate use different norms to constrain the generated adversarial perturbations, we choose the same parameter settings as much as possible for attack algorithms of the same type and with the same norm. FGS, FG, BIM, MIM, PGD and SLIDE are implemented by AdverTorch [29], which is an adversarial attack and defense framework based on Pytorch [30]. For the FGS algorithm, we set \(\alpha \) in Eq. (3) to 0.03. For the FG algorithm, we set \(\alpha \) in Eq. (4) to 1. Since the BIM, MIM, and PGD algorithms use the same norm to limit the perturbation, and we uniformly set \(\alpha \) in Eqs. (6), (8), and (10) to 0.001, and set the maximum number of iterations to 30. The SLIDE algorithm generates more sparse adversarial perturbations based on the PGD algorithm by controlling the \(l_1\) norm. In order to achieve a similar performance to PGD, we set its \(l_1\) norm parameter in AdverTorch to 300, and set the \(l_1\) sparsity parameter in AdverTorch to 0.95. The ML-DP is consistent with the original paper [22], setting the maximum number of iterations to 20. For the MLA-LP algorithm, we set the maximum number of iterations to 20, and the other settings remain the same as the original paper [21]. For the JSMA algorithm, we refer to the attack performance analysis in the original paper and set the proportion of disturbed pixels to 10% of the total number of pixels. Then we set the perturbation step parameter \(\theta \) on each pixel to 0.015.

For transfer attack experiments, we conducted experiments under two attack types: “hiding single" and “hiding all" in order to obtain a sufficient number of adversarial examples which can successfully attack the surrogate model. We used the adversarial examples which successfully attacked the surrogate model to attack another non-surrogate model, while keeping the dataset unchanged for all attack algorithms.

Considering the number of original samples in the dataset and the accuracy of the attack results, in comprehensive comparison experiments, for FGSM, FGM, BIM, MIM, PGD, and SLIDE, we randomly select 1000 samples as attack samples for each attack type and in each model on each dataset. If the number of original samples which meet the attack type requirements is less than 1000, we will select all samples that meet the attack requirements. For ML-DP, MLA-LP and JSMA algorithms, due to the limitation of computing resources, we choose attack examples according to the following strategies. For each dataset and each attack type, 100 samples which meet the initial label requirements of the attack are randomly selected. For transfer experiments, we randomly select 100 samples as attack samples for FGSM, FGM, BIM, MIM, PGD, SLIDE, ML-DP, and MLA-LP algorithms. For JSMA algorithm, we randomly select 50 samples which meet the requirements.

Evaluation metrics

In experiments, we mainly analyze the attack performance of algorithms in two aspects: the attack success rate of the adversarial attack and the distortion caused by the adversarial perturbations. For attack success rate, due to we mainly focus on the multi-label classification task, the attack is considered successful when the attacked adversarial example contains the same label as the attack target. For distortion, since the evaluated attack algorithms include multiple norm-constrained attack algorithms, the evaluation of adversarial perturbations will include the involved norm types. Considering the particularity of the \(l_0\) norm, the sparsity of adversarial perturbation is not used as an evaluation index for experimental results. Next, we describe the evaluation metrics in detail.

Attack success rate: When discussing the attack success rate of the algorithm attacking the multi-label classifier, we default the classification threshold of the classifier for each label is \(t = 0.5\). As mentioned in “Multi-label classification”, if \(f_i(\textbf{x} )\ge 0.5\), then \(F_i(\textbf{x})=1\). Since we focus on targeted attack, the attack is considered successful when the post-attack label is equal to the target label. Corresponding to each attack type, assuming that the number of selected examples is N and the number of successful examples is \(N_{adv}\), the attack success rate can be expressed as \(N_{adv}/N\).

Distortion: We employ five types of metrics. In addition to the \(l_1\) norm, \(l_2\) norm and \(l_\infty \) norm, we introduce the RMSD and mean perturbation. RMSD represents the root mean square deviation of perturbations in all successful adversarial examples. The mean perturbation is the average perturbation value on each pixel of all successful adversarial examples. Both of them can measure the overall size of the adversarial perturbation, but it is worth noting that the RMSD measures the images whose pixel values are in [0, 255], while the mean perturbation measures the images whose pixel values are in [0, 1].

Results and discussion

This section presents experimental results of tested nine gradient-based attack algorithms under six attack types on two datasets and two models. The performance of tested attack algorithms is analyzed from aspects of the attack success rate and distortion. We rank the metrics values in experimental results so that the advantages and disadvantages of tested attack algorithms can be more intuitively shown. The smaller the ranking score, the better the performance under this metric. The source code for this article is available at https://github.com/MiLabHITSZ/2022ChenEmpiricalMLAE.

Hiding single attack results

Table 2 shows the experimental results of the tested attack algorithms under the “hiding single” attack on target models. Most tested attack algorithms can successfully construct adversarial examples which hide one label of the original image. Next, we will evaluate the attack results from the attack success rate and distortion degree.

Attack success rate: Experimental results show that these gradient-based attack algorithms can effectively conduct the “hiding single” attack, except the ML-DP algorithm. From the ranking scores of the attack success rate, it can be seen that the performance of the MIM is the best, and the attack performance of the BIM and PGD algorithms is next to it. The success rates of MIM and BIM on different models are higher than 94%, and the success rates of PGD on all models are higher than 95.6%. The SLIDE, MLA-LP, and JSMA algorithms achieve a close performance which is lower than BIM and PGD algorithms. The minimum attack success rates of SLIDE, MLA-LP, and JSMA algorithms on the target models are 93.9%, 64%, and 88%, respectively. The attack success rates of one-step gradient attack algorithms on different models are between 50 and 70%. The ML-DP algorithm is the worst one, which cannot generate adversarial examples for the ML-GCN model trained on VOC 2007 dataset. When attacking other models, the ML-DP algorithm achieved the highest attack success rate on the ML-LIW model trained on the VOC 2007 dataset, but only 42%.

Distortion: Experimental results show that, for the “hiding single” attack, the perturbation generated by the MLA-LP algorithm has the smallest ranking score of \(l_2\) norm, \(l_\infty \) norm, and RMSD metrics. The perturbation generated by the SLIDE algorithm has the smallest ranking score of \(l_1\) norm and mean perturbation metrics. Among the iterative attack algorithms based on FGSM, the perturbation generated by MIM is the largest. Among non-FGSM related algorithms, all metrics of perturbations generated by ML-DP are the largest, and JSMA algorithm achieves a relatively large perturbation on \(l_2\) norm, \(l_\infty \) norm, and RMSD metrics.

Table 2 Experimental results under hiding single attack
Table 3 Experimental results under hiding all attack

Hiding all attack results

Table 3 shows experimental results of all tested attack algorithms under the “hiding all” attack on ML-GCN and ML-LIW models trained on VOC 2007 and VOC 2012 datasets. All algorithms can implement the “hiding all" attack and we will analyze experimental results and ranking score in detail below.

Attack success rate: From the ranking scores of the attack results, it can be seen that the overall attack success rates of the iterative gradient-based attack algorithms are higher than those of one-step gradient attack algorithms. The PGD algorithm achieves the best attack success rate performance, and the success rates on different models are all higher than 96%. MIM is close to PGD, and the success rates are higher than 95%. The attack performance of the BIM is slightly lower than the former two. The attack performance of the JSMA is inferior to BIM, and its attack success rates on all target models are higher than 92%. The ranking scores of SLIDE, MLA-LP, and ML-DP are close, but SLIDE is more stable than the MLA-LP and ML-DP algorithm. The attack success rates of the SLIDE method are higher than 90.8% on all models. The success rates of the MLA-LP algorithm are higher than 60% on ML-GCN but they are higher than 96% on ML-LIW. The success rates of the ML-DP on ML-LIW are lower than that of ML-GCN, that they are higher than 24% on ML-LIW but higher than 88% on ML-GCN. The attack success rates of the one-step attack algorithms on all models do not exceed 55%. The success rates of FGS are between 21.7 and 44.7, and the success rates of FG are between 7.6 and 54.2%.

Distortion: According to Table 3, the perturbation amount of different attack algorithms is significantly different. The amount of distortion caused by the FGM, SLIDE, and MLA-LP is relatively small. The ML-DP algorithm has the largest amount of perturbation, making the attack pointless. The \(l_2\) norm, \(l_1\) norm, mean perturbation, and RMSD of the perturbation which is produced by the SLIDE algorithm are all the smallest under the “hiding all” attack. The \(l_\infty \) norm of the MLA-LP algorithm is the smallest under the “hiding all” attack. The overall distortions of the adversarial perturbations generated by BIM, MIM, and PGD are relatively large. The perturbations of JSMA algorithm have large \(l_2\) and \(l_\infty \) norm values while they have relatively small \(l_1\) norm and mean perturbation values.

Random attack results

Table 4 shows experimental results of tested attack algorithms for “random” attack. The “random” attack involves label augmentation and reduction, so it is more complex than attack types that only require hiding a label. From experimental results, it can be seen that one-step gradient-based attack algorithms are hard to conduct the “random” attack, and iterative attack algorithms are more likely to generate adversarial examples. Since the MLA-LP algorithm cannot implement this type of attacks, its experimental results will not be listed in Table 4. The experimental results of other attack algorithms will be analyzed in detail below from the attack success rate and distortion.

Attack success rate: Experimental results show that the success rates of iterative attack algorithms are higher than that of one-step algorithms. From the ranking scores, it can be seen that The MIM has the best attack performance on all target models and the success rates of BIM and PGD are similar. It is worth noting that BIM, MIM, and PGD have poor attack performance when attacking ML-GCN trained on the VOC2012 dataset, with success rates of 58.9%, 64.2%, and 63.8%, respectively. On other models, the attack success rates of BIM, MIM, and PGD are similar, all above 90%. The attack success rates of SLIDE and JSMA vary greatly across different models. SLIDE varies from 59 to 93% and JSMA varies from 25 to 60%. The attack success rates of the single-step attack algorithm are generally low, the attack success rates of FGS are less than 7%, and the attack success rates of FG are less than 14.3%.

Distortion: Ranking scores of results indicate that, among all tested attack algorithms, the distortion of perturbation generated by the SLIDE algorithm is the smallest in each metric, while the distortion of perturbation generated by the ML-DP algorithm is the largest. The perturbations generated by BIM and JSMA algorithms are slightly larger than those generated by SLIDE and FG algorithms. The perturbations generated by MIM and PGD are relatively large. In one-step gradient attack algorithms, the distortion caused by the FG algorithm is smaller than that caused by the FGS algorithm under each metric except the \(l_\infty \) norm metric.

Table 4 Experimental results under random attack

Extreme attack results

Table 5 shows the attack results of the tested attack algorithms on the multi-label classification models. Because the “extreme” attack flips the all labels of the original image, it is extremely challenging. Only BIM, PGD, and MIM can successfully generate effective adversarial examples for target models under the “extreme" attack. Therefore, we only show the performance of these three algorithms that can be achieved and compare them in detail below. However, as shown in Table 5, even these three algorithms cannot attack the ML-LIW model trained on the VOC 2007 dataset.

Attack success rate: Among the three attack algorithms that can achieve extreme attacks, the attack success rate of the MIM is always better than that of the BIM and PGD. The attack success rate of BIM is better than that of the PGD. The attack success rates of MIM and BIM on the ML-GCN model trained on the VOC 2012 dataset are higher than that of other models, with a success rate of 77.1% and 46.8%, respectively. When attacking the ML-LIW model trained on the VOC 2012 dataset, only MIM can be successful, and the attack success rate is 13.4%. Experimental results show that most of the tested attack algorithms cannot achieve the ideal attack performance under the “extreme” attack.

Distortion: The ranking scores of the attack results shows that among these three algorithms, the BIM always maintains the lowest disturbance distortion. Specifically, except for the \(l_\infty \) norm, other norm values of the adversarial perturbation generated by the BIM do not exceed 34% of that generated by the MIM.

Table 5 Experimental results under extreme attack

Reduction attack results

Table 6 shows the attack results of the tested attack algorithms attacking the multi-label classification models under “reduction” attack. All the evaluated attack results can successfully attack the target models. Next, we will analyze experimental results in terms of the attack success rate and distortion.

Attack success rate: Experimental results show that iterative attack algorithms are more effective than one-step gradients under “reduction” attack, except for the ML-DP algorithm. The success rates of BIM, MIM, and PGD algorithms are all above 95%, while the highest success rates of FGS and FGS algorithms are 58.5% and 82.0%, respectively. It can be seen from the ranking score that the SLIDE algorithm is superior to the JSMA and MLA-LP algorithms. The attack performance of the SLIDE algorithm is more stable than the MLA-LP and JSMA algorithms. The success rates of the three attack algorithms SLIDE, JSMA, and MLA-LP are respectively 94.5–100%, 83–98%, and 62–100%. The attack performance of ML-DP is the worst, whose attack success rates are lower 4%.

Distortion: Under this attack type, the ML-DP algorithm generates the most distortion caused by the perturbation, and the SLIDE and MLA-LP algorithms generate the perturbation distortion to a lesser extent. The perturbation generated by the SLIDE square is sparser, so the \(l_1\) norm and mean perturbation are the smallest. The \(l_2\) norm, \(l_\infty \) norm, and RMSD of the perturbation constructed by the MLA-LP algorithm are all the smallest. In addition, in the iterative attack algorithm based on FGSM, the perturbation distortion generated by the BIM is the smallest, while the distortion degree of MIM and PGD is relatively large.

Augmentation attack results

Table 7 shows experimental results under the “augmentation” attack. Since the MLA-LP algorithm cannot achieve the augmentation attack in the experiment, its performance indicators are not listed in Table 7. The attack success rates and distortions of the other algorithms will be described in detail next.

Attack success rate: Experimental results show that the performance of the gradient-based one-step attack algorithm is poor under augmentation attack, in which the performance of the FG algorithm is better than that of FGSM, but its success rate on the target model does not exceed 20%. The FGSM-based iterative attack algorithms achieve their highest attack success rates when attacking the ML-LIW model trained on the VOC 2012 dataset, which is higher than 98.5%. But when they attack the ML-GCN model trained on the VOC 2012 dataset, their attack success rate is only between 36 and 46%. The attack success rate of the JSMA algorithm is only 45% when attacking the ML-LIW model trained on the VOC 2007 dataset, and the success rate is above 70% in other cases.

Distortion: The ranking score results show that the SLIDE attack algorithm has achieved the minimum distortion on all metrics. The \(l_\infty \) norms caused by the FG and JSMA algorithms are larger, and MIM and FGSM cause larger distortions in other metrics except the \(l_\infty \) norm. In addition, the distortion evaluation result of the PGD algorithm is close to MIM, and the distortion caused by the BIM is generally smaller than that caused by the MIM and PGD algorithms.

Table 6 Experimental results of under reduction attack
Table 7 Experimental results under augmentation attack

Transfer attack results

Tables 8 and 9 present the results of transfer experiments under the “hiding single" and “hiding all" attack types, respectively. The surrogate model refers to the model used to generate adversarial examples, while the transfer model refers to the target model of the attack algorithms in the transfer experiments. The attack success rates and distortions of all algorithms will be described in detail next.

Attack success rate: Experimental results show that, under “hiding single” transfer attack, the attack success rates of one-step attack algorithms are higher compared to multi-step attack algorithms. Under “hiding all” transfer attack the adversarial examples generated by the ML-DP algorithm have a higher attack success rate after transfer, while the attack success rates of adversarial examples generated by other attack algorithms are significantly reduced.

Distortion: In Tables 8 and 9, experimental results indicate that within the one-step attack algorithms, the FGS algorithm produces the highest perturbation in successful attacks. Among the multi-step attack algorithms, the ML-DP algorithm generates the largest perturbation, while the other attack algorithms have relatively smaller perturbations. Comparing all the algorithms, it can be observed that the ML-DP algorithm generates the highest perturbation, and the FGS algorithm produces a larger perturbation compared to all other multi-step attack algorithms except for ML-DP. The SLIDE algorithm generates the smallest perturbation overall.

Table 8 Experimental results under hiding single transfer attack
Table 9 Experimental results under hiding all transfer attack

Summary

Tables 3, 4, 5, 6 and 7 show experimental results of attacking multi-label classifiers using gradient-based attack algorithms. Tables 8 and 9 present experimental results of transfer attacks under the “hiding single" and “hiding all" types. Since the performance of tested attack algorithms on different datasets and different models are relatively similar, so we mainly evaluate these gradient-based white-box attack algorithm from three aspects: attack type, attack success rate, distortion, and transferability.

Attack type: Experimental results demonstrate significant variations in the performance of gradient-based attack algorithms across different attack types. These variations reveal distinct characteristics of gradient-based attacks in the context of multi-label adversarial attacks. The results confirm that the number of flipped labels in the attack target directly determines the difficulty of the attack. The more labels involved in the target, the more challenging it becomes to succeed. Specifically, the “hiding single” or “reduction” attack types, which involve flipping only one label, are relatively easier for gradient-based attack algorithms to achieve. It is important to note that, in the case of flipping a single label, “augmentation” attacks are more difficult to implement compared to “hiding single” or “reduction” attacks. This phenomenon suggests that in multi-label adversarial attacks, it is more challenging to increase the confidence of a single label above a threshold using model gradients, compared to decreasing the confidence of a single label below the threshold. “Random” attacks are more difficult to achieve than “hiding all” attacks, although the number of flipped labels in “random” attacks is less than or equal to that in the “hiding all” attacks. It further confirms that “augmentation” attacks are more difficult compared to “hiding single” or “reduction” attacks. Naturally, “extreme” attacks that require flipping all labels are the most challenging attack type. Additionally, in different attack types, iterative attack algorithms generally exhibit better attack performance compared to one-step attack algorithms, except for ML-DP and MLA-LP. It indicates that multi-label classification models have lower linearity compared to multi-class classification models. The more labels correspond to a sample, the weaker the linear relationship between the sample and the labels becomes.

Attack success rate: It can be observed from the experimental results that among all the tested adversarial attack algorithms, MIM exhibits the best attack performance. Overall, iterative gradient-based attack algorithms outperform the one-step gradient-based attack algorithms in feasible attack types, except for the MLA-LP and ML-DP algorithms. The MLA-LP attack shows better performance on the “hiding all”, “hiding single” and “reduction” attack types, and worse performance on “random”, “extreme” and “augmentation” attack types. For ML-DP, we believe that multi-label classification models are less linear and harder to find decision boundaries. By observing experimental results, we can figure out that the one-step attack algorithms cannot effectively handle the different gradient directions of multiple labels, so it is difficult to generate effective adversarial examples for multi-label classification models. The experimental results of BIM, MIM, and PGD attack algorithms indicate that the iterative attack algorithms based on FGSM exhibit significant variations in the attack success rates of “random” attacks and “extreme” attacks across different datasets and models. This suggests that as the attack difficulty increases, the attack success rates of the iterative attack algorithms based on FGSM are more influenced by the training dataset and model structure. Additionally, the MIM algorithm proves to be more effective in “extreme” attack compared to other methods, which confirms that introducing momentum to stabilize the update direction in the iterative attack algorithm based on FGSM also has a positive impact on improving the attack success rate of multi-label targeted attacks.

Distortion: Among all tested attack algorithms except ML-DP, the FGS algorithm has the highest overall perturbation magnitude, with its \(l_2\) norm of the perturbation not exceeding 23. Since the input images in the experiments are normalized to a size of 448*448*3, it can be considered that the adversarial examples generated by all the attack algorithms except ML-DP are acceptable. However, the perturbations generated by the ML-DP algorithm fail to meet the criterion of imperceptibility for adversarial examples. Specifically, among the one-step attack algorithms, the FG algorithm has the smallest perturbation magnitude. Among the iterative attack algorithms, MLA-LP has the smallest perturbation in the “hiding single”, “reduction”, and “hiding all” attacks, while SLIDE has the smallest perturbation in the “augmentation” and “random” attacks. For extreme attacks, BIM has the smallest perturbation. Among BIM, MIM, and PGD algorithms, which are multi-step variants of FGSM, the distortion caused by BIM is the smallest, while the distortion caused by MIM and PGD algorithms is relatively large. The distortion caused by the JSMA algorithm is usually greater than that caused by the BIM algorithm but smaller than that caused by the MIM and PGD algorithms. In addition, for a given attack algorithm, the distortion of its perturbations is similar across different datasets and models, and it is only related to the algorithm itself.

Transferability: Experimental results indicate that the attack success rates of all attack algorithms, except for the ML-DP algorithm, significantly decrease after transfer. While the ML-DP algorithm shows better attack success rates under the “hiding all” attack, we do not consider the adversarial examples generated by the ML-DP algorithm to be effective due to its excessive distortion. Comparing all the attack algorithms, it can be observed that those which generate adversarial examples with larger distortion exhibit stronger transferability. This suggests a correlation between the transferability and distortion of adversarial examples in multi-label adversarial attacks. Additionally, by comparing the experimental results in Tables 8 and 9, it can be observe that the adversarial examples generated under the “hiding all” demonstrate weaker transferability. This indicates that complex attack types in multi-label adversarial attacks pose challenges for the transferability of adversarial examples.

Conclusion

Gradient-based white-box attack algorithms have been widely used to generate adversarial examples. However, most attack algorithms are designed for multi-class classification models. Since multi-label classification models also face the threat of adversarial attacks, in this paper, we comprehensively compare the performance of nine typical gradient-based white-box attack algorithms when attacking multi-label classification models, as well as their transferability across different models. Among these nine attack algorithms, BIM, MIM, PGD, SLIDE, and JSMA algorithms are transplanted to attack the multi-label models in this paper, and the rest attack algorithms are derived from existing work.

Experimental results under six attack types on two datasets and two models show that the gradient-based one-step attack algorithms perform poorly in multi-label adversarial attacks, while the multi-step iterative attack algorithms are more promising to attack multi-label classifiers. As for different attack types, it can be seen that, for the gradient-based attack algorithm, label augmentation in multi-label adversarial attacks is more difficult to achieve than label hiding. Experimental results under “hiding single” and “hiding all” transfer attacks show that all gradient-based white-box attack algorithms exhibit poor transferability across different models. Compared to “hiding single”, “hiding all” attack type poses a greater challenge in terms of the transferability of adversarial examples. In the future, we will further design more effective attack and defense methods to study adversarial attacks on multi-label models.